Summary
Data has permeated every aspect of our lives and the products that we interact with. As a result, end users and customers have come to expect interactions and updates with services and analytics to be fast and up to date. In this episode Shruti Bhat gives her view on the state of the ecosystem for real-time data and the work that she and her team at Rockset is doing to make it easier for engineers to build those experiences.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
- The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Shruti Bhat about the growth of real-time data applications and the systems required to support them
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what is driving the adoption of real-time analytics?
- architectural patterns for real-time analytics
- sources of latency in the path from data creation to end-user
- end-user/customer expectations for time to insight
- differing expectations between internal and external consumers
- scales of data that are reasonable for real-time vs. batch
- What are the most interesting, innovative, or unexpected ways that you have seen real-time architectures implemented?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Rockset?
- When is Rockset the wrong choice?
- What do you have planned for the future of Rockset?
Contact Info
- @shrutibhat on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Rockset
- Embedded Analytics
- Confluent
- Kafka
- AWS Kinesis
- Lambda Architecture
- Data Observability
- Data Mesh
- DynamoDB Streams
- MongoDB Change Streams
- Bigeye
- Monte Carlo Data
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Siflae solves this problem by acting as an overseeing layer to the data stack, observing data and ensuring it's reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Ciflae can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams on their preferred channels, all thanks to over 50 quality checks, extensive column level lineage, and over 20 connectors across the data stack. In addition, data discovery is made easy through Siflae's information rich data catalog with a powerful search engine and real time health status. Listeners of the podcast will get $2,000 to use as platform credits when signing up to use Siflae. Siflae also offers a 2 week free trial. Find out more at data engineering podcast.com/ciflae today. That's s I f f l e t.
Your host is Tobias Macy, and today I'm interviewing Shruti Bhatt about the growth of real time data applications and the systems required to support them. So, Shruti, can you start by introducing yourself?
[00:01:59] Unknown:
Hey, Tobias. Thanks for having me here. Hey, everyone. I'm Shruti Bhat. I'm chief product officer at Rockset, the real time analytics platform built for the cloud.
[00:02:09] Unknown:
You were on the show a while ago when we first talked about Rockset shortly after you had launched it. But for folks who haven't listened back to that episode, which I'll have a link in the show notes, can you give a bit of an overview about how you first got started working in data?
[00:02:23] Unknown:
So I was previously at VMware and then at another startup which got acquired by Oracle. And that's how you know, once I was at Oracle talking to all the data teams, the biggest question people were asking was how do I go from data to app? Right? People had figured out how to build a lake, how to use transactional databases, how to do big data analytics, how to do warehouses, But now this new phenomenon is starting to build data products or data apps was just starting, and that is still a big challenge. So I was looking around trying to figure out a good answer, and I met Venkat and Dhruva, who were coming out of Facebook, where they built the Facebook newsfeed.
If you think about it, the Facebook newsfeed is 1 giant data app. It takes a lot of real time information on who's clicking on what, takes all of your historical information on who are your friends, what have you liked in the past, and then builds this very personalized news feed for you. Personalization is a great example of news feed. So that's really how I got into it. And since then, I think
[00:03:24] Unknown:
we've been seeing a whole lot of new data apps and data products being built. In this question of real time data and to your point about building applications on top of these datasets, what are the kind of main use cases for these low latency datasets and the types of turnaround time that people are expecting and maybe how that factors into whether they are an internal versus an external consumer of that data?
[00:03:51] Unknown:
Oh, great question. Yeah. Putting it as 2 categories, user facing analytics and then internal operational analytics is a very clear way of thinking about it. On the user facing analytics side, your end consumers are your customers. So it's embedded in your product. It is sometimes showing up as live dashboards for your customers. Oftentimes, it's alerts for your customers. It could be personalization and recommendation. So ecommerce site, your customer is there. They purchased some things in the past. They're clicking through some things. How do you recommend the right product before the user leaves that session? That's 1 great example.
Logistics and delivery tracking. I mean, everything that Uber and Instacart and all the guys have done out there, that's a great example. But we see even a lot more of that in supply chain. For example, 1 of our largest customers, I think Command and Alcon, we published a story. They track 80% of the cement mixers in the United States. And if you think about a cement mixer, it's constantly spinning. If it's raining, you gotta reroute it. If your contractor or your crew is late, you gotta reroute it. So it makes sense. Right? Logistics delivery tracking for your end customers embedded in your application. That's a great example.
More user facing analytics. We see stuff like if you're doing in game personalization. Everybody's buying a shield right now. Tell people to go to go buy a sword. Well, that's in game monetization and personalization. So this is very common on the user facing analytics side. But even internal analytics, it's not always BI dashboards. On the internal side, I think of it as if you have analysts building weekly reports, great. That's not a real time thing. You should be doing that on a warehouse. But if you have people on the ground making day to day decisions, that's operational analytics. And examples of this, we have a major fintech company that's doing anomaly detection and fraud detection.
So this is 1 of the top 3 buy now, pay later companies. You have millions of transactions happening across thousands of merchants. How do you catch that anomaly and have your risk team take action in real time? That's 1 example of internal operational analytics. Still a data app, but it's internal. Another customer, 1 of the major airlines, I can't mention them, but you know who they are. Well, you know, every time your flight is delayed and they're trying to reroute you, what do you think they're using? They have to look up so many different things to see where is the flight right now, what crew do I have, which flight is overbooked, What's the, you know, cheapest way to reroute you to your destination in the fastest time? That's internal operational analytics. So all of these are great examples of user facing analytics and operational analytics where low latency queries matter and or real time data matters. But low latency queries always matter.
[00:06:48] Unknown:
And in this context of real time data and being able to build analytics on top of it, the overall space of streaming data and the different technologies available to support that have been years in the making and they still, in some cases, feel like they're getting to the point of maturation, but maybe not quite there yet. And of course, that's going to be different depending on what company you're at or what technologies you're using. And in this question of real time data and access to it, I'm wondering if you've seen that the growth of adoption for these real time analytics use cases has been driving the improvement in those technologies, or if it's the other way around where the presence of those technologies and the realization that this is a possibility is driving the adoption of real time data?
[00:07:35] Unknown:
It's a little bit of a loop. It's a flywheel. Right? It's not 1 or the other. It's both of them are kinda happening together. I remember when I first talked to the confluent team behind Kafka, I think 5, 6 years ago, they kinda laid it out in terms of phases, and they said phase 0 was already done where a lot of people were collecting real time data. But they were just getting into phase 1, which is finally how do you get value out of that? And how do you actually build apps on top of that. And I think we're seeing a lot of that phase 1 today where they're still building these apps. And phase 2 is where you go from that to a lot of automation, where everything the default expectation is not a real time dashboard. The default expectation is you have a program that monitors things for you, taps you on your shoulder, and tells you what to do, and that is phase 2. We're not even fully there yet. We're still in phase 1 where people are just starting to harness the real time data and make decisions on that in real time. And we see go both ways. We come into accounts where they invested so much in, say, Kinesis or Kafka or even, I would say, CDC streams. So it's not the event stream, but CDC streams coming from Mongo or DynamoDB.
So real time data, you know, little nuance. It's not always coming from Kafka and Kinesis as a event stream. Sometimes it's coming as a CDC stream coming from Postgres, you know, coming from Oracle. Mongo's done it beautifully. Dynamo streams is beautiful. But we see people invested or really bought into CDC streams and event streams, and that's 1 of the reasons they're looking at how to make decisions in real time using that. But oftentimes, it even goes the other way. We come in. We prove that you can reduce the cost of your overall operation. You can get, you know, more speed. Well, suddenly, they'll go and invest more in their real time infrastructure. So it goes both ways, I would say.
[00:09:37] Unknown:
In the space of real time analytics, whether that's embedded analytics for end users or low latency analytics for internal consumers, you know, for instance, the flight rerouting capability that you were mentioning. What are some of the common architectural patterns that teams have settled on to be able to build out and support these use cases and maybe some of the points of friction or operational complexities that come about as a result of these distributed systems problems that are inherent to working with data, particularly at, high velocity and high volume?
[00:10:13] Unknown:
Yeah. Yeah. The scale definitely makes a big difference here. So the biggest friction I'll start with the friction, what not to do, and then go into what to do. What not to do is don't try to do things the batchy way. Because, you know, in the batch world, certain things that we've learned, we've so ingrained into ourselves, they don't work in the the real time world. So you have to almost unlearn some of the things. Start from first principles. Go back to first principles. And the minute you get into first principle thinking, you unlock a ton of value. What do I mean by this? Take for example, preprocessing.
Take for example, data modeling. How should you think about it? We've all learned in the batch world, the more you invest in preprocessing and data modeling, that's the right way to go because that's how you operate at scale. That's the clean way of doing things. If you don't want things to break downstream, if you wanna control your costs at massive scale in the batch where you're being taught that you have to do a lot of this, you know, do the data modeling upfront, invest in that upfront, do a lot of denormalization if you must, do the preprocessing, do the pre aggregations offline because that's how you save cost.
In the real time world, a lot of those things you have to question and go back to first principles. What I mean by that is the more hops you add, the 2 things you're doing is, a, you're introducing more data latency, and, b, you're adding a ton of complexity because in real time, once things break, it's very, very hard to go back and, you know, fix your real time data pipelines if you have very complex multi hop kind of real time data pipeline. So I would say the couple of things to keep in mind. 1 is simplicity scales, complexity scales.
The more real time you're going, go simple. Right? Have fewer mute moving parts because that's how you achieve scale, and that's how you achieve speed at scale. And when you say fewer moving parts, that also means really question where and how you do your preprocessing and your data modeling and your aggregations. So I'll give you a great example. The Fintech company that I was talking about, they had built this whole thing, anomaly detection on a batch pipeline with their previous approach. And amazing savvy team, they brought it down to 6 hours, but their business is still exposed to risk for 6 hours. And the cost was crazy high because they're taking massive volumes of data and running through these multiple hops.
When Roxy came in, we looked at the whole thing and said, wait. Cut that step. Cut this step. Cut this step. And suddenly, we're still doing pre aggregation. So it's not that we're not doing pre aggregations, but here's the new architecture that they have. They've gone from Stream, so it's Kinesis, to Rockstead directly. But as they go into Rockstead, it's all schema less. It's all JSON. So they do not do any data modeling. They do not do any schemas, completely schema less. So they completely eliminated the need for, you know, dealing with schemas.
Very deeply nested JSON but they've eliminated having to unnest your JSON. Again, this is super important. You wanna eliminate these steps. They don't unnest their JSON anywhere else. And they still do what we call roll ups or pre aggregations, but they do them in real time. Now this is actually, last time we spoke, we didn't have this new capability we added, which is you can do a SQL transform in Rockset before the data is indexed and stored. So as the Kinesis stream comes in, they're rolling it up, and roll ups basically allow them to reduce their storage cost, I think, by almost 10 x.
But the beauty of it is now in real time, like, the data's arriving within 1 to 2 seconds. They've pre aggregated the data. The only 2 parts they have here are Kinesis and Rockset, and now they're running SQL queries on it to catch anomalies. Now on top of that, yes, they've built alerts and triggers and some really interesting things, but this is the end to end pipeline. It's Kinesis, RockSED, and whatever they wanna use on their end to alert whenever there's a fraud. So you see it's becoming a very massive scale real time system, but with very few moving parts.
And when we did the TCO analysis, they actually cut their cost, I think, almost by half, and they're able to achieve speeds of for instance, 6 hours, get allotted within 1 to 2 seconds of something happening. It's a very different architectural pattern.
[00:14:54] Unknown:
For organizations that have already invested in building out analytical capabilities, whether that's by building a data warehouse or a data lake or whether they're doing analytics directly off of their application architectures. What are some of the motivating factors for moving into this real time space and some of the considerations that they're typically working through as far as what are the overall costs going to be, do I have the necessary talent pool to be able to support this infrastructure, support these approaches? You know, what are the changes that we need to make in terms of how we think about what types of analytics we're providing or what that we're building? And, you know, is it generally something where it's an either or or do people generally settle on, okay, we are going to use batch for these use cases and real time for these use cases and just some of those considerations and breakdown that they go through as they start to think about what capabilities can they offer in a more real time manner?
[00:15:52] Unknown:
Oh, it's definitely both. There's a place for both. It really depends on your use case. I know, you know, every data engineer has probably heard this a 1000 times, pick the right tool for the job, you know, match the tool for the use case. The biggest consideration I'm seeing though in this economy, 1 thing has changed, and I hope, like, all the data engineers and data architects listening are paying attention to what's happening in the economy and how they need to get their projects approved in the next 6 months. It's gonna be very different from what was true in the last 6 months. We're already seeing this. So if you realize now, suddenly, there's a lot more power in the hands of CFOs.
Whereas in the past, you could do a lot of innovation projects. I put it in my my CTOs or CPOs. I myself, in a product role, but suddenly, I have to really pay attention to what's happening in the financial side, and it's a different equation now. So if you wanna get your projects approved, I would say going forward, it is gonna come down to what is the price performance. If you can't prove that, you know, you can achieve the same thing with lower cost and better performance, it's gonna be incredibly hard to make those shifts in the next 6 months. What we see is, going back to your earlier point, the use cases are at a place where people are already trying to do this, and consumers are demanding this.
If it's user facing analytics, you're doing it because your customers are demanding certain features. Your customers are demanding certain analytical capabilities in your product. And at that point, you have a couple of choices. You either try to build it on your existing warehouse, say, you know, something like Snowflake or Redshift, or you try to build it on your existing transactional database, like, say, Mongo or Postgres or even Oracle, if you will. You only have these 2 things in your toolkit if you're not using something like Rockset already. And what we found is that when you try to do it on either of these things, for user facing analytics particularly, the price performance is just not there. Why do I say that?
With warehouses, for example, we've gone in and seen this in actual customer scenarios. For user facing analytics, 2 things are different. 1, you have a lot of frequent updates. And when you have a lot of frequent updates and you try to do this on your warehouse, very quickly you'll find that the warehouse does these very expensive merge operations. Why? Because it was designed for a batch wall, and it's assuming that you're gonna batch your updates. So no wonder if you don't batch your updates and you try to send it very frequent updates, you blow through your credits in no time. And we've had a lot of customers do this and say, oh, I'm forced to take my CDC stream and batch it in 15 minute or 30 minute or 1 hour increments because otherwise, my warehouse cost goes through the roof. So that's 1 consideration.
The other big consideration is your queries. With user facing analytics, your queries are different. You're no longer doing these very scan oriented, scan intensive queries. I'll give you an example. If you're running a weekly report, you might say, what is the average ASP in Europe this week compared to last week or, you know, compared to last year? We are comparing, like, ASP across regions. Well, that's very commonly scan oriented. You know? You have to scan through everything to compute an average. Sure. Warehouse does a really good job. To think about personalization use cases, logistics tracking use cases, tell me everything you know about Tobias right now from, you know, his purchase history to his click stream to his interest.
That's a very selective query. And you try to do this on a warehouse, and you can see because it does brute force scanning for the most part why you're wasting a ton of compute. So this is the real problem, which is with the new kind of data apps or data products that developers are building. Suddenly, they're not analyst style access patterns. Your data access pattern is different. Your data load pattern is different. Developers deal with data differently. So if you have a developer working with it and you're trying to force them to use a warehouse, no wonder your cost is blowing up because it's just not the right tool for the job.
So this is where, you know, we've seen yes. If you're using it for an analyst doing a weekly report, you're probably getting the best price performance because you can have them, you know, spin up that warehouse, just run the ad hoc queries, spin it down, and you're working on yesterday's data. Perfect. You're dealing with developers. That whole paradigm is broken. No wonder your price performance is not working for you. So when we've gone in, we've actually seen up to and I'm gonna say up to, it's a very marketing claim. But I've seen up to half the cost and double the performance simply by switching from a warehouse to something like Rockset, which is a real time analytics platform. And I don't say just Rockset. I would say, you know, choose the right real time analytics platform that works for you, but choose the right tool for the job. A warehouse is probably burning a hole in your pocket. And if you can make a case to your CFO that you're gonna double the performance and cut the bill in half, that project is getting approved.
[00:21:21] Unknown:
To your point about choosing the right real time platform, what are the different capabilities or feature sets or maybe operational capabilities or operational models that teams need to be thinking about as they're making those determinations? So, you know, whether it's compliance reasons or these kind of interfaces, inputs and outputs that are supported, just what are the attributes of these different platforms that are going to vary and how teams need to think about which 1 is going to be the right fit for their needs?
[00:21:55] Unknown:
Yeah. Great question. I think I didn't answer your previous question on what should the training for the team be or how do you decide based on people capability. So that's 1 for sure. But let's build out maybe you know, I'm jamming with you. Let's build out If you are evaluating something for user facing analytics or your operational internal analytics, it's a real time platform. What should your eval look like? How do you set up the right POC? How do you ask the right questions to your vendor? So I would definitely put price performance top of the list. And again, price performance is the right way to think about it. Don't think of it just as cost or TCO or performance only.
The way you think of price performance is almost like miles per gallon, which is, you know, we were just talking about this before. In the old world, when gas was cheap, you could just say, I don't care about miles per gallon. I just want the fastest car. Today, that doesn't work anymore. Gas is crazy expensive. You should care about miles per gallon. Similarly, you know, CFO suddenly have very tight budget, so you should care about price performance. Our price performance, which basically means see how much compute you're spending. The biggest thing when it comes to these workloads is compute.
You're not gonna really burn a hole in your pocket with storage. You know, storage today is so commoditized. It's and also most of the user facing analytics and operational analytics projects that we've seen, they don't have petabyte scale datasets. They tend to have, you know, tens to hundreds of terabytes. If it's a petabyte scale dataset, chances are you're doing offline, you're doing, you know, year over year kind of analysis, and that should live in your warehouse. But for user facing and operational analytics, for real time analytics, we see tens to hundreds of terabytes, especially because you're doing roll ups, pre aggregations, you're doing retention policies.
If you manage it the right way, storage is not the thing. Compute is the thing that's really blowing up your cost. So set up price performance and really measure for every, you know, hour of compute, say it's whatever CPUs you wanna peg it at. 4 CPUs, 8 CPUs, whatever you wanna peg it at. For every hour of 8 CPUs I spend, what kind of performance am I getting? Every minute, every hour, whatever you wanna measure it. So pick your dataset, run it on the different vendors and say, this is the real price performance for compute.
And that's almost like a compute efficiency metric that you should care about because real time analytics and even for warehouses. I mean, everybody who's spending a ton of money on Snowflake will tell you it's not the storage, it's the compute that is really expensive. So build that compute efficiency metric anytime you're running a Snowflake warehouse for 24 by 7, you should ask yourself, is this, you know, the right workload? Maybe I should come do a compute efficiency comparison against some real time analytics platform and build price performance as your number 1 thing. Compute efficiency matters.
The second 1 is what you called out, which is people. You have the people that you have. You have the skills that you have. Yes. They can learn new things, but these people already probably know SQL. So, you know, put some of the requirements down. SQL, if SQL matters, SQL is not only because of people, but also the thing that everybody's standardizing on in the industry today. If it doesn't speak SQL natively, if it anything that says SQL like is not SQL. So, yeah, you really wanna put what is the thing that your people are capable of. And generally, to me, that comes down to SQL.
Again, continuing on the people trend, there's are your people in the business of managing infrastructure or managing data? All the open source tools are amazing because it lets you go super deep. It lets you look at the code. But if you're not manipulating that code, if you're not, like, an active open source contributor, what it ends up doing for you is it ends up having, like, a lot of burden on the team to manage that infrastructure. You know? Open source is amazing. We are not open source. Rockset is not open source. But we study a lot of the open source tool, and we think it's amazing if you're running it in the data center on prem because now you can control your hardware. Now you can eat every bit of price performance out of it.
In the cloud, it's a different story. If you're running it in the cloud, again, this is where you should ask yourself, where am I gonna run this? I'm running it on prem. I wanna control the environment so much. Open source makes sense. If I'm running it in the cloud, if you can't spin up and spin down instances every minute, every second, you're actually doing something wrong. So that's the philosophy we take at Rockset, which is we're fully managed in the cloud. We have auto scaling ingestors, auto scaling pods, auto scaling compute.
The reason we do that is you shouldn't have to worry about anything. We should be, you know, not only spinning up but also spinning down every second. That's how you get, you know, again, going back to compute efficiency. So 3 things we touched on, price performance, people, which is, you know, SQL, and then operational efficiency, which is what is it that matters? Do you want to manage it on prem, or do you wanna do this in the cloud? And then going back to security and compliance, that's a big 1. I think we run into this a lot, especially if you're using managed services like Rockset. You wanna ask all the compliance questions.
You wanna ask the security questions. You wanna ask about, you know, all the good stuff like private link. Do you connect to my VPC via private link? So a bunch of security questions. Encryption, you encrypt everything, you know, at rest, in flight, like, what's going on? You have to understand what is the key management, and there are lots of good things on the encryption side. But, generally, what we've seen is if the compliance is in place, like SOC 2, type 2, HIPAA, all the compliances in place, and these people are crossing the security reviews of all the big companies, they probably have the right systems in place. And this is what you know, if I go back to our early years, we didn't have some of these compliance things in place. So we weren't ready to sell to the enterprise over the last few years we have.
So by now, we have a security team. We have the whole enchilada down. So you should be able to go back, ask them all the right questions, put your security team in touch with their security team, and as a data engineer, not worry so much about it. Just say, hey, as long as you pass my company security review, you're good to go. And then the last thing I would say is what I mentioned earlier, which is, let's say, all of these make sense. Price performance makes sense. You know, you have SQL, so you have interoperability with your ecosystem.
You have the operational, you know, cloud model that works for you. You have the security and compliance. But the last 1 that's super important is what I touched upon earlier, which is how do you minimize the number of moving parts? Because this is, I think, the thing that trips up people the most as they're going into this combative real time. Simplicity, simplicity, simplicity. So if you find that you have to do 10 hops, something's wrong. That's not gonna scale. You're going to constantly be debugging your real time data pipeline. But if you find that you can build a simple thing with 3 things, that is a very, very scalable model. So if you need to denormalize, something's broken.
Do you really need to, you know, use, like, 3rd party systems for whether it's data modeling, whether it's schema management? Ask yourself if something's broken. The fewer moving parts you have, the cleaner, and now you know it's something that scales because you're not building for today. You're building for the next few years.
[00:30:09] Unknown:
The biggest challenge with modern data systems is understanding what data you have, where it is located and who is using it. SelectStar's data discovery platform solves that out of the box with a fully automated catalog that includes lineage from where the data originated all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your DBT, Snowflake, Tableau, Looker or whatever you are using and select star will set everything up in just a few hours. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan.
And on that question of simplicity and the number of moving parts and also to your earlier point about the size of the data that you're working with in these real time systems is the question of when you're talking about real time, generally, you care about what has been happening in the past few seconds, minutes, hours. You know, what are the reasonable time horizons that people generally settle on as far as how much of the data do I want to push into my real time workflows and be able to surface in these embedded analytics use cases or real time analytics use cases?
And then how much of it do I then age out into these batch infrastructures, whether it's the data lake or the data warehouse? And then that introduces the question of we just recreated the Lambda architecture with some newer technologies and, you know, all of the problems that go along with that.
[00:31:38] Unknown:
So the question of what should you keep in your real time systems, you know, if you look at it purely from the lens of time series data, yes, a lot of those questions pop up because when you think real time, you think time series. But I wanna challenge that a little bit because time series, yes, it is 1 part of it. When you talk about tensor data, when you talk about click stream data, again, it goes back to event streams. And when you talk about event streams, sure, there's a lot of I only care about last 7 days. I only care about let me give you an example for sensor data. Let's say you're UPS. Right? And you have these smart drop boxes. Whenever somebody drops something and you wanna reroute the truck immediately to go pick that package up in the most efficient way.
Suddenly, somebody adds a new field called temperature, and temperature starts showing up. And, you know, like I said, if you have fewer moving parts, none of your pipelines will break. Temperature will just propagate through your stream and into, say, something like rock setting. You can start working with it. But the real question is, do you care about temperature every second? Do you care about temperature 5 times a second, which is how often the sensor is sending? Probably not. You only care about, well, if somebody's shipping wine, you wanna be careful about the temperature in that drop box. But every 5 minutes, that's good enough.
Every maybe even every hour is good enough, and that's a determination that every customer makes. So it's not just 7 days or how old. It's also at what granularity do you store it. So when it's event streams and time series, you have to ask yourself 2 questions. 1 is, how much data do I want to store? So 7 days, you know, 1 day, even many times, you know, many months, many years. That's okay as long as you're storing it at the right granularity. So we have people who are aggregating it on a per day basis. And if you start aggregating a per day basis, you can actually store many months of data. Or sometimes they'll say, I want to store the last 7 days at a certain granularity.
And anything older than that, I can store at you know, certain different granularities. And all of that should be seamlessly handled in your geographies' platform. So suddenly, it's not about only send the most recent data here and then send this here. It's really about, again, rethinking your architecture and saying, for time series data, you should be able to set it up for the right granularity and also think about retention in the right way. If you do both, oftentimes with event streams, you're okay. But the second big 1 I talked about, CDC streams. This doesn't apply often to CDC streams because CDC streams are coming from your transactional database.
They're getting updates, and this is where, you know, upsurge matter. Again, with Roxette, as you know, Tobias, we went upsurge first. This is why we're 1 of the few real time analytics platforms talking about CDC streams so vocally. With CDC streams, anything that doesn't handle upserts is super inefficient because you're doing merges. Elasticsearch, same problem. Snowflake, same problem. I mean, any other system you take, they're not doing upsurge. That's a problem. Now with CDC streams, it's not the land architecture anymore because you're bringing in almost everything that's happening in your transactional database, and that volume is oftentimes not that big. You're only bringing the right tables. So I might need, for example let's take personalization.
I only need 1 of the purchases that Tobias has made to be able to put you know, personalize your experience in the ecommerce platform. Maybe you have a lot more tables in your transactional system. I don't need that. So you selectively bring in the tables and the fields. Like, you only bring in the things that you need from your transactional database. And their roll ups don't matter because it's not time series. Their retention policies don't matter because I need to keep all the transactions that you've ever made. But, again, the volume of that data is very, very small. We're never talking petabytes in a transactional system.
We're talking very small data.
[00:35:54] Unknown:
To that question of data volume, what are the kind of breaking points between, I need to run this analysis in a batch fashion using my data warehouse or my data lake, or this is a small enough volume of data where it makes sense to run this in a real time environment because I can, you know, recompute these aggregations or recompute these analyses fast enough to be able to account for the newer data that's flowing in while building off of the pre aggregations that I've done or the, you know, rolling windowed aggregations that I'm managing. And then as you do get outside those windows for these different reasons that we were just discussing, you know, how to manage either aging that out into the batch system or whether you see folks typically doing a double write where they will propagate it to both the batch and the real time environments at the same time?
[00:36:48] Unknown:
Yeah. Great question. So, typically, it's not based on the data size. At least whenever we've spoken to customers, it's more about whether you need those low latency queries, whether it's user facing analytics, whether it's operational analytics. And who is gonna consume this is the first question we ask. And if it is for an analyst doing ad hoc queries, for a data science scientist doing offline training, that is a batch use case. That's the best, you know, way to think about it. But if it's operational analytics, it's people on the ground making day to day decisions, it's user facing analytics where it's embedded inside your product, and it's developers building apps on it.
Then no matter the size of the data, there is a way to do this in real time where you can make it much more price performant. And, again, it goes back to what granularity, how do you do this rolling window, how do you drop the fields that are not necessary. And it always comes down to if this is the use case, it can be done in real time at a much lower cost, oftentimes, than doing it on a batch system. It constantly trips people up because they think real time is more expensive. And every time we've gone in and done this, we've been able to go and, like, cut the cost in half compared to doing it on a batch system simply because you're using the right tool for the job.
And if you do it in the right way with the rolling windows and bringing the right data in, suddenly your cost is actually much lower, and you're getting the performance you need. So I would say don't tie it to the data size or data volume. I've literally seen what I jokingly call or lovingly call data torrents. They're not data streams anymore. Right? They're like many, many terabytes a day. And the more you try to do that in a batch fashion, the more your cost is actually going up. But if you bring it into a real time system that's built for this where, you know, it's got all the time series optimizations.
It's got the rolling windows. It's got copies that you actually need to store at the right granularity. It's got the aging built in. You can actually handle that volume of data at much lower cost. So it's more driven by the use cases, the first thing. It's also driven by the queries, not just the data that I will I think this is my biggest challenge and, you know, I will be very transparent. When I say real time analytics, people often think real time data and think of it as data never stops coming. But the thing that people forget is the queries.
In this world, queries never stop coming. This is high QPS. This is data apps. This is the queries never sleep, unlike analysts, you know, who go to sleep and the query stop. So it's not just that data never stops coming. If you really have use cases with it, query is not never stop coming. That is the number 1 thing of picking a real time analytics solution. And, again, if the queries never stop coming, that means your warehouse is running 24 by 7. That's a tell. That's a big tell. Every time you're doing that, you're probably misusing it. The other big 1 is the pattern of your queries. Like, if you have a lot of selectivity, you have a verb clauses in your queries, and you're sending it to a warehouse that's doing scans, then you're probably getting poor price performance.
So it's both. It's data never stop coming. Queries never stopped coming. Hopefully, I answered your question. I might have gone a little bit off topic.
[00:40:19] Unknown:
No. It's definitely good. The point of the continuity of queries and not just the continuity of data is an interesting 1 as well, and 1 that, as you said, is not something that people generally think about when they're exploring this problem of real time data. And I'm curious what you have seen as the ways that people maybe take advantage of the information that that constancy of queries provides to them as well about, particularly for the embedded analytics use case where you're exposing this interactive capability of being able to explore the data that is pertinent to that customer, how the ways that they are exploring that data gives you more information about what they care about and how you might think about, you know, driving your own product direction?
[00:41:08] Unknown:
Yeah. So the types of queries coming in are very interesting, for so many different reasons. 1 is you can see, for example, if it's, you know, very selective queries, if you have a lot of bear clauses, that's an obvious 1. But the second 1 is but again, it's customer facing analytics. This is where developers and product managers and they're really trying to iterate towards what the customer needs, and it's not always obvious. Right? Whereas if you're building reports for your executive, you already know what your executive wants because the executive comes in and says, I need to be able to report on blah.
But in this new world, the queries not only never stop coming, they never stop changing. This is again a real tell because developers are iterating, product managers, I'm a PM, so I can say this, really don't know what the users want. They're trying to iterate towards it as quickly as they want. They can learn, but they don't know on day 1. Right? And this is the agile way of building products, so you have to embrace it, which means you have to give them flexibility. You cannot say I'm gonna have this very rigid data model, and you've asked me for this particular query, so I'm just gonna optimize the heck out of it for that 1 query. Because they'll ship the product, next day customer says, oh, I wish I could ask this question this way, and suddenly you come back to a whole new access pattern.
So the classical knowledge in data engineering has been understand your access patterns and optimize the shit out of it. Am I allowed to say that word? The heck out of it for your customers. Well, what do you do if your customers are developers, your internal customers are developers, and they keep coming back to you? Queries never stop changing. So I'm actually saying embrace that. Embrace the flexibility of queries. Embrace the flexibility of your data model. Embrace that agile way of working. And to do that again, you need to be able to say whether it's search, aggregations, joins, no matter what the pattern of the queries, it should just be fast out of the box. It should be price performing.
You can't lock yourself into a corner because you know your PMs are gonna come with a road map improvement, like, 1 month down the line, and you better be ready for it. So how do you embrace that agile? There has to be a word for this. I haven't come across this. I hear a lot about data modeling in modern data stack. There's this whole debate happening, But what it doesn't anticipate is that your queries are going to constantly keep changing on you. Your data is going to constantly keep changing on you in the world of data apps, and you should actually embrace that. This is our approach. Right? The flexibility of queries. How do you get that? This is why we're indexing everything.
We're saying if we index everything and the database index is themselves immutable, data changes, no problem. You go update it in place. Queries change, no problem because it's already been indexed for a search and aggregation join. You don't have to go build new indexes or obsess about, what am I gonna do? Suddenly, they now wanna do a join with that new dataset. You gotta kinda embrace that.
[00:44:24] Unknown:
As you were discussing the constant changing of the queries and what people are trying to explore, it brought to mind the other major trend that's been happening in the data ecosystem around data quality, data observability, sort of lineage, and also the question of governance and compliance and all of these nonfunctional requirements for the data. And I'm wondering how that manifests in this real time world because there's been a growing consensus about how to think about this in the world of data lakes and data warehouses. I'm wondering how you're seeing that manifest in these kind of end user facing applications.
[00:45:04] Unknown:
Yeah. We're absolutely seeing it. Especially data observability becomes even more important because think about where observability started. Right? It started with DevOps. And now you're talking about developers building products on top of this. So observability has become super important, and it's only extended in the sense suddenly end to end latency matters. It's not just what you've seen in the data lake and data warehouse world, but it's even more. It's end to end data latency. The way we've worked with this is literally like we have a metrics endpoint, which, you know, people plug into their Datadog or Prometheus, and they basically monitor Rockset like they monitor any production database.
So observability has become even more critical. Data quality, for sure, matters. But again, in the real time world, the best thing is whether it's coming from an event stream and you're making these decisions based on real time data or it's coming from CDC stream, oftentimes there's also another copy of the data somewhere. Like you mentioned, you know, you might be doing dual rights. Sometimes you're also storing it in the lake for eternity for compliance reasons or know, for your own historical analysis that you might wanna do. So you're also doing that. So there is a copy of the data somewhere.
We also look at, for example, in our case, CDC streams coming in from Postgres or Mongo. Well, great. That is your system of record, but, say, Rockset is now syncing to that, staying in sync with that. We become the source of truth. So when people are doing analytics, the source of truth is often Rockset. And, again, data quality matters a lot. However, in this real time world, the way we do it is we make sure that you are able to constantly sync with your data source and that you're only 1 to 2 seconds behind. And, you know, in the event stream world, oh my god. So many questions around eventual consistency.
How do you handle out of order events? All of these things matter, and this is why I keep saying a real time analytics system should be able to handle all of that natively. You can't wave your way around it.
[00:47:24] Unknown:
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95% reported being at or overcapacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend. Io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability.
Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP. Go to data engineering podcast.com/ascend and sign up for a free trial. If you're a data engineering podcast listener, experience interacting with customers and other folks who are building these real time applications, what are some of the most interesting or innovative or unexpected ways that you have seen these real time architectures implemented or uses of these real time data streams?
[00:48:46] Unknown:
Oh my god. Customers are always surprising us. The really interesting examples that come to mind, I thought this whole, like, risk analytics thing was super interesting because we didn't even realize, but as we were talking to multiple people in that Fintech company, it became clearer and clearer that it's not just fraud. Like, when people say risk, they think of fraud immediately. But it's not just fraud. You know? There also other types of risk that your business is exposed to which you don't even realize. For example, let's say Apple Pay stopped working in West Africa.
That's not necessarily a fraudulent transaction. But by the time they figured that out on their platform and went and fixed it, 6 hours have gone by, and they have lost 6 hours of revenue from West Africa, and that could be 1,000,000 of dollars. So risk is not always fraud. Risk analytics is becoming more and more interesting because, yes, it catches fraud, but it also catches a lot of, you know, lost revenue opportunities. And that's been super interesting as we're seeing how people are now doing risk analytics. Again, going back to the economy and these times, people are thinking about risk very differently.
They're thinking about risk to their revenue. They're thinking about risk to their, you know like, even the sales ops world. A major risk is as the economy changes and your sales team is figuring out, everybody's adapting to the new world. There's a major risk if you find out after you've closed your quarter that you really, you know, are so behind on where you need to be. You need to know that every week, every month so that you can adapt. In uncertain times, there's a lot of risk, and we're going into the most uncertain time, I think, for a lot of us.
And in certain times, risk is everything. It's sales ops. It's marketing ops. It's finance ops. It's really knowing what's happening in your business and adopting adapting what you're doing changing every day. So that's been really interesting for us when people suddenly took the word risk analytics and moved it from fraud detection to all of the other things in uncertain times. It's been fascinating to watch how they're using real time to prevent their business from going off the rails or getting derailed by the economy. So that's been 1.
What are other really interesting ones? I keep sharing some of these very interesting ones. I'll share a real customer example, whatnot. This is so cool to see. This is live streaming. I don't know if you folks have heard of it, but very cool company doing live streaming. So this is ecommerce where people get on live. It's a buy, sell marketplace with live video streaming. So, again, new ways of engaging people in the ecommerce way. And their use case was so fascinating because how do you do recommendations for live streams? Because you don't have a lot of history. Right? As the stream is happening, it's becoming more and more popular, and you have to recommend the right 1. I need to know that Tobias is really into, I don't know, baseball cards. So I can recommend to you there's a live stream happening about baseball cards that's becoming really popular right now.
And that kind of stuff is super hard to do, and they've actually published a really cool blog on how and why they moved away from Elasticsearch to Rockset for this live streaming example. Because you would think of Elasticsearch for this use case. Right? And the the default that you go to is Elasticsearch. And suddenly, they wanna join, don't join a lot of data. So started using Rockset, and that I thought was a very, very cool, interesting use case. Just moving away from Elasticsearch to Rockset and doing a bunch of joins. On the logistics tracking side, my favorite example continues to be you know, again, this is me, I guess, because whenever I see these underrepresented I wanna say underrepresented in the data world perhaps, but heavy construction, It's not digitized. I love the fact that they're digitizing something like heavy construction infrastructure, building better roads, better bridges by using real time tracking for cement mixers.
That's really cool. Being able to join data from what your contractors are doing on-site and weather information so you can reroute cement trucks in real time. Think of how much money that saves. And we're not talking houses. You know? We're talking bridges and we're talking, you know, roads and all the heavy construction. Well, that's massive number of contractors, massive taxpayer money going into it. And we're really proud to save taxpayer money by digitizing heavy construction. So we see a lot of this kind of stuff, which in the grand scheme of things, Facebook has this. You know? Uber already has this. What we want is to bring it to the people who don't have it. The people I know this is a data engineer show, and I want a lot more data engineers in the industry.
Reality is there aren't enough data engineers to go around. So what about that construction company out in the Midwest that cannot hire thousands of data engineers? How can we give the 2 data engineers they do have superpowers to go run this kind of massive scale operation with price performance and all the good stuff.
[00:54:20] Unknown:
In your experience of working in this space of real time analytics and fast data streams and building and managing and directing the Rockset product, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:54:35] Unknown:
I think the most interesting 1 I've learned is that almost everything we do, we think of it initially as, you know, user workflows or, you know, other projects, but they always come back to price performance. We do a bunch of things on the ingest side. For example, roll ups is a great example. Initially, we were thinking, how do we make it easy for people? We start with ease of use and we see the workflows. We see what people are struggling with. And so how do we make it easy for people to do these, you know, constant roll ups or real time aggregations. But at the end of the day, it's about price performance.
It's about that compute efficiency Because, yes, ease of use matters, and, yes, saving people time matters. But as we built out the project, the biggest for us was, wait a minute. This is again about compute efficiency because if we aggregate your data in this way, your queries are much faster because you just move that aggregation from the query to the ingest in real time. So your queries are much faster, and, certainly, the compute cost is half. So it always comes down to price performance and compute efficiency in this world. And that, I think, has been the biggest learning and, of course, the most challenging thing we've had to do because, yes, everybody wants real time. You know? I'll give you an example.
10 years ago, if you had asked me, do I want everything shipped to me the next day? I would have said yes with a big asterisk and said, not if it cost me $50 for shipping. I'm not gonna do it. Right? I want it, but I can't afford it. So I would maybe do it for 1 or 2 things 10 years ago. But today, I want free shipping for everything, right? Think of the kind of stuff that we expect, free shipping and, you know, next day shipping. You would never have done this 10 days ago, and the only reason it's possible is because you don't have to pay through your nose for it. So that's the big moment for us, which is everybody wants real time for more and more use cases. Everybody wants low latency queries.
Fast is better than slow any day. Right? I mean, who doesn't want fast queries? And when people say I don't need it, what they really mean is I can't afford it. And the only way that you can change the game is by making it so compute efficient that it's even more efficient than doing it in the batchy way. And the minute we do that, which we already done it in a bunch of use cases, the minute you can prove that, now the data engineer can say, I'm giving you double the performance and at half the cost. That is a win win for the data team and the consumers.
[00:57:25] Unknown:
For folks who are interested in exploring these real time analytics use cases, what are the situations where Roxette is the wrong choice?
[00:57:32] Unknown:
Goes back to the 2 things. If your queries are, you know, weekly reports and you only go and run it once a week, absolutely the wrong choice. Right? We're not built for analysts doing weekly reports. We're built for developers building data products. So, yes, your analyst might come to you saying, can you make this go faster? And you might be tempted to go put it on something like Roxette because, oh, it's so much faster and cheaper, but not really because we're not built for those. And what do I mean by they're not built for those? We are indexing your data because we're anticipating that you're gonna have a lot of queries. And if you're only gonna do that 1 query a week, you know, something like Presto where you pay through your nose for that 1 query is actually the right thing to do because you don't have a lot of queries, and you should be maybe optimizing for something else. Another analogy I like to use is think of retailers. I work with a lot of retailers these days. Why do they have distribution centers as well as retail stores?
A distribution center is actually called a warehouse. So, you know, funny analogy. Why do you need a physical warehouse as a store? Because you're optimizing for 2 different things, and you still need that warehouse to store, pack away a lot of your boxes for infrequent use. No. You're only going and pulling a box at a time infrequently. Once a week, you go to a distribution center and pull out a box to ship to your retail store. That's the perfect use of your warehouse, right, because you go get it somewhere where the dollar per square foot is very cheap.
Same thing in a warehouse. Your dollar per gig is very, very low. It's a great place to pack away a lot of data for infrequent use. On the Rockset side, that's not what we're built for. So if you're doing infrequent analyst style queries once a week, I would recommend not to do it. On the other side, what is a retail store optimized for? It's optimized for the customer access patterns. It optimizes for the best experience for your customers, optimizes for a lot of foot traffic coming into the store, optimizes for your revenue. That's a retail store. And similarly, we're optimized for compute efficiency, We're optimized for that low latency customer experience where queries never stop coming, data never stops coming. So you cut your compute cost, but, you know, you're making a trade off with your storage cost. And that's why, again, it always comes back to right tools for the job. As you continue
[01:00:15] Unknown:
to build and evolve and grow the RockSett product, what are some of the things you have planned for the near to medium term and maybe anything specific to these real time applications?
[01:00:25] Unknown:
Lots of enhancements coming. The biggest 1 I would say is we continue to push the envelope on price performance. We're so excited about some of the benchmarks we recently published. Hoping to publish more of these as we're seeing what's actually happening. These benchmarks are hard because it depends on the access patterns to more queries. So it's still benchmarks where you've been thinking, maybe we can actually publish some of the actual queries with your permission and show for these access patterns, you'll get, you know, a better bang for your buck. So looking forward to pushing the envelope there and publishing some of the actual very exciting results we're seeing. This whole, like, you know, cut your cost in half and double your performance kind of thing.
The other big 1, I'll give you a sneak preview because my PR team, which is listening and probably is not gonna like it if I announce the whole thing. We're announcing something pretty soon. But it's this whole notion of what we're seeing as you go at massive scale is now you have multiple use cases. Right? The data mesh is famous for all decentralized access and allow every team to have its own access. So how do you have your cake and eat it too? How do you allow people to process data in real time, but also isolate the different use cases so that each of them can have their own access patterns. And this comes back to compute efficiency. This comes back to compute isolation across different use cases.
So we have some really cool stuff coming here, which get your real time data, centralized access to that real time data, but also isolate compute for the different use cases. I'm intentionally not giving you a name for it. I'm not telling you what it is because we're gonna announce it with a big bang hopefully very soon.
[01:02:13] Unknown:
Are there any other aspects of this real time data applications ecosystem and the ways that you're addressing it at Rockset that we didn't discuss yet that you'd like to cover before we close out the show?
[01:02:26] Unknown:
I think the thing that I keep going back to is when you think about real time data, don't think only event streams. Really think about CDC streams because I kid you not, this is the most exciting thing that I've seen with transactional databases. I think Dynamo streams nailed it. Mongo now has Mongo streams, which is really amazing. Postgres, MySQL, Oracle, I think they've all started to talk about this a lot more. But tap into those CDC streams. It is unbelievable what you can do when you start tapping into your CDC streams. And as you tap into your CDC streams, pay attention to how your downstream system handles updates.
Make sure it's mutable. Make sure it handles up sorts because otherwise, the minute you have CDC streams, you are gonna get inserts, updates, and deletes. It's no longer insert only. So really pay attention to that mutability. Really pay attention to upserts. And that is very, very interesting in terms of the use cases around blocks. So the only thing is don't think only event streams and time series data. Think CDC streams because CDC streams are also real time streams.
[01:03:43] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:03:57] Unknown:
I would say it's still that kinda observability of the end to end, you know, especially in the real time world. What we see is you're seeing more data apps, you're seeing more data products. There's a lot of work happening in terms of, you know, being able to monitor across hops, but still, end to end data latency is hard to monitor. We provide as much information as we can, and still, some some of the things that customers are constantly debugging is, 'Wait, what happened on this system? What happened on my source? What happened on my destination?' I think every data engineer has run into this at some point or the other.
And if you want to build production systems on this, it's not enough to have low latency. It's not enough to have compute efficiency. You need to have that level of data observability across and monitoring. And that I think is still developing. A lot of work is being done. I think Bigeye, Monte Carlo, great tools out there. But I still still think there's a lot of work to be done. So I'd love to see more and more work there, and we are hoping to work more closely with these vendors too.
[01:05:05] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share your experiences working in the space of real time analytics and embedded use cases and the architectural and logistical challenges of being able to build and maintain these systems. I appreciate all the time and energy that you and your team at Roxette are putting into supporting these use cases. So thank you again for your time, and I hope you enjoy the rest of your day. Of course. Thank you so much for having me here.
[01:05:39] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show. Sign up for the mailing list and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at dataengineeringpodcast dotcom with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Data Complexity and Quality Issues
Interview with Shruti Bhatt: Real-Time Data Applications
Use Cases for Low Latency Datasets
Growth and Adoption of Real-Time Analytics
Architectural Patterns and Operational Complexities
Motivations for Real-Time Analytics
Evaluating Real-Time Platforms
Data Retention and Granularity in Real-Time Systems
Batch vs. Real-Time Analysis
Data Quality and Observability in Real-Time Systems
Innovative Uses of Real-Time Architectures
Lessons Learned in Real-Time Analytics
Future Plans for Rockset