Summary
The process of exposing your data through a SQL interface has many possible pathways, each with their own complications and tradeoffs. One of the recent options is Rockset, a serverless platform for fast SQL analytics on semi-structured and structured data. In this episode CEO Venkat Venkataramani and SVP of Product Shruti Bhat explain the origins of Rockset, how it is architected to allow for fast and flexible SQL analytics on your data, and how their serverless platform can save you the time and effort of implementing portions of your own infrastructure.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more.
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Shruti Bhat and Venkat Venkataramani about Rockset, a serverless platform for enabling fast SQL queries across all of your data
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Rockset is and your motivation for creating it?
- What are some of the use cases that it enables which would otherwise be impractical or intractable?
- How does Rockset fit into the infrastructure and workflow of data teams and what portions of a typical stack does it replace?
- Can you describe how the Rockset platform is architected and how it has evolved as you onboard more customers?
- Can you describe the flow of a piece of data as it traverses the full lifecycle in Rockset?
- How is your storage backend implemented to allow for speed and flexibility in the query layer?
- How does it manage distribution, balancing, and durability of the data?
- What are your strategies for handling node and region failure in the cloud?
- You have a whitepaper describing your architecture as being oriented around microservices on Kubernetes in order to be cloud agnostic. How do you handle the case where customers have data sources that span multiple cloud providers or regions and the latency that can result?
- How is the query engine structured to allow for optimizing so many different query types (e.g. search, graph, timeseries, etc.)?
- With Rockset handling a large portion of the underlying infrastructure work that a data engineer might be involved with, what are some ways that you have seen them use the time that they have gained and how has that benefitted the organizations that they work for?
- What are some of the most interesting/unexpected/innovative ways that you have seen Rockset used?
- When is Rockset the wrong choice for a given project?
- What have you found to be the most challenging and the most exciting aspects of building the Rockset platform and company?
- What do you have planned for the future of Rockset?
Contact Info
- Venkat
- Shruti
- @shrutibhat on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at pythonpodcast.com/chat
Links
- Rockset
- Oracle
- VMWare
- Rube Goldberg Machine
- SnowflakeDB
- Protocol Buffers
- Spark
- Presto
- Apache Kafka
- RocksDB
- InnoDB
- Lucene
- Log Structured Merge Tree (LSTM)
- Kubernetes
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or you want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that coverage too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances.
Go to data engineering podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. This week's episode is also sponsored by Data Coral. They provide an AWS native serverless data infrastructure that installs in your VPC. Data Coral helps data engineers build and manage the flow of data pipelines without having to manage any of their own infrastructure. Data Coral's customers report that their data engineers are able to spend 80% of their work time invested in data transformations rather than pipeline maintenance. Raghu Murthy, founder and CEO of DataCore, built data infrastructures at Yahoo and Facebook, scaling from mere terabytes to petabytes of analytic data.
He started Data Coral with the goal to make SQL the universal data programming language. Visitdataengineeringpodcast.com/datacoral today to find out more. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as Dataversity, Corinium Global Intelligence, Eluxio, and Data Council.
Upcoming events include the combined events of the data architecture summit in Graphorum, the data orchestration summit, and the data council in New York City. Go to data engineering podcast.com/conferences today to learn more about these and other events and to take advantage of our partner discounts to save money when you register. Your host is Tobias Macy, and today I'm interviewing Shruti Bhat and Venkat Venkataramani about Rockset, a serverless platform for enabling fast SQL queries across all of your data. So, Shruti, can you start by introducing yourself? Thanks so much for, having us, Tobias.
[00:02:35] Unknown:
My name is Sudipa, and I am head of product here at Rockset. Prior to Rockset, I was at Oracle for a little while where I focused on strategies around IoT and AI. And prior to that,
[00:02:48] Unknown:
I spent some time at VMware and a, another startup. And Venkat, how about you? Hi. Thanks for having us, Tobias. My name is Venkat. I'm the founder and CEO of Rockset. Prior to starting the company, I was managing online data infrastructure at Facebook from 2, 007 to, you know, to 2, 015. Helped scale the online data back end from 40 15 monthly, active users mid 40 15000000 monthly active users to about a1000000000monthlyhalf. This was the set of systems and services that was responsible for storing and serving all Facebook user data and all the Facebook products user facing products were built on top of. The philosophy out there was to make it easy to build data driven products within Facebook. And now with Rockset, we we carry the same mission to make it easy to build data driven products and services, but for anybody building applications in the cloud. And the requirements at those different levels of scale is definitely
[00:03:43] Unknown:
interesting. I'm curious to
[00:03:52] Unknown:
linearly from where we started, I I don't think I don't think scaling linearly from where we started, I I don't think I don't think Facebook Facebook would have been a profitable company. So I think it had to scale sublinearly. But the the the thing that used to be, we used to say a lot about is from in those 8 years, you know, the user growth was, you know, 2 orders of magnitude, but the data infrastructure, kind of demand was probably 3 to 4 orders of magnitude more because the product got a lot more richer and lots and lots of other features were launched and that added a lot more demand on the online data management system than just user growth. I used to joke about this and saying the most predictable thing about Facebook was actually user growth and it's the product launches that the product launches and the impact a new product launch would have on the online data infrastructure
[00:04:41] Unknown:
was the or the harder, problem that was, you know, harder to predict, harder to capacity plan, and everything. So after leaving Facebook, you ended up starting Rockset. And I'm curious if you can describe a bit about what it is that you have built there and your motivation and inspiration for creating it. Yes.
[00:04:59] Unknown:
So it didn't immediately happen. I spent about a year, really trying to understand, what are the problems worth solving in the real world. You know, as much as Facebook was a great learning experience, I think the real world is a lot different. The complexities are different. The challenges are different. And, I spent a lot of time talking to, you know, I didn't I didn't have a company. I didn't have a product, so it wasn't really a sales pitch. But, you know, just really started talking to a lot of the, you know, real world companies out there that, you know, the the banks I do business with, the hospitals that I go to run, and I know Wendy gets sick and, you know, you know, companies like that that are way outside, you know, the Facebooks and the Googles of the world. Then just understanding what problems are worth solving. And that's kind of like that journey is what I think led, and gave, you know, both the insight and the motivation to go do something about it. And so, the motivation really came from how complex it was, to build data driven products and services in the current age, where we live in. You know, on 1 end of the spectrum, I would say the previous generation technologies, whether it's transactional processing systems or warehouses and data lakes or what have you, I think they have taken the previous generation problems, and and have gotten a pretty good solution, that is I think good enough for that for those use cases.
But more and more, companies, that that is not where they were struggling with, but companies were struggling with in building, you know, operational analytics, and and real time dashboards and real time insights, for their operational teams, you know. And and in order to build that, I think almost always everybody I used to kind of, like, joke about this and say everybody was building a Rube Goldberg machine of sorts, where they're kind of duct taping together multiple disparate data management technologies, in order to build those operational dashboards and APIs for, real time data processing. And so so that's kind of like what started us on this journey and and really motivated us, to figure out, in this day and age, what would be a data management system built ground up that exploits the cloud, that is built ground up for this new age, data management problems, and what is the simplest and the most powerful product that we can build that actually, you know, makes it really, really, simple and easy and accessible for a lot of people to go and make their companies and make their products a lot more data driven. And so that's kind of, like, at the high level, I would say, was our motivation to to to go do something about, you know, in this space. So at the 10, 000 foot view, what you're doing in some ways could be
[00:07:35] Unknown:
equated to what other projects such as Snowflake are doing as far as using cloud technologies to allow for scale and compute and storage independently and being able to ingest data and then be able to perform analytics on it in a SQL format. So I'm curious if you can dig a bit more into to the types of use cases that Rockset is particularly suited to and some of the areas where you would use that in place of a cloud data warehouse or a data lake approach? Sure. Let me dig into that a little bit. I think
[00:08:09] Unknown:
the most common use cases we see are actually very different from something that a data warehouse can handle. For example, you know, real time dashboards or a developer API or if you're trying to build a micro service on your data, it's incredibly hard to process a lot of your data in real time for 2 reasons today. Right? 1 is, well, data is flowing. 2nd is data is the shape of that data is more fluid. What I mean by that is you have more streaming data and you have more semi structured data. So how do you possibly process that in real time and build microservices or dashboards on that? But I can, give you a more, real example. For example, let's say you're building I'm actually gonna take a very real example here. You're building a gaming platform that, say, uses DynamoDB for transactions. Right?
And you wanna help all your game developers by giving them APIs so they can monetize their games. They can ask things like, hey, who's using a certain type of sword in this game right now, and what have they purchased in the past? So very quickly, these queries are asking real time questions but also trying to join that with some of the historical, data. But they need to know that right now so that they can monetize their game. If you're talking about, you know, complex queries involving joints, subqueries, aggregations, But then you also need that to be processed really fast. Now this is not something that a data warehouse can handle
[00:09:45] Unknown:
simply because the kind of data latency and credit latency you're looking at would not be feasible. If I could add, something to that. I mean, warehouses are very good at batch analytics. But But when it comes to real time and and, like, all the examples that Shruti, pointed out, you know, I think, you know, in in lots of cases, you end up abusing a a data warehouse or a or a or a transactional system to be able to build those kinds of architectures. And I think that's, I think we've built ground up, to embrace that, both in terms of, the speed at which data is flowing and and how fluid the the shape of the data can be and structure can be evolving over time and we keep up with all of that, on the fly for you. So I think, that's where I think we largely differentiate from existing
[00:10:29] Unknown:
products out there. And so with Rockstead, it seems that the primary use cases that it's aiming at are for these situations where you need to be able to do smaller query sizes, but at faster rates than what you would be doing with a data warehouse or a data lake where you're trying to do some small scale aggregates, the order of maybe megabytes to gigabytes as opposed to terabytes to petabytes, and you want to be able to do it on data as it's arriving rather than having to ingest it and then do some measure of cleaning on it? Yes. I think,
[00:11:04] Unknown:
there is, you know, I think maybe an introduction into, like, how somebody might use Rockset, would be helpful here just so that, we can get some clarity on what that is, what that would look like. Say you have, say you're using DynamoDB for transactions or transactional data or you're using, Kafka for a lot of real time event, logs and and it's coming in real time. Let's say that is coming in JSON or protobufs or some other semi structured dataset. Now, in order to use Rockset, you get an account, you point us at these 2 datasets, whether it's a DynamoDB table or a Kafka topic, and in, you know, in real time, it will sync all of the data, transform them into a a fully indexed, really fast, fully structured SQL table, and that will happen in real time. You don't need, any other pipelines or or or structural transformation logic that you need to build, in order to achieve that. And we will automatically use, in this in this particular example, a Dynamo, DP stream or, you know, tail Kafka topics to be able to keep itself in sync. The shape of the data can evolve, new fields can, get added, fields can change type, and our our structural you know, SQL representation of that in Rockset will automatically keep itself in sync. And, on top of that, now we have a distributed very fast SQL query engine that you can instantly start asking really complex quest questions, you know, including joins and aggregations and what have you just using SQL. So you can go now build your app or your dashboard or your API just using SQL. Now, in terms of, batch analytics, yes, you're right. If you just have all you know, if your primary job is to store a lot of data and do occasional query processing on that, Rockset is not a good fit. You wanna use a data lake and maybe use an open format to store it and occasionally use a batch compute system like, let's say, Spark or maybe even Presto type, data management systems to be able to do that. But if you want to build a system of engagement on all of this on this on this data, then you need, a powerful system that gives you the power of SQL. You need that to be fast so that you can build interactive, you know, applications and APIs on top of that. And it needs to be in happening in real time so you don't have to you know, you're not working on historical data but live data that is that is coming in your you know, coming from your system of record or from your real time streams. And those are the use cases we shine. And so we can do thousands of queries per second.
The speed of the queries are dependent because of our architecture because we index everything. The speed of your queries in a typical warehouse would be dependent on the size of your dataset. But because we index everything, the speed of your queries in Rockset would actually only depend on the size of this, you know, the the selectivity of the query. Your dataset can still be massive, but if you compute the WHERE clause of your SQL query and you do, well, let's say, account star on that result, that is what determines the speed of, you know, the Rockset query as opposed to other warehouses where it really depends on the size of your entire dataset that is being stored. And so, yes, we can actually manage, you know, tens to hundreds of terabytes of data and do still, you know, sub second query processing on that. But again, the query speed depends really on a lot of the, you know, on your selectivity of your query as opposed to the the total size of your dataset.
[00:14:16] Unknown:
In fact, my favorite customer quote along those lines was, somebody who said, I did everything that's state of the art and yet the best I could do was 30 minutes of latency and multiple hops. Right? And that right there tells you that that's what state of the art looks like. In fact, this was an ecommerce company that uses Kafka for streaming data, generating millions of events per day from their web and mobile apps, and they need to ask some complex questions in real time. Right? Like, is there fraudulent transactions? What are my customers buying right now? Did our competitor just drop their price? They're tracking all these metrics now. All they have to do is send that raw data from Kafka to Rockset
[00:14:58] Unknown:
and answer all these questions just using SQL in real time. So that's the the big difference. So in a quote, unquote typical architecture that somebody might be running, it sounds like some of the components that Rockset replaces are the ELT or workflow engine and the online analytical processing engine that you might be doing some of these queries in, and then possibly a few other components. I'm wondering if you can just talk through maybe some examples of pieces of infrastructure that people have been able to entirely do away with by moving to Rockset and some of the ways that it fits into the rest of their overlying architecture?
[00:15:39] Unknown:
So, I think what would be nice to kind of, like, get on the you know, think about it at a high level or what are the what does a modern, data architecture look like in the cloud? You know, you would definitely need a transactional data management system for your, you know, system of record. You would need, you know, something, say like something like Kafka for your real time human processing, which probably, drains that into some kind of a data lake in the cloud, you know, in, you know, it's Amazon s 3 or GCS or what have you. And, you know, you would use systems like Presto or Spark for batch, you know, compute on top of that, you know, system with the data lake. And then and then after that, on top of data in your real time stream on your lake or from your system of record, you are, you know, going to build a lots of systems of engagement, you know, for your marketing operations team, sales operations team, support operations, security ops, dev ops even, and that is this, you know, set of applications and set of use cases that we shine at. And so, typically, what we end up replacing there are in for 1 set of use cases, it will probably be something like Kafka landing data into Elasticsearch like systems where if it's a if all you need is a single table query on some real time streams, then you would use something like that. In other use cases, we also see complex, you know, either e ELT or ETL scripts periodically, dumping data into parking lots like Postgres where they don't really need the transactional capability of Postgres. They just need a a good, cert you know, data serving system, for their systems of engagement And that is where all the hops and the complexity and the non repeatable nature of the solutions come into picture because they can't really keep up with both, real time data and the fluidity and structural changes are that that they that, those streams are gonna see over a period of time. And so those are the things that we end up replacing. A lot of the time it's, 2 or 3, technologies cobbled together, but a lot of the time the serving system or the end of those pipelines, end up, being either Elasticsearch or or something like Postgres, which is what we end up, you know, replacing in a lot of places. So can you talk through the way that Rockset itself is implemented in order to be able
[00:17:56] Unknown:
to handle these various use cases to be able to supplant those different pieces of infrastructure that people have been relying on? Sure. I think we've written, our our CTO has written a lot
[00:18:06] Unknown:
of, good blog posts about it. Through the, I know we call it, like, the ALT architecture, which is aggregator, leaf, and tailors. So the the fundamental, contribution to this whole thing, you know, that I think in architecture is how to exploit the fluidity of the hardware in the cloud. And so the the aggregator leaf tailor architecture, you know, what it does is, you know, if you think about the tailors, which are are the first, the ones that ingest and keep the, Rockset, you know, collections and the datasets and Rockset in sync with whatever the source of the data is, They automatically scale based on the amount of data flowing through the system, and that scales independently to the leaf nodes, which are where the both the data indexing and the data storage happens. And that scales, you know, basically independently of the other parts of the system, based on just how much data is being managed, within a a particular account, you know, in Rockset. And then the aggregators are, again, distributed in nature and that is that is what does the distributor SQL query processing on top of the data stored in the leaves.
And, if I were to say, you know, there's a lot more we have written about it, in the blog post written by our CTO, Dhruva. I encourage, you know, your listeners to go, check that out at roxnet.com/blog. And, you know, we build our own schedulers that auto scale every component of that independent of others based on both, the demand and the SLAs that our customers are are, asking from us. And on the storage layer,
[00:19:40] Unknown:
I'm curious what you're relying on to allow for the speed and flexibility that's necessary in the query layer for being able to serve those queries with such low latency in terms of the data ingest and with the amount of flexibility necessary to be able to cover such different disparate data types and the semi structured nature of the information that you're ingesting?
[00:20:04] Unknown:
Yes. I think this is where we spend a lot of time. I could probably, talk for another hour on just all on all the technical, you know, I would say innovations that has enabled a system like, Rockset. But at a high level, I would I would start with, you know, a couple of things. Number 1, what, you know, what is 1 technology that we, know, use a lot is, is a storage engine called RocksDB, which Dhruva helped create, and and I was part of, part of the team. And in our previous life when we were working together, we worked on this storage engine that's called RocksDB that's quite popular as a high performance storage engine. So a best way to, you know, understand this is, like, you know, DB was the most popular storage engine for MySQL.
Lucene is kind of like the storage engine for Elasticsearch. And and for Rockset, Rocksdb and, an extension of Rocksdb which we call Rocksdb Cloud, is is the storage engine for for Rockset. And so there's a lot of innovation and and, and work that has gone into, Rocks, RocksDB and then and then we have even made it even work much, much better in the cloud, with RocksDB Cloud. And so that is a very core component that I think allows us to build a flexible architecture. We've also written about this in our blog post. I think, I welcome, your peer business to go check that out. Then second thing I would also say is our type system.
I think there's a lot of almost every other SQL data management system out there, SQL based data management system out there, you know, is statically typed and is strongly typed. And I would say our type system is strongly typed, but it, you know, but it is dynamically typed, which is why we don't need people to describe to us the shape of the data ahead of time before they can start ingesting data into Rockset. And this is why we are able to, you know, put a very fast SQL layer on any NoSQL data sources or or or data formats that you have. And how do we do that? I think our type system, kind of, like, is inspired from, how a lot of dynamically lead typed, like, language runtimes of dynamically type programming languages would look like.
You know, but, you know, whether it's HSVM or or other JIT, like v 8, like, JIT, you know, run times that are able to, you know, produce very, you know, a lot of efficient execution even on dynamically typed programming languages. So I think our execution engine is, like, very much and and our data representation, you know, data structure internal data structures, you know, we call it I value and and Tudor, from Rockset has written a pretty good blog post about that too. So that allows us to actually represent the data even though we don't know the type of the data ahead of time, very efficiently and do very efficient, you know, data processing on that.
And our indexing, systems, is also comes a long way. I think our indexing algorithms are a lot more inspired from, you know, distributed search, technologies, that we've helped, you know, building up in our past lives than just, data, you know, you know, columnar data management systems and SQL based data management systems and whatnot. So at a high level, we stand on the shoulders of giants. We have, you know, we have inspired a lot of amazing technical ideas from search based systems, from, you know, dynamic programming, like language run times, and, from distributor SQL, you know, optimizers and distributor SQL query execution strategies.
And I think we're able to kind of, like, bring the best of, some really, really awesome ideas from all these different disciplines into a single product with Rockset, which is why I think we're able to provide a very powerful abstraction, which is just fast SQL over any of your data, which is I think, is probably the the simplest way to, you know, present, a lot of these innovations, to to all of our customers.
[00:23:52] Unknown:
And in terms of failure modes and recovery, I'm curious how you approach distribution and balancing of the data and ensuring proper replication. And then in terms of recovery modes, I'm curious what you're relying on as far as being able to replace failed nodes and then redistribute data.
[00:24:13] Unknown:
Excellent question. So Rocks DB Cloud, I think I briefly mentioned, is the key ingredient here. So, I'll give you an example, of of, you know, at a very high level view, I'll give you, how it works and then an example of what happens on a failure, and I think that kind of, like, would answer your question, pretty well. So so RocksDB Cloud so RocksDB is a storage engine. It's a it's a log structured merge tree implementation. It's, it's it's something that is open source and there's a lot of literature around it that you can go read. Rockshipp Cloud is an extension of that, in the Cloud where all the data that is persisted in a local instance, let's say an EC 2 instance in AWS, is also backed up in s 3 as an s 3 object. So every time Rocksdb creates a new SST file locally, Rocksdb Cloud's, you know, implementation of that will also make a copy of s 3. And so what it allows us to do is cleanly separate the performance layer of our back end from the durability, getting, you know, from the durability. So the queries are fast, because Rocks you know, there is a full copy of the data in local SSDs and and a lot, you know, on on on SSD, media, which is, I think, very, very fast. So we are able to provide very high performance queries on a lot of these, you know, distributed data data datasets that is being managed in Rockset, but we also have a durable copy of every 1 of those SSD files in s 3. And so say a pod fails or a node fails and a different node now picks up that particular shard or particular that particular, you know, dataset and it opens the RocksDB Cloud instance and it realizes that local SSD files are missing, it'll automatically download the, you know, the copy the latest backups that are being kept in s 3, to the local instance before it will kind of open the database and and be able to serve it. And so that gives us a a lot of fluidity where recovery is almost a very normal part of the the the process. So we can scale up and down, deal with node failures, pod failures, disk failures almost very naturally because all we have to do is just, you know, kill that node without having to worry about any recovery, start off a new replica elsewhere, and and even if you have 2 replicas and 1 replica is down, we can bring up another replica without adding any more load to the current replicas that are serving the traffic because the downloads are happening from s 3 and it's completely isolated from production traffic. And so Rocks DB Cloud, it it's a very, you know, simple idea on top of Rocks DB but actually allows us to gives us so much fluidity in terms of resource management, durability, recovery, and and failover semantics are just so much more cleaner and and simpler, which allows us to actually fluidly scale, the infrastructure up and down in the cloud, than any other, distributed data management system,
[00:27:03] Unknown:
that that I've had the, you know, pleasure of using prior to, prior to, you know, building this Approxet. And another thing that I noticed while I was preparing for this interview is that in the white paper that you have available on your site, it mentions that you're relying on Kubernetes for being able to handle deployment and scaling of your microservices and in order to try to remain cloud agnostic. And then you're so, you're saying how the durability and reliability aspect of the storage layer that you're using is reliant on cloud object storage. And so I'm curious how the cloud agnosticism has been playing out for you in terms of being able to interface with different cloud object stores, but also for the cases where somebody might be different cloud environments,
[00:27:56] Unknown:
how you approach latency mitigation for those users? Yeah. It's a great question. So, you know, the way Rockset is built, it can run on any cloud. You're right. Because of how we use Kubernetes. But right now, it lives in AWS. But the the key thing to remember is that we ingest and index all the data. So even if your data lives in any other cloud or even if it lives on prem, what we will do is we will sync with your data source and we will ingest all that data. And then the the most important thing is we don't federate, back to the source. Right? So we terminate all the queries and this is why no matter where your data lives, once Rockset has ingested it, we will give you very reliable query performance.
And this is very important. Right? If you're building if you're a developer building microservices, you really need that kind of reliable query performance. We cannot afford to federate back to the source. Like you mentioned, if your data lives in a different cloud or it lives on prem, it'll take us forever to sort those queries. So that's the most important thing. And our indexing is, Venkat touched upon this a little bit, but our indexing is very interesting in that we do, search indexing as well as a columnar and we have a document index as well. And that's what gives us the very, very low credit latency.
Now, you know, 1 thing you might ask me is, well, what are you gonna do if you have flash floods? Right? Sure. We're ingesting data wherever it lives. What if, you know, it's it's say a Kafka stream and you have a flash flood of data coming in? Again, as Venkat was saying, we scale our ingestors separately from everything else. So we make sure that we meet the SLAs if it's a few seconds of data lag, like, if it's 2 seconds of data lag that you expect from DynamoDB or, say, from your Kafka stream before you can start creating it in Rockset, well, all we have to do is make sure that we scale up our ingestors to handle those flash floods and still meet your data latency requirements.
So that's the way to think about it. Currently, we live in AWS. That's not stopped us from ingesting data from Google Cloud Storage. We have customers using Google Cloud Storage and we have a native integration with GCS
[00:30:21] Unknown:
that customers can use right away. Can you talk a bit more about the indexing strategy that you're using to allow for optimizing on so many different query types, such as search and graph and time series, which I know that you highlight in your documentation, and any challenges or complications that you've run into in the process of enabling those different use cases?
[00:30:43] Unknown:
So, yes, I can talk about that. So we build right now out of the box, we index all the fields and we organize, all incoming data in primarily 3, you know, general purpose, indexes. An inverted index for, like, you know, you can also think about it as a search index on all those fields. You know, a column representation of all those fields and the role based representation of all these fields. So there are a few implications of that. The the biggest positive of that is our optimizer understands these different strategies. So if you run a query that says, you know, you know, find me the average age across the entire dataset, then it understands that I probably should just do a very quick columnar scan, just the fastest way to get retrieve all the data that I need to process the query.
But if you don't have the exact same query with a very complicated where clause where you only are looking at, certain records with certain fields being certain values, then I think the optimizer will automatically use the, you know, search index to very quickly go from, let's say, you know, billions and billions of records you have to the few 100000 that match and then, do, post, you know, processing once the data at 11 records and fields are retrieved, using, let's say, the search index, you know, and then do the post processing in parallel using our SQL distributed aggregation engine that we have.
So the combination of our indexing strategy and that being built hand in hand with our query optimizer is really the the the key essence to making this work. Today, we build 3 general purpose indexes. We are just about, added geo indexing as another new feature. So new types of, you know, indexes are are something that we're always thinking about. And also, you know, better and better automatic optimization strategies to exploit speculative patterns in the data, peculiar patterns in the queries so that, we can continue to provide really good performance for large spectrum of SQL queries for our for our users.
And, the the 1 other really important aspect of all of this is time series data. And so that is also something that is a full starter concept in Rockset where every record that can that comes in can be you know, there's a special field in that record, that we call even time. And so you can actually by default, it will be the time at which that particular record was inserted in a rock set, but you can actually supply that, based on any other fields that might be existing in your dataset. But and and these, this this even time, is is an optimization that is kind of built into our indexing strategy.
And so whenever you have a query that has a complicated pair clause on even time or an order by, even time, let's say descending because you want the most recent records of all the ones that match, or you wanna do any aggregations across, these, this particular field that is already out of the box, really, really, highly optimized, in our query processing. And so you should expect very, very fast performance for, time series data because our indexing strategy takes the time, for every record as a first order indexing concept, and our optimizer is aware, of that concept that we have. And so so, yeah, the the high level, our indexing strategy combined with our, you know, you know, custom built up, you know, query optimizer, SQL optimizer is is why I think we're able to provide really good performance for such a wide spectrum of, query workloads.
[00:34:12] Unknown:
Can you talk through a bit of the user aspect of getting started with Rockset and integrating with it? And then because of the fact that Rockset is able to handle a number of the different systems that a data engineer might have been involved with managing, what are some of the ways that you've seen them use the extra time that they gain by implementing Rockset?
[00:34:34] Unknown:
Great question. I think it it, you know, in almost every case, people's very quickly within, like, the first few minutes of them starting to use the platform, can go from, let's say, a DynamoDB table or a Kafka topic to a fully schematized and fully, you know, indexed SQL table. As I said, all you need to do is go to roxnet.com, sign up for you know, go create an account. It's self-service. Give us read permission using cross account role based access or IAM, user based access. You know, if you are using DynamoDB or if you're using Kafka, you know, download and install a Kafka Connect plugin to connect your, you know, your Kafka cluster to Rockset using secure API keys and what have you.
And in a matter of minutes, you you have now records flowing into a SQL table. And, and and I think, you know, as as Shruti pointed out, even if your DynamoDB table is terabytes, you know, multiple terabytes are are really, really large, we auto scale ingest and whatnot. And and still within a, you know, few hours, you should expect that to turn into a very, very fast SQL table on which you can ask any question, via SQL. And so what we actually see with all of our customers, especially data engineers and developers, is that they get to move a lot faster. What used to be, 3 people working on it for 3 weeks for a particular project now happens in a single day. And so data engineers are now spending a lot more time with their business stakeholders, understanding understanding their needs, give them access to the data that they need and the correct data sources in a in a correct format. And they're just spending a lot less time, you know, wrangling data pipelines and scaling them, monitoring them, doing a lot of DevOps on their data pipelines.
And more than that, you know, fixing ad hoc, scripts they had written for structural transformations of, like, NoSQL data to become to load them into SQL based, data management systems. And then the developers using us, to be able to build search APIs and other microservices or or data, you know, serverless data APIs on top of data, they now have a fully isolated, live SQL replica of the DynamoDB table or of their Kafka topic. And, and I think they now they're able to build microservices without having to worry about, affecting production transaction processing systems.
They're able to move a lot faster, not worry about scaling this and and building pipelines or anything like that to keep it in sync. And so I think they just now again, how can spend more time with product managers, their end users, make sure that the product they're building is actually useful and and and producing, the results that they want and build a more engaging, useful product for their end customers. So I think it's just giving a lot of time back, to our our immediate users who are developers and data engineers that that now they get to, you know, spend the time on on adding value to their business, to their product,
[00:37:29] Unknown:
as opposed to running data pipelines and and fixing ad hoc ETL scripts. And what are some of the most interesting or unexpected or innovative ways that you've seen people using the RockSat platform?
[00:37:39] Unknown:
Oh, gosh. So 1 of our favorite ones actually is, the state government heritage program. You know, it so happens that we're all nature lovers here at Roxette. In fact, if you go to Venkat's house, you'll find a wall full of posters of national parks. And then here's this, heritage program and the state government that's actually helping preserve natural species in several states actually through a real time app that, serves as an animal and plant field guide. So you can go in, you can search, filter, drill down on different species based on different criteria, and then that's helping them, find species in the wild and then help, preserve, some of the natural species out there. So, you know, that just gives us the warm fuzzies. We really like, that use case.
And then another interesting way we've seen, I think apart from the types of use cases we've described, is there's this other company, which is an automated checkout company. It's an AI based system. Interestingly, what they're doing is they're helping their developers move faster. So the data engineering team is saying, hey, if I have to give my developers access to my database in production every time they wanna go develop and test a new feature for our core product, it's gonna take us a lot longer and slow us all down. So instead, they actually use Rockset as a way to just give APIs to all of their developers to just play with the data, experiment faster, develop their, feature on it, and they don't have to worry about how would that impact the database in production.
So we thought that was a very interesting cool way of just accelerating your developer lifecycle and, you know, how data engineers and developers start working together when you give them new tools in their toolkit.
[00:39:41] Unknown:
Yeah. It's definitely interesting to think about the possibility of mirroring application databases into RockSett for making it easier for developers to be able to gain access to it, but also for being able to expose it to business users or business intelligence engines so that you don't have to add read replicas or transform them into some sort of flat file representation to be able to load into a data like technology. And I'm curious if you've heard any feedback from people using it for that type of, implementation.
[00:40:15] Unknown:
That's, that's almost, perfectly, like, a great summary of the the most common use cases we see, where real time streams, not just systems of record as, as in this example that Shruti, talked about where, it's it's a combination of these 2 and then giving that access, so that people can, you know you you're building operational dashboards for your business teams or, giving that, to your developers so that they can build, serverless data APIs, let's say, using AWS Lambdas or other function as a service, in the cloud so that, they can build very fast APIs on top of that data, which combines, you know, using joins and aggregations and what have you across both systems of record and all the behavioral data that you're getting in, like, even based systems like Kafka and what have you. So that's a very good summary of, like, the most popular use cases that we that we see from our customer user base right now. And then in terms of your experience of building and growing the technical and business aspects of Roxette, I'm curious what you found to be some of the most challenging and exciting aspects.
It's a great question. I think what's most challenging is probably also the 1 that's actually most exciting, because it's it's it's fun to, tackle hard challenges. I would say, the phoning team, in terms of technical, challenges and whatnot, I think 1 of the things is, like, you know, we've had an experience of both using and building lots of large scale data management systems in our past lives. But when it comes to building those, data management systems the cloud, I think I think, you know, as a industry, I think we're still scratching the surface of of the possibility here. So, like, if you think about, technically speaking on, on what a back, you know, a data stack should look like, in the cloud, whether you rent, you know, 1 machine for a 100 hours or a 100 machines for 1 hour, it costs you exactly the same.
And so for the same price, I think things should be infinitely faster, relatively speaking at least, in the cloud. And so no real software stack can actually do that yet, which basically means the software stack has to be really reimagined and and and rebuilt, in the cloud, especially when it comes to data management. And and I think the eventual data management in the cloud, will all be entirely serverless, end to end. And, and that is also 1 of the biggest value propositions of Rockset because we are entirely end to end. Are a serverless data management stack. So you don't have to think about how many nodes do I need, pods do I need, and how much RAM, CPU, and whatnot. You you simply point us at your data and start writing SQL, and we will automatically scale, you know, you pay for what you use and there are no idle clusters and there's no capacity planning or provisioning that you need to worry about. And I think that is providing and building data, I think, is very challenging, but it's very exciting, for that to come to light and and our customers, leveraging, you know, kind of all the hard work that has gone behind the scenes to be able to leverage and very quickly build amazing applications and and APIs using, Rockset at this point. So I would say that would be probably I would capture, you know, building serverless data management system and, you know, in in the for the new age is both, you know, really challenging, but also super exciting when you can see when you see that come to fruition. And then looking forward, what are some of the
[00:43:31] Unknown:
new features and improvements that you have planned for the future of Rockset?
[00:43:37] Unknown:
There's a it's a very, very long list. You know, fundamentally, when we start talking about, you know, we think about, you know, the ecosystem that we need to work with, there are 2 major sets of things that we need to work with. Right? Where is your data and what are you, what kind of applications or or system of engagement are you building? And so on 1 on 1 end, I think we want it, Roxette, to be able to have a live, sync and a live connector with wherever, our customer's data, happens to, reside, whether it is a SQL transactional database, it's a no SQL transactional, system, or real time streams, in 1 cloud, multiple clouds, Kafka, or some other system like that. And, you know, whether it is a a a cloud SaaS like Salesforce where which is where your data is all locked up. So we wanna make it very, very easy for people to, like, create a live, SQL replica SQL table replica of any of that, without any pipelines or or or or ETL or anything like that. And so that is 1 set of things that we'll always keep, you know, expanding and and, and growing.
And then the other side, there is a lot of different systems of engagement people wanna build, internal tools that people wanna build, within their organization and that we see our customers kind of, like, connecting Rockset with things like Retool for building an internal tool, you know, on top of Rockset. We see people, you know, integrating us with AWS Lambdas for building several of those data APIs, and so we want that workflow to be a lot better and smoother. And, I think there's a lot of JDBC or or West, you know, based, SQL based, dashboarding tools like Tableau and Redash and Superset, and and more, tools like that that I think we also wanna be, you know, working with that ecosystem. And in fact, we recently announced our integration with Grafana as, you know, now Grafana dashboards can be powered by Rockset. And so, yeah, in the in both these dimensions, I I think we see more and more, coverage and more and more, you know, you know, work that we happen to be doing. And I think, you know, more, in in terms of our own back end, as I said, you know, we're just catching the surface on building 3 general purpose indexes.
There's a lot of interesting research, you know, right now happening around learner indexes and other indexing strategies that can use advanced kind of like ML slash deep learning, ideas to build the the most appropriate index for a given dataset. That I think is a lot more sometimes, you know, it works, in in some cases works even better than the kind of indexing strategies that we have in place. And so, how can we use, more MLDL type techniques to build a better optimizer, to build a better indexing strategy, is also something that we're all constantly thinking about.
[00:46:26] Unknown:
I think our vision really there is how do we make it easy to build more data driven products and services. Right? Venkat always likes to say that back when he was at Facebook, he used to say, hey, you know, no developer should ever be bottlenecked on their infrastructure. And I think most of our customers, when we talk to the data team, they really believe in that vision that they they don't want the infrastructure to ever be a bottleneck. So where we're going is you should never have to worry about the source of the data, you should not have to worry about the shape of your data, And you should not have to worry about what type of queries are going to come in. So if you take those 3 things, really those are the 3 main questions that probably keep data engineers up at night. Right? So we're saying, hey. If we can take these 3 things, and just make it really really simple, give you the kind of performance and scale that you expect, then, you know, data engineer should be able to go do more valuable things for their business, versus just babysitting their infrastructure.
[00:47:29] Unknown:
And 1 thing that you mentioned there in terms of some of the research being done to allow for some of these more complex and interesting index types. I'm curious, being a startup, how much of your time and resources you're able to dedicate to research, or if you're currently in the mode where you're full force on just product development and iterating on improvement cycles?
[00:47:52] Unknown:
I think, you know, I would say majority of our time, almost, like, overwhelming majority of our time is is about building the product that our customers want. Immediate customers want and and our potential customers want. And so, I think we look at we keep in touch with literature and make sure, you know, it's more applied research as opposed to research, in isolation of that. So we are aware of how what are all the ways we could solve a problem. And if we find, for a particular use case or particular feature development that we're doing on behalf of what the product needs are, our customers' needs are. That particular approach that comes out of, you know, literature is the right thing. We'll go build it. But it's a lot more applied is how we think about it. We're not trying to, invent, necessarily, new learning, you know, you know, indexes. But once in a while, we experiment with it, let's say in a hackathon or something like that. And then, you know, take a benchmark, take a workload and see if you're able to replicate the results and understand that better so that for the right feature or for the right customer or for the right use case, we know that there's another trick up our sleeve that we can that we can kind of, like, employ.
So I think we keep in touch with that. I think, you know, you're very well connected with that ecosystem and so, and so it's it's something that, we're always keeping an eye on. But I don't think it it makes it into our product, road map until, we see a clear need,
[00:49:14] Unknown:
for either the product or for 1 of our customers' use cases. Are there any other aspects of the work that you're doing at Rockset or the use cases that you're targeting that we didn't discuss yet that you would like to cover before we close out the show?
[00:49:28] Unknown:
Yeah. I think 1 thing is, when we say SQL or NoSQL, it's a kind of a catchy way of saying it. But really what is it? It's it's saying that it's not a transactional system. Right? We're we're able to do all these things because we're giving up on transactions. So if you think about it, we're sort of this operational analytical database in some sense. We call it a search and analytic system. But what does that mean? It means, a, we are not a transactional system, so we're not your system of record. And, b, we're not really a warehouse. We're not for historical batch type analysis of petabytes of data. But then our ideal for is those millisecond queries on terabytes of data, a fairly large datasets with terabytes of data.
And what does that mean? And our use cases, it always comes back to, well, how do you build your real time dashboards and give developers APIs and build your microservices without having that 30 minute lag and multiple hops. Right? So it all comes down to, for us, I think the 2 core tenants that we always think about are speed and simplicity. How do we give our customers speed and simplicity? That's literally the 1 thing, we always ask ourselves before we make any new, future decisions. Alright. Is there anything else that we should cover?
[00:50:48] Unknown:
No. I thought that was pretty good. From your perspective, is this,
[00:50:52] Unknown:
is this all the you know, have you have you gone through all the questions? Do you do you have any other questions for us? No. We ran through all our questions. So for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. What's the biggest gap? I think,
[00:51:17] Unknown:
the biggest gap, as you said, is, in in in the space where, you know, lots of data coming from different data sources and whatnot and people building systems of engagement on that queries. I think the gap that, we see we feel full is is where I think a lot of the complexity lies, is where all the Goldberg machines are built. You know, is where, I think there's just not 1 good solution built ground up, for these set of use cases. All databases, on both ends of the spectrum are very good at managing the data where they are the source of truth, but not when the data is coming from elsewhere or getting synced from elsewhere. So that's, I think, like, the fundamental, kind of gap we see in in data management systems. You know, You know, the the older generation or the previous generation technologies are very good at transaction processing or analytical processing, you know, at at scale. But but operational analytics and operational data processing is what I think is the is the big frontier where it still feels like there's a huge gap in the in the tooling, which is what we hope to, you know, fulfill, for all of our customers.
[00:52:27] Unknown:
Well, thank you both very much for taking the time today to join me and discuss the work that you're doing with Rockette. It's definitely a very interesting platform that empowers a lot of interesting use cases that were previously fairly impractical or intractable. So thank you for all of your efforts on that, and I look forward to seeing how you progress along your goals, And I hope you enjoy the rest of your day. Thank you so much. Thank you for having us. We really enjoyed this conversation,
[00:52:52] Unknown:
and, you know, appreciate you, having us on the show. I have only 1 question for you, Tobias, before we sign up. So if your listeners want to follow-up with Roxette, like, it'll be cool to for them to know that, you know, we can you know, they can just go and sign up for an account at roxette.com. They don't need to contact sales or anything like that. Is there a way,
[00:53:12] Unknown:
to communicate that to your business at this point? Well, you're still being recorded, so you just told them. And also, I'll add links to the show notes for anybody who does want to follow through and,
[00:53:22] Unknown:
click through and sign up and test it out. Yeah. If you want, if you're interested in Rockset, you can go and sign up for an account. No credit card required. It's free to get started. The first 2 gigabytes is it's full it's free forever. Just go to roxa.com, click on the sign up account, sign up button, and get started if this is something that you think would be helpful for you. Alright. Well, thank you again, and, have a good rest of your day. Thank you. Thank you. Bye.
[00:53:52] Unknown:
Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on Itunes and tell your friends and coworkers.
Introduction and Sponsors
Interview with Shruti Bhat and Venkat Venkataramani
Venkat's Journey from Facebook to Rockset
Rockset's Use Cases and Differentiation
Replacing Traditional Data Infrastructure with Rockset
Rockset's Architecture and Implementation
Cloud Agnosticism and Data Ingestion
Indexing Strategy and Query Optimization
User Experience and Developer Benefits
Innovative Use Cases of Rockset
Challenges and Excitement in Building Rockset
Future Plans and Features for Rockset
Closing Thoughts and Contact Information