Summary
The promise of streaming data is that it allows you to react to new information as it happens, rather than introducing latency by batching records together. The peril is that building a robust and scalable streaming architecture is always more complicated and error-prone than you think it's going to be. After experiencing this unfortunate reality for themselves, Abhishek Chauhan and Ashish Kumar founded Grainite so that you don't have to suffer the same pain. In this episode they explain why streaming architectures are so challenging, how they have designed Grainite to be robust and scalable, and how you can start using it today to build your streaming data applications without all of the operational headache.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit dataengineeringpodcast.com/rudderstack today to learn more
- Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender where you can do two things: watch us build a data estate in 15 minutes and start for free today.
- Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council today
- Your host is Tobias Macey and today I'm interviewing Ashish Kumar and Abhishek Chauhan about Grainite, a platform designed to give you a single place to build streaming data applications
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Grainite is and the story behind it?
- What are the personas that you are focused on addressing with Grainite?
What are some of the most complex aspects of building streaming data applications in the absence of something like Grainite?
- How does Grainite work to reduce that complexity?
What are some of the commonalities that you see in the teams/organizations that find their way to Grainite?
What are some of the higher-order projects that teams are able to build when they are using Grainite as a starting point vs. where they would be spending effort on a fully managed streaming architecture?
Can you describe how Grainite is architected?
- How have the design and goals of the platform changed/evolved since you first started working on it?
What does your internal build vs. buy process look like for identifying where to spend your engineering resources?
What is the process for getting Grainite set up and integrated into an organizations technical environment?
- What is your process for determining which elements of the platform to expose as end-user features and customization options vs. keeping internal to the operational aspects of the product?
Once Grainite is running, can you describe the day 0 workflow of building an application or data flow?
- What are the day 2 - N capabilities that Grainite offers for ongoing maintenance/operation/evolution of those applications?
What are the most interesting, innovative, or unexpected ways that you have seen Grainite used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Grainite?
When is Grainite the wrong choice?
What do you have planned for the future of Grainite?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Grainite
- BigTable
- Spanner
- Firestore
- OpenCensus
- Citrix
- NetScaler
- J2EE
- RocksDB
- Pulsar
- SQL Server
- MySQL
- RAFT Protocol
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Data Council: ![Data Council Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/3WD2in1j.png) Join us at the event for the global data community, Data Council Austin. From March 28-30th 2023, we'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount off tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit: [dataengineeringpodcast.com/data-council](https://www.dataengineeringpodcast.com/data-council) Promo Code: dataengpod20
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit [RudderStack.com/DEP](https://rudderstack.com/dep) to learn more
- TimeXtender: ![TimeXtender Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/35MYWp0I.png) TimeXtender is a holistic, metadata-driven solution for data integration, optimized for agility. TimeXtender provides all the features you need to build a future-proof infrastructure for ingesting, transforming, modelling, and delivering clean, reliable data in the fastest, most efficient way possible. You can't optimize for everything all at once. That's why we take a holistic approach to data integration that optimises for agility instead of fragmentation. By unifying each layer of the data stack, TimeXtender empowers you to build data solutions 10x faster while reducing costs by 70%-80%. We do this for one simple reason: because time matters. Go to [dataengineeringpodcast.com/timextender](https://www.dataengineeringpodcast.com/timextender) today to get started for free!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Are you tired of dealing with the headache that is the modern data stack? It's supposed to make building smarter, faster, and more flexible data infrastructure a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it, it's all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to work properly. But don't worry, there is a better way. Time extender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, Time extender helps you build data solutions up to 10 times faster and saves you 70 to 80% on costs.
If you're fed up with the modern data stack, give Time extender a try. Head over to data engineering podcast.com/timeextender where you can do 2 things. Watch them build a data estate in 15 minutes and start for free today. Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real time with your own JavaScript or Python code. Join the RudderStack transformation challenge today for a chance to win a $1, 000 cash prize just by submitting a transformation to the open source RudderStack transformation library. Visitdataengineeringpodcast.com/ rudderstack today to learn more.
Your host is Tobias Macy. And today I'm interviewing Ashish Kumar and Abhishek Chauhan about Grainite, a platform designed to give you a single place to build streaming data applications. So, Ashish, can you start by introducing yourself?
[00:01:49] Unknown:
Yeah. Absolutely. Thanks for having us, Tobias. Ashish Kumar. I'm CEO and cofounder at, Gray Knight. Prior to Greenite, which we've been working on for just about 3 and a half years now, I used to be at Google. I was at Google for roughly 11 years. Most recently at Google, I ran all of the database teams there. So Bigtable, Spanner, DataStore, Firestore. Also, Kansas, which is not something that external folks, to Google would be familiar with, ran all of the user data, for all Google properties, as well as the telemetry teams. This is Census and OpenCensus. Prior to databases, I ran teams in hardware, display ads, and all of developer tools for Google. Before Google, a couple of startups, Sun Microsystems,
[00:02:33] Unknown:
as well. And Abhishek, how about yourself? Sure. Hey. This is Abhishek. I am the CTO and cofounder here. Before Greenite, I was at Citrix. I looked after all of their products, but I specialized in networking. I grew up on the networking side of engineering as well. I was the CTO for NetScaler, before that the chief architect for it, and before that, the chief security architect for it. So I have a little bit of the paranoid blood in me as well. I was acquired in through my previous startup that was an application firewall. Before that at Sun Microsystems, I was the blueprint's architect for g 2EE, and we are now talking 25 years back. And, you know, in those days, Ashish used to work on the application server side at Sun Microsystems while I worked on the standard side of j2e.
Before that, I had 1 more startup that was acquired by Microsoft, and I had Microsoft before then.
[00:03:26] Unknown:
And going back to you, Ashish, do you remember how you first got started working in data?
[00:03:30] Unknown:
Yeah. Absolutely. Look, I, you know, I think, infrastructure is where the game's at. And I think right from the beginning, I remember the days very early on, and I'm gonna date myself. I I wrote like a TCP driver on top of a Mac, system 7. So networking infrastructure is where things are things are interesting for me. And I think as Abhishek was mentioning, even at Sun, as we were doing the application server work, which is essentially, you know, the new found middleware for building all enterprise applications. At that time, I was working with some of the largest companies you can imagine. Right?
Major League Baseball, NFL, e eBay. Right? E Trade. They were all sort of running these hacks that were PHP and CGI and load balancers and MySQLs and so on. And we showed them that there was a different way of simplifying that hacky infrastructure into something that could be standardized. Just in roughly 1 year, there were 30, 000 new enterprise application developers that were building applications. Essentially, en masse, all of these enterprises moved into a single infrastructure that would do it. We got into data because, you know, right you know, when you're working on infrastructure, data is, like, always there. But just working at Google, both on the ads products and then later running the database teams, I got to talk to many large enterprises that were building applications on top of our systems.
And I realized that the problem like, even though we've been working on data for so long, like, you know, Oracle has been around for so long. Right? And, you know, data's data products been around for so long. It isn't a solved problem. People have built databases. People have built caches. People have built streams. Like, there's just too much going on there. And I that's where I think I met with Abhishek, and Abhishek will share the rest of it. Now, like, so so
[00:05:18] Unknown:
if you ask me how I first got started in data, I graduated from the University of Wisconsin in Medicine, and the computer science department there was affectionately known as the department of databases and other computer sciences. In other words, we basically all studied databases and innovated in databases while we were there. I had, like, 4 buddies there. We used to do our assignments together. We were project partners and so on. 1 of them went on to Inventorox DB. 1 of them became the GM of SQL Server. 1 of them wrote Pulsar. 1 of them worked worked on the MySQL side of things. So I was the black sheep for all these years because I was the 1 who went into networking and was looking at TCP packets and, you know, how many cycles does it take to have a packet be transmitted on the wire while these guys were all having fun with data.
So data has always been calling me. In the world of networking, you know, networking practitioners in general are shy of stateful processing. Like, well, you know, we are a network device. What do we know about state? And that had always been, like, the red line to be a networking guy or or or a data guy. And as somebody building load balancers, I always felt that, you know, you could do a lot more in the network if you had some state and if you could keep that state durably and remember it and make sure it remained consistent. We built some of that into NetScaler but not to the same extent that you would build in a database, for example. And eventually it was my turn to do data when we decided to build a data analytics pipeline at Citrix.
And as we built that, it was it was surprisingly difficult, and we followed all of the best practices. We looked at all of the goodies that that existed in the open source and elsewhere, and it turned out that the net result was still incredibly difficult to program and in incredibly difficult to operate. So that's where I went looking, found Ashish, and then we started to installing on how to make how to make modern data easier for the modern developer.
[00:07:30] Unknown:
And so that brings us now to what you're building at Grainite. I'm wondering if you can give a bit more detail about what it is that you're focused on, what you're building, and some of the story behind how it got started and why you decided that this was where you really wanted to spend your time and energy at this point in your life? Yeah. Tobias, like, you know, I think I mentioned that some of the infrastructure,
[00:07:52] Unknown:
that people were needing to build as part of delivering these applications as that I saw at Google was getting really, really complex. Right? I remember we launched Firestore back in 2018 in beta, and 2019, we made it generally available. There were a million sort of new accounts that people created on Firestore in that 1 year beta period. We saw many, many large customers sort of go take that beta and build applications that were deployed on the web, including, for, you know, the election, tracking. Right? How the how the states were and how the different elections were leaning. And so my assumption was with Firestore, we really simplified this infrastructure.
Application developers could simply talk to Firestore, no server code that needs to be deployed. But when I went and actually talked to some of these, companies, I realized that they were still piecing together 8 to 10 different products to build this. We also like talking to Abhishek, we also sort of realized that most applications now are not of the form where you are writing into a web form and saving into an Oracle database. Right? Most applications and most enterprises are looking to derive real time insights from data that they already have or that they're continuously getting so that they can find out more about their customers, find out more about their, find out more about what's happening with their own services, and deliver new functionality to the users, whether that's ads personalization, whether that's, delivering a personalized experience to the user, whether that's whether that's a customer transitioning from a phone call to a web interface, or a web interface to a phone call and keeping their state, right, as all of that happens. These are all you know, while they might not seem apparent, these are all streaming applications. You've got events that are flowing in that you need to be able to transition, you know, from 1 medium to the other or derive real time insights.
What we found was that this was too complex to build. And given the current state, it's going to keep getting more and more complex because people are continuing to add point products in between. So we decided to create GrainEye to make these kinds of applications significantly easier. That was the goal, goal of it. And, you know, why don't we have Abhishek, maybe describe what that, system is. Yeah. Let me
[00:10:09] Unknown:
I was going to talk about, you know, my experience and how I came looking for Ashish. So let me start with that, and then I'll describe the system a little bit more as well. So Ashish has been practice practicing this art for a long time now. For me, you know, we started out and we had this grand vision that on every system call that goes on on any given device, we would like to intercept it and be able to declare whether this system call is okay or whether it is suspect and should be blocked or return an error on it. And these were the days when, you know, a lot of information had been leaked on the Internet. There was there was a lot of talk about, you know, doing this sort of interception and analysis.
And I'm like, this can't be all that difficult. You know? We have all of our users data. And if you kind of put yourself into my shoes, we, you know, I am the technology fellow in charge of a bunch of products at Citrix. Citrix is a mid sized, you know, Fortune 500 company. So it is and it's the technology software company. So we have got a bunch of talented software engineers within the company and, you know, seemed like taking this foray into the data and data pipelines wouldn't be all that difficult. So I went on stage 1 of those days and announced to the rest of the world that we are going to build this thing and expect it to be there in about 9 months. And cut the long story short, for the 1st 9 months, we couldn't even hire the set of skills that is required to put a pipeline like this together. Something that would collect data from 200, 000, 000 users, something that would analyze the data in real time and be ready to intercept a system call for every system call. And, you know, on on any given system, you might have thousands of these system calls occurring per second. So that that was quite a humbling experience.
We ran into skills problems. We took us a long time to hire people. Then once we built the pipeline using, you know, a stream ingest system, then something to process the stream and something to something to record the results so that you could consult them when you are blocking the system calls. It took us about 3 years all said and done to build that thing. We scaled down our goals quite significantly. And then when it came time to operate it, like people talk about day 2 operations, we found that the thing we had built was quite complex to begin with, but it was even more complicated to operate it. Operating a pipeline at scale with a large number of users and making it live, every day is a mystery. Right? Everyday you would run into something or the other, either either a scaling problem or a shards mismatch or or a or a mysterious lag in your pipeline where things are not quite showing up in the time that you expected them to show up.
And to kind of put a cherry on top, we got the demand for this product to be about 10 times what we had anticipated. And so now imagine that, you know, you have designed this thing and the Internet was built using pizza boxes and all that stuff. And now I go to my architects and say, hey. You know, can we just add 10 more pizza boxes here and go? And they're like, no. You know? Wait a second. If you want that much performance, we need to go revisit the architect. And I'm like, you know, we are a technology company. We have renamed our engineering organization to Citrix Labs as opposed to Citrix just to attract the right talent. And here, 3 years later, you are telling me that I can't scale it 10 x, And and that was like a revelation. I started kind of looking at other people who were building similar pipelines, talking to them, and I saw the same story repeat over and over. People had built these things because, you know, the timing was right. They were desperate to have something out in the market, but they were not happy with what they had produced. And that's when I met Ashish, and I'm like, Ashish, tell me how you guys do this at Google because, you know, clearly, we are doing something wrong. We don't know something that you know. And Ashish's answer was, you know, you know, that guy who invented that thing in distributed systems, that works for my team. That other guy who invented that other thing in databases, that works for my team. And then he started listing all of these people whose papers I had been reading. Like, they're they all work for me.
And I tell you, you know, that may work for Google, but how how do you solve this for the rest of the world? And that's where we kind of started putting our heads together and saying, you know, if we created this sort of abstraction or these sorts of currencies, then the developers would be free to write the code that that is demanded off of their domain objects and business logic and not have to worry about all of the vagaries of distributed systems and clouds and date limits and item potencies. I like I could go on and on about all all of these things that you don't have when you write code in the cloud, but you do have when you write it on Greenlight.
[00:15:00] Unknown:
And there are any number of different directions that I would love to take this conversation. But before we get into all of that, in terms of the product that you're building at Grainite, you mentioned that you're just trying to solve this challenge of real time streaming pipelines and being able to scale them up or down? And in terms of the ways that you've been approaching that and the design of the system and the ways that you're thinking about marketing, who are the core personas that you're focused on targeting, and how has that helped to inform the prioritization and deployment of what you're building?
[00:15:36] Unknown:
Primarily, our focus has been on leaders in the organization that are either leading engineering teams or data teams and the reason why that is is because these are the people that have both the application requirements as well as the burden of managing, delivering as well as managing and operating those applications. And just the complexity that Abhishek and I talked about is just so high that, you know, people are, you know, sort of working 18 by 7 just to keep the you know, build this and then keep it up and running, and we can provide simplification on that. We can deliver sort of lower cost of ownership and all of that by with this simpler architecture.
[00:16:15] Unknown:
And in terms of the kind of distributed systems elements of this, you mentioned that in your, initial experiences, Abhishek, of building this at Citrix, you said, okay. We've got something running. Now we wanna scale it up. And then it was a matter of, oh, well, actually, we're gonna have to rebuild the whole thing if that's what you want. And I'm wondering how you've approached some of those aspects of distributed systems challenges and the nonlinear scalability of these systems and the, in particular, offering it as a product, some of the kind of multitenancy aspects of building this in a way that you are able to sustain it and keep it running and keep pushing it forward without having to re architect and rebuild every 6 months or a year?
[00:17:00] Unknown:
I I think if you put yourselves in the shoes of this enterprise developer, today, the way they put this infrastructure together, the industry demands them to make a bunch of architectural decisions about components for which they don't know much about. Right? So I am putting a streaming just. I am I I'm putting something to process behind it. I am adding a database. I am maybe adding a cache in the mix. I've got a bunch of sidekicks floating around as well. Like, Zookeeper is running in probably 3 or 4 instances of Zookeeper by the time you are all set and done in in the middle. And each 1 of these products is going to have a number of tunables.
And somebody who's building this for the first time or the second time or, you know, hasn't lived and read it for for a long time, they are going to have to come up with, like, stick a thumb into the wind and come up with some values for these 2 levels. And you end up with the system, you exercise it, you characterize it at a certain performance point. And then when it comes time to hear you, let's go scale it 10 x and see if it would work. Some of the assumptions that you had made when you had first made those decisions are no longer worn out. Right? So things like the number of shards that you want to set up, the number of nodes that you want to set up, the amount of parallelism that you want in the system, the rate at which you want new nodes to appear in in response to traffic. Right? So if you if you're expecting a spike at a certain rise, and now you expect now you receive a spike that rises faster than you expected, then you have a problem. So all of all of these operational decisions get hardwired into the infra infrastructure, and that creates a problem.
The way we thought about it with Granite was that there are 2 problems here. 1 is that I am asking people who have no expertise in this area to start making decisions about things like vandalism. So in rain night, we said from a very early on, we said that vandalism is going to be a property of the workload, and it would have nothing to do with how you deployed it. You could have deployed it on 1 node or a 1, 000 nodes. But if your workload has no dependencies between 2 objects, then those 2 objects should be allowed to proceed in parallel. And if it does have a dependency between 2 objects, then those 2 executions have to be serialized.
And this does not in terms of expressing this desire, they it is fairly simple for anybody to understand the guarantee. It's fairly simple for them to expect it. It's fairly difficult to ensure or deliver on this guarantee, but that's where we define the green eye boundary and say let the system figure this out on its own. Let's understand data processing with its dependencies, and let's make sure that when we process the data like all systems, we are internally sharded or segmented as well, but the number of segments or the number of shards has no bearing on the amount of batteries.
So that's that's 1 point of simplification and what allows us to create a system that would continue to scale regardless of regardless of where you took it. There are there are a few other things I can talk about in terms of the separation between compute and storage, making sure that making sure that if you are compute heavy or if you are IOPS heavy or or if your workload was compute heavy in the morning but has now become IOPS heavy, those are not things that your operator has to wake up in the middle of the night and retune or reconfigure the system. The system needs to understand those intrinsically and tune itself. And there's a lot that goes into the architecture to make sure that all of these tall claims that I'm making can actually happen behind the scenes.
[00:20:44] Unknown:
And you've mentioned a number of the different challenges associated with building these real time and streaming pipelines because of the fact that it's not just a matter of, I need a way to get data from a to b. There are all these incidental concerns, particularly as you talk about scaling and be being robust to failures and retries. And in terms of the ways that teams who aren't at Google or Facebook scale are experiencing these challenges, I'm wondering what are some of the commonalities in the specific tipping points that they hit that then bring them to Grainite and bring them to the point where they say, we don't wanna have to worry about all this stuff. I just want something that works. Just give me an API.
[00:21:26] Unknown:
I think 1 of the phrases that has stuck with me and almost surprised me in talking to a lot of different customers has been this concept that, you know, I haven't met a database or data infrastructure technology that I haven't liked. And I I have heard this from many like, when I first heard it, it was cute, but then when I started hearing it again and again from our prospects, like, wait a second, why is it that you haven't met a technology that you haven't like? How come your search for the perfect technology is still ongoing? And then when you dig or double click down on it, you realize that 2 things are happening. 1st is that the half life of technology in this area is very short. Like, in 2 years, the landscape changes and there is, like, a new a new darling of the of the data platform world. And, you know, we we all got to move to that. We we were doing generation 1 streaming. Now we are doing generation 2. Now we are sure enough doing generation 5 of streaming platform.
And so so people are people are a little bit of in this teasing their tail mode if I if I may say so. So that's that's 1 part of where we found the where we found the customers kind of suffering. The other thing I felt is again hearing from these guys is that they had all built these pipelines and I asked them 2 questions. 1st is, you know, did it take you just as long as you thought it would take? And because, you know, I was smarting from my own experience, and I'm like, let me ask these guys. And invariably, in all cases, it has taken them way longer than they thought it would take. And in fact, many times I hear this that, you know, we thought we would we would have built 50 or 60 applications by now. We are still on our second application.
And part of the problem is that the people that we used to build the application, they didn't come free when they were done with building that application because I had to leave half of that team behind to operate it because those are the only people who can figure out how to operate that thing. Right? So they're so they were not happy with what they had ended up with. And in particular, you know, you talk to them, and they would tell you about things like my pipeline, 3.64 out of 3.65 days, it works okay. But on the 1 day that it matters the most, it kind doesn't work. The consumer lag is too high. My events are not getting processed. I am having to tell my users even when I send them a second factor of authentication to expect to wait 5 minutes to receive your login token.
And, like, and I'm talking to a large payment processing company, and he's he told me that his business owners are telling him to use their payment processing infrastructure to deliver second factor tokens because their second factor token pipeline is not reliable enough. So I I heard all of these different concerns. The gist of them is that they have built a pipeline. It has taken them too long to build it, and they are not happy with what they have ended up doing.
[00:24:21] Unknown:
Yeah. And if I could if I could add to that, I think look. I think streaming applications are actually relatively simple conceptually. What you're doing is you've got an event that's come that's come in. You're going to find out more about the event, like do some master data lookup and figure out, like, who the event is about and what the event is. You're going to read some state from some database. You're going to modify that state and write it back. I mean, it's as simple as that. But just in this simple process, I described 3 or 4 products already and the challenge comes from this fact that there are 3 or 4 disconnected products. And so the biggest problems, I think, in streaming come from the fact that when you consume an event to the point that you materialize the effects of it, right, into some database or table that you can serve from, those 2 are not atomic.
Right? You can actually crash any time in the middle during that continuum. And when you do, you essentially have to able you have to be able to recover. Now, you know, people that are building analytics system might say, hey. It's okay. We can lose some events at a time. But if you're building a payment system or you're building a fraud detection, that 1 event that you lost might be the reason why you actually get fraud, get through. And so that's where that's where the complexity of streaming system comes in. And what we have built is the system where the consumption of the event and the materialization of its effects is actually atomic.
There is no system like that anywhere in the world.
[00:25:47] Unknown:
Given the level of operational complexity of running these systems, it it's a wonder that everybody who comes across this as a as an objective doesn't just throw up their hands and say, forget it. I'm done. Real time is just not gonna happen. Let's just go back to batch. And so in terms of the capabilities that folks are looking for when they say, we really need this real time system, what are some of the common threads that you see? And particularly once they migrate to Granite, some of the kind of higher order or higher value projects that they're able to focus on when they are freed from all of the operational headache of these real time systems?
[00:26:28] Unknown:
Yeah. Exactly. Like, I mean, there's, you know, the numbers there's a numbers value. Like, look at a TC over a period of time and you're gonna save this much and you're gonna be able to free up your team by that much. But the key part or the sort of, like, the wow moment with Greenlight is the actual simplification of the coding and the operational effort of building that application. So now so remember we talked earlier about how we how both Abhishek and I saw saw the transition from the PHP CGI hackery to the j 2 e world. That's exactly what we're talking about here. Right? So things that would take you thousands of lines of code to write because you're dealing with all of these failures and retries and remembering what you've processed versus not, you know, probably take you a 100 lines of code on top of Greenite because, you know, you've got this integrated atomicity, consistency guarantees that are provided by the platform at scale.
So that's sort of the core thing that, you know, wows people that when they first start using the platform, how simple it becomes. Once you're able to give that transition, all of a sudden now we start getting like, oh, but, you know, I can now do this. Right? I've always thought about doing that, but I knew I couldn't do it. Like so in fact, I batched my events. I was talking to 1 of these, 1 of these, financial institutions as well that essentially need to check for check events as they're having and these are trading events that are happening and they need to go figure out, like, which ones are essentially seem suspicious. Right?
1 of these, 1 of the problems that they have is that they're running through a Hadoop based infrastructure and they can't deal with the real timeness of these events. And so they actually have to batch these events. So 1 of the things that they're looking at Greenlight to do is can Greenlight essentially make this more real time, right, or at least real time, right, if we can get there, at scale. And that's essentially the kind of things that people are able to open up for. So new use cases that they never thought were possible, doing them much faster. And just like Abhishek mentioned earlier, where you have backlogs of 50 to 60 applications, you can actually get that backlog. You can catch up to that backlog, backlog and get it done.
[00:28:31] Unknown:
And so digging more into the Grainite architecture itself, can you give a bit of a system overview of how you've approached this problem, and in particular, the pieces that you are able to pull off the shelf because they fit your operational requirements and the scalability requirements and the pieces where you decided that you had to engineer it yourself to be able to fulfill the, overall Greenlight platform?
[00:28:59] Unknown:
Ashish mentioned a number of different guarantees that you need for simplification. Right? So you you don't want the developers on top of Draytonight to have to worry about all of these distributed systems concerns. And the at a highest level, if you think of the Draynt architecture as a means of providing this guarantee, there are 2 core ideas behind it. 1 of them is what I call unified consensus, and the other 1 I call unified journal. So any system that you take in the data world, it's going to have a right ahead log or a journal of some kind, And if it's a distributed system, it's also going to have a consensus algorithm that allows you to make all of the different nodes coherent with each other. And so by the time you have got 7 or 8 different systems, you have 7 or 8 different consensus algorithms, and you have got 7 or 8 different write ahead logs of journals.
And then each 1 of these systems is, by the way, writing the data in triplicate for durability purposes. So you have got 27 copies of the data and 8 times up up the consensus stack, 8 times down the consensus stack, 8 times down the journal in each 1 of these systems, and 8 times out of the journal. And everywhere there is a system boundary between 1 of these between 2 of these systems, you are going to see that the handoff is not exact. And and you can make that handoff exact by running a Distributed Transaction Coordinator, but then you would end up with maybe a 100 events per second that wouldn't that wouldn't be quite sufficient.
And so nobody actually runs a Distributed Transaction Coordinator. Instead, they tell their developers to write code to deal with it, to deal with the situation that, you know, if you send a 1, 000, 000 events, not all 1, 000, 000 events will show up on the other side. Or some of those events might even get duplicated, so you would you would have data loss and duplication at the same time. So by unifying the journal and by unifying the consensus algorithm, we are able to wrap the entire processing from the point that the data enters 3 night to the point that it leaves into a strongly consistent distributed system and it's its orders of magnitude more efficient.
But even more important than efficiency in this case is that when a developer writes code against the system with a unified terminal and a unified consensus algorithm, they don't have to worry about retries and locking and ordering and linearization.
[00:31:29] Unknown:
All of that take gets taken care of by the system below them. In streaming systems, I know that 1 of the perennially hard problems there are several, but some of the perennially hard problems are things like being able to do cross stream joins where you want to be able to say, I have this piece of data, and now I want to pull this other piece of data from another stream to make sure that they coincide or being able to do transactional actions on a stream or multiple streams where you say, I want to be able to submit this record to this stream and ensure that it gets delivered, but I only want it to be delivered if I'm also able to commit this other record to this other stream. And I'm wondering if any of those types of problems are, if not, incompletely solvable, but at least more tractable in the system that you're building at Greenite.
[00:32:15] Unknown:
Yeah. Yeah. I think that's a great question. And I fully agree at the from a end user's perspective, those are the those are the kind of challenging problems in distributed streaming systems. If you if you zoom out a little bit, what you would find is what's at the root of these problems is the need to do things statefully. In the sense that when an event happens, I would like to remember something about that event such that when another event happens later, I am able to consult that state and do something that combines that state with this new event. Right? I if I had this primitive which could do things statefully, I could use it for stateful joins. I could use it for conditional propagation.
We have, in fact, even built a entire Reti rules engine using the Greenite guarantees, and it was pleasantly surprising how easy it became for us to build that on top of Greenite. So so to your point, these things these things are are common things that people like to do. They require state and context, and they require exactly once messaging in order to be able to propagate these messages. Like, the notion of transaction that you mentioned, that I would like to have 2 things happen and only when both of them happen should an event be triggered or should a third action be taken. Those can all be captured with distributed state and exactly once message.
Now neither of these concepts is new, and many systems talk about it. But in every other system, these are expensive concepts. It's like, you know, you can use it if you really want to, if you are pressed against the wall with no other choice. But know that if you used exactly once messaging, we are going to be running a coordinated transaction and the scenes, and expect no more than a 1000 events per second through that. Or it's the same thing with same thing with Scratchpad state or or or context is that you can very easily overload your database if you start storing that in there. And so the design point in Grainite is that we want to do a read or a write transaction.
I should invert that. We we do a read modify write transaction at the cost of a read or a write. Right? You would see other systems where they are either write optimized or read optimized. In Zenite, we are able to think of the world as a read modify write transaction. So so that's part 1, and we are able to do exactly once messaging at the cost of a non executive once messaging. Right? So so it's cheap messaging, it's cheap state while maintaining both of them as consistent. And the reason we are able to pull that off is because we have a unified consensus and a unified journal underneath.
[00:34:59] Unknown:
Join in with the event for the global data community, Data Council Austin. From March 28th to 30th 2023, they'll play host to 100 of attendees, 100 top speakers, and dozens of start ups that are advancing data science, engineering, and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors, special discount a special discount of 20% off your ticket by using the promo code data eng pod 20. Don't miss out on their only event this year. Visit data engineering podcast dotcom/datahyphen council today.
And as far as the build versus buy equation, wondering what your process looked like and continues to look like as you build and scale the system and some of the ways that that has influenced the overall design and evolution of the platform as you went from your initial idea of, hey. Let's build a platform that actually solves stream processing so that that, you know, we don't so that nobody has to solve it again for the umpteenth time and just some of that kind of process of discovery and evolution that you've gone through over the past few years? Yeah. No. That's a great question as well, and especially
[00:36:20] Unknown:
me coming from a networking background, when I first started looking at this problem, my first instinct was that, you know, there are 35 different open source databases and probably a dozen different streaming systems, and, surely, we could use 1 of these as a starting point. So we we kind of did our thought experiments, and we did our analysis, and what we found out is that this idea of unifying the journal and the consensus algorithm is it would require rocket science. And you would basically have to have to excise almost half of an existing open source project and replace it with our code in order to in order to unify that journal. In particular, our journal and our our consensus algorithm are not head of line blocking constructs in the sense that while usually a consensus algorithm and a journal are euphemisms for things are going to proceed in a single line, as in, you know, everything is going to be queued 1 behind the other, and we're going to do them 1 at a time, 1, 2, 3, 4 sequentially.
In granite, that's not the case. In our world, consensus is can proceed in parallel and the journal is able to deliver entries out of order as well because like I said parallelism is a property of the workload. It's not a property of the journal. It's not a property of the number of nodes. And so to be able to impart that sort of dependency understanding to an existing open source project was mission impossible. We had also been working on a storage engine that is that is designed to be much faster than any other storage engine that we have known so far. And so we married those 2 things together.
And in short, both the the storage engine, the processing layer, as well as the stream ingest layer within Granite is homegrown. There are things for example, we do take advantage of the Raft consensus protocol on our configuration and control plane. We take advantage of Linux, of course, and and a number of other open source open source tools on the tooling site, but, the core of the data plane is all built from scratch. And as far as
[00:38:31] Unknown:
the onboarding and integration process for an organization that is using Grainite and becoming a customer, what is the overall workflow of getting it installed and integrated into their existing data systems and just some of the technical capabilities
[00:38:47] Unknown:
that are necessary prerequisites to be able to make effective use of Greenite? Yeah. Look. We're a young organization, and the answer is as we're building this up, we have a simple way for organizations to onboard. 1 of the 1 of the early decisions that Abhishek and I took was that we wanted to make sure that we could run everywhere. It shouldn't be that we can only run on 1 of the public clouds or something like that, even though we might find a lot of customers on that cloud. And second, that it should be really easy. Right? Remember, we're trying to simplify that entire experience for the experience for the developer, but similarly, we wanna simplify that entire experience for the operator as well. And so we took very light dependencies. We run into Kubernetes. We take light dependencies on Kubernetes and we require some form of block storage underneath us. We provide scripts that allow GreenEye to be installed within the customer's public cloud or private cloud. We run under each of the public clouds, AWS, Azure, and GCP, as well as under OpenShift.
And customers are easily able to run our scripts to get a cluster configured and set up within their environment such that, the entire product is running,
[00:39:55] Unknown:
running over there. And then once you have the Greenlight platform running, in terms of the end user experience, what has been your overall philosophy about which are the pieces that you expose as a platform capability and the API design elements of that, and what are the components of the overall operating runtime that you try to keep internal and kind of black box so that customers don't spend too much of their time trying to do any sort of fine tuning or
[00:40:26] Unknown:
worrying about things that they don't actually need to care about or shouldn't want to care about. Yeah. Exactly. So 1 of those things I think Abhishek already covered was segments. Like, we don't want people to think of these artificial entities of hashing keys into segments of partitions and then and then have to reason about which segment they're reading or deal deal with parallelism. Parallelism is a property of the workload. In terms of the actual in terms of the things that we want to expose to customers via APIs as well as, you know, in general on our system, our focus our our sort of driving principle is, can we get to the point and I talk about developers first. Can we get to the point where developers can focus purely on the business logic that they need to write, right, and not worry about any of the distributed systems concerns or even properties of Greenite for that matter. But then on top of that, we think about, like, from an operational perspective, what does it mean to operate a system that's doing all of this all of this work on behalf of the developer in the background?
The goal remember that the the eventual, place where we want to land up is this idea that a developer can go back to the sweet old days when people used to code on a 808 8 PC and there was a single thread that you could write your program program with. So anybody that's writing event handlers or database logic should operate as if they're they have the entire machine to themselves. They've got an event. They've got some state. They need to merge the 2. They need to call out to some things, but then they're done. Right? And they don't need to think about anything else that's happening in the system and Greenlight sort of abstracts that away. And that's the driving principle around the interface as well as the as well as what we expose back to, customers in terms of tunables and so on. Right? All of the parallelism and scale managed by Granite, all of the functionality or integrations make it easy on the API, and then finally, developers should have the power to write
[00:42:17] Unknown:
constructs within their domain. And so once Granite is running, the organization has it integrated. They're starting to plan out the applications or data flows that they want to build. What are some of those design considerations that they should be thinking about? Some of the user experience aspects of stream processing and the specifics around it that you have worked to simplify whether through SDKs or API design, and just some of the overall process of, you know, validating whether an idea is feasible and then going through the process of getting it implemented and deployed for kind of day 0, day 1 operations?
[00:42:56] Unknown:
Yeah. Absolutely. And, you know, like I mentioned, we're a young organization. We're working with, pretty advanced, and very supportive and collaborative customers at the moment that have been sort of, you know, guiding our road map and guiding our direction as well in terms of sort of core IAM integrations, core other technology integrations that we need to provide on the platform. 1 of the things that I'm now proud about is we work with several sort of banking institutions now to go through their security review, their infrastructure reviews, and so on to be ready for deploying the product in, you know, a highly, highly complex and regulated environments. And so we've got the base infrastructure that allows us to be deployed in pretty much any environment.
Remember I mentioned that, you know, we will provide you the scripts so you can run Greenlight within your network. We also offer an option where we can manage the cluster in a peered VPC for you. And in that case, you don't have to even have the management management issue. And over over time, we're also going to offer sort of a fully managed SaaS offering in the future. Now as we work, as we work with our customers, there are key sort of integration points like encryption and, you know, data, you know, encryption for data at rest, data in motion, role based access control, integration with their I'm providers, audit logs, and so on. These are the systems that we've helped build, you know, that we've built thanks to feedback that we've gotten from early adopters over the last year. So, you know, I think we do have, many of the capabilities that most complex environments require. And, obviously, as a young organization, we're always learning from our customers and, turning around as as we realize, additional things that they might need. The you know, I think Abhishek mentioned as an example, like, how as he talks to folks, we've heard this comment that, you know, I haven't found a data platform I didn't like, so far. So, like, pretty much every organization that we go into has, you know, tens, if not, you know, tens of different databases and different, different of these, different systems and, you know, scripts, hundreds of scripts that depend on each 1 of the incarnations of those systems. So many times integration is the piece where we have to, sort of, delve into and help out. Like, how do you introduce Greenite into an existing environment? And we've done a lot of work in terms of simplifying it so that, you know, within a week, we can drop into any environment and start showing value, into, with Greenlight.
[00:45:20] Unknown:
And so in terms of the actual types of applications and data flows that people are building, I'm wondering if you can give some kind of common examples. And, also, as I think about streaming systems, it seems like every single streaming system that somebody builds eventually turns into some sort of, kind of SQL database. And I'm wondering if that's, a capability that you're actively pursuing or, something that you are afraid of kind of incidentally building by accident and just kind of your thoughts on that aspect of the space as well? Yeah. Let, Abhishek take the SQL, part of it. But let me maybe describe some of the applications that people are building on top of Greenite and some of the scenarios
[00:46:01] Unknown:
that we're seeing. So as I mentioned, financial services is a common sort of user base. And you can imagine, like, the amount of data, you know, the ability to turn around that data and deliver new sort of functionality to the users is quite critical in that in that space. So as an example, in the financial services space, 1 of our customers is doing something very simple, right? 1 would imagine this would be something that's already there, like, hey, look, I get a document from my end customer, and I need to record it into my system. And then I want to send out a notification to my customer saying that it's been done. Right? And there are regulations around this. Right? If you don't deliver that message, if you don't save it, there are regulations around this. And there are audits that are done in terms of validating if that's been done. But very simple, I've got a Twilio API. I can call into that API. I've got a Salesforce API. I can call into that API.
But then the problem happens is what if Twilio puts you on a rate limit or or, you know, Salesforce puts you on a rate limit because you're hitting them too fast because you got a sudden burst of load that you needed to deal with. Well, at that point, you're out of luck, right, because those those requests are gonna fail and audit requirements don't go from you. So you essentially need to know that you, that you're retrying, right, and that you need to have logs to track, like, what's actually happening there. And so very simple use case. It's called guaranteed message delivery. I wanna talk to external systems from my application, and every place where I currently talk to, I have to build this retry logic and and deal with deal with the failures and generate logs so that I can show later to the auditors where we're at. But now you put Greenlight between your application, multiple applications, and whatever your endpoints are, and Greenlight takes care of that. A more complex a more complex example is 1 which is actually, again, in financial services, in a payment transaction. Right? You're going you go past a, past an RFID reader.
A payment essentially gets triggered. Payment action gets triggered, goes to the bank. The bank then goes to a central banking system, charges your wallet to another system, and the entire orchestration and the payment transaction happens within 100 of milliseconds. And that's the entire flow, right, including where systems might fail and all of that. Like so that that entire orchestration engine and the core of that, is being built on top of Greenite. Similarly, we have in other verticals, 1 popular use case that we have is this idea of collecting data from devices and being able to, detect anomalies, detect patterns, detect shapes of what's happening, being able to predict failures, this this area around AI ops, if you may, in general.
And so we have several folks that are either collecting data from network devices, collecting data from website pixels like customer interactions, and all they're trying to do is figure out what's happening with the system overall and how are customers or devices experiencing the system and where the problems might be. And Greenlight is particularly good because of the thing that Abhishek mentioned, the stateful ability to process streams. It's actually particularly good in terms of collecting data from a lot of places even though they're arriving out of order, arriving late, being able to join them, being able to predict, and and then be able to call out to machine learning models that, could essentially predict, what might be happening. And, Abhishek, do you wanna take the SQL SQL question? Yeah. Let me that is
[00:49:24] Unknown:
we call it the SQL question internally because it comes up with great regularity. There are 2 kinds of 2 kinds of people who ask this question. You know, the the first kind are the traditional database practitioner, and this is a person who understands normalized schemas. That's how their application is, and they are particularly interested in how many joins can you do and cross table and this and that. They in their mind, they're imagining a SQL query that will fit 2 or 3 pages on a on a on a screen. And we tell them that, you know, we could you could consider Renite as a normalized database, and you could run these SQL queries. And we would get you similar performance to your existing database, but there wouldn't be anything special to write home about. Where the specialty comes from is when you start thinking of your database as a denormalized schema.
In fact, there there are lots of people that we meet who have already gone through the microservices journey, and so they are very familiar with this notion of denormalizing the schema. And with Greenite, you can almost think of it as a denormalized schema on steroids in the sense that the data model that we provide to you in our database goes even 1 step beyond what you would expect from something like a big table with wide columns and large number of column families, we give you sorted column families. But that means now is that you cannot just have a user record, you can have an entire dossier of a user inside each row of Draytonight.
This dossier of a user could contain the entire click stream history of this user over the last 10 years or the entire order history of this user over the last 10 years, and we would be responsible for taking these large datasets and fitting them into a single row of Granite. I sometimes when I have my marketing hat on, marketing hat on, I say that we are giving you a MySQL table inside each row of granite. So if you had a 1, 000, 000, 000 users, you got a 1, 000, 000, 000 MySQL tables worth of worth of tables in there. And that sort that sort of data model allows you to almost accomplish the best of both worlds, but the right way to query this model is not usually with SQL.
We do have a version of SQL working in the lab. We can enable people to use SQL. But, typically, once I'm talking to somebody who's with me so far, they are going to be asking for GraphQL. And so so that's the direction that we are currently headed. Today, we have an API based access, and you can access this data and query it and serve it via API. But expect higher order languages like SQL
[00:52:04] Unknown:
and GraphQL to show up soon. And so once you go beyond the kind of day 0, day 1 process of designing and developing an application, getting it deployed, what are the capabilities that Granite offers for the ongoing maintenance and updates of those applications going from days 2 through n?
[00:52:24] Unknown:
Yeah. Essentially, having, you know, having a system that's dealing with data, there's a core set of capabilities that 1 needs to provide. Both Abhishek and I worked in environments like that. We do all of the standard things that are required, like your backup, restore Doctor, high availability, everything that you would need in order to continue operating system. Remember that 1 of the key reasons why we created Greenite was that we wanted to reduce the operational burden, alongside the developer burden, right, of developing these applications. So we've done everything that we can such that the system learns from the workload that's being exhibited as opposed to crash and burn as soon as a large workload shows up. You you popularly heard of this term called hotspotting in databases, especially in distributed databases.
In fact, our system operates better when there are hotspots or when there are hotspot workloads. So we've designed it so that, you know, operators are not sort of pulling their hair out as, you know, when they see sort of spikes in traffic. And, essentially, that's what's needed for streaming systems. So we've got all of the standard capabilities, but also we've made it so that operational operations on Grainight is significantly easier. 1 of the key things I like to call out in addition to, you know, all of the sort of core data things is this idea that we expose a Prometheus endpoint on Greenlight, which not only tells you, like, what's happening with the Greenlight cluster, like, what's happening with the nodes, what, you know, how much memory, how much storage, you know, where are the delays, what's happening inside the queues within Greenite, which, you know, we obviously expose an operational guide for and then, you know, our developers and our engineers understand even more. But on top of that, we also because Grainite now becomes the orchestrator for your application because, you know, we're calling out your application handler. The application handler is sending data back to, GreenEye to persist and then making callouts to other microservices.
We're actually able to tell you how your application is doing in that same Prometheus endpoint. We can tell you where the delays are, what the issues might be. And using our APIs, you can actually you can actually report, counters and gauges that are compatible with Prometheus for your data. Right? So think of this as data level observability exposed within the application directly through that same Prometheus endpoint where you can measure everything from the ground up on the Greenlight cluster all the way to how many payment transactions you successfully processed in the last hour, all in the same, endpoint.
[00:54:57] Unknown:
And so in your experience of building this product and platform and working with your customers and end users, what are some of the most interesting or innovative or unexpected Gray Knight used or some of the most interesting ways that Gray Knight has enabled these teams to focus on these kind of higher order, higher value problems.
[00:55:19] Unknown:
Yeah. I can take you know, I can probably provide some, workflow examples. I'm sure Abhishek has his Abhishek has several of these as well. But just in terms of sort of usage, when I started working on Google databases, I always thought data was a solved problem And then and then I saw all sorts of challenges and, you know, all sorts of new algorithms and new approaches being used. And, you know, obviously, Greenite is 1 of those, right? We're creating a brand new platform, with a new set of rules around how streaming data can be used. Similarly, I thought that simple things like, you know, I need to call out to an external API, you know, things like you know, this example that I just talked about, guaranteed message delivery, I would think, like, this would be this would have been solved by someone before.
And, no. In fact, it's pretty hard, and, people struggle with this. Like, real enterprise applications struggle with this. Similarly, I think, you know, 1 would imagine, you know, we've all heard of this term ETL for the longest time. Right? It's been around for so long. And, you know, 1 would think that these things are, you know, all, all done and solved. But, if take a look at any of these projects that provide ETL, ETL solutions, many of them might be an open source. Take a look at the amount of code that 1 needs to write to actually integrate into that environment or even run them or keep them up and running. I was surprised as I talked to customers about how scared they are to take some of these open source EPL solutions and run them and support them themselves.
And I think that's, that's an area, I think, where Greenite,
[00:56:57] Unknown:
you know, really really can help out. Yeah. So I think that's a that's a very interesting question. I'll tell you let me start with something interesting, which is and this was this was it almost during the 1st year of Greenite. I was working with a very senior engineer within our team. This guy has implemented file systems and disk disk subsystems before, and, you know, we had a mysterious problem in the code where once in a while we would refuse to read the disk. We would declare that, you know, this data is corrupt and and not read it.
And this guy spent about a week on it, came back and said, I think this is a bug in the Linux kernel, and, you know, I I trust the IDSIS. Remember, this is during that times of COVID, so collaboration is already slightly difficult, but, you know, I I couldn't help but show him my look of incredulity. And, you know, you sure it's the bug in the Linux kernel? Turns out that it was, in fact, a bug in the Linux kernel, and it only manifested when you are running on a cloud with a network mounted volume. And it we were really proud of what we accomplished at that time in 2 ways. 1st is that the system detected and would not would not proceed when it when it figured that the data that it was trying to read was not quite there.
And second is is that it was a gratifying feeling to have, you know, found a bug in the kernel. We reported it, we got it upstreamed, and it's now part of, part of every VM that that cloud ships. So that's on the interesting side. On the challenging side, 1 of the surprises that I found and both Ashish and I have been building 59 systems all over our lives, so we are particularly passionate about how we test these systems because nobody gets to 5 nines just by wishing so. And we found that in the world of distributed systems, even though there are things like Jepsen tests available, even though every database has talked about their testing methodology.
The and there are some open source test frameworks available as well. There is absolutely nothing out there that will test your system to a level where you can feel very confident about code not being the single point of failure in it. So we had to develop our own test system that that can, you know, sequence things the way we want them sequenced, produce errors that we that the developers didn't anticipate and so on. And it says it's a love hate relationship with that system because it kind of it it is merciless. You make a small mistake. It will find it, and it will tell you about it. But it's also something that I would have expected that this industry over 30 years would have built something like this. And, I mean, in in defense of the industry, the thing that we have built is somewhat peculiar to our system, but it is still something that we think we can generalize at some point and and contribute that. But that's the that's on the challenging side. I I will stand by and say testing these systems, especially the ones with the kinds of guarantees we talk about, is incredibly difficult. It's still, like, more of an art than a science.
[01:00:05] Unknown:
Absolutely. And and as somebody who has worked in operations for a while now, 1 of my favorite jokes is that you can have as many nines as you want, but I get to pick where I put the decimal point. That's right. And so in terms of your personal experiences of building this business and working in this industry, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[01:00:28] Unknown:
Yeah. I think, I think we covered some of that just as part of the last question as well. But as an example, you're talking about the, you know, this this idea of ETL, how complex that world still is and people sort of moving more to ELT type systems. We found that integration is obviously the name of the game. Very early on, Abhishek and I talked about, like, how we don't want to be in the connectors business because, you know, but, you know, we're a data platform. When we get deployed into an enterprise, we have to be able to integrate. So I think, what we were pleasantly surprised by, though, was that these integrations on top of Grainite essentially also see a factor of 10 reduction, right, in the amount of effort that it takes to write these integrations. Right? Simply because the Greenlight platform itself, if you can run it directly on the Greenlight platform, it essentially makes the entire job of writing that connector, managing it, operating it becomes so much simpler. So we actually added this capability on Greenite called tasks, which is essentially this framework that allows you to write, as many of these connectors on top of on top of Greenite and just make it really, really easy. And so I think, you know, I think that was that was definitely very, very valuable.
And now we see pretty much every customer deployment that we go into needs it.
[01:01:49] Unknown:
And that also brings up another question that we haven't touched on yet is, the aspect of building and visualizing the topologies of these applications that are running across the streaming data because, you do need to be able to understand what are the interdependencies, how do I manage some of the kind of guarantees between these different stages? And, what is if there are any elements of what you're building at Grainite to help with that problem as well? Yeah. Exactly. That's at the heart of the application. So
[01:02:17] Unknown:
Greenite has this concept of applications, and a Greenite cluster can run many applications. Right? And if you notice in streaming systems, there are no applications. You have DAGs that are part of the application. Right? You have multiple DAGs that might be running as part of the application, and there are different components being used. In a Greenlight application, you're able to describe how data flows between the different like, where like, what triggers an user code to process an event. User code can call other user code and so on. Right? So I think there's a there is an establishment of relationship there. And because Grainite is sitting at the heart of it orchestrating the movement of these messages and data flows, it is actually able to give you a view into how data is moving within your application.
Right? And it's able to tell you from this endpoint to this endpoint, these many messages passed. And at p 99, this was the latency on processing, and this was the mess this was the latency on delay. And you can actually build a graph of your running application to see exactly where the data entered into the system, how long it took, where it got stuck, where it's getting processed, where, you know, where you might need to make improvements. So we actually deliver that out of box, both from a developer view as well as for production applications.
[01:03:28] Unknown:
And for people who are considering building streaming applications and they're looking at Grainite, what are the cases where it is the wrong choice?
[01:03:37] Unknown:
Yeah. So look, I think if you're not building streaming applications, clearly, Grainite is the wrong choice. Right? I mean, if you want to build a web form application that updates an Oracle database, perhaps Oracle is the right right choice there. But I think, look, if you're looking for if you're looking to build an application that can start with, let's say, 0.5 TPS, right, transactions per second, and go grow to, you know, a 100, 000 transactions per second. 1 thing I can guarantee is that any infrastructure that you pick, you're going to rearchitect it 3 or 4 times over the next couple of years or whatever your duration is for scaling that system. Granite is 1 of those systems that you won't need to re architect. The application that you build for 0.5 TPS is the 1 that's actually going to take you all the way as you scale this up. All you're doing is increasing the load on. A is is is set up in a way that it's designed for sort of this parallel processing streaming workloads. So if your workload is not about streaming, it is actually about more you know, you want distributed transactions across multiple rows in the database, you know, perhaps Greenite is not the right choice. Perhaps scale is not what you're looking for eventually.
Right? I haven't run into a streaming application that doesn't
[01:04:55] Unknown:
fit GreenEye today. That it it sums it up. There are some people who are looking primarily for stream storage. Like, you know, I I'll process it as a batch pipeline later, but my streaming application is only going to take the stream and store it somewhere. And there are there are a number of stream storage products out there, and if that's all you need to do and your processing is in fact using Spark that runs a batch of every every day or something, you probably don't need a granite there. But if you are trying to get these results in a reasonable amount of time in half a second or a second or less, then you definitely, Granite can do some good there. Yeah. In fact, that reminds me 1 more thing I'd love to add. Like,
[01:05:39] Unknown:
you know, we when we we've we've been surprised by the, by the way people are implementing their application pipelines. Right? So these data pipelines, there's a class of data pipelines, which are batch pipelines, and there's a class which is they we'll call them streaming pipelines. When we Abhishek and I use the term streaming pipeline, we don't distinguish between the 2. Regardless of whether you have batch or streaming, the code doesn't need to change. Greenite takes care of that. Right? If you have batch workloads coming in, Greenite automatically has batching effects internally. And even though your code is still dealing with an event at a time, Greenite automatically deals with that batch as well. So I think we were talking to 1 of our advisors, and he teaches ML courses, and we were walk he was walking us through, you know, sort of how these feature shores and how some of these, how some of these ML pipelines are being built today with, folks that they are working with. And it became clear that people are thinking of batch pipelines completely distinctly
[01:06:40] Unknown:
from streaming pipelines. And Grainite's power is the ability to take that streaming pipeline and still, and integrate the batch pipeline into it. So no separate code, 1 for batch, 1 for streaming as well. And as you continue to develop and iterate on the Granite platform and work with some of your early customers, what are some of the things you have planned for the near to medium term or some of the specific projects or capabilities that you're excited to dig into? Yeah. I think the the North Star for us is to continue to make the developer experience and
[01:07:12] Unknown:
the ability to operate and build these applications simpler. Right? So so with that said, we have combined these things. We have got a simple API in front of it. We have got a novel data model. But, for example, we don't have a document oriented data model. These are things that we can build on top of TriNet fairly easily, but you would expect to see a lot of the new features and innovation going in the direction of making Granite become more familiar and easier to program and operate for developers.
[01:07:45] Unknown:
Are there any other aspects of the work that you're doing at Granite or the overall space of streaming data applications and the operational challenges of building them that we didn't yet that you'd like to cover before we close out the show? Yeah. I would like to make just, like, a generic statement or observation about the industry.
[01:08:04] Unknown:
And the the notion is this, typically, the industry responds like a pendulum. And and up until now, what has happened is that streaming applications was a new and novel concept, you know, as part of the digital transformation. Once you got the batch pipelines and so forth clicked, you were ready to go into streaming. And when people first went into streaming, it wasn't clear what they were going to do with it, so they wanted to retain all of the flexibility that they could get. They wanted to put it together themselves. They wanted to use best of breed components. They wanted to retain the freedom to be able to change 1 way the other because they didn't know any weather better. They didn't know whether storm was the in thing or the out thing or whether whether Flink is the new thing or whether there's going to be something after that, so they wanted to keep that freedom. Over the years, what has happened as a result of this is that the set of widgets that you see in this space is increasingly becoming specialized and niched in the sense that, hey. You know, if you had this corner case of this corner case, then we have built a new kind of open source project or a new kind of product that's going to help you deal with that. At the same time, I think the pendulum of use cases has started swinging in the other direction. And what I mean by that is that everybody that we talk to is building almost the same kind of pipeline.
They got, like, a batch pipeline, then they got a streaming pipeline on top of that. They got a reconciliation pipeline on top of that. And these things are basically all stateful processing pipelines. And as the use case starts getting more into focus and more predictable, and you have to worry less and less about what you might need to change in the future that you don't anticipate. It's now time to start converging that infrastructure. And say, hey. Instead of having to build it all from scratch again and again, what if you had something out of the box so that you didn't have to invent the wheel, You didn't have to pave your own road, but you could instead start focusing on writing your application logic and delivering these applications. Yeah. If I and and I think just just sort of 1 thing that we think about here is building these kinds of applications
[01:10:15] Unknown:
remains too hard and too expensive, right, and I think, it is unfair that only larger companies that can attract, you know, talent in the 1, 000 should be able to deliver these capabilities when everybody needs them. And I think what we're finding is that Greenlight or platforms like Greenlight like, you know, I'm pretty certain we're going to see many other folks, try to think about the ideas that Greenlight is delivering and come up with similar models that simplify operations and development. But I think systems like this are essentially going to make it easier for all enterprises to be able to build applications of the same kind of scale and properties that a Google or a Facebook or a Netflix are building today without needing to hire thousands of experts in each of these products.
[01:11:02] Unknown:
And I think, you know, we're we're we're pretty excited about that journey, and, you know, we hope to see the industry move in that direction as well. Alright. Well, for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Look, I think, it it is, the same exact thing. Like, today, if you take a look at building these applications, you need to find experts,
[01:11:35] Unknown:
in, you know, the different technologies and products that exist. You need to piece together these Snowflake pipelines and not Snowflake the product, but, like, every 1 of these instances of these pipelines that you create is slightly different. Like, I talk to customers that have 10 implementations of these pipelines. Each 1 of them is slightly different than the other. Right? And they are they are they have to stick with those because that's you know, some architect decided that's the best way to essentially pull those together. And, you know, it's just like this is just, you know, if I may use the word nonsense. Right? There's no reason to essentially be looking at this as, you know, a PHP CGI hackery. Right? I think there's a better and standard way of doing this now.
[01:12:19] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Grain Knight to simplify the work of being able to actually build streaming applications rather than spending all of your time in the care and feeding of the operational aspects of getting the data to the application. So appreciate all of the time and energy that the both of you and the rest of your team are putting into that, and I hope you enjoy the rest of your day. Yeah. Thanks, Tobias. Thanks for having us. Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning.
Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Ashish Kumar and Abhishek Chauhan
The Vision and Mission of Grainite
Target Audience and Market Strategy
Grainite Architecture and Technical Challenges
Customer Use Cases and Applications
Future Plans and Industry Insights