Summary
In this episode of the Data Engineering Podcast Rajan Goyal, CEO and co-founder of Datapelago, talks about improving efficiencies in data processing by reimagining system architecture. Rajan explains the shift from hyperconverged to disaggregated and composable infrastructure, highlighting the importance of accelerated computing in modern data centers. He discusses the evolution from proprietary to open, composable stacks, emphasizing the role of open table formats and the need for a universal data processing engine, and outlines Datapelago's strategy to leverage existing frameworks like Spark and Trino while providing accelerated computing benefits.
Announcements
Parting Question
In this episode of the Data Engineering Podcast Rajan Goyal, CEO and co-founder of Datapelago, talks about improving efficiencies in data processing by reimagining system architecture. Rajan explains the shift from hyperconverged to disaggregated and composable infrastructure, highlighting the importance of accelerated computing in modern data centers. He discusses the evolution from proprietary to open, composable stacks, emphasizing the role of open table formats and the need for a universal data processing engine, and outlines Datapelago's strategy to leverage existing frameworks like Spark and Trino while providing accelerated computing benefits.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Your host is Tobias Macey and today I'm interviewing Rajan Goyal about how to drastically improve efficiencies in data processing by re-imagining the system architecture
- Introduction
- How did you get involved in the area of data management?
- Can you start by outlining the main factors that contribute to performance challenges in data lake environments?
- The different components of open data processing systems have evolved from different starting points with different objectives. In your experience, how has that un-planned and un-synchronized evolution of the ecosystem hindered the capabilities and adoption of open technologies?
- The introduction of a new cross-cutting capability (e.g. Iceberg) has typically taken a substantial amount of time to gain support across different engines and ecosystems. What do you see as the point of highest leverage to improve the capabilities of the entire stack with the least amount of co-ordination?
- What was the motivating insight that led you to invest in the technology that powers Datapelago?
- Can you describe the system design of Datapelago and how it integrates with existing data engines?
- The growth in the generation and application of unstructured data is a notable shift in the work being done by data teams. What are the areas of overlap in the fundamental nature of data (whether structured, semi-structured, or unstructured) that you are able to exploit to bridge the processing gap?
- What are the most interesting, innovative, or unexpected ways that you have seen Datapelago used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datapelago?
- When is Datapelago the wrong choice?
- What do you have planned for the future of Datapelago?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Datapelago
- MIPS Architecture
- ARM Architecture
- AWS Nitro
- Mellanox
- Nvidia
- Von Neumann Architecture
- TPU == Tensor Processing Unit
- FPGA == Field-Programmable Gate Array
- Spark
- Trino
- Iceberg
- Delta Lake
- Hudi
- Apache Gluten
- Intermediate Representation
- Turing Completeness
- LLVM
- Amdahl's Law
- LSTM == Long Short-Term Memory
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
Your host is Tobias Macy, and today I'm interviewing Rajan Goyal about how to drastically improve efficiencies in data processing by reimagining the system architecture. So, Rajan, can you start by introducing yourself?
[00:00:59] Rajan Goyal:
Sure. First of all, pleasure to be here, Tobias. Thank you for taking time. This is Rajan Goel. I'm CEO and cofounder at, early stage data company, Datapelago. I am here in the valley for last thirty plus years. This is my third start up, but we'll talk more about my journey. But happy to be here at Dubai's. And do you remember how you first got started working in data and why it has kept your attention for this long? Yeah. Well, it is a funny, My journey to data world has been, unorthodox, I would say. My previous life, I used to build security systems, like the network security, and then I went deep into the technology building processors, domain specific processors, and or what's today buzzword is called accelerated computing.
So idea was to how do we make, a specialized processors, which are programmable and so on, to do a specific function, like security, deep packet inspection, and so on. So that was my previous life. We were building both MIPS and ARM based processors, and we were pretty successful, relatively speaking. And just before this we put our hands on the data processing problem, we were working on data moment problem, which is sounds boring, but all the modern cloud infrastructure to data bias is built on that principle. You have this disaggregated architecture. If I may take one second few seconds to explain that. If you look at AWS or Google or Azure, you as a user want to put take a server of you say, give me 32 codes, four drives, one GPU, 25 gig network, or some other variant. Guess what? There is no physical server that's queue. You are composing a software defined server on the fly from a disaggregated set of infrastructure.
Alright? So you have your CPU racks, your storage racks, your GPU racks, and drive racks, and then connected over high speed network, and then you're dynamically composing, this skew that you're looking for. So in my previous life, I was working on technologies to enable this modern infrastructure that we see today. We enjoy today at AWS or any public cloud. And in fact, AWS Nitro, if you're familiar with that or for your audience, is built on some of the technology that I built in previous slides. So coming back to your question, sir, I have a spending more time on making the infrastructure data centers more efficient, how to move their data efficiently between compute to compute or storage to compute and so on. So what occurred to me was with Mellanox, with Nvidia, with my previous companies, all these companies have more or less solved the problem. Obviously, it has to continue to improve, but fundamentally solved. So I observed that the next next problem bottleneck will be in the competition. Now you're able to move data. Let's actually process it. And there are two hills in the data processing, Tobias. One is taken by the AI data model training and inference, which is also data processing, but in a different way. And another is the data processing itself to prepare for AI or for our classic analytic. And that, I think, was left untouched to keep up with the shift in the infrastructure that was happening, which I was kind of fortunate to be part of my previous life. And, hence, my attention got to this. Okay. This is a I mean, it's a very well hundred $80,000,000,000 market today.
So it's a pretty large business. Problem is big. It's relevant, and why don't we apply our self to solve? That's how kind of a more bottom up engineering mind took me to this world of data, and ever since, it is fascinating. It's validating what we started with, but that's how we got in. I personally got into this world of data coming from computing, storage, and networking world. And digging into some of those
[00:04:38] Tobias Macey:
hardware level capabilities and constraints, most people who are working in this space, they're going to be fairly familiar with the system architecture of their laptops or their desktops, so the Von Neumann architecture, which has been prevalent for decades at this point. And as you mentioned, because of the requirements and the high throughput and the high scale of these data centers and the, cloud providers, you mentioned that they have a slightly different way of composing those different pieces together to be able to give that flexibility. And I'm wondering from your perspective and from the understanding that you've gathered from working in that space, how that fundamentally changes the ways that people need to be aware of the compute capabilities and how they link together versus the model that they're likely working from with that Von Neumann approach that of the collocated CPU, RAM, disk, and just how that factors into the capabilities of actual data processing and data movement in those environments?
[00:05:42] Rajan Goyal:
Yeah. Yeah. So let's break your question into maybe three topics, if I may, okay, to give the proper context. The first is the classic idea of I think Nutanix invented this world called hyperconverged. Idea was you put it in a single pizza box, if I may use that example, co located your CPU, DRAM, storage, networking, and that's your becomes your cookie, and you kind of replicate it thousand times or 10,000 times, a hundred times to get your scale up. Right? That was the then came the model of disaggregated, not hyper but disaggregated but composable infrastructure that we just talked about that. Right? Now in that regard, now things are far apart. Obviously, networking has solved the problem to make it appear, like, low latency and so on. How do we not spend CPU cycles in that? But it is still not a co located world. So now you have storage tracks, compute racks on separate over network. The third shift, if I may, or next subsequent to that is the compute itself is no longer homogeneous. The last decade, we, as an industry, spent time to everybody standardized on Intel or AMD x 86 architecture, and you'd write software once. And Intel is supposed to do every eighteen to twenty four months doubling the performance, and life is good. Right? But that train has stopped. Moore's law has flattened, and you're not doubling the CPU performance every generation to generation. So what's next? Next step function is accelerated computing. Now you have this heterogeneous elements, CPUs, GPUs, FPGAs, PPUs, you know, AWS has their own chips. Now you you're not you're not finding a homogeneous server. You're finding an heterogeneous server. So that's the reality in the infrastructure which is happening. So if you look observe this and ask a question, if you are a data now coming back to your overall team, if this is the world we are born in, now if you are a data practitioner writing Python code or a PySpark code or writing SQL queries, you cannot be bothered by GPU kernels or FPGA programming. Right? So how how those people who are building this platform bridge that gap? That that's where companies like Datapelago play that role to build intermediate layers between the Python programmer versus the underneath underneath heterogeneous infrastructure that we are providing. But that system in data centers are built around what that stack should look like. That's kind of maybe high level view of how it should appear. So there are two ways. Now one is the users of the data platform. One are the builders of the data platform. Like, we are the at in our world, we are the building the data platforms using this infrastructure.
But our goal or goal should be that how do we make the same abstraction that people are used to today? If I have to tell a Python programmer how to write a CUDA kernel, it's not gonna work. Right? So how do we abstract it out a level that whatever the today's data scientist or business analyst or the actual people who are using writing applications, that abstraction remains same, and we can hide this complexity with software modular layer is the new problem statement. Right? And, hence, the Datapelago or other companies like us would want to bridge that gap that has emerged because of this shift in underlying infrastructure device. Another division
[00:08:48] Tobias Macey:
in the data ecosystem is well, there are several, but one is open versus closed where you have these vertically integrated closed proprietary stacks in the form of your Redshift, Snowflake, BigQuery, systems like that. And so they're built specifically for a specific problem, and they're optimized because of the fact that they control the end to end use case versus these open and composable stacks that have been gaining ground, especially in the past three to four years of the data lake house architecture, which is an outgrowth of the all of the work that came from Hadoop and a lot of the processing that happened there and then these different offshoot projects. And now they've largely oriented around a handful of table formats that make it easier for the different compute engines to be able to execute to a specific target, so iceberg, Hudi, Delta Lake. And because of the fact that those are all evolving separately from each other, they're all developed by different teams. Coordination is done generally by loose consensus or, you know, very lengthy RFC and proposal processes.
And I'm curious how you see the variance in terms of the capabilities and the evolution of those two ecosystems of these proprietary vertically integrated systems versus these open and composable stacks and how that affects the overall efficiency and capabilities of those systems.
[00:10:15] Rajan Goyal:
First, let me answer this more fundamentally, Tobias. You know, I told you just now that in my previous life, we did this desegregation at infrastructure level. Right? From hyperconverse to disaggregated architecture was exactly same paradigm if you observe it, which is that instead of building a one hyperconverged server where everything is integrated, store it, your drives, your processor, your memory, and network card, and whatever into one server. And then to take a disaggregated very of storage separated from compute and compute separated from GPUs and so on. Right? It's a very similar paradigm shift is happening here from a using now your example from Snowflake or a Redshift with a monolithic con storage and execution engine to a disaggregated where you have your open table format based data lake versus an execution engine. I think that's a fundamental shift has to happen because one box does not fit for every op. Right? So people need different kind of data formats, different data, different execution capability. So this desegregation has to happen. Right? And that's what industry kind of went and, did all that. So that's already acceptable norm. Even the let's call them the converge architecture or, like, Redshift or Snowflake, they're also embracing iceberg or using table external tables and using open table format. So that's already happening there. Right? Now what it enables is that now once you are in this new paradigm, now the best engine should win because you as an enterprise leave your data once into your data lake wrapped around your choice of open table format or otherwise. Now you want to have your choice of engine. So there is no vendor lock in. Right? Whether it's a proprietary or whether it's an open source. That's a fundamental architectural enabling, which is it's enabling is that you can have your own engine. Whether it's an open source or proprietary is different part, but at least architecture allows us to you as an enterprise pick the engine of your choice. That's one part. The second part is, obviously, even in this lake house paradigm, Tobias, you will see some are open source engines like Spark or Trino. And there are some closed source or proprietary even Databricks. Right? They are they are not pushing back their improvements in performance back to open source. So so you will have a, open source and closed source choices with rightfully so. Some are better performance and have premium, so that's a business decision. But, actually, you would have those. So in both cases. But where where open source plays a role is that standard, open table format, connectors, file format, your how what kind of, connectors you use in that SQL side or whatever. Right? So those, if you open it up, then this paradigm has long legs because so that's kind of one point. If I may extend my argument one more further, is that as the data, like Snowflake or Redshift, they were designed for structured and table data when they were designed, right, for some other problem. But now data is shifting. So you want a different engine. So you don't want to have be enforced or forced to ingest the data into existing data warehouse and then only process it. You have to have this external legs or external connectors to the data in the data lake. So it it is bound to happen that you maybe build a specialized engine for processing unstructured data. Now processing unstructured data requires totally different it's like either you do use models. It's no longer just joins and and, and sort and and, metadata extraction to process unstructured data. So the engine of that has to marry those two pre past world and the new world. Right? So, hence, I'm a full supporter, and I think it's gonna enable a lot of innovation on the data format, data engines, and the choice of the infrastructure required for each engine, rather than having a monolithic system. That's why I kind of expanded your question, Tobias, but I think I give you enough,
[00:14:03] Tobias Macey:
to cover what the question is. Yeah. Absolutely. And on the point too of processing these unstructured data sources and the fact that we can't necessarily rely on these SQL engines that have been the heavy lifters for several years now. It makes me think that we're almost starting to drift back towards the MapReduce era of you have to write your custom code to turn through all of the data and hope that you got it right because otherwise, you're gonna have to go back and start over, which, thankfully we've gotten better at parallelizing that with systems like Ray and Spark and being able to get better efficiencies out of it. But we're no longer in the case where we have a single language target to be able to process the majority of data. We have these bespoke and fractal use cases and tool chains for being able to work with these unstructured data sources.
[00:14:52] Rajan Goyal:
That's true. And I think there's a push and pull here, Tobias. There is a value in preserving the existing language because you have large users of that, and they are familiar with that, and it it is fine. So there is one dimension of this is to extend the language semantics as well as syntax to support this new operators. Right? So that's kind of one dimension doing. Other is, obviously, people are using or want to use Python or some other just general purpose language to express your intent and do that. So I think it is still to be proven and seen what what would where would we land once it's all settled. But I do see initiatives as as value enabling existing user who's familiar with the, SQL, for example, to able to even have in the because if you buy this argument of lake house with a single engine with a unified data store and you say that you want to have a unstructured data semi there's a value in preserving the northbound API. Right? So that from that user, it sees, like, two layers, then I can use it. But, obviously, it it will it will take time to get there. But the other the thing you mentioned, yes, for the ray, having a flattened, if I may call it, distributed framework, where you as a user can control what your app is doing and then then use the power of ray to just distribute it is is a wonderful idea and that but that's where you more power from the user to how how to write. But, like, SQL world gives you the power of planner and optimizer and so on, which is kind of maybe missing there that we are familiar in the data world. But, yes, I think it's still to be proven, where language we will we will we will settle over
[00:16:25] Tobias Macey:
time. On the point of the overall ecosystem, particularly for these disaggregated stacks having to do a lot of process and coordination to be able to make any step changes in capabilities. For instance, if you were to introduce a new table format today, it would probably be two to five years before it was actually usable in all of the different processing engines that someone might want to work with. So for instance, Iceberg support for that landed in Spark initially, and then there was some, initial support in Trino for a while as well as a lot of investment in Dremio, etcetera. If you were to, so the Lance table format is another example where that is a newcomer, and it's starting to be worked on to incorporate it into Trino and some of these other different engines. And because of the fact that it can take so long for these new capabilities to be broadly available to different people without necessarily having to bring in a completely different tool chain or rearchitect their data stack, I'm curious what you see as the points of highest leverage in that ecosystem of disaggregated systems to be able to have the broadest change and the fastest capability of adoption without having to do that rearchitecting and reimplementation?
[00:17:44] Rajan Goyal:
I think there are fundamentally, which is what we just talked in the previous question, is that there is a marriage to standardize the interfaces. An open table format is one of that interface. How do you deal with your data? Right? So there is a the more we standardize on it and agree on one format, better it is for the engines to take advantage. That's one part. But I guess the there are two dimensions in which in this context, the status quo will be challenged. One is the efficiency part, If whether that that current format is better or efficient. And the second is the for the applications. Right? For the different app application data patterns access patterns for the data. And second is as the data is changing, because today's date file form the table formats are more for structured data. But as the data is changing to other multimodal unstructured data, what is that abstraction? Right? How do we make that as a more common format is to be evolved. So it will evolve, but I don't think, fundamentally, it's gonna disturb the the relationship between the engine and the and the data. Right? So that's how we see it. But from my point of view personally and the the where my expertise lies is that the format plays certain role in the overall systems performance or efficiency.
That at certain point, it's good enough. Now you would want to focus on the execution side and the data processing and take advantage of new infrastructure. That's where we are focused on. Right? So that we leverage whatever the industry will lead us to from the file format point of view. We will support it, but let's focus on the execution side. Because if you think of look at the enterprise bill cost of goods that you are paying 60 to 80% of the compute. Right? Not so much on the storage or not so much on the networking. So the compute is where your bulk of the spending is from energy and the and the and the cost point of view. And as the data is getting more and more complex, it is gonna be more and more difficult or more acute problem to optimize. And that's where we are focused on, is how do we make the compute part of the data processing more efficient while still leveraging all these initiatives in the open table formats and so on. One of the
[00:19:58] Tobias Macey:
outgrowths of the past three to four years in the broader data ecosystem that I think gives a much better point of integration to have a broad impact as well is the broad adoption of the Arrow ecosystem for data interoperability. And when I was looking through the documentation for your site, I also saw that you hook into the or early support the Apache Gluten project, which to my understanding is a marriage of the Spark execution and orchestration system with the high performance libraries from, I believe, ClickHouse and similar systems. And so digging now into what you're building at Datapalago, I'm just wondering if you can talk to what was the point at which you decided that this was the problem that you wanted to solve and, maybe some of the strategy that you've developed around how to actually tackle that problem?
[00:20:53] Rajan Goyal:
Yeah. First is the in memory format, like arrow, data structure or or now the corresponding engine has its, its its own place for efficiency point of view. But now we are in a world where it's a this as I said, heterogeneous computing, you use GPUs and FPGAs with the same in memory format sufficient to extend hardware software code design. We have a data file go. We can we have relooked it and and and extended that. So that's kinda one one part of the problem. And the second one is that, you know, regarding gluten or, or other technologies, you know, Spark is a wonderful and extensible distributed framework, but it has it was designed with its own engine, which was written in Java, Java virtual machine execution engine. So what Gluten allows you to do is that using my previous life, analogy, from networking world where you had people used to build monolithic routers to buy us. Right? You would have a control pin and a data pin monolithic. Right? So it's a very similar analogy in my mind that how do you separate the control pin of Spark from its data plane, which is execution engine? And Gluten plays that role to be able to punch the holes in both sides so that you can plug in your choice of execution engine as a back end. Right? Think of like that. So that's a value that Gluten now given those two clarification or the context about what it does, our strategy as with the vision has been is that the what's the gap in the industry? The gap in the industry is that the data engines or the data platform that we're talking about were written ten years back when infrastructure looked different. Now infrastructure is changing underneath.
So what do you do? Either you say, oh, I'm gonna reinvent new Spark, which is for this new world, but you have this whole existing applications and workloads written for Spark. What about that? Right? So then you have to figure out a way to insert the benefits of this, accelerated commuting to this existing ecosystem. Right? That's the view DataPalgo took. And, hence, many of the techniques I have mentioned, some we have not actually publicly mentioned, a lot is to to service that. There, our view is that you build a if I now met tie all the points we have talked, so the world in our mind looks like following. You have a unified data lake. You store all your structured, unstructured, semi structured data there. You have a universal data processing engine, which is what we are building, and it runs on the modern infrastructure. It run it handles any data.
And on top, which is more important, it enables any engine, any data engine written in the past to take benefits of, new hardware and the new data. Right? That's the view we take. And then Gluten and many other technologies that we have internally developed as well as using open source enables us to fulfill that vision of that so that we are not inventing Spark. We are not reinventing Trino. We want to give the benefit of accelerated computing to existing frameworks. So that's what hence, we have carved out this what we call universal data processing engine to give benefits. So that's our view, Tobias, how we look at it. There have been attempts in the past where people built their own SQL engine, their own, GPU powered or FPGA powered. Right? So my personal view is that the success of those attempts were not because GPU was not good enough or FPGA was not good enough. I think this last one that I just mentioned was one of the reason why adoption was not there or or was there.
Because if you don't have a friction to insertion, then, people can embrace the technology. And given how enterprise are already spending millions of dollar with thousands of users writing code, it will be kind of a nonstarter to suggest, okay. Now I have another engine. Can you learn again and let's stand it up. Right? So that transition, that migration is is not easy in that case. Right? So, hence, we the way we are doing a data parallel, is to not disrupt the north APIs, if I may use that term. So the users are using whatever they are comfortable with, but still give the benefit of accelerated commuting them. And and lakehouse paradigm has a big role to play because if that separation had not happened by separating the table or excuse me, data and data lake with the execution engine of your choice, we would not have been able to to innovate the way. So there are a lot of, past wonderful work on which we are stand you know, you are leveraging and standing on that. For people who are
[00:25:28] Tobias Macey:
looking to bring in the Datapelago capabilities, they're already running a Trino or a Spark, what does that integration path look like, particularly if they're running those systems via some sort of vendor like a Databricks or a Starburst?
[00:25:43] Rajan Goyal:
Oh, so let me, yeah. So, I mean, we have not announced the product yet, but let me generalize it. Right? Obviously, we have customers who are who are using it. But the idea is that let's first stay in the open source world even though technology is applicable to any any engine built with this paradigm. But idea is that can you think about a sidecar or a training wheel attached to existing existing car or existing whatever analogy you want to put in. Right? So that's what think of, like, DataPelgo, UDPE is that sidecar or that accelerated wheels which you attach to existing clusters. Now if I may use the word existing, spark, you attach that. And now all of a sudden, your same workloads run faster.
And they underneath, it consumes the newer breed of servers available in the cloud of the it's there's no need for hardware introduction or whatever. AWS, GCP, Azure, and many other clouds have lot of choices of servers with a very good price performance points, which have those accelerated computing instances, I call. When I say accelerated computing instances, I meant GPUs or from AMD or Nvidia, from Tesla t four to to higher end and so on, and or FPGAs and going forward, TPUs. There are plenty of choices there. User doesn't need to worry about it. Right? So our software, our technology handles it. What is the right computing element to use? That's one part of the problem. But most importantly, insertion is you just when you create a cluster or something, you point to our binary or whatever you want to call it, and then magically, the system just shrinks size wise but runs faster. Right? So that's the that's how we insert. Let me pause here, Tobias. You should ask question if we need more clarity than this one. No. That that's definitely very useful.
[00:27:32] Tobias Macey:
And to your point, the selection of which compute unit to use is definitely one that as somebody who is writing a SQL query or writing some sort of processing logic for the unstructured data that I want to do some entity extraction on. I don't wanna have to think about that level of detail. And so being able to just say, I want this to work, and then it gets federated out to the appropriate compute unit. And then, obviously, there are also questions around data gravity, making sure that those compute units are situated appropriately closely to the actual storage media. And particularly in that space of structured versus semi structured versus unstructured data, where you do have those differences in language interface. I'm wondering how the work that you're doing at the point that it gets to your layer, is it largely just an intermediate representation because you've already been abstracted away from the specifics of the language runtime and, you know, you're you're you're looking at some sort of, intermediate representation at that point? Or is there some other translation that you're doing from the query language and the execution plan into being able to say, okay. I'm dealing with unstructured data over here, so I need to do these different compute operations versus I'm running a SQL query, so I know I'm gonna be dealing with parquet files or JSON data for over here. Just curious about some of the ways that you're thinking about that problem as well of being able to, to the end user, unify the
[00:28:57] Rajan Goyal:
differentiation between those data sources so that they don't have to think about where they're integrating your product. Yeah. Good question. So yes. Absolutely. First, it's in if you don't do the way I am gonna describe, I don't think it it's gonna fly. You would want to keep it away this complexity of where to run which workload, right, or what computing element to use. So you still if you're writing in PyStone or a PySpark or SQL, you just continue to write as is, and it's a job of the underlying stack to figure out which computing element is the right element to do that. And you did you did mention couple of the points. One is a data affinity, and the second is the right, performance per dollar or performance per what whatever metric is the right processor to execute that. And many more things go into the mix for that algorithm to decide. And let me run it here. Because, you know, moving data, there are high level things you want to move compute closer to data rather than to data to the compute. So there are high level principles you want to address, keep in mind, while doing this kind of scheduling and and run time management of the resources to the operator. That that that's for sure required.
And the the second part is that when it comes to the handling the data types, as you asked, yes. Absolutely correct. Let me call it the chaotic data. Like, the when it is more chaotic data than a structured data, you need lot more energy to extract what you're looking out of that. And that hence, the accelerated commuting instances play a better role in terms of efficiency than a general purpose CPU for to process that chaotic data. That's one point. The second is that that's where all of these processors, Tobias, are Turing complete. So you can run anything on a CPU as well. It's gonna functionally correct. But if you want to break the problem where it is right, the computing element, maybe you can influence the gates into some FPGA to process JSON efficiently, to parse JSON efficiently than doing it in the GPU. Maybe that's the right thing to do. Or if you're doing a PDF extraction, nowadays, it's all done in the models, so there is no need to pass the PDF to that. You can just run a model. So now that is better than a GPU. So depending upon your algorithm, depending upon your data type, that your choice of the engine or the computing element may change. Right? And, obviously, data affinity and all the stuff that you also mentioned in your question are also into consideration.
So, hence, it is a it is and if it's well served for the industry and the user, if all that complexity is hidden from the end user. Otherwise, we will be the adoption will be too slow, and it it will it will break the system. Right? So that's what companies like us are working hard to solve that problem that how do I reexpress, if I may use the word, algorithm for this new machine, for the new animal that is handed to us. Right? That's what we do. And, if I may quote Jensen Huang, NVIDIA's CEO, he called it insanely hard problem. Right? Because we are not in a world where you are handed over a hundred gigahertz processor and recompile the software for that. Right? Now you have to rethink about, okay. I'm doing this new new sort or I need a new JSON parsing or a new PDF extraction.
How do I reexpress that algorithm for the new target machine is the is the problem. And all that is
[00:32:17] Tobias Macey:
buried and deeply integrated into the underlying stack, like, in case of Datapalago we build. But I'm sure many other companies are also working to solve that problem. And so if I can be a little bit reductive to summarize sort of what you're doing, the way I'm thinking about it is that it's analogous to being the LLVM for data compute.
[00:32:37] Rajan Goyal:
Yeah. I mean, that's a that's a, a big, high bar if if you may get there. But, yes, I would put it in a way that yeah. I mean, that's a that's a good way to to put it also, but maybe there's a,
[00:32:53] Tobias Macey:
simpler analogy I can mention. But, yeah, we can leave it at that device. That makes sense. As you mentioned, LLVM, that is a high bar. It has had a lot of engineering time that has gone into it. It's being used in a number of different use cases. And as you mentioned, the problem that you're tackling is also, to quote you, insanely hard. And given that, I'm curious what are some of those engineering challenges that you're tackling, some of the foundational computing elements that you're having to reconsider or try to look at in different angles to understand how best to actually approach the problem efficiently?
And also to tack on one last question to this, run on question, because of the fact that it is a very engineering heavy effort, I imagine that there is a lot of work that you have to do about how to think about marketing it to people, both for consumers as well as potential investors to make sure that you're given the appropriate time and space to actually invest in that engineering effort. Yeah. Yeah. Makes sense. So
[00:33:55] Rajan Goyal:
let's tackle the last one first. Right? Yes. It's like I'm using your light to shine, like, where to go. So you have taken me pretty deep into the stack, and hence, the technology details comes out. Okay? So but, otherwise, you're right. From from user perspective, investors perspective, we would want to make it as simple as frictionless to insert and understand. And that's what is the whole UDP idea, accelerating Spark, Coutinho without changing the application with the the things we just talked enables us to do so. So the all these complexities and the the what we just talked about are inside our stack, inside the software, not visible to the user. Right? So whether it's a choice of the hardware, whether it is a choice of the size of the cluster, whether where to schedule work and how to break algorithms, this is all done by our software. Right? So other no. Nothing is easy to to the user. The second part is that what you meant, in the big first part of the question was, what are those complexities, the challenges we are solving? One we just touched about is, reexpressing your algorithm for this new primitives that computing element we are offered. So that's one part. The second part is also that it is a large distributed systems problem twice, and there's something called Amdahl's law. Right? If you focus on a one function to accelerate and if that function is only used fraction of the time, you will only get the portion of that fraction accelerated from overall impact of the from the value proposition point of view. Right? So you need to be very, very careful about the Amdahl's law that you're not spending energy on a small portion of the workload. Right? So in the distributed systems, it's the IO, it's the data movement from storage to compute, compute to compute, and even within a node, data movement from your memory or drives to the CPU. And then now when you're talking about CPU and GPU, you're moving data from CPU over PCI to the GPU and back. Right? So those are all the flash points or hotspots.
You have to be very, very careful and design a system which is impotence match between compute as well as IO. Right? It's a pretty complicated sophisticated distributed systems. Not, again, to scare the audience or the user, but this is all done underneath by the software to, a, pick the right computing element, b, pick the right algorithm to write on on run on that algorithm, and, c, to build a impedance match distributed system so that the user is able to get the value out of it. That's what makes this problem insanely hard. Right? This all these three things that we just talked. And the fourth thing is is to make it appear to the user as if nothing has changed. That makes it even more exponentially more harder. But as we were talking about this meta concepts of lake house, open table formats, you know, separation of compute from storage, gluten, all those things have enabled us to realize our vision. Right? But to make it seamless from a user point of view. But there is no escape from solving the three things that I just told you fundamentally or to solve it. Otherwise, it it it then it's not gonna give the value that this the the the potential that this new architecture has. Right? I hope this is clear to me. Absolutely. And as you're talking as well about the distributed systems nature of the problem, the fact that you have to impedance match the
[00:37:19] Tobias Macey:
compute with the IO capabilities. It also brings in the question of what are those low level storage primitives that you're working with, especially because of the fact that it sounds like you aren't in a position to be able to standardize that or enforce that because you're just processing on already existing data. So it's not like you're building a full database system, and you can decide between an, LSTM or, you know, copy on write versus merge on read, So you have to work with the data as it already exists across that wide variety of different storage approaches and be able to process them as efficiently as possible. That's correct. Yes. I think the industry has already,
[00:37:59] Rajan Goyal:
let me say this way, willing to pay the cost price of desegregating and getting away with this proprietary control structures to wrap the data around. Right? So there is a penalty you pay for that. But the trade off is now you have separated and you're desegregated, and that that's a more architectural knob, which is have more benefits. So that is already realized. People are willing to pay that cost. Where where we come into the picture is, okay. You are already in deficit because of that departure from that monolithic architecture to this disaggregated architecture. How do we bridge that gap and even make it better? And hence, what we just went over, the discussion, this accelerated computing, having impedance match, and all those things enables us to not only cover the deficit, but also make it future proof as the data size grows, as the computing capability improves. So that's the that that's the how I look at it. Right? But you're right that the starting point is that we don't have luxury to store data which is more friendlier to us, and we don't want to do that. That's when the vendor lock in comes in. Enterprises don't want that. Right? They want to have a freedom of store once and use as many times. And if I'm storing pro a a proprietary format suitable for one engine, now you're locked in by definition to that engine. Right? That is a okay compromise as long as you're able to deliver on the promise of this thing. Right? Hence, I think this accelerated computing plays a big role in realizing the overall vision of leak house, this separation.
Otherwise, that deficit will become so huge that people would ignore the future, but at least solve to this. And you may pendulum may shift back to the monolithic because it's not delivering on the promise. Right? So I think we may have a some role in enabling this vision, and making it more future proof, meaning in the using accelerated computing and the things that we just talked about, you know, in a humble way. Yeah. And
[00:39:50] Tobias Macey:
in terms of the impact on behavior and prioritization for teams who are working with these data lake house systems where today, they maybe are trying to run a Trino cluster, they execute a query, it takes some number of seconds to execute because it's executing across hundreds of gigabytes or terabytes of data. So then they bring in Datapelago, and that drops from multiple seconds down to single seconds or subseconds, what whatever the order of magnitude improvement is. I'm curious how that shift in speed and efficiency and capabilities changes the ways that they think about what data to use for which problems.
[00:40:34] Rajan Goyal:
Yeah. I think, you know, there is a a certain angle to this. Right? Right now, the way business model is, you're always paying as a consumption of something. So you are paying as a consumption of compute resources. If you make shrink that, that directly translates to your TCO savings. Right? So that's what the unless you're charging your premium, which many companies do, but for the performance improvements and so on. But, otherwise, fundamentally, the performance has a benefit of reducing your TCOs. That's one part I wanted to just address. I missed your second part, Tobias. Can you repeat the second half of the question? Yes.
[00:41:14] Tobias Macey:
So just wondering how the increase in speed and efficiency of that data processing changes the ways that teams think about what data to use for which problems and how it maybe unlocks greater potential for bringing larger volumes of data to bear on the organizational challenges.
[00:41:32] Rajan Goyal:
Right. So two parts to that. One is the the the business answer, and then there's a technical answer here. The business answer I hear from many CIOs is that your TCO savings that I have in my current workload and freeze up my budget to do exploratory work that I would would have not done otherwise. Okay? Because they want to do AI processing or Genia or fine tuning or direct processing. Just today's whatever is in the thing. So that is a business advantage of, reducing your TCO on the current spend. That's one part. The second part is also that there are many use cases which we make practical, which were not viable earlier. Right? Because now we are able to meet SLAs with this new architecture without needing to put it in a, let's say, proprietary data warehouse. If you can do that, then it enables many, many new inter new, use cases. So for example, in a telco world, we are involved with a company which they have to meet certain, by morning or something. All the data ingested overnight need to be processed and be ready for the next time when it starts. Right? Earlier, given the budget, they would have to cut short either on the quality of the processing or the amount of data they do. Right? So that that affects their relevance or be staying competitive in the next day. So, hence, the speed up and this extra volume of data or more deeper analysis you can do within the same time frame enables them to be more sharper and competitive next stage. So that's a real use case that we are enabling that.
But right now, given the current macroeconomics and otherwise, everybody's under stress, I would say, to reduce their TCO because, I mean, all said and done, this pay as you go model looks okay in the beginning. But moment you start using at scale, then the reality starts sinking in on the cost and and so on. Right? So that's kind of where many people are grappling, and our technologies that that we built at building a data panel enables them to keep that fundamentally. Right? We are not here playing a margin game that, okay, if they are paying, you know, 10% savings here and but we are fundamentally changing the price and the performance assumption. And in that case, there's a we enable them to reduce their TCO and help them scale, in the as a telco example I gave you, or even free up the budget to do things which, they want to do, but there is no budget, practically speaking.
[00:44:00] Tobias Macey:
In the work that you have done with some of your early customers and design partners, mocks applied to their business problems?
[00:44:16] Rajan Goyal:
I think, to be honest, we are I will not have that many, data points to generalize that. Right? But what I so I will just caution with that, my next answer. So but, you know, there are many and we have not touched many verticals yet. Manufacturing, you know, whether it is, in cars or in in the semiconductor, in shop, you know, factory, when there's so many sensors that are generating data. So there are lots of use cases where efficiency, performance, latency reduction is important, and we will enable that. Right? But right now, more or less, I think I can generalize that either the TCO savings, right, is is important.
Using that savings to increase their amount of data they process and how that it helps. That's one part. The second part is which is also obvious now in the machine learning or the data development excuse me, the model development and model, training, workload specific. Data scientist time is very important, and how much time they spend on spend on processing data, preparing data is important because it comes in the way of how iterative and how fast you can train models. So that's so that's another area where we we help and speed up the data preprocessing for model training, which translates to saving the data scientist time and and so on. Right? So there are few higher level values we see, but, next time we talk, I can give you a little bit more better insights than that. Yeah.
[00:46:01] Tobias Macey:
And in your work of investing in this problem space and working at this layer of the technology stack, which sounds like is maybe even a little bit higher up than some of your previous companies, What are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:46:19] Rajan Goyal:
Well, you learn a lot, Tobias, especially if I may I don't know if this is the angle. I mean, first is as a entrepreneur, as a CEO, or as a founder, it's a learning every day. Right? You know, it's I say that you building a feature versus building a product and versus building a company is exponential scale complexity. Right? So, so hence, it is a lot of learning technically. We are kind of breaking new ground, every day, because many of the areas that we are we have to solve are some unsolved problems. You know, how do you move data or how do you process your data sizes, terabytes? Your GPU memory is only 16 gigabyte. How do you manage that? Right? So that's a technical problem to solve and and many more such things. So there are many technical problems to solve, to do what we do. Then there are many business model related problems to solve. How do we insert into user space without, disrupting the ecosystem? That's another way to innovate.
And then that, obviously, third is, you know, it's a fast evolving market, a problem set itself. Data types are changing and so on. How do you build an architecture which is has legs for future proofing? So there are a lot of areas, which we are learning, and I'm personally learning daily, while working on this problem.
[00:47:48] Tobias Macey:
And for people who are looking for ways to increase the efficiency and speed and flexibility of their data systems, what are the cases where Datapelago is the wrong choice and won't have the impact that they're hoping for?
[00:48:04] Rajan Goyal:
I think, one is a fundamental answer. Right? One is that if you are as I said that we are our first attempt is to do an impedance match between data moment and compute. And if compute computation is the bottleneck, we will free it up and make that's a little bit fundamental answer I'm giving you to to Tobias. Right? So if if you find a workload where you're processing very less, but you're just moving data or moving things here and there, probably there's not much for a GPU to play a role in that. Right? So that's kind of a fundamental answer to that. But I guess, it I would not give up, like, a, too early in that journey because we are still evolving and and, and and solving. So but right now, yes. If I were to put it today's, state of the art from DataPelago, we look for when you're processing complex data, you are doing spending lot of compute cycles to do it. And, there is a data movement from your storage to thing, but relative to that, you're spending lot more on the compute. That's a higher level, highest order bit, I would say, to resolve, to see where data per go will be applicable. Right? Because you are attacking the competition part of the problem, more more more than that. We are trying to be or we are efficient in selecting how to move data efficiently. Sure. How to I mean, there are many data techniques to do dynamic filtering or, row group based pruning in parquet or whatnot. Right? Like, so that you're very, very selective in only reading data that you need, and those are complementary techniques to us. But once that is done, and even then if you are reading too much data, then, and the ratio of the IO to compute is disturbed, the and then there is not much probably data that will go to add value right now. Right?
In my previous slide, we have solved the data moment problem, but now in context data it's a computational problem you're solving.
[00:50:05] Tobias Macey:
As you continue to build and invest in and move towards general availability of the technology that you're building at Datapelago? What are some of the capabilities or specific projects that you're excited to dig into or any of the features that you're building towards?
[00:50:21] Rajan Goyal:
Yeah. It it is a the the good news or the bad news, however you want to take it, it's a endless world. Right? So why is it endless world? So let's let me share few dimensions, right, in which we want to and we will go. One is the world of data platforms itself. You know, you handle Spark. You handle Trino. Then there are many more which can benefit from from the this, accelerated computing. So that's kinda one one dimension. The second is the data type itself. Right? So data types are changing from multimodal and and unstructured, semi structured data. And the kind of computation you do, kind of models you want to bring in there is another dimension where the platform will will evolve. Right? And the third is application itself. You know, because now different what is your distributed framework? Is it a ray or is it a spark?
What kind of so that is another dimension to go. Fourth could be, multi cloud, which is obvious from hybrid on prem to that. But now there is this new wave of GPU clouds emerging. Right? How about we address data processing there? So there are many, many dimensions ahead of us, Tobias, to to go and innovate, and many of them are in our road map, and we are actively working.
[00:51:38] Tobias Macey:
Are there any other aspects of the work that you're doing at Datapelago, the specific problem of efficiencies in data lake architectures and the overall evolution of data processing, particularly as generative AI brings unstructured data more into the forefront that we didn't discuss yet that you would like to cover before we close out the show?
[00:52:00] Rajan Goyal:
Not really. I think you covered pretty broad, areas. I hope this is useful. But, yes, we we have not covered the how the data processing for generative AI world is evolving and so on, but we can save it for the processing for generative AI world is evolving and so on, but we can save it for the next time. But, yes, there is a very fascinating world emerging of in front of us. Like, obviously, data is changing, but the kind of data processing, which is a combination of, actually, actually, I will leave this for you and your audience, which is that the two pipelines are also converging. People want to apply, analytics primitives along with the unstructured data processing together. Right? How how does that work look like?
Where, you want to do joins as well as extract metadata from PDF. Right? So that's a problem to be solved, and that's where I think the direction, platforms will take. Like, on a single platform, how do I do unified processing, processing both for structured and unstructured data? Right? That's that's the next frontier, I would call it, for for us as well as an industry.
[00:53:03] Tobias Macey:
For anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:53:19] Rajan Goyal:
What I can answer is that with the generative AI, there's lot of innovation have to happen and already happening. How do you even use natural languages, for example, to express your intent? Right? From natural language to SQL or natural language to Python or whatever. Right? How do you and there are many initiatives in that. I think that is probably one area where industry will evolve and learn and improve, to even make the interface easier, from for so that even people who are not familiar with SQL, can you use your data platform for your other? So that's one area, but I'm sure, world is full of smart people. They will think of something better than what I can contribute right now on that problem.
[00:54:06] Tobias Macey:
Absolutely. And everybody's always filling in different gaps where which then expose other ones. So it's a it's a very fractal problem space. So thank you very much for taking the time today to join me and share the work that you're doing at Datapelago and your insights into the fundamental challenges of data efficiency in these disaggregated architectures, both in the software and, the hardware layers as we work in these cloud environments. So I appreciate all the time and energy that you're putting into improving some of those efficiencies, and I hope you enjoy the rest of your day.
[00:54:38] Rajan Goyal:
Thank you, Tobias. It was wonderful and a pleasure, and many of their questions were insightful. And I hope that it is useful for for the audiences. Thank you for doing this.
[00:54:56] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafolds today for the details.
Your host is Tobias Macy, and today I'm interviewing Rajan Goyal about how to drastically improve efficiencies in data processing by reimagining the system architecture. So, Rajan, can you start by introducing yourself?
[00:00:59] Rajan Goyal:
Sure. First of all, pleasure to be here, Tobias. Thank you for taking time. This is Rajan Goel. I'm CEO and cofounder at, early stage data company, Datapelago. I am here in the valley for last thirty plus years. This is my third start up, but we'll talk more about my journey. But happy to be here at Dubai's. And do you remember how you first got started working in data and why it has kept your attention for this long? Yeah. Well, it is a funny, My journey to data world has been, unorthodox, I would say. My previous life, I used to build security systems, like the network security, and then I went deep into the technology building processors, domain specific processors, and or what's today buzzword is called accelerated computing.
So idea was to how do we make, a specialized processors, which are programmable and so on, to do a specific function, like security, deep packet inspection, and so on. So that was my previous life. We were building both MIPS and ARM based processors, and we were pretty successful, relatively speaking. And just before this we put our hands on the data processing problem, we were working on data moment problem, which is sounds boring, but all the modern cloud infrastructure to data bias is built on that principle. You have this disaggregated architecture. If I may take one second few seconds to explain that. If you look at AWS or Google or Azure, you as a user want to put take a server of you say, give me 32 codes, four drives, one GPU, 25 gig network, or some other variant. Guess what? There is no physical server that's queue. You are composing a software defined server on the fly from a disaggregated set of infrastructure.
Alright? So you have your CPU racks, your storage racks, your GPU racks, and drive racks, and then connected over high speed network, and then you're dynamically composing, this skew that you're looking for. So in my previous life, I was working on technologies to enable this modern infrastructure that we see today. We enjoy today at AWS or any public cloud. And in fact, AWS Nitro, if you're familiar with that or for your audience, is built on some of the technology that I built in previous slides. So coming back to your question, sir, I have a spending more time on making the infrastructure data centers more efficient, how to move their data efficiently between compute to compute or storage to compute and so on. So what occurred to me was with Mellanox, with Nvidia, with my previous companies, all these companies have more or less solved the problem. Obviously, it has to continue to improve, but fundamentally solved. So I observed that the next next problem bottleneck will be in the competition. Now you're able to move data. Let's actually process it. And there are two hills in the data processing, Tobias. One is taken by the AI data model training and inference, which is also data processing, but in a different way. And another is the data processing itself to prepare for AI or for our classic analytic. And that, I think, was left untouched to keep up with the shift in the infrastructure that was happening, which I was kind of fortunate to be part of my previous life. And, hence, my attention got to this. Okay. This is a I mean, it's a very well hundred $80,000,000,000 market today.
So it's a pretty large business. Problem is big. It's relevant, and why don't we apply our self to solve? That's how kind of a more bottom up engineering mind took me to this world of data, and ever since, it is fascinating. It's validating what we started with, but that's how we got in. I personally got into this world of data coming from computing, storage, and networking world. And digging into some of those
[00:04:38] Tobias Macey:
hardware level capabilities and constraints, most people who are working in this space, they're going to be fairly familiar with the system architecture of their laptops or their desktops, so the Von Neumann architecture, which has been prevalent for decades at this point. And as you mentioned, because of the requirements and the high throughput and the high scale of these data centers and the, cloud providers, you mentioned that they have a slightly different way of composing those different pieces together to be able to give that flexibility. And I'm wondering from your perspective and from the understanding that you've gathered from working in that space, how that fundamentally changes the ways that people need to be aware of the compute capabilities and how they link together versus the model that they're likely working from with that Von Neumann approach that of the collocated CPU, RAM, disk, and just how that factors into the capabilities of actual data processing and data movement in those environments?
[00:05:42] Rajan Goyal:
Yeah. Yeah. So let's break your question into maybe three topics, if I may, okay, to give the proper context. The first is the classic idea of I think Nutanix invented this world called hyperconverged. Idea was you put it in a single pizza box, if I may use that example, co located your CPU, DRAM, storage, networking, and that's your becomes your cookie, and you kind of replicate it thousand times or 10,000 times, a hundred times to get your scale up. Right? That was the then came the model of disaggregated, not hyper but disaggregated but composable infrastructure that we just talked about that. Right? Now in that regard, now things are far apart. Obviously, networking has solved the problem to make it appear, like, low latency and so on. How do we not spend CPU cycles in that? But it is still not a co located world. So now you have storage tracks, compute racks on separate over network. The third shift, if I may, or next subsequent to that is the compute itself is no longer homogeneous. The last decade, we, as an industry, spent time to everybody standardized on Intel or AMD x 86 architecture, and you'd write software once. And Intel is supposed to do every eighteen to twenty four months doubling the performance, and life is good. Right? But that train has stopped. Moore's law has flattened, and you're not doubling the CPU performance every generation to generation. So what's next? Next step function is accelerated computing. Now you have this heterogeneous elements, CPUs, GPUs, FPGAs, PPUs, you know, AWS has their own chips. Now you you're not you're not finding a homogeneous server. You're finding an heterogeneous server. So that's the reality in the infrastructure which is happening. So if you look observe this and ask a question, if you are a data now coming back to your overall team, if this is the world we are born in, now if you are a data practitioner writing Python code or a PySpark code or writing SQL queries, you cannot be bothered by GPU kernels or FPGA programming. Right? So how how those people who are building this platform bridge that gap? That that's where companies like Datapelago play that role to build intermediate layers between the Python programmer versus the underneath underneath heterogeneous infrastructure that we are providing. But that system in data centers are built around what that stack should look like. That's kind of maybe high level view of how it should appear. So there are two ways. Now one is the users of the data platform. One are the builders of the data platform. Like, we are the at in our world, we are the building the data platforms using this infrastructure.
But our goal or goal should be that how do we make the same abstraction that people are used to today? If I have to tell a Python programmer how to write a CUDA kernel, it's not gonna work. Right? So how do we abstract it out a level that whatever the today's data scientist or business analyst or the actual people who are using writing applications, that abstraction remains same, and we can hide this complexity with software modular layer is the new problem statement. Right? And, hence, the Datapelago or other companies like us would want to bridge that gap that has emerged because of this shift in underlying infrastructure device. Another division
[00:08:48] Tobias Macey:
in the data ecosystem is well, there are several, but one is open versus closed where you have these vertically integrated closed proprietary stacks in the form of your Redshift, Snowflake, BigQuery, systems like that. And so they're built specifically for a specific problem, and they're optimized because of the fact that they control the end to end use case versus these open and composable stacks that have been gaining ground, especially in the past three to four years of the data lake house architecture, which is an outgrowth of the all of the work that came from Hadoop and a lot of the processing that happened there and then these different offshoot projects. And now they've largely oriented around a handful of table formats that make it easier for the different compute engines to be able to execute to a specific target, so iceberg, Hudi, Delta Lake. And because of the fact that those are all evolving separately from each other, they're all developed by different teams. Coordination is done generally by loose consensus or, you know, very lengthy RFC and proposal processes.
And I'm curious how you see the variance in terms of the capabilities and the evolution of those two ecosystems of these proprietary vertically integrated systems versus these open and composable stacks and how that affects the overall efficiency and capabilities of those systems.
[00:10:15] Rajan Goyal:
First, let me answer this more fundamentally, Tobias. You know, I told you just now that in my previous life, we did this desegregation at infrastructure level. Right? From hyperconverse to disaggregated architecture was exactly same paradigm if you observe it, which is that instead of building a one hyperconverged server where everything is integrated, store it, your drives, your processor, your memory, and network card, and whatever into one server. And then to take a disaggregated very of storage separated from compute and compute separated from GPUs and so on. Right? It's a very similar paradigm shift is happening here from a using now your example from Snowflake or a Redshift with a monolithic con storage and execution engine to a disaggregated where you have your open table format based data lake versus an execution engine. I think that's a fundamental shift has to happen because one box does not fit for every op. Right? So people need different kind of data formats, different data, different execution capability. So this desegregation has to happen. Right? And that's what industry kind of went and, did all that. So that's already acceptable norm. Even the let's call them the converge architecture or, like, Redshift or Snowflake, they're also embracing iceberg or using table external tables and using open table format. So that's already happening there. Right? Now what it enables is that now once you are in this new paradigm, now the best engine should win because you as an enterprise leave your data once into your data lake wrapped around your choice of open table format or otherwise. Now you want to have your choice of engine. So there is no vendor lock in. Right? Whether it's a proprietary or whether it's an open source. That's a fundamental architectural enabling, which is it's enabling is that you can have your own engine. Whether it's an open source or proprietary is different part, but at least architecture allows us to you as an enterprise pick the engine of your choice. That's one part. The second part is, obviously, even in this lake house paradigm, Tobias, you will see some are open source engines like Spark or Trino. And there are some closed source or proprietary even Databricks. Right? They are they are not pushing back their improvements in performance back to open source. So so you will have a, open source and closed source choices with rightfully so. Some are better performance and have premium, so that's a business decision. But, actually, you would have those. So in both cases. But where where open source plays a role is that standard, open table format, connectors, file format, your how what kind of, connectors you use in that SQL side or whatever. Right? So those, if you open it up, then this paradigm has long legs because so that's kind of one point. If I may extend my argument one more further, is that as the data, like Snowflake or Redshift, they were designed for structured and table data when they were designed, right, for some other problem. But now data is shifting. So you want a different engine. So you don't want to have be enforced or forced to ingest the data into existing data warehouse and then only process it. You have to have this external legs or external connectors to the data in the data lake. So it it is bound to happen that you maybe build a specialized engine for processing unstructured data. Now processing unstructured data requires totally different it's like either you do use models. It's no longer just joins and and, and sort and and, metadata extraction to process unstructured data. So the engine of that has to marry those two pre past world and the new world. Right? So, hence, I'm a full supporter, and I think it's gonna enable a lot of innovation on the data format, data engines, and the choice of the infrastructure required for each engine, rather than having a monolithic system. That's why I kind of expanded your question, Tobias, but I think I give you enough,
[00:14:03] Tobias Macey:
to cover what the question is. Yeah. Absolutely. And on the point too of processing these unstructured data sources and the fact that we can't necessarily rely on these SQL engines that have been the heavy lifters for several years now. It makes me think that we're almost starting to drift back towards the MapReduce era of you have to write your custom code to turn through all of the data and hope that you got it right because otherwise, you're gonna have to go back and start over, which, thankfully we've gotten better at parallelizing that with systems like Ray and Spark and being able to get better efficiencies out of it. But we're no longer in the case where we have a single language target to be able to process the majority of data. We have these bespoke and fractal use cases and tool chains for being able to work with these unstructured data sources.
[00:14:52] Rajan Goyal:
That's true. And I think there's a push and pull here, Tobias. There is a value in preserving the existing language because you have large users of that, and they are familiar with that, and it it is fine. So there is one dimension of this is to extend the language semantics as well as syntax to support this new operators. Right? So that's kind of one dimension doing. Other is, obviously, people are using or want to use Python or some other just general purpose language to express your intent and do that. So I think it is still to be proven and seen what what would where would we land once it's all settled. But I do see initiatives as as value enabling existing user who's familiar with the, SQL, for example, to able to even have in the because if you buy this argument of lake house with a single engine with a unified data store and you say that you want to have a unstructured data semi there's a value in preserving the northbound API. Right? So that from that user, it sees, like, two layers, then I can use it. But, obviously, it it will it will take time to get there. But the other the thing you mentioned, yes, for the ray, having a flattened, if I may call it, distributed framework, where you as a user can control what your app is doing and then then use the power of ray to just distribute it is is a wonderful idea and that but that's where you more power from the user to how how to write. But, like, SQL world gives you the power of planner and optimizer and so on, which is kind of maybe missing there that we are familiar in the data world. But, yes, I think it's still to be proven, where language we will we will we will settle over
[00:16:25] Tobias Macey:
time. On the point of the overall ecosystem, particularly for these disaggregated stacks having to do a lot of process and coordination to be able to make any step changes in capabilities. For instance, if you were to introduce a new table format today, it would probably be two to five years before it was actually usable in all of the different processing engines that someone might want to work with. So for instance, Iceberg support for that landed in Spark initially, and then there was some, initial support in Trino for a while as well as a lot of investment in Dremio, etcetera. If you were to, so the Lance table format is another example where that is a newcomer, and it's starting to be worked on to incorporate it into Trino and some of these other different engines. And because of the fact that it can take so long for these new capabilities to be broadly available to different people without necessarily having to bring in a completely different tool chain or rearchitect their data stack, I'm curious what you see as the points of highest leverage in that ecosystem of disaggregated systems to be able to have the broadest change and the fastest capability of adoption without having to do that rearchitecting and reimplementation?
[00:17:44] Rajan Goyal:
I think there are fundamentally, which is what we just talked in the previous question, is that there is a marriage to standardize the interfaces. An open table format is one of that interface. How do you deal with your data? Right? So there is a the more we standardize on it and agree on one format, better it is for the engines to take advantage. That's one part. But I guess the there are two dimensions in which in this context, the status quo will be challenged. One is the efficiency part, If whether that that current format is better or efficient. And the second is the for the applications. Right? For the different app application data patterns access patterns for the data. And second is as the data is changing, because today's date file form the table formats are more for structured data. But as the data is changing to other multimodal unstructured data, what is that abstraction? Right? How do we make that as a more common format is to be evolved. So it will evolve, but I don't think, fundamentally, it's gonna disturb the the relationship between the engine and the and the data. Right? So that's how we see it. But from my point of view personally and the the where my expertise lies is that the format plays certain role in the overall systems performance or efficiency.
That at certain point, it's good enough. Now you would want to focus on the execution side and the data processing and take advantage of new infrastructure. That's where we are focused on. Right? So that we leverage whatever the industry will lead us to from the file format point of view. We will support it, but let's focus on the execution side. Because if you think of look at the enterprise bill cost of goods that you are paying 60 to 80% of the compute. Right? Not so much on the storage or not so much on the networking. So the compute is where your bulk of the spending is from energy and the and the and the cost point of view. And as the data is getting more and more complex, it is gonna be more and more difficult or more acute problem to optimize. And that's where we are focused on, is how do we make the compute part of the data processing more efficient while still leveraging all these initiatives in the open table formats and so on. One of the
[00:19:58] Tobias Macey:
outgrowths of the past three to four years in the broader data ecosystem that I think gives a much better point of integration to have a broad impact as well is the broad adoption of the Arrow ecosystem for data interoperability. And when I was looking through the documentation for your site, I also saw that you hook into the or early support the Apache Gluten project, which to my understanding is a marriage of the Spark execution and orchestration system with the high performance libraries from, I believe, ClickHouse and similar systems. And so digging now into what you're building at Datapalago, I'm just wondering if you can talk to what was the point at which you decided that this was the problem that you wanted to solve and, maybe some of the strategy that you've developed around how to actually tackle that problem?
[00:20:53] Rajan Goyal:
Yeah. First is the in memory format, like arrow, data structure or or now the corresponding engine has its, its its own place for efficiency point of view. But now we are in a world where it's a this as I said, heterogeneous computing, you use GPUs and FPGAs with the same in memory format sufficient to extend hardware software code design. We have a data file go. We can we have relooked it and and and extended that. So that's kinda one one part of the problem. And the second one is that, you know, regarding gluten or, or other technologies, you know, Spark is a wonderful and extensible distributed framework, but it has it was designed with its own engine, which was written in Java, Java virtual machine execution engine. So what Gluten allows you to do is that using my previous life, analogy, from networking world where you had people used to build monolithic routers to buy us. Right? You would have a control pin and a data pin monolithic. Right? So it's a very similar analogy in my mind that how do you separate the control pin of Spark from its data plane, which is execution engine? And Gluten plays that role to be able to punch the holes in both sides so that you can plug in your choice of execution engine as a back end. Right? Think of like that. So that's a value that Gluten now given those two clarification or the context about what it does, our strategy as with the vision has been is that the what's the gap in the industry? The gap in the industry is that the data engines or the data platform that we're talking about were written ten years back when infrastructure looked different. Now infrastructure is changing underneath.
So what do you do? Either you say, oh, I'm gonna reinvent new Spark, which is for this new world, but you have this whole existing applications and workloads written for Spark. What about that? Right? So then you have to figure out a way to insert the benefits of this, accelerated commuting to this existing ecosystem. Right? That's the view DataPalgo took. And, hence, many of the techniques I have mentioned, some we have not actually publicly mentioned, a lot is to to service that. There, our view is that you build a if I now met tie all the points we have talked, so the world in our mind looks like following. You have a unified data lake. You store all your structured, unstructured, semi structured data there. You have a universal data processing engine, which is what we are building, and it runs on the modern infrastructure. It run it handles any data.
And on top, which is more important, it enables any engine, any data engine written in the past to take benefits of, new hardware and the new data. Right? That's the view we take. And then Gluten and many other technologies that we have internally developed as well as using open source enables us to fulfill that vision of that so that we are not inventing Spark. We are not reinventing Trino. We want to give the benefit of accelerated computing to existing frameworks. So that's what hence, we have carved out this what we call universal data processing engine to give benefits. So that's our view, Tobias, how we look at it. There have been attempts in the past where people built their own SQL engine, their own, GPU powered or FPGA powered. Right? So my personal view is that the success of those attempts were not because GPU was not good enough or FPGA was not good enough. I think this last one that I just mentioned was one of the reason why adoption was not there or or was there.
Because if you don't have a friction to insertion, then, people can embrace the technology. And given how enterprise are already spending millions of dollar with thousands of users writing code, it will be kind of a nonstarter to suggest, okay. Now I have another engine. Can you learn again and let's stand it up. Right? So that transition, that migration is is not easy in that case. Right? So, hence, we the way we are doing a data parallel, is to not disrupt the north APIs, if I may use that term. So the users are using whatever they are comfortable with, but still give the benefit of accelerated commuting them. And and lakehouse paradigm has a big role to play because if that separation had not happened by separating the table or excuse me, data and data lake with the execution engine of your choice, we would not have been able to to innovate the way. So there are a lot of, past wonderful work on which we are stand you know, you are leveraging and standing on that. For people who are
[00:25:28] Tobias Macey:
looking to bring in the Datapelago capabilities, they're already running a Trino or a Spark, what does that integration path look like, particularly if they're running those systems via some sort of vendor like a Databricks or a Starburst?
[00:25:43] Rajan Goyal:
Oh, so let me, yeah. So, I mean, we have not announced the product yet, but let me generalize it. Right? Obviously, we have customers who are who are using it. But the idea is that let's first stay in the open source world even though technology is applicable to any any engine built with this paradigm. But idea is that can you think about a sidecar or a training wheel attached to existing existing car or existing whatever analogy you want to put in. Right? So that's what think of, like, DataPelgo, UDPE is that sidecar or that accelerated wheels which you attach to existing clusters. Now if I may use the word existing, spark, you attach that. And now all of a sudden, your same workloads run faster.
And they underneath, it consumes the newer breed of servers available in the cloud of the it's there's no need for hardware introduction or whatever. AWS, GCP, Azure, and many other clouds have lot of choices of servers with a very good price performance points, which have those accelerated computing instances, I call. When I say accelerated computing instances, I meant GPUs or from AMD or Nvidia, from Tesla t four to to higher end and so on, and or FPGAs and going forward, TPUs. There are plenty of choices there. User doesn't need to worry about it. Right? So our software, our technology handles it. What is the right computing element to use? That's one part of the problem. But most importantly, insertion is you just when you create a cluster or something, you point to our binary or whatever you want to call it, and then magically, the system just shrinks size wise but runs faster. Right? So that's the that's how we insert. Let me pause here, Tobias. You should ask question if we need more clarity than this one. No. That that's definitely very useful.
[00:27:32] Tobias Macey:
And to your point, the selection of which compute unit to use is definitely one that as somebody who is writing a SQL query or writing some sort of processing logic for the unstructured data that I want to do some entity extraction on. I don't wanna have to think about that level of detail. And so being able to just say, I want this to work, and then it gets federated out to the appropriate compute unit. And then, obviously, there are also questions around data gravity, making sure that those compute units are situated appropriately closely to the actual storage media. And particularly in that space of structured versus semi structured versus unstructured data, where you do have those differences in language interface. I'm wondering how the work that you're doing at the point that it gets to your layer, is it largely just an intermediate representation because you've already been abstracted away from the specifics of the language runtime and, you know, you're you're you're looking at some sort of, intermediate representation at that point? Or is there some other translation that you're doing from the query language and the execution plan into being able to say, okay. I'm dealing with unstructured data over here, so I need to do these different compute operations versus I'm running a SQL query, so I know I'm gonna be dealing with parquet files or JSON data for over here. Just curious about some of the ways that you're thinking about that problem as well of being able to, to the end user, unify the
[00:28:57] Rajan Goyal:
differentiation between those data sources so that they don't have to think about where they're integrating your product. Yeah. Good question. So yes. Absolutely. First, it's in if you don't do the way I am gonna describe, I don't think it it's gonna fly. You would want to keep it away this complexity of where to run which workload, right, or what computing element to use. So you still if you're writing in PyStone or a PySpark or SQL, you just continue to write as is, and it's a job of the underlying stack to figure out which computing element is the right element to do that. And you did you did mention couple of the points. One is a data affinity, and the second is the right, performance per dollar or performance per what whatever metric is the right processor to execute that. And many more things go into the mix for that algorithm to decide. And let me run it here. Because, you know, moving data, there are high level things you want to move compute closer to data rather than to data to the compute. So there are high level principles you want to address, keep in mind, while doing this kind of scheduling and and run time management of the resources to the operator. That that that's for sure required.
And the the second part is that when it comes to the handling the data types, as you asked, yes. Absolutely correct. Let me call it the chaotic data. Like, the when it is more chaotic data than a structured data, you need lot more energy to extract what you're looking out of that. And that hence, the accelerated commuting instances play a better role in terms of efficiency than a general purpose CPU for to process that chaotic data. That's one point. The second is that that's where all of these processors, Tobias, are Turing complete. So you can run anything on a CPU as well. It's gonna functionally correct. But if you want to break the problem where it is right, the computing element, maybe you can influence the gates into some FPGA to process JSON efficiently, to parse JSON efficiently than doing it in the GPU. Maybe that's the right thing to do. Or if you're doing a PDF extraction, nowadays, it's all done in the models, so there is no need to pass the PDF to that. You can just run a model. So now that is better than a GPU. So depending upon your algorithm, depending upon your data type, that your choice of the engine or the computing element may change. Right? And, obviously, data affinity and all the stuff that you also mentioned in your question are also into consideration.
So, hence, it is a it is and if it's well served for the industry and the user, if all that complexity is hidden from the end user. Otherwise, we will be the adoption will be too slow, and it it will it will break the system. Right? So that's what companies like us are working hard to solve that problem that how do I reexpress, if I may use the word, algorithm for this new machine, for the new animal that is handed to us. Right? That's what we do. And, if I may quote Jensen Huang, NVIDIA's CEO, he called it insanely hard problem. Right? Because we are not in a world where you are handed over a hundred gigahertz processor and recompile the software for that. Right? Now you have to rethink about, okay. I'm doing this new new sort or I need a new JSON parsing or a new PDF extraction.
How do I reexpress that algorithm for the new target machine is the is the problem. And all that is
[00:32:17] Tobias Macey:
buried and deeply integrated into the underlying stack, like, in case of Datapalago we build. But I'm sure many other companies are also working to solve that problem. And so if I can be a little bit reductive to summarize sort of what you're doing, the way I'm thinking about it is that it's analogous to being the LLVM for data compute.
[00:32:37] Rajan Goyal:
Yeah. I mean, that's a that's a, a big, high bar if if you may get there. But, yes, I would put it in a way that yeah. I mean, that's a that's a good way to to put it also, but maybe there's a,
[00:32:53] Tobias Macey:
simpler analogy I can mention. But, yeah, we can leave it at that device. That makes sense. As you mentioned, LLVM, that is a high bar. It has had a lot of engineering time that has gone into it. It's being used in a number of different use cases. And as you mentioned, the problem that you're tackling is also, to quote you, insanely hard. And given that, I'm curious what are some of those engineering challenges that you're tackling, some of the foundational computing elements that you're having to reconsider or try to look at in different angles to understand how best to actually approach the problem efficiently?
And also to tack on one last question to this, run on question, because of the fact that it is a very engineering heavy effort, I imagine that there is a lot of work that you have to do about how to think about marketing it to people, both for consumers as well as potential investors to make sure that you're given the appropriate time and space to actually invest in that engineering effort. Yeah. Yeah. Makes sense. So
[00:33:55] Rajan Goyal:
let's tackle the last one first. Right? Yes. It's like I'm using your light to shine, like, where to go. So you have taken me pretty deep into the stack, and hence, the technology details comes out. Okay? So but, otherwise, you're right. From from user perspective, investors perspective, we would want to make it as simple as frictionless to insert and understand. And that's what is the whole UDP idea, accelerating Spark, Coutinho without changing the application with the the things we just talked enables us to do so. So the all these complexities and the the what we just talked about are inside our stack, inside the software, not visible to the user. Right? So whether it's a choice of the hardware, whether it is a choice of the size of the cluster, whether where to schedule work and how to break algorithms, this is all done by our software. Right? So other no. Nothing is easy to to the user. The second part is that what you meant, in the big first part of the question was, what are those complexities, the challenges we are solving? One we just touched about is, reexpressing your algorithm for this new primitives that computing element we are offered. So that's one part. The second part is also that it is a large distributed systems problem twice, and there's something called Amdahl's law. Right? If you focus on a one function to accelerate and if that function is only used fraction of the time, you will only get the portion of that fraction accelerated from overall impact of the from the value proposition point of view. Right? So you need to be very, very careful about the Amdahl's law that you're not spending energy on a small portion of the workload. Right? So in the distributed systems, it's the IO, it's the data movement from storage to compute, compute to compute, and even within a node, data movement from your memory or drives to the CPU. And then now when you're talking about CPU and GPU, you're moving data from CPU over PCI to the GPU and back. Right? So those are all the flash points or hotspots.
You have to be very, very careful and design a system which is impotence match between compute as well as IO. Right? It's a pretty complicated sophisticated distributed systems. Not, again, to scare the audience or the user, but this is all done underneath by the software to, a, pick the right computing element, b, pick the right algorithm to write on on run on that algorithm, and, c, to build a impedance match distributed system so that the user is able to get the value out of it. That's what makes this problem insanely hard. Right? This all these three things that we just talked. And the fourth thing is is to make it appear to the user as if nothing has changed. That makes it even more exponentially more harder. But as we were talking about this meta concepts of lake house, open table formats, you know, separation of compute from storage, gluten, all those things have enabled us to realize our vision. Right? But to make it seamless from a user point of view. But there is no escape from solving the three things that I just told you fundamentally or to solve it. Otherwise, it it it then it's not gonna give the value that this the the the potential that this new architecture has. Right? I hope this is clear to me. Absolutely. And as you're talking as well about the distributed systems nature of the problem, the fact that you have to impedance match the
[00:37:19] Tobias Macey:
compute with the IO capabilities. It also brings in the question of what are those low level storage primitives that you're working with, especially because of the fact that it sounds like you aren't in a position to be able to standardize that or enforce that because you're just processing on already existing data. So it's not like you're building a full database system, and you can decide between an, LSTM or, you know, copy on write versus merge on read, So you have to work with the data as it already exists across that wide variety of different storage approaches and be able to process them as efficiently as possible. That's correct. Yes. I think the industry has already,
[00:37:59] Rajan Goyal:
let me say this way, willing to pay the cost price of desegregating and getting away with this proprietary control structures to wrap the data around. Right? So there is a penalty you pay for that. But the trade off is now you have separated and you're desegregated, and that that's a more architectural knob, which is have more benefits. So that is already realized. People are willing to pay that cost. Where where we come into the picture is, okay. You are already in deficit because of that departure from that monolithic architecture to this disaggregated architecture. How do we bridge that gap and even make it better? And hence, what we just went over, the discussion, this accelerated computing, having impedance match, and all those things enables us to not only cover the deficit, but also make it future proof as the data size grows, as the computing capability improves. So that's the that that's the how I look at it. Right? But you're right that the starting point is that we don't have luxury to store data which is more friendlier to us, and we don't want to do that. That's when the vendor lock in comes in. Enterprises don't want that. Right? They want to have a freedom of store once and use as many times. And if I'm storing pro a a proprietary format suitable for one engine, now you're locked in by definition to that engine. Right? That is a okay compromise as long as you're able to deliver on the promise of this thing. Right? Hence, I think this accelerated computing plays a big role in realizing the overall vision of leak house, this separation.
Otherwise, that deficit will become so huge that people would ignore the future, but at least solve to this. And you may pendulum may shift back to the monolithic because it's not delivering on the promise. Right? So I think we may have a some role in enabling this vision, and making it more future proof, meaning in the using accelerated computing and the things that we just talked about, you know, in a humble way. Yeah. And
[00:39:50] Tobias Macey:
in terms of the impact on behavior and prioritization for teams who are working with these data lake house systems where today, they maybe are trying to run a Trino cluster, they execute a query, it takes some number of seconds to execute because it's executing across hundreds of gigabytes or terabytes of data. So then they bring in Datapelago, and that drops from multiple seconds down to single seconds or subseconds, what whatever the order of magnitude improvement is. I'm curious how that shift in speed and efficiency and capabilities changes the ways that they think about what data to use for which problems.
[00:40:34] Rajan Goyal:
Yeah. I think, you know, there is a a certain angle to this. Right? Right now, the way business model is, you're always paying as a consumption of something. So you are paying as a consumption of compute resources. If you make shrink that, that directly translates to your TCO savings. Right? So that's what the unless you're charging your premium, which many companies do, but for the performance improvements and so on. But, otherwise, fundamentally, the performance has a benefit of reducing your TCOs. That's one part I wanted to just address. I missed your second part, Tobias. Can you repeat the second half of the question? Yes.
[00:41:14] Tobias Macey:
So just wondering how the increase in speed and efficiency of that data processing changes the ways that teams think about what data to use for which problems and how it maybe unlocks greater potential for bringing larger volumes of data to bear on the organizational challenges.
[00:41:32] Rajan Goyal:
Right. So two parts to that. One is the the the business answer, and then there's a technical answer here. The business answer I hear from many CIOs is that your TCO savings that I have in my current workload and freeze up my budget to do exploratory work that I would would have not done otherwise. Okay? Because they want to do AI processing or Genia or fine tuning or direct processing. Just today's whatever is in the thing. So that is a business advantage of, reducing your TCO on the current spend. That's one part. The second part is also that there are many use cases which we make practical, which were not viable earlier. Right? Because now we are able to meet SLAs with this new architecture without needing to put it in a, let's say, proprietary data warehouse. If you can do that, then it enables many, many new inter new, use cases. So for example, in a telco world, we are involved with a company which they have to meet certain, by morning or something. All the data ingested overnight need to be processed and be ready for the next time when it starts. Right? Earlier, given the budget, they would have to cut short either on the quality of the processing or the amount of data they do. Right? So that that affects their relevance or be staying competitive in the next day. So, hence, the speed up and this extra volume of data or more deeper analysis you can do within the same time frame enables them to be more sharper and competitive next stage. So that's a real use case that we are enabling that.
But right now, given the current macroeconomics and otherwise, everybody's under stress, I would say, to reduce their TCO because, I mean, all said and done, this pay as you go model looks okay in the beginning. But moment you start using at scale, then the reality starts sinking in on the cost and and so on. Right? So that's kind of where many people are grappling, and our technologies that that we built at building a data panel enables them to keep that fundamentally. Right? We are not here playing a margin game that, okay, if they are paying, you know, 10% savings here and but we are fundamentally changing the price and the performance assumption. And in that case, there's a we enable them to reduce their TCO and help them scale, in the as a telco example I gave you, or even free up the budget to do things which, they want to do, but there is no budget, practically speaking.
[00:44:00] Tobias Macey:
In the work that you have done with some of your early customers and design partners, mocks applied to their business problems?
[00:44:16] Rajan Goyal:
I think, to be honest, we are I will not have that many, data points to generalize that. Right? But what I so I will just caution with that, my next answer. So but, you know, there are many and we have not touched many verticals yet. Manufacturing, you know, whether it is, in cars or in in the semiconductor, in shop, you know, factory, when there's so many sensors that are generating data. So there are lots of use cases where efficiency, performance, latency reduction is important, and we will enable that. Right? But right now, more or less, I think I can generalize that either the TCO savings, right, is is important.
Using that savings to increase their amount of data they process and how that it helps. That's one part. The second part is which is also obvious now in the machine learning or the data development excuse me, the model development and model, training, workload specific. Data scientist time is very important, and how much time they spend on spend on processing data, preparing data is important because it comes in the way of how iterative and how fast you can train models. So that's so that's another area where we we help and speed up the data preprocessing for model training, which translates to saving the data scientist time and and so on. Right? So there are few higher level values we see, but, next time we talk, I can give you a little bit more better insights than that. Yeah.
[00:46:01] Tobias Macey:
And in your work of investing in this problem space and working at this layer of the technology stack, which sounds like is maybe even a little bit higher up than some of your previous companies, What are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:46:19] Rajan Goyal:
Well, you learn a lot, Tobias, especially if I may I don't know if this is the angle. I mean, first is as a entrepreneur, as a CEO, or as a founder, it's a learning every day. Right? You know, it's I say that you building a feature versus building a product and versus building a company is exponential scale complexity. Right? So, so hence, it is a lot of learning technically. We are kind of breaking new ground, every day, because many of the areas that we are we have to solve are some unsolved problems. You know, how do you move data or how do you process your data sizes, terabytes? Your GPU memory is only 16 gigabyte. How do you manage that? Right? So that's a technical problem to solve and and many more such things. So there are many technical problems to solve, to do what we do. Then there are many business model related problems to solve. How do we insert into user space without, disrupting the ecosystem? That's another way to innovate.
And then that, obviously, third is, you know, it's a fast evolving market, a problem set itself. Data types are changing and so on. How do you build an architecture which is has legs for future proofing? So there are a lot of areas, which we are learning, and I'm personally learning daily, while working on this problem.
[00:47:48] Tobias Macey:
And for people who are looking for ways to increase the efficiency and speed and flexibility of their data systems, what are the cases where Datapelago is the wrong choice and won't have the impact that they're hoping for?
[00:48:04] Rajan Goyal:
I think, one is a fundamental answer. Right? One is that if you are as I said that we are our first attempt is to do an impedance match between data moment and compute. And if compute computation is the bottleneck, we will free it up and make that's a little bit fundamental answer I'm giving you to to Tobias. Right? So if if you find a workload where you're processing very less, but you're just moving data or moving things here and there, probably there's not much for a GPU to play a role in that. Right? So that's kind of a fundamental answer to that. But I guess, it I would not give up, like, a, too early in that journey because we are still evolving and and, and and solving. So but right now, yes. If I were to put it today's, state of the art from DataPelago, we look for when you're processing complex data, you are doing spending lot of compute cycles to do it. And, there is a data movement from your storage to thing, but relative to that, you're spending lot more on the compute. That's a higher level, highest order bit, I would say, to resolve, to see where data per go will be applicable. Right? Because you are attacking the competition part of the problem, more more more than that. We are trying to be or we are efficient in selecting how to move data efficiently. Sure. How to I mean, there are many data techniques to do dynamic filtering or, row group based pruning in parquet or whatnot. Right? Like, so that you're very, very selective in only reading data that you need, and those are complementary techniques to us. But once that is done, and even then if you are reading too much data, then, and the ratio of the IO to compute is disturbed, the and then there is not much probably data that will go to add value right now. Right?
In my previous slide, we have solved the data moment problem, but now in context data it's a computational problem you're solving.
[00:50:05] Tobias Macey:
As you continue to build and invest in and move towards general availability of the technology that you're building at Datapelago? What are some of the capabilities or specific projects that you're excited to dig into or any of the features that you're building towards?
[00:50:21] Rajan Goyal:
Yeah. It it is a the the good news or the bad news, however you want to take it, it's a endless world. Right? So why is it endless world? So let's let me share few dimensions, right, in which we want to and we will go. One is the world of data platforms itself. You know, you handle Spark. You handle Trino. Then there are many more which can benefit from from the this, accelerated computing. So that's kinda one one dimension. The second is the data type itself. Right? So data types are changing from multimodal and and unstructured, semi structured data. And the kind of computation you do, kind of models you want to bring in there is another dimension where the platform will will evolve. Right? And the third is application itself. You know, because now different what is your distributed framework? Is it a ray or is it a spark?
What kind of so that is another dimension to go. Fourth could be, multi cloud, which is obvious from hybrid on prem to that. But now there is this new wave of GPU clouds emerging. Right? How about we address data processing there? So there are many, many dimensions ahead of us, Tobias, to to go and innovate, and many of them are in our road map, and we are actively working.
[00:51:38] Tobias Macey:
Are there any other aspects of the work that you're doing at Datapelago, the specific problem of efficiencies in data lake architectures and the overall evolution of data processing, particularly as generative AI brings unstructured data more into the forefront that we didn't discuss yet that you would like to cover before we close out the show?
[00:52:00] Rajan Goyal:
Not really. I think you covered pretty broad, areas. I hope this is useful. But, yes, we we have not covered the how the data processing for generative AI world is evolving and so on, but we can save it for the processing for generative AI world is evolving and so on, but we can save it for the next time. But, yes, there is a very fascinating world emerging of in front of us. Like, obviously, data is changing, but the kind of data processing, which is a combination of, actually, actually, I will leave this for you and your audience, which is that the two pipelines are also converging. People want to apply, analytics primitives along with the unstructured data processing together. Right? How how does that work look like?
Where, you want to do joins as well as extract metadata from PDF. Right? So that's a problem to be solved, and that's where I think the direction, platforms will take. Like, on a single platform, how do I do unified processing, processing both for structured and unstructured data? Right? That's that's the next frontier, I would call it, for for us as well as an industry.
[00:53:03] Tobias Macey:
For anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:53:19] Rajan Goyal:
What I can answer is that with the generative AI, there's lot of innovation have to happen and already happening. How do you even use natural languages, for example, to express your intent? Right? From natural language to SQL or natural language to Python or whatever. Right? How do you and there are many initiatives in that. I think that is probably one area where industry will evolve and learn and improve, to even make the interface easier, from for so that even people who are not familiar with SQL, can you use your data platform for your other? So that's one area, but I'm sure, world is full of smart people. They will think of something better than what I can contribute right now on that problem.
[00:54:06] Tobias Macey:
Absolutely. And everybody's always filling in different gaps where which then expose other ones. So it's a it's a very fractal problem space. So thank you very much for taking the time today to join me and share the work that you're doing at Datapelago and your insights into the fundamental challenges of data efficiency in these disaggregated architectures, both in the software and, the hardware layers as we work in these cloud environments. So I appreciate all the time and energy that you're putting into improving some of those efficiencies, and I hope you enjoy the rest of your day.
[00:54:38] Rajan Goyal:
Thank you, Tobias. It was wonderful and a pleasure, and many of their questions were insightful. And I hope that it is useful for for the audiences. Thank you for doing this.
[00:54:56] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used. And the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts@dataengineeringpodcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Data Engineering Podcast
Interview with Rajan Goyal: Improving Data Processing Efficiencies
Understanding System Architectures in Data Centers
Open vs Closed Data Ecosystems
Challenges in Disaggregated Systems
Datapelago's Approach to Data Processing
Unifying Structured and Unstructured Data Processing
Impact of Datapelago on Data Processing Efficiency
Lessons Learned and Future Directions