Summary
Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Your host is Tobias Macey and today I'm interviewing Jignesh Patel about the research that he is conducting on technical scalability and user experience improvements around data management
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by summarizing your current areas of research and the motivations behind them?
- What are the open questions today in technical scalability of data engines?
- What are the experimental methods that you are using to gain understanding in the opportunities and practical limits of those systems?
- As you strive to push the limits of technical capacity in data systems, how does that impact the usability of the resulting systems?
- When performing research and building prototypes of the projects, what is your process for incorporating user experience into the implementation of the product?
- What are the main sources of tension between technical scalability and user experience/ease of comprehension?
- What are some of the positive synergies that you have been able to realize between your teaching, research, and corporate activities?
- In what ways do they produce conflict, whether personally or technically?
- What are the most interesting, innovative, or unexpected ways that you have seen your research used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on research of the scalability limits of data systems?
- What is your heuristic for when a given research project needs to be terminated or productionized?
- What do you have planned for the future of your academic research?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Carnegie Mellon Universe
- Parallel Databases
- Genomics
- Proteomics
- Moore's Law
- Dennard Scaling
- Generative AI
- Quantum Computing
- Voltron Data
- Von Neumann Architecture
- Two's Complement
- Ottertune
- dbt
- Informatica
- Mozart Data
- DataChat
- Von Neumann Bottleneck
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
Hello, and welcome to the Data Engineering podcast, the show about modern data management. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake and Hoody, so you always maintain ownership of your data.
Want to see Starburst in action? Go to data engineering podcast dotcom/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macy, and today I'm interviewing Jignesh Patel about the research that he is conducting on technical scalability and user experience improvements around data management. So, Jignesh, can you start by introducing yourself?
[00:01:20] Unknown:
Yes. Hi. Nice to talk to you and to your audience. I'm Jignesh Patel. I'm a professor in computer science at Carnegie Mellon. I've been working in the area of data for about, 25 years now and been working on things in data across the spectrum through the different ages that the data ecosystem has gone through from parallel databases to streaming databases to mobile databases to using databases for genomics and proteomics and other biological applications to where we are right now, where we are trying to use Gen AI and make data analytics, far more easier for humans to get insights from data.
[00:02:00] Unknown:
And you mentioned that you've been in the space for a while. Do you remember how you first got started working in data?
[00:02:06] Unknown:
Yeah. I first started working in data when I came to the University of Wisconsin as a grad student. This was in the early nineties. And I actually came here to do computer architecture, but Wisconsin has an amazing group. It had 1 of the leading group at that time in databases. And once I started taking a couple classes in there, that's how I decided to switch over to databases. So it was not the plan that I had, but it was the strength of the group that was at Wisconsin at that time that really drew me into databases.
[00:02:42] Unknown:
You are, as you said, a professor. You work at Carnegie Mellon, which is 1 of the leading schools for database research today. And I'm wondering if you can just start by giving a bit of a summary of some of the current areas of research that you're focused on and what it is about those subjects that motivates you to invest the time and energy required to gain meaningful results.
[00:03:05] Unknown:
Perfect. Sounds great. And maybe a little bit of a context. Carnegie Mellon is where many computer scientists will say is where AI was invented. And if you go back to the birth of the study of data in academia, Wisconsin, Berkeley, Purdue were 1 of the earliest schools that really started to do that. So I've been really fortunate to be at powerhouses of data and AI. And, of course, at Carnegie Mellon, there is both data and AI that is present today. In terms of what you know, of course, the data research ecosystem and product ecosystem has gone through different phases. Where my research is today and where I think many of the interesting, forward looking research problems are, and today's forward looking research problems are very likely products that will make a difference in a few years, is along the 2 edges. I just alluded to how I started initially as a grad student being attracted to architecture, which is, you know, making processors and storage devices and things like that that get used as the computing substrate on which you build your algorithms and software.
And today, my research is broken into 2 parts. 1 is on the architecture end of the spectrum and the other is on the human end of the spectrum. So think about what we do in data platforms today. Right? The data platforms are largely software. They will run on some hardware, and we want these data platforms to work with large volumes of data. We want them to be extremely fast, and we want them to be versatile. So, and, of course, we want all of that to happen in a cost effective fashion. At the other end, we want these data platforms to be very easy to use by humans of all types, not just programmers, and there's a ton of research in there. So the first part of my research, which is purely in academia right now, is on the data architecture side. So what's the interesting aspect over there?
So here's the backdrop. In many enterprises, data has been doubling in size roughly every 2 ish years or so. And this is a growth that has been happening for 30, 40 years for many organizations. In the past, the way you dealt with that is to say, okay. I've got a data platform. It's doubling in the volume every few years. I obviously can't pay twice for all of my analytics, all of my queries every 2 years. That would be unsustainable. So I need to keep the cost the same or at least or or perhaps even better start to lower that. The 1 big boost we used to get in the past for data platforms to meet up with that demand while keeping cost constant was to say, let's just upgrade to the latest hardware.
Because everyone was riding Moore's Law and the underlying principles of Dennard scaling, which means if I upgraded my computing substrate to the latest generation of storage, compute, and memory devices, which all was 2 x faster, And if my data volume doubled, I'm getting that constant cost perspective on my analytics pipeline. But all of that has stopped, and a big part of my research at Carnegie Mellon now is how do we build long term sustainable platforms where we can keep up with this growth in data demand, and it's not just growth, but we are asking deeper and deeper questions of data that pushes additional stress and still have this cost balance, the gift of Moore's Law hasn't fully ended yet, but we all know that, you know, 5 years out, it probably doesn't keep giving us the dividends it had for the last 30, 50 years. So that's 1 end of the spectrum, and the other end of the spectrum is on using Gen AI to make data platforms more programmable. And I can talk about that other part, but before that, let me turn it over to you, see if you have questions.
[00:06:43] Unknown:
You mentioned Moore's law as our saving grace for the past few years, and we are still somewhat benefiting from that by increasing the number of transistors, but we're not getting better clock speeds. We are adding more cores. We're starting to reach the logical limit of that as well. And as we go down the nanometer scale, we start to hit physical limitations of what we can even fit on a chip, which which brings up the specter of quantum computing. And I'm wondering what the viability is of that as our saving grace for the next few decades and if there's any analogous equivalent in quantum processing to the idea of Moore's Law?
[00:07:20] Unknown:
Yeah. Great question. You pointed out that Moore's Law is not dead. I agree. Not only are we getting we are still getting denser packaging of transistors, but it's also the big thing that's happening is now we are going 3 d. Right? You're setting storage and chips all becoming three-dimensional. It used to be all planar and 2 dimensional. So there's some life in that packaging stuff, but it's still the energy profile is dissipation becomes a problem. So we'll continue to get the gift of Moore's law or the behavior that we've been expecting of hardware for a little while, but not forever. You know, I don't think anyone says beyond a decade we are going to keep seeing that, and even that for some is a stretch.
Great question about quantum computing, and that certainly has the potential to revolutionize certain aspects of computer science, especially the ones in which you're trying to solve an algorithmic problem and trying to find some optimization stuff, huge opportunities potentially over there and, of course, in crypto. But, there's a well known result now 2 more than 2 decades ago that for some of the core data problems like sorting, the you can't do it any faster even if you have an ideal quantum computer. So, furthermore, you you know, we are at this point, many organizations are working with terabytes, and so many organization now working with petabytes of data. You have to go you can't even push all of that data through a compute unit. So it's like quantum computing for the type of data analytics, I don't think that's a possibility, at least as far as I can see.
It certainly might have implications in certain smaller components of what you do in the broader data ecosystem, but it's a different problem space. So we need to start finding ways to get the data ins data to insights pipeline through more traditional methods and nothing else other than the traditional semiconductor based hardware substrate ecosystem is likely to be the answer for a very long time.
[00:09:23] Unknown:
And also with quantum, it will likely bring up the same problems that we're having now with GPUs where it is a coprocessor. It's not going to supplant classical computing, and we're likely to hit a point where as it gains popularity and adoption, we're not going to have enough capacity for it. And so I wonder if then we'll end up in, back in the time sharing model of everybody can submit their requests in batch, and you just have to wait for it to come back.
[00:09:50] Unknown:
Yeah. And, look, I'm not an expert in quantum, computing, but, today, you can go and rent a quantum computer in many of the cloud providers. Yes. It is harder to get time on that, perhaps, compared definitely compared to a GPU. A coprocessor often in data intensive environment, the coprocessor have to be sitting very close to each other because the IO, the cost of transferring data from 1 side to the other is often the bottleneck called the Von Neumann bot bottleneck. That's already a big problem in CPU GPU databases. We don't know how to use GPUs well for large scale data platforms, and there's some big companies that are doing that. 1 of the leading companies that does that is Voltron Data down in the Bay Area. But there are lots of hard problems, even the simpler processing substrate. And I would say for you know, as I said, I'm not an expert in quantum computing, but that's not something I think most nearly anyone is really looking at as a viable computing substrate for the type of data processing. For cryptography, you know, code cracking, stuff like that, obviously, that's where all the excitement is. But for the data land, I think that's quite far out. There have been research papers that have explored using it for certain components, but nothing I can see becoming mainstream anytime soon for very fundamental reasons, and unless those fundamental reasons get solved, which probably needs a totally different type of quantum computers and totally different ways of getting data in and out at high speed, that's not a viable path for the data direction.
[00:11:19] Unknown:
Continuing on your point of IO being the biggest bottleneck as we scale the volume and complexity of data and the types of analytics that we're trying to build on top of them, what are the future directions that we can look to to try to realize that either constant or declining cost as the volumes of data increase and whether that is in terms of the physical hardware or some of the, semantics of how we work with data or ways that we think about storing and accessing data. I'm wondering what are some of the areas of research that you're focused on to help address those problems.
[00:11:55] Unknown:
Yeah. That's a great question. So the part that we are focused on is something a little speculative, which computer scientists and architects and data folks have been coming back to for a fair amount, which is to say the traditional 1 moment architecture says that I've got a compute device and I've got a storage device. They are connected by some communication component, and then you have to pull the data through that communication channel to the computing device, do stuff on it, and when you're done computing, you push it back. Right? So there's 2 separate devices, and today, that's largely how your laptop or your individual server or even your phone works to where entire cloud data centers have a compute portion of the cloud and a storage portion of the cloud. So that version of separation of data and cloud exists everywhere. But as you can imagine, it is very inefficient.
In many data analysis pipelines, you are going to scan a large amount of data, and, really, the core of the compute that you're gonna do is going to be on a very small fraction. And many times in many data pipelines, you have a very small number of cycles per byte of data that you're going to access. So so where there's been this idea in different forms for the last 30 ish years, which is to say, can we push compute to the storage? Right? Why are we bringing data through effectively a narrow straw that is relatively getting narrower and narrower because the device capacity for storage is increasing faster than the channel capacity to pull data out? Why can't we not think about devices as pure storage devices and pure compute devices, but have devices that can do storage and compute So you're not pulling stuff in and out of the device and then pushing it across these 2 separate modes of, working with data. And so this idea of pushing compute inside storage or pushing compute closer to storage has been around for 30 years in a variety of different forms.
Where we are we and we are spending a fair amount of our time looking at that. What has been missing in all of that work so far by the way, none of that has quite become a reality just yet. Right? You still have the separation, as I just said. Even cloud at a high level is has the separation principle for a variety of reasons, but it's inefficient. The reason why a lot of these techniques have not made had had a big commotion impact is because it's very hard to figure out what's the right amount of compute to push into this storage without blowing the cost of manufacturing this device. So if I said, I've got memory or I've got flash storage and I want to put smart compute inside that by the way, we already do that in many forms at practical storage devices that you see today. The question is how much compute do I put in there, how programmable is that compute, and what else can I do with that? And that's where all of those considerations because many of these storage devices are very low margin devices. And if you say, I'm gonna put $5 more in a $100 device, that's way too much. Even a dollar is sometimes a little too much. So what we are looking at is taking a very fundamental, arguably a very theoretical and a very academic approach, which is to go down and pretend like we were in the 19 sixties or 19 fifties when we were just just starting to build these computing systems. So I'll give you an example of a very fundamental technique, a question that we are asking. Is today if I represent a number and store that in a digital form, I'm going to convert that into a 2's complement representation and store it in that device.
For the rest of this, I'll make my example be in decimal form. Right? So imagine I've got 4 digit numbers that I want to store, and I'm storing the number 1, 000, which would be 1 0, and that's in decimal form. The number 2, 314 would be 23 14 and so on. Now imagine I had numbers in that that were like 5 and 6 and stuff like that. And if you look at the digital representation of that, all the leading digits in that is going to be zeros. And what we do typically in a computer is when we are storing just, let's say, a array of numbers, we'll store it so that we have the first number represented in storage first and the second number and so on. Now when you're searching for these numbers and I say find me everything that is less than 5, I'm actually going to go through all the digits for all the numbers before I can find my answer.
But now imagine we just said we're going to represent numbers in a totally different way. I'm just going to represent the thousandth position for the number first and keep the thousandth digit for all the numbers together. So if I go and fetch some data from memory, I'll just get the 1, 000th value for each of the numbers first before I get the 100th place, the 10th place, and the, unit place. And now with that, you can come up with a completely different class of algorithms because, let's say, I've got 10 numbers, and I just look at the 1, 000 digit value for that. And if all of them are nonzeros and I'm looking for everything that is equal to 5 or less than 5, I can simply say, of these 10 numbers, I don't even need to look at the hunt the last few digits for any of them. I can algorithmically guarantee you that this answer is not present in this or a set of numbers.
So that's the way we are thinking. We are going back to early design and say, what's the fundamental encoding of numbers? What's the fundamental way we want to represent them in storage? And then can we complete come up with a completely new class of algorithms that have algorithmic superiority in search compared to existing methods. So we think that in this space, there are 2 ways we will win and solve this long term data problem. 1 is by rethinking algorithms from ground up to be aware that storage and compute have can go together, and I can push specific algorithms that require very low computational check and get me this benefit. And the second is to design what are those computing substrates that are low cost, very cheap, and can actually be put in an economical way in the storage devices.
So it's a long answer and futuristic, but that's kind of the way we are thinking. We're imagining let's imagine this 19 fifties, and if you're doing this from scratch with knowing what we know now, would we do things differently and having a fun time exploring these ideas, which are admittedly a little bit theoretical right now, but we're having a ton of fun. Another element of this problem of being able to get the data that you want through the processor in as efficient manner as possible,
[00:18:21] Unknown:
indexing has long been 1 of the primary ways of doing that, at least in the context of database systems or database like systems. And I know that there's also been a lot of recent research in terms of how best to construct and maintain those indexes, and I'm wondering if there is anything noteworthy in that space that you've been focused on to either rethink how we build these indices, how we think about applying indices, or ways that indices can maybe applied outside of the constraints of a database engine.
[00:18:52] Unknown:
Yeah. Those are great questions. I think there's a there's this really tough balance between saying, I want to build an index, but I don't know which index to build in. So there's the first part of the problem, just knowing which indexes to build. And there's a ton of research in saying, can machine learning be used to automatically study workloads and build indices? And my colleague at CMU, Andy Paolo, has a whole startup that is looking at problems like that of physical database design and automatically tuning that. Then there are all kinds of new techniques to try and figure out can you build index structures and summary structures that allow you to have lower cost than traditional indices.
So very, very low cost and still allow you to get some of the larger filtering components of indices, and that's an active area of research. We've got some, efforts going on in that direction. There's also really interesting aspects because now data is scattered across different, storage devices and spread across the network. So how do you index that? Sometimes data is replicated, and so when I'm searching and trying to answer a query, I can look either 1 of those replicas. So there's a ton of open ended research problems in that entire space, and there are some automated tools, that are out there to help you with that.
[00:20:09] Unknown:
This also brings to mind some of the lessons that we learned from the beginnings of the big data era where the common wisdom at the time was just throw all the data in there. It'll be useful eventually. We don't know what we're gonna do with it right now, but just keep it all. And now as big data has become more widely adopted, we have a better understanding of how to actually apply useful algorithms and analytics on top of that data, and the regulatory environment has shifted. It's very much a only store the data that you actually have utility for because, otherwise, it's going to cost you both monetarily and potentially in terms of reputation if there's a breach or in terms of fines if you are violating any regulations. And I'm wondering what you have seen in terms of the some of the ways that we can design systems to assist in that upfront pruning of data rather than just throwing all the data in a big black box and hoping that we get some value out of it down the road?
[00:21:08] Unknown:
Yeah. No. Great question. I think there's still a lot of that which you described, which is through through the data in and find value later. 1 of the big transformations that has happened is in the past, people would say, to construct a data analysis pipeline, I'm going to extract, then I'm going to transform, then I'm going, you know, I'm going to extract, then transform it, and then load it into a database, then start my analysis. Then this whole paradigm shift potentially of saying, I'm going to extract, load, and transform so I don't need to get the schema that is in the right place. But more realistically now, especially when you see things like lake houses and stuff like that, the whole idea is throw the data in in some storage subsystem, which may be structured, semi structured, or unstructured, have some sort of metadata that could be metadata manager that could evolve over time, and then you're building your data analysis pipelines that, you know, all of these components are not linear anymore. I may be for a specific task, maybe I'm trying to build a machine learning model to do something. I may be looking at some portion of the data sitting in the structured database, a relational system, maybe a snowflake or something like that. At the same time, I maybe have, new data that may have come in and is sitting in parquet files or maybe even in unstructured files that's sitting in the file system. I might write some sort of a custom code in Python to extract stuff from it, blend all of this together to get some real time to get some features from that and build into a pipeline. So, like, data is everywhere having very linear ways of saying data lands here, has to be processed before it goes through, even though that's often the predominant method.
In many emerging applications, what enterprises want is flexibility so that you can deal with data and not have to wait for it to be formally loaded into a warehouse before you can do things. Because sometimes the speed with which you're getting insights from data that's constantly arriving is really the highest value proposition. Right? The value of an insight sometimes decays with time. The longer you have to wait to get the data to a form through processes either human or engineered before you can do any analysis with it is is often lost value. So that whole ecosystem is evolving, but it's very clear that we want more flexible compositional structures of being able to do structured data and unstructured analysis because analysis today often means very traditional type of analysis stuff that people were doing with business intelligence type of stuff to sort of more augmented methods that might be used machine learning to drive insights, perhaps even still in the structured form. And then the 3rd part is where it's sort of unstructured, and you're dealing with richer sets of data.
Through all of that component, 1 of the big challenges is it's becoming harder and harder to write analysis pipelines, and it's very programmatically driven today. So there's been a ton of work where people have talked about no code and low code methods to allow people to do analysis of this sort, And this is kind of where the other spectrum of my research is in using GenAI to allow people to generate these analysis pipeline, but to do that in a way that requires them to write no code and use a generative AI machinery to actually tell the system what to do, and my start up data chat essentially addresses this problem. You point it to a dataset.
We work with structured data. You ask it a question and produces the analysis for you. And as part of that analysis, it may write SQL queries. It may write machine learning pipelines. It may do a combination of that, it may do visualization, and presents the result to you. So I think data in its different form, there's the time to live for data. That's 1 consideration for sure. People don't want to hang on to data forever unless they have a reason to. But, also, there's the richness of data and the richness with which you need to get insights from that data, and there's just so many more tools. But there's also the human aspect of it is, like, all of that, if it requires increasing the human expense to do the insights, is unsustainable too. Just as it was unsustainable on the hardware end to say I'm going to double my cost every time I double the data volume, you can't say I'm gonna double my human cost for programming if I double my analysis needs. That's the other end of the spectrum where some of these Gen AI tools and stuff that we are doing in data chat is 1 of many examples, is the is the other big challenge for the industry and for the field.
[00:25:28] Unknown:
And in that space of user experience, usability of these data systems, as we get more sophisticated with the types of data that we're storing, the ways that we're analyzing the data. Finding the data is always a problem. So that's the first step in utility, but then understanding what to do with it, the semantics of that data, the organizational aspects of what does the data really mean in the context of my business, all of these are barriers to a seamless user experience. And I'm wondering what are some of the opportunities for improving the interfaces and the semantic understanding at a fundamental level that these data engines can contain and some of the ways that they can help to give hints to the end users without the end user having to go and get their PhD in data management just to be able to answer a simple question. Yeah. I think great questions. I think there are 3 components to it. 1 is today, it's the whole tooling ecosystem to even discover
[00:26:28] Unknown:
what you where to look for in this vast lake house is nonexistent, and I know a lot of people are working on it. We have a research project at CMU that is just starting out to explore some of these aspects. Today, it is not uncommon to go to a large enterprise and see that they have a warehouse or a lakehouse where they might have 100, if not a 1000000 datasets that are sitting around collected over time even though they might have pruned it. And in you know, a dataset might be a table, and that table might have 10 or 100 of columns in it. So you're really saying, I've got 1, 000, 000 or tens or maybe sometimes even more schema of what's in the data.
It's not just the data values, but just the description of the data is large. How do I look? Sometimes it's super complicated even saying, what is the profit that I made? That's a complicated question. What is the profit that I made? That's a complicated question. There's a financial version of this that is the methods that get used for reporting purposes for financial statements and stuff like that. But then there are other descriptions where even something as basic as pricing, it's like, do you look at as the data is flowing in, do you if I'm a retailer, do I look at all the items that were checked out from my cart? But what happens about returns? What happens about projected returns If I'm trying to do analysis on orders that were just placed, you know, do I expect that 25% of that is going to get returned at a certain type of the time of the year. Like, we know that sometimes there's more the return rates goes up around the holiday shopping time. So it's very complicated to even define simple things. You don't even know where to look. That's the first challenge. 2nd is the semantic complexity of saying, how do I manage what is the notion of something as simple as how much did I make last week is hard, and that's where many of these tools, you see there's excitement around DBT and a whole bunch of semantic tooling mechanisms. That's a second component.
The discovery component, there really isn't much. The semantic component, DBT and tools like that exist. And then you get to that programming layer, all of the complexity we talked about. So even before you get to that programming level, you're exactly right. We don't know where to look often. Even when we know where we want to look, we need some sort of an agreement and be able to communicate across different members of the team or different teams in an organization as to what's the semantic value of the things that we see in the database so that we can all be on the same page, and then we can start to trust the analytics pipeline downline. So there isn't a clean separation between these pieces. Today, when you see someone constructing a data science pipeline, let's say, in a notebook environment, all of these are blended in. They are written in code. They are not queryable. They are not transparent, and it's very hard. If I gave you a notebook that is 10, 000 lines long that is running a core pipeline, and if I'm no longer in your organization, it'll be very hard for whoever picks that up to understand what's going on in that notebook because all of these things are blended in programmatically, and it's a mess.
[00:29:22] Unknown:
And so given the fact that there is so much complexity, we have gotten to a space where we have to work across at least 2 or 3 different tools and systems just to be able to answer a simple question. What are some of the forward looking design considerations, system architectures, and platform evolutions that we can look to to simplify that aspect where maybe I think it was 10, 15 years ago, we had systems like Informatica where it was an all in 1 vertically integrated solution. Now we've gone to the modern data stack where we have a dozen different tools, each of which wants to own different overlapping pieces of the puzzle, and now we're starting to see the pendulum swing back the other direction where we are recompiling a vertically integrated solution out of the individual components of the data stack with things like Mozart data. What are some of the ways that we, as engineers and system integrators, should be thinking about how to build cohesive platforms, cohesive experiences so that our end users aren't struggling and spending their entire day just trying to figure out what they're supposed to be doing and how?
[00:30:25] Unknown:
Yeah. I great question. I think, practically, today from a systems architecture, data engineering, perspective, you want to keep the tool ecosystem as lean as possible, there's this huge tendency to say, you you were you hit the, hammer right, on the nail, which is a lot of these tools have overlapping components, and it's so common to see you might have a team of 12, engineer data engineers. Each 1 will pull in their favorite tool. And before you know it, you've got a dozen tools in the ecosystem, and maybe all you needed was 3, or 2. And even if you boil it down to a few, it's a question of how well is that process and the methodology for using those tools set at a systematic process level, to say what will be used when and how do you how do you keep track all of that, especially as all of these tools change over time. So I think that's just pretty straightforward 01:01 tool engineer running a good dev shop, good engineering shop. Keep it lean. Keep it clean, and only bring in when you need to and document everything, have processes that go outside that tool integration set. The second aspect of it, which is a little bit futuristic and goes a little bit into where data chat's going, it's it's we we look at it from the other end of the spectrum. We say all of this engineering support is a means to an end. The end is to enable the end user to ask a question and get an answer in a way that is transparent and reproducible.
So more than saying, I want to make it easy for someone to compose a programmatic pipeline, how about we complement that or flip it and say, we want it easy for anyone to ask any question and get an answer, and then get the pipeline that they can verify in a way that says, hey, this matches the semantics I need. So that's kind of where we are going. We are saying whether it is data science, which includes SQL and machine learning and data cleaning and feature engineering and all of that stuff or visualization, we'll give you 1 UI, 1 interface, which is a chat box. Type your question in. We'll generate all of that. But along the we'll give you a recipe, the precise steps that document what happened at each step. This is the semantic definition that we came up with for the definition of profit. You can verify it. You can change it. But that's the other end of the spectrum is that the tools will evolve, and if you make the management of the tools as the key task of the data engineering team, then you're not serving the end user. You could also try to come at it from the other end, which is kind of what we are doing in data chat is to say, blow up this portion of it. Make it easy for anyone to ask that, but build trust and verification into the system so that, yes, the semantic definition layer might change to maybe it's DBT or maybe it's just Python code right now to something else. But the interface that you want to keep constant is that enabling that end user to ask these questions and build that trust and verification layer behind it because the tools will change, and they will evolve. And given that you are researching both sides of of this equation of user experience, how to improve the utility of these data systems as well as the scalability aspects, and how do we make it so that we can push more data through these systems without having to double the cost every 2 years? What are the elements of tension that exist in answering those 2 questions,
[00:33:43] Unknown:
and what are the opportunities for incorporating those perspectives in
[00:33:47] Unknown:
the evolution of the fundamental platform components that we build? Yeah. Great questions. And there's a big unification across both ends of the spectrum. The unification is time on the human side, which basically is the same as cost. So I want a fast system to deal with the scalability problem on the architecture end of the spectrum that we talked about, but I want exactly that same speed because if I've got a human in the loop compute, which is what a lot of analytics is often today, then I want you know, if you fire up a question, let's say, to data chat, and it's gonna take 30 seconds to come back, but if I could have a faster hardware software system that could bring that answer back to you in half the time, guess what the what you win on? You win on human time, and human time is really expensive. So, ultimately, that cost is the driving factor across. Human time is the same as cost, And so that's the unification stuff. You need this faster stuff to do more, but humans are going to be impatient. A lot of these analysis is human in the loop. Right? Even in chat g p d, when you punch in your question and press enter, you know, if that thing took a minute to come back versus 5 seconds to come back, your user experience and your ability to actually use the tool to do real work will completely change. So at the end of the day, it's 1 thing, speed matters. On both ends of the spectrum, faster is better for very different reasons, but that's a unifying KPI across both of that. Faster is better.
[00:35:08] Unknown:
And so as you are conducting your research, you're doing it in the context of a lab environment with your research group, and you're hoping that the outcome of this research will have some meaningful impact on the industry and number of years down the road. I'm wondering what are some of the strategies that you use to get some sort of real world context around these problems and the solutions that you're building to feed that back into the research so that you're doing it in a, way that is directionally beneficial to the outcome that you're hoping to achieve? Yeah. That's a great question. That's a tough question. My research philosophy has always been
[00:35:48] Unknown:
work on interesting things that are at least a few years out. You know, I don't know if anyone can see more than 5 years out, but pick something that is a 3 year, 5 year challenge. You do not know where that will go. It may be that totally fails. The nice thing being about about being in academia is if you pick a problem that is interesting and that is hard and nontrivial, you explain that problem to someone and they say, this is interesting. You know, it's not a fool's errand to try to work on this problem, but I don't know the answer to. Then that's all you need to start a good research project. The super interesting part about academia is that if you're working on something interesting, even if you come up with a result that doesn't become an industry product, that's okay. You'll have trained students in it because that is a pre essential part of being an academic researcher.
And perhaps 1 could argue in long term, the biggest thing that academics produce are students, right, that go on and do other things and and build that future leadership. And if once in a while your research ends up hitting the point where, you know, you're always trying to look ahead, you don't know what's gonna happen 3, 5 years from now, But if what you are doing starts to become a reality, then you go spin it out into a startup, and all kinds of fun things happen. Right? And data chat is my 4th start up. I've been lucky that it has worked out many times. Hasn't worked out all the times, but that's the gift of academia, especially in the US where it's very easy to go and work on speculative problems. Right? It has to be something you have to convince funding agencies worth looking into. You have these smart students at great places like CMU. They're you know, you open up your door and the brightest and the smartest, students from across the world walk into your office, like, what a gift. You work on interesting stuff, and if something happens, you go to a spin out. It's like, I feel I'm in heaven.
[00:37:33] Unknown:
And in that balance between your research activities and your commercial enterprises with these start ups, what are some of the beneficial
[00:37:43] Unknown:
feedback loops that you've been able to build up, and what are some of the sources of tension that exist between those 2 aspects of your work? Yeah. The sources of tension, I'll hit that first, is, of course, there's always this issue of if you do work in a university and then you spin it out, like, what's the ownership for the base? All my startups have been in conjunction with the university, so, you know, I feel like, yeah, if I'm at a university and I do something interesting, it's because of the university, so let's play ball with them. And different people have different philosophies, but there's it's never an easy answer. There's always discussion. There's always negotiation. There's always con contractual stuff, and lawyers get involved. So there's some non fun parts of it. The second part of it is that once even in academia, if you are working on an insurance problem, industry is often pretty interested in getting engaged with you at an early point in time. And once you have even a crude prototype that you could deploy even in the limited setting, you always learn things that you would have never expected once something's becomes real and actual users start to play with it. Because people will do crazy stuff that even in the wildest imagination, you can't quite imagine.
And then all of a sudden, it becomes real, and what's super interesting is nearly always, it'll generate new research problems for you to think about that you wouldn't have come up with if you had just tried to dream about it and think about it in your office. But you have to start by dreaming first. Right? If you just go and tell people what do you want, they may not quite have that. So it's that combination. You have to have a dose of practical reality plus a dose of aspirational creative thinking, and you have to have both of those parts in any successful research project.
[00:39:22] Unknown:
And as you have been conducting your research and working in these different startup enterprises, what are some of the most interesting or innovative or unexpected ways that you've seen your research applied?
[00:39:34] Unknown:
I think the most unexpected ways is when you start to deal with real workloads and real constraints. You start to realize that things that seem simple or trivial actually turned out to be really complex. So just the practical components of making things in real life with cost considerations that are real. Right? Someone's writing a check If you're trying to train LLM on the specific task at hand, for example, stuff like that that we do a data chat, all of a sudden, the cost component is no longer abstract. You're actually writing a check for those hardware resources. So you just are at a different level where you start thinking very, very carefully about things like estimating things that are going to be actually run and developing methods to do that estimation, learning how to do low cost AB testing as you go down, searching for different architectural configurations for the system, architecture.
So very macro level stuff that are abstract and potentially not interesting in the academic setting, but even not realizable in the academic setting because you need often large teams of engineers to be able to build a big system like data chat is. So those are super interesting things that I think are very hard, if not impossible to study in academia, but are front and center and very quintessentially interesting problems that show up once things start to become real in enterprises and in startups.
[00:41:01] Unknown:
And in your own research that you're doing, what are some of the most interesting or unexpected or challenging lessons that you've learned?
[00:41:09] Unknown:
Yeah. I think, most challenging lesson is that don't give up the first time you get a negative result, which will happen. If you pick a challenging problem, you'll sometimes hit your head against the wall maybe for years. And if the question is still valid, if the, if it is tantalizingly important long term, you sometimes just have to stay at it. It takes patience, and sometimes it may take multiple students because students come even in the PhD program. They may be with you for 5 or 6 years, and sometimes an interesting problem may be may take longer time than that. And so staying with the problem longer than a few durations, I know attention spans are getting shorter and shorter over time, but sometimes the payoff happens when you work on something for an extended amount of time. And have there been any particularly interesting or informative dead ends that you've encountered along your journey? Yeah. The, part that we started out with where we are looking at encoding techniques and saying, let's revisit that. We actually started working on it about 10 years ago, got some good early results, then kept hitting a wall.
And now I think we are on to a new line of thinking, which is along this line of
[00:42:21] Unknown:
And as you continue to work on these hard problems, you start you try to forecast what are the solutions that we're going to need 3 to 5 years out as you were saying? I'm curious what your heuristic is for when a particular research project needs to be either killed or put into production.
[00:42:42] Unknown:
Yeah. I think put into production is easy. Right? If you have something that is interesting and exciting, you pitch it to a couple of VCs. You know, see 1st, before you pitch, you see if you can get your students excited and collaborators excited, to go spin it out into something like a startup. Once you do that, then you go and see if you can pitch it to VCs. Many of them are extremely sharp. They see a lot. They'll be and if you can't get a VC's attention, then there's something probably wrong. You missed it. Right? Because, you should be able to convince someone to put money into a good idea. And once you have all of that, then you can, get the ball rolling.
And academic research also requires funding. Right? You're trying to convince funding agencies to fund you, and the VC game is different. It has to be more mature by the time you to that. So it's a spectrum. But, luckily, there are well defined mechanisms to do that. But if you can't convince someone, your student to work on a interesting far outreaching, far out problem that may seem crazy, or if you can convince a VC to fund you, then something's wrong. You have to reexamine it and say, how do I refine what I'm doing? Am I on the wrong path? Should I sunset this or pause this till I can get someone else also to be more interested in this problem? So that's that's the way I think about it. I know there are different ways, you know, if you're a pure theory person or a pure math person, you could stick in stick to a problem by yourself. But for the type of things that I do in systems, you need collaborators, you need students, you need, you know, larger teams. So you have to convince someone that's a good idea, and that's, for me, a good measure.
[00:44:09] Unknown:
And as you look to the future and you see what are some of the problems that you are anticipating we're going to have to address as we continue to build and scale these complex systems and complex data challenges? What are some of the areas of focus that you have planned for the near to medium term or any particular projects or problem areas that you or someone else should dig into?
[00:44:32] Unknown:
Yeah. And it's again at the 2 ends of the spectrum to broaden out on the architecture side. There's just so much diversity of different ways to architect storage and computing devices. So I'm working with collaborators from other universities and at CMU who are hardware folks to understand that ecosystem and see what's possible. What's the design space? It's vast. So there's a ton of work to do in that space and lots of interesting subspaces there. On the other end of the spectrum where I think we are just getting started with all of the uses of Gen AI for improving human productivity in getting insights from systems and things of that sort. We're still starting to better understand how to use these LLMs in in ways that protect the privacy of the data and the communication between the platform that's using the Gen AI technology and the application.
There's also a huge component of what's the cost component, are small models a future in certain cases, or are they still quite far out from the large models? And large models are getting larger and larger. There are all kinds of different architectures. So lots of interesting stuff in just that space of how to economically use, when to use what components and just, like, you know, lots of interesting subspace including that data discovery piece that I mentioned. We don't know where to look. And even when you know what where to look, you don't know how to use many of these new advanced methods, especially in the gen AI space, because that's just moving so fast. So I think just anywhere you look, there are lots and lots of pockets of interesting components in the 2 ends of the spectrum. I would say the middle is kind of boring. Go to the edges. It's wide open.
[00:46:11] Unknown:
Are there any other aspects of the research that you're focused on, the problem spaces that are still, open to be explored or some of the other work that you're involved in that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think there's a huge amount of interest,
[00:46:29] Unknown:
in general, in terms of saying what's the future of LLMs in terms of how open should they be, and, you know, what does openness mean? Is open weights open enough? Probably not. I think in academia, 1 of the challenges when you're working on some of these large LLM models is very few institutions have the resources it takes to build 1 of these LLMs from scratch in a realistic fashion. Right? Yeah. So I think there are lots of research problems and, you know, if you especially look at the space of Gen AI, there's certain things that you can do better in industry today. So if you are at OpenAI or at Google and have been building these large language model now for 5 years, which is an eternity, You know all the deep system engineering tricks that you use, a lot of insights. They will never get written in papers. It's very hard in academia for someone to go and say, I'm gonna take that project. 1st, you don't have that 5 year, like, engineering detailed tricks that you can do or, trade secrets to go and do things in an efficient fashion. 2nd, it takes a lot of resources, 1, 000, 000, if not tens or 100 of 1, 000, 000 of dollars to build 1 of that. So, there's certain components that are just very uniquely well positioned right now in that exciting space in industry. And as academics, it's like, okay. Do you go to industry and spend some time over there if you're deeply interested in stuff like that? Luckily, there are lots of interesting far outreaching problems that require you to start with something that might be a large language model and do stuff with it, and there's a ton of work going on in there. But, you know, certainly, this is kind of unique where in the past, it was often the case where the deepest core component of some new technology was often done in academia.
You could arguably say building a large language model is 1 of those core constructs, and that is better done right now arguably in industry because of the resources and and all of the large engineering teams that you often need to go do do that stuff is available only over there right now. So there's a little bit of a difference in terms of where things go. So you have to if you're working in that space, you have to say which problems can I practically achieve and do in academia, and that's sort of a new thing for many parts of computer science?
[00:48:48] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think data discovery is probably the biggest 1 that comes to mind.
[00:49:07] Unknown:
You know, we do not have ways to find out where do I even start to look.
[00:49:12] Unknown:
Absolutely. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing in your research and the ways that you're have been applying that in the commercial sector. It's definitely a very interesting body of topics that you're focused on. Definitely glad that you and your collaborators are working to improve our capabilities in this space. So I appreciate all of the time and energy that you're putting into that, and I hope you enjoy the rest of your day. Thank you. Take care.
[00:49:43] Unknown:
Bye.
[00:49:45] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show. Sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Overview
Guest Introduction: Jignesh Patel
Current Research Focus
Future of Quantum Computing
Challenges in Data Scalability
Innovations in Indexing
Data Pruning and Storage Strategies
User Experience in Data Systems
Building Cohesive Data Platforms
Balancing Research and Commercialization
Feedback Loops Between Research and Startups
Challenges and Lessons in Research
Future Research Directions
Open Questions in Large Language Models
Closing Remarks