Summary
Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying levels of support for fast reads on quickly changing data. To address that need more completely the team at Materialize has created an engine that allows for building queryable views of your data as it is continually updated from the stream of changes being generated by your applications. In this episode Frank McSherry, chief scientist of Materialize, explains why it was created, what use cases it enables, and how it works to provide fast queries on continually updated data.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Frank McSherry about Materialize, an engine for maintaining materialized views on incrementally updated data from change data captures
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Materialize is and the problems that you are aiming to solve with it?
- What was your motivation for creating it?
- What use cases does Materialize enable?
- What are some of the existing tools or systems that you have seen employed to address those needs which can be replaced by Materialize?
- How does it fit into the broader ecosystem of data tools and platforms?
- What are some of the use cases that Materialize is uniquely able to support?
- How is Materialize architected and how has the design evolved since you first began working on it?
- Materialize is based on your timely-dataflow project, which itself is based on the work you did on Naiad. What was your reasoning for using Rust as the implementation target and what benefits has it provided?
- What are some of the components or primitives that were missing in the Rust ecosystem as compared to what is available in Java or C/C++, which have been the dominant languages for distributed data systems?
- In the list of features, you highlight full support for ANSI SQL 92. What were some of the edge cases that you faced in complying with that standard given the distributed execution context for Materialize?
- A majority of SQL oriented platforms define custom extensions or built-in functions that are specific to their problem domain. What are some of the existing or planned additions for Materialize?
- Can you talk through the lifecycle of data as it flows from the source database and through the Materialize engine?
- What are the considerations and constraints on maintaining the full history of the source data within Materialize?
- For someone who wants to use Materialize, what is involved in getting it set up and integrated with their data sources?
- What is the workflow for defining and maintaining a set of views?
- What are some of the complexities that users might face in ensuring the ongoing functionality of those views?
- For someone who is unfamiliar with the semantics of streaming SQL, what are some of the conceptual shifts that they should be aware of?
- The Materialize product is currently pre-release. What are the remaining steps before launching it?
- What do you have planned for the future of the product and company?
Contact Info
- frankmcsherry on GitHub
- @frankmcsherry on Twitter
- Blog
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Materialize
- Timely Dataflow
- Dryad: Distributed Data-Parallel Programs from SequentialBuilding Blocks
- [Naiad](Programs from SequentialBuilding Blocks): A Timely Dataflow System
- Differential Privacy
- PageRank
- Data Council Presentation on Materialize
- Change Data Capture
- Debezium
- Apache Spark
- Flink
- Go language
- Rust
- Haskell
- Rust Borrow Checker
- GDB (GNU Debugger)
- Avro
- Apache Calcite
- ANSI SQL 92
- Correlated Subqueries
- OOM (Out Of Memory) Killer
- Log-Structured Merge Tree
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances, and they've got GPU instances as well.
Go to data engineering podcast.com/ linode, that's linode, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Corineum Global Intelligence, ODSC, and Data Council.
Upcoming events include the Software Architecture Conference, the Strata data conference, and PyCon US. Go to data engineering podcast.com/conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today.
[00:01:38] Unknown:
Your host is Tobias Macy. And today, I'm interviewing Frank McSherry about Materialise, an engine for maintaining materialized views on incrementally updated data from change data captures. So, Frank, can you start by introducing yourself? Sure. I'm Frank MacShuri. I'm chief scientist at Materialise Incorporated.
[00:01:54] Unknown:
I used to be, I would say, much more of an academic. Did a lot of science research, systems, computer systems, data processing, things go back, 15, 20 years or so. So a bit all over the place. I did a bit of work with differential privacy back when that was just beginning, and that you know, just a lot of things related to data processing. Big data processing mostly understanding a bit about how how computations move data around at large scale and what they need to, to do this efficiently, what sort of support they need. And do you remember how you first got involved in the area of data management? Yeah. I mean, it's it goes back away. So as a grad student, doing computer science y, whatnots, I ended up I was a theoretician originally, so proving theorems and doing math and stuff like that. And the work that I did related to large graph analysis, basically. So trying to understand back at the time, this was, like, 2, 000 or so, page rank was just barely a thing, and people were pulling out other sorts of analysis about what do you do with really large graphs, the web at the time, but other sorts of social graphs, things that have emerged out of people's interactions rather than being planned ahead of time, like a network topology. And this led to a bunch of, like, as practical questions about, like, how do you actually so if someone gave you a 1000000000 inches, what would you what would you do with that? How do you even make that work? And at the time, I was at Microsoft Research, but at the time, people like Microsoft and Google and, you know, potentially a few other places, you know, AltaVista type places, we're we're the ones, working at this scale, and it's very bespoke technology that people were using. And as part of wanting to to work with these sorts of problems, got involved in projects in particular at Microsoft with the, the dryad project, the dryad link project, which are sort of these cool DSL in c sharp languages that that trick people into writing declarative programs in a way that you see now in Spark and stuff like that. It's work that sort of, I think, drew drew a curve that I'm trying to think. But caused people to realize, caused me for sure to realize that a lot of the programming idioms that we use to work with large large data, really rendered down pretty nicely to data parallel compute patterns that you might you might ask a computer to to perform if you could manually control where all the bits and bytes go. But through an interface, through a through an API that was much more pleasant or a bit more declarative and was sort of lifting the abstractions up to something that looked no. I wouldn't say like SQL, but a bit more of a let me just explain what I want to be true about the output rather than exactly how I need to go into it. And that's that project burbled around for a little while. Did a lot of really cool things.
And from my point of view, led to the NIA project, which was the next thing that that I did there. And that was sort of where I got a bit more directly involved in building systems for data manipulation, data management data. To be honest, a lot of computation. Like, a lot of a lot of what I do, I would say, is more about, computing my data rather than than managing it per se. Or there's other people who work a lot harder on on data management than I do. There's a lot of there's a lot of talking. Hopefully, that gives some context as to, like, you know, the past, let's say, 15 years to 5 years ago has been just starting to get a to get a bit of experience of what people might plausibly need when they want to shuffle around,
[00:04:56] Unknown:
gigabytes or terabytes of data. Yeah. It definitely gives a good amount of context for where you're ending up now with the Materialise project. So I'm wondering if you can describe a bit about what that is and some of the problems that you're aiming to solve by building it. Yeah. No. That's, perfect. So
[00:05:12] Unknown:
Materialise is, in some way, sort of a a natural extension of some of the the research that we did. We both back at Microsoft and also some, stuff in between at that various academic institutions. We're we took what we learned when we were putting on our our science hats about how to maintain incrementally maintain computations that people specified, which is great. We felt very smart when we when we did this, but it was something that wasn't super accessible to your average person who just wants to in principle, wants to be able to type some SQL, press enter, see the result, and in principle, if they press up, enter over and over again, they ask the same question repeatedly, you know, maybe that should be really fast. And a lot of, I would say a lot of what materialize, is going after is a refinement of the experience in some sense of this underlying technology. We're borrowing very heavily from this work, on incrementally maintaining computations, SQL style computations, but but others more broadly. And trying to refine it to to the point that you can grab its material as binary and turn it on. It looks a lot like a standard database experience. You use something like p SQL to to log into it, and you get to look at all sorts of relations there. You get to ask queries just like you would normally ask using just standard SQL constructs. And it just has some really nice performance properties if you ask the same questions over and over again. If you create views, create indices, you can start seeing results for what appear to be very complicated questions, very quickly if we happen to be sitting on the the appropriate data assets before. So, you know, I would say the problem that we're we're trying to solve here you know, the the lived experience is meant to be it's a lot like a database that just happens to be fast when you, reask questions that you've had before.
And 1 of the problems that we're trying to solve is that a lot of the workflows for people using databases at the moment reask similar questions or similar questions of questions. Like, a a classic example, though not the only 1, is dashboarding, where dashboards are repeatedly hitting your databases asking the same classes of questions over and over again. Show me stats on how many engagements did we have in the past, yeah, many, minutes, hours, whatever. The only thing that's that's limiting the freshness of that data is the database's ability to handle, you know, however many tens or hundreds of these queries that are repeatedly being asked by the drum. Everyone's got a dashboard open. If we can make that incredibly fast, then all these things can just be updating real time. Like, real time, like, milliseconds, not real time, like, you know, minute by minute or something. So making making a lot of these experiences more pleasant once where data are fresh down to the second, and people's brains are thinking up new questions to ask second by second, removing any speed bumps or roadblocks in either of those 2 dimensions.
Either it takes a while for data to show up and actually get into the system, or possibly with some of the other frameworks out there, it takes a while for you to create a new to to turn a new query that you have into an actual running computation. We're gonna remove both of those and, you know, make some some workloads immediately much, much better. And probably, my guess is open the door for a lot of new classes of workloads, that people wouldn't really have attempted before because they're just so painful. But that's a little little bit speculative.
[00:08:07] Unknown:
And in terms of how it fits into the overall life cycle and workflow of data, I'm wondering if you can just give an overview of maybe a typical architecture as to where the data is coming from, how it gets loaded into Materialise, and sort of where it sits on the axis of the sort of transactional workload, where it's going into the database to the analytical workload, where it's maybe in a data lake or a data warehouse or any of the other sort of surrounding ecosystem that it might tie into or feed the Materialise platform?
[00:08:41] Unknown:
That's a great question. So there's definitely several different stories that you could go with here. And I'll I'll start with probably, like, the I'll say the most simple 1, but the the 1 that's most natural and least invasive or or disruptive. 1 of the ways to think about what Materialize does, it it's very much an analytic tool. Materialize itself is not, at first, it is planning on acting as a source of truth. So you have your database that's up to your transactional database that's holding on to all the information that you have about your sales, your customers, or your products, or whatever events are going on in your system. And what materializes is it attaches to the output of that database. So your database typically is gonna be producing this my SQL something like a bin log or some change data capture coming out of it. There are, other tools, not ours, but other tools that will transform this, these different exhaust from the database into a more standard format, drop it into Kafka, something in specifically of, could come out of this. And if not, no worries. It's a thing that attaches to various databases and and produces consistently shaped change logs that land in Kafka, and then we just pick things up out of Kafka. So we're we're decoupled from the transactional processor. You can think of us in some ways as a bolt on analytics accelerator, something like that. We we see what happens in your transactional database, and we give you fast read only access to views over this this data. So in some ways, we're, like, a read replica, or, you know, we we behave a bit like 1 of those. It's not the actual protocol we're using at the moment, but you you install something material that's just downstream from your source of truth. We don't really hope we don't, interfere with the source of truth in any way. So that's that's roughly where it fits into this this ecosystem if if you wanna feed it with other sources of data. So if if you've got Kafka streams that are being populated by, you know, not a database, maybe you have some IoT streams just sort of coming in at random measurements and observations that you've made. Probably with a little bit of of tap dancing, it's easy to adjust this to something that that we can consume. But we're we're very much at the moment looking at relational data models and the sorts of properties that that they have. You know, things like primary keys, foreign keys, the sorts of data flows that come out of SQL style queries. But yeah. Does that does that roughly describe where it fits into,
[00:10:46] Unknown:
the sorts of tools you're thinking of? Yeah. So my understanding of it from your description here and also from the presentation that you gave at data council, which I'll link in the show notes, is that in some ways, it can be thought of as a better use case for a read replica where you can get some actual improvements in the overall capabilities and characteristics of your workflow as well as being able to incorporate additional data sources as you mentioned rather than it just being a an exact copy of what's in your transactional system and being accessed in a way that's not going to impact it directly.
So it seems like there are a lot of performance and capability gains while at the same time sort of
[00:11:28] Unknown:
being able to access the same transactional records without it impacting your application that's actually feeding that data. Right. That seems right. I mean, 1 way to think about it is that if if you were gonna build an analytic data processor, you wouldn't need to architect it the same way that you build a transactional data processor. So, know, of course, the people who are building transactional data processors are they've been doing that for a long time. They've got a lot of expertise. But if the only thing you needed to do was support analytic queries or streaming, say, streaming updates to analytic queries, you have the ability to use a totally different design. And and indeed, you know, this is data warehousing type tools that have totally different designs from relational databases. And what we're doing, if you think of this as streaming data warehousing or something like this, you know, analytic queries over continually changing data, we've picked a different architecture for how to do this than a traditional database would would use and just dropping it downstream of the traditional database.
Let's just use this this totally different architecture. It decouples the, resources you might invest in the transactional side versus versus the analytic side and lets you I mean, materializes is this scale up, scale out sort of solution where you can throw a bunch more resources at it if you have a large volume of queries, and you don't actually need to go and get a bigger database to to be able to handle that. At the same time, you know, it it does some cool things that data warehouses, traditional data warehouses, don't do with respect to frequency of updates and service continual data monitoring stuff. And so this is a good opportunity for us to dig a bit deeper into the overall system architecture and maybe talk about how you're able to handle these analytical workloads and ingest the data stream in a near real time fashion for being able to continually update these queries? Yeah.
So material has has several components. So just like the high level architecture, there's several different models that you might not think of as part of data processing plane. So I just wanna call those, ahead of time. There's there's for sure components that mediate client access to the system. So things that that handle SQL sessions coming in and make sure that everyone's got all the, the right states stashed in the right places. This is this is control plane stuff. We'll we'll get to the sort of the juicy data plan stuff in just a moment. Control plane, like, people connect in or people or or tools connect in through SQL sessions. They have a little chat with, the central coordinator for for Materialise, and this is where we, you know, we figure out which, which relations have which schemas. If you wanna see the columns on on particular relations. All this data is sort of managed outside of the scalable data plan so that we can do things like query planning and just a certain more interactive, tell me about the structure of the data, the metadata interactively. But but at that point, this coordinator, with with the help of of each of the users who have actual questions, assembles, queries, see these big sort of SQL queries that we're thinking of as probably going to be working on lots of data and subjected to continual changes in their inputs. It's gonna take this, plan it out as a as a data flow. So if you're familiar at all with tools, like, I would say, like, things like Flink or Spark, and certain big data tools that reimagine a lot of SQL style queries or or data parallel queries as directed data flow graphs where nodes represent computation and edges represent the movement of data. This is exactly the same sort of take that we're going to we're gonna have. We're gonna reimagine these queries as a data flow, which is great for streaming because what we really want to do in these streaming settings is just sort of chill out. Well, while nothing's happening, if if no inputs have, have changed, we just sort of hang out. And as new changes arrive at various input collections, let's say, we want to prompt computation really only where we need to do work. Right? So, we'd love to be told, essentially by the movement of data itself, well, what operators you know, what what bits of our of our query actually need to be updated? So if we've got a query that depends on 5 different relations, it's great to be told, well, well, this is the only 1 that's changed. Like, your customer file is the only 1 that's changed, so push forward changes to this customer file and stop as soon as, you know, if a customer changed their address and that's not actually a field that's picked up by the query, great. We just we go quiet at that point and tell people the answers are correctly updated.
Nothing changed. So so the coordinator is thinking through how do I plan your query as a as a data flow computation. It gets deployed now to this data parallel back end That is is where all of the the work happens. These are where we have, you know, let's say, 32 concurrent workers or hundreds of concurrent workers. It turns out the numbers get to be a fair bit smaller with materialized and timely data flow and stuff because we've made some different architectural decisions from projects like Spark and Flink that that allow us to to move a fair bit faster. So we don't normally, at the moment, it's not really thousands of, thousands, of course. We can do the same sorts of work in tens, of course. But the data flow is is partitioned now. Across each of these data parallel workers, they all share the same set of data flows. So think of the data flows as a map for the computation, like, what needs to happen, when changes show up, where do we need to direct them to next. All the workers, share the same map. They all collaboratively do work for each of these operators. So we gotta join somewhere in here. Like, there's these 2 streams of data coming in. We're supposed to join them together.
The work associated with that join gets sliced up and sharded across, all of all of our workers. So the join, if we have 32 workers, each worker performs roughly 1 32nd of the join, whereas responsible for the keys 1 32nd of the keys in that join. And, I I guess, a really important architectural difference from from prior work here is that each of the workers in this the system is actually running its own little scheduler. It's each of the workers is going to be able to handle each of the operators in the data flow graph. And rather than spinning up threads for each of these operators and letting the operating system itself figure out what what gets done, we're actually, time of data flow is bouncing around between operators at, you know, some microsecond time scales to, to perform this work. And this gives us a a really big edge other bits of technology in this space. Other, I would say, like, scale out data parallel processing platforms, things like Spark and Blink, that partition away that's a data flow and don't allow them to share state, basically. That's that's the main thing that's going on in materials that's exciting is is sharing a state between, between queries.
Happy to dive into that, but I also just want to pause for a moment and make sure that I was still,
[00:17:36] Unknown:
on target for the question that you had. Yeah. No. That's definitely all useful. And as I was looking through, I noticed the timely data flow repository that you have and noticed that it, in the documentation, it mentions that it was based on your prior work on NIAID, which you mentioned. And I was also interested in the fact that you decided to implement it on top of Rust, where a lot of the existing and prior work for distributed computation, particularly in the data space, is based on either c and c plus plus or JVM languages. And so I'm curious what you found as far as the existing ecosystem in Rust for being able to implement things like timely data flow, and some of the surrounding tools and libraries for being able to build something like Materialise and the overall benefits that you've been able to realize from using Rust as your implementation language? Yeah. No. It's a great question, actually.
[00:18:29] Unknown:
The the historical explanation for using Rust isn't nearly as compelling a story as you might think it was originally. So, like, 5 years ago, we were all at Microsoft, and and Microsoft shut down the the lab we were working at. And I I was like, oh, I should learn at at the moment I'm sorry. At the time, everything we had written was in c sharp. So now it was all in c sharp. We were collectively thinking, like, oh, maybe we should learn a new language and, you know, sort of grow a little bit. I probably announced all of my all of my coworkers. I was gonna go learn to go, And because they're very supportive people, they told me that was dumb. And, in particular in particular, it was, like, a cool blog post called, Go is not good that I think is still relevant now. It's you know, the author basically talks about, like, their their languages, that are good programming. I don't think Go is bad, but but it lacks, several good things that languages have been introducing. And they called out 2 languages they thought were good, which were Haskell and Rust. And that led me to pick up Rust Rust instead.
And these were early days. This is before Rust was 1 0. Things are still changing. And, basically, I think I got pretty lucky. Rust worked out pretty well. In particular, I would say, 1 of the main pain points that people have with Rust is this lifetime and borrowing system, which which I think is great, but is really not that stressful in data processing frameworks. So, particularly, all these data flow systems move responsibility for data around between workers, at which point there's there's just not very much, borrowing going on. So the the main pain point for a lot of people using Rust doesn't manifest nearly as much as it might in a different domain.
And the the, rails that Rust puts on everything to make sure they don't screw things up have just been wonderful. There's a I don't know, hilarious joke, but, like, 1 thing that I tell people is, you know, timely data flow, differential data flow, all these things seem to work pretty well. They, you know, are actually used in production by some people. I don't actually know how to use GDB or LLTB debugger. It's just never come up. This Russ has done a great job from my experience of removing the class of error, which is the pull your hair out, not understanding, like, why is it even doing that thing? I didn't ask it to do that. So to my experience so far, Rust has been, like, a 100% of the errors are just sitting there right in front of you in the code and are you know, you have bugs for sure, but they're delightfully debuggable as opposed to, like, someone somewhere else in the code did a crazy thing which overwrote some of your memory, and you just can't, for the life of you, figure stuff out. Alright. So the flip side of the the question I think you're actually asking was more about, given that Rust is such a new language and you'd love to just pull random, support things off of the shelf, you know, adapters to different databases or support for Afro and JSON and stuff like that. Like, what's what's there? What's missing? Yeah. There's there you know, there wasn't nearly as much stuff when when folks are just starting up. You you look at are there is there a nice MySQL dep oh, okay. There wasn't back then.
There were a bunch of issues like this. They're starting to sort themselves out. And I would say that there's still some some gaps, though. But, you know, at least, we haven't we haven't really run hard into any of these. Like, we we found stuff that we can use for each of these, each of these problems. And, you know, I think in some cases, people would love to just grab a thing off of, off the shelf. So, like, an example, we we rewrote some of our own query planning stuff because, the the query planning that we're we're familiar with, stuff like Calcite, just isn't available in Rust. And, yeah, you know, if we were doing if we were a Java shop, we would possibly just borrow that and have skipped thinking about some of these things, and that would be interesting.
You know, we would we would end up being, doing different sorts of things now. We wouldn't have spent as much time thinking about query planning, but instead, we would hit our head with a lot of performance debugging. The anecdote I tell people from NIAD was that, like, getting getting the NIAD system up and running took, if I remember correctly, just like a few months. And from that point on, everything was is a year and a half of performance engineering working around the various vagaries of memory managed systems, where they don't do exactly what you what you've asked of them. So it's, you know, you end up doing doing different sorts of things. I I think Rust is great. We've you know, locally, the people also think Rust is great, but, of course, obviously, there's some selection bias going on here. People who are working, at PACHLS. In part, I think several of them did so because they're interested in Rust and making a a swing at that. And then in terms of the
[00:22:34] Unknown:
system architecture as it stands now, I'm curious how that compares to your initial design and how it's been updated as you have dug deeper into the problem space and started working with other people to vet your ideas and do some, maybe, alpha or beta testing as the case may be? Yeah. It's a good question. Let's see. There's been a few evolutions. So let me just mention
[00:22:58] Unknown:
a a few of them. Not all of them are materialized, really. So I mean, they're pretty materialized. You have things like timely differential data flow, which are essentially libraries, that you can link against. You can write some Rust code, link against them, and you get, a fairly performant, outcome. I thought this was great. It worked wonderfully for me. I was totally fine with that. So I would say the main architectural change was, in starting up Materialise and bringing, so we're working with a bunch of people who came from, you know, real database backgrounds, understanding a different architecture that is gonna require, you know, not having compilation, for example, on the on the critical path, you know, figuring out how to how to design things so that when a person types a query and presses enter, the the result flows as quickly as possible into execution without wandering through, let's say, 30 seconds of figuring out the right pilot code to drop down onto into a binary and and run.
Architecturally, like, from that point on, I don't think and we've had we've had minor architectural revisions where we've moved some of the responsibilities for certain bits of thinking between different modules that we have, and we've sort of broken things up to something more modules. But I don't feel that we've had substantial architectural revisions. We've had a lot of features that that we weren't initially thinking would be crucial, and and more and more people said, like, yeah, it's gonna be important for you to not just pull data out of relational databases. You know, people are also gonna show up just with, like, plain text log files. Like, don't you know, make sure that you understand exactly what you should do once someone points to 1 of these. And, like, after that, we gotta go figure out how to, you know, do column extractors from, from from plain text. And, you know, it's not fundamentally complicated. Or, you know, lots of people have done these sorts of things before. We didn't need to change the architecture of the system, but we needed to be be careful about exactly how we want to position it or what what features we wanted the the system to have. But I don't think there's been substantial architectural revisions that that have gone on. It's still very much there's a what we think of as a fairly performant data processing plane and a coordinator control plane that drives this. And the exact words it uses to drive it have wobble wobble wobble a little bit and exactly what the coordinator does in terms of showing people's sequel has, has for sure changed, at the same time. But there's not been any fundamental, I would say, rearchitecting yet. And I and my guess is based on the customers we're interacting, there's a lot of feature requests as opposed to fundamental redesigns in the in the near future. Like, you know, in the sort of 6 6 month time frame, we have a list of just things you wanna implement as opposed to things that fundamentally rethink. But at at the same time, it's also sort of early days for us. So, like, it's not it's not the right time to do a redesign anyhow. Like, even if we're worried that something was was not, set up the way it should be, you know, the right thing to do at the moment is to actually try it out on the domains that it works and and see if you get traction there as opposed to trying to be all things to all people. And as you've mentioned a few times, the primary interface for Materialise
[00:25:42] Unknown:
is through SQL, and I know that you have made pains to ensure that you're fully compliant with the ANSI SQL standard for the SQL 92 variant. And I'm curious what the sort of edge cases or pain points were that you had to deal with in being able to support all of that syntax
[00:26:00] Unknown:
and possibly any extensions or built in functions that you've added that fill the specific use case that materializes targeting? Yeah. So there's a there's a few, and they have different different sorts of answers. So, like, so some things just require a bit more work. So, yeah, c 1 92 has correlated subqueries is the thing that you can do, and and correlate subqueries can have their own correlated subqueries. This is especially painful in a a data flow setting because, you need to turn all of these subqueries into actual bits of data. A lot of databases can they might think about building data flow to describe the the course of execution or, you know, exactly where they should pull data from. But they can always, in some level, bail out if a query gets complicated and just do do some nested loop joints that will just iterate through all all the whatever table you've you've put together. So there's there's some nice escape patches that exist in standard relational databases that we don't really have access to because we actually do need to turn every silly query out there into something that we can run as as data flow. The standard papers that you would read about decorrelated subqueries, so some from the early 2000. Yeah. They didn't cover all the cases.
And we had to track down some more work that came out of Munich about how to decorrelate, like, everything, not just the the most appealing, sorts of queries. So that that's something where we just you know, we have to do more work. That's fine. I mean, it's good, you know, it's good to understand. We we learned a lot by doing this since it's good to have learned that because now we understand better what's what's easy, what's hard, what are the best I mean, what can we discourage people from doing? There's some other cases where things are just weird. Like, SQL has runtime exceptions. Right? If you use a, table value expression or, you know, if you have a subquery that is supposed to only have 1 row in the result, you can you can ask, is my row equal to that row? And if that query actually has either 0 or multiple rows, it's supposed to be your run timer, run time exception, and you would suspend query processing and return control back to the user. And that's not really a thing that we can do in this this incremental data flow setting. Like, we can't really suspend execution. You know, the data flow is supposed to keep running. If if if we suspend execution, we're we're killing the data flow. And that is disruptive for, you know, we have to think about what's the right way to communicate back to a person that they did something like divide by 0 or accessed an array out of bounds. And, you know, it's quite possible that we might change the the spectrum a little bit that that, you know, an array out of bounds is not an exception, but it's a null. It's 1 thing. That's what I think what Postgres does. Or if you divide by 0, maybe that should be a null rather than a runtime exception.
And we will do the, the query processing. There's a query analysis, sorry, ahead of time to try to confirm that your denominators might be non 0, in which case we can confirm that they are, that this will never be null. But but if you just might be dividing 2 random things, it produces input 1 by the other. We need to think of a thing that we can do that will not take down the data flow when you give us weird weird input. But at the same time, hopefully, it doesn't totally confuse people who, you know, might have expected to see a runtime exception. You know, have been using your database to do data validation. If you have a misparse integer, what should we do? Yeah. Good question. So so those those are some examples where we might have made a bit of a departure from, from the spec just because the query processing idioms of SQL back in 1992 aren't the same as the view maintenance,
[00:29:01] Unknown:
requirements that we have nowadays. Yeah. It's definitely interesting to see all of the different ways that people will deal with SQL because it's used so widely and so broadly. And there are so many different customizations that are necessary to fit a particular use case or edge cases that don't make sense for the original intent of where SQL was designed to run because of the ways that we have expanded the space of data processing and data analytics.
[00:29:27] Unknown:
So I think for the moment, we're definitely trying to do some things well and trying to find the customers. You know, SQL 92 is it was quite a while ago. You know, try trying to do those things well and seeing how far we get with that as opposed to racing ahead and bolting on all of the newest extensions and features. You know, like, we're we're looking at, for example, the moment of adding in JSON support, which we'll probably do because enough people, enough people are looking at it. But we very much want I personally, just very much want us to end up be being very good at specific sets of things that we understand and we can tell people, you know, if you want this, we do this very well as opposed to having 27 different features that are all sort of half implemented then and, may or may not work in some corner cases.
The the local the local metrics that we sort of want our read me page explaining what materialize can do for you to have no asterisks on it that say, like, yeah, except actually this doesn't this doesn't really work. A lot of other systems out there would say, like, oh, yeah, absolutely. You can join your data, but Asterisk has to be, like, you can always append only streams or, like, you need to make sure the partitioning is correct. You have to understand what a table is versus a stream. We don't want any of those. So we want it to be dead simple to use for people whose problems fit in admittedly restricted class of problems, but and then grow out from there as we have confidence that we are providing an experience that is predictable and unsurprising and, and performant.
[00:30:50] Unknown:
And then for people who are using Materialise, can you talk through what's involved in getting it set up and talk through some of the life cycle of the data as it flows from the source database or from the source data stream into storage
[00:31:13] Unknown:
to essentially duplicate what's already in the source, and you're just making sure that you have what's necessary to perform the queries that you care about? Right. Great question. So so what happens at the moment, and I should say that all of this are subject to change as we learn that that people hate it or love it or want something slightly different. But but the way the world works at the moment in materializes is that we we presume that you have some, like, let's say, MySQL or something. You have a a well accepted source of truth database. And at the same time, somewhere nearby, you have, let's say, Kafka, a place to put streamy ish ish data that, you know, has persisted and, performed. There's a a tool out there that we that we use and sort of recommend people use at the moment called Debesium that, attaches 2 various databases as essentially, like, a read replica or, you know, reads the bin log or there's a few different strategies based on on the databases. And you turn this thing on, you point it at you point it at your database, and it starts emitting Kafka topics for each of the relations that you've named, while you when you turned on, when you turned on to PC. And the topics that bruises into Kafka basically contain, before and after statements about various rows and various relations. So they say, like, you know, a change happens.
A row used to be this, before. Now it is this afterwards, end of timestamp. And what we do, you turn on materialize, and in materialize, you start typing things like create source from, and you announce the topic there. And when you announce a topic there or or a pattern for a class of topics, materials will go out, open up all these topics, start reading through each of them, and start pulling in these changes and presenting each of the topics now as a relation that you can query and, you know, start mixing and matching all of these these different queries together. So this is this is roughly sorry. This is I have to answer the question. I'm gonna continue this 1. But this is this is roughly what you need to do to get started with with Materialise. You have a relational database that's holding on to your data. You you transform it using a tool like the museum into a change log of with a particular structure in Kafka.
And then while using material, I just point it at the Kafka topics and we'll start sloping in the data for you. You in terms of how do you avoid being an entire replica of the entire database, and in particular, the full history of all the changes changes to the database. So what what you can do, in Materialise is you you obviously have the ability to select subsets of the relations that you want to bring in. You've got the ability as you bring them in to filter down the relations to based either on predicates or projections to filter down just to the data that you need. So if if it turns out that you're only interested in analyzing customer data and sales data and only 5 of the columns from it, you're more than welcome to slice those things down. And Materialise has, through through differential data flow on on which it's built, some compaction technology internally that that just makes sure that we're not using any more footprint than the size, sort of the resident size of whatever relation you're you're sitting on. So although the history might go back days, weeks, whatever, we don't actually unless you you ask for it, we don't need to keep the days weeks long, history around, and we'll just give you query answers on now going forward.
It's flexible, though. Like, you could, in principle, say, please load the whole history and and don't come back to any of it, at which point we look a bit more like a temporal data processor. So we'll show you the full history of your query going back as far as we have, as far as we have history, basically, update history going back. But but it is the case definitely that if you have, let's say, you know, a few gigs of of data that you're planning on analyzing interactively, we will have a few that few gigs of data live in memory. If your data is 10 terabytes and you want to just do random access to it and play around with it, we're gonna try to pull in 10 terabytes of data, and we might need to tell you about the cluster mode at that point or try to give you some advice on on thinning down the records a little bit so that, you don't have quite so much of of a footprint. But material is gonna manage all of its own data. It's not gonna return to the core database, and dump any of the analytical workload back onto it. So we're mirroring this stuff so that we can we can handle all of our all of our workloads without either interfering with things upstream or without finding ourselves off footed and not actually having the data we need indexed correctly. We'll see how that how that works. At at the moment, this has been fine. Like, a lot of people we talk talked with when they actually tell us what do I need interactive access to. It's it's a surprisingly smaller volume of data than everything they've kept in their source of truth or everything they've ever dumped in their their data lake is, you know, much smaller offset that they're they're interested in doing work over. And as far as the scaling and storage of data within Materialise,
[00:35:37] Unknown:
you mentioned being needing to keep the data in memory for being able to run these analyses. And I'm curious what the strategy is as far as spilling to disk for when you ex exceed the bounds of memory, and what the scaling strategies and scaling axes are for materialize? And then in the process of describing that, maybe talk about where it falls in the cap theorem. Yeah. Good good question.
[00:36:00] Unknown:
With respect to the cap theorem, it's it's sacrifices availability. That's that's easy. All of the time, the data flow stuff, going back to sacrifice availability, they're all fail stuff. So if if a thing if something goes wrong, you lose access to some of the workers or something like that. We stop giving answers, basically, because we can no longer confirm that they're correct. That's that's that's where that was. Material is definitely not, it's a kept there often applied to, like, geo distributed systems that are likely to suffer partitions, and you have to you have to say what's what are these gen ops here? NIA and Timely Dataflow and and Materialise are definitely you can think of these more as living. They're possibly in a single CPU to start to start with, you know, some large mini core, system or or rack or something like that. But much more tightly coupled sorts of systems where if you've experienced the network partition, that's very surprising.
And it could happen. But, like, this is why we choose to sacrifice availability rather than consistency. In terms of spilling, memory and spilling, all the internal data structures are, although they're in memory, they're log structured, merge tree type type things. There's a bunch of just big slabs of, allocations, and the implementations just naturally I mean, if you run it on an operating system that doesn't have an OOM killer, it just pages out of the virtual memory. It's totally fine. You can 1 can manually drop each of these slabs of the log structure merge tree into onto onto disk and just memory map them back in. That's super easy too. If if you'd rather that they literally be on disk, and memory mapped in as opposed to, in memory and then paged out. There's a bit of work that we're we still have to do and are planning on doing this regarding materializes own persistence story. And it's almost certainly going to involve taking these slabs of log structured merge, trace stuff and, you know, shipping them off to s 3 and and bringing them back when appropriate. But, yeah, there's not, unlike a lot of the JVM stuff systems have big hash maps and shoot themselves in the head when they hit 64 gigs. The, you know, this this the these systems are are super happy to just use the native mechanisms provided by the operating system. Performance doesn't seem to be impacted particularly. And Materialise itself
[00:38:01] Unknown:
is a business, and the Materialise product is your, at least, initial offering. So I'm curious if you can talk through your motivations for forming a company around this, and some of your overall business strategy. Yeah. Sure. So the motivation
[00:38:16] Unknown:
is is super simple. I mean, there's I'm not gonna source this correctly, but there's there's an expression of, if you want to move quickly and you go by yourself, and if you want to if you want to go far, you go you with friends. And, Arjun, the cofounder, didn't use those those exact, words, but pointed out that, well, me hacking on timely data flow was was super interesting. We actually wanted to see if this had legs could go anywhere. It was gonna need to involve other people. So people to write various adapters, write documentation, exercise parts of the system that I had no idea how how they're meant to work with SQL compliance and stuff like that. And a company is is the right mechanism to bring together a bunch of people and make sure that they get paid and and are looked after, so that they can actually build something that's larger than just a research platform that's fun for writing blog posts and stuff like that.
So part of the motivation of the company I mean, there's a few different dimensions here. But part of the motivation was if you wanna do something interesting with all of this work, timely data flow, differential data flow, whatnot. You need a framework to bring together some people to make sure that they get paid, and that's gonna involve building a thing that other people might want and making sure that we get paid for for doing it. There's other motivation too. Like, our team's motivation, I I think is much more of a like, this is absolutely an unblind area. So the the work on NIAID and the top of the data flow and differential data flow is great from his point of view is great and undercapitalized upon. So why not, like, try to actually take this and show people what it can do?
And, at the same time, you know, make a lot of people in the, enterprise infrastructure space really happy and collect paychecks from them and just build something really cool, and impressive. This is the motivation for the, for the formation, of of the company, which isn't, isn't I haven't said anything about the the the business model yet, which is still, I like, my understanding is that enterprise tech has has this very honest, healthy business model where, companies that you're building technology for are relatively well funded. And if you actually do something valuable for them, they will pay you. And if you haven't done something particularly valuable for them, you probably won't get paid. Paid. So this is not nearly as bad as, sort of consumer facing technology where you have to hope to write some zeitgeist of excitement. Now if if you show up and you make a large, you know, fortune 2, 000 company's life much bigger, I've done much better, great. They're delighted, and we just need to make sure that we actually do that for enough people. And the state of the product is currently pre release, and you have a sign up option on your landing page for being able to receive news. And I'm curious, what are some of the remaining steps that you have before you're ready to go with a general launch? Yeah. So we're we're basically, at the moment, just spending, it's, like, another month or 2. Sending off some some rough edges to make sure that so the plan is absolutely to, in a few months' time, to throw this out there to make it publicly available. So people should be able to grab the source code, the binary, use it, on their laptop just to try it out and see, do I like the look and feel of this sort of thing? Might I want to tell my boss that this is really cool and we should get some of this? And the steps I mean, we have we have a roadmap for this in, I think, just just about a few months, like, in the 2 month time frame. And mostly what we're doing now are just making sure that these various known issues, like, what, you know, what is the look and feel of, loading up CSV files from your file system as opposed to change logs from Kafka. Like, does that actually work properly, and are people delighted, by how that works? And some other integration with some existing BI tools, for example. So, if everyone had to just type raw SQL in, you know, that's that's fine. A certain class of person likes that, but but other classes of people like the look and feel of Tableau and and Looker a bit more. And there's some open source AI tools that we're we're making sure that we're compatible with so that people can point that at materializing, get a bit more, of an interactive query building experience rather than having to build their queries in some large text buffer and swap them in. So there's mostly some some fit and finish types of issues like that. It's a sort of the code is currently in a state that we handed it to people. They could try it out as long as they understood it has words. It'd be fine, and there's just a little bit of management of expectation management, basically, when we hand something out to people. If they go and try it and they realize that that it doesn't exactly do what they need, well, we probably should've fixed that beforehand. So we're doing a bit of that in the next next month or 2. Are there any other aspects of your work on Materialise
[00:42:21] Unknown:
or the underlying libraries
[00:42:23] Unknown:
or the overall space of being able to build a real time analytics engine on streaming data that we didn't discuss yet that you'd like to cover before we close out the show? I mean, there's also other really cool stuff for sure. I I don't wanna force any of it on people. Like, there's for sure a lot of thinking that that 1 needs to go through and understanding. I mean, just like the semantics of streaming data versus so what what materialist says, for example, is is incremental human hands. It takes a sequel query. It maintains it as your data change. That doesn't cover the full space of everything the person might possibly want to do with streaming data. So there's definitely, like, a hurdle that I I'm conscious constantly anxious about is how many of the things that people actually want to do, of which there's unfounded numbers of things, actually look like incremental new maintenance of of SQL. And in many cases, people have something totally different in mind, and you tell them, like, if only you had said that slightly differently, we could totally it'd just be a perfect fit for you. Is is there any negotiation available here? So there's occasionally a bit of getting people to try to think through what they really need to see out of their streaming data. That's been pretty interesting and a big a big shift that to getting people to move from thinking about points in time, but point in time queries to, yeah, to something that they're actually gonna want to see change as as time moves on is, probably the hardest part. Getting other people to change their brains is really hard. Possibly not gonna happen. We'll have to see. But that's 1 of the the main, things I'm most worried about in terms of are we gonna are we gonna fit and click with a lot of people is can we get them to understand the sorts of things that we're we're good at doing? Can they change their their needs, basically? Do their needs line up with what we're capable of doing? We think so. We think there's enough people, that are there, but this is a conversation that we have. We have a lot of focuses. Well, let me think for a second and just get my head around what it is that you actually do. I think that's probably gonna continue to be the case for streaming SQL for a while is what computers are actually able to do is is pretty interesting. How do we, wrap up in the language as pleasant as SQL? An easy way to think about what do I want to have happen to my streaming data. As my data change, what do I want to have happen? But there's an unbounded amount of other stuff to talk about too, but but let's,
[00:44:27] Unknown:
let me throw that out as as the main 1 that I, that I I think about. Well, for anybody who does want to get in touch with you and follow along with the work that you're doing and maybe have some follow on conversations to the topics we discussed today. I'll have you add your preferred contact information to the show notes. And as a final question, I'm interested in getting your perspective on what you see as as being the biggest gap in the tooling or technology that's available for data management today. Oh, yeah. So that's tricky. So,
[00:44:51] Unknown:
for sure, I have a biased take on this just based on what I've been doing for the past past few years. I don't know. I would say that, like, a lot of what we've done in the past year at Materialise and and before that has really been fighting with a lot of impedance mismatch between different tools and technology. So, for example, we're using Kafka to get to get our our data out, move it around, and Kafka just doesn't provide the the right information, about the state of the of the streams. And that's yeah. You know, this is that's fine. You can work around that. But but a lot of the tooling and tech and data management involves a lot of working around things. Even though in many of these cases, they sort of know what the right answer is, just not everyone, you know, got on board with things, at the right time. So, you know, like, Hadoop took off in early days, for example, because it was so easy. And even when Hadoop took off, people already knew that this wasn't the right way to be doing things, but it was so fast out the door that that people just pile onto it. And there's a lot of other tech that's like that. So for me, the thing that I always struggle with is the right way to figure out how to, if if possible, roll back some of these early tech out the gate that missed Mhmm. Important design elements and and swap in things that once we figure out how to do things a bit better. But it's hard to do. Like, it's hard to, swap performance and sort of better better architected deck in for things that missed missed important things. But we're still, you know, still super valuable for people. And until I don't know. I I feel like if if that were actually resolved in a pleasant way when we're there was a lot more off the shelf tooling that you could just take down and slot in for 1 class of tools. Like, if all databases use the same protocol to talk talk to each other, have the same interpretation of SQL, our lives would be a lot easier. They don't. Right? They all have crazy decisions that each of them make different. Complying with all of them is, it just takes a lot of time, when we could be doing more cool things. But I'm sure that's not the biggest biggest issue the tech faces, but it's the 1 that, you know, if you look at the dent on my forehead from what I've been hitting my head against, it has that shape. Alright. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Materialise.
[00:46:49] Unknown:
It's definitely an interesting project and 1 that fits a need in between the sort of transactional engine and a lot of the analytical engines that we've got as far as being able to keep real time information for people who are looking to do fast iterative queries or keep tabs on the current state of affairs of their data. So thank you for all of your work on that front, and I hope you enjoy the rest of your day. Hey. Thanks so much. I I appreciate the time to, to talk, and and all the questions have been very thoughtful. Appreciate it.
[00:47:21] Unknown:
Listening. Don't forget to check out our other show, podcast dot init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to cast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.
Introduction and Host Welcome
Interview with Frank McSherry
Overview of Materialize
Materialize's Role in Data Architecture
System Architecture and Data Flow
Choosing Rust for Implementation
Evolution of System Design
SQL Compliance and Challenges
Setting Up Materialize
Scaling and Storage Strategies
Business Strategy and Formation
Steps to General Launch
Future of Real-Time Analytics
Biggest Gaps in Data Management Tooling
Closing Remarks