Summary
The term "real-time data" brings with it a combination of excitement, uncertainty, and skepticism. The promise of insights that are always accurate and up to date is appealing to organizations, but the technical realities to make it possible have been complex and expensive. In this episode Arjun Narayan explains how the technical barriers to adopting real-time data in your analytics and applications have become surmountable by organizations of all sizes.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder
- Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell.
- Your host is Tobias Macey and today I’m interviewing Arjun Narayan about the benefits of real-time data for teams of all sizes
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what your conception of real-time data is and the benefits that it can provide?
- types of organizations/teams who are adopting real-time
- consumers of real-time data
- locations in data/application stacks where real-time needs to be integrated
- challenges (technical/infrastructure/talent) involved in adopting/supporting streaming/real-time
- lessons learned working with early customers that influenced design/implementation of Materialize to simplify adoption of real-time
- types of queries that are run on materialize vs. warehouse
- how real-time changes the way stakeholders think about the data
- sourcing real-time data
- What are the most interesting, innovative, or unexpected ways that you have seen real-time data used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Materialize to support real-time data applications?
- When is real-time the wrong choice?
- What do you have planned for the future of Materialize and real-time data?
Contact Info
- @narayanarjun on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to data engineering podcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying. You can now know exactly what will change in your database.
DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with DataFold. Your host is Tobias Macy. And today, I'm interviewing Arjun Narayan about the benefits of real time data for teams of all sizes. So, Arjun, can you start by introducing yourself?
[00:01:52] Unknown:
Hi. I'm Arjun Narayan. I am the cofounder and CEO of Materialise,
[00:01:56] Unknown:
the streaming database for all your real time needs. And do you remember how you first got started working in data?
[00:02:02] Unknown:
I did a PhD in databases, which I stumbled into, which is a hard thing to stumble into. But that's really sort of how this journey started. I was interested in computer science. I wanted to do more computer science. Applied to some PhD programs, specifically around distributed systems, networking, and, you know, back then, I guess it was called big data. And as I got closer to the weeds in all of those problems, repeatedly had the thought that these problems are database problems that the database people have thought really hard about for multiple decades. And maybe we should just be using the tried and true solutions, and we should make the databases more scalable rather than getting people into distributed systems sort of rolling their own database, and I got more and more opinionated over time that way.
And then I loved it. And so I worked for a small series a startup at the time, Cockroach Labs, which is now, you know, large wonderful company. And that was sort of my introduction into databases.
[00:02:58] Unknown:
Now we're talking about the use of real time data because as we have increased the scalability and performance of the data systems that we're running, we're starting to say, okay, now we can actually do this faster. And so to frame the conversation, I'm wondering if you can give your conception of what real time data is and what that means and some of the benefits that it can provide versus the, you know, juxtaposition of what people will generally term batch systems.
[00:03:28] Unknown:
Yeah. Totally. So the first thing is that real time data has, like, an enormous amount of associated with it because people typically associate the sort of desired end goal, which is the data that is fresh and fast and up to date, with the difficulty in the implementation details of building the horrendous pipelines that can even deliver that end result. And people go, oh, I don't know if I wanna deal with real time data. But if you just sort of step back and observe that, do you want your data to be up to date and fresh? And if if I could wave a magic wand and say, you know, all of your data was up to date all the time, pretty much everyone would take that deal. Right? Particularly as long as the tools would allow you to sort of do arbitrary historical time travel. Right? Then real time data is always better, sort of strictly better. Right? The problem is that so far, when people have tried to adopt real time data, they have had to get their hands dirty in increasingly unpleasant ways and deal with a lot of complex building blocks that they don't necessarily have to in batch. Right? So in batch data, you have these tremendous amounts of tools and ecosystem that really helps you wrangle all of your data with a lot of ease. And 1 of these sort of classic tool slash ecosystem things is SQL. Right? You just write a SQL query.
Tons of stuff happens behind the scenes. You don't really know or care. Right? You're like, I don't know if this takes a bunch of distributed systems under the hood, microservices in the cloud. Like, I type SQL. I get answer. Like, this is a wonderful way to live, right, for the vast majority of people. And, you know, until now, that has not existed in the world of real time with the exact same fidelity and ease of use as it has in batch. People have had to build Kafka clusters and write microservices and do all that stuff manually.
And if we could move away from that paradigm, if we can make real time as easy as batch, my contention is most people would choose real time because why not? As long as it was as easy and as cheap, and and you know? The 1 situation where I could see people not wanting real time would be or where they would prefer batch tools is in the world of data exploration. Right? So when you are doing sort of data science or you are you are in this exploration mode where you are simply thinking about, what am I even looking at? This looks funny, that sort of moment. You typically want to reduce as many moving parts, which means you might wanna fix the dataset. Right? So let's just look at how the world was at midnight last night, and then let's probe and look for correlations. In this exploratory mode, I think batch will always be better simply because there are a variety of optimizations and tools and techniques for making these ad hoc exploratory queries very fast and responsive that a batch system can optimize for. But beyond that, any sort of recurring pipeline, any sort of job that runs every night, if you fix the query, my contention would be if a query ever is run a second time, then real time is always gonna be preferable for the user to batch. Yeah. It's definitely a good framing for it. And
[00:06:43] Unknown:
as far as the kind of typical challenges that are associated with streaming, before we get too much further into kind of the meat of the discussion, I'd like to frame some of the ways that that manifests and maybe some of the ways that modern systems have been able to either circumvent or paper over those challenges. So things that come to mind are it's very difficult to be able to join across moving datasets. It's very difficult to know what the proper window sizes are if you're trying to run an aggregate. It's very difficult to be able to do kind of long time span historical aggregates with moving data because you have to decide, okay, you know, if I wanna compute the state of the world, you know, from the beginning of time, then that's gonna take a long time, and by the time it finishes, the the answer will have changed. And so I'm wondering if we can maybe talk to some of those classical issues and some of the ways that we're able to either ignore or resolve or kind of paper over some of those challenges.
[00:07:46] Unknown:
Yes. Many of these these concerns that you have raised typically are those faced by practitioners that have adopted streaming technologies and are trying to build something in streaming. Streaming today typically means sort of deploying a Kafka cluster. That's pretty much the way that people move data from point a to point b. And then using some sort of stream processing tool like Kafka Streams or Flink, Apache Flink, or something of that sort in order to do the query processing. And these stream processing platforms have sort of proto SQL layers on top, which do not support the full fidelity of the sort of standard SQL that a batch database would support.
The problem that folks typically encounter is that the stream processing systems fail to scale at meaningful throughputs of messages per second and of aggregate historical data size for queries that are very stateful. So, typically, what people start doing is they start arbitrarily reducing the amount of state that the system has to wrangle so it doesn't fall over. So they typically impose some kind of window size on sort of, let's do this query. So let's look at the aggregate sum or join or something of that sort over a recent historical window, let's say the last 10, 000 messages. Right? Like, what this does is this restricts me restricts the system from having to hold all the messages all the time, and so it's able to make some amount of progress without falling over.
The problem is that may not be the query that I want. Right? Like, if I'm doing a join on some primary key, there may not be a join between 2 Kafka topics where you're guaranteed to find the join in those last 10 1, 000 messages. Right? You have restricted the set of messages for consideration too much compared to the actual semantics of the messages that are flowing through system. So when you start to have to deal with what are fundamentally implementation details because of the inherent limitations of the system, you can't reason about just the semantics of the query you're trying to get done. Right? So you are no longer in this realm of, like, I am an analyst. I am thinking purely about the business needs. Like, I have this information here, this other information there. I wanna join these 2 things together, and I want the result, and I want it to be up to date. Right? Like, if you're able to sort of think in that declarative way without having to think about, oh, well, now I have to consider the window sizes that that so that the system doesn't fall over, then you can actually make a lot of progress. And that is what batch systems are wonderful at. Right? Like, in the batch system world, you wouldn't have to think about, you know, whether that was in the last 10, 000 messages because a large cloud native batch data warehouse will just glob over billions of messages for breakfast and give you the output of your join.
These challenges that come from the inherent limitations of some specific stream processors, in my contention, and this is an opinionated sort of biased contention, have been holding us back. Right? We need to get to a world where the underlying streaming systems are at par or close to par with the capabilities of batch data warehouses where the user doesn't have to think about these things. What does that mean? It has to be scalable and able to deal with extremely large amounts of state management at the end of the day such that the user doesn't even have to care about state management. Right? So streaming state management needs to basically no longer be a consideration, and people arbitrarily creating window size restrictions where the histories of their Kafka topics needs to stop being a thing. Right? And as long as that is a thing, streaming will never be as easy as batch.
[00:11:36] Unknown:
As far as the types of organizations or teams that have, to this point, been adopting real time, I'm wondering if you see any commonalities or general themes of the motivating factors that have pushed them to making that engineering investment and making it tractable, and some of the ways that you can see the evolution of real time adoption continuing as these systems become more mature and sophisticated and easier to operate?
[00:12:08] Unknown:
Yeah. So, today, the only people who have been able to meaningfully deploy and manage and wrangle, and this has been getting better. Life has been getting sort of better for folks. Used to be 5 years ago, you needed an entire team to manage your on prem Kafka clusters, and the move to sort of cloud native, sort of Kafka fully managed and hosted Kafka services means that you can do it with a leaner set. But even today, the folks who run and manage streaming infrastructure tend to be on the production infrastructure side. It is not something that is accessible to a business analytics team, the kind of team that would use a Snowflake or a Redshift. Right? The folks who are productive in adding meaningful business value, doing analytics, using fully managed services like Snowflake and Redshift cannot today be productive, and they have to sort of cross team to the production team to help deploy some real time infrastructure and perhaps even write some Java code for some Kafka Stream Appliance or some Flink Appliance that sort of gets munged into that and and orchestrated by some Kafka connectors. You've crossed teams entirely here. Right? And I forget the provenance of this joke, but, basically, anything that requires 2 VPs to commit resources basically means it's never gonna get done. Right? So we're we're talking about a 2 VPs problem at this point.
And the infrastructure needs to get easy enough that it can be handled by the analytics. And that that doesn't mean that that's the end goal here. Everyone can benefit, including on the infrastructure side. Infrastructure side also buys a and uses hosted OLTP databases. Right? So we need to get our streaming infrastructure to be as simple as an OLTP database or a cloud data warehouse, such that the amount of productive users can go up dramatically.
[00:14:00] Unknown:
The other interesting angle of this topic is when we're talking about real time data, we're generally talking about the need to be able to have up to date information for some sort of product or business purpose, and I'm wondering if you can talk to who you see as the general consumers of that real time data and maybe if there are any kind of common personas that you see across those use cases.
[00:14:24] Unknown:
That's a wonderful question. Let me start by talking about the amazingness of batch data warehouses. Right? Like, batch data warehouses are used by a variety of personas, but typically analysts or data engineers who are producing some result for human consumption. Right? So as long as there's a human in the loop, batch actually works pretty darn well today. And while real time may be a nice to have and, you know, if it was click, click, click, just as easy as batch, they might prefer the dashboard to sort of be constantly up to date. Today, if you're going to produce a report that goes to a human being, batch works great. Where real time starts to become a must have, not just a nice to have, is when you want to embed that into an automated action. So you are going to take the result of this SQL query and do something automated. That may be automated marketing. Right? That may be automated segmentation of your users, automated email activation. And these today, what ends up happening is the analyst's team or the analytics team has these insights, they're able to, say, push this data back into Salesforce for human, you know, sales reps to take a look at and make sort of these business judgment calls.
But they are unable to deploy this in an automated way without having to go back to that infrastructure engineering team. And that is a persona. The analyst, the business owner, the business sort of lineup business owner that wants to take a manual action and turn it into an automated action and fold it back into their core application, that really is the persona that wants real time infrastructure today that is on par with the batch infrastructure that they are productive with and use happily.
[00:16:15] Unknown:
Beyond just the infrastructure aspects of being able to manage the flow of data, you know, whether that's running a Kafka cluster the end to end stack needs to be properly integrated or architected to be able to source and generate and process these real time flows versus just a periodic batch job of reach out to this other system, grab a bunch of information, and pull it into this other system?
[00:16:52] Unknown:
The first 1 is just moving the data in real time from all the various systems where the data already lives, right? So no data system is in a vacuum. You have to first start by pulling data out of other systems. A huge amount of data today lives in OLTP databases, right, a Post request, a MySQL, and Oracle. And so change data capture ends up being an important aspect of sort of day 1 aspect of real time data. Right? So batch ETL no longer really works because if you've already had to wait for your data to be exfiltrated sort of once an hour or once a day, there's no potential for real time computation downstream. Right? So just moving your data in real time out of your OLTP systems using a change data capture system. Debezium is a popular 1, but many of these databases also have native sort of log replication things that you can scrape off of or push into Kafka.
The second 1 is existing real time data that already exists. Right? So a classic example of this is segment web events. Right? Like, you can just get real time segment events that are coming off of your website or something of that sort. And then you typically want to then think about the downstream computations that you are going to do. And over here, I have an opinionated take, which is most people this is not a absolute binding universal, but I think most people, most of the time, can get most of the things they want done with just SQL. So what they really want is a system that can connect to all of these input sources, be that databases or web events or arbitrary Kafka events, and then they just wanna write some SQL queries. And today, they do that in batch fine. Right? So you can take all of the stuff and load it into a Snowflake or a Redshift and then write SQL queries. And our contention is people just wanna write that same SQL and have that result of that SQL query just always be fresh.
And that is what Materialise does. Right? So Materialise is a database that under the hood is the full sort of stream processor. But from the user experience level, it's just them writing Postgres queries that stay up to date. You can't get everything done this way. Like, there are times where you do want to write some kind of imperative code. Not everything fits SQL, and not everything fits SQL ergonomically. Right? Like, there's some things that you could wrangle into SQL, and just like that line, like, your scientists were so preoccupied by whether or not they could, they never stopped to think whether or not they should. Right? Like, you can write some god awful SQL that you wish you could have written in some other language. But there's also computations that the current sort of SQL standards just don't fit, imperative sort of paradigms.
And those things are always gonna have to live as custom microservices. Right? But the contention that I would make is that about 90 or 90% plus of what you were forced to go into microservices to get real time or simply live with high latency batch can be done using SQL on top of your streaming inputs.
[00:19:52] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enables you to automatically send data to hundreds of downstream tools. Sign up for free today at dataengineeringpodcast.com/rudder. As far as the work of being able to consume these real time data streams, once you have the underlying infrastructure, you've got the integration set up.
Are there any kind of educational aspects that you found necessary for data teams or analysts to be able to construct the way that they're thinking about the problem in a way that is more conducive to working with a continuous flow of data versus what might be more kind of natural or familiar to them coming from a batch world where rather than saying, you're going to construct this query where you're aggregating across all data in the entire system as opposed to, you need to think about this differently because what you're actually trying to answer is what is the state of the world right now versus what has been the state of the world across all time? So not everything that can be computed efficiently
[00:21:19] Unknown:
in sort of a 1 shot computation can be translated into something that can be efficiently incrementally maintained. A good way to sort of get some intuition around this is to think of something where the answer set just changes dramatically second by second. Right? You are pulling out the result of some query where if you ran it the second and then you reran it next second, you're just gonna get a wildly different result. Right? So you do need to think about those queries where the result set changes in some manageable way because you are going to be doing work or you're gonna be asking whatever system you're using, be it materialize or something else, to do work proportional at the very minimum to the the rate of evolution of the result set. Right? You might genuinely want a fire hose of downstream competition that changes very quickly, but then you're gonna need sort of the amount of resources, you know, in terms of CPU cores or what you will, that is able to handle that. The key piece of education that we have to do is that it is worth thinking about what you actually want in this resulted because a lot of the time, people, you know, tend to think about it in terms of the amount of inputs, and it's really in terms of resource allocation. We need resources to deal with the amount of inputs that are changing.
And if you reframe it in terms of the amount of outputs that change, there are opportunities to use vastly fewer compute resources and have something that may have a sort of fire hose of data coming in, but it's able to quickly look at that data, make some incremental changes, and discard the data and hold much less state than people think is necessary because incremental maintenance of the query results using resources proportional to the outputs is much, much, much less than sort of hovering in all the data, computing everything, throwing away 99% of it, emitting an output, and then sort of redoing it in a non incremental sort of rinse repeat fashion.
And that means that people have come in with the expectation or they've been trained by their batch systems and by the sort of hoovering data, rinse repeat, batch pipelines that they've built, and we need to educate them in thinking sort of incrementally.
[00:23:42] Unknown:
Another aspect of this is the kind of current state of the world for streaming has been fairly, I don't want to say static, but it has been complicated for a long time. And I'm wondering what you see as the contribution of that state to the general attitude of teams towards their appetite for even starting to think about adopting real time or incorporating real time into their applications and what they see as that kind of barrier to entry or initial required investment and maybe some of the ways that the reality is no longer We we
[00:24:23] Unknown:
We we encounter this all the time. Right? Like, in fact, we go out of our way to not say streaming because the associations that people have with the word streaming are that of grief and despair and pain. Right? I don't think it is inaccurate to say that sort of streaming today is roughly where hadoop was, you know, 15 years ago, which is this giant confusing mess of things that you sort of have to constantly wrap your head around and deploy and staff up teams of, you know, tens of engineers to even just keep running. This is only gonna go away with results. Right? So people have to be able to use real time systems, get real time results where the underlying engine is a stream processor without really thinking about stream processing or streaming.
And it's only once that happens for a large enough ecosystem of folks who no longer have that prior association with streaming are things gonna change. Right? I think there's a lot of scar tissue and a lot of existing associations that it's simply gonna be a matter of time before that no longer is the case. That said, I do want to take a moment to talk about the fact that existing streaming systems, at least they exist. Right? Like, they are delivering a lot of value to a lot of folks who are able to build and deploy these systems. Right? So the same as with the Hadoop ecosystem. Right? There were lots of companies that had full at scale deployments of clusters and such and were getting value out of those systems.
The problem with that world was it required that you had an organization that could hire a 100 data Hadoop expert. Right? And not every organization can do that. Right? The democratization of the ability to use big data and get results and be sort of data driven, so to speak, really came about because with cloud native batch data warehouses, organizations that couldn't hire Hadoop engineers got the capabilities of organizations that could hire Hadoop engineers. Right? And I think the same thing is true of streaming. Right? So if you are a streaming first company, great example, Netflix.
Right? Netflix can get whatever they want done using streaming because they can hire the best of the best, and they can build and maintain whatever they need to maintain. Not every company has that luxury. Right? If you are a company that is not in Silicon Valley, that cannot sort of snap your fingers and have your recruiting team hire, you know, 20 streaming fluent microservice architects. You just have no options today.
[00:27:12] Unknown:
As you have worked with organizations and they've started to incorporate real time data into their products, into their analytics, I'm curious what you have seen as the ways that that changes the way that they think about the information that they need, the types of analyses that they are interested in, and the ways that data is applied and incorporated across the organization?
[00:27:40] Unknown:
The first is folks tend to want there's a sort of gravitational pull back into production. Right? So what has in the past had analysts sort of sitting only at near the exhaust of the data and then pushing things back to human stakeholders throughout the organization turns into a closed continuous loop, which means that folks who have not had to think about the problems of production running production systems have to start thinking about production systems and thinking about their deliverables in a production system. What this means is if you have have previously been a stakeholder that has only dealt with cloud data warehouses and then putting data back to, you know, manually human to human to other members of your organization, and now you are delivering something in an automated fashion. You now have to think about an on call rotation, getting alerts, getting paged if these systems go down. That is something that the production folks have always had from day 1. Right? Like, if you are a DBA, you've always had this. And so that's, I think, the biggest mindset shift. It's it's sort of as you go to real time, as you go to production, you have to involve a set of considerations that you may not have had to have before. And that is, I think, an ecosystem wide journey with things like reverse ETL or pushing things back into production systems. Folks have to sort of change the way they think. The second 1 is the expanded capabilities that they can do. We've had some sort of funny or, I think, you know, pleasant interactions where folks who have artificially restricted themselves by sort of putting on the narrow lanes that their streaming systems that have allowed them to do in the past.
And and once they realize that those, you know, restrictions are lifted, they have a wonderful set of opportunities in front of them. The biggest 1 of this is joining data from multiple data sources. Right? This is something that has been extremely difficult using, you know, previous generation streaming tools. Once you can start to look at a join between your production OLTP data log from your segment web events. Right? The list of possibilities that you can do, the features you can build widely expands. And so people oftentimes go into sort of this flurry of, oh, I can do this. I can do this. I can do this. That's very pleasant to see as somebody who builds data systems to just see the possibility space expand.
[00:30:08] Unknown:
To your point about the kind of managing state and the other earlier question about computing the history of the world, another aspect of real time and streaming that has been a kind of mainstay of the space for a while is that in order to be able to effectively combine historical analysis with real time updates is the need to have 2 separate systems where you have your batch system where you do these big expensive computations where you can compute across everything, and then you have your streaming system which is just for the most recent data, and then you have to periodically age things out of that into your batch systems.
And there have been evolutions of that with things like Pulsar and Kafka having the ability to tier their storage so that it's not all sitting hot on disk. They're able to archive that out to s 3, but then still be able to use the Kafka cluster and the Kafka KPIs or Pulsar APIs to, you know, work across that historical data as well. And I'm curious what you see as the current state of the art for being able to work across the real time and the historical data and maintain that statefulness and kind of what the trade offs are of either still requiring 2 different systems to be able to be effective or if you're able to actually integrate everything into 1 system, 1 interface, and 1 kind of way of working with the data.
[00:31:38] Unknown:
I will take the opinionated stance that that is horribly broken, and people should never have to sort of manually think about aging or building a Lambda architecture that have have a manual reconciliation sort of process to to move data from real time to batch. I do not think there will be 1 system that people use to solve all of their problems. Right? The key distinction I would make is people should choose different tools because they have different business needs. Right? And they have different teams that have different workflows for which bet different tools are suited. I think it's wonderful that there are many databases and many data platforms for different folks, you know, but it should be based off of actual sort of different sets of stakeholders having different sort of UX or workflow needs, not because the systems under the hood sort of need to have that data migrated. Right? When a single stakeholder has to use multiple systems, I think that nothing but slows them down. I think real time data will live or real time data platforms like Materialise will sit side by side with systems like Snowflake or Redshift because there are different stakeholders and there are different data pipelines.
For instance, BI tools that are doing data exploration. Right? They're probably gonna run off of the batch system. But if you have to build some sort of manual reconciliation thing that's pulling data from your Snowflake and from Materialise or from a batch system and a real time system. Because the real time system can't handle the scale and the dataset size of the batch system. That is a flaw that is bad. Right? So we take the belief that the real time system should be able to deal with all the state management with the seamless, completely transparent usage of cloud object stores like s 3 to deal with terabytes and terabytes of state without the user having to sort of lift a finger. And that is a good thing and that that users should demand that of all of their real time infrastructure.
[00:33:41] Unknown:
And so as you are working with your customers at Materialise and communicating with other players in the ecosystem, I'm wondering what you see as the broader trends and evolutions in the space and the kind of underlying technologies and some of the ways that the feedback that you've gotten from working with some of your early customers and design partners have influenced the way that you think about the implementation of the Materialise platform and the ways that you want to expose the interfaces and shape the interactions with the underlying data for these end users?
[00:34:18] Unknown:
I think the biggest shift is this sort of tendency for everything to go to prod. Right? Like, the fact that we have to, you know, support and enable users who are new to running prod services to start thinking in a prod services mindset. Right? So I don't mean in terms of us having an on call sort of production alerting system. That goes without saying. I mean, users who are rebuilding who have gone from not having to do that because they run batch jobs that they sort of check-in the morning to see if they succeeded or if they failed, they hit rerun or something of that sort, to these stakeholders having to have their own notion of alerting and sort of real time continuous testing, real time continuous integration testing before they change a query because there's a live essentially, a live migration that needs to happen because they have these running services and dependencies.
Because they've moved from batch to real time. Right? These are new considerations, and these are challenges that we have had to sort of help educate. It's also very gratifying because they are now able to build and deploy production production services that before would have to roll through another team, right, where they would say this is a thing the business needs to do, and then they'd have to sort of make a case that this other team would need to be staffed, the 1 that already has the alerts, that already has the on call rotation, that already manages the streaming infrastructure, and sort of petition that team to do it. It's gratifying to enable folks to self serve and build these production real time services without needing to cross that, but then there's also the challenges of having to educate them through the hard parts
[00:35:58] Unknown:
of that. In your work of building Materialise, I actually had 1 of your co founders, Frank, on a few years ago, fairly early on in your product launch. And I'm curious if you can talk to some of the ways that the product focus and the ways that you think about building this real time substrate have evolved over the past few years from when you first launched to where you are
[00:36:25] Unknown:
today? That's a wonderful question because we have had a dramatic improvement or upgrade to materialize through making it a cloud native service since you you talked to Frank. Right? So when we first started Materialise, all we wanted was to show that SQL materialized views were even possible on top of real time changing data. Right? So Materialise was a binary. It ran on a single machine, and it simply connected to Kafka clusters. And it could handle a decent amount of volume, but it was fundamentally and, again, it could horizontally scale out but as a single service. Right? And we have built a next generation cloud native materialize that supports multiple sort of use cases with separated storage and compute using cloud object store and persistent storage that supports multiple use cases that all can share the data but can stay isolated from each other and independently scale. What this fundamentally means is separate compute clusters that can sort of feed into each other, that can stay siloed, that can be replicated so that you can have redundancy in the case of faults or outages.
And all of this is new, and we recently announced it just a couple of months ago in our early access. We have users that have come from the all on premise world where they were themselves running and orchestrating multiple VMs, each with their own independent materialized binary running into using a single unified system that allows them to share the underlying inputs while having isolated compute experiences. And that is what's new since you talked to Frank.
[00:38:08] Unknown:
Build data pipelines, not DAGs. That's the spirit behind Upsolver SQL Lake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG based orchestration. All you do is write a query in SQL to declare your transformation, and SQL Lake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQL Lake supports a broad set of transformations, including high cardinality joins, aggregations, upserts, and window operations. Output can be streamed into a data lake for query engines like Presto, Trino, or Spark SQL, a data warehouse like Snowflake or Redshift, or any other destination you choose.
Pricing for SQL Lake is simple. You pay $99 per terabyte ingested into your data lake using SQL Lake and run unlimited transformation pipelines for free. That way, data engineers and data users can process to their heart's content without worrying about their cloud bill. Go to data engineering podcast dotcom/upsolver today to get a 30 day trial with unlimited data and see for yourself how to untangle your DAGs. In terms of the way that you think about the space, I'm curious if there are any early ideas or assumptions about what was possible, what people wanted out of the space that have been challenged or invalidated as you have gone through this journey of building the platform, building the business, and working with customers, and then going through this rearchitecture of the underlying technology?
[00:39:37] Unknown:
So when we first built Materialise, it had no persistence layer. Right? Like, it depended on upstream data sources being endlessly replayable. Typically, you know, this many Kafka cluster with infinite historical retention or, you know, required a Kafka cluster where the compaction that settings that it had used were completely compatible with replaying the computation. It was purely materialized was a purely in memory ephemeral system. This did not work for a lot of people because for a variety of reasons, either being sort of tightly coupled to the underlying Kafka system. We always had a long term plan of building a persistent system for Materialise, and we quickly came to the realization that this was in fact a fundamental requirement for Materialise's wider adoption. So with our cloud native platform, it comes in with persistence by default. Right? Persistence using cloud native s 3 object store, sort of infinite scaling persistence while still being low latency.
That was, I would say, the biggest shift between sort of when we first started launching Materialise and started putting that in front of customers to Materialise today. James Jacobson: In your
[00:40:52] Unknown:
experience of building the platform and working with customers and working with other players in this ecosystem? What are some of the most interesting or innovative or unexpected ways that you've seen real time data applied?
[00:41:04] Unknown:
1 of the wonderful things about building a data system is having a business that is horizontal, which is you're not coming in with a specific industry vertical that you're dealing with, is is you get exposed to a wide variety of use cases that come from folks that do have this vertical specific expertise. Right? And so we find ourselves constantly delighted by the things that are possible. And I think this is true of all databases, which databases are are sort of ubiquitously used across across all sorts of verticals. The biggest 1 that we find that I personally, this may not have been new to other folks at Materialise, to me was the fact that Materialise allows our users to unify their machine learning workflows and do real time sort of online prediction very seamlessly. Right? So, typically, without a system like Materialise, what you would have to do as a user would be to do some sort of training, extract some features, and then you would take those features and, you know, the weights, and then build a separate system that was streaming that would do prediction based off of the inputs, right, based off of the now streaming inputs, and then feed that into some kind of cache, and then have that cache be hit by your production application. Right? So now you've got, you know, a training system. You've got a stream processor.
You've got a Redis cache. You have some sort of orchestration workflow to sort of periodically recalculate the feature weights in a batch offline fashion, extract those weights, and then update the streaming pipeline. And all of that complexity is gone, so just focus on the core machine learning by unifying dispatch and streaming because you can fundamentally run those complicated training queries on the same system that's doing the online feature prediction. That to me was, I would say, the most personally gratifying.
[00:43:02] Unknown:
In your work of building the Materialise platform and exploring the technology and working with your customers, what are some of the most interesting or unexpected or challenging lessons that you've learned personally?
[00:43:13] Unknown:
Well, I'd like to say I was mostly eyes wide open starting this journey. I had a lot of benefit from seeing and being a part of building CockroachDB from when it was, you know, a single node database that would follow for every day to a, you know, tens of nodes, extremely resilient cluster. But the number 1 challenge I think I mostly went to the sites wide open, but there's always more than you think. It's a long journey. Right? It just takes a very long amount of time to build a rock solid production database from scratch. It's primarily a none of this is built in a single sprint. None of this is built with a single team.
And so recruiting, building, and managing a large engineering organization that's building a product that has a very long time before you get, you know, market signal is a challenging process. It's very different from a business where within 6 months, you can put this and have customers and then iterate out with that customer feedback. Right? It took us many years with 0 customers to build the very first nobody buys 80% of a database. Right? Like, you really do have to get that fit and finish to a very, very high bar before you will have your first customer. And so that is a challenging process, and I like to think I went to it mostly eyes wide open.
But it always takes, you know, 20% longer than even your most pessimistic assumption.
[00:44:51] Unknown:
For people who are weighing their options about how to approach their use of data the way that they architect their data systems, what are the cases where real time is the wrong choice?
[00:45:03] Unknown:
I think real time does not make sense for data exploration. I touched upon this a little bit earlier. But if you are in this genuine exploratory mindset where you don't even really know what you're looking for and you're sort of poking around and trying to understand your users better. Let's say you've just been tasked with an open ended problem. I think a batch system is simply gonna be better because you are going to be running a extremely high number of 1 off queries. You know? You're gonna be sitting there going, what is that correlated with that? What is this column when I join it with this other table? Right? And I think a real time system will confuse rather than clarify because in that mode, you want to reduce the number of variables that are changing.
And the dataset itself moving, the ground shifting under your feet is 1 more complication you do not need. It is only once you have come to some conclusion from that exploratory process where you've said, this is the thing that predicts customer churn when they have 2 bad experiences with my company in a short period of time that we need to immediately do x y z, or else they are going to never use our service. You you'll come to some conclusion like that. Right? So that is the point at which real time becomes the right choice. But if you introduce real time into that data exploratory process, it will not help. It will simply hurt.
[00:46:29] Unknown:
As you continue to work in this space and build out and evolve the materialized platform and business, what are some of the things you have planned for the near to medium term and some of the ways that you're looking to improve that end user experience around real time data?
[00:46:46] Unknown:
There's always larger scales. There's always more performance work that 1 can do. We're quite proud of what we've built. We are still in early access. Next year, we will make Materialise generally available, which means that anyone anywhere in the world can sort of click click sign up and get access to Materialise. There's a lot of scaling and operational work that goes on behind the scenes. So a lot of that is really making Materialise more accessible to the widest number of users possible, making it available on every cloud service. Right? So today, Materialise is AWS only. There's also some cool features that are on our to do list. Right? So today, Materialise is is SQL only, and I wanna sort of plant the seed that, you know, things like user defined functions and triggers and recursive queries and things that are, you know, on the more bold and ambitious sort of end of the spectrum are things that we are still to build. Right? I'm particularly personally excited about recursive queries. Right? Our underlying stream processor is very capable of handling meaty recursive queries, and most SQL databases have sort of weak, if at all, existent support for with recursive, and I'd like to change that. So that's something that, you know, is something that we really wanna do in the coming years.
[00:48:04] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:48:20] Unknown:
I think the biggest gap is while there have been some advances in bringing good software development practices sort of repeatable bills, repeatable testing. I still think that data tooling isn't quite at the place of software tooling where we have, you know, repeatable builds, continuous integration, continuous data validation, continuous data testing, and we can make more progress there such that building and maintaining data applications is the same level of quality that we can assure our customers or our internal customers that deploying software is. Right? So we really need to get to the point where it's like a fully automated GitHub with continuous integration tests, nightly's immediate ability to roll back, you know, bugs, immediate alerts and notifications when there's sort of a data quality issue.
[00:49:13] Unknown:
And I think we're making progress there as an ecosystem, but there's still more work to do. Well, thank you very much for taking the time today to join me and share the work that you're doing at Materialise to help improve the experience and capabilities for people who are working with real time data. Definitely very interesting and constantly evolving space. So I appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thank you so much for having me.
[00:49:44] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us us about it. Email hosts at data engineering podcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction to Data Complexity
Interview with Arjun Narayan
Real-Time Data vs Batch Systems
Challenges in Streaming Data
Adoption of Real-Time Data
Infrastructure for Real-Time Data
Educational Aspects for Data Teams
Current State of Streaming Systems
Impact of Real-Time Data on Organizations
Combining Historical and Real-Time Data
Trends and Evolutions in Real-Time Data
Evolution of Materialize Platform
Lessons Learned in Real-Time Data
When Real-Time Data is the Wrong Choice
Future Plans for Materialize
Biggest Gaps in Data Management Tooling
Closing Remarks