Summary
Streaming data sources are becoming more widely available as tools to handle their storage and distribution mature. However it is still a challenge to analyze this data as it arrives, while supporting integration with static data in a unified syntax. Deephaven is a project that was designed from the ground up to offer an intuitive way for you to bring your code to your data, whether it is streaming or static without having to know which is which. In this episode Pete Goddard, founder and CEO of Deephaven shares his journey with the technology that powers the platform, how he and his team are pouring their energy into the community edition of the technology so that you can use it freely in your own work.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.
- Your host is Tobias Macey and today I’m interviewing Pete Goddard about his work at Deephaven, a query engine optimized for manipulating and merging streaming and static data
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Deephaven is and the story behind it?
- What is the role of Deephaven in the context of an organization’s data platform?
- What are the upstream and downstream systems and teams that it is likely to be integrated with?
- Who are the target users of Deephaven and how does that influence the feature priorities and design of the platform?
- comparison of use cases/experience with Materialize
- What are the different components that comprise the suite of functionality in Deephaven?
- How have you architected the system?
- What are some of the ways that the goals/design of the platform have changed or evolved since you started working on it?
- What are some of the impedance mismatches that you have had to address between supporting different language environments and data access patterns? (e.g. batch/streaming/ML and Python/Java/R)
- Can you describe some common workflows that a data engineer might build with Deephaven?
- What are the avenues for collaboration across data roles and stakeholders?
- licensing choice/governance model
- What are the most interesting, innovative, or unexpected ways that you have seen Deephaven used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Deephaven?
- When is Deephaven the wrong choice?
- What do you have planned for the future of Deephaven?
Contact Info
- @pete_paco on Twitter
- @deephaven on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Deephaven
- Materialize
- Arrow Flight
- kSQLDB
- Redpanda
- Pandas
- NumPy
- Numba
- Barrage
- Debezium
- JPy
- Sabermetrics
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Pete Goddard about his work at Deephaven, a query engine optimized for manipulating and merging streaming and static data. So, Pete, can you start by introducing yourself?
[00:02:07] Unknown:
Pete Goddard, the CEO of Deephaven. Very nice to meet you, and thanks for having me on. Yeah. And do you remember how you first got involved in the area of data? I'm an aero and astro engineer from college, and then I made a sideways step into the world of Wall Street and quantitative trading. I'm old enough that it was a time when that was a little less obvious of a path than it might be today. I've spent the last, certainly at least 20 years thinking about data as a very important driver for business and, you know, fundamentally working with teams on systems to try and derive value from that data.
[00:02:46] Unknown:
And so now you have started working on the Deephaven project and the Deephaven engine and business there. I'm wondering if you can describe a bit more about what that is and some of the functionality that you're looking to provide and some of the story behind how it came about and why this is the particular problem space that you wanted to spend your time and energy on?
[00:03:06] Unknown:
So, you know, Deephaven is a query engine that's built from the ground up to be excellent with real time data. When, you know, we think of real time data, we want it to be great with real time data by itself as well as in combination with static or maybe historical data for context. That query engine probably its reason to exist hinges on a couple of things. 1st, its ability to unify streams and batches into the same concept and same process of, you know, of working with data. And the second is an incremental update model that exists behind the scenes, which allows for really good stuff when you're working with data that changes.
So Deephaven is, you know, that engine built from scratch because we thought it was really important to us first and then to others. And then, secondarily, I'd say Deephaven is a word we use for the framework around that. The engine is so you know, in many ways, it's either different than or I might immodestly suggest it's ahead of some other engines that exist out there. So if you wanna make people productive with it, you have to take on the task of handling integrations as well of delivering user experience as well because, you know, many of the tools that people are using, you know, aren't necessarily equipped for dynamic data. Most of them are organized in the direction of static data. So we've taken on both the engine and the framework work. Yeah. The project that came to mind first when I was starting to poke at the Deephaven documentation and figure out the use cases
[00:04:43] Unknown:
and where it is applicable was the work that's being done with Materialise, where they're pulling these streaming table updates off of Kafka queues and being able to provide a Postgres compatible wire format to be able to run these queries across dynamic and continually updating datasets. And I'm curious how you would characterize the Deephaven project and use cases in comparison to something like Materialise?
[00:05:10] Unknown:
It's a great question, and I appreciate the comparison. I think Deephaven is a bit less known than Materialise is. The short answer is, though, we've been doing this for a long time. Deephaven was first, let's say, a hedge fund that I founded that's still a very active quantitative hedge fund today, and we started building this in 2012 because we needed a data system that was good with both historical data and with real time data. So the idea of our update model that we started then and have evolved in the last 10 years is probably quite consistent with at least the concepts of materializes update model that, you know, it's founded on a couple of principles. The first are that tables are super powerful. They're very intuitive, and there's huge libraries and ecosystems and lots of people that understand what what tables are all about. But in our case, we think of tables, not only as batch and having, static state, but really as a flow of deltas.
You know, our update model is at an API level and at an engine level is tracking ads, modifieds, deletes, and shifts such that many, many operations can be done in such a way that the incremental work to compute results is much, much smaller. This is really, really valuable both for delivering or for supporting complex use cases that have lots of steps, let's say, where you have incremental updates that are helpful every step along the way. But, also, you know, our technology, which this is the core. Our technology makes it so that we have many people that are more or less Excel type of users. And all of a sudden with little scripts, they can interact with real time data on their own. So for us, that's really an interesting part of the story to be able to face a number of persona.
[00:06:58] Unknown:
As you mentioned, the foundational work that you started on that has become Deephaven began in this hedge fund that you were running. And I'm curious what it is about this technology and this problem domain that motivated you to extract this out into its own business and leave that hedge fund to focus so much on this project and this product to make it more broadly available?
[00:07:23] Unknown:
It's a valuable question. Many people in my family ask me it every year or 2 trying to check-in that I made the right decision there. The reality is when we first started building this product for our own use back in, you know, late 2011 or 2012, we had some pretty simple needs, we thought. We were going from a high frequency business, which, you know, is own kind of math and computer science. And, frankly, we were doing it in options, which is, you know, a very, very large universe compared to stocks. We wanted to take our team and do other math y and computer science y things with it that were also quite scalable.
So we said, well, we need a data system to do that. We don't need to be targeting really, really low latency. Right? Low latency today is FPGAs and microwave systems, submicrosecond turnaround. We're like, well, if you're gonna do something more scalable, you just need a generally good data system. You don't need that type of capability. So we wanted something that was general purpose. The formative questions of that really shaped a journey. Right? We said, well, what should the data system do? And we wanted a lot of people across the firm to use it, Quants, data scientists, developers, but also, like, trader types and portfolio manager types. So think of that as BI types outside of capital markets.
We wanted to be great with historical and real time data. But, you know, we knew that table operations were gonna be important for all of this, but we also knew, you know, all the good stuff was gonna have other code. At the time, that was mostly Java and c plus plus to us. Now, you know, that means also Python and Python and Python, but also Rust and Go. And then we wanted user experiences. You know? We knew that as we faced the team, you know, the time a 100 plus people, they were gonna wanna see data and have roll ups and pivots that changed in real time and all of that type of stuff.
So, you know, incrementally built those things for our own use because it was fundamental to driving the business. We only built it because there was nothing that existed out there. We wanted to buy it. We just didn't see it out there until we rolled our own. And to your question, we spun it out, you know, in late 2016 for, I guess, a few reasons. 1, the engineers and I were just fascinated by it. 2, we felt like we had witnessed tremendous impact at and were therefore quite bullish about its relevance beyond the hedge fund. And 3, the timing was right where I was really interested in software more than I was interested in buying low and selling high at that point in my life. So, you know, those 3 things in combination with an engineering team that was just up for it and excited about it, you know, led to the founding of Deephaven as its own standalone thing, you know, the 5 year path we've been on since then. In terms of
[00:10:15] Unknown:
the core use cases that Deephaven is focused on and the usage and interaction patterns, I'm curious where you would place it in the overall stack or ecosystem of an organization's data platform. Like, do you expect that it will replace certain business intelligence use cases? Is it something that you might use in conjunction with or instead of a Kafka stream or something like ksqlDB or just curious what your framing is when you're talking to people who are coming from the data ecosystem to help them understand what the applicability of it is and some of the tools or systems that it will either augment or replace.
[00:10:56] Unknown:
I think my answer has me scratching my own head a little bit on because my answer to your question of which of these is mostly yes. I think it this probably makes sense to understand that though we have many years servicing big Wall Street customers with an enterprise product, the product that we are investing in exclusively at this point and we're evolving very actively is our community product, our source available product. So I think it's really with that in mind that, you know, the 1 that is out there and available to all the people that are listening to this podcast is likely the 1 to talk about.
At its core, we believe that streaming tables are very important. We used to have to argue that real time data mattered, and then that phrase, which was overused, real time, right, just became messy. And then all of a sudden, Kafka blew up and Confluent blew up, and we don't have to make that argument. And people understand, like, oh, k. Stream's a thing. We all agree that streams matter. And I think a lot of people would agree it's growing in in terms of its relevance. I would go as far as, say, like, 2027, 2030.
If dynamic data isn't at the front of how you think about data, we probably just see the world a little differently. So streams are important, but we think we have this concept of a streaming table. And that's really a construct out in the open that we think is something that is very powerful and that we hope to nurture to a point where it's ubiquitous, both as something that serves the Deephaven engine, but also something that is in support of many many other applications. So the first investment we've made in going to the community is to deliver an open API that essentially uses arrow flight payloads to support a gRPC based package that describes tables that are changing, updating tables, streaming tables, if you like.
That's really the core piece. Once you have that core piece, again, we encourage others to explore it. We hope that we can build a community data software developers around that. But then the next layer is to think about the deephaven engine. And when you think about the deephaven engine, yes, it is very reasonable to compare it to ksqlDB. I'd suggest if you have Kafka streams and you want to build applications or do analytics with them, use AI on them. Deephaven, in many cases, might be an easier or better or higher performance engine to use than ksqlDB, even though it's Kafka. Certainly, we are not intending to compete with the Kafka API, you know, and those event streams, we think our streaming tables are a complementary thing, and our data engine can certainly sit on top of a number of source technologies Kafka just being 1 of them, RedPandas being another. But, also, you know, in the streaming world where I come from, Solace is relevant, you know, but then also proprietary APIs and vendor APIs with real time data really matter. Setting up web scrapers as sources for real time data or web applications.
These are all both direct sources that are interesting to Deephaven as well as indirect ones that are washed through Kafka.
[00:14:19] Unknown:
And as far as the user personas for Deephaven, I'm curious how you think about the categories of interaction for this framework and this engine where, you know, I'm sure that you have data engineers who are interested in being able to use it to be able to do exploratory analysis and transformation of their data streams to be able to figure out where to ship it or what transformations to make. Obviously, data scientists will be excited to be able to use it for being able to execute their Python code on these data streams. But I'm just curious sort of who you think about as you're designing the different features and functionality and user experience patterns.
[00:15:00] Unknown:
It's a great question and something we've really tried to be mindful of, particularly over the last year as we rearchitected, rewrote, and modularized our code base to be something that we think is attractive to the community. Again, our history comes from 1 team, 1 dream type of thinking. Let's all get around 1 data store and 1 single source of truth for streams, and let's not at all tolerate false dichotomies. So let's not at all think data scientists and data developers should use different sources. Let's not at all tolerate thinking that analytics are different than applications. Let's just build stuff on top of this common place. That's really where we come from. But in bringing it to the community, we understand that simple matters and straightforward matters, and we want to have smaller building blocks for people to handle first. So in particular, we focused on the 2 distinct personas you suggested.
1 are data developers, and the second are data scientists. Data scientists are a little bit easier to put together nowadays simply because there are some famous words and some famous patterns that seem to represent them quite well. The word Python seems to go there. AI seems to characterize a lot of them, even pandas and numpy. You know, these are sort of where communities coalesce and usage patterns for development. So when we think about Deephaven and its intersection with the data science community, we want to look at those tools.
We wanna understand those workflows, and we wanna make us a valuable complement to the way things happen. So with features there, we think, oh, this needs to be easy to deploy as a Docker image, but you need to be able to deploy it locally. Right now, we're working on I wanna make it available just as a Python library. Certainly, we think our amazing REPL or exploratory experience in a browser, it's amazing. It can do things that I think you would really, really find valuable. But we're working hard to make sure that the widgets for that are available in Jupyter, for example that you can have real time ticking tables in Jupyter notebooks. So with data scientists, we are more or less putting a full embrace around Jupyter, Python, you know, and ai libraries to make sure that Deephaven is integrated with the toolkits that they want. And in particular, 1 thing we're focusing on is real time AI. We think it's real time is super sexy. AI is super sexy in terms of words, but then, oh, hey. I have a lot of Kafka streams, and I want to, you know, do sophisticated stuff with them.
That's not so easy from a infrastructure point of view. We think that Deephaven makes that easy in combination with our streaming tables and a learn library that we built that really couples the capabilities of dynamic data in tables with Python libraries generally, but, you know, TensorFlow and PyTorch and scikit learn specifically through NumPy.
[00:18:10] Unknown:
So that's the data scientist persona. I'd be happy to talk about the data developer persona, but perhaps I should take a breath. The other interesting element of this pairing of personas is the question of collaboration and what are the interaction patterns between the data developer who might be trying to use this for figuring out their transformations or executing the transformations or building out a library of tooling that the data scientists can use and the data scientists who are trying to build these models understand the performance and the tuning of those model parameters, being able to do feature extraction, and then maybe being able to feedback to the data engineers and data developers what
[00:18:52] Unknown:
source systems they might wanna connect with or data models they need to be able to have available for powering those model development workflows. I mean, you have it exactly right. Imagine you're a business manager at a company, and that company might be a hedge fund. Right? You wouldn't tolerate much conversation about, like, oh, let's have you know, we've got these guys that work over here and these other people, you know, that work over here and here's the APIs between them, and those all need to be supported. That would feel not right to you. It's always felt not right to us for a decade now. Right? So we've built deephaven so that it's really sort of a pub sub mesh of these streaming tables so that the work 1 person is doing is easily consumable by another person. And that really doesn't mean people. For your audience, it means a service. Like, we have workers. Right? Those workers have access to batch data and to streams.
There can be smart federation to make sure they only get the stuff that they should get. And then as you use Deephaven, the Deephaven table API or arguably our Deephaven syntax for working with tables, table operations, or as you deliver Python or Java or another language to the code to do sophisticated things, you're doing it in such a way that you name tables. You know, if you could think of it maybe analogously to having topics, you know topics that you exhaust to kafka So all of these workers have named tables and all of the named tables can be available to any other worker that wants to subscribe to them. So you create this mesh of lots and lots of different processes doing work and serving each other data in real time as it flows through a directed acyclic graph. Right? So this can support pipeline workflows. This can support parallel workflows. These can support complex workflows that combine the 2, and all of it will update incrementally in real time.
[00:21:00] Unknown:
And as far as the architecture and system components that power these different aspects of the Deephaven experience, I'm wondering if you can talk through some of the technological underpinnings and some of the ways that you have approached the decisions about how this is structured to be able to support these use cases. And given the fact that you mentioned that you've just gone through a major rewrite, some of the ways that you have gained lessons from the initial earlier work that you've done and ways that the ecosystem of tooling and structure has been able to simplify your work of rebuilding the system.
[00:21:40] Unknown:
You know, maybe we could just start with some basics and see where the conversation goes. At its core, Deephaven is a Java application. K? It's a column oriented query engine, which probably doesn't surprise you given the feature and performance characteristics that I've talked about so far. You know, there are certainly Java experiences and Groovy experiences and Scala experiences now on their way, but it's developed as a Python first experience. So that probably is the greatest standout in regards to the evolution of the last number of years, particularly as we've journeyed towards community, is we really wanted all of that to feel very Pythonic, idiomatic, and to be, you know, naturally integrated with the ecosystem of libraries and tools that exist out there for Python. So 1 of the keys to that is a pretty substantial lift a couple of years ago to change the architecture to be array oriented at its lowest level. So that both from a performance point of view, we can operate on data in chunks, which is vital for performance, but then also as 1 moves between language. And, again, remember for us, our users are delivering, you know, code to the data, compiling all down to the same, you know, in process and doing potentially much more sophisticated work than classic table operations.
Right? We wanted to make sure that as they did that, if they had to cross over, you know, language barriers, that the cost of doing that was amortized over a great amount of data. So I think that was, you know, a really important bit of work, and all of it's done. Understand as we're rearchitecting anything in the engine, we're really optimizing every single operation for every single data type. And even though the user can remain happily blind about whether the data is static underneath or dynamic underneath. They can use exactly 100% all the same stuff and be totally blind as to which is going on. Under the covers, as you can imagine, we've got a fairly optimized form for static sources versus sources that actually do have table updates going on.
So I think that array orientation was important. And as part of that, you know, considering how to deliver a nice integration with CPython and NumPy so that it's really first class. You know, thinking about how to integrate Numba so that, you know, to the extent that people want, you know, Numba accelerated processes to work that they can just do so. So I think that's a very important part of both our current capabilities and the journey that you're asking about.
[00:24:28] Unknown:
In terms of the optimizations that you build for these different data types and being able to manage them in these arrays, I imagine that part of the array orientation is to be able to use the single instruction, multiple data capabilities of more modern CPUs. But to the data type question in particular, I'm curious how you think about constraining the available types so that you can provide these optimizations and balancing that with being able to support more complex or richer data types or data objects to be able to support the flexibility for the end user where maybe they want to use JSON objects or they want to use, you know, geometries for dealing with geospatial data or maybe they're dealing with, you know, some more complex data types that are domain specific and just some of the ways that you're able to balance the speed and optimization and first class support for data types with this flexibility and being able to manage that across these boundaries of streaming and static.
[00:25:28] Unknown:
You're really speaking towards where we're going. And these types of conversations excite me a lot, but I have to be probably modest in both, you know, accommodating your wisdom as well as potentially deferring to my team. So the data types that have typically fascinated our users are classic Java and Python data types. Right? And the ones that you would think, but also and you mentioned JSON. Obviously, that would be standard to support and and things of that nature. Another thing that is very, very important is date times. We think Deephaven is relevant across a variety of industries. We think many of those industries will be interested in time as a fundamental thing. Wall Street thinks this time is a fundamental thing. I mean, IoT data feels very time driven.
Real time gaming feels time driven. Health care telemetry feels like time matters. Right? So date times has been a very important data type to support. And you can imagine there's quite a bit of infrastructure that goes into being able to support date times and time series joins even in a relational type of pattern in a first class way. So we've done work on the data types that matter to us, some of the ones that you're talking about. For example, I don't think I wouldn't suggest we are 1st class in supporting geospatial, data types, you know, in their full modern richness.
Give it you know, right now, that would probably fall out as a POJO or something like that in our system. It could certainly be handled. You could certainly work on that object within Deephaven because Deephaven fundamentally is just bring your code to the data. It's a server. You can make it work in Java or Python or wrap c plus plus or something like that. But for a use case like that, I wouldn't think we'd have optimized performance out of the box. And it would be really interesting for our team or really, hypothetically, the team that moves forward with Deephaven as a contributing group to receive some direction from the community about this being a priority and try and optimize other data types, you know, with those use cases in mind that you suggested.
[00:27:34] Unknown:
And another interesting area is that because you're kind of foundational primitive of the interaction is this table structure, and people who are familiar with working with databases are going to have experience using user defined functions for being able to push functionality down into the database engine rather than having to pull the information out, process it, and push it back. But given the fact that the primary interaction pattern for Deephaven is code native. I'm curious how that influences the way that you think about what is a user defined function that needs to get pushed down into the engine to live closer to the data and closer into the sort of memory space of that server versus being able to execute in the sort of user level, user land where they're running their Python code with their libraries and dependencies or what have you? You know, we support
[00:28:30] Unknown:
both patterns, and I think they would feel pretty natural. Right? So, I mean, in some respects, Pandas is not a entirely dissimilar model, right, where you can push user defined functions towards the tables and the server just operates directly on the data. There are many patterns that are very important to data developers. I don't wanna accidentally starve the conversation from the data developer persona. We care about them every bit as much where, you know, either they're writing a Java client or a Python client, and that's how or a JavaScript client. We've we really have amazing amazing JavaScript API, and they want to interact with the server from there, or they wanna use, you know, a more declarative QST or something like that where they're, you know, delivering code to the server and and getting back results. So I think we are mindfully trying to support both usage patterns, you know, across a few languages.
[00:29:28] Unknown:
In terms of that cross language support and the fact that you are working across batch and streaming and working to support machine learning workloads, I'm curious, what are some of the impedance mismatches that you've had to deal with and some of the ways that you're able to kind of sand off the sharp edges of that experience so that you can support such a broad range of applications?
[00:29:51] Unknown:
Yeah. I I mean so I think the biggest challenge in approaching the broad community, k, is an expectation that comes from SQL or even specifically OLTB and transactional SQL that oftentimes when you have, you know, a data engine that has table operations as a core competency that people want. You know, there's a natural hope, I think, that SQL will be supported and that transactionality will be fundamental. We have a much more Kafka like approach in regards to consistency and transactionality. We think that for many use cases, both for OLAP, but also just, you know, even just for general feeds and data driven application development.
You know, we think, you know, that model of consistency, you know, is sufficient. So it'll be interesting to see where the community wants us to go in regards to integration with SQL. We certainly have ideas about how to map SQL's, you know, select and update to our table API. But we wanna be careful in doing that such that we're not, you know, mismanaging expectations about the type of data structure that exists behind that. I think other than SQL, we found pretty smooth integrations with a lot of the tooling that exists out there. As you know, almost all of it is made only for batch data. So, oh, if somebody wants to work with Deephaven from r, We've delivered solutions where, you know, you can snapshot an updating table every n seconds and our data frame that now all of your our code is gonna work on. Right? And that's analogous to, oh, I I really wanna work in pandas. I have all this stuff that works in pandas.
Great. You can use Deephaven as, I don't know, like a transformation engine that then feeds your pandas code or something like that or something that simply just joins a couple of Kafka streams in real time and then feeds that. Predicate push down on your parquet files and then joins them with real time streams from a web application or an IoT device and then delivers them to your pandas tables or something like that. So we've tried to understand the constructs that people rely on and the tools that they rely on and be as interoperable as possible while at the same time trying to also champion the view that, look, data changes and streaming tables are an interesting way to think about it. And, hey, data software world, here's this API called Baraj that we hope you'll find interesting to consider in regards to communicating dynamic tables across the wire.
[00:32:38] Unknown:
StreamSets DataOps platform is the world's first single platform for building smart data pipelines across hybrid and multi cloud architectures. Build, run, monitor, and manage data pipelines confidently with an end to end data integration platform that's built for constant change. Amp up your productivity with an easy to navigate interface and hundreds of prebuilt connectors and get pipelines into new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you're up and running, your smart data pipelines are resilient to data drift, those ongoing and unexpected changes in schema, semantics, and Finally, 1 single pane of glass for operating and monitoring all of your data pipelines, the full transparency and control you desire for your data operations.
Get started building pipelines in minutes for free at data engineering podcast.com/streamsets. The first 10 listeners that subscribe to stream sets professional tier will receive 2 months free after their 1st month. In terms of the structure of the tables, 1 of the perennial problems when you're dealing with any sort of structured schemas is evolution and changes in the source data systems. And I'm curious what types of machinery you have available for being able to surface alerts or errors in the event that the schema of a Kafka stream changes. And so now instead of it being an integer, it has evolved to a float or from a float to an integer. And so now the computation that you're using to create a derived table is not accurate or it's starting to error out or, you know, it has, you know, gone outside of a certain standard deviation from what it had been. And I'm just curious some of the ways that you're able to integrate some of these kind of data quality checks into the execution of these real time tables.
[00:34:28] Unknown:
It's an important question that you're asking. On the community side today, what would happen is you would air out because, again, what we've tried to deliver in the community side is simpler building blocks that people can understand, and we wouldn't wanna obfuscate something sophisticated like what you just suggested there. We certainly have experience. There are solutions that exist in our enterprise product. They're being used by the big hedge funds and banks out there. Many of them use Deephaven as their full data life cycle management system. Right? And they're doing that for taking real time feeds. They're transforming and emerging it into historical files. They're doing all sorts of data validation and data cleaning as part of that exercise. And they're doing it, you know, as I suggested, both in real time as well as, oh, they just inherited a batch from some vendor or something like that.
So from our perspective, that is, you know, logic that is introduced on top of the lowest layer, not something that's fundamental of the lowest layer today. The lowest layer obviously wouldn't be happy to the extent that it that a data type changed, and in the community version would air out at this point.
[00:35:34] Unknown:
In terms of getting onboarded into Deephaven or running the community edition yourself, Wondering if you can talk through some of the setup and infrastructure that's necessary to support it. But more interestingly, the work of integrating various data sources and then being able to feed that into different downstream systems. So just being able to figure out what is the scope of data sources that you can work with and consume from and some of the ways that you might build additional experiences or logic or analyses that are being powered by the streaming computation that Deephaven provides?
[00:36:13] Unknown:
Easiest way to get going is to just, you know, download a Docker image and, you know, launch locally or in the cloud. We make several available in Python and Java, essentially, but also sort of with or without various AI packages. Again, you know, integrating with machine learning libraries that exist out there is fundamental to some users, but other users would find that gear unnecessary and heavy. So we don't want to make that part of the image. There are other pan patterns for deployment, but I think that is the fastest 1 to get going. In regards to accessing data, there's a suite of integrations that we think will be fully supported within a couple of months. So we started with, again, we have experience with sort of the whole range, but we're porting it in smart ways to community so that it's very modularized and plays well with our gRPC based API. So in particular, we focused first on Parquet and RedPandas and Kafka as both ingest and exhaust. We support change data capture, ingestion. So, you know, some of your listeners have heard people talk about, you know, integrations with and things like that, and that would work. We just wrote from scratch, we think, a very compelling CSV parser in Java. We wrote 1 because we needed to be better performance than the Apache Commons version that was out there, and we needed to support type inference in a first class way where some of the other, you know, good performing Java CSV parsers did find, but they didn't have the inference. So it's pretty much 1 line of code to run for a file or a CSV or we have a resolver where you could get data from a web source or something like that very easily in 1 line of code.
Parquet or RedPandas, as you know, probably 4 or 5 lines where more or less you're just having to configure, you know, source information and topic details and things of this nature. In regards to exhaust, it's mostly, you know, a mirror image of that, I would say, other than exhausting CDC isn't something that makes a lot of sense to us there right now. But importantly, on the exhaust side, we think 1 of the really valuable things to consider is sending it from 1 Deephaven worker to another Deephaven, from 1 Deephaven process to another Deephaven process as I suggested over this Aeroflight compatible API that we have. And then in regards to consuming the data, there's sort of application consumers and eyeball and finger consumers. Right? And the eyeball and finger kind is very rich open API that supports all of our clients and their client APIs. And 1 of the client APIs sits on top of that is a JavaScript API.
And on top of that, we have, you know, a very rich environment for, you know, written and reacted for exploring data, for inheriting tables in real time, for seeing the work product of what I call this, you know, pubsub of streaming tables, you know, as 1 deep payment process talks to another. It does many important things that you would expect in an analytical interface. Oh, I wanna play with tables. I wanna filter things. I wanna create new columns here from the UI without doing anything else. So I wanna link 1 table to another and double click and have fancy things happen. It's quite rich in that regard. And, again, it's all engineered even within a browser to support data that changes.
So real time data ticking in, you know, and data at scale. We just put a blog post out there about how we can you know, the blog post was rendering. I think the number was a quadrillion rows in the browser. The pretty atypical number, but it also is not something that other browser grids can support. So we work on a number of these problems.
[00:40:00] Unknown:
Yeah. The browser interaction is definitely something that's interesting because as you said, it's not something that most people are focused on. If you're gonna try and load data into a browser, you're going to try and condense it and figure out, you know, what are the useful aggregates so that I can downsample this information to render it to the end user because somebody who's eyeballing it isn't going to want to look at all quadrillion rows of data. They are going to care about what is the actual aggregate information, but it's definitely valuable to be able to feed that all to the browser without completely crushing the user's laptop. For sure. I mean, it it would be silly of me to suggest that we're sending a quadrillion
[00:40:35] Unknown:
rows and number of columns to the browser. Right? You know, it's just the smart view port support that's going on there, and that extends not just to these tables that are that changing in real time, but, you know, you could imagine a pivot view or a roll up or an aggregation that has real time data underneath it. You know, that can be a pretty sophisticated problem from a UI integration perspective. So our customers have asked for these things, and we've delivered them. And now we have, we think, upgraded many aspects of them and delivered them to the community out there on GitHub.
[00:41:08] Unknown:
You mentioned at the beginning that you have a source available licensing model, and I'm wondering if you can speak to the decision process that went into choosing a license and some of the ways that you think about the governance model and the boundaries between the community edition and the enterprise edition. Edition? It's a fair question. A couple things just to remind you of going in here is that I am probably less expert at that question than you are. And, also, we have a bit of a strange history, right, in that
[00:41:37] Unknown:
we went from being an internal product to an enterprise product to a community first product. We really are fully open and fully committed to community development. So to your question, when we made that decision and there's a number of stakeholders you need to convince at a company like ours to make that decision. We wanted to lean as heavily as we could into the spirit of open source while satisfying the different views and different priorities of those stakeholders. So we went to the point that felt exactly right at that moment in time, which was, this is new to Deephaven.
Let's operate in as good a faith as possible. Let's make it as simple as possible to understand. Let's make sure that we can be 100% committed in spirit. And so what we did is we looked at all the licenses out there. We felt like as many of the other cloud or many of the other data infrastructure companies had determined, we determined that we should maybe protect a tiny percent of use cases because it might compromise our ability to invest in the product, and we wanted to invest in the community product. So then we read everyone else's licenses, and we determined that our assessment was if it needed an FAQ, that was bad. Like, we wanted the license to stand on its own without an explanation.
So we wrote our own source available license for the engine. And then this is just the core engine. We'll talk about other open source projects that we support that are a different license. But for just the core engine, we prohibited exactly 1 thing. We made it very technical. It has the word schema in it. It doesn't have anything to do with businesses, and we wanted to make obvious to any developer reader or any business reader the tiny sliver of the world was not supported in the license. And we think in doing that, we service a huge community, and we look to partner with them, and we look for them to direct the product.
[00:43:31] Unknown:
And you mentioned that beyond just the core engine, you have a number of other open source projects that form the constellation of the whole experience. And I'm curious what you've selected as the license for those. And then just broadly across that constellation of projects, how you think about the governance model and the long term sustainability of the ecosystem?
[00:43:53] Unknown:
We not only have a few other projects of our own, we participate in other big projects, and we sort of pair on some important but lesser known projects. So in all cases, they're, you know, OSI compliant licenses. Most of them are Apache. The ones that we control that are not Deephaven Core that, again, we just spoke of a moment ago, are Apache. When we think about code that right now our people are the sole contributors of, we think of it as let's be as modular as possible and really think about delivering this software in compartmentalized ways where it can be valuable to others. So 1 of the projects we put out there, for example, standalone with an Apache license is called WebUI.
Right? And this is our JavaScript React application as a stand alone with all the good stuff I described before, but you could have it served by any of a number of things, including many data engines that somebody might describe as competitive to ours or at least swimming in the same pool as we do. We put that out there in good faith because we know it serves the community to potentially be collaborating both with the community and with those other providers in making a better JavaScript, you know, experience for exploring data. That's an example of something we control. Another project that we're very active in, but it's much more in partnership than exclusive to us is jpy, which is, you know, bidirectional bridge between Java and Python per the architecture I talked about a few minutes ago. You can imagine how that's very important.
And so there's a company over in Europe that has started that project, and we got quite involved. The 2 teams for most of the contributing group there. And then, of course, the headliner here is probably that I've mentioned several times now, which is our API. And, frankly, that is, though it's its own project, we see it as a tail on the dog of Apache Arrow and Apache Arrow flight and the compatibility with that and the accommodation of everything Arrow and Arrow Flight is just a fully embraced first principle of barrage. It's a defining principle of barrage. And so, you know, that license, that governance, certainly, to us, feels like 1 that is always gonna be very aligned with Apache Arrow.
[00:46:18] Unknown:
In terms of the applications of Deephaven that you have built or that you have seen others build, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:46:28] Unknown:
Most of my experience, and I am anxious for this to change, revolves around, you know, the things that banks and hedge funds and stock exchanges and other capital market players do. In that space, it really runs the game. There are users of Deephaven that only use it for signal farming of static data in Python. In the world of everything Pythonic out there, they find that Deephaven is very important for this because it brings the team together, and it services AI, time series, and relational use cases all in 1. You know, at the other extreme, there's some very large important capital market players that use Deephaven in the critical path of trading. So you can think of order management. You can think of real time pre trade risk or pre trade compliance and surveillance or for algorithmic trading.
Maybe you have something. You have an order management system that's sending stuff in, you know, submillisecond latencies, but your signals are gonna change every second or every minute or every 10 minutes or something like that. And Deephaven would be very relevant for being that second system, that general purpose system of feeding in new signals. Those are all automated sort of application application use cases, but there's also many use cases where somebody's eyes or fingers are involved, where, you know, somebody is doing some sort of trading that combines a robot and their own intuition and and reaction to things. So they're setting parameters on a screen via Deephaven input tables, and then automated trading is happening in the background. Right? Sort of combining again, you know, Deephaven's code goes to data. So, you know, this idea of I have a few different processes that pipeline to create a workflow like that is pretty straightforward.
In bringing it to the community, we're very interested in where this goes. You know, in particular, we think there's many use cases that exist at the intersection, you know, real time AI, you know, that Deephaven can serve. There are people that are building, you know, recommendation algorithms, you know, clickstream applications. You know, it's just real time Kafka feeds coming in, you know, business logic or machine learning code, and then, you know, tables and table updates that are then exhaust out the other side for any of a variety of consumers across our API. So I think those are examples. But in the community version, we really like ideas that are fun and simple. So, for example, someone just created a toy. Like, it was crazy easy for them to build where, you know, within Deephaven, they listen to the Twitter API in real time for today's Wordle of the day. Have you heard of Wordle? I've heard references to it, but I haven't actually tried it out. They listened to today's Wordle of the Day on Twitter.
They just see a bunch of pictures of squares with different colors. And in real time, you can crowdsource what the wordle answer is of the day. Not to cheat, but just to prove that, hey. This is cool that the power of ma you just literally have to listen to the power of masses over the course of a couple of minutes. And without anyone telling you a letter, you can know definitively what the word of the day is with logic that, you know, I would think that any well organized high school student that was motivated could deliver. That's a pretty compelling stack to put together in 1 little application, so we really like that. But IoT, crypto, blockchain has tons of real time data we didn't matter. You know, I love the idea of real time sports. So I'm a sabermetrics guy, and I just would love Major League Baseball to open up the fire hose of data coming off those cameras that are in all the stadiums right now. It's an exciting world out there, and streams are an important part of it. Streams in the context of static data is important, and we think streaming tables are a really interesting way to
[00:50:25] Unknown:
deliver applications and analytics in that world. In your experience of working on this project and this problem domain for so many years and turning it into a business and now a community oriented project? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:50:42] Unknown:
We've learned a lot of lessons along the way. I think we were a little internally focused at a time when Python was not a driving energy within our company to the same extent it was out in the world. So we were a little late in reacting to that, and we've been very focused on software, and that has been somewhat in contrast to other, you know, potentially better, more maturely funded companies that have been focused both on software and cloud delivery. So I think it's been interesting to arrive in the last couple of years, particularly as we were very focused on community, to try and parse where software stops and where cloud starts and really how that should best work and what customers and users really want there. So there are certainly some, for example, compelling and really huge and famous cloud solutions that are very innovative for their cloud capabilities and their enterprise suite of integrations that we think are not as forward leading just in regards to the core software and what the core engine can do and the performance on single threads and the handling of real time data, things that we think are important to us. So there's this dichotomy or this interesting puzzle that's pretty multidimensional that I think, you know, we're approaching really with curiosity, and we're hoping that the community can inform the product as to how to handle all of that. 1 thing that we think we've been lucky on is though we faced enterprise for a number of years before we came focused on open, we thought we were lucky that we faced very sophisticated teams.
They weren't just using the product. They were evolving it. They were demanding about this thing and that thing. And because they were sophisticated or maybe we just got lucky and they are good, But because they're good and they were open in their directions, we think the product moved in a very modern way and a very important way, though a unique way. You know, that was exciting and helpful. And it was kind of interesting that now that we're open and all we're thinking about is interoperability and extendability that as we look out in the world, so many of our architectural principles are consistent with some of the other important players that it suggested they're revolutionary and they're really interesting. It's like, oh, nice. Our architecture seems you know, maybe we made some pretty okay decisions because it's lining up with what the world seems to want even though we weren't engaged in open source in the last 4 years. For people who are interested in being able to
[00:53:24] Unknown:
build analyses or build transformations on real time data or be able to merge across streaming and batch systems? What are the cases where Deephaven is the wrong choice, and maybe you're better suited with something like, you know, Kafka or, you know, Pulsar or some of these streaming cloud data lake engines or something like that? So I I think there's 2 times.
[00:53:47] Unknown:
The first is if you have a lot of legacy and you wanna service at all And, you know, it doesn't play well with a new transport technology. And people still make this bet, surprisingly. Like, we very much embrace open formats. But, you know, oh, I have a closed format, and we see this in the capital markets with customers. Oh, we have a closed format. There's famous ones. Can we, the customer, write a parser for that format and deliver it to deep haven? And then can we write a shim layer between the applications that are typically facing that other thing to now have them face Deephaven to take advantage of all the downstream coolness at Deephaven.
It starts to feel pretty squishy about whether that's right. If you, you know, if you have a legacy system, it's like, oh, you just put updates on Kafka, and now you get Deephaven, totally different animal. But if you really need to get into the guts of a legacy system and built a bunch of custom transformers or, you know, communication systems, that starts to feel pretty tough. The other way that feels like maybe there's at least a conversation to have about whether Deephaven is the first fit is if transactionality is the defining characteristic of your workflow. If you really are mostly OLTP and the analytics is a small afterthought and all of your contemplations in regards to those analytics is going to, you know, be thinking about the transactionality of the data even when you're analyzing it. You know, I think this is going back to your earlier question. This is 1 of the ways in which we contrast with Materialise. Right? So Materialise is it feels like there's cockroach in regards to l 2 p. And then if you want an incremental update of that, then, you know, that's where Materialise is, and they are very focused on transactionality.
We tend to think a good, you know, a good proxy is the closer you are to transactionality, the farther you are from really cool math and, you know, some of the sophisticated AI stuff going on. So in those cases, Kafka might not be your answer either. The consistency model that Kafka embraces is, you know, of the order of magnitude, what Deephaven does. So if Kafka is relevant, I would think Deephaven is quite relevant. But if you have a lot of legacy code or you're really mostly about transactions, and that's foremost in how you think about data, then
[00:56:14] Unknown:
I think our gear is relevant, but you might wanna have to think. As you continue to build out the Deephaven product and the business and invest in these community offerings, what are some of the things you have planned for the near to medium term or areas that you're particularly excited to dig into?
[00:56:30] Unknown:
The most immediate priority is just making our delivering Deephaven as a library a very elegant experience, particularly in Python and in Java. We want, you know, any such client or application just to inherit the goodness of the Deephaven engine, you know, with all of the deployment that you would expect. That's very important. We have invested over the last many months. We're We're testing now. Cool. Very cool infrastructure, for plug ins for Deephaven. So we think of plug ins. We're using 1 word for both the server side plug ins and the JavaScript client plug ins such that it should be quite straightforward to extend Deephaven for many tools that are important. So, for example, 1 of the driving catalysts for this was, you know, though they're engineered for static data, we knew that there are many cases where where somebody's working in Deephaven. They're seeing all these real time visualizations.
They're doing real time exploration, but then they wanna use matplotlib. Right? And they just want matplotlib to render in our exploratory UI. Okay. We instead of engineering that specifically, we put a general form version of that. We tried to service it in general form through plug ins with 1 specific wiring for matplotlib. So we think plug ins are very, very important. We are investing heavily in clients across our languages to make sure that they're first class in using this Baraj API for getting real time table updates and publishing real time tables or streaming tables to the server.
2 more things. We're continuing to evolve and prove use cases of our learn library, which is more or less the handshake between Python machine learning modules and our Deephaven streaming tables so that real time AI is very easy. That is fully delivered, but we're investing in example use cases of that and other battle hardening. And then the last thing is sort of speculative. We have this idea around many different widgets for streaming tables that it's very easy to publish them in a lightweight way, very easy to consume them in a lightweight way, and we think that may open up a whole world of ideas.
[00:58:48] Unknown:
So are there any other aspects of the work that you're doing at Deephaven or the overall space of streaming and batch data or streaming data analytics that we didn't discuss yet that you'd like to cover before we close out the
[00:59:00] Unknown:
show? I think we spend quite a bit of time on many of the things that are important to us. When we think of the project, we get very excited about a single data engine and its interoperable framework being able to be extremely relevant for a data driven developer as well as, you know, a classic data scientist persona, whether they're building AI applications or doing analytics. You know, we look forward to engaging with the community around the product and around those topics,
[00:59:30] Unknown:
seeing where all of these innovations that people are putting out there might go. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:59:50] Unknown:
My perspective on this is that the biggest gap comes from somewhat singular solutions that under the covers have put many modern tools together. So, you know, we think, you know, somewhat Deephaven has already thought about this from a stream and batch. Let's put it together under 1 solution, but 1 could even think of, you know, compute, storage, and networking, all having very innovative solutions and, you know, trying to put them together in turnkey and easy fashion. In many cases, we understand cloud innovation. We understand the options that are available to developers and dev op people that can configure selections to deliver the solutions that they want. But we think just general interoperability and ease of use around, all of these respective themes in integrated fashion is really where tremendous opportunity lies.
[01:00:49] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Deephaven. It's definitely very interesting project and interesting product and, very challenging space to operate in. So I appreciate all the time and effort that you've put into making it more accessible and more tractable. So thank you again for the time, and I hope you enjoy the rest of your day. Well, I enjoyed the time with you. It was time well spent. Thank you so much. Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used.
And visit the site at data engineeringpod cast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Pete Goddard Begins
Overview of Deephaven and Its Functionality
Comparison with Materialize
Motivation Behind Deephaven
Core Use Cases and Ecosystem Placement
User Personas and Collaboration
Architecture and System Components
Cross-Language Support and Impedance Mismatches
Handling Schema Evolution and Data Quality
Getting Started with Deephaven
Licensing and Governance Model
Applications and Use Cases
Lessons Learned and Challenges
When Deephaven is the Wrong Choice
Future Plans and Exciting Areas
Closing Remarks