Bring Your Code To Your Streaming And Static Data Without Effort With The Deephaven Real Time Query Engine

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Atlan as an internal tool for themselves.

Atlan is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Pete Goddard about his work at Deephaven, a query engine optimized for manipulating and merging streaming and static data. So, Pete, can you start by introducing yourself?

Pete Goddard, the CEO of Deephaven.

Very nice to meet you, and thanks for having me on. Yeah. And do you remember how you first got involved in the area of data? I'm an aero and astro engineer from college, and then I made a sideways step into the world of

Wall Street and quantitative trading. I'm old enough that it was a time when that was a little less obvious of a path than it might be today. I've spent the last,

certainly at least 20 years

thinking about data

as a very

important driver

for business and, you know, fundamentally working

with teams on systems to try and

derive value from that data.

And so

now you have started working on the Deephaven

project and the Deephaven engine and business there. I'm wondering if you can describe a bit more about what that is and some of the functionality that you're looking to provide and some of the story behind

how it came about and why this is the particular problem space that you wanted to spend your time and energy on?

So, you know, Deephaven

is a query engine that's built from the ground up to be

excellent with real time data.

When, you know, we think of real time data, we want it to be great with real time data by itself as well as in combination with

static or maybe historical data

for context.

That query engine probably

its reason to exist hinges on a couple of things. 1st, its ability to unify

streams and batches into the same concept

and same process

of, you know, of working with data. And the second is

an

incremental update model that exists behind the scenes, which allows for

really good stuff when you're working with data that changes.

So Deephaven is, you know, that engine built from scratch

because we thought it was really important to us first and then to others.

And then,

secondarily, I'd say Deephaven is a word we use for the framework around that.

The engine is so

you know, in many ways, it's

either different than or I might

immodestly suggest it's ahead of some other engines that exist out there. So if you wanna make people productive with it, you have to take on the task of handling integrations as well of delivering user experience as well because, you know, many of the tools that people are

using, you know, aren't necessarily

equipped for dynamic data. Most of them are organized in the direction of static data. So we've taken on both the engine and the framework work. Yeah. The project that came to mind first when I was starting to poke at the Deephaven documentation and figure out the use cases

and where it is applicable

was the work that's being done with Materialise, where they're

pulling

these streaming table updates off of Kafka queues and being able to provide a Postgres compatible wire format to be able to run these queries across

dynamic and continually updating datasets. And I'm curious

how you would characterize

the Deephaven project and use cases

in comparison to something like Materialise?

It's a great question, and I appreciate the comparison. I think Deephaven is a bit less known than Materialise is. The short answer is, though, we've been doing this for a long time. Deephaven was first, let's say, a hedge fund that I founded that's still a very active quantitative hedge fund today, and we started building this in 2012 because

we needed a data system that was good with both historical data and with real time data.

So the idea of our update model that we started then and have evolved in the last 10 years is probably quite consistent with at least the concepts

of materializes update model that,

you know, it's founded on a couple of principles. The first are that tables are super powerful. They're very intuitive,

and there's huge libraries and ecosystems

and lots of people that understand what what tables are all about. But in our case, we think of tables,

not only as batch and having, static state, but really as a flow of deltas.

You know, our update model is

at an API level and at an engine level is tracking

ads, modifieds,

deletes, and shifts

such that many, many operations can be done in such a way that the incremental

work to compute

results is much, much smaller.

This is

really, really valuable both for delivering

or for supporting complex use cases that have lots of steps, let's say, where you have incremental updates that are helpful every step along the way. But, also,

you know, our technology,

which this is the core. Our technology makes it so that

we have many people that are more or less Excel type of users. And all of a sudden with little scripts,

they can interact with real time data on their own. So

for us, that's really an interesting part of the story to be able to face a number of persona.

As you mentioned, the

foundational work that you started on that has become Deephaven began in this hedge fund that you were running. And I'm curious

what it is about this technology and this problem domain that motivated you to extract this out into its own business

and leave that hedge fund to focus so much on this project and this product to make it more broadly available?

It's a valuable question. Many people in my family ask me it every year or 2 trying to check-in that I made the right decision there. The reality is when we first started building this product for our own use back in, you know, late 2011 or 2012,

we had some pretty simple needs, we thought. We were going from a high frequency business, which, you know, is own kind of math and computer science. And, frankly, we were doing it in options, which is, you know, a very, very large universe compared to stocks.

We wanted to take our team and

do other math y and computer science y things with it that were also quite scalable.

So we said, well, we need a data system to do that. We don't need to be targeting

really, really low latency. Right? Low latency today is FPGAs

and microwave systems,

submicrosecond

turnaround.

We're like, well, if you're gonna do something more scalable,

you just need a generally good data system. You don't need that type of

capability. So we wanted something that was general purpose.

The formative questions of that really shaped a journey. Right? We said, well, what should the data system do? And we wanted a lot of people across the firm to use it,

Quants, data scientists,

developers, but also, like, trader types and portfolio manager types. So think of that as BI types outside of capital markets.

We wanted to be great with historical and real time data. But, you know, we knew that table operations were gonna be important for all of this, but we also knew, you know, all the good stuff was gonna have other code.

At the time, that was mostly Java and c plus plus to us. Now, you know, that means also Python and Python and Python, but also Rust and Go. And then we wanted user experiences. You know? We knew that as we faced the team, you know, the time a 100 plus people, they were gonna wanna see data

and have roll ups and pivots that changed in real time and all of that type of stuff.

So, you know, incrementally built those things for our own use because it was fundamental to

driving the business. We only built it because there was nothing that existed out there. We wanted to buy it. We just didn't see it out there

until we rolled our own.

And to your question,

we spun it out, you know, in late 2016

for, I guess, a few reasons.

1, the engineers and I were just fascinated by it.

2, we felt like we had witnessed

tremendous impact

at and were therefore quite bullish about its relevance beyond

the hedge fund. And 3, the timing was right where

I was really interested

in software more than I was interested in buying low and selling high at that point in my life. So, you know, those 3 things in combination with

an engineering team that was just up for it and excited about it, you know, led to the founding of Deephaven as its own standalone thing, you know, the 5 year path we've been on since then. In terms of

the core use cases that Deephaven is focused on and the usage and interaction patterns, I'm curious

where you would place it in the overall

stack or ecosystem of an organization's data platform. Like, do you expect that it will

replace certain business intelligence use cases? Is it something that you might use in conjunction with or instead of a Kafka stream or something like ksqlDB

or just curious what your framing is when you're talking to people who are coming from the data ecosystem

to help them understand

what

the applicability

of it is and some of the tools or systems that it will either augment or replace.

I think my answer has me scratching my own head a little bit on because my answer to your question of which of these is mostly yes.

I think it this probably makes sense

to understand that though we have

many years servicing big Wall Street customers with an enterprise product,

the product that we are investing in exclusively at this point and we're evolving

very actively

is our community product, our source available product. So I think it's really with that in mind that, you know, the 1 that is out there and available to all the people that are listening to this podcast

is likely the 1 to talk about.

At its core,

we believe that streaming tables are very important.

We used to have to argue that real time data mattered, and then that phrase, which was overused, real time, right, just became messy.

And then all of a sudden, Kafka blew up and Confluent blew up, and

we don't have to make that argument. And people understand, like, oh, k. Stream's a thing. We all agree that streams matter. And I think a lot of people would agree it's growing in in terms of its relevance. I would go as far as, say, like, 2027,

2030.

If dynamic data isn't at the front of how you think about data,

we probably just see the world a little differently.

So streams are important, but we think we have this concept of a streaming table.

And that's really a construct

out in the open that we think is

something that is very powerful

and that we hope to nurture to a point where it's ubiquitous,

both as something that serves the Deephaven engine, but also something that is in support of many many other applications.

So the first investment we've made in going to the community is to deliver

an open API

that essentially uses arrow flight

payloads

to support

a gRPC

based

package

that describes tables that are changing, updating tables, streaming tables, if you like.

That's really the core piece.

Once you have that core piece, again, we encourage others to explore it. We hope that we can build a community

data software developers around that. But then the next layer is to think about the deephaven engine. And when you think about the deephaven engine, yes, it is very reasonable to compare it to ksqlDB.

I'd suggest if you have Kafka streams

and you want to build applications or do analytics with them,

use AI on them. Deephaven,

in many cases, might be an easier or better or higher performance

engine to use than ksqlDB,

even though it's Kafka. Certainly, we are not intending to compete with the Kafka API,

you know, and those event streams, we think our streaming tables

are a complementary thing, and our data engine can certainly sit on top of a number of source technologies

Kafka just being 1 of them, RedPandas being another. But, also, you know, in the streaming world where I come from,

Solace is relevant,

you know, but then also

proprietary APIs

and vendor APIs with real time data really matter. Setting up web scrapers

as sources for real time data or web applications.

These are all

both direct sources that are interesting to Deephaven as well as indirect ones that are washed through Kafka.

And as far as the user personas

for Deephaven, I'm curious how you think about the

categories of

interaction

for this framework and this engine where, you know, I'm sure that you have data engineers who are interested in being able to use it to be able to do exploratory analysis and transformation of their data streams to be able to figure out where to ship it or what transformations to make.

Obviously, data scientists will be excited to be able to use it for being able to execute their Python code on these data streams. But I'm just curious

sort of who you think about as you're designing the different features and functionality and user experience patterns.

It's a great question and something we've really tried to be

mindful of, particularly over the last year as we rearchitected,

rewrote,

and modularized

our code base to be something that we think is attractive to the community.

Again, our history comes from

1 team, 1 dream type of thinking. Let's all get around 1

data store

and 1

single source of truth for streams, and

let's not at all tolerate false dichotomies. So let's not at all think

data scientists and data developers should use different sources. Let's not at all tolerate thinking that analytics are different than applications. Let's just build stuff on top of this common place. That's

really where we come from. But in bringing it to the community,

we understand that simple matters and straightforward

matters, and we want to

have smaller building blocks for people to handle first. So in particular, we focused on the 2 distinct personas you suggested.

1 are data developers,

and the second are data scientists. Data scientists are a little bit easier to put together nowadays simply because there are some famous words and some famous patterns that seem to represent them quite well. The word Python seems to go there.

AI seems to characterize

a lot of them, even pandas and numpy. You know, these are sort of where communities coalesce

and usage patterns for development.

So when we think about Deephaven and its intersection with the data science community,

we want to look at those

tools.

We wanna understand those workflows,

and we wanna make us a valuable complement to the way things happen. So

with features there, we think, oh, this needs to be

easy to

deploy

as a Docker image, but you need to be able to deploy it locally.

Right now, we're working on I wanna make it available just as a Python library.

Certainly, we think our amazing REPL or

exploratory experience in a browser, it's amazing. It can do things that I think you would really, really find valuable.

But we're working hard to make sure that the widgets for that are available in Jupyter, for example that you can have real time ticking tables in Jupyter notebooks.

So

with data scientists,

we are more or less putting a full embrace around Jupyter, Python,

you know, and ai libraries

to make sure that Deephaven

is integrated

with the toolkits that they want. And in particular, 1 thing we're focusing on is real time AI. We think it's

real time is super sexy. AI is super sexy in terms of words, but then, oh, hey. I have a lot of Kafka streams, and I want to,

you know, do sophisticated

stuff with them.

That's not so easy from

a infrastructure point of view. We think that Deephaven makes that easy in combination with our streaming tables

and

a learn library that we built that really couples the capabilities of

dynamic data in tables

with

Python libraries generally, but, you know, TensorFlow and PyTorch and scikit learn specifically through NumPy.

So that's the data scientist persona. I'd be happy to talk about the data developer persona, but perhaps I should take a breath. The other interesting element of this pairing of personas is the question of collaboration and what are the interaction patterns between the data developer

who might be trying to use this for figuring out their transformations or executing the transformations or building out a library of tooling that the data scientists can use and the data scientists who are trying to build these models understand the performance and the tuning of those model parameters, being able to do feature extraction, and then maybe being able to

feedback to the data engineers and data developers

what

source systems they might wanna connect with or data models they need to be able to have available for powering those model development workflows. I mean, you have it exactly right. Imagine you're a business manager at a company, and that company might be a hedge fund. Right? You wouldn't tolerate

much conversation about, like, oh, let's have

you know,

we've got these guys that work over here and these other people, you know, that work over here and here's the APIs between them, and those all need to be supported. That would feel

not right to you. It's always felt not right to us for a decade now. Right? So

we've built deephaven

so that it's really sort of a pub sub mesh

of these streaming tables

so that the work 1 person is doing is easily consumable

by another person. And that really doesn't mean people. For your audience, it means a service. Like, we have workers. Right? Those workers

have access to

batch data and to streams.

There can be smart federation

to make sure they only get the stuff that they should get. And then

as you use Deephaven,

the Deephaven table API or arguably our Deephaven

syntax for working with tables, table operations,

or as you deliver Python or Java or another language

to the code to do sophisticated things,

you're doing it in such a way that you name tables. You know, if you could think of it maybe analogously to having topics, you know topics that you exhaust to kafka

So all of these workers have named tables and all of the named tables can be available to any other worker that wants to subscribe to them. So you create this mesh of

lots and lots of different

processes

doing work

and serving each other data in real time

as it flows through a directed acyclic graph. Right? So this can support pipeline workflows. This can support parallel workflows. These can support complex workflows that combine the 2, and all of it will update incrementally in real time.

And as far as the

architecture and system components that power these different aspects of the Deephaven experience, I'm wondering if you can talk through some of the

technological underpinnings and some of the ways that you have approached the

decisions about how this is structured to be able to support these use cases. And given the fact that you mentioned that you've just gone through a major rewrite, some of the ways that you have gained lessons from the initial earlier work that you've done and ways that

the ecosystem

of tooling and

structure

has been able to

simplify your work of rebuilding the system.

You know, maybe we could just start with some basics and see where the conversation goes. At its core,

Deephaven

is a Java application.

K? It's a column oriented

query engine, which probably doesn't surprise you given the feature and performance characteristics that I've talked about so far. You know, there are certainly Java experiences and Groovy experiences and Scala experiences now on their way, but it's developed as a Python first experience.

So that probably is the greatest standout in regards to the evolution of the last

number of years, particularly as we've journeyed towards community,

is we really wanted all of that to feel very Pythonic,

idiomatic,

and to be, you know, naturally integrated with the ecosystem of libraries and tools that exist out there for Python. So 1 of the keys to that is

a pretty substantial lift

a couple of years ago

to change the architecture

to be array oriented at its lowest level. So that both from a performance point of view, we can operate on data in chunks,

which is vital for performance, but then also

as 1 moves between language. And, again, remember for us, our users are delivering,

you know, code to the data,

compiling all down to the same, you know, in process and doing potentially much more sophisticated work than classic table operations.

Right? We wanted to make sure that as they did that, if they had to cross over, you know, language barriers, that the cost of doing that was amortized over a great amount of data. So I think that was, you know, a really important bit of work, and all of it's done.

Understand as we're rearchitecting

anything in the engine,

we're really

optimizing

every single

operation

for every single data type.

And

even though the user can remain

happily blind about whether the data is static underneath or dynamic underneath. They can use exactly 100% all the same stuff and be totally blind as to which is going on. Under the covers, as you can imagine, we've got a fairly optimized form for static sources versus

sources that actually do have table updates going on.

So I think that array orientation

was important. And as part of that, you know,

considering how to deliver

a nice integration with CPython

and NumPy so that it's really first class.

You know, thinking about how to integrate Numba

so that, you know, to the extent that people want,

you know, Numba accelerated processes to work that they can just do so.

So I think that's a very important part

of both our current capabilities

and

the journey that you're asking about.

In terms of the optimizations that you build for these different data types and being able

to manage them in these arrays, I imagine that part of the array orientation is to be able to use the single instruction, multiple data capabilities of more modern CPUs. But to the data type question in particular,

I'm curious

how you think about

constraining

the available

types so that you can provide these optimizations

and balancing that with being able to support more complex or richer data types or data objects to be able to support the flexibility for the end user where maybe they want to use JSON objects or they want to use,

you know, geometries for dealing with geospatial data or maybe they're dealing with, you know, some more complex data types that are domain specific and just some of the ways that you're able to balance the speed and optimization

and first class support for data types with this flexibility and being able to manage that across these boundaries of streaming and static.

You're really speaking towards where we're going. And these types

of conversations excite me a lot, but I have to be probably modest in both, you know, accommodating

your wisdom as well as potentially deferring to my team. So the data types that have typically fascinated

our users are

classic Java and Python data types. Right? And the ones that you would think, but also

and you mentioned JSON. Obviously, that would be standard to support and and things of that nature. Another thing that is very, very important is date times. We think Deephaven is relevant across a variety of industries. We think many of those industries

will be interested in time as a fundamental thing. Wall Street thinks this time is a fundamental thing. I mean, IoT data feels very time driven.

Real time gaming feels time driven. Health care telemetry feels like time matters. Right? So date times has been a very important

data type to support. And you can imagine there's quite a bit of infrastructure

that goes into

being able to support date times and time series joins even in a relational

type of pattern

in a first class way. So we've done work on

the data types that matter to us, some of the ones that you're talking about. For example, I don't think I wouldn't

suggest we are 1st class in supporting geospatial,

data types, you know, in their full modern richness.

Give it you know, right now, that would probably fall out as a POJO or something like that in our system. It could certainly be handled. You could certainly

work on that object within Deephaven because Deephaven fundamentally is just bring your code to the data.

It's a server. You can make it work in Java or Python or wrap c plus plus or something like that. But for a use case like that, I wouldn't think we'd have optimized performance

out of the box. And it would be really interesting for our team or really,

hypothetically, the team that moves forward with Deephaven

as a contributing group

to receive some direction from the community about this being a priority and try and optimize

other data types,

you know, with those use cases in mind that you suggested.

And

another interesting area is that because you're

kind of foundational primitive

of the interaction is this table structure,

and

people who are familiar with working with databases

are going to have experience

using user defined functions for being able to push functionality down into the database engine rather than having to pull the information out, process it, and push it back.

But given the fact that the primary

interaction

pattern for Deephaven is code native. I'm curious

how that influences the way that you think about

what is a user defined function that needs to get pushed down into the engine to live closer to the data and closer into

the sort of memory space of that server versus being able to execute

in the

sort of user level, user land where they're running their Python code with their libraries and dependencies or what have you? You know, we support

both patterns, and I think they would feel pretty natural. Right? So, I mean, in some respects, Pandas

is not a entirely dissimilar model, right, where

you can push user defined functions towards the tables and the server just operates directly on the data. There are many patterns that are very important to data developers. I don't wanna accidentally starve the conversation from the data developer persona. We care about them

every bit as much where, you know, either they're writing a Java client or a Python client, and that's how or a JavaScript client. We've we really have amazing

amazing JavaScript API, and they want to

interact with the server

from there, or they wanna use,

you know, a more declarative

QST or something like that where they're, you know, delivering

code to the server and and getting back results. So I think we

are mindfully trying to support both usage patterns,

you know, across a few languages.

In terms of that cross language support

and the fact that you are working across batch and streaming and working to support machine learning workloads, I'm curious, what are some of the

impedance mismatches that you've had to deal with and some of the ways that you're able to kind of sand off the sharp edges of that experience

so that you can support such a broad range of applications?

Yeah. I I mean so I think the biggest

challenge

in

approaching the broad community, k,

is an expectation

that comes from SQL or even specifically OLTB and transactional SQL

that oftentimes when you have,

you know, a data engine

that has table operations as a core competency

that people want. You know, there's a natural hope, I think, that SQL will be supported and that transactionality

will be fundamental. We have a much more Kafka like approach in regards

to consistency and transactionality. We think that

for many use cases, both for OLAP, but also just, you know, even

just for general feeds and data driven application development.

You know, we think, you know, that model of consistency,

you know, is sufficient.

So it'll be interesting to see where the community wants us to go in regards to integration with SQL. We certainly

have ideas about how to map SQL's,

you know, select and update to our table API. But we wanna be careful in doing that such that we're not, you know, mismanaging

expectations about

the type of data structure that exists behind that. I think other than SQL, we found pretty smooth integrations

with a lot of the tooling that exists out there. As you know,

almost all of it is made only for batch data. So, oh, if somebody wants to work with Deephaven from r, We've delivered solutions where, you know, you can

snapshot an updating table every n seconds and our data frame that now all of your our code is gonna work on. Right? And that's analogous to, oh, I I really wanna work in pandas. I have all this stuff that works in pandas.

Great. You can use Deephaven as,

I don't know, like

a transformation

engine that then feeds your pandas code or something like that or something that simply just joins a couple of Kafka streams in real time and then feeds that. Predicate push down on your

parquet files and then

joins them with real time streams from a web application or an IoT device

and then delivers them to your pandas tables or something like that. So we've tried

to understand the constructs that people rely on and the tools that they rely on and be as interoperable as possible

while at the same time trying to also champion the view that,

look, data changes

and streaming tables are an interesting way to think about it. And,

hey,

data software world, here's this

API called Baraj that

we hope you'll find interesting to consider in regards to communicating

dynamic tables across the wire.

StreamSets DataOps platform is the world's first single platform for building smart data pipelines across hybrid and multi cloud architectures.

Build, run, monitor, and manage data pipelines confidently with an end to end data integration platform that's built for constant change.

Amp up your productivity with an easy to navigate interface and hundreds of prebuilt connectors and get pipelines into new hires up and running quickly with powerful, reusable components that work across batch and streaming.

Once you're up and running, your smart data pipelines are resilient to data drift, those ongoing and unexpected changes in schema, semantics, and

Finally, 1 single pane of glass for operating and monitoring all of your data pipelines,

the full transparency and control you desire for your data operations.

Get started building pipelines in minutes for free at data engineering podcast.com/streamsets.

The first 10 listeners that subscribe to stream sets professional tier will receive 2 months free after their 1st month.

In terms of the structure of the tables, 1 of the perennial problems when you're dealing with any sort of structured schemas is evolution

and changes in the source data systems. And I'm curious what

types of machinery you have available for being able to surface alerts or errors in the event that the schema of a Kafka stream changes.

And so now instead of it being an integer, it has evolved to a float or from a float to an integer. And so now the

computation that you're using to create a derived table is not accurate or it's starting to error out or, you know, it has, you know, gone outside of a certain standard deviation

from what it had been. And I'm just curious some of the ways that you're able to integrate some of these kind of data quality checks into the execution of these real time tables.

It's an important question that you're asking. On the community side today, what would happen is you would air out because, again, what we've tried to deliver in the community side is simpler building blocks that people can understand,

and we wouldn't wanna obfuscate something sophisticated like what you just suggested there. We certainly have experience.

There are solutions that exist in our enterprise product. They're being used by the big hedge funds and banks out there. Many of them use Deephaven

as their full data life cycle management system. Right? And they're doing that for

taking real time feeds. They're transforming and emerging it into

historical files. They're doing all sorts of data validation and data cleaning as part of that exercise. And they're doing it, you know, as I suggested, both in real time as well as, oh, they just inherited a batch from some vendor or something like that.

So

from our perspective,

that is, you know, logic that

is introduced

on top of the lowest layer, not something that's fundamental of the lowest layer today. The lowest layer obviously wouldn't be happy to the extent that it that a data type changed,

and in the community version would air out at this point.

In terms

of getting onboarded into Deephaven or running the community edition yourself,

Wondering if you can talk through some of the

setup and infrastructure that's necessary to support it. But more interestingly,

the work of integrating various data sources and then being able to feed that into different downstream systems. So just being able to figure out what is the

scope of data sources that you can work with and consume from and some of the ways that you might build additional

experiences

or

logic or analyses

that are being powered by the streaming computation

that Deephaven provides?

Easiest way to get going is to just, you know, download a Docker image and, you know, launch locally or in the cloud.

We make several available in Python and Java, essentially, but also sort of with or without various

AI packages. Again, you know, integrating with machine learning libraries that exist out

there is fundamental to some users, but other users would find that gear unnecessary and heavy. So we don't want to make that part of the image. There are other pan

patterns for deployment, but I think that is the fastest 1 to get going. In regards to

accessing data,

there's a suite of integrations that we think will be fully supported within a couple of months. So we started with, again, we have experience with sort of the whole range,

but we're porting it in smart ways to community

so that it's very modularized and plays well with our gRPC based API. So in particular, we focused

first on

Parquet and RedPandas

and Kafka

as both ingest and exhaust. We support

change data capture, ingestion. So, you know, some of your listeners have heard people talk about, you know, integrations with and things like that, and that would work. We just

wrote from scratch, we think, a very compelling

CSV parser

in Java. We wrote 1 because we needed to be better performance than the Apache Commons version that was out there, and we needed to support

type inference in a first class way where some of the other, you know, good performing Java

CSV parsers did find, but they didn't have the inference. So it's pretty much 1 line of code to run for a file

or a CSV or

we have a resolver where

you could get data from a web source or something like that very easily in 1 line of code.

Parquet or RedPandas, as you know, probably 4 or 5 lines where more or less you're just having to configure,

you know, source information and topic details and things of this nature. In regards to exhaust, it's mostly,

you know, a mirror image of that, I would say, other than

exhausting CDC isn't something that makes a lot of sense to us there right now. But importantly,

on the exhaust side, we think 1 of the really valuable things to consider is

sending it from 1 Deephaven worker to another Deephaven, from 1 Deephaven process to another Deephaven process as I suggested over this Aeroflight

compatible

API that we have. And then in regards to consuming the data,

there's sort of application consumers and eyeball and finger consumers. Right? And

the eyeball and finger kind is

very rich open API that supports all of our clients and their client APIs. And 1 of the client APIs sits on top of that is a JavaScript API.

And on top of that, we have, you know, a very rich

environment for,

you know, written and reacted

for exploring data, for inheriting tables in real time, for

seeing

the work product of what I call this, you know, pubsub

of streaming tables, you know, as 1 deep payment process talks to another.

It does

many important things that you would expect in

an analytical interface. Oh, I wanna play with tables. I wanna filter things. I wanna

create

new columns here from the UI without doing anything else. So I wanna link 1 table to another and double click and have fancy things happen.

It's quite rich in that regard. And, again, it's all engineered even within a browser to support data that changes.

So real time data ticking in, you know, and data at scale. We just put a blog post out there about how we can you know, the blog post was rendering.

I think the number was a quadrillion rows in the browser. The pretty atypical number, but it also is not something that

other browser grids can support. So we work on a number of these problems.

Yeah. The

browser interaction is definitely something that's interesting because as you said, it's not something that most people are focused on. If you're gonna try and load data into a browser, you're going to try and condense it and figure out, you know, what are the useful aggregates so that I can downsample this information to render it to the end user because

somebody who's eyeballing it isn't going to want to look at all quadrillion rows of data. They are going to care about what is the actual aggregate information, but

it's definitely valuable to be able to feed that all to the browser without completely crushing the user's laptop. For sure. I mean, it it would be silly of me to suggest that we're sending a quadrillion

rows and number of columns to the browser. Right? You know, it's just the smart view port support that's going on there, and that extends not just to these tables that are that changing in real time, but, you know, you could imagine a pivot view or a roll up or an aggregation

that has real time data underneath it. You know, that can be a pretty sophisticated

problem from a UI integration perspective. So our customers

have asked for these things, and we've delivered them. And now we have,

we think,

upgraded many aspects of them and delivered them to the community out there on GitHub.

You mentioned at the beginning that you have a source available licensing model, and I'm wondering if you can speak to the decision process that went into choosing a license and some of the ways that you think about the governance model and the boundaries between the community edition and the enterprise edition. Edition? It's a fair question. A couple things just to remind you of going in here is that I am probably less expert at that question than you are. And, also, we have a bit of a strange history, right, in that

we went from being an internal product to an enterprise product to a community first product. We really

are fully open and fully committed to community development.

So to your question,

when we made that decision and there's a number of stakeholders you need to convince

at a company like ours to make that decision.

We wanted to lean as heavily as we could into the spirit

of open source

while satisfying

the different views and different priorities of those stakeholders.

So we went to the point that felt exactly right

at that moment in time, which was,

this is new to Deephaven.

Let's operate in as good a faith as possible. Let's make it as simple as possible to understand.

Let's make sure that we can be 100%

committed in spirit. And so what we did is we looked at all the licenses out there. We felt like as many of the other cloud or many of the other data infrastructure companies

had determined, we determined that we should maybe protect a tiny percent of use cases because it might compromise our ability to invest in the product, and we wanted to invest in the community product.

So then we read everyone else's licenses,

and we determined that our assessment was if it needed an FAQ,

that was bad. Like, we wanted the license to stand on its own without an explanation.

So we wrote our own source available license

for the engine. And then this is just the core engine. We'll talk about other open source projects that we support that are a different license. But for just the core engine,

we prohibited

exactly 1 thing. We made it very technical.

It has the word schema in it. It doesn't have anything to do with businesses,

and we wanted to make obvious to any developer reader or any business reader

the tiny sliver of the world was not supported in the license.

And we think in doing that, we service a huge community, and we look to partner with them, and we look for them to direct the product.

And you mentioned that beyond just the core engine, you have a number of other open source projects that form the constellation of the whole experience. And I'm curious what you've selected as the license for those. And then just broadly across that constellation of projects, how you think about the

governance model and the long term sustainability

of the ecosystem?

We not only have a few other projects of our own, we participate in

other big projects, and we sort of pair on some important but lesser known projects. So in all cases, they're, you know, OSI compliant licenses.

Most of them are Apache. The ones that we control

that are not Deephaven Core that, again, we just spoke of a moment ago, are Apache.

When we think about code that right now

our people

are the sole contributors of, we think of it as let's be as modular as possible

and really think about delivering this

software

in compartmentalized

ways where it can be valuable to others. So 1 of the projects we put out there, for example, standalone with an Apache license is called WebUI.

Right? And this is our JavaScript React application

as a stand alone with all the good stuff I described before,

but you could have it served by any of a number of things, including

many data engines that somebody might describe as competitive to ours or at least swimming in the same pool as we do. We put that out there in good faith because we know it serves the community to potentially be collaborating

both with the community and with those other providers

in making a better JavaScript,

you

know, experience for exploring data. That's an example of something we control. Another project that we're very active in, but it's much more in partnership

than exclusive to us is jpy,

which is, you know, bidirectional

bridge between

Java and Python per the architecture I talked about a few minutes ago. You can imagine how that's very important.

And so there's a company over in Europe that has started that project, and we got quite involved.

The 2 teams

for most of the contributing group there.

And then, of course, the headliner here is probably that I've mentioned several times now, which is our API.

And,

frankly, that is, though it's its own project, we see it as

a tail on the dog of Apache Arrow and Apache Arrow flight

and the compatibility

with that and the

accommodation

of everything Arrow and Arrow Flight

is

just a fully embraced first principle of barrage. It's a defining principle of barrage. And so,

you know, that license, that governance,

certainly,

to us, feels like 1 that is always gonna be very aligned with Apache Arrow.

In terms of the

applications

of Deephaven that you have built or that you have seen others build, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?

Most of my experience, and I am anxious for this to change, revolves around, you know, the things that banks and hedge funds and stock exchanges and other capital market players do. In that space,

it really runs the game. There are users of Deephaven that only use it for signal farming of static data in Python.

In the world of everything Pythonic out there, they find

that Deephaven

is very important for this because it brings the team together, and it services

AI,

time series, and relational use cases all in 1. You know, at the other extreme, there's some very large important capital market players that

use Deephaven in the critical path of trading. So you can think of

order management. You can think of real time pre trade risk or pre trade compliance and surveillance

or for algorithmic trading.

Maybe you have something. You have an order management system that's sending stuff in, you know, submillisecond

latencies,

but your signals are gonna change

every second or every minute or every 10 minutes or something like that. And Deephaven would be very relevant for being that second system, that general purpose system of feeding in new signals. Those are all automated sort of application application use cases, but there's also

many use cases

where somebody's eyes or fingers are involved, where,

you know, somebody is doing some sort of trading that combines a robot and their own intuition and and reaction to things. So they're setting parameters on a screen via

Deephaven input tables, and then

automated trading is happening in the background. Right? Sort of combining again, you know, Deephaven's code goes to data. So, you know, this idea of I have a few different processes that pipeline

to create a workflow like that is pretty straightforward.

In bringing it to the community, we're very interested in where this goes. You know, in particular, we think there's many use cases that exist at the intersection,

you know, real time AI,

you know, that Deephaven can serve. There are people that are building, you know, recommendation algorithms,

you know, clickstream

applications.

You know, it's just real time Kafka feeds coming in, you know, business logic

or machine learning

code, and then, you know, tables and table updates that are then exhaust out the other side for any of a variety of consumers across our API. So I think those are examples. But in the community version, we really like ideas that are fun

and simple. So, for example, someone just created

a toy. Like, it was crazy easy for them to build where, you know, within Deephaven,

they listen to the

Twitter API in real time

for today's Wordle of the day. Have you heard of Wordle? I've heard references to it, but I haven't actually tried it out. They listened to today's Wordle of the Day on Twitter.

They just see a bunch of pictures of squares with different colors.

And in real time, you can crowdsource

what the wordle answer is of the day. Not to cheat, but just to prove that, hey. This is cool that the power of ma you just literally have to listen to the power of masses

over the course of a couple of minutes. And without anyone telling you a letter, you can know definitively what the word of the day is with

logic that, you know, I would think that any

well organized high school student that was motivated could deliver.

That's a pretty

compelling stack to put together in 1 little application, so we really like that. But IoT, crypto,

blockchain

has tons of real time data we didn't matter. You know, I love the idea of real time sports. So I'm a sabermetrics guy, and I just would love Major League Baseball to open up the fire hose of data coming off those cameras that are in all the stadiums right now. It's an exciting world out there, and

streams are an important part of it. Streams

in the context of static data is important, and we think streaming tables

are a really interesting way to

deliver applications and analytics in that world. In your experience of working on this project and this problem domain for so many years and turning it into a business and now a community oriented project? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

We've learned a lot of lessons along the way. I think we were

a little

internally focused at a time when Python was not

a driving energy within our company

to the same extent it was out in the world.

So

we were a little late in reacting to that,

and we've been very focused on software,

and that has

been somewhat in contrast

to other, you know, potentially better, more maturely funded companies that have been focused both on software

and

cloud delivery.

So

I think it's

been interesting

to arrive in the last couple of years, particularly as we were very focused on community,

to try and parse

where software stops

and where cloud starts and really how

that should best work and what

customers and users really want there. So there are certainly some, for example, compelling and really huge and famous

cloud solutions

that are very innovative for their

cloud capabilities

and their enterprise

suite of integrations

that we think

are not as forward leading just in regards to the core software and what the core engine can do and the performance on single threads and the handling of real time data, things that we think are important to us. So there's this dichotomy

or this interesting puzzle that's pretty multidimensional

that I think, you know, we're

approaching

really with curiosity, and we're hoping that the community

can inform the product as to how to handle all of that. 1 thing that we think we've been lucky on is

though we faced enterprise

for a number of years before we came focused on open, we thought we were lucky that we faced very sophisticated teams.

They weren't just using the product. They were evolving it. They were demanding about this thing and that thing. And because they were sophisticated or maybe we just got lucky and they are good, But because they're good and they were open in their directions,

we think the product moved in a very modern way

and a very important way,

though a unique way. You know, that was exciting and helpful. And it was kind of interesting

that now that we're open and all we're thinking about is interoperability

and extendability

that as we look out in the world, so many of our architectural principles are consistent with some of the other

important players that it suggested they're revolutionary and they're really interesting. It's like, oh,

nice. Our architecture seems you know, maybe we made some pretty okay decisions because it's lining up with what the world seems to want even though

we weren't engaged in open source in the last 4 years. For people who are interested in being able to

build analyses or build transformations on real time data or be able to

merge across streaming and batch systems? What are the cases where Deephaven is the wrong choice, and maybe you're better suited with something like,

you know, Kafka

or, you know, Pulsar or some of these streaming cloud data lake engines or something like that? So I I think there's 2 times.

The first is if you have

a lot of legacy

and you wanna service at all

And, you know, it doesn't play well with a new transport technology.

And people still make this bet, surprisingly.

Like, we very much embrace open formats.

But, you know, oh, I have a closed format, and we see this in the capital markets with customers. Oh, we have a closed format. There's famous ones. Can we, the customer,

write a parser for that format and deliver it to deep haven?

And then can we write a shim layer between the applications

that are typically facing that other thing to now have them face Deephaven

to take advantage of all the downstream coolness at Deephaven.

It starts to feel pretty squishy about whether that's right. If you, you know, if you have a legacy system, it's like, oh, you just put updates on Kafka, and now you get Deephaven, totally different animal. But if you really need to get into the guts of a legacy system and built a bunch of custom

transformers

or, you know,

communication systems,

that starts to feel pretty tough. The other way that feels like

maybe there's at least a conversation to have about whether Deephaven is the first fit

is if

transactionality

is the defining characteristic of your workflow. If you really are

mostly OLTP

and the analytics

is a small afterthought and all of your contemplations in regards to those analytics

is going to, you know, be thinking about the transactionality

of the data even when you're analyzing it. You know, I think this is going back to your earlier question. This is 1 of the ways in which we contrast with Materialise. Right? So Materialise is

it feels like there's

cockroach in regards to l 2 p. And then if you want an incremental update of that,

then, you know, that's where Materialise

is, and they are very focused on transactionality.

We tend to think a good,

you know, a good proxy is the closer you are to transactionality,

the farther you are from

really cool math

and, you know, some of the sophisticated AI stuff going on.

So in those cases, Kafka might not be your answer either.

The consistency model that Kafka embraces

is, you know, of the order of magnitude,

what Deephaven does. So if Kafka is relevant, I would think Deephaven is quite relevant. But if you have a lot of legacy code or you're really mostly about transactions,

and that's foremost in how you think about data,

then

I think our gear is relevant, but you might wanna have to think. As you continue to build out the Deephaven product and the business and invest in these community offerings, what are some of the things you have planned for the near to medium term or areas that you're particularly excited to dig into?

The most immediate priority is just making

our delivering Deephaven

as a library a very elegant experience,

particularly in Python and in Java.

We want, you know, any such client or application just to inherit the goodness of the Deephaven engine,

you know, with all of the

deployment

that you would expect. That's very important.

We have

invested

over the last many months. We're We're testing

now. Cool. Very cool infrastructure,

for plug ins for Deephaven. So we think of plug ins. We're using 1 word for both the server side plug ins and the JavaScript

client plug ins

such that it should be quite straightforward to extend Deephaven

for many tools that are important. So, for example,

1 of the driving

catalysts for this was,

you know, though they're engineered for static data, we

knew that there are many cases where where somebody's working in Deephaven. They're seeing all these real time visualizations.

They're doing real time exploration, but then they wanna use matplotlib. Right? And they just want matplotlib to render

in

our

exploratory UI. Okay. We

instead of engineering that

specifically, we put a general form version of that.

We tried to service it in general form through plug ins with 1 specific wiring for matplotlib.

So we think plug ins

are very, very important.

We are

investing heavily in

clients

across our languages to make sure that they're first class in

using this Baraj API for getting real time table updates

and publishing real time tables or streaming tables

to the server.

2 more things. We're

continuing to evolve and prove use cases of our learn library,

which is more or less the handshake between

Python machine learning modules and our Deephaven streaming tables so that real time AI is very easy. That is fully delivered, but we're investing in example use cases of that and other battle hardening.

And then the last thing is sort of speculative. We have this idea around

many different widgets

for streaming tables that it's very easy to publish them in a lightweight way, very easy to consume them in a lightweight way, and we think that may open up a whole world of ideas.

So are there any other aspects of the work that you're doing at Deephaven

or the overall space of streaming and batch data or streaming data analytics that we didn't discuss yet that you'd like to cover before we close out the

show? I think we spend quite a bit of time on many of the things that are important to us.

When we think of the project, we get very excited about a

single data engine and its interoperable framework being able to be extremely relevant for a data driven developer

as well as, you know, a classic data scientist persona, whether they're building AI applications or doing analytics.

You know, we look forward to

engaging with the community

around the product and around those topics,

seeing where all of these innovations that people are putting out there might go. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

My

perspective on this is that the biggest gap comes from

somewhat singular

solutions

that under the covers have put many modern tools together. So, you know, we think,

you know, somewhat

Deephaven

has already thought about this from a stream and batch. Let's put it together

under 1 solution, but 1 could even think of,

you know,

compute, storage, and networking, all having very innovative solutions

and, you know, trying to put them together in turnkey and easy fashion.

In many cases,

we understand cloud innovation. We understand the options that are available to

developers and dev op people that can

configure selections to deliver

the solutions that they want. But we think just general interoperability and ease of use around,

all of these respective themes

in integrated fashion is really where tremendous opportunity lies.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing at Deephaven. It's definitely very interesting project and interesting product and,

very challenging

space to operate in. So I appreciate all the time and effort that you've put into making it more accessible and more tractable. So thank you again for the time, and I hope you enjoy the rest of your day. Well, I enjoyed the time with you. It was time well spent. Thank you so much.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site at data engineeringpod cast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links