Building Real Time Applications On Streaming Data With Eventador

Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need some more to deploy it. So check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage, a 40 gigabit public network, fast object storage, and a brand new managed Kubernetes platform, you've got everything you need to run a fast, reliable, and bulletproof data platform. And for your machine learning workloads, they've got dedicated CPU and GPU instances.

Go to data engineering podcast.com/lunode

today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Kenny Gorman about the Aventador streaming SQL platform. So Kenny, can you start by introducing yourself? Hi. My name is Kenny Gorman. I'm co founder and CEO at Aventador. And do you remember how you first got involved in the area of data management?

Sure. So,

my my background is actually as a Oracle DBA way back when. I, worked with my co founder many, many years ago now at, PayPal and eBay, and that's actually how we met. He was at eBay, and I was at PayPal. And, we had some significant database problems back then scaling those those stacks. We were both relatively early on, and we had problems in the in the Oracle database realm that others had just hadn't seen before. And and we worked on some of those hard problems around storage and Veritas and Oracle and Sun Systems and, both started working together and really enjoyed each other's company and fixing hard problems. And yeah, that's really kind of where, where my, love with, with data systems and, and databases and, and streaming systems really kind of started. Yeah. The early days of the web were definitely an interesting time for actually finding out what the limitations are of the systems that were available at the time. Yeah. Yeah. We had this 1 thing. It's just kind of anecdotal, but it's kind of funny. We, you know, you're supposed to be able to add a column to a table in a relational database just in real time. Like there's no, you know, locking behavior really to that. You're not like locking up a table or causing some sort of semaphore. But, ultimately, under the covers you are, and our database was so busy that we couldn't even do these things,

in real time on production systems because of the just there was no time to get latches in the in the system. And, I remember looking at the Oracle guy next to us who actually worked for Oracle, and he's like, yeah, we haven't seen that before. I said, okay. Well, that's great. We're in uncharted territory here.

Good good times.

Yeah. That that's definitely great when the people who build the system say, I have no idea.

Right.

And so

starting there, I can see how the

interest in data has continued.

And I know that you have also been involved in the Object Rocket platform

and most recently with Aventador. So I'm wondering if you can just start by describing a bit about what the Aventador platform is and some of the story behind it and how you came about founding it. Sure. Sure. Yeah. I'll just kind of tell you what Aventador is real quick. I mean, kind of, you know, go back and talk about how we got here. But essentially, Aventador is a platform

that allows you to build applications on streams of data.

The design was that we wanted to allow

developers and data scientists and backend engineers and folks to be able to just query streams of data, just like it was a database. And, you know, Kafka and streaming technologies and the distributed log have kind of taken off in the last few years, but ultimately, it's very hard to kind of make sense of that data, kind of wrap your head around the new paradigm of streaming data, and then, you know, ultimately, like building applications that are really cool on top of those data streams. And that's really what the Aventador platform is designed to to solve all in all in 1. And the idea kind of came about. We were so just to kind of a little bit of history,

my background, again, my co founder, myself, Eric, were working on MongoDB problems actually around 2010.

And we started to figure out that,

MongoDB,

back then it was very early. I think I was 1 of the first few customers who really used it in production and fell in love with it. It was awesome.

But it had some quirky corner cases where

it just didn't work very fast. It had a heavy duty locking kind of design for the database engine and it made it slow under certain circumstances. And so we kind of figured out that a lot of that had to do with IO, and started building around SSDs.

Sure enough, it's kind of a brute force

mechanism, but SSDs made Mongo very fast. And so using some of the projects that came out of Facebook where we could do hybrid SSD and file system layers,

and then ultimately just pure SSD, we figured out that Mongo could just be really, really fast, and we started to kind of build a product around making MongoDB Enterprise. Enterprise grade is kind of what we called it. And that's when we founded Object Rocket,

and ultimately

sold that to Rackspace, and grew that business to be relatively big business. But along the way, we started to have a lot of our customers, and at this time we had 1, 000, 000, 000 and 1, 000, 000, 000 of documents under management.

In Mongo, it's a document on a row, but ultimately, 1, 000, 000, 000 and 1, 000, 000, 000 of rows under management. And customers were asking us these questions like, Hey, how do I get a millisecond response time on this 2, 000, 000, 000 document collection? And the answer was like, You're never going to get that. I mean,

Theoretically,

disks have to move to make this happen,

or buffers have to be retrieved. There's so much index

retrieval plans have to be followed. There's just a ton of work under the covers for a database engine to return the data in

that kind of a mechanism, and it just wasn't really feasible. And so, we started to ask, What are you doing? Why do you need to have this type of performance from your Mongo cluster?

What's the thing that's driving this? And it was these really kind of new school use cases where,

for instance, and I won't mention the customer, but it was like, imagine if you wanted to look around a room and find a date, and you wanted it to be very centric to the people that are in that room at that time. Maybe you're at a place having dinner, or you're at a bar or something, and you wanted to get a sense of the customers that could be available

for dating in the room. And we thought, well, okay, that's going to very, very real time kind of a situation. People are coming and going,

how are they tracking this data? And more and more customers started to have these use cases in similar ways. And we thought, okay, there's a pattern here, and this is super interesting in terms of database engineering. How are we going to, how is the world going to handle these kinds of requests and use cases? And if

sensitivity towards performance and real time, then there's got to be data systems behind them. There's got to be a backbone to make this work. And that's kind of when we fell in love with Kafka. And I know that there was also the heyday of RethinkDB

that was trying to promise some of that same capability of push based data delivery where as soon as a record was entered into the database, you could subscribe to that and then get a push notification based on that. And I'm wondering what the differences are in Kafka and some of these stream native

to build out with something based on Kafka and some of these stream native data platforms? Right. No. That's a very good question. I think, ultimately, we can say at this point, like, there's

different technology stacks that we've employed for various uses over the years. Right? And in my mind, and I think, you know, it's a relatively popular opinion, is that the data sphere is not shrinking, or it's not, you know, 1 technology or another technology. It's really that as a data engineering professional, we have more choices going forward. And now it's up to us to pick the right platform or choice for the particular use case. And I think, you know, in terms of real time things, you know, distributed log architectures, whether it's, you know, Kafka or Pulsar or something else are fantastically

cool

and fantastically

high performance because it's an append only structure. I mean, we've we did this in databases way back when, when we had to make the database super fast and had to handle a lot of inserts. We would just, you know, we wouldn't say we wouldn't do any updates. We just have to, you know, make it an insert only table. And that's a very

kind of old school and crufty way of doing a similar kind of thing. In fact, you know, and it should be obvious that all the databases,

most popular databases, have a distributed log built into them. It's for their recovery structures. So Oracle has a redo log, Postgres has wall logs. Ultimately, these are append only designs, and they're that way for a reason. They have to be high performance, and then, you know, they're typically designed to replay a stream of events. If you just pop that out of the database, so to speak, and and, you know, create a new a new infrastructure piece based on just that, you kind of end up with something like Kafka. At At least that's how I think about it, and so that's really the kind of the right architecture. It's not really a database, but it kind of is. And I think that the hybrid of this PubSub

style of interface with a durable and distributed back end is is really the sweet spot for a lot of these architectures.

And as you mentioned, there are a few different approaches to this

streaming

and event driven capabilities.

And in recent months, there have been a lot more projects coming out to focus on being able to provide SQL interfaces on these streams, particularly in things like Kafka or Pulsar. And I'm wondering how the capabilities of what you've built into Aventador and the overall developer experience compares to some of those other systems such as ksqlDB

or the SQL layer built into Pulsar or recently the Materialise project?

Right. So

first of all, it's super exciting that we've seen so much energy in this space. When we first started Aventador, our initial prototype

involved SQL. In fact, it involved Pipeline DB. We actually used, and now that's part of Confluent. And those guys are super smart.

Pipeline DB was super cool,

it was built on Postgres,

but it really we used it because it made it obvious that we needed a SQL layer, and it made it obvious that you needed to be able to materialize results. The only problem was it just didn't scale like something like Flink. So, we set off building our service kind of without that and built it from the ground up to what we have today.

If you're looking at, you know, kind of ksqlDB

and where that's at, you know, it's obviously very Kafka centric. It's built

into Kafka. It uses Kafka for scalability and command and control and coordination.

And that's fine. That's great. That's a great tool. I think what we view ourselves as, is more of a platform. Soup to nuts, start to finish, top to bottom,

ingest your application kind of capabilities for the enterprise. And I think that's kind of where our heads are at is, Hey, look, you're going to need to have

schema detection tools. You're going to need to have a procedural logic,

and on the fly creation of user defined functions, and things like this. Enterprise features that when someone's really serious about SQL and productionalizing

jobs based on SQL, that's where we fit in real nicely. The Materialise guys, the super smart guys,

great product, obviously. It's built with different languages and such, and on the timely data flow platform, very cool. And it's be great to see, you know, kind of where the energy is going. A lot of cool innovation happening in our space. A lot of really smart people are working on it. I think we're humbled and excited to be part of it. So, yeah, just kind of excited to see, you know, how things go from here. And in terms of the main use cases that you're seeing people using for the streaming C Google applications, you mentioned that the outset, the sort of location based and very real time nature of the application that 1 of your first customers was looking to build. I'm wondering what some of the other ways that you've been seeing the platform used, and

how it fits into the overall application architecture, and how people are reconsidering the ways that they build and deploy the types of applications they're building on these event streams? Yeah. That's a good question. So, you know, the the major industries or verticals or whatever you want to you want to, you know, kind of say that are sort of early adopters in this whole realm. We've seen Fintech be a big part of that. Obviously, processing financial transactions real time is important,

doing things like fraud detection,

anomaly detection is a big part of this for fraud pipelines and other patterns.

We've seen everything from kind of cryptocurrency

to traditional banking

and all over the place, including

things like

micro loans on on checkout and and all those kinds of of different, you know, subsets of the Fintech space. So that's really exciting. And I think there's a ton of

area to innovate there for cool products and services from

various vendors. I know you see Capital 1 leading the charge there. It's just customers are growing to expect more real time applications, and they want to be delighted when they use their iOS app, or log in on a webpage or something,

and that data is refreshed

and interacted with in real time. And I think that's really where streaming data

comes in, is Fintech and banking has been so badged for so long that, you know, any kind of real time product that kind of comes in can be such a game changer and competitive advantage for for a company. So that's that's 1 big 1. We've seen NetSec a lot, network security, in terms of things like, obviously,

finding anomalies and attack vectors for intrusion detection.

Obviously, just the raw data of packets

flying over networks is an interesting

streaming data problem in and of itself. So we see some traction there. IoT is a huge, obviously, that's a very broad

bucket of use cases, but everything from the mobility space to automotive,

and to aerospace are kind of the 3 big ones there, where folks are trying to make sense of streaming data. In general,

IoT is interesting because data is being generated at a massive a massive rate, sometimes more than once a second an event is being produced. And then aggregating those events, like in some sort of real time ETL pipeline, is very interesting and required in many cases. And then in a lot of cases, how they materialize that data out to apps is a problem space. So, it's very common today to take your data and put it into Kafka, because that's easy. It's easy to just

publish some data to Kafka. Cool, no problem. But it's also very common to then pull it out of Kafka, and stick it into a database. And that's so that you can materialize it, and read it with apps. And you know, what's interesting today is that, that kind of ruins the whole point

of a streaming architecture. It is kind of going backwards to some degree. And the reason that's very common is because it's, again, like I said, it's hard to materialize those results. So, 1 of the things that we've built into our platform is materialized views, and that just allows customers to skip that entire database layer and read right from the streams. And that's important in IoT, especially because there's just so much data. The aggregation and storage of it is a big problem space. And then,

I think lastly, real time manufacturing. That is 1 that, you know, I'm personally excited about and I think is gonna is going to grow over time. You know, building better products and understanding yields and how your supply chain and supply lines are working is a real time problem because there's a dollar value assigned to not doing it in real time, so it's costly to perform poorly in that area. It's also opportunity cost that you could be leaving on the table in terms of being more competitive

in your market space and things like that. So I think that is a super interesting area, and I think it's going to continue to grow. The industry giants out there are leading the charge there and doing very interesting things.

And I think, you know, kind of the rest of the world has to catch up, and I'm I'm excited to see that happen. 1 of the interesting things about running these aggregates on the event data as it's being transmitted into things like Kafka or whatever your streaming engine of choice happens to be, is that in a lot of ways, it makes it easier to

actually build meaningful analysis on top of it because you already have the context

co located with the data as it's streaming in and the timeliness of it. Whereas if, as you said, you're then replicating it back out into a database, it and, you know, fully normalizing it, it can become much more challenging to be able to run those same types of analyses because of the complicated logic that you have to incorporate into trying to recapture some of that context and time sensitivity to it. And I'm wondering what your thoughts are on the

value of actually

storing the raw events after it has propagated through the system, and whether you would recommend that it's more useful to store the actually aggregated data into a more long term storage because the raw events at that point are less valuable in isolation.

Actually, that's a really good question. You know, I think we see it kind of both ways, frankly. Companies are starting to kind of double down on on streaming architectures. Right? It's starting to become normal

to take every click that's ever happened on your shopping site or whatever and jam that into Kafka. That's just kind of normal operating business now. And the question is, like, what do I do with that data and how do I make value out of it? And it's not a 1 size fits all kind of thing in these architectures. For instance, hey, maybe I want to have a stream processor running on that. Maybe I'm going to run that in SQL and

I'm looking for certain events that are happening, and then I'm going to send an alarm. Like maybe I'm going to send an email out to the marketing team, or, another area is real time experiments. Right? So I'm running an experiment on my website, and oh, it's not performing. Oh, it's really performing really bad. I wanna know that right now because I gotta make the change, so I'll run an alert on that. That's fine. You can build that in the Ventura platform and off to the races. But you may also wanna store that detailed data somewhere else, right? Maybe you're dumping that, that, just that raw stream to S3, and maybe you will run some sort of, you know, batch process on it later. That's another way we see it's very common. Or

maybe you're going to put it into a data warehouse, and maybe that data warehouse is expensive, which they often are. In that case, you would maybe pre aggregate the data

ahead of time and then just put fixed aggregates into the data warehouse. So maybe you're going to do it by week, by month, by day. And that way you're saving a ton of space with the raw

data in something that's very expensive, what traditionally data warehouses have been. All that stuff is really enabled by platforms like ours because you can take that raw data coming in from Kafka, you can build SQL processors on it, You can route data to different places. You can pre aggregate it. You can dump it to S3. So all those things are kind of on the table. And I think like the best enterprises are leveraging tools in this way and thinking about it like, Hey, get it into Kafka. And now, let's think about in very serious way, how do I get the data out? And where do I where do I send it? And who are my consumers downstream?

And in terms of how you've actually built out Aventador, I know that you have mentioned your affinity to Kafka. I'm wondering if you can just discuss the overall architecture of the platform tenants and and various areas that we've tried to solve. So, you know, it's it's a

tenants in in various areas that we've tried to solve. So, you know, it's it's a service based design. It's it's built on Kubernetes.

Some of our services use open source technologies like Flink, we've mentioned that, and Kafka. But there's some, like, higher level components that were really important as part of our design. And that is, so first of all, schemas are hard. You know, the schema that most enterprises have is not I mean, if you think your enterprise has 1 schema and it's perfect, you're probably lying to yourself. Most data is very messy. Most companies are working through trying to,

standardize and and ensure that their schemas are good, but it it is tough. And in a lot of cases, it's the Wild West out there. So dealing with nested data structures, dealing with foreign data feeds, dealing with different departments, you know, that's a big piece of the challenge, and that's why we created input transforms and have the the deep, nesting capabilities in our SQL engine, because we wanted to be able to address those feeds. We wanted to allow customers to mutate those schemas on the fly if they needed to. We integrate with things like, schema registry if needed, and try to get that piece of puzzle really nailed down. The other piece of it is a good SQL engine. So we leverage Apache Flink for the core SQL engine, but that's not really the whole picture. Because if you use SQL and Flink today,

what you're doing is you're actually writing

Java or Scala code, and you're putting SQL into that Java or Scala code as a string, and it's evaluated at runtime. And, you know, as we know, SQL is an iterative declarative language. You want to be able to iterate on your data structures and on your data with SQL. You want to play around with it, you want to understand

maybe what groupings

are important, or what timeframes are important. And then ultimately, you decide, Okay, that's a good piece of SQL, I'm going to send that off and let that be

a job, and run continuously.

So our SQL engine is pretty cool in the sense that it allows you to iterate on streams of data. So it understands the schema, you can describe them,

it uses ANSI standard SQL based on Calcite, and that works nicely with Flink.

And if you make a mistake, like in your grammar, like a syntax error, it will tell you right away, so you don't have to wait for that job to run, and redeploy, and go through CD, and all that kind of stuff. So it really allows you to explore the data, play with it, treat it like a regular SQL terminal. And then, you know, as you hit execute, it creates a job and sends it off to the server. It's highly scalable, robust, has robust capabilities

around fault tolerance and things that frankly, Flink brings to the table, and you're off to the races. And then lastly is this notion of materialized views. So, we have

an engine that manages materialized views for us, that's something that we wrote. It manages the update and maintenance of materialized views, the indexing strategies, and really lets developers then

address that data with any key, not just lookup by key, it's not just a simple key value store, it's an actual robust,

restful endpoint. And so customers can build queries, they can send in parameters, they can assign predicates, and then build queries and build applications off of it. So those are kind of the, you know, from kind of the input and ingest to processing and then ultimately to materialization.

Those are the big kind of, pieces that we've added on top of the puzzle using things like Kubernetes

and Flank.

And given the fact that you are materializing the views, I'm wondering what you're using as the durable storage for that transient data. Is it something that you're writing back to a Kafka topic? Or do you have some sort of caching layer for being able to pull the current values from the materialized view based on the aggregates as they're updated? Good question. So today, we're based on the Postgres engine, and that gives us a ton of flexibility in terms of I mean, first of all, it's a very good SQL database. It's a very well known

and relatively bug free database engine, and then our management layer that lives on top of it that treats it like a materialized view versus database is a big part of that as well. So our software mixed with, you know, some of the Postgres open source goodness allows us to ultimately present those materialized views to the customer and keep them updated, age them out, and, you know, do all the things you'd expect from a view that's being updated in real time. Another element of the schema support and being able to capture the appropriate context of data as it's flowing in is the idea of enriching the records as they're being inserted into Kafka. And I'm wondering what type of support you have for being able to either join across different topics or materialized views or adding a default set of data that gets injected as part of that input transform.

So first of all, all those things are possible on our on our platform. We're a little bit different in that we treat every input, whether it's Kafka or anything else, as a virtual table. So and you can join virtual tables. So if you have a Kafka topic in the EU region and 1 in the US and you need to join those in a in a piece of SQL, that's that's fine. No problem. Super easy. If you need to join something like Redis, maybe you've got, you know, like a common use cases, like maybe with, I don't know why I'm making this up, but like a voter application, and maybe you wanna take geo coordinates and look them up by county so you can pop populate a map or something like that. You might just have that geo to county lookup

database in a, Redis database. So we allow you to actually enrich on the real time from Redis. So that's pretty cool, super easy to use, as well as just use user defined functions. So in user defined functions, if you have a static data structure you need to use or some sort of lookup table or lookup hash, you can go ahead and use that in real time as well. So we kind of see all of the above. Sometimes, you know, I think people start off with more static datasets and, and, you know, actually put it in the UDF. Then they say, Oh, that'd be great if I could update this in the back end via some sort of database technology. And then, you know, Redis is a great fit for that. So we see that as well. And then, yeah. And then, obviously sources and syncs, which are the inputs and outputs of any

kind of stream processing system are important. So, S3 is an interesting 1. I think going forward, we'll have more support for more, what I'll call legacy databases.

But today, if you want to do change data capture or you just wanna use Kafka or any of those kinds of things, all all those things are possible and and work quite well. And then at the time that you started Aventador,

my guess is that Kafka was still the preeminent

streaming platform for being able to handle this durable append only storage. But the landscape has evolved fairly significantly in the past, even just a year or 2. And I'm wondering

what types of changes have occurred in the overall ecosystem

since you first began working on Aventador.

And if you were to start over today, how would the current state of the industry influence your design choices, and what would you do differently?

Yes.

There are other options out there today. I think that, Kafka is sort of the de facto standard. Each cloud kind of has their own little version of that as well, and I think that's kind of a second place at this point, you know, like things like Kinesis.

And, you know, I think if I look back, knowing what I know now, you know, I told you we created this initial prototype where we had kind of this end vision of of materializing results through streams and SQL processor in the middle, I would have actually tried harder just to kind of get to the end result. I mean, I think we, early

on, and I want to be humble about saying this, I think we maybe

didn't necessarily predict

where the market was going, but we kind of maybe felt that it was going to go this way, that it just kind of made sense. And, you know, here we are now and, you know, materializing

views on streams is a real thing. There's multiple people doing it. It's a well known design paradigm. We didn't know that when we first started Aventador. We just knew that this would be a cool way to build things on streams.

And I think if I knew where we're at now, in terms of the macro picture of the market, I would have just tried harder just to get here faster instead of,

We went through phases of where we thought, Hey, it's going

to be a managed Kafka world or a managed Flink world. And our software and our platform was going to be a smaller part of it. I think ultimately, we figured out that kind of the reverse was true. Kafka

is very prevalent. Apache Flink is a very

amazing piece of software written by very smart folks and and a great list of maintainers and continues to grow in popularity and is, focus on correct results

and good state management. So it is a great piece of software to work with and work on. And we're we're bringing what we can to the table, and and I think, you know, it really the sum of the parts really equals a really nice and powerful platform. And I think, like I said, if we would have known that before, I would have just tried to jump right to the end result. But you never know when you're when you're building something, I suppose. Another element

of building on top of all these different types of systems is the operational complexity that it brings along with it because distributed systems are hard in any case, but trying to layer multiple ones of them together and then provide your own SLAs on top of it as a reliable foundation for other people's applications. I'm wondering what you have found to be some of the most interesting or challenging

operational aspects of building and running your platform on top of these systems.

Yeah. And that's a really, that's a really good point to make is that, you know, once you start, if you want to build this, you know, this kind of real time architecture in your company, in many cases you can. You'll have the talent to do it. Engineering talent is prevalent these days. And ultimately, these applications are business critical. You're building something that

I'll just go build this stuff and, you know, I'll use Kafka and I'll use Flank and I'll and I'll build this stuff myself. And, you know, companies have. You've seen Uber build Athena X. You've seen, Keystone by Netflix and there's others. But those companies have put massive resource and massive time and energy into building these platforms. And ultimately, like, it just lets the customer self serve and the development customer self serve and build apps, and that's great. I think that if you're

not Uber

or you're not 1 of these gigantic

companies' monoliths, then you probably do need to think about vendors

in the space, us or someone else, to help glue together a lot of these pieces. Because, you know, like I said earlier, stream processing

is a different mental state

than database engineering, to some degree. The same thing with kind of like, even if you're coming from the Hadoop, the distributed batch data landscape, these are very similar, but also very different kind of mind shifts in how you deal with streaming data,

how late data is, you know, you think about, you know, late data, how you think about schemas in a schemaless world, how do you mix those 2 things? Obviously

SQL requires

strongly typed data and a fixed schema. That's also pretty common for people just to jam whatever JSON into Kafka. How do you reconcile those 2 events? How do you get a strongly typed schema out of a JSON blob that people are throwing into Kafka? So, these are And then how do you support all that all day, all night, the whole time? And I think on a planet scale. And I think that

in those cases, you do need a good partner, you do need someone who's very good with support and understands the underlying

technology stacks to help you keep these things running. You know, I said earlier that, you know, a lot of these are production

business critical applications. And once you start to compete

in your market space with

real time apps and maybe you're running machine learning models, or maybe you're building really, really cool dashboards that customers, you know, the customers are attracted to. So they're buying more of your product. Once you start getting those things up and running, you can't just back away from them. They're production jobs, and it's very important to your business. So,

again, getting a good partner that understands that stack and the nuances within the stack, I think is a pretty key and core thing. And then, understanding that, you know, I should be able to the state of the art, I think, for

the kind of the end state that's great for companies

is that they can self serve data. They can look at a stream of data, they can inspect it, they can pick the pieces they want, they can build their own filters and aggregations, and then they can self serve it to whatever application they're building. Like that's the Holy Grail, I think. And you see, and that's why, you know, Athena X and the other platforms were built, is so that, you know, the data engineering

folks,

typically there's a few of them, even in big companies,

can keep up with the demand from the business in terms of building real time apps. And I think that's a really nice goal for us from a design aspect for Aventador. And I think companies should be looking and thinking in that way as well, because

there's only going to be more applications being built on streaming data going forward,

and, you know, how are you gonna keep up? And so that's that's why support and the right platform, I think, matter a ton. You mentioned too that you're building on top of Kubernetes for being able to manage the actual infrastructure level. And I'm curious what you have found to be some of the useful elements of that platform

and some of the challenges for being able to manage these stateful workloads on it. Right.

So

if you've used Kubernetes,

you'll know that it's, I think it's a love and a hate thing. Look, I'm not a Kubernetes expert, the

team is,

but what I experience,

where I'm at is I see a lot of, Hey, Kubernetes makes X really easy, and Kubernetes makes Y really hard. And so systems are more opaque from a support standpoint.

It is harder to figure out where something went wrong and rectify it.

At the same time, when we need to scale a cluster, it makes it very easy.

And if we have faults or problems, rescheduling jobs and delivering a seamless experience to the customer is easier. So it's a love hate thing. I think Kubernetes is still relatively early. It's very powerful and makes a ton of sense for what we're doing, I think. But

going forward, I think it'll get better

and we'll frankly develop even richer processes

for dealing with support and and how do we quickly identify broken things and fix them. So And then going back to the SQL support, you mentioned that you support anti SQL using Calcite as the engine for being able to handle the parsing of that. I'm wondering what are some of the ways that you have found necessary to extend or augment the SQL

dialect for being able to support streaming workloads and

some of the different ways that user defined functions are incorporated into these workloads?

Sure. Yeah. That's a good question. So just, you know, if if folks don't know, Apache Flank has a has a SQL API, and that's ultimately

what we leverage,

based on the things I said before. And that the dialect it understands is the Calcite,

which is, for the most part, ANSI SQL. And Calcite's very broad, so Flink doesn't support a 100% of Calcite because some of that stuff is more batch related and doesn't really apply. But if you go to the Calcite page and you're wondering, you know, if a piece of SQL works or not, then the CalSight's the right place to look that up. I'll say that Flink has been on a development,

journey, and the with the acquisition of Favorica with with Alibaba

and the Blink Planner that Alibaba has brought to the table, I think even more

SQL capabilities and richer SQL capabilities are are coming forward, and that's great. And it, you know, look, it already does stuff that's above and beyond

the dialect for just simply calcite, which is things like match recognize. And match recognize is

doing complex event processing in SQL in a single set statement. And it's really amazing. We use it a lot here. And so that's kind of an add on from that standpoint. And then like you mentioned, UDFs. So the way we do UDFs is a little different,

We kind of hearken back to our MongoDB days where JavaScript was the procedural language

for the engine, and we have that here too. So we actually, the JavaScript gets compiled down to Java and runs in the JVM

and sent out to the machines, and you can actually define that in real time. So we have an editor, you can build a snippet of JavaScript. It can be something simple like converting numbers or, I don't know, looking up a particular value

in a hash, or it can be, I don't know, farming out to a REST endpoint to enrich your data. And obviously that'd be slower, but we have all that kind of capability. So case logic and procedural logic and, you know, anything under the sun really that JavaScript supports, you can you can go ahead and build a UDF with that, and then you're off to the races. You know, it's not 1 of those things where you have to recompile,

the server or build new JARs or any of that stuff. It's just you just define it and use it in your SQL, just like you did in your favorite database,

whether it was MySQL or Postgres or any of these things. It works the same way. You create a function, and you're available to use it right away,

and you're off to the races. And we also use that JavaScript engine,

for our input transforms. So when you, you know, input transforms are cool because if your data is coming in and maybe you have arrays without keys, or maybe you've got messy data that needs to be normalized, or maybe

some of the data is just bogus. Like you just want to drop the data with a certain element and just not process it and only accept data with a certain schema or something like that. Input transforms work great for that. That's also in JavaScript. So, you know, if you're using Aventador on a day to day basis,

you're writing SQL statements in CalSight SQL.

You're,

augmenting that with JavaScript,

you know, with input transforms and UDFs,

and then you're creating restful endpoints that you're then pulling into your app. So we deploy an endpoint and you can use that in your application and then, you know, you get returned JSON data right from that endpoint. So that's, that's kind of the, the different pieces of where they fit in and, and how kind of UDFs kind of sit in the middle. As far as the workflow of somebody building on top of Aventador, what is how does it integrate into the overall development

cycle and the software development life cycle of being able to version and deploy the different queries or different user defined functions that are powering people's applications and making sure that any schema transformations that happen are deployed at the same time as the applications that are going to rely on those different forms of the data? Aventador has a notion of projects.

In Aventador you can write, you know, you can write Java or SQL. You can just write Flink jobs and deploy them. It doesn't have to SQL. Although we find most customers, you know, enjoy using SQL because it's much easier and quicker to get the job done in most cases. In that case, we use

projects and we have a projects interface where that's integrated with GitHub.

You can plug that right into your CICD pipeline and, you know, integrate with the rest of your organization and that way teams and all of that. Now that's just for Java. SQL's coming. That's something we're working on right now. So the next thing you'll see, you know, coming out from us is that you'll be able to check-in your your the various components of your SQL job, whether it be the SQL, the EDF, input transforms, all the all the things that make your job work properly, the configuration for the materialized view, all that stuff. And then that lives in your source repo and then, you know, when we launch a job, you can just launch from the repo. So that makes productionalizing and and, you know, having a resource repository versioning and CICD real easy. And we do that today for Java and then SQL's next. What have you found to be the tipping point where SQL isn't sufficient for being able to complete a particular task and somebody would want to drop down to writing the job specifically for Flink or be being able to approach it in a more procedural fashion? We get asked that a lot, and there's a number of dimensions to the answer that I think are interesting. So the first 1 is that, look, if someone's a Flink shop and maybe they've got a couple Flink jobs running and

now the business is relying on them and they're nervous about supporting them and how do I scale them, they can port those jobs over to EventDoor. No problem. That, you know, we'll just run those right off the bat and bam, you're off to the races. So from a transition

and a migration standpoint, it works really nicely to just keep using that Java

going forward. So that's 1 thing. The other thing is, it just depends on who's in the team, and what the mix of engineers is, and the timeframes

that they have to complete a project. So,

we had 1 customer, it's kind of funny,

Java guy, total dyed in the wool Java guy, and he's like, Yeah, I would like to write this in Java, but I'm going to write it in SQL. And we said, Well, why are you going to write it in SQL if you like Java so much? And he said, Because I'm the only 1 who knows Java here,

and if I write it in Java, then I'm going to be on the hook all night long for support and on call and all that stuff. And I'd rather write it in SQL so the 9 other people that work in my department can can maintain

it

and change the job and mutate it over time if it needs to. So, it's not just a matter of And I thought that was a very mature decision. I thought, okay, he's not trying to just He's thinking about the team and how what can we support going forward? And, you know, look, he's looking out for himself too a little bit. He didn't wanna be woken up in the middle of the night. So I thought it was funny. So, you know, I I think we see a mix of both.

There are some libraries that, you know, you can pull in a Java that and maybe you're used to, you know, working in that IntelliJ is

your home. Then we support that too, and that's no big deal. But I think, you know, look, it's like a 90 10 thing. 90% of almost all workloads can be expressed in SQL relatively robustly.

10% of the time you might need a specific library or you have a specific, you know, workflow that you're used to or a code pattern that you're used to and and you'd rather do it in Java or Scala. So I think that's kind of where the breakdown is. Maybe it's 80, 20 or or whatever depending on the organization. But in the for the most part, most of them, most things are expressed in SQL. And it's But in the for the most part, most most things are expressed in SQL, and it's just because it's simpler in most cases. So I think that's kind of where we

see it

see it heading.

And as

developers are working on building out these streaming workloads and building applications on top of them and trying to schematize the input data, what are some of the sharp edges or design pitfalls or data modeling considerations

that they should be aware of? Right. That is a huge topic in and of itself. Schema management and data quality is a honestly, it is it is a whole it's a whole topic in and of itself. Some of the things that we've seen trip people up though are

if people are pulling in data from multiple foreign sources,

sometime that's like a partner, sometime that's maybe a foreign REST API that's they're pulling in billing data or something. Other times it might be just departmental

differences.

Department A and department B have different data structures for essentially the same thing. So that can be very challenging. And I think a lot of companies, there's different philosophies out there. A lot of companies say, Hey, we want you to adhere to this rigid standard

on the schema and here's a document and how to do it. And here's a central schema repository

and kind of that approach. And we've seen that we've seen that work and be successful. Another approach is that people say, well, let's just let, let's not burden them with that. Otherwise, projects will stall behind, you know, waiting for, you know, schema approval or whatever. And let's just let them put the data in into Kafka in the format that feels native to them. And then we'll build the tools that allow that data to be used effectively in the rest of the enterprise. And that's kind of where we fall. I think we're more powerful in that case. There's another company called RockSett out there. They automatically detect and handle schemas. We do that too. They're more batch oriented or although they're kind of, you know, they're seeing a run down the wall, I suppose too, and going more towards streaming. So that's great. So, you know, I think we see

that kind of, you know, let the data just be the native format and then build the tools to handle it in a more robust way kind of design more often than not. And again, that's the 1 that we work better with. And that was the philosophy of MongoDB way back when. Right? The schema was designed and and ultimately everything has a schema, you know, and the the the world doesn't exist without schema. But let's let's move it to the part where we absolutely need it and let that data be persisted and and saved in in Kafka and then used in an effective way with things like input transforms or UDFs. Or just even being able to use nested structures in SQL is another way to deal with that. So,

I guess that's kind of my thoughts there. I know that's not exactly answering what you're saying, but I think that's the real struggle is where do we fall on the militant

level of

deciding that there's a fixed schema or do we let kind of a

loosely defined schema be the norm? Yeah. The universal answer to any technical question, it depends. Yeah. Right.

That was only 1. It depends. Maybe maybe I'll have more. I don't know.

And what are some of the most interesting or innovative or unexpected ways that you've seen your customers using your platform and building on top of these boundless event streams?

We had this we had this thing early on that I thought was super cool. Like I told you, we created input transforms and UDFs in JavaScript,

and we saw folks starting to use them. And I remember we were on a support call and they were talking to each other about our platform

like it was, you know, like with our vernacular and it was normal, like teaching each other our platform. And I thought, man,

this is the greatest thing ever. I love it when we come up with an idea to solve their problem and they're using it to solve a problem, and then they're teaching each other

to solve that problem with the toolset. This 1 customer I was thinking of, they really doubled down on the input transforms, 100 of lines of JavaScript code

to process incoming data, just because of the nature of that incoming data. And I think their use case is quite unique, not everybody has the same problem. But the fact that they were able to just use that out of the gate and

kind of got it and then went with it, I was pretty proud and excited to see that happen. And

scaling and scaling

and scaling and scaling and scaling and scaling and scaling and making sure that it's

quite cool. And in terms of your experience of building and scaling the Aventador platform and just working in the data management space, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process? I think streaming data is harder

than we thought it would be overall. You know, I think when you, just like anything, it's kind of an iceberg in the sense that you start putting data into into Kafka and you're like, Okay, I get it. And maybe you create a consumer group and you've got a microservice and you're pulling data out and you're like, Okay, this is cool. But that's

the moment with streaming data. But to then go from that to my company runs and my production applications run on streaming data is a whole different thing

entirely.

It's sort of like in the MongoDB days, it's like, okay, it's super easy to put data into

a Mongo database,

but

now I need to shard that database, and now I need

to index those nested structures or whatever.

It starts to get complicated quick,

and I think that is you know, ultimately, like when we're thinking about things like out of order data and back pressure,

streaming data is a mindset shift from database technologies

of of of old. It's extremely powerful

in your applications,

but it does require you to think differently about how, how it all works. And I think that's 1 of the things we have really worked hard on is to make it feel and work like a database

so that folks don't have isolate them from as much of the detail

of how sausage is made, so to speak, for streaming systems as possible. And that's all about using SQL and all about materialized views and those types of things. So, from a high level,

that's kind of where I think it was harder than we thought. I also think that, and I said this at the start to some degree, I'm

pleasantly surprised and excited to see more folks in a you know, most of the time, I guess you'd say,

it would be bad to have a lot of competitors. But I think if I just take it from

a

personal career standpoint and

from a craft standpoint, the thing we're building and the thing we care so much about and build every day, think about all night, I'm really excited to see a lot of other folks in the space. I mentioned Materialise earlier and obviously the Confluent folks with ksqlDB,

Rockset I mentioned, and there's others.

Keen. Io, and and there's a few other ones that are they're great companies building really cool things, and it's nice to see this ecosystem evolving and and customers to be successful with these stacks.

Whether it's us or someone else, I like to see that the community is getting it and growing,

and

it's really, really satisfying to see someone build something cool and, like, say, oh, that was kind of easy. And you're like, yeah. Well, okay. I'm glad it was easy for you. We we figured out some of the hard parts of that for you, and I'm and I'm glad we did. So I don't know. That's kind of where I suppose I land on that. And what do you have planned for the future of the Aventador platform?

Sure. So, you know, we want to continue to make

application developers, data scientists, data engineers successful

with streaming data. The theme, I guess, I keep talking about is, this stuff is hard. There's a lot to it. And ultimately you just want to build that killer application

and you shouldn't have to have,

your knowledge doesn't have to expand all the way into the depths of all these different components.

Maybe your company already has Kafka and you just want to make sense of that data and just drop in and start building.

And so those are the folks we want to continue to make successful with our product and build great things. And I think the world, especially now

has more data problems than ever. Data is gonna

is now and will continue to be the most important thing that a company has beyond its people. And you have to make sense of that data, and you have to use it competitively or you're gonna die. And,

using streaming data to do that

is the killer app. I think that's super exciting. So, you know, our path is to is to build more APIs,

do things like automatic fault detection and automatic scaling, use the clouds more effectively in terms of cost controls and scalability components. So there's, we have feature sets in all those different areas. We plan to continue to, I think our REST API

is very robust in terms of its capabilities of secondary keys and operators and things like that. But we want to continue to build on that too. We know that flexibility in the API and usability to the developers

is gonna be key.

So that's that's 1 area we're gonna continue invest and and build on. Are there any other aspects of the work that you're doing at Aventador or the overall space of streaming SQL and event streams and building applications on this real time data that we didn't discuss that you'd like to cover before we close out the show? No. I you know, no. I think that's a pretty good,

overview of, of streaming stacks these days. I think, you know, we'll see what the I think the next year and a half or 2 years is gonna be really exciting for this field. I think that ultimately customers are going to understand

that they should be building self-service platforms for this stuff that

their customers or developers or data scientists need to be able to have access to this data. I think data science in general is going to grow to understand that streaming data can be a big part. I think today,

if you asked a data scientist,

what does data look like? I don't think necessarily across the board, a huge segment of them would say, oh, it's streaming. I think that's going to change. And I think we'll understand that data and movement has more value than data at rest across the board. And we'll see some more exciting innovations coming

from those teams of people. So I think that's exciting. But yeah, that's, I forget.

For anybody who wants to follow along with the work that you're doing or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Sure. You know, I I think, ultimately,

I think users of data in an enterprise

don't know what data they have. They don't know how to use that data effectively in in their apps or in their projects, whatever they're building. And I think data discovery and understanding

the schemas, as we talked about a little bit, and having central repositories of this data,

easily available. Not just in the streaming sense, I think I think across the board, is has been a big challenge for companies. I I think I said, you know, look, it's not just databases anymore. It's databases and and batch and, you know, queuing and all sorts of different data systems make up of robust information architecture these days. And I think that just means more sprawl and more confusion. And I think that's a big area that that is is still kind of uncharted and left left unfixed, so to speak, or unaddressed to some degree. So Well, thank you very much for taking the time today to join me and share your experience

been great to see people like you out there trying to tackle it. So I appreciate all of your time and effort on that front, and I hope you enjoy the rest of your day. Thanks, Tobias. You too.

Listening. Don't forget to check out our other show, podcast.init@pyth

on podcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpod

podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links