Fast And Flexible Headless Data Analytics With Cube.JS

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Atlan as an internal tool for themselves.

Atlan is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Artem Kadanov and Pavel Tyanov about CubeJS, a framework for building analytics APIs to power your applications and BI dashboards. So, Artem, can you start by introducing yourself? Hi, everyone. My name is Artem Kydanov.

I'm really excited to be here today.

Together with Powell, I started kube. Js as an open source project 3 years ago and happy to share more about it today. And, Pawel, how about yourself? Yeah. Sure. Hi, everyone. My name is Pavel, and I am cofounder and CTO. And, yeah, we we started to work with Artem, like, back in the days in 2016

before we started CUBE in 2019.

Going back to you, Artem, do you remember how you first got involved in the area of data?

As Pavel mentioned,

I met him, and I get involved into the area when we started to work together on stat spot.

Stat spot was our Slack application

I created in 2016.

And the idea was to make analytical data from different places such as Google Analytics, Salesforce, Mixpanel,

accessible as dashboards in Slack and later in other places such as Microsoft Teams.

Statsboard was installed by thousands of users. And for large users, we started to see many scaling issues,

you know, around building efficient pipelines,

aggregations,

and data modeling

to eventually consume data downstream, and it's like in Microsoft Teams.

And

with all of that complexity and all, like, we're helping our clients understand how to optimize their data pipeline. That's when I really got involved into the data management area.

And, Pawel, how did you first get involved in data?

I was working a lot, like, back in the days, enterprise software

and specifically building BIs from scratch. So before I ever, like, stepped into a startup ecosystem, I was building BIs from scratch for 10 years.

Was the most consulting jobs. So I had, like, a pretty decent prior experience with data.

And so that brings us now to the cube project that you're both working on and that you have started a business around. I'm wondering if you can just give a bit of an overview about what it is that you're building there, some of the story behind how it came about, and why this is an area that you feel is worth spending your time and energy on.

So, yeah, you can think of CUBE

as a headless business intelligence.

So it has

JSON based data modeling player,

similar to

some other data modeling players that other BIs have. It has access control management

to implement row level and column level security. And finally, it has a caching player, like, similar to BigQuery's BI engine. The main difference is that we are API based developer centrics.

We don't do charts. So hence, we are headless.

Cube started as a project to power StatsBot, the company I Pavel and I started before.

And when we were working on StatsBot, we faced a problem that we needed to connect to many data sources,

but also to provide a universal

facade to all downstream data consumers.

In our case, it that were Slack applications, Microsoft Teams,

BI dashboards.

So we realized

that we needed to have a data modeling

security and aggregation in a single

unified layer upstream from the data consumers.

That's how we build our

cube initially as part of this stat bot. In terms of the use cases

and the workflows that it enables, what are some of the primary focuses that you are building around and some of

the existing challenges

that are in the space that make Kube JS a potential solution for people who are building their own data systems?

The major use case with storage is data analytics.

So Kube

powers dashboard and you can report features inside customer facing applications.

Since we are headless, right, we provide only API

and some front end SDKs

downstream

inside this customer facing complication.

Our users, they usually use chartic libraries like charges

or highcharts or to treat to display data consumed from cubes API.

I would say from data sources,

we are mostly focused on data warehouses and data lakes.

That's where

we see the most usage.

We also see some of the usage with transactional databases.

Very often, especially recently, our users use transformation tools like a DBT upstream, especially when they use it with the data warehouses.

You know, Kube itself is written in

Rust and TypeScript.

But since we distribute as its Docker container,

the trillion only can feed in any architecture. Right? We see companies building in Java, in Node. Js, in Python, Ruby, and they've just run Kube as a this kind of

Docker container inside their architecture.

Maybe, Pawel, if you have some insights into kind of other architecture and use cases.

Yeah. And there are some, like, really, I would say, advanced use cases

we see with Qube like automation ones. It's quite unusual right now, but looking towards it.

And right now, as we shipped our kube SQL API, which allows you to connect

different

downstream tools which consume data, this automation

becomes much more feasible. 1 of interesting use cases we saw is basically

some

automation on top of IoT sensors which are basically

the

queue on a scheduled basis to trigger some alerts. Yeah. This 1, yeah, very interesting use case.

In terms of the audience that you're building for, who are the sort of target users and personas that you think about as you initially built out the project? And now that it's been released as an open source framework and that you're turning it into a business, some of the ways that you think about the personas as you identify

potential new features or improvements to add in?

We started with

application developer as a persona.

We were focused on application developers. But over time, we started to see more,

you know, data engineers

to be involved in a curated projects within companies.

I feel that we just see the bigger trend that

more, you know, like a data analyst

say, becoming more data engineers. Right? We see that,

you know, a lot of

encouragement in a space for

applying developer

software engineering best practices

and to the data space. So I think that's sort of, you know, like a trend that affects CUBE as well, and we see more and more data engineers

using CUBE lately.

In terms

of the naming, I know that it is inspired by the idea of an OLAP cube, which is a certain way of being able to model your data so that you can answer useful questions

in a way that is maybe not as easy to get at or as performant when you're pulling it directly from a transactional system. I'm wondering if you can talk to some of the applications of OLAP cubes and their role in the current state of the data ecosystem and some of the ways that what you're doing in the Kube JS project makes them more accessible or more relevant in sort of the modern data ecosystem.

Right. Yeah. I'd probably just quickly comment on the name thing, and then Pavel can share his perspective on the OLAP cubes. That's true. The name comes from the OLAP cube, but

it's more like when we started to build the cube at statsbot, we didn't have a name for that project because it was just only, you know, just statsbot engine or something. But we have this main object in the system, which we called cube.

And

cubes in cube, we should probably call it hypercube or something, but cubes in cubes are they map to physical tables in a data warehouse or, you know, like a device

table that contains measures, dimensions, and join relationships,

some sort of abstraction. And we just thought that queue may be a good name for that because, you know, like, historically, people used to call things like that cube. Even when we are not 1 to 1, it's all up cubes. Right? But it just, like, something that we thought

the closest to the idea of the cube.

Yeah. And, also, when people hear about a lab cube says they think something about

slow materialized

and really outdated.

But in fact, cubes were introduced as a mass concept as basically coming from multidimensional

analysis.

And in fact, it means,

basically, a multidimensional

dataset with variable granularity, which can be, like, provided to users with, like as a more granular data or less granular data.

Like, first of all, it was introduced to solve optimization problem, but it turned out it's actually very great framework to design and model data.

In terms of the

idea of an OLAP cube and what you're doing at cube. Js and the recent introduction of the idea of a dedicated metrics layer. I'm wondering if you can kind of draw some comparisons across those 3 elements.

Some people claim that the metrics layer is just, you know, fancy name for all of cubes.

I think, ideologically,

they are close in trying to solve the same problem

to create a data modeling construction layer.

There are many tactical difference

in how they approach the problem though, you know, mainly because they're from

different

generations. Right?

So new tools and new technologies enable us right now to think about

these problems differently.

So that's why we have, you know, this new way for thinking about dedicated metrics layers.

Yeah. And also this concept of cubes, again, it was, like, back in the days, it was introduced.

And then, like, for our data community, it's not, like, well used and popular.

It is actually used, on most of data tools right now because,

for example, base parts of a cube, you can find it, most everywhere. It's measures and dimensions. In fact, it it is a cube when there is measuring dimension.

As far as the overall

sort of modern data stack as people are starting to coin the phrase where each

different concern within the data platform is a separate tool or a different service.

And I'm wondering how you see Kube. Js and the Kube Cloud Platform that you're building out sitting within that ecosystem

and maybe some of the pieces that it

either augments or potentially replaces?

We've been thinking a lot about that, especially now with when we recently started to see more data jitters use

Qube. And

when we built Qube initially,

we were

not

optimized for, you know, like, any, say,

transformation upstream, for example. Right? We build it the way that people would be able to use it with the raw data

without any required transformation in the sort of system.

And people would be able to just define metrics and run pre aggregations

and sort of transform data.

But

over time, we started to see more DBT users, for example, and we thought

that makes a lot of sense. I mean, that's a great tool to run transformations. Right?

And it could makes

our lives much easier when if people would come to Qube already with transformed data.

So that's why right now we encourage really to to use DBT to transform data upstream

for users to use Kube, and then use Kube to actually define metrics and

access control and caching if needed.

So I think we continue to do that, you know, like to see how we fit with the different tools.

As Pavel mentioned, keep SQL API is a good example. We don't wanna do charts. That's like a hard line for us. We don't wanna go into visualization business,

But there are a lot of great tools, you know, to do that.

And the eyes, obviously, you know, always

continue to involve. And there are, like, some good open sources

once, like, in Metabase or Superset.

So we want to integrate with all of them. In fact, we already integrate

with with Superset.

And there are also, like, tools like the data apps. Right? Like, Extremely is a good example. So we want to be able to provide APIs and sort of headless obstruction

to tools like StreamLead so they can visualize

data downstream. So we kind of, you know, like, plays in the middleware

here.

And that's why, you know, like, the question of how we fit into the ecosystem is really, like, very important

for you. In terms of the actual implementation of the platform and some of the design elements that go into how it factors into the data platform. I'm wondering if you can just talk to the architecture and implementation details and some of the ways that the design has evolved or the goals have changed as you from when you started to where you are today?

Chip has 3 main components,

the JSON based metrics framework,

APIs,

and a caching player.

Developers and data engineers, they develop metrics

using the cute metrics framework.

Then data consumers, such as BI tools or in app analytics,

They query metrics through the API.

It could be REST API. It could be SQL API or it could be GraphQL API.

And finally, the caching player can be used to catch the metrics calculations

to speed up some of the API requests.

And in terms of architecture, we decided to go with very distributed architecture

like modern BI stacks. So

Kube consists of, first of all, of horizontal layer of API instances, which can scale horizontally and handle

100 or even 1, 000 of

basically queries per second, then there will be

aware of basically caching queries. So API instances can go directly to raw data or either through cache layer. And the cache layer is a distributed cache layer

called kubestore. It's basically a

we tear towards

to work with data warehouses on a scale.

So it is designed

to provide, like, fixed response time on basically unlimited scale of data, and we are aiming, like, billions of rows per single table.

And we also have also refresh workers which basically

populate cache in the background.

So that's high

level architectural view.

And so you mentioned that 1 of the core elements of the architecture is the caching layer that you're using to improve the overall performance for people who are using the API for either powering user applications or to accelerate

the display of dashboards in the business intelligence layer. And as everyone who has worked with computers long enough knows, 1 of the hardest problems in computer science is cache invalidation and making sure that things are up to date and that you're not holding on to information too long, but that you're also making sure that you pull in the information that you need ahead of time. And so I'm curious

how you have approached

the overall challenge of

making sure that you have the right data in the cache in that pre aggregation layer so that people can answer the questions that they're looking to answer and then still being able to have a graceful degradation when they're starting to dig into information that isn't already available in that cache?

K. Prerogations,

they are sort of materialized

tables layer.

So we aggregate data first in source data warehouse

and download it,

reformat it for fast querying,

and then insert it into our cache layer. And Kubernetes is all the orchestration of that process. The way the refresh works

is either the time based

when users define the interval to refresh,

like, every 2 hours or every day 8 AM, Pacific time,

or it could be based on the condition in a search data warehouse.

An example here would be to check for the max time stamp in a table,

And if it changed, we build materialized table of cache.

Yeah. And from technical perspective,

under the hood, that's actually materialized,

and it's persisted

as our cat

files. So it's called in our storage on a scale. We use distributed file systems like s 3 or GCS or MinIO

to basically provide the storage.

And the separation,

of storage and compute here,

basically,

when these tables are queried,

we we have partially memory

cache, all the basically, tables are on top on multiple nodes, and it's basically kube storage design to work on dozens and even hundreds of nodes

to distribute or load horizontally.

So every query can be answered with a fixed response time and basically we are aiming for sub second response time here and basically scale

really

to large datasets like millions of rows.

Given that you are materializing the information into the parquet files and distributed storage, how do you make sure that you are

cleaning up those files after they've been invalidated so that you don't have them laying around and costing extra storage space and money or

potentially

polluting the cache with mismatched data because you're accidentally reading some older cache files when there's a newer cache file available and just the overall kind

of management of that life cycle?

This is why we have this refresh worker instances.

So they are basically

the process which runs in the ground and checks

freshness of all

data pieces. So it is quite usual that people will use partitions

and partition in data fetch in order to minimize their cost for refresh.

So when you have, for example,

time series data, you don't need to refresh

a whole table over and over. You need to refresh just the delta

for just recent data

in case you have other data is immutable for you. This refresh worker just marks basically partitions for garbage collection,

and in the ground, there is hot swap between, like, newer partitions. Once new partitions arrive,

10 minutes after that will be a like, all partition will be a garbage collect. But for user, it's basically transparent.

And in terms of the overall kind of design philosophy

and user experience for people who are building on top of the cube. Js platform

and for people who are designing the data aggregations that they want to expose. How do you think about the overall

usage and design elements and developer experience for people who are integrating CUBE into their data platforms?

Overall,

we try to advocate for

appliance of software engineering best practices

to data space in general.

I mean, Russian control, isolated environments,

and

we're

making all these required primitives in queue to make it very easy to follow best practices while working on q projects.

We are developer centric, and everything is code based.

That's its definitions

and configuration,

so

code based, and

usually it's being stored in a

version control system.

In fact, our cloud product, it is fully based on the version control system.

So even if you're not using any

Git today, right, the Kube Cloud will still store everything in as it is a Git project.

We also

always try to stand sort of on the shoulders of the giants here in terms of the API design and rely on existing solution and the best practices in ecosystems general.

An example here would be the way we manage

access control in Kube. We rely very heavily on JSON Web Tokens

and its ecosystem

because it seems to be sort of the main kind of the standard of

access control tokens in the web application nowadays.

Because of the fact that you're sitting in this in between layer of the data storage

and the sort of data visualization or downstream consumers of the information,

what are some of the points of tension or challenges that that introduces

because of the fact that you have to deal with such different

audiences, both in the producer and the consumer layer,

as well as the differences

in terms

of expertise

as far as data engineers and data analysts are going to be very familiar with data modeling and how things live in the data warehouse. Whereas

if you have business intelligence engineers or web developers who are consuming the API, they're going to have a much different set of expertise in terms of how they're accessing the data and just some of the ways that you think about the collaboration across that boundary.

It's gonna hurt

because we exactly try to make this bridge between an only data people and

applications and developers.

They have a different

expertise on both ends.

We have many conversations

with data people

when they mentioned they don't have enough, you know, like, experience, skills, or even boundaries just to build something custom on the front end. Right?

That's why they want to have existing tools like BIs to work with Qube or

the data apps like Streamlit.

That's

the whole reason why we released the SQL API

to be able to connect to these BIs.

On the other hand,

with front end engineers sometimes, you know, like, plan to use Kupen their projects to power dashboards,

They have a very limited

knowledge and understanding of the data, right, and data modeling.

In that case,

it's a lot of challenges to make product to help them because you still have to write SQL. Right? We don't try to replace SQL. It's basically all SQL based.

So it's not an easy solution here unless to make it more in all content

and just best practices around that. I think what also helps is

when you mentioned before that

sort of, you know, like explosion of the data stack right now and every tool solves its problem. And, again, I've mentioned DBT

before, but I think DBT could be very helpful in that case too. Right? When tables already transformed upstream by data team,

then application developer can just leverage cube and plug to this transform tables and do just, you know, 1 to 1 mapping of the cubes to tables,

And they would not need to

write all the custom SQL

that would help to speed up the adoption and the development cycle of the cubes. So that's kind of why I decided to just

to plug into more and more tools

to kind of leverage and speed up the development process.

Yeah. And also, Kube helps actually bridge a gap between these data engineering teams and the front end teams because once you define,

your data model using

cubes, so there is an API layer for that which you can, like, set in place and fix so front end team can rely on it. But the data

basically which goes into these cubes is very flexible.

So data engineering team is not tied to the data definitions, transform tools they use, or even databases they use so they can replace it. While front end team is fixed on the API layer

and can be sure that it won't change.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting. It often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done.

DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with DataFold.

And as far as the workflow of actually

building out a set of

metrics and API endpoints for being able to consume in some of these downstream applications, whether it's the business intelligence layer or an embedded analytics use case. Wondering if you can talk about some of the

considerations as far as data modeling in the source systems, some of the ways that you're approaching the

cube definitions

and how that maps to different API endpoints and some of the

considerations that go into that mapping as you're building that out and just the people involved as well at each of the stages of sort

of defining, designing, and implementing these endpoints and the metrics that they're pulling from?

I feel the first step would be

to connect to the data warehouse. Obviously, once we have that

connection working,

we would need to build at least 1 cube. We have up to generation of cubes from tables in a data warehouse,

so it's very to quickly bootstrap and generate some simple cubes with basic measures, like

count and dimension just being mapped to columns.

Overall, we can think about cubes as

kind of an abstraction layer that usually maps to physical tables in a data warehouse and creates all this

measures and dimensions as a metrics on top of cubes.

Once we have at least few cubes to play with, we can test them in what we call a cube playground,

which is a developer sandbox tool to test metrics definitions.

And once we're satisfied with our metrics,

we can connect to q from downstream tool, for example, from Apache Superset.

In that case, we would connect through the Kube SQL API.

It's basically a MySQL connection.

So from Superset,

we would need to connect to kube s to MySQL,

specify host user password as usual.

And now cubes will be treated as tables,

and users will be able to use Superset UI to click and select what they wanted to query and build charts and dashboards.

They would also be able to craft the SQL by hand.

The only caveat here is measures

because measures, they are already aggregated

using q. It doesn't make sense to run aggregate functions on them again, right, when clearing

through Kube SQL API.

But for compatibility

with downstream tools, kube supports aggregate functions on measures,

but as long as they match the measure type. For example, if you have a count measure,

you can use only count aggregate on this measure, and you cannot use a max or min

and to

will return the error saying to this.

So, yeah, that would be an example of querying from the BI tool. For in app analytics,

it's usually either

REST API or GraphQL API.

Users will specify the measures, dimensions,

filters they want to query

in a request.

And then people process the request,

use the metrics definitions,

generate a SQL for the data warehouse,

execute that SQL,

and that sends data back. And finally, users, they will be able to use any charting libraries they want to visualize the state in the application.

So once people start

deciding that they want to use Kube. Js and they've built out an initial proof of concept of here's a single metric and a single endpoint. So I've been able to prove out that I can actually connect to my source system. I can

build a cube definition and then access it in

iteration cycles of going from that initial proof of concept to a more sophisticated and

widely deployed usage of Kube JS, both in terms of the

infrastructure management that goes into it and also the kind of developer cycles of being able

to maintain

the existing definitions while they add new ones without breaking potential downstream consumers?

There are mainly

3 areas

work. Here is

data modeling,

security management,

or access control management, and cache configuration.

So data modeling is foundation

of the ever sync. Right? Once we have our first

metrics, we probably want to add more and more metrics.

So iteration cycle usually involve changes in a data model,

creating and updating metrics,

and then testing them, then applying security rules if needed,

and then finally,

caching

if required.

These areas can

go very deep

depending on the sophistication of the use case. If it's embedded analytics, it's usually very

multi tenancy driven,

you know, different users,

they can have different databases,

different users, they may have different metrics. So there's going to be a lot of configuration

around dynamically generate metrics, dynamically update them, connect to different databases for different data. In a more like a BI use

case. That would be more like just

a data modeling issues, you know, like how you

structure your

metrics.

So it can be very easy consumable by by people downstream

and downstream tools. But overall,

we try to follow,

you know, like, best practices.

And just a general kind of the flow of developing software, because everything is code based,

The iteration looks like you change you do changes in a code. You test them.

And

once you feel confident, you know, like you go into the staging environment and you test them not to break production, for that, in a cloud, typically, we run isolated environments, which are gate based.

So, basically, you usually run

your production environment from the master branch. And when you want to make a changes, you create a feature branch. You do all the changes

in a metrics

framework

in a different branch, then you deploy that branch in the same sort of, you know, like configuration as you have a master to test. You can run some end to end test against the API. You can hook up it into your BI dashboard to test that everything works and you didn't break any old metrics.

And once it's, you know, like, ready, you can just merge to the master, and then we will deploy it to the production.

So, again, it's very developer centric and everything is code based, so that's why we try to follow the best software engineering practices.

Again, from infrastructure perspective, there is a lot about versioning of changes. So

because when you just add new definitions of, like, members, like, measures or dimensions, it's simple. So it's, usually in no operation. But when you change definitions, it's it becomes much more trickier,

especially, for example, if you have cache in place. So it is usually you want to deploy as a blue green deployment here. So, for example, you have 3

high load production in place and you don't want to interrupt your users and you basically want to replace 1 definition with another 1.

So what we have in q cloud, for example, is basically a feature called regulations

warm up. It is basically

takes your

new version of deployment which should be deployed,

and in the ground, it just warm ups all the cache, basically, all the new cache definitions

you just defined.

And when it's done, it just switches API version from basically 1

schema to another. So in that

way, your users even cannot notice any difference and just start receiving, like, new numbers, like, instantly.

Another challenge of

the

sort of development of data applications like this is making sure that you are matching

definitions and metadata,

particularly across those boundaries that we talked about between the data producers and data management professionals and the

application developers and business intelligence engineers? And

especially because you're dealing at an abstraction layer of the SQL and the database layout and the warehouse,

how are you working to

ensure consistency

for people who are

building these cube definitions

and the API

consumers

and ensuring that you have that correctness

from the data warehouse through to the dashboard?

We started to see that question

many times in the community.

I think, you know, overall, the general space of the observability

with data lineage started to kind of grow,

and it has very interesting tools

that people use nowadays.

So we don't provide a lot of out of the box right now to do that. So we provide a meta endpoint,

which, you know, like, users can introspect to the schema and sort of sort of, you know, and build their own metrics catalogs and kind of test it. But our plan here would be

to

integrate with more data lineage and data observability

tools so they can work with Qube so end users

can be sure that, you know, like, all of the data

is in correct state, and it goes from data warehouse through cube to the final destinations.

And it's all correct, and it's all tested, and it's all predictable.

So it would be an interesting journey for us to integrate with all of this tool. Excited about it.

As far as the

kind of visibility aspect too, I imagine that that ties into your goal of integrating with some of these data lineage tools as being able to understand.

As a consumer of these APIs. I want to see what are all of the definitions that are available for the different metrics or cubes that I want to consume from and just being able to see at a glance, you know, this is the piece of information that I care about, so this is what I want to pull into either this business intelligence dashboard or this axis of the chart that I'm developing?

I think that there are 2 things to it. 1 is, obviously, can be solved by integrating with data observability and data lineage tools.

But there is also a second part, which is more about the metrics catalog

and which is kind of connected to the documentation

piece of it, where people want

to understand and to learn what metric they have and then how they can query them and maybe add some annotation layer to it.

So we know that

our community shows us that they want it. We know some companies already built

something like that on top of Qube introspection API.

We want to have it as a part of our product eventually,

some sort of a metrics catalog or data catalog. What we don't want to do, we don't wanna do what data observability and data line issues do. So the first, we probably try to understand

where the boundaries is and what makes sense for Qube to build in that area of visibility

and metrics catalog. And then once we kind of define the scope, we'll go and build that features in Qube.

In terms of

the

composition of metrics, I'm wondering how you approach the ability of users to be able to say, I've got this cube that pertains to this particular

attribute of my business.

So maybe it's this is the way to identify unique customers, and then I've got another

cube that is a way to identify the number of sales for a given period. But now I want to be able to compose them together to be able to aggregate these sales by unique customer and then present that to the business intelligence dashboard and just some of the aspects of being able to build some of these higher levels abstractions from more granular building blocks?

That's something that people people definitely want to need.

Right now, what's possible already in queue is it's possible to join cubes.

So if you have, like, a sales

cubes and we have

a Excels people cube, we can join these 2 cubes to attribute

sales to some specific salesperson and to see, you know, like, who is the best performing salesperson right now. So that's possible to join QX already just as regular tables.

What we're thinking of

is

mainly 2 things. 1 is

sometimes it makes sense

to join

cubes

in a different context. Right? Like, just to create a different composition of the cubes

and which will create some additional context. Like, we wanted to see if this 5 cube in a context of marketing and join them all 1 way. And then we wanted to see look at these cubes from a different context and join them a little bit differently and maybe add some more cubes here. So kind of additional obstruction on top of this. So that's 1 thing that we're thinking right now. We don't have that obstruction layer in the queue, but that's something that it seems needed

from our users.

And the second area here is more like

composite metrics, where, like, measures dimension, that's a fine. They are very granular. But what if you wanted to go up and say,

what our

best performing

acquisition channel. Right? Like, what is that? Is it a question? Is it a metric?

How that should be composed of more granular

objects? So that's something that

we've been working on right now and thinking about what tools we can build in cubes to let users express that high level abstraction objects.

The other interesting problem that you're taking on with the cube problem is the idea of how to

embed the SQL dialect in

a sort of host language or host data structure in a way that doesn't drive the end users insane. And so I'm wondering if you can talk to some of the ways

that you think about how to manage these

SQL statements and SQL fragments and being able to build up the logic

in a non sequel dialect in a way that is

approachable and maintainable and just some of the

utilities that you and the community have built to reduce the level of friction when trying to create these definitions?

I think that's a real problem and some things that, you know, like, we take seriously in trying to figure out what would be the best way to solve it.

I mentioned

that DBT integration before. Right? I think that's sort of 1

area that it helps to because if you transform

everything upstream already,

what do you do in a cube? You mostly just point to that

transformed table and say, my users queue will depend will be built out of this users

table in my data warehouse,

which is already transformed to his DPT.

But you would still need to write some SQL for dimensions and measures,

and you need to embed that SQL. 1 thing we're working as part of the cloud

is kind of QID,

where we help

to write and do all these useful tips to, you know, like, to show what could be done here. I think there is, like, some ongoing work in a community around types

to provide, you know, like, more plugins

to the

IDs, like a Versus code or something. So so you can have, like, after completing some useful tips too.

I maybe, Pavel, you have heard something interesting in that area from the community

too? Yes. Some other,

examples of it, I guess, from what we've seen in community.

A lot of interest, support for virus,

I would say dialects, so for schema flavors.

So we could request,

like, for, like, ES 6 support, TypeScript support, and also we are looking towards supporting

DBT

as a main layer of, like, basically metric definitions.

And that way, you can combine

basically your definitions on a dbt layer, which is basically in YAML.

Combine that with cube definitions

in cube itself because they can be mixed,

and like a compound

definition.

In that way,

you'll be having metrics definitions in dbt, which is really

tied to your data. And on other hand, you'll be having definitions

of cache layer on a cubesite.

So this type of stuff we see among the community, like, with the demanded features

and many more formats like the YAML itself,

like, to use with the cube.

And in terms of the kind of

development and management and deployment of a cube platform,

obviously, there's the open source offering. But I'm wondering what you see as some of the

challenges that end users face or some of the points of friction that they encounter when they're trying to get it set up for themselves and some of the ways that you're hoping to alleviate that with the kube cloud platform that you've launched recently?

We indeed launched our cloud platform recently to tackle some of the issues we

found when we were, like, working with open source users in a community.

They are around

both developing and staging and even going into production.

I think the main issues that we see is how to apply and how

best engineering practice to work with cube projects.

Like, what would be the best way to test

changes in a cube reliably

to create isolated environment?

What is the best way to trace issues and then debug,

you know, like, slow queries and understand why they're slow and what you can do

to improve the performance of that queries

and also around the collaboration

of the features and

how no. Like, more visibility into caching player.

And probably the major part is

infrastructure.

Cube has a lot of moving parts as an infrastructure

project.

It has refresh workers to keep the cache warm.

It has API instances. It needs to up to scale and

load balance the load of API request, but also the memory footprint of the metrics definitions because it can get quite big, especially for multi tenant application.

And then eventually, the caching player itself, which is sort of distributed query engine, which is backed by

distributed file format. So there are a lot of moving parts and sort of a lot of questions. How we scale caching? How we scale API instances?

So our idea for kube cloud was to

let's solve all these infrastructure

issues so people can just run kube.

And don't worry about just scaling and provisioning and management, but also provide them a lot of tools to follow this

engineering best practices,

like how we create a separated environment to test our changes, how we debug issues,

how we have more visibility into cache and clear and all of that. It's very new to us. Again, we just launched it, but I truly believe that Kube Cloud is just the best way to develop and run Kube applications in productions.

In terms of the

usage of the Kube platform and some of the ways that you've applied it, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?

I know 1 use case I would love to share about.

Just 1, like, large tech companies,

they building can internal

end to end data platform

for data transformation, data modeling, pipelines, and data quality management.

I cannot really share a lot about it, unfortunately, but they will open source it soon. So then I will be able to.

But very excited about it because Qube is the core component of that system, and it's which is responsible for the metrics layer. And it's great to see how CUBE can

power kind of a larger end to end data platform.

In your own work on building the CUBE project and starting the business around it and launching the platform, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

1 of the, like, trickiest stuff we actually bumped in the early days,

it's probably quite a simple problem and every data engineer

had, but really tricky to solve

in years away. So the problem is called either row multiplication issue

or either it's also called sometimes fan out issue. So when you are joining some tables together and you're trying to calculate the metric

on

the side where it's 1 to many from side where it's 1, and this metric is multiplied. Yeah. And it turns out that's not a lot of tools in SQL to others in this program

because it was never designed to handle this analytic stuff. Yeah. In Kube, you can see really tricky c codes that generated for this use case, but, yeah, it it is generally solved in Kube.

And so for people who are interested in being able to have a shared abstraction layer across their data infrastructure or their application databases,

and they wanna be able to access things either via SQL or APIs

or they just wanna be able to build a data powered experience for their end users. What are some of the cases where Qube is the wrong choice and they might be better suited with either a different metrics layer or building something in house or some of the other patterns that we've seen in the data ecosystem?

I think if the organization is very small and they just rely only on 1 PI and they don't need to use any other downstream tool.

Maybe they use Voucher right now, right, which has sort of a metrics framework

in it already. Maybe that would be a better choice, you know, like, if it's all, like, everything in there, and they don't need to consume the data

in other downstream BI tools or applications,

and they satisfied with a UI of Looker. That's probably where, like, we don't really need to use that, and they can use this other metrics framework.

The other use case, which may sound weird, and so many people try to use q for, like, a CRUD

operations and ask us, hey, can I write data back with CUBE and just create some sort of, you know, like, just

create, read something, and update? It's CUBE is not good for it, and we never will make it. So CUBE is

kind of sometimes we call there is, like, a OLTP

workload. Right? And it's, like, all up for cloud. CUBE is really all up for cloud, not ILTP.

And we we like to pair with tools like Hasura to create this obstruction data access layers through GraphQL for all the crop cloud related things. Then Kube really can power,

the analytics part of it.

And so as you continue to build the Kube JS framework and open source and build the cloud offering and grow the business, what are some of the things you have planned for the near to medium term or any projects that you're particularly excited to dig into?

I think

that first,

like, core areas of the queue where we provide most of our value right now,

metrics definitions,

security and caching.

And I think even in this conversation, we touched a lot of things already that's

sort of on our road map, like a data catalog

or more abstractions on top

of cubes to build this joint context.

We definitely want to continue to

improve the areas where we provide most of the value.

But also,

I feel that because we are middleware,

we need to be very native in the ecosystem and integrate with as much tools as possible, both upstream and downstream.

For upstream,

it would be more

data warehouses, more databases.

I'm particularly excited about the streaming space, and I know there are a lot of great companies

doing interesting

things in a SQL over stream space.

Well, ksqlDB

was

around, obviously, for a long time now, But there are, like, some companies, like Materialise

and others that I'm really excited about, and we'd like to keep to work on top of them. I think it will open a lot of new opportunities for real time analytics, and we will be able to bring that, you know, like, upstream from the stream processor to all of BIs and all

of embedded

applications

through the cube and make it real time, that would

be really, really great.

Also,

integrations and stores like a DPT for us would make a lot of sense upstream.

And downstream, it's more

like, we follow users mostly here in terms of what kind of tools and how they want to consume Q.

Downstream BIs is the biggest 1, probably, I think, on our road map. Right now, we just wanna make sure that every BI works great with Qube. And then we'll want it to look more in a rigorous CTLs tools. So, you know, users would be able to use the same metrics they defined in Qube, but to run a reverse CTL process and to get data through the Qube into the CRM, so marketing automation.

So,

yeah, again, we'd love to have more and more integrations

with

with the ecosystem.

Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today. Think there is an interesting space,

which some people would call

DataOps

or DevOps for data.

Whereas the main idea is

how to apply software engineering

best practices,

version control, automated end to end test and calculated

environments

to data teams and workflows.

It feels like there is a need of gap for tools to solve these problems.

It's very, very early, but overall direction

just feels right.

I know there are some great and smart teams like Nutana

actively thinking about that problem. So I'm excited to see how this space would develop.

I think it would take some time for all of these folks to show very early, like, in data space itself, very early to figure out those integrations and interconnections.

So as we mentioned, a lot of these DevOps tools

and, basically, all of those data observability tools ingest getting ramped up, and we'll take some time to figure out all the interconnections

in ecosystem here.

Alright. Well, thank you both very much for taking the time today to join me and share the work that you're doing on kube. Js and the kube cloud platform. It's definitely a very interesting project and 1 that tackles a very interesting problem space. So I appreciate all the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thank you. Thank you for having us today.

Listening. Don't forget to check out our other show, podcast.init

atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site at dataengineeringpodcastdot

com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links