Summary
One of the perennial challenges of data analytics is having a consistent set of definitions, along with a flexible and performant API endpoint for querying them. In this episode Artom Keydunov and Pavel Tiunov share their work on Cube.js and the various ways that it is being used in the open source community.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
- Your host is Tobias Macey and today I’m interviewing Artyom Keydunov and Pavel Tiunov about Cube.js a framework for building analytics APIs to power your applications and BI dashboards
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Cube is and the story behind it?
- What are the main use cases and platform architectures that you are focused on?
- Who are the target personas that will be using and managing Cube.js?
- The name comes from the concept of an OLAP cube. Can you discuss the applications of OLAP cubes and their role in the current state of the data ecosystem?
- How does the idea of an OLAP cube compare to the recent focus on a dedicated metrics layer?
- What are the pieces of a data platform that might be replaced by Cube.js?
- Can you describe the design and architecture of the Cube platform?
- How has the focus and target use case for the Cube platform evolved since you first started working on it?
- One of the perpetually hard problems in computer science is cache management. How have you approached that challenge in the pre-aggregation layer of the Cube framework?
- What is your overarching design philosophy for the API of the Cube system?
- Can you talk through the workflow of someone building a cube and querying it from a downstream system?
- What do the iteration cycles look like as you go from initial proof of concept to a more sophisticated usage of Cube.js?
- What are some of the data modeling steps that are needed in the source systems?
- The perennial problem of embedding SQL into another host language or DSL is how to deal with validation and developer tooling. What are the utilities that you and the community have built to reduce friction while writing the definitions of a cube?
- What are the methods available for maintaining visibility across all of the cubes defined within and across installations of Cube.js?
- What are the opportunities for composing multiple cubes together to form a higher level aggregation?
- What are the most interesting, innovative, or unexpected ways that you have seen Cube.js used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cube?
- When is Cube the wrong choice?
- What do you have planned for the future of Cube?
Contact Info
- Artom
- Pavel
- @paveltiunov87 on Twitter
- paveltiunov on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Cube.js
- Statsbot
- chart.js
- Highcharts
- D3
- OLAP Cube
- dbt
- Superset
- Streamlit
- Parquet
- Hasura
- kSQLDB
- Materialize
- Meltano
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Artem Kadanov and Pavel Tyanov about CubeJS, a framework for building analytics APIs to power your applications and BI dashboards. So, Artem, can you start by introducing yourself? Hi, everyone. My name is Artem Kydanov.
[00:02:11] Unknown:
I'm really excited to be here today.
[00:02:14] Unknown:
Together with Powell, I started kube. Js as an open source project 3 years ago and happy to share more about it today. And, Pawel, how about yourself? Yeah. Sure. Hi, everyone. My name is Pavel, and I am cofounder and CTO. And, yeah, we we started to work with Artem, like, back in the days in 2016 before we started CUBE in 2019.
[00:02:37] Unknown:
Going back to you, Artem, do you remember how you first got involved in the area of data?
[00:02:40] Unknown:
As Pavel mentioned, I met him, and I get involved into the area when we started to work together on stat spot. Stat spot was our Slack application I created in 2016. And the idea was to make analytical data from different places such as Google Analytics, Salesforce, Mixpanel, accessible as dashboards in Slack and later in other places such as Microsoft Teams. Statsboard was installed by thousands of users. And for large users, we started to see many scaling issues, you know, around building efficient pipelines, aggregations, and data modeling to eventually consume data downstream, and it's like in Microsoft Teams.
And with all of that complexity and all, like, we're helping our clients understand how to optimize their data pipeline. That's when I really got involved into the data management area.
[00:03:37] Unknown:
And, Pawel, how did you first get involved in data?
[00:03:39] Unknown:
I was working a lot, like, back in the days, enterprise software and specifically building BIs from scratch. So before I ever, like, stepped into a startup ecosystem, I was building BIs from scratch for 10 years. Was the most consulting jobs. So I had, like, a pretty decent prior experience with data.
[00:03:59] Unknown:
And so that brings us now to the cube project that you're both working on and that you have started a business around. I'm wondering if you can just give a bit of an overview about what it is that you're building there, some of the story behind how it came about, and why this is an area that you feel is worth spending your time and energy on.
[00:04:16] Unknown:
So, yeah, you can think of CUBE as a headless business intelligence. So it has JSON based data modeling player, similar to some other data modeling players that other BIs have. It has access control management to implement row level and column level security. And finally, it has a caching player, like, similar to BigQuery's BI engine. The main difference is that we are API based developer centrics. We don't do charts. So hence, we are headless. Cube started as a project to power StatsBot, the company I Pavel and I started before. And when we were working on StatsBot, we faced a problem that we needed to connect to many data sources, but also to provide a universal facade to all downstream data consumers.
In our case, it that were Slack applications, Microsoft Teams, BI dashboards. So we realized that we needed to have a data modeling security and aggregation in a single unified layer upstream from the data consumers. That's how we build our cube initially as part of this stat bot. In terms of the use cases
[00:05:32] Unknown:
and the workflows that it enables, what are some of the primary focuses that you are building around and some of the existing challenges that are in the space that make Kube JS a potential solution for people who are building their own data systems?
[00:05:50] Unknown:
The major use case with storage is data analytics. So Kube powers dashboard and you can report features inside customer facing applications. Since we are headless, right, we provide only API and some front end SDKs downstream inside this customer facing complication. Our users, they usually use chartic libraries like charges or highcharts or to treat to display data consumed from cubes API. I would say from data sources, we are mostly focused on data warehouses and data lakes. That's where we see the most usage. We also see some of the usage with transactional databases.
Very often, especially recently, our users use transformation tools like a DBT upstream, especially when they use it with the data warehouses. You know, Kube itself is written in Rust and TypeScript. But since we distribute as its Docker container, the trillion only can feed in any architecture. Right? We see companies building in Java, in Node. Js, in Python, Ruby, and they've just run Kube as a this kind of Docker container inside their architecture. Maybe, Pawel, if you have some insights into kind of other architecture and use cases.
[00:07:06] Unknown:
Yeah. And there are some, like, really, I would say, advanced use cases we see with Qube like automation ones. It's quite unusual right now, but looking towards it. And right now, as we shipped our kube SQL API, which allows you to connect different downstream tools which consume data, this automation becomes much more feasible. 1 of interesting use cases we saw is basically some automation on top of IoT sensors which are basically the queue on a scheduled basis to trigger some alerts. Yeah. This 1, yeah, very interesting use case.
[00:07:46] Unknown:
In terms of the audience that you're building for, who are the sort of target users and personas that you think about as you initially built out the project? And now that it's been released as an open source framework and that you're turning it into a business, some of the ways that you think about the personas as you identify potential new features or improvements to add in?
[00:08:08] Unknown:
We started with application developer as a persona. We were focused on application developers. But over time, we started to see more, you know, data engineers to be involved in a curated projects within companies. I feel that we just see the bigger trend that more, you know, like a data analyst say, becoming more data engineers. Right? We see that, you know, a lot of encouragement in a space for applying developer software engineering best practices and to the data space. So I think that's sort of, you know, like a trend that affects CUBE as well, and we see more and more data engineers using CUBE lately.
[00:08:51] Unknown:
In terms of the naming, I know that it is inspired by the idea of an OLAP cube, which is a certain way of being able to model your data so that you can answer useful questions in a way that is maybe not as easy to get at or as performant when you're pulling it directly from a transactional system. I'm wondering if you can talk to some of the applications of OLAP cubes and their role in the current state of the data ecosystem and some of the ways that what you're doing in the Kube JS project makes them more accessible or more relevant in sort of the modern data ecosystem.
[00:09:27] Unknown:
Right. Yeah. I'd probably just quickly comment on the name thing, and then Pavel can share his perspective on the OLAP cubes. That's true. The name comes from the OLAP cube, but it's more like when we started to build the cube at statsbot, we didn't have a name for that project because it was just only, you know, just statsbot engine or something. But we have this main object in the system, which we called cube. And cubes in cube, we should probably call it hypercube or something, but cubes in cubes are they map to physical tables in a data warehouse or, you know, like a device table that contains measures, dimensions, and join relationships, some sort of abstraction. And we just thought that queue may be a good name for that because, you know, like, historically, people used to call things like that cube. Even when we are not 1 to 1, it's all up cubes. Right? But it just, like, something that we thought the closest to the idea of the cube.
[00:10:25] Unknown:
Yeah. And, also, when people hear about a lab cube says they think something about slow materialized and really outdated. But in fact, cubes were introduced as a mass concept as basically coming from multidimensional analysis. And in fact, it means, basically, a multidimensional dataset with variable granularity, which can be, like, provided to users with, like as a more granular data or less granular data. Like, first of all, it was introduced to solve optimization problem, but it turned out it's actually very great framework to design and model data.
[00:11:04] Unknown:
In terms of the idea of an OLAP cube and what you're doing at cube. Js and the recent introduction of the idea of a dedicated metrics layer. I'm wondering if you can kind of draw some comparisons across those 3 elements.
[00:11:19] Unknown:
Some people claim that the metrics layer is just, you know, fancy name for all of cubes. I think, ideologically, they are close in trying to solve the same problem to create a data modeling construction layer. There are many tactical difference in how they approach the problem though, you know, mainly because they're from different generations. Right? So new tools and new technologies enable us right now to think about these problems differently. So that's why we have, you know, this new way for thinking about dedicated metrics layers.
[00:11:55] Unknown:
Yeah. And also this concept of cubes, again, it was, like, back in the days, it was introduced. And then, like, for our data community, it's not, like, well used and popular. It is actually used, on most of data tools right now because, for example, base parts of a cube, you can find it, most everywhere. It's measures and dimensions. In fact, it it is a cube when there is measuring dimension.
[00:12:19] Unknown:
As far as the overall sort of modern data stack as people are starting to coin the phrase where each different concern within the data platform is a separate tool or a different service. And I'm wondering how you see Kube. Js and the Kube Cloud Platform that you're building out sitting within that ecosystem and maybe some of the pieces that it either augments or potentially replaces?
[00:12:46] Unknown:
We've been thinking a lot about that, especially now with when we recently started to see more data jitters use Qube. And when we built Qube initially, we were not optimized for, you know, like, any, say, transformation upstream, for example. Right? We build it the way that people would be able to use it with the raw data without any required transformation in the sort of system. And people would be able to just define metrics and run pre aggregations and sort of transform data. But over time, we started to see more DBT users, for example, and we thought that makes a lot of sense. I mean, that's a great tool to run transformations. Right?
And it could makes our lives much easier when if people would come to Qube already with transformed data. So that's why right now we encourage really to to use DBT to transform data upstream for users to use Kube, and then use Kube to actually define metrics and access control and caching if needed. So I think we continue to do that, you know, like to see how we fit with the different tools. As Pavel mentioned, keep SQL API is a good example. We don't wanna do charts. That's like a hard line for us. We don't wanna go into visualization business, But there are a lot of great tools, you know, to do that.
And the eyes, obviously, you know, always continue to involve. And there are, like, some good open sources once, like, in Metabase or Superset. So we want to integrate with all of them. In fact, we already integrate with with Superset. And there are also, like, tools like the data apps. Right? Like, Extremely is a good example. So we want to be able to provide APIs and sort of headless obstruction to tools like StreamLead so they can visualize data downstream. So we kind of, you know, like, plays in the middleware here.
And that's why, you know, like, the question of how we fit into the ecosystem is really, like, very important
[00:14:51] Unknown:
for you. In terms of the actual implementation of the platform and some of the design elements that go into how it factors into the data platform. I'm wondering if you can just talk to the architecture and implementation details and some of the ways that the design has evolved or the goals have changed as you from when you started to where you are today?
[00:15:13] Unknown:
Chip has 3 main components, the JSON based metrics framework, APIs, and a caching player. Developers and data engineers, they develop metrics using the cute metrics framework. Then data consumers, such as BI tools or in app analytics, They query metrics through the API. It could be REST API. It could be SQL API or it could be GraphQL API. And finally, the caching player can be used to catch the metrics calculations to speed up some of the API requests.
[00:15:48] Unknown:
And in terms of architecture, we decided to go with very distributed architecture like modern BI stacks. So Kube consists of, first of all, of horizontal layer of API instances, which can scale horizontally and handle 100 or even 1, 000 of basically queries per second, then there will be aware of basically caching queries. So API instances can go directly to raw data or either through cache layer. And the cache layer is a distributed cache layer called kubestore. It's basically a we tear towards to work with data warehouses on a scale. So it is designed to provide, like, fixed response time on basically unlimited scale of data, and we are aiming, like, billions of rows per single table.
And we also have also refresh workers which basically populate cache in the background. So that's high level architectural view.
[00:16:50] Unknown:
And so you mentioned that 1 of the core elements of the architecture is the caching layer that you're using to improve the overall performance for people who are using the API for either powering user applications or to accelerate the display of dashboards in the business intelligence layer. And as everyone who has worked with computers long enough knows, 1 of the hardest problems in computer science is cache invalidation and making sure that things are up to date and that you're not holding on to information too long, but that you're also making sure that you pull in the information that you need ahead of time. And so I'm curious how you have approached the overall challenge of making sure that you have the right data in the cache in that pre aggregation layer so that people can answer the questions that they're looking to answer and then still being able to have a graceful degradation when they're starting to dig into information that isn't already available in that cache?
[00:17:45] Unknown:
K. Prerogations, they are sort of materialized tables layer. So we aggregate data first in source data warehouse and download it, reformat it for fast querying, and then insert it into our cache layer. And Kubernetes is all the orchestration of that process. The way the refresh works is either the time based when users define the interval to refresh, like, every 2 hours or every day 8 AM, Pacific time, or it could be based on the condition in a search data warehouse. An example here would be to check for the max time stamp in a table, And if it changed, we build materialized table of cache.
[00:18:32] Unknown:
Yeah. And from technical perspective, under the hood, that's actually materialized, and it's persisted as our cat files. So it's called in our storage on a scale. We use distributed file systems like s 3 or GCS or MinIO to basically provide the storage. And the separation, of storage and compute here, basically, when these tables are queried, we we have partially memory cache, all the basically, tables are on top on multiple nodes, and it's basically kube storage design to work on dozens and even hundreds of nodes to distribute or load horizontally.
So every query can be answered with a fixed response time and basically we are aiming for sub second response time here and basically scale really to large datasets like millions of rows.
[00:19:30] Unknown:
Given that you are materializing the information into the parquet files and distributed storage, how do you make sure that you are cleaning up those files after they've been invalidated so that you don't have them laying around and costing extra storage space and money or potentially polluting the cache with mismatched data because you're accidentally reading some older cache files when there's a newer cache file available and just the overall kind of management of that life cycle?
[00:20:00] Unknown:
This is why we have this refresh worker instances. So they are basically the process which runs in the ground and checks freshness of all data pieces. So it is quite usual that people will use partitions and partition in data fetch in order to minimize their cost for refresh. So when you have, for example, time series data, you don't need to refresh a whole table over and over. You need to refresh just the delta for just recent data in case you have other data is immutable for you. This refresh worker just marks basically partitions for garbage collection, and in the ground, there is hot swap between, like, newer partitions. Once new partitions arrive, 10 minutes after that will be a like, all partition will be a garbage collect. But for user, it's basically transparent.
[00:20:55] Unknown:
And in terms of the overall kind of design philosophy and user experience for people who are building on top of the cube. Js platform and for people who are designing the data aggregations that they want to expose. How do you think about the overall usage and design elements and developer experience for people who are integrating CUBE into their data platforms?
[00:21:21] Unknown:
Overall, we try to advocate for appliance of software engineering best practices to data space in general. I mean, Russian control, isolated environments, and we're making all these required primitives in queue to make it very easy to follow best practices while working on q projects. We are developer centric, and everything is code based. That's its definitions and configuration, so code based, and usually it's being stored in a version control system. In fact, our cloud product, it is fully based on the version control system. So even if you're not using any Git today, right, the Kube Cloud will still store everything in as it is a Git project.
We also always try to stand sort of on the shoulders of the giants here in terms of the API design and rely on existing solution and the best practices in ecosystems general. An example here would be the way we manage access control in Kube. We rely very heavily on JSON Web Tokens and its ecosystem because it seems to be sort of the main kind of the standard of access control tokens in the web application nowadays.
[00:22:41] Unknown:
Because of the fact that you're sitting in this in between layer of the data storage and the sort of data visualization or downstream consumers of the information, what are some of the points of tension or challenges that that introduces because of the fact that you have to deal with such different audiences, both in the producer and the consumer layer, as well as the differences in terms of expertise as far as data engineers and data analysts are going to be very familiar with data modeling and how things live in the data warehouse. Whereas if you have business intelligence engineers or web developers who are consuming the API, they're going to have a much different set of expertise in terms of how they're accessing the data and just some of the ways that you think about the collaboration across that boundary.
[00:23:29] Unknown:
It's gonna hurt because we exactly try to make this bridge between an only data people and applications and developers. They have a different expertise on both ends. We have many conversations with data people when they mentioned they don't have enough, you know, like, experience, skills, or even boundaries just to build something custom on the front end. Right? That's why they want to have existing tools like BIs to work with Qube or the data apps like Streamlit. That's the whole reason why we released the SQL API to be able to connect to these BIs.
On the other hand, with front end engineers sometimes, you know, like, plan to use Kupen their projects to power dashboards, They have a very limited knowledge and understanding of the data, right, and data modeling. In that case, it's a lot of challenges to make product to help them because you still have to write SQL. Right? We don't try to replace SQL. It's basically all SQL based. So it's not an easy solution here unless to make it more in all content and just best practices around that. I think what also helps is when you mentioned before that sort of, you know, like explosion of the data stack right now and every tool solves its problem. And, again, I've mentioned DBT before, but I think DBT could be very helpful in that case too. Right? When tables already transformed upstream by data team, then application developer can just leverage cube and plug to this transform tables and do just, you know, 1 to 1 mapping of the cubes to tables, And they would not need to write all the custom SQL that would help to speed up the adoption and the development cycle of the cubes. So that's kind of why I decided to just to plug into more and more tools to kind of leverage and speed up the development process.
[00:25:26] Unknown:
Yeah. And also, Kube helps actually bridge a gap between these data engineering teams and the front end teams because once you define, your data model using cubes, so there is an API layer for that which you can, like, set in place and fix so front end team can rely on it. But the data basically which goes into these cubes is very flexible. So data engineering team is not tied to the data definitions, transform tools they use, or even databases they use so they can replace it. While front end team is fixed on the API layer and can be sure that it won't change.
[00:26:08] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting. It often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.
DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. And as far as the workflow of actually building out a set of metrics and API endpoints for being able to consume in some of these downstream applications, whether it's the business intelligence layer or an embedded analytics use case. Wondering if you can talk about some of the considerations as far as data modeling in the source systems, some of the ways that you're approaching the cube definitions and how that maps to different API endpoints and some of the considerations that go into that mapping as you're building that out and just the people involved as well at each of the stages of sort of defining, designing, and implementing these endpoints and the metrics that they're pulling from?
[00:27:46] Unknown:
I feel the first step would be to connect to the data warehouse. Obviously, once we have that connection working, we would need to build at least 1 cube. We have up to generation of cubes from tables in a data warehouse, so it's very to quickly bootstrap and generate some simple cubes with basic measures, like count and dimension just being mapped to columns. Overall, we can think about cubes as kind of an abstraction layer that usually maps to physical tables in a data warehouse and creates all this measures and dimensions as a metrics on top of cubes. Once we have at least few cubes to play with, we can test them in what we call a cube playground, which is a developer sandbox tool to test metrics definitions.
And once we're satisfied with our metrics, we can connect to q from downstream tool, for example, from Apache Superset. In that case, we would connect through the Kube SQL API. It's basically a MySQL connection. So from Superset, we would need to connect to kube s to MySQL, specify host user password as usual. And now cubes will be treated as tables, and users will be able to use Superset UI to click and select what they wanted to query and build charts and dashboards. They would also be able to craft the SQL by hand. The only caveat here is measures because measures, they are already aggregated using q. It doesn't make sense to run aggregate functions on them again, right, when clearing through Kube SQL API.
But for compatibility with downstream tools, kube supports aggregate functions on measures, but as long as they match the measure type. For example, if you have a count measure, you can use only count aggregate on this measure, and you cannot use a max or min and to will return the error saying to this. So, yeah, that would be an example of querying from the BI tool. For in app analytics, it's usually either REST API or GraphQL API. Users will specify the measures, dimensions, filters they want to query in a request. And then people process the request, use the metrics definitions, generate a SQL for the data warehouse, execute that SQL, and that sends data back. And finally, users, they will be able to use any charting libraries they want to visualize the state in the application.
[00:30:25] Unknown:
So once people start deciding that they want to use Kube. Js and they've built out an initial proof of concept of here's a single metric and a single endpoint. So I've been able to prove out that I can actually connect to my source system. I can build a cube definition and then access it in iteration cycles of going from that initial proof of concept to a more sophisticated and widely deployed usage of Kube JS, both in terms of the infrastructure management that goes into it and also the kind of developer cycles of being able to maintain the existing definitions while they add new ones without breaking potential downstream consumers?
[00:31:06] Unknown:
There are mainly 3 areas work. Here is data modeling, security management, or access control management, and cache configuration. So data modeling is foundation of the ever sync. Right? Once we have our first metrics, we probably want to add more and more metrics. So iteration cycle usually involve changes in a data model, creating and updating metrics, and then testing them, then applying security rules if needed, and then finally, caching if required. These areas can go very deep depending on the sophistication of the use case. If it's embedded analytics, it's usually very multi tenancy driven, you know, different users, they can have different databases, different users, they may have different metrics. So there's going to be a lot of configuration around dynamically generate metrics, dynamically update them, connect to different databases for different data. In a more like a BI use case. That would be more like just a data modeling issues, you know, like how you structure your metrics.
So it can be very easy consumable by by people downstream and downstream tools. But overall, we try to follow, you know, like, best practices. And just a general kind of the flow of developing software, because everything is code based, The iteration looks like you change you do changes in a code. You test them. And once you feel confident, you know, like you go into the staging environment and you test them not to break production, for that, in a cloud, typically, we run isolated environments, which are gate based. So, basically, you usually run your production environment from the master branch. And when you want to make a changes, you create a feature branch. You do all the changes in a metrics framework in a different branch, then you deploy that branch in the same sort of, you know, like configuration as you have a master to test. You can run some end to end test against the API. You can hook up it into your BI dashboard to test that everything works and you didn't break any old metrics.
And once it's, you know, like, ready, you can just merge to the master, and then we will deploy it to the production. So, again, it's very developer centric and everything is code based, so that's why we try to follow the best software engineering practices.
[00:33:42] Unknown:
Again, from infrastructure perspective, there is a lot about versioning of changes. So because when you just add new definitions of, like, members, like, measures or dimensions, it's simple. So it's, usually in no operation. But when you change definitions, it's it becomes much more trickier, especially, for example, if you have cache in place. So it is usually you want to deploy as a blue green deployment here. So, for example, you have 3 high load production in place and you don't want to interrupt your users and you basically want to replace 1 definition with another 1. So what we have in q cloud, for example, is basically a feature called regulations warm up. It is basically takes your new version of deployment which should be deployed, and in the ground, it just warm ups all the cache, basically, all the new cache definitions you just defined.
And when it's done, it just switches API version from basically 1 schema to another. So in that way, your users even cannot notice any difference and just start receiving, like, new numbers, like, instantly.
[00:34:53] Unknown:
Another challenge of the sort of development of data applications like this is making sure that you are matching definitions and metadata, particularly across those boundaries that we talked about between the data producers and data management professionals and the application developers and business intelligence engineers? And especially because you're dealing at an abstraction layer of the SQL and the database layout and the warehouse, how are you working to ensure consistency for people who are building these cube definitions and the API consumers and ensuring that you have that correctness from the data warehouse through to the dashboard?
[00:35:37] Unknown:
We started to see that question many times in the community. I think, you know, overall, the general space of the observability with data lineage started to kind of grow, and it has very interesting tools that people use nowadays. So we don't provide a lot of out of the box right now to do that. So we provide a meta endpoint, which, you know, like, users can introspect to the schema and sort of sort of, you know, and build their own metrics catalogs and kind of test it. But our plan here would be to integrate with more data lineage and data observability tools so they can work with Qube so end users can be sure that, you know, like, all of the data is in correct state, and it goes from data warehouse through cube to the final destinations.
And it's all correct, and it's all tested, and it's all predictable. So it would be an interesting journey for us to integrate with all of this tool. Excited about it.
[00:36:39] Unknown:
As far as the kind of visibility aspect too, I imagine that that ties into your goal of integrating with some of these data lineage tools as being able to understand. As a consumer of these APIs. I want to see what are all of the definitions that are available for the different metrics or cubes that I want to consume from and just being able to see at a glance, you know, this is the piece of information that I care about, so this is what I want to pull into either this business intelligence dashboard or this axis of the chart that I'm developing?
[00:37:11] Unknown:
I think that there are 2 things to it. 1 is, obviously, can be solved by integrating with data observability and data lineage tools. But there is also a second part, which is more about the metrics catalog and which is kind of connected to the documentation piece of it, where people want to understand and to learn what metric they have and then how they can query them and maybe add some annotation layer to it. So we know that our community shows us that they want it. We know some companies already built something like that on top of Qube introspection API.
We want to have it as a part of our product eventually, some sort of a metrics catalog or data catalog. What we don't want to do, we don't wanna do what data observability and data line issues do. So the first, we probably try to understand where the boundaries is and what makes sense for Qube to build in that area of visibility and metrics catalog. And then once we kind of define the scope, we'll go and build that features in Qube.
[00:38:16] Unknown:
In terms of the composition of metrics, I'm wondering how you approach the ability of users to be able to say, I've got this cube that pertains to this particular attribute of my business. So maybe it's this is the way to identify unique customers, and then I've got another cube that is a way to identify the number of sales for a given period. But now I want to be able to compose them together to be able to aggregate these sales by unique customer and then present that to the business intelligence dashboard and just some of the aspects of being able to build some of these higher levels abstractions from more granular building blocks?
[00:38:56] Unknown:
That's something that people people definitely want to need. Right now, what's possible already in queue is it's possible to join cubes. So if you have, like, a sales cubes and we have a Excels people cube, we can join these 2 cubes to attribute sales to some specific salesperson and to see, you know, like, who is the best performing salesperson right now. So that's possible to join QX already just as regular tables. What we're thinking of is mainly 2 things. 1 is sometimes it makes sense to join cubes in a different context. Right? Like, just to create a different composition of the cubes and which will create some additional context. Like, we wanted to see if this 5 cube in a context of marketing and join them all 1 way. And then we wanted to see look at these cubes from a different context and join them a little bit differently and maybe add some more cubes here. So kind of additional obstruction on top of this. So that's 1 thing that we're thinking right now. We don't have that obstruction layer in the queue, but that's something that it seems needed from our users.
And the second area here is more like composite metrics, where, like, measures dimension, that's a fine. They are very granular. But what if you wanted to go up and say, what our best performing acquisition channel. Right? Like, what is that? Is it a question? Is it a metric? How that should be composed of more granular objects? So that's something that we've been working on right now and thinking about what tools we can build in cubes to let users express that high level abstraction objects.
[00:40:38] Unknown:
The other interesting problem that you're taking on with the cube problem is the idea of how to embed the SQL dialect in a sort of host language or host data structure in a way that doesn't drive the end users insane. And so I'm wondering if you can talk to some of the ways that you think about how to manage these SQL statements and SQL fragments and being able to build up the logic in a non sequel dialect in a way that is approachable and maintainable and just some of the utilities that you and the community have built to reduce the level of friction when trying to create these definitions?
[00:41:22] Unknown:
I think that's a real problem and some things that, you know, like, we take seriously in trying to figure out what would be the best way to solve it. I mentioned that DBT integration before. Right? I think that's sort of 1 area that it helps to because if you transform everything upstream already, what do you do in a cube? You mostly just point to that transformed table and say, my users queue will depend will be built out of this users table in my data warehouse, which is already transformed to his DPT. But you would still need to write some SQL for dimensions and measures, and you need to embed that SQL. 1 thing we're working as part of the cloud is kind of QID, where we help to write and do all these useful tips to, you know, like, to show what could be done here. I think there is, like, some ongoing work in a community around types to provide, you know, like, more plugins to the IDs, like a Versus code or something. So so you can have, like, after completing some useful tips too.
I maybe, Pavel, you have heard something interesting in that area from the community
[00:42:32] Unknown:
too? Yes. Some other, examples of it, I guess, from what we've seen in community. A lot of interest, support for virus, I would say dialects, so for schema flavors. So we could request, like, for, like, ES 6 support, TypeScript support, and also we are looking towards supporting DBT as a main layer of, like, basically metric definitions. And that way, you can combine basically your definitions on a dbt layer, which is basically in YAML. Combine that with cube definitions in cube itself because they can be mixed, and like a compound definition.
In that way, you'll be having metrics definitions in dbt, which is really tied to your data. And on other hand, you'll be having definitions of cache layer on a cubesite. So this type of stuff we see among the community, like, with the demanded features and many more formats like the YAML itself, like, to use with the cube.
[00:43:38] Unknown:
And in terms of the kind of development and management and deployment of a cube platform, obviously, there's the open source offering. But I'm wondering what you see as some of the challenges that end users face or some of the points of friction that they encounter when they're trying to get it set up for themselves and some of the ways that you're hoping to alleviate that with the kube cloud platform that you've launched recently?
[00:44:01] Unknown:
We indeed launched our cloud platform recently to tackle some of the issues we found when we were, like, working with open source users in a community. They are around both developing and staging and even going into production. I think the main issues that we see is how to apply and how best engineering practice to work with cube projects. Like, what would be the best way to test changes in a cube reliably to create isolated environment? What is the best way to trace issues and then debug, you know, like, slow queries and understand why they're slow and what you can do to improve the performance of that queries and also around the collaboration of the features and how no. Like, more visibility into caching player.
And probably the major part is infrastructure. Cube has a lot of moving parts as an infrastructure project. It has refresh workers to keep the cache warm. It has API instances. It needs to up to scale and load balance the load of API request, but also the memory footprint of the metrics definitions because it can get quite big, especially for multi tenant application. And then eventually, the caching player itself, which is sort of distributed query engine, which is backed by distributed file format. So there are a lot of moving parts and sort of a lot of questions. How we scale caching? How we scale API instances? So our idea for kube cloud was to let's solve all these infrastructure issues so people can just run kube.
And don't worry about just scaling and provisioning and management, but also provide them a lot of tools to follow this engineering best practices, like how we create a separated environment to test our changes, how we debug issues, how we have more visibility into cache and clear and all of that. It's very new to us. Again, we just launched it, but I truly believe that Kube Cloud is just the best way to develop and run Kube applications in productions.
[00:46:09] Unknown:
In terms of the usage of the Kube platform and some of the ways that you've applied it, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:46:19] Unknown:
I know 1 use case I would love to share about. Just 1, like, large tech companies, they building can internal end to end data platform for data transformation, data modeling, pipelines, and data quality management. I cannot really share a lot about it, unfortunately, but they will open source it soon. So then I will be able to. But very excited about it because Qube is the core component of that system, and it's which is responsible for the metrics layer. And it's great to see how CUBE can power kind of a larger end to end data platform.
[00:46:57] Unknown:
In your own work on building the CUBE project and starting the business around it and launching the platform, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:47:08] Unknown:
1 of the, like, trickiest stuff we actually bumped in the early days, it's probably quite a simple problem and every data engineer had, but really tricky to solve in years away. So the problem is called either row multiplication issue or either it's also called sometimes fan out issue. So when you are joining some tables together and you're trying to calculate the metric on the side where it's 1 to many from side where it's 1, and this metric is multiplied. Yeah. And it turns out that's not a lot of tools in SQL to others in this program because it was never designed to handle this analytic stuff. Yeah. In Kube, you can see really tricky c codes that generated for this use case, but, yeah, it it is generally solved in Kube.
[00:48:02] Unknown:
And so for people who are interested in being able to have a shared abstraction layer across their data infrastructure or their application databases, and they wanna be able to access things either via SQL or APIs or they just wanna be able to build a data powered experience for their end users. What are some of the cases where Qube is the wrong choice and they might be better suited with either a different metrics layer or building something in house or some of the other patterns that we've seen in the data ecosystem?
[00:48:32] Unknown:
I think if the organization is very small and they just rely only on 1 PI and they don't need to use any other downstream tool. Maybe they use Voucher right now, right, which has sort of a metrics framework in it already. Maybe that would be a better choice, you know, like, if it's all, like, everything in there, and they don't need to consume the data in other downstream BI tools or applications, and they satisfied with a UI of Looker. That's probably where, like, we don't really need to use that, and they can use this other metrics framework. The other use case, which may sound weird, and so many people try to use q for, like, a CRUD operations and ask us, hey, can I write data back with CUBE and just create some sort of, you know, like, just create, read something, and update? It's CUBE is not good for it, and we never will make it. So CUBE is kind of sometimes we call there is, like, a OLTP workload. Right? And it's, like, all up for cloud. CUBE is really all up for cloud, not ILTP.
And we we like to pair with tools like Hasura to create this obstruction data access layers through GraphQL for all the crop cloud related things. Then Kube really can power, the analytics part of it.
[00:49:46] Unknown:
And so as you continue to build the Kube JS framework and open source and build the cloud offering and grow the business, what are some of the things you have planned for the near to medium term or any projects that you're particularly excited to dig into?
[00:50:00] Unknown:
I think that first, like, core areas of the queue where we provide most of our value right now, metrics definitions, security and caching. And I think even in this conversation, we touched a lot of things already that's sort of on our road map, like a data catalog or more abstractions on top of cubes to build this joint context. We definitely want to continue to improve the areas where we provide most of the value. But also, I feel that because we are middleware, we need to be very native in the ecosystem and integrate with as much tools as possible, both upstream and downstream. For upstream, it would be more data warehouses, more databases.
I'm particularly excited about the streaming space, and I know there are a lot of great companies doing interesting things in a SQL over stream space. Well, ksqlDB was around, obviously, for a long time now, But there are, like, some companies, like Materialise and others that I'm really excited about, and we'd like to keep to work on top of them. I think it will open a lot of new opportunities for real time analytics, and we will be able to bring that, you know, like, upstream from the stream processor to all of BIs and all of embedded applications through the cube and make it real time, that would be really, really great.
Also, integrations and stores like a DPT for us would make a lot of sense upstream. And downstream, it's more like, we follow users mostly here in terms of what kind of tools and how they want to consume Q. Downstream BIs is the biggest 1, probably, I think, on our road map. Right now, we just wanna make sure that every BI works great with Qube. And then we'll want it to look more in a rigorous CTLs tools. So, you know, users would be able to use the same metrics they defined in Qube, but to run a reverse CTL process and to get data through the Qube into the CRM, so marketing automation. So, yeah, again, we'd love to have more and more integrations with with the ecosystem.
[00:52:14] Unknown:
Well, for anybody who wants to get in touch with either of you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today. Think there is an interesting space,
[00:52:32] Unknown:
which some people would call DataOps or DevOps for data. Whereas the main idea is how to apply software engineering best practices, version control, automated end to end test and calculated environments to data teams and workflows. It feels like there is a need of gap for tools to solve these problems. It's very, very early, but overall direction just feels right. I know there are some great and smart teams like Nutana actively thinking about that problem. So I'm excited to see how this space would develop.
[00:53:08] Unknown:
I think it would take some time for all of these folks to show very early, like, in data space itself, very early to figure out those integrations and interconnections. So as we mentioned, a lot of these DevOps tools and, basically, all of those data observability tools ingest getting ramped up, and we'll take some time to figure out all the interconnections in ecosystem here.
[00:53:33] Unknown:
Alright. Well, thank you both very much for taking the time today to join me and share the work that you're doing on kube. Js and the kube cloud platform. It's definitely a very interesting project and 1 that tackles a very interesting problem space. So I appreciate all the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thank you. Thank you for having us today.
[00:53:57] Unknown:
Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used. And visit the site at dataengineeringpodcastdot com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to CubeJS with Artem Kadanov and Pavel Tyanov
Origins and Early Challenges of CubeJS
Primary Use Cases and Workflows of CubeJS
Target Users and Personas for CubeJS
OLAP Cubes and Metrics Layer in Modern Data Ecosystem
Architecture and Implementation of CubeJS
Caching Layer and Performance Optimization
Developer Experience and Best Practices
Building Metrics and API Endpoints
Managing Changes and Deployments
Ensuring Consistency and Correctness
Composing Metrics and Higher-Level Abstractions
Embedding SQL in Host Languages
Challenges and Solutions in Cube Cloud Platform
Interesting Use Cases and Applications
Lessons Learned and Unexpected Challenges
When CubeJS is Not the Right Choice
Future Plans and Roadmap for CubeJS
Biggest Gaps in Data Management Tooling