Establish A Single Source Of Truth For Your Data Consumers With A Semantic Layer

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

A new approach to building and running data platforms and data pipelines.

It is an open source, cloud native orchestrator for the whole development life cycle with integrated lineage and observability, a declarative programming model, and best in class testability.

Your team can get up and running in minutes thanks to DAXTER Cloud, an enterprise class hosted solution that offers serverless and hybrid deployments, enhanced security, and on demand ephemeral test deployments.

Go to data engineering podcast.com/daxter

today to get started, and your first 30 days are free. Data lakes are notoriously complex.

For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics.

Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises.

And Starburst does all of this on an open architecture

with first class support for Apache Iceberg, Delta Lake and Hoody, so you always maintain ownership of your

data. Want to see Starburst in action? Go to dataengineeringpodcast.com

slash starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.

Your host is Tobias Macy, and today I'd like to welcome back Artem Khadunov to talk about the role of the semantic layer in your data platform. So, Artem, can you start by introducing yourself?

Yeah. Thank you for having me today. My name is Artem. I cofounder and CEO at CUBE. I started CUBE as an open source project in 2019.

And then with my cofounder in 2020, we built

a company around that to keep

developing the open source project, but also introduced

a commercial version of the cloud version of the cube, which is called the cube cloud.

And do you remember how you first got started working in data?

I was I was just a software engineer.

And,

I think, you know, like, as software engineer, you always work with the data. So at some point when I started to lead a project

and the project involved

collection of a lot of data. And, we were building an interesting an interesting product that was that software was deployed

in schools,

and, the kids were using software on daily basis. And idea was to

train the software to improve

so it can learn from a kid's basically based on their action and kind of adapt

different exercises, challenges that were presented to them based on their previous

input. So to to accomplish that on a scale with all these calls, we had to build

a big,

data

pipeline. I think that was my my first real data project.

As you mentioned, you started the cube project a few years ago. That was at the early stages of what some people called the metrics layer, other people called the semantic layer, other people will call headless BI.

I'm just wondering if you can just start for folks who aren't familiar with all these terms and the ways that they fit into the overall scope of data platforms, data management,

what the technical

elements are that constitute a semantic layer.

Right. I think but as we think about semantic layer as just a concept, an idea, it's been around for a long time. So if you look back

at business universe and micro strategy,

it's all

older generation of BI tools, they all had a semantic layer as part of it. Essentially, what semantic layer is is a way that BI can translate

the high level

analytical style of queries into the relational or tabular queries. Right? When we move sort of in our UI, in the BI, when we move for drag,

like, active users with a breakdown by state or country,

the BI would generate the correct SQL and execute that SQL insights on specific web storage. So that's where it's usually always been sort of considered and called a semantic layer. In fact, BusinessObjects, they even had a patent for that for, like, 10 years, and MicroStrategy

successfully defended against it. So the semantic layer as part of the BI,

it's always been around. But then probably by the 2018

and 2017, a lot of people started to talk about

the stand alone metrics layer or stand alone semantic layer. And the need for this mostly arise just due to the fact that we reached a point where we had so many different BI tools and data consumption tools

that it was unclear

what was the source of truth.

And I think it kind of happened due to the fact that a lot of people wanted to democratize

access to data. So, you know, like,

bring more different

specialized

data visualization,

data consumption tools. But also cloud contributed a lot to that because it became easier to buy and use software

than it was, like, 20 years ago. So a newer generation of tools like Tableau, Looker started all, but then we got, like, Mode. We got, ThoughtSpot. We got all these, like, tools that were, like some of them were cloud native that a lot of organizations started to adopt them. And, you know, like, in the later generation, Sigma,

you know, like you got, you

know, like, Lightdash and all of these tools. So it's like the number is only growing. And at some point,

data teams, they started to think, okay.

We have so many different tools and places to define metrics, to define semantics, what should be the source of truth. And that's where, like, a lot of,

organizations and just in data community in general started to talk about

headless BI, metrics layer, semantic layer. There there were a few different names to call it. I feel like a metrics store

specifically

came to the picture because a few organizations

lift an Airbnb, if I recall correctly. They built internally

software that they call Metric Store that was sort of, you know, like

some kind of a semantic layer. And that's why a lot of people started to use Metric Store. But I think that idea never really

came to the market as a general purpose solution. So that's why kind of the metric story idea started to fade out, and it all kind of re replaced with a more, like, universal semantic layer. That's what people are talking about.

And as you mentioned,

the semantic layer is the point in the overall

data system where you convert from raw information

into some sort of contextualized

business

domain objects. And

to to your point, for the most part, throughout the earlier history of data systems, that was a dedicated component of the business intelligence system because that was monolithic. That was the canonical reference for anybody who wanted to go and explore the data.

Now with this disaggregation

of the overall data stack,

there are a larger number of consumers. Not everybody is using the same tool chain. Not everybody has the same frame of reference or the same context, and I'm curious how that has complicated the question of even being able to

define what the appropriate granularity

and scope of those business domain objects are even beyond just the technical elements of being able to represent them in a system that can be viewed as the source of truth.

Right.

Yeah. There are

it's just

the more data we have,

the more data consumers we have, and the more need to consume data that obviously increases the scale of complexity

and around the data modeling in general. And I think there are, like, a few ideas

even

orthogonal to semantic layer kind of arise in,

in recent years to address some of these issues. Like, a data match is 1 of of those examples. Right? Like, how do we maybe hold the knowledge

of specific data modeling, metrics modeling aspects

inside

domain teams so they can model their,

metrics,

their model, their definitions, but at the same time, keep the centralized governance.

I think if we

zoom out here and think about what is what is the problem here in general is that it's always a balance between well

governed data and a flexibility

on the end of the analysis.

So in in in some world, we can

theoretically

model every possible

measure in a dimension and with every possible granularity

and present that to the end data consumer. And that that would be ideal. The thing is that it's just not possible to do. Right? It's always going to be something that is missing. Some granularity is missing measures. Dimensions are missing. So it's always going to be some data modeling at the

edge where, like, an analyst or, like, a business

consumer, they work with data. They would need to do some last mile transformation, maybe to look at the data differently.

I think that for the data teams and for the vendors who are involved in that process,

the idea is how do we

help to keep the balance between

governance on 1 end and flexibility

of the last mile analysis on the on the other end. Before we dig too much into more of the technical elements, as you said, there was this flash in the pan moment of the metrics layer being

touted as this

separate

layer of the stack. It needed to be a dedicated system.

And I'm curious if you can give your assessment of the current state of the ecosystem, the state of the market as to

when and whether the

dedicated

technical

layer is merited as opposed to it being a feature of 1 of the other components of a data platform?

Yeah. I think around 2018 to 2020,

that's where

we had, the recent wave of

ideas around headless BI metrics, store metrics layer. As I mentioned, I think the big driver was this just explosion of a different data consumption and the data visualization tool on 1 hand. And,

the fact that a few big tech companies, they developed an internal solutions, and they found some success with this internal solutions. So they sort of that catalyzed

the idea around

the standalone dedicated metric storage, and a few companies were started to address this problem. I think, overall, if you look

5 years or so later in the market, is that many of those companies,

they not

not around anymore or, like, they got acquired or they pivoted. They changed direction. So CUBE is 1 of the few companies that is still that is still kind of doing doing the semantic layer. And the general sort of, idea moved a little bit from metrics store or metrics layer into the semantic layer specifically. And it's not only the change in the name,

but it's more

like understanding

the place of that layer, the place of that technology, rather than being more like a catalog of the metrics.

It now is

it's being considered more a place to do the multidimensional

data modeling

sort of universally for all these tools. I think the state of today is,

as I see it, the place of the semantic layer is on top of the data warehouse. It's a fully virtual layer

that doesn't hold any data. It lets you to develop lets data

engineer develop measures, metrics, dimension aggregates,

dimensions to some extent.

Although many dimension modeling can can be done up stream in a transformation tool. So it's mostly measures, metrics,

and relationships and then expose these models to all

data consumption tools.

And by data consumption tools,

we mean not only BI,

but also

application, so data apps. Right? It's embedded analytics and just stand alone internal and different variety of apps. I think that currently

we talk with the buyers, we talk in the market in general. That's what

would be an expectation from a semantic layer is that data modeling framework that supports

wide range of the metrics modeling specifically

and relationships with joins.

It would be APIs that can support BI connectivity, but also would support apps connectivity and some sort of a caching layer as well and some some sort of a security layer.

In the case that you and so as you mentioned, the semantic layer or the headless BI, you can use that as an additional

modeling tool to be able to say, based on these upstream

purchase to be completed, or how do I calculate revenue, all of these business layer

semantic objects that you want to be able to ask questions about. But, also,

data warehousing as an overall practice was originally designed to be able to address some of those questions. And I'm curious how you have seen the idea of this dedicated layer for building these more conceptual domain objects off of the

core factual information that you have in the data warehouse, how that impacts the way that teams think about the data modeling approach, data delivery, who is responsible for which elements of that technical versus semantic modeling, etcetera?

Right. Great question. I think from

from an organization

perspective, it's always usually data engineers who work in both on a transformation

and a semantic modeling. It really depends whether, you know, like organization has some sort of like a data mesh where, you know, like a different domain

groups will contribute to specific areas. CUBE is called first as some other tools in a modern data stack. So I feel like with Code First, you always can get this benefit of a collaboration because it's essentially just a pull request, right, if you wanted to add a new measure. So that sort of can enable the sort of a mesh architecture, if you like, a marketing embedded team that can wanna they wanna contribute to marketing metrics specifically. Right? And then central governance team is going to review that. But at the end of the day, it's data engineers

regardless how they structure it within organization.

The other part of your question,

as as I understand it correctly, was, like, what goes essentially into transformation versus what goes into the semantic modeling, which is a big question always when an organization

starts to adopt a semantic modeling. That's 1 of the first questions. Like, should we model that right in a warehouse and materialize it in a warehouse versus we should do it in a modeling?

So the the usual what we recommend

is the usual setup is to do dimensional

modeling in that warehouse

and to keep your

models sort of normalized,

say, in a more like a DBT

oriented

way that would be they would call the staging models. So you get to a point where, like, you build your staging models or, like, they look more like a normalized entities. And then semantic layer would be the denormalization

point. That's where you bring your normalized

models

to semantic layer. You start describe measures. You start describe relationships.

And then based on these relationships,

you're kind of starting to build your multidimensional

data marks, which are denormalized. They potentially can can contain multiple

measures, multiple dimensions. They can all package together as a single kind of a data asset. That's like a Datamart, which is multidimensional,

which which it can it can look like a table. Right? But it's essentially could can be multidimensional.

And then you expose them to different data consumption tools. You can design them differently. They can have 1 measure, and then it's just a set of dimension that can be used with that measure. Or you can have

multiple measures inside 1

Datamart in multiple dimensions. So I see both ways,

sort of and they're, like, equally good. But the idea is that you do high level idea is that you do dimensional modeling

to a normalization on the data warehouse level and then denormalization

and measures on the semantic model.

For teams who have already been building out a data platform, they have their warehouse, they've already done their dimensional modeling. They've got a business intelligence system where they've already done some of those semantic modeling

or even maybe they've built those smart layers in the data warehouse, and they're just doing a 1 to 1 representation in their BI. What does the

adoption and migration process look like for incorporating

the semantic layer as that canonical source of access for being able to query and interact with those semantic domain objects from multiple different consumers?

Yeah. Good question. So,

it it it depends on a sort of a surface of migration.

So if you have only,

for example, Looker as 1 of your major BI tools and your organization is planning to move off to,

say, Tableau and maybe start supporting Excel and maybe doing some embedded analytics. And in this point, you're starting to realize, okay, Looker is great. I like my LookML models, but I can use them only within Looker. Right? So I need now to use all of these different data visualization tools. And that's where

you probably need to have something like CUBE, right, of stand alone semantic layer and migrate all of the Looker to to to that tool. Migration from Looker

Ookamal is code based, Kube is code based. We even have a code editor where, like, you can just migrate things. So we I saw migrations like this. I would say, like, a smaller to medium scale, which can happen really quickly. On other hand, if we, you know, like, engage with organizations that has a lot of modeling as base, you know, like, or even, like, something more modern like a Tableau, but, you know, like, you still got

hundreds of workbooks

and a lot of, like,

metrics are being copied across these workbooks. So that might take some time. I think on our end, in general, in any, like,

semantic

layer

vendor end, we can offer, you know, a lot of, like, blueprints or just best practices on a migration. But depending on the size of the data, right, it sometimes can be longer. Sometimes it can be faster.

But I would say, like, migrating over the Looker to queue, that's always been the fastest route.

Another question of

the adoption

process or the evaluation

phase of whether and when to bring in this separate semantic layer, 1 of the use cases is that the semantic layer can act as also

a means of providing caching and accelerating the response times from your warehouse. And I'm curious how the underlying warehouse technology

impacts that overall calculus of maybe I'm using Snowflake, and so I need to make sure that I'm minimizing cost and minimizing load on that because of the pricing model. Or I'm using a data lakehouse,

system, and so maybe I need to re be able to respond to queries faster. In those situations, the performance benefits seem fairly obvious. But if you're in the situation of using a Druid or a ClickHouse, which is optimized for interactive query speed, how does that influence the ways that teams think about whether and how to incorporate

the semantic layer as a separate technical component of their stack? Yep. Great question. Most of our customers

on Snowflake, Databricks,

BigQuery, and Redshift,

and Starburst now.

So I would say this where Qube adds most of the value and around the caching as well. The ClickHouse

and, you know, the Pinot and Druid of the world, I feel like they're mostly being used in,

where, like, low latency real time

use cases, maybe to power, you know, LinkedIn feed. Right? That's what Pino was created for. So maybe less general purpose like analytics, but more like

specific use case,

within within organization. So that's why, naturally, we just see less of them. But when we think about the warehouse,

caching can add 2 main benefits.

1 is it can speed up the performance.

And second, it can save on the warehouse cost. I will just explain how low how caching works so we can understand its benefits. So the caching in in Kube works in a way that you would Qube could execute a query that you want. It all starts with a definition of what you wanna cache. So we wanted to cache usually a few measures in a few dimensions together, which we call a pre aggregation or roll up table. So that would be bigger. It could be smaller. You can have many of them in your application. So data engineers, they can design what they wanna cache and what kind of pre aggregation stable they wanna build. Now what would happen is that Qube will

generate a query in the background that will go into the data source, say Snowflake, and will run the group by query essentially, right, in a Snowflake to get all the data that needed for the pre aggregation table. Then we will tell Snowflake to export it into some specific s 3 bucket or, like, a GCS bucket, so, essentially, into some object storage. Then Qube will read from that object storage and ingest into its own storage.

Inside that storage,

we will optimize it. We'll do

some magic to just make it faster, and then it will just remain in kube storage. Now in the background, we can keep going into Snowflake and refresh the data. We can make it incremental. So there's, like, a lot of different strategies. Call you on a refresh at pre aggregation. But now as the data is inside Qube, the Qube has an aggregate awareness. The idea of aggregate awareness is that Qube can understand that a query can be served from the aggregate instead of the raw data. So we can just get it out of the aggregate and serve that query. So now when we think about even without all the CUBE optimizations,

usually it's faster to query from the aggregated table rather than the raw data. Right? So that's definitely a performance benefit. Top of that, we do a lot of optimizations. Like, we have specific engine for top k queries, right, which is very popular in,

inter analytical

workload. So that's why all the cache queries are usually faster than their

data warehouse queries. Now on the performance side, because you don't really need to go to your Snowflake Flake anymore,

you you don't need to keep it running. So if all your aggregates needs to be refreshed daily, so you can just run Snowflake for 4 hours, maybe, you know, midnight to 4 AM, And then you just suspend your Snowflake, and then all the queries during the work day, they will be served from the cache. So that's a kind of cost benefit, cost saving opportunity.

Now that the

conceptual elements of the dedicated semantic layer have been around for a few years, there have been a few different

entrants into the market, some of which are still viable, some of which have faded or been acquired and integrated into other components of the stack. I'm curious how you have seen the overall understanding

of the purpose and application

of this

conceptual technical element shift and some of the ways that that has influenced the adoption and usage patterns of systems such as Qube? I think it

definitely feels like it's maturing.

In

earlier days, I would say 2019,

2020,

it was

at some point, it was a lot of excitement

around the idea of the metrics layer or metrics store,

but it was very unclear

how to use that, what exactly it is. Right? Is it a catalog of the metrics where people can collaborate and comment, you know, like, oh, this is a nice metric. Who created it? Like, can you tell me more about, you know, like, the meaning of that metric? Which is more was more like use case of the metric store. And on the other hand,

it was a headless BI

with specter

of that story that was more tailored towards

embedded

analytics use cases, more like, you know, like, how we built a data app. So it was a lot of sort of uncertainty about

how the semantic layer should work, you know, like, what place it should take in the in the data stack. I feel like as we went through a few years in the last few years, it sort of definitely matured in terms of understanding.

So now when, you know, like, when I talk to someone,

at the conference, you know, like, in just someone out in the world from a data community that most of the people they know about semantic layer, and they understand

what benefits semantic layer can bring to the table. So that definitely feels different now. We haven't touched that, but I think that's in terms of the use cases and the place of the,

semantic layer. Recently,

more people started to talk about how it can help with AI based stack, especially

around

the SQL generation to the warehouse.

I think that

was probably the most recent

addition to, you know, like, a use case that can be powered by semantic layer. But other than that, most of the core use cases kind of you know, like, it feels like we are settling down on the use cases for the Symantec letter specifically.

This episode is brought to you by DataFold,

a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow from migration to DBT deployment.

DataFold has recently launched data replication testing, providing ongoing validation for source to target replication.

Leverage DataFold's fast cross database data differing and monitoring to test your replication pipelines automatically and continuously.

Validate consistency between source and target at any scale and receive alerts about any discrepancies.

Learn more about DataFold by visiting data engineering podcast.com

slash datafolds today.

And in the technical evolution

of

the cube product in particular, since since that is what you're most familiar with. But just generally, the ways that people are thinking about the implementation and integration

of the semantic layer. I'm wondering what are some of the most challenging

engineering aspects of building the the system

in terms of ease of adoption, developer experience, but at the same time, trying to balance these aspects of performance

and ease of integration.

In terms of challenges, I think we have

at least 3

big sort of complex challenges,

and complex areas of sort of, you know, like deep engineering.

1 is,

what we call a SQL API.

So, essentially, it's an as an interface

how CUBE can be queried.

So when we think about

different data consumption data visualization tools, how they can get data from CUBE is

through currently 4 interfaces,

REST API, GraphQL API, SQLAPAN,

MDX.

For example, for Tableau to connect to CUBE. Tableau would connect to CUBE as to Postgres database.

So Tableau will just generate a SQL code

for the Postgres,

and then we'll send that SQL code to CUBE. CUBE will take that SQL,

look at it, and

create a new SQL code, the real SQL, based on all the data model, all the calculations that needs to be sent to Snowflake. Then Coup will send that SQL to Snowflake, execute it,

and then send it all the way back to Tableau. Right? So the big

technical challenge is how we translate

from that incoming SQL from Tableau

into outgoing

SQL

to Snowflake. Right? Because we need to

design a SQL language, essentially, building

sort of a SQL parser, SQL planner, SQL analyzer

that also can understand

the multidimensional

queries. So that's been a big challenge. It's,

our team is using a lot of mass,

cutting edge mass here

to to, like, essentially rewrite the query from the

multidimensional

into the tabular.

So that's been an interesting

interesting area of development. The other 2 big areas,

1 is essentially building our caching engine

because we wanna make sure that it's fast. It's built on top of Apache Arrow Data Fusion, which is Apache Arrow implementation in Rust. But we added a lot of our own development to that. It's all open source too. So, you know, like, it's easier to check it out. It's written in Rust,

as well as our SQL API endpoint. And,

the final

sort of the challenging part from engineering perspective

is the data modeling framework itself. And how do you deal with things like fanouts

and traps? How do you make sure that you can model

different measures,

build correct relationships, all of that? So that's a big that's a big piece of that. So I would say these 3 areas are the most technically challenging, SQL API,

cache and QiskitubStore,

and,

data modeling framework.

In terms of the

developer experience,

as you were saying earlier around the data modeling question, you would probably use CUBE as the so called Mart layer

where in dbt parlance. And so for people who are maybe using dbt as their overall workflow,

what are their options for being able to effectively treat Cube as that materialization

transformation layer so that they can use a single interface for doing all of their modeling, but have the option of being able to split the underlying

compute substrate

across the boundaries of warehouse versus semantic layer?

Yeah. That's a great question.

We have

we have a decent

number of DBT users.

So we've been thinking about that problem. How do we design the developer workflow? So, you know, that's it's a very straightforward

and streamlined process going from transformations

in DBT to the modeling queue. The good thing here is that DBT is code based first. Right? So, essentially, we can kind of combine the the 2 products

because they code first. So what we build is essentially

sort of a blueprint, but also

a code integration

between CUBE and DBT.

So what users can do is that,

once they build their

model staging models in dbt, we can bring them over to cube through our Python library.

So, essentially, the it can read the manifest file with all the models defined to dbt. And then based on these models, qpe users, they can create cubes. And cubes is,

the first,

layer in

our semantic modeling. So we have cubes, which are usually, like, normalized entities.

They usually being 1 to 1 mapping to staging

models in DBT with all the dimensions. And then on top of cubes, users would define all the measures.

Once that piece done and dimension part is kind of happening automatically because you're just bringing your definition over from DBT, Then you define what we call views on a cube side. A view is a denormalization

point. And that's essentially just saying, like, I wanted to take these 5 cubes, put it all together as a data mart, and then expose it to the world. So we use some more, like, outward representations

of the data model. So that's how the workflow usually

is happening right now. And many of our users, they keep it under the same repo.

So it's a, like, DBT folder, essentially, and a cube folder. And you can kind of first do your dimension changes in dbt and then kind of work on the measures related changes in cube.

So when you were on the show about 2 years ago, I think, at this point,

I think you had just started the process of doing a big rewrite in Rust. I think that the original name was actually cube JS because you were focusing on JavaScript as that execution layer. I'm curious

how that evolution of the technical underpinnings

of the project and of the product have shifted the

overall scope and goals of the project and some of the ways that it has shifted your thinking about how best to

manage

this as a production ready, you know, production grade project for people to be able to rely on and

some of the ways that you can incorporate additional features because of those performance gains?

Yeah. That's true. So when we are only started CUBE, we called it cube dotjs,

because it was written in

JavaScript. It's not just runtime. And our data modeling framework, it was JavaScript based. And when we were, like, releasing that, we thought, like, we either can call it a cube or we can call it a cube JS probably. And the reason

to

to call it cube JS was to try to minimize the noise because if you just go and Google cube, you know, like, you can find a bunch of stuff. But it's it's hard it would be hard to find a new open source project. So what we are what we did, we we called it cube JS. I think it helped definitely to, you know, like, to let people easier find that open source project, but it also created a little, maybe, wrong

impression about the product in a terms of what felt like many job charting libraries, they have name GS in their like, they have GS in their name. So think about char. Js, high charts. Js, and all of that. So many many users, many people from open source community,

they were thinking about CUBE as a churning library. So we thought maybe we need to get rid of JavaScript. And, also, at the same time, you know, like, when we started to think about getting rid of JavaScript into name,

we started to realize that

we would need to change the data modeling language from JavaScript to YAML and Python based because in a data engineering community, you just have 4 people familiar with YAML and Python based workflows rather than JavaScript. So we kicked off that process. And the separate stream was we kicked off the process of rewriting or, like, doing some changes to the code base going from a JavaScript to the Rust. We built our caching engine fully in Rust, which is based on this data fusion. Right? And then we built, SQL API, which is a

connection point between BI tools,

and Qube fully in Rust as well. And then we started to rewrite some of the data modeling framework pieces into the Rust as well. So I feel like Rust definitely helps us to

speed up things and on many areas, you know, like, even caching because obviously was a very obvious choice when we started, you know, like, to build our own caching engine. We were thinking either let's do it in c plus plus or RAS because you cannot do it in JavaScript, right, obviously, for performance reasons. But now as we started to adopt more Rust,

Cube, and more and more

core developers

are very

fluent in Rust, we're thinking of a lot of opportunities, you know, like, especially around the data modeling piece, data modeling framework to rewrite in Rust. That will help us to speed up the query generation.

In a query SQL query generation is not

usually the big

problem in terms of the latency, but sometimes in some edge cases, it can be. So, you know, like, sometimes it can take, like, a second to generate. Maybe, like, in

point

1 percent of the cases, but it's still, you know, like, it still can happen. So that in the area in that areas, we can use Rust to improve the SQL generation. And in general, just kind of for transport wise because you need to transform data still when you load it from Snowflake, kind of transform it in a different way and send it back to to the tool that requested it. You're going you're going to use some, you know, like, CPU intensive

kind of workload that can be rewritten in Rust to to improve the performance. So we see we see a lot of opportunities to use Rust right now to improve the performance.

Being open source

as the foundational component of the technical stack obviously helps with the adoption process because you don't necessarily have to go through the whole sales cycle just to be able to test something out. It also simplifies the work of being able to incorporate

the cube product earlier into the development cycle as well as throughout CI and into production.

I'm wondering

how you're thinking about the questions of project governance for the open source as well as the

product and business strategies in order to understand what are those boundary lines for differentiation

between when you want to use the open source, when you want to use the paid product, and how to

manage that delicate balance of not cannibalizing the open source in favor of the business and not letting the business go under in favor of supporting the open source?

Good question. It's a it's a question I I get asked very often.

I think, you know, we try to be as honest as possible

here, and I feel like transparency

is is a key regardless of what you're doing. Because as long as you're being transparent and you sort of

tell what is here, what is there, and what is on open source road map versus what is on a cloud road map, you manage an expectations of the community

and, you know, like,

you're just being very honest and, you know, like, it always pays off. So I think it at the end of the day, it's features. Right? Some of the features, they're going to be in open source. Some of the features, they're going to be in commercial product. And there

is no just other way around it. Right? And by features, I think about, like, on a bigger scale. Right? Like, because many software projects, they are like a layer. Say onions.

You got a bunch of stuff in the core, and then you put something on top of that. So some features may be on top of that. They can go into cloud and commercial product. Right? Some features

may go into open source. And it's always

many of these decisions. They, like, could be tactical decisions, but ad hoc. But as long as you're being honest and saying, like, this feature is an open source, and we're not going take them away. They're going to remain in open source. But these are features that are gonna be in cloud.

If you want these features, you know, you can buy them or you can write them. You know, it's open source. You can just, you know, like, create your own library or you

can another layer in that onion yourself. You have to maintain it. Right? You have to write that code.

The writing code is essentially, you know, like, you spend some time, you know, like, and someone needs to pay for the time. Right? So by the end of the day, someone is going to pay for that work. Right? But,

there is an option. And if the ecosystem is open, you know, organizations or, like, individuals, they can just develop their own, like, a plug in, so systems, or layers around that. And as you as an open source maintainer, as long as you very kind of open and transparent saying, like, here's

core features, like the core open source product. Here is extension points, how you can build your own stuff. If you wanna build it, build it. If you don't wanna build it, you can buy it in in in a commercial product. But that's sort of a framework you're trying to go with, and then you have to deal with a lot of tactical decisions, what exactly needs to go where.

And to that point as well of the technical decisions of what belongs in the open source versus not goes beyond just the question of whether it's in the open source versus the paid product, but also the question of whether it belongs in the scope of Qube at all and whether it's something that maybe it should be part of a completely separate piece of the technical stack or completely separate project. I'm curious what are some of the tensions that you

are coming up against as people are adopting CUBE and they have their own concepts of what should be in scope versus shouldn't or things that they're trying to make it do that it wasn't designed to do and some of the, potential directionality

that that influences as you continue the evolution of CUBE and just some of the ways you think about what is in scope, what is out of scope, and what is just pie in the sky thinking?

We have we have many of the those, but probably 1

which is

always follows us is,

is sort of a BI and a visualization

part of it. So in q, we have what you call a playground.

Think about it. Playground is,

you know, like a query builder interface in any BI tool. Like, Looker, for example, I think, call it Explore, right, when you have a bunch of measures dimensions usually on the left side, and then you have a chart on the right side, and you can do, like, either select and drag and drop things to build a chart. Every BI has it. So we have it and we call it a playground. The reason why we have it is that when you build your data model, you probably wanted to test your data model to make sure it's, like, it's actually numbers are correct. How we know, like, what kind of measures they have here, how we can play with it. So you need that tool.

And we had a tool from very beginning.

We recently did a big update to that tool. It now looks nicer.

And it always it always like, it triggers the conversation. Oh, now it keeps us a BI tool because we have it. And I don't think we have BI tool be because we have it. And we that's true. That's a big part of the BI. Right? But I think that's something that we always been thinking at CUBE a lot. It's like, do we want it to

become a BI tool? I don't think we want. I don't think that vision for CUBE is to be a BI tool. Are we going to have some of the features that might not only kind of have an overlap with the BI? Yes. Like, this query builder or playground is a good example because we just have to have it so people can test the data model. But I think that's an area where, like, we try to keep the balance and we don't wanna go into the BI world. We don't wanna let people think about Qube as, you know, like, replacement for Tableau

or, you know, like, replacement for Looker because we just not we'd be like we're not going to

invest a lot of time in charts, in a dashboards, and all of that. We are focused

on

data modeling. We are focused on the metadata, things related to the metadata management, to the data model, to the governance in general.

That's what we are. And I and I know that, historically, that's probably been a big part of the BI. So from that perspective,

yes, we are close to the BI, but we're not trying to replace Tableau from the visualization perspective.

And in your experience

of working on q, working in the space of the semantic layer data modeling? What is 1 of the most interesting or innovative or unexpected ways that you've seen the Cube project used? Because Cube in open source is very modular, so you can just take the pieces of the cube out and try to

use maybe only data modeling framework on, like, SQL generation. So in open source, it's easier not to use everything together, but you can just use pieces of that. I saw organizations building experimentation platforms,

internal experimentation

platforms with the cube

being

a framework to model metrics. So they took our data modeling framework. They kind of changed it a little bit to their use case, and they just turned it into the metrics modeling framework for, like, experimentation, like AB testing platform, which we're like we we never intended CUBE to be that. Right? But it was an interesting use case. So because of this modularity and embeddability,

right, like, you can you can see a lot of, like, different random use cases.

Another aspect

of having the semantic layer decoupling it from the business intelligence system is as we talked about having

the potential for multiple different data consumers, multiple different clients, and access patterns.

I'm wondering how you have seen that change the ways that

data teams think about what it means to deliver data to the organization

and some of the ways that it has changed the

organizational appetite for data exploration

given that it does open up the arena for having these multiple different bespoke use cases that don't necessarily all have to coordinate through a single tool.

Yeah. It changes how the data teams think about the data in a more way that they start to think about

data as a as a product

and more closer to how software engineering

team think about their work in general.

So software engineering team, regardless if they're building a customer facing product or, like, some internal product, they think about it delivering some product, right, like, which is which is usually piece of software

and then iterating on that product, like a change in having version system, so, like, delivering updates.

So I think with having a semantic layer in place that can potentially offer multiple API

for the data consumers, but also in general,

was all the improvements around the data engineering workflow to be closer to the software engineering

development cycle. You know, code first,

different environments, version control, all of that. It feels like more data engineering teams, they just generally think, as a software engineering teams, in a way of delivering products.

And in in case of the data, it's data products. Right? So you can have it in a cube world that would be cube view. Right? The multidimensional

data mart as your final data product. So now you have that artifact, that resource

that you can give to your day for data team. So you can give it to your

analysts that can build a chart in Tableau. You can give it to the front end team that can use it inside

their front end application.

You can even give it to the customer. So we have we have our customers who are selling this to their customer. So they're, like, monetizing their data products essentially. So they're building some datasets and then, like, they're exposing the SQL endpoint to their customers so they can just consume that, you know, like, that data. So I think that's sort of, interesting shift that I I saw as many, many teams started to think about their work as a as a date as a product work.

And in your experience of working in the space, building Qube, working with customers and end users and the broader data community, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

That's, that's a very good question. I feel like data

ecosystem

went through a few interesting key areas, you know, like to say, at least,

in the last 4 years or so. You know, like, first, we all went through the high of

0 interest rate, you know, and a lot of people were thinking about data differently maybe back then, you know, like, that we needed more tools to solve for every small problem in the data stack.

And now we all

went to different side of the spectrum to, like, we need less tools and, you know, like, we need to consolidate.

I think at different times,

I saw it in different trends.

I think nowadays,

what I'm wondering about is

that how data

ecosystem and, you know, like data engineering ecosystem in general would work with

AI and what what is a crossover.

So I I go to some data conferences, you know, like modern data stack conferences,

and I don't see a lot

of talks about AIs. Like, it feels like like, some talks are, like, mandatory talks. You know, like, you have to talk about AI because, you know, like, it's 2024, 2023.

But I still don't see a lot

of intersection between

data and AI Specifically, it feels like it's a little bit like 2 different worlds right now. Even though the a foundation of AI is data. Right? You need to have data in place to to run a good AI. So I think that's something that sort of, you know, like, maybe was a little unexpected when the all AI hype started.

I I expected to see more

people from the data

community lean into AI,

which which didn't happen.

And for people who are building their data systems,

maybe they're trying to branch out to have more consumers

of the data in their warehouse. What are the cases where CUBE or just more generally a semantic layer is the wrong choice?

I think if it's a small organization

and, you know, so, like, a 50 people org, and so there's 1st data hire, there are so many things that person needs to do from a delivering value and maybe, you know, like, do ad hoc reports is totally fine at this point. And just kind of writing SQL and Snowflake and getting all the foundational pieces in place, like, have an ETL in place, making sure that you have a warehouse,

then you probably need to choose the first BI tool. And, you know, like which can be, you know, like, a database or superset

open source because, you know, like, at this stage, you probably don't wanna pay for, like, expensive BI tool. I think at this point, the organization still don't need a semantic layer.

Now maybe once it gets to a point, okay, we have

Metabase, and we have more people trying to use Metabase

in a self serve way, how do we tell them what we have, what data measures, metrics they should look at?

And at this point, the data team might have a 2, 3 people at least on a team. That maybe would probably can be first time when an organization can think about CMN declare. But, generally, as organization matures, right, as you get more data folks on the team, as you get more BIs in place, the need for the semantic layer would be bigger. So I would say it's definitely a wrong choice for, like, a 50 people companies, 1 data higher. But as we go bigger, the need for that would would increase.

And as you continue to build and iterate and evolve the cube project and invest further in this semantic modeling space? What are the things you have planned for the near to medium term? I think the big thing that we started to lean into

is AI

last year and this year. I think 1 of the use cases

for cube and for semantic layer in general is sort of how we go from natural language to SQL,

and that's something that can have many different universal applications.

You know, it can be used internally. It could you could wanna build a Slack bot. You can build an AI agent that incorporate

some of the queries to your warehouse. But in general,

when you need to have an AI agent that can

execute queries against the warehouse, you probably need a semantic layer. So because if you think about that this way, like, we have a lot of data already in the warehouse.

If a human needs to access that, we need to write a SQL.

It's gonna be the same case for AI agents. If they need to access the data, they calculate some,

you know, analytics around something, they would need to go and write a SQL to to Snowflake. There is no way, like, we we will take that all data, structured data out of this. We'll put into context. Even if you do that, it's not going to work because, you know, like, the LLMs, they are probabilistic.

So they need to write a SQL. Now the question, like, how they can write a SQL is that, can they generate a

SQL query? I think they can because, you know, they saw so many examples of the SQL queries out of the world, but they just don't know what exact SQL query to generate to the warehouse specifically. So the simplest approach would be, like, let's download the DDL, like, for your tables and just give it to the give it to this AI. The people did it. They did a benchmarks about it, and it only gives, like, a 17%

accuracy or something because just you don't have enough context, right, like, about your columns, about your information. So I think the solution here is to use a semantic layer or knowledge graph. So any way just to describe your ontology and describe your semantics and data. So you give all that context to

AI agent LLM now, and then now you can generate a really accurate SQL queries. And especially if you generate the SQL queries not directly to warehouse, but you generate them back to your semantic layer, that can act as an additional

validation point. It even increases

the accuracy, and also you can get all of your security access, caching, call the benefits on top of this. So what we're building at CUBE now is a few things, but the first foundational thing could be an API endpoint where you can just send a text query to CUBE and saying like, hey. Give me give me data, essentially. And CUBE will generate a essentially. And CUBE will

generate a SQL query on top of this, execute the SQL query, and do whatever

the same thing we do for BI tool right now, but it's more like for natural language right now. And that's going to be very accurate because we're going to use the same data model we have already. And that can have a very wide range of applications in building chatbots, in building AI agents, and in having some sort of regenerated BI capabilities internally. So that's a exciting thing that we're working on right now.

Are there any other aspects of the cube project specifically or the overall space of the semantic layer that we didn't discuss yet that you'd like to cover before we close out the show? I think that's it. I think that, you know, like, we covered a lot of things. Thank you. That's they were all great questions. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. That's a great question. I feel like the missing piece here is around

the catalog

that is closer to the business consumers and that is more friendly for the business consumers to consume. And I will

will go deeper right now, on this. I think with the catalogs,

we we got a really good development

in the in the last several years. In general, it was the first wave of the catalogs. Now, Alation, Calibra, they kind of offer the tools

to to do the inventory of your data to create all these different

descriptions of the data assets within

your organization.

And a lot of teams say successfully use it, and then we got a sort of a newer generation of this catalogs like Atlan.

But I think the problem

is that the catalog space is

too big and too vague in terms of we can go down from, like, a low level catalog, you know, like, things like

pipelines,

airflow,

chops, or something to up, you know, like, into the things that more like a business users care about, like in charts, dashboards,

queries, all of that. So I feel what is missing is that a little bit more focused on that side of a spectrum that Molica data consumers care about. It's like what actual dashboards, how we can find the data, what are the metrics we have. Like, what are the charts we can use? And now when I think about that problem, I'm wondering why

no 1 is using or maybe probably, like, someone is doing that already, but not why why is I don't see products on the market right now that uses

AI to do that because it makes so much sense in terms of, like, discoverability in a catalog. If I'm from a marketing team and when you hire, it would be good to go to some place and say, like, hey. I just joined a marketing team. What dashboard should I look at? What metrics should I worry about? So it's still, like, a lot of things that AI can capture and change here from the cataloging perspective. So I think here, we'll see a lot of, like, good, interesting

developments soon. I'm sure someone is working on that right now. I just don't see it. Absolutely. The overall catalog and discoverability

space

is evolving, but I do think it is still a little bit underserved.

Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on CUBE and your perspective on the semantic layer and how it fits into the overall data platform, data ecosystem.

Definitely a very interesting problem space, and it's great to see the work that you and your team are putting into that. And I hope you enjoy the rest of your day. Thank you. Thank you for having me today. That was a great conversation.

Thank you for listening. Don't forget to check out our other shows, podcast.init,

which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com.

Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it.

Email hosts at data engineering podcast.com

with your story.

And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links