Summary
Maintaining a single source of truth for your data is the biggest challenge in data engineering. Different roles and tasks in the business need their own ways to access and analyze the data in the organization. In order to enable this use case, while maintaining a single point of access, the semantic layer has evolved as a technological solution to the problem. In this episode Artyom Keydunov, creator of Cube, discusses the evolution and applications of the semantic layer as a component of your data platform, and how Cube provides speed and cost optimization for your data consumers.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold.
- Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster today to get started. Your first 30 days are free!
- Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
- Your host is Tobias Macey and today I'm interviewing Artyom Keydunov about the role of the semantic layer in your data platform
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by outlining the technical elements of what it means to have a "semantic layer"?
- In the past couple of years there was a rapid hype cycle around the "metrics layer" and "headless BI", which has largely faded. Can you give your assessment of the current state of the industry around the adoption/implementation of these concepts?
- What are the benefits of having a discrete service that offers the business metrics/semantic mappings as opposed to implementing those concepts as part of a more general system? (e.g. dbt, BI, warehouse marts, etc.)
- At what point does it become necessary/beneficial for a team to adopt such a service?
- What are the challenges involved in retrofitting a semantic layer into a production data system?
- evolution of requirements/usage patterns
- technical complexities/performance and cost optimization
- What are the most interesting, innovative, or unexpected ways that you have seen Cube used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cube?
- When is Cube/a semantic layer the wrong choice?
- What do you have planned for the future of Cube?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
Links
- Cube
- Semantic Layer
- Business Objects
- Tableau
- Looker
- Mode
- Thoughtspot
- LightDash
- Embedded Analytics
- Dimensional Modeling
- Clickhouse
- Druid
- BigQuery
- Starburst
- Pinot
- Snowflake
- Arrow Datafusion
- Metabase
- Superset
- Alation
- Collibra
- Atlan
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Starburst: ![Starburst Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/UpvN7wDT.png) This episode is brought to you by Starburst - a data lake analytics platform for data engineers who are battling to build and scale high quality data pipelines on the data lake. Powered by Trino, Starburst runs petabyte-scale SQL analytics fast at a fraction of the cost of traditional methods, helping you meet all your data needs ranging from AI/ML workloads to data applications to complete analytics. Trusted by the teams at Comcast and Doordash, Starburst delivers the adaptability and flexibility a lakehouse ecosystem promises, while providing a single point of access for your data and all your data governance allowing you to discover, transform, govern, and secure all in one place. Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Try Starburst Galaxy today, the easiest and fastest way to get started using Trino, and get $500 of credits free. [dataengineeringpodcast.com/starburst](https://www.dataengineeringpodcast.com/starburst)
- Datafold: ![Datafold](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/zm6x2tFu.png) This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting https://get.datafold.com/replication-de-podcast.
- Dagster: ![Dagster Logo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/jz4xfquZ.png) Data teams are tasked with helping organizations deliver on the premise of data, and with ML and AI maturing rapidly, expectations have never been this high. However data engineers are challenged by both technical complexity and organizational complexity, with heterogeneous technologies to adopt, multiple data disciplines converging, legacy systems to support, and costs to manage. Dagster is an open-source orchestration solution that helps data teams reign in this complexity and build data platforms that provide unparalleled observability, and testability, all while fostering collaboration across the enterprise. With enterprise-grade hosting on Dagster Cloud, you gain even more capabilities, adding cost management, security, and CI support to further boost your teams' productivity. Go to [dagster.io](https://dagster.io/lp/dagster-cloud-trial?source=data-eng-podcast) today to get your first 30 days free!
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. A new approach to building and running data platforms and data pipelines. It is an open source, cloud native orchestrator for the whole development life cycle with integrated lineage and observability, a declarative programming model, and best in class testability. Your team can get up and running in minutes thanks to DAXTER Cloud, an enterprise class hosted solution that offers serverless and hybrid deployments, enhanced security, and on demand ephemeral test deployments. Go to data engineering podcast.com/daxter today to get started, and your first 30 days are free. Data lakes are notoriously complex.
For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte scale SQL analytics fast at a fraction of the cost of traditional methods so that you can meet all of your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and DoorDash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first class support for Apache Iceberg, Delta Lake and Hoody, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com slash starburst and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino.
Your host is Tobias Macy, and today I'd like to welcome back Artem Khadunov to talk about the role of the semantic layer in your data platform. So, Artem, can you start by introducing yourself?
[00:01:49] Unknown:
Yeah. Thank you for having me today. My name is Artem. I cofounder and CEO at CUBE. I started CUBE as an open source project in 2019. And then with my cofounder in 2020, we built a company around that to keep developing the open source project, but also introduced a commercial version of the cloud version of the cube, which is called the cube cloud.
[00:02:13] Unknown:
And do you remember how you first got started working in data?
[00:02:16] Unknown:
I was I was just a software engineer. And, I think, you know, like, as software engineer, you always work with the data. So at some point when I started to lead a project and the project involved collection of a lot of data. And, we were building an interesting an interesting product that was that software was deployed in schools, and, the kids were using software on daily basis. And idea was to train the software to improve so it can learn from a kid's basically based on their action and kind of adapt different exercises, challenges that were presented to them based on their previous input. So to to accomplish that on a scale with all these calls, we had to build a big, data pipeline. I think that was my my first real data project.
[00:03:06] Unknown:
As you mentioned, you started the cube project a few years ago. That was at the early stages of what some people called the metrics layer, other people called the semantic layer, other people will call headless BI. I'm just wondering if you can just start for folks who aren't familiar with all these terms and the ways that they fit into the overall scope of data platforms, data management, what the technical elements are that constitute a semantic layer.
[00:03:33] Unknown:
Right. I think but as we think about semantic layer as just a concept, an idea, it's been around for a long time. So if you look back at business universe and micro strategy, it's all older generation of BI tools, they all had a semantic layer as part of it. Essentially, what semantic layer is is a way that BI can translate the high level analytical style of queries into the relational or tabular queries. Right? When we move sort of in our UI, in the BI, when we move for drag, like, active users with a breakdown by state or country, the BI would generate the correct SQL and execute that SQL insights on specific web storage. So that's where it's usually always been sort of considered and called a semantic layer. In fact, BusinessObjects, they even had a patent for that for, like, 10 years, and MicroStrategy successfully defended against it. So the semantic layer as part of the BI, it's always been around. But then probably by the 2018 and 2017, a lot of people started to talk about the stand alone metrics layer or stand alone semantic layer. And the need for this mostly arise just due to the fact that we reached a point where we had so many different BI tools and data consumption tools that it was unclear what was the source of truth.
And I think it kind of happened due to the fact that a lot of people wanted to democratize access to data. So, you know, like, bring more different specialized data visualization, data consumption tools. But also cloud contributed a lot to that because it became easier to buy and use software than it was, like, 20 years ago. So a newer generation of tools like Tableau, Looker started all, but then we got, like, Mode. We got, ThoughtSpot. We got all these, like, tools that were, like some of them were cloud native that a lot of organizations started to adopt them. And, you know, like, in the later generation, Sigma, you know, like you got, you know, like, Lightdash and all of these tools. So it's like the number is only growing. And at some point, data teams, they started to think, okay.
We have so many different tools and places to define metrics, to define semantics, what should be the source of truth. And that's where, like, a lot of, organizations and just in data community in general started to talk about headless BI, metrics layer, semantic layer. There there were a few different names to call it. I feel like a metrics store specifically came to the picture because a few organizations lift an Airbnb, if I recall correctly. They built internally software that they call Metric Store that was sort of, you know, like some kind of a semantic layer. And that's why a lot of people started to use Metric Store. But I think that idea never really came to the market as a general purpose solution. So that's why kind of the metric story idea started to fade out, and it all kind of re replaced with a more, like, universal semantic layer. That's what people are talking about.
[00:06:38] Unknown:
And as you mentioned, the semantic layer is the point in the overall data system where you convert from raw information into some sort of contextualized business domain objects. And to to your point, for the most part, throughout the earlier history of data systems, that was a dedicated component of the business intelligence system because that was monolithic. That was the canonical reference for anybody who wanted to go and explore the data. Now with this disaggregation of the overall data stack, there are a larger number of consumers. Not everybody is using the same tool chain. Not everybody has the same frame of reference or the same context, and I'm curious how that has complicated the question of even being able to define what the appropriate granularity and scope of those business domain objects are even beyond just the technical elements of being able to represent them in a system that can be viewed as the source of truth.
[00:07:42] Unknown:
Right. Yeah. There are it's just the more data we have, the more data consumers we have, and the more need to consume data that obviously increases the scale of complexity and around the data modeling in general. And I think there are, like, a few ideas even orthogonal to semantic layer kind of arise in, in recent years to address some of these issues. Like, a data match is 1 of of those examples. Right? Like, how do we maybe hold the knowledge of specific data modeling, metrics modeling aspects inside domain teams so they can model their, metrics, their model, their definitions, but at the same time, keep the centralized governance.
I think if we zoom out here and think about what is what is the problem here in general is that it's always a balance between well governed data and a flexibility on the end of the analysis. So in in in some world, we can theoretically model every possible measure in a dimension and with every possible granularity and present that to the end data consumer. And that that would be ideal. The thing is that it's just not possible to do. Right? It's always going to be something that is missing. Some granularity is missing measures. Dimensions are missing. So it's always going to be some data modeling at the edge where, like, an analyst or, like, a business consumer, they work with data. They would need to do some last mile transformation, maybe to look at the data differently.
I think that for the data teams and for the vendors who are involved in that process, the idea is how do we help to keep the balance between governance on 1 end and flexibility
[00:09:30] Unknown:
of the last mile analysis on the on the other end. Before we dig too much into more of the technical elements, as you said, there was this flash in the pan moment of the metrics layer being touted as this separate layer of the stack. It needed to be a dedicated system. And I'm curious if you can give your assessment of the current state of the ecosystem, the state of the market as to when and whether the dedicated technical layer is merited as opposed to it being a feature of 1 of the other components of a data platform?
[00:10:09] Unknown:
Yeah. I think around 2018 to 2020, that's where we had, the recent wave of ideas around headless BI metrics, store metrics layer. As I mentioned, I think the big driver was this just explosion of a different data consumption and the data visualization tool on 1 hand. And, the fact that a few big tech companies, they developed an internal solutions, and they found some success with this internal solutions. So they sort of that catalyzed the idea around the standalone dedicated metric storage, and a few companies were started to address this problem. I think, overall, if you look 5 years or so later in the market, is that many of those companies, they not not around anymore or, like, they got acquired or they pivoted. They changed direction. So CUBE is 1 of the few companies that is still that is still kind of doing doing the semantic layer. And the general sort of, idea moved a little bit from metrics store or metrics layer into the semantic layer specifically. And it's not only the change in the name, but it's more like understanding the place of that layer, the place of that technology, rather than being more like a catalog of the metrics.
It now is it's being considered more a place to do the multidimensional data modeling sort of universally for all these tools. I think the state of today is, as I see it, the place of the semantic layer is on top of the data warehouse. It's a fully virtual layer that doesn't hold any data. It lets you to develop lets data engineer develop measures, metrics, dimension aggregates, dimensions to some extent. Although many dimension modeling can can be done up stream in a transformation tool. So it's mostly measures, metrics, and relationships and then expose these models to all data consumption tools.
And by data consumption tools, we mean not only BI, but also application, so data apps. Right? It's embedded analytics and just stand alone internal and different variety of apps. I think that currently we talk with the buyers, we talk in the market in general. That's what would be an expectation from a semantic layer is that data modeling framework that supports wide range of the metrics modeling specifically and relationships with joins. It would be APIs that can support BI connectivity, but also would support apps connectivity and some sort of a caching layer as well and some some sort of a security layer.
[00:12:47] Unknown:
In the case that you and so as you mentioned, the semantic layer or the headless BI, you can use that as an additional modeling tool to be able to say, based on these upstream purchase to be completed, or how do I calculate revenue, all of these business layer semantic objects that you want to be able to ask questions about. But, also, data warehousing as an overall practice was originally designed to be able to address some of those questions. And I'm curious how you have seen the idea of this dedicated layer for building these more conceptual domain objects off of the core factual information that you have in the data warehouse, how that impacts the way that teams think about the data modeling approach, data delivery, who is responsible for which elements of that technical versus semantic modeling, etcetera?
[00:13:49] Unknown:
Right. Great question. I think from from an organization perspective, it's always usually data engineers who work in both on a transformation and a semantic modeling. It really depends whether, you know, like organization has some sort of like a data mesh where, you know, like a different domain groups will contribute to specific areas. CUBE is called first as some other tools in a modern data stack. So I feel like with Code First, you always can get this benefit of a collaboration because it's essentially just a pull request, right, if you wanted to add a new measure. So that sort of can enable the sort of a mesh architecture, if you like, a marketing embedded team that can wanna they wanna contribute to marketing metrics specifically. Right? And then central governance team is going to review that. But at the end of the day, it's data engineers regardless how they structure it within organization.
The other part of your question, as as I understand it correctly, was, like, what goes essentially into transformation versus what goes into the semantic modeling, which is a big question always when an organization starts to adopt a semantic modeling. That's 1 of the first questions. Like, should we model that right in a warehouse and materialize it in a warehouse versus we should do it in a modeling? So the the usual what we recommend is the usual setup is to do dimensional modeling in that warehouse and to keep your models sort of normalized, say, in a more like a DBT oriented way that would be they would call the staging models. So you get to a point where, like, you build your staging models or, like, they look more like a normalized entities. And then semantic layer would be the denormalization point. That's where you bring your normalized models to semantic layer. You start describe measures. You start describe relationships.
And then based on these relationships, you're kind of starting to build your multidimensional data marks, which are denormalized. They potentially can can contain multiple measures, multiple dimensions. They can all package together as a single kind of a data asset. That's like a Datamart, which is multidimensional, which which it can it can look like a table. Right? But it's essentially could can be multidimensional. And then you expose them to different data consumption tools. You can design them differently. They can have 1 measure, and then it's just a set of dimension that can be used with that measure. Or you can have multiple measures inside 1 Datamart in multiple dimensions. So I see both ways, sort of and they're, like, equally good. But the idea is that you do high level idea is that you do dimensional modeling to a normalization on the data warehouse level and then denormalization and measures on the semantic model.
[00:16:35] Unknown:
For teams who have already been building out a data platform, they have their warehouse, they've already done their dimensional modeling. They've got a business intelligence system where they've already done some of those semantic modeling or even maybe they've built those smart layers in the data warehouse, and they're just doing a 1 to 1 representation in their BI. What does the adoption and migration process look like for incorporating the semantic layer as that canonical source of access for being able to query and interact with those semantic domain objects from multiple different consumers?
[00:17:12] Unknown:
Yeah. Good question. So, it it it depends on a sort of a surface of migration. So if you have only, for example, Looker as 1 of your major BI tools and your organization is planning to move off to, say, Tableau and maybe start supporting Excel and maybe doing some embedded analytics. And in this point, you're starting to realize, okay, Looker is great. I like my LookML models, but I can use them only within Looker. Right? So I need now to use all of these different data visualization tools. And that's where you probably need to have something like CUBE, right, of stand alone semantic layer and migrate all of the Looker to to to that tool. Migration from Looker Ookamal is code based, Kube is code based. We even have a code editor where, like, you can just migrate things. So we I saw migrations like this. I would say, like, a smaller to medium scale, which can happen really quickly. On other hand, if we, you know, like, engage with organizations that has a lot of modeling as base, you know, like, or even, like, something more modern like a Tableau, but, you know, like, you still got hundreds of workbooks and a lot of, like, metrics are being copied across these workbooks. So that might take some time. I think on our end, in general, in any, like, semantic layer vendor end, we can offer, you know, a lot of, like, blueprints or just best practices on a migration. But depending on the size of the data, right, it sometimes can be longer. Sometimes it can be faster.
But I would say, like, migrating over the Looker to queue, that's always been the fastest route.
[00:18:49] Unknown:
Another question of the adoption process or the evaluation phase of whether and when to bring in this separate semantic layer, 1 of the use cases is that the semantic layer can act as also a means of providing caching and accelerating the response times from your warehouse. And I'm curious how the underlying warehouse technology impacts that overall calculus of maybe I'm using Snowflake, and so I need to make sure that I'm minimizing cost and minimizing load on that because of the pricing model. Or I'm using a data lakehouse, system, and so maybe I need to re be able to respond to queries faster. In those situations, the performance benefits seem fairly obvious. But if you're in the situation of using a Druid or a ClickHouse, which is optimized for interactive query speed, how does that influence the ways that teams think about whether and how to incorporate the semantic layer as a separate technical component of their stack? Yep. Great question. Most of our customers
[00:19:53] Unknown:
on Snowflake, Databricks, BigQuery, and Redshift, and Starburst now. So I would say this where Qube adds most of the value and around the caching as well. The ClickHouse and, you know, the Pinot and Druid of the world, I feel like they're mostly being used in, where, like, low latency real time use cases, maybe to power, you know, LinkedIn feed. Right? That's what Pino was created for. So maybe less general purpose like analytics, but more like specific use case, within within organization. So that's why, naturally, we just see less of them. But when we think about the warehouse, caching can add 2 main benefits.
1 is it can speed up the performance. And second, it can save on the warehouse cost. I will just explain how low how caching works so we can understand its benefits. So the caching in in Kube works in a way that you would Qube could execute a query that you want. It all starts with a definition of what you wanna cache. So we wanted to cache usually a few measures in a few dimensions together, which we call a pre aggregation or roll up table. So that would be bigger. It could be smaller. You can have many of them in your application. So data engineers, they can design what they wanna cache and what kind of pre aggregation stable they wanna build. Now what would happen is that Qube will generate a query in the background that will go into the data source, say Snowflake, and will run the group by query essentially, right, in a Snowflake to get all the data that needed for the pre aggregation table. Then we will tell Snowflake to export it into some specific s 3 bucket or, like, a GCS bucket, so, essentially, into some object storage. Then Qube will read from that object storage and ingest into its own storage.
Inside that storage, we will optimize it. We'll do some magic to just make it faster, and then it will just remain in kube storage. Now in the background, we can keep going into Snowflake and refresh the data. We can make it incremental. So there's, like, a lot of different strategies. Call you on a refresh at pre aggregation. But now as the data is inside Qube, the Qube has an aggregate awareness. The idea of aggregate awareness is that Qube can understand that a query can be served from the aggregate instead of the raw data. So we can just get it out of the aggregate and serve that query. So now when we think about even without all the CUBE optimizations, usually it's faster to query from the aggregated table rather than the raw data. Right? So that's definitely a performance benefit. Top of that, we do a lot of optimizations. Like, we have specific engine for top k queries, right, which is very popular in, inter analytical workload. So that's why all the cache queries are usually faster than their data warehouse queries. Now on the performance side, because you don't really need to go to your Snowflake Flake anymore, you you don't need to keep it running. So if all your aggregates needs to be refreshed daily, so you can just run Snowflake for 4 hours, maybe, you know, midnight to 4 AM, And then you just suspend your Snowflake, and then all the queries during the work day, they will be served from the cache. So that's a kind of cost benefit, cost saving opportunity.
[00:23:04] Unknown:
Now that the conceptual elements of the dedicated semantic layer have been around for a few years, there have been a few different entrants into the market, some of which are still viable, some of which have faded or been acquired and integrated into other components of the stack. I'm curious how you have seen the overall understanding of the purpose and application of this conceptual technical element shift and some of the ways that that has influenced the adoption and usage patterns of systems such as Qube? I think it
[00:23:39] Unknown:
definitely feels like it's maturing. In earlier days, I would say 2019, 2020, it was at some point, it was a lot of excitement around the idea of the metrics layer or metrics store, but it was very unclear how to use that, what exactly it is. Right? Is it a catalog of the metrics where people can collaborate and comment, you know, like, oh, this is a nice metric. Who created it? Like, can you tell me more about, you know, like, the meaning of that metric? Which is more was more like use case of the metric store. And on the other hand, it was a headless BI with specter of that story that was more tailored towards embedded analytics use cases, more like, you know, like, how we built a data app. So it was a lot of sort of uncertainty about how the semantic layer should work, you know, like, what place it should take in the in the data stack. I feel like as we went through a few years in the last few years, it sort of definitely matured in terms of understanding.
So now when, you know, like, when I talk to someone, at the conference, you know, like, in just someone out in the world from a data community that most of the people they know about semantic layer, and they understand what benefits semantic layer can bring to the table. So that definitely feels different now. We haven't touched that, but I think that's in terms of the use cases and the place of the, semantic layer. Recently, more people started to talk about how it can help with AI based stack, especially around the SQL generation to the warehouse.
I think that was probably the most recent addition to, you know, like, a use case that can be powered by semantic layer. But other than that, most of the core use cases kind of you know, like, it feels like we are settling down on the use cases for the Symantec letter specifically.
[00:25:33] Unknown:
This episode is brought to you by DataFold, a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow from migration to DBT deployment. DataFold has recently launched data replication testing, providing ongoing validation for source to target replication. Leverage DataFold's fast cross database data differing and monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale and receive alerts about any discrepancies. Learn more about DataFold by visiting data engineering podcast.com slash datafolds today.
And in the technical evolution of the cube product in particular, since since that is what you're most familiar with. But just generally, the ways that people are thinking about the implementation and integration of the semantic layer. I'm wondering what are some of the most challenging engineering aspects of building the the system in terms of ease of adoption, developer experience, but at the same time, trying to balance these aspects of performance and ease of integration.
[00:26:42] Unknown:
In terms of challenges, I think we have at least 3 big sort of complex challenges, and complex areas of sort of, you know, like deep engineering. 1 is, what we call a SQL API. So, essentially, it's an as an interface how CUBE can be queried. So when we think about different data consumption data visualization tools, how they can get data from CUBE is through currently 4 interfaces, REST API, GraphQL API, SQLAPAN, MDX. For example, for Tableau to connect to CUBE. Tableau would connect to CUBE as to Postgres database. So Tableau will just generate a SQL code for the Postgres, and then we'll send that SQL code to CUBE. CUBE will take that SQL, look at it, and create a new SQL code, the real SQL, based on all the data model, all the calculations that needs to be sent to Snowflake. Then Coup will send that SQL to Snowflake, execute it, and then send it all the way back to Tableau. Right? So the big technical challenge is how we translate from that incoming SQL from Tableau into outgoing SQL to Snowflake. Right? Because we need to design a SQL language, essentially, building sort of a SQL parser, SQL planner, SQL analyzer that also can understand the multidimensional queries. So that's been a big challenge. It's, our team is using a lot of mass, cutting edge mass here to to, like, essentially rewrite the query from the multidimensional into the tabular.
So that's been an interesting interesting area of development. The other 2 big areas, 1 is essentially building our caching engine because we wanna make sure that it's fast. It's built on top of Apache Arrow Data Fusion, which is Apache Arrow implementation in Rust. But we added a lot of our own development to that. It's all open source too. So, you know, like, it's easier to check it out. It's written in Rust, as well as our SQL API endpoint. And, the final sort of the challenging part from engineering perspective is the data modeling framework itself. And how do you deal with things like fanouts and traps? How do you make sure that you can model different measures, build correct relationships, all of that? So that's a big that's a big piece of that. So I would say these 3 areas are the most technically challenging, SQL API, cache and QiskitubStore, and, data modeling framework.
[00:29:21] Unknown:
In terms of the developer experience, as you were saying earlier around the data modeling question, you would probably use CUBE as the so called Mart layer where in dbt parlance. And so for people who are maybe using dbt as their overall workflow, what are their options for being able to effectively treat Cube as that materialization transformation layer so that they can use a single interface for doing all of their modeling, but have the option of being able to split the underlying compute substrate across the boundaries of warehouse versus semantic layer?
[00:29:58] Unknown:
Yeah. That's a great question. We have we have a decent number of DBT users. So we've been thinking about that problem. How do we design the developer workflow? So, you know, that's it's a very straightforward and streamlined process going from transformations in DBT to the modeling queue. The good thing here is that DBT is code based first. Right? So, essentially, we can kind of combine the the 2 products because they code first. So what we build is essentially sort of a blueprint, but also a code integration between CUBE and DBT.
So what users can do is that, once they build their model staging models in dbt, we can bring them over to cube through our Python library. So, essentially, the it can read the manifest file with all the models defined to dbt. And then based on these models, qpe users, they can create cubes. And cubes is, the first, layer in our semantic modeling. So we have cubes, which are usually, like, normalized entities. They usually being 1 to 1 mapping to staging models in DBT with all the dimensions. And then on top of cubes, users would define all the measures. Once that piece done and dimension part is kind of happening automatically because you're just bringing your definition over from DBT, Then you define what we call views on a cube side. A view is a denormalization point. And that's essentially just saying, like, I wanted to take these 5 cubes, put it all together as a data mart, and then expose it to the world. So we use some more, like, outward representations of the data model. So that's how the workflow usually is happening right now. And many of our users, they keep it under the same repo.
So it's a, like, DBT folder, essentially, and a cube folder. And you can kind of first do your dimension changes in dbt and then kind of work on the measures related changes in cube.
[00:31:56] Unknown:
So when you were on the show about 2 years ago, I think, at this point, I think you had just started the process of doing a big rewrite in Rust. I think that the original name was actually cube JS because you were focusing on JavaScript as that execution layer. I'm curious how that evolution of the technical underpinnings of the project and of the product have shifted the overall scope and goals of the project and some of the ways that it has shifted your thinking about how best to manage this as a production ready, you know, production grade project for people to be able to rely on and some of the ways that you can incorporate additional features because of those performance gains?
[00:32:43] Unknown:
Yeah. That's true. So when we are only started CUBE, we called it cube dotjs, because it was written in JavaScript. It's not just runtime. And our data modeling framework, it was JavaScript based. And when we were, like, releasing that, we thought, like, we either can call it a cube or we can call it a cube JS probably. And the reason to to call it cube JS was to try to minimize the noise because if you just go and Google cube, you know, like, you can find a bunch of stuff. But it's it's hard it would be hard to find a new open source project. So what we are what we did, we we called it cube JS. I think it helped definitely to, you know, like, to let people easier find that open source project, but it also created a little, maybe, wrong impression about the product in a terms of what felt like many job charting libraries, they have name GS in their like, they have GS in their name. So think about char. Js, high charts. Js, and all of that. So many many users, many people from open source community, they were thinking about CUBE as a churning library. So we thought maybe we need to get rid of JavaScript. And, also, at the same time, you know, like, when we started to think about getting rid of JavaScript into name, we started to realize that we would need to change the data modeling language from JavaScript to YAML and Python based because in a data engineering community, you just have 4 people familiar with YAML and Python based workflows rather than JavaScript. So we kicked off that process. And the separate stream was we kicked off the process of rewriting or, like, doing some changes to the code base going from a JavaScript to the Rust. We built our caching engine fully in Rust, which is based on this data fusion. Right? And then we built, SQL API, which is a connection point between BI tools, and Qube fully in Rust as well. And then we started to rewrite some of the data modeling framework pieces into the Rust as well. So I feel like Rust definitely helps us to speed up things and on many areas, you know, like, even caching because obviously was a very obvious choice when we started, you know, like, to build our own caching engine. We were thinking either let's do it in c plus plus or RAS because you cannot do it in JavaScript, right, obviously, for performance reasons. But now as we started to adopt more Rust, Cube, and more and more core developers are very fluent in Rust, we're thinking of a lot of opportunities, you know, like, especially around the data modeling piece, data modeling framework to rewrite in Rust. That will help us to speed up the query generation.
In a query SQL query generation is not usually the big problem in terms of the latency, but sometimes in some edge cases, it can be. So, you know, like, sometimes it can take, like, a second to generate. Maybe, like, in point 1 percent of the cases, but it's still, you know, like, it still can happen. So that in the area in that areas, we can use Rust to improve the SQL generation. And in general, just kind of for transport wise because you need to transform data still when you load it from Snowflake, kind of transform it in a different way and send it back to to the tool that requested it. You're going you're going to use some, you know, like, CPU intensive kind of workload that can be rewritten in Rust to to improve the performance. So we see we see a lot of opportunities to use Rust right now to improve the performance.
[00:36:09] Unknown:
Being open source as the foundational component of the technical stack obviously helps with the adoption process because you don't necessarily have to go through the whole sales cycle just to be able to test something out. It also simplifies the work of being able to incorporate the cube product earlier into the development cycle as well as throughout CI and into production. I'm wondering how you're thinking about the questions of project governance for the open source as well as the product and business strategies in order to understand what are those boundary lines for differentiation between when you want to use the open source, when you want to use the paid product, and how to manage that delicate balance of not cannibalizing the open source in favor of the business and not letting the business go under in favor of supporting the open source?
[00:37:01] Unknown:
Good question. It's a it's a question I I get asked very often. I think, you know, we try to be as honest as possible here, and I feel like transparency is is a key regardless of what you're doing. Because as long as you're being transparent and you sort of tell what is here, what is there, and what is on open source road map versus what is on a cloud road map, you manage an expectations of the community and, you know, like, you're just being very honest and, you know, like, it always pays off. So I think it at the end of the day, it's features. Right? Some of the features, they're going to be in open source. Some of the features, they're going to be in commercial product. And there is no just other way around it. Right? And by features, I think about, like, on a bigger scale. Right? Like, because many software projects, they are like a layer. Say onions.
You got a bunch of stuff in the core, and then you put something on top of that. So some features may be on top of that. They can go into cloud and commercial product. Right? Some features may go into open source. And it's always many of these decisions. They, like, could be tactical decisions, but ad hoc. But as long as you're being honest and saying, like, this feature is an open source, and we're not going take them away. They're going to remain in open source. But these are features that are gonna be in cloud. If you want these features, you know, you can buy them or you can write them. You know, it's open source. You can just, you know, like, create your own library or you can another layer in that onion yourself. You have to maintain it. Right? You have to write that code.
The writing code is essentially, you know, like, you spend some time, you know, like, and someone needs to pay for the time. Right? So by the end of the day, someone is going to pay for that work. Right? But, there is an option. And if the ecosystem is open, you know, organizations or, like, individuals, they can just develop their own, like, a plug in, so systems, or layers around that. And as you as an open source maintainer, as long as you very kind of open and transparent saying, like, here's core features, like the core open source product. Here is extension points, how you can build your own stuff. If you wanna build it, build it. If you don't wanna build it, you can buy it in in in a commercial product. But that's sort of a framework you're trying to go with, and then you have to deal with a lot of tactical decisions, what exactly needs to go where.
[00:39:23] Unknown:
And to that point as well of the technical decisions of what belongs in the open source versus not goes beyond just the question of whether it's in the open source versus the paid product, but also the question of whether it belongs in the scope of Qube at all and whether it's something that maybe it should be part of a completely separate piece of the technical stack or completely separate project. I'm curious what are some of the tensions that you are coming up against as people are adopting CUBE and they have their own concepts of what should be in scope versus shouldn't or things that they're trying to make it do that it wasn't designed to do and some of the, potential directionality that that influences as you continue the evolution of CUBE and just some of the ways you think about what is in scope, what is out of scope, and what is just pie in the sky thinking?
[00:40:13] Unknown:
We have we have many of the those, but probably 1 which is always follows us is, is sort of a BI and a visualization part of it. So in q, we have what you call a playground. Think about it. Playground is, you know, like a query builder interface in any BI tool. Like, Looker, for example, I think, call it Explore, right, when you have a bunch of measures dimensions usually on the left side, and then you have a chart on the right side, and you can do, like, either select and drag and drop things to build a chart. Every BI has it. So we have it and we call it a playground. The reason why we have it is that when you build your data model, you probably wanted to test your data model to make sure it's, like, it's actually numbers are correct. How we know, like, what kind of measures they have here, how we can play with it. So you need that tool. And we had a tool from very beginning.
We recently did a big update to that tool. It now looks nicer. And it always it always like, it triggers the conversation. Oh, now it keeps us a BI tool because we have it. And I don't think we have BI tool be because we have it. And we that's true. That's a big part of the BI. Right? But I think that's something that we always been thinking at CUBE a lot. It's like, do we want it to become a BI tool? I don't think we want. I don't think that vision for CUBE is to be a BI tool. Are we going to have some of the features that might not only kind of have an overlap with the BI? Yes. Like, this query builder or playground is a good example because we just have to have it so people can test the data model. But I think that's an area where, like, we try to keep the balance and we don't wanna go into the BI world. We don't wanna let people think about Qube as, you know, like, replacement for Tableau or, you know, like, replacement for Looker because we just not we'd be like we're not going to invest a lot of time in charts, in a dashboards, and all of that. We are focused on data modeling. We are focused on the metadata, things related to the metadata management, to the data model, to the governance in general.
That's what we are. And I and I know that, historically, that's probably been a big part of the BI. So from that perspective, yes, we are close to the BI, but we're not trying to replace Tableau from the visualization perspective.
[00:42:22] Unknown:
And in your experience of working on q, working in the space of the semantic layer data modeling? What is 1 of the most interesting or innovative or unexpected ways that you've seen the Cube project used? Because Cube in open source is very modular, so you can just take the pieces of the cube out and try to
[00:42:41] Unknown:
use maybe only data modeling framework on, like, SQL generation. So in open source, it's easier not to use everything together, but you can just use pieces of that. I saw organizations building experimentation platforms, internal experimentation platforms with the cube being a framework to model metrics. So they took our data modeling framework. They kind of changed it a little bit to their use case, and they just turned it into the metrics modeling framework for, like, experimentation, like AB testing platform, which we're like we we never intended CUBE to be that. Right? But it was an interesting use case. So because of this modularity and embeddability, right, like, you can you can see a lot of, like, different random use cases.
[00:43:24] Unknown:
Another aspect of having the semantic layer decoupling it from the business intelligence system is as we talked about having the potential for multiple different data consumers, multiple different clients, and access patterns. I'm wondering how you have seen that change the ways that data teams think about what it means to deliver data to the organization and some of the ways that it has changed the organizational appetite for data exploration given that it does open up the arena for having these multiple different bespoke use cases that don't necessarily all have to coordinate through a single tool.
[00:44:04] Unknown:
Yeah. It changes how the data teams think about the data in a more way that they start to think about data as a as a product and more closer to how software engineering team think about their work in general. So software engineering team, regardless if they're building a customer facing product or, like, some internal product, they think about it delivering some product, right, like, which is which is usually piece of software and then iterating on that product, like a change in having version system, so, like, delivering updates. So I think with having a semantic layer in place that can potentially offer multiple API for the data consumers, but also in general, was all the improvements around the data engineering workflow to be closer to the software engineering development cycle. You know, code first, different environments, version control, all of that. It feels like more data engineering teams, they just generally think, as a software engineering teams, in a way of delivering products.
And in in case of the data, it's data products. Right? So you can have it in a cube world that would be cube view. Right? The multidimensional data mart as your final data product. So now you have that artifact, that resource that you can give to your day for data team. So you can give it to your analysts that can build a chart in Tableau. You can give it to the front end team that can use it inside their front end application. You can even give it to the customer. So we have we have our customers who are selling this to their customer. So they're, like, monetizing their data products essentially. So they're building some datasets and then, like, they're exposing the SQL endpoint to their customers so they can just consume that, you know, like, that data. So I think that's sort of, interesting shift that I I saw as many, many teams started to think about their work as a as a date as a product work.
[00:45:54] Unknown:
And in your experience of working in the space, building Qube, working with customers and end users and the broader data community, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:46:07] Unknown:
That's, that's a very good question. I feel like data ecosystem went through a few interesting key areas, you know, like to say, at least, in the last 4 years or so. You know, like, first, we all went through the high of 0 interest rate, you know, and a lot of people were thinking about data differently maybe back then, you know, like, that we needed more tools to solve for every small problem in the data stack. And now we all went to different side of the spectrum to, like, we need less tools and, you know, like, we need to consolidate. I think at different times, I saw it in different trends.
I think nowadays, what I'm wondering about is that how data ecosystem and, you know, like data engineering ecosystem in general would work with AI and what what is a crossover. So I I go to some data conferences, you know, like modern data stack conferences, and I don't see a lot of talks about AIs. Like, it feels like like, some talks are, like, mandatory talks. You know, like, you have to talk about AI because, you know, like, it's 2024, 2023. But I still don't see a lot of intersection between data and AI Specifically, it feels like it's a little bit like 2 different worlds right now. Even though the a foundation of AI is data. Right? You need to have data in place to to run a good AI. So I think that's something that sort of, you know, like, maybe was a little unexpected when the all AI hype started. I I expected to see more people from the data community lean into AI, which which didn't happen.
[00:47:46] Unknown:
And for people who are building their data systems, maybe they're trying to branch out to have more consumers of the data in their warehouse. What are the cases where CUBE or just more generally a semantic layer is the wrong choice?
[00:48:01] Unknown:
I think if it's a small organization and, you know, so, like, a 50 people org, and so there's 1st data hire, there are so many things that person needs to do from a delivering value and maybe, you know, like, do ad hoc reports is totally fine at this point. And just kind of writing SQL and Snowflake and getting all the foundational pieces in place, like, have an ETL in place, making sure that you have a warehouse, then you probably need to choose the first BI tool. And, you know, like which can be, you know, like, a database or superset open source because, you know, like, at this stage, you probably don't wanna pay for, like, expensive BI tool. I think at this point, the organization still don't need a semantic layer.
Now maybe once it gets to a point, okay, we have Metabase, and we have more people trying to use Metabase in a self serve way, how do we tell them what we have, what data measures, metrics they should look at? And at this point, the data team might have a 2, 3 people at least on a team. That maybe would probably can be first time when an organization can think about CMN declare. But, generally, as organization matures, right, as you get more data folks on the team, as you get more BIs in place, the need for the semantic layer would be bigger. So I would say it's definitely a wrong choice for, like, a 50 people companies, 1 data higher. But as we go bigger, the need for that would would increase.
[00:49:26] Unknown:
And as you continue to build and iterate and evolve the cube project and invest further in this semantic modeling space? What are the things you have planned for the near to medium term? I think the big thing that we started to lean into
[00:49:39] Unknown:
is AI last year and this year. I think 1 of the use cases for cube and for semantic layer in general is sort of how we go from natural language to SQL, and that's something that can have many different universal applications. You know, it can be used internally. It could you could wanna build a Slack bot. You can build an AI agent that incorporate some of the queries to your warehouse. But in general, when you need to have an AI agent that can execute queries against the warehouse, you probably need a semantic layer. So because if you think about that this way, like, we have a lot of data already in the warehouse. If a human needs to access that, we need to write a SQL.
It's gonna be the same case for AI agents. If they need to access the data, they calculate some, you know, analytics around something, they would need to go and write a SQL to to Snowflake. There is no way, like, we we will take that all data, structured data out of this. We'll put into context. Even if you do that, it's not going to work because, you know, like, the LLMs, they are probabilistic. So they need to write a SQL. Now the question, like, how they can write a SQL is that, can they generate a SQL query? I think they can because, you know, they saw so many examples of the SQL queries out of the world, but they just don't know what exact SQL query to generate to the warehouse specifically. So the simplest approach would be, like, let's download the DDL, like, for your tables and just give it to the give it to this AI. The people did it. They did a benchmarks about it, and it only gives, like, a 17% accuracy or something because just you don't have enough context, right, like, about your columns, about your information. So I think the solution here is to use a semantic layer or knowledge graph. So any way just to describe your ontology and describe your semantics and data. So you give all that context to AI agent LLM now, and then now you can generate a really accurate SQL queries. And especially if you generate the SQL queries not directly to warehouse, but you generate them back to your semantic layer, that can act as an additional validation point. It even increases the accuracy, and also you can get all of your security access, caching, call the benefits on top of this. So what we're building at CUBE now is a few things, but the first foundational thing could be an API endpoint where you can just send a text query to CUBE and saying like, hey. Give me give me data, essentially. And CUBE will generate a essentially. And CUBE will generate a SQL query on top of this, execute the SQL query, and do whatever the same thing we do for BI tool right now, but it's more like for natural language right now. And that's going to be very accurate because we're going to use the same data model we have already. And that can have a very wide range of applications in building chatbots, in building AI agents, and in having some sort of regenerated BI capabilities internally. So that's a exciting thing that we're working on right now.
[00:52:34] Unknown:
Are there any other aspects of the cube project specifically or the overall space of the semantic layer that we didn't discuss yet that you'd like to cover before we close out the show? I think that's it. I think that, you know, like, we covered a lot of things. Thank you. That's they were all great questions. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. That's a great question. I feel like the missing piece here is around
[00:53:11] Unknown:
the catalog that is closer to the business consumers and that is more friendly for the business consumers to consume. And I will will go deeper right now, on this. I think with the catalogs, we we got a really good development in the in the last several years. In general, it was the first wave of the catalogs. Now, Alation, Calibra, they kind of offer the tools to to do the inventory of your data to create all these different descriptions of the data assets within your organization. And a lot of teams say successfully use it, and then we got a sort of a newer generation of this catalogs like Atlan. But I think the problem is that the catalog space is too big and too vague in terms of we can go down from, like, a low level catalog, you know, like, things like pipelines, airflow, chops, or something to up, you know, like, into the things that more like a business users care about, like in charts, dashboards, queries, all of that. So I feel what is missing is that a little bit more focused on that side of a spectrum that Molica data consumers care about. It's like what actual dashboards, how we can find the data, what are the metrics we have. Like, what are the charts we can use? And now when I think about that problem, I'm wondering why no 1 is using or maybe probably, like, someone is doing that already, but not why why is I don't see products on the market right now that uses AI to do that because it makes so much sense in terms of, like, discoverability in a catalog. If I'm from a marketing team and when you hire, it would be good to go to some place and say, like, hey. I just joined a marketing team. What dashboard should I look at? What metrics should I worry about? So it's still, like, a lot of things that AI can capture and change here from the cataloging perspective. So I think here, we'll see a lot of, like, good, interesting developments soon. I'm sure someone is working on that right now. I just don't see it. Absolutely. The overall catalog and discoverability
[00:55:10] Unknown:
space is evolving, but I do think it is still a little bit underserved. Alright. Well, thank you very much for taking the time today to join me and share the work that you've been doing on CUBE and your perspective on the semantic layer and how it fits into the overall data platform, data ecosystem. Definitely a very interesting problem space, and it's great to see the work that you and your team are putting into that. And I hope you enjoy the rest of your day. Thank you. Thank you for having me today. That was a great conversation.
[00:55:42] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. Subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Overview
Guest Introduction: Artem Khadunov
Artem's Journey into Data Engineering
Understanding the Semantic Layer
Challenges in Data Modeling and Consumption
Evolution of the Metrics Layer
Data Warehousing and Semantic Layer Integration
Adoption and Migration to Semantic Layers
Performance and Caching Benefits
Market Maturity and Use Cases
Technical Challenges in Building Cube
Integration with DBT and Developer Workflow
Technical Evolution: From JavaScript to Rust
Open Source Governance and Business Strategy
Scope and Limitations of Cube
Data as a Product
Lessons Learned in Data Engineering
When Not to Use a Semantic Layer
Future Plans for Cube
Closing Remarks