Summary
The key to making data valuable to business users is the ability to calculate meaningful metrics and explore them along useful dimensions. Business intelligence tools have provided this capability for years, but they don’t offer a means of exposing those metrics to other systems. Metriql is an open source project that provides a headless BI system where you can define your metrics and share them with all of your other processes. In this episode Burak Kabakcı shares the story behind the project, how you can use it to create your metrics definitions, and the benefits of treating the semantic layer as a dedicated component of your platform.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- Your host is Tobias Macey and today I’m interviewing Burak Emre Kabakcı about Metriql, a headless BI and metrics layer for your data stack
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Metriql is and the story behind it?
- What are the characteristics and benefits of a "headless BI" system?
- What was your motivation to create and open-source Metriql as an independent project outside of your business?
- How are you approaching governance and sustainability of the project?
- How does Metriql compare to projects such as AirBnB’s Minerva or Transform’s platform?
- How does the industry/vertical of a business impact their ability to benefit from a metrics layer/headless BI?
- What are the limitations to the logical complexity that can be applied to the calculation of a given metric/set of metrics?
- Can you describe how Metriql is implemented?
- How have the design and goals of the project changed or evolved since you began working on it?
- What are the most complex/difficult engineering elements of building a metrics layer?
- Can you describe the workflow of defining metrics?
- What have been your guiding principles in defining the user experience for working with metriql?
- What are the opportunities for including business users in the definition of metrics? (e.g. pushing down/generating definitions from a BI layer)
- What are the biggest challenges and limitations of creating metrics definitions purely in SQL?
- What are the options for exposing metrics back to the warehouse and other operational systems such as reverse ETL vendors?
- What are the missing elements in the data ecosystem for taking full advantage of a headless BI/metrics layer?
- What are the most interesting, innovative, or unexpected ways that you have seen Metriql used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metriql?
- When is Metriql the wrong choice?
- What do you have planned for the future of Metriql?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Metriql
- Rakam
- Hazelcast
- Headless BI
- Google Data Studio
- Superset
- Trino
- Supergrain
- The Missing Piece Of The Modern Data Stack article by Benn Stancil
- Metabase
- dbt
- dbt-metabase
- re_data
- OpenMetadata
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Atlin as an internal tool for themselves. Atlin is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Barak Kabagji about Metricle, a headless BI and metrics layer for your data stack. So Barak, can you start by introducing yourself? Hey. It's Barak.
[00:02:07] Unknown:
I founded Terracom 5 years ago. I'm a data engineer turned into an entrepreneur. Right now, like, we have a team at Rackham, and we have, this Metrocod as a new product under the company. And do you remember how you first got involved in data management? When I was studying, I was trying to build up some projects for people. They were mostly b to c projects, ecommerce, or middle streaming websites. After a few failures, I realized that actually, like these technical challenges more, rather than building these b to c products. At that time, big data was the hype, and I decided to apply for the internships.
That was specific like, speaking specifically 1 company called Hazel Kiss. They are in memory database, basically. Like, they are building an in memory database. I learned, like, open source stuff, the enterprise softwares, distributed systems, concurrency, working at Hazelcast long term, like, 1 year or so. That's how I got into this space. And so now it brings us to what you're building with Metrocool.
[00:03:13] Unknown:
And I'm wondering if you can just discuss a bit about what it is and some of the story behind how it came to be and sort of the goals of the project.
[00:03:21] Unknown:
MetroCore is headless BI. Some people call it also a metric stash. This is relatively new. It is like a business intelligence solution, but it doesn't come with the user interface. It sits in between the data tools and the data warehouse as a middleware layers. And your data tools connect to MetroQuad, and then we just rewrite the query and, like, run the query directly in your database house. The goal is that, like, we usually talk in terms of the database tables and columns. But the business people, the people who analyze the data, usually talk in terms of the metrics. So MetricoL lets you model your data in your database, define your metrics, and expose these metrics to the data tools that they are using so that you don't need to define them in every tool again and again.
And the story is that Rakan is actually a product analytics solution. Right now, we have this verticalized analytics tool like a UI, which lets you analyze the data, product data, customer data in your database. So the way it works is that you connect to your data warehouse. You model your data. You can, like, mark a table as events table. And you can tell the system that this is my user ID. This is my event timestamp. This is an event type called page view. And then we expose all these, like, views, the data to product people. And the product people run behavioral analytics queries like funnel retention segmentations directly on top of their database. So this is what Rakam is all about. But building that product wasn't that easy. We had to build this data modeling language to be able to understand the data in a better way. And at some point, we realized that actually this metric store, like, what we are building is essential not only for this use case, product analytics use case, but also for the other data tools, the BI tools. So we decided to separate MetroCal from Rackham. This is actually the underlying architecture of Rackham, but we used a different name, Metricall.
And we started, like, building type integrations with the BI tools. And, like, it's been around 5 months, I guess. It is relatively new, but we are still progressing and learning about the use cases.
[00:05:43] Unknown:
And so in terms of the sort of headless BI system, you mentioned that it is roughly equivalent to the idea of the metric store, which is something that's been gaining some popularity in the past few months to a year or so. And I'm wondering if you can just talk through some of the similarities and distinctions about the idea of a headless BI and how it relates to a metric store and sort of your thoughts on which terms to use where and sort of where that focus might end up. There is no extra strict boundaries. The headless BI, like, usually means that there is no UI. It's a BI solution. You still, like, unload the data. You still define your metrics, but
[00:06:24] Unknown:
you are, like, exposing them through an API or through, like, an integration with the BI tools. Usually, the way it works is that most of these advanced BI tools provide a way for you to define your metrics inside this application. So if you are using Tableau, this is like Tableau expressions. If you are using Power BI, they have something called MDX. And if you are using Superset, you define your metrics as SQL. If you are using Metabase, they have their own expression language. So each tool has its own metric definition, like but this is not a standardized problem. So we also, like, have seen these configuration as code type recently.
So there is this dbt airflow like projects where you define your data as code data models as code. So they transform the data. They test the data. But when it comes to the metrics, you still need to interact with the BI tools 1 by 1. So there is this missing layer in between the transformation or data modeling tools and tools like BI tools. So Meta Quest tries to be fitting in that area. So I think that this BI is a better term rather than the metric storage. Because when I think about metric storage, I actually think it as a end to end product, end to end solution from data collection to data analysis. This is, like, the what other products is all about, Minerva or transform data. So that's why we usually use this headless BI term. And in terms
[00:08:03] Unknown:
of the product itself, you mentioned that it you released it as open source. And I'm wondering what the motivation was for releasing this project to the community given that it is something that is powering your business and just to the overall sort of goals and complexities involved in making that move?
[00:08:21] Unknown:
Actually, when we started Rackham 5 years ago, we had this product called Rackham API. It is still open source. So it started as an open source solution from the first day. And I learned some open source stuff while I was working at case of case. And I realized that, I mean, this is a good great opportunity for an engineer promoting this project initially. So that's why I was mostly eager to build up open source project in the beginning. But when it comes to MetroCloud, I think this such solution, such metric store solution should be open source to be able to get it adopted by Datadog.
So there are many different products like Lucas or Tableau. And if you just build a product and expect the other tools to integrate with you, it's not gonna work. Essentially, like, what we are trying to build up in the future is that we are building these integrations 1 by 1, but expect the data bay the digital tool vendors to integrate their product into Metro by in the long run. We got our first contribution last week, and I think it won't be open source. No other, like, data tool vendor will try to integrate it with Metro by. So in order to get adoption, we should be open source.
[00:09:40] Unknown:
That's what we are thinking of. And as far as the sort of ongoing management and governance and sustainability
[00:09:47] Unknown:
of the project, I'm wondering what the approach has been there. This is a relatively, like, new project. Like, we are not sure, 100% sure where this is gonna go, but we are doing lots of customer development, trying to understand, like, the use cases and trying to build up some success stories. But in general, I think the future is the integrations. So we are not gonna be trying to build up different features inside Metric Wiles for data testing or for analyzing data for BI or try to build up, like, something at 10. Rather than doing that, we should be integrating to other tools. And if you wanna have an end to end tool, it should be easy enough for you to just put MetroCloud and put a metadata catalog tool that talks to MetroCloud and use the data warehouse that integrates with MetroCloud.
And then you will be able to use the system in a modular way where you just have the option to building up to your data, like your data or your metadata tool or testing tool. If we can vaccinate, I think there's a great opportunity over here. And in terms of
[00:11:05] Unknown:
the metrical project, you mentioned Airbnb's Minerva and the transform platform. Obviously, Minerva is a bit harder to gauge because it's not public, but just sort of how you think about the capabilities and the use cases of MetraCool as compared to transform and some of the other metrics layers that are starting to come into the industry.
[00:11:34] Unknown:
Like, transform is also not public. So we are not able to just try out transform. So but I have read, like, their blog post. I have read the Minerva's blog post. So I know the general concept. And my understanding is that I mean, transform is something like Minerva. They have rather than having everything inside, transform instead connects to your data warehouse, but Minerva is an end to end tool. In Minerva, if you, read their blog post, you will see that they mentioned the data warehouse part, the date, the storage part, the UI part. So it's a well integrated product that is, like, being used internally at Airbnb.
And, is actually, like, trying to solve this metrics problem rather than being an end to end tool. It doesn't have its own storage mechanism. It just connects to your, data warehouse. It doesn't do any caching mechanism internally. Everything will be inside your data warehouse, and we have deep integration with DBT. So we are more, like, open source, like, version of this metric stores, that that tries to integrate with the similar tools that can that you can be using. That tries to reduce the friction. And also, like, if you look at the transform, it is similar in the sense of, like, connecting to the data warehouse. They but they still have the UI. They still have a UI where you can just mark yours, like, metrics, see the abnormal detection, or see the trends or collaborate with your team members.
So it's the product is bigger than what we are trying to do. We are just trying to solve this metric definition problem, and that's it. As far as the sort of utility of metrics, I'm wondering if there is any impact
[00:13:32] Unknown:
on how they're used or how applicable, you know, the usage of a metrics layer in a headless BI is for different industries or verticals or if it is something that is sort of universal, independent of the types of data sources that businesses might be working with? In the beginning, I thought that this is gonna be an,
[00:13:52] Unknown:
enterprise product. Only the big companies which use multiple BI tool data tools are gonna be using it. But we got our first contract last month. They are ETL tool, BRT tool, called Improvado. Like, they have this tool for marketers to collect all their marketing data into their data warehouse and play with the data with their PR tool. So what we are building with them is that they have their own system to get all the data from marketing systems, but they are using metric well, specific to our Google Data Studio integration to be able to expose this data to their users.
If you know this PR tool called Google Data Studio, it's actually great when it comes to, like, drag and drop interface for marketers. They just connect the data, analyze the data. And, like, the way it works with Data Studio is that people usually ingest the data into BigQuery and then connect BigQuery to Data Studio. So even though their primary database is ClickHouse, they had to push all the data to BigQuery and then define these metrics inside Google Data Studio manually. So we automated this process on behalf of them. And right now, they have our white labeled version of the Test Studio connectors. They are able to provide the solution in an automated way to their end, like, customers. That's not something that I was planning in the beginning, but this is just 1 use case that is interesting for me. In terms of the sort of definition
[00:15:30] Unknown:
of the metrics, I'm wondering what are some of the limitations to the level of complexity that you can actually achieve with the sort of calculation and definition of a given set of metrics before the computation introduces too much latency or before it becomes too difficult to figure out how to sort of layer the logic in a way that is maintainable?
[00:15:52] Unknown:
Yeah. I mean, we try not to do some magical stuff. Instead, we try to push all the work into your data warehouse. But when you, like, think of the data consumption layer, I mean, which tools are, like, gathering data from your data warehouse, letting you to analyze most of the BI tools. The data again drop BI tools. And in these BI tools, they have a specific way to generate the query, and each BI tool has its own way. Essentially, they are generating SQL, most of them. In order to build up these integrations, we realized that, I mean, we cannot be just building any API and expect these, vendors to work with us. Instead, we need to come up with a clever way to talk with their language. So we decided to act as a database, which is Trilog. Like, I mean, we are using Trilog interface. It was called PressUV beforehand.
So the data tools are connecting to Metrica, but they think that it's Trino using the Trino interface. Instead of, right, like, entering the Presto URL, they enter the people enter the Metrico URL, and then we expose everything as database tables and columns. The metrics are exposed as database columns, for example. And then we build an extra integrations for the BI tools, like, to be able to define these, differentiate the dimensions from the metrics, for example. In order to be able to do that, we need to parse the SQL that they generate, understand the dimensions and metrics definitions that are being used, and convert it to our DSL.
And after that, we compile the query for the underlying database house. We are essentially as parsing the SQL and generating SQL for you using the internal interface. And the cool things that we can do with this tool is that you can actually speed up your queries. We have something called aggregates where you define your metrics and dimension, the payers that you want to speed up. Since we are living inside dbt, we push our work transformation work into dbt. We create dbt models, incremental or table DVT models, and expect you to just update them, use them in your project.
And if you have an aggregate and you are using Tableau, for example, to be able to analyze a metric with a dimension, we get this query and understand, like, try to figure out if we can answer this query using our aggregates, which are basically some of the tables, materialized tables. And then if we can, we use that materialized table instead of going to the raw data. So that way, we are actually helping the data analysts and saving their time so that they don't need to create dimensional tables every time they need a new reporting request.
[00:18:53] Unknown:
And digging more into the Metraquil implementation, I'm wondering if you can just talk through some of the design and architecture of the system and some of the complex and challenging engineering aspects of actually getting this set up? So, like, the hardest part was actually parsing these SQL queries.
[00:19:12] Unknown:
Because, like, BI tools are generating really complex SQL queries. The data warehouse, like, solutions are able to, I mean, parse them, complete them. But we need to be understanding the query in a better way. And to be able to do some cool stuff like materialize views, we need to be able to, like, push down some of the parts of the query or apply the projections that you have in your select query. To be able to do that, we are now using Chino's SQL parser. Luckily, I had the, like, privilege to use Presto DB a couple of years ago for account initially.
So I knew about the engine, how it works, and the SQL dialect. So we hacked Trino, basically. That's what I call it. It boosts its engine to be able to do that. We are not using Tino to be able to process and compute data. Instead, use it to as a proxy layer. So we try to minimize the work as much as possible. That's why we don't have this, like, transformation engine, push all the work into dbt rather than having our own scheduler or transformation layers. And the hardest part for us to be able to do this solution is to find a way for us to talk this SQL language in the metric terms. So, partially, we are able to push some of the data modeling in the YAML files. For example, if you have a joint relationship in between 2 different datasets, you define them in the data model. But our SQL layers, we call it mQUEL, are exposed to BI tools. And this MQL is actually not able to do joins, for example. We push all these joins, all the sophisticated tools to the YAML and try to keep the SQL as simple as possible.
That's how we are able to solve this, like, problem when it comes to, like, advanced use cases. We don't need to define these join relationships inside Tableau, inside the BI tool. Because if we support the joins in the SQL, we will be, like, having some complex queries, and then passing them would be harder. 2 different problems.
[00:21:36] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its DATA DIFF feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. So when you first started working on the project, I'm wondering what were some of the ideas about how it would be used or how you would approach the actual implementation or some of the assumptions that you had about that that have been either proven incorrect or had to be changed as you iterated on it and started using it internally and sharing it with, end users?
[00:22:52] Unknown:
Actually, I've heard this this BI term from investors, like, base case that we see. Later, they became the investors of Supergrain, which is also a metric store, but they don't have, like, public solution at the moment. And then I have read this famous article from Ben, founder of Mote, about this missing layers in the data space. So thinking about that problem, I started thinking if we can like, the the most powerful feature of Rackham is actually this metric layer. So how I can use Rackham's engine to solve this problem? And then I started to try out different BI tools like Tableau, like, database. Try to understand how they work, if we can just build native integrations for each BI tool or not. And it turned out that they all have different login mechanisms, and it's gonna be easy for us to write closure for Metabase and build up native drivers for each BI tool. I run into this, like, specific project called DBT Metabase.
It is an open source project, like, on GitHub, and it talks to DBT. You define the YAML inside your DBT YAML files, and then they synchronize them or you write the column definitions, etcetera. And they talk to Metabase API and update these definitions. So it was interesting. I got in touch with the the maintainers. We had a few chats together. And looking at the, like, tools, I decided to, like, act like this database, Trino. And I had a few tries to build up native integration with Tableau and other BI tools. It didn't work. After a few iterations, we came up with this Tino approach.
But I didn't know about the use cases. What I see in the industry is that there are different, like, semantic layers, unified semantic layers, like Edge Scale, Arcadia, that are actually trying to do something similar without the metrics. They speed up your BI queries. And Power BI has also this MDX, which is the metric stage. Looking into each BI tool and the history of this industry, it looked like most of them actually are trying to solve similar problems, but they call it different. If you look at the unified semantic layer, it scales product is similar to what we are actually doing. But the difference here is that we are only focusing on this metrics there, and then we are trying to build up integrations with the most of the tools rather than trying to be an enterprise tool that only integrate a couple of products in a well managed.
[00:25:40] Unknown:
And so talking through the actual usage of Metricool, I'm wondering if you can describe the workflow and syntax of actually building a set of metrics and then exposing that to a business intelligence system or a, you know, command line client that wants to be able to interact with these metrics and use them for their own analytics workflow?
[00:26:02] Unknown:
So the building a developer experience product is not that easy because the developer experience should be frictionless. So we thought about, I mean, building our own, like, way for people to connect to the database houses. And then quickly realized that actually, DBT has most of the stuff. They have these profiles where you just write the database credentials in the YAML files, and they connect to your data warehouse and transform your data. So rather than building our own 1, we decided to use dbt's experience. So you initially create a dbt project, and you write your DBT models. If you don't want to transform the data, you can just create a YAML file and then define a data model that points to a table in inside your database.
Once you create a DBT project and your first YAML files, they they have something called meta. Under the metrical property, you define your metrics. If you have any events table, you can just create a YAML file and then create measures something like total users, total total events, and run DBT. When you run DBT, it creates a file called manifest.json, which is a metadata file. And we have common line application, which takes this manifest dot JSON file as an argument. So once you model your data and define your metrics, you use our, basically, API servers to start HTTP servers reading your manifest file. We talk to dbt's profiles dot YAML file. So you don't need to define the credentials exclusively for Metrogl. Instead, we just read the DBT's credentials.
And talking to the metadata API of DBTs, we expose everything as, like, these database tables to your BI tools. But if you are, you like, for each BI tool, we have different integrations. If you are using Tableau and use Metrica with it, you use the your Presto interface. You connect to Metrica using Presto connection. And then we have a simple dashboard, where you just click the tool that you want to use. For Tableau, for example, they have something called TDS files, Tableau data source files. You just select the the reset that you want to analyze in Tableau, and then we create a TDC file. When you double click the TDC file, Tableau opens up and ask for the credentials. When you enter the credentials, you'll see that the Tableau work with your metrics defined in your YAML files.
But if you are using open source BI tool like superset, in the dashboard, we ask for your superset credentials. You enter it and click sync. We talk to Superset API and synchronize all these metrics. So you can just go to supersets and run, like, connect the, which is basically Metricall. And when you drag and drop the values in the query builders, Metoquel gets the query, parses the query, understand the metrics. And then since we know the definition of these metrics, we just compile the query for your underlying database and to turn this up back to you. You if you are using Google Sheets, we have it connected as well. You can just bring your data from your data warehouse, into your Google Sheets using our plug in Google Sheets plug in. So we are not just building this tool for the BI tools, for all the data tools. But the analytics engineers are the ones who are writing and building these data models, defining the metrics, and maintain the metric as service.
[00:29:53] Unknown:
And as far as the actual sort of collaboration opportunities for being able to work with the business users and people who don't necessarily want to dig into the engineering aspects of the metrics layer. How do you how can teams support that type of contribution where the business user is able to identify or define their own set of metrics and then be able to actually expose that back to Metricool so that other people can take advantage of those definitions.
[00:30:26] Unknown:
I'm designing this experience, like, the modeling experience for the business people. The data people, the analytics engineers are the ones who are modeling the data. The business stakeholders are usually consuming the data. The data that they want to analyze should be consistent. If you have a metric, it should be the same in all the data tools. But if you want to, like, just experiment, understand, it, you we also had the SQL layers where you want to write just Jinja expressions inside the native SQL query for your data warehouse. We also have different reporting type, like SQL, MQL, segmentation funnel, etcetera.
But the data modeling part is, like, defining the configuration as code managed. So the data people write these YAML files and then commit it into their git support study and create a pull request and collaborate with their team members, which is not possible in Tableau workflow or superset workflow. Everything is committed to, like, in git. So the business people are not the ones who are defining the metrics here. But they are requesting the data people to get a metric and the dimensions. The business people are able to drill down and metric into different dimensions, into different time frames without the data people to model the data, create these dimensional tables. That's how we are helping the business people. But if they want to have a new metric, it should be in the YAML file. As far as being able to
[00:32:08] Unknown:
decide what is the developer experience, what is the interface for being able to actually define and propagate these metrics, what is the process of sort of working through in sort of like a CICD approach of testing these metrics and working through them? And what were some of the challenges that you encountered in figuring out how you wanted to actually expose these definitions in a way that was maintainable and understandable to the broadest set of users?
[00:32:35] Unknown:
So we use DBT mainly for everything. DBT has tests. DBT has a documentation where you can just click your DBT model sources and see your metrics. So we are just focusing on the metric side for the testing side. You are expected to use DVT. But the challenge that we have is that since we are generating these SQLs in a talk manage, we are not able to use DBT to, I mean, define, like, set up some alarms on your metrics or define the tests inside it. This is something we are, like, still brainstorming. But for example, for testing, there is a project called ReData that works on top of DBT. So if we can integrate Metrica with ReData, we can just, like, tell people that if you want to create an alarm for your metric and use Metabase, you can just, like, run this query, define the boundaries, and then you can set up an alarm. Or if you want to test the metrics in your continuous integration environment, you can add these lines into your YAML file for the data and make the data operate on top of metrical metrics. So this is how we are planning to implement it in the long run.
Rather than building it, we are trying to integrate these open source tools.
[00:34:03] Unknown:
For people who want to get set up and running on their own infrastructure, what's the process for actually deploying it and maintaining an installation?
[00:34:12] Unknown:
So it is basically a Docker container, like Docker image. We use Docker Hub to push these new versions. So, essentially, what you are doing is that what you need to do is that you just test these manifest dot dot JSON file as a URL, as an argument to the metric as an environment variable. And it starts an FTP server, like, reading this environment variables and and getting this manifest file from your servers or from your d g documentation. We have 1 click installers as well. If you are using Heroku, if you are using Google Cloud, you can just click deploy with Heroku button. And then we use the dockers under the hood to be able to deploy it to your Heroku account. But if you are using Kubernetes, you can use our docker image to deploy it into your own infrastructure.
[00:35:05] Unknown:
It's all Docker. In terms of the current state of the ecosystem, I'm wondering what are some of the potential improvements to how the metric layer can be leveraged. What are some of the missing elements that that you think will be filled in over the coming years to be able to take full advantage of this metrics layer? And what are some of the sort of untapped opportunities for this aspect of the sort of data production and data usage?
[00:35:30] Unknown:
So right now, this is, like, a new project. The way we approach to this problem is that we should be the 1 who integrates, like, these 3rd party BI tools. Like, I'm like or the 3rd party data tools. So for now, we are focusing more on the integrations. 1 thing that is interesting for me is, like like, we are this, like, metadata space. There are different metadata tools like Amazon, Atlas. And in the short term, we should be able to integrate them, like, seamlessly. There is 1 specific project called open metadata. I'm still trying to understand how we can integrate it, how we can leverage these metadata layers, and try to make it useful and have the metadata layers in place. But the challenge is that most of the tools are talking in terms of the database tables or database columns. But we are talking in terms of the metrics. We have this bridge with MQL. But in the long run, I expect the tools to be talking in terms of metrics more often, like the reverse ETL tools or spreadsheets or, like the notebooks, etcetera. This is something we will probably see in 5, 10 years more. People will be talking more about the metrics.
[00:36:51] Unknown:
And so in terms of the applications of Metricool, either in the work that you're doing at Racom or in the community as people have started to adopt the tool? What are some of the most interesting or innovative or unexpected ways that you've seen it used? So for now, like, it will be probably the same since we have the contract.
[00:37:11] Unknown:
So previously, our customer was using like, a transformation tool to ingest the data into data, BigQuery to be able to use Data Studio. But with Metrica, you can just connect to ClickHouse from Google Studio. You can just collect the get the data from ClickHouse or Snowflake directly into your Google Sheets. You can build up some Slack application where if you want to analyze this data. 1 interesting use case for me was this, like, ClickHouse use case specifically. But, like, we probably have a couple of production deployments at the moment, but that's it. Over time, I I will probably see more interesting use case, but that's it for now. So for people who are interested in being able to take advantage of the opportunities that a metrics layer provides,
[00:38:08] Unknown:
what are some of the cases where MetraCool is the wrong choice?
[00:38:12] Unknown:
MetroCloud can be, like, an overhead for, small companies. If you have a single BI tool and if you don't have analytics engineers, the developers working on or if you don't have a data department, you probably don't have the resource to write these definitions in the YAML files and deploy Metrica internally. So if you are using, for example, Metabase and it is working for you, it will be much easier for you to define the metrics inside Metabase or inside Tableau. But if you are a large organization with a data team, where you have different data tools to consume the data, it's a good choice. Otherwise, if you are consuming the data already in a single way, like using a single BI tool, then Metricaal can be overhead for you.
[00:39:09] Unknown:
And as you continue to use Metruquel and iterate on it, I'm wondering what are some of the things that you have planned for the near to medium term, or if there are any particular areas of help or contribution that you're looking for with the open source project.
[00:39:24] Unknown:
So for now, like, we are trying to build up tutorials, write better documentation, come up with some success layers. But, essentially, like, what I tried like, if you are familiar with Lukas, I tend to think Metrica as as look for all the BI tools, look for superset or Metabase. And I expect MetSQL, like, to fill this gap of metric definition and complete this picture of configuration as code, like the data configuration as code. And then try to build up a consistent way, provide a consistent way, like, view on top of the data.
[00:40:06] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think that this configuration
[00:40:24] Unknown:
as code will be the new hype people will be defining, like, their data as code. We see this in DevOps. Right now, people are writing YAML files and managing their infrastructure as code. And the software engineers are invading the data space at the moment. So right now, like, there is this new role called analytics engineering, like, analyst engineers. I think that, like, the experience will be more focused on the the, like, the developer side, which was on the, like, BI side in the past. So the Databord is unbundling right now. Like, we see different tools like metric or metadata. Previously, it was just BI, and Tableau was handling everything internally in in a single product. Right now, we are unbundling everything. We have the data warehouse. We have the ELT tools to load the data into our data warehouse.
And even in the data warehouse, we see that there are different technologies emerging like Iceberg, people like Apache Iceberg. Because people want to separate their compute from the storage. Right now, in the storage, we have this, like, standardized table formats, like iceberg, so that people can pick their data type table layout and use it. In our world, we see this metric there, and I expect to see more metadata related things for business people to interact with data and understand the data and test it and also
[00:41:59] Unknown:
document it to be able to access it. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at RECOM and on Metricool. It's definitely a very interesting project, and I'm excited to see an open source offering for this headless BI and metrics layer. So definitely appreciate all the time that you and your team and the community have put into it, and I look forward to seeing where it goes in the future. And I hope you enjoy the rest of your day. Thanks a lot. We are hiring, by the way. Like, you can get in touch with me from my email,
[00:42:29] Unknown:
emre@racom.io. We are especially hiring people who is gonna be working on the engine itself, also building some tutorials. So if it is interesting for you, we'll start to get in touch with you.
[00:42:51] Unknown:
Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Barak Kabagji
Building Metrocool and Its Purpose
Headless BI and Metric Stores
Open Source and Community Contributions
Comparing Metrocool with Other Tools
Technical Challenges and Solutions
Implementation and Architecture
Use Cases and Customer Stories
Collaboration and Business User Integration
Deployment and Maintenance
Future of Metrics Layer and Ecosystem
When Metrocool is the Wrong Choice
Future Plans and Contributions
Biggest Gaps in Data Management Tools
Closing Remarks and Contact Information