Make Your Business Metrics Reusable With Open Source Headless BI Using Metriql

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Atlin as an internal tool for themselves.

Atlin is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams to create a single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Barak Kabagji

about Metricle, a headless BI and metrics layer for your data stack. So Barak, can you start by introducing yourself? Hey. It's Barak.

I founded Terracom

5 years ago. I'm a data engineer turned into an entrepreneur.

Right now, like, we have a team at Rackham, and we have,

this Metrocod as a new product under the company. And do you remember how you first got involved in data management? When I was studying, I was trying to build up some projects

for people.

They were mostly b to c projects, ecommerce,

or middle streaming websites.

After a few failures, I realized that actually,

like these technical challenges more, rather than building these b to c products.

At that time, big data was the hype, and

I decided to apply for the internships.

That was specific like, speaking specifically 1 company called Hazel Kiss.

They are in memory database,

basically. Like, they are building an in memory database.

I learned, like, open source stuff, the enterprise softwares,

distributed

systems, concurrency,

working at Hazelcast long term, like, 1 year or so. That's how I got into this space. And so now it brings us to what you're building with Metrocool.

And I'm wondering if you can just discuss a bit about what it is and some of the story behind how it came to be and sort of the goals of the project.

MetroCore is headless BI. Some people call it also a metric stash. This is relatively new. It is like a business intelligence solution, but it doesn't come with the user interface.

It sits in between the data tools and the data warehouse

as a middleware layers.

And your data tools connect to MetroQuad,

and then we just rewrite the query and, like, run the query directly in your database house.

The goal is that, like, we usually talk in terms of the database tables and columns.

But the business people, the people who analyze the data, usually talk in terms of the metrics.

So MetricoL

lets you model your data in your database,

define your metrics,

and expose these metrics to the data tools that they are using so that you don't need to define them in every tool again and again.

And the story is that

Rakan is actually a product analytics solution. Right now, we have this verticalized analytics tool like a UI, which lets you

analyze the data, product data, customer data in your database. So the way it works is that you connect to your data warehouse. You model your data. You can, like,

mark a table as events table. And you can tell the system that this is my user ID. This is my event timestamp.

This is an event type called page view. And then we expose all these, like, views, the data to product people. And the product people run behavioral analytics queries like funnel retention segmentations

directly on top of their database. So

this is what Rakam is all about. But building that product

wasn't that easy. We had to build this data modeling language

to be able to understand the data in a better way. And

at some point, we realized that actually this metric store, like, what we are building is essential not only for this use case, product analytics use case, but also for the other data tools, the BI tools. So we decided to

separate MetroCal from Rackham. This is actually the underlying architecture of Rackham, but we used a different name, Metricall.

And we started, like, building type integrations with the BI tools.

And, like, it's been around 5 months, I guess. It is relatively new,

but we are still progressing and learning about the use cases.

And so in terms of the sort of headless BI system, you mentioned that it is roughly equivalent to the idea of the metric store, which is something that's been gaining some popularity in the past few months to a year or so. And I'm wondering if you can just talk through some of the

similarities and distinctions about the idea of a headless BI and how it relates to a metric store and sort of your thoughts on which terms to use where and sort of where that focus might end up. There is no extra strict boundaries. The headless BI, like, usually means that there is no UI. It's a BI solution. You still, like, unload the data. You still define your metrics, but

you are, like, exposing them through an API

or through, like, an integration with the BI tools. Usually,

the way it works is that most of these advanced BI tools provide a way for you to define your metrics inside this application.

So if you are using Tableau,

this is like Tableau expressions. If you are using Power BI,

they have something called MDX.

And if you are using Superset, you define your metrics as SQL. If you are using Metabase, they have their own expression language.

So each tool has its own

metric definition, like but this is not a standardized

problem. So

we also, like, have seen these configuration as code type recently.

So there is this dbt

airflow like projects where you define your data

as code data models as code. So they transform the data. They test the data. But when it comes to the metrics, you still need to interact with the BI tools 1 by 1. So there is this missing layer in between

the transformation or data modeling tools and tools like BI tools. So Meta Quest tries to be fitting in that area.

So I think that this BI is a better term rather than the metric storage. Because when I think about metric storage,

I actually

think it as a end to end product, end to end solution from data collection to data analysis. This is, like, the what other products is all about, Minerva or transform data. So that's why we usually use this headless BI term. And in terms

of the

product itself, you mentioned that it you released it as open source. And I'm wondering

what the motivation was for releasing this project to the community given that it is something that is powering your business and just to the overall sort of goals and complexities involved in making that move?

Actually,

when we started Rackham 5 years ago, we had this product called Rackham API. It is still open source. So it started as an open source solution

from the first day.

And I learned some open source stuff while I was working at case of case.

And I realized that, I mean, this is a good great opportunity for an engineer

promoting this project initially. So that's why I was mostly eager

to build up open source project in the beginning. But when it comes to MetroCloud,

I think this such solution, such metric store solution

should be open source to be able to get it adopted by Datadog.

So there are many different products like Lucas or

Tableau. And

if you just build a product

and expect the other tools to integrate with you, it's not gonna work. Essentially, like, what we are trying to build up in the future is that we are building these integrations 1 by 1, but expect the data bay the digital tool vendors to integrate their product into Metro by in the long run. We got our first contribution

last week, and I think it won't be open source.

No other, like, data tool vendor will try to integrate it with Metro by. So in order to get adoption,

we should be open source.

That's what we are thinking of. And as far as the sort of ongoing management and governance and sustainability

of the project, I'm wondering what the approach has been there. This is a relatively, like, new project. Like, we are not sure, 100%

sure where this is gonna go, but we are doing

lots of customer development,

trying to understand, like, the use cases

and trying to build up some success stories. But in general,

I think

the future is the integrations.

So we are not gonna be trying

to build up different features inside Metric Wiles

for data testing or for analyzing data for BI or try to build up, like, something at 10. Rather than doing that, we should be integrating to other tools.

And if you wanna have an end to end tool, it should be easy enough for you to just put MetroCloud

and put a metadata catalog tool that talks to MetroCloud

and use the data warehouse that integrates with MetroCloud.

And then you will be able to use the system in a modular way where you just have the option to building up to your data, like your data or your metadata tool or testing tool. If we can vaccinate,

I think there's a great opportunity over here. And in terms of

the metrical

project, you mentioned Airbnb's Minerva and the transform platform.

Obviously, Minerva is a bit harder to gauge because it's not public,

but just sort of how you think about the capabilities and the use cases of MetraCool as compared to transform and some of the other metrics layers that are starting to come into the industry.

Like, transform is also not

public. So we are not able to just try out transform. So but I have read, like, their blog post. I have read the Minerva's

blog post. So I know the general concept. And my

understanding is that I mean, transform is something like Minerva. They

have rather than having everything inside,

transform instead connects to your data warehouse, but Minerva is an end to end tool. In Minerva, if you,

read their blog post, you will see that they mentioned

the data warehouse part, the date, the storage part, the UI part. So it's a

well integrated product

that is, like, being used internally at Airbnb.

And,

is actually, like, trying to solve this metrics problem rather than being an end to end tool. It doesn't have its own storage mechanism.

It just connects to your, data warehouse. It doesn't

do any caching mechanism

internally. Everything will be inside your data warehouse, and we have deep integration with DBT. So we are more, like, open source,

like, version of this metric stores, that that tries to integrate with the similar tools that can that you can be using. That tries to reduce the friction.

And also,

like, if you look at the transform,

it is similar in the sense of, like, connecting to the data warehouse. They but they still have the UI. They still have a UI where you can just

mark yours, like, metrics, see the abnormal detection, or

see the trends or collaborate with your team members.

So it's the product is bigger than what we are trying to do. We are just trying to solve this metric definition problem, and that's it. As far as the sort of utility of metrics, I'm wondering if there is any impact

on how they're used or how applicable,

you know, the usage of a metrics layer in a headless BI is for different

industries or verticals or if it is something that is sort of universal, independent

of the types of data sources that businesses might be working with? In the beginning, I thought that this is gonna be an,

enterprise product. Only the big companies

which use multiple BI tool

data tools are gonna be using it. But we got our first contract

last month. They are ETL tool, BRT tool, called Improvado.

Like, they have this tool for marketers to collect all their marketing data into their data warehouse and play with the data with their PR tool. So what we are building with them is that they

have their own system to get all the data from marketing systems,

but they are using metric well, specific to our Google Data Studio integration

to be able to expose this data to their users.

If you know this PR tool called Google Data Studio, it's actually great when it comes to, like, drag and drop interface for marketers.

They just connect the data, analyze the data. And, like, the way it works with Data Studio is that people usually

ingest the data into BigQuery

and then connect BigQuery

to Data Studio.

So even though their primary database is ClickHouse,

they had to push all the data to BigQuery

and then define these metrics inside Google Data Studio manually.

So we automated this process on behalf of them. And right now, they have our white labeled version of the Test Studio connectors.

They are able to provide the solution

in an automated way to their end, like, customers. That's not something that I was planning in the beginning, but this is just 1 use case that is interesting for me. In terms of the sort of definition

of the metrics, I'm wondering what are some of the limitations

to the level of complexity that you can actually

achieve

with the sort of calculation and definition of a given set of metrics before the computation introduces too much latency

or before it becomes too difficult to figure out how to sort of layer the logic in a way that is maintainable?

Yeah. I mean, we try not to do some magical stuff. Instead, we try to push all the work into

your data warehouse.

But

when you, like, think of the data consumption layer, I mean, which tools are, like,

gathering data from your data warehouse, letting you to analyze most of the BI tools. The data again

drop BI tools.

And in these BI tools, they have a specific way to generate the query,

and each BI tool has its own way. Essentially, they are generating

SQL, most of them. In order to build up these integrations,

we realized that, I mean, we cannot be just

building

any API and expect these, vendors to work with us. Instead, we need to come up with a clever way to talk with their language. So

we decided

to act as a database,

which is Trilog. Like, I mean, we are using Trilog interface. It was called PressUV beforehand.

So the data tools are connecting to Metrica, but they think that it's Trino using the Trino interface.

Instead of, right, like, entering the Presto URL, they enter the people enter the Metrico URL, and then we expose everything as database tables and columns. The metrics are exposed as database columns, for example.

And then we build an extra integrations for the BI tools, like, to be able to define these, differentiate the dimensions from the metrics, for example.

In order to be able to do that, we need to parse the SQL that they generate,

understand the dimensions and metrics definitions that are being used, and convert it to our DSL.

And after that, we compile the query for the underlying database house. We are essentially as parsing the SQL and generating SQL for you using the internal interface.

And the cool things that we can do with this tool is that you can actually speed up your queries. We have something called aggregates where you define your

metrics and dimension, the payers that you want to speed up. Since we are living inside dbt,

we push our work transformation

work into dbt.

We create dbt models, incremental or table DVT models,

and expect you to

just update them, use them in your project.

And if you have an aggregate and you are using Tableau, for example, to be able

to analyze a metric with a dimension,

we get this query and understand, like, try to figure out if we can answer this query using our aggregates,

which are basically some of the tables, materialized tables.

And then if we can, we use that materialized table instead of going to the raw data. So that way, we are actually helping the data analysts and saving their time so that they don't need to create dimensional tables

every time they need a new reporting request.

And digging more into the Metraquil

implementation, I'm wondering if you can just talk through some of the design and architecture of the system and some of the complex and challenging engineering aspects of actually getting this set up? So, like, the hardest part was actually parsing these SQL queries.

Because,

like, BI tools are generating

really complex SQL queries. The data warehouse, like, solutions are able to,

I mean, parse them, complete them. But

we need to be understanding

the query in a better way. And

to be able to do some cool stuff like materialize views,

we need to be able to,

like, push down some of the parts of the query

or apply the projections that you have in your select query. To be able to do that, we are now using Chino's

SQL parser.

Luckily, I had the, like, privilege to use Presto

DB a couple of years ago for account initially.

So I knew about the engine, how it works,

and the SQL dialect. So

we hacked Trino,

basically. That's what I call it. It boosts its engine to be able to do that. We are not using Tino to be able to process

and compute data. Instead, use it to as a proxy layer. So we try to

minimize the work as much as possible. That's why we don't have this, like, transformation engine, push all the work into dbt rather than having our own scheduler or transformation layers.

And

the hardest part for us to be able to do this solution is to

find a way for us to

talk this SQL language

in the metric terms. So,

partially,

we are able to push some of the data modeling

in the YAML files. For example, if you have a joint relationship in between 2 different datasets,

you define them in the data model. But

our SQL layers, we call it mQUEL,

are exposed to BI tools. And

this MQL is actually not able to do joins, for example. We push all these joins, all the sophisticated tools to the YAML and try to keep the SQL as simple as possible.

That's how we are able to solve this, like, problem when it comes to, like, advanced use cases. We don't need to define these join relationships inside Tableau, inside the BI tool. Because if we support the joins in the SQL, we will be, like, having some complex queries, and then passing them would be harder. 2 different problems.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done.

DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its DATA DIFF feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with DataFold.

So when you first started working on the project, I'm wondering what were some of the ideas about how it would be used or how you would approach the actual implementation

or some of the assumptions that you had about that that have been either proven incorrect or had to be changed as you iterated on it and started using it internally and sharing it with, end users?

Actually,

I've heard this this BI term from investors, like, base case that we see. Later, they became the investors of Supergrain, which is also a metric store,

but they don't have, like, public solution at the moment.

And then I have read this

famous article from Ben, founder of Mote, about this missing

layers in the data space.

So

thinking about that problem, I started thinking if we can like, the the most powerful feature of Rackham is actually this metric layer.

So how I can use Rackham's engine to solve this problem?

And then I started to try out different BI tools like Tableau,

like, database. Try to understand how they work, if we can just build native integrations for each BI tool or not. And it turned out that they all have different login mechanisms,

and it's gonna be easy for us to

write closure for Metabase and

build up native drivers

for each BI tool. I run into this, like, specific project called DBT Metabase.

It is an open source project,

like, on GitHub,

and it talks to DBT.

You define the YAML inside your DBT YAML files,

and then they synchronize them or you write the column definitions,

etcetera. And they talk to Metabase API

and update these definitions.

So it was interesting. I got in touch with the the maintainers.

We had a few chats together.

And

looking at the, like, tools,

I decided to, like, act like this database, Trino. And

I had a few tries to build up native integration with Tableau

and other BI tools. It didn't work. After a few iterations,

we came up with this Tino

approach.

But I didn't know about the use cases. What I see in the industry is that there are different, like, semantic layers, unified semantic layers, like Edge Scale, Arcadia,

that are actually trying to do something similar without the metrics. They speed up your BI queries.

And Power BI has also

this MDX, which is the metric stage.

Looking into each BI tool and the history of this industry,

it looked like most of them actually are trying to solve similar problems, but they call it different. If you look at the unified semantic layer, it scales product is similar to what we are actually doing. But the difference here is that we are only focusing on this metrics there, and then we are trying to build up integrations with the most of the tools rather than trying to be an enterprise tool that only

integrate a couple of products in a well managed.

And so talking through the actual usage of Metricool, I'm wondering if you can describe the workflow

and

syntax of actually

building a set of metrics and then exposing that to a business intelligence system or

a, you know, command line client that wants to be able to interact with these metrics and use them for their own analytics workflow?

So the building a developer experience

product is not that easy because the developer experience should

be frictionless.

So

we thought about, I mean, building our own, like, way

for people to connect to the database houses.

And then quickly realized that actually, DBT has most of the stuff. They have

these profiles where you just write the database credentials in the YAML files, and they connect to your data warehouse

and transform your data. So rather than building our own 1, we decided to use dbt's

experience.

So you

initially create a dbt project,

and you write your DBT models. If you don't want to transform the data, you can just create a YAML file

and then define a data model that points to a table in inside your database.

Once you create a DBT project

and

your first YAML files, they they have something called meta.

Under the metrical property, you define your metrics. If you have

any events table, you can just create a YAML file and then create measures something like total users, total total events,

and run DBT.

When you run DBT, it creates a file called manifest.json,

which is a metadata file.

And

we have common line application,

which takes this manifest dot JSON file as an argument.

So once you model your data and define your metrics, you use our,

basically, API servers to start

HTTP servers reading your manifest file. We talk to dbt's profiles dot YAML file. So you don't need to define the credentials exclusively for Metrogl. Instead, we just read the DBT's credentials.

And talking to the metadata API of DBTs,

we expose everything

as, like, these database tables to your BI tools. But if you are, you like, for each BI tool, we have different integrations. If you are using

Tableau and use Metrica with it, you use the your Presto interface. You connect to Metrica using Presto

connection.

And then we have a simple dashboard,

where you just click the tool that you want to use. For Tableau, for example, they have something called TDS files, Tableau data source files.

You just select the the reset that you want to analyze in Tableau, and then we create a TDC file. When you double click the TDC file, Tableau opens up and

ask for the credentials. When you enter the credentials,

you'll see that the Tableau work with your metrics defined in your YAML files.

But if you are using open source BI tool like superset,

in the dashboard, we ask for your superset credentials.

You enter it and click sync. We talk to Superset API

and synchronize all these metrics. So you can just

go to supersets

and run, like, connect the, which is basically Metricall.

And when you drag and drop the values in the query builders,

Metoquel gets the query, parses the query, understand the metrics. And then since we know the definition of these metrics, we just compile the query for your underlying database

and to turn this up back to you. You if you are using Google Sheets, we have it connected

as well. You can just bring your data from your data warehouse, into your Google Sheets using our plug in Google Sheets plug in. So we are not just building this tool for the BI tools, for all the data tools. But the analytics engineers are the ones

who are writing and building these data models, defining the metrics,

and maintain the metric as service.

And as far as the

actual sort of collaboration opportunities

for being able to

work with the business users and people who don't necessarily

want to dig into the engineering aspects of the metrics layer.

How do you

how can teams support that type of contribution

where the business user is able to

identify or define their own set of metrics and then be able to actually expose that back to Metricool so that other people can take advantage of those definitions.

I'm designing this experience, like, the modeling experience

for the business people. The data people, the analytics engineers are the ones who are

modeling the data. The business stakeholders are usually

consuming the data. The data that they want to analyze

should be consistent.

If you have a metric,

it should be the same in all the data tools.

But

if you want to, like, just experiment, understand,

it, you we also had the SQL layers where you want to write just Jinja expressions inside

the native SQL query for your data warehouse.

We also have different reporting type, like SQL, MQL,

segmentation funnel, etcetera.

But

the data modeling part is, like, defining the configuration as code managed. So

the data people write these YAML files and then commit it into

their git support study and create a pull request and collaborate with their team members, which is not possible in Tableau workflow or superset

workflow.

Everything is committed to, like, in git. So the business people are not the ones who are defining the metrics here. But they are requesting the data people to get a metric and the dimensions.

The business people are able to drill down and metric into different dimensions, into different time frames without the data people to

model the data, create these dimensional tables. That's how we are helping the

business people. But if they want to have a new metric,

it should be in the YAML file. As far as being able to

decide what is the developer experience, what is the interface for being able to actually define and propagate these metrics, what is the

process of sort of working through in sort of like a CICD approach of testing these metrics and working through them? And what were some of the challenges that you encountered in figuring out how you wanted to actually

expose these definitions in a way that was

maintainable and understandable to the broadest set of users?

So we use DBT mainly for everything. DBT has tests. DBT has a documentation

where you can just click your DBT model sources

and see your metrics. So

we are just focusing on the metric side for the testing side.

You are expected to use DVT.

But

the challenge that we have is that since we are generating these SQLs

in a talk manage, we are not able to

use DBT to, I mean, define,

like, set up some alarms on your metrics or define the tests

inside it. This is something we are, like, still brainstorming.

But for example,

for testing,

there is a project called ReData

that works on top of DBT. So if we can integrate Metrica with ReData,

we can just, like, tell people that if you want to create an alarm for your metric

and use Metabase,

you can just, like, run this query, define the boundaries, and then you can set up an alarm. Or if you want to test the metrics

in your continuous integration

environment,

you can add these lines into your YAML file for the data and

make the data

operate on top of metrical metrics. So this is how we are planning to implement it in the long run.

Rather than building it, we are trying to integrate these open source tools.

For people who want to get

set up and running on their own infrastructure, what's the process for actually

deploying it and maintaining an installation?

So it is basically

a Docker container, like Docker image. We

use Docker Hub to push these new versions.

So, essentially, what you are doing is that what you need to do is that you just test these manifest dot dot

JSON file as a URL, as an argument to the metric as an environment variable.

And it starts an FTP server, like, reading this environment variables and and getting this manifest file from your servers or from your d g documentation.

We have 1 click installers as well. If you are using Heroku, if you are using Google Cloud, you can just click deploy with Heroku button. And then we use the dockers under the hood to be able to deploy it to your Heroku

account. But if you are using Kubernetes, you can use our docker image to deploy it into your own infrastructure.

It's all Docker. In terms of the

current state of the ecosystem, I'm wondering what are some of the potential

improvements

to how the metric layer can be leveraged. What are some of the missing elements that that you think will be filled in over the coming years to be able to take full advantage of this metrics layer? And what are some of the sort of untapped opportunities

for this aspect of the sort of data production and data usage?

So right now, this is, like, a new project. The way we approach to this problem is that we should be the 1 who integrates, like, these 3rd party BI tools. Like, I'm like or the 3rd party data tools. So for now, we are focusing

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links