Metabase Self Service Business Intelligence with Sameer Al-Sakran

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network all controlled by a brand new API, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com

/linode to get a $20 credit and launch a new server in under a minute. And for complete visibility into the health of your pipeline, including deployment tracking and powerful alerting driven by machine learning, Datadog has got you covered.

With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you'll have everything you need to find and fix performance bottlenecks in no time. Go to data engineering podcast.com/datadog

today to start your free 14 day trial and get a sweet new t shirt.

And go to dataengineeringpodcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macy. And today, I'm interviewing Sameer Alsakran about Metabase, the free and open source tool for self-service business intelligence. So Sameer, could you start by introducing yourself?

Yeah. Hi. My name is Sameer Alsakran.

I guess, by way of background, I've been working in machine learning and data science for most of my adult career. And before Metabase, I was, CTO of Expa, which is a startup incubator in San Francisco,

as well as I ran engineering for Black Jet, which is a private aviation company.

And before that, I've had stints doing,

machine learning research as well as, working on big data and Hadoop style shenanigans.

And do you remember how you first got involved in the area of data management?

Well, I'm not sure I'm in the area of data management per se. I think it's 1 of those things that is foisted on many people. And I think, like many other,

budding machine learning or,

you know, data infrastructure folks, data management or analytics or reporting

was just kind of part of the things I was suckered into.

So I actually remember

very clearly that I joined Imeme, which is a music start up back in the day.

And I was just really brought on board to do music recommendation

and machine learning and algorithms.

And at some point, I realized that about 95%

of my time was being devoted to the care and feeding of our analytics and recording structure. And so I'd say that is kinda how I got into it, where as the things that we were doing for, the recommendation side of the of the product

were built out,

it turns out that the far more important thing that we were doing was generating reports for labels,

for making decisions within the company, and just for getting data wired to the entire organization. And that was kind of the start of a long downward spiral. And before we start digging into Metabase itself,

1 of the goals that it facilitates

is

to help make companies data driven and give a high level access to the information that's stored in various, you know, storage systems that the business might be using. So I'm wondering if you can just start by explaining

your

conception and view on what it means for a company to be data driven.

Yeah. I'd almost characterize it in sort of Maslow's hierarchy of data needs. And so depending on the level which a company operates,

there's either some access to information, there's either pervasive access to information where everyone in the company gets whatever they want, or something in the middle. And so in large part, it's about having there be

data about the actual objective state of reality of the company

accessible to people who are making decisions.

And it's often discussed in

the sense of large,

heavy, big strategic decisions, but I think there's

a lot of value to wiring up a company such that people who are doing normal jobs in their everyday kinds of things like customer service, like running growth campaigns, like,

dealing with support tickets, like, you know, QA, have access to data that is relevant to them. And so you kinda think about it as the kind of lowest common denominator is is there a dashboard with the company KPIs?

As you get up the stack, there's data that permeates just the general

pardon me with the known questions, so the things that people realize have value. So, for example, it's useful to pull up the record for a customer. And so you can see all their transactions. You can see their support tickets. You can see what chargebacks they've had. And then there's kind of the third layer where you get the ability to ask the questions that you don't know ahead of time. And this is kinda gets into self-service data discovery and exploration.

And most companies have a very, very limited game there. And that's kind of where we have largely positioned ourselves.

And, you know, maybe up the stack, a little bit is also the ability to do

increasingly complicated, sophisticated things

given these these foundations of both data accessibility and data literacy in the company.

And as you were talking about being able to ask questions of the data that's stored in these different systems that a business might be using,

It reminded me of a conversation that I had a little while ago with 1 of the folks from Honeycomb where their main objective is to

facilitate observability

into,

systems infrastructure.

And as you're talking about the idea of a company being data driven, it has a lot of the same

fundamentals to it of being able to

have a wealth of data that you can ask questions of. So it sounds like observability

of the state of the business at, sort of high level.

Yeah. Exactly. And,

much like debugging systems, it's useful to know

what's happening, when it happened, and,

it's useful for everyone who might touch that problem to have access to that kind of that kind of question answering ability.

And so how does Metabase facilitate that goal, and what was your motivation for starting it in the first place?

I'll start with the second question first because I think it feeds into the first 1. So originally, we at Expo, we're starting a sequence of companies, in in batches and cohorts.

And 1 of the things that we were interested in super early on were 2 sided marketplaces, SaaS companies,

and just a wide variety of other, like, operationally intensive companies relative to their size. And so 1 of the goals of the precursors to Metabase was to provide something super easy that someone that was a little intimidated by Excel could use day to day.

And so our initial

overall objectives for it were to put as much power into an interface that was a little simpler than Excel. And while Excel can be really complicated, it's also, in some ways, delightful straightforward.

And so we generally try to walk that line of

having there be enough power that it's useful,

but not overwhelming the average person doing an average job in a company.

The first question was how Metabase facilitates

a business or an organization

in its goal to becoming data driven. Yeah. So, I think the biggest thing we do is reduce the barrier to deploying analytics.

And so 1 of the big trends over the last couple of decades has been that providing analytics

support or just basic reporting or dashboards or visualizations, whatever you wanna call it, has gotten cheaper and easier over the decades. And it used to be something where

every single query you would run was a big deal. It was written in code. You talk to DB or some other archaic dot database as in COBOL or such, and SQL made it way easier. Excel made it further even easier. Tableau,

did a great job in just pushing

the ability to do exploration further down into the org chart. And I think, in general, the primary thing I think we really add to a company is it lets you wire up all the little things that you did you don't know are valuable yet. And so

it makes it very cheap and easy for small to be done by anyone, and this shows up, across the board. And I think what it really where it really shines is, again, that ability to

spin up a little mini analytics

setup

for the nooks and crannies where you're not you know, the person whose benefits doesn't have a VP title.

There isn't necessarily a lot of budget for it, but it's those little things that make the company overall perform much better.

And in a lot of ways to a system like Metabase serves as the

first foray into

gaining

the ability to,

analyze and access the data outside of the context of the application and the ways that the code is manipulating it,

and it can feed into the

further desire and motivation to build more elaborate systems once you start to see the types of questions that can be answered by

bringing together multiple data sources in a new context.

Exactly. I think, 1 way to think about us is

it's where analytics starts out. And then as you realize that you need a customer,

profile tool and you know exactly what's on that page, you'd build it out in your app or in your internal tools.

And we're kind of the

place where you learn what should be in that page.

And so people use us for things like 1 off reports for given customers, for

company wide dashboards before they're elevated to something that's hand coded,

as well as just using us all the way up up and down that stack.

But the ability to do things, again, before you're precisely certain of what they should look like and the ability to do them quickly and easily. And if you make mistakes, correct them on the fly as opposed to encode,

has been where we've really shined. And given the fact that a strong focus of Metabase is for it to be able to be self-service and accessible by people who aren't necessarily

technical or interested or comfortable with writing SQL.

What have you found to be the ratio of users that take advantage of the graphical query builder as opposed to dropping down and writing raw SQL or raw query language for the,

the for the data source that they're, running the query against?

So it's actually interesting because it spans the gamut. I'd say that across our user base, we have folks that use strictly in SQL and folks that almost entirely use us just the GUI. So it's 1 of those situations where we're used in drastically different places.

So I'm not sure if I can name names, but we're used in, like, extremely engineering heavy organizations where everyone top to bottom, right, SQL. And we're also using companies

where there's just not a lot of engineering talent, and so people just hunt and peck for results in our interface.

I'd say it's roughly I'm just gonna throw out, like, arbitrarily roughly half half. And I'd say that there's also a lot of folks that mix and match. And so,

1 of the common patterns we're seeing is that someone will write a very complicated query in SQL, and then they'll chop off that final select and then let,

folks use our graphical, like, query interface

to hit that intermediate to sub select.

And so from what you're saying there, it sounds like it's possible

to use the results of 1 query as the input to another?

Precisely. And so the combination of a very complicated SQL query as that underlying,

starting point, plus a really easy way to slice and dice for end users on top of that is kinda where we start to shine as well. And when you are using the graphical query builder, is there a certain

breaking point where the level of complexity of the analysis is just too involved to be able to do it just purely with the,

graphical builder and you need to drop down to SQL, or have you found that people are generally able to do fairly complicated analysis

and aggregations

just using the graphical interface?

Let me answer a different question. I apologize for that in advance.

But,

fair enough. I think in general, we don't view the the graphical querying tool as a place where

people do analysis per se.

It's more exploration, slicing, dicing.

And so, you know, on 1 level,

the GUI doesn't have a ton of power.

So,

you can't, you know, do cohort analysis in it. You can't do predictive statistics. Can't really do a whole lot of statistics as a whole.

But what you can do is,

expose a dataset to pretty much anyone, and let them, again, just hunt and peck and find their way through it.

So the way that we've seen companies find the most success with Metabase is when they view it more as a data publishing tool as opposed to a place where complex important analytics get computed. And so

SQL is there for a reason, and the reason is that as you start doing things that your customer service reps can't understand, it stays it stays out of the tool. And so the the kind of GUI query builder, as we call it, or just the interface,

is really meant to be that final layer that is hit by anyone in your company.

And we try to have as much of the complexity be buried

behind that layer, either in the data prep or in writing these, intermediate SQL queries or SQL views materialize or otherwise.

But we view that the the graphical interface not so much as a place where

people,

construct complicated things and more where

that final

kind of display interactivity

and just discovery occurs.

And have you found that there are a general

class of use cases that people are typically using Metabase for? Or is there anything that's particularly interesting that you have found people leveraging it for? So, I mean, there's the standard. So company wide dashboards, nightly stats, emails,

per version,

launch pages, to see how well certain versions are doing.

And I feel that's fairly standard, for most tools. I think some of the places where

I I'd say people are doing it right,

1 example of folks using us is, Gojek uses us to introspect microservices. And so they'll just take all the databases that are backing their microservices

at a redone collection through Metabase, and then let folks get just the ability to quickly look into a microservices

underlying data where data I'm sorry, database

and tables. And it lets there be, like, a single place where you go to pull up this information. And so, for example, you can create a dashboard that has canary cards for whether migration is successful or not. So often as you're doing things like, system wide migrations, there's a bunch of intermediate steps and the ability to have, you know, really quickly built canary cards where, you know, how many cards still have time as a string and how many rows have,

time as an integer. And as you do perform the migration, kind of keep tabs on how it's going, whether there's, errors or not. Now you could have done this in a hand coded script, but that takes longer. It's less controlled, and it's kind of out of date the second it happens, whereas you can keep these canary dashboards around for a while. And so we tend to be useful even in situations that are not classical, quote, unquote, BI. Other things people use us for is just delivering datasets.

So, for example, they will,

do this big run of, either statistical or machine learning program. They'll get this large dataset. They'll stuff it into MySQL Postgres and then point their end users at it via meta base. And so rather than building a customer interface or rather than asking them to stuff in Excel, people can do that final layer of, again, just slicing and dicing and pecking and and rummaging around in this data. So those are the 2 big ones. The other 1 that I think is underappreciated

is just as a row browser. So,

you know, as you before you have BI, before you have analytics requirements, as you're just building out your product initially, it tends to be useful to be able to, like, dig into the user's table or, you know, photos table, whatever else you have lying around, and just do sanity checks. And, again, like, liberally sprinkle those best idea of canary cards where, you know, is everything as you expected? And if so, awesome. If not, hey. Here's the number of rows that have malformed JSON, for example. Yeah.

A lot of times, the application will have the admin panel embedded into it for being able to do that level of exploration as you're building it. And sometimes just being able to build and maintain those pages can be equally as complex as the application when you're first starting out. So I can definitely see where having Metabase as a ready to go solution for just pointing at a data source and getting that visibility would save a lot of time and effort. And now digging deeper into the project, can you discuss a bit how it's implemented

and some of the ways that the architecture has evolved over the course of time that you've been working on it?

Yeah. So I'd say, the heart and soul of Metabase is a query language, which gets, transpiled down to SQL dialects as well as, Mongo or Druid or elsewhere,

supporting.

We also have a metadata

layer where we will build a model of the underlying database. So it'll essentially be a skeleton with the schema, and we'll, like, attach,

both human curated information as well as,

classifier guesses and just our own guesses at what things mean.

And

so that is

what everything's built around.

Again, that language just compiled down transpile down to

the SQL dialects and other databases support.

And that's all done in closure. The front end is a React application, and,

it's generally,

you know, fairly standard issue BI and graphing layer. I might need some help drilling into the various parts that are interesting to your, audience as we talk this through. And as 1 of the original authors on the project, was there a particular reason that you chose that you chose Closure as the language for implementing it? And if you were to start over, do you think you would make that same choice? So I think we got to Clojure in a very roundabout way. So the first,

version of Metabase was written in Python.

And as we were hitting scaling issues initially, we decided we wanted to move off of Python and onto we were fairly soon it was gonna be the JVM. And the fundamental reasons we wanna do that were we wanted lighter weight threads as opposed to processes. We wanted the ability to package up a single jar for deployment and just have there be a single binary we could ship around. And in general, the performance characteristics of the JVM felt like where we wanted to be.

We actually wrote our first port in Scala. I think we spent about 2 weeks trying to port things to Scala before realizing that the overall zen of how the Scala world does database connection,

database,

database querying was not really

aligned with what we needed to do. And so, specifically,

a lot of effort is spent in the Scala ecosystem around type checking,

and specifically statically type checking, SQL queries and the various DSLs that generate SQL queries. And in our situation,

most of the heavy lifting was taking this intermediate language that we had and then, you know, spitting out SQL.

And it was it was largely

just taking an an AST, and then it was largely just taking essentially a parsed SQL statement that we it was in a language that we had generated

to look like a parse tree and then having that be trans piled down to SQL or Mongo. And trying to do that in Scala was proving to be more and more annoying, and it consistently felt like we were coloring outside the lines. And I think we ended

up trying to port a disclosure and having so much of it get done so fast that we just kept going on that road. As to whether we we'd do it again in the same language, I think quite likely, to be honest. It's not I'm not sure that any other platform has quite

the stability of database drivers,

the

overall performance,

characteristics, and just the general

developer productivity that Closure does.

Yeah. I can definitely see that Closure being a Lisp dialect would be very useful for dynamically creating those queries and being able to interface dealing

dealing with lots of IO and multiple different data sources,

having that, having the JVM for being able to handle that would definitely be beneficial. And speaking of the scaling requirements,

I'm wondering if you can talk a bit about the ways that you handle large or complex queries or large,

responses from those queries?

And generally push computation down to the data warehouse themselves.

And so on 1 hand, we're fairly size agnostic. And so if you're running a petabyte scale, you know, BigQuery data set, so long as the result sizes aren't too huge, we generally hand that gracefully. I think for situations

where the result set is very large,

there's just a

function limit to how much we get stuff in memory on the app server.

And so

I'd say, overall, we've we view most queries as being something that is

a distillation of data, and so

we haven't done a lot of work around streaming around, you know, 1, 000, 000, 000 row result sets.

We actually tend to cap things at 2, 000, 000 rows,

for anything that gets returned,

by an API call. And while we are thinking about ways of relax that for the most part, we intend for the application to be used,

in a kind of interactive clearing mode as opposed

to running really big jobs and downloading the results.

And so for things like, you know, things where you're generating a 100, terabyte results set up a petabyte, you should probably do that in a schedule query in some other tool or just crawl on that. Did I ask your question, or is that?

Yeah. No. That's definitely,

very reasonable approach to that problem because as you said, once you get past the 200, 000, 000 rows that you're trying to return at that that point, you're just abusing the tool because it happens to be there, and you should probably move to some sort of ETL tool or just a Chrome job that dumps to another

delivery because I've dealt with that situation a few times, and I'm actually handling it recently. So

it's a it's a very familiar problem.

Can I think of something else

that that brought to mind? Okay. It hopefully come back to me, but we'll see.

And when you're working on adding support for a new data source,

what does that workflow look like? And is that something that is done fairly commonly by the community? Or is it largely driven by

the by Metabase, the company?

Yeah. So what what goes into supporting a new data source is

writing that,

mapping from our query language

to whatever needs to execute against that database.

So for SQL, it's fairly straightforward, and there is a generic SQL driver that you would subclass

and essentially just fill in the missing pieces.

And for the most part, you know, our query language is fairly straightforward,

and it is,

you know, about 3 to 5 days of work for someone to support a new data warehouse or database.

Where it gets a little tricky is that jumping from that to a production

hardened

database connector that that that actually can be used, you know, for prime time generally takes somewhere between a 1000 and 10000 hours of production query load. And so we're really good at QA testing database, like JDBC drivers, because we generate all kinds of crazy, SQL under the hood. And so between the various nested queries, between the various date, time things we do, and just, you know, our time zone handling, we tend to look in all the various

dark corners

of a JDBC driver, and so there is a fair amount of work

in finding all those little gotchas.

So things around type systems, things around how things are cast, it list explicitly,

indexing strategies and such. We tend there tends to be a fair amount of really boring, gritty work on the tail end of things. I think for the most part, the team has written

majority of our drivers. We've had a couple,

you know, stellar commits from, community.

I think, in general,

we wanna encourage folks to write drivers rather than wait for us to write themselves. Partially, it's just because we're fairly bandwidth constrained, but also because we're not really experts in,

a lot of these other databases. And so we run a few databases internally. We've had

production experience over the years,

you know, working with Postgres, MySQL, BigQuery, Redshift,

and to some degree,

I think what else has been, like, Mongo. But, you know, for example, we have a d there's a d b 2 connector. There's, a vertex connector. I think someone's working on a Teradata connector. And I'd freely say that no 1 on the team has used any of those in anger. And so there's a limited ability for us to actually understand the nuances

of each of those SQL dialects and actually create a production grade driver there.

And

as you

mentioned when you first started working out on this project, you were at expa and you were building it because of the need for multiple different startups who,

required access to their data.

And you've since built a company around it, so I don't know if you can talk a bit about your reasoning for

building that company and the way that you support yourself in the project going forward and make it sustainable.

Yeah. I think, the decision to,

release the code into the open and then try to essentially make it self sustaining and,

more than just an external project was driven

by the fact just by the sense that we had something interesting.

I think there was

a lot of examples of other companies having written something similar.

So,

HighPallet, Facebook was an early ins inspiration, Avocado Yammer.

You know, there's, like, EasyData and a bunch of other things out of LinkedIn,

and I've lost track of all the others. But,

until recently, no 1 had really open sourced a compelling version of this. And so for us, the, you know, the first decision was we wanted to release it in the open. And second decision was we wanted to essentially make it self sustaining and have it be a long term going concern that wasn't just,

at the mercy of a larger company or fund.

In terms of what we're trying to do, I'd say, you know, Metabase, the company, first and foremost,

sustains Metabase

project,

and we're trying very hard to offer up kind of

a first class free open source

analytics

application, and I think there's not much out there like it.

And, you know, our

our hope there is to create something that it lasts and is durable and will be used by everyone as a default.

We are sustaining ourselves through a number of ways. So we do offer support for folks who are trying to run us at scale. We are offering up the ability to have to power

your application's analytics through us, either by embedding us directly into your application

or by white labeling, Metabase itself.

And we'll be introducing it. This is,

not quite yet announced, but we'll be introducing, essentially, versions of Metabase that are tailored to things like HIPAA compliance PCI compliance, where it's not really BI 101 anymore, but we will be offering up a paid way to check off a lot of the boxes there.

And so for that, I imagine it would have some capabilities for automatically

identifying

PII or personally identifiable

information

and preventing users from being able to query it directly unless they have some sort of admin status within the platform?

Yeah. It'll it will do relatively little automatically, but it will let folks set up sandboxes for users. And so folks in customer service get access to anonymized,

user information user user based information, but they don't see individual records, for example,

or situations where if a given aggregation has less than n folks in that demographic bucket, redact the entire row. And it's there'll be things that allow,

again, the creation of a sandbox where within

that set of data, someone can slice and dice and explore and just generally take care of their own business on their by themselves.

But they would stay within the confines of what they're allowed to see from a legal or compliance perspective.

Yeah. And I was impressed by the capabilities that you have built in in terms of the

user and group management and the permissions

availability

within the platform rather than pushing that responsibility

to the underlying database because that can add a lot of additional complexity of making sure that all of your different databases have appropriate roles defined,

rather than being able to control that all from within the platform itself.

And for, you know, our more paranoid users, we do recommend also using database level,

permissions as well. But we have tried to make the care and feeding of Metabase as easy as possible.

And where where where we've been able to, we try to remove knobs and just generally take care of things for the end user. And we do hope to keep doing that over the long term. And even as we, you know, provide these paid alternatives to really

annoying fiddling with permissions.

You know, the core

open source application is gonna stay

what it is and and better with time.

And from the security perspective, as you mentioned, hardening the database connectors so that they're

appropriately secure for shuttling data back and forth. And 1 of the other things that I appreciated while I was digging through the documentation was the fact that you support

encryption of the connection information within the configuration

store for Metabase because I've definitely worked with other tools where it just stores that information plain text within the database, which is

obviously not ideal.

Yeah. We're we're not quite I mean, we definitely have some folks that are using us in really,

high security regimes, and we are trying

to generally up our game to a point place where we're as good as any commercial tool in that regard. And so it's definitely something we're actively taking seriously, and, you know, we'll definitely be improving going forward.

And

from the perspective

of other options that are available on the market, whether open source or commercial, what do you see as being the main differentiating features of Metabase that would lead someone to choose you over any other tooling?

I think the best way to describe where we're most compelling is that we're the laziest possible option. And so

a lot of other tools let you do analytics work, and in some cases, they're much better at being a tool for analysts to do analytics.

For the most part, we're trying to automate the vast majority of comment requests and let people do their own queries and their own,

explorations.

And so

the the best way to think about it is we're a way for folks that don't wanna be analysts to automate the job away. And we generally provide 80% of the power of a dedicated tool with about 5% of the work.

And what have you found to be the most challenging aspects of building and growing the project and the business, both from the technical and the nontechnical perspectives?

From the technical side, it's honestly just been quality across

all the service area we have. So we support, you know, 15 databases, each with their own quirks.

We're being used by 1, 000 or tens of 1, 000 companies

in vastly different use cases, different query loads, different latencies of the data warehouses.

And designing

and verifying

a quality experience across all that service area has been really hard. So we're a pretty small team,

and we've managed to just build up a lot of tool around testing and around validation

That has helped us out a lot. But I'd say that compared to other things I've worked on, just the sheer footprint of just the sheer surface area,

both on the user side and on the

underlying connectivity side has been staggering.

And I'm sure that, you know, if it's not the most complicated thing anyone's ever written, but in terms of service area per engineer, it's pretty up there with anything I've touched.

And you've mentioned some of the plans that you have as far as offering white labeling

and these compliance options in future releases of Metabase. Are there any other plans that you have for the project going forward that you'd like to share? Yeah. I think the biggest thing I'd like to share is probably it's actually coming up in about a week or 2. So we're about to release a new version,

and it's going to include a feature that I've been waiting to release for almost 2 years. And, essentially, the best way to describe it is we're gonna automatically generate

reports and analysis based on things that you interact with in the application. That sounds very vague, but it specifically means that you can click

if you're looking at a chart of revenue by region and you're curious why what's interesting about sales of your widget in Kansas, you can essentially click on the Kansas,

bar

and drill into it. And with 0 config or 0 work, you know, get essentially a dashboard report on Kansas for q 3. And

these sort of automatically generate dashboards are both interactive, so you can further drill through. And you can also jump to adjacent link tables. You can look at it connecting metrics and segments. And in general, what it offers up is I don't know if you or your, listeners are familiar with getting analyst reports from,

internal company analysts where it generally comes back as a power as a PowerPoint or keynote, and there's usually 10 or 15 slides.

And, like, the first slide's always here as an overview of of how things are going. Here's things by region. Here's this and here's that. And we're essentially trying to automate as many slides of that deck as we can. And I'd say today, we do a reasonable job of automating the first 4 to 6 slides, and our eventual goal is to be able to

fully automate that kind of in-depth exploration where, you know, for someone who is working on a given, account, let's say it's a very big account in the northeast,

they can do things like look at,

how accounts in the Northeast across their company look, and is that specific count,

anomaly in some way or another. And all this happens, again, 0 config, and just to click on a given cell in a result, and you can then explore

and re slice and redice

all the connected information you have about that. Again, let's say revenue in Kansas. So you can see sales in Kansas. You can see returns in Kansas. You can charge backs. Basically, anything you have for that just

oh, yeah. I'm gonna use the word cell, but let's say it again. So, like, that cell, is accessible to anyone in your company on the fly with 0 Config. So I'm very excited about that. Yeah. That definitely sounds very interesting and useful. And as you were discussing

the user interaction with the front end of the application, it reminded me of how

it's very easy to get data visualization

wrong.

And, you know, there are any number of jokes and,

discussions about ways to lie with data because of how it's being presented. So do you have any capacity built into the platform to

simplify or guide users in the proper way to visualize and interpret the data that they're interacting with as they're asking these questions about the underlying sources?

We so I'd say we try to prevent nonsensical things from being done as much as possible. We aren't really doing things like

warning you against using average in a multimodal

dataset.

So there are lots of things where you can still get yourself in trouble. And for the most part, we try to keep things happening at a very descriptive level.

And so we don't do a lot of things like, here's an a b test. What's the correlation? And should you do a or b?

Or sorry, what's significance that should you do a or b? And so we're for the most part, we're trying to help companies,

move up Maslow's hierarchy of data needs. And I find that most companies aren't even at the stage where those kind of questions emerge. So it's not that people are aligned with data. It's people don't have any data whatsoever. And we're trying to get some amount of data into as many hands as possible. And we definitely do have,

you know, stated hope that these automatic dashboards and the x rays that they're powering

will, over time, start to look for anomalies. And we do have a couple alpha features in our current release if you wanna check that out, where we do a bit of stats and try to, you know, nudge you towards things like, you know, your data looks noisy. Do you wanna maybe try it on a larger time scale?

But

I'd say that at the the current stage of both,

you know, our perception of what our users need and are asking for, we're largely concerned with just getting data into as many hands as possible in a shape that they can understand.

I was just realizing that 1 of the things we haven't talked about yet is

what is involved in actually getting Metabase up and running and connected to a user system. So I don't know if you can walk through the workflow of somebody who wants to get get started using Metabase and promote it within their organization.

Yeah. So the first thing to kind of, figure out is where you wanna run it. And so you can run us on a laptop with our Mac edition if you have a Mac.

You can run us as a Docker image. And so if your company or you have a preferred way that you run Docker images,

we do need an attached database or some sort of persistent file system to store our application data on. And so you'd wanna, like, decide how you wanna do that. We work pretty well in Heroku. We work pretty well on Elastic Beanstalk. And so just depending on

how you and your your company prefers to run this stuff, you can either run the bare jar or docker image or just,

slap some Heroku. And at that point, you'll need a connection string to the databases you want to use, and so it's helpful to just have that just prepped ahead of time. And you would enter that information into end of setup process.

And from there, you should be ready to go. And so we'll generate a few dashboards for you, if you want, in our new version. And, otherwise, you just go and, like, start making dashboards,

drilling into your data,

and just looking through what you have. And for the most part, you can get something useful in on the order of 2 to 5 minutes. Yeah. That's definitely 1 of the other things that I was impressed by as I was looking through the documentation

is that the deployment story

and the operations workflow for being able to get the application running and usable is very

straightforward and well polished, whereas some of the other tools I have wrestled with have been less than ideal because they're primarily focused on more of a hosted platform,

and the self hosted option is more of an afterthought than a primary goal. Yeah. I think, from day 0, we we kinda wanted to keep,

the platform easy to install and maintain and administer.

And so I think there there is a very large tension in commercial open source projects between creating something that's so easy to host no 1 ever pay you for to host it and making business around hosting. I think we very intentionally,

from the very beginning, wanted to be super easy to install and set up.

And, you know, while we do offer hosting for a small number of people,

we're generally not trying to build a business there. And where possible,

we wanna make it dirt simple for you to do it yourself. And so that's been a pretty important pillar of the company, and we hope to make it even easier going forward. Yeah. And particularly for this type of system where you're connecting to people's application data stores where there might be sensitive information,

having a hosted platform

just becomes that much more complicated because then you're you become responsible

for the compliance of your application in the context of all this customer data versus pushing that to the customer to handle and make sure that they're properly securing their systems.

Yeah. And without going to g GDPR too much, I mean, it is becoming pretty clear that

anytime data leaves your walls, it's gonna be a pain for you.

And so I think 1 of the interesting things that

I believe we're an example of is there are a lot of products that don't need to be hosted anymore. And so back in the day when,

you know, running a Ruby a Ruby program or sorry, Rails program or Django app or, you know, PHP was a pain, and you had to worry about the databases. You had to worry about all kinds of other configuration.

I think for the most part today,

most SaaS applications could be delivered as docker image, and

it doesn't really matter whether they're running in your AWS group or the SaaS

application provider's AWS group on a lot of levels.

I do think that more and more,

you'll see applications that are

I don't really have a good turn of phrase for this, but, like, cloud prem or just essentially that are meant to be run,

again, as a docker image or otherwise or Kubernetes,

just

in your own security group.

And that both lets you control where data goes, lets you control,

who sees what when.

It makes compliance under various,

regimes way easier to verify. And I think it also ends up being it'll change the shape of what people build and sell in very good ways. And so

while, you know, the battle days of on prem software that showed up in CDs that you took over to your data center and flashed onto servers is, you know, thankfully dead. I think we are there the idea of running your own software is coming back into vogue for large class applications, and I think that,

we're a great example of that where,

you know, it's not that much harder to run us than it is to

run your own blog, and most people just spin spin up WordPress.

Alright. So with that, I'll have you add your preferred contact information to the show notes for anyone who wants to follow the work that you're up to or get in touch,

and

see, how Metabase is progressing.

And as a final question,

what do you see as being the biggest gap that's in the available tooling or technology for data management today?

It's hard to pick just 1. I think the the biggest gap,

is let me think about how best to phrase this. The loss of semantic information through the ETL chain.

I know that sounds really, really, dry, but at some point, the sheer amount of things that you have to specify

in most app like, analyst applications

is kind of nuts.

And so, you know, I think about something like, I'm gonna do a dump from Stripe or from Braintree or wherever into a database.

You lose all the semantics of what each column means, and then you've got a bunch of reconstruction that happens down the chain. And then so, like, there's things that get reconstruction in the ETL layer. And then when you're building out the dashboards, you have analysts that have to relearn what,

various plans mean, and it just seems like all of this should be held on to for through that entire chain.

And some of this is just a function of there being lots of little tools that are cobbled together. But I do think that,

you know, in my, you know, vision of what analytics could or should be,

just stop having to repeat myself at every level of the stack would be great. And so some sort of I don't know. I hesitate to say this, but some sort of

just data dictionary

that cuts across all these things would be lovely.

Yeah. The provenance of data as it flows through systems is definitely a perennial problem,

and I think that a lot of the effort that's going on with metadata management systems is helping a little bit,

and some of the additional capabilities

of maintaining a schema along with data as it gets serialized is useful.

But I think that the

widespread adoption of JSON

is definitely something that factors into some of the problems associated

with maintaining that context around the information because it's so easy for

it to just,

spit out some preformed data and adjust it into another system. And then at that point, you've lost the context that you're looking to maintain. So,

things like the JSON schema specification

and

the built in schemas with things like Avro and Parquet are definitely useful to help combat that, but I agree that it is something that should be

more of a first class consideration as these systems are being built and evolved.

And the hope is I mean, we're kinda working on that. I know there's,

some other retail projects

that are kind of nip nipping at away at it. And I

I think, eventually, just it'll start to happen as

some of these products become more vertically integrated. So I think at some point, it's much easier just to specify

a given,

yeah,

metadata format

within a a group of applications that talk to each other as opposed to it being

shared across 30 applications that are all used by different people, all built by different people.

Alright. Well, thank you for taking the time out of your day to join me and discuss the work that you're doing with Metabase.

It's definitely a very interesting project and 1 that I plan on evaluating in my work. So thank you for that, and I hope you enjoy the rest of your day.

You as well. Thank you so much.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links