Self Service Data Exploration And Dashboarding With Superset

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

Your host is Tobias Macy. And today I'm interviewing Max Boeschmann about Superset, an open source platform for data exploration, dashboards, and business intelligence. So, Max, can you start by introducing yourself? For sure. Well, first, thank you for having me on the show again. So that's 2 back to back in a short period of time. But, yeah, quick intros for people who missed the previous episode.

My name is Max. I'm the original creator of Apache Airflow and Apache Superset projects that I started

in 2014,

15

while at Airbnb.

Since then, I've worked at, well, Airbnb, Lyft,

Facebook before

doing data engineering. So I've been really involved in the world of data engineering over time and since before data engineering was data engineering.

Since then, I started a company called Preset that basically serves Superset as a service and, you know, with all sorts of bells and whistle, and that's, like, completely hassle free. So we're

very strong contributors in the Apache Superset community,

and then we're strong operators too because we we operate the the software at scale on the cloud as a cloud service.

Yeah. And as you mentioned, you've been on the show before, both this 1 and my other 1 about Python. So you are actually on, I think, episode 3 of this show back in 2017.

So 1 of the first guests to get this show off the ground talking about what even is data engineering.

And then before that even, you were on podcast dot in it to talk about airflow.

And then recently, you were on there to talk about superset. So for anybody who wants to get more about the backstory of the project and how it works under the hood, I'll put the link in the show notes for that. And so I was gonna say too, like, it's been a little while. It'd be really interesting to revisit some of these questions too. Like, what is data engineering? How has it changed over the past 5 years? There would be a lot to talk about.

Save that for yet another episode.

Absolutely.

We'll have to have you back on to say what is data engineering now

in 2021.

Exactly. Even before the show, we're just chatting here, and we're talking about how we're doing data engineering as a startup now. So Preset,

the company I started is about 45, 50 people now, and we don't really have a data engineering team. And, like, it wouldn't be cool to talk about, like, how

really small data driven company,

how they do data engineering, and what's the emerging stack there too. So I'm signing myself up for many more shows here.

Alright. We'll have you on every week for the rest of the year.

And so for people who haven't listened to any of the other episodes that you've been on, can you give a bit of a refresher about how it is that you got involved in the area of data management?

So I started

my career very early. I did 1 year of web development

super early on with, like, you know, tools

that people would not even recognize the name of anymore. So I think I was doing ASP, active server page, before, you know, dot net was even a thing. And there's a time where we'd write, like, VB script on the client side, not even JavaScript.

That was possible to do that back then. So long so did a tiny bit of web development and then jumped into an early data career when,

you know, at that time, I think, like, data warehousing

existed. There was resources like the Data Warehouse Institute was a big thing.

Ralph Kimball

and Inman, like, were were kind of the the grandfather now of data warehousing.

So there was, like, some literature, some best practices,

but it was a very different world. Back then, the term data engineering did not really exist. Tools were mostly drag and drop.

And that's when I got involved into it as for a company working company called Ubisoft, a video game company

doing all sorts

of

data management for them. More in a, you know, financials, so we were doing a lot of supply chain retail

and then a lot of financial type data too. So general ledger, payables, accounts receivable. So very different than your kind of growth analytics.

A little bit later on, I joined Yahoo where it was really the rise of

big data and kind of the birth of Hadoop at the time. That was, like, you know, heavily kind of financed or

sponsored, is probably the right term,

by Yahoo at the time, and went went on to Facebook

and the rest of the history. And I went on to, Airbnb where I started working on open source. For the people who haven't listened to the interview that we did on the Python podcast about Superset, Can you give a bit more of an overview about what Superset is

and the main use cases that it's being used for, particularly as a data engineer? For sure. Yeah. So Superset is an open source data exploration visualization

platform.

We're very much in the business intelligence and data visualization space. So if you're familiar with tools like Tableau and Looker

and Data Studio, right, like, we're very much in that space, but as a strong open source kinda

counterpart in that area. So if, you know, data sovereignty

is important to you, if having a more modern type set of tools or platform that's a little bit more modern and more probably extensible

and flexible,

you know, I think that Superset is a great choice now for all the reasons related to the fact that open source is great. Right? It's a better way to write software. It's a better way to distribute software. That makes for software that is more malleable, that people can can turn into exactly what they need it to be. So really happy, and it's it's kinda life goal for me to get a very strong open source solution

in the space that's a strong contender that can really compete with the more proprietary tools in the space. Digging a bit more into sort of the broader ecosystem around Superset, as you mentioned, there are a number of tools, both proprietary and open source. And I've been seeing

increased references to Superset as being kind of the default for people who are building out a new stack, particularly around the open source data ecosystem

where, you know, it might be s 3 with Trino and Superset or, you know, s 3 with BigQuery and Superset.

And I'm wondering

what you see as being the motivating factors for Superset gaining so much popularity

in this space

when there are other players such as Redash or Metabase on the open source side and Looker and Tableau and Data Studio, and I don't even know how many other options in the proprietary side. The very competitive space, you know, that has a lot of history. So I think, like, people have had business intelligence budget for the past, like, 2 decades. So it's a well known

space that's fairly mature

at this time. So the first case is right to talk about I think there's kinda 2 path to explore the and advantages of something like Superset in the space. 1 is

open source and everything that comes

from open source. So it's really clear to me that over time, we've made a case for

and we've had, like, kind of global acceptance around, like,

why open source matters.

A matter of fact, a blog post on the topic that we published a few weeks back that's called the Future of Business Intelligence is open source, where I explore,

you know, why is

open source winning in general, and how does that translate into

the space that we're in.

Right? So I tell a bit of the story of Superset and how it came to be, and then talk about things like

extendability

and integratability,

like just being able to integrate Superset with whatever you use

because it's open source, right, and being able to customize it to fit your need. On that topic, there was a really good blog post out of Airbnb that's called something like how Airbnb's turbocharged

superset to to to power their more specific use cases. Right? How then it's the story of how they customize superset to to make it really work for them internally.

Outside of that, you know, because we're open source, we have the power of the community.

I think other tools

tools like Tableau and Looker have some sort of, like, user community to where, you know, people meet and exchange. But I think with open source, that's such a

strength, right, and such a natural thing to have the community,

and that enables us to have, like, really good documentation,

really good example use cases, just like a very dynamic community that's

really supporting

the project and the rest of the people in the community that are using Superset.

1 big thing about open source that's clearly a theme is, like, avoiding lock in. Right? And I call this data sovereignty. So it's the idea of, like, really owning, be able to customize change,

get the help that you need, select a vendor if you want 1, and, you know, having the ability to take over as needed. Right? So if something happens

with a vendor that you might have in the space or, you know, you're you're never locked in. You can always go and decide to take over the software

and run it on your own. So I think that's a really important

guarantee. Companies like Tableau

and Looker

got acquired. You know, there's more example in this space. There's been a fair consolidation in this space.

Chartio recently got acquihired, so that leaves

people that use Chartio today, like, having to off ramp in the next year. So for these people, I think to tell them, like, hey, you've been burned once, like, maybe this time around, you might wanna select something that's open source. So that's covering the open source aspect. The other aspect is around

probably, like, the convenience of

cloud native and SaaS, and, you know, I would say on that front, there's a lot of vendor tools in the space that

have more kind of previous architecture,

that are more kind of desktop based or that don't work as well with the modern data stack. And I think that's a strong differentiator for Superset.

And then there's this idea of, like, you know, the convenience of a cloud based tool, and I think, like, we've seen that in other spaces that are not business intelligence, the data visualization. And you think about

things like on the data ware warehousing side, people love the convenience of BigQuery

and Snowflake because

they're just gonna server less, and it's really software as a service where you don't need to have a DBA or if that term even still exists. Right? Or you don't need like, anyone to be on call for the database because it's a service. So I think, like, open source now, we see the rise of a commercial open source vendor that can be really good partners in getting the best of both worlds in terms of, like, getting the guarantees that come from open source I was talking about, plus the convenience of having a service

without the lock in.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting. It often takes hours or days.

DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses

as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Go to data engineering podcast.com/datafold

today to start a 30 day trial of DataFold.

Once you sign up and create an alert in DataFold for your company data, they'll they'll send you a cool water flask.

Another interesting evolution

of the overall space of data engineering and data management

is sort of where the responsibility

lies for tools like Superset. Because

on the 1 hand,

it is kind of the responsibility of data engineers to be able to provide data in a manner that's clean and accessible for other people to be able to do analytics.

And on the other hand, there are roles such as analytics engineers and data scientists are becoming more savvy for some of these infrastructure aspects

where they might be doing all the analysis in Jupyter or, you know, the analytics engineers might be handling kind of the transformations

using dbt or something.

And where have you typically seen the responsibility for getting something like Superset

deployed and managed?

And then

where is the dividing line between

the kind of data platform and data engineer

providing the tools? And then, you know, do you see them moving into actually building the dashboards as well, or does that fall

somewhere else within the kind of role of responsibilities in the data team? That's a good and complex question too. I think it relates to what I would call, like, the data maturity life cycle of a company too. So that again, depending where you're at on your data team or, like, data literacy journey and culture as a company,

this question would be answered probably in different ways. In terms of, like, from the infra standpoint,

who is in charge of operating the tool or even, like, selecting and then providing service and being on call for the tool?

I remember in the rise of the data engineer, which is a blog post I wrote a long time ago, that probably

was before our first interview that might have triggered my first interview. I don't know. I forgot. But I talk about how in smaller organization, data engineering is also

you know, supports the function of data infrastructure

management. So if you have, like, 2 data engineers in your organization,

they probably are in charge of either selecting the vendors or selecting the software that's being used and making sure that software runs and that they probably are on call to for when it gets down. In bigger

organizations,

when I was at Airbnb, for instance, right, there'd be a clear distinction between a data infra team and the data engineering team or function.

Right? And then, clearly, the data infra team is, like, very horizontal

and in terms, like, it supports the organization

as a whole, where data engineering, there's a little bit more attention to kinda pull them vertically, right, to be a little bit closer to maybe certain product verticals in some cases.

Then in terms of, like, who builds their reports and

dashboards, I threw the term

data literacy

a tiny bit before. So talking about this, I think

more and more people

going from the information workers,

right, so people who are just like people work with a computer every day and, you know, answer emails and whether they're, you know, PMs or

business users of different kind of anyone in the organization, really,

need to become more sophisticated with data, and we see these people using

superset and self serve there, like, whether it's consuming a dashboard

or using a no code explorer inside of Superset. Right? They might not know SQL, but in Superset, they're still able to do a lot of, like, slice and dicing using the drag and drop type interface.

So, generally, I would say that

more and more people, like, regardless of the function in modern organization,

have to become more data literate and familiar

with powerful tools like Superset.

The vision for Superset is really to be across the whole

consumption layer and to cater to the different level of sophistication

of users that might be more or less, like, data literate.

Right? So tip so you can imagine that within an organization on the data consumption layer, you're gonna have people who are just gonna consume dashboard that have been built

by others,

people in their team, or their, like, analyst partner, or data engineer, whoever has built a dashboard that's relevant for them.

Then, you know, I think we see a lot of people being able to do a lot of slice and dicing.

There's a SQL IDE in Superset. So if you do speak SQL and we see more and more people across different functions learning SQL

nowadays,

good example of this is Airbnb has a program internally called Data University

to do

to do better at their work. Right?

The last 1 I would mention that's complementary on the consumption layer that goes up the ladder of sophistication

is the notebook. Right? We see the notebook as

if you can't do it in SQL because, I don't know, maybe there's a machine learning component to it. Maybe there's some crazy

wrangling that's better done, you know, in a data frame than in SQL.

We see notebooks as that last layer where

it's less accessible,

it's harder to use, but it's much more flexible and powerful.

Another interesting aspect of something like superset,

because it's not just a static dashboard where here are some pretty charts and maybe there's a, you know, limited amount of being able to reorient the axes or something. It also has a full fledged data exploration layer.

And I'm curious

what you see as

the opportunities for collaboration and so do the capabilities that Superset has for being able to

version and fork and merge different queries.

There's a fair amount to unpack here.

The first 1 is, like, collaborative

workflow

and on the data consumption layer. I think that's good to have.

Collaborative tools are transformative in a lot of ways. Right? We've seen Google Docs versus,

you know, Microsoft Word, you know, is a game changer for a lot of us. So we're definitely bringing a lot of this

into Superset

over time. There's

direct collaboration on the document, and there's, like, social aspects

too that are similar and intertwined. I would say, you know, they're in a similar space conceptually.

But I think, like, the social aspect is something that the internal data tooling at Facebook was really great about. So in most of the data tools internally at Facebook, you would have kinda social aspect of, like, who's the owner, who's using this dataset, who has created this chart, who's consuming it, and then a way for people to kinda leave comment or annotations for other people. So that is an important asset that we have already in superset, to a certain extent, and that we're pushing

forward. Now there's a question of, like, do we wanna have 2 people working on the same query or the same dashboard at the same time and see their cursor

move around?

And I think, at least from the SQL IDE perspective, if I'm writing SQL, I'm like, don't touch my SQL. Don't modify my SQL. Like, this is my thing. I can share my query with you, but I don't wanna be collaborating on SQL per se, like, in real time.

But I do wanna share queries which we allow, and we're gonna add a lot more around, like, the commenting type interface,

social collaboration within the tool, which I think, like, is super important, and we're doing we're investing in that. I think, like, all tools nowadays really need to move in that direction. That's something that users expect.

Because of the fact that there is this data exploration capability

and you're handing that over to people who are trying to answer their own questions and maybe don't have the same level of training with distributed systems and query optimization as a data engineer or a data analyst.

What are some of the potential challenges that

somebody who's providing Superset as a service to the organization might run into and some of the ways that they might avoid allowing people to end up in a suboptimal user experience because of performance issues with the queries or trying to bring in too many different data sources together in 1 visualization or something like that? There's a fine balance there, you know, between, like, giving people power, but also preventing them to shoot themselves in the foot.

Generally,

in software, I think 1 thing that people don't appreciate

as much as they should is

what it takes to harden

a piece of software. It's pretty easy for anyone to write, you know, a competitor to something like Airflow to be like, I'm gonna write a small scheduler and something that writes DAG, but I think the real part of the real value

of software and of

open source type workflows and then type approach

to development is that we can massively harden a fast moving piece of software.

And you'll see over time that software,

you know, will build an immune system. If you think of software as an organism, that's like every time that there's a little cut or a little break or something that hurts,

we think of a solution to prevent that in the future. And sometimes it's like from an architecture standpoint or from a people request beauty standpoint, it's not always the most beautiful piece of code that might be adding a retry decorator in the right or wrong place. But there's a tremendous amount

of value that exists in that. Now,

what does that mean for something like Superset? So 1 thing is like, how do we prevent ad hoc user from destroying the underlying analytics database?

So luckily, I would say Superset is

architectured

in a way that

is doing, like, most of the heavy lifting

on the underlying database. So, like, not our problem. We just push all the heavy lifting to the underlying database. You can go and, you know, destroy that through our powerful

tools. So

we do offer some ways to prevent some of that, and that's part of the immune system we've built. I think that responsibility

of, like, harming the underlying database is probably best done.

So say if you're using something like Presto, and I forgot Trino or Presto, whatever 1 you picked in that space,

I think presto is probably receiving a mixed bag of queries, like, 1 on 1 and from Superset. But, Pablo, maybe I've looked at maybe you have collections of other tools. So I think it's important for the database engine

itself or for the database, you know, organism to defend

itself pretty well.

We do have some hooks in superset that enable

people to kinda hook in

and control a little bit more the flow of what's happening.

I know at Lyft, I think we had some sort of, like, proxy in front that would do some routing and some rate limiting and some queuing, you know, to help Presto as a Presto proxy.

In Superset itself,

we do have things like a query mutator and a

connection mutator where it's easy to add configuration

inside Superset that will

intercept certain pieces of SQL

in the database connection

on its way to the database. So you can add logic in there that could say something like,

if there's a lot of query running, don't send the query over and and wait a little bit. Or you could say,

run an estimation ahead

of time. And if the estimation is, like, less than a certain amount or is bigger than a certain amount, like, per like, you know, kill the query. So we have these configuration hooks

that can be used

to intercept

query and route things,

and that's probably

the extent of it. I think we have a feature behind a feature flag that is

a button in SQL Lab that enables people to estimate the cost of a query prior to running it.

So that's something to dig out there in the feature flag framework,

if you're curious about that. Yeah. Intercept the query that's trying to do a Cartesian join on every order for the past 30 years.

Don't do it. But the reality is, like, solve this problem in superset. It's still a problem inside the other tools. You know, whether people use the CLI

to query Presto and then people will be like, okay, well, Superset doesn't wanna run my query. I'm gonna, you know, find a different way to attack and destroy the system. Right? So

sometimes, like, it feels as an operator

of these tools, like, you have a barbaric

hoard of people that are trying to attack and destroy your software. I know, like when I was operating airflow at scale,

sitting really close to to the data and for a team in different organization,

like, sometimes we'd look at things, incident management, or like, you were trying to destroy our infrastructure.

And then

And then we dig a little bit deeper and then improve our immune system

and come out stronger on the other side.

And in terms of the different data sources, Superset supports a large and growing number of query engines and data storage systems.

And I'm wondering what you have found to be some of the most challenging

either specific systems or categories of systems to work with

for something like Superset where you're trying to enable interactive data exploration and visualization.

So I think, like,

on our front,

the way that we integrate through different database engine, and it's, like, the huge kinda integration point for Superset and other tooling is how we connect to different

analytics database engine. Right? There's a bunch of other integration points, like, for the metadata database, for the caching layers, and then multiple caching layers.

But, like, going on to the analytics database integration,

So we do that through a hyphen standard that's called DB API.

It's a little bit shallow. It's not very prescriptive. It's a little bit loose. You know, it defines things like connections and cursors, and it's pretty high level.

And then, on top of that, we use something called SQLAlchemy.

That's a popular library, you know, in the Python world. That's a SQL toolkit and an ORM build on top of that SQL toolkit, and we leverage the SQL toolkit quite a bit. And this knows how to speak different SQL dialects. So so you can do something like write a query in a more object type of form, as in, like, select dot group by, you know, dot filter dot this dot that. And then it knows how to generate SQL and different SQL dialects.

That's been challenging. You know, the different SQL dialects

SQLAlchemy is a little bit shallow in terms of, like, what it defines and what it does not define, so it might know how to issue kind of a limit or a row number limit define. So it might know how to issue kind of a limit or row number limit on different engines, but it doesn't know how to truncate a date, for instance. So the semantics around

using the function in different database engine are just not specified by SQL Alchemy, so we need to manage that

in our own Python package,

where we add more semantics around date truncation and around, like, how to handle the cursors and how to do things like, if I wanna get a percentage of completion of a query. So if you run a 2 hour long query on Hive or a 20 minute long query on Presto,

it's great to be able to show the user that you're at, you know, 39%

and that,

you know, the the query is still running.

So we have a layer there. If you look at the Suprasek code base, it's under a package called DB engine spec

for all the things that are specific to database engines.

So it's like this kind of compatibility

layer, and it's been a struggle to work with all this, like, vast array of

drivers with different level of maturity. And we've written our own, and we've contributed back to, like, the BigQuery,

you know,

client and this the SQL alchemy dialect for it. So we've contributed. We've written

the Elasticsearch 1, the Druid 1,

a Google Sheets 1 that's a little bit more kinda

fuzzy and complex,

which probably brings me to the second part of the answer,

which

is 1 of the challenge has been that some databases don't speak very good SQL at

all, or are learning SQL over time. So you look at our journey with Apache Druid,

which didn't speak SQL at first, and has been learning SQL, and has pretty good SQL support now.

So for instance, like, some engines don't support subqueries.

There's all sorts of things that they may or may not support. Right? It could be a window functions or it could be common table expressions, CTEs.

So an example

of

what we do is sometimes you wanna plot a certain number of time series, but if you group by, I don't know, customers, you might have, like, 10,000 customers. So you don't wanna bring, like, 10,000 trend lines. So we'll run a subquery to say, like, what are the top 10

customers based on this metric? And then do some sort of subquery join to do this.

And for databases that don't support

subqueries

by the way, it's great to have a technical audience because I can get into a little bit more details here. But so for the database engines that we cannot do a correlate join with a subquery,

what we do is we do a 2 phase query. So then the mechanics of Superset are, like, I'm gonna run a first query to retrieve

result. Based on this result, I'm gonna assemble

a second

SQL Okami query that will be translated to right dialect, and then get the top end time series with the time detail for those top end customers in that case. Right?

So fair amount of complexity

there. And if if you think about unit testing this stuff and guaranteeing, you know, that you're not gonna see too much regressions in this area as the drivers evolve

and as the dialects evolve. It's been challenging, but someone's gotta do it. So we're definitely at the forefront of that. We've talked about

refactoring this maybe as a contribution to SQL Alchemy or as its own package that would be a little bit more generic and that other people could use as

a kind of this, like, slightly more

involved

abstraction

to, you know, to be able to talk to multiple engines.

Yeah. That was gonna be 1 of my questions is if this is available for the broader Python community to use in isolation from Superset. So you already addressed that question. And another interesting aspect of what you're saying with having to kind of remap some of these subquery joins or common table expressions for engines that don't support it, you know, then you also get into some of the complications of transactional isolation and some of the different semantics about how databases manage that. And, you know, are you seeing the same state of the data between those 2 query executions, and do they support transactions across multiple queries? There's a lot. Like, I wish we had, like, a newer version of DB API that would be more prescriptive.

There's all sorts of challenges too around

exceptions that are returned from the different engines. They are not gonna and narrative from base class of exceptions too. Or I don't think that DB API is prescriptive on that. I think it has some specification, but it's implemented in different ways. Right? A lot of people go outside of the spec.

So it shows the importance

of a standard

and of a spec, but, you know, the challenge there too is that if you have a spec that's too prescriptive,

then people don't wanna implement

it too. So maybe from that perspective, DB API is a decent compromise.

If any indication of that is

how mature is the connectivity to database engine in in Python as opposed to other languages.

I think Python is 1 of the language by that does best in this area, not necessarily in terms of rigor, but in terms of, like, things

that generally work. Right? So maybe,

on the Java JVM side, you might have, like, more rigorous driver, but less implementations,

less diverse set of implementation.

In terms of

the use of dashboarding tools within an organization,

You know, traditionally, they've been used for just show me the data. I'll make my own decisions based on that. But as organizations grow in terms of their maturity of the use of the data stack and the data stack itself continues to grow in terms of its capabilities and the sophistication of the people who are operating

it, there have been growing trends in terms of,

you know, how the dashboard is used. So rather than just being present the data to me, there's been more of a trend towards

rather than just business analytics. I want decision analytics. I want you to show me what is the data and then what am I supposed to do with it. So, you know, maybe I see this trend line and then I click a button. Okay.

My stock is declining at a certain rate. I need to order more widgets right now and then embed that right into the dashboard. And I'm curious what you have seen

as the creator and user of Superset,

how people are able

to implement that style of analytics and that sort of evolution of the sophistication of how analytics are used

to present to end users.

The big underlying trend, right, is democratization of access to to data and then up leveling data literacy in the organization.

And

where the model in the past was

data, I think, was a specialty. Right? Like, fundamentally, like, you would have maybe a small data team, or traditionally, you might have a data warehouse architect

that, you know, is in charge or a group of data warehouse architects that are in charge of, like, being the librarian

of the data within the organization and structuring, organizing the data from different sources into a data warehouse. And then you would have the business intelligence engineers that would install things like business objects and micro strategy, and they would create abstraction layers in this, what we call back in the semantic layer. I know probably the audiences might not be familiar with that term, but it's like an abstraction layer that enables

people to self serve a little bit more on top of the data warehouse. So traditionally, we we'd see a lot of that. Right? The tooling was oriented that way. And then you need it for the things to make it in the warehouse,

in the semantic layer so that then people could kind of self serve. They could order things from the menu at their restaurant, to use, like, an imperfect analogy. And I think we've seen people move towards much more of a

buffet slash open kitchen

where

it's a big free for all, and people are able to, like, consume ingredients that are, like, more or less processed, and there's, like, a wider variety, and people are welcome to go in the back room

a lot more where, you know, maybe before

the chefs was like, Hey, I don't want any customers in the kitchen. Right. They keep the customers out of the kitchen. Now, like, I know it's a big party. Everybody's invited. Anyone who has the skills can write SQL or write Airflow jobs

or create a dashboard. So if I take that trend and kinda translate that into, like, what we see in terms of, like, superset usage pattern or dashboard usage pattern is

before we would have, like, specialists

creating dashboards for

executives. You know, it might take the specialist team, like, months to create a dashboard that would have a life cycle of many years. Right? Like, the CEO dashboard

that's been built by a group of specialists, and this dashboard will be pretty much,

you know, more or less the same over the next few years. So the trend that we're seeing is an acceleration

of how many dashboards are created,

and we see a shorter

life cycle of dashboards. If your team is working on a particular

product feature or set a feature or

an analyst team is focusing, say, on a topic like

the effect of COVID on, you know, listings at Airbnb or something like that. Right, like, they're probably gonna build a dashboard for that, accumulate a set of learning.

And as they tackle that area and build the knowledge and the understanding and make data informed decision to change the product,

this dashboard will probably

get, you know, just unused over time. And then they'll go to work on

whatever the next hot topic is for that team or that smaller group of people and then probably create a new dashboard. So definitely a trend of, like,

more people building things that are more short lived and aligned with

more operational use cases too. And by that, I mean, the dashboard that supports the daily work

or the weekly work of a particular person or team.

And that's when I think more value is typically delivered at that point. Right? Because you have something that's more targeted that supports the process more clearly or more deeply. It's just more fitted for a smaller group of people. So that's clearly a big a big trend in analytics.

RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery,

Amazon Redshift, and more.

Their SDKs and plug ins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming.

With RudderStack, you can use all of your

The other big trend that I've been seeing in analytics and dashboarding

is starting to integrate some of the data quality indicators

into the dashboard so that you can see as a consumer of the dashboard or somebody who's doing some exploration,

this is the last time that the data source was refreshed. This is some of the lineage information.

These are the data quality checks that were executed. You can see that they were all passed or, you know, maybe 1 of them failed, but it's not critical. So you can understand as you're doing these explorations or as you're viewing the dashboard,

you know, how many grains of salt you wanna take with it.

That's, you know, 1 underlying

trend with that. So there's a clear need for that context. Right? Like, so while I was at all 3 previous companies, I was sitting really close

to the teams that were working on,

call it, metadata discovery

or metadata management, data dictionary, but that's just a huge area. So at Lyft, I was

sitting right next to the team working on Munson, which is an open source project in this space of, like, data catalog and

metadata discovery.

While I was at Airbnb, I was sitting really close to the data portal team that have published so data portal is not open source, but they have published a lot of their experience and a lot of, like, very interesting and influential information around data portal. So I encourage people to go and dig in Google data portal, Airbnb. They're gonna find some interesting information in the space. And similarly, my team at Facebook owns something called IData, and we're really close to a team working on metadata graph. That was really a metadata graph.

So been really close to that space

and multiple environment. I think it's validated that we really need

more of this type of tooling and context.

Some pointers in this is, like, there are some emerging standards in the space. I'm really excited

that we see a few standards I wanna mention. So 1 is Open Lineage.

That's an open source project that's really interesting. That's, I believe, was started by, Julien Laddem, who is the original creator

of Harkey,

and I believe, like, very involved into Aero. And

they're just a prolific, amazing person. And I think around this project, we start to see a consensus

from the different kind of actors in the space. Right? Like, so for us, we raised our hand and be like, hey. If there's a standard, we wanna be part of it. We wanna implement it from the superset project standpoint. I think the Airflow and DAXTER,

I think and then projects like Data Hub and M1sen are definitely also raising their hand to say, like, if there is an emerging standard, we absolutely wanna be at the forefront of implementing it.

Another pointer is something to call Marquez, I believe, which is kind of a if open lineage is the standard and and, spec,

the Marquez is gonna reference implementation

for it, and it's a pretty sweet

project that I would encourage all data engineers to check out, especially as it takes shape.

What does that mean for Superset? Well, first thing is as these standards

emerge, we want to implement them. And even in the absence of standards,

we wanna integrate, like, with the actors in the space, really. Right? And Munson and Data Hub

and whoever else. So we're at the forefront of, like, wanting to implement that. We have some hooks already. So 1 thing more on the data quality side is,

super set, we have the ability

to semantically

certify

datasets

and or metrics within these datasets. So you can do that through a REST API or it can do that in the UI,

which adds a little metal or a little, like, stamp by the dataset that says this has been certified by this person in this context

so that people navigating the tool

can get a sense for

which sets of objects they can kind of proceed

with confidence with and maybe which set of objects maybe they need be a little bit careful. You know, this is not a certified dataset.

So

take that. And I think it's more like a positive in general just to say, this is a certified dataset, so you know you can play with it and that the metrics and dimensions that were gonna come out of that should be trusted.

So, you know, really looking forward to the emerging standard, implementing them and providing a lot more context within Superset so that when you you're in a SQL IDE,

you should be able to get the lineage context, and you should be able to

navigate to something like a Munson and get even more context

so you can really understand

everything you're doing. Who's the owner? When is this dataset, you know, typically lending every day?

Who else is using this dataset?

Where's the logic for that? So if I need that for reference, I can get access to it very easily.

So I think that will be a trend in the tooling where we're gonna see

the tools talk to 1 another and exchange metadata in a much better way than they do right now. Yeah. Definitely excited for a lot of the developments there and Open Lineage in particular.

And

another interesting element of the kind of data quality and data lineage aspects and the dashboard being, you know, probably the first place that people are going to be interacting with the data is how it factors into to the overall governance strategy.

How does Superset act as a facilitator of the data governance strategy?

I remember the term data governance is kinda overloaded too, so, like, we should try to define this. So I remember back in the days, we had the data stewardship and data governance as 2 different things.

I know,

you know, data stewardship

too, and governance, that's some aspect of 2 data access policy of, like, having people who can define who can access what.

And then I guess there's another aspect that's just, like, the certification

and validation.

What are all these data objects and and who owns them and, you know, are they reliable

or not? When you ask about data governance, curious, like, about what you think exactly. Yeah. I mean, data governance definitely has become overloaded, and I kind of tend to use it in the broadest form because that's kind of its original intent. It's been

narrowed in scope in a few contexts to just mean, you know, access control and privacy management and regulatory compliance. But I think you kind of answered broadly that superset

is useful in this context because it does have access to some of the lineage information and the data quality checks. And because the security layer is extensible, you can enforce some of the access control

and, you know, privacy management to do things like data masking before it gets shown in the UI, things like that. Yeah. So so Superset definitely, you know, enables people to have very complex data access policy.

If they are able to express it, you know, on a piece of paper, they should be able to implement it and enforce it inside Superset.

I think a lot of the challenges in this space are,

you know, they're organizational for in any given organization, if you ask who should have access to what data, you know, it sounds easy at first, but it often reveals itself as a infinitely complex question

that no 1 really owns

and ends up with kinda half assed solution pretty often. People are, like, okay. Well, we explored this. It's extremely complex. Let's go back and say, like, there's gonna be 3 level of access here.

So Superset definitely enables

a lot of this.

There's inherent challenge around, like, constraints and guarantees there too, so if you have a lot of process and requirements

and constraints

around governance,

that fundamentally kind of plays against

some of the data democratization,

you know, aspects of things. So there's different forces at play. Like, 1 is, like, you know, the more chaos of, like, you know, enabling a lot of people to do what they want, and then there's, like, making sure that, you know, they don't shoot themselves in the foot, and that, you know, we get to cohesive answers.

2010 to 15, 16 as a general trend, 2010 to 15, 16

as a general trend, and then, you know, with GDPR, I think that was, like, 2017.

I think

There there's a little bit of a counter force there too of, like, hey, we need to know who's accessing what, and we need to be able to audit this and control

certain things too. So

multiple forces at play, I think organizations should be able to define their policies and enforce them in tooling like superset 2. So we don't initially to be too opinionated

on that too. And we want to enable people to just define their policy and

implement them in our tooling. For anybody who wants to learn more about Supersaid and follow along with you and the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, what do you see as the biggest gap in the tooling or technology for data management today? I think with the metadata integration we talked about, so having a better metadata

exchange,

and I can really kind of weave in the different tool sets together and preserve context. So that means, like, it should be easy to navigate from things like M1SEN, Airflow, Superset,

and, you know, to to consume all of that

business metadata, operational metadata,

you know, lineage metadata in a way that it feels like a a set of tools that works very well together.

Well, thank you very much for taking the time today to join me and share the work that you've been doing at Superset and some of the ways that it can be used in data organizations to help satisfy their needs. Definitely very interesting tool and 1 that I plan to take advantage of myself for my own work. So I appreciate all the time and effort you've put into that, and I hope you enjoy the rest of your day. Excellent. Thank you so much. It was a pleasure to be on the show again.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links