Summary
The reason for collecting, cleaning, and organizing data is to make it usable by the organization. One of the most common and widely used methods of access is through a business intelligence dashboard. Superset is an open source option that has been gaining popularity due to its flexibility and extensible feature set. In this episode Maxime Beauchemin discusses how data engineers can use Superset to provide self service access to data and deliver analytics. He digs into how it integrates with your data stack, how you can extend it to fit your use case, and why open source systems are a good choice for your business intelligence. If you haven’t already tried out Superset then this conversation is well worth your time. Give it a listen and then take it for a test drive today.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- Your host is Tobias Macey and today I’m interviewing Max Beauchemin about Superset, an open source platform for data exploration, dashboards, and business intelligence
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Superset is?
- Superset is becoming part of the reference architecture for a modern data stack. What are the factors that have contributed to its popularity over other tools such as Redash, Metabase, Looker, etc.?
- Where do dashboarding and exploration tools like Superset fit in the responsibilities and workflow of a data engineer?
- What are some of the challenges that Superset faces in being performant when working with large data sources?
- Which data sources have you found to be the most challenging to work with?
- What are some anti-patterns that users of Superset might run into when building out a dashboard?
- What are some of the ways that users can surface data quality indicators (e.g. freshness, lineage, check results, etc.) in a Superset dashboard?
- Another trend in analytics and dashboard tools is providing actionable insights. How can Superset support those use cases where a business user or analyst wants to perform an action based on the data that they are being shown?
- How can Superset factor into a data governance strategy for the business?
- What are some of the most interesting, innovative, or unexpected ways that you have seen Superset used?
- dogfooding
- What are the most interesting, unexpected, or challenging lessons that you have learned from working on Superset and founding Preset?
- When is Superset the wrong choice?
- What do you have planned for the future of Superset and Preset?
Contact Info
- @mistercrunch on Twitter
- mistercrunch on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Superset
- Preset
- ASP (Active Server Pages)
- VBScript
- Data Warehouse Institute
- Ralph Kimball
- Bill Inmon
- Ubisoft
- Hadoop
- Tableau
- Looker
- The Future of Business Intelligence Is Open Source
- Supercharging Apache Superset
- Redash
- Metabase
- The Rise Of The Data Engineer
- AirBnB Data University
- Python DBAPI
- SQLAlchemy
- Druid
- SQL Common Table Expressions
- SQL Window Functions
- Data Warehouse Semantic Layer
- Amundsen
- Open Lineage
- Datakin
- Marquez
- Apache Arrow
- Apache Parquet
- DataHub
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
[00:00:57] Unknown:
Your host is Tobias Macy. And today I'm interviewing Max Boeschmann about Superset, an open source platform for data exploration, dashboards, and business intelligence. So, Max, can you start by introducing yourself? For sure. Well, first, thank you for having me on the show again. So that's 2 back to back in a short period of time. But, yeah, quick intros for people who missed the previous episode. My name is Max. I'm the original creator of Apache Airflow and Apache Superset projects that I started in 2014, 15 while at Airbnb.
Since then, I've worked at, well, Airbnb, Lyft, Facebook before doing data engineering. So I've been really involved in the world of data engineering over time and since before data engineering was data engineering. Since then, I started a company called Preset that basically serves Superset as a service and, you know, with all sorts of bells and whistle, and that's, like, completely hassle free. So we're very strong contributors in the Apache Superset community, and then we're strong operators too because we we operate the the software at scale on the cloud as a cloud service.
[00:02:04] Unknown:
Yeah. And as you mentioned, you've been on the show before, both this 1 and my other 1 about Python. So you are actually on, I think, episode 3 of this show back in 2017. So 1 of the first guests to get this show off the ground talking about what even is data engineering. And then before that even, you were on podcast dot in it to talk about airflow. And then recently, you were on there to talk about superset. So for anybody who wants to get more about the backstory of the project and how it works under the hood, I'll put the link in the show notes for that. And so I was gonna say too, like, it's been a little while. It'd be really interesting to revisit some of these questions too. Like, what is data engineering? How has it changed over the past 5 years? There would be a lot to talk about.
[00:02:47] Unknown:
Save that for yet another episode.
[00:02:50] Unknown:
Absolutely. We'll have to have you back on to say what is data engineering now in 2021.
[00:02:56] Unknown:
Exactly. Even before the show, we're just chatting here, and we're talking about how we're doing data engineering as a startup now. So Preset, the company I started is about 45, 50 people now, and we don't really have a data engineering team. And, like, it wouldn't be cool to talk about, like, how really small data driven company, how they do data engineering, and what's the emerging stack there too. So I'm signing myself up for many more shows here.
[00:03:21] Unknown:
Alright. We'll have you on every week for the rest of the year. And so for people who haven't listened to any of the other episodes that you've been on, can you give a bit of a refresher about how it is that you got involved in the area of data management?
[00:03:34] Unknown:
So I started my career very early. I did 1 year of web development super early on with, like, you know, tools that people would not even recognize the name of anymore. So I think I was doing ASP, active server page, before, you know, dot net was even a thing. And there's a time where we'd write, like, VB script on the client side, not even JavaScript. That was possible to do that back then. So long so did a tiny bit of web development and then jumped into an early data career when, you know, at that time, I think, like, data warehousing existed. There was resources like the Data Warehouse Institute was a big thing.
Ralph Kimball and Inman, like, were were kind of the the grandfather now of data warehousing. So there was, like, some literature, some best practices, but it was a very different world. Back then, the term data engineering did not really exist. Tools were mostly drag and drop. And that's when I got involved into it as for a company working company called Ubisoft, a video game company doing all sorts of data management for them. More in a, you know, financials, so we were doing a lot of supply chain retail and then a lot of financial type data too. So general ledger, payables, accounts receivable. So very different than your kind of growth analytics.
A little bit later on, I joined Yahoo where it was really the rise of big data and kind of the birth of Hadoop at the time. That was, like, you know, heavily kind of financed or sponsored, is probably the right term, by Yahoo at the time, and went went on to Facebook
[00:05:05] Unknown:
and the rest of the history. And I went on to, Airbnb where I started working on open source. For the people who haven't listened to the interview that we did on the Python podcast about Superset, Can you give a bit more of an overview about what Superset is and the main use cases that it's being used for, particularly as a data engineer? For sure. Yeah. So Superset is an open source data exploration visualization
[00:05:28] Unknown:
platform. We're very much in the business intelligence and data visualization space. So if you're familiar with tools like Tableau and Looker and Data Studio, right, like, we're very much in that space, but as a strong open source kinda counterpart in that area. So if, you know, data sovereignty is important to you, if having a more modern type set of tools or platform that's a little bit more modern and more probably extensible and flexible, you know, I think that Superset is a great choice now for all the reasons related to the fact that open source is great. Right? It's a better way to write software. It's a better way to distribute software. That makes for software that is more malleable, that people can can turn into exactly what they need it to be. So really happy, and it's it's kinda life goal for me to get a very strong open source solution
[00:06:21] Unknown:
in the space that's a strong contender that can really compete with the more proprietary tools in the space. Digging a bit more into sort of the broader ecosystem around Superset, as you mentioned, there are a number of tools, both proprietary and open source. And I've been seeing increased references to Superset as being kind of the default for people who are building out a new stack, particularly around the open source data ecosystem where, you know, it might be s 3 with Trino and Superset or, you know, s 3 with BigQuery and Superset. And I'm wondering what you see as being the motivating factors for Superset gaining so much popularity in this space when there are other players such as Redash or Metabase on the open source side and Looker and Tableau and Data Studio, and I don't even know how many other options in the proprietary side. The very competitive space, you know, that has a lot of history. So I think, like, people have had business intelligence budget for the past, like, 2 decades. So it's a well known
[00:07:22] Unknown:
space that's fairly mature at this time. So the first case is right to talk about I think there's kinda 2 path to explore the and advantages of something like Superset in the space. 1 is open source and everything that comes from open source. So it's really clear to me that over time, we've made a case for and we've had, like, kind of global acceptance around, like, why open source matters. A matter of fact, a blog post on the topic that we published a few weeks back that's called the Future of Business Intelligence is open source, where I explore, you know, why is open source winning in general, and how does that translate into the space that we're in.
Right? So I tell a bit of the story of Superset and how it came to be, and then talk about things like extendability and integratability, like just being able to integrate Superset with whatever you use because it's open source, right, and being able to customize it to fit your need. On that topic, there was a really good blog post out of Airbnb that's called something like how Airbnb's turbocharged superset to to to power their more specific use cases. Right? How then it's the story of how they customize superset to to make it really work for them internally. Outside of that, you know, because we're open source, we have the power of the community.
I think other tools tools like Tableau and Looker have some sort of, like, user community to where, you know, people meet and exchange. But I think with open source, that's such a strength, right, and such a natural thing to have the community, and that enables us to have, like, really good documentation, really good example use cases, just like a very dynamic community that's really supporting the project and the rest of the people in the community that are using Superset. 1 big thing about open source that's clearly a theme is, like, avoiding lock in. Right? And I call this data sovereignty. So it's the idea of, like, really owning, be able to customize change, get the help that you need, select a vendor if you want 1, and, you know, having the ability to take over as needed. Right? So if something happens with a vendor that you might have in the space or, you know, you're you're never locked in. You can always go and decide to take over the software and run it on your own. So I think that's a really important guarantee. Companies like Tableau and Looker got acquired. You know, there's more example in this space. There's been a fair consolidation in this space.
Chartio recently got acquihired, so that leaves people that use Chartio today, like, having to off ramp in the next year. So for these people, I think to tell them, like, hey, you've been burned once, like, maybe this time around, you might wanna select something that's open source. So that's covering the open source aspect. The other aspect is around probably, like, the convenience of cloud native and SaaS, and, you know, I would say on that front, there's a lot of vendor tools in the space that have more kind of previous architecture, that are more kind of desktop based or that don't work as well with the modern data stack. And I think that's a strong differentiator for Superset.
And then there's this idea of, like, you know, the convenience of a cloud based tool, and I think, like, we've seen that in other spaces that are not business intelligence, the data visualization. And you think about things like on the data ware warehousing side, people love the convenience of BigQuery and Snowflake because they're just gonna server less, and it's really software as a service where you don't need to have a DBA or if that term even still exists. Right? Or you don't need like, anyone to be on call for the database because it's a service. So I think, like, open source now, we see the rise of a commercial open source vendor that can be really good partners in getting the best of both worlds in terms of, like, getting the guarantees that come from open source I was talking about, plus the convenience of having a service without the lock in.
[00:11:27] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting. It often takes hours or days. DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Go to data engineering podcast.com/datafold today to start a 30 day trial of DataFold. Once you sign up and create an alert in DataFold for your company data, they'll they'll send you a cool water flask. Another interesting evolution of the overall space of data engineering and data management is sort of where the responsibility lies for tools like Superset. Because on the 1 hand, it is kind of the responsibility of data engineers to be able to provide data in a manner that's clean and accessible for other people to be able to do analytics. And on the other hand, there are roles such as analytics engineers and data scientists are becoming more savvy for some of these infrastructure aspects where they might be doing all the analysis in Jupyter or, you know, the analytics engineers might be handling kind of the transformations using dbt or something.
And where have you typically seen the responsibility for getting something like Superset deployed and managed? And then where is the dividing line between the kind of data platform and data engineer providing the tools? And then, you know, do you see them moving into actually building the dashboards as well, or does that fall
[00:13:26] Unknown:
somewhere else within the kind of role of responsibilities in the data team? That's a good and complex question too. I think it relates to what I would call, like, the data maturity life cycle of a company too. So that again, depending where you're at on your data team or, like, data literacy journey and culture as a company, this question would be answered probably in different ways. In terms of, like, from the infra standpoint, who is in charge of operating the tool or even, like, selecting and then providing service and being on call for the tool? I remember in the rise of the data engineer, which is a blog post I wrote a long time ago, that probably was before our first interview that might have triggered my first interview. I don't know. I forgot. But I talk about how in smaller organization, data engineering is also you know, supports the function of data infrastructure management. So if you have, like, 2 data engineers in your organization, they probably are in charge of either selecting the vendors or selecting the software that's being used and making sure that software runs and that they probably are on call to for when it gets down. In bigger organizations, when I was at Airbnb, for instance, right, there'd be a clear distinction between a data infra team and the data engineering team or function.
Right? And then, clearly, the data infra team is, like, very horizontal and in terms, like, it supports the organization as a whole, where data engineering, there's a little bit more attention to kinda pull them vertically, right, to be a little bit closer to maybe certain product verticals in some cases. Then in terms of, like, who builds their reports and dashboards, I threw the term data literacy a tiny bit before. So talking about this, I think more and more people going from the information workers, right, so people who are just like people work with a computer every day and, you know, answer emails and whether they're, you know, PMs or business users of different kind of anyone in the organization, really, need to become more sophisticated with data, and we see these people using superset and self serve there, like, whether it's consuming a dashboard or using a no code explorer inside of Superset. Right? They might not know SQL, but in Superset, they're still able to do a lot of, like, slice and dicing using the drag and drop type interface.
So, generally, I would say that more and more people, like, regardless of the function in modern organization, have to become more data literate and familiar with powerful tools like Superset. The vision for Superset is really to be across the whole consumption layer and to cater to the different level of sophistication of users that might be more or less, like, data literate. Right? So tip so you can imagine that within an organization on the data consumption layer, you're gonna have people who are just gonna consume dashboard that have been built by others, people in their team, or their, like, analyst partner, or data engineer, whoever has built a dashboard that's relevant for them.
Then, you know, I think we see a lot of people being able to do a lot of slice and dicing. There's a SQL IDE in Superset. So if you do speak SQL and we see more and more people across different functions learning SQL nowadays, good example of this is Airbnb has a program internally called Data University to do to do better at their work. Right? The last 1 I would mention that's complementary on the consumption layer that goes up the ladder of sophistication is the notebook. Right? We see the notebook as if you can't do it in SQL because, I don't know, maybe there's a machine learning component to it. Maybe there's some crazy wrangling that's better done, you know, in a data frame than in SQL.
We see notebooks as that last layer where it's less accessible, it's harder to use, but it's much more flexible and powerful.
[00:17:24] Unknown:
Another interesting aspect of something like superset, because it's not just a static dashboard where here are some pretty charts and maybe there's a, you know, limited amount of being able to reorient the axes or something. It also has a full fledged data exploration layer. And I'm curious what you see as the opportunities for collaboration and so do the capabilities that Superset has for being able to version and fork and merge different queries.
[00:17:55] Unknown:
There's a fair amount to unpack here. The first 1 is, like, collaborative workflow and on the data consumption layer. I think that's good to have. Collaborative tools are transformative in a lot of ways. Right? We've seen Google Docs versus, you know, Microsoft Word, you know, is a game changer for a lot of us. So we're definitely bringing a lot of this into Superset over time. There's direct collaboration on the document, and there's, like, social aspects too that are similar and intertwined. I would say, you know, they're in a similar space conceptually. But I think, like, the social aspect is something that the internal data tooling at Facebook was really great about. So in most of the data tools internally at Facebook, you would have kinda social aspect of, like, who's the owner, who's using this dataset, who has created this chart, who's consuming it, and then a way for people to kinda leave comment or annotations for other people. So that is an important asset that we have already in superset, to a certain extent, and that we're pushing forward. Now there's a question of, like, do we wanna have 2 people working on the same query or the same dashboard at the same time and see their cursor move around?
And I think, at least from the SQL IDE perspective, if I'm writing SQL, I'm like, don't touch my SQL. Don't modify my SQL. Like, this is my thing. I can share my query with you, but I don't wanna be collaborating on SQL per se, like, in real time. But I do wanna share queries which we allow, and we're gonna add a lot more around, like, the commenting type interface, social collaboration within the tool, which I think, like, is super important, and we're doing we're investing in that. I think, like, all tools nowadays really need to move in that direction. That's something that users expect.
[00:19:41] Unknown:
Because of the fact that there is this data exploration capability and you're handing that over to people who are trying to answer their own questions and maybe don't have the same level of training with distributed systems and query optimization as a data engineer or a data analyst. What are some of the potential challenges that somebody who's providing Superset as a service to the organization might run into and some of the ways that they might avoid allowing people to end up in a suboptimal user experience because of performance issues with the queries or trying to bring in too many different data sources together in 1 visualization or something like that? There's a fine balance there, you know, between, like, giving people power, but also preventing them to shoot themselves in the foot.
[00:20:28] Unknown:
Generally, in software, I think 1 thing that people don't appreciate as much as they should is what it takes to harden a piece of software. It's pretty easy for anyone to write, you know, a competitor to something like Airflow to be like, I'm gonna write a small scheduler and something that writes DAG, but I think the real part of the real value of software and of open source type workflows and then type approach to development is that we can massively harden a fast moving piece of software. And you'll see over time that software, you know, will build an immune system. If you think of software as an organism, that's like every time that there's a little cut or a little break or something that hurts, we think of a solution to prevent that in the future. And sometimes it's like from an architecture standpoint or from a people request beauty standpoint, it's not always the most beautiful piece of code that might be adding a retry decorator in the right or wrong place. But there's a tremendous amount of value that exists in that. Now, what does that mean for something like Superset? So 1 thing is like, how do we prevent ad hoc user from destroying the underlying analytics database?
So luckily, I would say Superset is architectured in a way that is doing, like, most of the heavy lifting on the underlying database. So, like, not our problem. We just push all the heavy lifting to the underlying database. You can go and, you know, destroy that through our powerful tools. So we do offer some ways to prevent some of that, and that's part of the immune system we've built. I think that responsibility of, like, harming the underlying database is probably best done. So say if you're using something like Presto, and I forgot Trino or Presto, whatever 1 you picked in that space, I think presto is probably receiving a mixed bag of queries, like, 1 on 1 and from Superset. But, Pablo, maybe I've looked at maybe you have collections of other tools. So I think it's important for the database engine itself or for the database, you know, organism to defend itself pretty well.
We do have some hooks in superset that enable people to kinda hook in and control a little bit more the flow of what's happening. I know at Lyft, I think we had some sort of, like, proxy in front that would do some routing and some rate limiting and some queuing, you know, to help Presto as a Presto proxy. In Superset itself, we do have things like a query mutator and a connection mutator where it's easy to add configuration inside Superset that will intercept certain pieces of SQL in the database connection on its way to the database. So you can add logic in there that could say something like, if there's a lot of query running, don't send the query over and and wait a little bit. Or you could say, run an estimation ahead of time. And if the estimation is, like, less than a certain amount or is bigger than a certain amount, like, per like, you know, kill the query. So we have these configuration hooks that can be used to intercept query and route things, and that's probably the extent of it. I think we have a feature behind a feature flag that is a button in SQL Lab that enables people to estimate the cost of a query prior to running it.
So that's something to dig out there in the feature flag framework,
[00:23:59] Unknown:
if you're curious about that. Yeah. Intercept the query that's trying to do a Cartesian join on every order for the past 30 years.
[00:24:06] Unknown:
Don't do it. But the reality is, like, solve this problem in superset. It's still a problem inside the other tools. You know, whether people use the CLI to query Presto and then people will be like, okay, well, Superset doesn't wanna run my query. I'm gonna, you know, find a different way to attack and destroy the system. Right? So sometimes, like, it feels as an operator of these tools, like, you have a barbaric hoard of people that are trying to attack and destroy your software. I know, like when I was operating airflow at scale, sitting really close to to the data and for a team in different organization, like, sometimes we'd look at things, incident management, or like, you were trying to destroy our infrastructure.
And then And then we dig a little bit deeper and then improve our immune system and come out stronger on the other side.
[00:25:01] Unknown:
And in terms of the different data sources, Superset supports a large and growing number of query engines and data storage systems. And I'm wondering what you have found to be some of the most challenging either specific systems or categories of systems to work with for something like Superset where you're trying to enable interactive data exploration and visualization.
[00:25:25] Unknown:
So I think, like, on our front, the way that we integrate through different database engine, and it's, like, the huge kinda integration point for Superset and other tooling is how we connect to different analytics database engine. Right? There's a bunch of other integration points, like, for the metadata database, for the caching layers, and then multiple caching layers. But, like, going on to the analytics database integration, So we do that through a hyphen standard that's called DB API. It's a little bit shallow. It's not very prescriptive. It's a little bit loose. You know, it defines things like connections and cursors, and it's pretty high level. And then, on top of that, we use something called SQLAlchemy.
That's a popular library, you know, in the Python world. That's a SQL toolkit and an ORM build on top of that SQL toolkit, and we leverage the SQL toolkit quite a bit. And this knows how to speak different SQL dialects. So so you can do something like write a query in a more object type of form, as in, like, select dot group by, you know, dot filter dot this dot that. And then it knows how to generate SQL and different SQL dialects. That's been challenging. You know, the different SQL dialects SQLAlchemy is a little bit shallow in terms of, like, what it defines and what it does not define, so it might know how to issue kind of a limit or a row number limit define. So it might know how to issue kind of a limit or row number limit on different engines, but it doesn't know how to truncate a date, for instance. So the semantics around using the function in different database engine are just not specified by SQL Alchemy, so we need to manage that in our own Python package, where we add more semantics around date truncation and around, like, how to handle the cursors and how to do things like, if I wanna get a percentage of completion of a query. So if you run a 2 hour long query on Hive or a 20 minute long query on Presto, it's great to be able to show the user that you're at, you know, 39% and that, you know, the the query is still running.
So we have a layer there. If you look at the Suprasek code base, it's under a package called DB engine spec for all the things that are specific to database engines. So it's like this kind of compatibility layer, and it's been a struggle to work with all this, like, vast array of drivers with different level of maturity. And we've written our own, and we've contributed back to, like, the BigQuery, you know, client and this the SQL alchemy dialect for it. So we've contributed. We've written the Elasticsearch 1, the Druid 1, a Google Sheets 1 that's a little bit more kinda fuzzy and complex, which probably brings me to the second part of the answer, which is 1 of the challenge has been that some databases don't speak very good SQL at all, or are learning SQL over time. So you look at our journey with Apache Druid, which didn't speak SQL at first, and has been learning SQL, and has pretty good SQL support now.
So for instance, like, some engines don't support subqueries. There's all sorts of things that they may or may not support. Right? It could be a window functions or it could be common table expressions, CTEs. So an example of what we do is sometimes you wanna plot a certain number of time series, but if you group by, I don't know, customers, you might have, like, 10,000 customers. So you don't wanna bring, like, 10,000 trend lines. So we'll run a subquery to say, like, what are the top 10 customers based on this metric? And then do some sort of subquery join to do this. And for databases that don't support subqueries by the way, it's great to have a technical audience because I can get into a little bit more details here. But so for the database engines that we cannot do a correlate join with a subquery, what we do is we do a 2 phase query. So then the mechanics of Superset are, like, I'm gonna run a first query to retrieve result. Based on this result, I'm gonna assemble a second SQL Okami query that will be translated to right dialect, and then get the top end time series with the time detail for those top end customers in that case. Right?
So fair amount of complexity there. And if if you think about unit testing this stuff and guaranteeing, you know, that you're not gonna see too much regressions in this area as the drivers evolve and as the dialects evolve. It's been challenging, but someone's gotta do it. So we're definitely at the forefront of that. We've talked about refactoring this maybe as a contribution to SQL Alchemy or as its own package that would be a little bit more generic and that other people could use as a kind of this, like, slightly more involved abstraction to, you know, to be able to talk to multiple engines.
[00:30:10] Unknown:
Yeah. That was gonna be 1 of my questions is if this is available for the broader Python community to use in isolation from Superset. So you already addressed that question. And another interesting aspect of what you're saying with having to kind of remap some of these subquery joins or common table expressions for engines that don't support it, you know, then you also get into some of the complications of transactional isolation and some of the different semantics about how databases manage that. And, you know, are you seeing the same state of the data between those 2 query executions, and do they support transactions across multiple queries? There's a lot. Like, I wish we had, like, a newer version of DB API that would be more prescriptive.
[00:30:47] Unknown:
There's all sorts of challenges too around exceptions that are returned from the different engines. They are not gonna and narrative from base class of exceptions too. Or I don't think that DB API is prescriptive on that. I think it has some specification, but it's implemented in different ways. Right? A lot of people go outside of the spec. So it shows the importance of a standard and of a spec, but, you know, the challenge there too is that if you have a spec that's too prescriptive, then people don't wanna implement it too. So maybe from that perspective, DB API is a decent compromise.
If any indication of that is how mature is the connectivity to database engine in in Python as opposed to other languages. I think Python is 1 of the language by that does best in this area, not necessarily in terms of rigor, but in terms of, like, things that generally work. Right? So maybe, on the Java JVM side, you might have, like, more rigorous driver, but less implementations, less diverse set of implementation.
[00:31:52] Unknown:
In terms of the use of dashboarding tools within an organization, You know, traditionally, they've been used for just show me the data. I'll make my own decisions based on that. But as organizations grow in terms of their maturity of the use of the data stack and the data stack itself continues to grow in terms of its capabilities and the sophistication of the people who are operating it, there have been growing trends in terms of, you know, how the dashboard is used. So rather than just being present the data to me, there's been more of a trend towards rather than just business analytics. I want decision analytics. I want you to show me what is the data and then what am I supposed to do with it. So, you know, maybe I see this trend line and then I click a button. Okay.
My stock is declining at a certain rate. I need to order more widgets right now and then embed that right into the dashboard. And I'm curious what you have seen as the creator and user of Superset, how people are able to implement that style of analytics and that sort of evolution of the sophistication of how analytics are used to present to end users.
[00:33:00] Unknown:
The big underlying trend, right, is democratization of access to to data and then up leveling data literacy in the organization. And where the model in the past was data, I think, was a specialty. Right? Like, fundamentally, like, you would have maybe a small data team, or traditionally, you might have a data warehouse architect that, you know, is in charge or a group of data warehouse architects that are in charge of, like, being the librarian of the data within the organization and structuring, organizing the data from different sources into a data warehouse. And then you would have the business intelligence engineers that would install things like business objects and micro strategy, and they would create abstraction layers in this, what we call back in the semantic layer. I know probably the audiences might not be familiar with that term, but it's like an abstraction layer that enables people to self serve a little bit more on top of the data warehouse. So traditionally, we we'd see a lot of that. Right? The tooling was oriented that way. And then you need it for the things to make it in the warehouse, in the semantic layer so that then people could kind of self serve. They could order things from the menu at their restaurant, to use, like, an imperfect analogy. And I think we've seen people move towards much more of a buffet slash open kitchen where it's a big free for all, and people are able to, like, consume ingredients that are, like, more or less processed, and there's, like, a wider variety, and people are welcome to go in the back room a lot more where, you know, maybe before the chefs was like, Hey, I don't want any customers in the kitchen. Right. They keep the customers out of the kitchen. Now, like, I know it's a big party. Everybody's invited. Anyone who has the skills can write SQL or write Airflow jobs or create a dashboard. So if I take that trend and kinda translate that into, like, what we see in terms of, like, superset usage pattern or dashboard usage pattern is before we would have, like, specialists creating dashboards for executives. You know, it might take the specialist team, like, months to create a dashboard that would have a life cycle of many years. Right? Like, the CEO dashboard that's been built by a group of specialists, and this dashboard will be pretty much, you know, more or less the same over the next few years. So the trend that we're seeing is an acceleration of how many dashboards are created, and we see a shorter life cycle of dashboards. If your team is working on a particular product feature or set a feature or an analyst team is focusing, say, on a topic like the effect of COVID on, you know, listings at Airbnb or something like that. Right, like, they're probably gonna build a dashboard for that, accumulate a set of learning.
And as they tackle that area and build the knowledge and the understanding and make data informed decision to change the product, this dashboard will probably get, you know, just unused over time. And then they'll go to work on whatever the next hot topic is for that team or that smaller group of people and then probably create a new dashboard. So definitely a trend of, like, more people building things that are more short lived and aligned with more operational use cases too. And by that, I mean, the dashboard that supports the daily work or the weekly work of a particular person or team.
And that's when I think more value is typically delivered at that point. Right? Because you have something that's more targeted that supports the process more clearly or more deeply. It's just more fitted for a smaller group of people. So that's clearly a big a big trend in analytics.
[00:36:34] Unknown:
RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plug ins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming. With RudderStack, you can use all of your The other big trend that I've been seeing in analytics and dashboarding is starting to integrate some of the data quality indicators into the dashboard so that you can see as a consumer of the dashboard or somebody who's doing some exploration, this is the last time that the data source was refreshed. This is some of the lineage information.
These are the data quality checks that were executed. You can see that they were all passed or, you know, maybe 1 of them failed, but it's not critical. So you can understand as you're doing these explorations or as you're viewing the dashboard, you know, how many grains of salt you wanna take with it.
[00:37:42] Unknown:
That's, you know, 1 underlying trend with that. So there's a clear need for that context. Right? Like, so while I was at all 3 previous companies, I was sitting really close to the teams that were working on, call it, metadata discovery or metadata management, data dictionary, but that's just a huge area. So at Lyft, I was sitting right next to the team working on Munson, which is an open source project in this space of, like, data catalog and metadata discovery. While I was at Airbnb, I was sitting really close to the data portal team that have published so data portal is not open source, but they have published a lot of their experience and a lot of, like, very interesting and influential information around data portal. So I encourage people to go and dig in Google data portal, Airbnb. They're gonna find some interesting information in the space. And similarly, my team at Facebook owns something called IData, and we're really close to a team working on metadata graph. That was really a metadata graph. So been really close to that space and multiple environment. I think it's validated that we really need more of this type of tooling and context.
Some pointers in this is, like, there are some emerging standards in the space. I'm really excited that we see a few standards I wanna mention. So 1 is Open Lineage. That's an open source project that's really interesting. That's, I believe, was started by, Julien Laddem, who is the original creator of Harkey, and I believe, like, very involved into Aero. And they're just a prolific, amazing person. And I think around this project, we start to see a consensus from the different kind of actors in the space. Right? Like, so for us, we raised our hand and be like, hey. If there's a standard, we wanna be part of it. We wanna implement it from the superset project standpoint. I think the Airflow and DAXTER, I think and then projects like Data Hub and M1sen are definitely also raising their hand to say, like, if there is an emerging standard, we absolutely wanna be at the forefront of implementing it.
Another pointer is something to call Marquez, I believe, which is kind of a if open lineage is the standard and and, spec, the Marquez is gonna reference implementation for it, and it's a pretty sweet project that I would encourage all data engineers to check out, especially as it takes shape. What does that mean for Superset? Well, first thing is as these standards emerge, we want to implement them. And even in the absence of standards, we wanna integrate, like, with the actors in the space, really. Right? And Munson and Data Hub and whoever else. So we're at the forefront of, like, wanting to implement that. We have some hooks already. So 1 thing more on the data quality side is, super set, we have the ability to semantically certify datasets and or metrics within these datasets. So you can do that through a REST API or it can do that in the UI, which adds a little metal or a little, like, stamp by the dataset that says this has been certified by this person in this context so that people navigating the tool can get a sense for which sets of objects they can kind of proceed with confidence with and maybe which set of objects maybe they need be a little bit careful. You know, this is not a certified dataset.
So take that. And I think it's more like a positive in general just to say, this is a certified dataset, so you know you can play with it and that the metrics and dimensions that were gonna come out of that should be trusted. So, you know, really looking forward to the emerging standard, implementing them and providing a lot more context within Superset so that when you you're in a SQL IDE, you should be able to get the lineage context, and you should be able to navigate to something like a Munson and get even more context so you can really understand everything you're doing. Who's the owner? When is this dataset, you know, typically lending every day?
Who else is using this dataset? Where's the logic for that? So if I need that for reference, I can get access to it very easily. So I think that will be a trend in the tooling where we're gonna see the tools talk to 1 another and exchange metadata in a much better way than they do right now. Yeah. Definitely excited for a lot of the developments there and Open Lineage in particular.
[00:41:57] Unknown:
And another interesting element of the kind of data quality and data lineage aspects and the dashboard being, you know, probably the first place that people are going to be interacting with the data is how it factors into to the overall governance strategy. How does Superset act as a facilitator of the data governance strategy?
[00:42:16] Unknown:
I remember the term data governance is kinda overloaded too, so, like, we should try to define this. So I remember back in the days, we had the data stewardship and data governance as 2 different things. I know, you know, data stewardship too, and governance, that's some aspect of 2 data access policy of, like, having people who can define who can access what. And then I guess there's another aspect that's just, like, the certification and validation. What are all these data objects and and who owns them and, you know, are they reliable
[00:42:48] Unknown:
or not? When you ask about data governance, curious, like, about what you think exactly. Yeah. I mean, data governance definitely has become overloaded, and I kind of tend to use it in the broadest form because that's kind of its original intent. It's been narrowed in scope in a few contexts to just mean, you know, access control and privacy management and regulatory compliance. But I think you kind of answered broadly that superset is useful in this context because it does have access to some of the lineage information and the data quality checks. And because the security layer is extensible, you can enforce some of the access control and, you know, privacy management to do things like data masking before it gets shown in the UI, things like that. Yeah. So so Superset definitely, you know, enables people to have very complex data access policy.
[00:43:34] Unknown:
If they are able to express it, you know, on a piece of paper, they should be able to implement it and enforce it inside Superset. I think a lot of the challenges in this space are, you know, they're organizational for in any given organization, if you ask who should have access to what data, you know, it sounds easy at first, but it often reveals itself as a infinitely complex question that no 1 really owns and ends up with kinda half assed solution pretty often. People are, like, okay. Well, we explored this. It's extremely complex. Let's go back and say, like, there's gonna be 3 level of access here. So Superset definitely enables a lot of this.
There's inherent challenge around, like, constraints and guarantees there too, so if you have a lot of process and requirements and constraints around governance, that fundamentally kind of plays against some of the data democratization, you know, aspects of things. So there's different forces at play. Like, 1 is, like, you know, the more chaos of, like, you know, enabling a lot of people to do what they want, and then there's, like, making sure that, you know, they don't shoot themselves in the foot, and that, you know, we get to cohesive answers. 2010 to 15, 16 as a general trend, 2010 to 15, 16 as a general trend, and then, you know, with GDPR, I think that was, like, 2017.
I think There there's a little bit of a counter force there too of, like, hey, we need to know who's accessing what, and we need to be able to audit this and control certain things too. So multiple forces at play, I think organizations should be able to define their policies and enforce them in tooling like superset 2. So we don't initially to be too opinionated on that too. And we want to enable people to just define their policy and
[00:45:26] Unknown:
implement them in our tooling. For anybody who wants to learn more about Supersaid and follow along with you and the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, what do you see as the biggest gap in the tooling or technology for data management today? I think with the metadata integration we talked about, so having a better metadata
[00:45:46] Unknown:
exchange, and I can really kind of weave in the different tool sets together and preserve context. So that means, like, it should be easy to navigate from things like M1SEN, Airflow, Superset, and, you know, to to consume all of that business metadata, operational metadata, you know, lineage metadata in a way that it feels like a a set of tools that works very well together.
[00:46:12] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you've been doing at Superset and some of the ways that it can be used in data organizations to help satisfy their needs. Definitely very interesting tool and 1 that I plan to take advantage of myself for my own work. So I appreciate all the time and effort you've put into that, and I hope you enjoy the rest of your day. Excellent. Thank you so much. It was a pleasure to be on the show again. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Max Boeschmann and Superset
Max's Journey in Data Management
Overview of Superset
Superset's Popularity and Open Source Advantages
Responsibilities in Data Management and Superset Deployment
Collaboration and Data Exploration in Superset
Challenges in Providing Superset as a Service
Integrating Various Data Sources with Superset
Evolution of Dashboard Usage and Decision Analytics
Data Quality Indicators and Governance in Superset