Summary
SQL is the most widely used language for working with data, and yet the tools available for writing and collaborating on it are still clunky and inefficient. Frustrated with the lack of a modern IDE and collaborative workflow for managing the SQL queries and analysis of their big data environments, the team at Pinterest created Querybook. In this episode Justin Mejorada-Pier and Charlie Gu share the story of how the initial prototype for a data catalog ended up as one of their most widely used interfaces to their analytical data. They also discuss the unique combination of features that it offers, how it is implemented, and the path to releasing it as open source. Querybook is an impressive and unique piece of technology that is well worth exploring, so listen and try it out today.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Firebolt is the fastest cloud data warehouse. Visit dataengineeringpodcast.com/firebolt to get started. The first 25 visitors will receive a Firebolt t-shirt.
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Your host is Tobias Macey and today I’m interviewing Justin Mejorada-Pier and Charlie Gu about Querybook, an open source IDE for your big data projects
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Querybook is and the story behind it?
- What are the main use cases or workflows that Querybook is designed for?
- What are the shortcomings of dashboarding/BI tools that make something like Querybook necessary?
- The tag line calls out the fact that Querybook is an IDE for "big data". What are the manifestations of that focus in the feature set and user experience?
- Who are the target users of Querybook and how does that inform the feature priorities and user experience?
- Can you describe how Querybook is architected?
- How have the goals and design changed or evolved since you first began working on it?
- What were some of the assumptions or design choices that you had to unwind in the process of open sourcing it?
- What is the workflow for someone building a DataDoc with Querybook?
- What is the experience of working as a collaborator on an analysis?
- How do you handle lifecycle management of query results?
- What are your thoughts on the potential for extending Querybook beyond SQL-oriented analysis and integrating something like Jupyter kernels?
- What are the most interesting, innovative, or unexpected ways that you have seen Querybook used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Querybook?
- When is Querybook the wrong choice?
- What do you have planned for the future of Querybook?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Querybook
- University of Waterloo
- Superset
- Sequel Pro
- Presto
- Trino
- Flask
- uWSGI
- Celery
- Redis
- SocketIO
- Elasticsearch
- Amundsen
- Apache Atlas
- DataHub
- Okta
- LDAP (Lightweight Directory Access Protocol)
- Grand Rounds
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because the number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Atlan started out as a data team themselves and faced all of this collaboration chaos firsthand, and they started building Atlan as an internal tool for themselves. Atlan is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to data engineering podcast.com/atlan, that's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Justin Mejara Peir and Charlie Goo about Querybook, an open source IDE for your big data projects. So, Justin, can you start by introducing yourself? Sure, Tobias.
[00:02:11] Unknown:
So I'm Justin, the engineering manager for the Dynamics platform team at Pinterest. I've been at Pinterest for 6 and a half years now. Most of this time, working in data, building internal data tools for the rest of interested employees. And we have several tools like this. And 1 of them is Squarebook that we're talking about today. That's the most safe. And, Charlie, how about yourself? Hi. I'm Charlie. I have been at Pinterest for
[00:02:37] Unknown:
4 years now. And after I graduated from University of Waterloo, I have stayed on the analytics platform team for the past 4 years and, mostly focused on building internal data tools for pin Pinterest employees.
[00:02:50] Unknown:
And going back to you, Justin, do you remember how you first got involved in data management? I moved to the data team shortly after joining Pinterest.
[00:02:57] Unknown:
I did some q and a demos that were relevant to the team. In the data team, there was several areas that we basically support our end customers with. For example, in AV management, like life cycle management, like, for our AB experiments or charting and dashboarding tools. And then 1 of these resources was data management. This was so the first time that I worked on that as a product was with Querybook. So maybe I can talk more about it once we talk about how Querybook got started. That was mostly it. Like, it was 1 of the charter support team. And, Charlie, how did you first get involved in data management?
[00:03:37] Unknown:
So after I joined the team, I was working mostly on the experimentation and the dashboarding features for our internal tools. And then Justin mentioned about this opportunity to build, like, this coring tool, which is now, like, Querybook. And then that's how I got started.
[00:03:53] Unknown:
And so in terms of the actual project itself, can you give a bit of an overview about what it is that you've built with Querybook and some of the story behind what motivated you to create this tool internally versus picking something up off the shelf?
[00:04:05] Unknown:
So we started Corebook around 2017. At that point in time, it started as an intern project. So we had an external vendor solution. However, it had some things that we would like to have it done differently or that we wanted to improve or that we wanted to connect better with Pinterest ecosystem. So just as an exploration project, we asked, basically, our internal team to try to build, like, data cataloging small application to be able to better integrate with our with our systems of interest. At the end of the internship, we also obviously, the intern, in this case, Lauren. So she was able to finish first version of this data cataloging feature, but also querying our data querying functionalities to it. So it ended up being, like, faster than what we expected.
That go gave us a bit of, like, pause to, hey. Like, maybe we should explore this area a bit more. Charlie joined this project. He rebuild, most of the architecture so that I mean, because now it was no longer, like, MVP, but rather more, like, full blown exploration. We started thinking of, like, hey. Should we continue down this path? Should we keep using the solution that we're currently using from the external vendor, or should we go with an open source solution? So at that point in time, we explore several solutions that were back in the day, Superset, I think, on that point in time, called Carabel, along with other options, both open source and commercial.
We found very good ones, but we did not found exactly what we're looking for. So for example, we found there were several data cataloging solutions, which which kinda like the image data of these tables and all of that. And we found several charting and dashboarding solutions, our slicing and dicing, like superset. And all of them were very good in what they do. However, we did not find a solution that combined both things. Basically, that combined data cataloging with data querying and dashboarding. Also, not a solution that allowed for easy narratives to be created around data, because it was our main focus of what we wanted to enable. Basically, we wanted to enable people to be able to ask a question like, hey. Why is this incident happening? Or why is, for example, number of users going up or down on this given country? Run an analysis easily without having to know Python to run it. For example, if you if there were using Jupyter. Document this and easily share with other people. That's why we went with the narrative or kinda like notebook like UI.
So based on this initial evaluation, we thought it was worth exploring more, basically giving us 6 months more to build a more fleshed out prototype. And then based on the initial customer response I mean, of course, there was a lot of rough edges back then, but we decided to actually pursue further. User response was that they prefer it much more than the overall alternative. Then we just, like, replaced our existing solution with with the new 1. From that point, it was mostly just taking user feedback, iterating a lot, fixing a lot of things, a lot of processes, and just trying to make the product a bit better over time.
[00:07:09] Unknown:
And in terms of the main use cases that you're using it for and the sort of traditional business intelligence dashboard. Wondering if you can just talk through some of the kind of overlap between Querybook and some of the adjacent types of tooling that are available. So you mentioned superset, but also, you know, because it's notebook style interface, it brings to mind things like Jupyter and just the use cases that QueryBook is uniquely well suited for and sort of where it fits in the overall ecosystem and use cases at Pinterest?
[00:07:45] Unknown:
The main gap that we wanna focus on is being able to allow engineers, data analysts, product scientists, even PMs to be able to quickly answer the questions that they may have through data and then share this. Basically, what that means is, normally, in many of these other tools, queries are handled, like, as independent entities or even a group in a folder, just like the place where you write a SQL query or, like, a high Presto and so on query. You get results, and that's it. And then people go on into Google Docs and then create the narrative around those, and then they link to many of these different queries and entities or charts or dashboards.
However, this is, like, tedious process because you have to then commit base a lot or provide links here and there, which is done through different applications. So what we wanted to enable is to kinda like the crazy friction. You don't have to now, basically I mean, most will just end the end analysis in the query book just with adding a couple paragraphs here and there of explanation, then sharing it very easily with our teammates. Even collaborating built on with real time collaboration now directly on the Datadog, which is, like, a very different experience from any other tools that may exist right now. Obviously, we wanna enable, like, fast, analysis for people. And And then they can just, like, share it, and that's it. And they can move on to their daily work or their other things that they may have to be doing. If they want more in-depth slicing and dicing of the data, maybe using Python, and they also happen to know Python and they also happen to know Pandas, like, all of the data science toolkit, then they can definitely click on, like, say, export to Jupyter, because we also have, like, a Jupyter installation of interest where they can follow-up on that process. And then they can just, like, do the real, like, heavy heavy duty analysis through Python if they prefer it to do over directly on Hive or directly on Presto.
That's something that people do, definitely. But we want to enable that, hey. If you can just answer your question back in the query book, then it's like, you should just go ahead,
[00:09:51] Unknown:
not have to spend that much amount of time on the other things. So that's kinda, like, how it interacts with our other tools that we have or that exist out there. And when you do export it to Jupyter, I'm curious, what's actually contained in that export? Is it just a link to the sort of rendered datasets so that you can import those into the Jupyter Notebook or sort of what the interchange format looks like? And if there's the possibility of sort of going back and forth between Jupyter and query book where maybe you go to Jupyter, do some additional analysis, and then export a new dataset that you then bring back into the original Datadog?
[00:10:24] Unknown:
Sure. So, basically, like, how it currently works is that you export. So once you click on export, it will open the Jupyter, kinda like, internal application website, which will have a new notebook with your query in it so that it's ready for you to just press and, like, run-in the Python Jupyter format, for example. Then you can just, like, put it into a data frame, the normal traditional Jupyter slicing and dicing of the data. Once you're done with that, if you wanted to let's say if you wanted to continue then on the query book, you could then save it as a new table. And then once it's saved into a new table, you can then use it again on on query book if you wanted to.
[00:11:03] Unknown:
And in the tagline for query book, it calls out the fact that it's an IDE for big data. And big data sort of as a term is something that gets thrown around a lot with different meanings and different intentions behind it. And I'm wondering if you can just talk through sort of what you're trying to aim for with that tagline and some of the ways that that focus manifests in the ways that QueryBook is designed and the feature sets that it has and the overall user experience that it provides.
[00:11:29] Unknown:
Let's compare the big data ID, which is QueryBook, to, like, a small data ID, which which is something like a SQL Pro for MySQL. So what are the differences? Normally, like, a MySQL database has a lot less tables. So for example, MySQL database may only contain, like, hundreds of the tables. And whereas for QueryBook, like, currently at Pinterest, we handle tables at, like, size of, like, tens of thousands, like, maybe close to, like, a quarter 1000000 tables. So that's, like, 1 thing we have to optimize. So what we did is that we actually pushed our tables' information and search index onto Elasticsearch to provide also suggestions.
So that's 1 of the, like, design choices we made to support this kind of larger scale. Another thing we had to focus on is that the query results and the query runtime for big data are much longer and larger. So for example, like, a typical MySQL query would time out after 10 seconds, whereas, query book query can run for, like, days, maybe for, like, close to a week. So to make sure that will work, we have to, like, create async workers to make sure the query can, like, run much longer. And then they have to also send out progress updates or logs back to the user. So then they can know, like, how the query is running currently.
So that's also another optimization query book has to focus on. Another thing is the query results. So on my seagull table is typically, like, gigabytes in size, whereas query book tables can be in size of, like, terabytes or even larger. So a typical query results for MySQL is very likely to be much smaller. It could be, like, sizing of megabytes. But, like, users on query book would query, like, gigabytes of data. So what we did is at Pinterest, internally, we will upload the query results to s 3 and then format it in CSV such that we can, like, only read chunks of it to the user. And if they want to see the full result, they can download the data directly from s 3. And I'm curious about your choice of CSV as the interchange format and why you didn't go with something like par k or ORC for being able to chunk the data and use it more easily for other downstream analysis that would then preserve some of the type information and schema data? Oh, it's more like a design choice very early in the tool that we didn't really bother to change. I think we chose CSV just because, like, when people want to download the data, it's very easy for them to read it on their laptop. They can, like, put it into Google Sheets very easily.
The advantage of that is that like, for the CSV format, we can also, like, be able to read it line by line and then just read a certain chunk of it. So for example, on QueryBook, we set a maximum of 5 megabytes for the previews. So then we don't show any more data that would, like, lag the front end website.
[00:14:23] Unknown:
I think, like, moving forward, we could think of, like, changing the format of CSV into another format. But, like, right now, it's more like a early design decision. Yeah. It's definitely useful context to have because it's always interesting to look at what are the design constraints and limitations that you're working around and the things that you're trying to optimize for as you're building out these different tools. And I can definitely understand the desire to optimize for ease of use as you're downloading it on your laptop for being able to shorten that iteration cycle versus, okay. I've got a parquet file. Now I need to break out another set of tools to be able to understand what's even in here.
[00:14:57] Unknown:
Yeah. Yeah. That's totally true. And when we started, I don't think our company enforced parquet. It was something that, we started doing 2018. So we also made into our decision to use CSV.
[00:15:10] Unknown:
In terms of the actual target users of the query book platform, so you talked about it's kind of aimed at these different roles within the company. And I'm wondering if you can just talk through how the kind of personas that you're building around has influenced the feature set of query book and the primary capabilities and the interaction patterns that you've built into it and how that plays into how you think about the overall role of the tool within the company.
[00:15:40] Unknown:
So when we were designing Querybook, we did talk with a lot of users. We tried to see how they were used to the existing product or existing solution that we have for them in the company. We found that, like, a lot of users were not necessarily using it as intended, And then they actually had, like, for example, their queries stored in their local file system even though there was, like, a feature to store queries in the problem itself. And many people even use, like, notepads or, like, Evernote, those kind of things in order to store their queries to provide, like, context and, like, a bit more information on them. So that definitely help guide or kinda like that inform a lot of design decision of why we went with, like, a notebook kind of interface. Even, of course, like, it is, like, very, very similar to, like I mean, like, how Jupyter would have, like, a notebook interface. Our main, like, inspiration was more like the people were using, like, things like the macOS, not applications in order to solve the queries. So we wanted to replicate what they're already doing. So that is an example of how users definitely help guide the direction of these acquisitions on the product.
As that example, there has been many, many cases because to go back to your question, like, our users are kinda like data scientists, product analysts, engineers, and product managers. So it's a very big combination of functions and they all have very different needs. The needs of a data scientist compared to, like, a product manager are very different. So we try to provide as much as possible what each function requires, however, without complicating the product because 1 of our main things is that we don't want it to feel like a technical product or, like, something that you require, like, to know a lot in order to properly use it. I mean, of course, you would have some advanced features, but those are more hidden away. So, like, the more casual users are not, like, intimidated by them. In terms of, like, the general direction, so we're mostly clear in the sense, like, we wanted to function like it just to be the first step where people start their analysis before jumping to, if necessary, to Jupyter for the subset, maybe the 20% of the advanced analysis they wanna do.
And maybe before jumping to our other internal solution for dashboarding and charting. So we do have, like, like, because we started, like, in 2014, so we do have, like, an internal solution like superset that is for meant for interactive slicing and dicing and charting and dashboarding. It has different characteristics of superset. I mean, I think superset is a very great solution for that. But then the idea would be, again, like, people start on analysis first on QuickBook. And then if they need to do shouting on the supporting for slicing and dicing and presenting this on a kinda like weekly cadence for for a meeting, They can do that, like, directly on on our equivalent of superset. If they wanna do, like, in-depth slicing and dicing with Python, they can do that with Jupyter.
Therefore, basically, we have focused on this coordinate, which is, like, in common between all 3 functions we do not like engineers, PMs, and data scientists.
[00:18:44] Unknown:
And in terms of the actual implementation of QueryBook, can you talk through the overall architecture of how it runs and some of the specific design choices that you made to be able to handle the scale and sort of the big data focus of it? Our front end, the web server that, like, hosts web pages and APIs, that's
[00:19:04] Unknown:
through Flask, and we use the UWSGI for scaling. It talks to a cellular worker via a Redis connection. So we also run a set of cellular workers, and those cellular workers would handle running the actual queries and then pulling the results from these query engines and then uploading them. Other than that, the front end Flask server also can push updates, like such as, like, the query running logs, like, a query status updates, or, like, if you're doing collaborative editing. It will also push, like, live updates such as, like, collaborative updates through our WebSocket connection to the front end. So that's we at socket. Io.
This also goes through another Redis connection. Other than that, the worker would connect to AWS 3 to, like, push the query results there. And then it will also connect to the Elasticsearch. So, like, for example, if you added a data doc, then the data doc the Flask front end server will push a async job through Redis to the Celery worker to update the Elasticsearch search index so that when people search for the data doc, they will find it in the Elasticsearch. Lastly, there's a scheduler. So the scheduler's job is for people to run, like, a scheduled data doc. Or there are some scheduled jobs in Querybook that can use to be used to index, like, meta store information, like, the table and database information, or it can be used to, like, king up, like, older query results.
[00:20:34] Unknown:
And then in terms of the actual design of the system, I'm curious how it has changed or evolved since you first began working on it, and some of the assumptions that you made early on that you had to update or work around as you began introducing it to more of the company, and it began to be used more, and you saw sort of the the specific usage patterns that people were employing?
[00:20:56] Unknown:
So in terms of the product, how it evolved is that we focus first on the just having kinda like an interface for the data the big data of interest, especially like a data cataloging problem space, something similar to, say, LinkedIn's Data Hub or, Lyft and Monson or many other external vendors. So that is where it start. But then we evolved our thinking into supporting the colocation of the current interface in the same product, and that's when we added the current capabilities into it. Lastly, if an old, like, main part that we added is the ability to chart a dashboard quickly. I mean, it's not meant to be, like, a full blown, again, solution like Superset.
It's meant to be just like the starting point. So similar to having Google Docs, you can add maybe Google Drawings or you can add, like, tables. But if you want to have the full blown solution for that, you can go to, basically, the Google Apps drawing application, or you can go to Google Sheets. It's similar organization that we're aiming for with Datadog and Querybook. And in terms of technical aspects,
[00:22:03] Unknown:
Charlie, if you wanna talk about how if there is anything interesting that you found in the evolution of it. I think the biggest change is the collaborative editing. So when we first designed this tool, we did not think about collaboration at all. It was basically 1 user and then 1 data log, and then we just treated kind of like a local file. After, like, we started to, like, have everyone in the company to have access to QueryBook, they were starting to ask, like, oh, can I edit someone else's data log? Can I share this with someone else so that I can kind of give it off to someone else? So we really have to think about collaborative editing. This is when we kind of, like, have to rearchitecture a lot of the code so that it kind of pushes changes through WebSockets to other people who are also viewing this Datadog. So that's 1 of the more technical change here.
And another thing that was changed was we added scheduling. So before, we were thinking that Airflow, which is used internally at Pinterest, would be good enough for, like, people's scheduling needs. But then it turns out to be a very common requirement for people to run some queries and then export it to Google Sheets and then to view it. And then they find that the whole workflow of, like, creating some Airflow jobs is, like, a lot of burdens. So we actually added, like, scheduling feature, and then you can actually, like, create a Datadog schedule in, like, matter of minutes. And then you can auto export data to the same Google Sheets every day. Because of the fact that QueryBook does have this scheduling capability and you also have Airflow for running your pipelines. I'm curious how that factors into any sort of
[00:23:41] Unknown:
data lineage tracking or sort of metadata flows that you have within Pinterest for being able to track execution or tables that are created using QueryBook as compared to other workflows that you might have, or you might use Airflow or maybe DBT for building different data views and just how you keep track of the work that's going on where?
[00:24:00] Unknown:
So that's actually something the data team at Pinterest is working on right now is to come up with, like, a data lineage that's for every data team at Pinterest. It doesn't really matter if the query is executed from Airflow or is executed from Querybook. They all go into the same engine. And then the the engine will log those queries and then push them back into Hive analysis. So it will just come from different sources, but all the queries will be recorded in the same table.
[00:24:27] Unknown:
And then another thing that's worth digging into is the data cataloging aspects of query book where you mentioned that 1 of the original purposes of the project was to be able to build out this catalog to understand what tables you have access to and maybe see how they're being used and track that. And I'm curious if you can talk through some of the capabilities that QueryBook has and the way it's being used for a cataloging perspective and maybe how it compares to other things that have come out in the ecosystem since you first began working on it, such as Amundsen or Apache Atlas?
[00:24:58] Unknown:
As you mentioned, like, when we first started working on this, there was not a lot of, like, at least another proliferation of, like, data cataloging solutions in the open source world, which is great to have. So I think that as far as I know, Amundsen was not there, and then, like, LinkedIn's data hub was not there among other things. So right now, it's a bit, very similar to some of these other products. I would say most of the features like the intersection is very big in terms of the features that are available in each platform. In our case, what we currently have is basic things like the table table information, like table description, details, table metadata, information on the Hamid store, Column description, they're able to to add comments to the columns.
Column, of course, tab information, data lineage information. So we support ingestion of this data to an API, but we also generate part of this data from the queries that get run on Querybook itself. We also support showing the source query for a given table, who is using and quizzes to top the source of a given table. Example queries of such table. Basically, the queries that have been run on QueryBook, we screw them over and show them in the UI. Warnings and alerts. So for example, if a given table is, like, having an downstream issue or upstream issue and the data is incorrect, we can just flag them. And then the nice thing is that, for example, in your editor or in the data that view, you would get those warnings or everyone will get those warnings so that they don't use a table if it's maybe bad for the day. Those are the main features that we have focused on so far in.
The main differentiator would be that instead of having to go into different UI or different system or a standalone UI, like and then see all of this information and see it by integrating it into the overloaded term. But the reason is because if you have this collocated with your current interface, then you're able to do, like, very powerful stuff or interesting stuff like auto completion, warnings, all of these things that would be maybe not there if it was a different product.
[00:27:01] Unknown:
The difference between Querybook and all other analytics tools out there is that most of the ad hoc query people run at Pinterest, like, would run on QueryBook. So we have a lot of data internally for analysis. And then we can use those analysis to power, like, all those suggestions. Like, for example, people query this table more commonly than we know that, like, this table is popular. So then we can actually, like, boost it up in the search index itself. We can also use, information, like, for example, like, we can use people's query information to, like, auto create lineages so then we know, like, which table comes from another. And so there's a lot of potential there for, like, auto query analysis in Querybook.
[00:27:45] Unknown:
Patrick is a diligent data engineer, probably the best in his team. Yesterday, when trying to optimize the performance of a query running over 20, 000, 000, 000 rows, he was so eager to succeed that he read the entire database documentation. He changed the syntax. He changed the schema. He gave it his everything and reduced the response time from 20 minutes down to 5. Today is not a good day. Sarah from business intelligence says 5 minutes is way too long. John, the CFO, is constantly slacking every living being trying to figure out what caused the business intelligence expenses to grow so high yesterday. Want to become the liberator of data? Firebolt's cloud data warehouse can run complex queries over terabytes and petabytes of data in sub seconds with minimum resources.
No more waiting, no more huge expenses, and your hard work finally pays off. Firebolt is the fastest cloud data warehouse. Visit data engineering podcast.com/firebolt today to get started, and the first 25 visitors will receive a free Firebolt t shirt. And in terms of the actual workflow of getting started with building a data doc in QueryBook and sort of discovering the data assets and working with it. And then you mentioned the collaboration capabilities, just sort of the overall workflow of saying, I have this question that I'm trying to answer. I'm gonna go use query book to be able to answer it, and then just the sort of end to end flow of question to solution and then sort of the life cycle of the data doc after I've come to a particular conclusion from the question that I've asked?
[00:29:16] Unknown:
Sure. I mean, the overall workflow is very small. I mean, like, it doesn't have a lot of steps. You can go into the the website, the the UI for it, then you can click on create new Datadog. Once you have your Datadog, you can just start typing. That will be your first query. If you already know the table that you are going to use, maybe because it's like a table that maybe your team or basically a problem space that your team often works on. Therefore, maybe there's, like, a few, like, say, 5 or 10 tables that your team often uses. So then you can just start typing. It will get out to complete it. If you wanna have, like, reminder, for example, of which columns are the scheme of it, you can just double click it, and then you will get the information of it in a sidebar. So similar to the experience of, like, using, like, a real ID for, like, programming. And then you continue writing your query. You run it once. You get results.
You can modify your query because maybe you do, like, another dimension of it or, you know, flavor of it. Then you run it again. You get, like, the historical results if you wanna go back to them. Maybe you are satisfied with that first query. You then, create a new query in the bottom below. Then you add your second query. You maybe now slice and dice it for a very different aspect. For example, maybe you first slice it by the country. You then slice it by the gender to see if if it's a difference in your among different population. You may slice it in a third query by a different category, maybe language.
Then once you find the reason or the answer to your question, you can add a new sales between those results, maybe like text sales, can explain the rationale of like, hey. This first query, we're slicing by country, but we actually we did not find, like, anything meaningful in the behavior, basically, of this drop in usage, for example. Let me you add a little bit more clarification in 2nd and 3rd query, and then you can add, like, kind of, like, a final thought. But once we found it once we slice it by language, we found that Korean users dropped dramatically, and we then checked with the team and it's actually because we had a bug in the translation area, for example.
Then once that's done, you can, of course, give it a title to your dead dog if you want to. And then you can just click on share just like a Google Doc. You'll get a very similar model like any modern application, data log as private or how you wanna make it public. And if it's public, everyone will see in the company. If it's private, you can invite, like, some collaborators to it, and you can give them either view or edit access. If you edit access, then they'll be able to collaboratively edit as Charlie mentions. And if you have your access, they'll just, like, be able to see it. That's mostly the end of the workflow.
And then if you did not exactly knew which tables to use, you could also start your journey, like, in the sidebar of, like, searching for tables, or you can kinda, like, type something in. For example, maybe or engagement or pictures, and then you'll find real tables to that. If you click on them, you'll find more information on each of them of them in order to be able to more easily narrow down your search to the right tables. It's something that we wanna improve on. Basically, right now, you can find tables that you're looking for. It's in QuickBook, but it's something that there is still a lot of opportunity area in order to find the right table, like, as fast as possible. So it's something that we wanna double down on. As we've been discussing, this is a tool that started off internally with Pinterest because you're trying to solve your own needs. And, you know, since then, there's been a proliferation
[00:32:43] Unknown:
of tools that are in adjacent spaces to what you're building with Querybook. Wondering if you can just talk through your reasoning and motivations for open sourcing the project and the technical aspects and organizational aspects that you had to unwind in the process of making this an open source tool?
[00:33:01] Unknown:
So in terms of the the the rationale for it, there are several reasons. Just like any company, we would like to be able to contribute to the open source community in order to I mean, interest as any other company. This is a lot of open source tools. And if we can contribute back to the community, that would be a great outcome. Second thing would be, of course, being able to benefit from the input of this community in terms of the direction of the product. That will in turn help make the product better interest because it will end up being a better product.
Lastly, of course, like, we would love it if the community as well is able to contribute to the product, maybe both features or integrations or even bug fixes. That would be super beneficial to the internal usage of interest. So these streaming driving factors led us to decide to pursue this route. And because also, of course, we did an exploration of, like, what was available in open source at the point, We didn't see, like, a clear I mean, there were products that were doing parts of what QueryBoom does. Like, a lot of products that do data cataloging with querying cooperatives. A lot of products do that full blown sizing and dicing with sharding and dashboarding.
And then, like, full loan Python analysis, heavy duty with Jupiter. But we didn't see, like, exactly the intersection, certainly what we're aiming for with Qorvo. So that led us to think of, like, that it was something useful to open source. That's mostly it from the rational perspective. Charlie, do you wanna talk more about basically the changes that were needed to be done in terms of making it, like, technically possible to to open source? So there are a few areas we had to change. 1 is that
[00:34:44] Unknown:
we hardcoded a lot of things because we assume there's only, like, several query engines we use, and then they all use the same meta store. That's no longer the case with the open source version of Querybook where you can add all, like, arbitrary query engines, and then you can add, like, associate them with arbitrary, like, metastore, like, database information. So that's, like, a 1 very large architecture change we have to do. Another, like, areas of hard coding is that we often get requests such as, hey. If you run Presto query or if you run this query, then show this message to the user. Or if you want to schedule a query that has, like, a Presto if you want to schedule a Datadog that has a Presto or a certain query in it, show this warning. So we have a lot of, like, internal requests such as those, and they become much, much harder to do with open source because we can't, like, put something so specific in there. And then that's, like, 1 area we really had to change. And then well, so what we did is that we took out all the internal Pinterest code and then put them through the plug in system. So then they can be used internally, but then they won't be exposed for the open source code.
Last change we had to do is to make the setup life easier. So before the installations, the instruction for Querybook internally was, like, everywhere. There are, like, 10 steps. Like, you have to install this library, then you had to, like, do this, do that, and then you can start the query book process. And then you have to start the worker on a separate process, and then you have to do this all manually. But then we figured that will be really complicated if we open source this thing. So what we did is that we make sure there's only 1 Docker command that can start the whole thing. So you can just run make in Querybook, and we'll start a higher demo service. We added a demo setup, so it would auto inject all the, like, the demo data so that you have something to try out when you just cloned it from GitHub.
We also added admin tools so that you can try out very easily.
[00:36:46] Unknown:
In terms of the extension points that you've had to build in in the process of open sourcing it, I'm wondering if you can just talk through the opportunities that that opens up and just the available interfaces that you've built into QueryBook for people to be able to customize it to fit their own needs. So when we build the plug in system, I kind of took the
[00:37:06] Unknown:
Airflow plugin system as a reference because I saw that the Airflow supports a lot of different customizations. So when we built the plugin system, we also wanted something similar. Right now, it's not the perfect model. We want it so that people can actually share their plugins in the future. So then you can add, like, different functionality into Chorebook without, like, putting too much code into the main repo. And in terms of plugins, so we support custom authentication. By default, we kind of provides, like, Okta and Google and l Google OAuth and then LDAP OAuth. But then I think different companies have, like, different authentication system, and then they want to fetch different information from different sources. So that's why we have a custom authentication plugin.
And another plugin we have is for, like, query engine and Metastore. So this is a very heavily used 1 internally at Pinterest. So we want it so that, like, we can fetch, like, analysis from, like, different sources while you're running the query end query. So for example, internally, if you run a query, we can actually show you your current, like, memory usage information directly on QueryBook. And that's something, like, that's not included in the open source version. Other than that, there's also plugins for, like, exporting. So internally, this is used to export to different internal tools that can actually further enhance the query results.
We also have, like, front end plugins. So there's a front end plugin called Result Transform. So what this does is that when you query in query books for example, at Pinterest, we have a lot of, like, pins. Right? So when you query something that has a column pin ID or image signature, it can translate that pin ID into a link. So then you can just click on that pin ID and it will redirect you to the pin at Pinterest. For image signature, we can actually find out which image it is by reverse mapping it. So then you can actually just view the image directly inside Querybook. So those are the different functionalities of plugins.
[00:39:10] Unknown:
And in terms of the life cycle management of the query results, I'm also curious about that where you mentioned that a lot of them get exported to s 3. And I'm curious how you manage to avoid too much sort of propagation and data duplication in the stored results and, you know, what your thoughts are on expiry of results as a data doc ages out or as sort of the query gets executed in different locations and just the overall thought of reducing the amount of amount of extra data that you have to keep around for the purposes of the data docs versus, you know, the amount of time it takes to regenerate that dataset after it's been created?
[00:39:46] Unknown:
Well, there are several things. Right now, the data on QueryBook, at least internally, does not, like, last forever. So we have a 180 day expiration policy. So any data on s 3 will be deleted after a 180 days. And then internally, we also have a cleanup job that kind of removes all the queries after 3 65 days. So that query will be deleted permanently forever. We do want to have an additional feature that can let user mark some of their query results so then they can keep it for longer. In terms of, like, just cleaning up datas on s 3, like, to make sure there's no duplicated data, we actually have a working group right now to, like, kind of detect, like, s 3 access to those data and then also checking, like, the hide metastore access for the tables. And then we'll remove those tables or data if there's no access.
[00:40:36] Unknown:
In terms of the ways that QueryBook is being used or your own experience of working with it for building different data docs and analysis, I'm curious what you've seen as some of the most interesting or innovative or unexpected ways that it's been employed either inside Pinterest or elsewhere since you've open sourced it. I can give 2 examples. So 1 is that while we're doing the open source, we added MySQL functionality
[00:40:57] Unknown:
just so then we can see, like, what kind of engines can actually be supported on Chromebook. And then I actually connected my SQL database that's used by a lot of internal tools for our team. And then some of our teammates actually start to use Querybook for MySQL because they figure out that you can share your queries very easily with other people. Also, you can save your alter queries, save your create queries. So then you can always go back and see what you have done. So to my surprise, like, we actually start to use QueryBook more for, like, MySQL Quarrying for our internal tools compared to, like, traditional tools like SQL Pro. In terms of externally, so I think Grand Rounds was 1 of the early adopters of Querybook, and they have added a lot more interesting features around, like, creating tables. So what they added was that after you run a query, you can actually create a view or, like, a table on top of this query result, and then it goes through like, they add a lot more UIs for the table creation. And then it actually has, like, a, I think it's a GitHub integration that kind of lets you kind of view your changes on GitHub directly afterwards.
[00:42:12] Unknown:
And in terms of your own experience of building the query book platform and sharing it within Pinterest and then releasing it as open source and working with the community as it starts to gain adoption. I'm curious, what have been some of the most interesting or unexpected or challenging lessons that you've each learned in the process?
[00:42:30] Unknown:
I think the biggest, learning for me was the importance of talking to our users, basically having a very, very tight feedback loop. So we interact with them and, like, basically, Charlie and Lina or the, tenure is working in it. Like, they also interact directly with our resource, so we don't have, like, intermediary. So this has helped us iterate very fast and get a feeling of what users like and what they don't like over time. And we have, I mean, learned a lot and, like, fix our own mistakes all the time. The biggest learning in terms of, like, the importance of it compared to some of our other products that we may not have been that directly involved with our users.
And I think that has been what has allowed a query group to grow very fast within Pinterest in terms of user adoption and in terms of user satisfaction.
[00:43:21] Unknown:
That has been the big learning for me. I also have 2 to share. 1 is that when we interact with users, I noticed that they will often give out very specific, like, instructions on how it will solve their pain points, such as, like, oh, you can add a button here or you can add, like, this feature here. But, like, we shouldn't, like, really do exactly what they're told, but we should, like, think about, like, what their pain points are. It's actually much more useful to, like, gather, like, a lot of users' pain points and then come up with our own design for it. So then it will can actually solve, like, a larger problem and then not, like, make the product go into a direction of, like, a specific user. Another lesson I learned is kind of like the law of diminishing returns. We actually wanted to add a lot of, like, granular features into query books. So for example, I kind of want to add, like, a full Presto parser that will give users, like, exact, syntax errors as they're typing out the query. I also kind of wanted to work out a much better collaborate editing system for Querybook that can, like, do it kind of like what Google Doc does. Because, like, right now, Querybook's limitation is, like, 2 users cannot edit in the same cell or else they will overwrite each other's queries. But then I noticed that these 2 features I mentioned, they're really great to add, but then they will actually cost a lot of work. And then the benefit of having them to users are pretty low compared to the amount of work we have to do. Whereas we can dedicate this time for other features that can actually greatly help users' productivity.
[00:44:54] Unknown:
For people who are looking to be able to do collaborative analysis or they wanna be able to build a sort of notebook style workflow for integrating queries and prose. I'm wondering what you've seen as the cases where a query book might be the wrong choice and somebody might be better served with some of the adjacent tools in the space.
[00:45:15] Unknown:
Some examples for people that will be better served by our tools is if they have a specific thing in mind and they wanna create something specific as, for example, I wanna create a dashboard that I can use on a weekly basis or on a daily basis with my team, on a given team meeting, and I wanna be able to slice and dice interactively those results, then they would be better served probably to use a superset for that or any other, like, interactive dashboarding solutions. Because Chromebook currently, it's meant for for noninteractive use in sense, like, even though you can have, like, fast engines like Presto or Spark SQL or similar, the inter interface and the user mode is, like, different. So it's done as elements for flash boarding. So that would be 1 example where you would rather prefer to start in in another product.
Another example, of course, is if you wanna build analysis and you know that you prefer to use, like, Python to slice and dice your data and that your dataset actually is able to fit in memory in your Python in your Jupyter instance, your kernel, then, like, I mean, of course, you can just start in Jupyter itself. I mean, some people, like, what they do is, like, they create their queries in Querybook. They have, like, all the metadata enrichment capabilities there. So they draft their queries in Querybook, and then they can, like, move on to to Jupyter to do the Python analysis. But still, I mean, there may be cases where you wanna start right there in in Jupyter Notebook or a similar solution.
Another situation where it could be better for you to start in another tool is like in workflows. And it sounds like even though QueryBook allows for and provides simple scheduling, quick and simple scheduling, it's meant to be like ad hoc. So if because you're gonna run a Datadog quickly, like, on a daily basis and get some results, an email or on a spreadsheet, something basic for people to use. But, of course, if you wanna like a full blown scheduling, you should definitely use something like Airflow or any similar. That would be the main examples, I would say, where you probably wanna use a different solution like that.
[00:47:16] Unknown:
And in terms of the future work for Querybook, I'm curious what you have planned for the near to medium term or any use cases or capabilities that you're excited to add or enable?
[00:47:27] Unknown:
Our current, like, most focused 1 is to add more, like, integrations for open source. There has been a lot of ask to support, like, BigQuery or adding more, like, data catalog features for different query engines, adding, like, more query engines, or adding, like, Terraform and Helm charts for continuous deployment. So these things are on top of our mind right now. So then the more users cache, they start using QueryBook. And in terms of near future, we're working on expanding the feature of the list feature, which is a way to organize your data docs and tables into, like, list and share with others. So that's something we want to work on in the near future. And then another thing we're excited to work on is the table creation feature in Querybook. So as to close the loop, right now, like, you can do, like, insert, update, and delete in Querybook. But then if you want to create a date table, you can't really put the data into Querybook and let it, like, create a table for you. And then that's something we want to work
[00:48:30] Unknown:
on. Are there any other aspects of the QueryBook project or the overall space of data discovery and data cataloging and querying and notebook interfaces or sort of the use cases that you're targeting with query book that we didn't discuss yet that you'd like to cover before we close out the show? Some of the areas that I find very interesting, and, like,
[00:48:51] Unknown:
I haven't seen, like, a very good solution either with I mean, what we currently have with QuickBook or with other external vendors or open source breaks. I mean, finding the right table for people to use. Even though, I mean, we have, of course, many features towards this, and there's many other projects like I want to send and so on that try to tackle the same problem. I think that it's still not efficient for people as it could be. So it's something that I expect to see a lot of improvements over the next years. That's kinda like an an area that I'm excited about. How about you, Charlie?
[00:49:22] Unknown:
Yeah. Actually, I got a lot of questions, David, from, like, how do I find this table, how do I find that table, and then it's kind of more like a tribal knowledge right now. It's very hard for people to find the table they want for, like, the specific data.
[00:49:38] Unknown:
Yeah. It's definitely 1 of the perennial challenges, and everybody solves it differently. Alright. Well, for anybody who wants to get in touch with either or both of you and follow along with the work that you're doing or contribute to query book, I'll have you each add your preferred contact information to the show notes. And as a final question, I would like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:50:02] Unknown:
This is probably, like, similar to my previous answer, but I would say in the immediate space, it would be finding the right table for people to use. Basically, there is so many duplicated tables, so many wasted time by engineers and by everyone in the company trying to find the right information. And there is a lot of, like, approaches and there's a lot of attempts at this. I mean, including, of course, things that we do at Querybook, but things that our products do, like Moonstone or Better Better Help Linkedin and so on. But, I mean, I would expect this to be very different 3 years, 4 years from now once better solutions or improvements to these products happen. So I think that would save a lot of time for engineers, data scientists, and any employee in the company. So something that I'm excited about definitely.
[00:50:46] Unknown:
Other than data discovery, I think another thing is, like, right now, we kinda have support, like, for Spark SQL, which can run for, like, much larger data and also Presto, which is fast, but, like because since it runs in memory, it will, like, it will only work for, like, a small amount of analysis. So I would like to see, like, kind of, like, query language can actually balance the 2 where it's, like, fast for smaller queries, but then it doesn't have the memory issue of Presto.
[00:51:16] Unknown:
Alright. Well, thank you both for taking the time today to join me and sharing the work that you're doing on Querybook. It's definitely a very interesting project and 1 that fits an interesting Venn diagram in the space that I have not seen filled before. So definitely plan to experiment with it myself and keep a close eye on it as it progresses. So I appreciate all the time and energy you've put into building it and releasing it to the community. So thank you for that, and thank you for taking the time. And I hope you enjoy the rest of your day. Thank you so much. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, the innovative ways it is being used.
And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Host Welcome
Interview with Justin Mejara Peir and Charlie Goo
Overview and Motivation Behind Querybook
Main Use Cases and Unique Features of Querybook
Big Data IDE and Design Choices
Target Users and Personas
Technical Architecture and Evolution
Data Cataloging and Lineage Tracking
Workflow and User Experience
Open Sourcing Querybook
Plugin System and Customization
Interesting Use Cases and Lessons Learned
When Querybook Might Not Be the Right Choice
Future Plans and Enhancements
Final Thoughts and Closing Remarks