Build Your Analytics With A Collaborative And Expressive SQL IDE Using Querybook

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because the number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset?

Or tried to understand what a column name means? Our friends at Atlan started out as a data team themselves and faced all of this collaboration chaos firsthand,

and they started building Atlan as an internal tool for themselves.

Atlan is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code,

Atlan enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to data engineering podcast.com/atlan,

that's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Justin Mejara Peir and Charlie Goo about Querybook, an open source IDE for your big data projects. So, Justin, can you start by introducing yourself? Sure, Tobias.

So I'm Justin,

the engineering manager for the Dynamics platform team at Pinterest.

I've been at Pinterest for 6 and a half years now.

Most of this time, working in data,

building internal data tools for the rest of interested employees.

And we have several tools like this. And 1 of them is Squarebook that we're talking about today. That's the most safe. And, Charlie, how about yourself? Hi. I'm Charlie. I have been at Pinterest for

4 years now. And after I graduated from University of Waterloo,

I have stayed on the analytics platform team for the past 4 years and, mostly focused on building internal data tools for pin Pinterest employees.

And going back to you, Justin, do you remember how you first got involved in data management? I moved to the data team shortly after joining Pinterest.

I did some q and a demos that were relevant to the team.

In the data team, there was several

areas that we basically support our end customers with. For example, in AV management,

like life cycle management, like, for our AB experiments

or charting and dashboarding

tools.

And then 1 of these resources was data management.

This was so the first time that I worked on that as a product was with Querybook.

So maybe I can talk more about it once we talk about how Querybook got started. That was mostly it. Like, it was 1 of the charter support team. And, Charlie, how did you first get involved in data management?

So after I joined the team, I was working mostly on the

experimentation

and the dashboarding

features for our internal tools. And then Justin mentioned about this opportunity to build,

like, this coring tool, which is now, like, Querybook. And then that's how I got started.

And so in terms of the actual project itself, can you give a bit of an overview about what it is that you've built with Querybook and some of the story behind what motivated you to create this tool internally versus picking something up off the shelf?

So we started Corebook around 2017.

At that point in time,

it started as an intern project.

So

we had an external vendor solution.

However, it had some things that we would like to have it done differently or that we wanted to improve or that we wanted to connect better with Pinterest ecosystem.

So just as an exploration project,

we asked, basically, our internal team to try to build, like,

data cataloging

small

application

to be able to better integrate with our with our systems of interest.

At the end of the internship, we also obviously, the intern, in this case, Lauren.

So she was able to

finish first version of this data cataloging feature, but also querying our data querying functionalities to it. So it ended up being, like, faster than what we expected.

That go gave us a bit of, like, pause to, hey. Like, maybe we should explore this area a bit more.

Charlie joined this project.

He rebuild,

most of the architecture

so that I mean, because now it was no longer, like, MVP, but rather more, like, full blown exploration.

We started thinking of, like, hey. Should we continue down this path? Should we keep using

the solution that we're currently using from the external vendor, or should we go with an open source solution? So at that point in time, we explore several solutions that were back in the day, Superset, I think, on that point in time, called Carabel,

along with other options, both open source and commercial.

We found very good ones, but we did not found exactly what we're looking for.

So for example, we found there were several data cataloging solutions, which which kinda like the image data of these tables and all of that. And we found several

charting and dashboarding solutions, our slicing and dicing, like superset.

And all of them were very good in what they do. However, we did not find a solution that combined both things. Basically,

that combined data cataloging

with data querying and dashboarding.

Also, not a solution that allowed for easy

narratives to be created

around data, because it was our main focus of what we wanted to enable. Basically, we wanted to enable people to be able to ask a question like, hey. Why is this incident happening? Or why is, for example, number of users going up or down on this given country? Run an analysis easily without having to know Python to run it. For example, if you if there were using

Jupyter. Document this and easily share with other people. That's why we went with the narrative or

kinda like notebook like UI.

So based on this initial evaluation,

we thought it was worth exploring

more, basically giving us 6 months more to build a more fleshed out prototype. And then based on the initial customer response I mean, of course, there was a lot of rough edges back then, but we decided to actually pursue further.

User response was

that they prefer it much more than the overall alternative.

Then we just, like, replaced our existing solution with with the new 1.

From that point, it was mostly just taking user feedback, iterating a lot, fixing a lot of things, a lot of processes,

and just trying to make the product a bit better over time.

And in terms of the

main use cases

that you're using it for and the sort of

traditional business intelligence dashboard.

Wondering if you can just talk through some of the kind of overlap between

Querybook and some of the adjacent types of

tooling that are available. So you mentioned superset, but also, you know, because it's notebook style interface, it brings to mind things like Jupyter and just the use cases that QueryBook is uniquely well suited for

and sort of where it fits in the overall ecosystem and use cases at Pinterest?

The main gap that we wanna

focus on

is being able to allow

engineers,

data analysts, product scientists,

even PMs

to be able to quickly

answer the questions that they may have through data and then share this. Basically, what that means is, normally,

in many of these other tools,

queries are handled, like, as independent entities

or even a group in a folder,

just

like the place where you write a SQL query or, like, a high Presto and so on query. You get results, and that's it. And then people go on into Google Docs and then create

the narrative around those, and then they link to many of these different queries and entities or charts or dashboards.

However, this is, like, tedious process because you have to then

commit base a lot or provide links here and there, which is done through different applications.

So what we wanted to enable is to

kinda like the crazy friction.

You don't have to now, basically I mean, most will just end the end analysis in the query book just with adding a couple paragraphs here and there of explanation, then sharing it very easily with our teammates.

Even collaborating built on with real time collaboration now directly on the Datadog, which is, like, a very different experience from any other tools that may exist right now. Obviously, we wanna enable, like, fast, analysis for people. And And then they can just, like, share it, and that's it. And they can move on to their daily work or their other things that they may have to be doing. If they want more in-depth slicing and dicing of the data, maybe using Python, and they also happen to know Python and they also happen to know Pandas, like, all of the data science toolkit, then they can definitely click on, like, say, export to Jupyter, because we also have, like, a Jupyter installation of interest where they can follow-up on that process. And then they can just, like, do the real, like, heavy heavy duty

analysis through Python if they prefer it to do

over directly on Hive or directly on Presto.

That's something that people do, definitely. But we want to enable that, hey. If you can just answer your question back in the query book, then it's like, you should just go ahead,

not have to spend that much amount of time on the other things. So that's kinda, like, how it interacts with our other tools that we have or that exist out there. And when you do export it to Jupyter, I'm curious, what's actually contained in that export? Is it just a link to the sort of rendered datasets so that you can import those into the Jupyter Notebook or sort of what the interchange format looks like? And if there's the possibility of sort of going back and forth between Jupyter and query book where maybe you go to Jupyter, do some additional analysis, and then export a new dataset that you then bring back into the original Datadog?

Sure. So, basically, like, how it currently works is that you export. So once you click on export, it will open the Jupyter, kinda like, internal application website,

which will have a new notebook with your query in it so that it's ready for you to just press and, like, run-in

the Python Jupyter format, for example.

Then

you can just, like, put it into a data frame, the normal traditional Jupyter slicing and dicing of the data. Once you're done with that, if you wanted to let's say if you wanted to continue then on the query book, you could then save it as a new table.

And then once it's saved into a new table, you can then use it again on on query book if you wanted to.

And in the tagline for query book, it calls out the fact that it's an IDE for big data. And big data sort of as a term is something that gets thrown around a lot with different meanings and different intentions behind it. And I'm wondering if you can just talk through

sort of what you're trying to aim for with that tagline and some of the ways that that focus manifests in the ways that QueryBook is designed and the feature sets that it has and the overall user experience that it provides.

Let's compare

the big data ID, which is QueryBook, to, like, a small data ID, which which is something like a SQL Pro for MySQL.

So what are the differences?

Normally, like, a MySQL database has a lot less tables. So for example, MySQL database may only contain, like, hundreds of the tables.

And whereas for QueryBook, like, currently at Pinterest, we handle tables at, like, size of, like, tens of thousands, like, maybe close to, like, a quarter 1000000

tables.

So that's, like, 1 thing we have to optimize. So what we did is that we actually pushed our tables' information and search index

onto Elasticsearch

to provide also suggestions.

So that's 1 of the, like, design choices we made to support this kind of larger scale. Another thing we had to focus on is that the query results and the query runtime

for big data are much longer and larger. So for example, like, a typical MySQL query would time out after 10 seconds, whereas, query book query can run for, like, days, maybe for, like, close to a week. So

to make sure that will work, we have to, like, create async workers to make sure the query can, like, run much longer. And then they have to also send out

progress updates or logs back to the user. So then they can know, like, how the query is running currently.

So that's also another optimization query book has to focus on. Another thing is the query results. So on my seagull table

is typically, like, gigabytes in size, whereas query book tables can be in size of, like, terabytes or even larger.

So a typical query results for MySQL is very likely to be much smaller. It could be, like, sizing of megabytes.

But, like, users on query book would query, like, gigabytes of data. So what we did is at Pinterest, internally, we will upload the query results to s 3 and then format it in CSV such that we can, like, only read chunks of it to the user. And if they want to see the full result, they can download the data directly from s 3. And I'm curious about your choice of CSV as the interchange format and why you didn't go with something like par k or ORC for being able to chunk the data and use it more easily for other downstream analysis that would then preserve some of the type information and schema data?

Oh, it's more like a design choice very early in the tool that we didn't really bother to change. I think we chose CSV just because, like, when people want to download the data, it's very easy for them to read it on their laptop. They can, like, put it into Google Sheets very easily.

The advantage of that is that like, for the CSV format, we can also, like, be able to read it line by line and then just read

a certain chunk of it. So for example, on QueryBook, we set a maximum of 5 megabytes

for the previews.

So then we don't show any more data that would, like, lag the front end website.

I think, like, moving forward, we could think of, like, changing the format of CSV into another format. But, like, right now, it's more like a early design decision. Yeah. It's definitely useful context to have because it's always interesting to look at what are the design constraints and limitations that you're working around and the things that you're trying to optimize for as you're building out these different tools. And I can definitely understand the desire to optimize for ease of use as you're downloading it on your laptop for being able to shorten that iteration cycle versus, okay. I've got a parquet file. Now I need to break out another set of tools to be able to understand what's even in here.

Yeah. Yeah. That's totally true. And when we started, I don't think our company enforced parquet. It was something that, we started doing 2018.

So we also made into our decision to use CSV.

In terms of

the actual

target users of the query book platform,

so you talked about it's kind of aimed at these different roles within the company. And I'm wondering if you can just talk through how the kind of personas that you're building around has influenced

the feature set of query book and the primary capabilities

and the interaction patterns that you've built into it and how that plays into how you think about the overall role of the tool within the company.

So

when we were designing Querybook, we did talk with a lot of users. We tried to see how they were used to the existing product or existing solution that we have for them in the company. We found that, like, a lot of users were not necessarily using it as intended,

And then they actually had, like, for example,

their queries stored in their local file system even though there was, like, a feature to store queries in the problem itself.

And many people even use, like, notepads or,

like, Evernote, those kind of things in order to store their queries to provide, like, context and, like, a bit more information on them. So that definitely help guide or kinda like that inform a lot of design decision of why we went with, like, a notebook kind of interface.

Even, of course, like, it is, like, very, very similar to, like I mean, like, how Jupyter would have, like, a notebook interface. Our main, like, inspiration was more

like the people were using, like,

things like the macOS, not applications in order to solve the queries. So we wanted to replicate what they're already doing. So that is an example of how

users definitely help guide the direction of these acquisitions on the product.

As that example, there has been many, many cases because to go back to your question, like, our users are

kinda like data scientists, product analysts,

engineers, and product managers.

So it's a very big combination of functions and they all have very different needs. The needs of a data scientist compared to, like, a product manager are very different.

So we try to provide as much as possible what each function requires, however, without complicating the product because 1 of our main things is that we don't want it to feel like a technical product or, like, something that you require, like, to know a lot in order to properly use it. I mean, of course, you would have some advanced features, but those are more hidden away. So, like, the more casual users are not, like, intimidated by them.

In terms of, like, the general direction,

so

we're mostly clear in the sense, like, we wanted

to function like it just to be the first step where people start their analysis

before jumping to, if necessary, to Jupyter for the subset, maybe the 20% of the advanced analysis they wanna do.

And maybe before jumping to our other internal solution for dashboarding and charting. So we do have, like, like, because we started, like, in 2014, so we do have, like, an internal

solution like superset

that is for meant for interactive

slicing and dicing and charting and dashboarding. It has different characteristics of superset. I mean, I think superset is a very great solution for that. But then the idea would be, again, like, people start

on analysis first

on QuickBook.

And then if they need to do shouting on the supporting for slicing and dicing and presenting this on a kinda like weekly cadence for for a meeting, They can do that, like, directly on on our equivalent of superset.

If they wanna do, like, in-depth slicing and dicing with Python, they can do that with Jupyter.

Therefore,

basically, we have focused on this coordinate, which is, like, in common between all 3 functions we do not like engineers, PMs, and data scientists.

And in terms of the actual implementation of QueryBook, can you talk through the overall architecture of how it runs and some of the

specific design choices that you made to be able to handle the scale and sort of the big data focus of it? Our front end, the web server that, like, hosts web pages and APIs, that's

through Flask,

and we use the UWSGI

for scaling.

It talks to a cellular worker via a Redis connection.

So we also run a set of cellular workers, and those cellular workers would handle

running the actual queries and then pulling the results from these query engines and then uploading them.

Other than that, the front end Flask server also

can push updates, like such as, like, the query running logs, like, a query status updates, or, like, if you're doing collaborative editing.

It will also

push, like, live updates such as, like, collaborative updates through our WebSocket connection

to the front end. So that's we at socket. Io.

This also goes through another Redis connection.

Other than that, the worker would connect to AWS

3 to, like, push

the query results there. And then it will also connect to the Elasticsearch.

So, like, for example, if you added a data doc, then the data doc the Flask front end server will push a async job through Redis to the Celery worker to update the Elasticsearch

search index so that when people search for the data doc, they will find it in the Elasticsearch.

Lastly, there's a scheduler. So the scheduler's job is for people to run, like, a scheduled data doc. Or there are some scheduled jobs in Querybook that can use to be used to

index, like, meta store information, like, the table and database information, or it can be used to, like, king up, like, older query results.

And then in terms of the actual

design of the system,

I'm curious how it has changed or evolved since you first began working on it, and some of the

assumptions that you made early on that you had to update or work around as you began introducing it to more of the company, and it began to be used more, and you saw sort of the the specific usage patterns that people were employing?

So in terms of the product, how it evolved is that we focus first on the

just having kinda like an interface

for the data

the big data of interest, especially like a data cataloging

problem space, something similar to, say, LinkedIn's Data Hub or, Lyft and Monson

or many other external vendors. So that is where it start.

But then we evolved our thinking into supporting the colocation

of the current interface in the same product,

and that's when we added the current capabilities

into

it. Lastly,

if an old, like, main part that we added is the ability to chart a dashboard

quickly. I mean, it's not meant to be, like, a full blown, again, solution like Superset.

It's meant to be just like the starting point. So similar to having Google Docs, you can add maybe Google Drawings or you can add, like, tables. But if you want to have the full blown solution for that, you can go to, basically, the Google Apps drawing application,

or you can go to Google Sheets.

It's similar organization that we're aiming for with Datadog and Querybook.

And in terms of technical aspects,

Charlie, if you wanna talk about how if there is anything interesting that you found in the evolution of it. I think the biggest change is the collaborative editing. So when we first designed this tool, we did not think about collaboration at all. It was basically 1 user and then 1 data log, and then we just treated kind of like a local file.

After, like, we started to, like, have everyone in the company to have access to QueryBook,

they were starting to ask, like, oh, can I edit someone else's data log? Can I share this with someone else so that I can kind of give it off to someone else? So we really have to think about collaborative editing. This is when we kind of, like, have to rearchitecture

a lot of the code so that it kind of pushes changes

through WebSockets

to other people who are also viewing this Datadog. So that's 1 of the more technical

change here.

And another thing that was changed was we added scheduling. So

before, we were thinking that Airflow, which is used internally at Pinterest, would be good enough for, like, people's scheduling needs. But then it turns out to be a very common requirement for people to run some queries and then export it to Google Sheets and then to view it. And then they find that the whole workflow of, like, creating

some Airflow jobs is, like, a lot of burdens.

So we actually added, like, scheduling feature, and then you can actually, like, create a Datadog schedule in, like, matter of minutes. And then you can auto export data to the same Google Sheets every day. Because of the fact that QueryBook does have this scheduling capability and you also have Airflow for running your pipelines. I'm curious how that factors into any sort of

data lineage tracking or sort of metadata flows that you have within Pinterest for being able to track execution or tables that are created using QueryBook as compared to other workflows that you might have, or you might use Airflow or maybe DBT for building different data views and just how you keep track of the work that's going on where?

So that's actually something

the data team at Pinterest is working on right now is to come up with, like, a data lineage that's for every data team at Pinterest. It doesn't really matter if the query is executed from Airflow or is executed from Querybook. They all go into the same engine. And then the the engine will log those queries and then push them back into Hive analysis.

So it will just come from different sources, but all the queries will be recorded in the same table.

And then another thing that's worth digging into is

the data cataloging aspects of query book where you mentioned that 1 of the original purposes of the project was to be able to build out this catalog to understand

what tables you have access to and maybe see how they're being used and track that. And I'm curious if you can talk through

some of the capabilities that QueryBook has and the way it's being used for a cataloging

perspective and maybe how it compares to other things that have come out in the ecosystem since you first began working on it, such as Amundsen or Apache Atlas?

As you mentioned, like, when we first started working on this,

there was not a lot of, like, at least another proliferation of, like, data cataloging solutions in the open source world, which is great to have. So I think that as far as I know, Amundsen was not there, and then, like, LinkedIn's data hub was not there among other things.

So right now, it's a bit, very similar to some of these other products. I would say most of the features

like the intersection is very big

in terms of the features that are available in each platform.

In our case, what we currently have is basic things like

the table

table information, like table description,

details, table metadata,

information on the Hamid store, Column

description, they're able to to add comments to the columns.

Column, of course, tab information,

data lineage information.

So we support ingestion of this data to an API, but we also generate part of this data from the queries that get run on Querybook itself.

We also support

showing the source query for a given table,

who is using and quizzes to top the source of a given table.

Example queries of such table.

Basically, the queries that have been run on QueryBook,

we screw them over and show them in the UI.

Warnings and alerts. So for example, if a given table is, like, having an downstream issue or upstream issue and the data is incorrect, we can just flag them. And then the nice thing is that, for example, in your editor or in the data that view, you would get those warnings or everyone will get those warnings so that they don't use a table if it's maybe bad for the day.

Those are the main features that we have focused on so far in.

The main differentiator would be that instead of having to go into different UI or different system or a standalone UI, like and then see all of this information and see it by integrating it into the

overloaded term. But the reason is because

if you have this collocated with your current interface, then you're able to do, like, very powerful stuff or interesting stuff like auto completion,

warnings,

all of these things that would be maybe not there if it was a different product.

The difference between Querybook and all other

analytics tools out there is that

most of the ad hoc query people run at Pinterest, like, would run on QueryBook. So we have a lot of data internally for analysis.

And then we can use those analysis to power, like, all those suggestions. Like, for example,

people query this table more commonly than we know that, like, this table is popular. So then we can actually, like, boost it up in the search index itself. We can also use, information, like, for example, like, we can use people's query information to, like, auto create lineages so then we know, like, which table comes from another. And so there's a lot of potential there for, like, auto query analysis in Querybook.

Patrick is a diligent data engineer, probably the best in his team.

Yesterday, when trying to optimize the performance of a query running over 20, 000, 000, 000 rows, he was so eager to succeed that he read the entire database documentation.

He changed the syntax. He changed the schema. He gave it his everything and reduced the response time from 20 minutes down to 5. Today is not a good day.

Sarah from business intelligence says 5 minutes is way too long.

John, the CFO, is constantly slacking every living being trying to figure out what caused the business intelligence expenses to grow so high yesterday.

Want to become the liberator of data? Firebolt's cloud data warehouse can run complex queries over terabytes and petabytes of data in sub seconds with minimum resources.

No more waiting, no more huge expenses, and your hard work finally pays off.

Firebolt is the fastest cloud data warehouse. Visit data engineering podcast.com/firebolt

today to get started, and the first 25 visitors will receive a free Firebolt t shirt.

And in terms of the actual workflow of getting started with building a data doc in QueryBook and sort of discovering the data assets and working with it. And then you mentioned the collaboration capabilities,

just sort of the overall workflow of

saying, I have this question that I'm trying to answer. I'm gonna go use query book to be able to answer it, and then just the sort of end to end flow of

question to solution and then sort of the life cycle of the data doc after I've come to a particular conclusion from the question that I've asked?

Sure. I mean, the overall workflow is very small. I mean, like, it doesn't have a lot of steps. You can go into the the website, the the UI for it,

then you can click on create new Datadog. Once you have your Datadog, you can just start typing. That will be your first query.

If you already know the table that you are going to use, maybe because it's like a table that maybe your team or basically a problem space that your team often works on. Therefore, maybe there's, like, a few, like, say, 5 or 10 tables that your team often uses. So then you can just start typing. It will get out to complete it. If you wanna have, like, reminder, for example, of which columns are the scheme of it, you can just double click it, and then you will get the information of it in a sidebar. So similar to the experience of, like, using, like, a real ID for, like,

programming. And then you continue writing your query. You run it once. You get results.

You can modify your query because maybe you do, like, another

dimension of it or, you know, flavor of it. Then you run it again. You get, like, the historical results if you wanna go back to them. Maybe you are satisfied with that first query. You then, create a new query in the bottom below. Then you add your second query. You maybe now slice and dice it for a very different

aspect. For example, maybe you first slice it by the country. You then slice it by the gender to see if if it's a difference in your among different population.

You may slice it in a third query by a different category, maybe language.

Then once you find the reason or the answer to your question, you can add a new sales between those results, maybe like text sales, can explain the rationale of like, hey. This first query, we're slicing by country, but we actually we did not find, like, anything

meaningful in the behavior,

basically, of this drop in usage, for example. Let me you add a little bit more clarification in 2nd and 3rd query, and then you can add, like, kind of, like, a final thought. But once we found it once we slice it by language, we found that

Korean users dropped dramatically, and we

then checked with the team and it's actually because we had a bug in the translation area, for example.

Then once that's done, you can, of course, give it a title to your dead dog if you want to. And then you can just click on share just like a Google Doc. You'll get a very similar model like any modern application,

data log as private or how you wanna make it public. And if it's public, everyone will see in the company. If it's private, you can invite, like, some collaborators to it, and you can give them either view or edit access. If you edit access, then they'll be able to collaboratively edit as Charlie mentions. And if you have your access, they'll just, like, be able to see it. That's mostly

the end of the workflow.

And then if you did not exactly knew which tables to use, you could also

start your journey, like, in the sidebar

of, like, searching for tables,

or you can kinda, like, type something in. For example, maybe or

engagement or pictures,

and then you'll find real tables to that. If you click on them, you'll find more information on each of them of them in order to be able to more easily narrow down your search to the right tables. It's something that we wanna improve on. Basically, right now, you can find tables that you're looking for. It's in QuickBook, but it's something that there is still a lot of opportunity area in order to find the right table, like, as fast as possible. So it's something that we wanna double down on. As we've been discussing, this is a tool that started off internally with Pinterest because you're trying to solve your own needs. And, you know, since then, there's been a proliferation

of tools that are in adjacent spaces to what you're building with Querybook.

Wondering if you can just talk through your reasoning and motivations for open sourcing the project and the technical aspects and organizational aspects that you had to unwind in the process of making this an open source tool?

So in terms of the the the rationale

for it, there are several reasons. Just like any company, we would like to be able to contribute to the open source community in order to

I mean, interest as any other company. This is a lot of open source tools. And if we can contribute back to the community, that would be a great outcome.

Second thing would be, of course, being able to benefit from the input of this community

in terms of the direction of the product.

That will in turn help make the product better interest

because it will end up being a better product.

Lastly, of course, like, we would love it if the community as well is able to contribute

to the product, maybe both features or integrations

or even bug fixes.

That would be super beneficial to the internal usage of interest.

So

these streaming driving factors led us to

decide

to pursue this route.

And because also, of course, we did an exploration of, like, what was available in open source at the point, We didn't see, like, a clear

I mean, there were products that were doing parts of what QueryBoom does. Like, a lot of products that do data cataloging with querying cooperatives.

A lot of products do that full blown sizing and dicing with sharding and dashboarding.

And then, like, full loan Python analysis,

heavy duty with Jupiter.

But we didn't see, like, exactly the intersection, certainly what we're aiming for with Qorvo.

So that led us to think of, like, that it was something useful to open source. That's mostly it from the rational perspective. Charlie, do you wanna talk more about basically

the changes that were needed to be done in terms of making it, like, technically possible to to open source? So there are a few areas we had to change. 1 is that

we hardcoded a lot of things because we assume there's only, like, several query engines we use,

and then they all use the same meta store.

That's no longer the case with the open source version of Querybook where you can add all, like, arbitrary

query engines, and then you can add, like,

associate them with arbitrary, like, metastore, like, database information.

So that's, like, a 1 very large architecture change we have to do. Another, like, areas of hard coding is that we often get requests such as, hey. If you run Presto query

or if you run this query, then show this message to the user. Or if you want to schedule a query that has, like, a Presto

if you want to schedule a Datadog that has a Presto or a certain query in it, show this warning. So we have a lot of, like, internal requests such as those, and they become much, much harder to do with open source because we can't,

like, put something so specific in there. And then that's, like, 1 area we really had to change. And then well, so what we did is that we took out all the internal Pinterest code and then put them through the plug in system. So then they can be used internally, but then they won't be exposed for the open source code.

Last change we had to do is to make the setup life easier.

So before

the installations,

the instruction for Querybook internally was, like, everywhere. There are, like, 10 steps. Like, you have to install this library, then you had to, like, do this, do that, and then you can start the query book process. And then you have to start the worker on a separate process, and then you have to do this all manually. But then we figured that will be really complicated if we open source this thing. So what we did is that

we make sure there's only 1 Docker command that can start the whole thing. So you can just run make in Querybook, and we'll start a higher demo service. We added a demo setup, so it would auto

inject all the, like, the demo data so that you have something to try out when you just cloned it from GitHub.

We also added admin tools so that you can try out very easily.

In terms of the

extension points that you've had to build in in the process of open sourcing it, I'm wondering if you can just talk through the opportunities that that opens up and just the available interfaces that you've built into QueryBook for people to be able to customize it to fit their own needs. So when we build the plug in system, I kind of took the

Airflow plugin system as a reference because I saw that the Airflow

supports a lot of different customizations.

So when we built the plugin system, we also wanted something similar. Right now, it's not the perfect model. We want it so that people can actually share their plugins in the future. So then

you can add, like, different functionality into Chorebook

without, like, putting too much code into the main repo.

And in terms of plugins, so we support custom

authentication.

By default, we kind of provides, like, Okta

and Google and l Google OAuth and

then LDAP OAuth. But then I think different companies have, like, different authentication system, and then they want to fetch different information from different sources. So that's why we have a custom authentication

plugin.

And another plugin we have is for, like, query engine and Metastore. So this is a very heavily used 1 internally at Pinterest. So we want it so that, like, we can fetch, like, analysis from, like, different sources while you're running the query end query. So for example, internally, if you run a query, we can actually show you your current, like, memory usage information

directly

on QueryBook.

And that's something, like, that's not included in the open source version. Other than that, there's also plugins for, like, exporting.

So internally, this is used to export to different internal tools that can actually further enhance the query results.

We also have, like, front end plugins.

So there's a front end plugin called Result Transform.

So what this does is that when you query in query books for example, at Pinterest, we have a lot of, like, pins. Right? So when you query something that has a column pin ID or image signature,

it can translate that pin ID into a link.

So then you can just click on that pin ID and it will redirect you to the pin at Pinterest. For image signature, we can actually find out which image it is by reverse mapping it. So then you can actually just view the image directly inside Querybook.

So those are the different functionalities of plugins.

And in terms of the life cycle management of the query results, I'm also curious about that where you mentioned that a lot of them get exported to s 3. And I'm curious how you manage to avoid too much sort of propagation and data duplication

in the stored results and, you know, what your thoughts are on expiry of results as a data doc ages out or as sort of the query gets executed in different locations and just the overall thought of reducing the amount of amount of extra data that you have to keep around for the purposes of the data docs versus, you know, the amount of time it takes to regenerate that dataset after it's been created?

Well, there are several things. Right now, the data on QueryBook,

at least internally, does not, like, last forever. So we have a 180 day expiration policy.

So any data on s 3 will be deleted after a 180 days. And then internally, we also have a cleanup job that kind of removes all the queries after

3 65 days. So that query will be deleted permanently forever.

We do want to have an additional feature that can let user mark some of their query results so then they can keep it for longer. In terms of, like, just cleaning up datas on s 3, like, to make sure there's no duplicated data, we actually have a working group right now to, like, kind of detect, like, s 3 access

to those data and then also checking, like, the hide metastore access

for the tables. And then we'll remove those tables or data if there's no access.

In terms of the ways that QueryBook is being used or your own experience

of working with it for building different data docs and analysis, I'm curious what you've seen as some of the most interesting or innovative or unexpected ways that it's been employed either inside Pinterest or elsewhere since you've open sourced it. I can give 2 examples. So 1 is that while we're doing the open source, we added MySQL functionality

just so then we can see, like, what kind of engines can actually be supported on Chromebook.

And then I actually connected

my SQL database that's used by a lot of internal tools for our team.

And then some of our teammates actually start to use Querybook for MySQL

because they figure out that you can share your queries very easily with other people. Also, you can save your alter queries,

save your create queries. So then you can always go back and see what you have done. So to my surprise, like, we actually start to use QueryBook more for, like, MySQL Quarrying for our internal tools compared to, like, traditional tools like SQL Pro. In terms of externally, so I think

Grand Rounds was 1 of the early adopters of Querybook,

and they have added a lot

more interesting features around, like, creating tables. So what they added was that after you run a query,

you can actually

create a view or, like, a table on top of this query result, and then it goes through like, they add a lot more UIs for the table creation. And then

it actually has, like, a, I think it's a GitHub integration that kind of lets you

kind of view

your changes on GitHub directly afterwards.

And in terms of your own experience of building the query book platform and sharing it within Pinterest and then releasing it as open source and working with the community as it starts to gain adoption. I'm curious,

what have been some of the most interesting or unexpected or challenging lessons that you've each learned in the process?

I think the biggest, learning for me was the importance

of talking to our users,

basically having a very, very tight feedback loop.

So we interact with them and, like, basically, Charlie

and Lina or the, tenure is working in it. Like, they also interact directly with our resource, so we don't have, like, intermediary. So

this has helped us

iterate very fast and get a feeling of what users like and what they don't like over time.

And we have, I mean, learned a lot and, like, fix our own mistakes all the time. The biggest learning

in terms of, like, the importance of it compared to some of our other products that we may not have been that

directly involved with our users.

And I think that has been what has allowed a query group to grow very fast within Pinterest in terms of user adoption and in terms of user satisfaction.

That has been the big learning for me. I also have 2 to share. 1 is that when we interact with users, I noticed that they will often give out very specific,

like,

instructions on how it will solve their pain points, such as, like, oh, you can add a button here or you can add, like, this feature here. But, like, we shouldn't, like, really do exactly what they're told, but we should, like, think about, like, what their pain points are. It's actually much more useful to, like, gather, like, a lot of users'

pain points and then come up with our own design for it. So then it will can actually solve, like, a larger problem and then not, like,

make the product go into a direction of, like, a specific user. Another lesson I learned is kind of like the law of diminishing returns. We actually wanted to add a lot of, like, granular features into query books. So for example, I kind of want to add, like, a full Presto parser

that will give users, like, exact,

syntax errors as they're typing

out the query. I also kind of wanted to work out a much better collaborate editing

system for Querybook that can, like, do it kind of like what Google Doc does. Because, like, right now, Querybook's limitation is, like, 2 users cannot edit in the same cell or else they will overwrite each other's queries. But then I noticed that these 2 features I mentioned, they're really great to add, but then they will actually cost a lot of work. And then the benefit

of having them to users are pretty low compared to the amount of work we have to do. Whereas we can dedicate this time for other features that can actually

greatly help users' productivity.

For people who are looking to be able to do collaborative analysis or they wanna be able to build a sort of notebook style

workflow for

integrating

queries and prose. I'm wondering what you've seen as the cases where a query book might be the wrong choice

and somebody might be better served with some of the adjacent tools in the space.

Some examples for people that will be better served by our tools is if they have a specific thing in mind and they wanna create something specific as, for example, I wanna create a dashboard

that I can use on a weekly basis or on a daily basis with my team, on a given team meeting, and I wanna be able to slice and dice interactively

those results,

then they would be better served probably to use a superset for that or any other, like, interactive dashboarding solutions.

Because Chromebook currently, it's meant for

for noninteractive

use in sense, like, even though you can have, like, fast

engines like Presto or Spark SQL

or similar, the inter interface and the user mode is, like, different. So it's done as elements for flash boarding. So that would be 1 example where you would rather prefer to start in in another product.

Another example, of course, is if you wanna build

analysis and you know that you prefer to use, like, Python to slice and dice your data and that your dataset actually

is able to fit in memory in your Python in your Jupyter

instance, your kernel,

then, like, I mean, of course, you can just start in Jupyter itself. I mean, some people, like, what they do is, like, they create their queries in Querybook. They have, like, all the metadata

enrichment capabilities there. So they draft their queries in Querybook, and then they can, like, move on to to Jupyter to do

the Python analysis. But still, I mean, there may be cases where you wanna start right there in in Jupyter Notebook or a similar solution.

Another situation where it could be better for you to start in another tool is like in workflows. And it sounds like even though QueryBook allows for and provides simple scheduling,

quick and simple scheduling, it's meant to be like ad hoc. So if because you're gonna run a Datadog quickly, like, on a daily basis and get some results, an email or on a spreadsheet, something basic for people to use. But, of course, if you wanna like a full blown scheduling, you should definitely use something like Airflow or any similar.

That would be the main examples, I would say, where you probably wanna use

a different solution like that.

And in terms of the

future work for Querybook, I'm curious what you have planned for the near to medium term or any use cases or capabilities that you're excited to add or enable?

Our current, like, most focused 1 is to add more, like, integrations for open source. There has been a lot of ask to support, like, BigQuery or adding more, like, data catalog features for different query engines, adding, like, more query engines, or adding, like, Terraform

and Helm charts for

continuous deployment.

So these things are on top of our mind right now. So then the more users cache, they start using QueryBook.

And in terms of near future,

we're working on expanding the feature of the list feature, which is a way to organize your data docs and tables into, like, list and share with others. So that's something we want to work on in the near future. And then another thing we're excited to work on is the table creation feature in Querybook.

So as to close the loop,

right now, like, you can do, like, insert, update, and delete in Querybook. But then if you want to create a date table, you can't really put the data into Querybook and let it, like, create a table for you. And then that's something we want to work

on. Are there any other aspects of the QueryBook project or the overall space of data discovery and data cataloging and querying and notebook interfaces or sort of the use cases that you're targeting with query book that we didn't discuss yet that you'd like to cover before we close out the show? Some of the areas that I find very interesting, and, like,

I haven't seen, like, a very

good solution either with I mean, what we currently have with QuickBook or with other external vendors or open source breaks. I mean, finding the right table for people to use. Even though, I mean, we have, of course, many features towards this, and there's many other projects like I want to send and so on that try to tackle the same problem.

I think that it's still not efficient

for people

as it could be. So it's something that I expect to see a lot of improvements over the next years.

That's kinda like an an area that I'm excited about. How about you, Charlie?

Yeah. Actually, I got a lot of questions, David, from, like, how do I find this table, how do I find that table, and then

it's kind of more like a tribal knowledge

right now. It's very hard for people to find the table they want for, like, the specific data.

Yeah. It's definitely 1 of the perennial challenges, and everybody solves it differently.

Alright. Well, for anybody who wants to get in touch with either or both of you and follow along with the work that you're doing or contribute to query book, I'll have you each add your preferred contact information to the show notes. And as a final question, I would like to get each of your perspectives on what you see as being the biggest gap in the tooling or technology that's available for data management today.

This is probably, like, similar to my previous answer, but I would say

in the immediate space, it would be finding the right table for people to use. Basically, there is so many duplicated tables, so many wasted time

by engineers and by everyone in the company trying to

find the right information.

And there is a lot of, like, approaches and there's a lot of attempts at this. I mean, including, of course, things that we do at Querybook, but things that our products do, like Moonstone or Better Better Help Linkedin and so on. But, I mean, I would expect this to be very different 3 years, 4 years from now once better solutions or improvements to these products happen. So I think that would save a lot of time for engineers, data scientists, and any employee in the company. So something that I'm excited about definitely.

Other than data discovery,

I think another thing is, like, right now, we kinda have support, like, for Spark SQL,

which can run for, like, much larger data and also Presto, which is fast, but, like because since it runs in memory, it will, like, it will only work for, like, a small amount of analysis.

So I would like to see, like, kind of, like, query language can actually balance the 2 where it's, like,

fast for smaller queries, but then it doesn't have the memory issue of Presto.

Alright. Well, thank you both for taking the time today to join me and sharing the work that you're doing on Querybook. It's definitely a very interesting project and 1 that fits an interesting

Venn diagram in the space that I have not seen filled before. So definitely plan to experiment with it myself and keep a close eye on it as it progresses. So I appreciate all the time and energy you've put

into building it and releasing it to the community. So thank you for that, and thank you for taking the time. And I hope you enjoy the rest of your day. Thank you so much.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links