Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode.

With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today to get a $20 credit and launch a new server in under a minute. And go to dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, read the show notes,

and get in touch, and join the discussion at data engineering podcast.com/chat.

Your host is Tobias Macy. And today, I'm interviewing Matthew Seale about the ways that Netflix is using Jupyter Notebooks to bridge the gap between data roles. So, Matthew, could you start by introducing yourself? Yeah. So, I work here at Netflix.

I'm, on the data platform team.

In particular, I've been working on orchestration problems and orchestrating,

with notebooks. My background before Netflix, I was at, OpenGov for a long time, and I I actually started out my career

in data science. And then I slowly, gradually ended up doing all these data things to make that work, and I ended up transitioning and doing the data pipelines and then sort of architecting the systems around that. That's where I've kind of been for the past 5 years. And did you have any particular

training

before you got thrown in the deep end of managing the different aspects of the data pipeline aside from just your data science work? Or is it something that you just ended up having to learn by doing as you went? Yeah. I kinda ended up learning by doing.

I I was in the startup space for a while, so after school I did school. I was going to do robotics, and back then, I thought it was gonna be a big booming field. It's still not quite where I expected at the time. So I was a a mix of electrical engineering and computer science, and I was focusing a lot on AI and machine learning. And then when I got into working at start ups, it ended up being

a lot of, like, there's no 1 building a good pipeline for you, so you gotta go figure it out. And I just slowly ended up more and more in the data's realm and sort of ended up getting into big data in particular on Spark and some of the systems there. And so

as I mentioned at the open, you've recently been working on consolidating a lot of your workflows for the different data roles at Netflix to being oriented around Jupyter Notebooks. So can you give a bit of an outline about the motivation for choosing Jupyter Notebooks in particular as the core interface for your data teams? Yeah. And, actually, when I first joined, Netflix, it was just sort of tossed at me. And I was definitely like, are we crazy? And the answer was like, we might be a little crazy. But it it came across as, I guess, on the onset of of hearing that. It sounds like why would you do that. Right? Like, these things that are highly

iterative, and why why would you rely on a production pipeline on it? But, actually, if you if you dig in and you add some more abstractions in that space, it starts to make a lot of sense. And where where we're pulling it in and consolidating is we're kinda trying to converge, like, many different scheduling systems at Netflix to have some common systems.

And in the process, sort of give a a more white glove experience for less technical users,

especially for users that know enough

scripting to kind of, like, write some SQL or read a little bit of code, but not so much that they could design their own pipeline or

dig through logs to find references to other code bases.

So a lot of the step into notebooks is that

it was,

looked at as a strategic bet for what interface might be a good future shared interface between

analysts,

data engineers, and, you know, professional

coders as well as, machine learning,

developers that could all kind of represent the same executed code with some closely coupled logs

and good visualizations all in the same place. So it becomes like this this artifact that you can share with a lot of different user groups that's common and easy to transport.

And what are some of the places that you

have not yet deployed notebooks to, or that you are consciously

deciding not to,

implement

using that notebook interface?

And what are some of the other areas where you are pushing notebooks into common use and just some of the

differences

that denote

when is a good place to use the notebook versus not. Yeah. For sure. And that that's gonna

I'll maybe break that into some pieces because I think there's a lot of, interesting topics in there. 1 thing at Netflix we've done is we haven't done a really, like, heavy handed approach in the sense that we force, like, development to all be in notebooks and, you know, notebooks or nothing else. But it's more of the emphasis that,

with the the way we're doing so around scheduling notebooks in particular,

a lot of users don't know they're using notebooks yet, so they're getting gradually introduced to notebooks. So we're not forcing them to rewrite scripts that have been sitting around for 4 years or or SQL that's been there since that table existed.

What's end up happening is is we're migrating them on systems.

The thing that's actually executing the workload that they're running is a notebook. Kind of you think of it almost like a a job template. And for users that haven't been familiar with notebooks, it sort of gets them like they could see the outcome of their work that's running is is starting to look at a notebook. So if you wanna debug it, if you wanna see what ran, you wanna rerun it, you basically always end up at the same notebook interface. So places we're introducing this a lot and kind of pushing it is really as, an integration tool. So it's something that is a great place to express

how to connect a few pieces and run something in a linear fashion. And we've also kind of from the other side sort of been this idea that on a lot of model development,

many times machine learning developers work

on notebooks, and in particular, Jupyter or Zeppelin notebooks are probably the most common, at least here. And they'll they'll do some iterating cycles. And if they get done of it, it's hard for them to kind of productionalize or rip that code out of a notebook and put it someplace else. So So the places where we emphasize it is, you know, if you have a notebook and it runs well and it's something that's easy for you to understand, like, why rewrite the whole thing and risk failure of of transferring that knowledge? And then on the other side is sort of introducing people that, you know, this is a way to express how to, you know, execute a controlled context of of code that you wanna,

fulfill without needing to necessarily write all that code in a notebook initially. And in the middle where we're not pushing this really hard is, like, your everyday professional, like, maybe, like, a data, you know, a data engineer knows exactly what they're doing or a,

you know, a back end developer that's maintaining systems. They may not touch the notebooks as much directly. We don't necessarily emphasize it. But in a few places where they're doing kinda like 1 off tasks,

like, let's say doing some janitor cleanup of some old data and trying to, like, keep track of what they deleted. It's a really good form factor we try and introduce there and and emphasize, like, the ability to see what every run did and have, like, nice visualizations of the the actions you took. And having the

embeddable pros as some rich metadata that can live along with the code can be very useful as well, particularly since that's 1 of the holy grails of data infrastructure in general is being able to have that extra information about the whys and wherefores of the different processes and the different

schemas, etcetera. Yeah. But at the same time, it can be difficult

to surface that in any sort of automated fashion. So I imagine it adds a bit of cognitive or overhead and manual process to be able to take advantage of that extra information.

To a degree,

I think in the really controlled context, like, where we're scheduling work on behalf of user or, you know, we have some sort of operational task, it's not so much overhead to introduce a notebook into that equation. Matter of fact, the project I've been working on a lot this past year has been,

using a library called Papermill to execute notebooks

on demand.

And what that basically

has enabled is the whole new scheduler. Everything's built around executing notebooks.

And, we can still execute, like, other types of work, and we do a few times. But I would say at this point, we're like 95% of the work that's executing on that executing on that scheduler is all notebook.

And it's actually been pretty easy to adopt that way. Where the notebook really shines in that case is that

it almost becomes a report of what you did, and it's got high reproducibility.

So we don't rely on environment variables very much in these notebooks. We rely very heavily on the parameterization

capabilities that Papermill provides.

So all of the context about what you're running gets injected as code into this notebook. And then when you're running it, you could run it anywhere and it would run exactly the same way. What this means is that we actually enable a lot more self-service for users to be able to debug their own issues because they can always know how do I run this thing. They don't have to set up some special environment

or find out, like, where this ran and what are the things it pulled in. It's always just take that notebook and run it in 1 of the notebook servers, and you're good to go. Another follow-up to that too is in some cases, these users are starting to use these notebooks as reports as well. So you could imagine with some of the nice visualizations that Interact provides or other notebook

environments, you can pull together

a bunch of data and render a data frame or or some other type of object in a very clean and nice fashion, including in an iterative fashion. And we're building a lot of tools to try and make that more of a picture because we are kind of betting that

interactive documents are going to be where reporting heads,

and in a way that can re execute code more so than, say,

than even like a Tableau system might do. And Jupyter Notebooks

have the capacity for having a lot of different plugins to provide

additional UI elements or,

different display capabilities.

And then also the code that is being executed in them will have different requirements

for,

different library or application dependencies. So I'm wondering how you're managing

those extra

requirements as part of the overall system for being able to deploy and execute these notebooks in your different environments.

Yeah.

So in our case, there's definitely a place where

the open source community has a really good spec on, like, what a Jupyter notebook is in particular or a Zeppelin notebook if you're using that ecosystem in the sense that, like, the format of the object that's being shared at the end of the day, it's just a JSON document with a well schema

interface. And then how it actually executes and the assumptions made therein

are built in a way that it's easy to build lots of tools on top of, those specifications.

So for the case with,

running notebooks in different environments, there's definitely

a gap in today

on

But for us, we have our own isolated,

kind of shared container that has 95% of the dependencies that we need, and we keep that updated constantly. And this is for the, like, average user just goes and uses this container. And this container generally has everything they would want to run-in there, and it's like a couple gigs. And we launch all of our, like, big data platform services on top of the same image so that you get a very consistent experience for execution. When you're not in 1 of those cases, it gets a little trickier. And there, we we will do things like

specify the requirements into an image or into a baked AMI that you're gonna run on top of instead. But that requires a lot more forethought for the users, so it's generally more of an advanced case. Ideally, we we wanna keep iterating on how you'd express these things and really bake it into the Jupyter,

ecosystem in the long term, a better way of expressing your your dependencies or your requirements in a, you know, kind of uniform manner. And so

that brings me to the question

of what you have built in terms of the broader technical

infrastructure to be able to support this workflow of using Jupyter Notebooks for scheduling and executing various tasks across the different roles at Netflix?

Yeah. And as much as we've, you know, we actually had, like, the blog posts and some other talks we've done about the infrastructure here, but the reality of that picture is that we're really reusing a lot of the infrastructure that's already in our ecosystem.

So, like, the scheduling tool that we're using. We're reusing an existing scheduling tool and just extending it with this concept of translating

all the parameters and,

and environment context into a papermill argument so that, you know, it passes properly.

And then on the sort of visual representation side of what what actually executed and when, a lot of this ends up being baked into our internal,

portal tools that are already there for displaying other types of work.

So, really, a lot of the infrastructure we've done on the on our platform has just been extending each of those little platform pieces so that they have awareness to how to link over to, a rendering of a notebook

and how to, like,

link that rendered notebook back into the system.

So the new infrastructure we've had to deploy is really around keeping a,

the ability to deploy

dedicated and ad hoc Jupyter Servers so that we can have servers up that are live and iterative.

And then also we've put up, through another interactive open source tool, We use, Commuter, which is a good read only interface for notebooks.

And we have a bunch of, a bunch of our assets all live in s 3 and are served through the the commuter interface and kind of like what happened

type mode. You can go look at any notebook without accidentally editing or or manipulating the source of truth. I'm trying to think what other, infrastructure we really had to bring up. And then, oh, the other infrastructure we had to do is actually more on the library side,

in sense of, like, we wanna support multiple types of kernels.

So, we've done we've been contributing more and more in the Scala space. We're trying to get the community to

unite around some Scala kernel improvements and and make that world better.

But most of our stuff is running on Python or r that's going to the notebook path.

And there, the the kernels are pretty well defined already, though we have some extensions for, like, connecting to specific platform capabilities.

In those cases, those were the types of things we had to actually focus on building,

and the the rest of the diagram kinda fits. It's like any other tool you would use. You know, we're using s 3 as our store for a lot of things. A lot of our other tools already understand the data that's there and how it's represented.

So at the end of the day, the infrastructure we had to add was less than sort of

the

abstraction concept that we're communicating and and expressing the users.

And you said that when you first started at Netflix, there was already some discussion and motivation for

going down this path of focusing a lot of your

use cases and workflows around Jupyter Notebooks. So I'm wondering what was the driving force behind that effort? Was there a particular team that had a

specific pain that they were trying to address? And once you did have some momentum behind the implementation,

was there any difficulty

in getting other teams to buy into this new technical platform?

Yeah. That's a a really good question.

When I joined the

so the big data platform

team had sort of bought into this there's this great memo that went around that, we had. It was labeled notebooks everywhere, and it it sort of asked the question of why not use a notebook for doing some of the work we're doing. And some of the answers there are, like, well, maybe the editor isn't as nice as my local editor. That's something you could solve or or improve, or it's usually very minor things like linting additions to your interface. Or there might be something like, hey. It's it's hard to test a notebook. And at the time, that that was pretty true, and I think the story has gotten better there. But in general, like, the end

the day, it's like we have a lot of machine learning work that's done here, and a lot of the iterative development is in notebooks already. So it was looking at what what will be a shared interface that different teams could all use

that they can all agree upon was was a reasonable document to to look at and interact with. And we were looking a lot. The big motivating factor was actually looking forward to how what are the employees going to look like in 2 or 3 years, and what's the distribution? The distribution is not going to be what we have today or had last year,

which is more focused on the data engineering side. Netflix has hired a lot more analysts and and kinda content focused oriented

developers or people. And

there, the technical skill set's different and the the expectation with interfaces is different. And we wanted to emphasize

a contract format that was much more shareable and but still familiar to to players that are already in the space that Netflix occupies. So, you you know, stepping back from, like, what motivated this, it motivated that the tools and the way we show how things are run and what was happening

were completely alien to a lot of our new,

like, fastest growing teams. And once you

had that unification

of the Jupyter Notebook as a common means of different teams being able to build and execute

their different projects,

Did you find that there was the opportunity

for consolidating

or sharing a lot of existing work between those different roles?

Actually, I think this kinda leads into the part I didn't answer in the last part, which was was there any kind of resistance to adopting notebooks? And I think this plays into how well has it been adopted. And I think we're still actually decently early on that story at Netflix in the sense that we have infrastructure that's now making this usable and

everyone, but I wouldn't say the majority of people on their everyday development, like, go to a notebook. But I would say that now a large percentage of users are familiar with notebooks as the way to understand what happened in the platform

or at least are becoming familiar. And I think there, the initial resistance was that not to the idea of the notebook, it was to the idea of how you develop with a notebook or iterate on notebooks.

And I think where

this has hit a lot of, like, confusion and pain for users that haven't sort

of got deeper into the space is that, well, I use the notebook as a scratch space. Why would I I wouldn't want to ever, you know, put that anywhere near a production system. And that's still true, and that definitely is, like, the first thing people think about all the time is notebooks is just a scratch space tool. But the reality is is because it has this,

shared visualization

with logs of what happened along with the code that executed, it actually makes a very compelling integration format

where you can have a consolidation of what happened, when, how, and visualize the outcomes

all in 1 1 place.

As long as you keep your notebook, you know, in a very concise manner, it helps in this in this problem space. So for teams adopting and sort of people coming on board with

consolidating to notebooks,

I would say that, like, operationally, like, a lot of the big data platforms' operational concerns that have been new have all been notebook based now. So we have a lot

of, tasks that look like something along the lines of we're gonna pull

for anything that's reached the state,

pull those out of some system, and then send an email to someone saying these are the things that have violated some SLA, or

these tables we're gonna clean up in 30 days because no one's using them or they're past their TTL.

And these type operations are really handy. They're they tend to be, you know, a page of code or less, and they fit very nicely in the notebook where at the end, you can kind of print out a data frame or visualize a data frame of all the,

objects that you're going to notify against and all the people you're going to notify and then consolidate altogether. And that's been pretty successful for, like, operational usage. And then I say the other place where there's been good consolidation

has been,

on machine learning and exploration.

While they've used notebooks a lot, like, the ability to schedule that notebook directly has sort of been a light bulb for, oh, maybe we should rethink how we use notebooks. And there's been a lot of consolidation on some of the tooling here around either directly using notebooks or indirectly using notebooks through, the scheduling component.

And you've mentioned a couple of times the Zeppelin notebooks in addition to Jupyter, and I know that there are also

1 or 2 other notebook projects that have been gaining some measure of popularity.

So I'm wondering

if there has been

much effort to either try and,

incorporate some of the workflows that you've built around Jupyter Notebooks to support these other notebooks or if there are any sorts of

features

that

are not present in those notebooks or that the or that are present in them that aren't present in Jupyter that you would like to see brought back into the Jupyter ecosystem?

Yeah. There's definitely a lot of there's a lot of really nice things in, like, the Zeppelin space, but it's also more constraining

from a development point of view

in the sense that it's it's kind of a 1 bundled

package. Like, your notebook is, more inclusive of everything that's happening, which is a good and a bad thing.

And it's more

a design preference choice to choose 1 and kind of do really well at 1 thing. We've

have Zeplin exposed because we've given users options, what they wanna do. We're heading more in the Jupyter space because we're building tools around it, and we have expertise and people here that are part of that community. So it made sense to lean into what we already know and what we're good at.

In but in terms of things that could come over that are really nice, a lot of the widgets and extensions that are live in the Zeppelin space are definitely something people want in the the Jupyter Notebooks that we haven't, you know, made fully available or or only available under certain execution

context. Like, maybe there's a JupyterLab extension, but there's not something for interact or Jupyter classic can't run something that maybe another extension tool could. And those a lot of times around kinda quality of life features of the notebook editing experience.

And as you mentioned before,

Jupyter Notebooks are often seen as temporary

or throw away

and potentially

conducive to

development anti patterns. So I'm curious

what are some of the anti patterns that you have encountered

in notebooks that people have been building, and some of the conventions or tooling that you've established to help discourage

those practices

and encourage more

reusable and robust methods of leveraging these notebooks for your production contexts?

Yeah. And this is actually an aspect of the notebook space I'm I'm most excited about because I feel like this is where kind of the world is turning a bit bit in that space, where things that we didn't think about as possible are are coming to light. I do think, like, the use case of a notebook is a scratch space to try out something iteratively and and have partial execution of your whole workflow

is is a really awesome tool, and I don't think that's something you should turn around and schedule. But the same it actually follows a lot of the same anti patterns and arguments you'd have about any snippet of code. Like, I may have some script that that does this 1 task and it does it once,

but I may not wanna schedule that immediately. It's the same thing with a with a Jupyter notebook or any other notebook interface. Productionization

of your code still has to happen,

and this is something I think you know, it's easy to not do that productionization

of code when your notebook feels like it's a reproducible

artifact that you don't need to touch it again.

So there's several tasks you should definitely

try to do in notebooks around productionization that you would do in scripts or other other context. Like,

move shared code into libraries,

and definitely keep, you know, what you're executing to be simple and straightforward. And even also removing, like, things like hard coded attributes that you're using just to get 1 run through or,

local files that aren't gonna be necessarily available

elsewhere. So it's the same kind of process you go through when you're putting a code into production and and doing code review of it. I will say, the anti patterns that we've expressed a lot to avoid at a high level are

because,

unit testing notebooks is relatively difficult, but integration testing notebooks is pretty easy, is that you wanna treat it like any kind of integrated component where you wanna have branching factor to be very low so that they pass through your notebook

is,

is is consistent and reliably tells you that this notebook is going to execute successfully.

So, like, 1 thing we've started doing and and we wanna make more efforts on here is if you look at a notebook that it's that you wanna schedule, you could easily do some quick static analysis to say, hey. How good of a notebook is this? Much like you have, like, linting rules or other things or other systems. And this is somewhere where we wanna improve. But generally, from, like, a code review point of view aspect, if you see a notebook that comes in and it has many, many conditionals that branch in different ways, that's a good sign that you should take the code in that cell and put it in a library that's unit tested and then call that library from your, notebook.

And it's a little hard to enforce these rules today, but we wanna give instead of making hard, like, blocks on on what you can do, and this is definitely kind of counter to Netflix development philosophy to do that anyway, We wanna instead encourage, like, the quality of something, like, encourage things that are good and give good examples of what should be done or maybe give suggestions on how to improve what's already there. And so in that way, we have a,

a

kind of portal of all the notebooks in the ecosystem, which has got a surprising number of notebooks in it. I don't remember the,

the number off the top of my head, but it's, you know, tens of thousands of custom notebooks people have made. And there, we wanna surface the good ones, the ones that are shareable or reusable or the ones that are scheduled with good,

inputs. But I would say in terms of, yeah, anti patterns, the,

branching factor and keeping it simple is probably the most important thing to do.

And I know that it's possible to import 1 notebook into another 1. So I'm curious if there has been any discussion about the pros or cons of that approach in terms of making the individual notebooks modular components for being able to build up larger workflows?

Yeah. That's definitely a space where I feel like especially when I was using notebooks notebooks before, like, maybe, you know, 5 years ago when I first got introduced to to the notebook space. I think I was using it might have been, Jupyter notebooks back then, but I think I was using a few other notebook platforms. There was this emphasis of the idea that maybe, you know, notebooks inputting notebooks would be the direction it would go and notebooks as a library. And that hasn't played out all that

snippets that you can't fully trust are going to do exactly what you expect over time. And so that's I think there could be a story there that they would branch where that would become better. But, generally, here, we haven't been encouraging that type of pattern, and it's been more, if you wanna import something that's shared, it's really easy to make a library that that within your organization that that can house this code in, you know, in a more consistent manner. And this is really where, like,

should notebooks be treated as libraries or notebooks treated as integrations or scratch spaces? And I think here we've definitely leaned on the side of it's probably not good to treat a notebook as a library. That may not be true in all context and, like, certain teams that that might be a a good way to go and they wanna build tooling around supporting it. But let's say the tooling today doesn't facilitate a lot of happy outcomes from

having notebooks as libraries without putting a lot of extra effort on your side to make sure that, you know, you can trust that that code's gonna do what you want. And are there any particular limitations

of the notebook environment

that you have encountered and that you would like to see modified? And conversely,

what are some of the biggest strengths of the notebook environment that you have enjoyed working with? I would say, like, the especially in a lot of the popular notebook frameworks, it's not limited to 1 of them. There's definitely

a lot of room for improvement on the abstraction of the storage mechanism and the syncing of your your data. And this is in the sense of, like, Jupyter Notebooks.

When you save, you're saving the whole JSON document each time, and that can be really problematic when that document gets big or if you have multiple edits concurrently that you want to have happen. It it leads to sort of blocking conditions that are that can be annoying to extend into other use cases. And that's like a a known trade off there, but there's definitely things that could be done to improve that. And I think that other spaces where notebooks maybe could be done better is around

editing tools. Today,

You know, if you if I load up my favorite IDE or or use 1 any IDE, really, I get a lot of integration with, like, default linters,

default code searchers,

and all sorts of other things that are handy even in a very simple down to the point of, like, emacs extensions that can do these things.

And, in notebooks, I I feel like a lot of those haven't developed as much. I think Zeppelin's probably a little further along than Jupyter in that space, and there's some niceties there. So I'd say quality of life type improvements are definitely where a little bit of,

iteration by the open source community is gonna go a long way, and I think they are heading in that direction. And the other space where,

where notebooks being a little better is integration with source control systems. Because the notebook at the end of the day is a big JSON document.

JSON

saved into, like, Git, for example, isn't the prettiest thing to look at in most diff tools. And there's some there's been some really neat things like MBDIME is an is an open source tool for diffing notebooks that works really well on GitHub and works good for local diffing. But there's also some other, there's a lot of room for improvement on source controls with notebooks, and I think it's very doable with with

a few more improvements in the tooling space. Yeah. I was going to specifically ask about that because I know that from my own experience and from talking to other people, that's been 1 of the biggest pain points when working with notebooks

as a collaborative

environment because of the commonality

of everybody liking to

use code review as a means of preventing

some obvious mistakes and, just sort of sharing the discussion about approaches and design choices and whether or not a given piece of code meets the intended requirements. So having that additional difficulty in the common workflows by virtue of using a notebook,

seems like it would be a potential pain point as you try to scale it out to,

larger groups of people and make it more of a broad use case and collaborative

effort. There's definitely someplace where, like, we don't use all the tools that are out there available today to to help in that space,

and we're starting to pull some of those in because we've we've been controlling it in the smaller group that's been doing maybe the most shared notebooks. And there, we're okay. Like, we know how to kinda get around the ugliness. But as we expand to users that wanna make their own notebooks and check them into version control outside of the kind of big data platform group, it's definitely something we we're gonna have to focus on helping

pull in or improve other open source tools in that space. I will say that, like, there's been some some really neat things even since the last time I looked around. Some Git Hooks that help clean up your notebook before you commit.

There's some tools that do some nice things there. And then, some of the diffing tools that can integrate with git are getting better and neater. So I think it I think we're in, like, around the corner from having that be a really nice experience.

And in terms of having notebooks be used for production

contexts

of different data workflows.

What have you found to be some of the most challenging aspects of building those workflows

and platforms,

and what are some of the unexpected lessons that you have earned in the process

of building these different capabilities?

Yeah. I'd say that probably where some of the challenges comes up is making that decision about when is my notebook

becoming too much like a library source. Like, how how many

how many code cells with custom functions does it does it take before it justifies putting this in the library,

especially when you're in a new context that you don't have necessarily a backing library to support yet. So when there's already a library you're importing to to handle some of the complexity that you're you're working with let's say you have,

sending an alert. You have an alerting library that kinda handles a lot of the cruft,

for you. There, it's easy to justify, like, when something should go in that library. When I have a new type of, like, data query that has to do some complex check against 2 systems that come up with an answer, like, where you choose to put that into a library is definitely a hard choice, and it's very project dependent. And it's much like other systems. Like, when do I promote my script that's just sitting in s 3 that I'm running to an actual library?

It's it's kinda scary how much,

how many very professional companies rely on some scripts sitting someplace that, are maybe backed by source control, and it's the same type of problem. I have a notebook even if it was checked in the source control window. I make that trade off of,

of spending more time to break it into smaller pieces. I would say from things that have been learned that are really valuable, I would say that the the abstraction where we introduced with paper mill where we isolate input notebooks from output notebooks

really changes the paradigm of risk and and

and worry that came initially with notebooks.

And it helps a lot too because,

in that paradigm shift where your output path is independent of your input path and you get a kind of record of each each execution, it's very easy to go back and roll back to what happened on day x.

You just go to an s 3 path and pull down that notebook and go see what it did. And,

that's been, like, really, really useful

as a tool. A matter of fact, we even have people that weren't familiar with notebooks at all where even managers that were on call because their team was on vacation or something like that. It's like, oh, yeah. Something went wrong. And, I went to go click into what went wrong, and it sent me to this notebook link. And I saw immediately what it was, and I uploaded it to I just, like, started editing it and fixed it right away. And that was a it's been kind of rewarding, like, turnaround on the bet we made that notebooks would make those type of operations easier. And looking forward, what are some of the projects that you're either currently working on or that you have planned for the future that you're most excited by and that you think will be most beneficial

to your work and the data teams at Netflix?

Yeah. So I've been living in the scheduler world now for about a year, and I think some of the exciting things are are actually adjacent to that scheduler in the sense of a lot of the tooling around, well, I've been working a lot on the paper mill project in particular.

And there, there's a lot of really fun conversations that are starting to bubble into actual,

new things that have come out. Like, how do you record data from a notebook in in a consistent and and efficient way? That's not a story arc that's been really solved. Like, oftentimes, you use some other tool to save data someplace.

But what if that data frame that you built in your notebook was really easy to pass along as as a first class object? I think that space soon we we started a a new little

repo to start exploring that more called scrapbook

inside our interacting hence hence a theme on the, interact side projects. And I think the space there and also with kind of changing how notebook stores work and making intermediate layers that that help with collaboration and,

consistent identification of of what's happening in the notebook space down to a cell level will enable a lot of use cases that weren't

really possible before. And in particular, you're gonna see more things like, you know, a notebook as an outcome report that can have collaboration in real time updated visualizations

without having heavy load on the system. I think those are gonna be really

interesting times when

the the thing that generates your report is the same thing that is your report in just a different view of the same same object. And I think it'll that's gonna reduce, like, in my world, a lot of the operational overhead of transferring information

to multiple other systems that need to interpret it in the same way. And to clarify how how you think of that, if you think of, like,

I'm pulling data from a big data system. Right? I just got rows of data that's really hard to look at, but I have users that need to have, you know, answer some very fundamental business questions. And so the traditional path there is you take that data. You have a data engineer make the query. The engineer kind of

outputs that outcome to, some other place, maybe aggregates it, and then either them or some other analyst is gonna take that and try to gain some insights.

Gain some insights by making a Tableau report or exporting the Google Sheets and then showing that to Or, like, there's so many different forms and mediums to to expose the information that that's being,

that's being produced by, like, literally thousands of people. Then simplifying that and unifying a lot of those components so that there's less overhead and how to translate that successfully and trust it is gonna make the ability for those 1, 000 or 2, 000 people to do a lot more work in in with less effort. And are there any other aspects

of the work that you've been doing with the notebook interfaces

and the productionizing

of

these workflows

that we didn't discuss yet that you think we should cover before we close out the show? Let's see. I do think that,

in general, like, there's definitely a few more, like, key points on, like, how the the scheduler how paper mill is is, like, passing parameters into a notebook that is kind of interesting. So maybe maybe you could spend a minute on that. So I do think in

the in the scheduler realm when we're using paper mill to to execute a notebook, we basically take all these different parameters. You can think of them as anything It's JSON or YAML Express. They're, like, really simple configuration tools. There's actually become code in the notebook, and that's something it's a unique approach to how parametrizing

notebooks has been done in the past. And it was something when I joined that that project had started, and, it was doing that and recording basic data on the notebook. And it it was really interesting how that translation

and into code that you can see as code in whatever language your kernel is running in makes it so much easier for people to understand how something got configured. In particular, because it just becomes code, you don't have to know that it came from some parent system that saved a bunch of environment variables or something. Instead, you just get tell it where to inject in all the all the information

and then provide any defaults you want. And then you can have a local run that you're iterating on where you're running on the defaults. And then independently when it actually schedules, it's very easy to see how you overwrote the the attributes that came in. And that's been very very handy for debugging and and understanding

how the system actually works that I think has been a kind of game changer for making notebooks usable in in this setting. And I think that another another spot I haven't focused on as much, maybe because I'm more focused on the kind of back end and scheduling

concepts around notebooks.

It's also been there's been a lot of really interesting work being done on the,

interact project and and in others like JupyterLab and, other notebook interfaces on making better and better front ends to to enable really clean and nice reporting or visualization tools. And I I definitely encourage people to go check out some of these, open source communities for what's happened in the past year or 2 because I think the the world is changing a lot in the visual representation space of notebooks. Alright. Well, thank you very much for that.

For anybody who wants to follow the work you're up to or get in touch, I'll have you add your preferred contact information to the show notes.

And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Yeah. So the biggest for data management,

I would say 1 of the the biggest gaps is the

is the the inability for

nonexperts

to get involved and be able to self-service themselves on on on problems.

I think I see this a lot where the bigger the big data platform is, the more you lean on the specialist to to understand why something isn't working and how to remediate it. And that puts a lot of pressure on very high skilled

data management individuals that you have a lot of them and that they do their job well to be successful as a company. And I think a lot of the data experience

world is

trying to should be trying to lean more into

making easier for these upstream users to be successful without

needing as much expertise in that. And I think some of the tooling has has made that better with, you know, easier understanding of of failure modes, But I think there's a lot of room for improvement there in the big data world. And maybe notebooks will be part of that story, but I think there's there's a lot of other aspects to it too. Alright. Well, thank you very much for taking the time today to discuss the work that you've been doing with notebooks at Netflix. It's definitely

a very interesting project. I appreciate the detail that you've all put into the blog posts around that.

So we'll definitely have links to that in the show notes. So thank you again for that, and I hope you enjoy the rest of your day. Thank you for hosting me.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links