Summary
Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Matthew Seal about the ways that Netflix is using Jupyter notebooks to bridge the gap between data roles
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by outlining the motivation for choosing Jupyter notebooks as the core interface for your data teams?
- Where are you using notebooks and where are you not?
- What is the technical infrastructure that you have built to suppport that design choice?
- Which team was driving the effort?
- Was it difficult to get buy in across teams?
- How much shared code have you been able to consolidate or reuse across teams/roles?
- Have you investigated the use of any of the other notebook platforms for similar workflows?
- What are some of the notebook anti-patterns that you have encountered and what conventions or tooling have you established to discourage them?
- What are some of the limitations of the notebook environment for the work that you are doing?
- What have been some of the most challenging aspects of building production workflows on top of Jupyter notebooks?
- What are some of the projects that are ongoing or planned for the future that you are most excited by?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Netflix Notebook Blog Posts
- Nteract Tooling
- OpenGov
- Project Jupyter
- Zeppelin Notebooks
- Papermill
- Titus
- Commuter
- Scala
- Python
- R
- Emacs
- NBDime
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network, all controlled by a brand new API, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today to get a $20 credit and launch a new server in under a minute. And go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch, and join the discussion at data engineering podcast.com/chat.
Your host is Tobias Macy. And today, I'm interviewing Matthew Seale about the ways that Netflix is using Jupyter Notebooks to bridge the gap between data roles. So, Matthew, could you start by introducing yourself? Yeah. So, I work here at Netflix.
[00:01:00] Unknown:
I'm, on the data platform team. In particular, I've been working on orchestration problems and orchestrating, with notebooks. My background before Netflix, I was at, OpenGov for a long time, and I I actually started out my career in data science. And then I slowly, gradually ended up doing all these data things to make that work, and I ended up transitioning and doing the data pipelines and then sort of architecting the systems around that. That's where I've kind of been for the past 5 years. And did you have any particular
[00:01:28] Unknown:
training before you got thrown in the deep end of managing the different aspects of the data pipeline aside from just your data science work? Or is it something that you just ended up having to learn by doing as you went? Yeah. I kinda ended up learning by doing.
[00:01:44] Unknown:
I I was in the startup space for a while, so after school I did school. I was going to do robotics, and back then, I thought it was gonna be a big booming field. It's still not quite where I expected at the time. So I was a a mix of electrical engineering and computer science, and I was focusing a lot on AI and machine learning. And then when I got into working at start ups, it ended up being a lot of, like, there's no 1 building a good pipeline for you, so you gotta go figure it out. And I just slowly ended up more and more in the data's realm and sort of ended up getting into big data in particular on Spark and some of the systems there. And so as I mentioned at the open, you've recently been working on consolidating a lot of your workflows for the different data roles at Netflix to being oriented around Jupyter Notebooks. So can you give a bit of an outline about the motivation for choosing Jupyter Notebooks in particular as the core interface for your data teams? Yeah. And, actually, when I first joined, Netflix, it was just sort of tossed at me. And I was definitely like, are we crazy? And the answer was like, we might be a little crazy. But it it came across as, I guess, on the onset of of hearing that. It sounds like why would you do that. Right? Like, these things that are highly iterative, and why why would you rely on a production pipeline on it? But, actually, if you if you dig in and you add some more abstractions in that space, it starts to make a lot of sense. And where where we're pulling it in and consolidating is we're kinda trying to converge, like, many different scheduling systems at Netflix to have some common systems.
And in the process, sort of give a a more white glove experience for less technical users, especially for users that know enough scripting to kind of, like, write some SQL or read a little bit of code, but not so much that they could design their own pipeline or dig through logs to find references to other code bases. So a lot of the step into notebooks is that it was, looked at as a strategic bet for what interface might be a good future shared interface between analysts, data engineers, and, you know, professional coders as well as, machine learning, developers that could all kind of represent the same executed code with some closely coupled logs and good visualizations all in the same place. So it becomes like this this artifact that you can share with a lot of different user groups that's common and easy to transport.
[00:03:59] Unknown:
And what are some of the places that you have not yet deployed notebooks to, or that you are consciously deciding not to, implement using that notebook interface? And what are some of the other areas where you are pushing notebooks into common use and just some of the differences that denote when is a good place to use the notebook versus not. Yeah. For sure. And that that's gonna
[00:04:28] Unknown:
I'll maybe break that into some pieces because I think there's a lot of, interesting topics in there. 1 thing at Netflix we've done is we haven't done a really, like, heavy handed approach in the sense that we force, like, development to all be in notebooks and, you know, notebooks or nothing else. But it's more of the emphasis that, with the the way we're doing so around scheduling notebooks in particular, a lot of users don't know they're using notebooks yet, so they're getting gradually introduced to notebooks. So we're not forcing them to rewrite scripts that have been sitting around for 4 years or or SQL that's been there since that table existed. What's end up happening is is we're migrating them on systems.
The thing that's actually executing the workload that they're running is a notebook. Kind of you think of it almost like a a job template. And for users that haven't been familiar with notebooks, it sort of gets them like they could see the outcome of their work that's running is is starting to look at a notebook. So if you wanna debug it, if you wanna see what ran, you wanna rerun it, you basically always end up at the same notebook interface. So places we're introducing this a lot and kind of pushing it is really as, an integration tool. So it's something that is a great place to express how to connect a few pieces and run something in a linear fashion. And we've also kind of from the other side sort of been this idea that on a lot of model development, many times machine learning developers work on notebooks, and in particular, Jupyter or Zeppelin notebooks are probably the most common, at least here. And they'll they'll do some iterating cycles. And if they get done of it, it's hard for them to kind of productionalize or rip that code out of a notebook and put it someplace else. So So the places where we emphasize it is, you know, if you have a notebook and it runs well and it's something that's easy for you to understand, like, why rewrite the whole thing and risk failure of of transferring that knowledge? And then on the other side is sort of introducing people that, you know, this is a way to express how to, you know, execute a controlled context of of code that you wanna, fulfill without needing to necessarily write all that code in a notebook initially. And in the middle where we're not pushing this really hard is, like, your everyday professional, like, maybe, like, a data, you know, a data engineer knows exactly what they're doing or a, you know, a back end developer that's maintaining systems. They may not touch the notebooks as much directly. We don't necessarily emphasize it. But in a few places where they're doing kinda like 1 off tasks, like, let's say doing some janitor cleanup of some old data and trying to, like, keep track of what they deleted. It's a really good form factor we try and introduce there and and emphasize, like, the ability to see what every run did and have, like, nice visualizations of the the actions you took. And having the
[00:06:57] Unknown:
embeddable pros as some rich metadata that can live along with the code can be very useful as well, particularly since that's 1 of the holy grails of data infrastructure in general is being able to have that extra information about the whys and wherefores of the different processes and the different schemas, etcetera. Yeah. But at the same time, it can be difficult to surface that in any sort of automated fashion. So I imagine it adds a bit of cognitive or overhead and manual process to be able to take advantage of that extra information.
[00:07:32] Unknown:
To a degree, I think in the really controlled context, like, where we're scheduling work on behalf of user or, you know, we have some sort of operational task, it's not so much overhead to introduce a notebook into that equation. Matter of fact, the project I've been working on a lot this past year has been, using a library called Papermill to execute notebooks on demand. And what that basically has enabled is the whole new scheduler. Everything's built around executing notebooks. And, we can still execute, like, other types of work, and we do a few times. But I would say at this point, we're like 95% of the work that's executing on that executing on that scheduler is all notebook. And it's actually been pretty easy to adopt that way. Where the notebook really shines in that case is that it almost becomes a report of what you did, and it's got high reproducibility.
So we don't rely on environment variables very much in these notebooks. We rely very heavily on the parameterization capabilities that Papermill provides. So all of the context about what you're running gets injected as code into this notebook. And then when you're running it, you could run it anywhere and it would run exactly the same way. What this means is that we actually enable a lot more self-service for users to be able to debug their own issues because they can always know how do I run this thing. They don't have to set up some special environment or find out, like, where this ran and what are the things it pulled in. It's always just take that notebook and run it in 1 of the notebook servers, and you're good to go. Another follow-up to that too is in some cases, these users are starting to use these notebooks as reports as well. So you could imagine with some of the nice visualizations that Interact provides or other notebook environments, you can pull together a bunch of data and render a data frame or or some other type of object in a very clean and nice fashion, including in an iterative fashion. And we're building a lot of tools to try and make that more of a picture because we are kind of betting that interactive documents are going to be where reporting heads, and in a way that can re execute code more so than, say, than even like a Tableau system might do. And Jupyter Notebooks
[00:09:35] Unknown:
have the capacity for having a lot of different plugins to provide additional UI elements or, different display capabilities. And then also the code that is being executed in them will have different requirements for, different library or application dependencies. So I'm wondering how you're managing those extra requirements as part of the overall system for being able to deploy and execute these notebooks in your different environments.
[00:10:05] Unknown:
Yeah. So in our case, there's definitely a place where the open source community has a really good spec on, like, what a Jupyter notebook is in particular or a Zeppelin notebook if you're using that ecosystem in the sense that, like, the format of the object that's being shared at the end of the day, it's just a JSON document with a well schema interface. And then how it actually executes and the assumptions made therein are built in a way that it's easy to build lots of tools on top of, those specifications. So for the case with, running notebooks in different environments, there's definitely a gap in today on But for us, we have our own isolated, kind of shared container that has 95% of the dependencies that we need, and we keep that updated constantly. And this is for the, like, average user just goes and uses this container. And this container generally has everything they would want to run-in there, and it's like a couple gigs. And we launch all of our, like, big data platform services on top of the same image so that you get a very consistent experience for execution. When you're not in 1 of those cases, it gets a little trickier. And there, we we will do things like specify the requirements into an image or into a baked AMI that you're gonna run on top of instead. But that requires a lot more forethought for the users, so it's generally more of an advanced case. Ideally, we we wanna keep iterating on how you'd express these things and really bake it into the Jupyter, ecosystem in the long term, a better way of expressing your your dependencies or your requirements in a, you know, kind of uniform manner. And so
[00:11:46] Unknown:
that brings me to the question of what you have built in terms of the broader technical infrastructure to be able to support this workflow of using Jupyter Notebooks for scheduling and executing various tasks across the different roles at Netflix?
[00:12:03] Unknown:
Yeah. And as much as we've, you know, we actually had, like, the blog posts and some other talks we've done about the infrastructure here, but the reality of that picture is that we're really reusing a lot of the infrastructure that's already in our ecosystem. So, like, the scheduling tool that we're using. We're reusing an existing scheduling tool and just extending it with this concept of translating all the parameters and, and environment context into a papermill argument so that, you know, it passes properly. And then on the sort of visual representation side of what what actually executed and when, a lot of this ends up being baked into our internal, portal tools that are already there for displaying other types of work.
So, really, a lot of the infrastructure we've done on the on our platform has just been extending each of those little platform pieces so that they have awareness to how to link over to, a rendering of a notebook and how to, like, link that rendered notebook back into the system. So the new infrastructure we've had to deploy is really around keeping a, the ability to deploy dedicated and ad hoc Jupyter Servers so that we can have servers up that are live and iterative. And then also we've put up, through another interactive open source tool, We use, Commuter, which is a good read only interface for notebooks. And we have a bunch of, a bunch of our assets all live in s 3 and are served through the the commuter interface and kind of like what happened type mode. You can go look at any notebook without accidentally editing or or manipulating the source of truth. I'm trying to think what other, infrastructure we really had to bring up. And then, oh, the other infrastructure we had to do is actually more on the library side, in sense of, like, we wanna support multiple types of kernels.
So, we've done we've been contributing more and more in the Scala space. We're trying to get the community to unite around some Scala kernel improvements and and make that world better. But most of our stuff is running on Python or r that's going to the notebook path. And there, the the kernels are pretty well defined already, though we have some extensions for, like, connecting to specific platform capabilities. In those cases, those were the types of things we had to actually focus on building, and the the rest of the diagram kinda fits. It's like any other tool you would use. You know, we're using s 3 as our store for a lot of things. A lot of our other tools already understand the data that's there and how it's represented. So at the end of the day, the infrastructure we had to add was less than sort of the abstraction concept that we're communicating and and expressing the users.
[00:14:33] Unknown:
And you said that when you first started at Netflix, there was already some discussion and motivation for going down this path of focusing a lot of your use cases and workflows around Jupyter Notebooks. So I'm wondering what was the driving force behind that effort? Was there a particular team that had a specific pain that they were trying to address? And once you did have some momentum behind the implementation, was there any difficulty in getting other teams to buy into this new technical platform?
[00:15:07] Unknown:
Yeah. That's a a really good question. When I joined the so the big data platform team had sort of bought into this there's this great memo that went around that, we had. It was labeled notebooks everywhere, and it it sort of asked the question of why not use a notebook for doing some of the work we're doing. And some of the answers there are, like, well, maybe the editor isn't as nice as my local editor. That's something you could solve or or improve, or it's usually very minor things like linting additions to your interface. Or there might be something like, hey. It's it's hard to test a notebook. And at the time, that that was pretty true, and I think the story has gotten better there. But in general, like, the end the day, it's like we have a lot of machine learning work that's done here, and a lot of the iterative development is in notebooks already. So it was looking at what what will be a shared interface that different teams could all use that they can all agree upon was was a reasonable document to to look at and interact with. And we were looking a lot. The big motivating factor was actually looking forward to how what are the employees going to look like in 2 or 3 years, and what's the distribution? The distribution is not going to be what we have today or had last year, which is more focused on the data engineering side. Netflix has hired a lot more analysts and and kinda content focused oriented developers or people. And there, the technical skill set's different and the the expectation with interfaces is different. And we wanted to emphasize a contract format that was much more shareable and but still familiar to to players that are already in the space that Netflix occupies. So, you you know, stepping back from, like, what motivated this, it motivated that the tools and the way we show how things are run and what was happening were completely alien to a lot of our new, like, fastest growing teams. And once you
[00:16:57] Unknown:
had that unification of the Jupyter Notebook as a common means of different teams being able to build and execute their different projects, Did you find that there was the opportunity for consolidating or sharing a lot of existing work between those different roles?
[00:17:17] Unknown:
Actually, I think this kinda leads into the part I didn't answer in the last part, which was was there any kind of resistance to adopting notebooks? And I think this plays into how well has it been adopted. And I think we're still actually decently early on that story at Netflix in the sense that we have infrastructure that's now making this usable and everyone, but I wouldn't say the majority of people on their everyday development, like, go to a notebook. But I would say that now a large percentage of users are familiar with notebooks as the way to understand what happened in the platform or at least are becoming familiar. And I think there, the initial resistance was that not to the idea of the notebook, it was to the idea of how you develop with a notebook or iterate on notebooks.
And I think where this has hit a lot of, like, confusion and pain for users that haven't sort of got deeper into the space is that, well, I use the notebook as a scratch space. Why would I I wouldn't want to ever, you know, put that anywhere near a production system. And that's still true, and that definitely is, like, the first thing people think about all the time is notebooks is just a scratch space tool. But the reality is is because it has this, shared visualization with logs of what happened along with the code that executed, it actually makes a very compelling integration format where you can have a consolidation of what happened, when, how, and visualize the outcomes all in 1 1 place.
As long as you keep your notebook, you know, in a very concise manner, it helps in this in this problem space. So for teams adopting and sort of people coming on board with consolidating to notebooks, I would say that, like, operationally, like, a lot of the big data platforms' operational concerns that have been new have all been notebook based now. So we have a lot of, tasks that look like something along the lines of we're gonna pull for anything that's reached the state, pull those out of some system, and then send an email to someone saying these are the things that have violated some SLA, or these tables we're gonna clean up in 30 days because no one's using them or they're past their TTL.
And these type operations are really handy. They're they tend to be, you know, a page of code or less, and they fit very nicely in the notebook where at the end, you can kind of print out a data frame or visualize a data frame of all the, objects that you're going to notify against and all the people you're going to notify and then consolidate altogether. And that's been pretty successful for, like, operational usage. And then I say the other place where there's been good consolidation has been, on machine learning and exploration. While they've used notebooks a lot, like, the ability to schedule that notebook directly has sort of been a light bulb for, oh, maybe we should rethink how we use notebooks. And there's been a lot of consolidation on some of the tooling here around either directly using notebooks or indirectly using notebooks through, the scheduling component.
[00:20:06] Unknown:
And you've mentioned a couple of times the Zeppelin notebooks in addition to Jupyter, and I know that there are also 1 or 2 other notebook projects that have been gaining some measure of popularity. So I'm wondering if there has been much effort to either try and, incorporate some of the workflows that you've built around Jupyter Notebooks to support these other notebooks or if there are any sorts of features that are not present in those notebooks or that the or that are present in them that aren't present in Jupyter that you would like to see brought back into the Jupyter ecosystem?
[00:20:42] Unknown:
Yeah. There's definitely a lot of there's a lot of really nice things in, like, the Zeppelin space, but it's also more constraining from a development point of view in the sense that it's it's kind of a 1 bundled package. Like, your notebook is, more inclusive of everything that's happening, which is a good and a bad thing. And it's more a design preference choice to choose 1 and kind of do really well at 1 thing. We've have Zeplin exposed because we've given users options, what they wanna do. We're heading more in the Jupyter space because we're building tools around it, and we have expertise and people here that are part of that community. So it made sense to lean into what we already know and what we're good at. In but in terms of things that could come over that are really nice, a lot of the widgets and extensions that are live in the Zeppelin space are definitely something people want in the the Jupyter Notebooks that we haven't, you know, made fully available or or only available under certain execution context. Like, maybe there's a JupyterLab extension, but there's not something for interact or Jupyter classic can't run something that maybe another extension tool could. And those a lot of times around kinda quality of life features of the notebook editing experience.
[00:21:51] Unknown:
And as you mentioned before, Jupyter Notebooks are often seen as temporary or throw away and potentially conducive to development anti patterns. So I'm curious what are some of the anti patterns that you have encountered in notebooks that people have been building, and some of the conventions or tooling that you've established to help discourage those practices and encourage more reusable and robust methods of leveraging these notebooks for your production contexts?
[00:22:25] Unknown:
Yeah. And this is actually an aspect of the notebook space I'm I'm most excited about because I feel like this is where kind of the world is turning a bit bit in that space, where things that we didn't think about as possible are are coming to light. I do think, like, the use case of a notebook is a scratch space to try out something iteratively and and have partial execution of your whole workflow is is a really awesome tool, and I don't think that's something you should turn around and schedule. But the same it actually follows a lot of the same anti patterns and arguments you'd have about any snippet of code. Like, I may have some script that that does this 1 task and it does it once, but I may not wanna schedule that immediately. It's the same thing with a with a Jupyter notebook or any other notebook interface. Productionization of your code still has to happen, and this is something I think you know, it's easy to not do that productionization of code when your notebook feels like it's a reproducible artifact that you don't need to touch it again.
So there's several tasks you should definitely try to do in notebooks around productionization that you would do in scripts or other other context. Like, move shared code into libraries, and definitely keep, you know, what you're executing to be simple and straightforward. And even also removing, like, things like hard coded attributes that you're using just to get 1 run through or, local files that aren't gonna be necessarily available elsewhere. So it's the same kind of process you go through when you're putting a code into production and and doing code review of it. I will say, the anti patterns that we've expressed a lot to avoid at a high level are because, unit testing notebooks is relatively difficult, but integration testing notebooks is pretty easy, is that you wanna treat it like any kind of integrated component where you wanna have branching factor to be very low so that they pass through your notebook is, is is consistent and reliably tells you that this notebook is going to execute successfully.
So, like, 1 thing we've started doing and and we wanna make more efforts on here is if you look at a notebook that it's that you wanna schedule, you could easily do some quick static analysis to say, hey. How good of a notebook is this? Much like you have, like, linting rules or other things or other systems. And this is somewhere where we wanna improve. But generally, from, like, a code review point of view aspect, if you see a notebook that comes in and it has many, many conditionals that branch in different ways, that's a good sign that you should take the code in that cell and put it in a library that's unit tested and then call that library from your, notebook. And it's a little hard to enforce these rules today, but we wanna give instead of making hard, like, blocks on on what you can do, and this is definitely kind of counter to Netflix development philosophy to do that anyway, We wanna instead encourage, like, the quality of something, like, encourage things that are good and give good examples of what should be done or maybe give suggestions on how to improve what's already there. And so in that way, we have a, a kind of portal of all the notebooks in the ecosystem, which has got a surprising number of notebooks in it. I don't remember the, the number off the top of my head, but it's, you know, tens of thousands of custom notebooks people have made. And there, we wanna surface the good ones, the ones that are shareable or reusable or the ones that are scheduled with good, inputs. But I would say in terms of, yeah, anti patterns, the, branching factor and keeping it simple is probably the most important thing to do.
[00:25:40] Unknown:
And I know that it's possible to import 1 notebook into another 1. So I'm curious if there has been any discussion about the pros or cons of that approach in terms of making the individual notebooks modular components for being able to build up larger workflows?
[00:25:59] Unknown:
Yeah. That's definitely a space where I feel like especially when I was using notebooks notebooks before, like, maybe, you know, 5 years ago when I first got introduced to to the notebook space. I think I was using it might have been, Jupyter notebooks back then, but I think I was using a few other notebook platforms. There was this emphasis of the idea that maybe, you know, notebooks inputting notebooks would be the direction it would go and notebooks as a library. And that hasn't played out all that snippets that you can't fully trust are going to do exactly what you expect over time. And so that's I think there could be a story there that they would branch where that would become better. But, generally, here, we haven't been encouraging that type of pattern, and it's been more, if you wanna import something that's shared, it's really easy to make a library that that within your organization that that can house this code in, you know, in a more consistent manner. And this is really where, like, should notebooks be treated as libraries or notebooks treated as integrations or scratch spaces? And I think here we've definitely leaned on the side of it's probably not good to treat a notebook as a library. That may not be true in all context and, like, certain teams that that might be a a good way to go and they wanna build tooling around supporting it. But let's say the tooling today doesn't facilitate a lot of happy outcomes from having notebooks as libraries without putting a lot of extra effort on your side to make sure that, you know, you can trust that that code's gonna do what you want. And are there any particular limitations
[00:27:24] Unknown:
of the notebook environment that you have encountered and that you would like to see modified? And conversely,
[00:27:32] Unknown:
what are some of the biggest strengths of the notebook environment that you have enjoyed working with? I would say, like, the especially in a lot of the popular notebook frameworks, it's not limited to 1 of them. There's definitely a lot of room for improvement on the abstraction of the storage mechanism and the syncing of your your data. And this is in the sense of, like, Jupyter Notebooks. When you save, you're saving the whole JSON document each time, and that can be really problematic when that document gets big or if you have multiple edits concurrently that you want to have happen. It it leads to sort of blocking conditions that are that can be annoying to extend into other use cases. And that's like a a known trade off there, but there's definitely things that could be done to improve that. And I think that other spaces where notebooks maybe could be done better is around editing tools. Today, You know, if you if I load up my favorite IDE or or use 1 any IDE, really, I get a lot of integration with, like, default linters, default code searchers, and all sorts of other things that are handy even in a very simple down to the point of, like, emacs extensions that can do these things.
And, in notebooks, I I feel like a lot of those haven't developed as much. I think Zeppelin's probably a little further along than Jupyter in that space, and there's some niceties there. So I'd say quality of life type improvements are definitely where a little bit of, iteration by the open source community is gonna go a long way, and I think they are heading in that direction. And the other space where, where notebooks being a little better is integration with source control systems. Because the notebook at the end of the day is a big JSON document. JSON saved into, like, Git, for example, isn't the prettiest thing to look at in most diff tools. And there's some there's been some really neat things like MBDIME is an is an open source tool for diffing notebooks that works really well on GitHub and works good for local diffing. But there's also some other, there's a lot of room for improvement on source controls with notebooks, and I think it's very doable with with
[00:29:30] Unknown:
a few more improvements in the tooling space. Yeah. I was going to specifically ask about that because I know that from my own experience and from talking to other people, that's been 1 of the biggest pain points when working with notebooks as a collaborative environment because of the commonality of everybody liking to use code review as a means of preventing some obvious mistakes and, just sort of sharing the discussion about approaches and design choices and whether or not a given piece of code meets the intended requirements. So having that additional difficulty in the common workflows by virtue of using a notebook, seems like it would be a potential pain point as you try to scale it out to, larger groups of people and make it more of a broad use case and collaborative
[00:30:22] Unknown:
effort. There's definitely someplace where, like, we don't use all the tools that are out there available today to to help in that space, and we're starting to pull some of those in because we've we've been controlling it in the smaller group that's been doing maybe the most shared notebooks. And there, we're okay. Like, we know how to kinda get around the ugliness. But as we expand to users that wanna make their own notebooks and check them into version control outside of the kind of big data platform group, it's definitely something we we're gonna have to focus on helping pull in or improve other open source tools in that space. I will say that, like, there's been some some really neat things even since the last time I looked around. Some Git Hooks that help clean up your notebook before you commit.
There's some tools that do some nice things there. And then, some of the diffing tools that can integrate with git are getting better and neater. So I think it I think we're in, like, around the corner from having that be a really nice experience.
[00:31:13] Unknown:
And in terms of having notebooks be used for production contexts of different data workflows. What have you found to be some of the most challenging aspects of building those workflows and platforms, and what are some of the unexpected lessons that you have earned in the process of building these different capabilities?
[00:31:38] Unknown:
Yeah. I'd say that probably where some of the challenges comes up is making that decision about when is my notebook becoming too much like a library source. Like, how how many how many code cells with custom functions does it does it take before it justifies putting this in the library, especially when you're in a new context that you don't have necessarily a backing library to support yet. So when there's already a library you're importing to to handle some of the complexity that you're you're working with let's say you have, sending an alert. You have an alerting library that kinda handles a lot of the cruft, for you. There, it's easy to justify, like, when something should go in that library. When I have a new type of, like, data query that has to do some complex check against 2 systems that come up with an answer, like, where you choose to put that into a library is definitely a hard choice, and it's very project dependent. And it's much like other systems. Like, when do I promote my script that's just sitting in s 3 that I'm running to an actual library?
It's it's kinda scary how much, how many very professional companies rely on some scripts sitting someplace that, are maybe backed by source control, and it's the same type of problem. I have a notebook even if it was checked in the source control window. I make that trade off of, of spending more time to break it into smaller pieces. I would say from things that have been learned that are really valuable, I would say that the the abstraction where we introduced with paper mill where we isolate input notebooks from output notebooks really changes the paradigm of risk and and and worry that came initially with notebooks.
And it helps a lot too because, in that paradigm shift where your output path is independent of your input path and you get a kind of record of each each execution, it's very easy to go back and roll back to what happened on day x. You just go to an s 3 path and pull down that notebook and go see what it did. And, that's been, like, really, really useful as a tool. A matter of fact, we even have people that weren't familiar with notebooks at all where even managers that were on call because their team was on vacation or something like that. It's like, oh, yeah. Something went wrong. And, I went to go click into what went wrong, and it sent me to this notebook link. And I saw immediately what it was, and I uploaded it to I just, like, started editing it and fixed it right away. And that was a it's been kind of rewarding, like, turnaround on the bet we made that notebooks would make those type of operations easier. And looking forward, what are some of the projects that you're either currently working on or that you have planned for the future that you're most excited by and that you think will be most beneficial
[00:34:10] Unknown:
to your work and the data teams at Netflix?
[00:34:13] Unknown:
Yeah. So I've been living in the scheduler world now for about a year, and I think some of the exciting things are are actually adjacent to that scheduler in the sense of a lot of the tooling around, well, I've been working a lot on the paper mill project in particular. And there, there's a lot of really fun conversations that are starting to bubble into actual, new things that have come out. Like, how do you record data from a notebook in in a consistent and and efficient way? That's not a story arc that's been really solved. Like, oftentimes, you use some other tool to save data someplace. But what if that data frame that you built in your notebook was really easy to pass along as as a first class object? I think that space soon we we started a a new little repo to start exploring that more called scrapbook inside our interacting hence hence a theme on the, interact side projects. And I think the space there and also with kind of changing how notebook stores work and making intermediate layers that that help with collaboration and, consistent identification of of what's happening in the notebook space down to a cell level will enable a lot of use cases that weren't really possible before. And in particular, you're gonna see more things like, you know, a notebook as an outcome report that can have collaboration in real time updated visualizations without having heavy load on the system. I think those are gonna be really interesting times when the the thing that generates your report is the same thing that is your report in just a different view of the same same object. And I think it'll that's gonna reduce, like, in my world, a lot of the operational overhead of transferring information to multiple other systems that need to interpret it in the same way. And to clarify how how you think of that, if you think of, like, I'm pulling data from a big data system. Right? I just got rows of data that's really hard to look at, but I have users that need to have, you know, answer some very fundamental business questions. And so the traditional path there is you take that data. You have a data engineer make the query. The engineer kind of outputs that outcome to, some other place, maybe aggregates it, and then either them or some other analyst is gonna take that and try to gain some insights.
Gain some insights by making a Tableau report or exporting the Google Sheets and then showing that to Or, like, there's so many different forms and mediums to to expose the information that that's being, that's being produced by, like, literally thousands of people. Then simplifying that and unifying a lot of those components so that there's less overhead and how to translate that successfully and trust it is gonna make the ability for those 1, 000 or 2, 000 people to do a lot more work in in with less effort. And are there any other aspects
[00:36:52] Unknown:
of the work that you've been doing with the notebook interfaces and the productionizing of these workflows that we didn't discuss yet that you think we should cover before we close out the show? Let's see. I do think that,
[00:37:07] Unknown:
in general, like, there's definitely a few more, like, key points on, like, how the the scheduler how paper mill is is, like, passing parameters into a notebook that is kind of interesting. So maybe maybe you could spend a minute on that. So I do think in the in the scheduler realm when we're using paper mill to to execute a notebook, we basically take all these different parameters. You can think of them as anything It's JSON or YAML Express. They're, like, really simple configuration tools. There's actually become code in the notebook, and that's something it's a unique approach to how parametrizing notebooks has been done in the past. And it was something when I joined that that project had started, and, it was doing that and recording basic data on the notebook. And it it was really interesting how that translation and into code that you can see as code in whatever language your kernel is running in makes it so much easier for people to understand how something got configured. In particular, because it just becomes code, you don't have to know that it came from some parent system that saved a bunch of environment variables or something. Instead, you just get tell it where to inject in all the all the information and then provide any defaults you want. And then you can have a local run that you're iterating on where you're running on the defaults. And then independently when it actually schedules, it's very easy to see how you overwrote the the attributes that came in. And that's been very very handy for debugging and and understanding how the system actually works that I think has been a kind of game changer for making notebooks usable in in this setting. And I think that another another spot I haven't focused on as much, maybe because I'm more focused on the kind of back end and scheduling concepts around notebooks.
It's also been there's been a lot of really interesting work being done on the, interact project and and in others like JupyterLab and, other notebook interfaces on making better and better front ends to to enable really clean and nice reporting or visualization tools. And I I definitely encourage people to go check out some of these, open source communities for what's happened in the past year or 2 because I think the the world is changing a lot in the visual representation space of notebooks. Alright. Well, thank you very much for that.
[00:39:08] Unknown:
For anybody who wants to follow the work you're up to or get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:39:24] Unknown:
Yeah. So the biggest for data management, I would say 1 of the the biggest gaps is the is the the inability for nonexperts to get involved and be able to self-service themselves on on on problems. I think I see this a lot where the bigger the big data platform is, the more you lean on the specialist to to understand why something isn't working and how to remediate it. And that puts a lot of pressure on very high skilled data management individuals that you have a lot of them and that they do their job well to be successful as a company. And I think a lot of the data experience world is trying to should be trying to lean more into making easier for these upstream users to be successful without needing as much expertise in that. And I think some of the tooling has has made that better with, you know, easier understanding of of failure modes, But I think there's a lot of room for improvement there in the big data world. And maybe notebooks will be part of that story, but I think there's there's a lot of other aspects to it too. Alright. Well, thank you very much for taking the time today to discuss the work that you've been doing with notebooks at Netflix. It's definitely
[00:40:34] Unknown:
a very interesting project. I appreciate the detail that you've all put into the blog posts around that. So we'll definitely have links to that in the show notes. So thank you again for that, and I hope you enjoy the rest of your day. Thank you for hosting me.
Introduction and Guest Background
Motivation for Using Jupyter Notebooks at Netflix
Deployment and Integration of Notebooks
Operational Use and Benefits of Notebooks
Technical Infrastructure Supporting Notebooks
Adoption and Resistance to Notebooks
Comparing Jupyter and Zeppelin Notebooks
Anti-Patterns and Best Practices
Limitations and Strengths of Notebooks
Challenges and Lessons Learned
Future Projects and Innovations
Parameterization and Scheduling with Papermill
Final Thoughts and Contact Information