Ship Faster With An Opinionated Data Pipeline Framework

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network, you've got everything you need to run a fast, reliable and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances.

Go to data engineering podcast.com/linode,

that's

linode,

today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

And you listen to this show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations

such as O'Reilly Media, Dataversity,

Corinium Global Intelligence, and Data Council.

Upcoming events include the O'Reilly AI Conference, the Strata Data Conference, the combined events of the Data Architecture Summit in Graforum,

and Data Council in Barcelona.

Go to data engineering podcast.com/conferences

to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Tom Goldenberg about Kedro, an open source development workflow tool that helps structure reproducible, scalable, deployable, and robust inversion to data pipelines.

So, Tom, can you start by introducing yourself? Hi, Tobias. First of all, thanks for inviting me here. It's a pleasure to be here. Yeah. A little bit about myself. I'm a junior principal data engineer at Quantum Black, which is a a McKinsey company. I'm based out of the Cambridge, Massachusetts

office.

And, happy to, you know, share a little bit about myself or Kedra. Or where would you like to start first, Tobias?

Let's start by if you can remember how you first got involved in the area of data management, telling us a bit about that story.

Sure.

So

it's it's interesting. I've I've been in the,

I would say, the tech scene for quite a while. I I started out working with a couple

of, startups in New York City. That's kind of where I got my start.

The 1 of the first startups I worked with was a company called Agolo,

and they are a Microsoft backed AI company.

And at the time, I wasn't heavily involved in the, I would call it, like, the data science aspect of it, but I was always very intrigued by it.

So I moved on from there. I I cofounded a startup and was CTO of a a Fintech asset management startup.

And it wasn't really until I got into consulting,

where I had the opportunity to,

basically play a a leading role on a large scale analytics transformation.

And this was for a client in, you know, in the retail industry.

And,

it was it was then that I, you know, kind of discovered that I I really enjoy this. I I definitely see the huge impact it has on,

on the client at the time. And so that's why I, eventually joined Quantum Black, here at McKinsey

to to work with clients, you know, Fortune 500 clients, and to help them, you know, transform and adapt and

and and use these

framework

in the form framework in the form of Kedro. So I'm wondering if you can describe a bit about what it is and some of the use cases that it enabled

and some of the origin story of how it got started.

Yeah. Sure. Happy to to answer that. So first of all, I did not create Kedro. I I'm a I'm an avid user of Kedro and a big fan. But there are, you know, lots of people involved

out of,

London, Quantum Black's London office primarily, but also here in in the United States.

Basically, Kedro the word Kedro is Greek for center.

It's essentially you know, the idea is it's essentially the center of your data pipeline.

And what it is and what it offers you to to really boil it down and simplify,

the the the basic thing that it offers you is is a a template, kind of a boilerplate for how to start a data science, data engineering project in Python.

And, you know, that's that's the core of it, what it what it offers. And from there, it offers kind of an organization of the code,

the concept of data layers to organize your,

you know, the data that you're you're using.

And then there's a whole abstraction layer that it offers for the datasets that you use and as well as tools for both visualizing and and organizing your data pipeline.

I I hope that clarifies. Happy to dig in any of those,

those points to clarify a little bit more.

Yeah. It's definitely a useful overview,

and it seems that it was designed to

span the disciplines between data engineers and data scientists to serve as the focal point for collaboration

and enabling the overall life cycle of a data project.

I'm wondering if you can dig into some of the ways that each of those different roles

factor into the overall

Kedro?

Yeah. Absolutely. So it's great that you call that out.

You know, listen. In in big data projects, you have

data engineering and you have data science.

And, you know, by the name of this podcast, you know, we all know that data engineering

is a huge chunk of what goes into driving the insights of machine learning and and data science.

And and, you know, the the the statistic is is quoted pretty often that data engineering is 80% of it. And so, yeah, our what what Kager really emphasizes is that collaboration

between data engineering and data science. So first of all, it starts with it's a shared code base.

So all of your, you know, poll reviews in your Git is is made really simple,

and you're able to share code. It's all in Python as well. So that's kind of a very,

you know, decision that that's made that,

you know,

we we think it's it's better for data engineering and data science to share the same programming language and the same code base. And so that's kind of where Kedro starts with. And on top of that, it provides a whole bunch of tools and functionality

to basically streamline a lot of the stuff that would take,

a lot of the time to get started on a project. And what are the main feature sets that are present within Kedro

and some of the

abstractions

that are built in to encourage best practices for a data project and reduce some of the friction that exists at the outset of a project where you're left with trying to decide on all the different tools and components and what the overall requirements are. So I'm just curious what the feature set is and how it plays into some of that analysis paralysis.

Yeah. Absolutely.

So in terms of, like, which libraries you use, we're we are

agnostic in terms of, you know, what libraries you can use. It's very open ended platform. So, you

know, whether you use,

whichever modeling languages

modeling libraries you use, you know, whether it's TensorFlow or scikit learn or whatever, There there's not an opinion in that sense. Where there is an opinion is

on certain best practices, what we consider best practice for software development,

and this includes stuff like documentation.

So documentation is built in.

There's a, you know, KedRA command line tool to

to produce the documentation

for the code base. But it it also goes a step further in the sense that we have a visualization

tool called KedroVis,

which takes a pipeline and provides a visual,

representation of of the code, which is very helpful as well.

I would say,

reusability,

testability.

So, I mean, I I've mentioned before about the data abstraction that it offers. I think this is the really the

an an amazing feature that it offers where we can have different environments.

And in each environments, we'll have

datasets that we refer to by a certain name. But we can change the parameters of those datasets

so that in our test environment, they're pointing

to 1 location, and then our, QA or prod environments are pointing somewhere else.

So those are the kind of configuration

options it gives you to run the same pipeline on multiple environments.

It makes it possible to to test before deploying, etcetera,

and and to reuse parts of the code across for multiple purposes.

So that that's what I would say is the, you know, the the main feature set.

There's lots of stuff that is in development

now as well.

So there's

making more robust deployment

options. So we have

tools that incorporate airflow and Docker and working on Kubernetes,

but also deploying to, like, Databricks and giving clear guides on how to do that.

There's also a data and code versioning feature that's actively being worked on to make that more robust. And and that's kind of where we are and and what features that we're prioritizing and that we have currently. And as I mentioned earlier,

this grew out

of the patterns that you identified through a number of different client engagements at Quantum Black. And I'm curious

how much of the culture and

technical acumen of the people working there

factored into the design and priorities of Kedro and some of the ways that it manifests?

Yeah. Absolutely.

So a couple things come to mind,

with that. So the first thing is Quantum Black. We are, you know, we provide for those that don't know, we we are a consulting firm that that provides advanced analytics solutions to to clients. And Quantum Black got its start in Formula 1 racing. So there's always been this aspect of solving really tough problems, but also,

coming up with a big impact in a short time

and improving performance of organizations.

So in that aspect, the the culture really comes out in the fact of we have a short time to deliver this huge impact.

We don't wanna sacrifice

code quality, documentation,

all these kind of things. How can we build that into

what we're what we're developing

so that we're able to kind of streamline those processes?

So that's 1 way I would say that the the culture or the history of Quantum Black has led to to Kedro.

The other thing I would say is that Quantum Black has a very cross collaborative culture.

And we alluded to this before with,

you know, d e and d s, you know, data engineering and data science

working closely together in the same teams,

working with the same code base, reviewing each other's code,

and really just providing more

more eyes on, you know, code reviews and

quality of the code.

Another interesting

thing about Quantum Black is that there is a strong design aesthetic as well.

And so you'll see this in, as I mentioned, KedroVis, which is the the visualization tool to to see your pipeline and provides, like, a really interesting view of that. But I think it's beyond just the the visual part of

the for the design. I think it comes into play in the actual design of the library and of the templates of how,

we've really stressed simplicity,

whether it's the data abstraction or the folder organizations.

Everything is really,

you know, we we use this term MECE, which is mutually exclusive

and, collectively exhaustive. It's it's it's boiling things down to their ultimate parts,

And,

I feel like that really comes through that simplicity in Kedro Projects.

And the fact that this is oriented largely for client engagements is a benefit as well because it means that by having this standardized approach with a lot of built in boilerplate means that you're not wasting time trying to come up with new documentation

and training regimens for hand off to the to the customer once the project is done, and they need to be able to maintain it going forward.

Yeah. Or, you know,

feeling rushed to to do those things at the end of a project or, you know, it it it really kind of forces you into this right path where where those things are considered from day 1,

and,

and and you know that you're getting quality at the end. Exactly.

And can you describe a bit about how KedRA itself is implemented and some of the evolution that it's undergone since it was

first started

and some of the main libraries and language features that have been incorporated into the framework?

Sure. I can I can speak to that? So Kedro came about, I,

I think I alluded to it before. About 2 years ago in a client engagement,

we had a group of machine learning engineers

that were just kind of frustrated doing these things that we're talking about, the all the kind of boilerplate and standard activity that that goes into

a data pipeline.

And and so that's when they had the idea of creating this. And and what I would say is that the the main components

so there is what we call an IO layer,

and that's what I mentioned about the the data abstraction.

And what that is, as I mentioned, you can create multiple environments.

For each environment, you have a YAML file where you can specify that my dataset points to a certain folder, to a certain,

file type, whether CSV, parquet, etcetera,

or it could be a a cloud based dataset as well. And

that was that was part of the original vision, this IO layer.

That and the pipeline

and the templating of the code kind of were the 3 original features,

and and that's kind of remained the core of Kedro. And I think what has happened since then is we've we've used it with more and more clients,

you know, so dozens of clients.

And it's it's just gotten more mature, and we built some functionality on top of that. So, for example,

in those those datasets to have versioning. So,

the ability to version datasets we now have, that was kind of added on,

you know, as we went along.

And,

you know and and listen. The the feature requests are continue to come. We're we're working on, for example,

Dask support, which is coming next, making a KedRA API

for extensibility,

reusable pipelines.

But I I think the core of the functionality is is what has been there at the beginning, and we've kind of built all these layers on top of that to make things easier. Does that make sense? Yeah. Definitely.

And I find it interesting that there has been a recent movement towards

building a

abstraction layer for the data programming piece itself

and

separating that from the actual execution context, which is

the opposite of how projects such as Airflow and Luigi and a lot of the more traditional ETL tools have been built where everything is all in 1 framework.

But now we have projects such as Kedro, which you're building,

and the prefect library, and Dagster,

which are focusing more on just what are the programming primitives and the overall logical flow, and then having a pluggable execution layer and scheduling layer so that you can mix and match what's already present in your environment. I'm just curious

what your experience has been on that front and what your thoughts are on the benefits of that approach versus how we have been dealing with things up until now.

Yeah. I mean, you bring up a lot of points. But just to pull the thread a little bit, I think that there is a lot of approaches to to this this the problem that companies are facing.

And, I mean, essentially, the the problem is that

data pipelines are hard. They're hard to build. They require a lot of resources. They're hard to deploy.

And all of the approaches are kind of aimed at making that more simple.

And

I I would I would say there's a couple different approaches.

1 approach that we've seen is,

you know, the the rise of drag and drop or GUI tools to kind of

create, you know, more simplicity in in, you know, ETL or feature engineering, etcetera.

And I think the idea behind that has been

to create more simplicity to, you know, hey. We we we don't it's hard to get resources for data engineering, for data science. Let's create drag and drop tools to make it easier.

And at least my experience has been in the past working with clients has been that that actually tends to add,

further complication.

And so, Kedro, you know, we've we've taken a step back. We said, hey. We're not building a drag and drop tool. We're not making it simpler in that sense.

You know, we're not creating, like, locking into a tool that you have to use.

We're just creating a way to organize your Python code. And we're we're still solving the same problem. We're making things simpler. We're taking a complex problem.

We're boiling it down to what is it what do you really need to do to achieve that? And trying to take all the kind of,

stuff that just takes time and energy and make it a lot simpler.

So I I think there has been a trend away from proprietary frameworks to open source.

I think that there is still this push

to to simplify the data engineering and data science process, and we've seen that play out with

whether it's drag and drop, GUI tools, or whether it's open source libraries or proprietary tools.

I mean, the problem is very real real. I think that, you know, the approach is gonna be different.

And and and, you know, we've taken a stand in how we think is a really effective way to solve that approach.

So I

all these things are are trends that are happening.

I think the trend towards open source is also important.

You know, I think

1 thing that we're seeing among clients is a concern about lock in. They don't wanna be locked into a particular vendor. They don't wanna be locked in into a certain cloud provider. And so the ability to use like you mentioned, these open source tools is a big movement where

they are not locked into any particular company. So all these things are playing out, and,

you know, you you know, we we think that,

you know, we're offering a a good approach to to simplify that process as well.

And the other benefit of having

the programming layer be 1 piece

and having everything else be pluggable

is that

you can isolate

the logic from the actual execution context, which I'm sure makes it easier for the testing

and validation. I'm curious

what the overall approach looks like for

running a integration test or unit test on a Kedro project and any any tooling that you have built into the framework to simplify that operation?

Yeah. Yeah. That's a good, that's a good point.

We are integrated with, Pytest.

I I don't have specific examples I can give you, but I I do know 1 1 approach that you can use, for example,

in the data abstraction layer is having,

you know, test datasets that you can run the pipeline on and run tests on on the results as well. That's that's 1 approach we've seen. I think that the different testing approaches and and bringing that to light more is

is is something that we're working on to improve. But,

I think that the abstraction of the data layer makes it actually very easy to

do, like you say, an integration test on a different environment with, you know, maybe a sample dataset

to test the results, etcetera.

I think the data and code versioning will also help that as well, and that's something that's active in the the

the future road map.

So, yeah, these these are all things that we've we've heard from clients and users and are are looking to,

enhance and improve.

And another thing that you mentioned is the

built in support for integrating with different data sources for being able to pull it into your project. I'm wondering if you can talk a bit more about some of the sources that you have built in

and the overall approach that you've taken to

abstract out the core principles

while still allowing for specific features of different storage back ends?

Yeah. Yeah. Happy to explain that. So so, yes, we do have support for specific,

data connect connectors.

You know, so for,

local files, of course, you have, like, parquet, CSV, pickle, etcetera.

And we do have, like, s 3 and Pyspark as well. We do have connectors for them. I think the

when when you look at Kedro,

the real benefit

is maybe not the the number of connectors per se, but the ability to extend

and create custom connectors is really, really easy.

So if you go to the Kedro docs, you'll see, connect a a section where it talks about the abstract dataset, and that's kind of the core class

that all of the data connectors are built on. And to extend it is is really simple. It's just a load and a save method,

for a class, and so it's possible to extend. We've worked with clients, for example, that are using,

Azure Data Lake Store Gen 2, and, you know, it could be some aspects of that might be different from Gen 1,

or Blob storage as well.

And so it's it's it's really easy to extend,

the data connectors,

the abstract dataset to all these different whether whether it's different cloud providers or

different types of database, etcetera.

So those are some of the ways that we've seen people leverage the, the the data abstraction.

And I'm wondering if you can talk a bit more about some of the ways that Kedro

is

differentiated from projects such as Prefect and Dagster, which I think are

most analogous to what you're doing with Kedro, but also projects such as Airflow

and, other workflow management tools, particularly when it comes to things like handling failures and retries and

building in

some of the common error modes and recovery capabilities?

Yeah. Yeah. Sure. I've so so first of all, in terms of you mentioned Airflow. So Kedro integrates with Airflow.

Kedro

is is just an organization of the pipeline. It's not a a scheduler. So you you could integrate it with Airflow, or you could schedule it

via different options via, for example, Data Factory, etcetera,

or via a Databricks notebook.

So it it does integrate nicely with Airflow.

In terms of the the other projects you mentioned, I've I've never actually used those in any projects. I have looked at some of the documentation.

I would say, you know, for me, where KedRA stands out is its simplicity,

in terms of,

you know, the way that the

the the code is structured when you create a new Kedro project.

The the syntax that you use in constructing a pipeline,

constructing nodes, etcetera,

it's it's gonna be very simple and elegant.

And, you know, a lot of thought has gone into making it as simple as possible, and that's kind of where we have really leveraged. And so when, you know, also when you look at the data abstraction

layer as well, I I from what I've seen, there's definitely a lot of shared

elements as well. So, like, visualization of the pipeline,

etcetera, and certain principles are definitely shared across the

the the other 2 that you mentioned. I would,

definitely among Daxter, I've gotten to look at it. I haven't seen, the other 1.

But that's where I would say that Kedro has kind of put most of its effort in creating

a real a a really rich developer experience. So very easy to use, very simplified,

as well as the richness of its data abstraction layer. That's where I'd say it stands out.

And when I was looking through, I noticed that Kedro is currently at, I think, the 0 dot 14

release. I'm curious

what you think is still

missing for a 1 dot x release

and some of the aspects that are in the

heaviest states of flux and change

and, some of the effort that you've got for pushing toward a 1 dot x release?

Yeah. Yeah. I I think that,

the 1 docs 1 dot x release is imminent.

I think what the bar that we've set is that

we want to continue to improve and enhance our our versioning feature set. So,

data

not just dataset versioning, but data and code versioning so that you can truly reproduce

any any run.

The other aspect is

getting stability and feedback from users on,

on the project templates, and we've introduced a a context API.

So I I I think we're still in the process of the iterative loop and getting feedback and stabilizing those features.

What I mentioned before was at the core of KedRA, which is the

the main project template, the IO layer, and the pipeline.

Those have remained,

very stable, and and we're we we do feel that those are are very stable. But in terms of getting to a 1 dot release, I think that's where we have set the bar is is

getting full stability in in the context API

and really getting to a a very mature state with our,

our versioning

capabilities,

data and code, etcetera.

And

you mentioned that there is a plug in capability in Kedro and that you're working more towards a standardized approach to that. I'm curious

what types of plug ins

are currently present, both ones that you've built with Quantum Black

and some of the

community contributed ones and just how much engagement you've seen from users outside of Quantum Black?

Yeah. I can speak a little bit to that. So in terms of plug ins, we've I I believe I mentioned that there is a plug in for Docker, Kedro Docker.

Obviously, there's the the Viz, Kedro Viz, which provides the visualization of the pipeline.

There's also Kedro airflow.

We we have seen

a great contribution from the community.

1 of the contributions has been looking at creating reusable pipelines.

So in other words,

maybe not even from a data science perspective, but using the the pipelining tool

to create workflows that can be reproduced across multiple datasets.

So we've seen KedRA being used in some interesting ways. We've

seen it being adapted and and used in academia and and several universities.

I just learned of this recently that it's been adapted in

Imperial College of London, Oxford University,

University of Cape Town in South Africa,

and they're using it to reproduce work for some machine learning papers.

We've started to see,

more streaming examples, leveraging Kafka.

So I I think people are using it in ways that we might not have anticipated.

We're still on the process of

working with the community and and

and extending it in in ways that

people are requesting and and and learning from the way that people are using it. And what are some of the most interesting or innovative or unexpected ways that you have seen it used? So, like, the reusable pipelines,

that was something that we hadn't considered to be used as, basically, a workflow template to run across different datasets. So that's that's something that we started to to see and and start to incorporate. And like I said, the surprise at the some of the adoption of how it's being used in academia was,

was was also a pleasant surprise as well. And in your own experience

of both working on Kedro and using it for projects,

what have you found to be some of the most interesting or unexpected or challenging lessons that you've learned in the process?

In terms of challenges,

I mean, I I really have had the opposite reaction, honestly. Like, I I came from

a world without Kedro,

and using some of these, for example, being on projects where using drag and drop tools, etcetera.

And just the developer experience was horrendous.

And, honestly, it's it's more like a breath of fresh air where I'm

working directly with data scientists.

We're sharing code. The code is

organized. I know when I go into a new a new code repository, how things are structured and how I can explore the code. So for me, it's it's it's kind of resolved several challenges that I had had before, if anything.

I would say, you know, it does the you know, it does take some time to get up to speed on some of the more advanced features. So

but once once you have done that and are comfortable,

extending,

for example, the abstract dataset and creating,

custom data connectors, it's it can be really powerful and and liberating, I would say. What are the cases where Kejo is the wrong choice and you might need to use some other approach or some other framework? Yeah. I I think there there are 2 situations when that applies.

So so 1 is if an organization is

not interested in using Python. Right? And, you know, I I we believe there's a very strong case for why

Python is an excellent,

programming language to use for advanced analytics. But, you know, some organizations

may have invested elsewhere in different languages. And in that situation, it might not be right, especially if there's not interested in adopting to Python. So that's 1. And the other, I think I alluded it to before. If if an organization

has gone a 100% invested in

a drag and drop framework. I mean, that's that's not what Kedro is. Right? And so, you know, I have seen certain organizations

go in thinking that this is going to either

increase increase the efficiency of their data engineering

or resolve the need to to hire data engineering talent, I'm skeptical.

But if an organization has that mindset, they're gonna use those drag and drop tools.

So, I mean, those are 2 situations

where KedRA just wouldn't apply. And are there any other aspects of Kedro or

the overall life cycle of data projects that we didn't discuss yet that you'd like to cover before we start to close out the show? I think we covered a lot. I mean, we talked about the trend towards open source,

the the trend away from vendor lock in, which is kind of producing that. The the move in some of the technology to bring data science and data engineering closer,

especially in regards to Python and PySpark.

We talked about the difference of, you know, the drag and drop approach versus the

clean and organized code approach to data engineering. So I feel like we we talked a lot about the industry, and I I hope I was able to present, you know, what I what I think KedRA has to offer in that regards.

Yeah. It definitely looks like a very

well thought through and well put together framework. I'm excited to see how it continues to evolve and grow as you head towards a 1 dot x release and

beyond and see what sorts of adoption it is able to accrue

as you add more capabilities.

Absolutely.

Yeah. So for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Yeah. I mean, I I think it's that's the $1, 000, 000 question. Right? I think you have a a lot of investment in start ups and a lot of investment in projects to

re to make it easier

for companies to get insight out of their data.

And,

you know, there's there's multiple approaches to it. I, I think that it's still an unresolved problem because we do have a shortage of talent

and it's just an incredible need across multiple industries

to,

to leverage AI, to leverage analytics

and use it to improve their performance.

So I don't, I don't have the answer. I think it's, it's still something that we're all working towards, and it's something that we're working on at Quantum Black for sure. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing on Kedro at Quantum Black. It's definitely an interesting framework. And as I said, I'm excited to see how it continues, and I'll probably be giving it a try for my own work fairly soon. So thank you for all of your efforts on that, and I hope you enjoy the rest of your day. Wonderful. Thank you so much, Tobias. It's been a pleasure, and, looking forward to hearing from you.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used. And visit the

If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links