Summary
Building an end-to-end data pipeline for your machine learning projects is a complex task, made more difficult by the variety of ways that you can structure it. Kedro is a framework that provides an opinionated workflow that lets you focus on the parts that matter, so that you don’t waste time on gluing the steps together. In this episode Tom Goldenberg explains how it works, how it is being used at Quantum Black for customer projects, and how it can help you structure your own. Definitely worth a listen to gain more understanding of the benefits that a standardized process can provide.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, Data Council in Barcelona, and the Data Orchestration Summit. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Tom Goldenberg about Kedro, an open source development workflow tool that helps structure reproducible, scaleable, deployable, robust and versioned data pipelines.
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Kedro is and its origin story?
- Who are the primary users of Kedro, and how does it fit into and impact the workflow of data engineers and data scientists?
- Can you talk through a typical lifecycle for a project that is built using Kedro?
- What are the overall features of Kedro and how do they compound to encourage best practices for data projects?
- How does the culture and background of QuantumBlack influence the design and capabilities of Kedro?
- What was the motivation for releasing it publicly as an open source framework?
- What are some examples of ways that Kedro is being used within QuantumBlack and how has that experience informed the design and direction of the project?
- Can you describe how Kedro itself is implemented and how it has evolved since you first started working on it?
- There has been a recent trend away from end-to-end ETL frameworks and toward a decoupled model that focuses on a programming target with pluggable execution. What are the industry pressures that are driving that shift and what are your thoughts on how that will manifest in the long term?
- How do the capabilities and focus of Kedro compare to similar projects such as Prefect and Dagster?
- It has not yet reached a stable release. What are the aspects of Kedro that are still in flux and where are the changes most concentrated?
- What is still missing for a stable 1.x release?
- What are some of the most interesting/innovative/unexpected ways that you have seen Kedro used?
- When is Kedro the wrong choice?
- What do you have in store for the future of Kedro?
Contact Info
- @tomgoldenberg on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Kedro
- Quantum Black Labs
- Agolo
- McKinsey
- Airflow
- Docker
- Kubernetes
- DataBricks
- Formula 1
- Kedro Viz
- Dask
- Py.Test
- Azure Data Factory
- Prefect
- Dagster
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With 200 gigabit private networking, scalable shared block storage and a 40 gigabit public network, you've got everything you need to run a fast, reliable and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances.
Go to data engineering podcast.com/linode, that's linode, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. And you listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as O'Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council.
Upcoming events include the O'Reilly AI Conference, the Strata Data Conference, the combined events of the Data Architecture Summit in Graforum, and Data Council in Barcelona. Go to data engineering podcast.com/conferences to learn more about these and other events and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macy. And today, I'm interviewing Tom Goldenberg about Kedro, an open source development workflow tool that helps structure reproducible, scalable, deployable, and robust inversion to data pipelines.
[00:01:51] Unknown:
So, Tom, can you start by introducing yourself? Hi, Tobias. First of all, thanks for inviting me here. It's a pleasure to be here. Yeah. A little bit about myself. I'm a junior principal data engineer at Quantum Black, which is a a McKinsey company. I'm based out of the Cambridge, Massachusetts office. And, happy to, you know, share a little bit about myself or Kedra. Or where would you like to start first, Tobias?
[00:02:15] Unknown:
Let's start by if you can remember how you first got involved in the area of data management, telling us a bit about that story.
[00:02:21] Unknown:
Sure. So it's it's interesting. I've I've been in the, I would say, the tech scene for quite a while. I I started out working with a couple of, startups in New York City. That's kind of where I got my start. The 1 of the first startups I worked with was a company called Agolo, and they are a Microsoft backed AI company. And at the time, I wasn't heavily involved in the, I would call it, like, the data science aspect of it, but I was always very intrigued by it. So I moved on from there. I I cofounded a startup and was CTO of a a Fintech asset management startup. And it wasn't really until I got into consulting, where I had the opportunity to, basically play a a leading role on a large scale analytics transformation.
And this was for a client in, you know, in the retail industry. And, it was it was then that I, you know, kind of discovered that I I really enjoy this. I I definitely see the huge impact it has on, on the client at the time. And so that's why I, eventually joined Quantum Black, here at McKinsey to to work with clients, you know, Fortune 500 clients, and to help them, you know, transform and adapt and and and use these framework
[00:03:50] Unknown:
in the form framework in the form of Kedro. So I'm wondering if you can describe a bit about what it is and some of the use cases that it enabled and some of the origin story of how it got started.
[00:04:01] Unknown:
Yeah. Sure. Happy to to answer that. So first of all, I did not create Kedro. I I'm a I'm an avid user of Kedro and a big fan. But there are, you know, lots of people involved out of, London, Quantum Black's London office primarily, but also here in in the United States. Basically, Kedro the word Kedro is Greek for center. It's essentially you know, the idea is it's essentially the center of your data pipeline. And what it is and what it offers you to to really boil it down and simplify, the the the basic thing that it offers you is is a a template, kind of a boilerplate for how to start a data science, data engineering project in Python. And, you know, that's that's the core of it, what it what it offers. And from there, it offers kind of an organization of the code, the concept of data layers to organize your, you know, the data that you're you're using.
And then there's a whole abstraction layer that it offers for the datasets that you use and as well as tools for both visualizing and and organizing your data pipeline. I I hope that clarifies. Happy to dig in any of those, those points to clarify a little bit more.
[00:05:18] Unknown:
Yeah. It's definitely a useful overview, and it seems that it was designed to span the disciplines between data engineers and data scientists to serve as the focal point for collaboration and enabling the overall life cycle of a data project. I'm wondering if you can dig into some of the ways that each of those different roles factor into the overall Kedro?
[00:05:46] Unknown:
Yeah. Absolutely. So it's great that you call that out. You know, listen. In in big data projects, you have data engineering and you have data science. And, you know, by the name of this podcast, you know, we all know that data engineering is a huge chunk of what goes into driving the insights of machine learning and and data science. And and, you know, the the the statistic is is quoted pretty often that data engineering is 80% of it. And so, yeah, our what what Kager really emphasizes is that collaboration between data engineering and data science. So first of all, it starts with it's a shared code base.
So all of your, you know, poll reviews in your Git is is made really simple, and you're able to share code. It's all in Python as well. So that's kind of a very, you know, decision that that's made that, you know, we we think it's it's better for data engineering and data science to share the same programming language and the same code base. And so that's kind of where Kedro starts with. And on top of that, it provides a whole bunch of tools and functionality to basically streamline a lot of the stuff that would take,
[00:06:54] Unknown:
a lot of the time to get started on a project. And what are the main feature sets that are present within Kedro and some of the abstractions that are built in to encourage best practices for a data project and reduce some of the friction that exists at the outset of a project where you're left with trying to decide on all the different tools and components and what the overall requirements are. So I'm just curious what the feature set is and how it plays into some of that analysis paralysis.
[00:07:27] Unknown:
Yeah. Absolutely. So in terms of, like, which libraries you use, we're we are agnostic in terms of, you know, what libraries you can use. It's very open ended platform. So, you know, whether you use, whichever modeling languages modeling libraries you use, you know, whether it's TensorFlow or scikit learn or whatever, There there's not an opinion in that sense. Where there is an opinion is on certain best practices, what we consider best practice for software development, and this includes stuff like documentation. So documentation is built in.
There's a, you know, KedRA command line tool to to produce the documentation for the code base. But it it also goes a step further in the sense that we have a visualization tool called KedroVis, which takes a pipeline and provides a visual, representation of of the code, which is very helpful as well. I would say, reusability, testability. So, I mean, I I've mentioned before about the data abstraction that it offers. I think this is the really the an an amazing feature that it offers where we can have different environments. And in each environments, we'll have datasets that we refer to by a certain name. But we can change the parameters of those datasets so that in our test environment, they're pointing to 1 location, and then our, QA or prod environments are pointing somewhere else.
So those are the kind of configuration options it gives you to run the same pipeline on multiple environments. It makes it possible to to test before deploying, etcetera, and and to reuse parts of the code across for multiple purposes. So that that's what I would say is the, you know, the the main feature set. There's lots of stuff that is in development now as well. So there's making more robust deployment options. So we have tools that incorporate airflow and Docker and working on Kubernetes, but also deploying to, like, Databricks and giving clear guides on how to do that. There's also a data and code versioning feature that's actively being worked on to make that more robust. And and that's kind of where we are and and what features that we're prioritizing and that we have currently. And as I mentioned earlier,
[00:09:49] Unknown:
this grew out of the patterns that you identified through a number of different client engagements at Quantum Black. And I'm curious how much of the culture and technical acumen of the people working there factored into the design and priorities of Kedro and some of the ways that it manifests?
[00:10:11] Unknown:
Yeah. Absolutely. So a couple things come to mind, with that. So the first thing is Quantum Black. We are, you know, we provide for those that don't know, we we are a consulting firm that that provides advanced analytics solutions to to clients. And Quantum Black got its start in Formula 1 racing. So there's always been this aspect of solving really tough problems, but also, coming up with a big impact in a short time and improving performance of organizations. So in that aspect, the the culture really comes out in the fact of we have a short time to deliver this huge impact. We don't wanna sacrifice code quality, documentation, all these kind of things. How can we build that into what we're what we're developing so that we're able to kind of streamline those processes?
So that's 1 way I would say that the the culture or the history of Quantum Black has led to to Kedro. The other thing I would say is that Quantum Black has a very cross collaborative culture. And we alluded to this before with, you know, d e and d s, you know, data engineering and data science working closely together in the same teams, working with the same code base, reviewing each other's code, and really just providing more more eyes on, you know, code reviews and quality of the code. Another interesting thing about Quantum Black is that there is a strong design aesthetic as well.
And so you'll see this in, as I mentioned, KedroVis, which is the the visualization tool to to see your pipeline and provides, like, a really interesting view of that. But I think it's beyond just the the visual part of the for the design. I think it comes into play in the actual design of the library and of the templates of how, we've really stressed simplicity, whether it's the data abstraction or the folder organizations. Everything is really, you know, we we use this term MECE, which is mutually exclusive and, collectively exhaustive. It's it's it's boiling things down to their ultimate parts, And, I feel like that really comes through that simplicity in Kedro Projects.
[00:12:31] Unknown:
And the fact that this is oriented largely for client engagements is a benefit as well because it means that by having this standardized approach with a lot of built in boilerplate means that you're not wasting time trying to come up with new documentation and training regimens for hand off to the to the customer once the project is done, and they need to be able to maintain it going forward.
[00:12:55] Unknown:
Yeah. Or, you know, feeling rushed to to do those things at the end of a project or, you know, it it it really kind of forces you into this right path where where those things are considered from day 1, and, and and you know that you're getting quality at the end. Exactly.
[00:13:15] Unknown:
And can you describe a bit about how KedRA itself is implemented and some of the evolution that it's undergone since it was first started and some of the main libraries and language features that have been incorporated into the framework?
[00:13:30] Unknown:
Sure. I can I can speak to that? So Kedro came about, I, I think I alluded to it before. About 2 years ago in a client engagement, we had a group of machine learning engineers that were just kind of frustrated doing these things that we're talking about, the all the kind of boilerplate and standard activity that that goes into a data pipeline. And and so that's when they had the idea of creating this. And and what I would say is that the the main components so there is what we call an IO layer, and that's what I mentioned about the the data abstraction. And what that is, as I mentioned, you can create multiple environments.
For each environment, you have a YAML file where you can specify that my dataset points to a certain folder, to a certain, file type, whether CSV, parquet, etcetera, or it could be a a cloud based dataset as well. And that was that was part of the original vision, this IO layer. That and the pipeline and the templating of the code kind of were the 3 original features, and and that's kind of remained the core of Kedro. And I think what has happened since then is we've we've used it with more and more clients, you know, so dozens of clients.
And it's it's just gotten more mature, and we built some functionality on top of that. So, for example, in those those datasets to have versioning. So, the ability to version datasets we now have, that was kind of added on, you know, as we went along. And, you know and and listen. The the feature requests are continue to come. We're we're working on, for example, Dask support, which is coming next, making a KedRA API for extensibility, reusable pipelines. But I I think the core of the functionality is is what has been there at the beginning, and we've kind of built all these layers on top of that to make things easier. Does that make sense? Yeah. Definitely.
[00:15:30] Unknown:
And I find it interesting that there has been a recent movement towards building a abstraction layer for the data programming piece itself and separating that from the actual execution context, which is the opposite of how projects such as Airflow and Luigi and a lot of the more traditional ETL tools have been built where everything is all in 1 framework. But now we have projects such as Kedro, which you're building, and the prefect library, and Dagster, which are focusing more on just what are the programming primitives and the overall logical flow, and then having a pluggable execution layer and scheduling layer so that you can mix and match what's already present in your environment. I'm just curious what your experience has been on that front and what your thoughts are on the benefits of that approach versus how we have been dealing with things up until now.
[00:16:27] Unknown:
Yeah. I mean, you bring up a lot of points. But just to pull the thread a little bit, I think that there is a lot of approaches to to this this the problem that companies are facing. And, I mean, essentially, the the problem is that data pipelines are hard. They're hard to build. They require a lot of resources. They're hard to deploy. And all of the approaches are kind of aimed at making that more simple. And I I would I would say there's a couple different approaches. 1 approach that we've seen is, you know, the the rise of drag and drop or GUI tools to kind of create, you know, more simplicity in in, you know, ETL or feature engineering, etcetera.
And I think the idea behind that has been to create more simplicity to, you know, hey. We we we don't it's hard to get resources for data engineering, for data science. Let's create drag and drop tools to make it easier. And at least my experience has been in the past working with clients has been that that actually tends to add, further complication. And so, Kedro, you know, we've we've taken a step back. We said, hey. We're not building a drag and drop tool. We're not making it simpler in that sense. You know, we're not creating, like, locking into a tool that you have to use.
We're just creating a way to organize your Python code. And we're we're still solving the same problem. We're making things simpler. We're taking a complex problem. We're boiling it down to what is it what do you really need to do to achieve that? And trying to take all the kind of, stuff that just takes time and energy and make it a lot simpler. So I I think there has been a trend away from proprietary frameworks to open source. I think that there is still this push to to simplify the data engineering and data science process, and we've seen that play out with whether it's drag and drop, GUI tools, or whether it's open source libraries or proprietary tools.
I mean, the problem is very real real. I think that, you know, the approach is gonna be different. And and and, you know, we've taken a stand in how we think is a really effective way to solve that approach. So I all these things are are trends that are happening. I think the trend towards open source is also important. You know, I think 1 thing that we're seeing among clients is a concern about lock in. They don't wanna be locked into a particular vendor. They don't wanna be locked in into a certain cloud provider. And so the ability to use like you mentioned, these open source tools is a big movement where they are not locked into any particular company. So all these things are playing out, and, you know, you you know, we we think that, you know, we're offering a a good approach to to simplify that process as well.
[00:19:19] Unknown:
And the other benefit of having the programming layer be 1 piece and having everything else be pluggable is that you can isolate the logic from the actual execution context, which I'm sure makes it easier for the testing and validation. I'm curious what the overall approach looks like for running a integration test or unit test on a Kedro project and any any tooling that you have built into the framework to simplify that operation?
[00:19:52] Unknown:
Yeah. Yeah. That's a good, that's a good point. We are integrated with, Pytest. I I don't have specific examples I can give you, but I I do know 1 1 approach that you can use, for example, in the data abstraction layer is having, you know, test datasets that you can run the pipeline on and run tests on on the results as well. That's that's 1 approach we've seen. I think that the different testing approaches and and bringing that to light more is is is something that we're working on to improve. But, I think that the abstraction of the data layer makes it actually very easy to do, like you say, an integration test on a different environment with, you know, maybe a sample dataset to test the results, etcetera.
I think the data and code versioning will also help that as well, and that's something that's active in the the the future road map. So, yeah, these these are all things that we've we've heard from clients and users and are are looking to, enhance and improve.
[00:20:59] Unknown:
And another thing that you mentioned is the built in support for integrating with different data sources for being able to pull it into your project. I'm wondering if you can talk a bit more about some of the sources that you have built in and the overall approach that you've taken to abstract out the core principles while still allowing for specific features of different storage back ends?
[00:21:24] Unknown:
Yeah. Yeah. Happy to explain that. So so, yes, we do have support for specific, data connect connectors. You know, so for, local files, of course, you have, like, parquet, CSV, pickle, etcetera. And we do have, like, s 3 and Pyspark as well. We do have connectors for them. I think the when when you look at Kedro, the real benefit is maybe not the the number of connectors per se, but the ability to extend and create custom connectors is really, really easy. So if you go to the Kedro docs, you'll see, connect a a section where it talks about the abstract dataset, and that's kind of the core class that all of the data connectors are built on. And to extend it is is really simple. It's just a load and a save method, for a class, and so it's possible to extend. We've worked with clients, for example, that are using, Azure Data Lake Store Gen 2, and, you know, it could be some aspects of that might be different from Gen 1, or Blob storage as well.
And so it's it's it's really easy to extend, the data connectors, the abstract dataset to all these different whether whether it's different cloud providers or different types of database, etcetera. So those are some of the ways that we've seen people leverage the, the the data abstraction.
[00:22:53] Unknown:
And I'm wondering if you can talk a bit more about some of the ways that Kedro is differentiated from projects such as Prefect and Dagster, which I think are most analogous to what you're doing with Kedro, but also projects such as Airflow and, other workflow management tools, particularly when it comes to things like handling failures and retries and building in some of the common error modes and recovery capabilities?
[00:23:23] Unknown:
Yeah. Yeah. Sure. I've so so first of all, in terms of you mentioned Airflow. So Kedro integrates with Airflow. Kedro is is just an organization of the pipeline. It's not a a scheduler. So you you could integrate it with Airflow, or you could schedule it via different options via, for example, Data Factory, etcetera, or via a Databricks notebook. So it it does integrate nicely with Airflow. In terms of the the other projects you mentioned, I've I've never actually used those in any projects. I have looked at some of the documentation. I would say, you know, for me, where KedRA stands out is its simplicity, in terms of, you know, the way that the the the code is structured when you create a new Kedro project.
The the syntax that you use in constructing a pipeline, constructing nodes, etcetera, it's it's gonna be very simple and elegant. And, you know, a lot of thought has gone into making it as simple as possible, and that's kind of where we have really leveraged. And so when, you know, also when you look at the data abstraction layer as well, I I from what I've seen, there's definitely a lot of shared elements as well. So, like, visualization of the pipeline, etcetera, and certain principles are definitely shared across the the the other 2 that you mentioned. I would, definitely among Daxter, I've gotten to look at it. I haven't seen, the other 1.
But that's where I would say that Kedro has kind of put most of its effort in creating a real a a really rich developer experience. So very easy to use, very simplified, as well as the richness of its data abstraction layer. That's where I'd say it stands out.
[00:25:13] Unknown:
And when I was looking through, I noticed that Kedro is currently at, I think, the 0 dot 14 release. I'm curious what you think is still missing for a 1 dot x release and some of the aspects that are in the heaviest states of flux and change and, some of the effort that you've got for pushing toward a 1 dot x release?
[00:25:40] Unknown:
Yeah. Yeah. I I think that, the 1 docs 1 dot x release is imminent. I think what the bar that we've set is that we want to continue to improve and enhance our our versioning feature set. So, data not just dataset versioning, but data and code versioning so that you can truly reproduce any any run. The other aspect is getting stability and feedback from users on, on the project templates, and we've introduced a a context API. So I I I think we're still in the process of the iterative loop and getting feedback and stabilizing those features. What I mentioned before was at the core of KedRA, which is the the main project template, the IO layer, and the pipeline.
Those have remained, very stable, and and we're we we do feel that those are are very stable. But in terms of getting to a 1 dot release, I think that's where we have set the bar is is getting full stability in in the context API and really getting to a a very mature state with our, our versioning capabilities, data and code, etcetera.
[00:26:58] Unknown:
And you mentioned that there is a plug in capability in Kedro and that you're working more towards a standardized approach to that. I'm curious what types of plug ins are currently present, both ones that you've built with Quantum Black and some of the community contributed ones and just how much engagement you've seen from users outside of Quantum Black?
[00:27:22] Unknown:
Yeah. I can speak a little bit to that. So in terms of plug ins, we've I I believe I mentioned that there is a plug in for Docker, Kedro Docker. Obviously, there's the the Viz, Kedro Viz, which provides the visualization of the pipeline. There's also Kedro airflow. We we have seen a great contribution from the community. 1 of the contributions has been looking at creating reusable pipelines. So in other words, maybe not even from a data science perspective, but using the the pipelining tool to create workflows that can be reproduced across multiple datasets. So we've seen KedRA being used in some interesting ways. We've seen it being adapted and and used in academia and and several universities.
I just learned of this recently that it's been adapted in Imperial College of London, Oxford University, University of Cape Town in South Africa, and they're using it to reproduce work for some machine learning papers. We've started to see, more streaming examples, leveraging Kafka. So I I think people are using it in ways that we might not have anticipated. We're still on the process of working with the community and and and extending it in in ways that people are requesting and and and learning from the way that people are using it. And what are some of the most interesting or innovative or unexpected ways that you have seen it used? So, like, the reusable pipelines, that was something that we hadn't considered to be used as, basically, a workflow template to run across different datasets. So that's that's something that we started to to see and and start to incorporate. And like I said, the surprise at the some of the adoption of how it's being used in academia was, was was also a pleasant surprise as well. And in your own experience
[00:29:17] Unknown:
of both working on Kedro and using it for projects, what have you found to be some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:29:29] Unknown:
In terms of challenges, I mean, I I really have had the opposite reaction, honestly. Like, I I came from a world without Kedro, and using some of these, for example, being on projects where using drag and drop tools, etcetera. And just the developer experience was horrendous. And, honestly, it's it's more like a breath of fresh air where I'm working directly with data scientists. We're sharing code. The code is organized. I know when I go into a new a new code repository, how things are structured and how I can explore the code. So for me, it's it's it's kind of resolved several challenges that I had had before, if anything. I would say, you know, it does the you know, it does take some time to get up to speed on some of the more advanced features. So but once once you have done that and are comfortable, extending, for example, the abstract dataset and creating, custom data connectors, it's it can be really powerful and and liberating, I would say. What are the cases where Kejo is the wrong choice and you might need to use some other approach or some other framework? Yeah. I I think there there are 2 situations when that applies.
So so 1 is if an organization is not interested in using Python. Right? And, you know, I I we believe there's a very strong case for why Python is an excellent, programming language to use for advanced analytics. But, you know, some organizations may have invested elsewhere in different languages. And in that situation, it might not be right, especially if there's not interested in adopting to Python. So that's 1. And the other, I think I alluded it to before. If if an organization has gone a 100% invested in a drag and drop framework. I mean, that's that's not what Kedro is. Right? And so, you know, I have seen certain organizations go in thinking that this is going to either increase increase the efficiency of their data engineering or resolve the need to to hire data engineering talent, I'm skeptical.
But if an organization has that mindset, they're gonna use those drag and drop tools. So, I mean, those are 2 situations
[00:31:48] Unknown:
where KedRA just wouldn't apply. And are there any other aspects of Kedro or the overall life cycle of data projects that we didn't discuss yet that you'd like to cover before we start to close out the show? I think we covered a lot. I mean, we talked about the trend towards open source,
[00:32:06] Unknown:
the the trend away from vendor lock in, which is kind of producing that. The the move in some of the technology to bring data science and data engineering closer, especially in regards to Python and PySpark. We talked about the difference of, you know, the drag and drop approach versus the clean and organized code approach to data engineering. So I feel like we we talked a lot about the industry, and I I hope I was able to present, you know, what I what I think KedRA has to offer in that regards.
[00:32:36] Unknown:
Yeah. It definitely looks like a very well thought through and well put together framework. I'm excited to see how it continues to evolve and grow as you head towards a 1 dot x release and beyond and see what sorts of adoption it is able to accrue as you add more capabilities.
[00:32:53] Unknown:
Absolutely.
[00:32:54] Unknown:
Yeah. So for anybody who wants to follow along with the work that you're doing and get in touch, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Yeah. I mean, I I think it's that's the $1, 000, 000 question. Right? I think you have a a lot of investment in start ups and a lot of investment in projects to
[00:33:20] Unknown:
re to make it easier for companies to get insight out of their data. And, you know, there's there's multiple approaches to it. I, I think that it's still an unresolved problem because we do have a shortage of talent and it's just an incredible need across multiple industries to, to leverage AI, to leverage analytics and use it to improve their performance.
[00:33:45] Unknown:
So I don't, I don't have the answer. I think it's, it's still something that we're all working towards, and it's something that we're working on at Quantum Black for sure. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing on Kedro at Quantum Black. It's definitely an interesting framework. And as I said, I'm excited to see how it continues, and I'll probably be giving it a try for my own work fairly soon. So thank you for all of your efforts on that, and I hope you enjoy the rest of your day. Wonderful. Thank you so much, Tobias. It's been a pleasure, and, looking forward to hearing from you. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Tom Goldenberg and Kedro
Tom's Journey in Data Management
Overview of Kedro
Kedro's Features and Best Practices
Implementation and Evolution of Kedro
Testing and Validation in Kedro
Data Source Integration
Comparison with Other Tools
Roadmap to 1.0 Release
Community Engagement and Plugins
Challenges and Lessons Learned
When Kedro is Not the Right Choice
Final Thoughts and Future of Data Management