Summary
The core task of data engineering is managing the flows of data through an organization. In order to ensure those flows are executing on schedule and without error is the role of the data orchestrator. Which orchestration engine you choose impacts the ways that you architect the rest of your data platform. In this episode Hugo Lu shares his thoughts as the founder of an orchestration company on how to think about data orchestration and data platform design as we navigate the current era of data engineering.
Announcements
Parting Question
The core task of data engineering is managing the flows of data through an organization. In order to ensure those flows are executing on schedule and without error is the role of the data orchestrator. Which orchestration engine you choose impacts the ways that you architect the rest of your data platform. In this episode Hugo Lu shares his thoughts as the founder of an orchestration company on how to think about data orchestration and data platform design as we navigate the current era of data engineering.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- It’s 2024, why are we still doing data migrations by hand? Teams spend months—sometimes years—manually converting queries and validating data, burning resources and crushing morale. Datafold's AI-powered Migration Agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today to learn how Datafold can automate your migration and ensure source to target parity.
- As a listener of the Data Engineering Podcast you clearly care about data and how it affects your organization and the world. For even more perspective on the ways that data impacts everything around us don't miss Data Citizens® Dialogues, the forward-thinking podcast brought to you by Collibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. In every episode of Data Citizens® Dialogues, industry leaders unpack data’s impact on the world, from big picture questions like AI governance and data sharing to more nuanced questions like, how do we balance offense and defense in data management? In particular I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast-moving field. The Data Citizens Dialogues podcast is bringing the data conversation to you, so start listening now! Follow Data Citizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts.
- Your host is Tobias Macey and today I'm interviewing Hugo Lu about the data platform and orchestration ecosystem and how to navigate the available options
- Introduction
- How did you get involved in building data platforms?
- Can you describe what an orchestrator is in the context of data platforms?
- There are many other contexts in which orchestration is necessary. What are some examples of how orchestrators have adapted (or failed to adapt) to the times?
- What are the core features that are necessary for an orchestrator to have when dealing with data-oriented workflows?
- Beyond the bare necessities, what are some of the other features and design considerations that go into building a first-class dat platform or orchestration system?
- There have been several generations of orchestration engines over the past several years. How would you characterize the different coarse groupings of orchestration engines across those generational boundaries?
- How do the characteristics of a data orchestrator influence the overarching architecture of an organization's data platform/data operations?
- What about the reverse?
- How have the cycles of ML and AI workflow requirements impacted the design requirements for data orchestrators?
- What are the most interesting, innovative, or unexpected ways that you have seen data orchestrators used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data orchestration?
- When is an orchestrator the wrong choice?
- What are your predictions and/or hopes for the future of data orchestration?
Parting Question
- From your perspective, what is the biggest thing data teams are missing in the technology today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- Orchestra
- Previous Episode: Overview Of The State Of Data Orchestration
- Cron
- ArgoCD
- DAG
- Kubernetes
- Data Mesh
- Airflow
- SSIS == SQL Server Integration Services
- Pentaho
- Kettle
- DataVolo
- NiFi
- Dagster
- gRPC
- Coalesce
- dbt
- DataHub
- Palantir
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Your host is Tobias Macy, and today I'm interviewing Hugo Lou about the data platform and orchestration ecosystem and how to navigate the available options. So, Hugo, can you start by introducing yourself?
[00:00:28] Hugo Lu:
Of course. Great to be here, Tobias. So I'm Hugo Lou. I'm the CEO and cofounder of Okstra, which is a unified control of data. Prior to this, you know, I'm sort of all those people that fell into data by chance. My first trip was in investment banking and then moved into strategy at a company called Jewel picked up data and it's kind of been history ever since. So, yeah. Thanks for having me looking forward to this.
[00:00:54] Tobias Macey:
And you mentioned that you founded Orchestra, which is a company focused on orchestration, which we're not going to spend a lot of time focused on what you're building specifically, but generally orchestration and how it impacts the rest of the choices that you make about how you work with data. And I'm wondering if you can just start by giving your definition of what constitutes an orchestrator and orchestration in that data context.
[00:01:19] Hugo Lu:
Sure. I think it's really interesting when you try to build a data platform. Right? Because you think about where you wanna put your data, what you wanna use to, you know, change stuff in it. So like a compute engine, but fundamentally, if you don't have something triggering something, then nothing is ever gonna happen. So that's sort of where I see an orchestration tool coming in. I would just define it as a way to schedule trigger and monitor things. So nice and short
[00:01:51] Tobias Macey:
and orchestration as a practice and as a principle is something that has existed since well before computing, but has been translated into the computing environment in various forms. Maybe the most notable and most long lived one being chron of I want this thing to happen at this interval, and everybody well, many people have outgrown it. Many people still use it for various use cases, but other aspects of orchestration are things like CICD pipelines where you wanna make sure that your software builds get through and test it, etcetera. Orchestration and scheduling are also generally linked, and then you start getting into things like Kubernetes and its internal scheduling system, which orchestrates all of the different moving pieces, which has then led to outgrowths of things like Argo CD, which has also made forays into the data space.
And I'm wondering if you could just talk to some of the ways that that idea of scheduling and orchestration has been kind of conflated and jammed into various shapes and places and how the specifics of the ways that the orchestration is managed and executed and scheduled influences the ways that it actually works within a given use case.
[00:03:11] Hugo Lu:
Yes. Absolutely. A lot to unpack there, but I think kind of hit the hell on it, like, hit the nail on the head. Right? The process of having you know, I wanna complete this task, and then I've got multiple dependencies, and then I wanna do those things, and then there are multiple dependencies after that is a practice that is as old as computing. And I think in you know, if you speak to anyone on the software side like, orchestration is not a thing. Right? If the if you need to execute, like, a series of tasks in, like, a directed acyclic graph or a DAG, that sort of functionality is built into a lot of things that have names. So, you know, you mentioned Kubernetes just as an example.
You know, that it's a great example. Right? You've got a you've got a there are a lot of dependencies and processes that need to happen within a Kubernetes cluster, and, obviously, it's got a scheduler too. I think the reason it's got its own sort of ether area in data is probably because a lot of the, like, processes we have are split into different areas. Right? So if anyone's ever built a data ingestion system, that has to have an orchestration component too because maybe you need to, you know, trigger parallel fetches of data, put it into a staging area around quality checks, you know, move it somewhere, change the format, and then push it to a final destination. Right? That's not just gonna be handled in one big script.
But the fact that we have these, you know, things that do data ingestion, things that transform data, maybe things that do, you know, transformation and then ingestion and maybe a little bit more means that there's a need to, like, monitor lots of different things at different places. So as a result, a lot of engineering time that, you know, data teams spend is saying, okay. Well, I've got all these components. How do I system together? And, you know, the word that that that prevails here is orchestration. Right? You stitch it together with an orchestration tool. So, yeah, I think that's that's that's more or less where it fits in.
[00:05:21] Tobias Macey:
As you noted, orchestration is something that finds its way into almost every piece of software in some fashion, which leads to a lot of complexity and confusion as you're selecting which piece of the stack is going to own, which pieces of sequencing and the overall control. And if you do allow all of those different pieces to delegate a certain layer of orchestration, then you end up in the situation of having to stitch back together the view of what are all those pieces, how are they happening, and when versus having a centralized orchestration engine that says, I'm going to take control over all of these things. You don't do anything by yourself unless I tell you to. And, obviously, those two extremes have a big impact on the overall architecture of the data platform.
And I'm wondering if you can talk to some of the ways that you've seen those gradations take shape as people build their data systems and their data workflows and how they try to make sense of how data is moving through their organization.
[00:06:26] Hugo Lu:
Yeah. Definitely. I think a helpful lens here is attacking it from, like, a maturity standpoint. Right? So, you know, many people that are trying to build a data platform have have started from day 1. Right? And, you know, in day 1, you might not have loads of people relying on loads of reports. So you maybe have a couple of scripts that are getting cleaning some data. They're storing it somewhere. Maybe you do, you know, a little bit of cleaning, and then, you know, you're kinda done. Right? People will have a dashboard that's directly querying it, or maybe people will just go in and get that data and do some fun stuff with it, download it to Excel. But the orchestration is not complicated here. Right? You can sort of move stuff and then have something else triggered when it needs to be.
Obviously, as you grow, that gets more complicated. Right? What happens if you have a big dataset and you're using something like a Power BI or a Tableau? You need to trigger an extract refresh. What happens if you have a lot of data and you need a complicated data model? Right? You might have 100 or thousands of tables. What happens if you have 30 different sources of data that people are relying on? You can't just maybe have 1 ingestion tool. Maybe you have multiple ingestion services. Maybe some of that's streaming. So the question then becomes, how do you how do you stitch all of that together and get visibility while leveraging all of those components you've already got to their fullest extent.
And I think at that point, it becomes really, really difficult to have all of those different systems talking to each other. Right? It's like in the sort of software world, you might have, you know, different services that speak to each other. Right? They send each other events. It's all it's all choreographed. Right? You don't orchestrate most, like, many software systems. The difference here is that we're dealing with data. So, you know, if if every if every service doesn't have access to the same data, it becomes very expensive and very slow to make that work. And as a result, it it can be helpful to have sort of control layer on top of all these different services because, you know, you don't have this huge data dependency in software like you do in data.
[00:08:34] Tobias Macey:
What are the approaches to gaining that visibility that is largely an artifact of how you think about where that control lies, what is the motivating force for the propagation of data is the idea of an overarching metadata catalog that all of your different tools integrate with, and it either pulls data from them or pushes data to it so that you can see across all of the different pieces of software and technology. This is all the data that I have. This is how it moved, etcetera, etcetera. Whereas different orchestration engines have also tried to pull that into the core of their functionality of, I am going to own everything, so I will be the repository of metadata and give you visibility across these different layers.
And I'm curious how you've seen those philosophies play out in your experience of working in this space and working with customers.
[00:09:28] Hugo Lu:
Yeah. No. Look. I hear it. Again, like, a lot to unpack. And I think if we start with if we start with a problem people are trying to solve, a lot of the time, there's a data team that is scaling or at scale. The consumers of data, particularly like if you're doing BI, really struggle to get trust in it. Right? It's like you're leading a data platform. You've got 15 hardcore engineers. But at the end of the day, some of the data sets that you're building are for people in product, they're for people in marketing, they're for people in finance. Right? And they've got to come to you and say, hey, like, is this data fresh? Like, something looks a bit funky. I don't really know what's going on. And, you know, you then have this pattern, right, where on the one hand, you have this central team or many central teams. And then on the other hand, you have the consumer.
And the consumer basically has no no big idea what's going on. So the solution is to say, ah, well, you know, we as the central team can give you a catalog. The catalog will show you what's going on. I will train you to use the catalog. You know, we'll pay lots of money for the catalog. We'll maintain the catalog. But this is the way that you understand what's going on. This is how you can get trust in the data. And, you know, this is like a this is like a really tricky pattern to work because fundamentally, you have, like, bottleneck or a bot or or many bottlenecks who who actually know what the hell is going on. So I think this is this is the first first thing we see playing out. Right? At scale. Even with a catalog, people struggle to work out what's going on, which is bad because as a data team, your goal is to help them know what's going on so they can use data to make decisions. So that's the first thing. Second thing is that as a data team, it's a lot of effort to make that pattern work. Like, I was speaking to a fast growing technology company. They have about 1 and a half 1000 employees. They're doing data mesh. Right? So they're saying, hey. We're gonna give everybody the tools they need to build their own pipelines.
And they're super highly technical. The end users are back end engineers. And even then, it's taken them almost 2 years to stand up airflow and, like, parameterize it in in the sort of, you know, have like a sort of YAML based domain specific language on top that anybody can use. And on top of that, right, even after they've written all those pipelines, they have to write yet more code to keep their catalog up to date. And it's taken them, you know, 6, 7 platform engineers a year and a half, and only back end engineers can use what they've built. Right? They haven't even started on marketing or finance yet. They have, like I I asked I asked the lead engineer. I was like, how many airflow instances have you got? He said, oh, I've lost count at this point. You know? It's like the like, you can do this pattern, and it just takes an enormous amount of effort and resource. And, you know, if you've not done it before, I would say there's quite a high chance of failure. Right? So, you know, I think that's the second component. It takes a lot of investment to, you know, not only stitch everything together, but also surface it in a way. So this is part of the reason that there are some quote unquote orchestration tools that are trying to be the catalog because, you know, the orchestration tool triggers and monitors everything. So it has all the context. It has all the metadata. Right? It's got all those juicy run IDs, which you which you wanna monitor over time.
So from an architectural perspective, it would make sense to kind of put a catalog there.
[00:12:48] Tobias Macey:
The challenge there, though, is that by having that be the nexus of metadata, it then forces you to use that for situations where it's maybe not the appropriate fit for owning a certain data flow just so that you can get the metadata into it versus if you have 90% of your metadata in your orchestrator and only 10% of your workflows live outside of it, you then have to add a whole other software layer just to be able to track those disparate pieces.
[00:13:19] Hugo Lu:
Right. Yeah. You you you put it on the head. And this is what like, this is the issue people find with airflow. Right? It's completely agnostic. So you can sort of trigger, monitor any Python processes. But a single task can be like a print statement or a function that prints like hello world. It does nothing. You have to write everything yourself. And what we see is data teams spending time building pipelines to fetch metadata that is generated by their pipelines and then building dbt models to, like, clean that metadata and they're building dashboards themselves to monitor the metadata and then building alerting systems on the metadata all themselves. You know, I think in in some cases, it's probably, like, genuinely, like, a doubling of work just to know what's going on, which is insane.
[00:14:08] Tobias Macey:
It's 2024. Why are we still doing data migrations by hand? Teams spend months, sometimes years, manually converting queries and validating data, burning resources, and crushing morale. DataFold's AI powered migration agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafold today to learn how Datafolds can automate your migration and ensure source to target parity.
You're calling out of Airflow, and its Python orientation is also another angle to the impact that orchestration systems have on the overall architectural choices of your system because some of these orchestration systems are very much oriented to a specific language or a specific mode of interaction, and that influences the ways that you think about hiring, who works on all of these different data flows, who is able to interact with it and control it versus other orchestration systems that are going the other extreme of low code, take whatever language runtime you want. We're just going to let you click and drag things together and it'll all be amazing. What have you what have you got in mind? Nothing specific, but I think the one that comes to mind most readily are like the kettle and pentaho
[00:15:42] Hugo Lu:
of I don't know 10 15 years ago and the Microsoft server, SQL Server Integration Services and things like that. Yeah. But like it's really interesting, right, because it's almost like we're coming full circle because if you take, you know, you mentioned, s s I s, SQL Integration Services. That's another good example of a product that does what it's meant to do, but also has orchestration within it. You know, people would have seen that Snowflake recently acquired a company called DataVolo, and that's based on NiFi. Again, the same thing. It's It's fundamentally a low code tool for well, a no code tool for moving, adjusting, and transforming data. But within it, you can do orchestration. But the point is is, like, the problem it solves is, okay, how do I take data from these places and put it in that place in the format I want it? And it does all of that thing in one go. And what with the advent of, like, what data stack and things getting more complicated and, you know, all all the things that are driving us to make more complex systems, you lose out on orchestration because you have these different components that are very good at doing one thing.
Whereas before, you just had packages that had it all. Right? It's like you didn't think about orchestration. It's like, well, of course, I can trigger things in this software. Like, how else would it work?
[00:16:57] Tobias Macey:
I think that's an interesting point too, as far as the generational shift in the ways that we're using these tools and the ways that these tools are implemented where the early stages of ETL orchestration data movement were these monolithic packages largely bolted onto some database software, and they were the place where everything got done. So it was very much a centralized monolith. And now as we have increased the sources of data, types of data, who is consuming the data, how the data is being used, whether it's batch or streaming, etcetera, etcetera, it pushes us more into this federated approach of we have lots of little things happening all over the place, and the orchestration systems that are designed for the current era are generally built with that in mind of being able to have some sort of central nexus of control or visibility, but allowing for federating across multiple different execution contexts where Yes.
My experience is largely with Dagster where, for instance, it has the Dagster web node that you can then point to multiple different running gRPC for services that correspond to the the actual pipeline code for different use cases. So you can have that central visibility with federated execution. And I'm just wondering how you're seeing those generational divides of orchestration and platform architectures being able to bridge that gap or manage that dichotomy of central visibility and control versus federated execution.
[00:18:41] Hugo Lu:
Yeah. I mean, it's hard. Right? And I think a lot of the reason for it is the movement to the cloud. So, you know, we're speaking to one of the largest, you know, one of the largest hospital chains in the US. Right? And they're all on Oracle. And, you know, they're doing all their data integration, all their transformation. Oracle works super well. And last 10 years has been different because there's a lot of data they need that's not on premise or in Oracle anymore. So now they're saying, okay. How can we how can we push that data into Oracle? How can we then get out of Oracle and put it where we need to? Right. They need something that can integrate those different layers. And that's why, you know, we're, like, we as orchestra are talking to them because we facilitate that. And the cool thing about the cloud is you can, you know, connect and build integrations to different things in the cloud, be it AWS or Snowflake or Databricks or whatever.
But, like, you know and, you know, you can do this in orchestra too, obviously. But, you know, like like the example you mentioned with Dagster, you can still connect and monitor processes which are remote. So, like, on a server. Right? As long as there is some kind of Internet access, you can get visibility into that. And, you know, I think something we do which is quite unique is we go we take things one step further. So you know how I mentioned that in airflow, a task can be very simple. Right? It can be a function that you write yourself. In orchestra, a task is much larger.
You give us a few lines of YAML, so it's declarative. And not only will we handle that task, but we'll also fetch all of the metadata relating to that task. So a little bit like, you know, similarly to how airflow will, you know, give you logs when you do, like, an SSH operator. Right? It goes into where you've got it and pulls the logs out. Like, we'll get logs, but if the underlying tool also has an orchestration engine, we'll also surface that sub deck and then do things like calculate lineage, which is really, really cool because a lot of the things we see are, you know, people are running highly complex processes in different places. Right? You might have an analytics engineering team that just use like a coalesce or a dbt.
That's all they do. But then you might also have engineering, you know, data movement processes that depend on it, machine learning models, reverse ETL that depend on that on the other side. So then the question becomes, how do you get the full end to end visibility such that it's not just like box a, box b, box c type thing? And that's what we're trying to do.
[00:21:12] Tobias Macey:
Another reason for that generational shift too, I think, is the ownership of the process, where in the early days of data warehousing, all of the ETL, all of the business intelligence was largely owned by the IT department. So it was very much a cost center. It was something that was done because it was necessary, not because it necessarily drove its own inherent value. Data has now been moved more into the core of the product workflow. Ownership of all of those systems has largely been moved into a separate team that is generally distinct from IT, and they're more of a software product focused team, at least for people who are doing it in the, quote, unquote, modern way.
And so I think that also shifts the ways that the systems are designed and packaged and sold where when it's an IT asset, you sell it to the IT team, and they just want something big, predictable, manageable. They don't want to have to do a lot of customization to it. Whereas with data teams, they're generally working in more of the agile workflow of iterative development, iterative improvement. We want things that we can customize and tweak to suit our specific needs. And I think that that's another way that the overall architecture and platform approach to data has grown out of what it originally started from.
[00:22:43] Hugo Lu:
Yeah. Definitely. And I think how to put this? The use case for data is really important here. So, you know, we work with, like, like, many large manufacturing and logistics companies. Right? And they have sensor data for their operations. So having this sort of move through the system in a timely way is kind of of, like, critical importance. Because if they don't do it, they can't respond to, you know, like, you know, just changes in stuff that's happened that's fundamentally gonna impact their bottom line p and l. Right? It's like if something is gonna be delayed and they have an SLA with a customer and they don't let them know, then, you know, they're gonna take a hit. Right? So in this case, data's playing a really, really, really key and important, like, operational function.
And in that case, right, the person who is sort of owning that product is probably someone on the operation side. They're not gonna be able to probably build out, like, you know, relatively low latency stable orchestration system. Right? It's like they've got suppliers, they've got projects, they've got factories to manage. You can't expect them to build a data infrastructure as well. But in those cases, you know, it kinda it kinda makes sense that you would have someone that says, hey. Look. I'm gonna make sure that this thing is delivered to you every 15 minutes, every 5 minutes, and I'm gonna you're you're gonna get alerted if it's broken, and I'm gonna be your point of contact. Right? That's when I think the sort of platform team on the one side, stakeholder on the other. That pattern works really well in that case. Right? The new use case is like BI, right, and just like cloud stuff. So if I'm like, you know, if I'm working in marketing or I'm in finance and I wanna get a real time, you know, look at my transactions. Right?
Just because I need to do reporting and just keep a hold on stuff. Right? How's this customer paid today? They're a big customer. Like, it would be good if I could work that out and I had the data updated every 15 minutes because then I can email them at 5 PM at the right time so that they actually convert instead of falling out. The engineering for these use cases is often, like, a little bit easier. And I think here, where we're really moving to is empowering people to do this end to end themselves. So, you know, increasingly, you'll see finance teams talking about how they've adopted Snowflake and it's, like, revolutionized their ability to drive insight. Right?
And that's because they will have a power user that can write sequel that's like, yeah, I know what I'm doing. Like, I'm gonna be the guy that helps my VP of finance work out everything and automates all these processes so we can actually start, you know, driving the business of finances so just, like, keeping the lights on. That's why it's really interesting for me from the orchestration side because that's, like, the final technical bit that would be really hard for them to do that, you know, we're sort of trying to help people be able to do now.
[00:25:30] Tobias Macey:
We've been talking about the ways that your selection of orchestration tool influences the ways that you think about your overall platform architecture, but there are also many cases where you have to approach it from the reverse angle of you've already started building out your data systems. You are hitting growing pains of not being able to have that visibility, that sequencing that we've been discussing, and I'm wondering how that influences the ways that you think about what type of orchestration tool or what types of orchestration you need if you already have the data flows and you're just trying to get them under better management.
[00:26:06] Hugo Lu:
Yeah. I mean, like, let's dig into that a bit more. What type of, what type of scenarios are you thinking of? Like, what did the data team have? What growing pains are they running into? Yeah. I mean It all depends.
[00:26:20] Tobias Macey:
Well, that that that's generally what any question in engineering boils down to. I think that, typically, what you would run into is the initial promises of the modern data stack of you just throw a credit card at the problem, and you'll have all of your data in your warehouse, and your BI will be amazing because you're using Fivetran, Snowflake, DBT, and whatever the business intelligence tool of the day is. And so you say, okay. Great. All of this stuff is working, but now I don't actually know when the data flows are failing or what the quality issues are or whether the is up to date or if my DBT compiled properly.
[00:26:57] Hugo Lu:
Yeah. Okay. That's a good one. So I guess the pain is, well, we threw a credit card at the modern data stack, and it's very expensive, and we're no better at making decisions with data than we were before. Yeah. I mean, look. The the sort of, phrase du jour is, is data quality, and I think, you know, that setup has its issues. So, obviously, without, you know, without some sort of end to end orchestration and observability, it's gonna be really hard for you to just, you know, let people know who depend on a specific data asset or, like, dashboard when stuff is breaking. Right? Stuff always breaks, because you need to have some kind of orchestrator in there. Right? If you don't have that, it's gonna be tricky.
And, you know, I think the the key here the key here is is is to get a little bit more flexibility. So it's important to basically build out the the stack in a way where you can use the tools for what they're really, really good at. So running everything through DBT might not be the best idea. Right? If you've got stuff that needs to go quickly, you might wanna use, like, delta delta tables and Databricks or Snowflake tasks and dynamic tables. Right? You might have some people that wanna self serve in, like, a notebook environment instead of, like, a dashboard. You might not wanna have all of your connectors going through one way. You might wanna start doing some streaming. Right? And then in this case, you're like, well, I'm making my stack more complex so that I can save cost, right, and get data to where it needs to go faster.
I'm splitting up my data pipelines into more and more granular ways. But now you have 6 things that you have to connect instead of 3. And before, you know, just about with no airflow and stuff running at 4 AM and then 6 AM and 8 AM, it was okay. Now that don't work. So then you're like, now I need a platform engineer to put in airflow. And then you have this whole bottleneck problem because then anytime anyone says, hey. I'm not sure what's going on or, hey. Can I change the schedule for this? It results as big old long ticket, and then you've got a data platform manager talking to a head of marketing and, you know, they're butting heads.
So, you know, I think, like, in this case, right, Orchestra is is a pretty good solution or indeed, like, any orchestration platform that is easy to use that also gives people good visibility of what's going on. Like, clearly prioritizing and, like, defining the different data products you have, so essentially just, like, grouping pipelines and grouping things is also very helpful because then instead of saying, oh, like, for me to work out what's going on, go ahead and inspect this 1,000 box DAG. You're just saying, yeah. Sure. Here's the pipeline for your invoices data product. Here's how it's doing. Here's the data quality. You can make decisions on this. It's okay.
But I think something else people find, right, as a sort of scenario 2, is we have flexibility. We have a really good platform team. We have an orchestration framework in place that we manage ourselves. We have airflow, say, but it's a big monorepo. There's loads of stuff going on, and we we just we we're just spending way too much time managing it. Right? Like, stuff takes too long. Stuff that should take an hour takes 2 hours. Like, the cluster keeps going down. And to boot, we also probably have quite a lot of data quality issues that we don't control. So, you know, we speak to a health tech company over here in the UK earlier, and what they're doing is it's it's really cool actually. They're shifting some of that left.
So they're taking the staging models that their software team give them in DBT and asking the different teams to manage that themselves. So the central Ted data team is actually like it it kinda it's kinda like cheating, but, like, they're basically just doing less stuff. Right? But then you have this other problem. Right? You've got 70 repos of DBT code or or wearing away, and then, you know, they're they're they're building like central data models or like the clean data or whatever. And then you've got the central data models and the marts happening afterwards. How do you keep visibility of all of that?
And, you know, you take, like you can still do it with airflow. Right? Just have 8 different airflow instances and have you'd stitch up all the air flows to each other, but then you probably have to get something on top of that to monitor you know, it's like a who will guard the guardians type thing. And you have that with, like, pretty much any orchestrator. So that's why we pitch ourselves as like a control plane. That's why dbt cloud have this concept of, like, dbt mesh because, you know, you realize having everything in one place is a lot, so you need to move stuff to other teams. But then, again, you have the complexity issue of how do you monitor things in different places. But, yeah, there there there are a couple of scenarios where we see people running into problems, and that's how that's how we see them solving it.
[00:31:49] Tobias Macey:
Another aspect of orchestration data flows, particularly when you're not dealing specifically with streaming data, is the idea of do you do time based triggering, or do you do everything as event based where you're reacting to state changes in the system where sometimes that state changes, the wall clock ticking over to a certain point? And I'm wondering how you see those trends moving in the overall data ecosystem of people's appetite for, I want things to happen on a predictable schedule, or I want things to happen as soon as possible whenever a given event takes place.
[00:32:29] Hugo Lu:
Yeah. I mean, I think definitely a trend towards the latter. Right? Like, people want more data. They want it faster. So the more you can stitch things together, the better. So, obviously, why you have things like sensors in orchestration tools. But, you know, I think it it it's it always becomes complicated when you have different things at different places. Right? It's like not everything can have a sensor. And if you don't have the concept of, like, a run and maybe, you know, maybe you've got, like, 2 data sources. Right? They're landing in s 3, and then you've got a dbt model that, like, builds off both of them. External table in Snowflake. I don't know. When should that run? Should it run when 1 s 3 bucket has a file in it? Should it run when another one has a file in it? Like, if every time files land, they always land in pairs. Like, what's the right window to assess them both landing in there at the same time? Right? What happens if one lands in its window and then the next one lands later?
You know? Like like, this is the problem. Like, we wanna do event based scheduling across the entire data stack, but that only works where the chain of dependencies is basically linear. Or you have, like, a metadata frame work. So where you say, you know, the process writing the file to the s 3 is gonna put all the metadata you need to work out what to do yet at that moment, and then it's gonna send the webhook to the next thing. The next thing needs to know where you know, it needs to then trigger a process that can read that metadata and then work out what to do. And, you know, that metadata framework, we also see being very robust in especially in enterprise settings.
But, you know, it's it's not a question of, like, putting all the logic in the orchestrator because it it this is not how data works.
[00:34:17] Tobias Macey:
As a listener of the data engineering podcast, you clearly care about data and how it affects your organization in the world. For even more perspective on the ways that data impacts everything around us, don't miss Data Citizens Dialogues, the forward thinking podcast brought to you by Calibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. In every episode of DataCitizens Dialogues, they unpack data's impact on the world, from big picture questions like AI governance and data sharing, to more nuanced questions like how do we balance offence and defence in data management. In particular, I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast moving field.
The Data Citizens Dialogues podcast is bringing the data conversation to you, so start listening now. Follow Follow Data Citizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts. The other major split in the data platform and usage of data that has been growing in recent years is the divide between the analytical and product focused use cases of batch or streaming data and the use cases of data to power, train, fine tune, guide different ML and AI systems. And I'm wondering how you're seeing that strain the current or previous generation of orchestration systems and how you're thinking towards how that fits into
[00:35:49] Hugo Lu:
the orchestration systems that are going to be coming out over the next few years. Yeah. What do you mean by product versus analytical use cases? What are some examples of that?
[00:35:59] Tobias Macey:
So for example, analytical use cases being typical business intelligence or even reverse ETL product use cases being I have some piece of data that gets fed into a table that is either an embedded analytics dashboard for a customer
[00:36:15] Hugo Lu:
or data that gets fed into a recommendation engine, things like that. Yeah. No. I'm with you. I mean, look, man. It's real spectrum. Like, I think the embedded dashboard for a customer thing, like, typically, I mean, I say tip I was gonna say typically the use case isn't real time, but often it is. You know, we see people leveraging, like, modern analytical warehouses fairly well there, but having a real, really tough time if they don't have an orchestrator because the data often fails and then it's out of date, and then their customers come to them and they say, well, you know, this is terrible. I don't know what's going on.
So there's definitely an issue there. And I think, you know, the product need drives a lot of the requirements for how robust a system needs to be. And, ideally, you will, you know, centralize that data so that you can have an event based system that is essentially managed by software engineers. Right? So, you know, think about, like, you know, maybe you've got, like, an app. Right? And you need and and you need to show usage to the customer because they need to know when they're gonna hit the elements. Like, you're not gonna send events for usage onto Kafka, drop it into s 3, put it into Snowflake, like, aggregate it on a daily level, rolling average at 7 days, like, put it in a Power BI dashboard and embed that into your app.
Like, you're gonna take an event. You're gonna insert a row into another Postgres table, or maybe you're gonna have a function that, like, cleans it first, and then you're gonna have your dashboard that looks at that Postgres table. Right? But the point is it's like an event based system. It's not it's not really in the data stack. Right? And I think in machine learning, this is this is even more this is even more the case because for you to get it out of that software domain into something the data team is using, assuming it's something different, which it normally is, it's a lot of data when you're doing machine learning at scale. And, indeed, most ML engineers seem to want to do stuff on data that's in object storage, probably because of size. Right? It's like you might wanna use some spark. You might have to do spark streaming. Right? It's like, can you do that in a warehouse? No. So it's an object storage.
And, you know, again, there there there are a lot of other requirements around machine learning pipelines specifically because some of that metadata related to, like, training, fine tuning models, like monitoring their outputs is so specific. And that's why there are sort of, like, machine learning specific orchestrators, same with, like, AI. Right? There are a load of AI orchestrators. I don't even know the name of them. But, like, it just goes to show how sort of specialized it is. We're we're probably doing data orchestration, I guess. But I think, yeah, things becoming more and more hyper specialized is, is is the trend.
The other trend worth mentioning sorry. I know I was just waiting for a lot here, but is, you know, the centralization of data. So you can't do everything in s 3. You need an analytical warehouse to do analytical queries. That statement is less true these days because of, something that runs with the iceberg. So we'll see where that goes.
[00:39:22] Tobias Macey:
Yeah. On that note, I just saw today that Amazon announced s three tables buckets specifically designed for improving their iceberg performance.
[00:39:34] Hugo Lu:
Yeah. And this is the cool thing. Right? It's like, say, you've got a ton of product data, and it all lands in s 3, and then your ML team pick it up, do some cool stuff, send some recommendations back to the customer. But, you know, they build out some feature tables. Right? And then the data team pick it up from s 3, put it somewhere else, create some reports. It's like you just spent twice the amount of money probably needed to, and now the data's in different places, and people don't know what the source of truth is. That's all in one place. That's potentially big. So I think that's pretty cool.
[00:40:05] Tobias Macey:
Absolutely. And another pressure that I predict I haven't seen a lot of movement there yet, but I think that one of the ways that we're going to trend with the pressures of AI applications where that is getting folded more explicitly into the product arena is that by virtue of those AI models, inputs to the AI being a core dependency of the product experience, it brings the application engineering team background full circle to being involved in the product that is the exhaust of the data that they initiated where you you have to have that more full circle workflow of the application engineers through to the data teams, the ML teams, the AI teams, background to the application teams, all working in tandem along the same visibility. And I think that that's going to force more of these orchestration systems to grow beyond their current boundaries and incorporate that end to end life cycle and visibility and touch points for each of those different personas.
[00:41:17] Hugo Lu:
Yeah. No. I hadn't even I hadn't even thought about that. Are you sort of saying that, like, in order to effectively incorporate AI into your product, you will probably need data that's not in the product. You'll need other types of data too.
[00:41:30] Tobias Macey:
Absolutely. I mean, just think about the rag systems that are becoming the prevalent means of bringing AI into production for the current era of generative systems.
[00:41:40] Hugo Lu:
Right. Yeah. I mean, like, what would you like? What do you think you would need an orchestration tool to do in addition to what they already do in respect of, like, in respect of I don't think it's even necessarily
[00:41:53] Tobias Macey:
a growth in terms of their core functional capacity so much as it is an evolution of the way that it's being presented and integrated into the workflows of those different personas, where application teams largely their interface to orchestration is in the CICD pipeline of, I wrote my software. The test passed. It got deployed. It's on QA. I tested it. Now I push it to production. Maybe there's feature flagging that gets factored in there somewhere versus the data team of, I've gotta take all of the data from the application database, pull it out of there, put it into my warehouse, clean it up, present it, turn it into a usable asset for other things. And then you've got the ML teams of, I've got my experimentation system, my feature store. I need to have my model training pipeline. I've got my model monitoring system.
And then with generative AI, you've got the I've got to figure out which model I'm using, maybe apply some fine tuning, get that deployed version monitor for hallucinations, guardrail issues, people trying to jailbreak it, but I also need to have all of my data inputs to the vector database to populate the rack context and make sure that that gets updated appropriately, manage the different generations of embedding model that I'm using to update or improve the way that the AI model gets used. All of that is getting collapsed into a single end user experience, whereas before, they were largely disparate teams working on disparate projects.
[00:43:22] Hugo Lu:
Yeah. I mean, I still think we're quite away from that, but that is that is the dream. Right? If you can if you can monitor all of those workflows from a single place and all of your data is in the same place and, you know, the way you're monitoring it also takes things like data quality into account and, you know, is really, really reliable and robust and, you know, is is really well integrated with production systems that aren't the orchestrator. Right? Which, you know, like, it needs to be, for example. Right? It's like if you have, you know, if you've got, like, a service which serves up an AI model. Right? And then your front end is just sending events being like, hey. You know, consumer asked this question. What's the answer? Right? It's like that thing should be able to have some understanding of metadata. But, yeah, it'll be interesting to see where sort of orchestration lands in it. Yeah. It's a tough one. Also is changing the directionality
[00:44:13] Tobias Macey:
of data flow where it used to be it started in the application and then eventually made its way out to ML, and then it would start the cycle back over again. But with the interaction patterns of generative AI, that data gets fed into the AI directly. And then also given the memory layers that are being built out immediately incorporated into the AI context and used back out for the end user experience, but then also fed through the typical data flow of analysis, experimentation to figure out, okay, how are end users interacting with this? How can we improve that? How does that get factored into the product life cycle?
[00:44:51] Hugo Lu:
Yeah. And, you know, it's making it's making a lot of changes on the on the data side as well. I don't think you see people talking about it as much cause it would be sort of, lagging indication of how much AI stuff people are doing. But, you know, in the example you just gave, it's like, okay, let's say you've got an AI product and I'm having a conversation with it. Every single message I write is a data. What's happening to that? Like, do I just have loads of event data that's landing in s 3 that's just text? It's like, maybe. But then do you then have something which is cleaning that data and structuring it before you put it into analytical layer, right, before you write it to iceberg, for example, you could do. Then maybe that's another service you build. Maybe that's a service you buy. But it's it's it's more complexity. Right? Small things to integrate, which is why I think orchestration is so exciting because, you know, it's an area where we see a lot of data teams wanna move fast and not have to spend all this time building all these connections to all these things. So by sort of giving people those managed connections in the same way that, like, a 5 tram means you don't have to learn the Salesforce API. We're trying to do the same thing for data teams so that they can, like, go a bit faster. I think too that
[00:45:58] Tobias Macey:
with the ability for AI to work across all of these boundaries, it's going to be increasingly incorporated into the data flow management arena more so than it already is. And I think that there's going to be a certain amount of trust building that has to happen before people feel confident actually delegating any core capability to an AI model. But I think that in that earlier point of collapsing the stack of personas and bringing it more full circle, I imagine that that conversational interface will probably be the unifying factor that brings all of those different teams into the same workflow and onto the same page.
[00:46:40] Hugo Lu:
Yeah. How do you mean?
[00:46:43] Tobias Macey:
Well, I imagine that because of the fact that they're all used to working with data in different ways, if you can layer on a conversational aspect to it that speaks to them in their own language, then it reduces the tooling complexity of, oh, I have to build 5 different UIs to suit these 5 different personas. It's instead, I have my interface, and there's just the conversational aspect where you can ask and ask questions and get insights about the data, how it's flowing, what you need to do next type of a thing, or direct the orchestration engine to do the things that you want it to do without having to learn all the in intricacies of its peculiarities
[00:47:21] Hugo Lu:
of the different functions that it wants. You're talking about, like, an AI layer for data product managers all through to, like, machine learning engineers that helps to build and monitor and recover data pipelines?
[00:47:35] Tobias Macey:
Yes. Yeah. I I think we're probably 5 to 10 years away from today, but yeah. Yeah.
[00:47:44] Hugo Lu:
No. It's cool. I and, you know, you see you see elements of this today. So when people spin up like a, you know, like a new microservice, there are some pretty sophisticated data teams that will say, okay. Well, to spin this up, all you need to do is write a few lines PMO. But then what the YAML does is it automatically creates the orchestration pipeline. So automatically creates the DBT model. So, also, you know, basically just provisions all the resources automatically. So then if you say, well, you know, we can actually have a menu of things we can create. Right? And then here's all the data on how we create it, and you feed that to a model. And then, yeah, put the AI on top, then there's no reason we can't do it. But put put it this way. When when when, when I saw articles a year ago saying that AI was gonna automate away data engineers' jobs, I thought about what you just said, and then I realized how hard it was. And then I had confidence that trying to build a unified control plane that isn't powered by AI was not gonna be a colossal waste of time.
[00:48:40] Tobias Macey:
Oh, absolutely. I I don't think we'll ever completely cede control to the AIs. I think it will largely be a discovery interface and not necessarily a tell the AI to do the thing and then trust that the thing got done right. Yeah.
[00:48:55] Hugo Lu:
Yeah. Although you do raise an interesting point, especially in the context of, like, metadata frameworks where, you know, like, processes will write data that says, okay, I just ingested all these tables. Like, I put the IDs over there. Hey, thing that's gonna move the tables, like, go fetch the IDs and move the tables. Right? It's like you could potentially tag messages with the services and their descriptions and, like, their endpoints and then just hit an AI in the middle instead of a database. I mean, it would be an AI on top of DB. Right? But, yeah. I don't know. You would probably want to define that logic explicitly. We'll see. We'll see. We'll see.
[00:49:34] Tobias Macey:
So on that point, in your experience of working in this space, working with data teams of various sizes and compositions and areas of focus, what are some of the most interesting or innovative or unexpected ways that you've seen data orchestration implemented or the ways that it has impacted the overall platform architecture?
[00:49:54] Hugo Lu:
Oh, good question. I mean, at the other end at the opposite end of the spectrum, right, be our use case is very standard. It's very boring, but boring works. There are some pretty interesting ways people use it in terms of, like, provisioning and incident handling. So because you can sort of run any scripts in any places, including, like, your data warehouse, you can sort of build event based flows that automatically help you sort of, like, do things like do access control. Obviously, then you need your orchestration plane to itself have very good access control. Unfortunately, orchestra does, but, you know, that's that's one kind of rogue way people are doing it. Another way is just, like, using the orchestrator to get visibility. So, you know, if you can and, you know, this is this is something I feel like we're trying to pioneer as well. It's like with a Datadog, you can send it data. Right? And then it shows you what's going on, and it sends you alerts. But Datadog doesn't do anything. You have to send it the data. Otherwise, it knows nothing because we have this lovely data model for your metadata.
If you've got, like, event based pipelines or pipelines that are happening elsewhere, you can still send us the data. So similar to, like, a data hub, if you like. That's pretty cool because it's like an expansion of what engineers think the orchestrator should be doing. Right? You're turning it into genuinely a place where you can say, okay. Here, I can see everything that's going on, and I can control everything. I can rerun things. I can notify people. I can, like, trigger workflows that are operational. That's that's pretty cool. So, yeah, I guess stuff like like governance, automating governance, access control, such as getting full visibility instead of relying on big clunky expensive things like Datadog.
[00:51:42] Tobias Macey:
And in your experience of working in this space, building an orchestration engine and trying to fathom the different ways that data is being relied upon and used? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:52:00] Hugo Lu:
Mate, there are too many. Like like, it's just you can cut everything in so many ways. Right? People have different tools. People have different teams. People have different latency requirements, and people just have different personas and experiences they wanna have depending on the organization. Right? You could be a tiny startup with, like, 2 shell developers and still need, like, multiple environments. You know, everything everything get controlled, everything sort of, like, asset aware. You could be a sort of, you know, 10,000 person logistics company that is sort of foraying into into building their first data products, right, which needs good orchestration, but also, like, a bit of visibility of what's happening on the event pipeline.
And then also a way to, like, you know, enable and monitor self serve because you've got, you know, 10 global divisions. Right? Even though all you're fundamentally doing is building a relatively simple, you know, ELT pipeline. Think one area which I definitely didn't appreciate as much as I do now is the need for, like, security in where things are hosted. I I I've learned what colocation and, like, what an Azure private link and self hosted instance actually means. And it's, like, nuts because for anyone that doesn't know, right, if you've built a software product, right, and you run it in the cloud, you run it on AWS in London, and then you have a company in California that says, hey. We're on Azure. We need private link to Azure. Can you support that? What you then have to basically do is write your app using Azure services and make sure it can be hosted and provisioned in basically the same building that all that stuff is in. That's really hard towards, mate.
[00:53:41] Tobias Macey:
And for people who are tasked with building a data platform, managing its health and longevity, what are the cases where a data specific orchestrator is the wrong choice?
[00:53:53] Hugo Lu:
Oh, the wrong choice. Good question. If you have full streaming use cases, don't get an orchestration tool. You should be streaming that stuff. Apart from that, I mean, if you're gonna do batch stuff, you should probably have something. If it like, if your flows are really simple and if they're linear, I would probably just monitor it, like, have really good logging and have different services talk to each other. Orchestration is probably overkill. And oh, here here's a good one. So if you're a huge, huge company and you have very, very, like, high difficult SLA requirements, you might wanna choose something like a Palantir.
Right? In this case, you're buying the platform. It's like, don't build it. Buy the thing. Other than that, I think you're you're you're always gonna need 1. Right? I mean and final point. In terms of buying a data platform, right, historically, this was basically the same thing as, like, you know, maybe having a warehouse, but it's on premise. So it's like, do we get Oracle? Do we get SQL Server? I know that's the end of the question. Now the discussion is, well, do we get BigQuery? Do we get Snowflake? Do we get Databricks? Caveat is none of those are a data platform. Databricks getting very close to having everything, but not quite of even with people with Databricks, most of those organizations still also use Snowflake. I think it's at something like 40 or 50%, like, they share customers. So you're still gonna have to get visibility of everything.
So how do you do that? So that's, yeah, that's another reason that I think building something which connects to different parts of the stack is a is is is a good bet because it's not it's not getting any well, it is getting a bit simpler, but there's still a lot of a lot of tools out there. And as you continue to build and invest in this ecosystem,
[00:55:35] Tobias Macey:
what are some of your predictions and or hopes for the future of data orchestration?
[00:55:40] Hugo Lu:
I don't know about orchestration, but I think generally, it'd be good to it'd be good to see data teams stop being viewed as a call center. I predict that data teams will realize that even for basic BI use cases, the level of essentially the SLA of the data needs to be a lot higher than we think it is, much more similar to a software system, if anything, like, higher. Because at the end of the day, people only like, people really fickle when they see data that they don't trust, and it's really easy for them to lose that trust. I don't think we sort of generally make things to a sufficiently high standard. Like, definitely consolidation, right, in the orchestration plane. Like, you see this with, you know, lots of companies like DBT, Orchestra, Dagster. We're all sort of trying to grasp everything at the top. So, like, not not being a warehouse, not being an ingestion tool, not being a dashboarding tool. And, yeah, the other one, of course, as you mentioned today is is like iceberg. Right? It'd be cool to see if people can move things together, but at the end of the day, if you still see the data team as a cost entry, you're not driving value from it, then it's a bit of a defensive exercise to move stuff to iceberg and slash your costs and reduce your security footprint. Right? It's like, that's not why we got hired. Like, we got hired to make companies grow.
So, yeah, they're my main ones. What are yours?
[00:56:58] Tobias Macey:
I I think the main one is what we discussed earlier of AI being a motivating factor to push all of those teams closer together and in tighter collaboration and cooperation with the orchestration engine being that focal point of interaction. Nice. 10 year plan. Are there any other aspects of data platforms, their architecture, the role of the orchestrator that we didn't discuss yet that you'd like to cover before we close out the show? I think I think we're all good, to be honest. I think, it it it'll be it'll be really interesting to see how people start automating
[00:57:33] Hugo Lu:
things, and simplifying things even more and, like, what that does to both the users of the data and the users of the data platform. I've always thought that they should sort of kinda be the same people. Right? It's like there's nothing better than a power user that can self serve, but it's not getting any easier to architect a data platform. So people that know how to do that just getting more and more specialized and better and better paid. So we'll see what happens. As a final question, I'd like to get your perspective and what you see as being the biggest gap in the tool learning or technology that's available for data management today. The answer was not orchestration because you got plenty of those. I think it is around effective governance and prioritization. I don't know if it's a tool. I don't know if it's a process, but at the end of the day, a lot of dashboards don't get used. A lot of work that data engineers does, we feel like it goes down the drain. Anything we can do to say, well, I care about these 10 things and say, well, actually, it's only 5. That would be game changing. Absolutely. Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experience in the arena of data platform design and orchestration
[00:58:38] Tobias Macey:
and the ways that that impacts what people are able to get done with their data. It's definitely a very interesting and important problem space. So I appreciate the time and energy that you're investing in that, and I hope you enjoy the rest of your day. You too, sir. It's really good to be here.
[00:58:57] Hugo Lu:
Cheers, mate.
[00:58:59] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and co
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Your host is Tobias Macy, and today I'm interviewing Hugo Lou about the data platform and orchestration ecosystem and how to navigate the available options. So, Hugo, can you start by introducing yourself?
[00:00:28] Hugo Lu:
Of course. Great to be here, Tobias. So I'm Hugo Lou. I'm the CEO and cofounder of Okstra, which is a unified control of data. Prior to this, you know, I'm sort of all those people that fell into data by chance. My first trip was in investment banking and then moved into strategy at a company called Jewel picked up data and it's kind of been history ever since. So, yeah. Thanks for having me looking forward to this.
[00:00:54] Tobias Macey:
And you mentioned that you founded Orchestra, which is a company focused on orchestration, which we're not going to spend a lot of time focused on what you're building specifically, but generally orchestration and how it impacts the rest of the choices that you make about how you work with data. And I'm wondering if you can just start by giving your definition of what constitutes an orchestrator and orchestration in that data context.
[00:01:19] Hugo Lu:
Sure. I think it's really interesting when you try to build a data platform. Right? Because you think about where you wanna put your data, what you wanna use to, you know, change stuff in it. So like a compute engine, but fundamentally, if you don't have something triggering something, then nothing is ever gonna happen. So that's sort of where I see an orchestration tool coming in. I would just define it as a way to schedule trigger and monitor things. So nice and short
[00:01:51] Tobias Macey:
and orchestration as a practice and as a principle is something that has existed since well before computing, but has been translated into the computing environment in various forms. Maybe the most notable and most long lived one being chron of I want this thing to happen at this interval, and everybody well, many people have outgrown it. Many people still use it for various use cases, but other aspects of orchestration are things like CICD pipelines where you wanna make sure that your software builds get through and test it, etcetera. Orchestration and scheduling are also generally linked, and then you start getting into things like Kubernetes and its internal scheduling system, which orchestrates all of the different moving pieces, which has then led to outgrowths of things like Argo CD, which has also made forays into the data space.
And I'm wondering if you could just talk to some of the ways that that idea of scheduling and orchestration has been kind of conflated and jammed into various shapes and places and how the specifics of the ways that the orchestration is managed and executed and scheduled influences the ways that it actually works within a given use case.
[00:03:11] Hugo Lu:
Yes. Absolutely. A lot to unpack there, but I think kind of hit the hell on it, like, hit the nail on the head. Right? The process of having you know, I wanna complete this task, and then I've got multiple dependencies, and then I wanna do those things, and then there are multiple dependencies after that is a practice that is as old as computing. And I think in you know, if you speak to anyone on the software side like, orchestration is not a thing. Right? If the if you need to execute, like, a series of tasks in, like, a directed acyclic graph or a DAG, that sort of functionality is built into a lot of things that have names. So, you know, you mentioned Kubernetes just as an example.
You know, that it's a great example. Right? You've got a you've got a there are a lot of dependencies and processes that need to happen within a Kubernetes cluster, and, obviously, it's got a scheduler too. I think the reason it's got its own sort of ether area in data is probably because a lot of the, like, processes we have are split into different areas. Right? So if anyone's ever built a data ingestion system, that has to have an orchestration component too because maybe you need to, you know, trigger parallel fetches of data, put it into a staging area around quality checks, you know, move it somewhere, change the format, and then push it to a final destination. Right? That's not just gonna be handled in one big script.
But the fact that we have these, you know, things that do data ingestion, things that transform data, maybe things that do, you know, transformation and then ingestion and maybe a little bit more means that there's a need to, like, monitor lots of different things at different places. So as a result, a lot of engineering time that, you know, data teams spend is saying, okay. Well, I've got all these components. How do I system together? And, you know, the word that that that prevails here is orchestration. Right? You stitch it together with an orchestration tool. So, yeah, I think that's that's that's more or less where it fits in.
[00:05:21] Tobias Macey:
As you noted, orchestration is something that finds its way into almost every piece of software in some fashion, which leads to a lot of complexity and confusion as you're selecting which piece of the stack is going to own, which pieces of sequencing and the overall control. And if you do allow all of those different pieces to delegate a certain layer of orchestration, then you end up in the situation of having to stitch back together the view of what are all those pieces, how are they happening, and when versus having a centralized orchestration engine that says, I'm going to take control over all of these things. You don't do anything by yourself unless I tell you to. And, obviously, those two extremes have a big impact on the overall architecture of the data platform.
And I'm wondering if you can talk to some of the ways that you've seen those gradations take shape as people build their data systems and their data workflows and how they try to make sense of how data is moving through their organization.
[00:06:26] Hugo Lu:
Yeah. Definitely. I think a helpful lens here is attacking it from, like, a maturity standpoint. Right? So, you know, many people that are trying to build a data platform have have started from day 1. Right? And, you know, in day 1, you might not have loads of people relying on loads of reports. So you maybe have a couple of scripts that are getting cleaning some data. They're storing it somewhere. Maybe you do, you know, a little bit of cleaning, and then, you know, you're kinda done. Right? People will have a dashboard that's directly querying it, or maybe people will just go in and get that data and do some fun stuff with it, download it to Excel. But the orchestration is not complicated here. Right? You can sort of move stuff and then have something else triggered when it needs to be.
Obviously, as you grow, that gets more complicated. Right? What happens if you have a big dataset and you're using something like a Power BI or a Tableau? You need to trigger an extract refresh. What happens if you have a lot of data and you need a complicated data model? Right? You might have 100 or thousands of tables. What happens if you have 30 different sources of data that people are relying on? You can't just maybe have 1 ingestion tool. Maybe you have multiple ingestion services. Maybe some of that's streaming. So the question then becomes, how do you how do you stitch all of that together and get visibility while leveraging all of those components you've already got to their fullest extent.
And I think at that point, it becomes really, really difficult to have all of those different systems talking to each other. Right? It's like in the sort of software world, you might have, you know, different services that speak to each other. Right? They send each other events. It's all it's all choreographed. Right? You don't orchestrate most, like, many software systems. The difference here is that we're dealing with data. So, you know, if if every if every service doesn't have access to the same data, it becomes very expensive and very slow to make that work. And as a result, it it can be helpful to have sort of control layer on top of all these different services because, you know, you don't have this huge data dependency in software like you do in data.
[00:08:34] Tobias Macey:
What are the approaches to gaining that visibility that is largely an artifact of how you think about where that control lies, what is the motivating force for the propagation of data is the idea of an overarching metadata catalog that all of your different tools integrate with, and it either pulls data from them or pushes data to it so that you can see across all of the different pieces of software and technology. This is all the data that I have. This is how it moved, etcetera, etcetera. Whereas different orchestration engines have also tried to pull that into the core of their functionality of, I am going to own everything, so I will be the repository of metadata and give you visibility across these different layers.
And I'm curious how you've seen those philosophies play out in your experience of working in this space and working with customers.
[00:09:28] Hugo Lu:
Yeah. No. Look. I hear it. Again, like, a lot to unpack. And I think if we start with if we start with a problem people are trying to solve, a lot of the time, there's a data team that is scaling or at scale. The consumers of data, particularly like if you're doing BI, really struggle to get trust in it. Right? It's like you're leading a data platform. You've got 15 hardcore engineers. But at the end of the day, some of the data sets that you're building are for people in product, they're for people in marketing, they're for people in finance. Right? And they've got to come to you and say, hey, like, is this data fresh? Like, something looks a bit funky. I don't really know what's going on. And, you know, you then have this pattern, right, where on the one hand, you have this central team or many central teams. And then on the other hand, you have the consumer.
And the consumer basically has no no big idea what's going on. So the solution is to say, ah, well, you know, we as the central team can give you a catalog. The catalog will show you what's going on. I will train you to use the catalog. You know, we'll pay lots of money for the catalog. We'll maintain the catalog. But this is the way that you understand what's going on. This is how you can get trust in the data. And, you know, this is like a this is like a really tricky pattern to work because fundamentally, you have, like, bottleneck or a bot or or many bottlenecks who who actually know what the hell is going on. So I think this is this is the first first thing we see playing out. Right? At scale. Even with a catalog, people struggle to work out what's going on, which is bad because as a data team, your goal is to help them know what's going on so they can use data to make decisions. So that's the first thing. Second thing is that as a data team, it's a lot of effort to make that pattern work. Like, I was speaking to a fast growing technology company. They have about 1 and a half 1000 employees. They're doing data mesh. Right? So they're saying, hey. We're gonna give everybody the tools they need to build their own pipelines.
And they're super highly technical. The end users are back end engineers. And even then, it's taken them almost 2 years to stand up airflow and, like, parameterize it in in the sort of, you know, have like a sort of YAML based domain specific language on top that anybody can use. And on top of that, right, even after they've written all those pipelines, they have to write yet more code to keep their catalog up to date. And it's taken them, you know, 6, 7 platform engineers a year and a half, and only back end engineers can use what they've built. Right? They haven't even started on marketing or finance yet. They have, like I I asked I asked the lead engineer. I was like, how many airflow instances have you got? He said, oh, I've lost count at this point. You know? It's like the like, you can do this pattern, and it just takes an enormous amount of effort and resource. And, you know, if you've not done it before, I would say there's quite a high chance of failure. Right? So, you know, I think that's the second component. It takes a lot of investment to, you know, not only stitch everything together, but also surface it in a way. So this is part of the reason that there are some quote unquote orchestration tools that are trying to be the catalog because, you know, the orchestration tool triggers and monitors everything. So it has all the context. It has all the metadata. Right? It's got all those juicy run IDs, which you which you wanna monitor over time.
So from an architectural perspective, it would make sense to kind of put a catalog there.
[00:12:48] Tobias Macey:
The challenge there, though, is that by having that be the nexus of metadata, it then forces you to use that for situations where it's maybe not the appropriate fit for owning a certain data flow just so that you can get the metadata into it versus if you have 90% of your metadata in your orchestrator and only 10% of your workflows live outside of it, you then have to add a whole other software layer just to be able to track those disparate pieces.
[00:13:19] Hugo Lu:
Right. Yeah. You you you put it on the head. And this is what like, this is the issue people find with airflow. Right? It's completely agnostic. So you can sort of trigger, monitor any Python processes. But a single task can be like a print statement or a function that prints like hello world. It does nothing. You have to write everything yourself. And what we see is data teams spending time building pipelines to fetch metadata that is generated by their pipelines and then building dbt models to, like, clean that metadata and they're building dashboards themselves to monitor the metadata and then building alerting systems on the metadata all themselves. You know, I think in in some cases, it's probably, like, genuinely, like, a doubling of work just to know what's going on, which is insane.
[00:14:08] Tobias Macey:
It's 2024. Why are we still doing data migrations by hand? Teams spend months, sometimes years, manually converting queries and validating data, burning resources, and crushing morale. DataFold's AI powered migration agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafold today to learn how Datafolds can automate your migration and ensure source to target parity.
You're calling out of Airflow, and its Python orientation is also another angle to the impact that orchestration systems have on the overall architectural choices of your system because some of these orchestration systems are very much oriented to a specific language or a specific mode of interaction, and that influences the ways that you think about hiring, who works on all of these different data flows, who is able to interact with it and control it versus other orchestration systems that are going the other extreme of low code, take whatever language runtime you want. We're just going to let you click and drag things together and it'll all be amazing. What have you what have you got in mind? Nothing specific, but I think the one that comes to mind most readily are like the kettle and pentaho
[00:15:42] Hugo Lu:
of I don't know 10 15 years ago and the Microsoft server, SQL Server Integration Services and things like that. Yeah. But like it's really interesting, right, because it's almost like we're coming full circle because if you take, you know, you mentioned, s s I s, SQL Integration Services. That's another good example of a product that does what it's meant to do, but also has orchestration within it. You know, people would have seen that Snowflake recently acquired a company called DataVolo, and that's based on NiFi. Again, the same thing. It's It's fundamentally a low code tool for well, a no code tool for moving, adjusting, and transforming data. But within it, you can do orchestration. But the point is is, like, the problem it solves is, okay, how do I take data from these places and put it in that place in the format I want it? And it does all of that thing in one go. And what with the advent of, like, what data stack and things getting more complicated and, you know, all all the things that are driving us to make more complex systems, you lose out on orchestration because you have these different components that are very good at doing one thing.
Whereas before, you just had packages that had it all. Right? It's like you didn't think about orchestration. It's like, well, of course, I can trigger things in this software. Like, how else would it work?
[00:16:57] Tobias Macey:
I think that's an interesting point too, as far as the generational shift in the ways that we're using these tools and the ways that these tools are implemented where the early stages of ETL orchestration data movement were these monolithic packages largely bolted onto some database software, and they were the place where everything got done. So it was very much a centralized monolith. And now as we have increased the sources of data, types of data, who is consuming the data, how the data is being used, whether it's batch or streaming, etcetera, etcetera, it pushes us more into this federated approach of we have lots of little things happening all over the place, and the orchestration systems that are designed for the current era are generally built with that in mind of being able to have some sort of central nexus of control or visibility, but allowing for federating across multiple different execution contexts where Yes.
My experience is largely with Dagster where, for instance, it has the Dagster web node that you can then point to multiple different running gRPC for services that correspond to the the actual pipeline code for different use cases. So you can have that central visibility with federated execution. And I'm just wondering how you're seeing those generational divides of orchestration and platform architectures being able to bridge that gap or manage that dichotomy of central visibility and control versus federated execution.
[00:18:41] Hugo Lu:
Yeah. I mean, it's hard. Right? And I think a lot of the reason for it is the movement to the cloud. So, you know, we're speaking to one of the largest, you know, one of the largest hospital chains in the US. Right? And they're all on Oracle. And, you know, they're doing all their data integration, all their transformation. Oracle works super well. And last 10 years has been different because there's a lot of data they need that's not on premise or in Oracle anymore. So now they're saying, okay. How can we how can we push that data into Oracle? How can we then get out of Oracle and put it where we need to? Right. They need something that can integrate those different layers. And that's why, you know, we're, like, we as orchestra are talking to them because we facilitate that. And the cool thing about the cloud is you can, you know, connect and build integrations to different things in the cloud, be it AWS or Snowflake or Databricks or whatever.
But, like, you know and, you know, you can do this in orchestra too, obviously. But, you know, like like the example you mentioned with Dagster, you can still connect and monitor processes which are remote. So, like, on a server. Right? As long as there is some kind of Internet access, you can get visibility into that. And, you know, I think something we do which is quite unique is we go we take things one step further. So you know how I mentioned that in airflow, a task can be very simple. Right? It can be a function that you write yourself. In orchestra, a task is much larger.
You give us a few lines of YAML, so it's declarative. And not only will we handle that task, but we'll also fetch all of the metadata relating to that task. So a little bit like, you know, similarly to how airflow will, you know, give you logs when you do, like, an SSH operator. Right? It goes into where you've got it and pulls the logs out. Like, we'll get logs, but if the underlying tool also has an orchestration engine, we'll also surface that sub deck and then do things like calculate lineage, which is really, really cool because a lot of the things we see are, you know, people are running highly complex processes in different places. Right? You might have an analytics engineering team that just use like a coalesce or a dbt.
That's all they do. But then you might also have engineering, you know, data movement processes that depend on it, machine learning models, reverse ETL that depend on that on the other side. So then the question becomes, how do you get the full end to end visibility such that it's not just like box a, box b, box c type thing? And that's what we're trying to do.
[00:21:12] Tobias Macey:
Another reason for that generational shift too, I think, is the ownership of the process, where in the early days of data warehousing, all of the ETL, all of the business intelligence was largely owned by the IT department. So it was very much a cost center. It was something that was done because it was necessary, not because it necessarily drove its own inherent value. Data has now been moved more into the core of the product workflow. Ownership of all of those systems has largely been moved into a separate team that is generally distinct from IT, and they're more of a software product focused team, at least for people who are doing it in the, quote, unquote, modern way.
And so I think that also shifts the ways that the systems are designed and packaged and sold where when it's an IT asset, you sell it to the IT team, and they just want something big, predictable, manageable. They don't want to have to do a lot of customization to it. Whereas with data teams, they're generally working in more of the agile workflow of iterative development, iterative improvement. We want things that we can customize and tweak to suit our specific needs. And I think that that's another way that the overall architecture and platform approach to data has grown out of what it originally started from.
[00:22:43] Hugo Lu:
Yeah. Definitely. And I think how to put this? The use case for data is really important here. So, you know, we work with, like, like, many large manufacturing and logistics companies. Right? And they have sensor data for their operations. So having this sort of move through the system in a timely way is kind of of, like, critical importance. Because if they don't do it, they can't respond to, you know, like, you know, just changes in stuff that's happened that's fundamentally gonna impact their bottom line p and l. Right? It's like if something is gonna be delayed and they have an SLA with a customer and they don't let them know, then, you know, they're gonna take a hit. Right? So in this case, data's playing a really, really, really key and important, like, operational function.
And in that case, right, the person who is sort of owning that product is probably someone on the operation side. They're not gonna be able to probably build out, like, you know, relatively low latency stable orchestration system. Right? It's like they've got suppliers, they've got projects, they've got factories to manage. You can't expect them to build a data infrastructure as well. But in those cases, you know, it kinda it kinda makes sense that you would have someone that says, hey. Look. I'm gonna make sure that this thing is delivered to you every 15 minutes, every 5 minutes, and I'm gonna you're you're gonna get alerted if it's broken, and I'm gonna be your point of contact. Right? That's when I think the sort of platform team on the one side, stakeholder on the other. That pattern works really well in that case. Right? The new use case is like BI, right, and just like cloud stuff. So if I'm like, you know, if I'm working in marketing or I'm in finance and I wanna get a real time, you know, look at my transactions. Right?
Just because I need to do reporting and just keep a hold on stuff. Right? How's this customer paid today? They're a big customer. Like, it would be good if I could work that out and I had the data updated every 15 minutes because then I can email them at 5 PM at the right time so that they actually convert instead of falling out. The engineering for these use cases is often, like, a little bit easier. And I think here, where we're really moving to is empowering people to do this end to end themselves. So, you know, increasingly, you'll see finance teams talking about how they've adopted Snowflake and it's, like, revolutionized their ability to drive insight. Right?
And that's because they will have a power user that can write sequel that's like, yeah, I know what I'm doing. Like, I'm gonna be the guy that helps my VP of finance work out everything and automates all these processes so we can actually start, you know, driving the business of finances so just, like, keeping the lights on. That's why it's really interesting for me from the orchestration side because that's, like, the final technical bit that would be really hard for them to do that, you know, we're sort of trying to help people be able to do now.
[00:25:30] Tobias Macey:
We've been talking about the ways that your selection of orchestration tool influences the ways that you think about your overall platform architecture, but there are also many cases where you have to approach it from the reverse angle of you've already started building out your data systems. You are hitting growing pains of not being able to have that visibility, that sequencing that we've been discussing, and I'm wondering how that influences the ways that you think about what type of orchestration tool or what types of orchestration you need if you already have the data flows and you're just trying to get them under better management.
[00:26:06] Hugo Lu:
Yeah. I mean, like, let's dig into that a bit more. What type of, what type of scenarios are you thinking of? Like, what did the data team have? What growing pains are they running into? Yeah. I mean It all depends.
[00:26:20] Tobias Macey:
Well, that that that's generally what any question in engineering boils down to. I think that, typically, what you would run into is the initial promises of the modern data stack of you just throw a credit card at the problem, and you'll have all of your data in your warehouse, and your BI will be amazing because you're using Fivetran, Snowflake, DBT, and whatever the business intelligence tool of the day is. And so you say, okay. Great. All of this stuff is working, but now I don't actually know when the data flows are failing or what the quality issues are or whether the is up to date or if my DBT compiled properly.
[00:26:57] Hugo Lu:
Yeah. Okay. That's a good one. So I guess the pain is, well, we threw a credit card at the modern data stack, and it's very expensive, and we're no better at making decisions with data than we were before. Yeah. I mean, look. The the sort of, phrase du jour is, is data quality, and I think, you know, that setup has its issues. So, obviously, without, you know, without some sort of end to end orchestration and observability, it's gonna be really hard for you to just, you know, let people know who depend on a specific data asset or, like, dashboard when stuff is breaking. Right? Stuff always breaks, because you need to have some kind of orchestrator in there. Right? If you don't have that, it's gonna be tricky.
And, you know, I think the the key here the key here is is is to get a little bit more flexibility. So it's important to basically build out the the stack in a way where you can use the tools for what they're really, really good at. So running everything through DBT might not be the best idea. Right? If you've got stuff that needs to go quickly, you might wanna use, like, delta delta tables and Databricks or Snowflake tasks and dynamic tables. Right? You might have some people that wanna self serve in, like, a notebook environment instead of, like, a dashboard. You might not wanna have all of your connectors going through one way. You might wanna start doing some streaming. Right? And then in this case, you're like, well, I'm making my stack more complex so that I can save cost, right, and get data to where it needs to go faster.
I'm splitting up my data pipelines into more and more granular ways. But now you have 6 things that you have to connect instead of 3. And before, you know, just about with no airflow and stuff running at 4 AM and then 6 AM and 8 AM, it was okay. Now that don't work. So then you're like, now I need a platform engineer to put in airflow. And then you have this whole bottleneck problem because then anytime anyone says, hey. I'm not sure what's going on or, hey. Can I change the schedule for this? It results as big old long ticket, and then you've got a data platform manager talking to a head of marketing and, you know, they're butting heads.
So, you know, I think, like, in this case, right, Orchestra is is a pretty good solution or indeed, like, any orchestration platform that is easy to use that also gives people good visibility of what's going on. Like, clearly prioritizing and, like, defining the different data products you have, so essentially just, like, grouping pipelines and grouping things is also very helpful because then instead of saying, oh, like, for me to work out what's going on, go ahead and inspect this 1,000 box DAG. You're just saying, yeah. Sure. Here's the pipeline for your invoices data product. Here's how it's doing. Here's the data quality. You can make decisions on this. It's okay.
But I think something else people find, right, as a sort of scenario 2, is we have flexibility. We have a really good platform team. We have an orchestration framework in place that we manage ourselves. We have airflow, say, but it's a big monorepo. There's loads of stuff going on, and we we just we we're just spending way too much time managing it. Right? Like, stuff takes too long. Stuff that should take an hour takes 2 hours. Like, the cluster keeps going down. And to boot, we also probably have quite a lot of data quality issues that we don't control. So, you know, we speak to a health tech company over here in the UK earlier, and what they're doing is it's it's really cool actually. They're shifting some of that left.
So they're taking the staging models that their software team give them in DBT and asking the different teams to manage that themselves. So the central Ted data team is actually like it it kinda it's kinda like cheating, but, like, they're basically just doing less stuff. Right? But then you have this other problem. Right? You've got 70 repos of DBT code or or wearing away, and then, you know, they're they're they're building like central data models or like the clean data or whatever. And then you've got the central data models and the marts happening afterwards. How do you keep visibility of all of that?
And, you know, you take, like you can still do it with airflow. Right? Just have 8 different airflow instances and have you'd stitch up all the air flows to each other, but then you probably have to get something on top of that to monitor you know, it's like a who will guard the guardians type thing. And you have that with, like, pretty much any orchestrator. So that's why we pitch ourselves as like a control plane. That's why dbt cloud have this concept of, like, dbt mesh because, you know, you realize having everything in one place is a lot, so you need to move stuff to other teams. But then, again, you have the complexity issue of how do you monitor things in different places. But, yeah, there there there are a couple of scenarios where we see people running into problems, and that's how that's how we see them solving it.
[00:31:49] Tobias Macey:
Another aspect of orchestration data flows, particularly when you're not dealing specifically with streaming data, is the idea of do you do time based triggering, or do you do everything as event based where you're reacting to state changes in the system where sometimes that state changes, the wall clock ticking over to a certain point? And I'm wondering how you see those trends moving in the overall data ecosystem of people's appetite for, I want things to happen on a predictable schedule, or I want things to happen as soon as possible whenever a given event takes place.
[00:32:29] Hugo Lu:
Yeah. I mean, I think definitely a trend towards the latter. Right? Like, people want more data. They want it faster. So the more you can stitch things together, the better. So, obviously, why you have things like sensors in orchestration tools. But, you know, I think it it it's it always becomes complicated when you have different things at different places. Right? It's like not everything can have a sensor. And if you don't have the concept of, like, a run and maybe, you know, maybe you've got, like, 2 data sources. Right? They're landing in s 3, and then you've got a dbt model that, like, builds off both of them. External table in Snowflake. I don't know. When should that run? Should it run when 1 s 3 bucket has a file in it? Should it run when another one has a file in it? Like, if every time files land, they always land in pairs. Like, what's the right window to assess them both landing in there at the same time? Right? What happens if one lands in its window and then the next one lands later?
You know? Like like, this is the problem. Like, we wanna do event based scheduling across the entire data stack, but that only works where the chain of dependencies is basically linear. Or you have, like, a metadata frame work. So where you say, you know, the process writing the file to the s 3 is gonna put all the metadata you need to work out what to do yet at that moment, and then it's gonna send the webhook to the next thing. The next thing needs to know where you know, it needs to then trigger a process that can read that metadata and then work out what to do. And, you know, that metadata framework, we also see being very robust in especially in enterprise settings.
But, you know, it's it's not a question of, like, putting all the logic in the orchestrator because it it this is not how data works.
[00:34:17] Tobias Macey:
As a listener of the data engineering podcast, you clearly care about data and how it affects your organization in the world. For even more perspective on the ways that data impacts everything around us, don't miss Data Citizens Dialogues, the forward thinking podcast brought to you by Calibra. You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone. In every episode of DataCitizens Dialogues, they unpack data's impact on the world, from big picture questions like AI governance and data sharing, to more nuanced questions like how do we balance offence and defence in data management. In particular, I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast moving field.
The Data Citizens Dialogues podcast is bringing the data conversation to you, so start listening now. Follow Follow Data Citizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts. The other major split in the data platform and usage of data that has been growing in recent years is the divide between the analytical and product focused use cases of batch or streaming data and the use cases of data to power, train, fine tune, guide different ML and AI systems. And I'm wondering how you're seeing that strain the current or previous generation of orchestration systems and how you're thinking towards how that fits into
[00:35:49] Hugo Lu:
the orchestration systems that are going to be coming out over the next few years. Yeah. What do you mean by product versus analytical use cases? What are some examples of that?
[00:35:59] Tobias Macey:
So for example, analytical use cases being typical business intelligence or even reverse ETL product use cases being I have some piece of data that gets fed into a table that is either an embedded analytics dashboard for a customer
[00:36:15] Hugo Lu:
or data that gets fed into a recommendation engine, things like that. Yeah. No. I'm with you. I mean, look, man. It's real spectrum. Like, I think the embedded dashboard for a customer thing, like, typically, I mean, I say tip I was gonna say typically the use case isn't real time, but often it is. You know, we see people leveraging, like, modern analytical warehouses fairly well there, but having a real, really tough time if they don't have an orchestrator because the data often fails and then it's out of date, and then their customers come to them and they say, well, you know, this is terrible. I don't know what's going on.
So there's definitely an issue there. And I think, you know, the product need drives a lot of the requirements for how robust a system needs to be. And, ideally, you will, you know, centralize that data so that you can have an event based system that is essentially managed by software engineers. Right? So, you know, think about, like, you know, maybe you've got, like, an app. Right? And you need and and you need to show usage to the customer because they need to know when they're gonna hit the elements. Like, you're not gonna send events for usage onto Kafka, drop it into s 3, put it into Snowflake, like, aggregate it on a daily level, rolling average at 7 days, like, put it in a Power BI dashboard and embed that into your app.
Like, you're gonna take an event. You're gonna insert a row into another Postgres table, or maybe you're gonna have a function that, like, cleans it first, and then you're gonna have your dashboard that looks at that Postgres table. Right? But the point is it's like an event based system. It's not it's not really in the data stack. Right? And I think in machine learning, this is this is even more this is even more the case because for you to get it out of that software domain into something the data team is using, assuming it's something different, which it normally is, it's a lot of data when you're doing machine learning at scale. And, indeed, most ML engineers seem to want to do stuff on data that's in object storage, probably because of size. Right? It's like you might wanna use some spark. You might have to do spark streaming. Right? It's like, can you do that in a warehouse? No. So it's an object storage.
And, you know, again, there there there are a lot of other requirements around machine learning pipelines specifically because some of that metadata related to, like, training, fine tuning models, like monitoring their outputs is so specific. And that's why there are sort of, like, machine learning specific orchestrators, same with, like, AI. Right? There are a load of AI orchestrators. I don't even know the name of them. But, like, it just goes to show how sort of specialized it is. We're we're probably doing data orchestration, I guess. But I think, yeah, things becoming more and more hyper specialized is, is is the trend.
The other trend worth mentioning sorry. I know I was just waiting for a lot here, but is, you know, the centralization of data. So you can't do everything in s 3. You need an analytical warehouse to do analytical queries. That statement is less true these days because of, something that runs with the iceberg. So we'll see where that goes.
[00:39:22] Tobias Macey:
Yeah. On that note, I just saw today that Amazon announced s three tables buckets specifically designed for improving their iceberg performance.
[00:39:34] Hugo Lu:
Yeah. And this is the cool thing. Right? It's like, say, you've got a ton of product data, and it all lands in s 3, and then your ML team pick it up, do some cool stuff, send some recommendations back to the customer. But, you know, they build out some feature tables. Right? And then the data team pick it up from s 3, put it somewhere else, create some reports. It's like you just spent twice the amount of money probably needed to, and now the data's in different places, and people don't know what the source of truth is. That's all in one place. That's potentially big. So I think that's pretty cool.
[00:40:05] Tobias Macey:
Absolutely. And another pressure that I predict I haven't seen a lot of movement there yet, but I think that one of the ways that we're going to trend with the pressures of AI applications where that is getting folded more explicitly into the product arena is that by virtue of those AI models, inputs to the AI being a core dependency of the product experience, it brings the application engineering team background full circle to being involved in the product that is the exhaust of the data that they initiated where you you have to have that more full circle workflow of the application engineers through to the data teams, the ML teams, the AI teams, background to the application teams, all working in tandem along the same visibility. And I think that that's going to force more of these orchestration systems to grow beyond their current boundaries and incorporate that end to end life cycle and visibility and touch points for each of those different personas.
[00:41:17] Hugo Lu:
Yeah. No. I hadn't even I hadn't even thought about that. Are you sort of saying that, like, in order to effectively incorporate AI into your product, you will probably need data that's not in the product. You'll need other types of data too.
[00:41:30] Tobias Macey:
Absolutely. I mean, just think about the rag systems that are becoming the prevalent means of bringing AI into production for the current era of generative systems.
[00:41:40] Hugo Lu:
Right. Yeah. I mean, like, what would you like? What do you think you would need an orchestration tool to do in addition to what they already do in respect of, like, in respect of I don't think it's even necessarily
[00:41:53] Tobias Macey:
a growth in terms of their core functional capacity so much as it is an evolution of the way that it's being presented and integrated into the workflows of those different personas, where application teams largely their interface to orchestration is in the CICD pipeline of, I wrote my software. The test passed. It got deployed. It's on QA. I tested it. Now I push it to production. Maybe there's feature flagging that gets factored in there somewhere versus the data team of, I've gotta take all of the data from the application database, pull it out of there, put it into my warehouse, clean it up, present it, turn it into a usable asset for other things. And then you've got the ML teams of, I've got my experimentation system, my feature store. I need to have my model training pipeline. I've got my model monitoring system.
And then with generative AI, you've got the I've got to figure out which model I'm using, maybe apply some fine tuning, get that deployed version monitor for hallucinations, guardrail issues, people trying to jailbreak it, but I also need to have all of my data inputs to the vector database to populate the rack context and make sure that that gets updated appropriately, manage the different generations of embedding model that I'm using to update or improve the way that the AI model gets used. All of that is getting collapsed into a single end user experience, whereas before, they were largely disparate teams working on disparate projects.
[00:43:22] Hugo Lu:
Yeah. I mean, I still think we're quite away from that, but that is that is the dream. Right? If you can if you can monitor all of those workflows from a single place and all of your data is in the same place and, you know, the way you're monitoring it also takes things like data quality into account and, you know, is really, really reliable and robust and, you know, is is really well integrated with production systems that aren't the orchestrator. Right? Which, you know, like, it needs to be, for example. Right? It's like if you have, you know, if you've got, like, a service which serves up an AI model. Right? And then your front end is just sending events being like, hey. You know, consumer asked this question. What's the answer? Right? It's like that thing should be able to have some understanding of metadata. But, yeah, it'll be interesting to see where sort of orchestration lands in it. Yeah. It's a tough one. Also is changing the directionality
[00:44:13] Tobias Macey:
of data flow where it used to be it started in the application and then eventually made its way out to ML, and then it would start the cycle back over again. But with the interaction patterns of generative AI, that data gets fed into the AI directly. And then also given the memory layers that are being built out immediately incorporated into the AI context and used back out for the end user experience, but then also fed through the typical data flow of analysis, experimentation to figure out, okay, how are end users interacting with this? How can we improve that? How does that get factored into the product life cycle?
[00:44:51] Hugo Lu:
Yeah. And, you know, it's making it's making a lot of changes on the on the data side as well. I don't think you see people talking about it as much cause it would be sort of, lagging indication of how much AI stuff people are doing. But, you know, in the example you just gave, it's like, okay, let's say you've got an AI product and I'm having a conversation with it. Every single message I write is a data. What's happening to that? Like, do I just have loads of event data that's landing in s 3 that's just text? It's like, maybe. But then do you then have something which is cleaning that data and structuring it before you put it into analytical layer, right, before you write it to iceberg, for example, you could do. Then maybe that's another service you build. Maybe that's a service you buy. But it's it's it's more complexity. Right? Small things to integrate, which is why I think orchestration is so exciting because, you know, it's an area where we see a lot of data teams wanna move fast and not have to spend all this time building all these connections to all these things. So by sort of giving people those managed connections in the same way that, like, a 5 tram means you don't have to learn the Salesforce API. We're trying to do the same thing for data teams so that they can, like, go a bit faster. I think too that
[00:45:58] Tobias Macey:
with the ability for AI to work across all of these boundaries, it's going to be increasingly incorporated into the data flow management arena more so than it already is. And I think that there's going to be a certain amount of trust building that has to happen before people feel confident actually delegating any core capability to an AI model. But I think that in that earlier point of collapsing the stack of personas and bringing it more full circle, I imagine that that conversational interface will probably be the unifying factor that brings all of those different teams into the same workflow and onto the same page.
[00:46:40] Hugo Lu:
Yeah. How do you mean?
[00:46:43] Tobias Macey:
Well, I imagine that because of the fact that they're all used to working with data in different ways, if you can layer on a conversational aspect to it that speaks to them in their own language, then it reduces the tooling complexity of, oh, I have to build 5 different UIs to suit these 5 different personas. It's instead, I have my interface, and there's just the conversational aspect where you can ask and ask questions and get insights about the data, how it's flowing, what you need to do next type of a thing, or direct the orchestration engine to do the things that you want it to do without having to learn all the in intricacies of its peculiarities
[00:47:21] Hugo Lu:
of the different functions that it wants. You're talking about, like, an AI layer for data product managers all through to, like, machine learning engineers that helps to build and monitor and recover data pipelines?
[00:47:35] Tobias Macey:
Yes. Yeah. I I think we're probably 5 to 10 years away from today, but yeah. Yeah.
[00:47:44] Hugo Lu:
No. It's cool. I and, you know, you see you see elements of this today. So when people spin up like a, you know, like a new microservice, there are some pretty sophisticated data teams that will say, okay. Well, to spin this up, all you need to do is write a few lines PMO. But then what the YAML does is it automatically creates the orchestration pipeline. So automatically creates the DBT model. So, also, you know, basically just provisions all the resources automatically. So then if you say, well, you know, we can actually have a menu of things we can create. Right? And then here's all the data on how we create it, and you feed that to a model. And then, yeah, put the AI on top, then there's no reason we can't do it. But put put it this way. When when when, when I saw articles a year ago saying that AI was gonna automate away data engineers' jobs, I thought about what you just said, and then I realized how hard it was. And then I had confidence that trying to build a unified control plane that isn't powered by AI was not gonna be a colossal waste of time.
[00:48:40] Tobias Macey:
Oh, absolutely. I I don't think we'll ever completely cede control to the AIs. I think it will largely be a discovery interface and not necessarily a tell the AI to do the thing and then trust that the thing got done right. Yeah.
[00:48:55] Hugo Lu:
Yeah. Although you do raise an interesting point, especially in the context of, like, metadata frameworks where, you know, like, processes will write data that says, okay, I just ingested all these tables. Like, I put the IDs over there. Hey, thing that's gonna move the tables, like, go fetch the IDs and move the tables. Right? It's like you could potentially tag messages with the services and their descriptions and, like, their endpoints and then just hit an AI in the middle instead of a database. I mean, it would be an AI on top of DB. Right? But, yeah. I don't know. You would probably want to define that logic explicitly. We'll see. We'll see. We'll see.
[00:49:34] Tobias Macey:
So on that point, in your experience of working in this space, working with data teams of various sizes and compositions and areas of focus, what are some of the most interesting or innovative or unexpected ways that you've seen data orchestration implemented or the ways that it has impacted the overall platform architecture?
[00:49:54] Hugo Lu:
Oh, good question. I mean, at the other end at the opposite end of the spectrum, right, be our use case is very standard. It's very boring, but boring works. There are some pretty interesting ways people use it in terms of, like, provisioning and incident handling. So because you can sort of run any scripts in any places, including, like, your data warehouse, you can sort of build event based flows that automatically help you sort of, like, do things like do access control. Obviously, then you need your orchestration plane to itself have very good access control. Unfortunately, orchestra does, but, you know, that's that's one kind of rogue way people are doing it. Another way is just, like, using the orchestrator to get visibility. So, you know, if you can and, you know, this is this is something I feel like we're trying to pioneer as well. It's like with a Datadog, you can send it data. Right? And then it shows you what's going on, and it sends you alerts. But Datadog doesn't do anything. You have to send it the data. Otherwise, it knows nothing because we have this lovely data model for your metadata.
If you've got, like, event based pipelines or pipelines that are happening elsewhere, you can still send us the data. So similar to, like, a data hub, if you like. That's pretty cool because it's like an expansion of what engineers think the orchestrator should be doing. Right? You're turning it into genuinely a place where you can say, okay. Here, I can see everything that's going on, and I can control everything. I can rerun things. I can notify people. I can, like, trigger workflows that are operational. That's that's pretty cool. So, yeah, I guess stuff like like governance, automating governance, access control, such as getting full visibility instead of relying on big clunky expensive things like Datadog.
[00:51:42] Tobias Macey:
And in your experience of working in this space, building an orchestration engine and trying to fathom the different ways that data is being relied upon and used? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:52:00] Hugo Lu:
Mate, there are too many. Like like, it's just you can cut everything in so many ways. Right? People have different tools. People have different teams. People have different latency requirements, and people just have different personas and experiences they wanna have depending on the organization. Right? You could be a tiny startup with, like, 2 shell developers and still need, like, multiple environments. You know, everything everything get controlled, everything sort of, like, asset aware. You could be a sort of, you know, 10,000 person logistics company that is sort of foraying into into building their first data products, right, which needs good orchestration, but also, like, a bit of visibility of what's happening on the event pipeline.
And then also a way to, like, you know, enable and monitor self serve because you've got, you know, 10 global divisions. Right? Even though all you're fundamentally doing is building a relatively simple, you know, ELT pipeline. Think one area which I definitely didn't appreciate as much as I do now is the need for, like, security in where things are hosted. I I I've learned what colocation and, like, what an Azure private link and self hosted instance actually means. And it's, like, nuts because for anyone that doesn't know, right, if you've built a software product, right, and you run it in the cloud, you run it on AWS in London, and then you have a company in California that says, hey. We're on Azure. We need private link to Azure. Can you support that? What you then have to basically do is write your app using Azure services and make sure it can be hosted and provisioned in basically the same building that all that stuff is in. That's really hard towards, mate.
[00:53:41] Tobias Macey:
And for people who are tasked with building a data platform, managing its health and longevity, what are the cases where a data specific orchestrator is the wrong choice?
[00:53:53] Hugo Lu:
Oh, the wrong choice. Good question. If you have full streaming use cases, don't get an orchestration tool. You should be streaming that stuff. Apart from that, I mean, if you're gonna do batch stuff, you should probably have something. If it like, if your flows are really simple and if they're linear, I would probably just monitor it, like, have really good logging and have different services talk to each other. Orchestration is probably overkill. And oh, here here's a good one. So if you're a huge, huge company and you have very, very, like, high difficult SLA requirements, you might wanna choose something like a Palantir.
Right? In this case, you're buying the platform. It's like, don't build it. Buy the thing. Other than that, I think you're you're you're always gonna need 1. Right? I mean and final point. In terms of buying a data platform, right, historically, this was basically the same thing as, like, you know, maybe having a warehouse, but it's on premise. So it's like, do we get Oracle? Do we get SQL Server? I know that's the end of the question. Now the discussion is, well, do we get BigQuery? Do we get Snowflake? Do we get Databricks? Caveat is none of those are a data platform. Databricks getting very close to having everything, but not quite of even with people with Databricks, most of those organizations still also use Snowflake. I think it's at something like 40 or 50%, like, they share customers. So you're still gonna have to get visibility of everything.
So how do you do that? So that's, yeah, that's another reason that I think building something which connects to different parts of the stack is a is is is a good bet because it's not it's not getting any well, it is getting a bit simpler, but there's still a lot of a lot of tools out there. And as you continue to build and invest in this ecosystem,
[00:55:35] Tobias Macey:
what are some of your predictions and or hopes for the future of data orchestration?
[00:55:40] Hugo Lu:
I don't know about orchestration, but I think generally, it'd be good to it'd be good to see data teams stop being viewed as a call center. I predict that data teams will realize that even for basic BI use cases, the level of essentially the SLA of the data needs to be a lot higher than we think it is, much more similar to a software system, if anything, like, higher. Because at the end of the day, people only like, people really fickle when they see data that they don't trust, and it's really easy for them to lose that trust. I don't think we sort of generally make things to a sufficiently high standard. Like, definitely consolidation, right, in the orchestration plane. Like, you see this with, you know, lots of companies like DBT, Orchestra, Dagster. We're all sort of trying to grasp everything at the top. So, like, not not being a warehouse, not being an ingestion tool, not being a dashboarding tool. And, yeah, the other one, of course, as you mentioned today is is like iceberg. Right? It'd be cool to see if people can move things together, but at the end of the day, if you still see the data team as a cost entry, you're not driving value from it, then it's a bit of a defensive exercise to move stuff to iceberg and slash your costs and reduce your security footprint. Right? It's like, that's not why we got hired. Like, we got hired to make companies grow.
So, yeah, they're my main ones. What are yours?
[00:56:58] Tobias Macey:
I I think the main one is what we discussed earlier of AI being a motivating factor to push all of those teams closer together and in tighter collaboration and cooperation with the orchestration engine being that focal point of interaction. Nice. 10 year plan. Are there any other aspects of data platforms, their architecture, the role of the orchestrator that we didn't discuss yet that you'd like to cover before we close out the show? I think I think we're all good, to be honest. I think, it it it'll be it'll be really interesting to see how people start automating
[00:57:33] Hugo Lu:
things, and simplifying things even more and, like, what that does to both the users of the data and the users of the data platform. I've always thought that they should sort of kinda be the same people. Right? It's like there's nothing better than a power user that can self serve, but it's not getting any easier to architect a data platform. So people that know how to do that just getting more and more specialized and better and better paid. So we'll see what happens. As a final question, I'd like to get your perspective and what you see as being the biggest gap in the tool learning or technology that's available for data management today. The answer was not orchestration because you got plenty of those. I think it is around effective governance and prioritization. I don't know if it's a tool. I don't know if it's a process, but at the end of the day, a lot of dashboards don't get used. A lot of work that data engineers does, we feel like it goes down the drain. Anything we can do to say, well, I care about these 10 things and say, well, actually, it's only 5. That would be game changing. Absolutely. Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experience in the arena of data platform design and orchestration
[00:58:38] Tobias Macey:
and the ways that that impacts what people are able to get done with their data. It's definitely a very interesting and important problem space. So I appreciate the time and energy that you're investing in that, and I hope you enjoy the rest of your day. You too, sir. It's really good to be here.
[00:58:57] Hugo Lu:
Cheers, mate.
[00:58:59] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and co
Introduction to Data Engineering Podcast
Understanding Data Orchestration
Complexity in Data Systems
Challenges in Data Trust and Visibility
Generational Shifts in Data Orchestration
Growing Pains in Data Systems
Event-Based vs Time-Based Scheduling
Analytical vs Product Use Cases
AI's Impact on Data Orchestration
Innovative Uses of Data Orchestration