How Orchestration Impacts Data Platform Architecture

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Your host is Tobias Macy, and today I'm interviewing Hugo Lou about the data platform and orchestration ecosystem

and how to navigate the available options. So, Hugo, can you start by introducing yourself?

Of course. Great to be here, Tobias. So I'm Hugo Lou. I'm the CEO and cofounder

of Okstra, which is a unified control of data.

Prior to this, you know, I'm sort of all those people that fell into data by chance. My first trip was in investment banking and then moved into strategy

at a company called Jewel picked up data and it's kind of been history ever since. So, yeah. Thanks for having me looking forward to this.

And

you mentioned that you founded Orchestra, which is a company focused on

orchestration, which we're not going to spend a lot of time focused on what you're building specifically, but generally orchestration and how it impacts the rest of the choices that you make about how you work with data. And I'm wondering if you can just start by giving your definition of what constitutes

an orchestrator and orchestration in that data context.

Sure. I think it's really interesting

when you try to build a data platform. Right? Because you think about where you wanna put your data, what you wanna use to, you know, change stuff in it. So like a compute engine, but fundamentally,

if you don't have

something triggering something, then nothing is ever gonna happen. So

that's sort of where I see an orchestration tool coming in. I would just define it as

a way to schedule trigger and monitor things.

So nice and short

and orchestration

as a practice and as a principle is something that has existed since well before computing, but has been translated into the

computing environment in various forms.

Maybe the most notable

and most long lived one being chron of I want this thing to happen at this interval,

and

everybody well, many people have outgrown it. Many people still use it for various use cases, but other

aspects of orchestration are things like CICD

pipelines where you wanna make sure that your software builds get through and test it, etcetera.

Orchestration

and scheduling

are also

generally linked, and then you start getting into things like Kubernetes

and its internal scheduling system, which orchestrates all of the different moving pieces,

which has then led to outgrowths of things like Argo CD, which has also made forays into the data space.

And I'm wondering if you could just talk to some of the ways that

that idea of scheduling and orchestration

has been kind of conflated

and jammed into various shapes and places and

how the specifics of the ways that the orchestration

is managed and executed and scheduled

influences

the ways that it actually works within a given use case.

Yes.

Absolutely. A lot to unpack there, but I think

kind of hit the hell on it, like, hit the nail on the head. Right?

The process of

having

you know, I wanna complete this task, and then I've got multiple dependencies, and then I wanna do those things, and then there are multiple dependencies after that is a practice that is as old as computing.

And I think in you know, if you speak to anyone on the software side

like, orchestration is not a thing. Right? If the if you need to

execute, like, a series of tasks in, like, a directed acyclic graph or a DAG,

that sort of functionality

is built into a lot of things that have names. So, you know, you mentioned Kubernetes

just as an example.

You know, that it's a great example. Right? You've got a you've got a there are a lot of dependencies

and processes that need to happen within a Kubernetes cluster, and, obviously, it's got a scheduler too.

I think the reason it's

got its own sort of ether area

in data

is

probably

because

a lot of the, like, processes we have are

split into different areas. Right? So if anyone's ever built a data ingestion system,

that has to have an orchestration component too

because maybe you need to, you know, trigger parallel fetches of data, put it into a staging area around quality checks, you know, move it somewhere,

change the format, and then push it to a final destination. Right? That's not just gonna be handled in one big script.

But the fact that

we have these, you know, things that do data ingestion, things that transform data, maybe things that do,

you know, transformation and then ingestion and maybe a little bit more means that there's a need to, like, monitor

lots of different things at different places. So

as a result,

a lot of engineering time that, you know, data teams spend is saying, okay. Well, I've got all these components. How do I system together? And,

you know, the word that that that prevails here is

orchestration. Right? You stitch it together with an orchestration tool.

So, yeah, I think that's that's that's more or less where it fits in.

As you noted, orchestration

is something that finds its way into almost every piece of software in some fashion,

which leads to a lot of complexity and confusion as you're selecting

which piece of the stack is going to own,

which

pieces of sequencing and the overall control.

And if you do allow all of those different pieces to delegate a certain

layer of orchestration, then you end up in the situation of having to stitch back together the view of what are all those pieces, how are they happening, and when

versus having a centralized orchestration

engine that says, I'm going to take control over all of these things. You don't do anything by yourself unless I tell you to. And, obviously, those two

extremes have a big impact on the overall architecture

of the data platform.

And I'm wondering if you can talk to some of the ways that you've seen those

gradations

take shape as people build their data systems and their data workflows and how they try to make sense of how data is moving through their organization.

Yeah. Definitely. I think a helpful lens here is attacking it from, like, a maturity standpoint. Right? So, you know, many people that are trying to build a data platform have have started from day 1. Right? And, you know, in day 1, you might not have loads of people relying on loads of reports. So you maybe have a couple of scripts that are getting cleaning some data. They're storing it somewhere.

Maybe you do, you know, a little bit of cleaning,

and then, you know, you're kinda done. Right? People will have a dashboard that's directly querying it, or maybe people will just go in and get that data and do some fun stuff with it, download it to Excel. But the orchestration is not complicated here. Right? You can sort of move stuff and then have something else triggered

when it needs to be.

Obviously,

as you grow,

that gets more complicated. Right? What happens if you have a big dataset and you're using something like a Power BI or a Tableau? You need to trigger an extract refresh. What happens if you have a lot of data and you need a complicated data model? Right? You might have 100 or thousands of tables.

What happens if you have

30 different sources of data that people are relying on? You can't just maybe have 1 ingestion tool. Maybe you have multiple ingestion services. Maybe some of that's streaming.

So the question then

becomes, how do you

how do you stitch all of that together and get visibility

while leveraging all of those components you've already got to their fullest extent.

And I think at that point, it becomes really, really difficult to have all of those different systems talking to each other. Right? It's like in the sort of software world, you might have, you know, different services that speak to each other. Right? They send each other events. It's all it's all choreographed. Right? You don't orchestrate

most, like, many software systems.

The difference here is that we're dealing with data. So,

you know, if if every if every service doesn't have access to the same data, it becomes very expensive and very slow to make that work. And as a result,

it it can be helpful to have sort of

control layer on top of all these different services

because,

you know,

you don't have this huge data dependency

in software like you do in data.

What are the approaches to gaining that visibility

that is largely

an artifact of how you think about where that control lies, what is the motivating force for the

propagation of data is the idea of an overarching metadata catalog that all of your different tools integrate with, and it either pulls data from them or pushes data to it so that you can see across all of the different pieces of software and technology.

This is all the data that I have. This is how it moved, etcetera, etcetera.

Whereas

different orchestration engines have also tried to pull that into the core of their functionality of, I am going to own everything, so I will be the repository of metadata

and give you visibility across these different layers.

And I'm curious how you've seen those philosophies play out in your experience of working in this space and working with customers.

Yeah. No. Look. I hear it. Again, like, a lot to unpack. And I think if we start with

if we start with a problem people are trying to solve, a lot of the time, there's a data team that is scaling or at scale.

The consumers of data, particularly like if you're doing BI,

really struggle to get trust in it. Right? It's like you're leading a data platform. You've got 15 hardcore engineers.

But at the end of the day, some of the data sets that you're building are for people in product, they're for people in marketing, they're for people in finance. Right? And they've got to come to you and say, hey, like, is this data fresh? Like, something looks a bit funky. I don't really know what's going on.

And,

you know, you then have this pattern, right, where on the one hand, you have this central team or many central teams. And then on the other hand, you have the consumer.

And the consumer basically has no no big idea what's going on.

So the solution is to say, ah, well, you know, we as the central team can give you a catalog. The catalog will show you what's going on. I will train you to use the catalog. You know, we'll pay lots of money for the catalog. We'll maintain

the catalog. But this is the way that you understand what's going on. This is how you can get trust in the data.

And,

you know, this is like a this is like a really tricky pattern to work because fundamentally, you have, like, bottleneck or a bot or or many bottlenecks

who who actually know what the hell is going on. So I think this is this is the first first thing we see playing out. Right? At scale. Even with a catalog,

people struggle to work out what's going on, which is bad because as a data team, your goal is to help them know what's going on so they can use data to make decisions. So that's the first thing. Second thing is that as a data team,

it's a lot of effort to make that pattern work. Like, I was speaking to

a fast growing technology company. They have about 1 and a half 1000 employees. They're doing data mesh. Right? So they're saying, hey. We're gonna give everybody the tools they need to build their own pipelines.

And they're super highly technical. The end users are back end engineers.

And even then, it's taken them

almost 2 years

to

stand up airflow and, like, parameterize

it in in the sort of, you know, have like a sort of YAML based domain specific language on top that anybody can use. And on top of that, right, even after they've written all those pipelines, they have to write yet more code to keep their catalog up to date.

And it's taken them, you know, 6, 7 platform engineers a year and a half,

and only back end engineers can use what they've built. Right? They haven't even started on marketing or finance yet. They have, like

I I asked I asked the lead engineer. I was like, how many airflow instances have you got? He said, oh, I've lost count at this point. You know? It's like the like, you can do this pattern, and it just takes an enormous amount of effort and resource. And, you know, if you've not done it before, I would say there's quite a high chance of failure. Right? So, you know, I think that's the second component. It takes a lot of investment

to, you know, not only stitch everything together, but also surface it in a way. So this is part of the reason that there are some quote unquote orchestration tools that are trying to be the catalog because,

you know, the orchestration tool triggers and monitors everything. So it has all the context. It has all the metadata. Right? It's got all those juicy run IDs, which you which you wanna monitor over time.

So from an architectural perspective, it would make sense to kind of put a catalog there.

The challenge

there, though, is that

by having that be the nexus of metadata,

it then forces you to

use that for situations where it's maybe not the appropriate fit for owning a certain data flow just so that you can get the metadata into it

versus if you have 90% of your metadata in your orchestrator and only 10% of your workflows live outside of it, you then have to add a whole other software layer just to be able to track those disparate pieces.

Right. Yeah. You you you put it on the head. And this is what like, this is the issue people find with airflow. Right? It's completely agnostic.

So you can sort of trigger, monitor any Python processes.

But a single task can be like a print statement or a function that prints like hello world. It does nothing. You have to write everything yourself. And what we see is data teams spending time

building pipelines to fetch metadata

that is generated by their pipelines

and then building dbt models to, like, clean that metadata and they're building dashboards themselves to monitor the metadata and then building alerting systems on the metadata all themselves.

You know, I think in in some cases, it's probably, like, genuinely, like, a doubling of work just to know what's going on,

which is

insane.

It's 2024.

Why are we still doing data migrations by hand? Teams spend months, sometimes years, manually converting queries and validating data, burning resources, and crushing morale.

DataFold's AI powered migration agent brings migrations into the modern era. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches.

And they're so confident in their solution, they'll actually guarantee your timeline in writing.

Ready to turn your year long migration into weeks? Visit dataengineeringpodcast.com/datafold

today to learn how Datafolds can automate your migration and ensure source to target parity.

You're calling out of Airflow, and its Python orientation is also

another angle to the impact that orchestration systems have on the overall architectural choices of your system

because some of these orchestration systems are very much oriented to a specific language or a specific mode of interaction,

and that influences the ways that you think about hiring, who works on all of these different data flows, who is able to interact with it and control

it versus other orchestration systems that are going the other extreme of low code, take whatever language runtime you want. We're just going to let you click and drag things together and it'll all be amazing. What have you what have you got in mind?

Nothing specific, but I think the one that comes to mind most readily are like the

kettle and pentaho

of I don't know 10 15 years ago and the Microsoft server, SQL Server Integration Services and things like that. Yeah. But like it's really interesting, right, because it's almost like we're coming full circle because if you take, you know, you mentioned,

s s I s, SQL Integration Services.

That's another good example of a product that does what it's meant to do, but also has orchestration within it. You know, people would have seen that Snowflake recently acquired a company called DataVolo,

and that's based on NiFi. Again, the same thing. It's It's fundamentally a low code tool for well, a no code tool for moving, adjusting, and transforming data. But within it, you can do orchestration.

But the point is is, like, the problem it solves is, okay, how do I take data from these places and put it in that place in the format I want it? And it does all of that thing in one go. And

what with the advent of, like, what data stack and things getting more complicated

and, you know, all all the things that are driving us to make more complex systems,

you lose out on orchestration because you have these different components that are very good at doing one thing.

Whereas before, you just had

packages that had it all. Right? It's like you didn't think about orchestration. It's like, well, of course, I can trigger things in this software. Like, how else would it work?

I think that's an interesting

point too,

as far as the generational shift in the

ways that we're using these tools and the ways that these tools are implemented where

the early stages of

ETL

orchestration

data movement were these monolithic

packages largely bolted onto some database software,

and

they were the place where everything got done. So it was very much a centralized

monolith.

And now as we have increased the sources of data, types of data, who is consuming the data, how the data is being used,

whether it's batch or streaming, etcetera, etcetera,

it pushes us more into this federated approach of we have lots of little things happening all over the place,

and

the orchestration systems that are designed for the current era

are generally

built with that in mind of being able to have some sort of central nexus of control or visibility,

but allowing for federating

across multiple different execution contexts where Yes.

My experience is largely with Dagster where, for instance, it has the Dagster web node that you can then point to multiple different running gRPC for services that correspond to the the actual pipeline code for different use cases. So you can have that central visibility with federated

execution. And I'm just wondering how you're seeing those

generational divides of

orchestration and platform architectures

being able to bridge that gap or manage that dichotomy of central visibility and control versus federated execution.

Yeah. I mean, it's hard. Right?

And I think a lot of the reason for it is the movement to the cloud. So,

you know, we're speaking to one of the largest, you know, one of the largest hospital chains in the US. Right? And they're all on Oracle.

And,

you know, they're doing all their data integration, all their transformation. Oracle works super well. And last 10 years has been different because

there's a lot of data they need that's not

on premise or in Oracle anymore. So now they're saying, okay. How can we how can we push that data into Oracle? How can we then get out of Oracle and put it where we need to? Right. They need something that can integrate those different layers.

And that's why, you know, we're, like, we as orchestra are talking to them because we facilitate that. And the cool thing about the cloud

is you can,

you know, connect and build integrations to different things in the cloud, be it AWS

or Snowflake or Databricks or whatever.

But, like, you know and, you know, you can do this in orchestra too, obviously. But, you know, like like the example you mentioned with Dagster, you can still connect and monitor processes which are remote.

So, like, on a server. Right? As long as there is some kind of Internet

access, you can get visibility into that. And, you know, I think something we do which is quite unique

is we go we take things one step further.

So you know how I mentioned that in airflow, a task can be very simple. Right? It can be a function that you write

yourself. In orchestra, a task is much larger.

You give us a few lines of YAML, so it's declarative.

And not only will we handle that task,

but we'll also

fetch all of the metadata relating to that task. So a little bit like, you know, similarly to how airflow will, you know, give you logs when you do, like, an SSH operator. Right? It goes into where you've got it and pulls the logs out. Like, we'll get logs, but if the underlying tool also has an orchestration engine, we'll also surface that sub deck and then do things like calculate lineage,

which is really, really cool

because a lot of the things we see are, you know, people are running highly complex processes in different places. Right? You might have an analytics engineering team that just use like a coalesce or a dbt.

That's all they do. But then you might also have

engineering, you know, data movement processes that depend on it, machine learning models, reverse ETL that depend on that on the other side. So then the question becomes, how do you get the full end to end visibility

such that it's not just like box a, box b,

box c

type thing? And that's what we're trying to do.

Another reason for that generational shift too, I think, is

the ownership

of the process, where in the early days of data warehousing,

all of the ETL, all of the business intelligence was largely owned by the IT department. So it was very much a cost center. It was something that was done because it was necessary, not because it necessarily drove its own inherent value.

Data has now been moved more into the core of the product workflow.

Ownership of all of those systems has largely been moved into a separate

team

that is generally distinct from IT, and they're more of a software product focused team, at least for people who are doing it in the, quote, unquote, modern way.

And so I think that also shifts the ways that the systems are designed and packaged and sold where

when it's an IT

asset,

you sell it to the IT team, and they just want something big, predictable,

manageable.

They don't want to have to do a lot of

customization

to it.

Whereas with data teams,

they're generally working in more of the agile workflow of iterative development, iterative improvement.

We want things that we can customize and tweak to suit our specific needs. And I think that that's another way that the

overall

architecture

and platform approach to data has grown out of what it originally started from.

Yeah. Definitely.

And I think

how to put this? The use case for data is really important here. So,

you know, we work with,

like,

like, many large manufacturing and logistics companies. Right? And they have sensor data for

their operations.

So having this sort of move through the system in a timely way is kind of of, like, critical importance.

Because if they don't do it, they can't

respond to,

you know, like, you know, just changes in stuff that's happened that's fundamentally gonna impact their bottom line p and l. Right? It's like if something is gonna be delayed and they have an SLA with a customer and they don't let them know, then, you know, they're gonna take a hit. Right? So in this case, data's playing a really, really, really key and important,

like, operational function.

And in that case, right, the person who is sort of owning that product is probably someone on the operation side. They're not gonna be able to

probably

build out, like, you know, relatively low latency stable orchestration system. Right? It's like they've got suppliers, they've got projects, they've got factories to manage.

You can't expect them to build a data infrastructure as well.

But in those cases, you know, it kinda it kinda makes sense that you would have someone that says, hey. Look. I'm gonna make sure that this thing is delivered to you every 15 minutes, every 5 minutes, and I'm gonna you're you're gonna get alerted if it's broken, and I'm gonna be your point of contact. Right? That's when I think

the sort of platform team on the one side, stakeholder on the other. That pattern works really well in that case. Right? The new use case is like BI, right, and just like cloud stuff. So if I'm like, you know, if I'm working in marketing or I'm in finance and I wanna get a real time, you know, look at my transactions. Right?

Just because I need to do reporting and just keep a hold on stuff. Right? How's this customer paid today? They're a big customer. Like, it would be good if I could work that out and I had the data updated every 15 minutes because then I can email them at 5 PM at the right time so that they actually convert instead of falling

out. The engineering for these use cases is often, like, a little bit easier.

And I think here, where we're really moving to

is empowering people to do this end to end themselves.

So, you know, increasingly, you'll see finance teams talking about how they've adopted Snowflake and it's, like, revolutionized their ability to drive insight. Right?

And that's because they will have a power user that can write sequel that's like, yeah, I know what I'm doing. Like, I'm gonna be the guy that helps my VP of finance work out everything and automates all these processes so

we can actually start, you know, driving the business of finances so just, like, keeping the lights on. That's why it's really interesting for me from the orchestration side because that's, like,

the final technical bit that would be really hard for them to do

that, you know, we're sort of trying to help people be able to do now.

We've been talking about the ways that your selection of orchestration

tool influences

the ways that you think about your overall platform architecture,

but there are also many cases where you have to approach it from the reverse angle of you've already started building out your data systems.

You are hitting growing pains of not being able to have that visibility,

that sequencing that we've been discussing,

and I'm wondering how that influences the ways that you think about what type of orchestration tool or what types of orchestration

you need if you already have the data flows and you're just trying to get them under better management.

Yeah. I mean, like, let's dig into that a bit more. What type of,

what type of scenarios are you thinking of? Like, what did the data team have? What growing pains are they running into?

Yeah. I mean It all depends.

Well, that that that's generally what any question in engineering boils down to. I think that, typically, what you would run into is the

initial promises of the modern data stack of you just throw a credit card at the problem, and you'll have all of your data in your warehouse, and your BI will be amazing because you're

using Fivetran, Snowflake, DBT,

and whatever the business intelligence tool of the day is.

And so you say, okay. Great. All of this stuff is working, but now I don't actually know when the data flows are failing

or what the quality issues are or whether the is up to date or if my DBT compiled properly.

Yeah. Okay. That's a good one. So I guess the pain is, well, we threw a credit card at the modern data stack, and it's very expensive,

and we're no better at making decisions with data than we were before.

Yeah. I mean, look. The

the sort of, phrase du jour is, is data quality, and I think, you know, that setup has its issues.

So, obviously, without, you know, without some sort of end to end orchestration and observability,

it's gonna be really hard for you to just, you know, let people know who depend on a specific data asset or, like, dashboard when stuff is breaking. Right? Stuff always breaks,

because you need to have some kind of orchestrator in there. Right? If you don't have that, it's gonna be tricky.

And,

you know, I think the the key here the key here is is is to

get a little bit more

flexibility.

So it's important to basically build out the the stack in a way where you can use the tools for what they're really, really good at. So running everything through DBT might not be the best idea. Right? If you've got stuff that needs to go quickly, you might wanna use, like, delta delta tables and Databricks

or Snowflake tasks and dynamic tables. Right?

You might have some people that wanna self serve in, like, a notebook environment instead of, like, a dashboard. You might not wanna have all of your connectors going through one way. You might wanna start doing some streaming. Right? And then in this case, you're like, well, I'm making my stack more complex so that I can

save cost, right, and get data to where it needs to go faster.

I'm splitting up my data pipelines into more and more granular ways.

But now you have 6 things that you have to connect instead of 3.

And before, you know, just about with no airflow and stuff running at 4 AM and then 6 AM and 8 AM, it was okay.

Now that don't work.

So then you're like, now I need a platform engineer to put in airflow.

And then you have this whole bottleneck problem because then anytime anyone says, hey. I'm not sure what's going on or, hey. Can I change the schedule for this?

It results as big old long ticket, and then you've got a data platform manager talking to a head of marketing

and, you know, they're butting heads.

So, you know, I think, like, in this case, right, Orchestra

is is a pretty good solution or indeed, like, any orchestration platform that is easy to use that also gives people good visibility of what's going on.

Like, clearly prioritizing and, like, defining the different data products you have, so essentially just, like, grouping pipelines and grouping things is also very helpful

because then instead of saying,

oh, like, for me to work out what's going on, go ahead and inspect this 1,000 box DAG. You're just saying, yeah. Sure. Here's the pipeline for your invoices data product. Here's how it's doing. Here's the data quality.

You can make decisions on this. It's okay.

But I think something else people find, right, as a sort of scenario 2, is

we have flexibility.

We have a really good platform team.

We have an orchestration

framework in place that we manage ourselves. We have airflow, say, but it's a big monorepo. There's loads of stuff going on, and we we just we we're just spending way too much time managing it. Right? Like, stuff takes too long. Stuff that should take an hour takes 2 hours. Like, the cluster keeps going down. And to boot, we also probably have quite a lot of data quality issues that we don't control.

So, you know, we speak to a health tech company over here in the UK earlier, and what they're doing is it's it's really cool actually. They're

shifting

some of that left.

So they're taking the staging models that their software team give them in DBT

and asking the different teams to manage that themselves.

So the central Ted data team is actually like it it kinda it's kinda like cheating, but, like, they're basically just doing less stuff. Right?

But then you have this other problem. Right? You've got

70 repos of DBT code or or wearing away, and then, you know, they're they're they're building like central data models or like the clean data or whatever. And then you've got the central data models and the marts happening afterwards.

How do you keep visibility of all of that?

And, you know, you take, like you can still do it with airflow. Right? Just have

8 different airflow instances and have you'd stitch up all the air flows to each other, but then you probably have to get something on top of that to monitor you know, it's like a who will guard the guardians type thing.

And you have that

with,

like, pretty much any orchestrator.

So

that's why we pitch ourselves as like a control plane. That's why dbt cloud have this concept of, like, dbt mesh

because,

you know, you realize having everything in one place is a lot, so you need to move stuff to other teams. But then, again, you have the complexity issue of how do you monitor things in different places.

But, yeah, there there there are a couple of scenarios where we see people running into problems, and that's how that's how we see them solving it.

Another

aspect

of orchestration

data flows, particularly when you're not dealing specifically with streaming data, is the idea of do you do

time based triggering, or do you do everything as event based where you're reacting to

state changes in the system where sometimes that state changes, the wall clock ticking over to a certain point? And I'm wondering how you see

those trends

moving in the overall data ecosystem of

people's appetite for, I want things to happen on a predictable schedule, or I want things to happen as soon as possible whenever a given event takes place.

Yeah. I mean, I think definitely a trend towards the latter. Right? Like, people want more data. They want it faster. So the more you can stitch things together, the better. So, obviously, why you have things like sensors in orchestration tools.

But,

you know, I think it it it's it always becomes complicated when you have different things at different places.

Right? It's like not everything can have a sensor. And if you don't have the concept of, like, a run

and maybe,

you know, maybe you've got, like, 2 data sources. Right? They're landing in s 3, and then you've got a dbt model that, like, builds off both of them. External table in Snowflake. I don't know. When should that run?

Should it run when 1 s 3 bucket has a file in it? Should it run when another one has a file in it? Like, if

every time files land, they always land in pairs. Like, what's the right window to assess them both landing in there at the same time? Right? What happens if one lands in its window and then the next one lands later?

You know? Like like, this is the problem. Like, we wanna do event based scheduling across the entire data stack,

but that only works where the chain of dependencies is basically linear.

Or you have, like, a metadata frame work. So where you say,

you know, the process writing the file to the s 3 is gonna

put all the metadata you need to work out what to do yet

at that moment, and then it's gonna send the webhook to the next thing. The next thing needs to know

where you know, it needs to then trigger a process that can read that metadata and then work out what to do. And, you know, that metadata framework, we also see being very robust

in especially in enterprise settings.

But, you know, it's it's not a question of, like,

putting all the logic in the orchestrator

because

it it this is not how data works.

As a listener of the data engineering podcast, you clearly care about data and how it affects your organization in the world. For even more perspective on the ways that data impacts everything around us, don't miss Data Citizens Dialogues, the forward thinking podcast brought to you by Calibra.

You'll get further insights from industry leaders, innovators, and executives in the world's largest companies on the topics that are top of mind for everyone.

In every episode of DataCitizens

Dialogues, they unpack data's impact on the world, from big picture questions like AI governance and data sharing, to more nuanced questions like how do we balance offence and defence in data management.

In particular, I appreciate the ability to hear about the challenges that enterprise scale businesses are tackling in this fast moving field.

The Data Citizens Dialogues podcast is bringing the data conversation to you, so start listening now. Follow Follow Data Citizens Dialogues on Apple, Spotify, YouTube, or wherever you get your podcasts.

The other major split in the

data platform

and

usage of data

that has been growing in recent years is the divide between the

analytical

and product focused use cases of batch or streaming data

and the use cases of data to

power, train, fine tune,

guide different ML and AI systems.

And I'm wondering how you're seeing that strain

the current or previous generation

of orchestration systems and how you're thinking towards how that fits into

the orchestration systems that are going to be coming out over the next few years. Yeah. What do you mean by product versus analytical use cases? What are some examples of that?

So for example,

analytical use cases being typical business intelligence or even reverse ETL

product use cases being I have some piece of data that gets fed into

a table that is

either an embedded analytics dashboard for a customer

or data that gets fed into a recommendation engine, things like that. Yeah. No. I'm with you. I mean, look, man. It's real spectrum. Like, I think the embedded dashboard for a customer thing, like,

typically,

I mean, I say tip I was gonna say typically the use case isn't real time, but often it is.

You know, we see people leveraging, like, modern analytical warehouses fairly well there, but having a real, really tough time if they don't have an orchestrator

because the data often fails and then it's out of date, and then their customers come to them and they say, well, you know, this is terrible.

I don't know what's going on.

So there's definitely an issue there.

And I think, you know, the product need

drives a lot of the

requirements

for how robust a system needs to be.

And,

ideally,

you will, you know, centralize

that data

so that you can have an event based system that is essentially managed by software engineers. Right? So, you know, think about, like,

you know, maybe you've got, like, an app. Right? And you need and and you need to show usage to the customer

because they need to know when they're gonna hit the elements.

Like, you're not gonna send events for usage onto Kafka,

drop it into s 3, put it into Snowflake,

like, aggregate it on a daily level, rolling average at 7 days, like,

put it in a Power BI dashboard and embed that into your app.

Like, you're gonna take an event.

You're gonna insert a row into another Postgres table, or maybe you're gonna have a function that, like, cleans it first,

and then you're gonna have your dashboard that looks at that Postgres table. Right? But the point is it's like an event based system. It's not it's not really in the data stack. Right? And I think in machine learning, this is this is even more this is even more the case because for you to get it out of that software domain into something the data team is using, assuming it's something different, which it normally is,

it's a lot of data when you're doing machine learning at scale. And,

indeed, most ML engineers seem to

want to do stuff

on data that's in object storage, probably because of size. Right? It's like you might wanna use some spark. You might have to do spark streaming. Right? It's like, can you do that in a warehouse? No. So it's an object storage.

And, you know, again, there there there are a lot of other requirements around machine learning pipelines specifically

because some of that metadata related to, like, training, fine tuning models, like monitoring their outputs is so specific.

And that's why there are sort of, like, machine learning specific orchestrators,

same with, like, AI. Right? There are a load of AI orchestrators. I don't even know the name of them.

But, like, it just goes to show how sort of specialized it is. We're we're probably doing data orchestration, I guess. But I think, yeah, things becoming more and more hyper specialized

is,

is is the trend.

The other trend worth mentioning sorry. I know I was just waiting for a lot here, but is,

you know, the centralization of data.

So you can't do everything in s 3.

You need an analytical warehouse

to do analytical queries.

That statement is less true these days because of,

something that runs with the iceberg.

So we'll see where that goes.

Yeah.

On that note, I just saw today that Amazon announced

s three tables

buckets specifically

designed for

improving their iceberg performance.

Yeah. And this is the cool thing. Right? It's like,

say, you've got a ton of product data, and it all lands in s 3, and then your ML team pick it up, do some cool stuff, send some recommendations back to the customer.

But, you know, they build out some feature tables. Right? And then the data team pick it up from s 3, put it somewhere else, create some reports. It's like you just spent twice the amount of money probably needed to, and now the data's in different places, and people don't know what the source of truth is. That's all in one place. That's potentially big. So I think that's pretty cool.

Absolutely.

And another

pressure that I

predict I haven't seen a lot of movement there yet, but I think that one of the ways that we're going to trend with the pressures

of AI applications

where

that is getting folded more explicitly

into

the product arena

is that by virtue of those AI models, inputs to the AI being a core dependency of the product experience, it brings the application engineering team background full circle to

being involved in

the product that is the exhaust of the data that they initiated

where you you have to have that more

full circle workflow

of the application engineers

through to the data teams, the ML teams, the AI teams, background to the application teams, all working in

tandem

along the same visibility. And I think that that's going to force more of these orchestration systems to

grow beyond their current boundaries

and incorporate that end to end life cycle and visibility

and touch points for each of those different personas.

Yeah. No. I hadn't even I hadn't even thought about that. Are you sort of saying that, like, in order to effectively incorporate AI into your product, you will probably need data that's not in the product. You'll need other types of data too.

Absolutely. I mean, just think about the rag systems that are becoming the prevalent means of bringing AI into production for the current era of generative systems.

Right.

Yeah. I mean, like, what would you like? What do you think you would need an orchestration

tool to do in addition to what they already do in respect of, like,

in respect of I don't think it's even necessarily

a growth in terms of their core functional

capacity so much as it is an evolution of the way that it's being presented and integrated

into the workflows of those different personas, where

application teams largely their interface to orchestration is in the CICD

pipeline of, I wrote my software.

The test passed. It got deployed. It's on QA. I tested it. Now I push it to production.

Maybe there's feature flagging that gets factored in there somewhere

versus the data team of, I've gotta take all of the data from the application database, pull it out of there, put it into my warehouse, clean it up, present it, turn it into a usable asset for other things.

And then you've got the ML teams of, I've got my experimentation

system, my feature store. I need to have my model training pipeline. I've got my model monitoring system.

And then with generative AI, you've got the I've got to

figure out which model I'm using, maybe apply some fine tuning, get that deployed version

monitor for

hallucinations,

guardrail

issues, people trying to jailbreak it, but I also need to have all of my data inputs to

the vector database to populate the rack context and make sure that that gets updated appropriately,

manage the different generations of embedding model that I'm using to update or improve the way that the AI model gets used.

All of that is getting collapsed into a single

end user experience, whereas before, they were largely

disparate teams working on disparate projects.

Yeah. I mean, I still think we're quite away from that,

but that is that is the dream. Right? If you can if you can monitor all of those workflows from a single place and all of your data is in the same place and, you know, the way you're monitoring it also takes things like data quality into account and, you know, is really, really reliable and robust and, you know, is is really well integrated

with production systems that aren't the orchestrator. Right? Which, you know, like, it needs to be, for example. Right? It's like if you have, you know, if you've got, like, a service which serves up an AI model. Right? And then your front end is just sending events being like, hey. You know, consumer asked this question. What's the answer? Right? It's like that thing should be able to have some understanding of metadata. But, yeah, it'll be interesting to see where sort of orchestration lands in it. Yeah. It's a tough one. Also is changing the directionality

of data flow where

it used to be it started in the application and then eventually made its way out to ML, and then it would start the cycle back over again.

But with the interaction patterns of generative AI, that data gets fed into the AI

directly. And then also

given the memory layers that are being built out immediately incorporated into the AI context and used back out for the end user experience,

but then also fed through the typical data flow of analysis,

experimentation

to figure out, okay, how are end users interacting with this? How can we improve that? How does that get factored into the product life cycle?

Yeah. And, you know, it's making it's making a lot of changes on the on the data side as well. I don't think you see people talking about it as much cause it would be sort of,

lagging indication of how much AI stuff people are doing. But, you know, in the example you just gave, it's like, okay, let's say you've got an AI product and I'm having a conversation with it. Every single message I write is a data. What's happening to that? Like, do I just have loads of event data that's landing in s 3 that's just text?

It's like, maybe. But then do you then have something which is cleaning that data and structuring it before you put it into analytical layer, right, before you write it to iceberg, for example, you could do. Then maybe that's another service you build. Maybe that's a service you buy. But it's it's it's more complexity. Right? Small things to integrate, which is why I think orchestration is so exciting because, you know, it's an area where we see a lot of data teams wanna move fast and not have to spend all this time building all these connections to all these things. So by sort of giving people those managed connections in the same way that, like, a 5 tram means you don't have to learn the Salesforce API. We're trying to do the same thing for data teams so that they can, like, go a bit faster. I think too that

with the

ability for

AI to work across all of these boundaries,

it's going to be increasingly incorporated

into the

data flow

management

arena

more so than it already is.

And I think that there's going to be a certain amount of trust building that has to happen before people feel confident actually

delegating

any core capability

to an AI model. But I think that

in that earlier point of collapsing

the

stack of personas

and bringing it more full circle, I imagine that that conversational

interface will probably be the unifying factor that brings all of those different teams into the same

workflow and onto the same page.

Yeah. How do you mean?

Well, I imagine that because of the fact that they're all used to working with data in different ways,

if you can layer on a conversational

aspect to it that speaks to them in their own language,

then it reduces the

tooling complexity of, oh, I have to build 5 different UIs to suit these 5 different personas. It's instead,

I have my interface, and there's just the conversational aspect where you can ask and ask questions and get insights about the data, how it's flowing,

what you need to do next type of a thing, or direct the orchestration engine to do the things that you want it to do without having to learn all the in intricacies of its peculiarities

of the different functions that it wants. You're talking about, like, an AI layer for data product managers all through to, like, machine learning engineers

that helps to build and monitor and recover data pipelines?

Yes. Yeah.

I I think we're probably

5 to 10 years away from today, but

yeah.

Yeah.

No. It's cool. I and, you know, you see you see elements of this today. So when people spin up like a, you know, like a new microservice,

there are some pretty sophisticated data teams that will say, okay. Well, to spin this up, all you need to do is write a few lines PMO. But then what the YAML does is it automatically creates the orchestration pipeline. So automatically creates the DBT model. So, also, you know, basically just provisions all the resources

automatically. So then if you say, well, you know, we can actually have a menu of things we can create. Right? And then here's all the data on how we create it, and you feed that to a model. And then, yeah, put the AI on top, then there's no reason we can't do it. But put put it this way. When when when,

when I saw articles a year ago saying that AI was gonna automate away data engineers' jobs, I thought about what you just said, and then I realized how hard it was. And then I had confidence that trying to build a unified control plane that isn't powered by AI

was not gonna be a colossal waste of time.

Oh, absolutely.

I I don't think we'll ever completely cede control to the AIs. I think it will largely be a

discovery interface

and not necessarily

a tell the AI to do the thing and then trust that the thing got done right. Yeah.

Yeah. Although you do raise an interesting point, especially in the context of, like, metadata frameworks where, you know, like, processes will write data that says, okay, I just ingested all these tables. Like, I put the IDs over there. Hey, thing that's gonna move the tables, like, go fetch the IDs and move the tables. Right? It's like you could

potentially tag messages with the services and their descriptions

and, like, their endpoints

and then just hit an AI in the middle instead of a database. I mean, it would be an AI on top of DB. Right? But,

yeah. I don't know. You would probably want to define that logic explicitly. We'll see. We'll see. We'll see.

So on that point, in your experience

of working in this space, working with data teams of various

sizes and compositions and areas of focus, what are some of the most interesting or innovative or unexpected ways that you've seen data orchestration implemented or the ways that it has impacted the overall platform architecture?

Oh,

good question. I mean, at the other end at the opposite end of the spectrum,

right, be our use case is very standard.

It's very boring,

but boring works. There are some pretty interesting ways people use it in terms of, like, provisioning and incident handling. So because you can sort of run any

scripts in any places, including, like, your data warehouse, you can sort of build event based flows that automatically

help you sort of, like, do things like

do access control. Obviously, then you need your orchestration plane to itself have very good access control. Unfortunately, orchestra does, but, you know, that's that's one kind of rogue way people are doing it. Another way is just, like, using the orchestrator to get visibility.

So, you know, if you can and, you know, this is this is something I feel like we're trying to pioneer as well. It's like with a Datadog,

you can send it data. Right? And then it shows you what's going on, and it sends you alerts. But Datadog doesn't do anything. You have to send it the data. Otherwise, it knows nothing because we have this lovely data model

for your metadata.

If you've got, like, event based pipelines

or pipelines that are happening elsewhere,

you can still send us the data.

So similar to, like, a data

hub, if you like. That's pretty cool

because it's like an expansion

of what engineers think the orchestrator

should be doing. Right? You're turning it into genuinely a place where you can say, okay. Here, I can see everything that's going on, and I can control everything. I can rerun things. I can notify people. I can, like, trigger workflows that are operational.

That's that's pretty cool. So, yeah, I guess

stuff like like governance, automating governance, access control, such

as getting full visibility

instead of relying on big clunky expensive things like Datadog.

And in your experience

of working in this space,

building an orchestration engine and

trying to

fathom the different ways that data is being relied upon

and used? What are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Mate, there are too many.

Like like, it's just you can cut everything

in so many ways. Right?

People have different tools. People have different

teams. People have different latency requirements,

and

people just have different

personas and experiences they wanna have depending on the organization. Right? You could be a tiny startup with, like, 2 shell developers

and still need, like, multiple environments.

You know, everything everything get controlled,

everything sort of, like, asset aware.

You could be a sort of, you know, 10,000 person

logistics company that is sort of foraying

into into building their first data products,

right, which needs good orchestration, but also, like, a bit of visibility of what's happening on the event pipeline.

And then also a way to, like, you know, enable and monitor self serve because you've got, you know, 10 global divisions. Right? Even though all you're fundamentally doing is building a relatively simple, you know, ELT pipeline. Think one area which I definitely didn't appreciate as much as I do now is the need for, like, security

in where things are hosted. I I I've learned what colocation

and, like, what an Azure private link and self hosted instance actually means.

And it's, like, nuts because for anyone that doesn't know, right, if you've built a software product, right, and you run it in the cloud, you run it on AWS in London, and then you have a company in California that says, hey. We're on Azure. We need private link to Azure. Can you support that? What you then have to basically do is write your app using Azure services

and make sure it can be hosted and provisioned in basically the same building that all that stuff is in. That's really hard towards, mate.

And for people who are

tasked with building a data platform,

managing its health and longevity, what are the cases where a data specific orchestrator

is the wrong choice?

Oh, the wrong choice. Good question.

If you have full streaming use cases, don't get an orchestration tool.

You should be streaming that stuff.

Apart from that, I mean, if you're gonna do batch stuff, you should probably have something.

If it like, if your flows are really simple and if they're linear,

I would probably just monitor it, like, have really good logging

and have different services talk to each other. Orchestration is probably overkill. And oh, here here's a good one. So if you're a huge, huge company and you have

very, very, like,

high difficult SLA requirements,

you might wanna choose something like a Palantir.

Right? In this case, you're buying the platform.

It's like, don't build it. Buy the thing.

Other than that, I think you're you're you're always gonna need 1. Right? I mean

and final point. In terms of buying a data platform, right, historically, this was basically the same thing as, like, you know, maybe having a warehouse, but it's on premise. So it's like, do we get Oracle? Do we get SQL Server? I know that's the end of the question. Now the discussion is, well, do we get BigQuery? Do we get Snowflake? Do we get Databricks? Caveat is none of those are a data platform. Databricks getting very close to having everything, but not quite of even with people with Databricks, most of those organizations

still also use Snowflake. I think it's at something like 40 or 50%, like, they share customers. So you're still gonna have to get visibility of everything.

So how do you do that? So that's, yeah, that's another reason that I think building something which connects to different parts of the stack is a is is is a good bet because it's not it's not getting any well, it is getting a bit simpler, but there's still a lot of a lot of tools out there. And as you continue to build and invest in this ecosystem,

what are some of your predictions

and or hopes for the future of data orchestration?

I don't know about orchestration, but I think generally,

it'd be good to

it'd be good to see

data teams stop being viewed as a call center. I predict that data teams will realize that even for basic BI use cases, the level of essentially the SLA of the data needs to be a lot higher than we think it is, much more similar to a software system, if anything, like, higher.

Because at the end of the day, people only like, people really fickle when they see data that they don't trust, and it's really easy for them to lose that trust. I don't think we sort of generally make things to a sufficiently high standard. Like, definitely consolidation, right, in the orchestration plane. Like, you see this with, you know, lots of companies like DBT, Orchestra, Dagster. We're all sort of trying to

grasp everything at the top. So, like, not not being a warehouse, not being an ingestion tool, not being a dashboarding tool. And, yeah, the other one, of course, as you mentioned today is is like iceberg. Right? It'd be cool to see if people can move things together, but at the end of the day,

if you still see the data team as a cost entry, you're not driving value from it, then it's a bit of a defensive exercise to move stuff to iceberg and slash your costs and reduce your security footprint. Right? It's like, that's not why we got hired. Like, we got hired to make companies grow.

So, yeah, they're my main ones. What are yours?

I I think the main one is what we discussed earlier of

AI

being a motivating

factor to push all of those teams closer together and in tighter collaboration and cooperation

with the orchestration engine being that focal point of interaction.

Nice. 10 year plan.

Are there any other aspects of data platforms, their architecture, the role of the orchestrator that we didn't discuss yet that you'd like to cover before we close out the show? I think I think we're all good, to be honest. I think, it it it'll be it'll be really interesting to see how people start automating

things, and simplifying things even more and, like, what that does to both the users of the data

and the users of the data platform. I've always thought that they should sort of kinda be the same people. Right? It's like there's nothing better than a power user that can self serve, but it's not getting any easier to architect a data platform.

So people that know how to do that just getting more and more specialized and better and better paid. So we'll see what happens. As a final question, I'd like to get your perspective and what you see as being the biggest gap in the tool learning or technology that's available for data management today. The answer was not orchestration because you got plenty of those. I think

it is around

effective governance and prioritization. I don't know if it's a tool. I don't know if it's a process,

but at the end of the day, a lot of dashboards don't get used. A lot of work that data engineers does, we feel like it goes down the drain. Anything we can do to say, well, I care about these 10 things and say, well, actually, it's only 5. That would be game changing. Absolutely. Alright. Well, thank you very much for taking the time today to join me and share your thoughts and experience in the arena of data platform design and orchestration

and the ways that that impacts what people are able to get done with their data. It's definitely a very interesting and important problem space. So I appreciate the time and energy that you're investing in that, and I hope you enjoy the rest of your day. You too, sir. It's really good to be here.

Cheers, mate.

Thank you for listening, and don't forget to check out our other shows.

Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story.

Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and co