Exploring The Expanding Landscape Of Data Professions with Josh Benamram of Databand

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

RudderStack is the smart customer data pipeline. Easily build pipelines connecting your whole customer data stack, then make them smarter by ingesting and activating enriched data from your warehouse, enabling identity stitching and advanced use cases like lead scoring and in app personalization.

Start building a smarter customer data pipeline today. Sign up for free at dataengineeringpodcast.com/rudder.

Your host is Tobias Macy. And today, I'm interviewing Josh Ben Amram about the continued evolution of roles and responsibilities in data teams and their varied requirements for visibility into the data stack. So, Josh, can you start by introducing yourself? Well, it's great to be speaking with you today. Tobias, I'm a huge fan of your work and, honestly, just really grateful for the conversations you've led in the community. So thank you. I'm Josh Benhamram. I'm CEO at DataBond AI. We are a data pipeline observability company,

and we help organizations

get good data products to market by providing

visibility and easier management of their pipelines and data assets.

And do you remember how you first got involved in the area of data management? Yeah. I do. So I come from a pretty

varied background.

I actually started in the finance world

working in the investment arm of a quant trading firm.

It was there that I I got first introduced to data science and data engineering, and I've been obsessed ever since. I later worked in a venture capital firm where I focused on data investments.

And just prior to DataBank, I was a product manager at a data analytics company.

So while I've worked in in different kinds of teams, the common denominator for me has always been

being really close to the data organization, if not working on data products directly within my different,

organizations.

Given your

varied roles across these different organizations

and different ways of interacting

decent perspective on some of the different ways that the platform and the team orientation

is

structured at different contexts. And also given the fact that you're building a company that hooks into data platforms, I imagine that also gives you some interesting perspectives

into how people are interacting with their various data pipelines.

Yeah. That's that's exactly right. This is something that I've thought a lot about over the different

experiences that I've had, in my career, jumping between different kinds of teams.

Definitely in my last company before DataBank

as a PM

building data analytics products. I worked with many, many data teams who are integrating our analytics stack

into their organizations. And

at DataBank, since we're plugging into data platforms, as you mentioned, this is something that we focus on quite a lot and a question that

we like wrapping our head around.

Before jumping into some of those different

team distinctions and role distinctions that we see across organizations that we're working with, I'll preface that our company now, DataBound, works with a particular kind of organization,

and I think that that definitely colors the way that we see the world.

We work mainly with

companies

that are

building data products.

And by data products, I mean, analytics or machine learning that's customer facing. Like, I I like using the example of a stock analysis dataset that's sold to investors

or a recommendation engine that goes into a customer facing software product.

But even before talking about the the different kinds of roles that we often see,

having that context is important because

there are 2 things that are really unique about companies that are building data products, we feel. The first 1 is their pipelines tend to be more complex.

I think by virtue of building whatever

their unique product

is, these data teams typically have a lot of their IP in how they process,

aggregate,

transform,

and work with their data. So, therefore,

they're usually starting with more raw data. There's a lot of data that they're working with, like a lot of sources,

and they do a lot of processing on this data to make it useful. Otherwise, there would not really be much of a product for them to sell. You know, their customers would go directly to their data sources. So for all those reasons, the the kinds of companies that we're working with, their pipelines just tend to be more complex.

Secondly,

the companies that we work with, they have really high standards. They treat their data processes really seriously.

It's a very bad thing if data gets delivered

late or if there are data quality issues

in their final outputs.

And that's because it's ultimately going in front of a customer that's paying you, so they often have higher standards for what they produce.

So when you have a data team like this where there's more complexity in the pipelines, there's higher standards,

you tend to get this bigger

division of responsibilities

to make sure that things run smoothly.

And

that creates

more defined activities and roles within the data team. So just setting that context for the kinds of roles that I'll talk about because

if you're working with a data product team, it's gonna be quite different than if you're working with a a smaller scale analytics unit with a company that's a little more targeted on the kind of data

analytics that they're doing.

So going back to the actual

roles that that we might see, separate this first into data producers and data consumers.

So the role of producers is to

create

ready to use data. And among that group, you have the data engineers and the data platform engineers.

The data engineers are usually responsible

for what's happening in

specific pipelines. So they work generally more closely with analysts and scientists to prepare the actual logic

that takes raw data in, cleans it, aggregates it, makes it queryable, etcetera.

That might start as JSON coming into s 3 and and as a table in a database like Snowflake.

The data platform engineers are more responsible for the services that make pipelines run. So they care about the issues that will cut across pipelines, like an airflow environment going down

or system wide resource bottlenecks or meaningful disturbances

in upstream data sources that are going to affect a lot of things.

On the other end of this, you have consumers. This might be the analysts who take ready data and prepare

analytics and reports from it, or it might be the data scientists who take the data to build their models. So just at the starting point here, we have 4 typical kinds of categories of roles that we're dealing with. You have data platform, data engineers, data analysts, data scientists.

Maybe walking through an example of a typical kind of troubleshooting

process there if an issue comes up. So

let's say an analyst raises a flag that a table looks out of date. That might be really bad because you have a customer on the other end of that dashboard that expects it to be timely. Maybe that's even stipulated in some SLA that you have with the customer.

Data engineer might then jump in to check out the pipeline that delivers the data to that table. They might find that the pipeline never ran, and then maybe they just do a backfill and that resolves the issue there. But maybe there's a deeper problem, a service level problem, like something going on in the broader airflow environment or an issue in an underlying spark cluster

or a pipeline that ran completely fine, but there's some fundamental problem in the upstream creates

about a service level issue, and then alerts the engineers and creates about a service level issue, and then alerts the engineers and the analysts based on pipelines and datasets that are gonna be impacted. But that'd be a typical kind of division of activities in the troubleshooting scenario that describes some of these different roles.

Given that you are working with these organizations

that are more at the forefront of how to work with data and the types of processing and systems that they're using to perform that processing,

I'm wondering if you see that as being somewhat of a predictor of where the broader industry is going, where more organizations are going to start to

experience this sort of segmentation of roles where I feel like in the sort of early stages of big data, it was all about data science, and everybody just wanted a data scientist because they thought that they were going to be the solution to everything, and they would do everything.

And then most everyone realized, oh, wait. We actually need data engineers to work with the data scientists so that the data scientists have clean and manageable data to be able to use as a starting point.

And business intelligence as a practice

used to be more of sort of an IT concern and now has become a a concern of analytics engineers where I see there have been a number of evolutions of data roles just in the past decade or so. And I'm wondering

if you think that some of the further segmentation that you're seeing is going to remain

sort of constrained to these

organizations that are working on building data products, or do you think that it is a predictor of where the larger industry is going to head in the next sort of 5 to 10 years? First of all, I think more and more companies

will be building data products. I think more organizations

will follow this trend of monetizing their data assets.

There's a couple reasons why I really believe that's the case. First of all, it's really lucrative.

Companies can make a lot of money on on data products.

It's a highly scalable resource. If you have a good data asset and you find a product market fit for it

and there's a a wide audience for that, it might be the case and often is that you don't really have to do too much to that data to sell it to more and more end customers. So in the same way that software scales really effectively,

a data product can scale really effectively and and can be really lucrative.

The other reason why I think more companies will be moving into

data product development is because there are a lot of organizations out there that just have really valuable data that's kept up in some internal data warehouse or being created by a business unit somewhere that is able to be monetized in some way. And I think this is

particularly true of technology companies that are already selling software, and somewhere in that software,

some really interesting data is being created. And I saw a recent headline about Atlassian,

for example, purchasing

a business intelligence company. I think it was Chart. Io.

But to me, this was a great crystallization

of that trend in the industry.

When you think

about Atlassian and the reach that their product has across

organizations and the kind of data that they're producing within their software about how companies

are completing

their

tasks and development and the productivity of their engineers. There's so much interesting information to be used within data products. It's not super surprising to see them invest further in bringing that kind of analytics within the system so that they can better monetize it. I think that's gonna be a consistent parallel across many companies in the technology space and and also beyond. So

more companies are gonna be building data products. Once you're in that data product world, then you get into the complexity. You get into the higher standards. And

when

those kinds of needs are present,

then you want better division of responsibilities

to handle

things in a more effective way and give different stakeholders on the team a better idea of who to go to when there are problems.

Outside of even companies that are working on data products, even if you just have a internal

analytical system that you're relying on, at the point that you really start relying on that system, at the point that you start depending on it to make mission critical business processes,

you better

have assurance that the data that's ending up in those dashboards or in those models

is accurate. So you may not have a

end customer that's calling you and saying, what's going on within this dataset or what's going on within this dashboard?

But that can be just as scary for a team if it's coming from an executive that said, hey. I just, you know, plowed a big investment into

some business decision because of a KPI that I saw. And I'm looking again. It looks like it was totally wrong. So I think different factors are gonna catalyze

the complexity that we see in pipelines

and

the

level of standards that we see being put on data teams, and that's what's going to create this new division of responsibilities and roles.

And in terms of the roles that you see emerging, you set the baseline of most organizations

have at least some distribution of data scientist, data analyst, data engineer, and platform engineer.

Those are fairly well established roles. Most people at least understand what the responsibilities are across those different divisions.

And as you are working with these

more advanced data organizations, what are some of the

additional

specializations

that you see emerging, and what are the

established positions that you're seeing those roles kind of being broken out of or

merged with to be able to support these more complex data products?

So we

see further specialization

and outgrowth

from these roles happening in 2 ways.

The first way would be through the formation of umbrella groups.

The second way would be through the formation of hybrid roles.

In umbrella groups,

you have

different organizations

that have repeating roles between them, but focus on

different parts of the end to end business process or sit at different components of the organization. So,

for example, you might have an upstream

platform team as 1 umbrella group and a downstream

analytics team as another umbrella group. And in each group, you have somewhat repeating roles,

but they focus on different levels of the data process end to end. So in the platform team, you might have all the roles that we just discussed. You might have data engineers, data scientists, data analysts that produce the data that different downstream business units will use. So you might have a data scientist in this group, but, really, they're a data platform scientist. And that's a different kind of set of responsibilities

and a different kind of day to day work than a data scientist sitting in another umbrella unit. So an example of that is the data platform scientists, they might be aware of

different downstream consumers. They may be responsible for bringing

the raw data

that is coming into platform closer to the form that those downstream

those business unit data scientists

and analysts are going to use or able to use. For example, scientists in the platform team

might build a

predictive KPI

into a table that a downstream team uses as 1 of their data sources. I mentioned

before, I I like using this example of a trading firm that's looking at analyzing stock market information. So let's say your company is pulling in

information about the stock market, and the data product that you're building is about whether stocks are gonna go up and down, you know, trying to predict where GameStop is headed.

So this trading firm may have a platform team that's pulling in data from

a bunch of different exchanges

and different brokerages.

And then in the platform team, you have a data scientist that's creating a predicted GameStop price. And that predicted GameStop price is gonna get dropped into a table

in Snowflake or your data lake that some other business units downstream

are gonna pick up and then

use as part of the analytics that they're building or models that they're developing.

So going down into that next umbrella group, the the data analytics team, another

group downstream,

this team might likewise contain some assortment of platform people. They might have data engineers. They will have actual data scientists or analysts in most cases. But they'll be working on more discrete projects. They'll be closer to the end customer,

and they might pull in

several data sources beyond what platform provides them. So in this kind of umbrella group organization where you have different repeating

business units at

different levels of the organization,

different repeating roles,

this is a nice way of organizing things for a lot of companies because it provides

units a level of autonomy that allows them to innovate quickly, get products to market faster, and centralized

shared requirements that allows for better focus on what makes these units different in their main mission.

The other version of this outside of the umbrella groups that we see would be having hybrid roles. So

you might have a data scientist role that opens up into additional roles that help close the gap with data engineering.

So a data scientist might open up to 2 roles, 1 being a core data scientist who focuses on running experiments

and producing models.

And then you might have

another set of

responsibilities for the machine learning engineer

who manages the automated ML pipelines for training, deployment,

retraining, and really helps bridge the gap between the data scientists and the engineering. So the data scientists might open up to core data scientists and then an ML engineer who focuses mostly on automation.

Same thing for data analysts. You might have a single data analyst, and as things get more complex, that opens up to a core data analyst

who focuses mostly on building analytics

and defining KPIs,

and then a analytics engineer

who manages automation

around analytics pipelines

and

prebuilt aggregations that need to get done.

Platform opens up as well. So we'll see often data platform open up to platform engineers

who are responsible

usually for

setting the structure

and design principles and templating for how people build their pipelines, and

a data ops engineer

who manage the services,

cover the needs that people

usually keep poking DevOps to help. So that's how we often see data platform opening up too.

As you are working with these organizations that have these varying types of specialization,

whether it's these hybrid roles that are

specializations

within the broad category of analytics or data engineering or data science or what have you versus these umbrella organizations,

what do you see as being the broad business impact in terms of how that

affects the capabilities of the organization

to build and release high quality data products?

And just some of the

additional considerations that they need to think about from

a organizational and product perspective of delivering data as a product versus just delivering

software

or physical widgets as a product?

In general,

having this kind of division of responsibility

as the company or the data organization

scales is going to help these teams be more productive because people will be able to focus more on their

core set of responsibilities,

and they'll be able to

pull in

different stakeholders in the team to solve problems based on responsibilities

that they know those stakeholders have. So if I'm working in a more amorphous team

where there isn't such a good distinction between what a data engineer does and what a data scientist does, and I'm working with a dozen folks on the team,

when a process breaks down or an environment breaks down,

knowing exactly who to pull in can be a really tricky thing. So when you have a better division of responsibilities,

just like in software organizations where you have a good sense of who does full stack, who does

front end, who does back end, who's focused on DevOps.

If there's a DevOps issue, you know who to go to. Same thing in data. If there's a DataOps issue on the service level, you'll know better who to go to. A couple of big factors that we think about on the organizational

impact

of this kind of separation when it happens is

the level of autonomy

for different teams or different units versus the level of cohesion

in the end to end team and the overall organization. So

as your different positions become more specialized,

how you ensure that stakeholders

have the freedom to work independently

and move quickly while at the same time are connected enough that people are working together in a clear direction.

I think this is 1 of the main challenges that a lot of the companies that we see really face. And I I think it relates

to the kinds of investments that people make in their technology systems and their technology platforms.

But here's where it's really important to have good levels of

interoperability

and the ability to create sources of truth at the highest level of the tech stack. So with DataBank, we'll work with teams where

before using us, platform just has no idea

how the pipelines of any engineers work.

And we had a a case where an engineer left the team of 1 of our client organizations, and the client basically had to recreate a mostly working process

from scratch, which took months for them to do. And they had to do it because they didn't wanna go to production without having good visibility or control over their pipeline.

So that would be an example where you had

clear division of roles, a clear specialization of responsibilities,

but not enough

visibility or observability

between

the different activities in the team so that platform could introspect

those processes and get good checkpoints and logging

from the pipeline covering the entire travel of the data, from the data source to the data lake, to the warehouse, to the consumer.

That ability to observe is really crucial and why we're working on DataBank.

In terms of the

team dynamics and the ways to structure the organizational

aspects of these more specialized roles, what are some of the strategies that you've seen be particularly effective in terms of being able to maintain effective communications

across these different boundaries of roles and responsibilities

and how they can all work together to be able to deliver the end product to the customer.

I think this, for us, has

a lot to do with the mixture of

having the right processes

set up within the organization as well as having the right tools and technologies in place to help build that cohesion

and maintain the right level of autonomy.

So in terms of processes,

an example of this would

be having a

system in place where there's clear ownership

over different pieces of the technology stack and different

pipelines that

the team is is building. So having a clear owner in the data engineering

category,

which engineers own which specific pipelines,

which data targets or data assets do those pipelines read to, having that in the built up in the organizational knowledge of the data engineering group

and

using that knowledge to be able to quickly

isolate

where failures might be cascading down to the consumer level. So

if a service goes down, if platform knows that there's

something failing in

a environment, in airflow, or in the spark level,

platform know knowing who owns which pipelines that are going to be affected and being able to get the information and the news out to those stakeholders

who can then

distribute the news down to who might be consuming the data. So having a really clear

set of ownership across these different processes

is paramount to being able to run the organization smoothly.

In terms of tools and

technologies,

having a set of capabilities

that allow you to distill that kind of knowledge into a

product layer that people across the organization are able to access

and, without really even asking anybody,

get information about different services and get information about different pipelines.

This is really critical.

So a lot of the tension that we see in these groups is

that we think is unique to data organizations is is this desire for using

best of breed tools versus building more sources of truth or centers of gravity

within their teams. And

finding that right balance between being able to use those different pieces of technology to run your pipelines and get data delivered

versus having the sources of truth where you can see

all the information that's being produced from them and you can quickly go in and isolate issues, that's gonna be

really important. Because of the need to be able to

have this visibility and have these tools that establish clear communication across the different boundaries and across the different layers of the data stack and stages of delivery,

how does the sort of specialization of these roles and the sophistication of the operations that they're performing

impact the way that the organization

and the teams think about investment in the data platform and in the technologies that they're using to be able

to deliver these systems, both in terms of just making sure that they're operational, but also in terms of

maybe preventing tool sprawl so that you don't have everybody speaking a different language and end up sort of in the tower of Babel situation.

Yeah.

So this balance between

best of breed tools and building those sources of truth is is something that we think about a lot. And

on 1 level, you wanna be able to leverage the right technologies

for different stakeholders

so that you can give people the tools that they need to work efficiently. An example of that would be allowing

data platform teams to build their data lake in s 3 where engineers

are mostly working with Spark and

using those tools because they need to optimize for storage space. On the platform side, you really wanna optimize towards getting as much information

into your lake as possible,

as cheaply as possible, and being able to support a lot of different sources, so being able to take data files of any type. So allowing platform to work at the s 3 level in a really flexible and open and essentially free data lake and then leveraging a tool like Spark so that they can easily process a lot of that data.

But but forcing a analyst team or even an analytics engineer

to operate at that level might be pretty challenging because you're not gonna typically find a lot of analysts or

analytics engineers that

are super, super familiar with Spark. They'll tend to be more comfortable in tools like, SQL and working more at the database layer. So allowing the analyst to use tools like Snowflake as a more aggregated warehouse and an easy query engine on top of the lake is just a good example of being able to separate those best of breed technologies.

On top of that,

on another level, because these processes are so interdependent. Right? That data that's accumulating in s 3 is going to eventually make it into Snowflake where analysts are gonna be begin querying on it. You want the ability to control and observe what's happening across these systems.

So your orchestrator and your observability

tool are good examples of that.

If you have engineers working on

s 3 with Spark and analysts working in Snowflake, your pipeline orchestrator needs to be able to really easily

run jobs across the both of these systems.

And airflow, we we see a lot as a common tool for

achieving that bridge across different layers. We can also get meta here where you have multiple orchestrators in your orchestrator

for best of breed within that layer even. We see a lot of airflow running Spark on s 3 and then airflow even kicking off DBT jobs on Snowflake because

DBT as an orchestrator might be a lot easier for your analytics engineers

and

your data analysts to work with on the Snowflake side, whereas Airflow might feel a lot more comfortable for

your platform engineers and your data engineers that are working upstream, and that's pretty common.

Likewise,

you may want your observability

tool

to be able to capture that end to end process so that you can catch issues

as soon as they occur upstream

and can broadcast those issues to those impacted downstream.

So

if you have a missed data delivery

in an s 3 bucket that data engineers own and platform manages,

what pipeline is that going to affect? What's gonna be delayed by that? And how's that going to affect what ultimately becomes a table within Snowflake that's gonna get fed into an analytics report or another data product that's being prepared by analysts? That becomes really important.

If you have umbrella organizations, it's gonna be a lot easier to build these kinds of best of breed technologies. It's usually gonna be harder to create single sources of truth. So if you have that big distinction where you have data platform, engineers, analysts, scientists working at 1 higher end of the organization

and then

similar groups

working in different business units,

it's gonna be a lot easier to have those different teams

use whatever tools that they feel they're most comfortable with. It's gonna be harder to create

cohesion

and

a clear remediation path when an issue comes up that cascades across those units.

On the other hand, if you have a single highly specialized team, where in a single group, you've got all those roles. You have the hybrid roles between

the data ops engineer. You have the data platform engineer. You have

the ML engineer, you have an ETL engineer, a data scientist, a data analyst,

an analytics engineer, all those different roles,

generally, it's gonna be easier to build sources of truth,

but

it'll be harder to achieve the best of breed technologies at different layers because everybody's so interdependent.

Particularly in this juxtaposition of umbrella org versus hybrid roles, what do you see as the impact of the current trend of data mesh as sort of a growing way of thinking about the way to

build out different data products internally within an organization and how to combine them into maybe downstream data products that other people are going to consume

and just how to build useful interfaces and compartmentalization

of

these different stages of responsibility.

There's a lot of different definitions

for terms like data mesh that we see, so I'd love to hear how you define that. And then I can help target how I would look at it and maybe how we would look at it. Yeah. It's definitely starting to become 1 of these buzzwords that people throw around to try and make their product or their team or whatever they're doing sound cool and interesting. But the way that I think about it is sort of back to the original article that I read by Zhamak Dehghani, who I had on the show a little while ago to talk about sort of her perspective on it. But,

basically,

instead of having

1 centralized team that does all of the data work for the entire organization,

You have maybe an enabling platform team that provides all of the systems necessary to

do self-service access for each of the different business units to then treat all of the data internal to their concern, whether it's their application or their business responsibility,

and then provide that as you know, via an interface to the rest of the organization to be able to consume

as a packaged product so that it's already been cleaned and easy to work with. You don't have to

try and, you know, perform your own analysis to understand, you know, what are the standardized metrics because it's already

delivered via this API or via this, you know, prepackaged product that you can consume just as

a consumable data asset.

So this is very much in line with the umbrella

organizations that we see forming in the companies that we're working with. So it's similar to the mesh or the hub and spoke model that that we see in big companies.

I think the

importance of the interface

between that, like, ready to consume data product that is being produced by the more centralized teams, the data platform organizations,

making sure that the

products that produce

have the right level of,

essentially, certifications

around it so that

the end business units and these different

consuming teams

understand

what that data consists of, how it's changing over time, what the critical

failure points may be if they begin taking that data and working it into a product that they're producing.

Having that kind of certification

lineage, that level of tagging into the data asset becomes really, really critical because then they begin depending on that as another data source that gets fed into

the products that they are producing. So they may be relying on

that input source from platform as well as several other sources

and having a good

governing stamp that says it it's clear how to leverage this and when it might fail and who to talk to if the data is delayed, that becomes really critical.

What what we aim to

deliver, what we aim to support through our application is the ability to

cascade and understand the lineage of not just data that's flowing from 1 business unit to the next, but the cascade of notifications

and alerting that needs to be

promoted if a failure does happen or if a bad data event transpires.

So if the platform team

is regularly producing this asset that a end business unit is using,

And for whatever reason,

pipeline

fails that is producing that dataset,

being able to distribute and announce and broadcast those kinds of notifications

across the organization

that different

consumers can subscribe to, essentially,

that becomes really critical.

Going back to our investment example,

if the central team is producing that KPI that says GameStop shares are gonna go up or gonna go down, and for whatever reason, Nasdaq never

uploaded the recent trading information that that platform is generally leveraging,

and that causes a pipeline to stall or delay and miss its SLA,

being able to distribute that notification

through the organization

to the end business unit that's leveraging that as 1 of their critical

data inputs,

that's something that becomes really important and something that we definitely wanna support with our observability system.

Digging more into the visibility aspect

of these different layers of the data systems and the ways that different users across the life cycle of this information are

utilizing the data platform.

I'm wondering if we can start with just talking about what are the types of questions that each of these different roles and responsibilities are asking of the data platform, and

what are some of the ways of being able to

reliably

surface and derive the answers to the questions that they're asking?

Great question. So, typically,

the folks

working on data platform

are going to care more about SLAs

and catastrophic

events that are going to affect multiple

consuming teams or multiple data consumers.

And

examples of this would be issues that

will create problems across several pipelines.

Let's say you're pulling information from 1 of the exchanges or you're pulling Facebook information,

and Facebook changes their API, which is not uncommon,

that might cause a cascade of problems that

will

domino down into several different business units if you're creating a data platform

asset that multiple different teams are leveraging.

So focusing in on those key points of failure that are going to create that web of issues across multiple processes,

that's really critical for the platform teams that we work with.

On top of that,

the performance issues that slow down the delivery of data. So SLAs

are generally

more important for the platform end of things. We typically see that they're less concerned with what's happening on the actual inside

of a dataset. Like, is this data accurate in a true sense?

And typically are more concerned with the SLA they have for delivering the data. So is the data arriving in the target location on time? Is it more or less

intact and complete

as opposed to what is actually said about the data itself within

that table or that file that's being made available. So

SLAs, catastrophic events, performance information,

this typically is more of the concern of data platform.

Data analysts or the analytics engineers, the folks that are on the other end of this spectrum,

tend to care more about the end results. So this would be like, is the right data in the expected table at the right time?

Is it an accurate data source? Does it tell us what we need to know about the product we're creating?

Are we getting the best data for that question? And, also, really importantly,

how have

significant KPIs changed? Like, tell me immediately as soon as you detect in the pipeline

that, you know, big share movement in in GameStop if that's where we're building our our analytics products.

So, typically, we'll see platform a lot more concerned about the service level issues, the delays of data that they're coming into the organization, performance problems that might slow things down, how spark jobs are doing that might cause delays.

Analysts on the consuming end will be a lot more concerned about what's happening in the actual datasets themselves and that data quality information at maybe even a record by record level.

Everyone

across this entire scope cares about lineage in the sense that as a broader organization,

you wanna understand the source of problems upstream.

You wanna know their impact downstream.

Analysts tend to care less about the internals of pipelines,

and

engineers

generally are less interested in the inner workings of a Tableau or a Looker dashboard.

But both of these groups wanna be able to

trace the impact of issues down and the cause of issues up. There is an interesting overlap

in some metrics like data distributions,

where

we

see possibly 2 teams that want the same exact information, but for 2 totally different reasons. So an example of this would be, like, data skew. If you're a data scientist,

you might be obviously concerned about that because

you want a model that's trained on the data that you expect to see in production. And if your data is really skewed, that model is not going to be as performant as as you want it to be. So if we're, you know, building that model to predict where GameStop price is going, if that model's trained on data from

2020, it's it's gonna be pretty off at this point.

Generally, if we see data analysts or scientists more concerned about SKU for

the

accuracy

of their products

on platform,

you might take that same metric like SKU or same metadata like SKU,

and they may be much more concerned about that just to the extent that it impacts the performance of their jobs,

which cause late deliveries in data and go back to that SLA being missed. So an example there would be

a skew in a dataset that causes a slowdown in a spark job because it's not properly picking up the data and partitioning it across the cluster.

So in those 2 cases, you may have the same exact piece of metadata

that people are looking for, the distribution of a dataset, but it's gonna be used in quite different ways and for quite different means.

Generally, also in platform, we're dealing with much more of this data, so it becomes natural to do more sampling on top of it and not try to get a 100%

record by record snapshot of everything that exists within a file or a partition.

And on the analytics end of things closer to the data product, as you get closer and closer to that dashboard or that model,

the record by record information becomes that much more important.

And having good expectations built up at that point becomes more critical.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days.

DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses

as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Go to data engineering podcast dot com /datafold

today to start a 30 day trial of DataFold.

Once you sign up and create an alert in DataFold for your company data, they'll send you a cool Water Flask.

Another interesting aspect

of this visibility

factor in building data products and managing data systems is similar to the same problem with

managing system metrics and performance metrics in software applications where the system that is responsible for being able to

store and alert on the data that is used to manage these systems has to have a higher uptime than the systems that you're alerting on. And so there's sort of this interplay between how much you invest in building the observability stack versus how much you invest in actually building the product that you care more about, you know, what you're trying to get the visibility on. And I'm wondering if you have any thoughts on sort of some of the impacts or some of the ways that you're approaching

the reliability of the systems that you're building at DataBands to be able to ensure that you're always going to catch these critical issues and alert in a timely fashion the end users of your product who are trying to build their own products based on the data that they're sending to you. That's a funny point. So first of all,

we we feel really strongly observability

is crucial. And once you have these pipelines that are moving data

from point a to point b, once that's running,

you immediately fall into the trap of not knowing

what the data actually looks like that's getting moved over. And we do encourage teams to

take orchestration and observability

hand in hand. It's like as soon as you're building out your pipelines,

start building in the logging, start building in the metrics and the metadata tracking that helps you

identify

when failures come up and trace the root cause really quickly.

We come at this with a design principle

of isolate the failures and make sure that there's

clear fault tolerance and good

separation

between

what's doing your observing

and what's doing your running, what's doing your orchestration.

So

Airflow itself,

I'll pick on a little bit because we're big fans of Airflow, and we're quite invested in integrations there. So I'll feel free to pick on that a little.

We see a lot of teams that rely heavily on Airflow,

obviously, for running their pipelines, but also for monitoring their pipelines. And

airflow does a good job of picking up just state information and status information and durations and kind of the basics of performance metrics.

But,

first of all, we'll not go super deep into what's happening in the actual data flows, and that really requires a separate kind of solution to be brought in.

But on top of that,

it's a good design principle to have

a separation

between

what is

alerting on or monitoring

the system that's running your processes versus the system that's running your process itself. So if airflow goes down and you're using airflow to monitor airflow,

then all of your monitoring has just gone down. Having some external system,

whether it's Data Band or even just a Grafana dashboard that's monitoring that, alerting on it, has some outside perspective looking in is a really good practice that we wanna help drive within the companies that we're working with and the broader ecosystem.

In terms of how

we try to make it as

easy as possible

to

make sure that we are

accurately

capturing

the information from these different systems. What we aim to do is

make sure that the integrations that we focus on are really comprehensive

and gather up a lot of data from the systems that we're tying into. So when we integrate with a company that's using Airflow, we aim to look at a lot of the data that's being captured in Airflow and create some redundancy of that in our system so that you have a backup of

the key performance metrics

that may be important.

And on top of that, we'll pull into DataBank, we'll pull into our product

additional information

from your actual tasks themselves, from the data quality information,

from the Spark jobs that you're running, from the Snowflake queries.

And we can provide that meta layer on top which helps you understand

if you have a slowdown in airflow, is that coming from some

issue happening in the spark cluster

or some issue happening in a Snowflake query or something that's coming from the data itself? And we wanna help draw that web between these different ends of the process.

Last thing I'll add there is within our application, we also aim to do

higher level

analysis

and higher level

trending and comparisons

of this metadata that we're pulling out. So

even for companies that have really simple pipelines that are just running everything within a single airflow environment,

it can be tricky to know where to focus your attention when just in the normal course of doing business, your pipelines are gonna fail. Pipelines are prone to failure. They're really noisy,

and it's

usually not the case that you need to know every time that a pipeline goes down. And if you have 1 to 1 alerting setup that's getting triggered every time a DAG fails,

your engineering team is really quickly gonna get inundated with alerts.

And we wanna help these organizations really filter through that noise and separate out the signal and find the needle in the haystack of where you need to focus your attention.

Some of the ways that we do that is by looking at deeper factors within the actual pipeline

that help us identify if there's a problem

which

goes beyond just a simple restart or goes beyond just a simple backfill and may require,

actually, more attention.

Generally, that's gonna be stuff that happens on the data layer. That's gonna be stuff that, like, relates to completeness of your dataset, but it can also be factors from your orchestrator that we help collect up by looking at the deltas between

restarts and failures. You know? For example, if a pipeline fails for an extended period of time,

restarts a few times, continues to fail, goes to that back and forth a lot, We wanna really draw your attention to that kind of process

compared to 1 that fails once, snap backs on, and is now running okay. And another interesting

aspect of visibility

and observability and alerting in data pipeline contexts is,

1, being able to source the information from all of the different points where it's useful to pull from. So there are, you know, some products and some companies that might focus just on the data warehouse as the focal point of this is where all the alerting is going to happen because this is where all the data ends up and where it's all being pulled from

versus, you know, I wanna be able to get a cross cutting view of all of the data life cycle from the very first point of contact where I pull it in from a source all the way through to where I'm sending it out as a machine learning model.

And, you know, there are varying degrees in between. And so I'm curious to understand sort of what your perspective is on

how to effectively source the useful information for building these

alerts and visibility into the data life cycle. And then also,

given the different ways that

roles in data teams and across data organizations

interact with the data platform,

how do you surface that information

to them at an appropriate time and location

without forcing everybody to

use the same interface or buy into 1 way of working with the data system?

Each stakeholder

is definitely going to be interested in a different resolution

and ways of monitoring

their process.

So

platform and engineering,

we find that they need more gates

through

their end to end process throughout the process, how data is coming in from the source, how is it being processed within the lake, how is it then being moved to the warehouse.

They tend to be interested in more higher level metrics, less concerned with record by record changes, and

they tend to be more interested in information

on the boundaries, like, are records coming in within the expected thresholds?

Is data more or less complete?

And these teams we find are are generally easier

to cover with more kinds of generic profiling metrics, so, like, sizes, schemas, type changes,

and the problems that will lead to slowdowns or failures in the pipelines, more obvious corruptions in the data. So what we aim to do within

the platform and engineering side of things is provide

as soon as possible in the process. So as soon as we detect some issues, some significant change in the data, we wanna capture that at the point where the data is actually coming in. So

if our Facebook API is changing and that API is dropping

data into an s 3 location as the first part of our logical pipeline,

We want to be watching that s 3 location. We wanna be watching that key there, and we wanna flag when we see a change in schema or a type change coming in. From there,

we wanna

triage

the alerts

that are going to be most impactful

on helping

you focus on the datasets that are most important. So if our Facebook pipeline there that's pulling from that marketing information

is leading to a dataset that we know not too many consumers are using, which we can see by looking at different aspects of lineage in the system or

looking at how many reads are happening within that dataset,

then that should be alert that's on a lower severity level that's more quiet. Maybe that just gets sent out to some shared Slack channel that people use as, like, a event stream

relative to a high severity critical level alert that gets blasted on all channels

and, you know, is waking people up through pager duty in the middle of the night. That kind of alert is gonna get triggered when you have a

pipeline that's pulling in mission critical data that you know has a clear SLA around it. It needs to be delivered every morning at 6 AM before business wakes up, and it's being leveraged across

many different

consuming teams in your organization.

That kind of alert, we wanna help triage up and surface to the front of the stack and and blast to more channels. So the kinds of techniques that we're using there would be leveraging

different data alert targets, different alert destinations

based on the kind of notification that you wanna put out or based on the severity of the issue. 1 might go to Slack, 1 might go to PagerDuty.

Another kind of technique that we're using is

trying to look at what are the patterns of behavior

around the alerts that you're surfacing. Are you quieting these alerts a lot of the time? Are you resolving them really quickly?

Are you letting them sit there? Are you acting on the pipelines after

an alert has been fired? And we wanna be able to pull in that kind

of feedback loop from our users

and surface that into the alert definitions themselves. So

if we see that

there's an alert that gets fired every day because a pipeline just, you know, tends to fail consistently

and every day before it succeeds,

and we see someone's coming in and just immediately acknowledging that alert every time it gets serviced in the system, that might be 1 that we throttle down to a lower severity level and stop sending to PagerDuty every time that we send to Slack instead. So being able to

throttle the severity of alerts and the noise that we're creating based on the impact that it's gonna have is really critical.

From the consumer perspective, it sort of just works the other way. I think for a lot of our consumers,

they're more interested

in

enrolling in information and subscribing to information about events.

They have, like, their particular data assets that they really tend to care about. They have their data table in Snowflake, which they always wanna read from, and they wanna subscribe. They wanna know what are the events that are happening up stream to this which might be impacting that table. You know, if there's a pipeline

that's producing the data every day, tell me when there looks to be a big failure that's gonna come in that pipeline. Again, usually, they care less about the

discrete internals,

the different tasks that are happening across a process, but they do wanna know if it looks like things are trending in a scary direction for datasets that they really depend on. As somebody who is building a product to help

provide visibility

and

confidence

to people who are building these different data products and for for this emerging variety of roles, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I mentioned a couple of ways that Data Band is

trying to help surface the most meaningful

alerts. I think

1 of the big challenges that we see is that there are teams that are just working with so much information, so many pipelines. They may have some alerting that's set up already using more conventional

monitoring tools.

Being able to separate the signal from the noise in these processes and do it in a way that sort of respects the nuances of how

data pipelines operate, not triggering notifications every time there's a failure, understanding what a restart is,

understanding that

a alert associated with some pipeline upstream is going to be relevant

to a consumer downstream.

Having some system in place, which just is is more purpose built feeling to these kinds of use cases, I think, is or rather the lack of that is a big challenge. So trying to bridge across, I think, a lot of teams' instinct to pull in

the standard stack from the software engineering world and just place that on top of data engineering,

I think that's 1 of the big

challenges that we see a lot of teams getting into because there's just gonna be so many nuances in how things are monitored, how alerts are spit out. That's not gonna be a really nice form fit for the unique challenges that these teams face.

Another kind of challenge that we often see

is

for teams that are just starting out in data quality monitoring

or in data observability,

helping them get going with some initial

analytics about their pipelines or monitoring screens or alerts, helping them get the ball rolling on what KPIs they should actually be looking at, I think, is probably where a lot of the market stands is that they don't have a good sense of what are even the data quality measurements that they should be taking. So when you come to them with, like, a totally open library where you say,

you know, define your KPIs. We're gonna pull it into the system.

We're gonna help you alert on it. That can be a huge blocker

because a lot of these teams don't even really know which KPIs to begin with. And that's to no fault of their own. It's it's because a lot of these data organizations are quite new. They're building data products for the first time, and they don't yet have a lot of that knowledge built up or awareness of what data they're actually working with to know where the failures are coming from. What we aim to do is

try to provide much more, as users get started, much more out of the box

alert definitions,

more out of the box anomaly detection, and out of the box actual, like, data quality logging

through templates and other techniques

that you can embed directly within your pipelines

or run against your data lake. And sort of as soon as you spin on the platform,

you'll get some observability.

You'll get some data quality monitoring

in the first, you know, 10 minutes of using the system.

And from there,

as you get to know your data better, as you get to know your pipelines better, that'll sort of open the gates more for you to bring in more customized logging of KPIs that are important and data rules that seem more critical for given use case. But getting the ball rolling on better monitoring, we find to be a challenge for a lot of teams. As you continue to build out the capabilities of DataBAND and continue to work with these forward looking organizations,

what are some of the long term trends or long term impacts that you foresee in the data ecosystem

and how people think about working with data and how that factors into the plans that you have for the future of DataBAND.

On a macro level, something that I'm really excited about is just the level of tooling that's becoming available and the really strong communities out there that undergird

products like Spark and Airflow and DBT and more that we see on the rise. I think this community, this area has some of the most passionate

engineers

in the world, and it's really just exciting to see a lot of that energy.

I think

that relates to the specialization of roles that we see and the growth of different roles like machine learning engineers and analytics engineers and data ops engineers

because

strong communities are forming around people with shared interests and activities. So as the role specialization

increases, I think those open source communities

are going to become stronger, and they're gonna become more active, and we're gonna see more of them. It is not obvious at this point we're a big fan of open source and open core companies,

and we're exciting to see more of that permeate the lengths of the data value chain, even going down to BI applications like Looker and Tableau, which we see now are giving ground to tools like Metabase and Preset. So for us, that trend of

moving more towards open,

having those open core areas of the product,

that relates directly to how we're building data band because we wanna make sure that

interfaces, the integration points between our system and what engineers are working with are really open for them to use or open for them to to customize. You can understand exactly how metadata is getting reported from your system.

Moving forward,

as the

level of tooling in this space increases,

I think we have a few areas that we're really obsessed with, a few development areas that we're obsessed with. We wanna make it easier and easier to integrate

our system. So we don't want engineers to really have to lift a finger. We want you to be able to pull in data band, integrate into a pipeline, run it on a data lake or a database that you have or connect it into your orchestrator,

and immediately get metadata and monitoring that's useful for you. And

our open source is a big part of that, the open core behind the product, and our new cloud offering that we're now rolling out is another big aspect of that so that our users don't have to worry about the DevOps overhead of running another monitoring system if they don't want to. On top of that, as more open source tools are out there, as more companies are using new kinds of products, having more coverage and more integrations to various new services so that we can collect metadata

across more parts of the process. So streaming systems,

more analytic systems, additional orchestrators,

a lot of that's gonna be guided with the new clients that we bring on to the system, new clients that we're working with, but we have strong plans to scale up the number of tools that we're connecting into.

And finally, I think as we see more activity in this space, as more teams are starting to invest in their data monitoring and pipeline observability,

we wanna help with more insights into your

tech stack, into your pipeline. So we're beginning to pack into our system

more

derived metrics and measurements about your pipeline

and data health because our users are looking for more help in how they should actually be monitoring their pipelines. They really

want guidance on the important KPIs they should be tracking and those, like, key performance indicators

of

whether pipelines are healthy or not. And there's a lot of thought that we're putting into that that's gonna be instantiated

in the alerts that we send and visualizations that we show within our system. Are there any other aspects of the work that you're doing at Data Band and the overall space of

visibility and data quality for data teams and how that helps to support this growing level of specialization across data roles that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think something that we

did see over

the time that we've been in the market is

and I think it's going in a great direction

as of the last, year or so.

But we do look at this balance

of research versus production work that's happening within data organizations.

And whether you're

pursuing more of a mesh style of working, building out a data mesh, whether you have more umbrella groups that are built up that operate at different ends of the business or levels of the stack, or if you have more specialized roles,

figuring out that right balance between

the research investments that you're putting into

new products and the production

investments that you're making

to get those products into the field, get them into the hands of customers, get feedback on them,

and iterate,

I think, is something that we

really wanna encourage teams to find a good balance in. I think a few years back, we saw a huge flood of

growth in research teams and big build outs, data scientists, and engineers that were working on just, like,

getting new AI systems, you know, building new AI products that that we wanna get into market. And I think a lot of those teams hit a wall because

they either didn't have the data ops roles available or they didn't have the data engineering

capabilities or capacity to support the teams, or maybe they didn't have just, like, an actual market fit for the product that they were building. And

we saw a little bit of a downsizing of those teams. So I think

not treating data like a magic

property, but thinking about it more like any other product that you sell and building an organization

along with the normal course of market validation, product market fit, shipping

data products into the market in an agile manner, getting feedback, iterating.

That's something that we really wanna encourage our organizations to

pursue. But it seems

definitely over the last year or so, we're seeing just a huge uptick in the amount of productivity

within these organizations. So I'm hopeful it's moving in the right direction. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. No question. Observability.

We're working on it for a reason. The first step for a lot of these organizations

that we're working with is

just getting data from

1 location,

doing some transformations,

getting it to another location. That's what they sort of needed to figure out first. And

I think the

new category of tools on the orchestration layer that have become a lot easier to work with, that have become a lot more widespread,

have

eaten off a lot of that challenge. So we don't see so much anymore

as orchestration being, like, a critical, critical, critical investment point for a lot of the teams that we're working with anymore. They have some system in line that they're using, which allows them to get the job done.

Once you have that motion happening,

once data is traveling,

the automation is in place, you then need to make sure that the factory is churning out data properly, that those pipes are working as expected. And that's where observability comes in. And I think for the teams that are facing a wall there,

a lot of the time, it's more about the observability question even if they don't have, like, fully massive pipelines built up. So an example would be,

you have a data replication project that's pulling data from an on prem source location, delivering it to a cloud database.

And because there's so much uncertainty

over whether the data is being properly moved over into the cloud,

the project gets put on hold or

elongated

and some momentum falls, and you end up in this awkward, you know, 1 foot in, 1 foot out situation. And

the core issue there is not that there isn't a tool that's available to help you process and move the data over. The core issue is the lack of confidence that

the data is being properly

that the data from point a is being properly moved to point b.

And that data point b is then being used properly in the data products that it's supposed to support. So

having the observability layer that comes in and gives you that confidence, that's what we feel is the biggest gap today and what we're aiming to solve with Databank.

Well, thank you very much for taking the time today to join me and share the work that you're doing at DataBAND and the perspective that you've been able to gain into some of these emerging roles and abilities in the data ecosystem. It's definitely a very

interesting aspect

of the outgrowth of more companies using more data for more things. So it's definitely useful to be able to get a bit of a peek into what's to come for more organizations. So I appreciate all of the time and energy that you've put into

working with these teams and helping them to be successful.

So thank you again for your time and energy, and I hope you enjoy the rest of your day. Absolutely. Thanks so much, Tobias.

For listening. Don't forget to

Thank you for listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links