Summary
"Business as usual" is changing, with more companies investing in data as a first class concern. As a result, the data team is growing and introducing more specialized roles. In this episode Josh Benamram, CEO and co-founder of Databand, describes the motivations for these emerging roles, how these positions affect the team dynamics, and the types of visibility that they need into the data platform to do their jobs effectively. He also talks about how his experience working with these teams informs his work at Databand. If you are wondering how to apply your talents and interests to working with data then this episode is a must listen.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- Your host is Tobias Macey and today I’m interviewing Josh Benamram about the continued evolution of roles and responsibilities in data teams and their varied requirements for visibility into the data stack
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by discussing the set of roles that you see in a majority of data teams?
- What new roles do you see emerging, and what are the motivating factors?
- Which of the more established positions are fracturing or merging to create these new responsibilities?
- What are the contexts in which you are seeing these role definitions used? (e.g. small teams, large orgs, etc.)
- How do the increased granularity/specialization of responsibilities across data teams change the ways that data and platform architects need to think about technology investment?
- What are the organizational impacts of these new types of data work?
- How do these shifts in role definition change the ways that the individuals in the position interact with the data platform?
- What are the types of questions that practitioners in different roles are asking of the data that they are working with? (e.g. what is the lineage of this asset vs. what is the distribution of values in this column, etc.)
- How can metrics and observability data about pipelines and data systems help to support these various roles?
- What are the different ways of measuring data quality for the needs of these roles?
- How is the work you are doing at Databand informed by these changing needs?
- One of the big challenges caused by data systems is the varying modes of access and interaction across the different stakeholders and activities. How can data platform teams and vendors help to surface useful metrics and information across these various interfaces without forcing users into a new or unfamiliar workflow?
- What are some of the long-term impacts that you foresee in the data ecosystem and ways of interacting with data as a result of the current trend toward more specialized tasks?
- As a vendor working to provide useful context to these practitioners what are some of the most interesting, unexpected, or challenging lessons that you have learned?
- What do you have planned for the future of Databand?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
RudderStack is the smart customer data pipeline. Easily build pipelines connecting your whole customer data stack, then make them smarter by ingesting and activating enriched data from your warehouse, enabling identity stitching and advanced use cases like lead scoring and in app personalization. Start building a smarter customer data pipeline today. Sign up for free at dataengineeringpodcast.com/rudder.
[00:01:21] Unknown:
Your host is Tobias Macy. And today, I'm interviewing Josh Ben Amram about the continued evolution of roles and responsibilities in data teams and their varied requirements for visibility into the data stack. So, Josh, can you start by introducing yourself? Well, it's great to be speaking with you today. Tobias, I'm a huge fan of your work and, honestly, just really grateful for the conversations you've led in the community. So thank you. I'm Josh Benhamram. I'm CEO at DataBond AI. We are a data pipeline observability company, and we help organizations get good data products to market by providing visibility and easier management of their pipelines and data assets.
[00:02:00] Unknown:
And do you remember how you first got involved in the area of data management? Yeah. I do. So I come from a pretty
[00:02:06] Unknown:
varied background. I actually started in the finance world working in the investment arm of a quant trading firm. It was there that I I got first introduced to data science and data engineering, and I've been obsessed ever since. I later worked in a venture capital firm where I focused on data investments. And just prior to DataBank, I was a product manager at a data analytics company. So while I've worked in in different kinds of teams, the common denominator for me has always been being really close to the data organization, if not working on data products directly within my different, organizations.
[00:02:48] Unknown:
Given your varied roles across these different organizations and different ways of interacting decent perspective on some of the different ways that the platform and the team orientation is structured at different contexts. And also given the fact that you're building a company that hooks into data platforms, I imagine that also gives you some interesting perspectives into how people are interacting with their various data pipelines.
[00:03:18] Unknown:
Yeah. That's that's exactly right. This is something that I've thought a lot about over the different experiences that I've had, in my career, jumping between different kinds of teams. Definitely in my last company before DataBank as a PM building data analytics products. I worked with many, many data teams who are integrating our analytics stack into their organizations. And at DataBank, since we're plugging into data platforms, as you mentioned, this is something that we focus on quite a lot and a question that we like wrapping our head around.
Before jumping into some of those different team distinctions and role distinctions that we see across organizations that we're working with, I'll preface that our company now, DataBound, works with a particular kind of organization, and I think that that definitely colors the way that we see the world. We work mainly with companies that are building data products. And by data products, I mean, analytics or machine learning that's customer facing. Like, I I like using the example of a stock analysis dataset that's sold to investors or a recommendation engine that goes into a customer facing software product.
But even before talking about the the different kinds of roles that we often see, having that context is important because there are 2 things that are really unique about companies that are building data products, we feel. The first 1 is their pipelines tend to be more complex. I think by virtue of building whatever their unique product is, these data teams typically have a lot of their IP in how they process, aggregate, transform, and work with their data. So, therefore, they're usually starting with more raw data. There's a lot of data that they're working with, like a lot of sources, and they do a lot of processing on this data to make it useful. Otherwise, there would not really be much of a product for them to sell. You know, their customers would go directly to their data sources. So for all those reasons, the the kinds of companies that we're working with, their pipelines just tend to be more complex.
Secondly, the companies that we work with, they have really high standards. They treat their data processes really seriously. It's a very bad thing if data gets delivered late or if there are data quality issues in their final outputs. And that's because it's ultimately going in front of a customer that's paying you, so they often have higher standards for what they produce. So when you have a data team like this where there's more complexity in the pipelines, there's higher standards, you tend to get this bigger division of responsibilities to make sure that things run smoothly.
And that creates more defined activities and roles within the data team. So just setting that context for the kinds of roles that I'll talk about because if you're working with a data product team, it's gonna be quite different than if you're working with a a smaller scale analytics unit with a company that's a little more targeted on the kind of data analytics that they're doing. So going back to the actual roles that that we might see, separate this first into data producers and data consumers. So the role of producers is to create ready to use data. And among that group, you have the data engineers and the data platform engineers.
The data engineers are usually responsible for what's happening in specific pipelines. So they work generally more closely with analysts and scientists to prepare the actual logic that takes raw data in, cleans it, aggregates it, makes it queryable, etcetera. That might start as JSON coming into s 3 and and as a table in a database like Snowflake. The data platform engineers are more responsible for the services that make pipelines run. So they care about the issues that will cut across pipelines, like an airflow environment going down or system wide resource bottlenecks or meaningful disturbances in upstream data sources that are going to affect a lot of things.
On the other end of this, you have consumers. This might be the analysts who take ready data and prepare analytics and reports from it, or it might be the data scientists who take the data to build their models. So just at the starting point here, we have 4 typical kinds of categories of roles that we're dealing with. You have data platform, data engineers, data analysts, data scientists. Maybe walking through an example of a typical kind of troubleshooting process there if an issue comes up. So let's say an analyst raises a flag that a table looks out of date. That might be really bad because you have a customer on the other end of that dashboard that expects it to be timely. Maybe that's even stipulated in some SLA that you have with the customer. Data engineer might then jump in to check out the pipeline that delivers the data to that table. They might find that the pipeline never ran, and then maybe they just do a backfill and that resolves the issue there. But maybe there's a deeper problem, a service level problem, like something going on in the broader airflow environment or an issue in an underlying spark cluster or a pipeline that ran completely fine, but there's some fundamental problem in the upstream creates about a service level issue, and then alerts the engineers and creates about a service level issue, and then alerts the engineers and the analysts based on pipelines and datasets that are gonna be impacted. But that'd be a typical kind of division of activities in the troubleshooting scenario that describes some of these different roles.
[00:08:52] Unknown:
Given that you are working with these organizations that are more at the forefront of how to work with data and the types of processing and systems that they're using to perform that processing, I'm wondering if you see that as being somewhat of a predictor of where the broader industry is going, where more organizations are going to start to experience this sort of segmentation of roles where I feel like in the sort of early stages of big data, it was all about data science, and everybody just wanted a data scientist because they thought that they were going to be the solution to everything, and they would do everything. And then most everyone realized, oh, wait. We actually need data engineers to work with the data scientists so that the data scientists have clean and manageable data to be able to use as a starting point.
And business intelligence as a practice used to be more of sort of an IT concern and now has become a a concern of analytics engineers where I see there have been a number of evolutions of data roles just in the past decade or so. And I'm wondering if you think that some of the further segmentation that you're seeing is going to remain sort of constrained to these organizations that are working on building data products, or do you think that it is a predictor of where the larger industry is going to head in the next sort of 5 to 10 years? First of all, I think more and more companies
[00:10:12] Unknown:
will be building data products. I think more organizations will follow this trend of monetizing their data assets. There's a couple reasons why I really believe that's the case. First of all, it's really lucrative. Companies can make a lot of money on on data products. It's a highly scalable resource. If you have a good data asset and you find a product market fit for it and there's a a wide audience for that, it might be the case and often is that you don't really have to do too much to that data to sell it to more and more end customers. So in the same way that software scales really effectively, a data product can scale really effectively and and can be really lucrative.
The other reason why I think more companies will be moving into data product development is because there are a lot of organizations out there that just have really valuable data that's kept up in some internal data warehouse or being created by a business unit somewhere that is able to be monetized in some way. And I think this is particularly true of technology companies that are already selling software, and somewhere in that software, some really interesting data is being created. And I saw a recent headline about Atlassian, for example, purchasing a business intelligence company. I think it was Chart. Io.
But to me, this was a great crystallization of that trend in the industry. When you think about Atlassian and the reach that their product has across organizations and the kind of data that they're producing within their software about how companies are completing their tasks and development and the productivity of their engineers. There's so much interesting information to be used within data products. It's not super surprising to see them invest further in bringing that kind of analytics within the system so that they can better monetize it. I think that's gonna be a consistent parallel across many companies in the technology space and and also beyond. So more companies are gonna be building data products. Once you're in that data product world, then you get into the complexity. You get into the higher standards. And when those kinds of needs are present, then you want better division of responsibilities to handle things in a more effective way and give different stakeholders on the team a better idea of who to go to when there are problems.
Outside of even companies that are working on data products, even if you just have a internal analytical system that you're relying on, at the point that you really start relying on that system, at the point that you start depending on it to make mission critical business processes, you better have assurance that the data that's ending up in those dashboards or in those models is accurate. So you may not have a end customer that's calling you and saying, what's going on within this dataset or what's going on within this dashboard? But that can be just as scary for a team if it's coming from an executive that said, hey. I just, you know, plowed a big investment into some business decision because of a KPI that I saw. And I'm looking again. It looks like it was totally wrong. So I think different factors are gonna catalyze the complexity that we see in pipelines and the level of standards that we see being put on data teams, and that's what's going to create this new division of responsibilities and roles.
[00:13:35] Unknown:
And in terms of the roles that you see emerging, you set the baseline of most organizations have at least some distribution of data scientist, data analyst, data engineer, and platform engineer. Those are fairly well established roles. Most people at least understand what the responsibilities are across those different divisions. And as you are working with these more advanced data organizations, what are some of the additional specializations that you see emerging, and what are the established positions that you're seeing those roles kind of being broken out of or merged with to be able to support these more complex data products?
[00:14:18] Unknown:
So we see further specialization and outgrowth from these roles happening in 2 ways. The first way would be through the formation of umbrella groups. The second way would be through the formation of hybrid roles. In umbrella groups, you have different organizations that have repeating roles between them, but focus on different parts of the end to end business process or sit at different components of the organization. So, for example, you might have an upstream platform team as 1 umbrella group and a downstream analytics team as another umbrella group. And in each group, you have somewhat repeating roles, but they focus on different levels of the data process end to end. So in the platform team, you might have all the roles that we just discussed. You might have data engineers, data scientists, data analysts that produce the data that different downstream business units will use. So you might have a data scientist in this group, but, really, they're a data platform scientist. And that's a different kind of set of responsibilities and a different kind of day to day work than a data scientist sitting in another umbrella unit. So an example of that is the data platform scientists, they might be aware of different downstream consumers. They may be responsible for bringing the raw data that is coming into platform closer to the form that those downstream those business unit data scientists and analysts are going to use or able to use. For example, scientists in the platform team might build a predictive KPI into a table that a downstream team uses as 1 of their data sources. I mentioned before, I I like using this example of a trading firm that's looking at analyzing stock market information. So let's say your company is pulling in information about the stock market, and the data product that you're building is about whether stocks are gonna go up and down, you know, trying to predict where GameStop is headed.
So this trading firm may have a platform team that's pulling in data from a bunch of different exchanges and different brokerages. And then in the platform team, you have a data scientist that's creating a predicted GameStop price. And that predicted GameStop price is gonna get dropped into a table in Snowflake or your data lake that some other business units downstream are gonna pick up and then use as part of the analytics that they're building or models that they're developing. So going down into that next umbrella group, the the data analytics team, another group downstream, this team might likewise contain some assortment of platform people. They might have data engineers. They will have actual data scientists or analysts in most cases. But they'll be working on more discrete projects. They'll be closer to the end customer, and they might pull in several data sources beyond what platform provides them. So in this kind of umbrella group organization where you have different repeating business units at different levels of the organization, different repeating roles, this is a nice way of organizing things for a lot of companies because it provides units a level of autonomy that allows them to innovate quickly, get products to market faster, and centralized shared requirements that allows for better focus on what makes these units different in their main mission.
The other version of this outside of the umbrella groups that we see would be having hybrid roles. So you might have a data scientist role that opens up into additional roles that help close the gap with data engineering. So a data scientist might open up to 2 roles, 1 being a core data scientist who focuses on running experiments and producing models. And then you might have another set of responsibilities for the machine learning engineer who manages the automated ML pipelines for training, deployment, retraining, and really helps bridge the gap between the data scientists and the engineering. So the data scientists might open up to core data scientists and then an ML engineer who focuses mostly on automation.
Same thing for data analysts. You might have a single data analyst, and as things get more complex, that opens up to a core data analyst who focuses mostly on building analytics and defining KPIs, and then a analytics engineer who manages automation around analytics pipelines and prebuilt aggregations that need to get done. Platform opens up as well. So we'll see often data platform open up to platform engineers who are responsible usually for setting the structure and design principles and templating for how people build their pipelines, and a data ops engineer who manage the services, cover the needs that people usually keep poking DevOps to help. So that's how we often see data platform opening up too.
[00:19:21] Unknown:
As you are working with these organizations that have these varying types of specialization, whether it's these hybrid roles that are specializations within the broad category of analytics or data engineering or data science or what have you versus these umbrella organizations, what do you see as being the broad business impact in terms of how that affects the capabilities of the organization to build and release high quality data products? And just some of the additional considerations that they need to think about from a organizational and product perspective of delivering data as a product versus just delivering software or physical widgets as a product?
[00:20:02] Unknown:
In general, having this kind of division of responsibility as the company or the data organization scales is going to help these teams be more productive because people will be able to focus more on their core set of responsibilities, and they'll be able to pull in different stakeholders in the team to solve problems based on responsibilities that they know those stakeholders have. So if I'm working in a more amorphous team where there isn't such a good distinction between what a data engineer does and what a data scientist does, and I'm working with a dozen folks on the team, when a process breaks down or an environment breaks down, knowing exactly who to pull in can be a really tricky thing. So when you have a better division of responsibilities, just like in software organizations where you have a good sense of who does full stack, who does front end, who does back end, who's focused on DevOps.
If there's a DevOps issue, you know who to go to. Same thing in data. If there's a DataOps issue on the service level, you'll know better who to go to. A couple of big factors that we think about on the organizational impact of this kind of separation when it happens is the level of autonomy for different teams or different units versus the level of cohesion in the end to end team and the overall organization. So as your different positions become more specialized, how you ensure that stakeholders have the freedom to work independently and move quickly while at the same time are connected enough that people are working together in a clear direction.
I think this is 1 of the main challenges that a lot of the companies that we see really face. And I I think it relates to the kinds of investments that people make in their technology systems and their technology platforms. But here's where it's really important to have good levels of interoperability and the ability to create sources of truth at the highest level of the tech stack. So with DataBank, we'll work with teams where before using us, platform just has no idea how the pipelines of any engineers work. And we had a a case where an engineer left the team of 1 of our client organizations, and the client basically had to recreate a mostly working process from scratch, which took months for them to do. And they had to do it because they didn't wanna go to production without having good visibility or control over their pipeline.
So that would be an example where you had clear division of roles, a clear specialization of responsibilities, but not enough visibility or observability between the different activities in the team so that platform could introspect those processes and get good checkpoints and logging from the pipeline covering the entire travel of the data, from the data source to the data lake, to the warehouse, to the consumer. That ability to observe is really crucial and why we're working on DataBank.
[00:23:07] Unknown:
In terms of the team dynamics and the ways to structure the organizational aspects of these more specialized roles, what are some of the strategies that you've seen be particularly effective in terms of being able to maintain effective communications across these different boundaries of roles and responsibilities and how they can all work together to be able to deliver the end product to the customer.
[00:23:34] Unknown:
I think this, for us, has a lot to do with the mixture of having the right processes set up within the organization as well as having the right tools and technologies in place to help build that cohesion and maintain the right level of autonomy. So in terms of processes, an example of this would be having a system in place where there's clear ownership over different pieces of the technology stack and different pipelines that the team is is building. So having a clear owner in the data engineering category, which engineers own which specific pipelines, which data targets or data assets do those pipelines read to, having that in the built up in the organizational knowledge of the data engineering group and using that knowledge to be able to quickly isolate where failures might be cascading down to the consumer level. So if a service goes down, if platform knows that there's something failing in a environment, in airflow, or in the spark level, platform know knowing who owns which pipelines that are going to be affected and being able to get the information and the news out to those stakeholders who can then distribute the news down to who might be consuming the data. So having a really clear set of ownership across these different processes is paramount to being able to run the organization smoothly.
In terms of tools and technologies, having a set of capabilities that allow you to distill that kind of knowledge into a product layer that people across the organization are able to access and, without really even asking anybody, get information about different services and get information about different pipelines. This is really critical. So a lot of the tension that we see in these groups is that we think is unique to data organizations is is this desire for using best of breed tools versus building more sources of truth or centers of gravity within their teams. And finding that right balance between being able to use those different pieces of technology to run your pipelines and get data delivered versus having the sources of truth where you can see all the information that's being produced from them and you can quickly go in and isolate issues, that's gonna be
[00:26:06] Unknown:
really important. Because of the need to be able to have this visibility and have these tools that establish clear communication across the different boundaries and across the different layers of the data stack and stages of delivery, how does the sort of specialization of these roles and the sophistication of the operations that they're performing impact the way that the organization and the teams think about investment in the data platform and in the technologies that they're using to be able to deliver these systems, both in terms of just making sure that they're operational, but also in terms of maybe preventing tool sprawl so that you don't have everybody speaking a different language and end up sort of in the tower of Babel situation.
[00:26:47] Unknown:
Yeah. So this balance between best of breed tools and building those sources of truth is is something that we think about a lot. And on 1 level, you wanna be able to leverage the right technologies for different stakeholders so that you can give people the tools that they need to work efficiently. An example of that would be allowing data platform teams to build their data lake in s 3 where engineers are mostly working with Spark and using those tools because they need to optimize for storage space. On the platform side, you really wanna optimize towards getting as much information into your lake as possible, as cheaply as possible, and being able to support a lot of different sources, so being able to take data files of any type. So allowing platform to work at the s 3 level in a really flexible and open and essentially free data lake and then leveraging a tool like Spark so that they can easily process a lot of that data.
But but forcing a analyst team or even an analytics engineer to operate at that level might be pretty challenging because you're not gonna typically find a lot of analysts or analytics engineers that are super, super familiar with Spark. They'll tend to be more comfortable in tools like, SQL and working more at the database layer. So allowing the analyst to use tools like Snowflake as a more aggregated warehouse and an easy query engine on top of the lake is just a good example of being able to separate those best of breed technologies. On top of that, on another level, because these processes are so interdependent. Right? That data that's accumulating in s 3 is going to eventually make it into Snowflake where analysts are gonna be begin querying on it. You want the ability to control and observe what's happening across these systems.
So your orchestrator and your observability tool are good examples of that. If you have engineers working on s 3 with Spark and analysts working in Snowflake, your pipeline orchestrator needs to be able to really easily run jobs across the both of these systems. And airflow, we we see a lot as a common tool for achieving that bridge across different layers. We can also get meta here where you have multiple orchestrators in your orchestrator for best of breed within that layer even. We see a lot of airflow running Spark on s 3 and then airflow even kicking off DBT jobs on Snowflake because DBT as an orchestrator might be a lot easier for your analytics engineers and your data analysts to work with on the Snowflake side, whereas Airflow might feel a lot more comfortable for your platform engineers and your data engineers that are working upstream, and that's pretty common.
Likewise, you may want your observability tool to be able to capture that end to end process so that you can catch issues as soon as they occur upstream and can broadcast those issues to those impacted downstream. So if you have a missed data delivery in an s 3 bucket that data engineers own and platform manages, what pipeline is that going to affect? What's gonna be delayed by that? And how's that going to affect what ultimately becomes a table within Snowflake that's gonna get fed into an analytics report or another data product that's being prepared by analysts? That becomes really important. If you have umbrella organizations, it's gonna be a lot easier to build these kinds of best of breed technologies. It's usually gonna be harder to create single sources of truth. So if you have that big distinction where you have data platform, engineers, analysts, scientists working at 1 higher end of the organization and then similar groups working in different business units, it's gonna be a lot easier to have those different teams use whatever tools that they feel they're most comfortable with. It's gonna be harder to create cohesion and a clear remediation path when an issue comes up that cascades across those units.
On the other hand, if you have a single highly specialized team, where in a single group, you've got all those roles. You have the hybrid roles between the data ops engineer. You have the data platform engineer. You have the ML engineer, you have an ETL engineer, a data scientist, a data analyst, an analytics engineer, all those different roles, generally, it's gonna be easier to build sources of truth, but it'll be harder to achieve the best of breed technologies at different layers because everybody's so interdependent.
[00:31:24] Unknown:
Particularly in this juxtaposition of umbrella org versus hybrid roles, what do you see as the impact of the current trend of data mesh as sort of a growing way of thinking about the way to build out different data products internally within an organization and how to combine them into maybe downstream data products that other people are going to consume and just how to build useful interfaces and compartmentalization of these different stages of responsibility.
[00:31:56] Unknown:
There's a lot of different definitions
[00:31:59] Unknown:
for terms like data mesh that we see, so I'd love to hear how you define that. And then I can help target how I would look at it and maybe how we would look at it. Yeah. It's definitely starting to become 1 of these buzzwords that people throw around to try and make their product or their team or whatever they're doing sound cool and interesting. But the way that I think about it is sort of back to the original article that I read by Zhamak Dehghani, who I had on the show a little while ago to talk about sort of her perspective on it. But, basically, instead of having 1 centralized team that does all of the data work for the entire organization, You have maybe an enabling platform team that provides all of the systems necessary to do self-service access for each of the different business units to then treat all of the data internal to their concern, whether it's their application or their business responsibility, and then provide that as you know, via an interface to the rest of the organization to be able to consume as a packaged product so that it's already been cleaned and easy to work with. You don't have to try and, you know, perform your own analysis to understand, you know, what are the standardized metrics because it's already delivered via this API or via this, you know, prepackaged product that you can consume just as a consumable data asset.
[00:33:16] Unknown:
So this is very much in line with the umbrella organizations that we see forming in the companies that we're working with. So it's similar to the mesh or the hub and spoke model that that we see in big companies. I think the importance of the interface between that, like, ready to consume data product that is being produced by the more centralized teams, the data platform organizations, making sure that the products that produce have the right level of, essentially, certifications around it so that the end business units and these different consuming teams understand what that data consists of, how it's changing over time, what the critical failure points may be if they begin taking that data and working it into a product that they're producing.
Having that kind of certification lineage, that level of tagging into the data asset becomes really, really critical because then they begin depending on that as another data source that gets fed into the products that they are producing. So they may be relying on that input source from platform as well as several other sources and having a good governing stamp that says it it's clear how to leverage this and when it might fail and who to talk to if the data is delayed, that becomes really critical. What what we aim to deliver, what we aim to support through our application is the ability to cascade and understand the lineage of not just data that's flowing from 1 business unit to the next, but the cascade of notifications and alerting that needs to be promoted if a failure does happen or if a bad data event transpires.
So if the platform team is regularly producing this asset that a end business unit is using, And for whatever reason, pipeline fails that is producing that dataset, being able to distribute and announce and broadcast those kinds of notifications across the organization that different consumers can subscribe to, essentially, that becomes really critical. Going back to our investment example, if the central team is producing that KPI that says GameStop shares are gonna go up or gonna go down, and for whatever reason, Nasdaq never uploaded the recent trading information that that platform is generally leveraging, and that causes a pipeline to stall or delay and miss its SLA, being able to distribute that notification through the organization to the end business unit that's leveraging that as 1 of their critical data inputs, that's something that becomes really important and something that we definitely wanna support with our observability system.
[00:36:11] Unknown:
Digging more into the visibility aspect of these different layers of the data systems and the ways that different users across the life cycle of this information are utilizing the data platform. I'm wondering if we can start with just talking about what are the types of questions that each of these different roles and responsibilities are asking of the data platform, and what are some of the ways of being able to reliably surface and derive the answers to the questions that they're asking?
[00:36:43] Unknown:
Great question. So, typically, the folks working on data platform are going to care more about SLAs and catastrophic events that are going to affect multiple consuming teams or multiple data consumers. And examples of this would be issues that will create problems across several pipelines. Let's say you're pulling information from 1 of the exchanges or you're pulling Facebook information, and Facebook changes their API, which is not uncommon, that might cause a cascade of problems that will domino down into several different business units if you're creating a data platform asset that multiple different teams are leveraging.
So focusing in on those key points of failure that are going to create that web of issues across multiple processes, that's really critical for the platform teams that we work with. On top of that, the performance issues that slow down the delivery of data. So SLAs are generally more important for the platform end of things. We typically see that they're less concerned with what's happening on the actual inside of a dataset. Like, is this data accurate in a true sense? And typically are more concerned with the SLA they have for delivering the data. So is the data arriving in the target location on time? Is it more or less intact and complete as opposed to what is actually said about the data itself within that table or that file that's being made available. So SLAs, catastrophic events, performance information, this typically is more of the concern of data platform.
Data analysts or the analytics engineers, the folks that are on the other end of this spectrum, tend to care more about the end results. So this would be like, is the right data in the expected table at the right time? Is it an accurate data source? Does it tell us what we need to know about the product we're creating? Are we getting the best data for that question? And, also, really importantly, how have significant KPIs changed? Like, tell me immediately as soon as you detect in the pipeline that, you know, big share movement in in GameStop if that's where we're building our our analytics products. So, typically, we'll see platform a lot more concerned about the service level issues, the delays of data that they're coming into the organization, performance problems that might slow things down, how spark jobs are doing that might cause delays.
Analysts on the consuming end will be a lot more concerned about what's happening in the actual datasets themselves and that data quality information at maybe even a record by record level. Everyone across this entire scope cares about lineage in the sense that as a broader organization, you wanna understand the source of problems upstream. You wanna know their impact downstream. Analysts tend to care less about the internals of pipelines, and engineers generally are less interested in the inner workings of a Tableau or a Looker dashboard. But both of these groups wanna be able to trace the impact of issues down and the cause of issues up. There is an interesting overlap in some metrics like data distributions, where we see possibly 2 teams that want the same exact information, but for 2 totally different reasons. So an example of this would be, like, data skew. If you're a data scientist, you might be obviously concerned about that because you want a model that's trained on the data that you expect to see in production. And if your data is really skewed, that model is not going to be as performant as as you want it to be. So if we're, you know, building that model to predict where GameStop price is going, if that model's trained on data from 2020, it's it's gonna be pretty off at this point.
Generally, if we see data analysts or scientists more concerned about SKU for the accuracy of their products on platform, you might take that same metric like SKU or same metadata like SKU, and they may be much more concerned about that just to the extent that it impacts the performance of their jobs, which cause late deliveries in data and go back to that SLA being missed. So an example there would be a skew in a dataset that causes a slowdown in a spark job because it's not properly picking up the data and partitioning it across the cluster. So in those 2 cases, you may have the same exact piece of metadata that people are looking for, the distribution of a dataset, but it's gonna be used in quite different ways and for quite different means.
Generally, also in platform, we're dealing with much more of this data, so it becomes natural to do more sampling on top of it and not try to get a 100% record by record snapshot of everything that exists within a file or a partition. And on the analytics end of things closer to the data product, as you get closer and closer to that dashboard or that model, the record by record information becomes that much more important. And having good expectations built up at that point becomes more critical.
[00:42:01] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days. DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Go to data engineering podcast dot com /datafold today to start a 30 day trial of DataFold. Once you sign up and create an alert in DataFold for your company data, they'll send you a cool Water Flask. Another interesting aspect of this visibility factor in building data products and managing data systems is similar to the same problem with managing system metrics and performance metrics in software applications where the system that is responsible for being able to store and alert on the data that is used to manage these systems has to have a higher uptime than the systems that you're alerting on. And so there's sort of this interplay between how much you invest in building the observability stack versus how much you invest in actually building the product that you care more about, you know, what you're trying to get the visibility on. And I'm wondering if you have any thoughts on sort of some of the impacts or some of the ways that you're approaching the reliability of the systems that you're building at DataBands to be able to ensure that you're always going to catch these critical issues and alert in a timely fashion the end users of your product who are trying to build their own products based on the data that they're sending to you. That's a funny point. So first of all,
[00:43:58] Unknown:
we we feel really strongly observability is crucial. And once you have these pipelines that are moving data from point a to point b, once that's running, you immediately fall into the trap of not knowing what the data actually looks like that's getting moved over. And we do encourage teams to take orchestration and observability hand in hand. It's like as soon as you're building out your pipelines, start building in the logging, start building in the metrics and the metadata tracking that helps you identify when failures come up and trace the root cause really quickly.
We come at this with a design principle of isolate the failures and make sure that there's clear fault tolerance and good separation between what's doing your observing and what's doing your running, what's doing your orchestration. So Airflow itself, I'll pick on a little bit because we're big fans of Airflow, and we're quite invested in integrations there. So I'll feel free to pick on that a little. We see a lot of teams that rely heavily on Airflow, obviously, for running their pipelines, but also for monitoring their pipelines. And airflow does a good job of picking up just state information and status information and durations and kind of the basics of performance metrics.
But, first of all, we'll not go super deep into what's happening in the actual data flows, and that really requires a separate kind of solution to be brought in. But on top of that, it's a good design principle to have a separation between what is alerting on or monitoring the system that's running your processes versus the system that's running your process itself. So if airflow goes down and you're using airflow to monitor airflow, then all of your monitoring has just gone down. Having some external system, whether it's Data Band or even just a Grafana dashboard that's monitoring that, alerting on it, has some outside perspective looking in is a really good practice that we wanna help drive within the companies that we're working with and the broader ecosystem.
In terms of how we try to make it as easy as possible to make sure that we are accurately capturing the information from these different systems. What we aim to do is make sure that the integrations that we focus on are really comprehensive and gather up a lot of data from the systems that we're tying into. So when we integrate with a company that's using Airflow, we aim to look at a lot of the data that's being captured in Airflow and create some redundancy of that in our system so that you have a backup of the key performance metrics that may be important.
And on top of that, we'll pull into DataBank, we'll pull into our product additional information from your actual tasks themselves, from the data quality information, from the Spark jobs that you're running, from the Snowflake queries. And we can provide that meta layer on top which helps you understand if you have a slowdown in airflow, is that coming from some issue happening in the spark cluster or some issue happening in a Snowflake query or something that's coming from the data itself? And we wanna help draw that web between these different ends of the process. Last thing I'll add there is within our application, we also aim to do higher level analysis and higher level trending and comparisons of this metadata that we're pulling out. So even for companies that have really simple pipelines that are just running everything within a single airflow environment, it can be tricky to know where to focus your attention when just in the normal course of doing business, your pipelines are gonna fail. Pipelines are prone to failure. They're really noisy, and it's usually not the case that you need to know every time that a pipeline goes down. And if you have 1 to 1 alerting setup that's getting triggered every time a DAG fails, your engineering team is really quickly gonna get inundated with alerts.
And we wanna help these organizations really filter through that noise and separate out the signal and find the needle in the haystack of where you need to focus your attention. Some of the ways that we do that is by looking at deeper factors within the actual pipeline that help us identify if there's a problem which goes beyond just a simple restart or goes beyond just a simple backfill and may require, actually, more attention. Generally, that's gonna be stuff that happens on the data layer. That's gonna be stuff that, like, relates to completeness of your dataset, but it can also be factors from your orchestrator that we help collect up by looking at the deltas between restarts and failures. You know? For example, if a pipeline fails for an extended period of time, restarts a few times, continues to fail, goes to that back and forth a lot, We wanna really draw your attention to that kind of process compared to 1 that fails once, snap backs on, and is now running okay. And another interesting
[00:49:05] Unknown:
aspect of visibility and observability and alerting in data pipeline contexts is, 1, being able to source the information from all of the different points where it's useful to pull from. So there are, you know, some products and some companies that might focus just on the data warehouse as the focal point of this is where all the alerting is going to happen because this is where all the data ends up and where it's all being pulled from versus, you know, I wanna be able to get a cross cutting view of all of the data life cycle from the very first point of contact where I pull it in from a source all the way through to where I'm sending it out as a machine learning model. And, you know, there are varying degrees in between. And so I'm curious to understand sort of what your perspective is on how to effectively source the useful information for building these alerts and visibility into the data life cycle. And then also, given the different ways that roles in data teams and across data organizations interact with the data platform, how do you surface that information to them at an appropriate time and location without forcing everybody to use the same interface or buy into 1 way of working with the data system?
[00:50:21] Unknown:
Each stakeholder is definitely going to be interested in a different resolution and ways of monitoring their process. So platform and engineering, we find that they need more gates through their end to end process throughout the process, how data is coming in from the source, how is it being processed within the lake, how is it then being moved to the warehouse. They tend to be interested in more higher level metrics, less concerned with record by record changes, and they tend to be more interested in information on the boundaries, like, are records coming in within the expected thresholds?
Is data more or less complete? And these teams we find are are generally easier to cover with more kinds of generic profiling metrics, so, like, sizes, schemas, type changes, and the problems that will lead to slowdowns or failures in the pipelines, more obvious corruptions in the data. So what we aim to do within the platform and engineering side of things is provide as soon as possible in the process. So as soon as we detect some issues, some significant change in the data, we wanna capture that at the point where the data is actually coming in. So if our Facebook API is changing and that API is dropping data into an s 3 location as the first part of our logical pipeline, We want to be watching that s 3 location. We wanna be watching that key there, and we wanna flag when we see a change in schema or a type change coming in. From there, we wanna triage the alerts that are going to be most impactful on helping you focus on the datasets that are most important. So if our Facebook pipeline there that's pulling from that marketing information is leading to a dataset that we know not too many consumers are using, which we can see by looking at different aspects of lineage in the system or looking at how many reads are happening within that dataset, then that should be alert that's on a lower severity level that's more quiet. Maybe that just gets sent out to some shared Slack channel that people use as, like, a event stream relative to a high severity critical level alert that gets blasted on all channels and, you know, is waking people up through pager duty in the middle of the night. That kind of alert is gonna get triggered when you have a pipeline that's pulling in mission critical data that you know has a clear SLA around it. It needs to be delivered every morning at 6 AM before business wakes up, and it's being leveraged across many different consuming teams in your organization.
That kind of alert, we wanna help triage up and surface to the front of the stack and and blast to more channels. So the kinds of techniques that we're using there would be leveraging different data alert targets, different alert destinations based on the kind of notification that you wanna put out or based on the severity of the issue. 1 might go to Slack, 1 might go to PagerDuty. Another kind of technique that we're using is trying to look at what are the patterns of behavior around the alerts that you're surfacing. Are you quieting these alerts a lot of the time? Are you resolving them really quickly? Are you letting them sit there? Are you acting on the pipelines after an alert has been fired? And we wanna be able to pull in that kind of feedback loop from our users and surface that into the alert definitions themselves. So if we see that there's an alert that gets fired every day because a pipeline just, you know, tends to fail consistently and every day before it succeeds, and we see someone's coming in and just immediately acknowledging that alert every time it gets serviced in the system, that might be 1 that we throttle down to a lower severity level and stop sending to PagerDuty every time that we send to Slack instead. So being able to throttle the severity of alerts and the noise that we're creating based on the impact that it's gonna have is really critical.
From the consumer perspective, it sort of just works the other way. I think for a lot of our consumers, they're more interested in enrolling in information and subscribing to information about events. They have, like, their particular data assets that they really tend to care about. They have their data table in Snowflake, which they always wanna read from, and they wanna subscribe. They wanna know what are the events that are happening up stream to this which might be impacting that table. You know, if there's a pipeline that's producing the data every day, tell me when there looks to be a big failure that's gonna come in that pipeline. Again, usually, they care less about the discrete internals, the different tasks that are happening across a process, but they do wanna know if it looks like things are trending in a scary direction for datasets that they really depend on. As somebody who is building a product to help
[00:55:16] Unknown:
provide visibility and confidence to people who are building these different data products and for for this emerging variety of roles, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:55:29] Unknown:
I mentioned a couple of ways that Data Band is trying to help surface the most meaningful alerts. I think 1 of the big challenges that we see is that there are teams that are just working with so much information, so many pipelines. They may have some alerting that's set up already using more conventional monitoring tools. Being able to separate the signal from the noise in these processes and do it in a way that sort of respects the nuances of how data pipelines operate, not triggering notifications every time there's a failure, understanding what a restart is, understanding that a alert associated with some pipeline upstream is going to be relevant to a consumer downstream.
Having some system in place, which just is is more purpose built feeling to these kinds of use cases, I think, is or rather the lack of that is a big challenge. So trying to bridge across, I think, a lot of teams' instinct to pull in the standard stack from the software engineering world and just place that on top of data engineering, I think that's 1 of the big challenges that we see a lot of teams getting into because there's just gonna be so many nuances in how things are monitored, how alerts are spit out. That's not gonna be a really nice form fit for the unique challenges that these teams face. Another kind of challenge that we often see is for teams that are just starting out in data quality monitoring or in data observability, helping them get going with some initial analytics about their pipelines or monitoring screens or alerts, helping them get the ball rolling on what KPIs they should actually be looking at, I think, is probably where a lot of the market stands is that they don't have a good sense of what are even the data quality measurements that they should be taking. So when you come to them with, like, a totally open library where you say, you know, define your KPIs. We're gonna pull it into the system.
We're gonna help you alert on it. That can be a huge blocker because a lot of these teams don't even really know which KPIs to begin with. And that's to no fault of their own. It's it's because a lot of these data organizations are quite new. They're building data products for the first time, and they don't yet have a lot of that knowledge built up or awareness of what data they're actually working with to know where the failures are coming from. What we aim to do is try to provide much more, as users get started, much more out of the box alert definitions, more out of the box anomaly detection, and out of the box actual, like, data quality logging through templates and other techniques that you can embed directly within your pipelines or run against your data lake. And sort of as soon as you spin on the platform, you'll get some observability.
You'll get some data quality monitoring in the first, you know, 10 minutes of using the system. And from there, as you get to know your data better, as you get to know your pipelines better, that'll sort of open the gates more for you to bring in more customized logging of KPIs that are important and data rules that seem more critical for given use case. But getting the ball rolling on better monitoring, we find to be a challenge for a lot of teams. As you continue to build out the capabilities of DataBAND and continue to work with these forward looking organizations,
[00:58:45] Unknown:
what are some of the long term trends or long term impacts that you foresee in the data ecosystem and how people think about working with data and how that factors into the plans that you have for the future of DataBAND.
[00:58:57] Unknown:
On a macro level, something that I'm really excited about is just the level of tooling that's becoming available and the really strong communities out there that undergird products like Spark and Airflow and DBT and more that we see on the rise. I think this community, this area has some of the most passionate engineers in the world, and it's really just exciting to see a lot of that energy. I think that relates to the specialization of roles that we see and the growth of different roles like machine learning engineers and analytics engineers and data ops engineers because strong communities are forming around people with shared interests and activities. So as the role specialization increases, I think those open source communities are going to become stronger, and they're gonna become more active, and we're gonna see more of them. It is not obvious at this point we're a big fan of open source and open core companies, and we're exciting to see more of that permeate the lengths of the data value chain, even going down to BI applications like Looker and Tableau, which we see now are giving ground to tools like Metabase and Preset. So for us, that trend of moving more towards open, having those open core areas of the product, that relates directly to how we're building data band because we wanna make sure that interfaces, the integration points between our system and what engineers are working with are really open for them to use or open for them to to customize. You can understand exactly how metadata is getting reported from your system.
Moving forward, as the level of tooling in this space increases, I think we have a few areas that we're really obsessed with, a few development areas that we're obsessed with. We wanna make it easier and easier to integrate our system. So we don't want engineers to really have to lift a finger. We want you to be able to pull in data band, integrate into a pipeline, run it on a data lake or a database that you have or connect it into your orchestrator, and immediately get metadata and monitoring that's useful for you. And our open source is a big part of that, the open core behind the product, and our new cloud offering that we're now rolling out is another big aspect of that so that our users don't have to worry about the DevOps overhead of running another monitoring system if they don't want to. On top of that, as more open source tools are out there, as more companies are using new kinds of products, having more coverage and more integrations to various new services so that we can collect metadata across more parts of the process. So streaming systems, more analytic systems, additional orchestrators, a lot of that's gonna be guided with the new clients that we bring on to the system, new clients that we're working with, but we have strong plans to scale up the number of tools that we're connecting into.
And finally, I think as we see more activity in this space, as more teams are starting to invest in their data monitoring and pipeline observability, we wanna help with more insights into your tech stack, into your pipeline. So we're beginning to pack into our system more derived metrics and measurements about your pipeline and data health because our users are looking for more help in how they should actually be monitoring their pipelines. They really want guidance on the important KPIs they should be tracking and those, like, key performance indicators of whether pipelines are healthy or not. And there's a lot of thought that we're putting into that that's gonna be instantiated
[01:02:22] Unknown:
in the alerts that we send and visualizations that we show within our system. Are there any other aspects of the work that you're doing at Data Band and the overall space of visibility and data quality for data teams and how that helps to support this growing level of specialization across data roles that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think something that we
[01:02:45] Unknown:
did see over the time that we've been in the market is and I think it's going in a great direction as of the last, year or so. But we do look at this balance of research versus production work that's happening within data organizations. And whether you're pursuing more of a mesh style of working, building out a data mesh, whether you have more umbrella groups that are built up that operate at different ends of the business or levels of the stack, or if you have more specialized roles, figuring out that right balance between the research investments that you're putting into new products and the production investments that you're making to get those products into the field, get them into the hands of customers, get feedback on them, and iterate, I think, is something that we really wanna encourage teams to find a good balance in. I think a few years back, we saw a huge flood of growth in research teams and big build outs, data scientists, and engineers that were working on just, like, getting new AI systems, you know, building new AI products that that we wanna get into market. And I think a lot of those teams hit a wall because they either didn't have the data ops roles available or they didn't have the data engineering capabilities or capacity to support the teams, or maybe they didn't have just, like, an actual market fit for the product that they were building. And we saw a little bit of a downsizing of those teams. So I think not treating data like a magic property, but thinking about it more like any other product that you sell and building an organization along with the normal course of market validation, product market fit, shipping data products into the market in an agile manner, getting feedback, iterating.
That's something that we really wanna encourage our organizations to pursue. But it seems definitely over the last year or so, we're seeing just a huge uptick in the amount of productivity
[01:04:44] Unknown:
within these organizations. So I'm hopeful it's moving in the right direction. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. No question. Observability.
[01:05:04] Unknown:
We're working on it for a reason. The first step for a lot of these organizations that we're working with is just getting data from 1 location, doing some transformations, getting it to another location. That's what they sort of needed to figure out first. And I think the new category of tools on the orchestration layer that have become a lot easier to work with, that have become a lot more widespread, have eaten off a lot of that challenge. So we don't see so much anymore as orchestration being, like, a critical, critical, critical investment point for a lot of the teams that we're working with anymore. They have some system in line that they're using, which allows them to get the job done. Once you have that motion happening, once data is traveling, the automation is in place, you then need to make sure that the factory is churning out data properly, that those pipes are working as expected. And that's where observability comes in. And I think for the teams that are facing a wall there, a lot of the time, it's more about the observability question even if they don't have, like, fully massive pipelines built up. So an example would be, you have a data replication project that's pulling data from an on prem source location, delivering it to a cloud database.
And because there's so much uncertainty over whether the data is being properly moved over into the cloud, the project gets put on hold or elongated and some momentum falls, and you end up in this awkward, you know, 1 foot in, 1 foot out situation. And the core issue there is not that there isn't a tool that's available to help you process and move the data over. The core issue is the lack of confidence that the data is being properly that the data from point a is being properly moved to point b. And that data point b is then being used properly in the data products that it's supposed to support. So having the observability layer that comes in and gives you that confidence, that's what we feel is the biggest gap today and what we're aiming to solve with Databank.
[01:07:06] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing at DataBAND and the perspective that you've been able to gain into some of these emerging roles and abilities in the data ecosystem. It's definitely a very interesting aspect of the outgrowth of more companies using more data for more things. So it's definitely useful to be able to get a bit of a peek into what's to come for more organizations. So I appreciate all of the time and energy that you've put into working with these teams and helping them to be successful.
So thank you again for your time and energy, and I hope you enjoy the rest of your day. Absolutely. Thanks so much, Tobias. For listening. Don't forget to Thank you for listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Josh Ben Amram: Introduction and Background
Complexity and Standards in Data Pipelines
Roles in Data Teams: Producers and Consumers
Future of Data Roles and Industry Trends
Specialization in Data Roles: Umbrella Groups and Hybrid Roles
Business Impact of Specialized Data Roles
Investment in Data Platforms and Tooling
Impact of Data Mesh on Data Roles
Visibility and Observability in Data Pipelines
Sourcing Information for Data Lifecycle Visibility
Challenges and Lessons in Building Data Observability Tools
Future Trends and Plans for DataBAND
Balancing Research and Production in Data Organizations
Biggest Gaps in Data Management Tooling