Summary
We have been building platforms and workflows to store, process, and analyze data since the earliest days of computing. Over that time there have been countless architectures, patterns, and "best practices" to make that task manageable. With the growing popularity of cloud services a new pattern has emerged and been dubbed the "Modern Data Stack". In this episode members of the GoDataDriven team, Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan, explain the combinations of services that comprise this architecture, share their experiences working with clients to employ the stack, and the benefits of bringing engineers and business users together with data.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
- Your host is Tobias Macey and today I’m interviewing Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan about their experiences with managed services in the modern data stack in their work as consultants at GoDataDriven
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving your definition of the modern data stack?
- What are the key characteristics of a tool or platform that make it a candidate for the "modern" stack?
- How does the modern data stack shift the responsibilities and capabilities of data professionals and consumers?
- What are some difficulties that you face when working with customers to migrate to these new architectures?
- What are some of the limitations of the components or paradigms of the modern stack?
- What are some strategies that you have devised for addressing those limitations?
- What are some edge cases that you have run up against with specific vendors that you have had to work around?
- What are the "gotchas" that you don’t run up against until you’ve deployed a service and started using it at scale and over time?
- How does data governance get applied across the various services and systems of the modern stack?
- One of the core promises of cloud-based and managed services for data is the ability for data analysts and consumers to self-serve. What kinds of training have you found to be necessary/useful for those end-users?
- What is the role of data engineers in the context of the "modern" stack?
- What are the most interesting, innovative, or unexpected manifestations of the modern data stack that you have seen?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working with customers to implement a modern data stack?
- When is the modern data stack the wrong choice?
- What new architectures or tools are you keeping an eye on for future client work?
Contact Info
- Guillermo
- Bram
- bramochsendorf on GitHub
- Juan
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- GoDataDriven
- Deloitte
- RPA == Robotic Process Automation
- Analytics Engineer
- James Webb Space Telescope
- Fivetran
- dbt
- Data Governance
- Azure Cloud Platform
- Stitch Data
- Airflow
- Prefect
- Argo Project
- Looker
- Azure Purview
- Soda Data
- Datafold
- Materialize
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to data engineering podcast.com/census today to get a free 14 day trial. You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain.
Now there's a book that captures the foundational lessons and principles that underlie everything that you hear about here. I'm happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O'Reilly to publish it as 97 things every data engineer should know. Go to data engineering podcast.com/97 things today to get your copy. Your host is Tobias Macy. And today, I'm interviewing Guillermo Sanchez, Bram Oxendorf, and Juan Paraffan about their experiences with managed services in the modern data stack for their work as consultants at GoData Driven. So Guillermo, can you start by introducing yourself? Hello. My name is Guillermo Zantef. I'm originally from Madrid, Spain.
[00:02:20] Unknown:
And currently, I work as an analytics engineer for GoData driven
[00:02:24] Unknown:
here. It's a consultancy company based in Amsterdam. And Bram, how about you? I'm Bram. I'm also a consultant at GeoData driven in a role as a lead data scientist. And Juan, how about you? My name is Juan. I was born and raised in Colombia, currently located in the Netherlands, and I'm also a colleague of Pravan Quillermo. My specialty is dashboarding.
[00:02:45] Unknown:
And going back to you, Guillermo, do you remember how you first got involved in the area of data management?
[00:02:50] Unknown:
Yeah. So that was, I think, back in the day, 4 years ago when I started working as a consultant at Deloitte in Madrid, Spain. So there, I used to work building RPA robots. So, basically, this is robotic process automation. And I remember that we were picking up unstructured documents. We were processing them with some machine learning algorithms to extract structured data from them, and then we were basically merging those into into some sort of structured database. And back in the day, I didn't know that this was a machine learning pipeline, which I learned afterwards.
And I started to be really interested in the machine learning world. I started studying machine learning, and then I realized that afterwards is in some of the jobs that I encountered that, yeah, data availability was quite a problem. And then I started to learn some data engineering to self-service myself to be able to pull my use cases. Yeah. Basically, I discovered that it was quite frustrating to maintain some data pipelines. There was just, like, some of the stuff there that was quite difficult to maintain. The scale was also a problem. So then I started to look into modern data tools to be able to deploy my use cases in a more stable way. Yeah. I ended up being an analytic engineer, which is quite a recent role, I'd say, but 1 that is quite interesting.
And I think that g d GPD, GoData Driven, is 1 of the first companies at least here in the Netherlands that is actually pushing towards using this role and really democratizing this role in companies.
[00:04:19] Unknown:
And, Param, do you remember how you got involved in data? For me, my journey started actually in astronomy, then my studies in physics and astronomy. And, especially during my PhD, I worked with a large amount of data from telescopes, You know either you know, ground based or space telescopes, to study the evolution of stars and how they interact with the inner cell medium Working with this, you know, amount of data, I really, practice my skills in data wrangling, analysis, machine learning, software engineering, but also Python. But also specifically, you know, as to, you know, data management, it's really, you know, very important within the field of astronomy.
I spent a few years working as an assistant research scientist at the, Johns Hopkins University and Space Telescope Science Institute, which is the, Science Operations Center of, of Hubble and the upcoming, James Webb Space Telescope. And, the strategic goals of the Space Telescope Science Institute is, 1, to exploit the scientific capabilities of of these telescopes, but also to create a legacy and to document and to make available all this astronomical data, you know, for future reference as well. You know, and this is is really how I got involved in the specifically in the field of data management.
After my career returned to the Netherlands, I worked as a lead data scientist at Amsterdam based TechScaleUp in the area of online advertisements. And I joined GoData driven 1 and a half years ago. We try to help our clients to take the next step with data, you know, by training them. But also at GDD, we, for example, we build platforms, use the modern data stack, execute use cases, all 2 as the end goal to realize the value data assets.
[00:06:00] Unknown:
And, Juan, do you remember how you got involved with data? Yes. During my first job,
[00:06:05] Unknown:
I worked as a marketing analyst. And very quickly, I learned that just because I was able to build data visualizations, I got a lot of leverage amongst the CEO, the management team, and they would invite me to all sorts of meetings. They would approve any type of budget just because they love the graphs that just kept on showing them. Another thing that was very distinctive of that job is they had a DBA, which wasn't a really big fan of me, and he kept on restricting my access and making it difficult for me to work. So I just had to learn SQL.
And from that job, essentially, what I got is that visualizations and bringing data to people in general, no matter how, is very powerful. But secondly, the better you're getting the technical parts, the better your Python, your SQL is, the less you have to depend on Jeffrey. So that's kind of the 2 things that have shaped my career. And I've been doing consultancy for quite a while. I joined GoData driven back in December, and indeed together with the Germans, some more individuals in Brown, we are really starting this analytics engineering journey.
[00:07:18] Unknown:
A couple of times in your introductions and when I introduced the topic, we used the term modern data stack. And I'm wondering if you can just start by giving your definition of what that means because there are a lot of different interpretations and combinations of tools that people think about when they use that term.
[00:07:35] Unknown:
Also between us, we have difficulties to to agree on what is exactly the modern data stack. Right? So we've discussed this quite a lot within GDD. The way we see it in this room is that the modern data stack is a set of tools, normally are fully managed SaaS offerings, software as a service that cover different pieces of the data platform such as, to name a few ingestion, warehousing or lakehousing, transformations, self-service, and also data quality and observability. Maybe you can include there also scheduling or infrastructure monitoring as well. This is basically a set of tools that cover certain parts of of these data platforms. I'd say that normally also they are quite intuitive as well. So either they are UI based, like, basically have a simple user interface that most users can can use to, for example, in the case of 5 Tran, build end to end extract download pipeline. If it requires some data transformation, then mostly they are based in SQL, which is probably the simplest, language to do data transformations.
The data configurations are quite straightforward. They also scale up and down on the need of any further configurations. So I see basically these set of tools have these characteristics that make them simple in in terms of usage and also maintainability as well. It's quite tricky because when we were preparing for this, I also asked the team, okay, guys. Let's create another definition. And
[00:08:58] Unknown:
it's super difficult to come up with some sort of checklist of these sort of things. But we do notice some sort of ideological trend that if we make tools easier to use, then a qualified analyst could do a job that was maybe in the past associated to something very hardcore data engineer was the only person that could do.
[00:09:20] Unknown:
Yeah. And 1 of the sort of common elements that I see in a lot of these articles and sort of wrap ups of the modern data stack is that they're largely managed services so that they do have the easy scale up and scale down and that somebody who doesn't necessarily have a lot of infrastructure knowledge can get it set up and integrated together and deployed without necessarily having to bring an engine you know, a data engineer or a DevOps engineer in to get it set up and managed and connected altogether. That's sort of the core element of self serve from start to finish versus having to hire a platform engineer to build out the platform and then it's self serve.
[00:09:58] Unknown:
Exactly. That is also what we see our clients like. We look at more traditional data platforms and would normally have at least 3 to 2 platform engineers, mostly with, really strong knowledge of, infrastructures code, really strong knowledge in DevOps as well as CICD. Probably, there's also going to be quite a big chance that there's going to be a lot of custom integrations involved. Yeah. Just a lot of maintainability issues. Well, with the modern data stack, what we've seen is that you can probably deploy our platform in a much shorter amount of time and your dependence on an actual data engineer is much lower for sure. And in terms of the responsibilities
[00:10:40] Unknown:
of people who are using and building on top of this modern stack, I'm wondering how you see it shifting the experience and capabilities that are necessary and the roles that are actually interacting with the data stack as this new set of tooling is adopted and integrated?
[00:10:58] Unknown:
So I think the keyword that we're looking for here is touched upon it just now. It's self-service. Right? So we're essentially sort of democratizing, you know, certain elements in you can call it the data supply chain, you know, for raw to transform transforms and prepare data. You're democratizing some of these elements for people that do not necessarily have that, you know, extreme technical backgrounds. People that are really, really hard to find also in the in the job markets and democratizing these elements. And what we also see a lot of times at at our clients is that the data engineering team or the platform team, they really get overstretched, over asked.
Backlogs are getting bigger and bigger, and, you know, eventually, it's it's becoming a bottleneck. For example, the time to market of products and, you know, moving towards a modern data stack, you know, by, you know, allowing analysts to self serve on platform, for example, and democratizing some of these qualities. Really, you know, we've seen that some of our clients helps, you know, in relieving some of these bottlenecks. It's moving out some of the products and tied to markets.
[00:12:05] Unknown:
Maybe 1 little thing to add is I used to know somebody whose specialty was to upgrade databases. And with a lot of these smaller tools, they just upgrade automatically. So it's not just that aspect, but it seems that the focus is changing towards activities that are more associated with value over things that you have to do. That's also part of things that are going to change in the roles.
[00:12:33] Unknown:
Exactly. So actually a really good 1 1 because I think precisely what the modern data stack does is it obstructs some of the technical, let's say, difficulties that you'd face on a traditional data platform. And what this does is basically it empowers users to build actually use cases that are closely related to the business value. And for us at Code Data driven, we see that this is 1 of the biggest impacts by far.
[00:13:00] Unknown:
Yeah. And, you know, thinking about it a little bit, you know, how it shifts, you know, the the responsibilities and capabilities of data professionals and consumers. And we've talked about, you know, technical analysts taking some of the role of a classical data engineer. But we can also think about, you know, the consumers that's really at the self-service. Right? So the we have consumers that may merely want to interact with the dashboard, for example. They can take a next step by knowing how to use self-service tools and get insights, information out of data.
And then you have people working within these, these BI tools and working on data models, for example. And then you have the tech savvy analysts that work with data and transform it. So all in all, there's a sort of a shift. I'd say the modern data set enables you to have a shift between the business and the data domain. The shift is towards the democratizing, enabling self-service, and it shifts the whole responsibilities and allowing the business to get information out of data themselves.
[00:14:07] Unknown:
I hope that makes sense. No. Exactly. So in the sense of, like, for example, to give a practical example that we've seen at times quite often, a tool like PBT enables a lot of analysts to build end to end pipelines in SQL. And these are these basically when when we come in and we deploy DBT and help them tune DBT. What we see at the beginning is that they usually have, like, their own SQL views that they refresh, say, daily days. Maybe they know something about the stored procedures, but what what DBD allows them to do is basically to start building and version controlling pipelines end to end, which is something that
[00:14:43] Unknown:
we've never seen out of this tool. And it's definitely enabled by modern data stack tool. To your point about the skills that this brings into analysts, but also to business users of being able to self serve on data and be able to ask and answer their own questions. It also brings in the challenge of the level of context and understanding that they might need to have to be able to effectively answer those questions, particularly when it comes to agreed upon metrics calculations where the business user might see a dashboard and they wanna dig a little deeper, so they maybe, you know, use 1 of these self serve platforms. They ask and answer a question, but the answer that they get isn't actually correct in the way that they think it is because maybe they don't understand the way that the underlying metric was calculated. And I'm wondering what you see is some of the challenges that come up by adding this self serve capability to more people in the business and some of the educational requirements that come about as you start to introduce these tools to more people throughout the organization?
[00:15:48] Unknown:
I think self introducing self-service or, you know, maybe like a broader term of it's like to democratize your data assets. Maybe like a few really key enablers there, like pre maybe prerequisites, as you can call them. From a GDD, working with our clients, There are 2 main pillars there. You know, 1 is the data literacy within a company, and second is, like, a certain level of data governance or data management as well. First of all, the literacy, of course, when you get exposed with data and information, the way that you actually create and distill that information for yourself needs to have some sort of a background in how to work with data, how to get that information out of data. This is how we always think about taking the business users along in their journey and becoming more data savvy by working with data.
So we think of learning journeys. So what kind of training do we envision for people merely consuming dashboards? What kind of a training do we envision for people that are maybe working within these BI tools and creating models underneath those dashboards? And what kind of a training do we use for the more advanced people that are actually creating data pipelines and transformations? In DBT, for example, transforming the data, serving them to maybe and sharing them within your organization as well. You're completely right that there needs to be a matched level of literacy with the responsibilities when you get access and exposed to data as well.
The risk is that, of course, you get diverging KPIs. You get people reinventing the wheel all different, all of the time. And, yeah, eventually, just a lot of, well, the, which I can call it chaos, maybe. Yeah. Chaos. Yeah.
[00:17:31] Unknown:
Lots of data assets all over the place. It's also good to mention challenges don't necessarily end with training. Some factors that a lot of people don't take into account, the first 1 is getting executive support. So you need from very high up to have some sort of support. From now on, everybody's gonna create dashboards and look at them. Or from now on, everybody has to analyze things. And those the executive support from high up comes a lot of times in the form of priorities, the roles that get higher in the budget. So that's definitely an important facet.
And the other 1 is having an internal adoption of the tool. And what I noticed that in order for that to happen organically, you need, firstly, a group of champions. And these are the people that just out of passion or out of whatever it is, learn the ins and outs of the tool. Those people inside of the organization that become the go to person whenever there is a SQL question or whenever there is DVD or whatever type of of thing. But you also need all of these people, including the very casual users. You need to give them a place where they can collaborate, some sort of Slack channel where they can ask those questions, some sort of confluence of Wiki.
So it's hard just the whole adoption and the limitations tend to have a lot of ramifications. We have to remember that just giving self-service analytics to people, it's throwing also more responsibilities at them. It's just saying, from now on, this is something that you also have to do on top of your job. So you might hear some sort of aversion or defensiveness towards
[00:19:22] Unknown:
I don't want to have to do the data job myself. We were talking about diverting KPIs before, and this is actually something really interesting. Right? Because this is something that we've seen precisely not because of a lack of data literacy in the past, but actually just because the business requirements are never clear or not well understood by the data engineers, right, traditionally. So I think in the case of diverging the APIs, it's actually quite interesting to look at what the modern data stack offers, which is by means of empowering these analysts to calculate their own KPIs. It's most likely that their definition and their implementation of a particular KPI is going to be more closely related to the business logic than, let's say, a data engineer like myself doing it for them without any business context at all. So this is something that I democratizing data and the modern data stack definitely, like, push.
[00:20:15] Unknown:
And in terms of the technical and any other organizational complexities that come about from introducing
[00:20:22] Unknown:
these modern data stack tools to an organization, I'm wondering what are some of the sort of interesting or useful lessons that you've learned as you start to add these systems to organizations in your consulting work? We could divide, let's say, this question in 2. So the first 1 I need is the the technical challenges, so implementing something new. But to be honest, in that regard, I think technical challenges are actually diminishing versus the previous account data platform. Right? So if you compare classic Hadoop on premise platform versus modern data stack that works with, you know, ingesting with tool like tools like Pipetran, transforming with tools like PBT, scaling up and down automatically with tools like BigQuery or Snowflake.
It's really not a technical, problem what we see. And I think maybe this goes back to 1 who's actually a specialist in this part, but I think it's mostly a people challenge that we see. So how do you and Juan, please, of course, basically, how people struggle, I think, to change the way they work.
[00:21:25] Unknown:
Correct. I see that on a daily basis. What a lot of my clients, they purchase, let's say, Looker or Tableau or PowerDA or whatever it is, their new fancy BI tool to replace whatever they have. What I notice in almost every client is there are some people that kind of nudged me to replicate exactly what they had but in a newer, flashier version. It is like they already accepted that they wanted to move to a new system, but they're still trying to use it the old way. They may even get disappointed if a certain, maybe, cosmetical feature or something that doesn't bring too much value thing is not replicated.
And it also happens not only to dashboarding. It could be like, hey. How come dbt doesn't support this thing that my database before used to support? That's 1 people challenge that I see quite often.
[00:22:20] Unknown:
There has to be, as we discussed in the previous question, I think, there has to be a community around adopting this new set of pools because otherwise you really need to empower people to change the way they work. And that's that's probably the way to go in terms of migrating to a new set of tools. As I said before, I think the technical talent, it's obviously there because it's a new tech new way of workings and you set up tools. But at the end of the day, maintaining this sort of stack is way easier than maintaining a traditional data platform.
[00:22:51] Unknown:
RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming. With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at data engineering podcastdot com/rudder today.
And the major appeal of these managed services is that, as you said, there is a much reduced maintenance burden, and they're easier to get started with where if you're on their paved path of trying to use, for instance, the supported plug ins for Fivetran and pull some prebuilt packages out of the dbt repository, You can get up and running in a matter of minutes or hours versus weeks or months. But as you start to get onboarded into these systems and you start to tie yourself to their ways of working, what are some of the edge cases or limitations that you start to run into that aren't immediately obvious or some of the
[00:24:06] Unknown:
organizational, you know, in terms of technical organization and code organization challenges that you run into that you have to figure out just by sort of running into the pain and working through it? We could say that the TDD we've experienced some of those. I think the thing, of course, is that we are partners with most of these technologies, which basically enable us to work with them in ways of how to solve these and pass the tool to cover these edge cases as you mentioned. Right? So for example, be completely a bit more practical. 5 turn and DBPR, 2 good examples of that. I think 5 turn, for example, tries to offer as much connectors out of the box as they can. And, of course, it's not always the case that that they have a connector of the box for 1 of our clients over our applications, either because it's an in house field application or because it's just not so common, which means that we need to be, like, custom integration for this sort of application. And I know that Firecrumb, for example, offers these sort of SDK style where you can build that application on top. So this is the way they try to extend functionality to avoid the sort of limitation that you could encounter.
With DVD, for example, what we've seen, is that they only support 5 adapters out of the box. Adapters meaning they support certain amount of warehouses as first class citizens. So what you see in the case of DBT, because it's open source and it has a huge community, is that the community starts developing their own adapters and maintaining those adapters. And actually a good use case that we find that on the DVDs that we were using Databricks in quite a lot of clients. Using the Netherlands. It's quite common to use the Azure Stack. And a lot of people are using Azure Databricks, which meant that if we wanted to use PVD, we needed to create some sort of adapter for PVD on top of Spark.
And what we did at DVD, and this is something that happens quite often for us, is that we decided to chip in and help in the productization of the DBT Spark adapter. So we did a big push in house to get this up and running together with the guys at DBT Labs, the former system analytics. And so definitely, we find these set cases and we try to collaborate with these tools to, yeah, let's say introduce this functionality on their tool. And if not, obviously you need to find some sort of work around, which sometimes it's obviously not desirable because it's less maintainable.
[00:26:31] Unknown:
1 thing we were talking about is as an IT professional, as an IT consultant or data engineer, whatever you are, sometimes you learn to overcome any type of roadblocks and find ways to create workarounds. And sometimes they can get quite extreme. But what is important to mention and kind of 1 of my lessons here is there is a point where you have to stop because then you're defeating the original purpose, which is to make the whole modern stack easy and accessible for everybody. So sometimes you really have to think to yourself, is it worth it to build the whole SDK, to build the whole thing or to just build this really long calculation that overcomes for this behavior that doesn't come out of the box.
[00:27:18] Unknown:
In terms of some of the sort of specific vendors, you mentioned a little bit about dbt with only having out of the box support for a limited set of cloud data warehouses. But are there any of the other sort of managed platforms that you've worked with that at first seemed like they were gonna going to solve all of your problems, and then you ended up running up against some of their limitations and had to abandon it because of specific requirements for the organization or sort of technical or compliance issues?
[00:27:47] Unknown:
There's a couple of cases. For example, we do a lot of work for banks as well. And in the case of banks, when you're trying to ingest data from using tools like Fivetran or Stitch data, basically extract on loading, There's always this concern where the data is going to leave during this extract download process. Obviously, this is quite a painful process to explain as well. So sometimes the tool is abandoned just because of these concerns as you mentioned. And then in terms of finding edges tools recently, for example, I think the data observability and data quality landscape is quite a new 1. There's a lot of tools coming into place like sort of SQL, Datapold It's the main of you. And I've definitely seen in my experience that testing some of these tools out, you can see that the maturity is still not there maybe for some of the use cases that you particularly want in place. And this means that at least for the time being, you may be, yeah, do not adopt the tool. But, of course, consider, like, maybe in the future, like, I don't know, 3 or 6 months from now, it will fulfill your use case.
[00:28:53] Unknown:
And, yeah, when you talk about managed services, you know, you always have, like, sort of a trade off of, you know, performance versus customability. Right? For the highly specialized use cases that we have for some of our clients, you know, sometimes these managed services may not offer the the the performance or, like, the the out of the box solutions when also putting them at scale. And there's also some some of our clients and some of the people that we work with, and they also worry about a potential vendor lock in eventually in the end. So these are some of the examples that we have while working with the modern data stack and trying to implement them at at a client.
[00:29:30] Unknown:
Definitely. And, actually, we do find limitations sometimes. That is the case, but I think the benefits usually outweigh the non benefits,
[00:29:38] Unknown:
let's say. Correct. 1 thing that we also mentioned is, for example, Fivetran. Even if not all of your pipelines are suitable for a certain use case, you can still migrate some of them. And then that's work that you don't have to maintain. So sometimes a partial moving could be a possibility.
[00:29:58] Unknown:
Another element that we've touched on briefly a couple of times is the question of data governance and how that gets applied across the various components of the modern data stack and how you manage consistency of enforcement and policy definition as you traverse these different layers of the tool chain. I'm just wondering what you have seen as some of the useful practices or systems for being able to actually define and manage these policies, particularly as you have people who are coming from the business or different analysts and people self serving and still being able to have a cohesive and effective governance strategy.
[00:30:40] Unknown:
When we talk about data governance, we really talk about, you know, the, maybe, like, different elements here. No one's saying there's also a people side and there's a technical side. So the main goal that we see for data governance is really, you know, promoting the usage of your data assets, which is really a balance between, you know, again, promoting asset your data and protecting your data in 1 way or the other. And this balance is not really defined by business strategy of organization. But I think we from gdd, we have the vision that data governance should really be focused on letting people use data in a safe and controlled way.
Some of the toolings in the modern data stack really enables you to enforce some data governance practices by, you know, increasing the observability of the data assets, how they as they flow through, as you say, different layers of your data supply chain. Observability, meaning that's, you know, we see tools coming up that's have data catalogs that show you data lineage, that show you have automated data documentation, profiling of data. You know, when you transform your data or your data moves through a pipeline, you're not only interested in knowing if the pipeline has run, but also what happens inside your pipeline. Automated alerts when stuff, go wrong.
Making your data observable and transparent as it moves from a to b, who uses it, you know, when is it used, Where is my sensitive data? Who is who is getting access to a sensitive data? And we see that the modern data stack has some of these tools available to enforce some of your data governance policies.
[00:32:23] Unknown:
1 cautionary tale that is also interesting to mention is in the case of access management, some tools and very specifically, I mean, there's Snowflake. They have actually come up with a structure in which you can only give people roles, and they have essentially a role based authentication system. And this was originally in their mind, as far as I know, a way to simplify their permission system. But that's where the things got interesting for some clients that already worked out somewhere in their care barrels, I'm or bare active directory, a very complex set of ways in which people authenticate.
Moving to something like Snowflake might not be that appealing. So even though this doesn't seem to be the case in every tool, it's also interesting to see that sometimes tools, by trying to simplify things like governance, they might come into conflict with some companies that already have well set up structures.
[00:33:25] Unknown:
And then if I may add something to what Bram mentioned before, the key here to data governance is data observability. And I agree with that. You can see these in big clients where they have the scattered map of data assets. And for them, the most scary part to adopt tools like the modern data stack is that they are giving access to data to a lot of people, and they cannot control the amount of data and the amount of people that are going to be accessing everything or creating new data assets. What data observability brings to the table is basically the ability of actually understanding who has access to what and basically mapping out all these datasets, whether what meaning do they have, who has access to them, who can remove or create new pipelines.
And this is something that is lacking in bigger clients. And that's why data observability, I think, is going to play such such a big role in the coming years. Yeah. It's really also in a large part about trust, right, in a way that's, you know, IT and management.
[00:34:24] Unknown:
You know, when you self-service and we need to democratize data, they need to let go of some of the control. These tools are empowering, you know, analysts to own a bigger chunk of the data value chain and unlocking the value of data. And, you know, data observability, you know, it's about finding the data, understanding the data, and getting value out of the data as well. And I think this is really, like, a very important field. It's really moving forward in recent time.
[00:34:51] Unknown:
Another interesting element of the sort of modular nature of the stack is that it's fairly easy to be able to pick and choose the pieces that fit best with your cloud provider or your technology stack or level of experience of the people who are managing the systems and sort of largely the common interface across all the different layers seems to be the data warehouse. But particularly as you dig into things like data observability and performance and governance, there's a need to be able to have a more nuanced integration across those different layers to be able to maintain that visibility. And I'm wondering if you've seen any common structures or interfaces being defined or standardized to be able to simplify the work of composing together different tool chains to be able to achieve the overall objective?
[00:35:42] Unknown:
As seen, for example, scheduling tools. So for example, if I look at scheduling tools like Airflow or Prefect Cargo or several of them, I think this is basically, for me, the glue that puts together all these sort of different services. And so, Steve, in a modern data stack, I think you still need some sort of scheduling that allows you to interconnect all of these services and make sure that everything is orchestrated in a logical way. Because basically, as you said, I think there's several integrations out of the box between these tools. Like for example, Pipetron can execute a pipeline and then afterwards, a dbt model will be executed. And this will probably trigger some sort of automated refresh in something like Looker or Tableau.
But at the end of the day, this is some sort of chain of events. Having a tool to orchestrate everything like Airflow, for example, will help you have an overview of all the steps along the way.
[00:36:39] Unknown:
There's also tools popping up, for example, at, Azure Purview, which offers you, like, an end to end solution for data governance, which really shows you the different tools that are how they're connected together, where your data is coming from, where your data is transformed, and where your data is served as well. So I think it's a good point, and this is also something that we see popping up in the field as well. And then the other question
[00:37:03] Unknown:
is, as you start to push more control and sort of product decisions and management of the systems to the analytics engineers in the company and add self-service capabilities to the different business owners. What do you see as the ongoing role of the data engineer in this new landscape of the modern data stack, where a lot of the different systems are managed services and platform oriented?
[00:37:29] Unknown:
That's a really good question. So there's 2 main things. First 1 is the data engineer is moving more towards the platform engineering side. Eventually, you're seeing that a lot of data engineers are focused on the infrastructure side of things. And still there's always, even in the modern data stack, some sort of infrastructure management that is going on. And the other side of things, it's so the the other potential role that the data engineer can have is on the, what we mentioned before, the edge cases. So there's always still going to be particular use cases that are not supported by these tools. And nowadays, for example, we see a big case for streaming analytics or any sort of streaming pipeline.
And those end to end use cases still probably need a data engineer role to make sure that you can manage that sort of complexity end to end. I I would not expect an analytics engineer to be versed in Kafka and know how to write Scala or Java code. I don't think that's the core skill that is expected from analytics engineer.
[00:38:31] Unknown:
In terms of your experience of building and managing these pipelines and training organizations to be able to take advantage of these different components of the stack, what are some of the most interesting or innovative or unexpected manifestations of the stack that you've seen as far as different combinations of tools or ways of use or lessons that you've learned in the process of helping organizations level up to these new systems?
[00:38:56] Unknown:
Yeah, I think what stands out the most for us is the speed and flexibility of the development cycle. So for example, I was at a client recently. I think that is using quite successfully a modern data stack, including tools like Stitch Data and DBT BigQuery. What I saw personally is that they could make releases almost twice a week, building use cases on top of the platform in a way that I've never seen before, building use cases. So and these use cases were also built by data analysts. So basically, you have an ongoing conversation between the business and the data analysts, which both of them are business savvy. And these data analysts scan themselves, build these use cases on top of the data platform. You can make a release less than 3 or 4 days, and you have a new report in system or a new KPI that was extremely demanded by the business.
And for me, this is like the biggest showcase of the success of a modern data platform. We've also seen some curious edge cases of how, you know, people work around these platforms when they need to do certain things. Like, for example, we have a colleague is currently using dbt on top of Databricks, and he wanted to write back to an Azure SQL database. So they build some sort of an extension to do this from dbt models so that they can run right to a data to a database. And this will be basically replicated to an Azure SQL database as well. So this is quite interesting things that you see people in which you see basically the value of the modern data stack and also basically how you need to extend and work around the modern data stack as well to be able to fulfill your use cases.
[00:40:34] Unknown:
And in terms of your experience of working with these systems and helping organizations adopt these new practices, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:40:46] Unknown:
For me, it's, I guess, coming back to the people side as well. And really the process of, you know, the change of the responsibilities or roles within an organization, how you democratize these tools and enabling people within the organization to start using the data, and that balance between trust in using data and at the same time, the balance of protecting the data. You know, at the end of the day, it's really more to win in empowering people to use the data and build their own solutions and analysis. Because the reason what Guillermo just just said, you really can shorten the time of market. You can relieve the pressure on existing BI and IT teams and, you know, empower people within the organization to really drive business value by themselves.
For me, this is the most important lesson and should also sort of be the main driver of why you would want to implement a modern data stack, these kinds of modern tools, in order to drive a data driven organization further.
[00:41:44] Unknown:
As you are working with customers and organizations and they're evaluating these new architectural paradigms and capabilities that the modern data stack can provide? What are the cases where this new structure is the wrong choice and somebody might be better served with a more traditional ETL flow or code oriented approach with self managed systems?
[00:42:09] Unknown:
For us, at least in what we've seen in several clients, these are usually what we consider the edge cases, right? So a particular client needs fulfill some sort of particular use case in which performance really matters. And in that case, you're probably better off building something custom yourself. Yeah. There's a good example recently in DBP. They are trying to do some performance improvements in the compiler because when you reach certain amount of models and the compiling time of those models can can be quite large. If performance is something that really matters for you, then you should probably build something on your own that fulfills your particular use case.
Maybe it doesn't tick off the boxes that DVD does. And of course, there's other examples of these that are probably better, but I think performance is a big 1. I think another 1 maybe is if you're trying to build something that is relatively new and relatively innovative as well, then probably these tools are not going to fulfill that gap use case. But also in our experience, we see out of clients that are not even performing batch analytics only at the level that they want to, for them, good and simple to start is actually good enough. And for that, obviously the 1 data stack covers by far their needs.
[00:43:25] Unknown:
In terms of your work and what you are keeping an eye on as you continue to work with these systems and work with clients? What are some of the new practices or tools or architectural patterns that you're keeping an eye on for future engagements?
[00:43:41] Unknown:
I'm really looking forward to some of these new data observability tools. And I think this space is really maturing really fast now, and this is also something that we see a lot of our clients really struggling with. You know, what happens with our data, who gets access to 1, how to handle sensitive data, empowering business users to control data assets, tools like SOTA, DataFolds. You see these tools popping up right now and maturing, and this is a sort of a because we also helps a lot of our clients with building data platforms. This is like a natural solution to build on top of a data platform to ensure, you know, some of your data governance practices you can start to enforce as well.
So for me, I'm keeping an eye out on how to implement data observability
[00:44:28] Unknown:
with these kinds of tools. Yeah. I agree with Bram. So they got some validity a 100%. Mostly also because we build these platforms for our clients. And then after the next step is, you know, I have all these data pipelines, but am I actually reporting the right metrics? And this is also something that data should already be breached to the table. I also see a lot of potential improvement in streaming analytics. I think that traditionally streaming analytics is still a practice that is linked to complex tools, like we've discussed before, like, Kafka and basically being able to program in in Java or Scala.
And I think, for example, tools like materialize, which is also relatively new, are trying to bring this SQL approach to real time analytics. And I think this is something that is quite interesting, that is going to definitely be, yeah, something to look out to in the in the next few years.
[00:45:20] Unknown:
Are there any other aspects of the modern data stack in terms of the tooling or the capabilities or the organizational patterns that it brings about or your experience of building out these systems and working with customers at GoData driven that we didn't discuss yet that you'd like to cover before we close out the show? I think, for example, something that is quite interesting that, you know, you don't talk about much in the modern data stack is you still need a centralized place in which you can,
[00:45:47] Unknown:
basically monitor your infrastructure. Right? So for example, tools like Datadog or, Azure Log Analytics, stuff like that. Yeah. Let's say, you don't have to monitor the infrastructure itself. At the end of the day, you're trying to aim for a serverless kind of architecture. There's still a lot of blocks to be collected and to understand basically whether your end to end platform is actually working or not. So that's an important part. Of course, something that we still see a gap for is, as I mentioned before, streaming analytics. I think that is still something that, yeah, many of our clients want to aim for, but they don't have the capabilities to do so. And I still think that the tools and the skills required for that are quite complex.
So yeah. They're eager to move into the other space, but they still don't feel the certainty of, you know, whether they have the tools and capabilities to go into the space.
[00:46:39] Unknown:
Alright. Well, for anybody who wants to follow along with the work that you're all doing or get in touch, I'll have you each add your preferred contact information to the show notes. And so as a final question, I'd like to get your perspectives on what you see as being the biggest gap on the tooling or technology that's available for data management today.
[00:46:56] Unknown:
I think with Frampton, the most interesting gap in the pooling is data probability and monitoring landscape. I think there's a lot of tools that are moving to this space. Brad mentioned, I think so that data pool, I think there's also Monte Carlo, and and recently, I I came across a tool that is called Albion. So this is an space that is building up because most companies are realizing that they need to monitor their data in order to proper reporting and make sure that their data driven decisions are actually properly driven.
So we see that this market is growing quite a lot, but still the tools are, I'd say in the immature phase. For me, this is the most interesting thing to look up to in the next few years where I actually see some sort of gap in the modern data stack.
[00:47:44] Unknown:
Well, thank you all very much for taking the time today to join me and share the work that you've been doing at GoDataDriven and your experiences of working with the modern data stack and the capabilities that it unlocks for organizations. It's definitely a very interesting and relevant and ongoing topic that, I continue to keep an eye on. So I appreciate all of your time and insights on that, and I hope you enjoy the rest of your day. Thanks, Tobias. Thanks for having us, Tobias. Listening. Don't forget to check out our other show, podcast dotinit@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Guest Introductions
Defining the Modern Data Stack
Shifting Responsibilities in Data Roles
Challenges of Self-Service Data Tools
Technical and Organizational Complexities
Managed Services and Their Limitations
Data Governance in the Modern Data Stack
Integration and Standardization Across Tools
The Evolving Role of Data Engineers
Innovative Uses and Lessons Learned
When the Modern Data Stack is Not the Right Choice
Future Trends and Tools to Watch
Final Thoughts and Closing