Summary
In this episode of the Data Engineering Podcast the inimitable Max Beauchemin talks about reusability in data pipelines. The conversation explores the "write everything twice" problem, where similar pipelines are built without code reuse, and discusses the challenges of managing different SQL dialects and relational databases. Max also touches on the evolving role of data engineers, drawing parallels with front-end engineering, and suggests that generative AI could facilitate knowledge capture and distribution in data engineering. He encourages the community to share reference implementations and templates to foster collaboration and innovation, and expresses hopes for a future where code reuse becomes more prevalent.
Announcements
Parting Question
In this episode of the Data Engineering Podcast the inimitable Max Beauchemin talks about reusability in data pipelines. The conversation explores the "write everything twice" problem, where similar pipelines are built without code reuse, and discusses the challenges of managing different SQL dialects and relational databases. Max also touches on the evolving role of data engineers, drawing parallels with front-end engineering, and suggests that generative AI could facilitate knowledge capture and distribution in data engineering. He encourages the community to share reference implementations and templates to foster collaboration and innovation, and expresses hopes for a future where code reuse becomes more prevalent.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Data migrations are brutal. They drag on for months—sometimes years—burning through resources and crushing team morale. Datafold's AI-powered Migration Agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year-long migration into weeks? Visit dataengineeringpodcast.com/datafold today for the details.
- Your host is Tobias Macey and today I'm joined again by Max Beauchemin to talk about the challenges of reusability in data pipelines
- Introduction
- How did you get involved in the area of data management?
- Can you start by sharing your current thesis on the opportunities and shortcomings of code and component reusability in the data context?
- What are some ways that you think about what constitutes a "component" in this context?
- The data ecosystem has arguably grown more varied and nuanced in recent years. At the same time, the number and maturity of tools has grown. What is your view on the current trend in productivity for data teams and practitioners?
- What do you see as the core impediments to building more reusable and general-purpose solutions in data engineering?
- How can we balance the actual needs of data consumers against their requests (whether well- or un-informed) to help increase our ability to better design our workflows for reuse?
- In data engineering there are two broad approaches; code-focused or SQL-focused pipelines. In principle one would think that code-focused environments would have better composability. What are you seeing as the realities in your personal experience and what you hear from other teams?
- When it comes to SQL dialects, dbt offers the option of Jinja macros, whereas SDF and SQLMesh offer automatic translation. There are also tools like PRQL and Malloy that aim to abstract away the underlying SQL. What are the tradeoffs across those options that help or hinder the portability of transformation logic?
- Which layers of the data stack/steps in the data journey do you see the greatest opportunity for improving the creation of more broadly usable abstractions/reusable elements?
- low/no code systems for code reuse
- impact of LLMs on reusability/composition
- impact of background on industry practices (e.g. DBAs, sysadmins, analysts vs. SWE, etc.)
- polymorphic data models (e.g. activity schema)
- What are the most interesting, innovative, or unexpected ways that you have seen teams address composability and reusability of data components?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on data-oriented tools and utilities?
- What are your hopes and predictions for sharing of code and logic in the future of data engineering?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The AI Engineering Podcast is your guide to the fast-moving world of building AI systems.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com with your story.
- Max's Blog Post
- Airflow
- Superset
- Tableau
- Looker
- PowerBI
- Cohort Analysis
- NextJS
- Airbyte
- Fivetran
- Segment
- dbt
- SQLMesh
- Spark
- LAMP Stack
- PHP
- Relational Algebra
- Knowledge Graph
- Python Marshmallow
- Data Warehouse Lifecycle Toolkit (affiliate link)
- Entity Centric Data Modeling Blog Post
- Amplitude
- OSACon presentation
- ol-data-platform Tobias' team's data platform code
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visitdataengineeringpodcast.com/datafolds today for the details.
Your host is Tobias Macy, and today, I'm joined again by Max Beauchemin to talk about the challenges of reusability in data pipelines. So, Max, for anybody who hasn't heard your numerous past experiences, can you just give a quick refresher about who you are and how you got into data? For sure. Well, first, thank you for for having me on the show again. So I think it's been, like, 3 or 4 appearances, and it's been a little while since we connected. So excited to be here again.
[00:01:14] Maxime Beauchemin:
My quick self intro is so I've been doing data engineering and business intelligence for for, you know, best of 2 decades now. And, and then I got really involved in open source while I was at Airbnb. I started Apache Airflow and Apache Superset, which became extremely popular in both cases. So for those not familiar with with the tools or Airflow is is an orchestrator, so it's really about, training up some some Python workloads and and running some data pipelines defined as code. And Apache Superset is an open source challenger in the the Tableau, the business intelligence space. So we're competing with Tableau and Looker and Power BI, with the awesome power of open source. So that's what we do. If you haven't checked out Apache Superset in a while or or you don't know about it, so I urge people to to go and check it out. The the news is that open source, this is Adeljan, is super, super competitive. Now there's no reason why you should be using proprietary software in that area. So if you're fed up with, with Look or Tableau or whatever it is that you use, you wanna you wanna move over to to open source. There's some really, really good set of tools out there. So check out Apache Superset. You can check it out on Presets. Preset is a commercial offering. I'm the CEO and founder of Preset, which is commercial open source on top of Apache Superset. So feel free to use the the the open source version.
If you just wanna try it, take it for a test drive. You can try superset on preset. Io. So that's the end of the commercial portion of this, this podcast. Let's jump into it.
[00:02:49] Tobias Macey:
And so the conversation today is around the premise that data pipelines are stuck in this phase of write everything twice, where everywhere you go, you have to keep doing the same piece of work. You can't just copy and paste from one business to another. And I'm wondering if you could just start by sharing your current thesis on the opportunities for code reuse and the shortcomings on either the tooling, ecosystem, or education that limit that code and component reusability across organizations?
[00:03:22] Maxime Beauchemin:
Right. So I've been thinking about this for, you know, 10 years, maybe maybe longer. The question, you know, since I I open sourced Airflow, I really thought people are gonna go and use Airflow or build a higher level construct on top of Airflow to share code and logic and pipeline across organization. And we really haven't seen that for a whole set of reasons that are intricate and complicated and that we're gonna talk about today. Something that can be good for the listeners would be to check out the blog post I wrote on the topic. So the blog post is gonna be more structured probably than this conversation, but this conversation is gonna be much more interesting in in exploring, you know, the the the intricacies of things and getting on the next layer of talking about this challenge. But if you, if you're curious about maybe pausing the podcast and taking a look at the scanning the blog post, it's on the the preset blogs. That's at preset.io/blog.
On there, you'll find the blog post that's titled, why data teams keep reinventing the wheel, the struggle for code reuse in data transform in the data transformation layers. That's, probably a good a good place to start our foundation, at least for for this conversation now. Assuming that some of the listeners maybe paused, took a look, came back. But the the the summary at a very high level is that, you know, as a data engineer, you'll go from organization to organization and and kinda rewrite essentially very similar pipelines. I think it's even more so the case for analyst engineers. So if we're talking more about the the part of the workload of data engineer that's about essentially writing a bunch of SQL or doing data transformation, like creating datasets from datasets.
And typically, you're gonna go and take data from your CRM, your SaaS product, data around engagement and growth, you know, on your product. You're gonna be computing things like NAU, DAU, weekly active users, and the users are new churned, resurrected cohort analysis. Maybe you're gonna go and build a an AB testing kinda compute pipeline to look at your your AB tests and your your different treatments and and what the p values are and how, you know, your your different experiments are are converging. And, typically, there's, like, just 0 code reuse across organization and that layer of the data stack, which is kind of a shame, you know, and there's probably some really good reasons for for this.
And there's probably some parallels to be drawn with, you know, web frameworks. Right? People will use something like Next. Js or they'll use React, and the reality is that, you know, maybe there's not that much code reuse on on top of that, or CRM logic and things like that. But I I think there's there's still the intuition that we should be able to at least, you know, reuse a little bit of code there, create some frameworks, have some reference implementation. If if not, like, pure framework reuse, can we have more reference implementation in the wild so that at least you can take inspiration from from previous work?
So that's the the general kind of prompt and an idea, you know, why doesn't that exist today, you know. Or maybe that exists. I just don't know, you know. I was like, give me some point. If you know about some of these things, people, and, you know, share, maybe there's place for for for us to have a conversation around that. But there's much less that seems to exist other than it than should be.
[00:06:59] Tobias Macey:
There has definitely been an overall growth in terms of the maturity and the mindset around data engineering, data integration, and movement with tools like Airbyte, your work on Airflow, the overall investment in orchestration, DBT has had a significant uptake in terms of usage. And to some approximation, that increases the productivity and the ability for smaller teams to get more work done. So to some extent, maybe there is code reuse at that tooling layer. Mhmm. But I do agree that there is a lot of rework and individual exploration that goes into okay. Now that I can pull all of my data in, where do I put it? How do I store it? How do I transform it? How do I figure out what type of dimensional modeling technique I want to use? How do I appropriately model the business?
And there has been a lot of work on the software as a service or data as a service side, particularly if you're thinking in terms of the work like segment where you say, oh, let's just take all of your customer data, put it into our tool. We will generate the dimensions that we know how to generate, and then you'll just use that and everything will be amazing. And then maybe that works for some people, but you inevitably have to evolve it and transform it and bring it under your own control. And I'm wondering if you can talk to some of the ways that you're seeing the overall trend as far as individual and team productivity versus the amount of rework that people have to do at that more detail oriented level.
[00:08:33] Maxime Beauchemin:
Yeah. I I love that take on, like, the progress maybe so far over the past 5 years. I think you're right on the data integration layers. The layer you know, talk about air bite. Talk about 5tran. So one is open source, the other is more proprietary or a SaaS service. And that layer of data integration, I think, has been solved in in some ways or is getting actively solved. So that's very much if you if you think about it in terms of ELT, that's the EL. Right? So So you just extract what's in your CRM, what's in your, your HR system, what's inside, like, whatever tool you use as a SaaS service. You can get that in your warehouse synchronized very reliably for fairly cheap, though depending on data volume, you know, depending on the I I know that, you know, Fivetran can can get pretty expensive. I think Airbyte, similarly, probably in terms of compute and overhead. But that's been largely solved. And maybe that's a foundation block we didn't have before that is required.
Like, we need some sort of universal data model, at least for the the different data sources as a foundation to reuse, for the t of the ELT. Right? So having consistent EL is a is a prerequisite in some ways to having, you know, a t. We could talk about other prerequisite. You mentioned DBT, and and one thing we we haven't seen enough of, I think, is, like, the the advent of reusable DBT projects or or just like a reference implementation of, like, hey. I built a an AB testing framework compute engine, you know, using using DBT, and here's the project. You can use it. Maybe it's parameterized.
So one question around, like, why wouldn't we seen more of these reusable dbt projects? I think there's I I try to tackle some of that in the blog post. One is, I think, the complexity in managing different SQL dialects. Right? So you have maybe I have a BigQuery implementation. And using the long tail of of UDFs or or just the the functions that are exposed to do some some JSON processing, some more intricate, you know, math or stats function, and and then these are intricately different across dialects. And now, also then as a data engineer, I have to care about other, you know, SQL dialects, and it's not super sexy or fun or interesting.
So and and it might not be even easy to test. That might, you know, wanna wanna be multi dialect code is actually gonna work on Snowflake or on Redshift or something else. So I think that's that's a blocker in some ways. I I think, you know, SQL mesh is kinda interesting. I don't know how good their support is for reusable parameterized projects too. Also, the the issue around if you wanna do parameterized pipeline and into in in my blog post, I I talk about 2 things that are foundational. One's the idea of, like, some sort of universal data model that you have to dump your data into, and that might be, you know, provided by 5tran or or air by as a foundation. So that's one one part but to have say a unified data model for healthcare data or for CRM data or for, you know, engagement on on various product. We call it user action and time type universal model. Some are very intricate, some might be simpler.
But I think that's a foundation we don't necessarily have or have adopted. I think there's been some different projects in the past to try to create these unified data models. So they they may kinda exist and kinda intricate. So that's one thing. And then we need to have some sort of, like, parameterized pipelines. I can say deploy this AB testing framework, but here's, you know, my set of dimensions and here's how I wanna cube the data. Or here's some intricacies that are specific to my business. Right? So it's trying to figure out what is kinda universal and reusable. You know, what is a constant? What is a variable? And then you would have to presumably write these these, what I call, parametric pipelines. Right? So it's, like, dynamically generated pipeline based on configuration around your business.
Could I I could get into some of the examples. I've got some in the blog post. But you know, so that it can be it's not one size fits all. We know that much. So then for the things that are not one size fits all, you need to be able to parameterize your pipeline and instantiate it. And currently, to do that in in DBT, you have to write a bunch of Jinja. And that's just a pile of mess. Right? So to be writing something highly dynamic, like just writing code that writes SQL, which is messy by nature. Right? You have a declarative language, and you're trying to bend it into being more dynamic, and it's not what it was intended to do. So so maybe the tool set is is not really appropriate to make it easy for people to be like, oh, it's really easy for me to make this available for other people to to use it. I'm just gonna parameterize it. I'm just gonna write something more dynamic, and share it as a as a project that people can reuse. And like, okay, there's a dialect problem. There's the the messy parameterized pipeline issue on top of other problems we could get into, but it's probably a good place to start. We're already, like, kinda hitting some pretty major blockers here.
[00:14:00] Tobias Macey:
Yeah. There there are definitely any number of different areas that you could focus in on and spend your entire lifetime trying to improve and people do. But I also think that it's interesting before I get too far along this thought, there's also the case of code focused versus SQL focused where right now we've been focused very much on the SQL oriented data engineering approach of I load all of my data into my warehouse, and then everything after that is SQL. Mhmm. There is also a significant cohort of I load all of my data into my data lake, and then I write a bunch of code to process it and do different things.
And to some extent, those are converging where there's overlap, but particularly in the data lake house architecture or the unification of batch and streaming for different use cases. But I think, also, some of the challenge maybe comes in from the fact that we are typically working at least 1 or 2 layers removed from the problem that we're actually trying to solve. We're we're trying to solve a problem for the business, but instead, we're trying to fine tune the block size of the objects that we upload to s 3 to reduce the overall latency. So there's definitely something there that I I think keeps us stuck and away from being able to do a lot of reuse of the the logic that is actually solving the problem we care about.
And to some extent too, depending on who you talk to and where they're coming from, there's an argument to be made that the way that we're modeling the data is cumbersome and problematic where we've been using these relational databases and relational theory to be able to try to model the business. There's a whole camp of people who say, actually, everything is a graph. So model it all as a graph, and your life will be amazing. But then there's the problem of graph scalability problem. So I'm just wondering what your thoughts are on that juxtaposition of code versus SQL focused data engineering and then also some of the ways that the specifics of how we think about the structure of the data is maybe hindering our ability to reuse more of these capabilities across organizational and team boundaries.
[00:16:16] Maxime Beauchemin:
Yeah. That makes a lot of sense. I've got I've got a lot to to say here and unpack. But, yeah, one one thing is on the duality between, say, SQL and more proper coding or not, you know, say call it declarative to more more dynamics. And in reality, I think today that means most like dbt versus, maybe spark and, you know, or a data frame tape type API or say, like, data frame dot group by and that they frame that filter. And but it does feel like writing something like a parametric pipeline in SQL is it feels like, you know, like, trying to do some some odd things, like, the equivalent of, like, the lamp, like, PHP days where, like, you're doing something that works. But you're writing I think in the PHP days, we were writing logic inside HTML files, You know? So you'd put, like, an if condition within an HTML block.
And then, you know, at some point, we kinda flip to to React where it's more, like, driven by JavaScript that generates HTML. I'm not sure if the parallel is that that good or interesting, but back to the the SQL take. I think SQL is extremely convenient and declarative and acts at and it's not dynamic, which makes it a lot more accessible to more people, but it's a bit of a dead end. May maybe we don't realize at what point at which point it it just kinda prevents us from thinking like software engineers. Right? So it's like, you know, these declarative languages are good for for fairly static things. You know, you think they're Terraform for infrastructure as code. You don't you don't want something too dynamic there.
But I think SQL in some ways in the complexity around dialects, it's like those like, it's just like close enough, like, similar enough that so that you feel like it's familiar when you switch from a dialect to another. It's it's different enough and truly different different enough so that you can't really practically share code across engines. It's it feels like a a a very convenient kinda abstraction that becomes a bit of a dead end for, you know, analyst engineers to turn more software engineer or embrace more, even more of the the software engineering practices. And there's there's been a lot of, like, narrative around how, I think bring bringing kind of software engineering practices to data engineering first, but even analysis or analytics engineer, but it's kinda limited to Git, right, in some ways. And there's these macros. You know, in DBT, you can write these like Jinja macros, but got our day nasty to write and and, you know, test and debug.
So that's that's a bit of a limitation there. You know? It doesn't feel like d DBT is the best place for code reuse. The package management is just a little awkward, you know, as a as a ginger first, you know, type type thing. So there there is a parallel there with PHP, you know, probably. So there's that. Let's see. What else were you touching on? I think there was much more to unpack here. I think some of the other interesting
[00:19:32] Tobias Macey:
avenues to go through are the the underlying model of the data, how we think about it. Is it all just 2 dimensional arrays? Is that really all we care about? And then we say, well, actually, 2 dimensions isn't enough to sufficiently capture the nuance of the problem that I'm trying to solve. So let me explode this into multiple tables joined across these different attributes that then become cumbersome to evolve and maintain, and it becomes difficult for end users to be able to figure out. So you have to pepper it with documentation and other helper functions and maybe just hand them their their spreadsheet that they actually care about. And just wondering how you see that data model that we have gotten locked into for the past, what is it, 40, 50 years now?
And some of the ways that alternate storage methods, particularly in the graph space, can either supplement or potentially at least partially replace some of the ways that we do that modeling to improve on reuse.
[00:20:33] Maxime Beauchemin:
Yeah. I mean, let's let's take it from an angle of, like, dynamicity and data models. So if we wanna do code reuse, like, clearly, we we need part of the model to be dynamic. And I get I get into that in the in the blog post in some ways, but your customer attribute at your company are not exactly the same as my customer attribute at my company, and we care about different set of attributes. And how can we model this in relational database or traditional relational databases. You know, it's it's inconvenient to have to do dynamic modeling to be like, I'm gonna alter table, add column dynamically.
It's not practical to do so, and it's not reasonable to do so. So one thing I'm talking about in the blog post that's based on that premise of being in in this database world, in the traditional relational database world, is could we identify these these dynamic parts of the model and put them into these these new not new fields, but these this emerging class of, like, new new fields, that are more dynamic in in modern databases. Right? So you could have, you know, the extreme examples, you have a JSON blob and in in your customer table, you would have some sort of, like, customer extra attributes. And in there, you could have as much complexity that, you know, you wanna you wanna add to enrich your customer entity with whatever is unique to your business. And maybe the framework needs something like a customer type that's, like, really structural and used by the framework with some assumption around, like, this is a paid customer. This is a free customer.
Maybe there's some logic that attached to the customer type, and that's hard and something you gotta fit into when when you wanna use this parametric pipeline. But then if you wanna have, store all sorts of other customer attributes, we would have a place for that. That's all in within a column. So you wouldn't have to have this, you know, dynamic model or, you know, I I think the dynamic model is just not viable. Now let's do, like, is it grab database or document store or key value store a better place to, yeah, ultimately, everything is kind of a keyed you know, everything is a document even in a relational database. You know? So it's all data and metadata.
So whether you say create create table, you know, and define columns or you put that in, you know, in in a marshmallow, for people familiar with the the the schema library in Python. So there's all sorts of different ways that you could define a schema for your different entities, but, clearly, you're gonna have some sort of notion of entity type and a schema for it and an entity ID in all cases. And what to me, whether it's modeled, you know, in in the traditional relational way ways as a different physical table and a physical database or whether it's stored, you know, in an object store, what metadata living in Git instead of, you know, in the database engine itself is is is not necessarily I mean, it's an it's an interesting thing to think about, but it's kind of the same problem.
And and then except for maybe the properties of of the relational the different database engine and what they can do. Right? So, of of course, in a graph database, you're gonna be able to to run certain types of queries in a in a different way. So different workloads are, like, give me all the descendants for this node, you know, and then find all the objects that so it's gonna be real good at doing that. It might not be as good at doing, you know, full scan or a join. So that as to, like, where to store it to to me is is less of a of a challenge. But I think thinking about these parametric pipeline, what's really important is to define what is structural to the parametric pipeline and something you have to fit into and which part of the pipeline are more dynamic, or which part of the the different attributes or facts are more dynamic.
So that would be like, you know, my extra customer attribute or my extra, you know, transactional attribute or my the hierarchies that are specific to my business. And this would have to be parameterized and stored alongside, you know, the instant the instantiation of the parametric pipeline so that we know what to do. So say if you want for the pipeline to generate to cube the data in a certain way and generate some count distinct group by different combination of dimension. That would be something that could be unique to your pipeline. And based on on top of that metadata of the fields that the extra fields or the extra metadata that's unique to you. So that, you know, there's quite a bit of abstraction and complexity there. Maybe the operator of the pipeline is like, what is this thing? What does it do? You know, thinking about the these abstractions are pretty complex. It's kind of an issue.
Right? So maybe if you're like, okay. I'm gonna do my first AB testing data compute pipeline. I'm gonna use this toolkit or framework or reusable, you know, airflow DAG or what or SQL mesh, you know, abstraction. Okay. Where do I start? Then you maybe you PIP install this or I don't even know. Like, you you know, the DBT installed the package, and then you have to start looking at the the YAML configuration maybe of that pipeline and how you're gonna do your own AB testing. And you're gonna have to load your data into a certain place and provide a bunch of parameters and and run this thing, and and and maybe that interface is too too complicated of a template for people. You know, maybe they need to go through the grind of writing the pipeline to to be able to operate it down the line. I don't know. But it's it's it's quite a it's it's quite a complex abstraction for any data engineer to kinda inerrate that and try to try to work with it.
[00:26:28] Tobias Macey:
I think another interesting aspect of the state that we find ourselves in is the fact that data engineering as a discipline, for 1, is relatively new. I mean, people have been doing data work for a long time, but data engineering as a specific title has really only been used actively for about the past decade, give or take. And the backgrounds of people who come into data is fairly wide. There are a lot of people who come from the systems side where they start as a sysadmin. They just have to keep things up and running, and so then, automatically, they have to deal with the data. There are a lot of people coming from DBAs and business intelligence backgrounds. There are a lot of people coming in from data analytics backgrounds who then go further down the stack more into the pipeline management.
[00:27:21] Maxime Beauchemin:
The many also folks. The spreadsheet people too, like, the the spreadsheet wizards, you know, that that realize they can have so much more impact if they work with a database. Those can be pretty impressive too as they become, you know, data engineers. There's, like, so many different paths and backgrounds and way to get there. And there there is also a substantial number of people with a software background, but I don't think that it's as
[00:27:42] Tobias Macey:
ubiquitous as the web application development ecosystem has been where that is purely software, has been all the way along. And I'm just wondering how you see that variance in backgrounds in the data space impacting the ways that people think about investing in reuse and those software patterns.
[00:28:03] Maxime Beauchemin:
Yeah. Maybe I'll I'll start with with an interesting parallel with front end engineering. You know? So I think we've seen front end engineering over the past decade go from this, like, not so respected discipline to becoming, like, a true software engineering kind of legit, profession almost. You know, the perception in the past was like, you're a web a webmaster, a web engineer. You write HTML and JavaScript, and you're not you're not a true software engineer. Like, the the back end guys, you know, the the the balding guys with a ponytail writing, JVM code all day. They're like, okay. This this guy just writes website. Right? He's more closer. He's sitting closer to the the graphic designers than he than he is to to than than he or she is from us. And but we've seen this transformation happen. It took a moment. I'm not sure exactly what are the the key ingredients, you know, that, are the the important milestones that allowed for that discipline to to get better.
If you push and I'm deviating for from from the topic a little bit, but if you push the analogy and compare and you look at code reuse in on the front end, you won't see that many templates. I think I think what you're gonna see is, like like frameworks and toolkits and things to abstract with these abstractions of CSS and HTML and, but you won't see that much. Like here's a marketing website. Right? Like I'm gonna I'm just gonna use I'm gonna I'm gonna NPM install, like marketing base and and parameterize it to fit my marketing organization. So so so so maybe when you get into the things that are unique to your business, like the way you treat your data or branding, you know, on the on the front end, you can you you know, it it becomes really hard to reuse code. And maybe the template approach, right, you might take a marketing website template and modify the crap out of it, but you're not gonna install a marketing website package and parameterize it. Right? So maybe maybe then that points to, like, adding a better a better ecosystem of templates. By that, I mean, like, reference implementations of these different pipelines and and for organizations or or data engineers to just be like, oh, well, I'm I'm working on a on on an engagement computation framework or I'm looking to bring my HubSpot data into my warehouse and do something with it. Well, maybe you'd you'd use some sort of template and and fork it as opposed to, you know, take a package and parameterize it and and extend it.
[00:30:48] Tobias Macey:
I think that's a useful analogy to build on as well where in the front end ecosystem, you have these templates, you have all this tooling, but all of the work that you're doing is inherently visible because that was why you made it. You made it to show to your end users, and the way that the web is structured is that when you load it I mean, it's getting less so now because of some of the protocol changes and and the frameworks. But when you load a page, you automatically have access to all of the code that went into it to to some approximation. And I think that in the data ecosystem, we're a little bit hampered by the fact that data inherently has this layer of security and shroud of secrecy applied to it, so you don't have it as visible for other people to be able to learn from. So all you have is your own past experience at other organizations to draw from or what you're able to glean from people that you come in contact with through your own professional network and conferences, etcetera.
And so I think that that also has acted as a limiting factor in terms of our ability to have these reusable templates, have this more broad reaching reference implementations to be able to draw from because it's rare that you actually have a fully spec'd out data warehouse using, star snowflake Skiba with highly maintained, slowly changing dimensions, etcetera, etcetera. And so it it's it's this more diffuse knowledge transfer than a easily referenceable way of doing anything in a concrete fashion. All we have are these abstract references of the Kimbell book or the data vault book of this is how you do it in an abstract sense. But if you wanna do it in your own business, figure it out and good luck. Pay me a bunch of money and maybe I'll help you. Yeah. I like to, you know, the one of the Kimball books was more project center. I think it was called the data warehouse life cycle toolkit is what it was called. And that one was more like, okay.
[00:32:48] Maxime Beauchemin:
You're starting a data warehouse implementation in your organization, and how do you navigate the whole business and conduct interview and identify the key stakeholders and get to, like, a bus matrix. So there's a whole, like, recipe that probably does does not work anymore. It assumes, like, a very waterfall type thing. It's it's not agile by by any means, I think. It predates agile, but it's more like the cookbook was kinda interesting. I like the the parallel, the the the the thing that's that's cool about the front end, I think you identified as it's a lot more transparent or the code is in your face, so maybe that creates some some unique dynamics to front to front end in some ways. Yeah. On the on the, you know, bringing more transparency to to what's happening or sharing more. So I wanted to bring up, data gravity to describe an issue where where data is not very modular. It wants to be together and mesh together so you wanna bring everything you know about a customer inside your customer table. You know, actually I I have a blog post that talks about entity centric data modeling and it says basically your dimension tables, you should think of them like a, you know, like not just dimensional attributes, but bring metrics. Right? So if you have a you have a customer dimension, you have a product dimension, don't use it only for attributes of the thing. You can totally bring in, you know, metrics like what is my number of seeds sold and number of seeds filled over 7 days, 28 days.
Some arguing, you know, can I bring all of the properties of the entity or, you know, as an attribute in your dimension, take snapshots and, you know, the the there's a lot of good ideas there, but it points in the direction that you you're gonna bring all the attributes that come from your all your different systems? Right? So that means your your CRM, your product use usage, you know, anything that's related to the customer, whichever system it comes from, you wanna bring it all together. So that data gravity, I think, makes it such that data pipelines are a big tangled mess.
You know, that that everything's connected with everything, and that makes it harder to say, like, I'm actually gonna share something that processes, you know, event data, and that really doesn't touch all the other dimensions of or or attributes of my my my customer or my user. So so that's one thing. And then, you know, there's something about sharing templates or reference implementation where I have like, I wish I could just put I could put my dbt like, we added dbt at at preset. Like, over over the years, we've grown very complex, transformation layer with, like, maybe 300, 400 data models. And I wish I could, you know, maybe I could just open source it not so that people would use it, but as a reference implementation.
But it's it's always a little shameful, you know, and a little bit of a mess because these things grow like like a a big, big ball of duct tape. I don't know what what exactly is the the analogy, but it just kinda piles off on itself, becomes extremely complex. It's it feels like it's tech debt from day 1. You know, so it's not something that you're proud of. It's really hard to make something you're proud of using, you know, template SQL and share it in the open. Plus, there's the the intricacies of the privacy or not not even privacy security issues would, like, show like, you know, there's probably a thing or 10, you know, in that dbt pipeline that I wouldn't wanna share, and it would be a little dangerous.
You know? And I I don't even know what these these things are. So so there's that the shareability of these constructs is not good because the dialects, because the kinda nature of it being kinda messy and tangled up and complicated. So that that's an I don't think I identified that in in the blog post, you know. So it's like the the self perception of the the value of the code. You're like, I don't wanna put that in the open. You know, this is dirty laundry.
[00:36:54] Tobias Macey:
Keep the curtain closed, everybody. Yeah. Absolutely. For for what it's worth, the code repository that I'm building for my day job, we actually do have open source. It's not necessarily a useful reference implementation, but people can look at it. But I I do think that that is probably one of the biggest hindrances in our ecosystem of data is to your point that gravity of the data where moving at a 1000 different places to make it more componentized and modularized goes against the physics of the problem. And the fact that there is so much security layered on top of it and the issues of privacy and just the, I don't wanna risk it, so just keep it under the covers. So it hinders the distribution of that knowledge.
So it it is much less diffuse beyond just the books that people invest in that are, by nature, abstract and not directly applicable without doing some of your own translation work, which is where all the complexity comes
[00:37:55] Maxime Beauchemin:
in. Yeah. That'd be viable work to be done by someone to to kinda create templates and and share them as, you know, incomplete ideals, but but, like, reusable and not necessarily as something you can go and run with and get value instantly from, but more, maybe it's the author of a book about, you know, practical data engineering. That would be like, okay, here's an example of a dbt pipeline that computes, you know, all the engagement, growth accounting metrics that you want for your product. And here's how you would extend it. That's that stuff doesn't seem to exist. Curious on the on the repo that you say you opened up with the with your own dirty laundry or maybe not so dirty laundry. But is that is that in the form of the DBT project or or a spark pipeline? Or what does that look like? So the repository
[00:38:43] Tobias Macey:
right now primarily consists of some Daxter pipelines and dbt models.
[00:38:49] Maxime Beauchemin:
Gotcha. And and then you're processing what kind of, is it product information? Or
[00:38:56] Tobias Macey:
So it's primarily information about educational materials and how people are engaging with it. So, my day job, I work at MIT in the open learning department. So it's all nondegree material for learners outside of the bounds of MIT, so not MIT students. So MOOCs, open course for material, professional education, and so it's figuring out what are the enrollments, what are the learner engagement patterns, what are the courses that we have on offer, etcetera, etcetera, and then being able to report that into the business.
[00:39:30] Maxime Beauchemin:
That that makes sense. Yeah. So it's a that pattern, if you think of it in terms of verticals to an opportunity for reuse. You know? So I I talk about that a little bit in the blog post too to say, what are some of the business verticals that are highly common that data engineers do over and over that where the the schema and the model and maybe the the the customizability is not as important as needed? Like, what are some simple data model? Maybe that are very simple in terms of, like, the the input to the data model, but it can be complex in terms of the output. And I think to to me, it seemed like product, what what I would call like user action in time and event pipeline would be extremely useful. Right? So because everyone can model some of their data, as, you know, time stamp, user ID, you know, event type, and then extra attributes. Right? And and if you if you have something like that, you could compute all sorts of, you know, actors and time, actions and times, action per user, frequency of action, click stream, like, you know, all these, you know, funnel analysis. So all from a very simple model. So maybe there's an opportunity that there in that area. And you've seen that a little bit productized when when I was saying, like, there there's no real or usable, you know, things that have been used and shared. If you look at tools like Amplitude or there's there's a lot of these, like, you know, tools out there that are mostly user action event in time.
So I could see that I could, you know, there's an opportunity for a pipeline there. I've been meaning to share one that I call it's called EGAP, the engagement growth accounting framework. I have a dbt project. I don't think I don't think I've made it public, but, you know, I never felt like it was good enough to share. But I I've been meaning to do this. I wanted to give a talk at the submitted a talk for the COLLS conference, like, 2 years ago and wasn't accepted for for whatever reason. So so it hurt my feelings a tiny bit. And then I was like, okay. I'll just do that some other time. Never got to it. But the premise with this, engagement growth accounting framework was, you know, you load your data into a well defined, you know, structure that's just time stamp, user ID, action, or kinda event type, event subtype maybe, and x some sort of extra field with, you know, other JSON you might wanna use down the line. And then based on that, you can order you parameterize it by wanting to say, I want DAU, MAU, we collect the users on, you know, these cohort tables.
I want like, maybe there's a sub module for experiment exposure. Like, who was part of which experiment at what point in time. So we could do some maybe testing, kind of bolt that on top of it. And it seems like it would be immediately useful to to people, but it's kinda hard to get to, to sharing it because, you know, I wrote it maybe in BigQuery in DBT and, you know, people couldn't use it out of the box. It's it's kinda hard to put DBT to put a YAML file to parameterize that pipeline. I mean, it's not it's not that hard, but it's, you know, it's it's a little bit intricate to figure out exactly what the parameters of the pipeline should be. But maybe it's an experiment to look forward to. You know? Like like, I'm curious to try to put it out there, see if I can get, you know, 10 or a 1000 stars or 10000 stars on GitHub. It's it seems like 10000 stars material. You know?
Every company is doing that in some capacity.
[00:43:04] Tobias Macey:
Yeah. It's definitely great to hear you suggesting to put that out there. I encourage you to do so. I'll keep an eye out for it. And I think that this conversation won't be complete without some mention of generative AI and LLMs. And I'm wondering what your thoughts are on the ability of this current cycle of innovation, experimentation, whatever you wanna call it, how that is maybe a potential means of doing that knowledge capture and distribution that is currently done behind closed doors and federating that a bit more, making it easier for people to say, this is the business problem I'm trying to solve.
You generate the 1st pass at that, and then I'll fine tune it as we go.
[00:43:49] Maxime Beauchemin:
If we wanna, like, enable intelligence in general to do a great thing with reusable code and templates and and frameworks, I think it's kind of the same if you're trying to enable you like, intelligent humans or artificial intelligence. Like, you need to have good docs, good reference, good naming, you know, things that are intelligible humans or art AI can can work with. One question is would so say if you put these these templates out there, these frameworks out there, and they're publicly available for for machine to train on as opposed to rag on, you know, that's that's one thing that I really think that open source is gonna get more from more help from AI than private source because it's just out there. Right? The the models are training against Apache percent Apache Airflow, and they're they don't have access to the Power BI code base to to come and help out or to answer question or to write PRs or to answer issues. So we've seen the rise of these these AI bots that are effectively trained on the superset code base, you know, and can can help us.
So that in saying that maybe the AI could work very well with these these reference implementation and temp and templates. Because you could say, okay. This egas thing I was talking about, engagement growth accounting framework, it's out there. All the buts, you know, Cloud knows about it. GPT knows about it. And when I asked it, I can can you help me, you know, instantiate that in my repo? Maybe I can take you, like, 80, 90% away there. Can those questions could the would the AI help us get to that abstraction? Probably. Would it help us, like, use, bring this abstraction into into useful or, like, production territory?
Probably. But the more the more that is shared, the the better, you know, for advancement of intelligence regardless if it if it's artificial or or not.
[00:45:45] Tobias Macey:
Yeah. It's it's definitely interesting to see how generative AI has already started consuming software engineering in different capacities and how it's influencing the way people think about how they get their work done. So I'm excited to see what potential positive impact it can have on the data engineering space. And I'm just wondering, as you continue to invest in this space, try to build and promote this idea of code reuse, I'm just wondering what are some of your hopes and predictions around that potential future?
[00:46:19] Maxime Beauchemin:
Yeah. So a lot of hopes. I don't know how much prediction, but for for me, what I do is I try to do first reflex AI for just about everything I do. So new task, new workload, new workflows, something creative, something repetitive. I always try to get the AI to help me out, and I've been having, like, various level of success. I gave a talk at the Airflow Summit and at OSACON, it's the same talk. It's like it's it's around, like, the the the hype of using AI as a practitioner over the past 2 years. As a founder, as a data practitioner, over the past 2 years, I've had this AI first reflex and then some reflection on as to, like, what's working, what's not working, and what I have good hopes for in the future. So there's an entire talk on that that's that could be could be interesting.
One thing I tell the people is I encourage them to have that AI first, reflex for just about everything. And then, you know, and then give it more than just a shot of, like, can you do this for me and the AI is not helpful for that, you know, without the context. If you try to provide more context, you know, give it what it needs to to help you out. So I think I think it's a good it's a critical thing to do, I think, to remain relevant. You know? If if you skip a a beat or 2 on that, I think you can you can realize you're you're far behind pretty quickly. It's quite easy to catch up if if you're behind, but, you know, really figuring out how you can team up with this this kinda infinite resource, you know, and then this, you know, this thing that can really help out with a lot of things. Figure out what works, what doesn't for you. I think it's like everyone should be should be doing it. We shouldn't wait for the products to do it for us. You know, individually, I think you gotta figure out, how it can help you out, you know, every day.
[00:48:12] Tobias Macey:
Are there any other aspects of this problem space that we didn't discuss yet that you wanna touch on before we close out the show?
[00:48:19] Maxime Beauchemin:
No. I think I think it's maybe a call out for for people interested in the space. So it it turns out, you know, people have been thinking about this for about 20 years too. Like, when I started in business intelligence and, you know, data warehouse architecture, people were already thinking about these things. So, just keeping, I'd be curious if people there's a way for people maybe on Twitter, LinkedIn, you know, to start a thread. If you've seen things that you think are interesting in around, like, code reuse and how to go about it, and then maybe encouraging people to put their reference implementation out there, you know, to be like, you know, I've wrote an AB testing framework or pipeline as a DBT package, as a SQL mesh mesh project. And maybe I forked it for my own use, but here's how I got started, and here's the reference the reference implementation.
Maybe there's an opportunity to become kind of a a small community around all the people that are building that specific vertical, and they can share experiences. So it's so it's more of a sharing, you know, sharing what you did as opposed to, like, writing a framework that everyone should just, you know, pip install and use. But maybe that's another aspect of open source, and it might be a required step towards the the more reusable framework. Because, like, first, like, we gotta look at each other's work to to understand what's common and what's different.
And maybe only then, you know, we we we can start building a framework that works for, you know, 70, 80 percent of the of the people building the same thing.
[00:49:55] Tobias Macey:
Absolutely. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And usually, I ask about what you see as being the biggest gap in the tooling or technology for data management, but I think that's what we just spent the past 5 minutes or so talking about. So,
[00:50:14] Maxime Beauchemin:
I just wanna say thank you for taking the time today and for your thoughtfulness in this space and all the contributions that you've made, and I hope you enjoy the rest of your day. Same here. Thank you for, you know, for everything you do here. I think, having this this medium of exchange is is much better than, say, just a blog post. So the blog post is more structured, but this is a lot more interesting to to really go and explore and and talk about it. So I love this long form, you know, chat chatty, like, way to approach, and explore, you know, part problem space. So it's been great.
[00:50:52] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Data migrations are brutal. They drag on for months, sometimes years, burning through resources and crushing team morale. DataFold's AI powered migration agent changes all that. Their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches. And they're so confident in their solution, they'll actually guarantee your timeline in writing. Ready to turn your year long migration into weeks? Visitdataengineeringpodcast.com/datafolds today for the details.
Your host is Tobias Macy, and today, I'm joined again by Max Beauchemin to talk about the challenges of reusability in data pipelines. So, Max, for anybody who hasn't heard your numerous past experiences, can you just give a quick refresher about who you are and how you got into data? For sure. Well, first, thank you for for having me on the show again. So I think it's been, like, 3 or 4 appearances, and it's been a little while since we connected. So excited to be here again.
[00:01:14] Maxime Beauchemin:
My quick self intro is so I've been doing data engineering and business intelligence for for, you know, best of 2 decades now. And, and then I got really involved in open source while I was at Airbnb. I started Apache Airflow and Apache Superset, which became extremely popular in both cases. So for those not familiar with with the tools or Airflow is is an orchestrator, so it's really about, training up some some Python workloads and and running some data pipelines defined as code. And Apache Superset is an open source challenger in the the Tableau, the business intelligence space. So we're competing with Tableau and Looker and Power BI, with the awesome power of open source. So that's what we do. If you haven't checked out Apache Superset in a while or or you don't know about it, so I urge people to to go and check it out. The the news is that open source, this is Adeljan, is super, super competitive. Now there's no reason why you should be using proprietary software in that area. So if you're fed up with, with Look or Tableau or whatever it is that you use, you wanna you wanna move over to to open source. There's some really, really good set of tools out there. So check out Apache Superset. You can check it out on Presets. Preset is a commercial offering. I'm the CEO and founder of Preset, which is commercial open source on top of Apache Superset. So feel free to use the the the open source version.
If you just wanna try it, take it for a test drive. You can try superset on preset. Io. So that's the end of the commercial portion of this, this podcast. Let's jump into it.
[00:02:49] Tobias Macey:
And so the conversation today is around the premise that data pipelines are stuck in this phase of write everything twice, where everywhere you go, you have to keep doing the same piece of work. You can't just copy and paste from one business to another. And I'm wondering if you could just start by sharing your current thesis on the opportunities for code reuse and the shortcomings on either the tooling, ecosystem, or education that limit that code and component reusability across organizations?
[00:03:22] Maxime Beauchemin:
Right. So I've been thinking about this for, you know, 10 years, maybe maybe longer. The question, you know, since I I open sourced Airflow, I really thought people are gonna go and use Airflow or build a higher level construct on top of Airflow to share code and logic and pipeline across organization. And we really haven't seen that for a whole set of reasons that are intricate and complicated and that we're gonna talk about today. Something that can be good for the listeners would be to check out the blog post I wrote on the topic. So the blog post is gonna be more structured probably than this conversation, but this conversation is gonna be much more interesting in in exploring, you know, the the the intricacies of things and getting on the next layer of talking about this challenge. But if you, if you're curious about maybe pausing the podcast and taking a look at the scanning the blog post, it's on the the preset blogs. That's at preset.io/blog.
On there, you'll find the blog post that's titled, why data teams keep reinventing the wheel, the struggle for code reuse in data transform in the data transformation layers. That's, probably a good a good place to start our foundation, at least for for this conversation now. Assuming that some of the listeners maybe paused, took a look, came back. But the the the summary at a very high level is that, you know, as a data engineer, you'll go from organization to organization and and kinda rewrite essentially very similar pipelines. I think it's even more so the case for analyst engineers. So if we're talking more about the the part of the workload of data engineer that's about essentially writing a bunch of SQL or doing data transformation, like creating datasets from datasets.
And typically, you're gonna go and take data from your CRM, your SaaS product, data around engagement and growth, you know, on your product. You're gonna be computing things like NAU, DAU, weekly active users, and the users are new churned, resurrected cohort analysis. Maybe you're gonna go and build a an AB testing kinda compute pipeline to look at your your AB tests and your your different treatments and and what the p values are and how, you know, your your different experiments are are converging. And, typically, there's, like, just 0 code reuse across organization and that layer of the data stack, which is kind of a shame, you know, and there's probably some really good reasons for for this.
And there's probably some parallels to be drawn with, you know, web frameworks. Right? People will use something like Next. Js or they'll use React, and the reality is that, you know, maybe there's not that much code reuse on on top of that, or CRM logic and things like that. But I I think there's there's still the intuition that we should be able to at least, you know, reuse a little bit of code there, create some frameworks, have some reference implementation. If if not, like, pure framework reuse, can we have more reference implementation in the wild so that at least you can take inspiration from from previous work?
So that's the the general kind of prompt and an idea, you know, why doesn't that exist today, you know. Or maybe that exists. I just don't know, you know. I was like, give me some point. If you know about some of these things, people, and, you know, share, maybe there's place for for for us to have a conversation around that. But there's much less that seems to exist other than it than should be.
[00:06:59] Tobias Macey:
There has definitely been an overall growth in terms of the maturity and the mindset around data engineering, data integration, and movement with tools like Airbyte, your work on Airflow, the overall investment in orchestration, DBT has had a significant uptake in terms of usage. And to some approximation, that increases the productivity and the ability for smaller teams to get more work done. So to some extent, maybe there is code reuse at that tooling layer. Mhmm. But I do agree that there is a lot of rework and individual exploration that goes into okay. Now that I can pull all of my data in, where do I put it? How do I store it? How do I transform it? How do I figure out what type of dimensional modeling technique I want to use? How do I appropriately model the business?
And there has been a lot of work on the software as a service or data as a service side, particularly if you're thinking in terms of the work like segment where you say, oh, let's just take all of your customer data, put it into our tool. We will generate the dimensions that we know how to generate, and then you'll just use that and everything will be amazing. And then maybe that works for some people, but you inevitably have to evolve it and transform it and bring it under your own control. And I'm wondering if you can talk to some of the ways that you're seeing the overall trend as far as individual and team productivity versus the amount of rework that people have to do at that more detail oriented level.
[00:08:33] Maxime Beauchemin:
Yeah. I I love that take on, like, the progress maybe so far over the past 5 years. I think you're right on the data integration layers. The layer you know, talk about air bite. Talk about 5tran. So one is open source, the other is more proprietary or a SaaS service. And that layer of data integration, I think, has been solved in in some ways or is getting actively solved. So that's very much if you if you think about it in terms of ELT, that's the EL. Right? So So you just extract what's in your CRM, what's in your, your HR system, what's inside, like, whatever tool you use as a SaaS service. You can get that in your warehouse synchronized very reliably for fairly cheap, though depending on data volume, you know, depending on the I I know that, you know, Fivetran can can get pretty expensive. I think Airbyte, similarly, probably in terms of compute and overhead. But that's been largely solved. And maybe that's a foundation block we didn't have before that is required.
Like, we need some sort of universal data model, at least for the the different data sources as a foundation to reuse, for the t of the ELT. Right? So having consistent EL is a is a prerequisite in some ways to having, you know, a t. We could talk about other prerequisite. You mentioned DBT, and and one thing we we haven't seen enough of, I think, is, like, the the advent of reusable DBT projects or or just like a reference implementation of, like, hey. I built a an AB testing framework compute engine, you know, using using DBT, and here's the project. You can use it. Maybe it's parameterized.
So one question around, like, why wouldn't we seen more of these reusable dbt projects? I think there's I I try to tackle some of that in the blog post. One is, I think, the complexity in managing different SQL dialects. Right? So you have maybe I have a BigQuery implementation. And using the long tail of of UDFs or or just the the functions that are exposed to do some some JSON processing, some more intricate, you know, math or stats function, and and then these are intricately different across dialects. And now, also then as a data engineer, I have to care about other, you know, SQL dialects, and it's not super sexy or fun or interesting.
So and and it might not be even easy to test. That might, you know, wanna wanna be multi dialect code is actually gonna work on Snowflake or on Redshift or something else. So I think that's that's a blocker in some ways. I I think, you know, SQL mesh is kinda interesting. I don't know how good their support is for reusable parameterized projects too. Also, the the issue around if you wanna do parameterized pipeline and into in in my blog post, I I talk about 2 things that are foundational. One's the idea of, like, some sort of universal data model that you have to dump your data into, and that might be, you know, provided by 5tran or or air by as a foundation. So that's one one part but to have say a unified data model for healthcare data or for CRM data or for, you know, engagement on on various product. We call it user action and time type universal model. Some are very intricate, some might be simpler.
But I think that's a foundation we don't necessarily have or have adopted. I think there's been some different projects in the past to try to create these unified data models. So they they may kinda exist and kinda intricate. So that's one thing. And then we need to have some sort of, like, parameterized pipelines. I can say deploy this AB testing framework, but here's, you know, my set of dimensions and here's how I wanna cube the data. Or here's some intricacies that are specific to my business. Right? So it's trying to figure out what is kinda universal and reusable. You know, what is a constant? What is a variable? And then you would have to presumably write these these, what I call, parametric pipelines. Right? So it's, like, dynamically generated pipeline based on configuration around your business.
Could I I could get into some of the examples. I've got some in the blog post. But you know, so that it can be it's not one size fits all. We know that much. So then for the things that are not one size fits all, you need to be able to parameterize your pipeline and instantiate it. And currently, to do that in in DBT, you have to write a bunch of Jinja. And that's just a pile of mess. Right? So to be writing something highly dynamic, like just writing code that writes SQL, which is messy by nature. Right? You have a declarative language, and you're trying to bend it into being more dynamic, and it's not what it was intended to do. So so maybe the tool set is is not really appropriate to make it easy for people to be like, oh, it's really easy for me to make this available for other people to to use it. I'm just gonna parameterize it. I'm just gonna write something more dynamic, and share it as a as a project that people can reuse. And like, okay, there's a dialect problem. There's the the messy parameterized pipeline issue on top of other problems we could get into, but it's probably a good place to start. We're already, like, kinda hitting some pretty major blockers here.
[00:14:00] Tobias Macey:
Yeah. There there are definitely any number of different areas that you could focus in on and spend your entire lifetime trying to improve and people do. But I also think that it's interesting before I get too far along this thought, there's also the case of code focused versus SQL focused where right now we've been focused very much on the SQL oriented data engineering approach of I load all of my data into my warehouse, and then everything after that is SQL. Mhmm. There is also a significant cohort of I load all of my data into my data lake, and then I write a bunch of code to process it and do different things.
And to some extent, those are converging where there's overlap, but particularly in the data lake house architecture or the unification of batch and streaming for different use cases. But I think, also, some of the challenge maybe comes in from the fact that we are typically working at least 1 or 2 layers removed from the problem that we're actually trying to solve. We're we're trying to solve a problem for the business, but instead, we're trying to fine tune the block size of the objects that we upload to s 3 to reduce the overall latency. So there's definitely something there that I I think keeps us stuck and away from being able to do a lot of reuse of the the logic that is actually solving the problem we care about.
And to some extent too, depending on who you talk to and where they're coming from, there's an argument to be made that the way that we're modeling the data is cumbersome and problematic where we've been using these relational databases and relational theory to be able to try to model the business. There's a whole camp of people who say, actually, everything is a graph. So model it all as a graph, and your life will be amazing. But then there's the problem of graph scalability problem. So I'm just wondering what your thoughts are on that juxtaposition of code versus SQL focused data engineering and then also some of the ways that the specifics of how we think about the structure of the data is maybe hindering our ability to reuse more of these capabilities across organizational and team boundaries.
[00:16:16] Maxime Beauchemin:
Yeah. That makes a lot of sense. I've got I've got a lot to to say here and unpack. But, yeah, one one thing is on the duality between, say, SQL and more proper coding or not, you know, say call it declarative to more more dynamics. And in reality, I think today that means most like dbt versus, maybe spark and, you know, or a data frame tape type API or say, like, data frame dot group by and that they frame that filter. And but it does feel like writing something like a parametric pipeline in SQL is it feels like, you know, like, trying to do some some odd things, like, the equivalent of, like, the lamp, like, PHP days where, like, you're doing something that works. But you're writing I think in the PHP days, we were writing logic inside HTML files, You know? So you'd put, like, an if condition within an HTML block.
And then, you know, at some point, we kinda flip to to React where it's more, like, driven by JavaScript that generates HTML. I'm not sure if the parallel is that that good or interesting, but back to the the SQL take. I think SQL is extremely convenient and declarative and acts at and it's not dynamic, which makes it a lot more accessible to more people, but it's a bit of a dead end. May maybe we don't realize at what point at which point it it just kinda prevents us from thinking like software engineers. Right? So it's like, you know, these declarative languages are good for for fairly static things. You know, you think they're Terraform for infrastructure as code. You don't you don't want something too dynamic there.
But I think SQL in some ways in the complexity around dialects, it's like those like, it's just like close enough, like, similar enough that so that you feel like it's familiar when you switch from a dialect to another. It's it's different enough and truly different different enough so that you can't really practically share code across engines. It's it feels like a a a very convenient kinda abstraction that becomes a bit of a dead end for, you know, analyst engineers to turn more software engineer or embrace more, even more of the the software engineering practices. And there's there's been a lot of, like, narrative around how, I think bring bringing kind of software engineering practices to data engineering first, but even analysis or analytics engineer, but it's kinda limited to Git, right, in some ways. And there's these macros. You know, in DBT, you can write these like Jinja macros, but got our day nasty to write and and, you know, test and debug.
So that's that's a bit of a limitation there. You know? It doesn't feel like d DBT is the best place for code reuse. The package management is just a little awkward, you know, as a as a ginger first, you know, type type thing. So there there is a parallel there with PHP, you know, probably. So there's that. Let's see. What else were you touching on? I think there was much more to unpack here. I think some of the other interesting
[00:19:32] Tobias Macey:
avenues to go through are the the underlying model of the data, how we think about it. Is it all just 2 dimensional arrays? Is that really all we care about? And then we say, well, actually, 2 dimensions isn't enough to sufficiently capture the nuance of the problem that I'm trying to solve. So let me explode this into multiple tables joined across these different attributes that then become cumbersome to evolve and maintain, and it becomes difficult for end users to be able to figure out. So you have to pepper it with documentation and other helper functions and maybe just hand them their their spreadsheet that they actually care about. And just wondering how you see that data model that we have gotten locked into for the past, what is it, 40, 50 years now?
And some of the ways that alternate storage methods, particularly in the graph space, can either supplement or potentially at least partially replace some of the ways that we do that modeling to improve on reuse.
[00:20:33] Maxime Beauchemin:
Yeah. I mean, let's let's take it from an angle of, like, dynamicity and data models. So if we wanna do code reuse, like, clearly, we we need part of the model to be dynamic. And I get I get into that in the in the blog post in some ways, but your customer attribute at your company are not exactly the same as my customer attribute at my company, and we care about different set of attributes. And how can we model this in relational database or traditional relational databases. You know, it's it's inconvenient to have to do dynamic modeling to be like, I'm gonna alter table, add column dynamically.
It's not practical to do so, and it's not reasonable to do so. So one thing I'm talking about in the blog post that's based on that premise of being in in this database world, in the traditional relational database world, is could we identify these these dynamic parts of the model and put them into these these new not new fields, but these this emerging class of, like, new new fields, that are more dynamic in in modern databases. Right? So you could have, you know, the extreme examples, you have a JSON blob and in in your customer table, you would have some sort of, like, customer extra attributes. And in there, you could have as much complexity that, you know, you wanna you wanna add to enrich your customer entity with whatever is unique to your business. And maybe the framework needs something like a customer type that's, like, really structural and used by the framework with some assumption around, like, this is a paid customer. This is a free customer.
Maybe there's some logic that attached to the customer type, and that's hard and something you gotta fit into when when you wanna use this parametric pipeline. But then if you wanna have, store all sorts of other customer attributes, we would have a place for that. That's all in within a column. So you wouldn't have to have this, you know, dynamic model or, you know, I I think the dynamic model is just not viable. Now let's do, like, is it grab database or document store or key value store a better place to, yeah, ultimately, everything is kind of a keyed you know, everything is a document even in a relational database. You know? So it's all data and metadata.
So whether you say create create table, you know, and define columns or you put that in, you know, in in a marshmallow, for people familiar with the the the schema library in Python. So there's all sorts of different ways that you could define a schema for your different entities, but, clearly, you're gonna have some sort of notion of entity type and a schema for it and an entity ID in all cases. And what to me, whether it's modeled, you know, in in the traditional relational way ways as a different physical table and a physical database or whether it's stored, you know, in an object store, what metadata living in Git instead of, you know, in the database engine itself is is is not necessarily I mean, it's an it's an interesting thing to think about, but it's kind of the same problem.
And and then except for maybe the properties of of the relational the different database engine and what they can do. Right? So, of of course, in a graph database, you're gonna be able to to run certain types of queries in a in a different way. So different workloads are, like, give me all the descendants for this node, you know, and then find all the objects that so it's gonna be real good at doing that. It might not be as good at doing, you know, full scan or a join. So that as to, like, where to store it to to me is is less of a of a challenge. But I think thinking about these parametric pipeline, what's really important is to define what is structural to the parametric pipeline and something you have to fit into and which part of the pipeline are more dynamic, or which part of the the different attributes or facts are more dynamic.
So that would be like, you know, my extra customer attribute or my extra, you know, transactional attribute or my the hierarchies that are specific to my business. And this would have to be parameterized and stored alongside, you know, the instant the instantiation of the parametric pipeline so that we know what to do. So say if you want for the pipeline to generate to cube the data in a certain way and generate some count distinct group by different combination of dimension. That would be something that could be unique to your pipeline. And based on on top of that metadata of the fields that the extra fields or the extra metadata that's unique to you. So that, you know, there's quite a bit of abstraction and complexity there. Maybe the operator of the pipeline is like, what is this thing? What does it do? You know, thinking about the these abstractions are pretty complex. It's kind of an issue.
Right? So maybe if you're like, okay. I'm gonna do my first AB testing data compute pipeline. I'm gonna use this toolkit or framework or reusable, you know, airflow DAG or what or SQL mesh, you know, abstraction. Okay. Where do I start? Then you maybe you PIP install this or I don't even know. Like, you you know, the DBT installed the package, and then you have to start looking at the the YAML configuration maybe of that pipeline and how you're gonna do your own AB testing. And you're gonna have to load your data into a certain place and provide a bunch of parameters and and run this thing, and and and maybe that interface is too too complicated of a template for people. You know, maybe they need to go through the grind of writing the pipeline to to be able to operate it down the line. I don't know. But it's it's it's quite a it's it's quite a complex abstraction for any data engineer to kinda inerrate that and try to try to work with it.
[00:26:28] Tobias Macey:
I think another interesting aspect of the state that we find ourselves in is the fact that data engineering as a discipline, for 1, is relatively new. I mean, people have been doing data work for a long time, but data engineering as a specific title has really only been used actively for about the past decade, give or take. And the backgrounds of people who come into data is fairly wide. There are a lot of people who come from the systems side where they start as a sysadmin. They just have to keep things up and running, and so then, automatically, they have to deal with the data. There are a lot of people coming from DBAs and business intelligence backgrounds. There are a lot of people coming in from data analytics backgrounds who then go further down the stack more into the pipeline management.
[00:27:21] Maxime Beauchemin:
The many also folks. The spreadsheet people too, like, the the spreadsheet wizards, you know, that that realize they can have so much more impact if they work with a database. Those can be pretty impressive too as they become, you know, data engineers. There's, like, so many different paths and backgrounds and way to get there. And there there is also a substantial number of people with a software background, but I don't think that it's as
[00:27:42] Tobias Macey:
ubiquitous as the web application development ecosystem has been where that is purely software, has been all the way along. And I'm just wondering how you see that variance in backgrounds in the data space impacting the ways that people think about investing in reuse and those software patterns.
[00:28:03] Maxime Beauchemin:
Yeah. Maybe I'll I'll start with with an interesting parallel with front end engineering. You know? So I think we've seen front end engineering over the past decade go from this, like, not so respected discipline to becoming, like, a true software engineering kind of legit, profession almost. You know, the perception in the past was like, you're a web a webmaster, a web engineer. You write HTML and JavaScript, and you're not you're not a true software engineer. Like, the the back end guys, you know, the the the balding guys with a ponytail writing, JVM code all day. They're like, okay. This this guy just writes website. Right? He's more closer. He's sitting closer to the the graphic designers than he than he is to to than than he or she is from us. And but we've seen this transformation happen. It took a moment. I'm not sure exactly what are the the key ingredients, you know, that, are the the important milestones that allowed for that discipline to to get better.
If you push and I'm deviating for from from the topic a little bit, but if you push the analogy and compare and you look at code reuse in on the front end, you won't see that many templates. I think I think what you're gonna see is, like like frameworks and toolkits and things to abstract with these abstractions of CSS and HTML and, but you won't see that much. Like here's a marketing website. Right? Like I'm gonna I'm just gonna use I'm gonna I'm gonna NPM install, like marketing base and and parameterize it to fit my marketing organization. So so so so maybe when you get into the things that are unique to your business, like the way you treat your data or branding, you know, on the on the front end, you can you you know, it it becomes really hard to reuse code. And maybe the template approach, right, you might take a marketing website template and modify the crap out of it, but you're not gonna install a marketing website package and parameterize it. Right? So maybe maybe then that points to, like, adding a better a better ecosystem of templates. By that, I mean, like, reference implementations of these different pipelines and and for organizations or or data engineers to just be like, oh, well, I'm I'm working on a on on an engagement computation framework or I'm looking to bring my HubSpot data into my warehouse and do something with it. Well, maybe you'd you'd use some sort of template and and fork it as opposed to, you know, take a package and parameterize it and and extend it.
[00:30:48] Tobias Macey:
I think that's a useful analogy to build on as well where in the front end ecosystem, you have these templates, you have all this tooling, but all of the work that you're doing is inherently visible because that was why you made it. You made it to show to your end users, and the way that the web is structured is that when you load it I mean, it's getting less so now because of some of the protocol changes and and the frameworks. But when you load a page, you automatically have access to all of the code that went into it to to some approximation. And I think that in the data ecosystem, we're a little bit hampered by the fact that data inherently has this layer of security and shroud of secrecy applied to it, so you don't have it as visible for other people to be able to learn from. So all you have is your own past experience at other organizations to draw from or what you're able to glean from people that you come in contact with through your own professional network and conferences, etcetera.
And so I think that that also has acted as a limiting factor in terms of our ability to have these reusable templates, have this more broad reaching reference implementations to be able to draw from because it's rare that you actually have a fully spec'd out data warehouse using, star snowflake Skiba with highly maintained, slowly changing dimensions, etcetera, etcetera. And so it it's it's this more diffuse knowledge transfer than a easily referenceable way of doing anything in a concrete fashion. All we have are these abstract references of the Kimbell book or the data vault book of this is how you do it in an abstract sense. But if you wanna do it in your own business, figure it out and good luck. Pay me a bunch of money and maybe I'll help you. Yeah. I like to, you know, the one of the Kimball books was more project center. I think it was called the data warehouse life cycle toolkit is what it was called. And that one was more like, okay.
[00:32:48] Maxime Beauchemin:
You're starting a data warehouse implementation in your organization, and how do you navigate the whole business and conduct interview and identify the key stakeholders and get to, like, a bus matrix. So there's a whole, like, recipe that probably does does not work anymore. It assumes, like, a very waterfall type thing. It's it's not agile by by any means, I think. It predates agile, but it's more like the cookbook was kinda interesting. I like the the parallel, the the the the thing that's that's cool about the front end, I think you identified as it's a lot more transparent or the code is in your face, so maybe that creates some some unique dynamics to front to front end in some ways. Yeah. On the on the, you know, bringing more transparency to to what's happening or sharing more. So I wanted to bring up, data gravity to describe an issue where where data is not very modular. It wants to be together and mesh together so you wanna bring everything you know about a customer inside your customer table. You know, actually I I have a blog post that talks about entity centric data modeling and it says basically your dimension tables, you should think of them like a, you know, like not just dimensional attributes, but bring metrics. Right? So if you have a you have a customer dimension, you have a product dimension, don't use it only for attributes of the thing. You can totally bring in, you know, metrics like what is my number of seeds sold and number of seeds filled over 7 days, 28 days.
Some arguing, you know, can I bring all of the properties of the entity or, you know, as an attribute in your dimension, take snapshots and, you know, the the there's a lot of good ideas there, but it points in the direction that you you're gonna bring all the attributes that come from your all your different systems? Right? So that means your your CRM, your product use usage, you know, anything that's related to the customer, whichever system it comes from, you wanna bring it all together. So that data gravity, I think, makes it such that data pipelines are a big tangled mess.
You know, that that everything's connected with everything, and that makes it harder to say, like, I'm actually gonna share something that processes, you know, event data, and that really doesn't touch all the other dimensions of or or attributes of my my my customer or my user. So so that's one thing. And then, you know, there's something about sharing templates or reference implementation where I have like, I wish I could just put I could put my dbt like, we added dbt at at preset. Like, over over the years, we've grown very complex, transformation layer with, like, maybe 300, 400 data models. And I wish I could, you know, maybe I could just open source it not so that people would use it, but as a reference implementation.
But it's it's always a little shameful, you know, and a little bit of a mess because these things grow like like a a big, big ball of duct tape. I don't know what what exactly is the the analogy, but it just kinda piles off on itself, becomes extremely complex. It's it feels like it's tech debt from day 1. You know, so it's not something that you're proud of. It's really hard to make something you're proud of using, you know, template SQL and share it in the open. Plus, there's the the intricacies of the privacy or not not even privacy security issues would, like, show like, you know, there's probably a thing or 10, you know, in that dbt pipeline that I wouldn't wanna share, and it would be a little dangerous.
You know? And I I don't even know what these these things are. So so there's that the shareability of these constructs is not good because the dialects, because the kinda nature of it being kinda messy and tangled up and complicated. So that that's an I don't think I identified that in in the blog post, you know. So it's like the the self perception of the the value of the code. You're like, I don't wanna put that in the open. You know, this is dirty laundry.
[00:36:54] Tobias Macey:
Keep the curtain closed, everybody. Yeah. Absolutely. For for what it's worth, the code repository that I'm building for my day job, we actually do have open source. It's not necessarily a useful reference implementation, but people can look at it. But I I do think that that is probably one of the biggest hindrances in our ecosystem of data is to your point that gravity of the data where moving at a 1000 different places to make it more componentized and modularized goes against the physics of the problem. And the fact that there is so much security layered on top of it and the issues of privacy and just the, I don't wanna risk it, so just keep it under the covers. So it hinders the distribution of that knowledge.
So it it is much less diffuse beyond just the books that people invest in that are, by nature, abstract and not directly applicable without doing some of your own translation work, which is where all the complexity comes
[00:37:55] Maxime Beauchemin:
in. Yeah. That'd be viable work to be done by someone to to kinda create templates and and share them as, you know, incomplete ideals, but but, like, reusable and not necessarily as something you can go and run with and get value instantly from, but more, maybe it's the author of a book about, you know, practical data engineering. That would be like, okay, here's an example of a dbt pipeline that computes, you know, all the engagement, growth accounting metrics that you want for your product. And here's how you would extend it. That's that stuff doesn't seem to exist. Curious on the on the repo that you say you opened up with the with your own dirty laundry or maybe not so dirty laundry. But is that is that in the form of the DBT project or or a spark pipeline? Or what does that look like? So the repository
[00:38:43] Tobias Macey:
right now primarily consists of some Daxter pipelines and dbt models.
[00:38:49] Maxime Beauchemin:
Gotcha. And and then you're processing what kind of, is it product information? Or
[00:38:56] Tobias Macey:
So it's primarily information about educational materials and how people are engaging with it. So, my day job, I work at MIT in the open learning department. So it's all nondegree material for learners outside of the bounds of MIT, so not MIT students. So MOOCs, open course for material, professional education, and so it's figuring out what are the enrollments, what are the learner engagement patterns, what are the courses that we have on offer, etcetera, etcetera, and then being able to report that into the business.
[00:39:30] Maxime Beauchemin:
That that makes sense. Yeah. So it's a that pattern, if you think of it in terms of verticals to an opportunity for reuse. You know? So I I talk about that a little bit in the blog post too to say, what are some of the business verticals that are highly common that data engineers do over and over that where the the schema and the model and maybe the the the customizability is not as important as needed? Like, what are some simple data model? Maybe that are very simple in terms of, like, the the input to the data model, but it can be complex in terms of the output. And I think to to me, it seemed like product, what what I would call like user action in time and event pipeline would be extremely useful. Right? So because everyone can model some of their data, as, you know, time stamp, user ID, you know, event type, and then extra attributes. Right? And and if you if you have something like that, you could compute all sorts of, you know, actors and time, actions and times, action per user, frequency of action, click stream, like, you know, all these, you know, funnel analysis. So all from a very simple model. So maybe there's an opportunity that there in that area. And you've seen that a little bit productized when when I was saying, like, there there's no real or usable, you know, things that have been used and shared. If you look at tools like Amplitude or there's there's a lot of these, like, you know, tools out there that are mostly user action event in time.
So I could see that I could, you know, there's an opportunity for a pipeline there. I've been meaning to share one that I call it's called EGAP, the engagement growth accounting framework. I have a dbt project. I don't think I don't think I've made it public, but, you know, I never felt like it was good enough to share. But I I've been meaning to do this. I wanted to give a talk at the submitted a talk for the COLLS conference, like, 2 years ago and wasn't accepted for for whatever reason. So so it hurt my feelings a tiny bit. And then I was like, okay. I'll just do that some other time. Never got to it. But the premise with this, engagement growth accounting framework was, you know, you load your data into a well defined, you know, structure that's just time stamp, user ID, action, or kinda event type, event subtype maybe, and x some sort of extra field with, you know, other JSON you might wanna use down the line. And then based on that, you can order you parameterize it by wanting to say, I want DAU, MAU, we collect the users on, you know, these cohort tables.
I want like, maybe there's a sub module for experiment exposure. Like, who was part of which experiment at what point in time. So we could do some maybe testing, kind of bolt that on top of it. And it seems like it would be immediately useful to to people, but it's kinda hard to get to, to sharing it because, you know, I wrote it maybe in BigQuery in DBT and, you know, people couldn't use it out of the box. It's it's kinda hard to put DBT to put a YAML file to parameterize that pipeline. I mean, it's not it's not that hard, but it's, you know, it's it's a little bit intricate to figure out exactly what the parameters of the pipeline should be. But maybe it's an experiment to look forward to. You know? Like like, I'm curious to try to put it out there, see if I can get, you know, 10 or a 1000 stars or 10000 stars on GitHub. It's it seems like 10000 stars material. You know?
Every company is doing that in some capacity.
[00:43:04] Tobias Macey:
Yeah. It's definitely great to hear you suggesting to put that out there. I encourage you to do so. I'll keep an eye out for it. And I think that this conversation won't be complete without some mention of generative AI and LLMs. And I'm wondering what your thoughts are on the ability of this current cycle of innovation, experimentation, whatever you wanna call it, how that is maybe a potential means of doing that knowledge capture and distribution that is currently done behind closed doors and federating that a bit more, making it easier for people to say, this is the business problem I'm trying to solve.
You generate the 1st pass at that, and then I'll fine tune it as we go.
[00:43:49] Maxime Beauchemin:
If we wanna, like, enable intelligence in general to do a great thing with reusable code and templates and and frameworks, I think it's kind of the same if you're trying to enable you like, intelligent humans or artificial intelligence. Like, you need to have good docs, good reference, good naming, you know, things that are intelligible humans or art AI can can work with. One question is would so say if you put these these templates out there, these frameworks out there, and they're publicly available for for machine to train on as opposed to rag on, you know, that's that's one thing that I really think that open source is gonna get more from more help from AI than private source because it's just out there. Right? The the models are training against Apache percent Apache Airflow, and they're they don't have access to the Power BI code base to to come and help out or to answer question or to write PRs or to answer issues. So we've seen the rise of these these AI bots that are effectively trained on the superset code base, you know, and can can help us.
So that in saying that maybe the AI could work very well with these these reference implementation and temp and templates. Because you could say, okay. This egas thing I was talking about, engagement growth accounting framework, it's out there. All the buts, you know, Cloud knows about it. GPT knows about it. And when I asked it, I can can you help me, you know, instantiate that in my repo? Maybe I can take you, like, 80, 90% away there. Can those questions could the would the AI help us get to that abstraction? Probably. Would it help us, like, use, bring this abstraction into into useful or, like, production territory?
Probably. But the more the more that is shared, the the better, you know, for advancement of intelligence regardless if it if it's artificial or or not.
[00:45:45] Tobias Macey:
Yeah. It's it's definitely interesting to see how generative AI has already started consuming software engineering in different capacities and how it's influencing the way people think about how they get their work done. So I'm excited to see what potential positive impact it can have on the data engineering space. And I'm just wondering, as you continue to invest in this space, try to build and promote this idea of code reuse, I'm just wondering what are some of your hopes and predictions around that potential future?
[00:46:19] Maxime Beauchemin:
Yeah. So a lot of hopes. I don't know how much prediction, but for for me, what I do is I try to do first reflex AI for just about everything I do. So new task, new workload, new workflows, something creative, something repetitive. I always try to get the AI to help me out, and I've been having, like, various level of success. I gave a talk at the Airflow Summit and at OSACON, it's the same talk. It's like it's it's around, like, the the the hype of using AI as a practitioner over the past 2 years. As a founder, as a data practitioner, over the past 2 years, I've had this AI first reflex and then some reflection on as to, like, what's working, what's not working, and what I have good hopes for in the future. So there's an entire talk on that that's that could be could be interesting.
One thing I tell the people is I encourage them to have that AI first, reflex for just about everything. And then, you know, and then give it more than just a shot of, like, can you do this for me and the AI is not helpful for that, you know, without the context. If you try to provide more context, you know, give it what it needs to to help you out. So I think I think it's a good it's a critical thing to do, I think, to remain relevant. You know? If if you skip a a beat or 2 on that, I think you can you can realize you're you're far behind pretty quickly. It's quite easy to catch up if if you're behind, but, you know, really figuring out how you can team up with this this kinda infinite resource, you know, and then this, you know, this thing that can really help out with a lot of things. Figure out what works, what doesn't for you. I think it's like everyone should be should be doing it. We shouldn't wait for the products to do it for us. You know, individually, I think you gotta figure out, how it can help you out, you know, every day.
[00:48:12] Tobias Macey:
Are there any other aspects of this problem space that we didn't discuss yet that you wanna touch on before we close out the show?
[00:48:19] Maxime Beauchemin:
No. I think I think it's maybe a call out for for people interested in the space. So it it turns out, you know, people have been thinking about this for about 20 years too. Like, when I started in business intelligence and, you know, data warehouse architecture, people were already thinking about these things. So, just keeping, I'd be curious if people there's a way for people maybe on Twitter, LinkedIn, you know, to start a thread. If you've seen things that you think are interesting in around, like, code reuse and how to go about it, and then maybe encouraging people to put their reference implementation out there, you know, to be like, you know, I've wrote an AB testing framework or pipeline as a DBT package, as a SQL mesh mesh project. And maybe I forked it for my own use, but here's how I got started, and here's the reference the reference implementation.
Maybe there's an opportunity to become kind of a a small community around all the people that are building that specific vertical, and they can share experiences. So it's so it's more of a sharing, you know, sharing what you did as opposed to, like, writing a framework that everyone should just, you know, pip install and use. But maybe that's another aspect of open source, and it might be a required step towards the the more reusable framework. Because, like, first, like, we gotta look at each other's work to to understand what's common and what's different.
And maybe only then, you know, we we we can start building a framework that works for, you know, 70, 80 percent of the of the people building the same thing.
[00:49:55] Tobias Macey:
Absolutely. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And usually, I ask about what you see as being the biggest gap in the tooling or technology for data management, but I think that's what we just spent the past 5 minutes or so talking about. So,
[00:50:14] Maxime Beauchemin:
I just wanna say thank you for taking the time today and for your thoughtfulness in this space and all the contributions that you've made, and I hope you enjoy the rest of your day. Same here. Thank you for, you know, for everything you do here. I think, having this this medium of exchange is is much better than, say, just a blog post. So the blog post is more structured, but this is a lot more interesting to to really go and explore and and talk about it. So I love this long form, you know, chat chatty, like, way to approach, and explore, you know, part problem space. So it's been great.
[00:50:52] Tobias Macey:
Thank you for listening, and don't forget to check out our other shows. Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Challenges of Reusability in Data Pipelines
Opportunities for Code Reuse
Trends in Data Engineering
SQL vs Code Focused Data Engineering
Data Models and Reusability
Backgrounds in Data Engineering
Challenges in Sharing Data Knowledge
Potential of Generative AI in Data Engineering
Future of Code Reuse in Data Engineering