Summary
In this episode of the Data Engineering Podcast Lukas Schulte, co-founder and CEO of SDF, explores the development and capabilities of this fast and expressive SQL transformation tool. From its origins as a solution for addressing data privacy, governance, and quality concerns in modern data management, to its unique features like static analysis and type correctness, Lucas dives into what sets SDF apart from other tools like DBT and SQL Mesh. Tune in for insights on building a business around a developer tool, the importance of community and user experience in the data engineering ecosystem, and plans for future development, including supporting Python models and enhancing execution capabilities.
Announcements
Parting Question
In this episode of the Data Engineering Podcast Lukas Schulte, co-founder and CEO of SDF, explores the development and capabilities of this fast and expressive SQL transformation tool. From its origins as a solution for addressing data privacy, governance, and quality concerns in modern data management, to its unique features like static analysis and type correctness, Lucas dives into what sets SDF apart from other tools like DBT and SQL Mesh. Tune in for insights on building a business around a developer tool, the importance of community and user experience in the data engineering ecosystem, and plans for future development, including supporting Python models and enhancing execution capabilities.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Imagine catching data issues before they snowball into bigger problems. That’s what Datafold’s new Monitors do. With automatic monitoring for cross-database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time, right at the source. Whether it’s maintaining data integrity or preventing costly mistakes, Datafold Monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production? Learn more at dataengineeringpodcast.com/datafold today!
- Your host is Tobias Macey and today I'm interviewing Lukas Schulte about SDF, a fast and expressive SQL transformation tool that understands your schema
- Introduction
- How did you get involved in the area of data management?
- Can you describe what SDF is and the story behind it?
- What's the story behind the name?
- What problem are you solving with SDF?
- dbt has been the dominant player for SQL-based transformations for several years, with other notable competition in the form of SQLMesh. Can you give an overview of the venn diagram for features and functionality across SDF, dbt and SQLMesh?
- Can you describe the design and implementation of SDF?
- How have the scope and goals of the project changed since you first started working on it?
- What does the development experience look like for a team working with SDF?
- How does that differ between the open and paid versions of the product?
- What are the features and functionality that SDF offers to address intra- and inter-team collaboration?
- One of the challenges for any second-mover technology with an established competitor is the adoption/migration path for teams who have already invested in the incumbent (dbt in this case). How are you addressing that barrier for SDF?
- Beyond the core migration path of the direct functionality of the incumbent product is the amount of tooling and communal knowledge that grows up around that product. How are you thinking about that aspect of the current landscape?
- What is your governing principle for what capabilities are in the open core and which go in the paid product?
- What are the most interesting, innovative, or unexpected ways that you have seen SDF used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on SDF?
- When is SDF the wrong choice?
- What do you have planned for the future of SDF?
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
- SDF
- Semantic Data Warehouse
- asdf-vm
- dbt
- Software Linting)
- SQLMesh
- Coalesce
- Apache Iceberg
- DuckDB
- SDF Classifiers
- dbt Semantic Layer
- dbt expectations
- Apache Datafusion
- Ibis
[00:00:11]
Tobias Macey:
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Imagine catching data issues before they snowball into bigger problems. That's what DataFold's new monitors do. With automatic monitoring for cross database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time right at the source. Whether it's maintaining data integrity or preventing costly mistakes, DataFold monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production?
Learn more at data engineering podcast.com/datafold
[00:00:48] Lukas Schulte:
today. Your host is Tobias Macey. And today, I'm welcoming Lucas Schulte to talk about SDF, a fast and expressive SQL transformation tool that understands your schemas. So, Lukas, can you start by introducing yourself? Hey. Hey, Tobias. I'm happy to be here. I'm Lukas. I'm one of the co founders and the CEO of STF. And today, I'll be telling you a little bit about what we've been building over the last couple of years, and why we're very, very excited about it. And do you remember how you first got started working in data? Yeah. Vaguely. So I originally got interested in data, maybe through a peripheral pathway. I studied electrical and computer engineering, and I was primarily interested in sensor systems and, sensors collect a lot of data.
And so my my interest in sort of data systems and analysis and understanding, how, data can be used to express real world systems, kind of came from there. So after after college, I joined, sensors team at Microsoft.
[00:01:47] Tobias Macey:
I was primarily working on sensor systems there. That was all sort of traditional analytics that was less ML oriented than it might be today, but that was sort of my start. And now bringing us to SDF, can you give a bit of an overview about what it is and some of the story behind how it got started and why you decided that this was worth your time and energy? Yeah, for sure. So SDF
[00:02:09] Lukas Schulte:
came about, maybe to go back a little bit further, those sensor systems that I was talking about, at some point, they all became very ML oriented. Machine learning kind of took over what traditional algorithms have been doing for a long time. And I found myself in that world, building out data infrastructure for a company that was building creative tools on top of various computer vision algorithms. And so we built out this whole like data collection paradigm there, built out a team for labeling ground truth data, created multiple ML models, tried to make them small so they fit on devices. Anyways, the data loads were growing there. And I also live in Los Angeles, which means that, of course, this turns into a social media related enterprise.
And so, of course, we all of a sudden had user data, and I think as any company that grows is want to do, the folks on the ML side want to start using user data. The company starts having quarterly board meetings, so we need a modern data stack on top of the user data as well. And then the company grows even more, and there's GDPR and CCPA. So this stack that started relatively small and contained grew very, very quickly, and I think quickly became clear that we kind of lost the plot. And we realized that we needed some system to understand data from ingestion to consumption. And that system didn't really exist. Dbt was the it, you know, was, and I think in large cases still is the best system that exists with this kind of stuff today. So we started using that. But at the same time, I was fortunate enough to talk with my now 2 cofounders, Michael and Wolfram, about what they were building at Meta. Wolfram was one of the chief architects of Meta's data warehouse. And at the time, he was building a system for static analysis of SQL at scale, to understand exactly why and how data moves from ingestion to consumption and what impact that has on data privacy concerns, data governance concerns, data quality concerns, how that can improve the developer experience and so on and so forth. And I think, you know, SDF was born out of the realization that if the series a series b companies that are starting to build out their data stacks have some of these challenges and, you know, the largest data consumers in the world, like Meta, have some of these same concerns. That means there's probably a cross cutting reason to start working on this. And that was sort of the genesis of SDF. It's very enigmatic name. Wondering if you could unpack it a little bit. I don't everybody loves 3 part names. Right? I think.
So Wolfram at Meta coined the term, the semantic data warehouse, because that was sort of the goal, I think, also after some of the data privacy, issues that Meta had in the late 2010s. The goal was to build a semantic data warehouse, where you had a true understanding of what transformations acted on, what types of data throughout everything. And this is what Wolfram was working on, and we needed SDF to be more pluggable into other different systems, hence fabric. And, then we saw that those were also the perfect three letters on the keyboard. If you're working on a command line utility, and we thought, you know what? This is absolutely perfect. So semantic data warehouse turned into semantic data fabric. And now, hopefully, you know, those are our three favorite words on the keyboard.
[00:05:07] Tobias Macey:
And I guess it's a natural progression that you would be able to manage your versions of SDF using the ASDF VM. That would be that would be
[00:05:17] Lukas Schulte:
we should do that, honestly. We should start looking into that.
[00:05:21] Tobias Macey:
And so you started touching on this a little bit, but what is the core problem that you're solving with SDF and in particular as it's juxtaposed
[00:05:31] Lukas Schulte:
with the most notable I don't know if competitor is the right term, but alternative in the ecosystem in the form of DBT. So, yeah, maybe rather than answering the the question super directly, I'll I'll answer a little bit of a story, which mirrors, you know, how data engineers work in in teams today. So let's say you're a data engineer at a midsize company. There's probably a few more technical folks and then a larger number of slightly less technical data analysts. Likely you're working in dbt. Your workspace is now, you know, 100, if not thousands of models. Compile times, take a long time to complete. Dependencies are harder to manage, right? It's difficult to, you know, delete a column in a model here or change a schema and understand exactly what changes downstream, pipelines break, and, you know, maybe worst of all, debugging is pretty painful because if at least if you're, operating on DBT, if you run a command, Snowflake only sees the resulting query after Jinja expansion and macro expansion and all the configuration takes place. So the error is actually related to the Snowflake error, not the error on the SQL file that you're working on on your laptop. And this is, this is pretty strange, especially in comparison to software engineering, where a lot of these hard parts are solved, right? There's a lot of talk about column love lineage. Compilers have had linkers for like 30 years, right? And a linker just manages dependencies between files, which is exactly what you want for SQL and data as well. There's really great static analysis tools, really great linters that are pretty highly performant. And, I think what we want to do is bring a lot of that tooling, right, that those really, really top notch software development experiences into the realm of data engineering as well. So we want, you know, more static analysis, more guarantees in CICD, more verification through automated systems rather than, you know, manually running a query, changing a column name, changing an aggregation, and kind of seeing what happens. And I think we think this is possible. It's really hard to do because SQL dialects are so differentiated between data warehouses, but this is really our goal. Right? This is, like, SQL validation and transformation, especially at scale. It was a very long winded answer, but, there you go.
[00:07:39] Tobias Macey:
No. It's it's definitely helpful to get that context. And you've already mentioned DBT. They have been the dominant player in this SQL oriented transformation market for a while. They helped to actually create that market and the idea of the analytics engineer. Another notable entrant in this landscape in the past couple of years has been SQL mesh that is taking aim at dbt and focusing on that ability to actually parse and understand the SQL AST. And I'm wondering if you can give a bit of an overview about how you view that Venn diagram
[00:08:15] Lukas Schulte:
of features and functionality across SDF, DBT, and SQL mesh, and any other tools that you feel like throwing in the mix? So, yeah, maybe maybe the first thing to call out. Definitely, I want to be very, very complimentary of all the the tools and companies in the space. I think, ultimately, we're all circling the same drain and trying to make the world of data engineering better for everyone as a long time user of dbt. And I think what they what they've done especially has been really transformational in in the world of data. So that's maybe that's maybe step 1. And I think to that end, a lot of these tools have, just slightly different approaches, right? There's another tool there that's, I think been gaining a lot of popularity amongst Snowflake users is Coalesce. And, they have some tooling, but for them, it's very much oriented towards, visualization and making the drag and drop experience as good as it gets. And that works for some folks. I think the the folks at SQL mesh have really taken dbt to, you know, the the next level.
Some of the things that they've done around virtual environments are pretty cool. But again, it's a different authoring framework, and I think the the approach there is very opinionated in terms of, hey, you know, this is this is sort of, you know, the ideal way to write a data warehouse. We've been trying to be a little bit less opinionated and work more with SQL dialects where they are and work with DBT projects where they are. And I think our goal here has been less to to work on the authoring surface and more on the engine, really sort of showcase, the capabilities that we have on the SQL understanding side and what we can do once we have executable semantics for some of these proprietary dialects and actually, you know, slowly inch our way into the realm of native compute as well. So in short, I think different different approaches, different things will work for for for different companies. We're obviously really excited about the static analysis capabilities that we've developed and and what those enable us us to do. Now digging into
[00:10:13] Tobias Macey:
the implementation and design of SDF, I'm wondering if you can talk to some of the ways that you're thinking about the core functionality, the ways that you are tackling the complexities of transformation and in in particular being able to scale to that those 100 or thousands of models and some of the Yep. Engineering challenges that you've had to address as part of that effort? Yeah. So I think there we we've taken maybe a fundamentally different approach than than I think anybody else. So we built from from the ground up in Rust and
[00:10:47] Lukas Schulte:
tried to build full grammar descriptions for the SQL dialects that we support. And, that's been quite, quite an undertaking, but it allows us to have very, very precise, not just SQL analysis, but SQL validation. So I think maybe my favorite example here is at the moment, I think the only entity that can tell you whether you're missing a comma in a Snowflake SQL query is Snowflake itself, right? Like you have to send that query to Snowflake and then Snowflake compiles that query and says, Hey, yes, you're missing a comma here. Or if a function takes a bar char instead of an integer, only Snowflake can tell you that. And this is so counterintuitive. If you're like used to software engineering where you just like, you know, you write whatever, you have a TypeScript compiler, or you have cargo, right? You have all these like local compilation tools that tell you everything that you need to know about sort of the compile time, that give you compile time guarantees on your laptop that, we, I think have finally unlocked, at scale. And the reason I say at scale, Rust has allowed us to do a lot of really cool performance optimizations.
It's allowed us to go really, really deep, build in type binding and other static analysis toolsets, and still have the result be relatively performance. So I think at at this point, one of our benchmarks, we have a, an SDF workspace with 6 16,000 SQL files that are on average 500 lines long or have on average 500 columns. And, compiling that, getting all the column of lineage takes, like, less than 45 seconds on my laptop, and I'm pretty pretty happy with that performance.
[00:12:20] Tobias Macey:
To your point of having to send the query off to the cloud warehouse and wait for it to tell you what you're doing wrong, there's a little bit of a parallel happening as more and more of software architecture and infrastructure is cloud native or requires the cloud, particularly if you're using some of the core cloud provider services versus the open source alternatives. So definitely a similar struggle happening in that regard for kind of the architecture and infrastructure piece. But from the pure software layer, it's definitely true that we have gotten used to being able to lean on our tooling to be able to tell us early and often what we're doing wrong.
[00:13:01] Lukas Schulte:
Yeah. Absolutely. It does seem like like the reason that the cloud vendors are so excited about being able to require cloud connections for more features also adds to their vendor lock in. Right? Like, it adds to their ecosystem mode in a pretty big way. And I think that's the reason why things like Iceberg have been so exciting in the last couple of years, because they, you know, for the first time, give companies a little bit more of an out where they can say, you know what? Hey. There's, like, another opportunity. I can use another I can use another query engine against this to run a query against against this data.
[00:13:28] Tobias Macey:
Yes. Data gravity is a very real point of leverage.
[00:13:32] Lukas Schulte:
Yeah. It's, enormous, I think. Let me we we've we've talked to companies that have multiple cloud vendors just so that they have negotiating ability against, when their when their contracts come up for renewal. Right? Like, that that's that's how, yeah, enormous data gravity is to to an enterprise.
[00:13:48] Tobias Macey:
And as you have been working through the development and early stages of onboarding and working with some of the early adopters of SDF, what are some of the ways that the scope and goals of the project and the business have changed in that time?
[00:14:05] Lukas Schulte:
They've changed in almost every way. I will say one thing that I'm happy about at all of our all the hands, I get to sort of show show some of the mission statements of the that we put together over 2 years ago at this point. The nice thing is the mission hasn't changed, but I think the implementation has changed quite a bit. So one one thing that's changed pretty dramatically is it initially, we were entirely focused on static analysis, and we had this pipe dream that we could maybe think about execution someday. But for for us to be able to think about execution, we have to have, you know, these fully resolved logical plans, that you can actually send to send to query engine. And we didn't know if we were actually going to be able to get there. And I think in the last sort of 3 months, we've turned the corner. I think we now have pretty strong executable semantics for the SQL dialects that we that we support. And that means that we're really, really excited about our ability to to support some of that, you know, we call it, like, query engine emulation, so you can actually run, you know, Snowflake queries locally on your laptop. And I think that's a really cool capability. And if it scales, I think it could be quite powerful.
[00:15:06] Tobias Macey:
And to that point too, I know that in looking through the documentation, you're also using the Arrow data fusion capability for being able to actually do some fully local execution, and that's definitely a very different set of functionality than something like DBT that is wholly reliant on whatever the query engine is that you're targeting, where if you wanna do something locally, then you're probably using DuckDV. Otherwise, you're relying on some of these other engines. I'm wondering if you could maybe just break out some of the features that help to differentiate SDF and some of the ways that that fit into the overarching workflow for people who are using that tool to accomplish their, you know, engineering and business goals?
[00:15:50] Lukas Schulte:
That's a great question. So so, yeah, maybe maybe to the point about dbt. Dbt operates on on strings. Folks, like SQL Mesh, they'll take it one step further and look at look at ASTs. And our goal is to have sort of a a fully, you know, type safe implementation, which means that at every sort of node in the transformation layer graph, we know all of the types that are incoming and expected as output from every transformation. So if you run an SDF compile on a single model, you'll see that SDF knows not just, you know, the column names, but also exactly their data types. And once you have that, you can actually start adding additional type information on top. So you can start working with classifiers and higher level type objects rather than just VARCHARs and integers. And from a developer driven data governance standpoint, this is super exciting because you can finally tell your data warehouse sort of at ingestion, hey, here's where my PII is. Here's where my canonical definition of a daily active user is. Here's, you know, all columns that have social security numbers have this type of retention policy, and you can start automating a large part of that system that was otherwise manual, as it relates to, higher level data types, because VARCHARs are incredibly expressive, but not all VARCHARs are the same, right? And it's hard to treat them differently in the current transformation layer ecosystem.
And we're really, really excited at sort of the the extra level of capabilities that type correctness gives you. So we're adding data classification tools is definitely one of the really, really exciting things, that you can do when you have type correctness.
[00:17:26] Tobias Macey:
Yeah. The extra classifier capability is something that I found particularly compelling because, as you said, there is a lot of the nuanced detail that is very easy to lose track of or very difficult to surface where maybe you can write it into a docstring somewhere, but then you're relying on human operators to actually read all of that, parse it, understand the real impact of that versus being able to actually put business rules around that type information around those higher order details of things like PII or some of the business semantics. And in particular, I'm wondering how you see the classifier functionality in comparison to things like the what what happened a couple of years ago with the idea of the what what do they call it? The metrics layer, and the little ways that that yes. So some of the ways that that has actually been
[00:18:24] Lukas Schulte:
realized in terms of the technical implementations and in particular, the maybe DBT semantic layer since that's what most folks are probably working with. Yeah. I think metrics and semantic layers, we I see them more at larger scale companies, but they think the need for something in that space is real. So I'm curious, maybe for your for your projects in the data warehouse that that you work on most, do you know approximately how many like, what percentage of columns are just VARCHARs in that data warehouse? Not offhand, but I would presume the majority. Okay. Yeah. That sounds about right. Yeah. What what we see is it's something like 50% of all columns that we encounter are just VARCHARs, which is, extraordinary for the amount of, you know, different and and very nuanced information that those columns actually hold. And I think there's like an example here that there's, very large enterprise, bought another bought another company. And, one of the data analysts started joining user IDs on user IDs. Seems like the most mundane and normal, you know, joint operation in the world if you've if you if you're working a lot with users and user IDs. But it turns out that one set of user IDs was from from from the original company, and the other set of user IDs was from the company that that company had just bought. So they were actually a completely disjoint set of of user IDs that had some overlap because of you know, and everything. They're all, like, integers or something like that. And the company didn't catch this for for months months months on end, and it was very, very expensive to rectify because they had to do a whole bunch of backfilling. But my my point here is even just calling a column user ID doesn't make it the same user ID. Right? Like, this there's a need, especially when you're working at scale, to work with higher level types. And what you know, our goal with our type system is, is to sort of allow you to say, hey. This is, you know, company 1 user ID. This is company 2 user ID. And you can create, a little rule. Right? We call it business logic as code that says, hey. You can never join those 2. And if you do join them, you know, flag a warning. So that that capability is, I think, super critical. I know it's important for everything from, you know, user IDs to privacy concerns and data governance concerns, but also metrics. We've seen these classifiers used in a few different ways that were very not obvious to me in the beginning, and we're really, really excited to see what it is that people actually use them for. We'll see if it turns out to be a a replacement or an addition to, you know, a traditional metrics layer.
But it's been exciting to see, yeah, what people use these things for so far. Digging a bit further into the developer experience, the workflow,
[00:21:01] Tobias Macey:
particularly as you start to scale the usage and in particular for some of these enterprise class use cases that you've touched on where you're spanning multiple different teams, possibly even completely different business units or organizations or and some of the ways that SDF helps to support some of those very kind of fractal use cases where you maybe have some various handoffs and maybe those handoffs aren't always very clearly defined. And, also, given the fact that SDF as a product is open core and has paid features, some of the ways that those team oriented capabilities differ between the open and the commercial offerings.
[00:21:45] Lukas Schulte:
From a philosophical standpoint, our goal is that that, you know, essentially, what you run locally on your machine is, free to use, and what happens in the cloud is typically a service that we have to manage and therefore is paid. We also haven't really tried to change the authoring surface, so there's no there's no GUI test if it's a command line engine. But that engine, because it's written in Rust, is just a binary that you download. So the idea was if you are working in a remote Versus code workspace or a GitHub code space, you can install SDF into that environment, and it just works. There's no other dependencies. There's no, you know, Python or virtual environment that you have to manage. You just install the binary and go. And the other goal from sort of a static analysis viewpoint was if you have, you know, some analyst who decides to write a new model, that they can take an SDF workspace, add a model to it. And if SDF compiles that model correctly and there are no errors spit out by the compiler, that model should be good to go, and and you should be able to integrate it into your production pipelines without breaking anything. That's sort of the guarantees that SDF sort of tries to make to the the workspace owners. Additionally, I think from, from, like, a using the tool standpoint, a lot of functionality actually mirrors mirrors dbt. Right? There's dbt compile. SDF compile does most of the same stuff that dbt compile does, but also gives you some of the static analysis capabilities and does type checking and, you know, builds out column of lineage and so on and so forth. There's you know, dbt run and SDF run. Again, similar functionality. Dbt test, SDF test, again, similar functionality.
So so the goal here really is to, again, meet people where they are in their,
[00:23:25] Tobias Macey:
authoring workflow. Another interesting parallel that I'm curious to hear your answer to as far as the development workflow and the interfaces is that with DBT, they maybe it was a year or 2 ago that they started offering the capability of Python models, which aren't available across all execution engines, but Yep. Gives you the ability to write arbitrary Python as long as it generates a table structure as the output. I know SQL mesh has a similar capability of being able to do Python models as long as it returns a data frame. I'm curious how SDF thinks about the computational capabilities that go beyond SQL and some of the ways that teams can address those maybe more complicated or complex computational requirements in the event that there isn't a built in function in their target engine to address that capability?
[00:24:21] Lukas Schulte:
Yeah. Great. I'm actually really glad that you asked this question. So we've been thinking about it a lot. And there's there's like a like a there's like an agony and an ecstasy to like allowing arbitrary Python execution in your code. So the great parts are obviously that you can create, you know, table functions, data frames, and that is incredibly powerful if you want to build, especially like reusable code. So to date, SDF does not support data frame operators or Python in SDF's world. And the reason is fairly simple. We want to make sure that we can provide all of these guarantees, and until we find a way to provide all those same guarantees in Python, we're not gonna provide Python models. So this is maybe the most opinionated stance that we've taken to date, but the good news is we have a plan, and we're very, very excited about, sort of the the plan here to offer a Python interface, in the future. It won't happen in the near term. It's not something that's actively being worked on. It's on our backlog.
We still have connectors and other things that we wanna get through first, but, there's there will be some some really exciting stuff coming there, hopefully in the you know, hopefully in less than a year. The other interesting aspect of your positioning
[00:25:33] Tobias Macey:
and the particular time that we are in the ecosystem is that as we've already mentioned several times, DBT is the 1st mover, not the 1st mover, but in in terms of modern history or recent history, one of the first movers in the SQL as software engineering practice and being able to build a set of transformations that have dependency chaining as part of that. And I'm curious how you're thinking about the adoption and migration path for teams who have already invested in dbt, have substantial code bases, are they're very interested in the capabilities that SDF offers, but there is that barrier of, well, I've already got all the sunk cost into dbt. So I'm just gonna keep going in that direction because I don't wanna have to figure out a whole other tool. I'm just wondering if you can talk to what is your answer for those people?
[00:26:27] Lukas Schulte:
Yeah. That is most people, I think. At at this point, the d dbt is a wild percentage of Snowflake, BigQuery, etcetera, compute, or it comes from comes from, like, dbt models, and it's something like 85 or 90% of the data teams that we see at this point use dbt. And there's one one thing which is sort of moving code, and there is another piece which I I actually think is even more challenging, which is which is actually sort of reeducating or getting all of the the team on the same page about whatever a new authoring system should be. So, our approach here is to try and meet folks where they are. I already mentioned that we have parity between, you know, compile dbt compile and s d f compile, dbt build, s d f build, dbt test, s d f test, and so on and so forth. If, you know, some other commands, that, maybe showcase some of some of the unique features that s d f has. But the goal here is user experience there should not change dramatically.
The second part is sort of all the DBT configuration, rough statements, etcetera, some of which SDF does not need, but we are moving to a world. And probably by the time this this podcast goes live, we'll actually already be in that world, where SDF will natively run and interpret TBT models and TBT configuration and TBT profiles and ingest them and just use SDF as the engine. So I think the way to think about that is there's dbt as the authoring layer, and then there's the dbt engine that actually takes that configuration and executes it and does all the Jinja expansion. The capability that we're, launching in the next couple of weeks allows you to keep that same dbt configuration and just exchange the engine with the SDF engine and get all the same sort of static analysis capabilities, speed improvements, all the wonderful things you get from Rust, directly, on top of your dbt project. And once you have that, you can start to delete your rough statements because you no longer need them, and, we're happy about that, obviously.
[00:28:29] Tobias Macey:
And the other piece of dbt investment is the set of packages that they have been working to try and build up. There is a, I think, an unequal distribution of people who are using them versus not, and I'm wondering how you're addressing that aspect, or is it just a matter of it's all DBT, so it doesn't really matter, it just works.
[00:28:50] Lukas Schulte:
The, the goal is it's all DBT, it just works. We will see how far we get down. There's a lot of DBT packages at this point, and Python Jinja is leaky. So you can actually do Python subroutines, like, directly from Jinja and have those be executed. And managing things like this is incredibly difficult, especially if you want, like, a closed compiled well defined system like SDF really tries to be. And dbt sometimes, like, tries to break out of that cage a little bit. We'll see. I think a lot of a lot of core things like dbt expectations and so on and so forth, we already support. I'm sure there's libraries where we'll have to we'll have to figure out if there's additional Jinja complexity that we need to take into consideration. But the goal is you just download SDF and it works. And then the other piece
[00:29:33] Tobias Macey:
of complexity around trying to break into an established market is just the mindshare is one piece of it, and then there's also all of the technical investment. But beyond that, there's also just the amount of communal knowledge that gets built up around these tools in terms of blog posts, presentations, you know, just intra team communication. And just I'm curious how you think about that aspect of the problem as well, just being able to kind of work your way into that mindshare and reduce those points of friction so that there isn't as much requirement
[00:30:09] Lukas Schulte:
for that communal knowledge for people to be able to figure out how they work around those edge cases as they try to move down that adoption path. I mean, I I will say, I feel like a a lot of the help requests that I see around dbt is has to do with, like, pie ends and, packages and versioning issues. And in SDF, you don't have any of those problems. So, hopefully, we can at least take a large subset of the help that's required to make dbt run at scale and maybe forget about it. But, yeah, I mean, the reality is, of course, like one of dbt's greatest assets and what they've done an incredible job, over the last almost decade at this point is building out a really stellar community of engaged and helpful people with, at this point, a very large, knowledge base. And it would be great if if you could, you know, translate some of that knowledge. Right? So whether it is, you know, linting configuration or, what the right model structure is for an efficient, you know, warehouse.
Like, ideally, SF, you know, is just an addition into that community and an additional tool for people to use rather than something that tries to, you know, redefine the wheel. Like, the last thing I wanna do is, you know, boil the ocean and and try and build entirely orthogonal, but similar authoring experience from scratch. Dbt folks got a lot of things right. Right? And, like, there's a there's, you know, hundreds of tens, if not hundreds of thousands of developers using dbt every day, hundreds of thousands of dbt projects. And what we want to do is elevate what the capabilities are there. Right? That's really the goal. In terms of the community investment around SDF, I'm wondering what are the
[00:31:42] Tobias Macey:
interfaces and extension points for people to be able to augment or extend the core functionality of SDF. And then on the the kind of outer shell, the additional tooling or plug ins that people might be interested in building to extend their experience of working with SDF?
[00:32:02] Lukas Schulte:
On the open source side of things, I think we invest heavily and love investment in data fusion. The core of our execution engine and our executable semantics, like, comes from Data Fusion. So anybody that wants to to to spend the time and, you know, build a really great Rust based query engine, go check out Data Fusion and, you know, see if you can pull a ticket or 2. Separately, on on our side, we have found maybe some really cool optimizations. Tests is one of them. I think dbt tests are not super efficient the way the testing library is written. Those macros, mean that you're doing a lot of scans for every individual test. We've written a testing library that I think is a little bit more efficient that batches things a little bit more elegantly, and we are constantly trying to to put out more of these additional packages and and get folks to to contribute to those as well. So I think from an open source standpoint, the package ecosystem is what we I think we're most excited for community investment in at this point. As you have been building SDF, growing the community, growing the business, what are some of the most interesting or innovative or unexpected ways that you've seen the tool used? Quite there's been a there's been a few really fun ones. I think, 1, we talked a little bit of about metrics and classifications a while ago. We initially developed the data classification system more for privacy and governance use cases. And then we started getting questions about, hey. Like, you know, we have 5 different definitions of what a daily active user is. People just keep copying and pasting the SQL query and, like, putting it in new put in new places, and we wanna have one canonical definition of what a daily active user is. Can we use SDF classifiers to to do this? And we said, maybe maybe this will work. And it turns out it works, like, super, super well. And then we got had a similar question around data retention and using classifiers to map out which tables needed which types of retention policies. So that was that was really, really exciting to see because that was a true use case that I think people were spending a lot of time trying to map and manage retention policies.
And now there's a simple, you know, SDF report that they run-in their CI pipeline or or in their orchestrator once a day, and they get a report of all the partitions that need to be deleted. That's fantastic. That was completely unexpected. So that's yeah. One one example of using classifiers in a completely unexpected way. And as people start to investigate SDF, they want to
[00:34:32] Tobias Macey:
start incorporating to get into their stack, what are the cases where SDF is just the wrong choice and you would advocate against using it?
[00:34:40] Lukas Schulte:
Yeah. If you're writing Scala, or RDDs, probably not the this is probably not the tool for you. I think a lot of the the spark ecosystem is really rich and has a lot of capabilities. There's a lot of notebooks, especially from from Databricks. SDF does not work well in that universe, at the moment, for some of the reasons I've outlined earlier. So I think that that is probably the main area at this point where SDF is probably not the right tool.
[00:35:05] Tobias Macey:
And in your experience of building SDF, growing the business, growing the project, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:35:16] Lukas Schulte:
Yeah. Building a framework that people like or building a cool developer tool is not the same as building a business. I think that, like, if you're if you're, you know, an excited engineer and you're like, yes. I'm gonna build a really cool tool, It's very easy to focus on on the tool and the functionality and the capabilities and where it's differentiated and novel, but growing a business has different requirements and understanding what should be free to use, what should be paid, what features make sense when, you know, they're offered as a paid feature for for a company versus what doesn't make sense, whether to, you know, price on seats or models. Those are really, really difficult questions, and the only way to tackle them is to just, you know, talk to talk to a lot of talk to a lot of companies and talk to a lot of engineers. That's been a a fascinating learning lesson over the last 2 years, but I'd especially say over the last 6 months.
[00:36:09] Tobias Macey:
And as you continue to build and invest in SDF, what are some of the things you have planned for the near to medium term or any particular projects you're excited to explore?
[00:36:19] Lukas Schulte:
Yeah. I think on on the, dbt side, seeing more use cases of SDF as as, a dbt accelerator is super exciting. And on the compute side, we have some awesome, I think, you know, demos and hopefully an alpha very soon that we'll be able to show, where where you can actually start, you know, using STF as a query engine for mile transformations either directly from your orchestrator or as a separate service. And then and then lastly, on the cloud product side, just today, we launched a really cool impact analysis feature that looks at diffs between 2 warehouses. It may be the state that's in Maine, maybe the state that's in a pull request, and we'll show you the impact of those changes, what columns changed, if there's anything that breaks downstream, etcetera. And we're I'm really, really excited to invest a little bit more on on the cloud side in building out differentiated tool sets for enterprises that uniquely need services and that you can't just run on the laptop, by yourself. Are there any other aspects of the SDF product, the SDF tooling,
[00:37:21] Tobias Macey:
the ecosystem around it, and the ecosystem that you are working within that we didn't discuss yet that you'd like to cover before we close out the show? There's, there's there's always too many things to talk through. No. I I really appreciate you having me on. This was, absolutely fantastic. Feel like I learned a lot. Questions were great. Appreciate your time. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And so as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. There's, there's a consensus
[00:37:57] Lukas Schulte:
among software engineers, something that we've talked about a lot today, which is that the tooling and data engineering is up to par with the tooling and software engineering. And if you look at the reasons for this, I think selfishly, I, as someone who's building a compiler for SQL, will say, hey. There's not really good SQL compiler that exists. Right? Like, compilers are the the framework for everything from, you know, IntelliSense to, you know, CICD checks. There's no really good at compiler for SQL. So, like, that's that's the problem in in data development. But and maybe digging one step deeper. Like, why is there no good compiler for SQL? I I think if you take a look at the history of SQL, there's there's no module system that was ever introduced in that language. Right? Like, the the reason that you have data frames is data frames are the only way to do table level functions, essentially. Otherwise, you have some user defined functions, maybe, but there's no, like, import statements. There's no libraries. There's no package repository for anything in the SQL world, which means that every single company, all over the world, every single time, is rebuilding the same wheel from scratch. And I think that is maybe the most exciting problem and question in the data space. Right? It's like, how do you, how do you actually build a data development framework that provides some level of modularity, some ability for libraries to exist, some ability for code reuse in a way in in any sort of way. That I think is is the most interesting
[00:39:20] Tobias Macey:
gap in data development today. And also the fact that despite being referenced as a language, it is actually
[00:39:29] Lukas Schulte:
a fractal amount of dialects. And so even if you can have some way of saying this is SQL, this is SQL, they're actually 2 different sequels. And if you try to run them both on the same engine, then you're gonna have a bad day. You're gonna have a really bad day. Yeah. But but if you think about, like, why why is this the case? Like, why why did Snowflake evolve its own grammar? Why are they still evolving? Like, their that grammar is still evolving. DuckDV at this point. Right? Now DuckDV is also creating its own grammar. The reason for this is there's no extensibility. So the only thing that's left is to change the language. If you if you can't add a library for, you know, how you want to work with some ML model, Like, of course, if you're Snowflake and you want people to start running ML models as well, of course, you're gonna build, you know, a little, like, dialect, addendum that lets you run, you know, GPT, directly from, like, a Snowflake SQL query. It's complete madness, but it's the only resource that's available to to these vendors as well. Right? It's expanding and extending the language. So, yeah, I think if you if you could figure out, if there's a way to to create modularity in that universe, I think that would be incredibly powerful.
[00:40:29] Tobias Macey:
Yeah. That that's where you also start getting all of these superset languages that compile down to SQL, similar to how we've had all of these different languages that compile the JavaScript.
[00:40:39] Lukas Schulte:
Yes. This is yeah. So this is exactly. Alright. We I have, you know, sometimes the script like, tried to describe a little bit of what we're doing as as, you know, sort of the TypeScript to JavaScript transition, but for SQL, where we, you know, add a little bit some type information, some static analysis capabilities. But ultimately, like the thing that we send, you know, to Snowflake is just the same Snowflake SQL as everybody else. But, yeah, this is it's a really it's a really interesting challenge. And and I think the reason that you've seen this, like, absolute explosion in data tooling also has to do with this. Right? Like, there's
[00:41:08] Tobias Macey:
also really cool stuff happening in places like Ibis, right, where where, like, people are really trying to figure out how to translate and or map well from one dialect to another. But mapping is always fuzzy, and it doesn't really work at scale. It's a it's a challenging problem. Alright. Well, thank you very much for taking the time today to join me. I appreciate all of the time and effort that you and the rest of the SDF SDF team are putting into this problem space. It's definitely very important and constantly evolving target. So definitely look forward to seeing the continued growth of SDF and starting to experiment with it for my own work. So thank you again, and I hope you have enjoy the rest of your day. Thanks a lot. You as well. Appreciate you having me on the show. Thank you for listening, and don't forget to check out our other shows.
Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and to tell your friends and coworkers.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Imagine catching data issues before they snowball into bigger problems. That's what DataFold's new monitors do. With automatic monitoring for cross database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies and anomalies in real time right at the source. Whether it's maintaining data integrity or preventing costly mistakes, DataFold monitors give you the visibility and control you need to keep your entire data stack running smoothly. Want to stop issues before they hit production?
Learn more at data engineering podcast.com/datafold
[00:00:48] Lukas Schulte:
today. Your host is Tobias Macey. And today, I'm welcoming Lucas Schulte to talk about SDF, a fast and expressive SQL transformation tool that understands your schemas. So, Lukas, can you start by introducing yourself? Hey. Hey, Tobias. I'm happy to be here. I'm Lukas. I'm one of the co founders and the CEO of STF. And today, I'll be telling you a little bit about what we've been building over the last couple of years, and why we're very, very excited about it. And do you remember how you first got started working in data? Yeah. Vaguely. So I originally got interested in data, maybe through a peripheral pathway. I studied electrical and computer engineering, and I was primarily interested in sensor systems and, sensors collect a lot of data.
And so my my interest in sort of data systems and analysis and understanding, how, data can be used to express real world systems, kind of came from there. So after after college, I joined, sensors team at Microsoft.
[00:01:47] Tobias Macey:
I was primarily working on sensor systems there. That was all sort of traditional analytics that was less ML oriented than it might be today, but that was sort of my start. And now bringing us to SDF, can you give a bit of an overview about what it is and some of the story behind how it got started and why you decided that this was worth your time and energy? Yeah, for sure. So SDF
[00:02:09] Lukas Schulte:
came about, maybe to go back a little bit further, those sensor systems that I was talking about, at some point, they all became very ML oriented. Machine learning kind of took over what traditional algorithms have been doing for a long time. And I found myself in that world, building out data infrastructure for a company that was building creative tools on top of various computer vision algorithms. And so we built out this whole like data collection paradigm there, built out a team for labeling ground truth data, created multiple ML models, tried to make them small so they fit on devices. Anyways, the data loads were growing there. And I also live in Los Angeles, which means that, of course, this turns into a social media related enterprise.
And so, of course, we all of a sudden had user data, and I think as any company that grows is want to do, the folks on the ML side want to start using user data. The company starts having quarterly board meetings, so we need a modern data stack on top of the user data as well. And then the company grows even more, and there's GDPR and CCPA. So this stack that started relatively small and contained grew very, very quickly, and I think quickly became clear that we kind of lost the plot. And we realized that we needed some system to understand data from ingestion to consumption. And that system didn't really exist. Dbt was the it, you know, was, and I think in large cases still is the best system that exists with this kind of stuff today. So we started using that. But at the same time, I was fortunate enough to talk with my now 2 cofounders, Michael and Wolfram, about what they were building at Meta. Wolfram was one of the chief architects of Meta's data warehouse. And at the time, he was building a system for static analysis of SQL at scale, to understand exactly why and how data moves from ingestion to consumption and what impact that has on data privacy concerns, data governance concerns, data quality concerns, how that can improve the developer experience and so on and so forth. And I think, you know, SDF was born out of the realization that if the series a series b companies that are starting to build out their data stacks have some of these challenges and, you know, the largest data consumers in the world, like Meta, have some of these same concerns. That means there's probably a cross cutting reason to start working on this. And that was sort of the genesis of SDF. It's very enigmatic name. Wondering if you could unpack it a little bit. I don't everybody loves 3 part names. Right? I think.
So Wolfram at Meta coined the term, the semantic data warehouse, because that was sort of the goal, I think, also after some of the data privacy, issues that Meta had in the late 2010s. The goal was to build a semantic data warehouse, where you had a true understanding of what transformations acted on, what types of data throughout everything. And this is what Wolfram was working on, and we needed SDF to be more pluggable into other different systems, hence fabric. And, then we saw that those were also the perfect three letters on the keyboard. If you're working on a command line utility, and we thought, you know what? This is absolutely perfect. So semantic data warehouse turned into semantic data fabric. And now, hopefully, you know, those are our three favorite words on the keyboard.
[00:05:07] Tobias Macey:
And I guess it's a natural progression that you would be able to manage your versions of SDF using the ASDF VM. That would be that would be
[00:05:17] Lukas Schulte:
we should do that, honestly. We should start looking into that.
[00:05:21] Tobias Macey:
And so you started touching on this a little bit, but what is the core problem that you're solving with SDF and in particular as it's juxtaposed
[00:05:31] Lukas Schulte:
with the most notable I don't know if competitor is the right term, but alternative in the ecosystem in the form of DBT. So, yeah, maybe rather than answering the the question super directly, I'll I'll answer a little bit of a story, which mirrors, you know, how data engineers work in in teams today. So let's say you're a data engineer at a midsize company. There's probably a few more technical folks and then a larger number of slightly less technical data analysts. Likely you're working in dbt. Your workspace is now, you know, 100, if not thousands of models. Compile times, take a long time to complete. Dependencies are harder to manage, right? It's difficult to, you know, delete a column in a model here or change a schema and understand exactly what changes downstream, pipelines break, and, you know, maybe worst of all, debugging is pretty painful because if at least if you're, operating on DBT, if you run a command, Snowflake only sees the resulting query after Jinja expansion and macro expansion and all the configuration takes place. So the error is actually related to the Snowflake error, not the error on the SQL file that you're working on on your laptop. And this is, this is pretty strange, especially in comparison to software engineering, where a lot of these hard parts are solved, right? There's a lot of talk about column love lineage. Compilers have had linkers for like 30 years, right? And a linker just manages dependencies between files, which is exactly what you want for SQL and data as well. There's really great static analysis tools, really great linters that are pretty highly performant. And, I think what we want to do is bring a lot of that tooling, right, that those really, really top notch software development experiences into the realm of data engineering as well. So we want, you know, more static analysis, more guarantees in CICD, more verification through automated systems rather than, you know, manually running a query, changing a column name, changing an aggregation, and kind of seeing what happens. And I think we think this is possible. It's really hard to do because SQL dialects are so differentiated between data warehouses, but this is really our goal. Right? This is, like, SQL validation and transformation, especially at scale. It was a very long winded answer, but, there you go.
[00:07:39] Tobias Macey:
No. It's it's definitely helpful to get that context. And you've already mentioned DBT. They have been the dominant player in this SQL oriented transformation market for a while. They helped to actually create that market and the idea of the analytics engineer. Another notable entrant in this landscape in the past couple of years has been SQL mesh that is taking aim at dbt and focusing on that ability to actually parse and understand the SQL AST. And I'm wondering if you can give a bit of an overview about how you view that Venn diagram
[00:08:15] Lukas Schulte:
of features and functionality across SDF, DBT, and SQL mesh, and any other tools that you feel like throwing in the mix? So, yeah, maybe maybe the first thing to call out. Definitely, I want to be very, very complimentary of all the the tools and companies in the space. I think, ultimately, we're all circling the same drain and trying to make the world of data engineering better for everyone as a long time user of dbt. And I think what they what they've done especially has been really transformational in in the world of data. So that's maybe that's maybe step 1. And I think to that end, a lot of these tools have, just slightly different approaches, right? There's another tool there that's, I think been gaining a lot of popularity amongst Snowflake users is Coalesce. And, they have some tooling, but for them, it's very much oriented towards, visualization and making the drag and drop experience as good as it gets. And that works for some folks. I think the the folks at SQL mesh have really taken dbt to, you know, the the next level.
Some of the things that they've done around virtual environments are pretty cool. But again, it's a different authoring framework, and I think the the approach there is very opinionated in terms of, hey, you know, this is this is sort of, you know, the ideal way to write a data warehouse. We've been trying to be a little bit less opinionated and work more with SQL dialects where they are and work with DBT projects where they are. And I think our goal here has been less to to work on the authoring surface and more on the engine, really sort of showcase, the capabilities that we have on the SQL understanding side and what we can do once we have executable semantics for some of these proprietary dialects and actually, you know, slowly inch our way into the realm of native compute as well. So in short, I think different different approaches, different things will work for for for different companies. We're obviously really excited about the static analysis capabilities that we've developed and and what those enable us us to do. Now digging into
[00:10:13] Tobias Macey:
the implementation and design of SDF, I'm wondering if you can talk to some of the ways that you're thinking about the core functionality, the ways that you are tackling the complexities of transformation and in in particular being able to scale to that those 100 or thousands of models and some of the Yep. Engineering challenges that you've had to address as part of that effort? Yeah. So I think there we we've taken maybe a fundamentally different approach than than I think anybody else. So we built from from the ground up in Rust and
[00:10:47] Lukas Schulte:
tried to build full grammar descriptions for the SQL dialects that we support. And, that's been quite, quite an undertaking, but it allows us to have very, very precise, not just SQL analysis, but SQL validation. So I think maybe my favorite example here is at the moment, I think the only entity that can tell you whether you're missing a comma in a Snowflake SQL query is Snowflake itself, right? Like you have to send that query to Snowflake and then Snowflake compiles that query and says, Hey, yes, you're missing a comma here. Or if a function takes a bar char instead of an integer, only Snowflake can tell you that. And this is so counterintuitive. If you're like used to software engineering where you just like, you know, you write whatever, you have a TypeScript compiler, or you have cargo, right? You have all these like local compilation tools that tell you everything that you need to know about sort of the compile time, that give you compile time guarantees on your laptop that, we, I think have finally unlocked, at scale. And the reason I say at scale, Rust has allowed us to do a lot of really cool performance optimizations.
It's allowed us to go really, really deep, build in type binding and other static analysis toolsets, and still have the result be relatively performance. So I think at at this point, one of our benchmarks, we have a, an SDF workspace with 6 16,000 SQL files that are on average 500 lines long or have on average 500 columns. And, compiling that, getting all the column of lineage takes, like, less than 45 seconds on my laptop, and I'm pretty pretty happy with that performance.
[00:12:20] Tobias Macey:
To your point of having to send the query off to the cloud warehouse and wait for it to tell you what you're doing wrong, there's a little bit of a parallel happening as more and more of software architecture and infrastructure is cloud native or requires the cloud, particularly if you're using some of the core cloud provider services versus the open source alternatives. So definitely a similar struggle happening in that regard for kind of the architecture and infrastructure piece. But from the pure software layer, it's definitely true that we have gotten used to being able to lean on our tooling to be able to tell us early and often what we're doing wrong.
[00:13:01] Lukas Schulte:
Yeah. Absolutely. It does seem like like the reason that the cloud vendors are so excited about being able to require cloud connections for more features also adds to their vendor lock in. Right? Like, it adds to their ecosystem mode in a pretty big way. And I think that's the reason why things like Iceberg have been so exciting in the last couple of years, because they, you know, for the first time, give companies a little bit more of an out where they can say, you know what? Hey. There's, like, another opportunity. I can use another I can use another query engine against this to run a query against against this data.
[00:13:28] Tobias Macey:
Yes. Data gravity is a very real point of leverage.
[00:13:32] Lukas Schulte:
Yeah. It's, enormous, I think. Let me we we've we've talked to companies that have multiple cloud vendors just so that they have negotiating ability against, when their when their contracts come up for renewal. Right? Like, that that's that's how, yeah, enormous data gravity is to to an enterprise.
[00:13:48] Tobias Macey:
And as you have been working through the development and early stages of onboarding and working with some of the early adopters of SDF, what are some of the ways that the scope and goals of the project and the business have changed in that time?
[00:14:05] Lukas Schulte:
They've changed in almost every way. I will say one thing that I'm happy about at all of our all the hands, I get to sort of show show some of the mission statements of the that we put together over 2 years ago at this point. The nice thing is the mission hasn't changed, but I think the implementation has changed quite a bit. So one one thing that's changed pretty dramatically is it initially, we were entirely focused on static analysis, and we had this pipe dream that we could maybe think about execution someday. But for for us to be able to think about execution, we have to have, you know, these fully resolved logical plans, that you can actually send to send to query engine. And we didn't know if we were actually going to be able to get there. And I think in the last sort of 3 months, we've turned the corner. I think we now have pretty strong executable semantics for the SQL dialects that we that we support. And that means that we're really, really excited about our ability to to support some of that, you know, we call it, like, query engine emulation, so you can actually run, you know, Snowflake queries locally on your laptop. And I think that's a really cool capability. And if it scales, I think it could be quite powerful.
[00:15:06] Tobias Macey:
And to that point too, I know that in looking through the documentation, you're also using the Arrow data fusion capability for being able to actually do some fully local execution, and that's definitely a very different set of functionality than something like DBT that is wholly reliant on whatever the query engine is that you're targeting, where if you wanna do something locally, then you're probably using DuckDV. Otherwise, you're relying on some of these other engines. I'm wondering if you could maybe just break out some of the features that help to differentiate SDF and some of the ways that that fit into the overarching workflow for people who are using that tool to accomplish their, you know, engineering and business goals?
[00:15:50] Lukas Schulte:
That's a great question. So so, yeah, maybe maybe to the point about dbt. Dbt operates on on strings. Folks, like SQL Mesh, they'll take it one step further and look at look at ASTs. And our goal is to have sort of a a fully, you know, type safe implementation, which means that at every sort of node in the transformation layer graph, we know all of the types that are incoming and expected as output from every transformation. So if you run an SDF compile on a single model, you'll see that SDF knows not just, you know, the column names, but also exactly their data types. And once you have that, you can actually start adding additional type information on top. So you can start working with classifiers and higher level type objects rather than just VARCHARs and integers. And from a developer driven data governance standpoint, this is super exciting because you can finally tell your data warehouse sort of at ingestion, hey, here's where my PII is. Here's where my canonical definition of a daily active user is. Here's, you know, all columns that have social security numbers have this type of retention policy, and you can start automating a large part of that system that was otherwise manual, as it relates to, higher level data types, because VARCHARs are incredibly expressive, but not all VARCHARs are the same, right? And it's hard to treat them differently in the current transformation layer ecosystem.
And we're really, really excited at sort of the the extra level of capabilities that type correctness gives you. So we're adding data classification tools is definitely one of the really, really exciting things, that you can do when you have type correctness.
[00:17:26] Tobias Macey:
Yeah. The extra classifier capability is something that I found particularly compelling because, as you said, there is a lot of the nuanced detail that is very easy to lose track of or very difficult to surface where maybe you can write it into a docstring somewhere, but then you're relying on human operators to actually read all of that, parse it, understand the real impact of that versus being able to actually put business rules around that type information around those higher order details of things like PII or some of the business semantics. And in particular, I'm wondering how you see the classifier functionality in comparison to things like the what what happened a couple of years ago with the idea of the what what do they call it? The metrics layer, and the little ways that that yes. So some of the ways that that has actually been
[00:18:24] Lukas Schulte:
realized in terms of the technical implementations and in particular, the maybe DBT semantic layer since that's what most folks are probably working with. Yeah. I think metrics and semantic layers, we I see them more at larger scale companies, but they think the need for something in that space is real. So I'm curious, maybe for your for your projects in the data warehouse that that you work on most, do you know approximately how many like, what percentage of columns are just VARCHARs in that data warehouse? Not offhand, but I would presume the majority. Okay. Yeah. That sounds about right. Yeah. What what we see is it's something like 50% of all columns that we encounter are just VARCHARs, which is, extraordinary for the amount of, you know, different and and very nuanced information that those columns actually hold. And I think there's like an example here that there's, very large enterprise, bought another bought another company. And, one of the data analysts started joining user IDs on user IDs. Seems like the most mundane and normal, you know, joint operation in the world if you've if you if you're working a lot with users and user IDs. But it turns out that one set of user IDs was from from from the original company, and the other set of user IDs was from the company that that company had just bought. So they were actually a completely disjoint set of of user IDs that had some overlap because of you know, and everything. They're all, like, integers or something like that. And the company didn't catch this for for months months months on end, and it was very, very expensive to rectify because they had to do a whole bunch of backfilling. But my my point here is even just calling a column user ID doesn't make it the same user ID. Right? Like, this there's a need, especially when you're working at scale, to work with higher level types. And what you know, our goal with our type system is, is to sort of allow you to say, hey. This is, you know, company 1 user ID. This is company 2 user ID. And you can create, a little rule. Right? We call it business logic as code that says, hey. You can never join those 2. And if you do join them, you know, flag a warning. So that that capability is, I think, super critical. I know it's important for everything from, you know, user IDs to privacy concerns and data governance concerns, but also metrics. We've seen these classifiers used in a few different ways that were very not obvious to me in the beginning, and we're really, really excited to see what it is that people actually use them for. We'll see if it turns out to be a a replacement or an addition to, you know, a traditional metrics layer.
But it's been exciting to see, yeah, what people use these things for so far. Digging a bit further into the developer experience, the workflow,
[00:21:01] Tobias Macey:
particularly as you start to scale the usage and in particular for some of these enterprise class use cases that you've touched on where you're spanning multiple different teams, possibly even completely different business units or organizations or and some of the ways that SDF helps to support some of those very kind of fractal use cases where you maybe have some various handoffs and maybe those handoffs aren't always very clearly defined. And, also, given the fact that SDF as a product is open core and has paid features, some of the ways that those team oriented capabilities differ between the open and the commercial offerings.
[00:21:45] Lukas Schulte:
From a philosophical standpoint, our goal is that that, you know, essentially, what you run locally on your machine is, free to use, and what happens in the cloud is typically a service that we have to manage and therefore is paid. We also haven't really tried to change the authoring surface, so there's no there's no GUI test if it's a command line engine. But that engine, because it's written in Rust, is just a binary that you download. So the idea was if you are working in a remote Versus code workspace or a GitHub code space, you can install SDF into that environment, and it just works. There's no other dependencies. There's no, you know, Python or virtual environment that you have to manage. You just install the binary and go. And the other goal from sort of a static analysis viewpoint was if you have, you know, some analyst who decides to write a new model, that they can take an SDF workspace, add a model to it. And if SDF compiles that model correctly and there are no errors spit out by the compiler, that model should be good to go, and and you should be able to integrate it into your production pipelines without breaking anything. That's sort of the guarantees that SDF sort of tries to make to the the workspace owners. Additionally, I think from, from, like, a using the tool standpoint, a lot of functionality actually mirrors mirrors dbt. Right? There's dbt compile. SDF compile does most of the same stuff that dbt compile does, but also gives you some of the static analysis capabilities and does type checking and, you know, builds out column of lineage and so on and so forth. There's you know, dbt run and SDF run. Again, similar functionality. Dbt test, SDF test, again, similar functionality.
So so the goal here really is to, again, meet people where they are in their,
[00:23:25] Tobias Macey:
authoring workflow. Another interesting parallel that I'm curious to hear your answer to as far as the development workflow and the interfaces is that with DBT, they maybe it was a year or 2 ago that they started offering the capability of Python models, which aren't available across all execution engines, but Yep. Gives you the ability to write arbitrary Python as long as it generates a table structure as the output. I know SQL mesh has a similar capability of being able to do Python models as long as it returns a data frame. I'm curious how SDF thinks about the computational capabilities that go beyond SQL and some of the ways that teams can address those maybe more complicated or complex computational requirements in the event that there isn't a built in function in their target engine to address that capability?
[00:24:21] Lukas Schulte:
Yeah. Great. I'm actually really glad that you asked this question. So we've been thinking about it a lot. And there's there's like a like a there's like an agony and an ecstasy to like allowing arbitrary Python execution in your code. So the great parts are obviously that you can create, you know, table functions, data frames, and that is incredibly powerful if you want to build, especially like reusable code. So to date, SDF does not support data frame operators or Python in SDF's world. And the reason is fairly simple. We want to make sure that we can provide all of these guarantees, and until we find a way to provide all those same guarantees in Python, we're not gonna provide Python models. So this is maybe the most opinionated stance that we've taken to date, but the good news is we have a plan, and we're very, very excited about, sort of the the plan here to offer a Python interface, in the future. It won't happen in the near term. It's not something that's actively being worked on. It's on our backlog.
We still have connectors and other things that we wanna get through first, but, there's there will be some some really exciting stuff coming there, hopefully in the you know, hopefully in less than a year. The other interesting aspect of your positioning
[00:25:33] Tobias Macey:
and the particular time that we are in the ecosystem is that as we've already mentioned several times, DBT is the 1st mover, not the 1st mover, but in in terms of modern history or recent history, one of the first movers in the SQL as software engineering practice and being able to build a set of transformations that have dependency chaining as part of that. And I'm curious how you're thinking about the adoption and migration path for teams who have already invested in dbt, have substantial code bases, are they're very interested in the capabilities that SDF offers, but there is that barrier of, well, I've already got all the sunk cost into dbt. So I'm just gonna keep going in that direction because I don't wanna have to figure out a whole other tool. I'm just wondering if you can talk to what is your answer for those people?
[00:26:27] Lukas Schulte:
Yeah. That is most people, I think. At at this point, the d dbt is a wild percentage of Snowflake, BigQuery, etcetera, compute, or it comes from comes from, like, dbt models, and it's something like 85 or 90% of the data teams that we see at this point use dbt. And there's one one thing which is sort of moving code, and there is another piece which I I actually think is even more challenging, which is which is actually sort of reeducating or getting all of the the team on the same page about whatever a new authoring system should be. So, our approach here is to try and meet folks where they are. I already mentioned that we have parity between, you know, compile dbt compile and s d f compile, dbt build, s d f build, dbt test, s d f test, and so on and so forth. If, you know, some other commands, that, maybe showcase some of some of the unique features that s d f has. But the goal here is user experience there should not change dramatically.
The second part is sort of all the DBT configuration, rough statements, etcetera, some of which SDF does not need, but we are moving to a world. And probably by the time this this podcast goes live, we'll actually already be in that world, where SDF will natively run and interpret TBT models and TBT configuration and TBT profiles and ingest them and just use SDF as the engine. So I think the way to think about that is there's dbt as the authoring layer, and then there's the dbt engine that actually takes that configuration and executes it and does all the Jinja expansion. The capability that we're, launching in the next couple of weeks allows you to keep that same dbt configuration and just exchange the engine with the SDF engine and get all the same sort of static analysis capabilities, speed improvements, all the wonderful things you get from Rust, directly, on top of your dbt project. And once you have that, you can start to delete your rough statements because you no longer need them, and, we're happy about that, obviously.
[00:28:29] Tobias Macey:
And the other piece of dbt investment is the set of packages that they have been working to try and build up. There is a, I think, an unequal distribution of people who are using them versus not, and I'm wondering how you're addressing that aspect, or is it just a matter of it's all DBT, so it doesn't really matter, it just works.
[00:28:50] Lukas Schulte:
The, the goal is it's all DBT, it just works. We will see how far we get down. There's a lot of DBT packages at this point, and Python Jinja is leaky. So you can actually do Python subroutines, like, directly from Jinja and have those be executed. And managing things like this is incredibly difficult, especially if you want, like, a closed compiled well defined system like SDF really tries to be. And dbt sometimes, like, tries to break out of that cage a little bit. We'll see. I think a lot of a lot of core things like dbt expectations and so on and so forth, we already support. I'm sure there's libraries where we'll have to we'll have to figure out if there's additional Jinja complexity that we need to take into consideration. But the goal is you just download SDF and it works. And then the other piece
[00:29:33] Tobias Macey:
of complexity around trying to break into an established market is just the mindshare is one piece of it, and then there's also all of the technical investment. But beyond that, there's also just the amount of communal knowledge that gets built up around these tools in terms of blog posts, presentations, you know, just intra team communication. And just I'm curious how you think about that aspect of the problem as well, just being able to kind of work your way into that mindshare and reduce those points of friction so that there isn't as much requirement
[00:30:09] Lukas Schulte:
for that communal knowledge for people to be able to figure out how they work around those edge cases as they try to move down that adoption path. I mean, I I will say, I feel like a a lot of the help requests that I see around dbt is has to do with, like, pie ends and, packages and versioning issues. And in SDF, you don't have any of those problems. So, hopefully, we can at least take a large subset of the help that's required to make dbt run at scale and maybe forget about it. But, yeah, I mean, the reality is, of course, like one of dbt's greatest assets and what they've done an incredible job, over the last almost decade at this point is building out a really stellar community of engaged and helpful people with, at this point, a very large, knowledge base. And it would be great if if you could, you know, translate some of that knowledge. Right? So whether it is, you know, linting configuration or, what the right model structure is for an efficient, you know, warehouse.
Like, ideally, SF, you know, is just an addition into that community and an additional tool for people to use rather than something that tries to, you know, redefine the wheel. Like, the last thing I wanna do is, you know, boil the ocean and and try and build entirely orthogonal, but similar authoring experience from scratch. Dbt folks got a lot of things right. Right? And, like, there's a there's, you know, hundreds of tens, if not hundreds of thousands of developers using dbt every day, hundreds of thousands of dbt projects. And what we want to do is elevate what the capabilities are there. Right? That's really the goal. In terms of the community investment around SDF, I'm wondering what are the
[00:31:42] Tobias Macey:
interfaces and extension points for people to be able to augment or extend the core functionality of SDF. And then on the the kind of outer shell, the additional tooling or plug ins that people might be interested in building to extend their experience of working with SDF?
[00:32:02] Lukas Schulte:
On the open source side of things, I think we invest heavily and love investment in data fusion. The core of our execution engine and our executable semantics, like, comes from Data Fusion. So anybody that wants to to to spend the time and, you know, build a really great Rust based query engine, go check out Data Fusion and, you know, see if you can pull a ticket or 2. Separately, on on our side, we have found maybe some really cool optimizations. Tests is one of them. I think dbt tests are not super efficient the way the testing library is written. Those macros, mean that you're doing a lot of scans for every individual test. We've written a testing library that I think is a little bit more efficient that batches things a little bit more elegantly, and we are constantly trying to to put out more of these additional packages and and get folks to to contribute to those as well. So I think from an open source standpoint, the package ecosystem is what we I think we're most excited for community investment in at this point. As you have been building SDF, growing the community, growing the business, what are some of the most interesting or innovative or unexpected ways that you've seen the tool used? Quite there's been a there's been a few really fun ones. I think, 1, we talked a little bit of about metrics and classifications a while ago. We initially developed the data classification system more for privacy and governance use cases. And then we started getting questions about, hey. Like, you know, we have 5 different definitions of what a daily active user is. People just keep copying and pasting the SQL query and, like, putting it in new put in new places, and we wanna have one canonical definition of what a daily active user is. Can we use SDF classifiers to to do this? And we said, maybe maybe this will work. And it turns out it works, like, super, super well. And then we got had a similar question around data retention and using classifiers to map out which tables needed which types of retention policies. So that was that was really, really exciting to see because that was a true use case that I think people were spending a lot of time trying to map and manage retention policies.
And now there's a simple, you know, SDF report that they run-in their CI pipeline or or in their orchestrator once a day, and they get a report of all the partitions that need to be deleted. That's fantastic. That was completely unexpected. So that's yeah. One one example of using classifiers in a completely unexpected way. And as people start to investigate SDF, they want to
[00:34:32] Tobias Macey:
start incorporating to get into their stack, what are the cases where SDF is just the wrong choice and you would advocate against using it?
[00:34:40] Lukas Schulte:
Yeah. If you're writing Scala, or RDDs, probably not the this is probably not the tool for you. I think a lot of the the spark ecosystem is really rich and has a lot of capabilities. There's a lot of notebooks, especially from from Databricks. SDF does not work well in that universe, at the moment, for some of the reasons I've outlined earlier. So I think that that is probably the main area at this point where SDF is probably not the right tool.
[00:35:05] Tobias Macey:
And in your experience of building SDF, growing the business, growing the project, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:35:16] Lukas Schulte:
Yeah. Building a framework that people like or building a cool developer tool is not the same as building a business. I think that, like, if you're if you're, you know, an excited engineer and you're like, yes. I'm gonna build a really cool tool, It's very easy to focus on on the tool and the functionality and the capabilities and where it's differentiated and novel, but growing a business has different requirements and understanding what should be free to use, what should be paid, what features make sense when, you know, they're offered as a paid feature for for a company versus what doesn't make sense, whether to, you know, price on seats or models. Those are really, really difficult questions, and the only way to tackle them is to just, you know, talk to talk to a lot of talk to a lot of companies and talk to a lot of engineers. That's been a a fascinating learning lesson over the last 2 years, but I'd especially say over the last 6 months.
[00:36:09] Tobias Macey:
And as you continue to build and invest in SDF, what are some of the things you have planned for the near to medium term or any particular projects you're excited to explore?
[00:36:19] Lukas Schulte:
Yeah. I think on on the, dbt side, seeing more use cases of SDF as as, a dbt accelerator is super exciting. And on the compute side, we have some awesome, I think, you know, demos and hopefully an alpha very soon that we'll be able to show, where where you can actually start, you know, using STF as a query engine for mile transformations either directly from your orchestrator or as a separate service. And then and then lastly, on the cloud product side, just today, we launched a really cool impact analysis feature that looks at diffs between 2 warehouses. It may be the state that's in Maine, maybe the state that's in a pull request, and we'll show you the impact of those changes, what columns changed, if there's anything that breaks downstream, etcetera. And we're I'm really, really excited to invest a little bit more on on the cloud side in building out differentiated tool sets for enterprises that uniquely need services and that you can't just run on the laptop, by yourself. Are there any other aspects of the SDF product, the SDF tooling,
[00:37:21] Tobias Macey:
the ecosystem around it, and the ecosystem that you are working within that we didn't discuss yet that you'd like to cover before we close out the show? There's, there's there's always too many things to talk through. No. I I really appreciate you having me on. This was, absolutely fantastic. Feel like I learned a lot. Questions were great. Appreciate your time. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And so as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. There's, there's a consensus
[00:37:57] Lukas Schulte:
among software engineers, something that we've talked about a lot today, which is that the tooling and data engineering is up to par with the tooling and software engineering. And if you look at the reasons for this, I think selfishly, I, as someone who's building a compiler for SQL, will say, hey. There's not really good SQL compiler that exists. Right? Like, compilers are the the framework for everything from, you know, IntelliSense to, you know, CICD checks. There's no really good at compiler for SQL. So, like, that's that's the problem in in data development. But and maybe digging one step deeper. Like, why is there no good compiler for SQL? I I think if you take a look at the history of SQL, there's there's no module system that was ever introduced in that language. Right? Like, the the reason that you have data frames is data frames are the only way to do table level functions, essentially. Otherwise, you have some user defined functions, maybe, but there's no, like, import statements. There's no libraries. There's no package repository for anything in the SQL world, which means that every single company, all over the world, every single time, is rebuilding the same wheel from scratch. And I think that is maybe the most exciting problem and question in the data space. Right? It's like, how do you, how do you actually build a data development framework that provides some level of modularity, some ability for libraries to exist, some ability for code reuse in a way in in any sort of way. That I think is is the most interesting
[00:39:20] Tobias Macey:
gap in data development today. And also the fact that despite being referenced as a language, it is actually
[00:39:29] Lukas Schulte:
a fractal amount of dialects. And so even if you can have some way of saying this is SQL, this is SQL, they're actually 2 different sequels. And if you try to run them both on the same engine, then you're gonna have a bad day. You're gonna have a really bad day. Yeah. But but if you think about, like, why why is this the case? Like, why why did Snowflake evolve its own grammar? Why are they still evolving? Like, their that grammar is still evolving. DuckDV at this point. Right? Now DuckDV is also creating its own grammar. The reason for this is there's no extensibility. So the only thing that's left is to change the language. If you if you can't add a library for, you know, how you want to work with some ML model, Like, of course, if you're Snowflake and you want people to start running ML models as well, of course, you're gonna build, you know, a little, like, dialect, addendum that lets you run, you know, GPT, directly from, like, a Snowflake SQL query. It's complete madness, but it's the only resource that's available to to these vendors as well. Right? It's expanding and extending the language. So, yeah, I think if you if you could figure out, if there's a way to to create modularity in that universe, I think that would be incredibly powerful.
[00:40:29] Tobias Macey:
Yeah. That that's where you also start getting all of these superset languages that compile down to SQL, similar to how we've had all of these different languages that compile the JavaScript.
[00:40:39] Lukas Schulte:
Yes. This is yeah. So this is exactly. Alright. We I have, you know, sometimes the script like, tried to describe a little bit of what we're doing as as, you know, sort of the TypeScript to JavaScript transition, but for SQL, where we, you know, add a little bit some type information, some static analysis capabilities. But ultimately, like the thing that we send, you know, to Snowflake is just the same Snowflake SQL as everybody else. But, yeah, this is it's a really it's a really interesting challenge. And and I think the reason that you've seen this, like, absolute explosion in data tooling also has to do with this. Right? Like, there's
[00:41:08] Tobias Macey:
also really cool stuff happening in places like Ibis, right, where where, like, people are really trying to figure out how to translate and or map well from one dialect to another. But mapping is always fuzzy, and it doesn't really work at scale. It's a it's a challenging problem. Alright. Well, thank you very much for taking the time today to join me. I appreciate all of the time and effort that you and the rest of the SDF SDF team are putting into this problem space. It's definitely very important and constantly evolving target. So definitely look forward to seeing the continued growth of SDF and starting to experiment with it for my own work. So thank you again, and I hope you have enjoy the rest of your day. Thanks a lot. You as well. Appreciate you having me on the show. Thank you for listening, and don't forget to check out our other shows.
Podcast.net covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. Just to help other people find the show, please leave a review on Apple Podcasts and to tell your friends and coworkers.
Introduction to SDF with Lukas Schulte
The Genesis of SDF
Core Problem SDF Solves
Implementation and Design of SDF
Evolution of SDF and Business Goals
Type Correctness and Data Classification
Computational Capabilities Beyond SQL
Adoption and Migration from DBT
Community and Ecosystem of SDF
Biggest Gap in Data Management Tooling