Build Your Data Transformations Faster And Safer With SDF

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Imagine catching data issues before they snowball into bigger problems. That's what DataFold's new monitors do. With automatic monitoring for cross database data diffs, schema changes, key metrics, and custom data tests, you can catch discrepancies

and anomalies in real time right at the source.

Whether it's maintaining data integrity or preventing costly mistakes, DataFold monitors give you the visibility and control you need to keep your entire data stack running smoothly.

Want to stop issues before they hit production?

Learn more at data engineering podcast.com/datafold

today. Your host is Tobias Macey. And today, I'm welcoming Lucas Schulte to talk about SDF, a fast and expressive SQL transformation tool that understands your schemas. So, Lukas, can you start by introducing yourself? Hey. Hey, Tobias. I'm happy to be here. I'm Lukas. I'm one of the co founders and the CEO of STF. And today, I'll be telling you a little bit about what we've been building over the last couple of years, and why we're very, very excited about it. And do you remember how you first got started working in data? Yeah. Vaguely. So

I originally

got interested in data, maybe through a peripheral pathway. I studied electrical and computer engineering, and I was primarily interested in sensor systems

and, sensors collect a lot of data.

And so my my interest in sort of data systems and analysis and understanding,

how,

data can be used to express real world systems,

kind of came from there. So after after college, I joined, sensors team at Microsoft.

I was primarily working on sensor systems there. That was all sort of traditional analytics that was less ML oriented than it might be today, but that was sort of my start. And now bringing us to SDF, can you give a bit of an overview about what it is and some of the story behind how it got started and why you decided that this was worth your time and energy? Yeah, for sure. So SDF

came about, maybe to go back a little bit further, those sensor systems that I was talking about,

at some point, they all became very ML oriented. Machine learning kind of took over what traditional algorithms have been doing for a long time. And I found myself in that world, building out data infrastructure for a company that was building creative tools on top of various computer vision algorithms. And so we built out this whole like data collection paradigm there, built out a team for labeling ground truth data, created multiple

ML models, tried to make them small so they fit on devices. Anyways, the data loads were growing there. And I also live in Los Angeles, which means that, of course, this turns into

a social media related enterprise.

And so, of course, we all of a sudden had user data, and I think as any company that grows is want to do, the folks on the ML side want to start using user data. The company starts having quarterly board meetings, so we need a modern data stack on top of the user data as well. And then the company grows even more, and there's GDPR and CCPA.

So this stack that started relatively small and contained

grew very, very quickly, and I think quickly became clear that we kind of lost the plot. And we realized that we needed some system to understand data from ingestion to consumption. And that system didn't really exist. Dbt was the it, you know, was, and I think in large cases still is the best system that exists with this kind of stuff today. So we started using that. But at the same time, I was fortunate enough to talk with my now 2 cofounders, Michael and Wolfram, about what they were building at Meta. Wolfram was one of the chief architects of Meta's data warehouse. And at the time, he was building a system for static analysis of SQL at scale, to understand exactly

why and how data moves from ingestion to consumption

and what impact that has on data privacy concerns, data governance concerns, data quality concerns, how that can improve the developer experience and so on and so forth. And I think, you know, SDF was born out of the realization that if the series a series b companies that are starting to build out their data stacks have some of these challenges and, you know, the largest data consumers in the world, like Meta, have some of these same concerns. That means there's probably a cross cutting

reason to start working on this. And that was sort of the genesis of SDF. It's very enigmatic name. Wondering if you could unpack it a little bit. I don't everybody loves 3 part names. Right? I think.

So Wolfram at Meta coined the term, the semantic data warehouse, because that was sort of the goal, I think, also after some of the data privacy,

issues that Meta had in the late 2010s.

The goal was to build a semantic data warehouse, where you had a true understanding of what transformations acted on, what types of data throughout everything. And this is what Wolfram was working on, and we needed SDF to be more pluggable into other different systems, hence fabric. And,

then we saw that those were also the perfect three letters on the keyboard. If you're working on a command line utility, and we thought, you know what? This is absolutely perfect. So semantic data warehouse turned into semantic data fabric. And now, hopefully, you know, those are our three favorite words on the keyboard.

And I guess it's a natural progression that you would be able to manage your versions of SDF using the ASDF VM. That would be that would be

we should do that, honestly.

We should start looking into that.

And so you started touching on this a little bit, but what is the core problem that you're solving with SDF

and in particular as it's juxtaposed

with the most notable I don't know if competitor is the right term, but alternative in the ecosystem in the form of DBT. So, yeah, maybe rather than answering the the question super directly, I'll I'll answer a little bit of a story, which mirrors, you know, how data engineers work in in teams today. So let's say you're a data engineer at a midsize company. There's probably a few more technical folks and then a larger number of slightly less technical data analysts. Likely you're working in dbt. Your workspace is now, you know, 100, if not thousands of models. Compile times,

take a long time

to complete. Dependencies are harder to manage, right? It's difficult to, you know, delete a column in a model here or change a schema and understand exactly what changes downstream,

pipelines break, and, you know, maybe worst of all, debugging is pretty painful because if at least if you're, operating on DBT, if you run a command, Snowflake only sees the resulting query after Jinja expansion and macro expansion and all the configuration takes place. So the error is actually related to the Snowflake error, not the error on the SQL file that you're working on on your laptop. And this is, this is pretty strange, especially in comparison to software engineering, where a lot of these hard parts are solved, right? There's a lot of talk about column love lineage. Compilers have had linkers for like 30 years, right? And a linker just manages dependencies between files, which is exactly what you want

for SQL and data as well. There's really great static analysis tools, really great linters that are pretty highly performant. And,

I think what we want to do is bring a lot of that tooling, right, that those really, really top notch software development experiences into the realm of data engineering as well. So we want, you know, more static analysis,

more guarantees in CICD,

more verification

through automated systems rather than, you know, manually running a query, changing a column name, changing an aggregation, and kind of seeing what happens. And I think we think this is possible. It's really hard to do because SQL dialects are so differentiated between data warehouses, but this is really our goal. Right? This is, like, SQL validation

and transformation, especially at scale. It was a very long winded answer, but, there you go.

No. It's it's definitely helpful to get that context.

And you've already mentioned DBT.

They have been the dominant player in this SQL oriented transformation market for a while. They helped to actually

create that market and the idea of the analytics engineer.

Another notable entrant in this landscape in the past couple of years has been SQL mesh that is taking aim at dbt and focusing on that ability to actually parse and understand the SQL

AST. And I'm wondering if you can give a bit of an overview about how you view that Venn diagram

of features and functionality across SDF, DBT, and SQL mesh, and any other tools that you feel like throwing in the mix? So, yeah, maybe maybe the first thing to call out. Definitely, I want to be very, very complimentary of all the the tools and companies in the space. I think, ultimately, we're all circling the same drain and trying to make the world of data engineering

better for everyone as a long time user of dbt. And I think what they what they've done especially has been really transformational in in the world of data. So that's maybe that's maybe step 1. And I think to that end, a lot of these tools have, just slightly different approaches, right? There's another tool there that's,

I think been gaining a lot of popularity amongst Snowflake users is Coalesce. And,

they have some

tooling, but for them, it's very much oriented towards,

visualization

and making

the drag and drop experience as good as it gets. And that works for some folks. I think the the folks at SQL mesh have really taken dbt to, you know, the the next level.

Some of the things that they've done around virtual environments are pretty cool. But again, it's a different authoring framework, and I think the the approach there is very opinionated in terms of, hey, you know, this is this is sort of, you know,

the ideal way to write a data warehouse.

We've been trying to be a little bit less opinionated

and work more with SQL dialects where they are and work with DBT projects where they are. And

I think our goal here has been less to to work on the authoring surface and more on the engine, really sort of showcase,

the capabilities that we have on the SQL understanding side and what we can do once we have executable semantics for some of these proprietary dialects and actually, you know, slowly inch our way into the realm of native compute as well. So in short, I think different different approaches, different things will work for for for different companies. We're obviously really excited about the static analysis capabilities that we've developed and and what those enable us us to do. Now digging into

the implementation and design of SDF, I'm wondering if you can talk to some of the ways that you're thinking about the core functionality,

the ways that you are tackling the complexities

of transformation

and in in particular being able to scale to that those 100 or thousands of models and some of the Yep. Engineering challenges that you've had to address as part of that effort? Yeah. So I think there we we've taken maybe a fundamentally different approach than than I think anybody else. So we built from from the ground up in Rust and

tried to

build full grammar descriptions for the SQL dialects that we support. And, that's been quite, quite an undertaking,

but it allows us to have very, very precise, not just SQL analysis, but SQL validation. So I think maybe my favorite example here is at the moment, I think the only entity that can tell you whether you're missing a comma in a

Snowflake SQL query is Snowflake itself, right? Like you have to send that query to Snowflake and then Snowflake compiles that query and says, Hey, yes, you're missing a comma here. Or if a function takes a bar char instead of an integer,

only Snowflake can tell you that. And this is so counterintuitive. If you're like used to software engineering where you just like, you know, you write whatever, you have a TypeScript compiler, or you have cargo, right? You have all these like local compilation tools that tell you everything that you need to know about sort of the compile time, that give you compile time guarantees on your laptop that,

we, I think have finally unlocked,

at scale. And the reason I say at scale, Rust has allowed us to do a lot of really cool performance optimizations.

It's allowed us to go really, really deep, build in type binding and other static analysis toolsets, and still have the result be relatively performance. So I think at at this point, one of our benchmarks, we have a, an SDF workspace with 6 16,000

SQL files that are on average 500 lines long or have on average 500 columns. And, compiling that, getting all the column of lineage takes, like, less than 45 seconds on my laptop, and I'm pretty pretty happy with that performance.

To your point of having to send the query off to the cloud warehouse

and wait for it to tell you what you're doing wrong, there's a little bit of a parallel happening as more and more of software architecture and infrastructure

is cloud native or requires the cloud, particularly if you're using some of the core

cloud provider services versus the open source alternatives. So definitely

a similar struggle happening in that regard for kind of the architecture and infrastructure piece. But from the pure software layer, it's definitely

true that we have gotten used to being able to lean on our tooling to be able to tell us early and often what we're doing wrong.

Yeah. Absolutely. It does seem like like the reason that the cloud vendors are so excited about being able to require cloud connections for more features also adds to their vendor lock in. Right? Like, it adds to their ecosystem mode in a pretty big way. And I think that's the reason why things like Iceberg have been so exciting in the last couple of years, because they, you know, for the first time, give companies a little bit more of an out where they can say, you know what? Hey. There's, like, another opportunity. I can use another I can use another query

engine against this to run a query against against this data.

Yes. Data gravity is a very real point of leverage.

Yeah. It's, enormous, I think. Let me we we've we've talked to companies that have multiple cloud vendors just so that they have negotiating ability against, when their when their contracts come up for renewal. Right? Like, that that's that's how, yeah, enormous data gravity is to to an enterprise.

And as you have been

working through the development

and early stages

of onboarding and working with some of the early adopters of SDF, what are some of the ways that the scope and goals of the project and the business have changed in that time?

They've changed in almost every way. I will say one thing that I'm happy about at all of our all the hands, I get to sort of show show some of the mission statements of the that we put together over 2 years ago at this point. The nice thing is the mission hasn't changed, but I think the implementation

has changed quite a bit. So

one one thing that's changed pretty dramatically

is it initially, we were entirely focused on static analysis, and we had this pipe dream that we could maybe think about execution someday. But for for us to be able to think about execution, we have to have, you know, these fully resolved logical plans, that you can actually send to send to query engine. And we didn't know if we were actually going to be able to get there. And I think in the last sort of 3 months, we've turned the corner. I think we now have pretty strong executable semantics for the SQL dialects that we that we support. And that means that we're really, really excited about our ability to to support some of that, you know, we call it, like, query engine emulation,

so you can actually run, you know, Snowflake queries locally on your laptop. And I think that's a really cool capability. And if it scales, I think it could be quite powerful.

And to that point too, I know that in looking through the documentation, you're also using the

Arrow data fusion

capability for being able to actually do some fully local execution,

and that's definitely a very different set of functionality

than something like DBT that is wholly reliant on whatever the query engine is that you're targeting, where if you wanna do something locally, then you're probably using DuckDV. Otherwise, you're relying on some of these other engines. I'm wondering if you could maybe just break out some of the features that help to differentiate SDF and some of the ways that that fit into the overarching workflow for people who are using that tool

to accomplish their, you know, engineering and business goals?

That's a great question. So so, yeah, maybe maybe to the point about dbt. Dbt operates on on strings. Folks, like SQL Mesh, they'll take it one step further and look at look at ASTs.

And our goal is to have sort of a a fully, you know, type safe implementation, which means that at every sort of node in the transformation layer graph, we know all of the types that are incoming and expected as output from every transformation.

So if you run an SDF compile on a single

model, you'll see that SDF knows not just, you know, the column names, but also exactly their data types. And once you have that, you can actually start adding additional type information on top. So you can start working with classifiers

and higher level type objects rather than just VARCHARs and integers. And from a developer driven data governance standpoint, this is super exciting because you can finally tell your data warehouse sort of at ingestion, hey, here's where my PII is. Here's where my canonical definition of a daily active user is. Here's, you know, all columns that have social security numbers have this type of retention policy, and you can start automating

a large part of that system that was otherwise manual, as it relates to,

higher level data types, because VARCHARs are incredibly expressive, but not all VARCHARs are the same, right? And

it's hard to treat them

differently

in the current transformation layer ecosystem.

And we're really, really excited at sort of the the extra level of capabilities that type correctness gives you. So we're adding data classification tools is definitely one of the really, really exciting things, that you can do when you have type correctness.

Yeah. The extra classifier capability is something that I found particularly

compelling because, as you said, there is a lot of the

nuanced

detail that is very easy to lose track of or very difficult to surface where maybe you can write it into a docstring somewhere,

but then you're relying on human operators to actually read all of that, parse it, understand

the real impact of that versus being able to actually put business rules around that type information around those higher order details of things like PII or some of the business semantics. And in particular, I'm wondering how you see

the classifier functionality

in comparison to things like the what what happened a couple of years ago with the idea of the what what do they call it? The metrics layer, and the little ways that that yes. So some of the ways that that has actually

been

realized in terms of the technical implementations and in particular, the maybe DBT semantic layer since that's what most folks are probably working with. Yeah. I think metrics and semantic layers, we I see them more at larger scale companies, but they think the need for something in that space is real. So I'm curious, maybe for your for your projects in the data warehouse that that you work on most, do you know approximately how many like, what percentage of columns are just VARCHARs in that data warehouse? Not offhand, but I would presume the majority.

Okay. Yeah. That sounds about right. Yeah. What what we see is it's something like 50%

of all columns that we encounter are just VARCHARs, which is,

extraordinary for the amount of, you know, different and and very nuanced information that those columns actually hold. And I think there's like an example here that there's, very large enterprise,

bought another bought another company. And, one of the data analysts started joining

user IDs on user IDs. Seems like the most mundane and normal, you know, joint operation in the world if you've if you if you're working a lot with users and user IDs. But it turns out that one set of user IDs was from from from the original company, and the other set of user IDs was from the company that that company had just bought. So they were actually a completely disjoint set of of user IDs

that had some overlap

because of you know, and everything. They're all, like, integers or something like that. And the company didn't catch this for for months months months on end, and it was very, very expensive to rectify because they had to do a whole bunch of backfilling. But my my point here is even just calling a column user ID doesn't make it the same user ID. Right? Like, this there's a need, especially when you're working at scale, to work with higher level types. And what you know, our goal with our type system is, is to sort of allow you to say, hey. This is, you know, company 1 user ID. This is company 2 user ID. And you can create, a little rule. Right? We call it business logic as code that says, hey. You can never join those 2. And if you do join them, you know, flag a warning. So that that capability is, I think, super critical. I know it's important for everything from, you know, user IDs to privacy

concerns and data governance concerns, but also metrics. We've seen these classifiers used in a few different ways that were very not obvious to me in the beginning, and we're really, really excited to see what it is that people actually use them for. We'll see if it turns out to be a a replacement or an addition to, you know, a traditional metrics layer.

But it's been exciting to see, yeah, what people use these things for so far. Digging a bit further into the developer experience, the workflow,

particularly as you start to

scale the usage and in particular for some of these enterprise

class use cases that you've touched on where you're spanning multiple different teams, possibly even completely different business units or organizations or and some of the ways that SDF helps to support some of those

very kind of fractal use cases where you maybe have some various handoffs and maybe those handoffs aren't always very clearly defined.

And, also, given the fact that SDF as a product is open core and has paid features,

some of the ways that those team oriented capabilities

differ between the open and the commercial offerings.

From a philosophical standpoint, our goal is that that, you know, essentially, what you run locally on your machine is,

free to use, and what happens in the cloud is typically a service that we have to manage and therefore is paid. We also haven't really tried to change the authoring surface, so there's no there's no GUI test if it's a command line engine. But that engine, because it's written in Rust, is just a binary that you download. So the idea was if you are working in a remote Versus code workspace or a GitHub code space, you can install SDF into that environment, and it just works. There's no other dependencies. There's no, you know, Python or virtual environment that you have to manage. You just install the binary and go. And the other goal from sort of a static analysis

viewpoint was if you have, you know, some

analyst who decides to write a new model, that they can take an SDF workspace, add a model to it. And if SDF compiles that model correctly

and there are no errors spit out by the compiler, that model should be good to go, and and you should be able to integrate it into your production pipelines without breaking anything. That's sort of the guarantees that SDF sort of tries to make to the the workspace owners. Additionally,

I think from, from, like, a using the tool standpoint, a lot of functionality actually mirrors mirrors dbt. Right? There's dbt compile. SDF compile does most of the same stuff that dbt compile does, but also gives you some of the static analysis capabilities

and does type checking and, you know, builds out column of lineage and so on and so forth. There's you know, dbt run and SDF run. Again, similar functionality. Dbt test, SDF test, again, similar functionality.

So so the goal here really is to, again, meet people where they are in their,

authoring workflow. Another interesting parallel

that I'm curious to hear your answer to as far as the development workflow and the interfaces

is that with DBT,

they maybe it was a year or 2 ago that they started offering the capability of Python models, which aren't available across all

execution engines, but Yep. Gives you the ability to write arbitrary Python as long as it generates a table structure as the output. I know SQL mesh has a similar capability of being able to do Python models as long as it returns a data frame. I'm curious how

SDF thinks about the

computational capabilities

that go beyond SQL and some of the ways that teams can address those maybe more complicated

or complex

computational requirements

in the event that there isn't a built in function in their target engine to address that capability?

Yeah. Great. I'm actually really glad that you asked this question. So we've been thinking about it a lot. And there's there's like a like a there's like an agony and an ecstasy to like allowing arbitrary Python execution

in your code. So the great parts are obviously that you can create, you know, table functions, data frames, and that is incredibly powerful

if you want to build, especially like reusable code. So to date, SDF does not support data frame operators or Python

in SDF's world. And the reason is fairly simple. We want to make sure that we can provide all of these guarantees,

and until we find a way to provide all those same guarantees in Python, we're not gonna provide Python models. So this is maybe the most opinionated stance that we've taken to date, but the good news is we have a plan, and we're very, very excited about, sort of the the plan here to offer a Python interface,

in the future. It won't happen in the near term. It's not something that's actively being worked on. It's on our backlog.

We still have connectors and other things that we wanna get through first, but, there's there will be some some really exciting stuff coming there, hopefully in the you know, hopefully in less than a year. The other interesting aspect of your positioning

and the particular time that we are in the ecosystem

is that as we've already mentioned several times, DBT

is the 1st mover, not the 1st mover, but in in terms of modern history or recent history, one of the first movers in the

SQL as software engineering practice and being able to

build a

set of transformations that have dependency chaining as part of that. And I'm curious how you're thinking about the adoption and migration path for teams who have already invested in

dbt, have substantial code bases, are they're very interested in the capabilities that SDF offers, but there is that barrier of, well, I've already got all the sunk cost into dbt. So I'm just gonna keep going in that direction because I don't wanna have to figure out a whole other tool. I'm just wondering if you can talk to what is your answer for those people?

Yeah. That is most people, I think. At at this point, the d dbt

is a wild percentage of Snowflake, BigQuery, etcetera,

compute,

or it comes from comes from, like, dbt models, and it's something like 85

or 90%

of the data teams that we see at this point use dbt.

And there's one one thing which is sort of moving code, and there is another piece which I I actually think is even more

challenging, which is which is actually sort of reeducating or getting all of the the team on the same page about whatever a new authoring system should be.

So,

our approach here is to try and meet folks where they are. I already mentioned that we have parity between, you know, compile

dbt compile and s d f compile, dbt build, s d f build, dbt test, s d f test, and so on and so forth. If, you know, some other commands,

that,

maybe showcase some of some of the unique features that s d f has. But the goal here is user experience there should not change dramatically.

The second part is sort of all the DBT configuration,

rough statements, etcetera,

some of which SDF does not need, but we are moving

to a world. And probably by the time this this podcast goes live, we'll actually already be in that world, where SDF

will natively run and interpret TBT models and TBT configuration and TBT profiles

and ingest them and just use SDF as the engine. So I think the way to think about that is there's dbt as the authoring layer, and then there's the dbt engine that actually takes that configuration

and executes it and does all the Jinja expansion. The capability that we're,

launching in the next couple of weeks allows you to keep that same dbt configuration

and just exchange the engine with the SDF engine and get all the same sort of static analysis capabilities,

speed improvements,

all the wonderful things you get from Rust,

directly,

on top of your dbt project. And once you have that, you can start to delete your rough statements because you no longer need them, and, we're happy about that, obviously.

And the other piece of dbt investment

is the set of packages that they have been working to try and build up. There is a, I think, an unequal distribution of people who are using them versus not, and I'm wondering how you're addressing that aspect, or is it just a matter of it's all DBT, so it doesn't really matter, it just works.

The,

the goal is it's all DBT, it just works. We will see how far we get down. There's a lot of DBT packages at this point, and Python Jinja is leaky. So you can actually do Python subroutines, like, directly from Jinja and have those be executed. And managing things like this is incredibly difficult, especially if you want, like, a closed compiled

well defined system like SDF really tries to be. And dbt sometimes, like, tries to break out of that cage a little bit. We'll see. I think a lot of a lot of core things like dbt expectations and so on and so forth, we already support. I'm sure there's libraries where we'll have to we'll have to figure out if there's additional Jinja complexity that we need to take into consideration. But the goal is you just download SDF and it works. And then the other piece

of complexity around

trying to break into an established market is just the mindshare is one piece of it, and then there's also all of the technical investment. But beyond that, there's also just the amount of communal knowledge that gets built up around these tools in terms of blog posts, presentations,

you know, just intra team

communication.

And just I'm curious how you think about

that aspect of the problem as well,

just being able to kind of work your way into that mindshare and reduce those points of friction so that there isn't as much requirement

for that communal knowledge for people to be able to figure out how they work around those edge cases as they try to move down that adoption path. I mean, I I will say, I feel like a a lot of the help requests that I see around dbt is has to do with, like, pie ends and,

packages and versioning issues. And in SDF, you don't have any of those problems. So, hopefully, we can at least take a large subset of the help that's required to

make dbt run at scale

and maybe forget about it. But, yeah, I mean, the reality is, of course, like one of dbt's greatest assets and what they've done an incredible job,

over the last almost decade at this point is building out a really stellar community of engaged and helpful people with, at this point, a very large,

knowledge base. And it would be great if if you could, you know, translate some of that knowledge. Right? So whether it is, you know, linting configuration

or,

what the right model structure is for an efficient, you know, warehouse.

Like, ideally, SF, you know, is just an addition into that community and an additional tool for people to use rather than something that tries to, you know, redefine the wheel. Like, the last thing I wanna do is, you know, boil the ocean and and try and build entirely

orthogonal, but similar authoring experience from scratch. Dbt folks got a lot of things right. Right? And, like, there's a there's, you know, hundreds of tens, if not hundreds of thousands of developers using dbt every day, hundreds of thousands of dbt projects. And what we want to do is elevate what the capabilities are there. Right? That's really the goal. In terms of the community investment around SDF, I'm wondering what are the

interfaces

and extension points for people to be able to augment

or extend

the core functionality of SDF.

And then on the the kind of outer shell, the additional tooling or plug ins that people might be interested in building to

extend their experience of working with SDF?

On the open source side of things, I think we invest heavily and love investment in data fusion. The core of our execution engine and our executable semantics, like, comes from Data Fusion. So anybody that wants to to to spend the time and, you know, build a really great Rust based query engine, go check out Data Fusion and,

you know, see if you can pull a ticket or 2. Separately, on on our side, we

have found maybe some really cool optimizations.

Tests is one of them. I think dbt tests are not super efficient the way the testing library is written. Those macros,

mean that you're doing a lot of scans for every individual test. We've written a testing

library that I think is a little bit more efficient that batches things a little bit more elegantly,

and we are constantly trying to

to put out more of these additional packages

and and get folks to to contribute to those as well. So I think from an open source standpoint, the package ecosystem is what we I think we're most excited for community investment in at this point. As you have been building SDF,

growing the community, growing the business, what are some of the most interesting or innovative or unexpected ways that you've seen the tool used? Quite there's been a there's been a few really fun ones. I think, 1, we talked a little bit of about metrics and classifications a while ago. We initially developed the data classification system more for privacy and governance use cases. And then we started getting questions about, hey. Like, you know, we have 5 different definitions of what a daily active user is. People just keep copying and pasting the SQL query and, like, putting it in new put in new places, and we wanna have one canonical definition of what a daily active user is. Can we use SDF classifiers to to do this? And we said, maybe maybe this will work. And it turns out it works, like, super, super well. And then we got had a similar question around data retention and using classifiers

to map out which tables needed which types of retention policies. So that was that was really, really exciting to see because that was a true use case that I think people were spending a lot of time trying to map and manage retention policies.

And now there's a simple, you know, SDF report that they run-in their CI pipeline or or in their orchestrator once a day, and they get a report of all the partitions that need to be deleted. That's fantastic. That was completely unexpected. So that's yeah. One one example of using classifiers in a completely unexpected way. And as people start to investigate SDF, they want to

start incorporating to get into their stack, what are the cases where SDF is just the wrong choice and you would advocate against using it?

Yeah. If you're writing Scala,

or RDDs,

probably not the this is probably not the tool for you. I think a lot of the the spark ecosystem is really rich and has a lot of capabilities.

There's a lot of notebooks, especially from from Databricks.

SDF does not work well in that universe, at the moment, for some of the reasons I've outlined earlier. So I think that that is probably the main area at this point where SDF is probably not the right tool.

And in your experience

of building SDF, growing the business, growing the project, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Yeah. Building a framework that people like or building a cool developer tool is not the same as building a business. I think that, like, if you're if you're, you know, an excited engineer and you're like, yes. I'm gonna build a really cool tool, It's very easy to focus on on the tool and the functionality and the capabilities and where it's differentiated and novel, but growing a business has different requirements

and understanding

what should be free to use,

what should be paid, what features

make sense when, you know, they're offered as a paid feature for for a company versus what

doesn't make sense,

whether to, you know, price on seats or models. Those are really, really difficult questions, and the only way to tackle them is to just, you know, talk to talk to a lot of talk to a lot of companies and talk to a lot of engineers. That's been a a fascinating learning lesson over the last 2 years, but I'd especially say over the last 6 months.

And as you continue to build and invest in SDF, what are some of the things you have planned for the near to medium term or any particular projects you're excited to explore?

Yeah. I think on on the,

dbt side, seeing more use cases of SDF as as, a dbt accelerator is super exciting. And on the compute side,

we have some awesome,

I think, you know, demos and hopefully an alpha very soon that we'll be able to show,

where where you can actually start, you know, using STF as a query engine

for mile transformations

either directly from your orchestrator or as a separate service. And then and then lastly, on the cloud product side, just today, we launched a really cool impact analysis feature that looks at diffs between 2 warehouses.

It may be the state that's in Maine, maybe the state that's in a pull request, and we'll show you the impact of those changes, what columns changed, if there's anything that breaks downstream, etcetera. And we're I'm really, really excited to invest a little bit more on on the cloud side in building out differentiated

tool sets for enterprises that uniquely need services and that you can't just run on the laptop, by yourself. Are there any other aspects of the SDF product, the SDF tooling,

the ecosystem around it, and the ecosystem that you are working within that we didn't discuss yet that you'd like to cover before we close out the show? There's, there's there's always too many things to talk through. No. I I really appreciate you having me on. This was, absolutely fantastic. Feel like I learned a lot. Questions were great. Appreciate your time. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And so as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. There's, there's a consensus

among software engineers, something that we've talked about a lot today, which is that the

tooling and data engineering is up to par with the tooling and software engineering. And if you look at the reasons for this, I think selfishly,

I, as someone who's building a compiler for SQL, will say, hey. There's not really good SQL compiler that exists. Right? Like, compilers are the the framework for everything from, you know, IntelliSense to, you know, CICD checks. There's no really good at compiler for SQL. So, like, that's that's the problem in in data development. But and maybe digging one step deeper. Like, why is there no good compiler for SQL? I I think if you take a look at the history of SQL, there's there's no module system that was ever introduced in that language. Right? Like, the the reason that you have data frames is data frames are the only way to do table level functions, essentially.

Otherwise, you have some user defined functions, maybe, but there's no, like, import statements. There's no libraries. There's no package repository for anything in the SQL world, which means that every single company,

all over the world, every single time, is rebuilding

the same wheel from scratch. And I think that is maybe the most exciting problem and question in the data space. Right? It's like, how do you, how do you actually build a data development framework that provides

some level of modularity,

some ability for libraries to exist, some ability for code reuse in a way in in any sort of way. That I think is is the most interesting

gap in data development today. And also the fact that despite being referenced as a language, it is actually

a fractal amount of dialects. And so even if you can have some way of saying this is SQL, this is SQL, they're actually 2 different sequels. And if you try to run them both on the same engine, then you're gonna have a bad day. You're gonna have a really bad day. Yeah. But but if you think about, like, why why is this the case? Like, why why did Snowflake evolve its own grammar? Why are they still evolving? Like, their that grammar is still evolving. DuckDV at this point. Right? Now DuckDV is also creating its own grammar. The reason for this is there's no extensibility. So the only thing that's left is to change the language. If you if you can't add a library for, you know, how you want to work with some ML model, Like, of course, if you're Snowflake and you want people to start running ML models as well, of course, you're gonna build, you know, a little, like, dialect,

addendum that lets you run, you know, GPT,

directly from, like, a Snowflake SQL query. It's complete madness, but it's the only resource that's available to to these vendors as well. Right? It's expanding and extending the language. So, yeah, I think if you if you could figure out, if there's a way to to create modularity in that universe, I think that would be incredibly powerful.

Yeah. That that's where you also start getting all of these superset languages

that compile down to SQL, similar to how we've had all of these different languages that compile the JavaScript.

Yes. This is yeah. So this is exactly.

Alright. We I have, you know, sometimes the script like, tried to describe a little bit of what we're doing as as, you know, sort of the TypeScript to JavaScript transition, but for SQL, where we, you know, add a little bit some type information, some static analysis capabilities.

But ultimately, like the thing that we send, you know, to Snowflake is just the same Snowflake SQL as everybody else. But, yeah, this is it's a really it's a really interesting challenge. And and I think the reason that you've seen this, like, absolute explosion in data tooling also has to do with this. Right? Like, there's

also really cool stuff happening in places like Ibis, right, where where, like, people are really trying to figure out how to translate and or map well from one dialect to another. But mapping is always fuzzy, and it doesn't really work at scale. It's a it's a challenging problem. Alright. Well, thank you very much for taking the time today to join me. I appreciate all of the time and effort that you and the rest of the SDF SDF team are putting into this problem space. It's definitely very important and constantly evolving target. So definitely look forward to seeing the continued growth of SDF and starting to experiment with it for my own work. So thank you again, and I hope you have enjoy the rest of your day. Thanks a lot. You as well. Appreciate you having me on the show.

Thank you for listening, and don't forget to check out our other shows.

Podcast.net

covers the Python language, its community, and the innovative ways it is being used, and the AI Engineering Podcast is your guide to the fast moving world of building AI systems.

Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story.

Just to help other people find the show, please leave a review on Apple Podcasts and to tell your friends and coworkers.