Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control.

Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools.

Sign up for free today at dataengineeringpodcast.com/

rudderstack.

Your host is Tobias Macy, and today, I'm interviewing Toby Mao about SQL Mesh, an open source data ops framework designed to scale data transformations with ease of collaboration and validation built in. So, Toby, can you start by introducing yourself? Hey. How's it going? I'm Toby. I'm the CTO of Deepika Data, and, yeah, I'm working on SQL mesh. Thanks for having me here today. Of course. And do you remember how you first got started working in data?

So my first job out of college was at a pharmaceutical

consulting company, and I was working as an analyst

where I would process data using SAS,

and create decks and stuff with Excel, VBA, etcetera. So kind of got into data engineering through the analyst

path.

And in terms of SQL mesh, can you give a bit of an overview about what it is and some of the story behind how it came to be and also why you decided to release it as open source?

Sure.

So

at its core, SQL mesh is a transformation framework

that's looking to enable DataOps. So it allows you to write

transformations in either Python or SQL,

and it can orchestrate that and do a lot of stuff for you to do it scalably and correctly.

I guess

how it started,

I was I led analytics

at Airbnb

where I worked on some projects like Minerva, their metrics platform, and experimentation.

And at first, I kind of wanted to do a semantic layer metric startup.

But

as I did some research there, I realized that

a stand alone kind of semantic layer doesn't really make too much sense because

the hardest part of getting good metrics is having good data. Right? And that means

the transformation

aspect.

And so

at Airbnb, I actually saw a demo of DBT,

and, immediately, I was like,

This would never work at any of the companies I've worked at because

there's kind of a built in assumption of being able to fully refresh everything,

a lot of inefficiencies,

etcetera. And so that's kind of where I got the idea. And why is why open source? Well, I I really do believe in open source. I've been a big fan of open source for many years.

I have SQL Cloud, which is a open source SQL parser. I even have a open source board gaming website called 18xx.games.

So I love open source.

Another kind of keyword that was put into the opening description is this idea of DataOps,

and that is 1 of those phrases that has be been co opted and overloaded by everyone in industry for their own particular purposes.

And I'm wondering if, for the purpose of this conversation,

nail down the concepts that you're trying to convey through the use of that term.

DataOps is just a set of practices that's looking to

make data practitioners

more

productive,

have their output be more reliable,

and,

more accurate. Right?

And so SQL mesh is looking to automate that all the way. We're trying to make sure that you can just focus on defining your business logic, and we'll we'll handle the rest. We'll make sure it's correct.

It scales. It's reliable.

And and so that that that's really all we're trying to do.

In terms of the existing tool chains,

not to pick on DBT too much, but it is kind of the, elephant in the room.

I'm curious what are some of the challenges that are posed that prevent this ideal of DataOps given the existing state of the ecosystem?

So I'll kinda make an analogy here. Right?

And, for example,

Kron

is a very simple tool. Right? In the

past, right, a lot of people used to use Kron. And what Kron is can do is it can kind of

schedule some particular task

and do it for you at some cadence. Right?

And that works great if you have a very simple project or or simple requirements.

Over time,

as you're using Kron and your use case gets more complex,

you add in more custom things, more custom scripts. Maybe you need to orchestrate some stuff. Maybe you need if else magic. Right?

And so over time,

different tools started to come out in the space, for example, like Airflow. Right?

And Airflow

really specializes in these tasks of orchestrating complex things, etcetera. Right? And so

similar to Cron, DBT is very simple. Right? And it's great to use if you have a simple use case.

But over time,

a lot of people who get in deeper into DBT, deeper into the uses of DBT,

start to build a lot of custom scripts

in order to handle their use cases.

And so I would say that,

SQL Mesh is trying

to view,

like, what airflow is to. Right? It's trying to have a fundamental

understanding of those of the transformation space and build from that to make a seamless

experience as opposed to just being kind of a simple

and, basic

orchestrator.

Does that make sense?

Yeah. It definitely does. And I think in particular, some of the

rough edges that have come up with the current state of data engineering

with the understanding

of that workflow being things like the, quote, unquote, modern data platform where you're using a Fivetran or an or an Airbyte for data integration.

You're using probably DBT

or some,

custom,

chain of ETL

scripts to manage that transformation flow,

and then maybe some sort of BI dashboard or reverse ETL framework for making some use of that data.

I think a lot of the challenge is the

disjointed flows where

there are tools

that do help to stitch back together those disparate components into some semblance of a cohesive experience, but,

that then adds another layer of complexity of making sure that that tool can understand the different semantics of the underlying

processes.

And and also the fact that a lot of these

utilities are working at a very coarse granularity.

And in order to

really scale and get to a sophisticated point with your data workflows, you need to have a more nuanced view and and a more detailed access to some of the underlying mechanics to be able to build out

some some of the kind of customizations that are necessary to,

ensure that all of your data is correct and properly represented.

Exactly. And so these nuances matter.

They really

affect the kind of UX of development. Right? The experience of

developing with data. And so I guess 2 more rough edges that I kinda wanna talk about

are, 1,

existing tool chains don't understand SQL at all. Even though you might be writing SQL, your tools just treat them as raw strings, and everything is stitched together through templating. Right?

SQL mesh is kind of built on top of SQL GloD, and so it has a first class understanding of SQL.

And so

simple things like

why are you why do you need to define ref? Right? Can't you can't your tool just tell you what your SQL is is doing? Why why do you have to manually define things? Right?

Also, if you're using

multiple engines, right, why do you have to manually

specify

macros

to transpile between those 2? Like, you're writing SQL. You should just be able to kinda switch between engines. Right? Then that's kind of what SQL mesh can do for you.

Another part of it is with this whole the the development environment experience.

I would say that with existing tools,

managing development environments is very manual,

and it's very cost ineffective.

Right?

So let's say you wanted to make a particular change in your existing workflow, and you want to make a development environment to test your change or look at your change in a representative environment.

The default flow would just be to kind of copy your entire warehouse, your entire pipeline into this dev environment. Right?

But

this workflow is very

inefficient, right, because

you only need maybe a subset of what you've done. And so either it's inefficient or you then have to manually say, hey. Look. I know this is what I'm gonna be changing. I can manually specify that. Right?

1 of SQL mesh's core concept is in the virtual environment layer. So because we understand SQL, we know exactly what has changed in your particular branch.

And so from that, we can take the most efficient path automatically

towards creating a development environment that is fully representative and has your changes.

And so we've taken away a lot of the manual work

necessary to kinda do the data,

and we make that easy and correct and automatic.

In terms of the

mapping between different SQL dialects

at the very

high level of just the core semantics of SQL,

I can definitely see where that's an achievable goal. But where gets complicated is in dealing with some of the different database specific functions that you might call. So,

using JSON parseentrino

versus whatever other JSON functions are in BigQuery versus Snowflake, etcetera, and some of the semantics of how you access some of the nested attributes of that JSON being 1,

likely very widely used example of those custom functions. And I'm wondering,

how that has manifested in your work on SQL mesh as far as being able to map between those more engine specific concepts.

Yeah. So

all of this is powered through SQLPLOD, and and we do handle

JSON extract, etcetera, and all those differences between different dialects. And so the way we do that

is by when we parse the SQL, we first

turn it into,

intermediate representation

or an AST. Right? So the AST is kind of a generic representation of SQL.

And then

from that, we can then, for each dialect, specify, okay.

They want to extract out this field from this JSON thing. And so in Snowflake, it's this way. In BigQuery, it's this way. In Spark, it's this way.

And so the kind of architecture looks like many dialects coming in, and they go to 1 representation,

and then many dialects coming out. Now, obviously, it's not perfect. SQL is huge. There's a 1, 000, 000 functions.

But we have a lot of them covered. And the important thing is that our architecture is there to support it. So if, for example, we don't have a particular mapping in place, we can easily add that, and now

everybody can benefit. As opposed to in the current workflow, you'd probably just write your own custom ginger macro for your specific use case, and everybody

in the world who has this particular case

can't benefit from open source. They're gonna have to do it themselves

in the custom script. Right?

And so yeah.

And another aspect of what you mentioned in the introduction of SQL mesh is the idea of being able to use SQL or Python for these transformations.

And Python

capabilities

are being added to tools like DBT. They're being added to Snowflake through these things like Snowpark. I'm wondering if you can just talk through

what that looks like in terms of the types of transformations and how those transformations are written and integrated into the rest of the SQL environment within a given SQL mesh code base?

So in SQL mesh, you can do the standard SQL, but you can also write Python. And so the kind of the only thing you need to do when writing a Python model is you can either return a Pandas data frame or a PySpark data frame.

And so it's it's incredibly flexible. You can do anything you want in there. You can even make API calls. You can hit the database, etcetera.

And

we found many people many of our clients really enjoy the

ease and power of our Python models.

And,

I guess, 1 big architectural difference between SQL Mesh's Python models and DBT My Python models is that SQL Mesh Python models can run anywhere.

Right? They kind of run where SQL mesh is running. We don't have to ship off the Python code

to, for example, Snowflake. I and I I I think that

the DBT's

Python execution mode causes some problems because whatever you're sending has to be able to run on Snowflake.

And so the benefits of being able to just run your Python models wherever SQL mesh is it's much easier to debug.

You can see the Python code running. You have all the dependencies installed, etcetera.

And digging more into SQL mesh itself, can you talk through some more of the implementation

and some of the user experience around how to actually build a project

with it. Sure. I mean, it's it's quite similar to DBT in the sense that you can just define your SQL, and then you can just run plan.

So SQL mesh's plan is,

kind of like DBT's run, except that the plan kind of gives you an overview of what might happen first before you actually apply it. And so that kind of takes

inspiration from Terraform, if you're familiar with that. Right? Because we understand SQL,

we can understand a lot of what's gonna happen before we actually run it.

And so

when you define a SQL mesh model,

it's it's very familiar. You're simply writing SQL. There's not a bunch of YAML that you have to define. You don't have to do ref tags. You just write your SQL in files.

And then from there, we can automatically detect all the dependencies.

We could do some compile time, validations, like if you have any syntax errors, and then we can tell you exactly what has changed and what hasn't changed.

So the development experience is similar to dbt if you've seen it, but it's a little bit more powerful

in that we have things at compile time that can are are very powerful and also that, you know, you don't have to do a lot of extra manual work with YAML files and refs, etcetera.

And in terms of the development

of SQL mesh itself, can you talk to some of the ways that the design and goals have evolved since you first started working on it and maybe give some sense of what the current level of maturity is for people who are starting to think about adopting SQL mesh for their own work. So the most important goal, which has never changed for us, is that SQL mesh should always be correct. Right? Correctness is nonnegotiable.

So we wanna make sure that whatever results SQL mesh are pumping out, that is what you expect and consistent.

And so

over time, as we've been developing SQL mesh, what we found challenging as we delve into different use cases is like,

it's quite difficult to do this. Right? It's quite difficult to make make sure everything is always correct.

So starting from a foundation of we're able to understand SQL. Right? There's so much we can do with that,

to help

aid in correctness and efficiency.

For example, because we can understand SQL,

we understand

also

how changes to your SQL can or cannot

affect downstream dependencies.

For example, if you have a parent model that has many down children and you add a column to it. Right? What does that mean? If you're just naive, you think, okay. I have to backfill everything, right, the parent and all the downstream because I've changed the model.

But because we understand SQL, for example, and we have a full column level lineage

of who uses that table.

When you add a column

that's brand new, no 1 else is using it. Right? And you haven't affected the cardinality

of the

parent. And so we know

that that is a non breaking change. We can automatically

classify that. And so in order to

make your environment into that correct state representing

a data model with this 1 new column,

we don't have to backfill the whole world. We only have to backfill that parent. Right?

And so you can see now

the power yet complexity

of being able to understand SQL and how that gates to the system.

And so we've been doing our best to kind of take advantage of all of this power without exposing any of it to the user.

And

in in terms of what you're building at SQL Mesh, the obvious incumbent is DBT.

It has helped to launch the overall concept of analytics engineering.

I'm just wondering, what are some of the lessons lessons that you've learned from DBT

in terms of how you've approached the design and functionality of SQL mesh and the,

ways that you're thinking about how people can manage the adoption path, particularly if they have already invested heavily into DBT?

So, yeah, DBT is a fantastic tool. Right? It's it's really changed the space. It's it's created this whole branch of analytics engineering.

And so the biggest inspiration for us for DBT

in our in terms of our design is definitely in the simplicity of the tool. Right? DBT is extremely simple. You just do DBT run, right, and everything kind of just

works.

And so with SQL mesh, we really wanted to make sure that the experience was just as simple. And so we have basically 1 command as well, which is the SQL mesh plan.

And so the plan,

not only does it run your DAG, etcetera, but before it does anything, it tells you exactly what's gonna change, what is the potential impact, etcetera. And so we wanted to make sure that SQL mesh,

was just as easy and simple to use as DPT.

Now if you already have a DPT project, that's great. Right? You're you're thinking about transformation. You're trying to make things more,

scalable and and more reusable.

Right? And so SQL mesh

is DPT

syntax compatible. So if you have an existing DPT project, you can just have SQL Mesh read in your existing project,

and you can take advantage of many of the

foundational and architectural improvements that we have.

Now

there are some things that, you know, DBT makes assumptions for,

and,

and so you might have to do some tweaks here and there.

But

we we've we've been able to read in large, you know, kind of 1, 000 model dbt projects with our existing clients.

It's definitely a work in progress, and,

we're hoping to make that better. But our goal is definitely to be dbt syntax compatible.

Given that

goal of being syntax compatible, what are some of the constraints or limitations that that imposes

as far as the overall,

scope of possibilities

for what you have built for the foundation?

Right. So SQL mesh has its own syntax, and there's load limitations if you use SQL mesh syntax. But if you use DBT syntax

for example,

if you have a DBT incremental model, right,

you probably have in your incremental model some if

else block, which is like, if is incremental, do this, else, do that. Right?

SQL mesh's incremental model is very different.

We actually keep track of what intervals have been computed. Right? And so there is no if else logic in SQL mesh models. You can just write

select all from my table where my date is between

the start and the end of the run. Right? Because SQL mesh tracks the runs,

and it knows, okay. Well, you've processed this date, and so now we need to process this date.

In your existing Jinja

DBT code,

you're trying to calculate

the previous run dynamically

with your Jinja block, and you're querying yourself to find the latest date. Right?

And so if you have something like that,

you're going to need

to add, for example, another else if block and check if it's SQL mesh. And if it's SQL mesh, then you replace your entire

subquery block with just a date is between start and end.

So

because DBT doesn't have a concept of kind of intervals and partitions, etcetera,

if you're using an existing DPD project and you wanna take full advantage of SQL Mesh's incremental models, you'll have to add another block in your code.

Does that make sense? Yes. In terms of somebody who's using SQL mesh, I'm wondering if you can just talk through what that overall process looks like from either I have nothing, I just wanna get started and start building it out through to I'm using it in a production context, managing it through data orchestration engine, kind of what is that overall,

kind of day 0 through day n workflow look like?

Sure. So for day 0, a user first starting out with nothing.

They'd probably go to our documentation. They'd go to the quick start,

and they'd bootstrap a project using SQL mesh in it. That's gonna create kind of a skeleton

with 2 models,

and

they're gonna get started.

So with that, they would set up a particular config. So they point their SQL mesh to

some engine like Snowflake or BigQuery or whatever, just putting in their credentials, etcetera, or whatever they need to connect.

Then they would start adding models. So you just start adding

different models for the transformations you're looking for. And then when they're ready, you can just run SQL mesh plan,

and it's going to basically tell you to

create your environment

and to create all your models. Right? And so you can use dev. You can use prod. You can have all these environments, etcetera.

Now once you start getting more mature and you wanna start running these,

for example, daily, you could hook that up to a cron.

You just have your cron job run SQL mesh run every day, and it's gonna compute only what is needed every day, right, because we keep track of state, etcetera.

Now for a

more mature company, something that's running thousands of models, etcetera,

you could use something like Airflow. So SQL mesh has first class integration with Airflow.

That means that SQL Mesh can

convert your models into,

individual DAGs within Airflow. So SQL mesh doesn't run itself as just 1 black box within airflow. Right?

K. That's kind of what DBT does. SQL mesh actually creates

first class objects

within airflow for every single kind of node that you have.

And so the benefits of something like that is that

you can actually

integrate more closely with airflow. So you can have external signals that trigger certain nodes, or you can have certain nodes within SQL mesh in airflow

triggering other things. Right?

And so those are 2 engines of the spectrum. Right? You can have the beginner with just a basic

SQL mesh project using DuckDV locally with no scheduling,

or you could have enterprise kind of airflow deployment.

Another aspect of the DataOps workflow is the idea of being able to do testing and validation,

particularly where you don't want to have to use production as your validation environment. And I'm wondering if you could talk to some of the capabilities

that SQL mesh exposes for that CICD

and staging workflow?

Right. So correctness

is kind of our core principle, and so we wanted to really make sure that users of SQL mesh had the ability to fully test and validate their data.

And so

right now,

as of today, SQL mesh has 2 modes of testing.

We have unit testing, real data unit test. Right? This is something that doesn't really exist today in the industry,

but we thought it was very important because we came from software engineering backgrounds, and unit tests are a big part of data development. So first of all, what is a data unit test? So in SQL mesh, you can define a YAML file with fake data. So you can just say, like, this table has these rows,

and these are the expected results after you run your transformation. Right?

So that means that in SQL mesh, you could define this y m o YAML file with the input expectations and the output expectations.

And then now

whenever

you wanna do anything, like CI or whatever,

your project will automatically

run these unit tests to make sure that none of your business logic changes.

Right?

These are,

very quick to run because we can, for example, translate your model into DuckDV,

and then we can run all of this in CI without having to hit your database. Right?

We also have what we call audits. So DBT calls these tests, but we call them audits because they're basically data quality checks. And

so

on your production data, after a run, you can check for things like count star, if there's nulls, etcetera. Right? These are very good as well, but they serve a very different use case from unit tests. So we have both unit tests

and data quality checks or audits.

Now 1 more area that we're looking into adding,

is

table diffs or data diffing. Right? And so given you have 2 kind of tables in different environments with slightly different logic, you may wanna understand

or see a summary of what exactly is different.

And so we're looking into either developing our own version of that or perhaps integrating with DataFold

so that you can get this.

To the point of data diffs and using that for validation,

I'm wondering if you can just talk to some of the types of

errors or bugs in your logic that you're able to catch through the use of DataDiff before you actually apply the the SQL changes.

Right. So maybe you've changed some logic. Right? And

you with something like data diff, you can see, okay, for this particular primary key, the the value used to be red. Now it's blue. Right? And then you can say, okay. That that's correct. That's exactly what I intended, or there there's a problem there. Other things you could check are, like, row count differences. Like, this particular set has a 1000000 rows, and this 1 had 15, 000, 000 rows. That doesn't seem right.

And so it's kind of a quick way for you to check

on real data what the differences are.

And digging more into the orchestration

experience,

in in the documentation, it points out that SQL mesh is,

aware of the fact that it's operating on an orchestrator. And I'm curious if you can talk to some of the

benefits that you're able to gain by having the transformation tool understand the semantics of that orchestration engine

as well as some of the complexities that that adds because of the need to have another layer of integration within the tool chain.

Right. So we really wanted SQL Mesh to be a part of your data existing data infrastructure

as opposed to being like some black box

that kinda runs

as a part of it. Right? So, for example,

I think the basic usage of DBT within airflow is just like, okay. Within airflow, there's a node, and it does dbt run, which runs your whole deck. Right?

But SQL mesh integrates directly with airflow and creates individual DAGs for every model that you have, and so the benefits of that are pretty clear. Right? You can use the existing

UI

for airflow to see exactly how your DAG flows.

You can take into consideration external signals and also trigger external signals

at the individual

model level. Right?

And so, yes, there is some complexity here in that now SQL mesh integrates with Airflow, but that complexity is mostly abstracted away from the user. The user

doesn't have to think about anything. They just write their models in SQL mesh, and then we do all the hard work of hitting airflow,

creating all the DAGs, and orchestrating all of that.

I'm a bit biased because of the fact that I am using DAGSTER for my own day to day work. And 1 of the concepts that they are investing heavily in is the idea of,

software defined assets and being aware of what the individual assets are for each of the different processes that are executing. So in the DBT case, they will actually parse out the entire DAG of the project so that you can view each of the individual models, understand when it was last refreshed, etcetera.

And I'm wondering if you can talk through what that integration might look like for SQL mesh and some of the

mechanical elements that, would need to be put in place for enabling that integration?

We do plan to eventually integrate with Prefect and DAXTER, but we chose to do Airflow first because it's kinda like the industry leader.

And, yeah, we're big fans of the way DAXTER kinda treats its assets, but

I would say that tools like Daxter and Airflow are much lower level. Right? They're they are very good at orchestrating

tasks,

but they don't go too much deeper than that. So although something like Dagster might know when a particular task was executed,

it wouldn't have,

the nuance of this task, you know, has these intervals that have been run for as well as what that task actually means. Right? Because SQL mesh

understands SQL and understands your models.

It fully understands that for this version of a model, what it means.

Right, and what all the downstream dependencies are. And if you change your model, how that would affect your downstream jobs. Right?

And so I would say that, like, SQL mesh

and Daxter are somewhat similar, but SQL mesh is,

is, I guess, higher level in that it really understands tables. It really understands SQL and column level lineage.

So I'm not exactly sure how SQL mesh is going to take advantage of what DAXTER has done,

but there are definitely some similarities here.

Given the fact that SQL mesh does handle

in an automated fashion some of these,

semantic concerns and some of the

cognitive complexity that goes with building out a, in particular, larger projects of these transformation flows. So in,

being able to understand the column dependencies, being able to automatically map that column lineage at the project level without having to

manually add specific signals for the tool to be able to do that job for you. I'm curious what you either have seen or predict as the second order benefits of using something like SQL mesh,

for teams that are building out some of these complex data estates.

Yeah. So I think it's just gonna really help productivity. Right? So let's say you add

you you you're modifying a a model downstream,

and you accidentally refer to the wrong column name. Right?

If you were to use, for example, DBT, you have to wait for your entire DAG to execute until you get that in the node, and then there would be some error runtime error. Right?

Because SQL mesh has a full understanding of column level lineage,

right, as you planned,

it's going to be like, hey. Are you using this column?

Are you sure that was correct?

Because we we understand the whole flow of of the data.

That's 1 aspect of it. The other aspect of it, which I I think is really important,

is that

single mesh was really designed to unify

the different data stacks that data engineers and data analysts have. Right? At all the places that I've worked at, data engineers kind of have their own stack. Right? Maybe Scala,

Airflow, etcetera, something a little bit more complex because it has to deal with scale and robustness.

And the analysts have their own stack. Right? Something that's more geared towards usability,

pure SQL,

something that they're familiar with.

And so

SQL mesh

is intended to be easy for the analyst to use,

but robust and scalable enough that a data engineer would use it in their production systems.

And so the real benefit of something like SQL Mesh IC

is that it would enable

a company at any scale to

really leverage the same stack because not only is it easy enough for someone less technical to use,

but also

it robust enough

that we can handle any of the kind of data flowing in that a data engineer,

used to have to kind of do custom stuff for.

Some other

aspects of building a tool like this and working with it and the team and organizational level are the

communication elements,

documentation.

I'm curious if you can talk to what,

SQL mesh has in that avenue as far as,

something analogous to the DBT docs capability, etcetera,

and also some of the collaboration

aspect of data engineering and data analysts and some of the ways that SQL mesh thinks about those challenges?

Sure. So SQL mesh includes a free UI. Right? The UI includes not only an IDE,

but

it includes docs. And so the way SQL mesh does docs is we have a kind of very SQL first format. Right? So you just write in your model's comments,

and then those comments get automatically put into a nice docs interface.

And so

we we find that to be aesthetically pleasing because, know, you don't need another YAML file or anything like that. We can automatically parse out the comments. We can parse out the column names. We can parse out the types and automatically

create a nice UI for you, which is free and open source.

So in terms of collaboration,

SQL mesh has a couple different avenues here as well. Right? Because our dev environments are so cheap and easy to make,

anyone can kind of make a dev environment with their changes and point people to it.

People can see exactly what your changes would entail.

So we have also an open source CICD bot. And so whenever you make a PR using that bot,

the bot will automatically create a dev environment

that has your particular changes,

which means, furthermore,

people can look exactly okay. If we were to merge in their PR,

this is the state of the warehouse

with that change. So that further facilitates collaboration.

And for people who,

in particular, are moving from DBT into SQL mesh, but also people who maybe have just been managing a pile of SQL scripts and moving into SQL mesh. I'm curious how that

automatic dependency

management at the column level,

and the,

collaborative elements influence the ways that they think about the structure of the project and the granularity of the transformations

that they want to build?

Well, if you already have kind of scripts

or using DBT, you've already kind of thought of these things.

We have an interesting use case. We're working with this company with an existing

multi thousand,

model

DPT project.

And what was really surprising to me was that

so although we do have the DPT adapter,

they

really loved

the simplicity and syntax

of SQL mesh. And so they decided to just rewrite their entire DBD project into SQL mesh.

And so they said that it was actually quite easy to do it. You know, it's mostly just copy and paste, deleting the refs.

And from that, they were able to also take the time to refactor things a little bit and make things nicer. And so I was surprised that they went through that, but in the end of the day,

they

they really

they thought it was worth it, and and they enjoyed doing it. And now their project is much simpler. Right? There's no more refs.

Their project is simpler with with with our macro system, etcetera.

It's it's it's checked that compile time.

And so I was quite surprised to see something about that.

I've also talked to,

1 of our partners. They're a smaller company,

and they already had a 40 model DPD project.

And they also decided to rewrite the project instead of just using the adapter.

And,

you know, their reason was just like, okay. Well, the SQL mesh simple syntax is just SQL. Right? There's no Jinja.

And so what they really liked was that they were able to copy and paste exist like, you know, just their SQL and their models into the BigQuery UI without having to remove or render their existing Jinja. So I I don't know.

That those are just kind of some of the interesting things that I've seen.

Yeah. And I can definitely see that also simplifying the tooling around things like link linting, where if you're using something like SQL fluff, it has to go through those extra hops of figuring out what is the Jinja,

how much of that factors into the semantic structures and the and the code style that we want to enforce. So

Right. Yeah. So on topic of, like, SQL fluff, so we just use SQL GLOT for linting. So you can just run, like, SQL mesh format,

and it's,

I guess, our our kind of format is a little bit different. We're a little bit more like black where we're very opinionated, and there's kind of 1 way of doing things. There's, like, some things you can tweak like indentation and etcetera,

but for the most part, it's just a,

very opinionated formatter.

Another

aspect of the tool that you're building because of it being open source as well as the fact that you are trying to build a business, I'm wondering if you can talk to the governance considerations

and how you're thinking about the overall strategy of managing the open source project along with your business goals.

So we're very early. We just started in September,

and we're looking to really focus the next couple of months on making sure our existing clients and partners are successful. Right? And so we're really focused on making SQL Mesh more robust. And so a lot of our early efforts are going to be around

making sure open source is a great product that's, robust,

and production ready.

Now in terms of our

kind of long term goals of of monetization,

etcetera,

we're also looking to have a cloud or on prem offering. Right? Because single mesh manages a lot of state for you. For example, it knows every single version of every model you have. It knows for every 1 of those which intervals have been run. Right? We have to manage state. And so for the clients that don't particularly wanna manage that themselves,

we're we're gonna have an offering where we have hosted or on prem solutions where we'll manage that for you.

Additionally, we'll have other enterprise kind of features like

cost estimation,

right, knowing, for example, which models are expensive and how to optimize that, as well as our query proxy layer,

which I can talk about more if you're interested in that.

Yeah. Definitely be curious to hear what the query proxy is aimed at and some of the capabilities that that adds.

Right. So single mesh, as we discussed,

has this whole virtual environment layer. Right? It'll easily and efficiently allows you to create development environments that are full replicas

of your warehouse without any of the cost. Right? Does it all vary cheaply?

And so the vision here is that end to end, we want to be able to

have any tool, even outside of SQL mesh, be able to query those virtual environments.

The real world use case would be like an executive says like, says, hey. I wanna look at this new metric. Like, what happens when I change,

the definition of daily active users? How does that affect all these downstream things? How does that affect my Tableau dashboards?

Right? In traditional systems, like, you're kinda out of luck. Either you have to test it in production or you're gonna have to change all your Tableau dashboards, which is gonna take forever. Right? And so 1 of the features we're thinking for enterprise

is a query proxy layer. And so we would just have Tableau point to,

this proxy layer, which acts like a SQL Server. And with a simple flag on that side, you can easily route all your existing Tableau queries

into the right dev environment so that you can now, without touching Tableau,

see,

what this new metric would look like without affecting production. So that that's 1 idea that we have.

And to the point of metrics and your earlier point of this all started with wondering about building out a metrics layer solution and the fact that DBT has acquired Transform, and they are heavily investing in the concept of the metrics layer. I'm Wondering how that manifests in the context of SQL mesh and,

some of the design and user experience considerations that you've had along that line.

Our team

has a lot of experience in metrics

and experimentation. Right? We come from

Airbnb, Apple, and Netflix, but we worked on some of these

really large scale

advanced metric systems.

And so it definitely is 1 of our ambitions

to take our core foundational transformation layer and add a very powerful,

semantics layer. And so I

I don't have a time limit on that. It's something that we're planning to do.

But 1 thing I will say is that

we really wanna do it in a way that

is quite flexible. And so that doesn't mean just business metrics. Right? When we're doing metrics, we're also planning to do

metrics that are suitable for experimentation,

right, and suitable for machine learning features.

In your work of building SQL mesh and working with some of the early adopters, what are the most interesting or innovative or unexpected ways that you've seen it applied?

Yeah. So

the I think the most interesting thing is that there's kind of 2 spectrums of users that we've seen. Right? I kinda mentioned this before, but 1 of our clients is a huge power user of DBT. They have 14 or, you know, thousands of models. Right?

And they were able to not only translate all of that SQL mesh,

but they also then hooked it up to a really advanced CICD system. Right? So they were really taking SQL mesh to the fullest. We were anticipating

that there would be such passionate user base already,

and they're building so much really cool stuff with it, taking advantage of of all the stuff that we have.

And on the other side of the spectrum, right, we're seeing

small shops. Right? You would think that these companies,

are kind of DBT's bread and butter,

but they love SQL mesh because of just how simple it is. Right?

1 of the things they told us was that SQL mesh, is simply SQL. They love that. Right? They also loved how

SQL mesh doesn't have layers and layers of config. Right?

1 of our cuss or or our clients

said that

reading the DPT docs was so confusing, and there's just so many nested layers

of inheritance with the YAML files,

they didn't wanna deal with that. They just wanted to write SQL.

And so we've kind of been surprised at these 2 spectrums of users.

And in your own work on this project

and operating in this space, what are the most interesting or unexpected or challenging lessons that you've learned in the process of building SQL

Mesh? Yeah. So I think that

I I used to think that DBT was quite simple, and it is simple to use.

But because we support DBT projects, right, we had to dig into the nitty gritty of how DBT is actually implemented.

And, actually, under the hood, there's a lot of complexity

with dealing and managing with nest layers config and YAML files and a bunch of

implicit rules that we needed to get right in order for your existing DBT project to work with SQL mesh.

And so although under the hood, DBT

or sorry. Although at at face value, DBT seems quite easy to use.

Once you dig into the nuances of of how, like, configurations and the YAMLs will interact with each other,

it's actually quite complex. So that that was a little surprising to me

as as as we were developing SQL Mesh.

And so for people who are interested

in building out their data transformations

in a primarily SQL environment, what are the cases where SQL mesh is the wrong choice?

So SQL mesh is not the right choice if you have, like, a real time pipeline. So if you have

CDC or you want something really low latency,

right, SQL mesh is not the right choice.

We have seen some users using SQL mesh Python models for ingest,

but it's really done in a micro batch fashion. So something like ingest every 5, 10 minutes. But you shouldn't really be doing that at any high scale

or where you're looking to have sub second latency. It's not the right tool. You should use,

Flink or or Fivetran for that.

And as you continue to build out and iterate on SQL mesh and some of the other business goals for to be code data, what are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to dig into?

Yeah. I touched on this briefly already, but

our our immediate goal is just to really work with our existing clients and

partners, making sure they're successful, making sure that SQL mesh is robust and enterprise ready.

But we're also really excited about,

changing the metric space. Right? Making a really robust and powerful

semantics layer,

as well as this kind of query proxy layer. So yeah.

Are there any other aspects of SQL mesh in particular or the data ops space or the overall workflow of,

transformations in a SQL context that we didn't discuss yet that you'd like to cover before we close out the show?

I think we covered most of it. Okay. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

From my experience, the biggest gap in this whole data ecosystem

is how fragmented it is. It's even within 1 company.

I think

different

types of users, for example, data engineers, data analysts, data scientists, they all have very different workflows,

partially because they have very different skills.

And I think that this

heterogeneous

environment

makes it difficult for business to be productive,

and it makes it difficult to maintain.

Right? A data analyst might be using some kind of SQL based stack like DBT

in in their own black box. Data engineer might be writing some kind of Scala

airflow custom thing, and the data scientists might be using a Jupyter notebook.

And so these folks are all working towards the same goals, but they're all talking different languages. Right? It's very difficult

to,

have a productive development environment with these 3 different stacks.

And so I so that's kind of 1 of the motivating factors of what we're trying to do. Trying to solve

and make data development more homogeneous

across these 3 very different personas.

Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on SQL Mesh. It's definitely very interesting tool and 1 that I'm excited to start experimenting with and trying out for my own use cases. So I appreciate all the time and energy that you're putting into that. And I hope you enjoy the rest of your

day. Thanks for having

me.

Thank you for listening.

Don't forget to check out our other shows, podcast.init,

which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning.

Visit the site at dataengineeringpodcast.com.

To subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story.

And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links