Summary
Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack-
- Your host is Tobias Macey and today I'm interviewing Toby Mao about SQLMesh, an open source DataOps framework designed to scale data transformations with ease of collaboration and validation built in
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what SQLMesh is and the story behind it?
- DataOps is a term that has been co-opted and overloaded. What are the concepts that you are trying to convey with that term in the context of SQLMesh?
- What are the rough edges in existing toolchains/workflows that you are trying to address with SQLMesh?
- How do those rough edges impact the productivity and effectiveness of teams using those
- Can you describe how SQLMesh is implemented?
- How have the design and goals evolved since you first started working on it?
- What are the lessons that you have learned from dbt which have informed the design and functionality of SQLMesh?
- For teams who have already invested in dbt, what is the migration path from or integration with dbt?
- You have some built-in integration with/awareness of orchestrators (currently Airflow). What are the benefits of making the transformation tool aware of the orchestrator?
- What do you see as the potential benefits of integration with e.g. data-diff?
- What are the second-order benefits of using a tool such as SQLMesh that addresses the more mechanical aspects of managing transformation workfows and the associated dependency chains?
- What are the most interesting, innovative, or unexpected ways that you have seen SQLMesh used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLMesh?
- When is SQLMesh the wrong choice?
- What do you have planned for the future of SQLMesh?
Contact Info
- tobymao on GitHub
- @captaintobs on Twitter
- Website
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- SQLMesh
- Tobiko Data
- SAS
- AirBnB Minerva
- SQLGlot
- Cron
- AST == Abstract Syntax Tree
- Pandas
- Terraform
- dbt
- SQLFluff
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Rudderstack: ![Rudderstack](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/CKNV8HZ6.png) RudderStack provides all your customer data pipelines in one platform. You can collect, transform, and route data across your entire stack with its event streaming, ETL, and reverse ETL pipelines. RudderStack’s warehouse-first approach means it does not store sensitive information, and it allows you to leverage your existing data warehouse/data lake infrastructure to build a single source of truth for every team. RudderStack also supports real-time use cases. You can Implement RudderStack SDKs once, then automatically send events to your warehouse and 150+ business tools, and you’ll never have to worry about API changes again. Visit [dataengineeringpodcast.com/rudderstack](https://www.dataengineeringpodcast.com/rudderstack) to sign up for free today, and snag a free T-Shirt just for being a Data Engineering Podcast listener.
Hello, and welcome to the Data Engineering podcast, the show about modern data management. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up for free today at dataengineeringpodcast.com/ rudderstack.
Your host is Tobias Macy, and today, I'm interviewing Toby Mao about SQL Mesh, an open source data ops framework designed to scale data transformations with ease of collaboration and validation built in. So, Toby, can you start by introducing yourself? Hey. How's it going? I'm Toby. I'm the CTO of Deepika Data, and, yeah, I'm working on SQL mesh. Thanks for having me here today. Of course. And do you remember how you first got started working in data?
[00:01:07] Unknown:
So my first job out of college was at a pharmaceutical consulting company, and I was working as an analyst where I would process data using SAS, and create decks and stuff with Excel, VBA, etcetera. So kind of got into data engineering through the analyst path.
[00:01:27] Unknown:
And in terms of SQL mesh, can you give a bit of an overview about what it is and some of the story behind how it came to be and also why you decided to release it as open source?
[00:01:38] Unknown:
Sure. So at its core, SQL mesh is a transformation framework that's looking to enable DataOps. So it allows you to write transformations in either Python or SQL, and it can orchestrate that and do a lot of stuff for you to do it scalably and correctly. I guess how it started, I was I led analytics at Airbnb where I worked on some projects like Minerva, their metrics platform, and experimentation. And at first, I kind of wanted to do a semantic layer metric startup. But as I did some research there, I realized that a stand alone kind of semantic layer doesn't really make too much sense because the hardest part of getting good metrics is having good data. Right? And that means the transformation aspect.
And so at Airbnb, I actually saw a demo of DBT, and, immediately, I was like, This would never work at any of the companies I've worked at because there's kind of a built in assumption of being able to fully refresh everything, a lot of inefficiencies, etcetera. And so that's kind of where I got the idea. And why is why open source? Well, I I really do believe in open source. I've been a big fan of open source for many years. I have SQL Cloud, which is a open source SQL parser. I even have a open source board gaming website called 18xx.games. So I love open source.
[00:03:09] Unknown:
Another kind of keyword that was put into the opening description is this idea of DataOps, and that is 1 of those phrases that has be been co opted and overloaded by everyone in industry for their own particular purposes. And I'm wondering if, for the purpose of this conversation, nail down the concepts that you're trying to convey through the use of that term.
[00:03:31] Unknown:
DataOps is just a set of practices that's looking to make data practitioners more productive, have their output be more reliable, and, more accurate. Right? And so SQL mesh is looking to automate that all the way. We're trying to make sure that you can just focus on defining your business logic, and we'll we'll handle the rest. We'll make sure it's correct. It scales. It's reliable. And and so that that that's really all we're trying to do.
[00:04:03] Unknown:
In terms of the existing tool chains, not to pick on DBT too much, but it is kind of the, elephant in the room. I'm curious what are some of the challenges that are posed that prevent this ideal of DataOps given the existing state of the ecosystem?
[00:04:23] Unknown:
So I'll kinda make an analogy here. Right? And, for example, Kron is a very simple tool. Right? In the past, right, a lot of people used to use Kron. And what Kron is can do is it can kind of schedule some particular task and do it for you at some cadence. Right? And that works great if you have a very simple project or or simple requirements. Over time, as you're using Kron and your use case gets more complex, you add in more custom things, more custom scripts. Maybe you need to orchestrate some stuff. Maybe you need if else magic. Right? And so over time, different tools started to come out in the space, for example, like Airflow. Right?
And Airflow really specializes in these tasks of orchestrating complex things, etcetera. Right? And so similar to Cron, DBT is very simple. Right? And it's great to use if you have a simple use case. But over time, a lot of people who get in deeper into DBT, deeper into the uses of DBT, start to build a lot of custom scripts in order to handle their use cases. And so I would say that, SQL Mesh is trying to view, like, what airflow is to. Right? It's trying to have a fundamental understanding of those of the transformation space and build from that to make a seamless experience as opposed to just being kind of a simple and, basic orchestrator.
Does that make sense?
[00:06:02] Unknown:
Yeah. It definitely does. And I think in particular, some of the rough edges that have come up with the current state of data engineering with the understanding of that workflow being things like the, quote, unquote, modern data platform where you're using a Fivetran or an or an Airbyte for data integration. You're using probably DBT or some, custom, chain of ETL scripts to manage that transformation flow, and then maybe some sort of BI dashboard or reverse ETL framework for making some use of that data. I think a lot of the challenge is the disjointed flows where there are tools that do help to stitch back together those disparate components into some semblance of a cohesive experience, but, that then adds another layer of complexity of making sure that that tool can understand the different semantics of the underlying processes.
And and also the fact that a lot of these utilities are working at a very coarse granularity. And in order to really scale and get to a sophisticated point with your data workflows, you need to have a more nuanced view and and a more detailed access to some of the underlying mechanics to be able to build out some some of the kind of customizations that are necessary to, ensure that all of your data is correct and properly represented.
[00:07:30] Unknown:
Exactly. And so these nuances matter. They really affect the kind of UX of development. Right? The experience of developing with data. And so I guess 2 more rough edges that I kinda wanna talk about are, 1, existing tool chains don't understand SQL at all. Even though you might be writing SQL, your tools just treat them as raw strings, and everything is stitched together through templating. Right? SQL mesh is kind of built on top of SQL GloD, and so it has a first class understanding of SQL. And so simple things like why are you why do you need to define ref? Right? Can't you can't your tool just tell you what your SQL is is doing? Why why do you have to manually define things? Right?
Also, if you're using multiple engines, right, why do you have to manually specify macros to transpile between those 2? Like, you're writing SQL. You should just be able to kinda switch between engines. Right? Then that's kind of what SQL mesh can do for you. Another part of it is with this whole the the development environment experience. I would say that with existing tools, managing development environments is very manual, and it's very cost ineffective. Right? So let's say you wanted to make a particular change in your existing workflow, and you want to make a development environment to test your change or look at your change in a representative environment. The default flow would just be to kind of copy your entire warehouse, your entire pipeline into this dev environment. Right?
But this workflow is very inefficient, right, because you only need maybe a subset of what you've done. And so either it's inefficient or you then have to manually say, hey. Look. I know this is what I'm gonna be changing. I can manually specify that. Right? 1 of SQL mesh's core concept is in the virtual environment layer. So because we understand SQL, we know exactly what has changed in your particular branch. And so from that, we can take the most efficient path automatically towards creating a development environment that is fully representative and has your changes. And so we've taken away a lot of the manual work necessary to kinda do the data, and we make that easy and correct and automatic.
[00:09:55] Unknown:
In terms of the mapping between different SQL dialects at the very high level of just the core semantics of SQL, I can definitely see where that's an achievable goal. But where gets complicated is in dealing with some of the different database specific functions that you might call. So, using JSON parseentrino versus whatever other JSON functions are in BigQuery versus Snowflake, etcetera, and some of the semantics of how you access some of the nested attributes of that JSON being 1, likely very widely used example of those custom functions. And I'm wondering, how that has manifested in your work on SQL mesh as far as being able to map between those more engine specific concepts.
[00:10:38] Unknown:
Yeah. So all of this is powered through SQLPLOD, and and we do handle JSON extract, etcetera, and all those differences between different dialects. And so the way we do that is by when we parse the SQL, we first turn it into, intermediate representation or an AST. Right? So the AST is kind of a generic representation of SQL. And then from that, we can then, for each dialect, specify, okay. They want to extract out this field from this JSON thing. And so in Snowflake, it's this way. In BigQuery, it's this way. In Spark, it's this way. And so the kind of architecture looks like many dialects coming in, and they go to 1 representation, and then many dialects coming out. Now, obviously, it's not perfect. SQL is huge. There's a 1, 000, 000 functions.
But we have a lot of them covered. And the important thing is that our architecture is there to support it. So if, for example, we don't have a particular mapping in place, we can easily add that, and now everybody can benefit. As opposed to in the current workflow, you'd probably just write your own custom ginger macro for your specific use case, and everybody in the world who has this particular case can't benefit from open source. They're gonna have to do it themselves in the custom script. Right? And so yeah.
[00:12:01] Unknown:
And another aspect of what you mentioned in the introduction of SQL mesh is the idea of being able to use SQL or Python for these transformations. And Python capabilities are being added to tools like DBT. They're being added to Snowflake through these things like Snowpark. I'm wondering if you can just talk through what that looks like in terms of the types of transformations and how those transformations are written and integrated into the rest of the SQL environment within a given SQL mesh code base?
[00:12:32] Unknown:
So in SQL mesh, you can do the standard SQL, but you can also write Python. And so the kind of the only thing you need to do when writing a Python model is you can either return a Pandas data frame or a PySpark data frame. And so it's it's incredibly flexible. You can do anything you want in there. You can even make API calls. You can hit the database, etcetera. And we found many people many of our clients really enjoy the ease and power of our Python models. And, I guess, 1 big architectural difference between SQL Mesh's Python models and DBT My Python models is that SQL Mesh Python models can run anywhere. Right? They kind of run where SQL mesh is running. We don't have to ship off the Python code to, for example, Snowflake. I and I I I think that the DBT's Python execution mode causes some problems because whatever you're sending has to be able to run on Snowflake.
And so the benefits of being able to just run your Python models wherever SQL mesh is it's much easier to debug. You can see the Python code running. You have all the dependencies installed, etcetera.
[00:13:46] Unknown:
And digging more into SQL mesh itself, can you talk through some more of the implementation and some of the user experience around how to actually build a project
[00:14:00] Unknown:
with it. Sure. I mean, it's it's quite similar to DBT in the sense that you can just define your SQL, and then you can just run plan. So SQL mesh's plan is, kind of like DBT's run, except that the plan kind of gives you an overview of what might happen first before you actually apply it. And so that kind of takes inspiration from Terraform, if you're familiar with that. Right? Because we understand SQL, we can understand a lot of what's gonna happen before we actually run it. And so when you define a SQL mesh model, it's it's very familiar. You're simply writing SQL. There's not a bunch of YAML that you have to define. You don't have to do ref tags. You just write your SQL in files.
And then from there, we can automatically detect all the dependencies. We could do some compile time, validations, like if you have any syntax errors, and then we can tell you exactly what has changed and what hasn't changed. So the development experience is similar to dbt if you've seen it, but it's a little bit more powerful in that we have things at compile time that can are are very powerful and also that, you know, you don't have to do a lot of extra manual work with YAML files and refs, etcetera.
[00:15:21] Unknown:
And in terms of the development of SQL mesh itself, can you talk to some of the ways that the design and goals have evolved since you first started working on it and maybe give some sense of what the current level of maturity is for people who are starting to think about adopting SQL mesh for their own work. So the most important goal, which has never changed for us, is that SQL mesh should always be correct. Right? Correctness is nonnegotiable.
[00:15:49] Unknown:
So we wanna make sure that whatever results SQL mesh are pumping out, that is what you expect and consistent. And so over time, as we've been developing SQL mesh, what we found challenging as we delve into different use cases is like, it's quite difficult to do this. Right? It's quite difficult to make make sure everything is always correct. So starting from a foundation of we're able to understand SQL. Right? There's so much we can do with that, to help aid in correctness and efficiency. For example, because we can understand SQL, we understand also how changes to your SQL can or cannot affect downstream dependencies.
For example, if you have a parent model that has many down children and you add a column to it. Right? What does that mean? If you're just naive, you think, okay. I have to backfill everything, right, the parent and all the downstream because I've changed the model. But because we understand SQL, for example, and we have a full column level lineage of who uses that table. When you add a column that's brand new, no 1 else is using it. Right? And you haven't affected the cardinality of the parent. And so we know that that is a non breaking change. We can automatically classify that. And so in order to make your environment into that correct state representing a data model with this 1 new column, we don't have to backfill the whole world. We only have to backfill that parent. Right?
And so you can see now the power yet complexity of being able to understand SQL and how that gates to the system. And so we've been doing our best to kind of take advantage of all of this power without exposing any of it to the user.
[00:17:45] Unknown:
And in in terms of what you're building at SQL Mesh, the obvious incumbent is DBT. It has helped to launch the overall concept of analytics engineering. I'm just wondering, what are some of the lessons lessons that you've learned from DBT in terms of how you've approached the design and functionality of SQL mesh and the, ways that you're thinking about how people can manage the adoption path, particularly if they have already invested heavily into DBT?
[00:18:18] Unknown:
So, yeah, DBT is a fantastic tool. Right? It's it's really changed the space. It's it's created this whole branch of analytics engineering. And so the biggest inspiration for us for DBT in our in terms of our design is definitely in the simplicity of the tool. Right? DBT is extremely simple. You just do DBT run, right, and everything kind of just works. And so with SQL mesh, we really wanted to make sure that the experience was just as simple. And so we have basically 1 command as well, which is the SQL mesh plan. And so the plan, not only does it run your DAG, etcetera, but before it does anything, it tells you exactly what's gonna change, what is the potential impact, etcetera. And so we wanted to make sure that SQL mesh, was just as easy and simple to use as DPT.
Now if you already have a DPT project, that's great. Right? You're you're thinking about transformation. You're trying to make things more, scalable and and more reusable. Right? And so SQL mesh is DPT syntax compatible. So if you have an existing DPT project, you can just have SQL Mesh read in your existing project, and you can take advantage of many of the foundational and architectural improvements that we have. Now there are some things that, you know, DBT makes assumptions for, and, and so you might have to do some tweaks here and there. But we we've we've been able to read in large, you know, kind of 1, 000 model dbt projects with our existing clients.
It's definitely a work in progress, and, we're hoping to make that better. But our goal is definitely to be dbt syntax compatible.
[00:20:04] Unknown:
Given that goal of being syntax compatible, what are some of the constraints or limitations that that imposes as far as the overall, scope of possibilities for what you have built for the foundation?
[00:20:18] Unknown:
Right. So SQL mesh has its own syntax, and there's load limitations if you use SQL mesh syntax. But if you use DBT syntax for example, if you have a DBT incremental model, right, you probably have in your incremental model some if else block, which is like, if is incremental, do this, else, do that. Right? SQL mesh's incremental model is very different. We actually keep track of what intervals have been computed. Right? And so there is no if else logic in SQL mesh models. You can just write select all from my table where my date is between the start and the end of the run. Right? Because SQL mesh tracks the runs, and it knows, okay. Well, you've processed this date, and so now we need to process this date.
In your existing Jinja DBT code, you're trying to calculate the previous run dynamically with your Jinja block, and you're querying yourself to find the latest date. Right? And so if you have something like that, you're going to need to add, for example, another else if block and check if it's SQL mesh. And if it's SQL mesh, then you replace your entire subquery block with just a date is between start and end. So because DBT doesn't have a concept of kind of intervals and partitions, etcetera, if you're using an existing DPD project and you wanna take full advantage of SQL Mesh's incremental models, you'll have to add another block in your code.
[00:21:57] Unknown:
Does that make sense? Yes. In terms of somebody who's using SQL mesh, I'm wondering if you can just talk through what that overall process looks like from either I have nothing, I just wanna get started and start building it out through to I'm using it in a production context, managing it through data orchestration engine, kind of what is that overall, kind of day 0 through day n workflow look like?
[00:22:22] Unknown:
Sure. So for day 0, a user first starting out with nothing. They'd probably go to our documentation. They'd go to the quick start, and they'd bootstrap a project using SQL mesh in it. That's gonna create kind of a skeleton with 2 models, and they're gonna get started. So with that, they would set up a particular config. So they point their SQL mesh to some engine like Snowflake or BigQuery or whatever, just putting in their credentials, etcetera, or whatever they need to connect. Then they would start adding models. So you just start adding different models for the transformations you're looking for. And then when they're ready, you can just run SQL mesh plan, and it's going to basically tell you to create your environment and to create all your models. Right? And so you can use dev. You can use prod. You can have all these environments, etcetera.
Now once you start getting more mature and you wanna start running these, for example, daily, you could hook that up to a cron. You just have your cron job run SQL mesh run every day, and it's gonna compute only what is needed every day, right, because we keep track of state, etcetera. Now for a more mature company, something that's running thousands of models, etcetera, you could use something like Airflow. So SQL mesh has first class integration with Airflow. That means that SQL Mesh can convert your models into, individual DAGs within Airflow. So SQL mesh doesn't run itself as just 1 black box within airflow. Right?
K. That's kind of what DBT does. SQL mesh actually creates first class objects within airflow for every single kind of node that you have. And so the benefits of something like that is that you can actually integrate more closely with airflow. So you can have external signals that trigger certain nodes, or you can have certain nodes within SQL mesh in airflow triggering other things. Right? And so those are 2 engines of the spectrum. Right? You can have the beginner with just a basic SQL mesh project using DuckDV locally with no scheduling, or you could have enterprise kind of airflow deployment.
[00:24:36] Unknown:
Another aspect of the DataOps workflow is the idea of being able to do testing and validation, particularly where you don't want to have to use production as your validation environment. And I'm wondering if you could talk to some of the capabilities that SQL mesh exposes for that CICD and staging workflow?
[00:24:57] Unknown:
Right. So correctness is kind of our core principle, and so we wanted to really make sure that users of SQL mesh had the ability to fully test and validate their data. And so right now, as of today, SQL mesh has 2 modes of testing. We have unit testing, real data unit test. Right? This is something that doesn't really exist today in the industry, but we thought it was very important because we came from software engineering backgrounds, and unit tests are a big part of data development. So first of all, what is a data unit test? So in SQL mesh, you can define a YAML file with fake data. So you can just say, like, this table has these rows, and these are the expected results after you run your transformation. Right?
So that means that in SQL mesh, you could define this y m o YAML file with the input expectations and the output expectations. And then now whenever you wanna do anything, like CI or whatever, your project will automatically run these unit tests to make sure that none of your business logic changes. Right? These are, very quick to run because we can, for example, translate your model into DuckDV, and then we can run all of this in CI without having to hit your database. Right? We also have what we call audits. So DBT calls these tests, but we call them audits because they're basically data quality checks. And so on your production data, after a run, you can check for things like count star, if there's nulls, etcetera. Right? These are very good as well, but they serve a very different use case from unit tests. So we have both unit tests and data quality checks or audits.
Now 1 more area that we're looking into adding, is table diffs or data diffing. Right? And so given you have 2 kind of tables in different environments with slightly different logic, you may wanna understand or see a summary of what exactly is different. And so we're looking into either developing our own version of that or perhaps integrating with DataFold so that you can get this.
[00:27:06] Unknown:
To the point of data diffs and using that for validation, I'm wondering if you can just talk to some of the types of errors or bugs in your logic that you're able to catch through the use of DataDiff before you actually apply the the SQL changes.
[00:27:23] Unknown:
Right. So maybe you've changed some logic. Right? And you with something like data diff, you can see, okay, for this particular primary key, the the value used to be red. Now it's blue. Right? And then you can say, okay. That that's correct. That's exactly what I intended, or there there's a problem there. Other things you could check are, like, row count differences. Like, this particular set has a 1000000 rows, and this 1 had 15, 000, 000 rows. That doesn't seem right. And so it's kind of a quick way for you to check on real data what the differences are.
[00:27:57] Unknown:
And digging more into the orchestration experience, in in the documentation, it points out that SQL mesh is, aware of the fact that it's operating on an orchestrator. And I'm curious if you can talk to some of the benefits that you're able to gain by having the transformation tool understand the semantics of that orchestration engine as well as some of the complexities that that adds because of the need to have another layer of integration within the tool chain.
[00:28:27] Unknown:
Right. So we really wanted SQL Mesh to be a part of your data existing data infrastructure as opposed to being like some black box that kinda runs as a part of it. Right? So, for example, I think the basic usage of DBT within airflow is just like, okay. Within airflow, there's a node, and it does dbt run, which runs your whole deck. Right? But SQL mesh integrates directly with airflow and creates individual DAGs for every model that you have, and so the benefits of that are pretty clear. Right? You can use the existing UI for airflow to see exactly how your DAG flows.
You can take into consideration external signals and also trigger external signals at the individual model level. Right? And so, yes, there is some complexity here in that now SQL mesh integrates with Airflow, but that complexity is mostly abstracted away from the user. The user doesn't have to think about anything. They just write their models in SQL mesh, and then we do all the hard work of hitting airflow, creating all the DAGs, and orchestrating all of that.
[00:29:36] Unknown:
I'm a bit biased because of the fact that I am using DAGSTER for my own day to day work. And 1 of the concepts that they are investing heavily in is the idea of, software defined assets and being aware of what the individual assets are for each of the different processes that are executing. So in the DBT case, they will actually parse out the entire DAG of the project so that you can view each of the individual models, understand when it was last refreshed, etcetera. And I'm wondering if you can talk through what that integration might look like for SQL mesh and some of the mechanical elements that, would need to be put in place for enabling that integration?
[00:30:18] Unknown:
We do plan to eventually integrate with Prefect and DAXTER, but we chose to do Airflow first because it's kinda like the industry leader. And, yeah, we're big fans of the way DAXTER kinda treats its assets, but I would say that tools like Daxter and Airflow are much lower level. Right? They're they are very good at orchestrating tasks, but they don't go too much deeper than that. So although something like Dagster might know when a particular task was executed, it wouldn't have, the nuance of this task, you know, has these intervals that have been run for as well as what that task actually means. Right? Because SQL mesh understands SQL and understands your models.
It fully understands that for this version of a model, what it means. Right, and what all the downstream dependencies are. And if you change your model, how that would affect your downstream jobs. Right? And so I would say that, like, SQL mesh and Daxter are somewhat similar, but SQL mesh is, is, I guess, higher level in that it really understands tables. It really understands SQL and column level lineage. So I'm not exactly sure how SQL mesh is going to take advantage of what DAXTER has done, but there are definitely some similarities here.
[00:31:39] Unknown:
Given the fact that SQL mesh does handle in an automated fashion some of these, semantic concerns and some of the cognitive complexity that goes with building out a, in particular, larger projects of these transformation flows. So in, being able to understand the column dependencies, being able to automatically map that column lineage at the project level without having to manually add specific signals for the tool to be able to do that job for you. I'm curious what you either have seen or predict as the second order benefits of using something like SQL mesh, for teams that are building out some of these complex data estates.
[00:32:26] Unknown:
Yeah. So I think it's just gonna really help productivity. Right? So let's say you add you you you're modifying a a model downstream, and you accidentally refer to the wrong column name. Right? If you were to use, for example, DBT, you have to wait for your entire DAG to execute until you get that in the node, and then there would be some error runtime error. Right? Because SQL mesh has a full understanding of column level lineage, right, as you planned, it's going to be like, hey. Are you using this column? Are you sure that was correct?
Because we we understand the whole flow of of the data. That's 1 aspect of it. The other aspect of it, which I I think is really important, is that single mesh was really designed to unify the different data stacks that data engineers and data analysts have. Right? At all the places that I've worked at, data engineers kind of have their own stack. Right? Maybe Scala, Airflow, etcetera, something a little bit more complex because it has to deal with scale and robustness. And the analysts have their own stack. Right? Something that's more geared towards usability, pure SQL, something that they're familiar with.
And so SQL mesh is intended to be easy for the analyst to use, but robust and scalable enough that a data engineer would use it in their production systems. And so the real benefit of something like SQL Mesh IC is that it would enable a company at any scale to really leverage the same stack because not only is it easy enough for someone less technical to use, but also it robust enough that we can handle any of the kind of data flowing in that a data engineer, used to have to kind of do custom stuff for.
[00:34:21] Unknown:
Some other aspects of building a tool like this and working with it and the team and organizational level are the communication elements, documentation. I'm curious if you can talk to what, SQL mesh has in that avenue as far as, something analogous to the DBT docs capability, etcetera, and also some of the collaboration aspect of data engineering and data analysts and some of the ways that SQL mesh thinks about those challenges?
[00:34:52] Unknown:
Sure. So SQL mesh includes a free UI. Right? The UI includes not only an IDE, but it includes docs. And so the way SQL mesh does docs is we have a kind of very SQL first format. Right? So you just write in your model's comments, and then those comments get automatically put into a nice docs interface. And so we we find that to be aesthetically pleasing because, know, you don't need another YAML file or anything like that. We can automatically parse out the comments. We can parse out the column names. We can parse out the types and automatically create a nice UI for you, which is free and open source.
So in terms of collaboration, SQL mesh has a couple different avenues here as well. Right? Because our dev environments are so cheap and easy to make, anyone can kind of make a dev environment with their changes and point people to it. People can see exactly what your changes would entail. So we have also an open source CICD bot. And so whenever you make a PR using that bot, the bot will automatically create a dev environment that has your particular changes, which means, furthermore, people can look exactly okay. If we were to merge in their PR, this is the state of the warehouse with that change. So that further facilitates collaboration.
[00:36:21] Unknown:
And for people who, in particular, are moving from DBT into SQL mesh, but also people who maybe have just been managing a pile of SQL scripts and moving into SQL mesh. I'm curious how that automatic dependency management at the column level, and the, collaborative elements influence the ways that they think about the structure of the project and the granularity of the transformations that they want to build?
[00:36:50] Unknown:
Well, if you already have kind of scripts or using DBT, you've already kind of thought of these things. We have an interesting use case. We're working with this company with an existing multi thousand, model DPT project. And what was really surprising to me was that so although we do have the DPT adapter, they really loved the simplicity and syntax of SQL mesh. And so they decided to just rewrite their entire DBD project into SQL mesh. And so they said that it was actually quite easy to do it. You know, it's mostly just copy and paste, deleting the refs. And from that, they were able to also take the time to refactor things a little bit and make things nicer. And so I was surprised that they went through that, but in the end of the day, they they really they thought it was worth it, and and they enjoyed doing it. And now their project is much simpler. Right? There's no more refs.
Their project is simpler with with with our macro system, etcetera. It's it's it's checked that compile time. And so I was quite surprised to see something about that. I've also talked to, 1 of our partners. They're a smaller company, and they already had a 40 model DPD project. And they also decided to rewrite the project instead of just using the adapter. And, you know, their reason was just like, okay. Well, the SQL mesh simple syntax is just SQL. Right? There's no Jinja. And so what they really liked was that they were able to copy and paste exist like, you know, just their SQL and their models into the BigQuery UI without having to remove or render their existing Jinja. So I I don't know. That those are just kind of some of the interesting things that I've seen.
[00:38:37] Unknown:
Yeah. And I can definitely see that also simplifying the tooling around things like link linting, where if you're using something like SQL fluff, it has to go through those extra hops of figuring out what is the Jinja, how much of that factors into the semantic structures and the and the code style that we want to enforce. So
[00:38:56] Unknown:
Right. Yeah. So on topic of, like, SQL fluff, so we just use SQL GLOT for linting. So you can just run, like, SQL mesh format, and it's, I guess, our our kind of format is a little bit different. We're a little bit more like black where we're very opinionated, and there's kind of 1 way of doing things. There's, like, some things you can tweak like indentation and etcetera, but for the most part, it's just a, very opinionated formatter.
[00:39:22] Unknown:
Another aspect of the tool that you're building because of it being open source as well as the fact that you are trying to build a business, I'm wondering if you can talk to the governance considerations and how you're thinking about the overall strategy of managing the open source project along with your business goals.
[00:39:40] Unknown:
So we're very early. We just started in September, and we're looking to really focus the next couple of months on making sure our existing clients and partners are successful. Right? And so we're really focused on making SQL Mesh more robust. And so a lot of our early efforts are going to be around making sure open source is a great product that's, robust, and production ready. Now in terms of our kind of long term goals of of monetization, etcetera, we're also looking to have a cloud or on prem offering. Right? Because single mesh manages a lot of state for you. For example, it knows every single version of every model you have. It knows for every 1 of those which intervals have been run. Right? We have to manage state. And so for the clients that don't particularly wanna manage that themselves, we're we're gonna have an offering where we have hosted or on prem solutions where we'll manage that for you.
Additionally, we'll have other enterprise kind of features like cost estimation, right, knowing, for example, which models are expensive and how to optimize that, as well as our query proxy layer, which I can talk about more if you're interested in that.
[00:40:49] Unknown:
Yeah. Definitely be curious to hear what the query proxy is aimed at and some of the capabilities that that adds.
[00:40:56] Unknown:
Right. So single mesh, as we discussed, has this whole virtual environment layer. Right? It'll easily and efficiently allows you to create development environments that are full replicas of your warehouse without any of the cost. Right? Does it all vary cheaply? And so the vision here is that end to end, we want to be able to have any tool, even outside of SQL mesh, be able to query those virtual environments. The real world use case would be like an executive says like, says, hey. I wanna look at this new metric. Like, what happens when I change, the definition of daily active users? How does that affect all these downstream things? How does that affect my Tableau dashboards?
Right? In traditional systems, like, you're kinda out of luck. Either you have to test it in production or you're gonna have to change all your Tableau dashboards, which is gonna take forever. Right? And so 1 of the features we're thinking for enterprise is a query proxy layer. And so we would just have Tableau point to, this proxy layer, which acts like a SQL Server. And with a simple flag on that side, you can easily route all your existing Tableau queries into the right dev environment so that you can now, without touching Tableau, see, what this new metric would look like without affecting production. So that that's 1 idea that we have.
[00:42:15] Unknown:
And to the point of metrics and your earlier point of this all started with wondering about building out a metrics layer solution and the fact that DBT has acquired Transform, and they are heavily investing in the concept of the metrics layer. I'm Wondering how that manifests in the context of SQL mesh and, some of the design and user experience considerations that you've had along that line.
[00:42:38] Unknown:
Our team has a lot of experience in metrics and experimentation. Right? We come from Airbnb, Apple, and Netflix, but we worked on some of these really large scale advanced metric systems. And so it definitely is 1 of our ambitions to take our core foundational transformation layer and add a very powerful, semantics layer. And so I I don't have a time limit on that. It's something that we're planning to do. But 1 thing I will say is that we really wanna do it in a way that is quite flexible. And so that doesn't mean just business metrics. Right? When we're doing metrics, we're also planning to do metrics that are suitable for experimentation, right, and suitable for machine learning features.
[00:43:28] Unknown:
In your work of building SQL mesh and working with some of the early adopters, what are the most interesting or innovative or unexpected ways that you've seen it applied?
[00:43:39] Unknown:
Yeah. So the I think the most interesting thing is that there's kind of 2 spectrums of users that we've seen. Right? I kinda mentioned this before, but 1 of our clients is a huge power user of DBT. They have 14 or, you know, thousands of models. Right? And they were able to not only translate all of that SQL mesh, but they also then hooked it up to a really advanced CICD system. Right? So they were really taking SQL mesh to the fullest. We were anticipating that there would be such passionate user base already, and they're building so much really cool stuff with it, taking advantage of of all the stuff that we have.
And on the other side of the spectrum, right, we're seeing small shops. Right? You would think that these companies, are kind of DBT's bread and butter, but they love SQL mesh because of just how simple it is. Right? 1 of the things they told us was that SQL mesh, is simply SQL. They love that. Right? They also loved how SQL mesh doesn't have layers and layers of config. Right? 1 of our cuss or or our clients said that reading the DPT docs was so confusing, and there's just so many nested layers of inheritance with the YAML files, they didn't wanna deal with that. They just wanted to write SQL.
And so we've kind of been surprised at these 2 spectrums of users.
[00:45:01] Unknown:
And in your own work on this project and operating in this space, what are the most interesting or unexpected or challenging lessons that you've learned in the process of building SQL
[00:45:11] Unknown:
Mesh? Yeah. So I think that I I used to think that DBT was quite simple, and it is simple to use. But because we support DBT projects, right, we had to dig into the nitty gritty of how DBT is actually implemented. And, actually, under the hood, there's a lot of complexity with dealing and managing with nest layers config and YAML files and a bunch of implicit rules that we needed to get right in order for your existing DBT project to work with SQL mesh. And so although under the hood, DBT or sorry. Although at at face value, DBT seems quite easy to use. Once you dig into the nuances of of how, like, configurations and the YAMLs will interact with each other, it's actually quite complex. So that that was a little surprising to me as as as we were developing SQL Mesh.
[00:46:02] Unknown:
And so for people who are interested in building out their data transformations in a primarily SQL environment, what are the cases where SQL mesh is the wrong choice?
[00:46:13] Unknown:
So SQL mesh is not the right choice if you have, like, a real time pipeline. So if you have CDC or you want something really low latency, right, SQL mesh is not the right choice. We have seen some users using SQL mesh Python models for ingest, but it's really done in a micro batch fashion. So something like ingest every 5, 10 minutes. But you shouldn't really be doing that at any high scale or where you're looking to have sub second latency. It's not the right tool. You should use, Flink or or Fivetran for that.
[00:46:47] Unknown:
And as you continue to build out and iterate on SQL mesh and some of the other business goals for to be code data, what are some of the things you have planned for the near to medium term or any particular projects or problem areas you're excited to dig into?
[00:47:02] Unknown:
Yeah. I touched on this briefly already, but our our immediate goal is just to really work with our existing clients and partners, making sure they're successful, making sure that SQL mesh is robust and enterprise ready. But we're also really excited about, changing the metric space. Right? Making a really robust and powerful semantics layer, as well as this kind of query proxy layer. So yeah.
[00:47:33] Unknown:
Are there any other aspects of SQL mesh in particular or the data ops space or the overall workflow of, transformations in a SQL context that we didn't discuss yet that you'd like to cover before we close out the show? I think we covered most of it. Okay. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:48:05] Unknown:
From my experience, the biggest gap in this whole data ecosystem is how fragmented it is. It's even within 1 company. I think different types of users, for example, data engineers, data analysts, data scientists, they all have very different workflows, partially because they have very different skills. And I think that this heterogeneous environment makes it difficult for business to be productive, and it makes it difficult to maintain. Right? A data analyst might be using some kind of SQL based stack like DBT in in their own black box. Data engineer might be writing some kind of Scala airflow custom thing, and the data scientists might be using a Jupyter notebook.
And so these folks are all working towards the same goals, but they're all talking different languages. Right? It's very difficult to, have a productive development environment with these 3 different stacks. And so I so that's kind of 1 of the motivating factors of what we're trying to do. Trying to solve and make data development more homogeneous across these 3 very different personas.
[00:49:14] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing on SQL Mesh. It's definitely very interesting tool and 1 that I'm excited to start experimenting with and trying out for my own use cases. So I appreciate all the time and energy that you're putting into that. And I hope you enjoy the rest of your
[00:49:31] Unknown:
day. Thanks for having
[00:49:36] Unknown:
me.
[00:49:38] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com. To subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Overview of SQL Mesh
Guest Introduction: Toby Mao
What is SQL Mesh?
Understanding DataOps
Challenges with Existing Tool Chains
Nuances in Data Workflows
Mapping SQL Dialects
Python Capabilities in SQL Mesh
Building a Project with SQL Mesh
Lessons from DBT
Day 0 to Day N Workflow
Testing and Validation
Orchestration Experience
Second Order Benefits of SQL Mesh
Communication and Collaboration
Transitioning from DBT or SQL Scripts
Governance and Business Strategy
Query Proxy Layer
Metrics Layer and User Experience
Interesting Use Cases
Lessons Learned
When SQL Mesh is the Wrong Choice
Future Plans
Biggest Gap in Data Management Tools
Closing Remarks