Summary
Managing a data warehouse can be challenging, especially when trying to maintain a common set of patterns. Dataform is a platform that helps you apply engineering principles to your data transformations and table definitions, including unit testing SQL scripts, defining repeatable pipelines, and adding metadata to your warehouse to improve your team’s communication. In this episode CTO and co-founder of Dataform Lewis Hemens joins the show to explain his motivation for creating the platform and company, how it works under the covers, and how you can start using it today to get your data warehouse under control.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more.
- Are you working on data, analytics, or AI using platforms such as Presto, Spark, or Tensorflow? Check out the Data Orchestration Summit on November 7 at the Computer History Museum in Mountain View. This one day conference is focused on the key data engineering challenges and solutions around building analytics and AI platforms. Attendees will hear from companies including Walmart, Netflix, Google, and DBS Bank on how they leveraged technologies such as Alluxio, Presto, Spark, Tensorflow, and you will also hear from creators of open source projects including Alluxio, Presto, Airflow, Iceberg, and more! Use discount code PODCAST for 25% off of your ticket, and the first five people to register get free tickets! Register now as early bird tickets are ending this week! Attendees will takeaway learnings, swag, a free voucher to visit the museum, and a chance to win the latest ipad Pro!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
- Your host is Tobias Macey and today I’m interviewing Lewis Hemens about DataForm, a platform that helps analysts manage all data processes in your cloud data warehouse
Interview
-
Introduction
-
How did you get involved in the area of data management?
-
Can you start by explaining what DataForm is and the origin story for the platform and company?
- What are the main benefits of using a tool like DataForm and who are the primary users?
-
Can you talk through the workflow for someone using DataForm and highlight the main features that it provides?
-
What are some of the challenges and mistakes that are common among engineers and analysts with regard to versioning and evolving schemas and the accompanying data?
-
How does CI/CD and change management manifest in the context of data warehouse management?
-
How is the Dataform SDK itself implemented and how has it evolved since you first began working on it?
- Can you differentiate the capabilities between the open source CLI and the hosted web platform, and when you might need to use one over the other?
-
What was your selection process for an embedded runtime and how did you decide on javascript?
- Can you talk through some of the use cases that having an embedded runtime enables?
- What are the limitations of SQL when working in a collaborative environment?
-
Which database engines do you support and how do you reduce the maintenance burden for supporting different dialects and capabilities?
-
What is involved in adding support for a new backend?
-
When is DataForm the wrong choice?
-
What do you have planned for the future of DataForm?
Contact Info
- @lewishemens on Twitter
- lewish on GitHub
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- DataForm
- YCombinator
- DBT == Data Build Tool
- Fishtown Analytics
- Typescript
- Continuous Integration
- Continuous Delivery
- BigQuery
- Snowflake DB
- UDF == User Defined Function
- RedShift
- PostgreSQL
- AWS Athena
- Presto
- Apache Beam
- Apache Kafka
- Segment
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or you want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering podcast dedicated CPU instances.
Go to data engineering podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show. This week's episode is also sponsored by Data Coral. They provide an AWS native serverless data infrastructure that installs in your VPC. Data Coral helps data engineers build and manage the flow of data pipelines without having to manage any of their own infrastructure. Data Coral's customers report that their data engineers are able to spend 80% of their work time invested data transformations rather than pipeline maintenance. Raghu Murthy, founder and CEO of Data Coral, built data infrastructures at Yahoo and Facebook, scaling from mere terabytes to petabytes of analytic data.
He started Data Coral with the goal to make SQL the universal data programming language. Visit data engineering podcast.com/datacoral today to find out more. Are you working on data, analytics, or AI using platforms such as Presto, Spark, or TensorFlow? Check out the Data Orchestration Summit on November 7th at the Computer History Museum in Mountain View, California. This 1 day conference is focused on the key data engineering challenges and solutions around building analytics and AI platforms. Attendees will hear from companies including Walmart, Netflix, Google, and DBS Bank on how they leveraged technology such as Eluxio, Presto, Spark, and TensorFlow. And you will also hear from creators of open source projects, including Eluxio, Presto, Airflow, and Iceberg, many of whom you've heard on this show.
Use discount code podcast for 25% off of your ticket, and the first 5 people to register get free tickets. Register now as early bird tickets are ending this week. Attendees will take away learnings, swag, a free voucher to visit the museum at a chance to win the latest iPad Pro. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season. We have partnered with organizations such as Corinium Global Intelligence, Alexio, and Data Council.
Upcoming events include the combined events of the data architecture summit in Graphorum, the data orchestration summit, and the data council in New York City. Go to data engineering podcast.com/conferences today to learn more about these and other events and to take advantage of our partner discounts to save money when you register. Your host is Tobias Macy. And today, I'm interviewing Lewis Hemmings about Dataform, a platform that helps analysts manage all data processes in your cloud data warehouse. So, Louis, can you start by introducing yourself?
[00:03:24] Unknown:
Hi. I'm Louis. I'm CTO and cofounder of Dataform. So I previously worked at Google as a software engineer at Google. Everything we did was about data. I was there for about 6 and a half years and left 2 years ago to start my own company, and we actually worked on a few different ideas before we came to Dataform. We took part in Y Combinator last back in 2018. And we got funding for Dataform last year, and we're now a team of 7 or so. So we didn't actually start with data management when I left Google. We actually worked on some different products in the data space around auto automated data mining.
We quickly realized 1 of the biggest challenges our users were facing was around data management. Of course, we couldn't do any meaningful data mining if the input to that process was not clean, well prepared data. So this is kind of how we ended up where we are. And at Google, we were very spoiled with many great tools to spin up new pipelines, to do data modeling, and build scalable SQL data processes. We felt that this was a far more interesting problem to solve, so we we made the move. And do you remember how you first got involved in the area of data management? Yes. So so when we were working on our first idea after leaving Google, we found ourselves the first thing we do with any customer we worked with was to go and write some SQL for them. We would start by modeling their data. We would pull a number of tables together. We'd do joins, do data cleaning, And then we ended up stuck because we needed a way to give this back to them so that they could display it deploy it, so to speak.
And this is something that most companies didn't really have set up or particularly the start ups we were working with didn't really have any way to do this. So we came across DBT. We saw what they're working on, and we loved it. We felt that 1 of the biggest challenges to picking this kind of solution up was to package it in a way that it could be used by maybe data analysts who are less comfortable with command line interfaces or writing code, Git, so on. So, actually, our initial version of the product was a web based IDE for DBT projects, and we'd spoken with Tristan and Drew from Fishtown Analytics while we developed this.
A few months down the line, we decided to move off our own framework instead of GPT, which we can get into a little bit, but that's kind of the origin story of of Dataform.
[00:05:55] Unknown:
I'm curious what the motivation was for starting your own command line framework to replace dbt and what the sort of main differences are in terms of capabilities or workflow that you are building into your tool and how it compares to the overall workflow of DBT for people who might be familiar with that?
[00:06:15] Unknown:
Yeah. Absolutely. So Dataform is 2 things. We think about it as 2 things. So there is a data modeling framework, which is comparable to DBT. But I think the bigger part of what we've been developing and where we we think most of the value comes from is we have a web based development environment for data from projects where you can develop, SQL queries. You can validate, preview, test new models, changes, and keep everything tracked in a Git repository. So this was what our initial IDE was on top of DBT. The migration away from the DBT framework was not for any particularly major reason. And, you know, if I'm honest, our framework is actually very similar to DBT conceptually.
It just allowed us to move a little quicker and do a few things that we were struggling to do. So Dataform is designed for analysts and engineers, and this web based environment kind of gives us the best of both of those worlds, and allows teams to adopt those engineering best practices while maybe not having everybody on the team super comfortable with, doing those things manually via command line or via code.
[00:07:30] Unknown:
And I also noticed that the command line agent that you wrote is largely written in TypeScript, whereas DBT is primarily Python. I'm curious if the motivating factor for that is because you're using a decent amount of that logic web environment as well to reduce the overall burden of development.
[00:07:50] Unknown:
Yes. So that is definitely a factor. There are a few factors in us going that way. Personally, I'm a big fan of typed languages, and that was something that drew us towards, something like TypeScript. You're absolutely right. Being able to develop the framework and the back end and the front end all in the same language using the same DTOs is a huge benefit to us and helps us move quite quickly. Given that, I think we knew from day 1, the most valuable part of this was really packaging this up into a into a web based environment.
[00:08:25] Unknown:
So can you talk through a bit of the workflow and life cycle of an analytic project that is using Dataform and how it allows for collaboration between data engineers and data analysts and what the sort of primary users are at each of the different stages?
[00:08:43] Unknown:
Yep. So we see the shift at the moment. I mean, I think, you know, dbt is driven this definitely, and engineers have been doing this for a long time. But to have all of your everything version controlled and tested is extremely valuable, and that is where a lot of the the core value from these software engineer engineering best practices comes from. So looking at a data form project in the web UI, you typically start off by coming into a project and wanting to make a change that might be publishing a new dataset or a table or a view or some other operation. You create a new branch, which can be done directly from the UI, and then you can start writing SQL in a what looks like your standard code text editor.
The query that you write will be compiled and validated as you write it, so we'll actually test your query against your warehouse. And we provide a number of built in functions for building dependencies between different actions in in your larger project. As you develop, you can test your query, preview the results. We allow you to run portions of your pipeline in isolated development schemas so that you're not overwriting production as you develop new changes. And then there's a few things you can do on top of that. So 1 of the core features which seems to be a, you know, big selling point for data modeling frameworks in general, I think, is, the ability to write tests and assertions about your data.
So to make sure that the query you've written actually does what you expect it to do, Maybe, previously, you would have just run the query and had a look at the results and checked. But if you actually wanna put in a test that will fail in the future, if somebody comes along and breaks it, then you can write an assertion. You can write documentation for datasets to describe the table as a whole or column level descriptions, or you can override other settings like the schema, the name of the dataset, set tags, set up schedules. So finally, when you're finished, you can take all of these changes that you've made. You can commit them directly from the web UI or from the command line interface if you're doing it manually, push them to production and to your GitHub repo, and Dataform will pull those latest changes, and those will be deployed to production.
[00:11:08] Unknown:
And you mentioned that a decent amount of the utility of a tool such as data form is that it helps to encourage and enforce best practices such as versioning and testing. And 1 of the things that I found interesting as I was going through the documentation is that the form of the tests is actually that you build a query that is intended to not return any results. So essentially the inverse of the intended query is how you actually generate the test and that if there are any results returned is when it fails. So I'm just curious if you can talk through some of the challenges or how users have found that in terms of being able to understand continuous integration, continuous delivery workflow manifests in the context of data warehouse management using tools like Dataform.
[00:12:05] Unknown:
Yeah. So the the concept of an assertion is a bit jarring at first. In practice, it's simply a lot easier to write these kind of checks to look for broken rows in your data than to look for the presence of something. This is a concept DBT had, and, actually, it was also a concept we had in similar platforms at Google. So I think that learning hop that kind of learning hump can be gone over quite quickly. When it comes to CICD, so I guess we should break these out. Continuous integration, obviously, as part of that process that we talked about, when you're making a change to your your data pipeline, this is I don't think we always think about it as software, but it is software. You wanna you wanna test that that actually works, and you wanna make sure that those assertions pass.
Doing CI in data environments can be tricky because data pipelines can take a very long time to run. It could be several hours before you get a result back for a particularly large project. So that that makes CI tricky, and we can apply some we can do some clever things to try and speed that up. For example, running your entire pipeline in a staging environment where only 1% of the data is processed or where we don't materialize any of the results, we simply create views. So there are a few few things we can do there. When you actually build your data from project as well, there is a validation step going on as you write SQL and develop the project. So we will check for obvious areas such as circular dependencies, which we can catch before you commit and push those changes to production.
So these are this all kind of helps with CI in general. Continuous delivery is a lot easier in a data environment. We simply deploy these things to production when we're ready.
[00:13:58] Unknown:
And, yeah, that's pretty much it. And in terms of the speed aspect of that, I'm sure that having the tests structured in a way that they're intended to not match anything will help to speed up the overall execution. Whereas if you're trying to, as you said, assert the presence of certain records or certain data types means that you're actually going to be doing a lot a lot more processing in the database of the records to be able to return something potentially quite large in the result set and then having to do validation against that?
[00:14:29] Unknown:
So so I think that's probably not true for most warehouses, although it's gonna depend on the warehouse. So the way we actually execute an assertion is we take the select query that you've given us, and we wrap it up in a select someone from your user generated code. So I think, actually, whether rows exist or not, because the data's never being put anywhere, the execution cost is the same. 1 of the goals that we had in designing the system is we don't want to data to ever leave your warehouse. And I think that's something to important to understand with Dataform. It's not you it's not running it's not processing your data itself. It's a it's an orchestrator. It executes those queries for you. So when you write a an assertion, we actually create a view in your data warehouse, which contains the query that you've provided us with, and then we query that.
1 of the benefits of this is if your assertion fails in production and you want to know why, you can go to that table and have a look, query it yourself, and try and work out what's gone gone on without, data ever actually leaking from from your production data warehouse.
[00:15:41] Unknown:
And then in terms of the engineering best practices that you're trying to enforce beyond just testing, I'm curious what you have found to be common challenges and mistakes among engineers and analysts in terms of versioning and evolving schemas in the accompanying data, as well as just general collaboration best practices to ensure that all the engineers and analysts are on the same page and able to work effectively?
[00:16:09] Unknown:
Yeah. So we see a a number of change a a number of issues come up often. Typically, when people have moved from a non data warehousing world, maybe working with more traditional warehouse or database like MySQL or SQL Server. I think a lot's changed about how to think about this stuff. This was definitely something that we were very used to at Google. Data should be so source data should generally be immutable. When you're reading data from logs, transaction logs or event logs from, you know, your website, that data should ideally never change.
And, similarly, when you're reading data from a database, you don't you ideally want to read that in a way that you have a full history of everything that has changed. So when we come to transforming the data and producing derived datasets, which is what you're primarily doing with the products like Dataform, you wanna be able to build those derived datasets from scratch. If you lost them all, if you accidentally deleted all of those datasets, you want to be able to recreate that entire warehouse state again. And we often see the, you know, the first people thing people do when they start using our product is to actually write more like SQL scripts, which perform some sequence of operations.
So creating tables, inserting into those tables, deleting from those tables. This is kind of stateful, and it becomes very hard to reason about. And I believe Drew mentioned this on the DBT podcast where, ideally, you wanna be a bit more declarative. Every data search should ideally be a select. There are some situations where this doesn't work, GDPR probably being the main issue there. But the closer you can get to having your datasets being declared in that way and to not mutate, stay, or insert new rows, then the easier is to reason about them and make sure that your entire data pipeline is reproducible. So coming up to, yeah, some other things that we see, I think SQL adoption is growing, and I think the power of modern warehouses is enabling that.
We definitely see some people still kind of falling back to processing data with Python, NumPy, r. There's a problem with scalability here. This is they're simply never gonna be as fast as doing a comparable thing in SQL. They're usually limited to a single machine or some RAM limitations, whereas you can comfortably process terabyte datasets with BigQuery or Snowflake. So that's some that's a habit that I think we perhaps see. I know myself as a software engineer. I wanna fall back to what I know and what I understand. But moving this stuff into SQL using UDFs where you can actually opens up a lot of possibilities.
When it comes to schema evolution, I think a lot of these challenges go away if you get these first things right. So having immutable source datasets and having reproducible transformations of those datasets And having all of your code in 1 place makes it much easier to make mass changes across those systems because you remove the state. It's much easier to change a dataset and all its dependencies in a single pass. And did it the scheme evolution becomes a bit less of an issue. At some point, scale does become an issue, and you can start looking at approaching those schema changes in a backwards compatible way. And this is similar to how you might approach this problem in software engineering.
For example, we have many different microservices. They might get released on different schedules, and you wanna make sure that they continue they can continue to talk it to each other as that software rolls out. So, for example, removing a field from a table because it's no longer valid, you typically do that in a software engineering way by marking that field as deprecated, giving those consumers some amount of time to clean up their, well, in this case, SQL queries that use that data. And then when you know all consumers are no longer reading it, you can remove it safely. Tools like Dataform make this kind of action a bit more feasible to accomplish because you can see all of that SQL in 1 place. You can actually analyze and work out who's reading this field, who do I need to notify, and you know when all references to it have been removed.
[00:20:30] Unknown:
And digging into Dataform itself more, can you talk through a bit in terms of how it's implemented and some of the evolution that it's gone through since you first began working on that as a replacement for DBT?
[00:20:44] Unknown:
At a high level, the data from framework takes a data from project folder containing SQLX files, which is a format that we've created, as well as JavaScript files, custom music code. And it outputs a big JSON objects. That JSON object describes every SQL action, its dependencies, and other metadata. So unit tests, assertions, column descriptions, tags, those are all contained in this 1 data object, which is no longer user code. It is declarative. So we actually break the the data form process down into a few different stages. The first stage is compilation where we produce that first JSON graph.
The second step is we call it build, where we take that graph and we combine it with information from the data warehouse, which can which would be states such as what tables currently exist, are they tables of views, what columns are in those tables, And we use that to generate an execution graph, which is a kind of final ready to run list of every single SQL operation, that we're gonna actually execute as part of the graph. So this is the difference between those 2 graphs is 1 contains select statements, and the second 1 actually create contains the create, delete, alter, table statements, the boilerplate that tools like Dataform avoids you having to write and think about as a user.
So, finally, when you have that graph, Dataform can execute that graph and will execute notes in parallel and warn you if anything fails. It It will run your assertions and tests and finally tell you, what's happened. And via the command line, this is kind of all pretty prints pretty printed out. And via the UI, we keep a track of those logs, and we send you emails and notifications and all that stuff, when something goes wrong. In terms of the evolution, as I've mentioned, we're, of course, inspired very much by dbt. Some of our design goals were primarily to make it fast.
In order to make the IDE powerful and usable, we wanted real time compilation. So So we wanted to make sure that the data from projects could compile within a second so that you could actually see what SQL you're generating before it got validated or run. And that's something that we try to do and possibly 1 of the reasons why we moved towards, JavaScript. We also wanted to make sure that the output of that compilation process was machine interpretable and, as close to pure SQL as possible. So the output of the data form compilation project actually is just a JSON object containing really nothing but actual SQL statements that can be run directly either through data form or manually. There are a few kind of steps step changes, I'd say, in the in the framework itself.
1 particularly interesting and important part was sandboxing. So when data form projects compile, we don't allow external calls to third party services. It can't look at the data warehouse, and this is necessary if you wanna make it fast. If we allow user code to make external API calls, then this can slow things down quite a lot. Beyond that, I'd say the 2 big changes we've added are, support for unit tests for SQL queries, which is, again, 1 of those software engineering best practices that just hasn't really existed in the SQL world, and support for documenting datasets, which allows you to annotate fields and columns within your queries with descriptions or other metadata.
[00:24:36] Unknown:
So it sounds like you've actually ended up building a SQL parser into the command line agent for being able to extract the data types and the dependency graph. I'm curious, what are some of the challenges that you encountered in the process of that? And if there were any prebuilt libraries that you're able to lean on to handle some of that, overall capability?
[00:25:03] Unknown:
So we don't currently parse SQL itself. 1 of the challenges there is the different dialects of SQL across different warehouse types. However, we do still have our own custom parser, for these SQLX files. So this SQLX file is raw SQL, but a allows it to be annotated with JavaScript or JSON like blocks, which set configuration settings or allow the use of templates within that SQL. So you can reuse code and make your project more modular. We did use some libraries. I think we used new. We're trying to keep the footprint of the data form core library, as small as possible.
1 of our engineers is a, experience doing this stuff and managed to whip something up in a in a few days, but it's certainly been challenging. And I think, as I mentioned before, the performance is very important and trying to keep that lexing and passing fast and under a second is something we've worked quite hard to do.
[00:26:07] Unknown:
And you mentioned too that in the SQLX files, you can embed some JavaScript logic for being able to build out macros or conditional templating. And I'm curious if you can talk through some of the use cases that that enables that would otherwise be fairly difficult if you were just relying on the pure SQL and some of the limitations of SQL itself when you're working in an environment where you're collaborating with other engineers and you might need to be able to compile multiple different fragments of SQL into an overarching query or table definition?
[00:26:45] Unknown:
Yeah. So SQL is not a programming language exactly. Some variants like T SQL maybe are, but we can't really make that assumption. So as you mentioned, those conditional behaviors are hard to recreate, and it it really limits what you can do. So to give you to walk through an example, unit testing. For a SQL query in order to be able to unit test a SQL query, you need to have an understanding of its inputs, and you need to be able to declare its expected output. This isn't something that SQL can do on itself, and this is where I think product like Dataform, actually enables these use cases.
Similarly, parametrization parametrization of queries for different environments. I mentioned this earlier. Running staging pipelines on 1% of your data, you can't do this with SQL directly. It doesn't understand what environment it's in. But with a framework like Dataform, we can wrap up that information and make that accessible to the user code. So as we create the SQL code, we can put those variables in. The reusable aspect is obviously important. Don't repeat yourself. If you are writing some actual codes, if and you're copy pasting more than a handful of lines, you're probably gonna, think twice about that and package it up.
SQL, again, doesn't really have many ways to do this. Some data warehouses do support things like stored SQL procedures, which can help, but these things are rarely version controlled. So being able to do that inside a framework like Dataform massively reduces the amount of duplicate code, makes larger factorings easier. I could give you an example of 1 thing which 1 of our first customers did early on, which I think really highlights the power of this kind of thing. So I mentioned earlier, GDPR is annoying because but important because you need to remove, data from your your warehouse. You need to do something which is kind of stateful in a sense. Execute, delete operations, and forget everything about a user. You might have 30 plus tables which contain user or PII or identifiable information.
You're gonna have to go into those tables and write execute a delete statement removing any rows which matched user IDs from some deletion request table. So with data form, this template can be packaged up with a for loop and a list of the tables, or and possibly the user ID field names. You can automatically generate, those 30 queries that actually will go and run at the end of your daily schedule to remove all of those users rows from those tables. So none of the stuff would be doable directly with SQL, and none of the other things we've talked about, CICD, unit testing, code mod modularity.
These are all limitations of SQL itself.
[00:29:59] Unknown:
And then another thing that you mentioned earlier is the different dialects for the different data warehouses. And I'm curious how you have approached encapsulating the common core capabilities and then also allowing for taking advantage of the differentiating factors between those different warehouses and what's involved in adding support for a new back end?
[00:30:27] Unknown:
Yep. So, currently, we have support for BigQuery, Redshift, Snowflake, Postgres, and Azure data warehouses. It it certainly is a cost to us to to maintain all of these. Thankfully, they do share a lot of similarities. And, you know, if you look at, Redshift, Postgres, particularly, they are very similar, in terms of how they behave. Snowflake, fairly similar to BigQuery is probably the odd 1 out here in that it has its own way of doing things and definitely causes causes us a bit of a headache. I think our 1 of our decisions in how we interface with these warehouses is to kind of treat them as simp in the in the most simple way we possibly can, which is they are something that we send SQL strings to and get responses back from. And that is, internally, that is pretty much what our database adapter looks like.
If you want to support a new warehouse, you need to tell it a little bit about the SQL dialect. And most of these derive from the same kind of core standard SQL, dialects that we have inside our repository. And you need to tell it a, how to connect to that warehouse, and how to execute statements. And that's really it. That's kind of the fundamental piece. It's quite low level. It doesn't know much more. We do have some APIs that need to be implemented such as, listing tables, validating statements if you want validation in the actual Dataform WebUI, listing schemers and fetching table metadata such as the schema of a table.
There are some headaches there dealing with different data types and different databases, some often tricky issues to debug around connection pooling. But I think thanks to keeping that interface quite simple, we haven't this hasn't been a a huge burden for us so far. I suspect some of the other warehouses such as Athena and Presto, will make this a little trickier, where dealing with data on s 3, for example, you maybe lose this concept of, like, a schema and table relation identifier.
[00:32:45] Unknown:
And in terms of the SQL dialect support, do you have any validation in terms of the compilation step to verify that a particular statement is going to comply with the dialect of the engine that it's going to be running against?
[00:33:01] Unknown:
Yes. So we've thought a little bit about this. There are some libraries you can use to to kinda do this. At the moment, we don't do that automatically. Compilation, as I mentioned, we wanna keep really fast. We will do validation in the UI after we successfully compile. And we can cache some of that stuff where it makes sense. But we generally rely on asking the warehouse itself for to validate a query for us. And this is particularly important for something like BigQuery where, you know, they've perhaps, moved a bit away from your standard dialect, and the source of truth is ultimately asking the BigQuery API itself whether this is a valid query.
1 of the things that also makes this tricky is understanding if the tables that you're querying from even exist, and it's useful for us to be able to surface that kind of information.
[00:33:54] Unknown:
And then as you've mentioned, there is the web UI, and there's also the command line interface that has the core functionality. I'm wondering if you can talk a bit about the differences in capabilities and when somebody might want to reach for the web UI versus the command line.
[00:34:11] Unknown:
So the Dataform web platform is really good towards enabling collaboration, across an entire data team. So Dataform is an IDE for Dataform projects. You can develop new models. You can develop SQL queries, get real time compilation, and preview results. It integrates directly with git so you can create branches, make commits, and develop and test changes in an isolated environment, so you're not overwriting your production data every time somebody makes a change. It also wraps up some of the deployment pieces, so continuous delivery, scheduling, and logs, email email notifications, all this kind of stuff you'd expect with a a scheduler type system.
So I think the command line interface is useful for a number of things still. Because every project in data form is usually gap backed by GitHub or GitLab repository, it's kind of up to individuals how much they want to use the IDE or not. And we see teams where maybe the software engineers or the data engineers are checking out that code themselves and making edits in their preferred ID, and using the command line interface to test those changes. Whereas, perhaps, analysts are less comfortable doing that and would rather work in the web ID itself where they get a bit more real time feedback or as they develop queries.
So I think you can kind of use both together, but the the parts of the web platform, which we think are particularly useful, are the collaborative pieces. And that's probably what we've spent the majority of our time on to make sure that analysts have a place to come and contribute to these projects without having to do things that they may be feel less comfortable with.
[00:36:05] Unknown:
And what are the cases where Dataform is the wrong choice for a given project?
[00:36:12] Unknown:
Yes. So I guess we talked a bit about how Dataform compiles projects. You the only time user code gets executed in a Dataform project is during compilation and the generation of SQL queries, which means at runtime, we can't actually execute user code. So you can't pull from a third party API, for example. For something like that, you still need to fall back to a system perhaps like airflow, where you can actually run user code at runtime. I'd say the other situation where this form really doesn't make sense is real time and streaming. I think that there are a few groups working on making streaming SQL dialects a real thing.
And I know these are already being developed for Apache Beam and for, Kafka, and this is something that I think we want to think about 1 day. But right now, if you need extremely low latency under, say, 5, 10 minutes, which is probably the maximum you'd feasibly want to run a batch system on, then you'd want to look at doing something a little bit more custom. I'd think in in practice, for the kind of use cases that we see people, applying data from to, this isn't a problem. It's rare that you need sub 5 minute latency. There are features like incremental table builds, which allow you to kind of do micro batching in a sense.
But for anything faster than that, real time dashboards, then Dataform's probably not the right choice.
[00:37:51] Unknown:
And then in terms of your experience of building and running the Dataform project and business, what have you found to be some of the most interesting or unexpected or challenging lessons that you've learned?
[00:38:03] Unknown:
Yeah. The so Dataform is a complex product, and it's a complex idea. And I'd say we have a significant challenge trying to communicate what it is, to to our users in terms of value proposition, but also in terms of getting them started. So this is something that we've been investing a lot in and are continuing to invest in is that initial onboarding experience when you come and sign up. Perhaps originally, you were just land you were dropped in an IDE with a empty SQL editor and kind of expected to know what to do. And in practice, that just doesn't work. So we're having to invest a lot more than I think we ever imagined in making that process smooth and clear to to new users, and we still do a lot of manual kind of handholding in that process.
As a as a company, I'm not sure if there were many surprising things. Starting a start up is hard. Fundraising is hard. Building a team is hard, but also rewarding. And I think the, the the team is thoroughly enjoying the problem that we're solving right now.
[00:39:19] Unknown:
And 1 thing that I'm not sure if we mentioned explicitly is that the command line piece of Dataform is open source. And so I'm wondering what your governance policy is around that, and how you're approaching the sustainability of that core framework as it relates to the business and the web application that you're building on top of it?
[00:39:40] Unknown:
Yeah. We think a lot about, you know, what the line to draw there is, and I think it's it's a it's a tough balance to get right. Obviously, we do have a lot of features of the product which are closed source. I think 1 of the so this has happened this has happened already where we've had people not from Dataform come in and make a change to the open source repo. We wanna make sure that they can add the features that they want or their support for their data warehouse without us, having to support a number of changes in our proprietary systems too. So that gives kind of flexibility for people to come and change the very the very core part of the product, while we can still maintain control of the, the the web platform, and they can still use that for the new features that they develop.
[00:40:31] Unknown:
And looking forward, what do you have planned for the near to medium term of Dataform, both on the technical and business side as far as new capabilities or improvements?
[00:40:42] Unknown:
Yeah. So a lot of users spend a lot of time in our app as it is their primary place for development of queries and SQL. So we're doubling down on that. I think in terms of the features we have, you know, we're really supporting that core use case right now. And I'd say over the last 6 to 9 months, we've managed to get out most of the key blocking features for new users. I think right now, we wanna make sure that developing SQL and developing data from projects in the WebUI is a really pleasant experience that is fast and snappy, getting rid of those edge case bugs of which there is a never ending stream of.
I'd say beyond that, supporting massive data teams, this is really challenging. We have some big clients using our software with, you know, 200, 300 data models, but we are also speaking to clients with many, many more. And when you start looking at data processes in teams of analysts, which may be a 100 or more, managing thousands of SQL queries and extremely complex data pipelines. And I think there are a lot of challenges that still need to be solved there. And we're trying to think a lot about how we deal with that. Some of these around maybe more enterprise things, fine grained access control, data ownership, data governance, column level lineage.
This is all extremely important at that level. In terms of media features, we are currently building out support for, release processes and multiple deployment environments so you can have staging and canary environments. We're looking at 1st class support for UDFs. I mentioned UDFs before, and I think this is this enables a lot of really interesting use cases in modern data warehouses, as well as support for Presto and Athena and integration with some other data workflow tools such as Airflow. In terms of the maybe slightly longer term, I think we'd like to be able to support more to date multiple data warehouses with a single data form project.
I mentioned the streaming thing too. That's something that's always on the back of our mind. And we're also investing in, a number of packages to kind of make common SQL processes a little easier to to do without having to reinvent the wheel every time. So for example, analyzing AB experiments, managing access control, or utilities to simplify writing data tests. We think we can save a lot of people a lot of repeat work, by packaging some of these up and making them available to the community.
[00:43:24] Unknown:
Yeah. Having the prepackaged data tests, I'm sure, would be quite interesting and useful, especially for cases where you're providing validation for some common data sources where somebody might be pulling from Google Analytics or maybe Twitter or LinkedIn for being able to ensure that there's an appropriate schema or that you have some sort of common needs that are able to be encapsulated in those prebaked tests?
[00:43:52] Unknown:
Yeah. So a good example of this is Segment. You know, I think a lot of our users use Segment, and they often have the same set of data models that they are developing to build on top of the core segment datasets, like sessionized, tracks and, page views. And this is quite hairy code, and if we can provide a kind of ready to go tested package so that others can benefit from that, then I think that kinda really builds up the ecosystem and and and the framework and makes it gives you a lot more value without having to write a lot of SQL yourself.
[00:44:28] Unknown:
Are there any other aspects of data form and data warehouse management in the context of data engineering and analytics that we didn't discuss yet that you'd like to cover before we close out the show? No. I think I think that's great. There's nothing else. Well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:44:57] Unknown:
So I say there's still a, I think we're bringing a lot of best practices and deployment processes to SQL for those core data models. I think 1 of the areas where a lot of what we're doing with Dataform where this doesn't apply is in machine learning, where you still require slightly more custom approaches to building and deploying models. And we've definitely spoken to a lot of people who are having to invest heavily in how they develop and deploy machine learning models in a kind of reproducible way, again, following those software entering best practices. So this is definitely an area that I kind of expect or hope to see, a lot of innovation in over the next few
[00:45:42] Unknown:
years. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Dataform. It's definitely interesting to see that there is more that there's more innovation going on in this space of ensuring repeatable processes and best practices for engineering in the context of data warehouses and leaning on SQL capabilities. So I'm excited to see where you take the business and where the overall space ends up. So thank you for all of your efforts on that, and I hope you enjoy the rest of your day. Great. Thank you
[00:46:15] Unknown:
very much.
[00:46:19] Unknown:
Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Lewis Hemmings and Dataform
Motivation and Differences from DBT
Workflow and Collaboration in Dataform
Assertions and Continuous Integration
Implementation and Evolution of Dataform
SQL Dialects and Warehouse Support
When Dataform is Not the Right Choice
Challenges and Lessons Learned
Future Plans for Dataform
Biggest Gap in Data Management Tooling