Keeping Your Data Warehouse In Order With DataForm

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline or you want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

And for your machine learning workloads, they just announced dedicated CPU instances. Go to data engineering podcast dedicated CPU instances.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of this show.

This week's episode is also sponsored by Data Coral. They provide an AWS native serverless data infrastructure that installs in your VPC.

Data Coral helps data engineers build and manage the flow of data pipelines

without having to manage any of their own infrastructure.

Data Coral's customers report that their data engineers are able to spend 80% of their work time invested data transformations rather than pipeline maintenance.

Raghu Murthy, founder and CEO of Data Coral, built data infrastructures at Yahoo and Facebook, scaling from mere terabytes to petabytes of analytic data.

He started Data Coral with the goal to make SQL the universal data programming language.

Visit data engineering podcast.com/datacoral

today to find out more.

Are you working on data, analytics, or AI using platforms such as Presto, Spark, or TensorFlow?

Check out the Data Orchestration Summit on November 7th at the Computer History Museum in Mountain View, California.

This 1 day conference is focused on the key data engineering challenges and solutions around building analytics and AI platforms.

Attendees will hear from companies including Walmart, Netflix, Google, and DBS Bank on how they leveraged technology such as Eluxio, Presto, Spark, and TensorFlow.

And you will also hear from creators of open source projects, including Eluxio,

Presto, Airflow, and Iceberg, many of whom you've heard on this show.

Use discount code podcast for 25%

off of your ticket, and the first 5 people to register get free tickets.

Register now as early bird tickets are ending this week. Attendees will take away learnings, swag, a free voucher to visit the museum at a chance to win the latest iPad Pro. You listen to this show to learn and stay up to date with what's happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss out on this year's conference season.

We have partnered with organizations such as

Corinium Global Intelligence, Alexio, and Data Council.

Upcoming events include the combined events of the data architecture summit in Graphorum,

the data orchestration summit, and the data council in New York City.

Go to data engineering podcast.com/conferences

today to learn more about these and other events and to take advantage of our partner discounts to save money when you register.

Your host is Tobias Macy. And today, I'm interviewing Lewis Hemmings about Dataform, a platform that helps analysts manage all data processes in your cloud data warehouse.

So, Louis, can you start by introducing yourself?

Hi. I'm Louis. I'm CTO and cofounder of Dataform.

So I previously worked at Google as a software engineer

at Google. Everything we did was about data.

I was there for about 6 and a half years

and left 2 years ago

to start my own company, and we actually worked on a few different ideas before we came to Dataform.

We took part in Y Combinator

last

back in 2018.

And we got funding for Dataform

last year, and we're now a

team of 7 or so. So we didn't actually start with data management when I left Google. We actually worked on some different products

in the data space around auto automated data mining.

We quickly realized 1 of the biggest challenges our users were facing was around data management.

Of course, we couldn't do any meaningful data mining if the input to that process was not clean, well prepared data.

So this is kind of how we ended up where we are. And at Google, we were very spoiled with many great tools

to spin up new pipelines, to do data modeling, and build scalable SQL data processes.

We felt that this was a far more interesting problem to solve, so we we made the move. And do you remember how you first got involved in the area of data management?

Yes. So

so when we were working on our first idea

after leaving Google, we found ourselves

the first thing we do with any customer we worked with was to go and write some SQL for them. We would start by modeling their data. We would pull a number of tables together. We'd do joins,

do data cleaning,

And then we ended up stuck because we needed a way to give this back to them so that they could display it deploy it, so to speak.

And this is something that most companies didn't really have set up or particularly the start ups we were working with didn't really have any way to do this.

So we came across DBT.

We saw what they're working on, and we loved it.

We felt that 1 of the biggest challenges to picking this kind of solution up was

to package it in a way that it could be used by maybe data analysts who are less comfortable with command line interfaces or writing code, Git, so on.

So, actually, our initial version of the product was a web based IDE for DBT projects,

and we'd spoken with Tristan and Drew from Fishtown Analytics while we developed this.

A few months down the line, we decided to move off our own framework instead of GPT, which we can get into a little bit, but that's kind of the origin story of of Dataform.

I'm curious what the motivation was for starting your own command line framework

to replace dbt and what the sort of main differences are in terms of capabilities

or workflow

that you are building into your tool and how it compares to the overall workflow of DBT for people who might be familiar with that?

Yeah. Absolutely.

So Dataform is 2 things. We think about it as 2 things. So there is a data modeling framework, which is comparable to DBT.

But I think the bigger part of what we've been developing and where we we think most of the value comes from is we have a web based development environment for data from projects

where you can develop,

SQL queries. You can validate, preview, test new models,

changes,

and keep everything tracked in a Git repository.

So this was what our initial IDE was on top of DBT.

The migration

away from the DBT framework was

not for any particularly

major reason. And, you know, if I'm honest, our framework is actually very

similar to DBT conceptually.

It just allowed us to move a little quicker and do a few things that we were struggling to do.

So

Dataform is designed for analysts and engineers, and this web based environment kind of gives us the best of both of those worlds,

and allows teams to adopt those engineering best practices

while maybe not having everybody on the team super comfortable with,

doing those things manually via command line or via code.

And I also noticed that the command line agent that you wrote is largely written in TypeScript, whereas DBT is primarily Python. I'm curious if the motivating factor for that is because you're using a decent amount of that logic

web environment as well to reduce the overall burden of development.

Yes. So that is definitely a factor.

There are a few

factors in us going that way. Personally, I'm a big fan of typed languages,

and that was something that drew us towards, something like TypeScript.

You're absolutely right. Being able to develop the framework and the back end and the front end all in the same language

using the same DTOs is

a huge benefit to us and helps us move quite quickly. Given that, I think we knew from day 1,

the most valuable part of this was really packaging this up into a into a web based environment.

So can you talk through a bit of the workflow and life cycle of an analytic project that is using Dataform and how it allows for collaboration between data engineers and data analysts and what the sort of primary users are at each of the different stages?

Yep.

So we see the shift at the moment. I mean, I think, you know, dbt is driven this definitely, and

engineers have been doing this for a long time. But to have all of your everything version controlled

and tested is extremely valuable, and that is where a lot of the the core value from these software engineer engineering best practices

comes from.

So looking at a data form project in the web UI,

you typically start off

by coming into a project and wanting to make a change that might be publishing a new dataset or a table or a view or some other operation.

You create a new branch, which can be done directly from the UI,

and then you can start writing SQL in a what looks like your standard

code text editor.

The query that you write will be compiled and validated

as you write it, so we'll actually test your query against your warehouse.

And we provide a number of built in functions for building

dependencies between different actions in in your larger project.

As you develop, you can test your query, preview the results.

We allow you to run

portions of your pipeline in isolated development schemas so that you're not overwriting production as you develop new changes.

And then there's a few things you can do on top of that. So

1 of the core features which seems to be a, you know, big selling point for data modeling

frameworks in general, I think, is, the ability to write tests and assertions about your data.

So to make sure that the query you've written actually does what you expect it to do,

Maybe,

previously, you would have just run the query and had a look at the results and checked. But if you actually wanna

put in a test that will fail in the future, if somebody comes along and breaks it, then you can write an assertion.

You can write documentation for datasets to describe the table as a whole or column level descriptions,

or you can override other settings like the schema, the name of the dataset,

set tags, set up schedules.

So finally, when you're finished, you can take all of these

changes that you've made. You can commit them directly from the web UI or from the command line interface if you're doing it manually,

push them to production and

to your GitHub repo, and Dataform will pull those latest changes, and those will be deployed to production.

And you mentioned that a decent amount of the utility of a tool such as data form is that it helps to encourage and enforce best practices such as versioning and testing.

And 1 of the things that I found interesting as I was going through the documentation

is that the form of the tests is actually that you build a query

that is intended to not return any results. So essentially the inverse of the intended query

is how you actually generate the test and that if there are any results returned is when it fails.

So I'm just curious if you can talk through some of the challenges or

how users have found that in terms of being able to understand

continuous integration, continuous delivery

workflow

manifests in the context of data warehouse management using tools like Dataform.

Yeah. So the the concept of an assertion is

a bit jarring at first.

In practice, it's simply a lot easier to write these kind of checks to look for broken rows in your data than to look for the presence

of something.

This is a concept DBT had, and, actually, it was also a concept we had in similar platforms at Google.

So I think that learning hop that kind of learning hump can be gone over quite quickly.

When it comes to CICD,

so

I guess we should break these out.

Continuous integration, obviously, as part of that process that we talked about, when you're making a change to your your data pipeline, this is I don't think we always think about it as software, but it is software.

You wanna you wanna test that that actually works, and you wanna make sure that those assertions pass.

Doing CI in

data environments can be tricky because

data pipelines can take a very long time to run. It could be several hours before

you get a result back for a particularly large project.

So that that makes CI tricky, and we can apply some we can do some clever things to try and speed that up. For example, running your entire pipeline in a staging environment where only 1% of the data is processed or where we don't materialize any of the results, we simply

create views.

So there are a few few things we can do there. When you actually build your data from project as well, there is a validation step going on as you write

SQL and develop the project. So we will check for obvious areas such as circular dependencies, which we can catch before you commit and push those changes to production.

So these are this all kind of helps with

CI in general.

Continuous delivery is a lot easier in

a data

environment. We simply deploy these things to production when we're ready.

And, yeah, that's pretty much it. And in terms of the speed aspect of that, I'm sure that having the tests structured in a way that they're intended to not match anything will help to speed up the overall execution. Whereas if you're trying to, as you said, assert the presence of certain records or certain

data types means that you're actually going to be doing a lot a lot more processing in the database of the records to be able to return something potentially quite large in the result set and then having to do validation against that?

So so

I think that's probably not true for most warehouses, although it's gonna depend on the warehouse. So the way we actually execute an assertion is we take the select query that you've given us, and we wrap it up in a

select someone from

your user generated code.

So I think, actually,

whether rows exist or not, because the data's never being put anywhere,

the execution cost is the same.

1 of the

goals that we had in designing the system is we don't want to data to ever leave your warehouse. And I think that's something to important to understand with Dataform. It's not you it's not running it's not processing your data itself. It's a it's an orchestrator. It executes those queries for you.

So when you write a an assertion, we actually create a view in your data warehouse, which contains the query that you've provided us with, and then we query that.

1 of the benefits of this is if your assertion fails in production and you want to know why, you can go to that table and have a look, query it yourself, and try and work out what's gone gone on without,

data ever actually leaking from from your production data warehouse.

And then in terms of the engineering best practices

that you're trying to enforce beyond just testing, I'm curious what you have found to be common challenges and mistakes among engineers and analysts in terms of versioning and evolving schemas in the accompanying data,

as well as just general

collaboration best practices to ensure that all the engineers and analysts are on the same page and able to work effectively?

Yeah.

So we see a a number of change a a number of issues come up often.

Typically, when people have moved from a non data warehousing

world, maybe working with more traditional warehouse or database like MySQL or SQL Server.

I think a lot's changed about how to think about this stuff. This was definitely something that we were very used to at Google. Data should be so source data should generally be immutable.

When you're reading data from logs,

transaction logs or event logs from, you know, your website,

that data should ideally never change.

And, similarly, when you're reading data from a database,

you don't

you ideally want to read that in a way that you have a full history of everything that has changed.

So when we come to transforming

the data and producing derived datasets, which is what you're primarily doing with the products like Dataform, you wanna be able to build those derived datasets

from scratch.

If you lost them all, if you accidentally deleted all of those datasets,

you want to be able to recreate

that entire warehouse state again.

And we often see the, you know, the first people thing people do when they start using our product is to actually write more like SQL scripts, which perform some sequence of operations.

So creating tables,

inserting into those tables, deleting from those tables.

This is kind of stateful, and it becomes very hard to reason about.

And I believe Drew mentioned this on the DBT podcast

where,

ideally, you wanna be a bit more declarative. Every data search should ideally be a select. There are some situations where this doesn't work, GDPR

probably being the main issue there.

But the closer you can get to having your datasets being declared in that way and to not mutate, stay, or insert new rows, then the easier is to reason about them and make sure that your entire data pipeline is reproducible.

So coming up to, yeah, some other things that we see, I think

SQL adoption is growing, and I think the power of modern warehouses is enabling that.

We definitely see some people still kind of falling back to processing data with

Python, NumPy, r. There's a problem with scalability here.

This is

they're simply never gonna be as fast as

doing a comparable thing in SQL.

They're usually limited to a single machine or some RAM limitations,

whereas you can comfortably process terabyte datasets with

BigQuery or Snowflake.

So that's some that's a habit that I think we perhaps see. I know myself as a software engineer. I wanna fall back to what I know and what I understand.

But moving this stuff into SQL using UDFs where you can

actually opens up a lot of possibilities.

When it comes to schema evolution,

I think a lot of these challenges go away if you get these first things right.

So having immutable source datasets

and having reproducible

transformations of those datasets

And having all of your code in 1 place makes it much easier to make mass changes across those systems

because you remove the state.

It's much easier to change a dataset and all its dependencies in a single pass. And did it the scheme evolution becomes a bit less of an issue.

At some point, scale does become an issue, and

you can start looking at approaching those schema changes in a backwards compatible way. And this is similar to how you might approach this problem in software engineering.

For example, we have many different microservices.

They might get released on different schedules, and you wanna make sure that they continue they can continue to talk it to each other as that software rolls out.

So, for example, removing a field from a

table

because it's no longer valid,

you typically do that in a software engineering way by

marking that field as deprecated,

giving those consumers some amount of time to clean up their,

well, in this case, SQL queries that use that data.

And then when you know all consumers are no longer reading it, you can remove it safely.

Tools like Dataform make this kind of action a bit more feasible to accomplish because you can see all of that SQL in 1 place. You can actually analyze and work out who's reading this field, who do I need to notify,

and you know when all references to it have been removed.

And digging into Dataform itself more, can you talk through a bit in terms of how it's implemented

and some of the evolution that it's gone through since you first began working on that as a replacement for DBT?

At a high level, the data from framework takes a data from project folder

containing

SQLX files, which is a

format that we've created,

as well as JavaScript files,

custom music code.

And it outputs a big JSON objects.

That JSON object describes every SQL action, its dependencies, and other metadata.

So unit tests, assertions,

column descriptions, tags, those are all contained in this 1 data object,

which is no longer user code. It is declarative.

So we actually break the the

data form process down into a few different stages.

The first stage is compilation where we produce that first JSON graph.

The second step is we call it build,

where we take that graph and we combine it with information from the data warehouse,

which can which would be states such as what tables currently exist, are they tables of views, what columns are in those tables,

And we use that to generate an execution graph, which is

a kind of final ready to run list of every single SQL operation,

that we're gonna actually execute as part of the graph. So this is

the difference between those 2 graphs is 1 contains select statements, and the second 1 actually create contains the create, delete, alter, table statements,

the boilerplate that

tools like Dataform avoids you having to write and think about as a user.

So,

finally, when you have that graph, Dataform

can execute that graph and will execute notes in parallel and warn you if anything fails.

It It will run

your assertions and tests

and finally tell you,

what's happened. And via the command line, this is kind of all pretty prints pretty printed out. And via the UI, we keep a track of those logs, and we send you emails and notifications and all that stuff, when something goes wrong.

In terms of the evolution,

as I've mentioned, we're, of course, inspired very much by dbt. Some of our design goals were

primarily to make it fast.

In order to make the IDE powerful and usable, we wanted real time compilation. So So we wanted to make sure that the data from projects could compile within a second

so that you could actually see what SQL you're generating

before it got validated or run.

And that's something that we try to do and possibly 1 of the reasons why we moved towards,

JavaScript.

We also wanted to make sure that the output of that compilation process was machine interpretable

and,

as close to pure SQL as possible.

So the output of the data form compilation project actually is just a JSON object containing really nothing but actual SQL statements that can be run directly either through data form or manually.

There are a few kind of steps

step changes, I'd say, in the in the framework itself.

1

particularly interesting and important part was sandboxing.

So when data form projects compile, we don't allow

external

calls to third party services.

It can't look at the data warehouse,

and this is necessary if you wanna make it fast. If we allow user code to make external API calls, then this can slow things down quite a lot. Beyond that, I'd say the

2 big changes we've added are,

support for unit tests for SQL queries, which is, again, 1 of those software engineering best practices that just hasn't really existed in the SQL world, and support for documenting datasets,

which

allows you to annotate fields

and columns within your queries with descriptions or other metadata.

So it sounds like you've actually ended up building

a SQL parser

into the command line agent for being able to extract the

data types and the dependency graph. I'm curious, what are some of the challenges that you encountered in the process of that? And if there were any prebuilt libraries that you're able to lean on to handle some of that,

overall

capability?

So we don't currently parse SQL itself.

1 of the challenges there is

the different dialects of SQL across different warehouse types. However, we do still have our own custom parser,

for these SQLX files. So this SQLX file is

raw SQL, but a

allows it to be annotated with

JavaScript or JSON like blocks, which set configuration settings or allow the use of templates within that SQL.

So you can reuse code and make your project more modular.

We did use some libraries. I think we used new. We're trying to keep the

footprint of the data form core library,

as small as possible.

1 of our engineers is a,

experience doing this stuff and managed to whip something up in a in a few days, but it's certainly been challenging. And I think,

as I mentioned before, the performance is very important and trying to keep that lexing and passing fast and under a second is something we've worked quite hard to do.

And you mentioned too that in the SQLX files, you can embed some

JavaScript logic for being able to build out macros or conditional templating. And I'm curious if you can talk through some of the use cases that that enables that would otherwise be fairly difficult if you were just relying on the pure SQL

and some of the limitations

of SQL itself when you're working in an environment where you're collaborating with other engineers and you might need to be able to compile multiple different fragments of SQL into an overarching query

or table definition?

Yeah. So SQL is not a programming language

exactly.

Some variants like T SQL maybe are, but we can't really make that assumption. So as you mentioned, those conditional behaviors are hard to recreate,

and it it really limits what you can do. So

to give you to walk through an example,

unit testing.

For a SQL query

in order to be able to unit test a SQL query, you need to have an understanding of its inputs, and you need to be able to declare

its expected output.

This isn't something that SQL can do on itself,

and this is where I think product like Dataform,

actually enables these use cases.

Similarly, parametrization

parametrization

of queries

for different environments.

I mentioned this earlier. Running staging pipelines on 1% of your data,

you can't do this with SQL directly. It doesn't understand what environment it's in.

But with a framework like Dataform, we can

wrap up that information and

make that accessible to the user code. So as we create the SQL code, we can

put those variables in.

The reusable

aspect is obviously important. Don't repeat yourself.

If you are writing some actual

codes,

if and you're copy pasting more than a handful of lines, you're probably gonna,

think twice about that and package it up.

SQL, again, doesn't really have many ways to do this. Some

data warehouses do support things like stored SQL procedures,

which can help, but these things are rarely version controlled.

So

being able to do that inside a framework like Dataform

massively reduces the amount of duplicate code,

makes larger factorings easier.

I could give you an example of 1 thing which 1 of our first customers did early on, which I think really highlights the power of this kind of thing.

So I mentioned earlier, GDPR

is

annoying

because

but important because you need to remove,

data from your your warehouse. You need to do something which is kind

of stateful in a sense. Execute, delete operations, and forget everything about a user. You might have 30 plus tables which contain

user

or PII or identifiable information.

You're gonna have to go into those tables and write execute a delete statement removing any rows which matched user IDs from some

deletion request table.

So with data form, this

template can be packaged up with a for loop and a list of the tables,

or and possibly the user ID field names. You can automatically generate,

those 30 queries that actually will go and run at the end of your daily schedule to remove all of those

users rows from those tables.

So none of the stuff would be doable directly with SQL,

and none of the other things we've talked about, CICD, unit testing, code mod modularity.

These

are all limitations of SQL itself.

And then another thing that you mentioned earlier

is the different dialects for the different data warehouses.

And I'm curious

how you have

approached encapsulating

the common core capabilities

and then also allowing for taking advantage of the differentiating factors between those different warehouses and what's involved in adding support for a new back end?

Yep. So,

currently, we have support for BigQuery, Redshift, Snowflake, Postgres, and Azure data warehouses.

It it certainly is a cost to us to to maintain all of these. Thankfully, they do share a lot of similarities. And, you know, if you look at,

Redshift, Postgres,

particularly, they are very similar,

in terms of how they behave.

Snowflake, fairly similar to BigQuery is probably the odd 1 out here in that it

has its own way of doing things and definitely causes causes us a bit of a headache.

I think our

1 of our decisions in how we interface with these warehouses is to kind of treat them

as simp in the in the most simple way we possibly can, which is

they are something that we send SQL strings to and get

responses back from. And that is,

internally, that is pretty much what our database adapter looks like.

If you want to support a new warehouse, you need to tell it a little bit about the SQL dialect.

And most of these derive from the same kind of core standard SQL,

dialects that we have inside our repository.

And you need to tell it a,

how to connect to that warehouse,

and how to execute statements. And that's really it. That's kind of the fundamental piece. It's quite low level. It doesn't know much more.

We do have some

APIs that need to be implemented such as,

listing tables,

validating

statements

if you want validation in the actual

Dataform WebUI,

listing schemers and fetching table metadata such as the schema of a table.

There are some headaches there dealing with different data types and different databases,

some

often

tricky issues to debug around connection pooling. But I think thanks to keeping that interface quite simple, we haven't this hasn't been a a huge burden for us so far. I suspect some of the other warehouses

such as Athena and Presto, will make this a little trickier, where

dealing with data on s 3, for example, you maybe lose this concept of, like, a schema and table

relation

identifier.

And

in terms of the SQL dialect support, do you have any validation

in terms of the compilation step

to verify that

a particular statement is going to comply with the dialect of the engine that it's going to be running against?

Yes. So we've thought a little bit about

this. There are some libraries you can use to to kinda do this.

At the moment, we don't do that automatically. Compilation, as I mentioned, we wanna keep really fast. We will do validation

in the UI after we successfully compile.

And we can cache some of that stuff where it makes sense.

But we generally rely on asking the warehouse itself for

to validate a query for us. And this is

particularly important for something like BigQuery where, you know, they've perhaps,

moved a bit away from your standard

dialect, and

the source of truth is ultimately asking the BigQuery API itself whether this is a valid query.

1 of the things that also makes this tricky is understanding if the tables that you're querying from even exist, and it's useful for us to be able to surface that kind of information.

And then as you've mentioned, there is the web UI, and there's also the command line interface that has the core functionality. I'm wondering if you can talk a bit about the differences

in capabilities

and when somebody might want to reach for the web UI versus the command line.

So the Dataform web platform is really good towards enabling collaboration,

across an entire data team.

So Dataform is an IDE for Dataform projects. You can develop new models.

You can develop SQL queries,

get real time compilation,

and preview results.

It integrates directly with git so you can create branches, make commits, and develop

and test changes in an isolated environment,

so you're not overwriting your production data every time somebody makes a change.

It also wraps up some of the deployment pieces, so continuous delivery,

scheduling,

and logs, email email notifications, all this kind of stuff you'd expect with a a scheduler type system.

So I think the command line interface

is

useful for a number of things still.

Because every project in data form is usually gap backed by GitHub or GitLab repository, it's kind of up to individuals

how much they want to use the IDE or not. And we see teams where

maybe the software engineers or the data engineers

are checking out that code themselves and making edits

in

their preferred ID,

and using the command line interface to test those changes.

Whereas, perhaps, analysts are less comfortable doing that and would rather work in the web ID

itself where they get a bit more real time feedback or as they develop queries.

So I think you can kind of use both together, but the the parts

of the web platform, which we think are particularly

useful, are the collaborative pieces. And that's probably what we've spent the majority of our time on to make sure that

analysts have a place to come

and contribute to these projects without having to do things that they may be feel less comfortable with.

And what are the cases where Dataform is the wrong choice for a given project?

Yes. So

I guess we talked a bit about how Dataform compiles projects.

You the only time user code gets executed in a Dataform project is during compilation and the generation of SQL queries,

which means at runtime, we can't actually execute user code. So you can't pull from a third party API, for example.

For something like that, you still need to fall back to a

system perhaps like airflow,

where you can actually run user code at runtime.

I'd say the other

situation where this form really doesn't make sense is real time and streaming.

I think that there are a few groups working on making

streaming SQL dialects a real thing.

And I know these are already being developed for Apache Beam and for, Kafka, and this is something that I think we want to think about 1 day. But right now, if you need

extremely low latency under,

say, 5, 10 minutes, which is probably the maximum you'd feasibly want to run a batch system on, then

you'd want to look at doing something a little bit more custom.

I'd think in in practice,

for the kind of use cases that we see people,

applying data from to,

this isn't a problem. It's rare that you need

sub 5 minute latency.

There are features

like incremental

table builds, which allow you to kind of do micro batching in a sense.

But for anything faster than that, real time dashboards, then Dataform's probably not the right choice.

And then in terms of your experience

of building and running the Dataform project and business, what have you found to be some of the most interesting or unexpected or challenging lessons that you've learned?

Yeah. The

so Dataform is a complex

product, and it's a complex idea.

And I'd say we have a significant challenge trying to communicate

what it is,

to to our users

in terms of value proposition, but also in terms of getting them started.

So this is something that we've been investing a lot in and

are continuing to invest in is that initial onboarding experience when you come and sign up. Perhaps originally, you were just land you were dropped in an IDE with a empty SQL editor

and kind of expected to know what to do.

And in practice, that just doesn't work. So we're

having to invest a lot more than I think we ever imagined

in making that process

smooth and clear to to new users, and

we still do a lot of manual kind of handholding

in that process.

As a

as a company,

I'm not sure

if there were many surprising things.

Starting a start up is hard.

Fundraising is hard. Building a team is hard,

but also

rewarding. And I think the, the the team is thoroughly enjoying the problem that we're solving right now.

And 1 thing that I'm not sure if we

mentioned explicitly is that the command line piece

of Dataform is open source. And so I'm wondering what your governance policy is around that, and how you're approaching the sustainability of that core framework as it relates to the business and the web application that you're building on top of it?

Yeah. We think a lot about, you know, what the line to draw there is, and I think it's it's a it's a tough balance to get right. Obviously, we do have a lot of features of the product which are closed source.

I think 1 of the

so this has happened this has happened already where we've had people not from Dataform come in and make a change to the open source repo.

We wanna make sure that they can add the features that they want or their support for their data warehouse without us,

having to

support

a number of changes in our proprietary systems too.

So that gives kind of flexibility for people to come and change the very the very core part of the product,

while we can still maintain control of the,

the the web platform, and they can still use that for the new features that they develop.

And looking forward, what do you have planned for the near to medium term of Dataform,

both on the technical and business side as far as new capabilities

or improvements?

Yeah. So a lot of users spend a lot of time in our app

as it is their primary

place for development of queries and SQL.

So we're doubling down on that. I think in terms of the features we have, you know, we're really supporting that core use case right now. And I'd say over the last 6 to 9 months, we've managed to get out most of the key blocking features

for new users.

I think right now, we wanna make sure that developing

SQL and

developing data from projects in the WebUI

is a really pleasant experience that is fast and snappy,

getting rid of those edge case bugs of which there is a never ending stream of.

I'd say beyond that,

supporting massive data teams,

this is really challenging.

We have some big

clients using

our software with, you know, 200, 300 data models,

but we are also speaking to clients with many, many more. And when you start looking at data processes in teams of analysts, which may be a 100 or more,

managing thousands

of SQL queries

and

extremely complex data pipelines.

And I think there are a lot of challenges that still need to be solved there. And we're trying to think a lot about how we deal with that.

Some of these around

maybe more enterprise things, fine grained access control, data ownership, data governance, column level lineage.

This is all extremely important at that level. In terms of media features, we are currently building out support for,

release processes

and multiple deployment environments

so you can have staging and canary environments.

We're looking at 1st class support for UDFs.

I mentioned

UDFs before, and I think this is this enables a lot of really interesting use cases in

modern data warehouses,

as well as support for Presto and Athena and

integration with some other

data

workflow tools such as Airflow.

In terms of the maybe slightly longer term,

I think we'd like to be able to support more to date multiple data warehouses with a single data form project.

I mentioned the streaming thing too. That's something that's always on the back of our mind.

And we're also investing in,

a number of packages to kind of make

common

SQL

processes a little easier to

to do without having to reinvent the wheel every time. So for example, analyzing AB experiments,

managing access control,

or utilities to simplify

writing data tests.

We think we can save a lot of people a lot of repeat work,

by packaging some of these up and making them available to the community.

Yeah. Having the prepackaged

data tests, I'm sure, would be quite interesting and

useful, especially for cases where you're providing validation

for some common data sources where somebody might be pulling from Google Analytics

or maybe Twitter or LinkedIn for being able to ensure that there's an appropriate schema or that you have some sort of common needs that are able to be encapsulated in those prebaked tests?

Yeah. So a good example of this is Segment.

You know, I think a lot of our users use Segment, and they often have the same set of data models that they are developing to build on top of the core segment datasets,

like sessionized,

tracks and, page views. And this is quite hairy code, and if we can

provide a kind of ready to go tested

package so that others can benefit from that, then I think that kinda really builds up the ecosystem and and and the framework and makes it gives you a lot more value without having to write a lot of SQL yourself.

Are there any other aspects of data form and data warehouse management

in the context of data engineering and analytics that we didn't discuss yet that you'd like to cover before we close out the show? No. I think I think that's great. There's nothing else. Well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

So I say there's still a,

I think we're bringing a lot of best practices and

deployment processes

to

SQL for those core data models. I think 1 of the areas where

a lot of what we're doing with Dataform where this doesn't apply is

in machine learning,

where you

still require

slightly more custom approaches to building and deploying models. And we've definitely spoken to a lot of people who are

having to invest heavily in how they

develop and deploy machine learning models in a kind of reproducible way,

again, following those software entering best practices. So this is definitely an area that I kind of expect or hope to see, a lot of innovation in over the next few

years. Well, thank you very much for taking the time today to join me and discuss the work that you've been doing with Dataform. It's definitely interesting

to see that there

is more that there's more innovation going on in this space of ensuring

repeatable processes

and best practices for engineering in the context of data warehouses

and leaning on SQL capabilities.

So I'm excited to see where you take the business and where the overall space ends up. So thank you for all of your efforts on that, and I hope you enjoy the rest of your day. Great. Thank you

very much.

Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links