Creating A Unified Experience For The Modern Data Stack At Mozart Data

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3,000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Peter Fishman and Dan Silberman about Mozart data and how they're building unified experience for the modern data stack. So, Peter, can you start by introducing yourself? I'm Pete Fishman. I go by Fish, and I am the cofounder and CEO of Mozart Data. And, Dan, how about you? Hi. I'm Dan Silverman, cofounder and CTO of Mozart Data. And going back to you, Pete, do you remember how you first got involved in data? Like many people in the data space, I am a failed academic.

So after

actually finishing the PhD

and realizing I really loved the bay area and wanted to stay in tech I actually sort of fell accidentally backwards

into sort of applying those empirical skills

That were sort of grinded at through my many many many years in college and grad school, then turning that into a career in technology.

And, Dan, do you remember how you first got involved in data? I first got involved in data. I wasn't officially a data engineer. I was a more of an application engineer, but I've kind of always worked at smaller companies that didn't have dedicated data engineering

needs. I've just kind of been picking it up as I've gone along and working with analysts throughout pretty much my whole career. Yeah. It's definitely how I think the majority of people who call themselves data engineers now ended up in the

role. And so in terms of the Mozart data project, I'm wondering if you can just give a bit of an overview about what it is that you're building there and some of the story behind how it came to be and why you decided that this was the problem that you each wanted to focus your time and energy on. Well, a couple of things. It starts with I think the best projects tend to be scratching your own itch. Dan and I were both kind of wanting

to start something together. We've been good friends for 20 years.

And,

know, we sort of thought, what is the ultimate combination

of our 2 skill sets? You know, Dan's

being sort of as an engineer and myself

being sort of more data science,

like, that ends up sort of looking like something in the data space where we've both been sort of working for the last, you know, 15

years apiece.

And, really, it started with,

what are the tools that we most

loved building or thought were really critical

or maybe overlooked

sort of more broadly at our last few jobs. So we really thought about the tools that we consumed on a day in day out basis that companies

essentially spending lots and lots of money on people like myself and Dan in order to build and provide internally.

So in terms of the specifics of what you're building there, I know it's focused around the so called modern data stack, which has become the latest entry in buzzword bingo. So I'm wondering if you can just start by giving a bit of a sense about what it is that you're layering on top of that and why you think that that is

where the particular area of focus can and should be, at least for the efforts that you're deciding to build this company around?

We're basically building, you know, the all in 1

data platform. So we handle

pulling data into a data warehouse. We manage a data warehouse for you. We're using Snowflake under the hood,

and then various tools for

scheduling data transformations

once data is pulled out of your tools, you know, observing how data is flowing through your pipelines,

getting notifications of failures,

some tools for

cataloging and managing data governance,

that sort of thing. We kinda try to handle

all of the core data engineering

that teams are building the world over and lets you focus on, you know, the specifics of your data and how you wanna organize it. And then you can use tools like, you know,

charting tools, BI tools, whether that's going into machine learning tools. Whatever you wanna do, kind of getting all of your data into 1

centralized data warehouse and then organizing it, we wanna make that as easy as possible.

The whole idea around the modern data stack is that it's intended to make all of those

infrastructure and engineering aspects easier or sort of obviate them in certain ways where the idea is that you can just throw a credit card at the problem and set up 5tran and Snowflake and DBT and Looker, and you've got your data engineering. You're done, and you don't need to worry about all the specifics. But as we all know, there are all the integration at specs that go along with it. I'm wondering what you identified as the missing pieces in this modern data stack ecosystem and what people are deciding to try and build around that make something like Mozart useful or necessary for engineering teams or organizations that maybe don't have a dedicated data team. I think the first thing that I wanna call out where I'm in deep, deep agreement with you is it's It's really magical to think about you know, the buzzword the modern data stack what it really is talking about is this evolution from needing a small or not so small team of data engineers,

a sort of big budget,

for a data warehouse just to get started.

And today, that doesn't look like it at all. Today, like you said, through a variety of tools,

you can get started with pretty much, like, swiping a credit card, you know, not just tools like Mozart Data, but all sort of the components

of the modern data stack. For the most part, there's a self-service offering

or even a free offering that you can just get started with. So it's really incredible that the bar a decade ago used to be

you'd hire a few data engineers. You'd

sort of toil and test some different technologies for a few months. Today,

that looks like in an afternoon, you can basically be spun up with world class data infrastructure.

So when we think about kind of the opportunity in technology,

it's to be opinionated

and bet on what you think are the core winners

of

sort of a landscape.

So that's really what our product does. It's trying to make some of these core winners even more accessible.

So on the 1 hand, we talk about, like, how incredibly trivial it is to spin up a modern data stack. But in practice,

that's not the case. Now if you're an experienced data engineer that's been through the rough and tumble before the days of this was an easy, simple credit card swipe and lots of simple button clicks just to get started, you might say, well, no. Actually, it is crazy easy. And, yes, it is much easier than it was, you know, even 5 years ago.

It's still not easy enough. Right? We think of it a little bit like you know, if you think about some of the workout apps that you can find on your phone, it's really easy to get started with those. They try to make it really easy for you. But, actually, still the hardest part is just getting started,

especially if you're a novice that is unclear about where to begin.

So I challenge

1 premise, which is to say that the modern data stack is very easy to get going with. There's still a lot of

debate or disagreement about what's the right tool for a given type of company. So there ends up being a lot of sort of tool exploration

in addition to, you know, making all the pieces work together, which

is true in theory, is not always true in practice.

So sort of as a practitioner,

sort of diagnosing

where issues come up is not always as trivial. Even though these pieces

sort of, in practice, are often used by many, many different companies

together as 1. Just having it be sort of a singular experience is definitely not the current state of the world.

And as far as the particular

use cases and workflows that you're focusing on or any maybe industry verticals or horizontal layers and the target end users that you're focusing on. I'm wondering if you can give a sense about how you thought about that as you started to design and build out the Mozart platform and some of the ways that

those focuses in terms of persona or industry

have informed

and helped you with prioritizing the features and user experience of the system?

I would say we're very industry agnostic

intentionally.

The questions that a finance company or a health care company

or a gaming company has,

the questions are different. The data is different, but the tooling that you need to answer those questions is pretty much the same regardless of industry. Industry. So I would say that the thing that's very consistent with our tool, we're a tool for data analysts. You generally do need to know SQL to get a lot of value out of our tool,

but I actually shouldn't say data analyst. Like, a lot of our customers

would not call themselves data engineers, would not call themselves data analyst. They're people that are you know, have titles like marketing ops, sales ops,

where they've maybe picked up some SQL along the way. And I think this is a pretty big growing class of person that has

learned some SQL, learned some

analysis skills over the course of their career so that they could do their job better. You know, they often work at companies that would never have, you know, a large team of data engineers supporting them, but the tooling is getting to the point where they don't need a team of data engineers to to kind of have access to the tooling that they need to answer their questions.

Going back to the

sort of sharp edges

or problems with the ways that the modern data stack has been manifesting and some of the ways that you're thinking about Mozart data, I'm curious what you have seen as the kind of necessary level of experience or background knowledge to be able to actually effectively

build and integrate the various components of the modern data stack and how you're thinking about being able to, you know, smooth the on ramp for people who maybe don't have all of that necessary background expertise and be able to build in some sort of gradual exposure of complexity so that they can, you know, sort of making the

easy things easy and the hard things possible and just how you think about that aspect of the overall problem of being able to put data to work and actually gain value from it without necessarily having to hire on a full team of data engineers and data scientists to you know spend months on the problem?

Well, I think it you know starts back with something that dan said which is we really wanted

the bar for being able to contribute

To be the ability to write basic sequel statements

so there's sort of a theme that you'll see in some data tooling, which is

There are business users that understand the business definitions

they understand

Not necessarily what columns mean, but they understand

What they want to track and the nuances

of their data in order to track them. So, you know, Dan and I, before we started this data company together, we had actually, a decade ago, started a hot sauce company together.

And it was actually a hot sauce company on on Shopify. And, you know, we wanted to, you know, do things like report on our customers.

And The count of customers in shopify was right. But really we wanted to get a count of customers that didn't have the last name Fishman or Silberman

which was obviously in the early days disproportionately

way too many. And

when I think about kind of that, that's like a specific

example of business knowledge that the business user would understand, which is they wanna actually know what their scalable traffic is looking like. But it's not necessarily

obvious to somebody that's just simply collecting the data. So we wanna put the power in the business users' hands. Now typically,

that looks like

a set of data engineers that are playing sort of telephone

and, you know, ambiguous requirements and a lot of frustration on both sides. So, you know, you sort of have this movement

to empower

business folks in the BI tool. Right? So that the BI tool is interacting with a bunch of very clean tables

that have sort of sources of truth hooked up to them.

That doesn't really happen magically. It doesn't happen, like, automatically. It happens through a lot of

pain. And the more you can empower,

like, higher up in the funnel, the less sort of shared pain in that organization

there is.

Getting back to your question, this to me is sort of the evolution

of where it's not so much the failure of the modern data stack, but the problems of the modern data stack begin, which

is just because it's easy doesn't mean that it's producing more sort of magic.

And

just because the tooling, aside from being more powerful, is also, like, better integrated again, there's just so many problems. You can think of it as downstream that you wanna work back from that we think that there's incredible opportunity, especially as that population and Dan alluded to, you know, the folks that are in marketing ops, rev ops, biz ops,

sales ops, you name it, blank ops that are really becoming SQL proficient, becoming sort of data savvy, data proficient.

As that population grows You're gonna see more of those sort of downstream problem as amazing as all the tooling is

you know, upstream of that, it's almost irrelevant if, you know, we're creating more problems downstream. So that's kind of where we saw

the modern data stack underserving

organizations,

especially

smaller businesses, especially smaller businesses that don't have, you know, a giant data team to play

many, many, many, many hours of telephone with.

So we see that falling down kind of just in the actual

real practitioner using data, getting value out of data,

running into

roadblocks in the setting up of the data, and we wanted to make that as easy as possible. I can give 1 specific example of something we're thinking about now is data governance.

So, like, the, you know, the sales op, marketing ops, the operations people,

they might think in terms like,

hey. We want the marketing team to have access to this data,

and we don't want, you know, this other team to have access to this data. And, you know, the sales team needs access to some other data, and they know that they can't have access to the HR data. And ideally, you know, all of the data is in the same data warehouse,

so the data engineer

has to kind of translate

those requirements

into

database roles and users and all the permissioning that goes along with that. So we try to find kind of, you know, simple interfaces

that the operations

sort of thinking can be translated into and obfuscate,

you know, the actual roles and permissions that that are existing in the database.

In terms of the actual platform that you've built and some of the engineering work that you've done to be able to

paper over some of these complexities of the data stack, I'm wondering if you can talk about the overall system that you've built and maybe

where you actually started the engineering effort to be able to iterate from and understand how to best explore this space of being able to present the modern data stack as this unified experience that people who don't necessarily think about themselves as data engineers or analysts are able to use effectively.

I'll start

by saying

we had a healthy set of design partners.

So first off, you know, dan and I have both built this product

a number of times at different companies So I spent the biggest chunk of my career at Yammer

And at Yammer, we built a tool called Avocado.

And Avocado

was sort of an inspiration for some combination of Mozart

and Mode Analytics,

which are the sort of 2 companies that came out of the brains of a lot of people on that team.

And what that looks like is what are sort of the tools that basically get data to a central place and then the tools that can then

visualize that and share that within an organization.

You know, I think it started with

what have we built kind of in the past and what has kinda served us.

And Dan, of course, has done something similar at Clover, his most recent job. So we had sort of some experience

building tooling like this.

On top of it, we work with some great people that have lots of opinions about how they want their data. So we started sort of not with a bunch of paid customers, but with a bunch of people that were willing to give us feedback on what we were building.

So we found folks that were willing to sort of, you know, trust our data experience and ask us for advice in a variety of data domains.

And we asked those people to tell us what was important to them. What is it that kind of they most wanted from a data tool, or where were they running into roadblocks as they tried to spin up their data infrastructure?

So we had a lot of great before we got started, inspiration.

And so as far as the actual platform itself, can you talk to some of the pieces that you've engineered and how you have done the work to actually

tie together these various elements of the data stack and maybe some of the pieces that you have been able to take off the shelf and just add facade over so that people don't need to know that they're using whatever the sort of managed service happens to be and how you think about the sort of build versus buy decision about which pieces to use off the shelf versus which pieces you need to actually engineer to be able to provide that overall experience that you're aiming for? So I would say our system is, you know, ETLT.

We don't really care to have the argument of should it be ETL or ELT.

You should extract data from

your various systems, transform it a bit, load it into a centralized data warehouse, and then do more transforming

on top of that.

We use a company called Fivetran

for a lot of our initial connectors. That's definitely something I would highly recommend that you buy. You know, a lot of people have built ways to pull data out of, you know, other databases and put them into another database or pull data out of,

you know, Salesforce or Google Ads, Facebook Ads, you know, hundreds of different tools.

Those connections have already been built, whether it's singer taps or Stitch or Fivetran, etcetera.

So we use Fivetran for a lot of that. We've built some of our own, which we generally use the Singer framework when we are building our own. And then we obviously did

not build a data warehouse. We're using Snowflake. So, I mean, for some context, v 1 of our product was you sign up for an account,

and we'll create another Snowflake account, create various users and roles and permissions for those. We'll automatically create a Fivetran group, load the destination, load your Snowflake account that we just created into your Fivetran destination,

give it its own user and roles, etcetera.

And then basically provide, you know, a very simple interface for you can write SQL to query your data warehouse, and then you can say, actually, I want this query

to be scheduled to run every hour and to create this other table. So you can kind of start building

data pipelines.

That was v 1, and that is basically, you know, the core of the product is the ability

to connect your different sources of data,

replicate it into a data warehouse,

and then start transforming it to basically organize it better

for any downstream purposes.

And then on top of that, since then, we've just kind of been layering on, you know, notifications

when there's failures, ways to observe how data is flowing through your different pipelines,

ways to catalog your data. And like I was talking a minute ago, you know, ways to layer on data governance, things like that.

And as far as the

engineering projects that you have focused on as you're building out this platform and trying to tie together the experience, I'm wondering what have been some of the most interesting or challenging aspects of putting together this platform and thinking about how to

design it in a way that's approachable for people who don't necessarily want to spend all of their time on the engineering aspects, but still

flexible enough to support people who do.

So I think you actually hit on something earlier.

We wanna have a low floor and high ceiling.

And that was 1 of the key design principles,

which is to say the low floor being anybody that can write sequel can be a data engineer So and the idea being you don't need to hire a data engineer until your hundreds of employees.

And this is sort of, you know,

sort of a radical

perspective or a radical opportunity that I think many in the modern data stack share, which is to say, kind of like Dan was mentioning,

now

that extracting data from a set of tools that are very standard across tech companies

has become sort of a solved problem,

Now the challenge becomes,

like, how do you,

you know, get access to that data, understand that data, give the right permissions

on top of that data.

So I would say that when we think about kind of diving into this problem,

you know, we started we started there. 1 thing that we do a little bit differently from most of these platforms is kind of how we approach

building

pipelines.

So rather than, you know, tools like Airflow or DBT,

you need to tell those systems,

you know, basically, do this task, and then when that's done, do this task, and then when that's done, do these 2 tasks, and when those 2 are done, do this task, things like that. We've built

a system for basically parsing through the SQL that our customers write

to determine

how their

tables are actually connected without them having to tell us, and then we can present that to them as here is the data lineage

in reality regardless of kind of how you imagine it should be, which takes a big step away

from the user. And then sort of related to that, we try to focus on features that we can add,

being the entire platform that is a lot harder or impossible to do if you're connecting a bunch of individual tools. So 1 good example is

since we're handling both, like, the ETL

and the next transform layer, we can kick off pipelines

when

as soon as the data lands from your different tools.

Whereas, if you're combining, you know, a 1 an ETL tool and then a separate

transform tool,

it's either hard or impossible depending on the tools to kind of have those play really nicely together.

In the sort of integration aspect, I'm wondering as you've been building out, tying together Fivetran and Snowflake and the transformation layers and being able to, you know, build out the presentation layer and maybe integrate with some of the reverse ELT frameworks? What have been some of the

layers in the stack or the points of integration that have had the most friction that you've had to deal with?

This kind of is getting into a bit of the details. When we started with Snowflake,

their top level abstraction

was the account. So we had a Snowflake account, and then within an account, you can have

databases and warehouses. Snowflake is a little bit different than some of the other data warehousing

platforms that they separate

compute with storage.

So we had our Snowflake account, and then for each customer, we would create a new database

and then a new warehouse.

That led to some trouble where a lot of tools

connect to Snowflake, and they

assume that the account that they're connecting to, they can have, like, full account admin access. They can have access to the various

system databases that Snowflake has. And so that gave us some trouble where

some tools wanted, like, basically full access to the accounts, and we couldn't do that without exposing, you know, our other customer data. Since then, Snowflake has added another layer of abstraction on top of account. That's the organization. So now an organization can have many

accounts. So now we're migrating our

architecture to be you know, Mozart is the organization,

and then each of our customers gets their own Snowflake account, which solves a lot of these problems. But along the way, that has definitely been something that we've struggled with. That's true for many tools. You hit on a few, but, you know,

reverse eto

obviously data warehousing,

you know a lot of

the initial kind of use cases weren't to sort of put them all together

in an all in 1. So digging more into what you were saying about having a low floor and a high ceiling where you want to be able to give people the ability to dig deeper into the stack once they gain a certain level

of familiarity or comfort with it. I'm wondering what have been some of the areas of kind of progressive exposure that you have

engineered into the project and some of the stages in the life cycle that you work hardest to automate and the pieces that you feel like need to be in the control of the end user because it's very specific to how they run their business or what they want to be able to ask and answer questions about.

Dan just touched on, you know, something which is this is your data. So you have access

To your snowflake warehouse. So anything that you want to connect and we have you know customers that connect

many many data tools to their

Data warehouse.

I think the high ceiling is the part that's interesting, which is you know how as somebody picks up additional data proficiencies

and has additional sort of data interests or data needs, and this is a challenge for any company building any tool, how can you both serve a novice,

But yet expose these sort of escape hatches

so that, you know, an expert that walks in or maybe, you know, you start using data where a bunch of

folks that are more novice with it, you know, start consuming the data, start getting value out of the data, and then you as an organization say, actually,

you know, we're getting so much value out of the data. What we wanna do is hire a high powered data engineer

or a data scientist

that can really take that data to the next set of challenges and the next level and extract even more value out of it because we can see, you know, getting a lot of value out of it from the business user.

And, you know, when they arrive in an organization,

how do you enable

sort of, like you said, a power data user that is already coming in with a variety of experiences

and expectations

and yet at the same time enable a novice. And I think it's not just unique to data tools. This is basically

a dilemma that almost every company faces, a growing company that can sort of, you know, sell beyond a niche. Like, how do you essentially serve, you know, essentially

a wide spectrum of users? And I think it's all about little, like, tricks in terms

of making it very clear,

you know, where the sort of I'll call them escape hatches to the advanced levels are.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks.

By the time errors have made their way into production, it's often too late and the damage is done.

DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection.

DataFold also helps automate regression testing of ETL code with its DATA DIFF feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.

DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with DataFold.

As far as the sort of automation aspects,

1 of the

real benefits of using a system such as Mozart is that a lot of the opinions come baked into the customer. It doesn't have to form their own. They can just say, whatever you say is fine. I'll just take what you've given me and work with it from there.

And 1 of

the probably longest running debates in the data ecosystem, it comes down to modeling in the warehouse, and different generations of

warehouse technology have led to different conclusions or sort of trends within that. And I'm wondering how you have approached that aspect of landing people's data into the warehouse and then maybe setting up the initial set of models for them to be able to work from and how you think about how much control

to provide

early on and sort of what the knobs are for people to be able to say, don't do anything. I just want you to land the data. I'll do all the transformation versus

I just want you to be able to hook this up into my data warehouse with the semantic layer prepopulated.

I don't wanna have to think about all of the minutiae having to do with data modeling and, you know, building out the business metrics.

To me, honestly, the debate has been settled in my mind just for practical purposes. There's so many good tools that have built these connectors.

They don't know your specific needs. So if you're going to use a tool like Fivetran or Stitch, you have to accept that they're gonna land the data in your database,

you know, a fairly generically

useful

schema.

And I think it's so useful to be able to use tools like that. It saves your team so much time that you should just accept, like, rather than you writing

some ETL and, you know, maybe in Python,

reorganize it specifically how your company needs the data. You should just use a tool like that. Let it land in your data warehouse.

Also, you know, for practical purposes, data warehouses are so powerful and relatively cheap now that you might as well do the specific

transforming

that into

your company's needs in the warehouse.

And, you know, if there's actually, you know, 5 different needs for some data, then build, you know, 5 pipelines and end up with 5 different tables that that different teams are building their dashboards on or combine it with other sources.

So I think for practical purposes, ETLT is just the way to go and accept that it's gonna land in your warehouse in possibly not the ideal state, but that's not really a problem. It's also interesting to talk about the sort of semantic layer that has gaining a lot of attention lately and its role in the modern data stack and how you think about it at Mozart as sort of because of the fact that that is the closest to the business use case because you need to have these specific metrics defined to be able to

aggregate across these different dimensions and be able to ask and answer questions rapidly.

I'm curious how you're thinking about

integrating that and simplifying the experience of people as this is an area that is seeing so much, I wanna say volatility necessarily, but so much activity in terms of what the experience is actually going to look like where there isn't really any consensus around it yet? Yeah. It's a good question. I would say we don't solve this super well right now. We have some

cataloging

functionality.

Right now, we have a couple customers that are using various tools on top of Mozart for that.

I might have a better answer in a few months. Kind of in general, how we look at this is, like,

we try to

look at the tools that are developing around any part of the stack and try to understand from talking to our customers and talking to other people, you know, what is the majority of the value that people are getting out of these tools, and can we replicate some of that in Mozart

and leave it as if you really, really wanna go deep on this, you're probably gonna wanna use a tool that's dedicated to just that. If you're okay with getting kind of the majority of the value in a way that we can bring it, then we'll try to do that. I don't think a lot of

These types of problems

get surfaced in an like an early data evolution

so, you know, this might be sort of debt that you're creating

in terms of you know, maybe multiple definitions

or maybe, you know, compute, you know multiple columns with similar meanings,

but you know, I think of those as more

enterprise problems and larger sort of companies with a variety of legacy reporting and legacy data and legacy systems. So we haven't, like, jumped into solving this problem

mostly because we think of it as not quite aligned

to

our initial customer. Now I've worked at a number of sort of large organizations

in my career, and these are real problems that these companies face. I've even at late stage or now you start to see them earlier and earlier so these are real problems that these companies are solving and we see kind of our biggest customers sort of starting

to face a lot of these challenges. But I would say that for early companies,

I almost exclusively see the problem as, you know, how to get started, how to bring in a new dataset, how to start you know, essentially, a lot of the value

comes from just the ability to consume it in any way, shape, or form, which is, you know, why

sort of the most popular BI tool in the world is still

Excel

because it's really good at being able to get really anyone,

you know, using

data in some way. So I still see sort of the harriers problems to tackle just really being on the getting started side. We do find that as customers mature, they start sort of running into more problems in that in that domain.

For somebody who is using Mozart data, I'm wondering if you can talk through a typical workflow of

getting set up with it and then being able to build out a set of analyses and maybe

how it might coexist

with an existing data infrastructure

where maybe you're working at a larger organization, but you want to have an easier experience for some of the business users so that they can, you know, fulfill maybe 80% of their use cases and let the core data team focus on the gnarliest problems.

In large organizations,

sometimes

specialized teams

are queued up behind,

quote, more pressing data engineering

requirements or needs

of other teams.

So we think of it as getting started with Mozart. It's important to us that it's incredibly easy, like shockingly,

jaw droppingly

easy. So what that typically looks like is

it just sort of similarly follows the logical flow of data. So it starts by connecting sources. So if you're using

Standard tooling, maybe you're in the b2b space. A very common crm is going to be salesforce.

Maybe you're using

HubSpot. Maybe you're doing some ads on Google. You know? You've got some databases.

A lot of the SaaS tools

through

tools like Fivetran are able to be

extracted and loaded just with credentials.

So it really just starts by getting data in. They're sort of making that initial connection. There's testing that connection, and then that data starts thinking. You've got a lot of data. You got a Julia Child where you show up like an hour later or a day later, and then it's magically

there.

Otherwise,

you know, you can be

connecting your data and then doing simple select statements or data cleaning

and then hooking up your favorite BI tool to your data warehouse all in under an hour. So you can be writing that first,

you know, report, that query. We like to get customers saying, what is the 1 report that they really want, that they really want to get done and then let's go through the data that we need and the steps that we need in order to make that happen that can typically be done in an hour And then, you know, now it's a 1 click refresh

that's good for forever.

And that's incredible. I see customers' faces light up in those moments. So for us, we really focus on making that incredibly

fast and easy. There's generally 2 types of

kind of initial goals that our customers have. Either,

you know, they have these different tools. Maybe they have Facebook ads and Google ads. And, you know, each of those tools has good reporting capabilities,

but they don't have a way to combine that data and, you know, look at it with the place that actually has the revenue, whether that's, you know, Stripe or Shopify or something like that. So they wanna be able to combine these different data sources and build a report

that needs multiple different sources. That can be kind of the initial goal. Another 1 is, you know, similar to that, except they already do have that report. They just spend, you know, 3 days a month putting it together, downloading CSVs, putting them into Excel,

pivot tabling, and then they have their monthly report and just reproducing that, but in a way that, you know, it doesn't have a month of data lag. You can have that

fresh every hour, and it doesn't take, you know, 3 human days to do it. It takes maybe 1 human day to set it up once, and then it's just automated every every time. As far as the sort of stage of the modern data stack and its role in the modern data ecosystem,

you know, it's definitely a very powerful set of technologies and capabilities, and there are still these

edge cases that exist as far as being able to integrate it all together, which is, you know, definitely a solvable problem. But what are some of the

underlying

challenges or problems that you see in terms of how people are thinking about the modern data stack, whether it be consumers or the people who are building it? And what are some of the areas of opportunity or maybe some of the next stages of evolution that you are anticipating

as you are getting more involved in this particular area?

I would say it's never been stronger, but it's not fundamentally different than it's been for 10 years, 10, 15 years. I would say, you know, the data stack has always been

about

making what is currently

possible but difficult

a little bit easier, and then that evolution just happens

constantly. I mean, that's just how

software works, basically.

The stuff that's getting the most action right now where there's a lot of, you know, start up companies and a lot of different ways to do things, that's maybe, you know, observability

and testing.

BI will probably always be in that state.

So some of that is a lot of different ways of doing things like metrics and semantic layer as well. There's a lot of experimentation

going with different start ups and different people doing things in different ways, a lot of companies,

you know, building their own systems.

And I think over time that those different pieces will coalesce

a little bit, and they'll become more standard ways to do this. And then the next thing that is, you know, currently

possible but very hard

will be on the forefront of what's about to be made easier. That might be, you know, maybe machine learning. I think in in 5 years will probably

be noticeably easier to do maybe to the point where you don't actually have to be much of an engineer at all. To pick up on that

a little bit,

first of all, yes, the modern data stack has become a little bit of a cliche

or, like a bingo word or,

you know, you could just be walking around the streets of San Francisco and just over here, you know, 4 conversations

in a row about, you know, the modern data stack.

So I think, like, while the term may have jumped the shark, I think

kind of its application and its uses are just kind of like Dan said, never been stronger.

What

I see the modern data stack as actually a lot of

services companies that came out of larger companies.

So if you think about a lot of these tools, they were built by expensive

in house teams

at,

you know, companies that were doing cutting edge data work. And they said, you know, in order to do this cutting edge data work, this is how I want my data flow. And they would be investing in data scientists to be incredibly expensive and to make them just a little bit more efficient and to make their insights, which would be hugely valuable to these now giant tech companies,

hugely successful tech companies.

These teams invested in

incredibly

useful but incredibly expensive

data infrastructure.

That data infrastructure

now finds itself as many services in the modern data stack. The modern data stack is essentially

totally ripped off of the modern data stack of 10 years ago, like Dan mentioned, plus all of the advances that were happening

sort of within

some of these great web 2.0

companies and essentially now are available

not just to,

you know, the latest stage or wealthiest or or public

companies, but now actually available

sort of more ubiquitous.

As far as the work that you've been doing with Mozart and some of the ways that you're seeing your customers and early design partners working with it, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?

1 great example, I think so we fairly recently added, what we thought of as kind of a niche feature where you can take any table, hook up a Google Sheet to it, and we'll sync this table. Whenever this table changes,

we'll update it as a tab in a Google Sheet. We were thinking of this as sort of a way. 1 thing we haven't mentioned, we don't do BI. We generally say, you know, then you hook up your favorite BI tool. That might be changing, but this was sort of a way to bootstrap people who hadn't yet chosen a a BI tool so you could keep using Google Sheets as your BI tool. 1 of our customers is now basically using 1 massive Google Sheet

as their customer support CRM. So they had, you know, over a 100 people

on their customer support team were using a BI tool. They had, you know, a 100 plus seats in a BI tool. You know, basically, what they're doing, they're pulling data from their application databases

and various vendors. They could see where they ship large things to their customers.

So they were pulling in data from a bunch of different of their tools, organizing it so that they could have a simple lookup tool. They would have everything about a given customer so their customer support team could look up

things in shipment and everything they need to know about a customer

in 1 tool. They were basically able to save a 100 plus seats from their BI tool by just using Google Sheets. I would go in a little bit of a different direction. Direction. 1 of the companies that I think is using, you know, our platform the best is Mozart data.

So we consume our product internally quite a bit. We have a series that we call Mozarting Mozart.

We try to be a very data driven, you know, b to b company ourselves.

So

there are real things that happen

to the

internal,

you know, data analysts or business folks in our organization

where, you know, basically,

the tool Mozart

is trying to solve a problem. Or when it doesn't, it ends up being Actae project that typically

ends up winning

sort of silly Act Day prize. So

we have a series that is really just about

real practitioner problems. Dan and I actually, you know, have, you know, in our

Mozart account, we have, like, a date table. Right? So just something as simple as whenever you want to do a cohort analysis, you need basically

a left table to join to that's just essentially a set of dates. Because it'll always be the case that there's, you know, some cohort with effectively,

you know, missing in 1 of, essentially, the date periods that you need.

So

this is something that, you know, comes about when you're actually doing that work. So 1 of our sort of core principles is obviously

much like many many companies is dog pooting, but it it's to Mozart.

As you have been using Mozart to run the Mozart business, what are some of the

interesting aspects of the product that you have kind of identified through the work of actually being your own consumer

and some of the maybe, empathy that you've been able to build up with your customers to be able to understand areas that are, you know, ripe for improvement or refactoring?

Some some big ones are basically onboarding

new employees. I think this is a common issue with our customers and ourselves. Somebody joins a company that already has a very mature

data platform,

and they come in and they see,

you know, 200 transforms,

a data warehouse with 5,000 tables, and most of the information of, you know, what is important, what is up to date, where can I find information about this or that,

is generally held in people's heads, and it's hard to see from looking at the code base or looking at the database

what is actually important?

So that's definitely something we've tried to make easier in our platform as we've experienced it as we've onboarded new employees to Mozart. I would also add kind of data validation

as a core feature. This is something we recently

launched, and as soon as we did and, like, when we were testing it on our own data, just the ability to say to, you know, put tests on different tables. Like, this should never have null. There should never be duplicates on this column and this table.

And we kind of didn't realize

some of our tables were in a bad state, and we didn't realize that until we actually started using our own testing framework. On a much much lighter note there There's some small uis that end up really

being important to the usage even in fact 1 of our most early on requested features was

we have a lot of, like, sort of cheesy puns at Mozart. I mean, the whole company name is based on the idea of, like, data orchestration and, you know, composing your tables.

There are lots of puns, but when you do create a table,

it ends up

playing a little snippet of, like,

And

1 of our,

you know, early customer requests was to actually, like, shorten that snippet. Now

we loved it. It always sort of led to a small moment of delight. But as internal customers, we started to really empathize with,

our external customers

that were starting to get annoyed by kind of the little sort of joke and pun that we had baked into the product. So there's, you know, some serious answers

like Dan shared, but I would also say that that's the whole point of essentially using your product, which is understand its day in, day out usage.

As you have been building out the product and working with your customers and trying to help improve the overall access of

these data infrastructure improvements, what are some of the most interesting or unexpected or challenging lessons that you've learned, whether technical or business or product?

So I think of some of the most interesting problems

as

very business related as well, sort of it sort of gets back to how do you get value out of a data team at a company.

And it's a little bit different than, like, a sales team. So a sales team is easily measured or so typically a marketing team is measured by, you know, the cost per lead or the cost per win or whatever it is, and then, you know, you can compensate people accordingly.

So to demonstrate

value in the data space,

it's a lot more ambiguous. It's kind of like you know it when you see it. But what we found is that, you know, companies

sometimes come to the table saying, oh, well, we know that we wanna use our data. Right? Like, data driven is such a cliched term, and we know that a lot of folks, you know, get assigned by their boards of directors or or whomever.

Like, well, we really need to

leverage our data and use our data. The original insight, which has continued to actually surprise me, is that there are this new set of data consumers that are incredibly

technical and savvy

in roles that don't have the title data.

And 2, that, you know, for organizations

to get a lot of the value out of data, they do have to be intentional about it. So kind of going about it as someone assigned me to think about this,

that tends to maybe yield 1 or 2 wins in the short run. Another thing that I would say has been surprising

is the way data consumption

massively

takes off. So, you know, it's funny. Like, humans are exceptionally bad at understanding our exponential growth. And, you know, data consumption often looks like that, which is you think you know, you start out by saying, okay. Like, maybe this 1 data source and producing, you know, a certain amount of data. And then, you know, as you start to essentially combine data sources or get hungry for more data sources as companies scale, data starts providing value,

net

volume scales, which is actually why you see a lot of usage based pricing

in data tools

broadly.

So I would say these are all sort of novel things that we found out about our customers and our customers' usage.

But again, still the hardest thing is to put value on the data you consume. Like, how do you measure?

Okay. Like, our ads are x percent more effective. Well, that actually,

you know, does translate into

a cost per acquisition

that has some meaning. How can you measure while the product is x percent more effective

at, you know, driving

a customer to stay on the platform or to use the platform.

It all becomes very hard to measure sort of the broader value of data, and that's, like, a challenge for not just our company, but for many companies in the data space, which is how do you translate

the value that you're adding to an organization,

and how does that surface and become apparent,

you know, to that organization and often to executives in the organization.

For people who are interested in being able to take advantage of the modern data stack and simplify the overall process of getting it set up and getting it integrated, what are the cases where Mozart is the wrong choice and they might be better suited

doing it themselves or building out their own abstraction layer on top of it all? It sort of looks like a barbell,

which is to say,

if you have nobody in your organization

that will be

spending a lot of time cleaning data or is committed

to the cleaning and consumption of data, Dan tends to put the bar at a SQL writer within your organization.

Typically, those organizations or organizations without the plan of hiring somebody that could be a data analyst,

Those organizations,

I think, struggle to get a lot of value out of our tool, and they certainly struggle to get a lot of value out of the modern data stack. I think part of the modern data stack is the ability to join data across

many tools. So, typically,

it's about an intentional investment.

And on the flip, if you already have a lot of existing

infrastructure

and you've built out pipelines and made investments

in your existing stack,

1 that's working for you. Again, sort of tearing it down to put in, you know, the modern data stack is often,

you know, not necessarily

the solution that has the greatest ROI

in a short or medium run sense. So I would say it's kind of Marbell in that respect.

As you continue to iterate on the Mozart platform and product and keep an eye on the ways that the modern data stack is evolving. What are some of the things that you have planned for the near to medium term or any projects that you're particularly excited to dig into? I'd say we're always building more connectors. So always, you know, more ways to get data in easily,

more ways to manipulate data

once it's in there.

And then, like I said, we're always looking at the cutting edge of tools that people are building

that you can hook up to your your data warehouse and get extra value. You know, we try to figure out what is the

core value, what are the core feature sets that people are using these tools for,

and can we and and should we bring some of that into Mozart?

So currently, the way that Mozart is set up, we create a Snowflake account for you, and we manage that for you

by q 1 of next year. So very imminent. If you already have a Snowflake account, you can just hook Mozart up on top of that

rather than having to use our Snowflake management and use a separate database. So that'll solve some of the problem that Pete was talking about on just the right side of the barbell. If you already have, you know, a Snowflake warehouse and you've got a bunch of stuff built on it, we no longer would dare to ask you to throw that away and use us. Instead, you can use us in addition.

Our strategy is to

erode the parts of the world that answer is true for. So you mentioned, where is Mozart? Not a solution. We wanna be honest about where the product is today.

And, you

know, there's 2 ways in which we really focus on

improving and expanding the product. 1 is to erode the people for which it is not an ideal solution, and the other is to make

the people for whom it is an ideal solution make it an even more ideal solution. So Dan just touched on 1 of those 2 ends that we're touching on, but we we're also trying to make it easier

for

the lower tech user as well. So we wanna expand in both directions versus the product today.

Are there any other aspects of the work that you're doing at Mozart or the overall modern data ecosystem that we didn't discuss yet that you'd like to cover before we close out the show?

I think the only thing I'd add,

and we have actually touched on it throughout the show, is that a lot of what's new or a lot of what's, like, modern

is old.

So it's that the problems

that, you know, are trying to be tackled

by the modern data stack and that I see great teams,

both data teams and then tool building teams

are are trying to tackle are kind of the problems of old. So, like, I think that people think of them as largely new and novel, but I still see the problems

as, you know, getting started or using more data

and that being the typical challenge as opposed to

sort of

somehow managing to find and extract even more value from sort of a dataset that you're already working with. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

So I think the biggest gap in tools today are just actually just

widening

the tools that data collection is rather easy from.

I think, you know, it's sort of the 2 ends of the data pipeline spectrum.

1

is, you know, having that ability to apply

that data. So the

reverse ETL

tooling where we see so many new great companies,

you know, sort of making real progress in that space. So finding, you know, ways to

do things in the data warehouse and

then take that output and then, essentially apply it or operationalize it. And then, again, on the other end of the spectrum,

you know, a great tool

like Fivetran

has

hundreds of connectors, but there are sort of

what feels like almost infinity

SaaS tools. In practice, it's, you know, it's probably millions.

So I think that there is an incredibly

long tail of

tooling that adds immense value

across industries.

And we've only scratched the surface what's

automated and easy. And then on the contact side,

I'm pete@mozartdata.com.

And I'm dan@mozardata.com,

or you can go to www.mozardata.com,

book a demo, or a free trial. We'd be happy to talk about your data problems whether or not that involves Mozart data. Alright. Well, thank you both very much for taking the time today to join me and share the work that you're doing at Mozart and the ways that you're working to make it easier to actually

take advantage of the technological

capabilities that the modern data stack has brought about and, you know, smooth beyond ramp for people who don't necessarily want

to deal with all of the integration challenges that come along with that. So appreciate all the time and energy that you're each putting into that and helping to make data more accessible to more people. So thank you again for taking the time today, and I hope you each enjoy the rest of your day. Thank you.

Listening. Don't forget to check out our other show, podcast dot init atpythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links