Summary
The modern data stack has been gaining a lot of attention recently with a rapidly growing set of managed services for different stages of the data lifecycle. With all of the available options it is possible to run a scalable, production grade data platform with a small team, but there are still sharp edges and integration challenges to work through. Peter Fishman and Dan Silberman experienced these difficulties firsthand and created Mozart Data to provide a single, easy to use option for getting started with the modern data stack. In this episode they explain how they designed a user experience to make working with data more accessibly by organizations without a data team, while allowing for more advanced users to build out more complex workflows. They also share their thoughts on the modern data ecosystem and how it improves the availability of analytics for companies of all sizes.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold.
- Your host is Tobias Macey and today I’m interviewing Peter Fishman and Dan Silberman about Mozart Data and how they are building a unified experience for the modern data stack
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Mozart Data is and the story behind it?
- The promise of the "modern data stack" is that it’s all delivered as a service to make it easier to set up. What are the missing pieces that make something like Mozart necessary?
- What are the main workflows or industries that you are focusing on?
- Who are the main personas that you are building Mozart for?
- How has that combination of user persona and industry focus informed your decisions around feature priorities and user experience?
- Can you describe how you have architected the Mozart platform?
- How have you approached the build vs. buy decision internally?
- What are some of the most interesting or challenging engineering projects that you have had to work on while building Mozart?
- What are the stages of the data lifecycle that you work the hardest to automate, and which do you focus on exposing to customers?
- What are the edge cases in what customers might try to do in the bounds of Mozart, or areas where you have explicitly decided not to include in your features?
- What are the options for extensibility, or custom engineering when customers encounter those situations?
- What do you see as the next phase in the evolution of the data stack?
- What are the most interesting, innovative, or unexpected ways that you have seen Mozart used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Mozart?
- When is Mozart the wrong choice?
- What do you have planned for the future of Mozart?
Contact Info
- Peter
- @peterfishman on Twitter
- Dan
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3,000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Peter Fishman and Dan Silberman about Mozart data and how they're building unified experience for the modern data stack. So, Peter, can you start by introducing yourself? I'm Pete Fishman. I go by Fish, and I am the cofounder and CEO of Mozart Data. And, Dan, how about you? Hi. I'm Dan Silverman, cofounder and CTO of Mozart Data. And going back to you, Pete, do you remember how you first got involved in data? Like many people in the data space, I am a failed academic.
[00:02:26] Unknown:
So after actually finishing the PhD and realizing I really loved the bay area and wanted to stay in tech I actually sort of fell accidentally backwards into sort of applying those empirical skills That were sort of grinded at through my many many many years in college and grad school, then turning that into a career in technology.
[00:02:47] Unknown:
And, Dan, do you remember how you first got involved in data? I first got involved in data. I wasn't officially a data engineer. I was a more of an application engineer, but I've kind of always worked at smaller companies that didn't have dedicated data engineering needs. I've just kind of been picking it up as I've gone along and working with analysts throughout pretty much my whole career. Yeah. It's definitely how I think the majority of people who call themselves data engineers now ended up in the
[00:03:13] Unknown:
role. And so in terms of the Mozart data project, I'm wondering if you can just give a bit of an overview about what it is that you're building there and some of the story behind how it came to be and why you decided that this was the problem that you each wanted to focus your time and energy on. Well, a couple of things. It starts with I think the best projects tend to be scratching your own itch. Dan and I were both kind of wanting
[00:03:36] Unknown:
to start something together. We've been good friends for 20 years. And, know, we sort of thought, what is the ultimate combination of our 2 skill sets? You know, Dan's being sort of as an engineer and myself being sort of more data science, like, that ends up sort of looking like something in the data space where we've both been sort of working for the last, you know, 15 years apiece. And, really, it started with, what are the tools that we most loved building or thought were really critical or maybe overlooked sort of more broadly at our last few jobs. So we really thought about the tools that we consumed on a day in day out basis that companies essentially spending lots and lots of money on people like myself and Dan in order to build and provide internally.
[00:04:26] Unknown:
So in terms of the specifics of what you're building there, I know it's focused around the so called modern data stack, which has become the latest entry in buzzword bingo. So I'm wondering if you can just start by giving a bit of a sense about what it is that you're layering on top of that and why you think that that is where the particular area of focus can and should be, at least for the efforts that you're deciding to build this company around?
[00:04:50] Unknown:
We're basically building, you know, the all in 1 data platform. So we handle pulling data into a data warehouse. We manage a data warehouse for you. We're using Snowflake under the hood, and then various tools for scheduling data transformations once data is pulled out of your tools, you know, observing how data is flowing through your pipelines, getting notifications of failures, some tools for cataloging and managing data governance, that sort of thing. We kinda try to handle all of the core data engineering that teams are building the world over and lets you focus on, you know, the specifics of your data and how you wanna organize it. And then you can use tools like, you know, charting tools, BI tools, whether that's going into machine learning tools. Whatever you wanna do, kind of getting all of your data into 1 centralized data warehouse and then organizing it, we wanna make that as easy as possible.
[00:05:48] Unknown:
The whole idea around the modern data stack is that it's intended to make all of those infrastructure and engineering aspects easier or sort of obviate them in certain ways where the idea is that you can just throw a credit card at the problem and set up 5tran and Snowflake and DBT and Looker, and you've got your data engineering. You're done, and you don't need to worry about all the specifics. But as we all know, there are all the integration at specs that go along with it. I'm wondering what you identified as the missing pieces in this modern data stack ecosystem and what people are deciding to try and build around that make something like Mozart useful or necessary for engineering teams or organizations that maybe don't have a dedicated data team. I think the first thing that I wanna call out where I'm in deep, deep agreement with you is it's It's really magical to think about you know, the buzzword the modern data stack what it really is talking about is this evolution from needing a small or not so small team of data engineers,
[00:06:48] Unknown:
a sort of big budget, for a data warehouse just to get started. And today, that doesn't look like it at all. Today, like you said, through a variety of tools, you can get started with pretty much, like, swiping a credit card, you know, not just tools like Mozart Data, but all sort of the components of the modern data stack. For the most part, there's a self-service offering or even a free offering that you can just get started with. So it's really incredible that the bar a decade ago used to be you'd hire a few data engineers. You'd sort of toil and test some different technologies for a few months. Today, that looks like in an afternoon, you can basically be spun up with world class data infrastructure.
So when we think about kind of the opportunity in technology, it's to be opinionated and bet on what you think are the core winners of sort of a landscape. So that's really what our product does. It's trying to make some of these core winners even more accessible. So on the 1 hand, we talk about, like, how incredibly trivial it is to spin up a modern data stack. But in practice, that's not the case. Now if you're an experienced data engineer that's been through the rough and tumble before the days of this was an easy, simple credit card swipe and lots of simple button clicks just to get started, you might say, well, no. Actually, it is crazy easy. And, yes, it is much easier than it was, you know, even 5 years ago. It's still not easy enough. Right? We think of it a little bit like you know, if you think about some of the workout apps that you can find on your phone, it's really easy to get started with those. They try to make it really easy for you. But, actually, still the hardest part is just getting started, especially if you're a novice that is unclear about where to begin.
So I challenge 1 premise, which is to say that the modern data stack is very easy to get going with. There's still a lot of debate or disagreement about what's the right tool for a given type of company. So there ends up being a lot of sort of tool exploration in addition to, you know, making all the pieces work together, which is true in theory, is not always true in practice. So sort of as a practitioner, sort of diagnosing where issues come up is not always as trivial. Even though these pieces sort of, in practice, are often used by many, many different companies together as 1. Just having it be sort of a singular experience is definitely not the current state of the world.
[00:09:26] Unknown:
And as far as the particular use cases and workflows that you're focusing on or any maybe industry verticals or horizontal layers and the target end users that you're focusing on. I'm wondering if you can give a sense about how you thought about that as you started to design and build out the Mozart platform and some of the ways that those focuses in terms of persona or industry have informed and helped you with prioritizing the features and user experience of the system?
[00:09:55] Unknown:
I would say we're very industry agnostic intentionally. The questions that a finance company or a health care company or a gaming company has, the questions are different. The data is different, but the tooling that you need to answer those questions is pretty much the same regardless of industry. Industry. So I would say that the thing that's very consistent with our tool, we're a tool for data analysts. You generally do need to know SQL to get a lot of value out of our tool, but I actually shouldn't say data analyst. Like, a lot of our customers would not call themselves data engineers, would not call themselves data analyst. They're people that are you know, have titles like marketing ops, sales ops, where they've maybe picked up some SQL along the way. And I think this is a pretty big growing class of person that has learned some SQL, learned some analysis skills over the course of their career so that they could do their job better. You know, they often work at companies that would never have, you know, a large team of data engineers supporting them, but the tooling is getting to the point where they don't need a team of data engineers to to kind of have access to the tooling that they need to answer their questions.
[00:11:03] Unknown:
Going back to the sort of sharp edges or problems with the ways that the modern data stack has been manifesting and some of the ways that you're thinking about Mozart data, I'm curious what you have seen as the kind of necessary level of experience or background knowledge to be able to actually effectively build and integrate the various components of the modern data stack and how you're thinking about being able to, you know, smooth the on ramp for people who maybe don't have all of that necessary background expertise and be able to build in some sort of gradual exposure of complexity so that they can, you know, sort of making the easy things easy and the hard things possible and just how you think about that aspect of the overall problem of being able to put data to work and actually gain value from it without necessarily having to hire on a full team of data engineers and data scientists to you know spend months on the problem?
[00:11:59] Unknown:
Well, I think it you know starts back with something that dan said which is we really wanted the bar for being able to contribute To be the ability to write basic sequel statements so there's sort of a theme that you'll see in some data tooling, which is There are business users that understand the business definitions they understand Not necessarily what columns mean, but they understand What they want to track and the nuances of their data in order to track them. So, you know, Dan and I, before we started this data company together, we had actually, a decade ago, started a hot sauce company together. And it was actually a hot sauce company on on Shopify. And, you know, we wanted to, you know, do things like report on our customers.
And The count of customers in shopify was right. But really we wanted to get a count of customers that didn't have the last name Fishman or Silberman which was obviously in the early days disproportionately way too many. And when I think about kind of that, that's like a specific example of business knowledge that the business user would understand, which is they wanna actually know what their scalable traffic is looking like. But it's not necessarily obvious to somebody that's just simply collecting the data. So we wanna put the power in the business users' hands. Now typically, that looks like a set of data engineers that are playing sort of telephone and, you know, ambiguous requirements and a lot of frustration on both sides. So, you know, you sort of have this movement to empower business folks in the BI tool. Right? So that the BI tool is interacting with a bunch of very clean tables that have sort of sources of truth hooked up to them.
That doesn't really happen magically. It doesn't happen, like, automatically. It happens through a lot of pain. And the more you can empower, like, higher up in the funnel, the less sort of shared pain in that organization there is. Getting back to your question, this to me is sort of the evolution of where it's not so much the failure of the modern data stack, but the problems of the modern data stack begin, which is just because it's easy doesn't mean that it's producing more sort of magic. And just because the tooling, aside from being more powerful, is also, like, better integrated again, there's just so many problems. You can think of it as downstream that you wanna work back from that we think that there's incredible opportunity, especially as that population and Dan alluded to, you know, the folks that are in marketing ops, rev ops, biz ops, sales ops, you name it, blank ops that are really becoming SQL proficient, becoming sort of data savvy, data proficient.
As that population grows You're gonna see more of those sort of downstream problem as amazing as all the tooling is you know, upstream of that, it's almost irrelevant if, you know, we're creating more problems downstream. So that's kind of where we saw the modern data stack underserving organizations, especially smaller businesses, especially smaller businesses that don't have, you know, a giant data team to play many, many, many, many hours of telephone with. So we see that falling down kind of just in the actual real practitioner using data, getting value out of data, running into roadblocks in the setting up of the data, and we wanted to make that as easy as possible. I can give 1 specific example of something we're thinking about now is data governance.
[00:15:47] Unknown:
So, like, the, you know, the sales op, marketing ops, the operations people, they might think in terms like, hey. We want the marketing team to have access to this data, and we don't want, you know, this other team to have access to this data. And, you know, the sales team needs access to some other data, and they know that they can't have access to the HR data. And ideally, you know, all of the data is in the same data warehouse, so the data engineer has to kind of translate those requirements into database roles and users and all the permissioning that goes along with that. So we try to find kind of, you know, simple interfaces that the operations sort of thinking can be translated into and obfuscate, you know, the actual roles and permissions that that are existing in the database.
[00:16:38] Unknown:
In terms of the actual platform that you've built and some of the engineering work that you've done to be able to paper over some of these complexities of the data stack, I'm wondering if you can talk about the overall system that you've built and maybe where you actually started the engineering effort to be able to iterate from and understand how to best explore this space of being able to present the modern data stack as this unified experience that people who don't necessarily think about themselves as data engineers or analysts are able to use effectively.
[00:17:09] Unknown:
I'll start by saying we had a healthy set of design partners. So first off, you know, dan and I have both built this product a number of times at different companies So I spent the biggest chunk of my career at Yammer And at Yammer, we built a tool called Avocado. And Avocado was sort of an inspiration for some combination of Mozart and Mode Analytics, which are the sort of 2 companies that came out of the brains of a lot of people on that team. And what that looks like is what are sort of the tools that basically get data to a central place and then the tools that can then visualize that and share that within an organization.
You know, I think it started with what have we built kind of in the past and what has kinda served us. And Dan, of course, has done something similar at Clover, his most recent job. So we had sort of some experience building tooling like this. On top of it, we work with some great people that have lots of opinions about how they want their data. So we started sort of not with a bunch of paid customers, but with a bunch of people that were willing to give us feedback on what we were building. So we found folks that were willing to sort of, you know, trust our data experience and ask us for advice in a variety of data domains. And we asked those people to tell us what was important to them. What is it that kind of they most wanted from a data tool, or where were they running into roadblocks as they tried to spin up their data infrastructure?
So we had a lot of great before we got started, inspiration.
[00:18:45] Unknown:
And so as far as the actual platform itself, can you talk to some of the pieces that you've engineered and how you have done the work to actually tie together these various elements of the data stack and maybe some of the pieces that you have been able to take off the shelf and just add facade over so that people don't need to know that they're using whatever the sort of managed service happens to be and how you think about the sort of build versus buy decision about which pieces to use off the shelf versus which pieces you need to actually engineer to be able to provide that overall experience that you're aiming for? So I would say our system is, you know, ETLT.
[00:19:23] Unknown:
We don't really care to have the argument of should it be ETL or ELT. You should extract data from your various systems, transform it a bit, load it into a centralized data warehouse, and then do more transforming on top of that. We use a company called Fivetran for a lot of our initial connectors. That's definitely something I would highly recommend that you buy. You know, a lot of people have built ways to pull data out of, you know, other databases and put them into another database or pull data out of, you know, Salesforce or Google Ads, Facebook Ads, you know, hundreds of different tools.
Those connections have already been built, whether it's singer taps or Stitch or Fivetran, etcetera. So we use Fivetran for a lot of that. We've built some of our own, which we generally use the Singer framework when we are building our own. And then we obviously did not build a data warehouse. We're using Snowflake. So, I mean, for some context, v 1 of our product was you sign up for an account, and we'll create another Snowflake account, create various users and roles and permissions for those. We'll automatically create a Fivetran group, load the destination, load your Snowflake account that we just created into your Fivetran destination, give it its own user and roles, etcetera.
And then basically provide, you know, a very simple interface for you can write SQL to query your data warehouse, and then you can say, actually, I want this query to be scheduled to run every hour and to create this other table. So you can kind of start building data pipelines. That was v 1, and that is basically, you know, the core of the product is the ability to connect your different sources of data, replicate it into a data warehouse, and then start transforming it to basically organize it better for any downstream purposes.
And then on top of that, since then, we've just kind of been layering on, you know, notifications when there's failures, ways to observe how data is flowing through your different pipelines, ways to catalog your data. And like I was talking a minute ago, you know, ways to layer on data governance, things like that.
[00:21:30] Unknown:
And as far as the engineering projects that you have focused on as you're building out this platform and trying to tie together the experience, I'm wondering what have been some of the most interesting or challenging aspects of putting together this platform and thinking about how to design it in a way that's approachable for people who don't necessarily want to spend all of their time on the engineering aspects, but still flexible enough to support people who do.
[00:21:58] Unknown:
So I think you actually hit on something earlier. We wanna have a low floor and high ceiling. And that was 1 of the key design principles, which is to say the low floor being anybody that can write sequel can be a data engineer So and the idea being you don't need to hire a data engineer until your hundreds of employees. And this is sort of, you know, sort of a radical perspective or a radical opportunity that I think many in the modern data stack share, which is to say, kind of like Dan was mentioning, now that extracting data from a set of tools that are very standard across tech companies has become sort of a solved problem, Now the challenge becomes, like, how do you, you know, get access to that data, understand that data, give the right permissions on top of that data.
So I would say that when we think about kind of diving into this problem,
[00:22:56] Unknown:
you know, we started we started there. 1 thing that we do a little bit differently from most of these platforms is kind of how we approach building pipelines. So rather than, you know, tools like Airflow or DBT, you need to tell those systems, you know, basically, do this task, and then when that's done, do this task, and then when that's done, do these 2 tasks, and when those 2 are done, do this task, things like that. We've built a system for basically parsing through the SQL that our customers write to determine how their tables are actually connected without them having to tell us, and then we can present that to them as here is the data lineage in reality regardless of kind of how you imagine it should be, which takes a big step away from the user. And then sort of related to that, we try to focus on features that we can add, being the entire platform that is a lot harder or impossible to do if you're connecting a bunch of individual tools. So 1 good example is since we're handling both, like, the ETL and the next transform layer, we can kick off pipelines when as soon as the data lands from your different tools.
Whereas, if you're combining, you know, a 1 an ETL tool and then a separate transform tool, it's either hard or impossible depending on the tools to kind of have those play really nicely together.
[00:24:22] Unknown:
In the sort of integration aspect, I'm wondering as you've been building out, tying together Fivetran and Snowflake and the transformation layers and being able to, you know, build out the presentation layer and maybe integrate with some of the reverse ELT frameworks? What have been some of the layers in the stack or the points of integration that have had the most friction that you've had to deal with?
[00:24:45] Unknown:
This kind of is getting into a bit of the details. When we started with Snowflake, their top level abstraction was the account. So we had a Snowflake account, and then within an account, you can have databases and warehouses. Snowflake is a little bit different than some of the other data warehousing platforms that they separate compute with storage. So we had our Snowflake account, and then for each customer, we would create a new database and then a new warehouse. That led to some trouble where a lot of tools connect to Snowflake, and they assume that the account that they're connecting to, they can have, like, full account admin access. They can have access to the various system databases that Snowflake has. And so that gave us some trouble where some tools wanted, like, basically full access to the accounts, and we couldn't do that without exposing, you know, our other customer data. Since then, Snowflake has added another layer of abstraction on top of account. That's the organization. So now an organization can have many accounts. So now we're migrating our architecture to be you know, Mozart is the organization, and then each of our customers gets their own Snowflake account, which solves a lot of these problems. But along the way, that has definitely been something that we've struggled with. That's true for many tools. You hit on a few, but, you know,
[00:26:05] Unknown:
reverse eto obviously data warehousing, you know a lot of the initial kind of use cases weren't to sort of put them all together
[00:26:15] Unknown:
in an all in 1. So digging more into what you were saying about having a low floor and a high ceiling where you want to be able to give people the ability to dig deeper into the stack once they gain a certain level of familiarity or comfort with it. I'm wondering what have been some of the areas of kind of progressive exposure that you have engineered into the project and some of the stages in the life cycle that you work hardest to automate and the pieces that you feel like need to be in the control of the end user because it's very specific to how they run their business or what they want to be able to ask and answer questions about.
[00:26:49] Unknown:
Dan just touched on, you know, something which is this is your data. So you have access To your snowflake warehouse. So anything that you want to connect and we have you know customers that connect many many data tools to their Data warehouse. I think the high ceiling is the part that's interesting, which is you know how as somebody picks up additional data proficiencies and has additional sort of data interests or data needs, and this is a challenge for any company building any tool, how can you both serve a novice, But yet expose these sort of escape hatches so that, you know, an expert that walks in or maybe, you know, you start using data where a bunch of folks that are more novice with it, you know, start consuming the data, start getting value out of the data, and then you as an organization say, actually, you know, we're getting so much value out of the data. What we wanna do is hire a high powered data engineer or a data scientist that can really take that data to the next set of challenges and the next level and extract even more value out of it because we can see, you know, getting a lot of value out of it from the business user.
And, you know, when they arrive in an organization, how do you enable sort of, like you said, a power data user that is already coming in with a variety of experiences and expectations and yet at the same time enable a novice. And I think it's not just unique to data tools. This is basically a dilemma that almost every company faces, a growing company that can sort of, you know, sell beyond a niche. Like, how do you essentially serve, you know, essentially a wide spectrum of users? And I think it's all about little, like, tricks in terms of making it very clear, you know, where the sort of I'll call them escape hatches to the advanced levels are.
[00:28:43] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold's proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its DATA DIFF feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values.
DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. As far as the sort of automation aspects, 1 of the real benefits of using a system such as Mozart is that a lot of the opinions come baked into the customer. It doesn't have to form their own. They can just say, whatever you say is fine. I'll just take what you've given me and work with it from there. And 1 of the probably longest running debates in the data ecosystem, it comes down to modeling in the warehouse, and different generations of warehouse technology have led to different conclusions or sort of trends within that. And I'm wondering how you have approached that aspect of landing people's data into the warehouse and then maybe setting up the initial set of models for them to be able to work from and how you think about how much control to provide early on and sort of what the knobs are for people to be able to say, don't do anything. I just want you to land the data. I'll do all the transformation versus I just want you to be able to hook this up into my data warehouse with the semantic layer prepopulated.
I don't wanna have to think about all of the minutiae having to do with data modeling and, you know, building out the business metrics.
[00:30:48] Unknown:
To me, honestly, the debate has been settled in my mind just for practical purposes. There's so many good tools that have built these connectors. They don't know your specific needs. So if you're going to use a tool like Fivetran or Stitch, you have to accept that they're gonna land the data in your database, you know, a fairly generically useful schema. And I think it's so useful to be able to use tools like that. It saves your team so much time that you should just accept, like, rather than you writing some ETL and, you know, maybe in Python, reorganize it specifically how your company needs the data. You should just use a tool like that. Let it land in your data warehouse.
Also, you know, for practical purposes, data warehouses are so powerful and relatively cheap now that you might as well do the specific transforming that into your company's needs in the warehouse. And, you know, if there's actually, you know, 5 different needs for some data, then build, you know, 5 pipelines and end up with 5 different tables that that different teams are building their dashboards on or combine it with other sources.
[00:31:53] Unknown:
So I think for practical purposes, ETLT is just the way to go and accept that it's gonna land in your warehouse in possibly not the ideal state, but that's not really a problem. It's also interesting to talk about the sort of semantic layer that has gaining a lot of attention lately and its role in the modern data stack and how you think about it at Mozart as sort of because of the fact that that is the closest to the business use case because you need to have these specific metrics defined to be able to aggregate across these different dimensions and be able to ask and answer questions rapidly.
I'm curious how you're thinking about integrating that and simplifying the experience of people as this is an area that is seeing so much, I wanna say volatility necessarily, but so much activity in terms of what the experience is actually going to look like where there isn't really any consensus around it yet? Yeah. It's a good question. I would say we don't solve this super well right now. We have some
[00:32:49] Unknown:
cataloging functionality. Right now, we have a couple customers that are using various tools on top of Mozart for that. I might have a better answer in a few months. Kind of in general, how we look at this is, like, we try to look at the tools that are developing around any part of the stack and try to understand from talking to our customers and talking to other people, you know, what is the majority of the value that people are getting out of these tools, and can we replicate some of that in Mozart and leave it as if you really, really wanna go deep on this, you're probably gonna wanna use a tool that's dedicated to just that. If you're okay with getting kind of the majority of the value in a way that we can bring it, then we'll try to do that. I don't think a lot of
[00:33:33] Unknown:
These types of problems get surfaced in an like an early data evolution so, you know, this might be sort of debt that you're creating in terms of you know, maybe multiple definitions or maybe, you know, compute, you know multiple columns with similar meanings, but you know, I think of those as more enterprise problems and larger sort of companies with a variety of legacy reporting and legacy data and legacy systems. So we haven't, like, jumped into solving this problem mostly because we think of it as not quite aligned to our initial customer. Now I've worked at a number of sort of large organizations in my career, and these are real problems that these companies face. I've even at late stage or now you start to see them earlier and earlier so these are real problems that these companies are solving and we see kind of our biggest customers sort of starting to face a lot of these challenges. But I would say that for early companies, I almost exclusively see the problem as, you know, how to get started, how to bring in a new dataset, how to start you know, essentially, a lot of the value comes from just the ability to consume it in any way, shape, or form, which is, you know, why sort of the most popular BI tool in the world is still Excel because it's really good at being able to get really anyone, you know, using data in some way. So I still see sort of the harriers problems to tackle just really being on the getting started side. We do find that as customers mature, they start sort of running into more problems in that in that domain.
[00:35:17] Unknown:
For somebody who is using Mozart data, I'm wondering if you can talk through a typical workflow of getting set up with it and then being able to build out a set of analyses and maybe how it might coexist with an existing data infrastructure where maybe you're working at a larger organization, but you want to have an easier experience for some of the business users so that they can, you know, fulfill maybe 80% of their use cases and let the core data team focus on the gnarliest problems.
[00:35:46] Unknown:
In large organizations, sometimes specialized teams are queued up behind, quote, more pressing data engineering requirements or needs of other teams. So we think of it as getting started with Mozart. It's important to us that it's incredibly easy, like shockingly, jaw droppingly easy. So what that typically looks like is it just sort of similarly follows the logical flow of data. So it starts by connecting sources. So if you're using Standard tooling, maybe you're in the b2b space. A very common crm is going to be salesforce. Maybe you're using HubSpot. Maybe you're doing some ads on Google. You know? You've got some databases.
A lot of the SaaS tools through tools like Fivetran are able to be extracted and loaded just with credentials. So it really just starts by getting data in. They're sort of making that initial connection. There's testing that connection, and then that data starts thinking. You've got a lot of data. You got a Julia Child where you show up like an hour later or a day later, and then it's magically there. Otherwise, you know, you can be connecting your data and then doing simple select statements or data cleaning and then hooking up your favorite BI tool to your data warehouse all in under an hour. So you can be writing that first, you know, report, that query. We like to get customers saying, what is the 1 report that they really want, that they really want to get done and then let's go through the data that we need and the steps that we need in order to make that happen that can typically be done in an hour And then, you know, now it's a 1 click refresh that's good for forever.
And that's incredible. I see customers' faces light up in those moments. So for us, we really focus on making that incredibly
[00:37:39] Unknown:
fast and easy. There's generally 2 types of kind of initial goals that our customers have. Either, you know, they have these different tools. Maybe they have Facebook ads and Google ads. And, you know, each of those tools has good reporting capabilities, but they don't have a way to combine that data and, you know, look at it with the place that actually has the revenue, whether that's, you know, Stripe or Shopify or something like that. So they wanna be able to combine these different data sources and build a report that needs multiple different sources. That can be kind of the initial goal. Another 1 is, you know, similar to that, except they already do have that report. They just spend, you know, 3 days a month putting it together, downloading CSVs, putting them into Excel, pivot tabling, and then they have their monthly report and just reproducing that, but in a way that, you know, it doesn't have a month of data lag. You can have that fresh every hour, and it doesn't take, you know, 3 human days to do it. It takes maybe 1 human day to set it up once, and then it's just automated every every time. As far as the sort of stage of the modern data stack and its role in the modern data ecosystem,
[00:38:45] Unknown:
you know, it's definitely a very powerful set of technologies and capabilities, and there are still these edge cases that exist as far as being able to integrate it all together, which is, you know, definitely a solvable problem. But what are some of the underlying challenges or problems that you see in terms of how people are thinking about the modern data stack, whether it be consumers or the people who are building it? And what are some of the areas of opportunity or maybe some of the next stages of evolution that you are anticipating as you are getting more involved in this particular area?
[00:39:20] Unknown:
I would say it's never been stronger, but it's not fundamentally different than it's been for 10 years, 10, 15 years. I would say, you know, the data stack has always been about making what is currently possible but difficult a little bit easier, and then that evolution just happens constantly. I mean, that's just how software works, basically. The stuff that's getting the most action right now where there's a lot of, you know, start up companies and a lot of different ways to do things, that's maybe, you know, observability and testing.
BI will probably always be in that state. So some of that is a lot of different ways of doing things like metrics and semantic layer as well. There's a lot of experimentation going with different start ups and different people doing things in different ways, a lot of companies, you know, building their own systems. And I think over time that those different pieces will coalesce a little bit, and they'll become more standard ways to do this. And then the next thing that is, you know, currently possible but very hard will be on the forefront of what's about to be made easier. That might be, you know, maybe machine learning. I think in in 5 years will probably be noticeably easier to do maybe to the point where you don't actually have to be much of an engineer at all. To pick up on that
[00:40:38] Unknown:
a little bit, first of all, yes, the modern data stack has become a little bit of a cliche or, like a bingo word or, you know, you could just be walking around the streets of San Francisco and just over here, you know, 4 conversations in a row about, you know, the modern data stack. So I think, like, while the term may have jumped the shark, I think kind of its application and its uses are just kind of like Dan said, never been stronger. What I see the modern data stack as actually a lot of services companies that came out of larger companies. So if you think about a lot of these tools, they were built by expensive in house teams at, you know, companies that were doing cutting edge data work. And they said, you know, in order to do this cutting edge data work, this is how I want my data flow. And they would be investing in data scientists to be incredibly expensive and to make them just a little bit more efficient and to make their insights, which would be hugely valuable to these now giant tech companies, hugely successful tech companies.
These teams invested in incredibly useful but incredibly expensive data infrastructure. That data infrastructure now finds itself as many services in the modern data stack. The modern data stack is essentially totally ripped off of the modern data stack of 10 years ago, like Dan mentioned, plus all of the advances that were happening sort of within some of these great web 2.0 companies and essentially now are available not just to, you know, the latest stage or wealthiest or or public companies, but now actually available sort of more ubiquitous.
[00:42:21] Unknown:
As far as the work that you've been doing with Mozart and some of the ways that you're seeing your customers and early design partners working with it, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:42:33] Unknown:
1 great example, I think so we fairly recently added, what we thought of as kind of a niche feature where you can take any table, hook up a Google Sheet to it, and we'll sync this table. Whenever this table changes, we'll update it as a tab in a Google Sheet. We were thinking of this as sort of a way. 1 thing we haven't mentioned, we don't do BI. We generally say, you know, then you hook up your favorite BI tool. That might be changing, but this was sort of a way to bootstrap people who hadn't yet chosen a a BI tool so you could keep using Google Sheets as your BI tool. 1 of our customers is now basically using 1 massive Google Sheet as their customer support CRM. So they had, you know, over a 100 people on their customer support team were using a BI tool. They had, you know, a 100 plus seats in a BI tool. You know, basically, what they're doing, they're pulling data from their application databases and various vendors. They could see where they ship large things to their customers.
So they were pulling in data from a bunch of different of their tools, organizing it so that they could have a simple lookup tool. They would have everything about a given customer so their customer support team could look up things in shipment and everything they need to know about a customer
[00:43:44] Unknown:
in 1 tool. They were basically able to save a 100 plus seats from their BI tool by just using Google Sheets. I would go in a little bit of a different direction. Direction. 1 of the companies that I think is using, you know, our platform the best is Mozart data. So we consume our product internally quite a bit. We have a series that we call Mozarting Mozart. We try to be a very data driven, you know, b to b company ourselves. So there are real things that happen to the internal, you know, data analysts or business folks in our organization where, you know, basically, the tool Mozart is trying to solve a problem. Or when it doesn't, it ends up being Actae project that typically ends up winning sort of silly Act Day prize. So we have a series that is really just about real practitioner problems. Dan and I actually, you know, have, you know, in our Mozart account, we have, like, a date table. Right? So just something as simple as whenever you want to do a cohort analysis, you need basically a left table to join to that's just essentially a set of dates. Because it'll always be the case that there's, you know, some cohort with effectively, you know, missing in 1 of, essentially, the date periods that you need.
So this is something that, you know, comes about when you're actually doing that work. So 1 of our sort of core principles is obviously much like many many companies is dog pooting, but it it's to Mozart.
[00:45:12] Unknown:
As you have been using Mozart to run the Mozart business, what are some of the interesting aspects of the product that you have kind of identified through the work of actually being your own consumer and some of the maybe, empathy that you've been able to build up with your customers to be able to understand areas that are, you know, ripe for improvement or refactoring?
[00:45:37] Unknown:
Some some big ones are basically onboarding new employees. I think this is a common issue with our customers and ourselves. Somebody joins a company that already has a very mature data platform, and they come in and they see, you know, 200 transforms, a data warehouse with 5,000 tables, and most of the information of, you know, what is important, what is up to date, where can I find information about this or that, is generally held in people's heads, and it's hard to see from looking at the code base or looking at the database what is actually important?
So that's definitely something we've tried to make easier in our platform as we've experienced it as we've onboarded new employees to Mozart. I would also add kind of data validation as a core feature. This is something we recently launched, and as soon as we did and, like, when we were testing it on our own data, just the ability to say to, you know, put tests on different tables. Like, this should never have null. There should never be duplicates on this column and this table. And we kind of didn't realize some of our tables were in a bad state, and we didn't realize that until we actually started using our own testing framework. On a much much lighter note there There's some small uis that end up really
[00:46:52] Unknown:
being important to the usage even in fact 1 of our most early on requested features was we have a lot of, like, sort of cheesy puns at Mozart. I mean, the whole company name is based on the idea of, like, data orchestration and, you know, composing your tables. There are lots of puns, but when you do create a table, it ends up playing a little snippet of, like, And 1 of our, you know, early customer requests was to actually, like, shorten that snippet. Now we loved it. It always sort of led to a small moment of delight. But as internal customers, we started to really empathize with, our external customers that were starting to get annoyed by kind of the little sort of joke and pun that we had baked into the product. So there's, you know, some serious answers like Dan shared, but I would also say that that's the whole point of essentially using your product, which is understand its day in, day out usage.
[00:47:47] Unknown:
As you have been building out the product and working with your customers and trying to help improve the overall access of these data infrastructure improvements, what are some of the most interesting or unexpected or challenging lessons that you've learned, whether technical or business or product?
[00:48:04] Unknown:
So I think of some of the most interesting problems as very business related as well, sort of it sort of gets back to how do you get value out of a data team at a company. And it's a little bit different than, like, a sales team. So a sales team is easily measured or so typically a marketing team is measured by, you know, the cost per lead or the cost per win or whatever it is, and then, you know, you can compensate people accordingly. So to demonstrate value in the data space, it's a lot more ambiguous. It's kind of like you know it when you see it. But what we found is that, you know, companies sometimes come to the table saying, oh, well, we know that we wanna use our data. Right? Like, data driven is such a cliched term, and we know that a lot of folks, you know, get assigned by their boards of directors or or whomever.
Like, well, we really need to leverage our data and use our data. The original insight, which has continued to actually surprise me, is that there are this new set of data consumers that are incredibly technical and savvy in roles that don't have the title data. And 2, that, you know, for organizations to get a lot of the value out of data, they do have to be intentional about it. So kind of going about it as someone assigned me to think about this, that tends to maybe yield 1 or 2 wins in the short run. Another thing that I would say has been surprising is the way data consumption massively takes off. So, you know, it's funny. Like, humans are exceptionally bad at understanding our exponential growth. And, you know, data consumption often looks like that, which is you think you know, you start out by saying, okay. Like, maybe this 1 data source and producing, you know, a certain amount of data. And then, you know, as you start to essentially combine data sources or get hungry for more data sources as companies scale, data starts providing value, net volume scales, which is actually why you see a lot of usage based pricing in data tools broadly.
So I would say these are all sort of novel things that we found out about our customers and our customers' usage. But again, still the hardest thing is to put value on the data you consume. Like, how do you measure? Okay. Like, our ads are x percent more effective. Well, that actually, you know, does translate into a cost per acquisition that has some meaning. How can you measure while the product is x percent more effective at, you know, driving a customer to stay on the platform or to use the platform. It all becomes very hard to measure sort of the broader value of data, and that's, like, a challenge for not just our company, but for many companies in the data space, which is how do you translate the value that you're adding to an organization, and how does that surface and become apparent, you know, to that organization and often to executives in the organization.
[00:51:00] Unknown:
For people who are interested in being able to take advantage of the modern data stack and simplify the overall process of getting it set up and getting it integrated, what are the cases where Mozart is the wrong choice and they might be better suited doing it themselves or building out their own abstraction layer on top of it all? It sort of looks like a barbell,
[00:51:20] Unknown:
which is to say, if you have nobody in your organization that will be spending a lot of time cleaning data or is committed to the cleaning and consumption of data, Dan tends to put the bar at a SQL writer within your organization. Typically, those organizations or organizations without the plan of hiring somebody that could be a data analyst, Those organizations, I think, struggle to get a lot of value out of our tool, and they certainly struggle to get a lot of value out of the modern data stack. I think part of the modern data stack is the ability to join data across many tools. So, typically, it's about an intentional investment.
And on the flip, if you already have a lot of existing infrastructure and you've built out pipelines and made investments in your existing stack, 1 that's working for you. Again, sort of tearing it down to put in, you know, the modern data stack is often, you know, not necessarily the solution that has the greatest ROI in a short or medium run sense. So I would say it's kind of Marbell in that respect.
[00:52:31] Unknown:
As you continue to iterate on the Mozart platform and product and keep an eye on the ways that the modern data stack is evolving. What are some of the things that you have planned for the near to medium term or any projects that you're particularly excited to dig into? I'd say we're always building more connectors. So always, you know, more ways to get data in easily,
[00:52:52] Unknown:
more ways to manipulate data once it's in there. And then, like I said, we're always looking at the cutting edge of tools that people are building that you can hook up to your your data warehouse and get extra value. You know, we try to figure out what is the core value, what are the core feature sets that people are using these tools for, and can we and and should we bring some of that into Mozart? So currently, the way that Mozart is set up, we create a Snowflake account for you, and we manage that for you by q 1 of next year. So very imminent. If you already have a Snowflake account, you can just hook Mozart up on top of that rather than having to use our Snowflake management and use a separate database. So that'll solve some of the problem that Pete was talking about on just the right side of the barbell. If you already have, you know, a Snowflake warehouse and you've got a bunch of stuff built on it, we no longer would dare to ask you to throw that away and use us. Instead, you can use us in addition.
[00:53:49] Unknown:
Our strategy is to erode the parts of the world that answer is true for. So you mentioned, where is Mozart? Not a solution. We wanna be honest about where the product is today. And, you know, there's 2 ways in which we really focus on improving and expanding the product. 1 is to erode the people for which it is not an ideal solution, and the other is to make the people for whom it is an ideal solution make it an even more ideal solution. So Dan just touched on 1 of those 2 ends that we're touching on, but we we're also trying to make it easier for the lower tech user as well. So we wanna expand in both directions versus the product today.
[00:54:30] Unknown:
Are there any other aspects of the work that you're doing at Mozart or the overall modern data ecosystem that we didn't discuss yet that you'd like to cover before we close out the show?
[00:54:39] Unknown:
I think the only thing I'd add, and we have actually touched on it throughout the show, is that a lot of what's new or a lot of what's, like, modern is old. So it's that the problems that, you know, are trying to be tackled by the modern data stack and that I see great teams, both data teams and then tool building teams are are trying to tackle are kind of the problems of old. So, like, I think that people think of them as largely new and novel, but I still see the problems as, you know, getting started or using more data and that being the typical challenge as opposed to sort of
[00:55:20] Unknown:
somehow managing to find and extract even more value from sort of a dataset that you're already working with. Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:55:43] Unknown:
So I think the biggest gap in tools today are just actually just widening the tools that data collection is rather easy from. I think, you know, it's sort of the 2 ends of the data pipeline spectrum. 1 is, you know, having that ability to apply that data. So the reverse ETL tooling where we see so many new great companies, you know, sort of making real progress in that space. So finding, you know, ways to do things in the data warehouse and then take that output and then, essentially apply it or operationalize it. And then, again, on the other end of the spectrum, you know, a great tool like Fivetran has hundreds of connectors, but there are sort of what feels like almost infinity SaaS tools. In practice, it's, you know, it's probably millions.
So I think that there is an incredibly long tail of tooling that adds immense value across industries. And we've only scratched the surface what's automated and easy. And then on the contact side, I'm pete@mozartdata.com.
[00:56:55] Unknown:
And I'm dan@mozardata.com, or you can go to www.mozardata.com,
[00:57:00] Unknown:
book a demo, or a free trial. We'd be happy to talk about your data problems whether or not that involves Mozart data. Alright. Well, thank you both very much for taking the time today to join me and share the work that you're doing at Mozart and the ways that you're working to make it easier to actually take advantage of the technological capabilities that the modern data stack has brought about and, you know, smooth beyond ramp for people who don't necessarily want to deal with all of the integration challenges that come along with that. So appreciate all the time and energy that you're each putting into that and helping to make data more accessible to more people. So thank you again for taking the time today, and I hope you each enjoy the rest of your day. Thank you. Listening. Don't forget to check out our other show, podcast dot init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site of data engineering podcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Peter Fishman and Dan Silberman
Peter Fishman's Background in Data
Dan Silberman's Background in Data
Overview of Mozart Data
Building the All-in-One Data Platform
Challenges in the Modern Data Stack
Target Users and Industry Agnosticism
Experience and Background Knowledge Required
Engineering the Mozart Platform
Low Floor, High Ceiling Design Principle
Progressive Exposure and User Control
ETLT and Data Modeling
Typical Workflow with Mozart Data
Challenges and Opportunities in the Modern Data Stack
Interesting Use Cases and Applications
Using Mozart Data Internally
Lessons Learned
When Mozart Data is Not the Right Choice
Future Plans and Exciting Projects
Final Thoughts and Contact Information