Summary
Data engineering systems are complex and interconnected with myriad and often opaque chains of dependencies. As they scale, the problems of visibility and dependency management can increase at an exponential rate. In order to turn this into a tractable problem one approach is to define and enforce contracts between producers and consumers of data. Ananth Packildurai created Schemata as a way to make the creation of schema contracts a lightweight process, allowing the dependency chains to be constructed and evolved iteratively and integrating validation of changes into standard delivery systems. In this episode he shares the design of the project and how it fits into your development practices.
Announcements
-
Hello and welcome to the Data Engineering Podcast, the show about modern data management
-
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
-
Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos.
-
Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect.
-
Your host is Tobias Macey and today I’m interviewing Ananth Packkildurai about Schemata, a modelling framework for decentralised domain-driven ownership of data.
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Schemata is and the story behind it?
- How does the garbage in/garbage out problem manifest in data warehouse/data lake environments?
- What are the different places in a data system that schema definitions need to be established?
- What are the different ways that schema management gets complicated across those various points of interaction?
- Can you walk me through the end-to-end flow of how Schemata integrates with engineering practices across an organization’s data lifecycle?
- How does the use of Schemata help with capturing and propagating context that would otherwise be lost or siloed?
- How is the Schemata utility implemented?
- What are some of the design and scope questions that you had to work through while developing Schemata?
- What is the broad vision that you have for Schemata and its impact on data practices?
- How are you balancing the need for flexibility/adaptability with the desire for ease of adoption and quick wins?
- The core of the utility is the generation of structured messages How are those messages propagated, stored, and analyzed?
- What are the pieces of Schemata and its usage that are still undefined?
- What are the most interesting, innovative, or unexpected ways that you have seen Schemata used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Schemata?
- When is Schemata the wrong choice?
- What do you have planned for the future of Schemata?
Contact Info
- ananthdurai on GitHub
- @ananthdurai on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Schemata
- Data Engineering Weekly
- Zendesk
- Ralph Kimball
- Data Warehouse Toolkit
- Iteratively
- Protocol Buffers (protobuf)
- Application Tracing
- OpenTelemetry
- Django
- Spring Framework
- Dependency Injection
- JSON Schema
- dbt
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to data engineering podcast.com/atlan today, that's a t l a n, to learn more about how Atlas Active Metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.
When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/ lunode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy, and today I'm interviewing Ananth Pakalduaraj about schemata, a modeling framework for decentralized domain driven ownership of data. So, Ananth, can you start by introducing yourself?
[00:01:38] Unknown:
Hi, everyone. First of all, thank you so much for inviting me. This podcast is 1 of my favorite podcast. It's kind of my mate on this podcast, so I'm really glad to be here. I'm working in data engineering now for Zendesk, so my major focus is on customer facing analytics, you know, how do we be integrating various customer interaction point into the support ticket system, and give a coherent 360 degree view to our customers. So that's been a predominantly my focus on the last 2 years. Previously, I used to work for Slack, but only on data engineering side, had to build a messaging platform, orchestration engine, scaling code engine. Towards the end of my time in Slack, I was working in the monitoring of the space, scaling metric systems, and logging infrastructure, and stuff like that.
So broadly, in this data side for a long time, I'm 1 of the victim of jumping into this whole data world from back end engineering at the hype of Hadoop. But no regrets. It's so far going great.
[00:02:37] Unknown:
And do you remember how you first got started working in data management?
[00:02:40] Unknown:
I mean, as I mentioned, like, I kind of jumped into this wagon of Hadoop and, like, there's a lot of opportunity at the time on this. And then if you recall, Hadoop at the world is got into the philosophy of, like, you don't need to model anything. You just throw that everything into this chunk of storage, and then we can process because of processing, the map reduce, we have processing can massively paralyze your transformation there, so you don't need to do do any upfront modeling capabilities. So before Hadoop comes in, the shadow of the wall, like, even for changing 1 column in your data warehouse, it requires like a multiple bureaucratic things to kind of go and change. And the Hadoop is kind of moving away from that, and it's like, you know, you're free to do whatever you want, and we have a massively parallel application that you can do that on. Right? So it's it's kind of resonate with the back end engineering side of it when I started getting to this 1, but as I started to work on those things, I started to realize that that is not the true case, like, you still need to do the proper data management in order for to optimize your data transformation.
As I started working more and more the Hadoop map produced world, I started to realize that, you know, there are a lot of things that we should borrow from the data management side of it. So that's the first time. I actually took a course from Kimball for the data warehouse side of it. I think if I recall correctly, that's his lost class that he thought. So that kind of introduced me to the whole data module world,
[00:04:10] Unknown:
and I started to see the both worlds. So it's really amazing. So that's how I ended up in this whole data engineering, data management side of it. So that brings us now to the schemata project, which you recently released. And I'm wondering if you can describe a bit about what it is and some of the story behind how you came to the conclusion that it was
[00:04:31] Unknown:
a project that you wanted to invest in and build, and some of the specifics of the problem that you're aiming to solve with it. Skimit was an idea that I started with throughout my experience, like, how this whole data function work, and what is the dynamics of data engineering are. It's it's a multifold. It's a technology problem, and also it's kind of leads to an organization execution problem. Right? Like, this is my view of all the businesses running. You know, if you take any businesses, it's end up today is a business process application, right? If you take a Uber, there's a very simple business process application, someone request a ride, and and there's an acknowledgement happened, and you're having a ride share. So every application that we take, Zendesk into case like support ticket system, somebody raise a ticket and you address it, and then you close it. Every business follows some kind of a workflow. The workflow is actually developed and maintained by the application developers, the feature developer, for example. The feature developers have a very rich knowledge about the business, or how those applications are functioning.
Whereas, the data engineers and data analysts, on the other side, as a listener and observer, trying to observe how this business works, get more insight from the data, and then produce, you know, insights to the business dissolves, like, basically, business operation side of it, To say, like, how can we effectively run the business? So where is the lack? What is our final monthly recurring revenue? And also our business metrics. So there's a 2 group of people, 1 who has a motivation to quickly build analytics and solution to the business, and how it is opening. So the way the analysis and data scientists, or data engineers view the business is completely different, what the application developers are viewing the business. So there's a 2 different viewpoint, and 2 different competing priorities.
But in order for this to function to work together, you need to have a coherent even tracking system. And everyone is kind of working on to kind of how the users are interacting with my application, and I have a rich knowledge. Like, application developer have a rich knowledge about that. Without a systems like schemata, what is happening, right, we are just handing over, oh, these are the events that I wanted to track. And either the application developer blindly implement that, or application developers on their own implementing some kind of event tracking system.
And this is not matching the both areas. And then that leads to a confusion. Every company is trying to gather more and more information. It's all flowing to a snowflake or data lake and all the sales. At some point of a time, the context missing, people just leave or, you know, join the complete team dynamics are changing. It became like a complete garbage. Nobody knows what is inside your data warehouse. Nobody knows what is inside your data lake. Everything is flying under under the assumption. That reduce trust on data completely. So schemata is an idea that started seeing the pattern repeatedly.
There's a system that requires providing a feedback loop for the developers and the analysts coherently develop the data modeling. So that's a whole idea behind schemata as a decentralized data management platform.
[00:07:45] Unknown:
To your point of, you know, there being a lot of garbage that ends up in the system, what are some of the ways that that garbage in garbage out problem can manifest in data warehouse and data lake environments, particularly as they go from initial implementation to a core business critical system of record?
[00:08:03] Unknown:
That's a good question. So 1 scenario is, like, let's say, let's say, let's imagine a business, which building an active users. So active users means somebody taking 1 action, an imaginary communication platform or a chat system, like you know, the active users means not only the viewer, they take 1 action. It could be an action, could be adding a reaction, action could be doing a search, or action could be adding in new messages, for example, some kind of an action they're taking. So in the development space, like, the team might be organized such a way that the search might have handled by a separate team, whereas your messaging could have been handled by a separate team. So there is a separate domain of expertise within those business process, building those application coherently for the unified product experience. Right? In many cases like the search team, may not even talk to the messaging team, in most of the cases.
When you wanted to kind of measure what is my active user, you need to have a very strong standard between how the instrument is data. And how this search user is the same as a message user, and there has to be connectivity between the activities among different domains that you need to track. So in some cases, what happened, a product manager say that, oh, we wanted to instrument some kind of an activity in our search product, so we are going to implement the search product, and they're missing many information, incomplete information that what you wanted to have in order for the the active user account has to run, Whereas, message might have some other incomplete information there. Because these 3 teams, it requires many collaboration and communication to work with. And the search team have like a different realities, they wanted to improve the performance or user experience perspective.
This always going backtrack, so they try to implement as quickly as possible, the event tracking, there is a mismatch is happening there, and data comes in. In many cases, it's may not be usable. The second thing is, like, what is it this field means? The developers just add, for example, string email. So what is string? What is email? Searching might say that email, so message stream might say it's like, you know, messenger email, or messenger ID, or something like that. So inconsistent naming, the completeness of the schema is inconsistent between different thing. And that might not really push the complete picture back to the back to the analysis.
And this back and forth happens, and those domain knowledge always belongs to this team. It takes a data engineer team, like a lot of back and forth conversation, you want to understand what is this domain means, what is this search functionality mean, how the people are like actually navigating to your system. And the people changes, but as long as the same people working for a long time, the same company, they hold this context in their head. The moment the team is kind of switching, flipping, and then whatever the event that is generated, the history kind of lost.
And we don't know what is the email, search email means, and the message email means, and how this pipeline is kind of working. So a lot of hidden complex, hidden logic is kind of embedded in the data pipeline that became over the period of a time very complicated and leads to the whole system like a garbage. Right? So
[00:11:19] Unknown:
In terms of the problem of kind of generating the garbage, obviously, the first issue is that it happens in the first place. And I know that that's a big part of where you're focusing with schemata is trying to prevent that garbage from being generated. As far as the kind of concept of garbage that is occasionally subjective where, you know, to some degree, you say, oh, I've got an event that has value because I'm sending information. But if you can't do anything with it, then it's effectively useless. And so that's where the question of schema comes in. A lot of times, schema is overloaded with the idea of the database schema, and so people don't necessarily think about the fact that I need to apply schema and constraints to other structured pieces of information or even semi structured or unstructured.
So I'm wondering if you can talk to some of the different locations in a data system, even pushing into what some people may not consider to be part of the data system, like the source applications? What are some of the different places from, you know, the manifestation of a piece of information through to analyzing it or embedding it into a machine learning model, etcetera, that that schema information needs to be established and the types of context and information that need to be maintained with it. I would imagine
[00:12:36] Unknown:
is like, that data kind of originated in multiple places, right? The modern ecosystems right now, I broadly classify it as, you know, control environment versus uncontrolled environment. So control environment is like, within your, let's say, you are running an app, or you are having your own website, or your B2B company, hold all the business transaction within your control, so you have a luxury to go and instrument certain things, or not instrument certain things. Zendesk can go and instrument their support ticket system, how the users are behaving. This is a a non control system, essentially, like, systems like Salesforce, or systems like, I know, any third party tools that data comes in. Right?
Those 2 different systems have different nature, and different way of bringing to Stata. The schema understanding started from the originating of the system, and the data goes through multiple hop, and that's the complication of data pipeline. Right? You're typically what happened. Right? You're originating the data and then put it into some kind of a data lake or maybe a data warehouse. And the data coming from different stream as a raw data comes in, and you're applying some kind of a data transformation to normalize your data structure, little more usable to query able to your analysis users or data scientists to easy consumable way.
And there's another hop happening. Once you get analysts or data scientists using that, this data science team might start to build your feature engineering, you know, trying to build the market based on what all the data that they're getting in, whereas analysis might start to build their own data mart, right? They might build some kind of a data mart for finance, some kind of a data mart for marketing focus, and stuff like that. So data going through each and every hop all the time, and every time there is a hop happening, there is a producer consumer relationship getting along.
And the producer sometime have a different understanding about what the asset they are producing. Whereas consumer, when they wanted to see, they started to see from their business require, business requirement perspective. So there is often, there is a mismatch in that case. So the schema modeling is kind of applied or very critical in all the hope that is happening across those things. Right? So many companies approaching in many different way, like, aware, knowingly or unknowingly. There's largely 2 pattern I've seen that companies are approaching it. When it's like, they kind of having those, you know, red, blue, green model.
Whereas red model is like largely for an experimental data pipeline. There is no data modeling required. You just wanted to quickly hypothesis something and trying to wanted to test it out. Whereas you go to this, you know, blue model where you are actually started to shifting the data, but you are still verifying, you are not really saying that this is a finite contract that I'm kind of exposing to other customers. Right? And the green area where any severity 1 pipeline, they have a very strict guidelines, how the whole pipeline needs to working in this case. So to answer your questions, like, every hop that makes that makes the schema modeling much more critical because of the producer consumer problem. And if you're not doing it that, you will again end up into creating coverage in, coverage out because the context always changes.
[00:16:08] Unknown:
In terms of the schemata project, what are the ways that it is designed to be able to minimize some of those lost pieces of information where as you go from, I've created this record. I have all of the code and all the context related to who created it, why it was created, what the actual purpose of it was to it's in a data warehouse, and I can see that there's a user ID that that is apparently a UUID. There was a created at that has some unknown time zone associated with it. There is a, you know, event that is just a string that I have no idea what the possible values are. Like, what are some of the ways that Scepada is designed to help, reduce some of the pain that you experience when you're in that situation of trying to rebuild that context where you have to trace things back manually and dig through the code and interview people. And
[00:17:02] Unknown:
Yeah. Totally. I think schema are heavily influenced by the DevOps principles. Right? When when I when I worked on monitoring observability area in Slack, 1 of the system so I'm bringing those data engineering expertise to this monitoring domain. So 1 of the system that we build that is like building this aptX score or, you know, reliability index score. Essentially, it it listen all the service lock, whatever the engine is locked that is all generating, and simply computing a ABDEX code. Like, ABDEX code is nothing but number of 200 HTTP request versus number of 500 HTTP request for each and every service, and you assign a score that says, like, this is your this many number of happy customers or happy request that you have, and you have some threshold assigned to that so that it will give us nice feedback loop to the application developer back to improve the reliability of the system.
And then I started to think about, how do I solve this problem of this whole data modeling issues. Right? When you are opting certain data modeling, a set of set of data modeling principles to be adopted, or some kind of a process to be adopted, The developers and the tool next to our data, you know, product managers and data analysis, they may not aware the whole nitty gritty of the how the data model works. Right? It takes a lot of time to read, to treat chunk of, you know, data warehouse toolkit to even to understand, you know, all those things. The literature literature is pretty high to kind of the barrier to entry to apply any data modeling concept is pretty high. So that's why schemata is kind of improved influenced by that simple abstract score calculation, you know, generating a feedback loop back to the developers.
So what schema does that, it enable to apply rules by product managers or data analysis and say, like, you know, hey. You are going to embed this schema, your schema has to be well connected with all the other domains. And the schematas core will system schematas essentially systematically compute, create a connected graph, and it does a graph computation like a, it's a graph algorithm machine learning the same. Compute the score, how well this vertex connected, you know, like, how strong the edge connection among the other entities.
Then it provides some open unit review, if you are emitting certain events, it has to be an entity or an event. So entity is immutable in nature, events are sorry. Entities are mutable in nature, events are mutable in sorry. Immutable in nature. I got confused on that. So the goal is the entities are representing a business object. Right? You've taken user, user is an entity, user has their own life cycle, whereas event for example, represent a business transaction. So this business transaction has to have as much as connectivity to the entity, which is the business object. So the more and more connection that you have, your stronger your data, you know, the Euend ecosystem going to be look like. So you can you can navigate to the graph, and you can even though the data is kind of incomplete, you can still enrich the data downstream pipeline to do that 1.
And on top of it, you can add all those completeness of the data standard that you are mentioning about. So schema to essentially build this schema to score, similar to app dev score, and as embedded as part of the CICD pipeline. So when the developer commit the core, commit the schema changes, it immediately compute the score and give the feedback to the developers and say, like, you know, hey, this much completeness that you have, introducing some kind of a gamification to, you know, even standardize or increase the completeness of the schema.
So database schema has kind of creating that well connected data ecosystem into the data lake or any of the systems here.
[00:20:58] Unknown:
In terms of the developers being able to hook into schemata and be able to get that feedback loop of, okay. I've made this change to this structure. Now I want to understand what impact this will have on the downstream systems. I'm wondering how that gets factored into their code, their developer environment. And this also is sounding a little bit similar to what the, shoot and blanking on the name of the company. There's a company that was doing something similar recently that got acquired iteratively. And I'm also interested in understanding a bit about how this compares to what the folks at iteratively were doing with being able to hook into that event schema and then be able to, you know, add that as sort of a linting rule in the developer experience.
[00:21:45] Unknown:
Yeah. So I'm not sure about how it will do it fully work. I care about it, but I have not really looked into this much. So how this works essentially, the largely the developers are using either a product of JSON, Thrift, Avo. There's a 4, you know, very popular data exchange formats available. And schema essentially not inventing any new DSL or anything to in order for us to provide that. So for example, PortaPuff, you can add some annotation on top of each new schema to enrich your information here. So the current open source version of schemata, that how does it work is like, if a developer going to change certain product of schema changes to exchange information to the data analysis, or even interchange information to another application to work along, they can add an annotation on top of it. And schema to essentially pass this annotation, pass the pull up of schema, pass the JSON, aero schema out of it, and get those annotations out of it, and build the score, and give the feedback.
Schemata, we constantly, you know, consistently corruptive creative experience not to build any other tool that their developers don't aware of it or don't don't want to work with that, right? So any new DSL that we're introducing, it's an option pane. And I, personally, as a developer, I won't adopt it. So so it is not giving you any new way of DSL of working, it's just like audio familiar tool, like, put above arrow and stuff like that, and you can add little annotation on top of it, and it kind of getting you this whole benefit out of.
[00:23:26] Unknown:
Data engineers don't enjoy writing, maintaining and modifying e t l pipelines all day every day, especially once they realize that 90% of all major data sources like Google Analytics, Salesforce, AdWords, Facebook, and spreadsheets are already available as plug and play connectors with reliable intuitive SaaS solutions. Hivo Data is a highly reliable and intuitive data pipeline platform used by data engineers from over 40 countries to set up and run low latency ELT pipelines with 0 maintenance. Boasting more than a 150 out of the box connectors that can be set up in minutes, Hivo also allows you to monitor and control your pipelines. You get real time data flow visibility with fail safe mechanisms and alerts if anything breaks, preload transformations and auto schema mapping precisely control how data lands in your destination, models into workflows to transform data for analytics, and reverse ETL capability to move the transformed data back to your business software to inspire timely action.
All of this plus its transparent pricing and 247 live support makes it consistently voted by users as the leader in the data pipeline category on review platforms like g 2. Go to data engineering podcast dot com/hevodata today and sign up for a free 14 day trial that also comes with 247 support. 1 of the themes that's been coming up a lot in recent episodes and conversations that I've been having on this show is the question of, how do we get the application engineers more involved and engaged with the overall life cycle of data, and this is definitely a step in that direction. And 1 of the things that I've been wishing for is that things like the ORM framework for different applications, such as Django, Spring, things like that, will automatically have some facility built into them for doing things like managing event schemas and being able to generate, you know, analytic APIs for being able to do data extraction from the system without having to dig into the guts of the database and understand how that schema is mutating.
And I'm wondering what you see as some of the potential for something like schemata being embedded into some of these application frameworks so that the schema modeling is just a native piece of working in the code and doesn't have to be an extra task layered on top of something that's already being done.
[00:25:40] Unknown:
That's a very interesting point. I had to think on that. Like, so where I've seen this successfully getting adopted is the classic tracing world. You are having a zipline tracing and the other things, like, essentially, you are adding a new span, and the whole framework is kind of embedded into whatever the open tracing format. So it's much easier to kind of trace back to the system towards that 1. Kinda even sourcing from the, like, a data engineering perspective, can it be adopted? I think this is the conversation we have. We also had in Slack, like, back and forth along with, like, is there a tracing model? Is it an appropriate model to kind of, you know, capture the information for the analytic side of it? I think the challenge will be that this framework is an abstraction of a a request response model. Right? It can well capture how does the request comes in, why is this propagating.
It can capture those metadata automatically from a network layer point of view, not on the application layer point of view. The application was still developer has to build, and I think that is where the complication comes in. Can we automatically do that? A certain context propagation, definitely, we can do that, but I think it's still largely depends upon the application logic that changes what you wanted to gather, what you don't want to gather. 1 possible way kind of to explore is via dependency injection, for example. So if you are doing some transaction, are we different? You are injecting certain entities, so you could potentially trying to automatically embed that object events that is already been defined somewhere, and then propagate that from there as part of those things. So there is a potential possibility to add in as part of the dependency injection. But, yeah, it's a it's a great thing to check flow and see how it
[00:27:24] Unknown:
goes. In terms of the way that schemata gets introduced into an application and a data system environment, who do you see as being the people who are likely to be the initial adopters and the people who are kind of promoting it and working with the rest of the teams to make sure that everybody is using schemata at their different layers so that they can have that full connected graph of schema information to be able to know whether a change in 1 of the stages is going to have a either a downstream or an upstream impact.
[00:27:59] Unknown:
The way I see schema, is a collaboration platform. And schema modeling is 1 part of it. It's a collaboration platform because the product managers, let's say, wanted to do some event tracking, the product managers have some understanding, they talk to the analysis, and then say like, you know, hey, it's a collaborative effort from the analysis and the product managers. If you are launching a new feature, what I what is the information that I want to do? Became very, very complicated as certain event tracking requires more than 1 team to implement a tracking system. Right? So schemata, I see is platform, our communication, let's say collaboration layer in the middle, whereas the product managers and the analysis initiate some kind of, you know, a need for change management, stuff like that.
And this schema has the whole context of what is really happening in in the underlying schema, how things are connected, and what is the historical versioning behind that, how things are changing, and who won't say it, and who are the people to talk to. And once the developers implement that particular tracking system, it gives them the immediate feedback, essentially, like, okay, if you add this field, you are going to do a you are doing a change such a way it's not backward compatible, so this is going to break the downstream system here, or you wanted to do the change, or like, you know, safeguarding your system, and also giving you the the score that says, like, how will you have your model the system here. So this is sitting on a collaborative environment, all those data practices.
Even for data scientist, often you see, you wanted to build some data model, and you then only you'll realize, oh my god, I'm missing certainly, I want this particular field is missing from this particular event. I want this even back. And so schema are providing this whole context behind that, and then they can raise a recourse and say, like, you know, can you add this field here? So it is a facilitator of a conversation among all the data practitioners and sessions, and also maintain the integrity of the system.
[00:30:00] Unknown:
Digging into the schemata project itself, can you talk through some of the implementation details of how you've built it and some of the engineering issues that you ran into and some of the questions that you had to address around the scope of the project and the design of how to make it usable, be able to provide the necessary structure, but be able to make it flexible enough to fit into a variety of different developer environments and workflows.
[00:30:27] Unknown:
The very first thing on the schema, essentially, I wrote my own DSL, essentially. So it's like a custom I'm a big fan of type safe systems, like, I believe there's something like I always like a fear about JSON. It's like there is no type safety in your schema. Even though JSON schema specification has certain certain elements, but it is not really representing the real world data types, essentially. I started to have my own little compiler language that you can do some sort of kind of a thing to define your schema and stuff like that. When I approach the developers, like and some people to kind of get the feedback, or, like, I don't understand what does it mean, like, I can't use it. Is it another tool that I actually wanted to learn? And that's when I realized that, then I just throw it away, that again, you have to meet the developers' priorities, where are they, and you have to meet the analysts where they are. You know, let's not introduce any new tools or any any systems that add more and more pain that already have this collaboration problems happening. Right? So that's when we started to build on top of Port above Avro.
The way the system structure such a way, there is a set of protocol that we support. It could be protobuf, it could be arrow, it could be JSON, or it could be even dbt, for example. We can potentially get all the DBT model back and then see how well connected the DBT schema model. And you can correlate it with the event schema that is coming from port above, and all the way down to the snowflakes and stuff like that. So every this system is giving us a as a protocol provider, and we do have an adapter to support different protocols here. Internally, the way schema developed is you take this protocol, and then there is a setup schema definition that we have. Schema that has their own internal schema representation.
Like, if you are schema, you should have description, you should have a owner, you should have a domain, and and certain rules, how a schema supposed to be structured in an opinionated way. And if it is an event, it should have some kind of an entity object reference, so on so, like some kind of opinionated view of how the schema are going to look like. So how this code works is like, you bring your own protocol, it pass it, you know, and then convert that into the internal schema representation that's there. And from there, you can extend and you can run your own schematas code. So schemata has a concept of application, so apps. So you can build multiple apps on top of it. So once schemata build this whole connected graph, it's a connected weighted graph inside.
You can build your own app on top of it. Schemata score is 1 of the app that, you know, compute the score and then give it to you. So there is a validator app, you can build whatever the validation that you are expecting from, you know, the schema definitions that you can add it. And you can also add the data validator score once the data has come, get into Snowflake or anything, you wanted to run certain data validations here. So it's a very connected ecosystem. The flow goes like a protocol provider pass the protocol and convert that into a internal schema representation, and then the schema representation essentially pass on to multiple app, kind of a typical chain pattern, and then you can build your own app or, like, we can have your own customized app here.
[00:33:46] Unknown:
In terms of the broader impact that you're hoping schemata to have, you know, obviously, the immediate case is being able to have this validation of ensuring that this event record is going to be compatible with this deep downstream DBT model so that if I add a new field to the record, it's not going to break systems that are already running. But in terms of the behavior of the engineers and the teams and the stakeholders who are interacting with these systems, what do you see as the broader impact in terms of the ways that they approach the problem, the ways that they think about building these systems and maintaining these systems and the actual effort that's involved in the care and feeding? You know, what what are some of the kind of more grandiose visions that you have for SKIMATA's impact on those people?
[00:34:37] Unknown:
There are multiple aspect towards it. Like, I think 1 of the major thing I would call highlight is now data is a very strategic asset for any company. So, you know, the more and after business functioning, as the new customers comes in, as a committed growth, you will see more and more data generated that essentially encapsulating your business knowledge. I know how your behavior about the user or behavior about your company business performance. This is how the knowledge kind of grow. What is happening right now in the whole ecosystem is that, there are very few companies in the industry able to successfully harness their data power, you know, to empower the business and run their business much more efficiently.
And often, how well you understood your data is essentially a differentiator between a successful business and a non successful business. To an extent, there's always like a human gut feeling, and then some kind of visionary founder can lead a business in a different direction. That's cases. Like most of the cases, like, the lack of understanding about their own data is a huge problem. And the current way of resolving is like kind of a data catalog approach. You know, you already generated a garbage systems into the data lake. And you just now trying to index that garbage into a data catalog on things, and then like, that literally nobody use it, you know. We use we use certain data catalogs in the past, you know, hardly 5, 6 people using every every week, and basically, like, it's there for the sake.
Whereas, the partly because there is no incentive in the data catalog system as of now, for those data practitioners to come together, and then systematically share and grow the schema, and the grow the knowledge about the company. So the grand scheme of thing, while I'm hoping this schema will be getting a company towards an adoption of semantic data warehouse, where you have this data warehouse, which has like a full context behind what is this field mean, you know, what is this data model represent, how we are coming from this where, what is the business context behind that, as the data practitioners, you know, collaboratively working on this, on top of schemata, it produce lot of lot of benefit, clear understanding that can accelerate the innovation a lot.
[00:36:59] Unknown:
As to the adoption path of schemata, you mentioned that your first attempt was, I'm gonna build the DSL. It's gonna be great. It's gonna do everything I wanted to do. And then everybody said, what is this thing? What are you trying to make me do? And so given that experience in your initial attempt at this kind of overall goal, what are some of the design aspects and some of the positioning aspects that you considered to be able to balance the need for flexibility and adaptability of the framework, with the ease of adoption and being able to demonstrate quick wins for people who just wanna
[00:37:39] Unknown:
pick it up and play with it and understand what is this thing supposed to do for me. I think this is still an ongoing work on the schema assigned. Right? Like, the first lesson that we learned, separate DSL not gonna work out. The second thing we learned is, like, there is some appeal to the developers that if you are working on top protocol for JSON on top of the GitHub workflow, which is very familiar to them, And then I can instrument that quickly, and then I can I can do that? So because that is very closely resemble to how you instrument trace, or you instrument metric, or you instrument the logs, and you instrument the data, as you instrument the event on a very familiar territory within your GitHub workflow here. I think what I'm working on right now is to further simplify the adoption, Because schemata, as of now, is assumed certain things. What it assumed that this is a certain way of a schema should be modeled.
And this is a very open ended stand that takes. So if you have to adopt schema dot right now, honestly, you need to order to let the opinionated view that you had to buy in the opinionated view, and you had to incorporate that view to get the full benefit out of it. Not many companies will be on the same, you know, same vantage point where they supply. Companies like Slack or Zendesk had that kind of a growth and had that kind of a a complexity in nature. So what I'm really working on is to kind of further simplify that particular model. Right? You know, if somebody have don't know how any of those open unit review, how can I introduce schema data and then, you know, kind of an onboarding process, how can we guide them towards the end, the schema monitoring and score scoring strategy to give the feedback loop? Right? So this is still an ongoing work. I've been talking to tons of developers last 2, 3 weeks after after writing it about a lot of interesting feedback and very exciting.
[00:39:39] Unknown:
Prefect is the data flow automation platform for the modern data stack, empowering data practitioners to build, run, and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn't get in your way, Prefect is the only tool of its kind to offer the flexibility to write workflows as code. Prefect specializes in gluing together the disparate pieces of a pipeline and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20, 000 community members, Prefect powers over 100, 000, 000 business critical tasks a month.
For more information on Prefect, go to dataengineeringpodcast.com/prefect today. That's prefect. As you are working with schemata, both in your own work and now that you've announced it and you're working to try and popularize it and get other people on board, what are some of the pieces of the project and its usage and its broader impact that are still undefined and that you're still trying to think through and understand and get feedback on?
[00:40:48] Unknown:
There are many aspects that I'm still, you know, talking through and get the feedback of it. Essentially, like, 1 thing is like, many companies are moving towards a SaaS product. Like, if I let's say, if I'm starting a business in a retail shop, you know, I go on shop if I create my own retail shop, for example. Like, I got my digital presence. Probably my I have an offline business, I'll probably buy some POS. I got all my accountings and everything in Zoho or any other systems, And or Stripe, for example, or I wanted to have a support ticket system, I buy Zendesk. So if you wanted to start a business right now, you you don't need to build any applications like it. It's already available as SaaS products for you. But if you want to connect and build a data warehouse, then again, I need to kinda hire data analysis, I need to have the data engineers, and all sort of a thing. And I have to go onboard data warehouse.
But this business is running in a very standard operating model, like retail. How are you going to measure is going to be 90% is going to be standard. That's my hypothesis that I'm trying to validate it. Can this schema, so the common motion motion for those businesses across the industry, so that can go 0 to 1. For example, if I'm starting a new retail shop, I'm using schemata, and I know as a retail shop, that's what schema.org is very very popular on this case. It's like, you know, for this domain, hey, this is a standard kind of schema that we see from across other people. Here is a schema definition for you, and this is what your initial data model going to look like in your data warehouse or in your data lake. It's a kind of 0 to 1.
So people don't have the knowledge, like, people don't have that luxury to start with data modeling or hiring a top notch data engineers who understand all those things. This gives you this the Ruby on Rails kind of a moment, you know, you just do start, and you get your data model ready to go connected to this app, and give you immediate insight for you. Right? So that's sort of grand scale of thing that what I'm hoping that Skimota will let you at some point of time.
[00:42:53] Unknown:
Yeah. And to your point of having these kind of off the shelf models available for people to use brings up the other aspect of the conversation that we haven't touched on yet where, largely, we've been discussing the use of schemata within the bounds of a single organization or even within an enterprise where everybody is collaborating in the bounds of an employment relationship, where there's definitely a big potential for this to be used across organizational boundaries as well, where, for instance, I am a, you know, I'm I'm a SaaS provider. I have an API. I will publish the schemata model for the output of the API that you'll retrieve when you're getting a JSON object, for instance, or, you know, maybe you can hook schemata into the Swagger API and the the docs there. And so now as a consumer of that, maybe it's going through something like an Airbyte or Fivetran or a Meltano.
I can understand, okay. This model is now being connected into this raw data system, and then it's getting fed into this DBT model to turn it into staging, etcetera, and being able to actually have that context propagated without having to do all that reconstruction and everybody having to, you know, rebuild the wheel for the umpteenth time because they're all communicating with the same API, but the API is just, you know, here's some documentation. Here's a bucket of JSON. Good luck. And so I'm wondering what your thoughts are about the potential for schemata to be able to provide some of those stronger contracts at those kind of organizational boundaries as well.
[00:44:21] Unknown:
Organizational boundaries like, for example, like, if you are adopting something kind of data sharing. So I think many cloud providers now trying to do kind of a data sharing that you can share across different models. Right? So, yeah, definitely, that is 1 of the theme of, like, schema in this case. Like, let's say, I'm importing data from Snowflake, and I have this, you know, I'm belongs to the detail. Hey. This is a common schema annotations here for you that you can create it to prove. In a way that I think fine tune and DBT has trying to do a similar model that, you know, you publish DBT model kind of public for Snowflake or public for any other systems that you can adopt into that 1. So I think this is happening in different way, in different ecosystems.
There is no platform agnostic way like schemata that gives you that abstraction. So from the vendor free, that makes oh, and this is a model that I'm going to implement. I want a Snowflake connector. I don't want to buy the whole, you know, large integration provider, but I just want 1 connector, you can buy that. So that's what schemata can enable in a very platform agnostic industry specific models, for example.
[00:45:28] Unknown:
Recognizing that schemata is still very early days, I think you've just announced it, what, maybe a couple of weeks ago at this point, the time of recording. So in that time and then your own experience of using it, I'm wondering what are some of the most interesting or innovative or unexpected ways you've seen it
[00:45:43] Unknown:
applied? So I'm hearing, like, a lot of interesting feedback, essentially, like how people are trying to adopt it. And 1 of the thing, So the whole schema, like, a schema suggestion model is came from 1 of the company trying to adopt schema. And then say, like, okay, hey, we know this schema, can you adopt some kind of a macro on top of it? Essentially, like, you know, if you are going to add a new entity, and you wanted to have certain fields to be mandatory, add a macro functionality. So moment somebody propose, can you add those additional field automatically into the schema definition?
It's like a typical auditing fields like created or updated or deleted or in some sector or the owner name, who are adding, can you add that automatically, stuff like that. So I think that macro adding a macro functionality kind of thing or it's it's a very interesting thing that I've seen first time somebody trying to do that. So it'll be try it'll be exciting. Interesting to see how can we add that in the laptop.
[00:46:43] Unknown:
In your work of building this project and releasing it to the world, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:46:52] Unknown:
I think the first challenge that you say mentioned about that DSL is kind of very interesting, scenario. As a developer, you always, as a developer, you always go and write your own abstraction, and then it's like, oh, no, you should not write your own abstraction here, something that had to meet what you wanted to do. That's that's a very interesting thing. I think the interesting lesson that I learned is once we introduce that kind of a notation, people able to resonate with that. People able to resonate if it is embedded within their workflow. The developers able to resonate with, oh, I need to add this 1, and this is much easiest to do for me because this is my familiar territory here.
Versus, you go to a UI and add something like that or, you know, so developers usually don't go there and see what is happening there. That was my hypothesis, and I'm glad that kind of validated in many cases where DevRelax still much more comfortable to add annotations in the product buffer, and then go to a UI or some other workflow that is they're not aware of it. Yes.
[00:47:51] Unknown:
For people who are interested in being able to have this view across the different stages of their data life cycle of the schema and the modeling and the way that it is transformed and the compatibility issues? What are the cases where schemata is the wrong choice and there might be a better utility for them? Schemata will be a wrong choice
[00:48:11] Unknown:
if you are the only person who building your company. So everything can be nicely fit in your system. You don't need to much worry about it. I feel like the moment that you started to have some kind of a 2 persons communication happen, like, before in order to finalize what are you going to track, you have to write it somewhere. Maybe schema data will be little heavyweight for that, but I would highly recommend to kind of have some kind of either via Kida Pipo or even from a simple Excel sheet. Having that contract right is is very, very important to have it. But as an organization started with you, and more collaborator comes in, and that is why schema are coming to the picture. So if you're starting new, you're a 1 person team or, like, even a 3 person organization, and you don't really have a separated engineering team, everything happening in the same team, I don't think schema is, like, kind of a heavy lift for that. It's still potentially useful, but it's still a little more work and additional component to run for your system, versus you are starting to hire your data engineering team now, and you started to hire your analysis team now, and there is a separate product development team is there. The moment you started to touch point that stage, I think that's the right moment for you to start looking at schemata, like, system to kind of because the more early that you start schemata, the more and more knowledge that you are starting to gather out of it, and you're tracking your organization growth at some point of a time. And implementing systems like schemata is not a very schemata is a very lightweight system. Right? It's not a very heavy light systems, like a data catalog or any other system that you need to have, like, whole bunch of procurement requirements, stuff like that. It's like a running into a CICD pipeline, listening to your data report, and just observe what is happening there and then making the sanity for you. If you instrument that kind of collaborative nature very early on with Schemata, a, it started to enforce a data driven culture in your organization, because the both the developers and analysts know what is important to maintain the certain contract, and certain compatibility in your data warehouse.
So if you miss that thing, you will quickly start losing your faith in the data. Because, oh, this data system is always breaking, or this team is not doing a good job. All because there is a lack of tooling to make these 2 people collaborate with each others. That's 1 of the principle that why I developed schema to have the feedback loop is that everyone comes to work with the best intention. Nobody comes here and say, like, I'm going to break my data data pipeline so that none of them can access the system, I'm going to bring down the trust in my data side of it. There is no strong feedback loop to unify the data analysis team or data engineering team to the feature team here. So because of there is a lack of tooling available, often that mistrust coming in, And that is further amplifying your loss in faith in data in your organization, that eventually, you started to do on a total decisions in most cases, sometimes it works, sometimes it don't work.
So instrumenting systems like schema to has a feedback loop, very initially, will help you to build a strong data culture across your organization. So that's the right place for you to start at schemata.
[00:51:27] Unknown:
1 of the things that you just mentioned in there too, I think is probably worth a bit more exploration of the question of using a data catalog as a way to be able to build this visibility and context and understand the lineage of different attributes within a record. And I see that as being a reactive approach to this modeling and schema management piece where schemata is the proactive way. And I'm wondering what you see as the balance of the relative utility of those 2 approaches and some of the ways that Schemaata coexists with data catalogs and metadata platforms?
[00:52:06] Unknown:
It depends upon the size of an organization, what they wanted to do that. Generally, I'm not a big fan of passive systems, right, because it is like a full base thing that you just search something and then try to figure it out. Unless and until you don't have a strong need, we'll not go and access the system, versus a active system, so the strong feedback loop is a much better way to maintain the health of the system, in these cases. How schema can be coexist with the data data catalog? So there's a lot of potential possibility where the output of a data schema can goes to the data catalog also.
And it can be like a 1 place for you to go and search and trying to understand what you wanted to do in in some point of a time. It could be a knowledge repository. Schema does like an active data management system for you to kind of manage the collaboration environment for that. Yeah. There is a potential possibility of coexisting both cases.
[00:52:59] Unknown:
As you continue to iterate on the SchemaTa project, what are some of the things you have planned for the near to medium term, either in terms of technical implementation or social engagement or help that you're looking for, etcetera?
[00:53:13] Unknown:
Yeah. Totally. So as I mentioned that we wanted to further simplify this whole model, building a UI platform on top of schemata so that it'll help people who are not familiar with the GitHub workflow. Like, you know, some data analysts or product managers, they don't wanted to create a PR on their own, can potentially create a pull request from from the UI itself, and yet integrate it with the GitHub seamlessly, and then bridge that gap between the developers and data analysis. So this is essentially like getting to the closing stay, you know, close the loop, and meet each and every data practitioner, what area, like, what tool they will be comfortable with that, integrate with them into this workflow so that they can they can use without, you know, without need to learn anything or any other aspect of it. So, ideally, what we are going at it is further simplifying in ease of adoption.
Like, an organization should not spend more than once for them to integrate schemata into the workflow, and then they should feel started to feel proactive immediately, whether they add to the schemata open a new or not, they should be able to go to, you know, they should implement and then integrate to the workflow within a week.
[00:54:26] Unknown:
So that is the next phase that we're looking at. Are there any other aspects of the schemata project and the
[00:54:34] Unknown:
problem space that it's addressing that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. Definitely. I think schema right now is kind of largely focusing on the 1 specific type of contract, which is the schema standards, and data modeling, and kind of a data contracts, and to an extent to a data validation, like a data testing of your data. I think there's a large ecosystem of data observability, you know, how I can kind of monitor the whole data across different things, and how that world can be integrated into your developer workflow. Right? 1 of the things that I'm still thinking, but I have not found any answers to that is the whole area of data observability. Right? The moment we find, like, the data observability tool kind of essentially listening to all the data set is produced, and they're trying to produce any anomalies and stuff like that.
And the moment you started to detect an anomaly, you know, maybe it takes, like, 30 minutes find out your machine learning model to work and try to find out it is an anomaly that you have it. The data pipeline would have already run like a 3, 4 half at the time. Going back and fixing that sometimes is very costly for some data quality issues. And, oh, this is already 4 pipelines then. I need I need to apply some kind of a cyclic barrier. My pipeline, I have to broadcast all my consumer and all this problem. I had to retract my 3, 4, 5, 10 again, rerun it again, and fix it. Like, what are the best today for me to kind of have this schemata, enable them to quickly fix it?
That workflow, how can it be seamlessly integrated into the system? That I have not figured it out yet, but this is something that I feel, I mean, as a data industry, it needs to be talking actively on that. You know, the cost of refiling something is also pretty cost, sometimes not worth enough to do that.
[00:56:27] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And along with that, I don't believe we mentioned it at the beginning, but for anybody who isn't aware, you're also the author and maintainer of the Data Engineering Weekly newsletter, which I appreciate all the effort you put in there because I use that as a source of inspiration for these interviews. So thank you for your work there. And as the final question, I'd like get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:56:58] Unknown:
I think the tooling is, like something like schemata. There's an active data management system, is kind of missing in my opinion. That's a huge gap that leads to, you know, I hear a lot of lot of other argument sessions today, like, we wanted to deliver certain things as fast, and it's it's okay to deliver certain things as fast, But when it comes to build your own, you know, annual recurring revenue and certain standard metrics that you wanted to publish, how do we kind of streamline that process? You know, in the whole Microsoft itself, we have this production readiness checklist. Like, if you are pushing some services to production, you should have a monitoring liability pipeline enabled. You should have a deployment dashboard enabled, and so on. So there's certain checklists for us to kind of make sure your whatever the services deploying is run and reliable.
I think that 1 area I feel like we have not built enough tooling to do that. I think the second area is, like, obviously, cost is a major factor. I just ran a survey that says, like, almost, like, 1 third of a data engineering time spend on optimizing or thinking about optimizing the cost. So certainly, we are missing something pretty important here that we are just throwing money at the problem. At some point of a time, it comes back to the optimization. So there's some tuning around, giving an immediate feedback loop to balance the flexibility versus optimization.
I guess we may never find it, but I see that some of the gap in the tools, at least, are attempt to find it, is missing.
[00:58:34] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you've been doing on schemata. It's definitely a very interesting and important project, so I appreciate the time and energy that you've been putting into that. I look forward to experimenting with it for my own work. So thank you again for your work, and I hope you enjoy the rest of your day. Awesome. Great. Thanks so
[00:58:57] Unknown:
much. Thank you for listening. Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Ananth's Background and Journey in Data Engineering
Introduction to Schemata Project
Challenges in Data Management and Event Tracking
Importance of Schema in Data Systems
How Schemata Helps in Data Management
Engaging Application Engineers in Data Lifecycle
Adoption and Collaboration with Schemata
Technical Implementation of Schemata
Broader Impact of Schemata on Teams and Stakeholders
Design and Positioning of Schemata
Current Challenges and Feedback on Schemata
Potential for Schemata Across Organizational Boundaries
When Schemata is the Wrong Choice
Schemata vs. Data Catalogs
Future Plans for Schemata
Closing Remarks and Final Thoughts