Summary
The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring across all functions!
- Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more.
- Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda today to find out more.
- Your host is Tobias Macey and today I'm interviewing Srujan Akula about DataOS, a pre-integrated and managed data platform built by The Modern Data Company
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what your mission at The Modern Data Company is and the story behind it?
- Your flagship (only?) product is a platform that you're calling DataOS. What is the scope and goal of that platform?
- Who is the target audience?
- On your site you refer to the idea of "data as software". What are the principles and ways of thinking that are encompassed by that concept?
- What are the platform capabilities that are required to make it possible?
- There are 11 "Key Features" listed on your site for the DataOS. What was your process for identifying the "must have" vs "nice to have" features for launching the platform?
- Can you describe the technical architecture that powers your DataOS product?
- What are the core principles that you are optimizing for in the design of your platform?
- How have the design and goals of the system changed or evolved since you started working on DataOS?
- Can you describe the workflow for the different practitioners and stakeholders working on an installation of DataOS?
- What are the interfaces and escape hatches that are available for integrating with and extending the operation of the DataOS?
- What are the features or capabilities that you are expressly choosing not to implement? (e.g. ML pipelines, data sharing, etc.)
- What are the design elements that you are focused on to make DataOS approachable and understandable by different members of an organization?
- What are the most interesting, innovative, or unexpected ways that you have seen DataOS used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on DataOS?
- When is DataOS the wrong choice?
- What do you have planned for the future of DataOS?
Contact Info
- @srujanakula on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
- Modern Data Company
- Alation
- Airbyte
- Fivetran
- Airflow
- Dremio
- PrestoDB
- GraphQL
- Cypher graph query language
- Gremlin graph query language
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Gartner: ![Gartner](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/4ODnKDqa.jpg) The evolving business landscape continues to create challenges and opportunities for data and analytics (D&A) leaders — shifting away from focusing solely on tools and technology to decision making as a business competency. D&A teams are now in a better position than ever to help lead this change within the organization. Harnessing the full power of D&A today requires D&A leaders to guide their teams with purpose and scale their scope beyond organizational silos as companies push to transform and accelerate their data-driven strategies. Gartner Data & Analytics Summit 2023 addresses the most significant challenges D&A leaders face while navigating disruption and building the adaptable, innovative organizations this shifting environment demands. Go to [dataengineeringpodcast.com/gartnerda](https://www.dataengineeringpodcast.com/gartnerda) Listeners can save $375 off standard rates with code GARTNERDA Promo Code: GartnerDA
- MonteCarlo: ![Monte Carlo](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/Qy25USZ9.png) Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit [dataengineeringpodcast.com/montecarlo](https://www.dataengineeringpodcast.com/montecarlo) to learn more.
- Materialize: ![Materialize](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/NuMEahiy.png) Looking for the simplest way to get the freshest data possible to your teams? Because let's face it: if real-time were easy, everyone would be using it. Look no further than Materialize, the streaming database you already know how to use. Materialize’s PostgreSQL-compatible interface lets users leverage the tools they already use, with unsurpassed simplicity enabled by full ANSI SQL support. Delivered as a single platform with the separation of storage and compute, strict-serializability, active replication, horizontal scalability and workload isolation — Materialize is now the fastest way to build products with streaming data, drastically reducing the time, expertise, cost and maintenance traditionally associated with implementation of real-time features. Sign up now for early access to Materialize and get started with the power of streaming data with the same simplicity and low implementation cost as batch cloud data warehouses. Go to [materialize.com](https://materialize.com/register/?utm_source=depodcast&utm_medium=paid&utm_campaign=early-access)
Truly leveraging and benefiting from streaming data is hard. The data stack is costly, difficult to use, and still has limitations. Materialise breaks down those barriers with a true cloud native streaming database, not simply a database that connects to streaming systems. With a Postgres compatible interface, you can now work with real time data using ANSI SQL, including the ability to perform multi way complex joins, which support stream to stream, stream to table, table to table, and more, all in standard SQL. Go to data engineering podcast.com/materialize today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring.
Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies, and lead with purpose. Join your peers at Gartner Data and Analytics Summit from March 20th to 22nd in Orlando, Florida for 3 days of expert guidance, peer networking, and collaboration. Listeners can save $375 off of standard rates with code Gartner DA. Go to the data engineering podcast.com/gartnerda to find out more. Your host is Tobias Macy. And today, I'm interviewing Surjan Okula about DataOS, a pre integrated and managed data platform built by the Modern Data Company. So, Surjan, can you start by introducing yourself?
[00:01:33] Unknown:
Hi, Tobias. Thanks for having me on the podcast. Nice to be here. Just by background, I'm the cofounder and CEO of the Modern Data Company. We started this company about 4 years ago. Built a product called the data operating system, Data OS. And my personal background come from a bachelor's and master's in computer science. Started my career off in Motorola, building network infrastructure, then moved to product management, worked in multiple consumer base, location based services, startups, building, navigation products, maps. Did my own startup called Doot, which was a location based messaging product. Let that do an exit. And over the last 8 years, you know, I've kind of transitioned and working more on data platforms, large scale data platforms as a head of product at companies like Persona Graph and more recently, Appsila, where our team built these massive scale platforms used by advertising and marketing teams to drive those efficiencies and kind of become the intelligence hub for marketing. So that's the background I come from. And my cofounder and me, we've he was the head of engineering. I was the head of product in these data companies that I just mentioned. And we came together to start modern to solve, what we feel is a pretty big gap in how simple data is today. That's the mission with which we started the companies to solve the data complexity in the modern data stack.
[00:02:54] Unknown:
And so in terms of the modern data company itself and what you're building, I'm wondering if you can give a bit of an overview about what it is that your actual mission is and some of the story behind how it got started and why this is the problem that you want to spend your time and energy on?
[00:03:09] Unknown:
So when we kind of started, we are working, we built a data management platform at Appsilar. And when we sold those data platforms and the attribution and the analytics raw data into these massive organizations, What we realized is organizations that had full control over their data stack and they're able to use that entire ecosystem efficiently got 10 x the value from the same product. Versus organizations that didn't have that level of control over their data stack and the unified approach to data. That is what got us really thinking about, you know, why is that happening? Such a big gap between certain companies on the same product getting significantly higher ROI.
And they are able to kind of solve the complexity of data to a certain extent to bring in new data sources, use them a lot more effectively for things like fraud detection and such. And that is where we realized there is a gap. Even though organizations are investing a lot of dollars into data modernization, into the modern data stack, There is still that gap that is not allowing enterprise business users on the enterprise side to work with data. They're not thinking outcome first. They're still very dependent on IT to drive and solve their business problems. So we realized there is a need for an operating system, sort of an architecture, that can be layered on top of a customer's existing data infrastructure.
And allow them to use that data in lot more simpler manners. Both from a data developer perspective and from a business perspective. In a sense, what we did with the data OS and for our customers, we go in, we deploy data OS as a platform, as a service in their environments. And usually, in a matter of 4 to 6 weeks, we make their existing data stack work in a modern way and deliver that simplicity that I was talking about.
[00:04:58] Unknown:
In terms of the product that you're building, you mentioned that DataOS is kind of the core offering. And I'm wondering if you can talk to what you view as the overarching scope and the goal that you're hoping to achieve by building this and providing it as a service.
[00:05:14] Unknown:
So to start with, 1 of the core goals for us is to provide a unified approach to data in an open standards manner where the customer has full control to extend and bring in the tools of their choice both from the data management perspective and tools to actually use your data. Like the Tableau's, Power BIs, AIML platforms, and such. So that is sort of the core essence of the data OS. If you think about, you know, what we actually deliver, I put it in 3 layers. You know? At the most basic level, we have a data layer, and the value we bring is by deploying data or is on top of your existing infrastructure. 3 things that happen in a fairly automated manner. 1, you get a consistent way to access any underlying data. So through a standard SQL interface, our customers can now start accessing everything from CSV files to text files to databases, the lake houses, you know, all types of data system including Kafka topics, Pulsar topics.
And more recently, we're introducing way to do select on an API endpoint. So being able to give you 1 consistent way to access all of this data through a common sequel interface reduces the need to replicate those data, reduces the need to create new pipelines. That's on 1 side, at the most basic level. The second thing we do is the way we've, built an attribute based access control governance engine from the ground up allows our customers to start governing a traditionally non secure data data storage system like an excel file or access DB or an Oracle DB with most modern ways of governing it. For example, you can apply low level filtering on an excel data. You can apply encryption, you know, on an Oracle database.
Once you're elevating the governance to a modern standard on top of the existing systems, that reduce that gives you a lot of flexibility in how you want to use the data. And the third thing in the data layer we do is give you a consistent way to discover this data. So we have a fairly automated data catalog that allows you to do deep searches, advanced searches, Google like searches across all data sitting in a multi cloud and continuous environment. This, if you think about it, are some of the core principles of our data fabric, providing that level of consistency with respect to discoverability, access, and governance.
On top of it, we have a knowledge layer, active metadata management that creates this knowledge graph on top of the existing customer infrastructure. So we spend a lot of time r and d in being able to introspect your existing pipelines, all of the existing tooling that you have, create all of that knowledge into this metadata layer, and we provide an open API, which allows you to leverage all of this metadata that we are collecting across enterprise and use it to enrich the value of your existing catalogs, existing tools that you already have. And then all of this is a repeatable, reusable architecture. So that's the core essence on the data side. And then on top of it, we introduced the concept of a data lens, which allows business users to define the outcomes that they want, the metrics, the measures that they would like to see from the right hand side. So define the ideal state of how you want to look at your customer data. How you want to look at orders, even though the orders are coming from 10 different order management systems, for example.
And once you do that, we provide a very simple way for business users to ask questions in 2, 3 lines without having to write complex sequels, stored procedures with, you know, and write the entire pipeline gets automated. And with that, you can now use our tools on the other side to deliver the data as a time series database if you want to do a IML on top of it. On the same data, we can deliver it as a data cube to run your advanced analytics use cases, reporting use cases. We provide APIs and graph dual interfaces on top of these data products so that you can build applications on top. And data, of course, in a way, takes care of all the management, deployment, access of the apps. So it's a lot more agile, simple way for businesses to take advantage of all of this productization, we need on the data discovery and the knowledge drafting that we do. And use all of that information to start using data themselves without having to constantly depend on IT. So that is the 3 layers. The data layer, the knowledge layer, and the lens layer on top.
And then this is surrounded by standard o d d c, g d c connectors, APIs on the data products, graph dual interfaces, which enable the business teams to bring in any tools of their choice. So we don't limit you to say, everybody in the organization has to use a tableau or a power bi. You can bring in tools of your choice. You can bring in query engines of your choice. You can run develop your AIML on top of platforms that you would want to, And they all can connect to our standard interfaces. So we want to be that layer that provides that openness and gives customers the freedom of choice to use tools of the choice.
[00:10:13] Unknown:
In terms of the kind of composability of what you're building and the integration with existing infrastructure. I'm wondering what are some of the assumptions that you have about kind of the underlying use cases, underlying capabilities for an organization where you're getting engaged and some of the ways that customers should be thinking about whether the DataOS is an appropriate solution given their existing operations and their technical expertise.
[00:10:41] Unknown:
DataOS is being used by our current customers, and we have some large enterprise customers using data hours today to drive the data modernization. We never rip and replace anything that they already have. So this delivers a value that I said in terms of the simplicity without any disruption to their current stack. We don't want customers to constantly worry about, you know, doing integrations and modernization. So we can connect into your existing pipe planning tools and extract all of that and craft it out. We provide that layer that allows you to govern all this underlying data in a unified manner, the access that I spoke about. And we've built this platform with such a composable architecture and abstractions at each capability within the unified ecosystem we deliver, which allows us to, for example, if the customer has an Alation as a catalog, we can easily extract the metadata from Alation and push, you know, our discovery or vice versa. The acknowledge that we build out is available.
Every aspect of the data, all of this is API driven and these APIs are available to our customers. So you can start integrating by using these open APIs into your existing stacks and raise the value without having to replace. So that is 1 of the big value ads we deliver is you get this modernization without having to disrupt what you're doing today.
[00:11:59] Unknown:
On your site, 1 of the things that you're discussing in the context of this data OS is the idea of data as software. And I'm wondering if you can talk through some of the principles that make that possible and some of the ways that that manifests in terms of how people are engaging with your platform.
[00:12:16] Unknown:
As a general design principle, we are always code first and UI second. Right? So every aspect of the data artifact is versioned on the data oversight, Including the data itself is automatically versioned. We provide a lot of declarative tools for our customers to, you know, start delivering value without having to get into the all the complexity and the boiler fitting you'll have to do to just do basic functions. A lot of those things are abstracted out. As an example, the way we do the consistent access discovery and governance through our data default construct on your existing system, Instead of you having to share credentials with your developers, figuring out the key rotations and all of that, as soon as, you know, we operationalize a SQL server, for example, each table becomes a data product with a unique address that we deliver. So when you are creating jobs, when you're trying to manipulate the data, you just call the UDL, like a URL for data within the data host context.
And the entire security and all of that is automatically taken care of. So a lot of these aspects that we provide, we are very DevOps friendly in how we approach our infrastructure. We use delivered as a infrastructure as code. So you can very easily bring up environments. You don't need to create separate dev, prod, you know, instances. You can have different workspaces and namespaces to manage your dev and test environments within the same install. So there's a lot of these typical software engineering practices which will allow the repeatability and reusability of what you do, is what we are trying to drive. That way of thinking about data.
[00:13:52] Unknown:
In terms of being able to make that code first approach possible, what are some of the interfaces that you are exposing, some of the programming languages that you're treating as first class citizens, and how you think about some of the kind of tool selection and the capabilities that you expose versus the things that you try to kind of automate and own internal to the platform?
[00:14:18] Unknown:
So we do have like I said earlier, a very, you know, micro services driven architecture where each capability, every aspect of the data, or is this API driven, and our customers have access to those APIs. If you don't like the experiences you see on the data, I was our customers can create completely new data experiences by leveraging that API layer. We provide us command line interface. Very similar to a cube CTL sort of an approach to that whole CLI. And there's a lot of dev tooling, sample code, you know, a template written so that you can easily declare, you know, what you want instead of having to write everything from the ground up. So there's a lot of that dev tooling we built. And when it comes to different stacks, we don't want to impose, you know, like we have to use a specific stack. So we, on the data oversight, have integrated multiple stacks. So for example, for Spark, we cannot create an abstraction around Spark. So you don't have to worry about all the boiler plating, the resource management. All of that is clearly automated by the data oversight and the developer can declare what needs to get done. We have separate stacks for streaming data. We have a different stack for complex event processing.
We also are integrating the dbt stack. So depending on, you know, what the customers would like, the data was with the openness of our platform allows you to bring in multiple stacks. And then when you're actually creating your tags, you know, the 1 advantage of a data was, the developer experiences, you can work with multiple stacks and create the dot. So you can, you know, use our slave stack to do, you know, some of your basic, you know, spark programming. We have a streaming stack called Bentos. We have a stack for your data app deployment and running. And all of these different types of stacks can be connected together, you know, into a tag. So there is that thing as well. And then the other aspect is you can now, as a developer, start accessing data that is stored in any of the different storage systems using a common sequel interface.
So all of these capabilities is what facilitates that kind of, simplicity and the data as software construct.
[00:16:23] Unknown:
You also list that there are 11 key features for the data OS. And I'm curious what your process was for identifying what are those kind of core capabilities that are a must have, and what are the pieces that are a nice to have in figuring out what the order of priorities and operations was as you were going from idea to execution and the initial launch of the platform?
[00:16:45] Unknown:
So 1 thing that we constantly encourage our customers to do is start thinking right to left. Start thinking from the outcomes and let technology deliver data against those outcomes versus limiting yourself with what data do I have. Right? So define your ideal customer 360. Let data always tell you the quality of the data, what data is available today, what is missing, and how to facilitate the right, you know, outcome. Right? In the ideal manner. The second thing we wanted to do is give a consistent interface for business teams so nothing breaks on their side. You could add, you know, new data sources or change your data, modernize your infrastructure to go to a cloud warehouse or move from Snowflake to that ship.
The business team should always have a consistent way of working with data. And the rate of change that is happening on your data infrastructure should not break anything on the business side. These were some of the core principles, you know, that we wanted to empower. So now if you work backwards from that, we wanted to make sure 1 thing we noticed is we looked at all the existing data governance stacks, for example, and we realized we wanted to build something ground up to provide the kind of control we wanted to provide. So we built an e back based governance engine from the ground up. We built a workbench capability to provide access to SQL to, you know, do a select on Kafka topics, API endpoints, to data sources.
And that is something that we didn't see exist in the market, so we developed that. We started more recently deploying this concept of a data contract, which essentially is an enforceable SLA between the producer and the consumer of the data. So when somebody's producing data, they know exactly what this is being used for, what sort of trust and quality dimensions are needed to make sure that the business can actually leverage the data that you're producing. So that is something that we are building, you know, from the ground up. So when we saw a gap in the current data ecosystem to provide the kind of simplicity that I spoke about is where we have invested in. The 1 thing we are very clear about is we do not play in the analytics space. We do not play in the AI ML space. Our goal is to facilitate the simplicity, provide an operating system that can deliver high quality trusted data in any formats the customer needs. So that they can bring in the tools of their choice to drive outcomes. So that is sort of how we thought about it.
Another thing, a core capability we built is the observability through the entire data life cycle. That facilitates a lot of efficiencies from a data developer perspective, but at the same time, even business users will start leveraging our observability to deliver insights during the actual workflows of our typical users. So 1 of our customers uses our observability to understand anomalous orders being placed, Understand that their customer shopping cart construction has changed. The patterns have changed. And then send the sales team push notifications versus them having to come to a tableau or a power bi dashboard to look at that a few weeks later. So those kind of modern experiences is, you know, what we are driving. And for those, we built some of the capabilities you see on our website. And then we also like, for example, data connections.
The ingestion piece. For us, it's more commoditized. So, you know, we provide a few basic connectors, but we also integrated air by cluster. You know, we can work with the fight trans, Alumas of the world, or the cloud native data connector. So that is where we did not invest in. But the core capabilities which we thought are lacking today to deliver the simplicity we built. And that's something I wanted to dig into as well with kind of the modern data stack being the latest meme, the head staying power.
[00:20:25] Unknown:
And there are some companies who are approaching this challenge of kind of tool sprawl within the modern data stack by packaging all the other existing tools under a single facade, but under the covers, it's still those separate tools. And I'm wondering what your thoughts were on kind of how to approach the kind of repackaging of existing tech versus building your own capabilities and what layer of the stack you were trying to operate, whether it was just kind of this integration piece or if you wanted to be kind of a key element of the stack or kind of how you think about playing in this broader ecosystem?
[00:21:02] Unknown:
From our experience and a lot of our current customers, they have invested, like, 100 of 1, 000, 000 in this modern data stack and the data modernization. But you're not solving the complexity, you know, even with the modern data stack. The aspects that I started the conversation with about being able to provide that consistent discovery, governance access, giving business teams full control and empower them to, you know, define outcomes, make sure the data gets delivered against those outcomes is still lacking. There is that missing layer in the modern data stack that can facilitate all of this, you know, heterogeneous tools that you're bringing together to stitch your modern data stack become interoperable through metadata layer, which is open. So that gives you a way to, you know, look at all of that through a single pane.
And then provide the value and simplicity from a business user perspective. That is still missing even after organizations have invested in this modernization. Given the heterogeneity of their existing stack, you know, you can't really bring in a third party governance tool and provide the kind of native centralized governance that we can provide, which gives you a lot of flexibility to even provide 100 different views into an excel file without making copies, for example. Right? So that is where we see a layer missing in the current modern stack to facilitate this kind of simplicity. And the reason we picked to be that open on the on our architecture is because even while you are investing in this, there are still some missing piece. So we can kinda fill those gaps that the customer has with respect to governance or access or discoverability to deliver the value.
Some of our customers, you know, they want to use our catalog instead of the existing cataloging tools that they have. So we don't limit that. But we see us as that missing layer that'll deliver the simplicity from a business user perspective.
[00:22:59] Unknown:
Are you struggling with broken pipelines, stale dashboards, missing data? If this resonates with you, you're not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end to end data observability platform. Trusted by the teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes.
Monte Carlo also gives you a holistic picture of data health with automatic end to end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today. Go to data engineering podcast dotcom/montecarlo to learn more. Digging into the implementation of the DataOS, I'm wondering if you can talk through some of the technical architecture that you're building and how you think about the kind of composability. What are the pieces that you are confident in, and what are some of the areas that you're continuing to iterate on and evolve as the rest of the ecosystem grows?
[00:24:13] Unknown:
To start with, we currently deploy data, oh, as a pass. So we deploy within the customer subscription. So following an operating system construct, we built a cloud kernel tier at the most basic level, which abstracts out network storage and compute so that we can deploy on any cloud. So we are cloud agnostic. We run on all major clouds, in our commercial deployments today. It is delivered. We terraformed the entire installation infrastructure as a code. Kubernetes through an architecture. And that is how we deliver data overs and install it within the customer's environment. Typically, it takes us half an hour. You know, once the customers provision the VM that we've requested off of them, to then deploy data, or else it have it up and running. So when we started off, our first iteration of the data was a lot more tightly coupled data discovery to governance to, you know, how some of these components, you know, work together. And what we have done now was, you know, kinda made them lot more loosely coupled so that we can bring in other tools of choice if the customer has an existing catalog, be able to leverage that. If the customer is using an existing pipelining tool, how do we extract all of that metadata? So that is where we went from a tightly coupled architecture to a more loosely coupled architecture.
And like I said, there is abstractions built at every layer of the data stack. We start the data OS architecture with a set of primitives like data depots, your policies, your some of the governance aspects of how you govern the data. Lot of these are primitive, you know, repeatable and reusable. And the modularity of the architecture allows our customers to, for example, you know, automate the entire profiling of the data through a basic workflow, or you could call our profiling service, point it to an existing dataset, and just get the profile report on that specific dataset. So that is the nature of the services driven architecture and the composability of taking different tools that we unified and give you different experiences. So for example, we can realize data fabric design pattern very very quickly in weeks instead of months years that it takes our customers because we have all the building blocks of a data OS of a data fabric. Sorry. And the composability of our architecture allows us to package these things to create data fabric design patterns. We've implemented almost like a CDP design patterns to operationalize c 360s, build and operationalize c 360s for some of our customers.
We are now starting to work with some customers who are leveraging the data OS to roll out the data mesh architecture in their organization. So all of these different patterns can be realized using the components of the data OS, which are composable.
[00:26:52] Unknown:
For the workflow for people who are adopting DataOS, I'm wondering if you can talk through some of the process of getting it integrated into their systems, getting it integrated into their workflows, and some of the kind of collaboration aspects about how the workflow transitions between different roles and stakeholders in an organization?
[00:27:12] Unknown:
To start off with, you know, once a data was deployed on a customer's cloud environment, we have this concept of a data depot. Think of it as a slightly more functionality than a typical ingress, egress that you do with the data integration tool. We elevate the governance. We have the capability to do query push downs through our default construct. We provide you the universal data links to each data table or a file through the data depot construct so you don't have to constantly share security credentials. So that kind of gives you that consistent discovery access and governance on top of our depot construct. So when we install, that's the first thing we do is provision these data depots, which usually essentially is an access, you know, into these traditional source and sync systems.
Once that gets done, there's a fair amount of automation to convert your excel files, your tables, etcetera, into data products on top of the data OS and catalog them. We do the schema evolution. We do the dictionary of the data. We can fingerprint and help you understand your classification of data. We profile the data. And a lot of this happens while we keep the data in place. And this now gives you a way to understand 1 type process, let's say, your customer database. Now I can tell you, you know, what is the completeness of your emails? What is the cardinality of yours, you know, of the skew field? So on and so forth. So that level of profiling at the most granular column level is automated for you. We extract the lineage information from the source system to where we are, you know, within the data OS. And all of that lineage is available to you in an automated manner. Once you get that level of understanding of the data, the lineage, the provenance, then all these different stacks we make available allow data developers to quickly declare, you know, what they're trying to do. And we do a fair bit of automation of actually creating the job and running it. We provide all of the entire job orchestration on top of the data. So that is where the data developers work on the modeling piece. And then the lens construct essentially is where the business users are almost defining. You can call them, you know, digital duplicates, digital twins, where you're defining the metrics, the measures, the attributes that you want to see in an ideal state. We leverage all of this knowledge within the data OS through the knowledge graph, through the catalog, through, you know, all multiple aspects, and intelligently map data into those lenses.
And once you start creating these lenses, we push down. So you can build a customer 360 lens where you have your click stream data coming from BigQuery. You have your customer data coming from your Salesforce CRM. You have some other data coming from your SAP hybrid stack or third parties. And once you construct a lens, which takes all of these capabilities and creates a customer 360, the business users start asking questions in 2, 3 lines, almost like an NLP ish simple English, the sort of queries. And then we have the intelligence to figure out this query needs to be found out into these 4 systems, what joints need to be applied, what transformations need to be applied, and give them the result while we keep data in place. And when you like it, you can operationalize this into any data format, into any type of a data system till you start using it. So that is a typical flow of, you know, where data always gets used. So the initial discovery cataloging, making sure the governance policies are set up is your, you know, data governance, data stores, data engineers working at that level. And then we have an entire business layer, which the business users work on. So those are the 2 key personas for Dataverse.
[00:30:47] Unknown:
In terms of the escape hatches or some of the ways that developers and platform operators can address some of the maybe sharp edges or limitations of DataOS? What are some of the ways that you think about being able to kind of add escape hatches to the workflow to say, okay, well, we don't quite have this integration implemented. So if you want to be able to call out to, you know, an API within your airflow cluster to figure out what the latest task execution time is, then, you know, here's where you drop that snippet of code or something like that.
[00:31:18] Unknown:
So I can mention earlier, you know, if we don't have data depots for every system, we provide an air bite integration to say, hey, we give you a lot more options that way. From a query engine perspective, we provide our own query engine, but you have our standard interfaces where you can plug in Premio or Presto or Databricks, Spark SQL, you name it. So we provide a lot of flexibility for the developers to bring these kinds of tools of their choice on top of the data OS. Every data product within the data OS has an API layer so that you can start leveraging that data. If you want to build more visual applications, we'll give you GraphQL interfaces on top of it. And then we also have a standard ODBC, JDBC, or data connectors so that you can plug in any tools of your choice. And if you start connecting, for example, your tableau through our standard o d b c, d b c connectors, then the governance policies, the granular governance that you're setting up applies there as well. So 1 tag that you drop on a specific field to change the classification from non PII to PII, And if you say encrypted with a chart 256, that essentially gets applied to all different, you know, from the query engine layer to the reporting to the Jupyter notebook layer. And that is how we, you know, provide those kind of simple interfaces.
And then we also allow customers to build applications on top of the data OS. So we package data hours with a few applications on top of it, but our customers are now starting to build applications by leveraging that API layer. And once you build the application, data was takes care of the entire management, access controls, deploying, discoverability across the enterprise. So we have an entire, almost like a small app store construct on top. So that gives them the extensibility.
[00:33:00] Unknown:
And in terms of that kind of application, you know, the app store or the application idiom, I'm wondering how you think about that as far as being able to do kind of sharing of business logic or of workflows, both across an organization and then even between organizations.
[00:33:15] Unknown:
Lot of that also is driven by the kind of understanding of the context of the data, the semantics, you know, the knowledge around the data, both from a business and an operational perspective. So all of this is captured into our knowledge graph and we provide Cypher and Gremlin interfaces so that you can start extracting that knowledge and start leveraging it with other systems, you know, as you'd like to. So both the data and the semantics, any data OS customer owns the entire data and the semantics around the data completely. So that they have then open interfaces to leverage that across, you know, value driving tools and data management capabilities.
[00:33:53] Unknown:
You mentioned earlier some of the kind of specific elements that you decided, you know, we're not an ML platform, but I'm curious. What are some of the ways that you think about limiting the scope of what you're building so that you can focus on the core value and some of the specific features and capabilities that you're expressly choosing not to try to implement or adopt?
[00:34:13] Unknown:
So we always want to see ourselves as an enablement layer. Right? We want customers to bring in the right tools of their choice to run better machine learning. If I want to enable my entire data sharing through Snowflake, we can do that. So we believe, you know, there are different products. They have very good strengths for specific outcomes that they drive. So instead of us trying to do everything, give customers that freedom of choice, which is what is limiting them in a lot of ways where, you know, you get logged into a specific vendor or a specific computer or a activation cloud, which oftentimes limits those enterprises. So we see ourselves as a layer that'll sit on top of whatever infrastructure you have and provide you the simplicity needed for the customers to bring in the right tools of their choice to essentially get value from data and use the right cloud environments and the right activation clouds to run different workloads. So you might want to run your ML workloads on a GCP versus some of your supply chain workloads on Azure. That kinda control is what data was provides to our customers. So we always want to stay within the enablement layer and provide open interfaces for the value delivery piece, which is analytics, machine learning, and so on and so forth.
[00:35:34] Unknown:
As far as the design elements of the platform, curious, what are some of the ways that you think about making it approachable and easy to use and some of the kind of user experience aspects that have been useful and kind of beneficial to you as you're working with some of your customers?
[00:35:52] Unknown:
So far, we've been very focused on the data developer persona in making sure we provide multiple stacks. We abstract out a lot of the boiler plating, the complexity, the resource management. So that developers can just declare what they want to do and start using data a lot more efficiently. So that is where we are focused on so far is that persona. And as we are starting to work with some large enterprise clients, they want that kind of simplicity even from a business persona perspective. So we are now starting to add a lot more of, you know like I said, we are always called first, UI second. So a lot of things that are done through YAML, that are done through a lot of your SQL, etcetera.
We are starting to add simple interfaces for customers to, for example, create their own lenses. You know, right click to say prepare this data for machine learning without having to go to IT and the pipeline is automated for you. So that level of UI driven way to empower business teams to do analytics, machine learning, data sharing, and data app building is where we are kind of pushing next to provide that business layer simplicity.
[00:36:59] Unknown:
As you have been working with your customers and building the product, what are some of the most interesting or innovative or unexpected ways that you've seen it applied?
[00:37:07] Unknown:
So we've seen, like, very interesting use cases. So we work with, you know, a large architecture firm. You know, we started off initially with an IoT use case where, you know, they wanted to understand space flow of the different, you know, architectures that they roll out. Right? So understand space and optimize space flow for retail, for their commercial clients, etcetera. So that is where we started off with them to now. We are looking at their HR data and being able to provide a resource 360, like because it's a very services driven, business when you're doing architecture.
So for them to understand what is the right team composition, which locations should work with each other to deliver this major project, what sort of team construction would make sense based on all the historical data we have. So from providing an IoT use case around spatial analytics and space optimization, to now moving toward a resource 360s and HR analytics. And then in between, you know, most of their data analytics team is using data or else to deliver enterprise reporting on top. So it's like very varied in terms of what we are seeing. We are seeing I can't talk a lot about the details of it, but we just started working with 1 of the big DOD agencies within the US government. And they have some really interesting use cases to or on predictive maintenance, around supply chain optimizations, you know, within the DOD space as well. So that we did not expect to happen this quickly.
But given the capability, how mature, the platform is versus and also the governance, you know, that you've set up allowed us to get into these kind of, you know, organizations as well. So that's something that we've seen. But the 1 thing where I saw massive and we didn't think this small feature would have that kind of a business impact It is the observability on top of data overs that is driving sales operations for 1 of our customers. So from sales teams, which had no idea about what orders were being placed to now getting a level of insight pushed to them, you know, automatically. That has really streamlined, the sales operations from a customer perspective.
[00:39:13] Unknown:
Yeah. In your experience of building this platform and working with customers and validating it in the market, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:39:24] Unknown:
I would say good question there. I think the biggest challenge for us has been because there's so much of I'd say I don't know if you agree with me or not, but a lot of overselling that has happened, you know, in the data space. You know, every vendor talks about every capability. It looks like everybody has the exact same, you know, product capability when you look at the marketing aspects. So for us to figure out the right way to position this and show, you know, where we fit as a layer on top, there's something that was quite challenging, you know, especially last year. And now we are able to, you know, clearly show this as that missing layer that is delivering a very crisp business value. Both with respect to time to market, with respect to the agility that the business teams are getting, and that took us some time. So that was where we, you know, at least last year, struggled to figure out what is the right way to position given the amount of noise that's there in the data marketing today.
[00:40:19] Unknown:
For people who are interested in being able to kind of simplify their workflows around data on top of some of this modern tooling? What are the cases where DataOS is the wrong choice?
[00:40:31] Unknown:
I don't see DataOS, you know, as a wrong choice because we are working with customers that have a fairly legacy infrastructure with, you know, a s 400db twos to excel files to, you know, PDF scans, to companies that have invested in fairly modern tech stacks with the data bricks, snowflakes, DVDs of the world. And we are able to deliver a similar value in both constructs. And we're talking about, you know, fortune 500 clients in this context. And we have customers across a wide range of verticals also. From government to logistics to architecture to distribution and retail.
And we are seeing, you know, that kind of a value across multiple organizations. Just to give you an example, we have a a basic framework to create an entity 360 very, very quickly and deploy and operationalize it. So now I'm seeing that being used by retail to operationalize customer 360s. I'm seeing that used by some services organizations to do resource 360. Some of our healthcare clients doing member patient 360. You know, we built it in such a way that it's a bunch of primitives, and the core constructs are the same. It's essentially the data and some of the business logic that changes, and that is where we make that happen, you know, fairly quickly and efficiently.
[00:41:48] Unknown:
As you continue to build and iterate on DataOS and grow your business, what are some of the things you have planned for the near to medium term or any particular projects or problem areas that you're excited to explore?
[00:41:59] Unknown:
Absolutely. So couple of things. 1, all this metadata that we're able to collect and kind of the nature of the platform as a unified, you know, data ecosystem capability allows us to bring in a true 360 around your data. I mean, the metadata around it. The knowledge around the entire, or the intelligence around your data. And because we're able to collect that amount of information and we see what is happening on the other side, we are now starting to invest a lot in AI for data so that we can start facilitating a lot more simpler way to work with data. So we are now working on, you know, if the customer declares an outcome, we'll apply AI on top of the metadata that we collect and be able to prescribe what datasets will drive the outcome and what would be the quality of the outcome.
Similar to, you know, we're doing a lot more work on clustering of data to to show similar datasets. You know, when customers have 20 different ways of naming a customer ID, how are you creating a common, you know, ontology on top of it in a fairly automated manner. We are working on technologies to provide column vectors, table vectors, data vectors so that we can understand similar datasets and discover them in a lot more simpler manner. So there's a lot of investment being made to leverage AI to provide simplicity on the other side. So we are working on NLP, more powerful NLP interfaces for business users to ask questions on top of data or when it comes to data. So that's something that we're doing. But the thing I'm most excited about is what our CTO has been blogging a lot about over the last 3 to 4 months. We're seeing a lot of good dialogue within the data leaders across multiple verticals is this construct of an open data contract.
Where the business teams can define the contract terms against which they want to see the data. So if you are using your data to do financial reporting, you don't want the data to come from more than 1 hop from the source system. You might want the data to be fresh as of that hour. The quality has to be about 98%. You know, the governance has to be to a certain standard. These are some the things that you can now start defining as these data contracts. And when your producers are essentially producing the data, the contracts will enforce the quality and the governance standards and the shaping of the data. So that when the business starts using it, they can trust with full audit compliance logging that the data is at the quality and the trust that they want. This, we believe, will create that handshake in a much more efficient manner and a scalable manner between the data producers and consumers.
Which today is lacking, I feel in a lot of organizations, which leads to multiple copies, multiple shadow platforms being created. And we hope this data contract will address that. And we're seeing some really good uptake and efficiencies that this contract construct is driving for our current customers.
[00:44:55] Unknown:
Are there any other aspects of the work that you're doing on DataOS and at the Modern Data Company that we didn't discuss yet that you'd like to cover before we close out the show?
[00:45:03] Unknown:
No. I think we covered a fair bit of, you know, what I would have want to talk about or some of the ways that we are trying to simplify and help organizations get a handle on their data complexity. The thing that we touched upon a little bit is our vision is instead of customers constantly thinking about technology, start thinking capabilities. Elevate the conversation in the data space where customers start thinking about capabilities that they want to deploy quickly versus thinking, you know, do I need Snowflake? Do I need Redshift? Do I need Azure? Do I need GCP? Things are always at a very technical nature when it comes to data conversations, and we are hoping the app store construct some of the simplicity that we want to drive will allow our customers to productize their expertise and deliver them as capabilities which are repeatable and reusable, instead of thinking about everything from scratch ground up. So that's the solution.
[00:46:01] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you and your team are doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:46:17] Unknown:
I'd say that handshake that I spoke about earlier, I mean, where, you know, the producer of the data, you know, somebody who's working on your Salesforce application has no idea what that data will be used for. A developer that takes it and creates a data model and now adds a new column to that model has no idea what is breaking on the business side. I believe a driver for a lot of complexity today where you have to repeat pipelines, create multiple copies of data. So that facilitation, I think, will really simplify and give business users the trust and confidence to start using data to drive their outcomes. Of course, that I mean, it's a very simple statement to make, but it needs a lot of tech behind the scenes with governance, access, you know, things that we spoke about. So that I believe is, you know, where there is that gap that I see. And if we can figure out that gap and provide a consistent contract, enforceable, auditable, I think we'll see a lot more, I'd say, agile innovation from a business perspective in using the data.
[00:47:20] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you and your team are doing on the DataOS and the ways that you're working to simplify some of the complexity of being able to actually build and share and maintain these data applications is definitely a very large problem. So it's great to see people like you taking a stab at providing a tractable solution. So thank you again for your time, and I hope you enjoy the rest of your day. Thank you, Tobias, for inviting me. Great, great conversation.
[00:47:47] Unknown:
Thank you. Have a good rest of the day as well.
[00:47:55] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning Podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Background
Mission and Origin of the Modern Data Company
Challenges in Data Management and the Need for DataOS
Core Goals and Structure of DataOS
Integration and Composability of DataOS
Key Features and Principles of DataOS
Technical Architecture and Deployment
Workflow and Integration Process
Application and Sharing of Business Logic
Customer Use Cases and Unexpected Applications
Future Plans and AI Integration
Closing Thoughts and Contact Information