Summary
The reason that so much time and energy is spent on data integration is because of how our applications are designed. By making the software be the owner of the data that it generates, we have to go through the trouble of extracting the information to then be used elsewhere. The team at Cinchy are working to bring about a new paradigm of software architecture that puts the data as the central element. In this episode Dan DeMers, Cinchy’s CEO, explains how their concept of a "Dataware" platform eliminates the need for costly and error prone integration processes and the benefits that it can provide for transactional and analytical application design. This is a fascinating and unconventional approach to working with data, so definitely give this a listen to expand your thinking about how to build your systems.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
- Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription.
- Your host is Tobias Macey and today I’m interviewing Dan DeMers about Cinchy, a dataware platform aiming to simplify the work of data integration by eliminating ETL/ELT
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Cinchy is and the story behind it?
- In your experience working in data and building complex enterprise-grade systems, what are the shortcomings and negative externalities of an ETL/ELT approach to data integration?
- How is a Dataware platform from a data lake or data warehouses? What is it used for?
- What is Zero-Copy Integration? How does that work?
- Can you describe how customers start their Cinchy journey?
- What are the main use case patterns that you’re seeing with Dataware?
- Your platform offers unlimited users, including business users. What are some of the challenges that you face in building a user experience that doesn’t become overwhelming as an organization scales the number of data sources and processing flows?
- What are the most interesting, innovative, or unexpected ways that you have seen Cinchy used?
- When is Cinchy the wrong choice for a customer?
- Can you describe the technical architecture of the Cinchy platform?
- How do you establish connections/relationships among data from disparate sources?
- How do you manage schema evolution in source systems?
- What are some of the edge cases that users need to consider as they are designing and building those connections?
- What are some of the features or capabilities of Cinchy that you think are overlooked or under-utilized?
- How has your understanding of the problem space changed since you started working on Cinchy?
- How has the architecture and design of the system evolved to reflect that updated understanding?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cinchy?
- What do you have planned for the future of Cinchy?
Contact Info
- @dandemers on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Are you bored with writing scripts to move data into SaaS tools like Sales Force, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you're looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hitouch for reverse ETL today. Get started for free at data engineering podcast.com/hitouch. Your host is Tobias Macy. And today, I'm interviewing Dan Demers about Cinci, a dataware platform aiming to simplify the work of data integration by eliminating ETL and ELT. So Dan, can you start by introducing yourself? Well, I'm Dan. I'm the CEO and cofounder at Sinchi, and happy to be here. And do you remember how you first got involved in data management? When I was fresh out of school, I started working at a consulting company delivering IT solutions for
[00:01:56] Unknown:
organizations all over the world and started building applications. And guess what? Every application generally manages data. So that was the beginnings of it, right from the early beginnings of my career. And I spent probably 15 years following that, working in financial services at some of the biggest banks in the world. I was at Citigroup for 11 years as an example. As you can imagine, big global banks tend to have a lot of data and build a lot of application functionality that needs to interact with that data. So I'd say, you know, anyone in the business of building applications is in the business of managing data. That's kind of the the experience that led us to, my cofounder and I, to see the opportunity to create the Cinchy platform. And so can you describe a bit about what Cinchy is and some of the story behind how it got started, and what made you decide that this was an area where the focusing your time and energy on? Well, my cofounder and I have known each other for quite a while. We worked together at a bunch of big banks. And 1 of the things that we had the realization as we were building different systems is that there's this integration tax that is this huge burden that slows down every technology project. And so every time you're building an application, even if you're buying an application, more than half of the the cost and complexity is integrating data and enabling the reuse and sharing of data across applications. And everyone has been so hyper focused on, especially recently, recently, AI and advanced analytics and the strategy being, you know, let's get a centralized copy of data and try and put Humpty Dumpty back together again after he's already been broken. And we saw the opportunity to make it so that Humpty Dumpty doesn't get broken in the first place. Why is data even fragmented? And it all comes down to how we design our applications.
So we started to do some experimentation. This is probably, like, 7 years ago and decided that there was a viable alternative, and started to create the platform, and I left our our full time roles as senior technologists in some big banks and decided to do it real. We've been building the platform ever since, and we launched in market just over 2 years ago. And since then, we've closed 2 funding rounds. So we've raised a good chunk of change, thus far. We're 50 plus people and growing strong. We have over a 100 organizations that are benefiting from our platform, including some of our former employers in financial services, which includes some of the biggest banks in in the world. We have no churn, so we're very kind of proud of our progress today, but we're really just getting started. It's a radical reimagining of how data is managed in applications and, quite frankly, the separation of data from code
[00:04:15] Unknown:
as they've been intertwined far too closely for for close to 50 years now. And so in terms of that architectural aspect of data and its relation to code, I'm wondering if you can dig a bit more into some of the historical motivations for that linkage and some of the ways that you are working to
[00:04:33] Unknown:
decouple them and allow data to exist as a separate artifact from the software that produces and manages it? It's a interesting question and a reason why it is the way it is. It's not because, people are dumb and they made a bunch of mistakes. It's this is the evolution of technology, and in many ways, it kinda mimics the evolution of species. But if you think of why does digital data even exist in the first place, like, what was its origins, well, let's go back to the early beginnings of, you know, the digitization of business processes, the ability to build technology solutions that allow you to move off of paper based processes.
Well, you had this thing called a program. This program was essentially the execution of instructions from the programmer, but often, the program would need to remember some information such that another component of the program or another program would need to access, like, a shared memory. So thus came the need to persist data and have it be stored across invocations, across programs, and the digitization of data really initially started as the memory, the persistent store for applications, meaning it was really built out in response to the application's evolution. It was a servant to the code. That was its original origins. And if you think of the early days of building, you know, the first applications, well, you're not even thinking of data silos. It's transformational. You're moving off a paper and turning it into a digital solution. It's amazing. It's only later did we realize that, you know, the memory of the code can, in fact, be harvested, and I can get insights out of that beyond its use case as serving as the memory of code. And that's where the realization of, well, wait a minute. We have the memory for this application over here and this other application over there. How do we connect it together? And that's where we started, you know, building datamarts and data warehouses and eventually evolving into data lakes and data lake houses and data fabric and data mesh and all these attempts to try and create a clean, curated, essentially accessible, and universally accessible interface to access that information.
But it's all been coming from the angle of, well, the data's already fragmented, so how do I pull it together either virtually or physically without necessarily thinking about how you change the way that the applications work. Guess what? The apps aren't done. We're continuously building more and more apps. So that's the opportunity. That's really what we're doing is we're changing the applications such that they no longer need to stand up an application specific data store. They don't create an application specific data silo that essentially it separates them from information management, not information access. We're not talking about read access. We're talking about read, write, transactional capabilities because your customer data isn't owned by a single application.
It is interacted with by many, in fact, most of your applications. Same with your employee data, right? Like, data cuts across. It doesn't wanna live in these artificial boundaries that we call applications or or services. There's a ridiculous amount of applicability of any particular piece of data to any particular catalog of business capabilities.
[00:07:33] Unknown:
Given the idea that you're sharing here as far as the data shouldn't be tied to an application, it should just be the shared pool of data that the application interacts with. There are a few different paradigms that that sounds similar to where 1 is the data lake that you mentioned where I just put all of my data into 1 place, and so I'm able to do analysis and manipulation there, but it's largely going to be a, you know, write once read many situation. Or maybe I have a large application database that I have, you know, multiple different schemas in where the applications all have a, you know, a shared resource there, and we've moved more into the shared nothing architecture of each application having its own database because of issues with scalability or, you know, concerns of, you know, different write and read patterns. And so I'm wondering if you can give a bit of an overview about some of the ways that what you're building at Cinchy is either distinct from or analogous to either of those situations?
[00:08:30] Unknown:
I guess the first thing is let's separate Cinchy from the pattern because Cinchy is just a product. It's a platform that implements a pattern. The pattern is what I would describe as dataware, and it's not something that was a recent idea. It actually goes back to, like, the mid eighties. Professor in the US, Gordon Everest, first reference we could find to it was was a book that introduced this idea of dataware, where it was, at the time, a prediction of the future where data rises to become independent of applications that allow regular humans to interact with the data without being constrained and impeded by application code, which today, you know, there's an app for that. There's an app for everything. And if there isn't, there will be. But those apps are code. The code needs a code path to enable you with a capability. Right? So every distinct requirement needs some type of, coding behind it. So it ultimately limits. It's a huge barrier that separates you from your data. So that's really the idea of Dataware is is the humanization of data such that there's a universal interface, and it's not that dissimilar to if you just think of the Power Grid. So the Power Grid makes it so that you can build a device. You can build a building, and you don't have to worry about the complexities of generating and distributing power. You know, nuclear versus solar versus hydro, it's all very complicated.
How do you make it resilient? How do you make it scale? Well, prior to the power grid, you could put a power plant inside of every individual building, and that's quite frankly, how it worked in the early days. Just imagine how inefficient and unscalable that ultimately is. So the the power grid really separated those who consume and utilize the power and provided that as a unified experience so that you're abstracted away from all those complexities, but it just works. Right? It's can use it to power my phone. I can use it power my air conditioner. I can use it to power anything. It's a standardized interface. Whereas if you think of data management today and all the complexities of, you know, the equivalent of, you know, geothermal versus solar versus nuclear is, is it a graph database? Is it a relational database? Is it a columnar database? Is it a document database? And is it optimized for read versus read write? Like, can't this be unified through a common experience such that I can interact with the utility, and the utility may internally use those different technologies and capabilities, but abstract away that complexity for me. I just wanna plug in. That's the idea of data ware. So now coming back to your question of contrasting that to, data lake or other technologies. And 1 of the things that always complicates this is everyone has different understandings of what these terms mean. And as far as I know, there's no single authoritative source where I can point you to the, you know, single source of truth or the definition of a data lake. But our interpretation is is a data lake is is more of a strategy than it is a particular technology for taking data from your operational systems and putting them in a central location in raw form so that you can curate and organize it later. You don't have to have the curation be a bottleneck for the acquisition and storage of data, which is very different than a warehouse where it's organized.
So a data lake, though, it isn't organized. It isn't curated unless you're turning that lake into a warehouse or using a lakehouse, which kind of a hybrid of the 2 concepts. But regardless of that, whether it's raw data or curated and organized and normalized data is you're not using it to build applications. That's the short story. So if you just picture for a second building an entirely new company that's starting from scratch, you don't get started by installing your data lake and or your data warehouse and or your data lakehouse and then run your business. Yeah. You get started with applications that you either buy or build and you integrate. The lake or lakehouse or warehouse or datamarts, they come in really to enable your reporting and analytic use cases because they're working around the fact that the data is fragmented. The difference is at absolute scale for a net new company is those constructs of, you know, technology that puts Humpty Dumpty back together again after he's broken are no longer required because he was never broken in the 1st place, which is different than companies that exist today that already have a broken Humpty Dumpty. They need to actually vote. In terms of the actual
[00:12:15] Unknown:
technical implementation of this strategy and concept, I'm wondering if you can give a bit of an overview about how you actually manage that at Cinchy and some of the architectural complexities that come into providing a unified interface to all of these different abstractions and approaches to data, whether it be, as you mentioned, a graph database or a relational database or a data lake or, you know, semi structured versus unstructured, and just all of these different areas of complexity that have cropped up that people in the data space in particular have been trying to deal with and coalesce around for the past few decades? I think we have to split that into 2 topics. 1 is, how does the code that developers write
[00:12:53] Unknown:
interact with the data? Because, really, they're using this service to replace the need for an application specific data store. So it needs to create a universal way of interfacing with the data where the other side of it, the implementation, handles all those complexities, you know, optimizing the data storage and whatnot. So the idea here is that the applications that you're creating, ones that are already created, they work the way that they work. But the ones that you're creating, they no longer create an application specific specific database, and they interface with data ware via the the data ware protocols. And in the case of Cinchy, as an example, we have a universal REST layer that allows you to interact with the data whether you're requesting data or changing data or performing a transaction or creating a user defined function. It it will mimic and feel like elements of a relational database, but at the same time, while you can do a select statement using an old fashioned, you know, ANSI syntax join, you can also use dot notation to do what we call joinless join. There's a bit of a unification of the language that uses ANSI SQL as its roots. But, again, that's just how Cinchy's implemented the universal access layer that doesn't have to be how every data ware platform you know, if you're creating your own, you could create your own standard interface. But the key is that the interface is complete such that applications don't have to compromise. They don't have to sacrifice. They can do everything that they otherwise need to do minus all of the complexity. So that's on the application build side. But then there's the how do you implement that on the other side, which is I'm the Dataware platform, and I'm receiving these requests. And how do I actually architect and and implement it? And that's where the implementation really needs to count for the different usage patterns. So as an example, relational databases are great because you can have referential integrity. You can have transactions.
But 1 of the things that is, maybe both a strength and a weakness is is the schema tends to become very rigid and not very adaptive and flexible and changeable. What if you could implement it such that you have the benefits of doing entity resolution and referential integrity without the dependence upon the schema? So can the relationships between different records in different datasets be the linked through pointers similar to what a graph database would do versus, you know, foreign keys and where there's a schema dependence and a data dependence? And the answer is yes. It absolutely can. And the internal implementation may use a bunch of different technologies, but, again, that's all transparent and abstracted away from the application developers who are writing code through this unified interface. So there's a bunch of different techniques for implementing it on the server side. Obviously, that's very complicated in terms of how that ultimately works, but I gave a bit of a clue in terms of that around, you know, elements of graph and elements of relational, but there's obviously much more to within that. So as an example, 1 of the capabilities of a data platform is that the data changes are all, tracked. Think of it as like a git for not your code, but for your data such that, you can run queries that are temporal in nature that continue to be workable even with schema evolution. Right? If I change my data model, I still wanna be able to run a query in the past. And and, heck, I maybe wanna use the schema in the past and get the present data in the past schema. Right? So there's a whole bunch of other capabilities that are added on top of that that are really required to bring the Dataware Vision to life. Digging more into the actual
[00:16:15] Unknown:
schema elements and the interaction with the code, I'm wondering if you can talk a bit about some of the data modeling considerations that are introduced by this approach to interacting with the information that you're storing and accessing and mutating with the applications that you're writing? It's an interesting question, and that's where plasticity comes into play. 1 of the reasons why we separate application
[00:16:36] Unknown:
data stores so that there's not a shared database is how do you coordinate on the schema. Right? Like, if you had 5 different development teams all building 5 different applications, but they all needed to interact with customer, well, who who owns the definition of the customer? How do you govern that, and how do you protect that? And they all have their own different perspective. They're they're running really in different contexts. That being said, the customer is still the customer. Right? 1 context says it needs first name, last name, and middle name all as a single field, and the next, it needs them separated. And maybe the other context needs to know the historical name changes if your last name changed because you got married or something. Like, the context is always different, but that doesn't mean there needs to be separate copies of the data. So the traditional approach is every application has its own application specific data model, and they send copies back and forth, and you basically have to translate it from your model into my model. And when I send you back my data, you have to translate from my model to your model, and it's all point to point. And don't forget, then you have to add in your warehouse and lake and all these other additional models above and beyond the application specific models.
Whereas if you look at some ideas around data centricity, which is, in many ways, very similar to dataware in that it's separating data from applications, with 1 nuance being the idea of creating a single vocabulary for the enterprise or a single ontology so that there's a single standard representation of customer and all the applications adhere to that. Well, the challenge with that is, in theory, that's great if you could get everyone to agree on the terms of things and the descriptions of things, but practically speaking, that's impossible. Just look at humanity. We haven't figured out a universal language. We still have multiple languages. And what have we done? Well, we built the ability to translate languages. And, hey, guess what? That's actually good enough. That works. Let languages not only be different, but let them even evolve independently from each other. And and, in fact, that's actually important for the advancement of our of our broader species.
So it's the same idea with dataware is, how do you make it so that the individual applications can have their own context, their own perspective, their own model without the need to force everyone to align on a universal model. That's really where the idea of plasticity comes in, which is the linkage between your code and data model can be application specific, but the realization of that when you deploy it into an environment, you're not installing a data store with a data model and then moving copies of it with transformation. You're actually mapping the application specific representation of information to the representation that's within the data product that is owned by the data owner. And that mapping is such that once linkage is established so, for example, if application 1 has this concept of an employee and it has first name, middle name, and last name, and application 2 has the concept of a worker, which includes full time employees and contractors, but it just has a single field called full name. At the end of that, they can map to the same physical pieces of data without any duplication of data. So the first name never needs to be stored twice. And allowing application 1 to model it in a way that makes sense to the development team in the context of that business problem, application 2, having a model that makes sense again to that application development team in that business context, But the data product, the rightful owner of that data, the custodian of that data for the organization, again, will model it in a way that makes sense to them. But by creating the linkages between them, it enables each to evolve independently from each other. So as Application 1 evolves its model, it's insulated from breaking changes or the other 2 dependencies.
And I don't think I explained that very well. But does that make sense in terms of the separation via models is what enables independence
[00:20:05] Unknown:
of schema evolution? Yeah. That makes sense. And that also brings me to another question I have about things like access control across these different applications, And then I And then I've got another application that wants to have information about employees, but shouldn't be allowed access to some of those sensitive details. And so being able to manage those access control guards across applications given that all of the data is shared and some of the potential ramifications of mutation of data by these different applications where as everybody knows who has ever written software, the biggest problems happen in production because of weirdness in the data. And so being able to establish any sort of mutation guards or understanding, like, the the ramifications of changing this piece of data in this application, how that's going to impact this other application that's expecting something else?
[00:21:01] Unknown:
So first of all, let's cover the control side of it. So the traditional approach is 1 where your applications are using an application specific data store, and your code has basically a service account that it's using to interface with that data that has, you know, unconstrained access to it. It basically is able to see all the data, change all the data, and it's the application code. And then the ultimate end user is then interfacing with the application, and the code is what constrains what I can see and what I can change. Right? But as soon as you share data across applications, well, now you have separate opportunities for inconsistency in those controls. Right? If application 1 and application 2 both need to interact with employees' salary data as an example, Who's to say that the controls are implemented in a consistent way across these applications? It could be developed by different vendors.
So the the problem with application specific controls is it just doesn't scale, and that's how we build systems, which is why in many ways, like, security and controls today in large complex organizations is more of an illusion than a reality. So the evolution of the separation of the data from the code actually now makes the proper solution here possible where you can apply those protections not inside of individual apps, but inside of the data, such that it's almost context agnostic to whether I can see the salary of my boss. Can I see my own salary? Can I change my own salary? Can I change the salary of my employees? Can I see the salary of my employees? If you think of those as rules, those are universal rules. It doesn't matter if I'm running a query in 1 reporting tool or if I'm using an application to do performance reviews or year end compensation. Like, it doesn't matter. It's agnostic to the context.
That's the idea is the access controls should be centrally defined yet universally enforced, which is only possible by separating your data from the code and eliminating this whole idea of application specific data source. So not only is it controlled access to limit what data you can see, but what data you can change, what data requires approvals, you know, ensuring that all data changes are auditable and trackable and version control, then, you know, all those controls can now be guaranteed and universally applied to all information across the enterprise rather than, you know, only as strong as your weakest link. That's the control side of it. But the other topic that you touched on is the dependencies. Right? Is if I'm building an application and I'm interacting with a shared data utility, you know, what stops me from making changes that break other applications that are interacting with that very same data? And that's really where plasticity comes in because the way I'm interfacing with that is via my application specific model, which is different than how other applications are interfacing with that. And the nature of a dataware platform as really being the broker for all of this is what makes it possible that even something as complex as schema evolution will no longer break application code. Like, for example, if I'm building an application interacting with customer data and I'm creating a REST endpoint that's obviously implemented through an application specific data model that represents what the customer is. If someone goes in and changes the structure of that, maybe it's the data product owner or the business owner, whether they're changing data values or changing the metadata, like the actual structure itself, is in the traditional approach, well, they're not gonna do that because it's gonna break my application code. Whereas by having Dataware be the broker in the middle with the capabilities of plasticity, suddenly, they can make those changes, and my code doesn't break. So what it enables is true federated development where you don't have to worry about all these contracts, like file specifications and API specification. These are all workarounds that exist because there was historically no plasticity. And just to put that into perspective, when we say plasticity, think of how your brain works, right? The process of learning is your brain rewiring itself, often when you're sleeping to break apart connections, to relink connections, to basically absorb your new life experiences into, knowledge.
So it's kinda like it's evolving your schema in your mind. But your body is able to evolve your schema, but you don't wake up in the morning and your arm doesn't work. You have a massive headache and your toes don't wiggle anymore and you can't see. Right? Meaning, the code doesn't break because the schema has evolved. It's a beautiful design. Imagine if that wasn't the case, you may actually be forced to choose between, you know, having random body parts start to fail because you're learning or stop learning, meaning stop plasticity. Make your brain rigid and non adaptive, and then we'd all have the intellect of a newborn child. That's kind of like enterprise applications if you think of how limiting it is. They don't have schema evolution at scale. You have these constraints. So how can they truly get to a level of real intelligence? It's it's not possible without plasticity.
[00:25:37] Unknown:
And continuing on the subject of scale, I'm wondering how you handle the sort of exposure of complexity as more and more applications are being onboarded onto Cinchy, and you're managing more data in that data ware platform, being able to expose those controls to the people who are knowledgeable about how you know, what controls are necessary and how they're applied as you're adding new applications that are introducing new sources of data, new models, and new formats, being able to identify at creation time whether or not a piece of information is sensitive or what the controls should be in those different business contexts and just being able to manage the sort of growing complexity of the data independent of the application requirements?
[00:26:21] Unknown:
It's a very good question, actually, and it's it's even complicated to think through. Imagine a world where there is no controls inside of the data itself, and it's up to the individual applications and the individual context and developers and business units that are creating business capabilities on top of what is ultimately shared data, that is the epitome of extreme complexity. If you shift your thought process to what if the controls were at the data layer in a way that is context agnostic, who can see this data, what is the classification of this data, and by the way, this is my data, I'm the owner of this data. You know, this data is in my business domain and within a data product, and I am the data product owner. It's for me to decide as the rightful owner of this data where I'm granting permissions not to applications, but to, basically, people. What can employees see? And maybe I need a rule where it says you could only see the historical transactions of a customer if you're in the customer support team, but you can only see them if the customer has a live phone conversation open and they're connected to you. And as soon as that connection is is ended, I want that access to be immediately revoked, meaning it's a data driven entitlement. And regardless if I'm interacting through my ticketing system or I'm running a Tableau report or something, I want that to be just universally enforced.
Is you can actually get your head around that if you start to look at it from the data's perspective, where your head explodes is when you think of all the different contexts and if you had to reimplement the controls within each context. It takes a little while to realize that, but it's dramatically simpler at scale to think about controls in a universal way than it is in every individual context because that's a large set of context in a complex organization.
[00:28:05] Unknown:
Have you ever had to develop ad hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the 1st DataSecOps platform that streamlines data access and security. Satori's Data SecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift, and SQL Server, and even delegates data access management to business users, helping you move your organization from default data access to need to know access. Go to data engineering podcast.com/satori, that's s a t o r I, today and get a $5,000 credit for your next Satori subscription.
And then the other aspect of scale that a lot of people who are working in the data space are thinking about is scale of volume of data or its overall sort of variety and being able to process, you know, large chunks of data for analytical purposes or for building machine learning models on top of. And I'm wondering if you can discuss some of the technical aspects of how sort of this Dataware approach and Synji specifically are able to manage these variances in terms of scale of, you know, small to medium scale of the application data being able to interact with the information that resides in the platform
[00:29:30] Unknown:
and then this, you know, large machine learning continuously trained system that needs to be able to process data in bulk? There's an interesting realization that I had. It's from my experience actually working in some of the I was at Citigroup, as an example, for 11 years, and they had been in business for over 200 years, and they grew through mergers and acquisition. They had over 10,000 applications. Any technology that ever has existed was deployed. And very smart people, but they hadn't been able to really rationalize everything, so lots of redundancy, lots of opportunities. And so kind of thinking in that context as, like, an extreme scenario of complexity, what I realized is there's not as much data in the world as people actually think, just a heck of a lot of copies of it. Because if you think of 10,000 applications in a single global bank, well, how many of those applications need to know about a customer? Most of them? Not all of them, but most of them. How How many of them need to know something about an employee?
Probably all. At least the vast majority of them. Right? So but I'm Dan. I'm an employee. I only am 1 person. I have a bunch of attributes about me. You could normalize that and make that accessible across these 10,000 applications where they can, you know, see ability to see or change data or perform, you know, behaviors on day end, like to hire me, to fire me, to give me a performance review, to give me the raise I deserve or to fire me if I deserve that. You know, all of that is such that the amount of data that I would need to store for each of these applications to work in a almost like an isolated way where it's storing its own representation of Dan, but a slice of Dan and it's going to do integration to boot data back and forth, there's gonna be 10,000 copies of Dan.
Like, like, if you're a individual citizen and you deal with a big old bank, trust me, that bank has 100, thousands of copies of your name, of your identifying information, of your financial transaction. So the if you're worried about the scale, you know, based on the volume of data, then you definitely want to decouple your data from your application so that you're not having to rerepresent data over and over again into these, you know, transform materialized views that are specific to each and every individual application. So you can dramatically reduce the amount of data that you're interacting with by really normalizing it. And, again, not in a way that forces a see single schema, but a way that forces the data to be stored once and modeled many ways and used many ways with the permissions at the data layer. So that's the first key to scaling is is this actually results in less data, not more data. So if you think of today, technology is able to support the data that we have today, well, can that technology be expanded to support less data?
Yes. I think that's the first part of it. And then the other is, well, there's different usage patterns. Right? In some cases, I'm doing simple reads. In other cases, I'm doing complex aggregations. In other cases, I'm invoking, you know, complex models. And the separation of the data from the code and the implementation of data as a service via dataware now allows that data ware to understand the context at which it's being operated within such that if it needs to materialize a view dynamically, it can do so with if it's invoking a model, whether it's in a training context or in an operationalization context, it can do so synchronously or asynchronously. And it has a very unique perspective on the data such that it can optimize those behaviors and follow its own code paths based on the usage patterns to enable it to be adapted to that evolution. And in terms of your
[00:32:46] Unknown:
overall understanding and approach to the problem and some of the ideas that you had going into your initial explorations of this problem space. I'm wondering what were some of the assumptions that you had at the outset and some of the ideas that you had about its applicability and its usage that have been challenged or updated as you have built out the technology and worked with customers and understood more about how all of this is being used and applied. I think for me, 1 of the biggest
[00:33:15] Unknown:
challenges, I just even mentally, was to get my head around the you know, if you're building a new company, you could build it in a new way using data ware where there are no silos, there is no fragmentation, there is no need for all these, you know, technologies to try and put Humpty Dumpty back together again. That's, in a nutshell, how data ware is different than any other technology pattern that has been even conceived, till till now. But what does that mean to a large complex organization? Right? Because you take a big global bank, just as an example, they have all these systems. It's not gonna rebuild them. So how do you actually enable that technological paradigm to add benefit to organizations that have all this existing complexity. Right? It's 1 thing to to stop complexity from ever being created, but what if I already have it? How do I unwind it? So that's something that I've had to get my head around, and what I've come to realize is it's not a technology problem. It's more of a methodology problem and an approach. Right? So if you take a big enterprise organization and realize that, you know, while they may have 10,000 applications, over the next year, they're gonna add another 10 or another 100. And, yeah, they may shut down a bunch of others, but they're they're adding, meaning it's not like the apps are finite and and done. You know? They clap their hands together, give each other high fives. We're done all of our applications. We never need to build them again. All we need to do is create a copy of data in a central location. Mission accomplished. We've conquered the world. No. It's continuous change. Right? They're always changing. They're always shutting things down. They're consolidating. They're writing new capabilities, entering new markets, launching new products. It's just build, build, build, build, build. That's what they're doing. They have to outbuild each other. So with that being said, you have your legacy, but then you're building. So the application of Dataware to what I'm building allows me to make it so that, you know, everything that I build is such that I'm not introducing new silos. I'm not introducing new redundancies.
And, well, what about my existing data because my new systems need to interact with it from my existing system. So that's where having the realization that while our vision and while the vision of Dataware is the elimination of integration, practically speaking, that's gonna be a phased approach. That's that's not a instafix. You don't install dataware and suddenly you have no integration instantly. You stop introducing new integrations. And as you connect data in and out of your existing systems, you're doing it now in a way that enables you to stop having to do it the next time you would otherwise have needed to. So it's what we call last copy integration. So the difference between 0 copy integration for new information, for new applications, versus last copy integration for existing information from existing systems is 1 of the biggest realizations that I've had to realize. And, again, that's not a technology thing. The technology doesn't change. It's just how you think about it and how you actually utilize it. But, you know, once we figured that out, that's where it became a little bit more clear.
[00:35:55] Unknown:
And so for organizations that already have all of these existing systems and data sources and analytical workloads, what's the process for actually integrating Cinchy and this data ware approach into the organization
[00:36:08] Unknown:
and starting with being able to load data into the platform and then share it back out? The key is that you don't wanna try and eat the entire elephant in a single bite. So you wanna do it really project by project. And the projects are the projects that you would have done without data ware, right? So you have to look at every project as a little slice of your overall fabric of knowledge that you build out iteratively as you go. So if your first project needs a little bit of customer data, maybe it's creating some transactional data, is you're gonna create the model. You're gonna do the glass copy integrations. You're going to essentially apply Dataware to the realization of that business capability knowing that your next project can then pick up where that 1 left off.
And in some cases, it can use, information, whether it's a model or actual data or capabilities, exactly as is. In other cases, it may need to refactor it. In other cases, it may need to adapt it to a slightly different context. The idea is that every project continues to contribute to your central knowledge repository versus have its own copy of the knowledge repository that is kind of selfishly from the application's context perspective, if that makes sense. So it is a project by project approach with the realization that you don't finish your you're continuously changing. Businesses are launching new products. They're entering new markets. There's continuous change, so your fabric just doesn't stop evolving. It needs to continue to evolve. In fact, the faster it can evolve, the smarter you are. That's the fundamental thing is align to your projects versus change your projects to align to the technology. You should never do that. Just look at every data late project.
[00:37:48] Unknown:
Another interesting element of this is the idea of data collaboration where what we've been talking about is collaboration within the organization, but what are the opportunities for being able to collaborate between organizations given that you have all of your data in this unified platform and just some of the opportunities there as far as being able to use data across organizational boundaries, but then also in terms of the kind of governance and control aspect for people who are using Cinchy in this data ware platform, just understanding what the sort of residence of the data is where some people don't want to have all of their data living in the software as a service platform. They need to have it living on their own hardware and just some of those sort of tensions that exist in both directions.
[00:38:36] Unknown:
Well, I think a good role model for that is just imagine how the World Wide Web works is I can create an HTML page. I can put it on a web server. That web server can be under my desk. It can be on Amazon. It can be anywhere. And my web page can link to a document that you own, and you can choose where to put your document. You could put it on GCP. You could put it on a computer under your desk. You can run a little mini web server on your Android device. It's your choice, but I can simply create a hyperlink. I don't have to store a copy of your document inside of mine. Right? So that's just for the interconnected web of documents that we call the World Wide Web, but now just apply that to data, and it's the very same idea. Right? So that's what the concept of data ware is doing initially within an organization is enabling this model where you're basically linking. You're not copying. It's pointers.
It's access, it's not copies. And quite frankly, if that works between 2 different businesses within a single enterprise organization or 2 different technology teams or 2 different applications to enable this cross unit data collaboration is it could easily be extended to go beyond the boundaries of a single organization, the very same benefits. So imagine a corporation that owns particular data that has its own preference in terms of what data to store and physically where to store it and how to model it and how to represent it, who is now able to grant access to individuals and or components that are outside of their application without having to share copies of that data where that access can then be revoked in the future, allowing you to retain control. Because you have to picture the world where data is kinda like money, and there's a good reason why you can't copy money. There's technological constraints, and it's even illegal. That's the inevitable future of data is where it's access, it's not copies, which is that's 1 of the secret sauces of the idea of dataware.
[00:40:18] Unknown:
And so in terms of your experience of building the platform and seeing the platform and seeing the ways that people are adopting this approach and the technological underpinnings, what are some of the most interesting or unexpected or innovative ways that you've seen it used?
[00:40:33] Unknown:
There's lots of examples, and it's actually been a fascinating journey just watching customers, you know, begin by applying it to their traditional projects, later realizing that now suddenly things become possible that were previously impossible and in many ways, thanks to the power of inference and intelligence, because a connected fabric of knowledge means that individual applications are now suddenly intelligent, right, because they have access to the collaborative not the collective, the collaborative knowledge base. So the ability to build really, really smart applications, in addition to the fact that it's, you know, a lot less time, a lot less technical complexity, is amazing. So I've seen some really, really innovative things that our customers have been able to do.
And, like, I'll just give 1 example where in the early days of COVID, if we go back to about a year ago, 1 of our customers was already using Dataware to transform their organization, but they provide services to many credit unions. And the, you know, federal governments in most, if not all, countries were making emergency funding available to small businesses to keep them afloat in the early days of the lockdown. But how do you make those funds available? Will you leverage your financial system in in organizations like financial institutions and including, credit unions? But credit unions are interesting in particular because they have smaller IT teams, and they don't have, you know, multibillion dollar technology budgets. They can't just spin up new digital experiences to launch new products in a couple of days. But this 1 organization was using Dataware, which actually gave them that capability where they realized that a cross company collaboration could also be applied.
So long story short, they literally had an idea on a Monday. And on that same Friday, they were live with a white labeled digital loan solution that was provided as a service to I think the initial batch was, like, 10 credit unions that were now making available to their end members, small businesses that were in desperate need of these funds, the ability to to basically manage the whole end to end journey of applying and receiving and eventually paying back those funds started within a small batch, but then it grew to north of, like, a 100 credit unions in a very short amount of time. And that was an example of cross company collaboration because these credit unions were all connected via this information network.
And the application being decoupled from the data is what, you know, individual credit union members were interacting with. Little did they know that the data wasn't copied all over the place or it wasn't integrated. The acceleration enabled that. That wouldn't have been possible with conventional technology. So that's just 1 example where the elimination of integration, the avoidance of duplication, the implementation of data layer controls, like, all these things, they enable outcomes that are, quite frankly, magical. They're hard to really anticipate and imagine, but I feel like we're just getting started with a whole new wave of innovations based on this enabling technology.
[00:43:19] Unknown:
In your own experience of building this company and the technical platform and helping companies realize the potential for it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:43:31] Unknown:
For me, selling technology in a a new paradigm as kind of an early adopter and appealing to other early adopters has been both challenging and exciting. I actually like it. I like to now pretend I'm some type of psychologist even though I have no idea what I'm talking about. But, you have to really spend a lot of time thinking about how people think about the words that you're saying and not saying and what you're showing and not showing and so on and so forth. But, and I think that's not true just uniquely of Cinchy or, I mean, of Dataware. It's just really anytime there's something that's actually new, something that is transformational. Right? And not every new technology is truly transformational. Right? Sadly, most of it is, you know, iterative. It's a red mousetrap instead of a blue 1, but first mousetrap had to come to be. And it's that first mousetrap that's often the hardest 1 to get to market. So for me, that's been the biggest challenge is but at the same time, it's also what makes it the most exciting.
[00:44:20] Unknown:
So For people who are interested in the possibility and promise of this data ware approach and being able to have all of their data live in a collective pool for use by these different applications, what are some of the cases where Cinchy might be the wrong choice? So I I think part of it is the organizational
[00:44:38] Unknown:
culture. So if you're trying to transform your organization and you're actually trying versus you're just saying those words, then you're going to love the introduction of a new paradigm, or you're gonna embrace the fact that it's an organizational change because it's bigger than just a technology shift. Right? It changes how you how you run your business and how you operate your company, and there's a much wider impact. But if you're more in in a position where you're trying to maintain status quo and not introduce, you know, any change risk, then that's where you're gonna want to stay clear of transformational technologies. And what I've just found from my personal experience is, know, an organization will have a way of thinking about that, but so will individuals within that organization.
You may have change agents in a company and then those who are pushing back against change. I think that's the main focus is the desire and willingness to transform and change. Beyond that, you know, if you look at individual business problems and use cases, like any technology, Dataware doesn't solve for every single thing that you could ever possibly imagine. Right? If you need to scan PDF documents and turn those PDF documents into structured data, Dataware can be used on the receiving end of that, but it's not gonna be the 1 that scans the documents and extracts the structured data out of that, meaning there's still lots of opportunities for apps to add value, whether it's recording a video or parsing an unstructured data file. So it's really anytime you're managing information that is really structured in nature. That's where data ware comes in. And if what you're doing is inherently unstructured,
[00:46:04] Unknown:
then it's probably not a good fit. Yeah. That was gonna be 1 of my other questions is what the sort of formats and structures of data are that you're working with, whether it is primarily in this sort of semi structured or structured data or if you are also able to deal with things like binary formats or things like video or, you know, scientific formats, things like that? It is more on the structured side. That being said, you can always take unstructured data and store it in a structure,
[00:46:32] Unknown:
meaning I could take a video file or an image, and I can put it in a dataset where I can track changes so that I can version control it. I can add calculated columns. I can derive intelligence out of that unstructured or semi structured file. For the most part, it's if you're using Dataware to store or manage unstructured data, it's because you then intend on extracting the structure out of it and turning it into intelligence that's actionable.
[00:46:57] Unknown:
So as you continue to build and iterate on the platform and the business of Cinchy, what are some of the things that you have planned for the near to medium term or any projects that you're particularly excited for? Yeah. 1 of the things that I'm really excited about is the next evolution of our plasticity engine. So today,
[00:47:14] Unknown:
we focus largely on structural plasticity, which is enabling the evolution of schema without breaking code, both at the data layer as well as the metadata layer, and that's kinda mimicking the you know, if you think of neuroplasticity of the brain, the structural plasticity elements of that, that, again, is a key requirement to learning. But your brain has a whole other side of it, which is the functional plasticity, which is, for example, if you suffer brain damage, your brain will reorganize itself to recover as best as possible and minimize the loss of information. And so that's something that we're actually in the early stages of iterating on is the addition of functional plasticity. So imagine data being resilient in ways whereas, you know, physically moving data based on evolving usage patterns and really just implementing anticipatory measures to ensure kind of an unprecedented level of availability and reliability and performance of how you interface and interact with data. That's something that I'm pretty excited about. That is pretty cool. Are there any other aspects of this data ware paradigm or what you're building at Cinchy that we didn't discuss yet that you'd like to cover before we close out the show? The only thing that I'd wanna add is there's a whole other side to to Cinchy, which is what we bootstrapped in what we call the data collaboration alliance. And while we bootstrapped it, it's not about Scentsy. We're trying to create really a collaborative of whether it's individuals or other organizations that are like minded of the need to move the world away from a copy based integration approach to an access based collaboration approach, which, quite frankly, is the only way that we'll ever get back to restoring data autonomy, both for people and for companies. So that's actually the underlying thesis of the company is this, you know, data should be treated like money and intellectual property and humans. It shouldn't be copyable. So we have that that alliance, and through that, we're working on standards that you'll start to hear more about over the months years to come around 0 copy integration.
And if that's something that is of interest to anyone listening to this, definitely check it out at, datacollaboration.org. We are building this in a very collaborative way. So if you're building technology that kinda supports this vision or are interested in participating in the establishment of these standards, check out the site. There's ways that you can join in, and we can work together. But, again, that's the underlying thesis of the Cinchy company that we sell kind of into the enterprise.
[00:49:32] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And so with that, I'll move us into the final question where I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I think the biggest gap, to be honest with you, is the lack of data ware, where every time you buy and build applications, you're standing up an application specific data store or falling back to the app trap and doing this wacky thing called integration.
[00:50:00] Unknown:
And everyone's been, I think, blindsided a little bit and very excited about analytics and AI, And it's all cool, but we have to fix it so that Humpty Dumpty doesn't get broken in the first place. So gotta fix the cure, not the symptoms.
[00:50:12] Unknown:
So I think that's the biggest gap, and that's the biggest opportunity that we're excited about. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Cinchy and the overall sort of underpinnings of the conceptual structure of how to approach data management and data integration as it were. So I appreciate all of the time and energy you're putting into that, and I hope you enjoy the rest of your day. Thank you. It was a pleasure. Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Episode Overview
Guest Introduction: Dan Demers
The Origin and Vision of Cinchy
Decoupling Data from Applications
Dataware vs. Data Lake and Data Warehouse
Technical Implementation of Dataware
Data Modeling and Plasticity
Access Control and Mutation Guards
Managing Complexity at Scale
Handling Large Data Volumes and Machine Learning
Adapting Dataware to Existing Complex Systems
Integrating Cinchy into Organizations
Cross-Organizational Data Collaboration
Real-World Applications and Success Stories
Future Developments and Exciting Projects
Closing Remarks and Contact Information