Summary
The flexibility of software oriented data workflows is useful for fulfilling complex requirements, but for simple and repetitious use cases it adds significant complexity. Coalesce is a platform designed to reduce repetitive work for common workflows by adopting a visual pipeline builder to support your data warehouse transformations. In this episode Satish Jayanthi explains how he is building a framework to allow enterprises to move quickly while maintaining guardrails for data workflows. This allows everyone in the business to participate in data analysis in a sustainable manner.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold.
- Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now!
- Your host is Tobias Macey and today I’m interviewing Satish Jayanthi about how organizations can use data architectural patterns to stay competitive in today’s data-rich environment
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what you are building at Coalesce and the story behind it?
- What are the core problems that you are focused on solving with Coalesce?
- The platform appears to be fairly opinionated in the workflow. What are the design principles and philosophies that you have embedded into the user experience?
- Can you describe how Coalesce is implemented?
- What are the pitfalls in data architecture patterns that you commonly see organizations fall prey to?
- How do the pre-built transformation templates in Coalesce help to guide users in a more maintainable direction?
- The platform is currently tied to Snowflake as the underlying engine. How much effort will it be to expand your integrations and the scope of Coalesece?
- What are the most interesting, innovative, or unexpected ways that you have seen Coalesce used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Coalesce?
- When is Coalesce the wrong choice?
- What do you have planned for the future of Coalesce?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
- Coalesce
- Data Warehouse Toolkit
- Wherescape
- dbt
- Type 2 Dimensions
- Firebase
- Kubernetes
- Star Schema
- Data Vault
- Data Mesh
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Satish Jayanti about how organizations can use data architectural patterns to stay competitive in today's data rich environment. So, Satish, can you start by introducing yourself? Thank you for having me on this. My name is, Satish Jayanti. I'm 1 of the cofounders of Coalesce,
[00:02:15] Unknown:
and I currently play the chief technology officer role, in the company.
[00:02:19] Unknown:
And do you remember how you first got started working in the area of data?
[00:02:23] Unknown:
Yes. Absolutely. I have started my career as an, you know, application programmer, dabbled with that, and soon became a DBA, database administrator by accident. I was responsible to run the, you know, database servers on a regular basis, to make sure the business is running smoothly. This is for an online e learning platform startup in Los Angeles. And because it's a startup, I was kind of playing many, many roles as you can expect in a startup. And 1 of the things that I was doing was I was writing and providing insights, like writing queries and generating reports for the business as part of my b b a role as well.
And it got to a point where it was just not sustainable. The amount of request that I was getting and the amount of work that I have to do to put something together and give it to business. These were, like, some basic questions like, hey, how many people are, you know, using this particular course? Or what are the top 10 courses? Things like that. And I questioned myself. Like, there must be a better way to do this. And that's when I had my first encounter with the concept of data warehousing. So I picked up Ralph Kimball's data warehouse toolkit, read it many, many times.
It was very interesting. And then implemented my first data mark. Then that was, like, a big light bulb for me at that time. And that's how I got into do that and continue to build a lot of data warehouses, data marts, and eventually also manage some groups of, you know, data professionals and so on for several companies.
[00:04:08] Unknown:
And so that brings us now to where you are today at Coalesce. I'm wondering if you can share a bit about what it is that you're building there and some of the story behind how it came to be and why this is the problem space that you wanted to spend your time and energy on. Yeah. Absolutely. So
[00:04:24] Unknown:
in my, you know, several years of data warehousing and data mart and data analytics experience, the main challenge, it was always data transformations for me. It was pretty clear that we were spending a lot of time to take the raw data and change it to a form that is useful and that can be consumed for decision making. So when I was leading a group in a financial firm, we were building data warehouses. We have all the tools because people were like, we were acquiring companies, so it was growing really fast. And we had a big ETL team and pretty much any tool that you can think of.
But we were still unable to keep up with the demand. And then at that time, I came across this concept from a company, especially it's it's called Warescape. That was my first encounter of data warehouse automation. The whole idea is there's so many patterns in data warehousing and so many mundane tasks. You know, how can you automate those things in a way that you make the engineers very productive. And it's just not 1 thing, but it's those, you know, opportunities wherever you can automate from, you know, 1 end to the other, the entire data data warehouse life cycle. And the aggregation of those automations collectively will make you more productive.
So that was the concept. And I was hooked on to that and, you know, I implemented it and saw really, like, a lot of benefit from that. You know, when that company got acquired and I moved on, and that concept was what stuck in my mind in that was a legacy product. So we took that concept and build it for the modern data stack. That's how we got here. And my cofounder and I, we've worked for that company implementing large data warehouses for large companies with great results. There are a lot of drawbacks in their solution, and we saw that. And we found an opportunity to modernize and build it for the modern stack. But the core idea was still automation and automating data transformation, which we think is still not automated. There's a lot of other areas like the database platforms now. Snowflake has automated that data acquisition.
Like, if you look at Fivetran, it's doing a great job there. However, when it comes to data transformation, it's still right for automation, I would say. And
[00:06:56] Unknown:
the most direct comparison that comes to mind right now is obviously the work that the folks at DBT are doing. And I'm curious if you can speak to some of the overlap and maybe potential coexistence of what you're building at Coalesce as compared to where the focus of DBT is.
[00:07:13] Unknown:
So what we have seen in the last, you know, few years, if you go back several years, you'll see that people were at the beginning. They were just hand coding. And that's a lot of work. And then they said, okay, let's do something graphical. Then the ETL tools were born. The ETL tools are graphical tools that you know, graphic the GUI based tools. They would give you a lot of efficiency. Pretty much with some training could use that. So it's all like widget based drag and drop data pipeline development. However, the problem was, you know, when you go out of its boundaries and you have this special use case, then you have to resort to leading the tool, go out, and kind of do something like a short procedure or something in the database itself.
So that was a limitation. They were pretty inflexible. So what happened is, you know, the whole industry kinda, you know, took a 1 80 degree turn and went everything as code. And that's what DBT is. You know, everything is core. Now it gives you a lot of flexibility. Of course, core is the most flexible thing. You can write anything you want. However, the cost of that is you lose the efficiency. It's how we see it. Now with everything as core paradigm, you need, you know, highly skilled people in the organization, especially large organizations. It's gonna be hard to have that many, you know, highly skilled data engineers given that it's so hard to find it engineers these days. And on top of that, because you don't have efficiency, you're gonna be coding a lot and still not able to accomplish in a certain and keep up the demands.
So what we think is a solution that has best of the both worlds is what is needed, and that's what we are. We are the solution that, you know, you can do 80% of the work GUI because it does give you a lot of productivity. There's a lot of patterns that can be automated. There is no reason why I should be coding the same thing over and over.
[00:09:25] Unknown:
And when it comes to core cases, that's when I'll focus on the coding aspects of it. So that's how you get the the results that you need on time. To your point about the 1st generation of ETL tools and the drag and drop workflow builders is that when you do hit the edges, you're kind of left to your own devices, and you have to figure out how do I build some additional component that I can somehow jam into this GUI builder and get them to work together. And I'm curious if you can talk to some of the escape hatches that you've built into Coalesce for being able to move from that initial process of here's the rough workflow. This is 80% of what I need, but now I actually need to dig in and customize this to fit my specific use case and being able to have that be an affordance in the system rather than something that you have to fight against the system to achieve?
[00:10:17] Unknown:
1 of the things, again, you know, we wanted to build it in a way that we can provide 80% of the solution out of the box and easy to use by anybody. What that means is we kinda guide, you know, the user in a certain direction in building a pipeline. There is a certain flow to it. Basically, we call it, you know, we call it the graph. Everybody calls it it a graph, whatever you're building as a pipeline. Each 1 is a node. And these nodes have certain configurations and certain behavior. Right? How to create or how to materialize a particular object on Snowflake, or how do you load that object? If it's a table, how do you load Versa logic to load DML, basically?
What we have done is we have built these components as Lego blocks, just at a very granular level. So you can assemble these things on your own to build a different kind of note pad that fits a certain pattern. And as an architect, you can build these user defined nodes and kind of, you know, meet that or address those edge cases. So when people start off, they start off with a whole bunch of nodes that are available. For example, type 2 dimensions. And out of the box, you don't have to think about it. You just go use it. But if you say, I don't want this to behave this way. I wanna do some changes. Like, maybe I don't want to use surrogate key. I wanna use hash keys.
Then you go behind the scenes. You go into that note type. You make some minor adjustments. You have a new note type. Everything is like a Lego block that you can control and configure and work with. So that's the idea
[00:11:58] Unknown:
here. 1 of the challenges with warehouses has often been SQL and the fact that it is very declarative and flexible, but not always very composable. And I'm curious how you have approached that challenge of being able to encapsulate these nodes so that the handoff between them is as pluggable as you want it to be so that you can combine them into these workflows without having to worry about how the underlying SQL is actually going to mesh together and what the sort of contract is between these different stages of the workflows.
[00:12:31] Unknown:
The way that it works right now is there are several I mean, you can build as many stages as you want in the pipeline, and you can have a raw layer, which is basically the raw data that's coming in. And then you can build a CDC layer, for example, to capture the deltas of what is being, you know, loaded by a data ingestion system. You can have a staging layer, which is basically now we can materialize them as views or tables. But it's all happening in Snowflake. The data is moving from 1 layer to the other in Snowflake. So whether it's views or a set of tables. But today, it's all SQL. Now you can change that because we are we are giving you templates.
There is no reason why you can't generate a different type of code other than SQL in the tool. You know, it's you have full control to override the template and generate, for example, some other language as long as Snowflake has the native capabilities to do so. And we're seeing more and more of that where Snowflake is supporting all these other paradigms in the platform. So the handshake or the flow can be pretty much customizable in in the way that you want. Today, we have only SQL support. So it has to go through table to table to view or view, you know, however, you know, SQL functions. But I can see that the handshake could change down the road depending on what's not their problem.
[00:13:58] Unknown:
Are you looking for a structured and battle tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all of your questions? Join Pipeline Academy, the world's first data engineering boot camp. Learn in small groups with like minded professionals for 9 weeks part time to level up in your career. The course covers the most relevant and essential data and software topics that enable you to start your journey as a professional data engineer or analytics engineer.
Plus, they have ask me anythings with world class guest speakers every week. The next cohort starts in April of 2022. Visitdataengineeringpodcast.com/academytodayandapplynow. In terms of the overall workflow, looking at the site and through the documentation, it seems to be fairly opinionated. And I'm curious if you can talk to the design principles and philosophies that you have embedded into the user experience and how you make decisions about where to prioritize features and how to present the different capabilities of the system in a manner that's internally cohesive?
[00:15:08] Unknown:
Again, it goes back to our philosophy of, hey, 80% of this can be automated. And the design principle is you got to be very easy to use. And that's number 1, you know, since we are bringing best of the both worlds here. We are saying, hey, you have to have the flexibility. But at the same time, you want to have the efficiency. In order for it to be efficient, you need to kinda interact with the tool pretty easily. And also the personas that are going to be working with this tool, it also varies depending on their experience. Right? If I'm an architect, my experience should be that I should be able to go and set standards so that junior engineers can just consume those standards without even thinking about them. If there's, you know, extensibility that needs to happen, I can do that as well. I can set like, extend the product behavior because there is a new feature or new like, something that came out in Snowflake that we want to support. Now you go create a new node, and then you make that available. That can be consumed by data engineers.
Now as a data engineer, the experiences could be different. If you're a junior engineer, you may just want to go build pipelines based on the standards set by my architect. So we wanna make sure that that is possible and that they are getting the productivity that they are expecting out of this tool. But on the other hand, if you're a data analyst, like a business analyst building dashboards reports, for them, it's all about understanding what was built and why it was built the way it was built. Like, what is this dimension mean? What does this column mean? How do I understand or how do I know that this data is correct and where is it coming from? So for them, the experience is all about documentation, lineage, understanding what was built. Because there is no data project where you can just kind of remove these data persona like, professionals or personas from at the end of the day, in the real world, all of these people have to come together to make a data project successful.
So we are making sure that these people have the right experience
[00:17:14] Unknown:
for what they're doing in the in the tool. And so in terms of the actual Coalesce platform, I'm wondering if you can speak to the technical architecture and how you've approached the implementation of the system.
[00:17:27] Unknown:
Yeah. So this is, again, we have implemented on the cloud. We are on Google Cloud. You know, we have a Kubernetes cluster that would serve as an application to our, you know, clients. It's a it's a multi tenant environment. And we have a metadata database that is in Google Firebase. So, you know, we get all that scalability from the Google's system. And as far as scalability of processing goes, the data processing, of course, we rely on Snowflake. We have a template render that takes all the metadata as input and generates the code according to whatever template logic that was written. And those are submitted to Snowflake, and Snowflake is doing the heavy lifting and returns the result sets, you know, whatever it has done and how many rows affected or or is there an error. Those things get back to this to the system from Snowflake.
Yeah. So essentially, it is a cloud based system, which where we have a cluster that is serving, working as a multi tenant platform.
[00:18:31] Unknown:
In terms of the data architectural patterns, you mentioned that Coalesce is designed to enable a junior or intermediate level data engineer to be productive while staying within the guardrails that are set by a more senior engineer or a data architect. And I'm curious if you can talk to some of the patterns that you have seen organizations fall prey to where they end up spending wasted cycles or they start to design themselves into a situation where they're gradually losing productivity rather than gaining it? We have seen some amazing things that are happening with our
[00:19:08] Unknown:
this whole user defined node concept that we have provided to our customers. But at the same time, you know, people people can make mistakes with that. You know? So far, I would say it's been more positive than negative. I can talk about just some pitfalls if that's what you're looking for that people can, you know, do and get into trouble. I myself had had those kind of pitfalls in the past. You know, sometimes under pressure, I would take some band aid approach and do something that is, you know it doesn't address the foundational aspect of the data analytics solution, but it just like a band aid. And then you end up with that band aid forever. You think you can get rid of it, but you don't. That's a pitfall.
And and, also, you know, when you plan these data projects and if you quickly create a standard and you give it to the business, the business might take that as a solution. You already built the solution, so you're done. You know? And I got what I want. So it's over. Right? But in your mind, you're thinking, hey. That was just a Band Aid. I still haven't built it the right way. I need more budget. I need more people that there is a part 2 for this project. So that is another pitfall that I myself encountered in the past where it's nothing to do with the technology itself. It's just more about how you approach this whole thing. You know, if I'm building a foundation, you gotta say part 1, part 2, part 3 or phase 1, phase 2, phase 3. Phase 1 is probably a quick and dirty solution. Phase 2 is the improvement on that. Phase 3 is the real output. And you gotta plan for that 3 years or whatever number of years and get the budget for the whole thing, not just for 1 thing. So that's the lesson that I learned myself when I was doing it. So, again, I know you're looking for more technical side of these things, but I think sometimes it's the nontechnical is more important than the technical, I would say. As far as the technical aspects of this, it's pretty straightforward, you know, because you know what Snowflake does.
If you have somebody who has built data warehouses, we are providing you a platform that can automate those patterns. So if you do that, you're gonna be pretty good, pretty satisfied.
[00:21:25] Unknown:
In terms of those prebuilt templates, you mentioned that it comes out of the box with a certain set of them. End users are able to add and customize their own templates. I'm curious what your approach has been to figuring out what is the minimum base set of templates that you want to provide, the specific data modeling styles that you want to work with, maybe providing templates to be able to work with specific data sources and use cases, and how you think about what you wanted to have available at start, whether it was, like, the Snowflake approach to data modeling in terms of the star schemas and slowly changing dimensions or Data Vault and just how you think about that overall process of providing data modeling out of the box to get people started moving faster and helping them to discover what are the actual problems that they care about as a business that are the rest of the 20%.
[00:22:18] Unknown:
So what we're seeing is, as far as the data warehousing solutions go, you know, people have certain methodologies that they want to adopt. Right? I mean, Kimbell has been the standard 1 for a long time. There is a lot of momentum around Datawalt, and there is everything in between. Right? I mean, you know, variations of Datawalt, variations of other methodologies. So what we are doing is we are giving a set of out of the box, again, these nodes, you know, for dimensions, for facts, for precision staging, stage nodes, hubs, links, satellites, you name it. So we provide that those things out of the box.
For the most part, that will satisfy a lot of use cases right out of the box. And anything that is beyond that, they can change it. They can branch off of an existing 1, and they can enhance it to meet their needs. But what we are seeing also is, like, people coming up with stuff that we didn't expect. For example, you know, Snowflake has the streaming and tasks as a functionality. They just identify deltas and things like that. 1 of our customers, they just went ahead and built a c and c node. And now that node is available in the graph, and you can just take a bunch of raw tables and say, add a stream, add a task, and run every whatever 10 minutes or whatever and dump the delta into another table. It has become such an easy task for everybody else to consume that type of functionality and build that functionality into their pipelines.
That's what we're seeing. You know? And we also see that the nodes that we are getting out of the box, that is going to grow because we wanna create a marketplace where people can actually share these things, notes or packages, make up a bunch of notes together that perform a certain function. So that's where we're going with that.
[00:24:12] Unknown:
And as far as the workflow for adopting Coalesce and starting to integrate it into the usage and the data platform and the sort of organizational analytics capabilities. Wondering if you can just talk through that process and some of the background knowledge that's useful to have as you're figuring out what the overall workflow and the node structures are going to look like. Again, there's several personas
[00:24:38] Unknown:
that are going to be using the tool. It's not just built for 1 type because we want to address the entire problem as much as we can, not just 1 piece. Although we're focused on transformations, there's there's other things that are on the edge that also are important. So transformation seems like, how do I be changing the data from, you know, 1 form to another as quickly as possible? But what if I don't have column lineage? Right? Then you don't have a way to really kinda see, you know, what's going on. So to answer your question, it depends on the organizational structure. They they have data engineers that would be ideal kind of persona to deal with the tool because they understand SQL.
They understand the methodology to some degree. And, you know, if you have architects, you know, for them, they're gonna work with the customization aspect of the tool. So I think that's what is expected. To work with the tool, we're seeing you have to be an architect or a engineer or a power user who would get some help from IT, but they also can build the pipelines whether they are proficient in SQL or not.
[00:25:46] Unknown:
And as I was looking at the Coalesce product, I noticed that it's very closely tied to Snowflake as the underlying storage and warehouse layer, and I'm curious if you can speak to the thinking that went into that decision and some of the ways that you have implemented Coalesce to potentially allow for additional storage and query engines in the future and some of the other directions that you see as potential expansions to coalesce?
[00:26:15] Unknown:
When we started this, you know, Snowflake was an obvious choice. Every other prospect that we talked to is moving to Snowflake pretty much. It was pretty clear for us to focus on Snowflake. However, the tool is built in a way that the communication with the query engine has been abstracted. It's just a best practice. Right? I mean, to build a software in a way that the front end is agnostic to what it's talking to. So that's how the tool is built. The middle layer is the template. And we are not thinking about this at this time because we are hyper focused on Snowflake, And we wanna expand the platform even more and be in lockstep with Snowflake's features and things like that. But, however, because we built it in a way that that part is abstracted, if we really want to support another platform, all it is that we have to do is build those templates that would generate the flavor of the SQL that can run on a particular target platform.
[00:27:17] Unknown:
In terms of the ways that you have been working with some of your early design partners, I'm wondering what are some of the most interesting or innovative or unexpected ways that they're using Coalesce?
[00:27:29] Unknown:
1 of the things I was surprised is people are building these nodes that I was talking about that we never thought of. And, you know, 1 example I gave you, which is, you know, streaming and tasks, CDC type of functionality, But there's also people building, like, a data profiling functionality into this. So people can create a profiling node and capture their profiling metrics to monitor data quality. And that is something that we haven't, you know, built or given to anybody. But because the platform enables them to do that, so they are rapidly building these nodes that we never thought of.
And the other aspect, you know, is very interesting from my standpoint is how quickly people are able to build, you know, complex solutions. I'll tell you a recent thing that happened. So we have an alliance director who was, you know, responsible to work with large firms and their, you know, system integrators and partners. You know, he's been handing out trial accounts and things like that. And there was this SI, you know, a a big firm. You know, they got some trial accounts. And, you know, we got in a call. After they got the trial accounts, 2 days later, we were on a call. And before the call started, they were saying that, hey. They want to talk to another firm to talk and and show them this tool. And we were like, we just sent you a recording. You know, we just need to talk and not make make sure that you understand the tool. And they said, but I think we figured out. Let me show you. And then then they started sharing screen.
They built this gigantic graph that is basically a implementation of an SAP, you know, SAP module or whatever, SAP thing that they built. I'm not an expert in SAP, but whatever they build is called a calculation view or something. And they built that in a matter of hours to see how it performs on Snowflake. On SAP, it takes, like, a long time because of whatever the SAP architecture is and how it works. But when they moved that to Snowflake and and they built using coalesce, you know, they got the performance boost. But what I was surprised was how quickly they picked up the tool based on a recording. Yeah. It's definitely very cool.
[00:29:52] Unknown:
And so in terms of your own experience of building Coalesce and iterating on the technical aspects, working with your design partners to figure out the product direction and grow the business, I'm curious, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:30:10] Unknown:
First of all, there's a lot of learning. As soon as I started, I was constantly looking at a lot of other tools and what they're doing and what are the gaps that they're trying to fill, you know, starting with, you know, schedulers, orchestrating tools, you know, data observability tools and whatnot. So there's a lot of learning in that regard. It was enjoyable for me. I enjoyed that part. I wouldn't call that as a challenge. But I think 1 of the challenging aspects for us right now that I see is, you know, we show the tool to people and they get very excited, but they always have something to add to it in terms of what they need.
And it's almost like, hey. I love your tool. I wanna use it, you know, but can you add this functionality? Can you make sure that you have this by this time? Or when are you planning to have it? And we get that from all directions. So to manage all of that and to prioritize, which is an obvious thing for and this is a common problem for any vendor, I guess. That is very challenging, in my opinion, to be able to focus on what we're doing, but also prioritize and pivot if necessary and address those in a timely manner is definitely challenging. And I knew that, but I'm experiencing it now. So that's different just from knowing and then actually experiencing it. Yeah. Absolutely.
[00:31:35] Unknown:
In terms of the sort of management of these graphs and the execution plans that you have. I'm wondering if you can talk to the, I guess, change management process there and how you're able to maybe automate construction of some of these graphs or being able to say, I've built this graph. I'm going to test it in either a test account for Snowflake or, you know, on on a test subset of the data and then being able to manage that rollout to the full production environment.
[00:32:08] Unknown:
Absolutely. And change management is very, very important thing that we have kinda made sure we focus on that right from the beginning. So, you know, first of all, we have of course, we integrate with Git. We save the state of what was built and what was deployed in our metadata and in Git. So whatever you build, it goes to Git. From there, it goes to different environments that you want to deploy to. So we have this concept of, you know, creating an environment that has credentials, that has a Snowflake account details. It also has something called storage mappings. So it's basically saying, hey. What database key ones that I need to work with when you push this code to this particular environment?
So an environment kinda encapsulates all of those things. So when we take this git state and we push that to that environment, you know, you can do this with command line or you can do this via front end. But it goes through certain process where it compares what's on the target, and it will show you the differences in the delta that it's going to execute. This is what we call plan and deploy, where you always have a plan that you can see before you actually deploy. So you get to approve that. So that's how the change management is done. That's how you promote from 1 environment to the other. It's going from dev to get get to any number of environments that you want to push push to.
[00:33:35] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying. You can now know exactly what will change in your database.
DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with DataFold. For people who are looking to accelerate their rate of development and the speed at which they're able to go from idea to analysis, what are the cases where Coalesce is the wrong choice?
[00:34:43] Unknown:
It depends on the use case, for sure. And Coalesce is built definitely more for the preparing data for analytics. Let's say that. And there are certain proven methodologies that people adopt. You know, for example, you know, Kimball or Datavault or things like that that, you know, if you're building something central, that's core, you probably follow 1 of these methodologies to build that. Now people also build something in between. As I said, you know, they can just build flat tables. That's fine too. But where it doesn't fit is if you're just moving data from 1 point a to point b for application to application integration, for example, that would not be call us, I would say. I mean, you can bend the tool to do that, but that's not the purpose of the tool. It's definitely in the data analytics domain.
So, you know, if you want to do application to application integration, that would be something else that you should look at. As you continue to iterate on the product and keeping in mind these competing priorities and feature requests, I'm wondering if you can speak to some of the things you have planned for the near to medium term. Yeah. Definitely. I mean, you know, this space is vast. There's a lot of need out there for clients. 1 of the things that we are focusing is, you know, you're going to call us and start building the pipelines. You know, build these nodes and you just build the pipeline, right, from from, from the beginning. But we wanna add some kind of modeling to this down the road, a way to kinda look at the source data and see if you can, you know, kinda connect the dots and say, hey, this field in here is related to this field in this table. You know, and the tables could be coming from different sources.
But with that kind of information and input from the user, now we can take the automation to the next level because, you know, you don't even have to specify the joins anymore because we already did from that interface by looking at the data at the beginning. Therefore, we can automate those things. We call it the discovery of datasets. So that's another piece that we're very focused on. But, also, Snowflake is adding so much functionality as we speak. I mean, we want to be in lockstep with that, you know, whether it's a data science use case, you know, like, for example, the Snowpark, I think that's what it's called. We wanna make sure what we can do there. That's definitely on our minds as well to support those data science use cases and other languages that you can generate code for, Snowflake.
[00:37:08] Unknown:
Are there any other aspects of the work that you're doing at Coalesce or the overall problems of how to approach data architectural patterns and stay sort of ahead of the game in terms of being able to build out analysis and drive the business that we didn't discuss yet that you'd like to cover before we close out the show?
[00:37:26] Unknown:
1 thing I am, you know, passionate about, and this is something that I got introduced to recently, is the decentralization paradigm, which is, you know, in other words, they call it data mesh. I'm really very passionate about that idea because it just seems going back in all my, you know, years of experience with this, if I look at this particular paradigm, it makes a lot of sense. Just to give you a very high level overview of what that means, you know, basically, there's underlying 4 principles. 1 is make the domain responsible for building the pipelines and producing high quality data. So in other words, rather than having a central team that does build this big, gigantic data warehouse and has a team of data engineers building this large pipelines, you know, instead of doing that, you know, how about we kinda decentralize this? Like, take that same idea, but do it at a domain level. Do it at a line of business level. That's the first principle.
But, obviously, once you say that, now aren't you creating silos is the next question. Right? If you do that, now you're creating silos. But then the answer to that is, you know, they have to create in a way that it is a product based mentality. It's like a data as a product. That's what you call. When you go to the supermarket, you buy a product. You expect certain quality. You expect certain documentation. You expect it to be safe. That's the same idea. So if my domain, you know, produces some data and publishes some data, the other domains, other people who want to consume that data expect certain quality. That's the second principle. And the third principle is the self serving aspect to it. Like, people don't have to rely on IT or some kind of specialist that they need to talk to to use this dataset. Instead, they can just kinda do self-service.
And finally, some kind of governance on all of this. Governance at the local level, it is at the domain level. At the same time, governance at the broader level, especially from IT, to make sure that there's no duplication and things are being shared correctly and things are have some consistency, some standards. So all of this whole data mesh paradigm, I'm very excited about. And the good news that we have from Cholera side is I think Cholera is just right out out of the box. It checks all these boxes pretty much right away. And I'm very, very curious to see, you know, if an organization is going in that direction, I want to see callouts play an important role in their organization.
[00:39:57] Unknown:
Yeah. Data mesh is definitely an interesting approach that has been gaining a lot of attention, so definitely appreciate your enthusiasm for it. Spoken to Zhamak a couple of times, and it's definitely a subject that comes up repeatedly on this show.
[00:40:12] Unknown:
Cool. I'm glad it is. Because in my opinion, I think that is, a way to scale for an organization moving forward. However, there's a lot more in there to learn and and make sure you do it right. Absolutely.
[00:40:25] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I talked about the the modeling
[00:40:41] Unknown:
piece that, you know, a lot of people are asking about it because, you know, it's 1 thing to understand the raw data and how that data is linked, especially coming from different sources. On the other hand, once you build something, you also want to see what you built. For example, if you built a data vault, you want to be able to visualize that and see as you're building it, hey, is this what I want? You you cannot comprehend everything just by looking at code or even a pipeline. There is another perspective to this, which is the the final kind of model view that people will see and kinda can use that as a communication tool for them to understand and also communicate with with other business, you know, users.
We think that is very, very critical and important. And I know we tied our road map, but it's gonna be coming very soon. That's 1 thing I can say. I mean, there's a whole lot of other things as well. But I would I would just say 1 since that is the near term. Absolutely.
[00:41:37] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing at Coalesce. It's definitely a very interesting product and tackling a real problem that people are experiencing. So I appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thank you so much. It was a pleasure. For listening. Don't forget to to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways that is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave your view on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Satish Jayanti Begins
Satish Jayanti's Career Journey
Challenges in Data Transformation
Founding Coalesce and Its Mission
Comparison with DBT and ETL Tools
Technical Architecture of Coalesce
Data Architectural Patterns and Pitfalls
Prebuilt Templates and Customization
Focus on Snowflake and Future Expansions
Lessons Learned and User Experiences
Change Management and Deployment
Use Cases Where Coalesce Might Not Fit
Future Plans and Roadmap
Data Mesh and Decentralization
Biggest Gap in Data Management Tooling
Closing Remarks