Accelerate Development Of Enterprise Analytics With The Coalesce Visual Workflow Builder

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Satish Jayanti about how organizations can use data architectural patterns to stay competitive in today's data rich environment. So, Satish, can you start by introducing yourself? Thank you for having me on this. My name is, Satish Jayanti. I'm 1 of the cofounders of Coalesce,

and I currently play the chief technology officer role, in the company.

And do you remember how you first got started working in the area of data?

Yes. Absolutely.

I have started my career as an, you know, application programmer, dabbled with that, and soon became a DBA,

database administrator by accident.

I was responsible

to run the, you know, database servers on a regular basis, to make sure the business is running smoothly.

This is for an

online

e learning platform

startup

in Los Angeles.

And because it's a startup, I was kind of playing many, many roles as you can expect in a startup.

And 1 of the things that I was doing was

I was writing

and

providing

insights,

like writing queries and generating reports for the business as part of my b b a role as well.

And it got to a point where it was just not sustainable. The amount of request that I was getting

and the amount of work that I have to do

to put something together and give it to business. These were, like, some basic questions like, hey, how many people are, you know, using this particular course? Or what are the top 10 courses? Things like that.

And

I questioned myself. Like, there must be a better way to do this.

And that's

when

I had my first encounter with the concept of data warehousing.

So I picked up Ralph Kimball's

data warehouse toolkit,

read it many, many times.

It was very interesting.

And then implemented my first data mark.

Then that was, like, a big light bulb for me at that time. And that's how I got into do that and continue to build a lot of data warehouses, data marts, and eventually also manage some groups of, you know, data professionals

and so on for several companies.

And so that brings us now to where you are today at Coalesce. I'm wondering if you can share a bit about what it is that you're building there and some of the story behind how it came to be and why this is the problem space that you wanted to spend your time and energy on. Yeah. Absolutely. So

in my, you know, several years of data warehousing and data mart and data analytics

experience,

the main challenge, it was always data transformations

for me. It was pretty clear that

we were spending

a lot of time

to take the raw data and change it to

a form that is useful and that can be consumed for decision making.

So

when I was leading a group in a financial firm,

we were building data warehouses. We have all the tools because people were like, we were

acquiring companies,

so it was growing really fast.

And we had a big ETL team and pretty much any tool that you can think of.

But we were still unable to keep up with the demand.

And then at that time, I came across this concept from a company, especially it's it's called Warescape.

That was my first encounter of

data warehouse automation.

The whole idea is there's so many patterns

in data

warehousing and so many mundane tasks.

You know, how can you

automate those things in a way that you make the

engineers very productive.

And it's just not 1 thing, but it's those,

you know, opportunities wherever you can automate from, you know, 1 end to the other, the entire data data warehouse life cycle. And the aggregation of those automations collectively

will

make you more productive.

So that was the concept.

And I was hooked on to that and, you know, I implemented it and saw really, like, a lot of benefit from that.

You know, when that company got acquired and I moved on, and that concept was what stuck in my mind in that was a legacy product.

So we took that concept

and build it for the modern data stack.

That's how we got here. And my cofounder and I, we've worked for that company

implementing

large data warehouses for large companies

with great results.

There are a lot of drawbacks in their solution, and we saw that. And we found an opportunity to

modernize and build it for the modern stack. But the core idea was still automation

and automating data transformation,

which we think is still

not automated. There's a lot of other areas like the database

platforms now. Snowflake has automated that data acquisition.

Like, if you look at Fivetran, it's doing a great job there. However, when it comes to data transformation,

it's still right for automation, I would say. And

the most direct

comparison

that comes to mind right now is obviously the work that the folks at DBT are doing. And I'm curious if you can speak to some of the

overlap and maybe potential coexistence of what you're building at Coalesce as compared to where the focus of DBT is.

So what we have seen in the last, you know, few years,

if you go back

several years, you'll see that people were at the beginning. They were just hand coding.

And that's a lot of work.

And then they said, okay, let's do something graphical. Then the ETL tools were born.

The ETL tools are graphical tools that you know, graphic the GUI based tools.

They would give you a lot of efficiency.

Pretty much with some training could use that. So it's all like widget based drag and drop

data pipeline development.

However, the problem was, you know, when you go out of

its boundaries

and you have this

special use case,

then you have to resort to leading the tool, go out, and kind of do something like a short procedure or something in the database itself.

So that was a limitation. They were pretty inflexible.

So what happened is,

you know, the whole industry kinda,

you know, took a 1 80 degree turn and went

everything as code.

And that's what DBT is. You know, everything is core. Now

it gives you a lot of flexibility. Of course, core is the most flexible thing. You can write anything you want.

However,

the cost of that is you lose the efficiency.

It's how we see it. Now

with everything as core paradigm,

you need, you know, highly skilled people in the organization, especially large organizations.

It's gonna be hard to

have that many, you know, highly skilled

data engineers given that it's so hard to find it engineers these days. And on top of that,

because you don't have efficiency,

you're gonna be coding a lot and still

not able to accomplish in a certain and keep up the demands.

So what we think is

a solution that has best of the both worlds

is what is needed, and that's what we are. We are the solution that,

you know, you can do 80% of the work

GUI because it does give you a lot of productivity.

There's a lot of patterns that can be automated. There is no reason why I should be coding the same thing over and over.

And when it comes to core cases, that's when I'll focus on the coding aspects of it. So that's how you get the the results that you need on time. To your point about the 1st generation of ETL tools and the drag and drop workflow builders is that when you do hit the edges, you're kind of left to your own devices, and you have to figure out how do I build some additional component that I can somehow jam into this GUI builder and get them to work together.

And I'm curious if you can talk to some of the escape hatches that you've built into Coalesce for being able to move from that initial process of here's the rough workflow. This is 80% of what I need, but now I actually need to dig in

and customize this to fit my specific use case and being able to

have that be an affordance in the system rather than something that you have to fight against the system to achieve?

1 of the things, again, you know, we wanted to build it in a way that

we can provide

80% of the solution out of the box and easy to use by anybody.

What that means is we kinda guide, you know, the user in a certain direction

in building a pipeline. There is a certain flow to it.

Basically, we call it, you know, we call it the graph. Everybody calls it it a graph, whatever you're building as a pipeline. Each 1 is a node.

And these nodes have certain configurations

and certain behavior. Right?

How to create or how to materialize a particular object on Snowflake,

or how do you load that object? If it's a table, how do you load Versa logic to load DML, basically?

What we have done is we have built these components as Lego

blocks, just

at a very granular level. So you can assemble these things on your own

to build a different kind of note pad that fits a certain pattern.

And as an architect, you can build these user defined nodes

and kind of, you know, meet that or address those edge cases.

So when people start off, they start off with a whole bunch of nodes that are available.

For example,

type 2 dimensions.

And out of the box, you don't have to think about it. You just go use it.

But if you say, I don't want this to behave this way. I wanna do some changes. Like, maybe I don't want to use surrogate key. I wanna use hash keys.

Then you go behind the scenes. You go into that note type. You make some minor adjustments.

You have a new note type. Everything is like a Lego block that you can control and configure

and work with. So that's the idea

here. 1 of the challenges with warehouses

has often been SQL and

the fact that it is very declarative and flexible, but not always very composable.

And I'm curious how you have approached that challenge of being able to

encapsulate these nodes so that

the handoff between them is as pluggable as you want it to be so that you can combine them into these workflows without having to worry about how the underlying SQL is actually going to mesh together and what the sort of contract is between these different stages of the workflows.

The way that it works right now is there are several I mean, you can build as many stages as you want in the pipeline, and you can have a raw layer, which is basically the raw data that's coming in. And then you can build a CDC layer, for example, to capture the deltas of what is being, you know, loaded by

a data ingestion system.

You can have a staging layer, which is basically now we can materialize them as views or tables.

But it's all happening in Snowflake. The data is moving from 1 layer to the other in Snowflake. So whether it's views or a set of tables.

But today, it's all SQL. Now you can change that because we are we are giving you templates.

There is no reason why you can't

generate a different type of code other than SQL

in the tool. You know, it's you have full control to override the template

and generate, for example, some other language

as long as Snowflake has the native capabilities to do so.

And we're seeing more and more of that where Snowflake is supporting all these other paradigms in the platform.

So the handshake or the flow can be pretty much customizable in in the way that you want. Today, we have only SQL support. So it has to go through table to table to view or view, you know,

however, you know, SQL functions. But I can see that the handshake could change down the road depending on what's not their problem.

Are you looking for a structured and battle tested approach for learning data engineering?

Would you like to know how you can build proper data infrastructures that are built to last?

Would you like to have a seasoned industry expert guide you and answer all of your questions? Join Pipeline Academy,

the world's first data engineering boot camp. Learn in small groups with like minded professionals for 9 weeks part time to level up in your career. The course covers the most relevant and essential data and software

topics that enable you to start your journey as a professional data engineer or analytics engineer.

Plus, they have ask me anythings with world class guest speakers every week. The next cohort starts in April of 2022.

Visitdataengineeringpodcast.com/academytodayandapplynow.

In terms of the overall workflow, looking at the site and through the documentation, it seems to be fairly opinionated. And I'm curious if you can talk to the design principles and philosophies that you have embedded into the user experience and how you make decisions about

where to prioritize features and how to

present the different capabilities of the system in a manner that's

internally cohesive?

Again, it goes back to our philosophy of, hey, 80% of this can be automated. And the design principle is

you got to be very easy to use.

And that's number 1, you know, since

we are bringing best of the both worlds here. We are saying, hey, you have to have the flexibility. But at the same time, you want to have the efficiency.

In order for it to be efficient,

you need to kinda interact with the tool pretty easily.

And also the personas

that are going to be working with this tool, it also varies

depending on their experience. Right? If I'm an architect,

my experience should be that I should be able to go and set standards

so that junior engineers can just consume those standards without even thinking about them. If there's, you know, extensibility that needs to happen, I can do that as well. I can set like, extend the product behavior

because there is a new feature or new like, something that came out in Snowflake that we want to support. Now you go create a new node, and then you make that available. That can be consumed by data engineers.

Now as a data engineer, the experiences could

be different. If you're a junior engineer, you may just want to go build pipelines

based on the standards set by my architect.

So we wanna make sure that that is

possible and that they are getting the productivity that they are expecting out of this tool.

But on the other hand, if you're a data analyst, like a business analyst

building dashboards reports,

for them, it's all about

understanding what was built

and why it was built the way it was built. Like, what is this dimension mean? What does this column mean?

How do I

understand or how do I know that this data is correct and where is it coming from? So for them, the experience is all about

documentation,

lineage,

understanding what was built. Because there is no data project where

you can just kind of remove these data persona like, professionals or personas from at the end of the day, in the real

world, all of these people have to come together to make a data project successful.

So we are making sure that

these people have the right experience

for what they're doing in the in the tool. And so in terms of the actual Coalesce platform, I'm wondering if you can speak to the technical architecture and how you've approached the implementation of the system.

Yeah. So this is, again, we have implemented on the cloud. We are on Google Cloud. You know, we have a Kubernetes cluster that would serve as an application

to our, you know, clients. It's a it's a multi tenant environment.

And we have a metadata database

that is in Google

Firebase.

So, you know, we get all that scalability from the Google's system.

And as far as scalability of processing goes, the data processing, of course, we rely on Snowflake.

We have a template render that

takes all the metadata as input and generates the code according to whatever template logic that was written.

And those are submitted to Snowflake, and Snowflake is doing the heavy lifting

and returns the result sets, you know, whatever it has done and how many rows affected or or is there an error. Those things get back to this

to the system from Snowflake.

Yeah. So essentially, it is a cloud based system, which where we have a cluster that is serving,

working as a multi tenant platform.

In terms of the data architectural patterns, you mentioned that

Coalesce is designed to enable

a junior or intermediate level data engineer to be productive while staying within the guardrails that are set by a more senior engineer or a data architect.

And I'm curious if you can talk to some of the patterns that you have seen organizations

fall prey to where they end up spending wasted cycles or they start to design themselves into a situation where they're gradually losing productivity rather than gaining it? We have seen some amazing things that are happening with our

this whole user defined node concept that we have provided to our customers.

But at the same time,

you know, people people can make mistakes with that. You know?

So far, I would say it's been more positive than negative. I can talk about just some pitfalls if that's what you're looking for

that people can,

you know, do and get into trouble. I myself had had those kind of pitfalls in the past.

You know, sometimes under pressure, I would take some

band aid approach

and do something that

is, you know it doesn't address the foundational

aspect of the data analytics

solution, but it just like a band aid. And then you end up with that band aid forever. You think you can get rid of it, but you don't. That's a pitfall.

And and, also,

you know, when you plan

these data projects

and if you quickly create a standard and you give it to the business,

the business might take that as a solution. You already built the solution, so you're done. You know? And I got what I want. So it's over. Right? But in your mind, you're thinking, hey. That was just a Band Aid. I still haven't built it the right way. I need more budget. I need more people that there is a part 2 for this project.

So that is another pitfall that I myself encountered in the past

where it's nothing to do with the technology itself. It's just more about

how you approach this whole thing. You know, if I'm building a foundation,

you gotta say part 1, part 2, part 3 or phase 1, phase 2, phase 3. Phase 1 is probably a quick and dirty solution. Phase 2 is the improvement on that. Phase 3 is the real output. And you gotta plan for that 3 years or whatever number of years and get the budget for the whole thing, not just for 1 thing. So that's the lesson that I learned myself when I was doing it. So, again, I know you're looking for more technical

side of these things, but I think sometimes it's the nontechnical is more important than the technical, I would say. As far as the technical aspects of this, it's pretty straightforward, you know, because

you know what Snowflake does.

If you have somebody who has built data warehouses,

we are providing you a platform

that can

automate

those patterns.

So

if you do that, you're gonna be pretty good, pretty satisfied.

In terms of those prebuilt templates, you mentioned that it comes out of the box with a certain set of them. End users are able to add and customize their own templates.

I'm curious what your approach has been to figuring out what is the

minimum base set of templates that you want to provide,

the specific

data modeling styles that you want to work with, maybe providing templates to be able to work with specific data sources and use cases,

and how you think about what you wanted to have available at start, whether it was, like, the Snowflake approach to data modeling in terms of the star schemas and slowly changing dimensions or Data Vault and just how you think about that overall process of

providing data modeling

out of the box to get people started moving faster and helping them to discover what are the actual problems that they care about as a business that are the rest of the 20%.

So what we're seeing is,

as far as the data warehousing solutions go, you know, people have certain methodologies that they want to adopt. Right? I mean, Kimbell has been the standard 1 for a long time.

There is a lot of momentum around Datawalt,

and there is everything in between. Right? I mean, you know, variations of Datawalt,

variations of other methodologies.

So what we are doing is we are giving a set of out of the box,

again, these nodes,

you know, for dimensions, for facts, for precision staging,

stage nodes,

hubs,

links, satellites, you name it. So we provide that

those things out of the box.

For the most part, that will satisfy a lot of use cases right out of the

box. And anything that is beyond that, they can change it. They can

branch off of an existing 1, and they can enhance it to meet their needs.

But what we are seeing also is, like,

people coming up with stuff that we didn't expect.

For example, you know, Snowflake has the streaming and tasks as a functionality.

They just identify deltas and things like that.

1 of our customers,

they just

went ahead and built a c and c node. And now that node is available in the graph, and you can just take a bunch

of raw tables and say, add a stream,

add a

task, and run every whatever

10 minutes or whatever and dump the delta into another table. It has become such an easy task for everybody else to

consume that type of functionality and build that functionality into their pipelines.

That's what we're seeing. You know? And we also see that

the nodes that we are getting out of the box, that is going to grow because we wanna create a marketplace where people can actually share these things,

notes or packages, make up a bunch of notes together

that perform a certain function.

So that's where we're going with that.

And as far as the workflow for

adopting Coalesce

and starting to integrate it into the usage and the

data platform

and the sort of organizational

analytics capabilities.

Wondering if you can just talk through that process and some of the background knowledge that's useful to have as you're figuring out what the overall workflow and the node structures are going to look like. Again, there's several personas

that are going to be using the tool. It's not just built for 1 type because we want to address the entire problem

as much as we can, not just 1 piece. Although we're focused on transformations,

there's there's other things that are on the edge that also are important.

So transformation seems like, how do I be changing the data from, you know, 1 form to another as quickly as possible?

But what if I don't have column lineage?

Right? Then you don't have

a way to really

kinda see, you know, what's going on. So to answer your question, it depends on

the organizational

structure. They they have

data engineers that would be ideal kind of persona to deal with the tool because they understand SQL.

They understand the methodology

to some degree. And, you know, if you have architects,

you know, for them, they're gonna work with the customization aspect of the tool.

So I think that's what is expected. To work with the tool, we're seeing you have to be an architect or a engineer or a power user who would get some help from IT, but they also can build the pipelines

whether they are proficient in SQL or not.

And as I was looking at the Coalesce product, I noticed that it's very closely tied to Snowflake as the underlying

storage and warehouse layer, and I'm curious if you can speak to the thinking that went into that decision and some of the ways that you have implemented Coalesce to potentially allow for

additional storage and query engines in the future and some of the other directions that you see as potential expansions to coalesce?

When we started this, you know, Snowflake was an obvious choice. Every other prospect that we talked to is moving to Snowflake pretty much. It was pretty clear for us to focus on Snowflake.

However, the tool is built in a way that

the communication with the query engine has been abstracted.

It's just a best practice. Right? I mean, to build a software in a way that the front end is agnostic to what it's talking to.

So that's how the tool is built.

The middle layer is the template.

And we are not thinking about this at this time because we are hyper focused on Snowflake,

And we wanna expand the platform even more and be

in lockstep with Snowflake's features and things like that. But, however, because we built it in a way that

that part is abstracted,

if we really want to support another platform,

all it is that we have to do is

build those templates

that would generate the flavor of the SQL

that can run on a particular target platform.

In terms of the

ways that you have been working with some of your early design partners, I'm wondering what are some of the most interesting or innovative or unexpected ways that they're using Coalesce?

1 of the things I was

surprised is

people are

building

these nodes that I was talking about that we never thought of.

And,

you know, 1 example I gave you,

which is, you know, streaming and tasks, CDC type of functionality,

But there's also people building, like,

a data profiling

functionality into this.

So people can create a profiling node

and capture their profiling metrics

to monitor data quality.

And that is something that we haven't, you know, built or given to anybody.

But because the platform enables them to do that,

so they are rapidly building these

nodes that we never thought of.

And the other aspect, you know, is very interesting from my standpoint is

how quickly

people are able to build, you know, complex

solutions.

I'll tell you a recent thing that happened. So

we have an alliance director who was, you know, responsible to work with

large firms and their, you know, system integrators and partners.

You know, he's been handing out trial accounts and things like that.

And there was

this SI, you know, a a big firm. You know, they got some trial accounts.

And, you know, we got in a call. After they got the trial accounts, 2 days later, we were on a call. And before the call started,

they were saying that, hey. They want to talk to another firm to talk and and show them this tool. And we were like, we just sent you a recording. You know, we just need to talk and not make make sure that you understand the tool. And they said, but I think we figured out. Let me show you. And then then they started sharing screen.

They built this gigantic

graph

that is basically

a implementation of an SAP,

you know, SAP

module or whatever, SAP thing that they built. I'm not an expert in SAP,

but whatever they build is called a calculation view or something.

And

they built that in a matter of hours

to see how it performs on Snowflake.

On SAP, it takes, like, a long time because of whatever the SAP architecture is and how it works. But when they moved that to Snowflake

and and they built using coalesce,

you know, they got the performance boost. But what I was surprised was how quickly they picked up the tool based on a recording. Yeah. It's definitely very cool.

And so in terms of your own experience

of building Coalesce and

iterating on the technical aspects, working with your design partners to figure out the product direction

and grow the business, I'm curious, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

First of all, there's a lot of learning. As soon as I started, I was constantly looking at

a lot of other tools

and what they're doing and what are the gaps that they're trying to fill, you know, starting with,

you know,

schedulers,

orchestrating tools,

you

know, data

observability tools and whatnot. So there's a lot of learning in that regard. It was enjoyable for me. I enjoyed that part. I wouldn't call that as a challenge. But

I think 1 of the challenging aspects for us right now that I see is, you know, we show the tool to people and

they get very excited,

but they always have something to add to it in terms of what they need.

And it's almost like, hey. I love your tool. I wanna use it, you know, but can you add this functionality? Can you make sure that you have this by this time? Or when are you planning to have it?

And we get that from all directions.

So to manage all of that and to prioritize, which is an obvious thing for and this is a common problem for any vendor, I guess. That is very challenging, in my opinion, to be able to focus on what we're doing, but also prioritize

and pivot if necessary

and address those in a timely manner

is definitely challenging. And

I knew that, but I'm experiencing it now. So that's different just from knowing and then actually experiencing it. Yeah. Absolutely.

In terms of the

sort of management of these

graphs and the execution plans that you have. I'm wondering if you can talk to the,

I guess, change management process there and how you're able to maybe

automate construction of some of these graphs

or being able to say, I've built this graph. I'm going to test it in

either a test account for Snowflake or, you know, on on a test subset of the data and then being able to manage that rollout to the full production environment.

Absolutely. And change management is very, very important thing that we have kinda made sure we focus on that right from the beginning. So, you know, first of all, we have of course, we integrate with Git. We save the state of what was built and what was deployed in our metadata and in Git. So whatever you build,

it goes to Git. From there, it goes to different environments that you want to deploy to. So we have this concept of, you know, creating an environment that has

credentials, that has a Snowflake account details.

It also has something

called storage mappings. So it's basically saying, hey. What database key ones that I need to work with when you push this code to this particular environment?

So an environment kinda encapsulates all of those things. So when we take this

git state and we push that to that environment,

you know,

you can do this with command line or you can do this via front end. But it goes through certain

process where it compares what's on the target,

and it will show you the differences

in the delta that it's going to execute. This is what we call plan and deploy,

where you always have a

plan that you can see before you actually deploy.

So you get to approve that. So that's how the change management is done. That's how you promote from 1 environment to the other. It's going from dev to get get to any number of environments that you want to push push to.

Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code.

Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks.

By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests.

DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production.

No more shipping and praying. You can now know exactly what will change in your database.

DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows.

Visitdataengineeringpodcast.com/datafold

today to book a demo with DataFold.

For people who are looking to

accelerate their rate of development

and the speed at which they're able to go from idea

to analysis, what are the cases where Coalesce is the wrong choice?

It depends on the use case, for sure.

And Coalesce is built definitely more for the preparing data for analytics. Let's say that. And there are certain proven methodologies that people adopt. You know, for example, you know, Kimball or Datavault or things like that that, you know, if you're building something central,

that's core,

you probably follow 1 of these methodologies to build that. Now people also build something in between. As I said, you know, they can just build flat tables. That's fine too.

But where it doesn't

fit is if you're just moving data from 1 point a to point b for application to application integration, for example,

that would not be call us, I would say. I mean, you can bend the tool to do that, but that's not the purpose of the tool. It's definitely in the data analytics domain.

So, you know, if you want to do application to application integration, that would be something else that you should look at. As you continue to iterate on the product and keeping in mind these competing priorities and feature requests, I'm wondering if you can speak to some of the things you have planned for the near to medium term. Yeah. Definitely. I mean, you know, this space is vast. There's a lot of need out there for clients.

1 of the things that we are focusing is,

you know, you're going to call us and start building the pipelines.

You know, build these nodes and you just build the pipeline, right, from from,

from the beginning. But

we wanna add some kind of modeling to this down the road, a way to kinda

look at the source data and see

if you can, you know, kinda connect the dots and say, hey, this field in here is related to this field in this table. You know, and the tables could be coming from different sources.

But with that kind of information and input from the user, now we can take the automation to the next level because, you know, you don't even have to specify the joins anymore because we already did from that interface

by looking at the data at the beginning.

Therefore, we can automate those things. We call it the discovery

of datasets.

So that's another piece that we're

very focused on. But,

also,

Snowflake is adding so much functionality

as we speak. I mean, we want to be in lockstep with that, you know, whether it's a data science use case, you know, like, for example, the Snowpark, I think that's what it's called. We wanna make sure what we can do there. That's definitely on our minds as well to support those data science use cases and other languages that you can generate code for,

Snowflake.

Are there any other aspects of the work that you're doing at Coalesce or the

overall problems of how to approach

data architectural patterns and stay

sort of ahead of the game in terms of being able to build out analysis and drive the business that we didn't discuss yet that you'd like to cover before we close out the show?

1 thing I am, you know,

passionate about, and this is something that I got introduced to recently,

is the decentralization

paradigm, which is, you know, in other words, they call it data mesh.

I'm really very passionate about that idea because it just seems

going back in all my, you know, years of experience with this, if I look at this particular paradigm, it makes a lot of sense.

Just to give you a very high level overview of what that means, you know, basically, there's underlying 4 principles.

1 is make the domain

responsible for building the pipelines and producing high quality data. So in other words, rather than having

a central team

that does

build this big, gigantic data warehouse and has a team of data engineers building this large pipelines,

you know, instead of doing that, you know, how about we kinda decentralize this? Like, take

that same idea, but do it at a domain level. Do it at a line of business level. That's the first principle.

But, obviously, once you say that, now aren't you creating silos is the next question. Right? If you do that, now you're creating silos. But then the answer to that is, you know, they have to create in a way that it is a product based mentality. It's like a data as a product. That's what you call. When you go to the supermarket, you buy a product. You expect certain quality. You expect certain documentation. You expect it to be safe. That's the same idea. So if my domain,

you know, produces some data and publishes some data, the other domains, other people who want to consume that data expect certain quality. That's the second principle. And the third principle is the self serving aspect to it. Like, people don't have to rely on IT or some kind of specialist

that they

need to talk to to use this dataset. Instead, they can just kinda do self-service.

And finally,

some kind of governance

on all of this. Governance at the local level, it is at the domain level. At the same time, governance at the broader level, especially from IT, to make sure that there's no duplication

and things are being shared correctly and things are have some consistency, some standards.

So all of this whole data mesh paradigm, I'm very excited about. And the good news that we have from Cholera side is I think Cholera is just right out out of the box. It checks all these boxes pretty much right away.

And

I'm very, very curious to see, you know, if an organization

is going in that direction, I want to see callouts play an important role in their organization.

Yeah. Data mesh is definitely

an interesting

approach that has been gaining a lot of attention, so definitely appreciate your enthusiasm for it. Spoken to Zhamak a couple of times, and it's definitely a subject that comes up repeatedly on this show.

Cool. I'm glad it is. Because in my opinion, I think that is, a way to

scale for an organization

moving forward. However, there's a lot more in there to learn and and make sure you do it right. Absolutely.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I talked about the the modeling

piece that, you know, a lot of people are asking about it because, you know, it's 1 thing to understand

the raw data and how that data is linked, especially coming from different sources.

On the other hand, once you build something, you also want to see what you built.

For example, if you built a data vault, you want to be able to visualize that

and see

as you're building it, hey, is this what I want? You you cannot comprehend everything just by looking at code

or even a pipeline.

There is another perspective to this, which is the the final kind of model view that people will see and kinda

can use that as a communication tool

for them to understand and also communicate with with other

business, you know, users.

We think that is very, very critical and important. And

I know we tied our road map, but it's gonna be coming very soon. That's 1 thing I can say. I mean, there's a whole lot of other things as well. But I would I would just say 1 since that is the near term. Absolutely.

Well, thank you very much for taking the time today to join me and share the work that you're doing at Coalesce. It's definitely a very interesting product and tackling a real problem that people are experiencing. So I appreciate all of the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Thank you so much. It was a pleasure.

For listening. Don't forget to to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways that is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave your view on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links