Open Source Reverse ETL For Everyone With Grouparoo

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means?

Our friends at Outland started out as a data team themselves and faced all this collaboration chaos.

They started building Outland as an internal tool for themselves.

Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams.

By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets

and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.

Go to dataengineeringpodcast.com/outland

today. That's a t l a n, and sign up for a free trial.

If you're a data engineering podcast listener, you get credits worth

$3, 000 on an annual subscription.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Brian Leonard about Gruberoo, an open source framework for managing your reverse ETL pipelines. So, Brian, can you start by introducing yourself?

Hi. Great to be here. Thanks for having me, Tobias.

Brian Leonard,

previously

to doing data engineering reverse ETL open source work. I was the technical cofounder at TaskRabbit,

service to get things done in your house. Very different, though we did, of course,

have data engineering needs that that gave us the inspiration for this project.

And so as you mentioned,

working in data isn't your sort of long term endeavors. I'm wondering if you can talk to how you got involved in the space.

Yeah. I was the

CTO and the head of the product and, you know, data team and the engineering team at TaskRabbit. And, you know, we had lots of people using the service and wanted to learn about them and create

data products and analytics and

all of that stuff. And so, you know, we work on all of that infrastructure in the end using

Snowflake

and Looker and a variety of machine learning tools,

in particular, to

create a recommendation system on who is the best person for the job at hand. And somewhat inevitably,

you know, after you get all of that going and you you learn something really interesting, maybe the cohorts

of who's likely to churn from the system or something like that, the marketing team asks

to get that into

Marketo or whatever email system we're using. And so spent a good deal of time syncing our

data to

marketing,

NPS tools, some of those kinds of things,

things to send push messages,

Zendesk support systems, and things like that. And

generally found that engineers

didn't really like working on that very much and they didn't really know what success looked like in particular. And marketing didn't really know about the engineering

and so we got inspired to find a way to make this so much easier and bring that organizations,

together so they can, you know, be more effective.

And so that brings us to what you're doing at Grupo. I'm wondering if you can talk more about what it is that you're building there and some of the motivation behind deciding that this was the problem space that you wanted to spend your time and energy on. It's just more of what I was saying. Crazy things would happen. Like,

on a Monday,

I'd approve a $1, 000, 000 for our marketing team to move some important metric for us, maybe retention.

How can we get people to do more stuff on our platform?

And then on Thursday, they'd come and say, yeah. We're super excited.

Let's do this. I'm like, yeah. Let's do this. And then I'd be like, okay. Great. Now I need sync all of this stuff from the product database. The last time somebody

did a cleaning or whatever

into the marketing system

And I'd be like,

I don't know. Like, I got my own stuff to do. Sorry. And then I show it the engineer and then they didn't quite prioritize it. And then the system was kind of janky

when we did do it. And at the end of the quarter, I'd be like, hey, what happened? We didn't hit the goal. I gave you a $1, 000, 000. And they'd be like, what do you mean? Like, I didn't have any day to do

anything really. All I did was send the newsletter out and run more ads. How could I target those ads better and how can I personalize those emails?

And this just happened several times, took me a while to learn. And then

as I was thinking about where I could make the most impact to organizations

when I was thinking about a new company, this idea just kept coming up and it it came up with my cofounders who solved simple things. It came up with

hundreds of other companies that we we talked to when we did market research.

Yeah. We decided to do go after it. Interesting thing was

we talked to all these marketing people,

and they said things like, this is a huge problem.

And I'm like, I see you and I see your problems. They're like they felt really gratified and those were very empathetic conversations.

And then I was like, you want to use this? And they're like, yeah, we want to use this. I said, okay, great.

You know, go get the password to your data warehouse.

And then the conversation ended because they're not the gatekeepers to that system. But when we started talking to the engineers,

instead of asking the marketers like, aren't engineers annoying because they never sync your data? We said,

Aren't marketers annoying because they always want more of your data?

Then they said, Yes. And guess what, they had the password

and they were inclined and they can make a read only user

and

they were inclined to keep the data in their environment and we started going down this open source path to be able

to sync your data from your data warehouse into your sales support and marketing tools.

1 of the

things about the data ecosystem is it seems that either everybody's talking to each other behind the scenes or everybody just happens on the same sets of problems at the same time where we get these explosions of different product categories. And in this case, it's reverse ETL where we have some of the commercial offerings such as Hightouch and Sensus, and there's Gruberu, which is the open source core aspect of it. And I'm wondering what the

landscape of available tooling looked like at the time when you first started building the Grupo system and some of the ways that the emergence of these commercial competitors has informed your product direction or the capabilities of the system that you wanted to build in? For sure.

Yeah. I mean, I'm not for solving

unsolved problems. There's always nuances. Right? And so at the time,

this wasn't a thing.

And I really think it wasn't a thing. Sometimes,

you think something doesn't exist. But once you get into that space, there's a whole bunch of it. I think in this case,

High Touch and Census

didn't exist, and certainly the term reverse ETL didn't exist. The thing that was closest to it at the time was

what marketing people would call a CDP, customer data platform.

And so you would send stuff to tools like Segment

and others and usually events,

and they would

relay those to others. That's the closest that I knew of the system when when we got started.

But

these things just kind of their moment has come, I think. I equate it to, like, a

hierarchy of needs, like, in psychology.

Like, people have been investing

in their foundations,

Snowflake and BigQuery and stuff for the last couple years, and then their analytics and then their machine learning, and then this and this. And then at the very top, it's just like, okay. Great.

We've spent 1, 000, 000 of dollars in 5 years,

and all I have to show for to these reports?

Like, what's next? We're not done. We're never done. And so what's next? And it's operationalizing

that data. And, you know, I think

just the modern data stack and the

time involved and all that investment has

the people that are on the forefront of that looking to put that data back into use, which is a whole new fascinating set of problems because

the worst outcome before was a bad report.

Now the worst outcome,

I don't even know. Like, sending it to the wrong place, data breach, or, like,

a 100000000

wrong emails, like, like, some companies do every now and then, things like that. Anyway, so it's time has come, and there's some interesting new problems.

How do those inform

how we're thinking about it?

The concept of the problem and solution is fairly straightforward. There's nuances.

Sometimes I look to see what integrations they have just to see if if there's 1 I haven't heard of and things like that. But in general, we're having customers

and open source users drive what we're working on and that's where most of our information is is coming from. Yeah. It's definitely interesting to occasionally look at some of these integration platforms to see what are the different sources and destinations because, yeah, they're inevitably

tools or platforms or services that I have never even heard of. It's like, oh, what is that thing?

Right. And so, like, I'm like, oh, look. They have

32, and we have 28 or whatever it is right now. Like, what are those 4? I've never heard of that, and no customers ever asked me for it. So I guess I'm just not gonna worry about it right now or something like that. And in terms of the

sort of core concepts of reverse ETL, I'm wondering if you can talk to some of the baseline capabilities that are necessary to be able to build and run 1 of these systems.

Additional features and utilities that you have been adding into the system as your users have started to become more advanced in their usage?

Yeah. I think at its very core,

you basically have

a table

in your,

you know, product database,

sometimes more commonly depending on the organization, their data warehouse.

And it's

a kind of like

a fact table. I know people call these different things. Roll up table, like, customer ID,

first name, last name, email, whole bunch of stuff, lifetime value,

I don't know, likelihood to churn percentage, all kinds of weird things you might come up with number of

actions they've taken.

I don't know. Number of things they favored case by case in that business.

And, like, basically, we really wish this was in system a, b, and c,

commonly

marketing, sales, and support. And then

sometimes

there's 8, 000 different marketing tools, for example, you know, various nuances thereof. And so, like, great.

Make sure those are always in sync. If something new happens in this table,

make sure it's reflected in the remote system. We call it a destination

source and destination.

And so what's required just to make that happen? You know, generally, the ability to

talk to that database

and the ability to talk to the destination,

baseline

some sort of data massaging in there, you know, around how that destination works. Dates are always represented in a different way, for example.

And

probably in the baseline,

some notion of, like, rate limiting and things like that super just they all have it, and it's

the first thing that someone that builds this themselves figures out. They send 3 users, and they think they've finished a good sync system, then they hook it up to production and turns out you can only do 7 a second or

there's some that are by day, which is even crazier central time, like, you know, 10000

central midnight to midnight, like, all kinds of weird things and then they turn off and you have to be able to retry and, you know, in the end get to

what's the right word? Like, synced status, I guess. There's some physics term that I'm thinking, equilibrium or or something like that. On top of that, you can start adding the ability to do more advanced queries. Of course, it's not just the 1 table.

Something that we tend to do that the other ones don't do driven by customer requirements

is not everybody has their act together where there's just this 1 magical table and everything is clear. And it's not even always just 1 query you can do. And so

we have the ability to kind

of stitch together that record

from

many different tables, pulling your

lifetime value from here and your user record from somewhere else and even a whole another source, we can kind of put those together and then sync that somewhere else. We've added segmentation on top of that, something that we heard that people commonly wanted to sync

with sort of these

groups, so to speak, kangaroo themed group building tool that we have. So maybe you wanna tag your users in Zendesk if they're high value so they get better services. Now what does high value mean? That's anyone that spent more than

$200 or whatever. We add that on top of that. And then

UIs to browse that is something we added, like, how is my data shaped? And we did that to

try to fill that rift I was talking about between engineering and marketing so they could actually agree on what the right data looks like and who's in those groups and the numbers numbers look right. You mentioned that when you were first

trying to

make people aware of the tool, get people to test it out, that your initial target was to talk to the marketing teams, and then you ended up talking to the engineers. And I'm wondering if you can speak to who the actual

target users are for the Grupo platform and some of the ways that those personas have helped to

inform the priorities in terms of feature development and the user experience design of the system.

Yeah. Maybe all that's code word for it doesn't have to look pretty if it's engineers. I don't know. Yeah. So we started with the marketers and then

eventually,

you know, got on this

rising trend of data engineers, like, adding this capability into their organization. And so

the open source product that we have is for

engineers to solve data problems and do the syncing. And so

you use the UI. It's very close to sort of DBT and its workflow,

which has informed our same target audiences. We have some

analytics engineers we're working with, for example, that are very comfortable in that area. You locally kind of come up with your

configuration that defines your pipeline. We have a UI to do that because it's super helpful to

browse the,

you know,

fields available in Zendesk, for example, and you click, click, click, that creates a Git configuration that you check-in and you deploy that and then, you know, autopilot everything is is syncing.

That's targeted at engineers.

We have a enterprise product and a hosted product, so you can run it yourself or in your own cloud or we'll host it for you

that adds on top of that solving organizational

problems to that. So, for example,

maybe you wanna hand off

exactly what gets synced in a no code kind of fashion to those marketers that we were talking about before. You've defined

in a in a way that maybe you did your LookML and Looker, something I've done in the past, but then people can make any dashboards they want from that. You can define your data schema

and but leave it up to the marketer what lifetime value means, $300, $200, etcetera,

or any other groupings they can do. They're pretty comfortable with these segmentation tools.

And what actually gets synced to Mailchimp, we have people using that and that's sort of in the no code

point and click kind of thing on top of those configurations.

When you're discussing the kind of visual element of being able to select which attributes in the target system you wanted to populate. I'm also interested in understanding

what the

other existing or potential capabilities are for being able to hook into

a data catalog or a metadata system to be able to enumerate the available

fields and tables in the source systems to be able to understand what are the either preexisting

models that fulfill these destination systems or being able to say, okay, these are the fields that I need from, you know, x, y, and z table possibly across multiple source systems that I can then stitch into these destination records?

We're finding,

you know, across the whole landscape

of, you know, possible users

that the more common scenario is that people don't know what the heck is in these columns,

and, you know, they don't have their catalog

situation together. And they're actually

using Grupo

to, like,

define the single source of truth of what, say, lifetime value means, which is, like, for example, does it include

returns or not? It's something that is, like, up for grabs when you're querying various databases and or data

tables in in different ways. Now I find it super interesting, and and certainly our infrastructure allows it

to

like, our Snowflake source, for example. You ask it, like, what tables do you have and what columns do you have in each of those tables? And there's actually space for in our thing for more meta information on that.

I find it super interesting to think about how we could point that to a sort of a metadata management system and, like,

filter that out on what the users see, especially the marketing users, but even the data engineers, just so they use the right 1. And, you know, sort of it's not just a column name

called,

you know,

recent

behavior score.

It's like, what does that actually mean? But not something we've done yet, but super good idea. Yeah. It also brings in a lot of the complications

of then you start ending up in the space of saying, oh, well, I see all these source tables. Now I need to have a visual query builder and understand when I'm coming from multiple systems, what the intermediate representation is going to look like so I can stitch that together. And then then you're in a whole different product category that you probably don't even wanna think about.

We do have a visual query builder of sorts. You can write your SQL and certainly

plenty of dorks that like to do that, myself included.

But especially in the most common cases that we've seen, like, querying a table

and then

either, you know, using exactly those values, so just 1 of those fact table, or summing it up and filtering it out a little bit. We have a query builder for that, but it's nice to have the fallback,

for sure to do anything you want.

And so in terms of the actual

architecture and implementation

of the project, I'm wondering if you can talk to some of the ways that that has manifested

and some of the technologies

and systems that you rely on.

Yeah. So, you know, both our hosted offering and the 1 people deploy in their own clouds,

basically the same. We run on top of a to store data. For example,

you know, another thing that we talked about before baseline and and what we do, like, we can sync incrementally. If you've got a 1, 000, 000 records,

we need to know

what's changed since the last time we've synced, you know, our general storage and not to send the same thing that we sent again to these systems, especially when they're being rate limited. We store that in a Postgres database,

and we use Redis

for caching basically and some

sort of background processing so we can run multiple threads and sort of keep everything, I don't know, dork words, mutex and things like

that, parallel

so when you deploy gruparoo,

you're running

the code, which is in Node on top of Postgres

and Redis.

Locally,

it doesn't have those requirements at all, like the development environment that falls back to, like, memory and SQLite. So there's even less dependencies when you're running it locally to get your config going. But, you know, once you have lots and lots of

parallelism and records, we use those stores to run up on the production system.

And then for the managed platform that you offer, what are some of the additional systems and architectural components that you've added in to be able to add in

some of those organizational requirements and some of the data governance capabilities?

Yeah. So the

stack is the same.

But, you know, if you get the enterprise edition and our hosted offering, like, there's just more tables in Postgres around

teams and, you know,

things like that such that you can you know, the marketing team's allowed to change this, but not this, and this person is on that team and such. In our cloud offering, there's a whole another layer on top of that, which is, like, how do we get many users

with these instances and things like that? And for that, we use Terraform.

As far as the evolution of the project, you mentioned that you first started iterating on this space when you were working at Task Rabbit and then decided to turn it into your own sort of dedicated endeavor. And I'm wondering how have the design and goals of the project changed or evolved in that time? So we we encountered the problem at TaskRabbit and solved the problem.

I think Grupo as a project was fresh from that problem set and, you know, using our learnings, but we didn't do any of the code there. For the same reason, it'd be impossible to prioritize

a generic

flexible,

multi destination, multi source syncing system if you didn't

have all of those needs at that moment. And, like, you're just basically trying to make marketing happening as soon as possible. This is what happens. 1st, what happens if you had a really engaged engineering team

that spent

year plus,

you know, sort

of bike shedding, I guess, really solving all of those exact problems

and doing them

as good as they could possibly be. What we got informed from that was really just this organizational

gap and, like, really thinking about how we could fill it and,

in general, how could we take

an integration that often took in

the month plus time frame and make it a day and, you know, what could we take from that? And the biggest thing is probably

if we wanna get really dorky about it, like this concept,

computer science term, idempotency,

which I read about in your book a little bit too, so I know it's in there, which is really

how can we make it so that it's fault tolerant such that if 1 or the other systems are down or we can always basically

recalculate the source of truth at any

time and get things back synced.

It's kind of this idea that's

more and more popular. You know, Terraform has it, React has it,

dbt has it, which is, like, the way that that's enabled is through being declarative. And so, you know, we we we took that approach.

StreamSets DataOps platform is the world's first single platform for building smart data pipelines across hybrid and multi cloud architectures.

Build, run, monitor, and manage data pipelines confidently with an end to end data integration platform that's built for constant change.

Amp up your productivity with an easy to navigate interface and hundreds of prebuilt connectors, and get pipelines into new hires up and running quickly with powerful, reusable components that work across batch and streaming.

Once you're up and running, your smart data pipelines are resilient to data drift, those ongoing and unexpected changes in schema, semantics, and infrastructure.

Finally, 1 single pane of glass for operating and monitoring all of your data pipelines, the full transparency and control you desire for your data operations.

Get started building pipelines in minutes for free at data engineering podcast.com/streamsets.

The first 10 listeners that subscribe to stream sets professional tier will receive 2 months free after their 1st month.

And so in order to go from, I have this problem, I need to be able to start getting data into these various marketing and sales and support systems to,

I found Gruberu. I want to get it set up and start writing pipelines. I'm wondering if you can talk to that overall workflow of saying, I've got this idea to I've got this running in production.

Yeah. We've got some good quotes. People say that's super fast. It's certainly

well within a day's work to do that, if not, you know, within the hours work. Basically, you

do a install of our tool.

You say new project,

then you launch this UI config. You install Gruberu. You say Gruberu init, Gruberu config,

then you put in your source,

you know, credentials essentially, often a read only user

that'll end up being like a m variable by the time we deploy this to production.

You pick the tables that you wanna sync or write the query

and then

do the same thing for your destination

and all of that's to generate the declarative config of what the pipeline is. 1 of the things we found that was super important in this space

was, like, this notion of sample records. So we have that built into our developer tool. So, like, the first time you know if this is working or not, is it with all

million people in your pipeline and you've accidentally deleted your Wholesales team's whatever?

No, hopefully.

In that you add, you say, I know about user 32,

what does that look like?

Essentially? Click and say, import the data, like, okay, this looks right, this is what I want to sync. Click export,

it goes over to

Mailchimp and Zendesk or whatever you've configured. You can look at it over there to make sure it looks right. You can, you know, sit with your counterpoint in the marketing or support organization.

See, does this look right? In general, build up the confidence.

And then with that same configuration,

when you say, like, group root start, like, it runs that in multiple threads. And you could do that locally,

which most people do the first time just to make sure. And then when that's deployed, it's, you know, running forever and just always keeping things in sync. The biggest

hurdle isn't usually that development thing. It's like, what's your server situation?

How do you deploy things and things like that? That's what had us inspired to say, okay. Great.

We'll do that for you if you want as well. And you just give us your configuration, and we'll do all of the AWS bits, so to speak.

As far as the

changes in source and destination schemas and being able to enumerate the available fields in these various downstream systems,

I'm wondering what your process is for being able to

manage the discovery in these destination systems and then also being able to manage any potential

mismatches or schema drift between source and destination.

We haven't fully tackled, like I don't know. You changed your column from f name to first underscore name. Right? Like, anytime you do something like that in your data warehouse,

there's a list of

ramifications, you know, in your BI system and your this and your that, and Gruber is certainly on that list.

It's pretty aggressive to decide

autonomously

to change what data is getting synced. Maybe it's not an renamed. It's hard to even know it was a rename, frankly.

But what we do have is

tools to get it right the first time and,

you know, tools make it easy to make that make that change. They probably exist in parallel in my experience for a little bit. And so when you're using that developer tool,

like, you just have a drop down of the columns and, like, you pick 1. Right? And so, like, in general, that gets rid of typos and things like that, and it gets written to

your config. And then it it can even be PR reviewed if you wanna do that, of course, when it gets checked in. So there's some eyes on that, and it's unlikely

that we got it wrong to begin with. We want to manage the migration,

of course, but, you know, tend to get it right

all the time when we're using that. On the destination, it's the same, but more complicated. There's no you know, all of these databases have, like, an information schema or something that's super reliable.

Every destination is different

and weird around understanding the fields that are built in and the custom fields, but we end up with that same experience where

you see if you're using Mailchimp, for example, you see the ones that are built in, email, f name,

l name, address, for example,

and 1 of them is required email, for example.

But then if you've added your custom ones, you see those in the drop down too. There's other destinations that just kinda willy nilly and, like, you just send whatever you want, and we have ways to do that as well.

And for people who are

running Gruberu and they've identified some new downstream system that they wanna be able to send data to that isn't already part of the available list of integrations? What's the process of actually

starting a new plug in project and going through the development cycle to

build against the Grupo Root internals and API

and be able to test against this new downstream system. And what are the requirements for

getting that new plugin integrated into the list of available

integrations on the Gruberu site?

Yeah. Great question. I think this is 1 of the examples of where I'm most excited about open source. Right? Because, like,

2 things. A,

you see these list of companies that especially US based SaaS companies integrate with. And, like, they seem to top out at the 100 to 200 range. Like, there's just only so many that you can

deal with

and that's worth it to your business. We have people all over the world using all kinds of systems. There's a Mailchimp of Vietnam and a Mailchimp of Brazil that, you know, US companies tend to never get to. And then the larger the organization,

the more that you've got this crazy internal system that obviously,

you know, we would never be able to integrate with. And so we're just, super excited to facilitate

custom integrations whether they get checked into

our code monorepo or, you know, they're your internal thing. All kinds of

internal support systems out there, for example,

that we've met. And so the process for that is basically,

you know, the goal was to you do the 1% of work for all these incremental things, and the system does the 90%. And so

you basically

implement

the things we've talked about, which is what fields does this thing have and what are their data types. Each of those are are different. And the other is the primary 1, which is, like, okay.

Here's

a record set. Like, here's what it was before, and here's what it is now just in case sometimes those changes are really interesting.

Like, make it so in the destination system.

We do that in Node. We had to pick a language that did well

to asynchronous

communication

and that a lot of people knew. So we went with, TypeScript,

JavaScript and TypeScript.

We've had

even analytics engineers. 1 recently built a big gap we had actually that got filled by 1 of our customers because we just didn't get around to it was Airtable.

Analytics engineer writes an Airtable plugin, which is now done, and now everyone can use it, which is great. The big part of that

is basically, in my experience having done 30 of these,

is

the testing system that we have. And so you've got your credentials on that,

and, like, we've got patterns that make it easy. Like, alright. Here. Add somebody.

Change their data.

Remove them, you know, sort of

patterns in place that once you run this script and everything works, we found it to be fairly reliable in production.

To your point of the fact that a large number of the platforms people are typically going to interact with are very US and Anglo centric. I'm wondering,

because of the fact that Gruberu is open source and does have this global audience, how that factors into any

localization and internationalization

work that you've had to do on the core capabilities of Grupo.

Yeah. That's 1 of those things that's like,

you know, you're supposed to do it, but it's hard to prioritize in the beginning. And that's still the phase that we're at. So, you know, I went through that as we launched TaskRabbit in many, many countries.

And, you know, it's a deep,

real investment to get right. And so

all the the air messages are in English and, you know, there's interpolation and, like, all of the things that people tend to do at the beginning of these projects. And then we know it, but we decide to clean it up later. And super happy to talk to anybody

that wants to help in that effort, but not currently a priority for us. In terms of your strategy or philosophy

around

the dividing line between open source and enterprise, I'm wondering

what the

kind of guidelines are for what features to put in which destination and how you have worked to

sort of understand what best practices are in other sort of similar product categories and your own evolution of working with the community to understand

sort of what are the expectations that the open source users have and that paying users have as to which features are available and which distribution.

The best practice that I've seen and that we come up with ourself and saw and others kind of like, HashiCorp was 1 that I know we looked at, is that you have to have a fairly

succinct way

of saying what's in the,

say, open core and in the enterprise edition. And so you have to be able to have that philosophy and apply it. We came up with

the thing I hinted at earlier, which is

the open source version is for engineers solving data problems, and the

enterprise edition is for,

you know, companies and organizations solving organizational problems. And so all of the sources and destinations

and that core syncing engine and all of that is part of the open core.

And then on top of that, user rights management.

You know?

Some of the things that sound like enterprise, we still put in core just because, like, why not? But, like, single sign on with Okta or something like that, for example, is in the enterprise edition. I think it'd be really discussion when we add a data dictionary or something like that. Is that for engineers solving data problems or is that for organizations solving organizational problems, things like that? The point and click, definitely an enterprise edition, things like that. And so that philosophy,

super important because in the early times

before we had that, it was, like, kind of a exhausting experience, frankly, trying to draw these lines.

And once you draw them,

especially

on the open source side, you really don't wanna change that decision.

In your work of building the Gruberu

technology

and organization around it and working with the end users of the system? What are some of the most interesting or innovative or unexpected ways that you've seen it applied? Couple. The thing I really didn't know that much about

some of these salespeople workflows,

which are very complicated,

and the transitioning

between

leads and contacts and opportunities and all these sorts of things. And I think that was just something I didn't have a lot of visibility into that's super nuanced

in some organizations and super

high value if you can get it right because you're focusing on talking to the right people, which leads to the revenue. And so the our Salesforce plugin, for example, is definitely the 1 that's evolving the most based on

interesting requirements in that space.

We thought a lot about the interplay between the marketing and the data teams, but the interplay between the data and the DevOps team has been more interesting,

especially as I mean, I think we're probably I saw this at TestGrip as well as as we're creating

data products, like that recommendation service and others that I was talking about, like, how are we standing these up and things? This is something a lot of organizations are going through. And so in general, we've spent

a lot more time helping organizations

with their

Terraform and their Helm charts and their this and their that and all these other things than I think we probably expected.

In your own experience

of building the platform

and working with end users and working with the open source community, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I

think that

you should never underestimate

the assumptions of what someone else's data warehouse is gonna look like. It's just always

bespoke.

And they don't all have primary keys, for example. It's something I always assumed. And when we ask for a primary key and then you pick 1, it's a column called ID.

And then you're trying to troubleshoot with someone like what's going on here.

And

the data, it doesn't look exactly right. And we're on, like, a pairing session kind of thing. And it's just like, what do you mean there's 2 people with that IT? Like, I just didn't I just didn't see it coming,

frankly.

And, oh, this is more like

a table of customers, but there's a row for each time they bought something.

Okay. Well, I would have called that table

purchase events or something like that, for example.

And just in general,

that's super interesting. And so

especially when we're working with people contributing to the project,

something that I've experienced recently is, like, okay. Well, they solved their problem. But guess what? They didn't have any date fields

or something like that. And so, like, what does it take to get that over the line? Are they willing to do that after they've solved their own problem because they're a good open source citizen?

Or do they have stuff to do? Like, probably have stuff to do. And so how do we work with them to get that over over the line? Another example of that same sort of root

situation.

And so for people who are looking for a solution to manage synchronizing their various data sources into their sales, marketing, and support systems, what are the cases where Gruberu is the wrong choice and they're either systems.

What are

the

cases where Gruberu is the wrong choice and they're either better suited writing their own internal tooling or looking to 1 of the other commercial vendors or some other solution? I think the main 1 I would say at this point, some other

solution?

I think the main 1 I would say at this point

for Gruberu, you know, we're evolving as the use cases come up, is that

there's

other systems,

CDPs and even other reverse ETL tools

that have focused more on

the sort of event driven architectures.

We'll certainly happily sync anything to anywhere. But in general,

if you don't have anything like that, and

you really just wanna get events to

a few different systems,

like,

segment's probably the right call. Like, they're gonna handle the intake of the events better, and then they spent 10 years, like, optimizing that

workflow. And even if you have a table

of events right now

somehow in your data warehouse because you're using system that writes it there or you're writing them in from Segment or whatever.

In general, we haven't prioritized

syncing those events to

same expandel

as high as we have

sort

of account driven, company driven,

human user driven,

like, those kind of data models as much as other ones. So for example, right now we sync

profiles

to Mixpanel

because people are using that for marketing

and other things, but we're not currently syncing events. So events might be might be better than 1 of those other ones. And as you continue to build out and evolve the Gruberoo platform, what are some of the things you have planned for the near to medium term? Yeah. So 1 great thing about open source is our roadmap is public, and we're, you know, sort of requesting comments and helping our user base

drive those.

The thing we're working on now is

being,

efficient in the

syncing situation

and adding in destinations,

sort of the near term

road map. I think the really exciting things that we might see in the midterm

is stuff that we can do

on top of the data once we have it. And so there's a whole bunch of interesting use cases that we've heard from users on

compliance sort of things, sort of data quality sort of things,

organizational use cases like how could we make

attribution better, for example, is 1 of the organizational ones. So we're starting to look into things now that we have

a normalized and well defined dataset

across many tools. Like, how can we

start solving more of those organizational

nuances and GDPR compliance and things like that? Are there any other aspects of the Grupo project itself or the overall space of reverse ETL or some of the community elements involved in your business that we didn't discuss yet that you'd like to cover before we close out the show? I think there's just a few trends that are super interesting in the whole space as you bring it up. Definitely 1 of the trends is so what people are calling the modern data stack, which in general is a

unbundling and best of breed tools used with the data warehouse at the at the center. Grouperoo

fulfilling 1 of those buckets and newly evolving.

I think that's super interesting. The other

1 in that same space and especially

as more of these things are impacting,

like, users directly, customers and end users, so to speak, is this, like,

I don't know, software development practices

being applied to this space more and more,

you know, pull requests and get things are checked in to get and, you know, checked in

notebooks and configurations

and deployments and all of that sort of stuff. The best practices that the product groups have been using for a while being applied to this, you know, I think that's something that we're leaning a lot into and we're seeing a lot of

that makes all of this more

reliable as it becomes a key product that the business

is using. And so just super excited to be a part of that, and we threw this conference called the Open Source Data Stack in September, and we're gonna be doing more on that. Actively looking for people to get involved with that and if they wanna speak and present and attend, that's at opensurcedatastack.com

where we're showcasing

how all of these things fit together. We had several partners in that conference.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today? I think there's a lot

of technological

gaps

that

that are in the process of being filled by a lot of great companies. Reverse CTL, obviously.

Data Quality, obviously.

A lot of companies that are helping the DevOps gaps that I'm seeing

and being able to productionize

all of these things. And so those are all

gaps that I see being

filled. I think

if there's a gap right now,

it's embedded as well as we talked about. And if there's a gap right now, I think it's close to that metadata space, but it's really more of the organizational gap that I've been talking about and just how can we get

data teams

and their stakeholders on the same page.

And it becomes even more

relevant as we look to operationalize

the data for them, like like we're seeing in reverse CTO.

Well, thank you very much for taking the time today to join me and share the work that you're doing on Gruberu. It's definitely a very interesting project, and it's great to have an open source offering in the reverse ETL space. So I appreciate all the time and energy that you and your team have put into that, and I hope you enjoy the rest of your day. Yeah. Thank you. You too.

Listening. Don't forget to check out our other show, podcast.init@pythonpod

cast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site of data engineering podcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links