Summary
Reverse ETL is a product category that evolved from the landscape of customer data platforms with a number of companies offering their own implementation of it. While struggling with the work of automating data integration workflows with marketing, sales, and support tools Brian Leonard accidentally discovered this need himself and turned it into the open source framework Grouparoo. In this episode he explains why he decided to turn these efforts into an open core business, how the platform is implemented, and the benefits of having an open source contender in the landscape of operational analytics products.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month.
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Your host is Tobias Macey and today I’m interviewing Brian Leonard about Grouparoo, an open source framework for managing your reverse ETL pipelines
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Grouparoo is and the story behind it?
- What are the core requirements for building a reverse ETL system?
- What are the additional capabilities that users of the system ask for as they get more advanced in their usage?
- Who is your target user for Grouparoo and how does that influence your priorities on feature development and UX design?
- What are the benefits of building an open source core for a reverse ETL platform as compared to the other commercial options?
- Can you describe the architecture and implementation of the Grouparoo project?
- What are the additional systems that you have built to support the hosted offering?
- How have the design and goals of the project changed since you first started working on it?
- What is the workflow for getting Grouparoo deployed and set up with an initial pipeline?
- How does Grouparoo handle model and schema evolution and potential mismatch in the data warehouse and destination systems?
- What is the process for building a new integration and getting it included in the official list of plugins?
- What is your strategy/philosophy around which features are included in the open source vs. hosted/enterprise offerings?
- What are the most interesting, innovative, or unexpected ways that you have seen Grouparoo used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Grouparoo?
- When is Grouparoo the wrong choice?
- What do you have planned for the future of Grouparoo?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management.
[00:00:19] Unknown:
Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams, like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlan enables teams to create single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Pacaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Brian Leonard about Gruberoo, an open source framework for managing your reverse ETL pipelines. So, Brian, can you start by introducing yourself?
[00:02:07] Unknown:
Hi. Great to be here. Thanks for having me, Tobias. Brian Leonard, previously to doing data engineering reverse ETL open source work. I was the technical cofounder at TaskRabbit, service to get things done in your house. Very different, though we did, of course, have data engineering needs that that gave us the inspiration for this project.
[00:02:28] Unknown:
And so as you mentioned, working in data isn't your sort of long term endeavors. I'm wondering if you can talk to how you got involved in the space.
[00:02:37] Unknown:
Yeah. I was the CTO and the head of the product and, you know, data team and the engineering team at TaskRabbit. And, you know, we had lots of people using the service and wanted to learn about them and create data products and analytics and all of that stuff. And so, you know, we work on all of that infrastructure in the end using Snowflake and Looker and a variety of machine learning tools, in particular, to create a recommendation system on who is the best person for the job at hand. And somewhat inevitably, you know, after you get all of that going and you you learn something really interesting, maybe the cohorts of who's likely to churn from the system or something like that, the marketing team asks to get that into Marketo or whatever email system we're using. And so spent a good deal of time syncing our data to marketing, NPS tools, some of those kinds of things, things to send push messages, Zendesk support systems, and things like that. And generally found that engineers didn't really like working on that very much and they didn't really know what success looked like in particular. And marketing didn't really know about the engineering and so we got inspired to find a way to make this so much easier and bring that organizations, together so they can, you know, be more effective.
[00:04:04] Unknown:
And so that brings us to what you're doing at Grupo. I'm wondering if you can talk more about what it is that you're building there and some of the motivation behind deciding that this was the problem space that you wanted to spend your time and energy on. It's just more of what I was saying. Crazy things would happen. Like,
[00:04:21] Unknown:
on a Monday, I'd approve a $1, 000, 000 for our marketing team to move some important metric for us, maybe retention. How can we get people to do more stuff on our platform? And then on Thursday, they'd come and say, yeah. We're super excited. Let's do this. I'm like, yeah. Let's do this. And then I'd be like, okay. Great. Now I need sync all of this stuff from the product database. The last time somebody did a cleaning or whatever into the marketing system And I'd be like, I don't know. Like, I got my own stuff to do. Sorry. And then I show it the engineer and then they didn't quite prioritize it. And then the system was kind of janky when we did do it. And at the end of the quarter, I'd be like, hey, what happened? We didn't hit the goal. I gave you a $1, 000, 000. And they'd be like, what do you mean? Like, I didn't have any day to do anything really. All I did was send the newsletter out and run more ads. How could I target those ads better and how can I personalize those emails?
And this just happened several times, took me a while to learn. And then as I was thinking about where I could make the most impact to organizations when I was thinking about a new company, this idea just kept coming up and it it came up with my cofounders who solved simple things. It came up with hundreds of other companies that we we talked to when we did market research. Yeah. We decided to do go after it. Interesting thing was we talked to all these marketing people, and they said things like, this is a huge problem. And I'm like, I see you and I see your problems. They're like they felt really gratified and those were very empathetic conversations.
And then I was like, you want to use this? And they're like, yeah, we want to use this. I said, okay, great. You know, go get the password to your data warehouse. And then the conversation ended because they're not the gatekeepers to that system. But when we started talking to the engineers, instead of asking the marketers like, aren't engineers annoying because they never sync your data? We said, Aren't marketers annoying because they always want more of your data? Then they said, Yes. And guess what, they had the password and they were inclined and they can make a read only user and they were inclined to keep the data in their environment and we started going down this open source path to be able to sync your data from your data warehouse into your sales support and marketing tools.
[00:06:36] Unknown:
1 of the things about the data ecosystem is it seems that either everybody's talking to each other behind the scenes or everybody just happens on the same sets of problems at the same time where we get these explosions of different product categories. And in this case, it's reverse ETL where we have some of the commercial offerings such as Hightouch and Sensus, and there's Gruberu, which is the open source core aspect of it. And I'm wondering what the landscape of available tooling looked like at the time when you first started building the Grupo system and some of the ways that the emergence of these commercial competitors has informed your product direction or the capabilities of the system that you wanted to build in? For sure.
[00:07:18] Unknown:
Yeah. I mean, I'm not for solving unsolved problems. There's always nuances. Right? And so at the time, this wasn't a thing. And I really think it wasn't a thing. Sometimes, you think something doesn't exist. But once you get into that space, there's a whole bunch of it. I think in this case, High Touch and Census didn't exist, and certainly the term reverse ETL didn't exist. The thing that was closest to it at the time was what marketing people would call a CDP, customer data platform. And so you would send stuff to tools like Segment and others and usually events, and they would relay those to others. That's the closest that I knew of the system when when we got started.
But these things just kind of their moment has come, I think. I equate it to, like, a hierarchy of needs, like, in psychology. Like, people have been investing in their foundations, Snowflake and BigQuery and stuff for the last couple years, and then their analytics and then their machine learning, and then this and this. And then at the very top, it's just like, okay. Great. We've spent 1, 000, 000 of dollars in 5 years, and all I have to show for to these reports? Like, what's next? We're not done. We're never done. And so what's next? And it's operationalizing that data. And, you know, I think just the modern data stack and the time involved and all that investment has the people that are on the forefront of that looking to put that data back into use, which is a whole new fascinating set of problems because the worst outcome before was a bad report.
Now the worst outcome, I don't even know. Like, sending it to the wrong place, data breach, or, like, a 100000000 wrong emails, like, like, some companies do every now and then, things like that. Anyway, so it's time has come, and there's some interesting new problems. How do those inform how we're thinking about it? The concept of the problem and solution is fairly straightforward. There's nuances. Sometimes I look to see what integrations they have just to see if if there's 1 I haven't heard of and things like that. But in general, we're having customers
[00:09:30] Unknown:
and open source users drive what we're working on and that's where most of our information is is coming from. Yeah. It's definitely interesting to occasionally look at some of these integration platforms to see what are the different sources and destinations because, yeah, they're inevitably tools or platforms or services that I have never even heard of. It's like, oh, what is that thing?
[00:09:50] Unknown:
Right. And so, like, I'm like, oh, look. They have 32, and we have 28 or whatever it is right now. Like, what are those 4? I've never heard of that, and no customers ever asked me for it. So I guess I'm just not gonna worry about it right now or something like that. And in terms of the
[00:10:07] Unknown:
sort of core concepts of reverse ETL, I'm wondering if you can talk to some of the baseline capabilities that are necessary to be able to build and run 1 of these systems. Additional features and utilities that you have been adding into the system as your users have started to become more advanced in their usage?
[00:10:32] Unknown:
Yeah. I think at its very core, you basically have a table in your, you know, product database, sometimes more commonly depending on the organization, their data warehouse. And it's a kind of like a fact table. I know people call these different things. Roll up table, like, customer ID, first name, last name, email, whole bunch of stuff, lifetime value, I don't know, likelihood to churn percentage, all kinds of weird things you might come up with number of actions they've taken. I don't know. Number of things they favored case by case in that business. And, like, basically, we really wish this was in system a, b, and c, commonly marketing, sales, and support. And then sometimes there's 8, 000 different marketing tools, for example, you know, various nuances thereof. And so, like, great.
Make sure those are always in sync. If something new happens in this table, make sure it's reflected in the remote system. We call it a destination source and destination. And so what's required just to make that happen? You know, generally, the ability to talk to that database and the ability to talk to the destination, baseline some sort of data massaging in there, you know, around how that destination works. Dates are always represented in a different way, for example. And probably in the baseline, some notion of, like, rate limiting and things like that super just they all have it, and it's the first thing that someone that builds this themselves figures out. They send 3 users, and they think they've finished a good sync system, then they hook it up to production and turns out you can only do 7 a second or there's some that are by day, which is even crazier central time, like, you know, 10000 central midnight to midnight, like, all kinds of weird things and then they turn off and you have to be able to retry and, you know, in the end get to what's the right word? Like, synced status, I guess. There's some physics term that I'm thinking, equilibrium or or something like that. On top of that, you can start adding the ability to do more advanced queries. Of course, it's not just the 1 table.
Something that we tend to do that the other ones don't do driven by customer requirements is not everybody has their act together where there's just this 1 magical table and everything is clear. And it's not even always just 1 query you can do. And so we have the ability to kind of stitch together that record from many different tables, pulling your lifetime value from here and your user record from somewhere else and even a whole another source, we can kind of put those together and then sync that somewhere else. We've added segmentation on top of that, something that we heard that people commonly wanted to sync with sort of these groups, so to speak, kangaroo themed group building tool that we have. So maybe you wanna tag your users in Zendesk if they're high value so they get better services. Now what does high value mean? That's anyone that spent more than $200 or whatever. We add that on top of that. And then UIs to browse that is something we added, like, how is my data shaped? And we did that to try to fill that rift I was talking about between engineering and marketing so they could actually agree on what the right data looks like and who's in those groups and the numbers numbers look right. You mentioned that when you were first
[00:14:00] Unknown:
trying to make people aware of the tool, get people to test it out, that your initial target was to talk to the marketing teams, and then you ended up talking to the engineers. And I'm wondering if you can speak to who the actual target users are for the Grupo platform and some of the ways that those personas have helped to inform the priorities in terms of feature development and the user experience design of the system.
[00:14:25] Unknown:
Yeah. Maybe all that's code word for it doesn't have to look pretty if it's engineers. I don't know. Yeah. So we started with the marketers and then eventually, you know, got on this rising trend of data engineers, like, adding this capability into their organization. And so the open source product that we have is for engineers to solve data problems and do the syncing. And so you use the UI. It's very close to sort of DBT and its workflow, which has informed our same target audiences. We have some analytics engineers we're working with, for example, that are very comfortable in that area. You locally kind of come up with your configuration that defines your pipeline. We have a UI to do that because it's super helpful to browse the, you know, fields available in Zendesk, for example, and you click, click, click, that creates a Git configuration that you check-in and you deploy that and then, you know, autopilot everything is is syncing.
That's targeted at engineers. We have a enterprise product and a hosted product, so you can run it yourself or in your own cloud or we'll host it for you that adds on top of that solving organizational problems to that. So, for example, maybe you wanna hand off exactly what gets synced in a no code kind of fashion to those marketers that we were talking about before. You've defined in a in a way that maybe you did your LookML and Looker, something I've done in the past, but then people can make any dashboards they want from that. You can define your data schema and but leave it up to the marketer what lifetime value means, $300, $200, etcetera, or any other groupings they can do. They're pretty comfortable with these segmentation tools.
And what actually gets synced to Mailchimp, we have people using that and that's sort of in the no code point and click kind of thing on top of those configurations.
[00:16:21] Unknown:
When you're discussing the kind of visual element of being able to select which attributes in the target system you wanted to populate. I'm also interested in understanding what the other existing or potential capabilities are for being able to hook into a data catalog or a metadata system to be able to enumerate the available fields and tables in the source systems to be able to understand what are the either preexisting models that fulfill these destination systems or being able to say, okay, these are the fields that I need from, you know, x, y, and z table possibly across multiple source systems that I can then stitch into these destination records?
[00:17:03] Unknown:
We're finding, you know, across the whole landscape of, you know, possible users that the more common scenario is that people don't know what the heck is in these columns, and, you know, they don't have their catalog situation together. And they're actually using Grupo to, like, define the single source of truth of what, say, lifetime value means, which is, like, for example, does it include returns or not? It's something that is, like, up for grabs when you're querying various databases and or data tables in in different ways. Now I find it super interesting, and and certainly our infrastructure allows it to like, our Snowflake source, for example. You ask it, like, what tables do you have and what columns do you have in each of those tables? And there's actually space for in our thing for more meta information on that.
I find it super interesting to think about how we could point that to a sort of a metadata management system and, like, filter that out on what the users see, especially the marketing users, but even the data engineers, just so they use the right 1. And, you know, sort of it's not just a column name called, you know, recent behavior score. It's like, what does that actually mean? But not something we've done yet, but super good idea. Yeah. It also brings in a lot of the complications
[00:18:28] Unknown:
of then you start ending up in the space of saying, oh, well, I see all these source tables. Now I need to have a visual query builder and understand when I'm coming from multiple systems, what the intermediate representation is going to look like so I can stitch that together. And then then you're in a whole different product category that you probably don't even wanna think about.
[00:18:44] Unknown:
We do have a visual query builder of sorts. You can write your SQL and certainly plenty of dorks that like to do that, myself included. But especially in the most common cases that we've seen, like, querying a table and then either, you know, using exactly those values, so just 1 of those fact table, or summing it up and filtering it out a little bit. We have a query builder for that, but it's nice to have the fallback, for sure to do anything you want.
[00:19:11] Unknown:
And so in terms of the actual architecture and implementation of the project, I'm wondering if you can talk to some of the ways that that has manifested and some of the technologies and systems that you rely on.
[00:19:24] Unknown:
Yeah. So, you know, both our hosted offering and the 1 people deploy in their own clouds, basically the same. We run on top of a to store data. For example, you know, another thing that we talked about before baseline and and what we do, like, we can sync incrementally. If you've got a 1, 000, 000 records, we need to know what's changed since the last time we've synced, you know, our general storage and not to send the same thing that we sent again to these systems, especially when they're being rate limited. We store that in a Postgres database, and we use Redis for caching basically and some sort of background processing so we can run multiple threads and sort of keep everything, I don't know, dork words, mutex and things like that, parallel so when you deploy gruparoo, you're running the code, which is in Node on top of Postgres and Redis.
Locally, it doesn't have those requirements at all, like the development environment that falls back to, like, memory and SQLite. So there's even less dependencies when you're running it locally to get your config going. But, you know, once you have lots and lots of parallelism and records, we use those stores to run up on the production system.
[00:20:38] Unknown:
And then for the managed platform that you offer, what are some of the additional systems and architectural components that you've added in to be able to add in some of those organizational requirements and some of the data governance capabilities?
[00:20:52] Unknown:
Yeah. So the stack is the same. But, you know, if you get the enterprise edition and our hosted offering, like, there's just more tables in Postgres around teams and, you know, things like that such that you can you know, the marketing team's allowed to change this, but not this, and this person is on that team and such. In our cloud offering, there's a whole another layer on top of that, which is, like, how do we get many users with these instances and things like that? And for that, we use Terraform.
[00:21:22] Unknown:
As far as the evolution of the project, you mentioned that you first started iterating on this space when you were working at Task Rabbit and then decided to turn it into your own sort of dedicated endeavor. And I'm wondering how have the design and goals of the project changed or evolved in that time? So we we encountered the problem at TaskRabbit and solved the problem.
[00:21:43] Unknown:
I think Grupo as a project was fresh from that problem set and, you know, using our learnings, but we didn't do any of the code there. For the same reason, it'd be impossible to prioritize a generic flexible, multi destination, multi source syncing system if you didn't have all of those needs at that moment. And, like, you're just basically trying to make marketing happening as soon as possible. This is what happens. 1st, what happens if you had a really engaged engineering team that spent year plus, you know, sort of bike shedding, I guess, really solving all of those exact problems and doing them as good as they could possibly be. What we got informed from that was really just this organizational gap and, like, really thinking about how we could fill it and, in general, how could we take an integration that often took in the month plus time frame and make it a day and, you know, what could we take from that? And the biggest thing is probably if we wanna get really dorky about it, like this concept, computer science term, idempotency, which I read about in your book a little bit too, so I know it's in there, which is really how can we make it so that it's fault tolerant such that if 1 or the other systems are down or we can always basically recalculate the source of truth at any time and get things back synced.
It's kind of this idea that's more and more popular. You know, Terraform has it, React has it, dbt has it, which is, like, the way that that's enabled is through being declarative. And so, you know, we we we took that approach.
[00:23:24] Unknown:
StreamSets DataOps platform is the world's first single platform for building smart data pipelines across hybrid and multi cloud architectures. Build, run, monitor, and manage data pipelines confidently with an end to end data integration platform that's built for constant change. Amp up your productivity with an easy to navigate interface and hundreds of prebuilt connectors, and get pipelines into new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you're up and running, your smart data pipelines are resilient to data drift, those ongoing and unexpected changes in schema, semantics, and infrastructure.
Finally, 1 single pane of glass for operating and monitoring all of your data pipelines, the full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at data engineering podcast.com/streamsets. The first 10 listeners that subscribe to stream sets professional tier will receive 2 months free after their 1st month. And so in order to go from, I have this problem, I need to be able to start getting data into these various marketing and sales and support systems to, I found Gruberu. I want to get it set up and start writing pipelines. I'm wondering if you can talk to that overall workflow of saying, I've got this idea to I've got this running in production.
[00:24:42] Unknown:
Yeah. We've got some good quotes. People say that's super fast. It's certainly well within a day's work to do that, if not, you know, within the hours work. Basically, you do a install of our tool. You say new project, then you launch this UI config. You install Gruberu. You say Gruberu init, Gruberu config, then you put in your source, you know, credentials essentially, often a read only user that'll end up being like a m variable by the time we deploy this to production. You pick the tables that you wanna sync or write the query and then do the same thing for your destination and all of that's to generate the declarative config of what the pipeline is. 1 of the things we found that was super important in this space was, like, this notion of sample records. So we have that built into our developer tool. So, like, the first time you know if this is working or not, is it with all million people in your pipeline and you've accidentally deleted your Wholesales team's whatever?
No, hopefully. In that you add, you say, I know about user 32, what does that look like? Essentially? Click and say, import the data, like, okay, this looks right, this is what I want to sync. Click export, it goes over to Mailchimp and Zendesk or whatever you've configured. You can look at it over there to make sure it looks right. You can, you know, sit with your counterpoint in the marketing or support organization. See, does this look right? In general, build up the confidence. And then with that same configuration, when you say, like, group root start, like, it runs that in multiple threads. And you could do that locally, which most people do the first time just to make sure. And then when that's deployed, it's, you know, running forever and just always keeping things in sync. The biggest hurdle isn't usually that development thing. It's like, what's your server situation?
How do you deploy things and things like that? That's what had us inspired to say, okay. Great. We'll do that for you if you want as well. And you just give us your configuration, and we'll do all of the AWS bits, so to speak.
[00:26:49] Unknown:
As far as the changes in source and destination schemas and being able to enumerate the available fields in these various downstream systems, I'm wondering what your process is for being able to manage the discovery in these destination systems and then also being able to manage any potential mismatches or schema drift between source and destination.
[00:27:14] Unknown:
We haven't fully tackled, like I don't know. You changed your column from f name to first underscore name. Right? Like, anytime you do something like that in your data warehouse, there's a list of ramifications, you know, in your BI system and your this and your that, and Gruber is certainly on that list. It's pretty aggressive to decide autonomously to change what data is getting synced. Maybe it's not an renamed. It's hard to even know it was a rename, frankly. But what we do have is tools to get it right the first time and, you know, tools make it easy to make that make that change. They probably exist in parallel in my experience for a little bit. And so when you're using that developer tool, like, you just have a drop down of the columns and, like, you pick 1. Right? And so, like, in general, that gets rid of typos and things like that, and it gets written to your config. And then it it can even be PR reviewed if you wanna do that, of course, when it gets checked in. So there's some eyes on that, and it's unlikely that we got it wrong to begin with. We want to manage the migration, of course, but, you know, tend to get it right all the time when we're using that. On the destination, it's the same, but more complicated. There's no you know, all of these databases have, like, an information schema or something that's super reliable.
Every destination is different and weird around understanding the fields that are built in and the custom fields, but we end up with that same experience where you see if you're using Mailchimp, for example, you see the ones that are built in, email, f name, l name, address, for example, and 1 of them is required email, for example. But then if you've added your custom ones, you see those in the drop down too. There's other destinations that just kinda willy nilly and, like, you just send whatever you want, and we have ways to do that as well.
[00:29:07] Unknown:
And for people who are running Gruberu and they've identified some new downstream system that they wanna be able to send data to that isn't already part of the available list of integrations? What's the process of actually starting a new plug in project and going through the development cycle to build against the Grupo Root internals and API and be able to test against this new downstream system. And what are the requirements for getting that new plugin integrated into the list of available integrations on the Gruberu site?
[00:29:42] Unknown:
Yeah. Great question. I think this is 1 of the examples of where I'm most excited about open source. Right? Because, like, 2 things. A, you see these list of companies that especially US based SaaS companies integrate with. And, like, they seem to top out at the 100 to 200 range. Like, there's just only so many that you can deal with and that's worth it to your business. We have people all over the world using all kinds of systems. There's a Mailchimp of Vietnam and a Mailchimp of Brazil that, you know, US companies tend to never get to. And then the larger the organization, the more that you've got this crazy internal system that obviously, you know, we would never be able to integrate with. And so we're just, super excited to facilitate custom integrations whether they get checked into our code monorepo or, you know, they're your internal thing. All kinds of internal support systems out there, for example, that we've met. And so the process for that is basically, you know, the goal was to you do the 1% of work for all these incremental things, and the system does the 90%. And so you basically implement the things we've talked about, which is what fields does this thing have and what are their data types. Each of those are are different. And the other is the primary 1, which is, like, okay.
Here's a record set. Like, here's what it was before, and here's what it is now just in case sometimes those changes are really interesting. Like, make it so in the destination system. We do that in Node. We had to pick a language that did well to asynchronous communication and that a lot of people knew. So we went with, TypeScript, JavaScript and TypeScript. We've had even analytics engineers. 1 recently built a big gap we had actually that got filled by 1 of our customers because we just didn't get around to it was Airtable. Analytics engineer writes an Airtable plugin, which is now done, and now everyone can use it, which is great. The big part of that is basically, in my experience having done 30 of these, is the testing system that we have. And so you've got your credentials on that, and, like, we've got patterns that make it easy. Like, alright. Here. Add somebody.
Change their data. Remove them, you know, sort of patterns in place that once you run this script and everything works, we found it to be fairly reliable in production.
[00:32:08] Unknown:
To your point of the fact that a large number of the platforms people are typically going to interact with are very US and Anglo centric. I'm wondering, because of the fact that Gruberu is open source and does have this global audience, how that factors into any localization and internationalization work that you've had to do on the core capabilities of Grupo.
[00:32:31] Unknown:
Yeah. That's 1 of those things that's like, you know, you're supposed to do it, but it's hard to prioritize in the beginning. And that's still the phase that we're at. So, you know, I went through that as we launched TaskRabbit in many, many countries. And, you know, it's a deep, real investment to get right. And so all the the air messages are in English and, you know, there's interpolation and, like, all of the things that people tend to do at the beginning of these projects. And then we know it, but we decide to clean it up later. And super happy to talk to anybody that wants to help in that effort, but not currently a priority for us. In terms of your strategy or philosophy
[00:33:10] Unknown:
around the dividing line between open source and enterprise, I'm wondering what the kind of guidelines are for what features to put in which destination and how you have worked to sort of understand what best practices are in other sort of similar product categories and your own evolution of working with the community to understand sort of what are the expectations that the open source users have and that paying users have as to which features are available and which distribution.
[00:33:43] Unknown:
The best practice that I've seen and that we come up with ourself and saw and others kind of like, HashiCorp was 1 that I know we looked at, is that you have to have a fairly succinct way of saying what's in the, say, open core and in the enterprise edition. And so you have to be able to have that philosophy and apply it. We came up with the thing I hinted at earlier, which is the open source version is for engineers solving data problems, and the enterprise edition is for, you know, companies and organizations solving organizational problems. And so all of the sources and destinations and that core syncing engine and all of that is part of the open core.
And then on top of that, user rights management. You know? Some of the things that sound like enterprise, we still put in core just because, like, why not? But, like, single sign on with Okta or something like that, for example, is in the enterprise edition. I think it'd be really discussion when we add a data dictionary or something like that. Is that for engineers solving data problems or is that for organizations solving organizational problems, things like that? The point and click, definitely an enterprise edition, things like that. And so that philosophy, super important because in the early times before we had that, it was, like, kind of a exhausting experience, frankly, trying to draw these lines.
And once you draw them, especially on the open source side, you really don't wanna change that decision.
[00:35:10] Unknown:
In your work of building the Gruberu technology and organization around it and working with the end users of the system? What are some of the most interesting or innovative or unexpected ways that you've seen it applied? Couple. The thing I really didn't know that much about
[00:35:26] Unknown:
some of these salespeople workflows, which are very complicated, and the transitioning between leads and contacts and opportunities and all these sorts of things. And I think that was just something I didn't have a lot of visibility into that's super nuanced in some organizations and super high value if you can get it right because you're focusing on talking to the right people, which leads to the revenue. And so the our Salesforce plugin, for example, is definitely the 1 that's evolving the most based on interesting requirements in that space.
We thought a lot about the interplay between the marketing and the data teams, but the interplay between the data and the DevOps team has been more interesting, especially as I mean, I think we're probably I saw this at TestGrip as well as as we're creating data products, like that recommendation service and others that I was talking about, like, how are we standing these up and things? This is something a lot of organizations are going through. And so in general, we've spent a lot more time helping organizations with their Terraform and their Helm charts and their this and their that and all these other things than I think we probably expected.
[00:36:33] Unknown:
In your own experience of building the platform and working with end users and working with the open source community, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:36:45] Unknown:
I think that you should never underestimate the assumptions of what someone else's data warehouse is gonna look like. It's just always bespoke. And they don't all have primary keys, for example. It's something I always assumed. And when we ask for a primary key and then you pick 1, it's a column called ID. And then you're trying to troubleshoot with someone like what's going on here. And the data, it doesn't look exactly right. And we're on, like, a pairing session kind of thing. And it's just like, what do you mean there's 2 people with that IT? Like, I just didn't I just didn't see it coming, frankly.
And, oh, this is more like a table of customers, but there's a row for each time they bought something. Okay. Well, I would have called that table purchase events or something like that, for example. And just in general, that's super interesting. And so especially when we're working with people contributing to the project, something that I've experienced recently is, like, okay. Well, they solved their problem. But guess what? They didn't have any date fields or something like that. And so, like, what does it take to get that over the line? Are they willing to do that after they've solved their own problem because they're a good open source citizen? Or do they have stuff to do? Like, probably have stuff to do. And so how do we work with them to get that over over the line? Another example of that same sort of root situation.
[00:38:10] Unknown:
And so for people who are looking for a solution to manage synchronizing their various data sources into their sales, marketing, and support systems, what are the cases where Gruberu is the wrong choice and they're either systems. What are the cases where Gruberu is the wrong choice and they're either better suited writing their own internal tooling or looking to 1 of the other commercial vendors or some other solution? I think the main 1 I would say at this point, some other solution?
[00:38:30] Unknown:
I think the main 1 I would say at this point for Gruberu, you know, we're evolving as the use cases come up, is that there's other systems, CDPs and even other reverse ETL tools that have focused more on the sort of event driven architectures. We'll certainly happily sync anything to anywhere. But in general, if you don't have anything like that, and you really just wanna get events to a few different systems, like, segment's probably the right call. Like, they're gonna handle the intake of the events better, and then they spent 10 years, like, optimizing that workflow. And even if you have a table of events right now somehow in your data warehouse because you're using system that writes it there or you're writing them in from Segment or whatever.
In general, we haven't prioritized syncing those events to same expandel as high as we have sort of account driven, company driven, human user driven, like, those kind of data models as much as other ones. So for example, right now we sync profiles to Mixpanel because people are using that for marketing and other things, but we're not currently syncing events. So events might be might be better than 1 of those other ones. And as you continue to build out and evolve the Gruberoo platform, what are some of the things you have planned for the near to medium term? Yeah. So 1 great thing about open source is our roadmap is public, and we're, you know, sort of requesting comments and helping our user base drive those.
The thing we're working on now is being, efficient in the syncing situation and adding in destinations, sort of the near term road map. I think the really exciting things that we might see in the midterm is stuff that we can do on top of the data once we have it. And so there's a whole bunch of interesting use cases that we've heard from users on compliance sort of things, sort of data quality sort of things, organizational use cases like how could we make attribution better, for example, is 1 of the organizational ones. So we're starting to look into things now that we have a normalized and well defined dataset across many tools. Like, how can we start solving more of those organizational nuances and GDPR compliance and things like that? Are there any other aspects of the Grupo project itself or the overall space of reverse ETL or some of the community elements involved in your business that we didn't discuss yet that you'd like to cover before we close out the show? I think there's just a few trends that are super interesting in the whole space as you bring it up. Definitely 1 of the trends is so what people are calling the modern data stack, which in general is a unbundling and best of breed tools used with the data warehouse at the at the center. Grouperoo fulfilling 1 of those buckets and newly evolving.
I think that's super interesting. The other 1 in that same space and especially as more of these things are impacting, like, users directly, customers and end users, so to speak, is this, like, I don't know, software development practices being applied to this space more and more, you know, pull requests and get things are checked in to get and, you know, checked in notebooks and configurations and deployments and all of that sort of stuff. The best practices that the product groups have been using for a while being applied to this, you know, I think that's something that we're leaning a lot into and we're seeing a lot of that makes all of this more reliable as it becomes a key product that the business is using. And so just super excited to be a part of that, and we threw this conference called the Open Source Data Stack in September, and we're gonna be doing more on that. Actively looking for people to get involved with that and if they wanna speak and present and attend, that's at opensurcedatastack.com where we're showcasing how all of these things fit together. We had several partners in that conference.
[00:42:41] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today? I think there's a lot
[00:42:58] Unknown:
of technological gaps that that are in the process of being filled by a lot of great companies. Reverse CTL, obviously. Data Quality, obviously. A lot of companies that are helping the DevOps gaps that I'm seeing and being able to productionize all of these things. And so those are all gaps that I see being filled. I think if there's a gap right now, it's embedded as well as we talked about. And if there's a gap right now, I think it's close to that metadata space, but it's really more of the organizational gap that I've been talking about and just how can we get data teams and their stakeholders on the same page.
And it becomes even more relevant as we look to operationalize the data for them, like like we're seeing in reverse CTO.
[00:43:47] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing on Gruberu. It's definitely a very interesting project, and it's great to have an open source offering in the reverse ETL space. So I appreciate all the time and energy that you and your team have put into that, and I hope you enjoy the rest of your day. Yeah. Thank you. You too.
[00:44:10] Unknown:
Listening. Don't forget to check out our other show, podcast.init@pythonpod cast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Brian Leonard and Gruberoo
Brian's Journey into Data Engineering
Challenges in Data Syncing and Marketing Collaboration
Emergence of Reverse ETL and Gruberoo's Position
Core Concepts and Features of Reverse ETL
Target Users and Use Cases for Gruberoo
Architecture and Implementation of Gruberoo
Managing Schema Changes and Integrations
Localization and Internationalization Challenges
Open Source vs Enterprise Features
Unexpected Applications and Lessons Learned
Future Plans for Gruberoo
Trends in the Data Management Space
Biggest Gaps in Data Management Tooling