Self Service Real Time Data Integration Without The Headaches With Meroxa

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance?

How much time could you save if those tasks were automated across your cloud platforms?

Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud.

Their comprehensive data level security, auditing, and de identification features eliminate the need for time consuming manual processes, and their focus on data and compliance team collaboration empowers you deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms.

Learn how they streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/amuta.

That's

imuta,

and get a 14 day free trial. And when you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflow. So try out the latest Helm charts from tools like Pulsar, Pachyderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode, that's l I n o d e, today and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Tavares Brown and Ali Hamidi about maroxa, a new platform as a service for data integration. So, Tavares, can you start by introducing yourself?

Yeah. I'm Davaros Brown. I am the CEO of Maroxa.

I don't know what else to say, but, yeah, that's, that's the quick intro right now.

And, Ali, how about yourself?

Yep. Hi. Ali Hamidi, CTO and cofounder of Maroxa.

And, Tavares, do you remember how you first got involved in the area of data management?

So

Ali and I, essentially, we worked at Heroku together,

and we kept going on these, like, customer on sites. And,

you know, at the end of the day, we kept hearing the same information over and over again and where people were saying our customers were telling us, it's great that we can provision our web app with the click of a button.

It's also great that we can, provision the database or a Kafka cluster

with the click of a button, but we also need you all to help orchestrate that. And so

Ali and I kept hearing this probably

10, 15 times from some of the largest, most popular customers in the world,

and we looked across each other, saw a twinkle in each other's eye. For me, it's it's more about solving problems for developers, and data is

increasingly become more important to an organization.

So

that is how I got my start. Ali?

Yeah. So, I mean, most of my career, I've been much more focused on sort of the back end engineering

and sort of in and around data. And so before joining Heroku, I worked at a relatively small targeted advertising startup.

And there, it was it was all about managing sort of 1, 000, 000, 000 of social profiles

and making use of that in in real time. And so that was kind of the first intro into large scale data management. And then, obviously, once I joined Heroku, I was on the Heroku data team, mainly focused on Heroku Kafka and and some of the other data products. And as Doris mentioned, you know, we attended all these customer meetings and kept hearing the same things. And that was kind of the motivation to to start Maroxa.

And so digging more into that, can you give a bit of a description about what it is that you're building at Maroxa, and what motivated you to turn that into a business, and what it is that's keeping you engaged

took a, like took a, like, a 3 month journey to understand

more of the landscape. I mean, I'm a product manager

by title. So for me, it was 1 of those things where,

you know, I knew we had the semblance of a great idea. Just wanted to know how pervasive this problem was. And so I talked over a 100 or so data engineers,

data scientists, data analysts to understand

what tools they were using, what they liked, what didn't they like, and where they were spending their time for their day. And, essentially, what we saw was that most of the people were spending their time on

just manual grunt work, getting random data, disparate data components

to integrate with each other. And so for me, if we could build a Roku like platform for real time data,

I mean, we could solve a huge problem for the masses. Right? Like, let's take a look at the the data ecosystem now, and this is kinda like, alright. Well, I got a few $1, 000, 000 in 6 months. I can, you know, contract somebody out to build this thing for me. And we just thought, hey. This should be something that should be approachable for anybody regardless of expertise and available resource. So for me, that's really the the reason why we we started to build this company. Ali can talk about kind of the underpinnings of it. But once we started with that premise, the the architecture

and the decisions that we had to make, it became more apparent and more clear.

Yeah. So, essentially, what Moroxa is is a managed data platform. And so it's a tool or a suite of tools that allow engineering teams to easily transport data in real time between various data sources. And then we add sort of additional

functionality on top of that, which is you know, enables you to to do transformations

and expose sort of various endpoints and and better leverage that data that you have. And then kind of sort of tear down the silos and kind of enable flexibility in the data formatting.

You mentioned that you both came from Heroku, which is very well known for being an easy on ramp for developers to be able to get something from idea into production without necessarily having to understand all of the pieces of launching and managing servers and keeping them to date and well maintained.

And I'm wondering what the lessons are that you learned from your time there that you are applying to what you're building at and the way that you are thinking about what the user interfaces are and what the developer and end user experience should be?

Developer experience is is something that,

you know, we still gotta have purple blood running through our veins. Right? And we wanna make sure that that same type of experience

applies to the data services ecosystem. And so some of the lessons that we learn,

developers first. Right? And so make sure that your developers are happy and it's easy for them to onboard. It's easy for them to grok the documents.

It's easy for them to play around and see quickly see the value

of your platform. So internally, we say we measure our effectiveness

in minutes, not months. Right? And that's just more of a reflection of what the current state is, right, where you have to take 3 to 6 months to get anything up on anybody else's platform.

But the other thing that, you know, on the, excuse me, experience side is that we wanna be prescriptive but give you control. And that's something that we learned at Heroku, right, where a lot of people complained about Heroku was a black box and we didn't they didn't understand how

certain things are done. But we've literally architected Maroxa to be open and adaptable and interoperable and and transparent.

And so that's something that, you know, we believe that, you know, for 80% of the people, Maroxa will work, you know, out of the box for you. Great. Right? Just point us at the data source. You You'll get data flowing in real time. There's another 10% of the people that need to customize,

you know, for whatever their data usage scenarios are. So we provide you levers to tune and configure

whatever it is that you need for your environment.

And then lastly,

we provide a very expressive CLI and API. Right? And so anything that we do internally,

say for, like, billing and some of the proprietary stuff, like, we'll make available via the CLI or API. And so those are some of the things that I think we took away from Heroku

that we definitely wanna replicate because, obviously, Heroku's you know, regardless of of who you are, I think, like, everybody at some point, probably 9 out of 10 engineers have have done a project on Heroku at some point, and we wanna do that for the data services ecosystem.

From my point of view, sort of the main learning side we're we're bringing over apart from sort of the the value of DevEx, which I think is is, you know, super significant, being on the Heroku data team in particular.

The Heroku data team managed millions of databases for tens of thousands of customers. At 1 point, the team was only 8 engineers.

And so the key learnings there is is really the focus on on solid automation.

Obviously, at that scale, you can't manually do anything on anybody's databases.

And so there was a definite focus on

really sophisticated and resilient automation. And so that's sort of something that I'm taking, you know, with me, bringing it into maroxa, and being able to build out these sort of very effective automation tools for dealing with operations related to, you know, various data pipeline components and and that kind of thing.

And in terms of the product that you're building right now, what are your initial target users, and how are you using that persona and those particular needs as guiding principles for designing the overall interface to the platform and the capabilities that you want to be able to provide at launch?

Yeah. I mean, right now, we're squarely focused on data engineers and data aware engineers.

We our persona is that person that, you know, you might be at a c say start up and you're just about to hire a data engineer. You're thinking about it. We wanna be, you know, basically position ourselves as, like, hey. You don't need to go pay this person

ungodly sums of money to do this job where they're gonna basically be

stitching together data components for the next, like, 3 to 6 months. And so that's really who we're going after. We're targeting bottoms up because

there are a ton of people, ton of companies in the enterprise realm that are, you know, the Confluence, StreamSets, Nextelas,

all that type of stuff, Ascend. But nobody really has the ability to to to be truly self served. And so

that's really where, you know, our wedge is is that, you know, we don't take a lot of time for people to get connected and to get data flowing through our systems. And so that's why we're decidedly going smaller, and we

wanna basically, you know, enable, like I said before, everybody regardless of expertise and resource to be able to have this, like, high grade enterprise level production data infrastructure at their disposal. And I think that's really, you know, our competitive advantage and

really what what is gonna differentiate us from any of the other the aforementioned companies that I talked about.

In terms of, like, guiding principles, I think there there are a few that Tavares and I sort of settled on pretty early on in the creation of of Maroxa.

And so

being able to focus on providing

this

sort of incredible developer experience, but not necessarily taking away control. That's kind of a key part of it. So providing that flexibility, allowing

customers to really configure and fine tune the components for their their use. No vendor lock in. This is sort of a key part of gaining sort of credibility and trust from, you know, users.

You know, we use open components in the sense of, the APIs and and standards that we we adhere to. We sort of believe that customer data is customers' data. And so always allowing customers to access it, no proprietary formats or anything like that. And then essentially

committing to to support and contribute to open source. And so I'm sure we'll we'll get into architecture a little bit more. But, essentially, we have products split into 2 parts. And the part that exists in the in the customer's sort of network boundary is entirely open source. So every component is either already open source or will be open source by us. And so those are sort of things that we committed to.

And by committing to those open standards and open APIs,

are there any constraints that that is imposing in terms of how you're approaching the product where it might have been easier or faster to get something functional if you

either build something in house from scratch or

you were able to customize the interface in terms of being able

to optimize for your particular deployment environment or any other limitations that that puts on you that would maybe make your life easier, but in the end, will not necessarily foster long term adoption?

Yeah. There are quite a few areas, you know, like that. In particular, 1 of the areas that we sort of had this issue with is, you know, underneath the hood, we leverage Kafka and Kafka Connect and the Kafka ecosystem.

So we leverage Kafka Connect quite heavily.

And there are certain limitations related to Kafka that, you know, impose some additional burden on us. And, you know, if we're committed to to supporting open source, that's something we we want to continue to use. So our our solution was was, I guess, ultimately the the obvious 1, which was build something that addresses the need better, but also open source it. And

so we've kind of skirted around the subject of, you know, having to make that decision because we just did the thing that we needed to build. We built the thing that we needed to build and then ultimately, you know, plan to open source it as well.

As far as the challenges

that engineers typically face in being able to build and maintain data infrastructure, what are some of the common challenges that you came across in your conversations with the folks that you're doing the on sites with and with the initial people that you are talking to as you build out and iterate on the maroxa product?

1 thing that we found, which was pretty common, is there's there's either a sort of a a clear divide between

the teams that are responsible for data infrastructure and the teams that are building applications.

You know, that's kind of the integration point where you want to take data that's being generated by a particular application,

transform it in some way, and expose it to other applications.

And so you're kind of crossing these team boundaries. Then you have, you know, a difference in expertise. If you're using Kafka, for example, many software engineers that are building back ends and APIs are very familiar with relational databases, but not so much with sort of streaming data paradigms and, you know, different semantics related to streaming and events. And then similarly, people who are building data pipelines aren't necessarily, you know, that concerned with sort of common practices and and limitations imposed by, you know, people building APIs and and applications. And so that was kind of a clear divide where there's often some friction. And that's kind of the area where we try to smooth things over and and provide tooling to kind of address that particular obstacle.

Another area is is just complexity in general. If you're building something using, you know, a suite of open source tools, you have Airflow, you maybe have some Python scripts, You might have some Spark jobs running to do some transformation.

You're so the complexity of of the infrastructure in the pipeline grows massively as you introduce more data sources and, you know, inevitably,

companies as they grow, you know, start introducing new data sources,

new types of data. You have databases that are optimized for particular use cases, you know, whether it's graph or time series. And so you have this sort of combinatorial

explosion of complexity

where you're trying to get data from, you know, n types of databases

into n types of targets.

And you need to be experts in everything in order to do it effectively.

That's just becomes a difficult task to kind of rein in. To the point of data sources and data targets, are there any

particular

sort of common systems that you are

targeting initially to be able to help scope your work that you think will be most widely

beneficial?

And how are you thinking about the types of data sources and data targets that you want to be able to support out of the box versus just providing a means of somebody being able to pipe into a standard interface

and then have your,

have the maroxa platform take over the rest.

The thing that we realized was, like, you know, your data is already in your database. Right? And it's sitting in a format where it's not being heavily used or leveraged because,

you know, if I write a, you know, select star from orders or select star from users command, I'm essentially just getting the end result of of many different actions. Right?

And so what we realized is that if by doing change data capture and streaming,

a lot of the data that, you know, gets piped into other, you know, systems, like if I use a segment or rudder stack or, you know, m particle or something like that, you know, essentially, like, I'm instrumenting those events.

But, you know, I have to take a huge amount of time to go do that. So we basically realized that, look, if we can do change data capture from your relational or or NoSQL data stores, we can essentially get all of that granularity that you're getting with the vented Vint based systems

without you having to go re instrument your app. And so out of the box, you know, we support the typical kind of relational and NoSQL data stores as, sources of data and then API endpoints as well. And then for

the syncs or the destinations,

you know, we support, you know, the major cloud platforms and things like that. And so that's where we're at now. Right? Like, we wanna basically be able to get your data from a data your production data store in real time, and then be able to give you the the flexibility to multiplex that data stream into, you know, whatever destinations that you may need. But that's where we're at today. And then in the future, we, you know, we'll look at different SaaS platforms and things like that, like Salesforce and Zendesk and, you know, some of the Stripe and Shopify and some of the more popular ones. But we'll never get be at a point where we're gonna, you know, have, like, Ali mentioned that combinatorial

explosion of, like, you know, we have the Facebook ads connector to Salesforce connector. Right? Like, you know, we don't wanna get into that playground. Because I mean, I swear, man. It's like, you know, we have an internal, like, competition Slack channel. It's like literally every other day, there's some new company that gets funded for, oh, we can take data from your thousands of SaaS applications into thousands of destinations. Right? And it's like, you know, that's not really where we see the competitive advantage or or how people actually

really want and need to work.

Yeah. Our focus has has really been to pick the connectors where we can add the most value. So the ones that are the hardest to implement well, the ones that are the most complex or, you know, least reliable or whatever the biggest obstacle is, the biggest challenge. And that's where we kind of focused our our energy on in order to make it very, very easy and very efficient. And so if you look at the first few connections that we've launched, you know, CDC for Postgres, MySQL,

and MongoDB,

we do it in a way that's fairly transparent to the end user. And so as a customer, they can point the mark so platform at Postgres,

and we figure out the best way to get data out of it, whether it's through logical replication and Debezium

or, you know, some other mechanism or degrading all the way down into, like, a polling interface. That's where we can add a ton of value because we make it so easy and seamless. And so like Dora said, we're not sort of interested in getting into that sort of connector arms race where you're just churning out a massive library connectors. I think for the the APIs that are relatively easy, if you're just hitting a REST API and pulling, you know, strings or or JSON values out, then we don't really add that much value anyway. In which case, you know, you're welcome to use 1 of our standard

API sort of endpoints. But in the areas where we can really add a ton of value, that's where we wanna focus our energy on.

Digging a bit more into the maroxa platform itself, you've mentioned the capability of being able to pull data from sources into destinations

using things like Kafka Connect and change data capture.

But I understand too that you also help with being able to manage things like administrative issues, such as upgrading a particular

destination store from version a to version b

and being able to just make sure that the

operations aspects of managing the data platform don't necessarily have to bubble up to the end user, and they don't have to be an expert in all of the systems that they're interacting with in order to be able to reap their benefit. So wondering if you can dig a bit more into what capabilities you're building into maroxa

in its entirety.

Sure. So

I kind of alluded to earlier, maroxa exists as as 2 separate sort of key parts. 1 being the control plane, which is

proprietary and and will remain proprietary.

That's a core IP. And essentially, that's the orchestration component. And so that's the part that reaches out into, you know, customer's infrastructure into the data plane And, you know, pulls all the strings and hits all the APIs and sort of steers the the infrastructure.

So part of that is, you know, monitoring the different components

and

managing it from sort of a operations

point of view. And so it could be things like updating versions. It could be things like addressing and remediating common problems.

You know, perhaps your disk volumes run out of space, so it could potentially automatically expand the volumes. It could restart services if they get into a state where they aren't responding, could restart underlying instances. It could do a number of things. And sort of essentially, the control plane has a framework for automating resolutions and remediations.

And so over time, as we, you know, gain expertise and gain experience in more common problems related to different sort of data components,

then we build out a library of remediations, and the the control plane will kick in and fix things there. Beyond that, the data plane itself, which consists of Kafka, Kafka Connect, and a number of other supporting components, the control plane also manages that. And so we look at the state of the health of each connector, the state that it's in, throughput, metrics, all those sort of aspects,

and figure out what to do in certain cases, whether it's, you know, we need to scale up, whether we need to change the configuration to optimize performance

and other parts of it. And so, yeah, essentially, the product is split into 2 parts, the control plane, which is the managed aspect,

and then the data plane, which is a collection of open source tools. Either they're existing ones like Kafka and Kafka Connect, or they're tools that we're writing that we will open source in terms of a Kafka proxy and some other things that we also include.

Today's episode of the data engineering podcast is sponsored by Datadog, a SaaS based monitoring and analytics platform for cloud scale infrastructure,

applications, logs, and more.

Datadog uses machine learning based algorithms to detect errors and anomalies across your entire stack, which reduces the time it takes to detect and address outages and helps promote collaboration between data engineering,

operations, and the rest of the company.

Go to data engineering podcast.com/datadog

today to start your free 14 day trial. And if you start a trial and install Datadog's agent, they'll send you a free t shirt.

And as far as the

architecture of the

maintainable for you and your team, as well as accessible to end users who don't necessarily

have all of the expertise

of how the different systems interact and how to integrate them effectively together, and they just want to be able to pull their data out of their Postgres database and land it into a data warehouse and write some SQL against it?

So currently, we have 2 different deployment modes targeting slightly different customer sort of personas and use cases. The first 1 is sort of the shared model, where a customer comes to the platform, they swipe a credit card, and all of the data pipelining infrastructure runs within our our account, within our infrastructure.

And so at that point, they introduce their Postgres database.

They tell us where it is. Our control plane reaches out and analyzes it. It figures out, you know, what we can do with it. And then at that point, through the APIs or the CLIs,

they can instruct the platform to, you know, retrieve the data and and put it into a different data source. And we automatically provision all the data pipelines. The second deploy mode, which is more targeted towards

customers with more sophisticated requirements, whether they're related to compliance or

security or maybe performance and scale.

Essentially, our platform can drill into a customer's account and deploy the data plane within their network boundary. And that's super appealing for a number of reasons. In terms of security, the data never leaves the customer's network boundary. All of the transformations,

the transport between the the sources and the sinks, all occurs within the network boundary. And so that helps them, you know, maintain whatever regulatory

compliance they had or potentially achieve regulatory compliance based on on their needs. And it also means that scaling is is sort of within their

control.

So if they're a larger

customer that may have negotiated

discounts with, you know, AWS, then they can continue to leverage those. Because as their usage scales, we spend up more resources within their account, and they benefit from those volume discounts and and whatever they've negotiated.

And so we're much more aligned with what the customer needs in terms of scaling at that point. And our control plane sort of reaches from the outside and sort of controls the and manages the data plane there. I'll be a little more controversial than Holly. Right? Like, this is really what the difference between us and pretty much everybody else on the market is. Right? Like, take a look at any of the ETL tools and all that type of stuff. Like, everybody stores their data in s 3, and especially during this time of increased privacy and data protections. Right? Like,

you know, that becomes a huge, huge, huge especially if somebody has a misconfigured bucket. Right? Like, that's a huge potential for a liability there. And so for us, you know, we have the the advantage of being able to deploy if you're cloud native, being able to deploy inside of your network boundary and not using any of the data that flows through our system. We can operate off of the shape of the data, but we don't necessarily need to see the bits and bytes. And I think that's something that, you know, for us,

it's a big advantage. I mean, and that's what a lot of people are seeing. The interesting thing is, like, a lot of our nearest competitors or people that are kinda playing in this space,

they always use coffee as, like, a rate limiter or a bottle. Right? But they're really missing out on

the full power of what Kafka can do outside of that. And they just basically just dump everything in the s 3, and then, you know, I can point a

data warehouse like Redshift or BigQuery

oh, Redshift or Snowflake

into to towards those things, and then basically, like, you get data warehousing for free at that point. Alright? And so for us, it's just kind of like, well, you know, what if you wanted to do more with that data?

What if I wanted to create an API? Or what if I wanted to do, you know, real time search indexing or any of that type of stuff? Like, I wouldn't be able to do that basically using s 3 as a launching point. I would have to write some jobs that do some pre aggregation and then put that into a data mart and then point all of these services at the data mart, and you're basically just wasting weeks months of time.

The stream has all of that information

available. And so that's really where

we see the advantage, and that's how we're making a bet on the architecture that we have.

Going into

this particular problem space and looking to build maroxa, what were some of the initial assumptions that you had based on your

existing experience and the conversations that you had had with your customers at Heroku?

And how have they been challenged or updated as you have continued down the road of building this product and working with some of your initial customers?

I think the main thing for us is that we see that our assumptions were right. I mean, Ali, I mean, is has there been any scenario where we've just been like, yo,

we need to do something completely different? I can't think of 1.

Yeah. I think we came into it fairly

well

prepared for what we were getting ourselves into. There's 1 area where I think

our assumptions were

not entirely wrong, but maybe we had higher expectations than what the reality was. And that was in the area of the quality of Kafka connectors.

The perception I I think the community has, and and we definitely we we fell for this too, is there's the belief that there is this massive collection of

really high quality, you know, expertly tested, you know, used in production by large companies.

And the reality of it is very different. There's a small number of really, really excellent open source projects, Debezium being 1 of them, super solid Kafka Connect connector. And then there is a very long tail of assorted other connectors for other data sources and data syncs that vary drastically in Qualysia. I think that's kind of what you'd expect from open source tooling in general. I think the perception of Kafka Connect and Kafka Connect connectors

is that the open source system is is filled with these amazing connectors, and they all work great. And, you know, that's what you should be using. I think perhaps that may have been the case

before Confluent changed their licensing

and

sort of changed the licensing on some of their own built connectors. And so maybe those are rock solid and, you know, and and don't totally solve everybody's problem easily. But, unfortunately, the the licensing sort of prevents us from providing that as a product. And it also prevents other customers from using them for similar reasons.

And so as far as the next steps for somebody who gets started with maroxa,

what are the capabilities

as far as being able to

expand outside of the Maroxa platform and bring in some of the enhancements or new capabilities with them?

You know, eventually

so right now, we're all

CLI and API,

and that's mostly focused on the data engineer and data aware engineer. But as we move up the value chain, we'll enable, you know, data analysts, data scientists to create

pipelines and manage pipelines and do transformations and write functions and all that type of stuff from a UI, from a visual perspective.

In addition to more connectors and, you know, like, those types of things, like, we really see that as the

kinda next stage of our journey that's gonna be able to provide the most value. And the real reason, like, we kinda came up with this is that, you know, we kinda look at this as a journey of of a few acts. Right? And so the first part of it is is is building the real time data catalog. So we enable the data engineers to build this catalog and, you know, with all the permissions and, you know, all the stuff that they need. So now they can integrate components super easily, right, and and manage and scale them extremely easily. Once you have that catalog

now from a UI perspective, if I'm a data analyst or data scientist,

what we've heard before in the past was, like, it would take anywhere from, you know, a couple hours to a couple months for for them, you know, data analysts or data scientists to get a pipeline built. Alright?

And so

because we can operate on the shape of the data, we can provide that mocking and staging aspect to the data scientists or data analysts

before they actually push this up to production. So you can drag and drop, create your pipelines,

and then select fields and tables and fields that you need. And then, you know, you can click, you know, a button, essentially, that says go to production, and it basically just sends a request out to the data engineer that's managing the infrastructure.

You know?

Ali is selecting these tables. He wants access to this, yay or nay, And then they can go to production. So the data engineer technically doesn't have to do any work anymore. And so even in our, like, research, we saw that if we can provide the ability

for data engineers or data aware engineers to quickly integrate things

and then, you know, self-service,

like, we can solve 80 to 90 percent of their workdays. So now they can actually get down to, you know, doing things that add value,

you know, feature support and all that type of stuff. So that's really kind of the, you know, near future as to what we're gonna be tackling.

I think when it comes to integration with other sort of tooling, because we strongly adhere to the idea that, you know, we should use open standards and existing APIs,

we're sort of well suited for integration with other tools. And so if you already use, you know, DBT on your data warehouse, then our goal is not to displace that. We can do some transformation on our side, but it's it's kind of a different focus. And so, yeah, absolutely use that. And, you know, I think that's very complementary.

Similarly, if you use materialize, you wanna generate, like, real time materialized views, that's something that you can totally use and you can integrate directly with the data pipelines that are have been provisioned

via maroxa.

And so our goal is really to create these partnerships and add value by integrating with these other tools rather than necessarily displacing everyone. That's that's not really our goal. And, you know, our commitment to open source also means that we can, you know, be used in components can be used in other products. And similarly,

that opens us to pull in other open source tools to improve the platform as well. And so we're very much for the idea

of, you know, ephemeralization.

And if there is a tool that works better than something that we've built, then by all means, we should stop, you know, wasting our resources and our energy on on something and and use the the better thing. And that sort of frees up our resources and our energy and our time to focus on areas where we can add more value.

Yeah. I always say we're we're system engineer first, software engineer second. Right? And that just comes with maturity. Right? Like, I think the big thing that Ali and I realized,

and, you know, maybe this was it's seared in us from from Heroku is that, look, we just wanna build the right things and not have to build everything. Right? And so the reason why, you know, it took us, what, 2 months to get something of this sophistication

going is that, number 1, you know, not to discredit the absolute sheer skill of and expertise of Ali, but, like, you know, all those years of experience,

he knows what to grab in his tool bag to get the job done versus like, oh, now we need to go write in event streaming platform

or event streaming framework and all this other stuff. Right? Like, you know, those are some of the mistakes that I see. You know, people that are less mature doing because now they have resource and time to go do this thing. Our main thing is, you know, you asked about, like, what are the things that we learned from Heroku. It's it's always put the customer first and to make sure that they're the most productive. Because if they're productive and they can clearly see value in your platform very quickly, then they're gonna be customers of yours for years to come. Even if, you know, there's outages or price changes or any of that type of stuff, they're still gonna be able to see the value. And I think us having that DNA kind of within a shared DNA between us, you know, that kinda gives us the ability to know, like, hey, this is where

we need to go, like, absolutely go build something from scratch versus,

look, we can just, you know, kinda augment whatever it is that's out there.

I think that that core focus on open source and open standards

and the extensibility

of the platform is definitely something that is worth calling out again as a

viable

business strategy

to make sure that you are

capable

of taking the long term view and being able to adapt as the surrounding ecosystem continues to evolve and mature. Because,

you know, if you built maroxa 10 years ago, maybe your core focus would have been integrating with Hadoop and its ecosystem. Whereas now, in most conversations,

that's an afterthought or not even a consideration where there had been the be all and end all of being able to do large scale data integration.

So it's definitely worthwhile

to take a look at what are the foundational principles, what are the core elements that are necessary,

and then being able to swap everything out using shared interfaces. I think you totally called it. That's really, you know, a key consideration for us is rather than the shiny new, you know, hot sort of product. It's really where can we add value. Where's the challenge right now? And that's where we we wanna focus.

From the people who you have worked with so far, are there any particularly

interesting or unexpected or notable ways that they've been able to use the Maroxa platform or ways that they've been able to spend the time that they've regained from not having to manage the underlying infrastructure?

I would say

the main thing is is that everybody

initially uses us for the same use case. Right? Which is, you know, we usually get people saying like, oh, we wanna do change data capture to to our data warehouse. Right? And they're basically looking to us to be the real time replacement

for, you know, Stitch or 5 Chain or something like that. Right?

And then once they realize like, holy crap, that was so easy a caveman can do it. Then they're like,

wait. Can this data stream be used to be pointed somewhere else? And we're like, yeah. What what do you got? And so, you know, now for some of the people, like, we're doing real time search indexing. Right? Because we can point that stream to their Algolia or Elastic endpoint that does indexing. Right?

Or we can do

basically, take the same stream

and then point it to a

API endpoint. So we auto generate

RESTful or GRPC or GraphQL API endpoints

based off of that stream of data. Right? And they're like, oh, but now I'm basically using this to do, you know, kinda internal dashboards

or basically, like, an app, whatever, you know, analytics, real time analytic or

something like that. And so, undoubtedly, basically, it becomes as, like, moment of discovery or moment of clarity that all of our, you know, design partners have had, which is,

man, that same stream can be used in multiplex into multiple destinations, and that's really what the power of Merox is. I I mean, it kinda goes back to an earlier point that I think was kind of made but kinda glossed over is that

what we realized and what Ali always says is, like, you know, we're the, like, unified data data backbone of the organization,

is that traditionally,

you know, analytics and engineering operations have been separate. Right? Like, I have my analytics stack. It's usually some, like, commercial SaaS product that allows me to do, you know, kinda some consumer data platform, right, or a customer data platform. Right? There's a bunch of connectors,

and that, you know, that's usually what my, you know, sales folks and marketing folks and those kind of folks use to do their job.

But then on the engineering side, you have another tool set, which is usually either some, like, you know, if you're in the enterprise or some off the shelf enterprise solution that's super clunky and experience is bad, Or, you know, you have an army of data engineers that goes out and kinda sistes the other things from from open source.

And, you know, the 2 stacks, analytics operations, are are largely separate, to be honest. And it's really, you know, painstakingly difficult to integrate the 2. And so, you know, that's really what's been our advantage is that to do engineering or analytical operations or beyond analytics

is literally the same toolset. That's

the moment of discovery that pretty much every 1 of our design partners has had. Yeah. Our big bet has been that if we take data from all of these various silos

and put them into a flexible

sort of event stream,

then that will enable

a broader range of use cases and enable a lot more value or enable customers to get a lot more value out of the data. And so typically, as a side effect of them trying to solve this particular

use case, a relational database into data warehouse, we introduced this concept of the the unified data backbone where we kind of expose the stream of data. And now they can leverage it for other things. And that's been, you know, the case with almost everyone. This is organic growth of, well, now I have it in this flexible format. I can do all these other things with it. And that's essentially the big bet that we've made is that, you know, let's get the data into this event stream, and then use that as a launching point for building new applications, enabling new use cases, and addressing the existing range of of use cases as well. Yeah. I mean, from the same stream, you can do API generation.

You can do disaster recovery backups. Right? And then you

can ship that off to a data warehouse. Right? And so, like, your IT organization, your engineering organization, your analysts

can all be made productive

from the same exact string

of information

without any changes to their app. And they're getting all this information in real time. And so, like, that's the power

of this unified platform that we have.

For somebody who is considering maroxah, what are the cases where it's the wrong choice?

Never.

Come on, man.

Our turtle joke is like, yo, just sprinkle some rocks on it and everything will be much better. But I would just say, like, from a technical standpoint,

there is no

real reason why somebody shouldn't be able to use maroxone.

Alright?

Where it gets the the nuance or the devil's in the details is, like, you know, because we don't have,

you know, our security compliance certifications

like ISO and SOC 2, type 2, and those types of things. Right? Like, we're not a great fit for heavily regulated industries. But if somebody's willing to, you know, deploy Merox's data plane inside of their network boundary, that's a way that we can get around that. Ali might have, you know, different opinions, but I don't think you know, as a as a salesman,

I don't think that there's a scenario that we can't tackle.

Yeah. I think, like like Tavares mentioned,

anything that's that's regulated, that's something we we don't address very well right now. But that's obviously something that we're we're working on. Another area where we currently don't really have a great fit is if you have very, very sophisticated

sort of data pipelines or, like, a very elaborate DAG where your data is being transformed,

you know, in 50 different steps before it lands somewhere. The platform alone isn't really well suited for that. Similarly, if you have very elaborate sort of stream processing

requirements where you want to do some very sophisticated, like, windowed aggregation

with some transformation and and that kind of stuff. Currently, we we don't address that very well. We have future projects to to address the stream processing use case with functions

and be able to do more sophisticated stream processing. But right now, we don't natively support anything like that. That being said, because we use open standards,

you can integrate, like, ksql on your side or Kafka streams or Spark or whatever you want and tap directly into that Kafka stream, if that's what you want. But on the platform alone, that's not something that we do right now. And so maybe if your requirements are very much focused on that, then we're probably not the best choice right now.

Yeah. But the other part of that too is just like we said, we play nice in the sandbox. And so

it's 1 of those things

where,

you know, if you got a airflow cluster or something like that, like, we can point

our stream towards that. Right? And, like, you don't necessarily have to take,

create, or provision,

you know, rip and replace any of your existing infrastructure. Right? And, like, that's another 1 of our product tenants, right, is is to be interoperable.

And we wanna make sure that, you know, regardless of,

you know, what data infrastructure you have in place or what data components, you have in place, we wanna make sure that maroxa is present for at least some of that. We don't have to be everything, but there are places where we provide

more value than others. And if you see that, hey. I need, you know, this complex DAG, you know, you can drop in elemental or airflow

or prefect or something like that in in between and, you know, and then output your stream to something else. Right? Like, that's something that we're

more than able and equipped to do.

And are there any particularly

interesting or unexpected or challenging lessons that you've learned in the process of building maroxa, whether from the technical or the business and sales perspective?

Oh, man.

I'll talk about the business and sales side. Alright? So even though I like, everybody has this, like, fallacy lit that product managers are the CEO of the organization.

I'm a tell you right now, being a product manager does not prepare you for being a CEO of a startup.

Right? And, you know, it's just all the little things. I mean, it gives you a good framework for understanding,

for having empathy, and and going and tackling problems.

And so, you know, which is good. But at the end of the day, right, like, there isn't a playbook for going out knowing how to go sell and knowing how to, like, close people and, like, all that type of stuff as a product manager because it's just not something that we have to do. And so, you know, all of the assumptions that we had initially,

you know, just iterate on those over time as far as, like, our ideal customer profile and, you know, what our go to market strategy should be and, like, all that type of stuff. And just being a product manager has helped me understand, like, what I have at my disposal. But at the end of the day, there's

no substitute for just getting out there and trying and failing or getting out there trying to succeed and and all that type of stuff. So from a business and sales side, I mean, there's just a lot that, you know, we just didn't know.

The the and and even from a company formation side. Right? Like, there's just so much that you just take for granted

even if you're early at organization or have basically been the Ali for a bunch of startups. Right? Been the 1st CTO or 1st technical hire. And it's just like things that you just don't get privy to when you're not having to deal with it day to day. So if anybody is gonna listen to this, the first piece of advice I could tell you is find you a great lawyer. That is the cheat code amongst all cheat codes. And we were fortunate for us to get 1 after some things happened, but lawyers make all of the difference. After that, I mean, everything is, you know, pretty much just just trial and error.

Yeah. I worked at Heroku and having exposure to millions of databases and, you know, a huge number of customers.

I met that I was exposed to lots of, you know, bizarre

usage and weird edge cases,

but that doesn't ever stop. So even, you know, with maroxa,

I think as we, you know, we start getting customer traffic, there's always

there there are always new

and sort of unique problems with data and data quality and data corruption

and things that, you know, you believe to be entirely impossible

happen.

And as you scale and you get more data, more customers, more usage,

those small, very, very unlikely events happen all the time.

And so that's continually surprising, and I think that's something that will never really stop. The the other thing which I I wanted to kind of talk about is much more on the business side is that these are highly unusual times, I think, in general, and it has made every aspect

different. So hiring is is really

bizarre because, you know, there's like no in person meetings. There's no in person

interviews. Even, you know, meeting with VCs

has been

weirder. I think every aspect of the business and business formation and and just getting set up has been

very

odd because of the current times.

And the way people are thinking about when we talk about hiring,

the opinions of people and and sort of the comfort levels of moving to a new company, potentially a startup, is definitely influenced by, you know, what's going on right now. And and I think people are mostly

sort of hunkering down and trying not to to make any drastic life changes. And so that's made hiring a little bit more challenging. But, you know, the flip side of that is,

remote is, you know, everyone is remote. And so us being a remote first company, it means that, you know, everyone is a potential new hire because, you know, everyone is is accepting of remote.

And so it's been a very unusual experience, obviously, with many downsides and many upsides as well.

For people who are interested in maroxa

and want to get started with it, what are the options for being able to experiment and test it out? And what do you have planned for the future in terms of new

maroxa,

just email

using maroxa, just email us and just do support at maroxa.com,

and somebody will help you out there. Like, we're basically white gloving people

onto the platform.

Our platform

will be available for self-service,

I would

say, mid November, mid to late November, just in time for the holidays.

And I think that's really what our, you know, general availability

will be. I mean, essentially, it'd be a public beta at that point. It'll be have significant amount of polish on that, and it'll have the the UI experience.

And then in the future, you know, we're gonna be open sourcing,

like Ali said, some of our data plane components. And so

right now, we are creating

a

Kafka Connect replacement in Golang.

And I'll let Ali talk about why and what the motivations behind that show,

because it's really just kind of a outcry from our customers

and and what we've seen as far as, like, usage and things like that. In the future, essentially, we're gonna be moving towards the you know, moving up the value chain and so creating experiences for,

you know, people that are not pro coders. Right? Like, how can we enable the masses to,

you know, develop

departmental

apps or, you know, those types of things for

that leverage real time data and that real time data catalog,

but do it in a low code, no code way. And so that's, you know, down the line, but, like, that's really what, you know, kinda what the playbook looks like for us. Yeah. Aside from sort of that functionality and moving up the the stack in terms of technical abilities,

stream processing is another super interesting area that know, we plan to tackle at some point, as well as

potentially creating a Terraform provider so that we can sort of fit nicely within existing sort of infrastructure as code sort of pipelines and and usage.

Going back to the Kafka Connect replacement, we've seen customers struggle with deploying Kafka Connect and ending those connectors, as well as building connectors.

Kafka Connect is is written in Java and runs on the JVM,

and that comes with a certain amount of sort of resource usage. And so there is, say, a good argument for

writing

an equivalent product in a much more resource efficient

language. And so that allows you to pack connectors much more tightly. That makes sense for us as a managed service provider because, you know, we we deal with the long tail, large number of customers, the tighter we can pack it pack the connectors onto an instance. They sort of the the better the financials work for us. Even as an open source user of this product, less resource usage means lower cost for me. We see Go as as being, you know, heavily adopted and and well supported language. And so if we can make it very, very easy to write connectors in Go, I think that's a net win for the the community in general.

And the idea is, you know, being able to maybe relax some of the constraints that Kafka and Kafka Connect impose

and potentially build a better user experience, you know, integrate a nice sort of feature rich UI,

expose some better ergonomics around transformations,

and sort of those kinds of things. It gives us an opportunity to to build a better experience in general and provide something to the, you know, the community that, obviously, we can use, but also, you know, adds immediate value.

Yeah. I definitely look forward to seeing that because I have heard a number of conversations about some of the challenges with Kafka Connect.

For the most important question, where does the name come from?

That's a very good question.

And most people, like, think it's just like something that's made up. But

everything with, Maroxa has a purpose. So

me and Ali came from Salesforce

slash Heroku, and at Salesforce, they used to say all the time, data's a new oil. And so

1 night, I was up watching that geo, you know, probably we could tune it or something like that. And then

next show came on, and they were talking about the Dangote pipeline that's getting built in in Nigeria.

It's gonna be 1 of the largest,

you know, refineries

in the world, and 1 of their byproducts

is kerosene.

And then the way that you remove impurities or sulfates

from

kerosene, which is jet fuel,

is the Merox process. So, you know, kind of our unofficial tagline is like, you know, if data's the new oil, we wanna power the refinery.

And so

that's where the name Meroxa comes from. I tried to go meroxa.comioai,

all that All that was taken. So I just added an a to the end of it, but that's the the origin of the name and and what it means.

It's a good backstory.

Are there any other aspects of the work that you're doing at Maroxa or the or the overall space of data integration and data platforms that we didn't discuss that you'd like to cover before we close out the show?

1 thing which I I kinda wanted to highlight is, you know, our our vision of for Meroxa is really to to grow into a

complete data platform in the sense that, you know, we hope, someday we'll we'll be able to address all data needs related to a business. And so right now, we're starting with the area where we think there's the biggest opportunity or the the sort of the weakest existing product. And so for us to, you know, build something to address that space. But, yeah, our our vision is really to become a a complete data platform where you can

provision, manage, you know, operate, create pipelines, you know, address real time use cases, and pretty much do anything you'd want to do related to data.

We gotta be careful because we got,

founders on our cap table. We're not we're not trying to get into the data visualization

game by any means.

You know,

I I I love those guys. So so I'm not I gotta try to get crushing their territory. But the interesting thing is, like, they will even tell you themselves the reason why they were so successful in the early stages of Looker is that they had a huge forward deployed

engineering organization. Right? And, basically, they would just put an engineer on-site at their largest customers to do anything and everything to get data into a format that Looker can consume.

And so, you know, they'll tell you themselves, like, hey. If we had to start Looker over from the beginning, it will look a lot more like Maroxa than it did you know, does Looker. Right?

And so that's really, you know, kind of a feather in our cap because we realized that, like, you know, to build any type of house, you have to have a good foundation.

And, you know, for us,

we've taken a pragmatic approach,

but we feel like once we have this foundation in a pretty stable state, that enables us to go and play in, you know, many different verticals, many different use cases,

and we will undoubtedly be extremely successful.

For anybody who does want to get in touch and follow along with the work that you're doing, I'll have you each add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

Oh, man. There's so much.

I would say

biggest gap that I see is

I mean, I don't know if I should say it, man, because that's really what our secret sauce is.

Ali, how you feel about that? Say whatever you want.

You know, for us is that we realize the biggest gap is, like, there isn't a component gap. It's the experience gap. Right? And, like, that's

the biggest thing that people just don't understand is that, you know,

especially in the data realm, people are less likely to be all in on 1 vendor. And so interoperability

and adaptability

is at a premium.

Unfortunately, folks just really haven't focused on developer experience. And so

for us, it's just like, you know, the combination of real time. Right? Like, you know, change data capture in the real time, being able to do real time stream processing, and then being able to, you know, have this, like, manage real time infrastructure.

You can take any 1 of those components, and there's probably, like, 20 companies that are well funded that are doing those single things. Man, I haven't even talked about, like, data governance and security and like, all that type of stuff, like, things that are already built into our platform.

There's companies that do single, you know, point things of that very, very well, but they don't do all of those things together very well. And I think that's really what the advantage that we have is. It's like

people think that we're probably boiling the ocean, but, you know, we see this as a platform play. And that's really where the biggest advantage is for us, I think. I think the overall experience is is clearly that's, you know, what we're we're doing, and and that's why we exist. But the other area, which I think is is a pretty big gap right now, is data hygiene or data quality.

Data pipelines tend to be super brittle, and a big part of that is because of the data quality itself.

And it's hard to

trace back the quality of the data or the source of the data corruption and kind of unthread that,

address it. And so there's definitely a

big area for, you know, someone to come in and build

a really good tool around that and make it really, really easy to track and,

you know, track data corruption and detect it and kind of provide data uniformity and data hygiene. I think that's definitely an area that we're super interested in and I'd love for us to tackle in a good way. But I think right now, I see that as kind of the biggest weakness in sort of data engineering in general, as data quality is just generally a problem.

Inside of our data plane, man, like, what are the things that we've done to, you know, essentially

kind of mitigate, at least in even in the early stages against, like, bad data quality. Can you talk about, like, the schema changes in the schema registry and all that type of stuff? The platform includes the schema registry by default. And, you know, we update schema registry based on events that we see,

and we kinda keep track of that. But there's an even deeper problem. Like, the schema registry and that kind of tooling only addresses

schema changes. But what I think is the weakness right now is where your schema hasn't changed, not intentionally.

We've seen this actually. A particular field is an integer,

and for some reason, there's a string because the upstream database is MongoDB,

and it doesn't enforce data types per record.

And so that breaks the pipeline, and it's hard to get around that. I mean, essentially, you're expecting 1 thing and you get something else. And so it's not really a schema change because it wasn't intentional, and it just happens to be 1 record out of, you know, 10, 000, 000, 000 that has this problem. But having tooling around, you know, being able to detect that, put it to the side, enable a UI,

provide a UI that shows the the customer, like, hey, we've we've found something that is unexpected.

What do you wanna do with this? Do you wanna correct it? You know, carry out surgery on the record? Do you want to drop it? Do you wanna flag it? Do you wanna do something? Basically, having that tooling around,

you know, picking out, you know, a needle in a haystack and seeing, like, oh, this is not what I expected. Let me fix it and put it back in. I think there's definitely an opportunity there to build something better.

Yeah. And I think that's the thing that, you know, we talked about this, like, self feeding system.

You know, I don't wanna go around and start throwing out buzzwords like AI and ML. But as you can quickly see, like, you know, if these things continuously happen,

we can feed that into a model, and then that model will be able to be widely available and widely used to say, like, hey. We saw this thing. You know, we can immediately take some remediation steps and notify you as to why this happened. And then, like, you know, and nothing looks different to you other than your data going to the destination

in the format that you needed and the speed that you needed. Those are the, you know, the the types of things that we can do because we control the ingest, and then we can can control the where it's gonna be placed at. And then using a intermediate format like Kafka,

that gives us the ability to do things like, you know, replays and all that type of stuff once we do find,

you know, anomalous events like that. Right? Like, you know, everything that we've done has been extremely intentional because we're, you know, providing, like I said, the foundation for us to do some really, really, really, really, really cool things down in the future.

Well, I appreciate you both taking the time today to join me and discuss the work that you've been doing so far and the visions that you have for it. It's definitely a very interesting platform and 1 that serves a very

platform and 1 that serves a very big need in the overall data ecosystem. So I appreciate all of the time and effort you've put into that and the work that you're doing to contribute back to the community. So thank you both for all of that, and I hope you enjoy the rest of your day.

Yeah. Thank you very much, Tobias, for giving us opportunity,

on your podcast.

Yeah. Thanks

for

having

us.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links