Building Data Pipelines That Run From Source To Analysis And Activation With Hevo Data

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to dataengineeringpodcast.com/linode

today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Data stacks are becoming more and more complex.

This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating

the quality of the data and causing teams to lose trust.

Siflae solves this problem by acting as an overseeing layer to the data stack, observing data and ensuring it's reliable from ingestion all the way to consumption.

Whether the data is in transit or at rest, CIFLA can detect data quality anomalies,

assess business impact, identify the root cause, and alert data teams on their preferred channels.

All thanks to over 50 quality checks, extensive column level lineage, and over 20 connectors across the data stack. In addition, data discovery is made easy through Siflae's information rich data catalog with a powerful search engine and real time health status. Listeners of the podcast will get $2, 000 to use as platform credits when signing up to use Ciflet.

Ciflet also offers a 2 week free trial. Find out more at data engineering podcast.com/cifletoday.

That's s

I f f l e t.

Your host is Tobias Macey, and today I'm interviewing Manish Jethany about HeboData's

experiences navigating the modern data stack and the role of ELT in data workflows. So Manish, can you start by introducing yourself?

Thanks, Tobias, for having me on your show. It's pleasure being part of this discussion today. So my name is Manish Shethani, and I'm cofounder

and CEO of HeboData.

We deal with data pipeline

as a service, and we are close to 4 and a half years old as a company.

And do you remember how you first got started working in data? Oh, yes. Unconventional

background coming into the data

space. I was running a consumer Internet startup

as a founder and CEO, and it was a venture backed startup.

And I personally experienced a lot of problems

rather having to make the decisions

and the data not being accessible. And I found it very, very frustrating

that as an individual, I am someone who kind of values scientific decision making a lot. And you cannot have a scientific decision making without having the facts and the information.

And it should turn out that data was the, kind of the building blocks for you to actually be

able to make scientific decision. And I

explored the solutions

around the space. I couldn't find

anything interesting.

And

before I could actually internally try and solve that problem, we got acquired.

And the company who acquired us, I was heading product over there. I saw the very similar nature of problem over there itself.

And

it sure turned out that I got so close to or rather I was so frustrated with the problem that as a second time entrepreneur, I picked this as a problem to deep dive and solve.

So in terms of what you're building at EvoData, can you give a bit more detail on what the product is, what specific problem area you're focusing on, and some of the story behind how you decided that that was a business that you wanted to spend your time and energy on?

If you see, most organizations

have multiple departments, and each departments have their own business software that they are using. In some cases, it's

built internally

by the engineering team, while in other cases, it's a third party software that they are using.

And as an executive, you want to understand what's happening in your business,

be it the company level or at a department level. So it's super hard for you to get a complete picture of what's happening

unless you can get all the data at 1 place.

And it so turns out that with cloud data warehouses,

storing and analyzing the data on the cloud has become much easier.

You don't have to worry about all the infrastructure pieces.

But getting the data into the warehouse was a problem that I personally experienced was a big problem.

And the number

of systems through which you had to pull in the data was also very large. So we ended up building

fully automated

data pipeline as a solution.

And the whole focus has been that how do we really

simplify to a point where

the technical barrier to use the data has to come down.

In terms of the

types of users that you're focused on, I'm interested in understanding kind of the personas that you are trying to address and how that informs the

way that you design the platform, the capabilities that you build into it, and some of the

prioritization that you have as far as understanding what direction to take your product?

So you've seen 2 groups of users from the companies being the entry point for us.

User group 1 is

the central data teams, which comprise

of engineers,

business analyst, folks who are working with the data. And they are centrally responsible for

making sure that the data is available to everyone who wants to consume

within the organization.

That's 1 group.

The second group that we've started to see very recently

has been the data ops team integrated within the line of business.

So sales team or marketing teams do have their own set of analysts who are trying to

solve specific problem for that department, which is not being centrally

solved. Or sometimes some of these companies don't even have a central data team. So kind of becomes that becomes as an entry point

for departments to go and solve their own problems. So these are the 2 user group that we see as an entry point within the organization who will start to use Xevo.

In terms of the

experiences that you had working in your previous role and the challenges that you had as far as being able to get all the data

that you needed into the warehouse to be able to understand what was happening in the business and where you wanted to focus your efforts.

What are some of the lessons that you learned

from going through that process and building a product where you were trying to be data driven

and some of the ways that you think about the requirements of HEVO data to be able to solve those problems that you are experiencing?

So 1 of the key learning that we had was if the effort required or the complexity

that needs to be handled for people to get access to data

is going to be large. Intuitively, you will see that people

defaulting to intuition based decision. So 1 of the key factor for us when we got started on the journey to build HEBio

was making it super simple. I think that was just the whole premise of it, that it should be so simple, so intuitive that it does not require

someone to really have a deep technical

expertise to be able to operate this system.

And that has been kind of the guiding principle because,

like, this whole concept of data pipeline or ETL as it was called earlier is not a new concept.

There have been, like, multiple

companies over the last 2 decades that have been trying to solve the same problem. Our differentiated point of view on this was

that

the number of companies who really need to

leverage data to make decisions

is much, much larger than the number of data engineers that are going to be available.

So in order to bridge that gap, the form factor of the technology had to be in a form which allowed

people with just the basic understanding of what the data is and where it is to be able to operate the system. So that is something that was the initiating point, and that has always remained to be the core of everything that we do.

As far as the

approach of being a single solution for being able to get data from the sources into the warehouse and then move it on into destinations,

I'm curious

what you see as being the benefits and some of the challenging aspects of being able to own that entire flow,

particularly given the current landscape of more focused kind of point solutions for different stages of the data life cycle?

I think it varies depending on the type of customers that you're trying to serve. If you go deep down enterprises,

like Fortune 500,

their requirement on each

part of this entire value chain of the data

are very specific.

Whereas if you look at companies who's, like, let's say, anywhere less than 2, 000 employees,

they typically want a more unified and integrated solution because

with all these point solutions

comes up the complexity of

having to manage all these different solutions.

And when something goes wrong, you don't know where to look at it.

So for majority of the market, they really don't have the kind of the technical expertise to be able to do that. No. I mean, in the entire modern data stack, we are at a super early stage of the adoption curve. So if you see Snowflake has got some 7, 000 customers now. Just from benchmarking perspective,

there are, like, few 100000 companies who typically

need to use data to make decisions. Right? But whereas

the Snowflake, which is 1 of the largest player in this space, has got just 7, 000 customers. So there is, like, whole

range of companies who are there. And our focus is that how do we really simplify to a point where all those companies are able to access the technology and solve their business problems.

And

to that point of Snowflake only having 7, 000 customers and there being, you know, a long tail of other customers and requirements and use cases,

there's definitely still a large installation of

what some might call legacy data warehouse systems where you actually have physical appliances in either a data center that an organization owns or in a colocation facility

or even

using a

virtual appliance that's deployed to a cloud environment.

And I'm wondering how that influences the ways that you think about the

interfaces and the integrations that you're focused on supporting

in Hivo.

Yeah. So for us, the segment of the market that we've decided not to go after is

all the companies who have legacy systems where

they have an on prem

setup and a part of their data

sources or the destinations

are on data center. That's the segment of the market that we've decided not to pursue.

The set of companies that we go after are,

digitally native businesses

who have their cloud warehouse set up or are trying to set up the cloud data warehouse. And their data is then fragmented across different systems for us to bring them together.

Given the

number of different companies that are trying to compete in the space of data integration and data movement

and the, you know, large and growing data ecosystem, particularly in the cloud native space,

what are some of the ways that you're thinking about that competitive landscape and the differentiating factors that you're focused on with Hivo?

So the first thing is that we are fully automated

and the most simple solution to use in the market. So the customers who end up signing up with us, they have evaluated

bunch of other solutions available in the market. Right? So any customer who decides to go with TiVo, they typically evaluate

2 or 3 different competing solutions in the market.

1 very clear feedback that we get from the customer is that

the amount of simplicity

that we focused on in terms of getting them up and running very quickly

is

better than anyone else.

The second aspect is

the way we

think about owning the scope of a problem between us and the customers.

The way we look ourselves is software as a service, which will own or take the accountability for

getting your data from all different sources into your warehouse.

And if anything goes wrong, we are there to take care of it. You don't even have to worry about it.

So the entire observability

instrumentation that we built at our end to be able to proactively detect the problems

is something that users really, really love about it.

Whereas

majority of the solutions available in the market are, hey. Here is a tool. You go and figure out how you are going to connect your source and integrate

different systems into your warehouse. And if something goes wrong, you've got to figure it out on your own. So the level of assurance that a customer gets when they work with us is another level altogether. And consequently, you would see that we are rated highest on g 2 Cloud reviews.

And if you go through the reviews, you'll find it out that this is the aspect that customers really love about us.

In terms of the actual platform itself, can you describe a bit about the

user experience around it and some of the ways that you have built and implemented the platform?

We are architected on the real time streaming architecture.

We use Kafka

because we understand that

for various types of different use cases, sometimes customers are okay

getting data in an hour, but there are times then when they did need data

within minutes of that data getting generated.

So the architecture is designed in a way that it supports the streaming architecture wherein we can,

deliver the near real time data into the warehouse.

Now today, not all the data warehouses

can actually support that streaming insertion,

But we are starting to see the new set of storage

layers coming in which are designed for the real time use cases.

And, also, we are horizontally scalable

in a sense that because the volume of the data that will come or flow through the pipeline is not very predictable.

You may have certain times of the day when there is a certain spike in the data that is getting generated at the customer end. But the

latency that the customer expects

needs to be nearly constant. Right? So you will have certain businesses who will have peak order volumes in a certain hour of the day,

and

there'll be huge amounts of data that'll be generated.

But they want the constant time, in which case the infrastructure has to be elastic enough so that it auto scales and allows the higher throughput. And when

the volume of the data comes down, it should automatically scale down to take care of the cost aspect of it. So these are the some of the core principles around which the entire product is architected. And the third aspect is that we are completely cloud agnostic. So we are available on AWS. If your sources and destinations are in AWS, you can use our AWS instance.

If you are on a Google Cloud, you could use

the Google instance as well.

Because of the fact that you are

managing the infrastructure for your customers

and depending on the number of customers that you have, you are going to have highly variable traffic patterns where

some customers might have, you know, a constant steady flow of data. Others are going to be very bursty where maybe at noon, they have you know, go from a few 100 messages an hour to a 1000000 messages an hour and just wondering what are some of the

optimizations that you've had to build into the platform to be able to support that aggregate load and be able to maybe anticipate some of the heterogeneity

in the traffic patterns?

Yeah. I think this is 1 of the very, very key aspects where customers, when they've evaluated us against some of the other solutions, they found it and that we tend to perform better when it comes to the stress testing. So typically, when customers evaluate, they try to do the stress testing, then they will just go

and update, like, all of their tables and see that what's the time it takes for the data to land into the warehouse.

Now keeping this factor into account that this is a real business scenario, what we've done is build a lot of instrumentation

around it that we proactively detect the throughput

for each customer and each of their pipelines.

And the moment we see that there is a lag starting to appear,

we auto scale the resources for that particular customer depending on the pricing plan

on which they are in. So the enterprise customers

really want the SLA guarantees, in which case they get higher priority with the available resources.

And even if then the latencies

is starting to increase, then the system auto scales

and adds more resources

to that particular

customer's

environment, and then it scales up. And even within their own sources, customers can prioritize at a pipeline level or even at a particular table level that this is higher priority for me. And so even if there is going to be some throttling,

those particular segment of the data is never going to slow down. So we've provided all

kind of controls to the customer. And on our side, we make sure that we are automatically able to scale infrastructure

to be able to meet the business SLA from the customer side. Another interesting element of

the way that you've designed the platform is you mentioned focusing on

automation as a way to reduce the burden on the operators of the system.

And automation is 1 of those terms that is very overloaded and can mean very different things to different people. So wondering if we can dig into that a bit and talk to some of the ways that that automation manifests in the experience of the person using the platform.

Let me just explain, like, in the context of ETL and the pipeline,

what are we automating at a very, very fundamental level. Right? Because

automation, as you rightly said, that can mean various different things in various different context.

So if you look at data pipeline or ETL as a category, it's almost like kind of a living organism.

Right? So your pipelines are not something that you have fixed set of data

and it will never change. And you can define your configurations,

and it will automatically work.

The general nature of

this entire system is that your input is going to constantly evolve, but your output has to be nearly constant. So which means that as an application developer, I might have made certain changes to my tables. I may have added few columns. I may have

added new tables. I may have changed

certain data types. And now all these changes have to have an impact on the downstream systems.

Earlier, what used to happen is that the tools used to

send notification or the alerts to the customer say that, hey. This has changed. Now you go and

change the entire

mapping logic that you've configured in your system for it to work.

So when we talk about automation, it is about

a few things. The first thing is the schema management.

So if there is going to be any deviation from the schema,

we automatically infer what is that deviation

and what's the right set of action that needs to happen. Right? So if you are going to add new tables or new columns, we can very easily

have those added to your destinations.

Or you can also, as a user, configure that whenever there is for this particular source, if there are new tables,

automatically create corresponding tables in the destination. Or you could say that

notify me, and then I will make a decision

whether I want to ingest that data or not.

So from that perspective, you can set the configuration once and not have to worry about all those schema changes breaking your pipeline. Because pipelines breaking up is a real real problem, and that kind of keeps data engineers anxious about it that any point in time something can go wrong.

The second aspect is around the auto scaling aspect that suddenly you have huge volumes of data getting generated. And

you don't want to be, like, monitoring it manually and then going and scaling and figuring out that how do you make sure that

the data that is available

in the warehouse is

the most recent data. So that is the second aspect around the automation that how do you monitor

the throughput and make sure that the throughput expected

is actually in line with the business SLA.

The third aspect is when something is going wrong. Right? So, typically,

on a given day, if you're ingesting certain volume of

data and on 1 specific day, the volume is low. So it's kind of an anomaly.

And then if the system can detect it and identify whether it is an anomaly or not

and alert the user, then user can decide that whether

something has gone wrong at at their end or it's just a regular business deviation.

RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control.

Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool.

Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder.

As far as the

actual work of doing the

transformation and data integration,

there are, again, a number of different patterns that have arisen over different

generations of technology with ELT being the most prevalent 1 now.

And

another interesting element of this space is the question of the sources and destinations that you're able to work with. And I'm wondering if you can talk to how you've approached

the design and implementation to be able to handle both the kind of common majority of systems that people are trying to integrate with, but also being able to support the long tail

of custom or bespoke

platforms and applications that people are trying to pull data from.

And the method that you're able to expose for being able to manage the transformations

and do it in a way that you're reducing the burden on the engineer, but again, still being able to provide that escape hatch for a very customized logic and processing for those situations where it's needed.

So actually, 2, 3 points over here. First 1 is practice around

ETL versus ELT.

I think that whole concept of best practice

about ELT being the right solution, I think it is taking something too far

than what the reality is. What we've seen,

roughly about 2 third of our customers actually do some level of transformation before the data loads into the warehouse.

And that's a lot of number. Like, every out of every 3 users, 2 of them use some lightweight transformation before the data lands into the warehouse.

The like in technology at a very principal level, there are no perfect solutions. There are always trade offs.

And depending on the context, depending on the use case that you're trying to solve. So we've seen whenever the data is very unstructured

right? So let's say if you are moving data from MongoDB or s 3 files or FTP files,

there,

you need to

restructure data data in a certain way so that it is easily consumable

for the analytics teams. So you may want to flatten it out before it lands into the warehouse.

The purpose users want to do this is because so that they can get the data which is easily consumable. That is point number 1. The point number 2 is that

you also want a good performance

with respect to the query and the cost. So if you are going to do a lot of transformations at the query time within the warehouse, it's a problem. So the better alternative in certain situation is that apply those set of transformation

and then load the data into the warehouse so that it is easily

understandable by the end user who is going to consume that data

and

get a better performance in terms of the time

to get the results and also the cost.

The second aspect is around control that you want in terms of what data should land into the warehouse

and what data should not land into the warehouse

because you necessarily don't need all types of data into the warehouse. For example, let's say you want to mask certain information

which should not be available to everyone within the organization.

So you might want to apply some lightweight transformation,

which is like an in flight transformation as we call it, before the data loads into the warehouse.

And we've seen that

when customers

start to use

a solution,

they may not instantly realize that these are the scenarios that they've come across.

But as they go along in that journey

and as they evolve their use cases, they come across these problems, in which case we are flexible.

The platform is flexible

in order to be able to cater to those use cases. And it is not left for the user to go and figure an alternative path of loading

that set of data into the warehouse. And they have to go and build their own custom stuff.

The second aspect, as you mentioned, around the sources because there is only so many sources

any player can build. And there could be, like, a whole bunch of long tail

of sources that customers may want to bring data from. On a destination side, you have very finite number of destinations that the customers are using. But on a sources side, we see

a huge long tail as well. So our principle around that has been that all the popular sources we want to build,

own control and manage.

Whereas there may be certain sources which it may not make logical sense for us to put it on in our road map because they are long tail. In those cases, we very recently released

an SDK which is in a very private preview setup

where the users can configure

on their own for any specific sources that they have. They can configure to bring the data

and push it into the Hivo. And for this, they really don't have to know a lot of details about what it goes in the ETN. It's more of configuration based input. So you would put certain inputs around the API. The systems ask on a UI what is a token

and what are the parameters.

And you can configure it on your UI and then what you get is a connector at the end of it all. So it kind of handles both the situations.

Your key important sources,

we take care of it, and we have a team which is going to constantly monitor, improve, and build. But if there are a long tail, you don't have to worry about

to having to build everything custom. You can use the framework that we provide

to build your own integrations. And we also are working towards

onboarding certain partners. If your team does not have the bandwidth or does not have the capability to be able to configure, we will have partners who will configure it for you in a span of 1 to 2 days.

Another interesting element of the

automation and the platform capabilities that you're focused on is being able to manage the schema mapping and schema evolution, which is 1 of the perennial challenges of dealing with data integration and ETL. And I'm wondering if you can talk to some of the mechanics of how you manage that in the platform

and some of the edge cases that you are still having to deal with where it requires a human getting involved?

I think we've done extensive work in terms of handling the schema. So now, thankfully, we are at a stage where nearly, you could say, 98, 99% of the scenarios,

typically, human intervention is not required

because now we've dealt with thousands of customers and tens and thousands of their pipelines.

So we've come across a long way in terms of encountering all those edge cases and being able to take care of it. So the way it works is that we kind of have a schema registry where

the last set of data that we loaded, we kind of keep a registry of it. Like, what did the data look like

and what were the data types, how was the structure of the data that came from different sources. And every time there is a new set of data, there is a comparison

between the last change. And then we identify what's a delta. And then we have certain set of structure which tells us that for this set of change, what is the desired action that needs to happen? Because we also take a configuration input from the user that if, for example, if you have new

tables coming in, what needs to be done? And

we can either ignore it or we can ingest it. It depends on the preference of the user. So that's how this whole auto schema mapping works in our scenario.

In terms of the conversation about ETL versus ELT and the question of what constitutes best practice,

what are some of the aspects

of customer education

and

customer feedback that you need to work with to help them understand

what approach you've taken, why you've taken that approach, and some of the ways that they can

best take advantage of the capabilities that you're building into, you know, applying some of those transformations before it lands in the data warehouse and how that factors into some of their approach to data modeling and those other questions of, you know, how to make sure that they're able to build and maintain a healthy data platform?

I think there are 2 sets of users who come on the platform. The first set is that they are just getting started

on that journey.

The first thing that they want to get the basics right, which is just get the data into the warehouse.

And as they go along, their analysts will come back and say that, you know what?

The query is running very slow or it is costing too much. What can we do about it? In which case, they will go and figure out that what's the structure of the data and how it needs to be modeled so that they can solve those bottlenecks.

That's the stage where they realize that what if we can really have the right structure

before the data lands into the warehouse itself. So that's the stage, kind of an evolution journey as they go along and discover the or they get to a point where they are trying to optimize the performance beyond just the basic of basics of getting the data into the warehouse. So that's 1 scenario. The second scenario is that customers who were using something else, some other solution in the market, And then they faced the same limitation, but those solutions didn't offer themselves to be to that level of flexibility where they could get an in flight transformation

before the data loads into the

destination. And that tool, not just at a source level, but at an extreme granularity

of for this particular table, for this particular

type of records, I want to apply this transformation. And for others, I just want to simply load.

So those users

naturally migrate or are looking for a platform that allows these intrinsic capabilities that we offer.

So it's a lot about users discovering

this on their own journey to optimizing their flow and performance

as opposed to we trying to

educate ELT versus ETL. In my view,

as a product, we need to support both scenarios

and leave it to the user to

determine what's best for them because there is no right answer. It's a trade off. And in some scenario, 1 solution is better, while in other cases, the other solution is better. As far as

the overall process of getting set up with Hivo and how it integrates into the workflow, I'm wondering if you can talk to some of those considerations

and also, in particular,

the collaboration aspect of how the different

personas and roles in the organization

will interact with Hebo as data traverses the various stages of the life cycle?

So we are a completely self serve platform. And

in 1st couple of years, we did not have any sales team or we did not have anyone other than engineers.

So the general way of thinking about this whole problem was that we want it to be super simple, and we did not

want someone to require to have a demo about the product or someone to walk them through. So that really helped us in terms of simplifying our entire

onboarding and the user experience within the product

just after people sign up. So today, if someone signs up on Hivo,

takes them about only few minutes to get their pipeline set up and the data to start loading into the warehouse.

And all the complex things in between are either automated or there is, like, a complete guided step for them to really know what needs to do. So, for example, if it is required for them

to whitelist an IP, then within the context of the product, they will get certain steps

depending on where they are within the product, the steps that they need to do in order to achieve that. Because we assume that the user may not have a lot of background,

And then we need to really help them and guide them so that they are able to achieve

the goals that they have.

So

that's 1 scenario in terms of how people onboard themselves in a very self survey. The second aspect is the collaboration about

once someone signs up on Hivo.

Naturally, they start with 1 or 2 sources of data, which is the predominant conciliation for them. But as they go along and keep their own maturity

in terms of the analytics need within the organizations grow, we've seen that more different types of sources

getting connected. So someone may start with just some marketing sources of data, but then eventually then want to combine the marketing with the sales

data as well. Or they may want to combine this with their purchase data as well, in which case they might start with, say, the advertising

channels like, say, Google Ads, Facebook,

LinkedIn Ads. And then they may want to add the sales CRM, say, Salesforce.

And later on, they may want to just combine with the Stripe data as well just to complete the funnel. And later on, they may want to get some product usage data, which might be in the MongoDB.

So then they will connect to MongoDB and get all these things together.

So as the

complexity of the questions that the use business users are trying to ask,

as that increases,

more and more sources of data

needs to be connected

and brought into the warehouse. So that's the kind of a journey we've seen.

Data teams are increasingly under pressure to deliver.

According to a recent survey by Ascend. Io, 95%

reported being at or overcapacity,

with 72 percent of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation.

In fact, while only 3.5%

report having current investments in automation,

85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability.

Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture,

as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.

Go to data engineering podcast .com/ascend

and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer.

So we've been talking a lot about the journey of data from source systems into the data warehouse. And my understanding is that you are also providing capabilities

for doing what some people are calling data activation or reverse ETL or,

you know, operational analytics. And I'm wondering if you can talk to also that aspect of

you know, you've got the data loaded. You've done your transformations and modeling and then being able to feed that back out into those operational systems where maybe somebody wants their customer data pushed back into Salesforce or HubSpot or whatever kind of system of record the rest of the business is using. And I'm wondering if you can talk to some of the

interesting challenges that you've

experienced there and some of the ways that you've been able to lean on the kind of data ingestion and integration

capabilities

to then be able to also

do double duty on that kind of egress path?

I think we started building the 2nd product before this category was even called reverse ETL or anything.

What we fundamentally

wanted to do was

so our sales team so there was this whole notion around that engineers don't want to talk to sales team.

Right? This is partly true, but not completely true. People want to talk to sales team when they are really ready to talk to sales team. If the objective of the sales team is to help the customers navigate

in their journey of decision making, they want to talk to them. But if the objective of the sales team is to push them to buy, I think that's when, like, the engineers are business folks, no 1 wants to talk.

Right? So we wanted a system where we could

get the user journey in terms of where they are in their journey

and decide that at what stage

should

someone reach out to them and who should reach out to them. So if they've achieved certain success

within the product, and then the discussion could be more around the line of what's the overall problem that they are trying to solve and what else would they need to make the decision

versus when they are struggling to get some of their data from some specific sources or they are encountering certain challenges.

So rather than waiting for the user to reach out to support, what we wanted to do was

know where they are stuck depending on that, proactively reach out with certain solution that how they could change certain configuration and get to the results that they are getting. Now in order for us to be able to do that, we wanted to get the product usage data on intercom, which is a medium of chat support,

also on our CRM.

So we were anyways getting all the data into our warehouse, and we wanted

to get this back. So we built this internally.

We figured out that, and just I ended up having discussion with few of the other folks. And they said that this looks great. Why don't we just convert this into a product? Like, I mean, some of the friends I was speaking with, they wanted to use

the same thing. And I said, no. It's not in a state where you could actually use it because you just did a bunch of scripts to make that happen. I figured that this could be a product in itself.

And we started on that journey very organically saying that people want product user data to come to either their sales CRM or they want this data to go into their help desk so that someone can really have a complete context while they are interacting with the customer.

And later on, we figured by talking to more customers

that,

ultimately you don't want your warehouse to become yet another silo. So if you get all the data into your warehouse and the only thing that you can do with that data is just build

dashboards, which are kind of reactive, that towards the end of the month, you figure out that you did not hit your targets.

It's of no use

because it's already happened.

Whereas if you can, almost like in a near real time,

trigger certain actions to certain individuals who can actually influence the outcome,

it can have a huge impact

because all the heavyweight lifting of

consolidating the data into the warehouse and building certain insights out of it is already done. Now all that you need to do is

make this insight be accessible in the right context to the right person at the right time. And that's how this whole concept of River CGL came into picture. And as we kind of were building, we figured that,

like, this is coming up as a new category in itself,

which I felt logically did make a lot of sense to me because it's just a very natural extension

where a solution which is bringing the data into the warehouse, they should naturally

lead the data back from warehouse into different systems. It's almost like

an Internet line which can upload the data and download the data. You typically don't have 2 Internet connections, 1 for uploading the data and another for downloading the data. Same Internet line does that for you. So it's almost very similar to that.

The other interesting element of building and supporting a platform that handles the kind of traversal of data across customer systems is

the problem of

ongoing maintenance and evolution of those workflows and being able to monitor and alert on failures

and be able to warn of potential breakages when you make changes, which is often where things like metadata and lineage come in. And I'm wondering if you can talk to some of that aspect of how you help to

support the kind of deliberate evolution and maintenance of those pipelines and workflows so that users don't have to kind of be surprised when all of a sudden

they go to load a dashboard or go to go into their Salesforce where they've been replicating data and things don't look right or, you know, the data is stale and some of those overall aspects of ongoing maintenance.

I think this is, like, the most difficult part of this entire business,

making sure that nothing breaks ever irrespective

of what.

And and it takes it's not just about

how many smart engineers you put in. There is a natural

product evolution that how many

edge case scenarios that you encountered

that you can really anticipate for it. Because the total number of combinations that you can think of are very large. On 1 side, you've got 150 different

connectors, and each may have their own version. So someone may have 1 version of MongoDB versus

the other version of MongoDB.

The second variable comes in the environment in which you're operating. Someone has set up at MongoDB and AWS, whereas someone else has set up on GCP or Azure. And the third element is

your entire configuration because you may have set up the user privilege in a certain different way compared to someone else. And it could lead to 150 into 4 into some infinite number. Maybe let's take a number for the sake of simplicity. There's 25 different

types of privileges that you can assign to a user that typically

people do. So the total number of combinations

could be very huge. Right? And suddenly, you figured out that there was a certain problem in a certain

specific version of the database

that needs to be handled differently. So the code complexity continues to go up over a period of time because now

you are trying to have just not 1 integration for 1 particular type of source. There are various different type of integrations

for different versions of it. So that kind of is a very complex problem in this particular scenario.

So

by default,

like, when someone is

building the product for data integration,

I think default 1 should assume that there are no happy cases.

So you should, by default, assume that anything that can possibly go wrong will definitely go wrong. Right? And then build your systems according to that. So if you are connecting to a source, don't assume that it will get connected. Assume that it will time out. Assume that

something else will go wrong. And then how do you build and design a system that make sure that if it can be resolved, automatically it gets resolved. If not, then determine based on certain conditions that what is the action that needs to be taken.

Does the system need to inform the user, or does it need to inform

someone on the control? So we have a control tower team which kind of monitors all these things on behalf of the customers.

So

those are the things how you kind of get to that level of robustness and maturity.

So in the early days, it was super hard. Like, every now and then we will see some customers

facing this problem. But now that we've seen all type of different edge cases, working with thousands of companies across last 2 to 3 years, now we understand that how it needs to be managed to make sure that, customer can rely on platform to deliver the accurate data for them.

In your work of building the platform and building Hivo data and given your experience of trying to make sure that you had insights and information about your previous business, what are some of the ways that you're using Hivo Data

to help build and gain insight into Hivo Data and understand where to take it going into the future?

I think,

interesting aspect of is that I did not come from an enterprise SaaS background.

So I had,

like, no background. It has certain advantages and lesser than disadvantages.

I come from a consumer Internet background where the users doesn't talk to anyone. If you like it, you sign up and you use the product. If you like it, you pay it and you continue to use it. The moment you stop

seeing value, you cancel it. Right? It's almost like a Netflix kind of a frame of reference.

So the way we ended up building product was of a very consumer grade, which was very unlike how typically enterprise softwares get built.

Right? So that fundamental

philosophy of trying to build very lovable product

is something that comes very inherently natural to us because

default assume that we don't have a salesperson who will go and talk to the customer. The product has to just work. If it doesn't work, we have no short dated. Now it may be a large customer who who otherwise would have paid, say, $100, 000,

but it just has to work

in the first go. So our obsession about optimizing

the entire user journey and funnel is

very unlike how typically enterprise

softwares are built. That's 1 learning. The second aspect is

around how we thought about our entire go to market.

Because if you were traditionally coming from

a SaaS background, we would think about, like, hey, we got to have a sales team or a marketing team. Whereas because we came from a consumer Internet background, we thought that

when you have a problem, what is the first thing that you do? You go and search over the Internet. We said that when someone faces a problem, which where the solution could be a data integration,

we should be found by them. And today, we dominate nearly all search terms so that users discover us. So we don't have to spend a lot of money on sales and marketing. Instead, we channel that capital to building great product, which over a long period of time compounds and delivers better value for the customers.

In your work of building the platform and working with your customers, what are some of the most interesting or innovative or unexpected

is something that we've seen really generate

disproportionate value for the customer. So just getting

data and looking at the dashboard

has got certain impact on how you operate.

But for example, when your sales team reaches out to the customer or your support team reaches out to the customer because you gathered

all the information about the customers

and their interaction on the product. And then someone in the support team, there is a proactive ticket saying that this user signed up 1 hour back, looked at the pricing page,

did x y z step. And after 4 hours,

they have not taken this step, which 90% people take.

And then someone from the support team directly reaches out saying that, hey. I saw that you did x y z, but you didn't complete the step z. Here is a documentation.

Here is a video link. And in case you need any help, here is a link to schedule a call with us. That kind of generates a wow moment for the customer, and that is where we've seen,

a lot of impact getting created with, Heo.

In your experience of building the business and working in this space, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

I think

the technical complexity of what does it take to build a truly reliable, robust data integration platform,

I think I kind of didn't understand the technical complexity that how complex it would be. I mean, you feel that there is this third party system. We are smart engineers. We'll go and build an integration with them. They would have a public API.

But

the reality is that

what those API docs talk about, it's far from what the response you will get from when you call those APIs. Right? And you can never be so sure. So you might like, they may have input that you will get time instead PST, but what you get is a GMT time and things around that, which you really don't know. And then you're figuring out that

what actually went wrong. Or someone says that, hey. We had 5, 000, 000, 000 records. We are getting 1 record less in our destination. But you go and figure out that where exactly did that record go. And at times, your best of the engineers have to spend

days, like, sleeplessly figuring out that what happened to that 1 record. And then, oh, you figured out that

user applied certain kind of transformation, and then it's waiting in a queue

and things around that. Right? So the technical complexity that needs to be handled, truly reliable

solution on which customers can confidently take decision based on that data, I think that's very, very huge. That came as a lot

of surprise to us in the early days, but thankfully,

we learned what does it take. And then over the last 2, 3 years, we've kind of really nailed it down.

And for people So

in

So in case the customers have a vast majority of their data on prem set up and they're just starting to graduate to the cloud and they need to manage both hybrid structure,

I really don't think we are really designed to cater to that segment of the customers. For us, I mean, for anyone

who has

the data in the cloud and they have warehouse in the cloud and they want to solve the problem of this fragmentation of data, I think he was the right solution for those scenarios.

As you continue

to build and grow and maintain the HEBA platform, what are some of the things you have planned for the near to medium term or any particular problem areas or feature capabilities that you're excited to dig into?

I think a lot of what you're trying to build in, let's say, next 6 to 12 months is just around how do you make it more and more robust

so that the customers, after they have set up everything,

they never ever have to kind of come to the platform to see whether things are working or not. That's 1 aspect. The second aspect is

the ROI aspect of it. So the adopters don't care about the efficiency or the ROI in terms of how much does it really

cost on your pipeline side, how much does it cost on the warehouse side.

So the natural evolution is that as a business, you at some stage start to question that I'm investing

$100, 000

on data infrastructure as I've set up. What's my ROI?

Right? And you realize that there is a whole bunch of cost that goes into doing things that are not really adding value to the customer because you might have tons of data from different sources, which most people are not using.

So how can we proactively

identify that

and give suggestion to the user that this is the dataset that you are bringing in.

In last 30 days, no 1 has really used this data. Do you really want to still replicate the data? Or do you want to replicate once every month?

In which case, it will lead to a cost saving of x dollars. I think those things can be kind of built in into the product

that will lead to a higher ROI for the customer. Now, of course, at times, it can be not a good thing for us as a business. But the general belief is that ultimately, market wins. Right? So if you don't optimize for your customers, someone else will do that, and customers will choose a solution which is designed for their win. So it is better that we kind of start proactively

working in that direction and

make sure that customers are not paying on things that they are not going to use.

Are there any other aspects of the work that you're doing at Hivo and the overall space of data integration and data pipelines that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think the whole part around

1 aspect is the fragmentation and the defragmentation

or kind of handling and bundling and unbunding.

That's 1 aspect that's kind of gets discussed a lot both internally within Hebo and outside.

And the second aspect that I'm I kind of personally care about a lot is the UX of all the products that exist today. I think

the current set of UX

is good for

a list set of adapters. But as we

see crossing the cars kind of a scenario where it unfolds to mainstream market, we will see that a lot of these products will have a very different form factor

aimed towards simplifying things so that

nearly everyone

who wants to use data to make decisions should be able to do that irrespective of their technical competency.

Yeah. Definitely agreed on the kind of polish aspect of the platforms that we have available, whereas you said, a lot of early adopters

or strong engineering teams are able to

take these systems and build things to power their businesses. But as you start to get more

into the kind of mainstream market or

the

broader set of adopters and engineering teams who don't necessarily want to become experts in absolutely everything,

there's definitely a significant need to be able to add more kind of

assistance and improved user experience to these platforms so that you can

kind of

add another layer of abstraction and understanding to the system so that you don't have to know everything about distributed systems or all the different ways that your pipeline might fail just to be able to get something that, you know, works and runs and that you can trust.

Yeah.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap on the tooling or technology that's available for data management today.

I think, again, I'll double click on the whole simplicity part of it. I think 1 thing that is little unrelated, but kind of is a North Star metric that I think or the North Star vision that I think that imagine that if

smartphones and the iPhones did not exist, 99% of the world Internet population will disappear.

So rather than waiting for people to learn new technology and then solve their problems, I think the core role of the technology

is to kind of get to a form factor where

it can truly unlock the market. I think that whole iPhone moment for the data space is yet to happen,

where we are not really confined to some 10, 000 companies

knowing how to really leverage because the total market is much, much bigger, and we've just scratched the surface of the market. So if you really want to penetrate

and become the mainstream,

then we really need to think hard on how do we simplify things.

Well, thank you very much for taking the time today to join me and share the work that you're doing at Hivo Data and your experiences

building this company and platform. It's definitely a very interesting product, and it's great to

have people out there who are focused on user experience and making it a simpler process of being able to get data to where we need it to be. So appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Great. Thank you, Tobias. And it was a pleasure speaking with you today.

Thank you for listening.

Don't forget to check out our other shows, podcast.init,

which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast,

which helps you go from idea to production with machine learning.

Visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story.

And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links