Summary
Any business that wants to understand their operations and customers through data requires some form of pipeline. Building reliable data pipelines is a complex and costly undertaking with many layered requirements. In order to reduce the amount of time and effort required to build pipelines that power critical insights Manish Jethani co-founded Hevo Data. In this episode he shares his journey from building a consumer product to launching a data pipeline service and how his frustrations as a product owner have informed his work at Hevo Data.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
- RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder.
- Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today!
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Manish Jethani about Hevo Data’s experiences navigating the modern data stack and the role of ELT in data workflows
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Hevo Data is and the story behind it?
- What is the core problem that you are trying to solve with the Hevo platform?
- What are the target personas of who will bring Hevo into a company and who will be using/interacting with it for their day-to-day?
- What are some of the lessons that you learned building a product that relied on data to function which you have carried into your work at Hevo, providing the utilities that enable other businesses and products?
- There are numerous commercial and open source options for collecting, transforming, and integrating data. What are the differentiating features of Hevo?
- What are your views on the benefits of a vertically integrated platform for data flows in the world of the disaggregated "modern data stack"?
- Can you describe how the Hevo platform is implemented?
- What are some of the optimizations that you have invested in to support the aggregate load from your customers?
- The predominant pattern in recent years for collecting and processing data is ELT. In your work at Hevo, what are some of the nuance and exceptions to that "best practice" that you have encountered?
- How have you factored those learnings back into the product?
- mechanics of schema mapping
- edge cases that require human intervention
- how to surface those in a timely fashion
- edge cases that require human intervention
- What is the process for onboarding onto the Hevo platform?
- Once an organization has adopted Hevo, can you describe the workflow of building/maintaining/evolving data pipelines?
- What are the most interesting, innovative, or unexpected ways that you have seen Hevo used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Hevo?
- When is Hevo the wrong choice?
- What do you have planned for the future of Hevo?
Contact Info
- @ManishJethani on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Sponsored By:
- Sifflet: ![Sifflet](https://files.fireside.fm/file/fireside-uploads/images/c/c6161a3f-a67b-48ef-b087-52f1f1573292/z-fy2Hbs.png) Sifflet is a Full Data Stack Observability platform acting as an overseeing layer to the Data Stack, ensuring that data is reliable from ingestion to consumption. Whether the data is in transit or at rest, Sifflet is able to detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. We also offer a 2-week free trial. Go to [dataengineeringpodcast.com/sifflet](https://www.dataengineeringpodcast.com/sifflet) to find out more.
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Siflae solves this problem by acting as an overseeing layer to the data stack, observing data and ensuring it's reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, CIFLA can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams on their preferred channels. All thanks to over 50 quality checks, extensive column level lineage, and over 20 connectors across the data stack. In addition, data discovery is made easy through Siflae's information rich data catalog with a powerful search engine and real time health status. Listeners of the podcast will get $2, 000 to use as platform credits when signing up to use Ciflet.
Ciflet also offers a 2 week free trial. Find out more at data engineering podcast.com/cifletoday. That's s I f f l e t. Your host is Tobias Macey, and today I'm interviewing Manish Jethany about HeboData's experiences navigating the modern data stack and the role of ELT in data workflows. So Manish, can you start by introducing yourself?
[00:02:01] Unknown:
Thanks, Tobias, for having me on your show. It's pleasure being part of this discussion today. So my name is Manish Shethani, and I'm cofounder and CEO of HeboData. We deal with data pipeline as a service, and we are close to 4 and a half years old as a company.
[00:02:19] Unknown:
And do you remember how you first got started working in data? Oh, yes. Unconventional
[00:02:23] Unknown:
background coming into the data space. I was running a consumer Internet startup as a founder and CEO, and it was a venture backed startup. And I personally experienced a lot of problems rather having to make the decisions and the data not being accessible. And I found it very, very frustrating that as an individual, I am someone who kind of values scientific decision making a lot. And you cannot have a scientific decision making without having the facts and the information. And it should turn out that data was the, kind of the building blocks for you to actually be able to make scientific decision. And I explored the solutions around the space. I couldn't find anything interesting.
And before I could actually internally try and solve that problem, we got acquired. And the company who acquired us, I was heading product over there. I saw the very similar nature of problem over there itself. And it sure turned out that I got so close to or rather I was so frustrated with the problem that as a second time entrepreneur, I picked this as a problem to deep dive and solve.
[00:03:31] Unknown:
So in terms of what you're building at EvoData, can you give a bit more detail on what the product is, what specific problem area you're focusing on, and some of the story behind how you decided that that was a business that you wanted to spend your time and energy on?
[00:03:45] Unknown:
If you see, most organizations have multiple departments, and each departments have their own business software that they are using. In some cases, it's built internally by the engineering team, while in other cases, it's a third party software that they are using. And as an executive, you want to understand what's happening in your business, be it the company level or at a department level. So it's super hard for you to get a complete picture of what's happening unless you can get all the data at 1 place. And it so turns out that with cloud data warehouses, storing and analyzing the data on the cloud has become much easier.
You don't have to worry about all the infrastructure pieces. But getting the data into the warehouse was a problem that I personally experienced was a big problem. And the number of systems through which you had to pull in the data was also very large. So we ended up building fully automated data pipeline as a solution. And the whole focus has been that how do we really simplify to a point where the technical barrier to use the data has to come down.
[00:04:53] Unknown:
In terms of the types of users that you're focused on, I'm interested in understanding kind of the personas that you are trying to address and how that informs the way that you design the platform, the capabilities that you build into it, and some of the prioritization that you have as far as understanding what direction to take your product?
[00:05:14] Unknown:
So you've seen 2 groups of users from the companies being the entry point for us. User group 1 is the central data teams, which comprise of engineers, business analyst, folks who are working with the data. And they are centrally responsible for making sure that the data is available to everyone who wants to consume within the organization. That's 1 group. The second group that we've started to see very recently has been the data ops team integrated within the line of business. So sales team or marketing teams do have their own set of analysts who are trying to solve specific problem for that department, which is not being centrally solved. Or sometimes some of these companies don't even have a central data team. So kind of becomes that becomes as an entry point for departments to go and solve their own problems. So these are the 2 user group that we see as an entry point within the organization who will start to use Xevo.
[00:06:18] Unknown:
In terms of the experiences that you had working in your previous role and the challenges that you had as far as being able to get all the data that you needed into the warehouse to be able to understand what was happening in the business and where you wanted to focus your efforts. What are some of the lessons that you learned from going through that process and building a product where you were trying to be data driven and some of the ways that you think about the requirements of HEVO data to be able to solve those problems that you are experiencing?
[00:06:51] Unknown:
So 1 of the key learning that we had was if the effort required or the complexity that needs to be handled for people to get access to data is going to be large. Intuitively, you will see that people defaulting to intuition based decision. So 1 of the key factor for us when we got started on the journey to build HEBio was making it super simple. I think that was just the whole premise of it, that it should be so simple, so intuitive that it does not require someone to really have a deep technical expertise to be able to operate this system. And that has been kind of the guiding principle because, like, this whole concept of data pipeline or ETL as it was called earlier is not a new concept.
There have been, like, multiple companies over the last 2 decades that have been trying to solve the same problem. Our differentiated point of view on this was that the number of companies who really need to leverage data to make decisions is much, much larger than the number of data engineers that are going to be available. So in order to bridge that gap, the form factor of the technology had to be in a form which allowed people with just the basic understanding of what the data is and where it is to be able to operate the system. So that is something that was the initiating point, and that has always remained to be the core of everything that we do.
[00:08:17] Unknown:
As far as the approach of being a single solution for being able to get data from the sources into the warehouse and then move it on into destinations, I'm curious what you see as being the benefits and some of the challenging aspects of being able to own that entire flow, particularly given the current landscape of more focused kind of point solutions for different stages of the data life cycle?
[00:08:44] Unknown:
I think it varies depending on the type of customers that you're trying to serve. If you go deep down enterprises, like Fortune 500, their requirement on each part of this entire value chain of the data are very specific. Whereas if you look at companies who's, like, let's say, anywhere less than 2, 000 employees, they typically want a more unified and integrated solution because with all these point solutions comes up the complexity of having to manage all these different solutions. And when something goes wrong, you don't know where to look at it. So for majority of the market, they really don't have the kind of the technical expertise to be able to do that. No. I mean, in the entire modern data stack, we are at a super early stage of the adoption curve. So if you see Snowflake has got some 7, 000 customers now. Just from benchmarking perspective, there are, like, few 100000 companies who typically need to use data to make decisions. Right? But whereas the Snowflake, which is 1 of the largest player in this space, has got just 7, 000 customers. So there is, like, whole range of companies who are there. And our focus is that how do we really simplify to a point where all those companies are able to access the technology and solve their business problems.
[00:10:08] Unknown:
And to that point of Snowflake only having 7, 000 customers and there being, you know, a long tail of other customers and requirements and use cases, there's definitely still a large installation of what some might call legacy data warehouse systems where you actually have physical appliances in either a data center that an organization owns or in a colocation facility or even using a virtual appliance that's deployed to a cloud environment. And I'm wondering how that influences the ways that you think about the interfaces and the integrations that you're focused on supporting in Hivo.
[00:10:47] Unknown:
Yeah. So for us, the segment of the market that we've decided not to go after is all the companies who have legacy systems where they have an on prem setup and a part of their data sources or the destinations are on data center. That's the segment of the market that we've decided not to pursue. The set of companies that we go after are, digitally native businesses who have their cloud warehouse set up or are trying to set up the cloud data warehouse. And their data is then fragmented across different systems for us to bring them together.
[00:11:24] Unknown:
Given the number of different companies that are trying to compete in the space of data integration and data movement and the, you know, large and growing data ecosystem, particularly in the cloud native space, what are some of the ways that you're thinking about that competitive landscape and the differentiating factors that you're focused on with Hivo?
[00:11:47] Unknown:
So the first thing is that we are fully automated and the most simple solution to use in the market. So the customers who end up signing up with us, they have evaluated bunch of other solutions available in the market. Right? So any customer who decides to go with TiVo, they typically evaluate 2 or 3 different competing solutions in the market. 1 very clear feedback that we get from the customer is that the amount of simplicity that we focused on in terms of getting them up and running very quickly is better than anyone else. The second aspect is the way we think about owning the scope of a problem between us and the customers.
The way we look ourselves is software as a service, which will own or take the accountability for getting your data from all different sources into your warehouse. And if anything goes wrong, we are there to take care of it. You don't even have to worry about it. So the entire observability instrumentation that we built at our end to be able to proactively detect the problems is something that users really, really love about it. Whereas majority of the solutions available in the market are, hey. Here is a tool. You go and figure out how you are going to connect your source and integrate different systems into your warehouse. And if something goes wrong, you've got to figure it out on your own. So the level of assurance that a customer gets when they work with us is another level altogether. And consequently, you would see that we are rated highest on g 2 Cloud reviews.
And if you go through the reviews, you'll find it out that this is the aspect that customers really love about us.
[00:13:32] Unknown:
In terms of the actual platform itself, can you describe a bit about the user experience around it and some of the ways that you have built and implemented the platform?
[00:13:42] Unknown:
We are architected on the real time streaming architecture. We use Kafka because we understand that for various types of different use cases, sometimes customers are okay getting data in an hour, but there are times then when they did need data within minutes of that data getting generated. So the architecture is designed in a way that it supports the streaming architecture wherein we can, deliver the near real time data into the warehouse. Now today, not all the data warehouses can actually support that streaming insertion, But we are starting to see the new set of storage layers coming in which are designed for the real time use cases.
And, also, we are horizontally scalable in a sense that because the volume of the data that will come or flow through the pipeline is not very predictable. You may have certain times of the day when there is a certain spike in the data that is getting generated at the customer end. But the latency that the customer expects needs to be nearly constant. Right? So you will have certain businesses who will have peak order volumes in a certain hour of the day, and there'll be huge amounts of data that'll be generated. But they want the constant time, in which case the infrastructure has to be elastic enough so that it auto scales and allows the higher throughput. And when the volume of the data comes down, it should automatically scale down to take care of the cost aspect of it. So these are the some of the core principles around which the entire product is architected. And the third aspect is that we are completely cloud agnostic. So we are available on AWS. If your sources and destinations are in AWS, you can use our AWS instance.
If you are on a Google Cloud, you could use the Google instance as well.
[00:15:32] Unknown:
Because of the fact that you are managing the infrastructure for your customers and depending on the number of customers that you have, you are going to have highly variable traffic patterns where some customers might have, you know, a constant steady flow of data. Others are going to be very bursty where maybe at noon, they have you know, go from a few 100 messages an hour to a 1000000 messages an hour and just wondering what are some of the optimizations that you've had to build into the platform to be able to support that aggregate load and be able to maybe anticipate some of the heterogeneity in the traffic patterns?
[00:16:10] Unknown:
Yeah. I think this is 1 of the very, very key aspects where customers, when they've evaluated us against some of the other solutions, they found it and that we tend to perform better when it comes to the stress testing. So typically, when customers evaluate, they try to do the stress testing, then they will just go and update, like, all of their tables and see that what's the time it takes for the data to land into the warehouse. Now keeping this factor into account that this is a real business scenario, what we've done is build a lot of instrumentation around it that we proactively detect the throughput for each customer and each of their pipelines.
And the moment we see that there is a lag starting to appear, we auto scale the resources for that particular customer depending on the pricing plan on which they are in. So the enterprise customers really want the SLA guarantees, in which case they get higher priority with the available resources. And even if then the latencies is starting to increase, then the system auto scales and adds more resources to that particular customer's environment, and then it scales up. And even within their own sources, customers can prioritize at a pipeline level or even at a particular table level that this is higher priority for me. And so even if there is going to be some throttling, those particular segment of the data is never going to slow down. So we've provided all kind of controls to the customer. And on our side, we make sure that we are automatically able to scale infrastructure to be able to meet the business SLA from the customer side. Another interesting element of
[00:17:51] Unknown:
the way that you've designed the platform is you mentioned focusing on automation as a way to reduce the burden on the operators of the system. And automation is 1 of those terms that is very overloaded and can mean very different things to different people. So wondering if we can dig into that a bit and talk to some of the ways that that automation manifests in the experience of the person using the platform.
[00:18:14] Unknown:
Let me just explain, like, in the context of ETL and the pipeline, what are we automating at a very, very fundamental level. Right? Because automation, as you rightly said, that can mean various different things in various different context. So if you look at data pipeline or ETL as a category, it's almost like kind of a living organism. Right? So your pipelines are not something that you have fixed set of data and it will never change. And you can define your configurations, and it will automatically work. The general nature of this entire system is that your input is going to constantly evolve, but your output has to be nearly constant. So which means that as an application developer, I might have made certain changes to my tables. I may have added few columns. I may have added new tables. I may have changed certain data types. And now all these changes have to have an impact on the downstream systems.
Earlier, what used to happen is that the tools used to send notification or the alerts to the customer say that, hey. This has changed. Now you go and change the entire mapping logic that you've configured in your system for it to work. So when we talk about automation, it is about a few things. The first thing is the schema management. So if there is going to be any deviation from the schema, we automatically infer what is that deviation and what's the right set of action that needs to happen. Right? So if you are going to add new tables or new columns, we can very easily have those added to your destinations.
Or you can also, as a user, configure that whenever there is for this particular source, if there are new tables, automatically create corresponding tables in the destination. Or you could say that notify me, and then I will make a decision whether I want to ingest that data or not. So from that perspective, you can set the configuration once and not have to worry about all those schema changes breaking your pipeline. Because pipelines breaking up is a real real problem, and that kind of keeps data engineers anxious about it that any point in time something can go wrong. The second aspect is around the auto scaling aspect that suddenly you have huge volumes of data getting generated. And you don't want to be, like, monitoring it manually and then going and scaling and figuring out that how do you make sure that the data that is available in the warehouse is the most recent data. So that is the second aspect around the automation that how do you monitor the throughput and make sure that the throughput expected is actually in line with the business SLA.
The third aspect is when something is going wrong. Right? So, typically, on a given day, if you're ingesting certain volume of data and on 1 specific day, the volume is low. So it's kind of an anomaly. And then if the system can detect it and identify whether it is an anomaly or not and alert the user, then user can decide that whether something has gone wrong at at their end or it's just a regular business deviation.
[00:21:27] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder. As far as the actual work of doing the transformation and data integration, there are, again, a number of different patterns that have arisen over different generations of technology with ELT being the most prevalent 1 now.
And another interesting element of this space is the question of the sources and destinations that you're able to work with. And I'm wondering if you can talk to how you've approached the design and implementation to be able to handle both the kind of common majority of systems that people are trying to integrate with, but also being able to support the long tail of custom or bespoke platforms and applications that people are trying to pull data from. And the method that you're able to expose for being able to manage the transformations and do it in a way that you're reducing the burden on the engineer, but again, still being able to provide that escape hatch for a very customized logic and processing for those situations where it's needed.
[00:23:03] Unknown:
So actually, 2, 3 points over here. First 1 is practice around ETL versus ELT. I think that whole concept of best practice about ELT being the right solution, I think it is taking something too far than what the reality is. What we've seen, roughly about 2 third of our customers actually do some level of transformation before the data loads into the warehouse. And that's a lot of number. Like, every out of every 3 users, 2 of them use some lightweight transformation before the data lands into the warehouse. The like in technology at a very principal level, there are no perfect solutions. There are always trade offs.
And depending on the context, depending on the use case that you're trying to solve. So we've seen whenever the data is very unstructured right? So let's say if you are moving data from MongoDB or s 3 files or FTP files, there, you need to restructure data data in a certain way so that it is easily consumable for the analytics teams. So you may want to flatten it out before it lands into the warehouse. The purpose users want to do this is because so that they can get the data which is easily consumable. That is point number 1. The point number 2 is that you also want a good performance with respect to the query and the cost. So if you are going to do a lot of transformations at the query time within the warehouse, it's a problem. So the better alternative in certain situation is that apply those set of transformation and then load the data into the warehouse so that it is easily understandable by the end user who is going to consume that data and get a better performance in terms of the time to get the results and also the cost.
The second aspect is around control that you want in terms of what data should land into the warehouse and what data should not land into the warehouse because you necessarily don't need all types of data into the warehouse. For example, let's say you want to mask certain information which should not be available to everyone within the organization. So you might want to apply some lightweight transformation, which is like an in flight transformation as we call it, before the data loads into the warehouse. And we've seen that when customers start to use a solution, they may not instantly realize that these are the scenarios that they've come across.
But as they go along in that journey and as they evolve their use cases, they come across these problems, in which case we are flexible. The platform is flexible in order to be able to cater to those use cases. And it is not left for the user to go and figure an alternative path of loading that set of data into the warehouse. And they have to go and build their own custom stuff. The second aspect, as you mentioned, around the sources because there is only so many sources any player can build. And there could be, like, a whole bunch of long tail of sources that customers may want to bring data from. On a destination side, you have very finite number of destinations that the customers are using. But on a sources side, we see a huge long tail as well. So our principle around that has been that all the popular sources we want to build, own control and manage.
Whereas there may be certain sources which it may not make logical sense for us to put it on in our road map because they are long tail. In those cases, we very recently released an SDK which is in a very private preview setup where the users can configure on their own for any specific sources that they have. They can configure to bring the data and push it into the Hivo. And for this, they really don't have to know a lot of details about what it goes in the ETN. It's more of configuration based input. So you would put certain inputs around the API. The systems ask on a UI what is a token and what are the parameters.
And you can configure it on your UI and then what you get is a connector at the end of it all. So it kind of handles both the situations. Your key important sources, we take care of it, and we have a team which is going to constantly monitor, improve, and build. But if there are a long tail, you don't have to worry about to having to build everything custom. You can use the framework that we provide to build your own integrations. And we also are working towards onboarding certain partners. If your team does not have the bandwidth or does not have the capability to be able to configure, we will have partners who will configure it for you in a span of 1 to 2 days.
[00:27:40] Unknown:
Another interesting element of the automation and the platform capabilities that you're focused on is being able to manage the schema mapping and schema evolution, which is 1 of the perennial challenges of dealing with data integration and ETL. And I'm wondering if you can talk to some of the mechanics of how you manage that in the platform and some of the edge cases that you are still having to deal with where it requires a human getting involved?
[00:28:10] Unknown:
I think we've done extensive work in terms of handling the schema. So now, thankfully, we are at a stage where nearly, you could say, 98, 99% of the scenarios, typically, human intervention is not required because now we've dealt with thousands of customers and tens and thousands of their pipelines. So we've come across a long way in terms of encountering all those edge cases and being able to take care of it. So the way it works is that we kind of have a schema registry where the last set of data that we loaded, we kind of keep a registry of it. Like, what did the data look like and what were the data types, how was the structure of the data that came from different sources. And every time there is a new set of data, there is a comparison between the last change. And then we identify what's a delta. And then we have certain set of structure which tells us that for this set of change, what is the desired action that needs to happen? Because we also take a configuration input from the user that if, for example, if you have new tables coming in, what needs to be done? And we can either ignore it or we can ingest it. It depends on the preference of the user. So that's how this whole auto schema mapping works in our scenario.
[00:29:25] Unknown:
In terms of the conversation about ETL versus ELT and the question of what constitutes best practice, what are some of the aspects of customer education and customer feedback that you need to work with to help them understand what approach you've taken, why you've taken that approach, and some of the ways that they can best take advantage of the capabilities that you're building into, you know, applying some of those transformations before it lands in the data warehouse and how that factors into some of their approach to data modeling and those other questions of, you know, how to make sure that they're able to build and maintain a healthy data platform?
[00:30:05] Unknown:
I think there are 2 sets of users who come on the platform. The first set is that they are just getting started on that journey. The first thing that they want to get the basics right, which is just get the data into the warehouse. And as they go along, their analysts will come back and say that, you know what? The query is running very slow or it is costing too much. What can we do about it? In which case, they will go and figure out that what's the structure of the data and how it needs to be modeled so that they can solve those bottlenecks. That's the stage where they realize that what if we can really have the right structure before the data lands into the warehouse itself. So that's the stage, kind of an evolution journey as they go along and discover the or they get to a point where they are trying to optimize the performance beyond just the basic of basics of getting the data into the warehouse. So that's 1 scenario. The second scenario is that customers who were using something else, some other solution in the market, And then they faced the same limitation, but those solutions didn't offer themselves to be to that level of flexibility where they could get an in flight transformation before the data loads into the destination. And that tool, not just at a source level, but at an extreme granularity of for this particular table, for this particular type of records, I want to apply this transformation. And for others, I just want to simply load.
So those users naturally migrate or are looking for a platform that allows these intrinsic capabilities that we offer. So it's a lot about users discovering this on their own journey to optimizing their flow and performance as opposed to we trying to educate ELT versus ETL. In my view, as a product, we need to support both scenarios and leave it to the user to determine what's best for them because there is no right answer. It's a trade off. And in some scenario, 1 solution is better, while in other cases, the other solution is better. As far as
[00:32:09] Unknown:
the overall process of getting set up with Hivo and how it integrates into the workflow, I'm wondering if you can talk to some of those considerations and also, in particular, the collaboration aspect of how the different personas and roles in the organization will interact with Hebo as data traverses the various stages of the life cycle?
[00:32:30] Unknown:
So we are a completely self serve platform. And in 1st couple of years, we did not have any sales team or we did not have anyone other than engineers. So the general way of thinking about this whole problem was that we want it to be super simple, and we did not want someone to require to have a demo about the product or someone to walk them through. So that really helped us in terms of simplifying our entire onboarding and the user experience within the product just after people sign up. So today, if someone signs up on Hivo, takes them about only few minutes to get their pipeline set up and the data to start loading into the warehouse.
And all the complex things in between are either automated or there is, like, a complete guided step for them to really know what needs to do. So, for example, if it is required for them to whitelist an IP, then within the context of the product, they will get certain steps depending on where they are within the product, the steps that they need to do in order to achieve that. Because we assume that the user may not have a lot of background, And then we need to really help them and guide them so that they are able to achieve the goals that they have.
So that's 1 scenario in terms of how people onboard themselves in a very self survey. The second aspect is the collaboration about once someone signs up on Hivo. Naturally, they start with 1 or 2 sources of data, which is the predominant conciliation for them. But as they go along and keep their own maturity in terms of the analytics need within the organizations grow, we've seen that more different types of sources getting connected. So someone may start with just some marketing sources of data, but then eventually then want to combine the marketing with the sales data as well. Or they may want to combine this with their purchase data as well, in which case they might start with, say, the advertising channels like, say, Google Ads, Facebook, LinkedIn Ads. And then they may want to add the sales CRM, say, Salesforce.
And later on, they may want to just combine with the Stripe data as well just to complete the funnel. And later on, they may want to get some product usage data, which might be in the MongoDB. So then they will connect to MongoDB and get all these things together. So as the complexity of the questions that the use business users are trying to ask, as that increases, more and more sources of data needs to be connected and brought into the warehouse. So that's the kind of a journey we've seen.
[00:35:07] Unknown:
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95% reported being at or overcapacity, with 72 percent of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.
Go to data engineering podcast .com/ascend and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer. So we've been talking a lot about the journey of data from source systems into the data warehouse. And my understanding is that you are also providing capabilities for doing what some people are calling data activation or reverse ETL or, you know, operational analytics. And I'm wondering if you can talk to also that aspect of you know, you've got the data loaded. You've done your transformations and modeling and then being able to feed that back out into those operational systems where maybe somebody wants their customer data pushed back into Salesforce or HubSpot or whatever kind of system of record the rest of the business is using. And I'm wondering if you can talk to some of the interesting challenges that you've experienced there and some of the ways that you've been able to lean on the kind of data ingestion and integration capabilities to then be able to also do double duty on that kind of egress path?
[00:37:04] Unknown:
I think we started building the 2nd product before this category was even called reverse ETL or anything. What we fundamentally wanted to do was so our sales team so there was this whole notion around that engineers don't want to talk to sales team. Right? This is partly true, but not completely true. People want to talk to sales team when they are really ready to talk to sales team. If the objective of the sales team is to help the customers navigate in their journey of decision making, they want to talk to them. But if the objective of the sales team is to push them to buy, I think that's when, like, the engineers are business folks, no 1 wants to talk. Right? So we wanted a system where we could get the user journey in terms of where they are in their journey and decide that at what stage should someone reach out to them and who should reach out to them. So if they've achieved certain success within the product, and then the discussion could be more around the line of what's the overall problem that they are trying to solve and what else would they need to make the decision versus when they are struggling to get some of their data from some specific sources or they are encountering certain challenges.
So rather than waiting for the user to reach out to support, what we wanted to do was know where they are stuck depending on that, proactively reach out with certain solution that how they could change certain configuration and get to the results that they are getting. Now in order for us to be able to do that, we wanted to get the product usage data on intercom, which is a medium of chat support, also on our CRM. So we were anyways getting all the data into our warehouse, and we wanted to get this back. So we built this internally. We figured out that, and just I ended up having discussion with few of the other folks. And they said that this looks great. Why don't we just convert this into a product? Like, I mean, some of the friends I was speaking with, they wanted to use the same thing. And I said, no. It's not in a state where you could actually use it because you just did a bunch of scripts to make that happen. I figured that this could be a product in itself.
And we started on that journey very organically saying that people want product user data to come to either their sales CRM or they want this data to go into their help desk so that someone can really have a complete context while they are interacting with the customer. And later on, we figured by talking to more customers that, ultimately you don't want your warehouse to become yet another silo. So if you get all the data into your warehouse and the only thing that you can do with that data is just build dashboards, which are kind of reactive, that towards the end of the month, you figure out that you did not hit your targets.
It's of no use because it's already happened. Whereas if you can, almost like in a near real time, trigger certain actions to certain individuals who can actually influence the outcome, it can have a huge impact because all the heavyweight lifting of consolidating the data into the warehouse and building certain insights out of it is already done. Now all that you need to do is make this insight be accessible in the right context to the right person at the right time. And that's how this whole concept of River CGL came into picture. And as we kind of were building, we figured that, like, this is coming up as a new category in itself, which I felt logically did make a lot of sense to me because it's just a very natural extension where a solution which is bringing the data into the warehouse, they should naturally lead the data back from warehouse into different systems. It's almost like an Internet line which can upload the data and download the data. You typically don't have 2 Internet connections, 1 for uploading the data and another for downloading the data. Same Internet line does that for you. So it's almost very similar to that.
[00:40:59] Unknown:
The other interesting element of building and supporting a platform that handles the kind of traversal of data across customer systems is the problem of ongoing maintenance and evolution of those workflows and being able to monitor and alert on failures and be able to warn of potential breakages when you make changes, which is often where things like metadata and lineage come in. And I'm wondering if you can talk to some of that aspect of how you help to support the kind of deliberate evolution and maintenance of those pipelines and workflows so that users don't have to kind of be surprised when all of a sudden they go to load a dashboard or go to go into their Salesforce where they've been replicating data and things don't look right or, you know, the data is stale and some of those overall aspects of ongoing maintenance.
[00:41:49] Unknown:
I think this is, like, the most difficult part of this entire business, making sure that nothing breaks ever irrespective of what. And and it takes it's not just about how many smart engineers you put in. There is a natural product evolution that how many edge case scenarios that you encountered that you can really anticipate for it. Because the total number of combinations that you can think of are very large. On 1 side, you've got 150 different connectors, and each may have their own version. So someone may have 1 version of MongoDB versus the other version of MongoDB.
The second variable comes in the environment in which you're operating. Someone has set up at MongoDB and AWS, whereas someone else has set up on GCP or Azure. And the third element is your entire configuration because you may have set up the user privilege in a certain different way compared to someone else. And it could lead to 150 into 4 into some infinite number. Maybe let's take a number for the sake of simplicity. There's 25 different types of privileges that you can assign to a user that typically people do. So the total number of combinations could be very huge. Right? And suddenly, you figured out that there was a certain problem in a certain specific version of the database that needs to be handled differently. So the code complexity continues to go up over a period of time because now you are trying to have just not 1 integration for 1 particular type of source. There are various different type of integrations for different versions of it. So that kind of is a very complex problem in this particular scenario.
So by default, like, when someone is building the product for data integration, I think default 1 should assume that there are no happy cases. So you should, by default, assume that anything that can possibly go wrong will definitely go wrong. Right? And then build your systems according to that. So if you are connecting to a source, don't assume that it will get connected. Assume that it will time out. Assume that something else will go wrong. And then how do you build and design a system that make sure that if it can be resolved, automatically it gets resolved. If not, then determine based on certain conditions that what is the action that needs to be taken. Does the system need to inform the user, or does it need to inform someone on the control? So we have a control tower team which kind of monitors all these things on behalf of the customers.
So those are the things how you kind of get to that level of robustness and maturity. So in the early days, it was super hard. Like, every now and then we will see some customers facing this problem. But now that we've seen all type of different edge cases, working with thousands of companies across last 2 to 3 years, now we understand that how it needs to be managed to make sure that, customer can rely on platform to deliver the accurate data for them.
[00:44:53] Unknown:
In your work of building the platform and building Hivo data and given your experience of trying to make sure that you had insights and information about your previous business, what are some of the ways that you're using Hivo Data to help build and gain insight into Hivo Data and understand where to take it going into the future?
[00:45:12] Unknown:
I think, interesting aspect of is that I did not come from an enterprise SaaS background. So I had, like, no background. It has certain advantages and lesser than disadvantages. I come from a consumer Internet background where the users doesn't talk to anyone. If you like it, you sign up and you use the product. If you like it, you pay it and you continue to use it. The moment you stop seeing value, you cancel it. Right? It's almost like a Netflix kind of a frame of reference. So the way we ended up building product was of a very consumer grade, which was very unlike how typically enterprise softwares get built. Right? So that fundamental philosophy of trying to build very lovable product is something that comes very inherently natural to us because default assume that we don't have a salesperson who will go and talk to the customer. The product has to just work. If it doesn't work, we have no short dated. Now it may be a large customer who who otherwise would have paid, say, $100, 000, but it just has to work in the first go. So our obsession about optimizing the entire user journey and funnel is very unlike how typically enterprise softwares are built. That's 1 learning. The second aspect is around how we thought about our entire go to market.
Because if you were traditionally coming from a SaaS background, we would think about, like, hey, we got to have a sales team or a marketing team. Whereas because we came from a consumer Internet background, we thought that when you have a problem, what is the first thing that you do? You go and search over the Internet. We said that when someone faces a problem, which where the solution could be a data integration, we should be found by them. And today, we dominate nearly all search terms so that users discover us. So we don't have to spend a lot of money on sales and marketing. Instead, we channel that capital to building great product, which over a long period of time compounds and delivers better value for the customers.
[00:47:16] Unknown:
In your work of building the platform and working with your customers, what are some of the most interesting or innovative or unexpected
[00:47:32] Unknown:
is something that we've seen really generate disproportionate value for the customer. So just getting data and looking at the dashboard has got certain impact on how you operate. But for example, when your sales team reaches out to the customer or your support team reaches out to the customer because you gathered all the information about the customers and their interaction on the product. And then someone in the support team, there is a proactive ticket saying that this user signed up 1 hour back, looked at the pricing page, did x y z step. And after 4 hours, they have not taken this step, which 90% people take.
And then someone from the support team directly reaches out saying that, hey. I saw that you did x y z, but you didn't complete the step z. Here is a documentation. Here is a video link. And in case you need any help, here is a link to schedule a call with us. That kind of generates a wow moment for the customer, and that is where we've seen, a lot of impact getting created with, Heo.
[00:48:33] Unknown:
In your experience of building the business and working in this space, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:48:42] Unknown:
I think the technical complexity of what does it take to build a truly reliable, robust data integration platform, I think I kind of didn't understand the technical complexity that how complex it would be. I mean, you feel that there is this third party system. We are smart engineers. We'll go and build an integration with them. They would have a public API. But the reality is that what those API docs talk about, it's far from what the response you will get from when you call those APIs. Right? And you can never be so sure. So you might like, they may have input that you will get time instead PST, but what you get is a GMT time and things around that, which you really don't know. And then you're figuring out that what actually went wrong. Or someone says that, hey. We had 5, 000, 000, 000 records. We are getting 1 record less in our destination. But you go and figure out that where exactly did that record go. And at times, your best of the engineers have to spend days, like, sleeplessly figuring out that what happened to that 1 record. And then, oh, you figured out that user applied certain kind of transformation, and then it's waiting in a queue and things around that. Right? So the technical complexity that needs to be handled, truly reliable solution on which customers can confidently take decision based on that data, I think that's very, very huge. That came as a lot of surprise to us in the early days, but thankfully, we learned what does it take. And then over the last 2, 3 years, we've kind of really nailed it down.
[00:50:17] Unknown:
And for people So in
[00:50:26] Unknown:
So in case the customers have a vast majority of their data on prem set up and they're just starting to graduate to the cloud and they need to manage both hybrid structure, I really don't think we are really designed to cater to that segment of the customers. For us, I mean, for anyone who has the data in the cloud and they have warehouse in the cloud and they want to solve the problem of this fragmentation of data, I think he was the right solution for those scenarios.
[00:50:55] Unknown:
As you continue to build and grow and maintain the HEBA platform, what are some of the things you have planned for the near to medium term or any particular problem areas or feature capabilities that you're excited to dig into?
[00:51:09] Unknown:
I think a lot of what you're trying to build in, let's say, next 6 to 12 months is just around how do you make it more and more robust so that the customers, after they have set up everything, they never ever have to kind of come to the platform to see whether things are working or not. That's 1 aspect. The second aspect is the ROI aspect of it. So the adopters don't care about the efficiency or the ROI in terms of how much does it really cost on your pipeline side, how much does it cost on the warehouse side. So the natural evolution is that as a business, you at some stage start to question that I'm investing $100, 000 on data infrastructure as I've set up. What's my ROI?
Right? And you realize that there is a whole bunch of cost that goes into doing things that are not really adding value to the customer because you might have tons of data from different sources, which most people are not using. So how can we proactively identify that and give suggestion to the user that this is the dataset that you are bringing in. In last 30 days, no 1 has really used this data. Do you really want to still replicate the data? Or do you want to replicate once every month? In which case, it will lead to a cost saving of x dollars. I think those things can be kind of built in into the product that will lead to a higher ROI for the customer. Now, of course, at times, it can be not a good thing for us as a business. But the general belief is that ultimately, market wins. Right? So if you don't optimize for your customers, someone else will do that, and customers will choose a solution which is designed for their win. So it is better that we kind of start proactively working in that direction and make sure that customers are not paying on things that they are not going to use.
[00:52:59] Unknown:
Are there any other aspects of the work that you're doing at Hivo and the overall space of data integration and data pipelines that we didn't discuss yet that you'd like to cover before we close out the show? Yeah. I think the whole part around
[00:53:12] Unknown:
1 aspect is the fragmentation and the defragmentation or kind of handling and bundling and unbunding. That's 1 aspect that's kind of gets discussed a lot both internally within Hebo and outside. And the second aspect that I'm I kind of personally care about a lot is the UX of all the products that exist today. I think the current set of UX is good for a list set of adapters. But as we see crossing the cars kind of a scenario where it unfolds to mainstream market, we will see that a lot of these products will have a very different form factor aimed towards simplifying things so that nearly everyone who wants to use data to make decisions should be able to do that irrespective of their technical competency.
[00:53:57] Unknown:
Yeah. Definitely agreed on the kind of polish aspect of the platforms that we have available, whereas you said, a lot of early adopters or strong engineering teams are able to take these systems and build things to power their businesses. But as you start to get more into the kind of mainstream market or the broader set of adopters and engineering teams who don't necessarily want to become experts in absolutely everything, there's definitely a significant need to be able to add more kind of assistance and improved user experience to these platforms so that you can kind of add another layer of abstraction and understanding to the system so that you don't have to know everything about distributed systems or all the different ways that your pipeline might fail just to be able to get something that, you know, works and runs and that you can trust.
[00:54:47] Unknown:
Yeah.
[00:54:48] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap on the tooling or technology that's available for data management today.
[00:55:03] Unknown:
I think, again, I'll double click on the whole simplicity part of it. I think 1 thing that is little unrelated, but kind of is a North Star metric that I think or the North Star vision that I think that imagine that if smartphones and the iPhones did not exist, 99% of the world Internet population will disappear. So rather than waiting for people to learn new technology and then solve their problems, I think the core role of the technology is to kind of get to a form factor where it can truly unlock the market. I think that whole iPhone moment for the data space is yet to happen, where we are not really confined to some 10, 000 companies knowing how to really leverage because the total market is much, much bigger, and we've just scratched the surface of the market. So if you really want to penetrate and become the mainstream, then we really need to think hard on how do we simplify things.
[00:56:00] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing at Hivo Data and your experiences building this company and platform. It's definitely a very interesting product, and it's great to have people out there who are focused on user experience and making it a simpler process of being able to get data to where we need it to be. So appreciate all the time and energy that you and your team are putting into that, and I hope you enjoy the rest of your day. Great. Thank you, Tobias. And it was a pleasure speaking with you today. Thank you for listening.
Don't forget to check out our other shows, podcast.init, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Guest Introduction
Manish Jethani's Background in Data
Building HevoData and Solving Data Integration Problems
User Personas and Platform Design
Lessons Learned from Previous Roles
Benefits and Challenges of Owning the Data Flow
Competitive Landscape and Differentiation
Platform Architecture and Real-Time Data Processing
Optimizations for Variable Traffic Patterns
Automation in ETL and Data Pipelines
Handling Data Transformations and Integrations
Managing Schema Mapping and Evolution
Customer Education and Feedback
Onboarding and Collaboration in Hevo
Reverse ETL and Data Activation
Ongoing Maintenance and Evolution of Workflows
Using HevoData to Build HevoData
Future Plans and Enhancements
UX and Simplification in Data Products
Biggest Gaps in Data Management Tools