Summary
Every business aims to be data driven, but not all of them succeed in that effort. In order to be able to truly derive insights from the data that an organization collects, there are certain foundational capabilities that they need to have capacity for. In order to help more businesses build those foundations, Tarush Aggarwal created 5xData, offering collaborative workshops to assist in setting up the technical and organizational systems that are necessary to succeed. In this episode he shares his thoughts on the core elements that are necessary for every business to be data driven, how he is helping companies incorporate those capabilities into their structure, and the ongoing support that he is providing through a network of mastermind groups. This is a great conversation about the initial steps that every group should be thinking of as they start down the road to making data informed decisions.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- Your host is Tobias Macey and today I’m interviewing Tarush Aggarwal about his mission at 5xData to teach companies how to build solid foundations for their data capabilities
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of what you are building at 5xData and the story behind it?
- impact of industry on challenges in becoming data driven
- profile of companies that you are trying to work with
- common mistakes when designing data platform
- misconceptions that the business has around how to invest in data
- challenges in attracting/interviewing/hiring data talent
- What are the core components that you have standardized on for building the foundational layers of the data platform?
- providing context and training to business users in order to allow them to self-serve the answers to their questions
- tooling/interfaces needed to allow them to ask and investigate questions
- most high impact areas for data engineers to focus on in the initial stages of implementing the data platform
- how to identify and prioritize areas of effort
- useful structure of data team at different stages of maturity
- What are the most interesting, unexpected, or challenging lessons that you have learned while building out the business and team of 5xData?
- What do you have planned for the future of the business?
- What are the industry trends or specific technologies that you are keeping a close watch on?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days. DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Go to dataengineeringpodcast.com/datafold today to start a 30 day trial of DataFold. Once you sign up and create an alert in DataFold for your company data, they'll send you a cool waterflask. Your host is Tobias Macy. And today, I'm interviewing Tarush Agarwal about his mission at 5x Data to teach companies how to build solid foundations for their data capabilities. So, Tarush, can you start by introducing yourself?
[00:02:04] Unknown:
Yeah. Absolutely. First of all, thank you, Tobias, for having me on the show. Super excited to be here. Just a little bit of myself. I've been in the data space for the last 10 years. I started off in Silicon Valley at salesforce.com. Back then, 10 years ago, Salesforce didn't really have a data team, so I got to be the 1st data engineer on the analytics team over there. That team is probably now massive. Yeah. Most recently, I was at WeWork, which is a super colorful company, but we did some really cool stuff around data. And I got to scale the data team up from 2 to a 100 people. So I really spent the last 10 years in data.
Yes. I'm super passionate about what's happening, and I think this is a really, really exciting time for the data space in general.
[00:02:55] Unknown:
Yeah. It's definitely pretty remarkable how much the overall ecosystem has grown just in the past year or 2, let alone the past decade.
[00:03:03] Unknown:
Yeah. You know, 1 of the things right now is just seeing what remote work is doing to, you know, the data space. The fact that, you know, people are not in the office. It's putting a highlight on what visibility does the company need around its systems and around its people. So I think we're 1 of the few industries which has been sort of positively affecting data.
[00:03:29] Unknown:
Yeah. It's definitely an interesting time for everybody and an interesting time to be in data. And you mentioned how you first got involved in data management, but can you give a bit of an overview about what it is that you're building now at 5xdata and some of the story behind your decision to go that direction?
[00:03:44] Unknown:
Just a little bit of context, I left WeWork around 6 months ago. And as I was, you know, I was still advising a few companies on their data strategies. And what we really realized is it doesn't really matter what you're trying to do. You know, it doesn't matter if you're a Fintech company or real estate or marketing or ecommerce or traditional SaaS company. Yeah. If you're serious about scaling your business, at some point, you are going to need to build a data foundation. You're gonna want to have visibility into your go to market strategy, And more importantly, you are going to want to be able to leverage data to build products that your customers love, to discover, hidden insights in your data. So, you know, what we've seen is the only difference is when you invest in it. Now if you're an ecommerce business, you might go do this at 7 figures. If you're a traditional SaaS company, you might do this even pre revenue.
But what we realized is all companies need to have this foundation. And the second thing is, it's not very easy to go invest in data. Right? Data hires are expensive. And more often than not, what we find is that companies rush to, you know, gain insights from the data. So they might go hire data scientists or data analysts, and these folks are going to go focus on the insights layer. And and to start off with that works, though what we see is that, you know, at some point without a data foundation, everything comes crashing down. So think of it like a skyscraper. Right? If you wanna build an iconic skyscraper, you need to spend some time building a foundation.
Otherwise, it doesn't matter how much steel and cement you have. Without the foundation, it just doesn't scale. You know, what we focus on at 5xdata is how do we teach companies to build a foundation so that they can build on top of?
[00:05:53] Unknown:
Yeah. And I think that another interesting element in this overall equation is the availability of a number of different hosted and managed systems for being able to do all kinds of data operations that were previously either very complex or very expensive to do in house. And so a lot of companies, particularly, you know, startups who are early stage, you know, out of the box, go with things like Fivetran and Snowflake and Looker for being able to get their full end to end visibility solution in place, but they don't necessarily have the expertise in house to understand some of the complexity that accrues around the data ecosystem where they'll just start pushing data into the data warehouse, and then they might do some transformations or modeling on it, but maybe not in ways that are scalable or maintainable in the long term.
And then they might hit a point of complexity where they spend a lot of their time paying down the technical debt rather than being able to move forward. And I'm wondering what your experience has been in terms of working with or seeing some of the approaches that companies take who might go down this path of using these managed services and some of the incidental complexity that grows up around it.
[00:07:12] Unknown:
And, you know, I really like that you sort of mentioned Looker, Firetran, Snowflake. Obviously, these are really, really good tools, and putting them together and working with them is, you know, a lot of the secret sauce. I like going back to really the fundamentals. Right? Like, why are we doing this? You know, what's the goal from all of this stuff? And what we've what we're really convinced by is that this idea that if you can answer 80% of your questions in a self-service manner, then you're far more likely to succeed. So what does that really mean? Right? Like, if your intern at your company, you know, someone who's just joined you can answer really complicated questions. How effective was this campaign? Is the previous campaign we ran last Christmas season? What is the LTV of those users?
You know, if anyone in your company can start to answer these questions in a purely self-service manner, number 1 is you're giving your employees autonomy to answer questions for themselves. So now this whole idea of fail fast at the start up is built on this model that employees can answer questions. They can come up with their own hypothesis, test them, and hydrate them. So, you know, the first goal is really to give employees as much autonomy as possible. And what this really does is it frees up your data team so that they can focus on, you know, needle moving work. I think data teams are really well positioned to be able to really find gold in your data and to focus on, you know, what products should you be building next? So what are, you know, or or sort of what are the interesting areas?
And if your teams are really bogged down by, as you said, building models and, you know, answering questions and backlogs and all of the typical stuff which teams are spending time on right now, it doesn't quite work. So the goal has always been around autonomy for every employee in your company, as well as focus for the data team to focus on the high value stuff. And the way we see it is is we break it down into, you know, 3 different pillars, which we teach in this fundamental program. And, you know, the first pillar is, you know, automated data ingestion.
So, you know, traditionally, what we find is that most companies spend a lot of their times building pipelines and managing these pipelines. Whereas, you know, with Fivetran, what we're seeing now is really the rise of ELT. So and the first thing we focus on is moving away from ETL into ELT. And what that allows you to do is, you know, completely automate the EL steps, which is a big advantage. The second thing we focus on is the data modeling layer. Now a lot of your source systems have really been structured for the different applications. Right? So your your sort of databases are structured for your application. Your your sort of front end tracking is structured in a way which makes sense for those tools.
We help you, you know, go back to what are the questions you're trying to answer, which again goes back to how are your employees gonna be using this, And how do you design a data modeling layer which can answer all of those questions in a pretty robust way? So instead of having 1, 000 or 100 of source tables, how do you do this with much smaller set, which becomes a lot easier to manage. And the third area is really the the self-service part. Right? So what are the BI tools? Again, there's no need to reinvent the wheel over your you know, there is a playbook which has made sense. The big companies have used it to scale to tens of thousands of employees.
So, you know, how do you set up your BI tools in a way such that anyone can answer questions? So just to recap, you know, 3 pillars, ingestion, data modeling, as well as self-service. And what we're really good at doing now is, you know, making this into a playbook, which we can teach in 12 weeks. So we help companies build this foundation from scratch in 12 weeks.
[00:11:25] Unknown:
That covers 1 of the challenges in the space is that because there are so many changes happening so fast, it can be difficult to keep up and to understand what choices to make, how to structure your systems because there's the fear of missing out, and then there's also the fear of picking the wrong tool. So you're kind of torn between, I wanna move fast and just get something working, but then I also wanna make sure that I don't move too fast and make a mistake that's going to hamstring me, you know, 6 months or 6 years down the line.
[00:11:54] Unknown:
Yeah. Absolutely. Very often when companies are hiring or assigned to build out a data team, they're solving for immediate problems. Right? We wanna get value from our data. And what they're also doing is, you know, if if you're just getting started, you know, you're probably not gonna go hire a VP of data or someone with, you know, 15 years of experience. You're gonna start with someone with a few years of experience who probably understands 1 part of the stack, but is solving for, you know, a local maximum instead of, you know, the big picture. So we work on this idea that there's no need to reinvent the wheel, you know. These are certain best practices needed and, you know, we wanna make it super super easy, almost like a no brainer that here's the program, this is how you do it, and you pick a project. Right? So if your goal is to add visibility into your go to market strategy or to, you know, find parts of your product which customers are using, great. Pick that as a project.
And while you implement the foundations which we teach you, you work towards that project. So at the end of 12 weeks, you've solved that problem, and now you really know how to go do this in future. So instead of, you know, selling you a fish, which is what, you know, what a lot of traditional consulting companies do, our goal is really to sort of teach you how to fish so that you can go do this yourself. And, you know, once you get to this sort of foundational layer, that's where really all the fun stuff starts to happen. That's when you can really go a lot deeper and, you know, machine learning or, you know, more in-depth analysis start to sort of really come alive for you. But you can't really do that unless you're at a certain level of sort of maturity, which is what our goal is.
[00:13:46] Unknown:
And so in working with these different companies, 1 question is, is there a particular profile of company that you're specifically looking to work with? And if so, sort of what are the characteristics? And then also in terms of the companies that you've worked with either at 5x Data or in previous roles, what do you see as being the impact on the particularly industry or vertical that they're in as to the types of challenges that they're facing in being able to reach this goal of being data driven?
[00:14:19] Unknown:
So, you know, I'll start with the first question. As I mentioned earlier, you know, it doesn't really matter what industry you're in. You know, at some point, you are going to need to have a data foundation in order to take things to the next level. You know, we go after series a or b companies. These companies are at a point where they have proven out they have a business, and now they've raised money to basically go scale out the business. So at at this point, they start to go hire a data team, which is the perfect time to do a program like us, which just gives you the best practices and, you know, sort of sets you up for success. So a lot of the companies we work with are just getting started with data.
They have a data team of, you know, under 10 people, and they're either just getting started or they have basically somewhere down the stack of the data train. We also work with a lot of companies that are traditionally not tech companies, but they could be in other industries like real estate, education, coaching has been a big 1 recently for us. Ecommerce is another 1 where they want to leverage data in order to take things to the next level. So we find that, you know, at some level, a lot of companies start to plateau. You know, hustle from entrepreneurs gets you from 0 to 1, but at some point, it starts to plateau and, you know, you lose visibility into what's working for your business and what's not. And without these systems, it's really difficult to give your employees clear directions and measure their success.
And vague input leads to vague output. So we work with a lot of companies which have hit a plateau and now need to go take things to the next level, and this is very often the first investment in data. You know, 1 of the things we've seen is that data hires are expensive. Right? Especially in America where, you know, average cost to company is $100, 150, 000. It's an expensive sport. Obviously, our programs are much cheaper than that. So a lot of our companies were the 1st investment in data. So I think that's the first part of the question. And the second part was around what value are these companies trying to get to and sort of what are the obstacles they are facing.
You know, I think I did speak about both of these a little bit earlier, but, you know, number 1 is very often, the obstacle these companies are facing is that they are making these hires. These hires start to go build out some really interesting stuff. But what happens is without these foundations, at some point, everything starts to get slower. So, you know, how many people have found themselves in situations where, you know, easy analysis start to take longer and decision making starts to get bottlenecked by data instead of being enabled by data. And stakeholders start losing trust in numbers because you have multiple sources of truth. And often, you know, small mistakes sort of start to enter our stack, and we enter a world where we have prioritization based on who screams the loudest.
So, you know, all of these are really telltale signs that what you are building and the way you're building it is not scalable, and you really need to go invest in inside foundations. And hiring new engineers or, you know, a quick rearchitecture of the stack is fixing the solution, not the fundamental problem that you haven't thought through this holistically, and you haven't implemented a system which is really scalable. So, you know, this is a big problem which these companies are trying to invest in data facing. The other 1 is around those companies, you know, who just don't know any better. Right? Entrepreneur build companies.
You know, the entrepreneurs were experts at what they did. But, again, you don't know what you don't know. And at some point, if you don't know that you need to invest in this, you hit a plateau. And, you know, we help go in there and add all this visibility, you know, give people autonomy. And at that point, they can take things to the next level.
[00:18:28] Unknown:
The first step that most of these companies need to take is actually bringing on some capacity for people who are able to actually build and manage these systems, either by bringing someone who's already internal up to speed with the technologies and with the needs of the business or by hiring externally. And there's sort of the catch 22 there of you need someone who has expertise to be able to evaluate the potential for somebody who you're looking to fill this position, but then you need to fill this position because you don't have the expertise. And I'm curious what you see is some of the useful strategies for businesses to be able to attract talent or identify talent internally, and then how best to evaluate their potential for being able to help the organization succeed in their data projects?
[00:19:19] Unknown:
We are big believers in this concept of doing data in house. I think data is a core competitive advantage to your business. So, you know, it's out of anything, this is 1 of the things that you do wanna keep in house. Obviously, working externally is great because that helps you accelerate your timelines towards towards projects, which makes sense. But we really push for teams investing in internal data resources. Now, yeah, this has been pretty tricky for companies just given the fact that data is relatively new. Right? Well, I think probably 9 or 10 years ago is when it started being recognized as, you know, data engineers and data scientists started being recognized as as real professions. So because of this, we're still figuring out what are some of these best practices and, you know, what is the best way to go organize this stuff. So I think that makes hiring for data particularly challenging.
Where we help is, you know, our program really gives you a lot of the foundational stuff. Right? So no need to reinvent the wheel. You sort of probably get started with someone who has experience, but sort of hasn't put this together end to end. So, you know, what we say is you're looking for again, sort of depending on what you can hire and, you know, where you are in this journey. We're looking for, like, a mid level data hire as a minimum to come do our program. So, you know, very typically, you will look for sort of a full stack data hire who can do some data engineering stuff, but also some analysis stuff. And our program will really get into the specifics of exactly what to do, what are the step by step instructions, what are the best practices, and also access to our faculty. So as these companies are implementing them and they're stuck, so we can go in there and help them.
So if you do wanna start cheaper, you know, our minimum requirements would be a junior analyst sort of who understands some basic Python and SQL. But what we really recommend is a mid level full stack data higher, which would give you, you know, the best ROI in a program like ours.
[00:21:34] Unknown:
Once the organization has 1 or multiple people on staff who are able to manage the data systems maintain them. What are the core components or the foundational layers of the data platform that you have found to be most useful or most broadly applicable and that you recommend for these different organizations who are just starting on the journey of being data driven?
[00:21:59] Unknown:
I think there's sort of 2 parts to it. Right? Number 1 is, what is the, you know, infrastructure stack? And as you mentioned earlier, you know, you know, Looker, Pipedran, Snowflake, DBT, you know, all of these are what I call best in class vendors. You know, we are partnered with all of these guys. If you're on the Google Cloud or, you know, you have a slightly different BI setup, that's fine. A lot of our stuff is built on first principles. So while there are certain tools which we recommend, there are sort of multiple ways to go do this. So a lot of the time historically has been spent on just operating, maintaining these tools, and building analysis on the tools, which is shared in you know, we find that, sort of, typically, companies are spending 80% of the time on just the ad hoc stuff, maintenance, backlog, answer questions for the business, and, you know, 15, 20%, if that, on the need removing work.
We really wanna flip that around. If your data team can now spend about 20% of its time keeping these tools alive, and maintaining them. And, you know, every time when you add a question comes to the business, this is an opportunity to add it on to the data modeling layer and expose it as self-service for the business now that more and more people can answer this. So more and more people can start to answer questions of this. And instead of your data scientist, as you get more advanced going all the way back to the raw data, if they can focus on this model layer, which we call the business layer, everyone in the company can start to use this. It's gonna make maintenance a lot easier Instead of having these massive fan out problems and combining application logic inside inside your transforms, you start to build a very clean layer where both your stakeholders as well as your data team can go to start consuming data. So, you know, we wanna shift that. We wanna shift the 80 20 into the 2080 so that now the data teams can actually spend 80% of their time on on going deeper into insights, on working with the product teams and figuring out what is the market research, what are our customers using, and what are some of the features which we should build next.
So a lot of the data science here, the analytics work, which just frankly, a lot of companies want to do, but in reality, they never end up sort of doing that. And and that's where we want them to be focusing their time on.
[00:24:31] Unknown:
And then on the self serve aspect of things, what have you found to be some of the context or training that's needed for different users within the business to be able to effectively ask and answer questions of the data that they have and understand how to apply the information that they receive from that, particularly given the potential for things like conflicting concepts of how to distill a given metric or, different data sources might represent different information separately where they might have different scales or different contexts or representations of the data and just how best to handle the modeling in the warehouse layer to ensure that the self serve layer is actually able to be effective and not cause any
[00:25:22] Unknown:
confusion. So just to recap, you know, how do you ensure success in the self-service layer, and sort of what are the tactics over there? And what we find is that self-service isn't a new concept. Right? We've been using this all the time now. Right? Every time you go to kiosk at an airport and print your boarding pass, so every time you pay for your own own or you pay for your own groceries at at Whole Foods or, you know, at any of these stores, you are using self-service. What's happening is that this has become so much more complicated because the data modeling layer hasn't been set up in a way to answer business questions. It's still modeled in a way for the different applications, and it's just stitched together.
So when you combine Mixpanel with your CRM, with your application databases, and just expose it inside Looker, it becomes really, really complicated to answer basic questions because you need to have the context as an engineer on how this is all stitched together. What we focus on is don't expose the raw data inside your BI tools. Actually work backwards and try and figure out, hey, what is 80% of the questions we're trying to answer? What does the marketing team wanna answer? What does the sales team wanna do? And what does the engineering team do? And work backwards and design a model which can answer these questions.
And then, you know and then what the data team focuses on is building out the transforms from the raw data into this layer. And at that point, when you expose this clean layer inside your self-service tools, it actually becomes pretty easy and intuitive to to go use this, you know. We see a lot of companies and even Looker, for instance, talking about models like train the trainer and have these people in every department who know how to use these dashboards, and they become the role models and, you know, and pushers for using self-service and, you know, some of that kinda makes sense, but I believe that success in the data modeling layer is a much, much better indicator of how easily self-service will be adopted than any training strategies which you can go do later.
[00:27:48] Unknown:
RudderStack is the smart customer data pipeline. Easily build pipelines connecting your whole customer data stack, then make them smarter by ingesting and activating enriched data from your warehouse, enabling identity stitching and advanced use cases like lead scoring and in app personalization. Start building a smarter customer data pipeline today. Sign up for free at dataengineeringpodcast.com/rudder. As you mentioned, specifically, not just dumping a bunch of raw data into the business intelligence tool because then there are just too many areas for misinterpreting that. And I think that, too, working backwards from the questions that you're trying to answer is helpful because if you're starting from the raw data and then trying to figure out how to model everything, then it can be confusing to try to figure out what are the high priority items. How can I structure this data in a way that is going to be useful, and just understanding, like, what are the questions that are actually already being asked and then answering those rather than trying to design things to be what you might need rather than what you concretely need right now is a useful way to frame the problem?
[00:28:57] Unknown:
Absolutely. You know, I think that over complication of your modeling layer is probably the worst thing you can do. It makes it so much harder for the business to just answer the basic stuff. Going back to WeWork, for a long time we had 3 core activity streams, which helped answer majority of the questions. You know, we had 1 which focused on what are all the activities someone can do before they become a member. The second 1 was all the activities they do after they become a member. And the third 1 was information on our buildings, capacity, opening up, occupancy, all of that fun stuff. And through these 3 tables, we can answer most questions and even super complicated stuff like how effective was 1 marketing campaign in driving users, and how long did those users stay for once they signed up and how many posts did they make on our internal social network within the 1st month. Right?
That is 50 different data sources. You know, if you go back to raw data, there's probably a few 1, 000 lines of code. We could answer that in, like, 50 lines or less. So it all boils down to you know, for most businesses, you shouldn't need to have more than 5 tables in your data modeling layer, which can answer majority of the stuff. And, you know, obviously, use these tables as the core tables. Join them with some of your fact tables to get more details. But from just a raw from a raw capability standpoint of view, but you can do a lot of cool stuff with activity streams. And so that's how you should be thinking about modeling your data.
[00:30:40] Unknown:
Once you have these foundational layers of internal staff to be able to handle the data projects somewhere to store the data and be able to, you know, load source data into the data warehouse, a self-service layer for business users to be able to ask and answer their own questions. How do you go about identifying and prioritizing areas of work for data engineers in particular, but other data professionals as well to be able to understand how to have the most impact on the business so that your time is well spent and that you're not chasing down a project that's interesting but isn't necessarily going to add any real value to the business.
[00:31:21] Unknown:
That links in really well to the other part of our program. So, you know, we offer this fundamentals training, which is a 12 week program, which helps you build foundations from scratch. Once you have these foundations, that's when it's a really cool time to be because that's when you can really move fast and execute quickly. So, you know, the second thing which we offer is this mastermind, which we really sort of bundle together. And are you familiar with a concept of a mastermind?
[00:31:56] Unknown:
Yeah. It's definitely a useful concept. And for people who aren't familiar, it's the idea that you have a group of people who are experts in their respective fields, but not necessarily within the same field who gather together on a periodic basis to ask questions of each other and get answers from their peer group so that they're able to teach and learn from each other and be able to accelerate their ability to execute on their vision rather than a more mentor oriented approach where you have somebody who is a few steps ahead of you in a given area, and so you're learning from them, but it's not as much of a reciprocal exchange. And the only reason I asked is I wasn't familiar with the concept of masterminds
[00:32:39] Unknown:
when I lived in America. I just hadn't heard of them. And I've been I've been in Bali, which is another story, but when COVID hit at that time, I was living in Shanghai. I came to Bali, and I got stuck over here. Very quickly, China got locked down, so I was a COVID. So, you know, I was stuck in Bali, which is there are a lot lot worse places to get stuck in. But what I did is I joined this mastermind and in 2020, that was that was the best decision that I made. It just accelerated my professional as well as my personal lives. And, you know, this mastermind wasn't even focused on you know, it had nothing to do with data. It was, you know, just a general purpose mastermind.
So so much so that I really started paying attention to, you know, masterminds, what they are, and, you know, how they use them. You know, it turns out that this is not a new concept. This has been done for the last 75 years. Now most famously, Tony Robbins runs 1. She charges half a $1, 000, 000 for access to his and, you know, he's got Fortune 50 CEOs in there. But Mastermind, amazing tools at being able to accelerate your timelines towards your business goals or your personal goals. And as you said, it's this idea when you bring in a diverse group of people. Our mastermind, it's a business data mastermind. So, you know, we bring in data leaders, data engineers, data scientists into a potent container with the idea that if we bring in the right people, now the group collective is way smarter, way more balanced, and super beneficial to everyone inside the mastermind.
And going back to your question of, you know, at a strategic level, now that you have the fundamentals, what do you focus on? How do you prioritize? This is where the mastermind really helps, you know. This idea that being surrounded with this pure group, which is super diverse and well balanced, it allows you to strategize, brainstorm, find new perspectives, very often learn what not to do, you know. If you wanna get to a certain level, this is what worked for us. This is what didn't work for us, and that's really where the medicine lies. So our fundamentals program combined with this mastermind is super potent in once these companies have these foundations, it's then sort of what do they focus on, what do they prioritize, what to do, what not to do. And that is just, you know, the sort of fastest way of hitting your business goals and of being able to accelerate even your personal life. Like, what I found is that these people in my mastermind are now just, you know, very, very dear friends of mine. A lot of us started to work together, so it's really helped 5 x data.
You know, what I find really interesting is, especially in, you know, New York and San Francisco and some of the biggest cities in America, community is often a word which is thrown around a lot. And we work and we call ourselves a community company, and we acquired meetups, so have a lot of context around there. You know, the 2 areas of community and these Meetups and open source, which I feel could be more potent, which masterminds really do well at, is number 1 is this area of consistency. Right? In a mastermind group, in our group, you're meeting weekly for a period of 12 weeks. Anything you do consistently is when, you know, you start to see exponential results, and that's something which a lot of communities and meetups don't do a very good job in. And and the second area is accountability, which is a huge huge piece in any sort of personal or professional development, which is again lacking inside our traditional sort of definitions of community, which masterminds do a really good job in. So, you know, the consistency and accountability combined with the brainstorming and new perspectives and this peer group, which you're introduced in, is just, you know, an ultra important combination and really takes things to the next level. So I think we are probably the first company which has focused on a mastermind which is purely focused on the data space.
And just for our customers so far and just what we're seeing in the market, this is something people are super, super, super excited about. So we are gonna be doing a lot more of that along with our fundamental programs.
[00:37:14] Unknown:
It's definitely an interesting approach. And in a lot of ways, this podcast has become my own sort of personal mastermind group where every week I get to speak with professionals and leaders in the industry who are, you know, building the tools that I'm using or who have been using different combinations of technologies that I can learn from to be able to understand how best to apply it to my own work. And so I could definitely see the value that is available for just being able to ask a question of somebody and be able to get immediate feedback rather than having to resort to some of these sort of longer cycles of asking questions on Stack Overflow or hunting down the Slack group for something and then hoping that somebody with enough context can respond to answer your question in a satisfactory manner?
[00:38:01] Unknown:
I think that makes a lot of sense. Right? And and what really comes to life for me over here is as you get immersed in these conversations, many of them might not be directly relevant to you at this point. Right? Like, a lot of this stuff could be, you know, a few people talking about, sort of talking about concepts which you might not be at a point which are super relevant. But listening into these conversations and listening into different people's perspectives really strengthens your overall understanding of a topic.
So when these conversations do become relevant or when you start getting into pros and cons and really the more tactical stuff of there are 100 different ways to go do this. How should we approach it? That's when this overall understanding of topics and these new perspectives are really shining. And I bet just with your experience and, you know, all these amazing people, data scientists and engineers and leaders that you have been speaking with, you're probably a very, very good person to go to if someone wants to get started with data. And I think a lot of your success sort of comes from this fact that you now probably have, number 1, obviously, an extremely potent network, but also to so much contextual knowledge of what are the problems out there and what are the pros and cons of these different approaches in solving these problems, which allows you to pick the best tools
[00:39:26] Unknown:
for the problem in the moment. Yeah. It's definitely been a great experience and 1 that I definitely wouldn't have anywhere near the amount of understanding of what tools to apply when if it was just acquired through trial and error and working on the projects that come up in my day to day because I've got an exposure to a much broader range of problem domains than I would in any single occupation or job role unless I was maybe switching jobs every day of the week and cycling back to a few of them
[00:39:56] Unknown:
periodically. Yeah. For sure. I 100% agree. That makes sense.
[00:40:00] Unknown:
For any companies that you're working with who are maybe further along in the journey of building out data capacity, and they have hit the point where they're stalling and they're not able to make meaningful forward progress because they're spending so much time trying to pay down technical debt from decisions they've made early on. What are some of the common mistakes that you see them having made that are avoidable for other businesses who are starting out or some of the lessons that you've learned from those organizations that have hit that wall that you've applied to the way that you structured your lessons in 5 x data?
[00:40:38] Unknown:
So I think with companies that have already some traction or or a lot of traction, you know, they already have data teams and they're answering questions and all of this fun stuff, what we find is that they are probably doing by trial and error. That's how most people learn. Like, to be very honest, that's how I learned on what to do, what not to do. Very often, it's what not to do. And what 5 x data's program is really based on is, you know, my experience in the last 10 years trying to get this right. So, you know, very often if companies have already sort of got started on this journey and they already have maybe BI tools and ingestion and doing things in a certain way, it's still super valuable to go do a program like ours for a few reasons. Right? Number 1 is we're structured very roughly into 12 modules, and each of those modules focus on a core competency.
So even if you're doing BI really well and, you know, and you have your hand on ingestion, optimizing any of these areas, even if 2 or 3 of these modules are relevant to you, then improving your efficiency by 15, 20% is extremely valuable. Right? You have a data team of 10 people. Improving your efficiency by 15% is an extra higher and a half. That's super, super valuable and especially when you sort of start to think about what does an average cost of a data higher cost, you know, a $150, 000 versus the cost of a program like us, which is $15, 000. You know, even if you get 10, 15% efficiencies from doing something like this, and you learn about how to model your data or you learn about some of the successful models which where you learn about some of the successful data models which other companies have used and you implement 1 of those to structure your data, again, there's no need to reinvent the wheel, then a lot of value which just comes from optimizing a few things. And what we see is that even with teams which are super advanced, the industry is changing so so quickly.
And, you know, tools which were relevant 2 years ago very quickly become irrelevant. So I spend a lot of my time constantly in conversation with, you know, senior data leaders or data engineers or data scientists figuring out, hey, what's working at this company? What are some of the pros of this approach? And with these pros, what are some of the cons? Just to make sure that we are a step ahead of what's happening out there, and we can change our programs to better reflect the current state of the industry. And I think the fundamental training and the mastermind are really, you know, the first few steps in the puzzle. What we would love to get deeper in is the next layer of programs.
You know, building data teams is something I'm super passionate about, especially with my experience in WeWork. How do you organize data teams? What part of the business should they report in? How do they work with software engineers? How do they work with stakeholders who are consumers of data? You know, that's a whole program which we're super excited about building. And then other programs like how do you build data products, you know. All these companies are starting to collect some really, really cool information, But how do you take this information and then go build products around it with this concept of, you know, respecting privacy, which is where we're heading to. So this concept of privacy by design, but still being able to leverage data to go build data products that your customers love and absolutely adore. So these are, you know, other programs which we'll start getting into later on this year, which will become the natural follow ups to our fundamental and mastermind programs.
[00:44:20] Unknown:
And as you have started down this journey of building out the business and the team at 5x Data and starting to work with companies to help them level up their internal data capacity, what are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:44:36] Unknown:
What we've learned with companies across a few industries it's actually kind of a sort of beautiful insight is ultimately every company wants to leverage data. Every company has seen the value in it. I think data is the new oil, which is a famous article The Economist wrote a few years ago. Is something people really relate to, and everyone wants to be leveraging data to take things to the next level. So what we've seen, which is really beautiful, is that companies want to do this, and very often they have no idea how to go do this. So it doesn't really matter what industry they're in. What's been really cool to see is a lot of companies go want to do this.
What's been challenging is this is really complicated stuff. Am I doing it right? Like, how should I do this? Or we don't have the resources to go do this. And a lot of that has been true. Right? Like, data hires are expensive, and getting it right is not as trivial as it seems. Just like in marketing, you know, it's no longer enough to go have a few posts on Instagram to do marketing. You know, there's a whole iceberg of things which happen underneath the surface. The same thing really applies in data. So I think the challenging thing is really the mindset. We love what you say you're gonna do.
Are we actually going to be able to get there? We're less technical than you think we are. I think that's been the biggest obstacle so far. It just allows us to focus on our program and, you know, making it accessible for as many people as possible. Our goal this year is to help 500 companies either through our mastermind or our foundations. So, you know, we've been built in a way where, you know, my purpose and the purpose of 5 x data is to build, to serve as many companies as possible. And I'm a big believer that, you know, we're living right now in this golden opportunity where if you leverage data correctly, you can use it for exponential growth. And I believe that, you know, at some point in the next I'm not sure if it's 2 years or 3 years or 5 years, you know, more companies with advancements in the tools and the ecosystem is that companies will start to get more and more insights for free and more analytics.
So it's gonna start to level the playing field. But right now, they exist a sort of golden opportunity, which if you do this, then you can leverage data as a competitive advantage and you can grow faster. So, you know, our purpose is to help 500 companies this year. Being on shows like this really allows us to talk about we have built this program to make this as consumable and as simple as possible. And if you're on the fence about investing in data and you have no idea how to go do it, then we can help educate you around that mindset and then give you all the resources you need to go make it happen.
[00:47:42] Unknown:
As you continue to work with these businesses and try to stay up to date with what's happening in the industry, what are some of the particular trends or specific technologies or groupings of technologies that you're keeping a close watch on for your own uses or for being able to leverage in these programs that you're offering to businesses?
[00:48:02] Unknown:
The biggest I keep going back to this. I think what was, in my opinion, the biggest thing which happened to data was this concept of warehouses. Right? And and sort of what Snowflake did recently where, you know, the idea of being able to separate out storage from compute. And with storage becoming so cheap, the idea that you can put in a lot of data inside your inside your storage layers, and you can run compute jobs separately. And for those of you who are not familiar with Snowflake, I think why they found so much success is, you know, instead of having dedicated hardware with tools like Redshift, Vertigo, all of these other data warehouses had, what sort of Redshift allows you to do is spin up resources on the fly, which makes it really, really affordable to then use a warehouse and sort of leverage data, and then the BI tools are a layer on top of that. So I think with Snowflake now IPO ing and sort of getting all this sort of traction that it has got, it's gonna make the data warehouse architectural layer even more appealing.
So there's gonna be a lot of advancements around the layer on top of that, which is your BI, which is analytical tools, data science y stuff on top of this, and that's an area which we're sort of very closely following. I think Looker does a really good job on data discovery and reporting. We're super interested, and I'm personally sort of looking out for stuff around data modeling and around really surfacing these insights back into the business and making that process a lot easier.
[00:49:39] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:49:55] Unknown:
You know, there's a lot of great tooling out there. Obviously, there is room for improving what we have in all areas of it, ingestion, modeling, analytics, machine learning, all of this stuff. But at this point, there's no need to go reinvent the wheel. Right? A lot of the tools which exist out there are doing extremely good job at it. We really help stitch all of that together to make it as easy and as consumable as possible. So, you know, I think the 1 thing I would focus on really is there's no need to reinvent the wheel over your, you know, self continue using the awesome tools which work really, really well together.
And sort of focusing on that is much better ROI than than sort of trying to optimize for a few things these tools might not be doing as well.
[00:50:43] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you're doing with 5x Data. It's definitely a useful pursuit to help more businesses understand how to implement their data stacks and the technical talent that they need to be able to be successful and become data driven and improve their efficiency and their ability to serve their customers. So thank you for all the time and energy you're putting into that, and I hope you have a good rest of your day. Awesome, man. Thank you so much for having me on the show.
[00:51:14] Unknown:
I really appreciate your time, and I look forward to, you know, helping our businesses and entrepreneurs really take things to the next level with data.
[00:51:29] Unknown:
Listening. Don't forget to check out our other show, podcast dotinit@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Tarush Agarwal: Introduction and Background
Building 5x Data: Mission and Vision
Challenges in Data Management and Solutions
Three Pillars of Data Foundation
Hiring and Training Data Teams
Ensuring Effective Self-Service Data
Prioritizing Data Projects
Mastermind Groups and Their Benefits
Lessons Learned and Future Directions
Trends and Technologies in Data Management
Biggest Gaps in Data Management Tools
Closing Remarks and Contact Information