Summary
CreditKarma builds data products that help consumers take advantage of their credit and financial capabilities. To make that possible they need a reliable data platform that empowers all of the organization’s stakeholders. In this episode Vishnu Venkataraman shares the journey that he and his team have taken to build and evolve their systems and improve the product offerings that they are able to support.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show!
- Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder
- Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer.
- Your host is Tobias Macey and today I’m interviewing Vishnu Venkataraman about building the data platform at CreditKarma and the forces that shaped the design
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what CreditKarma is and the role of data in the business?
- What is the current team topology that you are using to support data needs in the organization?
- How has that evolved from when you first started with the company?
- What are some of the characteristics of the data that you work with? (e.g. volume/variety/velocity, source of the data, format of the data)
- What are the aspects of data management and architecture that have posed the greatest challenge?
- What are the data applications that are providing the greatest ROI and/or seeing the most usage?
- How have you approached the design and growth of your data platform?
- CreditKarma was one of the first FinTech companies to migrate to the cloud, specifically GCP. Why migrate? What were some of your early challenges taking the company to the cloud?
- What are the main components of your data platform?
- What are the most notable evolutions that it has gone through?
- Given your strong focus on applications of data science and ML, how has that influenced the architectural foundations of your data capabilities?
- What is your process for evaluating build vs. buy decisions?
- What are your triggers for deciding when to re-evaluate components of your platform?
- Given your work with financial institutions how do you address testing and validation of your derived data? How does your team solve for data reliability and quality more broadly?
- What are the most interesting, innovative, or unexpected aspects of your growth as a data-led organization?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building up your data platform and teams?
- When are the most informative mistakes that you have made?
- What do you have planned for the future of your data platform?
Contact Info
- @vishnuvram on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers
Links
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode. With their new managed database service, you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.
Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend. Io, 95% reported being at or overcapacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it's no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. That's where our friends at Ascend dot io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark and can be deployed in AWS, Azure, or GCP.
Go to data engineering podcast.com/ascend and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $5, 000 when you become a customer. Your host is Tobias Macy. And today I'm interviewing Vishnu Venkataraman about building the data platform at Credit Karma and the forces that shaped the design. So, Vishnu, can you start by introducing yourself?
[00:02:09] Unknown:
Hi. I'm Vishnu. I lead data engineering at Credit Karma. I've been here at Credit Karma for 8 years now. And prior to Credit Karma, I did early stage CTO roles, games 247 and Nykaa back in India before I moved over to the Bay Area and started working with Credit Karma, and it's I'm excited to be here.
[00:02:30] Unknown:
And do you remember how you first got started working in data?
[00:02:33] Unknown:
You're taking me back, Tobias. So I think when I did my undergrad undergrad, I did a little bit of fuzzy logic and application to managing dynamics of systems, controlling dynamics of systems. I think that's really where I started out a little bit with AI. I had an opportunity to get into neural networks right after that, but I decided to not get into it. Maybe that was a mistake in hindsight. And then when I started working, I was more a early stage start up engineer. I just went the way where the company wanted me to go. Then later on, when I started my CTO role at Games 247, we built out the initial product and gaming being gaming, there's a lot of data that gets used for a lot of decision making. And at that point of time, there was a strong need to build a BI framework that allowed us to keep making the right business decisions. I think that's probably where I got started out into more into managing data products and understanding the value of data products. And after that, it's been I did, like, product recommendations at Nykaa.
And then later on, at Credit Karma, I have attached all aspects of data, building out data pipelines for collection of strong data, building out an experimentation platform, building out our recommendation systems teams and machine learning teams, helping data science teams and recommendations and fraud going, the whole gamut.
[00:03:59] Unknown:
And so in terms of your work at Credit Karma, can you describe a bit about what it is that the business does and the role that data plays in the organization?
[00:04:07] Unknown:
I think when our founders started, 1 of the problems that they were really looking to solve was the credit score was 1 of the single most important data points that was being used by banks and fintechs to decide whether someone should get a dead product or not. And users needed to pay money to the bureaus to get access to their credit score, and they felt that the system was really not working in favor of the users to be able to keep their financial health in good place. So that's where it started out with. And 1 way I visualize to myself is, like, I use Google Maps everyday. I just did a Disney trip and without Google Maps, I couldn't have made to Disney. I had trouble even with the Google Maps going into Disney.
And when I start with Google Maps, if I don't have, like, the GPS tracking on, like, to detect current location, I have to tell Google this is my current location. Even with respect to a financial journey, I think of it as we all need to know where are we today. If we do not know where we are today, we can't really do anything much, prove our situation or even understand whether we need to improve our situation. So that was the big unlock for Credit Karma where we bought credit scores and reports from the bureaus on behalf of our members and gave it back to the members to access for free.
That's where it started out, and we realized that there were a lot of other challenges that our 130, 000, 000 plus members have in using the data that we were putting in front of them to make their own decisions that matter to them in their everyday lives. And we started building a lot of applications on top of the data, which allows our members to really understand how to move forward with their lives in terms of financial progress.
[00:05:57] Unknown:
In order to be able to support this data led organization, I'm curious if you can talk to the team topology and some of the ways that you think about the breakdown of roles and responsibilities along the continuum of data professions and data stakeholders and users, and how that has grown from when you first started to where you are today.
[00:06:20] Unknown:
Yeah. I think when I first started out, we had all our data in Vertica, and we were managing the Vertica clusters ourselves. So that is part of the database administration team. And oh, and, like, at that point in time, we had a few data scientists spread across the organization. We didn't really have a full fledged data engineering team or a full fledged data science team in in any way, far form, or fashion. At the same time, the company was already into data science and help using data science to help our members to get more certainty in terms of which products they can get approved for because we had a lot of traction in the market in terms of, number of credit card applications and personal loan applications and approvals that were flowing through our funnel.
So it was very application oriented and we were focused on, like, building out the right member experiences with the data. We were focused on availability for the member experiences. I think over the years, we have built out, like, fully flashed BI teams. We have built out fully fleshed data science teams. Recommendations has become a big use case, which really allows us to grow our business. And to support all of that, over the years, what we have really done is, like, we've split ourselves up into, like, horizontals and vertical teams. 1 clear demarcation is there are 1 set of teams which are focused on partnering very closely with our analysts, analytics organization, and business intelligence organization. And there is another group of teams which are focused on partnering very closely with our data science teams. And data science, it's is spread across recommendations and fraud.
Maybe I'll start a little bit of the horizontal teams that we have to support all of these use cases. 1 is, like, we have an ETL team which helps application teams who are all into microservices to make sure that all the data that is getting collected from the operational databases is, like, getting transported over into our data warehouse, which is BigQuery. We made the change to BigQuery a few years back, and after that, it's been really beneficial for the entire organization. And then once the data lands in the data warehouse, we have a lot of other teams which are picking their data. The BI teams are picking their data and converting them into, like, datasets, high quality datasets which are usable by the rest rest of the business. And then there is a machine learning infrastructure team which has some sub steam subteams, which are focused on building machine learning datasets, which then becomes a starting point for our data scientists to start building their models, to do their exploration, start building their models.
Our data scientists also end up exploring new data that is getting ingested directly from the warehouse, but more often than not for their regular day to day work, they can just use the datasets generally cleaned and well maintained directly from what is being made available by the machine learning infrastructure team. The machine learning infrastructure team also supports framework internally, which we call as Vega, which is a way in which data scientists can easily design, develop, test, iterate, and deploy models into production. When we started building out Vega, we were looking at TensorFlow closely, and we started using some aspects of TensorFlow. And TensorFlow ends up getting used as a model training and as well as, like, models coding part as well. And then once these models get deployed into production, whether it is for recommendations, use cases, or fraud use cases, Now we need a model serving team that is able to operate at scale. So today, I think we do, like, 58, 000, 000, 000 models that are scored every day. So there's a team which is focused on the infrastructure for model serving, and then there is also a corresponding team which is focused on making sure that there is data current fresh data available for all the models to get scored on top of.
That's a lot, and I think let me just, like, maybe touch on the data science teams as well. So we have a data science team focused on certainty. This is more about helping our members get an understanding of what are the chances that they're gonna get approved for a particular product, whether it's, like, whether it's a very good chance or a good chance or an excellent chance. And then we have a data science focused on machine learning driven notifications. What email campaigns are most relevant to be sent to a particular member on a daily basis? And every day, we are running running through all our campaigns to check whether we should even send an email or a push notification or not. And then there is a third team which is focused on ranking, which gets used in our recommendation systems, which is where all the use cases come together to power our member experiences.
[00:11:08] Unknown:
As far as the specifics of the data that you're working with, obviously, it's working in the financial space, so they're a member of regulatory aspects to it. But I'm wondering if you can talk to some of the, I guess, operational characteristics of it in terms of the 3 v's of volume, variety, and velocity, and some of the types of integrations that you need to maintain for being able to source the data and maybe some of the structure of the data. So whether it is, you know, largely textual or if you're dealing with, you know, binary data or some of those kind of operational and analytical characteristics of it. I think in terms of the sources of data, like I mentioned earlier, we buy,
[00:11:52] Unknown:
members' data from the bureau on behalf of our members and make it available to the members and make it available for our internal systems. Now depending on the bureau, depending on the data that we are getting from the bureau, some of it is like daily, some of it is like weekly, some of it is a couple of times a week. Also, in real time, if there is a immediate change that is happening. So whenever I have a hard pull that happens on my credit report, the bureau immediately respond, let's credit and then let's our member know that there is a hard pull on your report. This is actually the core data on which a lot of business functions are built on top of, and this has a lot of critical information, not just for our business, but also for our members.
This is what our members are using to understand whether they have disputes on their report, which is pulling down their credit score. What are the recent pulls? When are these credit report pulls getting off my hard pulls getting off my credit report so that my credit score can get better. When I'm making payments, I'm making payments in times. This is where all of our user facing applications get built on top of. Like I mentioned earlier, we have 130, 000, 000 members. So as you can understand, there's a big volume of data that is coming through, and this is 0 data, which is, like, used by all the financial institutions.
So the schema aspects of the bureau data and how it changes, when it changes is, like, pretty static, I would say. Changes a little bit every few years, and then when it changes, there is just, like, a massive amount of work that all the institutions as well as we end up doing. We get a lot of support from bureaus to do it, but there is still it's an expensive thing to go through. And then I would say, while this is a lot of volume, we have more than 8, 000, 000 visitors who are using our app on a daily basis, more than 8, 000, 000, which means that there's a lot of behavioral data that is also getting collected and made available to our systems to understand, like, what the users are looking at, which are the items that they can they want they are more interested in, which are the articles that they are more interested in.
And then the last bit of data is actually what I think of as, like, content data, which is what we end up recommending, and content or items. Internally, we call them generally items. So items are what we end up recommending to our members. Now these items could be credit card or a personal loan, which means the items are controlled the changes to the items are controlled by the banks who have these financial products in our marketplaces, or it could also be content that we have internally authored, which can also be recommended to our members. Some of the authoring is essentially a big part of our member experience. And maybe I can give you an example here. An example here is that I just had a recent hard pull on my report.
And as soon as I have a recent hard pull on my report, then we want to highlight it to the user that this is an area that you want to go check on to make sure it's a hard pull that you requested because nobody is allowed to pull your report without your without the member's permission. So if the member is not given permission and the hard pull has happened, so that means there is a chance that their identity might have been stolen. And it's important for the member to know immediately and then important to put it in front of the member as much as possible. We recommend that we put that in front of our member.
So that's content data. The content data, it's like pulls together various attributes from the bureau data, pulls together various attributes from the behavioral data, and then we just, like, stitch it together and put it in front of the members. This data, I would say it's, like, it's not a large number. It doesn't change often because generally it's changed by humans and either humans within Credit Karma or humans within the banks, and it doesn't really change a lot. But it's very powerful to help us understand what's most relevant to our members, and we end up doing a lot of fun things with the content data. It's actually an area that we are trying to go deep into going the far in the future.
And in terms of the format of their data, I would say a lot of it is, like, tabular data. With the content data, we are getting into, like, some amount of textual content because when you see a credit card offer, you see a bunch of marketing bullets there. And, similarly, I talked about our notifications data science team. So when we are sending out an email or a push notification, there's a lot of text content in there that we want to start using to understand, like, what's most, to improve our relevance of our content that we put in front of our members.
[00:16:41] Unknown:
As far as the different categories of data that you're working with, I'm wondering how that influences the complexities and challenges of how to work with that data and some of the cycles that it introduces. Like you were mentioning with the bureau data, there are periodic updates where you then have to go and adapt your transformations to be able to account for the shifts in those data sources. And I'm wondering how the different kind of data categories and data sources combine to generate some of the kind of complexity and impose constraints or add requirements for the architectural elements to be able to support those flows?
[00:17:26] Unknown:
I think pretty much from very early on, we wanted to make sure when a member comes to Credit Karma, they have their most recent data in front of them. We did not want to make them come back and then say, hey. Come back tomorrow and then you'll see a refresh report and then show them stale data. We wanted them to have the most updated report and data in front of them. That's that is a very, very key design decision that got me a long time back. And it's kind of natural. Right? So, like, if you're going to check out for a flight price, you want to know the recent flight price that you're gonna get rather than, like, yesterday's flight prices. Right? Oh, as well as, like, availability. So what that means is when our members come to the app, we are getting in touch with the bureau. So we have built out the bureau integrations. We are getting in touch with the bureau and working with them to get the data as quickly as possible and then running through some transformations and processing and putting it in front of the member.
Now what this means is that correspondingly, we needed to make sure that all the models that we are using to serve recommendations to our members also uses the latest data. So which means that we had to make investments in real time model serving. That was, like, 1 of the first things that me and my team did as soon as I joined. The 1st year, we built out our real time model serving infrastructure to make sure that the users can see the latest data, the users can see the recommendations based on the latest data immediately. We made this investment when model serving and model serving infrastructure was not really like a thing out there. A lot of people like, I'm talking 2014, 2015.
Lot of people are still doing, like, batch model serving, do, like, batch data, and then, like, build your models, run your data quality checks on the batch data, and then run your models on top of it, use cash those models, of course, and make it available. We couldn't we couldn't afford any of that because of some initial design decisions that we made. The second big not really see a challenge, it's just like an important critical design decision that we have is we do not use any PII in our models. The credit report data has PII in it.
We make sure that the PII that we get is essentially for our member consumption, and it's not used anywhere else. So from a security perspective, a lot of these things are locked down, And the area of, like, recommendations, the area of data science and machine learning for all of these use cases, they do not get access to any of the PII data. So we had to build out, like, separate zones to make sure that we have, like, tight control over all of the data that is moving across from 1 zone to the other and made sure that all of the monitoring is in place. So these were all, like, very early initial design decisions that were made that allowed us to make the right investments at the right time. I would say, I think 1 of the bigger challenge really was when I started out in the initial time, we were not yet in the cloud. We were in our own data center.
Like I mentioned earlier, we were running our own, like, vertical cluster. And I remember a point in time when our data scientists needed to pull together a training dataset, and they would collect the data processing jobs over weekends, over a month. And then they would say, okay. I'm done with this month of training dataset collection. Now I can go build my model. And then once we started moving into the cloud, we had, obviously, elastic compute resources. We were able to get all our data into BigQuery, which meant that the management aspects of our data warehouse was something that we outsourced to Google.
It is a big, big lift for us because data scientists could then get into their job of, like, exploring the data, getting their training datasets together very, very conveniently, and build their models quickly. So I would say those are probably, like, some of the challenges from the initial days that I think are probably important to bring up here.
[00:21:46] Unknown:
RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and to build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enables you to automatically send data to hundreds of downstream tools. Sign up for free today at dataengineeringpodcast.com/rudder. And in terms of the different data products or derived data assets that you are generating, I'm wondering how you approach the question of determining the return on investment and identifying what are the most crucial aspects that require high uptime and reliability, and what are the pieces that maybe can have a looser SLA or something, and how that influences the way that you structure and orient the work around how those assets are maintained.
[00:22:49] Unknown:
Maybe I'll start with the data applications in terms of the applications about providing the greatest ROI for us is we monetize by recommending financial products to our members. And when our members get approved and when they take the financial product is when we get paid. So that's the basic business model, which means that anything related to financial product recommendations is top of mind for all of us, and all the data that gets used, all the models that gets used to drive our financial product recommendations is, like, going to directly add to our top line. The other big use case I mean, we have a bunch of other use cases. I'll probably touch on, like, 1 more big use case. The other big use case is on the notification side.
Notifications is when we send out an email or a push notification to a member and telling them that, like, hey. Something's changed on your credit report, or here is an offer that you're eligible for. So some of it is, like, related to their financial health. Some of that is related to, like, the financial products that they are now eligible to get that they previously were not eligible to get. So these are probably the 2 biggest data applications that provide the biggest ROI for us. And in terms of the SLAs related to these datasets, like I mentioned earlier, the when you're getting, like, real time credit data, when the user is on the app, you want to use the freshest data. You want to be able to make sure that the user has a snappy user experience.
There are definitely constraints in terms of the models that we can use there. There are constraints in terms of the amount of data that we are able to push through. And so far, we have done a great job of managing that, I would say. But at the same time, there are, like, a lot more use care data and derive use cases that are coming up are becoming more complex for us to handle online all the time. On the notification side, because it's a process where we not like someone is sitting up and waiting for an email or a push notification from a credit department every every minute of the day.
So we have an opportunity to do things in a way where the SLAs are a little looser. We want to make sure that we are able to get in touch with the members on some situations quickly. Like, I talked about the hard pull. You want to be able to immediately let the user know about the hard pull. But if you're telling them, hey. This is how your monthly spending looks like or this is how your monthly credit utilization looks like. You don't need to really worry about, like, a minute by minute SLA. So we end up doing, like, more on the batch side for the notifications, and the SLA is a little looser. We can take our time to make sure that our data is in a good place. We can use some of the machinery that Google allows us to use, data flow to be able to process things in batch, and do other fun stuff in that area as well.
[00:25:43] Unknown:
In terms of the architecture of your data platform, you mentioned that in the early days when you first started, you were largely running your own vertical clusters, and since then, you have migrated into the cloud. I'm curious if you can talk to the different, I guess, epochs of the data platform architecture and what were some of the motivating factors that led to some of those transitions or transformations, and how you thought about the build versus buy process along each of those stages?
[00:26:13] Unknown:
I think some of this is about evolution of our use cases and evolution of what we can do with the data. Some of this is about as we are able to build more teams as the company grew, we were able to build more teams to focus on, like, specialized problems. In the initial, couple of years, like I mentioned, it is more about moving from our own data centers to the cloud, primarily from a offline perspective where we were able to get, like, all of our data into BigQuery and start, like, processing all the data within BigQuery. Then I think 1 of the other early investments that we made that I talked about was our real time model serving system, which continues to operate at a large scale today. The other big investment that we made was to make it easy for people to collect data across the app, irrespective of which our product engineering team was doing. So I think we had, this concept of product engineering teams to understand, cheaper for our product engineering teams to make sure that the data that they are collecting is good data that they're collecting. And for them to get, like, immediate feedback if there's something wrong with the data collection process. So that is, like, 1 of the real critical in this really critical investments that we made very early on.
Then we started getting into the world of, hey. We are getting a lot of this data. Not all of this data is being used everywhere. So there was some amount of data that people had explored, some amount of data that they had not explored. So we wanted to make it easy for our, data scientists to use all the data in our models. That is where some of the training dataset creation that I had mentioned earlier in 1 of our questions, that came up where we said, like, we need to make sure that it's we're making it easy and easy for our data scientists to use all the data rather than making it hard for them. So I think the investments in custom created training datasets, We call it like a modeling data source. It was like an important thing that we allowed for our data scientists to continue to iterate.
I would say the investment in our machine learning framework, which allowed our data scientist as the number of data scientist in our teams grew, the number of use cases grew. How can they easily deal with large amounts of data while at the same time getting into, like, building deep learning models? Initially, we were using a lot of random for us. And trees, I think we moved into using a lot of deep learning models after that. So when we moved into deep learning, how can we allow our data scientists access to powerful frameworks like TensorFlow, not having to worry about, like, is my model training gonna take, like, a week or will it finish quickly? So I love them, like, self-service way of building their own models and, like, having the control over their own destiny.
And then I would say then we started getting into, like, applying all of the machinery that we talked about here into other use cases, which were in notifications, something that I brought up earlier, and also we started getting into, like, lot of the, like, internally authored content that we started putting in front of our members, that we started also having to collect a lot of behavioral data about on our initial investments and consistent tracking paid off here as well in terms of collecting behavioral data for our members. Along the way, I think 1 thing I missed out was our investment in our experimentation platform. I think it is actually 1 of the early investments that we made, which allowed data scientists to run models, model experiments very, very cheaply, and set up our experimentation practice in a really good way. And everything today gets, like, determined by experiments, and you have to when you start out the experiment, you are talking about the metrics that you're gonna move. If you do not have that documented, you're probably not even gonna get allowed to get that experiment out in production.
Even the experimentation system is like a self-service system that we built out and, like, pretty much, like, 1 third or close to 1 third of the company has access to the experimentation system, either to view or to launch or ramp experiments. I would say those are probably, like, some big investments that we made along the way.
[00:30:42] Unknown:
Another interesting aspect of the work that you're doing with your data is that in addition to being a data led company, you're investing very heavily in the data science and machine learning capabilities. And I'm wondering how that focus on machine learning and data science use cases influenced some of the architectural decisions and the ways that you thought about the flow of data through the company as opposed to a company that is primarily focused on analytical, like, report based uses of data?
[00:31:14] Unknown:
I think 1 key point that I'll bring up here is, like, when we built our first generation recommendation system, 1 of the things that was a very clear focus on was when someone is making a request to the recommender system, the recommender system has 2 jobs to do. 1 is to serve the most relevant recommendations to the member. The second thing is to collect data about the current context. All the data that was used by the recommender system as part of the current context and making that recommendation. Because we knew that when you start getting into, like, data science and machine learning, you need access to that system level data that allows our data scientists to look at data at that point in time.
So we made a deliberate design decision where we wanted high data consistency with respect to the data that is getting collected for our data scientists to use. Because in normal situations, this is, like, point in time data that is probably available in different datasets that you can, like, stitch together. It's probably might be cheaper to do that, but then you get into, like, lot of data consistency challenges to be able to stitch all these data sources together. So we made a very key design decision where our recommender system was collecting continues to collect point in time context data about the entire data, what is the user's data that you have in access, what are the behavioral aspects of the where is the user currently in the app, what is the state of the content at that point of time because content can also change. We were collecting all the data in 1 place, and then that is a big decision. It's a big investment that we made and we continue to make in making sure that we are able to allow our data scientist teams to not worry too much about consistency issues when you're trying to pull together. They still have to pull together some additional data points from other datasets, but, like, it's really set up to be minimum. I think that's probably 1 big design decision that I would call out here.
[00:33:28] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it's often too late and the damage is done. DataFold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. DataFold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying. You can now know exactly what will change in your database.
DataFold integrates with all major data warehouses as well as frameworks such as airflow and DBT and seamlessly plugs into CI workflows. Visitdataengineeringpodcast.com/datafold today to book a demo with Datafold. Another interesting aspect of the work that you're doing is that because you are dealing with financial information and you're dealing with data products that can influence the ways that your customers think about their finances and approach their spending and investments, obviously, there is a lot of need for ensuring that the information that you're providing is accurate and understandable.
So 2 forks to this question, where 1 is understanding how you approach the data quality and validation of the assets that you're building. And then the other 1 is how you approach the user facing design and presentation of those data products and data assets to ensure that they are being understood and interpreted correctly by those end users.
[00:35:12] Unknown:
Yeah. That's a difficult and hard problem for us because credit data is pretty complex, and to be able to put it in front of members in an understandable form is always a hard challenge that our designers are helping us with. And I think it's an evolving thing that we are always gonna be working on. Saying that on the more on the data reliability and quality side, I would say it's, like, making sure that, like, each of our teams who are involved along the way, whether it's, like, from right from the point of, like, data ingestion, even earlier, even working with our bureaus to making sure that they're able to help us on data quality and sure that we are partnering with them on data quality.
But internally, within Credit Karma, right from the point of ingestion, it's more about as, like, I talked about, like, some horizontals and verticals. It's more about making sure that people have access to the right tools, people have access to the right mechanisms and processes to be able to monitor and make sure that they're taking the right actions when something goes wrong. They can detect and quickly take actions when something goes wrong with respect to our data. I talked a little bit about the content recommendations earlier, and 1 of the interesting things that, like, 1 of our teams did was whenever there was a challenge with respect to, like, a particular content recommendation, they would create a site incident, which would actually tell the people responding to the site incident an estimate of the value of the content if the recommendation was not working properly.
So what they really did was, like, look at, like, historical data over the last few weeks in terms of how much revenue was attributed attributed to that particular content recommendation, And then put that in front of the people who are dealing with the site incident so that they can deal with it in the right level of urgency. This is an area that I think we are always gonna be paranoid about because we understand that when we put data in front of our members, we are asking them to put their trust in us and trust in the day trust in the company and members' trust in us is, like, paramount for us. It's really allowed us to have all the success. So this is an area that we are constantly looking to evolve, constantly looking to invest in. I think that there is so much that we can do here.
1 other aspect I can also bring up is we invested in anomaly detection. We were, like, 1 of the early users of Anadot. So a lot of our, data and business metrics, we link it up to Anadot, and we have made it, like, really self-service so that a lot of internal teams can use that for anomaly detection, which allows them to also catch a little bit of the unknown unknowns. Right? So Anadot can provide us some anomalous patterns in the data, then the teams can react quickly and figure out, like, what the problem is going on.
[00:38:05] Unknown:
On that point of anomaly detection and going back also to your earlier comment about the periodic updates from the bureau sources, I'm wondering how the, I guess, cyclical aspects of the data updates in some of your source information influence the pieces of information that you need to feed into the anomaly detection to say, we are expecting this update or, you know, allowing the anomaly detection algorithm to account for those cycles while still also saying, you know, we want to understand within those updates if there is something that appears anomalous because I'm sure that there are also cases where the source data isn't a 100% accurate, and you actually need to go back and work with some of those bureau sources to correct those assets.
[00:38:51] Unknown:
Yeah. I think as far as anomaly detection is concerned, it's actually expensive to run it at a user level. We end up doing it more at an aggregate level to get a sense of the overall dataset that we are ingesting, try to understand, like, are there, like, any big changes at the overall aggregate level? And when we see some challenges, I think we have, like I said, like, really tight relationships with the bureaus. We have, like, runbooks that we have shared with each other. We just, like, work directly with them, with their engineers to our engineers work directly with their engineers to bring things back to normal. And there is a lot of data that we are getting in real time from the bureaus. There is data that we are getting from batch. So if we find an issue with the batch, then we would much rather work with the bureaus to fix it before putting it in front of the users and before using it as part of our recommendations and other use cases that we have. So that way then we are protecting our business and members from data quality issues.
But as we get into, like, more interesting things like streaming of the bureau data where we want to focus on speed, then some of these things again become more challenging. How do you detect if you're no longer able to do things in aggregate, if you want to do it on a user by user basis, there are some interesting problems there for us to solve.
[00:40:10] Unknown:
In terms of your experience of building this platform and helping to guide a lot of the evolution and architectural updates. And in the case of moving to the cloud, the replatforming, I'm curious if there have been any interesting and informative either mistakes or maybe dead ends that you went down and some of the useful lessons that you pulled out of that experience?
[00:40:37] Unknown:
So we were moving into the cloud. We were also correspondingly decomposing our monolith into microservices. So there were definitely some challenges and how all of this played together where, like, we wanted to make sure that we were getting the most benefits out of the cloud. We also wanted to make sure that we are setting up each of our teams to work at, like, high velocity across the organization. I would say some challenges and some decisions that we made that we probably want to relook at is that along the way, we probably invested in, like, some data quality monitoring investments, like, across like, in different organizations. I think there was definitely an opportunity for us to have, like, a unified data quality monitoring investment that we could have made together. I think from the analytics side, they made their investment. And then on the machine learning and data science side, recommendation side, we made our investment.
We could have worked together better in some of those situations. And some of it is also, like, each team wanting to making sure that they are able to take care of their stakeholders in the right way possible. I think these are all, like, lessons that you learn when looking back in hindsight, and it's always something that you can go back and correct. The other maybe mistake I would say is that when we invested in some of our self-service systems initially, we provided a lot of flexibility to the people who are using the self-service systems. There was always an opportunity for us to, like, start out with some strict guardrails and then slowly open it up as we got more users onto the system rather than opening it up right up front. Seeing that's probably a lesson that I'll take. I think I heard last week the concept of, like internally, we were calling it, like, constrained flexibility. How can we introduce, like, a constrained flexibility into our self-service system so that we are setting up our stakeholders and the users of the self-service systems for success? But I heard a term attributed to bar called control freedom, or something like that. I think that is also, like, an interesting way to think about. We were calling it constraint flexibility. I think Bar calls it control freedom.
Definitely a lesson that I have for my life in terms of how to really build and set up these platforms up for success, and then setting up our stakeholders also for success.
[00:42:58] Unknown:
In your experience of building out these capabilities and working with the organization to understand what are the needs and potential products that you can build from the data that you have available? What are some of the most interesting or innovative or unexpected aspects of that growth that you have experienced?
[00:43:19] Unknown:
There are definitely lessons that have been learned outside in the world that relearning ourselves as well. But I would say 1 of the main challenges is to make the right decisions along the way where you are balancing cost performance, scale. And when I say performance, there is, like, system performance, model performance, and business performance. And it's more about making sure that you are able to make the right investments at the right time in each of these areas. And some of these investments, if you take care of it a little later, it can be really hard. And, cost is definitely in an area where I think we could have made investments earlier on that we made a little later on. But once we made those investments and we've been able to operate better, we've been able to also get help from GCP to be able to manage our cost better.
I think apart from that, 1 other area that I would also touch on is complexity. So there is always a tendency for us to introduce a lot of complexity when it comes to our use cases. How we make sure that we are balancing our investments in reliability and the both on the data side as well as in the system side. Along with the complexity, that's again a balancing act that as a leader that you need to be making all the time. In your own experience
[00:44:39] Unknown:
of working in this space and being a leader of the data teams at Credit Karma, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:44:50] Unknown:
Some of our best wins that we have got along the way, there were artificial constraints introduced into the system. The constraints could be in terms of time or the constraints could be in terms of number of people available, and we've had our best wins when we had that. And on the flip side of it, to be able to make things sustainable and last and do things well, you need to start removing some of those constraints. So that is something probably like a lesson that I've learned. And I think I used to hear from 1 of my bosses that, like, when you introduce these constraints, it could be artificial or it could be natural. That's when you really allow your engineers and your product managers and your teams to unlock their creativity.
You have to get them to their right mindset for them to be able to stop looking beyond the constraints, but operate within the constraints to be able to unlock their creativity. But some of the best wins that we've got talked about our experimentation system. I talked about integrations for anomaly detection. I think both of those were when we introduced an artificial constraint, and we were able to motivate small team to go just, like, do a POC and get it kick started out, and some of that paid a lot of dividends for the company along the way.
[00:46:14] Unknown:
And as you continue to build and iterate on the platform architecture and the uses of data within Credit Karma and trying to adapt to and predict some of the needs of your users? What are some of the things you have planned for the near to medium term or any projects that you're particularly interested in digging into?
[00:46:36] Unknown:
Yeah. I think 1 of the projects obviously is, like, we just started working closely with Monte Carlo. And we have a lot of users of the data within the company now. So what we really want to do is to be able to get all these users to have better visibility and control over the quality of the data that they're using, which allows them to make their own decisions that they want to use a particular data attribute or not, which allows them to monitor the data that they are building and they are depending on during for their day to day. So that's definitely an exciting area for me. I think the second exciting area is, like, let's say we have all these systems, we have all these, like, high quality data. What more can we do for our members?
I think there's just so much more that we can do for our members that 1 of the areas that I always think about is our members solve a lot of problems. Every member have their own constraints, but they are solving their own problems. Can we learn from how 1 member is solving a specific problem and then take some of those aspects and then use it to help some other members solve their own problem. And it could be, like, a different point in time. It could be, completely different geography within the country. That's the use case that I am really passionate about, and I'm hoping rich data that we have and the system investments that we made along to me.
[00:48:07] Unknown:
Are there any other aspects of the work that you're doing at Credit Karma or the role that data plays in the organization or the platform architecture that you have developed to support those use cases that we didn't discuss yet that you'd like to cover before we close out the show?
[00:48:21] Unknown:
Yeah. I think maybe 1 more area that I can bring out is making the lives of our data science teams easy This is an ongoing thing and, like, building the right partnership between data science and machine learning engineers and data engineers is like something that we constantly invest in. And And these are all systems and tools that mostly we've been working on building in house because we also want to be able to use these systems for other auditability and governance purposes with respect to models. But some of these investments are, again, I would say, nascent, and we are building it ourselves. I think we talked a little bit about the bio versus build earlier, and this is this is an area where I am really excited about the work that we're doing to be able to make data scientist lives easier in terms of what kind of models they can build, how they can monitor the models, how they can deploy their models conveniently, How can they make sure that the models are achieving their intended purposes? I think that's probably an area that,
[00:49:21] Unknown:
we have made some investments in that I didn't talk about earlier, and we are going to continue to make investments in. Alright. Well, for anybody who wants to get in touch with you and follow along with the with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:49:43] Unknown:
I think there are a lot of tools available today, and it's more about making sure that we, as leaders, are able to work through all the tools that are coming up there and using it. And some of these tools are also being evolved as we speak. And I think the opportunity for leaders is to understand, like, how you are able to put together the right tools for your own situation and making sure that you are finding the best in class that you need for your organization at that point in time. I can't really talk about a big gap. I think we have been building out a lot of these tools. And then when I go out there and look at the market, there are, like, so many of these awesome tools that have been built out. I would have loved to had access to them, like, 3 or 4 years back, and that are coming out there, and I need to sit down and spend time. I think we started a recent relationship with Monte Carlo. We had an old relationship with Anadark. So I think there's a and, obviously, we use a lot of Google managed services.
So the opportunity is really making sure that you're able to pick the right tools at the right time for the business needs.
[00:50:46] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you've been doing at Credit Karma and your journey through building out the data capabilities and architecture and use cases, it's definitely an interesting application of data. It's always great to be able to understand more about how people have charted their own course through this constantly expanding ecosystem. So thank you again for taking the time today to join me and share your path, and I hope you enjoy the rest of your day. Thanks a lot, OBS. Have a good 1.
[00:51:22] Unknown:
Thank you for listening. Don't forget to check out our other shows, podcast dot in it, which covers the Python language, its community, and the innovative ways it is being used, and the Machine Learning podcast, which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast dotcom to subscribe to the show, sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at data engineering podcast.com with your story.
And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Vishnu Venkataraman Begins
Vishnu's Background and Journey into Data
Role of Data at Credit Karma
Team Topology and Responsibilities
Operational Characteristics of Data
Complexities and Challenges of Data Management
Determining ROI and SLAs for Data Products
Evolution of Data Platform Architecture
Influence of Data Science and Machine Learning
Ensuring Data Quality and User Understanding
Anomaly Detection and Bureau Data Updates
Lessons Learned and Challenges Faced
Balancing Cost, Performance, and Scale
Interesting Lessons and Wins
Future Projects and Investments
Making Data Scientists' Lives Easier
Biggest Gaps in Data Management Tools
Closing Remarks