Summary
The landscape of data management and processing is rapidly changing and evolving. There are certain foundational elements that have remained steady, but as the industry matures new trends emerge and gain prominence. In this episode Astasia Myers of Redpoint Ventures shares her perspective as an investor on which categories she is paying particular attention to for the near to medium term. She discusses the work being done to address challenges in the areas of data quality, observability, discovery, and streaming. This is a useful conversation to gain a macro perspective on where businesses are looking to improve their capabilities to work with data.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise.
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar to get you up and running in no time. With simple pricing, fast networking, S3 compatible object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll.
- Your host is Tobias Macey and today I’m interviewing Astasia Myers about the trends in the data industry that she sees as an investor at Redpoint Ventures
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of Redpoint Ventures and your role there?
- From an investor perspective, what is most appealing about the category of data-oriented businesses?
- What are the main sources of information that you rely on to keep up to date with what is happening in the data industry?
- What is your personal heuristic for determining the relevance of any given piece of information to decide whether it is worthy of further investigation?
- As someone who works closely with a variety of companies across different industry verticals and different areas of focus, what are some of the common trends that you have identified in the data ecosystem?
- In your article that covers the trends you are keeping an eye on for 2020 you call out 4 in particular, data quality, data catalogs, observability of what influences critical business indicators, and streaming data. Taking those in turn:
- What are the driving factors that influence data quality, and what elements of that problem space are being addressed by the companies you are watching?
- What are the unsolved areas that you see as being viable for newcomers?
- What are the challenges faced by businesses in establishing and maintaining data catalogs?
- What approaches are being taken by the companies who are trying to solve this problem?
- What shortcomings do you see in the available products?
- What approaches are being taken by the companies who are trying to solve this problem?
- For gaining visibility into the forces that impact the key performance indicators (KPI) of businesses, what is lacking in the current approaches?
- What additional information needs to be tracked to provide the needed context for making informed decisions about what actions to take to improve KPIs?
- What challenges do businesses in this observability space face to provide useful access and analysis to this collected data?
- Streaming is an area that has been growing rapidly over the past few years, with many open source and commercial options. What are the major business opportunities that you see to make streaming more accessible and effective?
- What are the main factors that you see as driving this growth in the need for access to streaming data?
- What are the driving factors that influence data quality, and what elements of that problem space are being addressed by the companies you are watching?
- With your focus on these trends, how does that influence your investment decisions and where you spend your time?
- What are the unaddressed markets or product categories that you see which would be lucrative for new businesses?
- In most areas of technology now there is a mix of open source and commercial solutions to any given problem, with varying levels of maturity and polish between them. What are your views on the balance of this relationship in the data ecosystem?
- For data in particular, there is a strong potential for vendor lock-in which can cause potential customers to avoid adoption of commercial solutions. What has been your experience in that regard with the companies that you work with?
Contact Info
- @AstasiaMyers on Twitter
- @astasia on Medium
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Redpoint Ventures
- 4 Data Trends To Watch in 2020
- Seagate
- Western Digital
- Pure Storage
- Cisco
- Cohesity
- Looker
- DGraph
- Dremio
- SnowflakeDB
- Thoughspot
- Tibco
- Elastic
- Splunk
- Informatica
- Data Council
- DataCoral
- Mattermost
- Bitwarden
- Snowplow
- CHAOSSEARCH
- Kafka Streams
- Pulsar
- Soda
- Toro
- Great Expectations
- Alation
- Collibra
- Amundsen
- DataHub
- Netflix Metacat
- Marquez
- LDAP == Lightweight Directory Access Protocol
- Anodot
- Databricks
- Flink
- Zookeeper
- Pravega
- Airtable
- Alteryx
- CockroachDB
- Superset
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. What advice do you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I'm working with O'Reilly Media on a project to collect the 97 things that every data engineer should know, and I need your help. Go to data engineering podcast.com/97 things to add your voice and share your hard earned expertise. When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With their managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar to get you up and running in no time.
With simple pricing, fast networking, s 3 compatible object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode, that's l I n o d e, today, and get a $60 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. And SpringBoard has partnered with us to help you take the next step in your career by offering a scholarship to their machine learning engineering career track program.
In this online project based course, every student is paired with a machine learning expert who provides unlimited 1 to 1 mentorship support throughout the program via video conferences. You'll build up your portfolio of machine learning projects and gain hands on experience in writing machine learning algorithms, deploying models into production, and managing the life of a deep learning prototype. SpringBoard offers a job guarantee, meaning that you don't have to pay for the program until you get a job in the space. The data engineering podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants.
It only takes 10 minutes, and there's no obligation. Go to data engineering podcast.com/springboard today and apply. Make sure to use the code AI springboard when you enroll. Your host is Tobias Macy. And today, I'm interviewing Astasia Myers about the trends in the data industry that she sees as an investor at Redpoint Ventures. Oostasia, can you start by introducing yourself?
[00:02:34] Unknown:
Hi. Yeah. I'm Oostasia Myers. I'm part of Redpoint Ventures early stage team focusing on enterprise. Thanks so much for having me today, Tobias.
[00:02:42] Unknown:
Yeah. Do you remember how you first got involved in the area of data management or working with data companies?
[00:02:47] Unknown:
Yeah. It's actually pretty cool. I started my career in sell side equity research covering publicly traded enterprise companies. So I actually covered Seagate, WD, NetApp, EMC, kind of all the big players. This was an exciting time for storage. It was the era when all flash arrays were starting to come on the scene and software defined storage was all the rage. You know, pure storage at that time was still private. And, you know, digging in as a equity researcher really got me familiar with the world of storage and data management. Interestingly, I then transitioned to Cisco where I was on the m and a and venture investing team supporting the core business units of servers and networking. And we were spending a lot of time analyzing the storage market and, did quite a few investments actually in that space that I led. So very proud of leading the Series c in Cohesity, which is now a unicorn in the backup and risk recovery space. And we also invested in Datos that Rubik acquired in SpringPath that Cisco bought, and then also Elastifile that Google more recently acquired. And since my time at Cisco, I transitioned to Redpoint's early stage team, and I continue to look at data and ML focused startups, everything from new databases to ETL to ML tooling.
And I also share a lot of the research on the subject on my, Medium blog and Twitter account for others to learn more about the category.
[00:04:12] Unknown:
And before we get too much further into
[00:04:15] Unknown:
more of your background and the ways that you keep up to date with the industry, can you give a bit of an overview of what Redpoint Ventures works on and your role there? Yeah. Of course. So Redpoint is a Silicon Valley based VC firm that's been around for about 20 years. We currently have 2 funds, a venture fund and an early growth fund. I sit on the venture team. It's a 400 $1, 000, 000 vintage. We are quite enterprise leaning. So b to b investments represent about 80% of deployed capital, and we invest seed series a and series b out of that fund.
And we've had a long history of investing in data companies. We've deployed over 2 $15, 000, 000 in capital over the last few years in those businesses and have been honored to partner with startups like Snowflake, Looker, CockroachDB, dGraph, Pure Storage, Serial, Dremio, and Spring Path. So I've been doing it for quite a few years and continue to think there's opportunities in this space. It's definitely not slowing down. My role in the team is enterprise focused investment professional. I work with a handful of enterprise businesses today. 3 are in the data space. 1, we publicly announced called CIRL, which is in the data layer access control and visibility,
[00:05:29] Unknown:
and then 2 others in the data infrastructure world. So love all things data and looking to speak with startups in that space. From your perspective as an investor and somebody who's working with these companies, what is it about the overall category of data oriented businesses that you find appealing or attractive?
[00:05:47] Unknown:
Yeah. There's there's a few different factors. You know, 1 thing that we like is these markets are absolutely enormous. So according to EDC, big data and business analytics market is around a 190, 000, 000, 000 in annual spend and continuing to grow at a double digit clip and should be close to 275, 000, 000, 000 in just 2 years. So absolutely enormous. To put that in perspective, IDC for the same year says information security spend is only a 120, 000, 000, 000 So big data is huge. You know, 65, 000, 000, 000 larger in terms of scale. And it's not just the overall market. If you look at individual subcategories. You know, you have BI and analytics that's about 20, 000, 000, 000, you have database management that's 50, data warehouses that are about 20, 000, 000, 000. They're just huge categories that we rarely see in enterprise.
The second thing is because they're big categories, there's a precedent of large outcomes in all the subcategories. If you look at BI, you have Tableau and Looker and Qlik and ThoughtSpot and Domo. Databases, of course, Oracle, SAP. We've seen Mongo and Cloudera. Data Management, TIBCO and Informatica. Log Management, Elastic and Splunk. So it's awesome. You can, as an investor, you can point to businesses that have had really successful exits and become enduring companies, which is what we look for. There's a few other things that we like in terms of the market dynamics. These subcategories are large, but there's also a winner take a lot or winner take most dynamic, which is excellent. So when I was covering publicly traded companies like EMC, they had the largest market share. It was around 35%, which is pretty impressive for any category. So really an oligopoly, style of market here.
And then the tech, you know, it's really hard to build this type of technology. Differentiation matters and is felt by the user. And this technical differentiation can be a defensible moat for the business over time. And because of that, finally, you know, these products are really sticky. It's core infrastructure. Once the technology is adopted, there's a moat since it's friction full to rip it out. And this means
[00:08:00] Unknown:
longer contracts that are potentially larger. All things we like to see. And in terms of the information that you rely on for being able to keep up to date on what's happening in the data industry and understand what's relevant and what businesses are going to perform well given the overall economy and the overall environment that they're working within and building on? What are those types of information that you look at, and how do you gain that understanding?
[00:08:27] Unknown:
Yeah. So as you can imagine, there's really no 1 source we go to. We take a MOSAIC approach to our research. So, you know, everything from podcasts, like like data engineering. You know, I've been a long time listener and so honored to be here today. But also newsletters like O'Reilly and Data Council or even from the public cloud vendors that talk about their new product releases. Social media is a great outlet. So Reddit and Twitter before we'd be going to events like Strata and meetups around open source technologies. And you know my personal favorite because I came from academia and sell side research is speaking to operators and buyers. You know, these calls are often the ones that I have the most fun in, during my day. They're just my favorite because, you know, it cuts out the noise and the hype. It goes straight to the people who are working with these technologies and thinking about their architectures.
And, you know, we actively engage operators and have a network. For anyone who's listening who operates in the data space, please email me, and we'll be added adding you to our community events and engagement. We think that the insights from the operators, the people on the ground, are fundamental in how we get information and make smart decisions.
[00:09:37] Unknown:
And then within those different information sources, it can often be difficult to pull out the useful signal from the noise because of the variety of ways that it's being represented and potential biases in terms of how things are presented. So what is your personal heuristic for determining the relevance of any given piece of information and factoring that into your overall decision as to whether or not a given company or category is worthy of investment or further investigation.
[00:10:04] Unknown:
Yeah. It's funny. We're we're on a show that talks about data and data's information. And like everyone, we're inundated with information all the time. You We're really looking for a needle in the haystack. Any tidbit of information that we read, can actually change how we're thinking about the world or what we're doing in a single day. Right? If I come across something fascinating, I may spend, a few hours trying to investigate further so you're totally right. You know, I look for indicators and information that suggest 3 things, novelty, game changing, and this is the future. You know novelty is pretty clear. Is this new, original, or unusual? Have I heard about this before?
We see so much information every day that just being novel can be a big deal. Game changing, is there sentiment around the information? Is this considered ground breaking? Does this change someone's process or daily activities or their role? And then the future, we kind of think about is this something that most people will adopt and implement? Can it become ubiquitous in a few years? And so if the information that I'm reading suggests those 3 things,
[00:11:08] Unknown:
it's pretty exciting for us and we dig in further. And so as somebody who works closely with a variety of different companies across different verticals and areas of focus or problem domains, what are some of the common trends that you have identified in the overall data ecosystem that you're keeping an eye on and that you're spending your focus on in terms of identifying potential up and comers that are worthy of investment? We are seeing 3 trends today.
[00:11:32] Unknown:
1 is kind of flexibility around delivery models of the technology. 2 is the data security matters more now than ever before, and this could be data security solutions or data security in products. 3 is increased adoption of open source. So in terms of the flexible delivery models, you know, previously everything was on premise. Now it can be in private cloud, public cloud, or a system that sits across environments. Additionally, a lot of the solutions can be in a fully hosted SaaS, which we all know about, but we're starting to see the emergence of what we called Cloud Prem.
And the idea behind Cloud Prem, which is sometimes called VPC for virtual private cloud, is that it's a new architecture that splits the SaaS application into code and data. What's interesting about this is that the SaaS company writes, updates, and maintains the code, and the customer manages the data. So an example of this would be the control plane is in the cloud, but the data operators are in the private public cloud VPC. And this is 1 of the biggest changes we've seen recently. Customers are adopting this approach for a few different reasons. The first is cost. So the cloud is more expensive for large enterprises. Many are considering moving back to their own infrastructure and using cloud only to manage first. This is kind of what we saw originally, almost 10 years ago. And so this architecture can be more cost effective for these buyers. The second is control over data and access management.
You know, this is often the crown jewels of the business. And the third is compliance. You know, we're seeing more regulation than ever before with CCPA and GDPR. A lot of business wanna be better prepared and structure, their data and how they engage with software applications more centrally for these reasons. And what's interesting about this new Cloud Prem approach is that it really changes the customer controls of the data, and therefore, the power dynamics with the vendor. So when the customer controls the data, they can reshape it, move it, share it with a competing vendor, and the customer can create their own applications on top of the data. And the integration between applications no longer has to be at the application layer with Salesforce or Marketo APIs. It can now be done at the database.
So it's really interesting the unlocking of potential with this new delivery model and People wanna control the data themselves for compliance. SOC 2, People wanna control the data themselves for compliance. SOC 2 is becoming a requirement for very early stage startups to sell it to the enterprise. It is becoming the standard. And simply, no 1 wants to be in the news because of a data breach. You know, they've been very high profile from Capital 1 to Marriott. No 1 wants that, so data security is key in these platforms. And then the third is increased adoption of open source. You know, open source has existed in the data layers for a long time, Postgres, MySQL, but now we're seeing it touch nearly every type of database from graph to analytical.
We see it moving up the stack to data catalogs, processing engines, ETL, and data quality. While you don't need to deliver data software as open source, you know, Snowflake, which we're investors in is a clear example of this. There is movement in the space towards open source technology. People want to see the code, check it out, the tech for themselves, see if it works in their environment, and adopt it if it does without dealing with see if it works in their environment, and adopt it if it does without dealing with the sales team. And open source allows them to do that. It gives them a flexibility to be less reliant on the vendor,
[00:15:19] Unknown:
potentially decrease costs, and increase control. Going back to your point about the private SaaS model of deploying the actual service into the customer's environments, there have been a few different people that I've spoken with as well that are taking advantage of that. And it's definitely an interesting delivery model, notable examples being data coral and snowplow. And there's also another company called ChaosSearch, all of which allow you to store the data in s 3 or wherever your own applications are, and then they'll deploy the software to your environment and manage it. But you still, as you said, have control over the overall life cycle of the information, and you don't have to worry about handing it off to a third party and then requesting access to it to get it back for things like Google Analytics. And so for data in particular, where it does have that gravity and it does have the increased cost of moving it to and from different environments, then as you said, it's a cost savings as well as a control element.
[00:16:15] Unknown:
Totally. It's really exciting to see this new architecture. We actually started seeing this emerge, let's say, about 2 years ago. Not in the data space, but businesses that exemplify this as Mattermost, which is an open source alternative Slack, and Bitwarden, which is an open source password manager. They have these models. We have an entire thesis around investing in companies in this space. Both of the 2 that I just mentioned are in our portfolio. And we think it's really exciting, you know, with increased regulation, higher cost for the public cloud. This gives buyers an opportunity to safely adopt technology that meets the regulatory requirements.
[00:16:55] Unknown:
And so you also wrote an article recently that was highlighting the 4 main trends that you're keeping an eye on for 2020 in the data space. And those call out in particular the elements of data quality, data catalogs, observability of the influences for critical business indicators or KPIs and streaming data. So taking those in turn, starting with the data quality aspect, what are some of the driving factors that influence that quality, and what elements of that problem space are being addressed by the companies that you're watching? Yeah. So to make sure everyone's on the same page,
[00:17:32] Unknown:
data quality management ensures that data is fit for consumption and meets the needs of data consumers. To be high data quality, data must be consistent and unambiguous. You can measure data quality through a few different dimensions, accuracy, completeness, integrity, validity, among others. And there really isn't 1 data or excessively Capturing the wrong data or excessively collecting it can lead to shortcuts for reporting, so there could be bad data quality. Data quality issues are often a result of database mergers or systems and cloud integration processes in which data fields that should be compatible are not due to schema or format inconsistencies or unclear field definitions.
You know, at the basic level, manual steps of data entry and manipulation can cause problems. And finally, now there's a fragmentation of information systems leading to bad migrations or data duplication. So that data can become stale and out of date, which is also a form of bad data quality. And then fundamentally, you know, data can be corrupted or there could be changes in source systems that can lead to bad results. The companies that we're tracking can be categorized across 2 vectors. 1 is internal versus externally generated data. And 2 is in motion or at rest data quality evaluation.
So with regards to the first vector, you know, we can think about third party data that businesses and verticals like finance, real estate, and health care adopt. They're ingesting this third party data to inform their systems. And often, the data has not been cleaned and prepped by the vendor, and so they need to adopt technology to make sure the data fits well. And this is very different than internally generated data, like customer data that you would get in tech and e commerce and CPG, where they need to look the data their systems were organically generating. And then the second vector is around evaluating data quality for data in motion versus at rest. So in motion tying into Kafka streams or Pulsar to augment bad data real time before it reaches the sync, Or at data at rest, there's solutions out there that can scan databases for no values, inconsistent formatting, or changes in the distribution of data. So you can make sure that the current data you're seeing at rest mirrors historical distributions to make sure there aren't any issues.
[00:20:06] Unknown:
And so in terms of the overall data quality landscape, what are some of the unsolved areas that you see as being viable options for newcomers or new businesses to try and tackle and that businesses would be able to gain value from and are actively looking for? Yeah. What what excites me about data quality is that it's foundational
[00:20:27] Unknown:
to businesses, human and, machine decision making. You know, dirty data can result in incorrect values and dashboards and executive brace things. It's kinda crazy. We've heard about bad data leading to product development decisions that can cost corporations 1, 000, 000 of dollars in engineering effort. And then, you know, with machine made decisions based on bad data, it could lead to bias or incorrect actions that could create a bad user experience. We've come across a few startups and open source projects operating this space, SOTA, Toro, Monte Carlo, Great Expectations, DBT and Nexus Data. It's kind of the Wild West in terms of data quality. It's an idea that's been top of mind for senior leaders for the past few years, but there really haven't been great tools out there to solve it.
Some teams have built systems internally to identify data quality, but there hasn't been a a platform that's emerged just yet. Most of the startups I mentioned have only been around for 2 to 3 years and are early in their journey. So we think there's a lot of opportunity in the space as this is a top priority for senior leaders.
[00:21:36] Unknown:
And another element that ties into the overall data quality question to ensure to ensure that the processes that are being run on it aren't aligning important information or introducing inaccuracies or old or biased data. And that is a big portion of what's covered in the overall concept of data catalogs or metadata management. And I'm wondering what you're seeing as being the main challenges that businesses face in establishing and maintaining those data catalogs and being able to have robust mechanisms for managing all that metadata.
[00:22:15] Unknown:
Yeah. Data catalogs are super interesting because they capture rich information about data, including the application context, behavior, and change, and the lineage as you noted. It's pretty neat technology because it supports self serve data access, empowering individuals and teams, so that they don't actually have to work with IT to receive data, and they can discover it themselves, what's relevant. And this actually helps improve productivity of ML and data scientists teams. The other thing we like is, as you noted, they can address, PII. They can discover it. And so you can do controls on who can access PII data.
Some of these challenges faced by businesses establishing data catalogs is the implementation. As you can imagine, there's fragmentation of data across different silos, databases, storage layers, sometimes in Excel. There are many resources you need to tie into, and this could be hard to implement a solution. And second is really around user education adoption. When we talk to buyers, people often say that theoretically, they understand the value of a data catalog because the team no longer needs to work with IT, which can be a bottleneck for data access and they can actually get fresher data by having a self serve model.
But often we hear that these individuals, you know, have to experience a data catalog themselves to fully appreciate the value. And I think that's why we're starting to see now the emergence of many different players. It's taken a few years to frame the value and gain visibility.
[00:23:52] Unknown:
And now teams are starting to adopt it. And there are, as you mentioned, a few different entrance to this market. Some of which are fairly well established. The 1 that comes to mind most readily is Alation, but there are also a number of open source options. The 1 that comes to mind are Amundson from Lyft or the Datahub project from LinkedIn. And I'm wondering in terms of the available options that do exist, what you see as being the overall shortcomings in those products and that might inhibit their adoption or Yeah. You
[00:24:29] Unknown:
Yeah. You're right. It's interesting to see that Alation and Kleber have been around for the a while. They're closed source, enterprise oriented products, and there's been this new emergence of open source projects from Lyft, LinkedIn, Netflix and then other large businesses like Airbnb and Uber have built their own and publicly talked about it, but not open sourced it just yet. The ways we look at the different types of technologies is, first, you know, closed source versus open source. You know, we are agnostic to that approach. But then also what data sources do they ingest from, the stack they use, and the functionality that they support.
You know, there's a broad range of functionality. Everything from I'm looking at sample rows, data profiling, freshness metrics, ownership, top users inquiries, and lineage, in addition to the fundamentals of understanding schemas and metadata. And then in terms of stacks, you know, you have Amundsen, which is Python node and, uses databases like Neo4j as an Elastic. Well, MediCat is Java and Elastic only based. And then finally, in terms of data sources, you know, this is how it could get implemented in an environment. For those that are using Airflow for DAG orchestration, Amundsen is has a Python library to integrate at that point.
Other solutions like LinkedIn's Data Hub tie directly into Presto and MySQL and Oracle via API calls or Kafka events. So it really depends on a few factors as I noted. The breadth of the functionality that you're hoping to get from your data catalog, the stack that you are familiar with and comfortable with adopting, and finally, what are your data sources and your perspective of how best integrate a data catalog. If you're not using Airflow, Edmonton may not be the best choice for you. A lot of businesses are using, Airflow, so it's a great option.
It just really depends on your local environment.
[00:26:27] Unknown:
And I think that's why we see so many different offerings in the space today. Another piece that isn't specifically a data catalog, but that I was impressed by who I spoke with a while ago were the folks behind the Marquez project out of WeWork as a means you know, processes that produce the end results? Yeah. That's a really interesting project
[00:26:56] Unknown:
that came out of WeWork. I think foundationally for all of these data catalogs, it needs to be a seamless implementation into the environment, as I said, based on the data sources or orchestration layer that you're using. All of the data should be prepopulated. It should have freshness. It should be tied to a data steward, which could be auto generated based on who is collecting the data, or tying to your LDAP systems. As you can know who should be accessing the data. 1 of the things that I think is most powerful is the search functionality in some of these platforms for typing in a keyword and they auto discover tables based on relevancy,
[00:27:35] Unknown:
people within your network, and the owner. So you can have the freshest data available. And moving to the overall idea of observability of KPIs and the different factors that influence them and being able to dig into those different elements, what are the overall capabilities that are necessary in those types of systems? And what is lacking in any of the current approaches that businesses are using for being able to track those KPIs and be able to gain insight into what is
[00:28:06] Unknown:
different indicators move in different directions? In terms of what's necessary, it's obviously access to the data itself. So, we've seen examples where they tie into Kafka, others tie into the data warehouse. The next layer is making sure you have a metric that is defined, consistently for the product. And the third aspect is the rules engine or machine learning, that is applied to identify what are the appropriate bounds for this type of data, alert on whether the data's outside those bounds, and also help with root cause analysis if there are challenges. The way you phrase the question, you make it sound like there's been a number of encumbrance offering KPI observability for a while. You know, that's just simply not the case. Anadot is 1 of the most established vendors and it was founded in 2014.
Most of the companies that we're tracking were founded in 2018 to present. Many of them are still building the solution, so it's hard to say that the current offerings are lacking. What's been great is that 1 of the challenges historically for this category was the integration piece around where you tie into the data pipeline. Data typically was fragmented for customers. Now there's actually clear design patterns that have emerged in the data pipeline. 5 years ago, data warehouses weren't as mature. Now Snowflake, Redshift, BigQuery are kind of standards. People usually have a data warehouse. People are moving to ELT versus, ETL.
And then you think about enriching the data in the warehouse. These solutions can tie themselves to the data warehouse versus multiple data sources. So it's easier to implement and extract value when data is aggregated in 1 location. The second challenge that we've seen with these products is kind of the need to make it a self serve model where customers can leverage pre existing metric just definitions if they existed in something like LookML and apply it to KPI observability or for them to easily define their own metrics. So this can be a process. Sometimes, there's a cultural element of defining metrics in businesses that can make it a little harder.
And then the third challenge is really just the accuracy of identifying these anomalies and providing insights into the root cause. You know, the rules engines have to be tailored to not just the vertical that the company is operating in, but the company its itself in order to extract value. So it is a complicated solution to build and make sure that customers are extracting value, but we really do like the fact that implementation has gotten easier. And 2, that when we talk to buyers, this is incredibly top of mind. The value prop is clear. Today, people want to look at information and have dashboards.
There are hundreds of dashboards often in large enterprises. People aren't looking at them all the time, but they need to know if a KPI is out of whack, and these technologies are fundamental
[00:31:22] Unknown:
in allowing them to do so. And then in order for any of these solutions to be useful and effective, you need to be able to collect and track the information that actually feeds into what is causing some of those different indicators to move. I'm wondering what the challenges in identifying what those data sources are and then being able to potentially collect and associate them effectively. And then in terms of once that data is collected, what are the challenges for businesses in this observability space in terms of being able to display and analyze the data so that it is easy to interpret for people who might not necessarily have all of the training and being able to do their own analysis or do their own understanding of what all that data means and how it factors into the overall top level indicator of what they're trying to understand. Well, most data scientists typically have an
[00:32:16] Unknown:
operational and analytical database. So I would argue that the databases that you should be tying into are relatively clear in the customer environment. It's just how do you implement your solution. As I said, I think most of the solutions that we've come across tied to the data warehouse, like Snowflake because that is where information is being aggregated on the analytical side. In terms of how does a solution appropriately and clearly show value, and the challenges behind that. 1 is the volume of data that were that these databases are collecting is incredibly vast. And so when you were trying to do this analysis and run your rules engine or machine learning over it, you actually need to load some of this metadata into memory so you can run your analysis. So it's the movement of the mem metadata into their architectures has to be fast, and they have to be able to fit the volume of data into memory. So you have to be thoughtful about it. And then the second layer is the visualization of this. People are operating in their BI solution and often are looking at it a few times a day, if not all day long. And with these KPI observability solutions, ideally, you'd be tying into BI so that you're alerting in the current visualization layer of the product. So it is imaginal to think that the BI solutions today will be offering this eventually. But if you're a third party vendor, you need to make sure that your UI is beautiful and very clearly identifies what is healthy and what is not healthy. And then points to what could be the cause of the data change in the present and over time. And then the last point that you called out as a trend that you're keeping an eye on for this year is streaming, which is obviously an area that's been growing rapidly over the past few years with a number of different open source and commercial options available, most notable being Kafka and then Pulsar as 1 of its close competitors.
[00:34:18] Unknown:
And then in terms of the, corporate space, there's Databricks, which is focused on their streaming capabilities. There's Flink, which is being used for a lot of stream processing per Vega from the folks at EMC. Wondering what you see as being the major business opportunities that you see for being able to make the streaming capability more accessible and easier to implement and more effective for businesses that are relying on it? Yeah. So there's a few different ways we're thinking about improved streaming technologies.
[00:34:47] Unknown:
You're right. There's both the processing layer like Flink, and then you have the more storage pub sub layer like Kafka. Speaking about the Kafka pub sub layer, which is, the category that we're spending the most time right now, not to say we're not interested in processing, but we just haven't seen as many novel offerings in that space. There's definitely a call to action to anyone listening, if you are compelled by that category and have the background to do so. But in terms of the streaming platform world, we think about improvements across 4 different buckets, speed, volume, management, and cost.
Regarding speed, everything is moving to real time like dashboards and workflows and actions. If data can flow faster, actions and decisions can be faster. And when it comes to technologies, we've actually seen a few open source projects and other commercial offerings that can be 10x faster than Kafka in production, a slight overhead to what the hardware can do itself. In terms of volume, more data is being created faster than ever before. This is we've been knowing this for decades now. It's hard to keep up with the data volume, and so new solutions need to be able to deal with high data volume and more topics.
In terms of management, we've been told ZooKeeper, which is core to Kafka, is very hard to manage. You know, often people staff someone to manage the Kafka cluster. I appreciate that the team is replacing this component, but we believe the user experience from a management perspective can be even better. And we've heard that maintenance can be challenging because the number of topics can grow quickly, so teams are constantly balancing and upgrading instances, which can be hard. And then finally on cost, you know, in terms of cost, you can think about it from 2 different lenses. The number of people you have to staff to keep the service up and running. I've heard of teams that have, you know, 3 plus people trying to manage their clusters, which can be very expensive given the rate of what a great engineer is. And then the second is the service themselves.
You know, Pulsar is interesting. It has a 2 tier architecture where serving and storage can be scaled separately, which can decrease costs. And this is also really important for use cases with potentially infinite data retention, like logging where events can live forever. If you can move this to lower cost, environments like s 3 as compared to high performance disks, this can help with cost management as well. So it's really 4 things, speed, volume, management, and cost. And on the cost aspect too,
[00:37:27] Unknown:
Pulsar and soon, Kafka have the option of the tiered storage capability where you can keep the most recent data on that fast disk for access to recent topics and then have different data automatically life cycle off into s 3 while still being accessible using the same API if you need to be able to run processing against historical information?
[00:37:49] Unknown:
Yeah. Exactly. We I think that was 1 of the biggest improvements between Pulsar and Kafka. And it's great to see that Kafka will also be introducing this, but we've heard from buyers this is incredibly useful. It really cuts down the cost for them and supports this long term storage, which is great in certain, regulated industries.
[00:38:09] Unknown:
And then in terms of the factors that are driving this overall growth and the need for access to streaming data and real time information, what are some of those driving elements, whether from the business landscape or the technical landscape that are pushing companies to try to adopt these capabilities?
[00:38:28] Unknown:
Yeah. It's interesting. I think it's a little bit of the consumer world flowing into the enterprise world. Consumers have short attention spans, want data immediately, and want insights and answers as soon as possible. And we're starting to see this in the enterprise as well. You know, dashboards are moving to being real time. You know, we're refreshing at 10 minutes or less. Answers need to be as quick as possible, and, back end processes are all automated. And so the fresher the data, the better. We believe this is a huge catalyst because everything is moving to real time feedback. Streaming apps produce and rely on a constant flow of this data. You know? Common examples include predictive maintenance and fraud detection, recommendation engines, IoT.
So all of this is, increasing in terms of the volume but also the frequency in which it's being collected. Data science typically use streaming data rather than batch to provide rapid insights. Similarly, AI machine learning models leverage streaming data to constantly train and infer. In short, like, these 3 things make using streaming data across the board more popular. It's pretty incredible. If you look at the market, it's growing significantly from around 698, 000, 000, in 2018 to close to 2, 000, 000, 000 in the next few years, a 22%, CAGR over the period, which is really fast for most enterprise segments.
I think this is actually an underestimate of how big the market is. I think we'll slowly see the degradation of batch and most things will go to streaming unless it's exorbitantly expensive. But as we can see, a lot of these new projects are, open source and cost effective. Yeah. And the interesting thing about the stream batch dichotomy is that a lot of the major proponents of stream processing and streaming data
[00:40:22] Unknown:
attest that batch is just a special case of streaming and that you can and should just handle everything in the streaming context?
[00:40:29] Unknown:
Yeah. It's really interesting to see that. I like how they it's very smart positioning, I would say. It's when you think about some of the batch processing engines, and I'm thinking about Spark's, you know, stream processing layer, it's actually microbatches. So they take the reverse opinion of that. But yes, it is very smart positioning to say that batch is a subcomponent of streaming. I personally think that, there's value in both systems, but there's gonna be a significant migration to streaming over time. Once again, I think that's the consumer appetite percolating into enterprise businesses and wanting data and answers faster than ever before.
[00:41:13] Unknown:
And for businesses who are trying to adopt streaming, what are some of the barriers to entry that you're seeing and some of the missteps or mistakes that you see being made that could easily be addressed by having a vendor that works to paper over those challenges?
[00:41:30] Unknown:
I think the main challenge that we see is the fact that there is a deficit of great data engineers and data platform teams, and often these businesses, can't access those wonderful individuals. There's a concentration of talent at tech companies and especially on both coasts as compared to broadly distributed across North America and Europe. And so sometimes they just don't have the data teams in place that they need to be able to adopt these topics. You know, a traditional DBA is quite different than what a current, data engineering professional is responsible for.
And so vendors that provide hosted services or the, customer support, who was a driver early on in a lot of these businesses, really helped get customers up and running with these systems. The second thing that is, I think, lighter, because I think most people appreciate the value of streaming is identifying a use case where they can see a clear ROI and the cost is not exorbitantly expensive. Right? And so finding those use cases for all businesses is can be challenging at times. I think there's enough proof points in Fintech around trading and fraud detection in traditional enterprise around product user
[00:42:54] Unknown:
analytics, but it is still early days. I think there'll be more use cases that are unlocked and we discover in the future. And then with your focus on these 4 major trends, how does that influence your overall investment of time and attention and where you make decisions as to where to actually put capital in play? Yeah. It's interesting. Investors sit on a spectrum of being opportunistic
[00:43:17] Unknown:
and thematic. As you can kind of tell, I do a lot of first person research. I like to publish that out to the community so to share those insights and what we're hearing. So I lean thematic because I think it's helpful to deeply know the landscape and the technical differentiation in order to make literally data driven decisions about where we invest our capital and who we partner with. That's also really helpful. Right? When we have seen a lot of different vendors and startups and have a deep understanding of the category, when an entrepreneur or founder comes to us with a piece of technology, we can truly appreciate the challenge in building that and the value of what they have done.
So, you know, I'm super excited about data and ML focused startups overall. We believe that categories are massive, foundational, and can result in large exits. Since I'm more thematic, the 4 data themes that I discussed, data quality, data catalogs, KPI observability, and streaming are particular areas we've been digging into further based on my research speaking with operators and kind of our hypothesis of where the world is moving to. So it is when I do research and I share out my themes, those are particular areas that I'd love to talk to founders and love to find a partnership opportunity.
[00:44:39] Unknown:
And then outside of those particular areas of focus, what are some of the other unaddressed markets or product care categories that you see which would be lucrative for new businesses, particularly in the data space? So it's interesting. We think about,
[00:44:54] Unknown:
kind of data infrastructure in terms of a Maslow hierarchy of needs. And so kind of foundational and at the base of the pyramid, you have data warehouses, and beyond that you have ETL and then you have BI. And the newer technologies we discussed today are more at the top of the pyramid. You know, more fast movers, early adopters are considering these technologies today, like data quality. But we believe the industry is moving in this direction, and these pieces will eventually become a crucial component of the future data stack. And that's why we're evaluating them. Beyond the topics we talked about, I think there continues to be an opportunity to improve data ask us and usage for non technical users.
You know, I appreciate this show is, really targeted to people that are technologists, engineers, data platform leads, or PMs in this space, but we're excited about new technologies that help the everyday business user access, munch, and leverage data. I think great examples of that is Airtable, which completely changed how people think about spreadsheets, Alteryx for data munging and cleaning, and then finally, you know, solutions that allow people to adopt ML in their workflows, like forecasting, inventory management, financial projections.
This is really neat. Previously, all of that was confined to, you know, data scientists and machine learning engineers that have such technical depth. And with the commoditization of machine learning algorithms and new platforms that are making it easily accessible to the everyday business user. It's incredible to see how, in the future, business oriented employees will be able to access, clean, and create models themselves, creating huge business impact. So really excited about all data tools that can facilitate the work of non technical users.
[00:46:54] Unknown:
And so in most areas of technology and in data in particular these days, there's a strong mix of open source and commercial solutions that are available for solving any given problem with varying levels of maturity and polish between them, where a lot of the commercial options might just be an optimization of an open source platform that adds in some ease of use capabilities or additional security measures. And I'm wondering what your views are on the overall balance of this relationship in the data ecosystem.
[00:47:25] Unknown:
Well, Tobias, I'm so happy you asked this question because it's 1 of the questions that we get asked the most as investors with founders operating in the data ML space. They always think, do I have to be open source? Should it be closed source? How do I demonstrate the value in a commercial offering when I have an open source project? Overall, we wholeheartedly believe that there that a great solution can be open and closed source. Right? You have great open source projects like Kafka and Spark and Elastic and CockroachDB, which are wonderful examples of a venture backed startup that's supporting an open source project and then adding commercial value on top through security, compliance, integrations that will help with customer But there are also examples of closed source offers that are absolutely killing it, like Snowflake and Dremio. So there really is no 1 right answer. What we see is that the mix of open source to closed source also depends on where in the stack you are operating.
Core infrastructure like databases, we are seeing even more of a movement to open source. Higher in the stack like BI, traditionally, it's been closed source, but people are starting to adopt open source options like Superset if they're more tech technical user. What we find is the closer the solution is to the business person, the less likely the solution is going to be open source because these people are unlikely to know how to get it up and running. So a fully packaged a fully packaged solution is a better fit from them. What we often find is that open source is used for more of customer acquisition to generate pipeline. So technologists can pick up the solution, check it out themselves, see if it works as I said, and then the company that supports the project can call on the user and try to convert them into a paying customer support or an enterprise instance or a fully hosted offering. This same dynamic, on the go to market side that open source creates can be created by closed source technology. This can be through trials with self serve models or sandbox environments.
So it really can be approached with both form factors. I would think about 2 things. 1, the technical capability of your buyer, and 2, how low in the stack you are operating
[00:49:46] Unknown:
because there's more of a precedent in lowering the stack for open source than higher. And as you mentioned to the solutions as they get closer to the business user, the more likely they are to be commercial. But another element of that is that the lower down elements in the stack that are being increasingly open source in terms of databases or streaming platforms are also the elements that are closest to the core data and how it's being stored and represented, which is where a lot of the potential lock in occurs. So business intelligence platforms, for instance, are fairly easy to swap out because you just need to connect it to the data source. But if your data is owned by a proprietary solution and it's stored in a proprietary format, it creates a much stronger form of lock in and potentially adds an extra bit of resistance to technical implementers in terms of adopting that technology because they don't want to be locked in and have their data held hostage by a platform that may not exist in 10, 15 years. And so having that migration capability built in is a strong concern there. And I'm wondering what your experience has been in that regard in terms of the companies that you work with and their views on which solutions they want to have open source or at least supporting open standards versus being comfortable with buying a fully commercial solution? What we see around adoption of open source at different layers in the stack,
[00:51:08] Unknown:
it depends on 2 things. 1, the type of business, the vertical that they're operating in. 2, the maturity of the business. And 3, the technical depth of their staff. In terms of verticals, we often see that laggard industries like manufacturing, energy, are still okay with closed source offerings. Some of this is because they don't have the technical staff in place to be able to support open source and host it themselves. In terms of maturity of the business, often we see that early stage startups don't have the capital to spend on expensive commercial offerings. So they have to go to open source to build their product and support their customers.
In terms of technical depth to the team, we discussed this earlier. There's a concentration of talent of people that simply know how to set up, manage, and remediate challenges of, Kafka or Spark or CockroachDB. And sometimes teams can't hire these people. They don't have access to them, and so they need a commercial vendor to come in and help with this process. So it depends on a few different reasons in terms of the business itself, in terms of layers of the stack. We often see that because the user is technical at the infrastructure layer, they can manage open source and get it up and running. Once you touch a business analyst and sometimes even data scientists, they don't have the technical capacity to get these solutions up and running them themselves. And that's why a packaged
[00:52:43] Unknown:
hosted offering is the best fit for them. Alright. Well, for anybody who wants to follow along with you or get in touch and, keep up to date with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling our technology that's available for data management today. Yeah. We talked about it earlier, but I'm incredibly pumped about data quality.
[00:53:07] Unknown:
It is a top 3 priority for essentially every data executive or even c suite executive. The solutions in the space are early promising. And 3, we think it's, fundamental to any good data stack because bad data quality results in bad decision making which could be incredibly detrimental and it and people presenting bad data lose face in organizations. So we think this is going to be ubiquitous across all enterprises. So data quality is top of mind. If you're working on a data quality solution,
[00:53:38] Unknown:
please reach out to me. I'd love to chat with you. Alright. Well, thank you very much for taking the time today to join me and share your expertise and experience of working with all these different companies across the different industries for being able to tackle the data challenges that exist. It's definitely a very important problem domain and 1 that is important to have necessary funding for these companies to be able to build the solutions that they're trying to provide. So thank you for all of your effort on that front, and I hope you enjoy the rest of your day. Yeah. Thanks so much for having me. Listening. Don't forget to check out our other show, podcast dotinit@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction to Astasia Myers and Redpoint Ventures
Appeal of Data-Oriented Businesses
Research and Information Sources for Data Trends
Key Trends in the Data Ecosystem
Focus on Data Quality
Challenges in Data Catalogs and Metadata Management
Observability of KPIs and Influencing Factors
Growth and Opportunities in Streaming Data
Investment Strategies and Emerging Opportunities
Open Source vs. Commercial Solutions in Data Management