Summary
Machine learning models use vectors as the natural mechanism for representing their internal state. The problem is that in order for the models to integrate with external systems their internal state has to be translated into a lower dimension. To eliminate this impedance mismatch Edo Liberty founded Pinecone to build database that works natively with vectors. In this episode he explains how this technology will allow teams to accelerate the speed of innovation, how vectors make it possible to build more advanced search functionality, and how Pinecone is architected. This is an interesting conversation about how reconsidering the architecture of your systems can unlock impressive new capabilities.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- When it comes to serving data for AI and ML projects, do you feel like you have to rebuild the plane while you’re flying it across the ocean? Molecula is an enterprise feature store that operationalizes advanced analytics and AI in a format designed for massive machine-scale projects without having to manage endless one-off information requests. With Molecula, data engineers manage one single feature store that serves the entire organization with millisecond query performance whether in the cloud or at your data center. And since it is implemented as an overlay, Molecula doesn’t disrupt legacy systems. High-growth startups use Molecula’s feature store because of its unprecedented speed, cost savings, and simplified access to all enterprise data. From feature extraction to model training to production, the Molecula feature store provides continuously updated feature access, reuse, and sharing without the need to pre-process data. If you need to deliver unprecedented speed, cost savings, and simplified access to large scale, real-time data, visit dataengineeringpodcast.com/molecula and request a demo. Mention that you’re a Data Engineering Podcast listener, and they’ll send you a free t-shirt.
- Your host is Tobias Macey and today I’m interviewing Edo Liberty about Pinecone, a vector database for powering machine learning and similarity search
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by describing what Pinecone is and the story behind it?
- What are some of the contexts where someone would want to perform a similarity search?
- What are the considerations that someone should be aware of when deciding between Pinecone and Solr/Lucene for a search oriented use case?
- What are some of the other use cases that Pinecone enables?
- In the absence of Pinecone, what kinds of systems and solutions are people building to address those use cases?
- Where does Pinecone sit in the lifecycle of data and how does it integrate with the broader data management ecosystem?
- What are some of the systems, tools, or frameworks that Pinecone might replace?
- How is Pinecone implemented?
- How has the architecture evolved since you first began working on it?
- What are the most complex or difficult aspects of building Pinecone?
- Who is your target user and how does that inform the user experience design and product development priorities?
- For someone who wants to start using Pinecone, what is involved in populating it with data building an analysis or service with it?
- What are some of the data modeling considerations when building a set of vectors in Pinecone?
- What are some of the most interesting, unexpected, or innovative ways that you have seen Pinecone used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while building and growing the Pinecone technology and business?
- When is Pinecone the wrong choice?
- What do you have planned for the future of Pinecone?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Pinecone
- Theoretical Physics
- High Dimensional Geometry
- AWS Sagemaker
- Visual Cortex
- Temporal Lobe
- Inverted Index
- Elasticsearch
- Solr
- Lucene
- NMSLib
- Johnson-Lindenstrauss Lemma
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
When it comes to serving data for AI and ML projects, do you feel like you have to rebuild the plane while you're flying it across the ocean? Molecular is an enterprise feature store that operationalizes advanced analytics and AI in a format designed for massive machine scale projects without having to manage endless 1 off information requests. With Molecular, data engineers manage 1 single feature store that serves the entire organization with millisecond query performance, whether in the cloud or at your data center. And since it is implemented as an overlay, molecular doesn't disrupt legacy systems.
High growth startups use molecular's feature store because of its unprecedented speed, cost savings, and simplified access to all enterprise data. From feature extraction to model training to production, the Molecular Feature Store provides continuously updated feature access, reuse, and sharing without the need to preprocess data. If you need to deliver unprecedented speed, cost savings and simplified access to large scale real time data, visit dataengineeringpodcast.com/molecular. That's m o l e c u l a, and request a demo. Mention that you're a data engineering listener, and they'll send you a free t shirt. Your host is Tobias Macy. And today, I'm interviewing Ido Liberty about Pinecone, a vector database for powering machine learning and similarity search. So, Ito, can you start by introducing yourself?
[00:02:17] Unknown:
Hi. I'm Ito. I'm the founder and CEO of Pinecone. So I did my undergraduate in physics. I thought I wanted to be a physicist. And I took computer science as a minor because I didn't know anything about computers, and I felt I'd be a pretty bad physicist if I didn't know how to code. And obviously, long story short, I fell in love with the computer science aspect of it and, kind of fell out of love with physics. My PhD in computer science, with a focus on machine learning and and really, high dimensional geometry and functional analysis, which ends up being a very good foundation for machine learning. My post doc in applied math also opened a company that did real time video search and then joined Yahoo. You know, actually, at Yahoo was the first time I was actually exposed to huge amounts of data. Like, really, Yahoo Mail at the time was already, like, pretty massive.
So doing it at that scale was already kind of the first moment that, you know, there is a set of tech challenges out there that is beyond what our systems are able to do today. And it's been like that ever since, spent the ability of the Yahoo in the end, managing Yahoo's research lab out of Muir and then moving to AWS to build an organization called AWS. Amazon AI, which is actually a part of AWS, building solutions like SageMaker and and others.
[00:03:47] Unknown:
Yeah. And 2 years ago, building, you know, moved and started Pinecone. So I've been kind of in and around machine learning and systems my entire life. Yeah. It's funny how physics has become sort of the on ramp for a lot of people in the computer science industry. As I was coming out of high school myself, I thought, oh, I wanna get my PhD in theoretical physics. And then after a couple of years, I realized that that wasn't really the best career path. So
[00:04:09] Unknown:
I decided ended up leaving that and then went and got my degree in computer engineering instead. So here we are. Yeah. But I think you see a lot of especially in machine learning and AI, you see a lot of physicists because there's a lot of the continuous and kind of more fluid thought, just continuous math. A lot of the logic intuition, I think, in machine learning is very physical. Yeah. It's very physics like. So you see a lot of physicists in machine learning and AI, definitely. And I think there's also still the shared sort of underlying motivation of wanting to understand how things work and be able to, you know, do experiments to figure it out? A 100%. A 100%. Yeah. It's it's a very experimental branch of computer science, which is also very exciting. You kinda run stuff and you have your own, like, lab on your laptop and you just Right. Some stuff comes up like, oh, crap. Like, you know, look at that. You know?
[00:05:03] Unknown:
You don't you don't have to get a, multimillion dollar grant to build a particle accelerator to figure things out. Exactly.
[00:05:10] Unknown:
Exactly.
[00:05:11] Unknown:
And so in terms of Pinecone itself, you mentioned that you've been working on that for the past couple of years now. So I'm wondering if you could just give a bit of an overview about what it is that you're building and some of the story behind how you ended up going in that direction and building out this system.
[00:05:23] Unknown:
I will start by saying that Pinecone, I think, is the right technology in the right time. And maybe I'll try to answer that in a semi philosophical way by analogizing what's happening in our brain when we look at stuff and what's happening with deep learning today. Okay? And so when you look at something in the world and you recognize it, there are 2 very separate processes that happen. The first 1 is, like, mechanical almost. So think about it as like hardware. So you have light hitting your eyes, you know, going in, hitting your retina, kinda like you have a CCD in your in your camera. That channel backwards to the back of your brain, kinda just above you, back of your neck.
And that's where and the visual cortex is what it would be the analog to, like, convolutional neural net. They're 1 of those, like, deep learning models for computer vision. That processes the images that hit your eyes and create a very rich semantic representation for neural activation. Actually, the output of the visual cortex or the v 6 of the visual cortex, that representation of the image is very different. In fact, it has very little resemblance to the original image in some sense that your eye saw. Right? But and here's the second phase of this thing. Everything that you remember, recognize, understand, infer, and do, like, visually is based off of that deeper presentation of that image. Okay? The output of the visual system. And that goes from your visual cortex in the back of your brain to the temporal lobe mostly, which is just above your ear, where things like identifying your your family members or remembering, you know, understanding, like, a physical scene. All of that doesn't happen in the visual cortex. It happens otherwise in the brain.
It's important to say that as kind of prerequisite just to say all our memories and all of our understanding in the brain actually happens on, like, a vector representation on some activation of neurons that isn't the, quote, unquote, image itself. It's some other semantical presentation of that image. Sorry. With deep learning, now we're able to already, in some sense, have the analog of the visual cortex. We can now take an image and pass it to some pretrained computer vision model, be it a convolutional neural net or or something else, and get a very rich vector embedding and a presentation of that image such that we can start doing things like identifying objects and remembering and inferring stuff.
But the second phase doesn't exist. Like, we have those deep learning models, and we can create, you know, billions of these very rich semantic vectors, but we don't have anything to do with it. And Pinecone is that is the kind of the analog of the rest of the brain or the higher cognitive functions. Right? So you would need some form of database to store billions of these high dimensional vectors and do complex operations on them. 1 of the basic ones is retrieval and singularity. So, basically, just retrieve similar objects or similar images that I've seen. And doing that at scale is already very difficult. And so I think as a society, we're moving away from well, we're kind of graduating from the hardware level. We're just transforming things to high dimensional vectors. And we're getting to the point where we need to just perform higher cognitive functions on them. And that's what Pinecone is doing. So it stores and manipulates and searches through and lets you transact with, you know, billions of high dimensional vectors in very efficient ways in the same way that your brain would, you know, you know, look at a family member you haven't seen for a long time and, like, immediately recognize them.
[00:09:19] Unknown:
In terms of the actual use cases for that, you mentioned that, you know, these vectors are largely you I'm just wondering if you can just extrapolate into some of the ways that Pinecone is being used or sort of the use cases that it's being built for, at least sort of the things that are driving your immediate product direction and how people might wanna think about Pinecone in terms of where and when it will solve their problems?
[00:09:50] Unknown:
The technology behind Pinecone, a lot of what Pinecone already does is in fact a well beaten path. I mean, if you search on Google or Bing, then your language, your query is actually transformed with an NLP model to a high dimensional vector. And you're searching with those high dimensional vector tool documents for similar maybe for, you know, documents that would contain answers to a question that you asked rather than containing the actual words in your query. Right? If you're shopping on eBay or on Amazon, then the items that you recommended in the recommendation for your shopping session have to do with how those companies, like, embed you as as a shopper with all the searches and activities and your history as a high dimensional effector. Right? And so they recommend based on their understanding of a complex object, which in this case is is you as a person.
Right? Or your preferences are based in shopping, you know. You can say the same about feed ranking in Facebook and LinkedIn and other social networks or, you know, online dating and all that stuff. Right? And so those are immediate use cases, and those tech giants are already using these technologies in house to drive revenue, drive products, and improve user engagement and satisfaction. The challenge for the rest of us is that we are not Google, and we don't have, you know, 50 engineers building infrastructure for that for us. For the rest of us, we have Pinecone.
And so you can build those applications on top of Pinecone.
[00:11:24] Unknown:
For the search use case where, you know, a lot of people, if they think search and they're going to build something themselves, they're probably gonna go to Elastic search or Solr and Lucene. And what are the contextual differences between what you're able to do with those sort of inverted indices style search queries versus this vector native approach that you're building with Pinecone and, you know, when somebody might want to choose 1 versus the other? Inverted indices are
[00:11:52] Unknown:
an incredibly powerful data structure. In fact, the inverted index is probably the oldest reference to an inverted index I could find. It was more than 800 years ago. Actually, look up index for the New Testament. Bolivia was in Greek at the time. So it's actually like an inverted index. It's actually called an index. In the back of the book, you know, the lookup table is an index. Right? That is an inverted index. You look up specific terms and you intersect the page lists. Right? Now I'm not trying to put it down. It's an incredibly powerful idea and incredibly useful tool that could be taken very far. Okay? And so you can get things like Google search, which we all know and love and understand how powerful it is.
But and I think this is where it reaches its its limit. In the end, it it looks for terms and fields inside documents and filters based on that. If you're searching for you know, if you look retail, okay, and your shoppers are are shopping for, you know, women's, like, dress shoes or whatever, You know, your post or whatever your object that you're looking for might not contain any of those words. Right? It might be whatever 9 inch Prada, you know, leather, you know, red leather heel something. And I'm probably exposing my ignorance about women's shoes. So my apologies to the to the audience. I know very little about fashion and women's fashion specifically.
But that search, that page might actually be the exact thing that you're looking for. And a person looking at the query and the result page might actually say, oh, that's a good match because clearly the this person is asking about something like this. But you can't use, like, term map you can't map to, like, this term appealed in document here and this term appeal in document here. So you have to do this, like, semantic matching. Right? And that is done with vector representations. In the same way that when you look for similar images, right, you can go to Pinecone's website and you can run a bunch of examples. 1 of them is image search, for example. If you're looking for an image of, like, a bird in a fairly large collection of images, you can't really specify, oh, this pixel is gonna be blue and this pixel is gonna be black. You know, you kinda have to say, oh, the only thing I know about it is it's a bird.
You know, that's the really only thing I'm looking for. And I don't know much well, you know, I can't specify anything that looks like a like a term or a document and so on. And so, yeah, very different kinds of technologies even though both of them inherently care about retrieving relevant items.
[00:14:36] Unknown:
From my understanding too about the sort of overall just capabilities of vector embeddings and being able to convert arbitrary information into a high dimensional vector, The other constraint is that with something like Elasticsearch or Lucene where you're building this full text index, you know, it's in the name that you're constrained to working with textual data. Whereas with a vector representation, you could maybe do similarity search across images where you say, I have this image of Abraham Lincoln. I wanna look for other photographs that have this same kind of style or other people who look like Abraham Lincoln and then being able to pull up some results like that. Whereas if you're trying to do that with Elasticsearch, you would have to have, you know, a rich set of metadata to be able to then try and understand, like, you know, it's grayscale or, you know, or sepia images. It's, you know, this kind of dimensions or, you know, you know, some descriptive text as to what somebody looks like. And so I can see another kind of just completely different categorical capabilities of vector embeddings versus an inverted text index.
[00:15:34] Unknown:
A 100%. You nailed it. I mean, it's exactly right.
[00:15:38] Unknown:
In terms of the actual tools and infrastructure that we have, you mentioned that, you know, we already have systems that are able to generate these vector representations of information. But before Pinecone or in the absence of Pinecone, what are some of the types of systems and solutions that people are building to be able to address this sort of next step in terms of being able to actually work with this vector data where generated a vector representation of an image or a vector representation of a neural net, and I wanna be able to do some similarity searching or I wanna be able to do some sort of analysis in vector space, how are people handling that if they don't have Pinecone?
[00:16:15] Unknown:
It really depends on what you're trying to do. But if you're just trying to do something like similarity search, then you have open source libraries like Face, for example, or nmslib that do a good job allowing you to take a, you know, a modest amount of data, something that would fit on in memory in your machine and search through that relatively quickly. And those are great libraries. They work well. The main difference or the main challenge that we see in the companies and individuals that actually end up using Pinecone is that that has significant limitations, especially when you're talking about take something like this to production. And so when you wanna use, similarity search at scale or in production, you know, you start having to build a lot of scaffolding around those libraries.
I'll start with saying you have to figure out first which 1 you wanna use, which 1 has the functions that you care about, performance, and so on. You have to benchmark it on your data. There are, like, tens of algorithms and hundreds of parameters that you have to tune that most people don't know how to do and spend months kind of. And even when you know what you wanna do, even then, like, building all the scaffolding around it is very difficult. Then you have to do shorting and replication and scaling out. You you wanna give you know, those libraries usually use just indices of of fact tools. Like, they've taken, like, a big matrix as an input. That's not enough. You need to have IDs, and you have to have metadata, and you have to filter, and you have to, like, put it all in a proper system and you have to monitor it and log it and, like, you know, there's a lot to it. And so yeah. I mean, they'll create pieces of software that we also use internally and we develop and improve on as well.
And so you kinda get all the goodness of the open source solutions, plus our own improvements on them, plus all the management. And that's, I think, the main kind of difference between using something like Face and using Pinecone.
[00:18:15] Unknown:
Before we go too much further, let's just take a brief aside and go into the history of the name. Where did you come up with that? Well, we thought about it. Like, we kind of brainstormed names. And at some point, it just came up and everybody was like,
[00:18:27] Unknown:
that's cool. Like, everybody was kind of just happy about it. It's hard to explain. I mean, it does a lot of connotations for me. I mean, I I like growth. I like the fact that it's, like, really geometric and beautiful on the 1 hand, and on the other, like, very simple and, like, natural and abundant. You know? And, you know, from 1, you know, pine cone, you can create, like, a whole forest of pines. And it's it's, like, there's a lot of it that I really like. Yeah. But in the end, I have to be honest. I mean, I a part of it is just, like, we just clicked for us and we just liked it. It just sounded fun.
[00:19:03] Unknown:
Yes. 1 of the perennial hard problems in computer science was naming things. So it's always fun to kinda get some of the background of where these names come from. Yep. And so in terms of people who are actually using Pinecone, they wanna integrate it into their environment or into their workflow. Can you just talk through how it fits into the overall life cycle of data or where it sits in the overall ecosystem of data management? Sort of give a hypothetical, and, you know, I wanna be able to perform similarity search on images. What are some of the other pieces of infrastructure or processes that need to go through before I get to Pinecone? And then on the other side of Pinecone, what are some of the systems that I might be interacting with? As a whole, Pinecone
[00:19:42] Unknown:
mostly is used almost solely used in the operational real time production level of your application. Okay? You know, you might have staging and so on set up, but the goal is to be a part of the deployment phase. Okay? That's where you need the operational readiness. That's where you need scale. That's where you need high availability and so on. Your journey probably begins with you understanding the the input in your system, understanding the images, say, if you're doing image search or maybe recommendation on some shopping site or maybe anomaly detection detection or whatever it may be.
You have to understand what the input is and how you convert it to high dimensional vectors, what models you're using, you know, how you create that. Those models are what creates the input for Pinecone. You can give Pinecone the models themselves to say, hey. When I send this data, you know, convert it this way or you can convert them any somewhere else, the m and ops stack. Right? You will have a pinecone service running as a microservice inside your environment. And so on your cloud, it's a cloud managed service. And so you will spin it up in you know, adjacent to your region. You know, you'll hit that with an API when you need to run a query or to update a data point or to start a new index and all that stuff.
And you will connect to it usually from the back end of your application and get, say, the most you know, 10 most similar items or similar images or 10 most relevant, like, shopping items to show to the next shopper or 10 most anomalous, you know, like, the most similar patterns that I've seen in my system. Maybe if you're doing anomaly detection or, like, fraud detection or something like that. You'll get those, again, somewhere from probably the back end of your application. Usually, do some further processing based on that, you know. And so I didn't say how you update it. Of course, based on that, you know. And so I didn't say how you update it. Of course, in real time, if you get, you know, a new image and you wanna add it to the data, of course, that you converted high image to a vector and you upsurge it actually into Pinecone in real time, and that becomes immediately searchable. So it's the real time, high performance kind of microservice inside your architecture.
You can hit it from wherever you want.
[00:22:09] Unknown:
In terms of Pinecone itself, let's dig into some of the actual underlying architecture and how it's implemented and just some of the ways that the system has evolved since you first began working on it.
[00:22:19] Unknown:
There are 3 separate major components to Pinecone that need to be very tightly integrated, but they are conceptually different. The first of them is the index itself. Okay. A piece of code or the set of algorithms that actually search through and interact with the high dimensional vectors. This is a very deep algorithmic numerical library that uses both open source solutions so that we can, again, enjoy all the all the benefits of open source, but also builds our internal algorithms and improved solutions there.
The second layer is think about that as kind of a generalized database built on containers. So we have a Kubernetes orchestration of, you know, sometimes 100 of containers for a single service, that contains all the shards and the replication, the communication, all the gateways, the transformations along the way. Everything has to do with high availability, with recovery, with write ahead logs. Like, everything that you would expect from a database that is outside of the core index is built there. Okay? And the 3rd layer is another completely discomponent, which is not completely, but conceptually disconnected component, which is all the management.
So everything that has to do with spinning up new services, creating accounts, creating, you know, billing and metering and servicing and monitoring, we have a very, very rich and very deep monitoring stack so that we can figure out, you know, even the most slight degradation in SLA for some customer. We're usually able to mitigate that before they even know something's happening or they have an impact to their service. And so those are, you know, conceptually separate, but, you know, obviously, need to be very tightly integrated for everything to work.
[00:24:12] Unknown:
RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming. With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at dataengineeringpodcast.com/rudder today.
In terms of the actual data modeling aspect of it, you know, for people who are coming from a relational world, they understand, you know, there's tables with rows and columns, and there are different data types. But if you're working in, you know, high dimensional vectors, I imagine that there are a number of different conceptual elements that are foreign to people who aren't native to that world. So I'm just wondering what are some of the sort of modeling considerations that people should be thinking about as they're populating these vectors in Pinecone and, you know, how that impacts the way that they interact with it or the types of queries that they're able to perform or, you know, the just the overall interaction patterns around working with these vector representations?
[00:25:23] Unknown:
You know, it depends on who you are. And I think that when we interact with, you know, our customers and people who are interested in Pinecone in general, they divide into 2 groups. The 1 group already knows the answer to the question that you asked. Like, they have models. They've trained them. They know exactly what they're trying to do. And they mostly care about speed and scale and stability. They just wanna get beyond unblocked. Right? So they don't have any question there. The other set of people really ask, okay. I have images. What do I do?
Okay. Well, how do I get vectors out of images? Like, where do I even start text or audio, any of that stuff? To them, we say, a, there are examples on our website. B, there's an infinite amount of material out there on how to do those things. And see, you know, if this is very important for you, we can work with you. Okay? What I'm trying to say is that there is a copycat answer to this question, you know, and how we represent complex objects really highly depends on our application. And for images and audio, maybe there are some off the shelf things that you can just kind of forklift and just take into application. It'll work fine.
But for others, it might not be. And so if you don't feel like you know what you're doing, maybe you don't. And so it's it's great to ask either us or anybody
[00:26:52] Unknown:
else. Digging a little bit more into that, I know that, for instance, you know, when NoSQL databases came out, particularly document oriented databases, people said, oh, it's great. I could just put whatever I want in there. And then they get everything in there and then realize, oh, wait, this doesn't actually work for what I'm trying to do with it now or, you know, they put it in to make it work 1 way, and then they decide, oh, I wanna also be able to build this other thing. And now I need to build this whole transformation step to convert the document structures or be able to build another layer on top of it. And I'm curious if that's something that people run into in the vector space where they say, okay. These matrices are built in such a way that I'm able to easily do similarity search across images, but now I wanna be able to do some sort of feature extraction across those images? Do they then need to be able to sort of rebuild new vectors for it? Or, you know, are there any sort of foot guns that people need to watch out for as they're building out their initial representations in terms of how they might use it downstream?
[00:27:45] Unknown:
Yeah. I don't think AI, in general, is at a place where you just have 1 representation that's good for everything. You know, you're very likely to create 1 victim embeddings for your, you know, shopping catalog that's good for shopping recommendation and maybe a very different kind of embedding, a completely different embedding for, you know, an application that helps you, like, deduplicate whether, you know, a vendor is trying to sell you an item that you already have. Right? And those might be different embeddings. Right? They probably are. I don't think we're at the stage where there's just 1 set of episodes that works for everything, unfortunately.
And so maybe we'll get there. You know, I just don't think it's there yet. I mean, this goes kind of back to the what I was saying is that you kinda have to know what you're doing in terms of, I don't wanna scare people away. I mean, there is plenty of material and a lot of examples, and anybody with some aspiration to do machine learning and some willingness to kinda cut your teeth on some blog posts and some whatever, like peeping souls,
[00:28:54] Unknown:
can already get pretty far. And so as you're building out Pinecone, who are your target users? How does that inform sort of your feature road map and the user experience and the interfaces that you build out?
[00:29:07] Unknown:
We have 2 kinds of people who are most interested in Pinecone. And they mapped the 2 kinds of personas that I actually already alluded to before. You have the kind of more scrappy, start up y set of customers instead of users who who really enjoy being hands off on hosting, management, you know, high availability, and so on. The use cases are not huge usually, maybe a handful of millions of of vectors. It's actually not an insane systems effort to actually scale up and be able to support the load. But also, you know, because Pinecone is paid for by consumption, they just do the math and they figure out, oh, it's gonna be, like, a $100 a month. Who cares? So I'll just have Pinecone take care of this for me. Kind of have a guaranteed SLA and have everything just set up correctly.
That's 1 side of it. But the second side of it, which I personally find a lot more exciting as as scientists and engineer, are the large companies who say, wait a second. We did this with, you know, 10, 000, 000 items. You know, with a 1000000 items, it was easy. With, like, 10, 000, 000, it was already kind of tricky. And now we have a 100, 000, 000 items, and we have, like, you know, 500 queries per second. And suddenly, we understand that we start having to, like, build some infrastructure here that we're not into building.
Those problems end end up getting harder with time, you know. There's, like, you know, very few systems gracefully go 10 x from where they were designed to operate in. And so we work with some customers with some truly obscene amount of ops. Like, we're talking about 100 of millions of items, like tens of thousands of QPS, simultaneous updates in the tens of thousands per second, you know, something completely bonkers. Right? Then, yeah, for them, the appeal is very different. Like, for them, the effort is gonna be years' worth of development, if at all, they're even able to do it.
[00:31:15] Unknown:
Going back to your earlier analogy of the sort of visual representation in our brain and how, as an industry, we're kind of at this level of being able to get to the point where it's stored in the visual cortex and now pine cone and these vector representations and these engines that enable us to build these higher level systems on top of it are kind of allowing us to, you know, move more into the temporal lobe. I'm curious what are some of the kind of long term downstream impacts that you are anticipating or starting to see and are excited about for Pinecone and any other similar systems that are coming out as to, you know, what kind of capabilities this will unlock as, you know, data practitioners, as people who are working in engineering and these sort of products and systems that this will enable sort of more broadly as a society?
[00:32:05] Unknown:
I think we are going to be able to transact with objects that are, in some sense, opaque to most applications today. Right? Things like images, like audio, like long pieces of text, like user behavior. These are going to be processed and systematized and put in a infrastructure that can reason about it and retrieve from it and filter on it and do something cognitive with it. And so our mission and a part of what we do is is make that accessible and teach people what they need to know and give them the infrastructure so they can do that in their own application without investing many years of of infrastructure building or, like, you know, many, many months or headcount on, like, the machine learning side of that. And I think we're getting closer. We're not there yet. There's still a lot to to develop.
We, as a community, need to grow and understand and develop those muscles, you know, and there's a whole community around this that needs to start, you know, the same way that there's a community around. But if elastic and, you know, and Redis and Mongo and so on, and people was like, oh, this is if you do this, that works. And if you do that, you know, that doesn't work so well. And and so, you know, we're gonna see exactly the same thing with vector databases and pine cone specifically. But we're getting there. And we're getting you know, and those cognitive applications that learn and recognize and transact with complex objects in a in a meaningful way.
I don't think it's very far. I think it's actually already happening in the big companies. Everybody else needs to just kinda see the writing on the wall.
[00:33:49] Unknown:
In terms of the people who are already using Pinecone or some of the systems that you've built with it yourself, either internally or in terms of the actual capabilities that Pine Cone enables in maybe that you've done in prior roles? What are some of the most interesting or unexpected or innovative ways that you've seen Pine Cone and these vector representations used?
[00:34:07] Unknown:
I looked at some point at, like, personality questionnaires out there. You know, it's like a yes, no questions or multiple choice questions on how you, you know, like to spend Friday evening, and whether you prefer, you know, this food to that or whatever. And at some point, I was able to get my hands on a very large collection of those responses. And I was wondering, you know, whether you can identify whether those somehow correlate the behavior in in some way that was interesting. And at the time, a friend of mine was really into astrology, Okay?
Which pissed me off because he was a mathematician, and somehow I thought mathematicians shouldn't be into astrology. If you're listening and you're into astrology, I I apologize. I don't necessarily subscribe. But I told him how why don't I try to correlate and see whether, you know, if astrology is right, you know, and there's some personality traits that correlate with, astrological sign. We should be able to see that in the data. And so we actually took these vector grading embeddings for people's responses. And we obviously saw that there's absolutely no correlation, which didn't dissuade him from actually still, you know, thinking astrology was was real. But the interesting thing is the patterns that came out of it showed that, in fact, most of the variability in your personality traits come to a very few axis, like 3 or 4 or 5 axis of behavior.
And probably if you pinpoint those numbers, you pretty much capture, like, 95% of the variability, which you could think is depressing because, see, that makes us very similar. But maybe the flip side of that, maybe the more interesting side of that, is that the remaining 10% of the volatility was completely there was absolutely no amount of energy, no number of axes, no amount of data that you would need to represent it. So the viability is also, like, inherently very different. Like, there's a good chunk of us that's very alike each other, and there's a good chunk that's, like, completely different, and each 1 of them in that in those dimensions is completely unique. Yeah. And so these kinds of analysis and these kinds of statistics, I think, are becoming really interesting when you think about data in high dimensional space. It kind of becomes yeah, kind of fascinating that way.
[00:36:40] Unknown:
And you're keying off of the high dimensionality of the data. I'm just wondering if you can just take a moment to briefly discuss sort of in the storage level, what that kind of looks like and how any sort of challenges or complexities that you run into in terms of being able to efficiently store that information and, you know, any source of compression that you're able to perform to be able to scale out the capacity for being able to store and interact with these, you know, high dimensional vectors.
[00:37:07] Unknown:
And so I guess I'm not sure how deeply you want me to answer that question because you could easily go into a rabbit hole with with that. You know, I studied my entire PhD on what's called the Johnson Linde Strauss Lima, which talks about how you can reduce the high dimensional vectors to lower dimensional vectors and still preserve the metric between them almost accurately. There are literally thousands of data structures and mathematical transformations that preserve metrics that embed data in other spaces and that helps look through high dimensional look through points in high dimensional space. So, yeah, you know, there's yeah. I I can talk about it for 10 hours. You know, I I I, I actually taught at Tel Aviv University for 3 years, and I think, like, half of my academic course was dedicated to that. So he could literally talk about it for hours.
[00:38:02] Unknown:
Alright. Well, I guess if we're ever in the the same conference or something, then I'll have to ask for more detailed exploration of that rabbit hole. I guess we'll leave it through there for now. Yeah. I love that. And so in terms of your own experience of building out the pine cone technology and the business around it, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:38:25] Unknown:
I think 1 of the things that we keep learning, which it sounds like it was something we should have known a long time ago, but I feel like we keep learning it again and again, is how somehow, you can show customers a promised land, but they still have to walk the desert themselves. You know, you can't teleport them there, you know. And so work with customers that, you know, manage to get, like, 20% increase in their revenue or their engagement on the website or were able to, like, literally make tens of 1, 000, 000 of dollars or more by optimizing some part of their stack or some part of their offering.
But it wasn't a magic 1. It wasn't like it dropped in and, like, Pinecone chopped in and, like, everything was solved. Like, they still had to figure out what they were trying to do. They had to model it. They, they had to deploy it. They had to AB test it. They had to walk the desert. Right? At least a lot of companies out there, I would still say most, launching an application based on Pinecone or embeddings in production is still in some sense a big promise, but also a huge effort. We're trying to narrow that gap and try and make that a lot more accessible. We learn again and again that it's not magic. You know? There's still effort. There's still knowledge you need to inject to make everything work.
And for me as a scientist and as a data scientist, that's the exciting part. I love speaking with customers and understanding what they're trying to do and say, oh, yeah. You can do this and that. And, you know, that's you know, exciting for me, but somehow, you know, can also be daunting. I see that it can be daunting for companies who don't necessarily have the muscles to flex in in that direction. So, yeah, we keep learning that, it's a humbling lesson sometimes.
[00:40:13] Unknown:
But I guess that's a part of the kind of AI maturity that we all are developing over time. For people who are exploring the space of working with vectors and being able to do similarity search and, you know, recommendation systems in vector space, what are some of the cases where Pinecone is the wrong choice?
[00:40:32] Unknown:
I think if you have very tiny workloads, maybe, you know, like, thousands of items and you can just do something brute force and it would be fine. Right? I think you shouldn't worry about bringing in some big infrastructure. In the end, Pinecone is designed for scale and stability and so on. If you're running a small app and you just don't have such a big workload, you really shouldn't bother, really, you know, in the same way that you wouldn't use Redis where you can use a hash table or, you know, you wouldn't, you know. So, yeah, I would discourage people with very small workloads or very, like, where where operations isn't like a big thing, you know, where, you know, if your process crashes and whatever, you just get in the office the next morning and spin it back up.
If that's your situation, it's you know, Pinecone is an overkill. If you're running something in production and you have a few million points and it's already, like, becoming something that you need to care about,
[00:41:34] Unknown:
then, yeah, you should take a look. As you look to the near to medium term of what you're building at Pinecone, what are some of the things that you have planned?
[00:41:43] Unknown:
Oh, wow. Wow. I don't know where to begin, man. Yeah. Like, we're pushing farther on every part of the 3 sections, the kind of 3 investment areas that I told you about. The engine itself, we're accelerating it. We're making it more efficient, more performant. With the cloud, you know, section of it, we're making it more versatile, more cost cloud, more stable, more high availability, add functionality like filtering and name spacing and so on, dynamic sharding and auto scaling, there's no end. And within the management part as well, I mean, all the making it even easier and simpler to work and manage your your infrastructure on Pinecone. So there was no end when this and we're only scratching the surface, by the way. I mean, we talked about all the cognitive the higher cognitive functions.
But we're focusing predominantly on similarity search and nearest neighbor search and kinda cosine similarities, so on. So in the end, retrieval. Our brain does a lot more than retrieval. It does some of the learning. It does some of the faster understanding. It does a lot of things that we don't know how to do yet programmatically and definitely are not automated inside Pinecone. So there, we're not even scratching the surface yet. We're really just, like these are
[00:43:06] Unknown:
just still to be developed and to be even figured out. We'll work on that as well. Are there any other aspects of the work that you're doing at Pinecone or the overall topic of vector spaces and vector embeddings and being able to operate on those at a higher level that we didn't discuss yet that you'd like to cover before we close out the show? I think the main thing that I would encourage people is to play with it. You know, it's 1 thing you need to hear about it and very different thing to actually
[00:43:32] Unknown:
touch it and play with it and see what happens. And we talked in the beginning of the conversation about sometimes you just see something and it's just like, oh, you kinda get it. It's just in front of your eyes. Right? Yeah. There are examples on our website on how to find vector embeddings in Pinecone. There are thousands of examples online how to deal with embeddings, how to get something out of them, what to build with them. Just go play with it. Dedicate, like, a half a day and go, you know, kick the tires on some of those ideas. Maybe you'd be able to build something that you care about, maybe not. You know, most likely you learn something. It's gonna be a huge waste of time.
And, yeah, if you get excited about it and build something cool, we'd love hear about it. That's it, really.
[00:44:19] Unknown:
Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[00:44:34] Unknown:
I was kind of brought up in a different age of machine learning and AI, where we just kind of built everything from scratch all the time because nothing had a name and nothing had a platform and nothing had a framework. And there was no definition. Oh, this is training and this is ops and this is evaluation, and this is that. Like, everything is just, like, messy. And as we, as a community, mature, things start to crystallize and figure, oh, you know, here's how you do deployments. And here's how you do training. And here's how data looks like. And here's, you know, So I think what's missing is that maturity, that when a developer, an ML engineer, needs to do something, they immediately know kind of the 5 components that they need to string together and what the 2, 3 top vendors or open source solutions are for those components. And, you know, they just go and do that. I don't know if there's anything we can do to accelerate that process. It's just kind of, I think, crystallized over time for lack of a better word.
[00:45:34] Unknown:
Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Pinecone. It's definitely a very interesting project and something that I'm excited to see some of the downstream impacts that it will enable. I'm sure that I'll be able to find some time at some point to experiment with it. So I appreciate all of the time and energy you've put into the work that you're doing there, and I hope you enjoy the rest of your day. Thank you as well. See you. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site of data engineering podcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Sponsor Messages
Interview with Ido Liberty: Introduction and Background
Overview of Pinecone and Its Technology
Use Cases and Applications of Pinecone
Technical Details and Architecture of Pinecone
Target Users and Customer Success Stories
Future of Pinecone and Vector Databases
Interesting Use Cases and High-Dimensional Data
Challenges and Lessons Learned
Future Plans and Closing Remarks