Summary
A majority of the time spent in data engineering is copying data between systems to make the information available for different purposes. This introduces challenges such as keeping information synchronized, managing schema evolution, building transformations to match the expectations of the destination systems. H.O. Maycotte was faced with these same challenges but at a massive scale, leading him to question if there is a better way. After tasking some of his top engineers to consider the problem in a new light they created the Pilosa engine. In this episode H.O. explains how using Pilosa as the core he built the Molecula platform to eliminate the need to copy data between systems in able to make it accessible for analytical and machine learning purposes. He also discusses the challenges that he faces in helping potential users and customers understand the shift in thinking that this creates, and how the system is architected to make it possible. This is a fascinating conversation about what the future looks like when you revisit your assumptions about how systems are designed.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask.
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- Your host is Tobias Macey and today I’m interviewing H.O. Maycotte about Molecula, a cloud based feature store based on the open source Pilosa project
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of what you are building at Molecula and the story behind it?
- What are the additional capabilities that Molecula offers on top of the open source Pilosa project?
- What are the problems/use cases that Molecula solves for?
- What are some of the technologies or architectural patterns that Molecula might replace in a companies data platform?
- One of the use cases that is mentioned on the Molecula site is as a feature store for ML and AI. This is a category that has been seeing a lot of growth recently. Can you provide some context how Molecula fits in that market and how it compares to options such as Tecton, Iguazio, Feast, etc.?
- What are the benefits of using a bitmap index for identifying and computing features?
- Can you describe how the Molecula platform is architected?
- How has the design and goal of Molecula changed or evolved since you first began working on it?
- For someone who is using Molecula, can you describe the process of integrating it with their existing data sources?
- Can you describe the internal data model of Pilosa/Molecula?
- How should users think about data modeling and architecture as they are loading information into the platform?
- Once a user has data in Pilosa, what are the available mechanisms for performing analyses or feature engineering?
- What are some of the most underutilized or misunderstood capabilities of Molecula?
- What are some of the most interesting, unexpected, or innovative ways that you have seen the Molecula platform used?
- What are the most interesting, unexpected, or challenging lessons that you have learned from building and scaling Molecula?
- When is Molecula the wrong choice?
- What do you have planned for the future of the platform and business?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Molecula
- Pilosa
- The Social Dilemma
- Feature Store
- Cassandra
- Elasticsearch
- Druid
- MongoDB
- SwimOS
- Kafka
- Kafka Schema Registry
- Homomorphic Encryption
- Lucene
- Solr
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
Your host is Tobias Macy. And today, I'm interviewing h H. O. Macott about molecular, a cloud based feature store based on the open source Pelosa project. So, H. O, can you start by introducing yourself? Hello, everyone. And thanks for having me on here. My name is H. O. Macott. I'm originally from Mexico.
[00:01:13] Unknown:
I came to United States to study electrical and computer engineering, and I've been know, sort of a crazy addicted entrepreneur ever since.
[00:01:20] Unknown:
Do you remember how you first got involved in the area of data management?
[00:01:23] Unknown:
Yeah. Ever since I was a little kid, I was always obsessed with the idea that we as humans needed to evolve faster. And, you know, it was pretty clear that data and AI were gonna sort of be the path to get us there. The question for me was, like, how do we beat Darwin? And so all of my companies have really been focused on trying to figure out how to consume and process and use more data along the way. I'm 8 companies in, so like I said, addicted entrepreneur. And I think we're starting to see some really interesting things happening out there today in terms of the super evolution that I dream of. You know, you've probably watched the documentary Social Dilemma on Netflix, but, you know, data and AI is starting to do some really interesting things, you know, that we can't totally explain at very large scale. We just have to figure out how to make that technology available to everyone. And so that has really been my life mission and why I focus on the problems that I do. Can you give a bit of an overview about what it is that you're building now at Molekula and some of the story behind how you arrived at that particular problem and your particular approach to it? Today, Molekula is positioned as an enterprise feature store. Like you said, we're built on the open source project, Palossa.
Ultimately, our vision here is to help humans and machines make better decisions. It's kind of crazy to think about, but today, the average company is only using about 1% of their data for analytics and machine learning. And it's not because we don't wanna use more, but, like, physics is in the way for us being able to use more of that information. And, you know, we've been taught for decades to store it, but making it compute ready has been historically very difficult. And I lived this exact same story at our last company. We were a customer data platform for sports, media, and entertainment companies. And every time we signed a new customer, we'd sign 100 of millions of fans with 100 of millions of attributes, think like social graphs, behavioral graphs, from hundreds of different data sources. So think like preseason social and arena Wi Fi. I mean, our clients were like the NBA and Real Madrid and these huge, you know, groups of fans that would come on every single time we signed these customers, and our system just could not keep up. We could not do real time on all of this data, and making it compute ready was incredibly difficult. Right? We had huge clusters of Cassandra, of Elasticsearch, of Druid.
And even our most important queries were starting to take longer and longer the more data we put into the system, and we were starting to have to face preprocessing and pre aggregating and all the things that you do to sort of make data fast. No longer could we do the ad hoc queries that we wanted to do on those systems. And we had 2 engineers on our team that came up with this crazy idea, and they said, hey, Joe. You know, we've been trading stocks using computers our entire career in these black box trading systems. In order to train models, we've always done this thing called feature extraction, which is the first process that you do when you do feature engineering.
And at the time, I had no idea what they were saying, but I went with it. They further explained, like, hey. If we could automatically extract features from data at the source, and we could store features instead of data in a purpose built feature store, maybe we could solve some of these analytical problems. And I really had no other solutions at the time, and I told them, go for it. So they took 6 months. They came back with a feature store, and we put it up against some of these massive clusters of Elasticsearch servers. And we thought it was broken the first time we ran it because it was so fast. Our trace was just not down sub millisecond. And when we realized that, you know, on 2 Pelosi servers, and Pelosi is what we call that original technology in the open source version, when we realized that these 2 Pelosi servers were doing the same query that, you know, 40 50 Elasticsearch servers were doing, you know, in submillisecond versus, you know, the 10, 15, 20 seconds that it was now taking Elasticsearch to do that same query, we realized that we had really sort of stumbled on something really unique. And we started pushing these bitmaps that we were using underneath the hood further and further, and so we taught it how to store integers and other data types. And we just, to our surprise, found that it could keep going as far as we pushed it. And so we ended up with 9 patents almost immediately, and it was about 4 years ago that that company was merged with another. And we went to our board and said, hey. We really think that we've gotta teach the world how to operate on features and not data. Would you be okay if we spun these patents out? And that's what gave birth to the company that you're hearing about today, Molekula, and our efforts to go unlock data inside of these silos so that companies can put it to work.
[00:05:50] Unknown:
As part of rolling out Molekula or before then, as you mentioned, Pelosa is the open source sort of engine behind all of it. I'm wondering if you can give some of the story about what was your motivating factor for releasing that as open source and just any of
[00:06:07] Unknown:
the additional effort that was needed to be able to decouple the Peloso code from the company's internals be able to release it as an open source project? Yeah. Great questions. I think, 1, our engineers had the foresight early on to completely decouple the code base. So we built Pelosa as a completely independent project from day 1, not knowing that we were gonna open source a technology ultimately. And then when we did spin that out about 4 years ago, it was clear that all we knew that Telosa was good at is speed. And we knew that there had to be use cases, industries, verticals that could benefit from that speed, but we really didn't know what they were. So we knew that the right answer was to take that technology, put it out into the world to see what people would do with it. And sure enough, for about the first 3 years, you know, we had, you know, several 1000 companies come through and use Pelosa and tell us, you know, the types of use cases and workloads that they were interested in using Pelosa for. It allowed us to really extend Pelosa further so that we could support more use cases. We could, you know, support dense you know, sparse, mixed density, ultra sparse datasets. We could support, you know, integers and floating points. And so that time really gave us the ammo that we needed to really figure out our commercial strategy. And it was last year that we decided to start commercializing technology under the brand, Molekula.
It's been really exciting to get to engage with the community. I will say, though, over the last year, you know, we only had a finite amount of resources. It wasn't until late December that we raised a very large series a. So I would say for about the last year, we've really shifted focus on figuring out how to commercialize the technology. So most of our efforts have been on the molecular side, And it's gonna be later this year that we've got some really, really exciting plans with open source, again, to reinvigorate the open source community and continue to reinvest heavily in that community as we commercialize and make the best of the molecular side from all of our learnings.
And, you know, it's probably worth touching on on the Pelosi side. The core systems are all written in Go. That's a distributed system. A lot of the cluster management is something that falls to the user who's operating these clusters. So a lot of what we've been doing on the molecular side has been focused on ingest, on scaling and cluster management. You know, we have some very, very, very large use cases, so making sure that the technology is stable and adoptable. And it's not just about getting data in, which is a lot of what we focused on, but it's about being able to get data out. So being able to bring traditional query interfaces like SQL so that it's just as easy to get data out. So Colossus still requires quite a bit of effort to sort of set up your data models, to manage your clusters.
On the molecular side, a lot of that is taken care of. Right? Adaptability was a huge focus for us on the molecular side, and we're excited to push a lot of that innovation, code base, and technology back down into Telosa.
[00:09:18] Unknown:
You mentioned that the Telosa project itself is creating this bitmap index for being able to automatically extract features from the underlying source data, and that that allows you to execute a lot faster on being able to perform these aggregate analyses on it. And I'm wondering if you can just give a bit of an overview about the types of problems and use cases that that allows you to solve for and some of the purpose built
[00:09:45] Unknown:
solutions that you're providing through molecular based on that. Yeah. Definitely. And I think it is super important to state how important that automatic extraction of features is. Like, we like to think of the data sitting in a molecular cluster as compute ready. So having compute ready data to me is an incredibly important part of the puzzle. But once the data is in molecular, the types of workloads that we're very, very, very good at are your typical analytical workloads. So any type of OLAP workload is something that we would knock out of the park, especially when you start to get into really complicated combinations of query types, so top ins with filters or top k's, groupbys, you know, things that on their own might be difficult in a typical system. But I think maybe most, most importantly is to be able to do all of this without pre joining or pre aggregating data. Right? You know, today, a lot of the data warehouses that exist in fact, only 1 out of 7 copies of data is original. The other 6 are copies that are being made to pre join and preprocess data and to get the decisions and insights from the original data. And so just the fact that you no longer have to pre join data and you can do those joins at query time is really just a game changer in terms of your ability to have, again, compute ready data.
[00:11:02] Unknown:
1 of the types of patterns that you'll see a lot in data warehouses is that you have these large dimensional tables that are, I imagine, some of the cases where you're extracting the features from the source data. But then you also will need to be able to join in much smaller datasets that have concrete facts about the information that you're trying to derive. And in the molecular and pelosa approach where you're generating this bitmap from the source data where you're reducing its dimensionality, how does that play out when you're also trying to join in some of those concrete facts about the source data? Is that something where you would leave the source data in situ and then Melekula would query against that at the analysis time? Or would you also reflect that source data into an accompanying system for being able to join it together? So when the data comes into Molekula, we put it into what are called feature
[00:11:57] Unknown:
tables. And as you suggested, you know, the entire process of feature extracting is really the dimensionality reduction of data. And dimensional data starts with star schemas and relational constructs and sometimes some very complex structures. And at the end of the day, the ultimate goal is to get it into a 2 dimensional space, right, like objects and attributes. Like, the data model for our customers are usually pretty boring. It's like clients and all their attributes. But a lot of that data that you described, the smaller datasets, just gets dimensionally reduced into the that 2 dimensional space, or it can also exist as a feature table alongside, you know, the other data, and that join can happen at query time. And back to sort of the original problem, like everyone in the world today, sometimes scale out does not make a difference in terms of query speeds. Right? You can add as many services as you want, and you find a latency floor. And so being able to shatter that latency floor means you can now do transformations. You can do those joins at the query. And so if you can't dimensionally reduce it into the same feature table, you can put it into an adjacent feature table and still reap the benefits of that sort of extreme query speed that you're getting out the other end. And you mentioned that some of the progeny for this particular solution was a
[00:13:13] Unknown:
collection of Elasticsearch and Cassandra and various other storage technologies that you were trying to use because of their specific capabilities for doing different types of analyses or aggregates. I'm wondering what are some of the other technologies or architectural patterns that you typically see some of your customers moving from once they start to adopt Molekula and some of the ways that Molekula can replace some of these heavyweight workflows?
[00:13:40] Unknown:
Definitely, we end up replacing large swaths of what I like to call information era technology. We don't typically go in with that as a value proposition when we're selling the technology. We try to focus more on the value that's getting unlocked. But, certainly, our clients very quickly catch on to the fact that a lot of this can be eliminated. So we see a lot of Druid, a lot of Mongo, a lot of Elastic being removed, and it's really all around, you know, the ad hoc queries, the real time queries, and the systems that are being used to sort of pre join and pre process the data. Those can all be eliminated when you can do all of those things in real time. And, you know, we probably all seen these modern architectural drawings that Andreessen Horowitz AI stack look like? And they all have component after component for ad hoc and for real time. And all of these things can go away with a feature store. So those are the very things that we're replacing once we get implemented with our clients.
[00:14:44] Unknown:
RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming. With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at dataengineeringpodcast.com/rudder today.
You mentioned a few times feature extraction, and I know that you've also moved into the space of molecular being offered as a feature store. Can Can you describe some of the nuance there about the difference between extracting features and a feature store and just some of the additional operations that are necessary to go from 1 to the other? I'm excited as a founder and as an entrepreneur to be at the table
[00:15:40] Unknown:
while the definition of feature store is being established. I think it's a fairly new construct. And really, over the last 6 months, we've seen it explode in terms of its popularity. We see platforms repositioning as feature stores or emerging as feature stores. We also end up with Fortune 100 clients all the time who have been building feature stores inside of their stacks for years to try to solve for this problem. What we find pretty much across the board is that they're all really reference architectures for common information error systems used to store features. Right? So they are definitely extracting features, although most of the time, that's a manual process.
And what these modern feature stores are allowing you to do is to store those features and reuse them. But they're storing features in a data store, not in the feature store. So I'd say what is so radically different about our approach is that we see ourselves as the evolution to the data store, a feature first file format that is designed specifically to store features. And if you go back to the history of file formats, I assert that the very first file format was the file allocation tables used to store data on hard drives, although most would argue that it's really the transactional databases that were invented in the seventies with row wise formats, and then we very quickly had the OLAP cubes and what eventually has become sort of the columnar systems emerge to solve the analytical problems.
We see this feature first format as the evolution and the next step from columnar. Right? Columnar is still scanning the entire column. This feature first format that we built from the ground up, we wrote it in Go, literally reads and writes at the value at the feature without needing to scan rows, without needing to scan columns. And it's unimaginably fast because it's, you know, essentially deconstructed data into its most element form. These massive bitmaps that we shard and homomorphically compress, you know, get loaded into memory. And when you think you're running a SQL query, you're really just pulling bitmaps out of memory and pushing them through the processor.
And often in 1 clock cycle, you know, you're doing what might have taken seconds, minutes, hours. We still see long running queries that take half days and days, and our customers, you know, those now happen in milliseconds. And it's been really, really fascinating to see the evolution of that from when we first invented it to now. But long story short, we see even our competitor feature stores really needing a feature store underneath, not a data store that's storing features. You know, a lot of the problems we're in today is because data stores don't scale to the type of workloads that we're going to need to make real time decisions. So we really wanna solve the feature store problem for even the feature stores.
But, certainly, we believe that the entirety of the machine learning life cycle should be operating on features. You know, whether you're doing simple evolutions of analytics. Right? You're going from descriptive and diagnostic to predictive and prescriptive and proactive, or, you know, an AI infrastructure component, or 1 of these new fields in AI, you know, around explainability and ethics and bias. We see all of that ultimately sitting on top of features, in a real feature store, not on features being stored in the data store. So that that is really the differentiator.
[00:18:57] Unknown:
The way that I've come to understand features in the context of machine learning is sort of a derived fact based on underlying individual elements of data. So 1 of the examples that came up in a previous podcast is for maybe an Uber Eats customer. You have the information about the last order that they placed, the restaurant that it was, you know, the timing of it. And from that, you know, you have some of those historical context, and so you might derive a feature based on that about this particular user's preferred type of cuisine. And so in that context where the extracted or the derived feature is this piece of aggregate information, How does that compare to the feature extraction that you're doing with the bitmap index versus the feature that you're storing in the feature store for the machine learning context?
[00:19:46] Unknown:
Really, at the end of the day, they're the same thing. And I think what we've pioneered is the coarse grain automated feature extraction directly from the source. So we typically attach ourselves to things like Kafka topics, you know, change data capture feeds. And much like, you know, a Fivetran, we don't make any decisions about what data gets put into the feature store or not. There is no quote, unquote feature selection in our automated process. We just move your entire data model into the feature store. We look for schema changes, inserts, deletes, updates, and those automatically make it back into the feature store. If you need to further refine that data, combine it, do transformations on it, calculations on it, those are happening as a further layer of the feature store, and that's where you really start to see feature selection and a lot of what you're talking about. But we like to think that the data's already in a computable state. Right? You don't have to go back to IT and ask them to do another major drop of a file into s 3. Like, all of the data is there. And not only is it there, but it's getting updated within milliseconds of the change happening in the source systems.
And a lot of times, we see our customers have 100. We've never seen thousands of databases yet connected into a single feature store, but we have 1 customer that has 400 different databases inside of their enterprise that contain different aspects of their customers. So they've wired those up to feed the changes into the feature store, and now they have 1 place where all those 400 databases are consolidated into 1 view of that customer.
[00:21:14] Unknown:
In that case, the auto generated feature might be the frequency at which a particular user orders from a particular restaurant, and then the higher order feature would be the derivation of their preferred type of cuisine based on that, you know, automatically
[00:21:30] Unknown:
extracted feature. Exactly. Or scoring or ranking. And, again, a lot of that stuff, we prefer you to do in a calculation and not have to pre store. Right? With information error systems, a lot of times, even once you do the manual feature extraction and get your feature sets the way you want them, you're then running a batch process to score and rank them, and then you're running your behavioral based personalization or whatever it is on batch scores, you know, because this underlying feature store is so unbelievably fast, we can do that in the UX. So a lot of the use cases that we're powering are real time personalization, real time risk underwriting that are updated to the millisecond based on the totality of the data available to make that decision without the batch processes in the middle. Digging a bit more into the molecular platform itself, can you describe some of the overall architecture of the system and some of the ways that the design of that platform and its overall goals have changed or evolved since you first began working on it and building on top of the underlying Pelosa engine? So there's really 3 major components to Molekula. 1, and obviously, most importantly, it's Right? Is the feature first file format that we invented 7 years ago. Then we have the Molekula platform wrapped around it. You know, Molekula is really doing the orchestration of Pelosa and all of the services required to sort of deliver an end to end experience. And then we have these things called TAPs. TAPs are what we use to get data into the platform. So a tap gets attached to a downstream database, you know, Cassandra, MySQL, Mongo, s 3. Wherever the source data is, the tap is attached, and that tap is where the automated feature extraction is happening.
Then we're only routing feature changes to the feature store. So we end up getting, I'd say, anywhere between 10 20 x compression on the underlying data in terms of the features that are getting sent across the wire and then the features that are getting stored. That's 1 of the magical qualities of features. They're a fraction of the size of the data they're representing without any loss, without any impact to processing time to have to compress or decompress. It's pretty amazing. So it starts at the tap that moves into the feature store. It gets stored in the file system that we've invented. And then on the other end, it's consumed using client libraries. It's consumed using SQL. We have a proprietary bitwise query language that we call PQL, although that is slowly getting retired. And so those are the various ways that you communicate directly with the feature store. And I actually had 1 of your engineers
[00:24:02] Unknown:
on, I believe about a year ago to talk about Pelosa. So for anybody who wants to give some more context, I'll add a link to the show notes. But I'm interested in digging a bit more into that evolution from the proprietary or specific query language for your bitmap index to the more generally available and more widely understood SQL language and just some of the motivation for going that direction and some of the complexities that have arisen from trying to be able to map the concepts for this very specialized storage engine to the generalized declarative SQL language?
[00:24:39] Unknown:
Yeah. No question. I think, you know, obviously, we are very familiar with Pelosa and our own technology, and we know how fast it can be in terms of these analytical workloads. But it has been quite a journey being able to explain it to and convince others that it can defy physics and defy gravity. Right? We're so used to these information error systems that can't solve these problems. I mean, we regularly have rooms full of people doubting our ability to do these things in real time. You know, they're sure that we've pre joined or pre processed the data, and we haven't. But it's not historically easy to get data into Pelosa. It's not easy to get data out of Pelosa. And so adaptability was our major compass on the molecular side, making it easy to use, adoptable, relatable so that it was more familiar in terms of a traditional data system. And so we are going to invest very heavily on the open source side over the next year. So starting in q 2 of this year, we'll be putting a ton of that functionality back into the open source.
And for lack of a better description, Palossa and the open source are gonna become the offline feature store, and Molekula is in the process of becoming the online feature store. And we've all learned a lot about licensing and open source, so we're still working through some of those details in terms of exactly how we're gonna do that. But we are gonna take some really big risks, and we think this is bigger than us as we did originally. And we really do want even our competitors collaborating on the source format that we think is gonna power the future of machine learning. So, yeah, we have a lot of work to do still. Digging a bit more into the actual
[00:26:11] Unknown:
internal storage layout and the data model behind Pelosa, I'm I'm wondering if you can just describe some of the specifics there and some of the ways that somebody who's unfamiliar with this technology and the concept of the bitmap indexes that you're creating and the availability for being able to run compute across these sparse matrices and just how that might map into the way that they think about the data modeling aspects of it? So today, we're trying to eliminate the need to have to think about it. With Telosa, you still have to, but that's gonna go away shortly.
[00:26:43] Unknown:
And at the end of the day, it really is mind bogglingly simple. Like, you were saying earlier, like, the process of feature extraction, we've taken to the absolute most fundamental level. Like literally, an integer is a yes or no question. Everything is a yes or no question. Right? So you've got objects on 1 side and attributes across the top, and the attribute is literally, you know, is ho 44? Yes or no? Is ho 45? Yes or no? Now we don't create extra overhead if we don't have to, so there's some really fascinating tricks that you can teach bitmaps. Right? Like, roaring bitmaps and our modification of roaring bitmaps has been a huge part of our ability to compress and operate on compressed bitmaps. Then sharding strategies and how you distribute this in a cluster is yet a whole another set of of considerations.
We use tricks like bit slice indexing to be able to store 64 bit numbers in a much more compressed representation. And what's been just fascinating about these bitmaps is is how far we've been able to push them to solve both workloads and data types that nobody had ever pushed them to do so. Bitmaps are not new, and a lot of databases use bitmaps for particular index types. We just flipped it and said, no. Forget it. The bitmap is the database, and it's called features. So it really is unbelievably simple. And so with molecular, you don't even need to think about that. You change your schema in your source system. It changes the schema down underneath, and you run queries on the other end as if you were querying the original system. So we've really made that transparent for the end user because otherwise, it really is a major
[00:28:15] Unknown:
leap in terms of conception, in terms of how to do that. So that's where a lot of our IP has been focused on over the last couple of years. Digging further into this. So for somebody who's familiar with a more sort of what what you've been calling an information age system. So in particular, I'm thinking about it for somebody who wants to be able to keep all their source data in their source systems, but still be able to query across it and join an aggregate. The canonical example of that right now is the presto slash project where you can have the SQL engine that connects up to all these various different data sources, and then you might run these queries to perform aggregates or analyses across these different systems or use it to extract information from those systems to load it into a different data store, into object storage, and then be able to do downstream queries against that.
How does that concept of querying against these different source systems and possibly migrating it into other systems for being able to do more refined analytics compared to the workflow for somebody who's using molecular
[00:29:18] Unknown:
and building these feature extractions and derived features on top of those source data systems and running these aggregate queries across it? I think it really depends on your use case. Right? I like to say that we solve for the machine scale use cases, and I think a lot of those tools are really good at the human scale. You know, if some millisecond latencies on the data, regardless of the complexity of the query, is not important, then, sure, fan out your query, wait for the responses to come back, stitch them together, and then look at that response. Right? That might be good for an analyst trying to get an answer out of a complex set of databases inside of a very large company, but you're never gonna power real time decision off of that. Right? You're never going to make a recommendation to a physician at the point of care about what the diagnosis may be based on all of the historical data about every patient that's come through that hospital room. So and weirdly enough, 1 of the things that I've learned that's been very difficult to sort of accept is that real time is still a really ephemeral use case. You know, what people define as real time today is still not.
You know, sometimes real time can be minutes. Sometimes they can be in the same day. But if you need to make a decision right at that moment based on all of the available data without having to go through batch and preprocessing and precomputation, there really is no other way to do it. So, again, if if you're okay with the latency and you can wait minutes, hours, or days for the query to come back, then things like federation are perfect. If you don't care that your BI reports might be up to 2 weeks late, which we see very often, then, you know, batch is fine and aggregation is fine. You know, and this is why things like IoT have been so very difficult. You know, I can't talk specifically about names, but we're working with a very large car manufacturer right now that generates, like, a trillion events off their power plants and supply chains every day. Right? This is all the IoT data that's getting created. Well, IoT data is worthless on its own. Right? It's got a serial number. It's got a timestamp and a measurement. You can't make a decision on that. So all of these trillions of events have to get moved and piped back to a data center where they have to be pre joined with the business data and the inventory data and the customer data. And then only then can they start to make calculations about how, you know, a missing part might impact that supply chain in a different region. And by the time that happens, you're talking about hours or days or sometimes even weeks later, when that decision really needed to happen instantaneously.
And so those are the types of use cases that we're very excited about, security, health care,
[00:31:45] Unknown:
IoT, you know, where you really need real time. Your mention of the IoT use case put me in mind of another interview I did in the past about the swim framework or swim OS, which is a different model for being able to compute across real time streaming data rather than having it go all the way back to the data warehouse and then perform your analyses. You do it in flight so that you can get that real time feedback rather than having to figure out, like, okay. Well, I've joined it across all this other information, just keeping it, you know, stateful and resident in memory and then having the different nodes within the overall network be able to cooperate for being able to build out these computations. And 1 of the examples that they provided was manufacturing facility for aircraft being able to do real time up to the date tracking of the various components of inventory because of the complex number of pieces that go into building an aircraft where it had been a problem that was causing them, you know, significant pain, and then they implemented this tie the system of being able to compute it in real time.
And it was just a complete revolution in terms of their ability to be able to keep up to date with the state of their business.
[00:32:51] Unknown:
Absolutely. And I I don't think we can even fathom what that revolution is gonna look like when we really unlock real time. On the molecular side, we're in the process of building out our cloud version right now, and, you know, we're thinking about it as an underlay to all of the major clouds. It's completely hybrid, you know, but it's a network of compute ready data where you tap the source wherever it happens to sit, and you plug a model in and make a decision. You know, it's that easy and that fast. You don't have to worry about all of and deploying and securing and monitoring that typically goes into
[00:33:22] Unknown:
preparing data for analysis. And so I don't exactly know what the world looks like when this problem is solved, but I do think it it is the biggest problem in the world right now. So the technology that we're discussing here, and as you've alluded to, it's something that can be difficult for people to be able to wrap their heads around because it's such a paradigm shift from the way that people are used to thinking about the problem. And so as you are building out the platform and educating your customers and educating potential customers about it or onboarding people, What are some of the most underutilized capabilities of the system or most misunderstood concepts or some of the biggest challenges in helping people to arrive at an understanding of what the capabilities are for this platform?
[00:34:03] Unknown:
I think at the end of the day, because latency has been such a problem, we've created databases for everything. Right? Like, we have time series databases. We now have feature stores that are storing features in databases. The world does not need yet another, like, database type. But disabusing people of the notion that they have to go use a whole combination of databases to solve a single problem, I think, is 1 of the most misunderstood things. Like, once you shatter that latency floor, time, joins, transformations, those all just happen at the query. Right? You you shift the power to the data scientist and to the end user. I think it's also very important for us to say, and this is very appropriate for your series, is that we sell to data engineers. Right? Our internal mission this year is to turn data engineers into heroes.
We don't sell to data scientists. We don't sell to the end app builders and users. They're the beneficiaries. But you no longer have to go build a special database and dataset for every single data type and then figure out how to stitch them all together. So once the data is loaded into the feature store, it's there to solve time series problems, and it's there to solve complex joints. It's there to solve analytical queries for feature reuse. It's you can do anything you could do with a conventional system with it, except it's just very, very fast. Just digging a bit more into
[00:35:19] Unknown:
that aspect of things. So you mentioned that you have these different taps for being able to migrate information into the molecular platform and that there's the SQL interface for being able to query across it and perform these analyses and feature engineering on top of it. But can you just talk through some of the available interfaces and and extension points for molecular and pelosa for being able to onboard new datasets or some of the considerations that end users need to think about as far as making sure that the information in their source systems stays up to date in molecular and just the overall
[00:35:57] Unknown:
process of keeping everything in sync along those axes? Yeah. I think if my implementation team were here, they'd probably kick me under the table, but we've really solved for a lot of that. Right? So once you implement a tap, we pick up on those changes. I mean, no question, our absolute favorite way to ingest data is through a data pipeline like Kafka. You know, you attach yourself to a topic, you watch the schema registry. And as changes happen and as updates happen in that topic, the feature store updates itself. Right? So really very little intervention on the end users' part to have to change anything in the actual feature store. And then on the query side, same thing. Right? Once you've built out your queries, you know, a lot of them are portable. I think maybe an area for innovation and investment in the future is building out APIs that mirror existing APIs, right, like the Mongo API or the Cassandra API so that you can bring already written queries for other platforms to the feature store.
But, otherwise, the benefits are so overwhelming that sometimes you do have to rewrite some queries or rethink the way that you consume the data. But, often, we're solving use cases that the information error systems couldn't solve. You know, like, we have 1 customer that just brought us a use case, and they do 300, 000, 000, 000 transactions a day. No. There was no information error system that solved that. So it's not like they had, you know, 10 years of queries that they have to go rewrite. So we are trying to make it as transparent and easy as possible. It is not always the case, but it's not a huge lift to get there.
[00:37:25] Unknown:
Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours or days. DataFold helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column level lineage, and intelligent anomaly detection. DataFold also helps automate regression testing of ETL code with its data diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. DataFold integrates with all major data warehouses as well as frameworks such as Airflow and DBT and seamlessly plugs into CI workflows.
Go to data engineering podcast cast.com/datafold today to start a 30 day trial of DataFold. Once you sign up and create an alert in DataFold for your company data, they'll send you a cool Water Flask. 1 of the other things that you mentioned earlier is that with the fact that you have Pelosa out in the open available for other people to use, and you're working on making it easier to adopt and easier to integrate, and you're hoping that other companies adopt it for different use cases. I'm curious what you see as some of the downstream potential for
[00:38:42] Unknown:
other types of systems to be built on top of this or other ways that this technology can be used to benefit different companies or different data use cases? I think at the end of the day, just true utility style instant access to data is going to be super, super important. Like, the sharing economy that can develop, I think, is incredibly exciting. I mean, today, we're still focused on solving just accessibility issues inside of companies. A lot of CIOs and CEOs that we're working from the top down are just trying to break down silos between business units. When we can break down, you know, silos between and amongst companies, like, some pretty exciting things are gonna happen, right, where it's permissions and business decisions that govern the flow of data and not years of architecting and deploying infrastructure for that data to flow. We like to call it software defined data infrastructure. So a lot of our patent portfolio, we have 9 issued patents in 24 along the way that we're doing now, is really around that idea of software defined data infrastructure. Right? Like that intersection between data engineering and data science that assuming, you know, you have permissions, code, and configuration allow the data to flow, not, again, and deploying infrastructure. The permissions aspect is another thing
[00:39:54] Unknown:
worth digging into because along with data, there's always the question of security and access control and compliance issues. And how does that play into the molecular platform and the the issues. And how does that play into the molecular platform and the the capabilities that it provides? And how much of that is specific to molecular, and how much of it is available in Telosa itself?
[00:40:12] Unknown:
A lot of the work that we've done on that in molecular, we will be pushing to Telosa. If I can pull it off, I'd like for Telosa to be identical to molecular. Just one's online, one's offline. Still working through those details. But I think the number 1 benefit of using this representation of data is that you're only storing intersections. Right? So we use homomorphic compression to get the maximum benefits out of the data as it sits in memory. But we're working now on a homomorphic encryption technique that works in a very similar fashion. So the data, you know, in memory is actually also encrypted. But what's beautiful is there are actually no values. Right? That's just a giant matrix of ones and zeros. It's just bitmaps.
We maintain what's called a feature map, which is what allows us to translate data in and out of the bitmap. I like calling the ingest process dehydration and the how just process rehydration. Right? So we use that feature map to dehydrate and rehydrate data. As long as you keep that feature map under lock and key, there is no data in the system. And then if the data is compressed, then it's even I mean, sorry, encrypted, then it's even more secure. So a traditional information error system, you're literally copying the data, copying the values up to 6 different times from the original data source. Here, you just leave that in the original data source. So, inherently, it's a much, much, much more secure format. That said, that's not gonna solve everything. You know, most security breaches happen when the end users are accessing data.
So we're doing a lot of work on that today. We don't have a lot of good answers yet. I think our basic philosophy on that side is that we don't need to reinvent user access. Today, we've just integrated with all of the security systems that our customers ultimately already have. And that said, there's some really huge opportunities that we have that conventional data systems don't do very well. Like, you know, field level security is very easy for us. Right? Because that's just a mask, you know, a bit mask that maintains your security can very easily and real time control access down to the cell. So how we surface that up and integrate with traditional security systems has yet to be seen. But in general, we wanna integrate with things that are already out there. You've mentioned a few different
[00:42:22] Unknown:
interesting and exciting use cases of people who are already onboarded onto Molekula. But are there any other interesting or unexpected or innovative ways that you've seen the platform use that you'd like to call out? I think the most exciting, and this is obviously exciting for business too,
[00:42:37] Unknown:
is that typically in a company, when you're doing an application or doing a project, you'll go to IT and you'll ask them for data. You'll train models. You'll build them. You'll go through development. And then when you put it in production, you know, you might go through months or sometimes years of architecting and deploying all of that infrastructure. Well, so much of that happens in a very myopic way. Right? It's that data lake gets built for your project, and then the materialized views get optimized for that application and that query pattern. And so imagine in a Fortune 500 where you've got business units doing this for every single project, you end up in what I call a data death spiral. Right? So you end up with just a proliferation of copies and infrastructure that's kinda hard to even imagine, And getting out of that is very, very difficult. And so 1 of the most exciting things that we found about the feature store is we'll typically go in with 1 or 2 use cases when we make the sale or a customer adopts us. But that same feature store can serve another use case and another use case, and it can serve the adjacent business unit, all without having to go deploy another feature store. You know, you no longer have to build more data warehouses and data lakes and all the things you would typically have to do. And that's been an unexpected surprise for us and for our customers. Our oldest commercial customer, a company called Q2 Banking, they're, like, on their 6th or 7th use case now, all hitting the exact same feature store. They call it their trade store. And those range from, like, security to personalization to underwriting to internal analytics to customer facing analytics. It can all be powered by the same feature store. So that's been something really unexpected and exciting that's happened. In your own experience of building the molecular platform
[00:44:17] Unknown:
and growing it into this new business in this new area that you're exploring at the same time as building the company around it and the technology that underlies it. What are some of the most interesting or unexpected or challenging lessons that you've learned in that process?
[00:44:31] Unknown:
I think just building something so fundamentally different has been difficult. I like to call it, like, the particle physics of data. Like, when we first saw it and we thought it was broken because it was so fast, like, once we figured out it was so fast, we just thought instantly everybody wanna adopt it. But it's been very difficult to communicate. Right? And so building, you know, the infrastructure to make it adoptable and building the messaging to make it relatable has been very, very difficult. You know? And today, we still live in a world where we have the analytical use cases, but you still have the transactional use cases. And so it's definitely solving most of the analytical problems. We have some pretty exciting developments where we think we can ultimately solve the transactional ones as well, but that has yet to be seen. So, you know, those are still some of the really big existential
[00:45:17] Unknown:
challenges that we and the world are trying to figure out. Digging a bit further into that, what are the cases where molecular is the wrong choice and somebody might need to use a a different approach or architectural pattern or technological stack for being able to solve for some aspect of analytics or machine learning or whatever other problem domains you're trying to solve for? I think the really 2 big areas for us is just free text search. Like, we're not, you know, a Lucene or Solar replacement.
[00:45:43] Unknown:
Those platforms do free text search very, very well. Now if you're trying to do some sort of mathematical analysis or pattern analysis, you know, you can take that unstructured data and hash it and move it into the platform and do some really amazing stuff on it in that way. And so that's the second piece. Right? Like, we're not a search index. And second, we're not an unstructured data system. Like, you can't upload images into Pelosa or Molecula. You can't upload just raw text. You have to do some level of feature extraction ahead of time. Now once you put that into Molekula, then it unlocks all of the other use cases. Like, we've done some fascinating computational pathology use cases where radiology is getting, processed. Those features are extracted and put in molecular. But now in molecular is also the EMR record and the lab data and everything else. And so in that moment, they can do computations, you know, at speeds that might have taken them, you know, weeks or months before to get to. So the totally unstructured data is not something that we support yet. And as you continue to build out the molecular platform and continue to iterate on the underlying technology and capabilities and discover new use cases for it, what do you have planned for the near to medium term future of both the platform and the business? Yeah. I think it really falls into 2 camps, like the core platform and core data format and then the cloud version. You know, we're putting a lot of effort into both, and we're kinda keeping those separate right now from an engineering perspective. So on the core data platform, we're working on some really exciting new versions of the data format. So we have a roaring b tree format that we're starting to roll out now that allows us to bring asset compliance and transactions to the core data format. I think equally exciting is the ability to do tiered storage in the core data format. Like, to date, everything has been in memory, but the ability to offload some of that to SSD drives, to disk, you know, to slower forms of storage so that you can maintain, let's say, the last month of data hot, but, you know, the queries might be a little bit slower if if you've moved it into a colder tier of storage.
So that's been exciting. I think on the cloud side, you know, really building this global network and trying to think across clouds is exciting. It's an entirely new world, but I do think, you know, Snowflake has shown that you can build an overlay. We're trying to build an underlay, you know, that sort of sits between the network and the cloud with decision ready data. And so that, we are very excited about. And I think as we're thinking through our interfaces too, I think this is 1 of the ones that gets me the most excited is, you know, as we're really trying to differentiate from all the other feature stores who are really data stores storing features and we're really that feature store.
I think data stores have queries and feature stores have models. So we're building a model registry. So independently of how you build and develop your models, which is what most of the other feature stores are doing, really, they're just model lifecycle tools, machine learning lifecycle tools. You know, you can register them in our feature store, and they can be as simple as a query, or they can be as complex as a neural network. And they will run both in development and production against the feature store. And so making our primary method of interface a model is something that sounds really simple, but is a really, really big idea
[00:48:57] Unknown:
and something that we're working on right now. Are there any other aspects of the work that you're doing at Molekula or the underlying technology or the use cases that you enable that we didn't discuss yet that you'd like to cover before we close out the show? I think that it'll be interesting to see what future stacks and architectures look like. I think from my very biased perspective,
[00:49:15] Unknown:
I'd love to see a world where we all had data pipelines. We all had a data warehouse or some sort of persistent storage for all of our data to keep it safe and secure, and we have something like a feature store that has compute ready data to make decisions on. Right? And it's as simple as as bringing a model and a decision, and it happens. And if we can do that, you know, we'll eliminate a tremendous amount of infrastructure, a tremendous amount of time, tremendous amount of latency. So it'll be interesting to see how all of that happens. And, hopefully, if we can figure out the right recipe, we can open this for the world to help us figure it out together.
[00:49:50] Unknown:
For anybody who wants to follow along with you or get in touch, I'll have you add your preferred contact information to the show notes. And as final question, I would just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. I keep harping on all of this. I think the fact that we have to make copies of data to make decisions.
[00:50:08] Unknown:
Right? I think data engineers are firefighters today, and, you know, they're out rushing out to just build copy upon copy. If we can help, as a world, focus them on adding value and being able to use the data and and manage permissions
[00:50:24] Unknown:
versus making copies, I think we'll live in a much better world. Well, thank you very much for taking the time today to join me and discuss the work that you're doing at Molekula. It's definitely very interesting platform and solving a important set of use cases. So I definitely appreciate the time and energy that you and your team are putting into this. I look forward to seeing what you're able to produce going forward. So thank you again for your time, and I hope you have a good rest of your day. Great. Thank you for having me on the show. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Episode Overview
Guest Introduction: H.O. Macott
Molekula: Vision and Mission
Open Source and Pelosa
Use Cases and Workloads
Feature Extraction and Feature Stores
Molekula Platform Architecture
Comparing Molekula to Other Systems
Challenges and Misunderstandings
Future Potential and Use Cases
Future Plans for Molekula
Closing Remarks