Summary
Building, scaling, and maintaining the operational components of a machine learning workflow are all hard problems. Add the work of creating the model itself, and it’s not surprising that a majority of companies that could greatly benefit from machine learning have yet to either put it into production or see the value. Tristan Zajonc recognized the complexity that acts as a barrier to adoption and created the Continual platform in response. In this episode he shares his perspective on the benefits of declarative machine learning workflows as a means of accelerating adoption in businesses that don’t have the time, money, or ambition to build everything from scratch. He also discusses the technical underpinnings of what he is building and how using the data warehouse as a shared resource drastically shortens the time required to see value. This is a fascinating episode and Tristan’s work at Continual is likely to be the catalyst for a new stage in the machine learning community.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today.
- Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
- Your host is Tobias Macey and today I’m interviewing Tristan Zajonc about Continual, a platform for automating the creation and application of operational AI on top of your data warehouse
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you describe what Continual is and the story behind it?
- What is your definition for "operational AI" and how does it differ from other applications of ML/AI?
- What are some example use cases for AI in an operational capacity?
- What are the barriers to adoption for organizations that want to take advantage of predictive analytics?
- Who are the target users of Continual?
- Can you describe how the Continual platform is implemented?
- How has the design and infrastructure changed or evolved since you first began working on it?
- What is the workflow for someone building a model and putting it into production?
- Once a model has been deployed, what are the mechanisms that you expose for interacting with it?
- How does this differ from in-database ML capabilities such as what is offered by Vertica and BigQuery?
- How much understanding of ML/AI principles is necessary for someone to create a model with Continual?
- What is your estimation of the impact that Continual can have on the overall productivity of a data team/data scientist?
- What are the most interesting, innovative, or unexpected ways that you have seen Continual used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on Continual?
- When is Continual the wrong choice?
- What do you have planned for the future of Continual?
Contact Info
- @tristanzajonc on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Continual
- World Bank
- SAS
- SPSS
- Stata
- Feature Store
- DataRobot
- Transfer Learning
- dbt
- Ludwig
- Overton (Apple)
- Hightouch
- Census
- Galaxy Schema
- In-Database ML Podcast Episode
- scikit-learn
- Snorkel
- Materialize
- Flink SQL
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. Have you ever woken up to a crisis because a number on a dashboard is broken and no 1 knows why? Or sent out frustrating Slack messages trying to find the right dataset? Or tried to understand what a column name means? Our friends at Outland started out as a data team themselves and faced all this collaboration chaos. They started building Outland as an internal tool for themselves. Outland is a collaborative workspace for data driven teams like GitHub for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets and code, Atlant enables teams to create a single source of truth for all of their data assets and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker, and more.
Go to dataengineeringpodcast.com/outland today. That's a t l a n, and sign up for a free trial. If you're a data engineering podcast listener, you get credits worth $3, 000 on an annual subscription. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.
Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show. Your host is Tobias Macy. And today, I'm interviewing Tristan Zayans about Continual, a platform for automating the creation and application of operational AI on top of your data warehouse. So Tristan, can you start by introducing yourself?
[00:02:07] Unknown:
Hey. So, you know, my name is Tristan. I'm 1 of the cofounders and the CEO of Continual. And, you know, I'm happy to chat all about it, my journey to what we're doing at Continual, but very excited to be here. Thanks for having me. And do you remember how you first got involved in data management? I mean, I've been involved in sort of data management, data science, sort of the broad use of data for, you know, honestly, past 20 years. So got started in my twenties really in the academic context, you know, and all the way down to actually going into the field and collecting data, I was actually working with the World Bank on, you know, research projects in Pakistan and was, you know, actually in these villages, like, trying to get data on to palm pilots. And then once the data came back to the central office, you know, trying to clean it up and using sort of all the standard tools of that era through the early 2000.
A lot of them proprietary tools like, you know, SAS, s data, SPSS, you know, that ended up in graduate school, you know, doing a lot of research, a lot of sort of applied statistics, Bayesian statistics. Increasingly sort of shifted over to the open source data science ecosystem. So that was really the sort of the early part of my story when I was actually sort of getting my hands very, very, very dirty. You know, I quickly, as part of that experience, became frustrated with sort of the state of the art for tooling in data science and felt like there was really tremendous opportunity to reimagine tooling for data scientists, machine learning engineers. Well, I guess what now what you would call machine learning engineers. In the past, you might call it as, you know, statistic, sort of statisticians.
Then when I was doing this in sort of the, you know, 2013 era, data scientist was sort of a term that was really coming on the forefront. Sort of actually left graduate school or came out of graduate school rather than going and becoming a professor, you know, got that entrepreneurial itch, which I'd had for a long time, and founded a company called Sense, which was really my first startup. That was really 1 of the early pioneering, if you wanna if you wanna say that, enterprise data science startups. You know, you can think of us like, you know, Databricks without Spark or, you know, Domino Data Labs is another 1 that was our peer at that time. Really trying to build a platform for kind of code first, you know, sort of open source centric data science teams that, you know, are working within the enterprise and need sort of the collaboration, the security, the compute, access to data, automation, productionization, aspects that you'd need that maybe don't get in just the open source ecosystem.
You know, got into that world, sort of the startup world around data tooling. That was that that company was founded in 2013. I ended up getting acquired by Cloudera, the big data sort of major provider of the Hadoop ecosystem, Hadoop based platform. I got acquired in 2016 and then spent, you know, 3 really fun years there. That product that they've acquired since became their data science workbench product. And, really, at Cutter, I got to see, really, the sort of the largest companies in the world, the largest companies, organizations, governments in the world, you know, use data, you know, at tremendous scale and across a tremendous, you know, breadth of use cases, you know, from telecom, health care, security, and government. That was my real education at in terms of the breadth and the scale of sort of data management
[00:05:05] Unknown:
writ large. And so that brings us to where you are now with the continual product. And I'm wondering if you can just give a bit of an overview about what it is that you're building there and the problem that you're trying to solve and how you ended up settling on that as the particular area where you wanted to spend your time and energy. I know there are so many things, you know, to get excited about. So that last part of your question is always very apt. So at Continual,
[00:05:28] Unknown:
we are building an operational AI platform that integrates natively with cloud data warehouses. And so the goal is really to enable any data or analytics team or professional to build continually improving predictive models about any aspect of their business, whether it's from customer churn to inventory forecasts without any complicated engineering, any complicated infrastructure. The sort of the secret sauce to this, sort of product and that kind of the idea is really our declarative or data first approach to AI. So unlike traditional solutions, which, you know, are typically centered around and including ones that I, you know, I built previously, they're typically centered around building these, you know, pipelines, you know, centered on the model and the model architecture, centered on sort of the infrastructure management tasks you need to do to bring machine learning models to production. You know, we basically think that increasingly, AI and ML, particularly production AI and ML should be really focused and only needs to be focused on the data, and the process can be made highly declarative. Declarative, you can think of it, you know, like, the way SQL makes analytics declarative or the way, you know, Terraform maybe makes cloud infrastructure management declarative and sort of the simplification that comes of that.
We think machine learning, predictive modeling, predictive analytics can be made similarly declarative where the end user, the analytic professional, or the data professional can really just focus on, hey. What are you trying to predict? Right? What is the target of your prediction? What is your, you know, your definition of churn? And then that's sort of your output. And then on the other hand, what are the inputs or the predictive signals that might might be predictive of that output? So, you know, how active has the customer been, you know, in the last 7 days? That's a feature. And really provide a workflow that is centered on those 2 sat tasks. Like, centered on the task of building up your features and your sort of feature store, if you will, that can feed into your downstream models, building up the sort of the model targets that you're doing. Oh, couple little tiny bit of policy. Right? How often do you want these models to be refreshed? You know? How often do you need these predictions to be updated?
But that whole process can be made declarative. That's what really what we're trying to do at Continuum. I mean, the problem, you know, that that's sort of like what we're doing without talking too much about the problem. But, you know, the problem really we're trying to solve is the sheer complexity around operational AI, or machine learning operations. And so this really comes out of my experience, you know, over the last 20 years, but really over the last, you know, 3 my my time at Cloudera, where, you know, I was in sort of the CTO for machine learning role, which gave me access to a lot of, you know, like, analytics and data science leaders at major companies. And what I found was, first of all, they've all actually bought into the AI ML kind of hype, if you will, or the actual transformative capabilities or potential of machine learning and AI for their business. So they all come in with, you know, dozens of potential use cases spanning, you know, sales, marketing, you know, operations and logistics, product and engineering.
They're all incredibly excited about that. But they all were and this was, like, 2, 3 years ago. Everybody at that point had moved kind of away from just like, hey. We need to hire data scientists, which was maybe the 5 years ago. We need to kinda build up a basic m and l, you know, kind of capability and infrastructure. They all were saying, wow. We wanna impact the business. We've got to productionize and operationalize ML. And that's, you know, that's not a unique insight from me. The whole rise of, you know, machine learning operations or ML ops as a kind of category is really people saying, well, we need to get to production. We need to put models into production to have an impact. The problem that I saw at Cloudera, and I just think is really a problem, honestly, with the whole industry if I just put a diagnosis, is that the solutions are these incredibly complicated solutions with, you know, 7 different distributed systems involved, you know, these incredible pipeline jungles, which, you know, you're trying to bring sanity to all these different systems.
Very, very code centric, very imperative based, and, you know, our belief is that's really not necessary. Like, at the core, ML and predictive models are really just taking, like, inputs and their transformation to outputs. Right? Like, that's really all it is at the core. There's input signals, so you can think of those as features or your inputs, and there's outputs, which are your predictions or your kinda ground truth targets. And the operational aspect is often very simple. It's just like, well, how often should you refresh things? How often should these models be updated? Like, what's your promotion policy? Is it manual or is it automated? You know, what's, like, the test set that you're gonna make that decision based on? And that really can be done declaratively. You know? And when when you do this, you know, you'd end up, you know, basically with this sort of, like, radical simplification of operational ML. We think that, you know, I think the other little bit to continue was really our focus on the data warehouse as sort of, like, the center, and we just think that that's kind of the best place to sort of make this vision a reality. If you believe in sort of data first declarative approaches to AI, well, where do you plug in? And the data warehouse, you know, is 1 of the ingredients that makes this vision possible.
[00:10:08] Unknown:
And as far as the sort of distinction between, you know, operationalized AI and machine learning process and making it declarative versus some of the AutoML capabilities that different companies are working for and working on things like, DataRobot is the sort of biggest name that comes to mind. I guess, how would you distinguish the use cases and capabilities of automating the model generation process in an AutoML context from
[00:10:38] Unknown:
automating the machine learning life cycle process and what you're doing at Continual? I was originally always a great, you know, skeptic of sort of AutoML, you know, in my more earlier life. But, you know, I do think that in terms of the model training process, in terms of finding a highly, you know, performant predictive model, you know, automated approaches to machine learning, you know, using and and sort of next generation, you know, kind of similar next generation approaches, things like using pretrained capabilities and pretrained models and doing transfer learning. There's kind of more that you can do nowadays, you know, sort of in that idea of kind of a highly automated or, approach. That is a very exciting area. But if you look at, like, you know, what what the core of data robots product is, you know, it's basically upload a CSV file, train a bunch of models, and that's pretty exciting. I mean, it's amazing. You can actually do better than most data scientists can. Certainly, you can do you can get a very good performing model in a very short amount of time. Now I think what DataRobot has realized, but, you know, what we also you know, the credit, they realized maybe later in the game, what we realized is, hey. That's actually not solving really any critical problem that's gonna impact the business. What you really need to do is you know, every AI or ML use case that has a real business impact or not every, most, but the vast, vast majority have this continual life cycle to it. Right? Data is continually flowing in. Right? Even if your model is static, the data that's coming in, the signals, right, what do you know about your customers is continually being updated. Even if the model is static, like, you don't necessarily need to refresh it every, you know, night, you almost certainly need to refresh it every month or every 6 months. And you probably also need to refresh it because, you know, you're getting new data sources, new potential signals, so you need to, you know, incorporate those and improve the model in that way. In order to productionize or operationalize ML, which is required to have a business impact, you really need to think about the sort of the end to end life cycle and automating the end to end life cycle. And then, you know, it's not really automated ML. You know, that's where we think of it's really more of, like, declarative declarative AI or declarative operational AI. It's really, you know yes. There's maybe a potential for AutoML at that little core little bit where you train the model, and there's potential to to leverage AutoML and similar technologies there. But, really, it's about the workflow around the entire life cycle from training to inference to monitoring to iteration, right, development to production. So how do you, like, you know, experiment with a, you know, a new model or experiment with a new feature and see the impacts of it, then put it into production, then automate the life cycle of it. And if you build with that workflow and kinda end goal in mind, you just end up with a much different product.
You can simplify, you know, that's the whole experience versus I think if you just start with a kind of the core auto, Melanie. Like, well, let's put, you know, put these deployment. Let's go container deployment platform on there. Let's put a data processing platform. It doesn't allow you to kind of think from first principles and realize, wait. This whole thing can be made essentially a data management task, you know, where you're managing features and you're managing targets and your kinda your ground truth and your predictions. And then a little bit of declarative kind of policy around, you know, you know, how to ork how to orchestrate things, like how often things should be refreshed. And you end up with a sort of a very different product that results. In terms of the
[00:13:42] Unknown:
definition of operational AI, you've mentioned that a couple of times, and there are a couple of different ways that it seems like it could be taken where it's the operational aspects of putting machine learning into production, or it could be application of AI to operational concerns of the business. And I'm wondering if you can just discuss sort of how you think about operational AI in the context of the technology and the business and some of the ways that that might differ from some of the other ways that machine learning or AI might be used in a data science team or in more of sort of like a research or exploratory context?
[00:14:20] Unknown:
Yeah. So, I mean, the term definitely can be interpreted in those 2 different ways for better or worse. You know? When I think about operational AI, I am, in many ways, thinking about it in that holistic in both of those ways. Right? In my view, operational AI has to be in production. Right? So it it definitely is not, you know, research and development. It has to be driving some business impact. Right? So it has to actually be useful to some business process, whether you're automating a business process, building a better experience, streamlining some operational aspect of your business. And then, you know, finally, my view is that it's got a continual life cycle. I think that last bit, I would say, is the real overlooked differentiator. You know, there's a lot of for instance, you know, if you look at the machine learning operations platforms or, you know, solutions out there or sort of the the excitement around machine learning operations. And this is changing a little bit recently. But, traditionally, they were very focused on, for instance, building, like, model to you know, deployment tools. Right? Here's a way to host a model and serve a model as a rest end point or something like that. That was sort of gen 1. You know, gen 2 maybe is, you know, let's put some better monitoring on it. Right? Let's have better monitoring explainability, and that's, you know, machine learning operations. Okay. Machine learning operations that we deploy a model in a container, and we monitor it. Right? That's machine learning operations. And I think what more and more people are realizing now is that really pretty much every model that's kind of operational and in production has a continual life cycle. And you really got to think about that continual life cycle. If you don't, you're missing, in some ways, some of the hardest and most challenging and most impactful aspects of that. If you don't deal with it, you're gonna impose tremendous costs on your team, both from an infrastructure and from a just a time perspective.
That's how I think of operational ML. The aspect of, like, well, is it all, you know, targeting your product and engineering, you know, personalization within your product, or is it targeting, like, you know, your finance and marketing and sales? I think operational and ML can be either. We at Continual, you know, are focused on use cases that are best driven off of your data warehouse, which are very commonly are use cases that have this sort of business operations aspect to it. And so there's a nice little double meaning there for us in terms of what operational ML means. And then as far as the sort of challenges and barriers that organizations run into as they're trying to adopt predictive analytics and AI capabilities in their business?
[00:16:36] Unknown:
What are some of the issues that they typically run into and some of the ways that the sort of declarative approach that you're building with
[00:16:43] Unknown:
Continual is able to overcome those barriers and challenges for them. The first problem that, you know, we see when we talk to customers, I mean, is is data. So it's honestly just like, do you have the data to actually leverage ML? And for some, you know, customers, that is clearly the case that they have the data. Right? You know, if you're a large retailer, you have, you know, a tremendous amount of data on, for your sales, your inventory levels, you know, at a product level, at a store level, and you unquestionably need to leverage, you know, machine learning to do forecasting. Right? You know, you're probably on a low margin business. A lot of it's the the sort of the margin up from your business comes by from your operational efficiency then making sure you don't waste, you know, huge amounts of food or you don't run out of, you know, critical items where you lose revenue opportunities, you know, that's something that you obviously need to solve. You know, there are other, you know, types of companies. We deal with a lot of customers who are interested in, you know, understanding and and the applications of ML to understanding their customers.
And, you know, you need a certain amount of customer interactions to be able to do that. Right? So if you're a Swiss start up and you're in the b to b space, you're selling to other businesses, you know, often maybe you have 100 couple 100 leads or a couple 100 customers. Right? You might not be able to fit a, you know, a great churn model with a cut couple 100 customers. On the other hand, if you're, you know, doing a b to c case, you know, where you have 1, 000 tens of thousands, 100 of thousands, millions of customers. Right? So the first 1 really, I think in terms of challenges to leveraging it all, first 1 is you gotta have the data. You know, both the feature data that's the signals that are predictive of whatever you're trying to predict and then the ground truth. You need some sort of ground truth in terms of, like, what are you actually historical evidence, what are you trying to predict. You know, conditional on you having that, kinda that's the just what you absolutely need. You then run into, you know, I would say 3 main challenges. The first is infrastructure. Do you have the actual infrastructure to do ML? You know, well, traditionally, right, that's like, you know, do you have Kubernetes and containers and, like, notebook infrastructure and ways to run Python and maybe get access to GPUs if you need GPUs and get access to your data securely and, you know, how to automate that whole process and, you know, write it back, you know, and monitor it, all that sort of all that stuff, right, for the infrastructure behind. And, also, some of it which you may have from your data infrastructure, but some of it might be unique.
But, you know, the second thing you need is you need people. And, you know, certainly, if you're going down the traditional path to ML, that's a big challenge because, you know, you often need, you know, kinda cloud engineers, infrastructure engineers to kinda get you going. You need data engineers to really get your data in the right place and kinda set up the right data pipelines. And then you need, you know, data scientists or machine learning engineers who people who actually understand the science, you know, understand how do you do you do forecasting, right, when you have, you know, covariates and when you have your regular time intervals and you have missing data and you have all of these real world challenges. Those, you know, people, you know, have that sort of scientific understanding, and you need to form a whole team of kinda that works well together. So there's this kind of people challenge. And then the the final 1, which I actually hear, you know, a tremendous amount is just time and ROI.
Right? It's just like, if you can pull this off, you know, you might be able to, you know, get the infrastructure in place, recruit the team with all of these characteristics. But if you've done that, you know, you better hope the ROI, you know, to your use cases and the, you know, the cost of maintaining it is high enough. And, you know, typically, with the business stakeholders or within a company, when you talk to them, they actually have, like, dozens of use cases. A lot of times the ROI of any particular use case is a little bit uncertain. You kinda wanna take this portfolio approach where you try a whole bunch of things. You know, some pay off and some don't work because you maybe can't predict as well as you want. But the problem is if it takes you a ton of time, right, to implement every use case, let alone maintain it, you know, and so you can see the kind of the impact over a year or something. You know, you end you end up just not having the time to do all the use cases and and see the impact that will then justify the people and justify the infrastructure. I kinda like those are, like, 4 problems. Right? Your data infrastructure, people, and time. The last 3 are, like, really the ones that Continual is focused on. We wanna sort of eliminate your need for infrastructure, Do that by being kind of a fully managed kinda cloud prop product. Kinda connects directly to your existing data infrastructure, namely your your data warehouse, which is where, respectively, everybody is currently or will be scoring the vast majority of their data, bringing together their data.
Then we try to, you know, radically broaden the number of people that can do machine learning to sort of solve the people problem by enabling any, you know, data or analytics professional to be able to, you know, do predictive analytics as easily as they can do, kind of, historical analytics. Kinda finally, just, you know, radically accelerate or reduce the you know, the time to initially building a use case. Right? So dramatically drive that down. You know, dramatically drive down the time to iterate on a use case. So if you have any use case and you want to improve it because there is always an aspect of, hey. There's more signals that we can have here. I'm slightly changing my definition of what I'm trying to predict. So the time to iterate. And then, you know, I think what people often forget is the the time to maintain or the cost of maintenance. And so really just allow it to be essentially free once you've kind of implemented it, which hopefully doesn't take too long. Essentially free to maintain. Whereas I think traditional solutions, you know, sometimes it's it's okay to maintain, you know, like, you'll do an okay job maintaining it when you have your whole team there who, like, set the whole thing up. But, you know, typically, with these highly bespoke approaches, what at least what I've seen is that it's a very exciting kind of work for a data scientist to do a machine learning engineer engineer to do to set that up set these use cases up. But then they leave the company, you know, in 2, 3 years. You know, somebody else comes in and, you know, almost always, almost always, you know, it ends up being perceived at least as sort of like this incredible tech debt. Right? We have, kinda like these bespoke pipeline jungles, like technology that was, you know, a particular era or that this particular engineer was familiar with and this other engineer is not. And so you end up with this, like, very difficult, you know, kind of to maintain, know, systems which becomes very costly. Like, I even talked to customers who are, like, you know, decommissioning use cases because they're just, you know, too costly to maintain it if you do it in that way. And as far as the
[00:22:31] Unknown:
target users for the continual product, who are the sort of members of the business that are going to be interacting with the platform and defining the model definitions and writing the sort of declarations of what they're trying to get out of it and sort of who are the various stakeholders that you anticipate interacting with the platform as a whole and some of the ways that that has driven the sort of product focus and feature priorities?
[00:22:58] Unknown:
Yeah. So we think of our target user as really any data professional. So you could think of that as a data engineer or an analytics engineer. You can also think about it as a data scientist or kind of a pragmatic data scientist. You know, we're not as great a fit for, like, the machine learning engineer who's, like, really pushing the absolute frontier of, you know, the architecture of some TensorFlow model. Like, we're not trying to build, like, an r and d platform where they're, like, managing, like, some of these experiment platforms, like, weights and biases. Like, you know, we think those tools are great. So we're really trying to say, you know, you're a mainstream business. Like, think you know, I I like just walking around the world and, like, looking at companies and, like, being, okay. Here's Gap. Right? Like, how should Gap do their machine learning? Or, like, you know, there's the FedEx truck. Like, how should FedEx do it? Or, you know, a plane is in the sky. Like, that's JetBlue. Like, how should JetBlue do all their, you know, their arrival time forecast, their, like, fuel forecast, like, their customer support, you know, call center, like, how many people they should staff their call center. Right? You know, if I think about all those use cases and then I hang out with my sort of ML engineering friends, I'm like, often often the answer is like, well, they should set up, you know, Kubeflow and MLflow and, like, containers. Like, I don't know. That doesn't seem like the right solution. So for these, like, mainstream, it doesn't seem to be the right solution if you want to make AI and ML pervasive. And so I think if you wanna make AI and ML pervasive, you've got to democratize AI at least to any, you know, data professional. Right? Anybody who's know you know, is working with Snowflake or working with, you know, Redshift or BigQuery or Databricks Delta or something like that, they should be empowered to very, very quickly kinda build professional model to build these predictive models.
I think the end user persona that we're going after is really just like anybody who, like, knows data, understands data, understands their company's data, understands how data might or predictions might be able to impact the business. That's kind of like these data and analytics teams or or kind of pragmatic data scientists, the science teams. Of course, there's a business stakeholder who's often driving the use case. There there there's a demand from the business. You know, they're more kind of saying, hey. Here's what I want. Right? There's still an end user needs to do it. You know, I guess to go a little bit further on this question, maybe further than you asked, is, you know, another way, you know, I sort of think about this is, like, the different kind of tools that are out there. I think there's, you know, 1 set of tools which is quite mature and, you know, this was my previous life where it and it is targeting the sort of machine learning engineer. Right? These are, like, these Kubernetes based machine learning platforms. They're great. Right? Like, that's like AWS SageMaker giving you all the building blocks. If you're a machine learning engineer, you know, you can put them all together and you get a lot a lot of control, you know, and that's kinda like these gen 1 sort of ML platforms. They're increasingly adding, like, operational capabilities, which is good, but they're still, like, very low level, kind of very code centric.
Don't have a high level of abstraction, but a lot of control. I think there's another class of tools out there, which is that people recognize, hey. We need to obviously democratize AI. And they kinda go, oh, let's build no code AI. Right? And, you know, no code AI where, like, this business user is gonna come in and click around in something and somehow they're gonna have like, some tremendous business impact as a result. I think there are some opportunities for that, but, you know, what I've seen within the data sphere is you do wanna democratize things, but, you know, if you're dealing with operational AI or data or analytics, right, somebody needs to be on the hook for that. Right? And that takes it really seriously. Right? So if you wanna do, I think, even business intelligence seriously, like like some of the tremendous success of a tool like DBT, data build tool.
You know, on 1 hand, it's democratized. It puts SQL at the center. But on the other hand, it actually embraces some of the best practices and brings some of the best practices of the engineering culture to the SQL persona or the analytics. You know, they tried they're introducing this term analytics engineer. What I mean by that is, like, you know, the ability to, like, do version control of your models, the ability to do, you know, development, you know, sort of like experimentation and then, you know, open a poll request and actually merge it in, right, to have governance over the way your models change. And so I think declarative AI actually has the power, you know, not just continual, but, like, if this becomes a kind of a trend and a category of tools and a workflow and experiences, is gonna kinda be the similar middle ground where it says, look. We do believe we can dramatically democratize and simplify ML, including operational aspects of it, but we are going to embrace still maintain some engineering best practices. Things like the ability to, if you want, version control things. That means, like, you know, there should be a configuration driven approach, whether it's, you know, you have a UI, you can do that in the UI, but you should also have some way to follow more engineering best practices. Because if you're doing a really, really mission critical use case, right, if you're JetBlue and you're forecasting your arrival times and your fuel forecast, you better have governance over that whole process. Right? It still can hopefully be easy. Right? It can be like managing your cloud infrastructure with Terraform or something like that where you don't have to have all these scripts, you know, bringing up and maintaining and monitoring, oh, you know, all of your cloud instances. You can have a high level abstraction, but you still should have some sort of middle ground.
That is definitely, I think, 1 of the unique aspects of what we're doing at Continual sort of in the AI sphere, but I really think that that's gonna spread. I mean, I think that's inevitable that people are gonna see the value of that and that kind of idea of, you know, simplicity, but, you know, still
[00:27:57] Unknown:
robustness. You know, I hope it spreads. I hope it spreads. I definitely agree that sort of the declarative aspect is a necessary next step and that we're reaching that level of maturity in the industry where more people are recognizing the capabilities of machine learning and the potential for it to be transformative for their business, but it has been a barrier to actually put in all of the engineering groundwork to get to the point where you get it into production and can maintain it. But, you know, being able to just say, here's what I want. Here's the data. You know? Now I don't have to worry about what's actually happening under the covers. It's just going to do what I want it to do is going to be a sort of massive step forward for businesses in general. And so in terms of how that's actually sort of manifested, I'm wondering if you can talk through some of the ways that the continual platform is architected and implemented and some of the sort of interesting engineering challenges that have come about as you've gone through that process? So, I mean, the first part is really
[00:28:56] Unknown:
what is the declarative abstraction. Right? And I think, you know, that is a challenge to figuring that out. So, right, SQL did that with analytics, but there was competitors to SQL, you know, back in the day. Now SQL has become the sort of the lingua franca of analytics. But there's kind of a question, okay. Well, you know, I like the idea. You know, I understand that ML and prediction is kind of like inputs and outputs, and, you know, it can be data first, but, like, okay. You got some policies around retraining, but, like, what is that abstraction layer look like? Right? Does it look like SQL? Does it look like Terraform? Does it look like something else? So I think the first part of continual is really trying to define that, you know, that declarative abstraction.
The way we've done that, you know, we bet on the fact that data within most enterprises is relational and temporal. So, you know, you have business entities like your customers, your leads, your products, you know, your sales at a particular store. You know, that's typically kind of has this tabular kind of structure to it. It's almost always has a temporal nature to it as well or many use cases do. And a temporality of data is critical for ML use cases. You have to take that and make that a first class concept if you don't wanna have data leakage problems. Right? If you're trying to predict churn, you know, and you're saying, okay. Well, is this person gonna churn at the end of their contract? You know, you need to be able to go for, you know, in the next 30 days. You know, you need to have historical data where you can kinda go back in time, have this sort of time machine characteristic where you have a feature, like, how active they were on the platform, on your tool, your website, or how much did they buy, you know, which would be a kind of predictive signal of churning. You need to manage that in a way that, you know, you respect the time dimension to this. So you use that to then predict into the future. So the first is kind of like doing the getting that abstraction layer right. You know, we have this sort of, like, you know, sort of temporal relational model that is very well suited to layer on top of a data warehouse. You know? So the second thing is what do you connect to?
And, you know, I think, you know, this is actually a lesson that we had a continual was if if you try to do, you know, too broad. Right? If you say, well, you should be able to connect to Kafka topic, then do streaming processes on Kafka. Maybe you should be able to have APIs that you can hit and get real time features from APIs. Right? These are like if you look at the feature store, you know, literature or ML platforms, you know, there's a lot of connectors. Right? You should be able to pull data out of Salesforce yourself, you know, or whatever it is. You would kinda end up in this, like, mess. Actually, just things that we kinda lucked out on, I think, you know, that kind of coincided with the creation of Continuum or, like, wanting to build a company like Continuum was the rise of the modern data stack. And by that, I mean, the rise of, like, the data warehouse, the cloud data warehouse, Snowflake, Redshift, BigQuery, you know, Azure Synapse, Databricks, Delta Lake, and many more, but those are the, you know, leading ones. The rise of that architecture that puts sort of, like, the data warehouse at the center, you bring all your data in there from all your different touch points, That's where you can process it, merge it, clean it, etcetera.
That actually provides us, like, really nice kind of connected place that that Continual can connect to. And it can help inform the abstraction that we build on top of it. Right? So it can say, when we talk about features, we can talk about features with respect to well, in registering features, those features can be defined, you know, using SQL. Let me define your targets. Like, what is churn? We can define that, you know, using a language of SQL. And then it becomes a data management layer that sits on top of service semantic layer for ML that sits on top of your data warehouse. And then from that, that then drives the continual, you know, model maintenance, prediction maintenance, issue. How does what does that look like? You know, like, how does that implemented? You know, continue really has these 3 layers. We have the data plane, which is the data plane is really responsible for translating our declarative configuration, you know, into the actions that need to be taken to refresh models, to refresh predictions, to profile data, monitor data, you know, all the things that you would kinda typically do if you're, like, really serious about operational ML. We have a control plane, which is, you know, you can think about it as the metadata layer, the collaboration metadata layer. So this is where that all that kind of metadata is being stored around performance of models, the, you know, profiles of your data so you can do data drift and and things like that. The collaborative access, like, you know, collaborating in isolation around your projects and bringing people on and visibility, all that sort of stuff. And then, of course, you know, we're a cloud application, so we have, you know, the front end. And we built the front end, which is 1, it's a, you know, kind of a standard SaaS application, but you also have this sort of kinda total parity with your kind of a CLI. So if you're more of an advanced user, if you wanna drive things off of a CLI kind of version control, you know, even pull requests, you can do that. If you wanna be in the UI either to monitor from a read only perspective or actually to do some authoring, you can do that as well. So there's kind of these 3 layers to actually, like, how we translate that declarative vision into reality.
[00:33:28] Unknown:
And you mentioned that when you were first iterating on the problem, you were sort of considering being able to hook into things like Kafka for being able to feed off of streaming data as it's going through those systems and then deciding that the data warehouse was the sort of natural fit for the product that you were building. I'm wondering what were some of the other, sort of lessons that you learned in the process or assumptions that you had at the outset that were challenged or updated as you dug deeper into the problem and started working with customers about sort of the ways that they were thinking about how to interact with the platform that you were building?
[00:34:02] Unknown:
Unquestionably, putting the data warehouse at the center is probably the biggest and most important strategic thing that we've done and has been a little bit of an evolution, you know, from the first, you know, 6 to 12 months of continual. You know, we always knew we wanna sort of combine the ideas of, like, a, you know, data centric IML, which was, you know, with the idea of a feature store with the idea of automation, sort of continual automation of predictive models, declarative AI. We're very inspired by some of the research around declarative approaches to AI. Things like there's a, you know, open source projects like Ludwig, open source, like papers that have come out of the research world like Overton, which is declarative AI platform that Apple has talked about publicly. So we thought, well, there's tremendous power there. But, you know, it's actually amazing how much you can do with this declarative ad. Okay. Well, let's couple it with the idea of data being central for really operationally how we can build something unique there, A sort of unique take on the ML infrastructure.
But, you know, I think the big question when we talk to customers when we would kind of demo even the early versions of the platform, which was how do I exactly fit in? We realized our whole goal is simplicity here and the data warehouse by putting the data warehouse at the core, all these simplifying assumptions just start to unlock for us and for the customer because all of a sudden it's like, hey. I know how to get data into the platform. I already have that. I have Fivetran that's going and pulling data into my data warehouse. Like, I know how to get my data out of the platform. You know, I have my BI tool, you know, that's already hooked up, my Mode or my MicroStrategy or Tableau or what Looker or whatever it is. And now increasingly, like, hey. I mean, a recent trend that we're seeing, which is incredibly exciting, is, like, the reverse ETL movement, which is, you know, like the high touches and the censuses of the world, which are allowing you to pull data out of your data warehouse and put it back into your applications where where it can be used, right, and impact the business. And the beautiful thing for us is, hey. We write predictions back into the data warehouse. So if you wanna have a churn forecast, yes, it should definitely show up in Salesforce. It should show up in HubSpot. It should show up in Braze, you know, which is a marketing kind of automation tool if you wanna send, like, custom personalized emails.
So all of a sudden, the whole stack, you know, starts to play together. Like, the whole ecosystem plays together. You know, really, I would say, like, the biggest thing that we've done the lesson that we've learned is, like, you know, there's tremendous benefits to putting the data warehouse in the center in terms of simplification, and that's the probably, that's the biggest 1. Slightly smaller ones are, what is the right data model for feature management, data management? What is the right sort of, like, semantic? This is kinda like what is this declarative, you know, interface? And, you know, it's not I wouldn't say it's 1 change we've made, but, you know, we've continued to try to balance, like, you know, simplicity versus flexibility. You know, we have big debates on, like, snowflake schemas versus star schemas versus galaxy schemas and, you know, like, what's the right mental model for things like that. And we've continued to refine some of those the sort of the exact way we express declarative sort of setup for machine learning.
[00:36:48] Unknown:
Schema changes, missing data and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data in motion. That leaves data ops reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end to end metadata, DataBAND lets you identify data quality issues and their root causes from a single dashboard. With Data Band, you'll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to data engineering podcast.com/databand today to sign up for a free 30 day trial and to take control of your data quality.
For somebody who is using Continual, they've got their data in a data warehouse. They want to onboard on to your platform and start being able to take advantage of the predictive analytics and the predictive capabilities of understanding, you know, what are some of the forecasts based on the data that we have. What is the actual workflow for predictions and outcomes and put it into production and see the predictions and outcomes of that and any sort of data modeling considerations that go into the onboarding process?
[00:38:05] Unknown:
So there's really, you know, 3 steps. So, you know, you know, step 1 is you connect to your data warehouse. Right? So that is like any other like, how you would connect your BI tool or how you would connect, you know, Fivetran or a CTL tool. So, you know, give us some scope down credentials that allow us to get access to, you know, data and basically, you know, rights to a schema inside your data warehouse where we can then manage the features that your feature definitions, which we call feature sets. These are like the features on your customers, for instance, and manage the predictions. So both your historical predictions, kind of an audit log of predictions, and also your kind of the latest predictions. And so we kind of take over a schema inside your data warehouse and continually maintain that. So 1 is just connect to them. That's just a connection. That takes, you know, 3 minutes. Right?
The next step is, you know, setting up these problems. And in our system, there's really 2 tasks that you can do. 1, you can focus on what are the features or the inputs that are gonna go into your models. Right? We call those features or feature sets. A feature set in our world is like a collection of features on an entity like a customer or a product. Right? That feature, you know, it has a primary key. Right? Like an ID, like customer ID, and it also has a timestamp. Like, so what time did that feature exist at? So that's, like, kinda gets to the what we need to do, like, this time machine capability with that feature stores have. Virtual feature store, right, where so we're just a semantic layer on top of your data warehouse. So 1 thing is you can do is you can organize your feature data. You can then use all of those across all of your different models, like, as you build your models. You don't really need to think about features. You can just, you know, pull in all the features from your feature store automatically. The second thing you can do is you can build a predictive model. But that predictive model process starts with, okay, what is the model dataset? Namely, what is that spine of the model, which is, you know, essentially, again, like, what is the entity that it's on, like your customer? What is the time stamp? Like, when are you trying to make a prediction as of? And what is the ground truth label? Like, what is your definition of churn? You might have, you know, some historical data on churn for your customers, and you probably have a whole bunch of data that you don't have so you wanna predict predict it on. You know, you just need to set that up. You know? If you're if you're using dbt, it might just honestly be a table that you just register with us and you just say, hey. This is my feature table for, like, you know, customer features. This is my model for this might churn a table or I've churned over time for my customers, and we then can join it together. We join it together in a way that's, like, some point in time correct. That's kinda like the term of art. Just mean, like, you get the features for every point in time as necessary.
And so we generate, you know, all those queries kind of for you automatically. But you set that process up, you know, that's again, like, you know, if you understand the system, like, once you've done it, it's, like, you know, 3, 4, 5 to hit process to, like, just say, hey. Like, here's a new target. Like, not churn in 30 days, but churn in a quarter 90 days or churn after the contract expires. Right? Multiple definitions that you might wanna use. You just need to define that, register that with our system, submit it in you know, then last step is a little bit of policy. By policy, I mean, how often do you want the model refreshed? What exactly are you optimizing for if you're a more advanced user? Like, do you care about accuracy or precision or recall or AUC or something like that? Now that's where the data scientist, you know, says, oh, I wanna do that, but the other person says, just tell me something to do that's reasonable.
You know, then how often should you refresh predictions and then you submit that into into the system or you kinda push it into our system. Now at that point, you know, the system is on autopilot. We first we just, you know, do a first run where we train the model. We profile the data. We run the batch prediction where we want a prediction job that puts that data back those predictions back in your data warehouse. And then we maintain that, maintaining the model, refreshing the model, maintaining the predictions. So the whole, you know, life cycle, that whole thing becomes kind of this living process, and then maintain a lot of visibility into all of that. So, you know, the profiles of during training, what would the the distribution of your predictions? During inference, what are the predictions of your distributions? Are they drifting? You know, what are features are important? You know, what are the marginal dependence plots? I mean, you can get all this. I mean, the beautiful thing about surf like, having abstraction is you can build all of this stuff for free on top. So, you know, it would typically take a data scientist, you know, weeks. You kinda just get it for free. And so the, you know, the end result really is, you know, if your data is in reasonable shape, you know, you can, you know, put a model into production, I mean, really in minutes. And what's amazing is inside like, literally, it's actually in production. I mean, I think even internally, sometimes we get shocked, you know, where and that's actually in production for this customer. We just did a proof of concept with somebody and, you know, they gave us access to data and we were like, hey. Let's do a 1 use case and, you know, Jordan, who's on our team, who leads our field engineering team, you I mean, he literally implemented, you know, a production model for them in, like, you know, you know, an hour and a half or something. It was like, hey. Check this out. Let's jump on a call and let me show you let me onboard you and so you can do this yourself.
You know, coming from my background where, you know, I was writing Python scripts and Python pipelines and, you know, you know, even to just POC it, you know, like, in a notebook took me a week, then let alone figuring out how to deploy it. You know, that experience of where you can put a model into production, you know, in minutes is, like, pretty cool. Like, it's a a nice feeling. In terms of the sort of model life cycle aspects, you know, obviously, you're managing the retraining and understanding when there's model drift happening. But for teams that want to
[00:43:06] Unknown:
have more visibility into what the model is doing or wants to be able to sort of dig into some of the explainability aspects of the model, what are some of the options that they have for being able to interact with the model as it's running to get useful metrics out of it or be able to poke at it to understand sort of the reasoning that it's using for generating a given prediction for cases where they need to be able to have the auditability for GDPR or CCPA or other sort of regulatory regimes?
[00:43:36] Unknown:
Yeah. I mean, that's a huge topic because there's so many different aspects to this. Right? There's everything from oh, you know, of course, you wanna know the performance of the model over time, you know, at the training time performance, you know, with your test set, then you might wanna know your performance with the true out of sample test set that maybe comes in later because you get some ground truth for many of these applications, you know, your churn, right, that comes in actually act way after, which is, in some ways, often the truest signal of your performance. You wanna know things directly with respect to the model, what features are important, what features had the biggest impact, which direction was the impact, is it nonlinear, is it linear impact, you know, are there influential or outlier data points that are driving your results. Right? For an individual prediction, what are the individual features that drove those predictions, you know, etcetera. You know, we do a lot today. We can do a lot more. You know, we were always, you know, adding things and trying to get customer feedback on this. So there's a lot to do here. You know, I think what's amazing is that, like, these are really stand I mean, I think people often think, wow, it's really bespoke to my model.
But, you know, what we found is that's really not true. I mean, the vast majority of these things, if you have regression or classification, you know, the vast majority of things that you want to do are things that you can sort of automate away. Right? And, like, you might need to ask a little bit of information. Right? So, like, local feature importance, kind of per prediction feature importance is a time consuming task. So it's like, do care about that because you might not wanna do it if you don't care about it. Right? So you might say, okay. Well, that has to be a configuration thing that somebody could set. Or, you know, another 1 that has come up is, you know, sometimes you want performance by a particular subsample of your population, so you wanna do slicing. So you don't just want performance of your model, but you want performance within, you know, by state or by, you know, gender or something that you might care about or something like that. But, you know, I think the nice thing about sort of a declarative approach is it actually gives you a structured way to think about those things and say, hey. What are the things we wanna do? Okay. We want slice based performance. Okay. That's what we want. We want feature importance, okay, and feature understanding. What is driving the prediction? Okay. What are the set of diagnostics that we can do for that? So yes. I mean, I think that's 1 of the reasons why if you're just thinking, like, you know, people put ML in in SQL, that's not really enough. Right? You really do need to think about all of those other things. So we're doing a tremendous amount of that stuff. You know, we're always adding more. Right? We're still an early relatively early stage startup. We're certainly adding more. I think the second part of your question, which is also something that we're working on is, you know, how do you do you know, higher level of the abstractions are fantastic, but it's often a hierarchy of abstractions.
So how do you have revealed complexity where you keep the top very, very simple. Right? The data scientist who comes in, you know, has the control or has the ability to get deeper in. And so, you know, we're also, you know, trying to figure out what is the right extensibility, the hooks where somebody does want to be able to get in a little further, either to understand things or to kinda tweak things. And that's, I think, another, you know, challenge. That's with any kind of high level abstraction. Right? You know, SQL, should you have UDFs or, you know, that's kind of, like, a similar thing that we might experience. Like, should we have custom loss functions? For instance, I just came up in a conversation yesterday with a customer. So how how should we best expose that to them? You know, that's gonna be a little bit of an evolution. You know, I think once you kind of implement something and you see, wow, how fast it is and then how much visibility you do get, you know, for a lot of use cases, you're like, this is good enough. Right? This is good enough. And then I think for those 5% of use cases, hopefully, we have the hooks to do you know, and platforms like continual that are have this declarative nature have the right hooks that you can get in deeper. You know? Or there's no shame in, you know, pulling out 1 model and putting it into your airflow DAG and, you know, doing it over there. You know, you can pull out your features. Our all of our feature data, all of our tables are sitting there in your data warehouse. You can access it. You know, it's not locked in. There's no lock in there. You know, it's gonna take you 20 times as long unquestionably, but it will give you every bit of control you need. And then in terms of the sort of declarative model that you're building for people to be able to interact with continual, I'm wondering if you can talk through some of the sort of design and development process of what you went through to figure out what are the useful abstractions, what are the knobs that people need to be able to tune,
[00:47:39] Unknown:
and sort of the the syntax for being able to write this declaratively in a way that is accessible for somebody who isn't a, you know, expert research data scientist, but has the sort of general domain and data knowledge required without necessarily having to be a full programming language into some of the sort of balance that you decided to strike there and some of the considerations that you built into the way that you're presenting these abstractions?
[00:48:07] Unknown:
Yes. So number 1, unquestionably, is the fact that we embrace SQL and use SQL as the data warehouse as the way you manipulate data. So if you wanted to define a feature, you know, as you organize your data the way you do SQL, that is the primary escape hatch. Right? It's a tool that's familiar to essentially every data professional. You know, it still is a challenge. Right? You as you do these aggregations, right, to create features, you know, there's still work to be done there. And we're doing some things around, you know, automating even some of that. Is there potentially an abstraction layer that is 1 layer above SQL for some very, very common things? For instance, like roll up queries, which are very common for in feature engineering. So there's some things we are planning to do around that. But, basically, the fact that the choice that we do not have to invent another language. Right? We are saying you're just writing SQL. It's not RSQL, it's Snowflake SQL or whatever SQL you're underlying, or you're using DBT, right, Databill tool already. And so it's just extending that and kind of providing a little bit more structure on your DBT models and then registering them with our system. That's the hardest bit in terms of, like, that's the stuff that you could, I think, go down a crazy wrong rabbit hole if you try to do your own, you know, YAML based, you know, feature engineering something or other. Right? So that's 1 big design choice.
The other ones, as I mentioned earlier, we have these kind of core abstractions around, like, you have entities and, you know, which is your customers and you have feature sets and which are a collection of features, you know, feature tables. You could think of it for future views. These things have, you know, primary keys and they have time stamps. You know, honestly, there's a lot of convergence happening within the feature store space, not total convergence, but there is sort of a lot of convergence around this kind of mental model for, you know, organizing features. The remaining challenge, I think, you know, and that I think it's all pretty straightforward once you interact with 1 of these systems and kinda read the docs once, you get the picture. We're trying to stay very close to the way, you know, people understand data warehouses and data in general. So we're not trying to invent too many new terms.
The remaining challenge, I think, with respect to ML is really around framing the problem. But when you're talking to analytics and data professionals who maybe aren't aren't living and breathing ML, there is still the need for somebody to just sort of translate the task, the business task that they're trying to do like churn into how do you how should you think about that. Right? So, you know, for instance, with churn, which is, you know, literally, like, the canonical example that people talk about when they think about, like, you know, pretty basic ML. You know, know, a lot of times people have, like, you know, a flat dataset. You have a bunch of customers and you have churn as a column. Right? That's like the Kaggle version of churn. But, you know, that's really not how, like, the real world operates. You know, where's time in that? Right? Everybody's going to churn at some point. You know, they're gonna die, and they're gonna churn. Right? So you really need to frame it as when are you gonna churn. And then you have to make some decisions. Right? Are you saying, like, am I gonna churn within the next 30 days? That would make sense. That's a good idea. You should take your features, your signals, what you know about the customer at the beginning of 30 days. Look at over the next 30 days, did they churn? That's kinda your historical data, then train a model on that, then do a forecast going forward based on that. But the data scientists will rightfully bring up the point that you have customers with different, you know, duration contracts. You have, you know, 1 year contracts and 3 year contracts, and you have month to month contracts.
And, you know, then you have to make a decision, which is a little bit of a business decision. Do you wanna model that as you will still do monthly churn and, of course, you know, a feature that says you have a yearly contract or your contract is expiring in, you know, x months. Well, it's gonna become or it's gonna expire next month. Well, that's obviously gonna be highly, highly, highly predictive of churn because you're basically, it's gonna guarantee you don't churn if, you know, you're not in your last month. But you need to set that up or do you wanna, you know, predict, like, you know, what is the probability of you churning at the end of your contract duration? And so that's the prediction problem that you wanna set up. Our view is that we believe that data scientists and, you know, people that understand ML are critically important and, like, absolutely are you know, we're not trying to replace them. But we just think that they should be focused on kind of 2 things. They should be focused on the business problem. Like, what are you trying to do? How are you trying to have an impact? What does the business need? Answering questions like this. Right? Like, what does the business need in order to be effective and to have the biggest impact? And then secondarily, okay, what is the data? How do I define that, you know, churn outcome? And then also and this is really what drives honestly prediction predictive performance. What are the signals? Right? You know, what are the things? I mean and they're very bespoke. Like, we were just talking to somebody who I won't say exactly the use case, but, I mean, you know, these predictions of churn, for instance. You know, you'd need to understand your business and say, hey. You know, if they're this active on the platform or if they take these specific actions, that's a very highly predictive. If they, you know, invite a peer onto the platform, you know, and send an invite, maybe that's highly predictive of churn if you're kind of a more of a product led growth company. And, you know, data scientists in our view really should be focused on those tasks, kind of those 2 tasks, defining the problem, setting up the problem, conceptualizing the problem, and then kind of the data. The rest, you know, not so important.
Like, even loss functions. Right? Typically, not so important. And then as far as the
[00:53:03] Unknown:
sort of usage of the continual platform and this overall declarative aspect of AI, You're mentioning that you're sticking close to SQL as the interface for being able to interact with the platform and be able to describe the models. And I'm wondering if you can just take a brief minute to do some compare and contrast between what you're building at continual and some of the work that's happening in database engines to try and push machine learning into the engine itself, thinking in terms of things like what's being offered by Vertica or BigQuery
[00:53:37] Unknown:
where you have a, you know, function defined in the database engine that you can call to create some prediction based on some set of columns. We love SQL at 10th Annual. I actually am not really a big fan of putting, like, the train and predict function inside of SQL. I'm not a not a fan or am a fan. I just don't think it's really solving a problem that's very important. And so it's kind of in my view, like, that, you know, train and predict inside of SQL, like, as a UDF or whatever it is. You know, it's like an alternative interface to scikit learn fit. You know, scikit learn is like what you would do in Python, the, like, the vanilla, like, you know, regression or classification, you know, interface. You know, it's a 1 liner. Right? It's fit and, you know, to fit a model and to train, you know, predict to to predict your predictions. And and the sort of BigQuery ML and the equivalents are essentially different syntaxes on that. And it's okay. You don't have to move the data quite so you don't have to read the data in. You don't have to write write the data back. So there's a little bit of, you know, nicely that you get there. What we think at Continual is that, like, every real world use case is much, much more involved in that, mainly because of the sort of continual operational life cycle that you need to manage. Right? You need to think about all your features. You need to manage your features, collaborate around your features.
You need to, you know, manage your models and have the features feed into those models. You need to retrain your models. You need to look at performance of those models over time. You need to deeply understand those models. You'd have data drift and prediction drift, and you need to have performance, you know, after fact performance and, you know, test validation, you know, performance and feature importance. And, you know, it really becomes this much large like, the the the challenge, the problem that you're trying to solve is not like, hey. Here's like a different, you know, calling syntax for train and predict. It's really like, what is the workflow to operationalize ML?
And I think the kind of the in database ML people, you know, aren't really taking that. They're like, oh, it's really cool that we we can actually, you know, put a train and predict function inside of the data warehouse, which is cool. But it doesn't really solve the problem. Right? Like, if you think about your major retailer, right, you know, you're looking at you're doing forecasting use cases. And I think, to me, I would really think of it like, hey. I really want a system. I have my data layer. I have my BI layer. I, like, really want my prediction layer, which really takes that problem seriously. You know, gives me an environment where I can run some experiments, you know, and add features, gives me an environment where I can monitor performance over time, gives me an environment where I can, you know, share features about my stores, my products, you know, across different use cases, gives me a system where I can monitor, you know, the drift between training and inference so I can get alerted on things. Alerts me when things fail, you know, or things aren't updated.
And And that's kind of a much bigger task than just, you know, what is, you know, sort of the BigQuery ML. So we are sort of a fan of SQL as the language to manipulate data. We're less of a fan of SQL. Like, we don't think that's the end. Right? We think that that's just the beginning. We're using SQL as the data language. Right? How do you manipulate and organize data? We're not using SQL as the ML language or particularly like the workflow ML and experiential experience that we think you need for for operational AI. I hope that kinda makes that clear. Yeah. Definitely.
[00:56:34] Unknown:
And in terms of the sort of ways that you've seen the continual product used and the just overall concept of declarative machine learning and being able to automate a lot of the operational aspects? What are some of the most interesting or innovative or unexpected ways that you've seen it used? We were just chatting about this in our team meeting yesterday, actually.
[00:56:55] Unknown:
The most surprising I mean, in some ways also encouraging aspect is, honestly, it's not really 1 use case. It's really the breadth of the use cases and the breadth of the types of companies and, honestly, the geographic reach of the type of companies that we interact with. I mean so just to give you, like, real examples. I mean, you know, so it's like a makeup comp like, it's massive, like, the L'Oreal of, you know, of South America. That's the way you think of ML. Right? They actually are a huge business. Right? And they've got all sorts of actually ML use cases around their customers, their sales, their their stores, their sellers, and all of this sort of stuff. In the morning, we'll talk to a company like that. Then, you know, a couple hours later, we'll talk to, like, a genomics company that's doing, like, you know, genomic transcription, and they have, you know, data quality for predictions that they need to, you know, flag certain tests to be run again because they don't have 100% reliability and they need to have some intelligence around that, you know, then, you know, the next day we talk to, freight forwarder or, like, a logistics, you know, major, major, major, major, logistics, you know, company that's you've got trucks running everywhere and what are the arrival times and how much demand is there gonna be and how many drivers do they need, you know, during different peak seasons and peak periods and in what geographies and all of those things. You know, then we talk to, like, a, you know, a major, you know, car company, you know, like, you're 1 of your top 10 car companies, and they're looking at, you know, all the things related to, you know, car inventory and sales, service revenue, and all these kind of use cases around selling and maintaining and servicing cars.
In the financial space, you'll talk to, like, a pay now, pay later platform who's, like, you know, trying to do, you know, scorecard you know, kind of more traditional financial scorecard stuff that's going on. You know, you'll talk to a, you know, a SaaS product company that's, like, building kind of, like, products for educational purposes that's trying to do personalization and so that people can have a more kind of personalized experience both within the product themselves and, you know, mails that they get, you know, in their email box saying, hey. You might be interested in these courses and things like that. Of course, there's, like, abuse and fraud prevention. So across a lot of different industries, you know, there's there can be financial transaction fraud or, you know, in application or platform, you know, bad bad actors, you know, fraudulent accounts or, you know, fraudulent reviews or, you know, just kinda, like, abusive language, knowing the kind of the gaming sector. We're a small start up, but, actually, those are all real, you know, real conversations that I've had over, like, the last month. And that was just, like, the industries. The other aspect of that is, like, it's in Europe. It's in South America. It's in, you know, Australia. It's in the US. You know, it's from startups to mega corps. And we're barely we launched 2 months ago and not even with a press announcement. You know, we're we're we're gearing up for some things additional things. You know, that's kind of the most surprising is just, like, sheer breadth. A lot of times investors there's people would say, hey. You know, focus on 1 industry. Focus on, you know, you know, 1 use case. And, you know, it's very tempting because, I mean, I do think it makes it much clearer to kind of describe your value proposition when you have these vertical focuses. But then I think, well, okay. Well, who am I talking about? I'm talking to all these different people, all of these different use cases, and it's just remarkable how broad broad data is impacting the world. You know, ML
[00:59:52] Unknown:
is part of that or these companies any company with data now is starting to think about these predictive use cases. In your own experience of building up the continual product and creating the business around it and talking to customers, what are some of the most interesting or challenging lessons that you've learned in the process?
[01:00:08] Unknown:
So the most challenging bit is probably hype versus reality. That applies both, you know, to the customer, so their expectations versus their own reality, so having nothing to do with us as a company. Then I think equally, you know, from our side, you know, obviously, there's tremendous excitement around AI and ML. It's hard to kind of for customers to sort of see through hype versus reality. Right? So I'm in this conversation. I've been talking a lot about declarative AI and what I think that is and trying to explain what that is. And it is a new concept, right, or somewhat new. I mean, it has analogies to other things and hopefully are familiar. You know, it kind of, like, see it to believe it. Right? And so there's vendor fatigue and or there can be vendor fatigue around a you know, AI and ML. You know, a big challenge is, you know, keeping, you know, kind of balancing this act between you know, I do think that AI and ML is probably the 1 of the most transformational technologies over the next decade. Certainly, 1 of the things that is most exciting. It's why I'm working in this space. But, you know, you have to balance that with the reality. How do you actually, you know, deliver something deliver something of value? And then then, you know, this gets down to personas as well. You know, this you have the ML engineers, which, you know, do you actually need to be can you actually automate all this stuff or can you not? You know? And so there's very valid critiques there. And so just navigating all that, trying to, you know, build something that delivers value, stay excited about the future, I think that's been probably the most challenging. It's more of a sociological comment, I would say. And so for people who are excited about the opportunity for being able to apply machine learning and AI to their business data? What are some of the cases where continual is the wrong choice and they're better suited by building out the stack themselves or using an AutoML platform to understand what the capabilities are? That's a great question. I think I read a blog post once, which was, you know, a list of all the startups, and I think they said, what are you not? Right? So that was the you know, every startup could pitch basically what are they not, and it's actually very illuminating. So in terms of, you know, use cases where we're not, you know, currently at least, you know, a good solution, you know, we're not focused on sort of low latency real time use cases. So if you're doing real time fraud decisions for, you know, credit card transactions, you know, we're currently, you know, not the right solution for that. If you're doing kind of real time personalization, search re ranking use cases, we're not currently, at least, the best solution for that. If you're doing, you know, on device or edge based inference, right, you're training, like, image models that are gonna deploy onto your phones to do text recognition on your phone or something like that for license validation or whatever for these, like, that you see when you sign up for, like, a scooter and they validate your license. We're not a platform for that. There's, like, really exciting use cases, like transformational use cases around, like, factory automation, autonomous driving, you know, genomics and protein, you know, like the protein folding use cases. I mean, we're not the use cases for that. So those are all great use cases for ML, very exciting use cases, and some of those, I think, we will be able to handle. Like, I think declarative approaches to AI will solve them. I think real time, for instance, there's really nothing preventing us to extend into the real time use cases. We actually had buck capability around real time use cases. We just wanna when we get there, we wanna do it right, And now we just think there's a tremendous opportunity around these more continual in database maintenance use cases. They're just a few many, many use cases really don't want an API. You wanna say, like, what customers are most likely to churn? Not like, is this customer gonna churn from an API call? So those use cases really belong in the data warehouse. So our sweet spot, you know, is you update in your data warehouse. You wanna have predictions in your data warehouse. You want those predictions continually maintained. You know, that's like bread and butter for us. And the rest, some are on the horizon. You know, some other ones, like, you know, autonomous driving and robotics, you know, not the platform. A different platform is better for that. And you've touched on some potential plans that you have for the business, but what are some of the things that you have planned for the near to medium term of what you're building at Continual?
So the most immediate thing that we're actually working on right now and we're very, very excited about is really more around the development to production life cycle and work workflow for end users. And so this is you know, maybe seems esoteric, but I actually think it is, you know, just completely critical for developer tools and developer experience. You sort of need a principled way to think about, you know, when you're doing development, when you're doing staging, how to keep that isolated, how to see the impact of the changes that you're gonna make before you actually deploy them into production. And then how do you get them into production? How do you actually make them sort of deployed? You know, you can't just be a deployment platform without also thinking about the workflow from exploration to production.
And so we're introducing, like, isolated environments, very similar to the way, like, DBT works, which kind of gives you this, like, early ability to, like, keep everything isolated, compare between your different environments, you know, see the impacts of your changes, all while maintaining, you know, a declarative workflow. So very, very high velocity, high productivity workflow, but still giving you that, like, core control that you need that gives you the confidence to be like, I'm not gonna step on somebody else's toes, and I'm not gonna certainly, I'm not gonna step on production. So that's, like, something we're just, like, you know, works. It's, like, a little bit more in the weeds, but it's, like, 1 of these actually, like, foundational bits of semantics and, like, workflow that so many platforms failed to to address. Right? They build these, like, really slick UIs, but they actually fail to, like, have that kind of core developer experience, core workflow that actually can support production.
And because production and operationalization is, like, everything in continuous in the end trying to get to production, you know, that's super critical. So we're really excited by that. The next thing, which isn't like a technical element, is we are currently in private, you know, early access and you can, you know, sign up and we'll absolutely let you on the platform if you create a great use case. So definitely, you know, sign up and we'll get back to you, like, in a day. But we are, you know, gearing up for, you know, public data, much more broader availability, you know, talking much more about, you know, what we're doing, much more, you know, open documentation and, you know, tours of the product and a lot more. If you go to continual dotai right now, you'll see a lot about the product, but you won't see everything about it. And so we're gearing up for that in, you know, in the next, you know, 2 months or so, which is tremendously exciting after having spent, you know, a year and a half, 2 years in development on this product and being very excited by by how it's coming together. There are other things that we're working on more related to, like, the ML aspects. You know, I've hinted at some of those with my previous comments. You know, wait to announce those until they're ready. But I do believe that sort of generally, you know, it's amazing how far you can push declarative approaches to AI into broader domains, multimodal domains, images, text, real time. Really nothing stopping us from going after that. But right now, we really want to make sure we absolutely nail the data warehouse centric use cases with traditional tabular and temporal data, the workflow around that that's robust that people really buy into, connectivity to your existing tools like DBT, where we really wanna have a tight integration. And so kind of we're focused on that kind of gearing up towards a broader availability, you know, announcing some of our funding news, which I haven't we haven't announced yet and steps like that. Are there any other aspects of the work that you're doing at Continual or the space of declarative AI that we didn't discuss yet that you'd like to cover before we close out the show? No. I mean, I think that's it. This has been great. I mean, I would definitely encourage people to check out what we're doing at Continual. You know, I think, you know, if you're just interested in this idea of declarative AI in general, you know, I think, you know, other areas to go, if you're just interested in sort of declarative aspects of the AI part, maybe less about the operational aspects, are looking at tools like Overton. And there's some research papers that have been published on that from Apple, which is Apple's you know, do they talk about how they are using declarative approaches of AI to, you know, radically accelerate development of natural language processing models for their power Siri. Just really shows, I think, the potential of it. The same sort of things around, you know, Ludwig is a project out of Uber that is, you know, quite exciting. The SORCO project out of Stanford is another example that there's sort of these declarative aspects. And so I think there's just an exciting trend. I think it's we're gonna see over the next couple years declarative AI enter the mainstream. And if we couple that with, you know, making it operational, I think it has the power, you know, dramatically impact the way we do production machine learning. Very excited by that. Yeah. Definitely
[01:07:58] Unknown:
excited to see some of the capabilities that that will unlock because there has been a lot of sort of hype about the potential for AI, but not a lot of realization of it because of the complexities that you mentioned. So for anybody who wants to get in touch with you and follow along with the work that you're doing to help drive that forward, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. What an interesting question. I mean, god, there's so much. I I mean, I'd love the day data ecosystem.
[01:08:29] Unknown:
Without yeah. I mean, so staying away from the AI and ML space, which is obviously what I'm thinking about most, I mean, I do think there is still a tremendous opportunity around real time data infrastructure. I do not think that real time data infrastructure from the actual streaming and storage of data, you know, including things like replay. There's lots of frictions that I see in this space, you know, from performance to scale, to cost, to the ability to store that massive volumes of data. And then, you know, probably more importantly, the workflow around processing that data. So thinking about how do we actually operate on real time streaming data, you know, whether that's SQL and streaming approaches to SQL. Maybe you see tools like Materialise or, you know, or Flink SQL where kinda giving hints, you know, Spark Streaming, obviously. I know Snowflake is putting a huge effort into their streaming SQL story that in terms of they have building a big team there. And to also just programmatic access. Right? Real time data integration, you know, is increasingly how companies are gonna think about kind of stitching together data. You know, I don't think, like, Kafka is, like, the answer. Right? So it's not like you have Kafka now that's the game's done. I think that there's, you know, a tremendous opportunity to kinda think about just rethink the whole streaming landscape, particularly, hopefully, with a developer centric mindset, really. Thinking about how do we, you know, make the developer experience amazing. Development to production, isolation, you know, replayability, incrementality, all of these things that actually are really actually critical delightful developer experience. I don't think that's it's kinda come to the data warehouse. I think tools like DBT, I think it's coming to ML and AI.
[01:10:01] Unknown:
I don't think it's really come to streaming. And so I'm looking forward to the startups and everybody else who's gonna try to solve that in open source or, you know, in start up or maybe at 1 of the major clouds. So definitely excited about that. Yeah. Definitely interesting to see how the overall streaming ecosystem has been building up over the past 5 or 6 years and some of the potential that still exists that has not been realized yet. So I agree that there's a lot of interesting activity there, and I look forward to seeing how it plays out. So thank you again for taking all the time today to join me and share the work that you're doing at Continual. It's definitely very interesting platform and has a lot of promise for helping businesses be able to gain the advantages of machine learning and AI to sort of improve their overall efficiencies and capabilities without necessarily having to invest, you know, months or years in building up the foundational infrastructure. So definitely excited to see where you take the business and appreciate all the time and effort you're putting into that, and I hope you enjoy the rest of your day. Thank you so much. It was a real pleasure being on.
Listening. Don't forget to check out our other show, podcast.init atpythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used. And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at dataengineeringpodcast dotcom with your story. And to help other people find the show, please leave review on iTunes and tell your friends and coworkers.
Introduction to Tristan Zayans and Continual
Tristan's Journey in Data Management
Overview of Continual and Its Mission
Operational AI vs. AutoML
Defining Operational AI
Challenges in Adopting Predictive Analytics
Target Users and Product Focus
Architecture and Implementation of Continual
Lessons Learned and Customer Feedback
Onboarding and Workflow in Continual
Model Lifecycle and Explainability
Declarative Model Design
Comparing Continual with In-Database ML
Use Cases and Customer Stories
Lessons from Building Continual
When Continual is Not the Right Choice
Future Plans for Continual
Closing Thoughts on Declarative AI