Bringing Automation To Data Labeling For Machine Learning With Watchful

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

Atlan is the metadata hub for your data ecosystem.

Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities.

Push information about data freshness and quality to your business intelligence,

automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value.

Go to data engineering podcast.com/atlin

today, that's a t l a n, to learn more about how Atlin's active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork, and Unilever achieve extraordinary things with metadata.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends at Linode.

With their new managed database service, you can launch a production ready MySQL,

Postgres or MongoDB cluster in minutes with automated backups, 40 gigabit connections from your application hosts, and high throughput SSDs.

Go to data engineering podcast.com/

lunode today and get a $100

credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show.

Your host is Tobias Macy, and today I'm interviewing Shayan Mohanty about Watchful, a data centric platform for labeling your machine learning inputs. So, Shayan, can you start by introducing yourself? Yeah. For sure. Thanks so much, Tobias. My name is Shayan. I'm the CEO and cofounder of Watchful.

Prior to this, I used to work at Facebook where I was a tech lead for the stream processing team that owned all of the streaming platforms across

all Facebook infrastructure.

And then after that, I led a few machine learning teams at Facebook where I actually ran into the problem of not having enough labeled data for the types of things that we wanna build. So that's sort of what inspired building Watchful.

And do you remember how you first got started working in data management?

Actually, in college, I used to work as a software engineer at

a small start up

dealing with, like, basically lending and that sort of thing. And we realized that the way we were going about doing it, it was very much a data problem. It was what

signals could we possibly triage about this particular borrower that would indicate that they are likely to pay back a loan? And this is, you know, a classical underwriting process, but we were trying to do this for folks who wouldn't otherwise qualify for typical banking loans and that sort of thing. And we had to get really creative about how we access the data, what data we even got, how we thought about attribution, how we kind of like merged these things, how we did entity resolution. So that's sort of when I realized how deep the rabbit hole actually went, and I kind of meandered through the data world and distributed systems from there. The point of entity resolution is an interesting 1 too because on the 1 hand, it is a data engineering problem because you want to make sure that the data that you have is accurate and representative of the thing you're trying to represent. But on the other hand, it's also a machine learning problem because you have to learn lean on some of these machine learning techniques to be able to do the resolution on those entities. So it's always fun how there's such a fuzzy boundary between those worlds. I have a sneaking suspicion that this is gonna be a theme throughout the rest of this conversation

where oftentimes,

depending on the organization,

data scientists will end up moving more and more into the data engineering work stream just by nature of the role and what they need to do. Other times, data engineers will move more and more into the data science work stream. There's sort of this, like, fuzzy boundary between the 2 roles and how complementary they are, and sometimes there's, like, overlapping requirements.

And this is just, like, a very interesting example of it. So, yeah, totally

agree. And so in terms of the watchful product that you're building, can you give a bit of an overview about what you're

aiming for there and some of the story behind how it got started and why this was the problem that you wanted to spend your time and energy on? So to put it plainly,

at Facebook, I ran into this particular problem a couple times where

we wanted to build a model to do a particular thing, but we lacked the labeled data to be able to do it. And and just for, like, very broad hand wavy overview, like, in general, when you talk about machine learning in the enterprise, you're typically talking about supervised machine learning, which for all intents and purposes is, like,

models or algorithms that learn from examples. So you give it enough examples of how to do a thing, it will eventually learn how to do that thing. This is fine when

the things you wanna do are like fairly simple or fairly low hanging. So an example is like,

I might wanna be able to identify photos of dogs versus cats.

It's super easy for me to source a whole bunch of images of dogs and then source a whole bunch of images of cats or get an army of humans somewhere

to label my data and be like, that's a dog, that's a cat, that's a dog, that's a cat. But the moment I wanna do something more specific, maybe there's something specific about my business. Like, I want these specific corporate documents to be labeled in a particular way, or it requires some deep amount of domain expertise, like, I wanna identify certain types of cancer

in, like, MRI or CAT scans.

All of a sudden,

you can't just find an army of humans to label your data. And that's actually the problem we ran into when I was at Facebook. It was like,

we had all the money in the world and we could throw it at people and we could get an army of humans, no problem. In fact, we had several where we would just give them data and they would hand us back labeled data. But oftentimes for the problems that we were trying to solve, where we needed very specific types of labeled data, that army of humans approach was insufficient.

So

after working on a couple of these projects, I realized, okay, Facebook hasn't solved the labeling problem yet, and if Facebook hadn't, chances are no 1 else had, so I left to go build a company around it. And that's kind of our focus point. What we do is we largely automate the process of labeling data for machine learning. And kind of our thesis is

that machine learning as a whole hasn't penetrated

as far into the enterprise as you would think at this point. You know, it's been many, many years since AI has been like the new hotness,

and things have gotten easier, certainly, you know, like,

algorithms are easy to come by these days, you can kinda, like, pull models off the shelf and they'll work reasonably well, But now the hard part is no longer in the algorithmic implementation. It's actually in marrying that algorithm to your data. And that's actually where the disconnect is. When you think about like a large potentially legacy organization

trying to think about, okay, how do we bring AI into the fold? Generally speaking, they wanna focus on the things that actually bring

ROI, the things that are actually needle moving.

But usually those are not things like, I don't know, Twitter sentiment analysis and stuff like that. You know, the Hello World of the AI world.

It's usually around

augmenting

their most expensive experts.

That's really what moves the needle, you know, taking their already preexisting bottlenecks and just alleviating that bottleneck.

But that's very quickly a catch 22 because typically, if you want to build a model that automates some portion of a domain experts workflow,

you probably want that domain expert to be creating the data to then train that model.

And that's

like already kind of a nonstarter if that person is like already a huge bottleneck in the organization. So that's kind of the problem that we focus on, or at least that's like our wedge.

We find great success selling to organizations

where

they have

specific problems that they know machine learning or AI would be a good solution to,

but it's gated by domain experts or human time being spent on creating even like prototypes.

So what we do is we just give them our software, and instead of requiring, like, 30 of those domain experts to spend, like, 2 months labeling data, you know, full time,

instead,

1 or 2 of those domain experts could use Watchful.

And in about 4 to 6 hours, they could output the same quantity and the same quality of data, but several orders of magnitude faster. That's sort of the idea there.

Before we get too much more into the

labeling specific aspects, another interesting question to dig into, which we already started to touch on, is the question of

whose responsibility

is it to make sure that that data is labeled and to manage that aspect of doing the data labeling for

these machine learning use cases? Is it the responsibility of the data engineer? Does it belong to the domain experts? Is it the ML engineers or data scientists' responsibility? Like, who is typically the person that is the 1 managing this overall process of saying, okay, I need to collect this data, bring it into this environment where we can do the labeling, make sure that these labels are being generated, that the labels are accurate, and that they're being labeled in a way that my data scientists and machine learning engineers can actually build useful features off of them? Yeah. That's such a good question.

And the

unfortunate answer is it really depends on what your organization looks like today. I will say that historically, data engineers have not been tasked with actually owning the labeling process. If you think about, like,

traditionally, data engineering workflows are, like, fundamentally ones of automation. Like, their role and responsibility is to provision and scale data access for, like, any organization or any team that requires it. And, like, oftentimes, this means some combination of, like, sourcing raw data from where it lives and then, like, transforming it based on some domain knowledge and then, like, loading it to a place that's accessible by those teams. So that's like standard

data engineering kind of like 1 on 1 type stuff. And interestingly, when you look at kind of like the labeling process for machine learning, it looks very similar to what would historically be considered like a traditional data engineering pipeline. You have to source raw data from a place, You have to transform it. But in this case, the transformation process is manual.

The transformation process in this case is like, I need to farm this data out to my army of humans, or I need to farm this data out to my domain experts, or I need to farm this data out to someone else, or blah, blah, blah, blah. But that transformation process that would ordinarily be, like,

a step in a pipeline

is now gated by, like,

humans looking at data and evaluating it. That's like today's world. That's just like the world we live in right now. So as a result, it's not really on data engineers to go through the labeling process. I would also say that, like, data scientists are

potentially, like,

involved in some capacity depending on the organization. But other organizations,

data scientists are, like, the consumers

of data entirely. Like, that's it. That's all they do. So they don't have any input on how the data is labeled or why it's labeled and so on. There might be specific annotation teams that are spun up just to label that data.

But to us, this fragmentation is actually a little dangerous. You know, 1 of the issues here is,

like, kind of what we're hinting at here. There's no, like, clear articulation of who owns what. There's no, like, very clear definition of, okay, here's where the data engineering process starts and ends. Here's where, like, an annotation process starts and ends. Here's where the data science process starts and ends. Ends. It's all kind of, like, very wishy washy.

The second

kind of more egregious issue here is that the whole point of a pipeline

is that it's repeatable.

It's that you have this thing that is defined, like, programmatically,

and you can run it any number of times and get well defined output on the other side. And

it's debuggable, like you can go in, you can interpret it, you can interpret like why it's doing certain things.

And if you take that lens and apply it to labeling,

you lose that process the moment you actually hit the transform step of that pipeline.

All of a sudden, you don't have introspective ability. Like, why did this person label this as x instead of y? It's like a fundamentally hard problem unless you're actually, like, going in and kind of interrogating those people and figuring out like, okay, why did you label it as X or Y? And if something is wrong, if the way they're labeling is wrong,

instead of in, like, the data engineering world going back and, like, editing your pipeline, so it's correcting the mistake

and doing that in a programmatic way, Here, you have to coach your humans and be, like, here is the way I want you to label, and this could be

varying degrees of difficult, depending on how systemic

that issue is. Like, if it's

some sort of bias that's being introduced due to someone's experiences, like, that's very, very difficult to, like, get them to extract from their brain and not consider.

So there's sort of like a fundamentally philosophically hard problem that's baked into that. So kind of like the way we see it is

really we want to move towards a world where labeling is considered just any other data pipeline,

where there is a programmatic,

automatic,

like interpretable

way to go from raw data to labeled data, And if there are issues there, you should be able to go back and edit that sort of programmatic interface so that you get the right answers.

And that should be well documented, well defined, like, if

someone goes and leaves the organization, they shouldn't be taking that like tacit domain knowledge with them. It should be encapsulated in this like sort of programmatic

functional way of achieving your labels. So

that's sort of the way we think about it. And

part of this is actually like,

it's almost like an education for our users, like, we wanna teach them that,

look, this process is fundamentally no different to like any other pipeline you currently run. So we want you to be thinking in terms of, like, that sort of vernacular

and apply it to this type of problem. So yeah.

Given that

supervised machine learning approaches require so much time and energy invested in making sure that you have all of these accurate labels,

why doesn't everybody just say, let's just use unsupervised learning, let's just throw a deep neural network at it and hope that everything works out for the best? Yeah. It's, again, very good question. So

there are

2,

like, let's tease that apart, right? So there's like unsupervised learning where you just kind of throw an algorithm at your data and you kind of hope for the best. Clustering,

like, historically, like lots of clustering mechanisms are unsupervised by nature.

There's no guarantee that it's actually going to model your problem correctly. And what's more is that, like, it'll tell you that this piece of data is very similar to this piece of data, but it's not gonna be able to say, like, this is a cat and this is a dog. That tacit knowledge of the actual task comes in

through labels. Like

the whole point of labels really is to take domain expertise and knowledge like from the heads of humans

and provide it to models. Then we start talking about like deep learning models and things like that. And frankly, like

most deep learning models are supervised in some capacity. Now you could obviously get like something pre trained off the shelf, like you could use BERT or CLIP or some other, you know, well defined pre trained set of layers.

But oftentimes, if you want it to work really well for your task, you need to

still fine tune it on your data, which requires some amount of training data. And the other hard part is that, like, you also don't know how much data you need ahead of time. You know, like, it's sort of another philosophically hard problem. Like before you go train your model, you have no idea how many labels you actually need before that model performs well on your task.

And it varies not just by

algorithm or like model architecture, but also by task. So you could have the same exact model architecture

for 2 different tasks require 2 wildly different amounts of training data, simply because each of those tasks just have different properties, and you have to, like, really understand your data.

So, like, that's, like, 1 side of it. It's, like, those types of

techniques oftentimes just, like, don't work well, or it's this catch 22 where you need lots and lots of labeled training data to then create more training data. It's like this sort of hard problem. The other issue is that oftentimes

as you trade off into the world of more sophistication, as you move more and more towards like these deep neural nets,

oftentimes you lose explainability

as a side effect of that. And

I just want to highlight this as like 1 of our core tenants, like, we believe that

your labeled training data especially must be explainable. That's like a hard requirement because

AI today

is basically just like code plus data. That's fundamentally what it is.

And we're coming off the back of, like, several decades of, like, high growth software engineering and good software engineering practices and so on. So, like, we've got tons and tons of tools and techniques for managing the code part of that, like, particular equation.

But we have very, very few techniques for dealing with the data.

Now, the code is inherently interpretable.

Someone can go look at it, like that's the whole point of all these different techniques. That's why, like, collaboration techniques exist, that's why things like version control exists, so that several engineers, several data scientists, several whoever can go and collaborate on

some code for some time and all understand and have sort of the same shape of the product in their head. But that same idea doesn't really exist for data in machine learning,

and that's kind of problematic. Oftentimes, as data split up across several different annotators, you have to sort of like take at face value inter annotator disagreement or agreement, and this sort of like goes into several different layers of just like

possible things that can go wrong. Like an example is you have

5 people who are all labeling, Let's say they all happen to see a particular row at the same time.

3 of them say it's a cat. 2 of them say it's a dog.

It happens to be like a tiger or something like that. Who's right?

You know, like maybe the problem itself was ill defined. Maybe, like, the 3 people who chimed in actually do have like strong backgrounds in this particular slice of the data set. Maybe the 2 who said dogs were just tired and like weren't really thinking. Maybe like 98% of the time their labels are actually really good. Did this happen to fall into like the 2% of the time where the labels are bad? Like, now you get into these like sort of hard problems of what is a good label. And

in order for you to trust your downstream systems,

incendiary, but

I hope it makes sense. We don't believe that there's any such thing called ground truth. And this is a lot of the time like a term that's thrown around by data scientists all the time. It's like,

they'll call it gold data, they'll call it ground truth.

But we just don't believe it exists in any case where it's not inherent in the data itself. So for instance, if I'm building a model on like a predictive model for maybe a recommendation engine on like what a user would likely want to see next,

I could go and look at everything the user has interacted with in the past, and that's ground truth. Like, there's nothing I need to interpret there.

It's just what the user did. And so I can predict the user's next set of actions theoretically.

But here, where we actually do need an interpretation step, where we actually do need to label the data in some way, all of a sudden you introduce the possibility of human bias

or incorrectness in some capacity.

And as a side effect, you can't just blanket

say all this data is ground truth just because humans looked at it. So the moment you start embracing that idea that your labeled data is not ground truth,

then it starts begging the question, okay, what parts of it are good and what parts of it are not so good? And then that leads you down the rabbit hole of, like, okay, how do I actually interpret

my data? How do I explain it? And so we feel that that part of it is actually a very, very important

principle,

where your data must be able to, like, be introspected, you must be able to explain it, Because otherwise, how can you fix problems as they come up? And how can you even trust systems downstream of it?

In this question of how to

manage the labeling of the data, obviously, that's something that you're very invested in, but I also know that there are

a lot of

continuums in this space of

how best to label the data, what types of data you're working with Yeah. What types of labeling you need to do, what the downstream use of that data is, and how that factors into the level of accuracy or detail that needs to be present in these labels.

And I'm wondering if you can just talk to

the kind of overall market space of data labeling as a whole, which is obviously very large and growing,

and how you think about your position in that space and the kind of unique

capabilities or functionality that you're offering to

make labeling of training data for machine learning models a tractable problem?

Yeah. So

let me first start out by kind of describing the market, and then I'll also describe kind of the key players from,

like an implementation standpoint.

And I'll talk about like who cares about what and that sort of thing, and then hopefully that sort of paints a fairly clear picture. But

we sort of see this market in

ostensibly, like, what we call 3 generations.

So

generation 1

are basically, like, crowd sourced labeling services. So think, like, Amazon Mechanical Turk,

those types of services where it's like you just give them a task and you have some properties on it, and then they go and label it. So Amazon manages like an army of humans somewhere else. You give them data, you give them a task, they hand you back label data, and you kind of hope it's good. What people quickly realize there

is that,

like, okay, the data that comes back

is really at the mercy of, like, how good the instructions are and that sort of thing and, like, most people are not very good at writing these instructions,

so oftentimes you have to go through, like, several iterations.

Now

that's sort of, like, what spawned the 2nd generation

of labeling companies, which are companies like Appen, for instance, where

they manage that whole process. So it's not just access to a crowd, it's also like you get kind of like a project manager who can help coach you through, like, the creation of good instructions, and they kind of curate and help qualify

these labels in various ways.

But the 1st and second generation of labeling

services can only really be touched

by folks who have, like, non sensitive data,

or their data doesn't require any real subject matter expertise to be labeled and that sort of thing, you know, where the volume of data that needs to be labeled isn't huge. Like, there are all these, like, various restrictions

that you have to kind of, like, place on

it. So that's where

the 3rd generation of labeling companies kind of come in. And those

are like primarily software vendors where it's companies like Labelbox or even like Label Studio by Hardex

where

they'll give you software

and

the expectation is that

you've probably

if your task is like working on particularly sensitive data, you've probably employed

your own army of humans in some capacity. You've brought in some contractors or something within your 4 walls,

and they're now authorized to access your data. So great. Okay, so

you give them this software, and they sit in front of it, and it's like a nice glossy way to draw boxes and highlight stuff and basically apply labels, but they're still going through the data

largely 1 by 1. You know, there might be some automation techniques in there, but it's still largely just like a mechanical,

like calorically bound process.

So that's like 3rd generation.

And then 4th generation

is where we sit. We call this kind of like programmatic labeling,

where we air really hard towards the side of automation,

where

we don't want you to have to hire an army of humans to label data, that we feel that that's kind of like kind of beyond the point. Or if you do happen to have an army of humans, they should be able to do a 1000 times more than what they're currently doing. So it's sort of like efficient use of resources you already have.

This works really well when

your data is super sensitive,

or it requires a deep amount of like subject matter expertise where you don't have like 30 doctors who can just sit in a room using something like Labelbox or Label Studio, you know, for several months.

Instead, they only have like an hour or 2,

maybe 4 days in a row, where they can just like sit down with some software and help label. That's where we really shine. That's kind of like our entry point. So a lot of the time we work with organizations where they just have

a huge amount of subject matter expertise that's necessary in the labeling process or their data is so large that they have to go through through like programmatic means. A good example of this are like content moderation on major platforms

where

maybe the case of things that need to be content moderated

are

really, really,

like, rare. You know, maybe say sub

0.5%

of the data set is actually things that need to be content moderated.

And let's say your data set has like several 1, 000, 000, 000 of entries, several 1, 000, 000, 000 rows.

Imagine trying to like find those needles in that particular haystack manually.

It's hard. It's really, really difficult.

So instead, if you're able to do it programmatically, then ideally the vast majority of those things are caught automatically. And then, you know, the things that kind of slip through are the things that you can focus your time on manually. That's sort of like another area where we really shine. It's where data volumes are so large or the data

is wildly imbalanced.

And the 3rd area that we really shine where these sort of like manual services don't really work too well is when your data's changing very frequently.

This is sort of another example of content moderation, or even like fraud detection or things like that, where

you're kind of like in an adversarial

situation

where you're constantly kind of like chasing this tale

of

some sort of adversary who's always trying to outwit you or outsmart you. So, in fraud, it's like the moment you detect a new way that people are, you know, committing fraud, they're gonna come up with a new way, and they're gonna keep coming up with these new ways, and they're gonna do it very, very quickly. These are incentivized to do that. Now, if you are stuck manually labeling to then feed your models

this data and then like go it's just way too long of a feedback loop

before you actually get to the heart of it, which is, look, there's a very mechanical change that they did, you know, instead of these fraud, like, charges coming from x location, or they're coming from y location, this, like, other particular pattern, and so on, like, if you can encapsulate those programmatically, then obviously it makes it way easier to change as your data changes.

So that's sort of like another

reason why programmatic labeling is kind of really interesting these days, but

just to kind of like sum things up, those are the 4 generations. Then talking a little bit about the stakeholders, because I think it's important to think a little bit about who cares about what. I mentioned earlier that oftentimes

data scientists

or data science teams are not necessarily the ones who own the labeling process.

So I would say that there are like a couple key stakeholders here.

There are,

let's just call them annotators, or oftentimes they're domain experts, you know, doctors, lawyers, financial experts,

insert person here, who needs to, like, lend their expertise to a labeling process.

The obvious win for them in the world of programmatic labeling is that they get to save

a huge amount of time. They're not stuck doing this, like, tedious task where they see little to no ROI themselves.

Then you move over to data scientists.

And with data scientists, the win here is that now their data is explainable. So if any of it is wrong, you can always go back and be able to say actionably, hey, this is incorrect. Here's how I'd like to be changed.

The other big win here

is

that we actually output probabilistically

labeled data.

Now, historically, labeled data has been kind of like a 1 or a 0 problem. Like, it's either fraud or it's not fraud or it's something that is, you know, bad or malicious content, or it's not bad or malicious content, things like that. It's usually like very black and white.

What we do instead is we model the likelihood

of a particular thing being part of a class.

So in our world, there might be a 78% probability that this particular thing is fraud related or not. And this sounds like a relatively minute change, but when you couple that with the fact that your labels are explainable and they're programmatic,

your downstream models actually now have a lot richer information to learn from. So all of a sudden,

instead of just learning from

a 1 or a 0 and just calling it a day, now you can focus your model on the things that are very to be fraud related, or very not, and you can treat the things in the middle a little bit differently.

So there's a lot of richness that you can get out of this, which is quite valuable. And we found that in practice and in sort of experimental

workflows, that

models trained with this probabilistically labeled data tend to perform better, simply because there's 101 degrees of freedom now in the target value as opposed to just 2. So that's sort of another kind of win there. And the 3rd most obvious 1

is, like, the primary business stakeholder. It depends on who we're talking about. It could be like a line of business owner, it could be a product manager. There are several folks that we might end up working with. But the big win for them is that with programmatic labeling, you're able to pursue use cases that you couldn't before.

That's kind of like plain and simple what it is. Oftentimes, there are a set of use cases that are frankly the most valuable to an organization

that are cost prohibitive to even get started with. And that's

what programmatic labeling enables them to do versus going to like an app and or someone like that where you can't even give them sensitive data, or they can't do things that require deep subject matter expertise, or they can't process

require deep subject matter expertise, or they can't process millions of rows of data for you in like a timely fashion. That's sort of the areas where we shine.

In

my understanding of the market, the company that I've seen that is

most analogous to what you're doing is the folks at Snorkel AI, which is also taking this very data centric approach to

labeling the data and building training sets and Yep. Offering these programmatic approaches to being able to generate these labels and collaborate with domain experts. And I'm wondering if you can give your sense of how you think about

the differentiation

between the way you think about problems and the way that they're approaching them, and specifically which problems you're each trying to solve.

So you're absolutely right. In sort of like the programmatic,

like, 4th generation of labeling companies, it's basically us and Snorkel. And Snorkel is a great company. I think we just trade off in different directions. So

Snorkel seems to be very focused on an end to end workflow,

where you take raw data, you shove it into this thing, you put in, like, human effort, and out pops, like, trained models, and they have, like, infrastructure for deploying them and monitoring them and that sort of thing.

We traded off in the exact opposite direction.

Instead, our belief is that

this problem, like most others in data science, data engineering and software engineering, is best solved with a modular approach, where

you design

a single best in class solution for a specific slice of a larger problem.

But it plugs in nicely with everything else, where you can plug it in like a Lego

to literally anything else you're using upstream or downstream of it. That way, data engineers, data scientists, software engineers

can choose

their own stack. They can choose what is best in breed for each individual slice, what specifically you're trying to solve, as opposed to us

pigeonholing them into using a single end to end workflow.

So we focus a lot less

on the parts of the problem that we feel are actually reasonably

solved by other organizations, by other software.

MLOps has come a long way already. We have interesting things to say about like distributed training and serving and so on, but

we're not, like, uniquely positioned to solving those problems. We are uniquely positioned to solving things dealing with data specifically,

And so that's where we focus our time, and we want to be seen as like the best in breed labeling software, you know, that plugs into literally anything else you want to use downstream.

As a side effect, the types of organizations that prefer us over Snorkel and vice versa,

they just look different. You know, we tend to sell to organizations that,

you know, might have opinions about the things that are in their stack and why they're there, and they want things that plug into them.

Snorkel, you know, perhaps might work really well for organizations that haven't yet started their AI journey, maybe wanna sort of get started in a fairly low risk way. So they want kind of like a whole stack all in 1. But, you know, frankly, you can't really go wrong either way. Like, Snorkel's product seems really, really good. So I will say that.

RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control.

Their SDKs make event streaming from any app or website easy, and their state of the art reverse ETL pipelines enable you to send enriched data to any cloud tool.

Sign up for free or just get the free t shirt for being a listener of the data engineering podcast at dataengineeringpodcast.com/rudder.

Going back again to this question of

data engineer versus data scientist, who's responsible for the labeling question,

and your

characterization of the data engineer as the person who wants to be able to build these repeatable, scalable, automatable systems,

introducing that capacity into the labeling process through this programmatic labeling workflow,

how does that then bias the decision about who is responsible for managing the data labels? And then the other aspect of this question is that,

particularly recently, data engineers have been very focused on this question of data quality or data observability or validation, however you wanna frame it. And I'm wondering how that factors into this data labeling aspect as well of being able

to automate this process of adding these labels to the raw data and then being able to validate the accuracy or the quality of that resulting

dataset that is fed into these downstream models?

Yeah. So

I see

in almost like every data engineering process today,

there are like a couple key stakeholders. Right? So the data engineer themselves are specifically tasked with, like, the automation of this process.

Now,

typically they're not the ones who actually own

the quote unquote business logic, right? They're not necessarily the ones who came up with like, okay, we have to take these like 4 numbers and like calculate this particular number out of it for these reasons downstream. Like typically they are told that, you know, by a domain expert on the other side who actually owns the problem in some capacity. And similarly, there are like other stakeholders down stream that will care about various aspects. But if you sort of, like, take that mental model and apply it here,

like,

let's talk about

who actually owns how data gets labeled. It's largely,

in our opinion, like a data scientist

or data science plus, like, domain expert

question. Really, it's like the domain experts or the annotators are the ones who have the most insight into

what a particular classification

means, what a label means, and so how it should be applied to the data is sort of like within their domain.

Now, the data scientists own

how the labels, like, relate to the architectures that they're trying to train for, or the types of tasks that they're trying to solve downstream.

So they naturally are stakeholders in that process. And you can imagine that those 2 groups could iterate themselves, like prototype a dataset that seems to work reasonably well.

And they'll go train their 1st model, and they'll go deploy it, and everything's great, but now, like any model that you put into production, you have to assume you're gonna have to retrain it at some point in the future, you know, due to data drift or what have you. Like, you come back a quarter later, your model's not doing so hot, chances are you just haven't retrained it yet. So

that's where

data engineers can come in because now they can take that prototypical work that

the domain experts and the data scientists work together on,

and they can put into production and they can make it so that every day, every week, every 2 weeks, however often they want,

raw data is sourced from where it lives, it runs through a labeling process, it's already been crystallized by the domain experts and the data scientists together,

and outputs labeled data in a well known place where the data science, like, machine learning pipelines can just pick it up and use it automatically.

So then you can kind of, like, draw this very well defined picture end to end, where raw data starts

and

trained models are served,

and it becomes entirely programmatic at some point. You know, the the amount of human effort necessary in the middle should be as close to 0 as humanly possible.

So I wouldn't say that the goal here is for data engineers to own the labeling proper, like the labeling process, but rather for them to continue on and help with, like, the productionization

of these processes. And instead,

the onus is really on, like, the subject matter experts and the data scientists to get, like, the criteria for labeling down,

and then hand that off to data engineers to actually be able to productionize it in a meaningful way. Now, you asked an interesting follow-up, which is like, okay,

how do we reason about the quality of data and things like that? And that's actually like where we see this as part of all the same continuum. Like, again, that's why I kept harping at the very beginning about explainability,

because you need to be able to explain your data before you can, like, really interpret if it's good or bad, and by what degrees, and so on. So

we put a lot of time and effort into like providing those calculations, providing that insight to like not just data engineering users, but also data scientists and, and even domain experts. Like, everyone should be able to reason about their data in a pretty hands on meaningful way. Now, obviously, a lot of that happens in kind of like development mode,

where you're iterating, like, maybe your domain experts are iterating with data scientists and kinda coming up with a system that seems to work well. So in our world, you'd be creating what we call Hinter's, which are heuristics that noisily indicate possible labels. So you might say, if these certain text patterns show up, then it's very indicative that this particular customer support ticket is payment related versus account related versus something else.

You create a whole bunch of these, and then we'll show you how they interact and, like, in what ways your labels might be right, in what ways it might be wrong. Here are the particular parts of the data where your labels might not be super correct and where you might want to focus your time. So we provide all of those metrics kind of out of the box, both in development mode and in production mode, where we can run-in sort of like what we call headless mode, where as a data engineer,

you could just pull data from like your sources of truth. Maybe it's like data warehouses or what have you, your data lake,

run it through

this headless application, which just accepts

raw data on standard in, outputs label data on standard out,

and tie that into, like, a pipeline. So you just take that labeled data and shove it someplace where, your data science team can reason about it. And all the while, we'll be reporting metrics to you. So we'll tell you like, hey, here's like the overall quality of this particular pipeline.

And then you get action on it, and it becomes very clear who should be doing what.

And so now digging into the watch Full platform itself, can you talk to how it's implemented and some of the

ideas or assumptions you had about this problem domain and how to address it that you've had to revisit in the process of building out the platform to where you are today?

Yeah. So I'll go maybe slightly into the weeds, so feel free to pull me out if I'm going a little too deep. But I'll start with our stack. It's a little esoteric.

Our back end is written in Rust, and our front end's written in ClosureScript.

We chose those 2 languages despite the fact that they don't historically go together for very specific reasons.

Rust gave us the ability to

kind

of guarantee correctness in specific ways that we knew would be important to us, as well as guarantee performance on certain types of target architectures. And we could actually distribute a binary

on practically any platform without

any real dependencies on the system. And you obviously can't do that with like a JVM based language and so on and so forth. So we needed both speed and quality and kind of like

distribution, so to speak. And Enclosure Script was really interesting because

it allows our front end team to iterate very, very quickly. And part of our experience in the product is

like, if you kind of looked at it,

it feels very Repl y, if that makes sense.

A Repl is simply reevaluate print loop. And so if you type like Python in your terminal, it pulls up just like the Python interpreter and you can just like type things and you can see the immediate valuation

and kind of iterate from there. That's the type of experience that our product gives people, but in a way where you don't actually have to code a whole lot. So

we wanted to kind of mimic that same experience,

not only in the product, but also in the development experience. So Closure and Closure script, by extension,

are very like REPL driven development

stacks. And that was like encapsulating that from a cultural perspective, both in the engineering team and in our product was actually kind of a big

win. Now, in terms of things that we assumed that we ended up like changing,

we have quite a few. So I think like the first 1 was, I'd mentioned earlier,

you could build like text, like heuristics, right? So if I see these words, then it likely indicates something as payment related or account related or what have you. You can imagine that a lot of the time those take the form of RegExes.

So great. Now we're talking about like, okay, how does a domain expert sit down and write a regex? You know, even engineers oftentimes

like find it difficult to write good regexes.

So we thought like, okay, why don't we just have a regex builder?

So we'll have this, like, UI thing where you can, like, click and drag and, you know, drop things in, and it'll build a regex for you. And We tried that. We tried prototyping that with

users, and it turns out they hated it. And the reason why they hated it is because they went from a thing that required almost no clicking, you know, just keyboard, you know, just typing, something that required all clicking, and there's no other way to do it. So we ended up like trading off in a different direction there. We actually designed our own query language, which people historically tell you not to do, but we actually I promise we did the due diligence. We looked at every other query language that was like kind of in the space, and we realized that they weren't quite right for what we wanted to do for labeling. So we built our own, and that's actually paid dividends, which I'll talk about in a second.

The second

thing that we made an assumption about which we changed was

we sort of assumed that people want to use this product in the same way that they use IDEs,

where oftentimes people are, like, running

their IDE locally. You know, a lot of the time people are not using cloud based IDEs. So we assume that people would want kind of like a consider it like a data IDE,

where you could spin up your data, you could interact with it, like, in a fast local environment, you could be on a plane and iterate on your data no problem. The moment you're on the ground and you have Wi Fi, you can just

ship your changes to everyone else like you would in any other development process.

That was a noble thought. However, as it turns out, most people don't

a we actually shifted to like more containerized and like hosted deployments, and that's actually also paid dividends.

The 3rd area, which I think is more

around like,

not necessarily an assumption that we had, but something that we learned along the way is that like building good heuristics

is really hard.

Like, how do you know that a heuristic is actually good or not? It's meant to be noisy. Right? So it's gonna be good some portion of the time, bad some other portion of the time, it's not going to cover all of your data. So how do you reason about it? And how do you know that you're actually making good incremental progress on like labeling your data? So this was like a hard problem that we

really, really focused on for a long time and now we have like this

incredibly strong focus on a suggestion based workflow

where

you might get Watchful started. You might say, okay, if I see the word pay

somewhere in the text, it's likely indicating that this is payment related. Then Watchful will take that and run and be like, okay, you said pay is payment related? What about credit card? What about invoice? What about bank? What about transactions? It'll keep like coming up with these new patterns, and it doesn't just stop at keywords.

Like, because we designed this query language,

we own the entire

grammar

that

can possibly be run within our product, which means that we don't have to do something as sophisticated as trying to generate, like,

arbitrarily complex Python.

Like, there's a finite space in our grammar that we can generate, and it does all the things that you might want, but it's much smaller. And as a side effect, we can simulate it very, very quickly. We've come up with very clever ways to do that. So as a side effect, you can even enrich your dataset with things like parts of speech tags or like sentiment, and WASH will pick that up and use them as part of suggestions.

So it might say, you're trying to look for, like, company names or something like that.

Watch will automatically

discover

that, okay, if I see a sequence of proper nouns that ends in the word ink, it's very likely a company name, and it'll suggest that to you. You don't have to come up with it. And Watchful had no concept of proper noun. It had no concept of any of these things. It's just data at the end of the day, and it's able to suggest them for you automatically.

That was something that was like kind of a revelation for a lot of our customers, because now it took what was otherwise

like, yes, a time saving workflow, but it shifted

the pain from,

like, something very manual to something very thought heavy,

where instead of, like, clicking yes, no, yes, no, yes, no, and that being tedious,

now you've had to, like, think really hard about what heuristics you have to write. Then we went a step further with the suggestion engine, and it's like, okay, now you don't have to think as hard. Now let Watchful just kinda come up with stuff, and you tell Watchful whether that's right or wrong. So it's shifted back to, like, a yes no, yes no type work flow, but now you're saying yes or no to concepts in your data. Like, does the concept transaction have something to do with payment? That's a yes or no question. But now it applies to 100 or thousands of rows of your data all at the same time. And now, most recently, we've gone even a step further than this. And now we realize that our suggestion engine is actually good enough

to kind of be on autopilot

sometimes in,

like, carefully crafted

and very specific times. But the cool thing here

is that as an annotator

using Watchful,

as a domain expert, as someone who has never seen a query language ever and has no interest in learning 1, you can sit there and you could just hand label some data, a very small amount, maybe on the order of, like, tens or hundreds of rows.

And the suggestion engine is now good enough to automatically create heuristics that will explain your hand labels

and will actually scale them out across the rest of the dataset. And it'll be intelligent enough to be able to show you certain rows that it wants additional validation on. It'll be be intelligent enough to, like, figure out if these heuristics are in fact good or not and adjust weights over time and so on. So the side effect here is that for an organization that is already invested in, like, kind of a, quote, unquote, army of humans,

you don't have to retrain those people. You don't have to retrain them on using a new tool. They can just sit there and hand label the way they normally have,

and you get all the benefits of programmatic labeling without any of the headache. And that has been our most recent breakthrough, and it seems to be working really, really well so far, so we're really excited about that. Couple of interesting things to dig into here. 1 being, you mentioned that you built your own query language and that that gives you a

finite space that you need to worry about as far as how to kind of be intelligent

and be able to autogenerate

particular

semantic elements that you wanna suggest to people,

and I'm wondering

how you've been

addressing the problem of feature and scope creep and how

to avoid it accidentally turning into another Turing complete language and then exploding this potential state space and

removing that benefit of having a finite boundary of what you need to worry about being able to kind of maintain and explore automatically?

This is actually

the discussion we had when we initially made the decision to create a query language. What we didn't want is to have to go down the rabbit hole of like,

oh my god, we have to, like, develop a whole brand new, like, you know, fully complete language. So what we

did was we started out with a set of primitives

that we felt would exist in pretty much any labeling problem, and that they could be combined in a way that would be more powerful than, you know, the individual components themselves, you know, the sum of the parts, so to speak. This is a very, like,

closure

in mentality,

where there is a very, very small

number of atomic components that you can stitch together in various comprehensive ways,

that allow you to manipulate data. It's just in our case, we don't care about manipulating all data. In our case, we care about manipulating specific types of data in specific ways.

So far,

interestingly,

our customers have not asked

for additional features in our query language.

We add additional features as we increase

Watchful's capabilities.

So as we get into new types of tasks, for instance,

we've initially started with, like, full text classification. That was, like, the first task that we supported. You can imagine that like supporting RegExes was important,

supporting booleans

was important. So I could say like this RegEx and not this other RegEx,

supporting like column based queries, greater than, less than, that was all fine. And then we go into like the world of named entity recognition or NER, or more broadly speaking, like entity extraction,

where it's like, okay. I don't really want to just label this entire email as having contained a name.

I want to be able to highlight the actual name in it. So now we're talking about like, okay, I need to not only be able to say

this row or this piece of data has the thing I'm looking for,

but here's the thing I'm actually looking for. I need to be able to point to it. So now we're talking about capture groups and tokenizations

and various other ways, like slice through your data. So that's kind of how we've

expanded the query language. We come up with the base set of primitives that are necessary to solve a particular task.

We think about 1 or 2 tasks ahead of that, and we think about, like, okay, here are the things that are likely gonna be involved in, like, the next iteration of this. And we kinda soft circle those, and we validate with the users to make sure that if we were to give you these language capabilities, do you feel you'd be able to solve this problem effectively? And if the answer is yes, we'll we'll build it incrementally.

So far, the language is like really small. It's very easy to learn and understand, and actually the vast majority of the time, our users don't even have to,

like really learn the language. Like

in Watchful, you can click on stuff. Like, we'll have charts and things like that, where you can like zone in on specific

slices of your data, like the things that are false negatives, you know, the things that you had labeled manually as, yes, it's part of the class, and watch will predict that it's not part of the class,

there's a bar where you can click on that. It'll auto populate a query for you, so your query can do a lot more than just like run reg xs and so on. You can actually slice into the data based on like various aspects of metadata,

but you as the user didn't have to write that query. You interacted with a chart and everything bubbles back to queries, ultimately. But as a side effect, you don't have to be an expert on our query language. You just use the product like you normally would and rely on the parts of the UI that you feel you want to rely on, and use the query language for the parts that you feel comfortable with. And we have like this cheat sheet that auto populates based on your data set, and like it's very dynamic, so it's like based on the text you have in your data set, or based on the classes you have, it'll even

throw suggestions out there.

So short answer here is like, it's something that we're kind of wary of. We always think critically before we actually add new features to the query language. We always think like, okay, is there a better way to solve this problem without adding bloat?

Our goal is not to create a brand new Turing complete language. If we did, that would make our suggestion engine very, very complicated.

So for now, we're trying to keep things simple. And then you have to solve the whole big

Yeah, exactly. All of a sudden we're getting into like NP hard problems, and I'm like, I'm not into that.

And another element that's worth exploring is the question of which types of data you're able to work with and which types of labeling you're able to apply to it because

each different data type has its own aspects of annotation that you might be looking to do and that influences the scope of what types of machine learning problems you're able to take on.

You know, some of the most common examples are things like bounding boxes on images or in the named entity recognition piece being able to label and annotate specific segments of text. Yep. And I'm wondering how you've been approaching

that challenge of figuring out what are the data types that we want to support out of the box as we first explore this problem space, and then what are the juncture points where we say, okay, we're able to accept

a new data type. So maybe we went from text to images and now we're going from images to 3 d point clouds from a lidar scan, things like that? Let me first start out with, like, our mission, and then I'll talk about where we are, and then I'll talk about kind of the cool things that we've been working on. So

our mission is, for this particular product, to be able to label all data all the time, any way you'd like. So whether that's text data

as like pure classic, like full text classification or NER or something like that, or relationship mapping, or if we're talking about images, you should be able to do, like, object detection, like, bounding box type stuff. You should be able to do segmentation. You should be able to do all sorts of things. LiDAR, same deal like point cloud data,

audio, video, etcetera, time series, you name it. The other underpinning of that mission is to be able to solve all those problems with essentially the same workflow.

We also believe that our users shouldn't have to learn a brand new workflow each time they switch between data modalities.

And oftentimes,

these problems have

multimodal

aspects to them. You know, you might have text and images, for instance, or

audio and text if you're talking about, like, transcriptions

or, you know, so on and so forth.

So ideally, you're not having to, like, learn a brand new skill set each time you go to a different data modality. We started out with text, so

that's kind of like the world where we

have the strongest pitch today. That's where most of our customers

find the most value. It's they're doing some, like, full text classification type problems or some sort of, like, entity

extraction type use cases, information

extraction, and we're able

to largely automate that process for them. But we recently actually ran some experiments

on images just to make sure that

our workflow does actually work across the modalities.

And we got

very, very good results in a prototype. Like, this is still experimental at this point, but it gives credence to the idea that our workflow can be used for essentially any type of modality.

The next modality we're likely to attack is images. We'll probably keep going from there, but the foundational pieces that we're going to be building for images

will be cross compatible with almost any other data modality. So things will get more powerful for text based users.

We'll be, you know,

supporting images and, you know, bounding box based stuff, segmentation based things for for our users.

And we'll also start, you know, our initial forays into other types of modalities like time series, like audio, like video, and and so on. Another

challenging area that's

oftentimes the hardest problem to solve is the question of people and how to support them in using your tool with,

particularly in the data labeling space, the challenge of collaboration across different stakeholders with different contexts and different requirements of the tool.

And

so I'm curious how you've approached

that aspect of it and, in particular,

how you have seen

these different stakeholders

work through the problem of building a shared vocabulary

of how to

discuss and

refer to the different contextual elements that they're trying to establish in this labeling process and how to think about what are the

shared vocabularies

or

vocabulary constraints that they want to establish to make sure that everybody is using the same

semantic elements in labeling these assets so that they can be used in a consistent fashion for these model development processes.

So this is, like, 1 of those, again, interesting philosophically hard problems. So let's break it down into, like, some concrete pieces.

1 is like,

how do we support people onboarding onto using Watchful? How do we actually support them and give them a good experience?

Right

now, in today's world, we run workshops.

So

our success team will like help you get set up and we'll invest the time and energy to turn you and whoever else wants to use Watchful into power users.

Now, with the advent of kind of like that automatic labeling thing I was talking about,

some of that has actually gone by the wayside for certain users, because now users don't really have to be experts

on anything washable specific. Like, they just have to come in and hand label, and they're already good at that. We'll do the rest of it for them. That leads into kind of the second half of your question, which is like, how do we make sure that labeling is consistent across users?

And this is, again, like it's a hard problem.

What we do is we on the side of just providing our users

as much information as we possibly can,

and this manifests in a couple different ways. So for instance,

you might do a whole bunch of work, let's say, you know, on your instance, and then you decide to push it, and now it's available to all other people who are working on that project.

And

someone else does some work, and they push it, and then you pull it. And you suddenly see your precision has taken a nose dive, or your recall has taken a nose dive, or your error rate has shot up. Like, we'll show you all of those things, and we'll show you exactly which, like, pull, which hinders or heuristics

led to that particular issue.

Like, that's a clear indicator that chances are you and someone else are not aligned

on what you're labeling or how you're labeling it. There's some sort of, like, fundamental miscommunication

there. And that's a place where you can go in and look at someone's heuristics and be like, do I agree with this? Yes or no?

So it takes this like philosophically hard problem of like trying to describe a class space in a paragraph,

and hoping that everyone in the room has exactly the same definition in their head, and that they're gonna label in exactly the same way. We don't believe that's ever gonna happen. So instead, we give you tools to pick apart parts of the class that folks do agree on, and the parts of the class that they don't. So in this case, it's like you can point to specific heuristics that are problematic and be like, do we agree with this? Yes or no? Do we agree that the word invoice

is payment related? Or do we believe that that is

account related or something like that, right? Is it on this like weird fuzzy boundary?

Maybe for 98% of the class, everyone is in agreement. It's like a relatively straightforward thing, but there's certain aspects of it where they're not, and this helps surface those.

So now you're not like talking philosophically about a class space. Now you're talking about a very specific heuristic that leads to a particular set of classifications

that you can now reason about.

So it's not a perfect answer in the sense that, like, there's no real way to just be,

like, here's a set of instructions everyone is now gonna label perfectly. But instead, what we try and do is tighten the feedback loop, the time between knowing that there's a problem and actually figuring out a solution for it should be as small as possible.

And point number 2 is

that explainability

aspect. Like, you should be able to see that the quality of your labels tanked because of these, like, 3 introduced heuristics,

and you should go, like, fix those. And this also goes into, like, some other things. I talked about this in a previous talk I'd given at another conference about, like, kind of the trials and tribulations of hand labeling, but the CliffsNotes here are that, like, oftentimes people use annotator metrics to evaluate how much to trust specific annotators, like labels. So

you'll basically have a score for individual annotators

and say, okay, generally speaking, annotator seems to be more correct than others,

or at least have more convergence, and as a side effect, we're going to trust them more.

When, in actuality,

different people just have different experiences and different expertise. You know, like, just because someone might not have a ton of context on most of the things you're labeling doesn't mean that the 5% that they're actually able to, like, do a good job on. They might be, like, the world's foremost expert on.

So you can't just, like, discount the person overall. So our argument is that you shouldn't even worry about the people behind this. Like,

focus more on the work output, which are these heuristics.

The heuristics are actually the ones that are doing the labeling, not the humans at this point. The humans are helping in the creation of the heuristics, and you can talk about heuristics in particular, and you can have a very concrete discussion about whether it's a good heuristic or a bad heuristic.

But that level of scoring is actually much more helpful than trying to evaluate your entire, like, labeling team as a whole and trying to get them all on exactly the same page. It's like very unlikely to happen, but now this at least spurs, like, very concrete discussion and very concrete action items.

Big EYE is an industry leading data observability

platform that gives data engineering and data science teams the tools they need to ensure their data is always fresh, accurate, and reliable.

Companies like Instacart,

Clubhouse, and Udacity use BigEyes automated data quality monitoring,

ML powered anomaly detection, and granular root cause analysis to proactively detect and resolve issues before they impact the business. Go to dataengineeringpodcast.com/bigeye

today to learn more and keep an eye on your data.

1 of the things that we haven't dug into yet is the question of how does all of this data even make its way into Watchful?

And as a corollary to that, when you mentioned some of your background, you talked about your experience of working in the streaming data space,

and I'm curious

how that

increase in the need for real time and streaming data access

plays into this question of supervised learning and how to manage labeling of that data, whether this programmatic approach to data labeling allows for the introduction of streaming data sources as a labeled input to continuous model retraining cycles.

It's really interesting because, like, when you talk about machine learning process, you kind of think about it

in terms of development and production. And it's very interesting because in the development world, it's very batch like. You take a sample of data,

you don't touch that sample of data. That's important because this is, like, fundamentally data science. Right? You can't really do science on things that are changing underneath you. So you have to, like, take the snapshot, you work on that snapshot, you build a model, that model is built on that snapshot, and then you go deploy that model. And now when you deploy that model in production,

now you can kind of treat that input as a stream

because you don't know when the next input's going to come in. It's usually not inference in batch, it's typically inference in, you know, whatever cadence users are interacting with your system or whatever.

So there's like interesting opportunity there. I've always viewed like batch and streaming systems as kind of analogous. If you have

a very, very nice streaming system that works with, like, very well designed set of like atomic primitives,

where you give it a thing and it gives you back like thing,

you know, thing prime or thing modified,

then there's no reason why you can't also treat it like a batch system. You give it a 1, 000 things, and it gives you back a 1, 000 things prime.

So here

we treat washable roughly the same way. You give it a batch, you give it some, like, CSV data, let's say. That's a lot of the time what our customers are doing. They're like grabbing a snapshot,

like, they get a file,

maybe put it in like s 3 or something if they're like working on it as a team, or they might even have it on their desktop or something like that. They just like load it into Watchful, and they have like a development set that they can work on. Great. Everything's working.

Then after they built their HERO6 and they're happy with the results that they're getting out of Watchful in terms of labeled data, the next step is productionizing that system. And that just means running Watchful in what we call a headless mode, which is basically just like we don't show you a gooey anymore. You just take the project file, the thing that describes your heuristics and so on, and you load it into watchful and it just runs as a process. And that process expects, like, new raw data on standard in, and it outputs labeled data on standard out. And it'll do this all day every day. So you could give it a stream, you could give it batches, it doesn't really matter. So in that way, you could run this against a cron job, for instance. So if you wanted to like load new data every day or every couple of hours or what have you, like you could very easily do that,

or you could just like

in real time, send raw data into Watchful and get labeled data and have that just, like, shoved into some table somewhere where, the data science team knows where it lives, or have it output S3. It doesn't really matter. So we kinda see both as part of the same continuum. And then like our goal is exactly what you'd mentioned. Like,

we feel that

the process of labeling data, or frankly, like the process of getting training data is an intrinsically linked problem with modeling.

What I mean by that is your choice of model, or the way your model performs on your data, is a function of the data

itself. So there's this, like, link there, and mechanically, what we want to see happen is instead of this, like, waterfall where you get raw data, you label it after some period of time, like, let's say, 2 weeks, then you hand that data over to a data science team, they build their model, they train it, realize that the data is not quite right, they have to go back to the labeling process and, you know, wait another 2 weeks. Instead of doing all of that,

instead,

just constantly be modeling while you're labeling data.

Again, earlier I mentioned that like you don't know ahead of time how much labeled data you need before your model performs well. But if you're constantly training your model while you're labeling,

then you actually can get a pretty good sense. Like, you can see, okay, if I spend another hour labeling, how does that improve my model's metrics? Like, by how much? Am I making a big dent if I focus on this part of the data or this other part? We think that, like, feedback loop between the model and your labeling process is actually, like, very, very important,

and it's something that,

like, tightening the feedback loop, not just in the labeling process, but also with downstream systems is actually really important for. That's a big area of our focus.

And in your experience of building Watchful and working with your customers and seeing how they interact with the platform, what are some of the most interesting or innovative or unexpected ways that you've seen it applied? I would say 1 of the most interesting ways that you've seen it applied? I would say 1 of the most interesting things is that, like, some of our customers don't even train machine learning models. The labeling problem is,

like, interestingly,

not specific to machine learning, which is kind of obvious, you know, like,

you have data, you wanna categorize it in some way. Like you could call that a classification problem. You could call it a labeling problem. Like it's probably best solved using machine learning model, but oftentimes depending on what your data looks like and what your use case looks like and what sort of subject matter expertise you could bring to the table,

you might not even need to train a deep learning model or anything like that. Like, you might have a fairly well defined taxonomy that you just want to be able to apply to your data in a fast way. And it turns out that's exactly what we've built. Like, we have a very, very fast engine for applying extremely complex taxonomies

in a way that doesn't break your brain or break your systems.

So

as a side effect, people just use it for that oftentimes.

And, like, obviously, their goal is to eventually, like, you know,

build machine learning models, do x, y, and z, but sometimes they're like, look, this is good enough already. We're just gonna pause. And we've worked with, like, a law firm who did exactly this. Like, that was another interesting thing. Like, we didn't think we'd ever sell to a law firm, but we we sold to Wilson Sonsini, and and they've been 1 of our biggest customers.

And they've really kind

of run

data centric AI processes without realizing that they're running data centric AI processes. So we help, like, educate them on what exactly this looks like and how they could train their first couple of models. And then on the exact opposite end of the spectrum,

we have, like,

tiny startups using us, and there's this 1 startup in particular called Proper. They're actually in the cannabis retail space.

And

so, a, they use Watchful directly, so they don't train a model at all. And that's been, like, really, really good. And the second bit is that, like,

apparently, we enabled their business to even exist. Because if you're trying to do this manually, it would just require way too many man hours. It wouldn't be viable for their for their startup.

So this is, like, sort of a side effect of just, like, using this taxonomy mapping capability.

They're able to very quickly bootstrap,

like, a very sophisticated system that now I think serves around, like, 30% of the cannabis retail space.

So

it's sort of interesting to me that, like, we started this whole thing

targeting data science users who work on machine learning for a very specific task.

And what we realized is that this task is just, like,

it's the same thing as just, like, classifying data itself,

and it depends on what tool is best for your particular problem. Do you need a deep learning model? Do you need another model to actually do the classification?

If not, your taxonomy being mapped to your data might be good enough. And that was probably the most surprising thing to me. In your experience of building the platform, building the business, and

working through some of the vagaries and challenges of this space, what are some of the most interesting

or unexpected or challenging lessons that you've learned in the process?

Let me start with, like, personal, like, professional challenges.

As, like, CEO,

I think that

I've had to go through

a sequence of just, like, level ups. As your organization gets larger and larger, your, like, focus

has to be different. It's not better or worse. It's just

shifted. So for instance, in the very, very early days of building this company, I weaned much more on my engineering skill set to sort of like be very hands on. And it was like kind of the classic story of just like, you know, 4 people in a room just coding all day. And that very quickly evolved to becoming like more of a people manager, where now it's not really me writing a lot of the code. Now it's like other people, and I'm I'm helping like manage the team and sort of like make sure that, you know, people are unblocked and things like that.

And now that I have

several levels under me, now my focus is a little bit different. Now it's on strategic vision and and sort of, like, communicating

strategy to the rest of the team. And what I realized is that as you have levels removed between yourself and the end recipient of your message, the number of times you have to repeat that message increases by a significant number. So it used to be that I could just, you know, have a 1 on 1 conversation with someone and just, like, boom. That's it. It's done. Like,

message received, and and that's it. And I could even get, like, 4 people in a room, and we could just quickly talk about

this. But as I start focusing more on strategy, the number of times they have to repeat yourself increases, and that was like an interesting lesson to learn. So, it's a lesson in clarity. It's a lesson in sort of like good communication style and that sort of thing. Now, in terms of the market,

I think that there are some interesting things that we've learned there as well,

which is we thought that our opinions would be

very, like, as I mentioned earlier, incendiary. Like, if I came out and said, hey, there's no such thing as ground truth. So what you've been saying for the last, like, several years is just straight up wrong. Like, we thought people would have a much more adverse reaction to that. But in fact, like,

by and large, data scientists would just, like, nod their heads pretty aggressively in meetings when we say that because it's just like a fact that they realized, like, at some point, and they just also realized that there's nothing else that they could do about that. So

I think 1 of the things that I've learned about this market is that people are looking for

better ways to model innate truths that they've already kind of come to realize.

And data centric AI is an interesting approach to that. That's the other nice part. Like, we're not the only ones talking about this. It seems that several different organizations have come to roughly the same set of conclusions at roughly the same time. And that's valuable because it gives credence the messages that we're bringing. So lots of people have heard about data centric AI by now, and a lot of people are following that sort of, like, that doctrine

and seeing really good results.

And it's not as big of a mental change as we thought it would be. So that was another, like, interesting learning about sort of, like, people's willingness to adopt new mentalities and sort of bring that to bear on their everyday life. I thought people would be much more sort of averse to that, but as it turns out, I was wrong. So, yeah.

So for people who are interested in being able to manage this labeling process and try to scale their throughput, what are the cases where Watchful is the wrong choice?

I think it's really just about knowing the makeup of your data. So, I'm going to couch this by saying that,

like, frankly, there are very few

wrong choices in this particular domain.

There's an entire spectrum of different ways you could go about, like, addressing the labeling problem, where Watchful is just 1 of

several different

solutions.

What we do is we combine several techniques

under the hood. So we combine things like active learning and weak supervision, and we have like Monte Carlo simulations and several other things that are like running in the background.

And we sort of like take the approach of like, we're not precious about any 1 particular

machine learning technique to solve this problem. We we sort of combine several, but even if you could look past those techniques, you've got things

like high supervision techniques, like synthetic data generation,

where that's perfectly reasonable in certain cases where you might have a ton of data that you could use to like train the synthetic data generation model that will then just generate as much data as you need. Or you could do, like, you know, some unsupervised stuff, as you mentioned, like, you know, basic clustering and and things like that, which could work just as well. So I would say, like, the general mental framework here is

think about, like, the head and the tail of your data set. If your data is all long tail, and what I mean by that is

everything

looks very different from 1 another, right? So an example of this is like,

I can't think of a specific data set that's like all long tail, but you can think of like very long tail data sets in, like, computer vision, where you're trying to label, like, people's faces.

Now, people's faces look quite different

depending on who you're looking at and, like, what angle, and and so on. So there are all these, like, different externalities that could come in that affect the way you might want to label those faces. Now if you were to, like, cluster that data,

depending on the angle of the face, the lighting conditions, like the angle of the camera, and so on, like, you might get wildly different results. And so you might consider

parts of that dataset to be fairly long tailed.

On the other hand, if you have, like, a very, very, like, big head to the data, meaning, like, a lot of that data looks very similar, then it might be easier to use things like unsupervised techniques to sort of like help boost some of that signal.

So the general, like,

mental model here is that if your data is extremely long tailed, there's not really a whole lot you can do other than, like, hand label it, quite frankly. Because

no machine learning technique is going to give you a free lunch. Like, you still have to train your active learning model or your like transfer learning model on this like incredibly long tailed problem,

and it'll give you, like, some incremental wins, but it's not gonna be enough to, like, make a huge difference.

If your data set has a huge head,

then you might not need something super sophisticated.

You might just need, like, taxonomy mapping, or you might just need, like, unsupervised learning, like simple clustering.

Most datasets have some combination of these 2 things. There's a head and there's a tail. Depending on the size of the tail, you might want to more towards the side of things that are, like, techniques that are more sophisticated,

or have the ability to be more sophisticated. So think like deep learning type techniques. So transfer learning is a good example there. Active learning can be very, very good and that sort of thing. For parts of the dataset that have, like, a reasonable

head, things that are less supervision heavy might be useful. So, like, unsupervised learning clustering, weak supervision even can be reasonable.

So that's sort of why we combine several different approaches under the hood. We try and be unopinionated about the type of data you bring to

us, and we make it so that

really no matter what your dataset looks like, we will be helping you label it in the most optimal way depending on your data set. So if it's very head heavy,

we'll push you more towards the weak supervision side, for instance.

If it's very tail heavy, we might push you more towards the active learning side and do some other clever things that combine that and like weak supervision

and so on. That's part of the reason why, like, it's hard to say where watchful is, like, strictly not a good fit because

part of it really does come down to, like, if your dataset is on 1 of those extremes,

then it's probably not a great fit for Watchful. But if it's like most other datasets and it's somewhere in between,

then

great. Like, you could probably use Watchful. We will help you label it in the most optimal way possible.

I think we've talked about it a little bit, but as you continue to build and iterate on Watchful, what are some of the things you have planned for the near to medium term, or any particular problem areas that you're excited to explore?

So I think the obvious 1 is, like, expansion and new data modalities.

Our customers have been, like, foaming at the mouth for us to release image support, so we wanna give them that. So that's sort of like an obvious 1. I think, like, again, I'll talk broadly about

our goal as a company, which is to build best in class software to help organizations solve their hardest problems using machine learning. We started with what we saw as like the biggest bottleneck in today's process,

but

over time, we're gonna expand to other parts of the stack that have historically had little investment of, like,

time or energy to do it, quote, unquote, the right way. We want these pieces to fit together LEGO blocks, as I mentioned. So it's not an end to end solution. It's more like a sequence of best in class tools that sort of solve your machine learning problems.

We really wanna sort of, like,

focus our efforts on the areas where we have unique insight, which is anything touching data. Because, again, we we believe that

at least in the MLOps world, most of the things touching code, like, there's already some good thought that's been put in there. What we wanna focus on are areas where if historically not had nearly as much investment, which happens to be data. So you can expect more products, more solutions, that sort of thing from us. Anything that touches data in the machine learning space, we're gonna be all over.

Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as the final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

It touches on

explainability

and data. We fundamentally believe that, like, we're addressing a part of that problem right now, where it's like raw data comes in, label data pops out, and we can explain that process.

But there's also like this lineage of data that goes through this process where it's like, again, AI is code plus data. So knowing what model is trained with what data, and why that model's performing in a particular way, what explanations we have for that performance, and what data backs up that particular performance.

Like that entire connection

should be there, but that requires an integration between

like downstream monitoring systems and upstream like labeling processes, and that historically has not existed for like, since the beginning. So

it's not like 1 single component that we feel is missing. We feel that a connection is missing between

the ways we're interpreting how our models are performing downstream,

and the way those models even come to be. So we think that by drawing that line and making that feedback loop really tight, we will accelerate the process of machine learning and AI kind of like uniformly across every organization.

So, yeah, that's at least my take. Alright. Well, thank you very much for taking the time today to join me and share the work that you're doing at Watchful. It's definitely a very interesting product and an interesting problem domain. It's always great to catch up on

what the developments are in that space. So thank you for taking the time today to join me and for all the energy that you and your team are putting into addressing a significant problem in the ecosystem. So appreciate your time, and I hope you enjoy the rest of your day. Thank you so much, Tobias, for having me. This is a lot of fun.

Thank you for listening.

Don't forget to check out our other shows, podcast thought in it, which covers the Python language, its community, and the innovative ways it is being used, and the machine learning podcast,

which helps you go from idea to production with machine learning. Visit the site at dataengineeringpodcast.com.

Subscribe to the show. Sign up for the mailing list, and read the show notes. And if you've learned something or tried out a product from the show, then tell us about it. Email hosts at dataengineeringpodcast.com

with your story. And to help other people find the show, please leave a review on Apple Podcasts and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links