Summary
The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so the rest of the sources of knowledge in a company are housed in so-called “Dark Data” sets. In this episode Alex Ratner explains how the work that he and his fellow researchers are doing on Snorkel can be used to extract value by leveraging labeling functions written by domain experts to generate training sets for machine learning models. He also explains how this approach can be used to democratize machine learning by making it feasible for organizations with smaller data sets than those required by most tooling.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
- When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
- You can help support the show by checking out the Patreon page which is linked from the site.
- To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
- Your host is Tobias Macey and today I’m interviewing Alex Ratner about Snorkel and Dark Data
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by sharing your definition of dark data and how Snorkel helps to extract value from it?
- What are some of the most challenging aspects of building labelling functions and what tools or techniques are available to verify their validity and effectiveness in producing accurate outcomes?
- Can you provide some examples of how Snorkel can be used to build useful models in production contexts for companies or problem domains where data collection is difficult to do at large scale?
- For someone who wants to use Snorkel, what are the steps involved in processing the source data and what tooling or systems are necessary to analyse the outputs for generating usable insights?
- How is Snorkel architected and how has the design evolved over its lifetime?
- What are some situations where Snorkel would be poorly suited for use?
- What are some of the most interesting applications of Snorkel that you are aware of?
- What are some of the other projects that you and your group are working on that interact with Snorkel?
- What are some of the features or improvements that you have planned for future releases of Snorkel?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- Stanford
- DAWN
- HazyResearch
- Snorkel
- Christopher Ré
- Dark Data
- DARPA
- Memex
- Training Data
- FDA
- ImageNet
- National Library of Medicine
- Empirical Studies of Conflict
- Data Augmentation
- PyTorch
- Tensorflow
- Generative Model
- Discriminative Model
- Weak Supervision
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering podcast, the show about modern data management. When you're ready to launch your next project, you'll need somewhere to deploy it, so you should check out linode at data engineering podcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. And go to data engineering podcast.com to subscribe to the show, Sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page, which is linked from the site. To help other people find the show, you can leave a review on iTunes or Google Play Music, tell your friends and coworkers, and share it on social media.
Your host is Tobias Macy. And today, I'm interviewing Alex Ratner about snorkel and dark data. So, Alex, could you start by introducing yourself?
[00:01:00] Unknown:
Sure. And, again, Tobias, thanks for having me on the show. So I'm a, 4th year PhD student at, Stanford. I work with Chris Ray, and the, Dawn project at Stanford where our goal is to make it, sort of to democratize ML. So, my particular project that I work on with others in the lab, which we call Snorkel, is about making it easier to generate training data for today's, complex machine learning models. So 1 of the big sea changes that we've seen in the in the field is the rise of, approaches like deep learning, which are very complex models that do a great job empirically of learning how to represent data in a way that leads to, very good performance on tasks that are really tough traditionally, like, understanding text or images. But the kind of trade off, under the no free lunch policy is that they they require much more labeled training data to learn from. So with the Snorkel project, our our sort of big high level technical question is, can we solicit information from domain experts to solve problems like extracting stuff from dark data, which we'll get get into, with higher level information that might be noisier from the domain experts, but that's cheaper to provide.
[00:02:06] Unknown:
And how did you first get interested in and involved in the area of data management and working with, machine learning and the complexities of data management?
[00:02:16] Unknown:
Yeah. Well, so, personally, I had a job after college actually more in finance, but I was working a lot with the patent corpus or or a little bit at least. I I stumbled upon it, and I thought it was incredibly cool that, on about, you know, a couple terabytes of text data that I could download on a, you know, on a thumb drive or something, you had everything that anyone had thought was worth patentable. But the problem was that it was incredibly accessible, but, you couldn't do anything with it. If you wanted to extract an actual usable piece of knowledge, this is, you know, something beyond the current limits of, of the field. Probably, you know, in some ways still is, although we're we're trying to push that boundary. So I kind of, looked into how people try to pull useful stuff out of text data, or other, as we call it, kind of dark data that's difficult for for computers to process traditionally. And this led, to the field of machine learning and natural language processing. From there, this is, the the big question then became how do you get the training data to teach these machine learning models, in a practical way, which led to this project. And then for the lab, more broadly, Chris and others, my my adviser, Chris and others, started working on this DARPA funded project called MEMEX, which was primarily focused on pulling in information off of the dark web to help fight human trafficking. And so a predecessor system to Snorkel was kind of 1 of the big tools used to process that data, and I think that led to a lot of the lab's current interest.
[00:03:35] Unknown:
Just as a side note, it's interesting that this came out of Memex because, it's the second or third time I've heard mentioned to that project. And 1 of them was from a group of people working at Continuum Analytics.
[00:03:48] Unknown:
Yeah. Mamex is really an exciting project, and I I can yeah. I don't know whether this is an aside or an answer, but, there were you know, we were, you know, the work that, that Chris led, which started before I joined the lab, but then I was, I helped out with a bit was, you know, 1 piece of the pipeline. It was there were parts upstream of us where they pulled in the data from the dark web, and then our job was to take a mess of of text data and turn it into something that looks more like a spreadsheet that then could be used by downstream components to visualize and make predictions and all that sort of stuff. So it's a really a really cool tech stack that was started by that that DARPA program. Definitely. And so you briefly
[00:04:28] Unknown:
touched on the idea of dark data and gave a cursory definition of what it is. But I'm wondering if you can just do a bit more defining of what the dark data problem actually is and how Snorkel actually helps to alleviate some of the issues that are associated with dealing with that type of, data source.
[00:04:51] Unknown:
Great. Yeah. So dark data is also known as just unstructured data, and that's, a nice term to make a a a simple distinction. A lot of the data that we use is structured. So you could think of a a database or an Excel spreadsheet or a graph sort of all equivalent, but something that has a known schema. And this is the kind of data that's really easy to, for, you know, modern computers, technologies to to leverage, whether you wanna visualize it or make predictions based on it. We have a lot of, you know, a very, extensive tech stack built on top of taking in, you know, nicely structured data. But if you actually look by volume, most of the data in the world is unstructured.
So this includes text data, all the all the sort of semi structured data, stored as HTML on the web, images, video, time series, all this kind of stuff that is traditionally easy for humans to process slowly. And, hypothetically, if a machine could process it, it could do it much more broadly and at scale, but it's very tough for a machine to do. So, generally, machine learning is the the, technique, that actually gets us access to or does best currently at getting access to to dark data. So, for example, if we have a bunch of a bunch of text documents in the scientific literature and we wanna pull out specific facts about, certain chemical interactions. Today, the best solution would be to use a machine learning model, in particular, a deep learning model, to do this information extraction.
However, the problem is that this kind of machine learning is for all the strides it has made lately, requires a ton of hand labeled what we call training data. So training data is just a bunch of labeled examples that were, you know, labeled with sort of the ground truth answer, that a machine learning model learns from. So, for example, if you're doing that chemical chemical extraction, you'd have a human, you know, scientist say go over all of these, documents and highlight where 2 chemicals were mentioned as interacting or whatever other thing you're trying to pull out, and then you you train the model to be able to do similar extractions. So this is obviously a very expensive process, and as the models, that people use have gotten more and more complex, the amount of labeled training data that's that's needed, has grown exponentially. So the idea with Snorkel is that, instead of asking for thousands and thousands of of nicely labeled examples, We try to get higher level but, noisier inputs like rules or patterns or noisy data sources and see if we can use this to train these machine learning models. And it turns out that actually in a lot of instances, we can. And so instead of having, you know, months go into hand labeling a dataset to train a model, you can have, you know, have a domain expert sit down and then hours or days, write a bunch of labeling functions and and get the same performance.
[00:07:31] Unknown:
And as you mentioned, 1 of the main requirements for snorkel to be able to operate on these datasets is to build a labeling function that takes advantage of that domain knowledge. So I'm wondering if you can talk about some of the more challenging aspects of being able to build those types of functions in such a way that they can be verifiably accurate and effective for producing usable outcomes and if there are any sorts of tools or techniques to assist in that verification process?
[00:08:01] Unknown:
Yeah, that's a great question. And in some ways that's that's the the whole name of the game. I mean, it's been a long standing goal in in the the field of AI to to, make it easier for domain experts to inject their knowledge, into a system. And so the goal in Snorkel is kind of to make it very easy to inject this knowledge and then use it to train a machine learning model and thus leverage all the power in in those techniques. So 1 of the the sort of the main technical component of Snorkel has to do with automatically estimating the accuracy of a bunch of labeling functions without ground truth data. So intuitively, if you provide a set of labeling functions, say a couple dozen of them, Circle applies them to a a large amount of unlabeled data ideally and then looks at where the labeling functions agree and disagree.
And based on that, can, and with some other very loose statistical conditions, it can estimate the accuracies of the labeling functions and then weight them accordingly. So this isn't perfect, obviously. It it, you know, writing a labeling function involves sort of encoding domain expertise into a rule or a pattern or, or something similar. Concretely, a labeling function is just a Python function. So it still takes time to write them and and this is an iterative process, but then Snorkel automatically, models the noisiness of them and and and outputs a denoised, set of labels, which can then be used to train some,
[00:09:25] Unknown:
end machine learning model. And when I was reading through the documentation, it looked like the primary interface for working with Snorkel was via the Jupyter Notebook. And I'm wondering what the sort of typical workflow looks like for somebody who wants to take advantage of Snorkel and build those labeling functions and run them through a, you know, dataset?
[00:09:48] Unknown:
Yeah. That's a great question. We have a bunch of tutorials online where there are examples of labeling functions. But, you know, suppose you were you were approaching this, the chemical interaction extraction task I mentioned, which is actually a a a good proxy for our current project that we're working on with, some researchers at the Food and Drug Administration and some other researchers at Stanford. So what you do to start is, you know, we have everything is built or there's a lot of tooling built for this Jupyter notebook interface, which we find is a good 1, especially for our collaborators who are using Snorkel. So you might start by looking at a a small set of data that you might label it, just to get a sense of of the data. And then the goal would be to write a set of these labeling functions, which are just Python functions that take in a data point and say, yes, no, or abstain, for example. So for this chemical chemical extraction task, you might write a simple labeling function that just looks at a at a a pair of words.
And if it sees or suppose that we have all of the chemicals already annotated using an upstream tool. You might write a labeling function that takes in a pair of 2 chemical mentions in a document and just looks to see if the word causes appears in between them. And if the word causes doesn't appear, it just abstains. And if the word causes appears, it says, yes. This is a valid chemical chemical interaction mentioned. And maybe if not causes appears, then it says, no. This is this is not a chemical interaction. So this is just a, you know, a couple lines in Python, and that's really all there is to it. And then, what the the user would do is that they would, write a couple of these labeling functions, and then they'd run, Snorkel on it. And Snorkel would learn to weight the labeling functions based on, its estimate of how accurate they were they all were and then would use these outputs, the denoised outputs, to train an an an end machine learning model that would learn to generalize beyond the rules that that you wrote down as labeling functions.
And then, of course, it's it's a it's an iterative process in that you can always tweak the labeling functions and and, develop them further and, you know, Snorkel's estimates of how accurate the labeling functions are give some feedback on which ones you might work on. 1 other interesting advantage of this kind of an approach to training a machine learning model that I'll mention because it comes up a lot with our collaborators is the fact that you are not locked into a particular extraction schema. So, with a lot of, in general, if you're if you're doing a supervised learning problem, you need to decide ahead of time how to specify the problem. And then, say, having made that decision, you decide, well, these things are valid chemical interactions and these things are are not. I'm gonna discount them. Say you then went through, you know, a traditional machine learning route, that people, you know, do a lot in practice where you then go and pay some scientists to spend a couple months labeling training data. And then, actually, you get the extracted stuff and you realize that, actually, your your problem specification wasn't quite right. You actually don't wanna have chemical reaction type a because it's not useful for whatever you're using this data for, and you do wanna include, you know, chemical reaction type b, which you excluded.
The only answer in a traditional machine learning pipeline would be to throw out your training data and start all over again, which is a hugely expensive and impractical proposition. So the idea with snorkel is you could just go back to your labeling functions and just tweak them, until you've, corrected the specification. So this allows much more flexible iteration on actually using a machine learning model in practice.
[00:13:13] Unknown:
And I can imagine that being able to provide these labeling functions, particularly for industries where it's difficult to collect large volumes of data or the data that you are collecting is difficult to work with in some of the traditional machine learning contexts, such as if it's, you know, large raster files like you might have in meteorology or satellite imagery that you're using in a sort of agricultural context. I'm wondering if Snorkel would be useful for data engineers and data scientists to be able to then take that source data and make use of it within a machine learning context where it might otherwise be difficult or impossible
[00:13:56] Unknown:
to gain valuable insights from it? Yeah. That's a that's a great question. So at a high level, the ingredients that that are needed to actually use snorkel are you need some you need ideally a large amount of unlabeled data. So say you are trying to classify some satellite images. You need a bunch of you know, you need ideally a large quantity of of those images, even if they're not all labeled. And then you need some some some kind of, way of writing the labeling functions. So for text, as I was kinda giving some examples of, you you might write some regular expression patterns. You might use some external databases that you found online of chemical interactions that, you know, may or may not be reliable. You just throw them in the snorkel. For images, 1 of my lab mates is currently doing a lot of cool work on if you have some, features of the image that are pulled out by other algorithms, you can write labeling functions over those features and provide supervision that way. Another interesting route that we've been taking, with some of our collaborators in the radiology department at Stanford, which has actually worked quite well, is if you have, some metadata that you can write labeling functions over, this can work really well too. So, for example, in 1 of our recent projects, we had a bunch of, chest x-ray images, and we wanted to classify them. We wanted, specifically, we wanted to train a machine learning model to classify them as, say, benign or malignant. And then we also had a bunch of the unstructured text reports that the radiologists had dictated.
And so writing labeling functions over the image is kind of hard, but we can very easily write a couple dozen, labeling functions over the text reports looking for certain phrases and patterns and stuff like that. And all of the labeling functions were noisy, meaning none of them were higher in accuracy than, say, 70 or 80 percent. But by feeding them into Snorkel, we could use this to generate, much higher quality training data that we then use to train a machine learning model, that did, only a few points worse than a a similar machine learning model that, had been trained with a 100000 labeled data points that had been collected over a period of several years. So it hopefully, you know, offers a a, a a workflow That is much faster and more practical and more efficient.
Sorry, that is a much more practical and efficient way of using machine learning than the standard 1 of label a massive training set and then apply the model.
[00:16:20] Unknown:
And I imagine that because of the fact that you can more rapidly generate labeling whether it's, you know, as accurate as you might get from some of these larger datasets or not and being able to then get usable models from a smaller data sample that it would also allow a lot more different types of organizations or companies to actually leverage machine learning capabilities because they can obtain data sources in in a much more affordable fashion than is typically required for some of the machine learning contexts that are largely only accessible by larger tech companies or larger, institutions.
[00:17:04] Unknown:
Yeah. That's a that's a great point. And it it really is the the, you know, foundation of of our motivation here, going back to the, as mentioned, the the Dawn Lab at Stanford, the, the goal is sort of democratizing ML. And as far as Snorkel goes, assembling training data by hand is a a hugely slow and expensive, proposition. Even if you don't factor in what I was talking about before of of getting locked into a particular schema or problem specification, just doing 1 set of, training data annotation is a is a hugely impractical thing. So if you look at, sorry, and if you look at, you know, the the datasets that machine learning of late has made some of its most impressive gains on, they took years to assemble.
So, for example, there's a dataset called a labeled training dataset called ImageNet, and this has been, 1 of the sort of, big success cases of modern deep learning. So this you know, recognizing cats versus dogs versus hot dogs and and, you know, lots hundreds of other, image categories. But, again, this took years to plan and assemble and annotate this labeled dataset, and it didn't even require experts, like, didn't require doctors or or or any anyone with domain expertise to label. So now if you're talking about a problem that requires, you know, trained radiologists say to label, this is an even more expensive and massive proposition. So, again, the hope with Snorkel is that, by taking in training signal at a higher level and in a noisier form, and still being able to use it to train high quality machine machine learning models, we can make, it much easier for smaller organizations or for research labs or for individuals to use state of the art machine learning.
[00:18:47] Unknown:
And how is Snorkel built and architected in order to be able to take advantage of those noisier datasets and produce usable outcomes? And how has that design evolved over the course of its lifetime?
[00:19:02] Unknown:
Yeah. So there's, you know, the the core architecture of Snorkel, there's a, you know, a central there's a database that manages all the sort of bits of input and the labels that get attached to them. And we've done a lot of work on trying to make it usable with the Jupyter Notebooks and even some of the preprocessing tools for text and and, other stuff. And then on the machine learning and the algorithm side, Snorkel basically has has 2 stages. So there's a, what's called a generative model, which learns, as I mentioned before, to, estimate how accurate the labeling functions are. So if you put in 30 labeling functions and some of them are really precise and accurate and some of them are barely better than random, and in general, we assume that they're nonadversarial, so we assume that they're better than random, then Snorkel can learn without labeled data just by looking at how they agree and disagree with each other to come up with a model of their accuracies. And then it can weight them accordingly, weight their labels accordingly. And then the second stage is just any standard machine learning model that you plug in. And the difference now is that this end machine learning model is, being trained with, probabilistic labels output by Snorkel rather than yes, no say, you know, yes, no, or or multi category labels as it would be standard. And I guess as this, you know, as this has evolved, you know, we've we've tried to identify, certain aspects of this kind of new paradigm for training machine learning models that, could be automated or improved to make it easier for the user. So, 1 example, a recent success is that my lab mate, Steven Bach, led a project on trying to learn the structure of correlations between labeling functions.
So you could imagine intuitively that if you wrote 2 labeling functions that were exact duplicates of each other, this could cause problems. So imagine you you had a, you know, 2 copies of the same labeling function, but Snorkel didn't know that, then you'd get a double counting problem. Because they would always agree with each other, Snorkel would think that they were more accurate than they might be or say as has actually happened, you have a couple different engineers writing labeling functions, and they end up writing labeling functions that overlap heavily and are kind of expressing the same pattern or the same data source. This again can cause issues with Snorkel's estimation of the of the accuracies of those labeling functions. So we added a module to, Snorkel based on a a cool bit of machine learning work that actually automatically learns to detect correlations and account for them in Snorkel's model. So we've been doing those kinds of additions. That's been a lot of our work on the research side, but then it's sort of a call response. We, you know, throw snorkel out into the open source. We work with collaborators. We see what causes them the most, the most headaches and then we see if we can come up with a solution on the research side and ideally port that back into snorkel.
[00:21:44] Unknown:
And have you seen a lot of adoption in industry for snorkel and the type of problem domain that it's trying to address?
[00:21:52] Unknown:
Yeah. So we we've had a lot of, great industry collaborators who have come by. We have a weekly office hours, so we have a lot of, interaction with some companies, that are up on the, Snorkel website at snorkel.stanford.edu, but including, accenture and Toshiba, Alibaba, a bunch of others. And so that interest has been been very exciting, and we've gotten, you know, to see at a high level, you know, how they use Snorkel and and, and how it went for them. And then a lot of our interaction that we've been sort of fully hands on with has been with academic collaborators at Stanford. So a lot of people in the, in the biomedical informatic space, working with, people at Stanford Hospital and at the, VA here.
So, yeah, it's been really exciting, and we've learned a lot from those collaborations.
[00:22:41] Unknown:
And what are some of the situations
[00:22:43] Unknown:
where Snorkel would be a poor fit for the problem case that somebody might be investigating it or evaluating it for? Yeah. It's a great question. So I guess there are there are kinda 3 obvious answers, and then there are plenty of more more nuanced technical ones, I guess. But so first of all, the motivation for Snorkel is for training really complex models. So, you know, for example, deep learning models have, which are, you know, the the exciting, and most empirically popular set of machine learning models stay for a lot of use cases, like, in in particular, text and images. They have up to 100 of millions of trainable parameters, so they're they're hugely complex in in any on any metric of complexity. And, therefore, they sort of need a a commensurate amount of labeled training data. Now, hypothetically, if you had a really simple model, like you had logistic regression over 20 features, you wouldn't need that much labeled training data. And, therefore, you might just wanna label it by hand. You might not wanna go down the snorkel route at all. So that's that's 1 sort of, sorta you know, that's 1 that's 1 bit of where Snorkel is intended to be useful.
Another thing is that, Snorkel learns how to weight the accuracies of the labeling functions using no labeled data, but it needs a lot of unlabeled data, to apply the labeling functions to. So it's you know, Snorkel is mainly motivated by situations which which from our experience are very common where you have a where it's easy and cheap to get access to unlabeled data. It's just difficult and expensive to to label it, for your particular classification task. And then finally, you know, using Snorkel requires you to write these labeling functions. It it requires you to have some kind of noisy signal. So, again, this noisy signal could be it could be labels from crowd workers.
This this case is nicely subsumed by our our system. It could be patterns over text. It could be other databases or knowledge bases that you have found somewhere that might be really noisy or inaccurate for your task but provide some some signal. But there there needs to be something there. So, you know, the the biomedical domain is a great example of this because the the National Library of Medicine, for example, has built all these ontologies and and data resources. And for a given task that we approach with with, you know, that we might approach, you know, you wanna pull out this type of chemical chemical interaction from the scientific literature, you know, chances are there's not gonna be any perfect fit with any of those, datasets. But using Snorkel, you can kind of
[00:25:06] Unknown:
use them all to to give signal to train the model, if that makes sense. Yeah. Those are great answers. And what are some of the most interesting applications of snorkel that you're aware of or have been involved in?
[00:25:18] Unknown:
Well, there's there's, you know, multiple dimensions of interestingness. I I think on the in terms of the novelty, at least from where I sit side, we're we're doing some work with this project in part of Stanford called the Empirical Studies of Conflict. And so we have, it's actually run by a guy named Joe Felter who's now, a assistant secretary of defense at the Pentagon, but he, did his PhD, studying, conflict in the studying, insurgence, insurgency conflict in the Philippines. And so we have this very unique dataset of combat reports, and we're trying to, use Snorkel to, pull out specific attributes that then, international policy, and conflict researchers can use to study. And that's, you know, that's that's a standard template where we have messy dataset, and we have a bunch of collaborators who have all these exciting research questions that they that they wanna answer, which, you know, could even involve downstream machine learning problems, but they need the data in a usable format. And so if we can just, do that, which we often can with Snorkel, then, you know, they're off they're off to the races doing all kinds of exciting stuff.
Another cool instance is, I guess, that radiology project I mentioned where, you know, we can take those text reports and actually use them to provide training signal that is nearly as good, you know, several like, just mere points away in terms of end performance from these massive labeled datasets that 100 of 1, 000 or even more, dollars were spent to annotate. So that's that's really exciting to us in terms of, again, you know, democratizing machine learning and and making it easier and and cheaper to apply.
[00:26:55] Unknown:
And when I was reading through the documentation for Snorkel and some of the other information that you have available on your website for it, I was noticing that it appears you have built some additional tools that either use Snorkel as part of their functionality or feed into or feed off of Snorkel. So I'm wondering if you can just talk a bit about some of the other work that you're doing where snorkel has been of benefit within your group.
[00:27:22] Unknown:
Yeah. That's a a great question. So, I'll do some shameless, advertising for my lab mates. The, there's a couple of cool a number of cool projects that are sort of, either built on top of Snorkel or are sort of built proximate to Snorkel. And a lot of them, we have blog posts up at, snorkel.stanford.edu. Sort of dumped a lot of the labs material into into that page. So, for example, I I can name a couple. So there's there's a project called, Fondueur that my lab mate, Sam, works on, which is, built on top of Snorkel, and it's aimed at, handling what they call richly formatted data. So this is if if you could think of text or or even images as just purely unstructured data, and you could think of, say, an Excel spreadsheet where you know what the columns are as as a nice, you know, format of structured data that you can that that is really easy to use. That's sort of the end goal with a lot of our Snorkel deployments. There's a lot of data that falls into a middle category where there's some structure, but it's got all these different variances and you don't really know what the structure is ahead of time.
So examples of this include HTML where there's clearly structure, but it's just there's a long tail of all different types of formats that includes PDF documents that have, you you know, have some structure, but it's you know, there are a million and 1 ways to have a any given form or document, and stuff like and stuff like that. So Fondr is aimed at that. There's a really cool project my lab mate, Promo, works on called Coral, which is trying to, extend, writing labeling functions to images, which I mentioned before. So it's a lot harder to write labeling functions over images than it is for text. But if you have some features that were extracted using some upstream algorithms, like, say, you have bounding boxes around objects, say, you wanna write a, you know, you wanna train a machine learning model to detect whether there's an image of someone riding a bike and you have bounding boxes around people on bikes. You could then write a labeling function over that saying, you know, if person is vertically positioned above bike, then say yes.
And this adds some unique complexities to the problem that, that Coral takes care of. There's a project called BabbleLable, which is, by my lab mate, Braden, which is asking whether you could actually instead of writing labeling functions as bits little snippets of Python code, could you just describe them using natural language and then actually parse that into a bunch of labeling functions? And it it's very cool. Clearly, it works surprisingly well, and I think that, you know, the trick is that labeling functions themselves, the way that we expect them to come into Snorkel, we expect them to be noisy. So this produces a larger set of even noisier labeling functions that get parsed out from the natural language that you speak, or that you type in, but we can still actually do it surprisingly well with this.
And then finally, we, have done some work on another trick that people use a lot in practice heuristically to make up for lack of label training data called data augmentation, and we view this as another way of having domain experts inject their knowledge. So, you know, you might augment your training set by making copies of the, images, say, but rotating them all slightly or blurring them all slightly because you know that rotations or blurs shouldn't really change anything for your your end classification problem. So we have some software for automating that process as well, that can be used in in concert with Snorkel to generate even more training data.
[00:30:41] Unknown:
And in your experience, what have been some of the most challenging aspects of building and working with Snorkel and the problem sets that are associated with it? You know, there are some domain specific problems. All the, yeah, unstructured data is very messy to deal with, and there's a high bar to
[00:30:58] Unknown:
assembling the upstream tooling before you can even get to the point where you apply the core technical bits of snorkel or where you apply a a machine learning model. You have to, know, say you wanna do that chemical chemical extraction I mentioned. You need to do all the parsing and and preprocessing and stuff before you even get to the point where you can write a labeling function looking for the word causes. So that was 1 thing that that was certainly an interesting aspect. And then, you know, a lot of the things that have been challenges are now they're now, you know, either completed or ongoing research projects. So I mentioned that that that problem that we noticed of having correlated labeling functions and how that messed up the model in Snorkel. And and so we at least have a a a pretty decent solution to that now and ditto with problems of how do you write labeling functions over images or how do you write labeling functions if you don't know how to code, stuff like that. Are there any particular
[00:31:53] Unknown:
features or improvements that you have planned for future releases of Snorkel?
[00:31:57] Unknown:
Well, we, are a number of, sort of things on the software engineering side that I'm I'm a little hesitant to say on record because then my adviser is gonna take a lot of joy in in in brandishing it. But, I I you know, for for example yeah. He most certainly will. My you know, both myself and my adviser, Chris, are are very, excited about, a specific deep learning framework called PyTorch these days. So right now, Snorkel works with TensorFlow, and we wanna also extend it to PyTorch. There's a lot of engineering stuff like that that is on the docket. Also, I mentioned there are 2 modeling stages of Snorkel. 1 where it, the generative model where it learns the accuracy of the labeling functions, then there's the discriminative model that is just the that makes the end predictions that you're interested in. And there's a way to train them jointly that would greatly simplify the architecture that we wanna put in. And then 1 other interesting thing, that I'm working on now is, extending, Snorkel to better handle the setting of what's called multitask learning, which is an exciting area, that a lot of people in machine learning are exploring these days. And I guess the intuition behind it is that even though it might seem that learning training more models at the same time would just be harder, if these models actually share information about how they're learning to represent the world, it can actually be easier, and you can actually do better training a machine learning model to do multiple things at the same time, rather than just training us to do 1 1 narrow thing. So we're very interested in in exploring this in the context of of, weak supervision and the kind of inputs that Snorkel uses.
[00:33:25] Unknown:
Are there any other topics or areas of discussion that we didn't discuss yet which you think we should cover before we close out the show? I think that's
[00:33:33] Unknown:
a a a good covering of topics. I guess I can, you know, I can add in that I you know, that from our perspective, we coined the cheesy phrase that training data is the if data is the new oil, training you know, labeled training data is the new new oil. And, again, if you, you know, if you go out and talk with people trying to use machine learning models, especially the newest and fanciest, say, deep learning models, as we've done with, you know, a lot of our collaborators when we were developing Snorkel. Especially in the last couple of years, you don't see a lot of people who, are concerned about some particular property of the algorithm or even in the last 5 years or so are inter are are struggling to engineer features, for their model. Really, the bottleneck that you run into is that people that it's tough to get training data. It's expensive. It requires a lot of time. It locks you into a particular problem specification or scheme as I'd mentioned. So that's the again, that's really our our our highest level motivation for Snorkel is to make it easier to overcome this bottleneck when you don't have, you know, the resources to pay an entire building of people to label your training data for you. Alright. Well, for anybody who wants to follow the work that you're up to or get in touch and ask any sort of follow-up questions or get involved in the work you're doing, I'll have you add your preferred contact information to the show notes. Oh, I was just gonna pitch that, Snorkel's all open source. It's on GitHub, and and we love getting issues. We actually like getting complaints more than we like getting, positive feedback because it gives us stuff to, work on and improve. So definitely feel free to to check that out.
[00:35:07] Unknown:
And as a final parting question to give people something else to think about, I'm wondering if you can share your perspective on what the biggest gap is in the available tooling or technology for data management today.
[00:35:22] Unknown:
Yeah. That's a great question. I, I'm, you know, very biased, obviously, but I think that, machine learning is really quickly becoming an essential tool in in the toolkit of a lot of people in data management, whether it's for doing dark data extraction from text or images like I mentioned or even whether it's using machine learning to do or to optimize, say, traditional data management stacks. And so I think there's a lot of exciting work and, you know, we hope that, you know, Snorkel falls into that that category of, you know, exciting work, but I think there's a lot more to come in terms of, systems specifically for machine learning.
And so this is everything from how you, get training data for your models, like we're trying to do with Snorkel, to how you efficiently, serve the models and actually run them, say, even on the edge with on on smaller devices, how you debug them, how you make them more interpretable, how you, you know, in general program them like you would any other, part of your data management tool stack. So I think that's gonna be, at a really high level, 1 of the most exciting areas, this intersection of, machine learning and and systems development that, coming up in the next couple of years.
[00:36:40] Unknown:
Well, I appreciate you taking the time out of your day to join me today and talk about the work you're doing with snorkel and weak supervision and dark data extraction. It's definitely a fascinating area of research and 1 that I'm happy to see progress being made. So I, appreciate you taking the time, and I hope you enjoy the rest of your day.
[00:37:01] Unknown:
Thanks so much for having me on the show. It was a a pleasure to chat.
Introduction and Host Welcome
Guest Introduction: Alex Ratner
Overview of Snorkel and Dark Data
Alex's Journey into Data Management
Defining Dark Data
Challenges and Solutions with Dark Data
Building and Verifying Labeling Functions
Workflow with Snorkel
Applications in Various Industries
Democratizing Machine Learning
Snorkel's Architecture and Evolution
Adoption and Collaborations
When Snorkel is Not a Good Fit
Interesting Applications of Snorkel
Additional Tools and Projects
Challenges in Building Snorkel
Future Improvements
Closing Thoughts and Contact Information