Data Labeling That You Can Feel Good About With CloudFactory

Hello, and welcome to podcast. Init, the podcast about Python and the people who make it great. When you're ready to launch your next app or you want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking,

scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up.

And for your tasks that need fast computation, such as training machine learning models or running your CI pipelines, they just launched dedicated CPU instances.

In addition to that, they just launched a new data center in Toronto, and they've got 1 opening in Mumbai at the end of 2019.

Go to python podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of the show.

And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system that can keep up with you that's designed by software engineers for software engineers.

Clubhouse lets you craft a workflow that fits your style, including per team tasks, cross project epics, a large suite of prebuilt integrations, and a simple API for crafting your own.

With such an intuitive tool, it's easy to make sure that everyone in the business is on the same page.

Podcast.init

listeners get 2 months free on any plan by going to python podcast.com/clubhouse

today and signing up for a free trial.

And you can visit the site at python podcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And if you have any questions, comments, or suggestions, I'd love to hear them. And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers.

Your host is Tobias Macy, and today I'm interviewing Mark Sears about CloudFactory,

masters of the art and science of labeling data for machine learning and more. So, Mark, can you start by introducing yourself? Absolutely, Tobias. Yes. I'm Mark Sears, founder and CEO at CloudFactory,

and I'm talking to you today from Reading, UK, our our global headquarters at CloudFactory. And do you remember how you first got involved in the area of data management?

I definitely do. I'm I'm a I'm a software guy. I'm a I'm a computer scientist geek,

like a lot of people that has found their way into the world of of data,

ML, and AI. And so my entry point was really

mostly through CloudFactory,

being passionate about software, passionate about people. We ended up building CloudFactory

as,

an important piece. We're we're the we're the picks and shovels working on people's data

in order for them to both train up,

machine learning algorithms

as well as to augment and kind of fill the gap and search

humans in the loop at scale,

where technology

can't quite do it. So so that's how kind of how I found my way into into the world of of being passionate about data is first through being passionate about software

and,

actually in in, people and creating opportunities

for for people that,

you know, as I'm sure we'll get into, are are in places like Nepal and Kenya where we've got now 5, 000 people that we are trying to create meaningful work for,

by caring for data and,

taking care and and structuring and adding value to data for our clients.

And so all of the interest in sort of data management, data quality issues, as well as what you're referring to in terms of enhancing quality of life for people in developing countries has culminated in your work at CloudFactory.

So can you start off by explaining a bit about what it is that you do at CloudFactory and some of the story behind it? Sure. Yeah. I'll come at it from from maybe kind of a bit of the origin. Because like I said, we didn't really get in,

knowing that we would end up where we are today and kind of how we plug in to to the ecosystem.

Really what it was was was 10 years ago, my wife and I took a 2 week vacation to Nepal.

I was on our bucket list and we went. We had a fantastic time. And towards the end, we ended up meeting

3 other software developers.

3 young Nepali smart,

developers at a at a cafe eating pizza

and talking about,

software. And next thing I know,

the trip got extended for,

3 weeks. And I I bought an Imac and, started training them on Ruby on Rails. It was it's a language I've been working with way back when and kinda went from vacation to 3 weeks of training to then we got a project. So I extended it for 3 months.

We got a flat in Kathmandu.

And, next thing I know, that vacation turned into living in Nepal for 6 years.

Me and my wife had our 2

beautiful kids,

there and kind of a 3rd kid in in CloudFactory, this startup that kind of emerged.

And the the thesis really was

around that discovery of talent and and and so talent being equally distributed around the world, but opportunity

is not. And and so that's really where CloudFactory started,

with kind of identifying

talent, training,

and then building a technology platform,

to to really give opportunities

both on both sides of that platform to to create meaningful work and connect people in the global economy in kind of the emerging economies,

like like I mentioned, Nepal and Kenya where we operate today.

But then on the other side, for fast growing tech companies that,

need access to talent at scale,

to really

create value from their data. And and again, that's both to train up their models and then also to augment them and insert humans' loop to fill the gaps.

And,

so that's what we've been doing at CloudFactory

for,

you know, really probably the last, 8 years. We've been deep into the r and d programs of many, many fast growing small and large tech companies

and, you know, over 200 of those companies now really helping them

helping them to to provide access to a scalable workforce,

for the purpose of data work. And

so that's what that's what we get to do. We get to, you know

right now, we we are working on so many interesting data projects, AI projects,

and we get to operate in some pretty fun locations,

again, working with really talented people. And we do all of it in the middle with, you know, our our own 60 engineers and data scientists who are building that that technology platform to to really make the whole thing work at scale so that,

we can continue to kind of be the workforce of the future

to to make everyone's data

super fast, super easy, secure,

fast turnaround times,

fast times to market, all the things that everyone's looking for nowadays

when they think about how do I get the data I need to train and run my business train my models and run my business. And

in terms of

feature extraction and data labeling, there are a lot of different dimensions

that can occur in that workflow based on what type of data it is, whether it's for natural language processing or image or video or if it's trying to,

add some contextual information to some of the raw source data.

And I'm wondering what you have found to be some of the common requirements or needs in terms of that overall space from your customers

and some of the different challenges that come along with those various categories of request. Well, yeah. You you you you're right in the sense that it's it's very wide and yet there are a lot of things that are in common. And and we

we do most of our work

is computer vision and,

NLP related. But, of course, within that, there's such a broad range of use cases. I mean, I am surprised every day at the the the innovative ways that we are tagging and labeling,

different things. Right? You know, for agri tech and

drone and satellite applications.

There's some fantastic biotech of tagging different cells and, you know, I could go on and on. I'm shocked. Obviously, there's the the the very obvious ones like autonomous vehicles, video annotation,

and for self driving cars.

There's a lot of work that we do kind of related

to helping with intelligent extraction from documents,

you know. But, but I think what I'm most excited about is just how we see,

mostly, it's it's machine learning being applied deep learning being applied to so many different use cases in so many different industries in so many different ways,

and getting to partner with those programs. Yeah. It's it's it's very it's very wide,

but it does fit into that to a to a pretty common

taxonomy that starts with the the very top, the way we think about it is people are rather coming to CloudFactory

saying,

again, I need someone to train train the machine, right, to train up a model. So I need training data. They may also need help in kind of validating their model as well.

But essentially, it's it's on that side. It's it's on the AI side. And then we have a lot of people that also come to us that have technology in play. They've trained up models. They've got technology.

But but they're needing people to do anywhere from kind of a 1% gap

to a 80 or 90% gap that the technology can't can't do. And so we're literally

becoming that same workforce doing very similar data work to actually insert humans in that loop and and kinda fill that gap. And so that could be, again, things like, you know, we've got some intelligent

AI powered

data extraction for receipts or invoices,

but we need you to review

5% that, you know, get kicked out because they don't have high confidence.

And, or it could be

we have 80%

that needs to be actually done or reviewed

by humans.

And so kind of at that highest level, it's the common thing is I need help with training data or I need help with inserting humans in the loop,

to augment

the, the technology.

And then, of course, under AI, it's computer vision, it's natural language processing,

and really kind of continue to break down that. It really comes down to a lot of the same primitives of what we're doing to data. So we are labeling data, annotating data, categorizing data, scrubbing data. We are collecting data from the Internet or off PDFs or different sources, etcetera, etcetera. It's really a lot of different things,

but it does come down to some pretty common primitives,

no matter kinda what the industry, what the use case is,

if it's being done kind of as part of building out a, training dataset or if it's even real time,

we're working on data that's going back to their customers

in a matter of minutes or seconds,

turnaround. Some of the other challenges associated with providing this labeling and some of the categorization

is maintaining consistent taxonomies,

particularly given that certain customers might have their own sort of schemas or taxonomies that they're trying to adhere to, and then also ensuring

compatibility

with the different tools or libraries that they might be using. And so I'm wondering what your experience has been as far as

how to provide useful integration points with their existing systems and ensuring that the labeling techniques and the formats of those labels are consistent

and usable by their systems that they're trying to use it within? Yeah. That is probably 1 of the biggest learnings that we've had over the last

4 years is,

you know, the the first few years we and and and really the first 2 versions of our technology platform,

we tried to build

the jet engine where we just said, hey. We've got this amazing API where you can send in your data and we're gonna fire back to you perfect data, exactly what you want. And so we were a black box, kind of a full stack approach.

And what we quickly realized is even though we had built a jet engine, we had built amazing tools,

we built workflows,

we had built everything that was needed.

We started to find out that the market was really having this,

strong desire to actually

own.

They wanted to own the data. They wanted to own the process. They wanted to own the workflow and the tooling,

more and more. And they did not want to send their data to to us. They did not want the work to be done necessarily in our tooling, in our workflow, on our platform because of a few different reasons. 1 is a lot of a lot of companies,

consider

the the tooling and the process and workflow to be a part of their competitive advantage. And that's, again, if they're if they're training up their models, doing feature engineering, they actually sometimes believe that some of the things they do and how they annotate,

is is is part of their competitive advantage that's gonna cause them to win in their particular market.

So for some competitive advantage, sometimes it's just plain visibility and control. They wanna have control over those, the tools and the workflows

themselves. You know, a, because

they don't wanna be locked into 1 company like us.

They they wanna have maybe 2 or 3 vendors. They may want to have a small in house team that's doing some of the work. And, and so I think part of it is too is that there are so many

open source tools and just the general ability to create

to create this stuff faster than ever before,

that they they they there's a big preference to try and do

most of that stuff in house. And so keep kinda keep your data close. Yes. For

GDPR and other data infosec and compliance reasons, but but also just kinda keep it close because people do believe there's a lot of competitive advantages,

in in how they treat their data. And so so yeah. So we that's that's what we've done is we've built a solution that allows

allows

companies to get access and kind of bring this

contingent

fluid workforce

onto their tools.

And so kinda bring them onto their cloud where their data is hosted and,

kind of to work the way that they work. That was kind of the big decision we made is, you know what? We can try and do the best way to work and try and get people to work the way that we want them to work, but in the end, that wasn't, what people were asking. So so that's what we do now is we work the way that our customers work, and that involves their tooling, their workflow, their process,

and their very iterative

instructions and business rules on how they want their data

to be,

handled and prepared.

And and that's, you know, that's sometimes changing daily, sometimes changing weekly.

And so that agility and flexibility

to work the way that they work has certainly become something that we've seen the last 3 years especially

be absolutely the way the market's going. Yeah. That's interesting that you're having your workers actually accessing the source systems for your customers as opposed to the other way around where your customers are sending you the data and then you're sending it back as you described initially. And that's sort of the model that I've heard from other people who I've talked to in this space. And so I'm wondering what your experience has been in terms

of managing ongoing training for your workers as far as being,

able to

understand and interact with the different tooling and systems that your customers might be using,

as well as just the overall onboarding process as new people are coming up. If you need to, you know, maybe scale the number of people who are working on a given task and just ensuring that they're able to be productive in a short period of time? Yeah. Some so some of the things that we had to solve 3 or 4 years ago when we made this pretty big pivot in our business. Right? So like I said, we used to say, hey. Send it to us via API, and we'll Black Box handle it to instead, hey, we're gonna give you access to the world's workforce to come,

fluidly onto your tools and work the way that you wanna work. The first thing we had to do is we needed to we needed to create a

a work application

that all of our now 5, 000 cloud workers

would use to actually access,

the the client's tools.

And

so, essentially, you can think of so all of our all of we call them cloud workers. So our cloud workers every day, 247, they're all logging into our Cloudworker

app, which is essentially you can think of like a like an app that has an embedded browser in it that then actually directs them onto the client's tools. And so that can be a a video annotation tool that the client has built themselves. It could be a it could be a,

a custom

data categorization

tool that they, built themselves. It could be a NLP text tagging tool that they bought off the shelf. It could be an open source,

tool

that they've instantiated kind of their own version of that we're using. It could be Google Spreadsheets. Right? So so essentially, our cloud workers are logging in to,

kind of our own browser and they're being directed

to the appropriate tool to to work. And and it's all, of course, based on them passing the qualifications

and getting and gaining the skills

that allow them to get access to these different what we call we call work streams. Right? So every every client comes to us and they're essentially spinning up a work stream, which is a capacity of hours every month that they get access to this workforce

to to do work for them. And so each of these work streams, we are collecting data,

through that browser that allows us to be able to kind of pair it with some of the data that we get from our clients' tools and gives us the analytics and and everything we need to

guarantee that we are

kind of matching the right worker to the right task at the right time and and getting the results and managing the workforce towards results that everyone's looking for in terms of things like throughput and, of course, quality and and accuracy.

And so so kind of that fundamental change

of of our tech stack and and how we do

quality control and workforce management was was 1 part of it. But certainly,

from a training and an onboarding

perspective, that was where we had a lot of

improvements in in in kind of innovation in the methodology.

So for us,

we call we have a kind of a seed stage when we kick off a a work stream with a new client

where it it looks like kinda daily sprints where we're going back and forth and kind of it's a race to usable data. So how quickly can we get usable data to our clients?

That is super important, and we're very, very aggressive about how we how we do that.

And those those tight iterations and the feedback loops and some of the tools that we've created around that, some of the methodology around that is really, really important because we know that's what clients are looking for. They can't afford to wait for months before they know if they're gonna be able to get

the quality of data

and the, you know, kind of the throughput that that they need in order to to power their to product development. So so that kind of on the the onboarding side is certainly there's a lot of art and science to that. And similarly on the training side, you know, typically,

a cloud worker,

is somewhere between 3 hours and probably 2 weeks

that they need to be trained

in order to be productive

on a new work stream. So there is training and onboarding on our kind of general

process and tooling.

But then for a specific work stream,

it's somewhere between 3 hours and and, like I said, 2 weeks. And that can that's a blended learning model where

our cloud workers are you know, majority of them are working distributed, but many of them are working distributed, but many of them are also working in 1 of our offices, in we call them delivery hubs in Nepal in Kenya. And so they but they all live within an hour radius of 1 of those. So they're they're coming in. They might come in for a 3 hour training session,

or they may come in for an intensive 2 weeks of training, but everything's kind of in that window. And so typically, they'll be trained on anywhere from maybe 2 to 5 different work streams,

and they'll be working on on different work streams. And so it gives us a gives us and our clients a lot of elasticity

when they're kind of cross trained,

on multiple work streams,

but it also is is great for them because it keeps it keeps things more interesting

and

and, you know, reduces some some risk for them too if they were kind of only working on 1 project that came to an end. You know, that wouldn't be good in terms of the the meaningful work we're trying to create. So so, yeah, onboarding,

training,

and then the tech stack, all of that had to be completely revised in order to to work in this world

of, hey, clients.

We have the world's best workforce for data, but you guys have your own tooling, and we can help you get that tooling. We've got some great partnerships,

with the top the the top tool providers

for for different types of data work.

And, that's that's been a big part of, I think, a lot of momentum recently for CloudFactory.

And so you've talked a bit about the onboarding process of getting your workers up to speed on your client's tech stack and some of the common challenges and requirements that your customers are facing. And I'm wondering if you can talk a bit more about the overall workflow for a given request for a customer in terms of,

what the experience is for them as far as submitting the request

and receiving the output, and then for your cloud workers as far as their experience of actually doing the work and doing the labeling

and,

just some of the

ongoing effort that's necessary

to keep that flow

consistent

and

repeatable and adaptable, as well as making sure that you're able to make a scale. Yeah. That's good. I think the experience is a great way to to frame some of this up. So from a from a client's perspective,

they have that initial kickoff call. And so they are talking they get a dedicated client success manager and a team lead who is boots on the ground,

in Nepal or Kenya,

who's actually the first person who gets trained up on exactly

how to do the tasks. The client, again, during those at that initial seed stage where they're doing daily sprints, it's literally getting the first thing is getting the team lead trained,

such that they can then begin to build

the training infrastructure

and and, kind of the programs

and then actually do the training to actually build,

the team and get them ramped as fast as possible.

And the the platform that we have, the client logs in

is is really a lot of it is around

is around collaboration. So, you know, you can think kind of like a Slack like

chat and the ability to chat with

both the team lead, the client success manager, but actually to the entire,

the entire work stream. So all of the workers that are on that particular work stream have direct ability to see real time if the client makes any sort of change of saying, hey. If you see an image

of a

car that has,

a bike attached to the back of the car,

we actually don't want to tag that bike,

the same way as we would if it was a bike that was on the road or on the sidewalk. And and so they but they might change the next week and say, actually, we want you to tag all bikes the same no matter if they're on the back of a car or not. And so that kind of, again, real time collaboration and chat, you know, is happening on a constant 247 daily basis.

Then there's again, it starts off with kinda week with daily sprints and then moves to weekly sprints. And so there's usually weekly calls, biweekly calls, you know, depends certainly, you know, just on the different types of clients and volume and use cases. But, you know, a lot of the day to day experience is them being able to log in, get visibility,

you know, via dashboards and other analytics tools, and then chatting and collaborating

again through our

platform, but then also, you know, getting on on the on the on the phone or, you know, of course, Zoom is what we typically use to to get face to face oftentimes with our clients, and then many of our clients will actually come on-site as well and and get to spend time with, some other teams. So for us,

you know, it's always been about getting the the the the experience to be

scaled as much through technology as possible while still maintaining

that important human touch. And so that's really how we treat things on both sides is

we are a tech first company that is trying to make sure that wherever we can automate, streamline things, we do. But we always recognize that maintaining some of that kind of human element, the human touch,

both with our cloud workers,

and and then also with our clients. You know, finding that right we call it the radical middle between those is really, really important.

And I think a key part of,

of, again, some of the momentum.

And having that collaboration

and feedback loop built into the overall engagement

for your customers and your workers is definitely useful to ensure that everybody is staying on target.

And I'm also curious what sorts of protocols or practices you have in place to ensure that data quality

is being maintained and identifying

what are those quality metrics so that you can have some sort of measurable outcome,

as well as any

training or education that you have on either side to try and address some of the potential sources of bias that might occur

in the data itself and in some of the labeling techniques that you make? Yeah. It's it's certainly

it's not a 1 it's not

a a silver bullet when you come to trying to get the best quality,

you know, with the least amount of bias. Right? There's there's no question. It's it's a layered approach. And I think,

I think the first thing we think about is

who who's actually doing this work.

And and so making sure that you have a highly curated,

vetted workforce who's

really

engaged and dedicated enough to your particular

project,

enough that they're gonna be able to

stay up to date on the instructions and business rules to be able to get the quality that you need. I think that's a it's a kind of a basic starting point, but, obviously, what I'm talking about there is is is, you know, kind of more of the crowd platform approach to trying to get

your dataset built. And so we we think that,

we we think that that definitely does bring some challenges. And so for us, it starts with who's actually doing the work? Are they engaged? Do they care about my work? Do they understand the context and the why? It's actually a weird thing. But 1 of the biggest things we see on data quality is when people actually on the front end understand

what this data is being used for and why it's important to do a good job. Sounds hilarious, but it's again, as a technology person, I probably spent the first

6 to 8 years thinking that, you know, technology

and kind of,

enforcing quality through technology was the best and most important way. And while I will always be,

you know, a tech first kind of approach,

we've seen the reality of literally just telling someone why it's important and and how this data is being used. And that has huge impact on the actual end quality. Right? We've we've ran those tests and seen those results. Similarly, 1 of these weird kind of psychology things is,

having people say thank you. And so, again, facilitating a little bit of interaction where we've got some of our clients.

It's simple as just, you know, sending a message out saying thank you. It was a great job on that dataset you guys worked on yesterday or last week or recording a a a short 2 minute video

that they then send over from the CEO of the of the of the customer

to the to our cloud workers.

Or they ship some t shirts or swag, you know, to Nepal or Kenya or even hand deliver them. I know it sounds crazy, but it's like on the on the front end of vetting and making sure that you have people who care about your work,

and and can actually do a good job, and then on the back end of actually thanking them.

It's not usually what I would have said quite a few years ago. I would have went into our, you know, gold standard ground tooth

ground truth reputation algorithm with this and that, all the fancy things that we built. I can talk about that as well, but I think it's 1 of our learnings is just

amazing. It's like when when you need people,

to be involved in your technology projects,

you need to you need to remember that they're people.

And and I think that that's a a thing that we we we forget. It's like, you know what? We're building technology. We're dealing with data.

But when you need people,

especially at scale,

to actually if if they're the critical success factor for you to get the dataset or to kind of be the glue in your operations to kind of scale some of the features within your within your business, within your platform,

you need to think about these things. And, and so that's something we spend a lot of time. And, you know, culture

is

as important as technology, if not even more important when you think about things like quality control.

But that said, yes, we do put a lot of tech as well.

You know, again, we are able to monitor all of the different

activity and kind of have a click stream, right, that kind of comes

from that browser, that that Cloudworker,

app where we can really look at the patterns of what are what,

what clickstream

is associated with good actors and good performers and and maybe those that are a little bit less so. And so we've got a lot of proprietary stuff in that area that I think that we continue to invest in. That's that's that's fascinating. Again, it's a little bit of the trust but verify,

and having those tools to kind of identify the best workers from those that might need additional coaching and training. And then we also provide an API to our customers because we know that they are doing a lot of their own,

quality control. And some of them are doing it manually, where they're literally doing a sample size of, say, 1%, 5%, what have you,

or they've got their own automated algorithms that are trying to do some some grading of the work.

All of that, we give an API for them to kind of integrate and send that information back real time to kind of add to what we know on our side.

And together, that gives a really good visibility

that helps us to, again,

constantly be optimizing to try and get better and better quality of data back to our clients.

Yeah. I really like what you're saying about having that feedback

of understanding

what are some of the metrics that correlate with a high quality of work and then being able to take that to identify

potential options for

retraining

or,

assisting some of the other workers to try and meet that level of quality to help improve their own work, as well as the need for aligning everyone along the business goals and the value of the outcome,

which is a lot of what was going on with the DevOps movement of trying to make sure that everybody was working towards the same goals instead of having internal business units in conflict and everybody trying to fight each other to meet their own performance metrics without necessarily understanding

how that impacts the larger organization.

Exactly. It is amazing the deeper the deeper us techies get into,

into things, the more that we realize how much of this depends on on people and communication and a lot of these soft things, these intangibles,

it continues to shock me. But I I I agree on the tech side, it's fascinating. Obviously,

I won't get into details, but you can think of some of those signals that we can capture from from the activity data,

of our cloud workers, you know, the the basics of which, right, that you see

you can sit and watch somebody

use a keyboard. You don't even have to look at the work that they're doing. Right? You don't have to look at the screen and say, oh, they did that wrong. They didn't put that bounding box appropriately.

They, you know, put a 6 in instead of a 9. You don't even have to look at the screen. You can look at the confidence. It's almost I always think of it like playing a piano or playing a keyboard as someone uses uses the keyboard, switches back and forth to the mouse. So the signals of how much they use their mouse versus the tab key,

the the,

the rhythms and pacing pacings in which they are using and pressing the keys,

you know, the the pauses and delays. So there's all these signals in that that is just fascinating.

But obviously, there's things too of just,

well, statistically

performing this particular task or even, you know, having to focus in this field typically lasts, you know, 13 seconds and, you know, this was only 2 seconds. And so I think this is, you know, something that's suspicious when it happens too many times. And so so there's lots of signals

and lots of of data and statistics that we have access to ourselves that allow us to, again,

to help, coach

and manage,

you know, this very quickly growing global workforce

to ensure

that we can help our customers to essentially use, you know, to get their own high quality data. So it's sometimes it feels a little bit meta, meta, right, that we're using,

data and software to try and manage people who are then helping to create

data, which is creating software

to help create, you know, value in the world. So,

it's it's interesting

to see what's touching.

Yeah. It does. I think it does. Yeah. And so continuing on the

topic of

human impact on AI and machine learning,

I know we've talked a bit about some of the types of work and inputs that the cloud workers are doing,

but I'm also interested in getting

your broader view of the role that humans play in the overall life cycle for artificial intelligence and machine learning projects. Yeah. I think I think we talked obviously about

the the training side kind of on the front end of how do you make sure that you get

lots of high quality,

non biased data to

feed into your to your model.

I think the the next obvious place is, okay. Great. We've got a model. Now how do we validate it? How do we actually make sure that we've got something that's performing?

How well is it performing?

And so there's there's that side of it. I I think the more interesting part though is is I think we're all

learning and and certainly over the next 5 to 7 years gonna be learning a lot more about

not the training of AI, but kind of the sustaining of AI and the role that people have as more of these algorithms make their way into

into our lives and specifically into the enterprise.

And I think that's something that we think a lot about is

the more and more that AI is deployed into production,

it's it's interesting because it changes essentially,

how we work. And that's what we're seeing again is is again that whole idea of humans in the loop

that, yes, they need to be there to help train and improve,

and increase the automation.

But it also changes,

once those are deployed,

how we then

have people inserted into the loop to interact with that technology

to get that overall result that we want. If that is again,

data

if that's a customer experience that's being created or or what have you, it's almost always a human plus machine world. I think we

I continue, obviously, we're very biased, but that's what we see is we think that,

you know, we've got a lot of companies

where I think people,

they think that it's a 100% technology.

There's some people that are like, oh, wow. It's probably 90 95%.

Maybe there's 5% kinda humans inserted to kinda review.

The reality is there's a lot that we do where it's maybe 20%

AI and technology and it's 80% humans.

And I think that's,

you know, I think as some people say that's the dirty little secret of Silicon Valley right now is that,

you know, we talk a lot about AI automation and technology, but the reality in the, you know, in the real world, you know, not in the academic world, but in the real world where you're solving problems, the number of corner cases

and exceptions

and just hard things to solve, people are still really good. Our reasoning and our creativity

and our judgment

is so much more generalized

and and ahead. And so it's, again, the usual idea of

AI is

is amazing and and completely

destroys humans in many areas,

but humans still have a big advantage and role to play. And so the power is when you can design,

systems and processes

that include both human and machine. And that's what we're seeing in different industries. And some of our clients, you know, have really caught on to this or the ones who were were dominating,

because they've really found that sweet spot of both. And

on the side of the cloud workers that you employ, I understand that you do a fair amount of skills development

and help with

gaining leadership skills and community building.

So I'm wondering if you can just talk a bit about the relationship that you have as a business with your employees

and the cloud workers and how that relates to your overall goals. Yeah. So, obviously,

when me and my wife had a 2 week vacation turn into 6 years in Nepal, like, we we definitely did not stay

stay there and start CloudFactory,

you know,

with with any of this in mind. Right? We we didn't. I mean, we we really it was it was just just amazing time and opportunity to to discover a different culture,

to discover people who are super talented and smart, and yet, you know, we're it's like 40 to 60% unemployment if not even higher amongst people who are kind of in their twenties. And and so for us, it wasn't just a matter of even getting them a job. Right? So the idea for us, we we have a supply side strategy.

We believe that building the world's workforce,

the best workforce in the world,

is is necessary,

in fact, again, to train up AI

and to and to augment it. So we believe that

similar to AWS and kind of cloud computing, cloud storage, we believe that kind of this idea of cloud labor and having the world's workforce, you know, available on tap

to to power and and scale,

parts of your business. We believe that that's inevitable. We believe it's gonna happen.

But but we we really were doing it from this idea of of just trying to create meaningful work for people. And and so we knew right away that that was more than just a paycheck. We know that people come

to work, to to we say earn,

learn, and belong.

And and so we really designed a a model that would,

yes, have people work distributed,

but also maintain things like relationship.

And so people get together,

you know, every every week or every 2 weeks,

in these teams of 4 to 8 people.

And, it's an opportunity for them to to, like you said, to do leadership development, but then they also go out every few weeks and they do a community service project.

And, so you've got a you know, our average age is about 23 years old, but typically, I think 95%

of our cloud workers are 18 to 30 years old. And, they are this growing army of

millennials that are passionate

and smart

and they are earning. Right? They log in to CloudFactory,

day and night, and they're earning money.

They're learning both through the work, but through a lot of the other different sessions

and leadership development and other programs that we put on, kind of personally and professionally to kinda continue to grow.

And then they're also going out and they're serving, and they're doing it along side of a of a group of people who, you know, who become pretty important to them. And and so that team based model

alongside of Cloud Factory Academy, Leadership Academy,

alongside of the opportunity to kind of plug in

and join the digital economy

with some amazing companies that get to

to be an inside part of the r and d program.

All that comes together to people

even though they don't work you know, people typically will work anywhere

from 5 to 48 hours, you know, probably averages out somewhere right in the middle of that.

So a lot of college students that are working maybe 5 to 15 hours, a lot of recent graduates that are piecing together multiple work streams,

logging the platform, maybe doing 40 to 48 hours on the high end, and then a lot of people that do it as kind of a side gig,

somewhere in the middle. And, all of that, those yeah. That experience is 1 that

we focus on. We believe that

if we could be the best gig in town in places like Nepal and Kenya, we can attract

really smart, talented people, give them an opportunity,

to to really start their career in a in a in a really cool trajectory and make an investment into them and then celebrate

when they go on to do amazing, better things. And that sometimes is coming into the company into a bigger role. So I think,

you know, a huge percentage, more than a third of our,

the people who kind of manage our clients and projects and kind of supervise,

these teams of cloud workers, they all come from this workforce. They get, you know, promoted into that, but many of them go on and they will go to study abroad. They will go to become

teachers or start their own company, start their own NGO, or or go on to another professional job, from CloudFactory. All of that we consider to be success.

And, so so that's a fun part for us. Right? I mean, that social mission aspect,

I think that we did it because we just that's that was just the why it's the why of our business.

What's cool though is to see because we make that investment in that environment, in that culture, we end up attracting

and retaining

and having people engaged and motivated in a way

that actually is causing all of our clients to get way better results.

Because endgame, when you sit down to do these this dated work,

You're doing tasks that may be 10 seconds or 10 minutes or an hour. But when you actually

are a part of something bigger and you care about what you're doing and you're doing it alongside of people that you care about and you have flexibility

at the same time and all, you know, all of those things add up to

zooming in. Right? Again, zooming in, maybe a couple extra more times to see if it's 6 or 9,

or to to get that 1 pixel accuracy on bounding box.

And that's, I think, what our clients really appreciate.

You know, yes, the technology platform, a lot of the things we have in place is really interesting. But I think when you pair that with the culture that we're trying to create,

on the supply side,

it leads to just better results for everyone.

And

1 of the challenges that's

elasticity and demand on your customer side as far as

being able to scale up and scale down in terms of having the people

available to do the work. And so I'm curious what your strategies are on that front. Yeah. Resource pooling and that multi tenant access. Right? So so CloudFactory, Cloud Labor, you know, Cloud and the tenants of Cloud, that definitely include that. And it's it's important. I think there's a little bit of fallacy though to the idea, right, that you need to have 100 of 1, 000 or millions of workers in order to get the elasticity.

We certainly don't have any clients that we run into that need kind of this imaginary

elasticity. What it looks like typically

is, you know, on any given day

or week, they may need to

add 25, 50%, maybe double their capacity.

We have really, what we do is we give our clients kind of like a dial where they can upgrade or downgrade

in terms of the number of hours they subscribe to on a monthly basis.

And so someone could be at,

you know,

a 1000 hours a month or 10000 hours a month, and they can go ahead and just upgrade to add,

you know,

50% capacity. And that can happen sometimes in as little as 2 days.

And, again, the way that we do that is through technology and through cross training.

And so by having our cloud workers

work on a a number of projects, it allows us to quickly,

have them gain the skills and training necessary to come in.

We also maintain kind of a of a queue. Right? Just a general queue where people are coming into CloudFactory,

and they're taking general assessments,

and they're going through kind of onboarding, and they're ready. So they're ready to join a new work stream,

you know, within 2 days. And then we have people kind of within our existing 5, 000 that are always looking to pick up extra,

extra work streams to get more hours. And so between kind of how we've,

you know, set up the resource pooling,

by not having people dedicated

40 hours a week.

But the way we usually think about it is someone needs to be probably on a project at least 10 hours a week for them to really be dedicated enough to to stay in. So 10 hours, 20 hours. Some cases, they're 40 hours on 1 project, on 1 work stream. And that's kinda required, and that's what works best for that work stream.

But, typically, that that ability to work on more, it it, it gives that elasticity

such that if 1 week or 1 day

someone needs extra capacity,

we can dial that in the platform and they can all of a sudden

someone who's working 15 hours a week,

as a,

as college student, they can pick up an extra 10 hours

that week by just accepting kind of additional, we call them open shifts, and picking up some extra hours,

to work on on a a work stream they have the skill and already trained for. So there's lots of different,

ways that we're able to get that elasticity, and that's something that our clients

love. Again, it's not this

imaginary, right, where you just

create a form, you know, with a couple bullet points instructions and send it out to this anonymous crowd, and you can instantly get access to thousands of people.

That's cool if you're kind of doing you know, as an academic, you need to do a survey

or if it's like a very simple true or false binary kind of judgment,

and you've taken the time to set up all the quality control. But, again, that world,

it sounds very cool, but for those people who are really

doing data at scale,

the type of elasticity that that we see is is what we've designed CloudFactory for, and it seems to be working working actually really well. And so

1 of the other things that I'm curious about is given your position

as somebody who's working with all these different clients

and embedded in their r and d for different AI and machine learning projects

and your focus on having this positive social impact.

I'm curious what your thoughts are on the future of work as AI and other digital technologies

continue to disrupt

existing industries

and modify or replace certain

jobs that have been traditionally held by humans, and

also just how your views and considerations

on those shifts in the nature of work tie into your plans for CloudFactory in the medium to long term? It's a big question everyone's asking, isn't it? And I definitely fall in the, in the camp of the techno optimists. I,

I think that

while there's going to be a disruption, probably bigger and faster than what we've seen,

historically,

I and I and I and I and I do think it's going to be a disruption.

I am optimistic, and and I think that,

we are seeing a hollowing out of the middle of work. And there is like I said, I I continue to be surprised every day with the the the applications of AI that we're seeing. And so all of a sudden, you know, we are tagging cells,

right, that are going to be,

you know, helping and speeding up and replacing

portions of people's jobs,

that really operate in that kind of middle sector. And so there's data scientists

and there's data labelers, kind of a data scientists on the high end and data labelers on the low end

that are essentially teaming up to kind of hollow out some of those that middle work.

Sometimes it's not full jobs. It's only

partial

partial,

responsibilities

of of those jobs in the middle.

But but there's no question that's happening, and it's accelerating, and it's going to happen faster than we're probably gonna be able to adjust as

society.

So that's where, you know, but I will get back to the techno optimist side that I

I do think that I do think that things are probably happening

slower than what the news is talking about. So we do have kind of an under the hood

view

and, you know, we've seen even with some of the autonomous vehicle

predictions,

right, that

it's it's certainly again, the the real world is that might be 10 years longer

or 10 years later than we thought. And I think that similar thing of OCR has been around for over 30 years, and yet I can't tell you the millions

of invoices and receipts that we process,

you know, every month because technology

can't

it it just there's so much that it still can't do.

And, obviously, we are committed

to helping companies to build the tech to increase that,

but also to fill in the gap. And so I, again, I continue to believe that, yes, there are jobs

that are going to become,

you know, completely obsolete. There's some that are gonna be reduced.

But I think that what we see is a lot of the jobs that are being created

as we develop the AI and get into production. We have to create new jobs just to sustain and,

it's just a new world that we're going to

have to continue to build humans,

around the technology.

And so so, yeah, I I think it all ends up in a place

that looks very different, happens slower than what everyone is talking about,

and creates more jobs,

than than what everyone's talking about as well. And so,

yeah, it's,

it's an interesting time for us, isn't it, to

to to to be alive and to see what's about to happen over the next 5 to 10 years?

And are there any other aspects of the work that you're doing at CloudFactory

or the challenges

of managing data labeling and machine learning projects and artificial intelligence that we didn't cover yet that you'd like to discuss before we close out the show? So, yeah, I think that the the thing that I would add,

Tobias, is that we see some of the

companies that we work with, are struggling. They're struggling in a few areas. And so some of the bottlenecks that we see

are getting the data, just getting access to the data. So how do you collect and capture the data you need to build whatever it is you're dreaming up?

And that's hard. If if you need to capture

people who are, you know for facial recognition, you need to get 8 demographics

and you need 40, 000 images and you need them

to, you know, look certain ways or do certain things,

or if you need to get

all sorts of of

of visual data or what have you, the programs to actually capture that data, it's it's intense. I mean, it's people who are setting up booths in in Las Vegas to get, you know, international

demographics, and they're handing out swag, and they're getting people to sign

consent forms by, you know, giving them some sort of a swag and having salespeople that convince them as they walk by.

You know? I mean, it sounds ridiculous, but

companies right now

are scraping the web,

sending people around the world to take photos,

renting out Airbnbs

to stage them and and take VR, you know, capturing and lidar within these I mean, it's it's amazing. Obviously, sending out cars around the world, sending out drones and satellites. So the whole data capture

and data collection thing is the is just a is a wild, wild west that everyone is trying to figure out how to do it, how to scale it, and how to do it in a way that doesn't introduce biases.

And then the next challenge is, okay. I got the data. Now I need to annotate it. I need to label it. And, obviously, that's where we come in.

But, you know, learning how to do that well and finding a partner that has the experience

and the the the ability to to help you get that done, you know, considering ease and speed speed to market is is is a huge bottleneck that,

you know, I'm a little bit biased there. But, you know, I think most people would agree that that's a huge part of developing AI.

And then, obviously,

you know, getting into production and all the other there's a lot of challenges, but I would just kinda point out those first 2 of actually

collecting and capturing the data

and then actually

annotating and labeling and getting it into into

a a training dataset that's going to be of high quality,

and and for it to be large enough for you to actually get a high performing algorithm,

Those are those are really hard things that companies are figuring out how to do.

And people are realizing within their company, they've got different teams that are doing it different ways, and they're trying to figure out how do we create centers of excellence to bring

bring that all together. And we can, you know, have 1 team that's focused on data collection, 1 team that's focused on data labeling, and build out the pipelines,

build out the tooling, build out the partnerships,

you know, to do that. And so that's what we spend a lot of our time right now is helping companies learn how to do that well

and and kind of begin to set themselves up for repeatable

success as they develop more and more AI

for their own business and obviously for for their clients and for the market. Alright. Well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Biggest gap in tooling.

That's a really good question.

I think that the biggest I think I wouldn't say necessarily the gap, but I think the the biggest mistake that people make

is

is is is just

probably is just not going ahead and and building something from scratch.

I think that's what I've been most impressed about press press buy would be the companies that are,

very quickly pulling together,

even if it means starting with a Google spreadsheet,

or if it means building off of something open source.

But I I think that

it's a gap in the sense

that,

you know,

getting that tool and optimizing that tool and improving that tool

is so important as part of your program that it that kind of having ownership over that, I think is is is what we see a lot of companies

do. Right? There's a lot of wisdom in that. We see a lot of success in the people that make some investment there. And so so, yeah, I think it's easier than ever to do,

and more and more companies are doing it,

but I think that's probably 1 of the biggest gaps. I think sometimes, again, people maybe think it's too big of a deal

or they kind of rely on someone else to do it, and it can kinda get them in a place where they might be locked in. They might not be able to make the changes they want. They may not get the data they want,

the quality.

They might have bias in it because of how the tooling workflow set up. There's all those sorts of things. And so I would just say

when people jump in and actually,

make some investment into that tooling even if it starts really small before they scale it.

I think that's something we see people doing,

and, and end up winning and being very thankful for that investment that they make. Well, thank you very much for taking the time today to join me and discuss the work that you're doing at CloudFactory.

It's definitely very interesting

business space, and I appreciate

the social impact that you're focusing on as part of your business. And it definitely seems to be doing well for you. So thank you for all of your efforts on that front, and I hope you enjoy the rest of your day. Same to you, Tobias. Thanks very much for the opportunity to share

and the conversation. Thanks.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links