Summary
Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can be done using the tools and systems that already power your analytics, rather than sending data into a black box. In this episode Mark Sears, CEO of CloudFactory, explains how he and his team built a platform that provides valuable service to businesses and meaningful work to developing nations. He shares the lessons learned in the early years of growing the business, the strategies that have allowed them to scale and train their workforce, and the benefits of working within their customer’s existing platforms. He also shares some valuable insights into the current state of the art for machine learning in the real world.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
- Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show!
- You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Mark Sears about Cloud Factory, masters of the art and science of labeling data for Machine Learning and more
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what CloudFactory is and the story behind it?
- What are some of the common requirements for feature extraction and data labelling that your customers contact you for?
- What integration points do you provide to your customers and what is your strategy for ensuring broad compatibility with their existing tools and workflows?
- Can you describe the workflow for a sample request from a customer, how that fans out to your cloud workers, and the interface or platform that they are working with to deliver the labelled data?
- What protocols do you have in place to ensure data quality and identify potential sources of bias?
- What role do humans play in the lifecycle for AI and ML projects?
- I understand that you provide skills development and community building for your cloud workers. Can you talk through your relationship with those employees and how that relates to your business goals?
- How do you manage and plan for elasticity in customer needs given the workforce requirements that you are dealing with?
- Can you share some stories of cloud workers who have benefited from their experience working with your company?
- What are some of the assumptions that you made early in the founding of your business which have been challenged or updated in the process of building and scaling CloudFactory?
- What have been some of the most interesting/unexpected ways that you have seen customers using your platform?
- What lessons have you learned in the process of building and growing CloudFactory that were most interesting/unexpected/useful?
- What are your thoughts on the future of work as AI and other digital technologies continue to disrupt existing industries and jobs?
- How does that tie into your plans for CloudFactory in the medium to long term?
Contact Info
- @marktsears on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- CloudFactory
- Reading, UK
- Nepal
- Kenya
- Ruby on Rails
- Kathmandu
- Natural Language Processing (NLP)
- Computer Vision
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to podcast. Init, the podcast about Python and the people who make it great. When you're ready to launch your next app or you want to try a project you hear about on the show, you'll need somewhere to deploy it. So take a look at our friends over at Linode. With 200 gigabit private networking, scalable shared block storage, node balancers, and a 40 gigabit public network, all controlled by a brand new API, you've got everything you need to scale up. And for your tasks that need fast computation, such as training machine learning models or running your CI pipelines, they just launched dedicated CPU instances. In addition to that, they just launched a new data center in Toronto, and they've got 1 opening in Mumbai at the end of 2019.
Go to python podcast.com/linode, that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of the show. And to keep track of how your team is progressing on building new features and squashing bugs, you need a project management system that can keep up with you that's designed by software engineers for software engineers. Clubhouse lets you craft a workflow that fits your style, including per team tasks, cross project epics, a large suite of prebuilt integrations, and a simple API for crafting your own. With such an intuitive tool, it's easy to make sure that everyone in the business is on the same page.
Podcast.init listeners get 2 months free on any plan by going to python podcast.com/clubhouse today and signing up for a free trial. And you can visit the site at python podcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And if you have any questions, comments, or suggestions, I'd love to hear them. And to help other people find the show, please leave a review on iTunes and tell your friends and coworkers.
[00:01:56] Unknown:
Your host is Tobias Macy, and today I'm interviewing Mark Sears about CloudFactory, masters of the art and science of labeling data for machine learning and more. So, Mark, can you start by introducing yourself? Absolutely, Tobias. Yes. I'm Mark Sears, founder and CEO at CloudFactory,
[00:02:10] Unknown:
and I'm talking to you today from Reading, UK, our our global headquarters at CloudFactory. And do you remember how you first got involved in the area of data management? I definitely do. I'm I'm a I'm a software guy. I'm a I'm a computer scientist geek, like a lot of people that has found their way into the world of of data, ML, and AI. And so my entry point was really mostly through CloudFactory, being passionate about software, passionate about people. We ended up building CloudFactory as, an important piece. We're we're the we're the picks and shovels working on people's data in order for them to both train up, machine learning algorithms as well as to augment and kind of fill the gap and search humans in the loop at scale, where technology can't quite do it. So so that's how kind of how I found my way into into the world of of being passionate about data is first through being passionate about software and, actually in in, people and creating opportunities for for people that, you know, as I'm sure we'll get into, are are in places like Nepal and Kenya where we've got now 5, 000 people that we are trying to create meaningful work for, by caring for data and, taking care and and structuring and adding value to data for our clients.
[00:03:32] Unknown:
And so all of the interest in sort of data management, data quality issues, as well as what you're referring to in terms of enhancing quality of life for people in developing countries has culminated in your work at CloudFactory. So can you start off by explaining a bit about what it is that you do at CloudFactory and some of the story behind it? Sure. Yeah. I'll come at it from from maybe kind of a bit of the origin. Because like I said, we didn't really get in,
[00:03:58] Unknown:
knowing that we would end up where we are today and kind of how we plug in to to the ecosystem. Really what it was was was 10 years ago, my wife and I took a 2 week vacation to Nepal. I was on our bucket list and we went. We had a fantastic time. And towards the end, we ended up meeting 3 other software developers. 3 young Nepali smart, developers at a at a cafe eating pizza and talking about, software. And next thing I know, the trip got extended for, 3 weeks. And I I bought an Imac and, started training them on Ruby on Rails. It was it's a language I've been working with way back when and kinda went from vacation to 3 weeks of training to then we got a project. So I extended it for 3 months. We got a flat in Kathmandu.
And, next thing I know, that vacation turned into living in Nepal for 6 years. Me and my wife had our 2 beautiful kids, there and kind of a 3rd kid in in CloudFactory, this startup that kind of emerged. And the the thesis really was around that discovery of talent and and and so talent being equally distributed around the world, but opportunity is not. And and so that's really where CloudFactory started, with kind of identifying talent, training, and then building a technology platform, to to really give opportunities both on both sides of that platform to to create meaningful work and connect people in the global economy in kind of the emerging economies, like like I mentioned, Nepal and Kenya where we operate today.
But then on the other side, for fast growing tech companies that, need access to talent at scale, to really create value from their data. And and again, that's both to train up their models and then also to augment them and insert humans' loop to fill the gaps. And, so that's what we've been doing at CloudFactory for, you know, really probably the last, 8 years. We've been deep into the r and d programs of many, many fast growing small and large tech companies and, you know, over 200 of those companies now really helping them helping them to to provide access to a scalable workforce, for the purpose of data work. And so that's what that's what we get to do. We get to, you know right now, we we are working on so many interesting data projects, AI projects, and we get to operate in some pretty fun locations, again, working with really talented people. And we do all of it in the middle with, you know, our our own 60 engineers and data scientists who are building that that technology platform to to really make the whole thing work at scale so that, we can continue to kind of be the workforce of the future to to make everyone's data super fast, super easy, secure, fast turnaround times, fast times to market, all the things that everyone's looking for nowadays when they think about how do I get the data I need to train and run my business train my models and run my business. And
[00:07:12] Unknown:
in terms of feature extraction and data labeling, there are a lot of different dimensions that can occur in that workflow based on what type of data it is, whether it's for natural language processing or image or video or if it's trying to, add some contextual information to some of the raw source data. And I'm wondering what you have found to be some of the common requirements or needs in terms of that overall space from your customers
[00:07:42] Unknown:
and some of the different challenges that come along with those various categories of request. Well, yeah. You you you you're right in the sense that it's it's very wide and yet there are a lot of things that are in common. And and we we do most of our work is computer vision and, NLP related. But, of course, within that, there's such a broad range of use cases. I mean, I am surprised every day at the the the innovative ways that we are tagging and labeling, different things. Right? You know, for agri tech and drone and satellite applications.
There's some fantastic biotech of tagging different cells and, you know, I could go on and on. I'm shocked. Obviously, there's the the the very obvious ones like autonomous vehicles, video annotation, and for self driving cars. There's a lot of work that we do kind of related to helping with intelligent extraction from documents, you know. But, but I think what I'm most excited about is just how we see, mostly, it's it's machine learning being applied deep learning being applied to so many different use cases in so many different industries in so many different ways, and getting to partner with those programs. Yeah. It's it's it's very it's very wide, but it does fit into that to a to a pretty common taxonomy that starts with the the very top, the way we think about it is people are rather coming to CloudFactory saying, again, I need someone to train train the machine, right, to train up a model. So I need training data. They may also need help in kind of validating their model as well.
But essentially, it's it's on that side. It's it's on the AI side. And then we have a lot of people that also come to us that have technology in play. They've trained up models. They've got technology. But but they're needing people to do anywhere from kind of a 1% gap to a 80 or 90% gap that the technology can't can't do. And so we're literally becoming that same workforce doing very similar data work to actually insert humans in that loop and and kinda fill that gap. And so that could be, again, things like, you know, we've got some intelligent AI powered data extraction for receipts or invoices, but we need you to review 5% that, you know, get kicked out because they don't have high confidence.
And, or it could be we have 80% that needs to be actually done or reviewed by humans. And so kind of at that highest level, it's the common thing is I need help with training data or I need help with inserting humans in the loop, to augment the, the technology. And then, of course, under AI, it's computer vision, it's natural language processing, and really kind of continue to break down that. It really comes down to a lot of the same primitives of what we're doing to data. So we are labeling data, annotating data, categorizing data, scrubbing data. We are collecting data from the Internet or off PDFs or different sources, etcetera, etcetera. It's really a lot of different things, but it does come down to some pretty common primitives, no matter kinda what the industry, what the use case is, if it's being done kind of as part of building out a, training dataset or if it's even real time, we're working on data that's going back to their customers in a matter of minutes or seconds,
[00:11:15] Unknown:
turnaround. Some of the other challenges associated with providing this labeling and some of the categorization is maintaining consistent taxonomies, particularly given that certain customers might have their own sort of schemas or taxonomies that they're trying to adhere to, and then also ensuring compatibility with the different tools or libraries that they might be using. And so I'm wondering what your experience has been as far as how to provide useful integration points with their existing systems and ensuring that the labeling techniques and the formats of those labels are consistent
[00:11:52] Unknown:
and usable by their systems that they're trying to use it within? Yeah. That is probably 1 of the biggest learnings that we've had over the last 4 years is, you know, the the first few years we and and and really the first 2 versions of our technology platform, we tried to build the jet engine where we just said, hey. We've got this amazing API where you can send in your data and we're gonna fire back to you perfect data, exactly what you want. And so we were a black box, kind of a full stack approach. And what we quickly realized is even though we had built a jet engine, we had built amazing tools, we built workflows, we had built everything that was needed.
We started to find out that the market was really having this, strong desire to actually own. They wanted to own the data. They wanted to own the process. They wanted to own the workflow and the tooling, more and more. And they did not want to send their data to to us. They did not want the work to be done necessarily in our tooling, in our workflow, on our platform because of a few different reasons. 1 is a lot of a lot of companies, consider the the tooling and the process and workflow to be a part of their competitive advantage. And that's, again, if they're if they're training up their models, doing feature engineering, they actually sometimes believe that some of the things they do and how they annotate, is is is part of their competitive advantage that's gonna cause them to win in their particular market.
So for some competitive advantage, sometimes it's just plain visibility and control. They wanna have control over those, the tools and the workflows themselves. You know, a, because they don't wanna be locked into 1 company like us. They they wanna have maybe 2 or 3 vendors. They may want to have a small in house team that's doing some of the work. And, and so I think part of it is too is that there are so many open source tools and just the general ability to create to create this stuff faster than ever before, that they they they there's a big preference to try and do most of that stuff in house. And so keep kinda keep your data close. Yes. For GDPR and other data infosec and compliance reasons, but but also just kinda keep it close because people do believe there's a lot of competitive advantages, in in how they treat their data. And so so yeah. So we that's that's what we've done is we've built a solution that allows allows companies to get access and kind of bring this contingent fluid workforce onto their tools.
And so kinda bring them onto their cloud where their data is hosted and, kind of to work the way that they work. That was kind of the big decision we made is, you know what? We can try and do the best way to work and try and get people to work the way that we want them to work, but in the end, that wasn't, what people were asking. So so that's what we do now is we work the way that our customers work, and that involves their tooling, their workflow, their process, and their very iterative instructions and business rules on how they want their data to be, handled and prepared.
And and that's, you know, that's sometimes changing daily, sometimes changing weekly. And so that agility and flexibility to work the way that they work has certainly become something that we've seen the last 3 years especially
[00:15:20] Unknown:
be absolutely the way the market's going. Yeah. That's interesting that you're having your workers actually accessing the source systems for your customers as opposed to the other way around where your customers are sending you the data and then you're sending it back as you described initially. And that's sort of the model that I've heard from other people who I've talked to in this space. And so I'm wondering what your experience has been in terms of managing ongoing training for your workers as far as being, able to understand and interact with the different tooling and systems that your customers might be using,
[00:15:53] Unknown:
as well as just the overall onboarding process as new people are coming up. If you need to, you know, maybe scale the number of people who are working on a given task and just ensuring that they're able to be productive in a short period of time? Yeah. Some so some of the things that we had to solve 3 or 4 years ago when we made this pretty big pivot in our business. Right? So like I said, we used to say, hey. Send it to us via API, and we'll Black Box handle it to instead, hey, we're gonna give you access to the world's workforce to come, fluidly onto your tools and work the way that you wanna work. The first thing we had to do is we needed to we needed to create a a work application that all of our now 5, 000 cloud workers would use to actually access, the the client's tools.
And so, essentially, you can think of so all of our all of we call them cloud workers. So our cloud workers every day, 247, they're all logging into our Cloudworker app, which is essentially you can think of like a like an app that has an embedded browser in it that then actually directs them onto the client's tools. And so that can be a a video annotation tool that the client has built themselves. It could be a it could be a, a custom data categorization tool that they, built themselves. It could be a NLP text tagging tool that they bought off the shelf. It could be an open source, tool that they've instantiated kind of their own version of that we're using. It could be Google Spreadsheets. Right? So so essentially, our cloud workers are logging in to, kind of our own browser and they're being directed to the appropriate tool to to work. And and it's all, of course, based on them passing the qualifications and getting and gaining the skills that allow them to get access to these different what we call we call work streams. Right? So every every client comes to us and they're essentially spinning up a work stream, which is a capacity of hours every month that they get access to this workforce to to do work for them. And so each of these work streams, we are collecting data, through that browser that allows us to be able to kind of pair it with some of the data that we get from our clients' tools and gives us the analytics and and everything we need to guarantee that we are kind of matching the right worker to the right task at the right time and and getting the results and managing the workforce towards results that everyone's looking for in terms of things like throughput and, of course, quality and and accuracy.
And so so kind of that fundamental change of of our tech stack and and how we do quality control and workforce management was was 1 part of it. But certainly, from a training and an onboarding perspective, that was where we had a lot of improvements in in in kind of innovation in the methodology. So for us, we call we have a kind of a seed stage when we kick off a a work stream with a new client where it it looks like kinda daily sprints where we're going back and forth and kind of it's a race to usable data. So how quickly can we get usable data to our clients? That is super important, and we're very, very aggressive about how we how we do that.
And those those tight iterations and the feedback loops and some of the tools that we've created around that, some of the methodology around that is really, really important because we know that's what clients are looking for. They can't afford to wait for months before they know if they're gonna be able to get the quality of data and the, you know, kind of the throughput that that they need in order to to power their to product development. So so that kind of on the the onboarding side is certainly there's a lot of art and science to that. And similarly on the training side, you know, typically, a cloud worker, is somewhere between 3 hours and probably 2 weeks that they need to be trained in order to be productive on a new work stream. So there is training and onboarding on our kind of general process and tooling.
But then for a specific work stream, it's somewhere between 3 hours and and, like I said, 2 weeks. And that can that's a blended learning model where our cloud workers are you know, majority of them are working distributed, but many of them are working distributed, but many of them are also working in 1 of our offices, in we call them delivery hubs in Nepal in Kenya. And so they but they all live within an hour radius of 1 of those. So they're they're coming in. They might come in for a 3 hour training session, or they may come in for an intensive 2 weeks of training, but everything's kind of in that window. And so typically, they'll be trained on anywhere from maybe 2 to 5 different work streams, and they'll be working on on different work streams. And so it gives us a gives us and our clients a lot of elasticity when they're kind of cross trained, on multiple work streams, but it also is is great for them because it keeps it keeps things more interesting and and, you know, reduces some some risk for them too if they were kind of only working on 1 project that came to an end. You know, that wouldn't be good in terms of the the meaningful work we're trying to create. So so, yeah, onboarding, training, and then the tech stack, all of that had to be completely revised in order to to work in this world of, hey, clients.
We have the world's best workforce for data, but you guys have your own tooling, and we can help you get that tooling. We've got some great partnerships, with the top the the top tool providers for for different types of data work. And, that's that's been a big part of, I think, a lot of momentum recently for CloudFactory.
[00:21:39] Unknown:
And so you've talked a bit about the onboarding process of getting your workers up to speed on your client's tech stack and some of the common challenges and requirements that your customers are facing. And I'm wondering if you can talk a bit more about the overall workflow for a given request for a customer in terms of, what the experience is for them as far as submitting the request and receiving the output, and then for your cloud workers as far as their experience of actually doing the work and doing the labeling and, just some of the ongoing effort that's necessary to keep that flow consistent and
[00:22:20] Unknown:
repeatable and adaptable, as well as making sure that you're able to make a scale. Yeah. That's good. I think the experience is a great way to to frame some of this up. So from a from a client's perspective, they have that initial kickoff call. And so they are talking they get a dedicated client success manager and a team lead who is boots on the ground, in Nepal or Kenya, who's actually the first person who gets trained up on exactly how to do the tasks. The client, again, during those at that initial seed stage where they're doing daily sprints, it's literally getting the first thing is getting the team lead trained, such that they can then begin to build the training infrastructure and and, kind of the programs and then actually do the training to actually build, the team and get them ramped as fast as possible.
And the the platform that we have, the client logs in is is really a lot of it is around is around collaboration. So, you know, you can think kind of like a Slack like chat and the ability to chat with both the team lead, the client success manager, but actually to the entire, the entire work stream. So all of the workers that are on that particular work stream have direct ability to see real time if the client makes any sort of change of saying, hey. If you see an image of a car that has, a bike attached to the back of the car, we actually don't want to tag that bike, the same way as we would if it was a bike that was on the road or on the sidewalk. And and so they but they might change the next week and say, actually, we want you to tag all bikes the same no matter if they're on the back of a car or not. And so that kind of, again, real time collaboration and chat, you know, is happening on a constant 247 daily basis.
Then there's again, it starts off with kinda week with daily sprints and then moves to weekly sprints. And so there's usually weekly calls, biweekly calls, you know, depends certainly, you know, just on the different types of clients and volume and use cases. But, you know, a lot of the day to day experience is them being able to log in, get visibility, you know, via dashboards and other analytics tools, and then chatting and collaborating again through our platform, but then also, you know, getting on on the on the on the phone or, you know, of course, Zoom is what we typically use to to get face to face oftentimes with our clients, and then many of our clients will actually come on-site as well and and get to spend time with, some other teams. So for us, you know, it's always been about getting the the the the experience to be scaled as much through technology as possible while still maintaining that important human touch. And so that's really how we treat things on both sides is we are a tech first company that is trying to make sure that wherever we can automate, streamline things, we do. But we always recognize that maintaining some of that kind of human element, the human touch, both with our cloud workers, and and then also with our clients. You know, finding that right we call it the radical middle between those is really, really important.
And I think a key part of, of, again, some of the momentum.
[00:25:41] Unknown:
And having that collaboration and feedback loop built into the overall engagement for your customers and your workers is definitely useful to ensure that everybody is staying on target. And I'm also curious what sorts of protocols or practices you have in place to ensure that data quality is being maintained and identifying what are those quality metrics so that you can have some sort of measurable outcome, as well as any training or education that you have on either side to try and address some of the potential sources of bias that might occur in the data itself and in some of the labeling techniques that you make? Yeah. It's it's certainly
[00:26:24] Unknown:
it's not a 1 it's not a a silver bullet when you come to trying to get the best quality, you know, with the least amount of bias. Right? There's there's no question. It's it's a layered approach. And I think, I think the first thing we think about is who who's actually doing this work. And and so making sure that you have a highly curated, vetted workforce who's really engaged and dedicated enough to your particular project, enough that they're gonna be able to stay up to date on the instructions and business rules to be able to get the quality that you need. I think that's a it's a kind of a basic starting point, but, obviously, what I'm talking about there is is is, you know, kind of more of the crowd platform approach to trying to get your dataset built. And so we we think that, we we think that that definitely does bring some challenges. And so for us, it starts with who's actually doing the work? Are they engaged? Do they care about my work? Do they understand the context and the why? It's actually a weird thing. But 1 of the biggest things we see on data quality is when people actually on the front end understand what this data is being used for and why it's important to do a good job. Sounds hilarious, but it's again, as a technology person, I probably spent the first 6 to 8 years thinking that, you know, technology and kind of, enforcing quality through technology was the best and most important way. And while I will always be, you know, a tech first kind of approach, we've seen the reality of literally just telling someone why it's important and and how this data is being used. And that has huge impact on the actual end quality. Right? We've we've ran those tests and seen those results. Similarly, 1 of these weird kind of psychology things is, having people say thank you. And so, again, facilitating a little bit of interaction where we've got some of our clients.
It's simple as just, you know, sending a message out saying thank you. It was a great job on that dataset you guys worked on yesterday or last week or recording a a a short 2 minute video that they then send over from the CEO of the of the of the customer to the to our cloud workers. Or they ship some t shirts or swag, you know, to Nepal or Kenya or even hand deliver them. I know it sounds crazy, but it's like on the on the front end of vetting and making sure that you have people who care about your work, and and can actually do a good job, and then on the back end of actually thanking them.
It's not usually what I would have said quite a few years ago. I would have went into our, you know, gold standard ground tooth ground truth reputation algorithm with this and that, all the fancy things that we built. I can talk about that as well, but I think it's 1 of our learnings is just amazing. It's like when when you need people, to be involved in your technology projects, you need to you need to remember that they're people. And and I think that that's a a thing that we we we forget. It's like, you know what? We're building technology. We're dealing with data. But when you need people, especially at scale, to actually if if they're the critical success factor for you to get the dataset or to kind of be the glue in your operations to kind of scale some of the features within your within your business, within your platform, you need to think about these things. And, and so that's something we spend a lot of time. And, you know, culture is as important as technology, if not even more important when you think about things like quality control.
But that said, yes, we do put a lot of tech as well. You know, again, we are able to monitor all of the different activity and kind of have a click stream, right, that kind of comes from that browser, that that Cloudworker, app where we can really look at the patterns of what are what, what clickstream is associated with good actors and good performers and and maybe those that are a little bit less so. And so we've got a lot of proprietary stuff in that area that I think that we continue to invest in. That's that's that's fascinating. Again, it's a little bit of the trust but verify, and having those tools to kind of identify the best workers from those that might need additional coaching and training. And then we also provide an API to our customers because we know that they are doing a lot of their own, quality control. And some of them are doing it manually, where they're literally doing a sample size of, say, 1%, 5%, what have you, or they've got their own automated algorithms that are trying to do some some grading of the work.
All of that, we give an API for them to kind of integrate and send that information back real time to kind of add to what we know on our side. And together, that gives a really good visibility that helps us to, again, constantly be optimizing to try and get better and better quality of data back to our clients.
[00:31:25] Unknown:
Yeah. I really like what you're saying about having that feedback of understanding what are some of the metrics that correlate with a high quality of work and then being able to take that to identify potential options for retraining or, assisting some of the other workers to try and meet that level of quality to help improve their own work, as well as the need for aligning everyone along the business goals and the value of the outcome, which is a lot of what was going on with the DevOps movement of trying to make sure that everybody was working towards the same goals instead of having internal business units in conflict and everybody trying to fight each other to meet their own performance metrics without necessarily understanding how that impacts the larger organization.
[00:32:15] Unknown:
Exactly. It is amazing the deeper the deeper us techies get into, into things, the more that we realize how much of this depends on on people and communication and a lot of these soft things, these intangibles, it continues to shock me. But I I I agree on the tech side, it's fascinating. Obviously, I won't get into details, but you can think of some of those signals that we can capture from from the activity data, of our cloud workers, you know, the the basics of which, right, that you see you can sit and watch somebody use a keyboard. You don't even have to look at the work that they're doing. Right? You don't have to look at the screen and say, oh, they did that wrong. They didn't put that bounding box appropriately.
They, you know, put a 6 in instead of a 9. You don't even have to look at the screen. You can look at the confidence. It's almost I always think of it like playing a piano or playing a keyboard as someone uses uses the keyboard, switches back and forth to the mouse. So the signals of how much they use their mouse versus the tab key, the the, the rhythms and pacing pacings in which they are using and pressing the keys, you know, the the pauses and delays. So there's all these signals in that that is just fascinating. But obviously, there's things too of just, well, statistically performing this particular task or even, you know, having to focus in this field typically lasts, you know, 13 seconds and, you know, this was only 2 seconds. And so I think this is, you know, something that's suspicious when it happens too many times. And so so there's lots of signals and lots of of data and statistics that we have access to ourselves that allow us to, again, to help, coach and manage, you know, this very quickly growing global workforce to ensure that we can help our customers to essentially use, you know, to get their own high quality data. So it's sometimes it feels a little bit meta, meta, right, that we're using, data and software to try and manage people who are then helping to create data, which is creating software to help create, you know, value in the world. So, it's it's interesting to see what's touching.
Yeah. It does. I think it does. Yeah. And so continuing on the
[00:34:32] Unknown:
topic of human impact on AI and machine learning, I know we've talked a bit about some of the types of work and inputs that the cloud workers are doing, but I'm also interested in getting your broader view of the role that humans play in the overall life cycle for artificial intelligence and machine learning projects. Yeah. I think I think we talked obviously about
[00:34:55] Unknown:
the the training side kind of on the front end of how do you make sure that you get lots of high quality, non biased data to feed into your to your model. I think the the next obvious place is, okay. Great. We've got a model. Now how do we validate it? How do we actually make sure that we've got something that's performing? How well is it performing? And so there's there's that side of it. I I think the more interesting part though is is I think we're all learning and and certainly over the next 5 to 7 years gonna be learning a lot more about not the training of AI, but kind of the sustaining of AI and the role that people have as more of these algorithms make their way into into our lives and specifically into the enterprise.
And I think that's something that we think a lot about is the more and more that AI is deployed into production, it's it's interesting because it changes essentially, how we work. And that's what we're seeing again is is again that whole idea of humans in the loop that, yes, they need to be there to help train and improve, and increase the automation. But it also changes, once those are deployed, how we then have people inserted into the loop to interact with that technology to get that overall result that we want. If that is again, data if that's a customer experience that's being created or or what have you, it's almost always a human plus machine world. I think we I continue, obviously, we're very biased, but that's what we see is we think that, you know, we've got a lot of companies where I think people, they think that it's a 100% technology.
There's some people that are like, oh, wow. It's probably 90 95%. Maybe there's 5% kinda humans inserted to kinda review. The reality is there's a lot that we do where it's maybe 20% AI and technology and it's 80% humans. And I think that's, you know, I think as some people say that's the dirty little secret of Silicon Valley right now is that, you know, we talk a lot about AI automation and technology, but the reality in the, you know, in the real world, you know, not in the academic world, but in the real world where you're solving problems, the number of corner cases and exceptions and just hard things to solve, people are still really good. Our reasoning and our creativity and our judgment is so much more generalized and and ahead. And so it's, again, the usual idea of AI is is amazing and and completely destroys humans in many areas, but humans still have a big advantage and role to play. And so the power is when you can design, systems and processes that include both human and machine. And that's what we're seeing in different industries. And some of our clients, you know, have really caught on to this or the ones who were were dominating, because they've really found that sweet spot of both. And
[00:38:00] Unknown:
on the side of the cloud workers that you employ, I understand that you do a fair amount of skills development and help with gaining leadership skills and community building. So I'm wondering if you can just talk a bit about the relationship that you have as a business with your employees and the cloud workers and how that relates to your overall goals. Yeah. So, obviously,
[00:38:24] Unknown:
when me and my wife had a 2 week vacation turn into 6 years in Nepal, like, we we definitely did not stay stay there and start CloudFactory, you know, with with any of this in mind. Right? We we didn't. I mean, we we really it was it was just just amazing time and opportunity to to discover a different culture, to discover people who are super talented and smart, and yet, you know, we're it's like 40 to 60% unemployment if not even higher amongst people who are kind of in their twenties. And and so for us, it wasn't just a matter of even getting them a job. Right? So the idea for us, we we have a supply side strategy. We believe that building the world's workforce, the best workforce in the world, is is necessary, in fact, again, to train up AI and to and to augment it. So we believe that similar to AWS and kind of cloud computing, cloud storage, we believe that kind of this idea of cloud labor and having the world's workforce, you know, available on tap to to power and and scale, parts of your business. We believe that that's inevitable. We believe it's gonna happen.
But but we we really were doing it from this idea of of just trying to create meaningful work for people. And and so we knew right away that that was more than just a paycheck. We know that people come to work, to to we say earn, learn, and belong. And and so we really designed a a model that would, yes, have people work distributed, but also maintain things like relationship. And so people get together, you know, every every week or every 2 weeks, in these teams of 4 to 8 people. And, it's an opportunity for them to to, like you said, to do leadership development, but then they also go out every few weeks and they do a community service project. And, so you've got a you know, our average age is about 23 years old, but typically, I think 95% of our cloud workers are 18 to 30 years old. And, they are this growing army of millennials that are passionate and smart and they are earning. Right? They log in to CloudFactory, day and night, and they're earning money.
They're learning both through the work, but through a lot of the other different sessions and leadership development and other programs that we put on, kind of personally and professionally to kinda continue to grow. And then they're also going out and they're serving, and they're doing it along side of a of a group of people who, you know, who become pretty important to them. And and so that team based model alongside of Cloud Factory Academy, Leadership Academy, alongside of the opportunity to kind of plug in and join the digital economy with some amazing companies that get to to be an inside part of the r and d program.
All that comes together to people even though they don't work you know, people typically will work anywhere from 5 to 48 hours, you know, probably averages out somewhere right in the middle of that. So a lot of college students that are working maybe 5 to 15 hours, a lot of recent graduates that are piecing together multiple work streams, logging the platform, maybe doing 40 to 48 hours on the high end, and then a lot of people that do it as kind of a side gig, somewhere in the middle. And, all of that, those yeah. That experience is 1 that we focus on. We believe that if we could be the best gig in town in places like Nepal and Kenya, we can attract really smart, talented people, give them an opportunity, to to really start their career in a in a in a really cool trajectory and make an investment into them and then celebrate when they go on to do amazing, better things. And that sometimes is coming into the company into a bigger role. So I think, you know, a huge percentage, more than a third of our, the people who kind of manage our clients and projects and kind of supervise, these teams of cloud workers, they all come from this workforce. They get, you know, promoted into that, but many of them go on and they will go to study abroad. They will go to become teachers or start their own company, start their own NGO, or or go on to another professional job, from CloudFactory. All of that we consider to be success.
And, so so that's a fun part for us. Right? I mean, that social mission aspect, I think that we did it because we just that's that was just the why it's the why of our business. What's cool though is to see because we make that investment in that environment, in that culture, we end up attracting and retaining and having people engaged and motivated in a way that actually is causing all of our clients to get way better results. Because endgame, when you sit down to do these this dated work, You're doing tasks that may be 10 seconds or 10 minutes or an hour. But when you actually are a part of something bigger and you care about what you're doing and you're doing it alongside of people that you care about and you have flexibility at the same time and all, you know, all of those things add up to zooming in. Right? Again, zooming in, maybe a couple extra more times to see if it's 6 or 9, or to to get that 1 pixel accuracy on bounding box.
And that's, I think, what our clients really appreciate. You know, yes, the technology platform, a lot of the things we have in place is really interesting. But I think when you pair that with the culture that we're trying to create, on the supply side, it leads to just better results for everyone.
[00:44:02] Unknown:
And 1 of the challenges that's elasticity and demand on your customer side as far as being able to scale up and scale down in terms of having the people
[00:44:22] Unknown:
available to do the work. And so I'm curious what your strategies are on that front. Yeah. Resource pooling and that multi tenant access. Right? So so CloudFactory, Cloud Labor, you know, Cloud and the tenants of Cloud, that definitely include that. And it's it's important. I think there's a little bit of fallacy though to the idea, right, that you need to have 100 of 1, 000 or millions of workers in order to get the elasticity. We certainly don't have any clients that we run into that need kind of this imaginary elasticity. What it looks like typically is, you know, on any given day or week, they may need to add 25, 50%, maybe double their capacity.
We have really, what we do is we give our clients kind of like a dial where they can upgrade or downgrade in terms of the number of hours they subscribe to on a monthly basis. And so someone could be at, you know, a 1000 hours a month or 10000 hours a month, and they can go ahead and just upgrade to add, you know, 50% capacity. And that can happen sometimes in as little as 2 days. And, again, the way that we do that is through technology and through cross training. And so by having our cloud workers work on a a number of projects, it allows us to quickly, have them gain the skills and training necessary to come in.
We also maintain kind of a of a queue. Right? Just a general queue where people are coming into CloudFactory, and they're taking general assessments, and they're going through kind of onboarding, and they're ready. So they're ready to join a new work stream, you know, within 2 days. And then we have people kind of within our existing 5, 000 that are always looking to pick up extra, extra work streams to get more hours. And so between kind of how we've, you know, set up the resource pooling, by not having people dedicated 40 hours a week.
But the way we usually think about it is someone needs to be probably on a project at least 10 hours a week for them to really be dedicated enough to to stay in. So 10 hours, 20 hours. Some cases, they're 40 hours on 1 project, on 1 work stream. And that's kinda required, and that's what works best for that work stream. But, typically, that that ability to work on more, it it, it gives that elasticity such that if 1 week or 1 day someone needs extra capacity, we can dial that in the platform and they can all of a sudden someone who's working 15 hours a week, as a, as college student, they can pick up an extra 10 hours that week by just accepting kind of additional, we call them open shifts, and picking up some extra hours, to work on on a a work stream they have the skill and already trained for. So there's lots of different, ways that we're able to get that elasticity, and that's something that our clients love. Again, it's not this imaginary, right, where you just create a form, you know, with a couple bullet points instructions and send it out to this anonymous crowd, and you can instantly get access to thousands of people.
That's cool if you're kind of doing you know, as an academic, you need to do a survey or if it's like a very simple true or false binary kind of judgment, and you've taken the time to set up all the quality control. But, again, that world, it sounds very cool, but for those people who are really doing data at scale, the type of elasticity that that we see is is what we've designed CloudFactory for, and it seems to be working working actually really well. And so
[00:47:53] Unknown:
1 of the other things that I'm curious about is given your position as somebody who's working with all these different clients and embedded in their r and d for different AI and machine learning projects and your focus on having this positive social impact. I'm curious what your thoughts are on the future of work as AI and other digital technologies continue to disrupt existing industries and modify or replace certain jobs that have been traditionally held by humans, and also just how your views and considerations
[00:48:29] Unknown:
on those shifts in the nature of work tie into your plans for CloudFactory in the medium to long term? It's a big question everyone's asking, isn't it? And I definitely fall in the, in the camp of the techno optimists. I, I think that while there's going to be a disruption, probably bigger and faster than what we've seen, historically, I and I and I and I and I do think it's going to be a disruption. I am optimistic, and and I think that, we are seeing a hollowing out of the middle of work. And there is like I said, I I continue to be surprised every day with the the the applications of AI that we're seeing. And so all of a sudden, you know, we are tagging cells, right, that are going to be, you know, helping and speeding up and replacing portions of people's jobs, that really operate in that kind of middle sector. And so there's data scientists and there's data labelers, kind of a data scientists on the high end and data labelers on the low end that are essentially teaming up to kind of hollow out some of those that middle work.
Sometimes it's not full jobs. It's only partial partial, responsibilities of of those jobs in the middle. But but there's no question that's happening, and it's accelerating, and it's going to happen faster than we're probably gonna be able to adjust as society. So that's where, you know, but I will get back to the techno optimist side that I I do think that I do think that things are probably happening slower than what the news is talking about. So we do have kind of an under the hood view and, you know, we've seen even with some of the autonomous vehicle predictions, right, that it's it's certainly again, the the real world is that might be 10 years longer or 10 years later than we thought. And I think that similar thing of OCR has been around for over 30 years, and yet I can't tell you the millions of invoices and receipts that we process, you know, every month because technology can't it it just there's so much that it still can't do.
And, obviously, we are committed to helping companies to build the tech to increase that, but also to fill in the gap. And so I, again, I continue to believe that, yes, there are jobs that are going to become, you know, completely obsolete. There's some that are gonna be reduced. But I think that what we see is a lot of the jobs that are being created as we develop the AI and get into production. We have to create new jobs just to sustain and, it's just a new world that we're going to have to continue to build humans, around the technology.
And so so, yeah, I I think it all ends up in a place that looks very different, happens slower than what everyone is talking about, and creates more jobs, than than what everyone's talking about as well. And so, yeah, it's, it's an interesting time for us, isn't it, to to to to be alive and to see what's about to happen over the next 5 to 10 years?
[00:51:39] Unknown:
And are there any other aspects of the work that you're doing at CloudFactory
[00:51:43] Unknown:
or the challenges of managing data labeling and machine learning projects and artificial intelligence that we didn't cover yet that you'd like to discuss before we close out the show? So, yeah, I think that the the thing that I would add,
[00:51:56] Unknown:
Tobias, is that we see some of the companies that we work with, are struggling. They're struggling in a few areas. And so some of the bottlenecks that we see are getting the data, just getting access to the data. So how do you collect and capture the data you need to build whatever it is you're dreaming up? And that's hard. If if you need to capture people who are, you know for facial recognition, you need to get 8 demographics and you need 40, 000 images and you need them to, you know, look certain ways or do certain things, or if you need to get all sorts of of of visual data or what have you, the programs to actually capture that data, it's it's intense. I mean, it's people who are setting up booths in in Las Vegas to get, you know, international demographics, and they're handing out swag, and they're getting people to sign consent forms by, you know, giving them some sort of a swag and having salespeople that convince them as they walk by.
You know? I mean, it sounds ridiculous, but companies right now are scraping the web, sending people around the world to take photos, renting out Airbnbs to stage them and and take VR, you know, capturing and lidar within these I mean, it's it's amazing. Obviously, sending out cars around the world, sending out drones and satellites. So the whole data capture and data collection thing is the is just a is a wild, wild west that everyone is trying to figure out how to do it, how to scale it, and how to do it in a way that doesn't introduce biases. And then the next challenge is, okay. I got the data. Now I need to annotate it. I need to label it. And, obviously, that's where we come in.
But, you know, learning how to do that well and finding a partner that has the experience and the the the ability to to help you get that done, you know, considering ease and speed speed to market is is is a huge bottleneck that, you know, I'm a little bit biased there. But, you know, I think most people would agree that that's a huge part of developing AI. And then, obviously, you know, getting into production and all the other there's a lot of challenges, but I would just kinda point out those first 2 of actually collecting and capturing the data and then actually annotating and labeling and getting it into into a a training dataset that's going to be of high quality, and and for it to be large enough for you to actually get a high performing algorithm, Those are those are really hard things that companies are figuring out how to do.
And people are realizing within their company, they've got different teams that are doing it different ways, and they're trying to figure out how do we create centers of excellence to bring bring that all together. And we can, you know, have 1 team that's focused on data collection, 1 team that's focused on data labeling, and build out the pipelines, build out the tooling, build out the partnerships, you know, to do that. And so that's what we spend a lot of our time right now is helping companies learn how to do that well and and kind of begin to set themselves up for repeatable success as they develop more and more AI
[00:55:06] Unknown:
for their own business and obviously for for their clients and for the market. Alright. Well, for anybody who wants to get in touch with you or follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd just like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Biggest gap in tooling.
[00:55:28] Unknown:
That's a really good question. I think that the biggest I think I wouldn't say necessarily the gap, but I think the the biggest mistake that people make is is is is just probably is just not going ahead and and building something from scratch. I think that's what I've been most impressed about press press buy would be the companies that are, very quickly pulling together, even if it means starting with a Google spreadsheet, or if it means building off of something open source. But I I think that it's a gap in the sense that, you know, getting that tool and optimizing that tool and improving that tool is so important as part of your program that it that kind of having ownership over that, I think is is is what we see a lot of companies do. Right? There's a lot of wisdom in that. We see a lot of success in the people that make some investment there. And so so, yeah, I think it's easier than ever to do, and more and more companies are doing it, but I think that's probably 1 of the biggest gaps. I think sometimes, again, people maybe think it's too big of a deal or they kind of rely on someone else to do it, and it can kinda get them in a place where they might be locked in. They might not be able to make the changes they want. They may not get the data they want, the quality.
They might have bias in it because of how the tooling workflow set up. There's all those sorts of things. And so I would just say when people jump in and actually, make some investment into that tooling even if it starts really small before they scale it. I think that's something we see people doing,
[00:57:12] Unknown:
and, and end up winning and being very thankful for that investment that they make. Well, thank you very much for taking the time today to join me and discuss the work that you're doing at CloudFactory. It's definitely very interesting business space, and I appreciate the social impact that you're focusing on as part of your business. And it definitely seems to be doing well for you. So thank you for all of your efforts on that front, and I hope you enjoy the rest of your day. Same to you, Tobias. Thanks very much for the opportunity to share
[00:57:40] Unknown:
and the conversation. Thanks.
Introduction and Sponsor Messages
Interview with Mark Sears: Introduction
The Origin of CloudFactory
Common Requirements and Challenges in Data Labeling
Integration and Consistency in Data Labeling
Training and Onboarding Cloud Workers
Workflow and Client Experience
Ensuring Data Quality and Managing Bias
The Role of Humans in AI Lifecycle
Skills Development and Community Building for Cloud Workers
Elasticity and Demand Management
Future of Work and AI
Challenges in Data Labeling and AI Projects
Final Thoughts and Closing Remarks