Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie

Hello, and welcome to the data engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and 40 gigabit network, all controlled by a brand new API, you'll get everything you need to run a bulletproof data platform.

Go to data engineering podcast.com

slash linode to get a $20 credit and launch a new server in under a minute.

And are you struggling to keep up with customer requests and letting errors slip into production? Wanna try some of the innovative ideas in this podcast but don't have time?

DataKitchen's DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and datasets while improving quality.

Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love.

Join the DataOps movement today and sign up for the newsletter at datakitchen.iode.

After that, learn more about why you should be doing DataOps by listening to the head chef in the data kitchen at dataengineeringpodcast.com/datakitchen.

And go to dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.

Your host is Tobias Macy. And today, I'm interviewing Peter Lubell

about using ONA for collecting and processing it with Canopy. So, Peter, could you start by introducing yourself? Hi, Tobias. Thanks.

So my name is Peter. I'm 1 of the cofounders. I'm the current CTO

of Oona.

We work in the data management

space.

I have an office in New York, but we're sort of spread around globally with a a larger office in Nairobi in Kenya as well. And so do you remember how you first got involved and interested in the area of data management?

Yeah. So a key motivation behind

my current company is to bring what are standard tools in the commercial sector

to the challenges in global health and humanitarian work.

This intersected with my personal interests since around 2003

starting with the Stanford undergraduate research grant and finding meaningful applications of machine learning research.

When we started our work at ONA, it became clear that the first step to building

ML systems was to get data into a robust data management platform.

Sometimes this meant making existing data systems accessible,

but more often it was getting data digitized in the first place. And just as often as that, it was getting data collected in any form at all. So you ended up starting ONA to try and help bring some of these technologies

to,

broader scale and potentially smaller organizations. So can you discuss a bit about what it is that you do at Oona

and how you get the company moving? Yeah. So our mission at Oona

is to improve access to vital services through

better data. So a lot of the the organizations we work with and the the space we work with in the international development, humanitarian, global health space, The tools there are, you know, not what we're used to in the commercial sector.

They've they're often

building customized systems and sometimes not aware of the latest

advances in data engineering,

data management platforms.

So our our products and our services are both

offered as SaaS products and as configurable solutions where we might work more in-depth with our our partners and our clients. The work actually came out of

research that we were all of the cofounders were doing together at Columbia University

at the sustainable engineering lab there. And that's where we first started the platform that you now find on our site at ownit.io

and also

a health systems platform

platform called the Open Smart Register Platform or OpenSRP.

All of this work sort of began in this research lab, and we we saw that within that setting, we couldn't

reach our customers. We couldn't take on the types of contracts and build

the types of relationships we needed to be successful. So we sort of spun out of that and

formed a company. That was around it'll be 5 years in October. Can you give some of the examples of some of the types of customers and organizations that you work with for,

with ONA

and some of the types of problems that they're using your platform to address?

Sure. With larger organizations like the World Health Organization and the Gates Foundation,

the problems they have are on a a larger scale. So they want something that they can promote as a comprehensive

solution

to

tracking digital health,

managing

a large set of ongoing data

data collection problems. We also work with customers on

more specific projects where we might apply our platform

to a specific use case. For example, we have a partnership

working on malaria elimination in Southern Africa

where they use products built on top of our tools

to track

the vectors of malaria infection

and spray houses with deep, improve the coverage, and eliminate the disease. Another specific

project we work on is in supply chain management,

tracking vaccine supplies throughout

supply chain and low resource environments.

And for people who are using your platform,

what does the workflow look like for being able to collect data,

and what are some of the types of environments that they're working in and the types of information that they're

able to collect using your platform?

So we use a generic

standard called the Xform standard.

And on top of that,

a extension we came up with at the Columbia Lab called XLS forms.

Excel is such a common tool among our users

that this has opened up a lot of opportunities

for them to both create and share forms amongst each other. This standard supports text, numeric images,

GPS points, including

dynamic selects where

you might have 1 form that embeds the results of collection that's happened into in another form. So for example, you could use 1 form to collect

a set of villages

and then in the other form, have

a field worker load up that set of villages and then enter additional information about a specific village.

We also support generic programming concepts

like conditional branching logic and repeat statements.

We even had 1 group of partners that embedded a linear regression system into their forms so that they could do rapid diagnostic for TB

based on patient symptoms at the time of collection. And

so

when people are collecting data using

the mobile forms, there's always the potential for

either inaccurate data entry or using the wrong data types or skipping fields. So what are some of the mechanisms that you use to ensure accuracy

and fidelity of the data that's being collected

at the point of capture,

to prevent,

integration or,

accuracy issues further down the analysis pipeline?

Yeah. This is a really important and complex issue in the work we do. I think there are problems here on 2 sides. 1 are the technical problems

with just ensuring that the data is transmitted

correctly,

and then there's the social problem

of minimizing the incentives to enter

inaccurate

or fake data and minimizing whether that's intentional

or or by mistake.

The technical piece is

rather straightforward.

You know, we make sure all

all of our tools can operate

offline

and we synchronize when we get an Internet connection.

The more challenging part is

working with our partners who usually manage their own teams of data collectors

and giving them the tools

to define and visualize what accurate means in the context of their work.

Part of that is having filtered view sets so

an organization can

define what an outlier is

and use those to remove outliers or highlight outliers and then follow-up

with their failed staff, make corrections.

1 of the complexities here is that this incentivization to report data quickly,

report potentially incorrect data so that they can hit a deadline or a work milestone

is sort of an ever present

danger. To do more complex analysis

on the data,

maybe breaking it down by the field worker who collected it, looking for surprise values in it, performing outlier analysis, and other anomaly detection methods.

We enable that through our API and some dashboards that we built or that we'll build in partnership with our clients so that they can pull data directly out of the collection platform as it's being collected,

pull it into r, and sort of perform whatever statistical analysis they need to. A related feature that we've had selectively available within the system for a long time but are now rolling out to the general cloud platform

is a review

or approve option so that users can mark what is accurate and attach a common thread to individual data submissions. What are some of the

potential difficulties in terms of

ensuring that everyone is using the same versions of a form or at least being able to

capture the version of the form as metadata

when you're doing analysis further on so that you don't

accidentally

either mismatch field types or,

leave at have have gaps in the data that you're processing,

and then also potential issues with

data corruption

at the storage level before it gets transmitted or during transmission to the analytics platform?

So on the side of data corruption,

we rely on the open source platform form Open Data Kit, which stores the raw data

on the SD card on device

usually

so that if there is a problem in transmission,

we still have a record on the device and they can retransmit it, or they can also pull it

off of the card onto a machine and then upload the raw files on the machine. So that handles

most of the corruption on device issues.

With regards to form versioning,

we track the version of the form on our server side. So when you are submitting

from the mobile application,

it's against the specific version of the form

that is stored on our server. And then in the management web application, you can

choose to export from a specific version of the form.

So, for example, you might have removed

some columns in a newer version. You can export from an older version to see those columns. So a lot of our groups might export from a couple of versions and then merge those

downstream.

There's definitely, you know, a use case for

doing that merging across form versions

in the application. That's not something we'd we've addressed within ONA. We sort of see ONA as its responsibilities around data collection.

A data analysis tool, our

our data warehousing tool, Canopy, is more suited towards that type

of merging integration between different versions or multiple datasets.

And when you're collecting data or receiving data submissions, is it primarily

field workers with a given organization

who might have some form of training in terms

of how to collect or represent the data

in whatever form is necessary? Or do you also have users where,

any general user is able to submit information

either using the form or into the data platform for being able to

scale out data collection where you don't necessarily have as many people available to go around and,

perform surveys or interviews or anything along those lines?

Yeah. So both of those are supported

in the platform. Definitely, most of our users,

at least the majority, probably closer to 70%

or 80%

operate with

dedicated

field workers who

have user accounts

and have some sense of what they are collecting on.

But you can as a form manager, you can

accept submissions from anywhere. We have had other groups

that have done sort of broader population studies. They might distribute

a form link through social media

and then

have have anybody submit to it and do the analysis later on after the fact.

And once the data is submitted, what are some of the integration challenges that you have found to be unique or particular to the types of information that is getting collected in these types of environments and these types of use cases?

I think the the biggest difference

is that we have to be able to do everything

offline. Often, these field workers are operating

operating in an environment

where they don't have access to Internet or cell networks.

The power might be intermittent. In some cases, the lifetime of an entire project will happen completely

off grid. So for a data collection project, you need to be able to do everything offline

and have multiple options for syncing data back

and the conflict resolution strategy, which we touched on a little bit. In the case of OpenSRP, our health systems application,

this not only means

making the formers available offline, but it also means syncing down to the device

all the patient records relevant to a location

as well as the vaccination schedules

and any related business logic around the works for them. And can you walk us through

the overall life cycle

of data from the point of collection

through to storing and analyzing it in your data warehouse and the various systems that it traverses in the process? Yeah, definitely.

So

I think maybe starting at a higher level, most of these projects

start with a a program. There's some

sort of set of goals that our partners or our customers have and then those get coalesced into a data model. Usually, it's defined through a form, that is the data collection form, but it also might involve pieces of external data, demographic data,

metadata on the length of a a visit or the length of it an interview that results in data collection.

This form is

created along with links to metadata on our platform

that then gets synced down

to mobile apps.

It could be web forms in a call center. It might be embedded in another mobile or web app.

And then once it's done on device,

field workers will start entering data, will start receiving submissions.

They'll come in through our back end API's load balancer,

get routed to an API instance.

These incoming submissions will get matched to a form ID on our side and a version ID. The raw submission data gets stored in a flat file store like an s 3 bucket while a

parsed version will go into a post

GIS database.

And then once we get any data back at all, it becomes available for analysis.

So as a manager, you can see it in our web platform. You can see geospatial maps of where the data is coming from. You can overlay that with hex bins or choropleth.

You can group the data by field worker or by the answers to a specific question.

A lot of the problems in data collection, they happen while the data is being collected. Now as a programmer, we're used to

being able to enter something into a REPL and getting the output immediately,

adjust the way we're thinking about our problem. In the programs we work with, that's often not the case. It's as if you've entered something into a REPL and then a a week later, you get the result.

This is obviously

not not a good thing, so we want to focus on getting that analysis back to the program managers as quickly as possible so they can adjust their programs and improve improve the impact they're having. And when I was,

doing some of the initial investigation about the work that you're doing, it looks like you're using NiFi

as sort of the central routing and integration component for being able to collect the data and then distribute it to

various destination points. And I'm curious,

what your experience has been using that and some of the decision making that went into

choosing that as the sort of backbone of your platform versus any of the other available tooling?

So over the years, as we've worked on our data collection system, we saw this

common theme

where

users want some further analysis on top of their data. So it it goes into

1 platform and then there's a a custom web app that's built to process it in some minor way,

a custom visualization

that's built to

display the customized version.

The decision to go to go with NiFi

came out of needing a common integration layer

from the various data sources that we would connect to.

For example, the health data that we're collecting in our OpenSRP system, we often

wanna view that in line with data that's coming in through Oona.

So in the past projects a couple years ago, we might have built a web service that pulls in these 2 data streams,

but using NiFi

lets us have that general data integration layer,

which is also accessible not just to engineers but also to data analysts. You know, we can turn around our projects

quicker, develop templates,

and have a sort of organizational

wide standard. In terms of

the sort of the processing layer, we actually do use Kafka

for routing,

after it comes into NiFi. The decision for Kafka versus

Spark or Flink or 1 of those other tools is that we were more

interested

in

the

storage

in a topic queue than in doing the complex event processing

at this point. I mean, Kafka Kafka fit that use case best. I'm not sure of the timeline, but somewhat recently

began work on the Canopy project as you mentioned

destination point for the information that's collected

and to provide your customers with a way of analyzing the data.

So I'm curious what was the tipping point where you decided it was worth the time and effort to go through designing and building Canopy versus using some of the off the shelf

platforms?

Yeah. So what what came together here was so the maturity of the tooling

and a a really good partner that we could work with who had a very clear use case. And Canopy,

to be to be clear, is not a a custom built application. It's actually a collection

of various open source tools.

So it's a wrapper around

NiFi,

Kafka,

Druid,

Postgres,

Superset,

and a a custom tool we built called Jocita. Usually, it's not using all of these pieces. It's using each for us a specific step in the process. So a a common implement implementation that we have might pull in data from multiple sources using NiFi,

push that into Kafka

for persistence and replay, and load that in the Postgres as a data store, then visualize it in superset.

So as a as a user, you might see your configuration of a form in ONA, and then you'd see an exploratory

visualization in superset.

The rest of it would be okay to you.

And what have you

found to be some of the most challenging

or unexpected aspects

of

trying to integrate those various components

and some of the difficulties

that you're facing either currently or recently,

and a bit about the, sort of evolution of the architecture

from the initial concept to the current state. Yeah. I I think when we started out,

we had our

our use case was more built around

high availability

and larger scale

data systems. As we worked

more with our customers, we realized that that wasn't their immediate need right now.

It was positive in the sense that we got to experiment a lot with Druid and actually roll it out in a couple production environments so we know it's available as our as our clients want to scale. But for the immediate term, you know, a simple data store like Postgres

is fine. It also gave us

a bit of,

I'd

say, a discipline

on building strict abstraction layers between the ingestor

and between the data store so that if we wanted to swap out our data store in the future,

that would be that would be built into the the architecture.

Another sort of challenge we've we come up within our with the clients that we work with is wanting to host everything on-site and on your own infrastructure.

Working in global health systems, countries often aren't comfortable having health data about their citizens not stored in data centers that they control,

which

is entirely understandable.

So part of from day 1, part of the architecture was to make sure everything

would be deployable

on-site and would not rely on any external services.

That meant detailed Docker composed files,

Ansible playbooks, and Terraform plans

so we could easily stand up the full stack with a single command.

We're we're big

big advocates of automation and infrastructure

in in code.

Yeah. I definitely agree with that point, and that's sort of the the holy grail of any sort of infrastructure project is push button deployment where you, you know, start with nothing, and then 15 minutes later, you have everything up and running nicely. And so that that's a big part of what I do in my day to day. So it's it's nice to hear that type of concern reflected elsewhere.

Yeah. Definitely. And I think we see sort of the new the next step here, we can now do this where we might customize configuration files.

But, you know, obviously, there's there's a meta tool that can be built around this

where your deployments turn into configuration files and those configuration files get translated into a country specific

Dockerfile, Ansible playbook, and Terraform plan. So we're we're sort of looking towards the future and and building that out. For people who might be considering using Canopy in their own environments,

what are some of the either pieces of advice that you would provide or notes of caution

that they should be thinking about that might dissuade them from using Canopy or encourage them to use it given a particular set of use cases?

Definitely, the stronger

you understand your problems, the easier it is to get the full benefit out of the solutions that are out there. Right? A lot of that might involve also, you know, coming up with a draft of what what your problems are, what your use cases are,

and reevaluating those problems as you learn about the existing use cases, and then trying to break down your problems

to see whether there is,

you know, a single platform that can solve

the challenges you're facing or whether it's

an amalgamation of a couple solutions.

I think 1 thing to also keep in mind is that the more you can break down your problems into sub problems, the easier

it will be to find

solutions that have alternatives

so that you can either grow as a subproblem

shifts and there's another alternative that you can use that solves it in a better way

or if there's a competitor that come becomes available and that ends up being a better solution for that sub problem. So the modularity, the flexibility, the extensibility within the challenges you're facing is important to keep in mind and clarify as much as possible.

What

are your plans for the future of Oona and Canopy and

some of the goals that you have going forward?

Yeah. So the the technical

vision and both the mission behind

what we are doing has always been to get to the point where we're using the information

that's received through our platforms

to

automatically improve and optimize

programs as they are in

process.

1 example of this is when working with a nutrition program in Somalia

during last year's drought,

we saw that food distribution sites were based on where the security situation was safest

and not where there was the greatest demand for food.

So, you know, currently, the way things are set up, that's done by visual inspection.

You can see on some geospatial maps pulling in from our data collection platform that these circles don't line up.

But in the future, we'd like that type of insight to be generated

automatically

so that it can create notifications

or alerts,

and a program manager can use that to refine

their programs and have greater impact. So closing that circle between data collection, analysis, and

improvement. And are there any other aspects of the work that you're doing with ONA and Kanopy

or the area of data collection

and sort of,

humanitarian

data efforts that we didn't discuss that you think we should cover before we close out the show? Not not particularly. I guess I'd just like to add that I there is a lot of interesting work going on in

the humanitarian and global health sector that a lot of us in the the tech world don't get exposed to that often. It's a very interesting place to sort of see the limits

of tools that we built for an environment that's always connected, always powered. And some of those limits really

expose

new problems

that can lead to insights and improve

sort of the state of technology globally.

So that it's been a very interesting experience for me to get involved in this area and

have my conceptions about

how the technology that

we use every day works be challenged.

Yeah. It it's definitely easy to overlook some of the different ways that data is generated and collected

and the ways that it's being used, because there's so much attention being paid to,

big data and fast data, and things moving at high velocity and high scale,

that it's easy to forget about the fact that there are a lot of interesting challenges with small granular data,

issues with data integration and data cleanliness

for sort of distributed data collection, and some of the various problems that go into that. So, yeah, I definitely second your point that,

it's easy to

get stuck in tunnel vision of what the, sort of, a certain segment of the tech population is discussing

that you become blind to some of the other areas that our skills and technologies can be put to use.

Yeah. We we often joke that we focus on fat and short data.

Compared to big data, we have a couple of columns and millions of records. A lot of our datasets are hundreds of columns and thousands of records. So it's a different use case and a lot of interesting challenges there. And for anybody who wants to follow the work that you're up to or get in touch about anything we talked about here, I'll have you add your preferred contact information to the show notes.

And as a final question,

I'd like to get your perspective

on what you see as being the biggest gap in the tooling or technology that's available for data management today?

The biggest gap that I see

is

automated systems

to merge and integrate

different

schemas. A lot of the problems we have are

similar datasets with different schemas

that we want to see in a unified way.

And right

now, those are combined on an ad hoc basis.

Sure. We have the ontologies,

the ideal

data dictionaries. Those all exist out there.

But

on the ground, in the computer systems that are distributed throughout the world, That's not what the data looks like.

And to really steal

impact and data integration and automated

schema merging tools

going to have to be built.

We're looking looking forward to integrating that and helping build that in the in the future. Alright. Well, thank you very much for your time and for telling me about the work that you're doing with Ona and Canopy. It's definitely an

interesting problem space, and it's good to see that people are focusing on that. So thank you for your time today, and I hope you enjoy the rest of your evening. Thanks, Tobias.

I was glad to be on here. Thank you for thank you for having me.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Links