Summary
With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
- Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Peter Lubell-Doughtie about using Ona for collecting data and processing it with Canopy
Interview
- Introduction
- How did you get involved in the area of data management?
- What is Ona and how did the company get started?
- What are some examples of the types of customers that you work with?
- What types of data do you support in your collection platform?
- What are some of the mechanisms that you use to ensure the accuracy of the data that is being collected by users?
- Does your mobile collection platform allow for anyone to submit data without having to be associated with a given account or organization?
- What are some of the integration challenges that are unique to the types of data that get collected by mobile field workers?
- Can you describe the flow of the data from collection through to analysis?
- To help improve the utility of the data being collected you have started building Canopy. What was the tipping point where it became worth the time and effort to start that project?
- What are the architectural considerations that you factored in when designing it?
- What have you found to be the most challenging or unexpected aspects of building an enterprise data warehouse for general users?
- What are your plans for the future of Ona and Canopy?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
- OpenSRP
- Ona
- Canopy
- Open Data Kit
- Earth Institute at Columbia University
- Sustainable Engineering Lab
- WHO
- Bill and Melinda Gates Foundation
- XLSForms
- PostGIS
- Kafka
- Druid
- Superset
- Postgres
- Ansible
- Docker
- Terraform
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the data engineering podcast, the show about modern data management. When you're ready to build your next pipeline, you'll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and 40 gigabit network, all controlled by a brand new API, you'll get everything you need to run a bulletproof data platform. Go to data engineering podcast.com slash linode to get a $20 credit and launch a new server in under a minute. And are you struggling to keep up with customer requests and letting errors slip into production? Wanna try some of the innovative ideas in this podcast but don't have time? DataKitchen's DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and datasets while improving quality.
Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement today and sign up for the newsletter at datakitchen.iode. After that, learn more about why you should be doing DataOps by listening to the head chef in the data kitchen at dataengineeringpodcast.com/datakitchen. And go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macy. And today, I'm interviewing Peter Lubell about using ONA for collecting and processing it with Canopy. So, Peter, could you start by introducing yourself? Hi, Tobias. Thanks.
[00:01:38] Unknown:
So my name is Peter. I'm 1 of the cofounders. I'm the current CTO of Oona. We work in the data management space. I have an office in New York, but we're sort of spread around globally with a a larger office in Nairobi in Kenya as well. And so do you remember how you first got involved and interested in the area of data management? Yeah. So a key motivation behind my current company is to bring what are standard tools in the commercial sector to the challenges in global health and humanitarian work. This intersected with my personal interests since around 2003 starting with the Stanford undergraduate research grant and finding meaningful applications of machine learning research.
When we started our work at ONA, it became clear that the first step to building ML systems was to get data into a robust data management platform. Sometimes this meant making existing data systems accessible, but more often it was getting data digitized in the first place. And just as often as that, it was getting data collected in any form at all. So you ended up starting ONA to try and help bring some of these technologies
[00:02:46] Unknown:
to, broader scale and potentially smaller organizations. So can you discuss a bit about what it is that you do at Oona and how you get the company moving? Yeah. So our mission at Oona
[00:02:59] Unknown:
is to improve access to vital services through better data. So a lot of the the organizations we work with and the the space we work with in the international development, humanitarian, global health space, The tools there are, you know, not what we're used to in the commercial sector. They've they're often building customized systems and sometimes not aware of the latest advances in data engineering, data management platforms. So our our products and our services are both offered as SaaS products and as configurable solutions where we might work more in-depth with our our partners and our clients. The work actually came out of research that we were all of the cofounders were doing together at Columbia University at the sustainable engineering lab there. And that's where we first started the platform that you now find on our site at ownit.io and also a health systems platform platform called the Open Smart Register Platform or OpenSRP.
All of this work sort of began in this research lab, and we we saw that within that setting, we couldn't reach our customers. We couldn't take on the types of contracts and build the types of relationships we needed to be successful. So we sort of spun out of that and
[00:04:15] Unknown:
formed a company. That was around it'll be 5 years in October. Can you give some of the examples of some of the types of customers and organizations that you work with for, with ONA and some of the types of problems that they're using your platform to address?
[00:04:32] Unknown:
Sure. With larger organizations like the World Health Organization and the Gates Foundation, the problems they have are on a a larger scale. So they want something that they can promote as a comprehensive solution to tracking digital health, managing a large set of ongoing data data collection problems. We also work with customers on more specific projects where we might apply our platform to a specific use case. For example, we have a partnership working on malaria elimination in Southern Africa where they use products built on top of our tools to track the vectors of malaria infection and spray houses with deep, improve the coverage, and eliminate the disease. Another specific project we work on is in supply chain management, tracking vaccine supplies throughout supply chain and low resource environments.
[00:05:28] Unknown:
And for people who are using your platform, what does the workflow look like for being able to collect data, and what are some of the types of environments that they're working in and the types of information that they're able to collect using your platform?
[00:05:45] Unknown:
So we use a generic standard called the Xform standard. And on top of that, a extension we came up with at the Columbia Lab called XLS forms. Excel is such a common tool among our users that this has opened up a lot of opportunities for them to both create and share forms amongst each other. This standard supports text, numeric images, GPS points, including dynamic selects where you might have 1 form that embeds the results of collection that's happened into in another form. So for example, you could use 1 form to collect a set of villages and then in the other form, have a field worker load up that set of villages and then enter additional information about a specific village.
We also support generic programming concepts like conditional branching logic and repeat statements. We even had 1 group of partners that embedded a linear regression system into their forms so that they could do rapid diagnostic for TB based on patient symptoms at the time of collection. And
[00:06:50] Unknown:
so when people are collecting data using the mobile forms, there's always the potential for either inaccurate data entry or using the wrong data types or skipping fields. So what are some of the mechanisms that you use to ensure accuracy and fidelity of the data that's being collected at the point of capture, to prevent, integration or, accuracy issues further down the analysis pipeline?
[00:07:19] Unknown:
Yeah. This is a really important and complex issue in the work we do. I think there are problems here on 2 sides. 1 are the technical problems with just ensuring that the data is transmitted correctly, and then there's the social problem of minimizing the incentives to enter inaccurate or fake data and minimizing whether that's intentional or or by mistake. The technical piece is rather straightforward. You know, we make sure all all of our tools can operate offline and we synchronize when we get an Internet connection. The more challenging part is working with our partners who usually manage their own teams of data collectors and giving them the tools to define and visualize what accurate means in the context of their work.
Part of that is having filtered view sets so an organization can define what an outlier is and use those to remove outliers or highlight outliers and then follow-up with their failed staff, make corrections. 1 of the complexities here is that this incentivization to report data quickly, report potentially incorrect data so that they can hit a deadline or a work milestone is sort of an ever present danger. To do more complex analysis on the data, maybe breaking it down by the field worker who collected it, looking for surprise values in it, performing outlier analysis, and other anomaly detection methods. We enable that through our API and some dashboards that we built or that we'll build in partnership with our clients so that they can pull data directly out of the collection platform as it's being collected, pull it into r, and sort of perform whatever statistical analysis they need to. A related feature that we've had selectively available within the system for a long time but are now rolling out to the general cloud platform is a review or approve option so that users can mark what is accurate and attach a common thread to individual data submissions. What are some of the
[00:09:27] Unknown:
potential difficulties in terms of ensuring that everyone is using the same versions of a form or at least being able to capture the version of the form as metadata when you're doing analysis further on so that you don't accidentally either mismatch field types or, leave at have have gaps in the data that you're processing, and then also potential issues with data corruption at the storage level before it gets transmitted or during transmission to the analytics platform?
[00:10:01] Unknown:
So on the side of data corruption, we rely on the open source platform form Open Data Kit, which stores the raw data on the SD card on device usually so that if there is a problem in transmission, we still have a record on the device and they can retransmit it, or they can also pull it off of the card onto a machine and then upload the raw files on the machine. So that handles most of the corruption on device issues. With regards to form versioning, we track the version of the form on our server side. So when you are submitting from the mobile application, it's against the specific version of the form that is stored on our server. And then in the management web application, you can choose to export from a specific version of the form.
So, for example, you might have removed some columns in a newer version. You can export from an older version to see those columns. So a lot of our groups might export from a couple of versions and then merge those downstream. There's definitely, you know, a use case for doing that merging across form versions in the application. That's not something we'd we've addressed within ONA. We sort of see ONA as its responsibilities around data collection. A data analysis tool, our our data warehousing tool, Canopy, is more suited towards that type of merging integration between different versions or multiple datasets.
[00:11:35] Unknown:
And when you're collecting data or receiving data submissions, is it primarily field workers with a given organization who might have some form of training in terms of how to collect or represent the data in whatever form is necessary? Or do you also have users where, any general user is able to submit information either using the form or into the data platform for being able to scale out data collection where you don't necessarily have as many people available to go around and, perform surveys or interviews or anything along those lines?
[00:12:13] Unknown:
Yeah. So both of those are supported in the platform. Definitely, most of our users, at least the majority, probably closer to 70% or 80% operate with dedicated field workers who have user accounts and have some sense of what they are collecting on. But you can as a form manager, you can accept submissions from anywhere. We have had other groups that have done sort of broader population studies. They might distribute a form link through social media and then have have anybody submit to it and do the analysis later on after the fact.
[00:12:55] Unknown:
And once the data is submitted, what are some of the integration challenges that you have found to be unique or particular to the types of information that is getting collected in these types of environments and these types of use cases?
[00:13:10] Unknown:
I think the the biggest difference is that we have to be able to do everything offline. Often, these field workers are operating operating in an environment where they don't have access to Internet or cell networks. The power might be intermittent. In some cases, the lifetime of an entire project will happen completely off grid. So for a data collection project, you need to be able to do everything offline and have multiple options for syncing data back and the conflict resolution strategy, which we touched on a little bit. In the case of OpenSRP, our health systems application, this not only means making the formers available offline, but it also means syncing down to the device all the patient records relevant to a location as well as the vaccination schedules and any related business logic around the works for them. And can you walk us through
[00:14:03] Unknown:
the overall life cycle of data from the point of collection through to storing and analyzing it in your data warehouse and the various systems that it traverses in the process? Yeah, definitely.
[00:14:17] Unknown:
So I think maybe starting at a higher level, most of these projects start with a a program. There's some sort of set of goals that our partners or our customers have and then those get coalesced into a data model. Usually, it's defined through a form, that is the data collection form, but it also might involve pieces of external data, demographic data, metadata on the length of a a visit or the length of it an interview that results in data collection. This form is created along with links to metadata on our platform that then gets synced down to mobile apps.
It could be web forms in a call center. It might be embedded in another mobile or web app. And then once it's done on device, field workers will start entering data, will start receiving submissions. They'll come in through our back end API's load balancer, get routed to an API instance. These incoming submissions will get matched to a form ID on our side and a version ID. The raw submission data gets stored in a flat file store like an s 3 bucket while a parsed version will go into a post GIS database. And then once we get any data back at all, it becomes available for analysis.
So as a manager, you can see it in our web platform. You can see geospatial maps of where the data is coming from. You can overlay that with hex bins or choropleth. You can group the data by field worker or by the answers to a specific question. A lot of the problems in data collection, they happen while the data is being collected. Now as a programmer, we're used to being able to enter something into a REPL and getting the output immediately, adjust the way we're thinking about our problem. In the programs we work with, that's often not the case. It's as if you've entered something into a REPL and then a a week later, you get the result. This is obviously not not a good thing, so we want to focus on getting that analysis back to the program managers as quickly as possible so they can adjust their programs and improve improve the impact they're having. And when I was,
[00:16:22] Unknown:
doing some of the initial investigation about the work that you're doing, it looks like you're using NiFi as sort of the central routing and integration component for being able to collect the data and then distribute it to various destination points. And I'm curious, what your experience has been using that and some of the decision making that went into choosing that as the sort of backbone of your platform versus any of the other available tooling?
[00:16:50] Unknown:
So over the years, as we've worked on our data collection system, we saw this common theme where users want some further analysis on top of their data. So it it goes into 1 platform and then there's a a custom web app that's built to process it in some minor way, a custom visualization that's built to display the customized version. The decision to go to go with NiFi came out of needing a common integration layer from the various data sources that we would connect to. For example, the health data that we're collecting in our OpenSRP system, we often wanna view that in line with data that's coming in through Oona.
So in the past projects a couple years ago, we might have built a web service that pulls in these 2 data streams, but using NiFi lets us have that general data integration layer, which is also accessible not just to engineers but also to data analysts. You know, we can turn around our projects quicker, develop templates, and have a sort of organizational wide standard. In terms of the sort of the processing layer, we actually do use Kafka for routing, after it comes into NiFi. The decision for Kafka versus Spark or Flink or 1 of those other tools is that we were more interested in the storage in a topic queue than in doing the complex event processing at this point. I mean, Kafka Kafka fit that use case best. I'm not sure of the timeline, but somewhat recently
[00:18:27] Unknown:
began work on the Canopy project as you mentioned destination point for the information that's collected and to provide your customers with a way of analyzing the data. So I'm curious what was the tipping point where you decided it was worth the time and effort to go through designing and building Canopy versus using some of the off the shelf platforms?
[00:18:57] Unknown:
Yeah. So what what came together here was so the maturity of the tooling and a a really good partner that we could work with who had a very clear use case. And Canopy, to be to be clear, is not a a custom built application. It's actually a collection of various open source tools. So it's a wrapper around NiFi, Kafka, Druid, Postgres, Superset, and a a custom tool we built called Jocita. Usually, it's not using all of these pieces. It's using each for us a specific step in the process. So a a common implement implementation that we have might pull in data from multiple sources using NiFi, push that into Kafka for persistence and replay, and load that in the Postgres as a data store, then visualize it in superset.
So as a as a user, you might see your configuration of a form in ONA, and then you'd see an exploratory visualization in superset. The rest of it would be okay to you.
[00:19:58] Unknown:
And what have you found to be some of the most challenging or unexpected aspects of trying to integrate those various components and some of the difficulties that you're facing either currently or recently, and a bit about the, sort of evolution of the architecture
[00:20:19] Unknown:
from the initial concept to the current state. Yeah. I I think when we started out, we had our our use case was more built around high availability and larger scale data systems. As we worked more with our customers, we realized that that wasn't their immediate need right now. It was positive in the sense that we got to experiment a lot with Druid and actually roll it out in a couple production environments so we know it's available as our as our clients want to scale. But for the immediate term, you know, a simple data store like Postgres is fine. It also gave us a bit of, I'd say, a discipline on building strict abstraction layers between the ingestor and between the data store so that if we wanted to swap out our data store in the future, that would be that would be built into the the architecture.
Another sort of challenge we've we come up within our with the clients that we work with is wanting to host everything on-site and on your own infrastructure. Working in global health systems, countries often aren't comfortable having health data about their citizens not stored in data centers that they control, which is entirely understandable. So part of from day 1, part of the architecture was to make sure everything would be deployable on-site and would not rely on any external services. That meant detailed Docker composed files, Ansible playbooks, and Terraform plans so we could easily stand up the full stack with a single command.
We're we're big big advocates of automation and infrastructure in in code.
[00:22:02] Unknown:
Yeah. I definitely agree with that point, and that's sort of the the holy grail of any sort of infrastructure project is push button deployment where you, you know, start with nothing, and then 15 minutes later, you have everything up and running nicely. And so that that's a big part of what I do in my day to day. So it's it's nice to hear that type of concern reflected elsewhere.
[00:22:22] Unknown:
Yeah. Definitely. And I think we see sort of the new the next step here, we can now do this where we might customize configuration files. But, you know, obviously, there's there's a meta tool that can be built around this where your deployments turn into configuration files and those configuration files get translated into a country specific Dockerfile, Ansible playbook, and Terraform plan. So we're we're sort of looking towards the future and and building that out. For people who might be considering using Canopy in their own environments,
[00:22:53] Unknown:
what are some of the either pieces of advice that you would provide or notes of caution that they should be thinking about that might dissuade them from using Canopy or encourage them to use it given a particular set of use cases?
[00:23:06] Unknown:
Definitely, the stronger you understand your problems, the easier it is to get the full benefit out of the solutions that are out there. Right? A lot of that might involve also, you know, coming up with a draft of what what your problems are, what your use cases are, and reevaluating those problems as you learn about the existing use cases, and then trying to break down your problems to see whether there is, you know, a single platform that can solve the challenges you're facing or whether it's an amalgamation of a couple solutions. I think 1 thing to also keep in mind is that the more you can break down your problems into sub problems, the easier it will be to find solutions that have alternatives so that you can either grow as a subproblem shifts and there's another alternative that you can use that solves it in a better way or if there's a competitor that come becomes available and that ends up being a better solution for that sub problem. So the modularity, the flexibility, the extensibility within the challenges you're facing is important to keep in mind and clarify as much as possible.
[00:24:10] Unknown:
What are your plans for the future of Oona and Canopy and some of the goals that you have going forward?
[00:24:19] Unknown:
Yeah. So the the technical vision and both the mission behind what we are doing has always been to get to the point where we're using the information that's received through our platforms to automatically improve and optimize programs as they are in process. 1 example of this is when working with a nutrition program in Somalia during last year's drought, we saw that food distribution sites were based on where the security situation was safest and not where there was the greatest demand for food. So, you know, currently, the way things are set up, that's done by visual inspection. You can see on some geospatial maps pulling in from our data collection platform that these circles don't line up.
But in the future, we'd like that type of insight to be generated automatically so that it can create notifications or alerts, and a program manager can use that to refine their programs and have greater impact. So closing that circle between data collection, analysis, and
[00:25:26] Unknown:
improvement. And are there any other aspects of the work that you're doing with ONA and Kanopy or the area of data collection and sort of, humanitarian
[00:25:38] Unknown:
data efforts that we didn't discuss that you think we should cover before we close out the show? Not not particularly. I guess I'd just like to add that I there is a lot of interesting work going on in the humanitarian and global health sector that a lot of us in the the tech world don't get exposed to that often. It's a very interesting place to sort of see the limits of tools that we built for an environment that's always connected, always powered. And some of those limits really expose new problems that can lead to insights and improve sort of the state of technology globally.
So that it's been a very interesting experience for me to get involved in this area and have my conceptions about how the technology that we use every day works be challenged.
[00:26:27] Unknown:
Yeah. It it's definitely easy to overlook some of the different ways that data is generated and collected and the ways that it's being used, because there's so much attention being paid to, big data and fast data, and things moving at high velocity and high scale, that it's easy to forget about the fact that there are a lot of interesting challenges with small granular data, issues with data integration and data cleanliness for sort of distributed data collection, and some of the various problems that go into that. So, yeah, I definitely second your point that, it's easy to get stuck in tunnel vision of what the, sort of, a certain segment of the tech population is discussing that you become blind to some of the other areas that our skills and technologies can be put to use.
[00:27:19] Unknown:
Yeah. We we often joke that we focus on fat and short data. Compared to big data, we have a couple of columns and millions of records. A lot of our datasets are hundreds of columns and thousands of records. So it's a different use case and a lot of interesting challenges there. And for anybody who wants to follow the work that you're up to or get in touch about anything we talked about here, I'll have you add your preferred contact information to the show notes.
[00:27:47] Unknown:
And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today?
[00:27:57] Unknown:
The biggest gap that I see is automated systems to merge and integrate different schemas. A lot of the problems we have are similar datasets with different schemas that we want to see in a unified way. And right now, those are combined on an ad hoc basis. Sure. We have the ontologies, the ideal data dictionaries. Those all exist out there. But on the ground, in the computer systems that are distributed throughout the world, That's not what the data looks like. And to really steal impact and data integration and automated schema merging tools going to have to be built.
[00:28:41] Unknown:
We're looking looking forward to integrating that and helping build that in the in the future. Alright. Well, thank you very much for your time and for telling me about the work that you're doing with Ona and Canopy. It's definitely an interesting problem space, and it's good to see that people are focusing on that. So thank you for your time today, and I hope you enjoy the rest of your evening. Thanks, Tobias.
[00:29:02] Unknown:
I was glad to be on here. Thank you for thank you for having me.
Introduction to Peter Lubell and ONA
Motivation and Mission of ONA
ONA's Clients and Use Cases
Data Collection Workflow and Standards
Ensuring Data Accuracy and Integrity
Handling Data Corruption and Form Versioning
User Types and Data Submission
Integration Challenges and Offline Capabilities
Lifecycle of Data from Collection to Analysis
Using NiFi and Kafka for Data Integration
Introduction to Canopy Project
Challenges and Evolution of Canopy Architecture
Advice for Using Canopy
Future Plans for ONA and Canopy
Impact of Data Collection in Humanitarian Efforts
Biggest Gap in Data Management Technology
Closing Remarks