Brief Conversations From The Open Data Science Conference: Part 1

Hello, and welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline, you'll need somewhere to deploy it, so you should check out Linode. With private networking, shared block storage, node balancers, and a 40 gigabit network all controlled by a brand new API, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com/datadog

today to start your free 14 day trial and get a sweet new t shirt.

And go to data engineering podcast.com

to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macy, and this week, I attended the Open Data Science Conference in Boston, and I recorded a few brief interviews while I was there.

First up, you'll hear from Alan Anders, the CTO of Applecart, about their challenges with getting Spark to scale for constructing an entity graph for multiple data sources.

Next, I spoke with Stepan Pushkarev, the CEO, CTO, and cofounder of HydroSphere dot I o about the challenges of running machine learning models in production

and how his team tracks key metrics and samples production data to retrain and redeploy those models for better accuracy and more robust operation.

So I'm here with Alan Anders, the CTO of Applecart. So could you start by introducing yourself?

I'm Alan Anders, and I'm the CTO of Applecart.

Fair enough.

So,

we were just talking a little bit and you mentioned that your primary sort of data engineering concern is being able to

process

and create knowledge graphs at large scale by sourcing data from multiple

different layers. And

so wondering if you can just talk a bit about some of the ways that you're sourcing that data and some of the challenges that you're working on overcoming.

Sure. I mean, I could say I come from an ad tech world where we used many, many servers and distributed computing

in different ways. And when I came at Applecart, that's how it sort of was. I found that

we

did a lot of batch computing in strange ways, and I actually

moved the entire company to Spark.

And and we actually use we sort of use we're big users of something called Databricks, which I'm sure you're familiar with.

We have all of our data engineers

and data scientists,

like, you know, utilizing

Databricks and Spark with shared libraries through GitHub.

And it's actually been very, very,

effective at building sort of these batch ETLs, but still being able to do

this,

extensible

object oriented programming,

with many, many different moving parts,

if that makes sense. And as you're trying to

create these entities from multiple different sources, are you having challenges

with creating a unified representation of that data so it's easier to merge things together?

Abs absolutely.

So the entity resolution problems are really, really tough. You know, a lot of our data sets have,

like, the profile datasets, they have attributes of very many different kinds of flavors.

And

figuring out, you know, maybe you have n profiles and you wanna do n choose 2 matchings,

it's a nightmare to try to handle this and handle it at scale. You're simultaneously trying to do sort of machine learning techniques to say

profile a is the same thing as profile b across these 2 different data sources at the same time within each dataset. You have to dedupe them, which is also sort of an entity resolution problem. And it's the the bookkeeping can be a real nightmare

in addition to the, the machine learning aspects and at scale aspect. So if you're dealing with 2 datasets, 200, 000, 000 profiles, a 100 columns in each, How do you say these 2 are the same? And how do you process that data pretty quickly? That's sort of the challenges that we come up against.

And what are some of the ways that you identify

the places to source data from and then actually build the connectors to be able to consume that data at appropriate levels of scale. And also,

particularly thinking about things like rate limiting from various sources,

being able to make sure that things are flowing along at the proper rate from multiple sources to be able to merge them together?

That that's a it's a very interesting question. So there is

a lot of sort of

data analytics, data science analysis, and business analysis that goes into that. So, I mean, if you look at profile data, you have to ask yourself,

you know, how often does that data change and how often do I wanna refresh it? So for things like

demographic data, how often are people

changing their ethnicity, their gender, their religion? Not very often. Right? So if I have data sources that are, you know, fueling that, maybe I don't need to update though those that data. Now last names change,

addresses change.

You think about 20% of the the nation moves in a year, and you have to ask yourself, how often do I want that update change and how often like, how do I detect that online? And those those those things could be kinda tricky.

For other datasets, like you're talking about your more streaming ones with messaging, like Twitter,

a lot of these problems have actually been solved. You know, data's

I would say scraping is like CSE 101,

and you have to be very choosy of where you invest in that. We try to invest in

the kind of scraping that

requires, like, an extra barrier.

Right? Or it's very specific to what we're interested in terms of building,

real world relationships.

So

I would say it's actually,

the challenges are definitely

sort of, scale. But more often when you really ask the question of what you need, the scales come down a lot.

And a lot of this, like, classical streaming technologies that we'd use, like, really fit the bill for us. And do you find that you have any issues with the sources of data not being available when you're trying to consume them? And have you had to build in any extra engineering effort to manage fault tolerance when your

sources are not able to

produce data at the rates that you're trying to receive them?

Yeah, that definitely comes up. We definitely have pauses in our data streams.

We you know, we're kind of a different beast. Right? So

we're trying to build

machine learning probabilities of how,

people might, you know, behave in certain situations. So trying to understand who's gonna get get off the couch to go vote. We wanna build that probability

in a campaign in real time.

The it's it's it's sort of very strange because, you know, these machine learning models and how they operate, like, a lot of those data sources that we're talking about that are streaming in those regards, we're not necessarily gonna

incorporate into those models. And so a stoppage,

or or if if something

if 1 of our collectors breaks, it ends up not being disastrous for our clients.

So that ends up being okay, and we can invest those engineering efforts to fix that over longer periods.

But

it's it's something that we do think about a lot, you know, as we scale.

How are we going to fix that?

And what have been some of the most,

challenging aspects of building and maintaining your data systems for being able to create these entities?

Spark. Spark is the real,

the documentation for Spark is all over the place. I think you definitely have to have, like, an intuition for distributed computing.

How, you know, how does a join really work?

What is skewing? Things like this. And

we have people at different sorts of knowledge,

different points of knowledge within in their growth with Spark, and distributing that information is very hard.

So, I mean, I

I think they're getting better

at at sort of doing knowledge,

sharing at, like, the Spark summits and things like this. And we interact with the Spark contributors as much as we can, but it's an ongoing fight.

Before I go to the last question, are there any other aspects of your data platform or some of the challenges you're facing that you think we should cover? Yeah. So

what's very

unusual about us is because we sort of are deploying a lot of our services

from sort of a consulting perspective. When we work with campaigns or work with commercial clients,

they're often not sophisticated enough to to utilize the machine learning probabilities that we produce.

So that gives us a lot of leeway in terms of not necessarily,

delivering immediately.

On the other hand, we do have sort of the spike infrastructure

where we need to spin up a 100

nodes, 150 nodes, 250 nodes, and then shut them down.

And a lot of the different vendors we work with, they're not used to this. They're sort of all at this constant they wanna charge us at so many nodes,

over a month, and then,

you know, that's you know, you wanna upgrade to 10 nodes? Fine. This will cost us much. It doesn't work for us. We all we need sort of the spike infrastructure,

spin up and spin down. A lot of databases don't work for us. We've we generally use a lot of cloud storage with Spark so that we can have that spin up and spin down. We had to been very choosy about the kind of partnerships that we've been with. So there are companies like DataRobot that do machine learning, but have the sort of they're not their business isn't ready for that spin up, spin down. And we're in, like, talks with companies like Datadog for instrumentation, and and this is sort of new for them too. So

we're trying to find databases that can handle that. Right now, we're looking at something like Databricks Delta, which actually does allow for that sort of,

we're sort of also trying to explore other technologies.

And as 1 last question, what do you see as being the biggest gap in the available tooling or technology for data management as it stands right now?

That is a really good question.

Yeah. I mean,

there are these little things that we find here and there

that just aren't made for start ups. And I kinda was mentioning this to you

before we started talking is,

there's just certain spark operations are not performing, especially certain kinds of machine learning computations.

And it's the the research is getting there in terms of, like, how do we scale graph computations?

How do we scale sort of machine learning,

or, like, principal component analysis?

And they want to integrate that with Spark because everyone loves using Spark, and if you're small or large for doing these computations, but it just currently

doesn't work. And so

I'm really happy to see the open source community kind of, like, trying to solve this, but it's still way behind. And, know, companies like Google and Facebook believe have these problems solved, but can't you know, won't share it for the rest of us. So we don't have 100 of data scientists or data engineers,

you know, solving these problems.

And for somebody who wants to find out what you guys are up to or follow along, what would be the best way for them to do that?

Oh, yeah. Great. Yeah.

Very much visit applecart.co.

Reach out. We're very much hiring.

We're very interested in strong data engineers,

strong data scientists.

We have some product roles opening up. And, you know, if you don't know, you don't have that those backgrounds,

we're we would love smart people.

Political scientists, mathematicians,

physicists.

We have sort of very many applications that we're looking for right now. Well, thank you for your time. Enjoy the rest of the conference. Thank you so much, Tobias.

So I'm here with Stepan from HydroSphere dot io. So could you just start by introducing yourself?

Hey. I'm Stepan. I'm from HydroSphere. Io. I'm founder and CTO. So I used to work as a software engineer for many years, and then I used to build and architect

big scale stream processing systems based on Apache Spark,

Apache Kafka ecosystem.

So and also I used to work a lot with, machine learning engineers and data scientists to make them happy, to make them successful

in delivering their proof of concepts in a notebook environments

to the real end to end

real time production applications.

And, Hydrosphere, you're working on sort of closing the loop of AI and machine learning models in production to make sure that they're doing what they're supposed to be doing. So,

if you can just talk a bit about how you do that and some of the types of metrics that you use to be able to understand

how well these things are running? Yeah. Well, so

everybody is focused right now on training. It's, very sexy. It's very happy. We are,

taking 1 more step ahead, and, we are automating,

serving, and productionizing

part of the machine learning. So

we do a model deployment serving

and monitoring. Monitoring is the most crucial part because,

like, deployment is a boring thing

and, monitoring

of the model performance and the,

input features to the model and output, predictions

is

the most crucial part. So what we do how we do that? We monitor distributions of the input features.

We apply,

different statistics

statistics like Kolmogorov Smirnov, correlations,

clustering based algorithms

to, basically we basically know about your production traffic,

as much as possible,

to to be able to detect a concept drift and model degradation.

And, also, we use these statistics to, do resampling to generate

a clean and diverse

dataset for your retraining pipeline.

So we'll help you team to maintain and retrain your models in production as you go along.

Yeah. It's obvious obviously your model is as good as your data. If you if your data is being changed in in production, your model is not as good as you, as it's supposed to be in a training time. So and this, like,

a very, very

challenging part of the machine learning to maintain the model quality over time

and to scale that. This the the question of scale is,

probably the next most important question in the industry. If you, as a data scientist,

owning and building, I don't know, 10 models, 10 different models,

and every that every model has 10 different versions, how would you,

watch them in production?

It's not your day job to watch, models and identify the model drift, concept drift, and model degradation.

So obviously, you need more tooling to augment your day to day operations,

and this is here we are. We apply

AI and machine learning to monitor your machine learning in, models in production. So this is basically the the same concept,

that is being used in in the history a lot. Machines

help machine to build better, machines better.

Computers

help,

to build better computer programs, and now AI can help to build

more reliable

and fault tolerant

machine learning models. So, this is like the high level the high level overview.

And when you're monitoring these various metrics for the machine learning models, what are some of the things that you specifically look at for being able to identify that the output of a model isn't within the desired bounds? Is that something that you automatically detect once the model gets put into production or is that something that needs to be predefined once the model gets pushed into your system? So we can we can automatically profile your training data. So and build the basic, basic, like, data profile, what you have. It might be a statistical metrics. It might be a deep auto encoders.

Like, it it might be so we actually we're,

using actively using actively using GANs right now. Originally, GANs are being used for fooling fooling the, predictor

to make that output the wrong prediction,

and we are probably the first who is using that to generate

not a noise, we generate,

a drift. And we

train our discriminator

to identify a model drift. So it's,

yeah, it's 1 of the techniques we use, and we we actually

combine a different, methods and different metrics. So it might be like a simple statistics,

it might be more advanced statistics,

might be some ad hoc rules,

and in addition to

deep learning anomaly detection methods. So they they are if they are,

coincide

altogether,

this is, like, a great sign to for the model degradation. So it's, there is now a silver bullet for this. Every model is unique. Every more,

and this is kind of a new

discipline to to look at for machine learning engineer. How would I

make up our more my models more reliable?

What metrics should I use for my model

to to monitor?

So this is, as I mentioned there is no silver silver bullet. There are different method methods different metrics. There are there are some

unified approaches like, just watching a distribution of input features. Okay. We can do that for almost any input except images

and, and maybe words,

NLP use cases. But for the classical machine learning, you can you can monitor the age, the wage, the the salary, or what whatever features you have. And, yep, that's,

all the tooling is, like, in place for many years for that. So, just a matter of, like,

applying that to the to and automating that. The crucial thing is why the traditional

monitoring, like, software monitoring

methods, do not work here. Like, there are a lot of, like, system on a CPU, GPU

because the the metrics are much more complicated

than just a sim simple counter or histograms.

For example, Kolmogorov Smirnov. It's a stateful metric. You need to set, you need 2 samples to compare, and and you need to do that in a window.

So and you need to do that in a real time. So we do that, internally, it's be everything is based on Kafka streams. So we do that, like, a a stateful,

stateful aggregation and stateful calculations of these metrics in in real time. So, yeah, That's that's how we work.

You, preempted the next question I was gonna ask. So thanks for that. And in terms of the data that you're sampling for being able to retrain the models,

are you keeping track of the data that's coming in and then the output that's going out to determine which inputs are producing valid outputs and which ones aren't so that you can then determine which data points to sample for being able to retrain the model to bring it back in line with the expected outputs?

Yeah. For resampling, we do not watch for outputs. It's mostly for, inputs. If you do have

outliers in,

in production, that's fine. You need to include these outliers into your training pipeline because it to make it more reliable because you at the end of the day, you will get these outliers in in production, and you you need to be ready. Your model need to be ready to to do that. Yeah. And, this is, yeah, this is mostly an input, monitoring the inputs of the model.

And what has been some of the most challenging aspects of building HydroSphere and determining

how best to scale it and architect it to be able to serve your clients appropriately?

Yeah. It's a wide question. And so, oh, it's open question. So

besides just technical challenges, of course, it's like all all things Docker,

microservices,

and Kafka.

Besides that, the challenging part that's is probably socializing

the

idea and the evangelism

about that. So

a big part of the education of the clients and education of the community.

And another,

kind of,

not a show stopper, but,

very challenging part is,

that mo

we are a little bit ahead of the progress

rather the, comparing to the average,

in the industry. So everybody is building the models yet trying to figure out how to build their training pipelines,

figure out how to build their data pipelines.

And when we offer something advanced,

we

need to help clients to do the basics.

That's why we are kind of trying trying to

focus just on our product

and, without without, like,

spreading our focus across the whole machine learning,

business and in this and, yeah, and machine learning development.

So you're trying to make people aware of the pain that they're going to be suffering before they actually get to that point and because they haven't experienced it, they're having a hard time seeing where they might be able to use

you. Yeah. So,

before before even,

building a sophisticated model,

what we try to,

emphasize is can you build a very naive model and just but can you build it overnight

and deploy it

into end to end application and get a feedback from from business and get a fee, get a feedback from the, from users,

whatever?

And then improve it,

continuously.

So that's

so you you can you can be in production

after 1 week of research and and development. This is this is kind of everybody can, like, wanna do that, but,

what we see, companies just

hire data scientists.

They spend a year prototyping in machine learning notebooks, and they don't have a single model in production.

So this is a changing a changing sometimes it requires to change an organizational

structure

to make

data scientists

more aware about the final goal and the final final production. So it's, it's very, very tough topic.

And do you think that that's why the whole idea of a machine learning engineer as a distinct role from data engineering and data scientists is starting to gain more

prevalence? Yeah. So,

we target machine learning engineers rather than data scientists.

So, if a machine learning engineer

responsible for delivering

a final value to production to the end users,

he's he's interested in

in his hands to, like, to to drive that, to craft that. So and not not to throw

out of the through the wall

to IT people. Yeah. That's another, like, a cultural aspect of that. So,

the IT traditional IT

operations

that do monitoring, they do do support, and especially in big companies,

they have no idea about machine learning. If we, if we expose them a metric of Kalmagorov Smirnov

test, it means nothing for them. So and they they even don't know how to

adapt,

and how to react on that alarms.

So that's why we need, like, we need

a machine learning engineer to watch this matrix, to watch, to watch these alarms,

and probably at least to be in the loop. Because,

the,

serving pipeline, there are, like, a thousand of reasons why machine learning may not work in production. Yeah. The data pipeline might be even broken. Your upstream application just,

started producing your corrupted input features. So it's like,

there is no, there is no a single tool to

to monitor that and to fix that. That's why it's like the whole team should be aware about about any incidents in in production. If your apps it's actually another interesting use case that we are we are work working right now. So if you're differentiating between

expected

failure and unexpected failure

expected concept drift and unexpected. So for example, if if your upstream pipe pipeline failed

to to produce your right features,

this is unexpected drift. You should not retrain your models on these bad features. Yeah? You you

because they will propagate through the training and serving pipelines to the end users. So you have to classify between that,

like, kind of expected

drift and unexpected

system failure that you just started started receiving, like,

something different.

So, that's another interesting aspect of using

AI or machine learning to help,

operation

operationalize

machine learning in production. Yeah. I've been definitely seeing a lot of trends towards adopting some of the principles that the DevOps movement brought between developers and IT staff moving into the

realm of data engineering and data science and some of the more,

statistically oriented

roles within a company. Yeah.

I hope so. I hope so. This is, this is where we are and, like,

following that that idea, everybody wanna be AI first company right now.

And 1 of the

major step towards

AI first company is to to improve that organizational

processes

and educate people

about

production, about end value, about iterations, about all that DevOps stuff that is, that is, like, already pa or in the past for most of the companies. But, yeah, I still see that

engineers just make that work on their laptops,

and that's it.

And are there any other aspects of,

machine learning engineering or model serving or Hydrosphere in general that you think we should talk about? I don't know. Probably, I covered almost everything, that I have in my mind. Probably,

yeah. The the 1 I I wanted to as I mentioned, I don't have

a a word. I didn't point any buzzword for this for the thing that we are focusing on. And

probably,

1 of the thought I have is, like, extension of, OTML,

which is more,

which is already, like,

a little bit understood by the community. So you know how to people at least understand and and use it,

heavily,

automation around train training process, about hyperparameter

tuning, about model selection, and all that stuff.

And if we can extend that to the production,

so you will have serving model, monitoring,

resampling,

retraining

in the same loop

that would be really beneficial for everybody.

Maybe we should just start having everybody call it prod ML, and then you can leverage that.

Prod AutoML

SS

service.

So as as 1 last question. What do you see as being the biggest gap in the tooling or technology that's available for people working with data management today?

Basically

actually, there are a lot of gaps. So it's when you even start just experimenting,

just, having a Jupyter notebook up and running, just, make it make your Spark,

application

work perfectly with s 3. It looks like a very, very basic thing, but, when you started from scratch,

you can find a lot of a lot of, like, gaps. And if you're, like, data scientist,

machine learning engineer, you just spend, like, tons of time on on that. So that's that's 1 gap, and there there are, like,

perfect companies,

amazing companies that are trying to,

build more tools,

for for train for train there is, like, weights weights and biases company that just popped up a a new start up. Very, very, very, very cool. They they do some tooling around the training as well. So even training, it's,

it's very popular and everybody's doing training. It's there are some playbooks, notebooks,

about that. You still fight with reproducibility

of your experiments,

about the versioning of your experiments. Even you you have, like, a, like, a

dozen of model versions, you just forgot forgot about the you could not track track it per properly this versioning their performance correct characteristics.

Yep. So some you

you you name it. A lot of companies, like, do that data science collaboration tools,

but I don't think it's perfect at this moment. And I don't think there are tools in open source to,

address all the issues in collaboration and

training

slogans and

the, of the companies and the tools

everybody say about, like,

try trying to use data apps.

I don't I I have no idea what data ops is.

So it's, like, a 10 different

interpretations and definitions of day of data ops. And, about and everybody is putting everybody is production ready. But when you take a a closer look,

for some companies, production ready means

the having an ability to to tree, to schedule a cron job.

Okay. This is a is it a production? It's just an offline batch job. It's not a, it's not a real production.

So this, like,

terminology and marketing

hype is a little bit,

challenging to

break through. So that's why the, I believe the podcast

that that are really hardcore and that go straight to the point are very helpful to the community. Great. Yeah. It's,

seems like the sort of data management, data engineering and productionizing

of some of these more advanced workflows is going through the same growing pains that the DevOps movement did as far as not having a cohesive definition of what it even means to be doing that. And so it's open to per interpretation and opportunism.

So I I think we're in those same growing pain stages, but I think that we'll come out the other side the better for it. Yeah. Yeah. Definitely. Definitely. We'll see more so,

from the public use cases you see. So we we were we've been talking about, some, like, theoretical stuff right now.

So you you you see,

for example, type Taybot from Microsoft.

You you remember they launched that. So this is a very it's a public lesson for everybody

how machine learning and AI in production might look like.

When you do even in Microsoft,

even in their, like,

pretty cool research

environment,

I believe they tested

the tie tie Tay bot on some Wikipedia like day datasets,

and it works pretty fine,

pretty nice.

But, of course, the the real world is much more tough

than lab environment in Microsoft, so this is a result. So they it's a turn turned into racist, fascist, and the all that. So it's it's not fun, really.

And, this and, actually, the question, the question that is

very, like, open question for everybody,

Was Microsoft able to monitor and adjust their models

as they go along

and

prevent that unexpected shutdown of the of their bot.

Did did they have right tooling around that or not? So it's,

it was a fun it was just a research ex ex experiment, but it it might cost,

reputation of the company. It might cost, like, a huge efforts of the PR team to mitigate the to minimize the impact of of the business. So,

and I I believe, like, everybody can make their own assumption

about the impact of the ML failures in his project or in his business. So with,

and, like, think about think through that, how can we improve the,

the tooling and situation about that? Well, thank you very much for taking the time, and for anybody who wants to see the work that you're up to, it would be the best way for them to find you.

So Hydrosphere IO

and and follow follow the links, follow the social media. So

I'm I'm pretty open for any comments, questions, and, yeah, thoughts and

contributions. We have open source version. Great. Thank you very much.

Data Engineering Podcast

Summary

Preamble

Interview

Alan Anders from Applecart

Contact Info

Parting Question

Links

Stepan Pushkarev from Hydrosphere.io

Contact Info

Parting Question

Links