Accelerating ML Training And Delivery With In-Database Machine Learning

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode.

With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster.

With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform.

Go to data engineering podcast.com/linode

today. That's

l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.

We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time?

Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more.

Go to dataengineeringpodcast.com/census

today to get a free 14 day trial. Your host is Tobias Macy. And today, I'm interviewing Paige Roberts about machine learning workflows inside the database. So Paige, can you start by introducing yourself?

Hi. I'm Paige Roberts, open source relations manager at Vertica.

I have been doing this for a long time. And do you remember how you first got involved in data management?

I have been doing this a long time, a really long time. Started back in the nineties,

doing tech support and

documentation

for a little data integration startup called DataJunction.

I

just was a sponge. I went into QA and then I did

I ran the website for a while and then I ended up doing

software engineering for about 5 years. So I was actually

fixing

bugs that I reported

as a tech support support technician, so that was funny.

And then, fortunately, I'm good at telling myself what the reproduction

steps are, so that worked. I went into consulting.

So I was going around and building data pipelines for folks, and

I

was a data integration specialist for quite a while for couple of different companies and independently,

and then I got called back by my old company, which by then was

pervasive.

They asked me to do marketing and I was very confused.

That was a bit of a culture shock from going from

writing code and building pipelines to trying to explain to people why they needed this

cool software to do this to help them with this.

And then I went off and I did a lot of different things. I did more consulting, more engineering.

I did product management

for Seq Sort for a while and for their data quality and data integration

lines, and I was an analyst for very short time writing white papers and stuff like that for an independent

analyst from the Bloor Group.

And I did some consulting for Hortonworks.

So I've been

all over the place.

I joke that I've done every job in this industry except admin

at this point.

But yeah. So I kinda have a really broad view

of analytics and data engineering and data architecture.

And so we're here today talking about

the opportunities

and the implementation

of machine learning inside the database. I'm wondering if you can start by giving a bit of an overview of the current state of the market for the support for machine learning in the database

and maybe some of the motivating factors of why that's something that you should even be trying for?

Well, I think it's becomes table stakes to a large extent. I think I saw a survey recently that said

70%,

something in there. I don't remember the exact number, but something in that neighborhood, more than 2 thirds of the database,

of the analytics databases on the market

now have machine learning built into them in some way or as an add on.

And that is just becoming the way that you do machine learning, and there's a lot of really good reasons for that for doing it in the database. And I think the market hasn't caught up to that change. It's like, it's usually

the users that lead

the way. They're like, I really want this and then the vendors build something to make that work. In this case, I think the vendors saw it coming before maybe a lot of the

data scientists and such saw that. It's, like, the need to get

data

science, get machine learning into production

is so huge

and the skills to make that happen are so

limited

that the software vendors dove in and said, okay. There's gotta be a way that we can make this happen. And a lot of that has been

in database machine learning. I think there's a nice InfoWorld article that somebody wrote about recently about, you know, like, 8 databases that have machine learning in them or as an add on.

As for the reasons that people are doing machine learning in the database, I think there's just a lot of them and it it kinda depends on who you are. It's like the line of business people, they want the advantages of machine learning, but they wanna be able to use a BI tool. Like, they wanna do it point and click. They won't they don't wanna write code.

And if there's something built in that they can take advantage of, they can do it from,

you know, Grafana or Power BI or Tableau or whatever they have that they're comfortable with,

and they can do it that way. That broadens

the number of people

who can do predictive analytics,

who can accomplish

things like k means cluster to for targeted marketing, that that kind of thing.

For the data scientists

and the data analysts,

there's

a different advantage.

For the data analysts, they're used to using SQL and it's very comfortable for them.

Even data scientists I don't know anybody that works with data that doesn't know how to use SQL. So that, by itself, makes life a lot easier for a lot of the data manipulation and things like that.

The database, especially a modern scale out, you know, cluster database,

is designed to manipulate data very efficiently and very quickly.

And as much as I hate to admit it, way more efficiently

than any of the Hadoop or,

data lake kind of concepts. It's like the database has got a good 10, 15 years

of development and optimization.

It's databases usually used to talk about how

awesome their query optimization

was. Well,

what that means is they spent years years years

owning that performance

and it's just ahead. So you get that advantage of being able to do your data prep really fast

and do your training really fast, that

speed. For the DBAs and architects and stuff, it just simplifies things. You don't end up with this giant, you know, sprawling

500 different

components that, you know, you have to get them all to work together.

You end up with 1,

base sort of central

database where you can do

80% of your work and then the thing that you need to add onto it is the thing that feeds it. And you don't have to do 2 different feeds. That's the other aspect is instead of having to do 1 massive feed that feeds into the data lake for

data scientists to then pull out extracts and they go do their thing somewhere else even,

then

because, you know, they can't handle a 100 terabytes worth of stuff in R. It's just not made for it. Instead of doing that, they can just use the whole

dataset that's already in the database. It makes life a lot easier,

and a lot of the databases now will even reach out and grab data off of

HDFS

or off of, an s 3 bucket, that kind of thing.

So you can pull in all your datasets without having to go find them.

And as a data from a data engineering perspective, that means I don't have to build

multiple pipelines

to accomplish

analytics.

And whether that analytics feeds some report somewhere or it's for ad hoc queries

or it's for machine learning. It's like that data.

I can build 1 pipeline and accomplish

all of those goals.

And this is 1 other thing. I've gotta

stand on my soapbox a little bit. I don't think people realize the difference in concurrency

between

a data lake, a database, an analytics database is designed

to have multiple concurrent users or workloads. So

if I have a 100 people

in my company and they all need to access data to do their jobs,

database doesn't have any problem with that.

Data Lake,

Gartner even did something and it was funny. There was an article Gartner recently did

about, you know, if you have more than 10

users and then the data lake, the data lake chokes.

And there was something about, you know, Databricks

says they're working on that. I'm like,

databases solved that problem

years ago,

and

the fact that the data lake folks are still trying to check catch up with that is huge. You just can't spread your analytics

across your company and get everybody to to make their decisions based on data

if only 5 people can use it, can access the data at a time. That's huge.

Yeah. Absolutely. It's definitely remarkable the disparity in

where the engineering focus has gone between databases, particularly MPP style and data lake engines, where data lake engines are looking for

volume of data over concurrency,

and the MPP databases have optimized concurrency to be able to tackle the volume.

Yes. And it makes a huge difference. There's actually 1 other piece, I think, that

goes with that and this is from working at Vertica.

We

put the machine learning capabilities

into the database and did not get the adoption, at first, that we expected from our customers and these are the customers

that were pushing us.

They were like, we really want machine learning in the database.

Please make it happen. And we're like, alright then. Here you go. And they were like, yeah. We're not adopting that.

It's like, what?

Wait a minute.

And we asked them. It's like, why? Why is this not getting adopted when you were the 1 that was really pushing for it? And the answer was,

we have

business

critical BI and SLAs that we have to meet

on our analytics that are already in

place. And if I train a machine learning model,

it's liable to eat up all the resources and slow down everything else and maybe even, you know, bring the whole database to a stop.

So

being able to run

concurrent workloads

means more than

I can handle,

you know, a lot of users at 1 time. It means that

I can train a machine learning model

and I can do

streaming

data

ingestion and data prep, and I can do ad hoc queries.

I can do

targeted marketing

campaigns,

and all of these things can run at the same time and be isolated from each other.

Workload isolation

is huge.

You have to be able to

be able to separate

the resources

so that

nothing that 1 team does

screws up something the other team does. ETL is a really good example. Right now,

I mean, the classic concept was, well, let's wait until 3 AM or something when no 1 is using the database

and that's when we'll do ETL.

And sure, the data will be a day old,

but we won't bog down anything important. Well,

if you're

taking in streaming data all day,

all the time,

in parallel, massive amounts of it I mean, we have ad targeting folks that are pulling in a a 1000000 events a second,

and they need an SLA of 200 microseconds.

It's like, you can't wait until 3 AM

and then do the and there is no time when the database is not busy.

It's like

so in order to do that ingestion, that ETL, that data transformation

that makes that ready

to accomplish something,

it has to run all the time, and it has to not interfere with other workloads. So that's another good example of it's like, it has to be isolated.

Machine learning has to be isolated. It's like, you have to be able to say,

you have these resources.

Those are all yours. You can use as many of those as you want, but you're never gonna touch these resources over here that are doing a different job.

That workload isolation

was a big thing. And so now we have huge adoption because we built in workload isolation, and that made made a big difference.

That's a good segue into the performance implications

of running your machine learning inside the database. And I'm wondering if you can just talk through from both directions the

performance improvements that you're able to realize by doing the machine learning in the database where the data already lives, but also the potential negative impact that it can have on the database and the database users

for adding that additional workload and the training overhead of building those models and and serving them? Well, I think the workload isolation takes care of a lot of those negative impacts. If you have really good solid workload and isolation built into your database,

you don't have any negative impact. You can train your machine learning model at will, use up all of the resources assigned to you, and never worry about it bogging down the executive dashboard and your CEO getting mad at you.

That doesn't happen. That is wonderful in itself.

But the other aspect is what I talked about earlier,

databases spent years years years

perfecting

performance.

So your MPP analytics database is designed

to use all the power of the cluster

and

those really smart

AI super intelligent

query optimizers that are built into it

to make the queries go as fast as possible,

and that includes machine learning. And, again, I'm gonna use Vertica as the example because that's the 1 I know.

When Vertica

decided, okay. We're gonna put

logistic regression

model

and we're gonna build it into our database,

They built it in c plus plus They built

it distributed

so it's automatically parallelized

and it's got unbelievable performance

and you don't care. I mean, you don't have to care. It's like that is just not something that the person using it

has to worry about. They just have to say, okay. I wanna use logistic regression and here's the dataset and here's my parameters and go.

You can do a single

a SQL call and say, split this into training and test set

and train it on that, verify, and tell me what the accuracy is. You've even got AutoML.

This is exciting. This is actually something I really like.

You can give it a dataset

and it can run

10 different

machine learning algorithms

against that dataset

and give you

a chart that shows

how accurate

each 1 was,

and then you can go, well, it looks like

random forest is the way to go for this particular dataset, for this particular

use case, and this problem. It's the most accurate.

Well, that means I don't have to

talk about saving time. I don't have to try

6 other models that might work because I already know random forces again and gave me the most accurate answer.

And the other aspect of that is

before,

if you were using Python or R or something like that, you generally had to

extract

a sample. You had to

statistically try to figure out what a good sample was. And if you had a really large data set, you maybe had to do that

3, 4, or 5 times to try and get a sort of statistically

good

concept of what your data was like and not miss any,

you know, unusual events or outliers or that kind of thing. And

then you had to do it on your your laptop or your desktop or whatever,

and then you had to figure out how to now put that into production

in large scale. So maybe somebody else, like data engineer, kinda looks at the 25

data preparation steps you did and has to

reproduce them in parallel at scale with Spark and has to sit down and write Scala code to make that all work.

And then at the end, you know, maybe the accuracy isn't as good or

maybe, you know, your data has drifted in the 3 months it took them to write that application or whatever.

The

performance

is great, and it's a huge help,

but a lot of the

power of it is that you don't have to

move data around,

and you don't have to rebuild everything.

You can do everything in place.

You can use the power of the

database

engine to accomplish the goal without having to worry about writing parallel code. It's like, I don't have to worry about if I have a 10 node cluster or a 100 node cluster.

I just have to tell it I wanna do this model on this dataset

and I get to use the whole dataset. I can use all of it and get the full accuracy,

and it doesn't take me any longer.

I actually did a presentation recently with Anjal Singh. He's 1 of our data scientists,

and he was shown a churn reduction model. And when he first was talking to me about how this worked, he got to a certain point and showed

here are the

features

that

have the strongest influence on accuracy. And then, you know, here's a graph from the most

influential to the least. And I'm like, okay. So you're gonna knock off those bottom 10 features, right, to give better performance. He's like, why would I do that? Invertica

runs in, you know, microseconds

anyway. Why would I need to take out features if they add maybe a percentage point of accuracy to my model?

I can leave those features in

and get a boost in accuracy. Even if it's a tiny boost, that could mean a lot of dollars

to your company.

1%

smarter is,

you know, 1, 000, 000 more dollars in your pocket.

It's just

amazingly powerful

to get that performance

and that productivity gain.

Machine learning has been used in a lot of different contexts and a lot of different ways to mean many different things. And so I'm wondering if you can just take a moment to talk through

the particular styles of machine learning that are feasible to do within the database. You know, some people might think machine learning means Bayesian inference. Other people might think it means building recurrent neural networks or convolutional neural networks and doing deep learning,

or it might be, you know, doing, you know, gradient descent or, you know, Monte Carlo simulation. I'm wondering if you can just talk about sort of the the styles of machine learning that are most

applicable to being run within the database and any kinds that wouldn't really fit very well in that environment.

Well, I think

I mean, we have a extensive set of algorithms built in. And data preparation is 1 of those things that people forget is that, you know, you can't do this algorithm until after you

do the 1 hot encoding to, you know, change your categorical variable variables to binary or whatever.

You know, you gotta find your outliers. You gotta check your correlations,

you gotta do all that kind of stuff.

Having all that stuff built into your database

shortens your work time quite a bit, and that's for any kind of

machine learning. So that's not any particular

type of machine learning specific, I think. Just speeding up your data prep is huge. Vertica imports and exports PMML. So,

you know, you can hand it off, you can bring it back in, things like that, and or you can, you know, import or export datasets,

and that makes life a lot easier if you're just doing the data prep and then you're gonna do the machine learning somewhere else.

On the other hand,

if you wanna do most forms of machine learning, you know, just standard algorithms like

old ones like SVM and and Naive Bayes

and regression and,

k means clustering and, you know, all the kind of standard things that you use for the most part. I think we just added XGBoost.

The machine learning

inside the database is constantly expanding.

So

normal machine learning is the word that I would use. You can do it in the database. The main thing that I think you need

to do outside the database is

you mentioned

neural networks and deep learning and that sort of thing. So if I'm trying to do something with PyTorch or TensorFlow and I'm trying to

train a neural network model, a

lot of times what you need there is GPUs

because you need that style

of, you know, GPUs are really good at the linear algebra kind of concept and getting the math

faster

than a CPU can do.

And there's some cool things out there. Neural Magic is a piece of software that can simulate GPUs on a CPU machine, which is kinda cool.

But, for the most part, if you're gonna train neural networks, you do that on a GPU machine.

On the other hand, GPU machines are expensive

and the last thing you wanna do is deploying your model,

putting it to work.

You don't want that on your GPU machine because that's expensive. You just wanna train your model there. Well, okay. So if you could take, say, your TensorFlow

frozen graph

and import it into your database

and

manage it like a table and

deploy it and use it for prediction

and put it to work

on a standard, you know, CPU machine or a normal

virtual machine in a cloud,

that's powerful.

That makes a big difference and I think that's

1 of the things that that Vertica has going for it. I I'm the open source relations manager. It's like, we integrate with open source. So, you know, you wanna

train a model in Vertica and hand off the PMML to somebody else, you can do that. If you wanna train a model in Spark,

in whatever

you feel like training your model in and then hand it off to the database

to manage, evaluate,

deploy in production,

train it in TensorFlow,

train it in PyTorch,

import it, put it to work. That's a powerful

concept,

and the cooperation

is always better than the concept of

a friend of mine was just talking about. They used to work at SAP or they used to work with SAP software, and it's like, it worked great as long as you only used SAP software with it. As soon as you tried to integrate it with something else, it didn't work so good. So I think that's 1 of the things to watch for is,

does it work and play well with others?

Because your machine learning is powerful, but it's never gonna work in isolation.

It's always gonna have to

integrate with a visualization

layer, integrate with all your data sources.

To do something simple. Like, if I wanna do analytics

on

the structured data in the database

and

I need to

also look at maybe

my

5 year historical data, which is stored in parquet on an s 3 bucket,

that should be possible. There's no reason why I shouldn't be able to say to my database,

hey. Do analytics

on this table

and that

sort of table that's sitting over there

and join them together and give me information on all of it and

train this model on that resulting dataset.

No reason why that shouldn't work, and that's

that's 1 of the things that I think some of the databases are still catching up to, is the concept that

not all of the data that you need to use is in your database. You should be able to do analytics on data outside the database.

And I think data lakes have had that concept for quite a while. They've

well, they've had the concept of dump everything in here, and then we don't care if it's in Parquet and ORC and JSON and Avro and, you know, all these different for you know, log files and sensor data and all these other crazy semi structured and stuff like that. It's like, we should be able to analyze that anyway. And I think databases

are just now catching up to the idea,

yeah, we need to just analyze all that.

And I think

the other thing to watch for is the guy who says,

oh, sure. We'll analyze that. Just put it all in my database first.

It's like, that's a vendor play for,

I want all your data in my database, and I don't want egress fees. Have you heard of egress fees?

Yes. I heard of egress fees, and I just had a heart attack. I was like, that is the most idiotic thing I have ever heard in my life. So for anybody who doesn't know,

an egress fee

is when you pay money

to your database vendor

to move your data

somewhere else.

Seriously,

your data, if you wanna move it,

they want you to pay them. I was like, what?

I thought that was the craziest concept I'd ever heard.

That's a big thing in the cloud vendors as well. Once you get it inside their network, they're happy to let you send it everywhere as long as it's within their network. But if you try to go from AWS to GCP or vice versa, then they're definitely gonna get you on the cost.

Well and hybrid cloud is huge now. It's like the idea of having a data center and also,

you know, having some of your workload on the cloud and being able to pass your data back and forth,

that's powerful and that's, I think, I just saw something that said, like, 70% of the folks that are doing I I've got the graph around here somewhere.

But TNCF

did this survey recently

that was about how many folks are doing cloud,

how many folks are doing on premises, are in private cloud.

Hybrid was the top of the list.

The number 1 way that people

want to deploy their database and their data analytics is

hybrid. They wanna use both.

And if I'm letting people put stuff in my database but I'm not letting anybody take it out,

That's a huge

barrier. That's like a speed bump in the middle of your nice data flow.

My background is data engineering. So the idea of

telling somebody they can't move their data is just, like, that's what you do with data. You it flows. It changes. It shifts form. It's like that's how you

make it work.

You know? It's like a dam. It's like, you know, this wonderfully flowing stream, and then, you know, here's this bare rock in the middle of the stream. It's like, nope. Sorry. You can't do that, or a toll troll. It's like, yeah. It's like, no. You can't let that go pass unless you give me some money.

It's like, that's just 1 way street.

You can come in, but you can't get out. It's the Hotel California for data.

It is. The Hotel California for data. You can come in, but you cannot leave.

Yeah.

And so for people who are

looking to use

the database for building and deploying their machine learning models, I'm wondering if you can just talk through the overall workflow of going from, I have this data. It's already in my database to I have this machine learning model, and now I've put it into production and just the overall workflow of getting from point a to point

b? Well, if your data is already in the database, that makes life really easy. And then it's just kind of a choice of what do I wanna work with.

Mostly, I'm a data scientist.

I like notebooks.

I'm gonna pull up my Jupyter Notebook. We have a nice open source project called vertica py. So you can use your Jupyter Notebook and it has a Python interface. So you write

Pandas code and you you do some data exploration.

You do some, like, you know, what's my feature correlation?

Where are my outliers? Maybe I wanna balance my dataset,

do some 1 hot encoding. It's like, I do all that kind of stuff.

You know, maybe I discovered, oh, 0, this dataset

isn't in my database. It's over here in this s 3 bucket

in orc,

you know, or it's JSON that's been streaming for the last

5 years and now we've got a big pile of it sitting over here in

zip files or something.

If you're using Vericut anyway, you just define that as, like, an external table, tell it where it is, and it'll just go and you can just keep going. You could just say, okay. Join this data with that data over there

and let me okay. These features are cool, and

these only add 2% to my accuracy, But, you know what? It doesn't cost me that much. I'll just keep

them. And then

train your model,

evaluate it.

Look at your rock curve. Look at your lift. Look at your fusion matrix. Whatever you wanna do,

you know, you maybe wanna use your import matplotlib

and have a look at that. How does that look?

Okay. I'm I'm pretty happy with that. Save it. You save it like a table

and then you use a SQL command that says,

use this to predict

and you tell it where the dataset is that it's gonna predict on and, you know, you check it. Okay. This is solid.

Well,

since your database

is gonna be the same in development

and QA

and production,

It's like there is a line of code that you write to say, okay. Push this to production and make it work. That's it. Instead of months.

That's kind of huge, I think.

A lot of folks,

the biggest jump is like, okay. I'm happy. My model is awesome.

How do I get into production?

1 of the things Vertica does that I really like is it sells you your production database

and you can either

say, I'm gonna have this many terabytes of data up to this many terabytes and buy capacity like that and then have unlimited compute.

And then, you know, say if you're on Amazon, you'll have to pay Amazon if you use however much compute you use, but you won't have to pay Vertica.

Or you can have unlimited

storage and say, well, this is how much compute I think I'm gonna use. So that's my capacity that I've bought.

And then, you know, if you're on on prem

or on the cloud either way, you're gonna have to pay for your infrastructure because we're just software.

I mean, that's pretty much it. It's like you literally pay for your production

and dev and QA just comes with it. High availability

comes comes with it. It's not

extra.

You don't have to go pay some more to get that. It's like as if that were

an extra thing that, you know, you wouldn't have a dev cluster, would you? You don't do testing, do you? It's like, everybody does that. And

I heard a great joke that's like everybody does QA. Some people are fortunate enough to not do it in production.

So, yeah, everybody has that stuff. So you gotta have the dev and the test and the production and the high availability.

So that's just

included

in your license. And, I think that in itself just makes your life a lot easier.

And it means that your environments are all gonna be the same and you don't have to make a big

change

to get to production. It's just take it from here and push it over there.

RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery,

Amazon Redshift, and more.

Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming.

With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at data engineering podcast.com/rudder

today.

And so in terms of the actual

implementation

of the machine learning capabilities,

specifically with Vertica, since that's what you're familiar with, I'm wondering if you could just talk through some of the architecture and implementation

and some of the ways that that has changed or evolved in the time that you've been working with it. I think the architecture is it's very much an MPP architecture

in the database already,

and it's from the ground up pretty much designed to be really

optimized for high scale analytics.

They learned a little bit maybe from some of the mistakes. I mean, they did it beforehand, so maybe they just were smarter. I don't

know. But they don't have a masternode.

There's no leader node. There's no choke point. There's no single point of failure. If I send a query, Vertica,

and it has a 100 nodes, it could

go to any 1 of those nodes and initiate the query,

which means, of course, automatically, you have a 100 people that could hit the database at the same time and their queries aren't even initiating from the same it's very parallel built in.

What we did with the machine learning

as we added to that was simply

take the same principles

and

take a machine learning algorithm.

Say, I have k means clustering

and I'm working on it with r.

Well, it's gonna be

linear. It's gonna be sequential. It's gonna be kind of intended to do 1 thing and then do another thing and then do another thing. Well, pretty smart engineers and they basically went, yeah. No. That's not gonna work

and and rebuild it using something else, using using c and things like that.

And we we also, at some point, I think, acquired distributed r and built that in as well.

So we have the capability to distribute your r code.

We have this u d x framework, which is wonderful.

The capability

if I have a

special Python algorithm that I wrote

myself

that

nobody else has this, it's my secret sauce,

I can write that, wrap it as if it were

a SQL function

and put it in my distributed database and it will automatically distribute the workload

and treat it as if it were part of

Vertica from the beginning.

And then I have all the rest of my workload, of course, you know, workflow is already in there and this new thing that I added

that's special for me just goes right into the workflow as if

as if it was there all along. So I think that's pretty cool and we have that for

Java, c, Python, r.

I'm not sure if I've hit everything. But, you know, the idea is, you know, you can use whatever you're comfortable with

and you can put it in there. I think we we added a Go lang interface at some point, but I'm not sure it's in the u d x or not.

It's kind of

very

flexible, and the capability to work with

whatever you have or whatever you're comfortable with makes a big difference.

1 of the big differences

about

Vertica that

makes the machine learning

faster

is not really a machine learning specific thing and that is

I hear a lot about in memory databases.

Okay. The concept being, of course, that RAM is much faster than reading from disk.

Well, it is.

But

if you optimize the way you store things on disk,

you can

speed up the way you read from disk to the point where it's faster

and because of the columnar nature

and there's an AI built in there that'll pick the right compression algorithm for that particular data type and things like that, you can compress up to 90%.

Well, if I can compress the data down to 1 tenth of its previous size

and then work on that 1 tenth of the size without ever uncompressing it,

that's

a lot faster

than if I had to work at a dataset at its full size or if I had

to compress it, uncompress it, work with it, and recompress it. That, obviously, is gonna be slower too. So what Vertica did was they not only

came up with some smart compression algorithms,

they came up with new ways to store data

so that it is already optimized for

analytics.

That is different.

We don't store data in tables.

We don't store data in relational

datasets. We don't store data in

snowflake or a star scheme.

That is not essential for storing data. What's essential is we have an AI analyzer that looks at your

query,

the kind of analytics that you do,

and then says, okay. In order to do that,

this is how you should store your data to make it optimally fast

for that. So if I'm doing,

say, machine learning training

on a dataset

that would normally include 5 different tables

and it's a big flat thing with

700

columns

and that's what I need to train my machine learning. That's how we store it. It's already stored that way.

So when I go to do my training,

that's way faster than if I have to go and find all of those tables and join them and and get that all ready before I can do anything with it.

The other thing is, like, if I'm getting sensor data coming in, you know, 10 readings per second and all I need to look at is every minute

and that's what I'm gonna train my model on is the sensor readings per minute, I can do some pretty massive

compression.

So what we do is we'll store the original data

or we'll let you store it somewhere else, parquet or something like that if you want to, and we'll store the aggregated version,

either 1 or both.

And that way, when you go to do your machine learning and you need that aggregated version,

you know, retrain your model on the latest readings,

boom. Fast. It's already there.

On the other hand, if you go, oh, you know what? I think I might increase my accuracy if I looked at every 10 seconds instead of every minute. You still got your original data, and you can go back and you can you can do that.

So

that's

a long

walk for a short drink of water, but

I think that makes the difference.

The things that we did to optimize analytics

end up also optimizing machine learning. We've touched on this a little bit, but in terms of the architectural

patterns

and the way that you think about building your data infrastructure,

how does the use of the database

for your machine learning building and deployment

impact the way that you think about building the rest of your platform?

I do an entire

presentation on this.

You Google unifying analytics

and Paige Roberts. You'll probably find a video of me doing this talk.

But, essentially,

back in the

day, we had this data warehouse architecture

with that, you know, slow 1 in batch, once a day ETL

and a big database in the middle,

and it was pulling from, oh, I don't know, all 6 of your transactional

sources, you know, maybe.

And then it had a visualization layer on the front and that's how you did that data warehouse concept.

And then we're like, well,

I would really like to pull data from

25 places, and a lot of it is semi structured. And I'd like to get some streaming data in there, and I'd like to do some more advanced analytics and stuff. And I was like, did we improve the machine in the data warehouse? No. We threw it out the window

and started over with the whole data lake concept.

And so now what I see the most often

is what I call a combination architecture,

where someone has taken

their data lake or taken their data warehouse and they tried to replace it with a data lake and that didn't work. So what they ended up with is this sort of cooperative

thing where

either the data lake feeds the data warehouse or the data warehouse feeds the data lake and they're working together and the data engineer is bonkers because he's got to build every pipeline twice

and he's got to move things from 1 place to the other constantly.

And, you know, his data scientists are working over here, except they still

can't work with all this data. They've gotta extract samples and then they're like, oh, what about this data? Oh, that's in the data warehouse. And you end up with

a lot of duplication

and a lot of

frustration and trying to find things.

This is really the greatest way to do it. But because you have

the data warehouse,

you know, the analytics database doing what it does best and the data lake doing what it does best, it does work.

And there's a lot of people still using that. That is, I think, the most common,

architecture

right

now. What I see over time is more and more people using

moving to this concept where

the data warehouse and the data lake sort of merge

and become 1.

And you end

up, say, storing all of your data

in

an s 3 bucket or on

some specialized

shared storage hardware

that has s 3 as a supported

type. So object storage,

pure storage, or FlashBlades, or ScalityRing,

or Dell EMCEM

yeah. So you either got some specialized storage on prem or you've got an s 3 bucket or a Google Cloud Storage or something like that in the cloud. But either way,

that's pretty much where all of your data ends up.

It just ends up in, like,

10 different

varieties.

So you've got, you know, your JSON messages streaming in or you've got your sensor data streaming in or you've got your

parkade data getting bigger and bigger and, you know, absorbing

long term data.

You've got CSV files and and Excel files and, you know, simple things like that. All of them in this sort of large space

and that includes

your database

format.

So with Verdec, anyway, we store our database

format, which we call read optimized storage.

It's a file format just like parquet, just like quartz. It's a highly optimized

analytic columnar

format and you can store it on s 3. So it sits there right next to

JSON and Parquet and Ork and Avro and all the other ones.

And then, you know, you do analytics on top of that. And you don't have to move it anywhere.

You don't have to change format unless you're

unless your goal is to

improve your performance. You might move it from 1 format to another to get faster performance. Read optimized storage is ours and, obviously, we're gonna be fastest on our own format,

but you still get pretty decent form

performance from, say,

something else columnar and and design for analytics like Parquet or Ork. Whereas, you would get maybe less fast

performance if you were looking at a a whole bunch of JSON files sitting around.

Everything is gonna be,

analyzable

in 1

place.

So I don't have to

build multiple pipelines. I can build 1 pipeline,

store all the kinds of data that I wanna store, and maybe move a few around if I need to.

Store it where it makes sense

and then analyze all of it

in place

without picking it up and moving it, and we call that unified analytics. I'm seeing more and more of

the independent analysts picking that up, EMA,

GigaOm,

guys like that are create I think GigaOm just did a radar report for unified analytics platforms.

That concept of pulling the data lake and the data warehouse

together and making them 1 thing instead of having 2 separate things. And I think you hear that

from the database analytics, database vendor side, which is us. We call Unified Analytics.

The data lake side, they're mostly calling it a lake house,

which I think is pushing your metaphor a little far, but okay,

if that's what you wanna call it. Alright.

Go you.

And the main thing is that I think the database vendors have a little bit of a head start,

and I think that is true on a lot of things. It's like I was there when, you know, building Hadoop clusters when I went to 1.0

and got yarn.

And I had to rebuild my cluster because I'd built it with the previous version.

It's grown,

but it's only, what, 10, 12 years old at this point.

And

even a relatively new, relatively young database

like Vertica

is only 15, 16 years old. It's still got a secure head start.

And the ones that have been around even longer have got even more of a head start. It's

just gonna take a while for the Daylink vendors to catch up, I think. And in the meantime, you know, the database vendors aren't sitting still. We're adding

more and more machine learning capabilities

and more and more advanced analytics,

geospatial,

time series, all that stuff.

As you have been working with

machine learning in the database

and helping to

sort of spread the gospel, if you will, to the broader community.

What are some of the most interesting or innovative or unexpected ways that you've seen those capabilities used?

I think

I was really

impressed by 1 of the telecom use cases that I've seen.

It kind of brings everything together. We have use cases all over the map from

fraud prevention to targeted marketing to ad targeting. Yes. We are

responsible for a lot of those ads that follow you around the web.

That's us.

But I think 1 of the coolest things

that we do I mean, we can map genomes,

really cool stuff.

But 1 of the things that I loved was

like, AT and T, I think, is 1 of the guys doing that. They

take the data from all of their networks, all of the machine data, all the device data

from every phone,

every network repeater,

every tower, every every device they have.

And they're doing

geospatial analytics so that they know where everything is and they're doing all this other analytics,

but they're doing it in real time, which means if AT and T is my provider. So if I make a call

and they're having a Super Bowl and everybody in the world is trying to film this touchdown that just happened and send it to their buddies.

It's like that

is gonna be a seriously overloaded

geography that that particular

part of the network is gonna be way overloaded.

If I'm trying to make a call that should normally cross that,

they will, in real time,

reroute my call around that overloaded section

so that I never even know it. All I know is my call went through.

I had the conversation.

There was no problems.

So I get

better customer

experience.

They get reduced customer churn.

Everybody's

happy, and that requires

that they be able to do

time series analytics,

geospatial

analytics,

every kind of analytics you can think of,

machine learning,

predictive,

all of this stuff,

and respond

in

microseconds.

That to me was pretty mind blowing. I just learned about that a couple of weeks ago, and it I mean, we've been doing for ages. We were in some of the top 10 telecoms,

but I always was thinking about things like churn reduction and some of the other use cases. I hadn't thought about that,

you know, network optimization. Like, where do I put my next tower? You know, that kind of thing. I hadn't thought about

that particular use case until, you know, someone described it to me. Yeah. We do this. And I was like, wow. That's a pretty cool use case.

That 1 is probably the 1 I thought was the coolest.

And in your own experience of working with the technology

and helping to

spread spread awareness of how it can be used, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?

Well, unexpected,

I have to say.

I'm a Spark fan. I have always been a Spark fan. And like I said, way back in the day when clusters went to the point where they had yarn,

you know, MapReduce was the only thing you had. And then when Spark came out, I was just like, oh, this is so much better.

I'm pretty decent friends with Holden Corral. I'm probably better back when we had conferences and we could see each other on a regular basis.

But she wrote a lot of the things like High Performance Spark and a lot of those cool O'Reilly books. If I wanted to learn Spark, I'd go, you know, get 1 of her books. I already have a couple, but that's the thing. And she said,

years back to me, it's a big challenge

getting

machine learning

into production. And I was like, isn't that what Spark is for?

And I was very surprised about that, but that was years ago. So what happened recently that brought that to mind again was

we did a POC

and someone was trying to do

everything

at their business

using Spark.

They had

278 nodes

of of a Spark cluster and I was just like, wow. That's

that's a lot. Are you trying to, you know, deal with petabytes of data or something? You know, why is it so big? And it's like, no. It wasn't that huge.

They were just trying to do everything in the universe with Spark. Everything.

And I was like, there are better tools

for a lot of these things.

It's like Spark is the best thing probably you can use

for doing

high scale

data transformation

and pulling in data from

25 different sources.

That's what you need your Spark cluster for. That's what it's good at.

Once you have that data, if you wanna do your analytics and things like that,

databases are way better at that. And we did a POC and we went in 278

Spark clusters the

2 spark nodes on the cluster

and we ended up

reproducing

the use cases

that they were having trouble with with 9

nodes

of Vertica and getting better performance.

So I think a friend of mine at work coined it as spark spread.

It's like this concept that because spark can do everything, you should then

you should do everything with spark.

And it's like, no.

Just because he can do everything doesn't mean it's necessarily the best tool for the job.

That was a little bit of a surprise when I saw that

278 to 9. I was just like, wow.

That is a big difference. And I think if more people were aware that that

that

level of change

can happen if you pick the right tool for the job,

then they probably would.

And on the note of picking the right tool, what are the cases where building your machine learning

or running your machine learning inside the database is not the right choice, and you're better served either going outside of the database and using something like Spark or using something like PyTorch and TensorFlow and deploying the model natively or the cases where,

okay,

put putting the machine learning in the database, this sounds great. It sounds amazing. What are the cases where it's the wrong choice and you're better suited going elsewhere?

Well, I think I already mentioned 1, which is anything you gotta train on a GPU machine is not gonna make sense

to do it in the database. I mean, you could,

maybe. I don't know. But it would be slow.

And the whole point of using the database is to make it go faster.

So if I was gonna train a net, I would use GPU machine. I would use TensorFlow or PyTorch or something like that, and I would do that. And chances are, I would use Spark to feed it data.

And maybe

if it was really straightforward,

I might then use Spark to just put that right into production. On the other hand, if most of my production data

was in a database or in a data lake

and I, you know, naturally was using that on CPU machines,

on an object storage, or something like that, I would think that then putting it to work in the database would make sense.

I do see people

sometimes, like, prepping their data in the database and then handing off the PMML

to Spark and then putting it into production.

I think the folks that do that mostly are the ones that already have

certain

aspects of their production

already in Spark. I was like, it doesn't make sense to pick up and move stuff.

If it ain't broke, don't fix it. If it works,

if you're happy with it,

then, you know, maybe shortcut your data prep a little bit, but there's no reason to

lift and shift. I think that's the big thing. It's like everybody thinks, well, I gotta pick everything up and go do it somewhere else. It's like, no.

It's like, you can use bits and pieces

that work. It's like, if you already got this chunk

that's working for you,

then do this other chunk into something new that'll do it faster. That's 1 of the things I see. The other is the incremental thing. It's like, if I wanna do a new

mission learning

use case,

I might do it in the database.

But if I've already got 10 in production,

it's like I'm not gonna take them out of production and try and redo them in a new tech. I'm gonna leave them where they're at. I'm gonna let those

function until maybe they're, you know, not that accurate anymore, maybe they need to be retrained, you know, that kind of thing. It's like, then maybe I'll move them or maybe I won't. Maybe it makes more sense to leave them where it is.

I think all or nothing is the wrong way to go about any kind of analytics

shift. If you're shifting in tech, I think the biggest mistake a lot of people have made

over time has

been Hadoop is the big thing. I'm gonna put everything in Hadoop.

The cloud is the big thing. I'm gonna put everything in the cloud. It's like,

you know, GPR says I can't do that. I gotta pull everything back on prem. It's like, things change. This is that is the number 1 thing.

I remember Colin Faye is a data engineer that I know through Twitter. He's in Europe. I don't think I've ever physically met the guy, but we chat.

He gave me this great acronym,

DSOFU.

I usually,

change it to DSOFU to be a little more polite, but it's don't screw over future you.

It's like your future self needs the freedom to be able to change,

and don't don't lock yourself in. Those egress fees suck.

The ability to shift and change over time

is huge in this business. It's like that is always gonna happen. There's always gonna be something new. There's always gonna be

a change as you go along. It's like, go with the flow,

but keep your options open. I talked to

Catch Media

was the customer

recently.

They had, you know, all in on the cloud. That's the thing. We should go in all in on the cloud. I mean, everybody said that. Everybody said, oh, it'll save you costs. It's wonderful. You should do that.

Now there's this thing called cloud repatriation

that people have come up with a term with it because it's happening a lot. And Cash Media was the 1 they ran the numbers

after they moved to the cloud,

and they went, ouch.

I'm not doing that, and they moved it all back. They moved some of their workloads back. Now they're in hybrid. It's like which is most people are in hybrid.

But the reason was

they were spending,

I wanna say, $200, 000

a month

for something that

cost them, like, $10, 000

a year

if they did it on prem. The quote I got from the CEO was I can rebuy the hardware

every quarter,

every 3 months. And then, you know, he changed it after that. After he ran the numbers again, he was like, make that every 2 months.

Depending on your workload, depending on your situation,

you may wanna change.

Amazon gets crazy and is charging too much and you wanna move to Google. You should be able to do that. Everything changes.

Keep your options open.

On that note of change being the only constant, what are some of the upcoming changes and trends in machine learning for the database market that you're keeping an eye on and that you're paying particular attention to?

I think AutoML

is exciting. We just added XGBoost

recently, which I think is a cool, cool thing that we added that makes everything more accurate or makes for disc injuries more accurate anyway.

But AutoML is fascinating.

That ability to have

the machine crunch through a lot of the grunt work for you,

it just shorten your job

to the point where you only have to do the part that requires

a human being's expertise and thought and judgment and things like that, and the machine can do all the

plain, you know, just number crunching that machines are good at.

That's a really exciting thing. Auto AutoML, I think, is is really

making a big difference.

MLOps

is also huge

and the ability to get your machine learning models into production is the biggest barrier that so many people

have.

And everybody thinks

it's in deployment.

I'm done,

and that's just not the way it works.

In deployment,

okay, that's great for a little while, and then I gotta go back and retrain it, and I gotta, you know, compare the accuracy of the models and version it and switch it out and, you know, that's important.

And that ability to do

to keep track of who's doing what, you know, managing your teams and your models,

and getting things where they need to be and realizing when a model is losing its accuracy and you need to retrain it.

All of that is powerful and important

and both in in database or

if you wanna do it with something else. It's like that MLOps

capability is huge. I think the database helps you shortcut

some of that, but you still have to

some of that management, you need to know

the you need to know the steps and you need to be able to get them

working, hopefully, without a human in the loop for the whole time, automate as much of it as possible.

I know some of the cybersecurity

use cases that we run into, it's like they have to retrain their models, like,

every hour or 2.

And I'm serious. They're they're retraining their models. It's like, this is not, you know, redeploying

or anything like that. They're going back and training it again

because the data changes that fast.

The bad operators are smart,

and they're finding new ways

to break in. And if you don't go back and figure that out as you go along, they'll they'll get ahead of you. That's

part of our world now, I guess, unfortunately.

Are there any other aspects of machine learning in the database or your work at Vertica to help support that that we didn't discuss yet that you'd like to cover before we close out the show? There is 1 thing.

I know skills are really hard to get. I am

self taught from day 1. I've been doing this for, like, 25 years, and I did everything

from teach myself how to program in multiple languages

to, you know, teach myself about

marketing messages of all things.

You have to be able to get the information. It's like, the fact that you can now do machine learning in a database is great, but if you can't learn how to do that, it doesn't do you any good. 1 of the things Vertica has done to help with that is we have what we call the Vertica Academy.

So it's academy.vertica.com

And you go and there's free training

in how to use Vertica, how to do a lot of the cool things with it. You can get certified for the essentials, for the advanced stuff. We run boot camps and stuff periodically,

and it's all out there available on demand

without you spending

hard earned cash.

But I think that's a big thing is being able to upscale.

Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.

I think

data quality is pretty powerful.

I do some product management for data quality software, and there's a lot of really good

software out there and really good capabilities.

We're just starting

to get some of the data governance

and

cataloging and things like that on the huge datasets that we're dealing with now, I think it's behind.

I think

that

getting quality data is

the biggest challenge

still.

I mean, it's been the biggest challenge all the way along. It's been, like, 25 years. It's still the biggest challenge.

But the fact is, it's like our data quality and our data governance

keeps getting more and more advanced,

but our

data types and our data volume are both growing faster than

it can keep up with. And I think that's the biggest challenge right now is

is continuing to try to get

better and better

data.

No matter how you do your analytics,

it's like you gotta have good data to feed it.

Well, thank you very much for taking the time today to join me and share the work that you've been doing with Vertica and on machine learning in the database and helping to promote that and educate people on its capabilities. It's definitely a very interesting and important topic, so I appreciate all of your efforts on that. And I hope you enjoy the rest of your day. Well, thank you for having me. I hope you enjoy your day too.

Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com

to learn about the Python language, its community, and the innovative ways it is being used.

And visit the site at dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com

with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Closing Announcements

Links