Summary
When you build a machine learning model, the first step is always to load your data. Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build the model inside the database so that you don’t have to move the information? In this episode Paige Roberts explains the benefits of pushing the machine learning processing into the database layer and the approach that Vertica has taken for their implementation. If you are looking for a way to speed up your experimentation, or an easy way to apply AutoML then this conversation is for you.
Announcements
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
- RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
- We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
- Your host is Tobias Macey and today I’m interviewing Paige Roberts about machine learning workflows inside the database
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by giving an overview of the current state of the market for databases that support in-process machine learning?
- What are the motivating factors for running a machine learning workflow inside the database?
- What styles of ML are feasible to do inside the database? (e.g. bayesian inference, deep learning, etc.)
- What are the performance implications of running a model training pipeline within the database runtime? (both in terms of training performance boosts, and database performance impacts)
- Can you describe the architecture of how the machine learning process is managed by the database engine?
- How do you manage interacting with Python/R/Jupyter/etc. when working within the database?
- What is the impact on data pipeline and MLOps architectures when using the database to manage the machine learning workflow?
- What are the most interesting, innovative, or unexpected ways that you have seen in-database ML used?
- What are the most interesting, unexpected, or challenging lessons that you have learned while working on machine learning inside the database?
- When is in-database ML the wrong choice?
- What are the recent trends/changes in machine learning for the database that you are excited for?
Contact Info
- Blog
- @RobertsPaige on Twitter
- @PaigeEwing on Twitter
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
- Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
- Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
- If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
- To help other people find the show please leave a review on iTunes and tell your friends and co-workers
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
- Vertica
- SyncSort
- Hortonworks
- Infoworld – 8 databases supporting in-database machine learning
- Power BI
- Grafana
- Tableau
- K-Means Clustering
- MPP == Massively Parallel Processing
- AutoML
- Random Forest
- PMML == Predictive Model Markup Language
- SVM == Support Vector Machine
- Naive Bayes
- XGBoost
- Pytorch
- Tensorflow
- Neural Magic
- Tensorflow Frozen Graph
- Parquet
- ORC
- Avro
- CNCF == Cloud Native Computing Foundation
- Hotel California
- VerticaPy
- Pandas
- Jupyter Notebook
- UDX
- Unifying Analytics Presentation
- Hadoop
- Yarn
- Holden Karau
- Spark
- Vertica Academy
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello, and welcome to the Data Engineering Podcast, the show about modern data management. When you're ready to build your next pipeline and want to test out the projects you hear about on the show, you'll need somewhere to deploy it. So check out our friends over at Linode. With our managed Kubernetes platform, it's now even easier to deploy and scale your workflows or try out the latest Helm charts from tools like Pulsar, Packaderm, and Dagster. With simple pricing, fast networking, object storage, and worldwide data centers, you've got everything you need to run a bulletproof data platform. Go to data engineering podcast.com/linode today. That's l I n o d e, and get a $100 credit to try out a Kubernetes cluster of your own. And don't forget to thank them for their continued support of this show.
We've all been asked to help with an ad hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14 day trial. Your host is Tobias Macy. And today, I'm interviewing Paige Roberts about machine learning workflows inside the database. So Paige, can you start by introducing yourself?
[00:01:44] Unknown:
Hi. I'm Paige Roberts, open source relations manager at Vertica.
[00:01:50] Unknown:
I have been doing this for a long time. And do you remember how you first got involved in data management?
[00:01:55] Unknown:
I have been doing this a long time, a really long time. Started back in the nineties, doing tech support and documentation for a little data integration startup called DataJunction. I just was a sponge. I went into QA and then I did I ran the website for a while and then I ended up doing software engineering for about 5 years. So I was actually fixing bugs that I reported as a tech support support technician, so that was funny. And then, fortunately, I'm good at telling myself what the reproduction steps are, so that worked. I went into consulting. So I was going around and building data pipelines for folks, and I was a data integration specialist for quite a while for couple of different companies and independently, and then I got called back by my old company, which by then was pervasive.
They asked me to do marketing and I was very confused. That was a bit of a culture shock from going from writing code and building pipelines to trying to explain to people why they needed this cool software to do this to help them with this. And then I went off and I did a lot of different things. I did more consulting, more engineering. I did product management for Seq Sort for a while and for their data quality and data integration lines, and I was an analyst for very short time writing white papers and stuff like that for an independent analyst from the Bloor Group.
And I did some consulting for Hortonworks. So I've been all over the place. I joke that I've done every job in this industry except admin at this point. But yeah. So I kinda have a really broad view of analytics and data engineering and data architecture.
[00:03:50] Unknown:
And so we're here today talking about the opportunities and the implementation of machine learning inside the database. I'm wondering if you can start by giving a bit of an overview of the current state of the market for the support for machine learning in the database and maybe some of the motivating factors of why that's something that you should even be trying for?
[00:04:14] Unknown:
Well, I think it's becomes table stakes to a large extent. I think I saw a survey recently that said 70%, something in there. I don't remember the exact number, but something in that neighborhood, more than 2 thirds of the database, of the analytics databases on the market now have machine learning built into them in some way or as an add on. And that is just becoming the way that you do machine learning, and there's a lot of really good reasons for that for doing it in the database. And I think the market hasn't caught up to that change. It's like, it's usually the users that lead the way. They're like, I really want this and then the vendors build something to make that work. In this case, I think the vendors saw it coming before maybe a lot of the data scientists and such saw that. It's, like, the need to get data science, get machine learning into production is so huge and the skills to make that happen are so limited that the software vendors dove in and said, okay. There's gotta be a way that we can make this happen. And a lot of that has been in database machine learning. I think there's a nice InfoWorld article that somebody wrote about recently about, you know, like, 8 databases that have machine learning in them or as an add on.
As for the reasons that people are doing machine learning in the database, I think there's just a lot of them and it it kinda depends on who you are. It's like the line of business people, they want the advantages of machine learning, but they wanna be able to use a BI tool. Like, they wanna do it point and click. They won't they don't wanna write code. And if there's something built in that they can take advantage of, they can do it from, you know, Grafana or Power BI or Tableau or whatever they have that they're comfortable with, and they can do it that way. That broadens the number of people who can do predictive analytics, who can accomplish things like k means cluster to for targeted marketing, that that kind of thing.
For the data scientists and the data analysts, there's a different advantage. For the data analysts, they're used to using SQL and it's very comfortable for them. Even data scientists I don't know anybody that works with data that doesn't know how to use SQL. So that, by itself, makes life a lot easier for a lot of the data manipulation and things like that. The database, especially a modern scale out, you know, cluster database, is designed to manipulate data very efficiently and very quickly. And as much as I hate to admit it, way more efficiently than any of the Hadoop or, data lake kind of concepts. It's like the database has got a good 10, 15 years of development and optimization.
It's databases usually used to talk about how awesome their query optimization was. Well, what that means is they spent years years years owning that performance and it's just ahead. So you get that advantage of being able to do your data prep really fast and do your training really fast, that speed. For the DBAs and architects and stuff, it just simplifies things. You don't end up with this giant, you know, sprawling 500 different components that, you know, you have to get them all to work together. You end up with 1, base sort of central database where you can do 80% of your work and then the thing that you need to add onto it is the thing that feeds it. And you don't have to do 2 different feeds. That's the other aspect is instead of having to do 1 massive feed that feeds into the data lake for data scientists to then pull out extracts and they go do their thing somewhere else even, then because, you know, they can't handle a 100 terabytes worth of stuff in R. It's just not made for it. Instead of doing that, they can just use the whole dataset that's already in the database. It makes life a lot easier, and a lot of the databases now will even reach out and grab data off of HDFS or off of, an s 3 bucket, that kind of thing.
So you can pull in all your datasets without having to go find them. And as a data from a data engineering perspective, that means I don't have to build multiple pipelines to accomplish analytics. And whether that analytics feeds some report somewhere or it's for ad hoc queries or it's for machine learning. It's like that data. I can build 1 pipeline and accomplish all of those goals. And this is 1 other thing. I've gotta stand on my soapbox a little bit. I don't think people realize the difference in concurrency between a data lake, a database, an analytics database is designed to have multiple concurrent users or workloads. So if I have a 100 people in my company and they all need to access data to do their jobs, database doesn't have any problem with that.
Data Lake, Gartner even did something and it was funny. There was an article Gartner recently did about, you know, if you have more than 10 users and then the data lake, the data lake chokes. And there was something about, you know, Databricks says they're working on that. I'm like, databases solved that problem years ago, and the fact that the data lake folks are still trying to check catch up with that is huge. You just can't spread your analytics across your company and get everybody to to make their decisions based on data if only 5 people can use it, can access the data at a time. That's huge.
[00:10:14] Unknown:
Yeah. Absolutely. It's definitely remarkable the disparity in where the engineering focus has gone between databases, particularly MPP style and data lake engines, where data lake engines are looking for volume of data over concurrency, and the MPP databases have optimized concurrency to be able to tackle the volume.
[00:10:38] Unknown:
Yes. And it makes a huge difference. There's actually 1 other piece, I think, that goes with that and this is from working at Vertica. We put the machine learning capabilities into the database and did not get the adoption, at first, that we expected from our customers and these are the customers that were pushing us. They were like, we really want machine learning in the database. Please make it happen. And we're like, alright then. Here you go. And they were like, yeah. We're not adopting that. It's like, what? Wait a minute.
And we asked them. It's like, why? Why is this not getting adopted when you were the 1 that was really pushing for it? And the answer was, we have business critical BI and SLAs that we have to meet on our analytics that are already in place. And if I train a machine learning model, it's liable to eat up all the resources and slow down everything else and maybe even, you know, bring the whole database to a stop. So being able to run concurrent workloads means more than I can handle, you know, a lot of users at 1 time. It means that I can train a machine learning model and I can do streaming data ingestion and data prep, and I can do ad hoc queries.
I can do targeted marketing campaigns, and all of these things can run at the same time and be isolated from each other. Workload isolation is huge. You have to be able to be able to separate the resources so that nothing that 1 team does screws up something the other team does. ETL is a really good example. Right now, I mean, the classic concept was, well, let's wait until 3 AM or something when no 1 is using the database and that's when we'll do ETL. And sure, the data will be a day old, but we won't bog down anything important. Well, if you're taking in streaming data all day, all the time, in parallel, massive amounts of it I mean, we have ad targeting folks that are pulling in a a 1000000 events a second, and they need an SLA of 200 microseconds.
It's like, you can't wait until 3 AM and then do the and there is no time when the database is not busy. It's like so in order to do that ingestion, that ETL, that data transformation that makes that ready to accomplish something, it has to run all the time, and it has to not interfere with other workloads. So that's another good example of it's like, it has to be isolated. Machine learning has to be isolated. It's like, you have to be able to say, you have these resources. Those are all yours. You can use as many of those as you want, but you're never gonna touch these resources over here that are doing a different job. That workload isolation was a big thing. And so now we have huge adoption because we built in workload isolation, and that made made a big difference.
[00:13:57] Unknown:
That's a good segue into the performance implications of running your machine learning inside the database. And I'm wondering if you can just talk through from both directions the performance improvements that you're able to realize by doing the machine learning in the database where the data already lives, but also the potential negative impact that it can have on the database and the database users
[00:14:19] Unknown:
for adding that additional workload and the training overhead of building those models and and serving them? Well, I think the workload isolation takes care of a lot of those negative impacts. If you have really good solid workload and isolation built into your database, you don't have any negative impact. You can train your machine learning model at will, use up all of the resources assigned to you, and never worry about it bogging down the executive dashboard and your CEO getting mad at you. That doesn't happen. That is wonderful in itself. But the other aspect is what I talked about earlier, databases spent years years years perfecting performance.
So your MPP analytics database is designed to use all the power of the cluster and those really smart AI super intelligent query optimizers that are built into it to make the queries go as fast as possible, and that includes machine learning. And, again, I'm gonna use Vertica as the example because that's the 1 I know. When Vertica decided, okay. We're gonna put logistic regression model and we're gonna build it into our database, They built it in c plus plus They built it distributed so it's automatically parallelized and it's got unbelievable performance and you don't care. I mean, you don't have to care. It's like that is just not something that the person using it has to worry about. They just have to say, okay. I wanna use logistic regression and here's the dataset and here's my parameters and go.
You can do a single a SQL call and say, split this into training and test set and train it on that, verify, and tell me what the accuracy is. You've even got AutoML. This is exciting. This is actually something I really like. You can give it a dataset and it can run 10 different machine learning algorithms against that dataset and give you a chart that shows how accurate each 1 was, and then you can go, well, it looks like random forest is the way to go for this particular dataset, for this particular use case, and this problem. It's the most accurate. Well, that means I don't have to talk about saving time. I don't have to try 6 other models that might work because I already know random forces again and gave me the most accurate answer.
And the other aspect of that is before, if you were using Python or R or something like that, you generally had to extract a sample. You had to statistically try to figure out what a good sample was. And if you had a really large data set, you maybe had to do that 3, 4, or 5 times to try and get a sort of statistically good concept of what your data was like and not miss any, you know, unusual events or outliers or that kind of thing. And then you had to do it on your your laptop or your desktop or whatever, and then you had to figure out how to now put that into production in large scale. So maybe somebody else, like data engineer, kinda looks at the 25 data preparation steps you did and has to reproduce them in parallel at scale with Spark and has to sit down and write Scala code to make that all work.
And then at the end, you know, maybe the accuracy isn't as good or maybe, you know, your data has drifted in the 3 months it took them to write that application or whatever. The performance is great, and it's a huge help, but a lot of the power of it is that you don't have to move data around, and you don't have to rebuild everything. You can do everything in place. You can use the power of the database engine to accomplish the goal without having to worry about writing parallel code. It's like, I don't have to worry about if I have a 10 node cluster or a 100 node cluster. I just have to tell it I wanna do this model on this dataset and I get to use the whole dataset. I can use all of it and get the full accuracy, and it doesn't take me any longer.
I actually did a presentation recently with Anjal Singh. He's 1 of our data scientists, and he was shown a churn reduction model. And when he first was talking to me about how this worked, he got to a certain point and showed here are the features that have the strongest influence on accuracy. And then, you know, here's a graph from the most influential to the least. And I'm like, okay. So you're gonna knock off those bottom 10 features, right, to give better performance. He's like, why would I do that? Invertica runs in, you know, microseconds anyway. Why would I need to take out features if they add maybe a percentage point of accuracy to my model?
I can leave those features in and get a boost in accuracy. Even if it's a tiny boost, that could mean a lot of dollars to your company. 1% smarter is, you know, 1, 000, 000 more dollars in your pocket. It's just amazingly powerful to get that performance and that productivity gain.
[00:19:58] Unknown:
Machine learning has been used in a lot of different contexts and a lot of different ways to mean many different things. And so I'm wondering if you can just take a moment to talk through the particular styles of machine learning that are feasible to do within the database. You know, some people might think machine learning means Bayesian inference. Other people might think it means building recurrent neural networks or convolutional neural networks and doing deep learning, or it might be, you know, doing, you know, gradient descent or, you know, Monte Carlo simulation. I'm wondering if you can just talk about sort of the the styles of machine learning that are most applicable to being run within the database and any kinds that wouldn't really fit very well in that environment.
[00:20:42] Unknown:
Well, I think I mean, we have a extensive set of algorithms built in. And data preparation is 1 of those things that people forget is that, you know, you can't do this algorithm until after you do the 1 hot encoding to, you know, change your categorical variable variables to binary or whatever. You know, you gotta find your outliers. You gotta check your correlations, you gotta do all that kind of stuff. Having all that stuff built into your database shortens your work time quite a bit, and that's for any kind of machine learning. So that's not any particular type of machine learning specific, I think. Just speeding up your data prep is huge. Vertica imports and exports PMML. So, you know, you can hand it off, you can bring it back in, things like that, and or you can, you know, import or export datasets, and that makes life a lot easier if you're just doing the data prep and then you're gonna do the machine learning somewhere else.
On the other hand, if you wanna do most forms of machine learning, you know, just standard algorithms like old ones like SVM and and Naive Bayes and regression and, k means clustering and, you know, all the kind of standard things that you use for the most part. I think we just added XGBoost. The machine learning inside the database is constantly expanding. So normal machine learning is the word that I would use. You can do it in the database. The main thing that I think you need to do outside the database is you mentioned neural networks and deep learning and that sort of thing. So if I'm trying to do something with PyTorch or TensorFlow and I'm trying to train a neural network model, a lot of times what you need there is GPUs because you need that style of, you know, GPUs are really good at the linear algebra kind of concept and getting the math faster than a CPU can do.
And there's some cool things out there. Neural Magic is a piece of software that can simulate GPUs on a CPU machine, which is kinda cool. But, for the most part, if you're gonna train neural networks, you do that on a GPU machine. On the other hand, GPU machines are expensive and the last thing you wanna do is deploying your model, putting it to work. You don't want that on your GPU machine because that's expensive. You just wanna train your model there. Well, okay. So if you could take, say, your TensorFlow frozen graph and import it into your database and manage it like a table and deploy it and use it for prediction and put it to work on a standard, you know, CPU machine or a normal virtual machine in a cloud, that's powerful.
That makes a big difference and I think that's 1 of the things that that Vertica has going for it. I I'm the open source relations manager. It's like, we integrate with open source. So, you know, you wanna train a model in Vertica and hand off the PMML to somebody else, you can do that. If you wanna train a model in Spark, in whatever you feel like training your model in and then hand it off to the database to manage, evaluate, deploy in production, train it in TensorFlow, train it in PyTorch, import it, put it to work. That's a powerful concept, and the cooperation is always better than the concept of a friend of mine was just talking about. They used to work at SAP or they used to work with SAP software, and it's like, it worked great as long as you only used SAP software with it. As soon as you tried to integrate it with something else, it didn't work so good. So I think that's 1 of the things to watch for is, does it work and play well with others?
Because your machine learning is powerful, but it's never gonna work in isolation. It's always gonna have to integrate with a visualization layer, integrate with all your data sources. To do something simple. Like, if I wanna do analytics on the structured data in the database and I need to also look at maybe my 5 year historical data, which is stored in parquet on an s 3 bucket, that should be possible. There's no reason why I shouldn't be able to say to my database, hey. Do analytics on this table and that sort of table that's sitting over there and join them together and give me information on all of it and train this model on that resulting dataset.
No reason why that shouldn't work, and that's that's 1 of the things that I think some of the databases are still catching up to, is the concept that not all of the data that you need to use is in your database. You should be able to do analytics on data outside the database. And I think data lakes have had that concept for quite a while. They've well, they've had the concept of dump everything in here, and then we don't care if it's in Parquet and ORC and JSON and Avro and, you know, all these different for you know, log files and sensor data and all these other crazy semi structured and stuff like that. It's like, we should be able to analyze that anyway. And I think databases are just now catching up to the idea, yeah, we need to just analyze all that.
And I think the other thing to watch for is the guy who says, oh, sure. We'll analyze that. Just put it all in my database first. It's like, that's a vendor play for, I want all your data in my database, and I don't want egress fees. Have you heard of egress fees? Yes. I heard of egress fees, and I just had a heart attack. I was like, that is the most idiotic thing I have ever heard in my life. So for anybody who doesn't know, an egress fee is when you pay money to your database vendor to move your data somewhere else. Seriously, your data, if you wanna move it, they want you to pay them. I was like, what?
I thought that was the craziest concept I'd ever heard.
[00:27:16] Unknown:
That's a big thing in the cloud vendors as well. Once you get it inside their network, they're happy to let you send it everywhere as long as it's within their network. But if you try to go from AWS to GCP or vice versa, then they're definitely gonna get you on the cost.
[00:27:30] Unknown:
Well and hybrid cloud is huge now. It's like the idea of having a data center and also, you know, having some of your workload on the cloud and being able to pass your data back and forth, that's powerful and that's, I think, I just saw something that said, like, 70% of the folks that are doing I I've got the graph around here somewhere. But TNCF did this survey recently that was about how many folks are doing cloud, how many folks are doing on premises, are in private cloud. Hybrid was the top of the list. The number 1 way that people want to deploy their database and their data analytics is hybrid. They wanna use both.
And if I'm letting people put stuff in my database but I'm not letting anybody take it out, That's a huge barrier. That's like a speed bump in the middle of your nice data flow. My background is data engineering. So the idea of telling somebody they can't move their data is just, like, that's what you do with data. You it flows. It changes. It shifts form. It's like that's how you make it work. You know? It's like a dam. It's like, you know, this wonderfully flowing stream, and then, you know, here's this bare rock in the middle of the stream. It's like, nope. Sorry. You can't do that, or a toll troll. It's like, yeah. It's like, no. You can't let that go pass unless you give me some money. It's like, that's just 1 way street.
[00:29:00] Unknown:
You can come in, but you can't get out. It's the Hotel California for data.
[00:29:04] Unknown:
It is. The Hotel California for data. You can come in, but you cannot leave.
[00:29:10] Unknown:
Yeah. And so for people who are looking to use the database for building and deploying their machine learning models, I'm wondering if you can just talk through the overall workflow of going from, I have this data. It's already in my database to I have this machine learning model, and now I've put it into production and just the overall workflow of getting from point a to point
[00:29:33] Unknown:
b? Well, if your data is already in the database, that makes life really easy. And then it's just kind of a choice of what do I wanna work with. Mostly, I'm a data scientist. I like notebooks. I'm gonna pull up my Jupyter Notebook. We have a nice open source project called vertica py. So you can use your Jupyter Notebook and it has a Python interface. So you write Pandas code and you you do some data exploration. You do some, like, you know, what's my feature correlation? Where are my outliers? Maybe I wanna balance my dataset, do some 1 hot encoding. It's like, I do all that kind of stuff.
You know, maybe I discovered, oh, 0, this dataset isn't in my database. It's over here in this s 3 bucket in orc, you know, or it's JSON that's been streaming for the last 5 years and now we've got a big pile of it sitting over here in zip files or something. If you're using Vericut anyway, you just define that as, like, an external table, tell it where it is, and it'll just go and you can just keep going. You could just say, okay. Join this data with that data over there and let me okay. These features are cool, and these only add 2% to my accuracy, But, you know what? It doesn't cost me that much. I'll just keep them. And then train your model, evaluate it.
Look at your rock curve. Look at your lift. Look at your fusion matrix. Whatever you wanna do, you know, you maybe wanna use your import matplotlib and have a look at that. How does that look? Okay. I'm I'm pretty happy with that. Save it. You save it like a table and then you use a SQL command that says, use this to predict and you tell it where the dataset is that it's gonna predict on and, you know, you check it. Okay. This is solid. Well, since your database is gonna be the same in development and QA and production, It's like there is a line of code that you write to say, okay. Push this to production and make it work. That's it. Instead of months.
That's kind of huge, I think. A lot of folks, the biggest jump is like, okay. I'm happy. My model is awesome. How do I get into production? 1 of the things Vertica does that I really like is it sells you your production database and you can either say, I'm gonna have this many terabytes of data up to this many terabytes and buy capacity like that and then have unlimited compute. And then, you know, say if you're on Amazon, you'll have to pay Amazon if you use however much compute you use, but you won't have to pay Vertica. Or you can have unlimited storage and say, well, this is how much compute I think I'm gonna use. So that's my capacity that I've bought.
And then, you know, if you're on on prem or on the cloud either way, you're gonna have to pay for your infrastructure because we're just software. I mean, that's pretty much it. It's like you literally pay for your production and dev and QA just comes with it. High availability comes comes with it. It's not extra. You don't have to go pay some more to get that. It's like as if that were an extra thing that, you know, you wouldn't have a dev cluster, would you? You don't do testing, do you? It's like, everybody does that. And I heard a great joke that's like everybody does QA. Some people are fortunate enough to not do it in production.
So, yeah, everybody has that stuff. So you gotta have the dev and the test and the production and the high availability. So that's just included in your license. And, I think that in itself just makes your life a lot easier. And it means that your environments are all gonna be the same and you don't have to make a big change to get to production. It's just take it from here and push it over there.
[00:33:34] Unknown:
RudderStack's smart customer data pipeline is warehouse first. It builds your customer data warehouse and your identity graph on your data warehouse with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and Zendesk help you go beyond event streaming. With RudderStack, you can use all of your customer data to answer more difficult questions and send those insights to your whole customer data stack. Sign up for free at data engineering podcast.com/rudder today.
And so in terms of the actual implementation of the machine learning capabilities, specifically with Vertica, since that's what you're familiar with, I'm wondering if you could just talk through some of the architecture and implementation and some of the ways that that has changed or evolved in the time that you've been working with it. I think the architecture is it's very much an MPP architecture
[00:34:28] Unknown:
in the database already, and it's from the ground up pretty much designed to be really optimized for high scale analytics. They learned a little bit maybe from some of the mistakes. I mean, they did it beforehand, so maybe they just were smarter. I don't know. But they don't have a masternode. There's no leader node. There's no choke point. There's no single point of failure. If I send a query, Vertica, and it has a 100 nodes, it could go to any 1 of those nodes and initiate the query, which means, of course, automatically, you have a 100 people that could hit the database at the same time and their queries aren't even initiating from the same it's very parallel built in. What we did with the machine learning as we added to that was simply take the same principles and take a machine learning algorithm.
Say, I have k means clustering and I'm working on it with r. Well, it's gonna be linear. It's gonna be sequential. It's gonna be kind of intended to do 1 thing and then do another thing and then do another thing. Well, pretty smart engineers and they basically went, yeah. No. That's not gonna work and and rebuild it using something else, using using c and things like that. And we we also, at some point, I think, acquired distributed r and built that in as well. So we have the capability to distribute your r code. We have this u d x framework, which is wonderful.
The capability if I have a special Python algorithm that I wrote myself that nobody else has this, it's my secret sauce, I can write that, wrap it as if it were a SQL function and put it in my distributed database and it will automatically distribute the workload and treat it as if it were part of Vertica from the beginning. And then I have all the rest of my workload, of course, you know, workflow is already in there and this new thing that I added that's special for me just goes right into the workflow as if as if it was there all along. So I think that's pretty cool and we have that for Java, c, Python, r.
I'm not sure if I've hit everything. But, you know, the idea is, you know, you can use whatever you're comfortable with and you can put it in there. I think we we added a Go lang interface at some point, but I'm not sure it's in the u d x or not. It's kind of very flexible, and the capability to work with whatever you have or whatever you're comfortable with makes a big difference. 1 of the big differences about Vertica that makes the machine learning faster is not really a machine learning specific thing and that is I hear a lot about in memory databases.
Okay. The concept being, of course, that RAM is much faster than reading from disk. Well, it is. But if you optimize the way you store things on disk, you can speed up the way you read from disk to the point where it's faster and because of the columnar nature and there's an AI built in there that'll pick the right compression algorithm for that particular data type and things like that, you can compress up to 90%. Well, if I can compress the data down to 1 tenth of its previous size and then work on that 1 tenth of the size without ever uncompressing it, that's a lot faster than if I had to work at a dataset at its full size or if I had to compress it, uncompress it, work with it, and recompress it. That, obviously, is gonna be slower too. So what Vertica did was they not only came up with some smart compression algorithms, they came up with new ways to store data so that it is already optimized for analytics.
That is different. We don't store data in tables. We don't store data in relational datasets. We don't store data in snowflake or a star scheme. That is not essential for storing data. What's essential is we have an AI analyzer that looks at your query, the kind of analytics that you do, and then says, okay. In order to do that, this is how you should store your data to make it optimally fast for that. So if I'm doing, say, machine learning training on a dataset that would normally include 5 different tables and it's a big flat thing with 700 columns and that's what I need to train my machine learning. That's how we store it. It's already stored that way.
So when I go to do my training, that's way faster than if I have to go and find all of those tables and join them and and get that all ready before I can do anything with it. The other thing is, like, if I'm getting sensor data coming in, you know, 10 readings per second and all I need to look at is every minute and that's what I'm gonna train my model on is the sensor readings per minute, I can do some pretty massive compression. So what we do is we'll store the original data or we'll let you store it somewhere else, parquet or something like that if you want to, and we'll store the aggregated version, either 1 or both.
And that way, when you go to do your machine learning and you need that aggregated version, you know, retrain your model on the latest readings, boom. Fast. It's already there. On the other hand, if you go, oh, you know what? I think I might increase my accuracy if I looked at every 10 seconds instead of every minute. You still got your original data, and you can go back and you can you can do that. So that's a long walk for a short drink of water, but I think that makes the difference. The things that we did to optimize analytics
[00:40:49] Unknown:
end up also optimizing machine learning. We've touched on this a little bit, but in terms of the architectural patterns and the way that you think about building your data infrastructure, how does the use of the database for your machine learning building and deployment impact the way that you think about building the rest of your platform?
[00:41:10] Unknown:
I do an entire presentation on this. You Google unifying analytics and Paige Roberts. You'll probably find a video of me doing this talk. But, essentially, back in the day, we had this data warehouse architecture with that, you know, slow 1 in batch, once a day ETL and a big database in the middle, and it was pulling from, oh, I don't know, all 6 of your transactional sources, you know, maybe. And then it had a visualization layer on the front and that's how you did that data warehouse concept. And then we're like, well, I would really like to pull data from 25 places, and a lot of it is semi structured. And I'd like to get some streaming data in there, and I'd like to do some more advanced analytics and stuff. And I was like, did we improve the machine in the data warehouse? No. We threw it out the window and started over with the whole data lake concept.
And so now what I see the most often is what I call a combination architecture, where someone has taken their data lake or taken their data warehouse and they tried to replace it with a data lake and that didn't work. So what they ended up with is this sort of cooperative thing where either the data lake feeds the data warehouse or the data warehouse feeds the data lake and they're working together and the data engineer is bonkers because he's got to build every pipeline twice and he's got to move things from 1 place to the other constantly. And, you know, his data scientists are working over here, except they still can't work with all this data. They've gotta extract samples and then they're like, oh, what about this data? Oh, that's in the data warehouse. And you end up with a lot of duplication and a lot of frustration and trying to find things.
This is really the greatest way to do it. But because you have the data warehouse, you know, the analytics database doing what it does best and the data lake doing what it does best, it does work. And there's a lot of people still using that. That is, I think, the most common, architecture right now. What I see over time is more and more people using moving to this concept where the data warehouse and the data lake sort of merge and become 1. And you end up, say, storing all of your data in an s 3 bucket or on some specialized shared storage hardware that has s 3 as a supported type. So object storage, pure storage, or FlashBlades, or ScalityRing, or Dell EMCEM yeah. So you either got some specialized storage on prem or you've got an s 3 bucket or a Google Cloud Storage or something like that in the cloud. But either way, that's pretty much where all of your data ends up.
It just ends up in, like, 10 different varieties. So you've got, you know, your JSON messages streaming in or you've got your sensor data streaming in or you've got your parkade data getting bigger and bigger and, you know, absorbing long term data. You've got CSV files and and Excel files and, you know, simple things like that. All of them in this sort of large space and that includes your database format. So with Verdec, anyway, we store our database format, which we call read optimized storage. It's a file format just like parquet, just like quartz. It's a highly optimized analytic columnar format and you can store it on s 3. So it sits there right next to JSON and Parquet and Ork and Avro and all the other ones.
And then, you know, you do analytics on top of that. And you don't have to move it anywhere. You don't have to change format unless you're unless your goal is to improve your performance. You might move it from 1 format to another to get faster performance. Read optimized storage is ours and, obviously, we're gonna be fastest on our own format, but you still get pretty decent form performance from, say, something else columnar and and design for analytics like Parquet or Ork. Whereas, you would get maybe less fast performance if you were looking at a a whole bunch of JSON files sitting around.
Everything is gonna be, analyzable in 1 place. So I don't have to build multiple pipelines. I can build 1 pipeline, store all the kinds of data that I wanna store, and maybe move a few around if I need to. Store it where it makes sense and then analyze all of it in place without picking it up and moving it, and we call that unified analytics. I'm seeing more and more of the independent analysts picking that up, EMA, GigaOm, guys like that are create I think GigaOm just did a radar report for unified analytics platforms. That concept of pulling the data lake and the data warehouse together and making them 1 thing instead of having 2 separate things. And I think you hear that from the database analytics, database vendor side, which is us. We call Unified Analytics.
The data lake side, they're mostly calling it a lake house, which I think is pushing your metaphor a little far, but okay, if that's what you wanna call it. Alright. Go you. And the main thing is that I think the database vendors have a little bit of a head start, and I think that is true on a lot of things. It's like I was there when, you know, building Hadoop clusters when I went to 1.0 and got yarn. And I had to rebuild my cluster because I'd built it with the previous version. It's grown, but it's only, what, 10, 12 years old at this point. And even a relatively new, relatively young database like Vertica is only 15, 16 years old. It's still got a secure head start.
And the ones that have been around even longer have got even more of a head start. It's just gonna take a while for the Daylink vendors to catch up, I think. And in the meantime, you know, the database vendors aren't sitting still. We're adding more and more machine learning capabilities and more and more advanced analytics, geospatial, time series, all that stuff.
[00:47:46] Unknown:
As you have been working with machine learning in the database and helping to sort of spread the gospel, if you will, to the broader community. What are some of the most interesting or innovative or unexpected ways that you've seen those capabilities used?
[00:48:01] Unknown:
I think I was really impressed by 1 of the telecom use cases that I've seen. It kind of brings everything together. We have use cases all over the map from fraud prevention to targeted marketing to ad targeting. Yes. We are responsible for a lot of those ads that follow you around the web. That's us. But I think 1 of the coolest things that we do I mean, we can map genomes, really cool stuff. But 1 of the things that I loved was like, AT and T, I think, is 1 of the guys doing that. They take the data from all of their networks, all of the machine data, all the device data from every phone, every network repeater, every tower, every every device they have.
And they're doing geospatial analytics so that they know where everything is and they're doing all this other analytics, but they're doing it in real time, which means if AT and T is my provider. So if I make a call and they're having a Super Bowl and everybody in the world is trying to film this touchdown that just happened and send it to their buddies. It's like that is gonna be a seriously overloaded geography that that particular part of the network is gonna be way overloaded. If I'm trying to make a call that should normally cross that, they will, in real time, reroute my call around that overloaded section so that I never even know it. All I know is my call went through.
I had the conversation. There was no problems. So I get better customer experience. They get reduced customer churn. Everybody's happy, and that requires that they be able to do time series analytics, geospatial analytics, every kind of analytics you can think of, machine learning, predictive, all of this stuff, and respond in microseconds. That to me was pretty mind blowing. I just learned about that a couple of weeks ago, and it I mean, we've been doing for ages. We were in some of the top 10 telecoms, but I always was thinking about things like churn reduction and some of the other use cases. I hadn't thought about that, you know, network optimization. Like, where do I put my next tower? You know, that kind of thing. I hadn't thought about that particular use case until, you know, someone described it to me. Yeah. We do this. And I was like, wow. That's a pretty cool use case.
That 1 is probably the 1 I thought was the coolest.
[00:50:42] Unknown:
And in your own experience of working with the technology and helping to spread spread awareness of how it can be used, what are some of the most interesting or unexpected or challenging lessons that you've learned in the process?
[00:50:56] Unknown:
Well, unexpected, I have to say. I'm a Spark fan. I have always been a Spark fan. And like I said, way back in the day when clusters went to the point where they had yarn, you know, MapReduce was the only thing you had. And then when Spark came out, I was just like, oh, this is so much better. I'm pretty decent friends with Holden Corral. I'm probably better back when we had conferences and we could see each other on a regular basis. But she wrote a lot of the things like High Performance Spark and a lot of those cool O'Reilly books. If I wanted to learn Spark, I'd go, you know, get 1 of her books. I already have a couple, but that's the thing. And she said, years back to me, it's a big challenge getting machine learning into production. And I was like, isn't that what Spark is for?
And I was very surprised about that, but that was years ago. So what happened recently that brought that to mind again was we did a POC and someone was trying to do everything at their business using Spark. They had 278 nodes of of a Spark cluster and I was just like, wow. That's that's a lot. Are you trying to, you know, deal with petabytes of data or something? You know, why is it so big? And it's like, no. It wasn't that huge. They were just trying to do everything in the universe with Spark. Everything. And I was like, there are better tools for a lot of these things.
It's like Spark is the best thing probably you can use for doing high scale data transformation and pulling in data from 25 different sources. That's what you need your Spark cluster for. That's what it's good at. Once you have that data, if you wanna do your analytics and things like that, databases are way better at that. And we did a POC and we went in 278 Spark clusters the 2 spark nodes on the cluster and we ended up reproducing the use cases that they were having trouble with with 9 nodes of Vertica and getting better performance. So I think a friend of mine at work coined it as spark spread.
It's like this concept that because spark can do everything, you should then you should do everything with spark. And it's like, no. Just because he can do everything doesn't mean it's necessarily the best tool for the job. That was a little bit of a surprise when I saw that 278 to 9. I was just like, wow. That is a big difference. And I think if more people were aware that that that level of change can happen if you pick the right tool for the job, then they probably would.
[00:53:45] Unknown:
And on the note of picking the right tool, what are the cases where building your machine learning or running your machine learning inside the database is not the right choice, and you're better served either going outside of the database and using something like Spark or using something like PyTorch and TensorFlow and deploying the model natively or the cases where, okay, put putting the machine learning in the database, this sounds great. It sounds amazing. What are the cases where it's the wrong choice and you're better suited going elsewhere?
[00:54:11] Unknown:
Well, I think I already mentioned 1, which is anything you gotta train on a GPU machine is not gonna make sense to do it in the database. I mean, you could, maybe. I don't know. But it would be slow. And the whole point of using the database is to make it go faster. So if I was gonna train a net, I would use GPU machine. I would use TensorFlow or PyTorch or something like that, and I would do that. And chances are, I would use Spark to feed it data. And maybe if it was really straightforward, I might then use Spark to just put that right into production. On the other hand, if most of my production data was in a database or in a data lake and I, you know, naturally was using that on CPU machines, on an object storage, or something like that, I would think that then putting it to work in the database would make sense.
I do see people sometimes, like, prepping their data in the database and then handing off the PMML to Spark and then putting it into production. I think the folks that do that mostly are the ones that already have certain aspects of their production already in Spark. I was like, it doesn't make sense to pick up and move stuff. If it ain't broke, don't fix it. If it works, if you're happy with it, then, you know, maybe shortcut your data prep a little bit, but there's no reason to lift and shift. I think that's the big thing. It's like everybody thinks, well, I gotta pick everything up and go do it somewhere else. It's like, no. It's like, you can use bits and pieces that work. It's like, if you already got this chunk that's working for you, then do this other chunk into something new that'll do it faster. That's 1 of the things I see. The other is the incremental thing. It's like, if I wanna do a new mission learning use case, I might do it in the database.
But if I've already got 10 in production, it's like I'm not gonna take them out of production and try and redo them in a new tech. I'm gonna leave them where they're at. I'm gonna let those function until maybe they're, you know, not that accurate anymore, maybe they need to be retrained, you know, that kind of thing. It's like, then maybe I'll move them or maybe I won't. Maybe it makes more sense to leave them where it is. I think all or nothing is the wrong way to go about any kind of analytics shift. If you're shifting in tech, I think the biggest mistake a lot of people have made over time has been Hadoop is the big thing. I'm gonna put everything in Hadoop.
The cloud is the big thing. I'm gonna put everything in the cloud. It's like, you know, GPR says I can't do that. I gotta pull everything back on prem. It's like, things change. This is that is the number 1 thing. I remember Colin Faye is a data engineer that I know through Twitter. He's in Europe. I don't think I've ever physically met the guy, but we chat. He gave me this great acronym, DSOFU. I usually, change it to DSOFU to be a little more polite, but it's don't screw over future you. It's like your future self needs the freedom to be able to change, and don't don't lock yourself in. Those egress fees suck.
The ability to shift and change over time is huge in this business. It's like that is always gonna happen. There's always gonna be something new. There's always gonna be a change as you go along. It's like, go with the flow, but keep your options open. I talked to Catch Media was the customer recently. They had, you know, all in on the cloud. That's the thing. We should go in all in on the cloud. I mean, everybody said that. Everybody said, oh, it'll save you costs. It's wonderful. You should do that. Now there's this thing called cloud repatriation that people have come up with a term with it because it's happening a lot. And Cash Media was the 1 they ran the numbers after they moved to the cloud, and they went, ouch.
I'm not doing that, and they moved it all back. They moved some of their workloads back. Now they're in hybrid. It's like which is most people are in hybrid. But the reason was they were spending, I wanna say, $200, 000 a month for something that cost them, like, $10, 000 a year if they did it on prem. The quote I got from the CEO was I can rebuy the hardware every quarter, every 3 months. And then, you know, he changed it after that. After he ran the numbers again, he was like, make that every 2 months. Depending on your workload, depending on your situation, you may wanna change.
Amazon gets crazy and is charging too much and you wanna move to Google. You should be able to do that. Everything changes. Keep your options open.
[00:59:07] Unknown:
On that note of change being the only constant, what are some of the upcoming changes and trends in machine learning for the database market that you're keeping an eye on and that you're paying particular attention to?
[00:59:19] Unknown:
I think AutoML is exciting. We just added XGBoost recently, which I think is a cool, cool thing that we added that makes everything more accurate or makes for disc injuries more accurate anyway. But AutoML is fascinating. That ability to have the machine crunch through a lot of the grunt work for you, it just shorten your job to the point where you only have to do the part that requires a human being's expertise and thought and judgment and things like that, and the machine can do all the plain, you know, just number crunching that machines are good at. That's a really exciting thing. Auto AutoML, I think, is is really making a big difference.
MLOps is also huge and the ability to get your machine learning models into production is the biggest barrier that so many people have. And everybody thinks it's in deployment. I'm done, and that's just not the way it works. In deployment, okay, that's great for a little while, and then I gotta go back and retrain it, and I gotta, you know, compare the accuracy of the models and version it and switch it out and, you know, that's important. And that ability to do to keep track of who's doing what, you know, managing your teams and your models, and getting things where they need to be and realizing when a model is losing its accuracy and you need to retrain it.
All of that is powerful and important and both in in database or if you wanna do it with something else. It's like that MLOps capability is huge. I think the database helps you shortcut some of that, but you still have to some of that management, you need to know the you need to know the steps and you need to be able to get them working, hopefully, without a human in the loop for the whole time, automate as much of it as possible. I know some of the cybersecurity use cases that we run into, it's like they have to retrain their models, like, every hour or 2.
And I'm serious. They're they're retraining their models. It's like, this is not, you know, redeploying or anything like that. They're going back and training it again because the data changes that fast. The bad operators are smart, and they're finding new ways to break in. And if you don't go back and figure that out as you go along, they'll they'll get ahead of you. That's part of our world now, I guess, unfortunately.
[01:01:55] Unknown:
Are there any other aspects of machine learning in the database or your work at Vertica to help support that that we didn't discuss yet that you'd like to cover before we close out the show? There is 1 thing.
[01:02:05] Unknown:
I know skills are really hard to get. I am self taught from day 1. I've been doing this for, like, 25 years, and I did everything from teach myself how to program in multiple languages to, you know, teach myself about marketing messages of all things. You have to be able to get the information. It's like, the fact that you can now do machine learning in a database is great, but if you can't learn how to do that, it doesn't do you any good. 1 of the things Vertica has done to help with that is we have what we call the Vertica Academy. So it's academy.vertica.com And you go and there's free training in how to use Vertica, how to do a lot of the cool things with it. You can get certified for the essentials, for the advanced stuff. We run boot camps and stuff periodically, and it's all out there available on demand without you spending hard earned cash.
But I think that's a big thing is being able to upscale.
[01:03:03] Unknown:
Alright. Well, for anybody who wants to get in touch with you and follow along with the work that you're doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today.
[01:03:19] Unknown:
I think data quality is pretty powerful. I do some product management for data quality software, and there's a lot of really good software out there and really good capabilities. We're just starting to get some of the data governance and cataloging and things like that on the huge datasets that we're dealing with now, I think it's behind. I think that getting quality data is the biggest challenge still. I mean, it's been the biggest challenge all the way along. It's been, like, 25 years. It's still the biggest challenge. But the fact is, it's like our data quality and our data governance keeps getting more and more advanced, but our data types and our data volume are both growing faster than it can keep up with. And I think that's the biggest challenge right now is is continuing to try to get better and better data.
No matter how you do your analytics, it's like you gotta have good data to feed it.
[01:04:21] Unknown:
Well, thank you very much for taking the time today to join me and share the work that you've been doing with Vertica and on machine learning in the database and helping to promote that and educate people on its capabilities. It's definitely a very interesting and important topic, so I appreciate all of your efforts on that. And I hope you enjoy the rest of your day. Well, thank you for having me. I hope you enjoy your day too. Listening. Don't forget to check out our other show, podcast.init@pythonpodcast.com to learn about the Python language, its community, and the innovative ways it is being used.
And visit the site at dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show, then tell us about it. Email hosts at data engineering podcast.com with your story. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.
Introduction and Welcome
Guest Introduction: Paige Roberts
Machine Learning Inside the Database
Performance Implications and Workload Isolation
Types of Machine Learning Feasible in Databases
Workflow for Building and Deploying ML Models
Vertica's Architecture and Implementation
Impact on Data Infrastructure
Lessons Learned and Use Cases
When Not to Use ML in the Database
Upcoming Trends in ML for Databases
Skills and Training Resources
Biggest Gaps in Data Management Tooling
Closing Remarks