Deep Learning For Data Engineers

Hello, and welcome to the Data Engineering Podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the project you hear about on the show, you'll need somewhere to deploy it, so you should check out our friends at Linode.

With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

Go to data engineering podcast.com/linode,

that's l I n o d e, today to get a $20 credit and launch a new server in under a minute. And don't forget to thank them for their continued support of the show.

Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams.

If you're tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need, then it's time to talk to our friends at StrongDM.

They've built an easy to use platform that lets you leverage your company's single sign on for your data platform.

Go to data engineering podcast.com/

strongdm

today to find out how you can simplify your systems.

And go to dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And to help other people find the show, please leave a review on Itunes and tell your friends and coworkers.

You listen to the show to learn and stay up to date with what's happening in databases,

streaming platforms, big data, and everything else you need to know about modern data platforms.

For even more opportunities to meet, listen, and learn from your peers, you don't want to miss the Strata Conference in San Francisco on March 25th

and the Artificial Intelligence Conference in New York City on April 15th, both run by our friends at O'Reilly Media.

Go to dataengineeringpodcast.com/

strataconanddataengineeringpodcast.com/aicon

to register today and get 20% off. Your host is Tobias Macy. And today I'm interviewing Thomas Henson about what data engineers need to know about deep learning, including how to use it for their own projects. So, Thomas, could you start by introducing yourself? Hi. So I'm Thomas Henson. I'm a, Pluralsight author and involved in the data engineering community, and also work for, Dell EMC,

in our unstructured data team. So I've been around data engineering

and really around the Hadoop ecosystem probably for the last 6 years since since before Hadoop 2 dot o. And I've just been a part of that community and love it. And do you remember how you first got involved in the area of data management?

Oh, yeah. 100%.

So, you know, going through college,

I thought for a long time that I was gonna be a DBA. So I really that's kind of what I was targeting. And so when I graduated, you know, the job market being what it is, and it doesn't matter, you know, what what era, right, you're in, you know, especially when you're getting out of college, you're you're having to apply for a lot of different positions. And,

I actually got my first role as a web developer. So totally different, right, than than being a DBA. So but I always kinda had that passion, and I guess it would be I would be considered a full stack developer. So I did do some database management,

to some extent for our applications, but nothing

nothing too too ingrained like a traditional DBA. And then, you know, lo and behold, a few years later, there was a a research project that came up, and it was going I didn't know at the time that it was gonna be a big data project, but I knew it was gonna it was gonna require a lot of lot of information and just really take me outside of my comfort zone. So I volunteered to get on that project, and it turned out we were using Elasticsearch at the time, and then we we rotated into, using Hadoop. So, you know, down out of the Hortonworks sandbox, and I think Cloudera had 1 at the time too. And,

that that was kinda that was kinda my path. And, you know, went went to my first, I think, Hadoop summit.

And, you know, from there, I just started looking and just really saw this community

and really saw my opportunity to get into, data from, you know, what I'd looked at, you know, in my college days. So haven't looked back since. And recently, you've started getting into the area of deep learning and experimenting with that. So can you start by giving a bit of an overview of what deep learning is for anybody who isn't familiar with that terminology?

Yeah. So from a data engineer's approach, I haven't really you know, I I didn't really spend much time on the algorithms

and just kind of focusing on some of the machine learning pieces and, you know, that portion. And I'm not saying it was like kind of a dark box for data engineers, but it's not something that I really, you know, spent a lot of time. Like, you know, I was worried about being able to stand up our Hadoop cluster or stand up our environment

or, you know, writing, you know, at the time MapReduce jobs or spark jobs and, you know, those kind of pieces and kinda left the data engineering to the data engineers. But, you know, slowly started looking into, you know, okay. Well, you know, I know which algorithms we're using. Let me let me find a little bit more, you know, kinda underneath the covers about, you know, what those are. And so started kinda having that approach to looking at it. But, specifically,

you know, if you're looking, you know, as it from a data engineer's perspective or even a data science perspective, you know, the the real difference and key between,

deep learning and machine learning is gonna be the use of neural networks. So you're using neural networks to be able to, you know, go through and analyze your data versus, you know, with machine learning, it's it's a more of approach where, hey. You know, we're taking all these few different feature sets. Like, 1 of the famous examples is if to identify cats on the Internet. And I don't know why that you wanna be able to identify cats,

from YouTube videos. Maybe it just makes for amazing YouTube videos. I don't know. But that's that's seemed to be like the first use case. And so if you think about, you know, the machine learning approach to how you're gonna, you know, identify a cat from a video is, you know, you you're gonna program in the different features. So, like, you know, features as in, hey, you know, how what do the ears look like? You know, does it have hair? Even though there is hairless cats. But, you know, just you're gonna assign those. Right? The whisker length and some some some of those other pieces. And you're gonna run those through your different algorithms. So if you're using SVD or if you're using, you know, some kind of decision tree, you're gonna pick out the algorithm and you're gonna test and run that through. And that's the machine learning approach. But, you know, from a deep learning approach, what you're gonna do is you you just you know, you're gonna have this labeled dataset, and you can't have unlabeled dataset. Let's just keep it with labeled datasets here. And you're gonna feed those images of those cats through, and you're gonna be able to identify and let the neural network decide, okay, this features or hair

or the whiskers, you know, what what makes the biggest difference there? And you can kind of evaluate how that looks. So it's a it's a different approach from what we've done for machine learning. But, just as a data engineer, it was just kinda fascinating to me to kinda you know, I wanted to step back and take some time to really learn kind of what our data scientist and, you know, on our team kinda go through to hopefully, you know, make me a better data engineer so I, you know, can understand algorithms and kinda go through that process. And as far as your own experience with deep learning, I'm curious what set you down that particular path and what your experience has been with that so far. So at the end of 2017,

a couple of a couple of groups so I do

a podcast with some, other people in the data engineering and, you know, the data analytics world. And so there were a couple of us, Aaron Banks and, Brett Roberts. We we were looking at doing a Coursera course and just being able to kinda go through it. And we wanted to do the most famous 1. So, like, know, the most famous machine learning course with, you know, Anjanine taking everybody through. I think he's taught more people on the planet about that than probably anybody else. And so we were like, okay. This is the most popular course on the planet. Let's kinda go through it. And it was very hard. So, you know, we we kinda looked at it as like, oh, this is, you know, an online course. This is something we can do together. And we were we'd record videos after, going through it. It, but it really took me down more of a math path than I mean, I guess I should've known that. But so, you know, after kinda going through that and really understanding more about machine learning, just doing some work in, you know, in my job at Dell EMC, I'm part of a group called the unstructured data solutions team.

And so, you know, being a part of that group, there's a lot of things going on in the deep learning world that I was kinda challenged, by some of my coworkers and the other business units to understand more about that. And so, you know, I took what I learned in the machine learning area and kind of really applied that to what's going on from a deep learning. And so I started learning, you know, more about TensorFlow and PyTorch and what's going on from a GPU

specific basis and just kinda going down that path. So, you know, it wasn't that I was targeting at first the deep learning. I just kinda thought it would be good for me to understand because I got you know, I I would continually get questions is, you know, somebody who's advocating out in the data engineering community, questions around data science. And so I just thought for me to be more well rounded that it would be good for me to be able to answer some of those questions or have a better understanding for it. And it just kind of evolved into, hey. We need to check out what's going on from a TensorFlow perspective. And I just kinda hadn't looked back for the last year or so. And particularly from the perspective

of a data engineer who's working on building out the infrastructure and the data pipelines that are necessary for feeding into these different machine learning algorithms or deep learning projects. What is involved in building out that set of infrastructure and requirements to support a project that is going to be using deep learning, particularly as it compares to

something that would be using a more traditional machine learning approach that requires more of the feature engineering up upfront as opposed to just feeding in the labeled datasets for those deep learning algorithms?

So that's a good question. I as you start looking at it and, you know, kind of the way that I approach it, 1, with my learning and just kind of the way that I like I like to describe it is just you think about, you know, my experience from the Hadoop ecosystem. Right? And, like, how how does it differ, you know, from what we're doing in deep learning to what what's going on from a Hadoop ecosystem perspective. And you think about, you know, in the Hadoop, you know, in the Hadoop world, you know, your data is in HDFS or what we're trying to analyze. It's still you know, it's it's somewhat structured, you know, or semi structured, we call it, or, you know, we'd call it unstructured data, but it was really, you know, it was really, like, still a lot of text data and other portions like that versus what we're doing from a deep learning approach is, you know, we're talking about, you know, mostly, you know, image data or voice recognition, but just rich media. Right? Like, even video data. And that's really kind of 1 of 1 of the key portions. So with that, you know, when we talk about big data and Hadoop, we were talking about large datasets. But now, you know, on a deep learning side, we're talking about massive datasets. Right? Because I mean, how how much how much video data does it take to, you know, create the next driverless car. Right? We're still we're we're still going through that and figuring that out. But, I mean, you can just imagine, you know, if you're doing any kind of simulations or anything like that. I mean, we're talking about lots of lots of sensors and lots of lots of data points. And so there's some challenges there. And then 1 of the big keys too that's really kind of push forward,

deep learning and why you're seeing other projects

from the traditional ecosystem. Like so there's projects like project hydrogen,

submarine, and even what NVIDIA is doing with RAPIDS. They're trying to get more into the GPU. And so the GPU is giving you the ability to analyze data faster or even do ETL faster.

And, you know, that's really kind of accelerating it. So, it it does bring up challenges whenever we're talking about building out that data pipeline and how you wanna how you wanna kinda progress to it. And, you know, there's there's not really any answers just yet to how it's all gonna kinda go because it's still somewhat fluid. Right? Because, like, we know, you know, if we look at what we're doing, you know, let's just take TensorFlow for example. Right? So, like, what you're doing when you're setting up a TensorFlow environment, you know, it might be something as simple as you're just setting up, you know, different shares, you know, so you have some, you know, you have some NFS mount, right, where you can just analyze, you know, all this data. And, you know, you're still orchestrating it and you're still going through that portion. But to to build out those data pipelines, you know, you you might just have 1 1 dataset, right, or, you know, 1 large set of that data. And so I think really what the key and what we're in maybe in 2019 and beyond is we're we're starting to look and say, hey. How can we bridge that data with what we, you know, what we have in our Hadoop ecosystem, right, or what we have in other datasets. And not that I'm saying Hadoop's gonna be the key to that or, you know, even, you know, what we call in the Hadoop ecosystem, but it's it's still trying to kind of interesting to see how that plays out. Right? Like, you know, we're we're at this point now. We're taking advantage of what's going on from a GPU perspective. And, you know, now we wanna now we wanna do like we do, you know, with other projects throughout the years, right, you know, that we've seen in the past is, can we marry this with other data that we have or other decisions that we've already made? So it's it's real interesting. And, you know, there's, like I said, a lot of different a lot of different approaches to it, and we're still kind of going down that path. And I think your point too about the fact that deep learning is particularly applicable to these projects that are focused

on rich media, as you put it, video or images or audio.

It starts to look more like a content delivery pipeline than necessarily

the traditional

data pipeline that we're used to where we might be working more with discrete records or, you know, flat files

on disk or things like that that have a lot of structured aspects to it where there might be similarities between records that are conducive to different levels of compression or aggregation. Whereas with video in particular and even audio, there is a lot less of that similarity

from second to second within the content, but also between files because there are so many different orientations that are possible for an image frame or anything like that. So just conceptually, it requires

a much different tack as to how you're managing the information and how you're providing it to the algorithms that are actually processing it. Oh, 100%. I mean, we're talking massive amounts of storage, right, to be able to, you know, if you like I said, thinking about video data coming in and most of, you know, most of that in its format, it might be compressed to some extent, but it's there's not gonna be any dedupe or some kind of compression that we can do, you know, for the most part. Right? Like, you know, 1 video file of a car driving down the road versus, you know, a different view of that same 1. It's it's, you know, it's not gonna dedupe well. It's not it's not gonna have that. So there are some challenges there. But 1 of the things is, you know, I I did say, hey. You know, when we look at it from a rich media type, you know, traditionally, what we did when we're talking about Spark and Hadoop and, you know, anything in that Hadoop ecosystem

is, you know, still kind of text based data. Well, it's still the same thing here. So I I I just want people in the audience to understand we're still breaking the data down. We're just breaking it down into, you know let's just say that we're doing grayscale. Right? We're still we're still breaking it down into matrices of, you know, zeros and ones, but it's a lot of zeros and ones, right, for for for 1 video or 1 image or, you know, anything from an audio perspective.

And particularly for formats like video or audio where the information

in relation to the other attributes of, you know, the frame to frame is important and contextual,

it makes it much more difficult to identify what are the logical points where we can split it versus

not necessarily possible to do that with video or audio without compromising

the value that you're getting out of it. Yeah. I mean, so that's for sure. Like, you're looking at it from that perspective of, you know, being able to how you compress it or, you you know, break it in even though we're talking about massive amounts of data or large, you know, datasets, being able to break those into chunks. But I mean, even think about it from a compute perspective when we're just talking about RAM. Right? Like, a lot of times whenever we're talking about being able to run a job or run some maybe if it's a spark image or a spark job or some a map, you know, traditional map reduce job, in your cluster, you might you might have a ton of RAM. Right? But, you know, think about you know, I mean, we're talking at this scale, we're talking terabytes to petabytes to know, I was just reading an article where they were talking about the predictions that we're at 33 zettabytes of data worldwide today. And by 2025,

so less than 6 years away, we're gonna be at a 170 I think a 175 zettabytes. And so, like, I mean, it's just it's just massive. You know, it's just crazy just to think about how big the data is and how how much data that we're creating from this. And, I mean, it's also fun because we're changing the way that we interact with society out there. And we can get into, you know, where we think AI is and where we think the boundaries are and how much how much of it is maybe hype or not. But I'll I'll I'll say that, like, my favorite thing to kind of talk about whenever we're we're talking about just AI as a concept is, really, it's just an extension of automation at this point. But it's just automation that we couldn't do. And in terms of the actual responsibilities

of the data engineer for the data as it's being delivered to these algorithms, particularly

as it compares to machine learning, where you might need to do upfront feature extraction

and feature identification

to be able to get the most value out of the algorithm.

My understanding is that with deep learning, you're more likely to just provide coarse grained labeling of the information and then rely on the deep learning neural networks

to extract the

useful features. So I'm wondering if you can talk a bit about how the responsibilities of the data engineer shift as you're going from

machine learning into deep learning, particularly from the standpoint

of feature extraction and labeling?

Yeah. So

ETL is not going away.

So, you know, there's there's still gonna be ETL involved, and there's still gonna be, you know, whether we call it data wrangling or data data munging. Right? We're still a lot of what I'm seeing and a lot of what we're talking about and, you know, I've talked to, the chief data science officer, you know, at SAS. And, you know, 1 of the things that he was saying is, you know, we're we're still mostly doing supervised learning. So we're we're we're on the path of supervised learning where we have to have these trained labeled data sets. Right? And so, you know, data is still king and labeled data is still king as well just because of that fact, you know. You know, we we do think, you know, in the next in the next 5 years or so, we we might start seeing more advances from an unsupervised learning perspective and just really seeing that. But there's still a lot of time, and I think I think there was a stat that, that was out there. And I wish I could credit who it was from, but I think 79%

of a data scientist or a data engineer's job is still things outside of data engineering and and and data science. And part of that goes back to, you know, there's there's a big portion of that that's, you know, part of the data wrangling and part of the ETL that's involved. But then also, 1 of the things and, you know, this

is this is something as data engineers and, you know, as we shift into, you know, they created a new role called the machine learning engineer, but it's, you know, around the same around the same type of concepts. 1 of the things that we'll never get out of and probably the reason we like being a data engineer versus a data scientist is there's still a lot of importing, making sure we have the right software packages, making sure that, you know, this version you know, if we're using an NVIDIA

card that this version of Kudu and is gonna work with the the version of TensorFlow that we're trying to line up. So there's still a lot of that too outside of just, you know, making sure that we have the right data and making sure that we have the right datasets. And, hey. You know, we you know, if we're using some kind of storage, like, you know, make sure we've allocated enough right. Like, if we're taking data that's off of a simulation,

hey. Do we do do we have a big enough footprint to hold that 100 terabytes that's gonna be written that needs to be read as soon as it's written too. So there's there's there's still a lot of fun,

you know, things for us as data engineers to stay technical on that side, but there's new challenges with this too. And for anybody who's in the early stages of a deep learning project, I'm curious what some of the edge cases or gotchas that they should be aware of are, and particularly ones that you've experienced yourself as you're working on these types of projects. Yeah. So, you know, some of the edge cases or some of the things to start kind of looking at is, you know, it's it's a little bit of a different approach. Like I said, if you're coming from the Hadoop ecosystem and looking looking at that, it's, at this point, it's a little bit it's it's a little bit more simple. Right? Like I was saying, like, you can it's easy to get started, you know, from the perspective of, hey. You can set up an NFS mount and just be able to, you know, point point your jobs at it. You know? And, you know, from a TensorFlow perspective or PyTorch, making sure, you know, GPUs are gonna be the big piece. Right? Like, so, you know, it it's recommended, you know, the install you install it and you use, you know, the different packages with different GPU cards that you have. Can do it with CPU, like, if, you know, know, if you're just trying to do a POC and you're just trying to do some testing to validate, hey. I know I know the steps to kinda go through it. There are some of those,

different libraries that you can use. Like I I said, you know, GPU for for the most perspective there. And then, you know, there's a lot of installing and and going back and forth. So making sure that you have checking, you know, checking your card with the latest version of Kudu using TensorFlow

or using PyTorch. And so I would look to look to that. And another thing to do is, you know, start thinking about how this is gonna grow. So just just like we kinda joked about in the Hadoop ecosystem is, you know, once people start understanding that you you can do big data, the whole you know, some of the reasons that these projects get funded are because, you know, like we were just talking about it, you know, that AI is a hot topic right now. And 4 or 5 years ago, Hadoop was the same thing. So those projects get greenlit.

Right? And you get funding to to stand up those projects, but there's a lot of attention on you too. So there's gonna be there's gonna be a lot of ask. Right? Like, oh, hey. You know, the analytics group down the down the, down the hall, they're they're involved in a, AI project. Oh, wow. I'd love to get them to put some AI online. So, you know, you you you wanna understand that and understand how that's gonna grow. And kind of another thing, and you're seeing this go on, just from a data engineer's perspective too.

But this is, you know, on the forefront just because of where we are from a deep learning perspective.

Containerization is huge. So, you know, if you're if you're a data engineer, you know, I know when we first started out in the new ecosystem, it was like, hey, man. You know, it has to be on, you know, it has to be on bare metal. We can't even virtualize. And, you know, now we're going cloud native,

with, Cloudera releasing, you know, different versions. So, you know, from a deep learning perspective, you know, orchestration and management and just understanding containerization, like, if that's not something that you have and that's something that I've been trying to catch up on, you know, in the last year or so, I definitely make sure that, you know, you're up to speed on that because that's gonna it's gonna play an important part. And so, like I said, there's, you know, there's tools out there that'll help you manage that orchestration layer.

But on the back end, I mean, it's it's it's essentially containerizing. Right? To to be able to sched do your scheduling and doing everything kinda like what we've seen in the Yarn Hadoop ecosystem

piece. And in terms of the level of familiarity

and understanding that's necessary for being able to build out the underlying infrastructure and work effectively with the data scientists on these deep learning projects.

How much

knowledge of deep learning and machine learning and some of the mathematics and fundamental principles behind it should we, as data engineers,

be aware of in order to be able to continue to progress in our careers and work effectively

as these types of projects become more prevalent? So that's a it's a super good question. I get that question a lot. It's like, hey. Even just from the basics of, you know, should I understand the algorithm? Should I know the algorithms? If we're talking about machine learning, if we're talking about deep learning, like, how much should I be able to recommend and look at, you know, TensorFlow? And it's it's such the software engineering answer, right, to say it depends. But really, it does it's it's gonna depend. Right? Like, you know, if you're in a small organization

and, you know, you guys are just going down the don't going down the path, you probably, you know, maybe maybe you have a data scientist. Maybe, you know, maybe it's more of a, you know, data analyst in your organization. Then you're gonna wanna be able to handle and be able to, you know, carry some of that. Now I'm not gonna say that you're gonna wanna recommend, oh, you know, we should use, you know, CNN here or if we're doing machine learning, like, hey. We should use, you know,

PCI or, you know, decision trees or or from that perspective. But you definitely wanna have a little more understanding around it so that when it, you know, when it comes to your part and your your role in the organization, you can understand some of the tweaking and some of the kind of thought process around it and, you know, add, you know, add something to the table. Now in a large organization, right, that's maybe more mature in their analytics journey or the deep learning journey, then you're gonna be you're gonna be able to focus. You know, you're not gonna have to focus as much on understanding the underlying, hey. You know, how does the how does this math, you know, work? And, you know, what's the you know, what are the weights and biases? And, you know, why are there so many different layers there? So you you wouldn't have to focus as much in a larger organization.

But I will tell you 1 thing that I've found. And I said, I came from the data engineering side and, you know, understood a little bit about the algorithms, but didn't really didn't really focus on them. 1 of the things that I like more about deep learning than machine learning is it's it's really a little bit different

math. Like, it's a little more basic too. I think I think I've heard people say that, you know, whenever you're talking about it, you know, you can get away with, like, you know, the 1st semester of calculus from a deep learning perspective versus with machine learning, it's it's it's a lot more complex, right, with the algorithms and kind of what you're doing. And a lot of it goes back to what we were talking about how about how the data's broken up from a deep learning perspective

is it's really just, you know, it's really just matrix math. Right? It's, you know, it's matrix algebra to be able to, you know, stack all these ones and zeros or, you know, for using RGB, you know, ones and zeros and threes and, you know, all these different pieces sit together. So we're using a lot of easy basic math. It's just really big math. So,

it's a long answer to say that it depends, but it it it really is gonna it's gonna depend on your organization and kind of where where your role is. So I would encourage you from a career perspective to be a little bit little bit familiar with it. Just, you know, have a have a natural curiosity to it, but don't go I I wouldn't say that you have to go deep. Right? You're not you're not gonna have to go back and get a degree or, you know, you're not gonna have to know the intricacies of, you know, everything about it. But especially with the algorithms that you're using or the different neural networks that you're that you're implementing in your organization, I'd be pretty much I'd be pretty familiar with there. But I wouldn't I wouldn't stand up and put myself as I'm the 1 that's gonna recommend which, you know, which approach we take. And you mentioned earlier too about the possibilities

of leveraging some of these deep learning capabilities

in the data preparation and ETL processes. So can you talk a bit more about the different ways that we can leverage the capabilities that are promised by deep learning as part of our own work in the data management process? Yeah. That's a good point. You know, when whenever I kinda keyed on the point that, you know, we use supervised learning a lot. And just kinda to recap, you know, if you think about supervised learning, that's where we have these, you know, going back to our cat photos. Right? We have a lot of images. Hey. This is an image that contains a cat. This 1 doesn't contain a cat. Right? And so we we we know the end outcome we're looking for when we're doing that. And then I talked about, you know, how unsupervised learning is kind of, you know, on the forefront and, you know, something that we're seeing. But you can use unsupervised learning to help out with, some of your ETL and some of your data wrangling. Right? So unsupervised learning is where we have a, hey. I have a 1000000 images, and I just want you, you know, I want you to be able to classify them. Right? Unless you're gonna feed them in. So this can help you to group. So I talked about, you know, I don't think we're gonna get out of BTO, and I don't think we're gonna get out of data wrangling for a while. But you can use, you know, like, unsupervised learning to be able to pull and generate

and put, you know, put some kind of order to all the structure all unstructured data we have too. And so think about it, you know, kind of, you know, 1 of the famous examples, you know, that we've we've done before is, like, you know, sentiment analysis. Right? So think about when you were doing sentiment analysis if you've ever walked through a tutorial on Twitter. But now you can, you know, think about that from the same perspective of, hey. Do you know there's you know, we can we can train a neural network to kind of just look at a whole bunch of images and kind of put all those to some kind of structure. And so if you think about it, if, you know, if if your job was to find these trained datasets, right, like, you could you could you can use an unsupervised learning to be able to, you know, categorize those and put them in clusters so that, hey. You know, instead of looking at a 1000000 pictures, maybe I'm only looking at a 100000. Right? And

as far as

how that plays into the infrastructure

requirements

and the processing requirements for actually being able to execute these ETL jobs as we're incorporating deep learning, what kind of an impact does that have? Yeah. So from a storage perspective, there's a big big footprint, especially when we're talking about you know, let's talk a little bit about the, different environments. And so just when you think of your training environment, and this is where we're building out those algorithms we're training and we're hoping that, hey. You know, if we're able to what we're training our neural network to be able to do the outcome. So back to our cat identifier, you know, we're we're sending, you know, millions of images through to be able to train those models. And so, you know, to to be able to do that, right, you have to have the storage or the the output to be able to do that. And then you also have the have to have the throughput. Right? We're talking about, you know, from a perspective of, you know, these are some of the most powerful chips on the planet. Right? You know, when you think about your GPUs. And so there's some there's some specifically, you know, specifically if you're doing things on prem, there's some there's some requirements there from the just the how how do you how do you get enough power on a floor tile to to to GPU these GPUs. Right? Like, you know, you're limited to how much power they're gonna pull. Right? And then there's heating and some some of the other requirements. So there's a lot of processing that goes on in that part of the workload.

And then when we flip to, okay, we've got our we've got our image. Now let's train it out, you know, in the real world and see if it's working. That's more in the inference. Right? So that's where we talk about an inference environment. So the best the easiest way I like to think about it is if you're coming from a, you know, application development, you know, background is you you look at you look at your training environment, think of that as kinda test dev, and think of your inference. You know, it's kind of more your production environment, and that's where you really, really train. Hey. We're gonna feed you a whole bunch of images in there, and we're not doing any more training. And we're seeing, hey. Can it, you know, can it identify a cat or not a cat? Or, you know, can it can it drive down this practice road? Right? You know? Or do we need to go back and train it some more? And so but with that being said, you know, an application development environment, when you think about it like, oh, your test dev, normally that's not your biggest footprint. Right? But in deep learning, this is this is where the majority of your data lives. Right? Because like I said, you're you're trying to you're trying to get the best amount of data and the most data into into training these algorithms. So that whenever you go into production or you go out to test it in inference, you've you've made the best decisions.

And particularly in terms of the infrastructure layer, there have been a lot of new offerings coming out from the various cloud providers

that are aiming to

provide access to pre trained

neural networks or

managed services

around being able

to execute these different deep learning algorithms

and be able to pipeline data into them and out of them. So I'm wondering what your thoughts are in terms of the build versus buy decision

around deep learning with the availability

of these managed services, and then also particularly at the layer of ETL of does it make sense to build out your own additional capacity for being able to run these algorithms or just start consuming these managed services, at least as an initial step of simplifying and enhancing your capability

to provide meaningful

data processing and ETL

for feeding into the end product that you're actually aiming for? Yeah. No. I mean, it's a it's a good way to kind of look at it. It's like what you have whenever we're talking about, you know, with and, you know, the cloud providers. You know, you you have this I mean, think of it as like a catalog sometimes of, hey. You know, there's these different there's different approaches and different algorithms that I can use and turn it towards my data. Right? And it really you know, think of it as a service. Right? Like, it's gonna almost be your data scientist, you know, in the cloud or data scientist behind your browser to say, hey. You know, here's some data. Why don't you, you know, test this out and see if, you know, you can use this for your algorithm? With that, you know, if it you know, it's a good way to start, you know, looking at things and seeing, hey. Is you know, are are things viable for for what we wanna do? But at the same time, we talk about the data gravity. Like, where does the majority of your data live? Right? Like, if the majority of your data exists

in the cloud, you know, and you're you're wanting to take part of these managed services, you just have to understand,

you know, from a business perspective, like, hey. You know, you're, you know, you're offloading some of your data scientists or your data analysts. Right? Like, you're not having to make this make as much of an approach and a research perspective of, hey. You know, we're trying to build out and trying to see, you know, which algorithm's gonna be best for our datasets. So it does give you kind of a guideline to be able to test out and be able to look at that. But then at the same time, you know, if you had, you know, if you have a if you're have a lot of your data, like, on prem and you built your own systems there and you have your own research and your own talent in in house, then that's probably gonna be the best approach for you. Right? Like, you're you're gonna wanna build your own systems, and you're gonna wanna build your own algorithms because, you know, your data is unique. And so it's not an approach that you wanna take to be, hey. You know, we're gonna we're gonna transfer and move our data up, you know, or or or send it up into some of these managed services. So, you know, it kinda goes back to, you know, through the years, we go through different debates in different areas. It's like, hey. You know, is are we gonna offload? Are we gonna outsource to consultants or anything like that? So it's just it's really about how that you wanna approach from that perspective. So for small teams that maybe don't have a data scientist, you know, there are tools that are both on prem and off prem. It's the same kind of decision. Right? Like, how do we how are we gonna approach how we're going to build out our models? And do we want to you know, especially if you're just starting out, there's, you know, products like DataRobot and other pieces that give you give you the ability to look and see what you're doing

and give your data a test and the sample to say, hey. You know, these algorithms might work for you. Right? Like, this might give you the answer that you're looking for. But from a deep learning approach right now, I mean, I think, you know, we're we're we're starting to see those products and those tools come out as well. We've had them, like I said, for from a machine learning perspective for for some time, but you're starting to see those being integrated in a lot of tools. So I think we'll continue to see this debate continue, and we'll continue to see, you

know, product offerings in other maybe non even analytic tools that are starting to take advantage of AI and deep learning. And what is your personal litmus test for determining

when it is useful

and

practical to use deep learning as opposed to traditional machine learning algorithms,

or even just a basic decision tree for providing a given prediction or decision on whatever the,

input data might happen to be? Yeah. I mean,

so whenever we look at that, traditionally, what I've seen right now, like, kinda going back to what we're talking about with, you know, it's really good when we're from a deep learning perspective, when we're talking about image data and we're talking about, you know, video data or, audio files or, you know, those kinda rich, media types. I still see, you know, for the most part, whenever we're talking about, you know, like, if we're looking for, you know, just classic example with, like, hey, can you predict housing, you know, housing rates and, you know, mortgage rates, you know, or can you predict housing prices in a certain, you know, in a certain geo? A lot of those approaches, since you already kinda have the feature sets, they're really good to work in machine learning. It's not that you can't use them. There's plenty of examples out there to use deep learning for it. But, traditionally, you know, I still see that as the use case. All that being said, though, you know, like I said, I still come from the data engineering side. So, you know, if you're listening to the podcast and you're a data engineer and there's a data scientist in your organization too, you know, maybe you should maybe you should rely on them, you know, but be curious about it too. You know, maybe ask them, hey. You know, why you know, if it's if it's different than than what you think, you know, kinda take that approach. But, you know, that's kind of what I've seen, and it's that that's kind of my rule of thumb, but I'm I'm not gonna argue with a data scientist if they want to. They wanna kinda test some of that out. And like I said, there's plenty of examples out there, but I still see, you know, if we're if we're talking about, like, some of the tradition, you know, traditional, you know, semi structure, unstructured data, or data where we just we have all the all all the points for us that's not, you know, video files or audio files or, videos, then, you know, we're we're we can still stick with the, machine learning approach. And

with deep learning algorithms,

they're often a black box in terms of

identifying

what features contributed to the given decision that it outputs

and with regulations such as GDPR in particular, but others that are either active in different locations

or in process of being formulated.

They are introducing requirements to be able to identify

what were the different factors

that played into whatever the decision might be, especially when it impacts

a an individual.

So how does that factor into

your determination of what approach to take for a given project as far as whether to use deep learning or machine learning or just a standard, you know, Boolean logic based approach? Yeah. I mean, that's that's definitely an interesting question, you know, and we think about it from any kind of regulation. We're talking about GDPR in this specific example. But, you know, if you think about what's going on from the, you know, autonomous driving cars. Right? Like, hey. You know, at some point, you've gotta figure out, like, where do we make it a decision for the car to drive into a brick wall or to, you know, hit somebody on a bicycle. Right? How can we go back and and and kinda prove that? And then it's really the regulation, maybe not even the technology that's gonna you know, because that's gonna keep me driving.

So same thing, same thing here that we're talking about. Can you go back improve, you know, which which weight and which bias that we had? From a deep learning perspective, you know, there are ways there are ways whenever you're looking at the neural networks to be able to do, you know, back propagation and and kind of look back and see, okay. You know, where did we weight 1 feature or, you know, how how does all that kind of work? I don't know from a legal perspective how that would how that would kinda play out with GDPR. Right? Like, what's the level of proof that you would need? But, yeah, those are those are different, definitely, you know, different challenges, you know, that we're looking at from, like, not even from a data engineer's perspective or from a data scientist, but, like, we're all like, these are our projects. Right? Like, we we are all in and involved in this conversation as well. So those those are all, you know, like, interesting points, but I don't think it's something that, you know, that we're that we're gonna be able to solve. But, you know, from the GDPR, there's so many different layers to that. We think that, you know, from a regulation perspective, being able to take and being able to, you know, go back and say, hey. Okay. We've we we have different data elements that aren't gonna leave, you know, this this specific border, this specific, you know, GPS coordinates. Right? Like, you know, data's data's gonna stay in the country in which it originated.

Well, how do you take data that's been originated in a country, but you've trained models on it and you've deployed the model other way.

Right? Like, you're not moving the data, but the but how, you know, how does all that kinda tie in? So those are those are huge points, you know, huge huge things that we're we'll see play out for years years.

It's a whole big

ball of yarn to try and untangle, and then there are the aspects

of bias in terms of how the different features

are weighted and what the training data has in terms of inherent bias because of how it's collected or how it's represented. And that's something that is, an ongoing conversation and 1 that I don't know that we'll ever find a complete solution to, but something that is definitely useful to keep in mind as we build these different products, because it's important to be thinking about it even if we don't have a perfect answer, because don't let the perfect be the enemy of the good, especially as it pertains to people's privacy or rights or

inherent biases and how they're represented in the software and projects that we build. You know, I think that could be like a multipart,

maybe even ongoing podcast for for your episode to have. Right? Like, you know, just just peeling back the layers

of of GDPR and what's our, you know, what's our responsibility

and, you know, just just things to think about. Right? And for anybody who wants to learn more about deep learning

from the perspective of a data engineer

or who might be interested in deploying it for their own projects or building projects based on it. What are some of the resources

that you have found useful and that you recommend other people take a look at? Yeah. So like I said, started out with the machine learning course from Coursera.

I'm actually going through right now the, deep learning, boot camp. I think it's like the deep learning AI. So it's, engineering's course around it, Python development. Really cool hands on with TensorFlow. So, you know, that's that's been very interesting. And then from from my own perspective, so like I said, I'm a Pluralsight author. And, you know, 1 of the things that that I that I went through and did last year is, you know, I released it this year was, did a data engineer's course on, you know, kind of TensorFlow and used something called TF learn, which is an abstraction layer

for TensorFlow. And so it just gives you the ability to just think of it as from a data engineering perspective. Like, think of how we went with Pig Latin. Right? Like, Pig Latin could take you from a 140 lines of, Java down to, you know, 8 or 10 10 lines of code. Same thing from a TF learn perspective. So I went through you know, I've created a course specific to data engineers and how to kinda get started with, hey. You know, build your first, you know, build your first neural network. So So a lot of resources,

around that. Like I said, there's a lot out there on Coursera. There's some free courses as well on Google. Google has a, machine learning boot camp, it kind of just it kind of goes through I think they say it's, like, 30 days or something like that. But I think it's something I was able to knock out in, like, 2 weeks. You know, it offers labs and everything. It gives you the explanations and, you know, there's little quizzes in it too. So there's a lot of resources out there. It's very popular. Huge documentation

out there for TensorFlow. So just get out there and start looking at it and, you know, just if you don't understand it, it's okay. Right? Like, that's that's the thing. Just get

in, start learning it. After you know, after you keep keep going through repetition, you'll start to understand it. And are there any other aspects of deep learning from the perspective of a data engineer that we didn't discuss yet that you think we should cover before we close out the show? No. Like I said, I think, the 2 you know, the 3 biggest things or biggest things I would look look to, from a data engineer's perspective is just kinda watching what's going on projects like submarine,

sparks, project hydrogen,

and then looking into, what NVIDIA's got some documentation and some, some blog posts out there on what they're doing with NVIDIA RAPIDS.

I think that's gonna have a huge impact on our day to day jobs as data engineers just from the aspect of, hey, speeding up some of these ETL pipelines and then also being able to access what's going on, whether it be TensorFlow or PyTorch or Caffe.

Alright. And for anybody who wants to follow along with you or get in touch or see the work that you've been doing, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today. Yeah. So the biggest tooling and the biggest gap probably for data management, it's it's gonna be in the ETL arena. I mean, we I mean, it's it's something I

it's how I started out. Right? Like like I said, I volunteered for a job, and I volunteered for the job. I didn't have experience in data engineering, and I didn't have it in the Hadoop ecosystem. So my first job was doing ETL. Right? And I don't and we keep going through the years saying that, hey. This is something that we're gonna fix. Right? Like, hey. We have, you know, this tool or that tool. But and I'm not saying it's not getting better. But I mean, I think it's 1 of those things, you know, until we can train the machines to do it for us, I think it's always gonna be something we do. Alright. Well, I appreciate you taking the time to join me and share your experiences

working with deep learning and how it plays into your work as a data engineer.

It's definitely useful and interesting to get that background

and keep an eye on different areas of concern for people working in the industry. So I appreciate that, and I hope you enjoy the rest of your day. Oh, you too. Thanks.

Data Engineering Podcast

Summary

Announcements

Interview

Contact Info

Parting Question

Links