Putting Apache Spark Into Action with Jean Georges Perrin

Hello. Welcome to the Data Engineering podcast, the show about modern data management.

When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out Linode. With 200 gigabit private networking,

scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform.

If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai.

Go to data engineering podcast.com/linode

today to get a $20 credit and launch a new server in under a minute. And go to dataengineeringpodcast.com

to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And don't forget to go to data engineering podcast.com/chat

to join the community and keep the conversation going. Your host is Tobias Macy. And today, I'm interviewing Jean Georges Perrin, author of the upcoming Manning book, Spark in Action, second edition, about the ways that Spark is used and how it fits into the data landscape. So, Jean, could you start by introducing yourself?

Sure. Good evening, Tobias. And, I've been working in

data engineering and, our own software for, like, 25 years. And

and I would I would say I've I've been kind of constrained in what they now call or what we now call the small data space, like a relational database and just small data for a long time. And

I wanted to

discover and be useful

using more of these

bigger datasets.

And that's where

I never

found satisfaction with Hadoop.

And

when, when Spark started to be, to be out and more popular,

I I said, well, maybe that's that's a time to to actually skip, you know, this this this this step and go a little bit further. And that's that's that's where I am now. And, I've been lucky enough to be able to implement Spark in a few projects.

And

it gave me the motivation to

start this book. Went to Manning, and they said, wow, we love it. Let's do a book together. And that's what we're doing. And you said that you spent a good part of your career

working in the small data space with relational databases.

So do you remember how you first got involved in that area of data management?

Yeah. When I was living in France back back in the good old days,

I I, when I almost fresh out of college,

I started working for a company which was doing some compilers

for,

for a 4 g l language,

and that was specifically with working with with Informix. And then they extended for from Informix to, Oracle, MySQL,

SQL Server, and others. And

back at that time, you know, just a little bit fresh off out of college and probably a little bit stupid, what I was saying is I don't like databases.

That's why I'm using Informix.

And it really stuck with with me for a long time, and I I still love this this relational database. But as data grows and data set are growing,

the relational database are not enough anymore. But that's how I started. And that's I still pretty fond of all those those good old days. Not nostalgic, just, you know, as it was. And so can you give a bit of a background and explain what the Spark project is and some of the main use cases that people are leveraging it for? So the way I see Spark and I'm not the only 1 thinking. And, actually, I'm I borrowed this this definition from a friend of mine at IBM,

Rob Thomas.

And, basically, what what Rob was saying

is Spark is an is is an operating system. It's an analytics operating system. It's it's designed to handle pretty complex workloads

of data processing in a distributed way. And when you look at the definition of of what an operating system is and what it provides to your applications, I really think that that's what Spark is. It provides so many of these base layers and this base functionality

that you would need in any kind of highly distributed application, and it provides you this abstraction layer as well. So when you are thinking data engineering

and when you're thinking scale, that that's where,

Spark

has 1 very good role to play here. And a lot of times

in terms of the workflows that I see people referencing it

for. I see some cases where it's analogous to a lot of the ETL frameworks that people might be familiar with as far as moving data from 1 place to another and performing some transformations on it. But I know that it also has things such as the machine learning layer and the SQL capabilities,

and I believe that it has some graph analysis capabilities. So what are some of the main sort of tools and components

that people might use and take advantage of within that Spark framework or operating system as you put it? So so you're totally right. Okay? There's a lot of people, and and I've done quite a few project in in that space where,

it's basically creating a very high velocity data pipeline. And

and even if if if Spark, it's that's not the role where it was conceived, it excels in in this in this world. And, yeah, it's can be a little bit of a competitor to some of the some of the ETF. But, also, the fact that it brings this

platform,

it means that you can actually

do this this transformation at the same time. What I like about it is okay. As you said, you've got

Spark is is, you know, they kind of divide Spark into 4 pillars.

1 being Spark SQL, the other 1 is Spark Streaming, then ML and deep learning,

and and graphics. Okay?

So I think that

the 1 of the great thing is by offering a very unified API on top of these features,

you

empower the developer

and the development team

in in this

very powerful features

in a very abstract way. So

when a lot of people compare,

Spark to Hadoop, and when you look at Hadoop, and especially the first the the the few the very the first version where,

Hadoop was only map produced, was only disk, well, you're you're creating a very niche set of capability or a a set of capabilities can can can actually be used by an by only niche players, niche developers.

And

you don't have that with Spark.

Spark is is is actually offering all this all this APIs, basically, the data frame API to the developers so they can actually leverage,

making data ingestion. So basically taking data into Spark, applying some data quality rules,

offering

on then using, for example, this data in a in a machine learning pipeline and exporting this data in another database or another kind of report in a in a very seamless

way. A programmatic,

but seamless way. And who are sort of the main roles

and types of users who might be taking advantage

of Spark, both in terms of maybe doing some local development and also in the more production scale use cases?

So I I I see 2 big family of of of Spark users.

1 being the data engineers, and the other family is a data scientist.

And and they have a bit of a different set of tools as well and different, of course, use cases. When I was just speaking about data pipelines, this is this is typically a use case for, their engineers. But Spark is

can be easily interfaced with with products like Jupyter or Zeppelin.

And

by using this this this this notebooks,

this is really the the tools of choice for for data scientists. And they can leverage,

the poor of a cluster,

very easily as they would be using,

their notebooks locally. And now you've got also companies,

like Databricks

or IBM with Watson Data Studio,

which offers more, you know, collaborative capabilities on top of it. That and that's

targeted to to the data scientist. This is not the main scope of my of my book.

I think I think that's that's a great data science. And data scientist is is a great profession, and there's a lot of coverage on that. But I think that data engineers are a bit, you know, like, left on the side of the road

sometimes. So this is this is this is a population I was trying to address.

Yeah. It's a big part of why I've been running this podcast as well to try and give more attention to all of the work and effort required to gather and collect and process the data that's necessary to power some of these data science and analytics applications.

Exactly. You're you're you're you're totally right. I mean, we forget

I think we sometimes forget that if you get garbage in, you get garbage out. Okay? And that's kind of the rule of the data engineers to make sure that we actually don't do this this this kind of bad partings. So yep. And are there any particular

use cases that you think that Spark is uniquely positioned

to be able to

handle

as compared to some of the other streaming frameworks or big data frameworks that might be available, such as the Flinks or Kafka's or storms of the world? So

so streaming is 1 of the component of of of Spark.

And and that's that's also why I'm going back to to the definition of Spark as an operating

system is it does not just do

streaming. Okay? And but it it also plays well with Kafka. I've got I've got quite a few friends who and and actually customers and and partners using Kafka intensively with,

with Spark or or if you're in the AWS world, Kinesis.

So so that's that's where that's where Spark is nice as well. It plays well with others. And it it also have its own its own streaming capabilities, which

are more and more used by and and now since version 2.2 and we have version 2.4 now are in production. So so it's pretty mature as well on on that part as well. And in terms of the broader

Hadoop

and HDFS,

and

Hadoop and HDFS,

and there are some of the streaming frameworks that are available. And then there are people who will likely be landing some of that processed data into their data lakes or data warehouses.

And you mentioned the ML capabilities. So I'm just wondering if you can give some perspective

on how it fits in relation to some of those other tools and,

some of the other types of

deployments and applications that it might be used for. Adoob has become at some point a synonym of big data. And I think that that with Spark coming, it's it's almost like there's a another kid in town. And, and that that's actually,

pouring

a new a new generation of big data developers. So and and and you can see it when you actually talk to users.

You've got you've got really 2 kind of in the data engineering space. Or yeah. In the even in both spaces. But,

you see people that that are now using Spark are and who are coming from

the Hadoop world. And people that just a bit like me

skipped Hadoop

and and went to, and went to Spark directly. And you see really 2 different

usage

in in that part where the people that came from Hadoop are really familiar with, you know, resource managers like like, like Yorn

or

HDFS or Iva. And and they and they think as Apache's being a natural evolution,

of of that because it leverage more. It offers more algorithm. It offer ML. It offers SQL. And you've got the other the other population, which is coming for more from a a data engineering

background or or like like, you know, small data or RDBMS world and discovering Spark

and the familiarity of which you can do things the same way you would do it with with a normal database. And, of course, it it's it's, of course, a lot more than a database where you can have all this processing being done as well with with Spark, but it's it's a very natural transition as well. So what's interesting is to see the this this 2 population

being kind of reconciled

on big data through Spark.

And for somebody who's building an application on top of Spark or composing a data pipeline, what are some of the main design paradigms in terms of the way that the software is written and the interfaces that are available

both in terms of the way that the data is represented, but also in terms of some of the language runtimes that developers are able to target. Surprisingly,

there's

you you're not thinking big data when you use Spark, especially when you when you're when you're at the level of of designing and building your application. My experience was with many developers

is that they understand the APIs. They learn the APIs. That's a kind of the key thing once once you've learned the the data frame API. And and then deploy because

1 of the big advantage of Spark is you can run everything on your single lap on your laptop. Okay? You don't need virtual machines. You don't need you don't need a cluster.

Sparks come out of the box with what they call the local mode, and

you just

spin off spark, like, out of Eclipse or out of any environment very quickly, and you can run your application right away. And you can use and you can start developing in there. So and then comes the the deployment part, which which is not that difficult.

You've got to package things in a certain way. And,

of course, you've you've got to think about some of some of optimization. Okay? So if you're making massive

joins of huge datasets, you know that you're going to create a lot of traffic.

So it might be better to partition by by the keys you're going to join on. Right? But that's the same thing as you would do in a traditional RDBMS

way. It's nothing just nothing new there. So

it it's in in that way, it seems very natural and a very natural progression once more. Yeah. And are there any major edge cases or,

shifts in

design thinking

or,

sort of deployment considerations

when moving from that local mode to a more full blown production scale infrastructure, particularly as you start to scale that deployment in terms of the number of nodes and the overall volume of data processed?

There is, but it's not a major shift in how you do it. Even,

when you drill down a bit the architecture of Spark where you've got your master and you've got your your workers that actually doing the work, even deploying

the application, you you only deployed once, and then Spark is actually

distributing your code to the different to the different workers, which which makes it really, really neat.

And and you don't have to really think

too hard about what's going on. That was really surprising

for me. But, of course, I've, to be to be completely honest, I've I've not manipulated

petabytes of data. But all the project we we've been doing, whether it was in health care or, we've logged data, everything was pretty smooth,

when we went to when we went to production. It's not, you know, sometimes when you're taking your application that works kinda in a almost lab way on your on your laptop, and you start wondering, oh, what's what's what can be wrong? You kind of freak out or about, you know, the deployment part. I never had that. I I think I was more freaking out in the good old days of deploying things to Tomcat than deploying things on Spark.

And in terms of the way that you package and deploy the application,

is there anything

special in terms of how you need

to create an artifact for that deployment? And does that change based on the language that you're doing the implementation is? Where if you're doing JVM

code base code, such as Java or Scala, or if you're using something like PySpark.

Yeah. So that's a good point to add. Yeah. So to use Spark,

you can use Spark in in in Java, Scala,

Python, or r. And I think this is also

very

smart way that they decided to cover these these different languages because you've got the languages which are more data science oriented, like typically r,

and maybe data

language that are more data engineering oriented, like, I would say, Java, but I don't want to create a big big debate here. So so that's that's also offering, a wider range of of of potential users,

embracing embracing Spark. When when you're when you're deploying your app,

you can do you can do you can do it in several ways

depending depending on actually the goal of your application.

The the 1 of the traditional way is to

package your app as a single JAR,

and you just submit your JAR. Okay? Whether and then your JAR could be written in

in Scala or or in or in, Java. It doesn't doesn't make any any difference. When you're using you can also use it you can also use Spark via shell, and then you would be in a very queue in a very similar way as you would be using a notebook. Okay? You connect you connect to Spark and you type your commands. And and basically, you get the results, you know, in a very interactive in a very interactive way. The the shell part

is only available

in Scala, Python, and R

because Java doesn't have this this replay mechanism right now. So that's that's that's that's 1 way to access it. 1 1 other way is you can, as I said, you can submit your application or you can connect directly to the cluster.

So in some cases,

for for some project that did, we we needed to connect to have a permanent connection directly to to, to Spark. Basically, because we wanted to have a session

always open. We had big datasets that we wanted to have constantly being load and constantly being up updated. So and we opened just a kind of a rest endpoint in the application that would kind of, you know, kind of status update of what the application would actually be doing directly. But it's not even a very complex use case as well. So this is this is all the kind of different scenarios you can use Spark, which is really depending on on the workload you

you're planning on on running on Spark. Once more, it's giving flexibility.

And what have you found to be some of the most useful strategies for improving the efficiency and performance of a processing pipeline?

And I know that when I was looking through the book, you mentioned that 1 of the key aspects of the programming model of Spark is its laziness. So I'm curious if there are any

special approaches for being able to

employ that effectively. So I I I I love I love the lane the laziness part. And I compare it quite a few times to teenagers,

which which which I've got a lot in my family right now. And,

so the way I see it is you can ask something. Okay. Like like, go clean your room. Go go put your laundry in the washer, and and things like that. But as long as you're saying things in a in a kind of a normal way, they're not going to happen. But at some point, you've got to apply an action

to have the things done. And that's exactly how Spark is working. You you tell Spark to do a few things,

basically,

data transformation.

And at some point, you say, hey, Spark. Now it's time to to get your shit together

and work. And that's that's called an action. And, basically,

when you're building this recipe,

it's being transformed

into a directed acyclic graph, which is basically a graph that never comes back,

to itself. And

before it actually start working,

it's

giving this to an optimizer

called Catalyst within Spark. And, basically, Catalyst like like a like an on in a relational database when you got your query plan, is going through it and say, okay. I'm going to do that and that and that in this order, and then I'm going to this is a result of this optimization.

And then it can be of course, that's that's going to make it even faster. So you've got first first, you've got Spark that works

that you uses memory in a smart way.

So so just by using memory, you you gain a lot. And then, you've got this this DAG, which also completely

optimizes the way your workload are are are processed. A good exam a good example is, let's say, you you've got a dataset coming in. You add a column to your dataset.

And then for some reason, you you want to delete the column to your from for your dataset, because it was a temporary column or whatever. If you design a relational database,

the column will be physically

in place, and that's going to be a very heavy operation. In Spark, it's never going to be be created,

and and and the optimizer can actually do that. And it's interesting. And,

when you looked at this this chapter about about laziness, I've timed a bit to different to different operation. And it's actually almost funny

to see where how Spark is doing this thing. So I really think in that, I've got a strong power strong feeling about making a parallel between teenagers and Spark. So

And in terms of

the

architectural design of the applications

that developers

are creating and the ways that data engineers

are leveraging Spark,

what are some of the edge cases

and quirks of the framework

that they should be considering as they begin to scale their deployments,

again, in terms of the number of nodes and in of the volumes and varieties of data that they're processing? I can't think

of critical cases.

Spark can ingest

a lot of different

types of data,

so

you don't have to kind of preprocess the data before you ingest them.

On natively,

Tundles, text, CSV,

JSON.

With plugins, you can you can,

embark XML. And if you're if you want to, you can also create your own plugin. Of course, all all all relational database are are easily integrated as well. So so this this this part

is is pretty straightforward.

The following the the next part is is all this data transformation, whether you're starting by doing data quality or directly doing some some kind of heavy transformation on your data, this is

usually being done by you can actually combine SQL with the data frame API

and using the machine learning,

API as well in this part of transformation. And that's that's also pretty straightforward, and I don't

I don't see any kind of specific

cases where you need to be really aware of of of that. And once more, it's because at the end, it's going through

through through Catalyst to optimize your workload. So it gives you a little bit of a tolerance

on on what you're doing. It's not like when you you you've got to implement a map reduce algorithm where you've got to be very careful about what you're doing as a mapping and a reduce thing. Spark Spark gives you a little bit more freedom there. And finally, when you want to to share your data or whether it's going to be in a file or it's going to be, you know, in a database,

it's the same thing. 1 thing to consider is if you if you're saving in a database, for example, each task will will try to have a communication with the database. So if you're on, let's say, 20, 000 nodes and you've got, I don't know, let's say something like, I don't know, 5 tasks per node or 20 tasks per node, and they all try to communicate with your relational database, that just might be a bit tricky

because I will do that at of at the at exactly roughly the same time. So you can actually bring back you know, you can you can repartition the data, coalesce the data, and and then save it to your database. And, that would be probably a an edge case where you've got to be careful about that. And in terms of optimizing some of the processing, I know that Spark has support for user defined functions,

ways

that that can be beneficial

and the approach to designing and deploying some of those user defined functions to the Spark cluster.

Yeah. So just as a side note

before we go to the UDF, and you can edit that if you the way you want. But I think a component which is interesting to understand about about Spark is is tungsten.

So

Spark does not rely on on Java storage when it comes to to object. Okay? Let's say you're getting int or integers

and strings

as part of your data.

Spark is going to first convert that to serialize it and store it in a specific

storage manager

called Tungsten, which can actually compress the data. So this is also something where your data will be optimized

on your behalf. And the the ratio

I I

kind of got from the engineering teams on Spark is that you you win about 20%

of of the data, and that was that was back in in Spark 20. So they probably,

they probably did did some progress on top of that. So, basically,

that's that's good to know because when you've got

let's say you've got to to ingest a gig of data, well, then you know that it's going to take only, what, 800 meg to of of memory as well on your own cluster. So that's 1 side note that this could be also interesting to the listeners.

And what was your question?

And that's definitely very useful to know, particularly when you're doing capacity planning in terms of

the node types and number of nodes when you're figuring out how much data you're going to be processing. So definitely valuable for resource allocation.

But I was also curious

about the

use cases

and design and deployment patterns around user defined functions

and some of the benefits that they can provide in terms of efficiency

and ease of use when deploying an application on Spark? Yeah. Sure. So I actually that this is also something I I really like, and I I kind of liked it. And I'm making another plug for my favorite r d b m s Informix.

Informix brought,

user defined function in Java back in the back in the nineties. But but really, what's interesting as well is when you're

using Spark,

you've got a lot of built in functions, okay, like splitting a a value

in our column or cell in into 2 or 3 or whatever based on a reg x or whatever. That's that's the kind of functions that that Spark will provide. And you and nobody should be surprised about that or date conversion, you know, date from from a string to a real date. And and so this is this is almost,

SQL functions you find everywhere. And, of course, the thing is that's that's the libraries that the community is is providing as as part of the package. So you've you,

Spark

offers this this user defined functions, UDF,

where

you can just build your own. So I I like this concept a lot because

it extends the richness

of the product. And it also

1 of the reason I like it really a lot is you can leverage your existing libraries

and your existing code. So in the use case of, for example, of adding data quality, okay, and you're ingesting data in your system, and you already did all these data quality libraries that that exist, and you know that that they're reliable. And so you can just do a little bit of the plumbing

and using the UDF as as a as a plumbing mechanism

between

Spark and your existing library. And your leverage ex your your leverage your assets like like crazy in a matter of of very short time. So this is this is this is something that that that Spark is also very good for. Deploying those aren't are not that difficult because they end up being JARs, and and JARs are just deep deployed. 1 constraint is that a lot of the a lot of the libraries on the code you're going to add to Spark needs to be serialized when you're doing it in Java or Scala. But that's not that big of a constraint, I think. But it was not for me at least, on my teams. And another area

that I'd be interested in hearing about is any experience that you have in terms of

writing tests

and fitness functions

around the applications that are being deployed to Spark so that you can do some of those quality checks that you mentioned

and doing some validation

of the various stages of the processing pipeline? So yeah. That's a good 1. And I

embarking your

when you're using when you're using UDF and you're you're you're embarking your UDFs, you can do it easily

by testing

outside of that. I don't know if if I was fortunate or not, but we relied on an ex an external testing when we when we when we built,

pipelines with with Spark. So it was not we did not test each step within Spark. But there's a there's a good framework

for testing Spark, which which is sponsored by IBM.

And it's it's also it's it's it's interesting because it can be automated to change some values automatically.

So you can say, oh, I want to allocate that much memory to my execution, to my driver, to my master,

and and varies. And that's just a few of the parameters.

And and the tool will actually monitor

and change that and and give you a feedback

of,

of the results there, the for for performance testing. So there there's a small ecosystem well, not a small. There's an there's a growing ecosystem, really, about this kind of tools. Unfortunately,

I don't have I did not have the chance to to use that as much as I would have to.

And in terms of the actual deployment of the cluster itself,

what are the

necessary

system requirements

and any supporting technologies that should be considered

for getting a spark application up and running and getting the cluster deployed in a reliable and highly available fashion?

So so there's there's a lot of different ways of of doing it, and we couldn't probably spend a few hours on it. The first thing is that

Spark

allows you to build your own cluster

with only the Spark components,

which means you don't have you don't require you're not required to use anything else. So my first cluster I built was a bunch of Ubuntu machines. You just

start on 1

spark as a master. On the same machine in the same machine, you can start a slave. And on the other machine, you just start slaves, and you just point them to the master. And that's your cluster. Your cluster is ready. It's very fast in that way of of deploying cluster and adding nodes as well. And the resource manager,

is is basically provided by by Spark directly. Is that enough for production? Yes. It's it's been enough for some project in production.

Now

because your cluster or your physical cluster might be used by not only Spark, you will find also resource managers that work with with with Spark. So if you're coming from the Hadoop world, you will be able to use

Yarn directly with with with Spark. And that's what, for example,

Amazon is doing with, AWS

EMR, or you can use Mesos and GCUS.

And and,

we had a lot of fun, doing it. And when I'm saying a lot of fun, it's it's positive fun. It's not nightmare fun. And allocating more resources

and distributing task was really was really nice. And more more recently with Spark 2 3, Kubernetes

is is also completely supported. So if you're if you're a kube fan, you you can you can use Spark as well. So, really, you've got your building your cluster

is fairly

I'm not saying it's fairly simple, but it's fairly flexible.

And, you know, you're not stuck to 1 resource manager. You're not stuck to 1 technology. You're really open

to

a lot of different things. And there's there's also project I've seen with also resource manager, but I've I've not had some, you know, more squirfings

or more targeted,

systems. But, yeah, it really comes stand alone. You,

stand alone mode in the cluster.

You can have

a Yarn. You can have Mesos. You can have Kubernetes. So that that gives you a pretty nice choice already.

And with all these capabilities,

it can be easy to start thinking that Spark is the right answer for every situation where you might need to do something with data. But what are some of the cases where you think that Spark is the wrong choice and some of the limitations

that are imposed by the Spark programming model?

So it's not a transactional system. Okay? So you're not you're not going to do your,

you're not going to plug your cash registered in your Walmart

directly to Spark. It's still an analytics system. So all your transactional,

model will will still

rely on your on on another system. So that's 1 point. The other point is

your data within Spark relies in memory.

So we've also nice things about it

and potential danger of having it in memory.

Like,

if it's gone, it's gone. Of course, there's there's reliability built in. Okay? But but it it's not safe. It's not safe to this, and it's not designed to save it to to disk in the middle of the process. So this is this is something when you design your application that you you may you may think about. And the data in in Spark itself is

immutable. So

so that that's that's that was a weird 1 for me to to understand

when I, starting working with Spark is really wet. You're doing this analytic system and the data inside is immutable. So it means that it doesn't change. What's what's the point of that? And and, actually,

the

it it comes to the to the kind of the low level, almost, architecture of Spark where if you want to distribute your data if you distribute your data under changes in your data, like like a cluster of databases,

you have this network latency.

You've got this network

issues that you can have. You can have all these these things that your data can be out of sync. But if your data you know that your data is not changing,

you can only share the recipe or the tag between your different nodes, and that's that's 1 great thing. Okay? So but because it's immutable,

then you you're not going to do, like, for example, you're taking your your relational database and you're going going to update fields. Okay? That that's not how you're going to do it. You're going to apply transformation and you're going to get a new dataset, which is optimized, which is not going to be the copy of your old data, etcetera, etcetera.

But it's still

it's not

it's not a an a directly an update like you would do in in MySQL, for example.

And in terms of the book that you're writing, it's currently still in progress.

And you mentioned that

you, approached Manning with the idea of writing it because you were so interested and excited by the capabilities of Spark and the ways that it's used. So wondering if you can talk a bit more about your motivation

for writing it and the target audience that you're trying to serve with it. So, when when I started when I started learning about Spark, it was in the idea of of even if it was not directly said, of of of,

of data scientists,

which which I have a lot of respect for. I think they're really smart people. And and but I didn't I didn't feel like I was I was really

completely being connected to to their needs as a data engineer or data architect. So I started to look at resources,

and it brought me to a lot of the examples. And I like to I love to learn by examples. And most of the things were written in Scala. Oh, that's that's adding a burden. You know? It's like you really want to use your tool, but before you can use my tool, you've got to learn this new language.

And that didn't please me that much. So I I really said, okay. Well, I've got I've got to I've got to to see how how how much you can leverage on it. But that was back in the early version of of of Spark as well. And later in project, I was confronted to that. I get teams of engineers

really efficient in Java who don't need to learn anything in Java except the data frame API from Spark. So I said, well, if you if we want to develop things around Spark,

are we going to train all those nice bunch of engineers

into to Scala,

or are we just going to leverage

Java as I know? And try to focus their efforts on just learning the essentials about about Spark

and trying to understand, you know, big deals with Spark. So that that was that was the main motivation for for for my book. I didn't want I found I found it I found it damaging

for for for companies

I've I've been consulting with or working for

that they started doing some project in Scanana, trained people in Scanana. As soon as a person the people were trained in Scanana,

they found another job. And then you've got codes, you've got meetings, kinda, and you don't have anyone else. And you just take a little bit more efforts. Yeah. And but at first, your productivity might be a little lower, but you don't go through the phase of of of training every your whole team to scan. And that might be also a big debate we could start, and I don't want to to start any, you know, troll wars or anything. But I think that Java is more readable

than

that's that's that's Scala. I think the maintenance is easier. You've got more tools.

So I said that that's

when I went to see Manning, I said, this is this is this is what I believe in, and I would love to write a book about it. And

basically, they said, Panko, let's do it. So that's that's that's what happened about a year ago.

And what have been some of the most interesting or useful lessons that you've learned in the process of writing the book and anything

new that you've learned in terms of the capabilities of Spark that you might not otherwise have been exposed to? It's just 2 folds to your question. 1st, writing a book. It's my first published book. I self published a few times. I started

self publishing

back in the early nineties when I was when I was in a in college. And, a friend of mine, we decided to write,

because we wanted to help our, our our friends, we decided to write a book on c and c plus plus and the transition from c to c plus plus So that was that was a lot of fun, like, almost 30 years ago. And it was it was fun because I was, you know, a college grad, and it was a summer

project. Then fast forward to writing a book with money, oh my god. It's difficult.

It's really difficult.

I've got a great editor, Marina.

But I've also written a lot of articles before or

or tutorials for for IBM developer works, for example. And you've got your editor that is actually just babysitting you. And then you

you go work with with Manning, who are really great professionals,

really demanding.

And it's it's a lot of work. And

it's it's really

forcing you to be a better self. It's a bit like, when you write a blog or when you write,

something for a magazine and you've and you know you've got an editor behind you, after you, it's like writing in middle school. Right? Writing writing for writing for Manning is is is the same thing as directly jumping into college after middle school. I love it. It's awesome.

It's art.

So that's that's that's 1 thing. And I think I made a lot of progress in

personally in the way about explaining things, in about the way of doing doing graphs and, and illustration. And there's a lot of illustration in the book. Trying to find also the right examples

using real examples. Most of the examples in the book are based on real datasets and real. It's not big datasets because I don't want people to download something like, a few terabytes of of of data to to play with it. But it's real datasets, so it has a real meaning to it. It's not, you know, it's it's not like,

funky datasets with just integers and things like that. So that's that was for the book part. For the spark part

is I love the research. I love the research you've got to do because,

yeah, you I just I just finished chapter 10 on streaming.

And streaming is not a very complicated

paradigm, I think. But if you want to explain it,

you've got to find this kind of

useful little tips

and useful

entries

for the reader to and to to get to get your reader involved. So, for example, on on on this specific about about streaming, a lot of the streaming people are thinking about network. And I didn't want to do network streaming, to explain streaming because I didn't want people to to deal with Netcat or to deal with ports on the machine and all those things. So I said, let's let's go

let's do directory streaming. So, basically, you drop files in a directory. And for doing that, I build a small record generator

because I wanted to be I wanted that the the reader

could run the examples,

but just

focus on the examples.

So there's all this this

small ecosystem

around the examples that I I worked I worked for, and I I worked to provide to provide the tools that the the student can actually, you know, really focus and really learn on on the key elements of of of the technologies and the chapter is going to work. So when you when you combine these 2, I I think this this is this is really, both a terrible and terrific,

learning experience for, I would say, for both the writer and the reader.

So then you've got to you've got to pick if it's a terrible experience for the reader and a terrific experience for the writer or the or the opposite.

And what advice do you have for anyone who is considering

or currently using Spark? And what are some of the areas of interest that you're looking forward to as you continue on your journey of working with Spark? So I I love the idea of machine learning

without the

math. And,

that's that's 1 actually that's 1 of the chapter in the book, which which I think it's chapter 18 or something. So it's it's still down the road. But this is this is the chapter I'm kind of really looking for at at writing.

When I started doing machine learning, a lot of the education is a very formal, you know, college college like

training on on on that. So first, you've got to understand the math

before you under you you're actually using

the the the machine learning algorithm. And I think that's that's a wrong approach to to it. I think especially when you're talking if you're talking to data scientist, that's fine because because that's that's the way they deal with it after. They they look at the math. But and and don't get me wrong. I love math. It's just 1 of my favorite,

subject. But I think that when you're dealing with engineers,

as they like math most most engineers like math. But they don't have to understand

why a logistic

regression or a linear regression

works. They just need to understand

what's the use the best use case that is going that that is going to be applied to that. And this is really, I would say, my next challenge and my next challenge for for the book as well. And I'm not and it could be 10 books,

on the same topic, but it's going to be only 1 chapter in in this book at least. But this is this yeah. This is something I would I would definitely

love to see more, like, you know, associate associating

as a use case you're trying to solve with the right algorithm and not trying to learn first how the algorithm is working before you can actually use that.

And are there any other aspects

of

the Spark project

or the book that you're writing or your experiences

with either of those that you would like to discuss further before we close out the show? You know, there's always Spark has been there for for for for a few years now. And

right now, we we're in version we're in version 2 of Spark, the 2 dot 4 that was released a a few weeks ago.

And

Spark is also going through this this different level of maturity. And 1 of the great thing is that they are not afraid of of growing.

And

so when you're looking at learning Spark, 1 thing that you need to be very careful about is each which version of Spark are you going to cover. And

so a lot of and I I'm not trying to say that because

of my book, but just because of potential, you know, just readers and users.

Spark started by storing data in RDDs, which are still there, which has a basis for the data frame. But if you get a book on or a tutorial

on Spark and it's it's about Spark 1, you will find the RDD. And the for very old books on which which have Spark 2 in the title,

are actually they were too close to spark to spark to to actually

spark 1, sorry, to embrace,

to embrace the full potential of the RDDs

of the of the data frame versus the the RDDs.

And it's the same thing for streaming.

The first the first version of of of Spark included

Spark Streaming,

based on what they call discretinized

streams.

And now we but which were kind kind of tightly linked to this RDD,

this this resilient distributed dataset. So the the same thing is happening to to to streaming now, which is structured streaming, which is based on the data frame as well. So you've got once more, you've got the same API and the same evolution of the of the product. So you need to be so when if you embark on a project or or if you start a new project, you've got to be a bit careful about what you're going to pick there. But even the API is not fundamentally different. It's about the different

concepts there. But that's that's basically how how I would see things now. I'm kind of looking forward of the future of Spark,

and it's and it's growing

and user base. And the tools the tooling the more and more tools you you see around Spark, which are also very interesting. And the and the variety of the use cases, which which are also pretty incredible.

There's a bright future for Spark, definitely. Yeah. And I suppose that that's also worth mentioning that this is the 2nd edition of Spark in Action. So I'm assuming that some of the differences people can expect between the first version and the version that you're working on now is some of the new capabilities that have been introduced with subsequent versions of the Spark framework.

As I said, the

the the first edition of, I didn't participate in the first edition of of Spark in Action. It was mainly mainly targeted for,

in in Scanner and Python, whereas

my edition is more to what Java, as I said. And I think that they're a good complement to 1 another

to the concept I explained a little bit differently. I'm not saying mine is better or theirs is better.

I think it it's interesting to be able to to have that and,

so to have, you know, both perspective on it. But, definitely, the

spark Spark in action, the first edition is is covering more of Spark 1, and I'm exclusively

covering Spark 2. Of course, with with the links to explain the concept as they were in Spark 1 or to make sure that the reader is actually understanding

where it's coming from, if he's got some kind of background,

there. I don't like to assume that,

you know, you've got to learn different technologies to go to go to that point. I think that it's like, I don't want you to learn Hadoop to learn Spark. I don't want you to learn Scala to learn Spark. If you want to do that, you're going to be probably understanding a lot more things. Okay? If you want to debug Spark, because it happens, learning Scala is useful, but it might not be what you're going to do anyway. So there's potential as well to grow, to keep learning,

if if you want to go there. Alright. Well, for anybody who wants to follow the work that you're up to or get in touch with you, I'll have you add your preferred contact information to the show notes.

And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today? I still think I still think we're suffering too much in in data quality and data governance. It's so difficult.

It doesn't it doesn't seem to be easily configurable

to demonstrate

the importance of data quality and data governance.

And that's that's really something I I would like the data management part to to make some progress there. And really, data data governance more than anything else. I think we've we've we are getting so poor

at tracking those those artifacts.

Yeah. It's easy to get some metadata from a relational database, but then your data lineage gets lost somewhere. You're

now we we are dealing with more and more,

AI models,

machine learning models. And what is your life cycle of of those models? Are you able to track down

your hyperparameters?

And that's I think that's really where this next generation of data governance needs to go and need to embrace this thing, but in a more intrusive, in a more

stealth way that it's been doing right now. I hope that that's something that someone will tackle. Alright. Well, thank you very much for taking the time to talk today about your experiences with Spark and writing the book. And for the people listening, there's going to be a discount code in the show notes for 40% off of any of the Manning books, including Spark in Action 2nd edition. And, also, check-in the show notes for some details about a giveaway for a few copies of Jean's book. So thank you again, Jean, for all of your efforts there, and I hope you enjoy the rest of your evening. Well, thank you. Thank you, Tobias. And, let's talk soon.

Bye.

Data Engineering Podcast

Summary

Preamble

Interview

Contact Info

Parting Question

Book Discount

Links