Summary
Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem.
Preamble
- Hello and welcome to the Data Engineering Podcast, the show about modern data management
- When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute.
- Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
- Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
- Your host is Tobias Macey and today I’m interviewing Jean Georges Perrin, author of the upcoming Manning book Spark In Action 2nd Edition, about the ways that Spark is used and how it fits into the data landscape
Interview
- Introduction
- How did you get involved in the area of data management?
- Can you start by explaining what Spark is?
- What are some of the main use cases for Spark?
- What are some of the problems that Spark is uniquely suited to address?
- Who uses Spark?
- What are the tools offered to Spark users?
- How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm?
- For someone building on top of Spark what are the main software design paradigms?
- How does the design of an application change as you go from a local development environment to a production cluster?
- Once your application is written, what is involved in deploying it to a production environment?
- What are some of the most useful strategies that you have seen for improving the efficiency and performance of a processing pipeline?
- What are some of the edge cases and architectural considerations that engineers should be considering as they begin to scale their deployments?
- What are some of the common ways that Spark is deployed, in terms of the cluster topology and the supporting technologies?
- What are the limitations of the Spark programming model?
- What are the cases where Spark is the wrong choice?
- What was your motivation for writing a book about Spark?
- Who is the target audience?
- What have been some of the most interesting or useful lessons that you have learned in the process of writing a book about Spark?
- What advice do you have for anyone who is considering or currently using Spark?
Contact Info
Parting Question
- From your perspective, what is the biggest gap in the tooling or technology for data management today?
Book Discount
- Use the code poddataeng18 to get 40% off of all of Manning’s products at manning.com
Links
- Apache Spark
- Spark In Action
- Book code examples in GitHub
- Informix
- International Informix Users Group
- MySQL
- Microsoft SQL Server
- ETL (Extract, Transform, Load)
- Spark SQL and Spark In Action‘s chapter 11
- Spark ML and Spark In Action‘s chapter 18
- Spark Streaming (structured) and Spark In Action‘s chapter 10
- Spark GraphX
- Hadoop
- Jupyter
- Zeppelin
- Databricks
- IBM Watson Studio
- Kafka
- Flink
- AWS Kinesis
- Yarn
- HDFS
- Hive
- Scala
- PySpark
- DAG
- Spark Catalyst
- Spark Tungsten
- Spark UDF
- AWS EMR
- Mesos
- DC/OS
- Kubernetes
- Dataframes
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Hello. Welcome to the Data Engineering podcast, the show about modern data management. When you're ready to build your next pipeline or want to test out the projects you hear about on the show, you'll need somewhere to deploy them. So check out Linode. With 200 gigabit private networking, scalable shared block storage, and a 40 gigabit public network, you've got everything you need to run a fast, reliable, and bulletproof data platform. If you need global distribution, they've got that covered too with worldwide data centers, including new ones in Toronto and Mumbai. Go to data engineering podcast.com/linode today to get a $20 credit and launch a new server in under a minute. And go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. And don't forget to go to data engineering podcast.com/chat to join the community and keep the conversation going. Your host is Tobias Macy. And today, I'm interviewing Jean Georges Perrin, author of the upcoming Manning book, Spark in Action, second edition, about the ways that Spark is used and how it fits into the data landscape. So, Jean, could you start by introducing yourself?
[00:01:13] Unknown:
Sure. Good evening, Tobias. And, I've been working in data engineering and, our own software for, like, 25 years. And and I would I would say I've I've been kind of constrained in what they now call or what we now call the small data space, like a relational database and just small data for a long time. And I wanted to discover and be useful using more of these bigger datasets. And that's where I never found satisfaction with Hadoop. And when, when Spark started to be, to be out and more popular, I I said, well, maybe that's that's a time to to actually skip, you know, this this this this step and go a little bit further. And that's that's that's where I am now. And, I've been lucky enough to be able to implement Spark in a few projects. And it gave me the motivation to start this book. Went to Manning, and they said, wow, we love it. Let's do a book together. And that's what we're doing. And you said that you spent a good part of your career
[00:02:20] Unknown:
working in the small data space with relational databases. So do you remember how you first got involved in that area of data management?
[00:02:28] Unknown:
Yeah. When I was living in France back back in the good old days, I I, when I almost fresh out of college, I started working for a company which was doing some compilers for, for a 4 g l language, and that was specifically with working with with Informix. And then they extended for from Informix to, Oracle, MySQL, SQL Server, and others. And back at that time, you know, just a little bit fresh off out of college and probably a little bit stupid, what I was saying is I don't like databases. That's why I'm using Informix. And it really stuck with with me for a long time, and I I still love this this relational database. But as data grows and data set are growing, the relational database are not enough anymore. But that's how I started. And that's I still pretty fond of all those those good old days. Not nostalgic, just, you know, as it was. And so can you give a bit of a background and explain what the Spark project is and some of the main use cases that people are leveraging it for? So the way I see Spark and I'm not the only 1 thinking. And, actually, I'm I borrowed this this definition from a friend of mine at IBM, Rob Thomas.
And, basically, what what Rob was saying is Spark is an is is an operating system. It's an analytics operating system. It's it's designed to handle pretty complex workloads of data processing in a distributed way. And when you look at the definition of of what an operating system is and what it provides to your applications, I really think that that's what Spark is. It provides so many of these base layers and this base functionality that you would need in any kind of highly distributed application, and it provides you this abstraction layer as well. So when you are thinking data engineering and when you're thinking scale, that that's where, Spark has 1 very good role to play here. And a lot of times
[00:04:29] Unknown:
in terms of the workflows that I see people referencing it for. I see some cases where it's analogous to a lot of the ETL frameworks that people might be familiar with as far as moving data from 1 place to another and performing some transformations on it. But I know that it also has things such as the machine learning layer and the SQL capabilities, and I believe that it has some graph analysis capabilities. So what are some of the main sort of tools and components
[00:04:57] Unknown:
that people might use and take advantage of within that Spark framework or operating system as you put it? So so you're totally right. Okay? There's a lot of people, and and I've done quite a few project in in that space where, it's basically creating a very high velocity data pipeline. And and even if if if Spark, it's that's not the role where it was conceived, it excels in in this in this world. And, yeah, it's can be a little bit of a competitor to some of the some of the ETF. But, also, the fact that it brings this platform, it means that you can actually do this this transformation at the same time. What I like about it is okay. As you said, you've got Spark is is, you know, they kind of divide Spark into 4 pillars.
1 being Spark SQL, the other 1 is Spark Streaming, then ML and deep learning, and and graphics. Okay? So I think that the 1 of the great thing is by offering a very unified API on top of these features, you empower the developer and the development team in in this very powerful features in a very abstract way. So when a lot of people compare, Spark to Hadoop, and when you look at Hadoop, and especially the first the the the few the very the first version where, Hadoop was only map produced, was only disk, well, you're you're creating a very niche set of capability or a a set of capabilities can can can actually be used by an by only niche players, niche developers. And you don't have that with Spark.
Spark is is is actually offering all this all this APIs, basically, the data frame API to the developers so they can actually leverage, making data ingestion. So basically taking data into Spark, applying some data quality rules, offering on then using, for example, this data in a in a machine learning pipeline and exporting this data in another database or another kind of report in a in a very seamless way. A programmatic,
[00:07:05] Unknown:
but seamless way. And who are sort of the main roles and types of users who might be taking advantage of Spark, both in terms of maybe doing some local development and also in the more production scale use cases?
[00:07:22] Unknown:
So I I I see 2 big family of of of Spark users. 1 being the data engineers, and the other family is a data scientist. And and they have a bit of a different set of tools as well and different, of course, use cases. When I was just speaking about data pipelines, this is this is typically a use case for, their engineers. But Spark is can be easily interfaced with with products like Jupyter or Zeppelin. And by using this this this this notebooks, this is really the the tools of choice for for data scientists. And they can leverage, the poor of a cluster, very easily as they would be using, their notebooks locally. And now you've got also companies, like Databricks or IBM with Watson Data Studio, which offers more, you know, collaborative capabilities on top of it. That and that's targeted to to the data scientist. This is not the main scope of my of my book.
I think I think that's that's a great data science. And data scientist is is a great profession, and there's a lot of coverage on that. But I think that data engineers are a bit, you know, like, left on the side of the road sometimes. So this is this is this is a population I was trying to address.
[00:08:39] Unknown:
Yeah. It's a big part of why I've been running this podcast as well to try and give more attention to all of the work and effort required to gather and collect and process the data that's necessary to power some of these data science and analytics applications.
[00:08:55] Unknown:
Exactly. You're you're you're you're totally right. I mean, we forget I think we sometimes forget that if you get garbage in, you get garbage out. Okay? And that's kind of the rule of the data engineers to make sure that we actually don't do this this this kind of bad partings. So yep. And are there any particular
[00:09:12] Unknown:
use cases that you think that Spark is uniquely positioned to be able to handle as compared to some of the other streaming frameworks or big data frameworks that might be available, such as the Flinks or Kafka's or storms of the world? So
[00:09:29] Unknown:
so streaming is 1 of the component of of of Spark. And and that's that's also why I'm going back to to the definition of Spark as an operating system is it does not just do streaming. Okay? And but it it also plays well with Kafka. I've got I've got quite a few friends who and and actually customers and and partners using Kafka intensively with, with Spark or or if you're in the AWS world, Kinesis. So so that's that's where that's where Spark is nice as well. It plays well with others. And it it also have its own its own streaming capabilities, which are more and more used by and and now since version 2.2 and we have version 2.4 now are in production. So so it's pretty mature as well on on that part as well. And in terms of the broader
[00:10:18] Unknown:
Hadoop and HDFS, and Hadoop and HDFS, and there are some of the streaming frameworks that are available. And then there are people who will likely be landing some of that processed data into their data lakes or data warehouses. And you mentioned the ML capabilities. So I'm just wondering if you can give some perspective on how it fits in relation to some of those other tools and, some of the other types of
[00:10:49] Unknown:
deployments and applications that it might be used for. Adoob has become at some point a synonym of big data. And I think that that with Spark coming, it's it's almost like there's a another kid in town. And, and that that's actually, pouring a new a new generation of big data developers. So and and and you can see it when you actually talk to users. You've got you've got really 2 kind of in the data engineering space. Or yeah. In the even in both spaces. But, you see people that that are now using Spark are and who are coming from the Hadoop world. And people that just a bit like me skipped Hadoop and and went to, and went to Spark directly. And you see really 2 different usage in in that part where the people that came from Hadoop are really familiar with, you know, resource managers like like, like Yorn or HDFS or Iva. And and they and they think as Apache's being a natural evolution, of of that because it leverage more. It offers more algorithm. It offer ML. It offers SQL. And you've got the other the other population, which is coming for more from a a data engineering background or or like like, you know, small data or RDBMS world and discovering Spark and the familiarity of which you can do things the same way you would do it with with a normal database. And, of course, it it's it's, of course, a lot more than a database where you can have all this processing being done as well with with Spark, but it's it's a very natural transition as well. So what's interesting is to see the this this 2 population being kind of reconciled on big data through Spark.
[00:12:34] Unknown:
And for somebody who's building an application on top of Spark or composing a data pipeline, what are some of the main design paradigms in terms of the way that the software is written and the interfaces that are available both in terms of the way that the data is represented, but also in terms of some of the language runtimes that developers are able to target. Surprisingly,
[00:12:56] Unknown:
there's you you're not thinking big data when you use Spark, especially when you when you're when you're at the level of of designing and building your application. My experience was with many developers is that they understand the APIs. They learn the APIs. That's a kind of the key thing once once you've learned the the data frame API. And and then deploy because 1 of the big advantage of Spark is you can run everything on your single lap on your laptop. Okay? You don't need virtual machines. You don't need you don't need a cluster. Sparks come out of the box with what they call the local mode, and you just spin off spark, like, out of Eclipse or out of any environment very quickly, and you can run your application right away. And you can use and you can start developing in there. So and then comes the the deployment part, which which is not that difficult.
You've got to package things in a certain way. And, of course, you've you've got to think about some of some of optimization. Okay? So if you're making massive joins of huge datasets, you know that you're going to create a lot of traffic. So it might be better to partition by by the keys you're going to join on. Right? But that's the same thing as you would do in a traditional RDBMS way. It's nothing just nothing new there. So it it's in in that way, it seems very natural and a very natural progression once more. Yeah. And are there any major edge cases or,
[00:14:23] Unknown:
shifts in design thinking or, sort of deployment considerations when moving from that local mode to a more full blown production scale infrastructure, particularly as you start to scale that deployment in terms of the number of nodes and the overall volume of data processed?
[00:14:43] Unknown:
There is, but it's not a major shift in how you do it. Even, when you drill down a bit the architecture of Spark where you've got your master and you've got your your workers that actually doing the work, even deploying the application, you you only deployed once, and then Spark is actually distributing your code to the different to the different workers, which which makes it really, really neat. And and you don't have to really think too hard about what's going on. That was really surprising for me. But, of course, I've, to be to be completely honest, I've I've not manipulated petabytes of data. But all the project we we've been doing, whether it was in health care or, we've logged data, everything was pretty smooth, when we went to when we went to production. It's not, you know, sometimes when you're taking your application that works kinda in a almost lab way on your on your laptop, and you start wondering, oh, what's what's what can be wrong? You kind of freak out or about, you know, the deployment part. I never had that. I I think I was more freaking out in the good old days of deploying things to Tomcat than deploying things on Spark.
[00:16:00] Unknown:
And in terms of the way that you package and deploy the application, is there anything special in terms of how you need to create an artifact for that deployment? And does that change based on the language that you're doing the implementation is? Where if you're doing JVM code base code, such as Java or Scala, or if you're using something like PySpark.
[00:16:23] Unknown:
Yeah. So that's a good point to add. Yeah. So to use Spark, you can use Spark in in in Java, Scala, Python, or r. And I think this is also very smart way that they decided to cover these these different languages because you've got the languages which are more data science oriented, like typically r, and maybe data language that are more data engineering oriented, like, I would say, Java, but I don't want to create a big big debate here. So so that's that's also offering, a wider range of of of potential users, embracing embracing Spark. When when you're when you're deploying your app, you can do you can do you can do it in several ways depending depending on actually the goal of your application.
The the 1 of the traditional way is to package your app as a single JAR, and you just submit your JAR. Okay? Whether and then your JAR could be written in in Scala or or in or in, Java. It doesn't doesn't make any any difference. When you're using you can also use it you can also use Spark via shell, and then you would be in a very queue in a very similar way as you would be using a notebook. Okay? You connect you connect to Spark and you type your commands. And and basically, you get the results, you know, in a very interactive in a very interactive way. The the shell part is only available in Scala, Python, and R because Java doesn't have this this replay mechanism right now. So that's that's that's that's 1 way to access it. 1 1 other way is you can, as I said, you can submit your application or you can connect directly to the cluster.
So in some cases, for for some project that did, we we needed to connect to have a permanent connection directly to to, to Spark. Basically, because we wanted to have a session always open. We had big datasets that we wanted to have constantly being load and constantly being up updated. So and we opened just a kind of a rest endpoint in the application that would kind of, you know, kind of status update of what the application would actually be doing directly. But it's not even a very complex use case as well. So this is this is all the kind of different scenarios you can use Spark, which is really depending on on the workload you you're planning on on running on Spark. Once more, it's giving flexibility.
[00:18:45] Unknown:
And what have you found to be some of the most useful strategies for improving the efficiency and performance of a processing pipeline? And I know that when I was looking through the book, you mentioned that 1 of the key aspects of the programming model of Spark is its laziness. So I'm curious if there are any special approaches for being able to
[00:19:07] Unknown:
employ that effectively. So I I I I love I love the lane the laziness part. And I compare it quite a few times to teenagers, which which which I've got a lot in my family right now. And, so the way I see it is you can ask something. Okay. Like like, go clean your room. Go go put your laundry in the washer, and and things like that. But as long as you're saying things in a in a kind of a normal way, they're not going to happen. But at some point, you've got to apply an action to have the things done. And that's exactly how Spark is working. You you tell Spark to do a few things, basically, data transformation.
And at some point, you say, hey, Spark. Now it's time to to get your shit together and work. And that's that's called an action. And, basically, when you're building this recipe, it's being transformed into a directed acyclic graph, which is basically a graph that never comes back, to itself. And before it actually start working, it's giving this to an optimizer called Catalyst within Spark. And, basically, Catalyst like like a like an on in a relational database when you got your query plan, is going through it and say, okay. I'm going to do that and that and that in this order, and then I'm going to this is a result of this optimization. And then it can be of course, that's that's going to make it even faster. So you've got first first, you've got Spark that works that you uses memory in a smart way.
So so just by using memory, you you gain a lot. And then, you've got this this DAG, which also completely optimizes the way your workload are are are processed. A good exam a good example is, let's say, you you've got a dataset coming in. You add a column to your dataset. And then for some reason, you you want to delete the column to your from for your dataset, because it was a temporary column or whatever. If you design a relational database, the column will be physically in place, and that's going to be a very heavy operation. In Spark, it's never going to be be created, and and and the optimizer can actually do that. And it's interesting. And, when you looked at this this chapter about about laziness, I've timed a bit to different to different operation. And it's actually almost funny to see where how Spark is doing this thing. So I really think in that, I've got a strong power strong feeling about making a parallel between teenagers and Spark. So
[00:21:38] Unknown:
And in terms of the architectural design of the applications that developers are creating and the ways that data engineers are leveraging Spark, what are some of the edge cases and quirks of the framework that they should be considering as they begin to scale their deployments, again, in terms of the number of nodes and in of the volumes and varieties of data that they're processing? I can't think
[00:22:08] Unknown:
of critical cases. Spark can ingest a lot of different types of data, so you don't have to kind of preprocess the data before you ingest them. On natively, Tundles, text, CSV, JSON. With plugins, you can you can, embark XML. And if you're if you want to, you can also create your own plugin. Of course, all all all relational database are are easily integrated as well. So so this this this part is is pretty straightforward. The following the the next part is is all this data transformation, whether you're starting by doing data quality or directly doing some some kind of heavy transformation on your data, this is usually being done by you can actually combine SQL with the data frame API and using the machine learning, API as well in this part of transformation. And that's that's also pretty straightforward, and I don't I don't see any kind of specific cases where you need to be really aware of of of that. And once more, it's because at the end, it's going through through through Catalyst to optimize your workload. So it gives you a little bit of a tolerance on on what you're doing. It's not like when you you you've got to implement a map reduce algorithm where you've got to be very careful about what you're doing as a mapping and a reduce thing. Spark Spark gives you a little bit more freedom there. And finally, when you want to to share your data or whether it's going to be in a file or it's going to be, you know, in a database, it's the same thing. 1 thing to consider is if you if you're saving in a database, for example, each task will will try to have a communication with the database. So if you're on, let's say, 20, 000 nodes and you've got, I don't know, let's say something like, I don't know, 5 tasks per node or 20 tasks per node, and they all try to communicate with your relational database, that just might be a bit tricky because I will do that at of at the at exactly roughly the same time. So you can actually bring back you know, you can you can repartition the data, coalesce the data, and and then save it to your database. And, that would be probably a an edge case where you've got to be careful about that. And in terms of optimizing some of the processing, I know that Spark has support for user defined functions,
[00:24:33] Unknown:
ways that that can be beneficial and the approach to designing and deploying some of those user defined functions to the Spark cluster.
[00:24:42] Unknown:
Yeah. So just as a side note before we go to the UDF, and you can edit that if you the way you want. But I think a component which is interesting to understand about about Spark is is tungsten. So Spark does not rely on on Java storage when it comes to to object. Okay? Let's say you're getting int or integers and strings as part of your data. Spark is going to first convert that to serialize it and store it in a specific storage manager called Tungsten, which can actually compress the data. So this is also something where your data will be optimized on your behalf. And the the ratio I I kind of got from the engineering teams on Spark is that you you win about 20% of of the data, and that was that was back in in Spark 20. So they probably, they probably did did some progress on top of that. So, basically, that's that's good to know because when you've got let's say you've got to to ingest a gig of data, well, then you know that it's going to take only, what, 800 meg to of of memory as well on your own cluster. So that's 1 side note that this could be also interesting to the listeners.
And what was your question?
[00:26:01] Unknown:
And that's definitely very useful to know, particularly when you're doing capacity planning in terms of the node types and number of nodes when you're figuring out how much data you're going to be processing. So definitely valuable for resource allocation. But I was also curious about the use cases and design and deployment patterns around user defined functions and some of the benefits that they can provide in terms of efficiency
[00:26:28] Unknown:
and ease of use when deploying an application on Spark? Yeah. Sure. So I actually that this is also something I I really like, and I I kind of liked it. And I'm making another plug for my favorite r d b m s Informix. Informix brought, user defined function in Java back in the back in the nineties. But but really, what's interesting as well is when you're using Spark, you've got a lot of built in functions, okay, like splitting a a value in our column or cell in into 2 or 3 or whatever based on a reg x or whatever. That's that's the kind of functions that that Spark will provide. And you and nobody should be surprised about that or date conversion, you know, date from from a string to a real date. And and so this is this is almost, SQL functions you find everywhere. And, of course, the thing is that's that's the libraries that the community is is providing as as part of the package. So you've you, Spark offers this this user defined functions, UDF, where you can just build your own. So I I like this concept a lot because it extends the richness of the product. And it also 1 of the reason I like it really a lot is you can leverage your existing libraries and your existing code. So in the use case of, for example, of adding data quality, okay, and you're ingesting data in your system, and you already did all these data quality libraries that that exist, and you know that that they're reliable. And so you can just do a little bit of the plumbing and using the UDF as as a as a plumbing mechanism between Spark and your existing library. And your leverage ex your your leverage your assets like like crazy in a matter of of very short time. So this is this is this is something that that that Spark is also very good for. Deploying those aren't are not that difficult because they end up being JARs, and and JARs are just deep deployed. 1 constraint is that a lot of the a lot of the libraries on the code you're going to add to Spark needs to be serialized when you're doing it in Java or Scala. But that's not that big of a constraint, I think. But it was not for me at least, on my teams. And another area
[00:28:45] Unknown:
that I'd be interested in hearing about is any experience that you have in terms of writing tests and fitness functions around the applications that are being deployed to Spark so that you can do some of those quality checks that you mentioned and doing some validation of the various stages of the processing pipeline? So yeah. That's a good 1. And I
[00:29:10] Unknown:
embarking your when you're using when you're using UDF and you're you're you're embarking your UDFs, you can do it easily by testing outside of that. I don't know if if I was fortunate or not, but we relied on an ex an external testing when we when we when we built, pipelines with with Spark. So it was not we did not test each step within Spark. But there's a there's a good framework for testing Spark, which which is sponsored by IBM. And it's it's also it's it's it's interesting because it can be automated to change some values automatically. So you can say, oh, I want to allocate that much memory to my execution, to my driver, to my master, and and varies. And that's just a few of the parameters.
And and the tool will actually monitor and change that and and give you a feedback of, of the results there, the for for performance testing. So there there's a small ecosystem well, not a small. There's an there's a growing ecosystem, really, about this kind of tools. Unfortunately, I don't have I did not have the chance to to use that as much as I would have to.
[00:30:18] Unknown:
And in terms of the actual deployment of the cluster itself, what are the necessary system requirements and any supporting technologies that should be considered for getting a spark application up and running and getting the cluster deployed in a reliable and highly available fashion?
[00:30:38] Unknown:
So so there's there's a lot of different ways of of doing it, and we couldn't probably spend a few hours on it. The first thing is that Spark allows you to build your own cluster with only the Spark components, which means you don't have you don't require you're not required to use anything else. So my first cluster I built was a bunch of Ubuntu machines. You just start on 1 spark as a master. On the same machine in the same machine, you can start a slave. And on the other machine, you just start slaves, and you just point them to the master. And that's your cluster. Your cluster is ready. It's very fast in that way of of deploying cluster and adding nodes as well. And the resource manager, is is basically provided by by Spark directly. Is that enough for production? Yes. It's it's been enough for some project in production.
Now because your cluster or your physical cluster might be used by not only Spark, you will find also resource managers that work with with with Spark. So if you're coming from the Hadoop world, you will be able to use Yarn directly with with with Spark. And that's what, for example, Amazon is doing with, AWS EMR, or you can use Mesos and GCUS. And and, we had a lot of fun, doing it. And when I'm saying a lot of fun, it's it's positive fun. It's not nightmare fun. And allocating more resources and distributing task was really was really nice. And more more recently with Spark 2 3, Kubernetes is is also completely supported. So if you're if you're a kube fan, you you can you can use Spark as well. So, really, you've got your building your cluster is fairly I'm not saying it's fairly simple, but it's fairly flexible.
And, you know, you're not stuck to 1 resource manager. You're not stuck to 1 technology. You're really open to a lot of different things. And there's there's also project I've seen with also resource manager, but I've I've not had some, you know, more squirfings or more targeted, systems. But, yeah, it really comes stand alone. You, stand alone mode in the cluster. You can have a Yarn. You can have Mesos. You can have Kubernetes. So that that gives you a pretty nice choice already.
[00:32:58] Unknown:
And with all these capabilities, it can be easy to start thinking that Spark is the right answer for every situation where you might need to do something with data. But what are some of the cases where you think that Spark is the wrong choice and some of the limitations that are imposed by the Spark programming model?
[00:33:17] Unknown:
So it's not a transactional system. Okay? So you're not you're not going to do your, you're not going to plug your cash registered in your Walmart directly to Spark. It's still an analytics system. So all your transactional, model will will still rely on your on on another system. So that's 1 point. The other point is your data within Spark relies in memory. So we've also nice things about it and potential danger of having it in memory. Like, if it's gone, it's gone. Of course, there's there's reliability built in. Okay? But but it it's not safe. It's not safe to this, and it's not designed to save it to to disk in the middle of the process. So this is this is something when you design your application that you you may you may think about. And the data in in Spark itself is immutable. So so that that's that's that was a weird 1 for me to to understand when I, starting working with Spark is really wet. You're doing this analytic system and the data inside is immutable. So it means that it doesn't change. What's what's the point of that? And and, actually, the it it comes to the to the kind of the low level, almost, architecture of Spark where if you want to distribute your data if you distribute your data under changes in your data, like like a cluster of databases, you have this network latency.
You've got this network issues that you can have. You can have all these these things that your data can be out of sync. But if your data you know that your data is not changing, you can only share the recipe or the tag between your different nodes, and that's that's 1 great thing. Okay? So but because it's immutable, then you you're not going to do, like, for example, you're taking your your relational database and you're going going to update fields. Okay? That that's not how you're going to do it. You're going to apply transformation and you're going to get a new dataset, which is optimized, which is not going to be the copy of your old data, etcetera, etcetera. But it's still it's not it's not a an a directly an update like you would do in in MySQL, for example.
[00:35:27] Unknown:
And in terms of the book that you're writing, it's currently still in progress. And you mentioned that you, approached Manning with the idea of writing it because you were so interested and excited by the capabilities of Spark and the ways that it's used. So wondering if you can talk a bit more about your motivation
[00:35:46] Unknown:
for writing it and the target audience that you're trying to serve with it. So, when when I started when I started learning about Spark, it was in the idea of of even if it was not directly said, of of of, of data scientists, which which I have a lot of respect for. I think they're really smart people. And and but I didn't I didn't feel like I was I was really completely being connected to to their needs as a data engineer or data architect. So I started to look at resources, and it brought me to a lot of the examples. And I like to I love to learn by examples. And most of the things were written in Scala. Oh, that's that's adding a burden. You know? It's like you really want to use your tool, but before you can use my tool, you've got to learn this new language. And that didn't please me that much. So I I really said, okay. Well, I've got I've got to I've got to to see how how how much you can leverage on it. But that was back in the early version of of of Spark as well. And later in project, I was confronted to that. I get teams of engineers really efficient in Java who don't need to learn anything in Java except the data frame API from Spark. So I said, well, if you if we want to develop things around Spark, are we going to train all those nice bunch of engineers into to Scala, or are we just going to leverage Java as I know? And try to focus their efforts on just learning the essentials about about Spark and trying to understand, you know, big deals with Spark. So that that was that was the main motivation for for for my book. I didn't want I found I found it I found it damaging for for for companies I've I've been consulting with or working for that they started doing some project in Scanana, trained people in Scanana. As soon as a person the people were trained in Scanana, they found another job. And then you've got codes, you've got meetings, kinda, and you don't have anyone else. And you just take a little bit more efforts. Yeah. And but at first, your productivity might be a little lower, but you don't go through the phase of of of training every your whole team to scan. And that might be also a big debate we could start, and I don't want to to start any, you know, troll wars or anything. But I think that Java is more readable than that's that's that's Scala. I think the maintenance is easier. You've got more tools.
So I said that that's when I went to see Manning, I said, this is this is this is what I believe in, and I would love to write a book about it. And basically, they said, Panko, let's do it. So that's that's that's what happened about a year ago.
[00:38:28] Unknown:
And what have been some of the most interesting or useful lessons that you've learned in the process of writing the book and anything
[00:38:36] Unknown:
new that you've learned in terms of the capabilities of Spark that you might not otherwise have been exposed to? It's just 2 folds to your question. 1st, writing a book. It's my first published book. I self published a few times. I started self publishing back in the early nineties when I was when I was in a in college. And, a friend of mine, we decided to write, because we wanted to help our, our our friends, we decided to write a book on c and c plus plus and the transition from c to c plus plus So that was that was a lot of fun, like, almost 30 years ago. And it was it was fun because I was, you know, a college grad, and it was a summer project. Then fast forward to writing a book with money, oh my god. It's difficult.
It's really difficult. I've got a great editor, Marina. But I've also written a lot of articles before or or tutorials for for IBM developer works, for example. And you've got your editor that is actually just babysitting you. And then you you go work with with Manning, who are really great professionals, really demanding. And it's it's a lot of work. And it's it's really forcing you to be a better self. It's a bit like, when you write a blog or when you write, something for a magazine and you've and you know you've got an editor behind you, after you, it's like writing in middle school. Right? Writing writing for writing for Manning is is is the same thing as directly jumping into college after middle school. I love it. It's awesome. It's art.
So that's that's that's 1 thing. And I think I made a lot of progress in personally in the way about explaining things, in about the way of doing doing graphs and, and illustration. And there's a lot of illustration in the book. Trying to find also the right examples using real examples. Most of the examples in the book are based on real datasets and real. It's not big datasets because I don't want people to download something like, a few terabytes of of of data to to play with it. But it's real datasets, so it has a real meaning to it. It's not, you know, it's it's not like, funky datasets with just integers and things like that. So that's that was for the book part. For the spark part is I love the research. I love the research you've got to do because, yeah, you I just I just finished chapter 10 on streaming.
And streaming is not a very complicated paradigm, I think. But if you want to explain it, you've got to find this kind of useful little tips and useful entries for the reader to and to to get to get your reader involved. So, for example, on on on this specific about about streaming, a lot of the streaming people are thinking about network. And I didn't want to do network streaming, to explain streaming because I didn't want people to to deal with Netcat or to deal with ports on the machine and all those things. So I said, let's let's go let's do directory streaming. So, basically, you drop files in a directory. And for doing that, I build a small record generator because I wanted to be I wanted that the the reader could run the examples, but just focus on the examples.
So there's all this this small ecosystem around the examples that I I worked I worked for, and I I worked to provide to provide the tools that the the student can actually, you know, really focus and really learn on on the key elements of of of the technologies and the chapter is going to work. So when you when you combine these 2, I I think this this is this is really, both a terrible and terrific, learning experience for, I would say, for both the writer and the reader. So then you've got to you've got to pick if it's a terrible experience for the reader and a terrific experience for the writer or the or the opposite.
[00:42:29] Unknown:
And what advice do you have for anyone who is considering or currently using Spark? And what are some of the areas of interest that you're looking forward to as you continue on your journey of working with Spark? So I I love the idea of machine learning
[00:42:48] Unknown:
without the math. And, that's that's 1 actually that's 1 of the chapter in the book, which which I think it's chapter 18 or something. So it's it's still down the road. But this is this is the chapter I'm kind of really looking for at at writing. When I started doing machine learning, a lot of the education is a very formal, you know, college college like training on on on that. So first, you've got to understand the math before you under you you're actually using the the the machine learning algorithm. And I think that's that's a wrong approach to to it. I think especially when you're talking if you're talking to data scientist, that's fine because because that's that's the way they deal with it after. They they look at the math. But and and don't get me wrong. I love math. It's just 1 of my favorite, subject. But I think that when you're dealing with engineers, as they like math most most engineers like math. But they don't have to understand why a logistic regression or a linear regression works. They just need to understand what's the use the best use case that is going that that is going to be applied to that. And this is really, I would say, my next challenge and my next challenge for for the book as well. And I'm not and it could be 10 books, on the same topic, but it's going to be only 1 chapter in in this book at least. But this is this yeah. This is something I would I would definitely love to see more, like, you know, associate associating as a use case you're trying to solve with the right algorithm and not trying to learn first how the algorithm is working before you can actually use that.
[00:44:24] Unknown:
And are there any other aspects of the Spark project or the book that you're writing or your experiences
[00:44:34] Unknown:
with either of those that you would like to discuss further before we close out the show? You know, there's always Spark has been there for for for for a few years now. And right now, we we're in version we're in version 2 of Spark, the 2 dot 4 that was released a a few weeks ago. And Spark is also going through this this different level of maturity. And 1 of the great thing is that they are not afraid of of growing. And so when you're looking at learning Spark, 1 thing that you need to be very careful about is each which version of Spark are you going to cover. And so a lot of and I I'm not trying to say that because of my book, but just because of potential, you know, just readers and users.
Spark started by storing data in RDDs, which are still there, which has a basis for the data frame. But if you get a book on or a tutorial on Spark and it's it's about Spark 1, you will find the RDD. And the for very old books on which which have Spark 2 in the title, are actually they were too close to spark to spark to to actually spark 1, sorry, to embrace, to embrace the full potential of the RDDs of the of the data frame versus the the RDDs. And it's the same thing for streaming. The first the first version of of of Spark included Spark Streaming, based on what they call discretinized streams.
And now we but which were kind kind of tightly linked to this RDD, this this resilient distributed dataset. So the the same thing is happening to to to streaming now, which is structured streaming, which is based on the data frame as well. So you've got once more, you've got the same API and the same evolution of the of the product. So you need to be so when if you embark on a project or or if you start a new project, you've got to be a bit careful about what you're going to pick there. But even the API is not fundamentally different. It's about the different concepts there. But that's that's basically how how I would see things now. I'm kind of looking forward of the future of Spark, and it's and it's growing and user base. And the tools the tooling the more and more tools you you see around Spark, which are also very interesting. And the and the variety of the use cases, which which are also pretty incredible.
[00:46:47] Unknown:
There's a bright future for Spark, definitely. Yeah. And I suppose that that's also worth mentioning that this is the 2nd edition of Spark in Action. So I'm assuming that some of the differences people can expect between the first version and the version that you're working on now is some of the new capabilities that have been introduced with subsequent versions of the Spark framework.
[00:47:08] Unknown:
As I said, the the the first edition of, I didn't participate in the first edition of of Spark in Action. It was mainly mainly targeted for, in in Scanner and Python, whereas my edition is more to what Java, as I said. And I think that they're a good complement to 1 another to the concept I explained a little bit differently. I'm not saying mine is better or theirs is better. I think it it's interesting to be able to to have that and, so to have, you know, both perspective on it. But, definitely, the spark Spark in action, the first edition is is covering more of Spark 1, and I'm exclusively covering Spark 2. Of course, with with the links to explain the concept as they were in Spark 1 or to make sure that the reader is actually understanding where it's coming from, if he's got some kind of background, there. I don't like to assume that, you know, you've got to learn different technologies to go to go to that point. I think that it's like, I don't want you to learn Hadoop to learn Spark. I don't want you to learn Scala to learn Spark. If you want to do that, you're going to be probably understanding a lot more things. Okay? If you want to debug Spark, because it happens, learning Scala is useful, but it might not be what you're going to do anyway. So there's potential as well to grow, to keep learning,
[00:48:25] Unknown:
if if you want to go there. Alright. Well, for anybody who wants to follow the work that you're up to or get in touch with you, I'll have you add your preferred contact information to the show notes. And as a final question, I'd like to get your perspective on what you see as being the biggest gap in the tooling or technology that's available for data management today? I still think I still think we're suffering too much in in data quality and data governance. It's so difficult.
[00:48:53] Unknown:
It doesn't it doesn't seem to be easily configurable to demonstrate the importance of data quality and data governance. And that's that's really something I I would like the data management part to to make some progress there. And really, data data governance more than anything else. I think we've we've we are getting so poor at tracking those those artifacts. Yeah. It's easy to get some metadata from a relational database, but then your data lineage gets lost somewhere. You're now we we are dealing with more and more, AI models, machine learning models. And what is your life cycle of of those models? Are you able to track down your hyperparameters?
And that's I think that's really where this next generation of data governance needs to go and need to embrace this thing, but in a more intrusive, in a more
[00:49:45] Unknown:
stealth way that it's been doing right now. I hope that that's something that someone will tackle. Alright. Well, thank you very much for taking the time to talk today about your experiences with Spark and writing the book. And for the people listening, there's going to be a discount code in the show notes for 40% off of any of the Manning books, including Spark in Action 2nd edition. And, also, check-in the show notes for some details about a giveaway for a few copies of Jean's book. So thank you again, Jean, for all of your efforts there, and I hope you enjoy the rest of your evening. Well, thank you. Thank you, Tobias. And, let's talk soon.
[00:50:29] Unknown:
Bye.
Introduction to Jean Georges Perrin and Spark
Journey from Small Data to Big Data
Understanding Spark and Its Use Cases
Roles and Users of Spark
Spark's Compatibility with Other Frameworks
Design Paradigms and Language Support
Optimizing Spark Performance
Handling Different Data Types and Transformations
User Defined Functions in Spark
Testing and Quality Checks in Spark
Deploying Spark Clusters
When Not to Use Spark
Writing the Book: Motivation and Audience
Lessons Learned from Writing the Book
Future of Spark and Data Management