Experiments with Spark (based on Definitive Guide): CH15.1 Spark Application architecture, Execution Modes

We know how the structured APIs take a logical operation, break it up into a logical plan, and convert that to a physical plan that actually consists of Resilient Distributed Dataset (RDD) operations that execute across the cluster of machines. Here we learn about what happens when Spark goes about executing that code.

Architecture of a Spark Application:

The driver process is the controller of the execution of a Spark Application and maintains all of the state of the Spark cluster (the state and tasks of the executors).
It must interface with the cluster manager in order to actually get physical resources and launch executors.
Simply put this is just a process on a physical machine that is responsible for maintaining the state of the application running on the cluster.

Spark executors are the processes that perform the tasks assigned by the Spark driver.
Executors have one core responsibility: take the tasks assigned by the driver, run them, and report back their state (success or failure) and results.
Each Spark Application has its own separate executor processes.

The cluster manager is responsible for maintaining a cluster of machines that will run your Spark Application(s). a cluster manager will have its own “driver” (sometimes called master) and “worker” abstractions.

V.Imp: The core difference is that these are tied to physical machines rather than processes .

When it comes time to actually run a Spark Application, we request resources from the cluster manager to run it. Depending on how our application is configured, this can include a place to run the Spark driver or might be just resources for the executors for our Spark Application.
Spark currently supports three cluster managers: a simple built-in standalone cluster manager, Apache Mesos, and Hadoop YARN.

Execution Modes:

An execution mode gives you the power to determine where the resources are physically located when you go to run your application. There are 3 options - Cluster Mode, Local Mode, Client mode.

In Cluster mode, when a user submits a pre-compiled JAR, Python script, or R script to a cluster manager, the cluster manager then launches the driver process on a worker node inside the cluster, in addition to the executor processes.
This means that the cluster manager is responsible for maintaining all Spark Application–related processes.

Client mode is nearly the same as cluster mode except that the Spark driver remains on the client machine that submitted the application. This means that the client machine is responsible for maintaining the Spark driver process, and the cluster manager maintains the executor processes.
In this mode we are running the Spark Application from a machine that is not colocated on the cluster. These machines are commonly referred to as gateway machines or edge nodes.

This mode runs the entire Spark Application on a single machine. It achieves parallelism through threads on that single machine.
Not recommend using local mode for running production applications.

Experiments with Spark (based on Definitive Guide)