- We know how the
structured APIs take a logical operation, break it up into a
logical plan, and convert that to a physical plan that
actually consists of Resilient Distributed Dataset (RDD) operations
that execute across the cluster of machines. Here we learn about
what happens when Spark goes about executing that code.
|
|
Architecture of a Spark
Application:
- The Spark driver:
- The
driver process is the controller of the execution of a Spark
Application and maintains all of the state
of the Spark cluster (the state and tasks of the executors).
- It must
interface with the cluster manager in order to actually
get physical resources and launch executors.
- Simply
put this is just a process on a physical machine that
is responsible for maintaining the state of the application running on
the cluster.
- The Spark Executors:
- Spark executors
are the processes that perform the tasks assigned by the Spark
driver.
- Executors
have one core responsibility: take the tasks assigned by the
driver, run them, and report back their state (success or failure) and
results.
- Each
Spark Application has its own separate executor processes.
- The cluster Manager:
- The
cluster manager is responsible for maintaining a cluster
of machines that will run your Spark Application(s).
a cluster manager will have its own “driver” (sometimes called
master) and “worker” abstractions.
- V.Imp: The core
difference is that these are tied to physical machines rather than
processes .
- When it
comes time to actually run a Spark Application, we request
resources from the cluster manager to run it. Depending on how our
application is configured, this can include a place to run the Spark
driver or might be just resources for the executors for our
Spark Application.
- Spark currently
supports three cluster managers: a simple built-in standalone
cluster manager, Apache Mesos, and Hadoop YARN.
|
|
Execution Modes:
- An execution mode gives
you the power to determine where the resources are
physically located when you go to run your application. There
are 3 options - Cluster Mode, Local Mode, Client mode.
- Cluster mode:
- In
Cluster mode, when a user submits a pre-compiled JAR, Python
script, or R script to a cluster manager, the
cluster manager then launches the driver process on a worker node
inside the cluster, in addition to the executor
processes.
- This
means that the cluster manager is responsible for
maintaining all Spark Application–related processes.
- Client Mode:
- Client mode
is nearly the same as cluster mode except that the Spark
driver remains on the client machine that submitted the
application. This means that the client machine is
responsible for maintaining the Spark driver process, and the cluster
manager maintains the executor processes.
- In this
mode we are running the Spark Application from a machine that
is not colocated on the cluster. These machines are commonly
referred to as gateway machines or edge nodes.
- Local Mode:
- This mode runs
the entire Spark Application on a single machine. It achieves
parallelism through threads on that single machine.
- Not recommend using
local mode for running production applications.
|
No comments:
Post a Comment