| 
   
We know how the
        structured APIs take a logical operation, break it up into a
        logical plan, and convert that to a physical plan that
        actually consists of Resilient Distributed Dataset (RDD) operations
        that execute across the cluster of machines. Here we learn about
        what happens when Spark goes about executing that code. | 
  | 
Architecture of a Spark
  Application: 
 
The Spark driver: 
The
        driver process is the controller of the execution of a Spark
        Application and maintains all of the state
        of the Spark cluster (the state and tasks of the executors). It must
        interface with the cluster manager in order to actually
        get physical resources and launch executors. Simply
        put this is just a process on a physical machine that
        is responsible for maintaining the state of the application running on
        the cluster. 
 
The Spark Executors: 
Spark executors
        are the processes that perform the tasks assigned by the Spark
        driver. Executors
        have one core responsibility: take the tasks assigned by the
        driver, run them, and report back their state (success or failure) and
        results. Each
        Spark Application has its own separate executor processes. 
 
The cluster Manager: 
The
        cluster manager is responsible for maintaining a cluster
        of machines that will run your Spark Application(s).
        a cluster manager will have its own “driver” (sometimes called
        master) and “worker” abstractions.  
V.Imp: The core
        difference is that these are tied to physical machines rather than
        processes .  
When it
        comes time to actually run a Spark Application, we request
        resources from the cluster manager to run it. Depending on how our
        application is configured, this can include a place to run the Spark
        driver or might be just resources for the executors for our
        Spark Application.Spark currently
        supports three cluster managers: a simple built-in standalone
        cluster manager, Apache Mesos, and Hadoop YARN.  
 | 
  | 
Execution Modes: 
 
An execution mode gives
       you the power to determine where the resources are
       physically located when you go to run your application. There
       are 3 options - Cluster Mode, Local Mode, Client mode. 
 
Cluster mode: 
In
        Cluster mode, when a user submits a pre-compiled JAR, Python
        script, or R script to a cluster manager, the
        cluster manager then launches the driver process on a worker node
        inside the cluster, in addition to the executor
        processes.  This
        means that the cluster manager is responsible for
        maintaining all Spark Application–related processes.
 Client Mode: 
Client mode
        is nearly the same as cluster mode except that the Spark
        driver remains on the client machine that submitted the
        application. This means that the client machine is
        responsible for maintaining the Spark driver process, and the cluster
        manager maintains the executor processes.In this
        mode we are running the Spark Application from a machine that
        is not colocated on the cluster. These machines are commonly
        referred to as gateway machines or edge nodes.
 Local Mode:
This mode runs
        the entire Spark Application on a single machine. It achieves
        parallelism through threads on that single machine. Not recommend using
        local mode for running production applications. | 
 
No comments:
Post a Comment