Experiments with Spark (based on Definitive Guide): 04/18/19

Spark has three officially supported cluster managers:

Standalone mode
Hadoop YARN
Apache Mesos

These cluster managers maintain a set of machines onto which you can deploy Spark Applications. They all run Spark applications the same way

Where to deploy cluster to run Spark Applications :

Two high Level options: Deploy in an on-premises cluster or in the public cloud

On premises Cluster:

Reasonable option for organizations that already manage their own datacenters.
Advantage: Gives you full control over the hardware used, meaning you can optimize performance for your specific workload.
Disadvantage: On premises cluster will be fixed in size, whereas the resource demands of data analytics workloads are often elastic.

If you make your cluster too small, it will be hard to launch the occasional very large analytics query or training job for a new machine learning model, whereas
if you make it large, you will have resources sitting idle. In public clouds, it’s easy to give each application its own cluster of exactly the required size for just the duration of that job.
Second, for on-premises clusters, you need to select and operate your own storage system, such as a Hadoop file system or scalable key-value store. This includes setting up georeplication and disaster recovery if required.

On the on-premises cluster the best to combat the resource utilization problem is to use a cluster manager that allows you to run many Spark applications and dynamically reassign resources between them, or even allows non-Spark applications on the same cluster. All of Spark’s supported cluster managers allow multiple concurrent applications, but YARN and Mesos have better support for dynamic sharing and also additionally support non-Spark workloads.

Spark in Cloud:

Advantages:

Resources can be launched and shut down elastically, so you can run that occasional “monster” job that takes hundreds of machines for a few hours without having to pay for them all the time. Even for normal operation, you can choose a different type of machine and cluster size for each application to optimize its cost performance (Ex: launch machines with Graphics Processing Units (GPUs) just for your deep learning jobs.)
Public clouds include low-cost, georeplicated storage(Ex: S3) that makes it easier to manage large amounts of data

All the major cloud providers (Amazon Web Services [AWS], Microsoft Azure, Google Cloud Platform [GCP], and IBM Bluemix) include managed Hadoop clusters for their customers, which provide HDFS for storage as well as Apache Spark. This is actually not a great way to run Spark in the cloud, however, because by using a fixed-size cluster and file system, you are not going to be able to take advantage of elasticity. Instead, it is generally a better idea to use global storage systems that are decoupled from a specific cluster, such as Amazon S3, Azure Blob Storage, or Google Cloud Storage and spin up machines dynamically for each Spark workload. With decoupled compute and storage, you will be able to pay for computing resources only when needed, scale them up dynamically, and mix different hardware types.

Basically running Spark in the cloud need not mean migrating an on-premises installation to virtual machines: you can run Spark natively against cloud storage to take full advantage of the cloud’s elasticity, cost-saving benefit, and management tools without having to manage an on-premise computing stack within your cloud environment.

Databricks (The company started by the Spark team from UC Berkeley) is one example of a service provider built specifically for Spark in the cloud. Databricks provides a simple way to run Spark workloads without the heavy baggage of a Hadoop installation. The company provides a number of features for running Spark more efficiently in the cloud, such as auto-scaling, auto-termination of clusters, and optimized connectors to cloud storage, as well as a collaborative environment for working on notebooks and standalone jobs.

If you run Spark in the cloud you are most likely to create a separate, short-lived Spark cluster for each job you execute. In this case StandAlone cluster manager is the easiest to use.

Cluster Managers:

Spark supports three aforementioned cluster managers: standalone clusters, Hadoop YARN, and Mesos.

StandAlone Mode:

A lightweight platform built specifically for Apache Spark workloads.
Disadvantage : More limited than the other cluster managers . Can only run Spark applications
Best starting point if you just want to quickly get Spark running on a cluster ( and you do not have experience using YARN or Mesos).
Starting Standalone cluster:

1st it requires provisioning the machines - starting them up, ensuring that they can talk to one another over the network, and getting the version of Spark you would like to run on those sets of machines.
After provisioning , there are two ways to start the cluster: by hand or using built-in launch scripts.

By Hand process:

1st step : Start the master process on the machine that we want that to run on, using the following command: $SPARK_HOME/sbin/start-master.sh.
After we run this command, the cluster manager process will start up on that machine. Once started, the master prints out a spark://HOST:PORT URI. You use this when you start each of the worker nodes of the cluster, and you can use it as the master argument to your SparkSession on application initialization. You can also find this URI on the master’s web UI, which is http://master-ip-address:8080 by default.
2nd Step: With that URI, start the worker nodes by logging in to each machine and running the following script using the URI you just received from the master node. The master machine must be available on the network of the worker nodes you are using, and the port must be open on the master node, as well:

$SPARK_HOME/sbin/start-slave.sh <master-spark-URI>

Cluster launch Scripts:

Launch Scripts can be used to automate launch of Standalone cluster.
1st we create a file called conf/slaves in your Spark directory that will contain the hostnames of all the machines on which you intend to start Spark workers, one per line. If this file does not exist, everything will launch locally.
Note that when actually start the cluster, the master machine will access each of the worker machines via Secure Shell (SSH). By default, SSH is run in parallel and requires that you configure password-less (using a private key) access. If you do not have a password-less setup, you can set the environment variable SPARK_SSH_FOREGROUND and serially provide a password for each worker.
Once the conf/slaves file is setup , you can launch or stop your cluster by using the following shell scripts, based on Hadoop’s deploy scripts, and available in $SPARK_HOME/sbin

$SPARK_HOME/sbin/start-master.sh Starts a master instance on the machine on which the script is executed.

$SPARK_HOME/sbin/start-slaves.sh Starts a slave instance on each machine specified in the conf/slaves file.

$SPARK_HOME/sbin/start-slave.sh Starts a slave instance on the machine on which the script is executed.

$SPARK_HOME/sbin/start-all.sh Starts both a master and a number of slaves as described earlier.

$SPARK_HOME/sbin/stop-master.sh Stops the master that was started via the bin/start-master.sh script.

$SPARK_HOME/sbin/stop-slaves.sh Stops all slave instances on the machines specified in the conf/slaves file.

$SPARK_HOME/sbin/stop-all.sh Stops both the master and the slaves as described earlier.

After you create the cluster, you can submit applications to it using the spark:// URI of the master. You can do this either on the master node itself or another machine using spark-submit.

Yarn Mode:

Hadoop YARN is a framework for job scheduling and cluster resource management.
Spark has little to do with Hadoop. Spark does natively support the Hadoop YARN cluster manager but it requires nothing from Hadoop itself.
When submitting applications to YARN, the core difference from other deployments is that --master will become yarn as opposed the master node IP, as it is in standalone mode.
Instead, Spark will find the YARN configuration files using the environment variable HADOOP_CONF_DIR or YARN_CONF_DIR. Once you have set those environment variables to your Hadoop installation’s configuration directory, you can just run spark-submit.
Two deployment modes that you can use to launch Spark on YARN:

Cluster mode has the spark driver as a process managed by the YARN cluster, and the client can exit after creating the application. in In cluster mode, Spark doesn’t necessarily run on the same machine on which you’re executing. Therefore libraries and external jars must be distributed manually or through the --jars command-line argument.
Client mode the driver will run in the client process and therefore YARN will be responsible only for granting executor resources to the application, not maintaining the master node.

To read and write from HDFS using Spark, you need to include two Hadoop configuration files on Spark’s classpath:

hdfs-site.xml, which provides default behaviors for the HDFS client; and
core-site.xml, which sets the default file system name.

The location of these configuration files varies across Hadoop versions, but a common location is inside of /etc/hadoop/conf. To make these files visible to Spark, set HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh to a location containing the configuration files or as an environment variable when you go to spark-submit your application.

Mesos:

Mesos intends to be a datacenter scale-cluster manager that manages not just short-lived applications like Spark, but long-running applications like web applications or other resource interfaces.
Mesos is the heaviest-weight cluster manager,
We choose this cluster manager only if your organization already has a large-scale deployment of Mesos, but it makes for a good cluster manager nonetheless.
Submitting applications to a Mesos cluster is similar to doing so for Spark’s other cluster managers. For the most part you should favor cluster mode when using Mesos. Client mode requires some extra configuration on your part, especially with regard to distributing resources around the cluster.

Miscellaneous Considerations:

Number and type of applications: YARN is great for HDFS-based applications but is not commonly used for much else. Additionally, it’s not well designed to support the cloud, because it expects information to be available on HDFS. Also, compute and storage is largely coupled together, meaning that scaling your cluster involves scaling both storage and compute instead of just one or the other. Mesos does improve on this a bit conceptually, and it supports a wide range of application types, but it still requires pre-provisioning machines and, in some sense, requires buy-in at a much larger scale. For instance, it doesn’t really make sense to have a Mesos cluster for only running Spark Applications. Spark standalone mode is the lightest-weight cluster manager and is relatively simple to understand and take advantage of, but then you’re going to be building more application management infrastructure that you could get much more easily by using YARN or Mesos.

Managing different Spark versions: Your hands are largely tied if you want to try to run a variety of different applications running different Spark versions, and unless you use a well-managed service, you’re going to need to spend a fair amount of time either managing different setup scripts for different Spark services or removing the ability for your users to use a variety of different Spark applications.

Logging: You need to consider how you’re going to set up logging, store logs for future reference, and allow end users to debug their applications. These are more “out of the box” for YARN or Mesos and might need some tweaking if you’re using standalone.

Spark’s external shuffle service: Typically Spark stores shuffle blocks (shuffle output) on a local disk on that particular node. An external shuffle service allows for storing those shuffle blocks so that they are available to all executors, meaning that you can arbitrarily kill executors and still have their shuffle outputs available to other applications.

Experiments with Spark (based on Definitive Guide)

Search This Blog

Thursday, 18 April 2019

CH17.1 Deploying Spark