- Spark
has three officially supported cluster managers:
- Standalone mode
- Hadoop YARN
- Apache Mesos
These cluster managers maintain
a set of machines onto which you can deploy Spark Applications. They all run Spark applications the same way
|
|
Where to deploy cluster to
run Spark Applications :
Two high Level options:
Deploy in an on-premises cluster or in the public cloud
On premises Cluster:
- Reasonable option for
organizations that already manage their own datacenters.
- Advantage: Gives
you full control over the hardware used,
meaning you can optimize performance for your specific workload.
- Disadvantage: On
premises cluster will be fixed in size, whereas the resource demands of
data analytics workloads are often elastic.
- If you make your
cluster too small, it will be hard to launch the occasional very large
analytics query or training job for a new machine learning model,
whereas
- if you make it large, you will have
resources sitting idle. In
public clouds, it’s easy to give each application its own
cluster of exactly the required size for just the duration of
that job.
- Second,
for on-premises clusters, you need to select and operate
your own storage system, such as a Hadoop file system or scalable
key-value store. This includes setting up georeplication and
disaster recovery if required.
- On the
on-premises cluster the best to combat the resource
utilization problem is to use a cluster
manager that allows you to run many Spark applications and dynamically
reassign resources between them, or even allows non-Spark applications
on the same cluster. All of Spark’s supported cluster managers allow
multiple concurrent applications, but YARN and Mesos have better support
for dynamic sharing and also additionally support
non-Spark workloads.
Spark in Cloud:
- Advantages:
- Resources
can be launched and shut down elastically, so you can run that
occasional “monster” job that takes hundreds of machines for a few
hours without having to pay for them all the time. Even for normal
operation, you can choose a different type of machine and
cluster size for each application to optimize its cost
performance (Ex: launch machines with Graphics Processing Units (GPUs)
just for your deep learning jobs.)
- Public clouds include
low-cost, georeplicated storage(Ex: S3) that makes it easier to manage
large amounts of data
- All the
major cloud providers (Amazon Web Services [AWS], Microsoft Azure,
Google Cloud Platform [GCP], and IBM Bluemix) include
managed Hadoop clusters for their customers, which provide HDFS for
storage as well as Apache Spark. This is actually not a
great way to run Spark in the cloud, however, because by using a fixed-size
cluster and file system, you are not going to be able to take advantage
of elasticity. Instead, it is generally a
better idea to use global storage systems that are decoupled from a
specific cluster, such as Amazon S3, Azure Blob Storage, or Google Cloud
Storage and spin up machines dynamically for each Spark workload. With
decoupled compute and storage, you will be able to pay for computing
resources only when needed, scale them up dynamically, and mix different
hardware types.
- Basically
running Spark in the cloud need not mean migrating an on-premises
installation to virtual machines: you can run Spark natively against
cloud storage to take full advantage of the cloud’s elasticity,
cost-saving benefit, and management tools without having to manage an
on-premise computing stack within your cloud environment.
- Databricks (The
company started by the Spark team from UC Berkeley) is one example of a service
provider built specifically for Spark in the cloud.
Databricks provides a simple way to run Spark workloads without
the heavy baggage of a Hadoop installation. The
company provides a number of features for running Spark more efficiently
in the cloud, such as auto-scaling, auto-termination of clusters, and optimized
connectors to cloud storage, as well as a collaborative environment
for working on notebooks and standalone jobs.
- If you
run Spark in the cloud you are most likely to create a separate,
short-lived Spark cluster for each job you execute. In
this case StandAlone cluster manager is the easiest to use.
|
|
Cluster Managers:
Spark supports three aforementioned cluster managers: standalone clusters, Hadoop YARN, and Mesos.
StandAlone Mode:
- A
lightweight platform built specifically for Apache Spark
workloads.
- Disadvantage
: More limited than the other cluster managers .
Can only run Spark applications
- Best starting point if
you just want to quickly get Spark running on a cluster ( and you
do not have experience using YARN or Mesos).
- Starting Standalone
cluster:
- 1st it
requires provisioning the machines - starting them up,
ensuring that they can talk to one another over the
network, and getting the version of Spark you
would like to run on those sets of machines.
- After
provisioning , there are two ways to start the cluster: by hand
or using built-in launch scripts.
- By Hand process:
- 1st step : Start the master
process on the machine that we want that to run on,
using the following command: $SPARK_HOME/sbin/start-master.sh.
- After
we run this command, the cluster manager process will start up on
that machine. Once started, the master prints out a spark://HOST:PORT URI.
You use this when you start each of the worker nodes of the
cluster, and you can use it as the
master argument to your SparkSession on application initialization. You
can also find this URI on the master’s web UI, which is http://master-ip-address:8080 by
default.
- 2nd Step: With that URI, start the worker
nodes by logging in to each machine and running the following script
using the URI you just received from the master node. The master
machine must be available on the network of
the worker nodes you are using, and the port
must be open on the master node, as well:
$SPARK_HOME/sbin/start-slave.sh <master-spark-URI>
- Cluster launch
Scripts:
- Launch
Scripts can be used to automate launch of Standalone cluster.
- 1st we
create a file called conf/slaves in
your Spark directory that will contain the hostnames of all the
machines on which you intend to start Spark
workers, one per line. If this file does not exist,
everything will launch locally.
- Note
that when actually start the cluster, the master
machine will access each of the worker machines via Secure Shell (SSH). By
default, SSH is run in parallel and requires that you configure
password-less (using a private key) access. If you do not
have a password-less setup, you can set the environment variable SPARK_SSH_FOREGROUND and
serially provide a password for each worker.
- Once
the conf/slaves file is setup , you can launch or stop your cluster by
using the following shell scripts, based on Hadoop’s deploy scripts,
and available in $SPARK_HOME/sbin
|
$SPARK_HOME/sbin/start-master.sh Starts a master instance on the machine on which the
script is executed.
$SPARK_HOME/sbin/start-slaves.sh Starts a slave instance on each machine specified in the conf/slaves file.
$SPARK_HOME/sbin/start-slave.sh Starts a slave instance on the machine on which the
script is executed.
$SPARK_HOME/sbin/start-all.sh
Starts both a master and a number of slaves as described earlier.
$SPARK_HOME/sbin/stop-master.sh Stops the master that was started via the bin/start-master.sh script.
$SPARK_HOME/sbin/stop-slaves.sh Stops all slave instances on the machines specified
in the conf/slaves file.
$SPARK_HOME/sbin/stop-all.sh Stops both the master and the slaves as described
earlier.
|
- After
you create the cluster, you can submit applications to it using
the spark:// URI of the master. You can do this either
on the master node itself or another machine using spark-submit.
Yarn Mode:
- Hadoop YARN
is a framework for job scheduling and cluster resource
management.
- Spark has
little to do with Hadoop. Spark does natively support the Hadoop
YARN cluster manager but it requires nothing from Hadoop itself.
- When
submitting applications to YARN, the core difference from
other deployments is that --master will become yarn as
opposed the master node IP, as it is in standalone mode.
- Instead,
Spark will find the YARN configuration
files using the environment
variable HADOOP_CONF_DIR or YARN_CONF_DIR. Once
you have set those environment variables to your Hadoop installation’s
configuration directory, you can just run spark-submit.
- Two deployment modes
that you can use to launch Spark on YARN:
- Cluster
mode has the spark driver as a process managed by the YARN cluster, and
the client can exit after creating the application. in In cluster mode,
Spark doesn’t necessarily run on the same machine on which you’re
executing. Therefore libraries and external jars must be
distributed manually or through the --jars command-line
argument.
- Client
mode the driver will run in the client process and
therefore YARN will be responsible only for granting executor resources
to the application, not maintaining the master node.
- To read
and write from HDFS using Spark, you need to include two
Hadoop configuration files on Spark’s classpath:
- hdfs-site.xml, which
provides default behaviors for the HDFS client; and
- core-site.xml, which
sets the default file system name.
The location of these configuration files
varies across Hadoop versions, but a common location is inside of /etc/hadoop/conf. To
make these files
visible to Spark,
set HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh to a location containing the configuration files
or as an environment variable when you go to spark-submit your
application.
Mesos:
- Mesos intends to be a
datacenter scale-cluster manager that manages not just short-lived
applications like Spark, but long-running applications like web
applications or other resource interfaces.
- Mesos is the
heaviest-weight cluster manager,
- We choose this cluster
manager only if your organization already has a large-scale deployment
of Mesos, but it makes for a good cluster manager nonetheless.
- Submitting
applications to a Mesos cluster is similar to doing so for Spark’s other
cluster managers. For the most part you should favor cluster mode when
using Mesos. Client mode requires some extra configuration on your part,
especially with regard to distributing resources around the cluster.
|
|
Miscellaneous
Considerations:
- Number and type of applications: YARN is
great for HDFS-based applications but is not commonly used for much
else. Additionally, it’s not well designed to support the cloud,
because it expects information to be available on HDFS. Also, compute and storage is largely coupled together,
meaning that scaling your cluster involves scaling both storage and
compute instead of just one or the other. Mesos does improve on this a
bit conceptually, and it supports a wide range of application types, but
it still requires pre-provisioning machines and, in some sense, requires
buy-in at a much larger scale. For instance, it
doesn’t really make sense to have a Mesos cluster for only running Spark
Applications. Spark standalone mode is the lightest-weight
cluster manager and is relatively simple to understand and take
advantage of, but then you’re going to be building more application
management infrastructure that you could get much more easily by using
YARN or Mesos.
- Managing different Spark versions: Your
hands are largely tied if you want to try to run a variety of different
applications running different Spark versions, and unless you use a
well-managed service, you’re going to need to spend a fair amount of
time either managing different setup scripts for different Spark
services or removing the ability for your users to use a variety of
different Spark applications.
- Logging: You need to consider how you’re going
to set up logging, store logs for future reference, and allow end users
to debug their applications. These are more “out of the box” for YARN or
Mesos and might need some tweaking if you’re using standalone.
- Spark’s external shuffle service:
Typically Spark stores shuffle blocks (shuffle output) on a local disk
on that particular node. An external
shuffle service allows for storing those shuffle blocks so that they are
available to all executors, meaning that you can arbitrarily kill
executors and still have their shuffle outputs available to other
applications.
|
No comments:
Post a Comment