- Apache Spark is a unified
computing engine and a set of libraries for parallel data processing on
computer clusters.
- Spark supports multiple widely
used programming languages (Python, Java, Scala, and R), includes
libraries for diverse tasks ranging from SQL to streaming and machine
learning, and runs anywhere from a laptop to a cluster of thousands of
servers.
- Components and Libraries Spark offer to end users:
- Low Level API's - RDD's, Distributed Variables
- Structured API - DataSets, DataFrames, SQL
- Structured Streaming, Advanced Analytics, Libraries and EcoSystem
Apache Spark Philosophy:
Apache Spark—a unified computing engine and set of libraries for
big data
A] Unified :
- Spark’s key driving goal is to offer a unified platform for writing big data applications.
- Spark is designed to support a wide range of data analytics tasks, ranging from simple data loading and SQL queries to machine learning and streaming computation, over the same computing engine and with a consistent set of APIs.
- Real-world data analytics tasks tend to combine many different processing types and libraries. Spark’s unified nature makes these tasks both easier and more efficient to write.
- Spark’s APIs are also designed to enable high performance by optimizing across the different libraries and functions composed together in a user program. For example, if you load data using a SQL query and then evaluate a machine learning model over it using Spark’s ML library, the engine can combine these steps into one scan over the data.
B] Computing Engine:
- At the same
time that Spark strives for unification, it carefully limits its scope to a
computing engine.
- This means Spark
handles loading data from storage systems and performing computation on
it, not
permanent storage as the end itself. We can use Spark with a wide variety
of persistent storage systems, including cloud storage systems such as
Azure Storage and Amazon S3, distributed file systems such as Apache
Hadoop, key-value stores such as Apache Cassandra, and message buses such
as Apache Kafka. However, Spark neither stores data long term itself, nor
favors one over another.
- In user-facing APIs, Spark works hard to make these storage systems look largely similar so that applications do not need to worry about where their data is.
- Spark’s focus on computation makes it different from earlier big data software platforms such as Apache Hadoop. Hadoop included both a storage system and computing engine.
However, this choice makes it difficult to run one of the
systems without the other. More important,
this choice also makes it a challenge to write applications that access data
stored anywhere else. Although Spark runs well on Hadoop storage, today it is
also used broadly in environments for which the Hadoop architecture does not
make sense, such as the public cloud
(Spark is not tied to HDFS)
C] Libraries :
- Spark supports both standard
libraries that ship with the engine as well as a wide array of external
libraries published as third-party packages by the open source
communities.
- Spark core engine itself has changed little since it was first released, but the libraries have grown to provide more and more types of functionality. Spark includes libraries for SQL and structured data (Spark SQL), machine learning (MLlib), stream processing (Spark Streaming and the newer Structured Streaming), and graph analytics (GraphX). Beyond these libraries, there are hundreds of open source external libraries ranging from connectors for various storage systems to machine learning algorithms.
History of Spark:
- Apache Spark began at UC Berkeley in 2009 as the Spark research project, which was first published the following year in a paper entitled “Spark: Cluster Computing with Working Sets” by Matei Zaharia,
- In 2013, the project had grown to widespread use, with more than 100 contributors from more than 30 organizations outside UC Berkeley. The AMPlab contributed Spark to the Apache Software Foundation
- The early AMPlab team also launched a company, Databricks, to harden the project, joining the community of other companies and organizations contributing to Spark. Since that time, the Apache Spark community released Spark 1.0 in 2014 and Spark 2.0 in 2016, and continues to make regular releases, bringing new features into the project.
- Early versions
of Spark (before 1.0) largely defined this API in terms of functional
operations—parallel operations such as maps and reduces over collections
of Java objects.
Beginning with 1.0, the project added Spark SQL, a new API for working with structured data—tables with a fixed data format that is not tied to Java’s in-memory representation. Spark SQL enabled powerful new optimizations across libraries and APIs by understanding both the data format and the user code that runs on it in more detail. Over time, the project added a plethora of new APIs that build on this more powerful structured foundation, including DataFrames, machine learning pipelines, and Structured Streaming, a high-level, automatically optimized streaming API.
- New high-level streaming engine, Structured Streaming, was introduced in 2016.
Running Spark
- We can use Spark
from Python, Java, Scala, R, or SQL.
Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM), so therefore to run Spark either on your laptop or a cluster, all you need is an installation of Java.
If you want to use the Python API, you will also need a Python interpreter (version 2.7 or later).
If you want to use R, you will need a version of R on your machine.
- There are two options we recommend for getting started with Spark:
- downloading and installing Apache Spark on your laptop, or
- running a web-based version in Databricks Community Edition, a free cloud environment for learning Spark
- If you want
to download and run Spark locally, the first step is to make sure that
you have Java installed on your machine (available as java), as well
as a Python version if you would like to use Python.
Next, visit the project’s official download page, select the package type of “Pre-built for Hadoop 2.7 and later,” and click “Direct Download.” This downloads a compressed TAR file, or tarball, that you will then need to extract.
- Spark can run locally without any distributed storage system, such as Apache Hadoop. However, if you would like to connect the Spark version on your laptop to a Hadoop cluster, make sure you download the right Spark version for that Hadoop version
Launching Spark's interactive
consoles:
- We can start an interactive shell in Spark for several different programming languages.
- Python :
We need Python 2 or 3 installed in order to launch the
Python console. From Spark’s home directory, run the following code: ./bin/pyspark
After we have done that, type “spark” and press Enter. We will
see the SparkSession object printed
- Scala:
To launch the Scala console we run the command: ./bin/spark-shell
After we have done that, type “spark” and press Enter. We will
see the SparkSession object printed
- SQL Console:
To launch the Scala sql console we run the command: ./bin/spark-sql
No comments:
Post a Comment