Experiments with Spark (based on Definitive Guide): CH1/0 What is Apache Spark

Apache Spark is a unified computing engine and a set of libraries for parallel data processing on computer clusters.
Spark supports multiple widely used programming languages (Python, Java, Scala, and R), includes libraries for diverse tasks ranging from SQL to streaming and machine learning, and runs anywhere from a laptop to a cluster of thousands of servers.
Components and Libraries Spark offer to end users:

Low Level API's - RDD's, Distributed Variables
Structured API - DataSets, DataFrames, SQL
Structured Streaming, Advanced Analytics, Libraries and EcoSystem

Apache Spark Philosophy:

Apache Spark—a unified computing engine and set of libraries for big data

A] Unified :

Spark’s key driving goal is to offer a unified platform for writing big data applications.
Spark is designed to support a wide range of data analytics tasks, ranging from simple data loading and SQL queries to machine learning and streaming computation, over the same computing engine and with a consistent set of APIs.
Real-world data analytics tasks tend to combine many different processing types and libraries. Spark’s unified nature makes these tasks both easier and more efficient to write.
Spark’s APIs are also designed to enable high performance by optimizing across the different libraries and functions composed together in a user program. For example, if you load data using a SQL query and then evaluate a machine learning model over it using Spark’s ML library, the engine can combine these steps into one scan over the data.

B] Computing Engine:

At the same time that Spark strives for unification, it carefully limits its scope to a computing engine.
This means Spark handles loading data from storage systems and performing computation on it, not permanent storage as the end itself. We can use Spark with a wide variety of persistent storage systems, including cloud storage systems such as Azure Storage and Amazon S3, distributed file systems such as Apache Hadoop, key-value stores such as Apache Cassandra, and message buses such as Apache Kafka. However, Spark neither stores data long term itself, nor favors one over another.
In user-facing APIs, Spark works hard to make these storage systems look largely similar so that applications do not need to worry about where their data is.

Spark’s focus on computation makes it different from earlier big data software platforms such as Apache Hadoop. Hadoop included both a storage system and computing engine.

However, this choice makes it difficult to run one of the systems without the other. More important, this choice also makes it a challenge to write applications that access data stored anywhere else. Although Spark runs well on Hadoop storage, today it is also used broadly in environments for which the Hadoop architecture does not make sense, such as the public cloud

(Spark is not tied to HDFS)

C] Libraries :

Spark supports both standard libraries that ship with the engine as well as a wide array of external libraries published as third-party packages by the open source communities.
Spark core engine itself has changed little since it was first released, but the libraries have grown to provide more and more types of functionality. Spark includes libraries for SQL and structured data (Spark SQL), machine learning (MLlib), stream processing (Spark Streaming and the newer Structured Streaming), and graph analytics (GraphX). Beyond these libraries, there are hundreds of open source external libraries ranging from connectors for various storage systems to machine learning algorithms.

History of Spark:

Apache Spark began at UC Berkeley in 2009 as the Spark research project, which was first published the following year in a paper entitled “Spark: Cluster Computing with Working Sets” by Matei Zaharia,

In 2013, the project had grown to widespread use, with more than 100 contributors from more than 30 organizations outside UC Berkeley. The AMPlab contributed Spark to the Apache Software Foundation

The early AMPlab team also launched a company, Databricks, to harden the project, joining the community of other companies and organizations contributing to Spark. Since that time, the Apache Spark community released Spark 1.0 in 2014 and Spark 2.0 in 2016, and continues to make regular releases, bringing new features into the project.

Early versions of Spark (before 1.0) largely defined this API in terms of functional operations—parallel operations such as maps and reduces over collections of Java objects.
Beginning with 1.0, the project added Spark SQL, a new API for working with structured data—tables with a fixed data format that is not tied to Java’s in-memory representation. Spark SQL enabled powerful new optimizations across libraries and APIs by understanding both the data format and the user code that runs on it in more detail. Over time, the project added a plethora of new APIs that build on this more powerful structured foundation, including DataFrames, machine learning pipelines, and Structured Streaming, a high-level, automatically optimized streaming API.

New high-level streaming engine, Structured Streaming, was introduced in 2016.

Running Spark

We can use Spark from Python, Java, Scala, R, or SQL.
Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM), so therefore to run Spark either on your laptop or a cluster, all you need is an installation of Java.
If you want to use the Python API, you will also need a Python interpreter (version 2.7 or later).
If you want to use R, you will need a version of R on your machine.

There are two options we recommend for getting started with Spark:

downloading and installing Apache Spark on your laptop, or
running a web-based version in Databricks Community Edition, a free cloud environment for learning Spark

If you want to download and run Spark locally, the first step is to make sure that you have Java installed on your machine (available as java), as well as a Python version if you would like to use Python.
Next, visit the project’s official download page, select the package type of “Pre-built for Hadoop 2.7 and later,” and click “Direct Download.” This downloads a compressed TAR file, or tarball, that you will then need to extract.

Spark can run locally without any distributed storage system, such as Apache Hadoop. However, if you would like to connect the Spark version on your laptop to a Hadoop cluster, make sure you download the right Spark version for that Hadoop version

Launching Spark's interactive consoles:

We can start an interactive shell in Spark for several different programming languages.

Python :

We need Python 2 or 3 installed in order to launch the Python console. From Spark’s home directory, run the following code: ./bin/pyspark

After we have done that, type “spark” and press Enter. We will see the SparkSession object printed

Scala:

To launch the Scala console we run the command: ./bin/spark-shell

After we have done that, type “spark” and press Enter. We will see the SparkSession object printed

SQL Console:

To launch the Scala sql console we run the command: ./bin/spark-sql

Experiments with Spark (based on Definitive Guide)

Search This Blog

Wednesday, 10 April 2019

CH1/0 What is Apache Spark

No comments:

Post a Comment