Experiments with Spark (based on Definitive Guide): CH4/0 Structured API Overview

DataFrames vs Datasets:

Within the Structured APIs, there are two more APIs, the “untyped” DataFrames and the “typed” Datasets.

To say that DataFrames are untyped is a slightly inaccurate; they have types, but Spark maintains them completely and only checks whether those types line up to those specified in the schema at runtime.

Datasets, on the other hand, check whether types conform to the specification at compile time.

Datasets are only available to Java Virtual Machine (JVM)–based languages (Scala and Java) and we specify types with case classes or Java beans.

To Spark (in Scala), DataFrames are simply Datasets of Type Row. The “Row” type is Spark’s internal representation of its optimized in-memory format for computation. This format makes for highly specialized and efficient computation because rather than using JVM types, which can cause high garbage-collection and object instantiation costs, Spark can operate on its own internal format without incurring any of those costs.

To Spark (in Python or R), there is no such thing as a Dataset: everything is a DataFrame and therefore we always operate on that optimized format.

Columns and Rows:

Column and Row are types in Spark. We can check their documentation at below links:

https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Row.html

https://spark.apache.org/docs/2.1.0/api/java/org/apache/spark/sql/Column.html

Columns represent a simple type like an integer or string, a complex type like an array or map, or a null value. Spark tracks all of this type information for you and offers a variety of ways, with which you can transform columns. Think about Spark Column types as columns in a table.
A row is nothing more than a record of data. Each record in a DataFrame must be of type Row. We can create these rows manually from SQL, from Resilient Distributed Datasets (RDDs), from data sources, or manually from scratch.

Represents one row of output from a relational operator.

To create a new Row, use RowFactory.create() in Java or Row.apply() in Scala.

Important methods available on these types include:

Column

alias, and, apply, as, as, as, as, as, as, asc_nulls_first, asc_nulls_last, asc, between, bitwiseAND, bitwiseOR, bitwiseXOR, cast, cast, contains, desc_nulls_first, desc_nulls_last, desc, divide, endsWith, endsWith, eqNullSafe, equals, equalTo, explain, expr, geq, getField, getItem, gt, hashCode, isin, isin, isNaN, isNotNull, isNull, leq, like, lt, minus, mod, multiply, name, notEqual, or, otherwise, over, over, plus, rlike, startsWith, startsWith, substr, substr, toString, unapply, when

Row

anyNull, apply, copy, equals, fieldIndex, get, getAnyValAs, getAs, getAs, getBoolean, getByte, getDate, getDecimal, getDouble, getFloat, getInt, getJavaMap, getList, getLong, getMap, getSeq, getShort, getString, getStruct, getTimestamp, getValuesMap, hashCode, isNullAt, length, mkString, mkString, mkString, schema, size, toSeq, toString

Spark Types:

Spark has a large number of internal type representations.
Below is a reference table of how Scala types lines up with the type in Spark.

Data type	Value type in Scala	API to access or create a data type
ByteType	Byte	ByteType
ShortType	Short	ShortType
IntegerType	Int	IntegerType
LongType	Long	LongType
FloatType	Float	FloatType
DoubleType	Double	DoubleType
DecimalType	java.math.BigDecimal	DecimalType
StringType	String	StringType
BinaryType	Array[Byte]	BinaryType
BooleanType	Boolean	BooleanType
TimestampType	java.sql.Timestamp	TimestampType
DateType	java.sql.Date	DateType
ArrayType	scala.collection.Seq	ArrayType(elementType, [containsNull]). Note: The default value of containsNull is true.
MapType	scala.collection.Map	MapType(keyType, valueType, [valueContainsNull]). Note: The default value of valueContainsNull is true.
StructType	org.apache.spark.sql.Row	StructType(fields). Note: fields is an Array of StructFields. Also, fields with the same name are not allowed.
StructField	The value type in Scala of the data type of this field (for example, Int for a StructField with the data type IntegerType)	StructField(name, dataType, [nullable]). Note: The default value of nullable is true.

For Scala API, these datatypes are under the package org.apache.spark.sql.types

Documentation link : https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.types.package

These Spark types are classes and they have their companion objects.
Checkout the below links for ByteType class and Companion object:

https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.types.ByteType

https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.types.ByteType$

Following shows that ByteType is a Companion Object:

scala> ByteType

res2:
  org.apache.spark.sql.types.ByteType.type = ByteType

Overview of Structured API execution:

Following walks us through the execution of a single structured API query from user code to executed code.
Overview of Steps are

Write DataFrame/Dataset/SQL Code.
If valid code, Spark converts this to a Logical Plan.
Spark transforms this Logical Plan to a Physical Plan, checking for optimizations along the way.
Spark then executes this Physical Plan (RDD manipulations) on the cluster.

Logical Planning:

The first phase of execution is meant to take user code and convert it into a logical plan.
This logical plan only represents a set of abstract transformations that do not refer to executors or drivers, it’s purely to convert the user’s set of expressions into the most optimized version. It does this by converting user code into an unresolved logical plan. This plan is unresolved because although your code might be valid, the tables or columns that it refers to might or might not exist. Spark uses the catalog, a repository of all table and DataFrame information, to resolve columns and tables in the analyzer. The analyzer might reject the unresolved logical plan if the required table or column name does not exist in the catalog. If the analyzer can resolve it, the result is passed through the Catalyst Optimizer, a collection of rules that attempt to optimize the logical plan by pushing down predicates or selections.

Physical Planning:

The physical plan, often called a Spark plan, specifies how the logical plan will execute on the cluster by generating different physical execution strategies and comparing them through a cost model.

Physical planning results in a series of RDDs and transformations. This result is why you might have heard Spark referred to as a compiler—it takes queries in DataFrames, Datasets, and SQL and compiles them into RDD transformations for you.

Execution

Upon selecting a physical plan, Spark runs all of this code over RDDs, the lower-level programming interface of Spark

Spark performs further optimizations at runtime, generating native Java bytecode that can remove entire tasks or stages during execution. Finally the result is returned to the user.

Experiments with Spark (based on Definitive Guide)

Search This Blog

Thursday, 11 April 2019

CH4/0 Structured API Overview

No comments:

Post a Comment