DataFrames vs Datasets:
- Within the
Structured APIs, there are two more APIs, the “untyped” DataFrames
and the “typed” Datasets.
- To say that
DataFrames are untyped is a slightly inaccurate; they have types, but Spark
maintains them completely and only checks
whether those types line up to those specified in the schema
at runtime.
- Datasets, on the other
hand, check whether types conform to the specification at compile
time.
- Datasets are only available to Java Virtual Machine (JVM)–based
languages (Scala and Java) and we specify types with case classes or
Java beans.
- To Spark (in
Scala), DataFrames are simply Datasets of
Type Row. The “Row” type is Spark’s internal representation of
its optimized in-memory format for computation. This format
makes for highly specialized and efficient computation because rather
than using JVM types, which can cause high garbage-collection and
object instantiation costs, Spark can operate on its own internal
format without incurring any of those costs.
- To Spark (in Python or R),
there is no such thing as a Dataset: everything is a DataFrame and
therefore we always operate on that optimized format.
|
|
Columns and Rows:
- Columns represent
a simple type like an integer or string, a complex
type like an array or map, or a null value. Spark tracks all
of this type information for you and offers a variety of ways, with
which you can transform columns. Think about
Spark Column types as columns in a table.
- A row is nothing more than a
record of data. Each record in a DataFrame must be of type Row. We
can create these rows manually from SQL, from Resilient Distributed
Datasets (RDDs), from data sources, or manually from scratch.
Represents one row of output from a relational operator.
To create a new Row, use RowFactory.create() in Java
or Row.apply() in Scala.
- Important methods available
on these types include:
|
Column
|
alias, and, apply, as, as, as, as, as, as, asc_nulls_first, asc_nulls_last, asc, between, bitwiseAND, bitwiseOR, bitwiseXOR, cast, cast, contains, desc_nulls_first, desc_nulls_last, desc, divide, endsWith, endsWith, eqNullSafe, equals, equalTo, explain, expr, geq, getField, getItem, gt, hashCode, isin, isin, isNaN, isNotNull, isNull, leq, like, lt, minus, mod, multiply, name, notEqual, or, otherwise, over, over, plus, rlike, startsWith, startsWith, substr, substr, toString, unapply, when
|
|
Row
|
anyNull, apply, copy, equals, fieldIndex, get, getAnyValAs, getAs, getAs, getBoolean, getByte, getDate, getDecimal, getDouble, getFloat, getInt, getJavaMap, getList, getLong, getMap, getSeq, getShort, getString, getStruct, getTimestamp, getValuesMap, hashCode, isNullAt, length, mkString, mkString, mkString, schema, size, toSeq, toString
|
|
|
Spark Types:
- Spark has a large number of
internal type representations.
- Below is a reference table
of how Scala types lines up with the type in Spark.
|
Data type
|
Value type in Scala
|
API to access or create a
data type
|
|
ByteType
|
Byte
|
ByteType
|
|
ShortType
|
Short
|
ShortType
|
|
IntegerType
|
Int
|
IntegerType
|
|
LongType
|
Long
|
LongType
|
|
FloatType
|
Float
|
FloatType
|
|
DoubleType
|
Double
|
DoubleType
|
|
DecimalType
|
java.math.BigDecimal
|
DecimalType
|
|
StringType
|
String
|
StringType
|
|
BinaryType
|
Array[Byte]
|
BinaryType
|
|
BooleanType
|
Boolean
|
BooleanType
|
|
TimestampType
|
java.sql.Timestamp
|
TimestampType
|
|
DateType
|
java.sql.Date
|
DateType
|
|
ArrayType
|
scala.collection.Seq
|
ArrayType(elementType,
[containsNull]). Note: The default value of containsNull is true.
|
|
MapType
|
scala.collection.Map
|
MapType(keyType,
valueType, [valueContainsNull]). Note: The default value of
valueContainsNull is true.
|
|
StructType
|
org.apache.spark.sql.Row
|
StructType(fields).
Note: fields is an Array of StructFields. Also, fields with the same name
are not allowed.
|
|
StructField
|
The
value type in Scala of the data type of this field (for example, Int for a
StructField with the data type IntegerType)
|
StructField(name,
dataType, [nullable]). Note: The default value of nullable is true.
|
For Scala API, these datatypes are under the package
org.apache.spark.sql.types
These Spark types are classes and they have their companion
objects.
Checkout the below links for ByteType class and Companion object:
Following shows that ByteType is a Companion Object:
scala> ByteType
res2:
org.apache.spark.sql.types.ByteType.type = ByteType
|
|
Overview of Structured API execution:
Following walks us through the execution of a single structured API query from
user code to executed code.
Overview of Steps are
- Write DataFrame/Dataset/SQL
Code.
- If valid code, Spark
converts this to a Logical Plan.
- Spark transforms
this Logical Plan to a Physical Plan, checking for
optimizations along the way.
- Spark then executes
this Physical Plan (RDD manipulations) on the cluster.
Logical Planning:
The first phase of execution is meant to take user
code and convert it into a logical plan.
This logical plan only represents a
set of abstract transformations that do not refer to executors or drivers, it’s purely to convert the user’s set of expressions
into the most optimized version. It does this by converting user code into an unresolved logical plan. This plan is unresolved because although your code
might be valid, the tables or columns that it refers to might or might not
exist. Spark uses the catalog, a repository of all table and DataFrame information,
to resolve columns and tables in the analyzer. The analyzer
might reject the unresolved logical plan if the required table or column name
does not exist in the catalog. If the analyzer can resolve it, the result is
passed through the Catalyst Optimizer, a collection of rules that attempt to
optimize the logical plan by pushing down predicates or selections.
Physical Planning:
The physical plan, often called a Spark plan,
specifies how the logical plan will execute on the cluster by generating
different physical execution strategies and comparing them through a cost model.
Physical planning results in a series of RDDs and
transformations. This result is why you
might have heard Spark referred to as a compiler—it takes queries in
DataFrames, Datasets, and SQL and compiles them into RDD transformations for
you.
Execution
Upon selecting a physical plan, Spark runs all of this code over RDDs, the lower-level programming interface of Spark
Spark
performs further optimizations at runtime, generating native Java bytecode
that can remove entire tasks or stages during execution. Finally the
result is returned to the user.
|
No comments:
Post a Comment