Search This Blog

Thursday, 11 April 2019

CH4/0 Structured API Overview


    DataFrames vs Datasets:

    • Within the Structured APIs, there are two more APIs, the “untyped” DataFrames and the “typed” Datasets.

    • To say that DataFrames are untyped is a slightly inaccurate; they have types, but Spark maintains them completely and only checks whether those types line up to those specified in the schema at runtime. 

    • Datasets, on the other hand, check whether types conform to the specification at compile time. 

    • Datasets are only available to Java Virtual Machine (JVM)–based languages (Scala and Java) and we specify types with case classes or Java beans.

    • To Spark (in Scala), DataFrames are simply Datasets of Type Row. The “Row” type is Spark’s internal representation of its optimized in-memory format for computation. This format makes for highly specialized and efficient computation because rather than using JVM types, which can cause high garbage-collection and object instantiation costs, Spark can operate on its own internal format without incurring any of those costs. 

    • To Spark (in Python or R), there is no such thing as a Dataset: everything is a DataFrame and therefore we always operate on that optimized format.
Columns and Rows:


  • Columns represent a simple type like an integer or string, a complex type like an array or map, or a null value. Spark tracks all of this type information for you and offers a variety of ways, with which you can transform columns. Think about Spark Column types as columns in a table.
     
  • A row is nothing more than a record of data. Each record in a DataFrame must be of type Row. We can create these rows manually from SQL, from Resilient Distributed Datasets (RDDs), from data sources, or manually from scratch. 
Represents one row of output from a relational operator.
To create a new Row, use RowFactory.create() in Java or Row.apply() in Scala.

  • Important methods available on these types include:
     
Column
Row

Spark Types:

  • Spark has a large number of internal type representations. 
  • Below is a reference table of how Scala types lines up with the type in Spark.
     
Data type
Value type in Scala
API to access or create a data type
ByteType
Byte
ByteType
ShortType
Short
ShortType
IntegerType
Int
IntegerType
LongType
Long
LongType
FloatType
Float
FloatType
DoubleType
Double
DoubleType
DecimalType
java.math.BigDecimal
DecimalType
StringType
String
StringType
BinaryType
Array[Byte]
BinaryType
BooleanType
Boolean
BooleanType
TimestampType
java.sql.Timestamp
TimestampType
DateType
java.sql.Date
DateType
ArrayType
scala.collection.Seq
ArrayType(elementType, [containsNull]). Note: The default value of containsNull is true.
MapType
scala.collection.Map
MapType(keyType, valueType, [valueContainsNull]). Note: The default value of valueContainsNull is true.
StructType
org.apache.spark.sql.Row
StructType(fields). Note: fields is an Array of StructFields. Also, fields with the same name are not allowed.
StructField
The value type in Scala of the data type of this field (for example, Int for a StructField with the data type IntegerType)
StructField(name, dataType, [nullable]). Note: The default value of nullable is true.

For Scala API, these datatypes are under the package org.apache.spark.sql.types

These Spark types are classes and they have their companion objects.
Checkout the below links for ByteType class and Companion object:

Following shows that ByteType is a Companion Object:

scala> ByteType
res2: org.apache.spark.sql.types.ByteType.type = ByteType


 
Overview of Structured API execution:

Following walks us through the execution of a single structured API query from user code to executed code.
Overview of Steps are
  • Write DataFrame/Dataset/SQL Code.
  • If valid code, Spark converts this to a Logical Plan.
  • Spark transforms this Logical Plan to a Physical Plan, checking for optimizations along the way.
  • Spark then executes this Physical Plan (RDD manipulations) on the cluster.

Logical Planning:
The first phase of execution is meant to take user code and convert it into a logical plan.
This logical plan only
represents a set of abstract transformations that do not refer to executors or drivers, it’s purely to convert the user’s set of expressions into the most optimized version. It does this by converting user code into an unresolved logical plan. This plan is unresolved because although your code might be valid, the tables or columns that it refers to might or might not exist. Spark uses the catalog, a repository of all table and DataFrame information, to resolve columns and tables in the analyzer. The analyzer might reject the unresolved logical plan if the required table or column name does not exist in the catalog. If the analyzer can resolve it, the result is passed through the Catalyst Optimizer, a collection of rules that attempt to optimize the logical plan by pushing down predicates or selections.

Physical Planning:
The physical plan, often called a Spark plan, specifies how the logical plan will execute on the cluster by generating different physical execution strategies and comparing them through a cost model.
Physical planning results in a series of RDDs and transformations. This result is why you might have heard Spark referred to as a compiler—it takes queries in DataFrames, Datasets, and SQL and compiles them into RDD transformations for you.

Execution
Upon selecting a physical plan, Spark runs all of this code over RDDs, the lower-level programming interface of Spark
Spark performs further optimizations at runtime, generating native Java bytecode that can remove entire tasks or stages during execution. Finally the result is returned to the user.

 

No comments:

Post a Comment