|
Records and Rows:
As shown below commands that return individual rows to the driver will always return one or more Row types when we are working with DataFrames.
Note that the Row companion object has apply method, which
allows us to instantiate Row object using Row (…..)
Remember Row is Trait and also has a companion Object.
|
Search This Blog
Thursday, 11 April 2019
CH5/3 Records and Row type
CH5/2 Columns and Expressions
|
|
||||||||||
Columns and Expressions:
Columns:
To use either of these functions, you pass in a column name:
This
col method is available on Dataset(Dataframe). This returns the same type,
but is different than the col and column functions
Expressions:
expr("someCol
- 5") is the same transformation as
performing col("someCol") - 5, or
even expr("someCol") - 5.
Think as expr allows us to use string SQL expressions
and returns a Column type.
Column Type:
Accessing DataFrame
Columns:
|
CH5/1 Recap of DF, Schemas
Quick Recap:
|
||||||
Finding Documentation for DataFrame
methods:
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset DataFrames are nothing by Datasets for Row type. So documentation for DataFrame is found under Dataset itself. |
||||||
Schemas:
|
CH4/0 Structured API Overview
DataFrames vs Datasets:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||
|
Columns and Rows:
Represents one row of output from a relational operator.
To create a new Row, use RowFactory.create() in Java
or Row.apply() in Scala.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||
|
Spark Types:
For Scala API, these datatypes are under the package
org.apache.spark.sql.types
Documentation link : https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.types.package
These Spark types are classes and they have their companion
objects.
Checkout the below links for ByteType class and Companion object:
Following shows that ByteType is a Companion Object:
scala> ByteType
res2:
org.apache.spark.sql.types.ByteType.type = ByteType
|
|||||||||||||||||||||||||||||||||||||||||||||||||||
|
Overview of Structured API execution:
Following walks us through the execution of a single structured API query from user code to executed code. Overview of Steps are
Logical Planning:
The first phase of execution is meant to take user
code and convert it into a logical plan.
This logical plan only represents a set of abstract transformations that do not refer to executors or drivers, it’s purely to convert the user’s set of expressions into the most optimized version. It does this by converting user code into an unresolved logical plan. This plan is unresolved because although your code might be valid, the tables or columns that it refers to might or might not exist. Spark uses the catalog, a repository of all table and DataFrame information, to resolve columns and tables in the analyzer. The analyzer might reject the unresolved logical plan if the required table or column name does not exist in the catalog. If the analyzer can resolve it, the result is passed through the Catalyst Optimizer, a collection of rules that attempt to optimize the logical plan by pushing down predicates or selections.
Physical Planning:
The physical plan, often called a Spark plan,
specifies how the logical plan will execute on the cluster by generating
different physical execution strategies and comparing them through a cost model.
Physical planning results in a series of RDDs and
transformations. This result is why you
might have heard Spark referred to as a compiler—it takes queries in
DataFrames, Datasets, and SQL and compiles them into RDD transformations for
you.
Execution
Upon selecting a physical plan, Spark runs all of this code over RDDs, the lower-level programming interface of Spark
Spark
performs further optimizations at runtime, generating native Java bytecode
that can remove entire tasks or stages during execution. Finally the
result is returned to the user.
|
Subscribe to:
Posts (Atom)