Quick Recap:
|
||||||
Finding Documentation for DataFrame
methods:
https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.Dataset DataFrames are nothing by Datasets for Row type. So documentation for DataFrame is found under Dataset itself. |
||||||
Schemas:
|
Search This Blog
Thursday, 11 April 2019
CH5/1 Recap of DF, Schemas
CH4/0 Structured API Overview
DataFrames vs Datasets:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||
|
Columns and Rows:
Represents one row of output from a relational operator.
To create a new Row, use RowFactory.create() in Java
or Row.apply() in Scala.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||
|
Spark Types:
For Scala API, these datatypes are under the package
org.apache.spark.sql.types
Documentation link : https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.types.package
These Spark types are classes and they have their companion
objects.
Checkout the below links for ByteType class and Companion object:
Following shows that ByteType is a Companion Object:
scala> ByteType
res2:
org.apache.spark.sql.types.ByteType.type = ByteType
|
|||||||||||||||||||||||||||||||||||||||||||||||||||
|
Overview of Structured API execution:
Following walks us through the execution of a single structured API query from user code to executed code. Overview of Steps are
Logical Planning:
The first phase of execution is meant to take user
code and convert it into a logical plan.
This logical plan only represents a set of abstract transformations that do not refer to executors or drivers, it’s purely to convert the user’s set of expressions into the most optimized version. It does this by converting user code into an unresolved logical plan. This plan is unresolved because although your code might be valid, the tables or columns that it refers to might or might not exist. Spark uses the catalog, a repository of all table and DataFrame information, to resolve columns and tables in the analyzer. The analyzer might reject the unresolved logical plan if the required table or column name does not exist in the catalog. If the analyzer can resolve it, the result is passed through the Catalyst Optimizer, a collection of rules that attempt to optimize the logical plan by pushing down predicates or selections.
Physical Planning:
The physical plan, often called a Spark plan,
specifies how the logical plan will execute on the cluster by generating
different physical execution strategies and comparing them through a cost model.
Physical planning results in a series of RDDs and
transformations. This result is why you
might have heard Spark referred to as a compiler—it takes queries in
DataFrames, Datasets, and SQL and compiles them into RDD transformations for
you.
Execution
Upon selecting a physical plan, Spark runs all of this code over RDDs, the lower-level programming interface of Spark
Spark
performs further optimizations at runtime, generating native Java bytecode
that can remove entire tasks or stages during execution. Finally the
result is returned to the user.
|
CH3/2 Structured Streaming, RDD intro
Structured Streaming:
Following
example shows how we first analyze the data as a static dataset and create a
DataFrame to do so.
Note
that we have saved the Schema in a variable staticSchema (whose type is
StructType)
Following
window function will include all data from each day in the aggregation. It’s
simply a window over the time–series column in our data. This is a helpful
tool for manipulating date and timestamps because we can specify our
requirements in a more human form (via intervals), and Spark will group all
of them together for us
We
could have run this as a SQL code as well.
Now that we’ve seen how that works, let’s take a look
at the streaming code!. Very little
actually changes about the code.
The biggest change is that we used readStream instead of read. Also notice the maxFilesPerTrigger option, which simply specifies the number of files we should read in at once. Also notice that the schema we created during our static data analysis, can be directly applied when read stream of data.
Note that streamingDataFrame is returned.
Now
we can see whether our DataFrame is streaming
Following
we set up the same business logic as the previous DataFrame manipulations
Remember that Streaming is
still a lazy operation, so we will need to call a
streaming action to start the execution of this data flow.
Streaming actions are a bit different from our
conventional static action because we’re going to be populating data
somewhere instead of just calling something like count (which doesn’t make
any sense on a stream anyways). The action
we will use will output to an in-memory table that we will update after
each trigger. In this case, each
trigger is based on an individual file (the read option that we set). Spark
will mutate the data in the in-memory table such that we will always have
the highest value as specified in our previous aggregation
When
we start the stream, we can run queries against it to debug what our result
will look like if we were to write this out to a production sink:
Another option you can use is to write the results out to the
console:
You shouldn’t use either of these streaming methods in
production, but they do make for convenient demonstration of Structured
Streaming’s power.
Notice how this window is built on event time, as well, not the time at which Spark processes the data. This was one of the shortcomings of Spark Streaming that Structured Streaming has resolved. |
||||||||
|
Low Level API:
spark.sparkContext.parallelize(Seq(1,
2, 3)).toDF()
|
CH3/1 Spark-Submit, DataSets intro
|
||
Running Production Applications(spark-submit):
Following
shows the simplest example by running application on local machine:
Python
Version:
By
changing the master argument of spark-submit, we can also
submit the same application to a cluster running Spark’s standalone cluster
manager, Mesos or YARN.
|
||
Datasets :Type Safe Structured APIs
|
Subscribe to:
Posts (Atom)