The method count is
actually an action as opposed to a
transformation, and so it returns immediately.
Now, this method is a bit of an outlier because it exists as a method (in this case) as opposed to a
function and is eagerly evaluated
instead of a lazy transformation. We also have a count lazy function.
|
|||||||
|
Aggregations Functions:
We can do one of two things: specify a specific column to count,
or all the columns by
using count(*) or count(1) to
represent that we want to count every row as the literal one.
When performing a count(*), Spark will count
null values
(including rows containing all nulls). However, when counting an individual
column, Spark will not count the null values.
Following are the functions under
org.apache.spark.sql.functions package.
Note that these functions are returning a Column type.
Also note that almost all functions have overloaded versions that take String
column names and other that takes Column type.
These are fundamentally different statistical formulae.
By default, Spark performs the formula for the sample
standard deviation or variance if you use
the variance or stddev functions.
|
|||||||
|
Aggregating to Complex types:
In above example we are using agg method on dataframes.
Normally we have seen that method being used on RelationalGroupedDatasets.
The agg method on dataframes allows aggregation over
entire datasets.
|
Search This Blog
Sunday, 14 April 2019
CH7.1 Aggregations, Aggregating to complex types
Saturday, 13 April 2019
CH6.1 Working with Booleans, Numbers
|
Where to look for API:
Following are the places where one should look for transformations and functions:
|
||||||||
|
Converting to Spark types:
|
||||||||
|
Working with Booleans:
Following are the methods available on Column type to perform
tests.
Note that most of the methods have an operator equivalent
methods. These are preferred in Scala.
The verbose ones are preferred in Java.
Recollect that === is a method that we are invoking on the
Column type.
In Spark, you should always chain
together and filters as a sequential filter.
The reason for this is that even if Boolean statements are
expressed serially (one after the other), Spark will flatten all of
these filters into one statement and perform the filter at the same time,
creating the and statement for us. Multiple or statements
need to be specified in the same statement
Below note how we define condition as Column type. We can
create such Column type instances where ever we want. However they will be
analyzed only in context of a Dataframe
As shown above its often easier to
just express filters as SQL statements than using the
programmatic DataFrame interface and Spark SQL allows us to do this
without paying any performance penalty.
|
||||||||
|
Working with
Numbers:
We show multiple ways of performing the, --- using col, using
expr ,using select Expr.
By default, the round function rounds up if you’re exactly in between
two numbers. You can round down by using the bround
Following shows the definition of stat method on Dataframe
Documentation of
DataFrameStatFunctions: https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.sql.DataFrameStatFunctions
Note how using stat methods in select are simpler
|
Subscribe to:
Posts (Atom)