|
DataFrame Transformations (Manipulating DataFrames)
When
working with individual DataFrames there are some fundamental objectives.
These break down into several core operations
|
|||||||||
|
Creating DataFrames:
val df1 =
spark.read.format("json").load("/data/flight-data/json/2015-summary.json")
|
|||||||||
|
Select and selectExpr:
scala> df.select("DEST_COUNTRY_NAME").show(2)
+-----------------+
|DEST_COUNTRY_NAME|
+-----------------+
| United States|
| United States|
+-----------------+
only showing top 2 rows
scala> df.select("DEST_COUNTRY_NAME",
"ORIGIN_COUNTRY_NAME").show(2)
+-----------------+-------------------+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|
+-----------------+-------------------+
| United States| Romania|
| United States| Croatia|
+-----------------+-------------------+
only showing top 2 rows
This allows to use SQL like expressions in the select
We can treat selectExpr as a simple way to build up
complex expressions that create new DataFrames. In fact, we can add any valid
non-aggregating SQL statement, and as long as the columns resolve, it will be
valid!
Following adds new
column to our DataFrame. Similar to sql we are using * to represent all the
columns.
(Note that if we want to group by and then perform
aggregation, the technique is different. With select we can only perform
aggregations over entire DF.)
|
Search This Blog
Thursday, 11 April 2019
CH5/4 Creating DF,select and selectExpr
CH5/3 Records and Row type
|
Records and Rows:
As shown below commands that return individual rows to the driver will always return one or more Row types when we are working with DataFrames.
Note that the Row companion object has apply method, which
allows us to instantiate Row object using Row (…..)
Remember Row is Trait and also has a companion Object.
|
CH5/2 Columns and Expressions
|
|
||||||||||
Columns and Expressions:
Columns:
To use either of these functions, you pass in a column name:
This
col method is available on Dataset(Dataframe). This returns the same type,
but is different than the col and column functions
Expressions:
expr("someCol
- 5") is the same transformation as
performing col("someCol") - 5, or
even expr("someCol") - 5.
Think as expr allows us to use string SQL expressions
and returns a Column type.
Column Type:
Accessing DataFrame
Columns:
|
Subscribe to:
Posts (Atom)