Columns and Expressions:
Columns:
- Columns in
Spark are similar to columns in a spreadsheet, R dataframe, or pandas
DataFrame. You can select, manipulate, and remove columns from
DataFrames and these operations are represented as expressions.
- To Spark,
columns are logical constructions that simply represent a value
computed on a per-record basis by means of an expression. This means
that to have a real value for a column, we need to have a row; and to
have a row, we need to have a DataFrame. You cannot manipulate an
individual column outside the context of a DataFrame
|
def
col(colName:
String): Column
Returns
a Column based on the given column name.
|
|
def
column(colName:
String): Column
Returns
a Column based on the given column name. Alias of col.
|
To use either of these functions, you pass
in a column name:
|
scala> import
org.apache.spark.sql.functions.{col, column}
import org.apache.spark.sql.functions.{col, column}
scala>
col("someColumnName")
res6: org.apache.spark.sql.Column = someColumnName
scala>
column("someColumnName")
res7:
org.apache.spark.sql.Column = someColumnName
|
- Imp: Note
that these columns might or might not exist in our DataFrames. Columns are
not resolved until we compare the column names with those we
are maintaining in the catalog. Column and table resolution
happens in the analyzer phase.
- Scala has
some unique language features that allow for more shorthand ways of
referring to columns.
The following bits of syntactic sugar perform the exact same
thing, namely creating a column, but provide no performance
improvement. The $ allows us to designate a string as
a special string that should refer to an expression. The tick mark (')
is a special thing called a symbol; this is a Scala-specific
construct of referring to some identifier. They both perform the same
thing and are shorthand ways of referring to columns by name. Note that the
type is ColumnName, which is a subtype of Column
|
scala>
$"someColumnName"
res8:
org.apache.spark.sql.ColumnName = someColumnName
scala> 'someColumnName
res9: Symbol = 'someColumnName
|
- Explicit
Column References: If you need to refer to a specific DataFrame’s column, you can
use the col method on the specific DataFrame. This can be
useful when you are performing a join and need to refer to a specific
column in one DataFrame that might share a name with another column in
the joined DataFrame.
This
col method is available on Dataset(Dataframe). This returns the same type,
but is different than the col and column functions
|
def
col(colName: String): Column
Selects
column based on the column name and return it as a Column.
|
|
scala>
df.col("DEST_COUNTRY_NAME")
res10:
org.apache.spark.sql.Column = DEST_COUNTRY_NAME
|
Expressions:
- Columns are
just expressions. An expression is a set of
transformations on one or more values in a record in a DataFrame. Think of
expression as a function that takes as input one or more column names,
resolves them, and then potentially applies more expressions to create
a single value for each record in the dataset. Importantly, this
“single value” can actually be a complex type like
a Map or Array.
- In the
simplest case, an expression, created via the expr function,
is just a DataFrame column reference. In the simplest
case, expr("someCol") is equivalent
to col("someCol").
|
scala>
expr("DEST_COUNTRY_NAME")
res13:
org.apache.spark.sql.Column = DEST_COUNTRY_NAME
scala>
col("DEST_COUNTRY_NAME")
res14:
org.apache.spark.sql.Column = DEST_COUNTRY_NAME
scala>
expr("(((someCol + 5) * 200) - 6) < otherCol")
res15:
org.apache.spark.sql.Column = ((((someCol + 5) * 200) - 6) < otherCol)
|
- The expr is a function
under org.apache.spark.sql.functions Object.
|
def
expr(expr: String): Column
Parses
the expression string into the column that it represents, similar to
DataFrame.selectExpr
|
- Columns provide
a subset of expression functionality. If you use col() and
want to perform transformations on that column, you must perform those
on that column reference. When using an expression,
the expr function can actually parse transformations and
column references from a string and can subsequently be passed into
further transformations.
expr("someCol
- 5") is the same transformation as
performing col("someCol") - 5, or
even expr("someCol") - 5.
Think as expr allows us to use string SQL expressions
and returns a Column type.
Column Type:
- All the
functions we saw earlier .i.e col,column,df.col,expr return a
Column type. This is a class in Scala and we have several methods
available on this Ex: providing alias, casting etc
Accessing DataFrame
Columns:
- To see we can
use printSchema . To programmatically access columns, you can use
the columns property to see all columns on a DataFrame
|
def
columns: Array[String]
Returns
all column names as an array.
|
|
scala> df.columns
res16: Array[String] =
Array(DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, count)
|
|
No comments:
Post a Comment