Search This Blog

Thursday, 11 April 2019

CH5/2 Columns and Expressions



    Columns and Expressions:
     
    Columns:

    • Columns in Spark are similar to columns in a spreadsheet, R dataframe, or pandas DataFrame. You can select, manipulate, and remove columns from DataFrames and these operations are represented as expressions.

    • To Spark, columns are logical constructions that simply represent a value computed on a per-record basis by means of an expression. This means that to have a real value for a column, we need to have a row; and to have a row, we need to have a DataFrame. You cannot manipulate an individual column outside the context of a DataFrame

    def col(colName: String): Column
    Returns a Column based on the given column name.
    def column(colName: String): Column
    Returns a Column based on the given column name. Alias of col.

    To use either of these functions, you
    pass in a column name:
     
    scala> import org.apache.spark.sql.functions.{col, column}
    import org.apache.spark.sql.functions.{col, column}

    scala> col("someColumnName")
    res6: org.apache.spark.sql.Column = someColumnName

    scala> column("someColumnName")
    res7: org.apache.spark.sql.Column = someColumnName


    • Imp: Note that these columns might or might not exist in our DataFrames. Columns are not resolved until we compare the column names with those we are maintaining in the catalog. Column and table resolution happens in the analyzer phase.


    • Scala has some unique language features that allow for more shorthand ways of referring to columns.
      The following bits of
      syntactic sugar perform the exact same thing, namely creating a column, but provide no performance improvement. The $ allows us to designate a string as a special string that should refer to an expression. The tick mark (') is a special thing called a symbol; this is a Scala-specific construct of referring to some identifier. They both perform the same thing and are shorthand ways of referring to columns by name. Note that the type is ColumnName, which is a subtype of Column

    scala> $"someColumnName"
    res8: org.apache.spark.sql.ColumnName = someColumnName

    scala> 'someColumnName
    res9: Symbol = 'someColumnName


    • Explicit Column References: If you need to refer to a specific DataFrame’s column, you can use the col method on the specific DataFrame. This can be useful when you are performing a join and need to refer to a specific column in one DataFrame that might share a name with another column in the joined DataFrame.

    This col method is available on Dataset(Dataframe). This returns the same type, but is different than the col and column functions

    def col(colName: String): Column
    Selects column based on the column name and return it as a Column.

    scala> df.col("DEST_COUNTRY_NAME")
    res10: org.apache.spark.sql.Column = DEST_COUNTRY_NAME




    Expressions:

    • Columns are just expressions. An expression is a set of transformations on one or more values in a record in a DataFrame. Think of expression as a function that takes as input one or more column names, resolves them, and then potentially applies more expressions to create a single value for each record in the dataset. Importantly, this “single value” can actually be a complex type like a Map or Array.

    • In the simplest case, an expression, created via the expr function, is just a DataFrame column reference. In the simplest case, expr("someCol") is equivalent to col("someCol").

    scala> expr("DEST_COUNTRY_NAME")
    res13: org.apache.spark.sql.Column = DEST_COUNTRY_NAME

    scala> col("DEST_COUNTRY_NAME")
    res14: org.apache.spark.sql.Column = DEST_COUNTRY_NAME
     
    scala> expr("(((someCol + 5) * 200) - 6) < otherCol")
    res15: org.apache.spark.sql.Column = ((((someCol + 5) * 200) - 6) < otherCol)

    • The expr is a function under org.apache.spark.sql.functions Object.

    def expr(expr: String): Column
    Parses the expression string into the column that it represents, similar to DataFrame.selectExpr

    • Columns provide a subset of expression functionality. If you use col() and want to perform transformations on that column, you must perform those on that column reference. When using an expression, the expr function can actually parse transformations and column references from a string and can subsequently be passed into further transformations.

    expr("someCol - 5") is the same transformation as performing col("someCol") - 5, or even expr("someCol") - 5.

    Think as expr allows us to use string SQL expressions and returns a Column type.


    Column Type:

    • All the functions we saw earlier .i.e col,column,df.col,expr return a Column type. This is a class in Scala and we have several methods available on this Ex: providing alias, casting etc



    Accessing DataFrame Columns:

    • To see we can use printSchema . To programmatically access columns, you can use the columns property to see all columns on a DataFrame

    def columns: Array[String]
    Returns all column names as an array.

    scala> df.columns
    res16: Array[String] = Array(DEST_COUNTRY_NAME, ORIGIN_COUNTRY_NAME, count)



No comments:

Post a Comment