Experiments with Spark (based on Definitive Guide): CH7.5 Pivots

Pivots:

Pivots make it possible for you to convert a row into a column. The pivot method is available on RelationalGroupedDataset. Note that it returns another RelationalGroupedDataset on which we can invoke aggregate functions.

def pivot(pivotColumn: String, values: List[Any]): RelationalGroupedDataset

def pivot(pivotColumn: String, values: Seq[Any]): RelationalGroupedDataset

def pivot(pivotColumn: String): RelationalGroupedDataset

Pivots a column of the current DataFrame and perform the specified aggregation. There are two versions of pivot function: one that requires the caller to specify the list of distinct values to pivot on, and one that does not. The latter is more concise but less efficient, because Spark needs to first compute the list of distinct values internally.

// Compute the sum of earnings for each year by course with each course as a separate column

df.groupBy("year").pivot("course", Arrays.<Object>asList("dotNET", "Java")).sum("earnings");

// Or without specifying column values (less efficient)

df.groupBy("year").pivot("course").sum("earnings");

pivotColumn - Name of the column to pivot.

values - List of values that will be translated to columns in the output DataFrame.

scala> df.groupBy().pivot("Country")

res69:
    org.apache.spark.sql.RelationalGroupedDataset = RelationalGroupedDataset:
    [grouping expressions: [], value: [InvoiceNo: string, StockCode: string ...
    6 more fields], type: Pivot]

scala>
    df.groupBy().pivot("Country").count()

res70: org.apache.spark.sql.DataFrame =
    [Australia: bigint, Austria: bigint ... 36 more fields]

scala>
    df.groupBy().pivot("Country").count().printSchema

root

 |--
    Australia: long (nullable = true)

 |-- Austria: long (nullable = true)

 |-- Bahrain: long (nullable = true)

 |-- Belgium: long (nullable = true)

 |-- Brazil: long (nullable = true)

 |-- Canada:
    long (nullable = true)

 |-- Channel
    Islands: long (nullable = true)

 |-- Cyprus:
    long (nullable = true)

 |-- Czech
    Republic: long (nullable = true)

 |--
    Denmark: long (nullable = true)

 |-- EIRE:
    long (nullable = true)

 |--
    European Community: long (nullable = true)

 |--
    Finland: long (nullable = true)

 |-- France:
    long (nullable = true)

 |--
    Germany: long (nullable = true)

 |-- Greece:
    long (nullable = true)

 |-- Hong
    Kong: long (nullable = true)

 |--
    Iceland: long (nullable = true)

 |-- Israel:
    long (nullable = true)

 |-- Italy:
    long (nullable = true)

 |-- Japan:
    long (nullable = true)

 |--
    Lebanon: long (nullable = true)

 |--
    Lithuania: long (nullable = true)

 |-- Malta:
    long (nullable = true)

 |--
    Netherlands: long (nullable = true)

 |-- Norway:
    long (nullable = true)

 |-- Poland:
    long (nullable = true)

 |--
    Portugal: long (nullable = true)

 |-- RSA:
    long (nullable = true)

 |-- Saudi
    Arabia: long (nullable = true)

 |--
    Singapore: long (nullable = true)

 |-- Spain:
    long (nullable = true)

 |-- Sweden:
    long (nullable = true)

 |--
    Switzerland: long (nullable = true)

 |-- USA:
    long (nullable = true)

 |-- United
    Arab Emirates: long (nullable = true)

 |-- United
    Kingdom: long (nullable = true)

 |--
    Unspecified: long (nullable = true)

scala>
    df.groupBy().pivot("Country").count().show

+---------+-------+-------+-------+------+------+---------------+------+--------------+-------+----+------------------+-------+------+-------+------+---------+-------+------+-----+-----+-------+---------+-----+-----------+------+------+--------+---+------------+---------+-----+------+-----------+---+--------------------+--------------+-----------+

|Australia|Austria|Bahrain|Belgium|Brazil|Canada|Channel
    Islands|Cyprus|Czech Republic|Denmark|EIRE|European
    Community|Finland|France|Germany|Greece|Hong
    Kong|Iceland|Israel|Italy|Japan|Lebanon|Lithuania|Malta|Netherlands|Norway|Poland|Portugal|RSA|Saudi
    Arabia|Singapore|Spain|Sweden|Switzerland|USA|United Arab Emirates|United
    Kingdom|Unspecified|

+---------+-------+-------+-------+------+------+---------------+------+--------------+-------+----+------------------+-------+------+-------+------+---------+-------+------+-----+-----+-------+---------+-----+-----------+------+------+--------+---+------------+---------+-----+------+-----------+---+--------------------+--------------+-----------+

|    
    401|     19|  
    32|   151|            758|   622|            30|    389|8196|                61|    695| 
   9495|   146|     
    182|   297| 
  358|     45|       35| 
       2371|  1086|  
    1519| 58|          10|      229| 2533|   462|       2002|291|                  68|        495478|        446|

+---------+-------+-------+-------+------+------+---------------+------+--------------+-------+----+------------------+-------+------+-------+------+---------+-------+------+-----+-----+-------+---------+-----+-----------+------+------+--------+---+------------+---------+-----+------+-----------+---+--------------------+--------------+-----------+

scala>
    df.groupBy().pivot("Country",Seq("Canada","Germany")).count().show

+------+-------+

|Canada|Germany|

+------+-------+

|  
    151|   9495|

+------+-------+

Note that distinct values in the country columns become the column names.

Note: Simply using df.groupBy() also gives us entire dataset as a RelationalGroupedDataset on which we can invoke aggregate methods or pivot methods.

Experiments with Spark (based on Definitive Guide)

Search This Blog

Sunday, 14 April 2019

CH7.5 Pivots

No comments:

Post a Comment