Experiments with Spark (based on Definitive Guide): CH9.5 Reading and Writing ORC files

ORC Files:

ORC is a self-describing, type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. ORC actually has no options for reading in data because Spark understands the file format quite well.

The fundamental difference is that Parquet is further optimized for use with Spark, whereas ORC is further optimized for Hive.
Following shows an example of writing to orc files:

1    import org.apache.spark.sql._
2    import org.apache.spark.sql.types._
3
4    object SparkDefinitiveTesting {
5
6      def main(args: Array[String]): Unit = {
7        val spark = SparkSession.builder().master("local[*]").appName("Test App").getOrCreate()
8        spark.sparkContext.setLogLevel("FATAL")
9
10       val manualSchema = StructType(Seq(StructField("InvoiceNo", StringType, true),
11         StructField("StockCode", StringType, true),
12         StructField("Description", StringType, true),
13         StructField("Quantity", IntegerType, true),
14         StructField("InvoiceDate", TimestampType, true),
15         StructField("UnitPrice", DoubleType, true),
16         StructField("CustomerID", DoubleType, true),
17         StructField("Country", StringType, true)))
18
19       val df_csv_manual = spark.read.format("csv").option("header", "true").option("mode", "FAILFAST").option("inferSchema", "true").schema(manualSchema).load("C:\\Users\\sukulma\\Downloads\\Spark-Data\\Spark-data\\data\\retail-data\\by-day").repartition(5)
20       println("Count of Dataframe df_csv_manual:" + df_csv_manual.count())
21       df_csv_manual.write.format("orc").mode(SaveMode.Overwrite).save("C:\\Users\\sukulma\\Downloads\\Spark-Data\\Spark-data\\data\\retail-data\\orcop")
22
23
24      val df_orc =   spark.read.format("orc").load("C:\\Users\\sukulma\\Downloads\\Spark-Data\\Spark-data\\data\\retail-data\\orcop")
25      df_orc.printSchema()
26
27     }
28   }

C:\Users\sukulma\Downloads\Spark-Data\Spark-data\data\retail-data\orcop>dir

 Volume in
  drive C is OSDisk

 Volume Serial
  Number is C05C-4437

 Directory of
  C:\Users\sukulma\Downloads\Spark-Data\Spark-data\data\retail-data\orcop

04/15/2019 
  02:19 AM    <DIR>          .

04/15/2019 
  02:19 AM    <DIR>          ..

04/15/2019 
  02:19 AM             9,776
  .part-00000-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc

04/15/2019 
  02:19 AM             9,784
  .part-00001-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc

04/15/2019 
  02:19 AM             9,788
  .part-00002-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc

04/15/2019 
  02:19 AM             9,792
  .part-00003-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc

04/15/2019 
  02:19 AM             9,788
  .part-00004-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc

04/15/2019 
  02:19 AM                 8
  ._SUCCESS.crc

04/15/2019 
  02:19 AM         1,250,159
  part-00000-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc

04/15/2019 
  02:19 AM         1,251,208
  part-00001-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc

04/15/2019 
  02:19 AM         1,251,806
  part-00002-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc

04/15/2019 
  02:19 AM         1,252,097
  part-00003-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc

04/15/2019 
  02:19 AM         1,251,355
  part-00004-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc

04/15/2019 
  02:19 AM                 0
  _SUCCESS

             
  12 File(s)      6,305,561 bytes

              
  2 Dir(s)  77,677,350,912 bytes
  free

Result:

Count of Dataframe df_csv_manual:541909

root

 |--
  InvoiceNo: string (nullable = true)

 |--
  StockCode: string (nullable = true)

 |--
  Description: string (nullable = true)

 |-- Quantity:
  integer (nullable = true)

 |--
  InvoiceDate: timestamp (nullable = true)

 |--
  UnitPrice: double (nullable = true)

 |--
  CustomerID: double (nullable = true)

 |-- Country:
  string (nullable = true)

Experiments with Spark (based on Definitive Guide)

Search This Blog

Monday, 15 April 2019

CH9.5 Reading and Writing ORC files

No comments:

Post a Comment