Search This Blog

Monday, 15 April 2019

CH9.5 Reading and Writing ORC files



ORC Files:

 
  • ORC is a self-describing, type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. ORC actually has no options for reading in data because Spark understands the file format quite well.

  • The fundamental difference is that Parquet is further optimized for use with Spark, whereas ORC is further optimized for Hive.
     
  • Following shows an example of writing to orc files:

1    import org.apache.spark.sql._
2    import org.apache.spark.sql.types._
3   
4    object SparkDefinitiveTesting {
5   
6      def main(args: Array[String]): Unit = {
7        val spark = SparkSession.builder().master("local[*]").appName("Test App").getOrCreate()
8        spark.sparkContext.setLogLevel("FATAL")
9   
10       val manualSchema = StructType(Seq(StructField("InvoiceNo", StringType, true),
11         StructField("StockCode", StringType, true),
12         StructField("Description", StringType, true),
13         StructField("Quantity", IntegerType, true),
14         StructField("InvoiceDate", TimestampType, true),
15         StructField("UnitPrice", DoubleType, true),
16         StructField("CustomerID", DoubleType, true),
17         StructField("Country", StringType, true)))
18  
19       val df_csv_manual = spark.read.format("csv").option("header", "true").option("mode", "FAILFAST").option("inferSchema", "true").schema(manualSchema).load("C:\\Users\\sukulma\\Downloads\\Spark-Data\\Spark-data\\data\\retail-data\\by-day").repartition(5)
20       println("Count of Dataframe df_csv_manual:" + df_csv_manual.count())
21       df_csv_manual.write.format("orc").mode(SaveMode.Overwrite).save("C:\\Users\\sukulma\\Downloads\\Spark-Data\\Spark-data\\data\\retail-data\\orcop")
22  
23  
24      val df_orc =   spark.read.format("orc").load("C:\\Users\\sukulma\\Downloads\\Spark-Data\\Spark-data\\data\\retail-data\\orcop")
25      df_orc.printSchema()
26  
27     }
28   }
C:\Users\sukulma\Downloads\Spark-Data\Spark-data\data\retail-data\orcop>dir
 Volume in drive C is OSDisk
 Volume Serial Number is C05C-4437

 Directory of C:\Users\sukulma\Downloads\Spark-Data\Spark-data\data\retail-data\orcop

04/15/2019  02:19 AM    <DIR>          .
04/15/2019  02:19 AM    <DIR>          ..
04/15/2019  02:19 AM             9,776 .part-00000-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc
04/15/2019  02:19 AM             9,784 .part-00001-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc
04/15/2019  02:19 AM             9,788 .part-00002-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc
04/15/2019  02:19 AM             9,792 .part-00003-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc
04/15/2019  02:19 AM             9,788 .part-00004-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc
04/15/2019  02:19 AM                 8 ._SUCCESS.crc
04/15/2019  02:19 AM         1,250,159 part-00000-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc
04/15/2019  02:19 AM         1,251,208 part-00001-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc
04/15/2019  02:19 AM         1,251,806 part-00002-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc
04/15/2019  02:19 AM         1,252,097 part-00003-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc
04/15/2019  02:19 AM         1,251,355 part-00004-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc
04/15/2019  02:19 AM                 0 _SUCCESS
              12 File(s)      6,305,561 bytes
               2 Dir(s)  77,677,350,912 bytes free

Result:
Count of Dataframe df_csv_manual:541909
root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: timestamp (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: double (nullable = true)
 |-- Country: string (nullable = true)


 


No comments:

Post a Comment