ORC Files:
- ORC is a self-describing, type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. ORC actually has no options for reading in data because Spark understands the file format quite well.
- The
fundamental difference is that Parquet is further optimized for use
with Spark, whereas ORC is further optimized for Hive.
- Following shows an example of writing to orc files:
|
1 import org.apache.spark.sql._
2 import org.apache.spark.sql.types._ 3 4 object SparkDefinitiveTesting { 5 6 def main(args: Array[String]): Unit = { 7 val spark = SparkSession.builder().master("local[*]").appName("Test App").getOrCreate() 8 spark.sparkContext.setLogLevel("FATAL") 9 10 val manualSchema = StructType(Seq(StructField("InvoiceNo", StringType, true), 11 StructField("StockCode", StringType, true), 12 StructField("Description", StringType, true), 13 StructField("Quantity", IntegerType, true), 14 StructField("InvoiceDate", TimestampType, true), 15 StructField("UnitPrice", DoubleType, true), 16 StructField("CustomerID", DoubleType, true), 17 StructField("Country", StringType, true))) 18 19 val df_csv_manual = spark.read.format("csv").option("header", "true").option("mode", "FAILFAST").option("inferSchema", "true").schema(manualSchema).load("C:\\Users\\sukulma\\Downloads\\Spark-Data\\Spark-data\\data\\retail-data\\by-day").repartition(5) 20 println("Count of Dataframe df_csv_manual:" + df_csv_manual.count()) 21 df_csv_manual.write.format("orc").mode(SaveMode.Overwrite).save("C:\\Users\\sukulma\\Downloads\\Spark-Data\\Spark-data\\data\\retail-data\\orcop") 22 23 24 val df_orc = spark.read.format("orc").load("C:\\Users\\sukulma\\Downloads\\Spark-Data\\Spark-data\\data\\retail-data\\orcop") 25 df_orc.printSchema() 26 27 } 28 } |
|
C:\Users\sukulma\Downloads\Spark-Data\Spark-data\data\retail-data\orcop>dir
Volume in
drive C is OSDisk
Volume Serial
Number is C05C-4437
Directory of
C:\Users\sukulma\Downloads\Spark-Data\Spark-data\data\retail-data\orcop
04/15/2019
02:19 AM <DIR> .
04/15/2019
02:19 AM <DIR> ..
04/15/2019
02:19 AM 9,776
.part-00000-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc
04/15/2019
02:19 AM 9,784
.part-00001-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc
04/15/2019
02:19 AM 9,788
.part-00002-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc
04/15/2019
02:19 AM 9,792
.part-00003-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc
04/15/2019
02:19 AM 9,788
.part-00004-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc
04/15/2019
02:19 AM 8
._SUCCESS.crc
04/15/2019
02:19 AM 1,250,159
part-00000-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc
04/15/2019
02:19 AM 1,251,208
part-00001-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc
04/15/2019
02:19 AM 1,251,806
part-00002-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc
04/15/2019
02:19 AM 1,252,097
part-00003-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc
04/15/2019
02:19 AM 1,251,355
part-00004-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc
04/15/2019
02:19 AM 0
_SUCCESS
12 File(s) 6,305,561 bytes
2 Dir(s) 77,677,350,912 bytes
free
Result:
Count of Dataframe df_csv_manual:541909
root
|--
InvoiceNo: string (nullable = true)
|--
StockCode: string (nullable = true)
|--
Description: string (nullable = true)
|-- Quantity:
integer (nullable = true)
|--
InvoiceDate: timestamp (nullable = true)
|--
UnitPrice: double (nullable = true)
|--
CustomerID: double (nullable = true)
|-- Country:
string (nullable = true)
|
No comments:
Post a Comment