|
Advanced IO Concepts:
This is very much similar to Dynamic partitioning in
Hive. Works with target hive tables as well.
Following shows an example of writing parquet files
with data partitioned by Destination Country
Following is the definition of the bucketBy method on
the DataFrameWriter
Following shows the example of writing to 10 buckets.
(This code should have worked, but didn’t as I was using older versions of
spark)
We saw previously that the number of output files is a derivative of the number of
partitions we had at write time . We
can use the maxRecordsPerFile option and specify a number of
records that are written to each file.
We can use the option maxRecordsPerFile to
specify the number of records in the output files.(introduced in Spark 2.2)
|
Search This Blog
Monday, 15 April 2019
CH9.7 Advanced IO Concepts
CH9.6 Reading and Writing Text Files
|
Text Files:
Following are the method definitions available on the
DataFrameReader:
Following is the definition of the methods
"text" and "partitionBy" available on the DataFrameWriter
objects:
|
CH9.5 Reading and Writing ORC files
ORC Files:
- ORC is a self-describing, type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. ORC actually has no options for reading in data because Spark understands the file format quite well.
- The
fundamental difference is that Parquet is further optimized for use
with Spark, whereas ORC is further optimized for Hive.
- Following shows an example of writing to orc files:
|
1 import org.apache.spark.sql._
2 import org.apache.spark.sql.types._ 3 4 object SparkDefinitiveTesting { 5 6 def main(args: Array[String]): Unit = { 7 val spark = SparkSession.builder().master("local[*]").appName("Test App").getOrCreate() 8 spark.sparkContext.setLogLevel("FATAL") 9 10 val manualSchema = StructType(Seq(StructField("InvoiceNo", StringType, true), 11 StructField("StockCode", StringType, true), 12 StructField("Description", StringType, true), 13 StructField("Quantity", IntegerType, true), 14 StructField("InvoiceDate", TimestampType, true), 15 StructField("UnitPrice", DoubleType, true), 16 StructField("CustomerID", DoubleType, true), 17 StructField("Country", StringType, true))) 18 19 val df_csv_manual = spark.read.format("csv").option("header", "true").option("mode", "FAILFAST").option("inferSchema", "true").schema(manualSchema).load("C:\\Users\\sukulma\\Downloads\\Spark-Data\\Spark-data\\data\\retail-data\\by-day").repartition(5) 20 println("Count of Dataframe df_csv_manual:" + df_csv_manual.count()) 21 df_csv_manual.write.format("orc").mode(SaveMode.Overwrite).save("C:\\Users\\sukulma\\Downloads\\Spark-Data\\Spark-data\\data\\retail-data\\orcop") 22 23 24 val df_orc = spark.read.format("orc").load("C:\\Users\\sukulma\\Downloads\\Spark-Data\\Spark-data\\data\\retail-data\\orcop") 25 df_orc.printSchema() 26 27 } 28 } |
|
C:\Users\sukulma\Downloads\Spark-Data\Spark-data\data\retail-data\orcop>dir
Volume in
drive C is OSDisk
Volume Serial
Number is C05C-4437
Directory of
C:\Users\sukulma\Downloads\Spark-Data\Spark-data\data\retail-data\orcop
04/15/2019
02:19 AM <DIR> .
04/15/2019
02:19 AM <DIR> ..
04/15/2019
02:19 AM 9,776
.part-00000-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc
04/15/2019
02:19 AM 9,784
.part-00001-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc
04/15/2019
02:19 AM 9,788
.part-00002-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc
04/15/2019
02:19 AM 9,792
.part-00003-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc
04/15/2019
02:19 AM 9,788
.part-00004-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc.crc
04/15/2019
02:19 AM 8
._SUCCESS.crc
04/15/2019
02:19 AM 1,250,159
part-00000-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc
04/15/2019
02:19 AM 1,251,208
part-00001-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc
04/15/2019
02:19 AM 1,251,806
part-00002-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc
04/15/2019
02:19 AM 1,252,097
part-00003-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc
04/15/2019
02:19 AM 1,251,355
part-00004-ae4f177b-fc07-42f3-a8c6-0ce6a859c281.snappy.orc
04/15/2019
02:19 AM 0
_SUCCESS
12 File(s) 6,305,561 bytes
2 Dir(s) 77,677,350,912 bytes
free
Result:
Count of Dataframe df_csv_manual:541909
root
|--
InvoiceNo: string (nullable = true)
|--
StockCode: string (nullable = true)
|--
Description: string (nullable = true)
|-- Quantity:
integer (nullable = true)
|--
InvoiceDate: timestamp (nullable = true)
|--
UnitPrice: double (nullable = true)
|--
CustomerID: double (nullable = true)
|-- Country:
string (nullable = true)
|
Subscribe to:
Posts (Atom)