Parquet Files
Even though there are only two
options, you can still encounter problems if you’re working with incompatible
Parquet files. Be
careful when you write out Parquet files with different versions of
Spark
|
Search This Blog
Sunday, 14 April 2019
CH9.4 Reading and Writing Parquet Files
CH9.3 Reading and Writing json files
|
JSON file:
Note that one
file per partition is written out, and the entire DataFrame will be written out as a
folder. It will also have one JSON object per line.
|
CH9.2 Reading and Writing csv files
|
CSV files:
Note we can specify the compression codec to read/write
to compressed files.
In below example, we specify that 1) the csv files
contains "header". 2) Schema will be inferred from the files 3) set
the read mode to FAILFAST which will fail the job if the data is malformed.
Next we also show how to specify the schema explicitly.
If the data that we read does not match
the schema, the job will fail when Spark
actually reads the data.(Not when the code
is written). As soon as we start our Spark job, it will immediately fail
(after we execute a job) due to the data not conforming to the specified
schema. In general, Spark will fail only
at job execution time rather than DataFrame definition time—even if, for example, we point to a file that does
not exist. This is due to lazy
evaluation
Note that 5 .gz files are created inside the directory. This actually reflects the number of partitions in our DataFrame at the time we write it out. |
Subscribe to:
Comments (Atom)