Search This Blog

Tuesday, 23 April 2019

CH16.1 Developing Spark Applications


Spark Applications are the combination of two things: a Spark cluster and your code. 
Following shows a sample application in different languages.
Scala Based App:
Scala is Spark’s “native” language and naturally makes for a great way to write applications. It’s really no different than writing a Scala application.

You can build applications using sbt or Apache Maven, two Java Virtual Machine (JVM)–based build tools. 
To configure an sbt build for our Scala application, we specify a build.sbt file to manage the package information. Inside the build.sbt file, there are a few key things to include:
  • Project metadata (package name, package versioning information, etc.)
  • Where to resolve dependencies
  • Dependencies needed for your library

Following is how a sample build.sbt file would look like. Notice how we must specify the Scala version as well as the Spark version:

name := "example"
organization := "com.databricks"
version := "0.1-SNAPSHOT"
scalaVersion := "2.11.8"
// Spark Information
val
sparkVersion = "2.2.0"
// allows us to include spark packages
resolvers += "bintray-spark-packages" at
  "
https://dl.bintray.com/spark-packages/maven/"
resolvers += "Typesafe Simple Repository" at
  "
http://repo.typesafe.com/typesafe/simple/maven-releases/"
resolvers += "MavenRepository" at
  "
https://mvnrepository.com/"
libraryDependencies ++= Seq(
  // spark core
  "org.apache.spark"
%% "spark-core" % sparkVersion,
  "org.apache.spark" %% "spark-sql" % sparkVersion,
// the rest of the file is omitted for brevity
)

Once we have defined the build file, we can start adding code to our project.
The Sbt directory structure is same as that for maven projects.

src/
  main/
    resources/
       <files to include in main jar here>
    scala/
       <main Scala sources>
    java/
       <main Java sources>
  test/
    resources
       <files to include in test jar here>
    scala/
       <test Scala sources>
    java/
       <test Java sources>

We put the source code in the Scala and Java directories. Following shows an example of Scala-Spark code that initializes the SparkSession, runs the application, and then exits.

object DataFrameExample extends Serializable {
  def main(args: Array[String]) = {
val pathToDataFolder = args(0)
// start up the SparkSession
    // along with explicitly setting a given config
    val spark = SparkSession.builder().appName("Spark Example")
      .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
      .getOrCreate()

// udf registration
    spark.udf.register("myUDF", someUDF(_:String):String)
    val df = spark.read.json(pathToDataFolder + "data.json")
    val manipulated = df.groupBy(expr("myUDF(group)")).sum().collect()
     .foreach(x => println(x))
}
}
The main class would be used when we use spark-submit to submit it to our cluster for execution.

For compilation we have following options:
  • We can use sbt assemble to build an “uber-jar” or “fat-jar” that contains all of the dependencies in one JAR. This can be simple for some deployments but cause complications (especially dependency conflicts) for others.
  • A lighter-weight approach is to run sbt package, which will gather all of your dependencies into the target folder but will not package all of them into one big JAR.

After compilation target folder contains Jar, which can be used as argument to spark-submit.
Example:
$SPARK_HOME/bin/spark-submit \
   --class com.databricks.example.DataFrameExample \
   --master local \
   target/scala-2.11/example_2.11-0.1-SNAPSHOT.jar "hello"

No comments:

Post a Comment