Experiments with Spark (based on Definitive Guide): CH16.1 Developing Spark Applications

Spark Applications are the combination of two things: a Spark cluster and your code.

Following shows a sample application in different languages.

Scala Based App:

Scala is Spark’s “native” language and naturally makes for a great way to write applications. It’s really no different than writing a Scala application.

You can build applications using sbt or Apache Maven, two Java Virtual Machine (JVM)–based build tools.

To configure an sbt build for our Scala application, we specify a build.sbt file to manage the package information. Inside the build.sbt file, there are a few key things to include:

Project metadata (package name, package versioning information, etc.)
Where to resolve dependencies
Dependencies needed for your library

Following is how a sample build.sbt file would look like. Notice how we must specify the Scala version as well as the Spark version:

name := "example"

        organization := "com.databricks"

        version := "0.1-SNAPSHOT"

        scalaVersion := "2.11.8"

// Spark Information

        val sparkVersion = "2.2.0"

// allows us to include spark packages

        resolvers += "bintray-spark-packages" at

          "https://dl.bintray.com/spark-packages/maven/"

resolvers += "Typesafe Simple
    Repository" at

          "http://repo.typesafe.com/typesafe/simple/maven-releases/"

resolvers += "MavenRepository"
    at

          "https://mvnrepository.com/"

libraryDependencies ++= Seq(

          // spark core

          "org.apache.spark" %% "spark-core" % sparkVersion,

          "org.apache.spark" %%
    "spark-sql" % sparkVersion,

        // the rest of the file is omitted for brevity

        )

Once we have defined the build file, we can start adding code to our project.

The Sbt directory structure is same as that for maven projects.

src/

          main/

            resources/

               <files to include in
    main jar here>

            scala/

               <main Scala
    sources>

            java/

               <main Java
    sources>

          test/

            resources

               <files to include in
    test jar here>

            scala/

               <test Scala
    sources>

            java/

               <test Java sources>

We put the source code in the Scala and Java directories. Following shows an example of Scala-Spark code that initializes the SparkSession, runs the application, and then exits.

object DataFrameExample extends
    Serializable {

          def main(args: Array[String]) =
    {

val pathToDataFolder = args(0)

// start up the
    SparkSession

            // along with explicitly
    setting a given config

            val spark =
    SparkSession.builder().appName("Spark Example")

             
    .config("spark.sql.warehouse.dir",
    "/user/hive/warehouse")

              .getOrCreate()

// udf registration

           
    spark.udf.register("myUDF",
    someUDF(_:String):String)

            val df =
    spark.read.json(pathToDataFolder + "data.json")

            val manipulated =
    df.groupBy(expr("myUDF(group)")).sum().collect()

             .foreach(x => println(x))

}

        }

The main class would be used when we use spark-submit to submit it to our cluster for execution.

For compilation we have following options:

We can use sbt assemble to build an “uber-jar” or “fat-jar” that contains all of the dependencies in one JAR. This can be simple for some deployments but cause complications (especially dependency conflicts) for others.
A lighter-weight approach is to run sbt package, which will gather all of your dependencies into the target folder but will not package all of them into one big JAR.

After compilation target folder contains Jar, which can be used as argument to spark-submit.

Example:

$SPARK_HOME/bin/spark-submit
    \

           --class
    com.databricks.example.DataFrameExample \

           --master local \

           target/scala-2.11/example_2.11-0.1-SNAPSHOT.jar
    "hello"

Experiments with Spark (based on Definitive Guide)

Search This Blog

Tuesday, 23 April 2019

CH16.1 Developing Spark Applications

No comments:

Post a Comment