Submit and debug Spark jobs programmatically

5 minute read

This is my first post in the Data Engineering category. In this post, I discuss how to submit Spark 2 apps programmatically in yarn-cluster mode, and debug them remotely (e.g., from your favorite developer laptop using your favorite IDE :)).

What is Spark?

Spark - originally developed as part of a research project1 at the University of California, Berkeley and now an Apache Software Foundation project - is a framework for distributed computation. Spark was originally developed for a specific type of applications - iterative ones that reuse intermediate results (e.g., ML algorithms). The idea is to store these intermediate results in memory (in contrast to MapReduce, which does not efficiently persist results in memory), thus significantly speeding up subsequent computations on this in-memory data. The primary abstraction in Spark is the Resilient Distributed Dataset2, which is a dataset that can reside in distributed memory in a computing cluster.

Submit Spark jobs programmatically

Spark applications are usually submitted to YARN using a spark-submit command. In cases where this capability is needed programmatically, Spark provides the SparkLauncher class which allows the submission of Spark apps as a child process, that can then be monitored using an elegant Monitoring API.

Submitting jobs programmatically is often useful in cases where jobs need to be created on-the-fly. SparkLauncher makes this easy:

import org.apache.spark.launcher.SparkLauncher

var sparkLauncher = new SparkLauncher()
  .addSparkArg("--master", "yarn")

The above code snippet submits a Spark app in yarn-cluster mode. Note that the Spark binaries must be present locally, and their path must be specified in the setSparkHome method.

Debug Spark jobs remotely

Being able to submit Spark apps programmatically is awesome, but being able to debug remotely from your favorite IDE is awesomerr!

To do this, we need to set the spark.driver.extraJavaOptions property to request the Spark driver to wait until a debugger is attached. This can be done like so:

import java.util.concurrent.ThreadLocalRandom

val randomPort = ThreadLocalRandom.current.nextInt(30000, 65536)
sparkLauncher =
  "-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=" + randomPort)

The above snippet generates a random number between the specified range, and uses this as the port number where the Spark driver will listen and wait for a debugger to attach. Once the Spark app is submitted, use your favorite IDE to attach a debugger to the remote host and port where the Spark driver is listening (e.g., IntelliJ IDEA).

That’s it, you can now step through and inspect your code in action!

  1. Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., & Stoica, I. (2010). Spark: Cluster computing with working sets. HotCloud, 10(10-10), 95. 

  2. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., … & Stoica, I. (2012, April). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (pp. 2-2). USENIX Association.