Calculating Mandelbrot Set using Spark via Kubernetes cluster
An example application to calculate a Mandelbrot Set using Spark. The goal is to demonstrate use of Spark computing framework to perform calculations
while using Kubernetes as the underlying cluster management platform. The solution is not intended to be optimal in terms of choice of technologies or performance.
The calculation of the Mandelbrot Set is done in two steps:
First, an input data file needs to be created with data points for the iteration process. The data has two types of coordinates combined together within
one data line:
The format of the one line is as below. In image space coordinates, top left corner is (0,0), and bottom right corner is defined
by -sx and -sy parameters (horizontal and vertical image sizes respectively) - see run-spark-k8s-mand-generate.sh batch below.
The application currently provides the same number of iterations for each data point in the input data file - providing number of iterations for each data point
separately makes the calculation logic simpler and theoretically more flexible if necessary.
The input data file is created by mandelbrot.Main.generateInputData function.
image-x,image-y,position-x,position-y,number-of-iterations
Where:
Batch file to generate the input data file: run-spark-k8s-mand-generate.sh
hdfs dfs -put -f scala-spark-mandelbrot-assembly-0.3.jar /scala-spark-mandelbrot-assembly-0.3.jar
hdfs dfs -rm -r -f /input-800-600
$SPARK_HOME/bin/spark-submit
--class mandelbrot.Main
--master k8s://172.31.36.93:6443
--deploy-mode cluster
--executor-memory 1G
--total-executor-cores 3
--name mandelbrot
--conf spark.kubernetes.container.image=krzsam/spark:spark-docker
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
hdfs://ip-172-31-36-93:4444/scala-spark-mandelbrot-assembly-0.3.jar
-h hdfs://ip-172-31-36-93:4444/
-c generate
-f hdfs://ip-172-31-36-93:4444/input-800-600
-tl -2.2,1.2
-br 1.0,-1.2
-sx 800
-sy 600
-i 1024
Application parameters:
by Spark and will be physically represented as a directory with files containing partitioned data.
Currently the application writes the input file data in Parquet format, but this can be changed to any other supported format.
Once the input data file is created on HDFS, the generation process reads it line by line and iterates each data point defined by each input line
using the Mandelbrot Set formula using batch file run-spark-k8s-mand-calculate.sh. The actual calculation
is done by mandelbrot.Main.calculateImage function.
The size of the image to generate is gathered for the input data using:
val dim_x = calculated.agg( sql.functions.max( "img_x" ) ).head().getInt( 0 ) + 1
val dim_y = calculated.agg( sql.functions.max( "img_y" ) ).head().getInt( 0 ) + 1
The above is not optimal from the performance perspective as input data needs to be re-analyzed,
but is done that way as an example of using aggregation functions.
hdfs dfs -put -f scala-spark-mandelbrot-assembly-0.3.jar /scala-spark-mandelbrot-assembly-0.3.jar
$SPARK_HOME/bin/spark-submit
--class mandelbrot.Main
--master k8s://172.31.36.93:6443
--deploy-mode cluster
--executor-memory 1G
--total-executor-cores 3
--name mandelbrot
--conf spark.kubernetes.container.image=krzsam/spark:spark-docker
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
hdfs://ip-172-31-36-93:4444/scala-spark-mandelbrot-assembly-0.3.jar
-h hdfs://ip-172-31-36-93:4444/
-c calculate
-f hdfs://ip-172-31-36-93:4444/input-800-600
Application parameters:
The calculation step produces a single PNG file on HDFS, and the name corresponds to the input data file as below:
input-800-600.<timestamp>.png
An example file produced by the calculation is shown below:
The application is build in the same way and uses the same Spark Docker image as created for
/github.com/krzsam scala-spark-example project.
For specific details on running Spark on Kubernetes including creating Spark Docker image please refer to
Running application on Spark via Kubernetes