项目作者: krzsam

项目描述 :
Calculating Mandelbrot Set using Spark via Kubernetes cluster
高级语言: Scala
项目地址: git://github.com/krzsam/scala-spark-mandelbrot.git
创建时间: 2019-07-23T22:50:54Z
项目社区:https://github.com/krzsam/scala-spark-mandelbrot

开源协议:GNU General Public License v3.0

下载


scala-spark-mandelbrot

An example application to calculate a Mandelbrot Set using Spark. The goal is to demonstrate use of Spark computing framework to perform calculations
while using Kubernetes as the underlying cluster management platform. The solution is not intended to be optimal in terms of choice of technologies or performance.

Application structure

The calculation of the Mandelbrot Set is done in two steps:

  • Input data file generation
  • Iteration process and creation of the image

Input file generation

First, an input data file needs to be created with data points for the iteration process. The data has two types of coordinates combined together within
one data line:

  • image space coordinates
  • data point coordinates - c value in the iteration formula

The format of the one line is as below. In image space coordinates, top left corner is (0,0), and bottom right corner is defined
by -sx and -sy parameters (horizontal and vertical image sizes respectively) - see run-spark-k8s-mand-generate.sh batch below.

The application currently provides the same number of iterations for each data point in the input data file - providing number of iterations for each data point
separately makes the calculation logic simpler and theoretically more flexible if necessary.

The input data file is created by mandelbrot.Main.generateInputData function.

  1. image-x,image-y,position-x,position-y,number-of-iterations

Where:

  • image-x : X position of the data point in image space
  • image-y: Y position of the data point in image space
  • position-x : X position of the data point in the data space
  • position-y : Y position of the data point in the data space
  • number-of-iterations : number of iterations for the given data point.

Batch file to generate the input data file: run-spark-k8s-mand-generate.sh

  1. hdfs dfs -put -f scala-spark-mandelbrot-assembly-0.3.jar /scala-spark-mandelbrot-assembly-0.3.jar
  2. hdfs dfs -rm -r -f /input-800-600
  3. $SPARK_HOME/bin/spark-submit
  4. --class mandelbrot.Main
  5. --master k8s://172.31.36.93:6443
  6. --deploy-mode cluster
  7. --executor-memory 1G
  8. --total-executor-cores 3
  9. --name mandelbrot
  10. --conf spark.kubernetes.container.image=krzsam/spark:spark-docker
  11. --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
  12. hdfs://ip-172-31-36-93:4444/scala-spark-mandelbrot-assembly-0.3.jar
  13. -h hdfs://ip-172-31-36-93:4444/
  14. -c generate
  15. -f hdfs://ip-172-31-36-93:4444/input-800-600
  16. -tl -2.2,1.2
  17. -br 1.0,-1.2
  18. -sx 800
  19. -sy 600
  20. -i 1024

Application parameters:

  • -h : specifies HDFS URI
  • -c : specifies step, for input file generation the value should be generate
  • -f : specifies HDFS location for the input data - please bear in mind that input data file will be in fact a directory as the file will be partitioned
    1. by Spark and will be physically represented as a directory with files containing partitioned data.
    2. Currently the application writes the input file data in Parquet format, but this can be changed to any other supported format.
  • -tl : specifies top-left corner in input data coordinates. Provided as comma separated pair of decimal numbers.
  • -br : specifies bottom-right corner in input data coordinates. Provided as comma separated pair of decimal numbers.
  • -sx : specifies horizontal image size
  • -sy : specifies vertical image size
  • -i : number of iterations each data point will be iterated through

Iteration process and creation of the image

Once the input data file is created on HDFS, the generation process reads it line by line and iterates each data point defined by each input line
using the Mandelbrot Set formula using batch file run-spark-k8s-mand-calculate.sh. The actual calculation
is done by mandelbrot.Main.calculateImage function.

The size of the image to generate is gathered for the input data using:

  1. val dim_x = calculated.agg( sql.functions.max( "img_x" ) ).head().getInt( 0 ) + 1
  2. val dim_y = calculated.agg( sql.functions.max( "img_y" ) ).head().getInt( 0 ) + 1

The above is not optimal from the performance perspective as input data needs to be re-analyzed,
but is done that way as an example of using aggregation functions.

  1. hdfs dfs -put -f scala-spark-mandelbrot-assembly-0.3.jar /scala-spark-mandelbrot-assembly-0.3.jar
  2. $SPARK_HOME/bin/spark-submit
  3. --class mandelbrot.Main
  4. --master k8s://172.31.36.93:6443
  5. --deploy-mode cluster
  6. --executor-memory 1G
  7. --total-executor-cores 3
  8. --name mandelbrot
  9. --conf spark.kubernetes.container.image=krzsam/spark:spark-docker
  10. --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark
  11. hdfs://ip-172-31-36-93:4444/scala-spark-mandelbrot-assembly-0.3.jar
  12. -h hdfs://ip-172-31-36-93:4444/
  13. -c calculate
  14. -f hdfs://ip-172-31-36-93:4444/input-800-600

Application parameters:

  • -h : specifies HDFS URI
  • -c : specifies step, for calculation and image generation the value should be calculate
  • -f : specifies HDFS location for the input data file as created in the generation step above.

The calculation step produces a single PNG file on HDFS, and the name corresponds to the input data file as below:

  1. input-800-600.<timestamp>.png

An example file produced by the calculation is shown below:

Example Mandelbrot Set

Application build

The application is build in the same way and uses the same Spark Docker image as created for
/github.com/krzsam scala-spark-example project.

For specific details on running Spark on Kubernetes including creating Spark Docker image please refer to
Running application on Spark via Kubernetes