项目作者: ly3too

项目描述 :
big-data cluster playground
高级语言: Shell
项目地址: git://github.com/ly3too/bdplay.git
创建时间: 2020-02-15T12:50:12Z
项目社区:https://github.com/ly3too/bdplay

开源协议:

下载


bigdata playground - bdplay

Bdplay aims to provide a simple deployment of bigdata platform for beginners and for testing purpose.
It packages bigdata softwares in one docker image and provide some necessary configure, which makes it easy to deploy
in cluster with docker-compose or kubernetes.

Bdplay integrates kafka, hadoop, flink, spark. And planing to include more.

You can easily submit batch and streaming jobs to cluster and take advantage the co-functioning of big-data software.

flink and spark can take the advantage of the locality with hdfs and kafka in the same node, witch makes the data
processing low overhead.

build

cd to docker directory
run sudo bash build.sh
a docker image named bdplay will be ready

deployment

every bigdata software is installed under /opt directory.

Each container has role of either master or worker. Currently, only single master is supported. you can scale up as many workers
as you want to scale up the computation capability.

start master simply with CMD: master.

start worker simply with CMD: worker.

All configurations are done by docker environment variables or directly by config map.

deploy by docker-compose

cd to empty directory and create a new file docker-compose.yml and put following content:

  1. version: "3"
  2. services:
  3. zoo1:
  4. image: zookeeper
  5. restart: always
  6. hostname: zoo1
  7. ports:
  8. - 2181:2181
  9. environment:
  10. ZOO_MY_ID: 1
  11. ZOO_SERVERS: server.1=0.0.0.0:2888:3888;2181
  12. master:
  13. image: bdplay
  14. expose:
  15. - "6123"
  16. ports:
  17. - 6123:6123
  18. - 8081:8081
  19. - 9000:9000
  20. - 9870:9870
  21. - 8088:8088
  22. - 19888:19888
  23. - 10020:10020
  24. - 7077:7077
  25. - 8082:8082
  26. - 9092:9092
  27. command: master tail -f /dev/null
  28. links:
  29. - "zoo1:zoo1"
  30. environment:
  31. - HOSTNAME_MASTER=master
  32. - KAFKA_ZOOKEEPER_CONNECT=zoo1:2181
  33. worker:
  34. image: bdplay
  35. expose:
  36. - "6121"
  37. - "6122"
  38. depends_on:
  39. - master
  40. command: worker tail -f /dev/null
  41. links:
  42. - "master:master"
  43. - "zoo1:zoo1"
  44. environment:
  45. - HOSTNAME_MASTER=master
  46. - KAFKA_ZOOKEEPER_CONNECT=zoo1:2181

run docker-compose up --scale worker=2

flink web ui will be available at: http://localhost:8081

spark web ui at: http://localhost:8082

hadoop resource manager web ui at http://localhost:8088

hadoop hdfs webui at http://localhost:9870

deploy by kubernetes

create a yaml file bdplay.yaml and put such content:

  1. apiVersion: v1
  2. kind: Service
  3. metadata:
  4. name: zk
  5. labels:
  6. app: zk
  7. spec:
  8. ports:
  9. - port: 2181
  10. name: client
  11. selector:
  12. app: zk
  13. ---
  14. apiVersion: apps/v1
  15. kind: Deployment
  16. metadata:
  17. name: zk
  18. spec:
  19. selector:
  20. matchLabels:
  21. app: zk
  22. template:
  23. metadata:
  24. labels:
  25. app: zk
  26. spec:
  27. containers:
  28. - name: zk
  29. image: zookeeper
  30. imagePullPolicy: IfNotPresent
  31. env:
  32. - name: ZOO_MY_ID
  33. value: "1"
  34. - name: ZOO_SERVERS
  35. value: server.1=0.0.0.0:2888:3888;2181
  36. ---
  37. apiVersion: v1
  38. kind: Service
  39. metadata:
  40. name: bdplay-master
  41. labels:
  42. app: bdplay
  43. spec:
  44. ports:
  45. - port: 6123
  46. name: flink-rpc
  47. - port: 8081
  48. name: flink-web
  49. - port: 9000
  50. name: hdfs-server
  51. - port: 9870
  52. name: hdfs-web
  53. - port: 8088
  54. name: res-man-web
  55. - port: 19888
  56. name: job-his-web
  57. - port: 10020
  58. name: job-rpc
  59. - port: 7077
  60. name: spark-serv
  61. - port: 8082
  62. name: spark-web
  63. - port: 9092
  64. name: kafka
  65. selector:
  66. app: bdplay
  67. role: master
  68. ---
  69. apiVersion: apps/v1
  70. kind: Deployment
  71. metadata:
  72. name: bdplay-master
  73. labels:
  74. app: bdplay
  75. spec:
  76. selector:
  77. matchLabels:
  78. app: bdplay
  79. role: master
  80. strategy:
  81. type: Recreate
  82. template:
  83. metadata:
  84. labels:
  85. app: bdplay
  86. role: master
  87. spec:
  88. affinity:
  89. podAntiAffinity:
  90. requiredDuringSchedulingIgnoredDuringExecution:
  91. - labelSelector:
  92. matchExpressions:
  93. - key: app
  94. operator: In
  95. values:
  96. - bdplay
  97. topologyKey: "kubernetes.io/hostname"
  98. containers:
  99. - image: bdplay
  100. imagePullPolicy: IfNotPresent
  101. name: bdplay-master
  102. command:
  103. - /opt/start.sh
  104. - master
  105. - tail
  106. - "-f"
  107. - /dev/null
  108. env:
  109. - name: HOSTNAME_MASTER
  110. value: bdplay-master
  111. - name: KAFKA_ZOOKEEPER_CONNECT
  112. value: zk:2181
  113. - name: SPARK_MASTER_HOST
  114. value: 0.0.0.0
  115. - name: KAFKA_ADV_HOSTNAME
  116. valueFrom:
  117. fieldRef:
  118. fieldPath: status.podIP
  119. ---
  120. apiVersion: apps/v1
  121. kind: Deployment
  122. metadata:
  123. name: bdplay-worker
  124. labels:
  125. app: bdplay
  126. spec:
  127. replicas: 2
  128. selector:
  129. matchLabels:
  130. app: bdplay
  131. role: worker
  132. strategy:
  133. type: Recreate
  134. template:
  135. metadata:
  136. labels:
  137. app: bdplay
  138. role: worker
  139. spec:
  140. affinity:
  141. podAntiAffinity:
  142. requiredDuringSchedulingIgnoredDuringExecution:
  143. - labelSelector:
  144. matchExpressions:
  145. - key: app
  146. operator: In
  147. values:
  148. - bdplay
  149. topologyKey: "kubernetes.io/hostname"
  150. containers:
  151. - image: bdplay
  152. imagePullPolicy: IfNotPresent
  153. name: bdplay-worker
  154. command:
  155. - /opt/start.sh
  156. - worker
  157. - tail
  158. - "-f"
  159. - /dev/null
  160. env:
  161. - name: HOSTNAME_MASTER
  162. value: bdplay-master
  163. - name: KAFKA_ZOOKEEPER_CONNECT
  164. value: zk:2181
  165. - name: KAFKA_ADV_HOSTNAME
  166. valueFrom:
  167. fieldRef:
  168. fieldPath: status.podIP

run kubectl apply -f bdplay.yaml to start the cluster.

the master and worker containers will be run on different nodes.

  1. > kubectl get pods
  2. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
  3. bdplay-master-59b599fd79-5z4lx 1/1 Running 0 37m 10.244.1.12 kind-worker3 <none> <none>
  4. bdplay-worker-6b8dfc8897-8nhtc 1/1 Running 0 37m 10.244.2.10 kind-worker <none> <none>
  5. bdplay-worker-6b8dfc8897-wdlsn 1/1 Running 0 37m 10.244.3.12 kind-worker2 <none> <none>
  6. zk-7449fd49cb-98vkn 1/1 Running 0 37m 10.244.2.9 kind-worker <none> <none>

start a bash inside pod: kubectl exec -it bdplay-worker-6b8dfc8897-8nhtc gosu bdplay bash

start port forwarding: kubectl port-forward service/bdplay-master 8081:8081 9870:9870 8088:8088 19888:19888 8082:8082

Now web ui can be accessed from localhost just as docker-compose deployment.

start to play

paly the hdfs

you need to start a shell inside container to run following command.

put files to hdfs: hadoop fs -put /opt/spark/README.md hdfs://bdplay-master:9000/

list files:

  1. > hadoop fs -ls hdfs://bdplay-master:9000/
  2. Found 2 items
  3. -rw-r--r-- 1 bdplay supergroup 3756 2020-02-15 12:25 hdfs://bdplay-master:9000/README.md
  4. drwxrwx--- - bdplay supergroup 0 2020-02-15 11:42 hdfs://bdplay-master:9000/tmp

test kafka functionality

  1. create topic:

    1. kafka-topics.sh --create --bootstrap-server localhost:9092 --replication-factor 1 --partitions 1 --topic test
  2. list topic

    1. bin/kafka-topics.sh --list --bootstrap-server localhost:9092
    2. # or
    3. bin/kafka-topics.sh --list --zookeeper zk:2181
  3. send message

    1. > bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
    2. > hello
    3. > world
    4. > ^C
  4. show all message

    1. > bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
    2. hello
    3. world
    4. ^C

start a new flink application wordcount use maven:

  1. mvn archetype:generate -DarchetypeGroupId=org.apache.flink -DarchetypeArtifactId=flink-quickstart-scala -DarchetypeVersion=1.9.2 \
  2. -DgroupId=flink.start -DartifactId=flink-start

the project structure looks like as following:

  1. tree quickstart
  2. quickstart/
  3. ├── pom.xml
  4. └── src
  5. └── main
  6. ├── resources
  7. └── log4j.properties
  8. └── scala
  9. └── org
  10. └── myorg
  11. └── quickstart
  12. ├── BatchJob.scala
  13. └── StreamingJob.scala

edit source file as in the example:

  1. object StreamingJob {
  2. def main(args: Array[String]) {
  3. // set up the streaming execution environment
  4. val env = StreamExecutionEnvironment.getExecutionEnvironment
  5. val params = ParameterTool.fromArgs(args)
  6. env.getConfig.setGlobalJobParameters(params)
  7. val text = env.readTextFile(params.get("input"))
  8. val counts = text.flatMap(_.toLowerCase.split("\\W+")).filter(_.nonEmpty).map((_, 1)).keyBy(0).sum(1)
  9. if (params.has("output")) {
  10. counts.writeAsText(params.get("output"))
  11. } else {
  12. counts.print()
  13. }
  14. // execute program
  15. env.execute("Flink Streaming Scala API Skeleton")
  16. }
  17. }

build it by mvn clean package

now that you can submit the job by:

  1. flink run ./target/flink-start-1.0-SNAPSHOT --input /path/to/some/text/data --output /path/to/result

you can watch the job execution detail from web ui.

spark example

you can use flink maven quick start template, and change the dependencies accordingly

  1. mvn archetype:generate -DarchetypeGroupId=org.apache.flink -DarchetypeArtifactId=flink-quickstart-scala -DarchetypeVersion=1.9.0 \
  2. -DgroupId=spark.wordcount -DartifactId=spark-wordcount

spark wordcount example with hdfs source and kafka sink is available under examples/spark-wordcount

submit spark job by:

  1. spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0 --master spark://bdplay-master:7077 target/spark-wordcount-1.0-SNAPSHOT.jar hdfs://localhost:9000/README.md

now check out the job execution from web ui http://localhost:8082.

the result will be published to kafka topic: wordcount.

Currently, flink deployment support only single jobmanager. considering multi jobmanager support in the future.

configuration

file: /opt/flink/conf/flink-conf.yaml

template: /opt/conf/flink/flink-conf.yaml.template

environment variables:

  1. FLINK_JOBMANAGER_HEAP_SIZE: flink jobmanager heap size, default 1024m
  2. FLINK_TASKMANAGER_HEAP_SIZE: flink taskmanager heap size, default 1024m
  3. FLINK_TASKMANAGER_NUM_SLOT: flink number of task slots, default number of cpu cores
  4. FLINK_CONF_BY_FILE: default empty, if set to true, environment variables substitution will not take effect
  5. ENABLE_BUILT_IN_PLUGINS: semi-colon separated plugin names

port map

port description
6123 jobmanager.rpc.port, used to submit jobs.
8081 web front end

hadoop

configuration

refer to hadoop deploy doc

  • etc/hadoop/core-site.xml
  • etc/hadoop/hdfs-site.xml
  • etc/hadoop/yarn-site.xml
  • etc/hadoop/mapred-site.xml

env:

key default description
HDFS_NAME_DIR /data/hadoop/hdfs/name/ namenode dir, need not to change
HDFS_DATA_DIR /data/hadoop/hdfs/data/ datanode dir, need not to change
HDFS_REPLICATION 1 hdfs data replication
YARN_NODE_RES_MEM_MB 8192 Defines total available resources on the NodeManager to be made available to running containers
HADOOP_CONFIG_BY_FILE - set to true if you want to use config map

port map

port description
9000 hdfs server port
9870 hdfs web front end
8088 resource manager webui
19888 job history server webui
10020 job history rpc port

spark

env

for detailed env config, check out spark documentation

useful docker env variables

key default description
SPARK_MASTER_HOST localhost spark master host

port map

port description
7077 spark server port
8082 spark web ui port

kafka

env

key default description
KAFKA_ENABLED true enable kafka on each worker
KAFKA_ZOOKEEPER_CONNECT - comma separated zookeeper host:port, if kafak enabled this variable is required
KAFKA_CONF_BY_FILE - set to true if use direct file config

port map

port description
9092 kafka listen port

Problems

Currently submission of job outside the cluster will fail. One possible solution is to create a
service for each worker.

Todos

  1. config and data volume map.
  2. add more big-data software into docker image.
  3. make cluster service available outside the container.
  4. add more batch job and stream job examples.