项目作者: hyjae

项目描述 :
PySpark ETL Pipeline
高级语言: Python
项目地址: git://github.com/hyjae/spark-etl-pipeline.git
创建时间: 2019-09-11T15:59:49Z
项目社区:https://github.com/hyjae/spark-etl-pipeline

开源协议:

下载


Instruction

Installing Pipenv

Need to install Python3.x beforehand

  1. pip3 install pipenv

Installing Dependencies

Make sure that you’re in the project’s root directory (the same one in which the Pipfile resides), and then run,

  1. pipenv install --dev

Run a Program

stand-alone

  1. $SPARK_HOME/bin/spark-submit \
  2. --master spark://192.168.210.147:7077 \
  3. --py-files packages.zip \
  4. --packages mysql:mysql-connector-java:8.0.15 \
  5. --files configs/etl_config.json \
  6. --executor-memory 25g \
  7. jobs/etl_job.py

yarn

https://spark.apache.org/docs/latest/running-on-yarn.html

  1. $SPARK_HOME/bin/spark-submit \
  2. --master yarn \
  3. --driver-memory 4g \
  4. --executor-memory 20g \
  5. --py-files packages.zip \
  6. --packages mysql:mysql-connector-java:8.0.15 \
  7. --files configs/etl_config.json \
  8. jobs/etl_job.py