PySpark ETL Pipeline
Need to install Python3.x beforehand
pip3 install pipenv
Make sure that you’re in the project’s root directory (the same one in which the Pipfile
resides), and then run,
pipenv install --dev
stand-alone
$SPARK_HOME/bin/spark-submit \
--master spark://192.168.210.147:7077 \
--py-files packages.zip \
--packages mysql:mysql-connector-java:8.0.15 \
--files configs/etl_config.json \
--executor-memory 25g \
jobs/etl_job.py
yarn
https://spark.apache.org/docs/latest/running-on-yarn.html
$SPARK_HOME/bin/spark-submit \
--master yarn \
--driver-memory 4g \
--executor-memory 20g \
--py-files packages.zip \
--packages mysql:mysql-connector-java:8.0.15 \
--files configs/etl_config.json \
jobs/etl_job.py