BigData-HW-Spark

This repository contains solutions for four Spark exercises.

SparkSQL
Spark RDD
Spark DataFrame and Machine Learning Pipelines — Gradient Boosted Tree
Spark Application — Crime Analysis
Spark Application — Profit Prediction

Directory structure

├── README.md                               <- You are here
├── SparkSQL
│   ├── exercise1.py                        <- python source code file
│   ├── exercise1.png                       <- Output of the Spark Job
│   ├── exercise1-findings.txt              <- Findings
│   ├── Problem_Statement.md                <- Problem Statement
├── SparkRDD
│   ├── exercise2.py                        <- python source code file
│   ├── exercise2.txt                       <- Output of the Spark Job
│   ├── exercise2-findings.txt              <- Findings
│   ├── Problem_Statement.md                <- Problem Statement
├── Spark_Machine_Learning_Pipeline
│   ├── exercise3.py                        <- python source code file
│   ├── exercise3.txt                       <- Output of the Spark Job: Out of sample R Square of the Model
│   ├── Problem_Statement.md                <- Problem Statement
├── Spark_Application_Crime_Analysis
│   ├── exercise4.py                        <- python source code file
│   ├── exercise4.txt                       <- Output of the Spark Job
│   ├── exercise4.png                       <- Output of the Spark Job
│   ├── exercise3-findings.txt              <- Findings
│   ├── Problem_Statement.md                <- Problem Statement
├── Spark_Application_Profit_Prediction
│   ├── exercise5.py                        <- python source code file
│   ├── mape_all.txt                       <- Output of the Spark Job
│   ├── Problem_Statement.md                <- Problem Statement
<!-- tocstop -->