项目作者: ianjeffries

项目描述 :
Analyzing car accidents in the United Kingdom using PySpark and Python for big data processing.
高级语言: Jupyter Notebook
项目地址: git://github.com/ianjeffries/car-accident-analysis.git
创建时间: 2019-06-20T11:12:33Z
项目社区:https://github.com/ianjeffries/car-accident-analysis

开源协议:MIT License

下载


car-accident-analysis


Map of Accidents in the UK

Index

  1. Summary
  2. File Directory
  3. Language and Packages Used
  4. Installing PySpark
  5. Credits
  6. License

Summary

The following project uses Python and PySpark to simulate how to leverage big data processing to analyze car crashes in the UK. The attached Jupyter Notebook could be used in conjunction with databricks to process the data across a real cluster.

File Directory

  1. data - contains the four files used in analysis:

    1. a. [Acc.csv](https://github.com/ianjeffries/car-accident-analysis/blob/master/data/Acc.csv) - 2017 accident data reported by the UK police force.
    2. b. [Cas.csv](https://github.com/ianjeffries/car-accident-analysis/blob/master/data/Cas.csv) - 2017 casualty data reported by the UK police force.
    3. c. [Veh.csv](https://github.com/ianjeffries/car-accident-analysis/blob/master/data/Veh.csv) - 2017 vehicle data reported by the UK police force.
    4. c. [dictionary.xls](https://github.com/ianjeffries/car-accident-analysis/blob/master/data/dictionary.xls) - Data dictionary used to define coded categorical values within datasets.
  2. images - contains visualizations:

    1. a. [uk_accidents.png](https://github.com/ianjeffries/car-accident-analysis/blob/master/images/uk_accidents.png) - Heatmap showing accidents in the UK by accident severity.
  3. car_crash.ipynb - Jupyter Notebook containing all analysis performed on the datasets, along with visualizations.

Language and Packages Used

Python is used in conjunction with Pyspark for all analysis performed.

The following commands will import all necessary packages:

  1. import pyspark, os, zipfile
  2. import pandas as pd
  3. import urllib.request
  4. import matplotlib.pyplot as plt
  5. import cartopy.crs as ccrs
  6. from pyspark.sql import SQLContext
  7. from pyspark import SparkContext
  8. from pyspark.sql.types import IntegerType

Installing PySpark

PySpark takes special configuration to install and run within Jupyter Notebook:

  1. If you’re using windows, Michael Galarnyk has an excellent @GalarnykMichael/install-spark-on-windows-pyspark-4498a5d8d66c">tutorial on installing PySpark for windows.

  2. If you are installing on Linux or Mac OS, Charles Bochet’s article will get you started.

Credits

  1. Would like to thank the UK goverment for posting the data on their website.
  2. Would like to thank the stackoverflow user whose function I stole, because of you lot I get to stand on the shoulders of giants.

License

MIT License
Copyright (c) 2019 Ian Jeffries