项目作者: akshaytambe

项目描述 :
NYC Crime Dataset Analysis
高级语言: Python
项目地址: git://github.com/akshaytambe/BDA-Crime-Data-Analysis.git
创建时间: 2017-12-01T15:19:08Z
项目社区:https://github.com/akshaytambe/BDA-Crime-Data-Analysis

开源协议:

下载


BDA Project - NYC Crime Data Analysis

Submission by

Requirements

  • Hadoop setup in Dumbo Cluster
  • Spark setup in Dumbo Cluster

Installation Instructions

  • Log into the main HPC node. To do this,
  • Enter your password when prompted.
  • From the HPC node, log into the Hadoop cluster. To do this, type ssh dumbo. Enter password again (if prompted).
  • Upload the file using scp from local system to dumbo
  • Download Crime Dataset from here
  • If you didn’t before, put the data file on HDFS: hadoop fs -copyFromLocal NYPD_Complaint_Data_Historic.csv

Configuration

  • Set environment variables
    alias hfs='/usr/bin/hadoop fs '
    export HAS=/opt/cloudera/parcels/CDH-5.9.0-1.cdh5.9.0.p0.23/lib
    export HSJ=hadoop-mapreduce/hadoop-streaming.jar
    alias hjs='/usr/bin/hadoop jar $HAS/$HSJ'

Data Cleaning Scripts

  • Run the Data Cleaner Python program using Spark: spark-submit cleandata_script.py NYPD_Complaint_Data_Historic.csv
  • Output can be found in cleandata.csv, get in dumbo using: hadoop fs -getmerge cleandata.csv cleandata.csv

Data Analysis/Exploration Scripts

  • Run the Data Analysis/Exploration Python programs using Spark: spark-submit 'name_of_the_file.py' NYPD_Complaint_Data_Historic.csv
  • Output can be found in ‘output_file_name.out’, get in dumbo using: hadoop fs -getmerge 'output_file_name.out' 'output_file_name.out'

Data Plotting Scripts

  • Run the Data Plotting Python programs: python 'name_of_the_file.py' 'output_file_name.png'
  • Output can be found in 'output_file_name.png'