ETL process which downloads, transforms, and loads Freddie Mac/Fannie Mae mortgage data
git clone https://github.com/kr900910/mortgage-data-analysis.git
).mkdir temp_download
).pip install requests==2.5.3
.python download_freddie_mac.py
. Enter credentials and quarters to download when prompted. This downloads zip files into the current folder for each quarter.python download_fannie_mae.py
. Enter credentials and quarters to download when prompted. This downloads zip files into the current folder for each quarter.. create_hdfs_dir.sh
. This creates necessary HDFS folders.. unzip_to_HDFS.sh
. This unzips the zipped files into mortgage-data-analysis/temp_download, removes the zipped files, loads unzipped files to HDFS, and removes the unzipped files. Note that this step can take 15-30 minutes depending on number of quarters being loaded.. create_hive_tables.sh
. This creates Hive metadata for base Fannie and Freddie data in hdfs and for the combined data sets. Note that this script can take several hours to run, depending on how many quarters of data are there (for 15 quarters, acquisition data took 10 min, performance data took ~ 2 hours).hive --service hiveserver2 &
.