项目作者: the1schwartz

项目描述 :
Udacity Data Engineer Nanodegree Capstone Project
高级语言: SAS
项目地址: git://github.com/the1schwartz/DEND-capstone.git
创建时间: 2020-03-11T08:04:13Z
项目社区:https://github.com/the1schwartz/DEND-capstone

开源协议:

下载


City Immigration, Demographics and Temperatures

Data Engineering Capstone Project

By Martin Treacy-Schwartz

Project Summary

This project aims to be able to answers questions on US immigration such as what are the most popular cities for immigration, what is the gender distribution of the immigrants, what is the visa type distribution of the immigrants, what is the average age per immigrant and what is the average temperature per month per city. We extract data from 3 different sources, the I94 immigration dataset of 2016, city temperature data from Kaggle and US city demographic data from OpenSoft. We design 4 dimension tables: Cities, immigrants, monthly average city temperature and time, and 1 fact table: Immigration. We use Spark for ETL jobs and store the results in parquet for downstream analysis.

Data sources

  1. I94 Immigration Data: comes from the U.S. National Tourism and Trade Office and contains various statistics on international visitor arrival in USA and comes from the US National Tourism and Trade Office. The dataset contains data from 2016. link
  2. World Temperature Data: comes from Kaggle and contains average weather temperatures by city. link
  3. U.S. City Demographic Data: comes from OpenSoft and contains information about the demographics of all US cities such as average age, male and female population. link

Conceptual Data Model

Staging Tables
  1. staging_i94_df
  2. id
  3. date
  4. city_code
  5. state_code
  6. age
  7. gender
  8. visa_type
  9. count
  10. staging_temp_df
  11. year
  12. month
  13. city_code
  14. city_name
  15. avg_temperature
  16. lat
  17. long
  18. staging_demo_df
  19. city_code
  20. state_code
  21. city_name
  22. median_age
  23. pct_male_pop
  24. pct_female_pop
  25. pct_veterans
  26. pct_foreign_born
  27. pct_native_american
  28. pct_asian
  29. pct_black
  30. pct_hispanic_or_latino
  31. pct_white
  32. total_pop
Dimension Tables
  1. immigrant_df
  2. id
  3. gender
  4. age
  5. visa_type
  6. city_df
  7. city_code
  8. state_code
  9. city_name
  10. median_age
  11. pct_male_pop
  12. pct_female_pop
  13. pct_veterans
  14. pct_foreign_born
  15. pct_native_american
  16. pct_asian
  17. pct_black
  18. pct_hispanic_or_latino
  19. pct_white
  20. total_pop
  21. lat
  22. long
  23. monthly_city_temp_df
  24. city_code
  25. year
  26. month
  27. avg_temperature
  28. time_df
  29. date
  30. dayofweek
  31. weekofyear
  32. month
Fact Table
  1. immigration_df
  2. id
  3. state_code
  4. city_code
  5. date
  6. count

Steps necessary to pipeline the data into the chosen data model

  1. Clean the data on nulls, data types, duplicates, etc
  2. Load staging tables for staging_i94_df, staging_temp_df and staging_demo_df
  3. Create dimension tables for immigrant_df, city_df, monthly_city_temp_df and time_df
  4. Create fact table immigration_df with information on immigration count, mapping id in immigrant_df, city_code in city_df and monthly_city_temp_df and date in time_df ensuring referential integrity
  5. Save processed dimension and fact tables in parquet for downstream query

Discussions

Spark is chosen for this project as it is known for processing large amount of data fast (with in-memory compute), scale easily with additional worker nodes, with ability to digest different data formats (e.g. SAS, Parquet, CSV), and integrate nicely with cloud storage like S3 and warehouse like Redshift.

The data update cycle is typically chosen on two criteria. One is the reporting cycle, the other is the availabilty of new data to be fed into the system. For example, if new batch of average temperature can be made available at monthly interval, we might settle for monthly data refreshing cycle.

There are also considerations in terms of scaling existing solution.

  • If the data was increased by 100x:
    We can consider spinning up larger instances of EC2s hosting Spark and/or additional Spark work nodes. With added capacity arising from either vertical scaling or horizontal scaling, we should be able to accelerate processing time.

  • If the data populates a dashboard that must be updated on a daily basis by 7am every day:
    We can consider using Airflow to schedule and automate the data pipeline jobs. Built-in retry and monitoring mechanism can enable us to meet user requirement.

  • If the database needed to be accessed by 100+ people:
    We can consider hosting our solution in production scale data warehouse in the cloud, with larger capacity to serve more users, and workload management to ensure equitable usage of resources across users.