Udacity Data Engineer Nanodegree Capstone Project
This project aims to be able to answers questions on US immigration such as what are the most popular cities for immigration, what is the gender distribution of the immigrants, what is the visa type distribution of the immigrants, what is the average age per immigrant and what is the average temperature per month per city. We extract data from 3 different sources, the I94 immigration dataset of 2016, city temperature data from Kaggle and US city demographic data from OpenSoft. We design 4 dimension tables: Cities, immigrants, monthly average city temperature and time, and 1 fact table: Immigration. We use Spark for ETL jobs and store the results in parquet for downstream analysis.
staging_i94_df
id
date
city_code
state_code
age
gender
visa_type
count
staging_temp_df
year
month
city_code
city_name
avg_temperature
lat
long
staging_demo_df
city_code
state_code
city_name
median_age
pct_male_pop
pct_female_pop
pct_veterans
pct_foreign_born
pct_native_american
pct_asian
pct_black
pct_hispanic_or_latino
pct_white
total_pop
immigrant_df
id
gender
age
visa_type
city_df
city_code
state_code
city_name
median_age
pct_male_pop
pct_female_pop
pct_veterans
pct_foreign_born
pct_native_american
pct_asian
pct_black
pct_hispanic_or_latino
pct_white
total_pop
lat
long
monthly_city_temp_df
city_code
year
month
avg_temperature
time_df
date
dayofweek
weekofyear
month
immigration_df
id
state_code
city_code
date
count
Spark is chosen for this project as it is known for processing large amount of data fast (with in-memory compute), scale easily with additional worker nodes, with ability to digest different data formats (e.g. SAS, Parquet, CSV), and integrate nicely with cloud storage like S3 and warehouse like Redshift.
The data update cycle is typically chosen on two criteria. One is the reporting cycle, the other is the availabilty of new data to be fed into the system. For example, if new batch of average temperature can be made available at monthly interval, we might settle for monthly data refreshing cycle.
There are also considerations in terms of scaling existing solution.
If the data was increased by 100x:
We can consider spinning up larger instances of EC2s hosting Spark and/or additional Spark work nodes. With added capacity arising from either vertical scaling or horizontal scaling, we should be able to accelerate processing time.
If the data populates a dashboard that must be updated on a daily basis by 7am every day:
We can consider using Airflow to schedule and automate the data pipeline jobs. Built-in retry and monitoring mechanism can enable us to meet user requirement.
If the database needed to be accessed by 100+ people:
We can consider hosting our solution in production scale data warehouse in the cloud, with larger capacity to serve more users, and workload management to ensure equitable usage of resources across users.