New York City Taxi commute data analysis for making the commute system better for a particular Taxi company (TLC) by using Hive and HDFS
The New York City Taxi & Limousine Commission (TLC) has provided a dataset of trips made by the taxis in the New York City. The detailed trip-level data is more than just a vast list of taxi pickup and drop off coordinates.
The records include fields capturing pick-up and drop-off dates/times, pick-up and drop-off locations (location coordinates of the starting and ending points), trip distances, itemized fares, rate types, payment types, driver-reported passenger counts etc. The data used was collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP).
The data dictonary is uploaded.
For a dictionary describing green taxi data, or a map of the TLC Taxi Zones, visit http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.
This dataset was created by aggregating the aforementioned records. It provides precise location coordinates for where the trip started and ended, timestamps for when the trip started and ended, plus a few other variables including fare amount, payment method, and distance travelled.
The purpose of this dataset is to get a better understanding of the taxi system so that the city of New York can improve the efficiency of in-city commutes. Several exploratory questions can be asked about the travelling experience for passengers.
In this assignment, we ONLY consider the data of yellow taxis for November and December of the year 2017.
The dataset has been placed in the HDFS storage of the lab. The path to the data files is as follows:
‘/common_folder/nyc_taxi_data/‘
While performing this analysis, we will inevitably make several assumptions.
A few pointers before we start the assignment:
Here are the questions we need to answer.
Basic Data Quality Checks
Before answering the below questions, we need to create a clean, ORC partitioned table for analysis. Remove all the erroneous rows.
IMPORTANT: Before partitioning any table, make sure we run the below commands.
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;
Segregate the data into five segments of ‘tip paid’:
[0-5), [5-10), [10-15) , [15-20) and >=20.
Calculate the percentage share of each bucket (i.e. the fraction of trips falling in each bucket).
Which month has a greater average ‘speed’ - November or December? Note that the variable ‘speed’ will have to be derived from other metrics. Hint: We have columns for distance and time.
Analyse the average speed of the most happening days of the year, i.e. 31st December (New year’s eve) and 25th December (Christmas) and compare it with the overall average.