Udacity Nano Data Engineering Degree, Capstone Project
The project follows the follow steps:
The present project expresses the fact artist review sentiment and film review sentiment, based on the data provided by IMDb and TMDb
The goal of our Data Pipeline is to publish a PDF report to S3 with the following summarised information:
Our data sources are:
Please refer to the Movie Review Sentiment Analysis Notebook for end-to-end understanding of the project, as well as setup instructions.
fact_films_review_sentiments
Attribute | Type | Nullable | Value |
---|---|---|---|
date_id |
timestamp |
not null |
yyyy-mm-dd, foreign key to dim_dates |
film_id |
int |
not null |
foreign key to dim_films |
review_id |
int |
not null |
foreign key to dim_reviews |
review_sentiment_class |
short |
null |
[-1, 1] value representing the sentiment, classified by our model |
fact_cast_review_sentiments
Attribute | Type | Nullable | Value |
---|---|---|---|
date_id |
timestamp |
not null |
yyyy-mm-dd, foreign key to dim_dates |
cast_id |
int |
not null |
foreign key to dim_cast |
review_id |
int |
not null |
foreign key to dim_reviews |
review_sentiment_class |
short |
null |
[-1, 1] value representing the sentiment, classified by our model |
dim_dates
Attribute | Type | Nullable | Value |
---|---|---|---|
date_id |
timestamp |
not null |
primary key |
year |
int |
not null |
year of timestamp in int format |
month |
int |
not null |
month of timestamp in int format |
day |
int |
not null |
day of timestamp in int format |
dim_films
Attribute | Type | Nullable | Value |
---|---|---|---|
film_id |
varchar(32) |
not null |
idmb_id |
title |
varchar(256) |
not null |
the original title of the film |
release_year |
int |
not null |
year in which the film was released |
dim_cast
Attribute | Type | Nullable | Value |
---|---|---|---|
cast_id |
varchar(32) |
not null |
cast_id in the IMDB database |
film_id |
varchar(32) |
not null |
imdb_id |
full_name |
varchar(256) |
not null |
the name of the actor or actress |
dim_reviews
Attribute | Type | Nullable | Value |
---|---|---|---|
review_id |
varchar(32) |
not null |
review_id in the TMDb database |
film_id |
varchar(32) |
not null |
imdb_id |
text |
varchar(32) |
not null |
review text for sentiment classification |
The present solution allows Data Analysts to perform roll-ups and drill-downs into film review sentiment facts linked to both films and actors.
Since the raw data provided into S3 and reported on demand to the Data Analysis Team who can then perform further analysis on the database, a single write step is required, whereas many reads and aggregations will be performer.
Given those requirements, the choice of technology was the following:
AWS S3 to store the raw files from IMDb and TMDb films, reviews and casting.
AWS Redshift to produce a Data Warehouse with the required dimension and fact tables.
Tensorflow to allow us training a model and running the classification
Apache Airflow to automate our Data Piepeline.
This solution is scalable, under each of the following scenarios, as follows:
The technology employed is extremely elastic. Under 100x the size of the same data:
Q: How much data can I store in Amazon S3?
The total volume of data and number of objects you can store are unlimited. Individual Amazon S3 objects can range in size from a minimum of 0 bytes to a maximum of 5 terabytes. The largest object that can be uploaded in a single PUT is 5 gigabytes. For objects larger than 100 megabytes, customers should consider using the Multipart Upload capability.
Source: https://aws.amazon.com/s3/faqs/
COPY
and INSERT
operations would perform perfectly well, since there is no upper limit for them, and the INSERT
statements have been done using the INSERT INTO SELECT FROM
form, that is optimised in redshift:
We strongly encourage you to use the COPY command to load large amounts of data. Using individual INSERT statements to populate a table might be prohibitively slow. Alternatively, if your data already exists in other Amazon Redshift database tables, use INSERT INTO SELECT or CREATE TABLE AS to improve performance. For more information about using the COPY command to load tables, see Loading data.
Source: https://docs.aws.amazon.com/redshift/latest/dg/r_INSERT_30.html
Table Scanning Operations like review classification are done by running a SELECT
query in AWS Redshift and then streaming through the results to get classified by the model.
3.1. Computer Resources will not be particularly hit, given the fac that the classification is done by streaming through a small part of the dataset per time.
3.2. Time length to Classify Reviews will increase linearly as the data increases. Some level of parallelisation of this procedure (based on a foor loop) will be required.
The Data Warehouse has been designed for intesive reading and aggregation on the fact tables, and works on files with json lines
precompiled in S3, and processed in batches.
4.1. Should intensive writing be a requirement, instead of databases, we could have a message/queue based system that (a) would Write streams of files to s3 and (b) Use Apache Spark to compile our facts.
The Data Warehouse Building DAG would have to be scheduled for 00:00 every day.
Landing times will be required to be setup with alerts, so that we can keep track of late deliveries.
COPY
and INSERT
from the Data Warehouse Building DAG should be little affected, even if the data increases to 100x, because they currently run within 20 seconds, which in the worst case scenarios giving us a 6h safety window.
However, the Review Classification part of the DAG, should the data increase sensibly, could take too long to finish. Parallelisation of that task will need to be taken into consideration.
AWS Redshift maximum number of concurrent user defined queries is 50, according to AWS documentation
However, fact tables are optimised for reading, and the queuing time for each of the 50 concurrent times should be trivial.
Therefore, the system would easily sustain a much larger number of users (100+) with great level of fault tolerance.