Repartition & parallel Data ingestion
Repartition & parallel Data ingestion; Cloud services;
So this will project will help you migrate data from a normal file system-ish storage : AWS s3 to a Massively Parallel Processing DataWarehouse - Snowflake
Learnings:
primary key - sku
clustered by - sku
An aggregated table on above rows with `name` and `no. of products` as the columns
Creating the product table in snowflake :
CREATE TABLE PRODUCTS ( NAME VARCHAR(), SKU VARCHAR() NOT NULL, DESCRIPTION VARCHAR(), PRIMARY KEY (SKU) ) CLUSTER BY (SKU))
Used Classes to build the solution and modularized, so that end consumer can just run the query without any issues due to dependancy
Snowflake - A Massively Parallel Processing Entreprise level Datawarehouse which supports parallel data ingestion.
The data ingestion happens in less than 7 secs i.e., Spark dataframe to Table on Snowflake
Note : Writing the csv from s3 stage to snowflake is further more efficient
sku
as the primary key Created the table on Snowflake keeping SKU as both primary key & the cluster by key,
so that any DML operation is supported & is computationaly efficient
name
and no. of products
as the columns
The source data is loaded into a Spark dataframe & the aggregation is done on Spark.
It is written directly into snowflake & it creates a table with specified name,
automatically on the specified database & schema
Implemented the solution in this architecture as I had time & resource contraints
Incase of additional time availability, Enhancements would be
1. AWS S3 - Data Staging (home for csv file)
2. AWS EMR - Compute Engine
3. Airflow - Automating the process with this Orchestration tool &
hosting it on AWS EC2
4. AWS Redshift - Target DW
5. Apache Spark - Distributed computing
6. Deployment - Would package the Tasks & dependancies on Docker and manage multiple of them on Kuberenetes.
These architecture would make it an Enterprise level solution & pretty long term.
Would have added the AWS KMS to manage all the credentials but Databricks Community edition doesn't support the same Tasks
(Workaround used : creating an IAM role with limited access)