项目作者: Renien

项目描述 :
:file_folder:Extract,Transform,Load(ETL):construction_worker:指数据库使用过程,特别是数据仓库中的过程。此存储库包含一个包含ETL相关工作的入门工具包。
高级语言: Scala
项目地址: git://github.com/Renien/ETL-Starter-Kit.git
创建时间: 2017-03-12T03:13:37Z
项目社区:https://github.com/Renien/ETL-Starter-Kit

开源协议:MIT License

下载



ETL


ETL


Extract - Transform - Load




Travis Build


License

Summary

Extract, Transform, Load (ETL) refers to a process in database usage and especially in data warehousing. This repository contains a starter kit featuring ETL related work.

Features and Limitations


lamda-etl

Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.

This starter kit package is mainly focusing on ETL related work where it allows to expand to an independent ETL framework for different client data sources. It contains basic implementation and project structure as follows,

  • Common Module – This will contain all the common jobs and helper classes for ETL framework. Currently two Scalding helper classes are implemented (Hadoop job runner and MapReduceConfig)

  • DataModel Module – This will contain all the BigData schema related code. For example, Avro, ORC, Thrift etc. Currently a sample Avro clickstream raw data shema has been implemented.

  • SampleClient Module – This will contain independent data processing jobs which will have dependency on Common and DataModel.

Since this repository is to keep only the structure; different type of sample jobs are not implemented. Based on your requirement be free to modify and implement different type of batch/streaming jobs (Spark, Hive, Pig etc)

Installation

Make sure you have installed,

  • JDK 1.8+
  • Scala 2.10.*
  • Gradle 2.2+

This started kit package uses the latest version of linkedin gradle Hadoop plugin which supports only gradle 2 series version. If anyone like to use the gradle older version then you have to downgrade linkedin gradle Hadoop plugin.

Directory Layout

  1. .
  2. ├── Common --> common module which can contain helper class
  3. ├── build.gradle --> build script for common module specific
  4. └── src --> source package directory for common module
  5. └── main
  6. ├── java
  7. ├── resources
  8. └── scala
  9. └── com
  10. └── etl
  11. └── utils
  12. ├── HadoopRunner.scala
  13. └── MapReduceConfig.scala
  14. ├── DataModel --> schema level module (eg: avro, thrift, json etc)
  15. ├── build.gradle --> build script for datamodel module specific
  16. ├── schema --> data schema files
  17. └── com
  18. └── etl
  19. └── datamodel
  20. └── ClickStreamRecord.avsc --> click stream record avro schema
  21. ├── src --> source package directory for datamodel module
  22. └── main
  23. ├── java
  24. ├── resources
  25. └── scala
  26. └── target --> auto generated code (eg from avro, thrift etc)
  27. └── generated-sources
  28. └── main
  29. └── java
  30. └── com
  31. └── etl
  32. └── datamodel
  33. └── ClickStreamRecord.java --> auto generated code from click stream record avro schema
  34. ├── SampleClient --> sperate module for client specific ETL jobs
  35. ├── build.gradle --> build script for client specific module
  36. ├── src --> source package directory for client specific module
  37. └── main
  38. ├── java
  39. ├── resources
  40. └── scala
  41. └── com
  42. └── sampleclient
  43. └── jobs
  44. └── ClickStreamAggregates.scala --> clickstream aggregates jobs
  45. └── workflow --> hadoop job flow groovy script folder
  46. ├── flow.gradle --> gradle script to generate haoop job flows (eg: Azkaban)
  47. └── jobs.gradle --> gradle script for haoop jobs (eg: Azkaban)
  48. ├── build.gradle --> build script for root module
  49. ├── gradle --> gradle folder which contains all the build script files
  50. ├── artifacts.gradle --> artifact file for ETL project
  51. ├── buildscript.gradle --> groovy script contains plugins, task classes, and other classes are available for project
  52. ├── dependencies.gradle --> dependencies for the ETL project
  53. ├── enviroments.groovy --> configuration for prod and dev enviroment
  54. ├── repositories.gradle --> all the dependencies repository location
  55. └── workflows.gradle --> root workflow gradle file contain configuration and custom build task
  56. ├── gradlew
  57. ├── gradlew.bat
  58. ├── settings.gradle --> setting sub modules

This starter-kit is made based on few popular libraries with sample code. Based on your requirement choose the suitable technology.

Using The Project

Note:
This guide has only been tested on Mac OS X and may assume tools that are specific to it.
If working in another OS substitutes may need to be used but should be available.

Step 1 – Build the Project:

  • Run gradle clean build

Once you build the project you will find the following files:


etl-build-files

Step 2 – Upload Azkaban Flow

Upload ‘etl-starter-kit-sampleclient.zip’ to Azkaban. After deploying the fat Hadoop jar you’re ready to run the flow.


sample-client-azkaban-flow

Notable Frameworks for ETL work:

License

MIT © Renien