项目作者: sukumarh

项目描述 :
Distributed training using PyTorch DDP & Suggestive resource allocation
高级语言: Jupyter Notebook
项目地址: git://github.com/sukumarh/distributed-training.git
创建时间: 2020-11-26T03:41:25Z
项目社区:https://github.com/sukumarh/distributed-training

开源协议:Apache License 2.0

下载


Distributed training & Suggestive resource allocation

License

This project is aimed to study the impact of distributed trained of Deep Learning models to understand if a predictive model can be designed to predict the epoch speed and time to accuracy. We selected image classification as the application and CNN models to conduct our experiments.

We collected the training logs for around 75 configurations in which we varied model type, batch size, GPU type, number of GPUs, number of data loaders. Once the predictive model (also referred to as the recommender model) is trained, if the test error is low, we aim to make this model available to the end-user by hosting it over a Kubernetes cluster as a web application.

Finally, this can become a prescriptive solution that can suggest a configuration to the user involving the least training cost before they consider investing in hardware.

Trainers

Single GPU Trainer

  1. trainer_pytorch.py [-h] [-b BATCH_SIZE] [-c CONFIGURATIONS]
  2. [--configuration-file CONFIGURATION_FILE] [-d DATA]
  3. [--dataset DATASET] [-e EPOCHS]
  4. [-lr LEARNING_RATE] [-m MODEL_NAME]
  5. [-w NUM_WORKERS] [-s SAVE_LOCATION]

Distributed Trainer (Multi-GPU)

  1. distributed_trainer.py [-h] [-b BATCH_SIZE] [-c CONFIGURATIONS]
  2. [--configuration-file CONFIGURATION_FILE]
  3. [-d DATA] [--dataset DATASET]
  4. [--distribute-data] [-da DISTRIBUTED_ADDRESS]
  5. [-dp DISTRIBUTED_PORT]
  6. [--distributed-backend DISTRIBUTED_BACKEND]
  7. [-e EPOCHS] [-g GLOO_FILE] [-lr LEARNING_RATE]
  8. [-m MODEL_NAME] [--num-nodes NUM_NODES]
  9. [--num-gpus NUM_GPUS] [-w NUM_WORKERS]
  10. [-s SAVE_LOCATION]

Evaluations

The following graph shows the epoch timings for various configurations. In this experiment, each GPU trained on the entire dataset, leading to an increase in the epoch time with a larger decrease in the number of epochs required to reach a certain accuracy.

Evaluations

Recommender model

Time per epoch (in seconds)
MAE RMSE
1.84 4.60
Accuracy for an epoch
MAE RMSE
0.047 0.10

Frameworks & Libraries

  1. PyTorch
  2. LightGBM

Environments

  1. Jupyter Notebooks
  2. PyCharm
  3. Spyder
  4. Visual Studio Code