项目作者: vineeths96

项目描述 :
In this repository, we explore model compression for transformer architectures via quantization. We specifically explore quantization aware training of the linear layers and demonstrate the performance for 8 bits, 4 bits, 2 bits and 1 bit (binary) quantization.
高级语言: Python
项目地址: git://github.com/vineeths96/Compressed-Transformers.git
创建时间: 2020-11-07T12:28:13Z
项目社区:https://github.com/vineeths96/Compressed-Transformers

开源协议:MIT License

下载


Language Contributors Forks Stargazers Issues MIT License LinkedIn




Logo

Compressed Transformers



Transformer Model Compression for Edge Deployment


Explore the repository»


View Report

tags : model compression, transformers, edge learning, federated learning, iwslt translation, english, german, deep learning, pytorch

About The Project

Transformer architectures and their extensions such as BERT, GPT etc, has revolutionized the world of Natural Language, Speech and Image processing. However, the large number of parameters and the computation cost inhibits the transformer models to be deployed on edge devices such as smartphones. In this work, we explore the model compression for transformer architectures by quantization. Quantization not only reduces the memory footprint, but also improves energy efficiency. Research has shown that 8 bit quantized model uses 4x lesser memory and 18x lesser energy. Model compression for transformer architectures would lead to reduced storage, memory footprint and compute power requirements. We show that transformer models can be compressed with no loss of or improved performance on the IWSLT English-German translation task. We specifically explore quantization aware training of the linear layers and demonstrate the performance for 8 bits, 4 bits, 2 bits and 1 bit (binary) quantization. We find that the linear layers of the attention network to be highly resilient to quantization and can be compressed aggressively. A detailed description of quantization algorithms and analysis of the results are available in the Report.

Built With

This project was built with

  • python v3.8.3
  • PyTorch v1.5
  • The environment used for developing this project is available at environment.yml.

Getting Started

Clone the repository into a local machine and enter the src directory using

  1. git clone https://github.com/vineeths96/Compressed-Transformers
  2. cd Compressed-Transformers/src

Prerequisites

Create a new conda environment and install all the libraries by running the following command

  1. conda env create -f environment.yml

The dataset used in this project (IWSLT English-German translation) will be automatically downloaded and setup in src/data directory during execution.

Instructions to run

The training_script.py requires arguments to be passed (check default values in code):

  • --batch_size - set to a maximum value that won’t raise CUDA out of memory error
  • --language_direction - pick between E2G and G2E
  • --binarize - binarize attention module linear layers during training
  • --binarize_all_linear - binarize all linear layers during training

  • --quantize - quantize attention module linear layers during training

  • --quantize_bits- number of bits of quantization
  • --quantize_all_linear - quantize all linear layers during training

To train the transformer model without any compression,

  1. python training_script.py --batch_size 1500

To train the transformer model with binarization of attention linear layers,

  1. python training_script.py --batch_size 1500 --binarize True

To train the transformer model with binarization of all linear layers,

  1. python training_script.py --batch_size 1500 --binarize_all_linear True

To train the transformer model with quantization of attention linear layers,

  1. python training_script.py --batch_size 1500 --quantize True --quantize_bits 8

To train the transformer model with quantization of all linear layers,

  1. python training_script.py --batch_size 1500 --quantize_all_linear True --quantize_bits 8

Model overview

The transformer architecture implemented follows from the seminal paper Attention Is All You Need by Vaswani et al. The architecture of the model is shown below.

Transformer

Results

We evaluate the baseline models and proposed quantization methods on IWSLT dataset. We use Bilingual Eval. More detailed results and inferences are available in report here.

Model BLEU Score
Base line (Uncompressed) 27.9
Binary Quantization (All Linear) 13.2
Binary Quantization (Attention Linear) 26.87
Quantized - 8 Bit (Attention Linear) 29.83
Quantized - 4 Bit (Attention Linear) 29.76
Quantized - 2 Bit (Attention Linear) 28.72
Quantized - 1 Bit (Attention Linear) 24.32
Quantized - 8 Bit (Attention + Embedding) 21.26
Quantized - 8 Bit (All Linear) 27.19
Quantized - 4 Bit (All Linear) 27.72

The proposed method can also be used for post-training quantization with minimal performance loss (< 1%) on pretrained BERT models. (Results are not shown due to lack of space).

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Vineeth S - vs96codes@gmail.com

Project Link: https://github.com/vineeths96/Compressed-Transformers