项目作者: manantomar

项目描述 :
Mirror Descent Policy Optimization
高级语言: Python
项目地址: git://github.com/manantomar/Mirror-Descent-Policy-Optimization.git


Mirror-Descent-Policy-Optimization

This repository contains the code for MDPO, a trust-region algorithm based on principles of Mirror Descent. It includes two variants, on-policy MDPO and off-policy MDPO, based on the paper Mirror Descent Policy Optimization.

This implementation makes use of Tensorflow and builds over the code provided by stable-baselines.

Getting Started

Prerequisites

All dependencies are provided in a python virtual-env requirements.txt file. Majorly, you would need to install stable-baselines, tensorflow, and mujoco_py.

Installation

  1. Install stable-baselines

    1. pip install stable-baselines[mpi]==2.7.0
  2. Download and copy MuJoCo library and license files into a .mujoco/ directory. We use mujoco200 for this project.

  3. Clone MDPO and copy the mdpo-on and mdpo-off directories inside this directory.

  4. Activate virtual-env using the requirements.txt file provided.

  1. source <virtual env path>/bin/activate

Example

Use the run_mujoco.py script for training MDPO.

On-policy MDPO

  1. python3 run_mujoco.py --env=Walker2d-v2 --sgd_steps=10

Off-policy MDPO

  1. python3 run_mujoco.py --env=Walker2d-v2 --num_timesteps=1e6 --sgd_steps=1000 --klcoeff=1.0 --lam=0.2 --tsallis_coeff=1.0

Reference

  1. @article{tomar2020mirror,
  2. title={Mirror Descent Policy Optimization},
  3. author={Tomar, Manan and Shani, Lior and Efroni, Yonathan and Ghavamzadeh, Mohammad},
  4. journal={arXiv preprint arXiv:2005.09814},
  5. year={2020}
  6. }