项目作者: vcg-uvic

项目描述 :
Simple Python Scheduler
高级语言: Python
项目地址: git://github.com/vcg-uvic/simple-python-scheduler.git
创建时间: 2018-09-04T23:26:38Z
项目社区:https://github.com/vcg-uvic/simple-python-scheduler

开源协议:GNU General Public License v3.0

下载


Simple Python Scheduler

This is simple python scheduler for a multi-user multi-GPU server. As of now,
this scheduler works purely based on trust that all users will use this to run
their scripts. This script is NOT INTENDED for complex environments that
needs security.

Right now, the only thing it can do is allocate GPUs to users on demand, and
kill any processes that are intruding. It will also kill processes that
exceeded the initial lifetime.

Dependancies

  1. flufl.lock
  2. psutil
  3. nvidia-ml-py3

Supported Commands

salloc

Will start an interactive shell with the correct GPU allocated. Closing the
shell will result in releasing the GPU.

srunsched

Run the scheduler.

Planned Commands

sbatch

Will queue the job for execution. First queued object will run. We won’t have
any priority settings or limitations for now. Must define up-time for the job.

squeue

Read and report the current queue.

susage

Report wall time of all users. (later)

Directory to be monitored

Queue will be located at /var/sps/queue

Add queue will be located at /var/sps/addqueue/<user>

Quota for GPU will be located at /var/sps/addqueue/<user>.quota. Will only
have effect when /var/sps/addqueue/<user> exists.

Current job at GPU will be at /var/sps/gpu/X

Job file

Jobs will be named

<time>-<user>-<type>-<pid>.job

and the corresponding shell environment for the batch job

<time>-<user>-<type>-<pid>.env

Time will be from Python module time.time().

pid will be the pid of the
job submitter.

type will be either salloc or sbatch.

All job files and env files are in json format.

TODO

  • All variables and functions are now contained in a single file for each
    instance. Structure this better.

Known Vulnerabilities

  • The lock file can arbitrarilly be deleted.