项目作者: zlsh80826

项目描述 :
Machine Comprehension Train on MSMARCO with S-NET Extraction Modification
高级语言: Python
项目地址: git://github.com/zlsh80826/MSMARCO.git
创建时间: 2018-05-29T11:07:07Z
项目社区:https://github.com/zlsh80826/MSMARCO

开源协议:

下载


MSMARCO with S-NET Extraction (Extraction-net)

Requirements

Here are some required libraries for training and evaluations.

General

  • python3.6
  • cuda-9.0 (CNTK required)
  • openmpi-1.10 (CNTK required)
  • gcc >= 6 (CNTK required)

Python

  • Please refer requirements.txt

Evaluate with pretrained model

This repo provides pretrained model and pre-processed validation dataset for testing the performance

Please download pretrained model and
pre-processed data and put them on
the MSMARCO/data and MSMARCO root directory respectively, then decompress them at the right places.

The code structure should be like

  1. MSMARCO
  2. ├── data
  3. ├── elmo_embedding.bin
  4. ├── test.tsv
  5. ├── vocabs.pkl
  6. ├── data.tar.gz
  7. └── ... others
  8. ├── model
  9. ├── pm.model
  10. ├── pm.model.ckp
  11. └── pm.model_out.json
  12. └── ... others

After decompressing,

  1. cd Evaluation
  2. sh eval.sh

then you should get the generated answer and rough-l score.

Usage

Preprocess

MSMARCO V1

Download MSMARCO v1 dataset, GloVe embedding.

  1. cd data
  2. python3.6 download.py v1

Convert raw data to tsv format

  1. python3.6 convert_msmarco.py v1 --threads=`nproc`

Convert tsv format to ctf(CNTK input) format and build vocabs dictionary

  1. python3.6 tsv2ctf.py

Generate elmo embedding

  1. sh elmo.sh

MSMARCO V2

Download MSMARCO v2 dataset, GloVe embedding.

  1. cd data
  2. python3.6 download.py v2

Convert raw data to tsv format

  1. python3.6 convert_msmarco.py v2 --threads=`nproc`

Convert tsv format to ctf(CNTK input) format and build vocabs dictionary

  1. python3.6 tsv2ctf.py

Generate elmo embedding

  1. sh elmo.sh

Train (Same for V1 and V2)

  1. cd ../script
  2. mkdir log
  3. sh run.sh

Evaluate develop dataset

MSMARCO V1

  1. cd Evaluation
  2. sh eval.sh v1

MSMARCO v2

  1. cd Evaluation
  2. sh eval.sh v2

Performance

Paper

rouge-l bleu_1
S-Net (Extraction) 41.45 44.08
S-Net (Extraction, Ensemble) 42.92 44.97

This implementation

rouge-l bleu_1
MSMARCO v1 w/o elmo 38.43 39.14
MSMARCO v1 w/ elmo 39.42 39.47
MSMARCO v2 w/ elmo 43.66 44.44

TODO

  • [X] Multi-threads preprocessing
  • [X] Elmo-Embedding
  • [X] Evaluation script
  • [X] MSMARCO v2 support
  • Reasonable metrics