项目作者: hitochan777

项目描述 :
Replacement based unknown replacer
高级语言: Python
项目地址: git://github.com/hitochan777/unk-replacer.git
创建时间: 2016-12-10T14:11:27Z
项目社区:https://github.com/hitochan777/unk-replacer

开源协议:MIT License

下载


Replacer

Unknown word replacer in Neural Machine Translation (NMT)

wercker status

Requirements

  • Python >=3.3

Install

Currently unk-replacer is not registered PyPI, so you need to
get it from github.

  1. pip install git+https://github.com/hitochan777/unk-replacer.git

If you plan on modifying the code, it is better to do a “editable” installation:

  1. git clone https://github.com/hitochan777/unk-replacer.git
  2. cd unk-replacer
  3. pip install -e .

Overview Usage

After installing unk-replacer, a script unk-rep should be
available globally.

This script has several sub-commands, which can be listed on the command line with the following command:

  1. unk-rep -h

You can further get the help for a sub-command (ex: replace-parallel) as follows.

  1. unk-rep replace-parallel -h

Basic Usage

  1. Build source and target vocabulary from the training data with the following command

    1. unk-rep build_vocab \
    2. word \
    3. --source-file /path/to/source/training/data \
    4. --target-file /path/to/target/training/data \
    5. --src-vocab-size 50000 \
    6. --tgt-vocab-size 50000 \
    7. --output-file /path/to/json/vocab/file
  2. Get word alignment for the parallel corpora and lexical translation tables for both direction.

    Typically you can obtain the lexical translation tables as byproducts
    of word alignment.
    You can use GIZA++ or mgiza because they are fast.
    However, we recommend that you use Nile,
    which is a supervised alignment model rather than GIZA++,
    because it produces much better alignment.

  3. Train source and target Word2vec models

    For example, you can use gensim module to train a word2vec model from TRAIN
    and save it to MODEL_NAME.

    1. python -m gensim.models.word2vec \
    2. -train TRAIN \
    3. -output MODEL_NAME

    There are many parameters you can change.
    For more information, type

    1. python -m gensim.models.word2vec -h
  4. Replace unknown words in the training data with the
    following command.

    1. unk-rep replace-parallel \
    2. --root-dir /path/to/save/artifacts \
    3. --src-w2v-model /path/to/source/word2vec/model \
    4. --tgt-w2v-model /path/to/target/word2vec/model \
    5. --lex-e2f /path/to/target/to/source/lex/dict \
    6. --lex-f2e /path/to/source/to/target/lex/dict \
    7. --train-src /path/to/source/training/data \
    8. --train-tgt /path/to/target/training/data \
    9. --train-align /path/to/word/alignment/for/training/data \
    10. --vocab /path/to/json/vocab/file \
    11. --memory /path/to/save/replacement/memory \
    12. --replace-type multi

    The file for the source-to-target dictionary should contain target source probability for each line.

    If you also want to replace unknown words in development data,
    you can specify the paths to the source development data (--dev-src), target development data (--dev-tgt),
    word alignment (--dev-align).

    If you want to replace unknown words in one-to-one alignment only, you can set --replace-type to 1-to-1.

    You can set --handle-numbers if you want to apply special handling to numbers.

  5. Train NMT model with the replaced training data from step 3

  6. Replace unknown words in the test data with the following command.

    1. unk-rep replace-input \
    2. --root-dir /path/to/save/artifacts \
    3. --w2v-model /path/to/source/word2vec/model \
    4. --input /path/to/input/data \
    5. --vocab /path/to/json/vocab/file \
    6. --replace-log /path/to/save/replace/log/file

    A replace log keeps track of which parts of an original sentence
    map to the replaced input sentence.
    This log is necessary to restore the final translation.

    You can set --handle-numbers if you want to apply special handling to numbers.

  7. Translate the replaced test data with the trained NMT model.

    We recommend that you ensemble several models because it normally
    leads to the better attention.

  8. Restore the final translation with the following command.

    1. unk-rep restore \
    2. --translation /path/to/translation \
    3. --orig-input /path/to/original/input \
    4. --replaced-input /path/to/replaced/input \
    5. --output /path/to/save/final/translation \
    6. --lex-e2f /path/to/target/to/source/lex/dict \
    7. --lex-f2e /path/to/source/to/target/lex/dict \
    8. --replace-log /path/to/replace/log \
    9. --attention /path/to/attention \
    10. --lex-backoff

    --lex-backoff enables the use of the lexical translation tables
    when the replacement memory does not contain the queried entry.
    We recommend that you enable this.

    JSON file is supported for attention.
    It should look like

    1. [
    2. [
    3. [0.2, 0.4, ..., 0.2],
    4. [0.5, 0.1, ..., 0.01],
    5. ...
    6. [0.04, 0.3, ..., 0.2]
    7. ],
    8. ...
    9. [
    10. ...
    11. ]
    12. ]

    , where it contains a list of attention for all input sentences.
    Alternatively, you can specify the file obtained by --rich_output_filename in knmt.
    You can set --handle-numbers if you want to apply special handling to numbers.

Advanced Usage

Hybrid of BPE and Replacement Based Method

You can also choose to use BPE to segment unknown words
that are not handled by the replacement based method.

Note: You cannot apply special handling of numbers if you use BPE as a backoff!

  1. Build word and BPE vocabulary
    You first build word and BPE vocabulary separately.
    You can first build word vocabulary with the aforementioned command.
    To build BPE vocabulary you can use the following command.

    1. unk-rep build-vocab \
    2. bpe \
    3. --source-file /path/to/source/training/data \
    4. --target-file /path/to/target/training/data \
    5. --src-vocab-size 50000 \
    6. --tgt-vocab-size 50000 \
    7. --output-file /path/to/bpe/vocab/file
  2. Combine word and BPE vocabulary
    Assuming that you already have word vocabulary saved in /path/to/word/vocab/file,
    combine word and BPE vocabulary with the following command.

    1. unk-rep combine-word-and-bpe-vocab \
    2. --bpe-voc /path/to/bpe/vocab/file \
    3. --word-voc /path/to/word/vocab/file \
    4. --output /path/to/combined/vocab/file \
  3. Replace unknown words in the training data with the
    following command.

    1. unk-rep replace-parallel \
    2. ...
    3. --vocab /path/to/combined/vocab/file \
    4. --memory /path/to/save/replacement/memory \
    5. --replace-type multi \
    6. --bpe-vocab /path/to/bpe/vocab/file
    7. --back-off bpe

    You need to set --back-off to bpe.

  4. Translate the replaced input

  5. Replace unknown words in the test data with the following command.

    1. unk-rep replace-input \
    2. ...
    3. --bpe-vocab /path/to/bpe/vocab/file
    4. --back-off bpe

    You need to set --back-off to bpe.

  6. Translate the replaced test data with the trained NMT model.

  7. Restore the final translation with unk-rep restore like the basic usage.