项目作者: SeonbeomKim

项目描述 :
Byte Pair Encoding (BPE)
高级语言: Python
项目地址: git://github.com/SeonbeomKim/Python-Byte_Pair_Encoding.git
创建时间: 2018-10-24T02:14:27Z
项目社区:https://github.com/SeonbeomKim/Python-Byte_Pair_Encoding

开源协议:

下载


Python-Byte_Pair_Encoding

Byte Pair Encoding (BPE)

Env

  • Python 3
  • Numpy 1.15
  • tqdm
  • multiprocessing

Paper

Command

  • learn BPE from document

    1. python bpe_learn.py
    2. -train_path 1_document 2_document ... K_document
    3. -voca_out_path voca_path/voca_file_name
    4. -bpe_out_path 1_BPE_document 2_BPE_document ... K_BPE_document
    5. -train_voca_threshold 1
    6. -num_merges 30000
    7. -multi_proc=-1 (-1:use all process, 1:not use)
    8. -final_voca_size 30000 or -final_voca_threshold 50
  • apply BPE to document

    1. python bpe_apply.py
    2. -data_path 1_document 2_document ... K_document
    3. -voca_path voca_path/voca_file_name
    4. -bpe_out_path 1_BPE_document 2_BPE_document ... K_BPE_document

Reference