项目作者: dongjun-Lee

项目描述 :
Tensorflow seq2seq Implementation of Text Summarization.
高级语言: Python
项目地址: git://github.com/dongjun-Lee/text-summarization-tensorflow.git
创建时间: 2018-05-30T12:29:57Z
项目社区:https://github.com/dongjun-Lee/text-summarization-tensorflow

开源协议:MIT License

下载


tensorflow-text-summarization

Simple Tensorflow implementation of text summarization using seq2seq library.

Model

Encoder-Decoder model with attention mechanism.

Word Embedding

Used Glove pre-trained vectors to initialize word embedding.

Encoder

Used LSTM cell with stack_bidirectional_dynamic_rnn.

Decoder

Used LSTM BasicDecoder for training, and BeamSearchDecoder for inference.

Attention Mechanism

Used BahdanauAttention with weight normalization.

Requirements

  • Python 3
  • Tensorflow (>=1.8.0)
  • pip install -r requirements.txt

Usage

Prepare data

Dataset is available at harvardnlp/sent-summary. Locate the summary.tar.gz file in project root directory. Then,

  1. $ python prep_data.py

To use Glove pre-trained embedding, download it via

  1. $ python prep_data.py --glove

Train

We used sumdata/train/train.article.txt and sumdata/train/train.title.txt for training data. To train the model, use

  1. $ python train.py

To use Glove pre-trained vectors as initial embedding, use

  1. $ python train.py --glove

Additional Hyperparamters

  1. $ python train.py -h
  2. usage: train.py [-h] [--num_hidden NUM_HIDDEN] [--num_layers NUM_LAYERS]
  3. [--beam_width BEAM_WIDTH] [--glove]
  4. [--embedding_size EMBEDDING_SIZE]
  5. [--learning_rate LEARNING_RATE] [--batch_size BATCH_SIZE]
  6. [--num_epochs NUM_EPOCHS] [--keep_prob KEEP_PROB] [--toy]
  7. optional arguments:
  8. -h, --help show this help message and exit
  9. --num_hidden NUM_HIDDEN
  10. Network size.
  11. --num_layers NUM_LAYERS
  12. Network depth.
  13. --beam_width BEAM_WIDTH
  14. Beam width for beam search decoder.
  15. --glove Use glove as initial word embedding.
  16. --embedding_size EMBEDDING_SIZE
  17. Word embedding size.
  18. --learning_rate LEARNING_RATE
  19. Learning rate.
  20. --batch_size BATCH_SIZE
  21. Batch size.
  22. --num_epochs NUM_EPOCHS
  23. Number of epochs.
  24. --keep_prob KEEP_PROB
  25. Dropout keep prob.
  26. --toy Use only 5K samples of data

Test

Generate summary of each article in sumdata/train/valid.article.filter.txt by

  1. $ python test.py

It will generate result summary file result.txt. Check out ROUGE metrics between result.txt and sumdata/train/valid.title.filter.txt using pltrdy/files2rouge.

Sample Summary Output

  1. "general motors corp. said wednesday its us sales fell ##.# percent in december and four percent in #### with the biggest losses coming from passenger car sales ."
  2. > Model output: gm us sales down # percent in december
  3. > Actual title: gm december sales fall # percent
  4. "japanese share prices rose #.## percent thursday to <unk> highest closing high for more than five years as fresh gains on wall street fanned upbeat investor sentiment , dealers said ."
  5. > Model output: tokyo shares close # percent higher
  6. > Actual title: tokyo shares close up # percent
  7. "hong kong share prices opened #.## percent higher thursday on follow-through interest in properties after wednesday 's sharp gains on abating interest rate worries , dealers said ."
  8. > Model output: hong kong shares open higher
  9. > Actual title: hong kong shares open higher as rate worries ease
  10. "the dollar regained some lost ground in asian trade thursday in what was seen as a largely technical rebound after weakness prompted by expectations of a shift in us interest rate policy , dealers said ."
  11. > Model output: dollar stable in asian trade
  12. > Actual title: dollar regains ground in asian trade
  13. "the final results of iraq 's december general elections are due within the next four days , a member of the iraqi electoral commission said on thursday ."
  14. > Model output: iraqi election results due in next four days
  15. > Actual title: iraqi election final results out within four days
  16. "microsoft chairman bill gates late wednesday unveiled his vision of the digital lifestyle , outlining the latest version of his windows operating system to be launched later this year ."
  17. > Model output: bill gates unveils new technology vision
  18. > Actual title: gates unveils microsoft 's vision of digital lifestyle

Pre-trained Model

To test with pre-trained model, download pre_trained.zip, and locate it in the project root directory. Then,

  1. $ unzip pre_trained.zip
  2. $ python test.py