项目作者: sagorbrur

项目描述 :
Bengali Language Model Tool using fastai's ULMFit
高级语言: Python
项目地址: git://github.com/sagorbrur/bnlm.git
创建时间: 2019-12-27T04:26:53Z
项目社区:https://github.com/sagorbrur/bnlm

开源协议:MIT License

下载


Bengal Language Model

Build Status
Documentation Status
pypi version
python version

Bengali language model is build with fastai’s ULMFit and ready for prediction and classfication task.

Contents

NB:

  • This tool mostly followed inltk
  • We separated Bengali part with better evaluation results

Installation

pip install bnlm

Dependencies

  • use pytorch >=1.0.0 and <=1.3.0

Evaluation Result

Language Model

  • Accuracy 48.26% on validation dataset
  • Perplexity: ~22.79

Features and API

Download pretrained Model

To start, first download pretrained Language Model and Sentencepiece model

  1. from bnlm.bnlm import download_models
  2. download_models()

Predict N Words

predict_n_words take three parameter as input:

  • input_sen(Your incomplete input text)
  • N(Number of word for prediction)
  • model_path(Pretrained model path)
  1. from bnlm.bnlm import BengaliTokenizer
  2. from bnlm.bnlm import predict_n_words
  3. model_path = 'model'
  4. input_sen = "আমি বাজারে"
  5. output = predict_n_words(input_sen, 3, model_path)
  6. print("Word Prediction: ", output)

Get Sentence Encoding

  1. from bnlm.bnlm import BengaliTokenizer
  2. from bnlm.bnlm import get_sentence_encoding
  3. model_path = 'model'
  4. sp_model = "model/bn_spm.model"
  5. input_sentence = "আমি ভাত খাই।"
  6. encoding = get_sentence_encoding(input_sentence, model_path, sp_model)
  7. print("sentence encoding is: ", encoding)

Get Embedding Vectors

  1. from bnlm.bnlm import BengaliTokenizer
  2. from bnlm.bnlm import get_embedding_vectors
  3. model_path = 'model'
  4. sp_model = "model/bn_spm.model"
  5. input_sentence = "আমি ভাত খাই।"
  6. embed = get_embedding_vectors(input_sentence, model_path, sp_model)
  7. print("sentence embedding is : ", embed)

Sentence Similarity

  1. from bnlm.bnlm import BengaliTokenizer
  2. from bnlm.bnlm import get_sentence_similarity
  3. model_path = 'model'
  4. sp_model = "model/bn_spm.model"
  5. sentence_1 = "সে খুব করে কথা বলে।"
  6. sentence_2 = "তার কথা খুবেই মিষ্টি।"
  7. sim = get_sentence_similarity(sentence_1, sentence_2, model_path, sp_model)
  8. print("Similarity is: %0.2f"%sim)
  9. # Output: 0.72

Get Simillar Sentences

get_similar_sentences take four parameter

  • input sentence
  • N(Number of sentence you want to predict)
  • model_path(Pretrained Model Path)
  • sp_model(pretrained sentencepiece model)
  1. from bnlm.bnlm import BengaliTokenizer
  2. from bnlm.bnlm import get_similar_sentences
  3. model_path = 'model'
  4. sp_model = "model/bn_spm.model"
  5. input_sentence = "আমি বাংলায় গান গাই।"
  6. sen_pred = get_similar_sentences(input_sentence, 3, model_path, sp_model)
  7. print(sen_pred)
  8. # output: ['আমি বাংলায় গান গাই ।', 'আমি ইংরেজিতে গান গাই।', 'আমি বাংলায় গানও গাই।']

Classification

upcomming

Training

To train with your own corpus follow this repository

Contributor