项目作者: najafmurtaza

项目描述 :
Extract Sentence Embeddings from Hugging Face pre-trained models.
高级语言: Python
项目地址: git://github.com/najafmurtaza/General_Sentence_Embeddings.git
创建时间: 2020-08-15T22:39:22Z
项目社区:https://github.com/najafmurtaza/General_Sentence_Embeddings

开源协议:BSD 3-Clause "New" or "Revised" License

下载


General Sentence Embeddings

Extract Sentence Embeddings from Hugging Face pre-trained models.

This repo contains code for both tensorflow and pytorch. We can extract sentence embeddings for our dataset using any pre-trained Hugging Face models.
Sometimes out of the box embeddings work or sometimes they won’t.
If you want to train/finetune on your own dataset, checkout sentence-transformers.

These can be used for any semantic similarity search tasks, clustering etc.

Dependencies

  • tensorflow 2.0.0
  • pytorch 1.6.0
  • transformers 3.0.2

Working

The code works in the following way
1) Load model and its respective tokenizer.
2) Tokenize our sentences
3) Get token embeddings
4) Convert token embeddings to single sentence embeddings[1].

[1]. There are many techniques to convert token embeddings to sentence embeddings, but SOTA is mean pooling.

Benchmarks

Benchmarks using SentEval are coming Soon.

Other repos for Sentence Embeddings

Note

This repo is inspired by sentence-transformers. The pytorch code is from their repo.