Extract Sentence Embeddings from Hugging Face pre-trained models.
Extract Sentence Embeddings from Hugging Face pre-trained models.
This repo contains code for both tensorflow and pytorch. We can extract sentence embeddings for our dataset using any pre-trained Hugging Face models.
Sometimes out of the box embeddings work or sometimes they won’t.
If you want to train/finetune on your own dataset, checkout sentence-transformers.
These can be used for any semantic similarity search tasks, clustering etc.
The code works in the following way
1) Load model and its respective tokenizer.
2) Tokenize our sentences
3) Get token embeddings
4) Convert token embeddings to single sentence embeddings[1].
[1]. There are many techniques to convert token embeddings to sentence embeddings, but SOTA is mean pooling.
Benchmarks using SentEval are coming Soon.
This repo is inspired by sentence-transformers. The pytorch code is from their repo.