项目作者: finalfusion

项目描述 :
Finalfusion embeddings in Python
高级语言: Python
项目地址: git://github.com/finalfusion/finalfusion-python.git
创建时间: 2018-09-19T20:42:00Z
项目社区:https://github.com/finalfusion/finalfusion-python

开源协议:Other

下载


finalfusion-python

Documentation Status

Introduction

finalfusion is a Python package for reading, writing and using
finalfusion embeddings, but also
supports other commonly used embeddings like fastText, GloVe and
word2vec.

The Python package supports the same types of embeddings as the
finalfusion-rust crate:

  • Vocabulary:
    • No subwords
    • Subwords
  • Embedding matrix:
    • Array
    • Memory-mapped
    • Quantized
  • Norms
  • Metadata

Installation

The finalfusion module is
available on PyPi for Linux,
Mac and Windows. You can use pip to install the module:

  1. $ pip install --upgrade finalfusion

Installing from source

Building from source depends on Cython. If you install the package using
pip, you don’t need to explicitly install the dependency since it is
specified in pyproject.toml.

  1. $ git clone https://github.com/finalfusion/finalfusion-python
  2. $ cd finalfusion-python
  3. $ pip install .

If you want to build wheels from source, wheel needs to be installed.
It’s then possible to build wheels through:

  1. $ python setup.py bdist_wheel

The wheels can be found in dist.

Package Usage

Basic usage

  1. import finalfusion
  2. # loading from different formats
  3. w2v_embeds = finalfusion.load_word2vec("/path/to/w2v.bin")
  4. text_embeds = finalfusion.load_text("/path/to/embeds.txt")
  5. text_dims_embeds = finalfusion.load_text_dims("/path/to/embeds.dims.txt")
  6. fasttext_embeds = finalfusion.load_fasttext("/path/to/fasttext.bin")
  7. fifu_embeds = finalfusion.load_finalfusion("/path/to/embeddings.fifu")
  8. # serialization to formats works similarly
  9. finalfusion.compat.write_word2vec("to_word2vec.bin", fifu_embeds)
  10. # embedding lookup
  11. embedding = fifu_embeds["Test"]
  12. # reading an embedding into a buffer
  13. import numpy as np
  14. buffer = np.zeros(fifu_embeds.storage.shape[1], dtype=np.float32)
  15. fifu_embeds.embedding("Test", out=buffer)
  16. # similarity and analogy query
  17. sim_query = fifu_embeds.word_similarity("Test")
  18. analogy_query = fifu_embeds.analogy("A", "B", "C")
  19. # accessing the vocab and printing the first 10 words
  20. vocab = fifu_embeds.vocab
  21. print(vocab.words[:10])
  22. # SubwordVocabs give access to the subword indexer:
  23. subword_indexer = vocab.subword_indexer
  24. print(subword_indexer.subword_indices("Test", with_ngrams=True))
  25. # accessing the storage and calculate its dot product with an embedding
  26. res = embedding.dot(fifu_embeds.storage)
  27. # printing metadata
  28. print(fifu_embeds.metadata)

Beyond Embeddings

  1. # load only a vocab from a finalfusion file
  2. from finalfusion import load_vocab
  3. vocab = load_vocab("/path/to/finalfusion_file.fifu")
  4. # serialize vocab to single file
  5. vocab.write("/path/to/vocab_file.fifu.voc")
  6. # more specific loading functions exist
  7. from finalfusion.vocab import load_finalfusion_bucket_vocab
  8. fifu_bucket_vocab = load_finalfusion_bucket_vocab("/path/to/vocab_file.fifu.voc")

The package supports loading and writing all finalfusion chunks this way.
This is only supported by the Python package, reading will fail with e.g.
the finalfusion-rust.

Scripts

finalfusion also includes a conversion script ffp-convert to convert
between the supported formats.

  1. # convert from fastText format to finalfusion
  2. $ ffp-convert -f fasttext fasttext.bin -t finalfusion embeddings.fifu

ffp-bucket-to-explicit can be used to convert bucket embeddings to embeddings
with an explicit ngram lookup.

  1. # convert finalfusion bucket embeddings to explicit
  2. $ ffp-bucket-to-explicit -f finalfusion embeddings.fifu explicit.fifu

ffp-select generates new embedding files based on some embeddings and a word
list. Using ffp-select with embeddings with a simple vocab results in a
subset of the original embeddings. With subword embeddings, vectors for unknown
words in the word list are computed and added to the new embeddings. The
resulting embeddings cannot provide representations for OOV words anymore.
The new vocabulary covers only the words in the word list.

  1. $ ffp-select large-embeddings.fifu subset-embeddings.fifu words.txt

Finally, the package comes with ffp-similar and ffp-analogy to do
analogy and similarity queries.

  1. # get the 5 nearest neighbours of "Tübingen"
  2. $ echo Tübingen | ffp-similar embeddings.fifu
  3. # get the 5 top answers for "Tübingen" is to "Stuttgart" like "Heidelberg" to...
  4. $ echo Tübingen Stuttgart Heidelberg | ffp-analogy embeddings.fifu

Where to go from here