Natural Language EnCoder-Decoder: word, char, bpe etc
📕 Docs: https://isi-nlp.github.io/nlcodec
A set of (low level) Natural Language Encoder-Decoders (codecs), that are useful in preprocessing stage of
NLP pipeline. These codecs include encoding of sequences into one of the following:
It provides python (so embed into your app) and CLI APIs (use it as stand alone tool).
There are many BPE implementations available already, but this one provides differs:
less or cut. It includes info on which pieces were put together and what frequencies etc. Please run only one of these
# Install from pypi (preferred)$ pip install nlcodec --ignore-installed# Clone repo for development modegit clone https://github.com/isi-nlp/nlcodeccd nlcodecpip install --editable .
pip installer registers these CLI tools in your PATH:
nlcodec — CLI for learn, encode, decode. Same as python -m nlcodecnlcodec-learn — CLI for learn BPE, with PySpark backend. Same as python -m nlcodec.learn nlcodec-db — CLI for bitextdb. python -m nlcodec.bitextdbnlcodec-freq — CLI for extracting word and char frequencies using spark backend. Docs are available at
Refer to https://arxiv.org/abs/2104.00290
To-appear: ACL 2021 Demos
@article{DBLP:journals/corr/abs-2104-00290,author = {Thamme Gowda andZhao Zhang andChris A. Mattmann andJonathan May},title = {Many-to-English Machine Translation Tools, Data, and Pretrained Models},journal = {CoRR},volume = {abs/2104.00290},year = {2021},url = {https://arxiv.org/abs/2104.00290},archivePrefix = {arXiv},eprint = {2104.00290},timestamp = {Mon, 12 Apr 2021 16:14:56 +0200},biburl = {https://dblp.org/rec/journals/corr/abs-2104-00290.bib},bibsource = {dblp computer science bibliography, https://dblp.org}}