Sense Tagged Instances For Finnish
This repository contains code to automatically create a tagged sense corpus
from OpenSubtitles2018. It also contains a lot of corpora wrangling code, most
notably code to convert (the CC-NC licensed)
EuroSense into a format usable by
finn-wsd-eval.
You will need HFST and OMorFi installed globally before beginning. The reason
for this is neither are currently PyPI installable. You will also need poetry.
You can then run
$ ./install.sh
There is a Makefile. Reading the source is a recommended next step
after this README
. It has variables for most file paths. These are over
ridable with default. You can override them to help make it convenient when
supplying intermediate steps/upstream corpora yourself, wanting outputs in
a particular place, and when running it with Docker when you may want to use
bind mounts to make the aforementioned appear on the host.
You can make the data needed for
finn-wsd-eval by running::
make wsd-eval
which will make the STIFF and EuroSense WSD evaluation corpora, including
trying to fetch all dependencies. However,
stiff.raw.xml.zst
downloaded from here (TODO).BABELWNMAP
You will next need to set the environment variable BABELWNMAP
as the path to a TSV
mapping from BabelNet synsets to WordNet synsets. You can either:
Run::
make corpus-eval
Both the following pipelines first create a corpus tagged in the unified
format, which consists of an xml
and key
file, and then create a directory
consisting of the files needed by
finn-wsd-eval.
poetry run python scripts/fetch_opensubtitles2018.py cmn-fin
poetry run python scripts/pipeline.py mk-stiff cmn-fin stiff.raw.xml.zst
poetry run python scripts/variants.py proc bilingual-precision-4 stiff.raw.xml.zst stiff.bp4.xml.zst
./stiff2unified.sh stiff.bp4.xml.zst stiff.unified.bp4.xml stiff.unified.bp4.key
You will first need to obtain EuroSense. Since there are some language tagging
issues with the original, I currently recommend you use a version I have
attempted to fix.
You will next need to set the environment variable BABEL2WN_MAP as the path to a TSV
mapping from BabelNet synsets to WordNet synsets. You can either:
Then run:
poetry run python scripts/pipeline.py eurosense2unified \
/path/to/eurosense.v1.0.high-precision.xml eurosense.unified.sample.xml \
eurosense.unified.sample.key
First obtain finn-man-ann.
Then run:
poetry run python scripts/munge.py man-ann-select --source=europarl \
../finn-man-ann/ann.xml - \
| poetry run python scripts/munge.py lemma-to-synset - man-ann-europarl.xml
poetry run python scripts/munge.py man-ann-select --source=OpenSubtitles2018 \
../finn-man-ann/ann.xml man-ann-opensubs18.xml
This makes a directory usable by
finn-wsd-eval.
poetry run python scripts/pipeline.py unified-to-eval \
/path/to/stiff-or-eurosense.unified.xml /path/to/stiff-or-eurosense.unified.key \
stiff-or-eurosense.eval/
TODO: STIFF
poetry run python scripts/filter.py tok-span-dom man-ann-europarl.xml \
man-ann-europarl.filtered.xml
poetry run python scripts/pipeline.py stiff2unified --eurosense \
man-ann-europarl.filtered.xml man-ann-europarl.uni.xml man-ann-europarl.uni.key
poetry run python scripts/pipeline.py stiff2unified man-ann-opensubs18.xml \
man-ann-opensubs18.uni.xml man-ann-opensubs18.key.xml
poetry run python scripts/pipeline.py unified-auto-man-to-evals \
eurosense.unified.sample.xml man-ann-europarl.uni.xml \
eurosense.unified.sample.key man-ann-europarl.uni.key eurosense.eval
First process finn-man-ann.
poetry run python scripts/variants.py eval /path/to/stiff.raw.zst stiff-eval-out
poetry run python scripts/eval.py pr-eval --score=tok <(poetry run python scripts/munge.py man-ann-select --source=OpenSubtitles2018 /path/to/finn-man-ann/ann.xml -) stiff-eval-out stiff-eval.csv
poetry run python scripts/munge.py man-ann-select --source=europarl /path/to/finn-man-ann/ann.xml - | poetry run python scripts/munge.py lemma-to-synset - man-ann-europarl.xml
mkdir eurosense-pr
mv /path/to/eurosense/high-precision.xml eurosense-pr/EP.xml
mv /path/to/eurosense/high-coverage.xml eurosense-pr/EC.xml
poetry run python scripts/eval.py pr-eval --score=tok man-ann-europarl.xml eurosense-pr europarl.csv
Warning, plot may be misleading…
poetry run python scripts/eval.py pr-plot stiff-eval.csv europarl.csv
For help using the tools, try running with --help
. The main entry points are
in scripts
.
scripts/tag.py
: Produce an unfiltered STIFFscripts/filter.py
: Filter STIFF according to various criteriascripts/munge.py
: Convert between different corpus/stream formatsscripts/stiff2unified.sh
: Convert from STIFF format to the unified formatscripts/pipeline.py
: Various pipelines composing multiple layers of filtering/conversionThe Makefile
and Makefile.manann
.