STIFF - Sense Tagged Instances For Finnish

This repository contains code to automatically create a tagged sense corpus
from OpenSubtitles2018. It also contains a lot of corpora wrangling code, most
notably code to convert (the CC-NC licensed)
EuroSense into a format usable by
finn-wsd-eval.

Set up

You will need HFST and OMorFi installed globally before beginning. The reason
for this is neither are currently PyPI installable. You will also need poetry.
You can then run

$ ./install.sh

(Only partially tested) Conversion pipelines and evaluation using the Makefile

There is a Makefile. Reading the source is a recommended next step
after this README. It has variables for most file paths. These are over
ridable with default. You can override them to help make it convenient when
supplying intermediate steps/upstream corpora yourself, wanting outputs in
a particular place, and when running it with Docker when you may want to use
bind mounts to make the aforementioned appear on the host.

Make STIFF or EuroSense into data for finn-wsd-eval

You can make the data needed for
finn-wsd-eval by running::

make wsd-eval

which will make the STIFF and EuroSense WSD evaluation corpora, including
trying to fetch all dependencies. However,

It will take a long time. The longest step is building STIFF from scratch
which can take around two weeks. To speed things up you can supply a premade
stiff.raw.xml.zst downloaded from here (TODO).
It will not fetch one dependency with restrictions upon it: BABELWNMAP.

Obtaining `BABELWNMAP`

You will next need to set the environment variable BABELWNMAP as the path to a TSV
mapping from BabelNet synsets to WordNet synsets. You can either:

Obtain the BabelNet indices by following these
instructions and dump out the TSV by
following the instructions at https://github.com/frankier/babelnet-lookup
If you are affiliated with a research institution, I have permission to send
you the TSV file, but you must send me a direct communication from your
institutional email address. (Please shortly state your position/affiliation
and non-commercial research use in the email so there is a record.)
Alternatively (subject to the same conditions) if you prefer, I can just
send you eurosense.unified.sample.xml eurosense.unified.sample.key

Make STIFF and EuroSense P/R plot

Run::

make corpus-eval

(OLD) Example conversion pipelines and evaluation

Both the following pipelines first create a corpus tagged in the unified
format, which consists of an xml and key file, and then create a directory
consisting of the files needed by
finn-wsd-eval.

STIFF Pipeline

Fetch OpenSubtitles2018

poetry run python scripts/fetch_opensubtitles2018.py cmn-fin

Make raw STIFF

poetry run python scripts/pipeline.py mk-stiff cmn-fin stiff.raw.xml.zst

Make recommended STIFF variant + convert ➡️ Unified

poetry run python scripts/variants.py proc bilingual-precision-4 stiff.raw.xml.zst stiff.bp4.xml.zst
./stiff2unified.sh stiff.bp4.xml.zst stiff.unified.bp4.xml stiff.unified.bp4.key

EuroSense Pipeline

EuroSense ➡️ Unified

You will first need to obtain EuroSense. Since there are some language tagging
issues with the original, I currently recommend you use a version I have
attempted to fix.

You will next need to set the environment variable BABEL2WN_MAP as the path to a TSV
mapping from BabelNet synsets to WordNet synsets. You can either:

Obtain the BabelNet indices by following these
instructions and dump out the TSV by
following the instructions at https://github.com/frankier/babelnet-lookup
If you are affiliated with a research institution, I have permission to send
you the TSV file, but you must send me a direct communication from your
institutional email address. (Please shortly state your position/affiliation
and non-commercial research use in the email so there is a record.)
Alternatively (subject to the same conditions) if you prefer, I can just
send you eurosense.unified.sample.xml eurosense.unified.sample.key

Then run:

poetry run python scripts/pipeline.py eurosense2unified \
  /path/to/eurosense.v1.0.high-precision.xml eurosense.unified.sample.xml \
  eurosense.unified.sample.key

Process finn-man-ann

First obtain finn-man-ann.

Then run:

poetry run python scripts/munge.py man-ann-select --source=europarl \
  ../finn-man-ann/ann.xml - \
  | poetry run python scripts/munge.py lemma-to-synset - man-ann-europarl.xml
poetry run python scripts/munge.py man-ann-select --source=OpenSubtitles2018 \
  ../finn-man-ann/ann.xml man-ann-opensubs18.xml

Make STIFF or EuroSense into data for finn-wsd-eval

This makes a directory usable by
finn-wsd-eval.

Old

poetry run python scripts/pipeline.py unified-to-eval \
  /path/to/stiff-or-eurosense.unified.xml /path/to/stiff-or-eurosense.unified.key \
  stiff-or-eurosense.eval/

New

TODO: STIFF

poetry run python scripts/filter.py tok-span-dom man-ann-europarl.xml \
  man-ann-europarl.filtered.xml
poetry run python scripts/pipeline.py stiff2unified --eurosense \
  man-ann-europarl.filtered.xml man-ann-europarl.uni.xml man-ann-europarl.uni.key
poetry run python scripts/pipeline.py stiff2unified man-ann-opensubs18.xml \
  man-ann-opensubs18.uni.xml man-ann-opensubs18.key.xml
poetry run python scripts/pipeline.py unified-auto-man-to-evals \
  eurosense.unified.sample.xml man-ann-europarl.uni.xml \
  eurosense.unified.sample.key man-ann-europarl.uni.key eurosense.eval

Make STIFF and EuroSense P/R plot

First process finn-man-ann.

Gather STIFF eval data

poetry run python scripts/variants.py eval /path/to/stiff.raw.zst stiff-eval-out
poetry run python scripts/eval.py pr-eval --score=tok <(poetry run python scripts/munge.py man-ann-select --source=OpenSubtitles2018 /path/to/finn-man-ann/ann.xml -) stiff-eval-out stiff-eval.csv

Gather EuroSense eval data

poetry run python scripts/munge.py man-ann-select --source=europarl /path/to/finn-man-ann/ann.xml - | poetry run python scripts/munge.py lemma-to-synset - man-ann-europarl.xml
mkdir eurosense-pr
mv /path/to/eurosense/high-precision.xml eurosense-pr/EP.xml
mv /path/to/eurosense/high-coverage.xml eurosense-pr/EC.xml
poetry run python scripts/eval.py pr-eval --score=tok man-ann-europarl.xml eurosense-pr europarl.csv

Plot on common axis

Warning, plot may be misleading…

poetry run python scripts/eval.py pr-plot stiff-eval.csv europarl.csv

Organisation & usage

For help using the tools, try running with --help. The main entry points are
in scripts.

Innards

scripts/tag.py: Produce an unfiltered STIFF
scripts/filter.py: Filter STIFF according to various criteria
scripts/munge.py: Convert between different corpus/stream formats

Wrappers

scripts/stiff2unified.sh: Convert from STIFF format to the unified format
scripts/pipeline.py: Various pipelines composing multiple layers of filtering/conversion

Top level

The Makefile and Makefile.manann.

STIFF - Sense Tagged Instances For Finnish

Set up

(Only partially tested) Conversion pipelines and evaluation using the Makefile

Make STIFF or EuroSense into data for finn-wsd-eval

Obtaining BABELWNMAP

Make STIFF and EuroSense P/R plot

(OLD) Example conversion pipelines and evaluation

STIFF Pipeline

Fetch OpenSubtitles2018

Make raw STIFF

Make recommended STIFF variant + convert ➡️ Unified

EuroSense Pipeline

EuroSense ➡️ Unified

Process finn-man-ann

Make STIFF or EuroSense into data for finn-wsd-eval

Old

New

Make STIFF and EuroSense P/R plot

Gather STIFF eval data

Gather EuroSense eval data

Plot on common axis

Organisation & usage

Innards

Wrappers

Top level

Obtaining `BABELWNMAP`