Permutation algorithms to test statistical significance of experimental results.
We provide software for statistical significance testing. This was originally designed for a standard IR evaluation, where one or more method is represented by vectors of real-value performance scores. However, it can be used to compare any equal-length series (of performance measurements).
This utility consumes matrix input. Each row represents a single evaluation event. Each row element is an event-specific value of an effectiveness or efficiency metric such as classification accuracy, retrieval time, etc. In IR, we commonly use the following metrics: ERR, NDCG, or MAP. We provide a Python3 wrapper for this utility.
Our software employs permutation algorithms for unadjusted pair-wise significance testing and testing with adjustment for multiple comparisons. The advantage of permutation algorithms is that they make relatively mild assumptions about statistical nature of data. In particular, they do not assume observations are normal i.i.d. variables.
The code is released under the Apache License Version 2.0 http://www.apache.org/licenses/.
For technical/theoretical details see:
Leonid Boytsov, Anna Belova, Peter Westfall, 2013,
Deciding on an Adjustment for Multiplicity in IR Experiments.
In Proceedings of SIGIR 2013. [BibTex]
If you use our software, please, consider citing this paper.
EvalUtil:
ConvScripts:
A working example:
1) Compile the Eval util
2) Go to the directory SampleData
3) Run the shell script sample_run.sh
4) Read the comments inside the script