C++ implementation of Generalised Brown clustering and python scripts for feature generation
version 1.0
© 2015 Sean Chester and Leon Derczynski
The generalised-brown software suite clusters word types by
distributional similarity in two phases. It first generates a list
of merges based on the well-known Brown clustering algorithm and
then recalls historical states to vary the granularity of the
clusters. For example, given the following corpus:
Alice likes dogs and Bob likes cats while Alice hates snakes and Bob hates spiders
Greedily clustering word types based on average mutual information
(i.e., running the C++ merge generator) produces the following
merge list (assuming a = |V| = 10):
snakes spiders 8
dogs cats 7
Alice Bob 6
and while 5
likes hates 4
dogs snakes 3
dogs and 2
dogs Alice 1
dogs likes 0
One can then recall any historical state of the computation in order to
produce a set of clusters (i.e., run the python cluster generator).
For example, with c = 5, we recall the state c - 1 = 4 to produce
the following clusters:
{snakes, spiders}
{dogs, cats}
{Alice, Bob}
{likes, hates}
{and, while}
This approach (setting separate values of a and c) we refer to as
Roll-up feature generation. By contrast, traditional Brown clustering
would produce the following five clusters (equivalent to running the
C++ merge generator with a = 5 and the python cluster generator
with c = 5):
{likes, hates}
{snakes, spiders, cats, dogs}
{and, while}
{Alice}
{Bob}
For details about the concepts implemented in this software, please
read our recent AAAI paper:
L. Derczynski and S. Chester. 2016. “Generalised Brown Clustering
and Roll-up Feature Generation.” In: Proceedings of the
Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16).
7 pages. To appear.
For details about traditional Brown clustering, consult the article
in which it was introduced:
PF Brown et al. 1992. “Class-based n-gram models of natural language.”
Computational Linguistics 18(4): 467—479.
or the implementation that our C++ merge generator forked:
generalised-brown relies on the following applications:
For compiling the C++ merge generator: A C++ compiler that
is compatible with C++ 11 and OpenMP (e.g., the newest
GNU compiler) and the make program
For running the python cluster generator: A python
interpreter
The python cluster generator does not need to be compiled.
To compile the C++ merge generator, navigate to the
merge_generator/ subdirectory of the project and type:
make
To produce a set of features for a corpus, you will first want to use
Generalised Brown (i.e., the C++ merge generator) to create a merge list.
Then, you can create c clusters by running the python cluster generator
on the merge list. This second step can be done for as many values of c
as you like, but we recommend that each value of c is not larger than the
value of a used to generate the merge list.
To run the C++ merge generator, type:
./merge_generator/wcluster —text [input_file] —a [active_set_size]
The resultant merges will be recorded in:
./[input_file]-c[active_set_size]-p1.out/merges
To run the python cluster generator, type:
python ./cluster_generator/cluster.py -in ./[input_file]-c[active_set_size]-p1.out/merges -c 3
Each word type will be printed to stdout with its cluster id.
The C++ merge generator runs in O(|V| a^2) time, where |V| is the number
of distinct word types in the corpus (i.e., the size of the vocabulary) and
a is a bound on the algorithm’s search space. The python cluster generator
runs in O(|V|) time.
This software consists of two sub-modules, each released under a
different license:
The python cluster generator is subject to the terms of
The MIT License
The C++ merge generator follows the original licensing terms
of wcluster.
See the relevant sub-directories of this repository for the
specific details of each license.
You can find aclusters induced from the RCV1 data, including a full
merge file, with a=2560, here:
http://derczynski.com/sheffield/resources/rcv.a2560.tar.bz2
This software suite will undergo a major revision; so, you are encouraged
to ensure that this is still the latest version. Please do not hesitate to
contact the authors if you have comments, questions, or bugs to report.