A Quick Take on anagram creation
A Quick Take on anagram finding in Rust.
The requirements for the specific task are discussed here.
In essense, make a program, that takes a dictionary and a word as arguments and find all the anagrams of the word in the dictionary
Compatiblity Note: the program will exit with an error if the dictionary is not in UTF-8 encoding and contains non-ASCII characters.
The project includes a linux binary (compiled under ubuntu 18.04) int the .bin dir
. It also includes the standard “lemmad.txt” file, specified in the requirements.
NB! Included binary only works under linux! Also, if you are using older CPU architectures and the binary fails, then use bin/anagrams_compatiblity
executable (newer than Intel 4. gen Core, e.g. “Haswell” should be fine) instead of the bin/anagrams
.
Searching for anagrams for a word is done with the following command:
bin/anagrams dicts/lemmad.txt asi
> 4587,ais,asi,isa,sai
where the first argument is the dictionary and the other one is the word to search for.
To use a word with spaces as args do the following:
bin/anagrams dicts/lemmad.txt "aGu isAEEtall"
> 4674,Augeiase tall
When working with huge dictionarys that do not fit to memory (e.g tens of gigabytes), use the streaming mode with the command line flag ‘-b’:
bin/anagrams dicts/megagigalarge.txt "aGu isAEEtall" -b
The output is the execution time in microseconds and the found anagrams as written in requirements.
install rust:
curl https://sh.rustup.rs -sSf | sh
build the project:
cargo build --release
run the executable: (built into the ./target/release
folder).
./target/release/anagrams dicts/lemmad.txt anagramm
(or with any other desired parameters)
The naive approach would be, to compare the length word to every string in the dictionary. If it matches sort both strings and validate that they are equal. That however is far from optimal as comparison based sorting has a minimal complexitiy of O(n log n)
.
Instead we could “abuse” the fact that the possible number of characters is limited and use something similar to counting sort O(n)
.
E.g. We could create an “alphabet vector” for the first string (containing all the possible characters one could encounter in the dictionary). Then:
return false
immediately as the current letter in the 2nd string has more occurrences than in the 1st onereturn true
Relevant Javascript code:
// Example code in javascript for simplicity
function isAnagram(word, candidate) {
const counts = new Array(letters.length).fill(0);
// add letters of the search word
for(let letter of word) {
counts[indexes[letter]] += 1;
}
// subtract the letters of the candidate word
for(let letter of candidate) {
const idx = indexes[letter];
const count = memo[idx];
if(count === 0) return false
counts[idx] -= 1;
}
return true;
}
Due to the specifics of the task, we can make some rather obvious “optimizations”.
That algorithm is closer to:
//Example code in javascript for simplicity
function isAnagram(letterCounts, candidate) {
const counts = [...letterCounts]; //clone
for(let letter of candidate) {
const idx = indexes[letter];
const count = memo[idx];
if(count === 0) return false
counts[idx] -= 1;
}
return true;
}
This has the algorithm settled. The rest is implementation details and I/O stuff (as this excercise is majorly I/O bound).
Rust is a systems programming language that makes fearless concurrency relatively easy. It’s lightweight, has no carbage collection and has modern build-, packaging- and unit-testing tools included out of the box.
On top of that RUST allows pretty simple Thread level Parallelism (via rayon) and Instruction Level Parallelism (via SIMD libraries, like numeric-array) without changing your code.
As the anagram problem is embarrassingly parallel, it probably makes sense to use at least thread-level concurrency.
For instance: the function for streaming the file and finding anagrams sequencially is the following
let (letter_indexes, letter_counts) = precalc_letter_data(word);
let mut results: Vec<String> = Vec::new();
let file = File::open(path)
.expect("Something went wrong reading the file");
for line in BufReader::new(file).lines() {
let candidate = &line?;
if candidate.len() == word.len() && is_anagram(candidate, &letter_counts, &letter_indexes) {
results.push(candidate.clone());
}
}
Ok(results)
}
The only change needed to make to make this run on all the logical processors available (e.g. 12 threads on a modern 6-core i7) is the following:
pub fn find_anagrams_parallel(word: &str, path: &str) -> Vec<String> {
let (letter_indexes, letter_counts) = precalc_letter_data(word);
let contents = read_to_string(path)
.expect("Something went wrong reading the file");
contents
.par_lines() // <- this is all the parallelization magic there is
.filter(|candidate| {
candidate.len() == word.len() && is_anagram(candidate, &letter_counts, &letter_indexes)
})
.map(|candidate| candidate.to_string().clone())
.collect()
}
The only real change is replacing .iter()
with .par_iter()
and Rayon does the rest! :)
Although one must note, that the parallel doesn’t stream. Rather it loads the entire dictionary to memory in the beginning. This is required for two reasons:
~37648 microseconds
~4471 microseconds