项目作者: wherefortravel

项目描述 :
MinHash and LSH index written in Rust for Node.js
高级语言: Rust
项目地址: git://github.com/wherefortravel/minhash-node-rs.git
创建时间: 2019-04-11T01:44:37Z
项目社区:https://github.com/wherefortravel/minhash-node-rs

开源协议:

下载


minhash-node-rs

minhash-node-rs is a library for Node.js designed to calculate the MinHash of inputs, as well as being able to index them and allow sub-linear query time in a LSH index.

It is written in Rust for high performance and (relatively) low memory overhead, as well as memory safety.

Why?

Why use MinHash? Sometimes you might want to look for documents that are almost the same, but not exactly, and you want to find them in sub-linear time (less than O(N) on average). LshIndex provides a way to look up near duplicates for a document in near constant time.

Why use Rust? Node.js & V8 can have very poor performance with the garbage collector when objects get large. Since objects such as the LSH index can become huge, we don’t want to tax the GC too much. Writing this natively allows for that.

How do I use it?

Here is an example:

  1. import { PermGen, MinHash, LshIndex } from 'minhash-node-rs'
  2. const permGen = new PermGen(128) // 128 is the number of permutations to use when hashing
  3. const hash = new MinHash(permGen) // we allocate MinHash using the permutations of previous PermGen instance
  4. hash.update(permGen, 'hello') // add the word hello
  5. hash.update(permGen, 'world') // add the word world
  6. const hash2 = new MinHash(permGen)
  7. hash2.update(permGen, 'world') // add words
  8. hash2.update(permGen, 'test') // add words
  9. console.log(hash.jaccard(hash2)) // print jaccard similarity
  10. const index = new LshIndex() // construct LSH index
  11. index.insert('hash', hash) // insert hash
  12. console.log(index.query(hash2)) // query for near duplicates

Who uses this?

WhereTo.com uses this in production to detect near duplicate data. If you use this in production, feel free to submit a Pull Request.