项目作者: mcaputto

项目描述 :
Analyze string similarity using Levenshtein's distance.
高级语言: C
项目地址: git://github.com/mcaputto/similitude.git
创建时间: 2017-07-17T01:44:04Z
项目社区:https://github.com/mcaputto/similitude

开源协议:

下载


similitude

Description

similitude compares edit distances using Levenshtein’s
Distance
.

Algorithmic complexity

Given a string of length m and a string of length n, similitude runs in
O(m,n) time and O(min(m,n)) space.

Installation

  1. make

Usage

similitude will compare lines in two files and print the edit distances to
stdout.

Example

  1. $ ./bin/similitude test/foo test/bar
  2. 19958 S BAKERS FERRY RD, 2387 PIMLICO DR, 20
  3. 19958 S BAKERS FERRY RD, 1706 22ND AVE, 20
  4. 19958 S BAKERS FERRY RD, 512 SE BASELINE ST, 16
  5. etc.

Python extension

Requirements

Python version 3.6 or greater (due to f-strings).

Installation

  1. pip install -r requirements.txt

Usage

similitude.py will compare lines in two files and load the edit distances
into a pandas dataframe.

Example

  1. $ python3 similitude.py test/foo test/bar
  2. Pivot table:
  3. distance
  4. source target
  5. 0104 SW LANE ST 1 CONDOLEA DR 12
  6. 1 JEFFERSON PKWY APT 266 20
  7. 100 SW 195TH AVE SPC 13 14
  8. ... ...
  9. 9906 SE REEDWAY ST 9510 S WILDCAT RD 11
  10. 9517 SE 75TH AVE 12
  11. 9532 SW WHITFORD LN 14
  12. [2000000 rows x 1 columns]