项目作者: OtterOriented

项目描述 :
Using Levenshtein Distance for Fuzzy String Matching
高级语言: Jupyter Notebook
项目地址: git://github.com/OtterOriented/home-office-spend-2020.git
创建时间: 2021-03-26T20:49:44Z
项目社区:https://github.com/OtterOriented/home-office-spend-2020

开源协议:

下载


Using Levenshtein Distance for Fuzzy String Matching

There was a tweet on 26th March 2020 with a screenshot of some Home Office spending. I was curious about how easy it is to harvest, clean and summarise the spending data. The data were split across twelve CSV files, one for each month. Levenshtein distance was used to combine the files by fuzzy matching supplier names as inconsistent values were used. I’ve previously used other methods such as soundex and metaphone, but these data presented more of a challenge.

Examples of fuzzy supplier names used for matching:

Original Fuzzy value
A1 ENVIRO SCIENCE a1 enviro sciences
A1-ENVIRO SCIENCES. a1 enviro sciences
—- —-
WWW.EQUIP4WORK.CO.UK equip4work
EQUIP4WORK equip4work


The article originally appeared in Byline Times as ‘Home Office Fails to Explain Strange Expenses‘. The Home Office did respond 26-Mar-2021 to say that purchases were made for PPE and clothing for asylum seekers.

Byline

Source Data

  • Home Office data scraped from here on 26-Mar-2021
  • Data are split across twelve CSV files
  • Data covers the 2020 calendar year
  • Data published 19-Mar-2021

Report

Spending 2020 at the end of the Jupyter notebook contains:

  • Total spend for year: £2,097,376.09
  • March and April 2020 at the first lockdown have spike increase in spending and account for 58% of the 2020 spend
  • Chart and table of spending per month
  • Table of top 20 supplier spend for 2020
  • Table of top 20 spend from 01-Mar-2020 to 30-Apr-2020

Total Spending per Month 2020

Cleaned Data Files

The Supplier column had to be processed so that the data could be aggregated as supplier names aren’t always consistent. Levenshtein distance was used for fuzzy matching supplier names.

cleaned_agg_spend.csv contains data aggregated by supplier for 2020. Columns:

  • supplier_fuzzy - the fuzzy match string used to link suppliers
  • total_spend_supplier - total spend with supplier over year
  • percent_share_spend - percentage of total spend with supplier
  • number_purchases - number of purchases made with supplier
  • first_purchase - date first purchase made from supplier (yyyy-mm-dd)
  • last_purchase - date last purchase made from supplier (yyyy-mm-dd)

cleaned_spend.csv contains all rows from merging the original twelve CSV files. Columns:

  • Date
  • Amount
  • Supplier
  • supplier_clean - some preprocessing on Supplier names to help with the fuzzy matching
  • supplier_fuzzy - the fuzzy match string used to link suppliers