Using Levenshtein Distance for Fuzzy String Matching

There was a tweet on 26th March 2020 with a screenshot of some Home Office spending. I was curious about how easy it is to harvest, clean and summarise the spending data. The data were split across twelve CSV files, one for each month. Levenshtein distance was used to combine the files by fuzzy matching supplier names as inconsistent values were used. I’ve previously used other methods such as soundex and metaphone, but these data presented more of a challenge.

Examples of fuzzy supplier names used for matching:

Original	Fuzzy value
A1 ENVIRO SCIENCE	a1 enviro sciences
A1-ENVIRO SCIENCES.	a1 enviro sciences
—-	—-
WWW.EQUIP4WORK.CO.UK	equip4work
EQUIP4WORK	equip4work

The article originally appeared in Byline Times as ‘Home Office Fails to Explain Strange Expenses‘. The Home Office did respond 26-Mar-2021 to say that purchases were made for PPE and clothing for asylum seekers.

Byline

Source Data

Home Office data scraped from here on 26-Mar-2021
Data are split across twelve CSV files
Data covers the 2020 calendar year
Data published 19-Mar-2021

Report

Spending 2020 at the end of the Jupyter notebook contains:

Total spend for year: £2,097,376.09
March and April 2020 at the first lockdown have spike increase in spending and account for 58% of the 2020 spend
Chart and table of spending per month
Table of top 20 supplier spend for 2020
Table of top 20 spend from 01-Mar-2020 to 30-Apr-2020

Total Spending per Month 2020

Cleaned Data Files

The Supplier column had to be processed so that the data could be aggregated as supplier names aren’t always consistent. Levenshtein distance was used for fuzzy matching supplier names.

cleaned_agg_spend.csv contains data aggregated by supplier for 2020. Columns:

supplier_fuzzy - the fuzzy match string used to link suppliers
total_spend_supplier - total spend with supplier over year
percent_share_spend - percentage of total spend with supplier
number_purchases - number of purchases made with supplier
first_purchase - date first purchase made from supplier (yyyy-mm-dd)
last_purchase - date last purchase made from supplier (yyyy-mm-dd)

cleaned_spend.csv contains all rows from merging the original twelve CSV files. Columns:

Date
Amount
Supplier
supplier_clean - some preprocessing on Supplier names to help with the fuzzy matching
supplier_fuzzy - the fuzzy match string used to link suppliers