Using Levenshtein Distance for Fuzzy String Matching
There was a tweet on 26th March 2020 with a screenshot of some Home Office spending. I was curious about how easy it is to harvest, clean and summarise the spending data. The data were split across twelve CSV files, one for each month. Levenshtein distance was used to combine the files by fuzzy matching supplier names as inconsistent values were used. I’ve previously used other methods such as soundex and metaphone, but these data presented more of a challenge.
Examples of fuzzy supplier names used for matching:
Original | Fuzzy value |
---|---|
A1 ENVIRO SCIENCE | a1 enviro sciences |
A1-ENVIRO SCIENCES. | a1 enviro sciences |
—- | —- |
WWW.EQUIP4WORK.CO.UK | equip4work |
EQUIP4WORK | equip4work |
The article originally appeared in Byline Times as ‘Home Office Fails to Explain Strange Expenses‘. The Home Office did respond 26-Mar-2021 to say that purchases were made for PPE and clothing for asylum seekers.
Spending 2020 at the end of the Jupyter notebook contains:
The Supplier
column had to be processed so that the data could be aggregated as supplier names aren’t always consistent. Levenshtein distance was used for fuzzy matching supplier names.
cleaned_agg_spend.csv contains data aggregated by supplier for 2020. Columns:
cleaned_spend.csv contains all rows from merging the original twelve CSV files. Columns:
Supplier
names to help with the fuzzy matching