An ML+NLP solution for linking misspelled titles with the true titles
Finds the best match (in a database of titles) for a misspelled title,
using a combination of Machine Learning and NLP techniques.
make --always-make build
- to build and prepare the Docker container for running the projectmake update-docker
- to update the project setup on the Docker containermake stage-example-data-set
- to copy the “example” data set files to the Docker containermake inspect
- inspect the code for PEP-8 issuesmake test
- run the unit testsMain Classes (please follow the docstrings):
MatchMaker
match_maker.pyFeatureEngineering
feature_engineering.pyPrediction
predict.pyRun the following cli’s in order:
make train-model
Alias of train_model
in cli.py
OneVsRest[ ⃰]Classifier
- “rest” being the closest “n” (based on the Jaccard distance) titlestrain
and evaluation
data sets for the train-model
cliconstruct_features
(in feature_engineering.py)train-auc:0.999979 evaluation-auc:0.999964 train-custom-error:225 evaluation-custom-error:102
True Positives 7084
True Negatives 18673
False Positives 2
False Negatives 26
custom_error
in train.pyweighted_log_loss
make generate-predictions
Alias of generate_predictions
in cli.py
make get-predictions-accuracy
to calculate the following)
Correctly matched titles 5929
Incorrectly matched titles 114 ⃰
Correctly marked as not-found 3894
Incorrectly marked as not-found 63
*
The model is already biased against “false positives”. To have even fewer false positives,
tweak the FALSE_POSITIVE_PENALTY_FACTOR
or PREDICTION_PROBABILITY_THRESHOLD
settings in settings.py
make closest-search-single-title title='PRO teome plc SCIs'
Alias of closest_search_single_title
in cli.py
OneVsRestClassifier
for the entire (not just the nearest matches) “truth” database