Implementation of KNN regression and classification models in Python
example_usage.ipynb
notebook. Package is contained within the KNN
directory.KNN_data_collection
init
constructor: Creates an instance of a KNN model. Requires input model_type
, indicating whether the model is a classifier or regressor. Optional input k
sets the number of nearest neighbors to consider when generating predictions (default is 3).load_csv
: Loads a CSV dataset using the csv
module and splits predictor and response variables. Requires input path
(string) indicating the path of the CSV file to load, and response
(string) indicating the name of the column in the csv to be set as the predictor variable (all other columns in the CSV are set as predictors). Expects first row of CSV to contain column names.train_test_split
: Performs train/test split on the loaded dataset, storing the resulting arrays as instance attributes to be used by other modules. Randomly selects test indices using np.random.choice
to avoid potential bias. Optional input test_size
determines the proportion of the dataset that is put into test set (default is 0.3).generate_predictions
euclidean_distance
: computes the Euclidean distance between two points, used as the distance function for KNN implementation.generate_prediction
: Generates a single prediction using KNN model. Requires input knn_model
, an instance of a KNN model (with pre-loaded CSV); new_obs
, a numpy array or list containing the sample for which to generate a prediction; and subset
, one of ‘train’ or ‘all’ which determines whether to compute the prediction using only the training set or the entire dataset. Predictions are generated by computing the Euclidean distance between the inputted observation and all points in either the training set or the entire dataset (depending on the value of subset
), sorting these distances, and selecting the k
closest. For a regression model, the mean of these k
closest points is returned; for a classification model, the most observed class in these k
closest points is returned. Note that in the case of a tie, classifier prediction will be determined by the order of the training set (a warning will be displayed if this occurs).generate_prediction
: Generates multiple predictions using KNN model, returning the results as a numpy array. Requires input knn_model
, an instance of a KNN model (with pre-loaded CSV); new_array
, a multi-dimensional numpy array containing the samples for which to generate predictions; and subset
, one of ‘train’ or ‘all’ which determines whether to compute the prediction using only the training set or the entire dataset. Predictions are generating by applying generate_prediction
function to new_array
row-wise using np.apply_along_axis
.model_metrics
actual
, a numpy array or list containing the “true” values; and predicted
, a numpy array or list containing the “predicted” values. Error metrics are computed for these inputs using their respective mathematical formulas.model_accuracy
: Computes accuracymodel_misclassification
: Computes misclassification ratemodel_num_correct
: Computes the number of correct predictionsmodel_num_incorrect
: Computes the number of incorrect predictionsmodel_rmse
: Computes the root mean squared errormodel_mse
: Computes the mean squared errormodel_mae
: Computes the mean absolute errormodel_mape
: Computes the mean absolute percent errorcross_validation
CvKNN
class inherits from KNN
class (defined in in modelling
module). Model must load CSV and perform train/test split using functions documented in modelling
module in order to perform cross validation.init
constructor: Creates an instance of a KNN model for performing cross validation. Requires input model_type
, indicating whether the model is a classifier or regressor. Optional input num_folds
indicates the number of folds used in K-fold cross validation (default is 5).perform_cv
: Performs k-fold cross validation on the training set. Requires argument k_values
, a list or numpy array of k
values for which to perform CV on. This is done by first creating the indices needed to split training set into num_folds
folds. For every value of k
in k_values
, predictions are generated for 1/num_folds
of the training set (becoming the CV test set), using the other (num_folds
-1)/num_folds
values as the CV training set. The model’s average performance across all folds (mse
is used as the performance metric for regressors, while misclassificaiton_rate
is used for classifiers) is recorded for each value of k
in k_values
.get_cv_results
: Displays the results from perform_cv
. Average performance using the appropriate error metric across all folds is printed for each value of k
provided in perform_cv
. A lineplot created using the seaborn
library is displayed, offering a visual representation of how the various k
values influence the model’s performance.get_best_k
: Prints the “best” k value from perform_csv
, defined as the value of k which produced the lowest cross validation loss (mse for regressors, and misclassification rate for classifiers). Sets this value of k as an instance attribute, allowing the CvKNN
instance to be passed into the assessment_metrics
function in the generate_prediction
module in order to assess the KNN model’s performance with the tuned k hyper-parameter.