Prediction of the probability of default in home equity loan using both scikit-learn & PySpark
Home Equity Loan (HMEQ) reports characteristics and delinquency information for 5,960 home equity loans. A home equity loan is a loan where the obligor uses the equity of his/her home as the underlying collateral.
In this project, we predict the probability of default on home equity loan. The dataset contains two classes - The majority (negative) class comprises 80% of the observations and represents the applicants that paid their loan on time and 20% of the dataset is the minority (positive) class, which represents the applicants who defaulted on thier loan.
The dataset also contains few missing values in some variables, which were imputed before modeling. We built four supervised classification models: Logistic regression, Support vector machine, Random forest, and XGBoost. The area under the ROC curve (AUC) was used as the performance metric for all the models.
Below is the feature importances of the variables from Random Forest classifier
The results shown below are based on optimized AUC-ROC. We can see that XGBoost and SVM are the best models based on AUC-ROC.