LOANS are the major requirement of the modern world. By this only, Banks get a major part of the total profit. It is beneficial for students to manage their education and living expenses, and for people to buy any kind of luxury like houses, cars, etc.
But when it comes to deciding whether the applicant’s profile is relevant to be granted with loan or not. Banks have to look after many aspects.
So, here we will be using Machine Learning with Python to ease their work and predict whether the candidate’s profile is relevant or not using key features like Marital Status, Education, Applicant Income, Credit History, etc.
Loan Approval Prediction using Machine Learning
You can download the used data by visiting this link.
The dataset contains 13 features :
|1||Loan||A unique id|
|2||Gender||Gender of the applicant Male/female|
|3||Married||Marital Status of the applicant, values will be Yes/ No|
|4||Dependents||It tells whether the applicant has any dependents or not.|
|5||Education||It will tell us whether the applicant is Graduated or not.|
|6||Self_Employed||This defines that the applicant is self-employed i.e. Yes/ No|
|9||LoanAmount||Loan amount (in thousands)|
|10||Loan_Amount_Term||Terms of loan (in months)|
|11||Credit_History||Credit history of individual’s repayment of their debts|
|12||Property_Area||Area of property i.e. Rural/Urban/Semi-urban|
|13||Loan_Status||Status of Loan Approved or not i.e. Y- Yes, N-No|
Importing Libraries and Dataset
Firstly we have to import libraries :
- Pandas – To load the Dataframe
- Matplotlib – To visualize the data features i.e. barplot
- Seaborn – To see the correlation between features using heatmap
Once we imported the dataset, let’s view it using the below command.
Data Preprocessing and Visualization
Get the number of columns of object datatype.
Categorical variables: 7
As Loan_ID is completely unique and not correlated with any of the other column, So we will drop it using .drop() function.
Visualize all the unique values in columns using barplot. This will simply show which value is dominating as per our dataset.
As all the categorical values are binary so we can use Label Encoder for all such columns and the values will change into int datatype.
Again check the object datatype columns. Let’s find out if there is still any left.
Categorical variables: 0
The above heatmap is showing the correlation between Loan Amount and ApplicantIncome. It also shows that Credit_History has a high impact on Loan_Status.
Now we will use Catplot to visualize the plot for the Gender, and Marital Status of the applicant.
Now we will find out if there is any missing values in the dataset using below code.
Gender 0 Married 0 Dependents 0 Education 0 Self_Employed 0 ApplicantIncome 0 CoapplicantIncome 0 LoanAmount 0 Loan_Amount_Term 0 Credit_History 0 Property_Area 0 Loan_Status 0
As there is no missing value then we must proceed to model training.
((598, 11), (598,)) ((358, 11), (240, 11), (358,), (240,))
Model Training and Evaluation
As this is a classification problem so we will be using these models :
To predict the accuracy we will use the accuracy score function from scikit-learn library.
Accuracy score of RandomForestClassifier = 98.04469273743017
Accuracy score of KNeighborsClassifier = 78.49162011173185
Accuracy score of SVC = 68.71508379888269
Accuracy score of LogisticRegression = 80.44692737430168
Prediction on the test set:
Accuracy score of RandomForestClassifier = 82.5
Accuracy score of KNeighborsClassifier = 63.74999999999999
Accuracy score of SVC = 69.16666666666667
Accuracy score of LogisticRegression = 80.83333333333333
Random Forest Classifier is giving the best accuracy with an accuracy score of 82% for the testing dataset. And to get much better results ensemble learning techniques like Bagging and Boosting can also be used.