Scikit-learn is an open-source machine learning library for the Python programming language. It implements a good number of algorithms for tasks such as classification, clustering and regression. Sci-kit learn also provides so many utilities for carrying out machine learning tasks. With Scikit Learn, we can carry out data processing, parameter selection, model evaluation, and more.

The objective here is to introduce how to implement Scikit-learn for machine learning using a specific dataset. Most of the models and algorithms in the Scikit-Learn library are treated in approximately the same approach: basically, we

- Call a
to train the intended model`fit`

- Check the accuracy of the model with a call,
`score`

- Forecast the model's outcome (or predict the model) by using
.`predict`

*Let's go*:

The dataset (in CSV format) is about past loans and includes details of 400 customers whose loans are either `paid off`

or `defaulted`

. In machine learning and statistics, ** classification** is a supervised learning approach in which the computer program learns from some input data and then uses this learning to classify new observations.

We load the dataset using the `Pandas`

library and then apply the 4 classification algorithms: 'K Nearest Neighbor(KNN)', 'Decision Trees', 'Support Vector Machine' and 'Logistic Regression', and then find the best algorithm by carrying out some accuracy evaluation methods.

Below is a quick **summary of the column data features**:

Category | Description |
---|---|

Loan_status | Loan is 'paid off' on in 'collection' |

Principal | Basic principal loan amount |

Terms | Origination terms: weekly (7 days), biweekly, monthly payoff schedule |

Effective_date | Loan originated and effective date |

Due_date | Since it’s one-time payoff schedule, each loan has one single due date |

Age | Age of applicant |

Education | Education of applicant |

Gender | Gender of applicant |

In [40]:

```
import itertools #provides various functions that work on iterators
import numpy as np #for arrays and matrices
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib.ticker import NullFormatter
import sklearn
from sklearn import preprocessing
from sklearn import neighbors
from sklearn import metrics # to evaluate our model
from sklearn.model_selection import train_test_split
%matplotlib inline
```

In [41]:

```
loan_data = pd.read_csv('data_ML/loan_data.csv') # Load Data (CSV File) unto pandas
print('Data downloaded and read into a dataframe!')
```

Data downloaded and read into a dataframe!

In [42]:

```
loan_data.columns # a quick look at the column names
```

Out[42]:

Index(['Loan_ID', 'loan_status', 'Principal', 'terms', 'effective_date', 'due_date', 'paid_off_time', 'past_due_days', 'age', 'education', 'Gender'], dtype='object')

`bechalor`

should be `bachelor`

) and that was just corrected with a 'find and replace' function in the CSV editor.

In [43]:

```
loan_data=loan_data[['Loan_ID', 'Gender', 'Principal', 'terms', 'effective_date',
'due_date', 'paid_off_time', 'age', 'education', 'loan_status']]
loan_data.head()
```

Out[43]:

Loan_ID | Gender | Principal | terms | effective_date | due_date | paid_off_time | age | education | loan_status | |
---|---|---|---|---|---|---|---|---|---|---|

0 | xqd20166231 | male | 1000 | 30 | 9/8/2016 | 10/7/2016 | 9/14/2016 19:31 | 45 | High School or Below | PAIDOFF |

1 | xqd20168902 | female | 1000 | 30 | 9/8/2016 | 10/7/2016 | 10/7/2016 9:00 | 50 | Bachelor | PAIDOFF |

2 | xqd20160003 | female | 1000 | 30 | 9/8/2016 | 10/7/2016 | 9/25/2016 16:58 | 33 | Bachelor | PAIDOFF |

3 | xqd20160004 | male | 1000 | 15 | 9/8/2016 | 9/22/2016 | 9/22/2016 20:00 | 27 | college | PAIDOFF |

4 | xqd20160005 | female | 1000 | 30 | 9/9/2016 | 10/8/2016 | 9/23/2016 21:36 | 28 | college | PAIDOFF |

In [44]:

```
loan_data.shape
```

Out[44]:

(400, 10)

`datetime`

object¶in many datasets, dates are represented as strings. Python has a data type specifically designed for dates and times called `datetime`

. We do this for the `efective date`

and `due date`

In [45]:

```
loan_data['due_date'] = pd.to_datetime(loan_data['due_date'])
loan_data['effective_date'] = pd.to_datetime(loan_data['effective_date'])
loan_data.head(1) # View just a row to see the change in date format
```

Out[45]:

Loan_ID | Gender | Principal | terms | effective_date | due_date | paid_off_time | age | education | loan_status | |
---|---|---|---|---|---|---|---|---|---|---|

0 | xqd20166231 | male | 1000 | 30 | 2016-09-08 | 2016-10-07 | 9/14/2016 19:31 | 45 | High School or Below | PAIDOFF |

Check how many of each of the classes, ** PAIDOFF** and

`COLLECTION`

In [46]:

```
loan_data['loan_status'].value_counts()
```

Out[46]:

PAIDOFF 300 COLLECTION 100 Name: loan_status, dtype: int64

**Note that**
300 persons have paid off the loan on time while 100 have gone into collection status. We can make some visualisation of the columns to understand the dataset better using seaborn by grouping the data in terms of ** Gender**. We focus on either Principal or Age in 2 displays.

In [47]:

```
bins = np.linspace(loan_data.Principal.min(), loan_data.Principal.max(), 7)
g = sns.FacetGrid(data = loan_data, col="Gender", hue="loan_status", palette="Set2", col_wrap=2, height=5)
# bin define the shape of histogram (7 samples)
# hue parameter determines which column in the data frame should be used for colour encoding
g.map(plt.hist, 'Principal', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
```

In [48]:

```
bins=np.linspace(loan_data.age.min(), loan_data.age.max(), 7)
g = sns.FacetGrid(loan_data, col="Gender", hue="loan_status", palette="Set2", col_wrap=2, height=5)
g.map(plt.hist, 'age', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
```

**Feature selection** is the process of selecting a subset of relevant attributes of the data for use in machine learning model construction. In this case, visualization can help a bunch. We take a preview of the days of the week when loans are given:

In [49]:

```
loan_data['dayofweek'] = loan_data['effective_date'].dt.dayofweek
bins=np.linspace(loan_data.dayofweek.min(), loan_data.dayofweek.max(), 7)
g = sns.FacetGrid(loan_data, col="Gender", hue="loan_status", palette="Set2", col_wrap=2, height=5)
g.map(plt.hist, 'dayofweek', bins=bins, ec="k")
g.axes[-1].legend()
plt.show()
```

`Feature binarization`

and this means we can categorize the '**loan status'** as '1' for the `COLLECTION`

category and '0' for `PAIDOFF`

category.

In [50]:

```
loan_data['weekend']= loan_data['dayofweek'].apply(lambda x: 1 if (x>3) else 0)
loan_data.head()
```

Out[50]:

Loan_ID | Gender | Principal | terms | effective_date | due_date | paid_off_time | age | education | loan_status | dayofweek | weekend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | xqd20166231 | male | 1000 | 30 | 2016-09-08 | 2016-10-07 | 9/14/2016 19:31 | 45 | High School or Below | PAIDOFF | 3 | 0 |

1 | xqd20168902 | female | 1000 | 30 | 2016-09-08 | 2016-10-07 | 10/7/2016 9:00 | 50 | Bachelor | PAIDOFF | 3 | 0 |

2 | xqd20160003 | female | 1000 | 30 | 2016-09-08 | 2016-10-07 | 9/25/2016 16:58 | 33 | Bachelor | PAIDOFF | 3 | 0 |

3 | xqd20160004 | male | 1000 | 15 | 2016-09-08 | 2016-09-22 | 9/22/2016 20:00 | 27 | college | PAIDOFF | 3 | 0 |

4 | xqd20160005 | female | 1000 | 30 | 2016-09-09 | 2016-10-08 | 9/23/2016 21:36 | 28 | college | PAIDOFF | 4 | 1 |

`categorical`

features to `numerical`

values¶More Data analysis: let's group the loan status feature in terms of gender and education, first by gender and then by education.

`Gender`

In [51]:

```
loan_data.groupby(['Gender'])['loan_status'].value_counts(normalize=True)
```

Out[51]:

Gender loan_status female PAIDOFF 0.841270 COLLECTION 0.158730 male PAIDOFF 0.732938 COLLECTION 0.267062 Name: loan_status, dtype: float64

**Challenge**: 27% male versus 16% female debt `COLLECTION`

situation! We can convert **male** to 0 and **female** to 1 to obtain a binary category.

In [52]:

```
loan_data['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
loan_data.head()
```

Out[52]:

Loan_ID | Gender | Principal | terms | effective_date | due_date | paid_off_time | age | education | loan_status | dayofweek | weekend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | xqd20166231 | 0 | 1000 | 30 | 2016-09-08 | 2016-10-07 | 9/14/2016 19:31 | 45 | High School or Below | PAIDOFF | 3 | 0 |

1 | xqd20168902 | 1 | 1000 | 30 | 2016-09-08 | 2016-10-07 | 10/7/2016 9:00 | 50 | Bachelor | PAIDOFF | 3 | 0 |

2 | xqd20160003 | 1 | 1000 | 30 | 2016-09-08 | 2016-10-07 | 9/25/2016 16:58 | 33 | Bachelor | PAIDOFF | 3 | 0 |

3 | xqd20160004 | 0 | 1000 | 15 | 2016-09-08 | 2016-09-22 | 9/22/2016 20:00 | 27 | college | PAIDOFF | 3 | 0 |

4 | xqd20160005 | 1 | 1000 | 30 | 2016-09-09 | 2016-10-08 | 9/23/2016 21:36 | 28 | college | PAIDOFF | 4 | 1 |

`Education`

In [53]:

```
loan_data.groupby(['education'])['loan_status'].value_counts(normalize=True)
```

Out[53]:

education loan_status Bachelor PAIDOFF 0.788462 COLLECTION 0.211538 High School or Below PAIDOFF 0.715116 COLLECTION 0.284884 Master or Above PAIDOFF 0.750000 COLLECTION 0.250000 college PAIDOFF 0.773256 COLLECTION 0.226744 Name: loan_status, dtype: float64

Machine learning (ML) *has another name: 'zeroes and ones'. Indeed*. ML is all about computing and thinks in `0`

and `1`

?. **Not really!**
But we can easily achieve this with the concept of 'one hot encoding'. We use the technique to convert categorical variables to binary variables and append them to the feature Data Frame. This is a process by which `categorical variables`

are converted into a form that could be provided to ML algorithms to do a better job in prediction. We select our ** features** of interest before applying

In [54]:

```
loan_data[['Principal','terms','age','Gender', 'weekend','education']].head(10) #select just desired features
```

Out[54]:

Principal | terms | age | Gender | weekend | education | |
---|---|---|---|---|---|---|

0 | 1000 | 30 | 45 | 0 | 0 | High School or Below |

1 | 1000 | 30 | 50 | 1 | 0 | Bachelor |

2 | 1000 | 30 | 33 | 1 | 0 | Bachelor |

3 | 1000 | 15 | 27 | 0 | 0 | college |

4 | 1000 | 30 | 28 | 1 | 1 | college |

5 | 300 | 7 | 35 | 0 | 1 | Master or Above |

6 | 1000 | 30 | 29 | 0 | 1 | college |

7 | 1000 | 30 | 36 | 0 | 1 | college |

8 | 1000 | 30 | 28 | 0 | 1 | college |

9 | 800 | 15 | 26 | 0 | 1 | college |

** Master or Above** category of persons has a much fewer number and so we can drop that. We can just concatenate the

`education`

`college`

`Bachalor`

`High School or Below`

In [55]:

```
loan_data['education'].value_counts().to_frame()
```

Out[55]:

education | |
---|---|

High School or Below | 172 |

college | 172 |

Bachelor | 52 |

Master or Above | 4 |

In [56]:

```
Feature = loan_data[['Principal','terms','age','Gender','weekend']]
Feature = pd.concat([Feature,pd.get_dummies(loan_data['education'])], axis=1)
Feature.drop(['Master or Above'], axis = 1,inplace=True)
Feature.head()
```

Out[56]:

Principal | terms | age | Gender | weekend | Bachelor | High School or Below | college | |
---|---|---|---|---|---|---|---|---|

0 | 1000 | 30 | 45 | 0 | 0 | 0 | 1 | 0 |

1 | 1000 | 30 | 50 | 1 | 0 | 1 | 0 | 0 |

2 | 1000 | 30 | 33 | 1 | 0 | 1 | 0 | 0 |

3 | 1000 | 15 | 27 | 0 | 0 | 0 | 0 | 1 |

4 | 1000 | 30 | 28 | 1 | 1 | 0 | 0 | 1 |

In [57]:

```
#Feature.to_csv('Feature_loandataML.csv', sep=',', header=True, index=True) # export a csv file for reference
```

Before heading down the Machine Learning road, let's define our features set as `X`

and our labels as `y`

. By having a quick look, we can preview 8 features and 8 labels within rows 298 to 306 which contain 4 each loan status `PAIDOFF`

or `COLLECTION`

.

In [58]:

```
X = Feature
X[298:306] # display rows 298 to 306 for preview
```

Out[58]:

Principal | terms | age | Gender | weekend | Bachelor | High School or Below | college | |
---|---|---|---|---|---|---|---|---|

298 | 1000 | 30 | 40 | 0 | 0 | 0 | 1 | 0 |

299 | 1000 | 30 | 28 | 0 | 0 | 0 | 0 | 1 |

300 | 1000 | 15 | 29 | 0 | 1 | 0 | 0 | 1 |

301 | 1000 | 30 | 37 | 0 | 1 | 0 | 1 | 0 |

302 | 1000 | 30 | 33 | 0 | 1 | 0 | 1 | 0 |

303 | 800 | 15 | 27 | 0 | 1 | 0 | 0 | 1 |

304 | 800 | 15 | 24 | 0 | 1 | 1 | 0 | 0 |

305 | 1000 | 15 | 31 | 1 | 1 | 0 | 1 | 0 |

In [59]:

```
y = loan_data['loan_status'].values
y[298:306] # display loan status rows 298 to 306 for preview
```

Out[59]:

array(['PAIDOFF', 'PAIDOFF', 'COLLECTION', 'COLLECTION', 'COLLECTION', 'COLLECTION', 'COLLECTION', 'COLLECTION'], dtype=object)

Standardization of a dataset is a common requirement in machine learning. Many algorithms work better when features are on a relatively similar scale and close to normal distribution. `MinMaxScaler`

, `RobustScaler`

, `StandardScaler`

, and `Normalizer`

are different methods used in **Scikit-Learn** to pre-process data. The choice method depends on the model and/or the feature values.

Scikit-Learn `StandardScaler`

will give the data zero mean and unit variance. This should technically be done after the `train test split`

. The result is a distribution with a unit standard deviation and variance (remember that variance = standard deviation squared)

In [60]:

```
X = preprocessing.StandardScaler().fit(X).transform(X)
X[0:5] # let's view 5 rows
```

Out[60]:

array([[ 0.50130175, 0.92089421, 2.31334964, -0.43236977, -1.21838912, -0.38655567, 1.15133896, -0.86855395], [ 0.50130175, 0.92089421, 3.14310202, 2.31283513, -1.21838912, 2.5869495 , -0.86855395, -0.86855395], [ 0.50130175, 0.92089421, 0.32194392, 2.31283513, -1.21838912, 2.5869495 , -0.86855395, -0.86855395], [ 0.50130175, -0.9332552 , -0.67375893, -0.43236977, -1.21838912, -0.38655567, -0.86855395, 1.15133896], [ 0.50130175, 0.92089421, -0.50780846, 2.31283513, 0.82075585, -0.38655567, -0.86855395, 1.15133896]])

This is a supervised classification algorithm we can use to assign a class to a new data point and KKK enables us to classify cases based on their similarity to other cases (cases that are near each other-- neighbours). Interestingly, one `popular use`

case for the KNN algorithm is in the area of Credit risk analysis.

** K** is termed the number of nearest neighbours and is generally an odd number if the number of classes is 2 as we have. We use the best value of

`K`

`KNN`

- Has little noise
- Is labelled (
and`PAIDOFF`

in our case)`COLLECTION`

- Contains only relevant features
- is distinguishable into sub groups

In [61]:

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=4)
print ('Train set:', X_train.shape, y_train.shape)
print ('Test set:', X_test.shape, y_test.shape)
#train with 80% of data, test with 20%
```

Train set: (320, 8) (320,) Test set: (80, 8) (80,)

`KNN`

In [62]:

```
from sklearn.neighbors import KNeighborsClassifier
kNN_model = neighbors.KNeighborsClassifier()
k = 7
kNN_model = KNeighborsClassifier(n_neighbors=k).fit(X_train,y_train)
kNN_model
```

Out[62]:

KNeighborsClassifier(n_neighbors=7)

`KNN`

In [63]:

```
# Does the model work?
Lknn_status = kNN_model.predict(X_test)
Lknn_status[1:10]
```

Out[63]:

array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'COLLECTION', 'PAIDOFF', 'PAIDOFF', 'COLLECTION'], dtype=object)

The model has learnt a bit. We can use the score on the test data for evaluation

In [64]:

```
kNN_model.score(X_test, y_test)
```

Out[64]:

0.775

In [65]:

```
y_expect = y_test
y_pred = kNN_model.predict(X_test)
print(metrics.classification_report(y_expect, y_pred)) #*** Score the model ***
```

** recall** is a measure of a model's completeness. What is the result (table) showing? One is: of all the points that were labeled

`PAIDOFF`

- 89% of those results returned were truly relevant.
- Of the entire data set, 80% of the results were truly relevant.
High
and low`precision`

generally means that there are fewer results returned, but many of the labels that are predicted are returned correctly.`recall`

In [66]:

```
# kNN_model.score? Run this for help
```

Decision Trees are a non-parametric supervised learning method used for classification and regression (regression trees are used when the target variable is numerical or continuous). The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. ** Note** that SKlearn decision trees do not handle categorical variables.

`Classification`

- is simple to understand and visualize
- requires little effort for data preparation
- can handle both numerical and categorical data
- non-linear parameters do not affect performance

The ** disadvantages** of Decision Trees include

- the tendency of being prone to overfitting (this occurs when the algorithm captures noise in the data)
- getting high variance (the model can get unstable due to small variation in data)
- result in low bias (difficult for the model to work with new data)

In [67]:

```
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
DT_model = DecisionTreeClassifier(criterion="entropy", max_depth = 6)
DT_model.fit(X_train,y_train)
DT_model
```

Out[67]:

DecisionTreeClassifier(criterion='entropy', max_depth=6)

In [68]:

```
Ldt_status = DT_model.predict(X_test)
Ldt_status[0:10] #view 1st 10 results
```

Out[68]:

array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'], dtype=object)

In [69]:

```
DT_model.score(X_test, y_test)
```

Out[69]:

0.725

This is a supervised algorithm that classifies cases by finding a separator. It maps data into a high dimensional feature space. In other words, it finds a separator (drawn as a hyperplane. In 3D, the separator can be a plane). The pros ** SVM** includes accuracy in high-dimesional spaces and memory efficiency. However, the method is prone to overheating and will likely not be efficient on a very large dataset greater than 1000 rows.

`SVM`

In [70]:

```
from sklearn import svm
SVM_model = svm.SVC()
SVM_model.fit(X_train, y_train)
```

Out[70]:

SVC()

In [71]:

```
lSVM = SVM_model.predict(X_test)
```

In [72]:

```
SVM_model.score(X_test, y_test)
```

Out[72]:

0.75

Logistic regression is a widely used binary classifier (i.e. the target vector can only take two values). A linear model (e.g $\beta_0 + \beta_1x$ ) is included in a logistic (also called sigmoid) function, $$f(x)={\frac {1}{1+e^{-z}}}$$ such that $$ P(y_i = 1|X)={\frac {1}{1+e^{-(\beta_0 + \beta_1x)}}}$$

where $ P(y_i = 1|X)$ is the probability of the $i^{th}$ observation’s target value, $y_i$ being class 1, X is the training data, $\beta_0$ and $\beta_1$ are the parameters to be learned, and $e$ is Euler’s number. In other words, Instead of predicting exactly 0 or 1, logistic regression generates a probability— a value between 0 and 1, exclusive. Logistic regression predicts whether something is true or false instead of predicting something continuous. Despite the probability value, it is used for classification: e.g, whether mouse is obese or not. This obesity can be predicted by multiple features (weight, Age, etc). Logistic Regression can work with continuous data or discrete data.

It has been known to be used productively in 'employee attrition modeling' and 'hazardous event prediction'. Note that Logistic regression has a fewer assumptions than linear regression, and this includes data being free from missing values and that the predicted variable is binary. In other words, it only accepts two values, or it could be ordinal, a categorical variable with ordered values.

In [73]:

```
from sklearn.linear_model import LogisticRegression
LR_model = LogisticRegression(C=0.01).fit(X_train,y_train)
LR_model
```

Out[73]:

LogisticRegression(C=0.01)

In [74]:

```
#lSVM = SVM_model.predict(X_test)
lLR = LR_model.predict(X_test)
lLR[0:10] #view 1st 10 results
```

Out[74]:

In [75]:

```
LR_model.score(X_test, y_test)
```

Out[75]:

0.7625

First we load the ** Test** set for evaluation

In [76]:

```
test_df = pd.read_csv('data_ML/loan_test.csv')
test_df.head(3)
```

Out[76]:

Unnamed: 0.1 | Unnamed: 0 | loan_status | Principal | terms | effective_date | due_date | age | education | Gender | |
---|---|---|---|---|---|---|---|---|---|---|

0 | 1 | 1 | PAIDOFF | 1000 | 30 | 9/8/2016 | 10/7/2016 | 50 | Bechalor | female |

1 | 5 | 5 | PAIDOFF | 300 | 7 | 9/9/2016 | 9/15/2016 | 35 | Master or Above | male |

2 | 21 | 21 | PAIDOFF | 1000 | 30 | 9/10/2016 | 10/9/2016 | 43 | High School or Below | female |

In [77]:

```
from sklearn.metrics import jaccard_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
```

In [78]:

```
## Preprocessing test data
test_df['due_date'] = pd.to_datetime(test_df['due_date'])
test_df['effective_date'] = pd.to_datetime(test_df['effective_date'])
test_df['dayofweek'] = test_df['effective_date'].dt.dayofweek
test_df['weekend'] = test_df['dayofweek'].apply(lambda x: 1 if (x>3) else 0)
test_df['Gender'].replace(to_replace=['male','female'], value=[0,1],inplace=True)
test_Feature = test_df[['Principal','terms','age','Gender','weekend']]
test_Feature = pd.concat([test_Feature,pd.get_dummies(test_df['education'])], axis=1)
test_Feature.drop(['Master or Above'], axis = 1,inplace=True)
test_X = preprocessing.StandardScaler().fit(test_Feature).transform(test_Feature)
test_X[0:5]
```

Out[78]:

array([[ 0.49362588, 0.92844966, 3.05981865, 1.97714211, -1.30384048, 2.39791576, -0.79772404, -0.86135677], [-3.56269116, -1.70427745, 0.53336288, -0.50578054, 0.76696499, -0.41702883, -0.79772404, -0.86135677], [ 0.49362588, 0.92844966, 1.88080596, 1.97714211, 0.76696499, -0.41702883, 1.25356634, -0.86135677], [ 0.49362588, 0.92844966, -0.98251057, -0.50578054, 0.76696499, -0.41702883, -0.79772404, 1.16095912], [-0.66532184, -0.78854628, -0.47721942, -0.50578054, 0.76696499, 2.39791576, -0.79772404, -0.86135677]])

In [79]:

```
test_y = test_df['loan_status'].values
test_y[0:5]
```

Out[79]:

array(['PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF', 'PAIDOFF'], dtype=object)

In [80]:

```
knn_yhat = kNN_model.predict(test_X)
print("KNN Jaccard index: %.2f" % jaccard_score(test_y, knn_yhat, pos_label = 'PAIDOFF'))
print("KNN F1-score: %.2f" % f1_score(test_y, knn_yhat, average='weighted') )
```

KNN Jaccard index: 0.81 KNN F1-score: 0.82

In [81]:

```
DT_yhat = DT_model.predict(test_X)
print("DT Jaccard index: %.2f" % jaccard_score(test_y, DT_yhat, pos_label = 'PAIDOFF'))
print("DT F1-score: %.2f" % f1_score(test_y, DT_yhat, average='weighted') )
```

DT Jaccard index: 0.80 DT F1-score: 0.79

In [82]:

```
SVM_yhat = SVM_model.predict(test_X)
print("SVM Jaccard index: %.2f" % jaccard_score(test_y, SVM_yhat, pos_label = 'PAIDOFF'))
print("SVM F1-score: %.2f" % f1_score(test_y, SVM_yhat, average='weighted') )
```

SVM Jaccard index: 0.78 SVM F1-score: 0.76

In [83]:

```
LR_yhat = LR_model.predict(test_X)
LR_yhat_prob = LR_model.predict_proba(test_X)
print("LR Jaccard index: %.2f" % jaccard_score(test_y, LR_yhat, pos_label = 'PAIDOFF'))
print("LR F1-score: %.2f" % f1_score(test_y, LR_yhat, average='weighted') )
print("LR LogLoss: %.2f" % log_loss(test_y, LR_yhat_prob))
```

LR Jaccard index: 0.74 LR F1-score: 0.63 LR LogLoss: 0.50

We have below, a simple report of the accuracy of the built models using ** Jaccard score** and

`F1 score`

`K Nearest Neighbour`

Method | Jaccard | F1-score | LogLoss |
---|---|---|---|

KNN | 0.81 | 0.82 | NA |

Decision Tree | 0.80 | 0.79 | NA |

SVM | 0.78 | 0.76 | NA |

LogisticRegression | 0.74 | 0.63 | 0.50 |