Decision Trees, Random Forests & Support Vector Machines

Decision Trees, Random Forests & Support Vector Machines#

LendingClub is a US peer-to-peer lending company that operates an online lending platform. It was founded in 2006 and is headquartered in San Francisco, California. The platform facilitates the borrowing and lending of money directly between individuals, bypassing traditional banks. Investors can construct their portfolio of loans, according to their risk appetite. Naturally, if the borrower fails to repay their loan, investors lose money. Therefore, investors face the problem of predicting the risk of a borrower being unable to repay a loan.

The firm makes loan-level data freely available online so that investors can make informed decisions about whether to invest. Let’s try and use ML models learned in class to predict whether a loan will be fully-paid or not, based on borrower characteristics. The data is available on the firms’ webpage but here we will work with a dataset available on Kaggle.

Note that the original dataset has millions of loans. To keep things simple (and avoid computational bottlenecks) we will work with a random sample of this larger dataframe. Download two excel files from Moodle ‘loan_data.xlsx’ and ‘descriptives.xlsx’ and load them to your working environment.

Load Data#

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import RandomizedSearchCV

# Load the CSV file into a Pandas DataFrame
# Load data from the first sheet
df = pd.read_excel('loan_data.xlsx')
desc = pd.read_excel('descriptives.xlsx')

# Display the DataFrame
df.head()

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File c:\Users\Miguel\anaconda3\envs\econ5129_labs\Lib\site-packages\pandas\compat\_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    141 try:
--> 142     module = importlib.import_module(name)
    143 except ImportError:

File c:\Users\Miguel\anaconda3\envs\econ5129_labs\Lib\importlib\__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1204, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1176, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1140, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'openpyxl'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Cell In[1], line 13
      9 from sklearn.model_selection import RandomizedSearchCV
     11 # Load the CSV file into a Pandas DataFrame
     12 # Load data from the first sheet
---> 13 df = pd.read_excel('loan_data.xlsx')
     14 desc = pd.read_excel('descriptives.xlsx')
     16 # Display the DataFrame

File c:\Users\Miguel\anaconda3\envs\econ5129_labs\Lib\site-packages\pandas\io\excel\_base.py:478, in read_excel(io, sheet_name, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, date_format, thousands, decimal, comment, skipfooter, storage_options, dtype_backend)
    476 if not isinstance(io, ExcelFile):
    477     should_close = True
--> 478     io = ExcelFile(io, storage_options=storage_options, engine=engine)
    479 elif engine and engine != io.engine:
    480     raise ValueError(
    481         "Engine should not be specified when passing "
    482         "an ExcelFile - ExcelFile already has the engine set"
    483     )

File c:\Users\Miguel\anaconda3\envs\econ5129_labs\Lib\site-packages\pandas\io\excel\_base.py:1513, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options)
   1510 self.engine = engine
   1511 self.storage_options = storage_options
-> 1513 self._reader = self._engines[engine](self._io, storage_options=storage_options)

File c:\Users\Miguel\anaconda3\envs\econ5129_labs\Lib\site-packages\pandas\io\excel\_openpyxl.py:548, in OpenpyxlReader.__init__(self, filepath_or_buffer, storage_options)
    533 @doc(storage_options=_shared_docs["storage_options"])
    534 def __init__(
    535     self,
    536     filepath_or_buffer: FilePath | ReadBuffer[bytes],
    537     storage_options: StorageOptions = None,
    538 ) -> None:
    539     """
    540     Reader using openpyxl engine.
    541 
   (...)
    546     {storage_options}
    547     """
--> 548     import_optional_dependency("openpyxl")
    549     super().__init__(filepath_or_buffer, storage_options=storage_options)

File c:\Users\Miguel\anaconda3\envs\econ5129_labs\Lib\site-packages\pandas\compat\_optional.py:145, in import_optional_dependency(name, extra, errors, min_version)
    143 except ImportError:
    144     if errors == "raise":
--> 145         raise ImportError(msg)
    146     return None
    148 # Handle submodules: if we have submodule, grab parent module from sys.modules

ImportError: Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl.

# print descriptives of the variables
desc

	credit.policy	if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
0	purpose	The purpose of the loan (takes values "credit_...
1	int.rate	The purpose of the loan (takes values "credit_...
2	installment	The purpose of the loan (takes values "credit_...
3	log.annual.inc	The natural log of the self-reported annual in...
4	dti	The debt-to-income ratio of the borrower (amou...
5	fico	The FICO credit score of the borrower.
6	days.with.cr.line	The number of days the borrower has had a cred...
7	revol.bal	The borrower's revolving balance (amount unpai...
8	revol.util	The borrower's revolving balance (amount unpai...
9	inq.last.6mths	The borrower's number of inquiries by creditor...
10	delinq.2yrs	The number of times the borrower had been 30+ ...
11	pub.rec	The borrower's number of derogatory public rec...
12	not.fully.paid	not fully paid.

Data Descriptives#

To have a feel for the data it’s always a good idea to start with simple descriptive statistics. Because our focus is on classification (ie. predict delinquencies), let’s also plot some descriptives for loans that have and have not been fully paid.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   credit.policy      9578 non-null   int64  
 1   purpose            9578 non-null   object 
 2   int.rate           9578 non-null   float64
 3   installment        9578 non-null   float64
 4   log.annual.inc     9578 non-null   float64
 5   dti                9578 non-null   float64
 6   fico               9578 non-null   int64  
 7   days.with.cr.line  9578 non-null   float64
 8   revol.bal          9578 non-null   int64  
 9   revol.util         9578 non-null   float64
 10  inq.last.6mths     9578 non-null   int64  
 11  delinq.2yrs        9578 non-null   int64  
 12  pub.rec            9578 non-null   int64  
 13  not.fully.paid     9578 non-null   int64  
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB

df.describe().round(2)

	credit.policy	int.rate	installment	log.annual.inc	dti	fico	days.with.cr.line	revol.bal	revol.util	inq.last.6mths	delinq.2yrs	pub.rec	not.fully.paid
count	9578.0	9578.00	9578.00	9578.00	9578.00	9578.00	9578.00	9578.00	9578.00	9578.00	9578.00	9578.00	9578.00
mean	0.8	0.12	319.09	10.93	12.61	710.85	4560.77	16913.96	46.80	1.58	0.16	0.06	0.16
std	0.4	0.03	207.07	0.61	6.88	37.97	2496.93	33756.19	29.01	2.20	0.55	0.26	0.37
min	0.0	0.06	15.67	7.55	0.00	612.00	178.96	0.00	0.00	0.00	0.00	0.00	0.00
25%	1.0	0.10	163.77	10.56	7.21	682.00	2820.00	3187.00	22.60	0.00	0.00	0.00	0.00
50%	1.0	0.12	268.95	10.93	12.66	707.00	4139.96	8596.00	46.30	1.00	0.00	0.00	0.00
75%	1.0	0.14	432.76	11.29	17.95	737.00	5730.00	18249.50	70.90	2.00	0.00	0.00	0.00
max	1.0	0.22	940.14	14.53	29.96	827.00	17639.96	1207359.00	119.00	33.00	13.00	5.00	1.00

Visualization#

FICO scores ought to be an important variable for loan delinquency prediction as they measure borrower quality. To understand the firms’ credit policy with respect to FICO scores and how FICO scores differ across borrowers, let’s plot a histogram of borrowers that comply with Lending Club’s credit policy and have fully repaid their loss.

# Set up subplots with 1 row and 2 columns
plt.figure(figsize=(15, 5))

# Common color for both histograms
color = 'blue'

# Plot the first histogram
plt.subplot(1, 2, 1)
df[df['credit.policy']==0]['fico'].hist(bins=30, alpha=0.5, label='0', color='red')
df[df['credit.policy']==1]['fico'].hist(bins=30, alpha=0.5, label='1', color=color)
plt.legend()
plt.xlabel('FICO')
plt.title('Distribution of FICO scores by credit policy')

# Plot the first histogram
plt.subplot(1, 2, 2)
df[df['not.fully.paid']==0]['fico'].hist(bins=30, alpha=0.5, label='0', color='red')
df[df['not.fully.paid']==1]['fico'].hist(bins=30, alpha=0.5, label='1', color=color)
plt.legend()
plt.xlabel('FICO')
plt.title('Distribution of FICO scores by Loan Status')

# Adjust layout for better spacing
plt.tight_layout()

# Show the plot
plt.show()

../_images/2b3d3735c0e81ea5b8cfa76fb8480a9548c2a348b7d18b5405f7ad19e4c0b1da.png

Now, let’s also have a look at the distribution of FICO scores, interest rates charged and how they relate to each other.

sns.jointplot(x='fico',y='int.rate',data=df)

<seaborn.axisgrid.JointGrid at 0x288650b0d50>

../_images/eb31ea636558d1aeefb2a1d2fbcff842f3c40a3088f44854f60eed606bb4e429.png

sns.lmplot(x='fico',y='int.rate',data=df,col='not.fully.paid',hue='credit.policy',palette='coolwarm')

<seaborn.axisgrid.FacetGrid at 0x288653b3210>

../_images/8c443ef937fc2ca36eac2ffb7a6d658d5a355376cf199c0c2bbe2a3e175fe0da.png

Given a FICO score, there seems to be greater dispersion, in terms of interest rates charges, of borrowers that haven’t paid their loan back fully and therefore may default. This is a reminder that other variables should also be taken into account when considering delinquency likelihoods. Before we start processing the data and building our ML model, let’s look at the correlation amongst the variables.

# Select numerical columns from your DataFrame
numerical_columns = df.select_dtypes(include=['number'])
# Create a heatmap of the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(numerical_columns.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

../_images/699d8dcf34741fadecd88a4996303c6b2f4c85d9b048bfa53f8eafbca4f03560.png

Now convert some categorical variables into dummy variables so that they can be added to the feature set of our ML model.

df = pd.get_dummies(df,columns=['purpose'],drop_first=True)
df

	credit.policy	int.rate	installment	log.annual.inc	dti	fico	days.with.cr.line	revol.bal	revol.util	inq.last.6mths	delinq.2yrs	pub.rec	not.fully.paid	purpose_credit_card	purpose_debt_consolidation	purpose_educational	purpose_home_improvement	purpose_major_purchase	purpose_small_business
0	1	0.1189	829.10	11.350407	19.48	737	5639.958333	28854	52.1	0	0	0	0	False	True	False	False	False	False
1	1	0.1071	228.22	11.082143	14.29	707	2760.000000	33623	76.7	0	0	0	0	True	False	False	False	False	False
2	1	0.1357	366.86	10.373491	11.63	682	4710.000000	3511	25.6	1	0	0	0	False	True	False	False	False	False
3	1	0.1008	162.34	11.350407	8.10	712	2699.958333	33667	73.2	1	0	0	0	False	True	False	False	False	False
4	1	0.1426	102.92	11.299732	14.97	667	4066.000000	4740	39.5	0	1	0	0	True	False	False	False	False	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
9573	0	0.1461	344.76	12.180755	10.39	672	10474.000000	215372	82.1	2	0	0	1	False	False	False	False	False	False
9574	0	0.1253	257.70	11.141862	0.21	722	4380.000000	184	1.1	5	0	0	1	False	False	False	False	False	False
9575	0	0.1071	97.81	10.596635	13.09	687	3450.041667	10036	82.9	8	0	0	1	False	True	False	False	False	False
9576	0	0.1600	351.58	10.819778	19.18	692	1800.000000	0	3.2	5	0	0	1	False	False	False	True	False	False
9577	0	0.1392	853.43	11.264464	16.28	732	4740.000000	37879	57.0	6	0	0	1	False	True	False	False	False	False

9578 rows × 19 columns

segregate features from the output which in this case is ‘not.fully.paid’.

X=df.drop('not.fully.paid', axis=1)
y=df['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

Decision Trees#

Decision Trees are a popular machine learning model used for both classification and regression tasks (though here we use it for classification solely). Here is a brief summary of the intuition behind Decision Trees:

Structure:

Hierarchical Structure: Decision Trees organize data in a hierarchical manner.
Nodes: Internal nodes represent decisions based on specific features.

Decision Making:

Traversal: The model makes decisions by traversing the tree from the root to a leaf.
Feature-based Decisions: At each internal node, a decision is made based on a specific feature, leading to a branch.

Splitting Criteria:

Objective: The tree learns to split the data by selecting features and thresholds that maximize information gain.

Types:

Classification Trees: Leaf nodes represent different classes, and the majority class is assigned to instances in a leaf.
Regression Trees: Leaf nodes represent numeric values, usually the mean of the target values of instances in that leaf.

Pruning:

Overfitting: To prevent overfitting, trees can be pruned by removing branches that do not significantly contribute to predictive performance.

Ensemble Methods:

Random Forests and Gradient Boosted Trees: Decision Trees are often used as building blocks in ensemble methods to improve predictive accuracy and robustness.

Interpretability:

Visual Decision Process: Decision Trees offer interpretability, making it easy to understand the decision-making process and communicate.

Remarks:

Sensitive to Data Distribution: Decision Trees can be sensitive to variations in the training data, leading to different tree structures for slightly different datasets.
Handling Missing Values: Some implementations can handle missing values during the learning process.

Decision Trees provide a flexible and interpretable approach to machine learning, with the potential for overfitting mitigated through pruning and ensemble methods. Let’s use Decision trees to help predict whether a loan is fully paid or not.

dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
pred = dtree.predict(X_test)

print(classification_report(y_test,pred))
print(confusion_matrix(y_test,pred))

              precision    recall  f1-score   support

           0       0.85      0.84      0.85      2422
           1       0.21      0.23      0.22       452

    accuracy                           0.74      2874
   macro avg       0.53      0.54      0.53      2874
weighted avg       0.75      0.74      0.75      2874

[[2030  392]
 [ 346  106]]

Random forest is a popular supervised machine learning method for classification and regression that consists of using several decision trees, and combining the trees’ predictions into an overall prediction. To train the random forest is to train each of its decision trees independently. Each decision tree is typically trained on a slightly different part of the training set, and may look at different features for its node splits. Let’s also fit a Random Forest to our data.

rf = RandomForestClassifier()
rf.fit(X_train,y_train)
pred1=rf.predict(X_test)

print(classification_report(y_test,pred1))
print(confusion_matrix(y_test,pred1))

              precision    recall  f1-score   support

           0       0.85      0.99      0.91      2422
           1       0.44      0.03      0.05       452

    accuracy                           0.84      2874
   macro avg       0.64      0.51      0.48      2874
weighted avg       0.78      0.84      0.78      2874

[[2407   15]
 [ 440   12]]

As you see Random Forests take longer to run (even though our dataset is small). They deliver slightly better performance statistics.

Exercise#

Tune our random forest by using ‘RandomizedSearchCV()’ function in sklearn and comment on prediction gains.

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 1000, num = 5)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(20, 100, num = 5)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

rf1 = RandomForestClassifier()
rf2 = RandomizedSearchCV(estimator = rf1, param_distributions = random_grid, 
                              cv = 3, verbose=2,n_jobs = -1)
rf2.fit(X_train,y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits

RandomizedSearchCV(cv=3, estimator=RandomForestClassifier(), n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [20, 40, 60, 80, 100,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000]},
                   verbose=2)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

rf2.best_params_

{'n_estimators': 800,
 'min_samples_split': 5,
 'min_samples_leaf': 4,
 'max_features': 'sqrt',
 'max_depth': 60,
 'bootstrap': True}

rf3=RandomForestClassifier(n_estimators= 200,
 min_samples_split=2,
 min_samples_leaf=2,
 max_features='auto',
 max_depth=40,
 bootstrap=True)
rf3.fit(X_train,y_train)
preds3=rf3.predict(X_test)

print(classification_report(y_test,preds3))
print(confusion_matrix(y_test,preds3))

c:\Users\Miguel\anaconda3\Lib\site-packages\sklearn\ensemble\_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(

              precision    recall  f1-score   support

           0       0.84      1.00      0.91      2422
           1       0.36      0.01      0.02       452

    accuracy                           0.84      2874
   macro avg       0.60      0.50      0.47      2874
weighted avg       0.77      0.84      0.77      2874

[[2413    9]
 [ 447    5]]

Support Vector Machines (SVMs)#

SVMs are Supervised learning algorithms for classification and regression. They work by finding a hyperplane that best separates data points into different classes while maximizing the margin. Remember that a hyperplane is a decision boundary that separates data points of different classes. The margin is the distance between the hyperplane and the nearest data point from either class. SVMs seek to maximize this margin, promoting better generalization to new, unseen data.

Support Vectors:#

Definition: Support vectors are the data points that lie closest to the hyperplane and influence its position.
SVMs are named after these crucial data points as they support the definition of the decision boundary.

Types:#

1. Linear SVM:#

Hyperplane: A straight line in 2D, a plane in 3D, and a hyperplane in higher dimensions.
Use Case: Suitable for linearly separable data.

2. Non-Linear SVM (Kernel Trick):#

Idea: Transform data into a higher-dimensional space to make it linearly separable.
Kernel Functions: Radial Basis Function (RBF), Polynomial, Sigmoid, etc., are used to achieve non-linear separations.

The advantages of support vector machines are:#

Effective in high dimensional spaces.
Still effective in cases where number of dimensions is greater than the number of samples.
Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:#

If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

Commonly used for:#

Image Classification, Text Classification, and Bioinformatics.

Tips:#

Normalize Features: It is often beneficial to normalize input features to ensure equal importance in the model.

Support Vector Machines are powerful and versatile classifiers, effective in scenarios with complex decision boundaries and high-dimensional feature spaces. Let’s explore their performance with our loan data.

# Import Support Vector Machine classifier
from sklearn.svm import SVC

# Create a Support Vector Machine classifier
svm_classifier = SVC()

# Train the Support Vector Machine classifier on the training data
svm_classifier.fit(X_train, y_train)

# Make predictions on the test data
svm_pred = svm_classifier.predict(X_test)

# Print the classification report for SVM
print("Support Vector Machine - Classification Report:\n", classification_report(y_test, svm_pred))

# Print the confusion matrix for SVM
print("Support Vector Machine - Confusion Matrix:\n", confusion_matrix(y_test, svm_pred))

Support Vector Machine - Classification Report:
               precision    recall  f1-score   support

           0       0.84      1.00      0.91      2422
           1       0.33      0.00      0.00       452

    accuracy                           0.84      2874
   macro avg       0.59      0.50      0.46      2874
weighted avg       0.76      0.84      0.77      2874

Support Vector Machine - Confusion Matrix:
 [[2420    2]
 [ 451    1]]

Exercise#

Similarly to what you did with Random Forests, Tune the SVM for loan classification.

from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from scipy.stats import uniform
import numpy as np

# Assuming you have already loaded your data into X and y

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

# Create a StandardScaler object
scaler = StandardScaler()

# Fit and transform training data
X_train_standardized = scaler.fit_transform(X_train)

# Transform test data using the same scaler
X_test_standardized = scaler.transform(X_test)

# Define a more compact parameter distribution for random search
param_dist = {
    'C': uniform(0.1, 1),  # Uniform distribution between 0.1 and 1
    'penalty': ['l1', 'l2'],  # Penalty term for LinearSVC
    'dual': [False],  # Dual parameter for LinearSVC
}

# Create a Linear Support Vector Machine classifier
linear_svm_classifier = LinearSVC()

# Use RandomizedSearchCV for a more efficient search
random_search = RandomizedSearchCV(linear_svm_classifier, param_dist, n_iter=20, cv=3, scoring='accuracy', verbose=1, n_jobs=-1)
random_search.fit(X_train_standardized, y_train)

# Get the best hyperparameters
best_params_random = random_search.best_params_
print("Best Hyperparameters (Random Search):", best_params_random)

# Make predictions on the test data using the best model from random search
best_linear_svm_random = random_search.best_estimator_
linear_svm_pred_random = best_linear_svm_random.predict(X_test_standardized)

# Evaluate the performance of the tuned Linear SVM
print("Tuned Linear Support Vector Machine (Random Search) - Classification Report:\n", classification_report(y_test, linear_svm_pred_random))
print("Tuned Linear Support Vector Machine (Random Search) - Confusion Matrix:\n", confusion_matrix(y_test, linear_svm_pred_random))

Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best Hyperparameters (Random Search): {'C': 1.0872834785177143, 'dual': False, 'penalty': 'l1'}
Tuned Linear Support Vector Machine (Random Search) - Classification Report:
               precision    recall  f1-score   support

           0       0.83      1.00      0.91      2393
           1       0.00      0.00      0.00       481

    accuracy                           0.83      2874
   macro avg       0.42      0.50      0.45      2874
weighted avg       0.69      0.83      0.76      2874

Tuned Linear Support Vector Machine (Random Search) - Confusion Matrix:
 [[2389    4]
 [ 481    0]]

Some notes about the exercise:

Performing grid-search with a SVM with a nonlinear Kernel is a computationally prohibitive task.
Kernelized SVMs require the computation of a distance function between each point in the dataset, which is the dominating cost of \(O(n_{features},n^2_{observations})\).

To make this task feasible, it helps to think about it a bit before writing code:

Standardizing the features helps a lot because the Kernel requires the storage of the distances putting a burden on memory
Using RandomizedSearchCV() instead of GridSearchCV() also helps.
Another option is to use LinearSVC().

Decision Trees, Random Forests & Support Vector Machines

Contents

Decision Trees, Random Forests & Support Vector Machines#

Load Data#

Data Descriptives#

Visualization#

Decision Trees#

Exercise#

Support Vector Machines (SVMs)#

Support Vectors:#

Types:#

1. Linear SVM:#

2. Non-Linear SVM (Kernel Trick):#

The advantages of support vector machines are:#

The disadvantages of support vector machines include:#

Commonly used for:#

Tips:#

Exercise#