Decision Trees, Random Forests & Support Vector Machines#

LendingClub is a US peer-to-peer lending company that operates an online lending platform. It was founded in 2006 and is headquartered in San Francisco, California. The platform facilitates the borrowing and lending of money directly between individuals, bypassing traditional banks. Investors can construct their portfolio of loans, according to their risk appetite. Naturally, if the borrower fails to repay their loan, investors lose money. Therefore, investors face the problem of predicting the risk of a borrower being unable to repay a loan.

The firm makes loan-level data freely available online so that investors can make informed decisions about whether to invest. Let’s try and use ML models learned in class to predict whether a loan will be fully-paid or not, based on borrower characteristics. The data is available on the firms’ webpage but here we will work with a dataset available on Kaggle.

Note that the original dataset has millions of loans. To keep things simple (and avoid computational bottlenecks) we will work with a random sample of this larger dataframe. Download two excel files from Moodle ‘loan_data.xlsx’ and ‘descriptives.xlsx’ and load them to your working environment.

Load Data#

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.model_selection import RandomizedSearchCV

# Load the CSV file into a Pandas DataFrame
# Load data from the first sheet
df = pd.read_excel('loan_data.xlsx')
desc = pd.read_excel('descriptives.xlsx')

# Display the DataFrame
df.head()
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
File c:\Users\Miguel\anaconda3\envs\econ5129_labs\Lib\site-packages\pandas\compat\_optional.py:142, in import_optional_dependency(name, extra, errors, min_version)
    141 try:
--> 142     module = importlib.import_module(name)
    143 except ImportError:

File c:\Users\Miguel\anaconda3\envs\econ5129_labs\Lib\importlib\__init__.py:126, in import_module(name, package)
    125         level += 1
--> 126 return _bootstrap._gcd_import(name[level:], package, level)

File <frozen importlib._bootstrap>:1204, in _gcd_import(name, package, level)

File <frozen importlib._bootstrap>:1176, in _find_and_load(name, import_)

File <frozen importlib._bootstrap>:1140, in _find_and_load_unlocked(name, import_)

ModuleNotFoundError: No module named 'openpyxl'

During handling of the above exception, another exception occurred:

ImportError                               Traceback (most recent call last)
Cell In[1], line 13
      9 from sklearn.model_selection import RandomizedSearchCV
     11 # Load the CSV file into a Pandas DataFrame
     12 # Load data from the first sheet
---> 13 df = pd.read_excel('loan_data.xlsx')
     14 desc = pd.read_excel('descriptives.xlsx')
     16 # Display the DataFrame

File c:\Users\Miguel\anaconda3\envs\econ5129_labs\Lib\site-packages\pandas\io\excel\_base.py:478, in read_excel(io, sheet_name, header, names, index_col, usecols, dtype, engine, converters, true_values, false_values, skiprows, nrows, na_values, keep_default_na, na_filter, verbose, parse_dates, date_parser, date_format, thousands, decimal, comment, skipfooter, storage_options, dtype_backend)
    476 if not isinstance(io, ExcelFile):
    477     should_close = True
--> 478     io = ExcelFile(io, storage_options=storage_options, engine=engine)
    479 elif engine and engine != io.engine:
    480     raise ValueError(
    481         "Engine should not be specified when passing "
    482         "an ExcelFile - ExcelFile already has the engine set"
    483     )

File c:\Users\Miguel\anaconda3\envs\econ5129_labs\Lib\site-packages\pandas\io\excel\_base.py:1513, in ExcelFile.__init__(self, path_or_buffer, engine, storage_options)
   1510 self.engine = engine
   1511 self.storage_options = storage_options
-> 1513 self._reader = self._engines[engine](self._io, storage_options=storage_options)

File c:\Users\Miguel\anaconda3\envs\econ5129_labs\Lib\site-packages\pandas\io\excel\_openpyxl.py:548, in OpenpyxlReader.__init__(self, filepath_or_buffer, storage_options)
    533 @doc(storage_options=_shared_docs["storage_options"])
    534 def __init__(
    535     self,
    536     filepath_or_buffer: FilePath | ReadBuffer[bytes],
    537     storage_options: StorageOptions = None,
    538 ) -> None:
    539     """
    540     Reader using openpyxl engine.
    541 
   (...)
    546     {storage_options}
    547     """
--> 548     import_optional_dependency("openpyxl")
    549     super().__init__(filepath_or_buffer, storage_options=storage_options)

File c:\Users\Miguel\anaconda3\envs\econ5129_labs\Lib\site-packages\pandas\compat\_optional.py:145, in import_optional_dependency(name, extra, errors, min_version)
    143 except ImportError:
    144     if errors == "raise":
--> 145         raise ImportError(msg)
    146     return None
    148 # Handle submodules: if we have submodule, grab parent module from sys.modules

ImportError: Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl.
# print descriptives of the variables
desc
credit.policy if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
0 purpose The purpose of the loan (takes values "credit_...
1 int.rate The purpose of the loan (takes values "credit_...
2 installment The purpose of the loan (takes values "credit_...
3 log.annual.inc The natural log of the self-reported annual in...
4 dti The debt-to-income ratio of the borrower (amou...
5 fico The FICO credit score of the borrower.
6 days.with.cr.line The number of days the borrower has had a cred...
7 revol.bal The borrower's revolving balance (amount unpai...
8 revol.util The borrower's revolving balance (amount unpai...
9 inq.last.6mths The borrower's number of inquiries by creditor...
10 delinq.2yrs The number of times the borrower had been 30+ ...
11 pub.rec The borrower's number of derogatory public rec...
12 not.fully.paid not fully paid.

Data Descriptives#

To have a feel for the data it’s always a good idea to start with simple descriptive statistics. Because our focus is on classification (ie. predict delinquencies), let’s also plot some descriptives for loans that have and have not been fully paid.

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9578 entries, 0 to 9577
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   credit.policy      9578 non-null   int64  
 1   purpose            9578 non-null   object 
 2   int.rate           9578 non-null   float64
 3   installment        9578 non-null   float64
 4   log.annual.inc     9578 non-null   float64
 5   dti                9578 non-null   float64
 6   fico               9578 non-null   int64  
 7   days.with.cr.line  9578 non-null   float64
 8   revol.bal          9578 non-null   int64  
 9   revol.util         9578 non-null   float64
 10  inq.last.6mths     9578 non-null   int64  
 11  delinq.2yrs        9578 non-null   int64  
 12  pub.rec            9578 non-null   int64  
 13  not.fully.paid     9578 non-null   int64  
dtypes: float64(6), int64(7), object(1)
memory usage: 1.0+ MB
df.describe().round(2)
credit.policy int.rate installment log.annual.inc dti fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec not.fully.paid
count 9578.0 9578.00 9578.00 9578.00 9578.00 9578.00 9578.00 9578.00 9578.00 9578.00 9578.00 9578.00 9578.00
mean 0.8 0.12 319.09 10.93 12.61 710.85 4560.77 16913.96 46.80 1.58 0.16 0.06 0.16
std 0.4 0.03 207.07 0.61 6.88 37.97 2496.93 33756.19 29.01 2.20 0.55 0.26 0.37
min 0.0 0.06 15.67 7.55 0.00 612.00 178.96 0.00 0.00 0.00 0.00 0.00 0.00
25% 1.0 0.10 163.77 10.56 7.21 682.00 2820.00 3187.00 22.60 0.00 0.00 0.00 0.00
50% 1.0 0.12 268.95 10.93 12.66 707.00 4139.96 8596.00 46.30 1.00 0.00 0.00 0.00
75% 1.0 0.14 432.76 11.29 17.95 737.00 5730.00 18249.50 70.90 2.00 0.00 0.00 0.00
max 1.0 0.22 940.14 14.53 29.96 827.00 17639.96 1207359.00 119.00 33.00 13.00 5.00 1.00

Visualization#

FICO scores ought to be an important variable for loan delinquency prediction as they measure borrower quality. To understand the firms’ credit policy with respect to FICO scores and how FICO scores differ across borrowers, let’s plot a histogram of borrowers that comply with Lending Club’s credit policy and have fully repaid their loss.

# Set up subplots with 1 row and 2 columns
plt.figure(figsize=(15, 5))

# Common color for both histograms
color = 'blue'

# Plot the first histogram
plt.subplot(1, 2, 1)
df[df['credit.policy']==0]['fico'].hist(bins=30, alpha=0.5, label='0', color='red')
df[df['credit.policy']==1]['fico'].hist(bins=30, alpha=0.5, label='1', color=color)
plt.legend()
plt.xlabel('FICO')
plt.title('Distribution of FICO scores by credit policy')

# Plot the first histogram
plt.subplot(1, 2, 2)
df[df['not.fully.paid']==0]['fico'].hist(bins=30, alpha=0.5, label='0', color='red')
df[df['not.fully.paid']==1]['fico'].hist(bins=30, alpha=0.5, label='1', color=color)
plt.legend()
plt.xlabel('FICO')
plt.title('Distribution of FICO scores by Loan Status')

# Adjust layout for better spacing
plt.tight_layout()

# Show the plot
plt.show()
../_images/2b3d3735c0e81ea5b8cfa76fb8480a9548c2a348b7d18b5405f7ad19e4c0b1da.png

Now, let’s also have a look at the distribution of FICO scores, interest rates charged and how they relate to each other.

sns.jointplot(x='fico',y='int.rate',data=df)
<seaborn.axisgrid.JointGrid at 0x288650b0d50>
../_images/eb31ea636558d1aeefb2a1d2fbcff842f3c40a3088f44854f60eed606bb4e429.png
sns.lmplot(x='fico',y='int.rate',data=df,col='not.fully.paid',hue='credit.policy',palette='coolwarm')
<seaborn.axisgrid.FacetGrid at 0x288653b3210>
../_images/8c443ef937fc2ca36eac2ffb7a6d658d5a355376cf199c0c2bbe2a3e175fe0da.png

Given a FICO score, there seems to be greater dispersion, in terms of interest rates charges, of borrowers that haven’t paid their loan back fully and therefore may default. This is a reminder that other variables should also be taken into account when considering delinquency likelihoods. Before we start processing the data and building our ML model, let’s look at the correlation amongst the variables.

# Select numerical columns from your DataFrame
numerical_columns = df.select_dtypes(include=['number'])
# Create a heatmap of the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(numerical_columns.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()
../_images/699d8dcf34741fadecd88a4996303c6b2f4c85d9b048bfa53f8eafbca4f03560.png

Now convert some categorical variables into dummy variables so that they can be added to the feature set of our ML model.

df = pd.get_dummies(df,columns=['purpose'],drop_first=True)
df
credit.policy int.rate installment log.annual.inc dti fico days.with.cr.line revol.bal revol.util inq.last.6mths delinq.2yrs pub.rec not.fully.paid purpose_credit_card purpose_debt_consolidation purpose_educational purpose_home_improvement purpose_major_purchase purpose_small_business
0 1 0.1189 829.10 11.350407 19.48 737 5639.958333 28854 52.1 0 0 0 0 False True False False False False
1 1 0.1071 228.22 11.082143 14.29 707 2760.000000 33623 76.7 0 0 0 0 True False False False False False
2 1 0.1357 366.86 10.373491 11.63 682 4710.000000 3511 25.6 1 0 0 0 False True False False False False
3 1 0.1008 162.34 11.350407 8.10 712 2699.958333 33667 73.2 1 0 0 0 False True False False False False
4 1 0.1426 102.92 11.299732 14.97 667 4066.000000 4740 39.5 0 1 0 0 True False False False False False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
9573 0 0.1461 344.76 12.180755 10.39 672 10474.000000 215372 82.1 2 0 0 1 False False False False False False
9574 0 0.1253 257.70 11.141862 0.21 722 4380.000000 184 1.1 5 0 0 1 False False False False False False
9575 0 0.1071 97.81 10.596635 13.09 687 3450.041667 10036 82.9 8 0 0 1 False True False False False False
9576 0 0.1600 351.58 10.819778 19.18 692 1800.000000 0 3.2 5 0 0 1 False False False True False False
9577 0 0.1392 853.43 11.264464 16.28 732 4740.000000 37879 57.0 6 0 0 1 False True False False False False

9578 rows × 19 columns

segregate features from the output which in this case is ‘not.fully.paid’.

X=df.drop('not.fully.paid', axis=1)
y=df['not.fully.paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

Decision Trees#

Decision Trees are a popular machine learning model used for both classification and regression tasks (though here we use it for classification solely). Here is a brief summary of the intuition behind Decision Trees:

  1. Structure:

  • Hierarchical Structure: Decision Trees organize data in a hierarchical manner.

  • Nodes: Internal nodes represent decisions based on specific features.

  1. Decision Making:

  • Traversal: The model makes decisions by traversing the tree from the root to a leaf.

  • Feature-based Decisions: At each internal node, a decision is made based on a specific feature, leading to a branch.

  1. Splitting Criteria:

  • Objective: The tree learns to split the data by selecting features and thresholds that maximize information gain.

  1. Types:

  • Classification Trees: Leaf nodes represent different classes, and the majority class is assigned to instances in a leaf.

  • Regression Trees: Leaf nodes represent numeric values, usually the mean of the target values of instances in that leaf.

  1. Pruning:

  • Overfitting: To prevent overfitting, trees can be pruned by removing branches that do not significantly contribute to predictive performance.

  1. Ensemble Methods:

  • Random Forests and Gradient Boosted Trees: Decision Trees are often used as building blocks in ensemble methods to improve predictive accuracy and robustness.

  1. Interpretability:

  • Visual Decision Process: Decision Trees offer interpretability, making it easy to understand the decision-making process and communicate.

  1. Remarks:

  • Sensitive to Data Distribution: Decision Trees can be sensitive to variations in the training data, leading to different tree structures for slightly different datasets.

  • Handling Missing Values: Some implementations can handle missing values during the learning process.

Decision Trees provide a flexible and interpretable approach to machine learning, with the potential for overfitting mitigated through pruning and ensemble methods. Let’s use Decision trees to help predict whether a loan is fully paid or not.

dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
pred = dtree.predict(X_test)

print(classification_report(y_test,pred))
print(confusion_matrix(y_test,pred))
              precision    recall  f1-score   support

           0       0.85      0.84      0.85      2422
           1       0.21      0.23      0.22       452

    accuracy                           0.74      2874
   macro avg       0.53      0.54      0.53      2874
weighted avg       0.75      0.74      0.75      2874

[[2030  392]
 [ 346  106]]

Random forest is a popular supervised machine learning method for classification and regression that consists of using several decision trees, and combining the trees’ predictions into an overall prediction. To train the random forest is to train each of its decision trees independently. Each decision tree is typically trained on a slightly different part of the training set, and may look at different features for its node splits. Let’s also fit a Random Forest to our data.

rf = RandomForestClassifier()
rf.fit(X_train,y_train)
pred1=rf.predict(X_test)

print(classification_report(y_test,pred1))
print(confusion_matrix(y_test,pred1))
              precision    recall  f1-score   support

           0       0.85      0.99      0.91      2422
           1       0.44      0.03      0.05       452

    accuracy                           0.84      2874
   macro avg       0.64      0.51      0.48      2874
weighted avg       0.78      0.84      0.78      2874

[[2407   15]
 [ 440   12]]

As you see Random Forests take longer to run (even though our dataset is small). They deliver slightly better performance statistics.

Exercise#

Tune our random forest by using ‘RandomizedSearchCV()’ function in sklearn and comment on prediction gains.

n_estimators = [int(x) for x in np.linspace(start = 200, stop = 1000, num = 5)]
max_features = ['auto', 'sqrt']
max_depth = [int(x) for x in np.linspace(20, 100, num = 5)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

rf1 = RandomForestClassifier()
rf2 = RandomizedSearchCV(estimator = rf1, param_distributions = random_grid, 
                              cv = 3, verbose=2,n_jobs = -1)
rf2.fit(X_train,y_train)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
RandomizedSearchCV(cv=3, estimator=RandomForestClassifier(), n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': [20, 40, 60, 80, 100,
                                                      None],
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': [1, 2, 4],
                                        'min_samples_split': [2, 5, 10],
                                        'n_estimators': [200, 400, 600, 800,
                                                         1000]},
                   verbose=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
rf2.best_params_
{'n_estimators': 800,
 'min_samples_split': 5,
 'min_samples_leaf': 4,
 'max_features': 'sqrt',
 'max_depth': 60,
 'bootstrap': True}
rf3=RandomForestClassifier(n_estimators= 200,
 min_samples_split=2,
 min_samples_leaf=2,
 max_features='auto',
 max_depth=40,
 bootstrap=True)
rf3.fit(X_train,y_train)
preds3=rf3.predict(X_test)

print(classification_report(y_test,preds3))
print(confusion_matrix(y_test,preds3))
c:\Users\Miguel\anaconda3\Lib\site-packages\sklearn\ensemble\_forest.py:424: FutureWarning: `max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.
  warn(
              precision    recall  f1-score   support

           0       0.84      1.00      0.91      2422
           1       0.36      0.01      0.02       452

    accuracy                           0.84      2874
   macro avg       0.60      0.50      0.47      2874
weighted avg       0.77      0.84      0.77      2874

[[2413    9]
 [ 447    5]]

Support Vector Machines (SVMs)#

SVMs are Supervised learning algorithms for classification and regression. They work by finding a hyperplane that best separates data points into different classes while maximizing the margin. Remember that a hyperplane is a decision boundary that separates data points of different classes. The margin is the distance between the hyperplane and the nearest data point from either class. SVMs seek to maximize this margin, promoting better generalization to new, unseen data.

Support Vectors:#

  • Definition: Support vectors are the data points that lie closest to the hyperplane and influence its position.

  • SVMs are named after these crucial data points as they support the definition of the decision boundary.

Types:#

1. Linear SVM:#

  • Hyperplane: A straight line in 2D, a plane in 3D, and a hyperplane in higher dimensions.

  • Use Case: Suitable for linearly separable data.

2. Non-Linear SVM (Kernel Trick):#

  • Idea: Transform data into a higher-dimensional space to make it linearly separable.

  • Kernel Functions: Radial Basis Function (RBF), Polynomial, Sigmoid, etc., are used to achieve non-linear separations.

The advantages of support vector machines are:#

  • Effective in high dimensional spaces.

  • Still effective in cases where number of dimensions is greater than the number of samples.

  • Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

  • Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:#

  • If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.

  • SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

Commonly used for:#

  • Image Classification, Text Classification, and Bioinformatics.

Tips:#

  • Normalize Features: It is often beneficial to normalize input features to ensure equal importance in the model.

Support Vector Machines are powerful and versatile classifiers, effective in scenarios with complex decision boundaries and high-dimensional feature spaces. Let’s explore their performance with our loan data.

# Import Support Vector Machine classifier
from sklearn.svm import SVC

# Create a Support Vector Machine classifier
svm_classifier = SVC()

# Train the Support Vector Machine classifier on the training data
svm_classifier.fit(X_train, y_train)

# Make predictions on the test data
svm_pred = svm_classifier.predict(X_test)

# Print the classification report for SVM
print("Support Vector Machine - Classification Report:\n", classification_report(y_test, svm_pred))

# Print the confusion matrix for SVM
print("Support Vector Machine - Confusion Matrix:\n", confusion_matrix(y_test, svm_pred))
Support Vector Machine - Classification Report:
               precision    recall  f1-score   support

           0       0.84      1.00      0.91      2422
           1       0.33      0.00      0.00       452

    accuracy                           0.84      2874
   macro avg       0.59      0.50      0.46      2874
weighted avg       0.76      0.84      0.77      2874

Support Vector Machine - Confusion Matrix:
 [[2420    2]
 [ 451    1]]

Exercise#

Similarly to what you did with Random Forests, Tune the SVM for loan classification.

from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from scipy.stats import uniform
import numpy as np

# Assuming you have already loaded your data into X and y

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)

# Create a StandardScaler object
scaler = StandardScaler()

# Fit and transform training data
X_train_standardized = scaler.fit_transform(X_train)

# Transform test data using the same scaler
X_test_standardized = scaler.transform(X_test)

# Define a more compact parameter distribution for random search
param_dist = {
    'C': uniform(0.1, 1),  # Uniform distribution between 0.1 and 1
    'penalty': ['l1', 'l2'],  # Penalty term for LinearSVC
    'dual': [False],  # Dual parameter for LinearSVC
}

# Create a Linear Support Vector Machine classifier
linear_svm_classifier = LinearSVC()

# Use RandomizedSearchCV for a more efficient search
random_search = RandomizedSearchCV(linear_svm_classifier, param_dist, n_iter=20, cv=3, scoring='accuracy', verbose=1, n_jobs=-1)
random_search.fit(X_train_standardized, y_train)

# Get the best hyperparameters
best_params_random = random_search.best_params_
print("Best Hyperparameters (Random Search):", best_params_random)

# Make predictions on the test data using the best model from random search
best_linear_svm_random = random_search.best_estimator_
linear_svm_pred_random = best_linear_svm_random.predict(X_test_standardized)

# Evaluate the performance of the tuned Linear SVM
print("Tuned Linear Support Vector Machine (Random Search) - Classification Report:\n", classification_report(y_test, linear_svm_pred_random))
print("Tuned Linear Support Vector Machine (Random Search) - Confusion Matrix:\n", confusion_matrix(y_test, linear_svm_pred_random))
Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best Hyperparameters (Random Search): {'C': 1.0872834785177143, 'dual': False, 'penalty': 'l1'}
Tuned Linear Support Vector Machine (Random Search) - Classification Report:
               precision    recall  f1-score   support

           0       0.83      1.00      0.91      2393
           1       0.00      0.00      0.00       481

    accuracy                           0.83      2874
   macro avg       0.42      0.50      0.45      2874
weighted avg       0.69      0.83      0.76      2874

Tuned Linear Support Vector Machine (Random Search) - Confusion Matrix:
 [[2389    4]
 [ 481    0]]

Some notes about the exercise:

  • Performing grid-search with a SVM with a nonlinear Kernel is a computationally prohibitive task.

  • Kernelized SVMs require the computation of a distance function between each point in the dataset, which is the dominating cost of \(O(n_{features},n^2_{observations})\).

To make this task feasible, it helps to think about it a bit before writing code:

  • Standardizing the features helps a lot because the Kernel requires the storage of the distances putting a burden on memory

  • Using RandomizedSearchCV() instead of GridSearchCV() also helps.

  • Another option is to use LinearSVC().