Estatística para Cientistas de Dados

Carregando, aguarde alguns segundos.

6 - Classificação

6.1 - Preparação dos dados

6.1.1 - Importação dos pacotes Python

from pathlib import Path
import pandas as pd
import numpy as np
#
from sklearn.naive_bayes import MultinomialNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression #, LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, precision_recall_fscore_support
from sklearn.metrics import roc_curve, accuracy_score, roc_auc_score
#
import statsmodels.api as sm
#
from imblearn.over_sampling import SMOTE, ADASYN, BorderlineSMOTE
from pygam import LinearGAM, s, f, l
#
from dmba import classificationSummary
#
import seaborn as sns
import matplotlib.pyplot as plt

6.1.2 - Diretório de dados

O diretório DATA contém os arquivos .csv utilizados nos exemplos.

DATA = './'

6.1.3 - Caminhos dos conjuntos de dados

Se você não mantiver seus dados no mesmo diretório que o código, adapte os nomes dos caminhos.

LOAN3000_CSV = DATA + 'loan3000.csv'
LOAN_DATA_CSV = DATA + 'loan_data.csv.gz'
FULL_TRAIN_SET_CSV = DATA + 'full_train_set.csv.gz'

6.2 - Naive Bayes

6.2.1 - A solução Naive

Carregamos o conjunto de dados de empréstimos.

loan_data = pd.read_csv(LOAN_DATA_CSV)
print(loan_data.head())

   Unnamed: 0       status  loan_amnt       term  annual_inc    dti  payment_inc_ratio  revol_bal  revol_util  ... pub_rec_zero open_acc  grade  outcome  emp_length            purpose_ home_   emp_len_ borrower_score
0           1  Charged Off       2500  60 months       30000   1.00            2.39320       1687         9.4  ...            1        3    4.8  default           1      major_purchase  RENT   > 1 Year           0.65
1           2  Charged Off       5600  60 months       40000   5.55            4.57170       5210        32.6  ...            1       11    1.4  default           5      small_business   OWN   > 1 Year           0.80
2           3  Charged Off       5375  60 months       15000  18.08            9.71600       9279        36.5  ...            1        2    6.0  default           1               other  RENT   > 1 Year           0.60
3           4  Charged Off       9000  36 months       30000  10.08           12.21520      10452        91.7  ...            1        4    4.2  default           1  debt_consolidation  RENT   > 1 Year           0.50
4           5  Charged Off      10000  36 months      100000   7.06            3.90888      11997        55.5  ...            1       14    5.4  default           4               other  RENT   > 1 Year           0.55

[5 rows x 21 columns]

convertemos dos dados para categóricos.

loan_data.outcome = loan_data.outcome.astype('category')
loan_data.outcome.cat.reorder_categories(['paid off', 'default'])
loan_data.purpose_ = loan_data.purpose_.astype('category')
loan_data.home_ = loan_data.home_.astype('category')
loan_data.emp_len_ = loan_data.emp_len_.astype('category')

Calculamos as probabilidades.

predictors = ['purpose_', 'home_', 'emp_len_']
outcome = 'outcome'
X = pd.get_dummies(loan_data[predictors], prefix='', prefix_sep='')
y = loan_data[outcome]
#
naive_model = MultinomialNB(alpha=0.01, fit_prior=True)
naive_model = MultinomialNB(alpha=1e-10, fit_prior=False)
naive_model.fit(X, y)
#
new_loan = X.loc[146:146, :]
print('classe prevista: ', naive_model.predict(new_loan)[0])
#
probabilities = pd.DataFrame(
    naive_model.predict_proba(new_loan),
    columns=naive_model.classes_)
print('probabilidades previstas',)
print(probabilities)

classe prevista:  default
probabilidades previstas
    default  paid off
0  0.653699  0.346301

6.3 - Análise discriminante

Exemplo simples:

loan3000 = pd.read_csv(LOAN3000_CSV)
print(loan3000.head())

   Unnamed: 0   outcome            purpose_    dti  borrower_score  payment_inc_ratio
0       32109  paid off  debt_consolidation  21.23            0.40            5.11135
1       16982   default         credit_card  15.49            0.40            5.43165
2       25335  paid off  debt_consolidation  27.30            0.70            9.23003
3       34580  paid off      major_purchase  21.11            0.40            2.33482
4       14424   default  debt_consolidation  16.46            0.45           12.10320

Alteramos a propriedade outcome.

loan3000.outcome = loan3000.outcome.astype('category')
print(loan3000.head())

   Unnamed: 0   outcome            purpose_    dti  borrower_score  payment_inc_ratio
0       32109  paid off  debt_consolidation  21.23            0.40            5.11135
1       16982   default         credit_card  15.49            0.40            5.43165
2       25335  paid off  debt_consolidation  27.30            0.70            9.23003
3       34580  paid off      major_purchase  21.11            0.40            2.33482
4       14424   default  debt_consolidation  16.46            0.45           12.10320

Calculamos o LDA do empréstimo.

predictors = ['borrower_score', 'payment_inc_ratio']
outcome = 'outcome'
#
X = loan3000[predictors]
y = loan3000[outcome]
#
loan_lda = LinearDiscriminantAnalysis()
loan_lda.fit(X, y)
print(pd.DataFrame(loan_lda.scalings_, index=X.columns))
#
pred = pd.DataFrame(
    loan_lda.predict_proba(loan3000[predictors]),
    columns=loan_lda.classes_)
print(pred.head())

                          0
borrower_score     7.175839
payment_inc_ratio -0.099676
    default  paid off
0  0.553544  0.446456
1  0.558953  0.441047
2  0.272696  0.727304
3  0.506254  0.493746
4  0.609952  0.390048

Figura 5-1: use escalas e centro de meios para determinar o limite de decisão.

center = np.mean(loan_lda.means_, axis=0)
slope = - loan_lda.scalings_[0] / loan_lda.scalings_[1]
intercept = center[1] - center[0] * slope
#
# payment_inc_ratio para borrower_score de 0 e 20
#
x_0 = (0 - intercept) / slope
x_20 = (20 - intercept) / slope
#
lda_df = pd.concat([loan3000, pred['default']], axis=1)
print(lda_df.head())

   Unnamed: 0   outcome            purpose_    dti  borrower_score  payment_inc_ratio   default
0       32109  paid off  debt_consolidation  21.23            0.40            5.11135  0.553544
1       16982   default         credit_card  15.49            0.40            5.43165  0.558953
2       25335  paid off  debt_consolidation  27.30            0.70            9.23003  0.272696
3       34580  paid off      major_purchase  21.11            0.40            2.33482  0.506254
4       14424   default  debt_consolidation  16.46            0.45           12.10320  0.609952

Figura:

fig, ax = plt.subplots(figsize=(4, 4))
g = sns.scatterplot(
    x='borrower_score', y='payment_inc_ratio',
    hue='default', data=lda_df, 
    palette=sns.diverging_palette(240, 10, n=9, as_cmap=True),
    ax=ax, legend=False)
#
ax.set_ylim(0, 20)
ax.set_xlim(0.15, 0.8)
ax.plot((x_0, x_20), (0, 20), linewidth=3)
ax.plot(*loan_lda.means_.transpose())
#
plt.tight_layout()
plt.show()

6.4 - Regressão logística

6.4.1 - Função de Resposta Logística e Logit

p = np.arange(0.01, 1, 0.01)
df = pd.DataFrame({
    'p': p,
    'logit': np.log(p / (1 - p)),
    'odds': p / (1 - p),
})
#
fig, ax = plt.subplots(figsize=(3, 3))
ax.axhline(0, color='grey', linestyle='--')
ax.axvline(0.5, color='grey', linestyle='--')
ax.plot(df['p'], df['logit'])
ax.set_xlabel('Probability')
ax.set_ylabel('logit(p)')
#
plt.tight_layout()
plt.show()

6.4.2 - Regressão Logística e o GLM

O pacote Scikit-Learn possui a classe LogisticRegression especializada para regressão logística.

Statsmodels tem um método mais geral baseado no modelo linear generalizado (GLM, generalized linear model).

predictors = [
    'payment_inc_ratio',
    'purpose_',
    'home_',
    'emp_len_', 
    'borrower_score']
#
outcome = 'outcome'
#
X = pd.get_dummies(
    loan_data[predictors], 
    prefix='', 
    prefix_sep='', 
    drop_first=True)
y = loan_data[outcome] # .cat.categories
#
logit_reg = LogisticRegression(
    penalty='l2', 
    C=1e42, 
    solver='liblinear')
#
logit_reg.fit(X, y)
#
print('intercept ', logit_reg.intercept_[0])
print('classes', logit_reg.classes_)
print(pd.DataFrame({'coeff': logit_reg.coef_[0]},index=X.columns))

intercept  -1.6380886373237824
classes ['default' 'paid off']
                       coeff
payment_inc_ratio  -0.079728
borrower_score      4.611037
debt_consolidation -0.249342
home_improvement   -0.407614
major_purchase     -0.229376
medical            -0.510087
other              -0.620534
small_business     -1.215662
OWN                -0.048454
RENT               -0.157355
 > 1 Year           0.357464

Observe que o intercepto e os coeficientes são invertidos em comparação com o modelo R.

print(loan_data['purpose_'].cat.categories)
print(loan_data['home_'].cat.categories)
print(loan_data['emp_len_'].cat.categories)

Index(['credit_card', 'debt_consolidation', 'home_improvement',
       'major_purchase', 'medical', 'other', 'small_business'],
      dtype='object')
Index(['MORTGAGE', 'OWN', 'RENT'], dtype='object')
Index([' < 1 Year', ' > 1 Year'], dtype='object')

Não está no livro:

Se você tiver um recurso ou variável de resultado que seja ordinal, use o OrdinalEncoder do Scikit-Learn para substituir as categorias (campos 'paid off' e 'default') por números.

No código abaixo, substituímos 'paid off' por 0 e 'default' por 1.

Isso inverte a ordem das classes previstas e, como consequência, os coeficientes serão invertidos.

from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder(categories=[['paid off', 'default']])
y_enc = enc.fit_transform(loan_data[[outcome]]).ravel()
#
logit_reg_enc = LogisticRegression(
    penalty="l2",
    C=1e42,
    solver='liblinear')
logit_reg_enc.fit(X, y_enc)
#
print('intercept ', logit_reg_enc.intercept_[0])
print('classes', logit_reg_enc.classes_)
print(pd.DataFrame({'coeff': logit_reg_enc.coef_[0]},index=X.columns))

intercept  1.6380885435754573
classes [0. 1.]
                       coeff
payment_inc_ratio   0.079728
borrower_score     -4.611037
debt_consolidation  0.249342
home_improvement    0.407614
major_purchase      0.229376
medical             0.510087
other               0.620534
small_business      1.215662
OWN                 0.048453
RENT                0.157355
 > 1 Year          -0.357463

6.4.3 - Valores previstos da Regressão Logística

pred = pd.DataFrame(
    logit_reg.predict_log_proba(X),
    columns=logit_reg.classes_)
print("predict_log_proba: ", pred.describe())
#
pred = pd.DataFrame(
    logit_reg.predict_proba(X),
    columns=logit_reg.classes_)
print("predict_proba: ", pred.describe())

predict_log_proba:              default      paid off
count  45342.000000  45342.000000
mean      -0.757850     -0.760423
std        0.378032      0.390419
min       -2.768873     -3.538865
25%       -0.985728     -0.977164
50%       -0.697366     -0.688946
75%       -0.472209     -0.467076
max       -0.029476     -0.064787
predict_proba:              default      paid off
count  45342.000000  45342.000000
mean       0.500001      0.499999
std        0.167336      0.167336
min        0.062733      0.029046
25%        0.373167      0.376377
50%        0.497895      0.502105
75%        0.623623      0.626833
max        0.970954      0.937267

6.4.4 - Interpretando os coeficientes e razões ímpares

fig, ax = plt.subplots(figsize=(3, 3))
ax.plot(df['logit'], df['odds'])
ax.set_xlabel('log(razões ímpares)')
ax.set_ylabel('razões ímpares')
ax.set_xlim(0, 5.1)
ax.set_ylim(-5, 105)
#
plt.tight_layout()
plt.show()

6.4.5 - Avaliação do modelo

Para comparação, aqui o modelo GLM usando o módulo statsmodels.

Este método requer que o resultado seja mapeado para números.

Use GLM (modelo linear geral) com a família binomial para ajustar uma regressão .

y_numbers = [1 if yi == 'default' else 0 for yi in y]
logit_reg_sm = sm.GLM(
    y_numbers, X.assign(const=1), 
    family=sm.families.Binomial())
logit_result = logit_reg_sm.fit()
print("logit_reg_sm: ", logit_result.summary())

logit_reg_sm:                   Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                      y   No. Observations:                45342
Model:                            GLM   Df Residuals:                    45330
Model Family:                Binomial   Df Model:                           11
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -28757.
Date:                Wed, 31 Jul 2024   Deviance:                       57515.
Time:                        14:02:04   Pearson chi2:                 4.54e+04
No. Iterations:                     4   Pseudo R-squ. (CS):             0.1112
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
payment_inc_ratio      0.0797      0.002     32.058      0.000       0.075       0.085
borrower_score        -4.6126      0.084    -55.203      0.000      -4.776      -4.449
debt_consolidation     0.2494      0.028      9.030      0.000       0.195       0.303
home_improvement       0.4077      0.047      8.747      0.000       0.316       0.499
major_purchase         0.2296      0.054      4.277      0.000       0.124       0.335
medical                0.5105      0.087      5.882      0.000       0.340       0.681
other                  0.6207      0.039     15.738      0.000       0.543       0.698
small_business         1.2153      0.063     19.192      0.000       1.091       1.339
OWN                    0.0483      0.038      1.271      0.204      -0.026       0.123
RENT                   0.1573      0.021      7.420      0.000       0.116       0.199
 > 1 Year             -0.3567      0.053     -6.779      0.000      -0.460      -0.254
const                  1.6381      0.074     22.224      0.000       1.494       1.783
======================================================================================

6.5 - Ripas (splines)

import statsmodels.formula.api as smf
formula = ('outcome ~ bs(payment_inc_ratio, df=8) + purpose_ + ' +
           'home_ + emp_len_ + bs(borrower_score, df=3)')
model = smf.glm(formula=formula, data=loan_data, family=sm.families.Binomial())
results = model.fit()
print(results.summary())

                             Generalized Linear Model Regression Results                             
=====================================================================================================
Dep. Variable:     ['outcome[default]', 'outcome[paid off]']   No. Observations:                45342
Model:                                                   GLM   Df Residuals:                    45321
Model Family:                                       Binomial   Df Model:                           20
Link Function:                                         Logit   Scale:                          1.0000
Method:                                                 IRLS   Log-Likelihood:                -28731.
Date:                                       Wed, 31 Jul 2024   Deviance:                       57462.
Time:                                               14:02:04   Pearson chi2:                 4.54e+04
No. Iterations:                                            6   Pseudo R-squ. (CS):             0.1122
Covariance Type:                                   nonrobust                                         
==================================================================================================
                                     coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------------
Intercept                          1.5756      0.331      4.765      0.000       0.928       2.224
purpose_[T.debt_consolidation]     0.2486      0.028      8.998      0.000       0.194       0.303
purpose_[T.home_improvement]       0.4097      0.047      8.757      0.000       0.318       0.501
purpose_[T.major_purchase]         0.2382      0.054      4.416      0.000       0.132       0.344
purpose_[T.medical]                0.5206      0.087      5.980      0.000       0.350       0.691
purpose_[T.other]                  0.6284      0.040     15.781      0.000       0.550       0.706
purpose_[T.small_business]         1.2250      0.063     19.305      0.000       1.101       1.349
home_[T.OWN]                       0.0498      0.038      1.309      0.191      -0.025       0.124
home_[T.RENT]                      0.1577      0.021      7.431      0.000       0.116       0.199
emp_len_[T. > 1 Year]             -0.3526      0.053     -6.699      0.000      -0.456      -0.249
bs(payment_inc_ratio, df=8)[0]     0.7042      0.342      2.060      0.039       0.034       1.374
bs(payment_inc_ratio, df=8)[1]     0.6621      0.198      3.351      0.001       0.275       1.049
bs(payment_inc_ratio, df=8)[2]     0.8118      0.245      3.309      0.001       0.331       1.293
bs(payment_inc_ratio, df=8)[3]     1.0377      0.223      4.644      0.000       0.600       1.476
bs(payment_inc_ratio, df=8)[4]     1.1901      0.233      5.112      0.000       0.734       1.646
bs(payment_inc_ratio, df=8)[5]     2.8404      0.316      8.980      0.000       2.220       3.460
bs(payment_inc_ratio, df=8)[6]    -1.3427      1.229     -1.092      0.275      -3.752       1.067
bs(payment_inc_ratio, df=8)[7]     7.1094      6.393      1.112      0.266      -5.420      19.639
bs(borrower_score, df=3)[0]       -2.9011      0.533     -5.448      0.000      -3.945      -1.857
bs(borrower_score, df=3)[1]       -2.6056      0.196    -13.284      0.000      -2.990      -2.221
bs(borrower_score, df=3)[2]       -5.7421      0.508    -11.313      0.000      -6.737      -4.747
==================================================================================================

Gráfico:

from statsmodels.genmod.generalized_linear_model import GLMResults
def partialResidualPlot(model, df, outcome, feature, fig, ax):
    y_actual = [0 if s == 'default' else 1 for s in df[outcome]]
    y_pred = model.predict(df)
    org_params = model.params.copy()
    zero_params = model.params.copy()
    # set model parametes of other features to 0
    for i, name in enumerate(zero_params.index):
        if feature in name:
            continue
        zero_params[i] = 0.0
    model.initialize(model.model, zero_params)
    feature_prediction = model.predict(df)
    ypartial = -np.log(1/feature_prediction - 1)
    ypartial = ypartial - np.mean(ypartial)
    model.initialize(model.model, org_params)
    results = pd.DataFrame({
        'feature': df[feature],
        'residual': -2 * (y_actual - y_pred),
        'ypartial': ypartial/ 2,
    })
    results = results.sort_values(by=['feature'])
    #
    ax.scatter(results.feature, results.residual, marker=".", s=72./fig.dpi)
    ax.plot(results.feature, results.ypartial, color='black')
    ax.set_xlabel(feature)
    ax.set_ylabel(f'Residual + {feature} contribution')
    return ax
#
fig, ax = plt.subplots(figsize=(5, 5))
partialResidualPlot(results, loan_data, 'outcome', 'payment_inc_ratio', fig, ax)
ax.set_xlim(0, 25)
ax.set_ylim(-2.5, 2.5)
#
plt.tight_layout()
plt.show()

6.6 - Avaliação de modelos de classificação

6.6.1 - Matriz de confusão

pred = logit_reg.predict(X)
pred_y = logit_reg.predict(X) == 'default'
true_y = y == 'default'
true_pos = true_y & pred_y
true_neg = ~true_y & ~pred_y
false_pos = ~true_y & pred_y
false_neg = true_y & ~pred_y
#
conf_mat = pd.DataFrame([[np.sum(true_pos), np.sum(false_neg)], [np.sum(false_pos), np.sum(true_neg)]],
                       index=['Y = default', 'Y = paid off'],
                       columns=['Yhat = default', 'Yhat = paid off'])
print(conf_mat)
#
print(confusion_matrix(y, logit_reg.predict(X)))

              Yhat = default  Yhat = paid off
Y = default            14336             8335
Y = paid off            8148            14523
[[14336  8335]
 [ 8148 14523]]

O pacote _dmba_ contém a função `classificationSummary` que imprime matriz de confusão e precisão para um modelo de classificação.

print(classificationSummary(
    y,
    logit_reg.predict(X),
    class_names=logit_reg.classes_))

Confusion Matrix (Accuracy 0.6365)

         Prediction
  Actual  default paid off
 default    14336     8335
paid off     8148    14523
None

6.6.2 - Precisão, Recall e Especificidade

A função _Scikit-Learn_ `precision_recall_fscore_support` retorna precisão, recall, fbeta_score e suporte.

conf_mat = confusion_matrix(y, logit_reg.predict(X))
#
print('Precisão (Precision)', conf_mat[0, 0] / sum(conf_mat[:, 0]))
print('Rechamada (Recall)', conf_mat[0, 0] / sum(conf_mat[0, :]))
print('Especificidade(Specificity)', conf_mat[1, 1] / sum(conf_mat[1, :]))
#
prfs = precision_recall_fscore_support(
    y,
    logit_reg.predict(X), 
    labels=['default', 'paid off'])
print(prfs)

Precisão (Precision) 0.6376089663760897
Rechamada (Recall) 0.6323496978518812
Especificidade(Specificity) 0.6405981209474659
(array([0.63760897, 0.63535742]), array([0.6323497 , 0.64059812]), array([0.63496844, 0.63796701]), array([22671, 22671], dtype=int64))

6.6.3 - Curva ROC

A função `roc_curve` em _Scikit-Learn_ calcula todas as informações necessárias para traçar uma curva ROC.

fpr, tpr, thresholds = roc_curve(
    y,
    logit_reg.predict_proba(X)[:, 0], 
    pos_label='default')
roc_df = pd.DataFrame({'recall': tpr, 'specificity': 1 - fpr})
ax = roc_df.plot(x='specificity', y='recall', figsize=(4, 4), legend=False)
ax.set_ylim(0, 1)
ax.set_xlim(1, 0)
ax.plot((1, 0), (0, 1))
ax.set_xlabel('specificity')
ax.set_ylabel('recall')
plt.tight_layout()
plt.show()

6.6.4 - AUC

Accuracy can easily be calculated using the _Scikit-Learn_ function `accuracy_score`.

print(np.sum(roc_df.recall[:-1] * np.diff(1 - roc_df.specificity)))
print(roc_auc_score([1 if yi == 'default' else 0 for yi in y], logit_reg.predict_proba(X)[:, 0]))

fpr, tpr, thresholds = roc_curve(y, logit_reg.predict_proba(X)[:,0], 
                                 pos_label='default')
roc_df = pd.DataFrame({'recall': tpr, 'specificity': 1 - fpr})

ax = roc_df.plot(x='specificity', y='recall', figsize=(4, 4), legend=False)
ax.set_ylim(0, 1)
ax.set_xlim(1, 0)
# ax.plot((1, 0), (0, 1))
ax.set_xlabel('specificity')
ax.set_ylabel('recall')
ax.fill_between(roc_df.specificity, 0, roc_df.recall, alpha=0.3)

plt.tight_layout()
plt.show()

0.6917107933430462 0.691710871167958

6.7 - Estratégias para dados desequilibrados

6.7.1 - Subamostragem

Os resultados baseados em modelos são de magnitude semelhante.

full_train_set = pd.read_csv(FULL_TRAIN_SET_CSV)
print(full_train_set.shape)
#
print(
    'porcentagem de empréstimos inadimplentes: ', 
    100 * np.mean(full_train_set.outcome == 'default'))
#
predictors = [
    'payment_inc_ratio', 'purpose_', 'home_', 'emp_len_', 
            'dti', 'revol_bal', 'revol_util']
outcome = 'outcome'
X = pd.get_dummies(
    full_train_set[predictors], prefix='',
    prefix_sep='', drop_first=True)
y = full_train_set[outcome]

full_model = LogisticRegression(
    penalty='l2', C=1e42, solver='liblinear')
full_model.fit(X, y)
print(
    'porcentagem de empréstimos previstos para inadimplência: ',
    100 * np.mean(full_model.predict(X) == 'default'))
print(
    np.mean(full_train_set.outcome == 'default') / 
    np.mean(full_model.predict(X) == 'default'))

(119987, 19)
porcentagem de empréstimos inadimplentes:  18.894546909248504
porcentagem de empréstimos previstos para inadimplência:  0.0
inf

6.7.2 - Sobreamostragem e ponderação para cima/para baixo

default_wt = 1 / np.mean(full_train_set.outcome == 'default')
wt = [default_wt if outcome == 'default' else 1 for outcome in full_train_set.outcome]
#
full_model = LogisticRegression(penalty="l2", C=1e42, solver='liblinear')
full_model.fit(X, y, wt)
#
print(
    'porcentagem de empréstimos previstos para inadimplência (ponderação): ',
    100 * np.mean(full_model.predict(X) == 'default')))

SyntaxError: unmatched ')' (<string>, line 9) <traceback object at 0x0000018D3AA368C0>

6.7.3 - Data Generation

The package _imbalanced-learn_ provides an implementation of the _SMOTE_ and similar algorithms.

X_resampled, y_resampled = SMOTE().fit_resample(X, y)
print('percentage of loans in default (SMOTE resampled): ', 
      100 * np.mean(y_resampled == 'default'))

full_model = LogisticRegression(penalty="l2", C=1e42, solver='liblinear')
full_model.fit(X_resampled, y_resampled)
print('percentage of loans predicted to default (SMOTE): ', 
      100 * np.mean(full_model.predict(X) == 'default'))


X_resampled, y_resampled = ADASYN().fit_resample(X, y)
print('percentage of loans in default (ADASYN resampled): ', 
      100 * np.mean(y_resampled == 'default'))

full_model = LogisticRegression(penalty="l2", C=1e42, solver='liblinear')
full_model.fit(X_resampled, y_resampled)
print('percentage of loans predicted to default (ADASYN): ', 
print(      100 * np.mean(full_model.predict(X) == 'default')))

percentage of loans in default (SMOTE resampled):  50.0
percentage of loans predicted to default (SMOTE):  29.254835940560227
percentage of loans in default (ADASYN resampled):  48.56040383751355
27.323793410952852
percentage of loans predicted to default (ADASYN):  None

Arduino

Coautor

Betobyte

Autor

Autores

||| Áreas ||| Estatística ||| Python ||| Projetos ||| Dicas & Truques ||| Quantum ||| Estatística para Cientistas de Dados || Python para Iniciantes || Python Básico || Matplotlib || Numpy || Seaborn || Pandas || Django || Estatística para Cientistas de Dados || Python com ML Básico || Python com ML Básico || Aulas | 1 (Introdução) | 2 (Análise de dados exploratória) | 3 (Dados e exemplos de distribuições) | 4 (Experimentos estatísticos e testes de significância) | 5 (Regressão e previsão) | 6 (Regressão e previsão) | 7 (Aprendizado de máquina estatístico) | 8 (Aprendizado não supervisionado) |