Estatística para Cientistas de Dados

Carregando, aguarde alguns segundos.

5 - Regressão e previsão

5.1 - Preparação dos dados

5.1.1 - Importação dos pacotes Python

from pathlib import Path
#
import pandas as pd
import numpy as np
from sklearn.metrics import r2_score, mean_squared_error
from sklearn.linear_model import LinearRegression
#
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import OLSInfluence
#
from pygam import LinearGAM, s, l
from pygam.datasets import wage
#
import seaborn as sns
import matplotlib.pyplot as plt
#
from dmba import stepwise_selection
from dmba import AIC_score

5.1.2 - Diretório de dados

O diretório DATA contém os arquivos .csv utilizados nos exemplos.

DATA = './'

5.1.3 - Caminhos dos conjuntos de dados

Se você não mantiver seus dados no mesmo diretório que o código, adapte os nomes dos caminhos.

LUNG_CSV = DATA + 'LungDisease.csv'
HOUSE_CSV = DATA + 'house_sales.csv'

5.2 - Regressão Linear Simples

5.2.1 - A Equação de Regressão

lung = pd.read_csv(LUNG_CSV)
print(lung.head())

   PEFR  Exposure
0   390         0
1   410         0
2   430         0
3   460         0
4   420         1

Gráfico de dispersão:

lung.plot.scatter(x='Exposure', y='PEFR')
plt.tight_layout()
plt.show()

Podemos usar o modelo LinearRegression do _Scikit-Learn_.

predictors = ['Exposure']
outcome = 'PEFR'
#
model = LinearRegression()
model.fit(lung[predictors], lung[outcome])
#
print(f'Intercept: {model.intercept_:.3f}')
print(f'Coefficient Exposure: {model.coef_[0]:.3f}')

Intercept: 424.583
Coefficient Exposure: -4.185

Gráfico:

fig, ax = plt.subplots(figsize=(4, 4))
ax.set_xlim(0, 23)
ax.set_ylim(295, 450)
ax.set_xlabel('Exposure')
ax.set_ylabel('PEFR')
ax.plot((0, 23), model.predict(pd.DataFrame({'Exposure': [0, 23]})))
ax.text(0.4, model.intercept_, r'$b_0$', size='larger')
#
x = pd.DataFrame({'Exposure': [7.5,17.5]})
y = model.predict(x)
ax.plot((7.5, 7.5, 17.5), (y[0], y[1], y[1]), '--')
ax.text(5, np.mean(y), r'$\Delta Y$', size='larger')
ax.text(12, y[1] - 10, r'$\Delta X$', size='larger')
ax.text(12, 390, r'$b_1 = \frac{\Delta Y}{\Delta X}$', size='larger')
#
plt.tight_layout()
plt.show()

5.2.2 - Valores e Resíduos Ajustados

O método predict de um modelo _Scikit-Learn_ ajustado pode ser usado para prever novos pontos de dados.

fitted = model.predict(lung[predictors])
residuals = lung[outcome] - fitted
#
ax = lung.plot.scatter(x='Exposure', y='PEFR', figsize=(4, 4))
ax.plot(lung.Exposure, fitted)
for x, yactual, yfitted in zip(lung.Exposure, lung.PEFR, fitted): 
    ax.plot((x, x), (yactual, yfitted), '--', color='C1')
#
plt.tight_layout()
plt.show()

5.3 - Regressão linear múltipla

Carregamento do conjunto de dados house:

house = pd.read_csv(HOUSE_CSV, sep='\t')
print(house.head())
#
subset_house = [
    'AdjSalePrice',
    'SqFtTotLiving',
    'SqFtLot',
    'Bathrooms', 
    'Bedrooms',
    'BldgGrade']
print(house[subset_house].head())
#
outcome_house = 'AdjSalePrice'

  DocumentDate  SalePrice  PropertyID   PropertyType          ym  zhvi_px  zhvi_idx  AdjSalePrice  NbrLivingUnits  ...  Bedrooms  BldgGrade  YrBuilt  YrRenovated  TrafficNoise  LandVal  ImpsVal  ZipCode  NewConstruction
1   2014-09-16     280000     1000102      Multiplex  2014-09-01   405100  0.930836      300805.0               2  ...         6          7     1991            0             0    70000   229000    98002            False
2   2006-06-16    1000000     1200013  Single Family  2006-06-01   404400  0.929228     1076162.0               1  ...         4         10     2005            0             0   203000   590000    98166             True
3   2007-01-29     745000     1200019  Single Family  2007-01-01   425600  0.977941      761805.0               1  ...         4          8     1947            0             0   183000   275000    98166            False
4   2008-02-25     425000     2800016  Single Family  2008-02-01   418400  0.961397      442065.0               1  ...         5          7     1966            0             0   104000   229000    98168            False
5   2013-03-29     240000     2800024  Single Family  2013-03-01   351600  0.807904      297065.0               1  ...         4          7     1948            0             0   104000   205000    98168            False

[5 rows x 22 columns]
   AdjSalePrice  SqFtTotLiving  SqFtLot  Bathrooms  Bedrooms  BldgGrade
1      300805.0           2400     9373       3.00         6          7
2     1076162.0           3764    20156       3.75         4         10
3      761805.0           2060    26036       1.75         4          8
4      442065.0           3200     8618       3.75         5          7
5      297065.0           1720     8620       1.75         4          7

#
#
predictors = [
    'SqFtTotLiving',
    'SqFtLot',
    'Bathrooms', 
    'Bedrooms',
    'BldgGrade']
#
house_lm = LinearRegression()
house_lm.fit(house[predictors], house[outcome_house])
#
print(f'Intercept: {house_lm.intercept_:.3f}')
print('Coefficients:')
for name, coef in zip(predictors, house_lm.coef_):
    print(f' {name}: {coef}')

Intercept: -521871.368
Coefficients:
 SqFtTotLiving: 228.83060360240822
 SqFtLot: -0.060466820653058306
 Bathrooms: -19442.840398321114
 Bedrooms: -47769.95518521422
 BldgGrade: 106106.96307898105

5.3.1 - Avaliação do modelo

O método r2_score doScikit-Learn fornece uma série de métricas para determinar a qualidade de um modelo.

fitted = house_lm.predict(house[predictors])
RMSE = np.sqrt(mean_squared_error(house[outcome_house], fitted))
r2 = r2_score(house[outcome_house], fitted)
print(f'RMSE: {RMSE:.0f}')
print(f'r2: {r2:.4f}')

RMSE: 261220
r2: 0.5406

Enquanto _Scikit-Learn_ fornece uma variedade de métricas diferentes, _statsmodels_ fornece uma análise mais aprofundada do modelo de regressão linear.

Este pacote tem duas maneiras diferentes de especificar o modelo, uma que é semelhante a _Scikit-Learn_ e outra que permite especificar fórmulas no estilo _R_.

Aqui usamos a primeira abordagem. Como _statsmodels_ não adiciona uma interceptação automaticamente, precisamos adicionar uma coluna constante com valor 1 aos preditores.

Podemos usar o método _pandas_ assign para isso.

model = sm.OLS(house[outcome_house], house[predictors].assign(const=1))
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           AdjSalePrice   R-squared:                       0.541
Model:                            OLS   Adj. R-squared:                  0.540
Method:                 Least Squares   F-statistic:                     5338.
Date:                Wed, 31 Jul 2024   Prob (F-statistic):               0.00
Time:                        14:01:53   Log-Likelihood:            -3.1517e+05
No. Observations:               22687   AIC:                         6.304e+05
Df Residuals:                   22681   BIC:                         6.304e+05
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
SqFtTotLiving   228.8306      3.899     58.694      0.000     221.189     236.472
SqFtLot          -0.0605      0.061     -0.988      0.323      -0.180       0.059
Bathrooms     -1.944e+04   3625.388     -5.363      0.000   -2.65e+04   -1.23e+04
Bedrooms      -4.777e+04   2489.732    -19.187      0.000   -5.27e+04   -4.29e+04
BldgGrade      1.061e+05   2396.445     44.277      0.000    1.01e+05    1.11e+05
const         -5.219e+05   1.57e+04    -33.342      0.000   -5.53e+05   -4.91e+05
==============================================================================
Omnibus:                    29676.557   Durbin-Watson:                   1.247
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         19390738.346
Skew:                           6.889   Prob(JB):                         0.00
Kurtosis:                     145.559   Cond. No.                     2.86e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.86e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

5.3.2 - Seleção de modelo e regressão passo a passo

predictors = [
    'SqFtTotLiving', 'SqFtLot',
    'Bathrooms', 'Bedrooms',
    'BldgGrade', 'PropertyType',
    'NbrLivingUnits', 'SqFtFinBasement',
    'YrBuilt', 'YrRenovated', 
    'NewConstruction']
X = pd.get_dummies(house[predictors], drop_first=True)
X['NewConstruction'] = [1 if nc else 0 for nc in X['NewConstruction']]
house_full = sm.OLS(house[outcome_house], X.assign(const=1))
results = house_full.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           AdjSalePrice   R-squared:                       0.595
Model:                            OLS   Adj. R-squared:                  0.594
Method:                 Least Squares   F-statistic:                     2771.
Date:                Wed, 31 Jul 2024   Prob (F-statistic):               0.00
Time:                        14:01:53   Log-Likelihood:            -3.1375e+05
No. Observations:               22687   AIC:                         6.275e+05
Df Residuals:                   22674   BIC:                         6.276e+05
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
SqFtTotLiving                198.6364      4.234     46.920      0.000     190.338     206.934
SqFtLot                        0.0771      0.058      1.330      0.184      -0.037       0.191
Bathrooms                   4.286e+04   3808.114     11.255      0.000    3.54e+04    5.03e+04
Bedrooms                   -5.187e+04   2396.904    -21.638      0.000   -5.66e+04   -4.72e+04
BldgGrade                   1.373e+05   2441.242     56.228      0.000    1.32e+05    1.42e+05
NbrLivingUnits              5723.8438   1.76e+04      0.326      0.744   -2.87e+04    4.01e+04
SqFtFinBasement                7.0611      4.627      1.526      0.127      -2.009      16.131
YrBuilt                    -3574.2210     77.228    -46.282      0.000   -3725.593   -3422.849
YrRenovated                   -2.5311      3.924     -0.645      0.519     -10.222       5.160
NewConstruction            -2489.1122   5936.692     -0.419      0.675   -1.41e+04    9147.211
PropertyType_Single Family  2.997e+04   2.61e+04      1.149      0.251   -2.12e+04    8.11e+04
PropertyType_Townhouse      9.286e+04    2.7e+04      3.438      0.001    3.99e+04    1.46e+05
const                       6.182e+06   1.55e+05     39.902      0.000    5.88e+06    6.49e+06
==============================================================================
Omnibus:                    31006.128   Durbin-Watson:                   1.393
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         26251977.078
Skew:                           7.427   Prob(JB):                         0.00
Kurtosis:                     168.984   Cond. No.                     2.98e+06
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.98e+06. This might indicate that there are
strong multicollinearity or other numerical problems.

Podemos usar o método stepwise_selection do pacote _dmba_.

y = house[outcome_house]
#
def train_model(variables):
    if len(variables) == 0:
        return None
    model = LinearRegression()
    model.fit(X[variables], y)
    return model
#
def score_model(model, variables):
    if len(variables) == 0:
        return AIC_score(y, [y.mean()] * len(y), model, df=1)
    return AIC_score(y, model.predict(X[variables]), model)
#
best_model, best_variables = stepwise_selection(
    X.columns, train_model,
    score_model, verbose=True)
#
print()
print(f'Intercept: {best_model.intercept_:.3f}')
print('Coefficients:')
z = zip(best_variables, best_model.coef_)
for name, coef in z:
    print(f' {name}: {coef}')

Variables: SqFtTotLiving, SqFtLot, Bathrooms, Bedrooms, BldgGrade, NbrLivingUnits, SqFtFinBasement, YrBuilt, YrRenovated, NewConstruction, PropertyType_Single Family, PropertyType_Townhouse
Start: score=647988.32, constant
Step: score=633013.35, add SqFtTotLiving
Step: score=630793.74, add BldgGrade
Step: score=628230.29, add YrBuilt
Step: score=627784.16, add Bedrooms
Step: score=627602.21, add Bathrooms
Step: score=627525.65, add PropertyType_Townhouse
Step: score=627525.08, add SqFtFinBasement
Step: score=627524.98, add PropertyType_Single Family
Step: score=627524.98, unchanged None

Intercept: 6178645.017
Coefficients:
 SqFtTotLiving: 199.27755304201412
 BldgGrade: 137159.56022619843
 YrBuilt: -3565.424939249455
 Bedrooms: -51947.38367361435
 Bathrooms: 42396.16452772081
 PropertyType_Townhouse: 84479.16203299857
 SqFtFinBasement: 7.046974967583083
 PropertyType_Single Family: 22912.055187017773

5.3.3 - Regressão ponderada

Podemos calcular o ano a partir da coluna de data usando uma compreensão de lista ou o método apply do quadro de dados.

house['Year'] = [int(date.split('-')[0]) for date in house.DocumentDate]
house['Year'] = house.DocumentDate.apply(lambda d: int(d.split('-')[0]))
house['Weight'] = house.Year - 2005
#
predictors = [
    'SqFtTotLiving', 'SqFtLot', 'Bathrooms', 
    'Bedrooms', 'BldgGrade']
#
house_wt = LinearRegression()
house_wt.fit(house[predictors], house[outcome_house], sample_weight=house.Weight)
#
pd.concat([
    pd.DataFrame({
        'predictor': predictors,
        'house_lm': house_lm.coef_,
        'house_wt': house_wt.coef_,    
    }),
    pd.DataFrame({
        'predictor': ['intercept'],
        'house_lm': house_lm.intercept_,
        'house_wt': house_wt.intercept_,    
    })
])
#
residuals = pd.DataFrame({
    'abs_residual_lm': np.abs(house_lm.predict(house[predictors]) - house[outcome_house]),
    'abs_residual_wt': np.abs(house_wt.predict(house[predictors]) - house[outcome_house]),
    'Year': house['Year'],
})
print(residuals.head())

   abs_residual_lm  abs_residual_wt  Year
1    123750.814194    107108.553965  2014
2     59145.413089     96191.882094  2006
3    190108.725716    187004.492880  2007
4    198788.774412    196132.996857  2008
5     91774.996129     84277.577512  2013

# axes = residuals.boxplot(['abs_residual_lm', 'abs_residual_wt'], by='Year', figsize=(10, 4))
# axes[0].set_ylim(0, 300000)
#
pd.DataFrame(([year, np.mean(group['abs_residual_lm']), np.mean(group['abs_residual_wt'])] 
              for year, group in residuals.groupby('Year')),
             columns=['Year', 'mean abs_residual_lm', 'mean abs_residual_wt'])
# for year, group in residuals.groupby('Year'):
#     print(year, np.mean(group['abs_residual_lm']), np.mean(group['abs_residual_wt']))
#

5.4 - Variáveis de fatoração na regressão

5.4.1 - Representação de variáveis fictícias

print(house.PropertyType.head())
print(pd.get_dummies(house['PropertyType']).head(6))
print(pd.get_dummies(house['PropertyType'], drop_first=True).head(6))
predictors = [
    'SqFtTotLiving', 'SqFtLot',
    'Bathrooms', 'Bedrooms',
    'BldgGrade', 'PropertyType']
#
X = pd.get_dummies(house[predictors], drop_first=True)
#
house_lm_factor = LinearRegression()
house_lm_factor.fit(X, house[outcome_house])
#
print(f'Intercept: {house_lm_factor.intercept_:.3f}')
print('Coefficients:')
for name, coef in zip(X.columns, house_lm_factor.coef_):
    print(f' {name}: {coef}')

1        Multiplex
2    Single Family
3    Single Family
4    Single Family
5    Single Family
Name: PropertyType, dtype: object
   Multiplex  Single Family  Townhouse
1          1              0          0
2          0              1          0
3          0              1          0
4          0              1          0
5          0              1          0
6          0              0          1
   Single Family  Townhouse
1              0          0
2              1          0
3              1          0
4              1          0
5              1          0
6              0          1
Intercept: -446841.366
Coefficients:
 SqFtTotLiving: 223.37362892503847
 SqFtLot: -0.07036798136812017
 Bathrooms: -15979.013473415256
 Bedrooms: -50889.73218483006
 BldgGrade: 109416.30516146206
 PropertyType_Single Family: -84678.21629549324
 PropertyType_Townhouse: -115121.97921609218

5.4.2 - Variáveis de fator com muitos níveis

print(pd.DataFrame(house['ZipCode'].value_counts()).transpose())

house = pd.read_csv(HOUSE_CSV, sep='\t')

predictors = [
    'SqFtTotLiving',
    'SqFtLot',
    'Bathrooms', 
    'Bedrooms',
    'BldgGrade']
#
house_lm = LinearRegression()
house_lm.fit(house[predictors], house[outcome_house])
#
zip_groups = pd.DataFrame([
    *pd.DataFrame({
        'ZipCode': house['ZipCode'],
        'residual' : house[outcome_house] - house_lm.predict(house[predictors]),
    })
    .groupby(['ZipCode'])
    .apply(lambda x: {
        'ZipCode': x.iloc[0,0],
        'count': len(x),
        'median_residual': x.residual.median()
    })
]).sort_values('median_residual')
zip_groups['cum_count'] = np.cumsum(zip_groups['count'])
zip_groups['ZipGroup'] = pd.qcut(
    zip_groups['cum_count'], 5,
    labels=False, retbins=False)
zip_groups.head()
print(zip_groups.ZipGroup.value_counts().sort_index())
#
to_join = zip_groups[['ZipCode', 'ZipGroup']].set_index('ZipCode')
house = house.join(to_join, on='ZipCode')
house['ZipGroup'] = house['ZipGroup'].astype('category')
print(house['ZipGroup'])

         98038  98103  98042  98115  98117  98052  98034  98033  98059  98074  98053  98118  98029  98126  98133  ...  98014  98010  98047  98039  98148  98051  98024  98354  98050  98057  98288  98224  98068  98113  98043
ZipCode    788    671    641    620    619    614    575    517    513    502    499    492    475    473    465  ...     85     56     48     47     40     32     31      9      7      4      4      3      1      1      1

[1 rows x 80 columns]
0    16
1    16
2    16
3    16
4    16
Name: ZipGroup, dtype: int64
1        2
2        2
3        2
4        2
5        2
        ..
27057    3
27058    4
27061    0
27062    2
27063    4
Name: ZipGroup, Length: 22687, dtype: category
Categories (5, int64): [0, 1, 2, 3, 4]

5.5 - Interpretando a Equação de Regressão

5.5.1 - Preditores correlacionados

Os resultados da regressão stepwise são:

print(f'Intercept: {best_model.intercept_:.3f}')
print('Coefficients:')
for name, coef in zip(best_variables, best_model.coef_):
    print(f' {name}: {coef}')
#
predictors = [
    'Bedrooms',
    'BldgGrade',
    'PropertyType',
    'YrBuilt']
#
X = pd.get_dummies(house[predictors], drop_first=True)
#
reduced_lm = LinearRegression()
reduced_lm.fit(X, house[outcome_house])
#
print(f'Intercept: {reduced_lm.intercept_:.3f}')
print('Coefficients:')
for name, coef in zip(X.columns, reduced_lm.coef_):
    print(f' {name}: {coef}')

Intercept: 6178645.017
Coefficients:
 SqFtTotLiving: 199.27755304201412
 BldgGrade: 137159.56022619843
 YrBuilt: -3565.424939249455
 Bedrooms: -51947.38367361435
 Bathrooms: 42396.16452772081
 PropertyType_Townhouse: 84479.16203299857
 SqFtFinBasement: 7.046974967583083
 PropertyType_Single Family: 22912.055187017773
Intercept: 4913973.344
Coefficients:
 Bedrooms: 27150.537230208818
 BldgGrade: 248997.7936621273
 YrBuilt: -3211.744862155145
 PropertyType_Single Family: -19898.49534049739
 PropertyType_Townhouse: -47355.43687333949

5.5.2 - Variáveis confusas

predictors = [
    'SqFtTotLiving',
    'SqFtLot',
    'Bathrooms',
    'Bedrooms',
    'BldgGrade',
    'PropertyType',
    'ZipGroup']
#
X = pd.get_dummies(house[predictors], drop_first=True)
#
confounding_lm = LinearRegression()
confounding_lm.fit(X, house[outcome_house])
#
print(f'Intercept: {confounding_lm.intercept_:.3f}')
print('Coefficients:')
for name, coef in zip(X.columns, confounding_lm.coef_):
    print(f' {name}: {coef}')

Intercept: -666637.469
Coefficients:
 SqFtTotLiving: 210.6126600558015
 SqFtLot: 0.45498713854658845
 Bathrooms: 5928.425640001536
 Bedrooms: -41682.87184074452
 BldgGrade: 98541.18352725993
 PropertyType_Single Family: 19323.625287919327
 PropertyType_Townhouse: -78198.72092762374
 ZipGroup_1: 53317.17330659813
 ZipGroup_2: 116251.58883563514
 ZipGroup_3: 178360.53178793337
 ZipGroup_4: 338408.60185652005

5.5.3 - Interações e Efeitos Principais

model = smf.ols(formula='AdjSalePrice ~  SqFtTotLiving * ZipGroup + SqFtLot + ' +
     'Bathrooms + Bedrooms + BldgGrade + PropertyType', data=house)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           AdjSalePrice   R-squared:                       0.682
Model:                            OLS   Adj. R-squared:                  0.682
Method:                 Least Squares   F-statistic:                     3247.
Date:                Wed, 31 Jul 2024   Prob (F-statistic):               0.00
Time:                        14:01:55   Log-Likelihood:            -3.1098e+05
No. Observations:               22687   AIC:                         6.220e+05
Df Residuals:                   22671   BIC:                         6.221e+05
Df Model:                          15                                         
Covariance Type:            nonrobust                                         
=================================================================================================
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept                     -4.853e+05   2.05e+04    -23.701      0.000   -5.25e+05   -4.45e+05
ZipGroup[T.1]                 -1.113e+04   1.34e+04     -0.830      0.407   -3.74e+04    1.52e+04
ZipGroup[T.2]                  2.032e+04   1.18e+04      1.717      0.086   -2877.441    4.35e+04
ZipGroup[T.3]                   2.05e+04   1.21e+04      1.697      0.090   -3180.870    4.42e+04
ZipGroup[T.4]                 -1.499e+05   1.13e+04    -13.285      0.000   -1.72e+05   -1.28e+05
PropertyType[T.Single Family]  1.357e+04   1.39e+04      0.975      0.330   -1.37e+04    4.09e+04
PropertyType[T.Townhouse]     -5.884e+04   1.51e+04     -3.888      0.000   -8.85e+04   -2.92e+04
SqFtTotLiving                   114.7650      4.863     23.600      0.000     105.233     124.297
SqFtTotLiving:ZipGroup[T.1]      32.6043      5.712      5.708      0.000      21.409      43.799
SqFtTotLiving:ZipGroup[T.2]      41.7822      5.187      8.056      0.000      31.616      51.948
SqFtTotLiving:ZipGroup[T.3]      69.3415      5.619     12.341      0.000      58.329      80.354
SqFtTotLiving:ZipGroup[T.4]     226.6836      4.820     47.032      0.000     217.237     236.131
SqFtLot                           0.6869      0.052     13.296      0.000       0.586       0.788
Bathrooms                     -3619.4533   3202.296     -1.130      0.258   -9896.174    2657.267
Bedrooms                       -4.18e+04   2120.279    -19.715      0.000    -4.6e+04   -3.76e+04
BldgGrade                      1.047e+05   2069.472     50.592      0.000    1.01e+05    1.09e+05
==============================================================================
Omnibus:                    30927.394   Durbin-Watson:                   1.581
Prob(Omnibus):                  0.000   Jarque-Bera (JB):         34361794.502
Skew:                           7.279   Prob(JB):                         0.00
Kurtosis:                     193.101   Cond. No.                     5.80e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.8e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

5.6 - Testando as suposições: diagnóstico de regressão

5.6.1 - Valores atípicos

O pacote _statsmodels_ possui o suporte mais desenvolvido para análise de outliers.

house_98105 = house.loc[house['ZipCode'] == 98105, ]
#
predictors = [
    'SqFtTotLiving',
    'SqFtLot',
    'Bathrooms',
    'Bedrooms',
    'BldgGrade']
#
house_outlier = sm.OLS(house_98105[outcome_house], house_98105[predictors].assign(const=1))
result_98105 = house_outlier.fit()
print(result_98105.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           AdjSalePrice   R-squared:                       0.795
Model:                            OLS   Adj. R-squared:                  0.792
Method:                 Least Squares   F-statistic:                     238.7
Date:                Wed, 31 Jul 2024   Prob (F-statistic):          1.69e-103
Time:                        14:01:55   Log-Likelihood:                -4226.0
No. Observations:                 313   AIC:                             8464.
Df Residuals:                     307   BIC:                             8486.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
SqFtTotLiving   209.6023     24.408      8.587      0.000     161.574     257.631
SqFtLot          38.9333      5.330      7.305      0.000      28.445      49.421
Bathrooms      2282.2641      2e+04      0.114      0.909    -3.7e+04    4.16e+04
Bedrooms      -2.632e+04   1.29e+04     -2.043      0.042   -5.17e+04    -973.867
BldgGrade        1.3e+05   1.52e+04      8.533      0.000       1e+05     1.6e+05
const         -7.725e+05   9.83e+04     -7.861      0.000   -9.66e+05   -5.79e+05
==============================================================================
Omnibus:                       82.127   Durbin-Watson:                   1.508
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              586.561
Skew:                           0.859   Prob(JB):                    4.26e-128
Kurtosis:                       9.483   Cond. No.                     5.63e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.63e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

A classe OLSInfluence é inicializada com os resultados da regressão OLS e dá acesso a várias propriedades úteis.

Aqui usamos os resíduos estudantis.

influence = OLSInfluence(result_98105)
sresiduals = influence.resid_studentized_internal
outlier = house_98105.loc[sresiduals.idxmin(), :]
#
print(sresiduals.idxmin(), sresiduals.min())
print(result_98105.resid.loc[sresiduals.idxmin()])
print('AdjSalePrice', outlier[outcome])
print(outlier[predictors])

24333 -4.326731804078562
-757753.6192115826
KeyError: 'PEFR' <traceback object at 0x0000018D2FA37500>

5.6.2 - Valores influentes

from scipy.stats import linregress
#
np.random.seed(5)
x = np.random.normal(size=25)
y = -x / 5 + np.random.normal(size=25)
x[0] = 8
y[0] = 8
#
def abline(slope, intercept, ax):
    # Calculate coordinates of a line based on slope and intercept
    x_vals = np.array(ax.get_xlim())
    return (x_vals, intercept + slope * x_vals)
#
fig, ax = plt.subplots(figsize=(4, 4))
ax.scatter(x, y)
slope, intercept, _, _, _ = linregress(x, y)
ax.plot(*abline(slope, intercept, ax))
slope, intercept, _, _, _ = linregress(x[1:], y[1:])
ax.plot(*abline(slope, intercept, ax), '--')
ax.set_xlim(-2.5, 8.5)
ax.set_ylim(-2.5, 8.5)
#
plt.tight_layout()
plt.show()

O pacote _statsmodel_ fornece vários gráficos para analisar a influência do ponto de dados

influence = OLSInfluence(result_98105)
fig, ax = plt.subplots(figsize=(5, 5))
ax.axhline(-2.5, linestyle='--', color='C1')
ax.axhline(2.5, linestyle='--', color='C1')
ax.scatter(influence.hat_matrix_diag, influence.resid_studentized_internal, 
           s=1000 * np.sqrt(influence.cooks_distance[0]),
           alpha=0.5)
#
ax.set_xlabel('hat values')
ax.set_ylabel('studentized residuals')
#
plt.tight_layout()
plt.show()

mask = [dist < .08 for dist in influence.cooks_distance[0]]
house_infl = house_98105.loc[mask]

ols_infl = sm.OLS(house_infl[outcome], house_infl[predictors].assign(const=1))
result_infl = ols_infl.fit()

#df = pd.DataFrame({
#    'Original': result_98105.params,
#    'Influential removed': result_infl.params,
#})
#print(df)

KeyError: 'PEFR' <traceback object at 0x0000018D4E9F2440>

5.6.3 - Heteroscedasticidade, Não Normalidade e Erros Correlacionados

O regplot em Seaborn permite adicionar uma linha de baixa suavização ao gráfico de dispersão.

fig, ax = plt.subplots(figsize=(5, 5))
sns.regplot(
    x=result_98105.fittedvalues,
    y=np.abs(result_98105.resid), 
    scatter_kws={'alpha': 0.25},
    line_kws={'color': 'C1'},
    lowess=True, ax=ax)
ax.set_xlabel('predicted')
ax.set_ylabel('abs(residual)')
#
plt.tight_layout()
plt.show()

fig, ax = plt.subplots(figsize=(4, 4))
pd.Series(influence.resid_studentized_internal).hist(ax=ax)
ax.set_xlabel('std. residual')
ax.set_ylabel('Frequency')
#
plt.tight_layout()
plt.show()

5.6.4 - Gráficos residuais parciais e não linearidade

Método plot_ccpr:

fig, ax = plt.subplots(figsize=(5, 5))
fig = sm.graphics.plot_ccpr(result_98105, 'SqFtTotLiving', ax=ax)
#
plt.tight_layout()
plt.show()

Método plot_ccpr_grid:

fig = sm.graphics.plot_ccpr_grid(
    result_98105,
    fig=plt.figure(figsize=(8,12)))
plt.show()

5.6.5 - Regressão polinomial e ripas (splines)

model_poly = smf.ols(formula='AdjSalePrice ~  SqFtTotLiving + np.power(SqFtTotLiving, 2) + ' + 
                'SqFtLot + Bathrooms + Bedrooms + BldgGrade', data=house_98105)
result_poly = model_poly.fit()
print(result_poly.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           AdjSalePrice   R-squared:                       0.806
Model:                            OLS   Adj. R-squared:                  0.802
Method:                 Least Squares   F-statistic:                     211.6
Date:                Wed, 31 Jul 2024   Prob (F-statistic):          9.95e-106
Time:                        14:01:56   Log-Likelihood:                -4217.9
No. Observations:                 313   AIC:                             8450.
Df Residuals:                     306   BIC:                             8476.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
==============================================================================================
                                 coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------
Intercept                  -6.159e+05   1.03e+05     -5.953      0.000   -8.19e+05   -4.12e+05
SqFtTotLiving                  7.4521     55.418      0.134      0.893    -101.597     116.501
np.power(SqFtTotLiving, 2)     0.0388      0.010      4.040      0.000       0.020       0.058
SqFtLot                       32.5594      5.436      5.990      0.000      21.863      43.256
Bathrooms                  -1435.1231   1.95e+04     -0.074      0.941   -3.99e+04     3.7e+04
Bedrooms                   -9191.9441   1.33e+04     -0.693      0.489   -3.53e+04    1.69e+04
BldgGrade                   1.357e+05   1.49e+04      9.087      0.000    1.06e+05    1.65e+05
==============================================================================
Omnibus:                       75.161   Durbin-Watson:                   1.625
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              637.978
Skew:                           0.699   Prob(JB):                    2.92e-139
Kurtosis:                       9.853   Cond. No.                     7.37e+07
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.37e+07. This might indicate that there are
strong multicollinearity or other numerical problems.

A implementação de statsmodels de um gráfico de resíduo parcial funciona apenas para termo linear.

Aqui está uma implementação de um gráfico residual parcial que, embora ineficiente, funciona para a regressão polinomial.

def partialResidualPlot(model, df, outcome, feature, ax):
    y_pred = model.predict(df)
    copy_df = df.copy()
    for c in copy_df.columns:
        if c == feature:
            continue
        copy_df[c] = 0.0
    feature_prediction = model.predict(copy_df)
    results = pd.DataFrame({
        'feature': df[feature],
        'residual': df[outcome] - y_pred,
        'ypartial': feature_prediction - model.params[0],
    })
    results = results.sort_values(by=['feature'])
    smoothed = sm.nonparametric.lowess(results.ypartial, results.feature, frac=1/3)
    #
    ax.scatter(results.feature, results.ypartial + results.residual)
    ax.plot(smoothed[:, 0], smoothed[:, 1], color='gray')
    ax.plot(results.feature, results.ypartial, color='black')
    ax.set_xlabel(feature)
    ax.set_ylabel(f'Residual + {feature} contribution')
    return ax
#
fig, ax = plt.subplots(figsize=(5, 5))
partialResidualPlot(result_poly, house_98105, 'AdjSalePrice', 'SqFtTotLiving', ax)
#
plt.tight_layout()
plt.show()

Resultado result_poly.params[2]:

print(result_poly.params[2])

0.038791281682365585

5.6.6 - Ripas (Splines)

formula = ('AdjSalePrice ~ bs(SqFtTotLiving, df=6, degree=3) + ' + 
           'SqFtLot + Bathrooms + Bedrooms + BldgGrade')
model_spline = smf.ols(formula=formula, data=house_98105)
result_spline = model_spline.fit()
print(result_spline.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           AdjSalePrice   R-squared:                       0.814
Model:                            OLS   Adj. R-squared:                  0.807
Method:                 Least Squares   F-statistic:                     131.8
Date:                Wed, 31 Jul 2024   Prob (F-statistic):          7.10e-104
Time:                        14:01:56   Log-Likelihood:                -4211.4
No. Observations:                 313   AIC:                             8445.
Df Residuals:                     302   BIC:                             8486.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
========================================================================================================
                                           coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------
Intercept                            -4.142e+05   1.43e+05     -2.899      0.004   -6.95e+05   -1.33e+05
bs(SqFtTotLiving, df=6, degree=3)[0] -1.995e+05   1.86e+05     -1.076      0.283   -5.65e+05    1.66e+05
bs(SqFtTotLiving, df=6, degree=3)[1] -1.206e+05   1.23e+05     -0.983      0.326   -3.62e+05    1.21e+05
bs(SqFtTotLiving, df=6, degree=3)[2] -7.164e+04   1.36e+05     -0.525      0.600    -3.4e+05    1.97e+05
bs(SqFtTotLiving, df=6, degree=3)[3]  1.957e+05   1.62e+05      1.212      0.227   -1.22e+05    5.14e+05
bs(SqFtTotLiving, df=6, degree=3)[4]  8.452e+05   2.18e+05      3.878      0.000    4.16e+05    1.27e+06
bs(SqFtTotLiving, df=6, degree=3)[5]  6.955e+05   2.14e+05      3.255      0.001    2.75e+05    1.12e+06
SqFtLot                                 33.3258      5.454      6.110      0.000      22.592      44.059
Bathrooms                            -4778.2080   1.94e+04     -0.246      0.806    -4.3e+04    3.34e+04
Bedrooms                             -5778.7045   1.32e+04     -0.437      0.663   -3.18e+04    2.03e+04
BldgGrade                             1.345e+05   1.52e+04      8.842      0.000    1.05e+05    1.64e+05
==============================================================================
Omnibus:                       58.816   Durbin-Watson:                   1.633
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              622.021
Skew:                           0.330   Prob(JB):                    8.51e-136
Kurtosis:                       9.874   Cond. No.                     1.97e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.97e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Figura:

fig, ax = plt.subplots(figsize=(5, 5))
partialResidualPlot(
    result_spline, house_98105,
    'AdjSalePrice', 'SqFtTotLiving', ax)
plt.tight_layout()
plt.show()

5.6.7 - Modelos Aditivos Generalizados

predictors = [
    'SqFtTotLiving', 'SqFtLot', 'Bathrooms',
    'Bedrooms', 'BldgGrade']
#
X = house_98105[predictors].values
y = house_98105[outcome_house]
#
## model
#
gam = LinearGAM(s(0, n_splines=12) + l(1) + l(2) + l(3) + l(4))
gam.gridsearch(X, y)
#
print(gam.summary())

AttributeError: module 'numpy' has no attribute 'int'.
`np.int` was a deprecated alias for the builtin `int`. To avoid this error in existing code, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
The aliases was originally deprecated in NumPy 1.20; for more details and guidance see the original release note at:
    https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations <traceback object at 0x0000018D2BA1AA00>

Figura:

fig, axes = plt.subplots(figsize=(8, 8), ncols=2, nrows=3)
#
titles = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BldgGrade']
for i, title in enumerate(titles):
    ax = axes[i // 2, i % 2]
    XX = gam.generate_X_grid(term=i)
    ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX))
    ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX, width=.95)[1], c='r', ls='--')
    ax.set_title(titles[i]);
#
axes[2][1].set_visible(False)
#
plt.tight_layout()
plt.show()

AttributeError: GAM has not been fitted. Call fit first. --- fig, axes = plt.subplots(figsize=(8, 8), ncols=2, nrows=3) # titles = ['SqFtTotLiving', 'SqFtLot', 'Bathrooms', 'Bedrooms', 'BldgGrade'] for i, title in enumerate(titles): ax = axes[i // 2, i % 2] XX = gam.generate_X_grid(term=i) ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX)) ax.plot(XX[:, i], gam.partial_dependence(term=i, X=XX, width=.95)[1], c='r', ls='--') ax.set_title(titles[i]); # axes[2][1].set_visible(False) # plt.tight_layout() hw._plt_show_(apenas_link=False) ---

5.7 - Regularização

5.7.1 - Lasso

from sklearn.linear_model import Lasso, LassoLars, LassoCV, LassoLarsCV
from sklearn.preprocessing import StandardScaler
#
predictors = [
    'SqFtTotLiving','SqFtLot','Bathrooms',
    'Bedrooms','BldgGrade','PropertyType',
    'NbrLivingUnits','SqFtFinBasement',
    'YrBuilt','YrRenovated', 
    'NewConstruction']
#
X = pd.get_dummies(house[predictors], drop_first=True)
X['NewConstruction'] = [1 if nc else 0 for nc in X['NewConstruction']]
#
columns = X.columns
# X = StandardScaler().fit_transform(X * 1.0)
y = house[outcome_house]
#
house_lm = LinearRegression()
print(house_lm.fit(X, y))
#
house_lasso = Lasso(alpha=10)
print(house_lasso.fit(X, y))

LinearRegression()
Lasso(alpha=10)

Method = LassoLars
MethodCV = LassoLarsCV
Method = Lasso
MethodCV = LassoCV
#
alpha_values = []
results = []
for alpha in [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000, 1000000, 10000000]:
    model = Method(alpha=alpha)
    model.fit(X, y)
    alpha_values.append(alpha)
    results.append(model.coef_)
modelCV = MethodCV(cv=5)
modelCV.fit(X, y)
ax = pd.DataFrame(results, index=alpha_values, columns=columns).plot(logx=True, legend=False)
ax.axvline(modelCV.alpha_)
plt.show()

print(pd.DataFrame({
    'name': columns,
    'coef': modelCV.coef_,}))
# Intercept: 6177658.144
# Coefficients:
# SqFtTotLiving: 199.27474217544048
# BldgGrade: 137181.13724627026
# YrBuilt: -3564.934870415041
# Bedrooms: -51974.76845567939
# Bathrooms: 42403.059999677665
# PropertyType_Townhouse: 84378.9333363999
# SqFtFinBasement: 7.032178917565108
# PropertyType_Single Family: 22854.87954019308

                          name        coef
0                SqFtTotLiving  289.048846
1                      SqFtLot    0.029471
2                    Bathrooms    0.000000
3                     Bedrooms   -0.000000
4                    BldgGrade    0.000000
5               NbrLivingUnits   -0.000000
6              SqFtFinBasement    3.316479
7                      YrBuilt   -0.000000
8                  YrRenovated   45.727472
9              NewConstruction   -0.000000
10  PropertyType_Single Family   -0.000000
11      PropertyType_Townhouse    0.000000

Arduino

Coautor

Betobyte

Autor

Autores

||| Áreas ||| Estatística ||| Python ||| Projetos ||| Dicas & Truques ||| Quantum ||| Estatística para Cientistas de Dados || Python para Iniciantes || Python Básico || Matplotlib || Numpy || Seaborn || Pandas || Django || Estatística para Cientistas de Dados || Python com ML Básico || Python com ML Básico || Aulas | 1 (Introdução) | 2 (Análise de dados exploratória) | 3 (Dados e exemplos de distribuições) | 4 (Experimentos estatísticos e testes de significância) | 5 (Regressão e previsão) | 6 (Regressão e previsão) | 7 (Aprendizado de máquina estatístico) | 8 (Aprendizado não supervisionado) |

5 - Regressão e previsão

5.1 - Preparação dos dados

5.1.1 - Importação dos pacotes Python

5.1.2 - Diretório de dados

5.1.3 - Caminhos dos conjuntos de dados

5.2 - Regressão Linear Simples

5.2.1 - A Equação de Regressão

5.2.2 - Valores e Resíduos Ajustados

5.3 - Regressão linear múltipla

5.3.1 - Avaliação do modelo

5.3.2 - Seleção de modelo e regressão passo a passo

5.3.3 - Regressão ponderada

5.4 - Variáveis de fatoração na regressão

5.4.1 - Representação de variáveis fictícias

5.4.2 - Variáveis de fator com muitos níveis

5.5 - Interpretando a Equação de Regressão

5.5.1 - Preditores correlacionados

5.5.2 - Variáveis ​​confusas

5.5.3 - Interações e Efeitos Principais

5.6 - Testando as suposições: diagnóstico de regressão

5.6.1 - Valores atípicos

5.6.2 - Valores influentes

5.6.3 - Heteroscedasticidade, Não Normalidade e Erros Correlacionados

5.6.4 - Gráficos residuais parciais e não linearidade

5.6.5 - Regressão polinomial e ripas (splines)

5.6.6 - Ripas (Splines)

5.6.7 - Modelos Aditivos Generalizados

5.7 - Regularização

5.7.1 - Lasso

5.5.2 - Variáveis confusas