data_analysis/_05.04-linear-regression-penalized-regression.Rmd at main · mikenguyen13/data_analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
## Penalized (Regularized) Estimators {#penalized-regularized-estimators}

Penalized or regularized estimators are extensions of [Ordinary Least Squares](#ordinary-least-squares) designed to address its limitations, particularly in high-dimensional settings. Regularization methods introduce a penalty term to the loss function to prevent overfitting, handle multicollinearity, and improve model interpretability.

There are three popular regularization techniques (but not limited to):

1.  [Ridge Regression]
2.  [Lasso Regression]
3.  [Elastic Net]

------------------------------------------------------------------------

### Motivation for Penalized Estimators

OLS minimizes the Residual Sum of Squares (RSS):

$$
RSS = \sum_{i=1}^n \left( y_i - \hat{y}_i \right)^2 = \sum_{i=1}^n \left( y_i - x_i'\beta \right)^2,
$$

where:

-   $y_i$ is the observed outcome,

-   $x_i$ is the vector of predictors for observation $i$,

-   $\beta$ is the vector of coefficients.

While OLS works well under ideal conditions (e.g., low dimensionality, no multicollinearity), it struggles when:

-   **Multicollinearity**: Predictors are highly correlated, leading to large variances in $\beta$ estimates.

-   **High Dimensionality**: The number of predictors ($p$) exceeds or approaches the sample size ($n$), making OLS inapplicable or unstable.

-   **Overfitting**: When $p$ is large, OLS fits noise in the data, reducing generalizability.

To address these issues, **penalized regression** modifies the OLS loss function by adding a **penalty term** that shrinks the coefficients toward zero. This discourages overfitting and improves predictive performance.

The general form of the penalized loss function is:

$$
L(\beta) = \sum_{i=1}^n \left( y_i - x_i'\beta \right)^2 + \lambda P(\beta),
$$

where:

-   $\lambda \geq 0$: Tuning parameter controlling the strength of regularization.

-   $P(\beta)$: Penalty term that quantifies model complexity.

Different choices of $P(\beta)$ lead to ridge regression, lasso regression, or elastic net.

------------------------------------------------------------------------

### Ridge Regression

Ridge regression, also known as **L2 regularization**, penalizes the sum of squared coefficients:

$$
P(\beta) = \sum_{j=1}^p \beta_j^2.
$$

The ridge objective function becomes:

$$
L_{ridge}(\beta) = \sum_{i=1}^n \left( y_i - x_i'\beta \right)^2 + \lambda \sum_{j=1}^p \beta_j^2,
$$

where:

-   $\lambda \geq 0$ controls the degree of shrinkage. Larger $\lambda$ leads to greater shrinkage.

Ridge regression has a closed-form solution:

$$
\hat{\beta}_{ridge} = \left( X'X + \lambda I \right)^{-1} X'y,
$$

where $I$ is the $p \times p$ identity matrix.

**Key Features**

-   Shrinks coefficients but does **not set them exactly to zero**.
-   Handles multicollinearity effectively by stabilizing the coefficient estimates [@hoerl1970].
-   Works well when all predictors contribute to the response.

**Example Use Case**

Ridge regression is ideal for applications with many correlated predictors, such as:

-   Predicting housing prices based on a large set of features (e.g., size, location, age of the house).

------------------------------------------------------------------------

### Lasso Regression

Lasso regression, or **L1 regularization**, penalizes the sum of absolute coefficients:

$$
P(\beta) = \sum_{j=1}^p |\beta_j|.
$$

The lasso objective function is:

$$
L_{lasso}(\beta) = \sum_{i=1}^n \left( y_i - x_i'\beta \right)^2 + \lambda \sum_{j=1}^p |\beta_j|.
$$

**Key Features**

-   Unlike ridge regression, lasso can set coefficients to **exactly zero**, performing automatic feature selection.
-   Encourages sparse models, making it suitable for high-dimensional data [@tibshirani1996].

**Optimization**

Lasso does not have a closed-form solution due to the non-differentiability of $|\beta_j|$ at $\beta_j = 0$. It requires iterative algorithms, such as:

-   **Coordinate Descent**,

-   **Least Angle Regression (LARS)**.

**Example Use Case**

Lasso regression is useful when many predictors are irrelevant, such as:

-   Genomics, where only a subset of genes are associated with a disease outcome.

------------------------------------------------------------------------

### Elastic Net

Elastic Net combines the penalties of ridge and lasso regression:

$$
P(\beta) = \alpha \sum_{j=1}^p |\beta_j| + \frac{1 - \alpha}{2} \sum_{j=1}^p \beta_j^2,
$$

where:

-   $0 \leq \alpha \leq 1$ determines the balance between lasso (L1) and ridge (L2) penalties.

-   $\lambda$ controls the overall strength of regularization.

The elastic net objective function is:

$$
L_{elastic\ net}(\beta) = \sum_{i=1}^n \left( y_i - x_i'\beta \right)^2 + \lambda \left( \alpha \sum_{j=1}^p |\beta_j| + \frac{1 - \alpha}{2} \sum_{j=1}^p \beta_j^2 \right).
$$

**Key Features**

-   Combines the strengths of lasso (sparse models) and ridge (stability with correlated predictors) [@zou2005a].
-   Effective when predictors are highly correlated or when $p > n$.

**Example Use Case**

Elastic net is ideal for high-dimensional datasets with correlated predictors, such as:

-   Predicting customer churn using demographic and behavioral features.

------------------------------------------------------------------------

### Tuning Parameter Selection

Choosing the regularization parameter $\lambda$ (and $\alpha$ for elastic net) is critical for balancing model complexity (fit) and regularization (parsimony). If $\lambda$ is too large, coefficients are overly shrunk (or even set to zero in the case of L1 penalty), leading to underfitting. If $\lambda$ is too small, the model might overfit because coefficients are not penalized sufficiently. Hence, a systematic approach is needed to determine the optimal $\lambda$. For elastic net, we also choose an appropriate $\alpha$ to balance the L1 and L2 penalties.

#### Cross-Validation

A common approach to selecting $\lambda$ (and $\alpha$) is $K$-Fold Cross-Validation:

1.  **Partition the data** into $K$ roughly equal-sized "folds."
2.  **Train the model** on $K-1$ folds and **validate** on the remaining fold, computing a validation error.
3.  Repeat this process **for all folds**, and compute the average validation error across the $K$ folds.
4.  **Select** the value of $\lambda$ (and $\alpha$ if tuning it) that **minimizes** the cross-validated error.

This method helps us maintain a good bias-variance trade-off because every point is used for both training and validation exactly once.

#### Information Criteria

Alternatively, one can use **information criteria**---like the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC)---to guide model selection. These criteria reward goodness-of-fit while penalizing model complexity, thereby helping in selecting an appropriately regularized model.

------------------------------------------------------------------------

### Properties of Penalized Estimators

1.  **Bias-Variance Tradeoff**:
    -   Regularization introduces some bias in exchange for reducing variance, often resulting in better predictive performance on new data.
2.  **Shrinkage**:
    -   Ridge shrinks coefficients toward zero but usually retains all predictors.
    -   Lasso shrinks some coefficients exactly to zero, performing inherent feature selection.
3.  **Flexibility**:
    -   Elastic net allows for a continuum between ridge and lasso, so it can adapt to different data structures (e.g., many correlated features or very high-dimensional feature spaces).

------------------------------------------------------------------------

```{r fig-coef-ridge, fig.cap="Coefficient Paths for Ridge Regression", fig.alt="Plot showing coefficient paths for ridge regression. The X-axis represents Log Lambda, ranging from -4 to 4, and the Y-axis represents Coefficients, ranging from -0.15 to 0.15. Multiple colored lines depict the change in coefficients as Log Lambda varies, converging towards zero as Log Lambda increases. The title reads 'Coefficient Paths for Ridge Regression.'", out.width="100%", fig.align='center'}
# Load required libraries
library(glmnet)

# Simulate data
set.seed(123)
n <- 100   # Number of observations
p <- 20    # Number of predictors
X <- matrix(rnorm(n * p), nrow = n, ncol = p)  # Predictor matrix
y <- rnorm(n)                                 # Response vector

# Ridge regression (alpha = 0)
ridge_fit <- glmnet(X, y, alpha = 0)
plot(ridge_fit, xvar = "lambda", label = TRUE)
title("Coefficient Paths for Ridge Regression")
```

-   In this plot, each curve represents a coefficient's value as a function of $\lambda$.

-   As $\lambda$ increases (moving from left to right on a log-scale by default), coefficients shrink toward zero but typically stay non-zero.

-   Ridge regression tends to shrink coefficients but does not force them to be exactly zero.

```{r fig-coef-lasso, fig.cap="Coefficient Paths for Lasso Regression", fig.alt="Graph showing coefficient paths for Lasso Regression. The x-axis represents Log Lambda, ranging from -7 to -2, and the y-axis represents Coefficients, ranging from -0.15 to 0.15. Multiple colored lines depict the paths of different coefficients as Log Lambda changes, illustrating how coefficients shrink towards zero as regularization increases.", out.width="100%", fig.align='center'}
# Lasso regression (alpha = 1)
lasso_fit <- glmnet(X, y, alpha = 1)
plot(lasso_fit, xvar = "lambda", label = TRUE)
title("Coefficient Paths for Lasso Regression")
```

Here, as $\lambda$ grows, several coefficient paths **hit zero exactly**, illustrating the variable selection property of lasso.

```{r fig-coef-elastic, fig.cap="Coefficient Paths for Elastic Net", fig.alt="Plot titled 'Coefficient Paths for Elastic Net (alpha = 0.5)' showing multiple colored lines representing coefficient paths against the x-axis labeled 'Log Lambda' and y-axis labeled 'Coefficients.' The lines converge towards zero as Log Lambda increases, illustrating the effect of regularization on coefficients in an Elastic Net model.", out.width="100%", fig.align='center'}
# Elastic net (alpha = 0.5)
elastic_net_fit <- glmnet(X, y, alpha = 0.5)
plot(elastic_net_fit, xvar = "lambda", label = TRUE)
title("Coefficient Paths for Elastic Net (alpha = 0.5)")
```

-   Elastic net combines ridge and lasso penalties. At $\lambda = 0.5$, we see partial shrinkage and some coefficients going to zero.

-   This model is often helpful when you suspect both group-wise shrinkage (like ridge) and sparse solutions (like lasso) might be beneficial.

We can further refine our choice of $\lambda$ by performing cross-validation on the lasso model:

```{r fig-mse-log-lambda, fig.cap="Cross-Validation Curve for Lasso Regression: Mean-Squared Error by Log Lambda", fig.alt="A line chart displaying the relationship between Log(lambda) on the x-axis and Mean-Squared Error on the y-axis. The chart features a series of red dots representing data points, which form a downward trend from left to right. Vertical gray lines indicate error bars for each data point. The x-axis ranges from -7 to -2, while the y-axis ranges from 0.75 to 1.05. The top of the chart shows numbers from 19 to 0, likely indicating additional data or context.", out.width="100%", fig.align='center'}
cv_lasso <- cv.glmnet(X, y, alpha = 1)
plot(cv_lasso)
best_lambda <- cv_lasso$lambda.min
best_lambda
```

-   The plot displays the cross-validated error (often mean-squared error or deviance) on the y-axis versus $\log(\lambda)$ on the x-axis.

-   Two vertical dotted lines typically appear:

    1.  $\lambda.min$: The $\lambda$ that achieves the minimum cross-validated error.

    2.  $\lambda.1se$: The largest $\lambda$ such that the cross-validated error is still within one standard error of the minimum. This is a more conservative choice that favors higher regularization (simpler models).

-   `best_lambda` above prints the numeric value of $\lambda.min$. This is the $\lambda$ that gave the lowest cross-validation error for the lasso model.

**Interpretation**:

-   By using `cv.glmnet`, we systematically compare different values of $\lambda$ in terms of their predictive performance (cross-validation error).

-   The selected $\lambda$ typically balances having a smaller model (due to regularization) with retaining sufficient predictive power.

-   If we used real-world data, we might also look at performance metrics on a hold-out test set to ensure that the chosen $\lambda$ generalizes well.

------------------------------------------------------------------------