data_analysis/_05.05-linear-regression-robust-estimator.Rmd at main · mikenguyen13/data_analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
## Robust Estimators

Robust estimators are statistical techniques designed to provide reliable parameter estimates even when the assumptions underlying classical methods, such as [Ordinary Least Squares](#ordinary-least-squares), are violated. Specifically, they address issues caused by outliers, non-normal errors, or heavy-tailed distributions, which can render OLS inefficient or biased.

The goal of robust estimation is to reduce the sensitivity of the estimator to extreme or aberrant data points, thereby ensuring a more reliable and accurate fit to the majority of the data.

We will cover the key robust estimation techniques, their properties, and applications, along with practical examples and mathematical derivations. The focus will include $M$-estimators, $R$-estimators, $L$-estimators, $LTS$, $S$-estimators, $MM$-estimators, and more.

------------------------------------------------------------------------

### Motivation for Robust Estimation

OLS seeks to minimize the Residual Sum of Squares (RSS):

$$
RSS = \sum_{i=1}^n (y_i - x_i'\beta)^2,
$$

where:

-   $y_i$ is the observed response for the $i$th observation,
-   $x_i$ is the vector of predictors for the $i$th observation,
-   $\beta$ is the vector of coefficients.

OLS assumes:

1.  Errors are normally distributed and no outliers in the data ([A6 Normal Distribution](#a6-normal-distribution)).
2.  Homoscedasticity (constant variance of errors) ([A4 Homoskedasticity](#a4-homoskedasticity)).

In real-world scenarios:

-   **Outliers** in $y$ or $x$ can disproportionately affect the estimates, leading to biased or inefficient results.

-   **Heavy-tailed distributions** (e.g., Cauchy) violate the normality assumption, making OLS inappropriate.

For example, @huber1964 demonstrates that a single extreme observation can arbitrarily distort OLS estimates, while @hampel2005 define the breakdown point as a measure of robustness. Robust estimators aim to mitigate these problems by limiting the influence of problematic observations.

OLS inherently squares the residuals $e_i = y_i - x_i'\beta$, amplifying the influence of large residuals. For example, if a single residual is much larger than the others, its squared value can dominate the RSS, distorting the estimated coefficients.

Consider a simple case where $y_i = \beta_0 + \beta_1 x_i + e_i$, with $e_i \sim N(0, \sigma^2)$ under the classical assumptions. Now introduce an outlier: a single observation with an unusually large $e_i$. The squared residual for this point will dominate the RSS and pull the estimated regression line towards it, leading to biased estimates of $\beta_0$ and $\beta_1$.

The **breakdown point** of an estimator is the proportion of contamination (e.g., outliers) that the estimator can tolerate before yielding arbitrarily large or incorrect results. For OLS, the breakdown point is $1/n$, meaning even one outlier can cause substantial distortion in the estimates.

------------------------------------------------------------------------

### $M$-Estimators

To address the sensitivity of OLS, robust estimators minimize a different objective function:

$$
\sum_{i=1}^n \rho\left(\frac{y_i - x_i'\beta}{\sigma}\right),
$$

where:

-   $\rho(\cdot)$ is a robust loss function that grows slower than the quadratic function used in OLS,
-   $\sigma$ is a scale parameter to normalize residuals.

In OLS, the quadratic loss function $\rho(z) = z^2$ penalizes large residuals disproportionately. Robust estimators replace this with alternative $\rho$ functions that limit the penalty for large residuals, thus reducing their influence on the parameter estimates.

A robust $\rho$ function should satisfy the following properties:

1.  **Bounded Influence**: Large residuals contribute a finite amount to the objective function.
2.  **Symmetry**: $\rho(z) = \rho(-z)$ ensures that positive and negative residuals are treated equally.
3.  **Differentiability**: For computational tractability, $\rho$ should be smooth and differentiable.

------------------------------------------------------------------------

#### Examples of Robust $\rho$ Functions

1.  **Huber's Loss Function** [@huber1964]

Huber's loss function transitions between quadratic and linear growth:

$$
\rho(z) =
\begin{cases}
\frac{z^2}{2} & \text{if } |z| \leq c, \\
c|z| - \frac{c^2}{2} & \text{if } |z| > c.
\end{cases}
$$

Key features:

-   For small residuals ($|z| \leq c$), the loss is quadratic, mimicking OLS.
-   For large residuals ($|z| > c$), the loss grows linearly, limiting their influence.

The parameter $c$ controls the threshold at which the loss function transitions from quadratic to linear. Smaller values of $c$ make the estimator more robust but potentially less efficient under normality.

2.  **Tukey's Bisquare Function** [@beaton1974]

Tukey's bisquare function completely bounds the influence of large residuals:

$$
\rho(z) =
\begin{cases}
c^2 \left(1 - \left(1 - \left(\frac{z}{c}\right)^2\right)^3\right)/6 & \text{if } |z| \leq c, \\
c^2/6 & \text{if } |z| > c.
\end{cases}
$$

Key features:

-   Residuals larger than $c$ contribute a constant value to the objective function, effectively excluding them from the estimation process.
-   This approach achieves high robustness at the cost of lower efficiency for small residuals.

3.  **Andrews' Sine Function** [@andrews1974]:
    -   Smoothly downweights extreme residuals: $$ \rho(z) = \begin{cases}  c^2 \left(1 - \cos\left(\frac{z}{c}\right)\right)/2 & \text{if } |z| \leq \pi c, \\ c^2/2 & \text{if } |z| > \pi c. \end{cases} $$

------------------------------------------------------------------------

#### Weighting Scheme: Influence Functions

A critical concept in robust estimation is the **influence function**, which describes the sensitivity of the estimator to individual observations. For $M$-estimators, the influence function is derived as the derivative of the loss function $\rho(z)$ with respect to $z$:

$$
\psi(z) = \frac{d}{dz} \rho(z).
$$

This function plays a crucial role in downweighting large residuals. The weight assigned to each residual is proportional to $\psi(z)/z$, which decreases as $|z|$ increases for robust estimators.

For Huber's loss function, the influence function is:

$$
\psi(z) =
\begin{cases}
z & \text{if } |z| \leq c, \\
c \cdot \text{sign}(z) & \text{if } |z| > c.
\end{cases}
$$

-   For small residuals, $\psi(z) = z$, matching OLS.
-   For large residuals, $\psi(z)$ is constant, ensuring bounded influence.

------------------------------------------------------------------------

A key consideration when selecting a robust estimator is the trade-off between **robustness** (resistance to outliers) and **efficiency** (performance under ideal conditions). The tuning parameters in $\rho$ functions (e.g., $c$ in Huber's loss) directly affect this balance:

-   Smaller $c$ increases robustness but reduces efficiency under normality.
-   Larger $c$ improves efficiency under normality but decreases robustness to outliers.

This trade-off reflects the fundamental goal of robust estimation: to achieve a balance between reliability and precision across a wide range of data scenarios.

------------------------------------------------------------------------

#### Properties of $M$-Estimators

Robust estimators, particularly $M$-estimators, possess the following mathematical properties:

1.  **Asymptotic Normality**: Under mild regularity conditions, $M$-estimators are asymptotically normal: $$
    \sqrt{n} (\hat{\beta} - \beta) \xrightarrow{d} N(0, \Sigma),
    $$ where $\Sigma$ depends on the choice of $\rho$ and the distribution of residuals.
2.  **Consistency**: As $n \to \infty$, $\hat{\beta} \to \beta$ in probability, provided the majority of the data satisfies the model assumptions.
3.  **Breakdown Point**: $M$-estimators typically have a moderate breakdown point, sufficient to handle a reasonable proportion of contamination.

### $R$-Estimators

$R$-estimators are a class of robust estimators that rely on the ranks of residuals rather than their raw magnitudes. This approach makes them naturally resistant to the influence of outliers and highly effective in scenarios involving ordinal data or heavy-tailed error distributions. By leveraging rank-based methods, $R$-estimators are particularly useful in situations where classical assumptions about the data, such as normality or homoscedasticity, do not hold.

The general form of an $R$-estimator can be expressed as:

$$
\hat{\beta}_R = \arg\min_\beta \sum_{i=1}^n w_i R_i \left(y_i - x_i'\beta\right),
$$

where:

-   $R_i$ are the ranks of residuals $e_i = y_i - x_i'\beta$,
-   $w_i$ are rank-based weights determined by a chosen scoring function,
-   $y_i$ are observed responses, $x_i$ are predictor values, and $\beta$ is the vector of coefficients.

This formulation differs from $M$-estimators, which directly minimize a loss function $\rho$, by instead using the **ordering of residuals** to drive the estimation.

------------------------------------------------------------------------

#### Ranks and Scoring Function

##### Definition of Ranks

The rank $R_i$ of a residual $e_i$ is its position in the sorted sequence of all residuals:

$$
R_i = \text{rank}(e_i) = \sum_{j=1}^n \mathbb{I}(e_j \leq e_i),
$$

where $\mathbb{I}(\cdot)$ is the indicator function, equal to 1 if the condition is true and 0 otherwise. This step transforms the residuals into an ordinal scale, eliminating their dependency on magnitude.

##### Scoring Function

The weights $w_i$ are derived from a scoring function $S(R_i)$, which assigns importance to each rank. A common choice is the **Wilcoxon scoring function**, defined as:

$$
S(R_i) = \frac{R_i}{n + 1},
$$

which gives equal weight to all ranks, scaled by their position relative to the total number of observations $n$.

Other scoring functions can emphasize different parts of the rank distribution:

-   **Normal Scores**: Derived from the quantiles of a standard normal distribution.
-   **Logarithmic Scores**: Weight lower ranks more heavily.

The flexibility of the scoring function allows $R$-estimators to adapt to various data structures and assumptions.

------------------------------------------------------------------------

#### Properties of $R$-Estimators

##### Influence Function and Robustness

A key feature of $R$-estimators is their **bounded influence function**, which ensures robustness. Because the estimator depends only on the ranks of the residuals, extreme values in $y$ or $x$ do not disproportionately affect the results.

For $R$-estimators, the influence function $\psi(e_i)$ is proportional to the derivative of the rank-based objective function:

$$
\psi(e_i) = S'(R_i),
$$

where $S'(R_i)$ is the derivative of the scoring function. Since $R_i$ depends only on the ordering of residuals, outliers in the data cannot produce excessive changes in $R_i$, resulting in bounded influence.

##### Breakdown Point

The **breakdown point** of $R$-estimators is higher than that of OLS and comparable to other robust methods. This means they can tolerate a larger proportion of contaminated data without yielding unreliable results.

##### Asymptotic Efficiency

Under specific scoring functions, $R$-estimators achieve high asymptotic efficiency. For example, the Wilcoxon $R$-estimator performs nearly as well as OLS under normality while retaining robustness to non-normality.

------------------------------------------------------------------------

#### Derivation of $R$-Estimators for Simple Linear Regression

Consider the simple linear regression model:

$$
y_i = \beta_0 + \beta_1 x_i + e_i,
$$

where $e_i = y_i - (\beta_0 + \beta_1 x_i)$ are the residuals.

1.  **Rank the Residuals**: Compute the residuals $e_i$ for all observations and rank them from smallest to largest.

2.  **Assign Weights**: Compute weights $w_i$ for each residual rank based on the scoring function $S(R_i)$.

3.  **Minimize the Rank-Based Objective**: Solve the following optimization problem:

    $$
    \hat{\beta}_R = \arg\min_{\beta_0, \beta_1} \sum_{i=1}^n w_i R_i \left( y_i - (\beta_0 + \beta_1 x_i) \right).
    $$

    This minimization can be performed iteratively using numerical methods, as the rank-based nature of the function makes direct analytic solutions challenging.

------------------------------------------------------------------------

#### Comparison to $M$-Estimators

While $M$-estimators downweight large residuals using robust loss functions, $R$-estimators completely avoid reliance on the magnitude of residuals by using their ranks. This distinction has important implications:

-   $R$-estimators are naturally robust to leverage points and extreme outliers.
-   The performance of $R$-estimators is less sensitive to the choice of scale parameter compared to $M$-estimators.
-   However, $R$-estimators may be less efficient than $M$-estimators under normality because they do not use the full information contained in the residual magnitudes.

------------------------------------------------------------------------

### $L$-Estimators

$L$-estimators are a class of robust estimators constructed as **linear combinations of order statistics**, where order statistics are simply the sorted values of a dataset. These estimators are particularly appealing due to their intuitive nature and computational simplicity. By using the relative ranks of observations, $L$-estimators offer robustness against outliers and heavy-tailed distributions.

Order statistics are denoted as $y_{(1)}, y_{(2)}, \dots, y_{(n)}$, where $y_{(i)}$ is the $i$th smallest observation in the sample.

------------------------------------------------------------------------

The general form of an $L$-estimator is:

$$
\hat{\theta}_L = \sum_{i=1}^n c_i y_{(i)},
$$

where:

-   $y_{(i)}$ are the order statistics (sorted observations),
-   $c_i$ are coefficients (weights) that determine the contribution of each order statistic to the estimator.

By appropriately choosing the weights $c_i$, different types of $L$-estimators can be constructed to suit specific needs, such as handling outliers or capturing central tendencies robustly.

------------------------------------------------------------------------

Examples of $L$-Estimators

1.  **Sample Median**: The sample median is a simple $L$-estimator where only the middle order statistic contributes (for odd $n$) or the average of the two middle order statistics contributes (for even $n$):

    $$
    \hat{\mu}_{\text{median}} =
    \begin{cases}
    y_{\left(\frac{n+1}{2}\right)} & \text{if } n \text{ is odd}, \\
    \frac{1}{2}\left(y_{\left(\frac{n}{2}\right)} + y_{\left(\frac{n}{2} + 1\right)}\right) & \text{if } n \text{ is even}.
    \end{cases}
    $$

    -   **Robustness**: The median has a breakdown point of $50\%$, meaning it remains unaffected unless more than half the data are corrupted.
    -   **Efficiency**: Under normality, the efficiency of the median is lower than that of the mean (about $64\%$).

2.  **Trimmed Mean**: The trimmed mean excludes the smallest and largest $k\%$ of observations before averaging the remaining values:

    $$
    \hat{\mu}_T = \frac{1}{n - 2k} \sum_{i=k+1}^{n-k} y_{(i)},
    $$

    where:

    -   $k$ is the number of observations trimmed from each tail,

    -   $n$ is the sample size.

    -   **Robustness**: The trimmed mean is less sensitive to extreme values than the sample mean.

    -   **Efficiency**: By retaining most observations, the trimmed mean achieves a good balance between robustness and efficiency.

3.  **Winsorized Mean**: Similar to the trimmed mean, but instead of excluding extreme values, it replaces them with the nearest remaining observations:

    $$
    \hat{\mu}_W = \frac{1}{n} \sum_{i=1}^n y_{(i)}^*,
    $$

    where $y_{(i)}^*$ are "Winsorized" values: $$
    y_{(i)}^* =
    \begin{cases}
    y_{(k+1)} & \text{if } i \leq k, \\
    y_{(i)} & \text{if } k+1 \leq i \leq n-k, \\
    y_{(n-k)} & \text{if } i > n-k.
    \end{cases}
    $$

    -   **Robustness**: The Winsorized mean reduces the influence of outliers without discarding data.
    -   **Efficiency**: Slightly less efficient than the trimmed mean under normality.

4.  **Midrange**: The midrange is the average of the smallest and largest observations:

    $$
    \hat{\mu}_{\text{midrange}} = \frac{y_{(1)} + y_{(n)}}{2}.
    $$

    -   **Robustness**: Poor robustness, as it depends entirely on the extreme observations.
    -   **Simplicity**: Highly intuitive and computationally trivial.

------------------------------------------------------------------------

#### Properties of $L$-Estimators

1.  **Robustness to Outliers**: $L$-estimators gain robustness by downweighting or excluding extreme observations. For instance:
    -   The trimmed mean completely removes outliers from the estimation process.
    -   The Winsorized mean limits the influence of outliers by bounding their values.
2.  **Breakdown Point**:
    -   The breakdown point of an $L$-estimator depends on how many extreme observations are excluded or replaced.
    -   The median has the highest possible breakdown point ($50\%$), while the trimmed and Winsorized means have breakdown points proportional to the trimming percentage.
3.  **Efficiency**:
    -   The efficiency of $L$-estimators varies depending on the underlying data distribution and the specific estimator.
    -   For symmetric distributions, the trimmed mean and Winsorized mean approach the efficiency of the sample mean while being much more robust.
4.  **Computational Simplicity**:
    -   $L$-estimators involve simple operations like sorting and averaging, making them computationally efficient even for large datasets.

------------------------------------------------------------------------

#### Derivation of the Trimmed Mean

To understand the robustness of the trimmed mean, consider a dataset with $n$ observations. Sorting the data gives $y_{(1)} \leq y_{(2)} \leq \dots \leq y_{(n)}$. After trimming the smallest $k$ and largest $k$ observations, the remaining $n - 2k$ observations are used to compute the mean:

$$
\hat{\mu}_T = \frac{1}{n - 2k} \sum_{i=k+1}^{n-k} y_{(i)}.
$$

Key observations:

-   **Impact of** $k$: Larger $k$ increases robustness by removing more extreme values but reduces efficiency by discarding more data.
-   **Choosing** $k$: In practice, $k$ is often chosen as a percentage of the total sample size, such as $10\%$ trimming ($k = 0.1n$).

------------------------------------------------------------------------

### Least Trimmed Squares (LTS)

Least Trimmed Squares (LTS) is a robust regression method that minimizes the sum of the smallest $h$ squared residuals, rather than using all residuals as in [Ordinary Least Squares](#ordinary-least-squares). This approach ensures that large residuals, often caused by outliers or leverage points, have no influence on the parameter estimation.

The LTS estimator is defined as:

$$
\hat{\beta}_{LTS} = \arg\min_\beta \sum_{i=1}^h r_{[i]}^2,
$$

where:

-   $r_{[i]}^2$ are the ordered squared residuals, ranked from smallest to largest,
-   $h$ is the subset size of residuals to include in the minimization, typically chosen as $h = \lfloor n/2 \rfloor + 1$ (where $n$ is the sample size).

This trimming process ensures robustness by focusing on the best-fitting $h$ observations and ignoring the most extreme residuals.

------------------------------------------------------------------------

#### Motivation for LTS

In OLS regression, the objective is to minimize the Residual Sum of Squares (RSS):

$$
RSS = \sum_{i=1}^n r_i^2,
$$

where $r_i = y_i - x_i'\beta$ are the residuals. However, this method is highly sensitive to outliers because even one large residual ($r_i^2$) can dominate the RSS, distorting the parameter estimates $\beta$.

LTS addresses this issue by trimming the largest residuals and focusing only on the $h$ smallest ones, thus preventing extreme values from affecting the fit. This approach provides a more robust estimate of the regression coefficients $\beta$.

------------------------------------------------------------------------

#### Properties of LTS

1.  **Objective Function**: The LTS objective function is non-differentiable because it involves ordering the squared residuals. Formally, the ordered residuals are denoted as:

    $$
    r_{[1]}^2 \leq r_{[2]}^2 \leq \dots \leq r_{[n]}^2,
    $$

    and the objective is to minimize:

    $$
    \sum_{i=1}^h r_{[i]}^2.
    $$

    This requires sorting the squared residuals, making the computation more complex than OLS.

2.  **Choice of** $h$: The parameter $h$ determines the number of residuals included in the minimization. A common choice is:

    $$
    h = \lfloor n/2 \rfloor + 1,
    $$

    which ensures a high breakdown point (discussed below). Smaller values of $h$ increase robustness but reduce efficiency, while larger $h$ values improve efficiency but decrease robustness.

3.  **Breakdown Point**: LTS has a **breakdown point** of approximately $50\%$, the highest possible for a regression estimator. This means that LTS can handle up to $50\%$ of contaminated data (e.g., outliers) without yielding unreliable estimates.

4.  **Robustness**: By focusing only on the $h$ best-fitting observations, LTS naturally excludes outliers from the estimation process, making it highly robust to both vertical outliers (extreme values in $y$) and leverage points (extreme values in $x$).

------------------------------------------------------------------------

#### Algorithm for LTS

Computing the LTS estimator involves the following steps:

1.  **Initialization**: Select an initial subset of $h$ observations to compute a preliminary fit for $\beta$.

2.  **Residual Calculation**: For each observation, compute the squared residuals:

    $$
    r_i^2 = \left(y_i - x_i'\beta\right)^2.
    $$

3.  **Trimming**: Rank the residuals from smallest to largest and retain only the $h$ smallest residuals.

4.  **Refitting**: Use the $h$ retained observations to recompute the regression coefficients $\beta$.

5.  **Iterative Refinement**: Repeat the process (residual calculation, trimming, refitting) until convergence, typically when $\beta$ stabilizes.

Efficient algorithms, such as the **Fast-LTS** algorithm, are used in practice to reduce computational complexity.

------------------------------------------------------------------------

#### Comparison of LTS with OLS

| Property                    | OLS                           | LTS                                        |
|-----------------------------|-------------------------------|--------------------------------------------|
| **Objective**               | Minimize $\sum_{i=1}^n r_i^2$ | Minimize $\sum_{i=1}^h r_{[i]}^2$          |
| **Sensitivity to Outliers** | High                          | Low                                        |
| **Breakdown Point**         | $1/n$                         | $\approx 50\%$                             |
| **Computational Cost**      | Low                           | Moderate (requires sorting and iterations) |

: Comparison of LTS with OLS

------------------------------------------------------------------------

### $S$-Estimators

$S$-estimators are a class of robust estimators that focus on minimizing a robust measure of the **dispersion** of residuals. Unlike methods such as $M$-estimators, which directly minimize a loss function based on residuals, $S$-estimators aim to find the parameter values $\beta$ that produce residuals with the smallest robust scale. These estimators are particularly useful in handling datasets with outliers, heavy-tailed distributions, or other violations of classical assumptions.

The scale $\sigma$ is estimated by solving the following minimization problem:

$$
\hat{\sigma}_S = \arg\min_\sigma \frac{1}{n} \sum_{i=1}^n \rho\left(\frac{y_i - x_i'\beta}{\sigma}\right),
$$

where:

-   $\rho$ is a robust loss function that controls the influence of residuals,
-   $y_i$ are observed responses, $x_i$ are predictors, $\beta$ is the vector of regression coefficients,
-   $\sigma$ represents the robust scale of the residuals.

Once $\sigma$ is estimated, the $S$-estimator of $\beta$ is obtained by solving:

$$
\hat{\beta}_S = \arg\min_\beta \hat{\sigma}_S.
$$

------------------------------------------------------------------------

#### Motivation for $S$-Estimators

In regression analysis, classical methods such as Ordinary Least Squares rely on minimizing the Residual Sum of Squares (RSS). However, OLS is highly sensitive to outliers because even a single extreme residual can dominate the sum of squared residuals, leading to biased estimates of $\beta$.

$S$-estimators address this limitation by using a robust scale $\sigma$ to evaluate the dispersion of residuals. By minimizing this scale, $S$-estimators effectively downweight the influence of outliers, resulting in parameter estimates that are more resistant to contamination in the data.

------------------------------------------------------------------------

#### Key Concepts in $S$-Estimators

1.  **Robust Scale Function**: The key idea of $S$-estimators is to minimize a robust measure of scale. The scale $\sigma$ is computed such that the residuals normalized by $\sigma$ produce a value close to the expected contribution of well-behaved observations.

    Formally, $\sigma$ satisfies:

    $$
    \frac{1}{n} \sum_{i=1}^n \rho\left(\frac{y_i - x_i'\beta}{\sigma}\right) = \delta,
    $$

    where $\delta$ is a constant that depends on the choice of $\rho$ and ensures consistency under normality. This equation balances the residuals and controls their influence on the scale estimate.

2.  **Choice of** $\rho$-Function: The choice of the robust $\rho$ function is critical in determining the behavior of $S$-estimators. Common $\rho$ functions include:

    -   **Huber's** $\rho$-Function: $$
        \rho(z) =
        \begin{cases}
        z^2/2 & \text{if } |z| \leq c, \\
        c|z| - c^2/2 & \text{if } |z| > c.
        \end{cases}
        $$

    -   **Tukey's Bisquare**: $$
        \rho(z) =
        \begin{cases}
        c^2 \left(1 - \left(1 - \left(\frac{z}{c}\right)^2\right)^3\right)/6 & \text{if } |z| \leq c, \\
        c^2/6 & \text{if } |z| > c.
        \end{cases}
        $$

    -   **Andrews' Sine**: $$
        \rho(z) =
        \begin{cases}
        c^2 \left(1 - \cos\left(\frac{z}{c}\right)\right)/2 & \text{if } |z| \leq \pi c, \\
        c^2/2 & \text{if } |z| > \pi c.
        \end{cases}
        $$

    Robust $\rho$ functions grow more slowly than the quadratic function used in OLS, limiting the impact of large residuals.

------------------------------------------------------------------------

#### Properties of $S$-Estimators

1.  **Breakdown Point**: $S$-estimators have a breakdown point of up to $50\%$, meaning they can tolerate up to half the data being contaminated (e.g., outliers) without yielding unreliable estimates.

2.  **Efficiency**: The efficiency of $S$-estimators depends on the choice of $\rho$. While they are highly robust, their efficiency under ideal conditions (e.g., normality) may be lower than that of OLS. Proper tuning of $\rho$ can balance robustness and efficiency.

3.  **Influence Function**: The **influence function** measures the sensitivity of the estimator to a small perturbation in the data. For $S$-estimators, the influence function is bounded, ensuring robustness to outliers.

4.  **Consistency**: Under mild regularity conditions, $S$-estimators are consistent, meaning $\hat{\beta}_S \to \beta$ as the sample size $n \to \infty$.

5.  **Asymptotic Normality**: $S$-estimators are asymptotically normal, with:

    $$
    \sqrt{n}(\hat{\beta}_S - \beta) \xrightarrow{d} N(0, \Sigma),
    $$

    where $\Sigma$ depends on the choice of $\rho$ and the distribution of residuals.

------------------------------------------------------------------------

#### Algorithm for Computing $S$-Estimators

1.  **Initial Guess**: Compute an initial estimate of $\beta$ using a robust method (e.g., LTS or an $M$-estimator).

2.  **Scale Estimation**: Compute a robust estimate of scale $\hat{\sigma}$ by solving:

    $$
    \frac{1}{n} \sum_{i=1}^n \rho\left(\frac{y_i - x_i'\beta}{\sigma}\right) = \delta.
    $$

3.  **Iterative Refinement**:

    -   Recalculate residuals $r_i = y_i - x_i'\beta$.
    -   Update $\beta$ and $\sigma$ iteratively until convergence, typically using numerical optimization techniques.

------------------------------------------------------------------------

### $MM$-Estimators

$MM$-estimators are a robust regression method that combines the strengths of two powerful techniques: $S$-estimators and $M$-estimators. They are designed to achieve both a **high breakdown point** (up to $50\%$) and **high efficiency** under ideal conditions (e.g., normality). This combination makes $MM$-estimators one of the most versatile and widely used robust regression methods.

The process of computing $MM$-estimators involves three main steps:

1.  Compute an initial robust estimate of scale using an $S$-estimator.
2.  Use this robust scale to define weights for an $M$-estimator.
3.  Estimate regression coefficients by solving the weighted $M$-estimation problem.

This stepwise approach ensures robustness in the initial scale estimation while leveraging the efficiency of $M$-estimators for the final parameter estimates.

------------------------------------------------------------------------

Step 1: Robust Scale Estimation

The first step is to estimate the robust scale $\sigma$ using an $S$-estimator. This involves solving:

$$
\hat{\sigma}_S = \arg\min_\sigma \frac{1}{n} \sum_{i=1}^n \rho_S\left(\frac{y_i - x_i'\beta}{\sigma}\right),
$$

where $\rho_S$ is a robust loss function chosen to control the influence of extreme residuals. Common choices for $\rho_S$ include Huber's or Tukey's bisquare functions. This scale estimation provides a robust baseline for weighting residuals in the subsequent $M$-estimation step.

------------------------------------------------------------------------

Step 2: Weight Definition for $M$-Estimation

Using the robust scale $\hat{\sigma}_S$ obtained in Step 1, the weights for the $M$-estimator are defined based on a second loss function, $\rho_M$. The weights downweight residuals proportional to their deviation relative to $\hat{\sigma}_S$. For each residual $r_i = y_i - x_i'\beta$, the weight is computed as:

$$
w_i = \psi_M\left(\frac{r_i}{\hat{\sigma}_S}\right) / \frac{r_i}{\hat{\sigma}_S},
$$

where:

-   $\psi_M$ is the derivative of the robust $\rho_M$ function, known as the **influence function**.
-   $\rho_M$ is often chosen to provide high efficiency under normality, such as Huber's or Hampel's function.

These weights reduce the impact of large residuals while preserving the influence of small, well-behaved residuals.

------------------------------------------------------------------------

Step 3: Final $M$-Estimation

The final step involves solving the $M$-estimation problem using the weights defined in Step 2. The coefficients $\hat{\beta}_{MM}$ are estimated by minimizing the weighted residuals:

$$
\hat{\beta}_{MM} = \arg\min_\beta \sum_{i=1}^n w_i \rho_M\left(\frac{y_i - x_i'\beta}{\hat{\sigma}_S}\right).
$$

This ensures that the final estimates combine the robustness of the initial $S$-estimator with the efficiency of the $M$-estimator.

------------------------------------------------------------------------

#### Properties of $MM$-Estimators

1.  **High Breakdown Point**:
    -   The $S$-estimator in the first step ensures a breakdown point of up to $50\%$, meaning the estimator can handle up to half the data being contaminated without producing unreliable results.
2.  **Asymptotic Efficiency**:
    -   The use of an efficient $\rho_M$ function in the final $M$-estimation step ensures that $MM$-estimators achieve high asymptotic efficiency under normality, often close to that of OLS.
3.  **Robustness**:
    -   The combination of robust scale estimation and downweighting of large residuals makes $MM$-estimators highly robust to outliers and leverage points.
4.  **Influence Function**:
    -   The influence function of $MM$-estimators is bounded, ensuring that no single observation can exert disproportionate influence on the parameter estimates.
5.  **Consistency**:
    -   $MM$-estimators are consistent, converging to the true parameter values as the sample size increases, provided the majority of the data satisfies the model assumptions.
6.  **Asymptotic Normality**:
    -   $MM$-estimators are asymptotically normal, with:

        $$
        \sqrt{n} (\hat{\beta}_{MM} - \beta) \xrightarrow{d} N(0, \Sigma),
        $$

        where $\Sigma$ depends on the choice of $\rho_M$ and the distribution of residuals.

------------------------------------------------------------------------

#### Choice of $\rho$-Functions for $MM$-Estimators

The robustness and efficiency of $MM$-estimators depend on the choice of $\rho_S$ (for scale) and $\rho_M$ (for final estimation). Common choices include:

1.  **Huber's** $\rho$-Function: Combines quadratic and linear growth to balance robustness and efficiency:

    $$
    \rho(z) =
    \begin{cases}
    \frac{z^2}{2} & \text{if } |z| \leq c, \\
    c|z| - \frac{c^2}{2} & \text{if } |z| > c.
    \end{cases}
    $$

2.  **Tukey's Bisquare Function**: Provides high robustness by completely bounding large residuals:

    $$
    \rho(z) =
    \begin{cases}
    c^2 \left(1 - \left(1 - \left(\frac{z}{c}\right)^2\right)^3\right)/6 & \text{if } |z| \leq c, \\
    c^2/6 & \text{if } |z| > c.
    \end{cases}
    $$

3.  **Hampel's Three-Part Redescending Function**: Further limits the influence of large residuals by assigning a constant penalty beyond a certain threshold.

    $$
    \rho(z) =
    \begin{cases}
    z^2/2 & \text{if } |z| \leq a, \\
    a|z| - a^2/2 & \text{if } a < |z| \leq b, \\
    \text{constant} & \text{if } |z| > b.
    \end{cases}
    $$

------------------------------------------------------------------------

### Practical Considerations

The following table summarizes the key properties, advantages, and limitations of the robust estimators discussed:

| Estimator       | Key Features                                                                                               | Breakdown Point         | Efficiency (Under Normality) | Applications                                              | Advantages                                     |
|-----------------|------------------------------------------------------------------------------------------------------------|-------------------------|------------------------------|-----------------------------------------------------------|------------------------------------------------|
| $M$-Estimators  | Generalization of OLS; robust $\rho$ reduces large residual influence; flexible tuning via $\rho$-function | Moderate (up to $0.29$) | High with proper tuning      | Wide applicability in regression with moderate robustness | Balances robustness and efficiency             |
| $R$-Estimators  | Rank-based method; immune to outliers in $x$ and $y$; suitable for ordinal or rank-based data              | High (depends on ranks) | Moderate                     | Ordinal data or heavily skewed distributions              | Handles both predictor and response outliers   |
| $L$-Estimators  | Linear combination of order statistics; easy to compute even for large datasets                            | High (up to $50\%$)     | Moderate                     | Descriptive statistics, robust averages                   | Simple and intuitive                           |
| **LTS**         | Minimizes smallest $h$ squared residuals; resistant to leverage points                                     | High (up to $50\%$)     | Moderate                     | Data with high contamination, fault detection             | High robustness to outliers                    |
| $S$-Estimators  | Minimizes robust scale of residuals; effective at detecting extreme outliers                               | High (up to $50\%$)     | Low to moderate              | Outlier detection, data with heavy-tailed distributions   | Focus on robust scale estimation               |
| $MM$-Estimators | High robustness (scale) + high efficiency (coefficients); versatile and flexible                           | High (up to $50\%$)     | High                         | Mixed contamination and heavy-tailed errors               | Combines robustness and efficiency effectively |

: Comparison of Robust Regression Estimators

------------------------------------------------------------------------

Notes on Choosing an Estimator

-   $M$-Estimators: Best suited for general-purpose robust regression, offering a balance between robustness and efficiency with moderate contamination.
-   $R$-Estimators: Ideal for rank-based data or ordinal data, especially when outliers are present in both predictors and responses.
-   $L$-Estimators: Simple and effective for descriptive statistics or data cleaning with limited computational resources.
-   **LTS**: Recommended for datasets with significant contamination or leverage points due to its high breakdown point.
-   $S$-Estimators: Focus on robust scale estimation, suitable for identifying and mitigating the influence of extreme residuals.
-   $MM$-Estimators: Combines the robustness of $S$-estimators with the efficiency of $M$-estimators, making it the most versatile choice for heavily contaminated data.

------------------------------------------------------------------------

```{r}
# Load necessary libraries
library(MASS)       # For robust regression functions like rlm
library(robustbase) # For LTS regression and MM-estimators
library(dplyr)      # For data manipulation
library(ggplot2)    # For visualization

# Simulate dataset
set.seed(123)
n <- 100
x <- rnorm(n, mean = 5, sd = 2)   # Predictor
y <- 3 + 2 * x + rnorm(n, sd = 1) # Response

# Introduce outliers
y[95:100] <- y[95:100] + 20  # Vertical outliers
x[90:95] <- x[90:95] + 10    # Leverage points

data <- data.frame(x, y)
```

```{r fig-scatterplot-sim-outlier, fig.cap="Scatterplot of Simulated Data with Outliers", fig.alt="Scatterplot titled 'Scatterplot of Simulated Data with Outliers' showing data points with 'Predictor (x)' on the horizontal axis and 'Response (y)' on the vertical axis. The plot displays a general upward trend with several outliers deviating from the main cluster of points.", out.width="100%", fig.align='center'}
# Visualize the data
ggplot(data, aes(x, y)) +
    geom_point() +
    labs(title = "Scatterplot of Simulated Data with Outliers",
         x = "Predictor (x)",
         y = "Response (y)") +
    theme_minimal()
```

```{r}
# Ordinary Least Squares
ols_model <- lm(y ~ x, data = data)
summary(ols_model)
```

OLS coefficients are highly influenced by the presence of outliers. For example, the slope (x coefficient) and intercept are shifted to fit the outliers, resulting in a poor fit to the majority of the data.

```{r}
# $M$-Estimators
m_model <- rlm(y ~ x, data = data, psi = psi.huber)
summary(m_model)
```

The $M$-estimator reduces the influence of large residuals using Huber's psi function. This results in coefficients that are less affected by outliers compared to OLS.

```{r}
# Least Trimmed Squares (LTS)
lts_model <- ltsReg(y ~ x, data = data)
lts_coefficients <- coef(lts_model)
```

LTS minimizes the smallest squared residuals, ignoring extreme residuals. This results in a more robust fit, particularly in the presence of both vertical outliers and leverage points.

```{r}
# $MM$-Estimators
mm_model <- lmrob(y ~ x, data = data, setting = "KS2014")
summary(mm_model)
```

$MM$-estimators combine robust scale estimation (from $S$-estimators) with efficient coefficient estimation (from $M$-estimators). This achieves both high robustness and high efficiency under normal conditions.

```{r}
# Visualizing results
data <- data %>%
    mutate(
        ols_fit = predict(ols_model, newdata = data),
        m_fit = predict(m_model, newdata = data),
        lts_fit = fitted(lts_model),
        # Use `fitted()` for ltsReg objects
        mm_fit = predict(mm_model, newdata = data)
    )
```

```{r fig-comp-regression-fit, fig.cap="Comparison of Regression Fits", fig.alt="Scatter plot titled 'Comparison of Regression Fits' showing data points with three different regression lines. The x-axis is labeled 'Predictor (x)' and the y-axis is labeled 'Response (y).' The lines are dashed in red, blue, and purple, representing different regression models fitting the data. The plot illustrates how each model predicts the response variable based on the predictor variable.", out.width="100%", fig.align='center'}
ggplot(data, aes(x, y)) +
    geom_point() +
    geom_line(
        aes(y = ols_fit),
        color = "red",
        linetype = "dashed",
        size = 1,
        label = "OLS"
    ) +
    geom_line(
        aes(y = m_fit),
        color = "blue",
        linetype = "dashed",
        size = 1,
        label = "$M$-Estimator"
    ) +
    geom_line(
        aes(y = lts_fit),
        color = "green",
        linetype = "dashed",
        size = 1,
        label = "LTS"
    ) +
    geom_line(
        aes(y = mm_fit),
        color = "purple",
        linetype = "dashed",
        size = 1,
        label = "$MM$-Estimator"
    ) +
    labs(title = "Comparison of Regression Fits",
         x = "Predictor (x)",
         y = "Response (y)") +
    theme_minimal()
```

Visualization shows the differences in regression fits:

-   OLS is heavily influenced by outliers and provides a poor fit to the majority of the data.

-   The $M$-estimator downweights large residuals, resulting in a better fit.

-   LTS regression ignores the extreme residuals entirely, providing the most robust fit.

-   $MM$-estimators balance robustness and efficiency, producing coefficients close to the LTS but with improved efficiency under normality.

```{r}
# Comparing Coefficients
comparison <- data.frame(
    Method = c("OLS", "$M$-Estimator", "LTS", "$MM$-Estimator"),
    Intercept = c(
        coef(ols_model)[1],
        coef(m_model)[1],
        lts_coefficients[1],
        coef(mm_model)[1]
    ),
    Slope = c(
        coef(ols_model)[2],
        coef(m_model)[2],
        lts_coefficients[2],
        coef(mm_model)[2]
    )
)

print(comparison)
```

The table above shows how the coefficients vary across methods:

-   OLS coefficients are the most distorted by outliers.

-   $M$-estimators and $MM$-estimators provide coefficients that are less influenced by extreme values.

-   LTS regression, with its trimming mechanism, produces the most robust coefficients by excluding the largest residuals.