data_analysis/39-controls.Rmd at main · mikenguyen13/data_analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Controls {#sec-controls}

The framework in this chapter follows the "good and bad controls" treatment of @cinelli2022crash and the canonical "bad control" discussion in [@angrist2009mostly, ch. 3].

```{r, message=FALSE, warning=FALSE}
library(dagitty)
library(ggdag)
```

Adding more control variables in regression models is often considered harmless or even beneficial. The idea is simple: controlling for potential confounders (variables that influence both the treatment and the outcome) should help isolate the causal effect of the variable of interest. This intuition can be misleading, however, especially when it comes to overcontrolling, an issue that is easy to overlook in applied work.

A common default in applied practice is to add control variables in order to "control" for potential confounding. Without a solid theoretical justification, this practice can lead to erroneous conclusions: irrelevant or inappropriate controls can obscure the true causal relationship between the treatment (e.g., a marketing campaign) and the outcome (e.g., sales) rather than clarify it. The symmetric problem also matters: variables that distort the analysis are sometimes left in the model when they should be removed. This is a subtle issue that connects to the concept of [Coefficient Stability](#sec-coefficient-stability-bounds).

Overcontrolling occurs when we include control variables that do not truly affect the causal relationship between the exposure (e.g., treatment) and the outcome. This practice can introduce **bias** into the estimated causal effect. The central issue is **overcontrol bias**: when a control variable is a **collider** (as described in [Directed Acyclic Graphs](#sec-directed-acyclic-graphs)), conditioning on it can induce a spurious correlation between the exposure and the outcome.

Multicollinearity, by contrast, is a separate concern about *precision*, not *bias*: including highly correlated regressors inflates standard errors but leaves OLS estimates unbiased, and it is treated under the regression diagnostics in [Linear Regression] rather than under the bad-controls framework here.

------------------------------------------------------------------------

## Bad Controls

The label *bad control* covers a surprisingly varied family of pre-treatment or post-treatment variables that look like sensible adjustments but actively distort the estimate. The clearest cases, collider conditioning (M-bias and selection bias) and overcontrol (conditioning on a mediator), share a common diagnosis: they open or block paths in the DAG that the analyst did not mean to touch. We work through them in order of how often they appear in published work, starting with M-bias, which is also the trap most often baked into a textbook recommendation to "control for everything pre-treatment."

### M-bias

A common intuition in causal inference is to control for any variable that precedes the treatment. This logic underpins much of the guidance in traditional econometric texts [@imbens2015causal; @angrist2009mostly], where pre-treatment variables like $Z$ are often recommended as controls if they correlate with both the treatment $X$ and the outcome $Y$.

This perspective is especially prevalent in [Matching Methods], where all observed pre-treatment covariates are typically included in the matching process. However, controlling for *every* pre-treatment variable can lead to *bad control bias*.

One such example is **M-bias**, which arises when conditioning on a *collider* --- a variable that is influenced by two unobserved causes. The DAG below illustrates a case where $Z$ appears to be a good control but actually opens a biasing path:

```{r}
# Clean workspace
rm(list = ls())

# DAG specification
model <- dagitty("dag{
  x -> y
  u1 -> x
  u1 -> z
  u2 -> z
  u2 -> y
}")

# Set latent variables
latents(model) <- c("u1", "u2")

# Coordinates for plotting
coordinates(model) <- list(
  x = c(x = 1, u1 = 1, z = 2, u2 = 3, y = 3),
  y = c(x = 1, u1 = 2, z = 1.5, u2 = 2, y = 1)
)

# Plot the DAG
ggdag(model) + theme_dag()
```

In this structure, $Z$ is a **collider** on the path $X \leftarrow U_1 \rightarrow Z \leftarrow U_2 \rightarrow Y$. Controlling for $Z$ opens this path, introducing a spurious association between $X$ and $Y$ even if none existed originally.

Even though $Z$ is statistically correlated with both $X$ and $Y$, it is **not a confounder**, because it does not lie on a *back-door path* that needs to be blocked. Instead, adjusting for $Z$ biases the estimate of the causal effect of $X \to Y$.

Let's illustrate this with a simulation:

```{r}
set.seed(123)

n <- 1e4
u1 <- rnorm(n)
u2 <- rnorm(n)
z <- u1 + u2 + rnorm(n)
x <- u1 + rnorm(n)
causal_coef <- 2
y <- causal_coef * x - 4 * u2 + rnorm(n)

# Compare unadjusted and adjusted models
jtools::export_summs(
  lm(y ~ x),
  lm(y ~ x + z),
  model.names = c("Unadjusted", "Adjusted")
)
```

Notice how adjusting for $Z$ changes the estimate of the effect of $X$ on $Y$, even though $Z$ is not a true confounder. This is a textbook example of M-bias in practice.

#### Worse: M-bias with Direct Effect from Z to Y

A more difficult case arises when $Z$ also has a direct effect on $Y$. Consider the DAG below:

```{r}
# Clean workspace
rm(list = ls())

# DAG specification
model <- dagitty("dag{
  x -> y
  u1 -> x
  u1 -> z
  u2 -> z
  u2 -> y
  z -> y
}")

# Set latent variables
latents(model) <- c("u1", "u2")

# Coordinates for plotting
coordinates(model) <- list(
  x = c(x = 1, u1 = 1, z = 2, u2 = 3, y = 3),
  y = c(x = 1, u1 = 2, z = 1.5, u2 = 2, y = 1)
)

# Plot the DAG
ggdag(model) + theme_dag()
```

This situation presents a dilemma:

-   **Not controlling for** $Z$ leaves the back-door path $X \leftarrow U_1 \to Z \to Y$ open, introducing confounding bias.

-   **Controlling for** $Z$ opens the collider path $X \leftarrow U_1 \to Z \leftarrow U_2 \to Y$, which also biases the estimate.

In short, **no adjustment strategy** can fully remove bias from the estimate of $X \to Y$ using observed data alone.

**What Can Be Done?**

When facing such situations, we often turn to **sensitivity analysis** to assess how robust our causal conclusions are to unmeasured confounding. Specifically, recent advances [@cinelli2019sensitivity; @cinelli2020making] allow us to quantify:

1.  **Plausible bounds** on the strength of the direct effect $Z \to Y$

2.  **Sensitivity parameters** reflecting the possible influence of the latent variables $U_1$ and $U_2$

These tools help us understand how large the unmeasured biases would have to be in order to overturn our conclusions --- a pragmatic approach when perfect control is impossible.

### Bias Amplification

Bias amplification occurs when controlling for a variable that is not a confounder --- in fact, controlling for it increases bias due to an unobserved confounder.

In the DAG below, $U$ is an unobserved common cause of both $X$ and $Y$. $Z$ influences $X$ but has no causal relationship with $Y$. Including $Z$ in the model does not block any back-door path but instead increases the bias from $U$ by amplifying its association with $X$.

```{r}
# Clean workspace
rm(list = ls())

# DAG specification
model <- dagitty("dag{
  x -> y
  u -> x
  u -> y
  z -> x
}")

# Set latent variable
latents(model) <- c("u")

# Coordinates for plotting
coordinates(model) <- list(
  x = c(z = 1, x = 2, u = 3, y = 4),
  y = c(z = 1, x = 1, u = 2, y = 1)
)

# Plot the DAG
ggdag(model) + theme_dag()
```

Even though $Z$ is a strong predictor of $X$, it is **not a confounder**, because it is not a common cause of $X$ and $Y$. Controlling for $Z$ increases the portion of $X$'s variation explained by $U$, thus amplifying bias in estimating the effect of $X$ on $Y$.

Simulation:

```{r}
set.seed(123)
n <- 1e4
z <- rnorm(n)
u <- rnorm(n)
x <- 2*z + u + rnorm(n)
y <- x + 2*u + rnorm(n)

jtools::export_summs(
  lm(y ~ x),
  lm(y ~ x + z),
  model.names = c("Unadjusted", "Adjusted")
)

```

Observe that the adjusted model is **more biased** than the unadjusted one. This illustrates how controlling for a variable like $Z$ can *amplify* omitted variable bias.

### Overcontrol Bias {#sec-overcontrol-bias}

Overcontrol bias arises when we adjust for variables that lie on the causal path from treatment to outcome, or that serve as proxies for the outcome.

#### Mediator Control

Controlling for a **mediator** --- a variable that lies on the causal path between treatment and outcome --- removes part of the effect we are trying to estimate.

```{r}
# Clean workspace
rm(list = ls())

# DAG: X → Z → Y
model <- dagitty("dag{
  x -> z
  z -> y
}")

coordinates(model) <- list(
  x = c(x = 1, z = 2, y = 3),
  y = c(x = 1, z = 1, y = 1)
)

ggdag(model) + theme_dag()
```

If we want to estimate the **total effect** of $X$ on $Y$, controlling for $Z$ (a mediator) leads to [overcontrol bias](#sec-overcontrol-bias).

```{r}
set.seed(123)
n <- 1e4
x <- rnorm(n)
z <- x + rnorm(n)
y <- z + rnorm(n)

jtools::export_summs(
  lm(y ~ x),
  lm(y ~ x + z),
  model.names = c("Total Effect", "Controlled for Mediator")
)
```

Here, $Z$ will appear significant, but including it blocks the causal path from $X$ to $Y$. This is misleading when the goal is to estimate the **total effect** of $X$.

#### Proxy for Mediator

In more complex scenarios, controlling for variables that **proxy** for mediators can introduce similar distortions.

```{r}
# Clean workspace
rm(list = ls())

# DAG: X → M → Z, M → Y
model <- dagitty("dag{
  x -> m
  m -> z
  m -> y
}")

coordinates(model) <- list(
  x = c(x = 1, m = 2, z = 2, y = 3),
  y = c(x = 2, m = 2, z = 1, y = 2)
)

ggdag(model) + theme_dag()
```

```{r}
set.seed(123)
n <- 1e4
x <- rnorm(n)
m <- x + rnorm(n)
z <- m + rnorm(n)
y <- m + rnorm(n)


jtools::export_summs(lm(y ~ x),
                     lm(y ~ x + z),
                     model.names = c("Total Effect", "Controlled for Proxy Z"))

```

Even though $Z$ is not on the path from $X$ to $Y$, controlling for it removes part of the causal variation coming through $M$.

#### Overcontrol with Unobserved Confounding

When $Z$ is influenced by both $X$ and a latent confounder $U$ that also affects $Y$, controlling for $Z$ again biases the estimate.

```{r}
# Clean workspace
rm(list = ls())

# DAG: X → Z → Y; U → Z, U → Y
model <- dagitty("dag{
  x -> z
  z -> y
  u -> z
  u -> y
}")

latents(model) <- "u"

coordinates(model) <- list(
  x = c(x = 1, z = 2, u = 3, y = 4),
  y = c(x = 1, z = 1, u = 2, y = 1)
)

ggdag(model) + theme_dag()
```

```{r}
set.seed(1)
n <- 1e4
x <- rnorm(n)
u <- rnorm(n)
z <- x + u + rnorm(n)
y <- z + u + rnorm(n)

jtools::export_summs(
  lm(y ~ x),
  lm(y ~ x + z),
  model.names = c("Unadjusted", "Controlled for Z")
)

```

Although the **total effect** of $X$ on $Y$ is correctly captured in the unadjusted model, adjusting for $Z$ introduces bias via the collider path $X \to Z \leftarrow U \to Y$.

> **Insight:** Controlling for $Z$ inadvertently **blocks** the direct effect of $X$ and **opens** a biasing path through $U$. This makes the adjusted model unreliable for causal inference.

These examples highlight the importance of **conceptual clarity** and **causal reasoning** in model specification. Not all covariates should be controlled for --- especially not those that are:

-   Mediators (on the causal path)

-   Proxies for mediators or outcomes

-   Colliders or descendants of colliders

In business contexts, this often arises when analysts include *intermediate* variables like sales leads, customer engagement scores, or operational metrics without understanding whether these mediate the effect of a treatment (e.g., ad spend) or confound it.

### Selection Bias

Selection bias --- also known as **collider stratification bias** --- occurs when conditioning on a variable that is a *collider* (a common effect of two or more variables). This inadvertently opens non-causal paths, inducing spurious associations between variables that are otherwise independent or unconfounded.

Selection bias matters because it is rarely framed as a "control" decision in the first place. It often enters silently through sample restrictions, survey eligibility filters, or the way an outcome was operationalized, which means an analyst who never adds a single covariate to the regression can still be conditioning on a collider through the data they chose to observe. Recognizing the situation in practice usually starts with two questions: was a unit's *presence* in the sample affected by the treatment or outcome, and does the dataset contain a variable that is a downstream consequence of both? When the answer is yes, the standard back-door adjustment intuition (see the [Conditional Ignorability Assumption](#sec-conditional-ignorability-assumption)) breaks down even though every "confounder" appears to be controlled. Several of the cleanest fixes for this problem live outside the regression toolkit and instead exploit design-based identification, including [Instrumental Variables](#sec-instrumental-variables) and [Quasi-Experimental](#sec-quasi-experimental) approaches.

#### Classic Collider Bias

In the DAG below, $Z$ is a collider between $X$ and a latent variable $U$. Controlling for $Z$ opens a back-door path from $X$ to $Y$ through $U$, introducing bias.

```{r}
rm(list = ls())

# DAG
model <- dagitty("dag{
  x -> y
  x -> z
  u -> z
  u -> y
}")
latents(model) <- "u"
coordinates(model) <- list(
  x = c(x = 1, z = 2, u = 2, y = 3),
  y = c(x = 3, z = 2, u = 4, y = 3)
)
ggdag(model) + theme_dag()
```

Simulation:

```{r}
set.seed(123)
n <- 1e4
x <- rnorm(n)
u <- rnorm(n)
z <- x + u + rnorm(n)
y <- x + 2*u + rnorm(n)

jtools::export_summs(
  lm(y ~ x),
  lm(y ~ x + z),
  model.names = c("Unadjusted", "Adjusted for Z (collider)")
)

```

Controlling for $Z$ opens the non-causal path $X \to Z \leftarrow U \to Y$, resulting in **biased estimates** of the effect of $X$ on $Y$.

#### Collider Between Treatment and Outcome

In some cases, the collider is influenced directly by both the treatment and the outcome. This setting is also highly relevant in observational designs, particularly in retrospective or convenience sampling scenarios.

```{r}
rm(list = ls())

# DAG: X → Z ← Y
model <- dagitty("dag{
  x -> y
  x -> z
  y -> z
}")
coordinates(model) <- list(
  x = c(x = 1, z = 2, y = 3),
  y = c(x = 2, z = 1, y = 2)
)
ggdag(model) + theme_dag()
```

Simulation:

```{r}
set.seed(123)
n <- 1e4
x <- rnorm(n)
y <- x + rnorm(n)
z <- x + y + rnorm(n)

jtools::export_summs(
  lm(y ~ x),
  lm(y ~ x + z),
  model.names = c("Unadjusted", "Adjusted for Collider Z")
)

```

Even though $Z$ is associated with both $X$ and \$Y\$, it **should not be controlled for**, because doing so opens the collider path $X \to Z \leftarrow Y$, generating spurious dependence.

### Case-Control Bias

Case-control studies often condition on the outcome itself or on a *descendant* of the outcome, which can bias the estimated effect of the treatment on the outcome through several distinct mechanisms.

In the DAG below, $Z$ is a **descendant of the outcome** $Y$ (not a descendant of a collider). Controlling for $Z$ blocks part of the variation in $Y$ that the treatment $X$ induces and can attenuate or bias the estimated $X \to Y$ effect.

```{r}
rm(list = ls())

# DAG: X -> Y -> Z (Z is a descendant of the outcome Y)
model <- dagitty("dag{
  x -> y
  y -> z
}")
coordinates(model) <- list(
  x = c(x = 1, z = 2, y = 3),
  y = c(x = 2, z = 1, y = 2)
)
ggdag(model) + theme_dag()
```

Simulation:

```{r}
set.seed(123)
n <- 1e4
x <- rnorm(n)
y <- x + rnorm(n)
z <- y + rnorm(n)

jtools::export_summs(
  lm(y ~ x),
  lm(y ~ x + z),
  model.names = c("Unadjusted", "Adjusted for Descendant Z")
)

```

Note the subtlety: if $X$ has a **true causal effect** on $Y$, then controlling for $Z$ (a descendant of $Y$) biases the estimate, because $Z$ inherits the treatment-induced variation in $Y$ and adjusting for it removes part of the channel $X \to Y$ we are trying to measure. However, if $X$ has **no causal effect** on $Y$, then $X$ remains *d-separated* from $Y$ even when adjusting for $Z$. In that special case, controlling for $Z$ will not falsely suggest an effect.

> **Key Insight:** Whether or not adjustment induces bias depends on the presence or absence of a true causal path. This highlights the importance of DAGs in clarifying assumptions and guiding valid statistical inference.

### Summary

| Bias Type          | Key Mistake                                    | Path Opened                                   | Consequence                    |
|------------------|-------------------|------------------|------------------|
| M-Bias             | Controlling for a collider                     | $X \leftarrow U_1 \to Z \leftarrow U_2 \to Y$ | Spurious association           |
| Bias Amplification | Controlling for a non-confounder               | Amplifies unobserved confounding              | Larger bias than before        |
| Overcontrol Bias   | Controlling for a mediator or proxy            | Blocks part of causal effect                  | Underestimates total effect    |
| Selection Bias     | Conditioning on a collider or its descendant   | $X \to Z \leftarrow Y$ or similar             | Induced non-causal correlation |
| Case-Control Bias  | Conditioning on a descendant of the outcome    | $X \to Y \to Z$                               | Attenuated $X \to Y$ effect    |

------------------------------------------------------------------------

## Good Controls

### Omitted Variable Bias Correction

A variable $Z$ is a **good control** when it blocks all *back-door paths* from the treatment $X$ to the outcome $Y$. This is the fundamental criterion from the back-door adjustment theorem in causal inference.

#### Simple Confounder

In this DAG, $Z$ is a common cause of both $X$ and $Y$, i.e., a **confounder**.

```{r}
rm(list = ls())

model <- dagitty("dag{
  x -> y
  z -> x
  z -> y
}")
coordinates(model) <- list(
  x = c(x = 1, z = 2, y = 3),
  y = c(x = 1, z = 2, y = 1)
)
ggdag(model) + theme_dag()
```

Controlling for $Z$ removes the bias from the back-door path $X \leftarrow Z \rightarrow Y$.

```{r}
n <- 1e4
z <- rnorm(n)
causal_coef <- 2
beta2 <- 3
x <- z + rnorm(n)
y <- causal_coef * x + beta2 * z + rnorm(n)

jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
```

#### Confounding via a Latent Variable

In this structure, $U$ is unobserved but causes both $Z$ and $Y$, and $Z$ affects $X$. Even though $U$ is not observed, adjusting for $Z$ helps block the back-door path from $X$ to $Y$ that goes through $U$.

```{r}
rm(list = ls())

model <- dagitty("dag{
  x -> y
  u -> z
  z -> x
  u -> y
}")
latents(model) <- "u"
coordinates(model) <- list(
  x = c(x = 1, z = 2, u = 3, y = 4),
  y = c(x = 1, z = 2, u = 3, y = 1)
)
ggdag(model) + theme_dag()
```

```{r}
n <- 1e4
u <- rnorm(n)
z <- u + rnorm(n)
x <- z + rnorm(n)
y <- 2 * x + u + rnorm(n)

jtools::export_summs(lm(y ~ x), lm(y ~ x + z))

```

Even though \$Z\$ appears significant, **its inclusion serves to reduce omitted variable bias** rather than having a causal interpretation itself.

#### $Z$ is caused by $U$, but also causes $Y$

This DAG illustrates a subtle case where $Z$ is on a non-causal path from $X$ to $Y$ and helps block bias through a shared cause $U$.

```{r}
rm(list = ls())

model <- dagitty("dag{
  x -> y
  u -> z
  u -> x
  z -> y
}")
latents(model) <- "u"
coordinates(model) <- list(
  x = c(x = 1, z = 3, u = 2, y = 4),
  y = c(x = 1, z = 2, u = 3, y = 1)
)
ggdag(model) + theme_dag()
```

```{r}
n <- 1e4
u <- rnorm(n)
z <- u + rnorm(n)
x <- u + rnorm(n)
y <- 2 * x + z + rnorm(n)

jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
```

Again, **we cannot interpret the coefficient on** $Z$ **causally**, but including $Z$ helps reduce omitted variable bias from the unobserved confounder $U$.

#### Summary of Omitted Variable Correction

```{r}
# Model 1: Z is a confounder
model1 <- dagitty("dag{
  x -> y
  z -> x
  z -> y
}")
coordinates(model1) <-
    list(x = c(x = 1, z = 2, y = 3), y = c(x = 1, z = 2, y = 1))

# Model 2: Z is on path from U to X
model2 <- dagitty("dag{
  x -> y
  u -> z
  z -> x
  u -> y
}")
latents(model2) <- "u"
coordinates(model2) <-
    list(x = c(
        x = 1,
        z = 2,
        u = 3,
        y = 4
    ),
    y = c(
        x = 1,
        z = 2,
        u = 3,
        y = 1
    ))

# Model 3: Z influenced by U, affects Y
model3 <- dagitty("dag{
  x -> y
  u -> z
  u -> x
  z -> y
}")
latents(model3) <- "u"
coordinates(model3) <-
    list(x = c(
        x = 1,
        z = 3,
        u = 2,
        y = 4
    ),
    y = c(
        x = 1,
        z = 2,
        u = 3,
        y = 1
    ))

par(mfrow = c(1, 3))
ggdag(model1) + theme_dag()
ggdag(model2) + theme_dag()
ggdag(model3) + theme_dag()

```

### Omitted Variable Bias in Mediation Correction

When a variable $Z$ is a confounder of both the treatment $X$ and a mediator $M$, controlling for $Z$ helps isolate the **indirect and direct effects** more accurately.

#### Observed Confounder of Mediator and Treatment

```{r}
rm(list = ls())

model <- dagitty("dag{
  x -> y
  z -> x
  x -> m
  z -> m
  m -> y
}")
coordinates(model) <-
    list(x = c(
        x = 1,
        z = 2,
        m = 3,
        y = 4
    ),
    y = c(
        x = 1,
        z = 2,
        m = 1,
        y = 1
    ))
ggdag(model) + theme_dag()

```

```{r}
n <- 1e4
z <- rnorm(n)
x <- z + rnorm(n)
m <- 2 * x + z + rnorm(n)
y <- m + rnorm(n)

jtools::export_summs(lm(y ~ x), lm(y ~ x + z))

```

#### Latent Common Cause of Mediator and Treatment

```{r}
rm(list = ls())

model <- dagitty("dag{
  x -> y
  u -> z
  z -> x
  x -> m
  u -> m
  m -> y
}")
latents(model) <- "u"
coordinates(model) <-
    list(x = c(
        x = 1,
        z = 2,
        u = 3,
        m = 4,
        y = 5
    ),
    y = c(
        x = 1,
        z = 2,
        u = 3,
        m = 1,
        y = 1
    ))
ggdag(model) + theme_dag()

```

```{r}
n <- 1e4
u <- rnorm(n)
z <- u + rnorm(n)
x <- z + rnorm(n)
m <- 2 * x + u + rnorm(n)
y <- m + rnorm(n)

jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
```

#### Z Affects Mediator, U Affects Both X and Z

```{r}
rm(list = ls())

model <- dagitty("dag{
  x -> y
  u -> z
  z -> m
  x -> m
  u -> x
  m -> y
}")
latents(model) <- "u"
coordinates(model) <-
    list(x = c(
        x = 1,
        z = 3,
        u = 2,
        m = 4,
        y = 5
    ),
    y = c(
        x = 1,
        z = 2,
        u = 3,
        m = 1,
        y = 1
    ))
ggdag(model) + theme_dag()

```

```{r}
n <- 1e4
u <- rnorm(n)
z <- u + rnorm(n)
x <- u + rnorm(n)
m <- 2 * x + z + rnorm(n)
y <- m + rnorm(n)

jtools::export_summs(lm(y ~ x), lm(y ~ x + z))
```

#### Summary of Mediation Correction

```{r}
# Model 4
model4 <- dagitty("dag{
  x -> y
  z -> x
  x -> m
  z -> m
  m -> y
}")
coordinates(model4) <-
    list(x = c(
        x = 1,
        z = 2,
        m = 3,
        y = 4
    ),
    y = c(
        x = 1,
        z = 2,
        m = 1,
        y = 1
    ))

# Model 5
model5 <- dagitty("dag{
  x -> y
  u -> z
  z -> x
  x -> m
  u -> m
  m -> y
}")
latents(model5) <- "u"
coordinates(model5) <-
    list(x = c(
        x = 1,
        z = 2,
        u = 3,
        m = 4,
        y = 5
    ),
    y = c(
        x = 1,
        z = 2,
        u = 3,
        m = 1,
        y = 1
    ))

# Model 6
model6 <- dagitty("dag{
  x -> y
  u -> z
  z -> m
  x -> m
  u -> x
  m -> y
}")
latents(model6) <- "u"
coordinates(model6) <-
    list(x = c(
        x = 1,
        z = 3,
        u = 2,
        m = 4,
        y = 5
    ),
    y = c(
        x = 1,
        z = 2,
        u = 3,
        m = 1,
        y = 1
    ))

par(mfrow = c(1, 3))
ggdag(model4) + theme_dag()
ggdag(model5) + theme_dag()
ggdag(model6) + theme_dag()

```

While $Z$ may be statistically significant, this **does not imply a causal effect** unless $Z$ is directly on the causal path from $X$ to $Y$. In many valid control scenarios, $Z$ simply serves to isolate the causal effect of $X$, not to be interpreted as a cause itself.

------------------------------------------------------------------------

## Neutral Controls

Not all covariates used in regression adjustment are necessary for identification. **Neutral controls** do not help with causal identification but may affect estimation **precision**. Including them:

-   **Does not introduce bias**, because they do not lie on back-door or collider paths.
-   **May reduce standard errors**, by explaining additional variation in the outcome.

------------------------------------------------------------------------

### Good Predictive Controls

When a variable is correlated with the outcome $Y$, but not a cause of the treatment $X$, controlling for it is optional for identification but may increase precision.

Good predictive controls matter because they are often the easiest free lunch in an applied analysis: they tighten standard errors without altering the point estimate or threatening identification. Recognizing them in practice means asking whether a candidate covariate has any plausible *causal* link back to the treatment. If the answer is no, and the variable independently moves the outcome, it acts like a baseline covariate in a randomized trial, soaking up residual variance in $Y$ and shrinking the standard error on the treatment coefficient. The same logic underwrites the use of pre-treatment outcome levels in [Difference-in-Differences](#sec-difference-in-differences) and lagged covariates in [Event Studies](#sec-event-studies), where strong predictors of the outcome trajectory are routinely included for precision rather than identification.

#### $Z$ predicts $Y$, not $X$

```{r}
# Clean workspace
rm(list = ls())

model <- dagitty("dag{
  x -> y
  z -> y
}")
coordinates(model) <- list(
  x = c(x = 1, z = 2, y = 2),
  y = c(x = 1, z = 2, y = 1)
)
ggdag(model) + theme_dag()
```

```{r}
n <- 1e4
z <- rnorm(n)
x <- rnorm(n)
y <- x + 2 * z + rnorm(n)

jtools::export_summs(
  lm(y ~ x),
  lm(y ~ x + z),
  model.names = c("Without Z", "With Predictive Z")
)

```

The coefficient on $X$ remains unbiased in both models, but standard errors are smaller in the model with $Z$.

#### $Z$ predicts a mediator $M$

```{r}
rm(list = ls())

model <- dagitty("dag{
  x -> y
  x -> m
  z -> m
  m -> y
}")
coordinates(model) <- list(
  x = c(x = 1, z = 2, m = 2, y = 3),
  y = c(x = 1, z = 2, m = 1, y = 1)
)
ggdag(model) + theme_dag()

```

```{r}
n <- 1e4
z <- rnorm(n)
x <- rnorm(n)
m <- 2 * z + rnorm(n)
y <- x + 2 * m + rnorm(n)

jtools::export_summs(
  lm(y ~ x),
  lm(y ~ x + z),
  model.names = c("Without Z", "With Predictive Z")
)

```

Even though $Z$ is not on any causal path from $X$ to $Y$, controlling for it may reduce residual variance in $Y$, hence increasing precision.

### Good Selection Bias

In more complex selection structures, adjusting for selection variables can improve identification, but only in the presence of additional post-selection information.

#### $W$ is a collider; $Z$ helps condition on selection

```{r}
rm(list = ls())

model <- dagitty("dag{
  x -> y
  x -> z
  z -> w
  u -> w
  u -> y
}")
latents(model) <- "u"
coordinates(model) <- list(
  x = c(x = 1, z = 2, w = 3, u = 3, y = 5),
  y = c(x = 3, z = 2, w = 1, u = 4, y = 3)
)
ggdag(model) + theme_dag()

```

```{r}
n <- 1e4
x <- rnorm(n)
u <- rnorm(n)
z <- x + rnorm(n)
w <- z + u + rnorm(n)
y <- x - 2 * u + rnorm(n)

jtools::export_summs(
  lm(y ~ x),
  lm(y ~ x + w),
  lm(y ~ x + z + w),
  model.names = c("Unadjusted", "Control W", "Control W + Z")
)
```

-   Unadjusted model is unbiased.

-   Controlling only for $W$ is biased due to collider path $X \to Z \to W \leftarrow U \to Y$.

-   Adding $Z$ restores identification by blocking that path.

### Bad Predictive Controls

Not all predictive variables are useful --- some may reduce precision by soaking up degrees of freedom or increasing multicollinearity.

Bad predictive controls matter because they masquerade as good ones: they look statistically related to the treatment, they often improve in-sample fit, and a reviewer can reasonably ask why they were left out. Recognizing the situation in practice means flipping the question and asking what the variable predicts. A covariate that strongly predicts the *treatment* but adds nothing to explaining the outcome (beyond what the treatment already captures) competes with the treatment for variance and inflates the standard error on the causal coefficient. This is the same intuition that motivates the relevance condition in [Instrumental Variables](#sec-instrumental-variables): a variable can be powerfully associated with $X$ without being a useful adjustment, and the closer such a variable lies to a perfect predictor of $X$, the more a regression that conditions on it begins to resemble a near-multicollinear specification.

#### $Z$ predicts $X$, not $Y$

```{r}
rm(list = ls())

model <- dagitty("dag{
  x -> y
  z -> x