data_analysis/21-causality.Rmd at main · mikenguyen13/data_analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Causal Inference {#sec-causal-inference}

Throughout our journey into statistical concepts, we've uncovered patterns, relationships, and trends in data. But now, we arrive at one of the most profound questions in all of research and decision-making: **What truly causes what?**

We've all heard the phrase*.*

> *Correlation is not causation.*

Just because two things move together doesn't mean one is pulling the strings of the other. Ice cream sales and drowning incidents both rise in the summer, but ice cream isn't to blame.

But what exactly *is* causation? Let's explore.

One of the most insightful books on this topic is *The Book of Why* by Judea Pearl [@Pearl_2018], which explains the nuances of causal reasoning beautifully. We'll briefly visit a concise summary of key ideas from Pearl's work, supplemented with insights from econometrics and statistics.

Understanding causal relationships is essential in research, particularly in fields like economics, finance, marketing, and medicine. While statistical methods have traditionally focused on associational reasoning, causal inference allows us to answer **what-if questions** and make decisions based on interventions.

However, one must be aware of the limitations of statistical methods. As discussed throughout this book, relying solely on data without incorporating domain expertise can lead to misleading conclusions. To establish causality, we often need expert judgment, prior research, and rigorous experimental design.

------------------------------------------------------------------------

You may have come across amusing examples of **spurious correlations**---such as the famous [Tyler Vigen collection](http://www.tylervigen.com/spurious-correlations), which shows absurd relationships (e.g., "the number of Nicholas Cage movies correlates with drowning accidents"). These highlight the danger of mistaking correlation for causation.

Historically, one of the earliest attempts to infer causation using **regression analysis** was by @yule1899, who investigated the effect of relief policies on poverty. Unfortunately, his analysis suggested that relief policies increased poverty---a misleading conclusion due to unaccounted confounders.

------------------------------------------------------------------------

For a long time, statistics was largely a **causality-free discipline**. The field only began addressing causation in the 1920s, when Sewall Wright introduced **path analysis**, a graphical approach to representing causal relationships [@wright1920relative]. However, it wasn't until Judea Pearl's **Causal Revolution** (1990s) that we gained a formal calculus for causation.

Pearl's framework introduced two key innovations:

1.  **Causal Diagrams** (Directed Acyclic Graphs) -- A graphical representation of cause-and-effect relationships.
2.  **A Symbolic Language**: The Do-Operator ($do(X)$) -- A mathematical notation for interventions.

------------------------------------------------------------------------

Traditional statistics deals with **conditional probabilities**:

$$
P(Y | X)
$$

This formula tells us the probability of event $Y$ occurring given that event $X$ has occurred. In the context of observed data, $P(Y \mid X)$ reflects the association between $X$ and $Y$, showing how likely $Y$ is when $X$ happens.

However, causal inference requires a different concept:

$$
P(Y | do(X))
$$

which describes what happens when we **actively intervene** and set $X$. The crucial distinction is:

$$
P(Y | X) \neq P(Y | do(X))
$$

in general, because **passively observing** $X$ is not the same as **actively manipulating** it.

To make causal claims, we need to answer **counterfactual questions**:

> *What would have happened if we had NOT done* $X$?

This concept is essential in fields like policy evaluation, medicine, and business decision-making.

------------------------------------------------------------------------

<!--# To build intelligent systems that can reason causally, we need an inference engine:  -->

<!-- ![p. 12 [@Pearl_2018]](images/Figure%20I.png "Inference Engine"){style="display: block; margin: 1em auto" width="600" height="400"} -->

<!--# Flow chart illustrating an "inference engine" process. It begins with "Knowledge" as background, leading to "Assumptions" and then to "Causal model" under inputs. The "Inference Engine" evaluates if a query can be answered, directing to "Testable implications" or returning to assumptions if not. If yes, it proceeds to "Statistical estimation" using "Data," leading to "Estimand" and finally to "Estimate" as outputs. Arrows indicate the flow, with notes on potential additional connections. -->

@Pearl_2018 outlines **three levels of cognitive ability** required for causal learning:

1.  **Seeing** -- Observing associations in data.
2.  **Doing** -- Understanding interventions and predicting their outcomes.
3.  **Imagining** -- Reasoning about counterfactuals.

This is also known as Pearl's *Ladder of Causation* that is used to describe three hierarchical levels of causal reasoning.

<!-- +---------------------+--------------+----------------------------------------------------------+---------------------------------------------------------------+ -->

<!-- | **Level**           | **Activity** | **Questions Answered**                                   | **Examples**                                                  | -->

<!-- +=====================+==============+==========================================================+===============================================================+ -->

<!-- | **Association**     | *Seeing*     | What is? How does seeing $X$ change my belief in $Y$?    | What does a symptom tell me about a disease?                  | -->

<!-- +---------------------+--------------+----------------------------------------------------------+---------------------------------------------------------------+ -->

<!-- | **Intervention**    | *Doing*      | What if? What happens if I intervene and change $X$?     | If I study more, will my test score improve?                  | -->

<!-- +---------------------+--------------+----------------------------------------------------------+---------------------------------------------------------------+ -->

<!-- | **Counterfactuals** | *Imagining*  | Why? What would have happened if $X$ had been different? | If I had quit smoking a year ago, would I be healthier today? | -->

<!-- +---------------------+--------------+----------------------------------------------------------+---------------------------------------------------------------+ -->

<!-- *(Adapted from [@pearl2019seven], p. 57)* -->

Each level requires more cognitive ability and data. Classical statistics operates at Level 1 (association), while causal inference enables us to reach Levels 2 and 3.

------------------------------------------------------------------------

## The Formal Notation of Causality

A common mistake is to define causation using ordinary conditional probability:

$$
X \text{ causes } Y \text{ if } P(Y \mid X) \ne P(Y).
$$

This statement only says $X$ and $Y$ are *associated* (1st-rung). The association can arise for several reasons:

1.  $X$ causes $Y$,
2.  $Y$ causes $X$ (reverse causality),
3.  a common cause $Z$ affects both $X$ and $Y$ (confounding), or
4.  conditioning on a common effect of $X$ and $Y$ (selection / collider bias).

Adjusting for an observed set of variables $Z$ — that is, comparing $P(Y \mid X, Z = z)$ across values of $X$ — only resolves (3), and only if (a) we knew which $Z$ to choose, (b) all relevant confounders are observed, and (c) we condition only on non-descendants of $X$. In practice this is rarely guaranteed.

The correct causal statement separates *seeing* from *doing*. Pearl's $do$-operator denotes a hypothetical intervention that sets $X$ to a specific value, breaking any incoming arrows into $X$. We then say $X$ has a causal effect on $Y$ if

$$
P(Y \mid do(X = x)) \ne P(Y \mid do(X = x')) \quad \text{for some } x \neq x',
$$

which in general is *not* the same as $P(Y \mid X = x) \neq P(Y \mid X = x')$.

With **causal diagrams** and **do-calculus**, we can formally express interventions and answer questions at the 2nd level (Intervention).

------------------------------------------------------------------------

@pearl2019seven also introduce the Pearl's *Structural Causal Model (SCM)* framework for causal inference [@pearl2019seven]:

1.  **Encoding Causal Assumptions** -- Using **causal graphs** for transparency and testability.
2.  **Do-Calculus** -- Controlling for confounding using the **backdoor criterion**.
3.  **Algorithmization of Counterfactuals** -- Modeling "what if?" scenarios.
4.  **Mediation Analysis** -- Understanding direct vs. indirect effects.
5.  **External Validity & Adaptability** -- Addressing selection bias and domain adaptation.
6.  **Handling Missing Data** -- Using causal methods to infer missing information.
7.  **Causal Discovery** -- Learning causal relationships from data using:
    -   **d-separation**
    -   **Functional decomposition** [@hoyer2008nonlinear]
    -   **Spontaneous local changes** [@pearl2014graphical]

To explore causal inference in R, check out the [CRAN Task View for Causal Inference](https://cran.r-project.org/web/views/CausalInference.html):

For further reading:

-   *The Book of Why* -- Judea Pearl [@Pearl_2018]
-   *Causal Inference in Statistics: A Primer* -- Pearl, Glymour, Jewell
-   *Causality: Models, Reasoning, and Inference* -- Judea Pearl

------------------------------------------------------------------------

## Simpson's Paradox

Simpson's Paradox is one of the most striking examples of **why causality matters** and why simple statistical associations can be misleading.

At its core, Simpson's Paradox occurs when:

> A trend observed in an overall population **reverses** when the population is divided into subgroups.

This means that **statistical associations in raw data can be misleading** if important confounding variables are ignored.

Understanding Simpson's Paradox is critical in causal inference because:

1.  It highlights the danger of naive data analysis -- Just looking at overall trends can lead to incorrect conclusions.
2.  It emphasizes the importance of confounding variables -- We must control for relevant factors before making causal claims.
3.  It demonstrates why causal reasoning is necessary -- Simply relying on statistical associations ($P(Y | X)$) without considering structural relationships can lead to paradoxical results.

### Comparison between Simpson's Paradox and Omitted Variable Bias

Simpson's Paradox occurs when a trend in an overall dataset **reverses** when broken into subgroups. This happens due to **data aggregation issues**, where differences in subgroup sizes distort the overall trend.

While this often resembles **omitted variable bias (OVB)**---where missing confounders lead to misleading conclusions---Simpson's Paradox is not just a causal inference problem. It is a **mathematical phenomenon** that can arise purely from **improper weighting of data**, even in descriptive statistics.

#### Similarities Between Simpson's Paradox and OVB

1.  Both involve a missing variable:

-   In Simpson's Paradox, a key confounding variable (e.g., customer segment) is hidden in the aggregate data, leading to misleading conclusions.

-   In OVB, a relevant variable (e.g., seasonality) is missing from the regression model, causing bias.

2.  Both distort causal conclusions:

-   OVB biases effect estimates by failing to control for confounding.

-   Simpson's Paradox flips statistical relationships when controlling for a confounder.

#### Differences Between Simpson's Paradox and OVB

1.  Not all OVB cases show Simpson's Paradox:

-   OVB generally causes bias, but it doesn't always create a reversal of trends.

-   Example: If seasonality increases both ad spend and sales, omitting it inflates the ad spend → sales relationship but does not necessarily reverse it.

2.  Simpson's Paradox can occur even without causal inference:

-   Simpson's Paradox is a mathematical/statistical phenomenon that can arise even in purely observational data, not just causal inference.

-   It results from data weighting issues, even if causality is not considered.

3.  OVB is a model specification issue; Simpson's Paradox is a data aggregation issue:

-   OVB occurs in regression models when we fail to include relevant predictors.

-   Simpson's Paradox arises from incorrect data aggregation when groups are not properly analyzed separately.

#### The Right Way to Think About It

-   Simpson's Paradox is often *caused* by omitted variable bias, but they are not the same thing.

-   OVB is a problem in causal inference models; Simpson's Paradox is a problem in raw data interpretation.

#### How to Fix These Issues?

-   For OVB: Use causal diagrams, add control variables, and use regression adjustments.

-   For Simpson's Paradox: Always analyze subgroup-level trends before making conclusions based on aggregate data.

-   Bottom line: Simpson's Paradox is often *caused* by omitted variable bias, but it is not just OVB---it is a fundamental issue of misleading data aggregation.

------------------------------------------------------------------------

### Illustrating Simpson's Paradox: Marketing Campaign Success Rates

Let's explore this paradox using a practical business example.

#### Scenario: Marketing Campaign Performance

Imagine a company running two marketing campaigns, Campaign A and Campaign B, to attract new customers. We analyze which campaign has a higher conversion rate.

##### Step 1: Creating the Data

We will simulate conversion rates for two different customer segments:

1.  **High-Value** customers (who typically convert at a higher rate)
2.  **Low-Value** customers (who convert at a lower rate).

```{r}
# Load necessary libraries
library(dplyr)

# Create a dataset where:
#  - B is better than A in each individual segment.
#  - A turns out better when we look at the overall (aggregated) data.

marketing_data <- data.frame(
  Campaign = c("A", "A", "B", "B"),
  Segment  = c("High-Value", "Low-Value", "High-Value", "Low-Value"),
  Visitors = c(500, 2000, 300, 3000),    # total visitors in each segment
  Conversions = c(290, 170, 180, 270)   # successful conversions
)

# Compute segment-level conversion rate
marketing_data <- marketing_data %>%
  mutate(Conversion_Rate = Conversions / Visitors)

# Print the data
print(marketing_data)

```

###### Interpreting This Data

-   **Campaign B** in the High-Value segment: $\frac{180}{300} = 60\%$

-   **Campaign A** in the High-Value segment: $\frac{290}{500} = 58\%$

    =\> B is better in the High-Value segment (60% vs 58%).

-   **Campaign B** in the Low-Value segment: $\frac{270}{3000} = 9\%$

-   **Campaign A** in the Low-Value segment: $\frac{170}{2000} = 8.5\%$

    =\> B is better in the Low-Value segment (9% vs 8.5%).

Thus, **B** outperforms **A** in each **individual** segment.

##### Step 2: Aggregating Data (Ignoring Customer Segments)

Now, let's calculate the overall conversion rate for each campaign without considering customer segments.

```{r}
# Compute overall conversion rates for each campaign
overall_rates <- marketing_data %>%
  group_by(Campaign) %>%
  summarise(
    Total_Visitors     = sum(Visitors),
    Total_Conversions  = sum(Conversions),
    Overall_Conversion_Rate = Total_Conversions / Total_Visitors
  )

# Print overall conversion rates
print(overall_rates)
```

##### Step 3: Observing Simpson's Paradox

Let's determine which campaign appears to have a higher conversion rate.

```{r}
# Identify the campaign with the higher overall conversion rate
best_campaign_overall <- overall_rates %>%
    filter(Overall_Conversion_Rate == max(Overall_Conversion_Rate)) %>%
    select(Campaign, Overall_Conversion_Rate)

print(best_campaign_overall)

```

Even though **Campaign B** is **better** in each **segment**, you should see that **Campaign A** has a **higher** aggregated (overall) conversion rate!

##### Step 4: Conversion Rates Within Customer Segments

We now analyze the conversion rates separately for high-value and low-value customers.

```{r}
# Compute conversion rates by customer segment
by_segment <- marketing_data %>%
  select(Campaign, Segment, Conversion_Rate) %>%
  arrange(Segment)

print(by_segment)
```

-   In **High-Value**, B \> A.

-   In **Low-Value**, B \> A.

Yet, overall, A \> B.

This **reversal** is the hallmark of **Simpson's Paradox**.

##### Step 5: Visualizing the Paradox

Fig. \@ref(fig:simpson-paradox) makes this clearer by visualizing the results.

```{r simpson-paradox, fig.cap='Simpson’s paradox in marketing: campaign-level conversion rates appear inconsistent with segment-level rates.', fig.alt='Bar chart comparing conversion rates for Campaigns A and B across High-Value and Low-Value customer segments.'}
library(ggplot2)

# Plot conversion rates by campaign and segment
ggplot(marketing_data,
       aes(x = Segment,
           y = Conversion_Rate,
           fill = Campaign)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Simpson’s Paradox in Marketing",
    x     = "Customer Segment",
    y     = "Conversion Rate"
  ) +
  theme_minimal()
```

This bar chart reveals that for **both** segments, B's bar is taller (i.e., B's conversion rate is higher). If you only examined segment-level data, you would conclude that B is the superior campaign.

However, if you aggregate the data (ignore segments), you get the opposite conclusion --- that **A** is better overall.

### Why Does This Happen?

This paradox arises because of a **confounding variable** --- in this case, the **distribution of visitors** across segments.

-   **Campaign A** has **more** of its traffic in the High-Value segment (where conversions are generally high).

-   **Campaign B** has **many** of its visitors in the Low-Value segment.

Because the **volume** of Low-Value visitors in B is extremely large (3000 vs. 2000 for A), it weighs down B's **overall** average more, allowing A's overall rate to exceed B's.

### How Does Causal Inference Solve This?

To avoid Simpson's Paradox, we need to move beyond association and use causal analysis:

1.  **Use causal diagrams (DAGs) to model relationships**

    -   The marketing campaign choice is confounded by customer segment.

    -   We must control for the confounding variable.

2.  **Use stratification or regression adjustment**

    -   Instead of comparing raw conversion rates, we should compare rates within each customer segment.

    -   This ensures that confounding factors do not distort results.

3.  **Use the do-operator** to simulate interventions

    -   Instead of asking $P(\text{Conversion} \mid \text{Campaign})$, ask: $P(\text{Conversion} \mid do(\text{Campaign}))$

    -   This estimates what would happen if we randomly assigned campaigns (removing confounding bias).

### Correcting Simpson's Paradox with Regression Adjustment

Let's adjust for the confounding variable using **logistic regression**.

```{r}
# Logistic regression adjusting for the Segment
model <- glm(
  cbind(Conversions, Visitors - Conversions) ~ Campaign + Segment,
  family = binomial(),
  data   = marketing_data
)

summary(model)
```

This model includes both **Campaign** and **Segment** as predictors, giving a clearer picture of the *true effect* of each campaign on conversion, **after** controlling for differences in segment composition.

### Key Takeaways

1.  **Simpson's Paradox demonstrates why causal inference is essential.**

    -   Aggregated statistics can be misleading due to hidden confounding.

    -   Breaking data into subgroups can reverse conclusions.

2.  **Causal reasoning helps identify and correct paradoxes.**

    -   Using causal graphs, do-calculus, and adjustment techniques, we can find the true causal effect.

3.  **Naïve data analysis can lead to bad business decisions.**

    -   If a company allocated more budget to Campaign B based on overall conversion rates, it might be investing in the wrong strategy!

------------------------------------------------------------------------


## Experimental vs. Quasi-Experimental Designs

To determine whether a particular intervention or treatment causes an observed outcome, it requires more than observing associations---it demands a framework for **causal inference**.

To address this, researchers rely on two broad classes of research designs: **experimental** and **quasi-experimental**. Both aim to estimate causal effects, but they differ significantly in their level of control over the assignment mechanism and the assumptions required for valid inference.

-   [Experimental Design](#sec-experimental-design), particularly [randomized controlled trials (RCTs)]((#sec-the-gold-standard-randomized-controlled-trials)), are considered the gold standard for causal inference. By randomly assigning units (e.g., customers, users, regions) to treatment or control groups, these designs eliminate confounding and allow for straightforward interpretation of treatment effects. However, in many business settings, true randomization is costly, impractical, or ethically constrained.

-   [Quasi-experimental](#sec-quasi-experimental) designs provide an alternative when random assignment is not feasible. These designs rely on observational data and statistical techniques to approximate the conditions of an experiment. While more flexible in application, they typically require stronger assumptions and careful methodological implementation to yield credible causal insights.

Table \@ref(tab:exp-vs-quasi-exp) summarises the key differences between these two approaches, highlighting their respective strengths, limitations, and use cases in applied research.

Table: (\#tab:exp-vs-quasi-exp) Key differences between experimental and quasi-experimental designs.

+------------------------------+-----------------------------------------------+------------------------------------------------+
| Feature                      | Experimental Design                           | Quasi-Experimental Design                      |
+==============================+===============================================+================================================+
| Assignment to Treatment      | Randomized                                    | Non-randomized (observational)                 |
+------------------------------+-----------------------------------------------+------------------------------------------------+
| Control over Confounding     | High                                          | Limited (requires statistical control)         |
+------------------------------+-----------------------------------------------+------------------------------------------------+
| Causal Inference Validity    | Strong (if properly implemented)              | Weaker (depends on assumptions)                |
+------------------------------+-----------------------------------------------+------------------------------------------------+
| Feasibility in Field Studies | Often difficult or costly                     | More flexible and practical                    |
+------------------------------+-----------------------------------------------+------------------------------------------------+
| Examples                     | A/B testing, clinical trials                  | Difference-in-differences, matching            |
+------------------------------+-----------------------------------------------+------------------------------------------------+
| Principal Investigator       | Conducted by an experimentalist               | Conducted by an observationalist               |
+------------------------------+-----------------------------------------------+------------------------------------------------+
| Type of Data                 | Uses experimental data                        | Uses observational data                        |
+------------------------------+-----------------------------------------------+------------------------------------------------+
| Randomness helps             | Random assignment reduces treatment imbalance | Random sampling reduces sample selection error |
+------------------------------+-----------------------------------------------+------------------------------------------------+

: Experimental Design vs. Quasi-Experimental Design

### Criticisms of Quasi-Experimental Designs

[Quasi-experimental methods](#sec-quasi-experimental) do not always approximate experimental results accurately. For instance, @lalonde1986evaluating demonstrates that commonly used methods such as:

-   [Matching Methods]
-   [Difference-in-differences]
-   [Tobit-2] (Heckman-type models)

often fail to replicate experimental estimates reliably. This finding cast serious doubt on the credibility of observational studies for estimating causal effects, igniting an ongoing debate in econometrics and statistics about the reliability of nonexperimental evaluations.

LaLonde's critical assessment served as a catalyst for significant methodological and practical advancements in causal inference. In the decades since this publication, the field has evolved considerably, introducing both theoretical innovations and empirical practices aimed at addressing the limitations that were exposed [@imbens2024lalonde]. Among these advances are:

-   **Emphasis on estimators based on unconfoundedness (selection on observables):** Modern causal inference frameworks frequently adopt the *unconfoundedness* or *conditional independence* assumption. Under this premise, treatment assignment is assumed to be independent of potential outcomes, conditional on observed covariates. This theoretical foundation underpins many widely used estimation techniques, such as matching methods, inverse probability weighting, and regression adjustment.

-   **Focus on covariate overlap (common support):** Researchers now recognize the critical importance of *overlap*, also referred to as *common support*, in the distributions of covariates across treatment and control groups. Without sufficient overlap, comparisons between treated and untreated units rely on extrapolation, which weakens causal claims. Modern methods explicitly assess and often impose restrictions to ensure overlap before proceeding with estimation.

-   **Introduction of propensity score-based methods and doubly robust estimators:** The introduction of *propensity score* methods [@rosenbaum1983central] was a breakthrough, offering a way to reduce the dimensionality of the covariate space while balancing observed characteristics across groups. More recently, *doubly robust* estimators have emerged, combining propensity score weighting with outcome regression. These estimators provide consistent treatment effect estimates if either the propensity score model or the outcome model is correctly specified, offering greater robustness in practice.

-   **Greater emphasis on validation exercises to bolster credibility:** Modern studies increasingly incorporate validation techniques to evaluate the credibility of their findings. *Placebo tests*, *falsification exercises*, and *sensitivity analyses* are commonly employed to assess whether estimated effects may be driven by unobserved confounding or model misspecification. Such practices go beyond traditional goodness-of-fit statistics, directly interrogating the assumptions underlying causal inference.

-   **Methods for estimating and exploiting treatment effect heterogeneity:** Beyond estimating average treatment effects, contemporary research frequently explores *heterogeneous treatment effects*. These methods identify subgroups that may experience different causal impacts, which is of particular relevance in fields like personalized marketing, targeted interventions, and policy design.

To illustrate the practical lessons from these methodological advances, @imbens2024lalonde reexamine two canonical datasets:

1.  **LaLonde's National Supported Work Demonstration data**
2.  **The Imbens-Rubin-Sacerdote draft lottery data**

Applying modern causal inference methods to these datasets demonstrates that, when sufficient covariate overlap exists, robust estimates of the adjusted differences between treatment and control groups can be achieved. However, it is critical to underscore that robustness in estimation does not equate to validity. Without direct validation exercises, such as placebo tests, even well-behaved estimates may be misleading.

@imbens2024lalonde also highlight several key lessons for practitioners working with nonexperimental data to estimate causal effects:

-   **Careful examination of the assignment process is essential.**\
    Understanding the mechanisms by which units are assigned to treatment or control conditions informs the plausibility of the unconfoundedness assumption.

-   **Inspection of covariate overlap is non-negotiable.**\
    Without sufficient overlap, causal effect estimation may rely heavily on model extrapolation, undermining credibility.

-   **Validation exercises are indispensable.**\
    Placebo tests and falsification strategies help ensure that estimated treatment effects are not artifacts of modeling choices or unobserved confounding.

While methodological advances have substantially improved the tools available for causal inference with observational data, their effective application requires rigorous attention to the underlying assumptions and diligent validation to support credible causal claims.

------------------------------------------------------------------------

## Hierarchical Ordering of Causal Tools

Causal inference tools can be categorized based on their methodological rigor, with [randomized controlled trials](#sec-the-gold-standard-randomized-controlled-trials) (RCTs) considered the gold standard.

1.  [Experimental Design](#sec-experimental-design): [Randomized Control Trials (Gold standard)](#sec-the-gold-standard-randomized-controlled-trials)

2.  [Quasi-experimental](#sec-quasi-experimental)

    1.  [Regression Discontinuity]

    2.  [Synthetic Difference-in-Differences]

    3.  [Difference-In-Differences]

    4.  [Synthetic Control]

    5.  [Event Studies]

    6.  [Fixed Effects Estimator](sec-fixed-effects-estimator)

    7.  [Endogenous Treatment]: mostly [Instrumental Variables]

    8.  [Matching Methods]

    9.  [Interrupted Time Series]

    10. [Endogenous Sample Selection]

------------------------------------------------------------------------

## Types of Validity in Research

Validity in research includes:

1.  [Measurement Validity](#sec-measurement-validity) (e.g., construct, content, criterion, face validity)

2.  [Internal Validity](#sec-internal-validity)

3.  [External Validity](#sec-external-validity)

4.  [Ecological Validity](#sec-ecological-validity)

5.  [Statistical Conclusion Validity](#sec-statistical-conclusion-validity)

By examining these, you can ensure that your study's measurements are accurate, your findings are reliably causal, and your conclusions generalize to broader contexts.

------------------------------------------------------------------------

### Measurement Validity {#sec-measurement-validity}

**Measurement validity** pertains to whether the instrument or method you use truly measures what it's intended to measure. Within this umbrella, there are several sub-types:

#### Face Validity

-   **Definition**: The extent to which a measurement or test *appears* to measure what it is supposed to measure, at face value (i.e., does it "look" right to experts or users?).

-   **Importance**: While often considered a less rigorous form of validity, it's useful for ensuring the test or instrument is intuitively acceptable to stakeholders, participants, or experts in the field.

-   **Example**: A questionnaire measuring "anxiety" that has questions about nervousness, worries, and stress has good face validity because it obviously seems to address anxiety.

#### Content Validity

-   **Definition**: The extent to which a test or measurement covers *all* relevant facets of the construct it aims to measure.

-   **Importance**: Especially critical in fields like education or psychological testing, where you want to ensure the entire domain of a subject/construct is properly sampled.

-   **Example**: A math test that includes questions on algebra, geometry, and calculus might have high content validity for a comprehensive math skill assessment. If it only tested algebra, the content validity would be low.

### Construct Validity {#sec-construct-validity}

-   **Definition**: The degree to which a test or measurement tool accurately represents the theoretical construct it intends to measure (e.g., intelligence, motivation, self-esteem).

-   **Types of Evidence**:

    -   **Convergent Validity**: Demonstrated when measures that are supposed to be related (theoretically) are observed to correlate.

    -   **Discriminant (Divergent) Validity**: Demonstrated when measures that are supposed to be unrelated theoretically do not correlate.

-   **Example**: A new questionnaire on "job satisfaction" should correlate with other established job satisfaction questionnaires (convergent validity) but should not correlate strongly with unrelated constructs like "physical health" (discriminant validity).

### Criterion Validity {#sec-criterion-validity}

-   **Definition**: The extent to which the measurement predicts or correlates with an outcome criterion. In other words, do scores on the measure relate to an external standard or "criterion"?

-   **Types**:

    -   **Predictive Validity**: The measure predicts a future outcome (e.g., an entrance exam predicting college success).

    -   **Concurrent Validity**: The measure correlates with an existing, accepted measure taken at the same time (e.g., a new depression scale compared with a gold-standard clinical interview).

-   **Example**: A new test of driving skills has high criterion validity if people who score highly perform better on actual road tests (predictive validity).

### Internal Validity {#sec-internal-validity}

**Internal validity** refers to the extent to which a study can establish a *cause-and-effect* relationship. High internal validity means you can be confident that the observed effects are due to the treatment or intervention itself and not due to confounding factors or alternative explanations. This is the validity that economists and applied scientists largely care about.

#### Major Threats to Internal Validity

1.  **Selection Bias**: Systematic differences between groups that exist before the treatment is applied.

2.  **History Effects**: External events occurring during the study can affect outcomes (e.g., economic downturn during a job-training study).

3.  **Maturation**: Participants might change over time simply due to aging, learning, fatigue, etc., independent of the treatment.

4.  **Testing Effects**: Taking a test more than once can influence participants' responses (practice effect).

5.  **Instrumentation**: Changes in the measurement instrument or the observers can lead to inconsistencies in data collection.

6.  **Regression to the Mean**: Extreme pre-test scores tend to move closer to the average on subsequent tests.

7.  **Attrition (Mortality)**: Participants dropping out of the study in ways that are systematically related to the treatment or outcomes.

#### Strategies to Improve Internal Validity

-   **Random Assignment**: Ensures that, on average, groups are equivalent on both known and unknown variables.

-   **Control Groups**: Provide a baseline for comparison to isolate the effect of the intervention.

-   **Blinding (Single-, Double-, or Triple-blind)**: Reduces biases from participants, researchers, or analysts.

-   **Standardized Procedures and Protocols**: Minimizes variability in how interventions or measurements are administered.

-   **Matching or Stratification**: When randomization is not possible, matching participants on key characteristics can reduce selection bias.

-   **Pretest-Posttest Designs**: Compare participant performance before and after the intervention (though watch for testing effects).

### External Validity {#sec-external-validity}

**External validity** addresses the generalizability of the findings beyond the specific context of the study. A study with high external validity can be applied to other populations, settings, or times. On the other hand, localness can affect external validity.

#### Subtypes (or Related Concepts) of External Validity

1.  **Population Validity**: The degree to which study findings can be generalized to the larger population from which the sample was drawn.

2.  **Ecological Validity** (sometimes considered separately): Whether findings obtained in controlled conditions can be applied to real-world settings.

3.  **Temporal Validity**: Whether the results of the study hold true over time. Changing societal norms, technologies, or economic conditions might render findings obsolete.

#### Threats to External Validity

-   **Unrepresentative Samples**: If the sample does not reflect the wider population (in demographics, culture, etc.), generalization is limited.

-   **Artificial Research Environments**: Highly controlled lab settings may not capture real-world complexities.

-   **Treatment-Setting Interaction**: The effect of the treatment might depend on the unique conditions of the setting (e.g., a particular school, hospital, or region).

-   **Treatment-Selection Interaction**: Certain characteristics of the selected participants might interact with the treatment (e.g., results from a specialized population do not apply to the general public).

#### Strategies to Improve External Validity

-   **Use of Diverse and Representative Samples**: Recruit participants that mirror the larger population.

-   **Field Studies and Naturalistic Settings**: Conduct experiments in real-world environments rather than artificial labs.

-   **Replication in Multiple Contexts**: Replicate the study across different settings, geographic locations, and populations.

-   **Longitudinal Studies**: Evaluate whether relationships hold over extended periods.

### Ecological Validity {#sec-ecological-validity}

**Ecological validity** is often discussed as a subcategory of external validity. It specifically focuses on the *realism* of the study environment and tasks:

-   **Definition**: The degree to which study findings can be generalized to the real-life settings where people actually live, work, and interact.

-   **Key Idea**: Even if a lab experiment shows a particular behavior, do people behave the same way in their daily lives with everyday distractions, social pressures, and contextual factors?

#### Enhancing Ecological Validity

-   **Naturalistic Observation**: Conduct observations or experiments in participants' usual environments.

-   **Realistic Tasks**: Use tasks that closely mimic real-world challenges or behaviors.

-   **Minimal Interference**: Researchers strive to reduce the artificiality of the setting, allowing participants to behave as naturally as possible.

### Statistical Conclusion Validity {#sec-statistical-conclusion-validity}

Though often discussed alongside internal validity, **statistical conclusion validity** focuses on whether the statistical tests used in a study are appropriate, powerful enough, and applied correctly.

#### Threats to Statistical Conclusion Validity

1.  **Low Statistical Power**: If the sample size is too small, the study may fail to detect a real effect (Type II error).

2.  **Violations of Statistical Assumptions**: Incorrect application of statistical tests can lead to spurious conclusions (e.g., using parametric tests with non-normal data without appropriate adjustments).

3.  **Fishing and Error Rate Problem**: Running many statistical tests without correction increases the chance of a Type I error (finding a false positive).

4.  **Reliability of Measures**: If the measurement instruments are unreliable, statistical correlations or differences may be undervalued or overstated.

#### Improving Statistical Conclusion Validity

-   **Adequate Sample Size**: Conduct a power analysis to determine the necessary size to detect meaningful effects.

-   **Appropriate Statistical Techniques**: Ensure your chosen analysis matches the nature of the data and research question.

-   **Multiple Testing Corrections**: Use methods like Bonferroni or false discovery rate corrections when conducting multiple comparisons.

-   **High-Quality Measurements**: Use reliable and valid measures to reduce measurement error.

### Putting It All Together

1.  **Face Validity**: Does it look like it measures what it should?

2.  **Content Validity**: Does it cover all facets of the construct?

3.  **Construct Validity**: Does it truly reflect the theoretical concept?

4.  **Criterion Validity**: Does it correlate with or predict other relevant outcomes?

5.  **Internal Validity**: Is the relationship between treatment and outcome truly causal?

6.  **External Validity**: Can findings be generalized to other populations, settings, and times?

7.  **Ecological Validity**: Are the findings applicable to real-world scenarios?

8.  **Statistical Conclusion Validity**: Are the statistical inferences correct and robust?

Researchers typically need to strike a balance among these different validities:

-   A **highly controlled lab study** might excel in internal validity but might have limited external and ecological validity.

-   A **broad, naturalistic field study** might have stronger external or ecological validity but weaker internal validity due to less control over confounding variables.

**No single study** can maximize all validity types simultaneously, so replication, triangulation (using multiple methods), and transparent reporting are crucial strategies to bolster overall credibility.

------------------------------------------------------------------------

## Types of Subjects in a Treatment Setting

When conducting causal inference, particularly in randomized experiments or quasi-experimental settings, individuals in the study can be classified into four distinct groups based on their response to treatment assignment. These groups differ in how they react when they are assigned to receive or not receive treatment.

### Non-Switchers

Non-switchers are individuals whose treatment status does not change regardless of whether they are assigned to the treatment or control group. These individuals do not provide useful causal information because their behavior remains unchanged. They are further divided into:

-   **Always-Takers**: These individuals will **always receive** the treatment, even if they are assigned to the control group.
-   **Never-Takers**: These individuals will **never receive** the treatment, even if they are assigned to the treatment group.

Since their behavior is independent of the assignment, always-takers and never-takers do not contribute to identifying causal effects in standard randomized experiments. Instead, their presence can introduce bias in treatment effect estimation, particularly in **intention-to-treat analysis**.

### Switchers

Switchers are individuals whose treatment status **depends on the assignment**. These individuals are the primary focus of causal inference because they provide meaningful information about the effect of treatment. They are classified into:

-   **Compliers**: Individuals who follow the assigned treatment protocol.
    -   If assigned to the treatment group, they **accept and receive** the treatment.
    -   If assigned to the control group, they **do not receive** the treatment.
    -   **Why are compliers important?**
        -   They are the only group for whom treatment assignment affects actual treatment receipt.
        -   Causal effect estimates (such as the local average treatment effect, LATE) are typically identified using compliers.
        -   If the dataset only contains compliers, then the intention-to-treat effect (ITT) is equal to the treatment effect.
-   **Defiers**: Individuals who do the **opposite** of what they are assigned.
    -   If assigned to the treatment group, they **refuse the treatment**.
    -   If assigned to the control group, they **seek out and receive** the treatment anyway.
    -   **Why are defiers typically ignored?**
        -   In most studies, defiers are assumed to be a small or negligible group.
        -   Standard causal inference frameworks assume **monotonicity**, meaning no one behaves as a defier.
        -   If defiers exist in large numbers, estimating causal effects becomes significantly more complex.

### Classification of Individuals Based on Treatment Assignment

The following table summarizes how different types of individuals respond to treatment and control assignments:

|                   | Treatment Assignment | Control Assignment |
|-------------------|----------------------|--------------------|
| **Compliers**     | Treated              | Not Treated        |
| **Always-Takers** | Treated              | Treated            |
| **Never-Takers**  | Not Treated          | Not Treated        |
| **Defiers**       | Not Treated          | Treated            |

: Classification of Individuals Based on Treatment Assignment

#### Key Takeaways

1.  **Compliers** are the only group that allows us to estimate causal effects using **randomized or quasi-experimental designs**.
2.  **Always-Takers and Never-Takers** do not provide meaningful variation in treatment status, making them less useful for causal inference.
3.  **Defiers** typically violate the assumption of monotonicity, and their presence complicates causal estimation.
4.  If a dataset consists **only of compliers**, the **intention-to-treat effect** will be equal to the **treatment effect**.

By correctly identifying and accounting for these different subject types, researchers can ensure more accurate causal inference and minimize biases in estimating treatment effects.

------------------------------------------------------------------------

## Types of Treatment Effects

When evaluating the causal impact of an intervention, different estimands (quantities of interest) can be used to measure treatment effects, depending on the study design and assumptions about compliance.

### Terminology

-   **Estimands**: The causal effect parameters we seek to measure.
-   **Estimators**: The statistical procedures used to estimate those parameters.
-   **Sources of Bias** [@keele2025so]:

$$
\begin{aligned}
&\text{Estimator - True Causal Effect} \\
&= \underbrace{\textbf{Hidden bias}}_{\text{Due to design}}
+ \underbrace{\textbf{Misspecification bias}}_{\text{Due to modeling}}
+\underbrace{\textbf{Statistical noise}}_{\text{Due to finite sample}}
\end{aligned}
$$

1.  **Hidden Bias (Due to Design)**

-   Arises from **unobserved confounders** and **measurement error** that remain after conditioning on observed covariates.
-   Is "hidden" because its true magnitude or direction cannot be directly observed.
-   Violations of **conditional exchangeability** (also called no unobserved confounding) imply the presence of hidden bias.

2.  **Misspecification Bias (Due to Modeling)**

-   Occurs when the assumed model for the outcome or treatment assignment does not reflect the true data-generating process.
-   Persists even if we have perfect exchangeability (i.e., no hidden bias).
-   Can be viewed as **under-specification** (omitting essential terms or functional forms) or **over-specification** (including unnecessary parameters).

3.  **Statistical Noise (Due to Finite Sample)**

-   Even with perfect design and correct model specification, finite samples lead to randomness in estimates.
-   Standard errors, confidence intervals, and p-values reflect this uncertainty.

In practice, all three sources of bias and uncertainty can coexist to varying degrees.

------------------------------------------------------------------------

### Average Treatment Effect {#sec-average-treatment-effect}

The [Average Treatment Effect](#sec-average-treatment-effect) (ATE) is the expected difference in outcomes between individuals who receive treatment and those who do not.

#### Definition

Let:

-   $Y_i(1)$ be the outcome of individual $i$ under treatment.

-   $Y_i(0)$ be the outcome of individual $i$ under control.

The **individual treatment effect** is:

$$
\tau_i = Y_i(1) - Y_i(0)
$$

Since we cannot observe both $Y_i(1)$ and $Y_i(0)$ for the same individual (a fundamental problem in causal inference), we estimate the ATE across a population:

$$
ATE = E[Y(1)] - E[Y(0)]
$$

#### Identification Under Randomization

If treatment assignment is randomized (under [Experimental Design]), then the observed difference in means between treatment and control groups provides an unbiased estimator of ATE:

$$
ATE = \frac{1}{N} \sum_{i=1}^{N} \tau_i = \frac{\sum_1^N Y_i(1)}{N} - \frac{\sum_i^N Y_i(0)}{N}
$$

With **randomization**, we assume:

$$
E[Y(1) | D = 1] = E[Y(1) | D = 0] = E[Y(1)]
$$

$$
E[Y(0) | D = 1] = E[Y(0) | D = 0] = E[Y(0)]
$$

Thus, the difference in observed means between treated and control groups provides an unbiased estimate of ATE.

$$
ATE = E[Y(1)] - E[Y(0)]
$$

------------------------------------------------------------------------

Alternatively, we can express the potential outcomes framework in a regression form, which allows us to connect causal inference concepts with standard regression analysis.

Instead of writing treatment effects as potential outcomes, we can define the observed outcome $Y_i$ in terms of a regression equation:

$$
Y_i = Y_i(0)  + [Y_i (1) - Y_i(0)] D_i
$$

where:

-   $Y_i(0)$ is the outcome if individual $i$ does not receive treatment.

-   $Y_i(1)$ is the outcome if individual $i$ does receive treatment.

-   $D_i$ is a binary indicator for treatment assignment:

-   $D_i = 1$ if individual $i$ receives treatment.

-   $D_i = 0$ if individual $i$ is in the control group.

We can redefine this equation using regression notation:

$$
Y_i = \beta_{0i} + \beta_{1i} D_i
$$

where:

-   $\beta_{0i} = Y_i(0)$ represents the baseline (control group) outcome.

-   $\beta_{1i} = Y_i(1) - Y_i(0)$ represents the individual treatment effect.

Thus, in an ideal setting, the coefficient on $D_i$ in a regression gives us the treatment effect.

------------------------------------------------------------------------

In observational studies, treatment assignment $D_i$ is often **not random**, leading to **endogeneity**. This means that the error term in the regression equation might be correlated with $D_i$, violating one of the key assumptions of the [Ordinary Least Squares] estimator.

To formalize this issue, we can express the outcome equation as:

$$
\begin{aligned}
Y_i &= \beta_{0i} + \beta_{1i} D_i \\
&= ( \bar{\beta}_{0} + \epsilon_{0i} ) + (\bar{\beta}_{1} + \epsilon_{1i} )D_i \\
&=  \bar{\beta}_{0} + \epsilon_{0i} + \bar{\beta}_{1} D_i + \epsilon_{1i} D_i
\end{aligned}
$$

where:

-   $\bar{\beta}_{0}$ is the average baseline outcome.

-   $\bar{\beta}_{1}$ is the average treatment effect.

-   $\epsilon_{0i}$ captures individual-specific deviations in control group outcomes.

-   $\epsilon_{1i}$ captures heterogeneous treatment effects.

If treatment assignment is truly **random**, then:

$$
E[\epsilon_{0i}] = E[\epsilon_{1i}] = 0
$$

which ensures:

-   **No selection bias**: $D_i \perp \epsilon_{0i}$ (i.e., treatment assignment is independent of the baseline error).

-   **Treatment effect is independent of assignment**: $D_i \perp \epsilon_{1i}$.

However, in observational studies, these assumptions often fail. This leads to:

-   **Selection bias**: If individuals self-select into treatment based on unobserved characteristics, then $D_i$ correlates with $\epsilon_{0i}$.
-   **Heterogeneous treatment effects**: If the treatment effect itself varies across individuals, then $D_i$ correlates with $\epsilon_{1i}$.

These issues violate the exogeneity assumption in OLS regression, leading to biased estimates of $\beta_1$.

------------------------------------------------------------------------

When estimating treatment effects using OLS regression, we need to be aware of potential estimation issues.

1.  **OLS Estimator and Difference-in-Means**

Under random assignment, the OLS estimator for $\beta_1$ simplifies to the difference in means estimator:

$$
\hat{\beta}_1^{OLS} = \bar{Y}_{\text{treated}} - \bar{Y}_{\text{control}}
$$

which is an unbiased estimator of the [Average Treatment Effect](#sec-average-treatment-effect).

However, when treatment assignment is not random, OLS estimates may be biased due to unobserved confounders.

2.  **Heteroskedasticity and Robust Standard Errors**

If treatment effects vary across individuals (i.e., treatment effect heterogeneity), the error term contains an interaction:

$$
\epsilon_i = \epsilon_{0i} + D_i \epsilon_{1i}
$$

which leads to heteroskedasticity (i.e., the variance of errors depends on $D_i$ and possibly on covariates $X_i$).

To address this, we use heteroskedasticity-robust standard errors, which ensure valid inference even when variance is not constant across observations.

------------------------------------------------------------------------

### Conditional Average Treatment Effect {#sec-conditional-average-treatment-effect-}

Treatment effects may vary across different subgroups in a population. The Conditional Average Treatment Effect (CATE) captures heterogeneity in treatment effects across subpopulations.

For a subgroup characterized by covariates $X_i$:

$$
CATE = E[Y(1) - Y(0) | X_i]