data_analysis/34.4-proxy-var.Rmd at main · mikenguyen13/data_analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
### Proxy Variables {#sec-proxy-variables}

In applied business and economic analysis, we often confront a frustrating reality: the variables we truly care about---like *brand loyalty*, *employee ability*, or *investor sentiment*---are not directly observable. Instead, we rely on **proxy variables**, which are observable measures that stand in for these latent or omitted variables. Though useful, proxy variables must be used with care, as they introduce their own risks, most notably **measurement error** and **incomplete control of endogeneity**.

A **proxy variable** is an observed variable used in place of a variable that is either unobservable or omitted from a model. It is typically used under the assumption that it is correlated with the latent variable and explains some of its variation.

Let:

-   $X^*$ be the latent (unobserved) variable,

-   $X$ be the observed proxy,

-   $Y$ be the outcome.

We may desire to estimate: $$
Y = \beta_0 + \beta_1 X^* + \varepsilon,
$$ but since $X^*$ is unavailable, we instead estimate:

$$
Y = \beta_0 + \beta_1 X + u.
$$

The effectiveness of this approach hinges on whether $X$ can validly stand in for $X^*$.

#### Proxy Use and Omitted Variable Bias

Proxy variables are sometimes used as **substitutes** for omitted variables that cause [endogeneity](#sec-endogeneity). Including a proxy can **reduce** endogeneity, but it will **not** generally eliminate bias, unless strict conditions are met.

> **Key Insight**: Including a proxy does not allow us to estimate the effect of the omitted variable; rather, it helps mitigate the bias introduced by its omission.

To be more precise, let's consider a classic omitted variable setup:

Suppose the true model is: $$
Y = \beta_0 + \beta_1 X + \beta_2 Z + \varepsilon,
$$ but $Z$ is omitted from the estimation. If $Z$ is correlated with $X$, the OLS estimate of $\beta_1$ will be biased.

Now, suppose we have a proxy $Z_p$ for $Z$. Including $Z_p$ in the regression: $$
Y = \beta_0 + \beta_1 X + \beta_2 Z_p + u
$$ can help reduce the bias **if** $Z_p$ meets the following criteria.

------------------------------------------------------------------------

Let $Z$ be the unobserved variable and $Z_p$ be the proxy. Then, $Z_p$ is a valid proxy if:

1.  **Correlation**: $Z_p$ is correlated with $Z$ (i.e., $\text{Cov}(Z_p, Z) \ne 0$).
2.  **Residual Independence**: The residual variation in $Z$ unexplained by $Z_p$ is uncorrelated with all regressors (including $Z_p$ and $X$): $$
    Z = \gamma_0 + \gamma_1 Z_p + \nu, \quad \text{where } \text{Cov}(\nu, X) = \text{Cov}(\nu, Z_p) = 0.
    $$
3.  **No direct effect**: $Z_p$ affects $Y$ only through $Z$ (or at least not directly).

Violation of these conditions can lead to **biased** or **inconsistent** estimates.

::: {.rmdcaution}
The residual-independence condition is strong, and often violated in practice. Condition 2 above requires that the part of $Z$ *not* captured by the proxy $Z_p$ (i.e., $\nu$) be uncorrelated with both the proxy and every other regressor. There are three common ways this fails.

The first is reverse causation. If the unobserved construct $Z$ is a cause of the proxy $Z_p$ (the usual story: ability causes IQ scores), then $Z = \gamma_0 + \gamma_1 Z_p + \nu$ is a misspecified regression. The structural model runs the other direction, and $\text{Cov}(\nu, Z_p) = 0$ is not guaranteed.

The second is an endogenous proxy. If $Z_p$ is correlated with $X$ for reasons unrelated to $Z$ (e.g., IQ scores themselves respond to education), plugging it in induces a new bias rather than removing an old one.

The third is multiplicity of omitted variables. If $Z$ is one of several unobserved confounders, even a clean proxy for $Z$ leaves the others uncontrolled.

Because Condition 2 is untestable without additional structure, proxy variables *reduce* rather than *eliminate* omitted-variable bias. A sensible workflow is to report estimates with and without the proxy and to treat the proxy-adjusted estimate as a bound rather than a point identification.
:::

------------------------------------------------------------------------

#### Example: IQ as a Proxy for Ability in Wage Regressions

In labor economics, researchers often study the effect of education on wages. But ability---an unobservable factor---also affects both education and wages, leading to omitted variable bias.

Let:

-   $Y$ = wage,

-   $X$ = education,

-   $Z^*$ = ability (unobserved),

-   $Z$ = IQ test score (proxy for ability).

Suppose the true model is: $$
\text{wage} = \beta_0 + \beta_1 \text{education} + \beta_2 \text{ability} + \varepsilon.
$$

Since ability is unobserved, we estimate: $$
\text{wage} = \beta_0 + \beta_1 \text{education} + \beta_2 \text{IQ} + u,
$$ under the assumption: $$
\text{ability} = \gamma_0 + \gamma_1 \text{IQ} + \nu,
$$ with $\text{Cov}(\nu, \text{education}) = \text{Cov}(\nu, \text{IQ}) = 0$.

This inclusion of IQ helps reduce [endogeneity](#sec-endogeneity) but does **not** identify the pure effect of ability unless all variation in ability is captured by IQ.

------------------------------------------------------------------------

#### Pros and Cons of Proxy Variables

**Advantages**

-   **Make latent variables measurable**: Allows analysis of constructs that cannot be directly observed.
-   **Practicality**: Makes use of available data to address [endogeneity](#sec-endogeneity).
-   **Improved specification**: Can reduce omitted variable bias if proxies are well chosen.

**Disadvantages**

-   [Measurement error](#sec-measurement-error): Proxies usually include noise, causing **attenuation bias** (i.e., coefficients biased toward zero).

    If $X = X^* + \nu$, with $\nu$ classical measurement error (zero mean, uncorrelated with $X^*$ and $\varepsilon$), then: $$
    \text{plim}(\hat{\beta}_1) = \lambda \beta_1, \quad \text{where } \lambda = \frac{\sigma^2_{X^*}}{\sigma^2_{X^*} + \sigma^2_\nu} < 1.
    $$

-   **Interpretation issues**: Coefficients on proxies conflate the causal effect with proxy quality.

-   **Insufficient control**: Proxies only partially reduce omitted variable bias unless they meet strict independence conditions.

#### Empirical Illustration: Simulating Attenuation Bias

```{r}
set.seed(2025)
n <- 1000
ability <- rnorm(n)                   # latent variable
IQ <- ability + rnorm(n, sd = 0.5)    # proxy variable
education <- 12 + 0.5 * ability + rnorm(n)  # correlated regressor
wage <- 20 + 1.5 * education + 2 * ability + rnorm(n)  # true model

# Model using education only (omitted variable bias)
mod1 <- lm(wage ~ education)

# Model using education and proxy
mod2 <- lm(wage ~ education + IQ)

summary(mod1)
summary(mod2)
```

Observe how including the proxy reduces the bias in the coefficient on education, even if it doesn't eliminate it entirely.

#### Example: Marketing --- Brand Loyalty

Suppose you're modeling the effect of brand loyalty ($X^*$) on repeat purchase ($Y$). Since loyalty is latent, we might use:

-   Number of prior purchases,
-   Duration of current brand use,
-   Membership in loyalty programs.

These proxies are likely to be correlated with true loyalty, but none is a perfect substitute.

```{r}
# Simulating attenuation bias with a proxy
set.seed(42)
n <- 1000
X_star <- rnorm(n)  # true unobserved brand loyalty
proxy <- X_star + rnorm(n, sd = 0.6)  # proxy with measurement error
error <- rnorm(n)
Y <- 3 + 2 * X_star + error  # true model

# Model using the proxy variable
model_proxy <- lm(Y ~ proxy)
summary(model_proxy)
```

Observe that the estimated coefficient on `proxy` is less than the true coefficient (2), due to [measurement error](#sec-measurement-error).

------------------------------------------------------------------------

#### Example: Finance --- Investor Sentiment

Investor sentiment affects market movements but cannot be directly measured. Proxies include:

-   **Put-call ratios**

-   **Bullish/bearish sentiment surveys**,

-   **Volume of IPO activity**,

-   **Retail investor trading flows**.

These capture different dimensions of sentiment, and their effectiveness varies by context.

------------------------------------------------------------------------

#### Strategies to Improve Proxy Use

-   **Multiple proxies**: Use several proxies and combine them via factor analysis or PCA

-   [Instrumental variables](#sec-instrumental-variables): If a valid instrument exists for the proxy, use two-stage least squares to correct for [measurement error](#sec-measurement-error).

-   **Latent variable models**: Structural Equation Modeling (SEM) allows estimation of models with latent variables explicitly.

Proxy variables are valuable tools in empirical research when used with caution. They offer a bridge between theory and data when important variables are unobservable. However, this bridge is built on assumptions---especially regarding correlation, measurement error, and residual independence---that must be carefully justified.

> **Key Takeaway**: A proxy can reduce bias from omitted variables but introduces its own risks---especially measurement error and interpretive ambiguity. The best practice is to use proxies transparently, test assumptions when possible, and consider alternative solutions such as instruments or structural models.

------------------------------------------------------------------------

#### Negative Controls and Proximal Causal Inference

The classical proxy-for-omitted-variable framework assumes the analyst can plausibly substitute an observed variable for a latent confounder. In modern epidemiology, biostatistics, and applied econometrics, two related ideas push this logic further: **negative controls** for confounding *detection* and **proximal causal inference** for confounding *correction* under unmeasured confounding.

**Negative outcome controls** [@lipsitch2010negative] are outcomes that, by subject-matter knowledge, should *not* be causally affected by the treatment but *are* plausibly affected by the same unmeasured confounder $U$ that contaminates the primary outcome. The logic is diagnostic: if the estimated treatment effect on the negative outcome is far from zero, residual confounding is unlikely to be negligible for the primary outcome either.

> **Example**: In a study of statin use and dementia risk, mortality from accidents is a candidate negative outcome --- statins should not causally affect accident mortality, but unmeasured healthy-user effects could. A non-zero estimated effect on accident mortality flags residual confounding in the dementia analysis.

**Negative exposure controls** are treatment-like variables that, again by domain knowledge, are known to have no causal effect on the outcome of interest, but share confounders with the actual treatment. They serve the symmetric role of the negative outcome control: a non-zero estimated effect of the negative exposure on the outcome signals that the confounding-control strategy is incomplete.

**Common settings**: pharmacoepidemiology (healthy-user / sick-stopper effects), labor economics (estimating returns to education with placebo "treatments" assigned outside the policy window), marketing (brand-exposure studies with unrelated outcomes), and climate-and-health research where seasonality is a known confounder.

**Proximal causal inference** [@miao2018identifying; @tchetgen2024proximal] goes beyond detection: under stronger conditions, a *pair* of proxies --- one tied to the treatment-side confounder ("treatment-inducing proxy" $Z$) and one tied to the outcome-side confounder ("outcome-inducing proxy" $W$) --- can identify and consistently estimate the causal effect of $A$ on $Y$ even when the confounder $U$ is *entirely unmeasured*.

The key structural assumption is that $Z$ and $W$ are independent of each other given $(A, U, X)$ and that each is informative about $U$ in a specific direction (the so-called *bridge function* or *completeness* condition). Estimation proceeds by solving an integral equation for an *outcome confounding bridge function* $h(W, A, X)$ such that $E[Y \mid A, U, X] = E[h(W, A, X) \mid A, U, X]$; the average causal effect is then identified by

$$
E[Y(a)] = E[\, h(W, a, X)\, ].
$$

A schematic implementation looks like:

```{r, eval=FALSE}
# Pseudocode --- proximal two-stage least squares (P2SLS) for a binary treatment
#   A: treatment, Y: outcome, W: outcome-inducing proxy,
#   Z: treatment-inducing proxy, X: measured covariates

# Stage 1: regress W on (A, Z, X) using only the control sample to estimate
#          the bridge function h(W, A, X).
stage1 <- lm(W ~ A + Z + X)               # or a flexible learner

# Stage 2: project Y onto fitted bridge values; this is the proximal analogue
#          of an IV second stage.
What  <- predict(stage1)
fit   <- lm(Y ~ A + What + X)             # coef on A is the proximal ATE

# In practice, the proximal-IV machinery is still in active development.
# There is no canonical CRAN or PyPI package; representative research
# implementations on GitHub include:
#   - `pci2s` (Li): proximal causal inference for survival outcomes (R)
#   - `SyntheticControl` (Shi, Miao, Hu, Tchetgen Tchetgen): synthetic
#     control under a proximal framework (R)
#   - `proxci` (Yang): minimax kernel machine learning for doubly
#     robust proximal functionals (Python)
```

**Caveats and scope conditions**:

1.  Both $Z$ and $W$ must be *more than* simple proxies for $U$ --- they need to satisfy *completeness* (a strong identifying condition akin to relevance in IV).
2.  Misclassifying a true post-treatment variable as a proxy can re-introduce bias, just as in IV designs.
3.  Negative-control validity often rests on subject-matter argument; pre-registering the set of negative controls before unblinding helps prevent post-hoc selection.

> **Key Takeaway**: Proxies have evolved from a single observed substitute for a latent confounder into a *system* of proxies (treatment-inducing, outcome-inducing, and negative controls) that, jointly, can both *detect* and *correct* unmeasured confounding. When a credible classical-proxy argument is unavailable, negative controls supply a sanity check, and proximal causal inference supplies a constructive identification strategy.