data_analysis/_05.03-linear-regression-mle.Rmd at main · mikenguyen13/data_analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
## Maximum Likelihood {#maximum-likelihood-estimator}

The **Maximum Likelihood Estimation (MLE)** is a statistical method used to estimate the parameters of a model by maximizing the likelihood of observing the given data. The premise is to find the parameter values that maximize the probability (or likelihood) of the observed data.

The likelihood function, denoted as $L(\theta)$, is expressed as:

$$
L(\theta) = \prod_{i=1}^{n} f(y_i|\theta)
$$

where:

-   $f(y|\theta)$ is the probability density or mass function of observing a single value of $Y$ given the parameter $\theta$.
-   The product runs over all $n$ observations.

For different types of data, $f(y|\theta)$ can take different forms. For example, if $y$ is dichotomous (e.g., success/failure), then the likelihood function becomes:

$$
L(\theta) = \prod_{i=1}^{n} \theta^{y_i} (1-\theta)^{1-y_i}
$$

Here, $\hat{\theta}$ is the Maximum Likelihood Estimator (MLE) if:

$$
L(\hat{\theta}) > L(\theta_0), \quad \forall \theta_0 \text{ in the parameter space.}
$$

See [Distributions] for a review on variable distributions.

### Motivation for MLE

Suppose we know the **conditional distribution** of $Y$ given $X$, denoted as:

$$
f_{Y|X}(y, x; \theta)
$$

where $\theta$ is an unknown parameter of the distribution. Sometimes, we are only concerned with the unconditional distribution $f_Y(y; \theta)$.

For a sample of independent and identically distributed (i.i.d.) data, the joint probability of the sample is:

$$
f_{Y_1, \ldots, Y_n|X_1, \ldots, X_n}(y_1, \ldots, y_n, x_1, \ldots, x_n; \theta) = \prod_{i=1}^{n} f_{Y|X}(y_i, x_i; \theta)
$$

The **joint distribution**, evaluated at the observed data, defines the likelihood function. The goal of MLE is to find the parameter $\theta$ that maximizes this likelihood.

To estimate $\theta$, we maximize the likelihood function:

$$
\max_{\theta} \prod_{i=1}^{n} f_{Y|X}(y_i, x_i; \theta)
$$

In practice, it is easier to work with the natural logarithm of the likelihood (log-likelihood), as it transforms the product into a sum:

$$
\max_{\theta} \sum_{i=1}^{n} \ln(f_{Y|X}(y_i, x_i; \theta))
$$

------------------------------------------------------------------------

Solving for the Maximum Likelihood Estimator

1.  **First-Order Condition**: Solve the first derivative of the log-likelihood function with respect to $\theta$:

    $$
    \frac{\partial}{\partial \theta} \ell(\theta) = \frac{\partial}{\partial \theta} \ln L(\theta) = \frac{\partial}{\partial \theta} \sum_{i=1}^{n} \ln(f_{Y|X}(y_i, x_i; \hat{\theta}_{MLE})) = 0
    $$

    This yields the critical points where the likelihood is maximized. This derivative, sometimes written as $U(\theta)$, is called the **score**. Intuitively, the log-likelihood's "peak" indicates the parameter value(s) that make the observed data "most likely."

2.  **Second-Order Condition**: Verify that the second derivative of the log-likelihood function is negative at the critical point:

    $$
    \frac{\partial^2}{\partial \theta^2} \sum_{i=1}^{n} \ln(f_{Y|X}(y_i, x_i; \hat{\theta}_{MLE})) < 0
    $$

    This ensures that the solution corresponds to a maximum.

------------------------------------------------------------------------

Examples of Likelihood Functions

1.  Unconditional [Poisson Distribution]

The Poisson distribution models count data, such as the number of website visits in a day or product orders per hour. Its likelihood function is:

$$
L(\theta) = \prod_{i=1}^{n} \frac{\theta^{y_i} e^{-\theta}}{y_i!}
$$

2.  [Exponential Distribution]

The exponential distribution is often used to model the time between events, such as the time until a machine fails. Its probability density function (PDF) is:

$$
f_{Y|X}(y, x; \theta) = \frac{\exp(-y / (x \theta))}{x \theta}
$$

The joint likelihood for $n$ observations is:

$$
L(\theta) = \prod_{i=1}^{n} \frac{\exp(-y_i / (x_i \theta))}{x_i \theta}
$$

By taking the logarithm, we obtain the log-likelihood for ease of maximization.

------------------------------------------------------------------------

### Key Quantities for Inference

1.  **Score Function**\
    The **score** is given by\
    $$
    U(\theta) = \frac{d}{d\theta} \ell(\theta).
    $$\
    Setting $U(\hat{\theta}_{\mathrm{MLE}}) = 0$ yields the critical points of the log-likelihood, from which we can find $\hat{\theta}_{\mathrm{MLE}}$.

2.  **Observed Information**\
    The second derivative of the log-likelihood, taken at the MLE, is called the **observed information**:

    $$
    I_O(\theta) = - \frac{d^2}{d\theta^2} \ell(\theta).
    $$

    (The negative sign is often included so that $I_O(\theta)$ is *positive* if $\ell(\theta)$ is concave near its maximum. In some texts, you will see it defined without the negative sign, but the idea is the same: it measures the "pointedness" or curvature of $\ell(\theta)$ at its maximum.)

3.  **Fisher Information**\
    The **Fisher Information** (or **expected information**) is the expectation of the observed information over the distribution of the data:

    $$
    I(\theta) = \mathbb{E}\left[I_O(\theta)\right].
    $$

    It quantifies how much information the data carry about the parameter $\theta$. A larger Fisher information suggests that you can estimate $\theta$ more precisely.

4.  **Approximate Variance of** $\hat{\theta}_{\mathrm{MLE}}$\
    One of the key results from standard asymptotic theory is that, for large $n$, the variance of $\hat{\theta}_{\mathrm{MLE}}$ can be approximated by the inverse of the Fisher information:

    $$
    \mathrm{Var}\left(\hat{\theta}_{\mathrm{MLE}}\right) \approx I(\theta)^{-1}.
    $$

    This also lays the groundwork for constructing confidence intervals for $\theta$ in large samples.

------------------------------------------------------------------------

### Assumptions of MLE

MLE has desirable properties---*consistency*, *asymptotic normality*, and *efficiency*---but these do not come "for free." Instead, they rely on certain assumptions. Below is a breakdown of the main **regularity conditions**. These conditions are typically mild in many practical settings (for example, in exponential families, such as the normal distribution), but need to be checked in more complex models.

**High-Level Regulatory Assumptions**

1.  **Independence and Identical Distribution (iid)**\
    The sample $\{(x_i, y_i)\}$ is usually assumed to be composed of independent and identically distributed observations. This independence assumption simplifies the likelihood to a product of individual densities: $$
    L(\theta) = \prod_{i=1}^n f_{Y\mid X}(y_i, x_i; \theta).
    $$ In practice, if you have dependent data (e.g., time series, spatial data), modifications are required in the likelihood function.

2.  **Same Density Function**\
    All observations must come from the *same* conditional probability density function $f_{Y\mid X}(\cdot,\cdot;\theta)$. If the model changes across observations, you cannot simply multiply all of them together in one unified likelihood.

3.  **Multivariate Normality (for certain models)**\
    In many practical cases---especially for continuous outcomes---you might assume (multivariate) normal distributions with finite second or fourth moments [@little1988test]. Under these assumptions, the MLE for the mean vector and covariance matrix is consistent and (under further conditions) asymptotically normal. This assumption is quite common in regression, ANOVA, and other classical statistical frameworks.

------------------------------------------------------------------------

#### Large Sample Properties of MLE

##### Consistency of MLE

**Definition:** An estimator $\hat{\theta}_n$ is *consistent* if it converges in probability to the true parameter value $\theta_0$ as the sample size $n \to \infty$:

$$
\hat{\theta}_n \to^p \theta_0.
$$

For the MLE, a set of regularity conditions $R1$--$R4$ is commonly used to ensure consistency:

1.  **R1**\
    If $\theta \neq \theta_0$, then\
    $$
    f_{Y\mid X}(y_i, x_i; \theta) \neq f_{Y\mid X}(y_i, x_i; \theta_0).
    $$

    In simpler terms, the model is identifiable: no two distinct parameter values generate the *exact* same distribution for the data.

2.  **R2**\
    The parameter space $\Theta$ is compact (closed and bounded), and it contains the true parameter $\theta_0$. This ensures that $\theta$ lies in a "nice" region (no parameter going to infinity, etc.), making it easier to prove that a maximum in that space indeed exists.

3.  **R3**\
    The log-likelihood function $\ln(f_{Y\mid X}(y_i, x_i; \theta))$ is continuous in $\theta$ with probability $1$. Continuity is important so that we can apply theorems (like the Continuous Mapping Theorem or the Extreme Value Theorem) to find maxima.

4.  **R4**\
    The expected supremum of the absolute value of the log-likelihood is finite:

    $$
    \mathbb{E}\left(\sup_{\theta \in \Theta} \left|\ln(f_{Y\mid X}(y_i, x_i; \theta))\right|\right) < \infty.
    $$

    This is a technical condition that helps ensure we can "exchange" expectations and suprema, a step needed in many consistency proofs.

When these conditions are satisfied, you can show via standard arguments (e.g., the [Law of Large Numbers], uniform convergence of the log-likelihood) that:

$$
\hat{\theta}_{\mathrm{MLE}} \to^p \theta_0 \quad (\text{consistency}).
$$

------------------------------------------------------------------------

##### Asymptotic Normality of MLE

**Definition:** An estimator $\hat{\theta}_n$ is *asymptotically normal* if

$$
\sqrt{n}(\hat{\theta}_n - \theta_0) \to^d \mathcal{N}\left(0,\Sigma\right),
$$

where $\to^d$ denotes convergence in distribution and $\Sigma$ is some covariance matrix. For the MLE, $\Sigma$ is typically $I(\theta_0)^{-1}$, where $I(\theta_0)$ is the Fisher information evaluated at the true parameter.

Beyond $R1$--$R4$, we need the following additional assumptions:

1.  **R5**\
    The true parameter $\theta_0$ is in the *interior* of the parameter space $\Theta$. If $\theta_0$ sits on the boundary, different arguments are required to handle edge effects.

2.  **R6**\
    The pdf $f_{Y\mid X}(y_i, x_i; \theta)$ is *twice* continuously differentiable (in $\theta$) and strictly positive in a neighborhood $N$ of $\theta_0$. This allows us to use second-order Taylor expansions around $\theta_0$ to get the approximate distribution of $\hat{\theta}_{\mathrm{MLE}}$.

3.  **R7**\
    The following integrals are finite in some neighborhood $N$ of $\theta_0$:

    -   $\int \sup_{\theta \in N} \left\|\frac{\partial f_{Y\mid X}(y_i, x_i; \theta)}{\partial \theta} \right\| d(y,x) < \infty$.
    -   $\int \sup_{\theta \in N} \left\|\frac{\partial^2 f_{Y\mid X}(y_i, x_i; \theta)}{\partial \theta \partial \theta'} \right\| d(y,x) < \infty$.
    -   $\mathbb{E}\left(\sup_{\theta \in N} \left\|\frac{\partial^2 \ln(f_{Y\mid X}(y_i, x_i; \theta))}{\partial \theta \partial \theta'} \right\|\right) < \infty$.

    These conditions ensure that differentiating inside integrals is justified (via the dominated convergence theorem) and that we can expand the log-likelihood in a Taylor series safely.

4.  **R8**\
    The information matrix $I(\theta_0)$ exists and is nonsingular:

    $$
    I(\theta_0) = \mathrm{Var}\left(\frac{\partial}{\partial \theta} \ln\left(f_{Y\mid X}(y_i, x_i; \theta_0)\right)\right) \neq 0.
    $$

    Nonsingularity implies there is enough information in the data to estimate $\theta$ uniquely.

Under $R1$--$R8$, you can show that

$$
\sqrt{n}(\hat{\theta}_{\mathrm{MLE}} - \theta_0) \to^d \mathcal{N}\left(0,I(\theta_0)^{-1}\right).
$$

This result is central to frequentist inference, allowing you to construct approximate confidence intervals and hypothesis tests using the normal approximation for large $n$.

------------------------------------------------------------------------

### Properties of MLE

Having established in earlier sections that Maximum Likelihood Estimators (MLEs) are **consistent** ([Consistency of MLE]) and **asymptotically normal** ([Asymptotic Normality of MLE]) under standard regularity conditions, we now highlight additional properties that make MLE a powerful estimation technique.

1.  **Asymptotic Efficiency**

-   **Definition**: An estimator is *asymptotically efficient* if it attains the smallest possible asymptotic variance among all consistent estimators (i.e., it achieves the *Cramér-Rao Lower Bound*).
-   **Interpretation**: In large samples, MLE typically has smaller standard errors than other consistent estimators that do not fully use the assumed distributional form.
-   **Implication**: When the true model is correctly specified, MLE is the *most efficient* among a broad class of estimators, leading to more precise inference for $\theta$.
    -   **Cramér-Rao Lower Bound (CRLB)**: A theoretical lower limit on the variance of any unbiased (or asymptotically unbiased) estimator [@cramer1999mathematical, @rao1992information].
    -   **When MLE Meets CRLB**: Under correct specification and standard regularity conditions, the asymptotic variance of the MLE matches the CRLB, making it *asymptotically efficient*.
    -   **Interpretation**: Achieving CRLB means no other unbiased estimator can consistently outperform MLE in terms of variance for large $n$.

2.  **Invariance**

-   **Core Idea**: If $\hat{\theta}$ is the MLE for $\theta$, then for any *smooth* transformation $g(\theta)$, the MLE for $g(\theta)$ is simply $g(\hat{\theta})$.
-   **Example**: If $\theta$ is a mean parameter and you want the MLE for the *variance* $\theta^2$, you can just square the MLE for $\theta$.
-   **Key Point**: This *invariance property* saves considerable effort---there is no need to re-derive a new likelihood for the transformed parameter.

3.  Explicit vs. Implicit MLE

-   **Explicit MLE**:\
    Occurs when the score equation can be solved in *closed form*. A classic example is the MLE for the mean and variance in a normal distribution.
-   **Implicit MLE**:\
    Happens when no closed-form solution exists. Iterative numerical methods, such as **Newton-Raphson**, **Expectation-Maximization (EM)**, or other optimization algorithms, are used to find $\hat{\theta}$.

------------------------------------------------------------------------

**Distributional Mis-Specification**

-   **Definition**: If you assume a distribution for $f_{Y|X}(\cdot;\theta)$ that does *not* reflect the true data-generating process, the MLE may become *inconsistent* or biased in finite samples.
-   **Quasi-MLE**:
    -   A strategy to handle certain forms of mis-specification.
    -   If the chosen distribution belongs to a flexible class or meets certain conditions (e.g., generalized linear models with a robust link), the resulting parameter estimates can remain consistent *for some parameters of interest*.
-   **Nonparametric & Semiparametric Approaches**:
    -   Require minimal or no distributional assumptions.
    -   More robust to mis-specification but can be *harder to implement* and may exhibit higher variance or require larger sample sizes to achieve comparable precision.

------------------------------------------------------------------------

### Practical Considerations

1.  **Use Cases**
    -   MLE is extremely popular for:
        -   **Binary Outcomes** (logistic regression)
        -   **Count Data** (Poisson regression)
        -   **Strictly Positive Outcomes** (Gamma regression)
        -   **Heteroskedastic Settings** (models with variance related to mean, e.g., [GLM](#generalized-linear-models))
2.  **Distributional Assumptions**
    -   The efficiency gains of [MLE](#maximum-likelihood-estimator) stem from using a specific probability model.
    -   If the assumed model closely reflects the data-generating process, MLE gives accurate parameter estimates and reliable standard errors.
    -   MLE assumes knowledge of the conditional distribution of the outcome variable. This assumption parallels the normality assumption in linear regression models (e.g., [A6 Normal Distribution](#a6-normal-distribution)).
    -   If severely mis-specified, consider robust or semi-/nonparametric methods.
3.  **Comparison with OLS**: See [Comparison of MLE and OLS] for more details.
    -   [Ordinary Least Squares](#ordinary-least-squares) is a special case of [MLE](#maximum-likelihood-estimator) when errors are normally distributed and homoscedastic.
    -   In more general settings (e.g., non-Gaussian or heteroskedastic data), [MLE](#maximum-likelihood-estimator) can outperform OLS in terms of smaller standard errors and better inference.
4.  **Numerical Stability & Computation**
    -   For complex likelihoods, iterative methods can fail to converge or converge to local maxima.
    -   Proper initialization and diagnostics (e.g., checking multiple start points) are crucial.

------------------------------------------------------------------------

### Comparison of MLE and OLS

While [Maximum Likelihood](#maximum-likelihood-estimator) Estimation is a powerful estimation method, it does not solve all of the challenges associated with [Ordinary Least Squares](#ordinary-least-squares). Below is a detailed comparison highlighting similarities, differences, and limitations.

**Key Points of Comparison**

1.  **Inference Methods**:
    -   [MLE](#maximum-likelihood-estimator):
        -   Joint inference is typically conducted using **log-likelihood calculations**, such as likelihood ratio tests or information criteria (e.g., AIC, BIC).
        -   These methods replace the use of F-statistics commonly associated with OLS.
    -   [OLS](#ordinary-least-squares):
        -   Relies on the **F-statistic** for hypothesis testing and joint inference.
2.  **Sensitivity to Functional Form**:
    -   Both MLE and [OLS](#ordinary-least-squares) are sensitive to the **functional form** of the model. Incorrect specification (e.g., linear vs. nonlinear relationships) can lead to biased or inefficient estimates in both cases.
3.  **Perfect Collinearity and Multicollinearity**:
    -   Both methods are affected by collinearity:
        -   **Perfect collinearity** (e.g., two identical predictors) makes parameter estimation impossible.
        -   **Multicollinearity** (highly correlated predictors) inflates standard errors, reducing the precision of estimates.
    -   Neither MLE nor [OLS](#ordinary-least-squares) directly resolves these issues without additional measures, such as regularization or variable selection.
4.  **Endogeneity**:
    -   Problems like **omitted variable bias** or **simultaneous equations** affect both MLE and OLS:
        -   If relevant predictors are omitted, estimates from both methods are likely to be biased and inconsistent.
        -   Similarly, in systems of simultaneous equations, both methods yield biased results unless endogeneity is addressed through instrumental variables or other approaches.
    -   [MLE](#maximum-likelihood-estimator), while efficient under correct model specification, does not inherently address endogeneity.

------------------------------------------------------------------------

**Situations Where MLE and OLS Differ**

| **Aspect**                   | **MLE**                                                               | **OLS**                                                    |
|------------------------------|-----------------------------------------------------------------------|------------------------------------------------------------|
| **Estimator Efficiency**     | Efficient for correctly specified distributions.                      | Efficient under Gauss-Markov assumptions.                  |
| **Assumptions about Errors** | Requires specifying a distribution (e.g., normal, binomial).          | Requires only mean-zero errors and homoscedasticity.       |
| **Use of Likelihood**        | Based on maximizing the likelihood function for parameter estimation. | Based on minimizing the sum of squared residuals.          |
| **Model Flexibility**        | More flexible (supports various distributions, non-linear models).    | Primarily linear models (extensions for non-linear exist). |
| **Interpretation**           | Log-likelihood values guide model comparison (AIC/BIC).               | R-squared and adjusted R-squared measure fit.              |

: Comparative Summary of MLE and OLS Across Estimation, Assumptions, and Interpretation

**Practical Considerations**

1.  **When to Use MLE**:
    -   Situations where the dependent variable is:
        -   Binary (e.g., logistic regression)
        -   Count data (e.g., Poisson regression)
        -   Skewed or bounded (e.g., survival models)
    -   When the model naturally arises from a probabilistic framework.
2.  **When to Use OLS**:
    -   Suitable for continuous dependent variables with approximately linear relationships between predictors and outcomes.
    -   Simpler to implement and interpret when the assumptions of linear regression are reasonably met.

### Applications of MLE

MLE is widely used across various applications to estimate parameters in models tailored for specific data structures. Below are key applications of MLE, categorized by problem type and estimation method.

+-------------------------------+-------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| **Model Type**                | **Examples**                              | **Key Characteristics**                                                        | **Common Estimation Methods**             | **Additional Notes**                                                                                                                                  |
+:==============================+:==========================================+:===============================================================================+:==========================================+:======================================================================================================================================================+
| **Corner Solution Models**    | Hours worked                              | Dependent variable is often **censored at zero** (or another threshold).       | Tobit regression                          | Useful when a continuous outcome has a **mass point** at zero but also positive values (e.g., 30% of individuals donate \$0, the rest donate \> \$0). |
|                               |                                           |                                                                                |                                           |                                                                                                                                                       |
|                               | Donations to charity                      | Large fraction of observations at the corner (e.g., 0 hours, 0 donations).     | (latent variable approach with censoring) |                                                                                                                                                       |
|                               |                                           |                                                                                |                                           |                                                                                                                                                       |
|                               | Household consumption of a good           |                                                                                |                                           |                                                                                                                                                       |
+-------------------------------+-------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| **Non-Negative Count Models** | Number of arrests                         | Dependent variable consists of **non-negative integer counts**.                | Poisson regression,                       | Poisson assumes mean = variance, so often Negative Binomial is preferred for real data.                                                               |
|                               |                                           |                                                                                |                                           |                                                                                                                                                       |
|                               | Number of cigarettes smoked               | Possible **overdispersion** (variance \> mean).                                | Negative Binomial regression              | Zero-inflated models (ZIP/ZINB) may be used for data with **excess zeros**.                                                                           |
|                               |                                           |                                                                                |                                           |                                                                                                                                                       |
|                               | Doctor visits per year                    |                                                                                |                                           |                                                                                                                                                       |
+-------------------------------+-------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| **Multinomial Choice Models** | Demand for different car brands           | Dependent variable is a **categorical choice** among **3+ alternatives**.      | Multinomial logit,                        | Extension of binary choice (logit/probit) to multiple categories.                                                                                     |
|                               |                                           |                                                                                |                                           |                                                                                                                                                       |
|                               | Votes in a primary election               | Each category is distinct, with no inherent ordering (e.g., brand A, B, or C). | Multinomial probit                        | **Independence of Irrelevant Alternatives (IIA)** can be a concern for the multinomial logit.                                                         |
|                               |                                           |                                                                                |                                           |                                                                                                                                                       |
|                               | Choice of travel mode                     |                                                                                |                                           |                                                                                                                                                       |
+-------------------------------+-------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+
| **Ordinal Choice Models**     | Self-reported happiness (low/medium/high) | Dependent variable is **ordered** (e.g., **low \< medium \< high**).           | Ordered logit,                            | Probit/logit framework adapted to preserve **ordinal information**.                                                                                   |
|                               |                                           |                                                                                |                                           |                                                                                                                                                       |
|                               | Income level brackets                     | Distances between categories are not necessarily equal.                        | Ordered probit                            | Interprets latent continuous variable mapped to discrete ordered categories.                                                                          |
|                               |                                           |                                                                                |                                           |                                                                                                                                                       |
|                               | Likert-scale surveys                      |                                                                                |                                           |                                                                                                                                                       |
+-------------------------------+-------------------------------------------+--------------------------------------------------------------------------------+-------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------+

: Applications of Maximum Likelihood Estimation in Nonlinear and Limited Dependent Variable Models

------------------------------------------------------------------------

#### Binary Response Models

A binary response variable ($y_i$) follows a [Bernoulli](#bernoulli-distribution) distribution:

$$
f_Y(y_i; p) = p^{y_i}(1-p)^{(1-y_i)}
$$

where $p$ is the probability of success. For conditional models, the likelihood becomes:

$$
f_{Y|X}(y_i, x_i; p(.)) = p(x_i)^{y_i}(1 - p(x_i))^{(1-y_i)}
$$

To model $p(x_i)$, we use a function of $x_i$ and unknown parameters $\theta$. A common approach involves a **latent variable model**:

$$
\begin{aligned}
y_i &= 1\{y_i^* > 0 \}, \\
y_i^* &= x_i \beta - \epsilon_i,
\end{aligned}
$$

where:

-   $y_i^*$ is an unobserved (latent) variable.
-   $\epsilon_i$ is a random variable with mean 0, representing unobserved noise.

Rewriting in terms of observed data:

$$
y_i = 1\{x_i \beta > \epsilon_i\}.
$$

The probability function becomes:

$$
\begin{aligned}
p(x_i) &= P(y_i = 1 | x_i) \\
&= P(x_i \beta > \epsilon_i | x_i) \\
&= F_{\epsilon|X}(x_i \beta | x_i),
\end{aligned}
$$

where $F_{\epsilon|X}(.)$ is the cumulative distribution function (CDF) of $\epsilon_i$. Assuming independence of $\epsilon_i$ and $x_i$, the probability function simplifies to:

$$
p(x_i) = F_\epsilon(x_i \beta).
$$

The conditional expectation function is equivalent:

$$
E(y_i | x_i) = P(y_i = 1 | x_i) = F_\epsilon(x_i \beta).
$$

Common Distributional Assumptions

1.  **Probit Model**:
    -   Assumes $\epsilon_i$ follows a standard normal distribution.
    -   $F_\epsilon(.) = \Phi(.)$, where $\Phi(.)$ is the standard normal CDF.
2.  **Logit Model**:
    -   Assumes $\epsilon_i$ follows a standard logistic distribution.
    -   $F_\epsilon(.) = \Lambda(.)$, where $\Lambda(.)$ is the logistic CDF.

------------------------------------------------------------------------

**Steps to Derive MLE for Binary Models**

1.  **Specify the Log-Likelihood**:
    -   For a chosen distribution (e.g., normal for Probit or logistic for Logit), the log-likelihood is:

        $$
        \ln(f_{Y|X}(y_i, x_i; \beta)) = y_i \ln(F_\epsilon(x_i \beta)) + (1 - y_i) \ln(1 - F_\epsilon(x_i \beta)).
        $$
2.  **Maximize the Log-Likelihood**:
    -   Find the parameter estimates that maximize the log-likelihood:

        $$
        \hat{\beta}_{MLE} = \underset{\beta}{\text{argmax}} \sum_{i=1}^{n} \ln(f_{Y|X}(y_i, x_i; \beta)).
        $$

------------------------------------------------------------------------

**Properties of Probit and Logit Estimators**

-   **Consistency and Asymptotic Normality**:
    -   Probit and Logit estimators are consistent and asymptotically normal if:
        -   [A2 Full Rank](#a2-full-rank): $E(x_i' x_i)$ exists and is non-singular.
        -   [A5 Data Generation (Random Sampling)](#a5-data-generation-random-sampling): $\{y_i, x_i\}$ are iid (or stationary and weakly dependent).
        -   Distributional assumptions on $\epsilon_i$ hold (e.g., normal or logistic, independent of $x_i$).
-   **Asymptotic Efficiency**:
    -   Under these assumptions, Probit and Logit estimators are asymptotically efficient with variance:

        $$
        I(\beta_0)^{-1} = \left[E\left(\frac{(f_\epsilon(x_i \beta_0))^2}{F_\epsilon(x_i \beta_0)(1 - F_\epsilon(x_i \beta_0))} x_i' x_i \right)\right]^{-1},
        $$

        where $f_\epsilon(x_i \beta_0)$ is the PDF (derivative of the CDF).

------------------------------------------------------------------------

**Interpretation of Binary Response Models**

Binary response models, such as Probit and Logit, estimate the probability of an event occurring ($y_i = 1$) given predictor variables $x_i$. However, interpreting the estimated coefficients ($\beta$) in these models differs significantly from linear models. Below, we explore how to interpret these coefficients and the concept of **partial effects**.

1.  Interpreting $\beta$ in Binary Response Models

In binary response models, the coefficient $\beta_j$ represents the average change in the **latent variable** $y_i^*$ (an unobserved variable) for a one-unit change in $x_{ij}$. While this provides insight into the direction of the relationship:

-   **Magnitudes** of $\beta_j$ do not have a direct, meaningful interpretation in terms of $y_i$.
-   **Direction** of $\beta_j$ is meaningful:
    -   $\beta_j > 0$: A positive association between $x_{ij}$ and the probability of $y_i = 1$.
    -   $\beta_j < 0$: A negative association between $x_{ij}$ and the probability of $y_i = 1$.

2.  Partial Effects in Nonlinear Binary Models

To interpret the effect of a change in a predictor $x_{ij}$ on the probability of an event occurring ($P(y_i = 1|x_i)$), we use the **partial effect**:

$$
E(y_i | x_i) = F_\epsilon(x_i \beta),
$$

where $F_\epsilon(.)$ is the cumulative distribution function (CDF) of the error term $\epsilon_i$ (e.g., standard normal for Probit, logistic for Logit). The **partial effect** is the derivative of the expected probability with respect to $x_{ij}$:

$$
PE(x_{ij}) = \frac{\partial E(y_i | x_i)}{\partial x_{ij}} = f_\epsilon(x_i \beta) \beta_j,
$$

where:

-   $f_\epsilon(.)$ is the probability density function (PDF) of the error term $\epsilon_i$.

-   $\beta_j$ is the coefficient associated with $x_{ij}$.

3.  Key Characteristics of Partial Effects

-   **Scaling Factor**:
    -   The partial effect depends on a **scaling factor**, $f_\epsilon(x_i \beta)$, which is derived from the density function $f_\epsilon(.)$.
    -   The scaling factor varies depending on the values of $x_i$, making the partial effect nonlinear and **context-dependent**.
-   **Non-Constant Partial Effects**:
    -   Unlike linear models where coefficients directly represent constant marginal effects, the partial effect in binary models changes based on $x_i$.
    -   For example, in a Logit model, the partial effect is largest when $P(y_i = 1 | x_i)$ is around 0.5 (the midpoint of the S-shaped logistic curve) and smaller at the extremes (close to 0 or 1).

4.  Single Values for Partial Effects

In practice, researchers often summarize partial effects using either:

-   **Partial Effect at the Average (PEA)**:
    -   The partial effect is calculated for an "average individual," where $x_i = \bar{x}$ (the sample mean of predictors): $$
        PEA = f_\epsilon(\bar{x}\hat{\beta}) \hat{\beta}_j.
        $$
    -   This provides a single, interpretable value but assumes the average effect applies to all individuals.
-   **Average Partial Effect (APE)**:
    -   The average of all individual-level partial effects across the sample: $$
        APE = \frac{1}{n} \sum_{i=1}^{n} f_\epsilon(x_i \hat{\beta}) \hat{\beta}_j.
        $$
    -   This accounts for the nonlinearity of the partial effects and provides a more accurate summary of the marginal effect in the population.

5.  Comparing Partial Effects in Linear and Nonlinear Models

-   **Linear Models**:
    -   Partial effects are constant: $APE = PEA$.
    -   The coefficients directly represent the marginal effects on $E(y_i | x_i)$.
-   **Nonlinear Models**:
    -   Partial effects are **not constant** due to the dependence on $f_\epsilon(x_i \beta)$.
    -   As a result, $APE \neq PEA$ in general.

------------------------------------------------------------------------