data_analysis/_07-generalized-linear-models.Rmd at main · mikenguyen13/data_analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Generalized Linear Models {#generalized-linear-models}

Generalized Linear Models (GLMs) extend the traditional linear regression framework to accommodate response variables that do not necessarily follow a normal distribution. They provide a flexible approach to modeling relationships between a set of predictors and various types of dependent variables.

While [Ordinary Least Squares] regression assumes that the response variable is continuous and normally distributed, GLMs allow for response variables that follow distributions from the [exponential family](#sec-exponential-family-glm), such as [binomial](#sec-binomial-regression), [Poisson](#sec-poisson-regression), and gamma distributions. This flexibility makes them particularly useful in a wide range of business and research applications.

A [GLM](#generalized-linear-models) consists of three key components:

1.  **A random component**: The response variable $Y_i$ follows a distribution from the exponential family (e.g., binomial, Poisson, gamma).
2.  **A systematic component**: A linear predictor $\eta_i = \mathbf{x'_i} \beta$, where $\mathbf{x'_i}$ is a vector of observed covariates (predictor variables) and $\beta$ is a vector of parameters to be estimated.
3.  **A link function**: A function $g(\cdot)$ that relates the expected value of the response variable, $\mu_i = E(Y_i)$, to the linear predictor (i.e., $\eta_i = g(\mu_i)$).

Although the relationship between the predictors and the outcome *may appear nonlinear* on the original outcome scale (due to the link function), a GLM is still considered "linear" in the statistical sense because it remains linear in the parameters $\beta$. Consequently, GLMs are **not** generally classified as nonlinear regression models. They "generalize" the traditional linear model by allowing for a broader range of response variable distributions and link functions, but retain linearity in their parameters.

The choice of **distribution** and **link function** depends on the nature of the response variable. In the following sections, we will explore several important GLM variants:

-   [Logistic Regression](#sec-logistic-regression): Used for binary response variables (e.g., customer churn, loan defaults).
-   [Probit Regression](#sec-probit-regression): Similar to logistic regression but assumes a normal distribution for the underlying probability.
-   [Poisson Regression](#sec-poisson-regression): Used for modeling count data (e.g., number of purchases, call center inquiries).
-   [Negative Binomial Regression](#sec-negative-binomial-regression): An extension of Poisson regression that accounts for overdispersion in count data.
-   [Quasi-Poisson Regression](#sec-quasi-poisson-regression): A variation of Poisson regression that adjusts for overdispersion by allowing the variance to be a linear function of the mean.
-   [Multinomial Logistic Regression](#sec-multinomial-logistic-regression): A generalization of logistic regression for categorical response variables with more than two outcomes.
-   [Generalization of Generalized Linear Model](#sec-generalization-of-generalized-linear-models): A flexible generalization of ordinary linear regression that allows for response variables with different distributions (e.g., normal, binomial, Poisson).

------------------------------------------------------------------------

## Logistic Regression {#sec-logistic-regression}

Logistic regression is a widely used [Generalized Linear Model](#generalized-linear-models) designed for modeling binary response variables. It is particularly useful in applications such as credit scoring, medical diagnosis, and customer churn prediction.

### Logistic Model

Given a set of predictor variables $\mathbf{x}_i$, the probability of a positive outcome (e.g., success, event occurring) is modeled as:

$$
p_i = f(\mathbf{x}_i ; \beta) = \frac{\exp(\mathbf{x_i'\beta})}{1 + \exp(\mathbf{x_i'\beta})}
$$

where:

-   $p_i = \mathbb{E}[Y_i]$ is the probability of success for observation $i$.
-   $\mathbf{x_i}$ is the vector of predictor variables.
-   $\beta$ is the vector of model coefficients.

#### Logit Transformation {#sec-logit-transformation}

The logistic function can be rewritten in terms of the **log-odds**, also known as the **logit function**:

$$
\text{logit}(p_i) = \log \left(\frac{p_i}{1 - p_i} \right) = \mathbf{x_i'\beta}
$$

where:

-   $\frac{p_i}{1 - p_i}$ represents the **odds** of success (the ratio of the probability of success to the probability of failure).
-   The logit function **ensures linearity** in the parameters, which aligns with the GLM framework.

Thus, logistic regression belongs to the family of **Generalized Linear Models** because **a function of the mean response (logit) is linear in the predictors**.

### Likelihood Function {#sec-likelihood-function-logistic}

Since $Y_i$ follows a **Bernoulli distribution** with probability $p_i$, the likelihood function for $n$ independent observations is:

$$
L(p_i) = \prod_{i=1}^{n} p_i^{Y_i} (1 - p_i)^{1 - Y_i}
$$

By substituting the logistic function for $p_i$:

$$
p_i = \frac{\exp(\mathbf{x'_i \beta})}{1+\exp(\mathbf{x'_i \beta})}, \quad 1 - p_i = \frac{1}{1+\exp(\mathbf{x'_i \beta})}
$$

we obtain:

$$
L(\beta) = \prod_{i=1}^{n} \left( \frac{\exp(\mathbf{x'_i \beta})}{1+\exp(\mathbf{x'_i \beta})} \right)^{Y_i} \left( \frac{1}{1+\exp(\mathbf{x'_i \beta})} \right)^{1 - Y_i}
$$

Taking the natural logarithm of the likelihood function gives the **log-likelihood function**:

$$
Q(\beta) = \log L(\beta) = \sum_{i=1}^n Y_i \mathbf{x'_i \beta} - \sum_{i=1}^n \log(1 + \exp(\mathbf{x'_i \beta}))
$$

Since this function is **concave**, we can maximize it numerically using **iterative optimization techniques**, such as:

-   [Newton-Raphson Algorithm]
-   **Fisher Scoring Algorithm**

These methods allow us to obtain the [Maximum Likelihood] Estimates of the parameters, $\hat{\beta}$.

Under standard regularity conditions, the **MLEs of logistic regression parameters are asymptotically normal**:

$$
\hat{\beta} \dot{\sim} AN(\beta, [\mathbf{I}(\beta)]^{-1})
$$

where:

-   $\mathbf{I}(\beta)$ is the [Fisher Information Matrix], which determines the variance-covariance structure of $\hat{\beta}$.

### Fisher Information Matrix

The **Fisher Information Matrix** quantifies the amount of information that an observable random variable carries about the **unknown parameter** $\beta$. It is crucial in estimating the **variance-covariance matrix** of the estimated coefficients in logistic regression.

Mathematically, the Fisher Information Matrix is defined as:

$$
\mathbf{I}(\beta) = E\left[ \frac{\partial \log L(\beta)}{\partial \beta} \frac{\partial \log L(\beta)}{\partial \beta'} \right]
$$

which expands to:

$$
\mathbf{I}(\beta) = E\left[ \left(\frac{\partial \log L(\beta)}{\partial \beta_i} \frac{\partial \log L(\beta)}{\partial \beta_j} \right)_{ij} \right]
$$

Under **regularity conditions**, the Fisher Information Matrix is equivalent to the **negative expected Hessian matrix**:

$$
\mathbf{I}(\beta) = -E\left[ \frac{\partial^2 \log L(\beta)}{\partial \beta \partial \beta'} \right]
$$

which further expands to:

$$
\mathbf{I}(\beta) = -E \left[ \left( \frac{\partial^2 \log L(\beta)}{\partial \beta_i \partial \beta_j} \right)_{ij} \right]
$$

This representation is particularly useful because it allows us to compute the Fisher Information Matrix directly from the Hessian of the log-likelihood function.

------------------------------------------------------------------------

Example: Fisher Information Matrix in Logistic Regression

Consider a **simple logistic regression model** with one predictor:

$$
x_i' \beta = \beta_0 + \beta_1 x_i
$$

From the log-[likelihood function](#sec-likelihood-function-logistic), the second-order partial derivatives are:

$$
\begin{aligned}
- \frac{\partial^2 \ln(L(\beta))}{\partial \beta^2_0} &= \sum_{i=1}^n \frac{\exp(x'_i \beta)}{1 + \exp(x'_i \beta)} - \left[\frac{\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}\right]^2 & \text{Intercept} \\
&= \sum_{i=1}^n p_i (1-p_i)  \\
- \frac{\partial^2 \ln(L(\beta))}{\partial \beta^2_1} &= \sum_{i=1}^n \frac{x_i^2\exp(x'_i \beta)}{1 + \exp(x'_i \beta)} - \left[\frac{x_i\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}\right]^2 & \text{Slope}\\
&= \sum_{i=1}^n x_i^2p_i (1-p_i)  \\
- \frac{\partial^2 \ln(L(\beta))}{\partial \beta_0 \partial \beta_1} &= \sum_{i=1}^n \frac{x_i\exp(x'_i \beta)}{1 + \exp(x'_i \beta)} - x_i\left[\frac{\exp(x_i' \beta)}{1+ \exp(x'_i \beta)}\right]^2 & \text{Cross-derivative}\\
&= \sum_{i=1}^n x_ip_i (1-p_i)
\end{aligned}
$$

Combining these elements, the **Fisher Information Matrix** for the logistic regression model is:

$$
\mathbf{I} (\beta) =
\begin{bmatrix}
\sum_{i=1}^{n} p_i(1 - p_i) & \sum_{i=1}^{n} x_i p_i(1 - p_i) \\
\sum_{i=1}^{n} x_i p_i(1 - p_i) & \sum_{i=1}^{n} x_i^2 p_i(1 - p_i)
\end{bmatrix}
$$

where:

-   $p_i = \frac{\exp(x_i' \beta)}{1+\exp(x_i' \beta)}$ represents the predicted probability.
-   $p_i (1 - p_i)$ is the **variance of the Bernoulli response variable**.
-   The diagonal elements represent the variances of the estimated coefficients.
-   The off-diagonal elements represent the covariances between $\beta_0$ and $\beta_1$.

The inverse of the Fisher Information Matrix provides the **variance-covariance matrix** of the estimated coefficients:

$$
\mathbf{Var}(\hat{\beta}) = \mathbf{I}(\hat{\beta})^{-1}
$$

This matrix is essential for:

-   **Estimating standard errors** of the logistic regression coefficients.
-   **Constructing confidence intervals** for $\beta$.
-   **Performing hypothesis tests** (e.g., Wald Test).

------------------------------------------------------------------------

```{r Computing the Fisher Information Matrix}
# Load necessary library
library(stats)

# Simulated dataset
set.seed(123)
n <- 100
x <- rnorm(n)
y <- rbinom(n, 1, prob = plogis(0.5 + 1.2 * x))

# Fit logistic regression model
model <- glm(y ~ x, family = binomial)

# Extract the Fisher Information Matrix (Negative Hessian)
fisher_info <- summary(model)$cov.unscaled

# Display the Fisher Information Matrix
print(fisher_info)
```

### Inference in Logistic Regression

Once we estimate the model parameters $\hat{\beta}$ using [Maximum Likelihood] Estimation, we can conduct inference to assess the significance of predictors, construct confidence intervals, and perform hypothesis testing. The two most common inference approaches in logistic regression are:

1.  [Likelihood Ratio Test](#sec-likelihood-ratio-test-logistic)
2.  [Wald Statistics](#sec-wald-test-logistic)

These tests rely on the **asymptotic normality** of MLEs and the properties of the [Fisher Information Matrix].

------------------------------------------------------------------------

#### Likelihood Ratio Test {#sec-likelihood-ratio-test-logistic}

The **Likelihood Ratio Test** compares two models:

-   **Restricted Model**: A simpler model where some parameters are constrained to specific values.
-   **Unrestricted Model**: The full model without constraints.

To test a hypothesis about a subset of parameters $\beta_1$, we leave $\beta_2$ (nuisance parameters) unspecified.

Hypothesis Setup:

$$
H_0: \beta_1 = \beta_{1,0}
$$

where $\beta_{1,0}$ is a specified value (often zero). Let:

-   $\hat{\beta}_{2,0}$ be the **MLE of** $\beta_2$ under the constraint $\beta_1 = \beta_{1,0}$.
-   $\hat{\beta}_1, \hat{\beta}_2$ be the **MLEs under the full model**.

The **likelihood ratio test statistic** is:

$$
-2\log\Lambda = -2[\log L(\beta_{1,0}, \hat{\beta}_{2,0}) - \log L(\hat{\beta}_1, \hat{\beta}_2)]
$$

where:

-   The first term is the **log-likelihood of the restricted model**.
-   The second term is the **log-likelihood of the unrestricted model**.

Under the null hypothesis:

$$
-2 \log \Lambda \sim \chi^2_{\upsilon}
$$

where $\upsilon$ is the number of restricted parameters. We **reject** $H_0$ if:

$$
-2\log \Lambda > \chi^2_{\upsilon,1-\alpha}
$$

**Interpretation**: If the likelihood ratio test statistic is large, this suggests that the restricted model (under $H_0$) fits significantly worse than the full model, leading us to reject the null hypothesis.

------------------------------------------------------------------------

#### Wald Test {#sec-wald-test-logistic}

The **Wald test** is based on the asymptotic normality of MLEs:

$$
\hat{\beta} \sim AN (\beta, [\mathbf{I}(\beta)]^{-1})
$$

We test:

$$
H_0: \mathbf{L} \hat{\beta} = 0
$$

where $\mathbf{L}$ is a $q \times p$ matrix with $q$ linearly independent rows (often used to test multiple coefficients simultaneously). The **Wald test statistic** is:

$$
W = (\mathbf{L\hat{\beta}})'(\mathbf{L[I(\hat{\beta})]^{-1}L'})^{-1}(\mathbf{L\hat{\beta}})
$$

Under $H_0$:

$$
W \sim \chi^2_q
$$

**Interpretation**: If $W$ is large, the null hypothesis is rejected, suggesting that at least one of the tested coefficients is significantly different from zero.

------------------------------------------------------------------------

| Test                                                         | Best Used When...                                                                                                  |
|--------------------------|----------------------------------------------|
| [Likelihood Ratio Test](#sec-likelihood-ratio-test-logistic) | More accurate in small samples, providing better control of error rates. Recommended when sample sizes are small.  |
| [Wald Test](#sec-wald-test-logistic)                         | Easier to compute but may be inaccurate in small samples. Recommended when computational efficiency is a priority. |

: Comparing Likelihood Ratio and Wald Tests

------------------------------------------------------------------------

```{r}
# Load necessary library
library(stats)

# Simulate some binary outcome data
set.seed(123)
n <- 100
x <- rnorm(n)
y <- rbinom(n, 1, prob = plogis(0.5 + 1.2 * x))

# Fit logistic regression model
model <- glm(y ~ x, family = binomial)

# Display model summary (includes Wald tests)
summary(model)

# Perform likelihood ratio test using anova()
anova(model, test="Chisq")
```

#### Confidence Intervals for Coefficients

A **95% confidence interval** for a logistic regression coefficient $\beta_i$ is given by:

$$
\hat{\beta}_i \pm 1.96 \hat{s}_{ii}
$$

where:

-   $\hat{\beta}_i$ is the estimated coefficient.
-   $\hat{s}_{ii}$ is the standard error (square root of the diagonal element of $\mathbf{[I(\hat{\beta})]}^{-1}$).

This confidence interval provides a **range of plausible values** for $\beta_i$. If the interval does not include **zero**, we conclude that $\beta_i$ is statistically significant.

-   **For large sample sizes**, the [Likelihood Ratio Test](#sec-likelihood-ratio-test-logistic) and [Wald Test](#sec-wald-test-logistic) yield similar results.
-   **For small sample sizes**, the [Likelihood Ratio Test](#sec-likelihood-ratio-test-logistic) is preferred because the [Wald test](#sec-wald-test-logistic) can be less reliable.

------------------------------------------------------------------------

#### Interpretation of Logistic Regression Coefficients

For a **single predictor variable**, the logistic regression model is:

$$
\text{logit}(\hat{p}_i) = \log\left(\frac{\hat{p}_i}{1 - \hat{p}_i} \right) = \hat{\beta}_0 + \hat{\beta}_1 x_i
$$

where:

-   $\hat{p}_i$ is the predicted probability of success at $x_i$.
-   $\hat{\beta}_1$ represents the **log odds change** for a one-unit increase in $x$.

**Interpreting** $\beta_1$ **in Terms of Odds**

When the predictor variable increases by **one unit**, the logit of the probability changes by $\hat{\beta}_1$:

$$
\text{logit}(\hat{p}_{x_i +1}) = \hat{\beta}_0 + \hat{\beta}_1 (x_i + 1) = \text{logit}(\hat{p}_{x_i}) + \hat{\beta}_1
$$

Thus, the difference in log odds is:

$$
\begin{aligned}
\text{logit}(\hat{p}_{x_i +1}) - \text{logit}(\hat{p}_{x_i})
&= \log ( \text{odds}(\hat{p}_{x_i + 1})) - \log (\text{odds}(\hat{p}_{x_i}) )\\
&= \log\left( \frac{\text{odds}(\hat{p}_{x_i + 1})}{\text{odds}(\hat{p}_{x_i})} \right) \\
&= \hat{\beta}_1
\end{aligned}
$$

Exponentiating both sides:

$$
\exp(\hat{\beta}_1) = \frac{\text{odds}(\hat{p}_{x_i + 1})}{\text{odds}(\hat{p}_{x_i})}
$$

This quantity, $\exp(\hat{\beta}_1)$, is the **odds ratio**, which quantifies the effect of a one-unit increase in $x$ on the odds of success.

------------------------------------------------------------------------

**Generalization: Odds Ratio for Any Change in** $x$

For a difference of $c$ units in the predictor $x$, the estimated odds ratio is:

$$
\exp(c\hat{\beta}_1)
$$

For multiple predictors, $\exp(\hat{\beta}_k)$ represents the odds ratio for $x_k$, holding all other variables constant.

------------------------------------------------------------------------

#### Inference on the Mean Response

For a given set of predictor values $x_h = (1, x_{h1}, ..., x_{h,p-1})'$, the estimated **mean response** (probability of success) is:

$$
\hat{p}_h = \frac{\exp(\mathbf{x'_h \hat{\beta}})}{1 + \exp(\mathbf{x'_h \hat{\beta}})}
$$

The **variance of the estimated probability** is:

$$
s^2(\hat{p}_h) = \mathbf{x'_h[I(\hat{\beta})]^{-1}x_h}
$$

where:

-   $\mathbf{I}(\hat{\beta})^{-1}$ is the **variance-covariance matrix** of $\hat{\beta}$.
-   $s^2(\hat{p}_h)$ provides an estimate of **uncertainty** in $\hat{p}_h$.

In many applications, logistic regression is used for **classification**, where we predict whether an observation belongs to **category 0 or 1**. A commonly used decision rule is:

-   Assign $y = 1$ if $\hat{p}_h \geq \tau$
-   Assign $y = 0$ if $\hat{p}_h < \tau$

where $\tau$ is a chosen cutoff threshold (typically $\tau = 0.5$).

------------------------------------------------------------------------

```{r}
# Load necessary library
library(stats)

# Simulated dataset
set.seed(123)
n <- 100
x <- rnorm(n)
y <- rbinom(n, 1, prob = plogis(0.5 + 1.2 * x))

# Fit logistic regression model
model <- glm(y ~ x, family = binomial)

# Display model summary
summary(model)

# Extract coefficients and standard errors
coef_estimates <- coef(summary(model))
beta_hat <- coef_estimates[, 1]   # Estimated coefficients
se_beta  <- coef_estimates[, 2]   # Standard errors

# Compute 95% confidence intervals for coefficients
conf_intervals <- cbind(
  beta_hat - 1.96 * se_beta,
  beta_hat + 1.96 * se_beta
)

# Compute Odds Ratios
odds_ratios <- exp(beta_hat)

# Display results
print("Confidence Intervals for Coefficients:")
print(conf_intervals)

print("Odds Ratios:")
print(odds_ratios)

# Predict probability for a new observation (e.g., x = 1)
new_x <- data.frame(x = 1)
predicted_prob <- predict(model, newdata = new_x, type = "response")

print("Predicted Probability for x = 1:")
print(predicted_prob)
```

### Application: Logistic Regression

In this section, we demonstrate the application of **logistic regression** using simulated data. We explore model fitting, inference, residual analysis, and goodness-of-fit testing.

**1. Load Required Libraries**

```{r}
library(kableExtra)
library(dplyr)
library(pscl)
library(ggplot2)
library(faraway)
library(nnet)
library(agridat)
library(nlstools)
```

**2. Data Generation**

We generate a dataset where the predictor variable $X$ follows a **uniform distribution**:

$$
x \sim Unif(-0.5,2.5)
$$

The **linear predictor** is given by:

$$
\eta = 0.5 + 0.75 x
$$

Passing $\eta$ into the inverse-logit function, we obtain:

$$
p = \frac{\exp(\eta)}{1+ \exp(\eta)}
$$

which ensures that $p \in [0,1]$. We then generate the binary response variable:

$$
y \sim Bernoulli(p)
$$

```{r}
set.seed(23) # Set seed for reproducibility
x <- runif(1000, min = -0.5, max = 2.5)  # Generate X values
eta1 <- 0.5 + 0.75 * x                   # Compute linear predictor
p <- exp(eta1) / (1 + exp(eta1))          # Compute probabilities
y <- rbinom(1000, 1, p)                   # Generate binary response
BinData <- data.frame(X = x, Y = y)       # Create data frame
```

**3. Model Fitting**

We fit a **logistic regression model** to the simulated data:

$$
\text{logit}(p) = \beta_0 + \beta_1 X
$$

```{r}
Logistic_Model <- glm(formula = Y ~ X,
                      # Specifies the response distribution
                      family = binomial,
                      data = BinData)

summary(Logistic_Model) # Model summary
nlstools::confint2(Logistic_Model) # Confidence intervals

# Compute odds ratios
OddsRatio <- coef(Logistic_Model) %>% exp
OddsRatio

```

Interpretation of the Odds Ratio

-   When $x = 0$, the **odds of success** are **1.59**.

-   When $x = 1$, the **odds of success increase by a factor of 2.19**, indicating a **119.29% increase**.

**4. Deviance Test**

We assess the model's significance using the **deviance test**, which compares:

-   $H_0$: No predictors are related to the response (intercept-only model).

-   $H_1$: At least one predictor is related to the response.

The **test statistic** is:

$$
D = \text{Null Deviance} - \text{Residual Deviance}
$$

```{r}
Test_Dev <- Logistic_Model$null.deviance - Logistic_Model$deviance
p_val_dev <- 1 - pchisq(q = Test_Dev, df = 1)
p_val_dev
```

Since the **p-value is approximately 0**, we **reject** $H_0$, confirming that $X$ is significantly related to $Y$.

**5. Residual Analysis**

We compute **deviance residuals** and plot them against $X$.

```{r}
Logistic_Resids <- residuals(Logistic_Model, type = "deviance")
```

```{r fig-deviance-resid, fig.cap="Deviance Residuals", fig.alt="Scatter plot showing the relationship between the variable X and deviance residuals. The plot features two distinct clusters of data points, both showing a downward trend. The x-axis ranges from -0.5 to 2.5 and the y-axis, labeled as deviance residuals, ranges from -2.0 to 1.0", out.width="100%", fig.align="center"}
plot(
    y = Logistic_Resids,
    x = BinData$X,
    xlab = 'X',
    ylab = 'Deviance Residuals'
)
```

This plot is not very informative. A more insightful approach is **binned residual plots**.

**6. Binned Residual Plot**

We group residuals into **bins** based on predicted values.

```{r fig-binned-resid-plot, fig.cap="Binned Residual Plot", fig.alt="Scatter plot showing the relationship between BinData X on the x-axis and Logistic Residuals on the y-axis. Data points are scattered, showing variability in residuals relative to x values. The x-axis ranges from -0.5 to 2.5 and the y-axis ranges from -1.0 to 0.5", out.width="100%", fig.align="center"}
plot_bin <- function(Y,
                     X,
                     bins = 100,
                     return.DF = FALSE) {
  Y_Name <- deparse(substitute(Y))
  X_Name <- deparse(substitute(X))

  Binned_Plot <- data.frame(Plot_Y = Y, Plot_X = X)
  Binned_Plot$bin <-
    cut(Binned_Plot$Plot_X, breaks = bins) %>% as.numeric

  Binned_Plot_summary <- Binned_Plot %>%
    group_by(bin) %>%
    summarise(
      Y_ave = mean(Plot_Y),
      X_ave = mean(Plot_X),
      Count = n()
    ) %>% as.data.frame

  plot(
    y = Binned_Plot_summary$Y_ave,
    x = Binned_Plot_summary$X_ave,
    ylab = Y_Name,
    xlab = X_Name
  )

  if (return.DF)
    return(Binned_Plot_summary)
}

plot_bin(Y = Logistic_Resids, X = BinData$X, bins = 100)
```

We also examine **predicted values vs residuals**:

```{r fig-pred-resdi-glm, fig.cap="Predicted Values versus Residuals", fig.alt='Scatter plot showing the relationship between logistic predictions on the x-axis and logistic residuals on the y-axis. Data points are scattered across the plot, indicating variability in residuals for different prediction values. The x-axis ranges from 0.6 to 0.9, and the y-axis ranges from -1.0 to 1.0.'}
Logistic_Predictions <- predict(Logistic_Model, type = "response")
plot_bin(Y = Logistic_Resids, X = Logistic_Predictions, bins = 100)
```

Finally, we compare **predicted probabilities** to actual outcomes:

```{r fig-pred-actual-glm, fig.cap="Predicted versus Actual Values", fig.alt="Scatter plot showing data points with a blue dashed trend line. The x-axis is labeled as logistic predictions ranging from 0.6 to 0.9. The y-axis is labeled as observed values from BinData Y, also ranging from 0.6 to 0.9. The plot illustrates the relationship between logistic predictions and observed data", out.width="100%", fig.align="center"}
NumBins <- 10
Binned_Data <- plot_bin(
    Y = BinData$Y,
    X = Logistic_Predictions,
    bins = NumBins,
    return.DF = TRUE
)
Binned_Data
abline(0, 1, lty = 2, col = 'blue')
```

**7. Model Goodness-of-Fit: Hosmer-Lemeshow Test**

The **Hosmer-Lemeshow test** evaluates whether the model fits the data well. The test statistic is: $$
X^2_{HL} = \sum_{j=1}^{J} \frac{(y_j - m_j \hat{p}_j)^2}{m_j \hat{p}_j(1-\hat{p}_j)}
$$ where:

-   $y_j$ is the observed number of successes in bin $j$.

-   $m_j$ is the number of observations in bin $j$.

-   $\hat{p}_j$ is the predicted probability in bin $j$.

Under $H_0$, we assume:

$$
X^2_{HL} \sim \chi^2_{J-1}
$$

```{r}
HL_BinVals <- (Binned_Data$Count * Binned_Data$Y_ave -
               Binned_Data$Count * Binned_Data$X_ave) ^ 2 /
               (Binned_Data$Count * Binned_Data$X_ave * (1 - Binned_Data$X_ave))

HLpval <- pchisq(q = sum(HL_BinVals),
                 df = NumBins - 1,
                 lower.tail = FALSE)
HLpval
```

**Conclusion:**

-   Since $p$-value = 0.99, we **fail to reject** $H_0$.

-   This indicates that **the model fits the data well**.

------------------------------------------------------------------------

## Probit Regression {#sec-probit-regression}

Probit regression is a type of [Generalized Linear Models](#generalized-linear-models) used for binary outcome variables. Unlike [logistic regression](#sec-logistic-regression), which uses the [logit function](#sec-logit-transformation), probit regression assumes that the probability of success is determined by an **underlying normally distributed latent variable**.

### Probit Model

Let $Y_i$ be a binary response variable:

$$
Y_i =
\begin{cases}
1, & \text{if success occurs} \\
0, & \text{otherwise}
\end{cases}
$$

We assume that $Y_i$ follows a **Bernoulli distribution**:

$$
Y_i \sim \text{Bernoulli}(p_i), \quad \text{where } p_i = P(Y_i = 1 | \mathbf{x_i})
$$

Instead of the [logit function](#sec-logit-transformation) in [logistic regression](#sec-logistic-regression):

$$
\text{logit}(p_i) = \log\left( \frac{p_i}{1 - p_i} \right) = \mathbf{x_i'\beta}
$$

Probit regression uses the **inverse standard normal CDF**:

$$
\Phi^{-1}(p_i) = \mathbf{x_i'\theta}
$$

where:

-   $\Phi(\cdot)$ is the **CDF of the standard normal distribution**.

-   $\mathbf{x_i}$ is the vector of predictors.

-   $\theta$ is the vector of regression coefficients.

Thus, the probability of success is:

$$
p_i = P(Y_i = 1 | \mathbf{x_i}) = \Phi(\mathbf{x_i'\theta})
$$

where:

$$
\Phi(z) = \int_{-\infty}^{z} \frac{1}{\sqrt{2\pi}} e^{-t^2/2} dt
$$

------------------------------------------------------------------------

### Application: Probit Regression

```{r}
# Load necessary library
library(ggplot2)

# Set seed for reproducibility
set.seed(123)

# Simulate data
n <- 1000
x1 <- rnorm(n)
x2 <- rnorm(n)
latent <- 0.5 * x1 + 0.7 * x2 + rnorm(n)  # Linear combination
y <- ifelse(latent > 0, 1, 0)  # Binary outcome

# Create dataframe
data <- data.frame(y, x1, x2)

# Fit Probit model
probit_model <-
    glm(y ~ x1 + x2, family = binomial(link = "probit"), data = data)
summary(probit_model)

# Fit Logit model
logit_model <-
    glm(y ~ x1 + x2, family = binomial(link = "logit"), data = data)
summary(logit_model)

# Compare Coefficients
coef_comparison <- data.frame(
    Variable = names(coef(probit_model)),
    Probit_Coef = coef(probit_model),
    Logit_Coef = coef(logit_model),
    Logit_Probit_Ratio = coef(logit_model) / coef(probit_model)
)

print(coef_comparison)

# Compute predicted probabilities
data$probit_pred <- predict(probit_model, type = "response")
data$logit_pred <- predict(logit_model, type = "response")
```

```{r fig-probit-logit, fig.cap="Comparison of Predicted Probabilities", fig.alt="Scatter plot titled Comparison of Predicted Probabilities showing a linear relationship between Probit predictions on the x-axis and Logit predictions on the y-axis. Data points are closely aligned along a red diagonal line, indicating a strong correlation. Axes range from 0 to 1", out.width="100%", fig.align="center"}
# Plot Probit vs Logit predictions
ggplot(data, aes(x = probit_pred, y = logit_pred)) +
  geom_point(alpha = 0.5) +
  geom_abline(slope = 1,
              intercept = 0,
              col = "red") +
  labs(title = "Comparison of Predicted Probabilities",
       x = "Probit Predictions", y = "Logit Predictions")
```

```{r}
# Classification Accuracy
threshold <- 0.5
data$probit_class <- ifelse(data$probit_pred > threshold, 1, 0)
data$logit_class <- ifelse(data$logit_pred > threshold, 1, 0)

probit_acc <- mean(data$probit_class == data$y)
logit_acc <- mean(data$logit_class == data$y)

print(paste("Probit Accuracy:", round(probit_acc, 4)))
print(paste("Logit Accuracy:", round(logit_acc, 4)))
```

------------------------------------------------------------------------

## Binomial Regression {#sec-binomial-regression}

In previous sections, we introduced **binomial regression models**, including both [Logistic Regression](#sec-logistic-regression) and [probit regression](An%20alternative%20to%20logistic%20regression,%20commonly%20used%20in%20decision%20modeling.), and discussed their theoretical foundations. Now, we apply these methods to real-world data using the `esoph` dataset, which examines the relationship between esophageal cancer and potential risk factors such as **alcohol consumption** and **age group**.

### Dataset Overview

The `esoph` dataset consists of:

-   **Successes (`ncases`)**: The number of individuals diagnosed with esophageal cancer.

-   **Failures (`ncontrols`)**: The number of individuals in the control group (without cancer).

-   **Predictors**:

    -   `agegp`: Age group of individuals.

    -   `alcgp`: Alcohol consumption category.

    -   `tobgp`: Tobacco consumption category.

Before fitting our models, let's inspect the dataset and visualize some key relationships.

```{r}
# Load and inspect the dataset
data("esoph")
head(esoph, n = 3)
```

```{r fig-cancer-data, fig.cap="Esophageal Cancer Data", fig.alt="Box plot titled Esophageal Cancer Data showing the proportion of cancer cases across four alcohol consumption groups: 0 to 39 g per day, 40 to 79 g, 80 to 119 g, and 120 or more grams. The y-axis represents the proportion of cancer cases from 0 to 1. Each box plot shows the median, quartiles, and outliers, indicating an increase in cancer cases with higher alcohol consumption", out.width="100%", fig.align="center"}
# Visualizing the proportion of cancer cases by alcohol consumption
plot(
  esoph$ncases / (esoph$ncases + esoph$ncontrols) ~ esoph$alcgp,
  ylab = "Proportion of Cancer Cases",
  xlab = "Alcohol Consumption Group",
  main = "Esophageal Cancer Data"
)
```

```{r}
# Ensure categorical variables are treated as factors
class(esoph$agegp) <- "factor"
class(esoph$alcgp) <- "factor"
class(esoph$tobgp) <- "factor"
```

### Apply Logistic Model

We first fit a **logistic regression model**, where the response variable is the proportion of cancer cases (`ncases`) relative to total observations (`ncases + ncontrols`).

```{r}
# Logistic regression using alcohol consumption as a predictor
model <-
    glm(cbind(ncases, ncontrols) ~ alcgp,
        data = esoph,
        family = binomial)

# Summary of the model
summary(model)
```

Interpretation

-   The coefficients represent the **log-odds** of having esophageal cancer relative to the baseline alcohol consumption group.

-   **P-values** indicate whether alcohol consumption levels significantly influence cancer risk.

Model Diagnostics

```{r}
# Convert coefficients to odds ratios
exp(coefficients(model))

# Model goodness-of-fit measures
deviance(model) / df.residual(model)  # Closer to 1 suggests a better fit
model$aic  # Lower AIC is preferable for model comparison
```

To improve our model, we include **age group (`agegp`)** as an additional predictor.

```{r}
# Logistic regression with alcohol consumption and age
better_model <- glm(
    cbind(ncases, ncontrols) ~ agegp + alcgp,
    data = esoph,
    family = binomial
)

# Summary of the improved model
summary(better_model)

# Model evaluation
better_model$aic # Lower AIC is better

# Convert coefficients to odds ratios
# exp(coefficients(better_model))
data.frame(`Odds Ratios` = exp(coefficients(better_model)))

# Compare models using likelihood ratio test (Chi-square test)
pchisq(
    q = model$deviance - better_model$deviance,
    df = model$df.residual - better_model$df.residual,
    lower.tail = FALSE
)
```

Key Takeaways

-   **AIC Reduction**: A lower AIC suggests that adding age as a predictor improves the model.

-   **Likelihood Ratio Test**: This test compares the two models and determines whether the improvement is statistically significant.

### Apply Probit Model

As discussed earlier, the **probit** model is an alternative to logistic regression, using a cumulative normal distribution instead of the logistic function.

```{r}
# Probit regression model
Prob_better_model <- glm(
    cbind(ncases, ncontrols) ~ agegp + alcgp,
    data = esoph,
    family = binomial(link = probit)
)

# Summary of the probit model
summary(Prob_better_model)
```

Why Consider a Probit Model?

-   Like logistic regression, probit regression estimates probabilities, but it assumes a **normal distribution of the latent variable**.

-   While the **interpretation of coefficients differs**, model comparisons can still be made using **AIC**.

## Poisson Regression {#sec-poisson-regression}

### The Poisson Distribution

Poisson regression is used for modeling **count data**, where the response variable represents the number of occurrences of an event within a fixed period, space, or other unit. The Poisson distribution is defined as:

$$
\begin{aligned} f(Y_i) &= \frac{\mu_i^{Y_i} \exp(-\mu_i)}{Y_i!}, \quad Y_i = 0,1,2, \dots \\ E(Y_i) &= \mu_i \\ \text{Var}(Y_i) &= \mu_i \end{aligned}
$$ where:

-   $Y_i$ is the count variable.

-   $\mu_i$ is the expected count for the $i$-th observation.

-   The **mean and variance are equal** $E(Y_i) = \text{Var}(Y_i)$, making Poisson regression suitable when variance follows this property.

However, real-world count data often exhibit **overdispersion**, where the variance exceeds the mean. We will discuss remedies such as [Quasi-Poisson](#sec-quasi-poisson-regression) and [Negative Binomial Regression](#sec-negative-binomial-regression) later.

### Poisson Model

We model the expected count $\mu_i$ as a function of predictors $\mathbf{x_i}$ and parameters $\boldsymbol{\theta}$:

$$
\mu_i = f(\mathbf{x_i; \theta})
$$

### Link Function Choices

Since $\mu_i$ must be positive, we often use a **log-link function**:

$$
θ\log(\mu_i) = \mathbf{x_i' \theta}
$$

This ensures that the predicted counts are always non-negative. This is analogous to [logistic regression](#sec-logistic-regression), where we use the logit link for binary outcomes.

Rewriting:

$$
\mu_i = \exp(\mathbf{x_i' \theta})
$$ which ensures $\mu_i > 0$ for all parameter values.

### Application: Poisson Regression

We apply Poisson regression to the **bioChemists** dataset (from the `pscl` package), which contains information on academic productivity in terms of published articles.

#### Dataset Overview

```{r}
library(tidyverse)
# Load dataset
data(bioChemists, package = "pscl")