data_analysis/_06.03-nonlinear-regression.Rmd at main · mikenguyen13/data_analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
## Application

### Nonlinear Estimation Using Gauss-Newton Algorithm

This section demonstrates nonlinear parameter estimation using the Gauss-Newton algorithm and compares results with `nls()`. The model is given by:

$$
y_i = \frac{\theta_0 + \theta_1 x_i}{1 + \theta_2 \exp(0.4 x_i)} + \epsilon_i
$$

where

-   $i = 1, \dots ,n$
-   $\theta_0$, $\theta_1$, $\theta_2$ are the unknown parameters.
-   $\epsilon_i$ represents errors.

1.  Loading and Visualizing the Data

```{r}
library(dplyr)
library(ggplot2)
library(investr)

# Load the dataset
my_data <- read.delim("images/S21hw1pr4.txt", header = FALSE, sep = "") %>%
  dplyr::rename(x = V1, y = V2)
```

```{r fig-observed-data-nonlinear, fig.cap="Observed Data", fig.alt="Scatter plot titled 'Observed Data' showing a distribution of blue data points. The X-axis is labeled 'X' and the Y-axis is labeled 'Y.' The data points form a downward curve initially, then level out and slightly rise, indicating a potential trend or pattern in the data.", out.width="100%", fig.align='center'}
# Plot data
ggplot(my_data, aes(x = x, y = y)) +
  geom_point(color = "blue") +
  labs(title = "Observed Data", x = "X", y = "Y") +
  theme_minimal()
```

2.  Deriving Starting Values for Parameters

Since nonlinear optimization is sensitive to starting values, we estimate reasonable initial values based on model interpretation.

Finding the Maximum $Y$ Value

```{r}
max(my_data$y)
my_data$x[which.max(my_data$y)]
```

-   When $y = 2.6722$, the corresponding $x = 0.0094$.

-   From the model equation: $\theta_0 + 0.0094 \theta_1 = 2.6722$

Estimating $\theta_2$ from the Median $y$ Value

-   The equation simplifies to: $1 + \theta_2 \exp(0.4 x) = 2$

```{r}
# find mean y
mean(my_data$y)

# find y closest to its mean
my_data$y[which.min(abs(my_data$y - (mean(my_data$y))))]


# find x closest to the mean y
my_data$x[which.min(abs(my_data$y - (mean(my_data$y))))]
```

-   This yields the equation: $83.58967 \theta_2 = 1$

Finding the Value of $\theta_0$ and $\theta_1$

```{r}
# find value of x closet to 1
my_data$x[which.min(abs(my_data$x - 1))]

# find index of x closest to 1
match(my_data$x[which.min(abs(my_data$x - 1))], my_data$x)

# find y value
my_data$y[match(my_data$x[which.min(abs(my_data$x - 1))], my_data$x)]
```

-   This provides another equation: $\theta_0 + \theta_1 \times 0.9895 - 2.164479 \theta_2 = 1.457$

Solving for $\theta_0, \theta_1, \theta_2$

```{r}
library(matlib)

# Define coefficient matrix
A = matrix(
  c(0, 0.0094, 0, 0, 0, 83.58967, 1, 0.9895, -2.164479),
  nrow = 3,
  ncol = 3,
  byrow = T
)

# Define constant vector
b <- c(2.6722, 1, 1.457)

# Display system of equations
showEqn(A, b)

# Solve for parameters
theta_start <- Solve(A, b, fractions = FALSE)
theta_start
```

3.  Implementing the Gauss-Newton Algorithm

Using these estimates, we manually implement the **Gauss-Newton optimization**.

**Defining the Model and Its Derivatives**

```{r}
# Starting values
theta_0_strt <- as.numeric(gsub(".*=\\s*", "", theta_start[1]))
theta_1_strt <- as.numeric(gsub(".*=\\s*", "", theta_start[2]))
theta_2_strt <- as.numeric(gsub(".*=\\s*", "", theta_start[3]))

# Model function
mod_4 <- function(theta_0, theta_1, theta_2, x) {
  (theta_0 + theta_1 * x) / (1 + theta_2 * exp(0.4 * x))
}

# Define function expression
f_4 = expression((theta_0 + theta_1 * x) / (1 + theta_2 * exp(0.4 * x)))

# First derivatives
df_4.d_theta_0 <- D(f_4, 'theta_0')
df_4.d_theta_1 <- D(f_4, 'theta_1')
df_4.d_theta_2 <- D(f_4, 'theta_2')
```

Iterative Gauss-Newton Optimization

```{r}
# Initialize
theta_vec <- matrix(c(theta_0_strt, theta_1_strt, theta_2_strt))
delta <- matrix(NA, nrow = 3, ncol = 1)
i <- 1

# Evaluate function at initial estimates
f_theta <- as.matrix(eval(f_4, list(
  x = my_data$x,
  theta_0 = theta_vec[1, 1],
  theta_1 = theta_vec[2, 1],
  theta_2 = theta_vec[3, 1]
)))

repeat {
  # Compute Jacobian matrix
  F_theta_0 <- as.matrix(cbind(
    eval(df_4.d_theta_0, list(
      x = my_data$x,
      theta_0 = theta_vec[1, i],
      theta_1 = theta_vec[2, i],
      theta_2 = theta_vec[3, i]
    )),
    eval(df_4.d_theta_1, list(
      x = my_data$x,
      theta_0 = theta_vec[1, i],
      theta_1 = theta_vec[2, i],
      theta_2 = theta_vec[3, i]
    )),
    eval(df_4.d_theta_2, list(
      x = my_data$x,
      theta_0 = theta_vec[1, i],
      theta_1 = theta_vec[2, i],
      theta_2 = theta_vec[3, i]
    ))
  ))

  # Compute parameter updates
  delta[, i] = (solve(t(F_theta_0)%*%F_theta_0))%*%t(F_theta_0)%*%(my_data$y-f_theta[,i])


  # Update parameter estimates
  theta_vec <- cbind(theta_vec, theta_vec[, i] + delta[, i])
  theta_vec[, i + 1] = theta_vec[, i] + delta[, i]

  # Increment iteration counter
  i <- i + 1

  # Compute new function values
  f_theta <- cbind(f_theta, as.matrix(eval(f_4, list(
    x = my_data$x,
    theta_0 = theta_vec[1, i],
    theta_1 = theta_vec[2, i],
    theta_2 = theta_vec[3, i]
  ))))

  delta = cbind(delta, matrix(NA, nrow = 3, ncol = 1))

  # Convergence criteria based on SSE
  if (abs(sum((my_data$y - f_theta[, i])^2) - sum((my_data$y - f_theta[, i - 1])^2)) /
      sum((my_data$y - f_theta[, i - 1])^2) < 0.001) {
    break
  }
}

# Final parameter estimates
theta_vec[, ncol(theta_vec)]
```

4.  Checking Convergence and Variance

```{r}
# Final objective function value (SSE)
sum((my_data$y - f_theta[, i])^2)

sigma2 <- 1 / (nrow(my_data) - 3) *
  (t(my_data$y - f_theta[, ncol(f_theta)]) %*%
   (my_data$y - f_theta[, ncol(f_theta)]))  # p = 3

# Asymptotic variance-covariance matrix
as.numeric(sigma2)*as.matrix(solve(crossprod(F_theta_0)))
```

5.  Validating with `nls()`

```{r}
nlin_4 <- nls(
  y ~ mod_4(theta_0, theta_1, theta_2, x),
  start = list(
    theta_0 = as.numeric(gsub(".*=\\s*", "", theta_start[1])),
    theta_1 = as.numeric(gsub(".*=\\s*", "", theta_start[2])),
    theta_2 = as.numeric(gsub(".*=\\s*", "", theta_start[3]))
  ),
  data = my_data
)
summary(nlin_4)
```

### Logistic Growth Model

A classic logistic growth model follows the equation:

$$
P = \frac{K}{1 + \exp(P_0 + r t)} + \epsilon
$$

where:

-   $P$ = population at time $t$

-   $K$ = carrying capacity (maximum population)

-   $r$ = population growth rate

-   $P_0$ = initial population log-ratio

However, R's built-in `SSlogis` function uses a slightly different parameterization:

$$
P = \frac{asym}{1 + \exp\left(\frac{xmid - t}{scal}\right)}
$$

where:

-   $asym$ = carrying capacity ($K$)

-   $xmid$ = the $x$-value at the inflection point of the curve

-   $scal$ = scaling parameter

This gives the parameter relationships:

-   $K = asym$

-   $r = -1 / scal$

-   $P_0 = -r \cdot xmid$

```{r}
# Simulated time-series data
time <- c(1, 2, 3, 5, 10, 15, 20, 25, 30, 35)
population <- c(2.8, 4.2, 3.5, 6.3, 15.7, 21.3, 23.7, 25.1, 25.8, 25.9)
```

```{r fig-logistic-growth-example, fig.cap='Logistic Growth Model', fig.alt="Scatter plot titled Logistic Growth Model showing population growth over time. The x-axis shows time and the y-axis shows population. Data points indicate an initial slow increase in population, followed by a rapid rise and then a plateau, illustrating logistic growth", out.width="100%", fig.align='center'}
# Plot data points
plot(time,
     population,
     las = 1,
     pch = 16,
     main = "Logistic Growth Model")
```

```{r}
# Fit the logistic growth model using programmed starting values
logisticModelSS <- nls(population ~ SSlogis(time, Asym, xmid, scal))

# Model summary
summary(logisticModelSS)

# Extract parameter estimates
coef(logisticModelSS)
```

To fit the model using an alternative parameterization ($K, r, P_0$), we convert the estimated coefficients:

```{r}
# Convert parameter estimates to alternative logistic model parameters
Ks <- as.numeric(coef(logisticModelSS)[1])  # Carrying capacity (K)
rs <- -1 / as.numeric(coef(logisticModelSS)[3])  # Growth rate (r)
Pos <- -rs * as.numeric(coef(logisticModelSS)[2])  # P_0

# Fit the logistic model with the alternative parameterization
logisticModel <- nls(
    population ~ K / (1 + exp(Po + r * time)),
    start = list(Po = Pos, r = rs, K = Ks)
)

# Model summary
summary(logisticModel)

```

Visualizing the Logistic Model Fit

```{r fig-logistic-model-fit, fig.cap="Logistic Growth Model Fit", fig.alt="Chart titled Logistic Growth Model Fit showing population growth over time. The x-axis shows time and the y-axis shows population. Black data points are plotted with a red curve that begins low, rises steeply, and levels off to illustrate logistic growth", out.width="100%",    fig.align="center"}
# Plot original data
plot(time,
     population,
     las = 1,
     pch = 16,
     main = "Logistic Growth Model Fit")

# Overlay the fitted logistic curve
lines(time,
      predict(logisticModel),
      col = "red",
      lwd = 2)
```

### Nonlinear Plateau Model

This example is based on [@Schabenberger_2001] and demonstrates the use of a **plateau model** to estimate the relationship between soil nitrate ($NO_3$) concentration and relative yield percent (RYP) at two different depths (30 cm and 60 cm).

```{r fig-nonlinear-plateau, fig.cap='Nonlinear Plateau Plot', fig.alt='Scatter plot showing the relationship between soil NO3 concentration and relative yield percent. Data points are color-coded by depth: red for 30 cm and blue for 60 cm. The x-axis represents soil NO3 concentration, ranging from 0 to 100, and the y-axis represents relative yield percent, ranging from 30 to 110. Most data points cluster at higher yield percentages with lower NO3 concentrations.', out.width="100%", fig.align='center'}
# Load data
dat <- read.table("images/dat.txt", header = TRUE)

# Plot NO3 concentration vs. relative yield percent, colored by depth
library(ggplot2)
dat.plot <- ggplot(dat) +
  geom_point(aes(x = no3, y = ryp, color = as.factor(depth))) +
  labs(color = 'Depth (cm)') +
  xlab('Soil NO3 Concentration') +
  ylab('Relative Yield Percent') +
  theme_minimal()

# Display plot
dat.plot
```

The **suggested nonlinear plateau model** is given by:

$$
E(Y_{ij}) = (\beta_{0j} + \beta_{1j}N_{ij})I_{N_{ij}\le \alpha_j} + (\beta_{0j} + \beta_{1j}\alpha_j)I_{N_{ij} > \alpha_j}
$$

where:

-   $N_{ij}$ represents the soil nitrate ($NO_3$) concentration for observation $i$ at depth $j$.

-   $i$ indexes individual observations.

-   $j = 1, 2$ corresponds to depths **30 cm** and **60 cm**.

This model assumes a **linear increase** up to a threshold ($\alpha_j$), beyond which the response **levels off (plateaus).**

Defining the Plateau Model as a Function

```{r}
# Define the nonlinear plateau model function
nonlinModel <- function(predictor, b0, b1, alpha) {
  ifelse(predictor <= alpha,
         b0 + b1 * predictor,  # Linear growth below threshold
         b0 + b1 * alpha)      # Plateau beyond threshold
}
```

Creating a Self-Starting Function for `nls`

Since the model is **piecewise linear**, we can estimate starting values using:

1.  A **linear regression** on the **first half** of sorted predictor values to estimate $b_0$ and $b_1$.

2.  The **last predictor value** used in the regression as the plateau threshold ($\alpha$)

```{r}
# Define initialization function for self-starting plateau model
nonlinModelInit <- function(mCall, LHS, data) {
  # Sort data by increasing predictor value
  xy <- sortedXyData(mCall[['predictor']], LHS, data)
  n <- nrow(xy)

  # Fit a simple linear model using the first half of the sorted data
  lmFit <- lm(xy[1:(n / 2), 'y'] ~ xy[1:(n / 2), 'x'])

  # Extract initial estimates
  b0 <- coef(lmFit)[1]  # Intercept
  b1 <- coef(lmFit)[2]  # Slope
  alpha <- xy[(n / 2), 'x']  # Last x-value in the fitted linear range

  # Return initial parameter estimates
  value <- c(b0, b1, alpha)
  names(value) <- mCall[c('b0', 'b1', 'alpha')]
  value
}
```

Combining Model and Self-Start Function

```{r}
# Define a self-starting nonlinear model for nls
SS_nonlinModel <- selfStart(nonlinModel,
                            nonlinModelInit,
                            c('b0', 'b1', 'alpha'))
```

The `nls` function is used to estimate parameters separately for **each soil depth (30 cm and 60 cm).**

```{r}
# Fit the model for depth = 30 cm
sep30_nls <- nls(ryp ~ SS_nonlinModel(predictor = no3, b0, b1, alpha),
                  data = dat[dat$depth == 30,])

# Fit the model for depth = 60 cm
sep60_nls <- nls(ryp ~ SS_nonlinModel(predictor = no3, b0, b1, alpha),
                  data = dat[dat$depth == 60,])
```

We generate separate plots for **30 cm** and **60 cm** depths, showing both **confidence** and **prediction intervals.**

```{r fig-yield-depth-comparison, fig.cap="Fitted Curves for Relative Yield Response by Soil NO3 Concentration at 30 cm and 60 cm Depths", fig.alt="Two XY charts comparing relative yield percent to soil NO3 concentration at different depths. The left chart, titled Results at 30 cm Depth, shows data points clustered at higher yield percentages with a blue shaded confidence interval. The right chart, titled Results at 60 cm Depth, shows a similar pattern with a red shaded confidence interval. Both charts show a positive relationship between soil NO3 and relative yield percent",  out.width="100%", fig.align="center"}
# Set plotting layout
par(mfrow = c(1, 2))

# Plot model fit for 30 cm depth
investr::plotFit(
  sep30_nls,
  interval = "both",
  pch = 19,
  shade = TRUE,
  col.conf = "skyblue4",
  col.pred = "lightskyblue2",
  data = dat[dat$depth == 30,],
  main = "Results at 30 cm Depth",
  ylab = "Relative Yield Percent",
  xlab = "Soil NO3 Concentration",
  xlim = c(0, 120)
)

# Plot model fit for 60 cm depth
investr::plotFit(
  sep60_nls,
  interval = "both",
  pch = 19,
  shade = TRUE,
  col.conf = "lightpink4",
  col.pred = "lightpink2",
  data = dat[dat$depth == 60,],
  main = "Results at 60 cm Depth",
  ylab = "Relative Yield Percent",
  xlab = "Soil NO3 Concentration",
  xlim = c(0, 120)
)
```

```{r}
summary(sep30_nls)
summary(sep60_nls)
```

**Modeling Soil Depths Together and Comparing Models**

Instead of fitting separate models for different soil depths, we first fit a **combined model** where all observations share a **common slope, intercept, and plateau**. We then test whether modeling the two depths separately provides a significantly better fit.

1.  **Fitting a Reduced (Combined) Model**

The reduced model assumes that **all soil depths follow the same nonlinear relationship**.

```{r reduce-model}
# Fit the combined model (common parameters across all depths)
red_nls <- nls(
  ryp ~ SS_nonlinModel(predictor = no3, b0, b1, alpha),
  data = dat
)

# Display model summary
summary(red_nls)
```

```{r fig-res-combined-mod, fig.cap="Results for Combined Model", fig.alt="Scatter plot titled Results for Combined Model showing the relationship between Soil NO3 concentration on the x-axis and relative yield percent on the y-axis. Data points are scattered with a trend line that increases rapidly and then levels off. A shaded band around the line represents variability or confidence intervals", out.width="100%", fig.align='center'}
# Visualizing the combined model fit
par(mfrow = c(1, 1))
plotFit(
  red_nls,
  interval = "both",
  pch = 19,
  shade = TRUE,
  col.conf = "lightblue4",
  col.pred = "lightblue2",
  data = dat,
  main = 'Results for Combined Model',
  ylab = 'Relative Yield Percent',
  xlab = 'Soil NO3 Concentration'
)
```


2.  **Examining Residuals for the Combined Model**

Checking residuals helps diagnose potential **lack of fit**.

```{r}
library(nlstools)

# Residual diagnostics using nlstools
resid <- nlsResiduals(red_nls)
```


```{r fig-resid-nonlinear, fig.cap="Residual Plot Nonlinear Example", fig.alt="Four panel figure showing residual analysis. Top left shows residuals versus fitted values labeled Residuals. Top right shows standardized residuals versus fitted values labeled Standardized Residuals. Bottom left shows autocorrelation of residuals i versus residuals i+1 labeled Autocorrelation. Bottom right shows a normal QQ plot comparing sample quantiles to theoretical quantiles labeled Normal QQ Plot of Standardized Residuals. Each plot includes a horizontal reference line", out.width="100%", fig.align="center"}
# Plot residuals
plot(resid)
```


If there is a **pattern** in the residuals (e.g., systematic deviations based on soil depth), this suggests that a **separate model for each depth** may be necessary.

3.  **Testing Whether Depths Require Separate Models**

To formally test whether soil depth significantly affects the model parameters, we introduce a **parameterization where depth-specific parameters are increments from a baseline model** (30 cm depth):

$$
\begin{aligned}
\beta_{02} &= \beta_{01} + d_0 \\
\beta_{12} &= \beta_{11} + d_1 \\
\alpha_{2} &= \alpha_{1} + d_a
\end{aligned}
$$

where:

-   $\beta_{01}, \beta_{11}, \alpha_1$ are parameters for **30 cm depth**.

-   $d_0, d_1, d_a$ represent **depth-specific differences** for **60 cm depth**.

-   If $d_0, d_1, d_a$ are significantly **different from 0**, the **two depths should be modeled separately**.

4.  **Defining the Full (Depth-Specific) Model**

```{r full-model, eval=TRUE}
nonlinModelF <- function(predictor, soildep, b01, b11, a1, d0, d1, da) {

  # Define parameters for 60 cm depth as increments from 30 cm parameters
  b02 <- b01 + d0
  b12 <- b11 + d1
  a2 <- a1 + da

  # Compute model output for 30 cm depth
  y1 <- ifelse(
    predictor <= a1,
    b01 + b11 * predictor,
    b01 + b11 * a1
  )

  # Compute model output for 60 cm depth
  y2 <- ifelse(
    predictor <= a2,
    b02 + b12 * predictor,
    b02 + b12 * a2
  )

  # Assign correct model output based on depth
  y <- y1 * (soildep == 30) + y2 * (soildep == 60)

  return(y)
}


```

5.  **Fitting the Full (Depth-Specific) Model**

The starting values are taken from the **separately fitted models** for each depth.

```{r full-model-fit}
Soil_full <- nls(
  ryp ~ nonlinModelF(
    predictor = no3,
    soildep = depth,
    b01,
    b11,
    a1,
    d0,
    d1,
    da
  ),
  data = dat,
  start = list(
    b01 = 15.2,   # Intercept for 30 cm depth
    b11 = 3.58,   # Slope for 30 cm depth
    a1 = 23.13,   # Plateau cutoff for 30 cm depth
    d0 = -9.74,   # Intercept difference (60 cm - 30 cm)
    d1 = 2.11,    # Slope difference (60 cm - 30 cm)
    da = -6.85    # Plateau cutoff difference (60 cm - 30 cm)
  )
)

# Display model summary
summary(Soil_full)

```

6.  **Model Comparison: Does Depth Matter?**

-   If $d_0, d_1, d_a$ **are significantly different from 0**, the depths should be modeled separately.

-   The **p-values** for these parameters indicate whether depth-specific modeling is necessary.

------------------------------------------------------------------------