data_analysis/02.1-prerequisites.Rmd at main · mikenguyen13/data_analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Prerequisites

This chapter reviews essential mathematical and computational tools required for effective statistical analysis. Beginning with [Matrix Theory] and core concepts in linear algebra, we then explore foundational [Probability Theory], including random variables, distributions, and expectations. The chapter also revisits important mathematical skills, such as functions, derivatives, and logarithms, before moving into practical elements like data import/export and data manipulation using R. These topics ensure that readers are well-equipped to tackle the more advanced statistical methods presented in later chapters.

```{r setup, include=FALSE}
# knitr::opts_chunk$set(warning = FALSE, message = FALSE,cache = TRUE)
knitr::opts_chunk$set(out.width = "100%")
```

If you are confident in your understanding of these topics, you can proceed directly to the [Descriptive Statistics] section to begin exploring applied data analysis.

Figure \@ref(fig:include-image) offers a light-hearted take on how the same underlying statistical work gets rebranded across disciplines.

```{r include-image, fig.cap="Meme about AI and ML", out.width="70%", fig.alt="A three-panel comic illustration depicting the evolution of a crack in a wall as a metaphor for data analysis. In the first panel, a person observes a crack labeled 'Statistics.' In the second panel, the person frames the crack, labeled 'Machine Learning.' In the third panel, the framed crack is displayed to an audience, labeled 'AI,' with the person smiling confidently. Below, a larger image shows the person with a satisfied expression, standing next to the bold text 'AI.' The comic humorously illustrates the progression from basic observation to advanced presentation in data science."}
knitr::include_graphics("images/meme_01.png")
```

<!-- ![](images/Famous-internet-meme-displaying-the-relation-between-statistics-Machine-Learning-and.png){width="90%"} -->

## Matrix Theory

Matrix $A$ represents the original matrix. It's a 2x2 matrix with elements $a_{ij}$, where $i$ represents the row and $j$ represents the column.

$$
A =
\begin{bmatrix}
a_{11} & a_{12} \\
a_{21} & a_{22}
\end{bmatrix}
$$

$A'$ is the transpose of $A$. The transpose of a matrix flips its rows and columns.

$$
A' =
\begin{bmatrix}
a_{11} & a_{21} \\
a_{12} & a_{22}
\end{bmatrix}
$$

Fundamental properties and rules of matrices, essential for understanding operations in linear algebra:

$$
\begin{aligned}
\mathbf{(ABC)'}   & = \mathbf{C'B'A'} \quad &\text{(Transpose reverses order in a product)} \\
\mathbf{A(B+C)}   & = \mathbf{AB + AC} \quad &\text{(Distributive property)} \\
\mathbf{AB}       & \neq \mathbf{BA} \quad &\text{(Multiplication is not commutative)} \\
\mathbf{(A')'}    & = \mathbf{A} \quad &\text{(Double transpose is the original matrix)} \\
\mathbf{(A+B)'}   & = \mathbf{A' + B'} \quad &\text{(Transpose of a sum is the sum of transposes)} \\
\mathbf{(AB)'}    & = \mathbf{B'A'} \quad &\text{(Transpose reverses order in a product)} \\
\mathbf{(AB)^{-1}} & = \mathbf{B^{-1}A^{-1}} \quad &\text{(Inverse reverses order in a product)} \\
\mathbf{A+B}      & = \mathbf{B + A} \quad &\text{(Addition is commutative)} \\
\mathbf{AA^{-1}}  & = \mathbf{I} \quad &\text{(Matrix times its inverse is identity)}
\end{aligned}
$$

These properties are critical in solving systems of equations, optimizing models, and performing data transformations.

If a matrix $\mathbf{A}$ has an inverse, it is called **invertible**. If $\mathbf{A}$ does not have an inverse, it is referred to as **singular**.

The product of two matrices $\mathbf{A}$ and $\mathbf{B}$ is computed as:

$$
\begin{aligned}
\mathbf{A} &=
\begin{bmatrix}
a_{11} & a_{12} & a_{13} \\
a_{21} & a_{22} & a_{23}
\end{bmatrix}
\begin{bmatrix}
b_{11} & b_{12} & b_{13} \\
b_{21} & b_{22} & b_{23} \\
b_{31} & b_{32} & b_{33}
\end{bmatrix} \\
&=
\begin{bmatrix}
a_{11}b_{11}+a_{12}b_{21}+a_{13}b_{31} & \sum_{i=1}^{3}a_{1i}b_{i2} & \sum_{i=1}^{3}a_{1i}b_{i3} \\
\sum_{i=1}^{3}a_{2i}b_{i1} & \sum_{i=1}^{3}a_{2i}b_{i2} & \sum_{i=1}^{3}a_{2i}b_{i3}
\end{bmatrix}
\end{aligned}
$$

**Quadratic Form**

Let $\mathbf{a}$ be a $3 \times 1$ vector. The quadratic form involving a matrix $\mathbf{B}$ is given by:

$$
\mathbf{a'Ba} = \sum_{i=1}^{3}\sum_{j=1}^{3}a_i b_{ij} a_{j}
$$

**Length of a Vector**

The **length** (or 2-norm) of a vector $\mathbf{a}$, denoted as $||\mathbf{a}||$, is defined as the square root of the inner product of the vector with itself:

$$
||\mathbf{a}|| = \sqrt{\mathbf{a'a}}
$$

### Rank of a Matrix

The **rank** of a matrix refers to:

-   The dimension of the space spanned by its columns (or rows).
-   The number of linearly independent columns or rows.

For an $n \times k$ matrix $\mathbf{A}$ and a $k \times k$ matrix $\mathbf{B}$, the following properties hold:

-   $\text{rank}(\mathbf{A}) \leq \min(n, k)$
-   $\text{rank}(\mathbf{A}) = \text{rank}(\mathbf{A'}) = \text{rank}(\mathbf{A'A}) = \text{rank}(\mathbf{AA'})$
-   $\text{rank}(\mathbf{AB}) = \min(\text{rank}(\mathbf{A}), \text{rank}(\mathbf{B}))$
-   $\mathbf{B}$ is invertible (non-singular) if and only if $\text{rank}(\mathbf{B}) = k$.

### Inverse of a Matrix

In scalar algebra, if $a = 0$, then $1/a$ does not exist.

In matrix algebra, a matrix is invertible if it is **non-singular**, meaning it has a non-zero determinant and its inverse exists. A square matrix $\mathbf{A}$ is invertible if there exists another square matrix $\mathbf{B}$ such that:

$$
\mathbf{AB} = \mathbf{I} \quad \text{(Identity Matrix)}.
$$

In this case, $\mathbf{A}^{-1} = \mathbf{B}$.

For a $2 \times 2$ matrix:

$$
\mathbf{A} =
\begin{bmatrix}
a & b \\
c & d
\end{bmatrix}
$$

The inverse is:

$$
\mathbf{A}^{-1} =
\frac{1}{ad-bc}
\begin{bmatrix}
d & -b \\
-c & a
\end{bmatrix}
$$

This inverse exists only if $ad - bc \neq 0$, where $ad - bc$ is the determinant of $\mathbf{A}$.

For a partitioned block matrix:

$$
\begin{bmatrix}
A & B \\
C & D
\end{bmatrix}^{-1}
=
\begin{bmatrix}
\mathbf{(A-BD^{-1}C)^{-1}} & \mathbf{-(A-BD^{-1}C)^{-1}BD^{-1}} \\
\mathbf{-D^{-1}C(A-BD^{-1}C)^{-1}} & \mathbf{D^{-1}+D^{-1}C(A-BD^{-1}C)^{-1}BD^{-1}}
\end{bmatrix}
$$

This formula assumes that $\mathbf{D}$ and $\mathbf{A - BD^{-1}C}$ are invertible.

Properties of the Inverse for Non-Singular Matrices

1.  $\mathbf{(A^{-1})^{-1}} = \mathbf{A}$
2.  For a non-zero scalar $b$, $\mathbf{(bA)^{-1} = b^{-1}A^{-1}}$
3.  For a matrix $\mathbf{B}$, $\mathbf{(BA)^{-1} = B^{-1}A^{-1}}$ (only if $\mathbf{B}$ is non-singular).
4.  $\mathbf{(A^{-1})' = (A')^{-1}}$ (the transpose of the inverse equals the inverse of the transpose).
5.  Never notate $\mathbf{1/A}$; use $\mathbf{A^{-1}}$ instead.

**Notes**:

-   The determinant of a matrix determines whether it is invertible. For square matrices, a determinant of $0$ means the matrix is **singular** and has no inverse.

-   Always verify the conditions for invertibility, particularly when dealing with partitioned or block matrices.

### Definiteness of a Matrix

A symmetric square $k \times k$ matrix $\mathbf{A}$ is classified based on the following conditions:

-   **Positive Semi-Definite (PSD)**: $\mathbf{A}$ is PSD if, for any non-zero $k \times 1$ vector $\mathbf{x}$: $$
    \mathbf{x'Ax \geq 0}.
    $$

-   **Negative Semi-Definite (NSD)**: $\mathbf{A}$ is NSD if, for any non-zero $k \times 1$ vector $\mathbf{x}$: $$
    \mathbf{x'Ax \leq 0}.
    $$

-   **Indefinite**: $\mathbf{A}$ is indefinite if it is neither PSD nor NSD.

The **identity matrix** is always positive definite (PD).

Example

Let $\mathbf{x} = (x_1, x_2)'$, and consider a $2 \times 2$ identity matrix $\mathbf{I}$:

$$
\begin{aligned}
\mathbf{x'Ix}
&= (x_1, x_2)
\begin{bmatrix}
1 & 0 \\
0 & 1
\end{bmatrix}
\begin{bmatrix}
x_1 \\
x_2
\end{bmatrix} \\
&=
(x_1, x_2)
\begin{bmatrix}
x_1 \\
x_2
\end{bmatrix} \\
&=
x_1^2 + x_2^2 \geq 0.
\end{aligned}
$$

Thus, $\mathbf{I}$ is PD because $\mathbf{x'Ix} > 0$ for all non-zero $\mathbf{x}$.

**Properties of Definiteness**

1.  Any variance-covariance matrix is PSD.
2.  A matrix $\mathbf{A}$ is PSD if and only if there exists a matrix $\mathbf{B}$ such that: $$
    \mathbf{A = B'B}.
    $$
3.  If $\mathbf{A}$ is PSD, then $\mathbf{B'AB}$ is also PSD for any conformable matrix $\mathbf{B}$.
4.  If $\mathbf{A}$ and $\mathbf{C}$ are non-singular, then $\mathbf{A - C}$ is PSD if and only if $\mathbf{C^{-1} - A^{-1}}$ is PSD.
5.  If $\mathbf{A}$ is PD (or ND), then $\mathbf{A^{-1}}$ is also PD (or ND).

**Notes**

-   An **indefinite** matrix $\mathbf{A}$ is neither PSD nor NSD. This concept does not have a direct counterpart in scalar algebra.
-   If a square matrix is PSD and invertible, then it is PD.

Examples of Definiteness

1.  **Invertible / Indefinite**: $$
    \begin{bmatrix}
    -1 & 0 \\
    0 & 10
    \end{bmatrix}
    $$

2.  **Non-Invertible / Indefinite**: $$
    \begin{bmatrix}
    0 & 1 \\
    0 & 0
    \end{bmatrix}
    $$

3.  **Invertible / PSD**: $$
    \begin{bmatrix}
    1 & 0 \\
    0 & 1
    \end{bmatrix}
    $$

4.  **Non-Invertible / PSD**: $$
    \begin{bmatrix}
    0 & 0 \\
    0 & 1
    \end{bmatrix}
    $$

### Matrix Calculus

Consider a scalar function $y = f(x_1, x_2, \dots, x_k) = f(x)$, where $x$ is a $1 \times k$ row vector.

#### Gradient (First-Order Derivative)

The gradient, or the first-order derivative of $f(x)$ with respect to the vector $x$, is given by:

$$
\frac{\partial f(x)}{\partial x} =
\begin{bmatrix}
\frac{\partial f(x)}{\partial x_1} \\
\frac{\partial f(x)}{\partial x_2} \\
\vdots \\
\frac{\partial f(x)}{\partial x_k}
\end{bmatrix}
$$

#### Hessian (Second-Order Derivative)

The Hessian, or the second-order derivative of $f(x)$ with respect to $x$, is a symmetric matrix defined as:

$$
\frac{\partial^2 f(x)}{\partial x \partial x'} =
\begin{bmatrix}
\frac{\partial^2 f(x)}{\partial x_1^2} & \frac{\partial^2 f(x)}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f(x)}{\partial x_1 \partial x_k} \\
\frac{\partial^2 f(x)}{\partial x_2 \partial x_1} & \frac{\partial^2 f(x)}{\partial x_2^2} & \cdots & \frac{\partial^2 f(x)}{\partial x_2 \partial x_k} \\
\vdots & \vdots & \ddots & \vdots \\
\frac{\partial^2 f(x)}{\partial x_k \partial x_1} & \frac{\partial^2 f(x)}{\partial x_k \partial x_2} & \cdots & \frac{\partial^2 f(x)}{\partial x_k^2}
\end{bmatrix}
$$

#### Derivative of a Scalar Function with Respect to a Matrix

Let $f(\mathbf{X})$ be a scalar function, where $\mathbf{X}$ is an $n \times p$ matrix. The derivative is:

$$
\frac{\partial f(\mathbf{X})}{\partial \mathbf{X}} =
\begin{bmatrix}
\frac{\partial f(\mathbf{X})}{\partial x_{11}} & \cdots & \frac{\partial f(\mathbf{X})}{\partial x_{1p}} \\
\vdots & \ddots & \vdots \\
\frac{\partial f(\mathbf{X})}{\partial x_{n1}} & \cdots & \frac{\partial f(\mathbf{X})}{\partial x_{np}}
\end{bmatrix}
$$

#### Common Matrix Derivatives

1.  If $\mathbf{a}$ is a vector and $\mathbf{A}$ is a matrix independent of $\mathbf{y}$:
    -   $\frac{\partial \mathbf{a'y}}{\partial \mathbf{y}} = \mathbf{a}$
    -   $\frac{\partial \mathbf{y'y}}{\partial \mathbf{y}} = 2\mathbf{y}$
    -   $\frac{\partial \mathbf{y'Ay}}{\partial \mathbf{y}} = (\mathbf{A} + \mathbf{A'})\mathbf{y}$
2.  If $\mathbf{X}$ is symmetric:
    -   $\frac{\partial |\mathbf{X}|}{\partial x_{ij}} = \begin{cases} X_{ii}, & i = j \\ X_{ij}, & i \neq j \end{cases}$ where $X_{ij}$ is the $(i,j)$-th cofactor of $\mathbf{X}$.
3.  If $\mathbf{X}$ is symmetric and $\mathbf{A}$ is a matrix independent of $\mathbf{X}$:
    -   $\frac{\partial \text{tr}(\mathbf{XA})}{\partial \mathbf{X}} = \mathbf{A} + \mathbf{A'} - \text{diag}(\mathbf{A})$.
4.  If $\mathbf{X}$ is symmetric, let $\mathbf{J}_{ij}$ be a matrix with 1 at the $(i,j)$-th position and 0 elsewhere:
    -   $\frac{\partial \mathbf{X}^{-1}}{\partial x_{ij}} = \begin{cases} -\mathbf{X}^{-1}\mathbf{J}_{ii}\mathbf{X}^{-1}, & i = j \\ -\mathbf{X}^{-1}(\mathbf{J}_{ij} + \mathbf{J}_{ji})\mathbf{X}^{-1}, & i \neq j \end{cases}.$

### Optimization in Scalar and Vector Spaces

Optimization is the process of finding the minimum or maximum of a function. The conditions for optimization differ depending on whether the function involves a scalar or a vector. Below is a comparison of scalar and vector optimization:

+--------------------------------------------------------+------------------------------------------------+-----------------------------------------------------------------------------------------+
| Condition                                              | **Scalar Optimization**                        | **Vector Optimization**                                                                 |
+========================================================+================================================+=========================================================================================+
| **First-Order Condition**                              | $$\frac{\partial f(x_0)}{\partial x} = 0$$     | $$\frac{\partial f(x_0)}{\partial x} = \begin{bmatrix} 0 \\ \vdots \\ 0 \end{bmatrix}$$ |
+--------------------------------------------------------+------------------------------------------------+-----------------------------------------------------------------------------------------+
| **Second-Order Condition**                             | $$\frac{\partial^2 f(x_0)}{\partial x^2} > 0$$ | $$\frac{\partial^2 f(x_0)}{\partial x \partial x'} > 0$$                                |
|                                                        |                                                |                                                                                         |
| For **convex** functions, this implies a **minimum**.  |                                                |                                                                                         |
+--------------------------------------------------------+------------------------------------------------+-----------------------------------------------------------------------------------------+
| For **concave** functions, this implies a **maximum**. | $$\frac{\partial^2 f(x_0)}{\partial x^2} < 0$$ | $$\frac{\partial^2 f(x_0)}{\partial x \partial x'} < 0$$                                |
+--------------------------------------------------------+------------------------------------------------+-----------------------------------------------------------------------------------------+

: Conditions for scalar and vector optimization

**Key Concepts**

1.  **First-Order Condition**:
    -   The **first-order derivative** of the function must equal zero at a critical point. This holds for both scalar and vector functions:
        -   In the scalar case, $\frac{\partial f(x)}{\partial x} = 0$ identifies critical points.
        -   In the vector case, $\frac{\partial f(x)}{\partial x}$ is a **gradient vector**, and the condition is satisfied when all elements of the gradient are zero.
2.  **Second-Order Condition**:
    -   The **second-order derivative** determines whether the critical point is a minimum, maximum, or saddle point:
        -   For scalar functions, $\frac{\partial^2 f(x)}{\partial x^2} > 0$ implies a **local minimum**, while $\frac{\partial^2 f(x)}{\partial x^2} < 0$ implies a **local maximum**.
        -   For vector functions, the Hessian matrix $\frac{\partial^2 f(x)}{\partial x \partial x'}$ must be:
            -   **Positive Definite**: For a **minimum** (convex function).
            -   **Negative Definite**: For a **maximum** (concave function).
            -   **Indefinite**: For a **saddle point** (neither minimum nor maximum).
3.  **Convex and Concave Functions**:
    -   A function $f(x)$ is:
        -   **Convex** if $\frac{\partial^2 f(x)}{\partial x^2} > 0$ or the Hessian $\frac{\partial^2 f(x)}{\partial x \partial x'}$ is positive definite.
        -   **Concave** if $\frac{\partial^2 f(x)}{\partial x^2} < 0$ or the Hessian is negative definite.
    -   Convexity ensures global optimization for minimization problems, while concavity ensures global optimization for maximization problems.
4.  **Hessian Matrix**:
    -   In vector optimization, the Hessian $\frac{\partial^2 f(x)}{\partial x \partial x'}$ plays a crucial role in determining the nature of critical points:
        -   Positive definite Hessian: All eigenvalues are positive.
        -   Negative definite Hessian: All eigenvalues are negative.
        -   Indefinite Hessian: Eigenvalues have mixed signs.

### Cholesky Decomposition

In statistical analysis and numerical linear algebra, decomposing matrices into more tractable forms is crucial for efficient computation. One such important factorization is the **Cholesky Decomposition**. It applies to **Hermitian** (in the complex case) or **symmetric** (in the real case), **positive-definite** matrices.

Given an $n \times n$ positive-definite matrix $A$, the Cholesky Decomposition states:

$$
A = L L^{*},
$$

where:

-   $L$ is a lower-triangular matrix with strictly positive diagonal entries.
-   $L^{*}$ denotes the *conjugate transpose* of $L$ (simply the transpose $L^{T}$ for real matrices).

Cholesky Decomposition is both computationally efficient and numerically stable, making it a go-to technique for many applications---particularly in statistics where we deal extensively with covariance matrices, linear systems, and probability distributions.

Before diving into how we compute a Cholesky Decomposition, we need to clarify what it means for a matrix to be *positive-definite*. For a real symmetric matrix $A$:

-   $A$ is *positive-definite* if for every nonzero vector $x$, we have $$
    x^T A  x > 0.
    $$
-   Alternatively, you can characterize positive-definiteness by noting that all eigenvalues of $A$ are strictly positive.

Many important matrices in statistics---particularly *covariance* or *precision* matrices---are both symmetric and positive-definite.

------------------------------------------------------------------------

#### Existence

A real $n \times n$ matrix $A$ that is symmetric and positive-definite always admits a Cholesky Decomposition $A = L L^T$. This theorem guarantees that for any covariance matrix in statistics---assuming it is valid (i.e., positive-definite)---we can decompose it via Cholesky.

#### Uniqueness

If we additionally require that the diagonal entries of $L$ are *strictly positive*, then $L$ is *unique*. That is, no other lower-triangular matrix with strictly positive diagonal entries will produce the same factorization. This uniqueness is helpful for ensuring consistent numerical outputs in software implementations.

#### Constructing the Cholesky Factor $L$

Given a real, symmetric, positive-definite matrix $A \in \mathbb{R}^{n \times n}$, we want to find the lower-triangular matrix $L$ such that $A = LL^T$. One way to do this is by using a simple step-by-step procedure (often part of standard linear algebra libraries):

1.  **Initialize** $L$ to be an $n \times n$ zero matrix.
2.  **Iterate** through the rows $i = 1, 2, \dots, n$:
    1.  For each row $i$, compute $$
        L_{ii} = \sqrt{A_{ii} - \sum_{k=1}^{i-1} L_{ik}^2}.
        $$
    2.  For each column $j = i+1, i+2, \dots, n$: $$
        L_{ji} = \frac{1}{L_{ii}}
                  \left(A_{ji} - \sum_{k=1}^{i-1} L_{jk} L_{ik}\right).
        $$
    3.  All other entries of $L$ remain zero or are computed in subsequent steps.
3.  **Result**: $L$ is lower-triangular, and $L^T$ is its transpose.

Cholesky Decomposition is roughly half the computational cost of a more general LU Decomposition. Specifically, it requires on the order of $\frac{1}{3} n^3$ floating-point operations (flops), making it significantly more efficient in practice than other decompositions for positive-definite systems.

------------------------------------------------------------------------

#### Illustrative Example

Consider a small $3 \times 3$ positive-definite matrix:

$$
A =
\begin{pmatrix}
4 & 2 & 4 \\
2 & 5 & 6 \\
4 & 6 & 20
\end{pmatrix}.
$$

We claim $A$ is positive-definite (one could check by calculating principal minors or verifying $x^T A x > 0$ for all $x \neq 0$). We find $L$ step-by-step:

1.  **Compute** $L_{11}$:\
    $$
    L_{11} = \sqrt{A_{11}} = \sqrt{4} = 2.
    $$
2.  **Compute** $L_{21}$ and $L_{31}$:
    -   $L_{21} = \frac{A_{21}}{L_{11}} = \frac{2}{2} = 1.$\
    -   $L_{31} = \frac{A_{31}}{L_{11}} = \frac{4}{2} = 2.$
3.  **Compute** $L_{22}$:\
    $$
    L_{22} = \sqrt{A_{22} - L_{21}^2}
            = \sqrt{5 - 1^2}
            = \sqrt{4} = 2.
    $$
4.  **Compute** $L_{32}$:\
    $$
    L_{32} = \frac{A_{32} - L_{31} L_{21}}{L_{22}}
            = \frac{6 - (2)(1)}{2}
            = \frac{4}{2} = 2.
    $$
5.  **Compute** $L_{33}$:\
    $$
    L_{33} = \sqrt{A_{33} - (L_{31}^2 + L_{32}^2)}
            = \sqrt{20 - (2^2 + 2^2)}
            = \sqrt{20 - 8}
            = \sqrt{12}
            = 2\sqrt{3}.
    $$

Thus,

$$
L =
\begin{pmatrix}
2 & 0 & 0 \\
1 & 2 & 0 \\
2 & 2 & 2\sqrt{3}
\end{pmatrix}.
$$

One can verify $L L^T = A$.

------------------------------------------------------------------------

#### Applications in Statistics

##### Solving Linear Systems

A common statistical problem is solving $A x = b$ for $x$. For instance, in regression or in computing Bayesian posterior modes, we often need to solve linear equations with covariance or precision matrices. With $A = LL^T$:

1.  **Forward Substitution**: Solve $L  y = b$.
2.  **Backward Substitution**: Solve $L^T x = y$.

This two-step process is more stable and efficient than directly inverting $A$ (which is typically discouraged due to numerical issues).

##### Generating Correlated Random Vectors

In simulation-based statistics (e.g., Monte Carlo methods), we often need to generate random draws from a **multivariate normal** distribution $\mathcal{N}(\mu, \Sigma)$, where $\Sigma$ is the covariance matrix. The steps are:

1.  Generate a vector $z \sim \mathcal{N}(0, I)$ of independent standard normal variables.
2.  Compute $x = \mu + Lz$, where $\Sigma = LL^T$.

Then $x$ has the desired covariance structure $\Sigma$. This technique is widely used in Bayesian statistics (e.g., MCMC sampling) and financial modeling (e.g., portfolio simulations).

##### Gaussian Processes and Kriging

In Gaussian Process modeling (common in spatial statistics, machine learning, and geostatistics), we frequently work with large covariance matrices that describe the correlations between observed data points:

$$
\Sigma =
\begin{pmatrix}
k(x_1, x_1) & k(x_1, x_2) & \cdots & k(x_1, x_n) \\
k(x_2, x_1) & k(x_2, x_2) & \cdots & k(x_2, x_n) \\
\vdots      & \vdots      & \ddots & \vdots      \\
k(x_n, x_1) & k(x_n, x_2) & \cdots & k(x_n, x_n)
\end{pmatrix},
$$

where $k(\cdot, \cdot)$ is a covariance (kernel) function. We may need to invert or factorize $\Sigma$ repeatedly to evaluate the log-likelihood:

$$
\log \mathcal{L}(\theta) \sim
- \tfrac{1}{2} \left( y - m(\theta) \right)^T \Sigma^{-1} \left( y - m(\theta) \right)
- \tfrac{1}{2} \log \left| \Sigma \right|,
$$

where $m(\theta)$ is the mean function and $\theta$ are parameters. Using the Cholesky factor $L$ of $\Sigma$ helps:

-   $\Sigma^{-1}$ can be implied by solving systems with $L$ instead of explicitly computing the inverse.
-   $\log|\Sigma|$ can be computed as $2 \sum_{i=1}^n \log L_{ii}$.

Hence, Cholesky Decomposition becomes the backbone of Gaussian Process computations.

##### Bayesian Inference with Covariance Matrices

Many Bayesian models---especially hierarchical models---assume a multivariate normal prior on parameters. Cholesky Decomposition is used to:

-   Sample from these priors or from posterior distributions.
-   Regularize large covariance matrices.
-   Speed up Markov Chain Monte Carlo (MCMC) computations by factorizing covariance structures.

------------------------------------------------------------------------

#### Other Notes

1.  Numerical Stability Considerations

Cholesky Decomposition is considered more stable than a general LU Decomposition when applied to positive-definite matrices. Since no row or column pivots are required, rounding errors can be smaller. Of course, in practice, software implementations can vary, and extremely ill-conditioned matrices can still pose numerical challenges.

2.  Why We *Don't* Usually Compute $\mathbf{A}^{-1}$

It is common in statistics (especially in older texts) to see formulas involving $\Sigma^{-1}$. However, computing an inverse explicitly is often discouraged because:

-   It is numerically less stable.
-   It requires more computations.
-   Many tasks that *appear* to need $\Sigma^{-1}$ can be done more efficiently by solving systems via the Cholesky factor $L$.

Hence, "**solve, don't invert**" is a common mantra. If you see an expression like $\Sigma^{-1} b$, you can use the Cholesky factors $L$ and $L^T$ to solve $\Sigma x = b$ by forward and backward substitution, bypassing the direct inverse calculation.

3.  Further Variants and Extensions

-   **Incomplete Cholesky**: Sometimes used in iterative solvers where a full Cholesky factorization is too expensive, especially for large sparse systems.
-   **LDL\^T Decomposition**: A variant that avoids taking square roots; used for positive *semi*-definite or indefinite systems, but with caution about pivoting strategies.

------------------------------------------------------------------------

## Probability Theory

<!-- ![](images/meme-stat.jpg){width="80%"} -->

### Axioms and Theorems of Probability

1.  Let $S$ denote the sample space of an experiment. Then: $$
    P[S] = 1
    $$ (The probability of the sample space is always 1.)

2.  For any event $A$: $$
    P[A] \geq 0
    $$ (Probabilities are always non-negative.)

3.  Let $A_1, A_2, A_3, \dots$ be a finite or infinite collection of mutually exclusive events. Then: $$
    P[A_1 \cup A_2 \cup A_3 \dots] = P[A_1] + P[A_2] + P[A_3] + \dots
    $$ (The probability of the union of mutually exclusive events is the sum of their probabilities.)

4.  The probability of the empty set is: $$
    P[\emptyset] = 0
    $$

5.  The complement rule: $$
    P[A'] = 1 - P[A]
    $$

6.  The probability of the union of two events: $$
    P[A_1 \cup A_2] = P[A_1] + P[A_2] - P[A_1 \cap A_2]
    $$

#### Conditional Probability

The conditional probability of $A$ given $B$ is defined as:

$$
P[A|B] = \frac{P[A \cap B]}{P[B]}, \quad \text{provided } P[B] \neq 0.
$$

#### Independent Events

Two events $A$ and $B$ are independent if and only if:

1.  $P[A \cap B] = P[A]P[B]$
2.  $P[A|B] = P[A]$
3.  $P[B|A] = P[B]$

A collection of events $A_1, A_2, \dots, A_n$ is independent if and only if every subcollection is independent.

#### Multiplication Rule

The probability of the intersection of two events can be calculated as: $$
P[A \cap B] = P[A|B]P[B] = P[B|A]P[A].
$$

#### Bayes' Theorem

Let $A_1, A_2, \dots, A_n$ be a collection of mutually exclusive events whose union is $S$, and let $B$ be an event with $P[B] \neq 0$. Then, for any event $A_j$ ($j = 1, 2, \dots, n$): $$
P[A_j|B] = \frac{P[B|A_j]P[A_j]}{\sum_{i=1}^n P[B|A_i]P[A_i]}.
$$

#### Jensen's Inequality

-   If $g(x)$ is convex, then: $$
    E[g(X)] \geq g(E[X])
    $$

-   If $g(x)$ is concave, then: $$
    E[g(X)] \leq g(E[X]).
    $$

Jensen's inequality provides a useful way to demonstrate why the standard error calculated using the sample standard deviation ($s$) as a proxy for the population standard deviation ($\sigma$) is a biased estimator.

-   The population standard deviation $\sigma$ is defined as: $$
    \sigma = \sqrt{\mathbb{E}[(X - \mu)^2]},
    $$ where $\mu = \mathbb{E}[X]$ is the population mean.

-   The sample standard deviation $s$ is given by: $$
    s = \sqrt{\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar{X})^2},
    $$ where $\bar{X}$ is the sample mean.

-   When $s$ is used as an estimator for $\sigma$, the expectation involves the square root function, which is concave.

**Applying Jensen's Inequality**

The standard error formula involves the square root: $$
\sqrt{\mathbb{E}[s^2]}.
$$

However, because the square root function is concave, Jensen's inequality implies: $$
\sqrt{\mathbb{E}[s^2]} \leq \mathbb{E}[\sqrt{s^2}] = \mathbb{E}[s].
$$

This inequality shows that the expected value of $s$ (the sample standard deviation) systematically underestimates the population standard deviation $\sigma$.

**Quantifying the Bias**

The bias arises because: $$
\mathbb{E}[s] \neq \sigma.
$$

To correct this bias, we note that the sample standard deviation is related to the population standard deviation by: $$
\mathbb{E}[s] = \sigma \cdot \sqrt{\frac{n-1}{n}},
$$ where $n$ is the sample size. This bias decreases as $n$ increases, and the estimator becomes asymptotically unbiased.

By leveraging Jensen's inequality, we observe that the concavity of the square root function ensures that $s$ is a biased estimator of $\sigma$, systematically underestimating the population standard deviation.

#### Law of Iterated Expectation

The **Law of Iterated Expectation** states that for random variables $X$ and $Y$:

$$
E(X) = E(E(X|Y)).
$$

This means the expected value of $X$ can be obtained by first calculating the conditional expectation $E(X|Y)$ and then taking the expectation of this quantity over the distribution of $Y$.

#### Correlation and Independence

The strength of the relationship between random variables can be ranked from strongest to weakest as:

1.  **Independence**:
    -   $f(x, y) = f_X(x)f_Y(y)$
    -   $f_{Y|X}(y|x) = f_Y(y)$ and $f_{X|Y}(x|y) = f_X(x)$
    -   $E[g_1(X)g_2(Y)] = E[g_1(X)]E[g_2(Y)]$
2.  **Mean Independence** (implied by independence):
    -   $Y$ is mean independent of $X$ if: $$
        E[Y|X] = E[Y].
        $$
    -   $E[Xg(Y)] = E[X]E[g(Y)]$
3.  **Uncorrelatedness** (implied by independence and mean independence):
    -   $\text{Cov}(X, Y) = 0$
    -   $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$
    -   $E[XY] = E[X]E[Y]$

### Central Limit Theorem

The **Central Limit Theorem** states that for a sufficiently large sample size ($n \geq 25$), the sampling distribution of the sample mean or proportion approaches a normal distribution, regardless of the population's original distribution.

Let $X_1, X_2, \dots, X_n$ be a random sample of size $n$ from a distribution $X$ with mean $\mu$ and variance $\sigma^2$. Then, for large $n$:

1.  The sample mean $\bar{X}$ is approximately normal: $$
    \mu_{\bar{X}} = \mu, \quad \sigma^2_{\bar{X}} = \frac{\sigma^2}{n}.
    $$

2.  The sample proportion $\hat{p}$ is approximately normal: $$
    \mu_{\hat{p}} = p, \quad \sigma^2_{\hat{p}} = \frac{p(1-p)}{n}.
    $$

3.  The difference in sample proportions $\hat{p}_1 - \hat{p}_2$ is approximately normal: $$
    \mu_{\hat{p}_1 - \hat{p}_2} = p_1 - p_2, \quad \sigma^2_{\hat{p}_1 - \hat{p}_2} = \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}.
    $$

4.  The difference in sample means $\bar{X}_1 - \bar{X}_2$ is approximately normal: $$
    \mu_{\bar{X}_1 - \bar{X}_2} = \mu_1 - \mu_2, \quad \sigma^2_{\bar{X}_1 - \bar{X}_2} = \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}.
    $$

5.  The following random variables are approximately standard normal:

    -   $\frac{\bar{X} - \mu}{\sigma / \sqrt{n}}$
    -   $\frac{\hat{p} - p}{\sqrt{\frac{p(1-p)}{n}}}$
    -   $\frac{(\hat{p}_1 - \hat{p}_2) - (p_1 - p_2)}{\sqrt{\frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2}}}$
    -   $\frac{(\bar{X}_1 - \bar{X}_2) - (\mu_1 - \mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$

#### Limiting Distribution of the Sample Mean

If $\{X_i\}_{i=1}^{n}$ is an iid random sample from a distribution with finite mean $\mu$ and finite variance $\sigma^2$, the sample mean $\bar{X}$ scaled by $\sqrt{n}$ has the following limiting distribution:

$$
\sqrt{n}(\bar{X} - \mu) \xrightarrow{d} N(0, \sigma^2).
$$

Standardizing the sample mean gives: $$
\frac{\sqrt{n}(\bar{X} - \mu)}{\sigma} \xrightarrow{d} N(0, 1).
$$

**Notes**:

-   The CLT holds for most random samples from any distribution (continuous, discrete, or unknown).
-   It extends to the multivariate case: A random sample of a random vector converges to a multivariate normal distribution.

#### Asymptotic Variance and Limiting Variance

**Asymptotic Variance** (Avar): $$
Avar(\sqrt{n}(\bar{X} - \mu)) = \sigma^2.
$$

-   Refers to the variance of the limiting distribution of an estimator as the sample size ($n$) approaches infinity.

-   It characterizes the variability of the scaled estimator $\sqrt{n}(\bar{x} - \mu)$ in its asymptotic distribution (e.g., normal distribution).

**Limiting Variance** ($\lim_{n \to \infty} Var$)

$$
\lim_{n \to \infty} Var(\sqrt{n}(\bar{x}-\mu)) = \sigma^2
$$

-   Represents the value that the actual variance of $\sqrt{n}(\bar{x} - \mu)$ converges to as $n \to \infty$.

For a well-behaved estimator,

$$
Avar(\sqrt{n}(\bar{X} - \mu)) = \lim_{n \to \infty} Var(\sqrt{n}(\bar{x}-\mu)) = \sigma^2.
$$

However, **asymptotic variance is not necessarily equal to the limiting value of the variance** because asymptotic variance is derived from the limiting distribution, while limiting variance is a convergence result of the sequence of variances.

$$
Avar(.) \neq lim_{n \to \infty} Var(.)
$$

-   Both the asymptotic variance $Avar$ and the limiting variance $\lim_{n \to \infty} Var$ are numerically equal to $\sigma^2$, but their conceptual definitions differ.

-   $Avar(\cdot) \neq \lim_{n \to \infty} Var(\cdot)$. This emphasizes that while the numerical result may match, their derivation and meaning differ:

    -   $Avar$ depends on the asymptotic (large-sample) distribution of the estimator.

    -   $\lim_{n \to \infty} Var(\cdot)$ involves the sequence of variances as $n$ grows.

Cases where the two do not match:

1.  **Sample Quantiles**: Consider the sample quantile of order $p$, for some $0 < p < 1$. Under regularity conditions, the asymptotic distribution of the sample quantile is normal, with a variance that depends on $p$ and the density of the distribution at the $p$-th quantile. However, the variance of the sample quantile itself does not necessarily converge to this limit as the sample size grows.
2.  **Bootstrap Methods**: When using bootstrapping techniques to estimate the distribution of a statistic, the bootstrap distribution might converge to a different limiting distribution than the original statistic. In these cases, the variance of the bootstrap distribution (or the bootstrap variance) might differ from the limiting variance of the original statistic.
3.  **Statistics with Randomly Varying Asymptotic Behavior**: In some cases, the asymptotic behavior of a statistic can vary randomly depending on the sample path. For such statistics, the asymptotic variance might not provide a consistent estimate of the limiting variance.
4.  **M-estimators with Varying Asymptotic Behavior**: M-estimators can sometimes have different asymptotic behaviors depending on the tail behavior of the underlying distribution. For heavy-tailed distributions, the variance of the estimator might not stabilize even as the sample size grows large, making the asymptotic variance different from the variance of any limiting distribution.

### Random Variable

Random variables can be categorized as either **discrete** or **continuous**, with distinct properties and functions defining each type.

+--------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
|                                      | **Discrete Variable**                                                                                   | **Continuous Variable**                                                                                                      |
+======================================+=========================================================================================================+==============================================================================================================================+
| **Definition**                       | A random variable is discrete if it can assume at most a finite or countably infinite number of values. | A random variable is continuous if it can assume any value in some interval or intervals of real numbers, with $P(X=x) = 0$. |
+--------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| **Density Function**                 | A function $f$ is called a density for $X$ if:                                                          | A function $f$ is called a density for $X$ if:                                                                               |
+--------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
|                                      | 1\. $f(x) \geq 0$                                                                                       | 1\. $f(x) \geq 0$ for $x$ real                                                                                               |
+--------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
|                                      | 2\. $\sum_{x} f(x) = 1$                                                                                 | 2\. $\int_{-\infty}^{\infty} f(x)  dx = 1$                                                                                 |
+--------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
|                                      | 3\. $f(x) = P(X = x)$ for $x$ real                                                                      | 3\. $P[a \leq X \leq b] = \int_{a}^{b} f(x)  dx$ for $a, b$ real                                                           |
+--------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| **Cumulative Distribution Function** | $F(x) = P(X \leq x)$                                                                                    | $F(x) = P(X \leq x) = \int_{-\infty}^{x} f(t)  dt$                                                                         |
+--------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| $E[H(X)]$                            | $\sum_{x} H(x) f(x)$                                                                                    | $\int_{-\infty}^{\infty} H(x) f(x)  dx$                                                                                    |
+--------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| $\mu = E[X]$                         | $\sum_{x} x f(x)$                                                                                       | $\int_{-\infty}^{\infty} x f(x)  dx$                                                                                       |
+--------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| **Ordinary Moments**                 | $\sum_{x} x^k f(x)$                                                                                     | $\int_{-\infty}^{\infty} x^k f(x)  dx$                                                                                     |
+--------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+
| **Moment Generating Function**       | $m_X(t) = E[e^{tX}] = \sum_{x} e^{tx} f(x)$                                                             | $m_X(t) = E[e^{tX}] = \int_{-\infty}^{\infty} e^{tx} f(x)  dx$                                                             |
+--------------------------------------+---------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------+

: Comparison of discrete and continuous random variables

**Expected Value Properties**

-   $E[c] = c$ for any constant $c$.
-   $E[cX] = cE[X]$ for any constant $c$.
-   $E[X + Y] = E[X] + E[Y]$.
-   $E[XY] = E[X]E[Y]$ (if $X$ and $Y$ are independent).

**Variance Properties**

-   $\text{Var}(c) = 0$ for any constant $c$.
-   $\text{Var}(cX) = c^2 \text{Var}(X)$ for any constant $c$.
-   $\text{Var}(X) \geq 0$.
-   $\text{Var}(X) = E[X^2] - (E[X])^2$.
-   $\text{Var}(X + c) = \text{Var}(X)$.
-   $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$ (if $X$ and $Y$ are independent).

The standard deviation $\sigma$ is given by: $$
\sigma = \sqrt{\sigma^2} = \sqrt{\text{Var}(X)}.
$$

#### Multivariate Random Variables

Suppose $y_1, \dots, y_p$ are random variables with means $\mu_1, \dots, \mu_p$. Then:

$$
\mathbf{y} = \begin{bmatrix}
y_1 \\
\vdots \\
y_p
\end{bmatrix}, \quad E[\mathbf{y}] = \begin{bmatrix}
\mu_1 \\
\vdots \\
\mu_p
\end{bmatrix} = \boldsymbol{\mu}.
$$

The covariance between $y_i$ and $y_j$ is $\sigma_{ij} = \text{Cov}(y_i, y_j)$. The variance-covariance (or dispersion) matrix is:

$$
\mathbf{\Sigma} = (\sigma_{ij})= \begin{bmatrix}
\sigma_{11} & \sigma_{12} & \dots & \sigma_{1p} \\
\sigma_{21} & \sigma_{22} & \dots & \sigma_{2p} \\
\vdots & \vdots & \ddots & \vdots \\
\sigma_{p1} & \sigma_{p2} & \dots & \sigma_{pp}
\end{bmatrix}.
$$

And $\mathbf{\Sigma}$ is symmetric with $(p+1)p/2$ unique parameters.

Alternatively, let $u_{p \times 1}$ and $v_{v \times 1}$ be random vectors with means $\mathbf{\mu_u}$ and $\mathbf{\mu_v}$. then

$$ \mathbf{\Sigma_{uv}} = cov(\mathbf{u,v}) = E[\mathbf{(u-\mu_u)(v-\mu_v)'}] $$

$\Sigma_{uv} \neq \Sigma_{vu}$ (but $\Sigma_{uv} = \Sigma_{vu}'$)

**Properties of Covariance Matrices**

1.  **Symmetry**: $\mathbf{\Sigma}' = \mathbf{\Sigma}$.
2.  **Eigen-Decomposition** (spectral decomposition,symmetric decomposition): $\mathbf{\Sigma = \Phi \Lambda \Phi}$, where $\mathbf{\Phi}$ is a matrix of eigenvectors such that $\mathbf{\Phi \Phi' = I}$ (orthonormal), and $\mathbf{\Lambda}$ is a diagonal matrix with eigenvalues $(\lambda_1,...,\lambda_p)$ on the diagonal.
3.  **Non-Negative Definiteness**: $\mathbf{a \Sigma a} \ge 0$ for any $\mathbf{a} \in R^p$. Equivalently, the eigenvalues of $\mathbf{\Sigma}$, $\lambda_1 \ge ... \ge \lambda_p \ge 0$
4.  **Generalized Variance**: $|\mathbf{\Sigma}| = \lambda_1 \dots \lambda_p \geq 0$.
5.  **Trace**: $\text{tr}(\mathbf{\Sigma}) = \lambda_1 + \dots + \lambda_p = \sigma_{11} + \dots+ \sigma_{pp} = \sum \sigma_{ii}$ = sum of variances (total variance).

**Note**: $\mathbf{\Sigma}$ is required to be positive definite. This implies that all eigenvalues are positive, and $\mathbf{\Sigma}$ has an inverse $\mathbf{\Sigma}^{-1}$, such that $\mathbf{\Sigma}^{-1}\mathbf{\Sigma}= \mathbf{I}_{p \times p} = \mathbf{\Sigma}\mathbf{\Sigma}^{-1}$

#### Correlation Matrices

The correlation coefficient $\rho_{ij}$ and correlation matrix $\mathbf{R}$ are defined as:

$$
\rho_{ij} = \frac{\sigma_{ij}}{\sqrt{\sigma_{ii}\sigma_{jj}}}, \quad \mathbf{R} = \begin{bmatrix}
1 & \rho_{12} & \dots & \rho_{1p} \\
\rho_{21} & 1 & \dots & \rho_{2p} \\
\vdots & \vdots & \ddots & \vdots \\
\rho_{p1} & \rho_{p2} & \dots & 1
\end{bmatrix}.
$$

where $\rho_{ii} = 1 \forall i$

#### Linear Transformations

Let $\mathbf{A}$ and $\mathbf{B}$ be matrices of constants, and $\mathbf{c}$ and $\mathbf{d}$ be vectors of constants. Then:

-   $E[\mathbf{Ay + c}] = \mathbf{A \mu_y + c}$.
-   $\text{Var}(\mathbf{Ay + c}) = \mathbf{A \Sigma_y A'}$.
-   $\text{Cov}(\mathbf{Ay + c, By + d}) = \mathbf{A \Sigma_y B'}$.

### Moment Generating Function

#### Properties of the Moment Generating Function

1.  $\frac{d^k(m_X(t))}{dt^k} \bigg|_{t=0} = E[X^k]$ (The $k$-th derivative at $t=0$ gives the $k$-th moment of $X$).
2.  $\mu = E[X] = m_X'(0)$ (The first derivative at $t=0$ gives the mean).
3.  $E[X^2] = m_X''(0)$ (The second derivative at $t=0$ gives the second moment).

#### Theorems Involving MGFs

Let $X_1, X_2, \dots, X_n, Y$ be random variables with MGFs $m_{X_1}(t), m_{X_2}(t), \dots, m_{X_n}(t), m_Y(t)$:

1.  If $m_{X_1}(t) = m_{X_2}(t)$ for all $t$ in some open interval about 0, then $X_1$ and $X_2$ have the same distribution.
2.  If $Y = \alpha + \beta X_1$, then: $$
    m_Y(t) = e^{\alpha t}m_{X_1}(\beta t).
    $$
3.  If $X_1, X_2, \dots, X_n$ are independent and $Y = \alpha_0 + \alpha_1 X_1 + \alpha_2 X_2 + \dots + \alpha_n X_n$, where $\alpha_0, \alpha_1, \dots, \alpha_n$ are constants, then: $$
    m_Y(t) = e^{\alpha_0 t} m_{X_1}(\alpha_1 t) m_{X_2}(\alpha_2 t) \dots m_{X_n}(\alpha_n t).
    $$
4.  Suppose $X_1, X_2, \dots, X_n$ are independent normal random variables with means $\mu_1, \mu_2, \dots, \mu_n$ and variances $\sigma_1^2, \sigma_2^2, \dots, \sigma_n^2$. If $Y = \alpha_0 + \alpha_1 X_1 + \alpha_2 X_2 + \dots + \alpha_n X_n$, then:
    -   $Y$ is normally distributed.
    -   Mean: $\mu_Y = \alpha_0 + \alpha_1 \mu_1 + \alpha_2 \mu_2 + \dots + \alpha_n \mu_n$.
    -   Variance: $\sigma_Y^2 = \alpha_1^2 \sigma_1^2 + \alpha_2^2 \sigma_2^2 + \dots + \alpha_n^2 \sigma_n^2$.

------------------------------------------------------------------------

### Moments

+--------------+-------------------------------+-------------------------------------------+
| Moment       | Uncentered                    | Centered                                  |
+==============+===============================+===========================================+
| 1st          | $E[X] = \mu = \text{Mean}(X)$ |                                           |
+--------------+-------------------------------+-------------------------------------------+
| 2nd          | $E[X^2]$                      | $E[(X-\mu)^2] = \text{Var}(X) = \sigma^2$ |
+--------------+-------------------------------+-------------------------------------------+
| 3rd          | $E[X^3]$                      | $E[(X-\mu)^3]$                            |
+--------------+-------------------------------+-------------------------------------------+
| 4th          | $E[X^4]$                      | $E[(X-\mu)^4]$                            |
+--------------+-------------------------------+-------------------------------------------+

: First four moments of a random variable, centered and uncentered

-   **Skewness**: $\text{Skewness}(X) = \frac{E[(X-\mu)^3]}{\sigma^3}$

    -   **Definition**: Skewness measures the asymmetry of a probability distribution around its mean.
    -   **Interpretation**:
        -   **Positive skewness**: The right tail (higher values) is longer or heavier than the left tail.

        -   **Negative skewness**: The left tail (lower values) is longer or heavier than the right tail.

        -   **Zero skewness**: The data is symmetric.

-   **Kurtosis**: $\text{Kurtosis}(X) = \frac{E[(X-\mu)^4]}{\sigma^4}$

    -   **Definition**: Kurtosis measures the "tailedness" or the heaviness of the tails of a probability distribution.

    -   Excess kurtosis (often reported) is the kurtosis minus 3 (to compare against the normal distribution's kurtosis of 3).

    -   **Interpretation**:

        -   **High kurtosis (\>3)**: Heavy tails, more extreme outliers.

        -   **Low kurtosis (\<3)**: Light tails, fewer outliers.

        -   **Normal distribution kurtosis = 3**: Benchmark for comparison.

### Skewness

Skewness measures the asymmetry of the distribution:

1.  **Positive skew**: The right side (high values) is stretched out.

    -   Positive skew occurs when the right tail (higher values) of the distribution is longer or heavier.

    -   **Examples**:

        -   **Income Distribution**: In many countries, most people earn a moderate income, but a small fraction of ultra-high earners stretches the distribution's right tail.

        -   **Housing Prices**: Most homes may be around an affordable price, but a few extravagant mansions create a very long (and expensive) upper tail.

```{r, fig.cap="Visualization of positie skew", fig.alt="Two histograms side by side showing positively skewed distributions. The left plot represents Housing Prices with a long tail to the right, indicating higher-priced houses. The right plot represents Income Distribution, with  a long right tail, indicating high income earners."}
# Load required libraries
library(ggplot2)

# Simulate data for positive skew
set.seed(123)
positive_skew_income <-
    rbeta(1000, 5, 2) * 100  # Income distribution example
positive_skew_housing <-
    rbeta(1000, 5, 2) * 1000  # Housing prices example

# Combine data
data_positive_skew <- data.frame(