data_analysis/43-rep_synthetic_data.Rmd at main · mikenguyen13/data_analysis · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
# Replication and Synthetic Data

Access to comprehensive data is pivotal for replication, especially in the realm of social sciences. Yet, data are often inaccessible due to proprietary restrictions, privacy concerns, or logistical constraints, making replication a challenge [@king1995replication]. This chapter explores the nuances of replication, exceptions to its norms, and the significance of synthetic data as a solution.

## The Replication Standard

Replicability in research ensures:

-   **Credibility** -- Reinforces trust in empirical studies by allowing independent verification.
-   **Continuity** -- Enables future research to build upon prior findings, promoting cumulative knowledge.
-   **Visibility** -- Increases readership and citations, benefiting both individual researchers and the broader academic community.

For research to be replicable, adhering to the **replication standard** is essential. This standard requires researchers to provide all necessary information---data, code, and methodological details---so that third parties can independently reproduce the study's findings. While quantitative research often allows for clearer replication, qualitative studies pose challenges due to their depth, contextual nature, and reliance on subjective interpretation.

### Solutions for Empirical Replication

Several approaches help address replication challenges in empirical research:

1.  **Role of Individual Authors**
    -   Researchers must commit to transparency and provide well-documented data and code.
    -   Repositories such as the **Inter-University Consortium for Political and Social Research (ICPSR)** offer secure, long-term storage for replication datasets.
2.  **Creation of a Replication Data Set**
    -   A dedicated **replication dataset** should include original data, relevant supplementary data, and the exact procedures used for analysis.
    -   Metadata and documentation should be provided to ensure clarity.
3.  **Professional Data Archives**
    -   Organizations like **ICPSR, Dataverse, and Zenodo** facilitate open access to datasets while maintaining proper governance over sensitive information.
    -   These archives help address data accessibility and preservation issues.
4.  **Educational Implications**
    -   Teaching replication strengthens students' understanding of empirical methods and reproducibility.
    -   Many graduate programs now incorporate replication studies into coursework, emphasizing their importance in methodological rigor.

### Free Data Repositories

1.  **Zenodo**: Hosted by CERN, it provides a place for researchers to deposit datasets. It's not subject-specific, so it caters to various disciplines.

2.  **figshare**: Allows researchers to upload, share, and cite their datasets.

3.  **Dryad**: Primarily for datasets associated with published articles in the biological and medical sciences.

4.  **OpenICPSR**: A public-facing version of the Inter-University Consortium for Political and Social Research (ICPSR) where researchers can deposit data without any cost.

5.  **Harvard Dataverse**: Hosted by Harvard University, this is an open-source repository software application dedicated to archiving, sharing, and citing research data.

6.  **Mendeley Data**: A multidisciplinary, free-to-use open access data repository where researchers can upload and share their datasets.

7.  **Open Science Framework (OSF)**: Offers both a platform for conducting research and a place to deposit datasets.

8.  **PubMed Central**: Specific to life sciences, but it's an open repository for journal articles, preprints, and datasets.

9.  **Registry of Research Data Repositories (re3data)**: While not a repository itself, it provides a global registry of research data repositories from various academic disciplines.

10. **SocArXiv**: An open archive for the social sciences.

11. **EarthArXiv**: A preprints archive for earth science.

12. **Protein Data Bank (PDB)**: For 3D structures of large biological molecules.

13. **Gene Expression Omnibus (GEO)**: A public functional genomics data repository.

14. **The Language Archive (TLA)**: Dedicated to data on languages worldwide, especially endangered languages.

15. **B2SHARE**: A platform for storing and sharing research data sets in various disciplines, especially from European research projects.

### Exceptions to Replication

While the replication standard is fundamental to scientific integrity, certain constraints may prevent full adherence. Some common exceptions include:

1.  **Confidentiality**
    -   Some datasets contain highly sensitive information (e.g., medical records, personal financial data) that cannot be disclosed, even in a fragmented form.
    -   Anonymization techniques and data aggregation can sometimes mitigate these concerns, but privacy regulations (e.g., GDPR, HIPAA) impose strict limitations.
2.  **Proprietary Data**
    -   Datasets owned by corporations, governments, or third-party vendors often have restricted access due to intellectual property concerns.
    -   In many cases, researchers can share **summary statistics, derived variables, or synthetic versions** of the data while respecting proprietary restrictions.
3.  **Rights of First Publication**
    -   Some studies involve data embargoes, where researchers must delay public release until initial publications are completed.
    -   Despite embargoes, the essential data and methodology should eventually be accessible to ensure transparency.

### Replication Landscape

@brodeur2025comparing finds that while AI-assisted teams improve upon AI-led approaches, human-only teams remain the most effective at detecting major errors and ensuring reproducibility in quantitative social science research.

-   Human teams and AI-assisted teams achieved similar reproducibility success rates, both significantly outperforming AI-led teams.

-   Human-only teams were 57 percentage points more successful than AI-led teams (p \< 0.001).

-   Error detection: Human teams identified significantly more major errors than AI-assisted teams (0.7 more errors per team, p = 0.017) and AI-led teams (1.1 more errors per team, p \< 0.001).

-   AI-assisted teams detected 0.4 more errors per team than AI-led teams (p = 0.029) but still fewer than human teams.

-   Robustness checks: Both human and AI-assisted teams were significantly better than AI-led teams in proposing (25 percentage points, p = 0.017) and implementing (33 percentage points, p = 0.005) comprehensive robustness checks.

@huntington2025sources uses a three-stage many-analysts design to examine how researcher decisions influence variation in treatment effect estimates.

-    146 research teams completed the same causal inference task three times under increasingly standardized conditions:

    -    **Stage 1:** Few constraints (free-form analysis).

    -    **Stage 2:** Prescribed research design.

    -    **Stage 3:** Prescribed design plus pre-cleaned data.

```{=html}
<!-- -->
```
-    **Key findings:**

    -    **Stage 1:** High variation in reported effects (IQR = 3.1 percentage points), with outliers.

    -    **Stage 2:** Even greater variation (IQR = 4.0), due to imperfect protocol adherence.

    -    **Stage 3:** Lowest variation (IQR = 2.4), suggesting data cleaning substantially reduces result heterogeneity.

```{=html}
<!-- -->
```
-    **Sample size convergence:**

    -   IQR dropped from 295,187 (Stage 1) to 29,144 (Stage 2), and was effectively zero in Stage 3.

    The results highlight the **critical role of data cleaning** in applied microeconomics and suggest **new directions for replication research**.

------------------------------------------------------------------------

## Synthetic Data

Synthetic data, which models real data while ensuring anonymity, is becoming an essential tool in research. By generating artificial datasets that retain key statistical properties of the original data, researchers can **preserve privacy, enhance data accessibility, and facilitate replication**. However, synthetic data also introduces complexities and should be used with caution.

------------------------------------------------------------------------

### Benefits of Synthetic Data

1.  **Privacy Preservation**
    -   Protects sensitive or proprietary information while enabling research collaboration.
2.  **Data Fairness and Augmentation**
    -   Helps mitigate biases by generating more balanced datasets.
    -   Can supplement real data when sample sizes are limited.
3.  **Acceleration in Research**
    -   Allows for data sharing in environments where access to real data is restricted.
    -   Enables large-scale simulations without legal or ethical constraints.

------------------------------------------------------------------------

### Concerns and Limitations

1.  **Misconceptions About Privacy**
    -   Synthetic data does not guarantee absolute privacy---re-identification risks remain if it is too similar to the real dataset.
2.  **Challenges with Data Outliers**
    -   Rare but important data points may be poorly represented or excluded.
3.  **Risks of Solely Relying on Synthetic Data**
    -   Models trained exclusively on synthetic data may lack generalizability.
    -   Differences between real and synthetic distributions can introduce biases.

------------------------------------------------------------------------

### Further Insights on Synthetic Data

Synthetic data acts as a bridge between **model-centric and data-centric perspectives**, making it a vital tool in modern research. An analogy can be drawn to **viewing a replica of the Mona Lisa**---the essence remains, but the original is securely stored.

For a deeper dive into synthetic data and its applications, refer to [@jordon2022synthetic].

------------------------------------------------------------------------

### Generating Synthetic Data

When generating synthetic data, the approach depends on whether researchers have full access to the original dataset or are working under restricted conditions.

#### When You Have Access to the Original Dataset

If researchers can directly use the dataset, various techniques can be employed to generate synthetic data while preserving the statistical properties of the original:

-   **Statistical Approaches**
    -   **Parametric models (e.g., Gaussian Mixture Models)**
        -   Fit statistical distributions to real data and sample synthetic observations.
-   **Machine Learning-Based Methods**
    -   **Variational Autoencoders (VAEs)** -- Useful for structured, complex data representations.
    -   **Generative Adversarial Networks (GANs)** -- Effective for generating high-dimensional data (e.g., tabular, image, and text data).
    -   **CTGAN (Conditional Tabular GAN)** -- Specifically designed for structured, tabular datasets, addressing categorical and imbalanced data challenges.
-   **Differential Privacy Techniques**
    -   **Noise Addition** -- Introduces controlled noise while maintaining the overall statistical structure.

#### When You Have a Restricted Dataset

In cases where **data cannot be exported** due to security, privacy, or proprietary constraints, researchers must rely on alternative strategies to generate synthetic data:

-   **Summarization and Approximation**
    -   Extract summary statistics (e.g., means, variances, correlations) to approximate the dataset's structure.
    -   If permitted, share aggregated or anonymized data instead of raw observations.
-   **Server-Based Computation**
    -   Conduct in-server analyses where raw data remains inaccessible, but synthetic outputs can be generated on the secure system.
-   **Synthetic Data Generation with Preserved Properties**
    -   Use models trained on the secure dataset to produce synthetic data without directly copying real observations.
    -   Ensure that key statistical relationships are maintained, even if individual values differ.

## Application

### Original Dataset

1.  Import libraries

```{r}
library(copula)
library(moments)
library(PerformanceAnalytics)  # For correlation plots
library(ggplot2)
library(dplyr)
```

2.  Simulate a Complex, Nonlinear, Hierarchical Time Series

Suppose we have:

-   $G = 3$ groups (e.g., groups 1, 2, 3)

-   Each group has $N = 50$ units (e.g., individuals or devices)

-   Each unit is measured at $T = 20$ time points

We'll create four continuous variables, `X1` through `X4`, each influenced by:

-   A group-level random effect (different intercept by group)

-   A unit-level random effect (different intercept by unit)

-   Time (with some nonlinear relationships, e.g., sine, polynomial)

-   Nonlinear cross-relationships among X1--X4

This gives us a total of $3 \times 50 \times 20=3000$ rows in the "original" dataset.

```{r}
set.seed(123)  # For reproducibility

G <- 3    # Number of groups
N <- 50   # Units per group
Tt <- 20  # Time points per unit

# Create a data frame structure
df_list <- list()

for(g in 1:G) {

  # Group-level random intercept
  group_intercept <- rnorm(1, mean = 0, sd = 1)

  for(u in 1:N) {

    # Unit-level random intercept
    unit_intercept <- rnorm(1, mean = 0, sd = 0.5)

    # Simulate time points
    time_points <- 1:Tt

    # Create some base patterns
    X1_base <- group_intercept + unit_intercept +
               sin(0.2 * time_points) +  # Nonlinear time pattern
               rnorm(Tt, mean = 0, sd = 0.2)

    # Introduce different relationships for X2, X3, X4
    # Some polynomial in time, plus dependence on X1
    X2_base <- (X1_base^2) + 0.5 * time_points + rnorm(Tt, 0, 0.3)
    X3_base <- 1 + group_intercept - 0.3 * X1_base + log(time_points+1) +
               rnorm(Tt, mean = 0, sd = 0.2)
    X4_base <- exp(0.1 * X1_base) + 0.2 * (X2_base) - 0.5 * (X3_base) +
               rnorm(Tt, mean = 0, sd = 0.5)

    df_temp <- data.frame(
      group = g,
      unit  = paste0("G", g, "_U", u),
      time  = time_points,
      X1    = X1_base,
      X2    = X2_base,
      X3    = X3_base,
      X4    = X4_base
    )

    df_list[[length(df_list) + 1]] <- df_temp
  }
}

df_original <- do.call(rbind, df_list)
row.names(df_original) <- NULL

# Inspect the first rows
head(df_original)

```

3.  Explore the Original Dataset

Let's do some **descriptive statistics** and look at the **correlations** among X1--X4.\
Because we have repeated measures (time series) nested in units and groups, these correlations are "pooled" across all rows. This is a simplification, but it will let us demonstrate how to do a copula-based synthetic approach.

```{r}
# Descriptive stats (overall)
summary(df_original[, c("X1", "X2", "X3", "X4")])

# Skewness & Kurtosis
apply(df_original[, c("X1", "X2", "X3", "X4")], 2, skewness)
apply(df_original[, c("X1", "X2", "X3", "X4")], 2, kurtosis)

# Correlation matrix
(cor_mat <- cor(df_original[, c("X1", "X2", "X3", "X4")]))

chart.Correlation(df_original[, c("X1", "X2", "X3", "X4")],
                  histogram = TRUE, pch = 19)

```

4.  Convert to Pseudo-Observations for Copula Fitting

Copulas need variables in $[0,1]$ space, so we use the **empirical CDF** ("probability integral transform") on each variable.

> **Important**: We have discrete variables like `group`, or ID-like columns such as `unit`. For the synthetic generation of group/unit/time, we have multiple strategies:
>
> 1.  **Model them as random**: e.g., re-sample `group`, `unit`, `time` from the original distribution.
>
> 2.  **Treat them as continuous** in a copula (not recommended for true IDs).
>
> 3.  **Do a hierarchical approach**: fit a separate copula for each group or each time slice (advanced).

For simplicity, we'll:

1.  Re-sample `group` and `time` from their original distributions (like "bootstrapping").

2.  Use a **multivariate copula** only for `X1–X4`.

```{r}
# Extract the numeric columns we want to transform
original_data <- df_original[, c("X1","X2","X3","X4")]

# Convert each to uniform [0,1] by empirical CDF
u_data <- pobs(as.matrix(original_data))

# Check ranges (should be between 0 and 1)
apply(u_data, 2, range)

```

5.  Fit a Copula Model

We'll fit a **Gaussian copula** (you could try **t-copula** or **vine copulas** for heavier tails or more complex dependencies). We use maximum likelihood estimation:

```{r}
# Define an unstructured Gaussian copula
gaussCop <- normalCopula(dim = ncol(u_data), dispstr = "un")

# Fit to the pseudo-observations
fit_gauss <- fitCopula(gaussCop, data = u_data, method = "ml")

summary(fit_gauss)

```

Check the estimated correlation matrix within the copula. This should reflect the dependency among X1--X4 (though not time, group, or unit).

6.  Generate Synthetic Data

-   Synthetic X1--X4
    1.  **Sample** from the fitted copula to get synthetic $[0,1]$ values.
    2.  **Invert** them via the original empirical distributions (quantiles).

```{r}
n_synth <- nrow(df_original)  # same size as original

# Sample from the copula
u_synth <- rCopula(n_synth, fit_gauss@copula)

# Convert from [0,1] -> real scale by matching original distribution
synth_X <- data.frame(
    X1_synth = quantile(original_data$X1, probs = u_synth[, 1],
                        type = 8),
    X2_synth = quantile(original_data$X2, probs = u_synth[, 2],
                        type = 8),
    X3_synth = quantile(original_data$X3, probs = u_synth[, 3],
                        type = 8),
    X4_synth = quantile(original_data$X4, probs = u_synth[, 4],
                        type = 8)
)

head(synth_X)
```

-   Synthetic Group, Unit, and Time

A simple approach is to:

-   **Re-sample "group"** with the same probabilities as the original distribution.

-   **Re-sample "unit"** within each group or treat it as purely random labels (depending on your needs).

-   **Re-sample "time"** from the original distribution or replicate the same time points.

Below, we do a simplistic approach: for each row, pick a random row from the original data to copy `group`, `unit`, and `time`. This preserves the real distribution of group/time pairs and the frequency of each unit. (But it does **not** preserve the original time-series ordering or autoregressive structure!)

```{r}
indices <-
    sample(seq_len(nrow(df_original)),
           size = n_synth,
           replace = TRUE)
synth_meta <- df_original[indices, c("group", "unit", "time")]

# Combine the meta-info with the synthetic X's
df_synth <- cbind(synth_meta, synth_X)

head(df_synth)

```

If you **need** to preserve the exact time-ordering or real "per-unit" correlation across time, you'd need a more advanced approach (e.g., separate copula by unit or a hierarchical time-series model).

7.  Validate the Synthetic Data

-   Compare Descriptive Statistics

```{r}
# Original
orig_means  <- colMeans(df_original[, c("X1", "X2", "X3", "X4")])
orig_sds    <- apply(df_original[, c("X1", "X2", "X3", "X4")], 2, sd)
orig_skew   <-
    apply(df_original[, c("X1", "X2", "X3", "X4")], 2, skewness)
orig_kurt   <-
    apply(df_original[, c("X1", "X2", "X3", "X4")], 2, kurtosis)

# Synthetic
synth_means <-
    colMeans(df_synth[, c("X1_synth", "X2_synth", "X3_synth", "X4_synth")])
synth_sds   <-
    apply(df_synth[, c("X1_synth", "X2_synth", "X3_synth", "X4_synth")], 2, sd)
synth_skew  <-
    apply(df_synth[, c("X1_synth", "X2_synth", "X3_synth", "X4_synth")], 2, skewness)
synth_kurt  <-
    apply(df_synth[, c("X1_synth", "X2_synth", "X3_synth", "X4_synth")], 2, kurtosis)

cat(
    "### Means ###\nOriginal:",
    round(orig_means, 3),
    "\nSynthetic:",
    round(synth_means, 3),
    "\n\n"
)

cat(
    "### SDs ###\nOriginal:",
    round(orig_sds, 3),
    "\nSynthetic:",
    round(synth_sds, 3),
    "\n\n"
)

cat(
    "### Skewness ###\nOriginal:",
    round(orig_skew, 3),
    "\nSynthetic:",
    round(synth_skew, 3),
    "\n\n"
)

cat(
    "### Kurtosis ###\nOriginal:",
    round(orig_kurt, 3),
    "\nSynthetic:",
    round(synth_kurt, 3),
    "\n\n"
)

```

-   Compare Correlation Matrices

```{r}
cat("Original correlation:\n")
round(cor(df_original[, c("X1", "X2", "X3", "X4")]), 3)

cat("\nSynthetic correlation:\n")
round(cor(df_synth[, c("X1_synth", "X2_synth", "X3_synth", "X4_synth")]), 3)

```

-   Visual Comparison of Distributions

```{r}
par(mfrow = c(2, 2))
vars <- c("X1", "X2", "X3", "X4")
for (i in seq_along(vars)) {
    hist(
        df_original[[vars[i]]],
        probability = TRUE,
        breaks = 30,
        main = paste("Original", vars[i]),
        col = rgb(1, 0, 0, 0.5)
    )
    hist(
        df_synth[[paste0(vars[i], "_synth")]],
        probability = TRUE,
        breaks = 30,
        main = paste("Synthetic", vars[i]),
        col = rgb(0, 0, 1, 0.5),
        add = TRUE
    )
    legend(
        "topright",
        legend = c("Original", "Synthetic"),
        fill = c(rgb(1, 0, 0, 0.5), rgb(0, 0, 1, 0.5))
    )
}

```

-   Correlation Plot for Synthetic Data

```{r}
chart.Correlation(df_synth[, c("X1_synth", "X2_synth", "X3_synth", "X4_synth")],
                  histogram = TRUE, pch = 19)

```

-   **Indistinguishability**: If the synthetic summary statistics (means, variances, skewness, kurtosis) and correlation structure match closely, the synthetic data is often "indistinguishable" from the original for many analytical purposes.

-   **Hierarchical / Time-Series**: True hierarchical time-series replication (i.e., preserving each unit's time autocorrelation and group structure) may require more advanced methods, such as:

    -   **Hierarchical copulas** or **vine copulas** over time slices.

    -   **Mixed-effects / random-effects** modeling (e.g., for group and unit) plus a copula for residuals.

    -   **Deep generative approaches** (e.g., TimeGAN) for strong temporal dynamics, currently more common in Python.

-   **Categorical Variables**: For strictly categorical variables (e.g., group, unit ID), you can:

    -   Fit separate copulas within each group.

    -   Convert categories to numeric in a naive way (not recommended for actual IDs) or use specialized **discrete copulas**.

-   **Privacy Considerations**: Even if data is synthetic, do check that it doesn't inadvertently leak private information (e.g., via memorizing outliers). Techniques like **differential privacy** or **post-hoc checks** might be required.

### Restricted Dataset

1.  **Generate the "Original" Complex Dataset**

We'll simulate a hierarchical time-series with:

-   $G = 3$ groups

-   $N = 50$ units per group

-   $T = 20$ time points per unit

-   Nonlinear relationships between `X1`, `X2`, `X3`, `X4`.

```{r}
# Step 1: Generate "df_original" (what the partner owns internally)
set.seed(123)  # For reproducibility

G <- 3    # Number of groups
N <- 50   # Units per group
Tt <- 20  # Time points per unit

df_list <- list()

for(g in 1:G) {

  # Group-level random intercept
  group_intercept <- rnorm(1, mean = 0, sd = 1)

  for(u in 1:N) {

    # Unit-level random intercept
    unit_intercept <- rnorm(1, mean = 0, sd = 0.5)

    # Simulate time points
    time_points <- 1:Tt

    # Create some base patterns (X1)
    X1_base <- group_intercept + unit_intercept +
               sin(0.2 * time_points) +  # Nonlinear time pattern
               rnorm(Tt, mean = 0, sd = 0.2)

    # X2 depends on polynomial in time, plus dependence on X1
    X2_base <- (X1_base^2) + 0.5 * time_points + rnorm(Tt, 0, 0.3)

    # X3 depends on group intercept, negative correlation with X1, and log(time)
    X3_base <- 1 + group_intercept - 0.3 * X1_base + log(time_points + 1) +
               rnorm(Tt, mean = 0, sd = 0.2)

    # X4 depends on X1, X2, X3 in a more complex, nonlinear form
    X4_base <- exp(0.1 * X1_base) + 0.2 * X2_base - 0.5 * X3_base +
               rnorm(Tt, mean = 0, sd = 0.5)

    df_temp <- data.frame(
      group = g,
      unit  = paste0("G", g, "_U", u),
      time  = time_points,
      X1    = X1_base,
      X2    = X2_base,
      X3    = X3_base,
      X4    = X4_base
    )

    df_list[[length(df_list) + 1]] <- df_temp
  }
}

df_original <- do.call(rbind, df_list)
row.names(df_original) <- NULL

# Inspect the first rows (just for illustration)
head(df_original)

```

At this point, imagine `df_original` lives *only* on the partner's server and cannot be exported in its raw form.

2.  **Manually Collect Summary Statistics (Inside Secure Server)**

Within the secure environment, you would run commands to get:

-   Means, standard deviations for each variable

-   Correlation matrix

-   Group distribution info (how many groups, units, etc.)

-   Any other relevant stats (min, max, skewness, kurtosis, etc.) you might use

Below, we'll do that *directly* in code---**but in reality**, you would just write these numbers down or save them in a doc, not export the raw data.

```{r}
# Step 2: Summaries from "df_original" (pretend we can't take the actual df out)
library(dplyr)

# For demonstration, we'll compute them here:
stats_summary <- df_original %>%
  summarise(
    mean_X1 = mean(X1),
    mean_X2 = mean(X2),
    mean_X3 = mean(X3),
    mean_X4 = mean(X4),

    sd_X1 = sd(X1),
    sd_X2 = sd(X2),
    sd_X3 = sd(X3),
    sd_X4 = sd(X4)
  )

# Extract the correlation matrix among (X1, X2, X3, X4)
cor_matrix <- cor(df_original[, c("X1","X2","X3","X4")])

# Also note the group info
unique_groups <- unique(df_original$group)
group_sizes   <- table(df_original$group)
N_groups      <- length(unique_groups)
unit_example  <- length(unique(df_original$unit[df_original$group == 1]))
time_points   <- length(unique(df_original$time[df_original$group == 1 & df_original$unit == "G1_U1"]))

# Print them out as if we wrote them down
stats_summary
cor_matrix
group_sizes
N_groups
unit_example
time_points
```

**Example Output** (numbers will vary):

-   Means, SDs of each variable

-   4×4 correlation matrix

-   `group_sizes`: each group has 50×20 = 1000 rows

-   `N_groups`: 3

**Simulating What We "Take Out"**

Pretend these are the **only** data you're allowed to copy into your local machine:

```{r}
# (Pretend these are typed or copy-pasted from the secure environment)

# Means and SDs:
means <- c(stats_summary$mean_X1, stats_summary$mean_X2,
           stats_summary$mean_X3, stats_summary$mean_X4)
sds   <- c(stats_summary$sd_X1,   stats_summary$sd_X2,
           stats_summary$sd_X3,   stats_summary$sd_X4)

# Correlation matrix:
R <- cor_matrix

# Hierarchical structure info:
G_outside   <- N_groups  # 3
N_outside   <- unit_example  # 50
Tt_outside  <- time_points   # 20

```

4.  **Reconstruct Covariance Matrix and Distribution (Outside)**

Outside, you now have:

-   A mean vector for `(X1, X2, X3, X4)`

-   Standard deviations for each

-   A correlation matrix $R$

-   Basic knowledge: 3 groups, 50 units each, 20 time points each (or however the real data is structured)

Build the covariance $\Sigma$ from the correlation matrix and SDs:

```{r}
# Step 3: Covariance matrix = diag(SDs) %*% R %*% diag(SDs)
Sigma <- diag(sds) %*% R %*% diag(sds)
Sigma
```

4.  **Generate a Synthetic Dataset Matching Those Stats**

We'll replicate the same **hierarchical shape**: 3 groups, 50 units, 20 time points. But we'll fill in `(X1, X2, X3, X4)` by sampling from **multivariate normal** with `(means, Sigma)`.

> In practice, you might want to add back random intercepts for groups or time trends if your manual stats include that. However, if all you have are overall means, SDs, and a correlation matrix, the simplest approach is to assume a single global distribution for X1--X4.

```{r}
library(MASS)

set.seed(999)  # Synthetic data seed (different from original)

df_synth <- data.frame()

for(g in 1:G_outside) {
  for(u in 1:N_outside) {
    for(t in 1:Tt_outside) {
      # Draw one sample from the 4D normal
      X_vector <- mvrnorm(n = 1, mu = means, Sigma = Sigma)

      df_temp <- data.frame(
        group = g,
        unit  = paste0("G", g, "_U", u),
        time  = t,
        X1    = X_vector[1],
        X2    = X_vector[2],
        X3    = X_vector[3],
        X4    = X_vector[4]
      )
      df_synth <- rbind(df_synth, df_temp)
    }
  }
}

# Check the first rows of the synthetic dataset
head(df_synth)

```

At this point, `df_synth` is a dataset that has the same shape (3 groups × 50 units × 20 time points = 3000 rows) and is drawn from the same approximate distribution (matching the partner's means, SDs, correlation matrix).

Alternatively, if the goal is to capture even skewness and kurtosis, it's a bit more complex.

```{r, eval = FALSE}
# Load required libraries
library(MASS)   # For multivariate normal correlation structure
library(sn)     # For skewed normal distribution

# Step 1: Define Structure
num_groups <- 10   # Number of groups
num_timepoints <- 50  # Time series length per group
total_samples <- num_groups * num_timepoints  # Total data points

# Define statistical properties for each group
set.seed(123)  # For reproducibility

group_means <- rnorm(num_groups, mean=50, sd=10)  # Each group has a different mean
group_variances <- runif(num_groups, 50, 150)  # Random variance per group
group_skewness <- runif(num_groups, -1, 2)  # Skewness for each group
# group_kurtosis <- runif(num_groups, 3, 6)  # Excess kurtosis for each group

# Define AR(3) autocorrelation coefficients
phi <- c(0.5, 0.3, 0.2)  # AR(3) coefficients (must sum to < 1 for stationarity)
p <- length(phi)  # Order of the AR process

# Define correlation matrix for groups (cross-sectional correlation)
group_corr_matrix <- matrix(0.5, nrow=num_groups, ncol=num_groups)  # Moderate correlation
diag(group_corr_matrix) <- 1  # Set diagonal to 1 for perfect self-correlation

# Cholesky decomposition for group-level correlation
chol_decomp <- chol(group_corr_matrix)

# Step 2: Generate Hierarchical Time Series Data
data_list <- list()

for (g in 1:num_groups) {
    # Generate base time series (AR(p) process)
    ts_data <- numeric(num_timepoints)

    # Initialize first 'p' values randomly
    ts_data[1:p] <- rnorm(p, mean=group_means[g], sd=sqrt(group_variances[g]))

    for (t in (p+1):num_timepoints) {
        # AR(p) process with multiple past values
        ts_data[t] <- sum(phi * ts_data[(t-p):(t-1)]) +
            rnorm(1, mean=0, sd=sqrt(group_variances[g] * (1 - sum(phi^2))))
    }

    # Add skewness using skewed normal distribution
    ts_data <- rsn(num_timepoints, xi=mean(ts_data), omega=sd(ts_data), alpha=group_skewness[g])

    # Store data in list
    data_list[[g]] <- data.frame(
        Group = g,
        Time = 1:num_timepoints,
        Value = ts_data
    )
}

# Combine all group data into a single DataFrame
df <- do.call(rbind, data_list)

# Step 3: Apply Cross-Group Correlation
# Reshape the dataset for correlation application
wide_df <- reshape(df, idvar="Time", timevar="Group", direction="wide")

# Apply correlation across groups at each time step
for (t in 1:num_timepoints) {
    wide_df[t, -1] <- as.numeric(as.matrix(wide_df[t, -1]) %*% chol_decomp)
}

# Convert back to long format correctly
long_df <- reshape(wide_df,
                   varying=colnames(wide_df)[-1],  # Select all group columns
                   v.names="Value",
                   idvar="Time",
                   timevar="Group",
                   times=1:num_groups,
                   direction="long")

# Ensure no unexpected columns
long_df <- long_df[, c("Time", "Group", "Value")]


# Display first few rows
head(long_df)
```

5.  **Evaluate & Compare**

In reality, you might do this comparison inside the partner's environment to confirm your synthetic data is a close match. For demonstration, we'll just compare directly here.

```{r}
# Step 5: Evaluate

# A) Check Means & SDs
synth_means <- colMeans(df_synth[, c("X1","X2","X3","X4")])
synth_sds   <- apply(df_synth[, c("X1","X2","X3","X4")], 2, sd)

cat("Original (Collected) Means:\n", round(means, 3), "\n")
cat("Synthetic Means:\n", round(synth_means, 3), "\n\n")

cat("Original (Collected) SDs:\n", round(sds, 3), "\n")
cat("Synthetic SDs:\n", round(synth_sds, 3), "\n\n")

# B) Check Correlation
synth_cor <- cor(df_synth[, c("X1","X2","X3","X4")])
cat("Original (Collected) Correlation Matrix:\n")
print(round(R, 3))
cat("\nSynthetic Correlation Matrix:\n")
print(round(synth_cor, 3))

```

You should see that the synthetic dataset's means, SDs, and correlation matrix are **very close** to the manually collected values from `df_original`.

```{r}
# Histograms or density plots
par(mfrow = c(2,2))
hist(df_synth$X1, main="X1 Synthetic", col="lightblue", breaks=30)
hist(df_synth$X2, main="X2 Synthetic", col="lightblue", breaks=30)
hist(df_synth$X3, main="X3 Synthetic", col="lightblue", breaks=30)
hist(df_synth$X4, main="X4 Synthetic", col="lightblue", breaks=30)

# Pairwise correlation scatterplots
library(PerformanceAnalytics)
chart.Correlation(df_synth[, c("X1","X2","X3","X4")],
                  histogram=TRUE, pch=19)

```

### Synthpop

The easiest way to create synthetic data is to use the `synthpop` package.

```{r}
library(synthpop)
library(tidyverse)
library(performance)

# library(effectsize)
# library(see)
# library(patchwork)
# library(knitr)
# library(kableExtra)

head(iris)

synthpop::codebook.syn(iris)

syn_df <- syn(iris, seed = 3)

# check for replciated uniques
replicated.uniques(syn_df, iris)


# remove replicated uniques and adds a FAKE_DATA label
# (in case a person can see his or own data in
# the replicated data by chance)

syn_df_sdc <- sdc(syn_df, iris,
                  label = "FAKE_DATA",
                  rm.replicated.uniques = T)
```

```{r, message=FALSE, warning=FALSE}
iris |>
    GGally::ggpairs()

syn_df$syn |>
    GGally::ggpairs()
```

```{r}
lm_ori <- lm(Sepal.Length ~ Sepal.Width + Petal.Length , data = iris)
# performance::check_model(lm_ori)
summary(lm_ori)

lm_syn <- lm(Sepal.Length ~ Sepal.Width + Petal.Length , data = syn_df$syn)
# performance::check_model(lm_syn)
summary(lm_syn)
```

Open data can be assessed for its utility in two distinct ways:

1.  **General Utility**: This refers to the broad resemblances within the dataset, allowing for preliminary data exploration.

2.  **Specific Utility**: This focuses on the comparability of models derived from synthetic and original datasets, emphasizing analytical reproducibility.

For General utility

```{r, eval = FALSE}
compare(syn_df, iris)
```

Specific utility

```{r}
# just like regular lm, but for synthetic data
lm_syn <- lm.synds(Sepal.Length ~ Sepal.Width + Petal.Length , data = syn_df)
compare(lm_syn, iris)

# summary(lm_syn)
```

You basically want your lack-of-fit test to be non-significant.