A step-by-step guideline for data simulation is as follows:
Obtain the distribution parameters for the desired variables.
calc_theory
. If the goal is to mimic an empirical data set,
these values can be found using calc_moments
(using the
method of moments) or calc_fisherk
(using Fisher’s
k-statistics). If the standardized cumulants are obtained from
calc_theory
, the user may need to use rounded values as
inputs (i.e. skews = round(skews, 8)
). Due to the nature of
the integration involved in calc_theory
, the results are
approximations. Greater accuracy can be achieved by increasing the
number of subdivisions (sub
) used in the integration
process. For example, in order to ensure that skew is exactly 0 for
symmetric distributions. For some sets of cumulants, it is either not
possible to find power method constants or the calculated constants do
not generate valid power method pdfs. In these situations, adding a
value to the sixth cumulant may provide solutions (see
find_constants
). If simulation results indicate that a
continuous variable does not generate a valid pdf, the user can try
find_constants
with various sixth cumulant correction
vectors to determine if a valid pdf can be found.stats::dpois
).stats::dnbinom
). The variable
represents the number of failures which occur in a sequence of Bernoulli
trials before the target number of successes is achieved.If continuous variables are desired, verify that the standardized
kurtoses are greater than the lower kurtosis bounds.
These bounds can be calculated using calc_lower_skurt
,
given the skewness (for method
= “Fleishman”) and
standardized fifth and sixth cumulants (for method
=
“Polynomial”, referring to Headrick’s method) for each variable.
Different seeds should be examined to see if a lower boundary can be
found. If a lower bound produces power method constants that yield an
invalid pdf, the user may wish to provide a Skurt
vector of
kurtosis corrections. In this case, calc_lower_skurt
will
attempt to find the smallest value that produces a kurtosis which yields
a valid power method pdf. In addition, if method
=
“Polynomial”, a sixth cumulant correction vector (Six
) may
be used to facilitate convergence of the root-solving algorithm. Since
this step can take considerable computation time, the user may instead
wish to perform this check after simulation if any of the variables have
invalid power method pdfs.
Check if the target correlation matrix falls within the
feasible correlation bounds, given the parameters for
the desired distributions. The ordering of the variables in the
correlation matrix MUST be 1st ordinal, 2nd continuous, 3rd
Poisson, and 4th Negative Binomial. These bounds can be calculated using
either valid_corr
(correlation method 1) or
valid_corr2
(correlation method 2). Note that falling
within these bounds does not guarantee that the target correlation can
be achieved. However, the check can alert the user to pairwise
correlations that obviously fall outside the bounds.
Generate the variables using either correlation
method 1 and rcorrvar
or correlation method 2 and
rcorrvar2
. The user may want to try both to see which gives
a better approximation to the variables and correlation matrix. The
accuracy and simulation time will vary by situation. Again, the
ordering of the variables in the correlation matrix
MUST be 1st ordinal, 2nd continuous, 3rd Poisson, and 4th
Negative Binomial. In addition, the error loop can minimize the
correlation errors in most situations.
Summarize the results numerically. The functions
rcorrvar
and rcorrvar2
provide data.frames
containing summaries by variable type and the maximum error between the
final and target correlation matrices. Additional summary functions
include: sim_cdf_prob
(to calculate a cumulative
probability up to a given continuous y value),
power_norm_corr
(to calculate the correlation between a
continuous variable and the generating standard normal variable),
stats_pdf
(to calculate the 100 * alpha percent symmetric
trimmed mean, median, mode, and maximum height of a valid power method
pdf).
Summarize the results graphically. Comparing the simulated data to the target distribution demonstrates simulation accuracy. The graphing functions provided in this package can be used to display simulated data values, pdfs, or cdfs. The target distributions (either by theoretical distribution name or given an empirical data set) can be added to the data value or pdf plots. Cumulative probabilities can be added to the cdf plots (for continuous variables).
The following example generates 3 continuous, 1 binary, 1 ordinal, 3
Poisson, and 2 Negative Binomial variables. The standardized cumulants
produce power method constants that yield valid pdfs, so no sixth
cumulant corrections are needed. See the Using the Sixth
Cumulant Correction to Find Valid Power Method Pdfs vignette
for examples of using the sixth cumulant correction. (Note that the
printr
Xie (2017) package is
invoked to display the tables.)
The continuous variables come from the following distributions:
The ordinal variables have the following cumulative probabilities:
The last probability in each case is assumed to be 1, and should not be included.
The Poisson variables have the following lambda
(mean, lam
) values:
The Negative Binomial variables have the following sizes and success probabilities:
Either success probabilities (prob
) or means
(mu
) should be given for all variables.
library("SimMultiCorrData")
library("printr")
# Turn off scientific notation
options(scipen = 999)
# Set seed and sample size
seed <- 11
n <- 10000
# Continuous Distributions
Dist <- c("Gaussian", "Chisq", "Beta")
# Calculate standardized cumulants
# Those for the normal distribution are rounded to ensure the correct values
# are obtained.
M1 <- round(calc_theory(Dist = "Gaussian", params = c(0, 1)), 8)
M2 <- calc_theory(Dist = "Chisq", params = 4)
M3 <- calc_theory(Dist = "Beta", params = c(4, 2))
M <- cbind(M1, M2, M3)
# Binary and Ordinal Distributions
marginal <- list(c(0.3, 0.75), c(0.2, 0.5, 0.9))
support <- list() # default support will be generated inside simulation
# Poisson Distributions
lam <- c(1, 5, 10)
# Negative Binomial Distributions
size <- c(3, 6)
prob <- c(0.2, 0.8)
ncat <- length(marginal)
ncont <- ncol(M)
npois <- length(lam)
nnb <- length(size)
# Create correlation matrix from a uniform distribution (0.2, 0.7)
set.seed(seed)
Rey <- diag(1, nrow = (ncat + ncont + npois + nnb))
for (i in 1:nrow(Rey)) {
for (j in 1:ncol(Rey)) {
if (i > j) Rey[i, j] <- runif(1, 0.2, 0.7)
Rey[j, i] <- Rey[i, j]
}
}
# Check to see if Rey is positive-definite
min(eigen(Rey, symmetric = TRUE)$values) < 0
## [1] FALSE
Since this step takes considerable computation time, the user may wish to calculate these after simulation.
Lower <- list()
# list of standardized kurtosis values to add in case only invalid power
# method pdfs are produced
Skurt <- list(seq(0.5, 2, 0.5), seq(0.02, 0.05, 0.01), seq(0.02, 0.05, 0.01))
start.time <- Sys.time()
for (i in 1:ncol(M)) {
Lower[[i]] <- calc_lower_skurt(method = "Polynomial", skews = M[3, i],
fifths = M[5, i], sixths = M[6, i],
Skurt = Skurt[[i]], seed = 104)
}
## Only invalid pdf constants could be found.
## Try more starting values (increase n)
## or a different seed or Skurt vector.
## Also verify standardized cumulant values.
stop.time <- Sys.time()
Time <- round(difftime(stop.time, start.time, units = "min"), 3)
cat("Total computation time:", Time, "minutes \n")
## Total computation time: 0.492 minutes
In each case, the lower kurtosis boundary calculated from the
original Lagrangean constraint equations (see
poly_skurt_check
) generates constants that yield an invalid
power method pdf. This is indicated by the fact that each
Invalid.C
data.frame contains solutions (i.e. see
Lower[[2]]$Invalid.C
).
For Distributions 2 and 3, lower kurtosis values that generate
constants that yield valid power method pdfs could be found by adding
the values displayed in SkurtCorr1
to the original lower
kurtosis boundary. For Distribution 1 (Normal), no kurtosis addition (of
those specified in Skurt
) generated constants that yield a
valid pdf. This does not cause a problem since the simulated variable
has a valid power method pdf.
Look at lower kurtosis boundaries and sixth cumulant corrections:
as.matrix(Lower[[1]]$Min[1, c("skew", "fifth", "sixth", "valid.pdf",
"skurtosis")],
nrow = 1, ncol = 5, byrow = TRUE)
skew | fifth | sixth | valid.pdf | skurtosis | |
---|---|---|---|---|---|
5 | 0 | 0 | 0 | FALSE | -0.2829743 |
Note that valid.pdf = FALSE
, which means that a kurtosis
correction could not be found that yielded constants that produce a
valid power method pdf. The original lower kurtosis boundary (see
Lower[[1]]$Min
) is -0.282974. The standardized kurtosis for
the distribution (0) falls above this boundary.
as.matrix(Lower[[2]]$Min[1, c("skew", "fifth", "sixth", "valid.pdf",
"skurtosis")],
nrow = 1, ncol = 5, byrow = TRUE)
skew | fifth | sixth | valid.pdf | skurtosis |
---|---|---|---|---|
1.414214 | 8.485281 | 30 | TRUE | 3.005959 |
## [1] 0.03
The original lower kurtosis boundary (see
Lower[[2]]$Invalid.C
) of 2.975959 has been increased to
3.005959, so that the kurtosis correction is 0.03. The standardized
kurtosis for the distribution (3) is approximately equal to this
boundary. This does not cause a problem since the simulated variable has
a valid power method pdf.
as.matrix(Lower[[3]]$Min[1, c("skew", "fifth", "sixth", "valid.pdf",
"skurtosis")],
nrow = 1, ncol = 5, byrow = TRUE)
skew | fifth | sixth | valid.pdf | skurtosis |
---|---|---|---|---|
-0.4677072 | 1.403122 | -0.4261364 | TRUE | -0.3722661 |
## [1] 0.02
The original lower kurtosis boundary (see
Lower[[3]]$Invalid.C
) of -0.392266 has been increased to
-0.372266, so that the kurtosis correction is 0.02. The standardized
kurtosis for the distribution (-0.2727) falls above this boundary.
The remaining steps vary by simulation method:
# Make sure Rey is within upper and lower correlation limits
valid <- valid_corr(k_cat = ncat, k_cont = ncont, k_pois = npois,
k_nb = nnb, method = "Polynomial", means = M[1, ],
vars = (M[2, ])^2, skews = M[3, ], skurts = M[4, ],
fifths = M[5, ], sixths = M[6, ], marginal = marginal,
lam = lam, size = size, prob = prob, rho = Rey,
seed = seed)
##
## Constants: Distribution 1
##
## Constants: Distribution 2
##
## Constants: Distribution 3
## All correlations are in feasible range!
Simulate variables without the error loop.
A <- rcorrvar(n = 10000, k_cont = ncont, k_cat = ncat, k_pois = npois,
k_nb = nnb, method = "Polynomial", means = M[1, ],
vars = (M[2, ])^2, skews = M[3, ], skurts = M[4, ],
fifths = M[5, ], sixths = M[6, ], marginal = marginal,
lam = lam, size = size, prob = prob, rho = Rey, seed = seed)
##
## Constants: Distribution 1
##
## Constants: Distribution 2
##
## Constants: Distribution 3
##
## Constants calculation time: 0 minutes
## Intercorrelation calculation time: 0.045 minutes
## Error loop calculation time: 0 minutes
## Total Simulation time: 0.046 minutes
Summarize the correlation errors:
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
-0.01898 | -0.000629 | 0.0004595 | 0.0024193 | 0.003966 | 0.057726 |
Simulate variables with the error loop (using default settings of
epsilon
= 0.001 and maxit
= 1000).
B <- rcorrvar(n = 10000, k_cont = ncont, k_cat = ncat, k_pois = npois,
k_nb = nnb, method = "Polynomial", means = M[1, ],
vars = (M[2, ])^2, skews = M[3, ], skurts = M[4, ],
fifths = M[5, ], sixths = M[6, ], marginal = marginal,
lam = lam, size = size, prob = prob, rho = Rey, seed = seed,
errorloop = TRUE)
##
## Constants: Distribution 1
##
## Constants: Distribution 2
##
## Constants: Distribution 3
##
## Constants calculation time: 0 minutes
## Intercorrelation calculation time: 0.045 minutes
## Error loop calculation time: 0.15 minutes
## Total Simulation time: 0.196 minutes
Summarize the correlation errors:
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
-0.009503 | -0.000196 | 0.0003535 | 0.0009695 | 0.002282 | 0.011725 |
Based on the interquartile range, the simulation utilizing the error loop will be chosen for subsequent analysis.
1) Ordinal variables
Target | Simulated |
---|---|
0.30 | 0.3016 |
0.75 | 0.7500 |
1.00 | 1.0000 |
Target | Simulated |
---|---|
0.2 | 0.2028 |
0.5 | 0.4970 |
0.9 | 0.9002 |
1.0 | 1.0000 |
2) Count variables
Poisson variables: Note the expected means and variances are also given.
Distribution | mean | Exp_mean | var | Exp_var | min | max |
---|---|---|---|---|---|---|
1 | 0.9998 | 1 | 1.019102 | 1 | 0 | 10 |
2 | 5.0002 | 5 | 4.982898 | 5 | 0 | 15 |
3 | 9.9982 | 10 | 9.991796 | 10 | 0 | 25 |
Negative Binomial variables:
Distribution | success_prob | mean | Exp_mean | var | Exp_var | min | max |
---|---|---|---|---|---|---|---|
1 | 0.2 | 12.0050 | 12.0 | 60.187994 | 60.000 | 0 | 58 |
2 | 0.8 | 1.4988 | 1.5 | 1.879587 | 1.875 | 0 | 12 |
3) Continuous variables
Constants:
c0 | c1 | c2 | c3 | c4 | c5 |
---|---|---|---|---|---|
0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
-0.227508 | 0.900716 | 0.231610 | 0.015466 | -0.001367 | 0.000055 |
0.108304 | 1.104253 | -0.123347 | -0.045284 | 0.005014 | 0.001285 |
Target distributions:
Distribution | mean | sd | skew | skurtosis | fifth | sixth | |
---|---|---|---|---|---|---|---|
M1 | 1 | 0.00000 | 1.00000 | 0.00000 | 0.000 | 0.00000 | 0.00000 |
M2 | 2 | 4.00000 | 2.82843 | 1.41421 | 3.000 | 8.48528 | 30.00000 |
M3 | 3 | 0.66667 | 0.17817 | -0.46771 | -0.375 | 1.40312 | -0.42614 |
Simulated distributions:
as.matrix(round(B$summary_continuous[, c("Distribution", "mean", "sd",
"skew", "skurtosis", "fifth",
"sixth")], 5), nrow = 3, ncol = 7,
byrow = TRUE)
Distribution | mean | sd | skew | skurtosis | fifth | sixth | |
---|---|---|---|---|---|---|---|
X1 | 1 | 0.00000 | 1.00000 | 0.01164 | 0.00722 | 0.07345 | 0.47773 |
X2 | 2 | 4.00174 | 2.83894 | 1.40359 | 2.88806 | 7.46541 | 20.04478 |
X3 | 3 | 0.66676 | 0.17804 | -0.48268 | -0.32270 | 1.37467 | -0.82989 |
Valid power method pdf check:
## [1] "TRUE" "TRUE" "TRUE"
All continuous variables have valid power method pdfs. We can compute
additional summary statistics:
1) Normal($\Large 0,1$)
Distribution
trimmed_mean | median | mode | max_height |
---|---|---|---|
0 | 0 | 0 | 0.3989 |
trimmed_mean | median | mode | max_height |
---|---|---|---|
-0.0536 | -0.2275 | -0.7095 | 0.5203 |
trimmed_mean | median | mode | max_height |
---|---|---|---|
0.0215 | 0.1083 | 0.4517 | 0.3745 |
Look at the Chisq ($\Large df = 4$) distribution ($\Large 2^{nd}$ continuous variable):
Look at the empirical cdf of the $\Large 2^{nd}$ ordinal distribution:
Look at the Poisson ($\Large \lambda = 5$) distribution ($\Large 2^{nd}$ Poisson variable):
Look at the Negative Binomial ($\Large size = 3,\ prob = 0.2$) distribution ($\Large 1^{st}$ Negative Binomial variable):
plot_sim_theory(B$Neg_Bin_variables[, 1], cont_var = FALSE,
Dist = "Negative_Binomial", params = c(3, 0.2))
Method 2 requires cumulative probability truncation vectors for the
count variables (pois_eps
for Poisson and
nb_eps
for Negative Binomial). Each entry is the amount
removed from the total cumulative probability when creating a finite
support for that variable (see max_count_support
). The
entries may vary by variable.
pois_eps <- rep(0.0001, npois)
nb_eps <- rep(0.0001, nnb)
# Make sure Rey is within upper and lower correlation limits
valid2 <- valid_corr2(k_cat = ncat, k_cont = ncont, k_pois = npois,
k_nb = nnb, method = "Polynomial", means = M[1, ],
vars = (M[2, ])^2, skews = M[3, ], skurts = M[4, ],
fifths = M[5, ], sixths = M[6, ], marginal = marginal,
lam = lam, pois_eps = pois_eps, size = size,
prob = prob, nb_eps = nb_eps, rho = Rey, seed = seed)
##
## Constants: Distribution 1
##
## Constants: Distribution 2
##
## Constants: Distribution 3
## All correlations are in feasible range!
Simulate variables without the error loop.
C <- rcorrvar2(n = 10000, k_cont = ncont, k_cat = ncat, k_pois = npois,
k_nb = nnb, method = "Polynomial", means = M[1, ],
vars = (M[2, ])^2, skews = M[3, ], skurts = M[4, ],
fifths = M[5, ], sixths = M[6, ], marginal = marginal,
lam = lam, pois_eps = pois_eps, size = size, prob = prob,
nb_eps = nb_eps, rho = Rey, seed = seed)
##
## Constants: Distribution 1
##
## Constants: Distribution 2
##
## Constants: Distribution 3
##
## Constants calculation time: 0 minutes
## Intercorrelation calculation time: 0.03 minutes
## Error loop calculation time: 0 minutes
## Total Simulation time: 0.03 minutes
Summarize the correlation errors:
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
-0.018342 | -0.002686 | 0.000006 | 0.0017004 | 0.003214 | 0.055076 |
These results indicate that for these distributions, Correlation Method 1 and Correlation Method 2 have similar correlation errors.
Simulate variables with the error loop (using default settings of
epsilon
= 0.001 and maxit
= 1000).
D <- rcorrvar2(n = 10000, k_cont = ncont, k_cat = ncat, k_pois = npois,
k_nb = nnb, method = "Polynomial", means = M[1, ],
vars = (M[2, ])^2, skews = M[3, ], skurts = M[4, ],
fifths = M[5, ], sixths = M[6, ], marginal = marginal,
lam = lam, pois_eps = pois_eps, size = size, prob = prob,
nb_eps = nb_eps, rho = Rey, seed = seed, errorloop = TRUE)
##
## Constants: Distribution 1
##
## Constants: Distribution 2
##
## Constants: Distribution 3
##
## Constants calculation time: 0 minutes
## Intercorrelation calculation time: 0.029 minutes
## Error loop calculation time: 0.167 minutes
## Total Simulation time: 0.198 minutes
Summarize the correlation errors:
Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. |
---|---|---|---|---|---|
-0.005901 | -0.001527 | 0 | -0.0001389 | 0.001189 | 0.007552 |
Based on the interquartile range, the simulation utilizing the error loop will be chosen for subsequent analysis.
1) Ordinal variables
Target | Simulated |
---|---|
0.30 | 0.3051 |
0.75 | 0.7504 |
1.00 | 1.0000 |
Target | Simulated |
---|---|
0.2 | 0.2029 |
0.5 | 0.4973 |
0.9 | 0.9023 |
1.0 | 1.0000 |
2) Count variables
Poisson variables: Note the expected means and variances are also given.
Distribution | mean | Exp_mean | var | Exp_var | min | max |
---|---|---|---|---|---|---|
1 | 0.9992 | 1 | 1.018301 | 1 | 0 | 10 |
2 | 5.0027 | 5 | 4.981391 | 5 | 0 | 15 |
3 | 9.9982 | 10 | 10.009198 | 10 | 0 | 25 |
Negative Binomial variables:
Distribution | success_prob | mean | Exp_mean | var | Exp_var | min | max |
---|---|---|---|---|---|---|---|
1 | 0.2 | 11.9998 | 12.0 | 60.223222 | 60.000 | 0 | 61 |
2 | 0.8 | 1.5003 | 1.5 | 1.865787 | 1.875 | 0 | 12 |
3) Continuous variables
The constants are the same for both methods.
Target distributions:
Distribution | mean | sd | skew | skurtosis | fifth | sixth | |
---|---|---|---|---|---|---|---|
M1 | 1 | 0.00000 | 1.00000 | 0.00000 | 0.000 | 0.00000 | 0.00000 |
M2 | 2 | 4.00000 | 2.82843 | 1.41421 | 3.000 | 8.48528 | 30.00000 |
M3 | 3 | 0.66667 | 0.17817 | -0.46771 | -0.375 | 1.40312 | -0.42614 |
Simulated distributions:
as.matrix(round(D$summary_continuous[, c("Distribution", "mean", "sd",
"skew", "skurtosis", "fifth",
"sixth")], 5), nrow = 3, ncol = 7,
byrow = TRUE)
Distribution | mean | sd | skew | skurtosis | fifth | sixth | |
---|---|---|---|---|---|---|---|
X1 | 1 | 0.00000 | 1.00000 | 0.01169 | 0.00718 | 0.07684 | 0.48806 |
X2 | 2 | 4.00186 | 2.83899 | 1.39556 | 2.81246 | 6.97246 | 17.38131 |
X3 | 3 | 0.66677 | 0.17806 | -0.48314 | -0.32449 | 1.37948 | -0.82019 |
Valid power method pdf check:
## [1] "TRUE" "TRUE" "TRUE"
All continuous variables have valid power method pdfs. We can compute
additional summary statistics:
1) Normal($\Large 0,1$)
Distribution
trimmed_mean | median | mode | max_height |
---|---|---|---|
0 | 0 | 0 | 0.3989 |
trimmed_mean | median | mode | max_height |
---|---|---|---|
-0.0536 | -0.2275 | -0.7095 | 0.5203 |
trimmed_mean | median | mode | max_height |
---|---|---|---|
0.0215 | 0.1083 | 0.4517 | 0.3745 |
Since the methods vary primarily according to the calculation of the intermediate correlation for count variables, we will only look at the distributions of two count variables.
Look at the Poisson ($\Large \lambda = 5$) distribution ($\Large 2^{nd}$ Poisson variable):
Look at the Negative Binomial ($\Large size = 3,\ prob = 0.2$) distribution ($\Large 1^{st}$ Negative Binomial variable):
plot_sim_theory(D$Neg_Bin_variables[, 1], cont_var = FALSE,
Dist = "Negative_Binomial", params = c(3, 0.2))