SimMultiCorrData
generates continuous (normal or non-normal), binary, ordinal, and count
(Poisson or Negative Binomial) variables with a specified correlation
matrix. It can also produce a single continuous variable. This package
can be used to simulate data sets that mimic real-world situations
(i.e. clinical data sets, plasmodes, as in Vaughan et al. (2009)). All variables are generated from
standard normal variables with an imposed intermediate correlation
matrix. Continuous variables are simulated by specifying mean, variance,
skewness, standardized kurtosis, and fifth and sixth standardized
cumulants using either Fleishman’s Third-Order (1978) or Headrick’s Fifth-Order (2002) Polynomial Transformation. Binary and
ordinal variables are simulated using a modification of
GenOrd::ordsample
Barbiero and
Ferrari (2015a). Count variables are simulated using the inverse
cdf method. There are two simulation pathways which differ primarily
according to the calculation of the intermediate correlation matrix
Sigma
. In Correlation Method 1, the
intercorrelations involving count variables are determined using a
simulation based, logarithmic correlation correction (adapting Yahav and Shmueli (2012)’s method). In
Correlation Method 2, the count variables are treated
as ordinal (adapting Barbiero and Ferrari
(2015b)’s modification of GenOrd
). There is an
optional error loop that corrects the final correlation matrix to be
within a user-specified precision value. The package also includes
functions to calculate standardized cumulants for theoretical
distributions or from real data sets, check if a target correlation
matrix is within the possible correlation bounds (given the
distributions of the simulated variables), summarize results
(numerically or graphically), verify valid power method pdfs, and
calculate lower standardized kurtosis bounds.
The main strengths of SimMultiCorrData
are:
The user may generate correlated continuous (normal or non-normal), ordinal (r >= 2 categories), Poisson and/or Negative Binomial variables simultaneously, based on either theoretical distributions or empirical data.
Two distinct methods for generating non-normal continuous variables: Fleishman’s third-order or Headrick’s fifth-order polynomial transformation.
Two distinct methods for generating count variables (see Comparison of Correlation Method 1 and Correlation Method 2 vignette). The user may test each to see which yields greater simulation accuracy.
Calculation of the precise lower kurtosis
boundary using the Lagrangean constraint equations, instead of
an approximation (see calc_lower_skurt
).
Valid power method pdf checks during the calculation of the constants for continuous variables, and optional use of a sixth cumulant correction value to enable the discovery of valid pdf constants.
Computation of feasible correlation bounds based
on data simulation method (see valid_corr
for correlation
method 1 or valid_corr2
for correlation method 2).
Numerous attempts to reproduce the desired correlation matrix, including correcting for non-positive-definite intermediate correlation matrices and an optional final error loop (see Overview of Error Loop vignette). This error loop enables reproduction of many correlation structures that can not be achieved through other methods.
Function arguments (i.e. seed
, n
,
maxit
, epsilon
) that allow the user to have
greater control over simulation accuracy, speed, and
reproducibility.
Detailed simulation results, including the simulation time (in minutes) and descriptions of the generated variables and the correlation structure.
Additional functions to supplement the simulation process:
calc_theory
) or a vector of data by the method of moments
(calc_moments
) or based on Fisher’s k-statistics
(calc_fisherk
). Additional summary functions compute
important statistics for the generated continuous variables.ggplot2
objects so
that the user may save them or further adapt the graphs as
necessary.There are several other simulation packages. For example, Barbiero
& Ferrari’s (2015a)
GenOrd
, Amatya & Demirtas’ (2016a) MultiOrd
, Leisch, Kaiser,
& Hornik’s (2010)
orddata
, and Demirtas, Nordgren, & Allozi’s (2017) PoisBinOrdNonNor
. The first
three generate only binary and ordinal data, while the last generates
Poisson, binary, ordinal, and non-normal variables.
GenOrd
GenOrd
generates discrete random variables (i.e. binary
or ordinal) with given correlation matrix and marginal distributions.
The method used to determine the intermediate MVN correlation matrix in
GenOrd::ordcont
has been modified in
SimMultiCorrData
’s ordnorm
function. It works
by setting the intermediate correlation equal to the target correlation
of the discrete variables. Each intermediate pairwise correlation is
updated until the final pairwise correlation is within a user-specified
precision value (epsilon
) of the target correlation or the
maximum number of iterations (maxit
) has been reached.
GenOrd::ordcont
has been modified in the following
ways:
SimMultiCorrData::valid_corr
or
valid_corr2
.Sigma
for all
variable types, and if necessary, Sigma
is converted to the
nearest positive-definite matrix using Higham’s (2002) algorithm in
Matrix::nearPD
.SimMultiCorrData::ordnorm
uses
GenOrd::contord
to calculate the ordinal correlation
obtained from discretizing the normal variables generated from the
intermediate correlation matrix Sigma
. The reason is
because the function does not require random generation of the normal
variables, which ensures greater reproducibility.
SimMultiCorrData
also improves the way ordinal variables
are generated, as compared to GenOrd::ordsample
:
SimMultiCorrData::rcorrvar
and
rcorrvar2
allow a user-specified seed, maximum number of
iterations, and epsilon value.GenOrd::ordsample
stops if the intermediate correlation
matrix Sigma
is not positive-definite. As described above,
SimMultiCorrData
attempts to correct the problem and a
warning is given that it may not be possible to produce the desired
correlation matrix.MultiOrd
MultiOrd
generates multivariate ordinal data with given
correlation matrix and marginal distributions via the binary
conversion method of Demirtas (2006).
This method computes the binary marginals by collapsing the marginal
distributions of the ordinal variables. The intermediate correlation
matrix is also computed through an iterative process based on matching
the target matrix. Binary data are then converted to ordinal data
through a randomization step. This procedure requires the simulation of
large samples of binary data in order to maximize accuracy, which
requires greater computational time and resources than the methods used
in SimMultiCorrData
.
orddata
orddata
generates binary and ordinal data through 4
available methods:
PoisBinOrdNonNor
PoisBinOrdNonNor
is one in an extensive series of
simulation packages created by Demirtas with additional authors. Other
packages include OrdNor
(Amatya and
Demirtas 2015), BinNonNor
(Inan and Demirtas 2016a),
BinOrdNonNor
(Demirtas, Wang, and
Allozi 2017), PoisBinOrd
(Inan and Demirtas 2016b), PoisNor
(Amatya and Demirtas 2016b), and
PoisBinOrdNor
(Demirtas, Hu, and
Allozi 2017). PoisBinOrdNonNor
generates Poisson,
binary, ordinal, and non-normal variables. It differs from
SimMultiCorrData
in the following ways:
SimMultiCorrData
’s simulation functions
rcorrvar
and rcorrvar2
allow the user to
either provide an intermediate matrix or the matrix is calculated during
the simulation.SimMultiCorrData
. However, PoisBinOrdNonNor
does not produce Negative Binomial variables.SimMultiCorrData
. However, those for ordinal variables are
found using ordcont
, which, as previously mentioned, will
stop if the intermediate matrix is not positive-definite.SimMultiCorrData
contains the functions power_norm_corr
and
pdf_check
. The function that solves for the constants
(SimMultiCorrData::find_constants
) executes these checks
when finding the constants and attempts to produce valid pdf constants.
In the case of Headrick’s fifth-order method, the user may specify a
sixth cumulant correction value to help find these constants.PoisBinOrdNonNor
is a simple approximation: $\Large
standardized\ kurtosis \ge skew^2 - 2$.
SimMultiCorrData::calc_lower_skurt
solves the Lagrangean
expressions (as described in Headrick
(2002) and Headrick and Sawilowsky
(2002)) that determine the precise lower kurtosis boundary.
Examination of the boundaries computed in PoisBinOrdNonNor
demonstrates that the approximate boundaries are much lower than the
actual Fleishman boundaries, indicating that the guideline is not
accurate (see calc_lower_skurt
for examples).PoisBinOrdNonNor
does not allow the user to specify a
seed for random number generation, or an epsilon value and maximum
number of iterations to use when determining the intermediate ordinal
correlations. These specifications, as found in
SimMultiCorr
’s simulation functions rcorrvar
and rcorrvar2
, are essential for reproducibility and
controlling accuracy.SimMultiCorr
’s simulation functions produce detailed
summaries of the variables, the final correlation matrix, the maximum
error between the final and target correlation matrices, and the
simulation time.