All variables are generated via the appropriate transformation of
standard normal variables, as described below. See the documentation for
corrvar
and corrvar2
for more details about
all inputs. The parameter inputs should be checked first using
validpar
. Summaries of simulated variables can be obtained
using summary_var
. Some code has been modified from the
SimMultiCorrData package (Fialkowski 2018).
1) Continuous Variables: Continuous variables are
simulated using either Fleishman (1978)’s
third-order (method
= “Fleishman”) or Headrick (2002)’s fifth-order
(method
= “Polynomial”) power method transformation (PMT).
This is a computationally efficient algorithm that simulates continuous
distributions through the method of moments. It works by matching
standardized cumulants – the first four (mean, variance, skew, and
standardized kurtosis) for Fleishman’s third-order method, or the first
six (mean, variance, skew, standardized kurtosis, and standardized fifth
and sixth cumulants) for Headrick’s fifth-order method. The
transformation is expressed as follows:
where c4 and c5 both equal 0 for Fleishman’s method. The real constants
are calculated by SimMultiCorrData::find_constants
, which
solves the system of equations given in
SimMultiCorrData::fleish
or
SimMultiCorrData::poly
. All variables are simulated with
mean 0 and variance 1, and then transformed to the specified mean
and variance at the end.
Continuous Mixture Variables: Mixture distributions
describe random variables that are drawn from more than one component
distribution. For a random variable Y from a finite continuous mixture
distribution with k
components, the probability density function (PDF) can be described
by:
The πi are
mixing parameters which determine the weight of each component
distribution fYi(y)
in the overall probability distribution. As long as each component has a
valid PDF, the overall distribution hY(y)
has a valid PDF. The main assumption is statistical independence between
the process of randomly selecting the component distribution and the
distributions themselves: Assume there is a random selection process
that first generates the numbers 1, ..., k with probabilities π1, ..., πk.
After selecting number i,
where 1 ≤ i ≤ k, a
random variable y is drawn
from component distribution fYi(y).
Mixture distributions provide a useful way for describing
heterogeneity in a population, especially when an outcome is a composite
response from multiple sources. They may also be used to model
distributions with outliers. For example, the contaminated normal
outlier distribution is used when a proportion of measurement
errors, which are usually modeled as normal variables with zero mean and
common variance σ2,
have larger variability. These distributions can be thought of as
mixture distributions with contamination percentage π ϵ (0, 1). For example,
the PDF of Y may be given
by:
Here, ϕ(μ, σ2)(y)
is the PDF of the normal distribution (Davenport,
Bezder, and Hathaway 1988; Schork, Allison, and Thiel 1996; Everitt
1996; Pearson 2011).
Continuous mixture variables are generated at the component level. Each component variable is created using the PMT, and the mixture variable is generated from these based on a random multinomial variable described by the mixing probabilities. Correlations are also controlled at the component level. Users specify the target pairwise correlations between the components across continuous mixture variables and between components and other types of variables.
The function rho_M1M2
approximates the expected
correlation between two mixture variables M1 and M2 based on the component
distributions and the correlations between components. The function
rho_M1Y
approximates the expected correlation between a
mixture variable M1 and
another random variable Y,
that may have an ordinal, continuous, or count distribution. If the user
desires a specific correlation between a mixture variable M1 and another simulated variable
M2 or Y, various component level
correlations can be tried in these 2 functions to see if the desired
correlation can be achieved. See the Expected Cumulants and Correlations for
Continuous Mixture Variables vignette for more details on continuous
mixture variables.
The required parameters for simulating
continuous variables include: skewness, standardized kurtosis (kurtosis
- 3), and standardized fifth and sixth cumulants (for the fifth-order
method). If the goal is to simulate a theoretical distribution
(i.e. Gamma, Beta, Logistic, etc.), these values can be obtained using
SimMultiCorrData::calc_theory
. If the goal is to mimic an
empirical data set, these values can be found using
SimMultiCorrData::calc_moments
(using the method of
moments) or SimMultiCorrData::calc_fisherk
(using Fisher’s
k-statistics). If the standardized cumulants are obtained from
calc_theory
, the user may need to use rounded values as
inputs (i.e. skews = round(skews, 8)
). For example, in
order to ensure that skew is exactly 0
for symmetric distributions. Due to the nature of the integration
involved in calc_theory
, the results are approximations.
Greater accuracy can be achieved by increasing the number of
subdivisions (sub
) used in the integration
process.
For mixture variables, the parameters are specified at the
component level by the inputs mix_skews
,
mix_skurts
, mix_fifths
,
mix_sixths
, and mix_Six
. The mixing
probabilities, means and standard deviations of the component variables
are given by mix_pis
, mix_mus
and
mix_sigmas
.
The means and variances of all continuous variables are specified
by means
and vars
. These are at the variable
level, i.e., they refer to the continous non-mixture and mixture
variables themselves. The function calc_mixmoments
calculates the expected mean, standard deviation, and standardized
cumulants for mixture variables based on the component
distributions.
For some sets of cumulants, it is either not possible to find
power method constants or the calculated constants do not generate valid
power method PDFs. In these situations, adding a value to the sixth
cumulant may provide solutions (see find_constants
). When
using Headrick’s fifth-order approximation, if simulation results
indicate that a continuous variable does not generate a valid PDF, the
user can try find_constants
with various sixth cumulant
correction vectors to determine if a valid PDF can be found. These sixth
cumulant corrections are specified in the simulation functions using
Six
or mix_Six
.
Choice of Fleishman’s or Headrick’s Method: Using the fifth-order approximation allows additional control over the fifth and sixth moments of the generated distribution, improving accuracy. In addition, the range of feasible standardized kurtosis (γ2) values, given skew (γ1) and standardized fifth (γ3) and sixth (γ4) cumulants, is larger than with the third-order approximation. For example, Fleishman’s method can not be used to generate a non-normal distribution with a ratio of γ12/γ2 > 9/14 (Headrick and Kowalchuk 2007). This eliminates the family of distributions, which has a constant ratio of γ12/γ2 = 2/3. The fifth-order method also generates more distributions with valid PDF’s. However, if the fifth and sixth cumulants do not exist, the Fleishman approximation should be used.
2) Ordinal Variables: Ordinal variables (r ≥ 2 categories) are generated by discretizing the standard normal variables at quantiles. These quantiles are determined by evaluating the inverse standard normal CDF at the cumulative probabilities defined by each variable’s marginal distribution. If the support for variable Yi is not provided, the default is to use {1, 2, ..., ri}.
The required parameters for simulating ordinal variables include: the cumulative marginal probabilities and support values (if desired). The probabilities should be combined into a list of length equal to the number of ordinal variables. The ith element is a vector of the cumulative probabilities defining the marginal distribution of the ith variable. If the variable can take r values, the vector will contain r − 1 probabilities (the rth is assumed to be 1).
For binary variables, the user-supplied probability should be the probability of the 1st (lower) support value. This would ordinarily be considered the probability of failure (q), while the probability of the 2nd (upper) support value would be considered the probability of success (p = 1 − q). The support values should be combined into a separate list. The ith element is a vector containing the r ordered support values. If not provided, the default is for the ith element to be the vector 1, ..., r.
3) Poisson and Negative Binomial Variables: Count variables are generated using the inverse CDF method. The cumulative distribution function of a standard normal variable has a uniform distribution. The appropriate quantile function FY−1 is applied to this uniform variable with the designated parameters to generate the count variable: Y = Fy−1(Φ(Z)).
Zero-inflated distributions: A zero-inflated (ZI)
random variable YZI is
a mixture of a degenerate distribution having the point mass at 0 and another distribution Y that contributes both 0 and non-0
values. If the mixing probability is ϕ, then:
Therefore, ϕ is the
probability of a structural zero, and setting ϕ = 0 reduces YZI to
the variable Y. In
SimCorrMix, Y
can have either a Poisson distribution (YP) or a NB
distribution (YNB)
(Ismail and Zamani 2013; Lambert 1992; Zhang,
Mallick, and Yi 2016).
The mean of YZIP
is (1 − ϕ) * λ, and
the variance is λ + λ2 * ϕ/(1 − ϕ)
(see VGAM::dzipois
).
The zero-deflated Poisson distribution may be obtained by setting ϕ ∈ (−1/(exp(λ) − 1), 0), so that the probability of a zero count is less than the nominal Poisson value. In this case, ϕ no longer represents a probability.
When ϕ = −1/(exp(λ) − 1),
the random variable has a positive-Poisson distribution (see
VGAM::dpospois
). The probability of a zero response is
0, and the other probabilities are
scaled to sum to 1.
The parameters λ and ϕ are specified through the inputs
lam
and p_zip
. Setting p_zip = 0
(default setting) generates a regular Poisson variable. Parameters for
zero-inflated Poisson variables should follow those for regular Poisson
variables in all function inputs. For Correlation Method
2, a vector of total cumulative probability truncation values
should be given in pois_eps
. These values represent the
amount removed from the total cumulative probability when creating
finite supports. The value may differ by variable, but the default value
is 0.0001 (suggested by Barbiero and Ferrari (2015)). For example,
pois_eps = 0.0001
means that the support values removed
have a total probability of 0.0001 of
occurring in the distribution of that variable. The effect is to remove
improbable values, which may be of concern if the user wishes to
replicate a distribution with outliers.
In this formulation, the Negative Binomial component YNB
represents the number of failures which occur in a sequence of
independent Bernoulli trials before a target number of successes (size) is
reached. The probability of success in each trial is p. YNB may
also be parametrized by the mean μ and dispersion parameter size so
that p = size/(size + μ)
or μ = size * (1 − p)/p
(see stats::dnbinom
).
The mean of YZINB
is (1 − ϕ) * μ, and
the variance is (1 − ϕ) * μ * (1 + μ * (ϕ + 1/size))
(see VGAM::dzinegbin
).
The zero-deflated NB distribution may be obtained by setting ϕ ∈ (−psize/(1 − psize), 0),
so that the probability of a zero count is less than the nominal NB
value. Again, in this case, ϕ
no longer represents a probability.
The positive-NB distribution can be obtained with ϕ = −psize/(1 − psize)
(see VGAM::dposnegbin
). The probability of a zero response
is 0, and the other probabilities are
scaled to sum to 1.
The parameters size,
p, μ, and ϕ are specified through the inputs
size
, prob
, mu
, and
p_zinb
. Either prob
or mu
should
be given for all NB and ZINB variables. Setting p_zinb = 0
(default setting) generates a regular NB variable. Parameters for
zero-inflated NB variables should follow those for regular NB variables
in all function inputs. For Correlation Method 2, a
vector of total cumulative probability truncation values should be given
in nb_eps
. The default value in the simulation functions is
0.0001.
The distributions functions are taken from the VGAM package (Yee 2018).
The error loop may be used to correct the final
pairwise correlation of simulated variables to be within a
user-specified precision value (epsilon
) of the target
correlation. It updates the pairwise intermediate MVN correlation
iteratively in a loop until either the maximum error is less than
epsilon
or the number of iterations exceeds the maximum
number set by the user (maxit
). Some code has been modified
from the SimMultiCorrData package (Fialkowski 2018). Below is a description of the
algorithm, which is executed within each variable pair across all
k = k_cat + k_cont + k_comp + k_pois + k_nb
variables
(k_comp
is the number of component distributions for the
continuous mixture variables). rho_calc
is the calculated
final correlation matrix updated in each iteration, rho
is
the target final correlation, Sigmaold
is the intermediate
correlation from the previous iteration, it
is the
iteration number, q
is the row number, r
is
the column number, and maxerr
is the absolute pairwise
correlation error between rho_calc[q, r]
and
rho[q, r]
.
If rho[q, r] = 0
, set
Sigma[q, r] = 0
.
While maxerr
is greater than epsilon
and it
is less than maxit
:
rho[q, r] * (rho[q, r]/rho_calc[q, r]) <= -1
,
then:
Sigma[q, r] = Sigmaold[q, r] * (1 + 0.1 * (1 - Sigmaold[q, r]) * -sign(rho[q, r] - rho_old[q, r]))
rho[q, r] * (rho[q, r]/rho_calc[q, r]) >= 1
,
then:
Sigma[q, r] = Sigmaold[q, r] * (1 + 0.1 * (1 - Sigmaold[q, r]) * sign(rho[q, r] - rho_old[q, r]))
Sigma[q, r] = Sigmaold[q, r] * (rho[q, r]/rho_calc[q, r])
.kxk
Sigma
correlation matrix. If Sigma
is not
positive-definite, the negative eigenvalues are replaced with 0.k
with
correlation matrix Sigma
.k
using the
appropriate transformations on the Xi.rho_calc
of all the Y variables is
calculated.Sigmaold = Sigma
and increase it
by
1.niter
.The error loop does increase simulation time, but it can improve
accuracy in most situations. It may be unsuccessful in more difficult to
obtain correlation structures. Some cases utilizing negative
correlations will have results similar to those without the error loop.
Trying different values of epsilon
(i.e., 0.01 instead of 0.001) can help in these cases. For a given
row (q
= 1, …, nrow(Sigma)
), the error loop
progresses through the intermediate correlation matrix
Sigma
by increasing column index (r
= 2, …,
ncol(Sigma)
, r
not equal to q
).
Each time a new pairwise correlation Sigma[q, r]
is
calculated, the new Sigma
matrix is imposed on the
intermediate normal variables X
, the appropriate
transformations are applied to get Y
, and the final
correlation matrix rho_calc
is found. Even though the
intermediate correlations from previous q, r
combinations
are not changed, the final correlations are affected. The fewer
iterations for a given q, r
combination, the less
rho_calc[q, r]
changes. Since larger values of
epsilon
require fewer iterations, using
epsilon = 0.01
may give better results than
epsilon = 0.001
.
Each simulation pathway in SimCorrMix has its own
function to calculate correlation boundaries, given specified
distributional parameters, and to check if a target correlation matrix
rho
falls within these boundaries. Correlation method 1
uses validcorr
and correlation method 2 uses
validcorr2
. The parameter inputs should be checked first
using validpar
. Some code has been modified from the
SimMultiCorrData package (Fialkowski 2018). The distribution functions
for Poisson and Negative Binomial variables are taken from the
VGAM package (Yee
2018).
The GSC algorithm is a flexible method for determining empirical correlation bounds when the theoretical bounds are unknown. The steps are as follows:
Generate independent random samples from the desired distributions using a large number of observations (i.e. n = 100, 000).
Lower Bound: Sort the two variables in opposite directions (i.e., one increasing and one decreasing) and find the sample correlation.
Upper Bound: Sort the two variables in the same direction and find the sample correlation.
Demirtas and Hedeker (2011) showed that the empirical bounds computed from the GSC method are similar to the theoretical bounds (when they are known).
Suppose two random variables Y1 and Y2 have cumulative
distribution functions given by F1 and F2. Let U be a Uniform(0,1) random variable,
i.e. representing the distribution of the standard normal CDF. Then
Hoeffding (1994) and Fréchet (1951) showed that bounds for the
correlation between Y1 and Y2 are given by:
First, the calculations which are equivalent in the two pathways will be discussed by variable type.
Ordinal Variables:
Binary pairs: The correlation bounds are determined as
in Demirtas, Hedeker, and Mermelstein
(2012), who used the method of Emrich and
Piedmonte (1991). The joint distribution is determined by
“borrowing” the moments of a multivariate normal distribution. For two
binary variables Y1
and Y2, with
success probabilities p1 and p2, the boundaries are
given by:
where q1 = 1 − p1
and q2 = 1 − p2.
Binary-Ordinal or Ordinal-Ordinal pairs: Randomly generated variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.
Continuous Variables: Continuous variables are
randomly generated using constants from
SimMultiCorrData::find_constants
and a vector of sixth
cumulant correction values (if provided). The GSC algorithm is used to
find the lower and upper bounds.
Continuous - Ordinal Pairs: Randomly generated ordinal variables with the given marginal distributions and the previously generated continuous variables are used in the GSC algorithm to find the correlation bounds.
The GSC bounds are used for all variable pair types except Poisson and Negative Binomial variables. The Frechet-Hoeffding bounds are used for Poisson-Poisson, NB-NB, and Poisson-NB variable pairs.
In correlation method 2, count variables (regular or zero-inflated)
are treated as “ordinal” by truncating their infinite supports. The
maximum support values for the Poisson variables, given the cumulative
probability truncation values (pois_eps
), means
(lam
), and probabilities of structural zeros
(p_zip
for zero-inflated Poisson), are calculated using
maxcount_support
. The finite supports are used to determine
marginal distributions for each Poisson variable. The maximum support
values for the Negative Binomial variables, given the cumulative
probability truncation values (nb_eps
), sizes
(size
), success probabilities (prob
) or means
(mu
), and probabilities of structural zeros
(p_zinb
for zero-inflated NB), are calculated using
maxcount_support
. The finite supports are used to determine
marginal distributions for each NB variable.
The GSC bounds are used for all variable pair types.