All variables are generated via the appropriate transformation of
standard normal variables, as described below. See the documentation for
corrvar and corrvar2 for more details about
all inputs. The parameter inputs should be checked first using
validpar. Summaries of simulated variables can be obtained
using summary_var. Some code has been modified from the
SimMultiCorrData package (Fialkowski 2018).
1) Continuous Variables: Continuous variables are
simulated using either Fleishman (1978)’s
third-order (method = “Fleishman”) or Headrick (2002)’s fifth-order
(method = “Polynomial”) power method transformation (PMT).
This is a computationally efficient algorithm that simulates continuous
distributions through the method of moments. It works by matching
standardized cumulants – the first four (mean, variance, skew, and
standardized kurtosis) for Fleishman’s third-order method, or the first
six (mean, variance, skew, standardized kurtosis, and standardized fifth
and sixth cumulants) for Headrick’s fifth-order method. The
transformation is expressed as follows:
\[\begin{equation}
Y = c_{0} + c_{1} * Z + c_{2} * Z^2 + c_{3} * Z^3 + c_{4} * Z^4 + c_{5}
* Z^5,\ Z \sim iid\ N(0,1), (\#eq:System1)
\end{equation}\]
where \(c_{4}\) and \(c_{5}\) both equal \(0\) for Fleishman’s method. The real
constants are calculated by
SimMultiCorrData::find_constants, which solves the system
of equations given in SimMultiCorrData::fleish or
SimMultiCorrData::poly. All variables are simulated with
mean \(0\) and variance \(1\), and then transformed to the specified
mean and variance at the end.
Continuous Mixture Variables: Mixture distributions
describe random variables that are drawn from more than one component
distribution. For a random variable \(Y\) from a finite continuous mixture
distribution with \(k\) components, the
probability density function (PDF) can be described by:
\[\begin{equation}
h_{Y}(y) = \sum_{i=1}^{k} \pi_{i} f_{Y_{i}}(y),\ \ \sum_{i=1}^{k}
\pi_{i} = 1. (\#eq:System2)
\end{equation}\]
The \(\pi_{i}\) are mixing parameters
which determine the weight of each component distribution \(f_{Y_{i}}(y)\) in the overall probability
distribution. As long as each component has a valid PDF, the overall
distribution \(h_{Y}(y)\) has a valid
PDF. The main assumption is statistical independence between the process
of randomly selecting the component distribution and the distributions
themselves: Assume there is a random selection process that first
generates the numbers \(1,\ ...,\ k\)
with probabilities \(\pi_{1},\ ...,\
\pi_{k}\). After selecting number \(i\), where \(1
\leq i \leq k\), a random variable \(y\) is drawn from component distribution
\(f_{Y_{i}}(y)\).
Mixture distributions provide a useful way for describing
heterogeneity in a population, especially when an outcome is a composite
response from multiple sources. They may also be used to model
distributions with outliers. For example, the contaminated normal
outlier distribution is used when a proportion of measurement
errors, which are usually modeled as normal variables with zero mean and
common variance \(\sigma^2\), have
larger variability. These distributions can be thought of as mixture
distributions with contamination percentage \(\pi\ \epsilon\ (0,\ 1)\). For example, the
PDF of \(Y\) may be given by:
\[\begin{equation}
h_{Y}(y) = \pi \phi(0,\ 9\sigma^2)(y) + (1-\pi) \phi(0,\ \sigma^2)(y),\
\ -\infty < y < \infty. (\#eq:System3)
\end{equation}\]
Here, \(\phi(\mu,\ \sigma^2)(y)\) is
the PDF of the normal distribution (Davenport et
al. 1988; Schork et al. 1996; Everitt 1996; Pearson 2011).
Continuous mixture variables are generated at the component level. Each component variable is created using the PMT, and the mixture variable is generated from these based on a random multinomial variable described by the mixing probabilities. Correlations are also controlled at the component level. Users specify the target pairwise correlations between the components across continuous mixture variables and between components and other types of variables.
The function rho_M1M2 approximates the expected
correlation between two mixture variables \(M1\) and \(M2\) based on the component distributions
and the correlations between components. The function
rho_M1Y approximates the expected correlation between a
mixture variable \(M1\) and another
random variable \(Y\), that may have an
ordinal, continuous, or count distribution. If the user desires a
specific correlation between a mixture variable \(M1\) and another simulated variable \(M2\) or \(Y\), various component level correlations
can be tried in these 2 functions to see if the desired correlation can
be achieved. See the Expected Cumulants and
Correlations for Continuous Mixture Variables vignette for more
details on continuous mixture variables.
The required parameters for simulating
continuous variables include: skewness, standardized kurtosis (kurtosis
- 3), and standardized fifth and sixth cumulants (for the fifth-order
method). If the goal is to simulate a theoretical distribution
(i.e. Gamma, Beta, Logistic, etc.), these values can be obtained using
SimMultiCorrData::calc_theory. If the goal is to mimic an
empirical data set, these values can be found using
SimMultiCorrData::calc_moments (using the method of
moments) or SimMultiCorrData::calc_fisherk (using Fisher’s
k-statistics). If the standardized cumulants are obtained from
calc_theory, the user may need to use rounded values as
inputs (i.e. skews = round(skews, 8)). For example, in
order to ensure that skew is exactly \(0\) for symmetric distributions. Due to the
nature of the integration involved in calc_theory, the
results are approximations. Greater accuracy can be achieved by
increasing the number of subdivisions (sub) used in the
integration process.
For mixture variables, the parameters are specified at the
component level by the inputs mix_skews,
mix_skurts, mix_fifths,
mix_sixths, and mix_Six. The mixing
probabilities, means and standard deviations of the component variables
are given by mix_pis, mix_mus and
mix_sigmas.
The means and variances of all continuous variables are specified
by means and vars. These are at the variable
level, i.e., they refer to the continous non-mixture and mixture
variables themselves. The function calc_mixmoments
calculates the expected mean, standard deviation, and standardized
cumulants for mixture variables based on the component
distributions.
For some sets of cumulants, it is either not possible to find
power method constants or the calculated constants do not generate valid
power method PDFs. In these situations, adding a value to the sixth
cumulant may provide solutions (see find_constants). When
using Headrick’s fifth-order approximation, if simulation results
indicate that a continuous variable does not generate a valid PDF, the
user can try find_constants with various sixth cumulant
correction vectors to determine if a valid PDF can be found. These sixth
cumulant corrections are specified in the simulation functions using
Six or mix_Six.
Choice of Fleishman’s or Headrick’s Method: Using the fifth-order approximation allows additional control over the fifth and sixth moments of the generated distribution, improving accuracy. In addition, the range of feasible standardized kurtosis (\(\gamma_{2}\)) values, given skew (\(\gamma_{1}\)) and standardized fifth (\(\gamma_{3}\)) and sixth (\(\gamma_{4}\)) cumulants, is larger than with the third-order approximation. For example, Fleishman’s method can not be used to generate a non-normal distribution with a ratio of \(\gamma_{1}^2/\gamma_{2} > 9/14\) (Headrick and Kowalchuk 2007). This eliminates the family of distributions, which has a constant ratio of \(\gamma_{1}^2/\gamma_{2} = 2/3\). The fifth-order method also generates more distributions with valid PDF’s. However, if the fifth and sixth cumulants do not exist, the Fleishman approximation should be used.
2) Ordinal Variables: Ordinal variables (\(r \ge 2\) categories) are generated by discretizing the standard normal variables at quantiles. These quantiles are determined by evaluating the inverse standard normal CDF at the cumulative probabilities defined by each variable’s marginal distribution. If the support for variable \({Y}_{i}\) is not provided, the default is to use \(\{1,\ 2,\ ...,\ r_{i}\}\).
The required parameters for simulating ordinal variables include: the cumulative marginal probabilities and support values (if desired). The probabilities should be combined into a list of length equal to the number of ordinal variables. The \(i^{th}\) element is a vector of the cumulative probabilities defining the marginal distribution of the \(i^{th}\) variable. If the variable can take \(r\) values, the vector will contain \(r - 1\) probabilities (the \(r^{th}\) is assumed to be \(1\)).
For binary variables, the user-supplied probability should be the probability of the \(1^{st}\) (lower) support value. This would ordinarily be considered the probability of failure (\(q\)), while the probability of the \(2^{nd}\) (upper) support value would be considered the probability of success (\(p = 1 - q\)). The support values should be combined into a separate list. The \(i^{th}\) element is a vector containing the \(r\) ordered support values. If not provided, the default is for the \(i^{th}\) element to be the vector \(1, ..., r\).
3) Poisson and Negative Binomial Variables: Count variables are generated using the inverse CDF method. The cumulative distribution function of a standard normal variable has a uniform distribution. The appropriate quantile function \(F_{Y}^{-1}\) is applied to this uniform variable with the designated parameters to generate the count variable: \(Y = F_{y}^{-1}(\Phi(Z))\).
Zero-inflated distributions: A zero-inflated (ZI)
random variable \(Y_{ZI}\) is a mixture
of a degenerate distribution having the point mass at \(0\) and another distribution \(Y\) that contributes both \(0\) and non-\(0\) values. If the mixing probability is
\(\phi\), then:
\[\begin{equation}
P(Y_{ZI} = 0) = \phi + (1 - \phi) * P(Y = 0),\ \ 0 < \phi < 1.
(\#eq:System8)
\end{equation}\]
Therefore, \(\phi\) is the probability
of a structural zero, and setting \(\phi =
0\) reduces \(Y_{ZI}\) to the
variable \(Y\). In
SimCorrMix, \(Y\) can
have either a Poisson distribution (\(Y_P\)) or a NB distribution (\(Y_{NB}\)) (Ismail
and Zamani 2013; Lambert 1992; Zhang et al. 2016).
\[\begin{equation} \begin{split} P(Y_{ZIP} = 0) &= \phi + (1 - \phi) * exp(-\lambda) \\ P(Y_{ZIP} = y) &= (1 - \phi) * exp(-\lambda) * \frac{\lambda^y}{y!},\ \ y = 1, 2, ... \end{split} (\#eq:System9) \end{equation}\]
The mean of \(Y_{ZIP}\) is \((1 - \phi) * \lambda\), and the variance is
\(\lambda + \lambda^2 * \phi/(1 -
\phi)\) (see VGAM::dzipois).
The zero-deflated Poisson distribution may be obtained by setting \(\phi \in (-1/(exp(\lambda) - 1),\ 0)\), so that the probability of a zero count is less than the nominal Poisson value. In this case, \(\phi\) no longer represents a probability.
When \(\phi = -1/(exp(\lambda) -
1)\), the random variable has a positive-Poisson distribution
(see VGAM::dpospois). The probability of a zero response is
\(0\), and the other probabilities are
scaled to sum to \(1\).
The parameters \(\lambda\) and \(\phi\) are specified through the inputs
lam and p_zip. Setting p_zip = 0
(default setting) generates a regular Poisson variable. Parameters for
zero-inflated Poisson variables should follow those for regular Poisson
variables in all function inputs. For Correlation Method
2, a vector of total cumulative probability truncation values
should be given in pois_eps. These values represent the
amount removed from the total cumulative probability when creating
finite supports. The value may differ by variable, but the default value
is \(0.0001\) (suggested by Barbiero and Ferrari (2015)). For example,
pois_eps = 0.0001 means that the support values removed
have a total probability of \(0.0001\)
of occurring in the distribution of that variable. The effect is to
remove improbable values, which may be of concern if the user wishes to
replicate a distribution with outliers.
\[\begin{equation} \begin{split} P(Y_{ZINB} = 0) &= \phi + (1 - \phi) * p^{size} \\ P(Y_{ZINB} = y) &= (1 - \phi) * \frac{\Gamma(y + size)}{\Gamma(size) * y!} * p^{size} * (1 - p)^{size},\ \ y = 1, 2, ... \end{split} (\#eq:System10) \end{equation}\]
In this formulation, the Negative Binomial component \(Y_{NB}\) represents the number of failures
which occur in a sequence of independent Bernoulli trials before a
target number of successes (\(size\))
is reached. The probability of success in each trial is \(p\). \(Y_{NB}\) may also be parametrized by the
mean \(\mu\) and dispersion parameter
\(size\) so that \(p = size/(size + \mu)\) or \(\mu = size * (1-p)/p\) (see
stats::dnbinom).
The mean of \(Y_{ZINB}\) is \((1 - \phi) * \mu\), and the variance is
\((1 - \phi) * \mu * (1 + \mu * (\phi +
1/size))\) (see VGAM::dzinegbin).
The zero-deflated NB distribution may be obtained by setting \(\phi \in (-p^{size}/(1 - p^{size}),\ 0)\),
so that the probability of a zero count is less than the nominal NB
value. Again, in this case, \(\phi\) no
longer represents a probability.
The positive-NB distribution can be obtained with \(\phi = -p^{size}/(1 - p^{size})\) (see
VGAM::dposnegbin). The probability of a zero response is
\(0\), and the other probabilities are
scaled to sum to \(1\).
The parameters \(size\), \(p\), \(\mu\), and \(\phi\) are specified through the inputs
size, prob, mu, and
p_zinb. Either prob or mu should
be given for all NB and ZINB variables. Setting p_zinb = 0
(default setting) generates a regular NB variable. Parameters for
zero-inflated NB variables should follow those for regular NB variables
in all function inputs. For Correlation Method 2, a
vector of total cumulative probability truncation values should be given
in nb_eps. The default value in the simulation functions is
\(0.0001\).
The distributions functions are taken from the VGAM package (Yee 2018).
The error loop may be used to correct the final
pairwise correlation of simulated variables to be within a
user-specified precision value (epsilon) of the target
correlation. It updates the pairwise intermediate MVN correlation
iteratively in a loop until either the maximum error is less than
epsilon or the number of iterations exceeds the maximum
number set by the user (maxit). Some code has been modified
from the SimMultiCorrData package (Fialkowski 2018). Below is a description of the
algorithm, which is executed within each variable pair across all
k = k_cat + k_cont + k_comp + k_pois + k_nb variables
(k_comp is the number of component distributions for the
continuous mixture variables). rho_calc is the calculated
final correlation matrix updated in each iteration, rho is
the target final correlation, Sigmaold is the intermediate
correlation from the previous iteration, it is the
iteration number, q is the row number, r is
the column number, and maxerr is the absolute pairwise
correlation error between rho_calc[q, r] and
rho[q, r].
If rho[q, r] = 0, set
Sigma[q, r] = 0.
While maxerr is greater than epsilon
and it is less than maxit:
rho[q, r] * (rho[q, r]/rho_calc[q, r]) <= -1,
then:
Sigma[q, r] = Sigmaold[q, r] * (1 + 0.1 * (1 - Sigmaold[q, r]) * -sign(rho[q, r] - rho_old[q, r]))rho[q, r] * (rho[q, r]/rho_calc[q, r]) >= 1,
then:
Sigma[q, r] = Sigmaold[q, r] * (1 + 0.1 * (1 - Sigmaold[q, r]) * sign(rho[q, r] - rho_old[q, r]))Sigma[q, r] = Sigmaold[q, r] * (rho[q, r]/rho_calc[q, r]).kxk
Sigma correlation matrix. If Sigma is not
positive-definite, the negative eigenvalues are replaced with \(0\).k with correlation matrix
Sigma.k using the
appropriate transformations on the \(X_{i}\).rho_calc of all the \(\bm{Y}\) variables is calculated.Sigmaold = Sigma and increase it by
1.niter.The error loop does increase simulation time, but it can improve
accuracy in most situations. It may be unsuccessful in more difficult to
obtain correlation structures. Some cases utilizing negative
correlations will have results similar to those without the error loop.
Trying different values of epsilon (i.e., \(0.01\) instead of \(0.001\)) can help in these cases. For a
given row (q = 1, …, nrow(Sigma)), the error
loop progresses through the intermediate correlation matrix
Sigma by increasing column index (r = 2, …,
ncol(Sigma), r not equal to q).
Each time a new pairwise correlation Sigma[q, r] is
calculated, the new Sigma matrix is imposed on the
intermediate normal variables X, the appropriate
transformations are applied to get Y, and the final
correlation matrix rho_calc is found. Even though the
intermediate correlations from previous q, r combinations
are not changed, the final correlations are affected. The fewer
iterations for a given q, r combination, the less
rho_calc[q, r] changes. Since larger values of
epsilon require fewer iterations, using
epsilon = 0.01 may give better results than
epsilon = 0.001.
Each simulation pathway in SimCorrMix has its own
function to calculate correlation boundaries, given specified
distributional parameters, and to check if a target correlation matrix
rho falls within these boundaries. Correlation method 1
uses validcorr and correlation method 2 uses
validcorr2. The parameter inputs should be checked first
using validpar. Some code has been modified from the
SimMultiCorrData package (Fialkowski 2018). The distribution functions
for Poisson and Negative Binomial variables are taken from the
VGAM package (Yee
2018).
The GSC algorithm is a flexible method for determining empirical correlation bounds when the theoretical bounds are unknown. The steps are as follows:
Generate independent random samples from the desired distributions using a large number of observations (i.e. \(n = 100,000\)).
Lower Bound: Sort the two variables in opposite directions (i.e., one increasing and one decreasing) and find the sample correlation.
Upper Bound: Sort the two variables in the same direction and find the sample correlation.
Demirtas and Hedeker (2011) showed that the empirical bounds computed from the GSC method are similar to the theoretical bounds (when they are known).
Suppose two random variables \(Y_1\)
and \(Y_2\) have cumulative
distribution functions given by \(F_1\)
and \(F_2\). Let \(U\) be a Uniform(0,1) random variable,
i.e. representing the distribution of the standard normal CDF. Then
Hoeffding (1994) and Fréchet (1951) showed that bounds for the
correlation between \(Y_1\) and \(Y_2\) are given by:
\[\begin{equation}
\Big\{cor\left(F_1^{-1}(U), F_2^{-1}(1 - U)\right),\
cor\left(F_1^{-1}(U), F_2^{-1}(U)\right)\Big\}. (\#eq:System1)
\end{equation}\]
First, the calculations which are equivalent in the two pathways will be discussed by variable type.
Ordinal Variables:
Binary pairs: The correlation bounds are determined as
in Demirtas et al. (2012), who used the
method of Emrich and Piedmonte (1991). The
joint distribution is determined by “borrowing” the moments of a
multivariate normal distribution. For two binary variables \(Y_1\) and \(Y_2\), with success probabilities \(p_1\) and \(p_2\), the boundaries are given by:
\[\begin{equation}
\Big\{max\left(-\sqrt{(p_1p_2)/(q_1q_2)},\
-\sqrt{(q_1q_2)/(p_1p_2)}\right),\ \ \
min\left(\sqrt{(p_1q_2)/(q_1p_2)},\
\sqrt{(q_1p_2)/(p_1q_2)}\right)\Big\}, (\#eq:System2)
\end{equation}\]
where \(q_1 = 1 - p_1\) and \(q_2 = 1 - p_2\).
Binary-Ordinal or Ordinal-Ordinal pairs: Randomly generated variables with the given marginal distributions are used in the GSC algorithm to find the correlation bounds.
Continuous Variables: Continuous variables are
randomly generated using constants from
SimMultiCorrData::find_constants and a vector of sixth
cumulant correction values (if provided). The GSC algorithm is used to
find the lower and upper bounds.
Continuous - Ordinal Pairs: Randomly generated ordinal variables with the given marginal distributions and the previously generated continuous variables are used in the GSC algorithm to find the correlation bounds.
The GSC bounds are used for all variable pair types except Poisson and Negative Binomial variables. The Frechet-Hoeffding bounds are used for Poisson-Poisson, NB-NB, and Poisson-NB variable pairs.
In correlation method 2, count variables (regular or zero-inflated)
are treated as “ordinal” by truncating their infinite supports. The
maximum support values for the Poisson variables, given the cumulative
probability truncation values (pois_eps), means
(lam), and probabilities of structural zeros
(p_zip for zero-inflated Poisson), are calculated using
maxcount_support. The finite supports are used to determine
marginal distributions for each Poisson variable. The maximum support
values for the Negative Binomial variables, given the cumulative
probability truncation values (nb_eps), sizes
(size), success probabilities (prob) or means
(mu), and probabilities of structural zeros
(p_zinb for zero-inflated NB), are calculated using
maxcount_support. The finite supports are used to determine
marginal distributions for each NB variable.
The GSC bounds are used for all variable pair types.