There are two simulation pathways which differ primarily according to
the calculation of the intermediate correlations involving count
variables. The simulation function corrvar for correlation
method 1 calls the function intercorr. The simulation
function corrvar2 for correlation method 2 calls the
function intercorr2. Both of these call additional
auxiliary functions as needed. The ordering of the variables in
the target correlation matrix rho must be 1st ordinal, 2nd
continuous non-mixture, 3rd components of continuous mixture variables,
4th regular Poisson, 5th zero-inflated Poisson, 6th regular NB, and 7th
zero-inflated NB. Note that the target correlations are specified in
terms of the correlations with components of continuous mixture
variables. This allows the user to set the correlation between
components of the same mixture variable to any desired value. If this
correlation is set to \(0\), the
intermediate correlation matrix Sigma may need to be
converted to the nearest positive-definite matrix. This is done within
the simulation functions by specifying use.nearPD = TRUE.
Higham (2002)’s algorithm is executed with
the Matrix::nearPD function (Bates
and Maechler 2018). Otherwise, negative eigenvalues are replaced
with \(0\). Some code has been modified
from the SimMultiCorrData package (Fialkowski 2018).
First, the intermediate correlation calculations which are equivalent in the two pathways will be discussed by variable type.
If both variables are binary, the method of Demirtas et al. (2012) is used to find the
tetrachoric correlation (code adapted from Inan and Demirtas (2016)’s
BinNonNor::Tetra.Corr.BB). The tetrachoric correlation is
an estimate of the binary correlation measured on a continuous scale.
The assumptions are that the binary variables arise from latent normal
variables, and the actual trait is continuous and not discrete. This
method is based on Emrich and Piedmonte
(1991)’s work, in which the joint binary distribution is
determined from the third and higher moments of a multivariate normal
distribution:
Let \(Y_{1}\) and \(Y_{2}\) be binary variables with \(E[Y_{1}] = Pr(Y_{1} = 1) = p_{1}\), \(E[Y_{2}] = Pr(Y_{2} = 1) = p_{2}\), and
correlation \(\rho_{y1y2}\). Note here
that \(p_1 = 1 -\)
marginal[[1]][1] and \(p_2 = 1
-\) marginal[[2]][1] so that the user-supplied
probabilities are associated with the lower support value. Then solving
the equation
\[\begin{equation}
\Phi[z(p_{1}), z(p_{2}), \rho_{x1x2}] = \rho_{y1y2}\sqrt{p_{1}(1 -
p_{1})p_{2}(1 - p_{2})} + p_{1}p_{2} (\#eq:System2)
\end{equation}\]
for \(\rho_{x1x2}\) gives the
intermediate correlation of the standard normal variables needed to
generate binary variables with correlation \(\rho_{y1y2}\). Here, \(\Phi\) is the standard bivariate normal CDF
and \(z(p)\) indicates the \(p^{th}\) quantile of the standard normal
distribution. To generate the binary variables from the standard normal
variables, set \(Y_{1} = 1\) if \(Z_{1} \le z(p_{1})\) and \(Y_{1} = 0\) otherwise. Similarly, set \(Y_{2} = 1\) if \(Z_{2} \le z(p_{2})\) and \(Y_{2} = 0\) otherwise.
For binary-ordinal or ordinal
pairs, ord_norm is called. The algorithm to simulate
k_cat ordinal random variables is as follows:
If a support is not provided, create the default of
\(1, ..., r\) for an ordinal variable
with \(r\) categories.
Use the norm_ord function to calculate the initial
correlation of the ordinal variables (rhoordold) formed by
discretizing k_cat random normal variates with correlation
matrix set equal to rho0, using the cumulative
probabilities supplied in marginal and the corresponding
normal quantiles.
Let rho be the intermediate normal correlation
updated in each iteration, rhoord be the ordinal
correlation calculated in each iteration (initialized at
rhoordold), rhoold be the intermediate
correlation from the previous iteration, it be the
iteration number, maxit be the user-specified maximum
iteration number, and epsilon be the user-specified maximum
pairwise correlation error. For each variable pair, execute the
following:
rho0 \(= 0\), set
rho \(= 0\).rhoord and
rho0 is greater than epsilon and
it is less than maxit:rho0 * (rho0/rhoord) \(\leq
-1\), then:
rho = rhoold * (1 + 0.1 * (1 - rhoold) * -sign(rho0 - rhoord)).rho0 * (rho0/rhoord) \(\geq 1\), then:
rho = rhoold * (1 + 0.1 * (1 - rhoold) * sign(rho0 - rhoord)).rho = rhoold * (rho0/rhoord).rho \(> 1\), set
rho \(= 1\). If
rho \(< -1\), set
rho \(= -1\). v) Calculate
rhoord using norm_ord and the 2$$2 correlation
matrix formed by rho.rhoold = rho and increase it by
1.niter.SigmaC = rho for the random normal variables. Discretize
these to produce ordinal variables with the desired correlation
matrix.Correlations are computed pairwise. The function
intercorr_cont uses the equations derived by Headrick and Sawilowsky (1999) for the
third-order and Headrick (2002) for the
fifth-order power method transformation (PMT).
For two continuous variables \(Y_i\) and \(Y_j\) generated using Headrick (2002)’s fifth-order PMT, the intermediate normal correlation \(\rho_{{Z}_{i}{Z}_{j}}\) required to obtain the target correlation \(\rho_{{Y}_{i}{Y}_{j}}\) is the solution to the following:
\[\begin{equation} \begin{split} {\rho}_{{Y}_{i}{Y}_{j}} &= 3{c}_{{4}_{i}}{c}_{{0}_{j}}+3{c}_{{4}_{i}}{c}_{{2}_{j}}+9{c}_{{4}_{i}}{c}_{{4}_{j}}+{c}_{{0}_{i}}({c}_{{0}_{j}}+{c}_{{2}_{j}}+3{c}_{{4}_{j}})+{c}_{{1}_{i}}{c}_{{1}_{j}}{\rho}_{{Z}_{i}{Z}_{j}}+3{c}_{{3}_{i}}{c}_{{1}_{j}}{\rho}_{{Z}_{i}{Z}_{j}} \\ & +15{c}_{{5}_{i}}{c}_{{1}_{j}}{\rho}_{{Z}_{i}{Z}_{j}}+3{c}_{{1}_{i}}{c}_{{1}_{j}}{\rho}_{{Z}_{i}{Z}_{j}}+9{c}_{{3}_{i}}{c}_{{3}_{j}}{\rho}_{{Z}_{i}{Z}_{j}}+45{c}_{{5}_{i}}{c}_{{3}_{j}}{\rho}_{{Z}_{i}{Z}_{j}}+15{c}_{{1}_{i}}{c}_{{5}_{j}}{\rho}_{{Z}_{i}{Z}_{j}} \\ & +45{c}_{{3}_{i}}{c}_{{5}_{j}}{\rho}_{{Z}_{i}{Z}_{j}}+225{c}_{{5}_{i}}{c}_{{5}_{j}}{\rho}_{{Z}_{i}{Z}_{j}}+12{c}_{{4}_{i}}{c}_{{2}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{2}+72{c}_{{4}_{i}}{c}_{{4}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{2}+6{c}_{{3}_{i}}{c}_{{3}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{3} \\ & +60{c}_{{5}_{i}}{c}_{{3}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{3}+60{c}_{{3}_{i}}{c}_{{5}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{3}+600{c}_{{5}_{i}}{c}_{{5}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{3}+24{c}_{{4}_{i}}{c}_{{4}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{4}+120{c}_{{5}_{i}}{c}_{{5}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{5} \\ & +{c}_{{2}_{i}}({c}_{{0}_{j}}+{c}_{{2}_{j}}+3{c}_{{4}_{j}}+2{c}_{{2}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{2}+12{c}_{{4}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{2}). \end{split} (\#eq:System4) \end{equation}\]
For two continuous variables \(Y_i\)
and \(Y_j\) generated using Fleishman (1978)’s third-order PMT, the
intermediate normal correlation \(\rho_{{Z}_{i}{Z}_{j}}\) required to obtain
the target correlation \(\rho_{{Y}_{i}{Y}_{j}}\) is the solution to
the following:
\[\begin{equation}
{\rho}_{{Y}_{i}{Y}_{j}} =
\rho_{{Z}_{i}{Z}_{j}}(c_{1_{i}}c_{1_{j}}+3c_{1_{j}}c_{3_{i}}+3c_{1_{i}}c_{3_{j}}+9c_{3_{i}}c_{3_{j}}+2c_{0_{i}}c_{0_{j}}\rho_{{Z}_{i}{Z}_{j}}+6c_{3_{i}}c_{3_{j}}\rho_{{Z}_{i}{Z}_{j}}^2).
(\#eq:System5)
\end{equation}\]
The function SimMultiCorrData::findintercorr_cont_cat is
called to calculate the intermediate normal correlations. The
intermediate correlation between \(Z_1\) and \(Z_2\) (where \(Z_1\) is the standard normal variable
transformed using Headrick (2002)’s
fifth-order or Fleishman (1978)’s
third-order PMT to produce a continuous variable \(Y_1\), and \(Z_2\) is the standard normal variable
discretized to produce an ordinal variable \(Y_2\)) is calculated by dividing the target
correlation by a correction factor. The correction factor is the product
of the point-polyserial correlation between \(Y_2\) and \(Z_2\) (Olsson et al.
1982):
\[\begin{equation}
{{\rho}}_{{Y}_{2}{Z}_{2}} =
\frac{{\rho}_{{Z}_{2}{Z}_{2}}}{{\sigma}_{{Y}_{2}}}
\sum_{j=1}^{r-1}\phi({\tau}_{j})({y}_{{2}_{j+1}}-{y}_{{2}_{j}}) =
\frac{1}{{\sigma}_{{Y}_{2}}}
\sum_{j=1}^{r-1}\phi({\tau}_{j})({y}_{{2}_{j+1}}-{y}_{{2}_{j}}),
(\#eq:System6)
\end{equation}\]
and the power method correlation between \(Y_1\) and \(Z_1\) (Headrick and
Kowalchuk 2007):
\[\begin{equation}
{c}_{1}+3{c}_{3}+15{c}_{5}. (\#eq:System7)
\end{equation}\]
The constant \(c_5 = 0\) for the
third-order PMT. Then the intermediate normal correlation \(\rho_{{Z}_{1}{Z}_{2}}\) required to obtain
the target correlation \(\rho_{{Y}_{1}{Y}_{2}}\) is given by:
\[\begin{equation} \begin{split} {\rho}_{{Z}_{1}{Z}_{2}} &= \frac{{\rho}_{{Y}_{1}{Y}_{2}}{\sigma}_{{Y}_{2}}}{({c}_{1}+3{c}_{3}+15{c}_{5})\ \sum_{j=1}^{r-1}\phi({\tau}_{j})({y}_{{2}_{j+1}}-{y}_{{2}_{j}})} \\ &= \frac{{\rho}_{{Y}_{1}{Y}_{2}}{\sigma}_{{Y}_{2}}}{({c}_{1}+3{c}_{3}+15{c}_{5})\ \sum_{j=1}^{r-1}\phi(\Phi^{-1}(\sum_{i=1}^{j}{p}_{j}))({y}_{{2}_{j+1}}-{y}_{{2}_{j}})}. \end{split} (\#eq:System8) \end{equation}\]
Here, \(\phi\) is the standard
normal PDF, \({\sigma}_{{Y}_{2}}\) is
the standard deviation of the ordinal variable, and \({\mu}_{{Y}_{2}}\) is its expected
value:
\[\begin{equation}
{\sigma}_{{Y}_{2}} = \sqrt{\sum_{j=1}^{r}{y}_{{2}_{j}}^2{p}_{j} -
{\mu}_{{Y}_{2}}^2},\ \ \ {\mu}_{{Y}_{2}} =
\sum_{j=1}^{r}{y}_{{2}_{j}}{p}_{j}. (\#eq:System9)
\end{equation}\]
Now the two methods will be contrasted.
The intermediate correlations used in correlation method 1 are more simulation based than those in correlation method 2, which means that accuracy increases with sample size and the number of repetitions. Specifying the seed allows for reproducibility. In addition, method 1 differs from method 2 in the following ways:
The count variable correlations extend the
method of Yahav and Shmueli (2012). The
intermediate correlation between \(Z_1\) and \(Z_2\) (the standard normal variables used
to generate the count variables \(Y_1\)
and \(Y_2\) via the inverse CDF method)
is calculated using a logarithmic transformation of the target
correlation. First, the upper and lower Frechet-Hoeffding bounds
(mincor, maxcor) on \({\rho}_{{Y}_{1}{Y}_{2}}\) are simulated
(Fréchet 1951; Hoeffding 1994). See the Calculation of Correlation Boundaries
vignette for details on the Frechet-Hoeffding correlation boundaries.
The intermediate correlation \({\rho}_{{Z}_{1}{Z}_{2}}\) is found as
follows:
\[\begin{equation}
{\rho}_{{Z}_{1}{Z}_{2}} = \frac{1}{b} * log
\Bigg(\frac{{\rho}_{{Y}_{1}{Y}_{2}} - c}{a} \Bigg),
(\#eq:System10)
\end{equation}\]
where \[a = -\frac{maxcor * mincor}{maxcor +
mincor},\ \ \ b = log \Bigg(\frac{maxcor + a}{a} \Bigg),\ \ \ c =
-a.\]
intercorr_pois is called to
calculate the intermediate correlations for all variables.intercorr_nb is
called to calculate the intermediate correlations for all
variables.intercorr_pois_nb is called to calculate the intermediate
correlations for all variables.This method becomes less accurate as the variable mean gets closer to zero. The distribution functions are taken from the VGAM package (Yee 2018).
The ordinal - count variable correlations are based on an extension of the method of Amatya and Demirtas (2015), in which the correlation correction factor is the product of the upper Frechet-Hoeffding bound on the correlation between the count variable and the normal variable used to generate it and a simulated upper bound on the correlation between an ordinal variable and the normal variable used to generate it (Demirtas and Hedeker 2011). The intermediate correlations are the ratio of the target correlations to the correction factors.
intercorr_cat_pois is
called to calculate the intermediate correlations for all
variables.intercorr_cat_nb
is called to calculate the intermediate correlations for all
variables.The continuous - count variable correlations are based on an extension of the methods of Amatya and Demirtas (2015) and Demirtas et al. (2012), in which the correlation correction factor is the product of the upper Frechet-Hoeffding bound on the correlation between the count variable and the normal variable used to generate it and the power method correlation between the continuous variable and the normal variable used to generate it (Headrick and Kowalchuk 2007). The intermediate correlations are the ratio of the target correlations to the correction factors.
intercorr_cont_pois is
called to calculate the intermediate correlations for all
variables.intercorr_cont_nb
is called to calculate the intermediate correlations for all
variables.The simulation functions do not perform checks on the distribution
parameters or target correlation matrix rho. This should be
done first using validpar to ensure they are of the correct
dimension, format, and/or sign. The function validcorr
should also be used to check if rho is within the feasible
bounds and determine the lower and upper correlation limits. Summaries
of simulated variables can be obtained using
summary_var.
The algorithm used in the simulation function corrvar
that employs correlation method 1 is as follows:
If continuous variables are desired, the standardized cumulants are checked to see if there are any repeated distributions (i.e., if the user wishes to simulate two variables). These are noted so that the constants are calculated only once.
The constants are calculated for the continuous non-mixture
variables and components of continuous mixture variables using
SimMultiCorrData::find_constants. If no solutions are found
that generate valid power method PDF’s, the function will return
constants that produce invalid PDF’s (or a stop error if no solutions
can be found). Errors regarding constant calculation are the most
probable cause of function failure. Possible solutions include changing
the seed or using a list of sixth cumulant correction values (if
method = “Polynomial”).
The default support is created for the ordinal variables (if no support is provided).
The intermediate correlation matrix Sigma is
calculated using intercorr. Note that this will return a
matrix that is not positive-definite. If so and
use.nearPD = TRUE, the algorithm of Higham (2002) is used (see
Matrix::nearPD) to produce the nearest positive-definite
matrix and a message is given. Otherwise, negative eigenvalues are
replaced with \(0\).
k <- k_cat + k_cont + k_mix + k_pois + k_nb
multivariate normal variables (\(X_{nxk}\)) with correlation matrix
Sigma are generated using singular value decomposition on a
\(MVN_{nxk}(0,\ 1)\) matrix and
eigenvalue decomposition on Sigma.
The variables are generated from \(X_{nxk}\) using the appropriate transformations (see Variable Types vignette).
The final correlation matrix is calculated, and the maximum error
(maxerr) from the target correlation matrix is
found.
If the error loop is specified (error_loop = TRUE),
it is used on each variable pair to correct the final correlation until
it is within epsilon of the target correlation or the
maximum number of iterations maxit has been
reached.
If continuous mixture variables are desired, these are created from the component variables.
The intermediate correlations used in correlation method 2 are less simulation based than those in correlation method 1. Their calculations involve greater utilization of correction loops which make iterative adjustments until a maximum error has been reached (if possible). In addition, method 2 differs from method 1 in the following ways:
The intermediate correlations involving count
variables are based on the methods of Barbiero & Ferrari
(2012; 2015). The Poisson or Negative
Binomial support is made finite by removing a small user-specified value
(i.e. \(0.0001\)) from the total
cumulative probability. This truncation factor may differ for each count
variable (see maxcount_support). The count variables are
subsequently treated as ordinal and intermediate correlations are
calculated using the correction loop of ord_norm.
The continuous - count variable correlations are based on an extension of the method of Demirtas et al. (2012), and the count variables are treated as ordinal. The correction factor is the product of the power method correlation between the continuous variable and the normal variable used to generate it (Headrick and Kowalchuk 2007) and the point-polyserial correlation between the ordinalized count variable and the normal variable used to generate it (Olsson et al. 1982). The intermediate correlations are the ratio of the target correlations to the correction factor.
intercorr_cont_pois2 is
called to calculate the intermediate correlations for all
variables.intercorr_cont_nb2 is called to calculate the intermediate
correlations for all variables.The algorithm used in the simulation function corrvar2
that employs correlation method 2 is similar to that described for
corrvar, with a few modifications:
The feasibility of rho, given the distribution
parameters, should be checked first using the function
validcorr2, which checks if rho is within the
feasible bounds and returns the lower and upper correlation
limits.
After the support is created for the ordinal variables (if no
support is provided), the maximum support for the count variables is
determined using maxcount_support, given truncation value
vector pois_eps for Poisson variables and/or
nb_eps for Negative Binomial variables. The cumulative
probability truncation value may differ by variable, but a good value is
\(0.0001\). The resulting supports and
distribution parameters are used to create marginal lists, consisting of
the cumulative probabilities for each count variable.
The intermediate correlation matrix Sigma is
calculated using intercorr2.