There are two simulation pathways
which differ primarily according to the calculation of the intermediate
correlation matrix Sigma
. Note, unless otherwise indicated,
the functions referenced below come from
SimMultiCorrData
.
First, the intermediate correlation calculations which are equivalent in the two pathways will be discussed by variable type.
Correlations are computed pairwise. If both variables are
binary, the method of Demirtas et al. (2012) is used to find the tetrachoric
correlation (code adapted from
BinNonNor::Tetra.Corr.BB
). The tetrachoric correlation is
an estimate of the binary correlation measured on a continuous scale.
The assumptions are that the binary variables arise from latent normal
variables, and the actual trait is continuous and not discrete. This
method is based on Emrich and Piedmonte
(1991)’s work, in which the joint binary distribution is
determined from the third and higher moments of a multivariate normal
distribution:
Let $\Large Y_{1}$ and $\Large Y_{2}$ be binary variables with $\Large E[Y_{1}] = Pr(Y_{1} = 1) = p_{1}$, $\Large E[Y_{2}] = Pr(Y_{2} = 1) = p_{2}$, and correlation $\Large \rho_{y1y2}$.
Let $\Large \Phi[x_{1}, x_{2}, \rho_{x1x2}]$ be the standard bivariate normal cumulative distribution function, given by: $$\Large \Phi[x_{1}, x_{2}, \rho_{x1x2}] = \int_{-\infty}^{x_{1}} \int_{-\infty}^{x_{2}} f(z_{1}, z_{2}, \rho_{x1x2})\ dz_{1} dz_{2},$$ where $$\Large f(z_{1}, z_{2}, \rho_{x1x2}) = [2\pi\sqrt{1 - \rho_{x1x2}^2}]^{-1} * exp[-0.5(z_{1}^2 - 2\rho_{x1x2}z_{1}z_{2} + z_{2}^2)/(1 - \rho_{x1x2}^2)].$$ Then solving the equation $$\Large \Phi[z(p_{1}), z(p_{2}), \rho_{x1x2}] = \rho_{y1y2}\sqrt{p_{1}(1 - p_{1})p_{2}(1 - p_{2})} + p_{1}p_{2}$$ for $\Large \rho_{x1x2}$ gives the intermediate correlation of the standard normal variables needed to generate binary variables with correlation $\Large \rho_{y1y2}$. Here $\Large z(p)$ indicates the $\Large p^{th}$ quantile of the standard normal distribution.
To generate the binary variables from the standard normal variables, set $\Large Y_{1} = 1$ if $\Large Z_{1} \le z(p_{1})$ and $\Large Y_{1} = 0$ otherwise. Similarly, set $\Large Y_{2} = 1$ if $\Large Z_{2} \le z(p_{2})$ and $\Large Y_{2} = 0$ otherwise.
This ensures: $$\Large E[Y_{1}] = Pr(Y_{1} = 1) = Pr(Z_{1} \le z(p_{1})) = p_{1},$$ $$\Large E[Y_{2}] = Pr(Y_{2} = 1) = Pr(Z_{2} \le z(p_{2})) = p_{2},$$ $$\Large Cov(Y_{1}, Y_{2}) = Pr(Y_{1} = 1, Y_{2} = 1) - p_{1}p_{2}$$ $$\Large = Pr(Z_{1} \le z(p_{1}), Z_{2} \le z(p_{2})) - p_{1}p_{2}$$ $$\Large = \Phi[z(p_{1}), z(p_{2}), \rho_{x1x2}] - p_{1}p_{2}$$ $$\Large = \rho_{y1y2}\sqrt{p_{1}(1 - p_{1})p_{2}(1 - p_{2})},$$ $$\Large Cor(Y_{1}, Y_{2}) = Cov(Y_{1}, Y_{2})/\sqrt{p_{1}(1 - p_{1})p_{2}(1 - p_{2})} = \rho_{y1y2}.$$
Otherwise, ordnorm
is called for each
pair. If the resulting intermediate matrix is not positive-definite,
this is corrected for later.
Correlations are computed pairwise. findintercorr_cont
is called for each pair.
findintercorr_cont_cat
is called to calculate the
intermediate MVN correlation for all Continuous and Ordinal
combinations.
Now the two methods will be contrasted.
The intermediate correlations used in correlation method 1 are more
simulation based than those in correlation method 2, which means that
accuracy increases with sample size and the number of repetitions (see
findintercorr
). Specifying the seed allows for
reproducibility. In addition, method 1 differs from method 2 in the
following ways:
The intermediate correlation for count variables is based on the method of Yahav & Shmueli (2012), which uses a simulation based, logarithmic transformation of the target correlation. This method becomes less accurate as the variable mean gets closer to zero.
findintercorr_pois
is
called to calculate the intermediate MVN correlation for all
variables.findintercorr_nb
is called to calculate the intermediate MVN correlation for all
variables.The ordinal - count variable correlations are based on an extension of the method of Amatya & Demirtas (2015), in which the correlation correction factor is the product of the upper Frechet-Hoeffding bound on the correlation between the count variable and the normal variable used to generate it and a simulated upper bound on the correlation between an ordinal variable and the normal variable used to generate it (see Demirtas and Hedeker (2011)).
findintercorr_cat_pois
is
called to calculate the intermediate MVN correlation for all
variables.findintercorr_cat_nb
is called to calculate the
intermediate MVN correlation for all variables.The continuous - count variable correlations are based on an extension of the methods of Amatya & Demirtas (2015) and Demirtas et al. (2012), in which the correlation correction factor is the product of the upper Frechet-Hoeffding bound on the correlation between the count variable and the normal variable used to generate it and the power method correlation between the continuous variable and the normal variable used to generate it (see Headrick and Kowalchuk (2007)). The intermediate correlations are the ratio of the target correlations to the correction factor.
findintercorr_cont_pois
is
called to calculate the intermediate MVN correlation for all
variables.findintercorr_cont_nb
is called to calculate the
intermediate MVN correlation for all variables.The algorithm used in the simulation function rcorrvar
that employs correlation method 1 is as follows:
Preliminary checks on the distribution parameters and target
correlation matrix rho
are performed to ensure they are of
the correct dimension, format, and/or sign. This function does NOT
verify the feasibility of rho
, given the distribution
parameters. That should be done first using valid_corr
,
which checks if rho
is within the feasible bounds and
returns the lower and upper correlation limits.
The constants are calculated for the continuous variables using
find_constants
. If no solutions are found that generate
valid power method pdfs, the function will return constants that produce
invalid pdfs (or a stop error if no solutions can be found). Errors
regarding constant calculation are the most probable cause of function
failure. Possible solutions include:
method
= “Polynomial”).The support is created for the ordinal variables (if no support provided).
The intermediate correlation matrix Sigma
is
calculated using findintercorr
. Note that this will return
a matrix that is not positive-definite. If so, there will be a message
that it may not be possible to produce variables with the desired
distributions. Also, the algorithm of Higham
(2002) is used (see Matrix::nearPD
) to produce the
nearest positive-definite matrix and a message is given.
k <- k_cat + k_cont + k_pois + k_nb
multivariate
normal variables ($\Large X_{nxk}$)
with correlation matrix Sigma
are generated using
eigen-value and spectral value decompositions on a $\Large MVN_{nxk}(0,1)$ matrix.
The variables are generated from $\Large X_{nxk}$ using the appropriate transformations (see Variable Types vignette)
The final correlation matrix is calculated, and the maximum error
(maxerr
) from the target correlation matrix is
found.
If the error loop is specified (error_loop
= TRUE),
it is used on each variable pair to correct the final correlation until
it is within epsilon of the target correlation or the maximum number of
iterations has been reached. Additionally, if the extra correction is
specified(extra_correct
= TRUE), if the maximum error
within each variable pair is still greater than 0.1, the intermediate
correlation is set equal to the target correlation (with the assumption
that the calculated final correlation will be less than 0.1 away from
the target).
Summary statistics are calculated by variable type.
The intermediate correlations used in correlation method 2 are less
simulation based than those in correlation method 1 (see
findintercorr2
). Their calculations involve greater
utilization of correction loops which make iterative adjustments until a
maximum error has been reached (if possible). In addition, method 2
differs from method 1 in the following ways:
The intermediate correlations involving count
variables are based on the methods of Barbiero &
Ferrari (2012; 2015). The Poisson or
Negative Binomial support is made finite by removing a small
user-specified value (i.e. 1e-06) from the total cumulative probability.
This truncation factor may differ for each count variable (see
max_count_support
). The count variables are subsequently
treated as ordinal and intermediate correlations are calculated using
the correction loop of ordnorm
.
The continuous - count variable correlations are based on an extension of the method of Demirtas et al. (2012), and the count variables are treated as ordinal. The correction factor is the product of the power method correlation between the continuous variable and the normal variable used to generate it (see Headrick and Kowalchuk (2007)) and the point-polyserial correlation between the ordinalized count variable and the normal variable used to generate it (see Olsson, Drasgow, and Dorans (1982)). The intermediate correlations are the ratio of the target correlations to the correction factor.
findintercorr_cont_pois2
is
called to calculate the intermediate MVN correlation for all
variables.findintercorr_cont_nb2
is called to calculate the
intermediate MVN correlation for all variables.The algorithm used in the simulation function rcorrvar2
that employs correlation method 2 is similar to that described for
rcorrvar
, with a few modifications:
The feasibility of rho
, given the distribution
parameters, should be checked first using the function
valid_corr2
, which checks if rho
is within the
feasible bounds and returns the lower and upper correlation
limits.
After the support is created for the ordinal variables (if no
support is provided), the maximum support for the count variables is
determined using max_count_support
, given truncation value
vector pois_eps
for Poisson variables and/or
nb_eps
for Negative Binomial variables. The cumulative
probability truncation value may differ by variable, but a good value is
0.0001. The resulting supports and
distribution parameters are used to create marginal lists, consisting of
the cumulative probabilities for each count variable.
The intermediate correlation matrix Sigma
is
calculated using findintercorr2
.