Comparison of Correlation Methods 1 and 2

There are two simulation pathways which differ primarily according to the calculation of the intermediate correlations involving count variables. The simulation function corrvar for correlation method 1 calls the function intercorr. The simulation function corrvar2 for correlation method 2 calls the function intercorr2. Both of these call additional auxiliary functions as needed. The ordering of the variables in the target correlation matrix rho must be 1st ordinal, 2nd continuous non-mixture, 3rd components of continuous mixture variables, 4th regular Poisson, 5th zero-inflated Poisson, 6th regular NB, and 7th zero-inflated NB. Note that the target correlations are specified in terms of the correlations with components of continuous mixture variables. This allows the user to set the correlation between components of the same mixture variable to any desired value. If this correlation is set to 0, the intermediate correlation matrix Sigma may need to be converted to the nearest positive-definite matrix. This is done within the simulation functions by specifying use.nearPD = TRUE. Higham (2002)’s algorithm is executed with the Matrix::nearPD function (Bates and Maechler 2018). Otherwise, negative eigenvalues are replaced with 0. Some code has been modified from the SimMultiCorrData package (Fialkowski 2018).

Methods Used in Both Pathways:

First, the intermediate correlation calculations which are equivalent in the two pathways will be discussed by variable type.

Ordinal Variables:

If both variables are binary, the method of Demirtas, Hedeker, and Mermelstein (2012) is used to find the tetrachoric correlation (code adapted from Inan and Demirtas (2016)’s BinNonNor::Tetra.Corr.BB). The tetrachoric correlation is an estimate of the binary correlation measured on a continuous scale. The assumptions are that the binary variables arise from latent normal variables, and the actual trait is continuous and not discrete. This method is based on Emrich and Piedmonte (1991)’s work, in which the joint binary distribution is determined from the third and higher moments of a multivariate normal distribution:

Let Y1 and Y2 be binary variables with E[Y1] = Pr(Y1 = 1) = p1, E[Y2] = Pr(Y2 = 1) = p2, and correlation ρy1y2. Note here that p1 = 1− marginal[[1]][1] and p2 = 1− marginal[[2]][1] so that the user-supplied probabilities are associated with the lower support value. Then solving the equation

for ρx1x2 gives the intermediate correlation of the standard normal variables needed to generate binary variables with correlation ρy1y2. Here, Φ is the standard bivariate normal CDF and z(p) indicates the pth quantile of the standard normal distribution. To generate the binary variables from the standard normal variables, set Y1 = 1 if Z1 ≤ z(p1) and Y1 = 0 otherwise. Similarly, set Y2 = 1 if Z2 ≤ z(p2) and Y2 = 0 otherwise.

For binary-ordinal or ordinal pairs, ord_norm is called. The algorithm to simulate k_cat ordinal random variables is as follows:

  1. If a support is not provided, create the default of 1, ..., r for an ordinal variable with r categories.

  2. Use the norm_ord function to calculate the initial correlation of the ordinal variables (rhoordold) formed by discretizing k_cat random normal variates with correlation matrix set equal to rho0, using the cumulative probabilities supplied in marginal and the corresponding normal quantiles.

  3. Let rho be the intermediate normal correlation updated in each iteration, rhoord be the ordinal correlation calculated in each iteration (initialized at rhoordold), rhoold be the intermediate correlation from the previous iteration, it be the iteration number, maxit be the user-specified maximum iteration number, and epsilon be the user-specified maximum pairwise correlation error. For each variable pair, execute the following:

  1. If rho0  = 0, set rho  = 0.
  2. While the absolute error between rhoord and rho0 is greater than epsilon and it is less than maxit:
    i) If rho0 * (rho0/rhoord)  ≤ −1, then: rho = rhoold * (1 + 0.1 * (1 - rhoold) * -sign(rho0 - rhoord)).
    ii) If rho0 * (rho0/rhoord)  ≥ 1, then: rho = rhoold * (1 + 0.1 * (1 - rhoold) * sign(rho0 - rhoord)).
    iii) Otherwise, rho = rhoold * (rho0/rhoord).
    iv) If rho  > 1, set rho  = 1. If rho  < −1, set rho  = −1. v) Calculate rhoord using norm_ord and the 2$$2 correlation matrix formed by rho.
    vi) Set rhoold = rho and increase it by 1.
  3. Store the number of iterations in the matrix niter.
  1. Return the final intermediate correlation matrix SigmaC = rho for the random normal variables. Discretize these to produce ordinal variables with the desired correlation matrix.

Continuous Variables:

Correlations are computed pairwise. The function intercorr_cont uses the equations derived by Headrick and Sawilowsky (1999) for the third-order and Headrick (2002) for the fifth-order power method transformation (PMT).

For two continuous variables Yi and Yj generated using Headrick (2002)’s fifth-order PMT, the intermediate normal correlation ρZiZj required to obtain the target correlation ρYiYj is the solution to the following:

For two continuous variables Yi and Yj generated using Fleishman (1978)’s third-order PMT, the intermediate normal correlation ρZiZj required to obtain the target correlation ρYiYj is the solution to the following:

Continuous-Ordinal Pairs:

The function SimMultiCorrData::findintercorr_cont_cat is called to calculate the intermediate normal correlations. The intermediate correlation between Z1 and Z2 (where Z1 is the standard normal variable transformed using Headrick (2002)’s fifth-order or Fleishman (1978)’s third-order PMT to produce a continuous variable Y1, and Z2 is the standard normal variable discretized to produce an ordinal variable Y2) is calculated by dividing the target correlation by a correction factor. The correction factor is the product of the point-polyserial correlation between Y2 and Z2 (Olsson, Drasgow, and Dorans 1982):

and the power method correlation between Y1 and Z1 (Headrick and Kowalchuk 2007):

The constant c5 = 0 for the third-order PMT. Then the intermediate normal correlation ρZ1Z2 required to obtain the target correlation ρY1Y2 is given by:

Here, ϕ is the standard normal PDF, σY2 is the standard deviation of the ordinal variable, and μY2 is its expected value:

Now the two methods will be contrasted.

Overview of Correlation Method 1:

The intermediate correlations used in correlation method 1 are more simulation based than those in correlation method 2, which means that accuracy increases with sample size and the number of repetitions. Specifying the seed allows for reproducibility. In addition, method 1 differs from method 2 in the following ways:

  1. The count variable correlations extend the method of Yahav and Shmueli (2012). The intermediate correlation between Z1 and Z2 (the standard normal variables used to generate the count variables Y1 and Y2 via the inverse CDF method) is calculated using a logarithmic transformation of the target correlation. First, the upper and lower Frechet-Hoeffding bounds (mincor, maxcor) on ρY1Y2 are simulated (Fréchet 1951; Hoeffding 1994). See the Calculation of Correlation Boundaries vignette for details on the Frechet-Hoeffding correlation boundaries. The intermediate correlation ρZ1Z2 is found as follows:

    where $$a = -\frac{maxcor * mincor}{maxcor + mincor},\ \ \ b = log \Bigg(\frac{maxcor + a}{a} \Bigg),\ \ \ c = -a.$$

    1. Poisson variables: intercorr_pois is called to calculate the intermediate correlations for all variables.
    2. Negative Binomial variables: intercorr_nb is called to calculate the intermediate correlations for all variables.
    3. Poisson-Negative Binomial variable pairs: intercorr_pois_nb is called to calculate the intermediate correlations for all variables.

This method becomes less accurate as the variable mean gets closer to zero. The distribution functions are taken from the VGAM package (Yee 2018).

  1. The ordinal - count variable correlations are based on an extension of the method of Amatya and Demirtas (2015), in which the correlation correction factor is the product of the upper Frechet-Hoeffding bound on the correlation between the count variable and the normal variable used to generate it and a simulated upper bound on the correlation between an ordinal variable and the normal variable used to generate it (Demirtas and Hedeker 2011). The intermediate correlations are the ratio of the target correlations to the correction factors.

    1. Poisson variables: intercorr_cat_pois is called to calculate the intermediate correlations for all variables.
    2. Negative Binomial variables: intercorr_cat_nb is called to calculate the intermediate correlations for all variables.
  2. The continuous - count variable correlations are based on an extension of the methods of Amatya and Demirtas (2015) and Demirtas, Hedeker, and Mermelstein (2012), in which the correlation correction factor is the product of the upper Frechet-Hoeffding bound on the correlation between the count variable and the normal variable used to generate it and the power method correlation between the continuous variable and the normal variable used to generate it (Headrick and Kowalchuk 2007). The intermediate correlations are the ratio of the target correlations to the correction factors.

    1. Poisson variables: intercorr_cont_pois is called to calculate the intermediate correlations for all variables.
    2. Negative Binomial variables: intercorr_cont_nb is called to calculate the intermediate correlations for all variables.

Simulation Process:

The simulation functions do not perform checks on the distribution parameters or target correlation matrix rho. This should be done first using validpar to ensure they are of the correct dimension, format, and/or sign. The function validcorr should also be used to check if rho is within the feasible bounds and determine the lower and upper correlation limits. Summaries of simulated variables can be obtained using summary_var.

The algorithm used in the simulation function corrvar that employs correlation method 1 is as follows:

  1. If continuous variables are desired, the standardized cumulants are checked to see if there are any repeated distributions (i.e., if the user wishes to simulate two variables). These are noted so that the constants are calculated only once.

  2. The constants are calculated for the continuous non-mixture variables and components of continuous mixture variables using SimMultiCorrData::find_constants. If no solutions are found that generate valid power method PDF’s, the function will return constants that produce invalid PDF’s (or a stop error if no solutions can be found). Errors regarding constant calculation are the most probable cause of function failure. Possible solutions include changing the seed or using a list of sixth cumulant correction values (if method = “Polynomial”).

  3. The default support is created for the ordinal variables (if no support is provided).

  4. The intermediate correlation matrix Sigma is calculated using intercorr. Note that this will return a matrix that is not positive-definite. If so and use.nearPD = TRUE, the algorithm of Higham (2002) is used (see Matrix::nearPD) to produce the nearest positive-definite matrix and a message is given. Otherwise, negative eigenvalues are replaced with 0.

  5. k <- k_cat + k_cont + k_mix + k_pois + k_nb multivariate normal variables (Xnxk) with correlation matrix Sigma are generated using singular value decomposition on a MVNnxk(0, 1) matrix and eigenvalue decomposition on Sigma.

  6. The variables are generated from Xnxk using the appropriate transformations (see Variable Types vignette).

  7. The final correlation matrix is calculated, and the maximum error (maxerr) from the target correlation matrix is found.

  8. If the error loop is specified (error_loop = TRUE), it is used on each variable pair to correct the final correlation until it is within epsilon of the target correlation or the maximum number of iterations maxit has been reached.

  9. If continuous mixture variables are desired, these are created from the component variables.

Overview of Correlation Method 2:

The intermediate correlations used in correlation method 2 are less simulation based than those in correlation method 1. Their calculations involve greater utilization of correction loops which make iterative adjustments until a maximum error has been reached (if possible). In addition, method 2 differs from method 1 in the following ways:

  1. The intermediate correlations involving count variables are based on the methods of Barbiero & Ferrari (2012; 2015). The Poisson or Negative Binomial support is made finite by removing a small user-specified value (i.e. 0.0001) from the total cumulative probability. This truncation factor may differ for each count variable (see maxcount_support). The count variables are subsequently treated as ordinal and intermediate correlations are calculated using the correction loop of ord_norm.

  2. The continuous - count variable correlations are based on an extension of the method of Demirtas, Hedeker, and Mermelstein (2012), and the count variables are treated as ordinal. The correction factor is the product of the power method correlation between the continuous variable and the normal variable used to generate it (Headrick and Kowalchuk 2007) and the point-polyserial correlation between the ordinalized count variable and the normal variable used to generate it (Olsson, Drasgow, and Dorans 1982). The intermediate correlations are the ratio of the target correlations to the correction factor.

    1. Poisson variables: intercorr_cont_pois2 is called to calculate the intermediate correlations for all variables.
    2. Negative Binomial variables: intercorr_cont_nb2 is called to calculate the intermediate correlations for all variables.

Simulation Process:

The algorithm used in the simulation function corrvar2 that employs correlation method 2 is similar to that described for corrvar, with a few modifications:

  1. The feasibility of rho, given the distribution parameters, should be checked first using the function validcorr2, which checks if rho is within the feasible bounds and returns the lower and upper correlation limits.

  2. After the support is created for the ordinal variables (if no support is provided), the maximum support for the count variables is determined using maxcount_support, given truncation value vector pois_eps for Poisson variables and/or nb_eps for Negative Binomial variables. The cumulative probability truncation value may differ by variable, but a good value is 0.0001. The resulting supports and distribution parameters are used to create marginal lists, consisting of the cumulative probabilities for each count variable.

  3. The intermediate correlation matrix Sigma is calculated using intercorr2.

References

Amatya, A, and H Demirtas. 2015. “Simultaneous Generation of Multivariate Mixed Data with Poisson and Normal Marginals.” Journal of Statistical Computation and Simulation 85 (15): 3129–39. https://doi.org/10.1080/00949655.2014.953534.
Barbiero, A, and P A Ferrari. 2015. “Simulation of Correlated Poisson Variables.” Applied Stochastic Models in Business and Industry 31: 669–80. https://doi.org/10.1002/asmb.2072.
Bates, Douglas, and Martin Maechler. 2018. Matrix: Sparse and Dense Matrix Classes and Methods. https://CRAN.R-project.org/package=Matrix.
Demirtas, H, and D Hedeker. 2011. “A Practical Way for Computing Approximate Lower and Upper Correlation Bounds.” The American Statistician 65 (2): 104–9. https://doi.org/10.1198/tast.2011.10090.
Demirtas, H, D Hedeker, and R J Mermelstein. 2012. “Simulation of Massive Public Health Data by Power Polynomials.” Statistics in Medicine 31 (27): 3337–46. https://doi.org/10.1002/sim.5362.
Emrich, L J, and M R Piedmonte. 1991. “A Method for Generating High-Dimensional Multivariate Binary Variates.” The American Statistician 45: 302–4. https://doi.org/10.1080/00031305.1991.10475828.
Ferrari, P A, and A Barbiero. 2012. “Simulating Ordinal Data.” Multivariate Behavioral Research 47 (4): 566–89. https://doi.org/10.1080/00273171.2012.692630.
Fialkowski, A C. 2018. SimMultiCorrData: Simulation of Correlated Data with Multiple Variable Types. https://CRAN.R-project.org/package=SimMultiCorrData.
Fleishman, A I. 1978. “A Method for Simulating Non-Normal Distributions.” Psychometrika 43: 521–32. https://doi.org/10.1007/BF02293811.
Fréchet, M. 1951. “Sur Les Tableaux de Corrélation Dont Les Marges Sont Données.” Annales de l’Université de Lyon. Section A: Sciences Mathématiques Et Astronomie 9: 53–77.
Headrick, T C. 2002. “Fast Fifth-Order Polynomial Transforms for Generating Univariate and Multivariate Non-Normal Distributions.” Computational Statistics and Data Analysis 40 (4): 685–711. https://doi.org/10.1016/S0167-9473(02)00072-5.
Headrick, T C, and R K Kowalchuk. 2007. “The Power Method Transformation: Its Probability Density Function, Distribution Function, and Its Further Use for Fitting Data.” Journal of Statistical Computation and Simulation 77: 229–49. https://doi.org/10.1080/10629360600605065.
Headrick, T C, and S S Sawilowsky. 1999. “Simulating Correlated Non-Normal Distributions: Extending the Fleishman Power Method.” Psychometrika 64: 25–35. https://doi.org/10.1007/BF02294317.
Higham, N. 2002. “Computing the Nearest Correlation Matrix - a Problem from Finance.” IMA Journal of Numerical Analysis 22 (3): 329–43. https://doi.org/10.1093/imanum/22.3.329.
Hoeffding, W. 1994. “Scale-Invariant Correlation Theory.” In The Collected Works of Wassily Hoeffding, edited by N I Fisher and P K Sen, 57–107. Springer Series in Statistics (Perspectives in Statistics). New York: Springer-Verlag. https://doi.org/10.1007/978-1-4612-0865-5_4.
Inan, Gul, and Hakan Demirtas. 2016. BinNonNor: Data Generation with Binary and Continuous Non-Normal Components. https://CRAN.R-project.org/package=BinNonNor.
Olsson, U, F Drasgow, and N J Dorans. 1982. “The Polyserial Correlation Coefficient.” Psychometrika 47 (3): 337–47. https://doi.org/10.1007/BF02294164.
Yahav, I, and G Shmueli. 2012. “On Generating Multivariate Poisson Data in Management Science Applications.” Applied Stochastic Models in Business and Industry 28 (1): 91–102. https://doi.org/10.1002/asmb.901.
Yee, Thomas W. 2018. VGAM: Vector Generalized Linear and Additive Models. https://CRAN.R-project.org/package=VGAM.