--- title: "Comparison of Correlation Methods 1 and 2" author: "Allison C Fialkowski" date: "`r Sys.Date()`" output: bookdown::html_document2 bibliography: Bibliography.bib vignette: > %\VignetteIndexEntry{Comparison of Correlation Methods 1 and 2} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include=FALSE} library("bookdown") ``` There are two simulation pathways which differ primarily according to the calculation of the intermediate correlations involving count variables. The simulation function `corrvar` for correlation method 1 calls the function `intercorr`. The simulation function `corrvar2` for correlation method 2 calls the function `intercorr2`. Both of these call additional auxiliary functions as needed. The *ordering of the variables* in the target correlation matrix `rho` must be 1st ordinal, 2nd continuous non-mixture, 3rd components of continuous mixture variables, 4th regular Poisson, 5th zero-inflated Poisson, 6th regular NB, and 7th zero-inflated NB. Note that the target correlations are specified in terms of the correlations with components of continuous mixture variables. This allows the user to set the correlation between components of the same mixture variable to any desired value. If this correlation is set to $0$, the intermediate correlation matrix `Sigma` may need to be converted to the nearest positive-definite matrix. This is done within the simulation functions by specifying `use.nearPD = TRUE`. @Higham's algorithm is executed with the `Matrix::nearPD` function [@Matrix]. Otherwise, negative eigenvalues are replaced with $0$. Some code has been modified from the **SimMultiCorrData** package [@SMCD]. # Methods Used in Both Pathways: {-} First, the intermediate correlation calculations which are equivalent in the two pathways will be discussed by variable type. ## Ordinal Variables: {-} If both variables are **binary**, the method of @Dem_Power is used to find the *tetrachoric correlation* (code adapted from @BinNonNor's `BinNonNor::Tetra.Corr.BB`). The tetrachoric correlation is an estimate of the binary correlation measured on a continuous scale. The assumptions are that the binary variables arise from latent normal variables, and the actual trait is continuous and not discrete. This method is based on @EmPied's work, in which the joint binary distribution is determined from the third and higher moments of a multivariate normal distribution: Let $Y_{1}$ and $Y_{2}$ be binary variables with $E[Y_{1}] = Pr(Y_{1} = 1) = p_{1}$, $E[Y_{2}] = Pr(Y_{2} = 1) = p_{2}$, and correlation $\rho_{y1y2}$. Note here that $p_1 = 1 -$ `marginal[[1]][1]` and $p_2 = 1 -$ `marginal[[2]][1]` so that the user-supplied probabilities are associated with the lower support value. Then solving the equation \begin{equation} \Phi[z(p_{1}), z(p_{2}), \rho_{x1x2}] = \rho_{y1y2}\sqrt{p_{1}(1 - p_{1})p_{2}(1 - p_{2})} + p_{1}p_{2} (\#eq:System2) \end{equation} for $\rho_{x1x2}$ gives the intermediate correlation of the standard normal variables needed to generate binary variables with correlation $\rho_{y1y2}$. Here, $\Phi$ is the standard bivariate normal CDF and $z(p)$ indicates the $p^{th}$ quantile of the standard normal distribution. To generate the binary variables from the standard normal variables, set $Y_{1} = 1$ if $Z_{1} \le z(p_{1})$ and $Y_{1} = 0$ otherwise. Similarly, set $Y_{2} = 1$ if $Z_{2} \le z(p_{2})$ and $Y_{2} = 0$ otherwise. For **binary-ordinal** or **ordinal** pairs, `ord_norm` is called. The algorithm to simulate `k_cat` ordinal random variables is as follows: 1) If a `support` is not provided, create the default of $1, ..., r$ for an ordinal variable with $r$ categories. 2) Use the `norm_ord` function to calculate the initial correlation of the ordinal variables (`rhoordold`) formed by discretizing `k_cat` random normal variates with correlation matrix set equal to `rho0`, using the cumulative probabilities supplied in `marginal` and the corresponding normal quantiles. 3) Let `rho` be the intermediate normal correlation updated in each iteration, `rhoord` be the ordinal correlation calculated in each iteration (initialized at `rhoordold`), `rhoold` be the intermediate correlation from the previous iteration, `it` be the iteration number, `maxit` be the user-specified maximum iteration number, and `epsilon` be the user-specified maximum pairwise correlation error. For each variable pair, execute the following: a. If `rho0` $= 0$, set `rho` $= 0$. b. While the absolute error between `rhoord` and `rho0` is greater than `epsilon` and `it` is less than `maxit`: i) If `rho0 * (rho0/rhoord)` $\leq -1$, then: `rho = rhoold * (1 + 0.1 * (1 - rhoold) * -sign(rho0 - rhoord))`. ii) If `rho0 * (rho0/rhoord)` $\geq 1$, then: `rho = rhoold * (1 + 0.1 * (1 - rhoold) * sign(rho0 - rhoord))`. iii) Otherwise, `rho = rhoold * (rho0/rhoord)`. iv) If `rho` $> 1$, set `rho` $= 1$. If `rho` $< -1$, set `rho` $= -1$. v) Calculate `rhoord` using `norm_ord` and the 2$\times$2 correlation matrix formed by `rho`. vi) Set `rhoold = rho` and increase `it` by 1. c. Store the number of iterations in the matrix `niter`. 4) Return the final intermediate correlation matrix `SigmaC = rho` for the random normal variables. Discretize these to produce ordinal variables with the desired correlation matrix. ## Continuous Variables: {-} Correlations are computed pairwise. The function `intercorr_cont` uses the equations derived by @HeadSaw1 for the third-order and @Head2002 for the fifth-order power method transformation (PMT). For two continuous variables $Y_i$ and $Y_j$ generated using @Head2002's fifth-order PMT, the intermediate normal correlation $\rho_{{Z}_{i}{Z}_{j}}$ required to obtain the target correlation $\rho_{{Y}_{i}{Y}_{j}}$ is the solution to the following: \begin{equation} \begin{split} {\rho}_{{Y}_{i}{Y}_{j}} &= 3{c}_{{4}_{i}}{c}_{{0}_{j}}+3{c}_{{4}_{i}}{c}_{{2}_{j}}+9{c}_{{4}_{i}}{c}_{{4}_{j}}+{c}_{{0}_{i}}({c}_{{0}_{j}}+{c}_{{2}_{j}}+3{c}_{{4}_{j}})+{c}_{{1}_{i}}{c}_{{1}_{j}}{\rho}_{{Z}_{i}{Z}_{j}}+3{c}_{{3}_{i}}{c}_{{1}_{j}}{\rho}_{{Z}_{i}{Z}_{j}} \\ & +15{c}_{{5}_{i}}{c}_{{1}_{j}}{\rho}_{{Z}_{i}{Z}_{j}}+3{c}_{{1}_{i}}{c}_{{1}_{j}}{\rho}_{{Z}_{i}{Z}_{j}}+9{c}_{{3}_{i}}{c}_{{3}_{j}}{\rho}_{{Z}_{i}{Z}_{j}}+45{c}_{{5}_{i}}{c}_{{3}_{j}}{\rho}_{{Z}_{i}{Z}_{j}}+15{c}_{{1}_{i}}{c}_{{5}_{j}}{\rho}_{{Z}_{i}{Z}_{j}} \\ & +45{c}_{{3}_{i}}{c}_{{5}_{j}}{\rho}_{{Z}_{i}{Z}_{j}}+225{c}_{{5}_{i}}{c}_{{5}_{j}}{\rho}_{{Z}_{i}{Z}_{j}}+12{c}_{{4}_{i}}{c}_{{2}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{2}+72{c}_{{4}_{i}}{c}_{{4}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{2}+6{c}_{{3}_{i}}{c}_{{3}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{3} \\ & +60{c}_{{5}_{i}}{c}_{{3}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{3}+60{c}_{{3}_{i}}{c}_{{5}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{3}+600{c}_{{5}_{i}}{c}_{{5}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{3}+24{c}_{{4}_{i}}{c}_{{4}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{4}+120{c}_{{5}_{i}}{c}_{{5}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{5} \\ & +{c}_{{2}_{i}}({c}_{{0}_{j}}+{c}_{{2}_{j}}+3{c}_{{4}_{j}}+2{c}_{{2}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{2}+12{c}_{{4}_{j}}{{\rho}_{{Z}_{i}{Z}_{j}}}^{2}). \end{split} (\#eq:System4) \end{equation} For two continuous variables $Y_i$ and $Y_j$ generated using @Fleish's third-order PMT, the intermediate normal correlation $\rho_{{Z}_{i}{Z}_{j}}$ required to obtain the target correlation $\rho_{{Y}_{i}{Y}_{j}}$ is the solution to the following: \begin{equation} {\rho}_{{Y}_{i}{Y}_{j}} = \rho_{{Z}_{i}{Z}_{j}}(c_{1_{i}}c_{1_{j}}+3c_{1_{j}}c_{3_{i}}+3c_{1_{i}}c_{3_{j}}+9c_{3_{i}}c_{3_{j}}+2c_{0_{i}}c_{0_{j}}\rho_{{Z}_{i}{Z}_{j}}+6c_{3_{i}}c_{3_{j}}\rho_{{Z}_{i}{Z}_{j}}^2). (\#eq:System5) \end{equation} ## Continuous-Ordinal Pairs: {-} The function `SimMultiCorrData::findintercorr_cont_cat` is called to calculate the intermediate normal correlations. The intermediate correlation between $Z_1$ and $Z_2$ (where $Z_1$ is the standard normal variable transformed using @Head2002's fifth-order or @Fleish's third-order PMT to produce a continuous variable $Y_1$, and $Z_2$ is the standard normal variable discretized to produce an ordinal variable $Y_2$) is calculated by dividing the target correlation by a correction factor. The correction factor is the product of the point-polyserial correlation between $Y_2$ and $Z_2$ [@PolyCorr]: \begin{equation} {{\rho}}_{{Y}_{2}{Z}_{2}} = \frac{{\rho}_{{Z}_{2}{Z}_{2}}}{{\sigma}_{{Y}_{2}}} \sum_{j=1}^{r-1}\phi({\tau}_{j})({y}_{{2}_{j+1}}-{y}_{{2}_{j}}) = \frac{1}{{\sigma}_{{Y}_{2}}} \sum_{j=1}^{r-1}\phi({\tau}_{j})({y}_{{2}_{j+1}}-{y}_{{2}_{j}}), (\#eq:System6) \end{equation} and the power method correlation between $Y_1$ and $Z_1$ [@HeadKow]: \begin{equation} {c}_{1}+3{c}_{3}+15{c}_{5}. (\#eq:System7) \end{equation} The constant $c_5 = 0$ for the third-order PMT. Then the intermediate normal correlation $\rho_{{Z}_{1}{Z}_{2}}$ required to obtain the target correlation $\rho_{{Y}_{1}{Y}_{2}}$ is given by: \begin{equation} \begin{split} {\rho}_{{Z}_{1}{Z}_{2}} &= \frac{{\rho}_{{Y}_{1}{Y}_{2}}{\sigma}_{{Y}_{2}}}{({c}_{1}+3{c}_{3}+15{c}_{5})\ \sum_{j=1}^{r-1}\phi({\tau}_{j})({y}_{{2}_{j+1}}-{y}_{{2}_{j}})} \\ &= \frac{{\rho}_{{Y}_{1}{Y}_{2}}{\sigma}_{{Y}_{2}}}{({c}_{1}+3{c}_{3}+15{c}_{5})\ \sum_{j=1}^{r-1}\phi(\Phi^{-1}(\sum_{i=1}^{j}{p}_{j}))({y}_{{2}_{j+1}}-{y}_{{2}_{j}})}. \end{split} (\#eq:System8) \end{equation} Here, $\phi$ is the standard normal PDF, ${\sigma}_{{Y}_{2}}$ is the standard deviation of the ordinal variable, and ${\mu}_{{Y}_{2}}$ is its expected value: \begin{equation} {\sigma}_{{Y}_{2}} = \sqrt{\sum_{j=1}^{r}{y}_{{2}_{j}}^2{p}_{j} - {\mu}_{{Y}_{2}}^2},\ \ \ {\mu}_{{Y}_{2}} = \sum_{j=1}^{r}{y}_{{2}_{j}}{p}_{j}. (\#eq:System9) \end{equation} Now the two methods will be contrasted. # Overview of Correlation Method 1: {-} The intermediate correlations used in correlation method 1 are more simulation based than those in correlation method 2, which means that accuracy increases with sample size and the number of repetitions. Specifying the seed allows for reproducibility. In addition, method 1 differs from method 2 in the following ways: 1. The **count variable** correlations extend the method of @YahShm. The intermediate correlation between $Z_1$ and $Z_2$ (the standard normal variables used to generate the count variables $Y_1$ and $Y_2$ via the inverse CDF method) is calculated using a logarithmic transformation of the target correlation. First, the upper and lower Frechet-Hoeffding bounds (mincor, maxcor) on ${\rho}_{{Y}_{1}{Y}_{2}}$ are simulated [@Frech;@Hoeff]. See the [Calculation of Correlation Boundaries](corr_bounds.html) vignette for details on the Frechet-Hoeffding correlation boundaries. The intermediate correlation ${\rho}_{{Z}_{1}{Z}_{2}}$ is found as follows: \begin{equation} {\rho}_{{Z}_{1}{Z}_{2}} = \frac{1}{b} * log \Bigg(\frac{{\rho}_{{Y}_{1}{Y}_{2}} - c}{a} \Bigg), (\#eq:System10) \end{equation} where \[a = -\frac{maxcor * mincor}{maxcor + mincor},\ \ \ b = log \Bigg(\frac{maxcor + a}{a} \Bigg),\ \ \ c = -a.\] a) *Poisson variables:* `intercorr_pois` is called to calculate the intermediate correlations for all variables. b) *Negative Binomial variables:* `intercorr_nb` is called to calculate the intermediate correlations for all variables. c) *Poisson-Negative Binomial variable pairs:* `intercorr_pois_nb` is called to calculate the intermediate correlations for all variables. This method becomes less accurate as the variable mean gets closer to zero. The distribution functions are taken from the **VGAM** package [@VGAM]. 1. The **ordinal - count variable** correlations are based on an extension of the method of @AmaDem, in which the correlation correction factor is the product of the upper Frechet-Hoeffding bound on the correlation between the count variable and the normal variable used to generate it and a simulated upper bound on the correlation between an ordinal variable and the normal variable used to generate it [@DemHed]. The intermediate correlations are the ratio of the target correlations to the correction factors. a) *Poisson variables:* `intercorr_cat_pois` is called to calculate the intermediate correlations for all variables. b) *Negative Binomial variables:* `intercorr_cat_nb` is called to calculate the intermediate correlations for all variables. 1. The **continuous - count variable** correlations are based on an extension of the methods of @AmaDem and @Dem_Power, in which the correlation correction factor is the product of the upper Frechet-Hoeffding bound on the correlation between the count variable and the normal variable used to generate it and the power method correlation between the continuous variable and the normal variable used to generate it [@HeadKow]. The intermediate correlations are the ratio of the target correlations to the correction factors. a) *Poisson variables:* `intercorr_cont_pois` is called to calculate the intermediate correlations for all variables. b) *Negative Binomial variables:* `intercorr_cont_nb` is called to calculate the intermediate correlations for all variables. ## Simulation Process: {-} The simulation functions do not perform checks on the distribution parameters or target correlation matrix `rho`. This should be done first using `validpar` to ensure they are of the correct dimension, format, and/or sign. The function `validcorr` should also be used to check if `rho` is within the feasible bounds and determine the lower and upper correlation limits. Summaries of simulated variables can be obtained using `summary_var`. The algorithm used in the simulation function `corrvar` that employs correlation method 1 is as follows: 1. If continuous variables are desired, the standardized cumulants are checked to see if there are any repeated distributions (i.e., if the user wishes to simulate two \raisebox{2pt}{${\chi_4^{2}}$} variables). These are noted so that the constants are calculated only once. 1. The constants are calculated for the continuous non-mixture variables and components of continuous mixture variables using `SimMultiCorrData::find_constants`. If no solutions are found that generate valid power method PDF's, the function will return constants that produce invalid PDF's (or a stop error if no solutions can be found). Errors regarding constant calculation are the most probable cause of function failure. Possible solutions include changing the seed or using a list of sixth cumulant correction values (if `method` = "Polynomial"). 1. The default support is created for the ordinal variables (if no support is provided). 1. The intermediate correlation matrix `Sigma` is calculated using `intercorr`. Note that this will return a matrix that is not positive-definite. If so and `use.nearPD = TRUE`, the algorithm of @Higham is used (see `Matrix::nearPD`) to produce the nearest positive-definite matrix and a message is given. Otherwise, negative eigenvalues are replaced with $0$. 1. `k <- k_cat + k_cont + k_mix + k_pois + k_nb` multivariate normal variables ($X_{nxk}$) with correlation matrix `Sigma` are generated using singular value decomposition on a $MVN_{nxk}(0,\ 1)$ matrix and eigenvalue decomposition on `Sigma`. 1. The variables are generated from $X_{nxk}$ using the appropriate transformations (see [Variable Types](variable_types.html) vignette). 1. The final correlation matrix is calculated, and the maximum error (`maxerr`) from the target correlation matrix is found. 1. If the error loop is specified (`error_loop` = TRUE), it is used on each variable pair to correct the final correlation until it is within `epsilon` of the target correlation or the maximum number of iterations `maxit` has been reached. 1. If continuous mixture variables are desired, these are created from the component variables. # Overview of Correlation Method 2: {-} The intermediate correlations used in correlation method 2 are less simulation based than those in correlation method 1. Their calculations involve greater utilization of correction loops which make iterative adjustments until a maximum error has been reached (if possible). In addition, method 2 differs from method 1 in the following ways: 1. The intermediate correlations involving **count variables** are based on the methods of Barbiero & Ferrari [-@FerrBarb_Ord; -@FerrBarb_Pois]. The Poisson or Negative Binomial support is made finite by removing a small user-specified value (i.e. $0.0001$) from the total cumulative probability. This truncation factor may differ for each count variable (see `maxcount_support`). The count variables are subsequently treated as ordinal and intermediate correlations are calculated using the correction loop of `ord_norm`. 1. The **continuous - count variable** correlations are based on an extension of the method of @Dem_Power, and the count variables are treated as ordinal. The correction factor is the product of the power method correlation between the continuous variable and the normal variable used to generate it [@HeadKow] and the point-polyserial correlation between the ordinalized count variable and the normal variable used to generate it [@PolyCorr]. The intermediate correlations are the ratio of the target correlations to the correction factor. a) *Poisson variables:* `intercorr_cont_pois2` is called to calculate the intermediate correlations for all variables. b) *Negative Binomial variables:* `intercorr_cont_nb2` is called to calculate the intermediate correlations for all variables. ## Simulation Process: {-} The algorithm used in the simulation function `corrvar2` that employs correlation method 2 is similar to that described for `corrvar`, with a few modifications: 1. The feasibility of `rho`, given the distribution parameters, should be checked first using the function `validcorr2`, which checks if `rho` is within the feasible bounds and returns the lower and upper correlation limits. 1. After the support is created for the ordinal variables (if no support is provided), the maximum support for the count variables is determined using `maxcount_support`, given truncation value vector `pois_eps` for Poisson variables and/or `nb_eps` for Negative Binomial variables. The cumulative probability truncation value may differ by variable, but a good value is $0.0001$. The resulting supports and distribution parameters are used to create marginal lists, consisting of the cumulative probabilities for each count variable. 1. The intermediate correlation matrix `Sigma` is calculated using `intercorr2`. # References {-}