--- title: "Benefits of SimMultiCorrData and Comparison to Other Packages" author: "Allison C Fialkowski" date: "`r Sys.Date()`" output: rmarkdown::html_vignette bibliography: Bibliography.bib vignette: > %\VignetteIndexEntry{Benefits of SimMultiCorrData and Comparison to Other Packages} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE} #knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` `SimMultiCorrData` generates continuous (normal or non-normal), binary, ordinal, and count (Poisson or Negative Binomial) variables with a specified correlation matrix. It can also produce a single continuous variable. This package can be used to simulate data sets that mimic real-world situations (i.e. clinical data sets, plasmodes, as in Vaughan *et al.* [-@Vaughan]). All variables are generated from standard normal variables with an imposed intermediate correlation matrix. Continuous variables are simulated by specifying mean, variance, skewness, standardized kurtosis, and fifth and sixth standardized cumulants using either Fleishman's Third-Order [-@Fleish] or Headrick's Fifth-Order [-@Head2002] Polynomial Transformation. Binary and ordinal variables are simulated using a modification of `GenOrd::ordsample` @GenOrd. Count variables are simulated using the inverse cdf method. There are two simulation pathways which differ primarily according to the calculation of the intermediate correlation matrix `Sigma`. In **Correlation Method 1**, the intercorrelations involving count variables are determined using a simulation based, logarithmic correlation correction (adapting @YahShm's method). In **Correlation Method 2**, the count variables are treated as ordinal (adapting @FerrBarb_Pois's modification of `GenOrd`). There is an optional error loop that corrects the final correlation matrix to be within a user-specified precision value. The package also includes functions to calculate standardized cumulants for theoretical distributions or from real data sets, check if a target correlation matrix is within the possible correlation bounds (given the distributions of the simulated variables), summarize results (numerically or graphically), verify valid power method pdfs, and calculate lower standardized kurtosis bounds. ## Benefits of this package: The main strengths of `SimMultiCorrData` are: 1. The user may generate **correlated continuous (normal or non-normal), ordinal (r >= 2 categories), Poisson and/or Negative Binomial variables** simultaneously, based on either theoretical distributions or empirical data. 1. Two distinct methods for generating non-normal continuous variables: **Fleishman's third-order or Headrick's fifth-order polynomial transformation**. * Simulate distributions whose higher moments are unknown or do not exist using Fleishman's third-order method. * Accurately reproduce non-normal data up to the sixth moment using Headrick's fifth-order method. * Generate data with a wider range of standardized kurtosis values, given skewness and standardized fifth and sixth cumulants, using Headrick's fifth-order method. * Generate more distributions with a valid *pdf* using Headrick's fifth-order method. 1. Two distinct methods for generating count variables (see **Comparison of Correlation Method 1 and Correlation Method 2** vignette). The user may test each to see which yields greater simulation accuracy. 1. Calculation of the **precise lower kurtosis boundary** using the Lagrangean constraint equations, instead of an approximation (see `calc_lower_skurt`). 1. **Valid power method pdf checks** during the calculation of the constants for continuous variables, and optional use of a sixth cumulant correction value to enable the discovery of valid pdf constants. 1. **Computation of feasible correlation bounds** based on data simulation method (see `valid_corr` for correlation method 1 or `valid_corr2` for correlation method 2). 1. **Numerous attempts to reproduce the desired correlation matrix**, including correcting for non-positive-definite intermediate correlation matrices and an optional final error loop (see **Overview of Error Loop** vignette). This error loop enables reproduction of many correlation structures that can not be achieved through other methods. 1. Function arguments (i.e. `seed`, `n`, `maxit`, `epsilon`) that allow the user to have **greater control** over simulation accuracy, speed, and reproducibility. 1. **Detailed simulation results**, including the simulation time (in minutes) and descriptions of the generated variables and the correlation structure. 1. Additional functions to supplement the simulation process: * **Summary functions** that may be used to determine standardized cumulants from theoretical distributions (`calc_theory`) or a vector of data by the method of moments (`calc_moments`) or based on Fisher's k-statistics (`calc_fisherk`). Additional summary functions compute important statistics for the generated continuous variables. * **Plotting functions** for comparing simulated data to either theoretical distributions or empirical data. The plots display either actual data, cdfs, or pdfs, and work for continuous, ordinal, and/or count variables. The cdf plotting functions enable the user to specify a y-value to calculate the cumulative probability. This region on the graph is filled in and the probability is given on the graph. The data values and pdf graphs enable the user to overlay the target distributions. These functions produce `ggplot2` objects so that the user may save them or further adapt the graphs as necessary. ## Comparison to other R packages: There are several other simulation packages. For example, Barbiero & Ferrari's [-@GenOrd] `GenOrd`, Amatya & Demirtas' [-@MultiOrd] `MultiOrd`, Leisch, Kaiser, & Hornik's [-@Orddata] `orddata`, and Demirtas, Nordgren, & Allozi's [-@PoisBinOrdNonNor] `PoisBinOrdNonNor`. The first three generate only binary and ordinal data, while the last generates Poisson, binary, ordinal, and non-normal variables. ### Barbiero & Ferrari's (2015) `GenOrd` `GenOrd` generates discrete random variables (i.e. binary or ordinal) with given correlation matrix and marginal distributions. The method used to determine the intermediate MVN correlation matrix in `GenOrd::ordcont` has been modified in `SimMultiCorrData`'s `ordnorm` function. It works by setting the intermediate correlation equal to the target correlation of the discrete variables. Each intermediate pairwise correlation is updated until the final pairwise correlation is within a user-specified precision value (`epsilon`) of the target correlation or the maximum number of iterations (`maxit`) has been reached. `GenOrd::ordcont` has been modified in the following ways: 1. The initial correlation check has been removed because it is assumed the user has done this before simulation using `SimMultiCorrData::valid_corr` or `valid_corr2`. 1. The final positive-definite check has been removed because this is done on the intermediate correlation matrix `Sigma` for all variable types, and if necessary, `Sigma` is converted to the nearest positive-definite matrix using Higham's (2002) algorithm in `Matrix::nearPD`. 1. The intermediate correlation update function was changed to accommodate more situations. `SimMultiCorrData::ordnorm` uses `GenOrd::contord` to calculate the ordinal correlation obtained from discretizing the normal variables generated from the intermediate correlation matrix `Sigma`. The reason is because the function does not require random generation of the normal variables, which ensures greater reproducibility. `SimMultiCorrData` also improves the way ordinal variables are generated, as compared to `GenOrd::ordsample`: 1. The simulation functions `SimMultiCorrData::rcorrvar` and `rcorrvar2` allow a user-specified seed, maximum number of iterations, and epsilon value. 1. `GenOrd::ordsample` stops if the intermediate correlation matrix `Sigma` is not positive-definite. As described above, `SimMultiCorrData` attempts to correct the problem and a warning is given that it may not be possible to produce the desired correlation matrix. ### Amatya & Demirtas' (2016) `MultiOrd` `MultiOrd` generates multivariate ordinal data with given correlation matrix and marginal distributions via the *binary conversion method* of Demirtas [-@Dem_Ord]. This method computes the binary marginals by collapsing the marginal distributions of the ordinal variables. The intermediate correlation matrix is also computed through an iterative process based on matching the target matrix. Binary data are then converted to ordinal data through a randomization step. This procedure requires the simulation of large samples of binary data in order to maximize accuracy, which requires greater computational time and resources than the methods used in `SimMultiCorrData`. ### Leisch, Kaiser, & Hornik's (2010) `orddata` `orddata` generates binary and ordinal data through 4 available methods: 1. *Mean mapping:* this method involves an analytical algorithm for determining the intermediate MVN correlation. It creates the ordinal variables by thresholding the normal variates. The mean mapping method provides a wider correlation range than the binary conversion method, but the run time can be greater, depending on the number of categories and variables. 1. *Binary conversion:* Kaiser et al.'s binary conversion is similar to the method of Demirtas. However, while Demirtas' method uses simulations to determine the binary correlation required to create the desired ordinal variables, Kaiser et al. developed an algorithm to analytically determine the binary correlation (see @Kais_Ord). The authors note that although the feasible correlation range is smaller than that of Demirtas, the analytical solution decreases computational time. 1. *Monte Carlo simulation:* this method generates ordinal values by shifting the variables until the desired correlation structure is achieved. 1. *A "native" method:* this method involves a simpler analytical algorithm than the mean mapping method and also thresholds normal variates to generate the ordinal values. ### Demirtas, Nordgren, & Allozi's (2017) `PoisBinOrdNonNor` `PoisBinOrdNonNor` is one in an extensive series of simulation packages created by Demirtas with additional authors. Other packages include `OrdNor` [@OrdNor], `BinNonNor` [@BinNonNor], `BinOrdNonNor` [@BinOrdNonNor], `PoisBinOrd` [@PoisBinOrd], `PoisNor` [@PoisNor], and `PoisBinOrdNor` [@PoisBinOrdNor]. `PoisBinOrdNonNor` generates Poisson, binary, ordinal, and non-normal variables. It differs from `SimMultiCorrData` in the following ways: 1. The intermediate correlation matrix must be found using a function separate from the simulation function, while `SimMultiCorrData`'s simulation functions `rcorrvar` and `rcorrvar2` allow the user to either provide an intermediate matrix or the matrix is calculated during the simulation. 1. The intermediate correlations for the Poisson variables are found using the method of Yahav & Shmueli [-@YahShm], as in method 1 of `SimMultiCorrData`. However, `PoisBinOrdNonNor` does not produce Negative Binomial variables. 1. Binary intermediate correlations are calculated from tetrachoric correlations based on the method of @Dem_Power, as in `SimMultiCorrData`. However, those for ordinal variables are found using `ordcont`, which, as previously mentioned, will stop if the intermediate matrix is not positive-definite. 1. Non-normal variables are generated using Fleishman's third-order polynomial transformation, which generally yields simulation accuracy inferior to Headrick's fifth-order transformation and has a smaller range of possible kurtosis values. 1. The authors do not provide any checks for positive correlation of the continuous variable with the generating standard normal variable or for a valid power method pdf, while `SimMultiCorrData` contains the functions `power_norm_corr` and `pdf_check`. The function that solves for the constants (`SimMultiCorrData::find_constants`) executes these checks when finding the constants and attempts to produce valid pdf constants. In the case of Headrick's fifth-order method, the user may specify a sixth cumulant correction value to help find these constants. 1. The lower kurtosis boundary check in `PoisBinOrdNonNor` is a simple approximation: $\Large standardized\ kurtosis \ge skew^2 - 2$. `SimMultiCorrData::calc_lower_skurt` solves the Lagrangean expressions (as described in @Head2002 and @HeadSaw2) that determine the precise lower kurtosis boundary. Examination of the boundaries computed in `PoisBinOrdNonNor` demonstrates that the approximate boundaries are much lower than the actual Fleishman boundaries, indicating that the guideline is not accurate (see `calc_lower_skurt` for examples). 1. `PoisBinOrdNonNor` does not allow the user to specify a seed for random number generation, or an epsilon value and maximum number of iterations to use when determining the intermediate ordinal correlations. These specifications, as found in `SimMultiCorr`'s simulation functions `rcorrvar` and `rcorrvar2`, are essential for reproducibility and controlling accuracy. 1. The results include only the generated variables, while `SimMultiCorr`'s simulation functions produce detailed summaries of the variables, the final correlation matrix, the maximum error between the final and target correlation matrices, and the simulation time. ## References