The following gives a description
of SimMultiCorrData
’s functions by topic. The user should
visit the appropriate help page for more information.
nonnormvar1
simulates one non-normal continuous
variable using either Fleishman (1978)’s
third-order (method
= “Fleishman”) or Headrick (2002)’s fifth-order
(method
= “Polynomial”) approximation. See
Comparison of Simulated Distribution to Theoretical Distribution
or Empirical Data vignette for an example.
rcorrvar
simulates k_cat
ordinal ($\Large r \ge 2$ categories),
k_cont
continuous, k_pois
Poisson, and/or
k_nb
Negative Binomial variables with a specified
correlation matrix rho
using Correlation Method
1. The variables are generated from multivariate normal
variables with intermediate correlation matrix Sigma
,
calculated by findintercorr
, and then transformed
appropriately. The ordering of the variables in rho
must be
ordinal, continuous, Poisson, and Negative Binomial (note that it is
possible for k_cat
, k_cont
,
k_pois
, and/or k_nb
to be 0).
rcorrvar2
simulates k_cat
ordinal
($\Large r \ge 2$ categories),
k_cont
continuous, k_pois
Poisson, and/or
k_nb
Negative Binomial variables with a specified
correlation matrix rho
using Correlation Method
2. The variables are generated from multivariate normal
variables with intermediate correlation matrix Sigma
,
calculated by findintercorr2
, and then transformed
appropriately. The ordering of the variables in rho
must be
ordinal, continuous, Poisson, and Negative Binomial (note that it is
possible for k_cat
, k_cont
,
k_pois
, and/or k_nb
to be 0).
Please see the Comparison of Correlation Method 1 and Correlation Method 2 vignette for more information about the two different simulation pathways.
find_constants
calculates the constants used to
generate continuous variables via either Fleishman’s third-order (using
fleish
equations) or Headrick’s fifth-order (using
poly
equations) polynomial transformation. It attempts to
find constants that generate a valid power method pdf. When using
Headrick’s method, if no solutions converged or no valid pdf solutions
could be found and a vector of sixth cumulant correction values
(Six
) is provided, the function will attempt to find the
smallest correction value that generates a valid power method pdf. If
not, invalid pdf constants will be given.
fleish
contains Fleishman’s third-order polynomial
transformation equations.
poly
contains Headrick’s fifth-order polynomial
transformation equations.
calc_fisherk
uses Fisher’s k-statistics to calculate
the mean, standard deviation, skewness, standardized kurtosis, and
standardized fifth and sixth cumulants given a vector of data.
calc_moments
uses the method of moments to calculate
the mean, standard deviation, skewness, standardized kurtosis, and
standardized fifth and sixth cumulants given a vector of data.
calc_theory
calculates the mean, standard deviation,
skewness, standardized kurtosis, and standardized fifth and sixth
cumulants given either a distribution name with up to 4 associated
parameters or pdf function fx with lower and upper support bounds. There
are 39 available distributions by name. Please see the appropriate help
pages for information regarding parameter inputs (in the
VGAM
Yee (2018),
triangle
Carnell (2016), or
stats
R Core Team (2018)
packages).
cdf_prob
calculates a cumulative probability using
the theoretical power method cdf $\Large
F_p(Z)(p(z)) = F_p(Z)(p(z), F_Z(z))$ up to $\Large sigma * y + mu = delta$, where $\Large y = p(z)$, after using
pdf_check
to verify that the given constants produce a
valid pdf. If the given constants do not produce a valid power method
pdf, a warning is given.
power_norm_corr
calculates the correlation between a
continuous variable produced using a polynomial transformation and the
generating standard normal variable. If the correlation is <= 0, the
signs of c1 and c3 should be reversed (for method
=
“Fleishman”), or c1, c3, and c5 (for method
=
“Polynomial”). These sign changes have no effect on the cumulants of the
resulting distribution.
pdf_check
determines if a given set of constants
generates a valid power method pdf. This requires yielding a continuous
variable with a positive correlation with the generating standard normal
variable and satisfying certain contraints that vary by approximation
method (see Headrick and Kowalchuk
(2007)).
sim_cdf_prob
calculates the simulated (empirical)
cumulative probability up to a given y-value (delta
). It
uses Martin Maechler’s stats::ecdf
function to find the
empirical cdf $\Large Fn$. $\Large Fn$ is a step function with jumps
$\Large i/n$ at observation values,
where $\Large i$ is the number of tied
observations at that value. Missing values are ignored. For observations
$\Large y = (y1, y2, ..., yn)$, $\Large Fn$ is the fraction of observations
less or equal to $\Large t$, i.e.,
$\Large Fn(t) = \#[y_{i} <=
t]/n$.
stats_pdf
calculates the $\Large 100 * \alpha %$ symmetric trimmed
mean ($\Large 0 < \alpha <
0.50$), median, mode, and maximum height of a valid power method
pdf using the equations given by Headrick & Kowalchuk
(2007).
calc_lower_skurt
determines the lower standardized
kurtosis boundary for a continuous variable generated using the power
method transformation. This boundary depends on skewness (for
Fleishman’s third-order method, see Headrick and
Sawilowsky (2002)) or skewness and standardized fifth and sixth
cumulants (for Headrick’s fifth-order method, see Headrick (2002)).
fleish_Hessian
calculates the Fleishman
transformation Hessian matrix and its determinant, which are used in
finding the lower kurtosis boundary for asymmetric
distributions.
fleish_skurt_check
contains the Fleishman
transformation Lagrangean constraints which are used in finding the
lower kurtosis boundary for asymmetric distributions.
poly_skurt_check
contains the Headrick
transformation Lagrangean constraints which are used in finding the
lower kurtosis boundary.
valid_corr
(correlation method 1) and
valid_corr2
(correlation method 2) determine the feasible
correlation bounds for ordinal, continuous, Poisson, and/or Negative
Binomial variables. If a target correlation matrix rho
is
specified, the functions check each pairwise correlation to see if it
falls within the bounds. The indices of any variable pair with a target
correlation that is outside the bounds are given. If continuous
variables are required, the functions return the calculated constants,
the required sixth cumulant correction (if a Six
vector of
possible values was given), and whether each set of constants generate a
valid power method pdf.
findintercorr
(correlation method 1) and
findintercorr2
(correlation method 2) are the two main
intermediate correlation calculation functions. These functions call the
other functions:
chat_nb
calculates the upper Frechet-Hoeffding
correlation bound for Negative Binomial - Normal variable pairs used to
determine the intermediate correlation for Negative Binomial -
Continuous variable pairs in method 1.
chat_pois
calculates the upper Frechet-Hoeffding
correlation bound for Poisson - Normal variable pairs used to determine
the intermediate correlation for Poisson - Continuous variable pairs in
method 1.
denom_corr_cat
is used in intermediate correlation
calculations involving ordinal variables (or variables treated as
ordinal, as in method 2).
findintercorr_cat_nb
calculates the intermediate
correlation for ordinal - Negative Binomial variables in method
1.
findintercorr_cat_pois
calculates the intermediate
correlation for ordinal - Poisson variables in method 1.
findintercorr_cont
calculates the intermediate
correlation for continuous variables based on either Fleishman’s
third-order or Headrick’s fifth-order approximation.
findintercorr_cont_cat
calculates the intermediate
correlation for continuous - ordinal variables.
findintercorr_cont_nb
and
findintercorr_cont_nb2
calculate the intermediate
correlations for continuous - Negative Binomial variables in method 1 or
2 (respectively).
findintercorr_cont_pois
and
findintercorr_cont_pois2
calculate the intermediate
correlation for continuous - Poisson variables in method 1 or 2
(respectively).
findintercorr_nb
calculates the intermediate
correlation for Negative Binomial variables in method 1.
findintercorr_pois
calculates the intermediate
correlation for Poisson variables in method 1.
findintercorr_pois_nb
calculates the intermediate
correlation for Poisson - Negative Binomial variables in method
1.
intercorr_fleish
contains Fleishman’s third-order
polynomial transformation intercorrelation equations.
intercorr_poly
contains Headrick’s fifth-order
polynomial transformation intercorrelation equations.
max_count_support
calculates the maximum support
value for count variables by extending the method of Barbiero and Ferrari (2015) to include Negative
Binomial variables. It is used in method 2.
ordnorm
calculates the intermediate correlation for
ordinal variables or variables treated as ordinal (as in method 2). It
is based off of GenOrd::ordcont
with some important
corrections.
var_cat
is used in intermediate correlation
calculations involving ordinal variables (or variables treated as
ordinal, as in method 2) to calculate the variance.
error_loop
is the main error_loop function called by
rcorrvar
or rcorrvar2
.
error_vars
is used to generate variable pairs within
the error loop.
The 8 graphing functions either use simulated data as an input or a
set of constants (found by find_constants
or from
simulation). In the first case, the empirical cdf or pdf is found. In
the second case, the theoretical cdf or pdf is found using the equations
from Headrick and Kowalchuk (2007). These
functions (plot_cdf
, plot_pdf_ext
,
plot_pdf_theory
) work only for continuous variable inputs.
The other graphing functions work for continuous or count variable
inputs. The graphs either display data values, pdfs, or cdfs. In the
case of cdfs of continuous variables, the cumulative probability up to a
given y-value (delta) can be calculated and displayed on the graph
(using cdf_prob
for a set of constants or
sim_cdf_prob
for a vector of simulated data). The empirical
cdf can also be graphed for ordinal data. In the case of pdfs or actual
data values, the target distribution can be overlayed on the graph. This
target distribution can either be an empirical data set, or a
distribution specified by name (Dist
plus up to 4
parameters) or by a user-supplied pdf fx
with support
bounds. See plot_sim_pdf_theory
for names of
Dist
inputs. The graphing functions work for invalid or
valid power method pdfs. They are ggplot2
objects so
designated graphing parameters (i.e. line color and type, title) can be
specified by the user and the results can be further modified as
necessary.
plot_cdf
plots the theoretical power method
cumulative distribution function $\Large
F_p(Z)(p(z)) = F_p(Z)(p(z), F_Z(z))$, given a set of constants.
If calc_prob
= TRUE, it will also calculate the cumulative
probability up to a user-specified delta
value, where $\Large sigma * y + mu = delta$ and $\Large y = p(z)$.
plot_sim_cdf
plots the empirical cdf $\Large Fn$ of simulated continuous, ordinal,
or count data (see ggplot2::stat_ecdf
). If
calc_cprob
= TRUE and the variable is continuous, the
cumulative probability up to a user-specified y-value
(delta
) is calculated (see sim_cdf_prob
) and
the region on the plot is filled with a dashed horizontal line drawn at
$\Large Fn(delta)$.
plot_pdf_ext
plots the theoretical probability
density function $\Large f_p(Z)(p(z)) =
f_p(Z)(p(z), f_Z(z)/p'(z))$, given a set of constants, and the
target pdf calculated from a vector of external data. Unlike in
plot_pdf_theory
, the vector of external data is required.
If the user wants to plot only the theoretical pdf,
plot_pdf_theory
should be used with overlay
=
FALSE.
plot_pdf_theory
plots the theoretical probability
density function $\Large f_p(Z)(p(z)) =
f_p(Z)(p(z), f_Z(z)/p'(z))$, given a set of constants, and the
target pdf (if overlay
= TRUE), given either a continuous
distribution name and parameters or a user-supplied pdf fx
(bounds set equal to bounds of simulated data).
plot_sim_ext
plots simulated continuous or count
data and overlays external data (both as histograms). Unlike in
plot_sim_theory
, the vector of external data is required.
If the user wants to plot only the simulated data,
plot_sim_theory
should be used with overlay
=
FALSE.
plot_sim_pdf_ext
plots the pdf of simulated
continuous or count data and overlays the target pdf computed from a
vector of external data. Unlike in plot_sim_pdf_theory
, the
vector of external data is required. If the user wants to plot only the
pdf of simulated data, plot_sim_pdf_theory
should be used
with overlay
= FALSE.
plot_sim_pdf_theory
plots the pdf of simulated
continuous or count data and overlays the target pdf (if
overlay
= TRUE) specified by distribution name and
parameters or by pdf fx (bounds set equal to bounds of simulated
data).
plot_sim_theory
plots simulated continuous or count
data and overlays data (if overlay
= TRUE) randomly
generated from a target distribution specified by name and parameters or
by pdf fx
(bounds set equal to bounds of simulated data).
Both distributions are plotted as histograms. If the target distribution
is specified by a function fx
, it must be
continuous.
These would not ordinarily be called by the user.
calc_final_corr
calculates the final correlation
matrix.
separate_rho
separates a target correlation matrix
by variable type.