Mixture distributions describe random variables that are drawn from
more than one component distribution. Mixture distributions provide a
useful way for describing heterogeneity in a population, especially when
an outcome is a composite response from multiple sources. For a random
variable Y from a finite
mixture distribution with k
components, the PDF can be described by:
where $\sum_{i=1}^{k} \pi_{i} = 1$. The
πi are
mixing parameters which determine the weight of each component
distribution fYi(y)
in the overall probability distribution. As long as each component has a
valid PDF, the overall distribution hY(y)
has a valid PDF. The main assumption is statistical independence between
the process of randomly selecting the component distribution and the
distributions themselves. Assume there is a random selection process
that first generates the numbers 1, ..., k with probabilities π1, ..., πk.
After selecting number i,
where 1 ≤ i ≤ k, a
random variable y is drawn
from component distribution fYi(y)
(Davenport, Bezder, and Hathaway 1988; Schork,
Allison, and Thiel 1996; Everitt 1996; Pearson 2011).
Continuous mixture variables are generated at the component level in
SimCorrMix. The target correlation matrix
rho
in the simulation functions corrvar
and
corrvar2
is specified in terms of the correlations with
components of continuous mixture variables. This allows the user to set
the correlation between components of the same mixture variable to any
desired value. If this correlation is set to 0, the intermediate correlation matrix
Sigma
may need to be converted to the nearest
positive-definite matrix. This is done within the simulation functions
by specifying use.nearPD = TRUE
. Higham (2002)’s algorithm is executed with the
Matrix::nearPD
function (Bates and
Maechler 2018). Otherwise, negative eigenvalues are replaced with
0.
The components of the continuous mixture variables are created using
either Fleishman (1978)’s third-order or
Headrick (2002)’s fifth-order power method
transformation (PMT) applied to standard normal variables. The PMT
simulates continuous variables by matching standardized cumulants
derived from central moments. Since some distributions may have large
central moments, using standardized cumulants decreases the complexity
involved in calculations. In view of this, let Y be a real-valued random variable
with cumulative distribution function F. Define the central moments, μr, of Y as:
Then the first six cumulants are given by (Kendall and Stuart 1977):
The cumulants are standardized by dividing κ1 - κ6 by $\sqrt{{{\kappa}_{2}}^{r}} = ({\sigma}^{2})^{r/2} = {\sigma}^{r}$, where σ2 is the variance of Y and r is the order of the cumulant:
The values γ1 and γ2 correspond to skewness and standardized kurtosis (so that the normal distribution has a value of 0, hereafter referred to as skurtosis). The corresponding sample values for the above can be obtained by replacing μr by ${m}_{r} = \sum_{j=1}^{n}{({x}_{j}-{m}_{1})}^{r}/n$ (Headrick 2002).
The standardized cumulants for a continuous mixture variable can be derived in terms of the standardized cumulants of its component distributions. Suppose the goal is to simulate a continuous mixture variable Y with PDF hY(y) that contains two component distributions Ya and Yb with mixing parameters πa and πb:
Here,
so that Ya and Yb have expected values μa and μb and variances σa2 and σb2. Assume the variables Za′ and Zb′ are generated with zero mean and unit variance using Headrick’s fifth-order PMT given the specified values for skew (γ1a′, γ1b′), skurtosis (γ2a′, γ2b′), and standardized fifth (γ3a′, γ3b′) and sixth (γ4a′, γ4b′) cumulants:
The constants c0a, ..., c5a
and c0b, ..., c5b
are the solutions to the system of equations given in
SimMultiCorrData::poly
and calculated by
SimMultiCorrData::find_constants
. Similar results hold for
Fleishman’s third-order PMT, where the constants c0a, ..., c3a
and c0b, ..., c3b
are the solutions to the system of equations given in
SimMultiCorrData::fleish
and c4a = c5a = c4b = c5b = 0
(Fialkowski 2018).
The rth expected value of Y can be expressed as:
This expression can be used to derive expressions for the mean, variance, skew, skurtosis, and standardized fifth and sixth cumulants of Y in terms of the rth expected values of Ya and Yb.
Since Ef[Za′] = Eg[Zb′] = 0, this becomes Eh[Y] = πaμa + πbμb.
Applying the variance relation to Za′ and Zb′ gives:
Since Ef[Za′] = Eg[Zb′] = 0
and Varf[Za′] = Varg[Zb′] = 1,
Ef[Za′2]
and Eg[Zb′2]
both equal 1.
Therefore, Eh[Y2]
simplifies to:
and the variance of Y is given
by:
Then Ef[Za′3] = μ3a′ and Eg[Zb′3] = μ3b′ are given by:
Combining these with Ef[Za′] = Eg[Zb′] = 0
and Ef[Za′2] = Eg[Zb′2] = 1,
Eh[Y3]
simplifies to:
Therefore, the skew of Y
is:
Then Ef[Za′4] = μ4a′ and Eg[Zb′4] = μ4b′ are given by:
Since Ef[Za′] = Eg[Zb′] = 0 and Ef[Za′2] = Eg[Zb′2] = 1, Eh[Y4] simplifies to:
Therefore, the skurtosis of Y is:
Then Ef[Za′5] = μ5a′ and Eg[Zb′5] = μ5b′ are given by:
Since Ef[Za′] = Eg[Zb′] = 0 and Ef[Za′2]= Eg[Zb′2] = 1, Eh[Y5] simplifies to:
Therefore, the standardized fifth cumulant of Y is:
Then Ef[Za′6] = μ6a′ and Eg[Zb′6] = μ6b′ are given by:
Since Ef[Za′] = Eg[Zb′] = 0 and Ef[Za′2] = Eg[Zb′2] = 1, Eh[Y6] simplifies to:
Therefore, the standardized sixth cumulant of Y is:
If the desired mixture distribution Y contains more than two component
distributions, the expected values of Y are again expressed as sums of the
expected values of the component distributions, with weights equal to
the associated mixing parameters. For example, assume Y contains k component distributions Y1, ..., Yk
with mixing parameters given by π1, ..., πk,
where $\sum_{i=1}^{k} \pi_{i} = 1$. The
component distributions are described by the following parameters: means
μ1, ..., μk,
variances σ12, ..., σk2,
skews γ11′, ..., γ1k′,
skurtoses γ21′, ..., γ2k′,
fifth cumulants γ31′, ..., γ3k′,
and sixth cumulants γ41′, ..., γ4k′.
Then the rth
expected value of Y can be
expressed as:
Therefore, a method similar to that above can be used to derive the
system of equations defining the mean, variance, skew, skurtosis, and
standardized fifth and sixth cumulants of Y. These equations are used within
the function calc_mixmoments
to determine the values for a
mixture variable. Some code has been modified from the
SimMultiCorrData package (Fialkowski 2018).
Even though the correlations for the continuous mixture variables are set at the component level, we can approximate the resulting correlations for the mixture variables. The example from the Overall Workflow for Generation of Correlated Data vignette is used for demonstration.
Assume M1 and M2 are two continuous mixture variables. Let M1 have kM1 components with mixing probabilities α1, ..., αkM1. The standard deviations of the components are σM11, σM12, ..., σM1kM1. Let M2 have kM2 components with mixing probabilities β1, ..., βkM2. The standard deviations of the components are σM21, σM22, ..., σM2kM2.
The correlation between the mixture variables M1 and M2 is given by:
Equation @ref(eq:System24a) requires the expected value of the product
of M1 and M2. Since M1 and M2 may contain any desired number of
components and these components may have any continuous distribution,
there is no general way to determine this expected value. Therefore, it
will be approximated by expressing M1 and M2 as sums of their component
variables:
where
Using the general correlation equation, for 1 ≤ i ≤ kM1
and 1 ≤ j ≤ kM2:
so that we can rewrite Cor(M1, M2)
as:
For this example:
library("SimCorrMix")
L <- calc_theory("Logistic", c(0, 1))
C <- calc_theory("Chisq", 4)
B <- calc_theory("Beta", c(4, 1.5))
mix_pis <- list(c(0.4, 0.6), c(0.3, 0.2, 0.5))
mix_mus <- list(c(-2, 2), c(L[1], C[1], B[1]))
mix_sigmas <- list(c(1, 1), c(L[2], C[2], B[2]))
p_M11M21 <- p_M11M22 <- p_M11M23 <- 0.35
p_M12M21 <- p_M12M22 <- p_M12M23 <- 0.35
p_M1M2 <- matrix(c(p_M11M21, p_M11M22, p_M11M23, p_M12M21, p_M12M22, p_M12M23),
2, 3, byrow = TRUE)
rhoM1M2 <- rho_M1M2(mix_pis, mix_mus, mix_sigmas, p_M1M2)
The correlation between M1 and M2 is approximated as 0.0877342.
Here Y can be an ordinal, a
continuous non-mixture, or a regular or zero-inflated Poisson or
Negative Binomial variable. The correlation between the mixture variable
M1 and Y is given by:
Equation @ref(eq:System28a) requires the expected value of the product
of M1 and Y. Since M1 may contain any desired number of
components and these components may have any continuous distribution,
there is no general way to determine this expected value. Therefore, it
will again be approximated by expressing M1 as a sum of its component
variables:
where
Using the general correlation equation, for 1 ≤ i ≤ kM1:
so that we can rewrite Cor(M1, Y)
as:
Similarly,
For this example, Y can be O1, C1, C2, P1, or NB1. Let Y = C1. Then we have:
p_M11C1 <- p_M12C1 <- 0.35
p_M1C1 <- c(p_M11C1, p_M12C1)
rho_M1C1 <- rho_M1Y(mix_pis[[1]], mix_mus[[1]], mix_sigmas[[1]], p_M1C1)
The correlation between M1 and C1 is approximated as 0.1590909. Since O1, C2, P1, and NB1 have the same target pairwise correlations with M11 and M12 as C1, their correlations with M1 are also approximated as 0.1590909.
Similarly,
p_M21C1 <- p_M22C1 <- p_M23C1 <- 0.35
p_M2C1 <- c(p_M21C1, p_M22C1, p_M23C1)
rho_M2C1 <- rho_M1Y(mix_pis[[2]], mix_mus[[2]], mix_sigmas[[2]], p_M2C1)
The correlation between M2 and C1 is approximated as 0.1930151. Since O1, C2, P1, and NB1 have the same target pairwise correlations with M21, M22, and M23 as C1, their correlations with M2 are also approximated as 0.1930151.