Math, Statistics & Malaysia: Random Data Generation and Integral Probability Transformation for 1 and 2-dimension.

1. Random data generation?

Random data generation is a fundamental aspect of statistical analysis and modeling. At its core, it involves the creation of data points that exhibit random behavior, adhering to specified distributions or patterns. This technique plays a crucial role in various applications, including simulation studies, hypothesis testing, and Monte Carlo methods.

Applications:

Monte Carlo simulations: Random data generation is essential for conducting Monte Carlo simulations, which involve repeatedly sampling from probability distributions to estimate numerical results.
Synthetic data generation: In situations where real data is scarce or sensitive, random data generation can be used to create synthetic datasets that mimic the statistical properties of the original data.

2. Integral Probability transformation?

Integral Probability Transformation (IPT) is a powerful method used for generating random variables with specified marginal distributions. It operates by transforming uniformly distributed random variables into desired distributions using cumulative distribution functions (CDFs). As introduced in the work of Angus (1994), by employing IPT, one can efficiently generate random variables with complex distributions. The theorem states as follows,

Theorem (Integral Probability): If $X$ had CDF $F(.)$ which is continuous, then the random variable $Y=F(X)$ has the distribution of $U(0,1)$.

Theorem (Quantile Function): Let $F$ be a CDF. If $F^{(-1)}:(0,1) \rightarrow (-\infty, \infty)$ is defined by $F^{(-1)}(y)=inf[x: F(x) \ge y], 0<y<1$ and $U$ has the distribution of $U(0,1)$, then $X=F^{(-1)}(U)$ has CDF $F$.

It is also beneficial to state that as the CDF is uniformly distributed, then the survival function $S(.)=1-F(.)$ which is the complementary of the CDF is also uniformly distributed.

Applications:

Risk assessment: IPT can be applied in risk assessment models to simulate random variables representing uncertain outcomes, such as financial losses or environmental hazards.
Reliability analysis: In reliability engineering, IPT is utilized to generate random variables representing component lifetimes or failure rates, enabling the evaluation of system reliability.

Examples:

Let’s say that we know that time (in hour) for bus arrival follows $Exponential(\mu=0.2)$. Then the CDF is given by,

$$ \int_{0}^{x} \mu e^{-\mu u} du = F(x)$$

$$ 1-e^{-\mu x} = F(x)$$

$$ 1- F(x) = e^{-\mu x} $$

$$ \frac{-ln(1- F(x)) }{\mu}= x. (1)$$

We can generate 500 exponential random data as follows:

Generate 500 data from $Uniform(0,1)$.
For each data, convert to $x$ using $(1)$.

If we plot the generated data and actual graph for $Exponential(\mu=0.2)$, we can see the generated data show a similar pattern.

F.x <- runif(500,0,1) # Uniform distributed
mu = 0.2
X <- (-log(1-F.x))/mu # inverse of F

# True graph
tru.x <- dexp(seq(0,40, length=500), rate=mu)

hist(X, col="blue", xlab="X", main="Density function with Expo(0.2)", probability=T) # generated data
lines(seq(0,40, length=500), tru.x, lwd=2) # true curve

2. In survival studies, proportional hazard model plays a vital role in understanding the relation between risk set and failure time (morbidity). For instance, if we observe two subjects, a smoker ($x=1$) and a nonsmoker ($x=0$). Then the hazard function and survival function under this model are respectively given by,

$$h(t|x)=h_0(t)e^{\beta x},$$

$$S(t|x)=e^{-\int_{0}^{t} h_0(u)e^{\beta x} du}. (2)$$

where $h_0(t) = f(t)/S(t)$ the hazard rate is exactly the instantaneous failure rate by time $t$ given that the subject survives by time $t$. $\beta$ represents the association between $x$ and $t$. Denote from $(2)$, if we know how the failure time distributed, let say $Exponential(\mu=0.2)$ which implies $h_0(t)=\mu$, and $X~Normal(\mu=5,\sigma=1)$. Then we can find what is the time $t$ correspond to the observed $x$ through,

$$S(t|x)=e^{-\int_{0}^{t} \mu e^{\beta x} du}$$

$$S(t|x)=e^{-\mu e^{\beta x} \cdot \int_{0}^{t} du}$$

$$S(t|x)=e^{-\mu e^{\beta x} \cdot t}$$

$$-ln(S(t|x))=\mu e^{\beta x} \cdot t$$

$$\frac{-ln(S(t|x))}{ \mu e^{\beta x} }= t. (3)$$

Equation $(3)$ provide a base for us to generate bivariate data using proportional hazard model. The procedure works as follows:

Generate $x$ from $Normal(\mu=5,\sigma=1)$.
Generate $S(t|x)$ from $Uniform(0,1)$.
Use equation $(3)$ to find $t$.

mu.x=5; sigma.x=1; mu.t=0.2; beta=0.5
x <- rnorm(500, mean=mu.x, sd=sigma.x) 
St.x <- runif(500,0,1) # Uniform distributed for S(t|x)

t <- (-log(St.x))/(mu.t*exp(beta*x)) # inverse of S(t|x)

# graph & marginal density
p <- ggplot(data.frame(x=x,t=t), aes(x, t)) + geom_point(size=1) + theme(text=element_text(size=16)) + labs(title = 'Bivariate data of X and T')
ggExtra::ggMarginal(p, type = "histogram")

3. Conclusion.

In conclusion, the techniques of random data generation which using the integral probability transformation is an invaluable tool in the arsenal of statisticians and data scientists. From simulating complex systems to modeling multivariate dependencies, these methods offer innovative solutions to diverse challenges in statistical analysis and decision-making. By understanding and harnessing the power of these techniques, researchers and practitioners can unlock new insights and drive advancements across various fields.

References:

Angus, J. E. (1994). The Probability Integral Transform and Related Results. SIAM Review, 36(4), 652–654. https://doi.org/10.1137/1036146
Bender, R., Augustin, T., & Blettner, M. (2005). Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine, 24(11), 1713–1723. https://doi.org/10.1002/sim.2059

Math, Statistics & Malaysia

Monday, February 5, 2024

Random Data Generation and Integral Probability Transformation for 1 and 2-dimension.

No comments:

Post a Comment