Math, Statistics & Malaysia

Monday, February 5, 2024

Random Data Generation and Integral Probability Transformation for 1 and 2-dimension.

1. Random data generation?

Random data generation is a fundamental aspect of statistical analysis and modeling. At its core, it involves the creation of data points that exhibit random behavior, adhering to specified distributions or patterns. This technique plays a crucial role in various applications, including simulation studies, hypothesis testing, and Monte Carlo methods.

Applications:

Monte Carlo simulations: Random data generation is essential for conducting Monte Carlo simulations, which involve repeatedly sampling from probability distributions to estimate numerical results.
Synthetic data generation: In situations where real data is scarce or sensitive, random data generation can be used to create synthetic datasets that mimic the statistical properties of the original data.

2. Integral Probability transformation?

Integral Probability Transformation (IPT) is a powerful method used for generating random variables with specified marginal distributions. It operates by transforming uniformly distributed random variables into desired distributions using cumulative distribution functions (CDFs). As introduced in the work of Angus (1994), by employing IPT, one can efficiently generate random variables with complex distributions. The theorem states as follows,

Theorem (Integral Probability): If $X$ had CDF $F(.)$ which is continuous, then the random variable $Y=F(X)$ has the distribution of $U(0,1)$.

Theorem (Quantile Function): Let $F$ be a CDF. If $F^{(-1)}:(0,1) \rightarrow (-\infty, \infty)$ is defined by $F^{(-1)}(y)=inf[x: F(x) \ge y], 0<y<1$ and $U$ has the distribution of $U(0,1)$, then $X=F^{(-1)}(U)$ has CDF $F$.

It is also beneficial to state that as the CDF is uniformly distributed, then the survival function $S(.)=1-F(.)$ which is the complementary of the CDF is also uniformly distributed.

Applications:

Risk assessment: IPT can be applied in risk assessment models to simulate random variables representing uncertain outcomes, such as financial losses or environmental hazards.
Reliability analysis: In reliability engineering, IPT is utilized to generate random variables representing component lifetimes or failure rates, enabling the evaluation of system reliability.

Examples:

Let’s say that we know that time (in hour) for bus arrival follows $Exponential(\mu=0.2)$. Then the CDF is given by,

$$ \int_{0}^{x} \mu e^{-\mu u} du = F(x)$$

$$ 1-e^{-\mu x} = F(x)$$

$$ 1- F(x) = e^{-\mu x} $$

$$ \frac{-ln(1- F(x)) }{\mu}= x. (1)$$

We can generate 500 exponential random data as follows:

Generate 500 data from $Uniform(0,1)$.
For each data, convert to $x$ using $(1)$.

If we plot the generated data and actual graph for $Exponential(\mu=0.2)$, we can see the generated data show a similar pattern.

F.x <- runif(500,0,1) # Uniform distributed
mu = 0.2
X <- (-log(1-F.x))/mu # inverse of F

# True graph
tru.x <- dexp(seq(0,40, length=500), rate=mu)

hist(X, col="blue", xlab="X", main="Density function with Expo(0.2)", probability=T) # generated data
lines(seq(0,40, length=500), tru.x, lwd=2) # true curve

2. In survival studies, proportional hazard model plays a vital role in understanding the relation between risk set and failure time (morbidity). For instance, if we observe two subjects, a smoker ($x=1$) and a nonsmoker ($x=0$). Then the hazard function and survival function under this model are respectively given by,

$$h(t|x)=h_0(t)e^{\beta x},$$

$$S(t|x)=e^{-\int_{0}^{t} h_0(u)e^{\beta x} du}. (2)$$

where $h_0(t) = f(t)/S(t)$ the hazard rate is exactly the instantaneous failure rate by time $t$ given that the subject survives by time $t$. $\beta$ represents the association between $x$ and $t$. Denote from $(2)$, if we know how the failure time distributed, let say $Exponential(\mu=0.2)$ which implies $h_0(t)=\mu$, and $X~Normal(\mu=5,\sigma=1)$. Then we can find what is the time $t$ correspond to the observed $x$ through,

$$S(t|x)=e^{-\int_{0}^{t} \mu e^{\beta x} du}$$

$$S(t|x)=e^{-\mu e^{\beta x} \cdot \int_{0}^{t} du}$$

$$S(t|x)=e^{-\mu e^{\beta x} \cdot t}$$

$$-ln(S(t|x))=\mu e^{\beta x} \cdot t$$

$$\frac{-ln(S(t|x))}{ \mu e^{\beta x} }= t. (3)$$

Equation $(3)$ provide a base for us to generate bivariate data using proportional hazard model. The procedure works as follows:

Generate $x$ from $Normal(\mu=5,\sigma=1)$.
Generate $S(t|x)$ from $Uniform(0,1)$.
Use equation $(3)$ to find $t$.

mu.x=5; sigma.x=1; mu.t=0.2; beta=0.5
x <- rnorm(500, mean=mu.x, sd=sigma.x) 
St.x <- runif(500,0,1) # Uniform distributed for S(t|x)

t <- (-log(St.x))/(mu.t*exp(beta*x)) # inverse of S(t|x)

# graph & marginal density
p <- ggplot(data.frame(x=x,t=t), aes(x, t)) + geom_point(size=1) + theme(text=element_text(size=16)) + labs(title = 'Bivariate data of X and T')
ggExtra::ggMarginal(p, type = "histogram")

3. Conclusion.

In conclusion, the techniques of random data generation which using the integral probability transformation is an invaluable tool in the arsenal of statisticians and data scientists. From simulating complex systems to modeling multivariate dependencies, these methods offer innovative solutions to diverse challenges in statistical analysis and decision-making. By understanding and harnessing the power of these techniques, researchers and practitioners can unlock new insights and drive advancements across various fields.

References:

Angus, J. E. (1994). The Probability Integral Transform and Related Results. SIAM Review, 36(4), 652–654. https://doi.org/10.1137/1036146
Bender, R., Augustin, T., & Blettner, M. (2005). Generating survival times to simulate Cox proportional hazards models. Statistics in Medicine, 24(11), 1713–1723. https://doi.org/10.1002/sim.2059

Thursday, September 21, 2023

Newton-Method / Newton-Raphson Method

1. What is Newton-Method?

Since early education in Mathematics, we have been exposed to technique of finding roots. For instance when we need to find which value of $y$ that can fulfills the following equation

$$ y^2 = 25,$$

we can use root finding method to solve it. Analytically, we can then proceed to solve the above equation as follows

$$ y^2 - 25 = 0$$

$$ y^2 - 5^2 = 0$$

$$ (y-5)(y+5) = 0.$$

Thus, the $y$'s that can become the part of our solution are 5 and -5.

This simple procedure is a hard problem when the degree of polynomial increases or involving complex functions such as trigonometric function. To make it worse, the difficulties magnified when we want to set/automate the root finding procedure into the computer. Thus, the Newton-Method which is among the earliest algorithm suggested by Joseph Raphson to approximate the solution of the roots can be used.

2. How to perform Newton Method?

We will first look at how this algorithm is derived. Recalled that in calculus, the derivative of a function $f(x)$ is defined as,

$$ f'(x) = \lim_{h \to 0} \frac{f(x+h)-f(x)}{(x+h)-x}.$$

Our goal is to develop an algorithm that will proposed a new $x$ in each iteration and the value of $f(x) \rightarrow 0 $ (a root) as the iteration continues. But how do we generate the new $x$?

This is obvious if we manipulate the definition above as follows.

Let $x_{n+1} = x+h$ be the new value that we want to propose and $x_n = x$ be the former value that we already know its $f(x_n)$ value. Then it is noted that,

$$ f'(x_n) \approx \frac{f(x_{n+1}) - f(x_n)}{x_{n+1} - x_n}.$$

$$ x_{n+1} - x_n \approx \frac{f(x_{n+1}) - f(x_n)}{f'(x_n)}$$

$$ x_{n+1} \approx x_n + \frac{f(x_{n+1}) - f(x_n)}{f'(x_n)}.$$

Since we are expecting the new $x_{n+1}$ to cause $f(x_{n+1})$ approaching 0, thus the last equation is updated into,

$$ x_{n+1} \approx x_n + \frac{0 - f(x_n)}{f'(x_n)} = x_n - \frac{f(x_n)}{f'(x_n)}.$$

Thus completing our algorithm.

EG:

a) Find a positive $y$ that satisfy $y^2 = 25$.

Solution: Denote that $y^2 - 25 = 0$. We can use the root finding algorithm discussed above. Information that we need are as follows,

$$ f(y) = y^2 - 25$$

$$ f'(y) = 2y.$$

R script to run Newton-Raphson method:

f <- function(y) y^2 - 25
f.prime <- function(y) 2*y

y <- 3 # set initial y (close to answer is better)
tol <- 100 # initialize threshold to exit loop when f(y) is 0
n.iter <- 0 # store iteration

while(abs(tol) > 1e-5 && n.iter < 500){
  tol <- f(y)
  y <- y - f(y)/f.prime(y)
  n.iter = n.iter + 1
}

print(paste0("With ", n.iter, " iterations, y = ",y))

## [1] "With 5 iterations, y = 5"

b) Find all solutions to $2x^3 - 4x = 1$.

Graph below shows the curve for this question.

Solution: There are many R packages that can perform root finding algorithms. Among the package that accept Newton-Method is the 'rootSolve' package.

R script below is an example on how to find multiple roots.

f <- function(x) 2*x^3 - 4*x - 1
rootSolve::uniroot.all(f, c(-5,5))

## [1] -1.2669017 -0.2586498  1.5256917

3. When it does not work?

Although this algorithm is simple to be implemented, we still need to be careful when using it. As we have seen that this procedure rely on the computation of the first derivative and selection of the initial point, this also could become the reason of why the Newton-Method might fail.

For instance, it is normal to see function which requires cumbersome computation for its first derivative. For this case, approximating the derivative using a line slope is more practical. Such technique can be found through the secant method.

Selecting initial point also could lead to a problem when the solution is far from the selected point or have a close roots. This would lead to a slow convergence.

4. Conclusion.

Newton-Method is among the earliest technique used in root finding. Although there are other superior techniques been developed to solve problems on root finding, the Newton-Method act as an initial platform to embark the journey in numerical analysis.

Monday, September 18, 2023

Monte Carlo Integration

1. What is Monte Carlo Integration?

Monte Carlo Integration is a tools that can be used by researchers to compute integration of complex function. For instance, the Normal ($\mu, \sigma$) distribution (or called as the cumulative probability distribution of Normal) which is written as the probability of the variable $X$ to be less than a certain value $x$ (or can be written as $Pr(X \le x$ ) is hard to be solved analytically as the integrals result to a 'non-elementary' integrals. This can be seen through the following equation

$$ Pr(X \le x) = \int_{-\infty}^{x} \frac{1}{\sqrt{2\pi \sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}} dx.$$

2. When to apply Monte Carlo Integration?

We can use this technique when computation on the value of a complex integrals is needed. For instance, in my master's research, I need to calculate a double integrals which is in the following form.

$$\int_{0}^{x} \int_{0}^{t} \frac{1}{\sqrt{2\pi \sigma^2}}\lambda e^{u\beta} e^{-e^{u\beta}\lambda v} e^{-\frac{(u-\mu)^2}{2\sigma^2}} dv du$$

where $X$ variable represents biomarker and $T$ variable represents time-to-event. The double integration cannot be solved analytically and other numerical techniques results to a tedious procedure of trial-and-error when setting the initial value. Thus, Monte Carlo Integration techniques came to the rescue.

3. Why it works?

This technique works because it uses random numbers to perform the estimation. In probability theory, if we know that a variable (let's say $Y$) is distributed randomly in a certain fashion such as $g(y)$ distribution (i.e: Normal, Weibull, Exponential, Poisson, etc.), then we can calculate what is the expected value of a function $f(y)$ by solving the following equation

$$ E[f(y)] = \int f(y) \times g(y) dy$$

if $g(y)$ is a continuous distribution and

$$ E[f(y)] = \sum f(y) \times g(y) $$

if $g(y)$ is a discrete distribution.

In statistics, the $E[f(y)]$ can be estimated using its unbiased estimator $\hat{E[(f(y)]}$ which is

$$ \hat{E[f(y)]} = \frac{1}{n} \sum_{i}^{n} f(y_{i})$$

or in laymen terms is called the average of our $y_i$'s.

4. How to perform Monte Carlo Integration?

Need to assume our observation $y_i$'s is coming from a specific distribution. (Typically a Uniform(0,1) distribution. However, in the following examples, I will use the Normal and Exponential distribution).
Generate a lot of $y_i$'s.
For each $i$, if $y_i \le a$, then compute $f(y_i)$. If $y_i > a$, then assign 0.
Compute the average value.

EG:

a. Find the $E[X^3]$ for $x \in [0,10]$ if $X$ is following an Exponential(0.3) distribution.

Solution:

Here, we want to compute the following integral.

$$ \int_{0}^{10} x^3 (0.3)e^{-0.3x} dx.$$

If we use Monte Carlo Integration, our answer can be approximated by

$$ \hat{E[X^3]} = \frac{1}{N}\sum_i (x_i)^3$$

where $x_i$ is randomly generated through Exponential distribution.

Using the R script below, we get the estimated answer as 77.13

x_i <- rexp(10000, rate=0.3) # generate x randomly
# create indicator function
f<- function(x){
  if(x <= 10){
    return(x^3)
  }
  else{
    return(0)
  }
}

f_xi <- lapply(x_i, f)

mean(unlist(f_xi)) # compute the average

## [1] 77.1272

When solving analytically, the question above is solvable with integration by parts (IBP). We will obtain the following result when use the IBP.

$$ E[X^3] = 0.3[-x^3 e^{-0.3x}(1/0.3) -3x^2 e^{-0.3x}(1/0.09) -6x e^{-0.3x}(1/0.027) - 6e^{-0.3x}(1/0.0081)]_{0}^{10}$$

$$ E[X^3] = 0.3(-479.4310287 + 740.7407407) = 78.3929136.$$

b. Find the $Pr(X \ge 5)$ where $X$ is coming from a Normal($\mu=3,\sigma=0.8$) distribution.

Solution:

Here, we are trying to solve the following equation

$$ \int_{5}^{\infty} 1 \times \frac{1}{\sqrt{2\pi (0.8^2)}} e^{-\frac{(x-3)^2}{2(0.8^2)}} dx$$

Our Monte Carlo estimator can be calculated as follows

$$ \hat{E[X^3]} = \frac{1}{N}\sum_i (1_i)$$

where $1_i$ is our indicator function with the following rule

$$ 1_i = \left\{\begin{array}{l} 0 , x_i < 5 \\ 1 , x_i \ge 5 \end{array} \right.$$

Using the R script below, the estimated result is

x_i <- rnorm(10000, mean=3, sd=0.8) # generate x randomly
# create indicator function
f<- function(x){
  if(x >= 5){
    return(1)
  }
  else{
    return(0)
  }
}

f_xi <- sapply(x_i, f)

mean(unlist(f_xi)) # compute the average

## [1] 0.0061

# comparing with the true value
1 - pnorm(5, mean=3, sd=0.8)

## [1] 0.006209665

c. For 2-dimension problem, it is a little bit complex to show the R script but the following are the steps that we can use to estimate the solution.

Let's say that we need to find the solution to the following integrals,

$$ \int_{d}^{c}\int_{b}^{a} f(x,y) dxdy$$

The steps to use are:

Assume both $X, Y$ variables are following a Uniform(a,b) and Uniform(c,d) respectively.
Randomly generate a lot of $x$'s and $y$'s.
The answer is approximate as

$$ (d-c)(b-a) \frac{1}{N}\sum_{i}^{N} f(x_i,y_i).$$

d. Find area of a circle with radius 2.

Solution:

Since a circle can be divided into four quadrant, it is sufficient for us to just compute the area of a single quadrant before multiply the answer by four.

Recalled that with centre (0,0), a circle can be expressed as

$$ x^2 + y^2 = r^2,$$

thus we only need to solve the following integrals

$$ \int_{0}^{2} \int_{0}^{2} x^2 + y^2 dxdy$$

To apply our Monte Carlo Integration, we just need to manipulate the above integrals to be multiplied with any desired probability distribution. For this case, we choose Uniform(0,3) & Uniform(0,3) for our $x$ and $y$. Let $g(x)$ and $f(y)$ denote the chosen Uniform distributions, the integrals above is extended into

$$ \int_{0}^{2} \int_{0}^{2} x^2 + y^2 dxdy = \int_{0}^{2} \int_{0}^{2} (x^2 + y^2) g(x)f(y) \frac{1}{g(x)f(y)} dxdy$$

$$= \frac{1}{g(x)f(y)} \int_{0}^{2} \int_{0}^{2} (x^2 + y^2) g(x)f(y) dxdy$$

The final expression simply means that we just need to find the expected value of $x^2+y^2$ where $x$ & $y$ is randomly generated from Uniform distributions. Denote the Uniform(0,3) density function is written as follows

$$ h(x) = \frac{1}{3-0},$$

hence the desired integrals is approximated as

$$ \approx (3-0)(3-0) \frac{1}{N} \sum_{i}^{N} x_{i}^2 + y_{i}^2$$

We can further reduced the Monte Carlo Integration to be expressed into

$$ (3-0)(3-0) \frac{1}{N} \sum_{i}^{N}1_i$$

$$ 1_i = \left\{\begin{array}{l} 0 , x_{i}^2 + y_{i}^2 > 2^2 \\ 1 , x_{i}^2 + y_{i}^2 \le 2^2 \end{array} \right.$$

The following is the R script used.

x_i <- runif(10000, 0, 3) # generate x randomly
y_i <- runif(10000, 0, 3) # generate y randomly

# create indicator function
f<- function(x,y){
  if(x^2 + y^2 <= 2^2){
    return(1)
  }
  else{
    return(0)
  }
}

result <- matrix(NA,ncol=1, nrow=10000)
for (i in 1:10000){
  result[i] <- f(x_i[i],y_i[i])
}

4*(3-0)*(3-0)*mean(result) # 4 * average in a quadrant

## [1] 12.5352

# comparing with the true value
pi*2^2

## [1] 12.56637

5. Conclusion.

We have seen how Monte Carlo Integration works. In particular, I have shown how we can use this numerical technique to solve integration problems which revolved around computation of a complex function. We also have seen how this method can be extended into multiple dimension and how we can use any known distribution to sample our observation before calculating its average (refer example a,b,c). Moreover, if our integrals is not involving any probability function, we still can use this technique by manipulating our integrals to contain a probability function (refer example d).

With the improvement on the speed and efficiency of our computer nowadays, this technique has become more practical to be applied by researchers and professionals in solving their daily problem.

😄😄😄