---
title: "Reciprocal Normal Distribution"
author: "Paul A. Rubin (rubin@msu.edu)"
date: "10/11/2014"
output: html_document
runtime: shiny
---
```{r, echo=FALSE}
sampleSize <- 1000; # set the sample size to use
dist <- 100; # fixed distance to use
almostZero <- 1e-5; # lower limit for numerical integrations
```
A [recent question](https://www.or-exchange.org/questions/10215/reciprocal-normal-distribution) on OR-Exchange started me thinking about the reciprocal normal distribution (which is just what it sounds like: the distribution of the reciprocal of a [Gaussian](https://en.wikipedia.org/wiki/Normal_distribution) random variable). It's not something I've messed in the past, but apparently it occasionally comes up and in some literature it is (somewhat cavlierly?) approximated by a normal distribution.
For an example (admittedly contrived), suppose that we have a machine that will drill through a target of depth $d$ at a constant but somewhat random rate. Ignoring minor issues such as zero or negative drill speed (implying that the drill penetrates before we turn it on), we approximate the penetration rate $P$ of the drill by a normal distribution with mean $\mu_P$ and standard deviation $\sigma_P$. The time to drill through, $T = d/P$, is then random. The remainder of this post will attempt to answer questions about the distribution of $T$.
### Is $T$ normal (Gaussian)?
Heck no! If $P$ is Gaussian (we won't get into the wisdom of treating an inherently positive quantity as Gaussian here), it has a symmetric distribution, which guarantees that the distribution of $d/P$ is [positively skewed](https://en.wikipedia.org/wiki/Skewness). (I'll demonstrate this below.)
### Okay, is $T$ approximately normal?
Approximate normality, like beauty, is in the eye of the beholder. The one generalization I can make is that the larger the [coefficient of variation](https://en.wikipedia.org/wiki/Coefficient_of_variation) $c_v = \sigma_P/\mu_P$, the more skewed/less Gaussian the distribution will seem. That is fairly intuitive: the larger the $c_v$, the more likely $P$ is to get close to (or, worse, attain/surpass) the value zero, where $T$ blows up.
Henceforth I'll assume that the distribution of $T$ is sufficiently symmetric that you're willing to approximate it with a Gaussian distribution. The logical next question is ...
### What moments $(\mu_T,\sigma_T)$ should I use for the (approximate) normal distribution of $T$?
Hmm. I can think of a few choices, or more precisely a few different approaches to picking them. Bear in mind that your estimated normal distribution for $T$ will have mean and median equal, whereas the actual (skewed) distribution of $T$ will not. So one question (among) many is whether you're more interested in having the mean of the estimated distribution be accurate, or more interested in having the median be accurate.
One thing at least is clear: $\mu_T \neq d/\mu_P$ (and $\sigma_T \neq d/\sigma_P$).
#### Bring on the calculus
We can try to compute the mean $\mathrm{E}[T]$ and variance $\mathrm{V}[T]$ of $T$ the old fashioned way, by integration, and then set $\mu_T = \mathrm{E}[T]$ and $\sigma_T = \sqrt{\mathrm{V}[T]}$. The relevant formulas are $$\mathrm{E}[T] = \int_{-\infty}^{\infty} \frac{d}{x} \frac{1}{\sigma_P \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} dx$$and $$\mathrm{V}[T] = \int_{-\infty}^{\infty} \frac{d^2}{x^2} \frac{1}{\sigma_P \sqrt{2 \pi}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} dx - \mathrm{E}[T]^2.$$
Unfortunately, both integrals diverge due to the singularity at $P=0$. This would not happen if you used a [truncated normal distribution](https://en.wikipedia.org/wiki/Truncated_normal_distribution) for $P$, but we'll stick with the assumption that $P$ is normal.
##### Surely you jest?
Okay, another possibility, given the values of $\mu$ and $\sigma$ (and of course $d$), would be to perform numerical integration, replacing the limits of integration $(-\infty, \infty)$ with something a bit more finite, say from $\mu - k \sigma$ to $\mu + k \sigma$ for a reasonable value of $k$. Still another possibility (used below) is to integrate from $P=\epsilon$ to $P=\infty$ (or some really large value of $P$), where $\epsilon > 0$ is small but sufficiently positive that the integrals converge.
#### Use sample estimates
If you have a sample $\{p_1,\dots,p_n\}$ of observations of $P$, it's easy enough to compute $t_i = d/p_i\ \ (i=1,\dots,n)$ and use the sample mean $\overline{T}$ and standard deviation $S_T$ as moments for the approximate distribution of $T$. (If you don't have a sample for $P$ but have access to a convenient program, such as R, you can generate a sample.)
#### Order statistics
While inverting $P$ skews the resulting distribution and produces less-than-obvious moments, all it does to order statistics (besides inverting them) is swap the order. If we denote the $q$-th quantile of $P$ by $P_q$ ($q = 0.5$ for the median, $q = 0.9$ for the 90th percentile, etc.), then $T_{1-q} = d/P_q$.
```{r, echo=FALSE}
q <- qnorm(0.75)
```
That implies, in particular, that the median of $T$ is $d/\mu_P$. Since the Gaussian approximation to the distribution of $T$ is symmetric (mean = median), one possible choice for $\mu_T$ is $d/\mu_P$. The mean of a Gaussian distribution is also the average of the $q$ and $1-q$ quantiles, so we could also use $$\mu_T = 0.5\left(T_{0.25} + T_{0.75}\right) = 0.5\left(\frac{d}{P_{0.75}} + \frac{d}{P_{0.25}}\right) = 0.5\left(\frac{d}{\mu_P + `r q`\sigma_P} + \frac{d}{\mu_P - `r q`\sigma_P}\right).$$(I chose quartiles, but any quantile could be employed with equal lack of rigor.)
A similar trick can be employed to get an approximate value for $\sigma_T$. Again using quartiles just as a matter of personal preference, the [interquartile range](https://en.wikipedia.org/wiki/Interquartile_range) of a variable with distribution $Z \sim N(\mu_Z, \sigma_Z)$ is $$\mathrm{IQR}_Z = Z_{0.75} - Z_{0.25} = (\left(\mu_Z + `r q` \sigma_Z\right) - \left(\mu_Z - `r q` \sigma_Z)\right) = `r 2*q`\sigma_Z.$$So $\sigma_Z = \mathrm{IQR}_Z/`r 2*q` = `r 0.5/q`\times \mathrm{IQR}_Z$. Using that logic, we get an estimated standard deviation $$\sigma_T = `r 0.5/q`\left(\frac{d}{\mu_P - `r q`\sigma_P} - \frac{d}{\mu_P + `r q`\sigma_P} \right) = \frac{d\sigma_P}{\mu_P^2 - `r q*q`\sigma_P^2}$$for $T$.
### So, does any of this actually work?
You be the judge. The plot below shows a histogram of the drilling times computed from a random sample of `r sampleSize` observations of drill speed (using a pseudorandom number generator, since this entire example is "pseudo"). The depth to penetrate ($d$) is arbitrarily fixed at `r dist`. You can choose the mean speed $\mu_P$ and the coefficient of variation $c_v = \sigma_P /\mu_P$ using sliders, and optionally specify a seed for the pseudorandom number generator. You can also use the control beneath the plot to adjust the number of bins in the histogram.
Superimposed on the histogram (in blue) is the density function of a normal distribution for drilling time $T$, using estimates for mean and standard deviation. The second set of controls allows you to choose which of the estimates discussed above are used for both the mean ($\mu_T$) and the standard deviation ($\sigma_T$).
```{r, echo=FALSE}
#
# Get parameters for the sample of speeds.
#
div(
inputPanel(
sliderInput("mean.p",
label = "Mean speed",
min = 10,
max = 200,
value = 100,
step = 1),
sliderInput("cv.p",
label = "Coefficient of variation",
min = 0.01,
max = 0.5,
value = 0.1,
step = 0.01),
numericInput("seed",
label = "Random generator seed",
value = 123)
),
align="center"
)
#
# Generate the samples of speed and time.
#
sd.p <- reactive({
(input$cv.p)*(input$mean.p) # std. dev. of rate P
})
speedSample <- reactive({
set.seed(input$seed);
rnorm(sampleSize, mean = input$mean.p, sd = sd.p())
})
timeSample <- reactive({
sort(dist/speedSample())
})
#
# Choose how to estimate moments of the time distribution (optionally: specify seed value).
#
div(
inputPanel(
selectInput("mean.t",
label = "Estimate for mean time",
choices = list("Numerical integration" = 1,
"Sample mean" = 2,
"Reciprocal of mean speed" = 3,
"Quartile average" = 4
),
selected = 2),
selectInput("sd.t",
label = "Estimate for std. dev.",
choices = list("Numerical integration" = 1,
"Sample std. dev." = 2,
"From interquartile range" = 3),
selected = 2)
),
align = "center"
)
#
# Create methods for estimating moments of the time distribution.
#
# Estimate the mean via numerical integration.
mean.integrate <- reactive({
f <- function(x) dist/x*dnorm(x, mean = input$mean.p, sd = sd.p());
integrate(f, almostZero, Inf)$value
})
# Estimate the mean via a sample mean.
mean.sample <- reactive({
mean(timeSample())
})
# Estimate the mean via the reciprocal of the median.
mean.reciprocal <- reactive({
dist/input$mean.p
})
# Estimate the mean via quartiles.
mean.quartile <- reactive({
0.5*dist*(1/(input$mean.p + q*sd.p()) + 1/(input$mean.p - q*sd.p()))
})
# Estimate the std. dev. via integration.
sd.integrate <- reactive({
f <- function(x) ((dist/x)^2)*dnorm(x, mean = input$mean.p, sd = sd.p());
sqrt(integrate(f, almostZero, Inf)$value - mean.integrate()^2)
})
# Estimate the std. dev. via the sample std. dev.
sd.sample <- reactive({
sd(timeSample())
})
# std. dev. via quartiles
sd.quartile <- reactive({
dist*sd.p()/(input$mean.p^2 - q^2*sd.p()^2)
})
# estimate those moments
mean.est <- reactive({
c(mean.integrate(), mean.sample(), mean.reciprocal(), mean.quartile())
})
sd.est <- reactive({
c(sd.integrate(), sd.sample(), sd.quartile())
})
mean.time <- reactive({
v <- mean.est();
switch(input$mean.t,
"1" = v[1],
"2" = v[2],
"3" = v[3],
"4" = v[4])
})
sd.time <- reactive({
v <- sd.est();
switch(input$sd.t,
"1" = v[1],
"2" = v[2],
"3" = v[3])
})
# Create a candidate density function.
dens <- reactive({
xy.coords(timeSample(), dnorm(timeSample(), mean.time(), sd.time()))
})
# Plot the sample distribution of time.
renderPlot({
hist(timeSample(), freq = FALSE, breaks = as.numeric(input$n_time_breaks),
xlab = "Time", main = "Sample of Drilling Times");
# Superimpose a normal density.
lines(dens(), col = "blue");
})
# Let the user select the number of histogram bins.
div(
inputPanel(
selectInput("n_time_breaks", label = "Number of bins:",
choices = seq(10, 50, 10), selected = 20)
),
align = "center"
)
# Perform a Shapiro-Wilk test of normality.
sw.test <- reactive({
shapiro.test(timeSample())
})
```
These are the estimated moments of the distribution for $T$, using the various methods suggested above:
```{r, echo=FALSE}
meanTable <- reactive({
data.frame(Method = c("Numerical integration", "Sample mean", "Reciprocal of mean speed", "Quartile average"),
Estimate = mean.est()
)
})
sdTable <- reactive({
data.frame(Method = c("Numerical integration", "Sample std. dev.", "From interquartile range"),
Estimate = sd.est()
)
})
fluidRow(
column(width = 6, h5("Estimation of Mean Time", align = "center")),
column(width = 6, h5("Estimation of Std. Dev. of Time", align = "center"))
)
fluidRow(
column(width = 6, div(renderTable(meanTable(), digits = 4), align = "center")),
column(width = 6, div(renderTable(sdTable(), digits = 4), align = "center"))
)
```
Try adjusting the mean and coefficient of variation of the drill speed, and judge how well the normal approximation to drilling time seems to fit.
Is drilling time normally distributed? The [Shapiro-Wilk test](https://en.wikipedia.org/wiki/Shapiro%E2%80%93Wilk_test) ($H_0$: $T$ is normally distributed v. $H_A$: not so much) yields a test statistic value of `r renderText(sw.test()$statistic)` and a $p$-value of `r renderText(sw.test()$p.value)`. For a sufficiently small coefficient of variation (nearly constant drill speed), you can convince the Shapiro-Wilk test not to reject the null hypothesis (so the normal approximation would seem to fit well).
Incidentally, for a fixed coefficient of variation of $P$, the Shapiro-Wilk statistic for $T=d/P$ is invariant under changes to both $d$ and $\mu_P$. If you tweak the slider for $\mu_P$ and observe that the Shapiro-Wilk test results do not change, that's not a bug, it's correct behavior. (It took me a couple of minutes to convince myself of this.)