Section 6.4 Hypergeometric Distribution
For the discrete uniform distribution, the presumption is that you will be making a selection one time from the collection of items. However, if you want to take a larger sample without replacement from a distribution in which originally all are equally likely then you will end up with something which will not be uniform. Indeed, consider a collection of n items from which you want to take a sample of size r without replacement. If n1 of the items are "desired" and the remaining n2=n−n1 are not, let the random variable X measure the number of items from the first group in your sample with R={0,1,...,minr,n1}. The resulting collection of probabilities is called a Hypergeometric Distribution.
Theorem 6.4.1. Hypergeometric Probability Function.
For a Hypergeometric random variable with R = {0, 1, ..., r} and assuming n1≥r and n−n1≥r,
Proof.
For the following, we will presume thatSince you are sampling without replacement and trying only measure the number of items from your desired group in the sample, then the space of X will include R = {0, 1, ..., r} assuming \(n_1 \ge r\) and \(n-n_1 \ge r\text{.}\) In the case when r is too large for either of these, the formulas below will follow noting that binomial coefficients are zero if the top is smaller than the bottom or if the bottom is negative.
So f(x) = P(X = x) = P(x from the sample are from the target group and the remainder are not). Breaking these up gives
Theorem 6.4.2. Properties of the Hypergeometric Distribution.
- f(x)=(n1x)(n−n1r−x)(nr) satisfies the properties of a probability function.
- μ=rn1n
- σ2=rn1nn2nn−rn−1
- γ1=(n−2n1)√n−1(n−2r)rn1(n−n1)√n−r(n−2)
- γ2= very complicated!
Proof.
-
\begin{align*} \sum_{x=0}^n \binom{n}{x} y^x & = (1+y)^n, \text{ by the Binomial Theorem}\\ & = (1+y)^{n_1} \cdot (1+y)^{n_2} \\ & = \sum_{x=0}^{n_1} \binom{n_1}{x} y^x \cdot \sum_{x=0}^{n_2} \binom{n_2}{x} y^x \\ & = \sum_{x=0}^n \sum_{t=0}^r \binom{n_1}{r} \binom{n_2}{r-t} y^x \end{align*}
Equating like coefficients for the various powers of y gives
\begin{equation*} \binom{n}{r} = \sum_{t=0}^r \binom{n_1}{r} \binom{n_2}{r-t}. \end{equation*}Dividing gives
\begin{equation*} 1 = \sum_{x=0}^r f(x). \end{equation*} -
For the mean
\begin{align*} \sum_{x=0}^n x \frac{\binom{n_1}{x} \binom{n-n_1}{r-x}}{\binom{n}{r}} & = \frac{1}{\binom{n}{r}} \sum_{x=1}^n \frac{n_1(n_1-1)!}{(x-1)!(n_1-x)!} \binom{n-n_1}{r-x}\\ & = \frac{n_1}{\binom{n}{r}} \sum_{x=1}^n \frac{(n_1-1)!}{(x-1)!((n_1-1)-(x-1))!} \binom{n-n_1}{r-x} \\ & = \frac{n_1}{\frac{n(n-1)!}{r!(n-r)!}} \sum_{x=1}^n \binom{n_1-1}{x-1} \binom{n-n_1}{r-x} \end{align*}Consider the following change of variables for the summation:
\begin{gather*} y = x-1\\ n_3 = n_1-1\\ s = r-1\\ m = n-1 \end{gather*}Then, this becomes
\begin{align*} \mu = \sum_{x=0}^n x \frac{\binom{n_1}{x} \binom{n-n_1}{r-x}}{\binom{n}{r}} & = r \frac{n_1}{n} \sum_{y=0}^m \frac{\binom{n_3}{y} \binom{m-n_3}{s-y}}{\binom{m}{s}}\\ & = r \frac{n_1}{n} \cdot 1 \end{align*}noting that the summation is in the same form as was show yields 1 above.
-
For variance, we will use an alternate form of the definition that is useful when looking for cancellation options with the numerous factorials in the hypergeometric probability function 6.4. Indeed, you can easily notice that
\begin{equation*} \sigma^2 = E[X^2] - \mu^2 = E[X^2-X]+E[X] -\mu^2 = E[X(X-1)] + \mu - \mu^2. \end{equation*}Since we have \(\mu = r \frac{n_1}{n}\) from above then let's focus on the first term only and use the substitutions
\begin{gather*} y = x-2\\ n_3 = n_1-2\\ s = r-2\\ m = n-2 \end{gather*}to get
\begin{gather*} E[X(X-1)] = \sum_{x=0}^n x(x-1) \frac{\binom{n_1}{x} \binom{n-n_1}{r-x}}{\binom{n}{r}}\\ = \sum_{x=2}^n x(x-1) \frac{\frac{n_1!}{x(x-1)(x-2)!(n_1-x)! } \binom{n-n_1}{r-x}}{\binom{n}{r}}\\ = \sum_{x=2}^n \frac{\frac{n_1!}{(x-2)!(n_1-x)! } \frac{n_2!}{(r-x)!(n_2-r+x)!}}{\binom{n}{r}}\\ = n_1 \cdot (n_1-1) \cdot \sum_{x=2}^n \frac{\frac{(n_3)!}{(x-2)!(n_3 -(x-2))! } \frac{n_2!}{((r-2)-(x-2))!(n_2-(r-2)+(x-2))!}}{\binom{n}{r}}\\ = n_1 \cdot (n_1-1) \sum_{y=0}^{m} \frac{\frac{(n_3)!}{y!(n_3 -y)! } \frac{n_2!}{(s-y)!(n_2-s+y)!}}{\binom{n}{r}}\\ = \frac{n_1 \cdot (n_1-1) \cdot r \cdot (r-1)}{n (n-1)} \sum_{y=0}^{m} \frac{\binom{(n_3)}{y} \binom{n_2}{s-y}}{\binom{m}{s}}\\ = \frac{n_1 \cdot (n_1-1) \cdot r \cdot (r-1)}{n (n-1)} \end{gather*}where we have used the summation formula above that showed that f(x) was a probability function.
Putting this together with the earlier formula gives
\begin{equation*} \sigma^2 = \frac{n_1 \cdot (n_1-1) \cdot r \cdot (r-1)}{n (n-1)} + r \frac{n_1}{n} - \left ( r \frac{n_1}{n} \right )^2. \end{equation*} The formula and resulting proof of kurtosis is even more messy and we won’t bother with proving it for this distribution! If you start searching on the web for the formula then you will find many places just ignore the kurtosis for the hypergeometric. So we will as well here and just note that as the parameters increase the resulting graph sure looks bell-shaped. That is, approaching a kurtosis of 3.
xxxxxxxxxx
var('x,n,n1,r,m')
assume(x,'integer')
assume(n,'integer')
assume(n1,'integer')
assume(r,'integer')
assume(r<n1)
f = (factorial(n1)/(factorial(x)*factorial(n1-x))
*factorial(n-n1)/(factorial(r-x)*factorial(n-n1-r+x))
*(factorial(r)*factorial(n-r))/factorial(n))
f(x=2)
mu = sum(x*f,x,0,r)
M2 = sum(x^2*f,x,0,r)
M3 = sum(x^3*f,x,0,r)
M4 = sum(x^4*f,x,0,r)
pretty_print('Mean = ',mu)
v = (M2-mu^2).factor()
pretty_print('Variance = ',v)
stand = sqrt(v)
sk = ((M3 - 3*M2*mu + 2*mu^3)).factor()/stand^3
pretty_print('Skewness = ',sk)
kurt = (M4 - 4*M3*mu + 6*M2*mu^2 -3*mu^4).factor()/stand^4
pretty_print('Kurtosis = ',(kurt-3).factor(),'+3')
xxxxxxxxxx
k = 30 # make this number larger for limiting distribution
N1 = k*10 # desirable items
N2 = k*16 # the others
n = N1+N2
r = k*5 # sample size extracted without replacement
X = 0:r # the space R of the random variable
mu = r*N1/n # the mean of the Binomial Distributions
sdev = sqrt(mu*(N2/n)*(n-r)/(n-1)) # standard deviation
sink("message")
dhyper( X, N1, N2, r ) # print out a bunch of actual probs
Phyper = dhyper(X, N1, N2, r ) # probability function over X
Psample = rhyper(10^6, N1, N2, r) # to create a histogram
Xtop=max(Psample) # scaling the x-axis.Shift by 1/2
hist(Psample, prob=TRUE, br=(-1:Xtop)+0.5, col="skyblue2", xlab="X",
main="Hypergeometric pdf vs Approximating 'Bell Curve'")
points(X, Phyper, pch=19, col="darkgreen") # actual (x,f(x))
invisible(Pnormal <- function(X){dnorm(X, mean=mu, sd=sdev)})
invisible(curve(Pnormal, col="red", lwd=2, add=TRUE))
Theorem 6.4.3. Hypergeometric Limiting Distribution.
For the hypergeometric distribution 6.4.1, as m→∞
and
Proof.
For skewness, take the limit of the skewness result above as \(n_1, n_2\text{,}\) and \(r\) are uniformly increased together (i.e. all are doubled, tripled, etc.) To model this behavior one can simply scale each of those term by some variable \(k\) and then let \(k\) increase. Asymptotically, the result is
Similarly for kurtosis. However, since we can't find a nice formula for the kurtosis then taking the limit can be difficult. So, we will have to appeal to general results proved in a course that would follow up this one that would prove this fact must be true. Sadly, not here.
xxxxxxxxxx
N1 = 10
N2 = 16
n = N1+N2
r = 5
X = 0:r # the space R of the random variable
mu = r*N1/n # mean of the Binomial Distributions
sdev = sqrt(mu*(N2/n)*(n-r)/(n-1)) # the standard deviation
dhyper( X, N1, N2, r ) # let's print out a bunch of actual probs
Phyper = dhyper(X, N1, N2, r ) # the probability function over X
Psample = rhyper(10^6, N1, N2, r) # to create a histogram, sample
Xtop=max(Psample) # scaling the x-axis. Shift by 1/2 below.
hist(Psample, prob=TRUE, br=(-1:Xtop)+0.5, col="skyblue2", xlab="X",
main="Hypergeometric Probability Function vs Approximating 'Bell Curve'")
points(X, Phyper, pch=19, col="darkgreen") # to create actual (x,f(x))
Pnormal <- function(X){dnorm(X, mean=mu, sd=sdev)} # overlap a bell
curve(Pnormal, col="red", lwd=2, add=TRUE)
xxxxxxxxxx
# Hypergeometric distribution over 0 .. N
# Size of classes N1 and N2 must be given as well as subset size r
var('x')
layout=dict(top=[['N1','N2','r']])) (
def _(N1=input_box(10,width=5,label='$$ N_1 $$'),
N2=input_box(10,width=5,label='$$ N_2 $$'),
r=input_box(5,width=5,label='$$ r $$')):
N = N1 + N2
f = binomial(N1,x)*binomial(N2,r-x)/binomial(N,r)
def __(X = slider(0,min(r,N1),1)):
Px = f(x=X)
Pxapprox = Px.n(digits=8)
formula = ("\\frac{\\binom{%s}"%str(N1)+"{%s}"%str(X)
+" \\binom{%s}"%str(N2)+"{%s}}"%str(r-X)
+"{ \\binom{%s}"%str(N)+"{%s}}"%str(r))
pretty_print(html("P(X = %s"%str(X)+") = $ %s $"%str(formula)
+" = $ %s $"%str(latex(Px))
+"$ \\approx %s $"%str(Pxapprox)))
G = polygon([(0,0), (1,0), (1.2,1), (-.2,1)],color='lightblue')
G += line([(0.5,0),(0.5,1)],thickness=2)
G += circle((0.5,0.3),0.25,fill=True,
facecolor='yellow',alpha=0.5,zorder=2)
G += text("Things I want",(0.2,0.9))
G += text("Everything else",(0.8,0.9))
G += text("%s"%X,(0.4,0.3),fontsize=30)
G += text("%s"%(r-X),(0.6,0.3),fontsize=30)
G += text("%s"%(N1-X),(0.1,0.5),fontsize=30)
G += text("%s"%(N2-(r-X)),(0.9,0.5),fontsize=30)
show(G, axes=0, figsize=(4,4) )