Hypergeometric Distribution

Section 6.4 Hypergeometric Distribution

For the discrete uniform distribution, the presumption is that you will be making a selection one time from the collection of items. However, if you want to take a larger sample without replacement from a distribution in which originally all are equally likely then you will end up with something which will not be uniform.

Indeed, consider a collection of n items from which you want to take a sample of size r without replacement. If $n_1$ of the items are "desired" and the remainder are not, let the random variable X measure the number of items from the first group in your sample with $R = \{0, 1, ..., r \}\text{.}$ The resulting collection of probabilities is called a Hypergeometric Distribution.

Since you are sampling without replacement and trying only measure the number of items from your desired group in the sample, then the space of X will include R = {0, 1, ..., r} assuming $n_1 \ge r$ and $n-n_1 \ge r\text{.}$ In the case when r is too large for either of these, the formulas below will follow noting that binomial coefficients are zero if the top is smaller than the bottom or if the bottom is negative.

So f(x) = P(X = x) = P(x from the sample are from the target group and the remainder are not). Breaking these up gives

$\begin{equation*} f(x) = \frac{\binom{n_1}{x} \binom{n-n_1}{r-x}}{\binom{n}{r}} \end{equation*}$

Theorem 6.4.1 Properties of the Hypergeometric Distribution

$f(x) = \frac{\binom{n_1}{x} \binom{n-n_1}{r-x}}{\binom{n}{r}}$ satisfies the properties of a probability function.
$\mu = r \frac{n_1}{n}$
$\sigma^2 = r \frac{n_1}{n} \frac{n_2}{n} \frac{n-r}{n-1}$
$\gamma_1 = \frac{(n - 2 n_1)\sqrt{n-1}(n - 2r)}{r n_1 (n - n_1) \sqrt{n-r}(n-2)}$
$\gamma_2 = \frac{n(n+1)-6n(n-r)}{n_1(n-n_1)} + \frac{3r(n-r)(n+6)}{n^2} - 6$

Proof

\begin{align*} \sum_{x=0}^n \binom{n}{x} y^x & = (1+y)^n, \text{ by the Binomial Theorem}\\ & = (1+y)^{n_1} \cdot (1+y)^{n_2} \\ & = \sum_{x=0}^{n_1} \binom{n_1}{x} y^x \cdot \sum_{x=0}^{n_2} \binom{n_2}{x} y^x \\ & = \sum_{x=0}^n \sum_{t=0}^r \binom{n_1}{r} \binom{n_2}{r-t} y^x \end{align*}

Equating like coefficients for the various powers of y gives

\begin{equation*} \binom{n}{r} = \sum_{t=0}^r \binom{n_1}{r} \binom{n_2}{r-t}. \end{equation*}

Dividing gives

\begin{equation*} 1 = \sum_{x=0}^r f(x). \end{equation*}
For the mean

\begin{align*} \sum_{x=0}^n x \frac{\binom{n_1}{x} \binom{n-n_1}{r-x}}{\binom{n}{r}} & = \frac{1}{\binom{n}{r}} \sum_{x=1}^n \frac{n_1(n_1-1)!}{(x-1)!(n_1-x)!} \binom{n-n_1}{r-x} \\ & = \frac{n_1}{\binom{n}{r}} \sum_{x=1}^n \frac{(n_1-1)!}{(x-1)!((n_1-1)-(x-1))!} \binom{n-n_1}{r-x} \\ & = \frac{n_1}{\frac{n(n-1)!}{r!(n-r)!}} \sum_{x=1}^n \binom{n_1-1}{x-1} \binom{n-n_1}{r-x} \end{align*}

Consider the following change of variables for the summation:

\begin{gather*} y = x-1\\ n_3 = n_1-1\\ s = r-1\\ m = n-1 \end{gather*}

Then, this becomes

\begin{align*} \mu = \sum_{x=0}^n x \frac{\binom{n_1}{x} \binom{n-n_1}{r-x}}{\binom{n}{r}} & = r \frac{n_1}{n} \sum_{y=0}^m \frac{\binom{n_3}{y} \binom{m-n_3}{s-y}}{\binom{m}{s}}\\ & = r \frac{n_1}{n} \cdot 1 \end{align*}

noting that the summation is in the same form as was show yields 1 above.
The proof of the variance formula is similar and uses E(X(X-1)+ μ - μ2. The proof of skewness and kurtosis are messy and we won’t bother with them for this distribution!

Note, if r=1 then you are back at a regular discrete uniform model. Indeed,

$\begin{equation*} P(\text{desired item}) = 1 \cdot \frac{n_1}{n} = \mu . \end{equation*}$

which is indeed what you might expect when selecting once.

Consider the Hypergeometric distribution for various values of $n_1, n_2,$ and r using the interactive cell below. Notice what happens when you start with relatively small values of $n_1, n_2,$ and r (say, start with $n_1 = 5, n_2 = 8,$ and r = 4 and then doubling then all again and again. Consider the likely skewness and kurtosis of the graph as the values get larger.

xxxxxxxxxx
 
# Hypergeometric distribution over 0 .. N
# Size of classes N1 and N2 must be given as well as subset size r
var('x')
@interact
def _(N1=slider(1,40,1,10,label='$N_1$'),
    N2=slider(1,40,1,10,label='$N_2$'),
    r=slider(1,40,1,10,label='$r$')):
    N = N1 + N1
    R = range(r+1)
    if (r > N1)|(r > N2):
        pretty_print('When r is bigger than N1 or N2, special consideration must be made')
    else:
        f(x) = binomial(N1,x)*binomial(N2,r-x)/binomial(N,r)
        pretty_print(html('Density Function: $f(x) =%s$'%str(latex(f(x)))))
        pretty_print(html('over the space $R = %s$'%str(R)))
        points((k,f(x=k)) for k in R).show()
        for k in R:
           print (html('$f(%s'%k+') = %s'%latex(f(x=k))+' \\approx %s$'%f(x=k).n(digits=5)))