Statistical Measures of the Middle

Section 1.4 Statistical Measures of the Middle

Definition 1.4.1. Arithmetic Mean.

Suppose X is a discrete random variable with range $R = {x_1, x_2, ..., x_n}\text{.}$ The arithmetic mean is given by

$\begin{equation*} \frac{x_1 + ... + x_n}{n} = \frac{\sum_{k=1}^n x_k}{n}. \end{equation*}$

If this data comes from sample data then we call it a sample mean and denote this value by $\overline{x}\text{.}$ If this data comes from the entire universe of possibilities then we call it a population mean and denote this value by $\mu\text{.}$ When presented with raw data, it might be good to generally presume that data comes from a sample and utilize $\overline{x}\text{.}$

🔗

To illustrate, consider the previous data set: {2, 5, 8, 10}. The arithmetic mean is given by

$\begin{equation*} \overline{x} = \frac{2+5+8+10}{4} = \frac{25}{4} = 6.25. \end{equation*}$

🔗

The mean 1.4.1 is often called the centroid in the sense that if the x values were locations of objects of equal weight, then the centroid would be the point where this system of n equal masses would balance. Play around with the interactive cell below by entering your own data values into the first list.

🔗

xxxxxxxxxx
 
@interact
def _(x = input_box(default=[2, 5, 8, 10, 11],width = 40)):     
    x.sort()
    mu = mean(x)
    n = len(x)
    pts = [(x[0],0.05)]
    M = 0.5
    for k in range(1,n):
        if x[k]==x[k-1]:
            pts.append((x[k],pts[k-1][1]+0.1))
            M += 0.1
        else:
            pts.append((x[k],0.05))
    G = points(pts,size=100,ymin=-0.5,ymax = M,
            xmin=min(x)-2,xmax=max(x)+2,ticks=[[], []],figsize=[5,4])
    G += polygon([(mu,0), (mu+0.2,-0.5), (mu-0.2,-0.5)],color='brown')
    G.show(figsize=[5,4])

🔗

The values can all be provided with varying weights if desired and the result is called the weighted arithmetic mean and is given by

$\begin{equation*} \frac{m_1 x_1 + ... + m_n x_n}{m_1 + ... + m_n} = \frac{\sum_{k=1}^n m_k x_k}{\sum_{k=1}^n m_k}. \end{equation*}$

🔗

This is often how your teacher will actually compute your final grade in a class where the

$m_k$ are the relative weights for each assignment grade.

🔗

xxxxxxxxxx
 
x = [2, 5, 8, 10, 10]   #  x values 
w = [1, 2.5, 2.5, 4,2]  # weights for each x
wsum = sum(w)
​
n = len(x)
pts = [(x[0],0.05)]
M = 0.2
mu = 0
for k in range(1,n):
    mu += x[k]*w[k]
    if x[k]==x[k-1]:
        pts.append((x[k],pts[k-1][1]+0.2*w[k]))
        M += 0.2
    else:
        pts.append((x[k],0.05))
mu = mu/wsum
G = Graphics()
for k in range(n):
    G += point(pts[k],size=100*w[k]) 
P = polygon([(mu,0), (mu+0.2,-0.5), 
           (mu-0.2,-0.5)],color='brown')
(G+P).show(xmin=min(x)-0.5, xmax=max(x)+0.5, 
           ymin=-2*M, ymax = 2*M, ticks=[[], []],figsize=[5,3])

🔗

Example 1.4.2. Computing class final grade.

Suppose in a given class you have a daily grade of 92, exam 1 grade of 85, exam 2 grade of 87, and a final exam grade of 93. IF the daily grade counts 10 percent, the first two exams count 25 percent each and the final counts 40 percent then your final grade would be

$\begin{equation*} \frac{0.10 \cdot 92 + 0.25 \cdot 85 + 0.25 \cdot 87 + 0.40 \cdot 0.93}{0.10 + 0.25 + 0.25 + 0.40} = 89.4 . \end{equation*}$

It would then appear that you might want to do some bargaining with your teacher about how nice it would be to round that up.

🔗

Definition 1.4.3. Mode.

The mode for a given data set is the data value that repeats the greatest number of times. If there are two or more such data values, then each is a mode. If all of the data values are unique, then there is no mode.

🔗

Example 1.4.4. Computing mode.

Consider the data set

$\begin{equation*} 2,4,2,5,5,7,3,2,5,9 \end{equation*}$

Notice the number 2 is included 3 times in this set as is the number 5. Hence both are modes and we might say that this data set is "bi-modal".

🔗

Definition 1.4.5. Median:.

A positional measure of the middle is often utilized by finding the location of the 50th percentile. This value is also called the median and indicates the value at which approximately half the sorted data lies below and half lies above.

🔗

For data sets with an odd number of values, this is the "middle" data value if one were to successively cross off pairs from the two ends of the sorted data. For data sets with an even number of values, this is a average of the two data values left after crossing off all other pairs. Using the order statistics, the median equals

$\begin{equation*} y_{\frac{n+1}{2}} \end{equation*}$

🔗

if n is odd and

$\begin{equation*} \frac{y_\frac{n}{2} + y_{\frac{n}{2}+1}}{2} \end{equation*}$

🔗

if n is even.

🔗

Checkpoint 1.4.6. WeBWorK - Computing Mean, Median and Mode.

🔗

From the Presidential data 1.3.2, note that you are considering an even number of data values and so the median is given by (54+64)/2 = 59.

🔗

Definition 1.4.7. Midrange:.

The midrange is a mixture of the mean and median where one takes the simple average of the maximum and minimum values in the data set. Using the order statistics, this equals

$\begin{equation*} \frac{y_1+y_n}{2} \end{equation*}$

🔗

From the Presidential data 1.3.2, the maximum is 70 and the minimum is 46 so the midrange is 58, the average of these two.

🔗

There are several advantages and disadvantages associated with each of these measures. The mean utilizes all of the data values so each term is important. Utilizes them all even if some of the data values might suffer from collection errors. The median ignores outliers (which might be a result of collection errors) but does not account for the relative differences between terms. The midrange is very easy to compute but ignores the relative differences for all terms but the two extremes. A similar collection of features and drawbacks are associated with all descriptive statistics.

🔗

You can again compute many statistics automatically using R...

xxxxxxxxxx
 
data <- c( 1, 2, 5, 7, 7, -1, 3, 2)   # concatenate 
print(paste("Mean = ", mean(data)))
paste("Median =", median(data))

🔗

Now, replace the explicit data in the above interactive cell and replace with one of the built in data sets. For example, comment out the first line and replace with something like data = faithful[,2] .

🔗

Example 1.4.8. USA State Population Measures of the Middle.

The US Census Bureau reported the following state populations (in millions) for 2013: Spreadsheet

Table 1.4.9. USA State Populations - 2014

State	Population
Wyoming	0.6
Vermont	0.6
District of Columbia	0.6
North Dakota	0.7
Alaska	0.7
South Dakota	0.8
Delaware	0.9
Montana	1
Rhode Island	1.1
New Hampshire	1.3
Maine	1.3
Hawaii	1.4
Idaho	1.6
West Virginia	1.9
Nebraska	1.9
New Mexico	2.1
Nevada	2.8
Kansas	2.9
Utah	2.9
Arkansas	3
Mississippi	3
Iowa	3.1
Connecticut	3.6
Oklahoma	3.9
Oregon	3.9
Kentucky	4.4
Louisiana	4.6
South Carolina	4.8
Alabama	4.8
Colorado	5.3
Minnesota	5.4
Wisconsin	5.7
Maryland	5.9
Missouri	6
Tennessee	6.5
Indiana	6.6
Arizona	6.6
Massachusetts	6.7
Washington	7
Virginia	8.3
New Jersey	8.9
North Carolina	9.8
Michigan	9.9
Georgia	10
Ohio	11.6
Pennsylvania	12.8
Illinois	12.9
Florida	19.6
New York	19.7
Texas	26.4
California	38.3

Determine the minimum, maximum, midrange 1.4.7, and mean 1.4.1 for this data.

Solution

Notice that these are already in order so you can presume $y_1 = 0.6$ million is the minimum and $y_{50} = 38.3$ million is the maximum. Therefore, the midrange is given by

\begin{equation*} \frac{0.6+38.3}{2} = \frac{38.9}{2} = 19.45 \text{million}. \end{equation*}

In this collection of "states" data the District of Columbia is included so that the number of data items is n=51. The mean of this data takes a bit of arithmetic but gives

\begin{equation*} \overline{x} = \frac{\sum_{k=1}^{51} y_k }{51} = \frac{316.1}{51} \approx 6.20 \end{equation*}

million residents.

Since the number of states is odd, the median is found by looking at the 26th order statistic. In this case, that is the 4.4 million residents of Kentucky, i.e. $y_{26} = 4.4\text{.}$

🔗