Section 1.4 Statistical Measures of the Middle
Definition 1.4.1. Arithmetic Mean.
Suppose X is a discrete random variable with range \(R = {x_1, x_2, ..., x_n}\text{.}\) The arithmetic mean is given by
If this data comes from sample data then we call it a sample mean and denote this value by \(\overline{x}\text{.}\) If this data comes from the entire universe of possibilities then we call it a population mean and denote this value by \(\mu\text{.}\) When presented with raw data, it might be good to generally presume that data comes from a sample and utilize \(\overline{x}\text{.}\)
To illustrate, consider the previous data set: {2, 5, 8, 10}. The arithmetic mean is given by
The mean 1.4.1 is often called the centroid in the sense that if the x values were locations of objects of equal weight, then the centroid would be the point where this system of n equal masses would balance. Play around with the interactive cell below by entering your own data values into the first list.
The values can all be provided with varying weights if desired and the result is called the weighted arithmetic mean and is given by
This is often how your teacher will actually compute your final grade in a class where the \(m_k\) are the relative weights for each assignment grade.
Suppose in a given class you have a daily grade of 92, exam 1 grade of 85, exam 2 grade of 87, and a final exam grade of 93. IF the daily grade counts 10 percent, the first two exams count 25 percent each and the final counts 40 percent then your final grade would be It would then appear that you might want to do some bargaining with your teacher about how nice it would be to round that up.
Example 1.4.2. Computing class final grade.
The mode for a given data set is the data value that repeats the greatest number of times. If there are two or more such data values, then each is a mode. If all of the data values are unique, then there is no mode.
Definition 1.4.3. Mode.
Consider the data set Notice the number 2 is included 3 times in this set as is the number 5. Hence both are modes and we might say that this data set is "bi-modal".
Example 1.4.4. Computing mode.
A positional measure of the middle is often utilized by finding the location of the 50th percentile. This value is also called the median and indicates the value at which approximately half the sorted data lies below and half lies above.
Definition 1.4.5. Median:.
For data sets with an odd number of values, this is the "middle" data value if one were to successively cross off pairs from the two ends of the sorted data. For data sets with an even number of values, this is a average of the two data values left after crossing off all other pairs. Using the order statistics, the median equals
if n is odd and
if n is even.
Checkpoint 1.4.6. WeBWorK - Computing Mean, Median and Mode.
From the Presidential data 1.3.2, note that you are considering an even number of data values and so the median is given by (54+64)/2 = 59.
The midrange is a mixture of the mean and median where one takes the simple average of the maximum and minimum values in the data set. Using the order statistics, this equals
Definition 1.4.7. Midrange:.
From the Presidential data 1.3.2, the maximum is 70 and the minimum is 46 so the midrange is 58, the average of these two.
There are several advantages and disadvantages associated with each of these measures. The mean utilizes all of the data values so each term is important. Utilizes them all even if some of the data values might suffer from collection errors. The median ignores outliers (which might be a result of collection errors) but does not account for the relative differences between terms. The midrange is very easy to compute but ignores the relative differences for all terms but the two extremes. A similar collection of features and drawbacks are associated with all descriptive statistics.
You can again compute many statistics automatically using R...
Now, replace the explicit data in the above interactive cell and replace with one of the built in data sets. For example, comment out the first line and replace with something like data = faithful[,2] .
The US Census Bureau reported the following state populations (in millions) for 2013: Spreadsheet Notice that these are already in order so you can presume \(y_1 = 0.6\) million is the minimum and \(y_{50} = 38.3\) million is the maximum. Therefore, the midrange is given by In this collection of "states" data the District of Columbia is included so that the number of data items is n=51. The mean of this data takes a bit of arithmetic but gives million residents. Since the number of states is odd, the median is found by looking at the 26th order statistic. In this case, that is the 4.4 million residents of Kentucky, i.e. \(y_{26} = 4.4\text{.}\)
Example 1.4.8. USA State Population Measures of the Middle.
State
Population
Wyoming
0.6
Vermont
0.6
District of Columbia
0.6
North Dakota
0.7
Alaska
0.7
South Dakota
0.8
Delaware
0.9
Montana
1
Rhode Island
1.1
New Hampshire
1.3
Maine
1.3
Hawaii
1.4
Idaho
1.6
West Virginia
1.9
Nebraska
1.9
New Mexico
2.1
Nevada
2.8
Kansas
2.9
Utah
2.9
Arkansas
3
Mississippi
3
Iowa
3.1
Connecticut
3.6
Oklahoma
3.9
Oregon
3.9
Kentucky
4.4
Louisiana
4.6
South Carolina
4.8
Alabama
4.8
Colorado
5.3
Minnesota
5.4
Wisconsin
5.7
Maryland
5.9
Missouri
6
Tennessee
6.5
Indiana
6.6
Arizona
6.6
Massachusetts
6.7
Washington
7
Virginia
8.3
New Jersey
8.9
North Carolina
9.8
Michigan
9.9
Georgia
10
Ohio
11.6
Pennsylvania
12.8
Illinois
12.9
Florida
19.6
New York
19.7
Texas
26.4
California
38.3