These measures provide some indication of how much the data set is "spread out". Indeed, note that the data sets {-2,-1,0,1,2} and {-200,-100,0,100,200} have the same mean but one is much more spread out than the other. Measures of variation should catch this difference.
Let's now consider a measure of variation that is the counterpart to the mean in that it involves a computation that utilizes the actual values of all the data.
Average Deviation from the Mean (Population): Given a population data set x1,x2,...,xn with mean μ each term deviates from the mean by the value xk−μ. So, averaging these gives
This metric is therefore always zero for any provided set of data since cancellation makes this not useful. You should have expected this to be true since the definition of the mean is indeed the place where the data is "balanced". So, we need to determine ways to avoid cancellation.
which, although nicely stated, is difficult to deal with algebraically since the absolute values do not simplify well algebraically. Indeed, it is easy to see that, for example when n=3, the mean lies somewhere between y1 and y3 (using the ordered data) but could be on either side of y2 and so
where the ? is either a + or a - but it could be either in general. To avoid this algebraic roadblock, we can look for another way to nearly accomplish the same goal by squaring and then square rooting.
Using the average squared deviation from the mean, differences have been squared. Thus all of the squared differences added are non-negative but very small ones have been made even smaller and larger ones have been made relatively larger. To undo this scaling issue, one must take a square root to get things back into the right ball park.
The variance is the average squared deviation from the mean. If this data comes from the entire universe of possibilities then we call it a population variance and denote this value by σ2. Therefore
σ2=∑nk=1(xk−μ)2n
The standard deviation is the square root of the variance. If this data comes from the entire universe of possibilities then we call it a population standard deviation and denote this value by σ. Therefore
σ=√∑nk=1(xk−μ)2n.
If data comes from a sample of the population then we call it a sample variance and denote this value by
v=∑nk=1(xk−¯x)2n.
Sample data tends to reflect certain "biases". For example, a small data set is unlikely to contain a member of the data set that is far away from the major portion of the data. However, data values that are far from the mean provide a much greater contribution to the calculation of v than do values that are close to the mean. Technically, bias is defined mathematically using something called "expected value" and would be discussed in a course that might follow this one.
To account for this, we increase the value computed for v slightly by nn−1 to give the sample variance via
s2=nn−1v=nn−1∑nk=1(xk−¯x)2n=∑nk=1(xk−¯x)2n−1.
and the sample standard deviation s similarly as the square root of the sample variance.
From the data {2,5,8,10}, you have found that the mean is 6.25. Computing the variance then involves accumulating and averaging the squared differences of each data value and this mean. Then
Once again, we can compute these descriptive statistics using R. We again display a box and whisker diagram and you should now notice the the "box" corresponds to the IQR.
For IQR, we first must determine the quartiles. The median (found earlier) already is the second quartile so we have \(Q_2 = 4.5\) million. For the other two, the formula for computing percentiles gives you the 25th percentiile
Hence, the IQR = 7.325 - 2.05 = 5.275 million residents.
From the computation before, again note that n=51 since the District of Columbia is included. The mean of this data found before was found to be approximately 6.20 million residents. So, to determine the variance you may find it easier to compute using the the alternate variance formulas Theorem 1.5.4.
The interquartile range is sometimes also called the interquantile range. It must be noted that R computes the IQR differently for some reason. You can look up the R documentation to see what they do but notice that you can force R to use the standard approach as well. Below, for example, the interactive cell determine the IQR using R's preferred method. However, to use the method from this text, you can make the function call IQR(data, type=2). This might be helpful to know when working on some WeBWorK exercises where the author could be using R to compute the correct answer.