Statistical Measures of Position

Section 1.3 Statistical Measures of Position

Given a collection of data, sorting the data may provide several useful descriptors. When sorting data, you can easily use something like a spreadsheet for larger data sets but in this section you will also see there are ways to perform a sort by hand. In either case, statistical measures of position generally involve very little computational work once the data is sorted and take into account only the order of the data from lowest to highest. To assist with notation, we will generally use x-values to represent the original raw data and y-values to represent that same data when ordered with the subscript indicating the positional placement.

Definition 1.3.1 Order Statistic

From the data set $x_1, x_2, ... , x_n\text{,}$ assume that when sorted it is denoted $y_1, y_2, ..., y_n$ where

$\begin{equation*} y_1 \le y_2 \le ... \le y_n. \end{equation*}$

Then, $y_k$ is known as the kth order statistic.

Example 1.3.2 Age of Presidents - order statistics

The age at inauguration for presidents from 1981-2019 gives the data $x_1 = 69, x_2 = 64, x_3 = 46, x_4 = 54, x_5 = 47, x_6 = 70$ (Reagan, Bush, Clinton, Bush, Obama, Trump). For this data, the order statistics are denoted $y_1 = 46, y_2 = 47, y_3 = 54, y_4 = 64, y_5 = 69, y_6 = 71\text{.}$

Once the data is sorted, it should be very easy for you to locate the smallest and largest values.

Definition 1.3.3 Minimum/Maximum:

For a given data set, the smallest and largest values are known as the minimum and maximum, respectively. In our notation, minimum = $y_1$ and the maximum = $y_n$

Example 1.3.4 Age of Presidents - Minimum/Maximum

Using the President inauguration data 1.3.2, minimum = $y_1 = 46$ and maximum = $y_6 = 70\text{.}$

A value that separates ordered data into two groups with a desired percentage on each side is called a percentile. There are multiple ways that have been created that achieve this goal. In this text we present two and will consistently use the first one presented below. For each, in general, a given percentile is a numerical value at which approximately a given percentage of the data is smaller.

The definition presented below provides for a unique measure for each unique value of p that corresponds to the PERCENTILE.EXC macro in Excel. This version starts by computing $(n+1)p$ where $0 < p < 1$ and using this to linearly interpolate between two adjacent entries in the sorted list. Another option that corresponds to PERCENTILE.INC (and PERCENTILE) in Excel is to start with $(n-1)p+1$ for determining how to pick the two adjacent entries and then proceeding with linear interpolation. Again, the definition below utilizes the first approach.

Definition 1.3.5 Percentiles

For $0 \lt s \lt 1$ and for order statistics $y_1, y_2, ..., y_n$ define the 100s-th percentile to be

$\begin{equation*} P^{s} = (1-r)y_m + ry_{m+1} \end{equation*}$

where m is the integer part of (n+1)s, namely

$\begin{equation*} m = \left\lfloor (n+1)s \right\rfloor \end{equation*}$

and

$\begin{equation*} r = (n+1)s - m, \end{equation*}$

the fractional part of (n+1)s.

Example 1.3.6 Presidential Percentile

To compute, say, the 42nd percentile for the President inauguration data presented earlier 1.3.2 consider s = 0.42. Since there are 6 numbers in our data set, then

\begin{equation*} (n+1)s = 7 \cdot 0.42 = 2.94 \end{equation*}

and so m = 2 and r = 0.94. Thus, the percentile will lie between $y_2 = 47$ and $y_3 = 54$ and much closer to 54 than 47. Numerically

\begin{equation*} P^{0.42} = 0.06 \cdot 47 + 0.94 \cdot 54 = 53.58. \end{equation*}

The formula for percentiles determines a weighted average between $y_m$ and $y_{m+1}$ which is unique for distinct values of p provided each of the data values are distinct. Note that if some of the y-values are equal then some of these averages might be averages of equal numbers and will therefore be the common value.

Some special percentiles are provided special names...

Definition 1.3.7 Quartiles

Given a sorted data set, the first, second, and third quartiles are the values of $Q_1 = P^{0.25}, Q_2 = P^{0.5}$ and $Q_3 = P^{0.75}\text{.}$

It should be noted that many graphing calculators often compute quartiles using a straight average of two adjacent entries rather than by using the formula above. This causes some difficulty and especially so when n mod 4 = 2.

Example 1.3.8 $Q_1$ and $Q_3$ when n mod 4 = 2

Suppose n = 22 = 5(4) + 2. Computing the first quartile as defined above gives (n+1)p = 23(0.25) = 5.75 = 5 + 0.75 = m + r. Therefore,

\begin{equation*} Q_1 = 0.25 \times y_5 + 0.75 \times y_6 \end{equation*}

which is a value closer to $y_6\text{.}$ Many graphing calculators however quickly approximate this with

\begin{equation*} 0.5 \times y_5 + 0.5 \times y_6 \end{equation*}

so you should be aware of this possible difference. You should also notice that in this case p = 0.25 but r = 0.75 so these values are not required to be the same.

Definition 1.3.9 Deciles:

Given a sorted data set, the first, second, ..., ninth deciles are the value of $D_1 = P^{0.1}, D_2 = P^{0.2}, ... , D_9 = P^{0.9}$

Example 1.3.10 Small Example - Quartiles

Consider the following data set: {2,5,8,10}. The 50th percentile should be a numerical value for which approximately 50% of the data is smaller. In this case, that would be some number between 5 and 8. For now, let's just take 6.5 so that two numbers in the set lie below 6.5 and two lie above. This is a perfect 50% for the 50th percentile. In a similar manner, the 25th percentile would be some number between 2 and 5, say 2.75, so that one number lies below 2.75 and three numbers lie above.

Using the definition 1.3.5, the 25th percentile is computed by considering

\begin{equation*} (n+1)p = (4+1)0.25 = 5/4 = 1.25\text{.} \end{equation*}

So, m = 1 and r = 0.25. Therefore

\begin{equation*} P^{0.25} = 0.75 \times 2 + 0.25 \times 5 = 2.75 \end{equation*}

as noted above.

Similarly, the 75th percentile is given by

\begin{equation*} (n+1)p = (4+1)0.75 = 15/4 = 3.75\text{.} \end{equation*}

So, m = 3 and r = 0.75. Therefore

\begin{equation*} P^{0.75} = 0.25 \times 8 + 0.75 \times 10 = 9.5 \end{equation*}

It is interesting to note that 3 also lies between 2 and 5 as does 2.75 and has the same percentages above (75 percent) and below (25 percent). However, it should designate a slightly larger percentile location. Indeed, going backward:

\begin{gather*} 3 = (1-r) \times 2 + r \times 5\\ \Rightarrow r = \frac{1}{3}\\ \Rightarrow (n+1)p = 1 + \frac{1}{3} = \frac{4}{3}\\ \Rightarrow p = \frac{4}{15} \approx 0.267 \end{gather*}

and so 3 would actually be at approximately the 26.7th percentile.

Checkpoint 1.3.11

In general, given a numerical value within the range of a given data set, one can determine the percentile ranking of that value by reversing the general formula for percentile and solving for p, given $P^s\text{.}$ Determine such a formula/process for doing this in general.

For your data set {2,5,8,10}, $Q_1 = 2.75, Q_2 = 6.5,$ and $Q_3 = 9.5\text{.}$

For a given data set, a summary of these statistics is often desired in order to give the user a quick overview of the more important order statistics.

Definition 1.3.12 5-number summary

Given a set of data, the 5-number summary is a vector of the order statistics given by

$\begin{equation*} \lt \text{minimum}, Q_1, Q_2, Q_3, \text{maximum} \gt . \end{equation*}$

You can also compute these statistics automatically using the opensource statistical software known simply as "R". The following interactive cell uses the opensource software "Sage" to perform this calculation using the freely available web portal at sagemath.sagecell.org. You can change the data list if you want to use this to compute values for a different collections of numbers. The five-number-summary is displayed graphically using a "Box-Plot". Graphical representations of data will be discussed later in this chapter. You should compare the answers found using R with the values produced by our definition 1.3.5

xxxxxxxxxx
 
data <- c( 1, 2, 5, 7, 7, -1, 3, 2)   # concatenate the following items into a list
paste("Quartiles:")
quantile(data)
paste("Specific Percentiles:")
quantile(data, c(.32, .57, .98))   # find the 32nd, 57th and 98th percentiles
paste("Box and Whisker Diagram:")
boxplot(data, horizontal=TRUE)

Example 1.3.13 Small example - 5 number summary

Returning to our previous example, the five number summary would be

\begin{equation*} \lt 2, 2.75, 6.5, 9.5, 10 \gt . \end{equation*}