Statistical Measures of Position

Section 1.3 Statistical Measures of Position

Given a collection of data, sorting the data may provide several useful descriptors. When sorting data, you can easily use something like a spreadsheet for larger data sets but in this section you will also see there are ways to perform a sort by hand. In either case, statistical measures of position generally involve very little computational work once the data is sorted and take into account only the order of the data from lowest to highest. To assist with notation, we will generally use x-values to represent the original raw data and y-values to represent that same data when ordered with the subscript indicating the positional placement.

🔗

Definition 1.3.1. Order Statistic.

From the data set

$\begin{equation*} x_1, x_2, ... , x_n, \end{equation*}$

assume that when sorted it is denoted

$\begin{equation*} y_1, y_2, ..., y_n \end{equation*}$

where

$\begin{equation*} y_1 \le y_2 \le ... \le y_n. \end{equation*}$

Then, $y_k$ is known as the kth order statistic.

🔗

Example 1.3.2. Age of Presidents - order statistics.

The age at inauguration for presidents from 1981-2019 gives the data

$\begin{equation*} x_1 = 69, x_2 = 64, x_3 = 46, x_4 = 54, x_5 = 47, x_6 = 70 \end{equation*}$

(Reagan, Bush, Clinton, Bush, Obama, Trump). For this data, the order statistics are denoted

$\begin{equation*} y_1 = 46, y_2 = 47, y_3 = 54, y_4 = 64, y_5 = 69, y_6 = 70. \end{equation*}$

🔗

Once the data is sorted, it should be very easy for you to locate the smallest and largest values.

🔗

Definition 1.3.3. Minimum/Maximum:.

For a given data set, the smallest and largest values are known as the minimum and maximum, respectively. In our notation and presuming a data set of size n, the minimum = $y_1$ and the maximum = $y_n$

🔗

Example 1.3.4. Age of Presidents - Minimum/Maximum.

Using the President inauguration data 1.3.2, minimum = $y_1 = 46$ and maximum = $y_6 = 70\text{.}$

🔗

A value that separates ordered data into two groups with a desired percentage on each side is called a percentile. There are multiple ways that have been created that achieve this goal. In this text we present two and will consistently use the first one presented below. For each, in general, a given percentile is a numerical value at which approximately a given percentage of the data is smaller.

🔗

The definition presented below provides for a unique measure for each unique value of s that corresponds to the PERCENTILE.EXC macro in Excel. This version starts by computing

$(n+1)s$ where

$0 < s < 1$ and using this to linearly interpolate between two adjacent entries in the sorted list. Another option that corresponds to PERCENTILE.INC (and PERCENTILE) in Excel is to start with

$(n-1)p+1$ for determining how to pick the two adjacent entries and then proceeding with linear interpolation. Again, the definition below utilizes the first approach.

🔗

Definition 1.3.5. Percentiles.

For $0 \lt s \lt 1$ and for order statistics $y_1, y_2, ..., y_n$ define the 100s-th percentile to be

$\begin{equation*} P^{s} = (1-r)y_m + ry_{m+1} \end{equation*}$

where m is the integer part of $(n+1)s\text{,}$ namely

$\begin{equation*} m = \left\lfloor (n+1)s \right\rfloor \end{equation*}$

and

$\begin{equation*} r = (n+1)s - m, \end{equation*}$

the fractional part of $(n+1)s\text{.}$

In Excel, this is PERCENTILE.EXC.

🔗

Definition 1.3.6. Alternate Percentile Definition.

For $0 \lt s \lt 1$ and for order statistics $y_1, y_2, ..., y_n$ define the 100s-th percentile to be

$\begin{equation*} P^{s} = (1-r)y_m + ry_{m+1} \end{equation*}$

where m is the integer part of $(n-1)s + 1\text{,}$ namely

$\begin{equation*} m = \left\lfloor (n-1)s + 1 \right\rfloor \end{equation*}$

and

$\begin{equation*} r = (n-1)s + 1 - m, \end{equation*}$

the fractional part of $(n-1)s+1\text{.}$

In Excel, this is PERCENTILE.INC or just PERCENTILE.

🔗

Compute the following percentile values using the both formulas and determine which one the problem author picked. 1.3.6.

🔗

Checkpoint 1.3.7. WeBWorK - Computing Percentiles.

🔗

Example 1.3.8. Presidential Percentile.

To compute, say, the 42nd percentile using the definition 1.3.5 for the President inauguration data presented earlier 1.3.2 consider s = 0.42. Since there are 6 numbers in our data set, then

$\begin{equation*} (n+1)s = 7 \cdot 0.42 = 2.94 \end{equation*}$

and so m = 2 and r = 0.94. Thus, the percentile will lie between $y_2 = 47$ and $y_3 = 54$ and much closer to 54 than 47. Numerically

$\begin{equation*} P^{0.42} = 0.06 \cdot 47 + 0.94 \cdot 54 = 53.58. \end{equation*}$

🔗

Both formula approaches for percentiles determine a weighted average between

$y_m$ and

$y_{m+1}$ which is unique for distinct values of p provided each of the data values are distinct. Note that if some of the y-values are equal then some of these averages might be averages of equal numbers and will therefore be the common value.

🔗

Some special percentiles are provided special names...

🔗

Definition 1.3.9. Quartiles.

Given a sorted data set, the first, second, and third quartiles are the values of

$\begin{equation*} Q_1 = P^{0.25}, Q_2 = P^{0.5} \end{equation*}$

and

$\begin{equation*} Q_3 = P^{0.75}. \end{equation*}$

🔗

It should be noted that many graphing calculators often compute quartiles using a straight average of two adjacent entries rather than by using the formula above. This causes some difficulty and especially so when n mod 4 = 2.

🔗

Example 1.3.10. $Q_1$ and $Q_3$ vs Calculators.

Suppose n = 22 = 5(4) + 2. Computing the first quartile as defined above gives (n+1)s = 23(0.25) = 5.75 = 5 + 0.75 = m + r. Therefore,

$\begin{equation*} Q_1 = 0.25 \times y_5 + 0.75 \times y_6 \end{equation*}$

which is a value closer to $y_6\text{.}$ Many graphing calculators however quickly approximate this with

$\begin{equation*} 0.5 \times y_5 + 0.5 \times y_6 \end{equation*}$

so you should be aware of this possible difference. You should also notice that in this case s = 0.25 but r = 0.75 so these values are not required to be the same.

🔗

Definition 1.3.11. Deciles:.

Given a sorted data set, the first, second, ..., ninth deciles are the value of

$\begin{equation*} D_1 = P^{0.1}, D_2 = P^{0.2}, ... , D_9 = P^{0.9} \end{equation*}$

🔗

Example 1.3.12. Small Example - Quartiles.

Consider the data set

$\begin{equation*} \{2, 5, 8, 10 \}. \end{equation*}$

The 50th percentile should be a numerical value for which approximately 50% of the data is smaller. In this case, that would be some number between 5 and 8. For now, let's just take 6.5 so that two numbers in the set lie below 6.5 and two lie above. This is a perfect 50% for the 50th percentile. In a similar manner, the 25th percentile would be some number between 2 and 5, say 2.75, so that one number lies below 2.75 and three numbers lie above.

Using the percentile definition 1.3.5, the 25th percentile is computed by considering

$\begin{equation*} (n+1)s = (4+1)0.25 = 5/4 = 1.25\text{.} \end{equation*}$

So, m = 1 and r = 0.25. Therefore

$\begin{equation*} P^{0.25} = 0.75 \times 2 + 0.25 \times 5 = 2.75 \end{equation*}$

as noted above.

Similarly, the 75th percentile is given by

$\begin{equation*} (n+1)s = (4+1)0.75 = 15/4 = 3.75\text{.} \end{equation*}$

So, m = 3 and r = 0.75. Therefore

$\begin{equation*} P^{0.75} = 0.25 \times 8 + 0.75 \times 10 = 9.5 \end{equation*}$

It is interesting to note that 3 also lies between 2 and 5 as does 2.75 and has the same percentages above (75 percent) and below (25 percent). However, it should designate a slightly larger percentile location. Indeed, going backward:

$\begin{gather*} 3 = (1-r) \times 2 + r \times 5\\ \Rightarrow r = \frac{1}{3}\\ \Rightarrow (n+1)s = 1 + \frac{1}{3} = \frac{4}{3}\\ \Rightarrow s = \frac{4}{15} \approx 0.267 \end{gather*}$

and so 3 would actually be at approximately the 26.7th percentile.

🔗

In general, given a numerical value within the range of a given data set, one can determine the percentile ranking of that value by reversing the general formula for percentile and solving for s, given

$P^s\text{.}$ Since two ways to determine percentiles have been provided above then there will be two possible answers. Let's address this usinging the main definition above.

🔗

Theorem 1.3.13.

Given any value y with

$y_1 < y < y_n\text{,}$ then y is at the

$P^s$ percentile with

$\begin{equation*} s = \frac{r+m}{n+1} \end{equation*}$

where n = the number of data items, m = the positional subscript such that

$y_m < y < y_{m+1}$ and

$r = \frac{y - y_m}{y_{m+1}} - y$

🔗

Proof.

By the definition of percentile with $y = P^s\text{,}$

\begin{equation*} y = (1-r) y_m + r y_{m+1}. \end{equation*}

Solving for r and equating to the formula for $r$ in the definition of percentile yields

\begin{equation*} r = \frac{y - y_m}{y_{m+1} -y_m} = (n+1)s - m. \end{equation*}

Solving this for s given

\begin{equation*} s = \frac{r+m}{n+1}. \end{equation*}

Hence,

\begin{equation*} y = P^s. \end{equation*}

🔗

Again using your data set {2, 5, 8, 10},

$\begin{equation*} Q_1 = 2.75, Q_2 = 6.5, Q_3 = 9.5. \end{equation*}$

🔗

For a given data set, a summary of these statistics is often desired in order to give the user a quick overview of the more important order statistics.

🔗

Definition 1.3.14. 5-number summary.

Given a set of data, the 5-number summary is a vector of the order statistics given by

$\begin{equation*} \lt \text{minimum}, Q_1, Q_2, Q_3, \text{maximum} \gt . \end{equation*}$

🔗

You can also compute these statistics automatically using the opensource statistical software known simply as "R". The following interactive cell uses the opensource software "Sage" to perform this calculation using the freely available web portal at sagemath.sagecell.org. You can change the data list if you want to use this to compute values for a different collections of numbers. The five-number-summary is displayed graphically using a "Box-Plot". Graphical representations of data will be discussed later in this chapter. You should compare the answers found using R with the values produced by our definition 1.3.5

xxxxxxxxxx
 
data <- c( 1, 2, 5, 7, 7, -1, 3, 2)   # concatenate into a list
print(paste("Quartiles:"))
print(quantile(data))
print(paste("Specific Percentiles:"))
print(quantile(data, c(.32, .57, .98)))   # 32nd, 57th and 98th percentiles
print(paste("Box and Whisker Diagram:"))
boxplot(data, horizontal=TRUE)

🔗

Example 1.3.15. Small example - 5 number summary.

Returning to our previous example, the five number summary 1.3.14 would be

$\begin{equation*} \lt 2, 2.75, 6.5, 9.5, 10 \gt . \end{equation*}$

🔗

Of course, the data sets you can utilize by manually entering data will be relatively small and tedious to deal with. The open source statistical software R however has a number of built in data sets with many of them including a relatively large number of values. We will be utilizing some of those data sets throughout the remainder of this text. To see what data sets are actually available, execute the command data(). You can load the data from one of these sets by executing the data command with the desired data set name inside the parenthesis and then print out the first few data values with headers using the head command. Execute the interactive cell below and play around with some of the data sets. Remove the hash tag symbol to 'uncomment' those lines as needed.

xxxxxxxxxx
 
data()
# data(faithful)
# head(faithful,10)
# nrow(faithful)  #  the number of observations (rows)
# ncol(faithful)  #  the number of variables per observation (columns)
# print(? faithful)      #  a more exhaustive description of the data

🔗

Now that you can access large ready-to-go data sets, you will want to evaluate them in various ways. The following interactive cell helps you calculate the measures presented in this section as well as a few measures we will investigate in the next section. Note that an R cell only routinely outputs the final command. If you want to output any of the intermediate values, use the print() command as shown below. Also, to make things easier to enter, let's just go ahead and give the formal data set name a smaller local identifier x.

xxxxxxxxxx
 
data(faithful)
x <- faithful
x1 <- x[,1]
x2 <- x[,2]
m <- min(faithful, 2)
M <- max(faithful, 2)
print(paste("Minimum of the second outcome = ",m))
print(paste("Maximum of the second outcome = ",M))
cat("\n\n")   #  a couple of blank lines in the displayed output
quantile(x[,1], c(.25,.29,.57))
cat("\n\n")   #  a couple of blank lines in the displayed output
# mu1 <- mean(x1)
# mu2 <- mean(x2)
# med1 <- median(x1)
# med2 <- median(x2)
# print(paste("Mean of the first outcome = ",mu1))
# print(paste("Mean of the second outcome = ",mu2))
# print(paste("Median of the first outcome = ",med1))
# print(paste("Median of the second outcome = ",med2))