Statistical Measures of Position

Section 1.3 Statistical Measures of Position

Given a collection of data, sorting the data may provide several useful descriptors. When sorting data, you can easily use something like a spreadsheet for larger data sets but in this section you will also see there are ways to perform a sort by hand. In either case, statistical measures of position generally involve very little computational work once the data is sorted and take into account only the order of the data from lowest to highest. To assist with notation, we will generally use x-values to represent the original raw data and y-values to represent that same data when ordered with the subscript indicating the positional placement.

🔗

Definition 1.3.1. Order Statistic.

🔗

From the data set

x_{1}, x_{2}, . . ., x_{n},

🔗

assume that when sorted it is denoted

y_{1}, y_{2}, . . ., y_{n}

🔗

where

y_{1} \leq y_{2} \leq . . . \leq y_{n} .

🔗

Then,

y_{k}

is known as the kth order statistic.

🔗

Example 1.3.2. Age of Presidents - order statistics.

The age at inauguration for presidents from 1981-2019 gives the data

x_{1} = 69, x_{2} = 64, x_{3} = 46, x_{4} = 54, x_{5} = 47, x_{6} = 70

(Reagan, Bush, Clinton, Bush, Obama, Trump). For this data, the order statistics are denoted

y_{1} = 46, y_{2} = 47, y_{3} = 54, y_{4} = 64, y_{5} = 69, y_{6} = 70.

🔗

Once the data is sorted, it should be very easy for you to locate the smallest and largest values.

🔗

Definition 1.3.3. Minimum/Maximum:.

🔗

For a given data set, the smallest and largest values are known as the minimum and maximum, respectively. In our notation and presuming a data set of size n, the minimum =

y_{1}

and the maximum =

y_{n}

🔗

Example 1.3.4. Age of Presidents - Minimum/Maximum.

Using the President inauguration data 1.3.2, minimum =

y_{1} = 46

and maximum =

y_{6} = 70 .

🔗

A value that separates ordered data into two groups with a desired percentage on each side is called a percentile. There are multiple ways that have been created that achieve this goal. In this text we present two and will consistently use the first one presented below. For each, in general, a given percentile is a numerical value at which approximately a given percentage of the data is smaller.

🔗

The definition presented below provides for a unique measure for each unique value of s that corresponds to the PERCENTILE.EXC macro in Excel. This version starts by computing

(n + 1) s

where

0 < s < 1

and using this to linearly interpolate between two adjacent entries in the sorted list. Another option that corresponds to PERCENTILE.INC (and PERCENTILE) in Excel is to start with

(n - 1) p + 1

for determining how to pick the two adjacent entries and then proceeding with linear interpolation. Again, the definition below utilizes the first approach.

🔗

Definition 1.3.5. Percentiles (EXC).

🔗

For

0 < s < 1

and for order statistics

y_{1}, y_{2}, . . ., y_{n}

define the 100s-th percentile to be

P^{s} = (1 - r) y_{m} + r y_{m + 1}

🔗

where m is the integer part of

(n + 1) s,

namely

m = ⌊ (n + 1) s ⌋

🔗

and

r = (n + 1) s - m,

🔗

the fractional part of

(n + 1) s .

🔗

In Excel, this is PERCENTILE.EXC.

🔗

Definition 1.3.6. Percentiles (INC).

🔗

For

0 < s < 1

and for order statistics

y_{1}, y_{2}, . . ., y_{n}

define the 100s-th percentile to be

P^{s} = (1 - r) y_{m} + r y_{m + 1}

🔗

where m is the integer part of

(n - 1) s + 1,

namely

m = ⌊ (n - 1) s + 1 ⌋

🔗

and

r = (n - 1) s + 1 - m,

🔗

the fractional part of

(n - 1) s + 1 .

🔗

In Excel, this is PERCENTILE.INC or just PERCENTILE.

🔗

Compute the following percentile values using formulas #1 1.3.5 and #2 1.3.6 to determine which one the problem author picked..

🔗

Checkpoint 1.3.7. WeBWorK - Computing Percentiles.

Consider the following data set:

\begin{array}{ccccccccc} 47 & 37 & 30 & 65 & 20 & 38 & 37 & 45 & 59 \\ 49 & 53 & 21 & 23 & 37 & 49 & 20 & 62 & 62 \end{array}

Find the 15th and 89th percentiles for this data.

15th percentile =

89th percentile =

Answer 1.

20.85

Answer 2.

62

🔗

Example 1.3.8. Presidential Percentile.

To compute, say, the 42nd percentile using the definition 1.3.5 for the President inauguration data presented earlier 1.3.2 consider s = 0.42. Since there are 6 numbers in our data set, then

(n + 1) s = 7 \cdot 0.42 = 2.94

and so m = 2 and r = 0.94. Thus, the percentile will lie between

y_{2} = 47

and

y_{3} = 54

and much closer to 54 than 47. Numerically

P^{0.42} = 0.06 \cdot 47 + 0.94 \cdot 54 = 53.58 .

On the other hand, to compute the 42nd percentile using the definition 1.3.6 for the President inauguration data presented earlier 1.3.2 consider s = 0.42. Since there are 6 numbers in our data set, then

(n - 1) s + 1 = 5 \cdot 0.42 + 1 = 2.1 + 1 = 3.1

and so in this case m = 3 and r = 0.1. Thus, the percentile will lie between

y_{3} = 54

and

y_{4} = 64

and much closer to 54 than 64. Numerically

P^{0.42} = 0.9 \cdot 54 + 0.1 \cdot 64 = 55.

🔗

Both formula approaches for percentiles determine a weighted average between

y_{m}

and

y_{m + 1}

which is unique for distinct values of p provided each of the data values are distinct. Note that if some of the y-values are equal then some of these averages might be averages of equal numbers and will therefore be the common value.

🔗

Some special percentiles are provided special names...

🔗

Definition 1.3.9. Quartiles.

🔗

Given a sorted data set, the first, second, and third quartiles are the values of

Q_{1} = P^{0.25}, Q_{2} = P^{0.5}

🔗

and

Q_{3} = P^{0.75} .

🔗

It should be noted that many graphing calculators often compute quartiles using a straight average of two adjacent entries rather than by using the percentile-based formula above. This causes some difficulty and especially so when n mod 4 = 2.

🔗

Example 1.3.10. $Q_{1}$ and $Q_{3}$ vs Calculators.

Suppose n = 22 = 5(4) + 2. Computing the first quartile as defined using the "n+1" approach gives

(n + 1) s = 23 (0.25) = 5.75 = 5 + 0.75 = m + r .

Therefore,

Q_{1} = 0.25 \times y_{5} + 0.75 \times y_{6}

which is a value closer to

y_{6} .

Using the "n-1" approach gives

(n - 1) s + 1 = 21 (0.25) + 1 = 5.25 = 5 + 0.25 = m + r .

Therefore,

Q_{1} = 0.75 \times y_{5} + 0.25 \times y_{6}

which is a value closer to

y_{5} .

Many graphing calculators however quickly approximate this with

0.5 \times y_{5} + 0.5 \times y_{6}

so you should be aware of this possible difference. You should also notice that in this case s = 0.25 but r = 0.75 so these values are not required to be the same.

🔗

Definition 1.3.11. Deciles:.

🔗

Given a sorted data set, the first, second, ..., ninth deciles are the value of

D_{1} = P^{0.1}, D_{2} = P^{0.2}, . . ., D_{9} = P^{0.9}

🔗

Example 1.3.12. Small Example - Quartiles.

Consider the data set

{2, 5, 8, 10} .

The 50th percentile should be a numerical value for which approximately 50% of the data is smaller. In this case, that would be some number between 5 and 8. For now, let’s just take 6.5 so that two numbers in the set lie below 6.5 and two lie above. This is a perfect 50% for the 50th percentile. In a similar manner, the 25th percentile would be some number between 2 and 5, say 2.75, so that one number lies below 2.75 and three numbers lie above.

Using the percentile definition 1.3.5, the 25th percentile is computed by considering

(n + 1) s = (4 + 1) 0.25 = 5 / 4 = 1.25 .

So, m = 1 and r = 0.25. Therefore

P^{0.25} = 0.75 \times 2 + 0.25 \times 5 = 2.75

as noted above.

Similarly, the 75th percentile is given by

(n + 1) s = (4 + 1) 0.75 = 15 / 4 = 3.75 .

So, m = 3 and r = 0.75. Therefore

P^{0.75} = 0.25 \times 8 + 0.75 \times 10 = 9.5

It is interesting to note that 3 also lies between 2 and 5 as does 2.75 and has the same percentages above (75 percent) and below (25 percent). However, it should designate a slightly larger percentile location. Indeed, going backward:

\begin{matrix} 3 = (1 - r) \times 2 + r \times 5 \\ \Rightarrow r = \frac{1}{3} \\ \Rightarrow (n + 1) s = 1 + \frac{1}{3} = \frac{4}{3} \\ \Rightarrow s = \frac{4}{15} \approx 0.267 \end{matrix}

and so 3 would actually be at approximately the 26.7th percentile.

It should be noted that one might also use the alternate percentile definition 1.3.6, in which case the 25th percentile is computed by considering

(n - 1) s + 1 = (4 - 1) 0.25 + 1 = 7 / 4 = 1.75 .

So, m = 1 and r = 0.75. Therefore

P^{0.25} = 0.25 \times 2 + 0.75 \times 5 = 4.25

which is still between 2 and 5 but now closer to 5. So, it is pretty obvious that you would want to settle ahead of time which method for computing percentiles is preferred and stick with it. (Note, when working online exercises you might need to work some of them both ways since you have no idea perhaps what the author might have chosen.

🔗

In general, given a numerical value within the range of a given data set, one can determine the percentile ranking of that value by reversing the general formula for percentile and solving for s, given

P^{s} .

Since two ways to determine percentiles have been provided above then there will be two possible answers. Let’s address this usinging the main definition above.

🔗

Theorem 1.3.13.

Given any value y with

y_{1} < y < y_{n},

then y is at the

P^{s}

percentile with

s = \frac{r + m}{n + 1}

where n = the number of data items, m = the positional subscript such that

y_{m} < y < y_{m + 1}

and

r = \frac{y - y_{m}}{y_{m + 1}} - y

🔗

Proof.

By the definition of percentile with

y = P^{s},

y = (1 - r) y_{m} + r y_{m + 1} .

Solving for r and equating to the formula for

r

in the definition of percentile yields

r = \frac{y - y_{m}}{y_{m + 1} - y_{m}} = (n + 1) s - m .

Solving this for s given

s = \frac{r + m}{n + 1} .

Hence,

y = P^{s} .

🔗

Again using your data set {2, 5, 8, 10},

Q_{1} = 2.75, Q_{2} = 6.5, Q_{3} = 9.5 .

🔗

For a given data set, a summary of these statistics is often desired in order to give the user a quick overview of the more important order statistics.

🔗

Definition 1.3.14. 5-number summary.

🔗

Given a set of data, the 5-number summary is a vector of the order statistics given by

< minimum, Q_{1}, Q_{2}, Q_{3}, maximum > .

🔗

You can also compute these statistics automatically using the opensource statistical software known simply as "R". The following interactive cell uses the opensource software "Sage" to perform this calculation using the freely available web portal at sagemath.sagecell.org. You can change the data list if you want to use this to compute values for a different collections of numbers. The five-number-summary is displayed graphically using a "Box-Plot". Graphical representations of data will be discussed later in this chapter. You should compare the answers found using R with the values produced by our definition 1.3.6


    
        
xxxxxxxxxx
 
1
data <- c( 1, 2, 5, 7, 7, -1, 3, 2)   # concatenate into a list
2
print(paste("Quartiles:"))
3
print(quantile(data))
4
print(paste("Specific Percentiles:"))
5
print(quantile(data, c(.32, .57, .98)))   # 32nd, 57th and 98th percentiles
6
print(paste("Box and Whisker Diagram:"))
7
boxplot(data, horizontal=TRUE)

    
    
    
    
        
            
                Language:
                
            
        
    
    




    
    
        
        Messages

🔗

Example 1.3.15. Small example - 5 number summary.

Returning to our previous example, the five number summary 1.3.14 would be

< 2, 2.75, 6.5, 9.5, 10 > .

🔗

Of course, the data sets you can utilize by manually entering data will be relatively small and tedious to deal with. The open source statistical software R however has a number of built in data sets with many of them including a relatively large number of values. We will be utilizing some of those data sets throughout the remainder of this text. To see what data sets are actually available, execute the command data(). You can load the data from one of these sets by executing the data command with the desired data set name inside the parenthesis and then print out the first few data values with headers using the head command. Execute the interactive cell below and play around with some of the data sets. Remove the hash tag symbol to ’uncomment’ those lines as needed.


    
        
xxxxxxxxxx
 
1
data()
2
# data(faithful)
3
# head(faithful,10)
4
# nrow(faithful)  #  the number of observations (rows)
5
# ncol(faithful)  #  the number of variables per observation (columns)
6
# print(? faithful)      #  a more exhaustive description of the data

    
    
    
    
        
            
                Language:
                
            
        
    
    




    
    
        
        Messages

🔗

Now that you can access large ready-to-go data sets, you will want to evaluate them in various ways. The following interactive cell helps you calculate the measures presented in this section as well as a few measures we will investigate in the next section. Note that an R cell only routinely outputs the final command. If you want to output any of the intermediate values, use the print() command as shown below. Also, to make things easier to enter, let’s just go ahead and give the formal data set name a smaller local identifier x.


    
        
xxxxxxxxxx
 
1
data(faithful)
2
x <- faithful
3
x1 <- x[,1]
4
x2 <- x[,2]
5
m <- min(faithful, 2)
6
M <- max(faithful, 2)
7
print(paste("Minimum of the second outcome = ",m))
8
print(paste("Maximum of the second outcome = ",M))
9
cat("\n\n")   #  a couple of blank lines in the displayed output
10
quantile(x[,1], c(.25,.29,.57))
11
cat("\n\n")   #  a couple of blank lines in the displayed output
12
# mu1 <- mean(x1)
13
# mu2 <- mean(x2)
14
# med1 <- median(x1)
15
# med2 <- median(x2)
16
# print(paste("Mean of the first outcome = ",mu1))
17
# print(paste("Mean of the second outcome = ",mu2))
18
# print(paste("Median of the first outcome = ",med1))
19
# print(paste("Median of the second outcome = ",med2))

    
    
    
    
        
            
                Language:
                
            
        
    
    




    
    
        
        Messages

Essentials of Mathematical Probability and Statistics

Search Results:

Section 1.3 Statistical Measures of Position

Definition 1.3.1. Order Statistic.

Example 1.3.2. Age of Presidents - order statistics.

Definition 1.3.3. Minimum/Maximum:.

Example 1.3.4. Age of Presidents - Minimum/Maximum.

Definition 1.3.5. Percentiles (EXC).

Definition 1.3.6. Percentiles (INC).

Checkpoint 1.3.7. WeBWorK - Computing Percentiles.

Example 1.3.8. Presidential Percentile.

Definition 1.3.9. Quartiles.

Example 1.3.10. $Q_{1}$ and $Q_{3}$ vs Calculators.

Definition 1.3.11. Deciles:.

Example 1.3.12. Small Example - Quartiles.

Theorem 1.3.13.

Proof.

Definition 1.3.14. 5-number summary.

Example 1.3.15. Small example - 5 number summary.

Section 1.3 Statistical Measures of Position

Definition 1.3.1. Order Statistic.

Example 1.3.2. Age of Presidents - order statistics.

Definition 1.3.3. Minimum/Maximum:.

Example 1.3.4. Age of Presidents - Minimum/Maximum.

Definition 1.3.5. Percentiles (EXC).

Definition 1.3.6. Percentiles (INC).

Checkpoint 1.3.7. WeBWorK - Computing Percentiles.

Example 1.3.8. Presidential Percentile.

Definition 1.3.9. Quartiles.

Example 1.3.10. Q1 and Q3 vs Calculators.

Definition 1.3.11. Deciles:.

Example 1.3.12. Small Example - Quartiles.

Theorem 1.3.13.

Proof.

Definition 1.3.14. 5-number summary.

Example 1.3.15. Small example - 5 number summary.

Example 1.3.10. $Q_{1}$ and $Q_{3}$ vs Calculators.