Given a collection of data, sorting the data may provide several useful descriptors. When sorting data, you can easily use something like a spreadsheet for larger data sets but in this section you will also see there are ways to perform a sort by hand. In either case, statistical measures of position generally involve very little computational work once the data is sorted and take into account only the order of the data from lowest to highest. To assist with notation, we will generally use x-values to represent the original raw data and y-values to represent that same data when ordered with the subscript indicating the positional placement.
Once the data is sorted, it should be very easy for you to locate the smallest and largest values.
Definition1.3.3.Minimum/Maximum:.
For a given data set, the smallest and largest values are known as the minimum and maximum, respectively. In our notation and presuming a data set of size n, the minimum = \(y_1\) and the maximum = \(y_n\)
A value that separates ordered data into two groups with a desired percentage on each side is called a percentile. There are multiple ways that have been created that achieve this goal. In this text we present two and will consistently use the first one presented below. For each, in general, a given percentile is a numerical value at which approximately a given percentage of the data is smaller.
The definition presented below provides for a unique measure for each unique value of s that corresponds to the PERCENTILE.EXC macro in Excel. This version starts by computing \((n+1)s\) where \(0 < s < 1\) and using this to linearly interpolate between two adjacent entries in the sorted list. Another option that corresponds to PERCENTILE.INC (and PERCENTILE) in Excel is to start with \((n-1)p+1\) for determining how to pick the two adjacent entries and then proceeding with linear interpolation. Again, the definition below utilizes the first approach.
Definition1.3.5.Percentiles (EXC).
For \(0 \lt s \lt 1\) and for order statistics \(y_1, y_2, ..., y_n\) define the 100s-th percentile to be
Both formula approaches for percentiles determine a weighted average between \(y_m\) and \(y_{m+1}\) which is unique for distinct values of p provided each of the data values are distinct. Note that if some of the y-values are equal then some of these averages might be averages of equal numbers and will therefore be the common value.
Some special percentiles are provided special names...
Definition1.3.9.Quartiles.
Given a sorted data set, the first, second, and third quartiles are the values of
It should be noted that many graphing calculators often compute quartiles using a straight average of two adjacent entries rather than by using the percentile-based formula above. This causes some difficulty and especially so when n mod 4 = 2.
so you should be aware of this possible difference. You should also notice that in this case s = 0.25 but r = 0.75 so these values are not required to be the same.
Definition1.3.11.Deciles:.
Given a sorted data set, the first, second, ..., ninth deciles are the value of
The 50th percentile should be a numerical value for which approximately 50% of the data is smaller. In this case, that would be some number between 5 and 8. For now, let’s just take 6.5 so that two numbers in the set lie below 6.5 and two lie above. This is a perfect 50% for the 50th percentile. In a similar manner, the 25th percentile would be some number between 2 and 5, say 2.75, so that one number lies below 2.75 and three numbers lie above.
It is interesting to note that 3 also lies between 2 and 5 as does 2.75 and has the same percentages above (75 percent) and below (25 percent). However, it should designate a slightly larger percentile location. Indeed, going backward:
\begin{gather*}
3 = (1-r) \times 2 + r \times 5\\
\Rightarrow r = \frac{1}{3}\\
\Rightarrow (n+1)s = 1 + \frac{1}{3} = \frac{4}{3}\\
\Rightarrow s = \frac{4}{15} \approx 0.267
\end{gather*}
and so 3 would actually be at approximately the 26.7th percentile.
It should be noted that one might also use the alternate percentile definition 1.3.6, in which case the 25th percentile is computed by considering
which is still between 2 and 5 but now closer to 5. So, it is pretty obvious that you would want to settle ahead of time which method for computing percentiles is preferred and stick with it. (Note, when working online exercises you might need to work some of them both ways since you have no idea perhaps what the author might have chosen.
In general, given a numerical value within the range of a given data set, one can determine the percentile ranking of that value by reversing the general formula for percentile and solving for s, given \(P^s\text{.}\) Since two ways to determine percentiles have been provided above then there will be two possible answers. Let’s address this usinging the main definition above.
Theorem1.3.13.
Given any value y with \(y_1 < y < y_n\text{,}\) then y is at the \(P^s\) percentile with
\begin{equation*}
s = \frac{r+m}{n+1}
\end{equation*}
where n = the number of data items, m = the positional subscript such that \(y_m < y < y_{m+1}\) and \(r = \frac{y - y_m}{y_{m+1}} - y\)
For a given data set, a summary of these statistics is often desired in order to give the user a quick overview of the more important order statistics.
Definition1.3.14.5-number summary.
Given a set of data, the 5-number summary is a vector of the order statistics given by
You can also compute these statistics automatically using the opensource statistical software known simply as "R". The following interactive cell uses the opensource software "Sage" to perform this calculation using the freely available web portal at sagemath.sagecell.org. You can change the data list if you want to use this to compute values for a different collections of numbers. The five-number-summary is displayed graphically using a "Box-Plot". Graphical representations of data will be discussed later in this chapter. You should compare the answers found using R with the values produced by our definition 1.3.6
Of course, the data sets you can utilize by manually entering data will be relatively small and tedious to deal with. The open source statistical software R however has a number of built in data sets with many of them including a relatively large number of values. We will be utilizing some of those data sets throughout the remainder of this text. To see what data sets are actually available, execute the command data(). You can load the data from one of these sets by executing the data command with the desired data set name inside the parenthesis and then print out the first few data values with headers using the head command. Execute the interactive cell below and play around with some of the data sets. Remove the hash tag symbol to ’uncomment’ those lines as needed.
Now that you can access large ready-to-go data sets, you will want to evaluate them in various ways. The following interactive cell helps you calculate the measures presented in this section as well as a few measures we will investigate in the next section. Note that an R cell only routinely outputs the final command. If you want to output any of the intermediate values, use the print() command as shown below. Also, to make things easier to enter, let’s just go ahead and give the formal data set name a smaller local identifier x.