Section 1.6 Adjusting Statistical Measures for Grouped Data
As you considered the measures of the center and spread before, each data point was considered individually. Often, data may however be grouped into categories. The number of data items in each category is called the "frequency" of that outcome and the collection of these frequencies for all outcomes is called a "frequency distribution".
Data Grouped into Single-valued Categories
Rather than considering \(x_k\) to be the kth data value, take advantage of the grouping to perhaps save a bit on arithmetic. Indeed, let’s assume that data is grouped into m categories \(x_1, x_2, ..., x_m\) with corresponding frequencies \(f_1, f_2, ..., f_m\text{.}\) Then, for example, when computing the mean rather than adding \(x_1\) with itself \(f_1\) times just compute \(x_1 \times f_1\) for the first category and continuing through the remaining categories. This gives the following grouped data formula for the mean
\begin{equation*}
\mu = \frac{x_1 f_1 + ... + x_m f_m}{f_1 + ... + f_m} = \frac{\sum_{k=1}^m x_k f_k}{\sum_{k=1}^m f_k}.
\end{equation*}
and the following grouped data formula for the variance (along with one equivalent form)
\begin{equation*}
\sigma^2 = \frac{\sum_{k=1}^m ( x_k-\mu )^2 f_k}{\sum_{k=1}^m f_k} = \frac{\sum_{k=1}^m x_k^2 f_k}{\sum_{k=1}^m f_k} - \mu^2
\end{equation*}
Consider the following data set
{3, 1, 2, 2, 3, 1, 3, 4, 5, 5, 1, 4, 5, 1, 2, 4, 5, 3, 2, 5, 2, 1, 2, 2, 5}
Create a frequency distribution and determine the sample mean and variance.
Solution.
Collecting this data into a frequency distribution gives Table 1.6.2. Grouped Discrete Data
Therefore,
\(x_k\) | \(f_k\) |
1 | 5 |
2 | 7 |
3 | 4 |
4 | 3 |
5 | 6 |
\begin{equation*}
\overline{x} = \frac{1 \times 5 + 2 \times 7 + 3 \times 4 + 4 \times 3 + 5 \times 6}{5+7+4+3+6} \\
= \frac{5 + 14 + 12 + 12 + 30}{25} = \frac{43}{25}
\end{equation*}
and
\begin{align*}
v & = \frac{1^2 \times 5 + 2^2 \times 7 + 3^2 \times 4 + 4^2 \times 3 + 5^2 \times 6}{5+7+4+3+6} - \left ( \frac{43}{25} \right )^2 \\
& = \frac{5 + 28 + 36 + 48 + 150}{25} - \left ( \frac{43}{25} \right )^2 \\
& = \frac{4826}{625}\\
& \approx 7.7216
\end{align*}
and so \(s^2 = \frac{25}{24} \frac{4826}{625} \approx 8.043\text{.}\)
Checkpoint 1.6.3. WeBWorK - Grouped Discrete Data.
The U.S. Bureau of the Census conducts nationwide surveys on characteristics of U.S. households. Following are data on the number of people per household for a sample of 50 households. Construct a grouped data table for these household sizes.
4 | 1 | 2 | 4 | 6 | 1 | 6 | 6 | 5 | 7 |
3 | 2 | 4 | 3 | 6 | 3 | 1 | 3 | 2 | 5 |
5 | 2 | 5 | 2 | 2 | 4 | 5 | 6 | 5 | 6 |
5 | 1 | 4 | 7 | 2 | 5 | 7 | 6 | 7 | 1 |
4 | 3 | 3 | 5 | 3 | 2 | 5 | 6 | 1 | 6 |
Household size | Frequency | Relative Frequency |
1 | ||
2 | ||
3 | ||
4 | ||
5 | ||
6 | ||
7 | ||
Total | 50 | 1 |
Answer 1.
Answer 2.
Answer 3.
Answer 4.
Answer 5.
Answer 6.
Answer 7.
Answer 8.
Answer 9.
Answer 10.
Answer 11.
Answer 12.
Answer 13.
Answer 14.
\(6\)
\(0.12\)
\(8\)
\(0.16\)
\(7\)
\(0.14\)
\(6\)
\(0.12\)
\(10\)
\(0.2\)
\(9\)
\(0.18\)
\(4\)
\(0.08\)
For measures on data grouped into intervals, it is somewhat difficult to do calculations when the data no longer exists as individual values since all you know is the frequencies of each interval. You can use "class marks"...the midpoints of each interval...as representers for all of the items that fell into that interval for computing means and variances.
Indeed, for means of grouped data presume that all of the data in a given interval lies at the midpoint of that interval and then use the frequency formula described above 1.4.9 but with these class marks as the x-values.
Consider data collected into disjoint intervals of the form
- \(\displaystyle [a_1,b_1)\)
- \(\displaystyle [a_2,b_2)\)
- \(\displaystyle [a_3,b_3)\)
- ...
- \(\displaystyle [a_n,b_n)\)
where \(f_k\) is the frequency of data items in interval \([a_k,b_k)\text{.}\) Generally, since the intervals are disjoint then let’s put them in order from low to high so that \(b_1 \le a_2, b_2 \le a_3\text{,}\) etc. Compute class marks \(mid_k = \frac{a_k+b_k}{2}\) and then use
\begin{equation*}
\mu = \frac{mid_1 f_1 + ... + mid_m f_m}{f_1 + ... + f_m} = \frac{\sum_{k=1}^m mid_k f_k}{\sum_{k=1}^m f_k}.
\end{equation*}
For positional measures, you want to approach this in the same manner as with percentiles before. That is, my doing some sort of linear interpolation on the width of each interval.
So, for medians, consider the following approach: Consider data collected into disjoint intervals of the form
- \(\displaystyle [a_1,b_1)\)
- \(\displaystyle [a_2,b_2)\)
- \(\displaystyle [a_3,b_3)\)
- ...
- \(\displaystyle [a_n,b_n)\)
where \(f_k\) is the frequency of data items in interval \([a_k,b_k)\) with corresponding cummulative frequency \(F_k\text{.}\) Again, order them from low to high so that \(b_1 \le a_2, b_2 \le a_3\text{,}\) etc.
- Set m = total cummulative frequency/2 = \(F_{last}/2\)
- Determine the interval \(k\) where \(m \in [F_{k-1},F_k]\)
- Set median = \((b_k-a_k)\frac{m - F_{k-1}}{f_k}+a_k\)
Example 1.6.4. Computing Median for Interval Grouped Data.
\([a_k,b_k]\) | \(f_k\) |
[0,5) | 5 |
[5,10) | 7 |
[10,20) | 4 |
[20,23) | 3 |
[23,30) | 6 |
\begin{equation*}
\text{median} = (20-10) \frac{12.5-12}{4} + 10 = 11.25
\end{equation*}
Checkpoint 1.6.6. Statistics using Continuous Grouped Data.
Given the following table, compute the mean of the grouped data.
Class | Midpoint | Frequency |
\([9,15)\) | \(2\) | |
\([15,21)\) | \(1\) | |
\([21,27)\) | \(9\) | |
\([27,33)\) | \(7\) | |
\([33,39)\) | \(5\) | |
\([39,45)\) | \(1\) | |
\([45,51)\) | \(0\) | |
Totals |
What is the mean of the grouped data?