Skip to main content

Section 1.6 Adjusting Statistical Measures for Grouped Data

As you considered the measures of the center and spread before, each data point was considered individually. Often, data may however be grouped into categories. The number of data items in each category is called the "frequency" of that outcome and the collection of these frequencies for all outcomes is called a "frequency distribution".

Data Grouped into Single-valued Categories

Rather than considering xk to be the kth data value, take advantage of the grouping to perhaps save a bit on arithmetic. Indeed, let's assume that data is grouped into m categories x1,x2,...,xm with corresponding frequencies f1,f2,...,fm. Then, for example, when computing the mean rather than adding x1 with itself f1 times just compute x1×f1 for the first category and continuing through the remaining categories. This gives the following grouped data formula for the mean

μ=x1f1+...+xmfmf1+...+fm=mk=1xkfkmk=1fk.

and the following grouped data formula for the variance (along with one equivalent form)

σ2=mk=1(xkμ)2fkmk=1fk=mk=1x2kfkmk=1fkμ2
Checkpoint 1.6.1.

Consider the following data set

{3, 1, 2, 2, 3, 1, 3, 4, 5, 5, 1, 4, 5, 1, 2, 4, 5, 3, 2, 5, 2, 1, 2, 2, 5}

Create a frequency distribution and determine the sample mean and variance.

Solution

Collecting this data into a frequency distribution gives

Table 1.6.2. Grouped Discrete Data
\(x_k\) \(f_k\)
1 5
2 7
3 4
4 3
5 6
Therefore,

\begin{equation*} \overline{x} = \frac{1 \times 5 + 2 \times 7 + 3 \times 4 + 4 \times 3 + 5 \times 6}{5+7+4+3+6} \\ = \frac{5 + 14 + 12 + 12 + 30}{25} = \frac{43}{25} \end{equation*}

and

\begin{align*} v & = \frac{1^2 \times 5 + 2^2 \times 7 + 3^2 \times 4 + 4^2 \times 3 + 5^2 \times 6}{5+7+4+3+6} - \left ( \frac{43}{25} \right )^2 \\ & = \frac{5 + 28 + 36 + 48 + 150}{25} - \left ( \frac{43}{25} \right )^2 \\ & = \frac{4826}{625}\\ & \approx 7.7216 \end{align*}

and so \(s^2 = \frac{25}{24} \frac{4826}{625} \approx 8.043\text{.}\)

Checkpoint 1.6.3. WeBWorK - Grouped Discrete Data.

For measures on data grouped into intervals, it is somewhat difficult to do calculations when the data no longer exists as individual values since all you know is the frequencies of each interval. You can use "class marks"...the midpoints of each interval...as representers for all of the items that fell into that interval for computing means and variances. For positional measures, you want to approach this in the same manner as with percentiles before. That is, my doing some sort of linear interpolation on the width of each interval.

So, for medians, consider the following approach: Consider data collected into intervals of the form

  • [a1,b1)
  • [a2,b2)
  • [a3,b3)
  • ...
  • [an,bn)

where fk is the frequency of data items in interval [ak,bk) with corresponding cummulative frequency Fk.

  1. Set m = total cummulative frequency/2 = Flast/2
  2. Determine the interval k where m[Fk1,Fk]
  3. Set median = (bkak)mFk1fk+ak

Example 1.6.4. Computing Median for Interval Grouped Data.

Table 1.6.5. Interval Frequency Distribution
[ak,bk] fk
[0,5) 5
[5,10) 7
[10,20) 4
[20,23) 3
[23,30) 6
The total cummulative frequency is 25 and so m=252=12.5 which lies in the k = 3 interval [10,20) and F2=12. Therefore

median=(2010)12.5124+10=11.25

Checkpoint 1.6.6. Statistics using Continuous Grouped Data.