Section 9.5 Normal Distribution as a Limiting Distribution
Over the past several chapters you should have noticed that many distributions have skewness and kurtosis formulae which have limiting values of 0 and 3 respectively. This means that each of those distributions which can be approximated by the normal distribution for "large" parameter values.
To see how this works, consider a "random" distribution in the following two interactive experiments. For the first graph below, a sequence of N random samples, each of size r, ranging from 0 to "Range" is generated and graphed as small data points. As the number of samples N and the sample size r increase, notice that the data seems to cover the entire range of possible values relatively uniformly. (For this scatter plot note that each row represents the data for one sample of size r. The larger the N, the greater the number of rows.) Each row is averaged and that mean value is plotted on the graph as a red circle. If you check the "Show_Mean" box, the mean of these circles is indicated by the green line in the middle of the plot.
For the second graph below, the means are collected and the relative frequency of each is plotted. As N increases, you should see that the results begin to show an interesting tendency. As you increase the data range, you may notice this graph has a larger number of data values. Smoothing groups this data into intervals of length two for perhaps a graph with less variability.
Consider each of the following:
- As N increases with single digit values of r, what appears to happen to the mean and range of the means? How does increasing the data range from 1-100 to 1-200 or 1-300 affect these results?
- As N increases (say, for a middle value of r), what appears to happen to the means? How does increasing the data range from 1-100 to 1-200 or 1-300 affect these results?
- As r increases (say, for a middle value of N), what appears to happen to the range of the averages? Does your conclusion actually depend upon the value of N? (Look at the graph and don't worry about the actual numerical values.) How does increasing N for the second graph affect the skewness and kurtosis of that graph? Do things change significantly as r is increased?
xxxxxxxxxx
var('n,k')
from sage.finance.time_series import TimeSeries
layout=dict(top=[['Range'],['Show_Mean', 'Smoothing']], (
bottom=[['N'],['r']]))
def _(Range=[100,200,300,500],N=slider(5,200,2,2,label="N = Number of Samples"),r=slider(3,200,1,2,label="r = Sample Size"),Show_Mean=False,Smoothing=False):
R=[1..N] # R ranges over the number of samples...will point to the list of averages
rangemax = Range
data = random_matrix(ZZ,N,r,x=rangemax)
datapoints = []
avg_values = []
avg_string = []
averages = []
for n in range(N):
temp = 0
for k in range(r):
datapoints += [(data[n][k],n)]
temp += data[n][k]
avg_values.append(round(temp/r))
if Smoothing:
avg_string.append(str(2*round((temp/r)/2)))
else:
avg_string.append(str(round(temp/r)))
averages += [(round(temp/r),n)] # make these averages integers for use in grouping later
SCAT = scatter_plot(datapoints,markersize=2,edgecolor='red',figsize=(10,4),axes_labels=['Sample Values','Sample Number'])
AVGS = scatter_plot(averages,markersize=50,edgecolor='blue',marker='o',figsize=(7,4))
freqslist = frequency_distribution(avg_string,1).function().items()
# compute sample statistics for the raw data as well as for the N averages
Mean_data = (sum(sum(data))/(N*r)).n()
# STD_data = sqrt(sum(sum( (data-Mean_data)^2 ))/(N*r)).n()
Mean_averages = mean(avg_values).n()
# STD_averages = sqrt(variance(avg_values).n())
# print "Data mean =",Mean_data," vs Mean of the averages =",Mean_averages
# print "Data STD = ",STD_data," vs Standard Dev of avgs =", STD_averages
if Show_Mean:
avg_line = line([(Mean_data,0),(Mean_data,N-1)],rgbcolor='green',thickness=10)
avg_text = text('xbar',(Mean_data,N),horizontal_alignment='right',rgbcolor='green')
else:
avg_line = Graphics()
avg_text = Graphics()
# Plot a scatter plot exhibiting uniformly random data and the collection of averages
print(html("The random data plot on the left with each row representing a sample with size determined by\n"+
"the slider above and each circle representing the average for that particular sample.\n"+
"First, keep sample size relatively low and increase the number of samples. Then, \n"+
"watch what happens when you slowly increase the sample size."))
# Plot the relative frequencies of the grouped sample averages
print(html("Now, the averages (ie. the circles) from above are collected and counted\n"+
"with the relative frequency of each average graphed below. For a relatively large number of\n"+
"samples, notice what seems to happen to these averages as the sample size increases."))
if Smoothing:
binRange = Range//2
else:
binRange = Range
# normed=True # if you want to have relative frequencies below
his_low = 2*rangemax/7
his_high = 5*rangemax/7
T = histogram(avg_values,normed=False,bins=binRange,range=(his_low,his_high),axes_labels=['Sample Averages','Frequency'])
#T = TimeSeries(avg_values).plot_histogram(axes_labels=['Sample Averages','Frequency'])
pretty_print('Scatter Plot of random data. Horizontal is number of samples.')
(SCAT+AVGS+avg_line+avg_text).show()
pretty_print('Histogram of Sample Averages')
T.show(figsize=(5,2))
xxxxxxxxxx
var('n,k')
from sage.finance.time_series import TimeSeries
layout=dict(top=[['Range'],['Show_Mean', 'Smoothing']], (
bottom=[['N'],['r']]))
def _(Range=[100,200,300,500],N=slider(5,200,2,2,label="N = Number of Samples"),r=slider(3,200,1,2,label="r = Sample Size"),Show_Mean=False,Smoothing=False):
R=[1..N] # R ranges over the number of samples...will point to the list of averages
rangemax = Range
data = random_matrix(ZZ,N,r,x=rangemax)
datapoints = []
avg_values = []
avg_string = []
averages = []
for n in range(N):
temp = 0
for k in range(r):
datapoints += [(data[n][k],n)]
temp += data[n][k]
avg_values.append(round(temp/r))
if Smoothing:
avg_string.append(str(2*round((temp/r)/2)))
else:
avg_string.append(str(round(temp/r)))
averages += [(round(temp/r),n)] # make these averages integers for use in grouping later
SCAT = scatter_plot(datapoints,markersize=2,edgecolor='red',figsize=(10,4),axes_labels=['Sample Values','Sample Number'])
AVGS = scatter_plot(averages,markersize=50,edgecolor='blue',marker='o',figsize=(7,4))
freqslist = frequency_distribution(avg_string,1).function().items()
# compute sample statistics for the raw data as well as for the N averages
Mean_data = (sum(sum(data))/(N*r)).n()
# STD_data = sqrt(sum(sum( (data-Mean_data)^2 ))/(N*r)).n()
Mean_averages = mean(avg_values).n()
# STD_averages = sqrt(variance(avg_values).n())
# print "Data mean =",Mean_data," vs Mean of the averages =",Mean_averages
# print "Data STD = ",STD_data," vs Standard Dev of avgs =", STD_averages
if Show_Mean:
avg_line = line([(Mean_data,0),(Mean_data,N-1)],rgbcolor='green',thickness=10)
avg_text = text('xbar',(Mean_data,N),horizontal_alignment='right',rgbcolor='green')
else:
avg_line = Graphics()
avg_text = Graphics()
# Plot a scatter plot exhibiting uniformly random data and the collection of averages
print(html("The random data plot on the left with each row representing a sample with size determined by\n"+
"the slider above and each circle representing the average for that particular sample.\n"+
"First, keep sample size relatively low and increase the number of samples. Then, \n"+
"watch what happens when you slowly increase the sample size."))
# Plot the relative frequencies of the grouped sample averages
print(html("Now, the averages (ie. the circles) from above are collected and counted\n"+
"with the relative frequency of each average graphed below. For a relatively large number of\n"+
"samples, notice what seems to happen to these averages as the sample size increases."))
if Smoothing:
binRange = Range//2
else:
binRange = Range
# normed=True # if you want to have relative frequencies below
his_low = 2*rangemax/7
his_high = 5*rangemax/7
T = histogram(avg_values,normed=False,bins=binRange,range=(his_low,his_high),axes_labels=['Sample Averages','Frequency'])
#T = TimeSeries(avg_values).plot_histogram(axes_labels=['Sample Averages','Frequency'])
pretty_print('Scatter Plot of random data. Horizontal is number of samples.')
(SCAT+AVGS+avg_line+avg_text).show()
pretty_print('Histogram of Sample Averages')
T.show(figsize=(5,2))
So, even with random data, if you are to consider the arrangement of the collected means rather than the arrangement of the actual data then the means appear to have a bell-shaped distribution as well.