Section 2.2 Best-fit Line
In this section, we will presume only one independent variable x and one dependent variable y.
Consider a collection of data points
\begin{equation*}
(x_1,y_1), (x_2,y_2), ... , (x_n,y_n)
\end{equation*}
and a general linear function
\begin{equation*}
f(x) = mx + b.
\end{equation*}
It is possible that each of the given data points are exactly "interpolated" by the linear function so that
\begin{equation*}
f(x_k) = y_k
\end{equation*}
for k = 1, 2, ... , n. However, in general this is unlikely since even three points are not likely to be colinear. However, you may notice that the data points exhibit a linear tendency or that the underlying physics might suggest a linear model. If so, you may find it easier to predict values of y for given values of x using a linear approximation. Here you will investigate a method for doing so called "linear regression", "least-squares", or "best-fit line".
But why even bother creating a formula (a line here) to approximate data that does not satisfy that formula? Remember that you would expect collected data to vary slightly as one repeatedly collects that data in the same way that you would expect to make a slightly different score on repeated attempts at exams on the same material. Creating a forumla that is close to your data gives a well-defined way to predict a y value for a given x value. This predictive behavior is illustrated in the exercise below.
An train station has determined that the relationship between the number of passengers on a train and the total weight of luggage stored in the baggage compartment can be estimated by the least squares regression equation \(y = 177 + 24 x\text{.}\) Predict the weight of luggage for a flight with 148 passengers.
Answer: pounds
Answer.
\(3729\)
To determine the best-fit line, you need to determine what is meant by the word "best". Here, we will derive the standard approach which interprets this to mean that the total vertical error between the line and the provided data points is minimized in some fashion. Indeed, this vertical error would be of the form
\begin{equation*}
e_k = f(x_k) - y_k
\end{equation*}
and would be zero if f(x) exactly interpolated at the given data point. Note, some of these errors will be positive and some will be negative. To avoid any possible cancellation of errors, you can look at taking absolute values (which is tough to deal with algebraically) or by squaring the errors. This second option will be the approach taken here. This is similar to the approach taken earlier when developing formulas for the variance.
The best-fit line therefore will be the line \(f(x) = mx+b\) so that the "total squared error" is minimized. This total squared error is given by
\begin{equation*}
TSE(m,b) = \sum_{k=1}^n e_k^2 = \sum_{k=1}^n (f(x_k) - y_k)^2 = \sum_{k=1}^n (m x_k + b - y_k)^2.
\end{equation*}
For the following interactive cell, consider for the given data points various values for the slope and y-intercept and see if you can make the total squared error as small as possible. In doing so, notice the vertical distances from the line to the given data points generally decreases as this error measure gets smaller.
So that we don’t have to guess the best values for slope and intercept, we can appeal to calculus. Indeed, to minimize this function of the two variables m and b take partial derivatives and set them equal to zero to get the critical values:
\begin{equation*}
TSE_m = \sum_{k=1}^n 2(m x_k + b - y_k) \cdot x_k
\end{equation*}
and
\begin{equation*}
TSE_b = \sum_{k=1}^n 2(m x_k + b - y_k) \cdot 1 .
\end{equation*}
Setting equal to zero and solving gives what is known as the "normal equations":
\begin{equation*}
m \sum_{k=1}^n x_k^2 + b \sum_{k=1}^n x_k = \sum_{k=1}^n x_k y_k
\end{equation*}
and
\begin{equation*}
m \sum_{k=1}^n x_k + b \sum_{k=1}^n 1 = \sum_{k=1}^n y_k.
\end{equation*}
Solving these for m and b gives the best fit line.
Checkpoint 2.2.2. WeBWorK - Best fit line.
A study was conducted to detemine whether a the final grade of a student in an introductory psychology course is linearly related to his or her performance on the verbal ability test administered before college entrance. The verbal scores and final grades for \(10\) students are shown in the table below.
Student | Verbal Score \(x\) | Final Grade \(y\) |
\(1\) | \(31\) | \(38\) |
\(2\) | \(77\) | \(91\) |
\(3\) | \(68\) | \(84\) |
\(4\) | \(40\) | \(48\) |
\(5\) | \(76\) | \(93\) |
\(6\) | \(77\) | \(88\) |
\(7\) | \(32\) | \(39\) |
\(8\) | \(36\) | \(39\) |
\(9\) | \(44\) | \(53\) |
\(10\) | \(46\) | \(59\) |
Find the least squares line.
\(\hat{y} =\) \(+\) \(x\)
Should the regression be used to predict the final grade of a student with a verbal score of 100?
answer:
Sometimes it is useful to organize these calculations in a spreadsheet such as this Spreadsheet 1 .
To summarize these ideas for best-fit line and to show how to create one using EXCEL, a TI graphing calculutor, and using the normal equations, consider watching this video.
Data/Best_Fit_Line.xlsx