Skip to main content

Section 2.2 Best-fit Line

In this section, we will presume only one independent variable x and one dependent variable y.

Consider a collection of data points

\begin{equation*} (x_1,y_1), (x_2,y_2), ... , (x_n,y_n) \end{equation*}

and a general linear function

\begin{equation*} f(x) = mx + b. \end{equation*}

It is possible that each of the given data points are exactly "interpolated" by the linear function so that

\begin{equation*} f(x_k) = y_k \end{equation*}

for k = 1, 2, ... , n. However, in general this is unlikely since even three points are not likely to be colinear. However, you may notice that the data points exhibit a linear tendency or that the underlying physics might suggest a linear model. If so, you may find it easier to predict values of y for given values of x using a linear approximation. Here you will investigate a method for doing so called "linear regression", "least-squares", or "best-fit line".

But why even bother creating a formula (a line here) to approximate data that does not satisfy that formula? Remember that you would expect collected data to vary slightly as one repeatedly collects that data in the same way that you would expect to make a slightly different score on repeated attempts at exams on the same material. Creating a forumla that is close to your data gives a well-defined way to predict a y value for a given x value. This predictive behavior is illustrated in the exercise below.

Checkpoint 2.2.1. WeBWorK - best-fit line for approximation.

To determine the best-fit line, you need to determine what is meant by the word "best". Here, we will derive the standard approach which interprets this to mean that the total vertical error between the line and the provided data points is minimized in some fashion. Indeed, this vertical error would be of the form

\begin{equation*} e_k = f(x_k) - y_k \end{equation*}

and would be zero if f(x) exactly interpolated at the given data point. Note, some of these errors will be positive and some will be negative. To avoid any possible cancellation of errors, you can look at taking absolute values (which is tough to deal with algebraically) or by squaring the errors. This second option will be the approach taken here. This is similar to the approach taken earlier when developing formulas for the variance.

The best-fit line therefore will be the line \(f(x) = mx+b\) so that the "total squared error" is minimized. This total squared error is given by

\begin{equation*} TSE(m,b) = \sum_{k=1}^n e_k^2 = \sum_{k=1}^n (f(x_k) - y_k)^2 = \sum_{k=1}^n (m x_k + b - y_k)^2. \end{equation*}

For the following interactive cell, consider for the given data points various values for the slope and y-intercept and see if you can make the total squared error as small as possible. In doing so, notice the vertical distances from the line to the given data points generally decreases as this error measure gets smaller.

So that we don't have to guess the best values for slope and intercept, we can appeal to calculus. Indeed, to minimize this function of the two variables m and b take partial derivatives and set them equal to zero to get the critical values:

\begin{equation*} TSE_m = \sum_{k=1}^n 2(m x_k + b - y_k) \cdot x_k \end{equation*}

and

\begin{equation*} TSE_b = \sum_{k=1}^n 2(m x_k + b - y_k) \cdot 1 . \end{equation*}

Setting equal to zero and solving gives what is known as the "normal equations":

\begin{equation*} m \sum_{k=1}^n x_k^2 + b \sum_{k=1}^n x_k = \sum_{k=1}^n x_k y_k \end{equation*}

and

\begin{equation*} m \sum_{k=1}^n x_k + b \sum_{k=1}^n 1 = \sum_{k=1}^n y_k. \end{equation*}

Solving these for m and b gives the best fit line.

Checkpoint 2.2.2. WeBWorK - Best fit line.