Linear Regression - Best Fit Line

Section 2.2 Linear Regression - Best Fit Line

In the next few sections, we will presume only one independent variable x and one dependent variable y. Toward that end, consider a collection of data points

(x_{0}, y_{0}), (x_{1}, y_{1}), . . ., (x_{n}, y_{n})

$\begin{equation*} (x_0,y_0), (x_1,y_1), ... , (x_n,y_n) \end{equation*}$

and a general linear function

f (x) = m x + b .

$\begin{equation*} f(x) = mx + b. \end{equation*}$

It is possible but generally unlikely that each of the given data points will be interpolated exactly by the linear function. However, you may notice that the data points exhibit a linear tendency or that the underlying physics might suggest a linear model. A "scatter plot" of a example data set is created in teh interactive cell below and the provided data appears to indicate a linear trend. In general, if this is the case then you may find it easier to predict values of y for given values of x using a linear approximation. That is why this method for doing so is also often called a "best-fit line".

xxxxxxxxxx
 
var('x')
@interact
def _(Points = input_box([(-1,1),(3,2),(4,3),(6,4)])):   
    G = points(Points,size=20)
    G.show(title = "Scatter Plot")

But why even bother creating a formula (a line here) to approximate data that does not satisfy that formula? Remember that you would expect collected data to vary slightly as one repeatedly collects that data in the same way that you would expect to make a slightly different score on repeated attempts at exams on the same material. Creating a formula that is close to your data gives a well-defined way to predict a y value for a given x value. This predictive behavior is illustrated in the exercise below.

Checkpoint 2.2.1 WebWork - Using an approximating line

To determine this best-fit line, you need to determine what is meant by the word "best". For linear regression, to reach this goal consider the total of all vertical deviations between the desired line and the provided data points. Indeed, this vertical error would be of the form

e_{k} = f (x_{k}) - y_{k}

$\begin{equation*} e_k = f(x_k) - y_k \end{equation*}$

and would be zero if f(x) exactly interpolated at the given data point. Note, some of these errors will be positive and some will be negative. To avoid any possible cancellation of errors, we could consider taking absolute values (which is tough to deal with algebraically) or perhaps squaring the errors. This second option is the standard approach. This approach is similar to the approach taken earlier when developing formulas for the variance.

The best-fit line therefore will be the line $f(x) = mx+b$ so that the "total squared error" is minimized. This total squared error is given by

T S E (m, b) = \sum_{k = 1}^{n} e_{k}^{2} = \sum_{k = 1}^{n} (f (x_{k}) - y_{k})^{2} = \sum_{k = 1}^{n} (m x_{k} + b - y_{k})^{2} .

$\begin{equation*} TSE(m,b) = \sum_{k=1}^n e_k^2 = \sum_{k=1}^n (f(x_k) - y_k)^2 = \sum_{k=1}^n (m x_k + b - y_k)^2. \end{equation*}$

For the following interactive cell, consider for the given data points various values for the slope and y-intercept and see if you can make the total squared error as small as possible. In doing so, notice the vertical distances from the line to the given data points generally decreases as this error measure gets smaller.

xxxxxxxxxx
 
var('x')
@interact
def _(Points = input_box([(-1,1),(3,1),(4,3),(6,4)]), m = slider(-4,4,1/50,1),b = slider(-2,2,1/50,1)):   
    G = points(Points,size=20)
    xpt = []
    ypt = []
    f = m*x + b
    TSE = 0
    for k in range(len(Points)):
        x0 = Points[k][0]
        xpt.append(x0)
        y0 = Points[k][1]
        ypt.append(y0)
        TSE += (f(x=x0) - y0)^2
        G += line([(x0,f(x=x0)),(x0,y0)],color='orange')
    G += plot(f,x,min(xpt)-0.2,max(xpt)+0.2,color='gray')
    T = 'Total Squared Error = $%s$'%str(n(TSE))
    G.show(title = T)

Checkpoint 2.2.2 Non-functional data

Experiment in the interative cell above using exactly two data points that have the same x-value. Such as (1,1) and (1,2). Next, add some additional data points in the same general vicinity as your original two points. What is the effect to your best-fit line of adding non-functional points?

So that we don't have to guess the best values for slope and intercept, we can appeal to calculus. Indeed, to minimize this function of the two variables m and b take partial derivatives and set them equal to zero to get the critical values:

T S E_{m} = \sum_{k = 1}^{n} 2 (m x_{k} + b - y_{k}) \cdot x_{k}

$\begin{equation*} TSE_m = \sum_{k=1}^n 2(m x_k + b - y_k) \cdot x_k \end{equation*}$

and

T S E_{b} = \sum_{k = 1}^{n} 2 (m x_{k} + b - y_{k}) \cdot 1 .

$\begin{equation*} TSE_b = \sum_{k=1}^n 2(m x_k + b - y_k) \cdot 1 . \end{equation*}$

Setting each of these equations equal to zero (using calculus!) and solving gives what is known as the "normal equations":

m \sum_{k = 1}^{n} x_{k}^{2} + b \sum_{k = 1}^{n} x_{k} = \sum_{k = 1}^{n} x_{k} y_{k}

$\begin{equation*} m \sum_{k=1}^n x_k^2 + b \sum_{k=1}^n x_k = \sum_{k=1}^n x_k y_k \end{equation*}$

and

m \sum_{k = 1}^{n} x_{k} + b \sum_{k = 1}^{n} 1 = \sum_{k = 1}^{n} y_{k} .

$\begin{equation*} m \sum_{k=1}^n x_k + b \sum_{k=1}^n 1 = \sum_{k=1}^n y_k. \end{equation*}$

Notice that these normal equations are a linear system of equations and (among perhaps other reasons) is why this is called linear regression. Solving these for m and b gives the best fit line.

So, let's see how to graph points against the best-fit line using R

xxxxxxxxxx
 
x <- c(1, 2, 3, 5, 5, 6)
y <- c(5, 4, 2, 2, 3, 1)
cor(x,y)   # correlation coefficient
pts = data.frame(x, y)
plot(pts,pch = 16, cex = 1.0, col = "blue", main = "Scatter Plot vs Best-Fit Line", xlab = "x", ylab = "y")
# pch = 16 creates solid dots, while cex = 1.5 creates dots that are 1.5 times bigger than the default.
lm(y ~ x)
abline(lm(y ~ x))

Checkpoint 2.2.3 WebWork

Ok. Let's see if you can apply this to get a best fit line.