Section 2.2 Best-fit Line
In this section, we will presume only one independent variable x and one dependent variable y. Consider a collection of data points
(x1,y1),(x2,y2),...,(xn,yn)
and a general linear function
f(x)=mx+b.
It is possible that each of the given data points are exactly "interpolated" by the linear function so that
f(xk)=yk
for k = 1, 2, ... , n. However, in general this is unlikely since even three points are not likely to be colinear. However, you may notice that the data points exhibit a linear tendency or that the underlying physics might suggest a linear model. If so, you may find it easier to predict values of y for given values of x using a linear approximation. Here you will investigate a method for doing so called "linear regression", "least-squares", or "best-fit line".
But why even bother creating a formula (a line here) to approximate data that does not satisfy that formula? Remember that you would expect collected data to vary slightly as one repeatedly collects that data in the same way that you would expect to make a slightly different score on repeated attempts at exams on the same material. Creating a forumla that is close to your data gives a well-defined way to predict a y value for a given x value. This predictive behavior is illustrated in the exercise below.
Checkpoint 2.2.1. WeBWorK - best-fit line for approximation.
ek=f(xk)−yk
and would be zero if f(x) exactly interpolated at the given data point. Note, some of these errors will be positive and some will be negative. To avoid any possible cancellation of errors, you can look at taking absolute values (which is tough to deal with algebraically) or by squaring the errors. This second option will be the approach taken here. This is similar to the approach taken earlier when developing formulas for the variance.
The best-fit line therefore will be the line f(x)=mx+b so that the "total squared error" is minimized. This total squared error is given by
TSE(m,b)=n∑k=1e2k=n∑k=1(f(xk)−yk)2=n∑k=1(mxk+b−yk)2.
For the following interactive cell, consider for the given data points various values for the slope and y-intercept and see if you can make the total squared error as small as possible. In doing so, notice the vertical distances from the line to the given data points generally decreases as this error measure gets smaller.
xxxxxxxxxx
var('x')
def _(Points = input_box([(-1,1),(3,1),(4,3),(6,4)]),
m = slider(-4,4,0.05,1),b = slider(-2,2,0.05,1)):
G = points(Points,size=20)
xpt = []
ypt = []
f = m*x + b
TSE = 0
for k in range(len(Points)):
x0 = Points[k][0]
xpt.append(x0)
y0 = Points[k][1]
ypt.append(y0)
TSE += (f(x=x0) - y0)^2
G += line([(x0,f(x=x0)),(x0,y0)],color='orange')
G += plot(f,x,min(xpt),max(xpt),color='gray')
T = 'Total Squared Error = $%s$'%str(TSE)
G.show(title = T)
TSEm=n∑k=12(mxk+b−yk)⋅xk
and
TSEb=n∑k=12(mxk+b−yk)⋅1.
Setting equal to zero and solving gives what is known as the "normal equations":
mn∑k=1x2k+bn∑k=1xk=n∑k=1xkyk
and
mn∑k=1xk+bn∑k=11=n∑k=1yk.
Solving these for m and b gives the best fit line.