Skip to main content

Section 2.5 Multi-variable Regression

The regression models that we have looked at till now have always presumed a single independent variable and with "linear" coefficients. It is much more likely when investigating cause and effect relationships that there are perhaps many variables that contribute or perhaps using a non-linear relationship between the unknown coefficients. To tease you to consider taking another course that covers multi-variate regression, in this section we briefly consider a two-variable model. We also consider an interesting example that illustrates the danger in using models to estimate values well beyond the range of the relevant data that has been used to create the model.

Consider then a model of the form

\begin{equation*} z = \alpha_1 x + \alpha_2 y + \beta \end{equation*}

and the data points

\begin{equation*} (x_1,y_1,z_1), (x_2,y_2,z_2), ... , (x_n,y_n,z_n). \end{equation*}

Evaluating at these data points gives, in matrix form

\begin{equation*} \begin{bmatrix} z_1 \\ z_2 \\ ... \\ z_n \end{bmatrix} = \begin{bmatrix} x_1 \amp y_1 \amp 1 \\ x_2 \amp y_2 \amp 1 \\ ... \amp ... \amp ... \\ x_n \amp y_n \amp 1 \end{bmatrix} \cdot \begin{bmatrix} \alpha_1 \\ \alpha_2 \\ \beta \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ ... \\ \epsilon_n \end{bmatrix} \end{equation*}

where the \(\epsilon_k\) terms are the deviation between the exact data point and the approximation of that point on some plane. Symbolically

\begin{equation*} Z = XA + \epsilon. \end{equation*}

Unless all of the points lie on the same plane (unlikely) then when the vector \(\epsilon = 0\text{,}\) this system will overdetermined with more independent equations than unknowns. Applying a least squares solution approach is the same as minimizing \(\epsilon^t \epsilon\) and eventually gives

\begin{equation*} A = (X^t X)^{-1} X^t Z \end{equation*}

in general. Evaluating this with X and Z as above gives the matrix

\begin{equation*} A = \begin{bmatrix} \alpha_1 \\ \alpha_2 \\ \beta \end{bmatrix} \end{equation*}

A good example of the usefulness and limitations of multi-variate regression is the calculation of the "Heat Index". This measure determines a measure of discomfort relative to the ambient temperature and the relative humidity. Indeed, in warm climates a high temperature is more difficult to bear if the humidity is also high. One reason is that with high humidity the body is less effective in shedding heat through evaporation of body sweat.

The National Weather Service in 1990 published the following multiple regression equation for Heat Index (HI) relative to the ambient temperature (T) and the relative humidity (RH)

\begin{equation*} H = -42.379 + 2.04901523 \cdot T + 10.14333127 \cdot R - 0.22475541 \cdot T \cdot R \\ - 6.83783 \cdot 10^{-3} \cdot T^2 - 5.481717 \cdot 10^{-2} \cdot R^2 + 1.22874 \cdot 10^{-3} \cdot T^2 \cdot R \\ + 8.5282 \cdot 10^{-4} \cdot T \cdot R^2-1.99 \cdot 10^{-6} \cdot T^2 \cdot R^2. \end{equation*}

Notice, their model utilizes quadratic terms and therefore uses a generalization of the linear result presented above. Details on how this equation was determined and other details are available at https://www.wpc.ncep.noaa.gov/html/heatindex_equation.shtml .

Below one can compute a table for various ambient Temperature readings given one value for relative humidity. Notice what happens for a relatively high humidity and relatively high temperature.

Indeed, you cannot roast a turkey by simply turning the oven on 120 and pumping in a lot of humidity since the turkey is not trying to cool itself anymore. Any discomfort measured on the turkey's behalf would certainly be matched by the human since the bird would remain very much uncooked. The issue is that this model doesn't presume the possibility of 120F and 95% humidity. Often, in situations where the temperature is able to reach that level, such as a desert, then the relative humidity is correspondingly low. This idea of using a model to predict extreme values beyond the measured data is called extrapolation and should be utilized with care. Interpolation to estimate values within the confines of the measured data is however generally a safe bet.