Paragraph

Consider then a model of the form
\begin{equation*} z = \alpha_1 x + \alpha_2 y + \beta \end{equation*}
and the data points
\begin{equation*} (x_1,y_1,z_1), (x_2,y_2,z_2), ... , (x_n,y_n,z_n). \end{equation*}
Evaluating at these data points gives, in matrix form
\begin{equation*} \begin{bmatrix} z_1 \\ z_2 \\ ... \\ z_n \end{bmatrix} = \begin{bmatrix} x_1 \amp y_1 \amp 1 \\ x_2 \amp y_2 \amp 1 \\ ... \amp ... \amp ... \\ x_n \amp y_n \amp 1 \end{bmatrix} \cdot \begin{bmatrix} \alpha_1 \\ \alpha_2 \\ \beta \end{bmatrix} + \begin{bmatrix} \epsilon_1 \\ \epsilon_2 \\ ... \\ \epsilon_n \end{bmatrix} \end{equation*}
where the \(\epsilon_k\) terms are the deviation between the exact data point and the approximation of that point on some plane. Symbolically
\begin{equation*} Z = XA + \epsilon. \end{equation*}
Unless all of the points lie on the same plane (unlikely) then when the vector \(\epsilon = 0\text{,}\) this system will overdetermined with more independent equations than unknowns. Applying a least squares solution approach is the same as minimizing \(\epsilon^t \epsilon\) and eventually gives
\begin{equation*} A = (X^t X)^{-1} X^t Z \end{equation*}
in general. Evaluating this with X and Z as above gives the matrix
\begin{equation*} A = \begin{bmatrix} \alpha_1 \\ \alpha_2 \\ \beta \end{bmatrix} \end{equation*}
in-context