Consider then a model of the form
\begin{equation*}
z = \alpha_1 x + \alpha_2 y + \beta
\end{equation*}
and the data points
\begin{equation*}
(x_1,y_1,z_1), (x_2,y_2,z_2), ... , (x_n,y_n,z_n).
\end{equation*}
Evaluating at these data points gives, in matrix form
\begin{equation*}
\begin{bmatrix}
z_1
\\ z_2
\\ ...
\\ z_n
\end{bmatrix}
=
\begin{bmatrix}
x_1 \amp y_1 \amp 1 \\
x_2 \amp y_2 \amp 1 \\
... \amp ... \amp ... \\
x_n \amp y_n \amp 1
\end{bmatrix}
\cdot
\begin{bmatrix}
\alpha_1
\\
\alpha_2
\\
\beta
\end{bmatrix}
+
\begin{bmatrix}
\epsilon_1 \\
\epsilon_2 \\
... \\
\epsilon_n
\end{bmatrix}
\end{equation*}
where the \(\epsilon_k\) terms are the deviation between the exact data point and the approximation of that point on some plane. Symbolically
\begin{equation*}
Z = XA + \epsilon.
\end{equation*}
Unless all of the points lie on the same plane (unlikely) then when the vector \(\epsilon = 0\text{,}\) this system will overdetermined with more independent equations than unknowns. Applying a least squares solution approach is the same as minimizing \(\epsilon^t \epsilon\) and eventually gives
\begin{equation*}
A = (X^t X)^{-1} X^t Z
\end{equation*}
in general. Evaluating this with X and Z as above gives the matrix
\begin{equation*}
A = \begin{bmatrix}
\alpha_1 \\
\alpha_2 \\
\beta
\end{bmatrix}
\end{equation*}