[proofplan]
We isolate the part of the predictor $x_j$ that is orthogonal to the intercept and the remaining predictors. This residual vector $z_j$ is the only direction in the design that can identify the coefficient $\beta_j$ after the other columns have been accounted for. We then express $\hat{\beta}_j-\beta_j$ as the scalar projection of the noise vector $\varepsilon$ onto $z_j$, compute its conditional variance using homoskedasticity, and finally use the auxiliary regression identity $z_j^\top z_j=S_{jj}(1-R_j^2)$.
[/proofplan]
custom_env
admin
[step:Residualize $x_j$ against the remaining columns of the design]Let
\begin{align*}
A := X_{-j} \in \mathbb{R}^{n \times p}.
\end{align*}
Since $X$ has full column rank, the columns of $A$ are linearly independent, so $A^\top A \in \mathbb{R}^{p \times p}$ is invertible. Define the [linear map](/page/Linear%20Map)
\begin{align*}
P_A: \mathbb{R}^n &\to \mathbb{R}^n \\
v &\mapsto A(A^\top A)^{-1}A^\top v.
\end{align*}
This is the [orthogonal projection](/theorems/437) onto $\operatorname{Range}(A)$ with respect to the Euclidean inner product $u^\top v$ on $\mathbb{R}^n$.
Define the residual vector
\begin{align*}
z_j := (I_n-P_A)x_j \in \mathbb{R}^n.
\end{align*}
Then $A^\top z_j=0$, because
\begin{align*}
A^\top z_j
&= A^\top x_j-A^\top A(A^\top A)^{-1}A^\top x_j \\
&= A^\top x_j-A^\top x_j \\
&=0.
\end{align*}
Also $z_j \ne 0$. If $z_j=0$, then $x_j=P_Ax_j \in \operatorname{Range}(A)$, so the column $x_j$ would be a linear combination of the remaining columns of $X$, contradicting the full column rank of $X$.[/step]
custom_env
admin
[guided]We want to separate the contribution of $x_j$ from the contribution of the other regressors. Let
\begin{align*}
A := X_{-j} \in \mathbb{R}^{n \times p}.
\end{align*}
The matrix $A$ contains the intercept column and every predictor column except $x_j$. Since the full design matrix $X$ has full column rank, deleting one column leaves a matrix $A$ whose columns are still linearly independent. Therefore $A^\top A$ is invertible.
Define the linear map
\begin{align*}
P_A: \mathbb{R}^n &\to \mathbb{R}^n \\
v &\mapsto A(A^\top A)^{-1}A^\top v.
\end{align*}
This map is the Euclidean orthogonal projection onto $\operatorname{Range}(A)$. The residual part of $x_j$ after regressing on the other columns is
\begin{align*}
z_j := (I_n-P_A)x_j \in \mathbb{R}^n.
\end{align*}
The key property of $z_j$ is orthogonality to every column of $A$. Indeed,
\begin{align*}
A^\top z_j
&= A^\top (I_n-P_A)x_j \\
&= A^\top x_j-A^\top A(A^\top A)^{-1}A^\top x_j \\
&= A^\top x_j-A^\top x_j \\
&=0.
\end{align*}
The vector $z_j$ cannot vanish. If $z_j=0$, then $x_j=P_Ax_j$, so $x_j$ lies in the span of the columns of $A$. That would make the columns of $X$ linearly dependent, contradicting the assumed full column rank of $X$.[/guided]
custom_env
admin
[step:Express the coefficient error as a projection of the noise onto $z_j$]Let $\hat{\beta}_j$ denote the coordinate of $\hat{\beta}$ corresponding to the predictor column $x_j$. Since $\hat{\beta}$ satisfies the normal equations,
\begin{align*}
X^\top(Y-X\hat{\beta})=0.
\end{align*}
In particular,
\begin{align*}
z_j^\top(Y-X\hat{\beta})=0,
\end{align*}
because $z_j$ is orthogonal to every column of $A=X_{-j}$ and hence only tests the $x_j$ direction. Write the true coefficient vector as
\begin{align*}
\beta=(\beta_0,\beta_1,\dots,\beta_p)^\top \in \mathbb{R}^{p+1}.
\end{align*}
Using $Y=X\beta+\varepsilon$ and $z_j^\top A=0$, we obtain
\begin{align*}
0
&= z_j^\top(Y-X\hat{\beta}) \\
&= z_j^\top(X\beta+\varepsilon-X\hat{\beta}) \\
&= z_j^\top\varepsilon+z_j^\top X(\beta-\hat{\beta}) \\
&= z_j^\top\varepsilon+z_j^\top x_j(\beta_j-\hat{\beta}_j).
\end{align*}
Since $z_j=(I_n-P_A)x_j$ and $P_A$ is an orthogonal projection, $z_j^\top x_j=z_j^\top z_j$. Therefore
\begin{align*}
\hat{\beta}_j-\beta_j=\frac{z_j^\top\varepsilon}{z_j^\top z_j}.
\end{align*}[/step]
custom_env
admin
[guided]The normal equations for ordinary least squares say that the fitted residual vector is orthogonal to every column of the design:
\begin{align*}
X^\top(Y-X\hat{\beta})=0.
\end{align*}
The vector $z_j$ is not itself a column of $X$, but it lies in the span of the columns of $X$: it is $x_j$ minus its projection onto the span of the remaining columns. Therefore the same orthogonality gives
\begin{align*}
z_j^\top(Y-X\hat{\beta})=0.
\end{align*}
Write
\begin{align*}
\beta=(\beta_0,\beta_1,\dots,\beta_p)^\top \in \mathbb{R}^{p+1}.
\end{align*}
Substitute the model equation $Y=X\beta+\varepsilon$ into the preceding identity:
\begin{align*}
0
&= z_j^\top(Y-X\hat{\beta}) \\
&= z_j^\top(X\beta+\varepsilon-X\hat{\beta}) \\
&= z_j^\top\varepsilon+z_j^\top X(\beta-\hat{\beta}).
\end{align*}
Now $X$ consists of the columns in $A=X_{-j}$ together with $x_j$. Since $z_j$ is orthogonal to every column of $A$, all terms in $z_j^\top X(\beta-\hat{\beta})$ vanish except the one involving $x_j$. Hence
\begin{align*}
0=z_j^\top\varepsilon+z_j^\top x_j(\beta_j-\hat{\beta}_j).
\end{align*}
Because $z_j=(I_n-P_A)x_j$ and $P_Ax_j$ is orthogonal to $z_j$, we have
\begin{align*}
z_j^\top x_j
&= z_j^\top(P_Ax_j+z_j) \\
&= z_j^\top P_Ax_j+z_j^\top z_j \\
&= z_j^\top z_j.
\end{align*}
Since $z_j\ne 0$, the scalar $z_j^\top z_j$ is positive, so division is valid. We get
\begin{align*}
\hat{\beta}_j-\beta_j=\frac{z_j^\top\varepsilon}{z_j^\top z_j}.
\end{align*}
This identity is the Frisch-Waugh-Lovell residualization mechanism in this setting: the uncertainty in $\hat{\beta}_j$ is exactly the uncertainty in the noise projected onto the part of $x_j$ not explained by the other regressors.[/guided]
custom_env
admin
[step:Compute the conditional variance of the residualized coefficient]
Since $X$ is fixed and $z_j$ is determined by $X$, the vector $z_j$ is non-random conditional on $X$. Using $\operatorname{Var}(\varepsilon\mid X)=\sigma^2 I_n$, we compute
\begin{align*}
\operatorname{Var}(\hat{\beta}_j\mid X)
&= \operatorname{Var}\left(\frac{z_j^\top\varepsilon}{z_j^\top z_j}\,\middle|\,X\right) \\
&= \frac{1}{(z_j^\top z_j)^2}\operatorname{Var}(z_j^\top\varepsilon\mid X) \\
&= \frac{1}{(z_j^\top z_j)^2}z_j^\top\operatorname{Var}(\varepsilon\mid X)z_j \\
&= \frac{1}{(z_j^\top z_j)^2}z_j^\top(\sigma^2 I_n)z_j \\
&= \frac{\sigma^2}{z_j^\top z_j}.
\end{align*}
[/step]
custom_env
admin
[step:Identify the residual sum of squares with $S_{jj}(1-R_j^2)$]By definition, $R_j^2$ is the coefficient of determination from the least-squares regression of $x_j$ on the columns of $A=X_{-j}$. The fitted vector in that auxiliary regression is $P_Ax_j$, and the residual vector is $z_j=(I_n-P_A)x_j$. Since the predictors are centred and the intercept column belongs to $A$, the total sum of squares in the auxiliary regression is
\begin{align*}
S_{jj}=x_j^\top x_j.
\end{align*}
The residual sum of squares is
\begin{align*}
z_j^\top z_j.
\end{align*}
Therefore
\begin{align*}
R_j^2
&=1-\frac{z_j^\top z_j}{S_{jj}},
\end{align*}
and hence
\begin{align*}
z_j^\top z_j=S_{jj}(1-R_j^2).
\end{align*}[/step]
custom_env
admin
[guided]The auxiliary regression is the least-squares regression with response vector $x_j\in\mathbb{R}^n$ and design matrix $A=X_{-j}$. Its fitted vector is the projection of $x_j$ onto the column space of $A$:
\begin{align*}
P_Ax_j.
\end{align*}
Its residual vector is therefore
\begin{align*}
x_j-P_Ax_j=(I_n-P_A)x_j=z_j.
\end{align*}
Thus the residual sum of squares in the auxiliary regression is
\begin{align*}
z_j^\top z_j.
\end{align*}
The total sum of squares is computed around the sample mean of the auxiliary response $x_j$. The predictor $x_j$ is centred by hypothesis, so its sample mean is $\bar{x}_j=0$. Hence
\begin{align*}
S_{jj}
= \sum_{i=1}^n (x_{ij}-\bar{x}_j)^2
= \sum_{i=1}^n x_{ij}^2
= x_j^\top x_j.
\end{align*}
The coefficient of determination is
\begin{align*}
R_j^2
=1-\frac{\text{residual sum of squares}}{\text{total sum of squares}}
=1-\frac{z_j^\top z_j}{S_{jj}}.
\end{align*}
Rearranging gives
\begin{align*}
z_j^\top z_j=S_{jj}(1-R_j^2).
\end{align*}
This identity is exactly where collinearity enters the variance calculation: if $x_j$ is nearly explained by the other columns, then $z_j^\top z_j$ is small and $R_j^2$ is close to $1$.[/guided]
custom_env
admin
[step:Substitute the auxiliary regression identity into the variance formula]
Combining the variance computation with the auxiliary regression identity gives
\begin{align*}
\operatorname{Var}(\hat{\beta}_j\mid X)
&= \frac{\sigma^2}{z_j^\top z_j} \\
&= \frac{\sigma^2}{S_{jj}(1-R_j^2)}.
\end{align*}
Since $z_j\ne 0$, the denominator is positive. This proves the stated variance inflation factor formula.
[/step]