[proofplan]
The least-squares estimator $\hat{\beta} = (X^\top X)^{-1} X^\top \mathbf Y$ is a **fixed linear function** of the data vector $\mathbf Y$. Under the normal linear model $\mathbf Y \sim N_n(X\beta, \sigma^2 I_n)$, any linear function of $\mathbf Y$ is itself multivariate normal. We therefore only need to compute the mean and covariance of $\hat{\beta}$: the mean follows from $\mathbb E[\mathbf Y] = X\beta$ and the identity $(X^\top X)^{-1}(X^\top X) = I_p$, while the covariance follows from $\operatorname{Cov}(\mathbf Y) = \sigma^2 I_n$ together with the algebraic simplification $C(\sigma^2 I)C^\top = \sigma^2(X^\top X)^{-1}$ for $C = (X^\top X)^{-1}X^\top$.
[/proofplan]
[step:Express $\hat{\beta}$ as a linear function $C\mathbf Y$ of the data]
Under the normal linear model, we assume $X \in \mathbb R^{n \times p}$ has full column rank $p < n$, so that $X^\top X \in \mathbb R^{p \times p}$ is symmetric and positive definite, hence invertible. By the [normal equations](/theorems/501), the ordinary least squares estimator is
\begin{align*}
\hat{\beta} = (X^\top X)^{-1} X^\top \mathbf Y.
\end{align*}
Define the deterministic matrix
\begin{align*}
C : \mathbb R^n &\to \mathbb R^p \\
\mathbf y &\mapsto (X^\top X)^{-1} X^\top \mathbf y,
\end{align*}
so that $C \in \mathbb R^{p \times n}$ and $\hat{\beta} = C \mathbf Y$. Note two useful identities we will cite below:
\begin{align*}
CX = (X^\top X)^{-1} X^\top X = I_p, \qquad CC^\top = (X^\top X)^{-1} X^\top X (X^\top X)^{-1} = (X^\top X)^{-1}.
\end{align*}
The first uses the defining identity $A^{-1}A = I$ with $A = X^\top X$; the second uses $(X^\top X)^{-1}$ being symmetric (the inverse of a symmetric matrix is symmetric), so $((X^\top X)^{-1})^\top = (X^\top X)^{-1}$, giving $C^\top = X(X^\top X)^{-1}$.
[/step]
[step:Conclude multivariate normality of $\hat{\beta}$ from linearity]
By the model assumption $\mathbf Y \sim N_n(X\beta, \sigma^2 I_n)$. Multivariate normality is preserved under linear transformations: if $\mathbf Y \sim N_n(\mu, \Sigma)$ and $C$ is a deterministic $p \times n$ matrix, then $C\mathbf Y \sim N_p(C\mu, C\Sigma C^\top)$. This is the [behaviour of multivariate normals under linear maps](/theorems/1434) applied to the full-rank matrix $C$. Therefore $\hat{\beta} = C\mathbf Y$ is multivariate normal in $\mathbb R^p$:
\begin{align*}
\hat{\beta} \sim N_p\bigl( C(X\beta),\; C(\sigma^2 I_n) C^\top \bigr).
\end{align*}
It remains to compute the two parameters.
[guided]
The key observation is that $\hat{\beta}$ is a deterministic linear image of $\mathbf Y$: the matrix $C = (X^\top X)^{-1}X^\top$ depends on the design $X$, which is non-random, **not** on $\mathbf Y$. Linear combinations of jointly Gaussian variables are Gaussian, so once we know $\mathbf Y$ is jointly Gaussian and $C$ is constant, $C\mathbf Y$ is automatically Gaussian. This avoids any direct density computation.
Concretely, we invoke the [Orthogonal Transformations Preserve Multivariate Normality](/theorems/1434) in its general linear form: if $\mathbf Y \sim N_n(\mu, \Sigma)$ and $C \in \mathbb R^{p \times n}$ is a deterministic matrix, then the characteristic function of $C\mathbf Y$ evaluated at $\mathbf s \in \mathbb R^p$ is
\begin{align*}
\phi_{C\mathbf Y}(\mathbf s) = \mathbb E[e^{i\mathbf s^\top C\mathbf Y}] = \phi_{\mathbf Y}(C^\top \mathbf s) = \exp\!\left( i\mathbf s^\top(C\mu) - \tfrac{1}{2} \mathbf s^\top (C\Sigma C^\top)\mathbf s \right),
\end{align*}
which is the characteristic function of $N_p(C\mu, C\Sigma C^\top)$. Characteristic functions determine distributions uniquely, so $C\mathbf Y$ has exactly that distribution.
In our case $\mu = X\beta$ and $\Sigma = \sigma^2 I_n$, so
\begin{align*}
\hat{\beta} = C\mathbf Y \sim N_p\!\bigl( C X\beta,\; \sigma^2 CC^\top \bigr).
\end{align*}
The remaining steps compute $CX\beta$ and $CC^\top$ in closed form.
[/guided]
[/step]
[step:Compute the mean $\mathbb E[\hat{\beta}] = \beta$]
Using linearity of expectation (valid because every component of $C\mathbf Y$ is integrable — being jointly Gaussian, all moments exist),
\begin{align*}
\mathbb E[\hat{\beta}] = \mathbb E[C\mathbf Y] = C\, \mathbb E[\mathbf Y] = C X\beta.
\end{align*}
Apply the identity $CX = I_p$ from Step 1:
\begin{align*}
\mathbb E[\hat{\beta}] = I_p \beta = \beta.
\end{align*}
Thus $\hat{\beta}$ is an unbiased estimator of $\beta$.
[/step]
[step:Compute the covariance $\operatorname{Cov}(\hat{\beta}) = \sigma^2(X^\top X)^{-1}$]
Using $\operatorname{Cov}(C\mathbf Y) = C\operatorname{Cov}(\mathbf Y)C^\top$ (valid for any deterministic $C$ and any random vector $\mathbf Y$ with finite second moments — satisfied here because $\mathbf Y$ is Gaussian),
\begin{align*}
\operatorname{Cov}(\hat{\beta}) = C\, (\sigma^2 I_n)\, C^\top = \sigma^2\, CC^\top.
\end{align*}
Apply the identity $CC^\top = (X^\top X)^{-1}$ from Step 1:
\begin{align*}
\operatorname{Cov}(\hat{\beta}) = \sigma^2 (X^\top X)^{-1}.
\end{align*}
[guided]
We expand the product $CC^\top$ explicitly to make the cancellation visible. With $C = (X^\top X)^{-1} X^\top$ and $C^\top = X (X^\top X)^{-1}$ (using that $(X^\top X)^{-1}$ is symmetric, being the inverse of a symmetric matrix),
\begin{align*}
CC^\top = (X^\top X)^{-1} X^\top \cdot X (X^\top X)^{-1} = (X^\top X)^{-1} (X^\top X) (X^\top X)^{-1} = (X^\top X)^{-1}.
\end{align*}
The middle factors $(X^\top X)(X^\top X)^{-1}$ collapse to $I_p$, leaving a single $(X^\top X)^{-1}$. Multiplying by $\sigma^2$ (from the covariance of $\mathbf Y$) gives the covariance of $\hat{\beta}$:
\begin{align*}
\operatorname{Cov}(\hat{\beta}) = \sigma^2 (X^\top X)^{-1}.
\end{align*}
This is a beautiful and characteristic formula for least-squares regression: the entire dependence on the design is packaged into the matrix $(X^\top X)^{-1}$. Its diagonal entries $(X^\top X)^{-1}_{jj}$ give the variances of the individual coefficient estimates $\hat\beta_j$, while off-diagonal entries $(X^\top X)^{-1}_{jk}$ give the covariances $\operatorname{Cov}(\hat\beta_j, \hat\beta_k)$. An "ill-conditioned" design (one where $X^\top X$ is nearly singular) inflates $(X^\top X)^{-1}$ and hence the variability of the estimates — this is the phenomenon of multicollinearity.
[/guided]
[/step]
[step:Assemble the final distribution]
Combining the three preceding steps,
\begin{align*}
\hat{\beta} \sim N_p\!\bigl( \beta,\; \sigma^2 (X^\top X)^{-1} \bigr).
\end{align*}
This is the claimed distribution of the least squares estimator under the normal linear model: Gaussian, centred at the true parameter $\beta$, with covariance $\sigma^2 (X^\top X)^{-1}$.
[/step]