Distribution of the Least Squares Estimator — Statement & Proof

Distribution of the Least Squares Estimator (Theorem # 1442)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] The least-squares estimator $\hat{\beta} = (X^\top X)^{-1} X^\top \mathbf Y$ is a **fixed linear function** of the data vector $\mathbf Y$. Under the normal linear model $\mathbf Y \sim N_n(X\beta, \sigma^2 I_n)$, any linear function of $\mathbf Y$ is itself multivariate normal. We therefore only need to compute the mean and covariance of $\hat{\beta}$: the mean follows from $\mathbb E[\mathbf Y] = X\beta$ and the identity $(X^\top X)^{-1}(X^\top X) = I_p$, while the covariance follows from $\operatorname{Cov}(\mathbf Y) = \sigma^2 I_n$ together with the algebraic simplification $C(\sigma^2 I)C^\top = \sigma^2(X^\top X)^{-1}$ for $C = (X^\top X)^{-1}X^\top$. [/proofplan] [step:Express $\hat{\beta}$ as a linear function $C\mathbf Y$ of the data] Under the normal linear model, we assume $X \in \mathbb R^{n \times p}$ has full column rank $p < n$, so that $X^\top X \in \mathbb R^{p \times p}$ is symmetric and positive definite, hence invertible. By the [normal equations](/theorems/501), the ordinary least squares estimator is \begin{align*} \hat{\beta} = (X^\top X)^{-1} X^\top \mathbf Y. \end{align*} Define the deterministic matrix \begin{align*} C : \mathbb R^n &\to \mathbb R^p \\ \mathbf y &\mapsto (X^\top X)^{-1} X^\top \mathbf y, \end{align*} so that $C \in \mathbb R^{p \times n}$ and $\hat{\beta} = C \mathbf Y$. Note two useful identities we will cite below: \begin{align*} CX = (X^\top X)^{-1} X^\top X = I_p, \qquad CC^\top = (X^\top X)^{-1} X^\top X (X^\top X)^{-1} = (X^\top X)^{-1}. \end{align*} The first uses the defining identity $A^{-1}A = I$ with $A = X^\top X$; the second uses $(X^\top X)^{-1}$ being symmetric (the inverse of a symmetric matrix is symmetric), so $((X^\top X)^{-1})^\top = (X^\top X)^{-1}$, giving $C^\top = X(X^\top X)^{-1}$. [/step] [step:Conclude multivariate normality of $\hat{\beta}$ from linearity] By the model assumption $\mathbf Y \sim N_n(X\beta, \sigma^2 I_n)$. Multivariate normality is preserved under linear transformations: if $\mathbf Y \sim N_n(\mu, \Sigma)$ and $C$ is a deterministic $p \times n$ matrix, then $C\mathbf Y \sim N_p(C\mu, C\Sigma C^\top)$. This is the [behaviour of multivariate normals under linear maps](/theorems/1434) applied to the full-rank matrix $C$. Therefore $\hat{\beta} = C\mathbf Y$ is multivariate normal in $\mathbb R^p$: \begin{align*} \hat{\beta} \sim N_p\bigl( C(X\beta),\; C(\sigma^2 I_n) C^\top \bigr). \end{align*} It remains to compute the two parameters. [guided] The key observation is that $\hat{\beta}$ is a deterministic linear image of $\mathbf Y$: the matrix $C = (X^\top X)^{-1}X^\top$ depends on the design $X$, which is non-random, **not** on $\mathbf Y$. Linear combinations of jointly Gaussian variables are Gaussian, so once we know $\mathbf Y$ is jointly Gaussian and $C$ is constant, $C\mathbf Y$ is automatically Gaussian. This avoids any direct density computation. Concretely, we invoke the [Orthogonal Transformations Preserve Multivariate Normality](/theorems/1434) in its general linear form: if $\mathbf Y \sim N_n(\mu, \Sigma)$ and $C \in \mathbb R^{p \times n}$ is a deterministic matrix, then the characteristic function of $C\mathbf Y$ evaluated at $\mathbf s \in \mathbb R^p$ is \begin{align*} \phi_{C\mathbf Y}(\mathbf s) = \mathbb E[e^{i\mathbf s^\top C\mathbf Y}] = \phi_{\mathbf Y}(C^\top \mathbf s) = \exp\!\left( i\mathbf s^\top(C\mu) - \tfrac{1}{2} \mathbf s^\top (C\Sigma C^\top)\mathbf s \right), \end{align*} which is the characteristic function of $N_p(C\mu, C\Sigma C^\top)$. Characteristic functions determine distributions uniquely, so $C\mathbf Y$ has exactly that distribution. In our case $\mu = X\beta$ and $\Sigma = \sigma^2 I_n$, so \begin{align*} \hat{\beta} = C\mathbf Y \sim N_p\!\bigl( C X\beta,\; \sigma^2 CC^\top \bigr). \end{align*} The remaining steps compute $CX\beta$ and $CC^\top$ in closed form. [/guided] [/step] [step:Compute the mean $\mathbb E[\hat{\beta}] = \beta$] Using linearity of expectation (valid because every component of $C\mathbf Y$ is integrable — being jointly Gaussian, all moments exist), \begin{align*} \mathbb E[\hat{\beta}] = \mathbb E[C\mathbf Y] = C\, \mathbb E[\mathbf Y] = C X\beta. \end{align*} Apply the identity $CX = I_p$ from Step 1: \begin{align*} \mathbb E[\hat{\beta}] = I_p \beta = \beta. \end{align*} Thus $\hat{\beta}$ is an unbiased estimator of $\beta$. [/step] [step:Compute the covariance $\operatorname{Cov}(\hat{\beta}) = \sigma^2(X^\top X)^{-1}$] Using $\operatorname{Cov}(C\mathbf Y) = C\operatorname{Cov}(\mathbf Y)C^\top$ (valid for any deterministic $C$ and any random vector $\mathbf Y$ with finite second moments — satisfied here because $\mathbf Y$ is Gaussian), \begin{align*} \operatorname{Cov}(\hat{\beta}) = C\, (\sigma^2 I_n)\, C^\top = \sigma^2\, CC^\top. \end{align*} Apply the identity $CC^\top = (X^\top X)^{-1}$ from Step 1: \begin{align*} \operatorname{Cov}(\hat{\beta}) = \sigma^2 (X^\top X)^{-1}. \end{align*} [guided] We expand the product $CC^\top$ explicitly to make the cancellation visible. With $C = (X^\top X)^{-1} X^\top$ and $C^\top = X (X^\top X)^{-1}$ (using that $(X^\top X)^{-1}$ is symmetric, being the inverse of a symmetric matrix), \begin{align*} CC^\top = (X^\top X)^{-1} X^\top \cdot X (X^\top X)^{-1} = (X^\top X)^{-1} (X^\top X) (X^\top X)^{-1} = (X^\top X)^{-1}. \end{align*} The middle factors $(X^\top X)(X^\top X)^{-1}$ collapse to $I_p$, leaving a single $(X^\top X)^{-1}$. Multiplying by $\sigma^2$ (from the covariance of $\mathbf Y$) gives the covariance of $\hat{\beta}$: \begin{align*} \operatorname{Cov}(\hat{\beta}) = \sigma^2 (X^\top X)^{-1}. \end{align*} This is a beautiful and characteristic formula for least-squares regression: the entire dependence on the design is packaged into the matrix $(X^\top X)^{-1}$. Its diagonal entries $(X^\top X)^{-1}_{jj}$ give the variances of the individual coefficient estimates $\hat\beta_j$, while off-diagonal entries $(X^\top X)^{-1}_{jk}$ give the covariances $\operatorname{Cov}(\hat\beta_j, \hat\beta_k)$. An "ill-conditioned" design (one where $X^\top X$ is nearly singular) inflates $(X^\top X)^{-1}$ and hence the variability of the estimates — this is the phenomenon of multicollinearity. [/guided] [/step] [step:Assemble the final distribution] Combining the three preceding steps, \begin{align*} \hat{\beta} \sim N_p\!\bigl( \beta,\; \sigma^2 (X^\top X)^{-1} \bigr). \end{align*} This is the claimed distribution of the least squares estimator under the normal linear model: Gaussian, centred at the true parameter $\beta$, with covariance $\sigma^2 (X^\top X)^{-1}$. [/step]

Explore Further

Total Sum of Squares Decomposition Statistics Continuous Mapping Theorem Statistics Fisher Information Tensorizes for I.I.D. Samples Statistics Bootstrap Mean Statistics Bias-Variance Decomposition Statistics Donsker–Kolmogorov–Doob Statistics Multivariate Cramér-Rao Bound Statistics Consistency of the MLE Statistics

What brings you to Androma?

Start with a route through the knowledge graph.

Distribution of the Least Squares Estimator (Theorem # 1442)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Distribution of the Least Squares Estimator (Theorem # 1442)

Discussion

Proof

Explore Further