[proofplan]
We first use orthogonal invariance of the Frobenius norm and invariance of rank under multiplication by orthogonal matrices to reduce the problem to approximating the rectangular diagonal singular value matrix $\Sigma$. The upper bound is attained by the truncated diagonal matrix $\Sigma_k$. For the lower bound, we show that any rank-$k$ approximation $C$ can only use a $k$-dimensional column space, so the part of $\Sigma$ orthogonal to that column space already contributes at least the sum of the discarded squared singular values. Translating the diagonal result back by $U$ and $V^\top$ gives the theorem.
[/proofplan]
[step:Reduce the minimisation problem to the diagonal singular value matrix]
Let $B \in \mathbb{R}^{n \times p}$ satisfy $\operatorname{rank}(B) \leq k$. Define
\begin{align*}
C := U^\top B V \in \mathbb{R}^{n \times p}.
\end{align*}
Since $U^\top: \mathbb{R}^n \to \mathbb{R}^n$ and $V: \mathbb{R}^p \to \mathbb{R}^p$ are invertible linear maps, multiplication by them preserves rank, and therefore
\begin{align*}
\operatorname{rank}(C) = \operatorname{rank}(B) \leq k.
\end{align*}
Also,
\begin{align*}
A - B = U(\Sigma - C)V^\top.
\end{align*}
Let $I_n \in \mathbb{R}^{n \times n}$ and $I_p \in \mathbb{R}^{p \times p}$ denote the identity matrices, and let $\operatorname{tr}: \mathbb{R}^{p \times p} \to \mathbb{R}$ denote the trace map $M \mapsto \sum_{j=1}^{p} M_{jj}$. The Frobenius norm is invariant under left and right multiplication by orthogonal matrices. Indeed,
\begin{align*}
\|U(\Sigma - C)V^\top\|_F^2
&= \operatorname{tr}\left((U(\Sigma - C)V^\top)^\top U(\Sigma - C)V^\top\right)\\
&= \operatorname{tr}\left(V(\Sigma - C)^\top U^\top U(\Sigma - C)V^\top\right)\\
&= \operatorname{tr}\left(V(\Sigma - C)^\top(\Sigma - C)V^\top\right)\\
&= \operatorname{tr}\left((\Sigma - C)^\top(\Sigma - C)V^\top V\right)\\
&= \operatorname{tr}\left((\Sigma - C)^\top(\Sigma - C)\right)\\
&= \|\Sigma - C\|_F^2.
\end{align*}
Thus minimising $\|A-B\|_F$ over rank at most $k$ matrices $B$ is equivalent to minimising $\|\Sigma-C\|_F$ over rank at most $k$ matrices $C$.
[guided]
The [singular value decomposition](/theorems/3071) expresses $A$ as an orthogonal change of coordinates on the left, followed by the diagonal singular value matrix, followed by an orthogonal change of coordinates on the right. Since orthogonal changes of coordinates preserve Euclidean lengths, they should also preserve the Frobenius norm.
Let $B \in \mathbb{R}^{n \times p}$ be any matrix with $\operatorname{rank}(B) \leq k$, and define
\begin{align*}
C := U^\top B V \in \mathbb{R}^{n \times p}.
\end{align*}
Because $U^\top$ and $V$ are invertible matrices, multiplying by them does not change the dimension of the range. Hence
\begin{align*}
\operatorname{rank}(C) = \operatorname{rank}(B) \leq k.
\end{align*}
Using $A = U\Sigma V^\top$, we have
\begin{align*}
A - B = U\Sigma V^\top - B = U(\Sigma - U^\top B V)V^\top = U(\Sigma - C)V^\top.
\end{align*}
Let $I_n \in \mathbb{R}^{n \times n}$ and $I_p \in \mathbb{R}^{p \times p}$ denote the identity matrices, and let $\operatorname{tr}: \mathbb{R}^{p \times p} \to \mathbb{R}$ denote the trace map $M \mapsto \sum_{j=1}^{p} M_{jj}$.
We now verify the Frobenius norm equality directly. Since $U^\top U = I_n$, $V^\top V = I_p$, and the trace satisfies cyclic permutation under compatible products,
\begin{align*}
\|U(\Sigma - C)V^\top\|_F^2
&= \operatorname{tr}\left((U(\Sigma - C)V^\top)^\top U(\Sigma - C)V^\top\right)\\
&= \operatorname{tr}\left(V(\Sigma - C)^\top U^\top U(\Sigma - C)V^\top\right)\\
&= \operatorname{tr}\left(V(\Sigma - C)^\top(\Sigma - C)V^\top\right)\\
&= \operatorname{tr}\left((\Sigma - C)^\top(\Sigma - C)V^\top V\right)\\
&= \operatorname{tr}\left((\Sigma - C)^\top(\Sigma - C)\right)\\
&= \|\Sigma - C\|_F^2.
\end{align*}
Therefore the original approximation problem for $A$ is exactly the same as the approximation problem for $\Sigma$, after rotating every competitor $B$ into $C = U^\top B V$.
[/guided]
[/step]
[step:Compute the error of the truncated diagonal approximation]
The matrix $\Sigma_k$ has at most $k$ nonzero columns, hence $\operatorname{rank}(\Sigma_k) \leq k$. Since $\Sigma-\Sigma_k$ is rectangular diagonal with diagonal entries $d_{k+1},\dots,d_r$ and all other entries equal to $0$,
\begin{align*}
\|\Sigma-\Sigma_k\|_F^2
=
\sum_{j=k+1}^{r} d_j^2.
\end{align*}
Consequently,
\begin{align*}
\|A-A_k\|_F^2
=
\|U(\Sigma-\Sigma_k)V^\top\|_F^2
=
\|\Sigma-\Sigma_k\|_F^2
=
\sum_{j=k+1}^{r} d_j^2.
\end{align*}
[/step]
[step:Bound every rank-$k$ diagonal approximation from below]
Let $C \in \mathbb{R}^{n \times p}$ satisfy $\operatorname{rank}(C) \leq k$. Let $S := \operatorname{Range}(C) \subset \mathbb{R}^n$, and let $P: \mathbb{R}^n \to \mathbb{R}^n$ be the [orthogonal projection](/theorems/437) onto $S$. Since every column of $C$ lies in $S$, we have $PC = C$.
Using the [orthogonal decomposition](/theorems/436)
\begin{align*}
\Sigma - C = (I_n-P)\Sigma + (P\Sigma-C),
\end{align*}
and the fact that $(I_n-P)\Sigma$ has columns in $S^\perp$ while $P\Sigma-C$ has columns in $S$, the Frobenius inner product of these two matrices is $0$. Hence
\begin{align*}
\|\Sigma-C\|_F^2
=
\|(I_n-P)\Sigma\|_F^2+\|P\Sigma-C\|_F^2
\geq
\|(I_n-P)\Sigma\|_F^2.
\end{align*}
For $1 \leq j \leq r$, let $e_j \in \mathbb{R}^n$ denote the $j$th standard basis vector. Since the $j$th column of $\Sigma$ is $d_j e_j$ for $1 \leq j \leq r$ and the remaining columns contribute non-negatively,
\begin{align*}
\|(I_n-P)\Sigma\|_F^2
=
\sum_{j=1}^{r} d_j^2 |(I_n-P)e_j|^2.
\end{align*}
Let
\begin{align*}
a_j := |Pe_j|^2 \quad \text{for } 1 \leq j \leq r.
\end{align*}
Then $0 \leq a_j \leq 1$ and $|(I_n-P)e_j|^2 = 1-a_j$. Since $P$ is an orthogonal projection onto a subspace of dimension at most $k$,
\begin{align*}
\sum_{j=1}^{r} a_j
\leq
\sum_{j=1}^{n} |Pe_j|^2
=
\operatorname{tr}(P)
=
\dim S
\leq k.
\end{align*}
Therefore
\begin{align*}
\sum_{j=1}^{r} d_j^2 |(I_n-P)e_j|^2
=
\sum_{j=1}^{r} d_j^2(1-a_j).
\end{align*}
Because $d_1^2 \geq \cdots \geq d_r^2$ and the numbers $a_j$ satisfy $0 \leq a_j \leq 1$ with $\sum_{j=1}^{r} a_j \leq k$, we have
\begin{align*}
\sum_{j=1}^{r} d_j^2 a_j
\leq
\sum_{j=1}^{k} d_j^2.
\end{align*}
Indeed, if $k=0$, then $\sum_{j=1}^{r} a_j=0$, so every $a_j=0$. If $k\geq 1$, set $m:=\sum_{j=k+1}^{r}a_j$. Then $m \leq \sum_{j=1}^{k}(1-a_j)$, and the monotonicity of the $d_j$ gives
\begin{align*}
\sum_{j=k+1}^{r} d_j^2a_j
\leq
d_k^2 m
\leq
d_k^2\sum_{j=1}^{k}(1-a_j)
\leq
\sum_{j=1}^{k}d_j^2(1-a_j),
\end{align*}
which is equivalent to the displayed inequality.
It follows that
\begin{align*}
\|\Sigma-C\|_F^2
\geq
\sum_{j=1}^{r} d_j^2(1-a_j)
\geq
\sum_{j=1}^{r} d_j^2-\sum_{j=1}^{k}d_j^2
=
\sum_{j=k+1}^{r}d_j^2.
\end{align*}
[guided]
The key point is that a rank-$k$ matrix $C$ can only have a $k$-dimensional column space. Let
\begin{align*}
S := \operatorname{Range}(C) \subset \mathbb{R}^n,
\end{align*}
and let $P: \mathbb{R}^n \to \mathbb{R}^n$ be the orthogonal projection onto $S$. Since each column of $C$ belongs to $S$, applying $P$ to $C$ leaves $C$ unchanged:
\begin{align*}
PC = C.
\end{align*}
Now decompose the error according to the orthogonal splitting $\mathbb{R}^n = S \oplus S^\perp$:
\begin{align*}
\Sigma - C = (I_n-P)\Sigma + (P\Sigma-C).
\end{align*}
The columns of $(I_n-P)\Sigma$ lie in $S^\perp$, while the columns of $P\Sigma-C$ lie in $S$. Therefore the corresponding columns are orthogonal in $\mathbb{R}^n$, and so the Frobenius inner product of the two matrices is zero. By the Pythagorean identity for the Frobenius norm,
\begin{align*}
\|\Sigma-C\|_F^2
=
\|(I_n-P)\Sigma\|_F^2+\|P\Sigma-C\|_F^2
\geq
\|(I_n-P)\Sigma\|_F^2.
\end{align*}
Thus even before asking how well $C$ approximates the part of $\Sigma$ inside $S$, the component of $\Sigma$ orthogonal to $S$ already gives a lower bound.
For $1 \leq j \leq r$, let $e_j \in \mathbb{R}^n$ be the $j$th standard basis vector. The $j$th column of $\Sigma$ is $d_j e_j$ for $1 \leq j \leq r$, while every other nonzero contribution is absent. Hence
\begin{align*}
\|(I_n-P)\Sigma\|_F^2
=
\sum_{j=1}^{r} |(I_n-P)(d_j e_j)|^2
=
\sum_{j=1}^{r} d_j^2 |(I_n-P)e_j|^2.
\end{align*}
Define
\begin{align*}
a_j := |Pe_j|^2 \quad \text{for } 1 \leq j \leq r.
\end{align*}
Since $P$ is an orthogonal projection, $Pe_j$ and $(I_n-P)e_j$ are orthogonal and
\begin{align*}
|e_j|^2 = |Pe_j|^2 + |(I_n-P)e_j|^2.
\end{align*}
Because $|e_j|=1$, this gives
\begin{align*}
|(I_n-P)e_j|^2 = 1-a_j.
\end{align*}
Also $0 \leq a_j \leq 1$.
The total amount of projection that the first $r$ coordinate directions can place inside $S$ is at most $\dim S$. Indeed,
\begin{align*}
\sum_{j=1}^{r} a_j
\leq
\sum_{j=1}^{n} |Pe_j|^2
=
\operatorname{tr}(P)
=
\dim S
\leq k.
\end{align*}
The equality $\sum_{j=1}^{n}|Pe_j|^2=\operatorname{tr}(P)$ follows because $P$ is symmetric and idempotent, so
\begin{align*}
\sum_{j=1}^{n}|Pe_j|^2
=
\sum_{j=1}^{n} e_j^\top P^\top P e_j
=
\sum_{j=1}^{n} e_j^\top P e_j
=
\operatorname{tr}(P).
\end{align*}
We now use only the ordering $d_1^2 \geq \cdots \geq d_r^2$. The numbers $a_j$ are weights between $0$ and $1$, and their total mass is at most $k$. We claim that
\begin{align*}
\sum_{j=1}^{r} d_j^2 a_j
\leq
\sum_{j=1}^{k} d_j^2.
\end{align*}
When $k=0$, the constraint $\sum_{j=1}^{r}a_j\leq 0$ and the non-negativity of the $a_j$ force every $a_j$ to be $0$, so the claim holds. Assume now that $k\geq 1$, and define
\begin{align*}
m:=\sum_{j=k+1}^{r}a_j.
\end{align*}
Since $\sum_{j=1}^{r}a_j\leq k$, we have
\begin{align*}
m\leq k-\sum_{j=1}^{k}a_j=\sum_{j=1}^{k}(1-a_j).
\end{align*}
The coefficients after the $k$th index are no larger than $d_k^2$, while each of $d_1^2,\dots,d_k^2$ is at least $d_k^2$. Hence
\begin{align*}
\sum_{j=k+1}^{r} d_j^2a_j
\leq
d_k^2m
\leq
d_k^2\sum_{j=1}^{k}(1-a_j)
\leq
\sum_{j=1}^{k}d_j^2(1-a_j).
\end{align*}
Adding $\sum_{j=1}^{k}d_j^2a_j$ to both sides gives the claimed inequality. This is the exchange argument: any weight assigned after index $k$ can be moved into unused weight among the first $k$ indices without decreasing the weighted sum.
Therefore
\begin{align*}
\|\Sigma-C\|_F^2
\geq
\sum_{j=1}^{r} d_j^2(1-a_j)
=
\sum_{j=1}^{r} d_j^2-\sum_{j=1}^{r}d_j^2a_j
\geq
\sum_{j=1}^{r} d_j^2-\sum_{j=1}^{k}d_j^2
=
\sum_{j=k+1}^{r}d_j^2.
\end{align*}
This lower bound applies to every rank at most $k$ matrix $C$.
[/guided]
[/step]
[step:Translate the diagonal minimiser back to the original matrix]
The previous step shows that every $C \in \mathbb{R}^{n \times p}$ with $\operatorname{rank}(C) \leq k$ satisfies
\begin{align*}
\|\Sigma-C\|_F^2 \geq \sum_{j=k+1}^{r}d_j^2.
\end{align*}
The matrix $\Sigma_k$ has rank at most $k$ and attains equality:
\begin{align*}
\|\Sigma-\Sigma_k\|_F^2 = \sum_{j=k+1}^{r}d_j^2.
\end{align*}
Therefore $\Sigma_k$ minimises $\|\Sigma-C\|_F$ among all $C$ with rank at most $k$.
By the reduction step, the corresponding minimiser for the original matrix $A$ is
\begin{align*}
A_k = U\Sigma_k V^\top.
\end{align*}
Thus
\begin{align*}
\|A-A_k\|_F
=
\min_{\substack{B \in \mathbb{R}^{n \times p}\\ \operatorname{rank}(B)\leq k}}
\|A-B\|_F,
\end{align*}
and the exact squared error is
\begin{align*}
\|A-A_k\|_F^2=\sum_{j=k+1}^{r}d_j^2.
\end{align*}
This proves the theorem.
[/step]