[guided]Let $\mathcal{L}^n$ denote $n$-dimensional [Lebesgue Measure](/page/Lebesgue%20Measure) on $\mathbb{R}^n$, and define
\begin{align*}
M := \|u_0\|_{L^1(\mathbb{R}^n)}.
\end{align*}
First dispose of the case $M=0$. Since the $L^1$ norm of $u_0$ is zero, $u_0=0$ $\mathcal{L}^n$-a.e. on $\mathbb{R}^n$. The heat semigroup is linear on $L^2(\mathbb{R}^n)$, so $u(t)=e^{t\Delta}u_0=0$ in $L^2(\mathbb{R}^n)$ for every $t\geq 0$, and all asserted estimates follow. Hence, for the remainder of the proof, assume $M>0$; this is what later makes the factor $M^{-4/n}$ meaningful.
The goal is to replace the partial differential equation by an ordinary differential inequality for one scalar quantity. The right scalar quantity is the square of the $L^2$ norm, so define $Y: (0,\infty) \to [0,\infty)$ by
\begin{align*}
Y(t) := \|u(t)\|_{L^2(\mathbb{R}^n)}^2.
\end{align*}
The heat equation dissipates $L^2$ energy. The justification for rough initial data comes from the Fourier-multiplier construction of the heat semigroup, in parallel with the heat-kernel representation in the [Solution Of The Cauchy Problem For The Heat Equation](/theorems/54). For any $g\in L^2(\mathbb{R}^n)$, let $\widehat{g}:\mathbb{R}^n\to\mathbb{C}$ denote its Fourier transform in the Plancherel $L^2$ sense. Let $H^1(\mathbb{R}^n)$ denote the Sobolev space of all $g\in L^2(\mathbb{R}^n)$ whose weak first partial derivatives $\partial_{x_i}g$, for $1\leq i\leq n$, belong to $L^2(\mathbb{R}^n)$, with
\begin{align*}
\|\nabla g\|_{L^2(\mathbb{R}^n)}^2 := \sum_{i=1}^n \|\partial_{x_i}g\|_{L^2(\mathbb{R}^n)}^2.
\end{align*}
For each $t>0$, the solution operator $S(t): L^2(\mathbb{R}^n) \to L^2(\mathbb{R}^n)$ is defined by the Fourier multiplier $e^{-t|\xi|^2}$. The multiplier is bounded by $1$, so Plancherel gives $\|S(t)g\|_{L^2(\mathbb{R}^n)}\leq \|g\|_{L^2(\mathbb{R}^n)}$ for every $g\in L^2(\mathbb{R}^n)$. Since $e^{-t|\xi|^2}\to 1$ pointwise as $t\downarrow 0$, dominated convergence in Fourier space gives strong continuity on $L^2(\mathbb{R}^n)$. The smoothing assertion follows because $|\xi|^2e^{-2t|\xi|^2}\leq (2et)^{-1}$, so $\xi\widehat{S(t)g}(\xi)$ is square-integrable whenever $g\in L^2(\mathbb{R}^n)$; hence $S(t)g\in H^1(\mathbb{R}^n)$ for $t>0$.
For Schwartz initial data, the map $t\mapsto S(t)g$ is differentiable in $L^2(\mathbb{R}^n)$, and differentiating the Fourier multiplier gives $\partial_t S(t)g=\Delta S(t)g$. Taking the $L^2$ [inner product](/page/Inner%20Product) with $S(t)g$, using Plancherel, and integrating by parts in Fourier variables gives the integrated energy identity on every interval $[a,b]\subset(0,\infty)$. For the approximating Schwartz data this reads
\begin{align*}
\frac{1}{2}\|S(b)u_{0,k}\|_{L^2(\mathbb{R}^n)}^2 + \int_a^b \|\nabla S(t)u_{0,k}\|_{L^2(\mathbb{R}^n)}^2\, d\mathcal{L}^1(t) = \frac{1}{2}\|S(a)u_{0,k}\|_{L^2(\mathbb{R}^n)}^2.
\end{align*}
For general $u_0\in L^2(\mathbb{R}^n)$, choose Schwartz functions $u_{0,k}\to u_0$ in $L^2(\mathbb{R}^n)$. The $L^2$ convergence at the endpoints $a$ and $b$ is immediate from the contraction estimate for $S(t)$. For the gradient convergence, fix $0<a<b<\infty$. Plancherel gives
\begin{align*}
\int_a^b \|\nabla S(t)(u_{0,k}-u_0)\|_{L^2(\mathbb{R}^n)}^2\, d\mathcal{L}^1(t)
= \int_{\mathbb{R}^n} |\widehat{u_{0,k}-u_0}(\xi)|^2\left(\int_a^b |\xi|^2e^{-2t|\xi|^2}\, d\mathcal{L}^1(t)\right)d\mathcal{L}^n(\xi).
\end{align*}
For $t\in[a,b]$, the multiplier bound $|\xi|^2e^{-2t|\xi|^2}\leq (2ea)^{-1}$ gives
\begin{align*}
\int_a^b \|\nabla S(t)(u_{0,k}-u_0)\|_{L^2(\mathbb{R}^n)}^2\, d\mathcal{L}^1(t)
\leq (b-a)(2ea)^{-1}\|u_{0,k}-u_0\|_{L^2(\mathbb{R}^n)}^2.
\end{align*}
The right-hand side tends to $0$, so $S(t)u_{0,k}\to S(t)u_0$ in the time-integrated $H^1$ sense needed to pass to the limit in the integrated energy identity on compact subintervals of $(0,\infty)$. Passing to the limit gives
\begin{align*}
\frac{1}{2}\|S(b)u_0\|_{L^2(\mathbb{R}^n)}^2 + \int_a^b \|\nabla S(t)u_0\|_{L^2(\mathbb{R}^n)}^2\, d\mathcal{L}^1(t) = \frac{1}{2}\|S(a)u_0\|_{L^2(\mathbb{R}^n)}^2.
\end{align*}
Since this identity holds for every compact interval $[a,b]\subset(0,\infty)$, the function $Y$ is absolutely continuous on compact subintervals of $(0,\infty)$. Differentiating the integrated identity gives the whole-space form of [Energy Dissipation For The Heat Equation](/theorems/564). Therefore, for a.e. $t>0$,
\begin{align*}
\frac{1}{2}\frac{d}{dt}\|u(t)\|_{L^2(\mathbb{R}^n)}^2 + \|\nabla u(t)\|_{L^2(\mathbb{R}^n)}^2 = 0.
\end{align*}
In terms of $Y$, this becomes
\begin{align*}
Y'(t) = -2\|\nabla u(t)\|_{L^2(\mathbb{R}^n)}^2
\end{align*}
for a.e. $t>0$.
We also need a quantity that does not grow in time. The heat kernel representation writes $u(t)$ as convolution with a nonnegative kernel of total $\mathcal{L}^n$-mass $1$. Applying [Young's Convolution Inequality](/theorems/463) with exponents $1$ and $1$ gives contraction on $L^1(\mathbb{R}^n)$. Hence
\begin{align*}
\|u(t)\|_{L^1(\mathbb{R}^n)} \leq \|u_0\|_{L^1(\mathbb{R}^n)} = M
\end{align*}
for every $t\geq 0$. This is the point of introducing $M$: Nash's inequality contains an $L^1$ factor, and contractivity allows us to replace the time-dependent factor $\|u(t)\|_{L^1}$ by the fixed initial size $M$.[/guided]