[proofplan]
We prove an upper-tail estimate for the largest eigenvalue of $S:=\sum_{i=1}^nY_i$ by applying the matrix [Laplace transform](/page/Laplace%20Transform) method to $\operatorname{tr}\exp(\theta S)$. The scalar Bernstein exponential bound gives a Loewner-order estimate for each matrix moment generating function, while the noncommutative trace moment generating function estimate combines the independent summands inside a single trace. Optimising the resulting one-parameter exponential bound gives the Bernstein denominator $\sigma^2+Lt/3$. Applying the same argument to $-S$ gives the lower tail, and the operator norm bound follows by the union bound.
[/proofplan]
custom_env
admin
[step:Reduce to the non-degenerate case and introduce the variance matrix]
Let $(\Omega,\mathcal F,\mathbb P)$ denote the probability space on which the matrices $Y_1,\dots,Y_n$ are defined, and let $\mathbb E$ denote expectation with respect to $\mathbb P$. Let $\mathbb S^d\subset \mathbb R^{d\times d}$ denote the [vector space](/page/Vector%20Space) of real symmetric $d\times d$ matrices, and let $I\in\mathbb S^d$ denote the $d\times d$ identity matrix. For a matrix $A\in\mathbb S^d$, let $\lambda_{\max}(A)$ denote its largest eigenvalue, let $\operatorname{tr}(A)$ denote its trace, and let $\exp(A)\in\mathbb S^d$ denote the matrix exponential defined by the absolutely convergent [power series](/page/Power%20Series)
\begin{align*}
\exp(A)=\sum_{m=0}^{\infty}\frac{A^m}{m!}.
\end{align*}
Define the random matrix $S:\Omega\to \mathbb S^d$ by
\begin{align*}
S(\omega)=\sum_{i=1}^nY_i(\omega).
\end{align*}
Define the deterministic variance matrix
\begin{align*}
V:=\sum_{i=1}^n\mathbb E[Y_i^2]\in\mathbb S^d.
\end{align*}
Each matrix $Y_i^2$ is positive semidefinite almost surely, so $V$ is positive semidefinite and $\sigma^2=\|V\|_{\mathrm{op}}=\lambda_{\max}(V)$.
If $L=0$, then $\|Y_i\|_{\mathrm{op}}=0$ almost surely for every $i$, hence $Y_i=0$ almost surely for every $i$. For $t=0$, the right-hand side is $2d$ by the zero-denominator convention in the theorem statement, so $\mathbb P(0\ge 0)=1\le 2d$. For $t>0$, we have $\mathbb P(0\ge t)=0$, so the desired inequality holds. If $\sigma^2=0$, then $V=0$. Since each $\mathbb E[Y_i^2]$ is positive semidefinite and
\begin{align*}
0=\operatorname{tr}V=\sum_{i=1}^n\mathbb E[\operatorname{tr}(Y_i^2)],
\end{align*}
we have $\operatorname{tr}(Y_i^2)=0$ almost surely for every $i$, hence $Y_i=0$ almost surely for every $i$. Thus the conclusion is again immediate, with the same convention handling $t=0$ and with probability $0$ when $t>0$. Hence it remains to prove the estimate when $L>0$ and $\sigma^2>0$.
[/step]
custom_env
admin
[step:Bound each matrix moment generating function by the variance contribution]Define the scalar function $g:(0,3/L)\to(0,\infty)$ by
\begin{align*}
g(\theta)=\frac{\theta^2}{2(1-\theta L/3)}.
\end{align*}
Fix $\theta\in(0,3/L)$. For every $i\in\{1,\dots,n\}$,
\begin{align*}
\mathbb E[\exp(\theta Y_i)]\preceq \exp\left(g(\theta)\mathbb E[Y_i^2]\right),
\end{align*}
where $\preceq$ denotes the Loewner order on $\mathbb S^d$.
Indeed, for every real number $x$ with $|x|\le L$,
\begin{align*}
e^{\theta x}\le 1+\theta x+g(\theta)x^2.
\end{align*}
This is the scalar Bernstein exponential estimate; it follows from the power series for $e^{\theta x}$ and the bound
\begin{align*}
\frac{\theta^m x^m}{m!}\le \frac{\theta^m L^{m-2}x^2}{m!}
\end{align*}
for all $m\ge 2$ whenever $|x|\le L$, followed by
\begin{align*}
\sum_{m=2}^{\infty}\frac{\theta^mL^{m-2}}{m!}
\le \frac{\theta^2}{2}\sum_{m=0}^{\infty}\left(\frac{\theta L}{3}\right)^m
=\frac{\theta^2}{2(1-\theta L/3)}.
\end{align*}
Since $\|Y_i\|_{\mathrm{op}}\le L$ almost surely and $Y_i$ is symmetric almost surely, functional calculus applies this scalar inequality to every eigenvalue of $Y_i$ and gives
\begin{align*}
\exp(\theta Y_i)\preceq I+\theta Y_i+g(\theta)Y_i^2
\end{align*}
almost surely. Taking entrywise expectations preserves the Loewner order, and $\mathbb E[Y_i]=0$, so
\begin{align*}
\mathbb E[\exp(\theta Y_i)]
\preceq I+g(\theta)\mathbb E[Y_i^2].
\end{align*}
Since $g(\theta)\mathbb E[Y_i^2]$ is positive semidefinite, applying the scalar inequality $1+x\le e^x$ to its eigenvalues gives
\begin{align*}
I+g(\theta)\mathbb E[Y_i^2]\preceq \exp\left(g(\theta)\mathbb E[Y_i^2]\right).
\end{align*}
Combining the last two Loewner inequalities proves the claimed matrix moment generating function bound.[/step]
custom_env
admin
[guided]The purpose of this step is to convert the almost sure operator norm bound into a usable exponential estimate. Define the scalar function $g:(0,3/L)\to(0,\infty)$ by
\begin{align*}
g(\theta)=\frac{\theta^2}{2(1-\theta L/3)}.
\end{align*}
Fix $\theta\in(0,3/L)$.
We first prove the scalar inequality that drives the whole argument. If $|x|\le L$, then the power series for the exponential gives
\begin{align*}
e^{\theta x}=1+\theta x+\sum_{m=2}^{\infty}\frac{\theta^m x^m}{m!}.
\end{align*}
For $m\ge 2$ and $|x|\le L$, we have $x^m\le |x|^m\le L^{m-2}x^2$, hence
\begin{align*}
\sum_{m=2}^{\infty}\frac{\theta^m x^m}{m!}
\le x^2\sum_{m=2}^{\infty}\frac{\theta^mL^{m-2}}{m!}.
\end{align*}
The Bernstein simplification replaces the exact exponential remainder by a geometric upper bound. Since $m!\ge 2\cdot 3^{m-2}$ for every $m\ge 2$, we get
\begin{align*}
\sum_{m=2}^{\infty}\frac{\theta^mL^{m-2}}{m!}
\le \frac{\theta^2}{2}\sum_{m=0}^{\infty}\left(\frac{\theta L}{3}\right)^m
=\frac{\theta^2}{2(1-\theta L/3)}
=g(\theta),
\end{align*}
where the geometric series converges because $\theta L/3<1$. Therefore
\begin{align*}
e^{\theta x}\le 1+\theta x+g(\theta)x^2
\end{align*}
for every $x\in[-L,L]$.
Now apply this scalar inequality to the eigenvalues of the symmetric random matrix $Y_i$. Since $\|Y_i\|_{\mathrm{op}}\le L$ almost surely, every eigenvalue of $Y_i$ belongs to $[-L,L]$ almost surely. Functional calculus therefore gives the Loewner-order inequality
\begin{align*}
\exp(\theta Y_i)\preceq I+\theta Y_i+g(\theta)Y_i^2
\end{align*}
almost surely. Taking expectations is legitimate because all entries are bounded random variables, and expectation preserves the Loewner order: if $A(\omega)\preceq B(\omega)$ almost surely, then $\mathbb E[A]\preceq\mathbb E[B]$. Since $Y_i$ is centered, $\mathbb E[Y_i]=0$, so
\begin{align*}
\mathbb E[\exp(\theta Y_i)]
\preceq I+g(\theta)\mathbb E[Y_i^2].
\end{align*}
Finally, $\mathbb E[Y_i^2]$ is positive semidefinite, because $Y_i^2$ is positive semidefinite almost surely. Hence $g(\theta)\mathbb E[Y_i^2]$ is positive semidefinite. Applying $1+x\le e^x$ to each eigenvalue gives
\begin{align*}
I+g(\theta)\mathbb E[Y_i^2]\preceq \exp\left(g(\theta)\mathbb E[Y_i^2]\right).
\end{align*}
Combining the two Loewner inequalities yields
\begin{align*}
\mathbb E[\exp(\theta Y_i)]\preceq \exp\left(g(\theta)\mathbb E[Y_i^2]\right).
\end{align*}[/guided]
custom_env
admin
[step:Combine the independent summands inside a trace exponential]We use the [Lieb trace moment generating function estimate](https://doi.org/10.1007/s00440-011-0379-z), cited here as Tropp, *User-Friendly Tail Bounds for Sums of Random Matrices*, Theorem 3.6. We use the following Loewner-dominated corollary of that theorem: if $X_1,\dots,X_n$ are independent symmetric random matrices and $A_i$ are deterministic symmetric matrices such that
\begin{align*}
\mathbb E[\exp(X_i)]\preceq \exp(A_i)
\end{align*}
for every $i$, then
\begin{align*}
\mathbb E\left[\operatorname{tr}\exp\left(\sum_{i=1}^nX_i\right)\right]
\le \operatorname{tr}\exp\left(\sum_{i=1}^nA_i\right).
\end{align*}
Indeed, $\exp(X_i)$ is positive definite almost surely, so $\mathbb E[\exp(X_i)]$ is positive definite and its matrix logarithm is well-defined. Tropp's theorem is stated with $\log\mathbb E[\exp(X_i)]$ in place of $A_i$. From $\mathbb E[\exp(X_i)]\preceq\exp(A_i)$ and the operator monotonicity of the matrix logarithm on positive definite matrices, we get
\begin{align*}
\log\mathbb E[\exp(X_i)]\preceq A_i.
\end{align*}
The trace exponential map $B\mapsto\operatorname{tr}\exp(B)$ is monotone in the Loewner order on $\mathbb S^d$, so replacing $\log\mathbb E[\exp(X_i)]$ by the larger matrix $A_i$ preserves the upper bound. This is the finite-dimensional noncommutative trace moment generating function estimate obtained from Lieb's concavity theorem.
Apply this estimate with $X_i:=\theta Y_i$ and
\begin{align*}
A_i:=g(\theta)\mathbb E[Y_i^2]\in\mathbb S^d.
\end{align*}
The matrices $\theta Y_i$ are independent and symmetric almost surely, and the preceding step verifies the required moment generating function hypothesis. Therefore
\begin{align*}
\mathbb E[\operatorname{tr}\exp(\theta S)]
\le \operatorname{tr}\exp\left(g(\theta)V\right).
\end{align*}
Since $V$ is positive semidefinite and $\lambda_{\max}(V)=\sigma^2$, every eigenvalue of $g(\theta)V$ is at most $g(\theta)\sigma^2$. Thus
\begin{align*}
\operatorname{tr}\exp\left(g(\theta)V\right)
\le d\exp\left(g(\theta)\sigma^2\right).
\end{align*}
Consequently,
\begin{align*}
\mathbb E[\operatorname{tr}\exp(\theta S)]
\le d\exp\left(\frac{\theta^2\sigma^2}{2(1-\theta L/3)}\right).
\end{align*}[/step]
custom_env
admin
[guided]This step is where independence is used in a genuinely matrix-valued way. We apply the [Lieb trace moment generating function estimate](https://doi.org/10.1007/s00440-011-0379-z) in the following Loewner-dominated form: if $X_1,\dots,X_n$ are independent symmetric random matrices and deterministic symmetric matrices $A_1,\dots,A_n$ satisfy
\begin{align*}
\mathbb E[\exp(X_i)]\preceq\exp(A_i)
\end{align*}
for every $i$, then
\begin{align*}
\mathbb E\left[\operatorname{tr}\exp\left(\sum_{i=1}^nX_i\right)\right]
\le \operatorname{tr}\exp\left(\sum_{i=1}^nA_i\right).
\end{align*}
The usual statement has $\log\mathbb E[\exp(X_i)]$ in place of $A_i$. This is compatible with the displayed form because $\exp(X_i)$ is positive definite almost surely, hence $\mathbb E[\exp(X_i)]$ is positive definite and its logarithm is defined. The matrix logarithm is operator monotone on positive definite matrices, and $B\mapsto\operatorname{tr}\exp(B)$ is monotone in the Loewner order on $\mathbb S^d$.
We apply the estimate with $X_i:=\theta Y_i$ and
\begin{align*}
A_i:=g(\theta)\mathbb E[Y_i^2]\in\mathbb S^d.
\end{align*}
The matrices $\theta Y_i$ are independent because the $Y_i$ are independent, and they are symmetric almost surely because each $Y_i$ is symmetric almost surely. The preceding step proves exactly the required hypothesis
\begin{align*}
\mathbb E[\exp(\theta Y_i)]\preceq\exp\left(g(\theta)\mathbb E[Y_i^2]\right).
\end{align*}
Therefore
\begin{align*}
\mathbb E[\operatorname{tr}\exp(\theta S)]
\le \operatorname{tr}\exp\left(g(\theta)V\right).
\end{align*}
Since $V$ is positive semidefinite and $\lambda_{\max}(V)=\sigma^2$, every eigenvalue of $g(\theta)V$ is at most $g(\theta)\sigma^2$. Summing the exponentials of the $d$ eigenvalues gives
\begin{align*}
\operatorname{tr}\exp\left(g(\theta)V\right)
\le d\exp\left(g(\theta)\sigma^2\right).
\end{align*}
Substituting the definition of $g(\theta)$ yields
\begin{align*}
\mathbb E[\operatorname{tr}\exp(\theta S)]
\le d\exp\left(\frac{\theta^2\sigma^2}{2(1-\theta L/3)}\right).
\end{align*}[/guided]
custom_env
admin
[step:Derive and optimise the upper tail bound for the largest eigenvalue]For $t\ge 0$ and $\theta\in(0,3/L)$, let $\mathbb 1_{\{\lambda_{\max}(S)\ge t\}}:\Omega\to\{0,1\}$ denote the indicator function of the event $\{\omega\in\Omega:\lambda_{\max}(S(\omega))\ge t\}$. This event implies
\begin{align*}
e^{\theta\lambda_{\max}(S)}\ge e^{\theta t}.
\end{align*}
Since $\operatorname{tr}\exp(\theta S)=\sum_{j=1}^d e^{\theta\lambda_j(S)}\ge e^{\theta\lambda_{\max}(S)}$, where $\lambda_1(S),\dots,\lambda_d(S)$ are the eigenvalues of $S$, we obtain the pointwise inequality
\begin{align*}
\mathbb 1_{\{\lambda_{\max}(S)\ge t\}}
\le e^{-\theta t}\operatorname{tr}\exp(\theta S).
\end{align*}
Taking expectations and using the trace moment generating function bound gives
\begin{align*}
\mathbb P(\lambda_{\max}(S)\ge t)
\le d\exp\left(-\theta t+\frac{\theta^2\sigma^2}{2(1-\theta L/3)}\right).
\end{align*}
For $t>0$, choose
\begin{align*}
\theta:=\frac{t}{\sigma^2+Lt/3}.
\end{align*}
Because $L>0$ and $\sigma^2>0$, this choice satisfies $0<\theta<3/L$. With $a:=L/3$, we compute
\begin{align*}
1-a\theta
=1-\frac{at}{\sigma^2+at}
=\frac{\sigma^2}{\sigma^2+at}.
\end{align*}
Therefore
\begin{align*}
-\theta t+\frac{\theta^2\sigma^2}{2(1-a\theta)}
=-\frac{t^2}{\sigma^2+at}
+\frac{t^2}{(\sigma^2+at)^2}\cdot \frac{\sigma^2}{2}\cdot \frac{\sigma^2+at}{\sigma^2}.
\end{align*}
Simplifying the right-hand side gives
\begin{align*}
-\theta t+\frac{\theta^2\sigma^2}{2(1-a\theta)}
=-\frac{t^2}{2(\sigma^2+at)}.
\end{align*}
Since $a=L/3$, this is
\begin{align*}
-\theta t+\frac{\theta^2\sigma^2}{2(1-a\theta)}
=-\frac{t^2/2}{\sigma^2+Lt/3}.
\end{align*}
Hence, for every $t>0$,
\begin{align*}
\mathbb P(\lambda_{\max}(S)\ge t)
\le d\exp\left(-\frac{t^2/2}{\sigma^2+Lt/3}\right).
\end{align*}
For $t=0$, the same inequality holds because the left-hand side is at most $1$ and the right-hand side is $d\ge 1$.[/step]
custom_env
admin
[guided]Fix $t\ge0$ and $\theta\in(0,3/L)$. Let $\mathbb 1_{\{\lambda_{\max}(S)\ge t\}}:\Omega\to\{0,1\}$ denote the indicator function of the event $\{\omega\in\Omega:\lambda_{\max}(S(\omega))\ge t\}$. This event forces
\begin{align*}
e^{\theta\lambda_{\max}(S)}\ge e^{\theta t},
\end{align*}
because the scalar exponential is increasing and $\theta>0$. If $\lambda_1(S),\dots,\lambda_d(S)$ are the eigenvalues of the symmetric matrix $S$, then
\begin{align*}
\operatorname{tr}\exp(\theta S)=\sum_{j=1}^d e^{\theta\lambda_j(S)}\ge e^{\theta\lambda_{\max}(S)}.
\end{align*}
Thus the indicator of the upper-tail event is pointwise bounded by
\begin{align*}
\mathbb 1_{\{\lambda_{\max}(S)\ge t\}}
\le e^{-\theta t}\operatorname{tr}\exp(\theta S).
\end{align*}
Taking expectations with respect to $\mathbb P$ and using the trace moment generating function bound from the previous step gives
\begin{align*}
\mathbb P(\lambda_{\max}(S)\ge t)
\le d\exp\left(-\theta t+\frac{\theta^2\sigma^2}{2(1-\theta L/3)}\right).
\end{align*}
For $t>0$, choose
\begin{align*}
\theta:=\frac{t}{\sigma^2+Lt/3}.
\end{align*}
Since $L>0$ and $\sigma^2>0$, this choice lies in $(0,3/L)$. Put $a:=L/3$. Then
\begin{align*}
1-a\theta
=1-\frac{at}{\sigma^2+at}
=\frac{\sigma^2}{\sigma^2+at}.
\end{align*}
Substitution gives
\begin{align*}
-\theta t+\frac{\theta^2\sigma^2}{2(1-a\theta)}
=-\frac{t^2}{\sigma^2+at}
+\frac{t^2}{(\sigma^2+at)^2}\cdot\frac{\sigma^2}{2}\cdot\frac{\sigma^2+at}{\sigma^2}
=-\frac{t^2}{2(\sigma^2+at)}.
\end{align*}
Since $a=L/3$, this becomes
\begin{align*}
\mathbb P(\lambda_{\max}(S)\ge t)
\le d\exp\left(-\frac{t^2/2}{\sigma^2+Lt/3}\right).
\end{align*}
For $t=0$, the same estimate holds because the left-hand side is at most $1$ and the right-hand side is $d\ge1$.[/guided]
custom_env
admin
[step:Apply the upper tail estimate to the negative sum and combine the tails]Define the random matrix $\widetilde S:\Omega\to\mathbb S^d$ by
\begin{align*}
\widetilde S(\omega)=-S(\omega)=\sum_{i=1}^n(-Y_i(\omega)).
\end{align*}
The matrices $-Y_1,\dots,-Y_n$ are independent, centered, symmetric random matrices, and
\begin{align*}
\|-Y_i\|_{\mathrm{op}}=\|Y_i\|_{\mathrm{op}}\le L
\end{align*}
almost surely for every $i$. Their variance matrix is unchanged because $(-Y_i)^2=Y_i^2$, so the same value of $\sigma^2$ applies. Applying the largest-eigenvalue bound to $\widetilde S=-S$ gives
\begin{align*}
\mathbb P(\lambda_{\max}(-S)\ge t)
\le d\exp\left(-\frac{t^2/2}{\sigma^2+Lt/3}\right).
\end{align*}
For every symmetric matrix $A\in\mathbb S^d$,
\begin{align*}
\|A\|_{\mathrm{op}}=\max\{\lambda_{\max}(A),\lambda_{\max}(-A)\}.
\end{align*}
Thus
\begin{align*}
\left\{\|S\|_{\mathrm{op}}\ge t\right\}
\subset
\left\{\lambda_{\max}(S)\ge t\right\}
\cup
\left\{\lambda_{\max}(-S)\ge t\right\}.
\end{align*}
Using the union bound and the two tail estimates,
\begin{align*}
\mathbb P(\|S\|_{\mathrm{op}}\ge t)
\le \mathbb P(\lambda_{\max}(S)\ge t)+\mathbb P(\lambda_{\max}(-S)\ge t).
\end{align*}
The right-hand side is bounded by
\begin{align*}
2d\exp\left(-\frac{t^2/2}{\sigma^2+Lt/3}\right).
\end{align*}
Since $S=\sum_{i=1}^nY_i$, this is the asserted Matrix Bernstein inequality.[/step]
custom_env
admin
[guided]The largest-eigenvalue estimate controls only the upper spectral tail. To control the operator norm of a symmetric matrix, we also need the lower spectral tail. Define the random matrix $\widetilde S:\Omega\to\mathbb S^d$ by
\begin{align*}
\widetilde S(\omega)=-S(\omega)=\sum_{i=1}^n(-Y_i(\omega)).
\end{align*}
The matrices $-Y_1,\dots,-Y_n$ remain independent because [measurable functions](/page/Measurable%20Functions) of independent random variables are independent. They are centered because $\mathbb E[-Y_i]=-\mathbb E[Y_i]=0$, and they are symmetric almost surely because each $Y_i$ is symmetric almost surely. Their operator norms satisfy
\begin{align*}
\|-Y_i\|_{\mathrm{op}}=\|Y_i\|_{\mathrm{op}}\le L
\end{align*}
almost surely, and their variance matrix is the same since $(-Y_i)^2=Y_i^2$. Therefore the upper-tail estimate applied to $\widetilde S=-S$ gives
\begin{align*}
\mathbb P(\lambda_{\max}(-S)\ge t)
\le d\exp\left(-\frac{t^2/2}{\sigma^2+Lt/3}\right).
\end{align*}
For every symmetric matrix $A\in\mathbb S^d$, the operator norm is the larger of the positive and negative spectral extremes:
\begin{align*}
\|A\|_{\mathrm{op}}=\max\{\lambda_{\max}(A),\lambda_{\max}(-A)\}.
\end{align*}
Hence
\begin{align*}
\left\{\|S\|_{\mathrm{op}}\ge t\right\}
\subset
\left\{\lambda_{\max}(S)\ge t\right\}
\cup
\left\{\lambda_{\max}(-S)\ge t\right\}.
\end{align*}
Applying the union bound gives
\begin{align*}
\mathbb P(\|S\|_{\mathrm{op}}\ge t)
\le \mathbb P(\lambda_{\max}(S)\ge t)+\mathbb P(\lambda_{\max}(-S)\ge t).
\end{align*}
Applying the two one-sided estimates to the right-hand side gives
\begin{align*}
\mathbb P(\|S\|_{\mathrm{op}}\ge t)
\le 2d\exp\left(-\frac{t^2/2}{\sigma^2+Lt/3}\right).
\end{align*}
Since $S=\sum_{i=1}^nY_i$, this is exactly the claimed operator-norm bound.[/guided]