Bounded Gaussian Mean Minimax Rate — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] The upper bound is obtained by comparing two elementary estimators: the unbiased estimator $\hat{\theta}(X)=X$, whose risk is $d\sigma^2$, and the zero estimator, whose risk is at most $R^2$ on the parameter ball. For the lower bound, we place a product Rademacher prior on a coordinate hypercube contained in $B_2^d(R)$. A scalar two-point Gaussian testing calculation shows that estimating one active coordinate costs at least a universal multiple of $a^2$ when the coordinate amplitude $a$ is at most $\sigma$. Choosing the number of active coordinates and the amplitude so that the hypercube lies in the ball gives a Bayes risk lower bound of order $\min\{R^2,d\sigma^2\}$, and the minimax risk dominates this Bayes risk. [/proofplan] [step:Compare the identity estimator and the zero estimator for the upper bound] Define the estimator $\hat{\theta}_{\mathrm{id}}: \mathbb{R}^d \to \mathbb{R}^d$ by $\hat{\theta}_{\mathrm{id}}(x)=x$ for $x \in \mathbb{R}^d$. For every $\theta \in B_2^d(R)$, write $X=\theta+\sigma Z$, where $Z \sim \mathcal{N}(0,I_d)$. Then \begin{align*} \mathbb{E}_\theta[|\hat{\theta}_{\mathrm{id}}(X)-\theta|^2] = \mathbb{E}[|\sigma Z|^2] = \sigma^2 \sum_{i=1}^d \mathbb{E}[Z_i^2] = d\sigma^2. \end{align*} Define also the estimator $\hat{\theta}_{0}: \mathbb{R}^d \to \mathbb{R}^d$ by $\hat{\theta}_{0}(x)=0$ for $x \in \mathbb{R}^d$. For every $\theta \in B_2^d(R)$, \begin{align*} \mathbb{E}_\theta[|\hat{\theta}_{0}(X)-\theta|^2] = |\theta|^2 \leq R^2. \end{align*} Taking the better of these two estimators gives \begin{align*} \mathfrak{M}(B_2^d(R),|\cdot|^2) \leq \min\{R^2,d\sigma^2\}. \end{align*} Thus the desired upper bound holds with $C=1$. [guided] The minimax risk is an infimum over all estimators, so any particular estimator gives an upper bound. We use two estimators adapted to the two possible scales of the problem. First define $\hat{\theta}_{\mathrm{id}}: \mathbb{R}^d \to \mathbb{R}^d$ by $\hat{\theta}_{\mathrm{id}}(x)=x$ for $x \in \mathbb{R}^d$. If $X \sim \mathcal{N}(\theta,\sigma^2 I_d)$, then $X=\theta+\sigma Z$ for a standard Gaussian vector $Z \sim \mathcal{N}(0,I_d)$. Therefore \begin{align*} \mathbb{E}_\theta[|\hat{\theta}_{\mathrm{id}}(X)-\theta|^2] = \mathbb{E}[|X-\theta|^2] = \mathbb{E}[|\sigma Z|^2] = \sigma^2 \sum_{i=1}^d \mathbb{E}[Z_i^2] = d\sigma^2. \end{align*} This estimator ignores the boundedness of the parameter set and pays exactly the total noise variance. Second define $\hat{\theta}_{0}: \mathbb{R}^d \to \mathbb{R}^d$ by $\hat{\theta}_{0}(x)=0$ for $x \in \mathbb{R}^d$. This estimator ignores the data. On the ball $B_2^d(R)$ its squared error is deterministically bounded by the squared radius: \begin{align*} \mathbb{E}_\theta[|\hat{\theta}_{0}(X)-\theta|^2] = |\theta|^2 \leq R^2. \end{align*} Since the minimax risk is no larger than the risk of either estimator, it is no larger than the smaller of the two bounds: \begin{align*} \mathfrak{M}(B_2^d(R),|\cdot|^2) \leq \min\{R^2,d\sigma^2\}. \end{align*} [/guided] [/step] [step:Prove a scalar two-point Gaussian Bayes risk lower bound] Let $a \in [0,\sigma]$. Let $\varepsilon$ be a Rademacher [random variable](/page/Random%20Variable), meaning $\mathbb{P}(\varepsilon=1)=\mathbb{P}(\varepsilon=-1)=1/2$, and let $\xi \sim \mathcal{N}(0,\sigma^2)$ be independent of $\varepsilon$. Define \begin{align*} Y := a\varepsilon+\xi. \end{align*} For every measurable function $T:\mathbb{R}\to\mathbb{R}$, \begin{align*} \mathbb{E}[(T(Y)-a\varepsilon)^2] \geq a^2 \Phi(-1), \end{align*} where $\mathcal{L}^1$ denotes one-dimensional [Lebesgue measure](/page/Lebesgue%20Measure) and $\Phi: \mathbb{R} \to [0,1]$ is the standard normal distribution function defined by \begin{align*} \Phi(t):=\frac{1}{\sqrt{2\pi}}\int_{(-\infty,t]} e^{-u^2/2}\,d\mathcal{L}^1(u) \end{align*} for $t \in \mathbb{R}$. Indeed, define the sign estimator $\psi_T:\mathbb{R}\to\{-1,1\}$ by declaring, for each $y\in\mathbb{R}$, that $\psi_T(y)=1$ if $T(y)\geq 0$ and $\psi_T(y)=-1$ if $T(y)<0$. If $\psi_T(Y)\neq \varepsilon$, then $T(Y)$ and $a\varepsilon$ have opposite signs or $T(Y)=0$ while $a\varepsilon \neq 0$, hence $|T(Y)-a\varepsilon|\geq a$. Therefore \begin{align*} \mathbb{E}[(T(Y)-a\varepsilon)^2] \geq a^2\mathbb{P}(\psi_T(Y)\neq \varepsilon). \end{align*} The two conditional densities of $Y$ with respect to $\mathcal{L}^1$ are $p_{+}:\mathbb{R}\to[0,\infty)$ and $p_{-}:\mathbb{R}\to[0,\infty)$, corresponding respectively to $\varepsilon=1$ and $\varepsilon=-1$, where \begin{align*} p_{+}(y)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(y-a)^2}{2\sigma^2}\right) \end{align*} and \begin{align*} p_{-}(y)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(y+a)^2}{2\sigma^2}\right) \end{align*} for $y \in \mathbb{R}$. Since the two prior probabilities are equal, every measurable sign rule $\psi:\mathbb{R}\to\{-1,1\}$ has error probability \begin{align*} \mathbb{P}(\psi(Y)\neq \varepsilon) = \frac{1}{2}\int_{\{y:\psi(y)=-1\}} p_{+}(y)\,d\mathcal{L}^1(y) + \frac{1}{2}\int_{\{y:\psi(y)=1\}} p_{-}(y)\,d\mathcal{L}^1(y). \end{align*} Pointwise minimization of the integrand gives \begin{align*} \inf_{\psi}\mathbb{P}(\psi(Y)\neq \varepsilon) = \frac{1}{2}\int_{\mathbb{R}}\min\{p_{+}(y),p_{-}(y)\}\,d\mathcal{L}^1(y). \end{align*} The inequality $p_{+}(y)\geq p_{-}(y)$ is equivalent to $y\geq 0$, so the minimizing rule decides $\varepsilon=1$ exactly when $Y\geq 0$. Its error probability is \begin{align*} \frac{1}{2}\mathbb{P}(Y<0\mid \varepsilon=1) + \frac{1}{2}\mathbb{P}(Y\geq 0\mid \varepsilon=-1) = \Phi(-a/\sigma) \geq \Phi(-1), \end{align*} because both conditional probabilities equal $\Phi(-a/\sigma)$ and $0\leq a/\sigma\leq 1$. Thus every measurable $T$ satisfies the claimed bound. [guided] We reduce scalar estimation to scalar testing. Let $T:\mathbb{R}\to\mathbb{R}$ be any measurable estimator of $a\varepsilon$ from the observation $Y=a\varepsilon+\xi$. Define the induced sign rule $\psi_T:\mathbb{R}\to\{-1,1\}$ by $\psi_T(y)=1$ when $T(y)\geq 0$ and $\psi_T(y)=-1$ when $T(y)<0$. If $\psi_T(Y)\neq\varepsilon$, then $T(Y)$ lies on the wrong side of $0$ relative to $a\varepsilon$, or equals $0$ while $a\varepsilon\neq 0$. Hence $|T(Y)-a\varepsilon|\geq a$, and therefore \begin{align*} \mathbb{E}[(T(Y)-a\varepsilon)^2] \geq a^2\mathbb{P}(\psi_T(Y)\neq\varepsilon). \end{align*} It remains to lower-bound the best possible testing error. Conditional on $\varepsilon=1$, the observation $Y$ has density $p_{+}:\mathbb{R}\to[0,\infty)$ with respect to $\mathcal{L}^1$ given by \begin{align*} p_{+}(y)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(y-a)^2}{2\sigma^2}\right), \end{align*} and conditional on $\varepsilon=-1$, it has density $p_{-}:\mathbb{R}\to[0,\infty)$ given by \begin{align*} p_{-}(y)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(y+a)^2}{2\sigma^2}\right). \end{align*} For any measurable sign rule $\psi:\mathbb{R}\to\{-1,1\}$, equal prior probabilities give \begin{align*} \mathbb{P}(\psi(Y)\neq \varepsilon) = \frac{1}{2}\int_{\{y:\psi(y)=-1\}} p_{+}(y)\,d\mathcal{L}^1(y) + \frac{1}{2}\int_{\{y:\psi(y)=1\}} p_{-}(y)\,d\mathcal{L}^1(y). \end{align*} At each observed value $y$, choosing the larger of $p_{+}(y)$ and $p_{-}(y)$ is the unique way to minimize the contribution to the error integral. Thus \begin{align*} \inf_{\psi}\mathbb{P}(\psi(Y)\neq \varepsilon) = \frac{1}{2}\int_{\mathbb{R}}\min\{p_{+}(y),p_{-}(y)\}\,d\mathcal{L}^1(y). \end{align*} The comparison $p_{+}(y)\geq p_{-}(y)$ is equivalent, after taking logarithms and cancelling common constants, to $(y-a)^2\leq (y+a)^2$, which is equivalent to $y\geq 0$. Therefore the optimal test decides $\varepsilon=1$ when $Y\geq 0$ and $\varepsilon=-1$ when $Y<0$. Its error probability is \begin{align*} \frac{1}{2}\mathbb{P}(Y<0\mid \varepsilon=1) + \frac{1}{2}\mathbb{P}(Y\geq 0\mid \varepsilon=-1) = \Phi(-a/\sigma). \end{align*} Since $0\leq a\leq\sigma$, we have $0\leq a/\sigma\leq 1$, and monotonicity of the standard normal distribution function gives $\Phi(-a/\sigma)\geq\Phi(-1)$. Combining the testing lower bound with the reduction from estimation to testing yields \begin{align*} \mathbb{E}[(T(Y)-a\varepsilon)^2] \geq a^2\Phi(-1). \end{align*} [/guided] [/step] [step:Build a Rademacher hypercube contained in the parameter ball] Assume first that $\min\{R^2,d\sigma^2\}>0$. Define \begin{align*} s := \min\{R^2,d\sigma^2\}. \end{align*} Choose an integer \begin{align*} m := \max\left\{1,\left\lfloor \frac{s}{\sigma^2}\right\rfloor\right\} \end{align*} when $s\geq \sigma^2$, and choose $m:=1$ when $s<\sigma^2$. Define the amplitude $a \in [0,\infty)$ as follows. If $s\geq \sigma^2$, set \begin{align*} a:=\sigma. \end{align*} If $s<\sigma^2$, set \begin{align*} a:=\sqrt{s}. \end{align*} Then $1\leq m\leq d$, $0\leq a\leq \sigma$, and \begin{align*} ma^2 \leq s \leq R^2. \end{align*} Moreover, \begin{align*} ma^2 \geq \frac{s}{2}. \end{align*} When $s<\sigma^2$, this is equality because $m=1$ and $a^2=s$. When $s\geq\sigma^2$, the inequality follows from $\lfloor s/\sigma^2\rfloor \geq s/(2\sigma^2)$. Let $(\Omega,\mathcal{F},\mathbb{P})$ be a probability space supporting independent Rademacher random variables $\varepsilon_1,\dots,\varepsilon_m$. Define the random parameter $\Theta:(\Omega,\mathcal{F})\to(\mathbb{R}^d,\mathcal{B}(\mathbb{R}^d))$ by \begin{align*} \Theta(\omega):=(a\varepsilon_1(\omega),\dots,a\varepsilon_m(\omega),0,\dots,0) \end{align*} for $\omega\in\Omega$. For every $\omega \in \Omega$, \begin{align*} |\Theta(\omega)|^2 = ma^2 \leq R^2, \end{align*} so the prior distribution of $\Theta$ is supported on $B_2^d(R)$. [/step] [step:Lower-bound the Bayes risk of the hypercube prior coordinate by coordinate] Let $\hat{\theta}:\mathbb{R}^d\to\mathbb{R}^d$ be any measurable estimator, and let $\hat{\theta}_1,\dots,\hat{\theta}_d:\mathbb{R}^d\to\mathbb{R}$ denote its coordinate functions, so that \begin{align*} \hat{\theta}(x)=(\hat{\theta}_1(x),\dots,\hat{\theta}_d(x)) \end{align*} for $x \in \mathbb{R}^d$. Conditional on $\Theta$, let \begin{align*} X := \Theta+\sigma Z, \end{align*} where $Z=(Z_1,\dots,Z_d)$ is a standard Gaussian vector independent of $\Theta$. Then \begin{align*} \mathbb{E}[|\hat{\theta}(X)-\Theta|^2] = \sum_{i=1}^d \mathbb{E}[(\hat{\theta}_i(X)-\Theta_i)^2] \geq \sum_{i=1}^m \mathbb{E}[(\hat{\theta}_i(X)-a\varepsilon_i)^2]. \end{align*} Fix $i \in \{1,\dots,m\}$. Let $X_{-i}$ denote the vector obtained from $X$ by deleting the $i$th coordinate, and let $\mathbb{P}_{X_{-i}}$ denote the law of $X_{-i}$ on $\mathbb{R}^{d-1}$. Since $X_{-i}$ is independent of $\varepsilon_i$, the regular conditional distribution of $(X_i,\varepsilon_i)$ given $X_{-i}=z$ exists for $\mathbb{P}_{X_{-i}}$-almost every $z$ and is the same as the unconditional law of $(a\varepsilon_i+\sigma Z_i,\varepsilon_i)$. For such $z$, the map $T_z:\mathbb{R}\to\mathbb{R}$ defined by \begin{align*} T_z(x_i):=\hat{\theta}_i(x_i,z) \end{align*} for $x_i \in \mathbb{R}$ is a scalar estimator of $a\varepsilon_i$ from $X_i=a\varepsilon_i+\sigma Z_i$. By the scalar lower bound from the previous step, \begin{align*} \mathbb{E}[(\hat{\theta}_i(X)-a\varepsilon_i)^2\mid X_{-i}=z] \geq a^2\Phi(-1) \end{align*} for $\mathbb{P}_{X_{-i}}$-almost every $z$. Integrating over the law of $X_{-i}$ gives \begin{align*} \mathbb{E}[(\hat{\theta}_i(X)-a\varepsilon_i)^2] \geq a^2\Phi(-1). \end{align*} Summing over $i=1,\dots,m$ yields \begin{align*} \mathbb{E}[|\hat{\theta}(X)-\Theta|^2] \geq m a^2\Phi(-1) \geq \frac{\Phi(-1)}{2}\min\{R^2,d\sigma^2\}. \end{align*} [guided] The point of the hypercube prior is that it creates many independent scalar estimation problems inside the ball. Recall the construction: $m \in \{1,\dots,d\}$ and $a \in [0,\sigma]$ were chosen so that $ma^2\leq R^2$ and $ma^2\geq \frac{1}{2}\min\{R^2,d\sigma^2\}$, and the random parameter is \begin{align*} \Theta=(a\varepsilon_1,\dots,a\varepsilon_m,0,\dots,0), \end{align*} where $\varepsilon_1,\dots,\varepsilon_m$ are independent Rademacher random variables. Let $\hat{\theta}:\mathbb{R}^d\to\mathbb{R}^d$ be arbitrary, and write its coordinate functions as $\hat{\theta}_1,\dots,\hat{\theta}_d:\mathbb{R}^d\to\mathbb{R}$, so that $\hat{\theta}(x)=(\hat{\theta}_1(x),\dots,\hat{\theta}_d(x))$ for $x\in\mathbb{R}^d$. Under this prior, the observation has the form \begin{align*} X=\Theta+\sigma Z, \end{align*} where $Z=(Z_1,\dots,Z_d)\sim \mathcal{N}(0,I_d)$ is independent of $\Theta$. Expanding the squared Euclidean norm coordinate by coordinate gives \begin{align*} \mathbb{E}[|\hat{\theta}(X)-\Theta|^2] = \sum_{i=1}^d \mathbb{E}[(\hat{\theta}_i(X)-\Theta_i)^2]. \end{align*} The inactive coordinates only add non-negative terms, so \begin{align*} \mathbb{E}[|\hat{\theta}(X)-\Theta|^2] \geq \sum_{i=1}^m \mathbb{E}[(\hat{\theta}_i(X)-a\varepsilon_i)^2]. \end{align*} Now fix an active coordinate $i$. The estimator $\hat{\theta}_i(X)$ is allowed to depend on all coordinates of $X$, not just $X_i$, so we must justify why the scalar lower bound still applies. Define $X_{-i}$ to be the vector obtained from $X$ by deleting its $i$th coordinate, and let $\mathbb{P}_{X_{-i}}$ denote the law of $X_{-i}$ on $\mathbb{R}^{d-1}$. Because the prior signs $\varepsilon_1,\dots,\varepsilon_m$ are independent and the Gaussian noises $Z_1,\dots,Z_d$ are independent, the random vector $X_{-i}$ is independent of $\varepsilon_i$. Hence, after conditioning on $X_{-i}=z$, the only remaining information about $\varepsilon_i$ is contained in \begin{align*} X_i=a\varepsilon_i+\sigma Z_i. \end{align*} Because $X_{-i}$ takes values in a Euclidean space, regular conditional distributions exist. For $\mathbb{P}_{X_{-i}}$-almost every fixed value $z$, define the map $T_z:\mathbb{R}\to\mathbb{R}$ by $T_z(x_i)=\hat{\theta}_i(x_i,z)$ for $x_i\in\mathbb{R}$. This is a scalar estimator of $a\varepsilon_i$ from the one-dimensional Gaussian observation $X_i$. The scalar two-point bound from the previous step applies because its hypotheses are satisfied: $\varepsilon_i$ is Rademacher, $\sigma Z_i\sim\mathcal{N}(0,\sigma^2)$ is independent of $\varepsilon_i$, and the amplitude satisfies $0\leq a\leq\sigma$. Hence \begin{align*} \mathbb{E}[(T_z(X_i)-a\varepsilon_i)^2\mid X_{-i}=z] \geq a^2\Phi(-1) \end{align*} for $\mathbb{P}_{X_{-i}}$-almost every $z$. Equivalently, \begin{align*} \mathbb{E}[(\hat{\theta}_i(X)-a\varepsilon_i)^2\mid X_{-i}=z] \geq a^2\Phi(-1). \end{align*} Integrating this conditional inequality over the distribution of $X_{-i}$ gives \begin{align*} \mathbb{E}[(\hat{\theta}_i(X)-a\varepsilon_i)^2] \geq a^2\Phi(-1). \end{align*} Because this holds for every active coordinate $i=1,\dots,m$, summation gives \begin{align*} \mathbb{E}[|\hat{\theta}(X)-\Theta|^2] \geq m a^2\Phi(-1). \end{align*} The construction ensured $ma^2\geq \frac{1}{2}\min\{R^2,d\sigma^2\}$, so \begin{align*} \mathbb{E}[|\hat{\theta}(X)-\Theta|^2] \geq \frac{\Phi(-1)}{2}\min\{R^2,d\sigma^2\}. \end{align*} [/guided] [/step] [step:Pass from Bayes risk to minimax risk] Let $\Pi$ denote the prior distribution of $\Theta$ on $B_2^d(R)$. For every measurable estimator $\hat{\theta}:\mathbb{R}^d\to\mathbb{R}^d$, \begin{align*} \sup_{\theta\in B_2^d(R)} \mathbb{E}_\theta[|\hat{\theta}(X)-\theta|^2] \geq \int_{B_2^d(R)} \mathbb{E}_\theta[|\hat{\theta}(X)-\theta|^2]\,d\Pi(\theta). \end{align*} The right-hand side is exactly the Bayes risk computed in the previous step, so \begin{align*} \sup_{\theta\in B_2^d(R)} \mathbb{E}_\theta[|\hat{\theta}(X)-\theta|^2] \geq \frac{\Phi(-1)}{2}\min\{R^2,d\sigma^2\}. \end{align*} Taking the infimum over all measurable estimators gives \begin{align*} \mathfrak{M}(B_2^d(R),|\cdot|^2) \geq \frac{\Phi(-1)}{2}\min\{R^2,d\sigma^2\}. \end{align*} If $R=0$, then $\min\{R^2,d\sigma^2\}=0$ and the same lower bound is immediate. Combining this lower bound with the upper bound proves that there exist universal constants $c:=\Phi(-1)/2>0$ and $C:=1$ such that \begin{align*} c\min\{R^2,d\sigma^2\} \leq \mathfrak{M}(B_2^d(R),|\cdot|^2) \leq C\min\{R^2,d\sigma^2\}. \end{align*} Equivalently, in the notation $A\asymp B$ meaning that $c_0B\leq A\leq C_0B$ for some universal constants $c_0,C_0>0$, this is \begin{align*} \mathfrak{M}(B_2^d(R),|\cdot|^2) \asymp \min\{R^2,d\sigma^2\}. \end{align*} [/step]

Prerequisites (0/7 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

test

Definitions & Concepts

What brings you to Androma?

Start with a route through the knowledge graph.

Bounded Gaussian Mean Minimax Rate (Theorem # 5896)

Discussion

Proof

Prerequisites (0/7 completed)

Prerequisites Graph

Explore Further

Sign in to Androma

Check your inbox

One last step

Bounded Gaussian Mean Minimax Rate (Theorem # 5896)

Discussion

Proof

Prerequisites (0/7 completed)

Prerequisites Graph

Explore Further