Sparse Gaussian Design Prediction Minimax Lower Bound

Sparse Gaussian Design Prediction Minimax Lower Bound (Theorem # 5904)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We build a finite family of $k$-sparse regression vectors by placing a small amplitude $a>0$ on the coordinates of a Varshamov-Gilbert packing of $k$-subsets of $\{1,\dots,d\}$. The upper sparse eigenvalue bounds the Kullback-Leibler divergence from each packed model to the null model, while the lower sparse eigenvalue converts Hamming separation into prediction-norm separation. Choosing $a^2$ proportional to $\sigma^2\log(d/k)/(n\kappa_+)$ makes the information radius small enough for [Fano's inequality](/theorems/1654) and, under the assumed lower bound on $R$, keeps every packed vector inside $\mathcal{B}_k(R)$. The unrestricted result follows because $\mathcal{B}_k(R)\subset \mathcal{B}_k$. [/proofplan] [step:Construct a Hamming-separated family of sparse binary vectors] Let $\{0,1\}^d_k$ denote the set of all binary vectors $v \in \{0,1\}^d$ with exactly $k$ nonzero coordinates. By the standard Varshamov-Gilbert packing bound for constant-weight binary vectors, there is a subset $\mathcal{V}\subset \{0,1\}^d_k$ and a universal constant $c_{\mathrm{VG}}>0$ such that, for all distinct $v,w\in\mathcal{V}$, \begin{align*} d_H(v,w) \geq \frac{k}{2}. \end{align*} The same packing satisfies \begin{align*} \log |\mathcal{V}| \geq c_{\mathrm{VG}} k\log(d/k), \end{align*} where $d_H(v,w):=|\{j\in\{1,\dots,d\}:v_j\neq w_j\}|$ is the Hamming distance. Let $\gamma>0$ be a universal constant to be fixed below, and define the amplitude \begin{align*} a^2 := \gamma \frac{\sigma^2\log(d/k)}{n\kappa_+}. \end{align*} For each $v\in\mathcal{V}$, define the parameter vector $\beta_v\in\mathbb{R}^d$ by \begin{align*} (\beta_v)_j = a v_j \quad \text{for every } j\in\{1,\dots,d\}. \end{align*} Equivalently, $\beta_v=a v$ as an element of $\mathbb{R}^d$. Since $v$ has exactly $k$ nonzero coordinates, each $\beta_v$ is $k$-sparse and \begin{align*} |\beta_v|^2 = a^2 k. \end{align*} If $C_0\geq \gamma$, then the radius hypothesis gives \begin{align*} |\beta_v|^2 = \gamma \frac{\sigma^2}{\kappa_+}\frac{k\log(d/k)}{n} \leq R^2, \end{align*} so $\beta_v\in\mathcal{B}_k(R)$ for every $v\in\mathcal{V}$. [guided] The packing is designed to create many parameters that are far apart in prediction norm but hard to distinguish statistically. We first choose binary vectors because their Hamming distance directly counts how many coordinates differ. Let $\{0,1\}^d_k$ be the set of all binary vectors in $\mathbb{R}^d$ with exactly $k$ entries equal to $1$. The standard Varshamov-Gilbert packing bound for constant-weight binary vectors gives a subset $\mathcal{V}\subset \{0,1\}^d_k$ and a universal constant $c_{\mathrm{VG}}>0$ such that any two distinct vectors in $\mathcal{V}$ differ on at least $k/2$ coordinates. It also gives the entropy lower bound \begin{align*} \log |\mathcal{V}| \geq c_{\mathrm{VG}} k\log(d/k). \end{align*} Here the assumption $k\leq d/2$ is the regime in which the usual lower bound has the stated order $k\log(d/k)$. We now turn binary vectors into regression parameters by assigning a common amplitude. Let $\gamma>0$ be a small universal constant to be fixed when we control the Kullback-Leibler divergence, and define \begin{align*} a^2 := \gamma \frac{\sigma^2\log(d/k)}{n\kappa_+}. \end{align*} For every $v\in\mathcal{V}$, define $\beta_v=a v\in\mathbb{R}^d$. Since $v$ has exactly $k$ nonzero coordinates, $\beta_v$ is $k$-sparse and \begin{align*} |\beta_v|^2 = \sum_{j=1}^d a^2 v_j^2 = a^2 k. \end{align*} The radius condition is precisely what ensures that this packing lies in the restricted parameter space. Indeed, if $C_0\geq \gamma$, then \begin{align*} |\beta_v|^2 = \gamma \frac{\sigma^2}{\kappa_+}\frac{k\log(d/k)}{n} \leq C_0 \frac{\sigma^2}{\kappa_+}\frac{k\log(d/k)}{n} \leq R^2. \end{align*} Thus every packed parameter belongs to $\mathcal{B}_k(R)$. [/guided] [/step] [step:Bound the information radius of the packed models] Let $P_\beta$ denote the joint law of $(X_1,Y_1),\dots,(X_n,Y_n)$ under parameter $\beta$, and let $P_0$ denote the law corresponding to $\beta=0$. Let $X\in\mathbb{R}^{n\times d}$ denote the random design matrix whose $i$th row is $X_i^\top$. Because the marginal distribution of $X$ does not depend on $\beta$, the chain rule for Kullback-Leibler divergence gives \begin{align*} D(P_{\beta_v}\|P_0) = \mathbb{E}\left[ D\left(P_{\beta_v}(Y\in \cdot \mid X)\,\|\,P_0(Y\in \cdot \mid X)\right) \right]. \end{align*} Conditional on $X$, the vector $Y=(Y_1,\dots,Y_n)\in\mathbb{R}^n$ is Gaussian with covariance matrix $\sigma^2 I_n$ and mean $X\beta_v$ under $P_{\beta_v}$, and with mean $0$ under $P_0$. The standard Kullback-Leibler divergence formula for Gaussian measures with common covariance applies because both conditional Gaussian laws have the same positive definite covariance matrix $\sigma^2 I_n$ and hence are mutually absolutely continuous. It gives \begin{align*} D\left(P_{\beta_v}(Y\in \cdot \mid X)\,\|\,P_0(Y\in \cdot \mid X)\right) = \frac{|X\beta_v|^2}{2\sigma^2}. \end{align*} Taking expectation over $X$ and using independence and identical distribution of the rows, \begin{align*} D(P_{\beta_v}\|P_0)=\frac{1}{2\sigma^2}\mathbb{E}\left[|X\beta_v|^2\right] \end{align*} and \begin{align*} \frac{1}{2\sigma^2}\mathbb{E}\left[|X\beta_v|^2\right]=\frac{1}{2\sigma^2}\sum_{i=1}^n \mathbb{E}\left[(X_i^\top \beta_v)^2\right] \end{align*} and therefore \begin{align*} D(P_{\beta_v}\|P_0)=\frac{n}{2\sigma^2}\beta_v^\top \Sigma \beta_v. \end{align*} Since $\beta_v$ is $k$-sparse and hence $2k$-sparse, the upper sparse eigenvalue assumption gives \begin{align*} D(P_{\beta_v}\|P_0) \leq \frac{n\kappa_+|\beta_v|^2}{2\sigma^2} = \frac{n\kappa_+a^2k}{2\sigma^2} = \frac{\gamma}{2}k\log(d/k). \end{align*} Using $\log|\mathcal{V}|\geq c_{\mathrm{VG}}k\log(d/k)$, choose $\gamma\leq c_{\mathrm{VG}}/8$. Then \begin{align*} \frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\beta_v}\|P_0) \leq \frac{1}{16}\log|\mathcal{V}|. \end{align*} [guided] The information calculation measures how difficult it is to distinguish a packed alternative from the null model. Let $P_\beta$ be the joint law of the full sample under parameter $\beta$, and let $P_0$ be the law at $\beta=0$. Let $X\in\mathbb{R}^{n\times d}$ be the random design matrix with $i$th row $X_i^\top$. The marginal law of $X$ is the same for every $\beta$, so the chain rule for Kullback-Leibler divergence reduces the calculation to the conditional law of $Y$ given $X$: \begin{align*} D(P_{\beta_v}\|P_0) = \mathbb{E}\left[ D\left(P_{\beta_v}(Y\in \cdot \mid X)\,\|\,P_0(Y\in \cdot \mid X)\right) \right]. \end{align*} Conditional on $X$, the vector $Y=(Y_1,\dots,Y_n)$ is Gaussian with covariance matrix $\sigma^2 I_n$. Under $P_{\beta_v}$ its mean is $X\beta_v$, while under $P_0$ its mean is $0$. The Kullback-Leibler formula for Gaussian measures with common covariance applies because both conditional laws have the same positive definite covariance matrix $\sigma^2 I_n$; it gives \begin{align*} D\left(P_{\beta_v}(Y\in \cdot \mid X)\,\|\,P_0(Y\in \cdot \mid X)\right) = \frac{|X\beta_v|^2}{2\sigma^2}. \end{align*} Taking expectation over the design and using that the rows $X_1,\dots,X_n$ are independent with covariance matrix $\Sigma$, we obtain \begin{align*} D(P_{\beta_v}\|P_0) = \frac{1}{2\sigma^2}\mathbb{E}\left[|X\beta_v|^2\right]. \end{align*} Expanding the Euclidean norm in $\mathbb{R}^n$ gives \begin{align*} \mathbb{E}\left[|X\beta_v|^2\right] = \sum_{i=1}^n \mathbb{E}\left[(X_i^\top \beta_v)^2\right]. \end{align*} Since each row has covariance matrix $\Sigma$, \begin{align*} \sum_{i=1}^n \mathbb{E}\left[(X_i^\top \beta_v)^2\right] = n\beta_v^\top \Sigma \beta_v. \end{align*} Therefore \begin{align*} D(P_{\beta_v}\|P_0)=\frac{n}{2\sigma^2}\beta_v^\top \Sigma \beta_v. \end{align*} Because $\beta_v$ is $k$-sparse, it is also $2k$-sparse. The upper sparse eigenvalue assumption therefore yields \begin{align*} D(P_{\beta_v}\|P_0) \leq \frac{n\kappa_+|\beta_v|^2}{2\sigma^2} = \frac{n\kappa_+a^2k}{2\sigma^2} = \frac{\gamma}{2}k\log(d/k). \end{align*} The purpose of the amplitude choice is now visible: the divergence is proportional to the same entropy scale as the packing. Since $\log|\mathcal{V}|\geq c_{\mathrm{VG}}k\log(d/k)$, choosing $\gamma\leq c_{\mathrm{VG}}/8$ gives \begin{align*} \frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\beta_v}\|P_0) \leq \frac{1}{16}\log|\mathcal{V}|. \end{align*} [/guided] [/step] [step:Convert Hamming separation into prediction separation] For distinct $v,w\in\mathcal{V}$, the difference $\beta_v-\beta_w=a(v-w)$ is supported on the union of two $k$-element supports, so it is $2k$-sparse. The lower sparse eigenvalue assumption therefore gives \begin{align*} |\Sigma^{1/2}(\beta_v-\beta_w)|^2 = (\beta_v-\beta_w)^\top\Sigma(\beta_v-\beta_w) \geq \kappa_-|\beta_v-\beta_w|^2. \end{align*} Because $v_j-w_j\in\{-1,0,1\}$, \begin{align*} |\beta_v-\beta_w|^2 = a^2\sum_{j=1}^d (v_j-w_j)^2 = a^2 d_H(v,w) \geq \frac{a^2 k}{2}. \end{align*} Hence the squared prediction distance between any two distinct packed parameters is at least \begin{align*} \delta^2 := \frac{\kappa_-a^2k}{2} = \frac{\gamma}{2}\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}. \end{align*} [guided] This step translates the combinatorial packing into separation in the loss used by the theorem. Fix distinct $v,w\in\mathcal{V}$. The vector $\beta_v-\beta_w=a(v-w)$ can be nonzero only on the union of the supports of $v$ and $w$. Each support has cardinality $k$, so this union has cardinality at most $2k$, and therefore $\beta_v-\beta_w$ is $2k$-sparse. The lower sparse eigenvalue assumption applies to every $2k$-sparse vector. Applying it to $u=\beta_v-\beta_w$ gives \begin{align*} |\Sigma^{1/2}(\beta_v-\beta_w)|^2 = (\beta_v-\beta_w)^\top\Sigma(\beta_v-\beta_w) \geq \kappa_-|\beta_v-\beta_w|^2. \end{align*} It remains to compute the Euclidean norm of the difference. Since each coordinate difference $v_j-w_j$ belongs to $\{-1,0,1\}$, its square is $1$ exactly on the coordinates where $v$ and $w$ differ. Hence \begin{align*} |\beta_v-\beta_w|^2 = a^2\sum_{j=1}^d (v_j-w_j)^2 = a^2 d_H(v,w). \end{align*} The packing guarantees $d_H(v,w)\geq k/2$, so \begin{align*} |\beta_v-\beta_w|^2 \geq \frac{a^2k}{2}. \end{align*} Combining the two inequalities gives the uniform prediction separation \begin{align*} |\Sigma^{1/2}(\beta_v-\beta_w)|^2 \geq \frac{\kappa_-a^2k}{2}. \end{align*} We denote this lower bound by \begin{align*} \delta^2 := \frac{\kappa_-a^2k}{2} = \frac{\gamma}{2}\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}. \end{align*} [/guided] [/step] [step:Apply Fano's inequality to the packed subproblem] Let $V$ be uniformly distributed on $\mathcal{V}$, and conditionally on $V=v$, let the observation have law $P_{\beta_v}$. Fano's inequality in its testing form with a fixed reference law applies because $\mathcal{V}$ is finite, the prior on $V$ is uniform, and the divergences $D(P_{\beta_v}\|P_0)$ are finite by the Gaussian calculation above. It implies that every estimator $\widehat{V}$ of $V$ satisfies \begin{align*} \mathbb{P}(\widehat{V}\neq V) \geq 1-\frac{\frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\beta_v}\|P_0)+\log 2}{\log|\mathcal{V}|}. \end{align*} The preceding information bound gives an average divergence at most $\frac{1}{16}\log|\mathcal{V}|$. If $\log|\mathcal{V}|\geq 4\log 2$, this yields \begin{align*} \inf_{\widehat{V}}\mathbb{P}(\widehat{V}\neq V) \geq 1-\frac{1}{16}-\frac{1}{4} = \frac{11}{16} > \frac{1}{2}. \end{align*} If $\log|\mathcal{V}|<4\log 2$, then the entropy lower bound implies \begin{align*} k\log(d/k)<\frac{4\log 2}{c_{\mathrm{VG}}}. \end{align*} For this bounded-entropy regime, take two disjoint $k$-subsets $A,B\subset\{1,\dots,d\}$, which exist because $k\leq d/2$, and define $\beta_A,\beta_B\in\mathbb{R}^d$ by setting $(\beta_A)_j=a\mathbb{1}_A(j)$ and $(\beta_B)_j=a\mathbb{1}_B(j)$ for every $j\in\{1,\dots,d\}$. The same radius argument places both points in $\mathcal{B}_k(R)$. Since $\beta_A-\beta_B$ is $2k$-sparse and has Euclidean norm squared $2a^2k$, the lower sparse eigenvalue assumption gives \begin{align*} |\Sigma^{1/2}(\beta_A-\beta_B)|^2 \geq 2\kappa_-a^2k =: \Delta^2. \end{align*} The Gaussian KL calculation and the upper sparse eigenvalue assumption give \begin{align*} D(P_{\beta_A}\|P_{\beta_B}) = \frac{n}{2\sigma^2}(\beta_A-\beta_B)^\top\Sigma(\beta_A-\beta_B) \leq \gamma k\log(d/k) < \frac{4\gamma\log 2}{c_{\mathrm{VG}}}. \end{align*} Choose $\gamma\leq c_{\mathrm{VG}}/(16\log 2)$, in addition to the previous restrictions. Then $D(P_{\beta_A}\|P_{\beta_B})\leq 1/4$. Pinsker's inequality gives $\|P_{\beta_A}-P_{\beta_B}\|_{\mathrm{TV}}\leq \sqrt{1/8}$, and the two-point testing lower bound gives every test $\widehat{T}\in\{A,B\}$ the error probability bound \begin{align*} \max\left\{P_{\beta_A}(\widehat{T}=B),P_{\beta_B}(\widehat{T}=A)\right\} \geq \frac{1-\|P_{\beta_A}-P_{\beta_B}\|_{\mathrm{TV}}}{2} \geq \frac{1-\sqrt{1/8}}{2} > \frac{1}{4}. \end{align*} For any estimator $\hat{\beta}$, decode to the nearer of $\beta_A$ and $\beta_B$ in the norm $b\mapsto |\Sigma^{1/2}b|$, with ties broken deterministically. The same nearest-neighbour argument used below shows that a decoding error implies squared prediction loss at least $\Delta^2/4$. Therefore \begin{align*} \sup_{\beta\in\mathcal{B}_k(R)}\mathbb{E}_{\beta}\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right] \geq \frac{\Delta^2}{16} = \frac{\gamma}{8}\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}. \end{align*} This is stronger than the final bound after replacing the universal constant by $\gamma/16$. Thus, in the remaining nontrivial packing case, we may work with \begin{align*} \inf_{\widehat{V}}\mathbb{P}(\widehat{V}\neq V) \geq \frac{1}{2}. \end{align*} Now let $\hat{\beta}: (\mathbb{R}^d\times\mathbb{R})^n\to\mathbb{R}^d$ be any measurable estimator. Define the nearest-neighbour decoder $\widehat{V}_{\hat{\beta}}: (\mathbb{R}^d\times\mathbb{R})^n \to \mathcal{V}$ as follows. For a sample $z=((X_1,Y_1),\dots,(X_n,Y_n))\in(\mathbb{R}^d\times\mathbb{R})^n$, set \begin{align*} \widehat{V}_{\hat{\beta}}(z) := \operatorname*{argmin}_{v\in\mathcal{V}} |\Sigma^{1/2}(\hat{\beta}(z)-\beta_v)|, \end{align*} with ties broken by a fixed ordering of $\mathcal{V}$. If $V=v$ and \begin{align*} |\Sigma^{1/2}(\hat{\beta}-\beta_v)|^2 < \frac{\delta^2}{4}, \end{align*} then the triangle inequality in the Euclidean norm after applying $\Sigma^{1/2}$ implies that no other $\beta_w$ with $w\neq v$ can be at least as close as $\beta_v$, because \begin{align*} |\Sigma^{1/2}(\beta_v-\beta_w)| \leq |\Sigma^{1/2}(\hat{\beta}-\beta_v)| + |\Sigma^{1/2}(\hat{\beta}-\beta_w)|. \end{align*} Since the left-hand side is at least $\delta$, this forces \begin{align*} |\Sigma^{1/2}(\hat{\beta}-\beta_w)|>\frac{\delta}{2}>|\Sigma^{1/2}(\hat{\beta}-\beta_v)|. \end{align*} Therefore \begin{align*} \{\widehat{V}_{\hat{\beta}}\neq V\} \subset \left\{|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\geq \frac{\delta^2}{4}\right\}. \end{align*} Consequently, \begin{align*} \sup_{\beta\in\mathcal{B}_k(R)} \mathbb{E}_\beta\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right] \geq \frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}} \mathbb{E}_{\beta_v}\left[|\Sigma^{1/2}(\hat{\beta}-\beta_v)|^2\right]. \end{align*} The preceding event inclusion then gives \begin{align*} \frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}} \mathbb{E}_{\beta_v}\left[|\Sigma^{1/2}(\hat{\beta}-\beta_v)|^2\right] \geq \frac{\delta^2}{4}\mathbb{P}(\widehat{V}_{\hat{\beta}}\neq V) \geq \frac{\delta^2}{8}. \end{align*} Substituting the value of $\delta^2$ yields \begin{align*} \sup_{\beta\in\mathcal{B}_k(R)} \mathbb{E}_\beta\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right] \geq \frac{\gamma}{16}\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}. \end{align*} Taking the infimum over all measurable estimators $\hat{\beta}$ proves the restricted lower bound with $c:=\gamma/16>0$. [guided] The statistical lower bound now comes from converting estimation into a finite testing problem. Suppose an estimator $\hat{\beta}$ had prediction error much smaller than the squared separation $\delta^2$ on every packed point. Then we could identify which packed point generated the data by choosing the nearest packed parameter to $\hat{\beta}$ in prediction norm. This would contradict Fano's inequality. Let $V$ be uniformly distributed on $\mathcal{V}$, and given $V=v$, let the data have law $P_{\beta_v}$. Fano's inequality in testing form with a fixed reference law states that, for any estimator $\widehat{V}$ of $V$, \begin{align*} \mathbb{P}(\widehat{V}\neq V) \geq 1-\frac{\frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\beta_v}\|P_0)+\log 2}{\log|\mathcal{V}|}. \end{align*} Its hypotheses are satisfied here: $\mathcal{V}$ is finite, $V$ has the uniform prior, and the divergences to $P_0$ are finite by the Gaussian KL computation. The previous information calculation gives \begin{align*} \frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\beta_v}\|P_0) \leq \frac{1}{16}\log|\mathcal{V}|. \end{align*} When $\log|\mathcal{V}|\geq 4\log 2$, Fano gives \begin{align*} \mathbb{P}(\widehat{V}\neq V) \geq 1-\frac{1}{16}-\frac{1}{4} = \frac{11}{16} > \frac{1}{2}. \end{align*} It remains to handle the case $\log|\mathcal{V}|<4\log 2$ without referring back to the exact proof. The entropy lower bound implies \begin{align*} k\log(d/k)<\frac{4\log 2}{c_{\mathrm{VG}}}. \end{align*} Because $k\leq d/2$, there exist two disjoint $k$-element subsets $A,B\subset\{1,\dots,d\}$. Define two vectors $\beta_A,\beta_B\in\mathbb{R}^d$ by the coordinate rules $(\beta_A)_j=a\mathbb{1}_A(j)$ and $(\beta_B)_j=a\mathbb{1}_B(j)$ for every $j\in\{1,\dots,d\}$. Each vector is $k$-sparse, and the same radius calculation as above gives $\beta_A,\beta_B\in\mathcal{B}_k(R)$. Since $A$ and $B$ are disjoint, $\beta_A-\beta_B$ has $2k$ nonzero coordinates, each of absolute value $a$, so \begin{align*} |\beta_A-\beta_B|^2=2a^2k. \end{align*} Applying the lower sparse eigenvalue assumption to the $2k$-sparse vector $\beta_A-\beta_B$ gives \begin{align*} |\Sigma^{1/2}(\beta_A-\beta_B)|^2 \geq 2\kappa_-a^2k =: \Delta^2. \end{align*} The same Gaussian KL computation applied to the pair $\beta_A,\beta_B$ gives \begin{align*} D(P_{\beta_A}\|P_{\beta_B}) = \frac{n}{2\sigma^2}(\beta_A-\beta_B)^\top\Sigma(\beta_A-\beta_B). \end{align*} The vector $\beta_A-\beta_B$ is $2k$-sparse, so the upper sparse eigenvalue assumption yields \begin{align*} D(P_{\beta_A}\|P_{\beta_B}) \leq \frac{n\kappa_+}{2\sigma^2}|\beta_A-\beta_B|^2 = \gamma k\log(d/k) < \frac{4\gamma\log 2}{c_{\mathrm{VG}}}. \end{align*} If we choose $\gamma\leq c_{\mathrm{VG}}/(16\log 2)$, then $D(P_{\beta_A}\|P_{\beta_B})\leq 1/4$. Pinsker's inequality gives \begin{align*} \|P_{\beta_A}-P_{\beta_B}\|_{\mathrm{TV}}\leq \sqrt{\frac{1}{2}D(P_{\beta_A}\|P_{\beta_B})}\leq \sqrt{\frac{1}{8}}. \end{align*} Le Cam's two-point testing bound therefore gives, for every test $\widehat{T}\in\{A,B\}$, \begin{align*} \max\left\{P_{\beta_A}(\widehat{T}=B),P_{\beta_B}(\widehat{T}=A)\right\} \geq \frac{1-\|P_{\beta_A}-P_{\beta_B}\|_{\mathrm{TV}}}{2} \geq \frac{1-\sqrt{1/8}}{2} > \frac{1}{4}. \end{align*} Given any regression estimator $\hat{\beta}$, decode to whichever of $\beta_A$ and $\beta_B$ is nearer in the norm $b\mapsto |\Sigma^{1/2}b|$, breaking ties deterministically. If the squared prediction loss is less than $\Delta^2/4$ at the true point, the triangle inequality forces this two-point decoder to be correct. Hence a decoding error implies squared loss at least $\Delta^2/4$, and so \begin{align*} \sup_{\beta\in\mathcal{B}_k(R)}\mathbb{E}_{\beta}\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right] \geq \frac{\Delta^2}{4}\cdot\frac{1}{4} = \frac{\Delta^2}{16} = \frac{\gamma}{8}\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}. \end{align*} Thus the bounded-entropy case is already covered after decreasing the final universal constant to $\gamma/16$. In the nontrivial packing case used below, every decoder satisfies \begin{align*} \mathbb{P}(\widehat{V}\neq V)\geq \frac{1}{2}. \end{align*} We now construct a decoder from an arbitrary regression estimator. Let $\hat{\beta}: (\mathbb{R}^d\times\mathbb{R})^n\to\mathbb{R}^d$ be measurable. Define $\widehat{V}_{\hat{\beta}}:(\mathbb{R}^d\times\mathbb{R})^n \to \mathcal{V}$ as follows. For a sample $z=((X_1,Y_1),\dots,(X_n,Y_n))\in(\mathbb{R}^d\times\mathbb{R})^n$, set \begin{align*} \widehat{V}_{\hat{\beta}}(z) := \operatorname*{argmin}_{v\in\mathcal{V}} |\Sigma^{1/2}(\hat{\beta}(z)-\beta_v)|, \end{align*} breaking ties by a fixed ordering of $\mathcal{V}$. This map is measurable because $\mathcal{V}$ is finite and each displayed distance is a measurable function of the data. If the true index is $V=v$ and the estimator satisfies \begin{align*} |\Sigma^{1/2}(\hat{\beta}-\beta_v)|^2<\frac{\delta^2}{4}, \end{align*} then $\hat{\beta}$ is within distance $\delta/2$ of the true packed point. Since every other packed point is at least distance $\delta$ from $\beta_v$, the triangle inequality gives, for every $w\neq v$, \begin{align*} |\Sigma^{1/2}(\hat{\beta}-\beta_w)| \geq |\Sigma^{1/2}(\beta_v-\beta_w)| - |\Sigma^{1/2}(\hat{\beta}-\beta_v)| > \delta-\frac{\delta}{2} = \frac{\delta}{2}. \end{align*} Thus the nearest-neighbour decoder must return $v$. Equivalently, \begin{align*} \{\widehat{V}_{\hat{\beta}}\neq V\} \subset \left\{|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\geq \frac{\delta^2}{4}\right\}. \end{align*} Using this implication and averaging over the uniform prior on $\mathcal{V}$, \begin{align*} \sup_{\beta\in\mathcal{B}_k(R)} \mathbb{E}_\beta\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right] \geq \frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}} \mathbb{E}_{\beta_v}\left[|\Sigma^{1/2}(\hat{\beta}-\beta_v)|^2\right]. \end{align*} Averaging over the uniform prior identifies the right-hand side as \begin{align*} \mathbb{E}\left[|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\right]. \end{align*} Using the elementary bound $Z\geq t\mathbb{1}_{\{Z\geq t\}}$ for the nonnegative [random variable](/page/Random%20Variable) $Z:=|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2$ and the threshold $t:=\delta^2/4$, we get \begin{align*} \mathbb{E}\left[|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\right] \geq \frac{\delta^2}{4} \mathbb{P}\left(|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\geq \frac{\delta^2}{4}\right). \end{align*} The event inclusion gives \begin{align*} \frac{\delta^2}{4} \mathbb{P}\left(|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\geq \frac{\delta^2}{4}\right) \geq \frac{\delta^2}{4}\mathbb{P}(\widehat{V}_{\hat{\beta}}\neq V) \geq \frac{\delta^2}{8}. \end{align*} Finally, from the previous step, \begin{align*} \delta^2 = \frac{\gamma}{2}\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}. \end{align*} Therefore \begin{align*} \sup_{\beta\in\mathcal{B}_k(R)} \mathbb{E}_\beta\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right] \geq \frac{\gamma}{16}\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}. \end{align*} Since $\hat{\beta}$ was arbitrary, taking the infimum over all measurable estimators proves the restricted minimax lower bound. [/guided] [/step] [step:Pass from the restricted sparse ball to the unrestricted sparse class] Since $\mathcal{B}_k(R)\subset \mathcal{B}_k$, for every measurable estimator $\hat{\beta}$, \begin{align*} \sup_{\beta\in\mathcal{B}_k} \mathbb{E}_{\beta}\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right] \geq \sup_{\beta\in\mathcal{B}_k(R)} \mathbb{E}_{\beta}\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right]. \end{align*} Taking the infimum over $\hat{\beta}$ on both sides and applying the restricted lower bound gives \begin{align*} \inf_{\hat{\beta}}\sup_{\beta\in\mathcal{B}_k} \mathbb{E}_{\beta}\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right] \geq c\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}. \end{align*} This is the claimed unrestricted lower bound. [/step]

Prerequisites (0/9 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Definitions & Concepts

Explore Further

Distribution Definition Expectation Definition Event Definition Function Definition Matrix Definition Set Definition test Theorem #89 Fano's Inequality Theorem #1654 Triangle Inequality For Inner Product Spaces Theorem #433 Donsker's Invariance Principle Brownian Motion PGF and Moments Probability Theory Sparse Gaussian Mean Minimax Lower Bound Probability & Statistics Pinsker Inequality Probability & Statistics Gambler's Ruin Probability Probability Theory Consistency of the Cluster-Robust Sandwich Variance Estimator Under Many Clusters Probability & Statistics Existence and Uniqueness of Conditional Expectation Conditional Expectation Weak Law of Large Numbers Probability Theory Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.