Androma — The Home of Mathematics on the Internet

custom_env admin

[step:Bound the information radius of the packed models]Let $P_\beta$ denote the joint law of $(X_1,Y_1),\dots,(X_n,Y_n)$ under parameter $\beta$, and let $P_0$ denote the law corresponding to $\beta=0$. Let $X\in\mathbb{R}^{n\times d}$ denote the random design matrix whose $i$th row is $X_i^\top$. Because the marginal distribution of $X$ does not depend on $\beta$, the chain rule for Kullback-Leibler divergence gives \begin{align*} D(P_{\beta_v}\|P_0) = \mathbb{E}\left[ D\left(P_{\beta_v}(Y\in \cdot \mid X)\,\|\,P_0(Y\in \cdot \mid X)\right) \right]. \end{align*} Conditional on $X$, the vector $Y=(Y_1,\dots,Y_n)\in\mathbb{R}^n$ is Gaussian with covariance matrix $\sigma^2 I_n$ and mean $X\beta_v$ under $P_{\beta_v}$, and with mean $0$ under $P_0$. The standard Kullback-Leibler divergence formula for Gaussian measures with common covariance applies because both conditional Gaussian laws have the same positive definite covariance matrix $\sigma^2 I_n$ and hence are mutually absolutely continuous. It gives \begin{align*} D\left(P_{\beta_v}(Y\in \cdot \mid X)\,\|\,P_0(Y\in \cdot \mid X)\right) = \frac{|X\beta_v|^2}{2\sigma^2}. \end{align*} Taking expectation over $X$ and using independence and identical distribution of the rows, \begin{align*} D(P_{\beta_v}\|P_0)=\frac{1}{2\sigma^2}\mathbb{E}\left[|X\beta_v|^2\right] \end{align*} and \begin{align*} \frac{1}{2\sigma^2}\mathbb{E}\left[|X\beta_v|^2\right]=\frac{1}{2\sigma^2}\sum_{i=1}^n \mathbb{E}\left[(X_i^\top \beta_v)^2\right] \end{align*} and therefore \begin{align*} D(P_{\beta_v}\|P_0)=\frac{n}{2\sigma^2}\beta_v^\top \Sigma \beta_v. \end{align*} Since $\beta_v$ is $k$-sparse and hence $2k$-sparse, the upper sparse eigenvalue assumption gives \begin{align*} D(P_{\beta_v}\|P_0) \leq \frac{n\kappa_+|\beta_v|^2}{2\sigma^2} = \frac{n\kappa_+a^2k}{2\sigma^2} = \frac{\gamma}{2}k\log(d/k). \end{align*} Using $\log|\mathcal{V}|\geq c_{\mathrm{VG}}k\log(d/k)$, choose $\gamma\leq c_{\mathrm{VG}}/8$. Then \begin{align*} \frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\beta_v}\|P_0) \leq \frac{1}{16}\log|\mathcal{V}|. \end{align*}[/step]

custom_env admin

[guided]The information calculation measures how difficult it is to distinguish a packed alternative from the null model. Let $P_\beta$ be the joint law of the full sample under parameter $\beta$, and let $P_0$ be the law at $\beta=0$. Let $X\in\mathbb{R}^{n\times d}$ be the random design matrix with $i$th row $X_i^\top$. The marginal law of $X$ is the same for every $\beta$, so the chain rule for Kullback-Leibler divergence reduces the calculation to the conditional law of $Y$ given $X$: \begin{align*} D(P_{\beta_v}\|P_0) = \mathbb{E}\left[ D\left(P_{\beta_v}(Y\in \cdot \mid X)\,\|\,P_0(Y\in \cdot \mid X)\right) \right]. \end{align*} Conditional on $X$, the vector $Y=(Y_1,\dots,Y_n)$ is Gaussian with covariance matrix $\sigma^2 I_n$. Under $P_{\beta_v}$ its mean is $X\beta_v$, while under $P_0$ its mean is $0$. The Kullback-Leibler formula for Gaussian measures with common covariance applies because both conditional laws have the same positive definite covariance matrix $\sigma^2 I_n$; it gives \begin{align*} D\left(P_{\beta_v}(Y\in \cdot \mid X)\,\|\,P_0(Y\in \cdot \mid X)\right) = \frac{|X\beta_v|^2}{2\sigma^2}. \end{align*} Taking expectation over the design and using that the rows $X_1,\dots,X_n$ are independent with covariance matrix $\Sigma$, we obtain \begin{align*} D(P_{\beta_v}\|P_0) = \frac{1}{2\sigma^2}\mathbb{E}\left[|X\beta_v|^2\right]. \end{align*} Expanding the Euclidean norm in $\mathbb{R}^n$ gives \begin{align*} \mathbb{E}\left[|X\beta_v|^2\right] = \sum_{i=1}^n \mathbb{E}\left[(X_i^\top \beta_v)^2\right]. \end{align*} Since each row has covariance matrix $\Sigma$, \begin{align*} \sum_{i=1}^n \mathbb{E}\left[(X_i^\top \beta_v)^2\right] = n\beta_v^\top \Sigma \beta_v. \end{align*} Therefore \begin{align*} D(P_{\beta_v}\|P_0)=\frac{n}{2\sigma^2}\beta_v^\top \Sigma \beta_v. \end{align*} Because $\beta_v$ is $k$-sparse, it is also $2k$-sparse. The upper sparse eigenvalue assumption therefore yields \begin{align*} D(P_{\beta_v}\|P_0) \leq \frac{n\kappa_+|\beta_v|^2}{2\sigma^2} = \frac{n\kappa_+a^2k}{2\sigma^2} = \frac{\gamma}{2}k\log(d/k). \end{align*} The purpose of the amplitude choice is now visible: the divergence is proportional to the same entropy scale as the packing. Since $\log|\mathcal{V}|\geq c_{\mathrm{VG}}k\log(d/k)$, choosing $\gamma\leq c_{\mathrm{VG}}/8$ gives \begin{align*} \frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\beta_v}\|P_0) \leq \frac{1}{16}\log|\mathcal{V}|. \end{align*}[/guided]

custom_env admin

[step:Apply Fano's inequality to the packed subproblem]Let $V$ be uniformly distributed on $\mathcal{V}$, and conditionally on $V=v$, let the observation have law $P_{\beta_v}$. Fano's inequality in its testing form with a fixed reference law applies because $\mathcal{V}$ is finite, the prior on $V$ is uniform, and the divergences $D(P_{\beta_v}\|P_0)$ are finite by the Gaussian calculation above. It implies that every estimator $\widehat{V}$ of $V$ satisfies \begin{align*} \mathbb{P}(\widehat{V}\neq V) \geq 1-\frac{\frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\beta_v}\|P_0)+\log 2}{\log|\mathcal{V}|}. \end{align*} The preceding information bound gives an average divergence at most $\frac{1}{16}\log|\mathcal{V}|$. If $\log|\mathcal{V}|\geq 4\log 2$, this yields \begin{align*} \inf_{\widehat{V}}\mathbb{P}(\widehat{V}\neq V) \geq 1-\frac{1}{16}-\frac{1}{4} = \frac{11}{16} > \frac{1}{2}. \end{align*} If $\log|\mathcal{V}|<4\log 2$, then the entropy lower bound implies \begin{align*} k\log(d/k)<\frac{4\log 2}{c_{\mathrm{VG}}}. \end{align*} For this bounded-entropy regime, take two disjoint $k$-subsets $A,B\subset\{1,\dots,d\}$, which exist because $k\leq d/2$, and define $\beta_A,\beta_B\in\mathbb{R}^d$ by setting $(\beta_A)_j=a\mathbb{1}_A(j)$ and $(\beta_B)_j=a\mathbb{1}_B(j)$ for every $j\in\{1,\dots,d\}$. The same radius argument places both points in $\mathcal{B}_k(R)$. Since $\beta_A-\beta_B$ is $2k$-sparse and has Euclidean norm squared $2a^2k$, the lower sparse eigenvalue assumption gives \begin{align*} |\Sigma^{1/2}(\beta_A-\beta_B)|^2 \geq 2\kappa_-a^2k =: \Delta^2. \end{align*} The Gaussian KL calculation and the upper sparse eigenvalue assumption give \begin{align*} D(P_{\beta_A}\|P_{\beta_B}) = \frac{n}{2\sigma^2}(\beta_A-\beta_B)^\top\Sigma(\beta_A-\beta_B) \leq \gamma k\log(d/k) < \frac{4\gamma\log 2}{c_{\mathrm{VG}}}. \end{align*} Choose $\gamma\leq c_{\mathrm{VG}}/(16\log 2)$, in addition to the previous restrictions. Then $D(P_{\beta_A}\|P_{\beta_B})\leq 1/4$. Pinsker's inequality gives $\|P_{\beta_A}-P_{\beta_B}\|_{\mathrm{TV}}\leq \sqrt{1/8}$, and the two-point testing lower bound gives every test $\widehat{T}\in\{A,B\}$ the error probability bound \begin{align*} \max\left\{P_{\beta_A}(\widehat{T}=B),P_{\beta_B}(\widehat{T}=A)\right\} \geq \frac{1-\|P_{\beta_A}-P_{\beta_B}\|_{\mathrm{TV}}}{2} \geq \frac{1-\sqrt{1/8}}{2} > \frac{1}{4}. \end{align*} For any estimator $\hat{\beta}$, decode to the nearer of $\beta_A$ and $\beta_B$ in the norm $b\mapsto |\Sigma^{1/2}b|$, with ties broken deterministically. The same nearest-neighbour argument used below shows that a decoding error implies squared prediction loss at least $\Delta^2/4$. Therefore \begin{align*} \sup_{\beta\in\mathcal{B}_k(R)}\mathbb{E}_{\beta}\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right] \geq \frac{\Delta^2}{16} = \frac{\gamma}{8}\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}. \end{align*} This is stronger than the final bound after replacing the universal constant by $\gamma/16$. Thus, in the remaining nontrivial packing case, we may work with \begin{align*} \inf_{\widehat{V}}\mathbb{P}(\widehat{V}\neq V) \geq \frac{1}{2}. \end{align*} Now let $\hat{\beta}: (\mathbb{R}^d\times\mathbb{R})^n\to\mathbb{R}^d$ be any measurable estimator. Define the nearest-neighbour decoder $\widehat{V}_{\hat{\beta}}: (\mathbb{R}^d\times\mathbb{R})^n \to \mathcal{V}$ as follows. For a sample $z=((X_1,Y_1),\dots,(X_n,Y_n))\in(\mathbb{R}^d\times\mathbb{R})^n$, set \begin{align*} \widehat{V}_{\hat{\beta}}(z) := \operatorname*{argmin}_{v\in\mathcal{V}} |\Sigma^{1/2}(\hat{\beta}(z)-\beta_v)|, \end{align*} with ties broken by a fixed ordering of $\mathcal{V}$. If $V=v$ and \begin{align*} |\Sigma^{1/2}(\hat{\beta}-\beta_v)|^2 < \frac{\delta^2}{4}, \end{align*} then the triangle inequality in the Euclidean norm after applying $\Sigma^{1/2}$ implies that no other $\beta_w$ with $w\neq v$ can be at least as close as $\beta_v$, because \begin{align*} |\Sigma^{1/2}(\beta_v-\beta_w)| \leq |\Sigma^{1/2}(\hat{\beta}-\beta_v)| + |\Sigma^{1/2}(\hat{\beta}-\beta_w)|. \end{align*} Since the left-hand side is at least $\delta$, this forces \begin{align*} |\Sigma^{1/2}(\hat{\beta}-\beta_w)|>\frac{\delta}{2}>|\Sigma^{1/2}(\hat{\beta}-\beta_v)|. \end{align*} Therefore \begin{align*} \{\widehat{V}_{\hat{\beta}}\neq V\} \subset \left\{|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\geq \frac{\delta^2}{4}\right\}. \end{align*} Consequently, \begin{align*} \sup_{\beta\in\mathcal{B}_k(R)} \mathbb{E}_\beta\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right] \geq \frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}} \mathbb{E}_{\beta_v}\left[|\Sigma^{1/2}(\hat{\beta}-\beta_v)|^2\right]. \end{align*} The preceding event inclusion then gives \begin{align*} \frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}} \mathbb{E}_{\beta_v}\left[|\Sigma^{1/2}(\hat{\beta}-\beta_v)|^2\right] \geq \frac{\delta^2}{4}\mathbb{P}(\widehat{V}_{\hat{\beta}}\neq V) \geq \frac{\delta^2}{8}. \end{align*} Substituting the value of $\delta^2$ yields \begin{align*} \sup_{\beta\in\mathcal{B}_k(R)} \mathbb{E}_\beta\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right] \geq \frac{\gamma}{16}\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}. \end{align*} Taking the infimum over all measurable estimators $\hat{\beta}$ proves the restricted lower bound with $c:=\gamma/16>0$.[/step]

custom_env admin

[guided]The statistical lower bound now comes from converting estimation into a finite testing problem. Suppose an estimator $\hat{\beta}$ had prediction error much smaller than the squared separation $\delta^2$ on every packed point. Then we could identify which packed point generated the data by choosing the nearest packed parameter to $\hat{\beta}$ in prediction norm. This would contradict Fano's inequality. Let $V$ be uniformly distributed on $\mathcal{V}$, and given $V=v$, let the data have law $P_{\beta_v}$. Fano's inequality in testing form with a fixed reference law states that, for any estimator $\widehat{V}$ of $V$, \begin{align*} \mathbb{P}(\widehat{V}\neq V) \geq 1-\frac{\frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\beta_v}\|P_0)+\log 2}{\log|\mathcal{V}|}. \end{align*} Its hypotheses are satisfied here: $\mathcal{V}$ is finite, $V$ has the uniform prior, and the divergences to $P_0$ are finite by the Gaussian KL computation. The previous information calculation gives \begin{align*} \frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}D(P_{\beta_v}\|P_0) \leq \frac{1}{16}\log|\mathcal{V}|. \end{align*} When $\log|\mathcal{V}|\geq 4\log 2$, Fano gives \begin{align*} \mathbb{P}(\widehat{V}\neq V) \geq 1-\frac{1}{16}-\frac{1}{4} = \frac{11}{16} > \frac{1}{2}. \end{align*} It remains to handle the case $\log|\mathcal{V}|<4\log 2$ without referring back to the exact proof. The entropy lower bound implies \begin{align*} k\log(d/k)<\frac{4\log 2}{c_{\mathrm{VG}}}. \end{align*} Because $k\leq d/2$, there exist two disjoint $k$-element subsets $A,B\subset\{1,\dots,d\}$. Define two vectors $\beta_A,\beta_B\in\mathbb{R}^d$ by the coordinate rules $(\beta_A)_j=a\mathbb{1}_A(j)$ and $(\beta_B)_j=a\mathbb{1}_B(j)$ for every $j\in\{1,\dots,d\}$. Each vector is $k$-sparse, and the same radius calculation as above gives $\beta_A,\beta_B\in\mathcal{B}_k(R)$. Since $A$ and $B$ are disjoint, $\beta_A-\beta_B$ has $2k$ nonzero coordinates, each of absolute value $a$, so \begin{align*} |\beta_A-\beta_B|^2=2a^2k. \end{align*} Applying the lower sparse eigenvalue assumption to the $2k$-sparse vector $\beta_A-\beta_B$ gives \begin{align*} |\Sigma^{1/2}(\beta_A-\beta_B)|^2 \geq 2\kappa_-a^2k =: \Delta^2. \end{align*} The same Gaussian KL computation applied to the pair $\beta_A,\beta_B$ gives \begin{align*} D(P_{\beta_A}\|P_{\beta_B}) = \frac{n}{2\sigma^2}(\beta_A-\beta_B)^\top\Sigma(\beta_A-\beta_B). \end{align*} The vector $\beta_A-\beta_B$ is $2k$-sparse, so the upper sparse eigenvalue assumption yields \begin{align*} D(P_{\beta_A}\|P_{\beta_B}) \leq \frac{n\kappa_+}{2\sigma^2}|\beta_A-\beta_B|^2 = \gamma k\log(d/k) < \frac{4\gamma\log 2}{c_{\mathrm{VG}}}. \end{align*} If we choose $\gamma\leq c_{\mathrm{VG}}/(16\log 2)$, then $D(P_{\beta_A}\|P_{\beta_B})\leq 1/4$. Pinsker's inequality gives \begin{align*} \|P_{\beta_A}-P_{\beta_B}\|_{\mathrm{TV}}\leq \sqrt{\frac{1}{2}D(P_{\beta_A}\|P_{\beta_B})}\leq \sqrt{\frac{1}{8}}. \end{align*} Le Cam's two-point testing bound therefore gives, for every test $\widehat{T}\in\{A,B\}$, \begin{align*} \max\left\{P_{\beta_A}(\widehat{T}=B),P_{\beta_B}(\widehat{T}=A)\right\} \geq \frac{1-\|P_{\beta_A}-P_{\beta_B}\|_{\mathrm{TV}}}{2} \geq \frac{1-\sqrt{1/8}}{2} > \frac{1}{4}. \end{align*} Given any regression estimator $\hat{\beta}$, decode to whichever of $\beta_A$ and $\beta_B$ is nearer in the norm $b\mapsto |\Sigma^{1/2}b|$, breaking ties deterministically. If the squared prediction loss is less than $\Delta^2/4$ at the true point, the triangle inequality forces this two-point decoder to be correct. Hence a decoding error implies squared loss at least $\Delta^2/4$, and so \begin{align*} \sup_{\beta\in\mathcal{B}_k(R)}\mathbb{E}_{\beta}\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right] \geq \frac{\Delta^2}{4}\cdot\frac{1}{4} = \frac{\Delta^2}{16} = \frac{\gamma}{8}\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}. \end{align*} Thus the bounded-entropy case is already covered after decreasing the final universal constant to $\gamma/16$. In the nontrivial packing case used below, every decoder satisfies \begin{align*} \mathbb{P}(\widehat{V}\neq V)\geq \frac{1}{2}. \end{align*} We now construct a decoder from an arbitrary regression estimator. Let $\hat{\beta}: (\mathbb{R}^d\times\mathbb{R})^n\to\mathbb{R}^d$ be measurable. Define $\widehat{V}_{\hat{\beta}}:(\mathbb{R}^d\times\mathbb{R})^n \to \mathcal{V}$ as follows. For a sample $z=((X_1,Y_1),\dots,(X_n,Y_n))\in(\mathbb{R}^d\times\mathbb{R})^n$, set \begin{align*} \widehat{V}_{\hat{\beta}}(z) := \operatorname*{argmin}_{v\in\mathcal{V}} |\Sigma^{1/2}(\hat{\beta}(z)-\beta_v)|, \end{align*} breaking ties by a fixed ordering of $\mathcal{V}$. This map is measurable because $\mathcal{V}$ is finite and each displayed distance is a measurable function of the data. If the true index is $V=v$ and the estimator satisfies \begin{align*} |\Sigma^{1/2}(\hat{\beta}-\beta_v)|^2<\frac{\delta^2}{4}, \end{align*} then $\hat{\beta}$ is within distance $\delta/2$ of the true packed point. Since every other packed point is at least distance $\delta$ from $\beta_v$, the triangle inequality gives, for every $w\neq v$, \begin{align*} |\Sigma^{1/2}(\hat{\beta}-\beta_w)| \geq |\Sigma^{1/2}(\beta_v-\beta_w)| - |\Sigma^{1/2}(\hat{\beta}-\beta_v)| > \delta-\frac{\delta}{2} = \frac{\delta}{2}. \end{align*} Thus the nearest-neighbour decoder must return $v$. Equivalently, \begin{align*} \{\widehat{V}_{\hat{\beta}}\neq V\} \subset \left\{|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\geq \frac{\delta^2}{4}\right\}. \end{align*} Using this implication and averaging over the uniform prior on $\mathcal{V}$, \begin{align*} \sup_{\beta\in\mathcal{B}_k(R)} \mathbb{E}_\beta\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right] \geq \frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}} \mathbb{E}_{\beta_v}\left[|\Sigma^{1/2}(\hat{\beta}-\beta_v)|^2\right]. \end{align*} Averaging over the uniform prior identifies the right-hand side as \begin{align*} \mathbb{E}\left[|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\right]. \end{align*} Using the elementary bound $Z\geq t\mathbb{1}_{\{Z\geq t\}}$ for the nonnegative [random variable](/page/Random%20Variable) $Z:=|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2$ and the threshold $t:=\delta^2/4$, we get \begin{align*} \mathbb{E}\left[|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\right] \geq \frac{\delta^2}{4} \mathbb{P}\left(|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\geq \frac{\delta^2}{4}\right). \end{align*} The event inclusion gives \begin{align*} \frac{\delta^2}{4} \mathbb{P}\left(|\Sigma^{1/2}(\hat{\beta}-\beta_V)|^2\geq \frac{\delta^2}{4}\right) \geq \frac{\delta^2}{4}\mathbb{P}(\widehat{V}_{\hat{\beta}}\neq V) \geq \frac{\delta^2}{8}. \end{align*} Finally, from the previous step, \begin{align*} \delta^2 = \frac{\gamma}{2}\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}. \end{align*} Therefore \begin{align*} \sup_{\beta\in\mathcal{B}_k(R)} \mathbb{E}_\beta\left[|\Sigma^{1/2}(\hat{\beta}-\beta)|^2\right] \geq \frac{\gamma}{16}\frac{\kappa_-}{\kappa_+}\sigma^2\frac{k\log(d/k)}{n}. \end{align*} Since $\hat{\beta}$ was arbitrary, taking the infimum over all measurable estimators proves the restricted minimax lower bound.[/guided]

custom_env admin

What brings you to Androma?

Start with a route through the knowledge graph.

Attributions & Verification

Proof

Verification Progress

Contributors

Who Can Verify

Quick Actions

Sign in to Androma

Check your inbox

One last step

Attributions & Verification

Proof

Verification Progress

Contributors

Who Can Verify

Quick Actions

Raw Attribution Data