Minimax $L^2$ Risk for Density Estimation over Hölder Classes

Minimax $L^2$ Risk for Density Estimation over Hölder Classes (Theorem # 6360)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We prove the two-sided minimax rate by combining the standard bias-variance upper bound for a boundary-corrected density estimator with an Assouad hypercube lower bound. The upper bound chooses the resolution \begin{align*} h_n\asymp n^{-1/(2s+d)}, \end{align*} so the squared Hölder bias \begin{align*} h_n^{2s} \end{align*} and stochastic variance \begin{align*} (nh_n^d)^{-1} \end{align*} balance. The lower bound embeds a finite family of localized density perturbations of the uniform density; the $L^2$ separation is of order \begin{align*} h^{2s} \end{align*} while adjacent likelihoods have bounded Hellinger distance at the same resolution scale. These two estimates give matching constants in front of \begin{align*} n^{-2s/(2s+d)}. \end{align*} [/proofplan] [step:Apply a boundary-corrected estimator to obtain the upper bound] Use a boundary-corrected compactly supported orthonormal wavelet basis on $[0,1]^d$ with regularity strictly larger than $s$. Let $V_h$ be the span of wavelets up to spatial scale $h$, so $\dim V_h\le C_Dh^{-d}$. Let $P_h$ be the $L^2$ projection onto $V_h$. Define the estimator as the measurable map \begin{align*} \hat f_{n,h}:[0,1]^{dn}\to L^2([0,1]^d) \end{align*} obtained by replacing each coefficient $\langle f,\varphi\rangle$ of $P_hf$ by the empirical average $n^{-1}\sum_{i=1}^n\varphi(X_i)$ over the finitely many basis functions $\varphi\in V_h$. The boundary-wavelet approximation theorem for Hölder balls gives \begin{align*} \|P_hf-f\|_{L^2([0,1]^d)}^2\le C_1h^{2s} \end{align*} for every $f\in\mathcal H^s(L;[0,1]^d)$. Write $\mathbb E_f$ and $\operatorname{Var}_f$ for one-sample expectation and variance when $X_1$ has density $f$. Orthogonality and independence give the integrated variance bound \begin{align*} \mathbb E_{f,n}\|\hat f_{n,h}-P_hf\|_{L^2([0,1]^d)}^2=\sum_{\varphi\in V_h}\operatorname{Var}_f\left(\frac{1}{n}\sum_{i=1}^n\varphi(X_i)\right)\le \frac{1}{n}\sum_{\varphi\in V_h}\mathbb E_f[\varphi(X_1)^2]. \end{align*} Since $f\le B$ and the basis is orthonormal in $L^2([0,1]^d)$, the last expression is at most $B\dim(V_h)/n\le C_2/(nh^d)$. Therefore, for every $f\in\mathcal F$, \begin{align*} \mathbb E_{f,n}\left[\int_{[0,1]^d}|\hat f_{n,h}(x)-f(x)|^2\,d\mathcal L^d(x)\right]\le C_1 h^{2s}+\frac{C_2}{n h^d}. \end{align*} The first term is the squared approximation error of the boundary-corrected projection on the Hölder ball, and the second term is the integrated variance under the product density law. Choose \begin{align*} h_n=n^{-1/(2s+d)}. \end{align*} If necessary replace $h_n$ by the nearest admissible dyadic resolution; this changes the estimate only by a constant depending on $s$ and $d$. Therefore there is $C=C(s,d,L,b,B)>0$ such that \begin{align*} R_n(\mathcal F)\le C n^{-2s/(2s+d)}. \end{align*} [guided] The upper bound is the familiar bias-variance calculation, but the boundary of $[0,1]^d$ matters: an ordinary translation-invariant kernel can lose its approximation order near the boundary. Use a boundary-corrected compactly supported orthonormal wavelet basis with regularity greater than $s$. At resolution $h$, let $V_h$ be the finite-dimensional space spanned by wavelets up to that scale, with $\dim V_h\le C_Dh^{-d}$. Define \begin{align*} \hat f_{n,h}:[0,1]^{dn}\to L^2([0,1]^d) \end{align*} by estimating each projection coefficient by the sample average $n^{-1}\sum_{i=1}^n\varphi(X_i)$. The boundary-wavelet approximation bound gives $\|P_hf-f\|_2^2\le C_1h^{2s}$ for $f\in\mathcal H^s(L;[0,1]^d)$. In this paragraph $\mathbb E_f$ denotes one-sample expectation under density $f$. For the variance, orthogonality gives a sum over coefficients, and since $f\le B$, \begin{align*} \mathbb E_{f,n}\|\hat f_{n,h}-P_hf\|_2^2\le\frac{1}{n}\sum_{\varphi\in V_h}\mathbb E_f[\varphi(X_1)^2]\le\frac{B\dim(V_h)}{n}\le\frac{C_2}{nh^d}. \end{align*} Adding squared bias and variance gives constants $C_1=C_1(s,d,L,B)>0$ and $C_2=C_2(s,d,B)>0$ with \begin{align*} \mathbb E_{f,n}\left[\int_{[0,1]^d}|\hat f_{n,h}(x)-f(x)|^2\,d\mathcal L^d(x)\right]\le C_1 h^{2s}+\frac{C_2}{n h^d}. \end{align*} Here $C_1h^{2s}$ is the squared Hölder approximation error, and $C_2/(nh^d)$ is the integrated variance coming from averaging $n$ independent observations over cells or wavelets of volume comparable to $h^d$. To minimize the displayed upper bound, balance the two terms: \begin{align*} h^{2s}=\frac{1}{n h^d}. \end{align*} This is equivalent to $h^{2s+d}=n^{-1}$, so we choose \begin{align*} h_n=n^{-1/(2s+d)}. \end{align*} Substituting this value gives \begin{align*} h_n^{2s}=n^{-2s/(2s+d)} \end{align*} and \begin{align*} \frac{1}{n h_n^d}=n^{-1}n^{d/(2s+d)}=n^{-2s/(2s+d)}. \end{align*} Thus the estimator has risk at most a constant multiple of $n^{-2s/(2s+d)}$, uniformly over $f\in\mathcal F$. Taking the infimum over all estimators can only decrease the risk, hence \begin{align*} R_n(\mathcal F)\le C n^{-2s/(2s+d)}. \end{align*} [/guided] [/step] [step:Embed a localized Assouad hypercube inside the density class] Choose once and for all a nonzero function \begin{align*} \psi:[0,1]^d\to\mathbb R \end{align*} in $C_c^\infty((0,1)^d)$ such that \begin{align*} \int_{[0,1]^d}\psi(x)\,d\mathcal L^d(x)=0. \end{align*} The bump $\psi$ is fixed as part of the proof, depending only on $d$ and the chosen Hölder norm convention, so all constants involving $\psi$ are absorbed into constants depending on $s,d,L,b,B,\rho$. For $h\in(0,1]$, place $M\asymp h^{-d}$ coordinate-aligned cubes $Q_1,\dots,Q_M\subset[0,1]^d$ of side length comparable to $h$, with pairwise gaps also comparable to $h$. Let $x_j\in Q_j$ denote the coordinatewise minimal vertex of $Q_j$, and define localized bumps $\psi_j:[0,1]^d\to\mathbb R$ by \begin{align*} \psi_j(x)=h^s\psi\left(\frac{x-x_j}{h}\right). \end{align*} For each sign vector $\theta=(\theta_1,\dots,\theta_M)\in\{-1,1\}^M$, define $f_\theta:[0,1]^d\to\mathbb R$ by \begin{align*} f_\theta(x)=1+a\sum_{j=1}^M\theta_j\psi_j(x), \end{align*} where $a=a(s,d,L,b,B,\rho)>0$ is chosen sufficiently small. Since the supports of the $\psi_j$ are disjoint, the zero integral of $\psi$ gives \begin{align*} \int_{[0,1]^d}f_\theta(x)\,d\mathcal L^d(x)=1. \end{align*} Because the supports of the bumps are separated by gaps comparable to $h$, the Hölder seminorm across different supports is controlled by the same scale as the local seminorms. Under the convention $m=\lfloor s\rfloor$ and $\alpha=s-m\in[0,1)$, with the Hölder-Zygmund interpretation when $\alpha=0$, the scaling $\psi_j(x)=h^s\psi((x-x_j)/h)$ gives a uniform bound $\|\sum_j\theta_j\psi_j\|_{\mathcal H^s}\le C_3$ for a constant $C_3=C_3(s,d,\psi)>0$. The uniform bound gives $|a\sum_j\theta_j\psi_j(x)|\le aC_4h^s\le aC_4$ with $C_4=C_4(\psi)>0$. Choose \begin{align*} a\le \min\left\{\frac{\rho}{C_3},\frac{1-b}{C_4},\frac{B-1}{C_4}\right\}. \end{align*} Then the stated interior-radius hypothesis gives $f_\theta=1+g_\theta\in\mathcal H^s(L;[0,1]^d)$ for $g_\theta=a\sum_j\theta_j\psi_j$, while $0<h\le1$ gives $b\le f_\theta\le B$. Hence $\{f_\theta:\theta\in\{-1,1\}^M\}\subset\mathcal F$. [/step] [step:Use Assouad's lemma to prove the lower bound] Let $\mathbb P_{\theta,n}$ denote the $n$-fold product law generated by the density $f_\theta$. Define the squared Hellinger distance between probability measures $P$ and $Q$ dominated by a measure $\mu$ by \begin{align*} H^2(P,Q)=\int \left(\sqrt{\frac{dP}{d\mu}}-\sqrt{\frac{dQ}{d\mu}}\right)^2\,d\mu, \end{align*} so $0\le H^2(P,Q)\le 2$. If $\theta$ and $\theta'$ differ only in coordinate $j$, then \begin{align*} \int_{[0,1]^d}|f_\theta(x)-f_{\theta'}(x)|^2\,d\mathcal L^d(x)=4a^2h^{2s+d}\int_{[0,1]^d}|\psi(u)|^2\,d\mathcal L^d(u). \end{align*} Since all densities in the hypercube are bounded below by $b$, the squared Hellinger distance between adjacent one-sample laws is bounded by a constant multiple of $a^2h^{2s+d}$. For the product laws this gives \begin{align*} H^2(\mathbb P_{\theta,n},\mathbb P_{\theta',n})\le C_5 n a^2 h^{2s+d}, \end{align*} with $C_5=C_5(b,\psi)>0$. Choose \begin{align*} h_n=n^{-1/(2s+d)}. \end{align*} Then the adjacent Hellinger distances are bounded by a constant strictly below $2$ after reducing $a$ if necessary. We use the following finite-hypercube form of [Assouad's lemma](/theorems/5906). Suppose $\{P_\theta:\theta\in\{-1,1\}^M\}$ is a statistical experiment and $\{g_\theta\}$ is a family in a [metric space](/page/Metric%20Space) with squared loss $\ell(g,\tilde g)=\|g-\tilde g\|_{L^2([0,1]^d)}^2$. If, for each coordinate $j$, changing only $\theta_j$ changes $g_\theta$ on a disjoint coordinate block with squared $L^2$ separation at least $\delta^2$, and if every adjacent pair has Hellinger affinity at least $\kappa>0$, then \begin{align*} \inf_{\tilde g}\sup_\theta \mathbb E_\theta[\ell(g_\theta,\tilde g)]\ge \frac{\kappa}{4}M\delta^2. \end{align*} Here the infimum is over all measurable estimators in $L^2([0,1]^d)$. The first hypothesis holds by the construction of $\theta\mapsto f_\theta$. If $\theta$ and $\theta'$ differ only in coordinate $j$, then changing the $j$th sign changes the density only on $Q_j$, so the coordinatewise squared $L^2$ separation is \begin{align*} \delta^2=4a^2h_n^{2s+d}\int_{[0,1]^d}|\psi(u)|^2\,d\mathcal L^d(u). \end{align*} The second hypothesis holds because the preceding Hellinger estimate and the choice of $a$ give $H^2(\mathbb P_{\theta,n},\mathbb P_{\theta',n})\le 2-\eta$ for some $\eta=\eta(s,d,L,b,B,\rho,\psi)>0$ and every adjacent pair. Since Hellinger affinity is $1-H^2/2$ under this normalization, every adjacent affinity is at least $\eta/2$. Assouad's lemma therefore yields a constant $c_1=c_1(s,d,L,b,B,\rho)>0$ such that \begin{align*} R_n(\mathcal F)\ge c_1 M h_n^{2s+d}. \end{align*} Since $M\asymp h_n^{-d}$, this becomes \begin{align*} R_n(\mathcal F)\ge c n^{-2s/(2s+d)} \end{align*} for a constant $c=c(s,d,L,b,B,\rho)>0$. [guided] We now rebuild the lower-bound construction from the beginning, because Assouad's lemma needs an explicit finite cube of alternatives inside $\mathcal F$. Choose a nonzero bump \begin{align*} \psi:[0,1]^d\to\mathbb R \end{align*} in $C_c^\infty((0,1)^d)$ such that \begin{align*} \int_{[0,1]^d}\psi(x)\,d\mathcal L^d(x)=0. \end{align*} For a resolution $h\in(0,1]$, choose $M$ coordinate-aligned cubes $Q_1,\dots,Q_M\subset[0,1]^d$ of side length comparable to $h$ and separated by gaps comparable to $h$. A regular grid inside a fixed interior subcube gives constants $c_M,C_M>0$, depending only on $d$, such that \begin{align*} c_Mh^{-d}\le M\le C_Mh^{-d}. \end{align*} Let $x_j\in Q_j$ be the coordinatewise minimal vertex of $Q_j$, and define \begin{align*} \psi_j:[0,1]^d\to\mathbb R \end{align*} by \begin{align*} \psi_j(x)=h^s\psi\left(\frac{x-x_j}{h}\right). \end{align*} For each sign vector $\theta=(\theta_1,\dots,\theta_M)\in\{-1,1\}^M$, define \begin{align*} f_\theta:[0,1]^d\to\mathbb R \end{align*} by \begin{align*} f_\theta(x)=1+a\sum_{j=1}^M\theta_j\psi_j(x), \end{align*} where $a>0$ is a small constant to be fixed. We verify that these alternatives are densities in $\mathcal F$. The supports of the $\psi_j$ are disjoint, and the change of variables $u=(x-x_j)/h$ gives \begin{align*} \int_{[0,1]^d}\psi_j(x)\,d\mathcal L^d(x)=h^{s+d}\int_{[0,1]^d}\psi(u)\,d\mathcal L^d(u)=0. \end{align*} Therefore \begin{align*} \int_{[0,1]^d}f_\theta(x)\,d\mathcal L^d(x)=1. \end{align*} The separated supports ensure that cross-support Hölder quotients are bounded by the same scale as the local quotients. The Hölder scaling, with $m=\lfloor s\rfloor$ and the Hölder-Zygmund convention when $s$ is an integer, gives a constant $C_3=C_3(s,d,\psi)>0$ such that \begin{align*} \left\|\sum_{j=1}^M\theta_j\psi_j\right\|_{\mathcal H^s}\le C_3. \end{align*} The uniform bound gives a constant $C_4=C_4(\psi)>0$ such that \begin{align*} \left|a\sum_{j=1}^M\theta_j\psi_j(x)\right|\le aC_4h^s\le aC_4. \end{align*} Choose \begin{align*} a\le \min\left\{\frac{\rho}{C_3},\frac{1-b}{C_4},\frac{B-1}{C_4}\right\}. \end{align*} Then the perturbation $g_\theta=a\sum_j\theta_j\psi_j$ has $\|g_\theta\|_{\mathcal H^s([0,1]^d)}\le\rho$, so the interior-radius hypothesis gives $f_\theta=1+g_\theta\in\mathcal H^s(L;[0,1]^d)$. The strict inequalities $b<1<B$ make the remaining two bounds positive, so $b\le f_\theta\le B$, and $f_\theta$ integrates to $1$. Therefore $f_\theta\in\mathcal F$ for every $\theta\in\{-1,1\}^M$. Now compare adjacent vertices. Let $\theta,\theta'\in\{-1,1\}^M$ differ only in coordinate $j$. Then $f_\theta-f_{\theta'}=2a\theta_j\psi_j$ on $Q_j$ and is zero off $Q_j$. Hence \begin{align*} \int_{[0,1]^d}|f_\theta(x)-f_{\theta'}(x)|^2\,d\mathcal L^d(x)=4a^2h^{2s+d}\int_{[0,1]^d}|\psi(u)|^2\,d\mathcal L^d(u). \end{align*} This quantity is the coordinatewise squared-loss separation \begin{align*} \delta^2=4a^2h^{2s+d}\int_{[0,1]^d}|\psi(u)|^2\,d\mathcal L^d(u). \end{align*} We also need adjacent statistical experiments to remain close. Let $\mathbb P_{\theta,n}$ denote the $n$-fold product law generated by $f_\theta$. For probability measures $P$ and $Q$ dominated by a measure $\mu$, define \begin{align*} H^2(P,Q)=\int \left(\sqrt{\frac{dP}{d\mu}}-\sqrt{\frac{dQ}{d\mu}}\right)^2\,d\mu. \end{align*} Since every $f_\theta$ is bounded below by $b$, the pointwise inequality \begin{align*} (\sqrt{u}-\sqrt{v})^2\le \frac{(u-v)^2}{b} \end{align*} for $u,v\ge b$ gives a one-sample Hellinger bound bounded by a constant multiple of $a^2h^{2s+d}$. [Tensorization of Hellinger affinity](/theorems/5908) for independent samples then gives a constant $C_5=C_5(b,\psi)>0$ such that \begin{align*} H^2(\mathbb P_{\theta,n},\mathbb P_{\theta',n})\le C_5na^2h^{2s+d}. \end{align*} Choose \begin{align*} h_n=n^{-1/(2s+d)}. \end{align*} After reducing $a$ if necessary, this gives \begin{align*} H^2(\mathbb P_{\theta,n},\mathbb P_{\theta',n})\le 2-\eta \end{align*} for some $\eta>0$ and every adjacent pair. Since the Hellinger affinity equals $1-H^2/2$ under this normalization, every adjacent affinity is at least $\eta/2$. We apply the finite-hypercube form of Assouad's lemma: if the coordinatewise squared separation is at least $\delta^2$ and adjacent affinities are at least $\kappa>0$, then \begin{align*} \inf_{\tilde f_n}\sup_\theta\mathbb E_\theta\left[\int_{[0,1]^d}|\tilde f_n(x)-f_\theta(x)|^2\,d\mathcal L^d(x)\right]\ge \frac{\kappa}{4}M\delta^2. \end{align*} For completeness, this form follows by assigning to any estimator $\tilde f_n$ a sign estimate for each coordinate according to which of the two adjacent alternatives is closer on the block $Q_j$. On the event that the $j$th sign is misclassified, the disjoint-block $L^2$ loss is at least $\delta^2/4$. The two-point testing lower bound gives misclassification probability at least one half of the Hellinger affinity of the adjacent laws, hence at least $\kappa/2$. Summing the block losses over disjoint $Q_j$ gives the displayed constant after weakening constants. Here $\kappa=\eta/2$, and the hypercube lies inside $\mathcal F$, so \begin{align*} R_n(\mathcal F)\ge c_1M h_n^{2s+d} \end{align*} for a constant $c_1=c_1(s,d,L,b,B,\psi)>0$. Finally $M\asymp h_n^{-d}$, so \begin{align*} R_n(\mathcal F)\ge c h_n^{2s}=c n^{-2s/(2s+d)}. \end{align*} [/guided] [/step] [step:Combine the two estimates to obtain the minimax equivalence] The upper bound gives $R_n(\mathcal F)\le Cn^{-2s/(2s+d)}$, while the Assouad construction gives $R_n(\mathcal F)\ge cn^{-2s/(2s+d)}$. Therefore \begin{align*} c n^{-2s/(2s+d)}\le R_n(\mathcal F)\le C n^{-2s/(2s+d)}. \end{align*} The constants $c,C>0$ depend only on $s,d,L,b,B,\rho$ and the fixed Hölder norm convention, and not on $n$. This is exactly \begin{align*} R_n(\mathcal F)\asymp n^{-2s/(2s+d)}. \end{align*} [/step]

Prerequisites (0/6 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Definitions & Concepts

Explore Further

Event Definition Expectation Definition Orthogonality Definition Variance Definition Assouad's Lemma for Coordinatewise Support Recovery Theorem #5910 Assouad's Lemma Theorem #5906 Bias–Variance Decomposition for Expected Prediction Error Probability & Statistics Martingale Difference Bound under the Bounded Differences Condition Probability & Statistics Marchenko Pastur Edge for White Noise Probability & Statistics Tower Property of Conditional Expectation Conditional Expectation Marchenko-Pastur Upper Edge for Gaussian Null Sample Covariance Matrices Probability & Statistics Preservation of Independence Probability Theory Minimax Euclidean Risk for Sparse Linear Regression with Gaussian Design Probability & Statistics Primal-Dual Witness Theorem for Lasso Sign Consistency Probability & Statistics Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.