Convex Concentration from Talagrand's Convex Distance Inequality

Convex Concentration from Talagrand's Convex Distance Inequality (Theorem # 6815)

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

Proof

[proofplan] We apply Talagrand's convex distance inequality to the median sublevel set $A=\{x\in K:F(x)\le m_F\}$. Convexity makes $A$ a closed convex subset of the product space, and the median condition gives $\mathbb P(X\in A)\ge 1/2$. The main geometric point is that if $F(x)\ge m_F+r$, then the convex distance from $x$ to $A$ is at least $r/L$: the Euclidean projection of $x$ onto $A$ supplies a separating unit vector, and the coordinate intervals having length at most $1$ converts that separation into Talagrand's weighted Hamming distance. Talagrand's inequality then gives the stated Gaussian upper-tail bound, and the case of interval length at most $D$ follows by scaling the coordinates. [/proofplan] [step:Define the median sublevel set and Talagrand convex distance] Let \begin{align*} A:=\{x\in K:F(x)\le m_F\}. \end{align*} Since $F:K\to\mathbb R$ is convex, the set $A$ is convex. Since $F$ is Lipschitz on $K$, it is continuous on $K$, so $A$ is closed in the compact set $K$. Since $m_F$ is a median of $F(X)$, \begin{align*} \mathbb P(X\in A)=\mathbb P(F(X)\le m_F)\ge \frac{1}{2}. \end{align*} Let $\mathcal P(K)$ denote the power set of $K$. Define the Talagrand convex distance map used in [Talagrand's Convex Distance Inequality](/theorems/1182) as the map $d_T:K\times\mathcal P(K)\to[0,\infty]$ given by \begin{align*} d_T(x,B):=\sup_{\alpha\in[0,\infty)^n,\ |\alpha|=1}\ \inf_{y\in B}\sum_{i=1}^{n}\alpha_i\mathbb 1_{\{x_i\ne y_i\}}. \end{align*} Here $B\subset K$ is arbitrary, $|\alpha|$ denotes the Euclidean norm of $\alpha\in\mathbb R^n$, and $\mathbb 1_{\{x_i\ne y_i\}}$ is the indicator of the event that the $i$th coordinates of $x$ and $y$ differ. [/step] [step:Show that a large value of $F$ forces large convex distance from the sublevel set] Since the theorem statement assumes $L>0$, the quotient $r/L$ is well-defined. We prove \begin{align*} \{x\in K:F(x)-m_F\ge r\}\subseteq \{x\in K:d_T(x,A)\ge r/L\}. \end{align*} Fix $x\in K$ with $F(x)-m_F\ge r$. If $r=0$, then $d_T(x,A)\ge 0=r/L$, so assume $r>0$. Then $x\notin A$. Because $A$ is a nonempty compact convex subset of $\mathbb R^n$, choose $p\in A$ such that \begin{align*} |x-p|=\inf_{y\in A}|x-y|. \end{align*} Define $u\in\mathbb R^n$ by \begin{align*} u:=\frac{x-p}{|x-p|}. \end{align*} For every $y\in A$ and every $t\in[0,1]$, convexity gives $p+t(y-p)\in A$. The function $\varphi_y:[0,1]\to\mathbb R$ defined by $\varphi_y(t):=|x-p-t(y-p)|^2$ has a minimum at $t=0$, so its right derivative at $0$ is nonnegative. Hence $(x-p)\cdot(y-p)\le 0$, and therefore \begin{align*} u\cdot(x-y)=\frac{|x-p|^2-(x-p)\cdot(y-p)}{|x-p|}\ge |x-p| \end{align*} for every $y\in A$. Since $F$ is $L$-Lipschitz and $F(p)\le m_F$, \begin{align*} r\le F(x)-m_F\le F(x)-F(p)\le L|x-p|. \end{align*} Hence $|x-p|\ge r/L$. Define $\alpha\in[0,\infty)^n$ by $\alpha_i:=|u_i|$ for $i\in\{1,\dots,n\}$. Since $|u|=1$, we have $|\alpha|=1$. For every $y\in A$, using that each interval $K_i$ has length at most $1$, \begin{align*} \sum_{i=1}^{n}\alpha_i\mathbb 1_{\{x_i\ne y_i\}} = \sum_{i=1}^{n}|u_i|\mathbb 1_{\{x_i\ne y_i\}} \ge \sum_{i=1}^{n}|u_i||x_i-y_i| \ge u\cdot(x-y). \end{align*} Therefore, for every $y\in A$, \begin{align*} \sum_{i=1}^{n}\alpha_i\mathbb 1_{\{x_i\ne y_i\}}\ge \frac{r}{L}. \end{align*} Taking the infimum over $y\in A$ and then the supremum over all admissible weights in the definition of $d_T$ gives \begin{align*} d_T(x,A)\ge \frac{r}{L}. \end{align*} [guided] The goal is to turn the analytic statement $F(x)\ge m_F+r$ into a combinatorial-distance statement suitable for Talagrand's inequality. Fix $x\in K$ with $F(x)-m_F\ge r$. If $r=0$, the conclusion $d_T(x,A)\ge 0$ follows from the definition of $d_T$, so assume $r>0$. Then $F(x)>m_F$, while every point of $A$ has $F$-value at most $m_F$, so $x\notin A$. The set $A$ is nonempty because $\mathbb P(X\in A)\ge 1/2$. It is compact and convex, so there is a point $p\in A$ minimizing Euclidean distance to $x$: \begin{align*} |x-p|=\inf_{y\in A}|x-y|. \end{align*} Define the unit vector \begin{align*} u:=\frac{x-p}{|x-p|}. \end{align*} For any $y\in A$ and any $t\in[0,1]$, convexity of $A$ gives $p+t(y-p)\in A$. Since $p$ minimizes the squared distance from $x$ to $A$, define the function $\varphi:[0,1]\to\mathbb R$ by \begin{align*} \varphi(t):=|x-p-t(y-p)|^2. \end{align*} This function has its minimum at $t=0$. Thus its right derivative at $0$ is nonnegative: \begin{align*} 0\le \varphi'(0)=-2(x-p)\cdot(y-p). \end{align*} Hence \begin{align*} (x-p)\cdot(y-p)\le 0. \end{align*} Rearranging gives \begin{align*} u\cdot(x-y)=\frac{(x-p)\cdot(x-y)}{|x-p|} = \frac{|x-p|^2-(x-p)\cdot(y-p)}{|x-p|} \ge |x-p|. \end{align*} Now the Lipschitz hypothesis converts the increase in $F$ into a lower bound on $|x-p|$. Since $p\in A$, we have $F(p)\le m_F$. Therefore \begin{align*} r\le F(x)-m_F\le F(x)-F(p)\le L|x-p|. \end{align*} Thus \begin{align*} |x-p|\ge \frac{r}{L}. \end{align*} It remains to connect this Euclidean separation to Talagrand's convex distance. Define the weight vector $\alpha\in[0,\infty)^n$ by \begin{align*} \alpha_i:=|u_i| \end{align*} for $i\in\{1,\dots,n\}$. Since $u$ is a Euclidean unit vector, \begin{align*} |\alpha|^2=\sum_{i=1}^{n}|u_i|^2=|u|^2=1. \end{align*} For any $y\in A$, the condition that $K_i$ has length at most $1$ implies $|x_i-y_i|\le 1$ for each coordinate. Hence \begin{align*} |x_i-y_i|\le \mathbb 1_{\{x_i\ne y_i\}} \end{align*} for every $i\in\{1,\dots,n\}$. Therefore \begin{align*} \sum_{i=1}^{n}\alpha_i\mathbb 1_{\{x_i\ne y_i\}} = \sum_{i=1}^{n}|u_i|\mathbb 1_{\{x_i\ne y_i\}} \ge \sum_{i=1}^{n}|u_i||x_i-y_i|. \end{align*} By the triangle inequality in each coordinate, \begin{align*} \sum_{i=1}^{n}|u_i||x_i-y_i| \ge \sum_{i=1}^{n}u_i(x_i-y_i) = u\cdot(x-y). \end{align*} Combining this with the projection inequality and the Lipschitz lower bound gives \begin{align*} \sum_{i=1}^{n}\alpha_i\mathbb 1_{\{x_i\ne y_i\}}\ge \frac{r}{L} \end{align*} for every $y\in A$. Taking the infimum over $y\in A$ gives at least $r/L$ for this particular admissible weight vector $\alpha$, and taking the supremum over all admissible weights can only increase the value. Hence \begin{align*} d_T(x,A)\ge \frac{r}{L}. \end{align*} [/guided] [/step] [step:Apply Talagrand's convex distance inequality to the median sublevel set] We use [Talagrand's Convex Distance Inequality](/theorems/1182) for product probability spaces in the following form: for every measurable set $B\subset K$ and every $s\ge 0$, \begin{align*} \mathbb P(X\in B)\,\mathbb P(d_T(X,B)\ge s)\le \exp\left(-\frac{s^2}{4}\right). \end{align*} Here $B$ denotes an arbitrary measurable subset of $K$, $s$ denotes a nonnegative real parameter, and $d_T$ is the convex distance defined above. Apply this inequality with $B=A$ and $s=r/L$. The set $A$ is Borel measurable because it is closed in $K$, and the coordinates of $X$ are independent by assumption, so the product-space hypothesis is satisfied. Since $\mathbb P(X\in A)\ge 1/2$, Talagrand's inequality gives \begin{align*} \mathbb P\left(d_T(X,A)\ge \frac{r}{L}\right)\le 2\exp\left(-\frac{r^2}{4L^2}\right). \end{align*} By the inclusion proved in the preceding step, \begin{align*} \mathbb P(F(X)-m_F\ge r)\le \mathbb P\left(d_T(X,A)\ge \frac{r}{L}\right). \end{align*} For $r=0$, the desired estimate holds because probabilities are at most $1$. For $r>0$, the event $\{F(X)-m_F\ge r\}$ is contained in $\{F(X)>m_F\}$, and the median property gives \begin{align*} \mathbb P(F(X)>m_F)\le 1-\mathbb P(F(X)\le m_F)\le \frac{1}{2}. \end{align*} Set $a:=r^2/L^2$. If $0<a\le 8\log 2$, then \begin{align*} \mathbb P(F(X)-m_F\ge r)\le \frac{1}{2}\le \exp\left(-\frac{a}{8}\right). \end{align*} If $a\ge 8\log 2$, then the Talagrand bound yields \begin{align*} \mathbb P(F(X)-m_F\ge r)\le 2\exp\left(-\frac{a}{4}\right)\le \exp\left(-\frac{a}{8}\right). \end{align*} Thus, with the universal constant $c:=1/8$, \begin{align*} \mathbb P(F(X)-m_F\ge r)\le \exp\left(-\frac{c r^2}{L^2}\right). \end{align*} [guided] The geometric work in the previous step produced exactly the input needed for [Talagrand's Convex Distance Inequality](/theorems/1182). We use the following form of that inequality: if $B\subset K$ is a measurable set and $s\ge 0$, then \begin{align*} \mathbb P(X\in B)\,\mathbb P(d_T(X,B)\ge s)\le \exp\left(-\frac{s^2}{4}\right). \end{align*} The symbols here are auxiliary: $B$ is an arbitrary measurable subset of the product space $K$, and $s$ is a nonnegative real number. The distance $d_T$ is the weighted Hamming convex distance already defined in the proof. We apply the inequality to the particular set $B=A$ and to the value $s=r/L$. The hypotheses are satisfied: $A$ is measurable because it is closed in $K$, and the law of $X$ is a product probability measure because the coordinates $X_1,\dots,X_n$ are independent. Since $m_F$ is a median, we already know \begin{align*} \mathbb P(X\in A)\ge \frac{1}{2}. \end{align*} Therefore Talagrand's inequality implies \begin{align*} \mathbb P\left(d_T(X,A)\ge \frac{r}{L}\right)\le 2\exp\left(-\frac{r^2}{4L^2}\right). \end{align*} Now we use the inclusion from the preceding step. That inclusion says that every point with $F(x)-m_F\ge r$ must have convex distance at least $r/L$ from $A$. Hence \begin{align*} \mathbb P(F(X)-m_F\ge r)\le \mathbb P\left(d_T(X,A)\ge \frac{r}{L}\right). \end{align*} This gives the correct Gaussian decay for large $r$, but it contains a prefactor $2$. To obtain the theorem's no-prefactor form with a universal constant, we split into small and large deviations. For $r=0$, the desired bound is immediate because the right-hand side equals $1$ and probabilities are at most $1$. Assume $r>0$ and set \begin{align*} a:=\frac{r^2}{L^2}. \end{align*} Because $r>0$, the event $\{F(X)-m_F\ge r\}$ is contained in $\{F(X)>m_F\}$. The median condition gives \begin{align*} \mathbb P(F(X)>m_F)\le 1-\mathbb P(F(X)\le m_F)\le \frac{1}{2}. \end{align*} If $0<a\le 8\log 2$, then \begin{align*} \mathbb P(F(X)-m_F\ge r)\le \frac{1}{2}\le \exp\left(-\frac{a}{8}\right). \end{align*} If $a\ge 8\log 2$, then the prefactor in the Talagrand estimate is absorbed into the exponential: \begin{align*} 2\exp\left(-\frac{a}{4}\right)\le \exp\left(-\frac{a}{8}\right). \end{align*} Combining the two cases gives \begin{align*} \mathbb P(F(X)-m_F\ge r)\le \exp\left(-\frac{r^2}{8L^2}\right). \end{align*} Thus the theorem holds with the universal constant $c:=1/8$. [/guided] [/step] [step:Reduce intervals of length at most $D$ to the unit-length case] Assume now that every $K_i$ has length at most $D$, where $D>0$. The unit-length case has been proved with the universal constant $c=1/8$, so the scaled conclusion will have the same constant after replacing $L$ by $DL$. For each $i\in\{1,\dots,n\}$, write $K_i=[a_i,b_i]$ with $b_i-a_i\le D$, and define \begin{align*} \widetilde K_i:=\left\{\frac{x_i-a_i}{D}:x_i\in K_i\right\}\subset\mathbb R. \end{align*} Then $\widetilde K_i$ is a compact interval of length at most $1$. Set \begin{align*} \widetilde K:=\widetilde K_1\times\cdots\times\widetilde K_n. \end{align*} Define the affine map $T:\widetilde K\to K$ by \begin{align*} T(z):=(a_1+Dz_1,\dots,a_n+Dz_n). \end{align*} Let $(\Omega,\mathcal F,\mathbb P)$ denote the probability space on which $X$ is defined. Define the random vector $\widetilde X:\Omega\to\widetilde K$ by \begin{align*} \widetilde X:=T^{-1}(X). \end{align*} The coordinates of $\widetilde X$ remain independent because each coordinate is obtained from the corresponding coordinate of $X$ by a measurable one-dimensional affine transformation. Define the function $\widetilde F:\widetilde K\to\mathbb R$ by \begin{align*} \widetilde F(z):=F(Tz). \end{align*} The function $\widetilde F$ is convex because it is the composition of the convex function $F$ with an affine map. For $z,w\in\widetilde K$, \begin{align*} |\widetilde F(z)-\widetilde F(w)| = |F(Tz)-F(Tw)| \le L|Tz-Tw| = DL|z-w|. \end{align*} Thus $\widetilde F$ is $DL$-Lipschitz. Since $\widetilde F(\widetilde X)=F(X)$, the same number $m_F$ is a median of $\widetilde F(\widetilde X)$. Applying the unit-length case to $\widetilde F$ and $\widetilde X$ gives \begin{align*} \mathbb P(F(X)-m_F\ge r)=\mathbb P(\widetilde F(\widetilde X)-m_F\ge r) \end{align*} \begin{align*} \mathbb P(\widetilde F(\widetilde X)-m_F\ge r)\le \exp\left(-\frac{c r^2}{D^2L^2}\right). \end{align*} This is the claimed replacement of $L$ by $DL$ with the same universal constant $c$ from the unit-length case. [/step] [step:Explain why no lower-tail estimate follows for a general convex function] The proof used convexity of $F$ to make the sublevel set $\{x\in K:F(x)\le m_F\}$ convex. A lower-tail estimate for $F$ would be an upper-tail estimate for $-F$, but the preceding argument applies to $-F$ only when $-F$ is convex on $K$. Therefore no lower-tail estimate is implied for a general convex $F$ by this theorem alone. [/step]

Prerequisites (0/2 completed)

Prerequisites Graph

Interactive dependency map showing how this theorem builds on foundational concepts

Loading dependency graph...

Theorems

Triangle Inequality For Inner Product Spaces

Definitions & Concepts

Event

Explore Further

Event Definition Triangle Inequality For Inner Product Spaces Theorem #433 Tower Property of Conditional Expectation Conditional Expectation Bandwidth-Scale Stochastic Equicontinuity for Kernel Density Processes Probability & Statistics Second-Moment Criterion for Contiguity Probability & Statistics Cook's Distance Formula for Internally Studentized Residuals Probability & Statistics Singular Value Decomposition Formula for Ridge Regression Probability & Statistics ANOVA Decomposition for Ordinary Least Squares with Intercept Probability & Statistics Matching Minimax Lower and Upper Bounds Principle Probability & Statistics Law of Total Probability Probability Theory Probability & Statistics Area

What brings you to Androma?

Start with a route through the knowledge graph.