Attributions & Verification

Track contributions and verify content correctness

Proof

custom_env Unknown

[guided]The pivot of the entire representer theorem is here: the reproducing kernel structure forces the orthogonal complement $\mathcal{A}^\perp$ to consist of functions that vanish at the training inputs. Once we know this, the loss term cannot distinguish between $f$ and its projection $g$ onto $\mathcal{A}$. Fix $f \in \mathcal{H}_\phi$ and decompose $f = g + f^\perp$ as in Step 1, with $g \in \mathcal{A}$ and $f^\perp \in \mathcal{A}^\perp$. Pick any training index $j \in \{1, \ldots, N\}$. The kernel section $k_\phi(\cdot, x^j)$ is one of the spanning vectors of $\mathcal{A} = \operatorname{Span}\{k_\phi(\cdot, x^i) : 1 \le i \le N\}$, so $k_\phi(\cdot, x^j) \in \mathcal{A}$. By the very definition of $\mathcal{A}^\perp = \{h \in \mathcal{H}_\phi : (h, g')_{\mathcal{H}_\phi} = 0 \text{ for all } g' \in \mathcal{A}\}$, every $h \in \mathcal{A}^\perp$ is orthogonal to every element of $\mathcal{A}$, in particular to $k_\phi(\cdot, x^j)$: \begin{align*} (f^\perp, k_\phi(\cdot, x^j))_{\mathcal{H}_\phi} = 0. \end{align*} Now we apply the **reproducing property** of $k_\phi$: for any $h \in \mathcal{H}_\phi$ and any $z \in \widetilde{\mathcal{C}_p}$, \begin{align*} h(z) = (h, k_\phi(\cdot, z))_{\mathcal{H}_\phi}. \end{align*} This is the defining property of an RKHS and is what makes evaluation $h \mapsto h(z)$ a continuous linear functional. Apply this with $h = f^\perp$ and $z = x^j$: \begin{align*} f^\perp(x^j) = (f^\perp, k_\phi(\cdot, x^j))_{\mathcal{H}_\phi} = 0. \end{align*} The first equality is reproducing, the second is the orthogonality just established. This holds for every $j \in \{1, \ldots, N\}$, so $f^\perp$ vanishes on the training input set $\{x^1, \ldots, x^N\}$. Note that $f^\perp$ is generally **not** the zero function — it can be non-zero off the training set. The point is precisely that the reproducing kernel structure allows non-zero elements of $\mathcal{H}_\phi$ that nonetheless vanish at any specified finite set of inputs (the $\mathcal{A}^\perp$ space is exactly such a space). Why is this the engine of the proof? Because the loss in the next step depends on $f$ only through the values $f(x^j)$, and we have just shown $f(x^j) = g(x^j) + f^\perp(x^j) = g(x^j) + 0 = g(x^j)$. So the loss cannot tell $f$ from $g$, even though they differ as functions in $\mathcal{H}_\phi$.[/guided]

custom_env Unknown

[guided]We assemble the two pieces. The loss-equality $\mathcal{L}(f) = \mathcal{L}(g)$ from Step 3 says: going from $f$ to its $\mathcal{A}$-component $g$ does not change the loss. Pythagoras from Step 4 says: going from $f$ to $g$ does not increase the squared norm, and strictly decreases it if $f^\perp \neq 0$. Adding $\lambda$ times Pythagoras to the loss-equality: \begin{align*} \mathcal{L}(g) + \lambda\,\|g\|_{\mathcal{H}_\phi}^2 \leq \mathcal{L}(f) + \lambda\,\|f\|_{\mathcal{H}_\phi}^2. \end{align*} This says the $\mathcal{A}$-component $g$ achieves a regularised objective no worse than $f$. And the inequality is strict whenever $\|f^\perp\|_{\mathcal{H}_\phi} > 0$, **because $\lambda > 0$** — this is where the regularisation hypothesis is consumed. Without $\lambda > 0$, there would be no penalty for keeping a non-zero $f^\perp$, and the representer theorem would fail (the minimiser would be non-unique modulo $\mathcal{A}^\perp$). Now suppose $f^*$ is an actual minimiser. Decompose $f^* = g^* + (f^*)^\perp$. The displayed inequality, applied to $f = f^*$, gives \begin{align*} \mathcal{L}(g^*) + \lambda\,\|g^*\|_{\mathcal{H}_\phi}^2 \leq \mathcal{L}(f^*) + \lambda\,\|f^*\|_{\mathcal{H}_\phi}^2. \end{align*} But $f^*$ is a minimiser, so the reverse inequality $\mathcal{L}(g^*) + \lambda\,\|g^*\|_{\mathcal{H}_\phi}^2 \geq \mathcal{L}(f^*) + \lambda\,\|f^*\|_{\mathcal{H}_\phi}^2$ holds too (with $g^* \in \mathcal{H}_\phi$ a candidate). Hence equality, which by the strict-inequality clause forces $\|(f^*)^\perp\|_{\mathcal{H}_\phi}^2 = 0$, so $(f^*)^\perp = 0$ and $f^* = g^* \in \mathcal{A}$. The minimiser lies in the span of the kernel sections at the training inputs — the conclusion of the Representer Theorem. What does $f^* \in \mathcal{A}$ buy us in practice? Since $\mathcal{A} = \operatorname{Span}\{k_\phi(\cdot, x^i)\}_{i=1}^N$, we know there exist coefficients $\alpha_1, \ldots, \alpha_N \in \mathbb{R}$ with $f^*(\cdot) = \sum_{i=1}^N \alpha_i\,k_\phi(\cdot, x^i)$. The infinite-dimensional optimisation problem reduces to optimising over $N$ real coefficients — the basis of all kernel methods.[/guided]

custom_env Unknown

Verification Progress

11 Total Blocks

0 Verified

0% verified

Contributors

Unknown 11 blocks (0 verified)

Who Can Verify

Areas: Analysis
Subareas: Functional Analysis

Viktor Miykov Admin

Max Vassiliev Global Reviewer

Horia Neagu Global Reviewer

강현욱 Global Reviewer

Demo Testing Global Reviewer

Archie Pennycook Global Reviewer

Quick Actions

Edit Theorem

What brings you to Androma?

Start with a route through the knowledge graph.

Attributions & Verification

Proof

Verification Progress

Contributors

Who Can Verify

Quick Actions

Sign in to Androma

Check your inbox

One last step

Attributions & Verification

Proof

Verification Progress

Contributors

Who Can Verify

Quick Actions

Raw Attribution Data