Representer Theorem for Signature Kernels — Statement & Proof

Theorem

Edit Issues Pull Requests Attributions Admin

Discussion

No discussion available for this theorem.

Proof

[proofplan] The key fact is that $\mathcal{H}_\phi$ is a reproducing kernel Hilbert space and $\mathcal{A}$ is a finite-dimensional, hence closed, subspace; therefore $\mathcal{H}_\phi = \mathcal{A} \oplus \mathcal{A}^\perp$ as an orthogonal direct sum. Decompose any $f$ as $f = g + f^\perp$ with $g \in \mathcal{A}$ and $f^\perp \in \mathcal{A}^\perp$. The reproducing property of $k_\phi$ shows that $f^\perp$ vanishes at every training input $x^j$, so $\mathcal{L}(f) = \mathcal{L}(g)$ — the loss only sees the $\mathcal{A}$-component. Pythagoras gives $\|f\|^2 = \|g\|^2 + \|f^\perp\|^2 \geq \|g\|^2$ with equality iff $f^\perp = 0$. Combining, the regularised objective is no larger at $g$ than at $f$, and is strictly smaller (since $\lambda > 0$) unless $f^\perp = 0$. Hence any minimiser must lie in $\mathcal{A}$. [/proofplan] [step:Set up the orthogonal decomposition $\mathcal{H}_\phi = \mathcal{A} \oplus \mathcal{A}^\perp$] The space $\mathcal{H}_\phi$ is a reproducing kernel Hilbert space with reproducing kernel $k_\phi : \widetilde{\mathcal{C}_p} \times \widetilde{\mathcal{C}_p} \to \mathbb{R}$, $k_\phi(x, x') = (k_\phi(\cdot, x), k_\phi(\cdot, x'))_{\mathcal{H}_\phi}$, and reproducing property $f(x) = (f, k_\phi(\cdot, x))_{\mathcal{H}_\phi}$ for all $f \in \mathcal{H}_\phi$, $x \in \widetilde{\mathcal{C}_p}$. The hypothesis that $|\phi|$ satisfies the summability condition for $\mathcal{S} \subset T_\phi((V))$ guarantees that $k_\phi$ is a well-defined positive-definite kernel and that the associated RKHS exists. The set $\mathcal{A} := \operatorname{Span}\{ k_\phi(\cdot, x^i) : i = 1, \ldots, N \} \subseteq \mathcal{H}_\phi$ is the linear span of finitely many vectors in a Hilbert space, hence a finite-dimensional subspace of dimension at most $N$. Every finite-dimensional subspace of a Hilbert space is closed. By the Hilbert space orthogonal decomposition theorem applied to the closed subspace $\mathcal{A} \subseteq \mathcal{H}_\phi$, \begin{align*} \mathcal{H}_\phi = \mathcal{A} \oplus \mathcal{A}^\perp, \end{align*} where every $f \in \mathcal{H}_\phi$ admits a unique decomposition $f = g + f^\perp$ with $g \in \mathcal{A}$, $f^\perp \in \mathcal{A}^\perp$, and $(g, f^\perp)_{\mathcal{H}_\phi} = 0$. [guided] We need to set up the central machinery: writing every element of $\mathcal{H}_\phi$ as the sum of an $\mathcal{A}$-part and an $\mathcal{A}^\perp$-part. The orthogonal decomposition theorem requires the subspace being projected onto to be **closed**. Why is $\mathcal{A}$ closed? It is the span of the $N$ vectors $k_\phi(\cdot, x^1), \ldots, k_\phi(\cdot, x^N)$, all lying in $\mathcal{H}_\phi$. Therefore $\mathcal{A}$ is finite-dimensional (dimension at most $N$). A standard result (proved using equivalence of all norms on a finite-dimensional space and completeness of $\mathbb{R}^N$) is that every finite-dimensional subspace of a normed space is automatically closed — Cauchy sequences in $\mathcal{A}$ have limits inside $\mathcal{A}$. So $\mathcal{A}$ is a closed subspace. The Hilbert space orthogonal decomposition theorem then gives \begin{align*} \mathcal{H}_\phi = \mathcal{A} \oplus \mathcal{A}^\perp, \end{align*} where $\mathcal{A}^\perp = \{ h \in \mathcal{H}_\phi : (h, g)_{\mathcal{H}_\phi} = 0 \text{ for all } g \in \mathcal{A} \}$. Any $f \in \mathcal{H}_\phi$ has a **unique** representation $f = g + f^\perp$ with $g \in \mathcal{A}$, $f^\perp \in \mathcal{A}^\perp$, and these two components are mutually orthogonal: $(g, f^\perp)_{\mathcal{H}_\phi} = 0$. Why does $\mathcal{H}_\phi$ exist as an RKHS in the first place? The hypothesis on $|\phi|$ — that it satisfies the summability condition guaranteeing $\mathcal{S} \subset T_\phi((V))$ — ensures that $k_\phi(x, x') = \sum_{i=0}^\infty \phi(i)\,\langle S(x)^i, S(x')^i\rangle_{V^{\otimes i}}$ converges and is positive definite. The Moore-Aronszajn theorem then constructs $\mathcal{H}_\phi$ uniquely from $k_\phi$ with the reproducing property $f(x) = (f, k_\phi(\cdot, x))_{\mathcal{H}_\phi}$. [/guided] [/step] [step:Show that $f^\perp$ vanishes at every training input via the reproducing property] Fix any $f \in \mathcal{H}_\phi$ and write $f = g + f^\perp$ as above. For each training index $j \in \{1, \ldots, N\}$, the kernel section $k_\phi(\cdot, x^j)$ lies in $\mathcal{A}$ by definition of $\mathcal{A}$. Hence by definition of $\mathcal{A}^\perp$, \begin{align*} (f^\perp, k_\phi(\cdot, x^j))_{\mathcal{H}_\phi} = 0. \end{align*} By the reproducing property applied to $f^\perp$, \begin{align*} f^\perp(x^j) = (f^\perp, k_\phi(\cdot, x^j))_{\mathcal{H}_\phi} = 0. \end{align*} This holds for every $j = 1, \ldots, N$, so $f^\perp$ vanishes on the entire training input set. [guided] The pivot of the entire representer theorem is here: the reproducing kernel structure forces the orthogonal complement $\mathcal{A}^\perp$ to consist of functions that vanish at the training inputs. Once we know this, the loss term cannot distinguish between $f$ and its projection $g$ onto $\mathcal{A}$. Fix $f \in \mathcal{H}_\phi$ and decompose $f = g + f^\perp$ as in Step 1, with $g \in \mathcal{A}$ and $f^\perp \in \mathcal{A}^\perp$. Pick any training index $j \in \{1, \ldots, N\}$. The kernel section $k_\phi(\cdot, x^j)$ is one of the spanning vectors of $\mathcal{A} = \operatorname{Span}\{k_\phi(\cdot, x^i) : 1 \le i \le N\}$, so $k_\phi(\cdot, x^j) \in \mathcal{A}$. By the very definition of $\mathcal{A}^\perp = \{h \in \mathcal{H}_\phi : (h, g')_{\mathcal{H}_\phi} = 0 \text{ for all } g' \in \mathcal{A}\}$, every $h \in \mathcal{A}^\perp$ is orthogonal to every element of $\mathcal{A}$, in particular to $k_\phi(\cdot, x^j)$: \begin{align*} (f^\perp, k_\phi(\cdot, x^j))_{\mathcal{H}_\phi} = 0. \end{align*} Now we apply the **reproducing property** of $k_\phi$: for any $h \in \mathcal{H}_\phi$ and any $z \in \widetilde{\mathcal{C}_p}$, \begin{align*} h(z) = (h, k_\phi(\cdot, z))_{\mathcal{H}_\phi}. \end{align*} This is the defining property of an RKHS and is what makes evaluation $h \mapsto h(z)$ a continuous linear functional. Apply this with $h = f^\perp$ and $z = x^j$: \begin{align*} f^\perp(x^j) = (f^\perp, k_\phi(\cdot, x^j))_{\mathcal{H}_\phi} = 0. \end{align*} The first equality is reproducing, the second is the orthogonality just established. This holds for every $j \in \{1, \ldots, N\}$, so $f^\perp$ vanishes on the training input set $\{x^1, \ldots, x^N\}$. Note that $f^\perp$ is generally **not** the zero function — it can be non-zero off the training set. The point is precisely that the reproducing kernel structure allows non-zero elements of $\mathcal{H}_\phi$ that nonetheless vanish at any specified finite set of inputs (the $\mathcal{A}^\perp$ space is exactly such a space). Why is this the engine of the proof? Because the loss in the next step depends on $f$ only through the values $f(x^j)$, and we have just shown $f(x^j) = g(x^j) + f^\perp(x^j) = g(x^j) + 0 = g(x^j)$. So the loss cannot tell $f$ from $g$, even though they differ as functions in $\mathcal{H}_\phi$. [/guided] [/step] [step:Deduce that $\mathcal{L}(f) = \mathcal{L}(g)$] By the previous step, for every $j = 1, \ldots, N$, \begin{align*} f(x^j) = g(x^j) + f^\perp(x^j) = g(x^j) + 0 = g(x^j). \end{align*} The functional $\mathcal{L}$ depends on $f$ only through the values $f(x^1), \ldots, f(x^N)$ at the training inputs: \begin{align*} \mathcal{L}(f) = \ell\!\left((x^1, y^1, f(x^1)), \ldots, (x^N, y^N, f(x^N))\right) = \ell\!\left((x^1, y^1, g(x^1)), \ldots, (x^N, y^N, g(x^N))\right) = \mathcal{L}(g). \end{align*} The loss is unaffected by the $\mathcal{A}^\perp$-component of $f$. [guided] The crucial observation is that the loss functional $\mathcal{L}$ is a function of $f$ that depends on $f$ only through its values at the $N$ training points. Why? Look at its definition: \begin{align*} \mathcal{L}(f) = \ell\!\left((x^1, y^1, f(x^1)), \ldots, (x^N, y^N, f(x^N))\right). \end{align*} The third slot in each tuple is $f(x^j)$. The arbitrary loss $\ell$ never sees $f$ itself — only the $N$ scalars $f(x^1), \ldots, f(x^N)$. So if two functions $f, g \in \mathcal{H}_\phi$ agree at all $N$ training inputs, they have the same loss. In our decomposition, $f(x^j) = g(x^j) + f^\perp(x^j)$, and we showed in the previous step that $f^\perp(x^j) = 0$ for all $j$. Therefore $f(x^j) = g(x^j)$ for $j = 1, \ldots, N$, which gives $\mathcal{L}(f) = \mathcal{L}(g)$. This is the half of the argument that controls the loss term. The orthogonal complement $f^\perp$ is invisible to the loss because the loss only probes $f$ at the training points. The next step controls the regulariser. [/guided] [/step] [step:Apply Pythagoras to compare the regulariser at $f$ and $g$] Since $g \in \mathcal{A}$ and $f^\perp \in \mathcal{A}^\perp$ are orthogonal in $\mathcal{H}_\phi$, the Pythagorean theorem in a Hilbert space gives \begin{align*} \|f\|_{\mathcal{H}_\phi}^2 = \|g + f^\perp\|_{\mathcal{H}_\phi}^2 = \|g\|_{\mathcal{H}_\phi}^2 + \|f^\perp\|_{\mathcal{H}_\phi}^2 \geq \|g\|_{\mathcal{H}_\phi}^2, \end{align*} with equality if and only if $\|f^\perp\|_{\mathcal{H}_\phi} = 0$, i.e. $f^\perp = 0$. [guided] Step 3 controlled the loss. We now control the regulariser $\|f\|_{\mathcal{H}_\phi}^2$, and the result is the opposite direction: replacing $f$ by $g$ can only **decrease** the squared norm. By Step 1 the decomposition $f = g + f^\perp$ is orthogonal in $\mathcal{H}_\phi$: $(g, f^\perp)_{\mathcal{H}_\phi} = 0$. Expanding the squared norm using the inner product (this is just the polarisation identity for the orthogonal sum): \begin{align*} \|f\|_{\mathcal{H}_\phi}^2 &= \|g + f^\perp\|_{\mathcal{H}_\phi}^2 = (g + f^\perp, g + f^\perp)_{\mathcal{H}_\phi} \\ &= (g, g)_{\mathcal{H}_\phi} + (g, f^\perp)_{\mathcal{H}_\phi} + (f^\perp, g)_{\mathcal{H}_\phi} + (f^\perp, f^\perp)_{\mathcal{H}_\phi} \\ &= \|g\|_{\mathcal{H}_\phi}^2 + 0 + 0 + \|f^\perp\|_{\mathcal{H}_\phi}^2 = \|g\|_{\mathcal{H}_\phi}^2 + \|f^\perp\|_{\mathcal{H}_\phi}^2, \end{align*} where the cross terms vanish by orthogonality. This is the **Pythagorean theorem** in a Hilbert space. Since $\|f^\perp\|_{\mathcal{H}_\phi}^2 \ge 0$ (norms are non-negative), \begin{align*} \|f\|_{\mathcal{H}_\phi}^2 = \|g\|_{\mathcal{H}_\phi}^2 + \|f^\perp\|_{\mathcal{H}_\phi}^2 \ge \|g\|_{\mathcal{H}_\phi}^2, \end{align*} with equality iff $\|f^\perp\|_{\mathcal{H}_\phi}^2 = 0$, equivalently $f^\perp = 0$ (a normed space has $\|h\| = 0 \iff h = 0$). Geometrically: $g$ is the orthogonal projection of $f$ onto $\mathcal{A}$, and the projection of any vector onto a closed subspace has norm no larger than the original — a basic fact about projections in Hilbert spaces. The amount by which the norm shrinks is exactly $\|f^\perp\|^2$, the squared distance from $f$ to $\mathcal{A}$. This is the **strict inequality** half of the argument. Combined with Step 3's loss equality, it will give a strict decrease in the regularised objective whenever $f^\perp \neq 0$ — so a minimiser must satisfy $f^\perp = 0$. [/guided] [/step] [step:Combine to conclude the minimiser lies in $\mathcal{A}$] Combining the equality $\mathcal{L}(f) = \mathcal{L}(g)$ from Step 3 with the inequality $\|f\|_{\mathcal{H}_\phi}^2 \geq \|g\|_{\mathcal{H}_\phi}^2$ from Step 4 and using $\lambda > 0$, \begin{align*} \mathcal{L}(g) + \lambda\,\|g\|_{\mathcal{H}_\phi}^2 \leq \mathcal{L}(f) + \lambda\,\|f\|_{\mathcal{H}_\phi}^2, \end{align*} with equality if and only if $f^\perp = 0$, i.e. $f = g \in \mathcal{A}$. Now suppose $f^*$ is a minimiser of $\mathcal{L}(f) + \lambda \|f\|_{\mathcal{H}_\phi}^2$ over $\mathcal{H}_\phi$. Decompose $f^* = g^* + (f^*)^\perp$ with $g^* \in \mathcal{A}$, $(f^*)^\perp \in \mathcal{A}^\perp$. By the displayed inequality, \begin{align*} \mathcal{L}(g^*) + \lambda\,\|g^*\|_{\mathcal{H}_\phi}^2 \leq \mathcal{L}(f^*) + \lambda\,\|f^*\|_{\mathcal{H}_\phi}^2. \end{align*} Since $g^* \in \mathcal{H}_\phi$ and $f^*$ minimises the regularised objective over $\mathcal{H}_\phi$, the reverse inequality also holds, hence equality. Equality forces $\|(f^*)^\perp\|_{\mathcal{H}_\phi}^2 = 0$ (because $\lambda > 0$), so $(f^*)^\perp = 0$ and \begin{align*} f^* = g^* \in \mathcal{A}. \end{align*} This completes the proof. [guided] We assemble the two pieces. The loss-equality $\mathcal{L}(f) = \mathcal{L}(g)$ from Step 3 says: going from $f$ to its $\mathcal{A}$-component $g$ does not change the loss. Pythagoras from Step 4 says: going from $f$ to $g$ does not increase the squared norm, and strictly decreases it if $f^\perp \neq 0$. Adding $\lambda$ times Pythagoras to the loss-equality: \begin{align*} \mathcal{L}(g) + \lambda\,\|g\|_{\mathcal{H}_\phi}^2 \leq \mathcal{L}(f) + \lambda\,\|f\|_{\mathcal{H}_\phi}^2. \end{align*} This says the $\mathcal{A}$-component $g$ achieves a regularised objective no worse than $f$. And the inequality is strict whenever $\|f^\perp\|_{\mathcal{H}_\phi} > 0$, **because $\lambda > 0$** — this is where the regularisation hypothesis is consumed. Without $\lambda > 0$, there would be no penalty for keeping a non-zero $f^\perp$, and the representer theorem would fail (the minimiser would be non-unique modulo $\mathcal{A}^\perp$). Now suppose $f^*$ is an actual minimiser. Decompose $f^* = g^* + (f^*)^\perp$. The displayed inequality, applied to $f = f^*$, gives \begin{align*} \mathcal{L}(g^*) + \lambda\,\|g^*\|_{\mathcal{H}_\phi}^2 \leq \mathcal{L}(f^*) + \lambda\,\|f^*\|_{\mathcal{H}_\phi}^2. \end{align*} But $f^*$ is a minimiser, so the reverse inequality $\mathcal{L}(g^*) + \lambda\,\|g^*\|_{\mathcal{H}_\phi}^2 \geq \mathcal{L}(f^*) + \lambda\,\|f^*\|_{\mathcal{H}_\phi}^2$ holds too (with $g^* \in \mathcal{H}_\phi$ a candidate). Hence equality, which by the strict-inequality clause forces $\|(f^*)^\perp\|_{\mathcal{H}_\phi}^2 = 0$, so $(f^*)^\perp = 0$ and $f^* = g^* \in \mathcal{A}$. The minimiser lies in the span of the kernel sections at the training inputs — the conclusion of the Representer Theorem. What does $f^* \in \mathcal{A}$ buy us in practice? Since $\mathcal{A} = \operatorname{Span}\{k_\phi(\cdot, x^i)\}_{i=1}^N$, we know there exist coefficients $\alpha_1, \ldots, \alpha_N \in \mathbb{R}$ with $f^*(\cdot) = \sum_{i=1}^N \alpha_i\,k_\phi(\cdot, x^i)$. The infinite-dimensional optimisation problem reduces to optimising over $N$ real coefficients — the basis of all kernel methods. [/guided] [/step]

What brings you to Androma?

Start with a route through the knowledge graph.

Representer Theorem for Signature Kernels (Theorem # 2529)

Discussion

Proof

Explore Further

Sign in to Androma

Check your inbox

One last step

Representer Theorem for Signature Kernels (Theorem # 2529)

Discussion

Proof

Explore Further