[proofplan]
The key fact is that $\mathcal{H}_\phi$ is a reproducing kernel Hilbert space and $\mathcal{A}$ is a finite-dimensional, hence closed, subspace; therefore $\mathcal{H}_\phi = \mathcal{A} \oplus \mathcal{A}^\perp$ as an orthogonal direct sum. Decompose any $f$ as $f = g + f^\perp$ with $g \in \mathcal{A}$ and $f^\perp \in \mathcal{A}^\perp$. The reproducing property of $k_\phi$ shows that $f^\perp$ vanishes at every training input $x^j$, so $\mathcal{L}(f) = \mathcal{L}(g)$ — the loss only sees the $\mathcal{A}$-component. Pythagoras gives $\|f\|^2 = \|g\|^2 + \|f^\perp\|^2 \geq \|g\|^2$ with equality iff $f^\perp = 0$. Combining, the regularised objective is no larger at $g$ than at $f$, and is strictly smaller (since $\lambda > 0$) unless $f^\perp = 0$. Hence any minimiser must lie in $\mathcal{A}$.
[/proofplan]
[step:Set up the orthogonal decomposition $\mathcal{H}_\phi = \mathcal{A} \oplus \mathcal{A}^\perp$]
The space $\mathcal{H}_\phi$ is a reproducing kernel Hilbert space with reproducing kernel $k_\phi : \widetilde{\mathcal{C}_p} \times \widetilde{\mathcal{C}_p} \to \mathbb{R}$, $k_\phi(x, x') = (k_\phi(\cdot, x), k_\phi(\cdot, x'))_{\mathcal{H}_\phi}$, and reproducing property $f(x) = (f, k_\phi(\cdot, x))_{\mathcal{H}_\phi}$ for all $f \in \mathcal{H}_\phi$, $x \in \widetilde{\mathcal{C}_p}$. The hypothesis that $|\phi|$ satisfies the summability condition for $\mathcal{S} \subset T_\phi((V))$ guarantees that $k_\phi$ is a well-defined positive-definite kernel and that the associated RKHS exists.
The set $\mathcal{A} := \operatorname{Span}\{ k_\phi(\cdot, x^i) : i = 1, \ldots, N \} \subseteq \mathcal{H}_\phi$ is the linear span of finitely many vectors in a Hilbert space, hence a finite-dimensional subspace of dimension at most $N$. Every finite-dimensional subspace of a Hilbert space is closed. By the Hilbert space orthogonal decomposition theorem applied to the closed subspace $\mathcal{A} \subseteq \mathcal{H}_\phi$,
\begin{align*}
\mathcal{H}_\phi = \mathcal{A} \oplus \mathcal{A}^\perp,
\end{align*}
where every $f \in \mathcal{H}_\phi$ admits a unique decomposition $f = g + f^\perp$ with $g \in \mathcal{A}$, $f^\perp \in \mathcal{A}^\perp$, and $(g, f^\perp)_{\mathcal{H}_\phi} = 0$.
[guided]
We need to set up the central machinery: writing every element of $\mathcal{H}_\phi$ as the sum of an $\mathcal{A}$-part and an $\mathcal{A}^\perp$-part. The orthogonal decomposition theorem requires the subspace being projected onto to be **closed**.
Why is $\mathcal{A}$ closed? It is the span of the $N$ vectors $k_\phi(\cdot, x^1), \ldots, k_\phi(\cdot, x^N)$, all lying in $\mathcal{H}_\phi$. Therefore $\mathcal{A}$ is finite-dimensional (dimension at most $N$). A standard result (proved using equivalence of all norms on a finite-dimensional space and completeness of $\mathbb{R}^N$) is that every finite-dimensional subspace of a normed space is automatically closed — Cauchy sequences in $\mathcal{A}$ have limits inside $\mathcal{A}$. So $\mathcal{A}$ is a closed subspace.
The Hilbert space orthogonal decomposition theorem then gives
\begin{align*}
\mathcal{H}_\phi = \mathcal{A} \oplus \mathcal{A}^\perp,
\end{align*}
where $\mathcal{A}^\perp = \{ h \in \mathcal{H}_\phi : (h, g)_{\mathcal{H}_\phi} = 0 \text{ for all } g \in \mathcal{A} \}$. Any $f \in \mathcal{H}_\phi$ has a **unique** representation $f = g + f^\perp$ with $g \in \mathcal{A}$, $f^\perp \in \mathcal{A}^\perp$, and these two components are mutually orthogonal: $(g, f^\perp)_{\mathcal{H}_\phi} = 0$.
Why does $\mathcal{H}_\phi$ exist as an RKHS in the first place? The hypothesis on $|\phi|$ — that it satisfies the summability condition guaranteeing $\mathcal{S} \subset T_\phi((V))$ — ensures that $k_\phi(x, x') = \sum_{i=0}^\infty \phi(i)\,\langle S(x)^i, S(x')^i\rangle_{V^{\otimes i}}$ converges and is positive definite. The Moore-Aronszajn theorem then constructs $\mathcal{H}_\phi$ uniquely from $k_\phi$ with the reproducing property $f(x) = (f, k_\phi(\cdot, x))_{\mathcal{H}_\phi}$.
[/guided]
[/step]
[step:Show that $f^\perp$ vanishes at every training input via the reproducing property]
Fix any $f \in \mathcal{H}_\phi$ and write $f = g + f^\perp$ as above. For each training index $j \in \{1, \ldots, N\}$, the kernel section $k_\phi(\cdot, x^j)$ lies in $\mathcal{A}$ by definition of $\mathcal{A}$. Hence by definition of $\mathcal{A}^\perp$,
\begin{align*}
(f^\perp, k_\phi(\cdot, x^j))_{\mathcal{H}_\phi} = 0.
\end{align*}
By the reproducing property applied to $f^\perp$,
\begin{align*}
f^\perp(x^j) = (f^\perp, k_\phi(\cdot, x^j))_{\mathcal{H}_\phi} = 0.
\end{align*}
This holds for every $j = 1, \ldots, N$, so $f^\perp$ vanishes on the entire training input set.
[guided]
The pivot of the entire representer theorem is here: the reproducing kernel structure forces the orthogonal complement $\mathcal{A}^\perp$ to consist of functions that vanish at the training inputs. Once we know this, the loss term cannot distinguish between $f$ and its projection $g$ onto $\mathcal{A}$.
Fix $f \in \mathcal{H}_\phi$ and decompose $f = g + f^\perp$ as in Step 1, with $g \in \mathcal{A}$ and $f^\perp \in \mathcal{A}^\perp$. Pick any training index $j \in \{1, \ldots, N\}$. The kernel section $k_\phi(\cdot, x^j)$ is one of the spanning vectors of $\mathcal{A} = \operatorname{Span}\{k_\phi(\cdot, x^i) : 1 \le i \le N\}$, so $k_\phi(\cdot, x^j) \in \mathcal{A}$.
By the very definition of $\mathcal{A}^\perp = \{h \in \mathcal{H}_\phi : (h, g')_{\mathcal{H}_\phi} = 0 \text{ for all } g' \in \mathcal{A}\}$, every $h \in \mathcal{A}^\perp$ is orthogonal to every element of $\mathcal{A}$, in particular to $k_\phi(\cdot, x^j)$:
\begin{align*}
(f^\perp, k_\phi(\cdot, x^j))_{\mathcal{H}_\phi} = 0.
\end{align*}
Now we apply the **reproducing property** of $k_\phi$: for any $h \in \mathcal{H}_\phi$ and any $z \in \widetilde{\mathcal{C}_p}$,
\begin{align*}
h(z) = (h, k_\phi(\cdot, z))_{\mathcal{H}_\phi}.
\end{align*}
This is the defining property of an RKHS and is what makes evaluation $h \mapsto h(z)$ a continuous linear functional. Apply this with $h = f^\perp$ and $z = x^j$:
\begin{align*}
f^\perp(x^j) = (f^\perp, k_\phi(\cdot, x^j))_{\mathcal{H}_\phi} = 0.
\end{align*}
The first equality is reproducing, the second is the orthogonality just established.
This holds for every $j \in \{1, \ldots, N\}$, so $f^\perp$ vanishes on the training input set $\{x^1, \ldots, x^N\}$. Note that $f^\perp$ is generally **not** the zero function — it can be non-zero off the training set. The point is precisely that the reproducing kernel structure allows non-zero elements of $\mathcal{H}_\phi$ that nonetheless vanish at any specified finite set of inputs (the $\mathcal{A}^\perp$ space is exactly such a space).
Why is this the engine of the proof? Because the loss in the next step depends on $f$ only through the values $f(x^j)$, and we have just shown $f(x^j) = g(x^j) + f^\perp(x^j) = g(x^j) + 0 = g(x^j)$. So the loss cannot tell $f$ from $g$, even though they differ as functions in $\mathcal{H}_\phi$.
[/guided]
[/step]
[step:Deduce that $\mathcal{L}(f) = \mathcal{L}(g)$]
By the previous step, for every $j = 1, \ldots, N$,
\begin{align*}
f(x^j) = g(x^j) + f^\perp(x^j) = g(x^j) + 0 = g(x^j).
\end{align*}
The functional $\mathcal{L}$ depends on $f$ only through the values $f(x^1), \ldots, f(x^N)$ at the training inputs:
\begin{align*}
\mathcal{L}(f) = \ell\!\left((x^1, y^1, f(x^1)), \ldots, (x^N, y^N, f(x^N))\right) = \ell\!\left((x^1, y^1, g(x^1)), \ldots, (x^N, y^N, g(x^N))\right) = \mathcal{L}(g).
\end{align*}
The loss is unaffected by the $\mathcal{A}^\perp$-component of $f$.
[guided]
The crucial observation is that the loss functional $\mathcal{L}$ is a function of $f$ that depends on $f$ only through its values at the $N$ training points. Why? Look at its definition:
\begin{align*}
\mathcal{L}(f) = \ell\!\left((x^1, y^1, f(x^1)), \ldots, (x^N, y^N, f(x^N))\right).
\end{align*}
The third slot in each tuple is $f(x^j)$. The arbitrary loss $\ell$ never sees $f$ itself — only the $N$ scalars $f(x^1), \ldots, f(x^N)$. So if two functions $f, g \in \mathcal{H}_\phi$ agree at all $N$ training inputs, they have the same loss.
In our decomposition, $f(x^j) = g(x^j) + f^\perp(x^j)$, and we showed in the previous step that $f^\perp(x^j) = 0$ for all $j$. Therefore $f(x^j) = g(x^j)$ for $j = 1, \ldots, N$, which gives $\mathcal{L}(f) = \mathcal{L}(g)$.
This is the half of the argument that controls the loss term. The orthogonal complement $f^\perp$ is invisible to the loss because the loss only probes $f$ at the training points. The next step controls the regulariser.
[/guided]
[/step]
[step:Apply Pythagoras to compare the regulariser at $f$ and $g$]
Since $g \in \mathcal{A}$ and $f^\perp \in \mathcal{A}^\perp$ are orthogonal in $\mathcal{H}_\phi$, the Pythagorean theorem in a Hilbert space gives
\begin{align*}
\|f\|_{\mathcal{H}_\phi}^2 = \|g + f^\perp\|_{\mathcal{H}_\phi}^2 = \|g\|_{\mathcal{H}_\phi}^2 + \|f^\perp\|_{\mathcal{H}_\phi}^2 \geq \|g\|_{\mathcal{H}_\phi}^2,
\end{align*}
with equality if and only if $\|f^\perp\|_{\mathcal{H}_\phi} = 0$, i.e. $f^\perp = 0$.
[guided]
Step 3 controlled the loss. We now control the regulariser $\|f\|_{\mathcal{H}_\phi}^2$, and the result is the opposite direction: replacing $f$ by $g$ can only **decrease** the squared norm.
By Step 1 the decomposition $f = g + f^\perp$ is orthogonal in $\mathcal{H}_\phi$: $(g, f^\perp)_{\mathcal{H}_\phi} = 0$. Expanding the squared norm using the inner product (this is just the polarisation identity for the orthogonal sum):
\begin{align*}
\|f\|_{\mathcal{H}_\phi}^2 &= \|g + f^\perp\|_{\mathcal{H}_\phi}^2 = (g + f^\perp, g + f^\perp)_{\mathcal{H}_\phi} \\
&= (g, g)_{\mathcal{H}_\phi} + (g, f^\perp)_{\mathcal{H}_\phi} + (f^\perp, g)_{\mathcal{H}_\phi} + (f^\perp, f^\perp)_{\mathcal{H}_\phi} \\
&= \|g\|_{\mathcal{H}_\phi}^2 + 0 + 0 + \|f^\perp\|_{\mathcal{H}_\phi}^2 = \|g\|_{\mathcal{H}_\phi}^2 + \|f^\perp\|_{\mathcal{H}_\phi}^2,
\end{align*}
where the cross terms vanish by orthogonality. This is the **Pythagorean theorem** in a Hilbert space.
Since $\|f^\perp\|_{\mathcal{H}_\phi}^2 \ge 0$ (norms are non-negative),
\begin{align*}
\|f\|_{\mathcal{H}_\phi}^2 = \|g\|_{\mathcal{H}_\phi}^2 + \|f^\perp\|_{\mathcal{H}_\phi}^2 \ge \|g\|_{\mathcal{H}_\phi}^2,
\end{align*}
with equality iff $\|f^\perp\|_{\mathcal{H}_\phi}^2 = 0$, equivalently $f^\perp = 0$ (a normed space has $\|h\| = 0 \iff h = 0$).
Geometrically: $g$ is the orthogonal projection of $f$ onto $\mathcal{A}$, and the projection of any vector onto a closed subspace has norm no larger than the original — a basic fact about projections in Hilbert spaces. The amount by which the norm shrinks is exactly $\|f^\perp\|^2$, the squared distance from $f$ to $\mathcal{A}$.
This is the **strict inequality** half of the argument. Combined with Step 3's loss equality, it will give a strict decrease in the regularised objective whenever $f^\perp \neq 0$ — so a minimiser must satisfy $f^\perp = 0$.
[/guided]
[/step]
[step:Combine to conclude the minimiser lies in $\mathcal{A}$]
Combining the equality $\mathcal{L}(f) = \mathcal{L}(g)$ from Step 3 with the inequality $\|f\|_{\mathcal{H}_\phi}^2 \geq \|g\|_{\mathcal{H}_\phi}^2$ from Step 4 and using $\lambda > 0$,
\begin{align*}
\mathcal{L}(g) + \lambda\,\|g\|_{\mathcal{H}_\phi}^2 \leq \mathcal{L}(f) + \lambda\,\|f\|_{\mathcal{H}_\phi}^2,
\end{align*}
with equality if and only if $f^\perp = 0$, i.e. $f = g \in \mathcal{A}$.
Now suppose $f^*$ is a minimiser of $\mathcal{L}(f) + \lambda \|f\|_{\mathcal{H}_\phi}^2$ over $\mathcal{H}_\phi$. Decompose $f^* = g^* + (f^*)^\perp$ with $g^* \in \mathcal{A}$, $(f^*)^\perp \in \mathcal{A}^\perp$. By the displayed inequality,
\begin{align*}
\mathcal{L}(g^*) + \lambda\,\|g^*\|_{\mathcal{H}_\phi}^2 \leq \mathcal{L}(f^*) + \lambda\,\|f^*\|_{\mathcal{H}_\phi}^2.
\end{align*}
Since $g^* \in \mathcal{H}_\phi$ and $f^*$ minimises the regularised objective over $\mathcal{H}_\phi$, the reverse inequality also holds, hence equality. Equality forces $\|(f^*)^\perp\|_{\mathcal{H}_\phi}^2 = 0$ (because $\lambda > 0$), so $(f^*)^\perp = 0$ and
\begin{align*}
f^* = g^* \in \mathcal{A}.
\end{align*}
This completes the proof.
[guided]
We assemble the two pieces. The loss-equality $\mathcal{L}(f) = \mathcal{L}(g)$ from Step 3 says: going from $f$ to its $\mathcal{A}$-component $g$ does not change the loss. Pythagoras from Step 4 says: going from $f$ to $g$ does not increase the squared norm, and strictly decreases it if $f^\perp \neq 0$. Adding $\lambda$ times Pythagoras to the loss-equality:
\begin{align*}
\mathcal{L}(g) + \lambda\,\|g\|_{\mathcal{H}_\phi}^2 \leq \mathcal{L}(f) + \lambda\,\|f\|_{\mathcal{H}_\phi}^2.
\end{align*}
This says the $\mathcal{A}$-component $g$ achieves a regularised objective no worse than $f$. And the inequality is strict whenever $\|f^\perp\|_{\mathcal{H}_\phi} > 0$, **because $\lambda > 0$** — this is where the regularisation hypothesis is consumed. Without $\lambda > 0$, there would be no penalty for keeping a non-zero $f^\perp$, and the representer theorem would fail (the minimiser would be non-unique modulo $\mathcal{A}^\perp$).
Now suppose $f^*$ is an actual minimiser. Decompose $f^* = g^* + (f^*)^\perp$. The displayed inequality, applied to $f = f^*$, gives
\begin{align*}
\mathcal{L}(g^*) + \lambda\,\|g^*\|_{\mathcal{H}_\phi}^2 \leq \mathcal{L}(f^*) + \lambda\,\|f^*\|_{\mathcal{H}_\phi}^2.
\end{align*}
But $f^*$ is a minimiser, so the reverse inequality $\mathcal{L}(g^*) + \lambda\,\|g^*\|_{\mathcal{H}_\phi}^2 \geq \mathcal{L}(f^*) + \lambda\,\|f^*\|_{\mathcal{H}_\phi}^2$ holds too (with $g^* \in \mathcal{H}_\phi$ a candidate). Hence equality, which by the strict-inequality clause forces $\|(f^*)^\perp\|_{\mathcal{H}_\phi}^2 = 0$, so $(f^*)^\perp = 0$ and $f^* = g^* \in \mathcal{A}$. The minimiser lies in the span of the kernel sections at the training inputs — the conclusion of the Representer Theorem.
What does $f^* \in \mathcal{A}$ buy us in practice? Since $\mathcal{A} = \operatorname{Span}\{k_\phi(\cdot, x^i)\}_{i=1}^N$, we know there exist coefficients $\alpha_1, \ldots, \alpha_N \in \mathbb{R}$ with $f^*(\cdot) = \sum_{i=1}^N \alpha_i\,k_\phi(\cdot, x^i)$. The infinite-dimensional optimisation problem reduces to optimising over $N$ real coefficients — the basis of all kernel methods.
[/guided]
[/step]