[proofplan]
We prove the inclusion by testing the definition of the wave front set in local coordinates. Away from $S$, a conormal distribution is smooth, so only covectors based on $S$ need attention. In coordinates adapted to $S$, a conormal distribution is an oscillatory integral in the normal variables; after multiplying by a compactly supported cutoff, its [Fourier transform](/page/Fourier%20Transform) has a phase whose tangential gradient is nonzero in every cone avoiding the conormal directions. Repeated [integration by parts](/theorems/210) with the corresponding nonstationary phase operator gives rapid decay in that cone, which is exactly the exclusion criterion for the wave front set.
[/proofplan]
[step:Reduce the assertion to adapted local coordinates near a point of $S$]
Let $(p_0,\xi_0) \in T^*X \setminus 0$ satisfy $(p_0,\xi_0) \notin N^*S$. We prove that $(p_0,\xi_0) \notin \operatorname{WF}(u)$.
If $p_0 \notin S$, then the defining local form of a conormal distribution implies that $u$ is smooth in a neighbourhood of $p_0$. Hence no nonzero covector over $p_0$ lies in $\operatorname{WF}(u)$.
It remains to consider $p_0 \in S$. Choose an adapted coordinate chart $(U,\kappa)$ around $p_0$ with
\begin{align*}
\kappa: U \to V \subset \mathbb{R}^k \times \mathbb{R}^{\ell}
\end{align*}
a diffeomorphism onto an [open set](/page/Open%20Set) $V$, such that
\begin{align*}
\kappa(S \cap U)= V \cap (\mathbb{R}^k \times \{0\}).
\end{align*}
Write points of $V$ as $x=(y,z)$ with $y \in \mathbb{R}^k$ tangential and $z \in \mathbb{R}^{\ell}$ normal. For each $n \in \mathbb{N}$, let $\mathcal{L}^n$ denote $n$-dimensional [Lebesgue measure](/page/Lebesgue%20Measure) on $\mathbb{R}^n$. Under the induced cotangent coordinates, write a covector as $(\eta,\zeta) \in \mathbb{R}^k \times \mathbb{R}^{\ell}$. In these coordinates,
\begin{align*}
N^*S \cap T^*U = \{(y,0;0,\zeta) : y \in \mathbb{R}^k,\ \zeta \in \mathbb{R}^{\ell}\}.
\end{align*}
Since $(p_0,\xi_0) \notin N^*S$ and $p_0 \in S$, the tangential component $\eta_0$ of $\xi_0$ is nonzero.
[/step]
[step:Write the localized conormal distribution as an oscillatory integral]
Let $\chi: U \to \mathbb{R}$ be a compactly supported smooth cutoff equal to $1$ on a neighbourhood of $p_0$ and with support contained in the adapted coordinate patch. By the local definition of conormal distributions, after applying $\kappa$ we may write $\chi u$ as the sum of a smooth compactly supported function and an oscillatory integral distribution
\begin{align*}
v \in \mathcal{D}'(V)
\end{align*}
defined by the regularized oscillatory-integral pairing stated in the theorem. Equivalently, for every [test function](/page/Test%20Function) $\phi \in C_c^\infty(V)$,
\begin{align*}
v(\phi)=\lim_{\varepsilon \downarrow 0}\int_{\mathbb{R}^k \times \mathbb{R}^{\ell} \times \mathbb{R}^{\ell}} e^{i z \cdot \theta}\rho(\varepsilon\theta)a(y,\theta)\phi(y,z)\,d\mathcal{L}^k(y)\,d\mathcal{L}^{\ell}(z)\,d\mathcal{L}^{\ell}(\theta),
\end{align*}
where $\rho \in C_c^\infty(\mathbb{R}^{\ell})$ is equal to $1$ near $0$ and the limit is independent of $\rho$. Here $a: \mathbb{R}^k \times \mathbb{R}^{\ell} \to \mathbb{C}$ is smooth, compactly supported in $y$ after multiplication by the cutoff, and satisfies symbol estimates of one fixed order $m \in \mathbb{R}$: for every pair of multi-indices $\alpha \in \mathbb{N}_0^k$ and $\beta \in \mathbb{N}_0^{\ell}$ there is a constant $C_{\alpha,\beta}>0$ such that
\begin{align*}
|\partial_y^\alpha \partial_\theta^\beta a(y,\theta)| \le C_{\alpha,\beta}(1+|\theta|)^{m-|\beta|}.
\end{align*}
The smooth compactly supported summand has rapidly decreasing Fourier transform, so it suffices to prove rapid decay for the localized Fourier transform of $v$ near $(\eta_0,\zeta_0)$.
[/step]
[step:Choose a cone where the tangential frequency stays nonzero]
Write $\xi_0 = (\eta_0,\zeta_0)$ in the induced cotangent coordinates, so that $\zeta_0$ denotes the normal component of $\xi_0$. Because $\eta_0 \ne 0$, choose an open conic neighbourhood $\Gamma \subset \mathbb{R}^{k+\ell} \setminus \{0\}$ of $(\eta_0,\zeta_0)$ and a constant $c>0$ such that
\begin{align*}
|\eta| \ge c|(\eta,\zeta)|
\end{align*}
for every $(\eta,\zeta) \in \Gamma$.
Let $\psi: \mathbb{R}^k \times \mathbb{R}^{\ell} \to \mathbb{R}$ be a compactly supported smooth function supported in $V$ and equal to $1$ near $\kappa(p_0)$. Define the localized Fourier transform
\begin{align*}
F: \mathbb{R}^{k+\ell} \to \mathbb{C}
\end{align*}
by the distributional pairing
\begin{align*}
F(\eta,\zeta)=v\left(\psi(y,z)e^{-i(y\cdot \eta+z\cdot \zeta)}\right).
\end{align*}
For each $\zeta \in \mathbb{R}^{\ell}$, define the partial Fourier transform of the cutoff in the normal variable by the map $b_\zeta: \mathbb{R}^k \times \mathbb{R}^{\ell} \to \mathbb{C}$ with
\begin{align*}
b_\zeta(y,\theta)=\int_{\mathbb{R}^{\ell}} e^{i z\cdot(\theta-\zeta)}\psi(y,z)\,d\mathcal{L}^{\ell}(z).
\end{align*}
Since $\psi$ is compactly supported and smooth in $z$, [integration by parts](/theorems/2098) in the $z$-variables shows that for every multi-index $\alpha \in \mathbb{N}_0^k$ and every $M \in \mathbb{N}$ there is a constant $B_{\alpha,M}>0$ such that
\begin{align*}
|\partial_y^\alpha b_\zeta(y,\theta)| \le B_{\alpha,M}(1+|\theta-\zeta|)^{-M}.
\end{align*}
Therefore the product $a(y,\theta)b_\zeta(y,\theta)$ is integrable in $(y,\theta)$ after choosing $M>|m|+\ell+1$. Indeed, the $y$-support is compact and the integrand is bounded by a constant times $(1+|\theta|)^m(1+|\theta-\zeta|)^{-M}$. Applying the regularized definition of $v$, first with the factor $\rho(\varepsilon\theta)$ and then passing to the limit by dominated convergence with this integrable majorant, justifies the exchange of the $z$- and $\theta$-integrations and gives the absolutely convergent representation
\begin{align*}
F(\eta,\zeta)=\int_{\mathbb{R}^k \times \mathbb{R}^{\ell}} e^{-iy\cdot\eta}a(y,\theta)b_\zeta(y,\theta)\,d\mathcal{L}^k(y)\,d\mathcal{L}^{\ell}(\theta).
\end{align*}
The remaining phase function $\Phi_y: \mathbb{R}^k \times \Gamma \to \mathbb{R}$ is defined by $\Phi_y(y;\eta,\zeta)=-y\cdot\eta$, and its tangential gradient is
\begin{align*}
\nabla_y \Phi_y(y;\eta,\zeta)=-\eta.
\end{align*}
Thus on $\Gamma$ we have the uniform lower bound
\begin{align*}
|\nabla_y\Phi_y(y;\eta,\zeta)|=|\eta| \ge c|(\eta,\zeta)|.
\end{align*}
[/step]
[step:Integrate by parts in the tangential variables to gain arbitrary decay]
For each $(\eta,\zeta)\in \Gamma$, define the first-order differential operator $L_{\eta}: C^\infty(\mathbb{R}^k) \to C^\infty(\mathbb{R}^k)$ by
\begin{align*}
L_{\eta} f=\frac{i\eta\cdot \nabla_y f}{|\eta|^2}.
\end{align*}
Since $\nabla_y\Phi_y=-\eta$, direct differentiation gives
\begin{align*}
L_{\eta}e^{-iy\cdot\eta}=e^{-iy\cdot\eta}.
\end{align*}
Let $L_{\eta}^*: C^\infty(\mathbb{R}^k) \to C^\infty(\mathbb{R}^k)$ denote the formal adjoint of $L_{\eta}$ with respect to $d\mathcal{L}^k(y)$. Because $\eta$ is independent of $y$,
\begin{align*}
L_{\eta}^* f=-\frac{i\eta\cdot \nabla_y f}{|\eta|^2}.
\end{align*}
Applying integration by parts $N$ times in the $y$-variables to the absolutely convergent integral gives
\begin{align*}
F(\eta,\zeta)=\int_{\mathbb{R}^k \times \mathbb{R}^{\ell}} e^{-iy\cdot\eta}(L_{\eta}^*)^N[a(y,\theta)b_\zeta(y,\theta)]\,d\mathcal{L}^k(y)\,d\mathcal{L}^{\ell}(\theta).
\end{align*}
No boundary term appears, because $a(y,\theta)b_\zeta(y,\theta)$ is compactly supported in $y$.
Each application of $L_{\eta}^*$ contributes one factor $|\eta|^{-1}$ and one $y$-derivative falling on $a b_\zeta$. By the symbol estimates for $a$ and the rapid-decay estimate for $b_\zeta$, every summand in $(L_{\eta}^*)^N[a(y,\theta)b_\zeta(y,\theta)]$ is bounded by a constant times
\begin{align*}
|\eta|^{-N}(1+|\theta|)^m(1+|\theta-\zeta|)^{-M}.
\end{align*}
Choose $M>|m|+\ell+1$. The convolution-type estimate
\begin{align*}
\int_{\mathbb{R}^{\ell}}(1+|\theta|)^m(1+|\theta-\zeta|)^{-M}\,d\mathcal{L}^{\ell}(\theta) \le A(1+|\zeta|)^r
\end{align*}
holds for constants $A>0$ and $r=\max\{m,0\}$ depending only on $m$, $M$, and $\ell$; it follows from $1+|\theta| \le (1+|\theta-\zeta|)(1+|\zeta|)$ when $m\ge 0$, while for $m<0$ the factor $(1+|\theta|)^m$ is bounded by $1$. Since the $y$-support is compact, integration in $y$ only contributes a finite constant. Hence, for every $N \in \mathbb{N}$ there are constants $C_N>0$ and $r \ge 0$, with $r$ depending only on $m$ and $\ell$, such that
\begin{align*}
|F(\eta,\zeta)| \le C_N |\eta|^{-N}(1+|\zeta|)^r.
\end{align*}
On $\Gamma$, the inequality $|\eta| \ge c|(\eta,\zeta)|$ implies $1+|\zeta| \le (1+c^{-1})(1+|\eta|)$. Hence, after increasing the constant,
\begin{align*}
|F(\eta,\zeta)| \le C_N' |\eta|^{-N+r}.
\end{align*}
Given any $Q \in \mathbb{N}$, choose $N>Q+r$. Then $|\eta| \ge c|(\eta,\zeta)|$ gives a constant $D_Q>0$ such that
\begin{align*}
|F(\eta,\zeta)| \le D_Q |(\eta,\zeta)|^{-Q}
\end{align*}
for all $(\eta,\zeta) \in \Gamma$ with $|(\eta,\zeta)| \ge 1$. Thus $F$ is rapidly decreasing in the cone $\Gamma$.
[guided]
The analytic point is that tangential integration by parts alone does not make the $\theta$-integral harmless. The amplitude $a(y,\theta)$ is only a symbol, so it may grow polynomially in $\theta$. The compact support and smoothness of the cutoff in the normal variable provide the missing decay: for
\begin{align*}
b_\zeta(y,\theta)=\int_{\mathbb{R}^{\ell}} e^{i z\cdot(\theta-\zeta)}\psi(y,z)\,d\mathcal{L}^{\ell}(z),
\end{align*}
integration by parts in $z$ gives rapid decay in $\theta-\zeta$. Therefore $a(y,\theta)b_\zeta(y,\theta)$ is integrable in $\theta$, and the localized Fourier transform is the ordinary integral
\begin{align*}
F(\eta,\zeta)=\int_{\mathbb{R}^k \times \mathbb{R}^{\ell}} e^{-iy\cdot\eta}a(y,\theta)b_\zeta(y,\theta)\,d\mathcal{L}^k(y)\,d\mathcal{L}^{\ell}(\theta).
\end{align*}
Now the phase $-y\cdot\eta$ is nonstationary in the $y$-variables on the cone $\Gamma$, because $|\eta| \ge c|(\eta,\zeta)|$. We use the operator $L_{\eta}: C^\infty(\mathbb{R}^k) \to C^\infty(\mathbb{R}^k)$ defined by
\begin{align*}
L_\eta f=\frac{i\eta\cdot\nabla_y f}{|\eta|^2}.
\end{align*}
It satisfies $L_\eta e^{-iy\cdot\eta}=e^{-iy\cdot\eta}$. Its formal adjoint with respect to $d\mathcal{L}^k(y)$ is
\begin{align*}
L_\eta^*f=-\frac{i\eta\cdot\nabla_y f}{|\eta|^2}.
\end{align*}
Integrating by parts $N$ times transfers $N$ tangential derivatives from the exponential to the amplitude and gives
\begin{align*}
F(\eta,\zeta)=\int_{\mathbb{R}^k \times \mathbb{R}^{\ell}} e^{-iy\cdot\eta}(L_{\eta}^*)^N[a(y,\theta)b_\zeta(y,\theta)]\,d\mathcal{L}^k(y)\,d\mathcal{L}^{\ell}(\theta).
\end{align*}
There are no boundary terms because the integrand is compactly supported in $y$. The derivatives falling on $a$ preserve the same symbol order in $\theta$, while the derivatives falling on $b_\zeta$ preserve rapid decay in $\theta-\zeta$. More explicitly, for $M>|m|+\ell+1$, the remaining $\theta$-integral is controlled by
\begin{align*}
\int_{\mathbb{R}^{\ell}}(1+|\theta|)^m(1+|\theta-\zeta|)^{-M}\,d\mathcal{L}^{\ell}(\theta).
\end{align*}
This integral is bounded by $A(1+|\zeta|)^r$, where $A>0$ and $r=\max\{m,0\}$ depend only on $m$, $M$, and $\ell$. For $m\ge 0$ this follows from $1+|\theta| \le (1+|\theta-\zeta|)(1+|\zeta|)$; for $m<0$ the symbol factor is bounded by $1$. Consequently the remaining $\theta$-integral is bounded by a fixed polynomial in $|\zeta|$:
\begin{align*}
|F(\eta,\zeta)| \le C_N |\eta|^{-N}(1+|\zeta|)^r.
\end{align*}
The cone condition converts this polynomial growth into a power of $|\eta|$, and because $N$ is arbitrary we choose $N$ larger than the desired decay order plus $r$. This gives, for every $Q \in \mathbb{N}$,
\begin{align*}
|F(\eta,\zeta)| \le D_Q |(\eta,\zeta)|^{-Q}
\end{align*}
on $\Gamma$ for large frequency. This is exactly rapid decay of the localized Fourier transform in the cone avoiding the conormal directions.
[/guided]
[/step]
[step:Translate rapid decay into absence from the wave front set]
The definition of the wave front set says that a nonzero covector is absent from $\operatorname{WF}(u)$ if there is a compactly supported smooth cutoff equal to $1$ near the base point and a conic neighbourhood of the covector on which the localized Fourier transform decays faster than every negative power of the frequency. The previous step verifies this condition for every covector over $S$ whose tangential component is nonzero. Together with smoothness away from $S$, this proves that no covector outside $N^*S \setminus 0$ lies in $\operatorname{WF}(u)$.
Therefore
\begin{align*}
\operatorname{WF}(u)\subset N^*S\setminus 0.
\end{align*}
The exclusion of the zero section is part of the definition of the wave front set, which contains only nonzero covectors.
[/step]