[proofplan]
We first verify the canonical pair directly: the observability matrix of $(C_o,A_o)$ contains the standard coordinate rows in order, so it has full rank. We then compute the characteristic polynomial of $A_o$ from the determinant of $sI-A_o$, obtaining exactly $p(s)$. For a general observable pair $(C,A)$, we use its observability matrix as the coordinate-change matrix; the first $n-1$ rows shift by multiplication with $A$, while the final row is determined by the Cayley-Hamilton identity applied to the given characteristic polynomial.
[/proofplan]
[step:Verify observability of the canonical pair]
Let $I_n \in \mathbb R^{n \times n}$ denote the identity matrix. Let $\mathcal O(C_o,A_o) \in \mathbb R^{n \times n}$ denote the observability matrix whose $k$-th row is $C_oA_o^{k-1}$ for $1 \leq k \leq n$. We claim that
\begin{align*}
C_oA_o^{k-1}(x) = x_k
\end{align*}
for every $x=(x_1,\dots,x_n) \in \mathbb R^n$ and every $1 \leq k \leq n$.
For $m\in\{0,\dots,n-1\}$, we prove the stronger assertion that the first $n-m$ coordinates of $A_o^m x$ are $x_{m+1},\dots,x_n$. The case $m=0$ is immediate. Suppose the assertion holds for some $m\leq n-2$, and set $y=A_o^m x\in\mathbb R^n$. By the entrywise definition of $A_o$, the $i$-th coordinate of $A_oy$ is $y_{i+1}$ for $1\leq i\leq n-1$. Therefore the first $n-m-1$ coordinates of $A_o^{m+1}x=A_oy$ are $x_{m+2},\dots,x_n$. This proves the stronger assertion by induction. Taking the first coordinate with $m=k-1$ gives $C_oA_o^{k-1}(x)=x_k$ for $1\leq k\leq n$. Equivalently, the first $n$ rows of the observability sequence are the coordinate functionals $x \mapsto x_1,\dots,x \mapsto x_n$. Hence
\begin{align*}
\mathcal O(C_o,A_o) = I_n.
\end{align*}
Therefore $\mathcal O(C_o,A_o)$ has rank $n$. By the definition of observability for a single-output pair, this proves that $(C_o,A_o)$ is observable.
[guided]
Let $I_n \in \mathbb R^{n \times n}$ denote the identity matrix. The observability matrix records which linear functionals of the initial state can be recovered from repeated observations. Here the output map is $C_o(x)=x_1$, so the first observed coordinate is the first coordinate. The special shape of $A_o$ shifts coordinates upward: for every $y=(y_1,\dots,y_n)\in\mathbb R^n$ and every $1\leq i\leq n-1$, the $i$-th coordinate of $A_oy$ is $y_{i+1}$.
Formally, define $\mathcal O(C_o,A_o) \in \mathbb R^{n \times n}$ to be the matrix whose $k$-th row is $C_oA_o^{k-1}$. We prove a stronger induction statement than just the first-coordinate identity: for each $m\in\{0,\dots,n-1\}$, the first $n-m$ coordinates of $A_o^m x$ are $x_{m+1},\dots,x_n$ for every $x=(x_1,\dots,x_n)\in\mathbb R^n$. The case $m=0$ says that the first $n$ coordinates of $x$ are $x_1,\dots,x_n$, so it is immediate.
Assume the statement holds for some $m\leq n-2$, and set $y=A_o^m x\in\mathbb R^n$. By the induction hypothesis, the first $n-m$ coordinates of $y$ are $x_{m+1},\dots,x_n$. Since the $i$-th coordinate of $A_oy$ is $y_{i+1}$ for $1\leq i\leq n-1$, the first $n-m-1$ coordinates of $A_o^{m+1}x=A_oy$ are $x_{m+2},\dots,x_n$. This proves the stronger statement by induction.
Now choose $m=k-1$. The first coordinate of $A_o^{k-1}x$ is $x_k$, and applying $C_o$ extracts exactly that first coordinate. Hence
\begin{align*}
C_oA_o^{k-1}(x) = x_k
\end{align*}
for every $1\leq k\leq n$. Thus the rows of $\mathcal O(C_o,A_o)$ are exactly the standard coordinate rows. Hence $\mathcal O(C_o,A_o)=I_n$, which has rank $n$. This is precisely the definition of observability for a single-output pair.
[/guided]
[/step]
[step:Compute the characteristic polynomial of the canonical matrix]
Let $q \in \mathbb R[s]$ denote the characteristic polynomial of $A_o$, defined by
\begin{align*}
q(s)=\det(sI_n-A_o)
\end{align*}
for each $s\in\mathbb R$. Define the polynomial matrix $B \in \mathbb R[s]^{n\times n}$ by
\begin{align*}
B(s)=sI_n-A_o
\end{align*}
for each $s\in\mathbb R$. By the entrywise definition of $A_o$, the nonzero entries of $B(s)$ are $B(s)_{ii}=s$ for $1\leq i\leq n-1$, $B(s)_{i,i+1}=-1$ for $1\leq i\leq n-1$, $B(s)_{n,j}=a_{j-1}$ for $1\leq j\leq n-1$, and $B(s)_{n,n}=s+a_{n-1}$.
We compute $\det B(s)$ from the Leibniz determinant formula. A nonzero term is determined by the column $j$ chosen by the last row. If $1\leq j\leq n-1$, then rows $1,\dots,j-1$ must choose their diagonal entries, and rows $j,\dots,n-1$ must choose their superdiagonal entries. The corresponding permutation is the cycle sending $j$ to $j+1$, $j+1$ to $j+2$, and so on, with $n$ sent to $j$; its sign is $(-1)^{n-j}$. The product of matrix entries contains $n-j$ factors equal to $-1$, so the sign from the entries is also $(-1)^{n-j}$. Hence the total contribution for this $j$ is
\begin{align*}
a_{j-1}s^{j-1}.
\end{align*}
If $j=n$, all first $n-1$ rows choose their diagonal entries and the last row chooses $s+a_{n-1}$, so the contribution is
\begin{align*}
s^{n-1}(s+a_{n-1})=s^n+a_{n-1}s^{n-1}.
\end{align*}
These are the only nonzero Leibniz terms, because each row $i<n$ has nonzero entries only in columns $i$ and $i+1$. Therefore
\begin{align*}
q(s)=s^n+a_{n-1}s^{n-1}+a_{n-2}s^{n-2}+\cdots+a_1s+a_0.
\end{align*}
Thus $q=p$, so the characteristic polynomial of $A_o$ is $p$.
[guided]
We need to show that the canonical matrix has the prescribed characteristic polynomial, so we compute the determinant of $sI_n-A_o$ directly. Let $q \in \mathbb R[s]$ denote the characteristic polynomial of $A_o$, defined by
\begin{align*}
q(s)=\det(sI_n-A_o)
\end{align*}
for each $s\in\mathbb R$. Define the polynomial matrix $B \in \mathbb R[s]^{n\times n}$ by
\begin{align*}
B(s)=sI_n-A_o.
\end{align*}
The entrywise form of $A_o$ gives $B(s)_{ii}=s$ for $1\leq i\leq n-1$, $B(s)_{i,i+1}=-1$ for $1\leq i\leq n-1$, $B(s)_{n,j}=a_{j-1}$ for $1\leq j\leq n-1$, and $B(s)_{n,n}=s+a_{n-1}$.
We now apply the Leibniz determinant formula to $B(s)$. Because each row $i<n$ has nonzero entries only in columns $i$ and $i+1$, a nonzero product in the determinant is determined by the column $j$ chosen by the last row. If $1\leq j\leq n-1$, then rows $1,\dots,j-1$ must choose their diagonal entries, while rows $j,\dots,n-1$ must choose their superdiagonal entries. The corresponding permutation cycles the columns $j,j+1,\dots,n$, so its sign is $(-1)^{n-j}$. The selected superdiagonal entries contribute $n-j$ factors of $-1$, giving another factor $(-1)^{n-j}$. These two signs multiply to $1$, and the contribution is
\begin{align*}
a_{j-1}s^{j-1}.
\end{align*}
If $j=n$, the first $n-1$ rows choose the diagonal entries and the last row chooses $s+a_{n-1}$, giving
\begin{align*}
s^{n-1}(s+a_{n-1})=s^n+a_{n-1}s^{n-1}.
\end{align*}
There are no other nonzero Leibniz terms, because choosing any other pattern would require a zero entry in one of the first $n-1$ rows. Therefore
\begin{align*}
q(s)=s^n+a_{n-1}s^{n-1}+a_{n-2}s^{n-2}+\cdots+a_1s+a_0.
\end{align*}
This is exactly the polynomial $p(s)$ from the theorem statement, so the characteristic polynomial of $A_o$ is $p$.
[/guided]
[/step]
[step:Use the observability matrix as the coordinate basis]
Let $A \in \mathbb R^{n \times n}$ have characteristic polynomial $p$, and let $C:\mathbb R^n \to \mathbb R$ be a [linear map](/page/Linear%20Map) such that $(C,A)$ is observable. Define the observability matrix $\mathcal O(C,A) \in \mathbb R^{n \times n}$ by declaring its $k$-th row to be $CA^{k-1}$ for $1 \leq k \leq n$. For a single-output pair, observability means that this $n\times n$ observability matrix has rank $n$. Hence $\mathcal O(C,A)$ is invertible. Set
\begin{align*}
P = \mathcal O(C,A)^{-1}.
\end{align*}
We use the observable coordinates $z=\mathcal O(C,A)x$, equivalently $x=Pz$, for a state vector $x\in\mathbb R^n$ and its coordinate vector $z\in\mathbb R^n$.
We compute $\mathcal O(C,A)A$. Its first $n-1$ rows are $CA,CA^2,\dots,CA^{n-1}$, which are the second through $n$-th rows of $\mathcal O(C,A)$. Since $A\in\mathbb R^{n\times n}$ is a square real matrix and $p$ is its characteristic polynomial, the [Cayley-Hamilton theorem](/theorems/921) gives $p(A)=0$. Therefore, for the last row,
\begin{align*}
A^n + a_{n-1}A^{n-1} + \cdots + a_1A + a_0I_n = 0.
\end{align*}
Multiplying on the left by $C$ yields
\begin{align*}
CA^n = -a_{n-1}CA^{n-1} - \cdots - a_1CA - a_0C.
\end{align*}
Thus the last row of $\mathcal O(C,A)A$ is the linear combination of the rows of $\mathcal O(C,A)$ with coefficients $-a_0,-a_1,\dots,-a_{n-1}$. By the entrywise definition of $A_o$, this proves
\begin{align*}
\mathcal O(C,A)A = A_o\mathcal O(C,A).
\end{align*}
Multiplying on the right by $\mathcal O(C,A)^{-1}=P$ gives
\begin{align*}
P^{-1}AP = A_o.
\end{align*}
Finally, since the first row of $\mathcal O(C,A)$ is $C$, we have
\begin{align*}
CP = C\mathcal O(C,A)^{-1} = C_o.
\end{align*}
Therefore the observable pair $(C,A)$ is similar to $(C_o,A_o)$ in the observable coordinate basis determined by $P$.
[guided]
The goal is to build coordinates in which the output and dynamics take the canonical form. Let $A\in\mathbb R^{n\times n}$ have characteristic polynomial $p$, and let $C:\mathbb R^n\to\mathbb R$ be a linear map such that $(C,A)$ is observable. Define the observability matrix $\mathcal O(C,A)\in\mathbb R^{n\times n}$ by making its $k$-th row equal to $CA^{k-1}$ for $1\leq k\leq n$. For a single-output pair, observability means precisely that this $n\times n$ observability matrix has rank $n$, hence $\mathcal O(C,A)$ is invertible. Set
\begin{align*}
P=\mathcal O(C,A)^{-1}.
\end{align*}
The observable coordinate vector is $z=\mathcal O(C,A)x$, equivalently $x=Pz$, for a state vector $x\in\mathbb R^n$.
We next compute how the dynamics look in these coordinates. Multiplying $\mathcal O(C,A)$ on the right by $A$ shifts the rows: the first $n-1$ rows of $\mathcal O(C,A)A$ are $CA,CA^2,\dots,CA^{n-1}$, which are exactly the second through $n$-th rows of $\mathcal O(C,A)$. The only row not obtained by this shift is the last row, $CA^n$. Since $A\in\mathbb R^{n\times n}$ is a square real matrix and $p$ is its characteristic polynomial, the [Cayley-Hamilton theorem](/theorems/921) gives $p(A)=0$. Thus
\begin{align*}
A^n+a_{n-1}A^{n-1}+\cdots+a_1A+a_0I_n=0.
\end{align*}
Multiplying this identity on the left by the linear map $C$ gives
\begin{align*}
CA^n=-a_{n-1}CA^{n-1}-\cdots-a_1CA-a_0C.
\end{align*}
Therefore the last row of $\mathcal O(C,A)A$ is the linear combination of the rows of $\mathcal O(C,A)$ with coefficients $-a_0,-a_1,\dots,-a_{n-1}$. Comparing this row-shift structure with the entrywise definition of $A_o$, we obtain
\begin{align*}
\mathcal O(C,A)A=A_o\mathcal O(C,A).
\end{align*}
Now multiply on the right by $\mathcal O(C,A)^{-1}=P$ to get
\begin{align*}
P^{-1}AP=A_o.
\end{align*}
Finally, the first row of $\mathcal O(C,A)$ is $C$, while the first row of the identity matrix is $C_o$. Hence
\begin{align*}
CP=C\mathcal O(C,A)^{-1}=C_o.
\end{align*}
Thus, in the observable coordinate basis determined by $P$, the pair $(C,A)$ becomes $(C_o,A_o)$. This proves that every observable single-output pair with characteristic polynomial $p$ is similar to the canonical pair.
[/guided]
[/step]