[proofplan]
If $P$ is not absolutely continuous with respect to $Q$, the right-hand side is infinite and the result is immediate. Otherwise, after disposing of the zero total-variation case, we reduce the problem to a two-point probability space by choosing a measurable set $A_\varepsilon$ whose probability gap is almost maximal. The key estimate is the log-sum inequality, which says that relative entropy decreases under the coarse-graining into $A_\varepsilon$ and $A_\varepsilon^c$. We then prove the remaining Bernoulli inequality directly and let the approximating set approach the total variation supremum.
[/proofplan]
[step:Dispose of the singular case and fix an almost extremal set]
Let $(\Omega,\mathcal{F})$ be the measurable space on which the probability measures $P$ and $Q$ are defined.
If $P \not\ll Q$, then $D_{\mathrm{KL}}(P \mid Q)=+\infty$, and the asserted inequality holds because $\operatorname{TV}(P,Q) \leq 1$.
Assume henceforth that $P \ll Q$. If $\operatorname{TV}(P,Q)=0$, then the asserted inequality follows because $0 \leq \frac{1}{2}D_{\mathrm{KL}}(P \mid Q)$. Hence assume $\operatorname{TV}(P,Q)>0$, and let $\varepsilon$ satisfy $0<\varepsilon<\operatorname{TV}(P,Q)$. By the definition of the supremum, choose a measurable set $A_\varepsilon \in \mathcal{F}$ such that
\begin{align*}
|P(A_\varepsilon)-Q(A_\varepsilon)| \geq \operatorname{TV}(P,Q)-\varepsilon.
\end{align*}
Replacing $A_\varepsilon$ by its complement if necessary, we may assume
\begin{align*}
P(A_\varepsilon)-Q(A_\varepsilon) \geq \operatorname{TV}(P,Q)-\varepsilon.
\end{align*}
Define $a_\varepsilon := P(A_\varepsilon)$ and $b_\varepsilon := Q(A_\varepsilon)$. Since $0<\varepsilon<\operatorname{TV}(P,Q)$, the preceding lower bound is positive, so $a_\varepsilon-b_\varepsilon>0$. Hence $0 \leq b_\varepsilon < a_\varepsilon \leq 1$, and
\begin{align*}
a_\varepsilon-b_\varepsilon \geq \operatorname{TV}(P,Q)-\varepsilon.
\end{align*}
[/step]
[step:Coarse-grain the relative entropy to the two sets $A_\varepsilon$ and $A_\varepsilon^c$]
By the [Radon-Nikodym theorem](/page/Radon-Nikodym%20Theorem), let $r:\Omega\to[0,\infty]$ be the measurable function given by $r(\omega)=\frac{dP}{dQ}(\omega)$ for $\omega\in\Omega$, the Radon-Nikodym derivative of $P$ with respect to $Q$. Since $P(\Omega)=1<\infty$, the derivative $r$ may be chosen finite $Q$-a.e.; changing $r$ on a $Q$-null set does not affect any integral below. Define the function $\varphi:[0,\infty)\to\mathbb{R}$ by $\varphi(t)=t\log t$ for $t>0$ and $\varphi(0)=0$. The function $\varphi$ is convex on $[0,\infty)$ because $\varphi''(t)=1/t\geq 0$ for $t>0$ and $\varphi$ is the continuous extension at $0$ of this convex function. Then
\begin{align*}
D_{\mathrm{KL}}(P \mid Q)=\int_\Omega \varphi(r)\,dQ.
\end{align*}
We claim that
\begin{align*}
D_{\mathrm{KL}}(P\|Q)
\geq
a_\varepsilon\log\left(\frac{a_\varepsilon}{b_\varepsilon}\right)
+
(1-a_\varepsilon)\log\left(\frac{1-a_\varepsilon}{1-b_\varepsilon}\right),
\end{align*}
with the usual conventions $0\log(0/c)=0$ for $c>0$ and $c\log(c/0)=+\infty$ for $c>0$.
Indeed, on the measurable set $A_\varepsilon$, [Jensen's inequality](/page/Jensen%27s%20Inequality) for the probability measure $Q(\cdot \cap A_\varepsilon)/Q(A_\varepsilon)$ gives, when $b_\varepsilon>0$,
\begin{align*}
\int_{A_\varepsilon}\varphi(r)\,dQ
\geq
b_\varepsilon\,
\varphi\left(
\frac{1}{b_\varepsilon}\int_{A_\varepsilon} r\,dQ
\right)
=
b_\varepsilon\,
\varphi\left(\frac{a_\varepsilon}{b_\varepsilon}\right)
=
a_\varepsilon\log\left(\frac{a_\varepsilon}{b_\varepsilon}\right).
\end{align*}
If $b_\varepsilon=0$, then $P \ll Q$ implies $a_\varepsilon=0$, so the same term is $0$. If $1-b_\varepsilon>0$, applying [Jensen's inequality](/page/Jensen%27s%20Inequality) to the probability measure $Q(\cdot\cap A_\varepsilon^c)/Q(A_\varepsilon^c)$ gives
\begin{align*}
\int_{A_\varepsilon^c}\varphi(r)\,dQ
\geq
(1-a_\varepsilon)\log\left(\frac{1-a_\varepsilon}{1-b_\varepsilon}\right).
\end{align*}
If $1-b_\varepsilon=0$, then $Q(A_\varepsilon^c)=0$, and $P\ll Q$ implies $P(A_\varepsilon^c)=1-a_\varepsilon=0$; the complementary entropy term is therefore $0$.
Adding the two estimates proves the claimed coarse-grained lower bound.
[guided]
The purpose of this step is to replace the original probability space by the two events $A_\varepsilon$ and $A_\varepsilon^c$. This is useful because the total variation distance only asks for probability gaps of measurable sets, and a single nearly extremal set contains enough information to prove the estimate.
Since $P \ll Q$, the [Radon-Nikodym theorem](/page/Radon-Nikodym%20Theorem) gives a Radon-Nikodym derivative. Let $r:\Omega\to[0,\infty]$ be the measurable function given by $r(\omega)=\frac{dP}{dQ}(\omega)$ for $\omega\in\Omega$. Thus, for every measurable set $E \in \mathcal{F}$,
\begin{align*}
P(E)=\int_E r\,dQ.
\end{align*}
Because $P(\Omega)=1<\infty$, we may choose this derivative finite $Q$-a.e.; if necessary, redefine $r$ on the $Q$-null set where it is infinite. This redefinition preserves the displayed identity for every $E\in\mathcal{F}$ and preserves all integrals with respect to $Q$. We also define the function $\varphi:[0,\infty)\to\mathbb{R}$ by $\varphi(t)=t\log t$ for $t>0$ and $\varphi(0)=0$. The function $\varphi$ is convex on $[0,\infty)$: on $(0,\infty)$ it has second derivative $\varphi''(t)=1/t\geq 0$, and the value $\varphi(0)=0$ is the continuous endpoint extension of that convex function. The relative entropy can be written as
\begin{align*}
D_{\mathrm{KL}}(P \mid Q)=\int_\Omega \varphi(r)\,dQ.
\end{align*}
We now estimate the contribution from $A_\varepsilon$. If $b_\varepsilon=Q(A_\varepsilon)>0$, then $Q(\cdot \cap A_\varepsilon)/b_\varepsilon$ is a probability measure on $A_\varepsilon$. [Jensen's inequality](/page/Jensen%27s%20Inequality) applied to the convex function $\varphi$ gives
\begin{align*}
\frac{1}{b_\varepsilon}\int_{A_\varepsilon}\varphi(r)\,dQ
\geq
\varphi\left(
\frac{1}{b_\varepsilon}\int_{A_\varepsilon}r\,dQ
\right).
\end{align*}
Multiplying by $b_\varepsilon$ and using $\int_{A_\varepsilon}r\,dQ=P(A_\varepsilon)=a_\varepsilon$, we get
\begin{align*}
\int_{A_\varepsilon}\varphi(r)\,dQ
\geq
b_\varepsilon\varphi\left(\frac{a_\varepsilon}{b_\varepsilon}\right)
=
a_\varepsilon\log\left(\frac{a_\varepsilon}{b_\varepsilon}\right).
\end{align*}
If $b_\varepsilon=0$, absolute continuity $P \ll Q$ forces $a_\varepsilon=P(A_\varepsilon)=0$, so the corresponding entropy term is interpreted as $0$ and the inequality remains valid.
We next apply the same Jensen mechanism to the complement $A_\varepsilon^c$, but we must first check the endpoint. Since $Q(A_\varepsilon^c)=1-b_\varepsilon$ and $P(A_\varepsilon^c)=1-a_\varepsilon$, if $1-b_\varepsilon>0$, then $Q(\cdot\cap A_\varepsilon^c)/(1-b_\varepsilon)$ is a probability measure on $A_\varepsilon^c$ and [Jensen's inequality](/page/Jensen%27s%20Inequality) gives
\begin{align*}
\int_{A_\varepsilon^c}\varphi(r)\,dQ
\geq
(1-a_\varepsilon)\log\left(\frac{1-a_\varepsilon}{1-b_\varepsilon}\right).
\end{align*}
If $1-b_\varepsilon=0$, then $Q(A_\varepsilon^c)=0$, and absolute continuity $P\ll Q$ gives $P(A_\varepsilon^c)=1-a_\varepsilon=0$, so the complementary entropy term is $0$.
Adding the two estimates yields
\begin{align*}
D_{\mathrm{KL}}(P \mid Q)
\geq
a_\varepsilon\log\left(\frac{a_\varepsilon}{b_\varepsilon}\right)
+
(1-a_\varepsilon)\log\left(\frac{1-a_\varepsilon}{1-b_\varepsilon}\right).
\end{align*}
This is exactly the relative entropy of the two-point distributions $(a_\varepsilon,1-a_\varepsilon)$ and $(b_\varepsilon,1-b_\varepsilon)$.
[/guided]
[/step]
[step:Prove the Bernoulli form of Pinsker's inequality]
We prove that for $0 \leq b \leq a \leq 1$,
\begin{align*}
a\log\left(\frac{a}{b}\right)
+
(1-a)\log\left(\frac{1-a}{1-b}\right)
\geq
2(a-b)^2.
\end{align*}
We first handle the equality and endpoint cases. If $a=b$, then both sides are $0$, so the inequality holds. If $b=0$ and $a>0$, the left-hand side is $+\infty$ by the convention $a\log(a/0)=+\infty$, so the inequality holds. If $b=0$ and $a=0$, both sides are $0$. If $a=1$, then $b<1$ because $b<a$, and the inequality becomes
\begin{align*}
\log\left(\frac{1}{b}\right) \geq 2(1-b)^2.
\end{align*}
This follows from the interior argument below by taking $a\uparrow 1$, because the function
\begin{align*}
a \mapsto a\log\left(\frac{a}{b}\right)+(1-a)\log\left(\frac{1-a}{1-b}\right)-2(a-b)^2
\end{align*}
is continuous on $(b,1]$ after setting $(1-a)\log((1-a)/(1-b))=0$ at $a=1$. It remains to consider $0<b<a<1$.
For fixed $b\in(0,1)$, define $F_b:(b,1)\to\mathbb{R}$ by
\begin{align*}
F_b(a)
=
a\log\left(\frac{a}{b}\right)
+
(1-a)\log\left(\frac{1-a}{1-b}\right)
-
2(a-b)^2.
\end{align*}
Then
\begin{align*}
F_b'(a)
=
\log\left(\frac{a(1-b)}{b(1-a)}\right)-4(a-b),
\end{align*}
and
\begin{align*}
F_b''(a)
=
\frac{1}{a}+\frac{1}{1-a}-4
=
\frac{1}{a(1-a)}-4
=
\frac{(2a-1)^2}{a(1-a)}
\geq 0.
\end{align*}
Thus $F_b'$ is nondecreasing on $(b,1)$. Since
\begin{align*}
F_b'(b)=0,
\end{align*}
we have $F_b'(a)\geq 0$ for $a\geq b$. Hence $F_b$ is nondecreasing on $[b,1)$, and because
\begin{align*}
F_b(b)=0,
\end{align*}
we obtain $F_b(a)\geq 0$. This proves the Bernoulli inequality.
[/step]
[step:Apply the Bernoulli estimate and pass to the total variation supremum]
Applying the Bernoulli inequality with $a=a_\varepsilon$ and $b=b_\varepsilon$, and using the coarse-grained entropy bound, gives
\begin{align*}
D_{\mathrm{KL}}(P \mid Q)
\geq
2(a_\varepsilon-b_\varepsilon)^2.
\end{align*}
Since $a_\varepsilon-b_\varepsilon \geq \operatorname{TV}(P,Q)-\varepsilon$, we have
\begin{align*}
D_{\mathrm{KL}}(P \mid Q)
\geq
2(\operatorname{TV}(P,Q)-\varepsilon)^2.
\end{align*}
Letting $\varepsilon \downarrow 0$ gives
\begin{align*}
D_{\mathrm{KL}}(P \mid Q)
\geq
2\operatorname{TV}(P,Q)^2.
\end{align*}
Equivalently,
\begin{align*}
\operatorname{TV}(P,Q)^2 \leq \frac{1}{2}D_{\mathrm{KL}}(P \mid Q).
\end{align*}
This is the desired inequality.
[/step]