[proofplan]
Each property is proved by expanding the expectation as a sum over the range of the discrete random variable and manipulating the resulting series. Part (v) uses the algebraic completion-of-the-square identity to show the minimum of $\mathbb{E}[(X - c)^2]$ over $c \in \mathbb{R}$ is attained at $c = \mathbb{E}[X]$.
[/proofplan]
[step:Prove non-negativity: $X \ge 0$ implies $\mathbb{E}[X] \ge 0$]
If $X \ge 0$, then $X(\omega) \ge 0$ for all $\omega$, so every value $x$ in the range of $X$ satisfies $x \ge 0$. Since $\mathbb{P}(X = x) \ge 0$ for all $x$,
\begin{align*}
\mathbb{E}[X] = \sum_x x \, \mathbb{P}(X = x) \ge 0,
\end{align*}
where each term in the sum is the product of a non-negative number $x$ and a non-negative probability.
[/step]
[step:Prove that $X \ge 0$ and $\mathbb{E}[X] = 0$ imply $\mathbb{P}(X = 0) = 1$]
Suppose $X \ge 0$ and $\mathbb{E}[X] = 0$. Then $\sum_x x \, \mathbb{P}(X = x) = 0$, where every term $x \, \mathbb{P}(X = x) \ge 0$ (since $x \ge 0$ and $\mathbb{P}(X = x) \ge 0$). A sum of non-negative terms equals zero if and only if every term is zero. For $x > 0$, the term $x \, \mathbb{P}(X = x) = 0$ forces $\mathbb{P}(X = x) = 0$ (since $x \ne 0$). Therefore
\begin{align*}
\mathbb{P}(X \ne 0) = \sum_{x \ne 0} \mathbb{P}(X = x) = 0,
\end{align*}
which gives $\mathbb{P}(X = 0) = 1$.
[/step]
[step:Prove linearity: $\mathbb{E}[a + bX] = a + b\,\mathbb{E}[X]$]
Let $Y = a + bX$. The values of $Y$ are $\{a + bx : x \in \operatorname{Range}(X)\}$, and $\mathbb{P}(Y = a + bx) = \mathbb{P}(X = x)$. Substituting into the definition of expectation and using the substitution $y = a + bx$:
\begin{align*}
\mathbb{E}[a + bX] &= \sum_x (a + bx)\,\mathbb{P}(X = x) \\
&= a \sum_x \mathbb{P}(X = x) + b \sum_x x \, \mathbb{P}(X = x) \\
&= a \cdot 1 + b \, \mathbb{E}[X],
\end{align*}
where we used $\sum_x \mathbb{P}(X = x) = 1$ (the total probability over the range of $X$).
[/step]
[step:Prove additivity: $\mathbb{E}[X + Y] = \mathbb{E}[X] + \mathbb{E}[Y]$]
By the definition of expectation for the discrete random variable $X + Y$:
\begin{align*}
\mathbb{E}[X + Y] &= \sum_{x}\sum_{y} (x + y)\,\mathbb{P}(X = x, Y = y) \\
&= \sum_{x}\sum_{y} x\,\mathbb{P}(X = x, Y = y) + \sum_{x}\sum_{y} y\,\mathbb{P}(X = x, Y = y).
\end{align*}
In the first double sum, $x$ does not depend on $y$, so we may factor:
\begin{align*}
\sum_{x} x \sum_{y} \mathbb{P}(X = x, Y = y) = \sum_x x \, \mathbb{P}(X = x) = \mathbb{E}[X],
\end{align*}
where we used the marginal probability $\sum_y \mathbb{P}(X = x, Y = y) = \mathbb{P}(X = x)$. By the same argument with the roles of $x$ and $y$ exchanged, the second double sum equals $\mathbb{E}[Y]$.
[guided]
This proof does not require independence — it uses only the marginalisation identity $\sum_y \mathbb{P}(X = x, Y = y) = \mathbb{P}(X = x)$, which holds for any joint distribution. The key step is the double sum over the joint distribution. We write
\begin{align*}
\mathbb{E}[X + Y] = \sum_x \sum_y (x+y)\,\mathbb{P}(X = x, Y = y).
\end{align*}
We split $x + y$ into two terms and handle each separately. For the $x$-term: since $x$ is constant in the inner sum over $y$, we pull it out and sum $\mathbb{P}(X = x, Y = y)$ over $y$, which gives the marginal $\mathbb{P}(X = x)$ by the [law of total probability](/theorems/1113). The same reasoning applies symmetrically to the $y$-term.
[/guided]
[/step]
[step:Show $\mathbb{E}[X]$ minimises $\mathbb{E}[(X - c)^2]$ over $c \in \mathbb{R}$]
Let $\mu = \mathbb{E}[X]$. For any $c \in \mathbb{R}$, add and subtract $\mu$:
\begin{align*}
(X - c)^2 = ((X - \mu) + (\mu - c))^2 = (X - \mu)^2 + 2(X - \mu)(\mu - c) + (\mu - c)^2.
\end{align*}
Taking expectations and using linearity:
\begin{align*}
\mathbb{E}[(X - c)^2] &= \mathbb{E}[(X - \mu)^2] + 2(\mu - c)\,\mathbb{E}[X - \mu] + (\mu - c)^2.
\end{align*}
By linearity, $\mathbb{E}[X - \mu] = \mathbb{E}[X] - \mu = 0$, so the middle term vanishes. This gives
\begin{align*}
\mathbb{E}[(X - c)^2] = \mathbb{E}[(X - \mu)^2] + (\mu - c)^2.
\end{align*}
Since $(\mu - c)^2 \ge 0$ with equality if and only if $c = \mu$, the minimum of $\mathbb{E}[(X - c)^2]$ over $c \in \mathbb{R}$ is $\mathbb{E}[(X - \mu)^2]$, attained uniquely at $c = \mu = \mathbb{E}[X]$.
[guided]
Why does the "add and subtract $\mu$" trick work? The idea is to decompose the squared error $(X - c)^2$ into a term that depends on $c$ and a term that does not. Writing $X - c = (X - \mu) + (\mu - c)$ and expanding the square produces a cross-term $2(X - \mu)(\mu - c)$. The factor $(\mu - c)$ is a constant and can be pulled out of the expectation, leaving $\mathbb{E}[X - \mu]$. By definition of $\mu = \mathbb{E}[X]$, this expectation is zero — this is precisely the property that makes $\mu$ special. Once the cross-term vanishes, we are left with $\mathbb{E}[(X - \mu)^2] + (\mu - c)^2$, and the second term is a non-negative quantity that vanishes only when $c = \mu$.
[/guided]
[/step]