Continuous Random Variable

Also known as: continuous distribution, continuous-valued random variable, real-valued continuous random variable, continuous law, absolutely continuous random variable

Edit 0 Issues 0 Pull Requests Roadmap Admin

Content

Problems

History

Issues Verification Attributions

A spinner can land anywhere on a circle, a measurement instrument can report a real number, and a waiting time can take values in an interval rather than in a countable list. In these situations the most tempting discrete question, "what is the probability of this exact value?", gives the least useful answer. For an idealized uniform measurement on $[0,1]$, every point has probability $0$, yet the interval $[0,1/2]$ has probability $1/2$. The theory of continuous random variables begins with this failure: probabilities are no longer recovered by adding masses at individual points, but by measuring regions and integrating a density over them. This page is a child of the general notion of a [random variable](/page/Random%20Variable). The parent concept says that a random variable is a measurable map out of a [probability space](/page/Probability%20Space). The continuous child concept adds a structural assumption on its distribution: the law is absolutely continuous with respect to [Lebesgue measure](/page/Lebesgue%20Measure), so probabilities can be computed by integration against a density. That extra assumption is powerful, but it is also restrictive; many random variables are neither discrete nor continuous in this sense. The first warning is worth making concrete. A continuous model can assign probability $0$ to every point without assigning probability $0$ to every set. We write $\mathbb 1_E$ for the indicator function of a set $E$, so $\mathbb 1_E(x)=1$ when $x\in E$ and $\mathbb 1_E(x)=0$ otherwise. We also write $\mathcal L^1$ for one-dimensional Lebesgue measure, which assigns intervals their usual lengths, and $\mathcal B(\mathbb R)$ for the Borel $\sigma$-algebra, the standard collection of measurable subsets of the real line. [example: A Point Has Probability Zero but an Interval Does Not] Let $X \sim \operatorname{Unif}(0,1)$, so its density is $f_X(x)=\mathbb 1_{(0,1)}(x)$. For any $a\in[0,1]$, the event $\{X=a\}$ is the same as $\{X\in\{a\}\}$, and the density formula gives \begin{align*} \mathbb P(X=a)=\int_{\{a\}}\mathbb 1_{(0,1)}(x)\,d\mathcal L^1(x). \end{align*} Since $0\le \mathbb 1_{(0,1)}(x)\le 1$ for all $x$ and $\mathcal L^1(\{a\})=0$, \begin{align*} 0\le \int_{\{a\}}\mathbb 1_{(0,1)}(x)\,d\mathcal L^1(x)\le \int_{\{a\}}1\,d\mathcal L^1(x)=\mathcal L^1(\{a\})=0. \end{align*} Therefore $\mathbb P(X=a)=0$. For the interval event, \begin{align*} \mathbb P(0\le X\le 1/2)=\int_{[0,1/2]}\mathbb 1_{(0,1)}(x)\,d\mathcal L^1(x). \end{align*} Changing the integrand at the single point $0$ does not change the integral, because $\mathcal L^1(\{0\})=0$, so \begin{align*} \int_{[0,1/2]}\mathbb 1_{(0,1)}(x)\,d\mathcal L^1(x)=\int_{(0,1/2]}1\,d\mathcal L^1(x). \end{align*} The last integral is the length of $(0,1/2]$: \begin{align*} \int_{(0,1/2]}1\,d\mathcal L^1(x)=\mathcal L^1((0,1/2])=\frac{1}{2}. \end{align*} Thus every individual point has probability $0$, while the interval $[0,1/2]$ has probability $1/2$. This is why a continuous random variable is described by probabilities of intervals, or more generally Borel sets, rather than by point probabilities. [/example] The lesson is not that individual outcomes are impossible in ordinary language. The mathematical point is sharper: the event $\{X=a\}$ is too small for the density model to detect. Continuous probability lives at the scale of sets with positive Lebesgue measure. ## Definition The parent definition of a random variable gives measurability, but it does not say what the distribution looks like. To isolate the class for which probabilities are integrals over subsets of Euclidean space, we ask whether the law of the random variable is controlled by Lebesgue measure. [definition: Continuous Random Variable] Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. A real-valued random variable $X: (\Omega, \mathcal F) \to (\mathbb R, \mathcal B(\mathbb R))$ is a continuous random variable if there exists a [measurable function](/page/Measurable%20Function) $f_X: \mathbb R \to [0,\infty)$ such that for every Borel set $A \in \mathcal B(\mathbb R)$, \begin{align*} \mathbb P(X \in A) = \int_A f_X(x)\,d\mathcal L^1(x). \end{align*} [/definition] This definition identifies the class. It says that the law $\mu_X=\mathbb P\circ X^{-1}$ is absolutely continuous with respect to $\mathcal L^1$, and the function $f_X$ is the object that carries the probability information. It is called a probability density function for $X$. ## Densities and Distribution Functions ### Densities as Probability per Unit Length The density deserves its own name because it is the computational replacement for a [probability mass function](/page/Probability%20Mass%20Function). A probability mass function assigns probability directly to a value; a density assigns probability per unit length, so only integrals over sets have probabilistic meaning. [definition: Probability Density Function] Let $X: (\Omega, \mathcal F) \to (\mathbb R, \mathcal B(\mathbb R))$ be a continuous random variable. A probability density function for $X$ is a measurable function $f_X: \mathbb R \to [0,\infty)$ satisfying \begin{align*} \mathbb P(X \in A) = \int_A f_X(x)\,d\mathcal L^1(x) \end{align*} for every Borel set $A \in \mathcal B(\mathbb R)$. [/definition] A density is determined only up to equality $\mathcal L^1$-a.e. Changing $f_X$ at a single point, or on any Lebesgue null set, does not change any probability. This raises a necessary validation question: when does a nonnegative function actually define a probability model? [quotetheorem:10097] The normalization condition is the test that separates a candidate density from a merely nonnegative function. Necessity comes from applying the density formula to the whole real line: the event $X\in\mathbb R$ has probability $1$, so the total integral of $f_X$ must be $1$. Sufficiency says that this same condition is enough to build a probability law by setting the probability of each Borel set to the integral over that set. If the total integral is less than $1$, some probability mass is missing; if it is greater than $1$, the proposed rule assigns too much total mass. Thus normalization is the first check before using a displayed formula as a density. A one-dimensional density assigns mass by integrating over intervals and Borel sets in $\mathbb R$. For several coordinates, point probabilities are still zero, but the relevant events are regions in $\mathbb R^n$ rather than intervals. The same normalization issue remains, now with volume measure in place of length. The obstruction is to distinguish an arbitrary nonnegative function of several variables from one that controls the whole joint law through volume integration. [definition: Continuous Random Vector] Let $(\Omega, \mathcal F, \mathbb P)$ be a probability space. A random vector $X: (\Omega, \mathcal F) \to (\mathbb R^n, \mathcal B(\mathbb R^n))$ is a continuous random vector if there exists a measurable function $f_X: \mathbb R^n \to [0,\infty)$ such that for every Borel set $A \in \mathcal B(\mathbb R^n)$, \begin{align*} \mathbb P(X \in A) = \int_A f_X(x)\,d\mathcal L^n(x). \end{align*} [/definition] This vector version is essential for multivariate statistics and stochastic processes. A pair of measurements, a location in the plane, or a collection of errors in a regression model is usually studied through a joint density. ### Cumulative Probability A density is local: it tells us how probability accumulates near a point. A distribution function is cumulative: it records how much probability lies to the left of a threshold. Both describe the same law when the random variable is continuous, and moving between them is one of the basic skills of the subject. The cumulative description is useful because it exists for every real-valued random variable, not only for continuous ones. For continuous variables, it becomes an integral of the density. [definition: Distribution Function] Let $X: (\Omega, \mathcal F) \to (\mathbb R, \mathcal B(\mathbb R))$ be a real-valued random variable. The distribution function of $X$ is the function $F_X: \mathbb R \to [0,1]$ defined by \begin{align*} F_X(x) = \mathbb P(X \le x). \end{align*} [/definition] The formula $F_X(x)=\mathbb P(X\le x)$ records accumulated probability, while a density records how that accumulation changes locally. The difficulty is that densities are only determined up to changes on Lebesgue null sets, whereas differentiating a CDF is a pointwise operation. We need a precise bridge between these two descriptions, including the extra regularity needed to recover a displayed density value from a derivative. [quotetheorem:10098] This theorem explains why introductory probability courses often introduce continuous random variables through derivatives of CDFs. The representative matters: changing a density at a single point changes its displayed value there but not the law. The measure-theoretic definition is more robust, while the calculus rule is the main computational interface when a density has a genuinely continuous version near the point being differentiated. ### Standard First Models The simplest nonconstant density is constant on an interval. It gives a model of pure ignorance over a bounded range: no subinterval of a given length is favoured over another. [example: Uniform Density on an Interval] Let $a,b \in \mathbb R$ with $a<b$, and let $X \sim \operatorname{Unif}(a,b)$, so \begin{align*} f_X(x)=\frac{1}{b-a}\mathbb 1_{(a,b)}(x). \end{align*} Fix $s,t$ with $a\le s<t\le b$. Since $(s,t)\subseteq(a,b)$, the indicator satisfies $\mathbb 1_{(a,b)}(x)=1$ for every $x\in(s,t)$, and the density formula gives \begin{align*} \mathbb P(s<X<t)=\int_{(s,t)} \frac{1}{b-a}\mathbb 1_{(a,b)}(x)\,d\mathcal L^1(x). \end{align*} Thus \begin{align*} \mathbb P(s<X<t)=\int_{(s,t)} \frac{1}{b-a}\,d\mathcal L^1(x). \end{align*} Pulling out the constant integrand, \begin{align*} \int_{(s,t)} \frac{1}{b-a}\,d\mathcal L^1(x)=\frac{1}{b-a}\mathcal L^1((s,t)). \end{align*} The Lebesgue length of $(s,t)$ is $t-s$, so \begin{align*} \mathbb P(s<X<t)=\frac{t-s}{b-a}. \end{align*} Therefore, inside the support interval, the probability of an interval depends only on its length $t-s$, not on its position. [/example] A different shape appears when the random variable measures waiting time. The exponential density is high near $0$ and then decays, reflecting that shorter waiting times are more likely than much longer ones. [example: Exponential Waiting Time] Let $\lambda>0$ and let $X \sim \operatorname{Exp}(\lambda)$, with density \begin{align*} f_X(x) = \lambda e^{-\lambda x}\mathbb 1_{[0,\infty)}(x). \end{align*} For $t\ge 0$, the distribution function is \begin{align*} F_X(t)=\mathbb P(X\le t)=\int_{(-\infty,t]} \lambda e^{-\lambda x}\mathbb 1_{[0,\infty)}(x)\,d\mathcal L^1(x). \end{align*} On $(-\infty,0)$ the indicator is $0$, and on $[0,t]$ it is $1$, so \begin{align*} F_X(t)=\int_0^t \lambda e^{-\lambda x}\,d\mathcal L^1(x). \end{align*} Since $\frac{d}{dx}(-e^{-\lambda x})=\lambda e^{-\lambda x}$, \begin{align*} \int_0^t \lambda e^{-\lambda x}\,d\mathcal L^1(x)=(-e^{-\lambda t})-(-e^0). \end{align*} Because $e^0=1$, this gives \begin{align*} F_X(t)=1-e^{-\lambda t}. \end{align*} Therefore \begin{align*} \mathbb P(X>t)=1-\mathbb P(X\le t). \end{align*} Substituting the value of $F_X(t)$, \begin{align*} \mathbb P(X>t)=1-(1-e^{-\lambda t})=e^{-\lambda t}. \end{align*} The exponential tail decreases by the multiplicative factor $e^{-\lambda t}$, which is the calculation behind the memoryless property of exponential waiting times. [/example] The examples above motivate a diagnostic theorem: if a model has a genuine density, then no point and no Lebesgue null set can carry positive probability. This result is needed whenever we must distinguish continuous laws from laws with hidden point masses. [quotetheorem:10099] The converse is false: a random variable may assign probability $0$ to every point without having a density. Singular distributions, such as the Cantor distribution, show that atomlessness is weaker than continuity in the density sense. ## Computing Probabilities and Expectations ### Integrating Functions of the Variable Once a density is available, events are integrated over subsets and functions of the random variable are integrated against the density. This is the continuous analogue of summing $g(k)p_X(k)$ over possible values in the discrete case. The expectation formula is the main computational reason densities matter. It replaces an integral over the sample space by an integral over the real line. [definition: Expectation of a Function of a Continuous Random Variable] Let $X$ be a continuous random variable with density $f_X$, and let $g: \mathbb R \to \mathbb R$ be Borel measurable. If \begin{align*} \int_{\mathbb R} |g(x)|f_X(x)\,d\mathcal L^1(x) < \infty, \end{align*} then the expectation of $g(X)$ is defined by \begin{align*} \mathbb E[g(X)] = \int_{\mathbb R} g(x)f_X(x)\,d\mathcal L^1(x). \end{align*} [/definition] For $g(x)=x$, this gives the mean. For $g(x)=x^2$, it gives the second moment. The integrability condition is not cosmetic: a density can be normalized while its mean or variance fails to exist. The remaining issue is compatibility with the original probability-space expectation: after replacing the random outcome $\omega$ by the value $x=X(\omega)$, the integral over $\mathbb R$ must give the same number as averaging $g(X)$ over $\Omega$. [quotetheorem:3536] This result lets us compute moments directly from the original density. It is especially useful when the transformation $g$ destroys one-to-one structure. ### Moments and Tails The Gaussian density is the central continuous model in probability and statistics. Its normalizing constant is chosen so that the total area under the curve is $1$, and its second moment gives the variance parameter. [example: Mean and Variance of a Standard Normal Variable] Let $Z \sim \mathcal N(0,1)$ with density \begin{align*} \phi(x)=\frac{1}{\sqrt{2\pi}}e^{-x^2/2}. \end{align*} We compute its mean and variance from the density. First, $x\phi(x)$ is odd because $\phi(-x)=\phi(x)$, so for every $R>0$, \begin{align*} \int_{-R}^R x\phi(x)\,d\mathcal L^1(x)=0. \end{align*} Also, \begin{align*} \int_{\mathbb R}|x|\phi(x)\,d\mathcal L^1(x)=2\int_0^\infty x\frac{1}{\sqrt{2\pi}}e^{-x^2/2}\,d\mathcal L^1(x). \end{align*} Since $\frac{d}{dx}(-e^{-x^2/2})=xe^{-x^2/2}$ on $[0,\infty)$, \begin{align*} 2\int_0^\infty x\frac{1}{\sqrt{2\pi}}e^{-x^2/2}\,d\mathcal L^1(x)=\frac{2}{\sqrt{2\pi}}. \end{align*} Thus $Z$ is integrable, and taking the limit of the symmetric integrals gives \begin{align*} \mathbb E[Z]=\int_{\mathbb R}x\phi(x)\,d\mathcal L^1(x)=0. \end{align*} For the second moment, differentiate the density: \begin{align*} \phi'(x)=-x\phi(x). \end{align*} Hence $x^2\phi(x)=-x\phi'(x)$. For $R>0$, [integration by parts](/theorems/210) on $[-R,R]$ gives \begin{align*} \int_{-R}^R x^2\phi(x)\,d\mathcal L^1(x)=\int_{-R}^R -x\phi'(x)\,d\mathcal L^1(x). \end{align*} The [integration by parts](/theorems/2098) identity gives \begin{align*} \int_{-R}^R -x\phi'(x)\,d\mathcal L^1(x)=-R\phi(R)-R\phi(-R)+\int_{-R}^R\phi(x)\,d\mathcal L^1(x). \end{align*} Because $\phi(-R)=\phi(R)$, this is \begin{align*} \int_{-R}^R x^2\phi(x)\,d\mathcal L^1(x)=-2R\phi(R)+\int_{-R}^R\phi(x)\,d\mathcal L^1(x). \end{align*} Now \begin{align*} 2R\phi(R)=\frac{2R}{\sqrt{2\pi}}e^{-R^2/2}\to 0 \end{align*} as $R\to\infty$, while $\int_{\mathbb R}\phi(x)\,d\mathcal L^1(x)=1$ because $\phi$ is the standard normal density. Therefore \begin{align*} \mathbb E[Z^2]=\int_{\mathbb R}x^2\phi(x)\,d\mathcal L^1(x)=1. \end{align*} Finally, \begin{align*} \operatorname{Var}(Z)=\mathbb E[Z^2]-(\mathbb E[Z])^2=1-0^2=1. \end{align*} The standard normal is therefore centered at $0$, and its variance parameter is exactly $1$. [/example] Tail probabilities often matter more than exact densities. In reliability, finance, and limit theorems, one asks how likely it is that $X$ exceeds a threshold. For nonnegative variables, the tail itself can recover the mean. [quotetheorem:4993] For $X\sim \operatorname{Exp}(\lambda)$, the tail formula gives \begin{align*} \mathbb E[X] = \int_0^\infty e^{-\lambda t}\,d\mathcal L^1(t)=\frac{1}{\lambda}. \end{align*} This calculation is often simpler than integrating $x\lambda e^{-\lambda x}$ directly. Not every normalized density has finite expectation. Heavy tails show why integrability assumptions must be stated. [example: A Continuous Random Variable with Infinite Mean] Define \begin{align*} f(x)=x^{-2}\mathbb 1_{[1,\infty)}(x). \end{align*} The function $f$ is nonnegative and measurable. Its total integral is computed as an improper integral: \begin{align*} \int_{\mathbb R} f(x)\,d\mathcal L^1(x)=\int_{[1,\infty)}x^{-2}\,d\mathcal L^1(x). \end{align*} For $R>1$, \begin{align*} \int_1^R x^{-2}\,d\mathcal L^1(x)=\left[-x^{-1}\right]_1^R. \end{align*} Evaluating the endpoints gives \begin{align*} \left[-x^{-1}\right]_1^R=-R^{-1}-(-1)=1-\frac{1}{R}. \end{align*} Taking $R\to\infty$, \begin{align*} \int_{\mathbb R} f(x)\,d\mathcal L^1(x)=\lim_{R\to\infty}\left(1-\frac{1}{R}\right)=1. \end{align*} Therefore $f$ is a probability density. Let $X$ be a continuous random variable with density $f$. Since $X$ is supported on $[1,\infty)$, its expectation is the extended nonnegative integral \begin{align*} \mathbb E[X]=\int_{\mathbb R}x f(x)\,d\mathcal L^1(x)=\int_1^\infty x\cdot x^{-2}\,d\mathcal L^1(x). \end{align*} For $x\ge 1$, $x\cdot x^{-2}=x^{-1}$, so \begin{align*} \mathbb E[X]=\int_1^\infty \frac{1}{x}\,d\mathcal L^1(x). \end{align*} Again using improper integrals, for $R>1$, \begin{align*} \int_1^R \frac{1}{x}\,d\mathcal L^1(x)=\log R-\log 1. \end{align*} Since $\log 1=0$, \begin{align*} \int_1^R \frac{1}{x}\,d\mathcal L^1(x)=\log R. \end{align*} As $R\to\infty$, $\log R\to\infty$, and hence \begin{align*} \mathbb E[X]=\infty. \end{align*} The density is normalized, so the continuous random variable is well-defined, but its tail is heavy enough that the mean is not finite. [/example] ## Transformations and Change of Variables ### Smooth Transformations Real data are rarely used in their original units forever. We take logarithms, square errors, standardize observations, and transform uniform random numbers into variables with desired distributions. A transformation of a continuous random variable is still a random variable, but it may or may not remain continuous. The first reliable rule is the monotone change-of-variables formula. It tells us how densities stretch when the horizontal scale is changed. [quotetheorem:1138] The derivative factor is the price of changing scale. If $h$ expands lengths near a point, the density of $Y$ decreases there; if $h$ compresses lengths, the density increases. The squaring map illustrates a common complication. It is not one-to-one on $\mathbb R$, so the density receives contributions from both preimages of a positive value. [example: Squaring a Standard Normal Variable] Let $Z\sim \mathcal N(0,1)$ with density \begin{align*} \phi(z)=\frac{1}{\sqrt{2\pi}}e^{-z^2/2}, \end{align*} and define $Y=Z^2$. On $(0,\infty)$, the map $h_+(z)=z^2$ is a continuously differentiable bijection from $(0,\infty)$ to $(0,\infty)$ with inverse $h_+^{-1}(y)=\sqrt y$ and derivative $h_+'(z)=2z$. On $(-\infty,0)$, the same formula $h_-(z)=z^2$ is a continuously differentiable bijection from $(-\infty,0)$ to $(0,\infty)$ with inverse $h_-^{-1}(y)=-\sqrt y$ and derivative $h_-'(z)=2z$. Therefore, by *Monotone Change of Variables for Densities*, for $y>0$ the density of $Y$ is the sum of the two branch contributions: \begin{align*} f_Y(y)=\frac{\phi(\sqrt y)}{|2\sqrt y|}+\frac{\phi(-\sqrt y)}{|2(-\sqrt y)|}. \end{align*} Since $y>0$, $\sqrt y>0$, so $|2\sqrt y|=2\sqrt y$ and $|2(-\sqrt y)|=2\sqrt y$. Hence \begin{align*} f_Y(y)=\phi(\sqrt y)\frac{1}{2\sqrt y}+\phi(-\sqrt y)\frac{1}{2\sqrt y}. \end{align*} Now \begin{align*} \phi(\sqrt y)=\frac{1}{\sqrt{2\pi}}e^{-(\sqrt y)^2/2}=\frac{1}{\sqrt{2\pi}}e^{-y/2}. \end{align*} Also, \begin{align*} \phi(-\sqrt y)=\frac{1}{\sqrt{2\pi}}e^{-(-\sqrt y)^2/2}=\frac{1}{\sqrt{2\pi}}e^{-y/2}. \end{align*} Substituting these two values, \begin{align*} f_Y(y)=\frac{1}{\sqrt{2\pi}}e^{-y/2}\frac{1}{2\sqrt y}+\frac{1}{\sqrt{2\pi}}e^{-y/2}\frac{1}{2\sqrt y}. \end{align*} The two summands are equal, so \begin{align*} f_Y(y)=\frac{1}{\sqrt{2\pi}\sqrt y}e^{-y/2}. \end{align*} Because $\sqrt{2\pi}\sqrt y=\sqrt{2\pi y}$ for $y>0$, \begin{align*} f_Y(y)=\frac{1}{\sqrt{2\pi y}}e^{-y/2}. \end{align*} For $y<0$, the event $\{Y\le y\}$ is empty because $Y=Z^2\ge 0$, so the density is $0$ on $(-\infty,0)$. At the single point $y=0$, we may set $f_Y(0)=0$ because changing a density on a Lebesgue null set does not change the represented law. Thus \begin{align*} f_Y(y)=\frac{1}{\sqrt{2\pi y}}e^{-y/2}\mathbb 1_{(0,\infty)}(y). \end{align*} This is the $\chi^2_1$ density, so $Z^2\sim\chi^2_1$. [/example] A transformation can also destroy continuity. Collapsing an interval to a point creates an atom, and atoms cannot be represented by a density with respect to Lebesgue measure. [example: A Transformation That Creates an Atom] Let $X\sim \operatorname{Unif}(-1,1)$, so its density is \begin{align*} f_X(x)=\frac{1}{2}\mathbb 1_{(-1,1)}(x). \end{align*} Define \begin{align*} Y=\max\{X,0\}. \end{align*} For every real number $x$, $\max\{x,0\}=0$ exactly when $x\le 0$, so the events $\{Y=0\}$ and $\{X\le 0\}$ are equal. Therefore \begin{align*} \mathbb P(Y=0)=\mathbb P(X\le 0). \end{align*} Using the density of $X$, \begin{align*} \mathbb P(X\le 0)=\int_{(-\infty,0]}\frac{1}{2}\mathbb 1_{(-1,1)}(x)\,d\mathcal L^1(x). \end{align*} The indicator is $1$ on $(-1,0]$ except possibly at the endpoint $-1$, and changing an integrand on the singleton $\{-1\}$ does not change the integral because $\mathcal L^1(\{-1\})=0$. Hence \begin{align*} \int_{(-\infty,0]}\frac{1}{2}\mathbb 1_{(-1,1)}(x)\,d\mathcal L^1(x)=\int_{(-1,0]}\frac{1}{2}\,d\mathcal L^1(x). \end{align*} Pulling out the constant gives \begin{align*} \int_{(-1,0]}\frac{1}{2}\,d\mathcal L^1(x)=\frac{1}{2}\mathcal L^1((-1,0]). \end{align*} Since $\mathcal L^1((-1,0])=1$, \begin{align*} \mathbb P(Y=0)=\frac{1}{2}. \end{align*} Thus $Y$ has an atom at $0$. By *Continuous Random Variables Have No Atoms*, a random variable with a positive point probability cannot be continuous in the density sense. On the other hand, for every Borel set $A\subseteq(0,1)$, the event $\{Y\in A\}$ equals $\{X\in A\}$, so the remaining part of the distribution is continuous on $(0,1)$ with density $\frac{1}{2}\mathbb 1_{(0,1)}$. [/example] ### Quantiles and Simulation The previous examples show that transformations must be handled with care. A more constructive question now appears: can we start from a uniform random variable and build a variable with any desired distribution function? The answer is the probability integral transform. [quotetheorem:1139] This theorem is the mathematical basis of [inverse transform sampling](/theorems/1139). It also explains why quantiles are natural coordinates for continuous distributions: they label outcomes by accumulated probability. The first direction is intentionally stated with a genuine inverse scale on the part of the line where $X$ lives. This avoids relying on ordinary inverse intuition in cases with flat regions or singular continuous behaviour. The converse uses the generalized inverse, so flat parts and jumps of $F$ are handled by the infimum formula rather than by an ordinary inverse function. ## Joint Continuous Distributions ### Joint and Marginal Densities Many probabilistic questions involve several random quantities at once. The marginal behaviour of each variable is not enough; dependence lives in the joint law. A joint density describes how probability is spread across regions in the plane or higher-dimensional Euclidean space. The joint density is the multivariate version of a density. It assigns probability to Borel sets in $\mathbb R^n$ by integrating over those sets. [definition: Joint Density] Let $X=(X_1,\ldots,X_n)$ be a random vector from $(\Omega,\mathcal F,\mathbb P)$ to $(\mathbb R^n,\mathcal B(\mathbb R^n))$. A joint density for $X$ is a measurable function $f_X:\mathbb R^n\to[0,\infty)$ such that for every Borel set $A\in\mathcal B(\mathbb R^n)$, \begin{align*} \mathbb P(X\in A) = \int_A f_X(x)\,d\mathcal L^n(x). \end{align*} [/definition] In two dimensions, the probability that $(X,Y)$ lies in a region is an area integral. This motivates the marginal density: when only one coordinate is observed, we need a principled way to remove the unobserved coordinate from the joint density. [definition: Marginal Density] Let $(X,Y)$ be a continuous random vector taking values in $\mathbb R^m\times\mathbb R^n$ with joint density $f_{X,Y}$. A marginal density of $X$ is any measurable function $f_X:\mathbb R^m\to[0,\infty)$ such that \begin{align*} f_X(x) = \int_{\mathbb R^n} f_{X,Y}(x,y)\,d\mathcal L^n(y) \end{align*} for $\mathcal L^m$-a.e. $x\in\mathbb R^m$. On the remaining null set, $f_X$ may be assigned arbitrary values in $[0,\infty)$. [/definition] The word marginal comes from tables, where row and column totals were written in the margins. In the continuous setting, margins are integrals rather than sums. But the displayed formula only integrates the joint density in the unobserved coordinate; it still has to be checked that this produces the correct probabilities for events involving $X$ alone. The key question is whether integrating out $Y$ is merely a formal operation or whether it really recovers the law of $X$. The following result supplies that bridge: it turns the [marginal density formula](/theorems/10100) into the probability rule needed for Borel events involving $X$ alone. [quotetheorem:10100] Marginal densities discard dependence information. Two variables can have the same marginal densities but different joint behaviour. Independence is the special case where no dependence information remains beyond the marginals. ### Independence and Conditioning Independence is first a statement about events, not about formulas. For continuous variables, the event-based definition is still the right definition; the density factorization theorem comes afterward as a testable criterion. [definition: Independence of Continuous Random Variables] Let $(\Omega,\mathcal F,\mathbb P)$ be a probability space, and let $X:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ and $Y:(\Omega,\mathcal F)\to(\mathbb R,\mathcal B(\mathbb R))$ be continuous random variables with densities $f_X$ and $f_Y$. They are independent if for all Borel sets $A,B\in\mathcal B(\mathbb R)$, \begin{align*} \mathbb P(X\in A,\,Y\in B) = \mathbb P(X\in A)\mathbb P(Y\in B). \end{align*} [/definition] The definition quantifies over all Borel sets, which is conceptually clean but not convenient for computation. In practice we usually know a joint density, not every rectangle probability. The useful question is therefore whether independence can be recognized directly from the density by seeing that the joint mass assigned to rectangles separates into one-dimensional contributions. [quotetheorem:10101] The factorization condition is stronger than having a rectangular-looking support. Dependence can hide in the shape of the density even when both marginals look familiar. [example: Dependent Variables on a Triangle] Let $(X,Y)$ have joint density \begin{align*} f_{X,Y}(x,y)=2\mathbb 1_{\{0<y<x<1\}}. \end{align*} We compute the two marginal densities and then show that the joint density cannot factor into their product. Fix $x\in\mathbb R$. By the marginal-density formula, \begin{align*} f_X(x)=\int_{\mathbb R}2\mathbb 1_{\{0<y<x<1\}}\,d\mathcal L^1(y). \end{align*} If $0<x<1$, then the condition $0<y<x<1$ is equivalent, as a condition on $y$, to $0<y<x$. Hence \begin{align*} f_X(x)=\int_{(0,x)}2\,d\mathcal L^1(y). \end{align*} Pulling out the constant integrand gives \begin{align*} \int_{(0,x)}2\,d\mathcal L^1(y)=2\mathcal L^1((0,x)). \end{align*} Since $\mathcal L^1((0,x))=x$, we get \begin{align*} f_X(x)=2x. \end{align*} If $x\le 0$, there is no $y$ satisfying $0<y<x$, so $f_X(x)=0$. If $x\ge 1$, the condition $0<y<x<1$ is false because $x<1$ fails, so $f_X(x)=0$. Thus \begin{align*} f_X(x)=2x\mathbb 1_{(0,1)}(x). \end{align*} Similarly, fix $y\in\mathbb R$. The marginal density of $Y$ is \begin{align*} f_Y(y)=\int_{\mathbb R}2\mathbb 1_{\{0<y<x<1\}}\,d\mathcal L^1(x). \end{align*} If $0<y<1$, then the condition $0<y<x<1$ is equivalent, as a condition on $x$, to $y<x<1$. Therefore \begin{align*} f_Y(y)=\int_{(y,1)}2\,d\mathcal L^1(x). \end{align*} Pulling out the constant gives \begin{align*} \int_{(y,1)}2\,d\mathcal L^1(x)=2\mathcal L^1((y,1)). \end{align*} Since $\mathcal L^1((y,1))=1-y$, we get \begin{align*} f_Y(y)=2(1-y). \end{align*} If $y\le 0$ or $y\ge 1$, the same support condition gives no contribution except on null endpoint changes, so \begin{align*} f_Y(y)=2(1-y)\mathbb 1_{(0,1)}(y). \end{align*} Now take the rectangle $A=(3/4,1)$ and $B=(0,1/4)$. For every $(x,y)\in A\times B$, we have $0<y<x<1$, so \begin{align*} \mathbb P(X\in A,\,Y\in B)=\int_{A\times B}2\,d\mathcal L^2. \end{align*} Since $\mathcal L^2(A\times B)=\mathcal L^1(A)\mathcal L^1(B)=\frac14\cdot\frac14=\frac{1}{16}$, \begin{align*} \mathbb P(X\in A,\,Y\in B)=2\cdot\frac{1}{16}=\frac18. \end{align*} Also, \begin{align*} \mathbb P(X\in A)=\int_{3/4}^1 2x\,d\mathcal L^1(x). \end{align*} Since an antiderivative of $2x$ is $x^2$, \begin{align*} \mathbb P(X\in A)=1^2-\left(\frac34\right)^2=1-\frac{9}{16}=\frac{7}{16}. \end{align*} And \begin{align*} \mathbb P(Y\in B)=\int_0^{1/4}2(1-y)\,d\mathcal L^1(y). \end{align*} Since an antiderivative of $2(1-y)$ is $2y-y^2$, \begin{align*} \mathbb P(Y\in B)=\left(2\cdot\frac14-\left(\frac14\right)^2\right)-(0-0)=\frac12-\frac{1}{16}=\frac{7}{16}. \end{align*} Thus \begin{align*} \mathbb P(X\in A)\mathbb P(Y\in B)=\frac{7}{16}\cdot\frac{7}{16}=\frac{49}{256}. \end{align*} Since $\frac18=\frac{32}{256}$ and $\frac{32}{256}\ne\frac{49}{256}$, the independence identity fails for these Borel sets. Hence $X$ and $Y$ are dependent. The geometry is visible in the support: the joint density lives only on the triangle $0<y<x<1$, while the marginal product is positive on the whole square $(0,1)\times(0,1)$. [/example] Conditioning on a continuous observation introduces a new difficulty: the event $\{Y=y\}$ usually has probability $0$. To discuss the distribution of $X$ after observing $Y=y$, we need a density-level replacement for elementary [conditional probability](/page/Conditional%20Probability). [definition: Conditional Density] Let $(X,Y)$ be a continuous random vector in $\mathbb R^2$ with chosen joint density $f_{X,Y}$ and chosen marginal density $f_Y$. For $\mathcal L^1$-a.e. $y\in\mathbb R$ with $f_Y(y)>0$, the conditional density of $X$ given $Y=y$ is the function \begin{align*} f_{X\mid Y}(\cdot\mid y):\mathbb R \to [0,\infty) \end{align*} defined by \begin{align*} f_{X\mid Y}(x\mid y)=\frac{f_{X,Y}(x,y)}{f_Y(y)}. \end{align*} [/definition] The notation should not be read as conditioning on a positive-probability event. It is a density-level object that behaves like a probability density in the $x$ variable for fixed $y$. Since densities are defined only up to almost-everywhere equality, conditional densities are version-dependent on null sets in the conditioning variable. ### Sums and Convolution Sums of independent continuous variables lead to convolution. The density at a value $z$ is obtained by considering all decompositions $z=x+y$ and integrating over them. [quotetheorem:10102] Convolution is the density version of adding independent uncertainties. It explains why repeated addition smooths distributions, a phenomenon that later culminates in the [central limit theorem](/theorems/521). [example: Sum of Two Independent Uniform Variables] Let $X,Y\overset{\text{i.i.d.}}{\sim}\operatorname{Unif}(0,1)$ and define $S=X+Y$. Since both variables have density $\mathbb 1_{(0,1)}$, the *Convolution Formula for Sums* gives \begin{align*} f_S(s)=\int_{\mathbb R}\mathbb 1_{(0,1)}(x)\mathbb 1_{(0,1)}(s-x)\,d\mathcal L^1(x). \end{align*} The product of indicators equals $1$ exactly when both inequalities \begin{align*} 0<x<1 \end{align*} and \begin{align*} 0<s-x<1 \end{align*} hold. The [second inequality](/theorems/2136) is equivalent to \begin{align*} s-1<x<s. \end{align*} Therefore the integrand is $1$ precisely for \begin{align*} x\in (0,1)\cap(s-1,s), \end{align*} and is $0$ otherwise. Hence \begin{align*} f_S(s)=\mathcal L^1\bigl((0,1)\cap(s-1,s)\bigr). \end{align*} If $0<s<1$, then $s-1<0$ and $s<1$, so \begin{align*} (0,1)\cap(s-1,s)=(0,s). \end{align*} Thus \begin{align*} f_S(s)=\mathcal L^1((0,s))=s. \end{align*} If $1\le s<2$, then $0\le s-1<1$ and $s\ge 1$, so \begin{align*} (0,1)\cap(s-1,s)=(s-1,1). \end{align*} The length of this interval is \begin{align*} \mathcal L^1((s-1,1))=1-(s-1)=2-s, \end{align*} so \begin{align*} f_S(s)=2-s. \end{align*} If $s\le 0$, then $(s-1,s)$ lies to the left of or ends at $0$, so its intersection with $(0,1)$ is empty. If $s\ge 2$, then $s-1\ge 1$, so $(s-1,s)$ has no overlap with $(0,1)$. In both cases, \begin{align*} f_S(s)=0. \end{align*} Therefore \begin{align*} f_S(s)=s\mathbb 1_{(0,1)}(s)+(2-s)\mathbb 1_{[1,2)}(s). \end{align*} The density is triangular because, for each value of $s$, its height is exactly the length of the set of decompositions $s=x+y$ with $x,y\in(0,1)$. [/example] ## Approximation, Simulation, and Statistical Models ### From Data to Densities Continuous distributions are idealizations. Real instruments round measurements, computers generate finite strings of bits, and datasets contain finitely many observations. The usefulness of continuous random variables comes from the way they organize approximation: histograms approximate densities, empirical distribution functions approximate CDFs, and parametric families approximate mechanisms. A histogram is not a density by itself until it is scaled by bin width. This scaling is the difference between counting observations and estimating probability per unit length. [definition: Histogram Density Estimate] Let $x_1,\ldots,x_n\in\mathbb R$ be observed data. Let $I_1,\ldots,I_m\subset\mathbb R$ be pairwise disjoint intervals such that $0<\mathcal L^1(I_j)<\infty$ for each $j\in\{1,\ldots,m\}$ and each observation $x_i$ lies in exactly one of the intervals. The histogram density estimate associated to these bins is the function $\hat f:\mathbb R\to[0,\infty)$ defined by \begin{align*} \hat f(x)=\sum_{j=1}^m \frac{1}{n\mathcal L^1(I_j)}\sum_{i=1}^n \mathbb 1_{\{x_i\in I_j\}}\mathbb 1_{I_j}(x),\qquad x\in\mathbb R. \end{align*} [/definition] The coverage assumption is part of the normalization. Since every observation belongs to exactly one bin, integrating $\hat f$ over $\mathbb R$ gives $1$; without that assumption the same formula would describe only the mass captured by the chosen bins. The estimate is piecewise constant because each bin treats all points inside it as indistinguishable. Narrow bins reveal local shape but increase variability; wide bins reduce variability but blur features. This motivates a second construction for finite data: estimate the probability of $(-\infty,x]$ directly at every threshold $x$, avoiding bin choices altogether. [definition: Empirical Distribution Function] Let $x_1,\ldots,x_n\in\mathbb R$ be observed data. The empirical distribution function is the function $F_n:\mathbb R\to[0,1]$ defined by \begin{align*} F_n(x)=\frac{1}{n}\sum_{i=1}^n \mathbb 1_{(-\infty,x]}(x_i),\qquad x\in\mathbb R. \end{align*} [/definition] Even when the underlying model is continuous, $F_n$ has jumps because the dataset is finite. This is not a contradiction; empirical objects are discrete approximations to continuous laws. ### Likelihood and Rounding Parametric statistics uses densities to compare models. Once a family of densities is chosen, the observed data are treated as fixed and the parameter is varied. [definition: Likelihood for a Continuous Model] Let $\{f_\theta:\theta\in\Theta\}$ be a family of probability densities on $\mathbb R$, with each density a map $f_\theta:\mathbb R\to[0,\infty)$. Suppose the observed data $x_1,\ldots,x_n\in\mathbb R$ are modeled as independent random variables with common density $f_\theta$ for some parameter $\theta\in\Theta$. The likelihood function is the map $L(\cdot;x_1,\ldots,x_n):\Theta\to [0,\infty)$ defined by \begin{align*} L(\theta;x_1,\ldots,x_n) = \prod_{i=1}^n f_\theta(x_i). \end{align*} [/definition] The likelihood is not the probability of observing the exact data values, since that probability is $0$ under a continuous model. It is a density evaluated at the observed configuration, and likelihood ratios compare how strongly different parameter values support the observed data. [example: Likelihood for a Normal Location Model] Suppose $X_1,\ldots,X_n$ are modeled as i.i.d. random variables with $X_i\sim\mathcal N(\mu,\sigma^2)$, where $\sigma>0$ is known and $\mu\in\mathbb R$ is unknown. For observations $x_1,\ldots,x_n$, the normal density at $x_i$ is \begin{align*} f_\mu(x_i)=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right). \end{align*} Independence makes the joint density the product of the one-dimensional densities, so the likelihood is \begin{align*} L(\mu;x_1,\ldots,x_n)=\prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right). \end{align*} Pulling the constant factor out of the product gives \begin{align*} L(\mu;x_1,\ldots,x_n)=\left(\frac{1}{\sqrt{2\pi}\sigma}\right)^n\prod_{i=1}^n \exp\left(-\frac{(x_i-\mu)^2}{2\sigma^2}\right). \end{align*} Using $\prod_{i=1}^n e^{a_i}=e^{\sum_{i=1}^n a_i}$, \begin{align*} L(\mu;x_1,\ldots,x_n)=\left(\frac{1}{\sqrt{2\pi}\sigma}\right)^n\exp\left(-\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2\right). \end{align*} The prefactor $\left(\frac{1}{\sqrt{2\pi}\sigma}\right)^n$ is positive and does not depend on $\mu$, and the exponential function is strictly increasing. Therefore maximizing $L(\mu;x_1,\ldots,x_n)$ over $\mu$ is equivalent to maximizing \begin{align*} -\frac{1}{2\sigma^2}\sum_{i=1}^n(x_i-\mu)^2. \end{align*} Since $-\frac{1}{2\sigma^2}<0$, this is equivalent to minimizing \begin{align*} Q(\mu)=\sum_{i=1}^n(x_i-\mu)^2. \end{align*} Let \begin{align*} \bar x=\frac{1}{n}\sum_{i=1}^n x_i. \end{align*} For each $i$, \begin{align*} x_i-\mu=(x_i-\bar x)+(\bar x-\mu). \end{align*} Squaring and summing, \begin{align*} Q(\mu)=\sum_{i=1}^n\left((x_i-\bar x)+(\bar x-\mu)\right)^2. \end{align*} Expanding the square, \begin{align*} Q(\mu)=\sum_{i=1}^n(x_i-\bar x)^2+2(\bar x-\mu)\sum_{i=1}^n(x_i-\bar x)+\sum_{i=1}^n(\bar x-\mu)^2. \end{align*} Now \begin{align*} \sum_{i=1}^n(x_i-\bar x)=\sum_{i=1}^n x_i-n\bar x=n\bar x-n\bar x=0. \end{align*} Also, \begin{align*} \sum_{i=1}^n(\bar x-\mu)^2=n(\bar x-\mu)^2. \end{align*} Hence \begin{align*} Q(\mu)=\sum_{i=1}^n(x_i-\bar x)^2+n(\bar x-\mu)^2. \end{align*} The first term does not depend on $\mu$, and the second term is nonnegative with equality exactly when $\mu=\bar x$. Thus $Q(\mu)$ is minimized at \begin{align*} \hat\mu=\bar x=\frac{1}{n}\sum_{i=1}^n x_i. \end{align*} Therefore the maximum likelihood estimate of the unknown normal mean is the sample mean. [/example] Rounding reveals a final modelling distinction. A recorded measurement may be discrete even if the underlying physical quantity is treated as continuous. [example: Rounding a Continuous Measurement] Let $X\sim\operatorname{Unif}(0,1)$ and define \begin{align*} Y=\frac{\lfloor 10X\rfloor}{10}. \end{align*} Since $0<X<1$ with probability $1$, we have $0<10X<10$ with probability $1$. Hence $\lfloor 10X\rfloor$ takes values in $\{0,1,\ldots,9\}$ with probability $1$, and therefore $Y$ takes values in $\{0,1/10,\ldots,9/10\}$ with probability $1$. Fix $k\in\{0,\ldots,9\}$. By the defining property of the floor function, \begin{align*} \lfloor 10X\rfloor=k \quad \text{if and only if} \quad k\le 10X<k+1. \end{align*} Dividing the two inequalities by $10>0$ gives \begin{align*} \lfloor 10X\rfloor=k \quad \text{if and only if} \quad \frac{k}{10}\le X<\frac{k+1}{10}. \end{align*} Since $Y=k/10$ is equivalent to $\lfloor 10X\rfloor=k$, we get \begin{align*} \mathbb P(Y=k/10)=\mathbb P\left(\frac{k}{10}\le X<\frac{k+1}{10}\right). \end{align*} The density of $X$ is $\mathbb 1_{(0,1)}$, so \begin{align*} \mathbb P\left(\frac{k}{10}\le X<\frac{k+1}{10}\right)=\int_{[k/10,(k+1)/10)}\mathbb 1_{(0,1)}(x)\,d\mathcal L^1(x). \end{align*} For $k\in\{0,\ldots,9\}$, the interval $[k/10,(k+1)/10)$ differs from its intersection with $(0,1)$ only possibly at the point $0$, and that singleton has Lebesgue measure $0$. Thus the integral is the length of the interval: \begin{align*} \int_{[k/10,(k+1)/10)}\mathbb 1_{(0,1)}(x)\,d\mathcal L^1(x)=\mathcal L^1([k/10,(k+1)/10))=\frac{k+1}{10}-\frac{k}{10}=\frac{1}{10}. \end{align*} Therefore each displayed value of $Y$ has probability $1/10$. The rounded observation is discrete, even though the unrounded variable $X$ is continuous. [/example] ## Beyond and Connected Topics Continuous random variables sit at the meeting point of probability, measure theory, calculus, and statistics. The parent page [Random Variable](/page/Random%20Variable) gives the general measurable-map viewpoint; the present page studies the absolutely continuous child case where laws have densities. The natural measure-theoretic continuation is [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure). There, densities become Radon--Nikodym derivatives, and the condition defining a continuous random variable is expressed as absolute [continuity of measures](/theorems/1082). This perspective also clarifies why singular distributions are atomless but not continuous in the density sense. For probabilistic limit theory, [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability) develops [convergence in distribution](/page/Convergence%20In%20Distribution), characteristic functions, martingales, and [weak convergence](/page/Weak%20Convergence). Continuous random variables provide many of the standard examples, but advanced probability studies laws in a broader space where densities may disappear under limits. For computations and modelling, [Cambridge IA Probability](/page/Cambridge%20IA%20Probability) supplies the first systematic treatment of standard continuous distributions, expectation, independence, and transformations. [Cambridge IB Statistics](/page/Cambridge%20IB%20Statistics) then uses continuous densities for likelihood, estimation, confidence intervals, and hypothesis testing. Several neighbouring topics are worth separating carefully. A [discrete random variable](/page/Discrete%20Random%20Variable) is controlled by a probability mass function. A continuous random variable is controlled by a density with respect to Lebesgue measure. A general real-valued random variable may contain both discrete and continuous parts, or may be singular. The measure-theoretic language of laws and pushforwards is the common framework behind all three. ## References Androma, [Random Variable](/page/Random%20Variable). Androma, [Cambridge IA Probability](/page/Cambridge%20IA%20Probability). Androma, [Cambridge IB Probability and Measure](/page/Cambridge%20IB%20Probability%20and%20Measure). Androma, [Cambridge III Advanced Probability](/page/Cambridge%20III%20Advanced%20Probability). Androma, [Cambridge IB Statistics](/page/Cambridge%20IB%20Statistics). Patrick Billingsley, *Probability and Measure* (1995). Rick Durrett, *Probability: Theory and Examples* (2019). Larry Wasserman, *All of Statistics* (2004).

Created by admin on 6/23/2026 | Last updated on 6/23/2026

What brings you to Androma?

Start with a route through the knowledge graph.

Continuous Random Variable

Sign in to Androma

Check your inbox

One last step

Continuous Random Variable

Prerequisites (0/4 completed)

Prerequisites Graph

Rate this page