2 Probability
For the convenience of online teaching in the fall semester of 2020, the layout is modified with wide margins and line space for note taking.
2.1 Introduction
With the advent of big data, computer scientists have come up with a plethora of new algorithms that are aimed at revealing patterns from data. Machine learning and artificial intelligence become buzz words that attract public attention. They defeated best human Go players, automated manufacturers, powered self-driving vehicles, recognized human faces, and recommended online purchases. Some of these industrial successes are based on statistical theory, and statistical theory is based on probability theory. Although this probabilistic approach is not the only perspective to understand the behavior of machine learning and artificial intelligence, it offers one of the most promising paradigms to rationalize existing algorithms and engineer new ones.
Economics has been an empirical social science since Adam Smith (1723–1790). Many numerical observations and anecdotes were scattered in his Wealth of Nations published in 1776. Ragnar Frisch (1895–1973) and Jan Tinbergen (1903--1994), two pioneer econometricians, were awarded in 1969 the first Nobel Prize in economics. Econometrics provides quantitative insights about economic data. It flourishes in real-world management practices, from households and firms up to governance at the global level. Today, the big data revolution is pumping fresh energy into research and exercises of econometric methods. The mathematical foundation of econometric theory is built on probability theory as well.
2.2 Axiomatic Probability
Human beings are awed by uncertainty in daily life. In the old days, Egyptians consulted oracles, Hebrews inquired prophets, and Chinese counted on diviners to interpret tortoise shell or bone cracks. Fortunetellers are abundant in today’s Hong Kong.
Probability theory is a philosophy about uncertainty. Over centuries, mathematicians strove to contribute to the understanding of randomness. As measure theory matured in the early 20th century, Andrey Kolmogorov (1903-1987) built the edifice of modern probability theory in his monograph published in 1933. The formal mathematical language is a system that allows rigorous explorations which have made fruitful advancements, and is now widely accepted in academic and industrial research.
In this lecture, we will briefly introduce the axiomatic probability theory along with familiar results covered in undergraduate probability and statistics. This lecture note is at the level
Hansen (2020): Introduction to Econometrics, or
Stachurski (2016): A Primer in Econometric Theory, or
Casella and Berger (2002): Statistical Inference (second edition)
Interested readers may want to read this textbook for more examples.
2.2.1 Probability Space
A sample space \(\Omega\) is a collection of all possible outcomes. It is a set of things. An event \(A\) is a subset of \(\Omega\). It is something of interest on the sample space. A \(\sigma\)-field, denoted by \(\mathcal{F}\), is a collection of events such that
\(\emptyset\in\mathcal{F}\);
if an event \(A\in\mathcal{F}\), then \(A^{c}\in\mathcal{F}\);
if \(A_{i}\in\mathcal{F}\) for \(i\in\mathbb{N}\), then \(\bigcup_{i\in\mathbb{N}}A_{i}\in\mathcal{F}\).
Implications: (a) Since \(\Omega=\emptyset^{c}\in\mathcal{F}\), we have \(\Omega\in\mathcal{F}\). (b) If \(A_{i}\in\mathcal{F}\) for \(i\in\mathbb{N}\), then \(A_{i}^{c}\in\mathcal{F}\) for \(i\in\mathbb{N}\). Thus, if \(\bigcup_{i\in\mathbb{N}}A_{i}^{c}\in\mathcal{F}\) , then \(\bigcap_{i\in\mathbb{N}}A_{i}=(\bigcup_{i\in\mathbb{N}}A_{i}^{c})^{c}\in\mathcal{F}\).
- 1.1*. Intuitively, a \(\sigma\)-field is a pool which is closed for countable sets to conduct union, difference, and intersection operations. These are algebraic operations of sets. \(\sigma\)-field is also called \(\sigma\)-algebra.
Example 2.1 ** 1.1**. Let \(\Omega=\{1,2,3,4,5,6\}\). Some examples of \(\sigma\)-fields include
\(\mathcal{F}_{1}=\{\emptyset,\{1,2,3\},\{4,5,6\},\Omega\}\);
\(\mathcal{F}_{2}=\{\emptyset,\{1,3\},\{2,4,5,6\},\Omega\}\).
Counterexample: \(\mathcal{F}_{3}=\{\emptyset,\{1,2\},\{4,6\},\Omega\}\) is not a \(\sigma\)-field since \(\{1,2,4,6\}=\{1,2\}\bigcup\{4,6\}\) does not belong to \(\mathcal{F}_{3}\).
The \(\sigma\)-field can be viewed as a well-organized structure built on the ground of the sample space. The pair \(\left(\Omega,\mathcal{F}\right)\) is called a measure space.
Let \(\mathcal{G}=\{B_{1},B_{2},\ldots\}\) be an arbitrary collection of sets, not necessarily a \(\sigma\)-field. We say \(\mathcal{F}\) is the smallest \(\sigma\)-field generated by \(\mathcal{G}\) if \(\mathcal{G}\subseteq\mathcal{F}\), and \(\mathcal{F}\subseteq\mathcal{\tilde{F}}\) for any \(\mathcal{\tilde{F}}\) such that \(\mathcal{G}\subseteq\mathcal{\tilde{F}}\). A Borel \(\sigma\)-field \(\mathcal{R}\) is the smallest \(\sigma\)-field generated by the open sets on the real line \(\mathbb{R}\).
Example 2.2 ** 1.2**. Let \(\Omega=\{1,2,3,4,5,6\}\) and \(A=\{\{1\},\{1,3\}\}\). Then the smallest \(\sigma\)-field generated by \(A\) is \[\sigma(A)=\{\emptyset,\{1\},\{1,3\},\{3\},\{2,4,5,6\},\{2,3,4,5,6\},\{1,2,4,5,6\},\Omega\}.\]
A function \(\mu:(\Omega,\mathcal{F})\mapsto\left[0,\infty\right]\) is called a measure if it satisfies
(positiveness) \(\mu\left(A\right)\geq0\) for all \(A\in\mathcal{F}\);
(countable additivity) if \(A_{i}\in\mathcal{F}\), \(i\in\mathbb{N}\), are mutually disjoint, then \[\mu\left(\bigcup_{i\in\mathbb{N}}A_{i}\right)=\sum_{i\in\mathbb{N}}\mu\left(A_{i}\right).\]
Measure can be understand as weight or length. In particular, we call \(\mu\) a probability measure if \(\mu\left(\Omega\right)=1\). A probability measure is often denoted as \(P\). The triple \(\left(\Omega,\mathcal{F},P\right)\) is called a probability space.
So far we have answered the question: “What is a mathematically well-defined probability?”, but we have not yet answered “How to assign the probability?” There are two major schools of thinking on probability assignment. One is frequentist, who considers probability as the average chance of occurrence if a large number of experiments are carried out. The other is Bayesian, who deems probability as a subjective brief. The principles of these two schools are largely incompatible, while each school has merits and difficulties, which will be elaborated when discussing hypothesis testing.
2.2.2 Random Variable
The terminology random variable is a historic relic which belies its modern definition of a deterministic mapping. It is a link between two measurable spaces such that any event in the \(\sigma\)-field installed on the range can be traced back to an event in the \(\sigma\)-field installed on the domain.
Formally, a function \(X:\Omega\mapsto\mathbb{R}\) is \(\left(\Omega,\mathcal{F}\right)\backslash\left(\mathbb{R},\mathcal{R}\right)\) measurable if \[X^{-1}\left(B\right)=\left\{ \omega\in\Omega:X\left(\omega\right)\in B\right\} \in\mathcal{F}\] for any \(B\in\mathcal{R}.\) Random variable is an alternative, and somewhat romantic, name for a measurable function. The \(\sigma\)-field generated by the random variable \(X\) is defined as \(\sigma\left(X\right)=\left\{ X^{-1}\left(B\right):B\in\mathcal{R}\right\}\).
We say a measurable is a discrete random variable if the set \(\left\{ X\left(\omega\right):\omega\in\Omega\right\}\) is finite or countable. We say it is a continuous random variable if the set \(\left\{ X\left(\omega\right):\omega\in\Omega\right\}\) is uncountable.
A measurable function connects two measurable spaces. No probability is involved in its definition yet. While if a probability measure \(P\) is installed on \((\Omega,\mathcal{F})\), the measurable function \(X\) will induce a probability measure on \((\mathbb{R},\mathcal{R})\). It is easy to verify that \(P_{X}:(\mathbb{R},\mathcal{R})\mapsto\left[0,1\right]\) is also a probability measure if defined as \[P_{X}\left(B\right)=P\left(X^{-1}\left(B\right)\right)\] for any \(B\in\mathcal{R}\). This \(P_{X}\) is called the probability measure induced by the measurable function \(X\). The induced probability measure \(P_{X}\) is an offspring of the parent probability measure \(P\) though the channel of \(X\).
2.2.3 Distribution Function
We go back to some terms that we have learned in a undergraduate probability course. A (cumulative) distribution function \(F:\mathbb{R}\mapsto[0,1]\) is defined as \[F\left(x\right)=P\left(X\leq x\right)=P\left(\{X\leq x\}\right)=P\left(\left\{ \omega\in\Omega:X\left(\omega\right)\leq x\right\} \right).\] It is often abbreviated as CDF, and it has the following properties.
\(\lim_{x\to-\infty}F\left(x\right)=0\),
\(\lim_{x\to\infty}F\left(x\right)=1\),
non-decreasing,
right-continuous \(\lim_{y\to x^{+}}F\left(y\right)=F\left(x\right).\)
** 1.1**. Draw the CDF of a binary distribution; that is, \(X=1\) with probability \(p\in\left(0,1\right)\) and \(X=0\) with probability \(1-p\).
For continuous distribution, if there exists a function \(f\) such that for all \(x\), \[F\left(x\right)=\int_{-\infty}^{x}f\left(y\right)\mathrm{d}y,\] then \(f\) is called the probability density function of \(X\), often abbreviated as PDF. It is easy to show that \(f\left(x\right)\geq0\) and \(\int_{a}^{b}f\left(x\right)dx=F\left(b\right)-F\left(a\right)\).
Example 2.3 ** 1.3**. We have learned many parametric distributions like the binary distribution, the Poisson distribution, the uniform distribution, the exponential distribution, the normal distribution, \(\chi^{2}\), \(t\), \(F\) distributions and so on. They are parametric distributions, meaning that the CDF or PDF can be completely characterized by very few parameters.
2.3 Expected Value
2.3.1 Integration
Integration is one of the most fundamental operations in mathematical analysis. We have studied Riemann’s integral in the undergraduate calculus. Riemann’s integral is intuitive, but Lebesgue integral is a more general approach to defining integration. Lebesgue integral is constructed by the following steps. \(X\) is called a simple function on a measurable space \(\left(\Omega,\mathcal{F}\right)\) if \(X=\sum_{i}a_{i}\cdot1\left\{ A_{i}\right\}\) and this summation is finite, where \(a_{i}\in\mathbb{R}\) and \(\{A_{i}\in\mathcal{F}\}_{i\in\mathbb{N}}\) is a partition of \(\Omega\). A simple function is measurable.
Let \(\left(\Omega,\mathcal{F},\mu\right)\) be a measure space. The integral of the simple function \(X\) with respect to \(\mu\) is \[\int X\mathrm{d}\mu=\sum_{i}a_{i}\mu\left(A_{i}\right).\] Unlike the Rieman integral, this definition of integration does not partition the domain into splines of equal length. Instead, it tracks the distinctive values of the function and the corresponding measure.
Let \(X\) be a non-negative measurable function. The integral of \(X\) with respect to \(\mu\) is \[\int X\mathrm{d}\mu=\sup\left\{ \int Y\mathrm{d}\mu:0\leq Y\leq X,\text{ }Y\text{ is simple}\right\} .\]
Let \(X\) be a measurable function. Define \(X^{+}=\max\left\{ X,0\right\}\) and \(X^{-}=-\min\left\{ X,0\right\}\). Both \(X^{+}\) and \(X^{-}\) are non-negative functions. The integral of \(X\) with respect to \(\mu\) is \[\int X\mathrm{d}\mu=\int X^{+}\mathrm{d}\mu-\int X^{-}\mathrm{d}\mu.\]
The Step 1 above defines the integral of a simple function. Step 2 defines the integral of a non-negative function as the approximation of steps functions from below. Step 3 defines the integral of a general function as the difference of the integral of two non-negative parts.
- 1.2*. The integrand that highlights the difference between the Lebesgue integral and Riemann integral is the Dirichelet function on the unit interval \(1\left\{ x\in\mathbb{Q}\cap[0,1]\right\}\). It is not Riemann-integrable whereas its Lebesgue integral. is well defined and \(\int1\left\{ x\in\mathbb{Q}\cap[0,1]\right\} dx=0\).
If the measure \(\mu\) is a probability measure \(P\), then the integral \(\int X\mathrm{d}P\) is called the expected value, or expectation, of \(X\). We often use the notation \(E\left[X\right]\), instead of \(\int X\mathrm{d}P\), for convenience.
Expectation provides the average of a random variable, despite that we cannot foresee the realization of a random variable in a particular trial (otherwise the study of uncertainty is trivial). In the frequentist’s view, the expectation is the average outcome if we carry out a large number of independent trials.
If we know the probability mass function of a discrete random variable, its expectation is calculated as \(E\left[X\right]=\sum_{x}xP\left(X=x\right)\), which is the integral of a simple function. If a continuous random variable has a PDF \(f(x)\), its expectation can be computed as \(E\left[X\right]=\int xf\left(x\right)\mathrm{d}x\). These two expressions are unified as \(E[X]=\int X\mathrm{d}P\) by the Lebesgue integral.
2.3.2 Properties of Expectations
Here are some properties of mathematical expectations.
The probability of an event \(A\) is the expectation of an indicator function. \(E\left[1\left\{ A\right\} \right]=1\times P(A)+0\times P(A^{c})=P\left(A\right)\).
\(E\left[X^{r}\right]\) is call the \(r\)-moment of \(X\). The mean of a random variable is the first moment \(\mu=E\left[X\right]\), and the second centered moment is called the variance \(\mathrm{var}\left[X\right]=E\left[\left(X-\mu\right)^{2}\right]\). The third centered moment \(E\left[\left(X-\mu\right)^{3}\right]\), called skewness, is a measurement of the symmetry of a random variable, and the fourth centered moment \(E\left[\left(X-\mu\right)^{4}\right]\), called kurtosis, is a measurement of the tail thickness.
Moments do not always exist. For example, the mean of the Cauchy distribution does not exist, and the variance of the \(t(2)\) distribution does not exist.
\(E[\cdot]\) is a linear operation. If \(\phi(\cdot)\) is a linear function, then \(E[\phi(X)]=\phi(E[X]).\)
Jensen’s inequality is an important fact. A function \(\varphi(\cdot)\) is convex if \(\varphi(ax_{1}+(1-a)x_{2})\leq a\varphi(x_{1})+(1-a)\varphi(x_{2})\) for all \(x_{1},x_{2}\) in the domain and \(a\in[0,1]\). For instance, \(x^{2}\) is a convex function. Jensen’s inequality says that if \(\varphi\left(\cdot\right)\) is a convex function, then \(\varphi\left(E\left[X\right]\right)\leq E\left[\varphi\left(X\right)\right].\)
Markov inequality is another simple but important fact. If \(E\left[\left|X\right|^{r}\right]\) exists, then \(P\left(\left|X\right|>\epsilon\right)\leq E\left[\left|X\right|^{r}\right]/\epsilon^{r}\) for all \(r\geq1\). Chebyshev inequality \(P\left(\left|X\right|>\epsilon\right)\leq E\left[X^{2}\right]/\epsilon^{2}\) is a special case of the Markov inequality when \(r=2\).
The distribution of a random variable is completely characterized by its CDF or PDF. A moment is a function of the distribution. To back out the underlying distribution from moments, we need to know the moment-generating function (mgf) \(M_{X}(t)=E[e^{tX}]\) for \(t\in\mathbb{R}\) whenever the expectation exists. The \(r\)th moment can be computed from mgf as \[E[X^{r}]=\frac{\mathrm{d}^{r}M_{X}(t)}{\mathrm{d}t^{r}}\big\vert_{t=0}.\] Just like moments, mgf does not always exist.
2.4 Multivariate Random Variable
A bivariate random variable is a measurable function \(X:\Omega\mapsto\mathbb{R}^{2}\), and more generally a multivariate random variable is a measurable function \(X:\Omega\mapsto\mathbb{R}^{n}\). We can define the joint CDF as \(F\left(x_{1},\ldots,x_{n}\right)=P\left(X_{1}\leq x_{1},\ldots,X_{n}\leq x_{n}\right)\). Joint PDF is defined similarly.
It is sufficient to introduce the joint distribution, conditional distribution and marginal distribution in the simple bivariate case, and these definitions can be extended to multivariate distributions. Suppose a bivariate random variable \((X,Y)\) has a joint density \(f(\cdot,\cdot)\). The conditional density can be roughly written as \(f\left(y|x\right)=f\left(x,y\right)/f\left(x\right)\) if we do not formally deal with the case \(f(x)=0\). The marginal density \(f\left(y\right)=\int f\left(x,y\right)dx\) integrates out the coordinate that is not interested.
2.4.1 Conditional Probability and Bayes’ Theorem
In a probability space \((\Omega,\mathcal{F},P)\), for two events \(A_{1},A_{2}\in\mathcal{F}\) the conditional probability is \[P\left(A_{1}|A_{2}\right)=\frac{P\left(A_{1}A_{2}\right)}{P\left(A_{2}\right)}\] if \(P(A_{2})>0\). In the definition of conditional probability, \(A_{2}\) plays the role of the outcome space so that \(P(A_{1}A_{2})\) is standardized by the total mass \(P(A_{2})\). If \(P(A_{2})=0\), the conditional probability can still be valid in some cases, but we need to introduce the dominance between two measures, which we do not elaborate here.
Since \(A_{1}\) and \(A_{2}\) are symmetric, we also have \(P(A_{1}A_{2})=P(A_{2}|A_{1})P(A_{1})\). It implies \[P(A_{1}|A_{2})=\frac{P\left(A_{2}|A_{1}\right)P\left(A_{1}\right)}{P\left(A_{2}\right)}\] This formula is the Bayes’ Theorem.
2.4.2 Independence
We say two events \(A_{1}\) and \(A_{2}\) are independent if \(P(A_{1}A_{2})=P(A_{1})P(A_{2})\). If \(P(A_{2})\neq0\), it is equivalent to \(P(A_{1}|A_{2})=P(A_{1})\). In words, knowing \(A_{2}\) does not change the probability of \(A_{1}\).
Regarding the independence of two random variables, \(X\) and \(Y\) are independent if \(P\left(X\in B_{1},Y\in B_{2}\right)=P\left(X\in B_{1}\right)P\left(Y\in B_{2}\right)\) for any two Borel sets \(B_{1}\) and \(B_{2}\).
If \(X\) and \(Y\) are independent, then \(E[XY]=E[X]E[Y]\). The expectation of their product is the product of their expectations.
2.4.3 Law of Iterated Expectations
Given a probability space \(\left(\Omega,\mathcal{F},P\right)\), a sub \(\sigma\)-algebra \(\mathcal{G}\subseteq\mathcal{F}\) and a \(\mathcal{F}\)-measurable function \(Y\) with \(E\left|Y\right|<\infty\), the conditional expectation \(E\left[Y|\mathcal{G}\right]\) is defined as a \(\mathcal{G}\)-measurable function such that \[\int_{A}Y\mathrm{d}P=\int_{A}E\left[Y|\mathcal{G}\right]\mathrm{d}P\] for all \(A\in\mathcal{G}\). Here \(\mathcal{G}\) is a coarse \(\sigma\)-field and \(\mathcal{F}\) is a finer \(\sigma\)-field.
Taking \(A=\Omega\), we have \(E\left[Y\right]=\int Y\mathrm{d}P=\int E\left[Y|\mathcal{G}\right]\mathrm{d}P=E\left[E\left[Y|\mathcal{G}\right]\right]\). The law of iterated expectation \[E\left[Y\right]=E\left[E\left[Y|\mathcal{G}\right]\right]\] is a trivial fact which follows this definition of the conditional expectation. In the bivariate case, if the conditional density exists, the conditional expectation can be computed as \(E\left[Y|X\right]=\int yf\left(y|X\right)\mathrm{d}y\), where the conditioning variable \(E\left[\cdot|X\right]=E\left[\cdot|\sigma\left(X\right)\right]\) is a concise notation for the smallest \(\sigma\)-field generated by \(X\). The law of iterated expectation implies \(E\left[E\left[Y|X\right]\right]=E\left[Y\right]\).
Below are some properties of conditional expectations
\(E\left[E\left[Y|X_{1},X_{2}\right]|X_{1}\right]=E\left[Y|X_{1}\right];\)
\(E\left[E\left[Y|X_{1}\right]|X_{1},X_{2}\right]=E\left[Y|X_{1}\right];\)
\(E\left[h\left(X\right)Y|X\right]=h\left(X\right)E\left[Y|X\right].\)
2.5 Summary
If it is your first encounter of measure theory, the new definitions here may seem overwhelmingly abstract. A natural question is that: “I earned high grade in my undergraduate probability and statistics; do I really need the fancy mathematics in this lecture to do well in econometrics?” The answer is yes and no. No is in the sense that if you want to use econometric methods, instead of grasp the underlying theory, then the axiomatic probability does not add much to your weaponry. You can be an excellent economist or applied econometrician without knowing measure theoretic probability. Yes is in the sense that without measure theory, we cannot even formally define conditional expectation, which will be the subject of our next lecture and is a core concept of econometrics. Moreover, the pillars of asymptotic theory — law of large numbers and central limit theorem — can only be made accurate with this foundation. If you are aspired to work on econometric theory, you will meet and use measure theory so often in your future study and finally it becomes part of your muscle memory.
In this course, we try to keep a balance manner. On the one hand, many econometrics topics can be presented with elementary mathematics. Whenever possible, econometrics should reach wider audience with a plain appearance, instead of intimidating people by arcane languages. On the other hand, we introduce these concepts in this lecture and will invoke them in the discussion of asymptotic theory later. Your investment in advanced mathematics will not be wasted in vain.
Historical notes: Measure theory was established in the early 20th century by a constellation of French/German mathematicians, represented by Émile Borel, Henri Lebesgue, Johann Radon, etc. Generations of Russian mathematicians such as Andrey Markov and Andrey Kolmogorov made fundamental contributions in mathematizing seemingly abstract concepts of uncertainty and randomness. Their names are immortalized by the Borel set, the Lebesgue integral, the Radon measure, Markov chain, Kolmogorov’s zero–one law and many other terminologies named after them.
Fascinating questions about probability attracted great economists. Francis Edgeworth (1845–1926) wrote extensively on probability and statistics. John Maynard Keynes (1883–1946) published A Treatise on Probability in 1921 which mixed probability and philosophy, although this piece of work was not as influential as his General Theory of Employment, Interest and Money in 1936 which later revolutionized economics.
Today, the technology of collecting data and the processing data is unbelievably cheaper than that 100 years ago. Unfortunately, the cost of learning mathematics and developing mathematics has not been significantly lowered over one century. Only a small handful of talents, like you, enjoy the privilege and luxury to appreciate the ideas of these great minds.
Further reading: Doob (1996) summarized the development of axiomatic probability in the first half of the 20th century.
Zhentao Shi. Sep 12, 2020.