9 Hypothesis Testing

Notation: \(\mathbf{X}\) denotes a random variable or random vector. \(\mathbf{x}\) is its realization.

A hypothesis is a statement about the parameter space \(\Theta\). Hypothesis testing checks whether the data support a null hypothesis \(\Theta_{0}\), which is a subset of \(\Theta\) of interest. Ideally the null hypothesis should be suggested by scientific theory. The alternative hypothesis \(\Theta_{1}=\Theta\backslash\Theta_{0}\) is the complement of \(\Theta_{0}\). Based on the observed evidence, hypothesis testing decides to accept or reject the null hypothesis. If the null hypothesis is rejected by the data, it implies that from the statistical perspective the data is incompatible with the proposed scientific theory.

In this chapter, we will first introduce the idea and practice of hypothesis testing and the related confidence interval. While we mainly focus on the frequentist interpretation of hypothesis, we briefly discuss the Bayesian approach to statistical decision. As an application of the testing procedures to the linear regression model, we elaborate how to test a linear or nonlinear hypothesis of the slope coefficients based on the unrestricted or restricted OLS estimators.

9.1 Testing

9.1.1 Decision Rule and Errors

If \(\Theta_{0}\) is a singleton, we call it a simple hypothesis; otherwise we call it a composite hypothesis. For example, if the parameter space \(\Theta=\mathbb{R}\), then \(\Theta_{0}=\left\{ 0\right\}\) (or equivalently \(\theta_{0}=0\)) is a simple hypothesis, whereas \(\Theta_{0}=(-\infty,0]\) (or equivalently \(\theta_{0}\leq0\)) is a composite hypothesis.

A test function is a mapping \[\phi:\mathcal{X}^{n}\mapsto\left\{ 0,1\right\} ,\] where \(\mathcal{X}\) is the sample space. The null hypothesis is accepted if \(\phi\left(\mathbf{x}\right)=0\), or rejected if \(\phi\left(\mathbf{x}\right)=1\). We call the set \(A_{\phi}=\left\{ \mathbf{x}\in\mathcal{X}^{n}:\phi_{\theta}\left(\mathbf{x}\right)=0\right\}\) the acceptance region, and its complement \(R_{\phi}=\left\{ \mathbf{x}\in\mathcal{X}^{n}:\phi\left(\mathbf{x}\right)=1\right\}\) the rejection region.

The power function of a test \(\phi\) is \[\beta\left(\theta\right)=P_{\theta}\left\{ \phi\left(\mathbf{X}\right)=1\right\} =E_{\theta}\left[\phi\left(\mathbf{X}\right)\right].\] The power function measures the probability that the test function rejects the null when the data is generated under the true parameter \(\theta\), reflected in \(P_{\theta}\) and \(E_{\theta}\).

The power of a test for some \(\theta\in\Theta_{1}\) is the value of \(\beta\left(\theta\right)\). The size of the test is \(\sup_{\theta\in\Theta_{0}}\beta\left(\theta\right).\) Notice that the definition of power depends on a \(\theta\) in the alternative hypothesis \(\Theta_{1}\), whereas that of size is independent of \(\theta\) due to the supremum over the set of null \(\Theta_{0}\). The level of a test is any value \(\alpha\in\left(0,1\right)\) such that \(\alpha\geq\sup_{\theta\in\Theta_{0}}\beta\left(\theta\right)\), which is often used when it is difficult to attain the exact supremum. A test of size \(\alpha\) is also of level \(\alpha\) or bigger; while a test of level \(\alpha\) must have size smaller or equal to \(\alpha\).

The concept of level is useful if we do not have sufficient information to derive the exact size of a test. If \(\left(X_{1i},X_{2i}\right)_{i=1}^{n}\) are randomly drawn from some unknown joint distribution, but we know the marginal distribution is \(X_{ji}\sim N\left(\theta_{j},1\right)\), for \(j=1,2\). In order to test the joint hypothesis \(\theta_{1}=\theta_{2}=0\), we can construct a test function \[\phi_{\theta_{1}=\theta_{2}=0}\left(\mathbf{X}_{1},\mathbf{X}_{2}\right)=1\left\{ \left\{ \sqrt{n}\left|\overline{X}_{1}\right|\geq z_{1-\alpha/4}\right\} \cup\left\{ \sqrt{n}\left|\overline{X}_{2}\right|\geq z_{1-\alpha/4}\right\} \right\} ,\] where \(z_{1-\alpha/4}\) is the \(\left(1-\alpha/4\right)\)-th quantile of the standard normal distribution. The level of this test is \[\begin{aligned}P\left(\phi_{\theta_{1}=\theta_{2}=0}\left(\mathbf{X}_{1},\mathbf{X}_{2}\right)\right) & \leq P\left(\sqrt{n}\left|\overline{X}_{1}\right|\geq z_{1-\alpha/4}\right)+P\left(\sqrt{n}\left|\overline{X}_{2}\right|\geq z_{1-\alpha/4}\right)\\ & =\alpha/2+\alpha/2=\alpha. \end{aligned}\] where the inequality follows by the Bonferroni inequality \[P\left(A\cup B\right)\leq P\left(A\right)+P\left(B\right).\] (The seemingly trivial Bonferroni inequality is useful in many proofs of probability results.) Therefore, the level of \(\phi\left(\mathbf{X}_{1},\mathbf{X}_{2}\right)\) is \(\alpha\), but the exact size is unknown without the knowledge of the joint distribution. (Even if we know the correlation of \(X_{1i}\) and \(X_{2i}\), putting two marginally normal distributions together does not make a jointly normal vector in general.)


                   accept $H_{0}$     reject $H_{0}$
  $H_{0}$ true    correct decision     Type I error
  $H_{0}$ false    Type II error     correct decision

: \[tab:Decisions-and-States\] Actions, States and Consequences

  • The probability of committing Type I error is \(\beta\left(\theta\right)\) for some \(\theta\in\Theta_{0}\).

  • The probability of committing Type II error is \(1-\beta\left(\theta\right)\) for some \(\theta\in\Theta_{1}\).

The philosophy on hypothesis testing has been debated for centuries. At present the prevailing framework in statistics textbooks is the frequentist perspective. A frequentist views the parameter as a fixed constant. They keep a conservative attitude about the Type I error: Only if overwhelming evidence is demonstrated shall a researcher reject the null. Under the principle of protecting the null hypothesis, a desirable test should have a small level. Conventionally we take \(\alpha=0.01,\) 0.05 or 0.1. We say a test is unbiased if \(\beta\left(\theta\right)>\sup_{\theta\in\Theta_{0}}\beta\left(\theta\right)\) for all \(\theta\in\Theta_{1}\). There can be many tests of correct size.

A trivial test function \(\phi(\mathbf{x})=1\left\{ 0\leq U\leq\alpha\right\}\) for all \(\theta\in\Theta\), where \(U\) is a random variable from a uniform distribution on \(\left[0,1\right]\), has correct size \(\alpha\) but no non-trivial power at the alternative. On the other extreme, the trivial test function \(\phi\left(\mathbf{x}\right)=1\) for all \(\mathbf{x}\) enjoys the biggest power but suffers incorrect size.

Usually, we design a test by proposing a test statistic \(T_{n}:\mathcal{X}^{n}\mapsto\mathbb{R}^{+}\) and a critical value \(c_{1-\alpha}\). Given \(T_{n}\) and \(c_{1-\alpha}\), we write the test function as \[\phi\left(\mathbf{X}\right)=1\left\{ T_{n}\left(\mathbf{X}\right)>c_{1-\alpha}\right\} .\] To ensure such a \(\phi\left(\mathbf{x}\right)\) has correct size, we need to figure out the distribution of \(T_{n}\) under the null hypothesis (called the null distribution), and choose a critical value \(c_{1-\alpha}\) according to the null distribution and the desirable size or level \(\alpha\).

Another commonly used indicator in hypothesis testing is \(p\)-value: \[\sup_{\theta\in\Theta_{0}}P_{\theta}\left\{ T_{n}\left(\mathbf{x}\right)\leq T_{n}\left(\mathbf{X}\right)\right\} .\] In the above expression, \(T_{n}\left(\mathbf{x}\right)\) is the realized value of the test statistic \(T_{n}\), while \(T_{n}\left(\mathbf{X}\right)\) is the random variable generated by \(\mathbf{X}\) under the null \(\theta\in\Theta_{0}\). The interpretation of the \(p\)-value is tricky. \(p\)-value is the probability that we observe \(T_{n}(\mathbf{X})\) being greater than the realized \(T_{n}(\mathbf{x})\) if the null hypothesis is true.

\(p\)-value is not the probability that the null hypothesis is true. Under the frequentist perspective, the null hypothesis is either true or false, with certainty. The randomness of a test comes only from sampling, not from the hypothesis. \(p\)-value measures whether the dataset is compatible with the null hypothesis. \(p\)-value is closely related to the corresponding test. When \(p\)-value is smaller than the specified test size \(\alpha\), the test rejects the null.

So far we have been talking about hypothesis testing in finite sample. The discussion and terminologies can be carried over to the asymptotic world when \(n\to\infty\). If we denote the power function as \(\beta_{n}\left(\theta\right)\), in which we make its dependence on the sample size \(n\) explicit, the test is of asymptotic size \(\alpha\) if \(\limsup_{n\to\infty}\beta_{n}\left(\theta\right)\leq\alpha\) for all \(\theta\in\Theta_{0}\). A test is consistent if \(\beta_{n}\left(\theta\right)\to1\) for every \(\theta\in\Theta_{1}\).

9.1.2 Optimality

Just as there may be multiple valid estimators for a task of estimation, there may be multiple tests for a task of hypothesis testing. For a class of tests of the same level \(\alpha\) under the null \(\Psi_{\alpha}=\left\{ \phi:\sup_{\theta\in\Theta_{0}}\beta_{\phi}\left(\theta\right)\leq\alpha\right\}\) where we put a subscript \(\phi\) in \(\beta_{\phi}\left(\theta\right)\) to distinguish the power for different tests, it is natural to prefer a test \(\phi^{*}\) that exhibits higher power than all other tests under consideration at each point of the alternative hypothesis in that \[\beta_{\phi^{*}}\left(\theta\right)\geq\beta_{\phi}\left(\theta\right)\] for every \(\theta\in\Theta_{1}\) and every \(\phi\in\Psi_{\alpha}\). If such a test \(\phi^{*}\in\Psi_{\alpha}\) exists, we call it the uniformly most powerful test.

Suppose a random sample of size 6 is generated from \[\left(X_{1},\ldots,X_{6}\right)\sim\text{iid.}N\left(\theta,1\right),\] where \(\theta\) is unknown. We want to infer the population mean of the normal distribution. The null hypothesis is \(H_{0}\): \(\theta\leq0\) and the alternative is \(H_{1}\): \(\theta>0\). All tests in \[\Psi=\left\{ 1\left\{ \bar{X}\geq c/\sqrt{6}\right\} :c\geq1.64\right\}\] has the correct level. Since \(\bar{X}=N\left(\theta,1/6\right)\), the power function for those in \(\Psi\) is \[\begin{aligned} \beta_{\phi}\left(\theta\right) & =P\left(\bar{X}\geq\frac{c}{\sqrt{6}}\right)=P\left(\frac{\bar{X}-\theta}{1/\sqrt{6}}\geq\frac{\frac{c}{\sqrt{6}}-\theta}{1/\sqrt{6}}\right)\\ & =P\left(N\geq c-\sqrt{6}\theta\right)=1-\Phi\left(c-\sqrt{6}\theta\right)\end{aligned}\] where \(N=\frac{\bar{X}-\theta}{1/\sqrt{6}}\) follows the standard normal, and \(\Phi\) is the cdf of the standard normal. It is clear that \(\beta_{\phi}\left(\theta\right)\) is monotonically decreasing in \(c\). Thus the test function \[\phi_{\theta=0}\left(\mathbf{X}\right)=1\left\{ \bar{X}\geq1.64/\sqrt{6}\right\}\] is the most powerful test in \(\Psi\), as \(c=1.64\) is the lower bound that \(\Psi_{\alpha}\) allows in order to keep the level \(\alpha\).

9.1.3 Likelihood-Ratio Test and Wilks’ theorem

When estimators are not available in closed-forms, the likelihood-ratio test (LRT) serves as a very general testing statistic under the likelihood principle. Let \(\ell_{n}\left(\theta\right)=n^{-1}\sum_{i}\log f\left(x_{i};\theta\right)\) be the average sample log-likelihood, and \(\widehat{\theta}=\arg\max_{\theta\in\Theta}\ell_{n}\left(\theta\right)\) is the maximum likelihood estimator (MLE). Take a Taylor expansion of \(\ell_{n}\left(\theta_{0}\right)\) around \(\ell_{n}\left(\widehat{\theta}\right)\): \[\begin{aligned} \ell_{n}\left(\theta_{0}\right)-\ell_{n}\left(\widehat{\theta}\right) & =\frac{\partial\ell_{n}}{\partial\theta}\left(\widehat{\theta}\right)'\left(\theta_{0}-\widehat{\theta}\right)+\frac{1}{2}\left(\theta_{0}-\widehat{\theta}\right)'\left(\frac{\partial^{2}}{\partial\theta\partial\theta'}\ell_{n}\left(\theta_{0}\right)\right)\left(\theta_{0}-\widehat{\theta}\right)+O\left(\left\Vert \widehat{\theta}-\theta_{0}\right\Vert _{2}^{3}\right)\\ & =\frac{1}{2}\left(\widehat{\theta}-\theta_{0}\right)'\left(\frac{\partial^{2}}{\partial\theta\partial\theta'}\ell_{n}\left(\theta_{0}\right)\right)\left(\widehat{\theta}-\theta_{0}\right)+O\left(\left\Vert \widehat{\theta}-\theta_{0}\right\Vert _{2}^{3}\right)\\ & =\frac{1}{2}\left(\widehat{\theta}-\theta_{0}\right)'\left(\frac{\partial^{2}}{\partial\theta\partial\theta'}\ell_{n}\left(\theta_{0}\right)\right)\left(\widehat{\theta}-\theta_{0}\right)+o_{p}\left(1\right)\end{aligned}\] by that \(\frac{\partial\ell_{n}}{\partial\theta}\left(\widehat{\theta}\right)=0\) due to the first order condition of optimality. Define \(L_{n}\left(\theta\right):=\sum_{i}\log f\left(x_{i};\theta\right)\), and the likelihood-ratio statistic as \[\mathcal{LR}:=2\left(L_{n}\left(\widehat{\theta}\right)-L_{n}\left(\theta_{0}\right)\right)=2n\left(\ell_{n}\left(\widehat{\theta}\right)-\ell_{n}\left(\theta_{0}\right)\right).\] Obviously \(\mathcal{LR}\geq0\) because \(\widehat{\theta}\) maximizes \(\ell_{n}\left(\theta\right)\). Multiply \(-2n\) to the two sides of the above Taylor expansion: \[\mathcal{LR}=\sqrt{n}\left(\widehat{\theta}-\theta_{0}\right)'\left(-\frac{\partial^{2}}{\partial\theta\partial\theta'}\ell_{n}\left(\dot{\theta}\right)\right)\sqrt{n}\left(\widehat{\theta}-\theta_{0}\right)+o_{p}\left(1\right)\] Notice that when the model is correctly specified we have \[\begin{aligned} -\frac{\partial^{2}}{\partial\theta\partial\theta'}\ell_{n}\left(\theta_{0}\right) & \stackrel{p}{\to}-\mathcal{H}\left(\theta_{0}\right)=\mathcal{I}\left(\theta_{0}\right)\\ \sqrt{n}\left(\widehat{\theta}-\theta_{0}\right) & \stackrel{d}{\to}N\left(0,\mathcal{I}^{-1}\left(\theta_{0}\right)\right)\end{aligned}\] By Slutsky’s theorem: \[\left(-\frac{\partial^{2}}{\partial\theta\partial\theta'}\ell_{n}\left(\dot{\theta}\right)\right)^{1/2}\left[\sqrt{n}\left(\widehat{\theta}-\theta_{0}\right)\right]\stackrel{d}{\to}\mathcal{I}^{1/2}\left(\theta_{0}\right)\times N\left(0,\mathcal{I}^{-1}\left(\theta_{0}\right)\right)\sim N\left(0,I_{k}\right).\] and then \(\mathcal{LR}\stackrel{d}{\to}\chi_{K}^{2}\) by the continuous mapping theorem.

Wilks’ theorem, or Wilks’ phenomenon is referred to the fact that \(\mathcal{LR}\stackrel{d}{\to}\chi^{2}\left(K\right)\) when the parametric model is correctly specified.

9.1.4 Score Test

9.2 Confidence Interval\[confidence-interval\]

An interval estimate is a function \(C:\mathcal{X}^{n}\mapsto\left\{ \Theta_{1}:\Theta_{1}\subseteq\Theta\right\}\) that maps a point in the sample space to a subset of the parameter space. The coverage probability of an interval estimator \(C\left(\mathbf{X}\right)\) is defined as \(P_{\theta}\left(\theta\in C\left(\mathbf{X}\right)\right)\). When \(\theta\) is of one dimension, we usually call the interval estimator confidence interval. When \(\theta\) is of multiple dimensions, we call the it confidence region and it of course includes the one-dimensional \(\theta\) as a special case. The coverage probability is the frequency that the interval estimator captures the true parameter that generates the sample. From the frequentist perspective, the parameter is fixed while the confidence region is random. It is not the probability that \(\theta\) is inside the given confidence interval.

Suppose a random sample of size 6 is generated from \(\left(X_{1},\ldots,X_{6}\right)\sim\text{iid }N\left(\theta,1\right).\) Find the coverage probability of the random interval is \(\left[\bar{X}-1.96/\sqrt{6},\ \bar{X}+1.96/\sqrt{6}\right].\)

Hypothesis testing and confidence region are closely related. Sometimes it is difficult to directly construct the confidence region, but easy to test a hypothesis. One way to construct confidence region is by inverting a test. Suppose \(\phi_{\theta}\) is a test of size \(\alpha\). If \(C\left(\mathbf{X}\right)\) is constructed as \[C\left(\mathbf{X}\right)=\left\{ \theta\in\Theta:\phi\left(\mathbf{X}\right)=0\right\} .\] The coverage probability of the true data generating parameter \(\theta\) is \[P_{\theta}\left\{ \theta\in C\left(\mathbf{X}\right)\right\} =P_{\theta}\left\{ \phi\left(\mathbf{X}\right)=0\right\} =1-P_{\theta}\left\{ \phi\left(\mathbf{X}\right)=1\right\} =1-\beta\left(\theta\right)\geq1-\alpha\] where the last inequality follows as \(\beta\left(\theta\right)\leq\alpha\) for \(\theta\in\Theta_{0}\). If \(\Theta_{0}\) is a singleton, the equality holds.

knitr

9.3 Bayesian Credible Set

The Bayesian framework offers a coherent and natural language for statistical decision. However, the major criticism against Bayesian statistics is the arbitrariness of the choice of the prior.

The Bayesian approach views both the data \(\mathbf{X}_{n}\) and the parameter \(\theta\) as random variables. Before she observes the data, she holds a prior distribution \(\pi\) about \(\theta\). After observing the data, she updates the prior distribution to a posterior distribution \(p(\theta|\mathbf{X}_{n})\). The Bayes Theorem connects the prior and the posterior as \[p(\theta|\mathbf{X}_{n})\propto f(\mathbf{X}_{n}|\theta)\pi(\theta)\] where \(f(\mathbf{X}_{n}|\theta)\) is the likelihood function.

Here is a classical example to illustrate the Bayesian approach to statistical inference. Suppose \(\mathbf{X}_{n}=(X_{1},\ldots,X_{n})\) is an iid sample drawn from a normal distribution with unknown \(\theta\) and known \(\sigma\). If a researcher’s prior distribution \(\theta\sim N(\theta_{0},\sigma_{0}^{2})\), her posterior distribution is, by some routine calculation, also a normal distribution \[p(\theta|\mathbf{x}_{n})\sim N\left(\tilde{\theta},\tilde{\sigma}^{2}\right),\] where \(\tilde{\theta}=\frac{\sigma^{2}}{n\sigma_{0}^{2}+\sigma^{2}}\theta_{0}+\frac{n\sigma_{0}^{2}}{n\sigma_{0}^{2}+\sigma^{2}}\bar{x}\) and \(\tilde{\sigma}^{2}=\frac{\sigma_{0}^{2}\sigma^{2}}{n\sigma_{0}^{2}+\sigma^{2}}\). Thus the Bayesian credible set is \[\left(\tilde{\theta}-z_{1-\alpha/2}\cdot\tilde{\sigma},\ \tilde{\theta}+z_{1-\alpha/2}\cdot\tilde{\sigma}\right).\] This posterior distribution depends on \(\theta_{0}\) and \(\sigma_{0}^{2}\) from the prior. When the sample size is sufficiently large the posterior can be approximated by \(N(\bar{x},\sigma^{2}/n)\), where the prior information is overwhelmed by the information accumulated from the data.

In contrast, a frequentist will estimate \(\hat{\theta}=\bar{x}\sim N(\theta,\sigma^{2}/n)\). Her confidence interval is \[\left(\bar{x}-z_{1-\alpha/2}\cdot\sigma/\sqrt{n},\ \bar{x}-z_{1-\alpha/2}\cdot\sigma/\sqrt{n}\right).\] The Bayesian credible set and the frequentist confidence interval are different for finite \(n\), but they coincide when \(n\to\infty\).

9.4 Applications in OLS

We will introduce three tests for a hypothesis of the linear regression coefficients, namely the Wald test, the Lagrangian multiplier (LM) test, and the likelihood ratio test. The Wald test is based on the unrestricted OLS estimator \(\widehat{\beta}\). The LM test is based on the restricted estimator \(\tilde{\beta}\). The LRT, as we have discussed, is based on the difference of the log-likelihood function evaluated at the unrestricted OLS estimator and that on the restricted estimator.

Let \(R\) be a \(q\times K\) constant matrix with \(q\leq K\) and \(\mbox{rank}\left(R\right)=q\). All linear restrictions about \(\beta\) can be written in the form of \(R\beta=r\), where \(r\) is a \(q\times1\) constant vector.

We want to simultaneously test \(\beta_{1}=1\) and \(\beta_{3}+\beta_{4}=2\) in the above example. The null hypothesis can be expressed in the general form \(R\beta=r\), where the restriction matrix \(R\) is \[R=\begin{pmatrix}1 & 0 & 0 & 0 & 0\\ 0 & 0 & 1 & 1 & 0 \end{pmatrix}\] and \(r=\left(1,2\right)'\).

9.4.1 Wald Test

Suppose the OLS estimator \(\widehat{\beta}\) is asymptotic normal, i.e. \[\sqrt{n}\left(\widehat{\beta}-\beta\right)\stackrel{d}{\to}N\left(0,\Omega\right)\] where \(\Omega\) is a \(K\times K\) positive definite covariance matrix and. Since \(R\sqrt{n}\left(\widehat{\beta}-\beta\right)\stackrel{d}{\to}N\left(0,R\Omega R'\right)\), the quadratic form \[n\left(\widehat{\beta}-\beta\right)'R'\left(R\Omega R'\right)^{-1}R\left(\widehat{\beta}-\beta\right)\stackrel{d}{\to}\chi_{q}^{2}.\] Now we intend to test the linear null hypothesis \(R\beta=r\). Under the null, the Wald statistic \[\mathcal{W}=n\left(R\widehat{\beta}-r\right)'\left(R\widehat{\Omega}R'\right)^{-1}\left(R\widehat{\beta}-r\right)\stackrel{d}{\to}\chi_{q}^{2}\] where \(\widehat{\Omega}\) is a consistent estimator of \(\Omega\).

(Single test) In a linear regression \[\begin{aligned}y & =x_{i}'\beta+e_{i}=\sum_{k=1}^{5}\beta_{k}x_{ik}+e_{i}.\nonumber\\ E\left[e_{i}x_{i}\right] & =\mathbf{0}_{5},\label{eq:example} \end{aligned}\] where \(y\) is wage and \[x=\left(\mbox{edu},\mbox{age},\mbox{experience},\mbox{experience}^{2},1\right)'.\] To test whether education affects wage, we specify the null hypothesis \(\beta_{1}=0\). Let \(R=\left(1,0,0,0,0\right)\) and \(r=0\). \[\sqrt{n}\widehat{\beta}_{1}=\sqrt{n}\left(\widehat{\beta}_{1}-\beta_{1}\right)=\sqrt{n}R\left(\widehat{\beta}-\beta\right)\stackrel{d}{\to}N\left(0,R\Omega R'\right)\sim N\left(0,\Omega_{11}\right),\label{eq:R11}\] where \(\Omega{}_{11}\) is the \(\left(1,1\right)\) (scalar) element of \(\Omega\). Under \[H_{0}:R\beta=\left(1,0,0,0,0\right)\left(\beta_{1},\ldots,\beta_{5}\right)'=\beta_{1}=0,\] we have \(\sqrt{n}R\left(\widehat{\beta}-\beta\right)=\sqrt{n}\widehat{\beta}_{1}\stackrel{d}{\to}N\left(0,\Omega_{11}\right).\) Therefore, \[\sqrt{n}\frac{\widehat{\beta}_{1}}{\widehat{\Omega}_{11}^{1/2}}=\sqrt{\frac{\Omega_{11}}{\widehat{\Omega}_{11}}}\sqrt{n}\frac{\widehat{\beta}_{1}}{\sqrt{\Omega_{11}}}\] If \(\widehat{\Omega}\stackrel{p}{\to}\Omega\), then \(\left(\Omega_{11}/\widehat{\Omega}_{11}\right)^{1/2}\stackrel{p}{\to}1\) by the continuous mapping theorem. As \(\sqrt{n}\widehat{\beta}_{1}/\Omega_{11}^{1/2}\stackrel{d}{\to}N\left(0,1\right)\), we conclude \(\sqrt{n}\widehat{\beta}_{1}/\widehat{\Omega}_{11}^{1/2}\stackrel{d}{\to}N\left(0,1\right).\)

The above example is a test about a single coefficient, and the test statistic is essentially the square of the t-statistic, and the null distribution is the square of a standard normal.

In order to test a nonlinear regression, we use the delta method.

(This is not a good example because it can be rewritten into a linear hypothesis.) In the example of linear regression, the optimal experience level can be found by setting to zero the first order condition with respective to experience, \(\beta_{3}+2\beta_{4}\mbox{experience}^{*}=0\). We test the hypothesis that the optimal experience level is 20 years; in other words, \[\mbox{experience}^{*}=-\frac{\beta_{3}}{2\beta_{4}}=20.\] This is a nonlinear hypothesis. If \(q\leq K\) where \(q\) is the number of restrictions, we have \[n\left(f\left(\widehat{\theta}\right)-f\left(\theta_{0}\right)\right)'\left(\frac{\partial f}{\partial\theta}\left(\theta_{0}\right)\Omega\frac{\partial f}{\partial\theta}\left(\theta_{0}\right)'\right)^{-1}\left(f\left(\widehat{\theta}\right)-f\left(\theta_{0}\right)\right)\stackrel{d}{\to}\chi_{q}^{2},\] where in this example, \(\theta=\beta\), \(f\left(\beta\right)=-\beta_{3}/\left(2\beta_{4}\right)\). The gradient \[\frac{\partial f}{\partial\beta'}\left(\beta\right)=\left(0,0,-\frac{1}{2\beta_{4}},\frac{\beta_{3}}{2\beta_{4}^{2}},0\right)\] Since \(\widehat{\beta}\stackrel{p}{\to}\beta_{0}\), by the continuous mapping theorem, if \(\beta_{0,4}\neq0\), we have \(\frac{\partial}{\partial\beta}f\left(\widehat{\beta}\right)\stackrel{p}{\to}\frac{\partial}{\partial\beta}f\left(\beta_{0}\right)\). Therefore, the (nonlinear) Wald test is \[\mathcal{W}=n\left(f\left(\widehat{\beta}\right)-20\right)'\left(\frac{\partial f}{\partial\beta'}\left(\widehat{\beta}\right)\widehat{\Omega}\frac{\partial f}{\partial\beta'}\left(\widehat{\beta}\right)\right)^{-1}\left(f\left(\widehat{\beta}\right)-20\right)\stackrel{d}{\to}\chi_{1}^{2}.\] This is a valid test with correct asymptotic size.

However, we can equivalently state the null hypothesis as \(\beta_{3}+40\beta_{4}=0\) and we can construct a Wald statistic accordingly. Asymptotically equivalent though, in general a linear hypothesis is preferred to a nonlinear one, due to the approximation error in the delta method under the null and more importantly the invalidity of the Taylor expansion under the alternative. It also highlights the problem of Wald test being variant to re-parametrization.

9.4.2 Lagrangian Multiplier Test

The key difference between the Wald test and LM test is that the former is based on the unrestricted OLS estimator while the latter is based on the restricted OLS estimator. Estimate the constrained OLS estimator \[\min_{\beta}\left(y-X\beta\right)'\left(y-X\beta\right)\mbox{ s.t. }R\beta=r.\] We know that the restricted minimization problem can be converted into an unrestricted problem \[L\left(\beta,\lambda\right)=\frac{1}{2n}\left(y-X\beta\right)'\left(y-X\beta\right)+\lambda'\left(R\beta-r\right),\label{eq:Lagran}\] where \(L\left(\beta,\lambda\right)\) is called the Lagrangian, and \(\lambda\) is the Lagrangian multiplier.

The LM test is also called the score test, because the derivation is based on the score function of the restricted OLS estimator. Set the first-order condition of \[eq:Lagran\] as zero: \[\begin{aligned} \frac{\partial}{\partial\beta}L & =-\frac{1}{n}X'\left(y-X\tilde{\beta}\right)+\tilde{\lambda}R=-\frac{1}{n}X'e+\frac{1}{n}X'X\left(\tilde{\beta}-\beta_{0}\right)+R'\tilde{\lambda}=0.\\ \frac{\partial}{\partial\lambda}L & =R\tilde{\beta}-r=R\left(\tilde{\beta}-\beta_{0}\right)=0\end{aligned}\] where \(\tilde{\beta}\) and \(\tilde{\lambda}\) denote the roots of these equation, and \(\beta_{0}\) is the hypothesized true value. The two equations can be written as a linear system \[\begin{pmatrix}\widehat{Q} & R'\\ R & 0 \end{pmatrix}\begin{pmatrix}\tilde{\beta}-\beta_{0}\\ \tilde{\lambda} \end{pmatrix}=\begin{pmatrix}\frac{1}{n}X'e\\ 0 \end{pmatrix},\] where \(\hat{Q}=X'X/n\).

\[\begin{pmatrix}\widehat{Q}^{-1}-\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1} & \widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}\\ \left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1} & -(R'Q^{-1}R)^{-1} \end{pmatrix}\begin{pmatrix}\widehat{Q} & R'\\ R & 0 \end{pmatrix}=I_{K+q}.\]

Given the above fact, we can explicitly express \[\begin{aligned} \begin{pmatrix}\tilde{\beta}-\beta_{0}\\ \tilde{\lambda} \end{pmatrix}\begin{aligned}=\end{aligned} & \begin{pmatrix}\widehat{Q}^{-1}-\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1} & \widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}\\ \left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1} & -(R'Q^{-1}R)^{-1} \end{pmatrix}\begin{pmatrix}\frac{1}{n}X'e\\ 0 \end{pmatrix}\\ = & \begin{pmatrix}\widehat{Q}^{-1}\frac{1}{n}X'e-\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{n}X'e\\ \left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{n}X'e \end{pmatrix}\end{aligned}\] The \(\tilde{\lambda}\) component is \[\begin{aligned} \sqrt{n}\tilde{\lambda} & =\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'e\\ & \stackrel{d}{\to}N\left(0,\left(RQ^{-1}R'\right)^{-1}RQ^{-1}\Omega Q^{-1}R'\left(RQ^{-1}R'\right)^{-1}\right)\end{aligned}\] as \(\widehat{Q}\stackrel{p}{\to}Q\). Denote \(\Sigma=\left(RQ^{-1}R'\right)^{-1}RQ^{-1}\Omega Q^{-1}R'\left(RQ^{-1}R'\right)^{-1}\), we have \[n\tilde{\lambda}'\Sigma^{-1}\tilde{\lambda}\stackrel{d}{\to}\chi_{q}^{2}.\] Let \[\widehat{\Sigma}=\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\widehat{\Omega}\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}.\] If \(\widehat{\Omega}\stackrel{p}{\to}\Omega\), we have \[\begin{aligned} \mathcal{LM} & =n\tilde{\lambda}'\widehat{\Sigma}^{-1}\tilde{\lambda}=n\tilde{\lambda}'\Sigma^{-1}\tilde{\lambda}+n\tilde{\lambda}'\left(\widehat{\Sigma}^{-1}-\Sigma^{-1}\right)\tilde{\lambda}\\ & =n\tilde{\lambda}'\Sigma^{-1}\tilde{\lambda}+o_{p}\left(1\right)\stackrel{d}{\to}\chi_{q}^{2}.\end{aligned}\] This is the general expression of the LM test.

In the special case of homoskedasticity, \(\Sigma=\sigma^{2}\left(RQ^{-1}R'\right)^{-1}RQ^{-1}QQ^{-1}R'\left(RQ^{-1}R'\right)^{-1}=\sigma^{2}\left(RQ^{-1}R'\right)^{-1}.\) Replace \(\Sigma\) with the estimated \(\hat{\Sigma}\), we have \[\begin{aligned}\frac{n\tilde{\lambda}'R\hat{Q}^{-1}R'\tilde{\lambda}}{\hat{\sigma}^{2}} & =\frac{1}{n\hat{\sigma}^{2}}\left(y-X\tilde{\beta}\right)'X\hat{Q}^{-1}R'(R\hat{Q}^{-1}R')^{-1}R\hat{Q}^{-1}X'\left(y-X\tilde{\beta}\right)\stackrel{d}{\to}\chi_{q}^{2}.\end{aligned}\]

If we test the hypothesis that the optimal experience level is 20 years; \(\mbox{experience}^{*}=-\frac{\beta_{3}}{2\beta_{4}}=20.\) We can replace \(\beta_{3}\) by \(-40\beta_{4}\) so we only need to estimate 3 slope coefficients in the OLS to construct the LM test. Moreover, the LM test is invariant to re-parametrization.

9.4.3 Likelihood-Ratio Test for Regression

In the previous section we have discussed the LRT. Here we put it into the context regression with Gaussian error. Let \(\gamma=\sigma_{e}^{2}\). Under the classical assumptions of normal regression model, \[L_{n}\left(\beta,\gamma\right)=-\frac{n}{2}\log\left(2\pi\right)-\frac{n}{2}\log\gamma-\frac{1}{2\gamma}\left(Y-X\beta\right)'\left(Y-X\beta\right).\] For the unrestricted estimator, we know \[\widehat{\gamma}=\gamma\left(\widehat{\beta}\right)=n^{-1}\left(Y-X\widehat{\beta}\right)'\left(Y-X\widehat{\beta}\right)\] and the sample log-likelihood function evaluated at the MLE is \[\widehat{L}_{n}=L_{n}\left(\widehat{\beta},\widehat{\gamma}\right)=-\frac{n}{2}\log\left(2\pi\right)-\frac{n}{2}\log\widehat{\gamma}-\frac{n}{2}\] and the restricted estimator \(\tilde{L}_{n}=L_{n}\left(\tilde{\beta},\tilde{\gamma}\right)=-\frac{n}{2}\log\left(2\pi\right)-\frac{n}{2}\log\tilde{\gamma}-\frac{n}{2}\). The likelihood ratio is \[\begin{aligned} \mathcal{LR} & =2\left(\widehat{L}_{n}-\tilde{L}_{n}\right)=n\log\left(\tilde{\gamma}/\widehat{\gamma}\right).\end{aligned}\] If the normal regression is correctly specified, we can immediately conclude \(\mathcal{LR}\stackrel{d}{\to}\chi_{q}^{2}.\)

Now we drop the Gaussian error assumption while keep the conditional homoskedasticity. In this case, the classical results is not applicable because \(L_{n}\left(\beta,\gamma\right)\) is not a (genuine) log-likelihood function; instead it is the quasi log-likelihood function. Notice \[\begin{aligned} \mathcal{LR} & =n\log\left(1+\frac{\tilde{\gamma}-\widehat{\gamma}}{\widehat{\gamma}}\right)=n\left(\log1+\frac{\tilde{\gamma}-\widehat{\gamma}}{\widehat{\gamma}}+O\left(\frac{\left|\tilde{\gamma}-\widehat{\gamma}\right|^{2}}{\widehat{\gamma}^{2}}\right)\right)\nonumber \\ & =n\frac{\tilde{\gamma}-\widehat{\gamma}}{\widehat{\gamma}}+o_{p}\left(1\right)\label{eq:LRT1}\end{aligned}\] by a Taylor expansion of \(\log\left(1+\frac{\tilde{\gamma}-\widehat{\gamma}}{\widehat{\gamma}}\right)\) around \(\log1=0\). We focus on \[\begin{aligned} n\left(\tilde{\gamma}-\widehat{\gamma}\right) & =n\left(\gamma\left(\tilde{\beta}\right)-\gamma\left(\widehat{\beta}\right)\right)\nonumber \\ & =n\left(\frac{\partial\gamma\left(\widehat{\beta}\right)}{\partial\beta}\left(\tilde{\beta}-\widehat{\beta}\right)+\frac{1}{2}\left(\tilde{\beta}-\widehat{\beta}\right)'\frac{\partial^{2}\gamma\left(\widehat{\beta}\right)}{\partial\beta\partial\beta'}\left(\tilde{\beta}-\widehat{\beta}\right)+O\left(\left\Vert \tilde{\beta}-\widehat{\beta}\right\Vert _{2}^{3}\right)\right)\nonumber \\ & =\sqrt{n}\left(\tilde{\beta}-\widehat{\beta}\right)'\widehat{Q}\sqrt{n}\left(\tilde{\beta}-\widehat{\beta}\right)+o_{p}\left(1\right)\label{eq:LRT2}\end{aligned}\] where the last line follows by \(\frac{\partial\gamma\left(\widehat{\beta}\right)}{\partial\beta}=-\frac{2}{n}X'\left(Y-X\widehat{\beta}\right)=-\frac{2}{n}X'\widehat{e}=0\) and \(\frac{1}{2}\cdot\frac{\partial^{2}\gamma\left(\widehat{\beta}\right)}{\partial\beta\partial\beta'}=\frac{1}{2}\cdot\frac{2}{n}X'X=\widehat{Q}\).

From the derivation of LM test, we have \[\begin{aligned}\sqrt{n}\left(\tilde{\beta}-\beta_{0}\right) & =\left(\widehat{Q}^{-1}-\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\right)\frac{1}{\sqrt{n}}X'e\\ & =\frac{1}{\sqrt{n}}\left(X'X\right)^{-1}X'e-\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'e\\ & =\sqrt{n}\left(\widehat{\beta}-\beta_{0}\right)-\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'e. \end{aligned}\] Rearrange the above equation to obtain \[\sqrt{n}\left(\tilde{\beta}-\widehat{\beta}\right)=-\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'e\] and thus the quadratic form \[\begin{aligned} & & \sqrt{n}\left(\tilde{\beta}-\widehat{\beta}\right)'\widehat{Q}\sqrt{n}\left(\tilde{\beta}-\widehat{\beta}\right)\nonumber \\ & = & \frac{1}{\sqrt{n}}e'X\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\widehat{Q}\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'e\nonumber \\ & = & \frac{1}{\sqrt{n}}e'X\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'e\nonumber \\ & = & \frac{1}{\sqrt{n}}e'X\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'e.\label{eq:LRT3}\end{aligned}\] Collecting (\[eq:LRT1\]), (\[eq:LRT2\]) and (\[eq:LRT3\]), we have \[\begin{aligned} \mathcal{LR} & =n\frac{\sigma_{e}^{2}}{\widehat{\gamma}}\cdot\frac{\tilde{\gamma}-\widehat{\gamma}}{\sigma_{e}^{2}}+o_{p}\left(1\right)\\ & =\frac{\sigma_{e}^{2}}{\widehat{\gamma}}\frac{1}{\sqrt{n}}\frac{e}{\sigma_{e}}'X\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'\frac{e}{\sigma_{e}}+o_{p}\left(1\right)\end{aligned}\] Notice that under homoskedasticity, CLT gives \[\begin{aligned} R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'\frac{e}{\sigma_{e}} & =R\widehat{Q}^{-1/2}\widehat{Q}^{-1/2}\frac{1}{\sqrt{n}}X'\frac{e}{\sigma_{e}}\\ & \stackrel{d}{\to}RQ^{-1/2}\times N\left(0,I_{K}\right)\sim N\left(0,RQ^{-1}R'\right),\end{aligned}\] and thus \[\frac{1}{\sqrt{n}}\frac{e}{\sigma_{e}}'X\widehat{Q}^{-1}R'\left(R\widehat{Q}^{-1}R'\right)^{-1}R\widehat{Q}^{-1}\frac{1}{\sqrt{n}}X'\frac{e}{\sigma_{e}}\stackrel{d}{\to}\chi_{q}^{2}.\] Moreover, \(\frac{\sigma_{e}^{2}}{\widehat{\gamma}}\stackrel{p}{\to}1\). By Slutsky’s theorem, we conclude \[\mathcal{LR}\stackrel{d}{\to}\chi_{q}^{2}.\] under homoskedasticity.

9.5 Summary

Applied econometrics is a field obsessed of hypothesis testing, in the hope to establish at least statistical association and ideally causality. Hypothesis testing is a fundamentally important topic in statistics. The states and the decisions in Table \[tab:Decisions-and-States\] remind us the intrinsic connections with game theory in economics. I, a game player, plays a sequential game against the “nature”.

Step0:

The parameter space \(\Theta\) is partitioned into the null hypothesis \(\Theta_{0}\) and the alternative hypothesis \(\Theta_{1}\) according to a scientific theory.

Step1:

Before I observe the data, I design a test function \(\phi\) according to \(\Theta_{0}\) and \(\Theta_{1}\). In game theory terminology, the contingency plan \(\phi\) is my strategy.

Step2:

Once I observe the fixed data \(\mathbf{x}\), I act according to the instruction of \(\phi\left(\mathbf{x}\right)\) — either accept \(\Theta_{0}\) or reject \(\Theta_{0}\).

Step3:

Nature reveals the true parameter \(\theta^{*}\) behind \(\mathbf{x}\). Then I can evaluate the gain/loss of my decision \(\phi\left(\mathbf{x}\right)\).

When the loss function (negative payoff) is specified as \[\mathscr{L}\left(\theta,\phi\left(\mathbf{x}\right)\right)=\phi\left(\mathbf{x}\right)\cdot1\left\{ \theta\in\Theta_{0}\right\} +\left(1-\phi\left(\mathbf{x}\right)\right)\cdot1\left\{ \theta\in\Theta_{1}\right\} ,\] the randomness of the data will incur the risk (expected loss) \[\mathscr{R}\left(\theta,\phi\right)=E\left[\mathscr{L}\left(\theta,\phi\left(\mathbf{x}\right)\right)\right]=\beta_{\phi}\left(\theta\right)\cdot1\left\{ \theta\in\Theta_{0}\right\} +\left(1-\beta_{\phi}\left(\theta\right)\right)\cdot1\left\{ \theta\in\Theta_{1}\right\} .\] I am a rational person. I understand the structure of the game and I want to do a good job in Step 1 in designing my strategy. I want to minimize my risk.

If I am a frequentist, one and only one of \(1\left\{ \theta\in\Theta_{0}\right\}\) and \(1\left\{ \theta\in\Theta_{1}\right\}\) can happen. An unbiased test makes sure \(\sup_{\theta\in\Theta_{0}}\beta_{\phi}\left(\theta\right)\leq\alpha\). When many tests are unbiased, ideally I would like to pick the best one. If it exists, in a class \(\Psi_{\alpha}\) of unbiased tests of size \(\alpha\) the uniformly most power test \(\phi^{*}\) satisfies \(\mathscr{R}\left(\theta,\phi^{*}\right)\geq\sup_{\phi\in\Psi_{\alpha}}\mathscr{R}\left(\theta,\phi\right)\) for every \(\theta\in\Theta_{1}\). For simple versus simple tests, LRT is the uniformly most powerful test according to Neyman-Pearson Lemma.

If I am a Bayesian, I do not mind imposing probability (weight) on the parameter space, which is my prior belief \(\pi\left(\theta\right)\). My Bayesian risk becomes \[\begin{aligned} \mathscr{BR}\left(\pi,\phi\right) & =E_{\pi\left(\theta\right)}\left[\mathscr{R}\left(\theta,\phi\right)\right]=\int\left[\beta_{\phi}\left(\theta\right)\cdot1\left\{ \theta\in\Theta_{0}\right\} +\left(1-\beta_{\phi}\left(\theta\right)\right)\cdot1\left\{ \theta\in\Theta_{1}\right\} \right]\pi\left(\theta\right)d\theta\\ & =\int_{\left\{ \theta\in\Theta_{0}\right\} }\beta_{\phi}\left(\theta\right)\pi\left(\theta\right)d\theta+\int_{\left\{ \theta\in\Theta_{1}\right\} }(1-\beta_{\phi}\left(\theta\right))\pi\left(\theta\right)d\theta.\end{aligned}\] This is the average (with respect to \(\pi\left(\theta\right)\)) risk over the null and the alternative.

Historical notes: Hypothesis testing started to take the modern shape at the beginning of the 20th century. Karl Pearson (1957–1936) laid the foundation of hypothesis testing and introduced the \(\chi^{2}\) test, the \(p\)-value, among many other concepts that we keep using today. Neyman-Pearson Lemma was named after Jerzy Neyman (1894–1981) and Egon Pearson (1895–1980), Karl’s son.

Further reading: Young and Smith (2005) is a concise but in-depth reference for statistical inference.

9.6 Appendix

9.6.1 Neyman-Pearson Lemma

We have discussed an example of the uniformly most power test in the Gaussian location model. Under the likelihood principle, if the test is a simple versus simple (the null hypothesis is a singleton \(\theta_{0}\) and the alternative hypothesis is another single point \(\theta_{1}\)), then LRT \[\begin{aligned} \phi\left(\mathbf{X}\right) & :=1\left\{ \mathcal{LR}\geq c_{LR}\right\} ,\end{aligned}\] where \(c_{LR}\) is the critical value depending on the size of the the test, is a uniformly most powerful test. This result is the well-known Neyman-Pearson Lemma.

Notice \(\exp\left(L_{n}\left(\theta\right)\right)=\Pi_{i}f\left(x_{i};\theta\right)=f\left(\mathbf{x};\theta\right)\) where \(f\left(\mathbf{x};\theta_{0}\right)\) is the joint density of \(\left(x_{1},\ldots,x_{n}\right)\), the LRT can be equivalently written in likelihood ratio form (without log) \[\phi\left(\mathbf{X}\right)=1\left\{ f\left(\mathbf{X};\widehat{\theta}\right)/f\left(\mathbf{X};\theta_{0}\right)\geq c\right\}\] where \(c:=\exp\left(c_{LR}/2\right)\).

To see its is the most power test in the simple to simple context, consider another test \(\psi\) of the same size at the single null hypothesis \(\int\phi\left(\mathbf{x}\right)f\left(\theta_{0}\right)=\int\psi\left(\mathbf{x}\right)f\left(\mathbf{x};\theta_{0}\right)=\alpha\), where \(f\left(\mathbf{x};\theta_{0}\right)=\) is the joint density of the sample \(\mathbf{X}\). For any constant \(c>0\), the power of \(\phi\) at the alternative \(\theta_{1}\) is \[\begin{aligned} E_{\theta_{1}}\left[\phi\left(\mathbf{X}\right)\right] & =\int\phi\left(\mathbf{x}\right)f\left(\mathbf{x};\theta_{1}\right)\nonumber \\ & =\int\phi\left(\mathbf{x}\right)f\left(\mathbf{x};\theta_{1}\right)-c\left[\int\phi\left(\mathbf{x}\right)f\left(\mathbf{x};\theta_{0}\right)-\int\psi\left(\mathbf{x}\right)f\left(\mathbf{x};\theta_{0}\right)\right]\nonumber \\ & =\int\phi\left(\mathbf{x}\right)f\left(\mathbf{x};\theta_{1}\right)-c\int\phi\left(\mathbf{x}\right)f\left(\mathbf{x};\theta_{0}\right)+c\int\psi\left(\mathbf{x}\right)f\left(\mathbf{x};\theta_{0}\right)\nonumber \\ & =\int\phi\left(\mathbf{x}\right)\left(f\left(\mathbf{x};\theta_{1}\right)-cf\left(\mathbf{x};\theta_{0}\right)\right)+c\int\psi\left(\mathbf{x}\right)f\left(\mathbf{x};\theta_{0}\right).\label{eq:NP1}\end{aligned}\] Define \(\xi_{c}:=f\left(\mathbf{x};\theta_{1}\right)-cf\left(\mathbf{x};\theta_{0}\right)\). The fact that \(\phi\left(\mathbf{x}\right)=1\) if \(\xi_{c}\geq0\) and \(\phi\left(\mathbf{x}\right)=0\) if \(\xi_{c}<0\) implies \[\begin{aligned} & & \int\phi\left(\mathbf{x}\right)\left(f\left(\mathbf{x};\theta_{1}\right)-cf\left(\mathbf{x};\theta_{0}\right)\right)=\int\phi\left(\mathbf{x}\right)\xi_{c}\\ & = & \int_{\left\{ \xi_{c}\geq0\right\} }\phi\left(\mathbf{x}\right)\xi_{c}+\int_{\left\{ \xi_{c}<0\right\} }\phi\left(\mathbf{x}\right)\xi_{c}=\int_{\left\{ \xi_{c}\geq0\right\} }\xi_{c}=\int\xi_{c}\cdot1\left\{ \xi_{c}\geq0\right\} \\ & \geq & \int\psi\left(\mathbf{x}\right)\xi_{c}\cdot1\left\{ \xi_{c}\geq0\right\} =\int_{\left\{ \xi_{c}\geq0\right\} }\psi\left(\mathbf{x}\right)\xi_{c}\\ & \geq & \int_{\left\{ \xi_{c}\geq0\right\} }\psi\left(\mathbf{x}\right)\xi_{c}+\int_{\left\{ \xi_{c}<0\right\} }\psi\left(\mathbf{x}\right)\xi_{c}=\int\psi\left(x\right)\xi_{c}\\ & = & \int\psi\left(\mathbf{x}\right)\left(f\left(\mathbf{x};\theta_{1}\right)-cf\left(\mathbf{x};\theta_{0}\right)\right)\end{aligned}\] where the first inequality follows because the test function \(0\leq\psi\left(\mathbf{x}\right)\leq1\) for any realization of \(\mathbf{x}\), and where the second inequality holds because \(\int_{\left\{ \xi_{c}<0\right\} }\psi\left(\mathbf{x}\right)\xi_{c}\leq0\). We continue \[eq:NP1\]: \[\begin{aligned} E_{\theta_{1}}\left[\phi\left(\mathbf{X}\right)\right] & \geq & \int\psi\left(\mathbf{x}\right)\left(f\left(\mathbf{x};\theta_{1}\right)-cf\left(\mathbf{x};\theta_{0}\right)\right)+c\int\psi\left(\mathbf{x}\right)f\left(\mathbf{x};\theta_{0}\right)\\ & = & \int\psi\left(\mathbf{x}\right)f\left(\mathbf{x};\theta_{1}\right)=E_{\theta_{1}}\left[\psi\left(\mathbf{X}\right)\right].\end{aligned}\] In other words, \(\phi\left(\mathbf{X}\right)\) is more powerful at \(\theta_{1}\) than any other test \(\psi\) of the same size at the null.

Neyman-Pearson lemma establishes the optimality of LRT in single versus simple hypothesis testing. It can be generalized to show the existence of the uniformly most power test in one sided composite null hypothesis \(H_{0}:\theta\leq\theta_{0}\) or \(H_{0}:\theta\geq\theta_{0}\) in the parametric class of distributions exhibiting monotone likelihood ratio.

Zhentao Shi. Nov 4, 2020.