8 Asymptotic Properties of MLE

8.1 Examples of MLE

Normal, Logistic, Probit, Poisson

8.2 Consistency

We specify a parametric distribution (pdf) \(f\left(x;\theta\right)\) and a parameter space \(\Theta\). Define \(Q\left(\theta\right)=E\left[\log f\left(x;\theta\right)\right]\), and \(\theta_{0}=\arg\max_{\theta\in\Theta}Q\left(\theta\right)\) maximizes the expected log-likelihood. Given a sample of \(n\) observations, we compute the average sample log-likelihood \(\ell_{n}\left(\theta\right)=\frac{1}{n}\sum_{i=1}^{n}\log f\left(x;\theta\right)\). The MLE estimator is \(\widehat{\theta}=\arg\max_{\theta\in\Theta}\ell_{n}\left(\theta\right)\).

We say that correctly specified if the data \(\left(x_{1},\ldots,x_{n}\right)\) is generated from the pdf \(f\left(x;\theta\right)\) for some \(\theta\in\Theta\). Otherwise if the data is not generated from any member in the class of distributions \(\mathcal{M}^{*}:=\left\{ \theta\in\Theta:f\left(x;\theta\right)\right\}\), we say it is misspecified. When the model is misspecified, strictly speaking the log-likelihood function \(\ell_{n}\left(\theta\right)\) should be called quasi log-likelihood and the MLE estimator \(\widehat{\theta}\) should be called the quasi MLE.

We will discuss under what condition \(\widehat{\theta}\stackrel{p}{\to}\theta_{0}\), that is, the maximizer of the sample log-likelihood converges in probability to the maximizer of the expected log-likelihood in population. Notice that unlike OLS, most MLE estimators do not admit a closed-form. They are defined as a maximizer and solved by numerical optimization.

The first requirement for the consistency of MLE is that \(\theta_{0}\) uniquely defined. Suppose \(\theta_{0}\in\mathrm{int}\left(\Theta\right)\) lies in the interior of \(\Theta\). Let \(N\left(\theta_{0},\varepsilon\right)=\left\{ \theta\in\Theta:\left|\theta-\theta_{0}\right|<\varepsilon\right\}\) is a neighborhood around \(\theta_{0}\) with radius \(\varepsilon\) for some \(\varepsilon>0\).

The value \(\theta_{0}\) is identified if for any \(\varepsilon>0\), there exists a \(\delta=\delta\left(\varepsilon\right)>0\) such that \(Q\left(\theta_{0}\right)>\sup_{\theta\in\Theta\backslash N\left(\theta_{0},\varepsilon\right)}Q\left(\theta\right)+\delta\).

We know under suitable condition, LLN implies \(\ell_{n}\left(\theta\right)\stackrel{p}{\to}Q\left(\theta\right)\) for each \(\theta\in\Theta\). This is a pointwise result, meaning \(\theta\) is taken as fixed as \(n\to\infty\). However, \(\widehat{\theta}\) is random in finite-sample, which makes \(\ell_{n}(\widehat{\theta})\) a complicated function of the data in particular when \(\widehat{\theta}\) has no closed-form solution. We therefore need to strengthen the pointwise LLN.

We say a uniform law of large numbers (ULLN) for \(Q\left(\theta\right)\) holds on \(\Theta\) if \[P\left\{ \sup_{\theta\in\Theta}\left|\ell_{n}\left(\theta\right)-Q\left(\theta\right)\right|\geq\varepsilon\right\} \to0\label{eq:ULLN}\] for all \(\varepsilon>0\) as \(n\to\infty\).

ULLN can be established under pointwise LLN plus some regularity conditions, for example when \(\Theta\) is a compact set, and \(\log f\left(x;\cdot\right)\) is continuous in \(\theta\) almost everywhere on the support of \(x\).

If \(\theta_{0}\) is identified and ULLN \[eq:ULLN\] hold, then \(\widehat{\theta}\stackrel{p}{\to}\theta_{0}\).

According to the definition of consistency, we can check \[\begin{aligned} & P\left\{ \left|\widehat{\theta}-\theta_{0}\right|>\varepsilon\right\} \leq P\left\{ Q\left(\theta_{0}\right)-Q(\widehat{\theta})>\delta\right\} \\ & =P\left\{ Q\left(\theta_{0}\right)-\ell_{n}\left(\theta_{0}\right)+\ell_{n}\left(\theta_{0}\right)-\ell_{n}(\widehat{\theta})+\ell_{n}\left(\widehat{\theta}\right)-Q(\widehat{\theta})>\delta\right\} \\ & \leq P\left\{ \left|Q\left(\theta_{0}\right)-\ell_{n}\left(\theta_{0}\right)\right|+\ell_{n}\left(\theta_{0}\right)-\ell_{n}(\widehat{\theta})+\left|\ell_{n}\left(\widehat{\theta}\right)-Q(\widehat{\theta})\right|>\delta\right\} \\ & \leq P\left\{ \left|Q\left(\theta_{0}\right)-\ell_{n}\left(\theta_{0}\right)\right|+\left|\ell_{n}(\widehat{\theta})-Q(\widehat{\theta})\right|\geq\delta\right\} \\ & \leq P\left\{ 2\sup_{\theta\in\Theta}\left|\ell_{n}\left(\theta\right)-Q\left(\theta\right)\right|\geq\delta\right\} =P\left\{ \sup_{\theta\in\Theta}\left|\ell_{n}\left(\theta\right)-Q\left(\theta\right)\right|\geq\frac{\delta}{2}\right\} \to0.\end{aligned}\] The first line holds because of identification, the third line by the triangle inequality, the fourth line by the definition of MLE that \(\ell_{n}(\widehat{\theta})\geq\ell_{n}\left(\theta_{0}\right)\), and the last line by ULLN.

Identification is a necessary condition for consistent estimation. Although \(\widehat{\theta}\) has no closed-form solution in general, we establish consistency via ULLN over all point \(\theta\in\Theta\) under consideration.

8.3 Asymptotic Normality

The next step is to derive the asymptotic distribution of the MLE estimator. Let \(s\left(x;\theta\right)=\partial\log f\left(x;\theta\right)/\partial\theta\) and \(h\left(x;\theta\right)=\frac{\partial^{2}}{\partial\theta\partial\theta'}\log f\left(x;\theta\right)\)

\[thm:mis-MLE\] Under suitable regularity conditions, the MLE estimator \[\sqrt{n}\left(\widehat{\theta}-\theta_{0}\right)\stackrel{d}{\to}N\left(0,\left(E\left[h\left(x;\theta_{0}\right)\right]\right)^{-1}\mathrm{var}\left[s\left(x;\theta_{0}\right)\right]\left(E\left[h\left(x;\theta_{0}\right)\right]\right)^{-1}\right).\]

The “suitable regularity conditions” will be spelled out later. Indeed, those conditions can be observed in the proof.

That \(\widehat{\theta}\) is a maximizer entails \(\frac{\partial}{\partial\theta}\ell_{n}\left(\widehat{\theta}\right)=0\). Take a Taylor expansion of \(\frac{\partial}{\partial\theta}\ell_{n}\left(\widehat{\theta}\right)\) around \(\frac{\partial}{\partial\theta}\ell_{n}\left(\theta_{0}\right)\): \[0-\frac{\partial}{\partial\theta}\ell_{n}\left(\theta_{0}\right)=\frac{\partial}{\partial\theta}\ell_{n}\left(\widehat{\theta}\right)-\frac{\partial}{\partial\theta}\ell_{n}\left(\theta_{0}\right)=\frac{\partial}{\partial\theta\partial\theta'}\ell_{n}\left(\dot{\theta}\right)\left(\widehat{\theta}-\theta_{0}\right)\] where \(\dot{\theta}\) is some point on the line segment connecting \(\widehat{\theta}\) and \(\theta_{0}.\) Rearrange the above equation and multiply both side by \(\sqrt{n}:\) \[\sqrt{n}\left(\widehat{\theta}-\theta_{0}\right)=-\left(\frac{\partial}{\partial\theta\partial\theta'}\ell_{n}\left(\dot{\theta}\right)\right)^{-1}\sqrt{n}\frac{\partial}{\partial\theta}\ell_{n}\left(\theta_{0}\right).\label{eq:taylor1}\]

When \(Q\left(\theta\right)\) is differentiable at \(\theta_{0}\), we have \(\frac{\partial}{\partial\theta}Q\left(\theta_{0}\right)=0\) by the first condition of optimality of \(\theta_{0}\) for \(Q\left(\theta\right)\). Notice that \(E\left[s\left(x;\theta_{0}\right)\right]=\frac{\partial}{\partial\theta}Q\left(\theta_{0}\right)=0\) if differentiation and integration are interchangeable. By CLT, the second factor in \[eq:taylor1\] follows \[\sqrt{n}\frac{\partial}{\partial\theta}\ell_{n}\left(\theta_{0}\right)\stackrel{d}{\to}N\left(0,\mathrm{var}\left[s\left(x;\theta_{0}\right)\right]\right).\] Suppose the second factor in \[eq:taylor1\] follows \(\frac{\partial}{\partial\theta\partial\theta'}\ell_{n}\left(\dot{\theta}\right)\stackrel{p}{\to}E\left[h\left(x;\theta_{0}\right)\right]\) (sufficient if we assume \(E\left[\frac{\partial^{3}}{\partial\theta_{i}\partial\theta_{j}\partial\theta_{l}}\log f\left(x;\theta_{0}\right)\right]\) is continuous in \(\theta\) for all \(i,j,l\leq K\). Thus we have the conclusion by Slutsky’s theorem.

When the model is misspecified, the asymptotic variance takes a complicated sandwich form. When the parametric model is correctly specified, then the asymptotic variance can be further simplified, thanks to the following important result of information matrix equality.

8.4 Information Matrix Equality

When the model is correctly specified, \(\theta_{0}\) is the true parameter value. The variance \(\mathcal{I}\left(\theta_{0}\right):=\mathrm{var}_{f\left(x;\theta_{0}\right)}\left[\frac{\partial}{\partial\theta}\log f\left(x;\theta_{0}\right)\right]\) is called the (Fisher) information matrix, and \(\mathcal{H}\left(\theta_{0}\right):=E_{f\left(x;\theta_{0}\right)}\left[h\left(x;\theta_{0}\right)\right]\) is called the expected Hessian matrix. Here we emphasize the true underlying distribution \(f\left(x;\theta_{0}\right)\) by writing it as the subscript of the mathematical expectations.

\[fact:Info\]Under suitable regularity conditions, we have \(\mathcal{I}\left(\theta_{0}\right)=-\mathcal{H}\left(\theta_{0}\right)\)

Because \(f\left(x;\theta_{0}\right)\) a pdf, \(\int f\left(x;\theta_{0}\right)dx=1\). Take partial derivative with respect to \(\theta\), \[\begin{aligned} 0 & =\int\frac{\partial}{\partial\theta}f\left(x;\theta_{0}\right)dx=\int\frac{\partial f\left(x;\theta_{0}\right)/\partial\theta}{f\left(x;\theta_{0}\right)}f\left(x;\theta_{0}\right)dx\nonumber \\ & =\int\left[s\left(x;\theta_{0}\right)\right]f\left(x;\theta_{0}\right)dx=E_{f\left(x;\theta_{0}\right)}\left[s\left(x;\theta_{0}\right)\right]\label{eq:info_eqn_1}\end{aligned}\] where the third equality holds as by the chain rule \[s\left(x;\theta_{0}\right)=\frac{\partial f\left(x;\theta_{0}\right)/\partial\theta}{f\left(x;\theta_{0}\right)}.\label{eq:ell_d}\] Take a second partial derivative of (\[eq:info\_eqn\_1\]) with respective to \(\theta\), according to the chain rule: \[\begin{aligned} 0 & =\int\left[h\left(x;\theta_{0}\right)\right]f\left(x;\theta_{0}\right)dx+\int\left[s\left(x;\theta_{0}\right)\right]\frac{\partial}{\partial\theta'}f\left(x;\theta_{0}\right)dx\\ & =\int\left[h\left(x;\theta_{0}\right)\right]f\left(x;\theta_{0}\right)dx+\int s\left(x;\theta_{0}\right)\frac{\partial f\left(x;\theta_{0}\right)/\partial\theta}{f\left(x;\theta\right)}f\left(x;\theta_{0}\right)dx\\ & =\int\left[h\left(x;\theta_{0}\right)\right]f\left(x;\theta_{0}\right)dx+\int\left[s\left(x;\theta_{0}\right)s\left(x;\theta_{0}\right)'\right]f\left(x;\theta_{0}\right)dx\\ & =E_{f\left(x;\theta_{0}\right)}\left[h\left(x;\theta_{0}\right)\right]+E_{f\left(x;\theta_{0}\right)}\left[s\left(x;\theta_{0}\right)s\left(x;\theta_{0}\right)'\right]\\ & =\mathcal{H}\left(\theta_{0}\right)+\mathcal{I}\left(\theta_{0}\right).\end{aligned}\] The second equality follows by (\[eq:ell\_d\]). The last equality by \[eq:info\_eqn\_1\] as the zero mean ensures the variance of \(\frac{\partial}{\partial\theta}\log f\left(x;\theta_{0}\right)\) is equal to the expectation of its out-product.

Notice that a correct specification is essential for the information matrix equality. If the true data generating distribution is \(g\notin\mathcal{M}^{*}\), then \[eq:info\_eqn\_1\] breaks down because \[0=\int\frac{\partial}{\partial\theta}f\left(x;\theta_{0}\right)=\int\left[g^{-1}\frac{\partial}{\partial\theta}f\left(x;\theta_{0}\right)\right]g=E_{g}\left[g^{-1}\frac{\partial}{\partial\theta}f\left(x;\theta_{0}\right)\right]\] but \(g^{-1}\frac{\partial}{\partial\theta}f\left(x;\theta_{0}\right)\neq\left(f\left(x;\theta_{0}\right)\right)^{-1}\frac{\partial}{\partial\theta}f\left(x;\theta_{0}\right)=\frac{\partial}{\partial\theta}\log f\left(\theta_{0}\right)\). The asymptotic variance in Theorem \[thm:mis-MLE\], \[\left(E_{g}\left[h\left(x;\theta_{0}\right)\right]\right)^{-1}\mathrm{var}_{g}\left[s\left(x;\theta_{0}\right)\right]\left(E_{g}\left[h\left(x;\theta_{0}\right)\right]\right)^{-1},\] written explicitly in \(E_{g}\left[\cdot\right]\), is still valid.

When the parametric model \(\mathcal{M}^{*}\) is correctly specified, then we can replace \(E_{g}\left[\frac{\partial^{2}\ell_{n}}{\partial\theta\partial\theta'}\left(\theta_{0}\right)\right]\) by \(\mathcal{H}\left(\theta_{0}\right)\) and replace \(\mathrm{var}_{g}\left[\frac{\partial\ell_{n}}{\partial\theta}\left(\theta_{0}\right)\right]\) by \(\mathcal{I}\left(\theta_{0}\right)\), we simplify the asymptotic variance as \[\left(\mathcal{H}\left(\theta_{0}\right)\right)^{-1}\mathcal{I}\left(\theta_{0}\right)\left(\mathcal{H}\left(\theta_{0}\right)\right)^{-1}=\left(-\mathcal{I}\left(\theta_{0}\right)\right)^{-1}\mathcal{I}\left(\theta_{0}\right)\left(-\mathcal{I}\left(\theta_{0}\right)\right)^{-1}=\left(\mathcal{I}\left(\theta_{0}\right)\right)^{-1}\] by the information matrix equality Fact \[fact:Info\].

If the model is correctly specified, under the conditions for Theorem \[eq:info\_eqn\_1\] and Fact \[fact:Info\] the MLE estimator \[\sqrt{n}\left(\widehat{\theta}-\theta_{0}\right)\stackrel{d}{\to}N\left(0,\left[\mathcal{I}\left(\theta_{0}\right)\right]^{-1}\right).\]

This is the classical asymptotic normality result of MLE.

8.5 Cramer-Rao Lower Bound

8.6 Summary

Further reading: White (1996), Newey and McFadden (1994).

Zhentao Shi. Oct 29, 2020.

7 Asymptotic Properties of Least Squares

9 Hypothesis Testing