Processing math: 2%

8 Asymptotic Properties of MLE

8.1 Examples of MLE

Normal, Logistic, Probit, Poisson

8.2 Consistency

We specify a parametric distribution (pdf) f(x;θ) and a parameter space Θ. Define Q(θ)=E[logf(x;θ)], and θ0=argmax maximizes the expected log-likelihood. Given a sample of n observations, we compute the average sample log-likelihood \ell_{n}\left(\theta\right)=\frac{1}{n}\sum_{i=1}^{n}\log f\left(x;\theta\right). The MLE estimator is \widehat{\theta}=\arg\max_{\theta\in\Theta}\ell_{n}\left(\theta\right).

We say that correctly specified if the data \left(x_{1},\ldots,x_{n}\right) is generated from the pdf f\left(x;\theta\right) for some \theta\in\Theta. Otherwise if the data is not generated from any member in the class of distributions \mathcal{M}^{*}:=\left\{ \theta\in\Theta:f\left(x;\theta\right)\right\}, we say it is misspecified. When the model is misspecified, strictly speaking the log-likelihood function \ell_{n}\left(\theta\right) should be called quasi log-likelihood and the MLE estimator \widehat{\theta} should be called the quasi MLE.

We will discuss under what condition \widehat{\theta}\stackrel{p}{\to}\theta_{0}, that is, the maximizer of the sample log-likelihood converges in probability to the maximizer of the expected log-likelihood in population. Notice that unlike OLS, most MLE estimators do not admit a closed-form. They are defined as a maximizer and solved by numerical optimization.

The first requirement for the consistency of MLE is that \theta_{0} uniquely defined. Suppose \theta_{0}\in\mathrm{int}\left(\Theta\right) lies in the interior of \Theta. Let N\left(\theta_{0},\varepsilon\right)=\left\{ \theta\in\Theta:\left|\theta-\theta_{0}\right|<\varepsilon\right\} is a neighborhood around \theta_{0} with radius \varepsilon for some \varepsilon>0.

The value \theta_{0} is identified if for any \varepsilon>0, there exists a \delta=\delta\left(\varepsilon\right)>0 such that Q\left(\theta_{0}\right)>\sup_{\theta\in\Theta\backslash N\left(\theta_{0},\varepsilon\right)}Q\left(\theta\right)+\delta.

We know under suitable condition, LLN implies \ell_{n}\left(\theta\right)\stackrel{p}{\to}Q\left(\theta\right) for each \theta\in\Theta. This is a pointwise result, meaning \theta is taken as fixed as n\to\infty. However, \widehat{\theta} is random in finite-sample, which makes \ell_{n}(\widehat{\theta}) a complicated function of the data in particular when \widehat{\theta} has no closed-form solution. We therefore need to strengthen the pointwise LLN.

We say a uniform law of large numbers (ULLN) for Q\left(\theta\right) holds on \Theta if P\left\{ \sup_{\theta\in\Theta}\left|\ell_{n}\left(\theta\right)-Q\left(\theta\right)\right|\geq\varepsilon\right\} \to0\label{eq:ULLN} for all \varepsilon>0 as n\to\infty.

ULLN can be established under pointwise LLN plus some regularity conditions, for example when \Theta is a compact set, and \log f\left(x;\cdot\right) is continuous in \theta almost everywhere on the support of x.

If \theta_{0} is identified and ULLN eq:ULLN hold, then \widehat{\theta}\stackrel{p}{\to}\theta_{0}.

According to the definition of consistency, we can check \begin{aligned} & P\left\{ \left|\widehat{\theta}-\theta_{0}\right|>\varepsilon\right\} \leq P\left\{ Q\left(\theta_{0}\right)-Q(\widehat{\theta})>\delta\right\} \\ & =P\left\{ Q\left(\theta_{0}\right)-\ell_{n}\left(\theta_{0}\right)+\ell_{n}\left(\theta_{0}\right)-\ell_{n}(\widehat{\theta})+\ell_{n}\left(\widehat{\theta}\right)-Q(\widehat{\theta})>\delta\right\} \\ & \leq P\left\{ \left|Q\left(\theta_{0}\right)-\ell_{n}\left(\theta_{0}\right)\right|+\ell_{n}\left(\theta_{0}\right)-\ell_{n}(\widehat{\theta})+\left|\ell_{n}\left(\widehat{\theta}\right)-Q(\widehat{\theta})\right|>\delta\right\} \\ & \leq P\left\{ \left|Q\left(\theta_{0}\right)-\ell_{n}\left(\theta_{0}\right)\right|+\left|\ell_{n}(\widehat{\theta})-Q(\widehat{\theta})\right|\geq\delta\right\} \\ & \leq P\left\{ 2\sup_{\theta\in\Theta}\left|\ell_{n}\left(\theta\right)-Q\left(\theta\right)\right|\geq\delta\right\} =P\left\{ \sup_{\theta\in\Theta}\left|\ell_{n}\left(\theta\right)-Q\left(\theta\right)\right|\geq\frac{\delta}{2}\right\} \to0.\end{aligned} The first line holds because of identification, the third line by the triangle inequality, the fourth line by the definition of MLE that \ell_{n}(\widehat{\theta})\geq\ell_{n}\left(\theta_{0}\right), and the last line by ULLN.

Identification is a necessary condition for consistent estimation. Although \widehat{\theta} has no closed-form solution in general, we establish consistency via ULLN over all point \theta\in\Theta under consideration.

8.3 Asymptotic Normality

The next step is to derive the asymptotic distribution of the MLE estimator. Let s\left(x;\theta\right)=\partial\log f\left(x;\theta\right)/\partial\theta and h\left(x;\theta\right)=\frac{\partial^{2}}{\partial\theta\partial\theta'}\log f\left(x;\theta\right)

thm:mis-MLE Under suitable regularity conditions, the MLE estimator \sqrt{n}\left(\widehat{\theta}-\theta_{0}\right)\stackrel{d}{\to}N\left(0,\left(E\left[h\left(x;\theta_{0}\right)\right]\right)^{-1}\mathrm{var}\left[s\left(x;\theta_{0}\right)\right]\left(E\left[h\left(x;\theta_{0}\right)\right]\right)^{-1}\right).

The “suitable regularity conditions” will be spelled out later. Indeed, those conditions can be observed in the proof.

That \widehat{\theta} is a maximizer entails \frac{\partial}{\partial\theta}\ell_{n}\left(\widehat{\theta}\right)=0. Take a Taylor expansion of \frac{\partial}{\partial\theta}\ell_{n}\left(\widehat{\theta}\right) around \frac{\partial}{\partial\theta}\ell_{n}\left(\theta_{0}\right): 0-\frac{\partial}{\partial\theta}\ell_{n}\left(\theta_{0}\right)=\frac{\partial}{\partial\theta}\ell_{n}\left(\widehat{\theta}\right)-\frac{\partial}{\partial\theta}\ell_{n}\left(\theta_{0}\right)=\frac{\partial}{\partial\theta\partial\theta'}\ell_{n}\left(\dot{\theta}\right)\left(\widehat{\theta}-\theta_{0}\right) where \dot{\theta} is some point on the line segment connecting \widehat{\theta} and \theta_{0}. Rearrange the above equation and multiply both side by \sqrt{n}: \sqrt{n}\left(\widehat{\theta}-\theta_{0}\right)=-\left(\frac{\partial}{\partial\theta\partial\theta'}\ell_{n}\left(\dot{\theta}\right)\right)^{-1}\sqrt{n}\frac{\partial}{\partial\theta}\ell_{n}\left(\theta_{0}\right).\label{eq:taylor1}

When Q\left(\theta\right) is differentiable at \theta_{0}, we have \frac{\partial}{\partial\theta}Q\left(\theta_{0}\right)=0 by the first condition of optimality of \theta_{0} for Q\left(\theta\right). Notice that E\left[s\left(x;\theta_{0}\right)\right]=\frac{\partial}{\partial\theta}Q\left(\theta_{0}\right)=0 if differentiation and integration are interchangeable. By CLT, the second factor in eq:taylor1 follows \sqrt{n}\frac{\partial}{\partial\theta}\ell_{n}\left(\theta_{0}\right)\stackrel{d}{\to}N\left(0,\mathrm{var}\left[s\left(x;\theta_{0}\right)\right]\right). Suppose the second factor in eq:taylor1 follows \frac{\partial}{\partial\theta\partial\theta'}\ell_{n}\left(\dot{\theta}\right)\stackrel{p}{\to}E\left[h\left(x;\theta_{0}\right)\right] (sufficient if we assume E\left[\frac{\partial^{3}}{\partial\theta_{i}\partial\theta_{j}\partial\theta_{l}}\log f\left(x;\theta_{0}\right)\right] is continuous in \theta for all i,j,l\leq K. Thus we have the conclusion by Slutsky’s theorem.

When the model is misspecified, the asymptotic variance takes a complicated sandwich form. When the parametric model is correctly specified, then the asymptotic variance can be further simplified, thanks to the following important result of information matrix equality.

8.4 Information Matrix Equality

When the model is correctly specified, \theta_{0} is the true parameter value. The variance \mathcal{I}\left(\theta_{0}\right):=\mathrm{var}_{f\left(x;\theta_{0}\right)}\left[\frac{\partial}{\partial\theta}\log f\left(x;\theta_{0}\right)\right] is called the (Fisher) information matrix, and \mathcal{H}\left(\theta_{0}\right):=E_{f\left(x;\theta_{0}\right)}\left[h\left(x;\theta_{0}\right)\right] is called the expected Hessian matrix. Here we emphasize the true underlying distribution f\left(x;\theta_{0}\right) by writing it as the subscript of the mathematical expectations.

fact:InfoUnder suitable regularity conditions, we have \mathcal{I}\left(\theta_{0}\right)=-\mathcal{H}\left(\theta_{0}\right)

Because f\left(x;\theta_{0}\right) a pdf, \int f\left(x;\theta_{0}\right)dx=1. Take partial derivative with respect to \theta, \begin{aligned} 0 & =\int\frac{\partial}{\partial\theta}f\left(x;\theta_{0}\right)dx=\int\frac{\partial f\left(x;\theta_{0}\right)/\partial\theta}{f\left(x;\theta_{0}\right)}f\left(x;\theta_{0}\right)dx\nonumber \\ & =\int\left[s\left(x;\theta_{0}\right)\right]f\left(x;\theta_{0}\right)dx=E_{f\left(x;\theta_{0}\right)}\left[s\left(x;\theta_{0}\right)\right]\label{eq:info_eqn_1}\end{aligned} where the third equality holds as by the chain rule s\left(x;\theta_{0}\right)=\frac{\partial f\left(x;\theta_{0}\right)/\partial\theta}{f\left(x;\theta_{0}\right)}.\label{eq:ell_d} Take a second partial derivative of (eq:info\_eqn\_1) with respective to \theta, according to the chain rule: \begin{aligned} 0 & =\int\left[h\left(x;\theta_{0}\right)\right]f\left(x;\theta_{0}\right)dx+\int\left[s\left(x;\theta_{0}\right)\right]\frac{\partial}{\partial\theta'}f\left(x;\theta_{0}\right)dx\\ & =\int\left[h\left(x;\theta_{0}\right)\right]f\left(x;\theta_{0}\right)dx+\int s\left(x;\theta_{0}\right)\frac{\partial f\left(x;\theta_{0}\right)/\partial\theta}{f\left(x;\theta\right)}f\left(x;\theta_{0}\right)dx\\ & =\int\left[h\left(x;\theta_{0}\right)\right]f\left(x;\theta_{0}\right)dx+\int\left[s\left(x;\theta_{0}\right)s\left(x;\theta_{0}\right)'\right]f\left(x;\theta_{0}\right)dx\\ & =E_{f\left(x;\theta_{0}\right)}\left[h\left(x;\theta_{0}\right)\right]+E_{f\left(x;\theta_{0}\right)}\left[s\left(x;\theta_{0}\right)s\left(x;\theta_{0}\right)'\right]\\ & =\mathcal{H}\left(\theta_{0}\right)+\mathcal{I}\left(\theta_{0}\right).\end{aligned} The second equality follows by (eq:ell\_d). The last equality by eq:info\_eqn\_1 as the zero mean ensures the variance of \frac{\partial}{\partial\theta}\log f\left(x;\theta_{0}\right) is equal to the expectation of its out-product.

Notice that a correct specification is essential for the information matrix equality. If the true data generating distribution is g\notin\mathcal{M}^{*}, then eq:info\_eqn\_1 breaks down because 0=\int\frac{\partial}{\partial\theta}f\left(x;\theta_{0}\right)=\int\left[g^{-1}\frac{\partial}{\partial\theta}f\left(x;\theta_{0}\right)\right]g=E_{g}\left[g^{-1}\frac{\partial}{\partial\theta}f\left(x;\theta_{0}\right)\right] but g^{-1}\frac{\partial}{\partial\theta}f\left(x;\theta_{0}\right)\neq\left(f\left(x;\theta_{0}\right)\right)^{-1}\frac{\partial}{\partial\theta}f\left(x;\theta_{0}\right)=\frac{\partial}{\partial\theta}\log f\left(\theta_{0}\right). The asymptotic variance in Theorem thm:mis-MLE, \left(E_{g}\left[h\left(x;\theta_{0}\right)\right]\right)^{-1}\mathrm{var}_{g}\left[s\left(x;\theta_{0}\right)\right]\left(E_{g}\left[h\left(x;\theta_{0}\right)\right]\right)^{-1}, written explicitly in E_{g}\left[\cdot\right], is still valid.

When the parametric model \mathcal{M}^{*} is correctly specified, then we can replace E_{g}\left[\frac{\partial^{2}\ell_{n}}{\partial\theta\partial\theta'}\left(\theta_{0}\right)\right] by \mathcal{H}\left(\theta_{0}\right) and replace \mathrm{var}_{g}\left[\frac{\partial\ell_{n}}{\partial\theta}\left(\theta_{0}\right)\right] by \mathcal{I}\left(\theta_{0}\right), we simplify the asymptotic variance as \left(\mathcal{H}\left(\theta_{0}\right)\right)^{-1}\mathcal{I}\left(\theta_{0}\right)\left(\mathcal{H}\left(\theta_{0}\right)\right)^{-1}=\left(-\mathcal{I}\left(\theta_{0}\right)\right)^{-1}\mathcal{I}\left(\theta_{0}\right)\left(-\mathcal{I}\left(\theta_{0}\right)\right)^{-1}=\left(\mathcal{I}\left(\theta_{0}\right)\right)^{-1} by the information matrix equality Fact fact:Info.

If the model is correctly specified, under the conditions for Theorem eq:info\_eqn\_1 and Fact fact:Info the MLE estimator \sqrt{n}\left(\widehat{\theta}-\theta_{0}\right)\stackrel{d}{\to}N\left(0,\left[\mathcal{I}\left(\theta_{0}\right)\right]^{-1}\right).

This is the classical asymptotic normality result of MLE.

8.5 Cramer-Rao Lower Bound

8.6 Summary

Further reading: White (1996), Newey and McFadden (1994).

Zhentao Shi. Oct 29, 2020.