Likelihood

Published:

When I taught a graduate-level course last week, I asked students “Did you learn maximum likelihood estimation in your undergraduate statistics”? Not a single hand was raised. Perhaps students were too shy to signal their knowhow.

Twenty-plus years ago, I was a finance major. My undergraduate statistic teacher was a small, energetic woman at her forties. She always wore sports shoes when teaching. She covered two estimation methods, one was MLE and the other was the method of moments (MM). MM is too intuitive. It simply uses the sample moments to mimic the population moments. On the other hand, MLE is a conceptual leap. It struck me when I encountered it for the first time. Though it seemed a reasonable way to go, as one definitely doesn’t want to minimize the likelihood or somehow average the likelihood, I had a hard time to immediately appreciate why playing with the likelihood function was a good practice, let alone an optimal approach. There was a non-trivial logic gap to fill to align the parameter of interest with the likelihood function.

It was not until I took graduate-level econometrics that I grasped the (relative) “entropy” at play. Entropy is a deep, grand, and enigmatic building block of the universe. It was first introduced in thermodynamics to describe disorder. von Neumann suggested to Claude Shannon to borrow it because “No one really knows what entropy really is”. The hero of this post, though, is Ronald Fisher, the father of modern statistics. An enthusiastic advocate of MLE, he laid the foundation of MLE in 1922. Fisher also introduced terminologies that remain in use, such as the score and sufficiency. Fisher, obviously, was a pioneer and trailblazer, ahead of Shannon’s work on information theory. Information theory, it turns out, is a cornerstone of modern science far beyond statistics.