Open Mind

Take it to the Limit: Part 2, the Central Limit Theorem

June 15, 2009 · 3 Comments

In the last post we introduced the moment generating function. It’s defined as a power series, with its coefficients determined by the moments of a probability distribution.


The moment-generating function turned out to be equal to the expected value of the function e^{tx}

m(t) = {\bf E}(e^{tx}).

We also mentioned that the moments determine the probability distribution just as the probability distribution determines the moments. So, if we know the moment-generating function we know the probability distribution, and vice versa. To get the moments from the generating function, we can either use cleverness to figure out the series expansion for m(t) and use the fact that the nth coefficient c_n is the nth moment \mu_n divided by n!

c_n = \mu_n/n!,

or we can directly compute the coefficient by a Taylor’s series expansion and get the result

\mu_n = (d^n m / dt^n)(0),

i.e., the nth derivative of the moment generating function, evaluated at t=0.

We can compute the moment generating function for well-known probability distributions. For example, for the uniform distribution from 0 to 1 it turns out to be

m(t) = (e^t - 1)/t.

For the chi-square distribution with n degrees of freedom it’s

m(t) = 1/\sqrt{(1-2t)^n}.

For the ubiquitous and all-important normal distribution, the moment generating function is

m(t) = e^{\mu t + \sigma^2 t^2 /2},

where of course \mu is the mean and \sigma^2 is the variance.

If the random variable isn’t a random variable at all, but instead is just some constant \mu, then the moment generating function is simply

m(t) = e^{\mu t}.

From that we can deduce that {\bf E}(x)=\mu, {\bf E}(x^2)=\mu^2, {\bf E}(x^3)=\mu^3, etc., just as a constant should behave.

What if we took two independent random variables, which may not follow the same distribution, and add them? Let the moment generating function for the distribution of the random variable x be m_x(t), and let that for the random variable y be m_y(t). Since the variables may not follow the same distribution, the generating functions m_x(t) and m_y(t) may not be the same. What is the moment generating function for the distribution of the sum x+y?

We can determine it by using the definition

m_{(x+y)}(t) = {\bf E}(e^{t(x+y)}).

This is of course

m_{(x+y)}(t) = {\bf E}(e^{tx}e^{ty}).

Because we assumed that x and y are independent, this is equal to

m_{(x+y)}(t) = {\bf E}(e^{tx}) {\bf E}(e^{ty}) = m_x(t) m_y(t).

We have the valuable result that the moment generating function for the sum of two independent random variables is the product of their moment generating functions.

This motivates us to define yet another function, the cumulant generating function, which is the logarithm of the moment generating function

g(t) = \ln(m(t)).

The we have the useful result that the cumulant generating function for the sum of two independent random variables is the sum of their cumulant generating functions. If we know the cumulant generating function we can compute the moment generating function, so we can in turn determine the probability distribution. Put another way, if two variables have the same cumulant generating function then they follow the same probability distribution.

Just as the moment generating function determines the moments \mu_j, the cumulant generating function determines the cumulants \kappa_j

g(t) = \kappa_1 t + \frac{1}{2} \kappa_2 t^2 + ... = \sum_{j=1}^\infty \frac{1}{j!} \kappa_j t^j.

Note that the sum starts at j=1, i.e., there’s no term which doesn’t contain a power of t.

We can also determine the cumulant generating function for well-known distributions, just by taking the logarithm of the moment generating function. For a constant (a random variable that’s not random), it’s simply

g(t) = \mu t,

while for the normal distribution

g(t) = \mu t + \frac{1}{2} \sigma^2 t^2.

Averaging

Now suppose we have some random variable which follows an unknown probability distribution, and that we sample a large number N of data points with which we compute an average. The average is

\bar x = {1 \over N} \sum x_j.

Assume further that each sample value is independent of the others. The moment generating function for \bar x is

M(t) = {\bf E}(e^{t(x_1+x_2+...+x_N)/N}) = {\bf E}(e^{tx_1/N}e^{tx_2/N}...e^{tx_N/N}).

Since all the values are independent, this is

M(t) = {\bf E}(e^{tx_1/N}) {\bf E}(e^{tx_2/N}) ... {\bf E}(e^{tx_N/N}).

Each term is the moment generating function for a single value at t/N, so we have

M(t) = [m(t/N)]^N.

Taking the logarithm of both sides, we see that the cumulant generating function for the average has a similar simplicity

G(t) = N g(t/N).

Let’s use our series expansion for the cumulant generating function, with \kappa_1,\kappa_2... the cumulants for a single data value, to get

G(t) = N \Bigl [ \kappa_1(t/N) + \frac{1}{2}\kappa_2(t/N)^2 + \frac{1}{6}\kappa_3(t/N)^3 + ... \Bigr ]
= \kappa_1 t + \frac{1}{2} \kappa_2 t^2/N + \frac{1}{6} \kappa_3 t^3/N^2 + ...

Hence the 1st cumulant of the average is \kappa_1, the same as the 1st cumulant of the data itself. The 2nd cumulant of the average is \kappa_2/N, the 3rd cumulant of the average is \kappa_3/N^2, etc.

Now consider what happens as N grows larger and larger. Each successive term in the cumulant generating function has a higher power of 1/N, so each successive term gets smaller and smaller. To zeroth order, as N goes to infinity the cumulant generating function becomes

G(t) = \kappa_1 t.

But that’s just the cumulant generating function for a random variable which is not a random variable, i.e., a constant. To get the lowest-order random behavior as N grows, we have to include the 1st-order term

G(t) = \kappa_1 t + \frac{1}{2} \kappa_2 t^2/N.

This is the lowest nontrivial order as N approaches infinity, so it gives us the asymptotic cumulant generating function for the average of a large number of data points.

And what is the asymptotic distribution for the average? The asymptotic distribution has the same form as for the normal distribution, with only two nonzero cumulants. We therefore have the central limit theorem: the average of a large number of data points asymptotically follows the normal distribution.

The 1st cumulant is the same as that of the raw data, the 2nd cumulant is that of the raw data divided by N. But the 1st and 2nd cumulants are directly related to the mean and variance by

\kappa_1 = \mu,

\kappa_2 = \sigma^2.

Therefore the mean of the average is the same as the mean of the data, and the variance of the average is the variance of the data divided by N.

This “proof” of the central limit theorem has been more a sketch than a proof. A rigorous approach would use, not the moment generating function but the characteristic function

{\bf E}(e^{itx}).

This has the virtue that it always exists. It’s also equal to the complex conjugate of the Fourier transform of the probability density function (if that exists!). There are a lot more complications than I’ve outlined, but I hope this brief introduction has at least made the central limit theorem sensible and plausible.

Categories: Global Warming

3 responses so far ↓

  • Deep Climate // June 15, 2009 at 7:14 pm | Reply

    The we have the useful result

    I guess that’s a typo and and should be “Thus …”

  • naught101 // June 16, 2009 at 12:48 pm | Reply

    /me waits for the punchline

  • Ray Ladbury // June 17, 2009 at 1:30 am | Reply

    Nice summary and quite elegant.

    I often use the method of moments when I have limited data and want to get a feel for model dependence of the conclusion. For instance, for data between 0 and infinity, I’ll get a best fit to a lognormal, which is skewed right and a Weibull, which is skewed left. If the likelihood doesn’t distinguish between the two forms, I know I don’t have enough data to estimate skew. Is there maybe some more elegant way to look at this?

Leave a Comment