Open Mind

AIC part 1: Kullback-Leibler Divergence

October 5, 2009 · 116 Comments

Warning! This post is mathematical. Disinterested readers beware!

One of the goals of time series analysis is to model the signal underlying the data. If the data have some random element to them, they’ll follow some probability distribution. The distribution might be dependent on external variables (like time), in which case we usually create a model in which the mean of the distribution is time-dependent. Suppose, for example, we model a variable as following a straight line time trend, plus random noise:

x(t) = a + bt + \epsilon,


where the “noise” part \epsilon follows the normal distribution with mean 0 and standard deviation \sigma. This is the same as saying that x follows the normal distribution with time-dependent mean a+bt and standard deviation \sigma. Then our model is that the probability density for a given x value is

g(x) = e^{-(x-a-bt)^2/2\sigma^2} / \sqrt{2\pi \sigma^2}.

We generally call this the probability, but we can also call it the likelihood of getting the value x.

The model g(x) may be very useful; it might enable us to make inferences about the physical system or even make forecasts of future behavior. But as useful as it is, it’s probably not true. The behavior of physical systems can be quite complicated, and many systems show chaotic detailed behavior (although their long-term statistical behavior may be stable), so a complete description of reality may not be simple enough to encompass in a practical statistical model. And, data are often limited so we’re forced to use simple models (like straight-line trends) which we can’t expect to reflect that absolute truth of complex systems. If you do statistics long enough you begin to appreciate the classic statement that “All models are wrong. Some models are useful.”

But presumably the “truth” is out there, right? There must be some probability function f(x) which describes the “true” distribution for our variable x — we just don’t know what it is. It probably involves time dependence, maybe dependence on other external variables, and requires some parameters as well, but at least (we presume) it exists. Maybe we can’t know what it is (not in a finite lifetime anyway), but that doesn’t alter the fact that it exists.

If the “true” distribution is f(x), and our (probably simple, maybe useful, almost surely wrong) model is g(x), then how far are we from the truth? To get an idea, let’s recall the definition of the entropy of a probability distribution. The entropy of the distribution f(x) is the expected value of the negative of the logarithm of f(x). That’s a lot of words! We can express it as an equation by saying

S = - \langle \ln(f(x)) \rangle,

where the angle brackets \langle ~ \rangle indicate the expected value of the enclosed quantity. The expected value of any quantity Q(x) depending on a random variable x which is governed by the probability distribution f(x) is

\langle Q \rangle = \int f(x) Q(x) ~dx,

so the entropy of the distribution is

S = - \int f(x) \ln(f(x)) ~dx.

If the expected value of the negative logarithm of the true distribution is the entropy, what about the expected value of the negative logarithm of our model distribution? That would be

\Delta = - \langle \ln(g(x)) \rangle = - \int f(x) \ln(g(x)) ~dx,

and is called the cross-entropy.

There are many ways to show that the cross-entropy is always bigger than the entropy — unless the model g(x) is equal to the “truth” f(x), in which case it is equal to the entropy.

Since the cross-entropy is always bigger than the entropy (unless the model is correct), this motivates us to use the difference as a measure of how close the model is the the “truth”. We define the Kullback-Leibler divergence as the difference between the cross-entropy and the entropy

KL = \Delta - S = - \int f(x) \ln(g(x)) ~dx + \int f(x) \ln(f(x)) ~dx

= \int f(x) \ln(f/g) ~dx.

Let’s illustrate with a simple example: flipping a coin. The random variable x is discrete rather than continuous, taking only the values 0 (“tails”) or 1 (“heads”), so instead of a continuous probability distribution f(x) we only have probabilities p_0 of getting tails and p_1 of getting heads. Suppose the coin is fair so that p_0 = p_1 = \frac{1}{2}. Then the entropy is computed by replacing the integral by a sum

S = - \sum_j p_j \ln(p_j) = - \frac{1}{2} \ln(\frac{1}{2}) - \frac{1}{2} \ln(\frac{1}{2}) = \ln(2) = 0.693147...

We can model the coin flip by supposing that theprobability of heads has some value \theta, so the probability of tails is of course 1-\theta. The expected negative logarithm of this model (the cross-entropy) is

\Delta = - \frac{1}{2} \ln(1-\theta) - \frac{1}{2} \ln \theta.

We can combine terms to get

\Delta = - \frac{1}{2} \ln(\theta (1-\theta)).

Here’s a plot of the cross-entropy as a function of our model parameter \theta:

cross

Clearly, the cross-entropy is minimum when the parameter \theta is the true probability 0.5, hence the KL divergence is zero only when the model is correct. It’s also obvious that the further our model parameter is from the truth, the greater is the KL divergence.

Of course, to measure the KL divergence we need to know what the “true” probability f(x) is. But if we knew that, we wouldn’t need to test models! So how does this help us choose the best model when the truth is unknown? Stay tuned …

P.S. Thanks to Mrs. Tamino for her help … especially since all the equations mean nothing to her.

Categories: Global Warming
Tagged:

116 responses so far ↓

  • CM // October 5, 2009 at 10:12 pm | Reply

    Mrs Tamino is not just typing, she’s typing LaTeX? You \emph{lucky} man!

  • Nick Barnes // October 5, 2009 at 11:31 pm | Reply

    You mean “uninterested”. Otherwise excellent, looking forward to part 2.

  • Ray Ladbury // October 5, 2009 at 11:40 pm | Reply

    Tamino,

    YES!!! A very clear exposition of this concept. I think a lot of people don’t fully appreciate the power of having a metric that can be minimized for the truem model. Some poeple dismiss the K-L divergence merely because it cannot be calculated. However, its very existence is the important thing. It shows that as a model approaches the true model, the K-L distance decreases. We cannot know when we’ve reached the true model, but the concept is really key to understanding an information theoretic approach to statistical modeling. I’m looking forward to the rest!

  • David B. Benson // October 6, 2009 at 12:55 am | Reply

    I’m tuned! I’m tuned!

    And hearty thanks to Mrs. Tamino.

  • Nathan // October 6, 2009 at 1:13 am | Reply

    Nick Barnes

    Don’t want to be a pedant, but I think Tamino may be correct. My understanding is that ‘un’ refers to something that has now changed to be the opposite of what it was (like undo, undone etc).

  • dhogaza // October 6, 2009 at 1:34 am | Reply

    Sigh …

    Disinterested and uninterested share a confused and confusing history. Disinterested was originally used to mean “not interested, indifferent”; uninterested in its earliest use meant “impartial.” By various developmental twists, disinterested is now used in both senses. Uninterested is used mainly in the sense “not interested, indifferent.” It is occasionally used to mean “not having a personal or property interest.”
    Many object to the use of disinterested to mean “not interested, indifferent.” They insist that disinterested can mean only “impartial”: A disinterested observer is the best judge of behavior. However, both senses are well established in all varieties of English, and the sense intended is almost always clear from the context.

    Either serves for this meaning … ” By various developmental twists” is probably secret code for “people centuries ago were just as confused about the two as we are today …”

  • David Horton // October 6, 2009 at 1:46 am | Reply

    Fine, fine, but what about those mysterious oscillating graphs from central England? Is the body in the library with the candlestick?

  • David Horton // October 6, 2009 at 3:20 am | Reply

    Yeah, I’m a pedant to. It should be “uninterested” meaning to have no interest in the topic. Disinterested means to have no involvement (usually of a financial or legal nature) in the topic – that is you can comment on it objectively with nothing to gain personally. One of those pairs of words where meaning is lost in modern times (think reluctant and reticent)

  • David Horton // October 6, 2009 at 3:22 am | Reply

    Or, as us pedants like to say, “I’m a pedant too …”

  • Gavin's Pussycat // October 6, 2009 at 11:01 am | Reply

    mee tooo ;-)

  • ekzept // October 6, 2009 at 1:03 pm | Reply

    K-L divergence can be adapted to other purposes, too. Suppose there is a time series of estimates for a probability density, D[k], say each given by an empirical cumulative distribution function. Assumption successive captures of D[k] are suitably adjusted for power and representativeness, The symmetrized K-L can be used to generate an index of how dissimilar D[k] is from D[1+k].

  • Hank Roberts // October 6, 2009 at 3:06 pm | Reply

    We must bury this grammatical kerfluffle. It should remain uninterred no longer. Or if was once interred, then disinterred, it should be reinterred, permanently, not intermittently. Thank you.

  • Timothy Chase // October 6, 2009 at 3:36 pm | Reply

    Ray Ladbury wrote:

    Some poeple dismiss the K-L divergence merely because it cannot be calculated. However, its very existence is the important thing…

    Well, I am interested, Tamino. Just my luck, I bet part two comes out tomorrow. In any case, I will see how far I can follow. Oh — and congratulations regarding your wife — sounds like a real partner in life.

  • ekzept // October 6, 2009 at 3:54 pm | Reply

    I don’t understand the “cannot be calculated” part. Surely K-L is useless as an objective standard as the post seems to imply, but comparisons like those I suggested may serve as the basis for inference.

    Indeed, there’s a book by Pardo based upon the idea of using divergence measures for inference. I kind of reviewed it on Amazon. It is highly technical, and is not really self-standing, needing support from book by Read and Cressie (Goodness-of-Fit Statistics for Discrete Multivariate Data), and that’s unfortunate.

  • ekzept // October 6, 2009 at 3:56 pm | Reply

    There is is also a readable account of how these measures might be used by M.R.Forster in A.Zellner, H.A.KeuzenKamp, M.McAleer, Simplicity, Inference, Modelling called “The New Science of Simplicity”, Cambridge University Press, 83-119, 2001.

  • Mark // October 6, 2009 at 9:18 pm | Reply

    “Or, as us pedants like to say, “I’m a pedant too …””

    Pendants are well known for hanging about.

  • Ray Ladbury // October 6, 2009 at 11:38 pm | Reply

    ekzept,
    The “cannot be calculated” statement derives from the fact that K-L divergence requres knowledge of the true model. That is precisely why its utility is as a comparative metric. My comment was meant to address some of the attitudes I’ve run into when using such metrics. Not everyone understands the value of a comparative metric or of model selection.

  • ekzept // October 7, 2009 at 3:18 am | Reply

    Thanks, Ray. Indeed. M.R.Forster addresses some of these matters in the context of biology.

    My only puzzlement is their puzzlement. I mean, aren’t likelihood ratios comparison tests, often of a candidate against the default “explainable by change”? I can see it takes a tad more sophistication to see LR as a means of comparing competing hypotheses, but not much.

  • ekzept // October 7, 2009 at 3:19 am | Reply

    Sorry “change” –> “chance”.

  • Al // October 7, 2009 at 5:38 am | Reply

    Sorry, but I’ve just got to say this:
    Some of my best friends are Americans, (of course) but…”uninterested” always meant “not interested”, and “disinterested” always meant “impartial” UNTIL THE AMERICANS GOT HOLD OF THEM. (“momentarily” is another particularly disconcerting one.) dhogaza can euphemistically describe the situation as “various developmental twists”, but it just points to his being from the USA, where the importance of spelling as a means of precise communication just doesn’t seem to apply so much (in my opinion, and it doesn’t apply to their literary giants, of course, only the education system, I guess)!
    By the way, a correction of a minor typo: -0.5 ln(0.5) – 0.5 ln(0.5) = -ln(0.5), not -ln(2)

    [Response: First: the post says ln(2) not -ln(2) (always has), so the error is yours.

    Second: I'm no fan of the verbal or written language skills of Americans, but it sounds like you're just "puttin' on airs."]

    • TrueSceptic // October 7, 2009 at 5:19 pm | Reply

      I wonder what happened here. Did Al forget that
      a*ln(b) = ln(b^a) ?
      I don’t see it as a punctuation issue.

  • Ian // October 7, 2009 at 11:45 am | Reply

    If only the English would learn to punctuate correctly! (Perhaps that’s the root of Al’s misreading of the post…)

  • dhogaza // October 7, 2009 at 1:38 pm | Reply

    dhogaza can euphemistically describe the situation as “various developmental twists”, but it just points to his being from the USA

    No, it just points to my being the one person to actually look at a good dictionary to see what the professionals have to say about it.

    I won’t say anything insulting about those who think their own opinion triumphs that of professionals, though, even if there is a strange parallel with those high school-educated people who’ve overturned climate science …

  • dhogaza // October 7, 2009 at 1:46 pm | Reply

    My last note on this dumb subject (I had expected my first dictionary post to put an end to it, fat chance of that):

    Oddly enough, “not interested” is the oldest sense of the word, going back to the 17th century. This sense became outmoded in the 18th century but underwent a revival in the first quarter of the early 20th.

  • Kevin McKinney // October 7, 2009 at 1:47 pm | Reply

    At this rate, we’ll be needing posts on Differential Indices of Grammatical Solecisms (DIGS.)

  • Timothy Chase // October 7, 2009 at 4:15 pm | Reply

    Al wrote:

    … the USA, where the importance of spelling as a means of precise communication just doesn’t seem to apply so much…

    Blame Daniel Webster. His prejudice against the English so coloured his views that he deliberately went out of his way to create an American way of spelling English… and caused my poor wife to lose a spelling bee in elementary school as a result. (She grew up on English literature.)

    It is also my understanding that while the likelihood of it has been the subject of some exaggeration, at one point at least one colony was considering switching to German.

  • Igor Samoylenko // October 7, 2009 at 5:01 pm | Reply

    Is it not abundantly clear from context what Tamino meant when he said “Disinterested readers beware!” regardless of what you may think “disinterested” really means?

    But I am not a native English speaker, so it may be I am missing something subtle here… :-)

  • Eli Rabett // October 7, 2009 at 6:15 pm | Reply

    It’s not surprising that the cross entropy is equivalent to the entropy of mixing for a solution but it is interesting

  • Adrian Burd // October 7, 2009 at 8:47 pm | Reply

    Timothy,

    “Blame Daniel Webster”

    I think you mean Noah Webster, his cousin.

    Adrian

  • dhogaza // October 7, 2009 at 8:54 pm | Reply

    His prejudice against the English so coloured his views that he deliberately went out of his way to create an American way of spelling English…

    Yeah, but actually he was trying to regularize spelling so it would be more phonetic.

    Getting rid of “ou” when it’s not pronounced as in “our” or “hour”, for instance (thus “color”).

    Only a few of his innovations stuck, but they were systematic and not due to “prejudice”. They were meant to make it easier to learn proper spelling by making spelling more … proper :)

    Take spanish, for instance, if you hear it pronounced (and know the accent, i.e. Mexico vs. Spain) you can almost always spell it correctly.

    While English … ummm … not so true. Webster’s motivation was reasonable.

  • george // October 7, 2009 at 11:05 pm | Reply

    “to measure the KL divergence we need to know what the “true” probability f(x) is. But if we knew that, we wouldn’t need to test models!”

    In the process of minimizing the cross-entropy, aren’t you essentially finding the “true” probability distribution? (or at least something close to it)

    I assume such a minimization approach works for cases that are more involved than the trivial coin toss case above -, though I can also appreciate that it may not be feasible for some cases.

    the further our model parameter is from the truth, the greater is the KL divergence.

    Are their particular classes of models for which this is true?

    Is there some test to determine whether the approach is applicable?

  • David B. Benson // October 7, 2009 at 11:41 pm | Reply

    george // October 7, 2009 at 11:05 pm — Patience. Subsequent parts on this topic will clarify.

  • ekzept // October 7, 2009 at 11:50 pm | Reply

    Clearly, the cross-entropy is minimum when the parameter is the true probability 0.5, hence the KL divergence is zero only when the model is correct. It’s also obvious that the further our model parameter is from the truth, the greater is the KL divergence.

    I wonder if d/dθ of the cross-entropy might not have a useful interpretation? Is it something like the information lost or gained by improving knowledge of θ?

  • suricat // October 8, 2009 at 1:06 am | Reply

    Tamino: I think ekzept has a point!

    Where does ‘enthalpy’ feature in this?

    Best regards, suricat.

  • Mark // October 8, 2009 at 10:39 am | Reply

    “In the process of minimizing the cross-entropy, aren’t you essentially finding the “true” probability distribution? (or at least something close to it)”

    To my rough understanding, the removal of cross-entropy will remove double-accounting for errors.

    I.e. if two dependent values are assumed incorrectly to be independent, the error range you get will be sqrt(2) times bigger than the “real” error range in your dataset.

    And of course, assuming that the independent values are dependent has the opposite effect.

    This error will also change the possible forms of probability distribution, since you’d be mixing shapes of distribution together.

  • Mark // October 8, 2009 at 10:42 am | Reply

    From what I remember from English history and the development of language, the American spelling is an older form of the english spelling and, to that extent, is more “English” than the english spelling.

    The two countries had the same spelling and then England went through one rationalisation of the spelling of english words, making a closer tie to the French (hence the appearance of “u” in colour). The american organisation of US spelling was much less radical.

    There have been attempts to make the US spelling even more phonetic, but that failed. It did give rise to a long internet joke about spelling, though…

  • Ray Ladbury // October 8, 2009 at 1:27 pm | Reply

    Ekzept and Suricat,

    My guess is that you will see some more development in subsequent entries. Keep in mind here that we are talking about the cross entropy over a space of different possible models–which will in general be a lot more complicated than the coin-flip example Tamino used for illustration above. As such, while the entropy is well defined, other thermodynamic quantities (pressure, chemical potential, even temperature) may not have obvious analogies.

    I’ve thought about this question somewhat. It seems to me that such thermodynamic analogues might be useful in defining the “best model” subject to some constraints–such as cost or finiteness of resources. In essence, any additional terms that we add will tend to bias the solution away from the “true” model and toward a model that is optimal in some other criterion. The temperature, pressure and chemical potentials would serve as weights for each criterion. The problem would be to come up with a way of doing so that was not arbitrary.

    • suricat // October 8, 2009 at 10:51 pm | Reply

      Ray Ladbury: It seems to me that, like myself, you are also looking for ‘atractors’. OK, let’s wait and see.

      Best regards, suricat.

      • Ray Ladbury // October 9, 2009 at 12:57 am

        Suricat,
        My interest in this issue derives from its possible application in finding an optimal model for prediction in the face of constraints like finite resources, data, etc. For instance, one could perhaps view a unit-test cost as a sort of chemical potential and the “temperature” as the cost of an error in model determination. Still not sure what the analogue of pressure or volume might be.

        Keep in mind, though that there are at least three types of entropy (thermodynamic, information and model), and the relations between them are not 100% understood.

  • dhogaza // October 8, 2009 at 1:35 pm | Reply

    From what I remember from English history and the development of language, the American spelling is an older form of the english spelling and, to that extent, is more “English” than the english spelling.

    Rather than trusting to memory, one can look stuff up …

    The origin of the word ‘colour’ is in Middle English (developed into Modern English in 16th Century), which actually borrows from Anglo-Norman French in this case. ‘Colour’ has many definitions and uses (About nine, and then a tonne of little bullets). Somewhere between colonisation, revolution, and the Industrial Revolution, the English language had no central regulation. Samuel Johnson’s Dictionary of the English Language (1755) is the source of most of the current British spellings, but American English became somewhat simplified in spelling during the times between this book’s publication and Noah Webster and his An American Dictionary of the English Language of 1828. Webster was a large part in changing the spelling of the language because of his philosophies and strong nationalism. What would’ve been seen then as the “correct” spellings have been listed as variants, and still are today.

    So, the unstressed -our (favour, flavour, colour, savour) became -or (favor, flavor, color, savor), the few -re endings in British spelling (centre, metre, litre, manoeuvre) became -er (center, meter, liter, maneuver), and -ce (defence, offence, pretence) became -se (defense, offense, pretense). Because of wide usage in both countries and acceptance onto the pedastal of dictionaries, both spellings are accepted today, though it seems that “when in Rome” follows. And Canada got caught in the middle of it all, using mostly British spellings with some American leaking in.

    These various amateur hypotheses we’re being exposed to are interesting, but really, it’s written down.

  • Mark // October 8, 2009 at 2:17 pm | Reply

    “These various amateur hypotheses we’re being exposed to are interesting, but really, it’s written down.”

    What, though, is the spelling…

  • Mark // October 8, 2009 at 2:45 pm | Reply

    “the few -re endings in British spelling (centre, metre, litre, manoeuvre) became -er (center, meter, liter, maneuver)”

    Though a water meter is meter not metre.

  • dhogaza // October 8, 2009 at 3:30 pm | Reply

    What, though, is the spelling…

    Sammy J, in his dictionary which fixed most modern British spellings: colour.

    Afterwards, in the US, and codified by Noah Webster: color.

    That should’ve been clear with a close reading of the resource I pasted above.

  • ekzept // October 9, 2009 at 3:18 am | Reply

    So, can we calculate the K-L divergence of different spellings across countries for the same language?

  • Barton Paul Levenson // October 9, 2009 at 9:23 am | Reply

    The actual, proprt spelling of color/colour should be “ghoti.”

  • Ray Ladbury // October 9, 2009 at 11:42 am | Reply

    Count so far: 45 posts. 23 off topic. Perhaps we could move the discussion of the common language that separates the US from Britain to Open Thread.

    [Response: Good idea.]

  • Kevin McKinney // October 9, 2009 at 12:22 pm | Reply

    BPL, you’ve hooked me. . .

    (OK, for those who missed the reference:

    http://en.wikipedia.org/wiki/Ghoti)

  • Kevin McKinney // October 9, 2009 at 12:24 pm | Reply

    Sorry about that link; but follow the connecting links–the story is there on Wiki.

  • ekzept // October 9, 2009 at 2:41 pm | Reply

    Proof of a special case of Akaike’s Theorem.

  • ekzept // October 9, 2009 at 2:45 pm | Reply

    (Hmmm, post failed. Retry.)

    http://stanford.edu/~joelv/teaching/249/Akaike.pdf

    “How to Tell when Simpler, More Unified,
    or Less Ad Hoc Theories will Provide
    More Accurate Predictions”

  • Aaron Lewis // October 12, 2009 at 8:17 pm | Reply

    Be very glad that you can study under Tamino (with his wife’s clear transcriptions) rather than under some earlier generation (George Gamow comes to mind)) with their heavy accented lectures in some changing mix of languages with crude scrawls on a black board.

  • ekzept // October 13, 2009 at 5:13 pm | Reply

    Another reference:

    K.P.Burnham, D.R.Anderson, “Kullback-Leibler information as a basis for strong inference in ecological studies”, WILDLIFE RESEARCH, 2001, 28, 111-119

    http://warnercnr.colostate.edu/~anderson/PDF_files/K-LINFO.pdf

  • Ray Ladbury // October 13, 2009 at 7:23 pm | Reply

    ekzept, Thanks. This was the sort of summary I was looking for a colleague. General and broad, but still informative.

  • Timothy Chase // October 13, 2009 at 7:42 pm | Reply

    ekzept wrote:

    Another reference: … “Kullback-Leibler information as a basis for strong inference in ecological studies” …

    I am a little out of my depth here — perhaps more than a little. But you have managed to peak my interest: the paper argues that Kullback-Leiber bears on the application of the Occam’s Razor to scientific theories — and as such to issues regarding the philosophy of science — in what is essentially an alternative to Bayesian inference. (pg. 114)
    *
    As such it would even bear upon the issue of “What is knowledge?” In essence, the all “theories and models are wrong but some are better than others” means essentially that all theories and models are simply approximations — that we can expect to improve upon over time.

    In much the same way, when I say that two animals, say a beagle and a doberman are both “dogs,” in essence I am saying that they are both the same “kind” of animal. But this is simply an approximation, and with a finer grid of concepts I acknowledge that one is a “beagle” and another a “doberman.”

    But this is still an approximation. And conceptual knowledge will always be an approximation as it cannot grasp things in all their particularity, whether at the level of everyday discourse or our most advanced scientific theories.
    *
    In fact, the authors seem to be suggesting that based upon Kullback-Leiber, one can attempt a kind of a middle way between frequentist and Bayesian approaches which combines insights from both while avoiding their respective weaknesses.

    Another point this might have some bearing on: why multi-model means of single model ensembles tend to do better than even the best single model ensembles. See page 115. Something which Gavin has remarked on more than once.

    Ambitious work — and certainly more than I would have expected from ecological studies, at least at this point. But then they are trying to lay the foundation for “deciding” between alternative theories in a field where I suspect this is often quite difficult.

  • ekzept // October 13, 2009 at 9:49 pm | Reply

    @Timothy Chase,

    There is a treatment of the Raven Paradox using the Akaike frame given by Forster which illustrates how the K-L kind of approach addresses what is considered a classic Bayesian win.

  • Ray Ladbury // October 13, 2009 at 9:55 pm | Reply

    Timothy, the fact that the K-L information turned out to be related to the likelihood was a very cool development. Likelihood plays a role in nearly every school of statistical inference–be it Bayesian or frequentist.

    There are some who have even argued that likelihood is THE fundamental quality for comparing different models/theories. K-L and AIC extend that way of looking at things.

    The Occam’s razor analogy of course stems from the form of AIC, which has a term proportional to the likelihood (a measure of goodness of fit) and the penalty term in the number of terms in the model. Since likelihood enters into the picture as a log term, goodness of fit must actually increase exponentially with model complexity to justify the added complexity. Pretty cool, really. Burnham and Anderson have a pretty good book on the subject.

  • David B. Benson // October 13, 2009 at 11:26 pm | Reply

    Timothy Chase // October 13, 2009 at 7:42 pm — Actually, up to advances 5 & 6, this is just a restatement of (the modern formulation) of Bayesian reasoning. The meme is

    MaxInt == Bayesian

    See E.T. Jaynes’s “Probability Theory: the logic of science”. Comes with many important recommendations.

  • ekzept // October 14, 2009 at 1:17 am | Reply

    If one wants to get philosophical, the other way of looking at this is to say AIC or the Bayesian Information Criterion replace Occam’s Razor, being quantitative. After all, how does one really know when something amounts to a “simpler explanation”? That’s like the old “law of the unspecialized” in biology: Which species are “unspecialized”?

  • Timothy Chase // October 14, 2009 at 2:39 pm | Reply

    ekzept wrote:

    If one wants to get philosophical, the other way of looking at this is to say AIC or the Bayesian Information Criterion replace Occam’s Razor, being quantitative.

    In my view, “refine” might be better than “replace.” Despite the differences in terms of languages in which they are expressed, much of the knowledge which exists in Newton’s gravitational theory is preserved in Einstein’s gravitational theory. In a sense, this is the meaning of the principle of correspondence. Regarding the difference between qualitative and quantitative reasoning, qualitative generally precedes quantitative.

    First you recognize that two entities are “different,” that is, “different” in relation to one another. But bringing in another object, and one object may appear similar, that is, of the “same” kind as the second as the differences between between the first and the second recede into the background whereas the differences between the first and the third are brought into the foreground. And it is only then is one able to conceive of two units of the same kind and thus of quantitative measurement.
    *
    ekzept wrote:

    If one wants to get philosophical, the other way of looking at this is to say AIC or the Bayesian Information Criterion replace Occam’s Razor, being quantitative.

    Clearly AIC/BIC would be an improvement on Occam. In some contexts Occam can easily and non ambiguously be applied, such as when two explanations are equally good at explaining all of the evidence but one is considerably more convoluted, involving more entities or assumptions that the other. But this is something we more or less take for granted nowadays, and in the areas where we need additional guidance some sort of improvement upon Occam’s Razor is required.
    *
    ekzept wrote:

    That’s like the old “law of the unspecialized” in biology: Which species are “unspecialized”?

    I believe what you may be getting at here is that “unspecialized” is in fact a comparative concept rather than an absolute one. Something is large only in comparison to something else that is small.

    As for the “law of the unspecialized” being antiquated, I see that it dates back to 1896. However, Stephen J. Gould regarded it as an insight of sorts as it forms the basis for a higher level selection.

    Furthermore, one species may very well be unspecialized — when compared with another. But of course the “law” would be more of a “rule” rather than an inviolable law. Biology.

  • ekzept // October 14, 2009 at 5:18 pm | Reply

    @Timothy Chase ,

    While Newton may fit the same date as relativity, the equations and conceptual frames are vastly different. And, if that’s not sufficiently different to click acknowledgement from you, surely the frame of quantum expectations are another notion, another qualitatively different world from the classic means of calculating, say, the hydrogen atom.

    One way a law or rule can fail to be falsifiable is if its terms are too ambiguous for someone to definitively know whether or not it applies.

  • Timothy Chase // October 15, 2009 at 3:02 am | Reply

    ekzept wrote:

    While Newton may fit the same date as relativity, the equations and conceptual frames are vastly different.

    I believe that is “data” and “conceptual framework.” In any case, Newtonian gravitational theory can be expressed in terms of the language of curved spacetime — where the curvature exists strictly between the spatial dimensions and temporal dimension. Likewise, so long as the spacetime that is a solution to Einstein’s field equations is topologically equivilent to an extended Riemannian pseudo-sphere, one may replace the curved spacetime by a flat Newtonian three plus one dimensional spacetime and gravitational fields.
    *
    ekzept wrote:

    And, if that’s not sufficiently different to click acknowledgement from you, surely the frame of quantum expectations are another notion, another qualitatively different world from the classic means of calculating, say, the hydrogen atom.

    Qualitatively different? Surely. And as a matter of fact with our understanding of classical physics electrons would continuously emit electromagnetic energy and spiral into the nucleus of their respective atoms within a small fraction of a second — if classical physics held at that level. But we also know that at some level the equations of quantum mechanics and general relativity will break down, either one, the other or both. In all likelihood the language of one, the other or both will have to change as a result. Does this mean that they do not apply? They apply over the range over which they are applicable.

    The reason why one wouldn’t express Newtonian gravitational theory in terms of a curved spacetime with no gravitational forces is not because it would be false or because it would be any less accurate than Newtonian gravitational theory with a flat spacetime and gravitational fields but due to the consequent complexity of the equations and the difficulty in applying them — not whether or not spacetime is in fact curved. (This sort of argument likely applies to General Relativity as well — insofar as topologies that differ from an extended Riemannian pseudo-sphere may be physically unrealizable or, if physically realizable incapable of scientific verification.)

    In the same way that climatologists are often found of saying that all models are false, but some are useful, one may also say that all scientific theories are false, but some are useful — and some are more useful than others — over larger domains, or due to the ease with which one may apply the equations in the required context.
    *
    ekzept wrote:

    One way a law or rule can fail to be falsifiable is if its terms are too ambiguous for someone to definitively know whether or not it applies.

    I take it that you are referring to the ambiguity (outside of a given domain) of Occam’s razor that I referred to. Of course another way that a principle or theory may fail to be falsifiable is by being unfalsifiable — such as with a prioristic reasoning.

    So much for AIK:

    Abstract. We describe an information-theoretic paradigm for analysis of ecological data, based on Kullback-Leibler information, , that is an extension of likelihood theory and avoids the pitfalls of null hypothesis testing. Information-theoretic approaches emphasise a deliberate focus on the a priori science in developing a set of multiple working hypotheses or models. Simple methods then allow these hypotheses (models) to be ranked…

    K.P.Burnham, D.R.Anderson, Kullback-Leibler information as a basis for strong inference in ecological studies, Wildlife Research, 2001, 28, 111-119
    http://warnercnr.colostate.edu/~anderson/PDF_files/K-LINFO.pdf

    However, even as Popper saw it, not all knowledge is necessarily falsifiable. Ethical, aesthetic or theological statements, for example. Likewise the norms which define scientific discourse might very well be untestable — in the sense that they are used to define what it means for a scientific theory to be testable. However, falsifiability is a criterion that the philosophy of science gave up some time ago due to the interdependence that exists between scientific theories.

    Please see:

    Do Scientific Theories Ever Receive Justification?
    A Critique of the Principle of Falsifiability
    http://axismundi.hostzi.com/0/006.php
    *
    In any case, I thank you for bringing to my attention the papers on AIK and I look forward to your participation in their discussion. I believe it is quite likely that I will learn more as a result. And as a matter of fact, I either was never aware of the raven paradox you brought up, or if in fact I had been at one time, I forgot. And I have yet to wrap my brain around how it is resolved by AIK. No doubt like you, I wish to understand things and oftentimes the first step in understanding is the recognition that one still has things to left to understand.

  • ekzept // October 15, 2009 at 5:58 pm | Reply

    @Timothy Chase,

    In any case, Newtonian gravitational theory can be expressed in terms of the language of curved spacetime — where the curvature exists strictly between the spatial dimensions and temporal dimension. Likewise, so long as the spacetime that is a solution to Einstein’s field equations is topologically equivilent to an extended Riemannian pseudo-sphere, one may replace the curved spacetime by a flat Newtonian three plus one dimensional spacetime and gravitational fields.

    Yeah, that can be done, but that certain is not how Newton — or 19th century physics — thought about the problem. Sure, I understand entirely that the better model has to fit the older one, which is a special case.

    But we also know that at some level the equations of quantum mechanics and general relativity will break down, either one, the other or both. In all likelihood the language of one, the other or both will have to change as a result. Does this mean that they do not apply? They apply over the range over which they are applicable.

    Presumably there are “residuals” between what models predict and what the data shows, although for as far-reaching a model like relativity and quantum, it’s hard to imagine a single such depiction. I would also say, and any model selection process really ought to consider this, too, that a theory which is no more accurate than quantum but also no less accurate yet is computationally simpler to execute is a superior model.

    So much for AIK …

    I don’t see your point here. If “all models are false, but some are useful” (from statisitician G.E.P. Box, BTW), surely there is a possibility that a “combination of models” may serve as a useful model. The thing is, physicists and our collective notion of reality balks at the idea of having Keplerian ellipses and Ptolemaic epicycles both being true. I think that’s more our problem than one of the models’.

    However, falsifiability is a criterion that the philosophy of science gave up some time ago due to the interdependence that exists between scientific theories.

    The trouble is that traditional logic as used and as far as I know cannot model iteratively convergent methods for finding truth, even if these are qualitative. These are used all the time in science, and not only in numerical methods. For example, the determination of ages of geological strata depend upon multiple techniques and mutually constrain one another, with ties to findings from other strata elsewhere which, by circumstance, are better constrained in various ways. This is something people outside the field often founder upon, but it is entirely legitimate.

    And as a matter of fact, I either was never aware of the raven paradox you brought up, or if in fact I had been at one time, I forgot. And I have yet to wrap my brain around how it is resolved by AIK.

    I don’t think the Anderson AIC “resolves” the paradox, but offers an explanation as compelling as the Bayesian one.

  • Timothy Chase // October 17, 2009 at 6:54 am | Reply

    Science and Philosophy, Part I

    Regarding my observation that either Newtonian gravitational theory or Einstein’s gravitational theory may be expressed either in terms of a flat spacetime with gravitational fields or a curved spacetime in which gravitational fields have been eliminated, ekzept wrote:

    Yeah, that can be done, but that certain is not how Newton — or 19th century physics — thought about the problem. Sure, I understand entirely that the better model has to fit the older one, which is a special case.

    Agreed. Even today I believe one would be hard-put to find an engineer who prefers to perform calculations using Newton’s gravitational theory as it would be expressed within the language of curved spacetime rather than flat spacetime plus gravitational fields. Engineers use the traditional language in which it is expressed because the concepts, equations and calculations are simpler in the traditional language of Newton’s gravitational theory.

    However, if one could more easily and readily solve engineering problems by employing the language of curved spacetime I doubt that it would be very long before engineers regarded flat spacetime as some sort of bad dream. Flat spacetime with gravitational fields is the language (form) in which Newtonian gravitational theory can be most simply expressed, whereas curved spacetime without gravitational fields is the language in which Einsteinian gravitational theory can be most simply and economically expressed.

    ekzept wrote:

    Presumably there are “residuals” between what models predict and what the data shows, although for as far-reaching a model like relativity and quantum, it’s hard to imagine a single such depiction.

    As I understand it, the problem isn’t so much with a lack of correspondence between theory and evidence at this point. There is just about always one experiment or another which suggests as much until the results are overturned a few years later. But there exists a great deal of difficulty reconciling the twotwo theories with each other – despite their having proven exceedingly accurate in their respective domains.

    Integrating the insights of quantum mechanics with those of special relativity is not especially problematic. Integrating the insights of quantum mechanics and general relativity however is. When one studies the evolution of a wave function or probability density operator, one does so against the backdrop of a spatial geometry that is well-defined.

    But what does one do when the evolution geometry of spacetime becomes probablistic — and how would one express a theory of such an evolution? When the very concept of geometry begins to fall apart as one nears the Planck-Wheeler level?
    *
    ekzept had written:

    One way a law or rule can fail to be falsifiable is if its terms are too ambiguous for someone to definitively know whether or not it applies.

    I responded:

    I take it that you are referring to the ambiguity (outside of a given domain) of Occam’s razor that I referred to. Of course another way that a principle or theory may fail to be falsifiable is by being unfalsifiable — such as with a prioristic reasoning.

    So much for AIK:…

    ekzept then responded:

    I don’t see your point here. If “all models are false, but some are useful” (from statisitician G.E.P. Box, BTW), surely there is a possibility that a “combination of models” may serve as a useful model.

    I wasn’t thinking so much of the multimodel approach as the simple fact that AIK itself is presumably a prioristic in nature, at least according to the abstract that I then quoted, with the most relevant sentence being:

    Information-theoretic approaches emphasise a deliberate focus on the a priori science in developing a set of multiple working hypotheses or models.

    If it is “a prioristic” then it is no more falsifiable than Occam’s razor — and the principle of falsifiability cuts both ways. But the fact that both principles are essentially normative in nature would suggest that they are not theories to be tested but rather are elements in the framework through which one defines what it means for scientific theories to be testable — like the principle of falsifiability itself — which presumably is itself unfalsifiable.

    However, as I pointed out, Popper’s principle of falsifiability presupposes that scientific theories can be tested in isolation. Given the interdepedence that exists between scientific theories we know that this assumption is false and thus the principle of falsifiability must and has been abandoned by the philosophy of science — essentially since the 1950s. And the theoretical basis for its abandonment was more or less understood as far back as the 1890s, roughly forty years before the principle of falsifiability was first formulated.

    Please see:

    A Critique of the Principle of Falsifiability
    http://axismundi.hostzi.com/0/006.php

    As such there are a variety of reasons why I would regard the lack of falsifiability for either Occam’s razor or AIK irrelevant to either. However, I myself would go even further and abandon the concept of the a priori along with the analytic/synthetic dichotomy.

    Please see:

    Something Revolutionary: A Critique of Kant’s The Critique of Pure Reason
    Section 25: Self-Reference and the Analytic/Synthetic Dichotomy
    http://axismundi.hostzi.com/0/021.php

  • Timothy Chase // October 17, 2009 at 6:56 am | Reply

    Science and Philosophy, Part II of II

    ekzept wrote:

    If “all models are false, but some are useful” (from statisitician G.E.P. Box, BTW), surely there is a possibility that a “combination of models” may serve as a useful model. The thing is, physicists and our collective notion of reality balks at the idea of having Keplerian ellipses and Ptolemaic epicycles both being true. I think that’s more our problem than one of the models’.

    I would essentially agree, although the way that I would put it is not that the “all models are false” as that would be in my view a colloquialism, but rather that all models (or theories) are approximations. Likewise, presumably with time we could improve upon the models, some will drop away perhaps to be replaced. But for the time being at least, each does some things better than others — otherwise it would have already fallen away — and given the law of large numbers, the average is closer to that which each is an attempt to model than any of the models that go into the average.
    *
    ekzept wrote:

    The trouble is that traditional logic as used and as far as I know cannot model iteratively convergent methods for finding truth, even if these are qualitative. These are used all the time in science, and not only in numerical methods.

    If by “traditional logic” you mean “formal logic” (either categorical or propositional) then no, I wouldn’t expect to find any “iteratively convergent methods for finding the truth” there. However, there has there has been a convergence of sorts in epistemology with respect to theories of justification between coherentialism and an empirical foundationalism towards what may be termed a “coherentialist moderate foundationalism.” Or at least this is what Robert Audi proposes in “Fallibilist Foundationalism and Holistic Coherentialism.”

    This sort of approach acknowledges the fact that multiple, independent lines of evidence are often capable of transmitting far greater justification to a given conclusion than any one line of evidence in isolation. Likewise, some element of coherentialism would seem to follow in recognition of Duhem’s thesis first put forward during the 1890s. And consistent with moderate foundationalist elements, knowledge would consist primarily of corrigible knowledge — where justification is always a matter of degree.

    This would lend itself to what is sometimes termed “social epistemology” (e.g., which might study such things as dialogue and debate, and at an abstract level at least the division of cognitive labor. The philosophy of science would be a branch of this. Then under the “philosophy of science” exists the study of “the” scientific method – which would address such questions as whether there is even any one single scientific method — or whether there are different scientific methods for different sciences. Presumably when Burnham and Anderson state, “Information-theoretic approaches emphasise a deliberate focus on the a priori science in developing a set of multiple working hypotheses or models,” at one level or another, they are dealing with issues related to the philosophy of science.
    *
    ekzept wrote:

    For example, the determination of ages of geological strata depend upon multiple techniques and mutually constrain one another, with ties to findings from other strata elsewhere which, by circumstance, are better constrained in various ways. This is something people outside the field often founder upon, but it is entirely legitimate.

    Certainly one of the things I find most beautiful about science. And although I use other examples, I believe I have done a fair job of illustrating just this sort of interdependence in the piece I linked to earlier critiquing Karl Popper’s principle of falsifiability — and thereby illustrating Duhem’s thesis. In case you are interested, that piece belongs to a ten part paper of mine that aims at providing a critical history of early twentieth century empiricism which I have made available here:

    A Question of Meaning
    http://axismundi.hostzi.com/0/024.php

  • ekzept // October 17, 2009 at 2:24 pm | Reply

    Likewise, presumably with time we could improve upon the models, some will drop away perhaps to be replaced. But for the time being at least, each does some things better than others — otherwise it would have already fallen away — and given the law of large numbers, the average is closer to that which each is an attempt to model than any of the models that go into the average.

    Something I learned years after graduate school is that the embrace of a new Kuhn-type frame is not a perfect or consistent improvement. When Newtonian — per perhaps better put, Lagrangian — astronomy is displaced by curved space-time, there are things and insights which were deep and convenient which are lost. They are outweighed, however, by the greater accuracy of the new.

    Similarly, the advent of computational models has ushered in a new paradigm with many benefits, but there are deep insights and economies to be had using traditional, difficult mathematics, even if these are aided by creatures like Mathematica.

    In case you are interested, that piece belongs to a ten part paper of mine that aims at providing a critical history of early twentieth century empiricism which I have made available here

    I am interested, but I’m afraid I have not the time. In addition to learning many new things about maths — which always goes slow — I’m wading through the new (and important) Reinhart & Rogoff book This Time Is Different.

  • Ray Ladbury // October 17, 2009 at 3:58 pm | Reply

    Timothy,
    The emphasis on a priori science is basically a way to restrict potential theories/distributions/hypotheses that will be considered according to theoretical expectations. For instance, presumably if a quantity is positive definite, we need consider only distributions defined from 0 to infinity, not the entire real line. In essence this is similar to the selection of a family of models for a Bayesian Prior. One selects the best model according to how economically it fits the data or weights the results for different models according to how the same criterion (e.g. Akaike weights). Thus, in the former model scheme a model is falsified when it has negligible support from the data (e.g. delta(AIC)>10) or in the latter case when its weight is sufficiently small that it contributes negligibly to the result. Rather than outright falsification, the scheme is probabilistic, and so probably a better fit for active science as opposed to settled science.

  • Timothy Chase // October 17, 2009 at 5:26 pm | Reply

    ekzept,

    Sorry — I hadn’t checked the link you have back to your blog. It looks like you get the chance to do some fascinating work — and it likewise looks like we share some other interests. Photography, for example. And likewise the influence of the Religious Right on American politics, including their attempts to foist creationism in the guise of intelligent design upon the US educational system. Here is one of the responses to Thoughts on the Intelligent Design Inference which you might enjoy — posted at the British Centre for Science Education, an organization that I helped form.

    And yes, unfortunately creationism — particularly young earth creationism — has spread with a vengence to other parts of the world, Australia, France, former Eastern Europe, and even the birthplace of Charles Darwin. Likewise the religious right plays a large part in the attacks upon the legitimacy of climate science.

    For that particular reason they figure somewhat prominently in a custom Google search engine I have been putting together:

    Climate Search
    http://www.google.com/cse/home?cx=014999486326905263983:6yjomxybqli&hl=en

    Included among the sites that this search engine searches is:

    Notable Names Database
    http://www.nndb.com

    … which you might also like as it contains a tool for visualizing data, relationships and the exploring of networks.

    For example, this is the entry they have for:

    Howard Ahmanson, Jr.
    http://www.nndb.com/people/374/000058200/

    … who largely financed the Discovery Institute, and this is an example of the sort of network someone has created using the tool for studying the attempt to impose theocracy upon the United States:

    Theocracy Now!
    http://mapper.nndb.com/start/?map=1789

    … which you might also find of interest some of the links you have at your website. Such as the link to the book “The Eliminationists: How Hate Talk Radicalized the American Right” which you linked to under your blog photos.

    Of course the religious right isn’t the only place on our political landscape where one will find hate talk. Given the phenomena of complementary schismogenesis one will find one brand of extremism giving rise to its “opposite” on the other side. (Quite commonplace in history, e.g., communism vs. Nazism in Germany during the 1930s, French Algeria between the French separatists and and French loyalists in the 1950s.) But it would seem to be where hate talk is most pronounced and widespread.

  • Timothy Chase // October 17, 2009 at 5:38 pm | Reply

    Ray Ladbury wrote:

    The emphasis on a priori science is basically a way to restrict potential theories/distributions/hypotheses that will be considered according to theoretical expectations.

    Understood — and actually I regard that sort of a prior approach entirely valid — and in my personal view AIK is quite likely the correct approach. My argument against the “a priori” mentioned above is not in my view an argument against the AIK approach but rather an argument against any sort of criticism based upon the principle of falsifiability. Ultimately, however, I would argue that both the dichotomies between the a priori and a posteriori and the analytic and synthetic are false dichotomies. In fact this is a large part of my argument with both Kant and Logical Positivism.

  • Timothy Chase // October 17, 2009 at 6:06 pm | Reply

    ekzept wrote:

    Something I learned years after graduate school is that the embrace of a new Kuhn-type frame is not a perfect or consistent improvement.

    I would most certainly agree. As a matter of fact, one of the attacks on the general legitimacy of science is grounded in a relativistic approach that is greatly indebted to Kuhn, one proponent of which is…

    Steve Fuller
    http://www.bcseweb.org.uk/index.php/Main/SteveFuller

    An element (one of several) in the response to this sort of an approach would be found in:

    A Question of Meaning, [6]: The Criterion of Self-Referential Coherence Vs. Logical Positivism, [6.1] The Criterion of Self-Referential Coherence applied to Radical Skepticism
    http://axismundi.hostzi.com/0/030.php

    … as would:

    A Question of Meaning, [10]: Against Q.V. Quine and the Analytic/Synthetic-Dichotomy
    http://axismundi.hostzi.com/0/034.php
    *
    ekzept wrote:

    When Newtonian — per perhaps better put, Lagrangian — astronomy is displaced by curved space-time, there are things and insights which were deep and convenient which are lost. They are outweighed, however, by the greater accuracy of the new.

    I would most certainly agree. Then again there are some unexpected insights that are preserved, such as the absence of a gravitational field throughout the inside of a uniform empty spherical shell of constant density.
    *
    ekzept wrote:

    Similarly, the advent of computational models has ushered in a new paradigm with many benefits, but there are deep insights and economies to be had using traditional, difficult mathematics, even if these are aided by creatures like Mathematica.

    I have no reason to think otherwise — despite some of my arguments with Russell — such as those involving self-referential logic. I believe that one should always be open to learning from those that one disagrees with. There are often insights which combined with one’s own can be very illuminating. This is the power of dialogue. And, borrowing from “Babylon 5,” I find that there is much value in the view that,

    “Truth is a three-edged sword: your side, their side and the truth.”
    *
    ekzept wrote:

    I am interested [in your paper], but I’m afraid I have not the time.

    Understood. Likewise, I would have included my critique of Kuhn and what came after in a broader version of “A Question of Meaning” if I had had the time, but as Morpheus says in The Matrix, “Time is always against us.”

  • Timothy Chase // October 17, 2009 at 6:11 pm | Reply

    PS

    Correction:

    “Truth is a three-edged sword: your side, their side and the truth,”

    … should have been:

    “Understanding is a three-edged sword: your side, their side and the truth.”

  • Timothy Chase // October 18, 2009 at 12:52 am | Reply

    Not quite the Akaike information criterion, but something I ran into a few years back that may be of interest…

    Recent psychophysical experiments indicate that humans perform near-optimal Bayesian inference in a wide variety of tasks, ranging from cue integration to decision making to motor control. This implies that neurons both represent probability distributions and combine those distributions according to a close approximation to Bayes’ rule. At first sight, it would seem that the high variability in the responses of cortical neurons would make it difficult to implement such optimal statistical inference in cortical circuits. We argue that, in fact, this variability implies that populations of neurons automatically represent probability distributions over the stimulus, a type of code we call probabilistic population codes. Moreover, we demonstrate that the Poisson-like variability observed in cortex reduces a broad class of Bayesian inference to simple linear combinations of populations of neural activity. These results hold for arbitrary probability distributions over the stimulus, for tuning curves of arbitrary shape and for realistic neuronal variability.

    Wei Ji Ma, Jeffrey M Beck, Peter E Latham & Alexandre Pouget (2006) Bayesian inference with probabilistic population codes, Nature Neuroscience 9, 1432-1438
    http://www.bcs.rochester.edu/people/alex/pub/articles/mabecklathampougetnn06.pdf

  • Barton Paul Levenson // October 18, 2009 at 11:12 am | Reply

    Falsifiability works for me. Yes, if theory A predicts Jupiter should be bright green and it turns out to be orange, you can’t absolutely rule out theory A yet, because something might be masking the greenness, or it might have been green in the absence of some extra factor. But using Occam’s Razor, you can put those ideas aside at least provisionally. Falsifiability plus Occam’s Razor seems like an excellent working combination to me, even if neither technically works by itself.

  • Timothy Chase // October 18, 2009 at 6:03 pm | Reply

    Falsifiability and Simplicity, Part I of II

    Barton Paul Levenson wrote:

    Falsifiability works for me.

    I sometimes come off as sounding like I see no value in the “principle of falsifiability,” but that really isn’t the case, and I have stated as much before in a somewhat informal essay…

    Real scientists do everything the can to make their theories as testable and as falsifiable as possible. Real scientists make specific predictions about what has yet to be discovered, and try to make those predictions as risky as possible given the current state of our generally accepted knowledge, knowing that if they do so and win, then they win big.

    New Lenny Flank Essay at Talk Reason
    by Timothy Chase
    http://www.talkreason.org/articles/newflank.cfm

    Falsifiability is something to be aimed for, but in my view, particularly with the more advanced scientific theories, it is difficult if not impossible to achieve. The reason being is that no scientific theory stands or falls in absolute isolation from the rest. We can see suggestions of this in the attempt to rescue Newton’s theory with a hypothetical planet to explain the orbit of Uranus (prior to the discovery of Neptune) or a hypothetical planet Vulcan to explain the orbit of Mercury (or alternatively appeal to unobserved oblateness of the distribution of mass within the sun).

    In either case, the additional “auxiliary hypotheses” were appropriate responses — so long as one sought to make them independently testable of the hypothesis or theory that they were proposed to save. (When they aren’t independently testable they are refered to as ad hoc hypotheses.)

    The possibility that a theory might be properly “saved” by means of an auxiliary hypothesis (hypothesizing an additional planet in order to reconcile Newton with the orbit of Uranus — prior to the discovery of Neptune) or improperly “rescued” by means of an ad hoc hypothesis (what could have happened given the divergence between Newton and the precession of Mercury’s orbit if scientists had been unwilling to test or relinquish the hypothetical planet Vulcan as a means of explaining that precession) indicates the beginning of this, but it is only the beginning.

    Almost inevitably, in any test of the theories of modern science one must presuppose certain premises or more well-established theories of science in order to test the theory that one seeks to test. One must presuppose that these premises are already established or true, because in logic the result that one predicts on the basis of theory that one seeks to test isn’t simply dependent upon the theory itself but on a variety of background assumptions, and if the prediction turns out to be false, then one does not know that the theory itself is false, but only that at least one of the premises which formed the basis for that prediction was false.

    I offer an extended example of this here — involving evolution and the reasons why 19th Century science regarded it as unlikely that the earth was old enough for evolution to explain the origin of the earth’s many species of life:

    A Critique of the Principle of Falsifiability
    http://axismundi.hostzi.com/0/006.php

  • Timothy Chase // October 18, 2009 at 6:05 pm | Reply

    Falsifiability and Simplicity, Part II of II

    Barton Paul Levenson wrote:

    Yes, if theory A predicts Jupiter should be bright green and it turns out to be orange, you can’t absolutely rule out theory A yet, because something might be masking the greenness, or it might have been green in the absence of some extra factor. But using Occam’s Razor, you can put those ideas aside at least provisionally.

    Agreed. One discounts the theory and the hypothesis proposed to save it — until such time as one is able to propose a means of independently testing the hypothesis. Without such a test the proposed hypothesis is ad hoc, and it becomes “auxiliary” only once it can be tested independently of the theory it is intended to save. But if one is setting aside the theory “only provisionally,” then strictly speaking it hasn’t actually been falsified. And as such, given the interdependence of our scientific knowledge validation (e.g., induction) and falsification is always a matter of degree, that is, either a form of confirmation or disconfirmation. In which case I would argue that it is entirely appropriate to regard a well-supported and well-established theory as being true and consequently as a form of knowledge — provisionally — that is, as a form of corrigible knowledge.
    *
    Barton Paul Levenson wrote:

    Falsifiability plus Occam’s Razor seems like an excellent working combination to me, even if neither technically works by itself.

    If I understand things correctly, this is essentially what Akaike Information Criteria does — but rather than doing so simply in terms of a qualitative language where a either receives confirmation or disconfirmation or where an additional hypothesis is either auxiliary or ad hoc — it does so quantitatively.

    Please see for example Ray Ladbury’s statement above:

    The Occam’s razor analogy of course stems from the form of AIC, which has a term proportional to the likelihood (a measure of goodness of fit) and the penalty term in the number of terms in the model. Since likelihood enters into the picture as a log term, goodness of fit must actually increase exponentially with model complexity to justify the added complexity…

    … as well as of course the essay by Tamino itself.

    And as Ray Ladbury states later, this ability to quantify the reasons for regarding a given theory as “confirmed” or “disconfirmed” is a strength where science is still active or in flux:

    One selects the best model according to how economically it fits the data or weights the results for different models according to how the same criterion (e.g. Akaike weights). Thus, in the former model scheme a model is falsified when it has negligible support from the data (e.g. delta(AIC)>10) or in the latter case when its weight is sufficiently small that it contributes negligibly to the result. Rather than outright falsification, the scheme is probabilistic, and so probably a better fit for active science as opposed to settled science.

    … which is afterall precisely where you would want some sort of guidance from a principle of scientific method. If the science is already well established there isn’t much call for guidance, is there? When he states, “One selects the best model according to how economically it fits the data…” this is in essence where the insight from Occam’s razor (or alternatively, the “rule of simplicity”) is preserved but is given mathematical form.

  • ekzept // October 18, 2009 at 7:57 pm | Reply

    There are many other problems with a strictly Bayesian approach to things, including the need to be unnecessarily precise in a specification of an initial prior: Sure you might guess that there are 3 ranges of possible values for a prior, with each lower one being about twice as probable as the next one up, but you might not [i]really[/i] know it’s twice as probable, simply think its between 25% and 300% more probable as the next one up. Bayes [i]makes you have to specify[/i] and has no organic means of passing on this uncertainty into your inference.

  • Timothy Chase // October 19, 2009 at 2:31 am | Reply

    For those who are interested…

    Two Bayesian Defenses in response to AIC:

    In the curve fitting problem two conflicting desiderata, simplicity and goodness-of-fit pull in opposite directions. To solve this problem, two proposals, the first one based on Bayes’ theorem criterion (BTC) and the second one advocated by Forster and Sober based on Akaike’s Information Criterion (AIC) are discussed. We show that AIC, which is frequentist in spirit, is logically equivalent to BTC, provided that a suitable choice of priors is made. We evaluate the charges against Bayesianism and contend that AIC approach has shortcomings…

    Prasanta S. Bandyopadhyay et al (1998) The Curve Fitting Problem: A Bayesian Rejoinder. Philosophy of Science 66 (3):402
    http://scistud.umkc.edu/psa98/papers/bandyo.pdf

    The advent of formal definitions of the simplicity of a theory has important implications for model selection. But what is the best way to define simplicity? Forster and Sober ([1994]) advocate the use of Akaike’s Information Criterion (AIC), a non-Bayesian formalisation of the notion of simplicity. This forms an important part of their wider attack on Bayesianism in the philosophy of science. We defend a Bayesian alternative: the simplicity of a theory is to be characterised in terms of Wallace’s Minimum Message Length (MML). We show that AIC is inadequate for many statistical problems where MML performs well. Whereas MML is always defined, AIC can be undefined….

    David L. Dowe, Steve Gardner, and Graham Oppy (2007) Bayes not Bust! Why Simplicity is no Problem for Bayesians, Brit. J. Phil. Sci. 58, 709–754
    http://bjps.oxfordjournals.org/cgi/content/full/58/4/709

    Two Papers on AIC, Curve-Fitting and Grue…

    Malcolm R. Forster (1999) Model Selection in Science: The Problem Language Variance, Brit J. Phil. Sci. 50, 83-102
    http://philosophy.wisc.edu/forster/papers/BJPS1999.pdf

    An article that has been the basis for a number of presentations but is currently in draft:

    Gruesome Simplicity: A Guide to Truth
    http://www.fitelson.org/few/few_07/lyon.pdf

  • Timothy Chase // October 19, 2009 at 2:52 am | Reply

    With registration, free access to full text until October 31, 2009:

    Jouni Kuha (2004) AIC and BIC: Comparisons of Assumptions and Performance, Sociological Methods & Research, Vol. 33, No. 2, November 2004 188-229
    http://smr.sagepub.com/cgi/content/abstract/33/2/188

  • Timothy Chase // October 19, 2009 at 6:12 am | Reply

    Two more comparisons that may be of interest that are open access…

    Michael E. Alfaro and John P. Huelsenbeck (2006) Comparative Performance of Bayesian and AIC-Based Measures of Phylogenetic Model Uncertainty, Systematic Biology 55(1):89-96
    http://sysbio.oxfordjournals.org/cgi/content/full/55/1/89

    Russell J. Steele, Adrian E. Raftery (Sept. 2009) Performance of Bayesian Model Selection Criteria for Gaussian Mixture Models, University of Washington Department of Statistics, Technical Report No. 559
    http://www.stat.washington.edu/research/reports/2009/tr559.pdf

    The last of these compares the performance of several different criteria, including
    BIC (Schwarz 1978), ICL (Biernacki, Celeux, and Govaert 1998), DIC (Spiegelhalter, Best, Carlin, and van der Linde 2002) and AIC.

  • Barton Paul Levenson // October 19, 2009 at 11:00 am | Reply

    Well then, let me say that a theory can be “effectively falsified” or “essentially falsified.”

    BTW, did you guys know that in addition to the Akaike Information Criterion, there is the “Corrected Akaike Information Criterion” and the “Schwarz Information Criterion,” all of which sometimes give different answers? The multiple regression program I wrote gives all three so you can use whichever one you want.

  • Ray Ladbury // October 19, 2009 at 2:24 pm | Reply

    ekzept,
    For a very practical and cogent formulation of the Bayesian approach, see E. T. Jaynes’s Probability: The Logic of Science. In many ways, information theory and Bayesian approaches are complementary. Indeed, there is no reason why one must choose only a single Prior–one can compare the performance of different Priors and select the one that performs best or even average over Priors.

    The whole issue of falsifiability is an interesting one. Science has become increasingly probabilistic as we have started to understand how errors propagate. Even so, at some point, a theory becomes so improbable that we cease to consider it. In effect, our “Prior” as repeatedly updated with evidence would have effectively zero probability. Helen Quinn argued in a Reference Frame Column in Physics Today, that at some point, evidence in favor of a theory would become sufficiently strong that it could be considered established fact–probability effectively 1. Of course, this is contingent upon the results of all subsequent experiments, but unless we find that “The Matrix” is history rather than fiction, we can be pretty confident that Earth is round and orbits the Sun.

    Finally, wrt the different information criteria, (AIC, B/SIC, DIC), one can also look at likelihood itself. In effect one can view these different criteria as applying different weights to goodness of fit vs. model complexity penalty–and for AICc amount of data. So, if you want to get really speculative, you can ask how the cross entropy (or even Shannon Entropy) relates to a thermodynamic entropy. And even further out–what corresponds to “energy” “temperature”, etc.

  • Timothy Chase // October 19, 2009 at 4:04 pm | Reply

    Barton Paul Levenson wrote:

    Well then, let me say that a theory can be “effectively falsified” or “essentially falsified.”

    Hmmm…. That may work.
    *
    Barton Paul Levenson wrote:

    BTW, did you guys know that in addition to the Akaike Information Criterion, there is the “Corrected Akaike Information Criterion” and the “Schwarz Information Criterion,” all of which sometimes give different answers?

    I believe I might have run across something to that effect. Well, not the “Corrected Akaike Information Criterion” as of yet, but there were others.

    Akaike and Schwarz tend to give similar answers, but Schwarz tends to do better by various criteria — homing in on the model that best fits the data, suggesting fewer parameters, etc.. Michael E. Alfaro and John P. Huelsenbeck (2006) state at one point in their abstract that, “The AIC–based credible interval appeared to be more robust to the violation of the rate homogeneity assumption,” but for the most part they seem to lean toward the Bayesian Information Criterion proposed by Schwarz.

    Jouni Kuha (2004) suggests that, “… useful information for model selection can be obtained from using AIC and BIC together, particularly from trying as far as possible to find models favored by both criteria.” Russell J. Steele and Adrian E. Raftery (Sept. 2009) come down rather strongly on the side of the Bayesian Information Criteria — but this seems to be in an area where BIC was already thought to perform better than AIC.

    But I limited my reading mostly to the abstracts. Mostly. My main objective was simply to find articles that had something to say regarding AIC and BIC, both vs. and in relation to.
    *
    Barton Paul Levenson wrote:

    The multiple regression program I wrote gives all three so you can use whichever one you want.

    Oh? What language is it written in? I will probably just stick with Excel — that makes me blunderingly dangerous as it is — I believe arctic ice might ring a bell…. But I am still curious.

  • Deep Climate // October 19, 2009 at 4:32 pm | Reply

    I was wondering when BIC would come up.

    I noted this passage on climate sensitivity in the CCSP 5.2, p.67:

    An approximation to the log of the Bayes Factor for large sample sizes, Schwarz’s Bayesian Information Criterion or BIC, is often used as a model-fitting criterion when selecting among all possible subset models. The BIC allows models to be evaluated in terms of a lack of fit component (a function of the sample size and mean squared error) and a penalty term for the number of parameters in a model. The BIC differs from the well-known Akaike’s Information Criterion (AIC) only in the penalty for the number of included model terms.

    An interesting informal comparison of AIC and BIC is here:

    http://www.cs.cmu.edu/~zhuxj/courseproject/aicbic/

    Enjoy!

  • Deep Climate // October 19, 2009 at 4:36 pm | Reply

    As to how BIC has been used in climate studies, CCSP continues:

    Another related model selection statistic is Mallow’s Cp (Laud and Ibrahim, 1995). Karl et al. (1996) utilize the BIC to select among ARMA models for climate change, finding that the Climate Extremes Index (CEI) and the United States Greenhouse Climate Response Index (GCRI) increased abruptly during the 1970s.

  • David B. Benson // October 19, 2009 at 10:21 pm | Reply

    Deep Climate // October 19, 2009 at 4:32 pm — Thanks for the link! That was fun and informative!

  • Timothy Chase // October 20, 2009 at 2:03 am | Reply

    Ray Ladbury wrote:

    ekzept,
    For a very practical and cogent formulation of the Bayesian approach, see E. T. Jaynes’s Probability: The Logic of Science.

    Ray, actually I believe that might be:

    Probability Theory: The Logic of Science
    by E.T. Jaynes
    http://bayes.wustl.edu/etj/prob/book.pdf

    Just a hunch.

  • hipparchia // October 20, 2009 at 2:05 am | Reply

    de-lurking to thank mrs tamino for typing this for us.

  • Barton Paul Levenson // October 20, 2009 at 10:46 am | Reply

    Tim,

    I’m ashamed to say I wrote it in Just Basic, which is an interpreted language. I wanted a GUI interface, and while my Fortran compiler is supposed to provide one, I never really learned how to use it. (I mainly use Fortran for writing RCMs in console mode.) I should probably rewrite the whole thing in Fortran some time.

  • Timothy Chase // October 20, 2009 at 4:32 pm | Reply

    Barton,

    My now-deceased permanent position was as a VB6 programmer for five years. So I don’t look down on visual basic. However, I have a friend I worked for more recently who managed to keep a temp job at Boeing for as long as my “permanent” job. And at this point, VB6 is in roughly the same position as Sanskrit.

    But if you were going into DotNet, C-sharp would be the way to go. VB.Net seems to have been mostly a bridge to the world of DotNet, and DotNet job growth will be mostly in C-sharp from now on. And while there are fewer C-Sharp jobs relative to Java, the demand for C-Sharp (being new, I suppose) is expanding much more rapidly.

    Then there are the other C-languages. Javascript, PHP, ActionScript and so on. Learning any one of them gives you a leg-up on the rest. So that is where I am focusing.

    Plus C-Sharp is incorporating functional-language and query-language features. I can only assume Java is doing the same. (But of course C-Sharp came after the object-oriented revolution in programming and has consistently incorporated its principles. Java can’t quite claim the same.)

    I know Fortran is what gets used the most in climatology, but in the long-run, code-reuse and all, it might be a good idea if they started moving to C-type languages — or at least using them more often.

  • Deep Climate // October 20, 2009 at 6:29 pm | Reply

    I have written programs for part or most of my living at various times (still do, but it is quite peripheral now).

    I consider C++ and Java full-fledged OO languages. AFAIK, Java was designed as an implementation of OO principles, but I’m not a software historian. I can’t comment on C#, although part of me recoils at platform-specific languages.

    It seems to me all three must have baggage inherited, so to speak, from C.

    As far as I can see, Matlab seems to be the current choice for statistical analysis in climate research (e.g. Mann et al 2008). It has a huge advantage over R in that it is comilable.

    I agree that a migration away from Fortran (e.g. in climate data processing, or climate models) could be helpful, but there’s probably a lot of legacy data processing code that would need to be rewritten from scratch.

    • suricat // October 21, 2009 at 12:19 am | Reply

      If you use a compiler then the only other addition you need to include any other computer language is a translator!

      The problem with this is that the translator uses up so much of the ‘run time’ making its comparisons that the ‘other languages’ aren’t justifiable due to the increased ‘run time’.

      I’ve seen a lot of this problem in the GCMs that I’ve looked at, but, so what if the code needs to be rewritten in the name of shorter run times.

      Best regards, suricat.

      • dhogaza // October 21, 2009 at 3:44 am

        I worked much of my early professional life as a very highly-regarded compiler writer …

        And you’re full of shit, though your note is so convoluted I can’t say exactly how.

      • dhogaza // October 21, 2009 at 3:46 am

        “I’ve seen a lot of this problem in the GCMs that I’ve looked at”

        Name them, give URL references to the code, and then prepare to be bitch-slapped by people who know what they’re talking about.

  • Deep Climate // October 20, 2009 at 6:30 pm | Reply

    Oops, “comilable” should be “compilable”

  • Timothy Chase // October 20, 2009 at 9:55 pm | Reply

    Deep Climate wrote:

    I can’t comment on C#, although part of me recoils at platform-specific languages.

    I don’t know that much about Java as of yet.

    But there is Mono for the Linux world. The specifications for C# are public, and as such you can create open source versions of C# that are fully compatible with DotNet — or even port DotNet to the Linux world. Mono is the result of one such endeavor.

    Some details:

    Sponsored by Novel: Mono 2.4
    Run your applications… *.NET, *Winforms, *ASP.NET, *GTK#
    on all the platforms *Linux, *Mac, *Windows

    Get MonoDevelop: A free C# ASP.NET IDE
    http://mono-project.com/Main_Page

    It is also my understanding that Microsoft has been developing their own port to Linux, but I don’t know as much about that.

    However, just as Java is(? or at least was last I checked) slower than C# on most tests typically by factors of 2 to 3, Mono is slower than Java — last I checked. Roughly by a factor of 2 if I remember correctly. Nevertheless, it has been winning awards.

    Personally I think Microsoft may have already seen its better days, but C# is something special. Then again I grew up on Microsoft. Still, I am picking up other C-family languages.

  • Timothy Chase // October 20, 2009 at 9:57 pm | Reply

    Another paper that may be of interest:

    Within the context of regression functions, and for squared-error loss, Yang (2005) has shown that consistent model selection procedures, such as BIC, produce estimators which cannot attain the asymptotic minimax rate. Failure to attain this optimal rate extends also to model combination, or Bayesian model averaging with subjectively specified priors. On the other hand, it is known that AIC, which is inconsistent, does attain the minimax rate, see Proposition 1 in Yang (2005). This tension between model consistency and inference-optimality is sometimes referred to as the “AIC-BIC dilemma”.

    Reconciling Model Selection and Prediction
    George Casella and Guido Consonni, March 7, 2009
    http://www.stat.ufl.edu/~casella/Papers/AICBIC-3.pdf

  • Barton Paul Levenson // October 21, 2009 at 9:59 am | Reply

    C-related languages have a couple of problems.

    1. All those semicolons, and remembering which lines get them and which lines don’t.

    2. No exponentiation operator. This may seem trivial, but in a climate situation, you’ve got equation after equation. It’s a lot easier to write

    F = epsilon * sigma * T **4

    than to write

    F = epsilon * sigma * pow(T, 4.0);

    let alone

    F := epsilon * sigma * exp(4.0 * ln(T));

    And it’s easier to read and figure out as well.

    3. All the class and object-oriented stuff is irrelevant to straightforward simulation, since errors can be avoided accurately enough with top-down programming and modular design. You don’t want clever programming tricks and elegant data structures for a simulation, you want speed, speed, and more speed. Fortran beats every other high-level language out there. Only assembly language is faster–and try programming complicated equations in assembler. You’d need about 20 lines to do what I do in one line of Fortran above.

    4. Object-oriented stuff involves a lot of concealing information from the programmer who uses your routines–thus inheritance and private variables and interfaces and so on and so on. In science you don’t WANT to conceal your code or your algorithms, and you can assume the users of your code are not idiots who need to be protected from themselves.

    5. In C or Java or Pascal/Delphi you need to import units or packages or libraries to get anything not absolutely basic done. In Fortran you don’t. All kinds of math is available as built-in functions, from complex numbers to hyperbolics to Bessels. You don’t have to remember which package has the statistical functions and which has the complex number functions. And it’s extremely rare that you have to write your own.

    6. With Fortran modules (or earlier, common blocks), you can easily declare and use global variables. The recent languages frown on global variables as one of those confusing things amateurs might make a hash of, and several of them make it really hard to declare any. But if you’re doing atmosphere physics you WANT nearly every routine to have access to b, c, c1, c2, c3, P0, pi, 2 pi, 4 pi, R, sigma and T0 without having to declare them in each module.

    I have written fully functioning RCMs in Basic, C, Fortran, and Pascal, and I prefer Fortran.

    More ranting of this sort can be found at

    http://BartonPaulLevenson.com/PerfectProgrammingLanguage.html

  • Ray Ladbury // October 21, 2009 at 11:43 am | Reply

    Timothy, Regarding your last cite( Casella and Consonni), I wonder if the fact that different information criteria are optimal for different purposes might correspond in some way to the thermodynamic case, where the natural variable is in some cases energy, in others enthalpy…

    Inconsistency and worst-case error represent different types of error that might have differential importance for different problems. and the different weights given to model complexity penalty and goodness of fit in AIC and BIC might tend to bias the outcome in the desired direction.

  • Timothy Chase // October 23, 2009 at 4:41 pm | Reply

    Ray Ladbury wrote:

    Timothy, Regarding your last cite( Casella and Consonni), I wonder if the fact that different information criteria are optimal for different purposes might correspond in some way to the thermodynamic case, where the natural variable is in some cases energy, in others enthalpy…

    Quite possible. After downloading a copy of Jayne’s book (which unfortunately I haven’t had a chance to get into very deeply as of yet) I see that it is his view that despite the fact that he believes that the debate between frequentist and Beyesian has been largely won by the Beyesians, there are roles to be played by the frequentist approach. (Personally I might like to see some form of complementarity between the two approaches — but that is more a matter of personal aesthetics than anything else.)

    I have seen that AIC presumably grows out of a frequentist approach, but then I recently ran across a paper that argued that a frequentist derivation of BIC is certainly possible as is a Beyesian derivation of AIC. And there have been papers that argue either applying both methods in order to achieve the strongest results, and papers that argue that BIC and AIC both have strengths even though according to most criteria BIC performs better.

    Anyway, I am looking forward to learning more — but at the moment I have to shift more into a luker-type role due to time constraints mostly due to the demands of school.

  • Timothy Chase // October 23, 2009 at 5:10 pm | Reply

    I had just written:

    And there have been papers that argue either applying both methods in order to achieve the strongest results, and papers that argue that BIC and AIC both have strengths even though according to most criteria BIC performs better.

    An example of the former (arguing in essence that the methods work best when used in combination) would be:

    Jouni Kuha (2004) AIC and BIC: Comparisons of Assumptions and Performance, Sociological Methods & Research, Vol. 33, No. 2, November 2004 188-229
    http://smr.sagepub.com/cgi/content/abstract/33/2/188

    … which I cited above then quoted in a later comment in part as saying:

    [From the Abstract:] The behavior of the criteria in selecting good models for observed data is examined with simulated data and also illustrated with the analysis of two well-known data sets on social mobility. It is argued that useful information for model selection can be obtained from using AIC and BIC together, particularly from trying as far as possible to find models favored by both criteria.

    (NOTE: open access with registration until 31 Oct 2009)

    An example of the latter (arguing that both have strengths, but according to most criteria BIC performs better) was given by me as:

    Michael E. Alfaro and John P. Huelsenbeck (2006) Comparative Performance of Bayesian and AIC-Based Measures of Phylogenetic Model Uncertainty, Systematic Biology 55(1):89-96
    http://sysbio.oxfordjournals.org/cgi/content/full/55/1/89

    … of which I stated:

    Michael E. Alfaro and John P. Huelsenbeck (2006) state at one point in their abstract that, “The AIC–based credible interval appeared to be more robust to the violation of the rate homogeneity assumption,” but for the most part they seem to lean toward the Bayesian Information Criterion proposed by Schwarz.

  • Ray Ladbury // October 24, 2009 at 1:10 pm | Reply

    Timothy, Really the only aspect of AIC that is at all “frequentist” stems from its relation to K-L information–that is that the “TRUE” distribution exists and is part of the family of distributions under consideration. AIC, BIC and DIC are all related to the likelihood, and likelihood is fundamental to both frequentist and Bayesian schools. Anderson and Burnham have argued that AIC can be derived from a “savvy Prior” as opposed to a “maximum entropy Prior,” and that we are never in a state of maximum ignorance.

    The main difference in the behavior of AIC and BIC arises from the penalty term–2k in AIC and ln(n)*k for BIC, wheren k is the number of parameters and n is the sample size. This makes is much less likely that the favored distribution will change once the sample size is sufficiently large. This has both advantages and disadvantages. If the convergence of the answer with increased data is slow, this may give rise to some odd behavior. In any case, the use of AIC-weighted averages diminishes the importance of “getting the right answer” on distribution form. It is my impression that AIC-weights ought to be better behaved and more intuitive than BIC weights.

    Unfortunately, Jayne’s never seems to have embraced model selection/averaging. It would have been interesting to have his take on it. Hmm, anybody got a Ouiji board…

  • David B. Benson // October 24, 2009 at 10:32 pm | Reply

    Timothy Chase // October 23, 2009 at 4:41 pm — Bayesian

  • Timothy Chase // October 25, 2009 at 7:53 pm | Reply

    David B. Benson wrote:

    Timothy Chase // October 23, 2009 at 4:41 pm — Bayesian

    Thank you. Sometimes I spell things they way they sound and pronounce things the way they are spelled. Some sort of issue which also manifested itself in a hand-to-eye coordination problem early in life — if I remember correctly — but the name currently escapes me. I hate being unable to trust my own mind — so of course I was “cursed” with a bipolar condition as well. But at least I will always reach for the red pill…

  • Phil Scadden // October 27, 2009 at 1:05 am | Reply

    Language wars!! Yes. I still maintain half a million lines of Fortran, but sorry, despite being raised in the language, I hate it.

    >1. All those semicolons, and remembering >which lines get them and which lines don’t.

    Sheesh? Its a pretty simple rule. Much prefer to single-line statements or continuation marks.

    >2. No exponentiation operator. This may seem >trivial, but in a climate situation, you’ve got >equation after equation. It’s a lot easier to write

    But a potential source of trouble – you will do exponentiation by which method? And if you use a “general” routine, then you pay cost for internal method determination. I still think its a trivial point however.

    >3. All the class and object-oriented stuff is >irrelevant to straightforward simulation

    Amazing the no. of times you need to use generics though. Actually I really like objects for simulation, but I agree that they get in the way when it comes to a numerical algorithm. No one forces you to use them when inappropriate.

    >Only assembly language is faster–and try >programming complicated equations in >assembler.

    c is more or less high level assembler. For every apples for apples compiler test, I would bet on C winning over fortran.

    > You’d need about 20 lines to do what I do in >one line of Fortran above.
    This could only be a reference to f90 or f95 array syntax.

    > In science you don’t WANT to conceal your >code or your algorithms, and you can assume

    I think this is a misunderstanding of data hiding. You are writing to avoid inadvertent manipulation o f variables and to enforce rules. No algorithms are hidden at all – far from it.

    > And it’s extremely rare that you have to write >your own.
    Likewise in C++. Unless you want more than the standard provides or faster code because you know that the data has attributes that allow faster methods.
    Including the maths library is hardly a chore.

    >6. With Fortran modules (or earlier, common >blocks), you can easily declare and use global >variables.
    Common blocks are an extreme form of evil costing me I dont know how much of my life. Till modules came along, you had no name space control and could kill complex code by inadvertently using the same common block name as another part of program. No checks at all.
    You want everyone using same globals? Stick them in a unit and include where required. What on earth is hard about that? At least as easy as commons and no chance of name space collision. Add a variable to common? Fine, now watch the fun if something used it that you forgot about and it didnt get checked in common. Many compilers dont even check for size equality between units because its perfectly legal not to. Only safe way is put commons into include files – oh and thats what C/Pascal/Java do anyway.

    And I am doing thermal evolution of petroleum basins and geochemistry. These are not trivial physical models. Fortunately translated out of Fortran and into C++ some years ago. Same cant be said for the second law analysis of power station code which is stuck in fortran and the source of endless hassles to maintain.

  • Barton Paul Levenson // October 27, 2009 at 4:34 pm | Reply

    Phil Scadden:

    c is more or less high level assembler. For every apples for apples compiler test, I would bet on C winning over fortran.

    I wouldn’t.

    > You’d need about 20 lines to do what I do in >one line of Fortran above.
    This could only be a reference to f90 or f95 array syntax.

    Nope. I was comparing assembler to Fortran, not C to Fortran. Read for context.

    Including the maths library is hardly a chore.

    Excuse me, it slows me down, especially when I have to remember WHICH libraries (multiple) I need. I want to concentrate on the PROBLEM, not the LANGUAGE. Block-structured languages make the programmer think about the language instead of the problem.

    You want everyone using same globals? Stick them in a unit and include where required. What on earth is hard about that? At least as easy as commons and no chance of name space collision. Add a variable to common?

    Modules have been available in Fortran since the 1990 standard, which was, let me see, 19 years ago. In my RCMs I use ONE module. Period. All the globals in there. Haven’t had any problems. And I haven’t had to write and separately compile a unit, or worry about namespaces, or make files, or any of the rest of the language overhead you inevitably get with C-like languages.

    Plus there’s the fact that you can READ Fortran and understand what’s going on right away, which is rarely the case with C. It’s a lot easier for anyone to figure out

    do i = 1, 10
    x(i) = x(i) + j
    end do

    at a glance than it is to figure out

    for (i = 0; i < 9; i++)
    {
    x[i] += j;
    }

    at a glance. Or worse,

    for (i = 0; i < 9; i++)
    {
    *ptr[i] += j;
    }

    Oh yeah, did I mention the computer-friendly but programmer-hostile convention of starting all arrays from 0 in C, Java, Javascript, and even a lot of the modern object-oriented versions of Basic, versus the natural and obvious Fortran convention of starting them at 1? Or the ease of writing / to start a new line in a Fortran format compared to '\n' in a C++ format? Or how easy it is to write F7.3 to write out a real number with three decimal places in Fortran, compared to %ld7.3 in C++? I could go on all day. The bottom line is that in Fortran I can FORGET about the arcane details of the language, because there just aren't that many, and concentrate on the problem.

    I am NOT saying Fortran is better than C++ in all ways and for all purposes. But for numerical simulation, I would rather program in Fortran and be tired up in my basement listening to 48 hours of Rush Limbaugh than use C++.

  • David B. Benson // October 27, 2009 at 6:21 pm | Reply

    Barton Paul Levenson // October 27, 2009 at 4:34 pm — Looks like you would have trouble with elevators in Germany; ground floor is numeral 0. :-)

    I don’t care that much for C myself, but I’ll point out theat “Numerical Recipes in C, 2nd edition” uses one-origin indexing. Imagine that, in a zero-origin programming language.

    My biggest objection to all of the programming languages mentioned so far here is the imperative nature; that is always a source of hidden problems. I prefer mostly functional languages, currently using SML/NJ, even for numerical work. There are run-time faster versions of SML available and maybe F# from Microsoft is faster (although I still wouldn’t use it) than ocaml.

    All of this does touch upon the basic point of this thread, the tension between best-fit-to-data and a measure of complexity. Now number of parameters iis a perfectly decent notion of complexity for polynomial models; somewhat worse for models with trig functions, exp and log. But what about a more complex model with some considerable decision making along the way? Certainly a GCM qualifies?

    So far, what is called computational complexity does help very much if the models are in class P. Most physically based models are in class P? I don’t even know that.

  • David B. Benson // October 27, 2009 at 8:17 pm | Reply

    Oops! Computational complexity does not help…

    Not left out.

  • dhogaza // October 28, 2009 at 12:39 am | Reply

    c is more or less high level assembler. For every apples for apples compiler test, I would bet on C winning over fortran.

    I wouldn’t.

    As someone who spent roughly 25 years of his life writing highly optimizing compilers for a variety of languages, I would claim that it depends on the program and compiler, and I would win that claim.

    There’s nothing in Fortran that can intrinsically be compiled to more efficient code than C.

  • dhogaza // October 28, 2009 at 12:49 am | Reply

    Oh, and for the record, I hate C. And hate C++ more (some of that hatred having come from working on C++ implementations).

  • Barton Paul Levenson // October 28, 2009 at 9:34 am | Reply

    dhogaza:

    There’s nothing in Fortran that can intrinsically be compiled to more efficient code than C.

    Undoubtedly true, but in practice I’ve never found a C compiler that’s faster (I’ve used the old Mix Power C, Borland Turbo C, Borland C/C++, and MingW C). In practice, if not in theory, C is fast, but not as fast as Fortran.

  • dhogaza // October 28, 2009 at 2:47 pm | Reply

    It will depend entirely on the program and how it’s written. It’s true that Fortran compiler writers spend a lot more time concentrating on optimizing numerical computation than is typical for a C compiler.

    Fortran and C tend to be used for different types of problems, and compiler writers (and those paying their salaries) know this.

    C compiler writers also know that skilled C programmers will be writing low-level code, and have traditionally ignored much of the high-level optimizations (such as vectorization of array operations within loops) that are necessary to generate decent code from languages that *aren’t* glorified high-level assemblers.

  • Kevin McKinney // October 28, 2009 at 3:58 pm | Reply

    You guys are making me feel better about my mathematical/computational naivete. . .

  • Kevin McKinney // October 28, 2009 at 4:00 pm | Reply

    . . . meaning I’m glad not to have to worry about these arcana myself, even though I’m glad the expertise exists and is usefully deployed.

  • Donald Oats // October 29, 2009 at 7:32 am | Reply

    For statistics with a smattering of programming, I can recommend the entirely free “R” statistics environment, and “Tinn-R” as one of many editors available for writing R code and running it. R comes in Windows and Unix flavours.

    R is in use by such a large statistics community that it isn’t going to disappear, and furthermore, there is a vast set of packages for anything from generalized linear models to bioinformatics affymetrix genomic data analysis, ODE solvers, and a great integrated graphics toolset. The R programming language is simple and interpretative, using LAPACK as the engine for matrix and vector calculations, so speed is rarely an issue if you can write a matrix equation using high-level matrix and vector operations.

    R is available from CRAN (google “R” and the first few references should get you to it: http://www.r-project.org etc) and various mirrors.

    PS: R can interface to C, C++ and Fortran if necessary.

  • Andrew Dodds // October 29, 2009 at 2:38 pm | Reply

    D Benson -

    I’ve seen that 1-origin stuff in Numerical Recipies in C and it is pretty hideous. In any case, we should stick to the one true programming language:

    http://www.dangermouse.net/esoteric/ook.html

  • KenM // October 29, 2009 at 5:05 pm | Reply

    OOK! is nice, but I prefer whitespace for scientific programming.

  • David B. Benson // October 29, 2009 at 10:12 pm | Reply

    Andrew Dodds // October 29, 2009 at 2:38 pm &
    KenM // October 29, 2009 at 5:05 pm —

    :D

Leave a Comment