John O. Campbell

This post is an early draft of an article published as Universal Darwinism as a process of Bayesian inference.

Although Darwin must be counted amongst history’s foremost scientific geniuses he had very little talent for mathematics. His theory of natural selection was developed in great detail with many compelling examples but without a mathematical framework. Mathematics composed only a small part of his Cambridge education and in his autobiography he wrote:

I attempted mathematics, and even went during the summer of 1828 with a private tutor (a very dull man) to Barmouth, but I got on very slowly. The work was repugnant to me, chiefly from my not being able to see any meaning in the early steps in algebra. This impatience was very foolish, and in after years I have deeply regretted that I did not proceed far enough at least to understand something of the great leading principles of mathematics, for men thus endowed seem to have an extra sense.

Generally mathematics are an aid to scientific theories because a theory whose basics are described through mathematical relationships can be expanded into a larger network of predictive implications and the entirety of the expanded theory subjected to the test of evidence. As a bonus any interpretation of the theory must also conform to this larger network of implications.

Natural selection describes the change in frequency or probability of biological traits over succeeding generations. One might suppose that a mathematical description complete with an insightful interpretation would be straightforward but even today this remains elusive. The current impasse involves conceptual difficulties arising from one of mathematics’ bitterest interpretational controversies.

That controversy is between the Bayesian and the Frequentist interpretations of probability theory. Frequentists assume probability or frequency to be a natural propensity of nature. For instance the fact that each face of a dice will land with 1/6 probability is understood by frequentists to be a physical property of the dice. On the other hand Bayesians understand that humans assign probabilities to hypotheses on the basis of the knowledge they have, thus the probability of each side of a dice is 1/6 because the observer has no knowledge which would favour one face over the other; the only way that no face is favoured is for each hypothesis to be assigned the same probability.

Frequentists have attacked the Bayesian interpretation of probability on the grounds that the knowledge which a particular person has is ‘subjective’ and that mathematics and science only deals with objective phenomena. Bayesians counter that their view is objective, as all observers with the same knowledge must assign the same probability. As the great Bayesian theoretician E.T. Jaynes put it (1):

In the theory we are developing, any probability assignment is necessarily “subjective" in the sense that it describes only a state of knowledge, and not anything that could be measured in a physical experiment. …. Now it was just the function of our interface desiderata to make these probability assignments completely “objective" in the sense that they are independent of the personality of the user. They are a means of describing (or what is the same thing, of encoding) the information given in the statement of a problem, independently of whatever personal feelings (hopes, fears, value judgments, etc.) you or I might have about the propositions involved. It is “objectivity" in this sense that is needed for a scientifically respectable theory of inference.

The Bayesian framework is arguably more comprehensive and has been developed into the mathematics of Bayesian inference, at the heart of which is Bayes’ theorem describing how probabilistic models gain knowledge and learn from evidence. In my opinion the major drawback of the Bayesian approach is its anthropomorphic reliance on human agency.

Despite the lack of mathematics in Darwin’s initial formulation of the theory it was not long before researchers began developing a mathematical framework describing natural selection. Perhaps the first step was taken by Darwin’s cousin, Francis Galton. He developed numerous probabilistic techniques for describing the variance in natural traits as well as for natural selection in general. His conception of natural selection appears to have been curiously Bayesian although he may never have heard of Bayes' theorem. Evidence for his Bayesian bent is in the form of a visual aid which he built for a lecture given to the Royal Society.

He used this device as an aid to explain natural selection in probabilistic terms. It contains three compartments: a top one representing the frequency of traits in the parent population, a middle one representing the application of 'relative fitness' to this prior and a third representing the normalization of the resulting distribution in the child generation. Beads are loaded in the top compartment to represent the distribution in the parent generation and then are allowed to fall into the second compartment. The trick is in the second compartment where there is a vertical division in the shape of the relative fitness distribution. Some of the beads fall behind this division and are ‘wasted’; they do not survive and are removed from sight. The remaining beads represent the distribution of the 'survivors' in the child generation.

Galton’s device has recently been rediscovered and employed by Stephan Stigler and others in the statistics community as a visual aid, not of natural selection, but of Bayes' theorem. The top compartment represents the prior distribution, the middle one represents the application of the likelihood to the prior, and the third represents the normalization of the resulting distribution. The change between the initial distribution and the final one is due to the Bayesian update.

R.A. Fisher further developed the mathematics describing natural selection during the 1920s and 1930s. He applied statistical methods to the analysis of natural selection via Mendelian genetics and arrived at the fundamental theorem of natural selection which states (2):

the rate of increase in fitness of any organism at any time is equal to its genetic variance in fitness at that time.

Fisher was a fierce critic of the Bayesian interpretation which he considered subjective. Instead he pioneered and made many advances with the frequentist interpretation.

The next major development in the mathematics of natural selection came in 1970 with the publication of the Price equation which built on the fundamental theorem of natural selection. As Wikipedia describes it (3):

Price developed a new interpretation of Fisher's fundamental theorem of natural selection, the Price equation, which has now been accepted as the best interpretation of a formerly enigmatic result.

Although the Price equation fully describes evolutionary change, its meaning has only recently begun to be unravelled, chiefly by Stephen A. Frank in a series of papers spanning the last couple of decades. Frank’s insights into the meaning of the Price equation culminated in his 2012 paper (4) which derives a description of natural selection using the mathematics of information theory.

In my opinion this paper represents a huge advance in the understanding of evolutionary change as it shifts interpretation from the objective statistical description of frequentist probability to an interpretation in terms of Bayesian inference. Unfortunately Frank does not share my appreciation of his accomplishment. Instead he seems to take it for granted, in the frequentist tradition, that a Bayesian interpretation is not useful. While he understands that his mathematics are very close to those of Bayesian inference he is unable to endorse a Bayesian interpretation of his results.

However the mathematics of information theory and Bayesian probability are joined at the hip as their basic definitions are in terms of one another. Information theory begins with a definition of information in terms of probability:

Here we may view h_i as the i^th hypothesis in a mutually exclusive and exhaustive family of competing hypothesis composing a model. I is the information gained by the model on learning that hypothesis h_i is true. P is the probability which had previously been assigned by the model that the hypothesis h_i is true. Thus information is ‘surprise’; the less likely a model initially considered a hypothesis that turns out to be the case, the more surprise it experiences, the more information it receives.

Thus information theory, starting with the very definition of information, is aligned with the Bayesian interpretation of probability; information is ‘surprise’ or the gap between an existing state of knowledge and a new state of knowledge gained through receiving new information or evidence.

The 'expected' information contained by the model composed of the distribution of the p_i is the entropy.

Bayes' theorem follows directly from the axioms of probability theory and may be understood as the implication which new evidence or information holds for the model described by the distribution of the p_i. This theorem states that on the reception of new information (I) by the model (H) the probability of each component hypothesis (h_i) making up the model must be updated according to:

Where X is the information we had prior to receiving the new evidence or information I. Bayesian inference is commonly understood as any process which employs Bayes’ theorem to accumulate evidence based knowledge. As Wikipedia puts it (5):

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as evidence is acquired.

Thus we see that, contrary to Frank’s view, Bayesian inference and information theory have the same logical structure. However it is instructive to follow Frank’s development of the mathematics of evolutionary change in terms of information theory while keeping his explicit denial of its relationship to Bayesian inference in mind. Frank begins his unpacking of Prices equation by describing the ‘simple model’ he will develop:

A simple model starts with n different types of individuals. The frequency of each type is q_i. Each type has w_i offspring, where w expresses fitness. In the simplest case, each type is a clone producing w_i copies of itself in each round of reproduction. The frequency of each type after selection is

Where is the average fitness of the trait in the population. The summation is over all of the n different types indexed by the i subscripts.

Equation (1) is clearly an instance of a Bayesian update where the new evidence or information is given in terms of relative fitness and thus the development of his simple model is in terms of Bayesian inference; I consider this to be neither an analogy nor a speculation.

While Frank acknowledges that there is an isomorphism between Bayes’ theorem and his simple model he cannot bring himself to admit that this means his simple model involves a Bayesian update and therefore describes a process of Bayesian inference. Instead he refers to the relationship between Bayes’ theorem and his model as an analogy:

Part of the problem is that the analogy, as currently developed, provides little more than a match of labels between the theory of selection and Bayesian theory. As Harper (2010) shows, if one begins with the replicator equation (eqn 1), then one can label the set {qi} as the initial (prior) population, as the new information through differential fitness and { } as the updated (posterior) population.

Frank refers to Bayesian inference at many further points in his paper and even devotes a box to a discussion of it. In the process he gives a very coherent account of natural selection in terms of Bayesian inference:

The Bayesian process makes an obvious analogy with selection. The initial population encodes predictions about the fit of characters to the environment. Selection through differential fitness provides new information. The updated population combines the prior information in the initial population with the new information from selection to improve the fit of the new population to the environment.

Perhaps he considers it only an analogy because he doesn't realize that his 'simple model' is in fact an instance of Bayes' theorem. He makes the somewhat dismissive remark:

I am sure this Bayesian analogy has been noted many times. But it has never developed into a coherent framework that has contributed significantly to understanding selection.

On the contrary, I would suggest that Frank’s paper itself develops a coherent framework for natural selection in terms of Bayesian inference.

He concludes his paper with the derivation of:

J= B_mw V_w/w

On the right hand side of this equation are statistical variables describing selection. On the left is Jeffreys’ divergence. (Jeffrey, in the course of his geophysical research, had led the Bayesian revival during the 1930s.) Frank acknowledges that Jeffreys’ divergence was discovered and used by Jeffreys in his development of Bayesian inference:

Jeffreys (1946) divergence first appeared in an attempt to derive prior distributions for use in Bayesian analysis rather than as the sort of divergence used in this article.

Frank believes that his article is an analysis in terms of frequentist probability and information theory and he can therefore assert that Jeffreys’ divergence is not to be understood in terms of Bayesian analysis. He contends that his analysis relates the statistics of evolutionary change to information theory.

Equation 30 shows the equivalence between the expression of information gain and the expression of it in terms of statistical quantities. There is nothing in the mathematics to favour either an information interpretation or a statistical interpretation.

Frank puts himself in the awkward position of admitting that his description of evolutionary change utilizes the mathematics of Bayesian inference while at the same time denying that evolutionary change can be interpreted as a process of Bayesian inference. Why is he compelled to do this?

Part of the reason may be due to the near tribal loyalty demanded within both the Bayesian and Frequentist camps. Frank, in the tradition of the mathematics of evolutionary change, is a frequentist and jumping ship is not easy. A more substantial reason may be due to a peculiarity, and I would suggest a flaw, in the Bayesian interpretation. The consensus Bayesian position is that probability theory describes only inferences made by humans. As E.T. Jaynes put it:

it is...the job of probability theory to describe human inferences at the level of epistemology.

Epistemology is the branch of philosophy which studies the nature and scope of knowledge. Since Plato the accepted definition of knowledge within epistemology has been ‘justified true belief’. In the Bayesian interpretation ‘justified’ means justified by the evidence. ‘True belief’ is the degree of belief in a given hypothesis which is justified by the evidence; it is the probability that the hypothesis is true within the terms of the model. Thus knowledge is the probability, based on the evidence, that a given belief or model is true. I have proposed a technical definition of knowledge as 2^-S where S is the entropy of the model (6).

A perhaps interesting interpretation of this definition is that knowledge occurs within the confines of entropy or ignorance. For example, in a model composed of a family of 64 competing hypotheses where no evidence is available to decide amongst them, we would assign a probability of 1/64 to each hypothesis. The model has an entropy of 6 bits and has knowledge of 2^-6 = 1/64. Let’s say some evidence becomes available and the model’s entropy or ignorance is reduced to 3 bits. Then the knowledge of the updated model is 1/8, equivalent to the entropy of a model composed of only 8 competing hypotheses which is maximally ignorant, which has no available evidence. The effect which evidence has on the model is to increase its knowledge by reducing the scope of its ignorance.

Due to their anthropomorphic focus both the fields of epistemology and Bayesian inference deny themselves the option of commenting on the many sources of knowledge encountered in non-human realms of nature. They must deny some obviously true facts about the world such as that a bird ‘knows’ how to fly. Instead they equivocate that a bird has genetic and neural information which allows it to fly but must deny it the knowledge of flight. The notion that we are different from the rest of nature in that we have knowledge is, in my opinion, but an anti-Copernican conceit, an anthropomorphic attempt to claim a privileged position.

This is unfortunate because it forbids the application of Bayesian inference to phenomena other than models conceived by humans, it denies that knowledge may be accumulated in natural processes unconnected to human agency. Thus even though natural selection is clearly described in terms of the mathematics of Bayesian inference, neither Bayesians such as Jaynes nor frequentists such as Frank can acknowledge this fact due to another hard fact: natural selection is unconnected to human agency. In both their views this rule out it having a Bayesian interpretation.

Natural selection involves a 'natural' model rather than one conceived by humans. Biological knowledge is stored in biological structures, principally the genome, and the probabilities involved are frequencies, namely counts of the relative proportions of traits, such as alleles, in a population. The process of natural selection which updates these frequencies has nothing to do with human agency; in fact this process of knowledge accumulation operated quite efficiently for billions of years before the arrival of humans. How, then, should we interpret the mathematics of evolutionary change which clearly take the form of Bayesian inference but fall outside of the arena of Bayesian interpretation?

I believe that the way out of this conundrum is to simply acknowledge that in many cases inference is performed by non-human agents as in the case of natural selection. The genome may for instance be understood as an example of a non-human conceived model involving families of competing hypotheses in the form of competing alleles within the population. Such models are capable of accumulating evidence-based knowledge in a Bayesian manner. The evidence involved is simply the proportion of traits in ancestral generations which make it into succeeding generations. In other words, we just need to broaden Jaynes' definition of probability to include non-human agency in order to view natural selection in terms of Bayesian inference.

Bayesian probability, epistemology and science in general tend to draw a false distinction between the human and non-human realms of nature. In this view the human realm is replete with knowledge and thus infused with meaning, purpose and goals and the mathematical framework describing these knowledge-accumulating attributes is Bayesian inference. On the other hand the non-human realm is viewed as devoid of these attributes and thus Bayesian inference is considered inapplicable.

However if we recognize expanded instances, such as natural selection, in which nature accumulates knowledge then we may also recognize that Bayesian inference provides a suitable mathematical description. Evolutionary processes, as described by the mathematics of Bayesian inference, are those which accumulate knowledge. Not just any arbitrary type of knowledge, but that required for increased fitness, for increased chances of continued existence. Thus the mathematics imply purpose, meaning and goals, and thus provide legitimacy for Daniel Dennett’s interpretation of natural selection in those terms (7):

If I could give a prize to the single best idea anybody ever had, I’d give it to Darwin—ahead of Newton, ahead of Einstein, ahead of everybody else. Why? Because Darwin’s idea put together the two biggest worlds, the world of mechanism and material, and physical causes on the one hand (the lifeless world of matter) and the world of meaning, and purpose, and goals. And those had seemed really just—an unbridgeable gap between them and he showed “no,” he showed how meaning and purposes could arise out of physical law, out of the workings of ultimately inanimate nature. And that’s just a stunning unification and opens up a tremendous vista for all inquiries, not just for biology, but for the relationship between the second law of thermodynamics and the existence of poetry.

If we allow an expanded scope to Bayesian inference we may view Dennett’s poetic interpretation of Darwinian processes as being supported by their most powerful mathematical formulation.

An important aspect of these mathematics is that they apply not only to natural selection but also to any generalized evolutionary processes where inherited traits change in frequencies between generations. As noted by the cosmologists Conlon and Gardner (8):

Specifically, Price’s equation of evolutionary genetics has generalized the concept of selection acting upon any substrate and, in principle, can be used to formalize the selection of universes as readily as the selection of biological organisms.

Given that the Price equation is a general mathematical framework for evolutionary change, its Bayesian interpretation allows us to consider all evolutionary change as due to the accumulation of evidence-based knowledge. So quantum theory (9), biology (10), neural-based behaviour (11) and culture (12) may all be understood in terms of such evolutionary processes. Thus a wide scope of subject matters may be unified within a single philosophical and mathematical framework.

Bibliography

1. Jaynes, Edwin T. Probability Theory: the logic of science. Cambridge : Cambridge University Press, 2003.

2. Fisher, R.A. The Genetical Theory of Natural Selection. s.l. : Clarendon Press, Oxford, 1930.

3. Wikipedia. George R. Price. Wikipedia. [Online] [Cited: September 30, 2015.] https://en.wikipedia.org/wiki/George_R._Price.

4. Natural selection. V. How to read the fundamental equations of evolutionary change in terms of information theory. Frank, S.A. s.l. : Journal of Evolutionary Biology , 2012, Vols. 25:2377-2396.

5. Wikipedia. Bayesian Inference. Wikipedia. [Online] [Cited: September 26, 2015.] https://en.wikipedia.org/wiki/Bayesian_inference.

6. Campbell, John O. Darwin Does Physics. s.l. : createspace, 2014.

7. Dennett, Daniel. Darwin's Dangerous Idea. 1995. ISBN-10: 068482471X.

8. Cosmological natural selection and the purpose of the universe. Conlon, Joseph and Andy, Gardner. 2013.

9. Quantum Darwinism. Zurek, Wojciech H. 2009, Nature Physics, vol. 5, pp. 181-188.

10. Darwin, Charles. The Origin of Species. sixth edition. New York : The New American Library - 1958, 1872. pp. 391 -392.

11. Selectionist and evolutionary approaches to brain function: a critical appraisal. Fernando, Chrisantha, Szathmary, Eros and Husbands, Phil. s.l. : http://www.sussex.ac.uk/Users/philh/pubs/fncom-Fernandoetal2012.pdf, 2012, Computational Neuroscience.

12. Towards a unified science of cultural evolution. Mesoudi, Alex , Whiten, Andrew and Laland, Kevin N. . 2006, BEHAVIORAL AND BRAIN SCIENCES.

13. A framework for the unification of the behavioral sciences. Gintis, Herbert. s.l. : http://www2.econ.iastate.edu/tesfatsi/FrameworkForUnificationOfBehavioralSciences.HGintis2007.pdf, 2007, BEHAVIORAL AND BRAIN SCIENCES.

Science Buddies

Monday, 12 October 2015

The Price Equation and the Evolving Universe

Bibliography

Blog Archive