## Statistical mechanics and Bayesian inference

#### This post is an excerpt from my book Darwin Does Physics.

Thermodynamics models physical systems in terms of macro-variables such as heat, pressure, volume and pressure. Macro-variables have a great advantage in being measurable quantities and through careful experimentation during the 18th and 19th century many complex relationships were discovered between them. However it was unclear how the thermodynamic model related to other types of physical models or how macro-variables could be understood in more fundamental terms.

Entropy gained importance as a concept within thermodynamics because it is a measurable quantity and in any physical or chemical process entropy always increases. This became a fundamental law which scientist could use to predict the outcome of reactions: the reaction’s resultant state could be predicted to have higher entropy than its initial state. The change in entropy has a simple mathematical description in terms of other macro-variables:

Δ𝑆= ∫Δ𝑄𝑇

Where Δ𝑆 is the change in entropy, Δ𝑄 is the amount of heat energy added to the system during a reversible process and 𝑇 is the temperature of the system.

John Dalton proposed an experimentally supported atomic theory in the first decade of the 19th century but this view took over a hundred years to become fully accepted. Some converts to atomic theory suspected that thermodynamics would be put on a more fundamental foundation if its results were explained in terms of atoms and molecules.

Ludwig Boltzmann helped to found the branch of physics known as statistical mechanics which undertook this investigation. The challenge was to explain how the macro-variables of thermodynamics such as pressure, volume, heat and temperature may be understood as statistical properties of the micro-states or the molecular states which make up the thermodynamic system. His work was further developed by Josiah Gibbs.

Statistical mechanic grew to provide an extremely powerful scientific model of thermodynamic systems. Central to its findings is that thermodynamic entropy has a simple statistical interpretation:

SB = log W

Where SB is Boltzmann entropy and W is the number of possible distinct microstates which a thermodynamic system in equilibrium can occupy. This simple formula seemed so significant to Boltzmann that he had it inscribed on his gravestone.

Ludwig Boltzmann

Gibbs was able to generalize Boltzmann’s entropy to non-equilibrium systems:

𝑆𝐺= Σ 𝑝𝑖(−log (𝑝𝑖)

Where each of the n possible microstates are assigned a probability pi.

In 1948 Claude Shannon founded information theory when he demonstrated that any probability distribution used as a model carried information and that the expected value of the information in a distribution having n members is:

𝐸(𝐼)= Σ 𝑝𝑖(−log (𝑝𝑖)

Where pi is the probability which the model assigns to the occurrence of event i and −log (𝑝𝑖) is the information obtained if it does occur.

Shannon was unsure what to call this important feature and consulted John Von Neumann, one of the centuries leading mathematical physicists. Von Neumann suggested that as a joke he might call it entropy as it had the exact same structure as Gibbs entropy. So it was as a joke that information entropy was first equated to the entropy of statistical mechanics. Never the less it proved to be extremely useful.

Shannon worked for Bell labs and he was attempting to discover the basic rules governing how much information could be sent over a given communication channel. The probabilities in Shannon’s theory concern a model of the language used in messages. They are the relative frequency with which letters are used in a language. If no evidence concerning the relative frequency of letter use in English were known then the model should use the uniform distribution and assign a 1/26 probability to each letter. To distinguish one letter out of 26 equal possibilities requires about 4.7 bits of information which is the Shannon entropy of the model. A message of 100 characters length requires about 470 bits to code using this model.

Evidence concerning actual English letter usage is available and sophisticated models may be constructed using this evidence which assign probabilities quite different than the uniform distribution. For instance in written English the probability that a letter will be an ‘e’ is about 13/100 while the probability it will be a ‘z’ is about 3/4,000. More sophisticated models might include statistics such as the frequency with which words follow one another. Shannon developed one such model with entropy of only 2.6 bits per letter (7). Using this model nearly twice the amount of information can be sent over a communication channel as with a model based on the uniform distribution.

The Shannon entropy in the uniform distribution where each probability is 1/n is quite simple:

𝑆𝑆= −Σ 1/𝑛(log (1/𝑛)) = log (𝑛)

In view of Shannon’s discovery Boltzmann’s log W may be viewed as the expected value of the uniform distribution having W members, each with a probability equal to 1/W. Could this be a coincidence or a joke? It seems that central to statistical mechanics is a model in the form of a uniform probability distribution but what is that model?

Statistical mechanics, especially as developed by Gibbs, enjoyed spectacular success, and even survived the transition from a classical atomic model to a quantum model largely intact. The current view of statistical mechanics, based on a quantum model, is that all of the micro-states of a system at equilibrium which are consistent with a macro conserved quantity such as energy are equally likely. In the case of energy each combination of allowed individual molecular energies which add up to the total energy of the system is equally as likely as any other combination. As in introductory thermodynamics text explains:

The fundamental assumption of statistical mechanics is that all quantum states have an equal likelihood of being occupied. The probability that a particle may find itself in a given quantum state is the same for all states.

The meaning of the simple formula which so enthralled Boltzmann is made obvious: entropy is the expected value of the uniform distribution over allowed micro-states.

The uniform distribution over n members has the largest expected value or entropy of any distribution with n members. Given that the equal probability of micro-states typifies the equilibrium state of a thermodynamic systems the second law is implied: the entropy of any system which starts in non-equilibrium, with its micro-states in a distribution other than the uniform distribution, will tend towards the uniform distribution as it nears equilibrium and thus its entropy will tend to increase.

Gibbs understood that the laws of physics constrain the possible micro-states a thermodynamic system can occupy. For instance both the number of molecules within the system and its total energy do not change: both particle number and energy are conserved. The possible microstates of the system are constrained to those having the initial number of particles and the initial energy.

The problem for Gibbs was to find some probability distribution of the pi where the entropy: 𝑆𝐺= Σ 𝑝𝑖(−log (𝑝𝑖) has a maximum value subject to the constraints.

He was able to demonstrate a method which produced a distribution that took the constraints into account while remaining as close as possible to the uniform distribution. As this distribution described the system in the highest accessible state of entropy it could be used to make all of the predictions of thermodynamics.

Edwin Jaynes saw that Gibbs’ probability distribution could be viewed as the distribution which has maximum entropy subject to the constraints. Based on this understanding he proposed the principle of maximum entropy: that systems move to the state of maximum entropy consistent with their constraints. In 1957 he published papers which derived all of statistical mechanics from this principle.

Even at this early stage of his research career Jaynes understood that the principle of maximum entropy had more general applications beyond statistical mechanics. He understood it as a basic form of reasoning, as the way in which probabilities were best assigned to the hypothesis of a model. The probability distribution produced by this method retains the implications of the data in the form of constraints but is otherwise as uncertain as possible.

Jaynes also solved the mystery of the analogy between Shannon’s information entropy, used by Gibbs, and thermodynamic entropy by showing that they are equivalent. An analogy that was initially offered in jest has proved to be a deep truth. Specifically Jaynes demonstrated that statistical mechanics provides a model having incomplete knowledge of a thermodynamic system and the limits to conclusions which may be inferred using this model.

Boltzmann, Gibbs and Jaynes had searched for the probability distribution over micro-states which could serve as a model for thermodynamics. When n micro-states are available to the system a probability pi can be assigned to each one. The pi may be viewed as the probability that the hypothesis ‘the system is in micro-state i’ is true. One and only one of these hypothesis must be true within the context of the model but the model does not yet know which one. The best that can be done is to assign probabilities to each possible hypothesis that consistently balance the existing evidence against the remaining uncertainty. In this manner the probability distribution forms a model of the thermodynamic system.

Jaynes completed this search by showing that the correct distribution forming a model should be as close to the uniform distribution as allowed by the evidence which the model already has; otherwise some knowledge that is unsupported by evidence would be claimed by the model. He further demonstrated that this principle is applicable to models in general.

It is appropriate to assign the uniform distribution to models or hypothesis spaces when there is no evidence that any one of the hypothesis is more likely to be true than any of the others. This type of model has maximum ignorance or entropy and is the only one that honestly reflects a state of ignorance. A different distribution , one necessarily with less entropy, should only be assign if the model has access to evidence which implies that some hypothesis are more likely than others. If a model is initially assigned the uniform distribution and then additional evidence becomes available the model should be updated through adjusting the various probabilities using the methods of Bayesian inference.

For instance, if after many roles of the dice, one of the faces appeared much more often than 1/6 of the time this might indicate we had the wrong model and should consider others. We might want to adjust our model to one including the hypothesis that the dice is not a fair dice. If a single face of the dice continued to appear more frequently than 1/6 then the probability of the 'unfair' hypothesis will grow as calculated by the Bayesian update.

Models in the form of the uniform distribution have a very useful property in that if a number of the hypothesis all share a single characteristic then assigning a probability to that characteristic is easily accomplished simply by counting the number of hypothesis which have that characteristic. This property of the uniform distribution has the technical name multiplicity.

For example if two fair six-sided dice are rolled there are 36 distinct possible outcomes; the first dice may display six possible values and for each of those the second dice may display six possible values. As the dice is fair it means we have no evidence which would indicate that one outcome is more likely than another so our model must assign a probability of 1/36 to each possible outcome.

Rather than modelling the outcomes of all distinct possibilities we might wish to model the probability of the various possible sums of the two dice. In this case the appropriate model has only 12 hypothesis each having the form ‘the sum of the two dice is n’, where n is a number between 1 and 12. What probabilities should we assign to these hypothesis? A moment’s consideration will confirm that it cannot be the uniform distribution because some sums may occur in more than one way. For example a sum of five may occur in four ways: [1,4] , [2,3], [3,2] and [4,1]. Thus the multiplicity of this outcome is 4 and the probability assigned to the hypotheses ‘the sum of the two dice is 5’ must be 4/36 or 1/9.

In statistical mechanics multiplicity is the number of micro-states corresponding to a particular macro-state. As the fundamental assumption of statistical mechanics is that each micro-state is equally probable, the probability of macro-state may be calculated merely by counting the multiplicity of its microstates.

Perhaps we would like a model of dice having predictive power concerning the outcome of the next roll. The model of a fair dice gives only 1/6 for each outcome, it is completely ignorant of what that outcome might be. If the dice is fair we can gather evidence on the outcomes by continuing to roll the dice but this evidence is unlikely to cause a revision of the 1/6 prediction.

Only by adopting a radically different hypothesis space can we make progress. Such a hypothesis space would involve a mutually exclusive and exhaustive set of hypothesis connecting the outcomes to initial physical conditions of the dice as it is thrown including its orientation, linear and angular momentum as well as many other details including the spatial relationship between the dice and the surface it will land on.

Initially each of the hypothesis making up this model might be assigned equal probability. Provided with data on the initial conditions and outcomes of many rolls such a model is capable of learning through Bayesian inference and will gain knowledge providing predictive power.

Gibbs first demonstrated that a probabilistic model could be used to derive thermodynamic quantities. In a similar manner Shannon showed that a probabilistic model could be used to derive the communication channel capacity required for information transfer. Jaynes revealed that both of these accomplishments may be understood as applications of the general principle of maximum entropy and even more generally as examples of Bayesian inference. In this sense Bayesian inference may be understood as the means by which models can most accurately reflect the reality they are modelling. This general observation is applicable not only to the scientific models of statistical mechanics and communication theory discussed above but also to models such as genomes. In fact it is applicable wherever models containing knowledge are found in nature.