Saturday, 10 May 2014

Knowledge



Knowledge

The prominent physicists David Deutsch reminds us of the power of knowledge:

Everything that is not forbidden by the laws of nature is achievable, given the right knowledge.

Undoubtedly knowledge is the essential ingredient of order. I will argue however that the preeminent position which Deutsch bestows upon knowledge does not go far enough, that the very laws of nature have themselves come into being through the accumulation of knowledge, that the laws of physics may be considered but a summary of nature’s accumulated knowledge.
Information at the basis of structure
Since atomic theory came to general acceptance near the start of the 20th century physics has learned that atoms themselves are composed of yet more fundamental particles.  There is little reason to suppose that the quarks and leptons currently understood to be fundamental will not in turn be found to be composed of an underlying family of even more fundamental ‘preons’.

All structures appear to be composed of ever more fundamental units. At each level new structures may form because the subunits are able to interact and exchange information. Without the exchange of information between them no structures could exist; at that level such a reality would be composed entirely of isolated entities unable to detect or influence other isolated entity.

The complexity of the world we experience is due, in a primary sense, to this transfer of information; the ability of one entity to record information about another entity. The four forces of nature by which fundamental particles are understood to influence each other may be described as instances of information transfer. Indeed these forces appear to be the ultimate form of all information transfer.

It is akin to the miraculous that our universe is not a barren one where each fundamental particle is isolated and uninfluenced by anything else. In such a world there could be no bonds between matter and no entity could contain information regarding another entity. Indeed up to the level of the universe as a whole science understands all structures to be composed of interacting sub units.
Principle of incomplete information
However information transfer is highly constrained; it appears to be a universal principal, which I will name the principle of incomplete knowledge, that one entity may convey only very little information about itself to other entities[1]. This principle seems as applicable to human communications as it does to information transfer between quantum entities. Complete information concerning one entity is never available to another entity; some degree of ignorance is unavoidable.  

A general argument I will make is that knowledge is essential for the existence of many natural systems including quantum systems, biological systems, behavioural systems, and cultural systems. In this view knowledge has played a long and illustrious role in the evolution of the universe.

Towards a definition of knowledge
Surprisingly within science, our primary means of understanding the universe, the term ‘knowledge’ is used in only a vague sense and does not have a clear technical definition. This lack of an adequate definition for a phenomena playing a central role in the structure of the universe has resulted in a good deal of confusion. I will suggest a technical definition of knowledge later in this section which may help to resolve this problem.

While science has not provided a clear definition of knowledge it has developed a detailed understanding of ignorance. Ignorance is the amount of information in bits which any entity lacks in its knowledge of another and has the technical term entropy. Entropy is conceptually and mathematically well understood. 

My proposed definition of knowledge leverages our deep understanding of ignorance. Knowledge is the probability which is the mathematical inverse of entropy. This probability is essentially the chance that a random pick within the realm of ignorance will be the correct choice; our odds of being correct (knowledge) increases as ignorance decreases.

Knowledge and Bayesian inference

 The principle of incomplete knowledge requires that any knowledge must be uncertain knowledge. The field of mathematics which describes degrees of uncertainty or degrees of plausibility is Bayesian probability. It provides all the necessary mechanisms for determining the probability of knowledge from the information content of supporting evidence. Thus Bayesian probability prescribes the evolution of knowledge as evidential information is gained.
Common usage of the word knowledge usually involves an internal representation or model of external phenomena. The definition of knowledge I will develop is in close agreement with this view. In order for an internal model to be accurate it must receive information of the external phenomena and be capable of updating its representation as the phenomena changes. Thus it is necessary for the model to receive information concerning the phenomena and to possess the ability to update itself accordingly.

In contrast to knowledge, information has a fairly straight forward scientific definition. It is measured in bits which may be considered answers to ‘yes’ or ‘no’ type questions. For instance the game of twenty questions can be considered as one where the questioner receives twenty bits of information in order to identify the correct answer. Twenty bits of information is powerful, it is able to distinguish between 220 or over a million different possibilities.

Information may also take the form of a coded message that represents an entity. For instance all of the concepts dealt with by computers are represented by messages in binary code. As a simple example we might consider the sixteen binary states of four flipped coins represented by ones for heads and zeroes for tails.

0000

0001

0010

0011







0100

0101

0110

0111







1000

1001

1010

1011







1100

1101

1110

1111

These sixteen distinguishable outcomes of flipping four coins may each be identified with four bits of information. In general n distinguishable states can be distinguished or coded with log2(n) bits of information. The probability of randomly choosing a specific state from n states is 1/n or if we let I stand for the information required to code for the state then the probability is 1/2I or equivalently 2-I

The history of information as a well-defined scientific concept has been quite brief. Claude Shannon introduced our modern conception of information in 1948. Since then it has come to be seen by many as perhaps the most fundamental concept in Science. The great physicist John Wheeler said that he had come to view ‘everything as information (52). This astonishing ascendance of a scientific concept from its introduction as a scientific concept to perhaps the most fundamental in science has occurred in only fifty years.
There can be difficulties when a colloquial term such as ‘information’ is adopted by science and given a precise technical definition. The technical definition may be quite different from common usage and confusion may arise.

Technical definition of information
 Dictionary.com defines information:
  1.  knowledge communicated or received concerning a particular fact or circumstance; news: information concerning a crime.
  2. knowledge gained through study, communication, research, instruction, etc.; factual data: His wealth of general information is amazing.
These definitions describe information in terms of knowledge but this is not the technical definition of information. Technically information is defined in terms of probability:



Where each w is one of the possible outcomes of some event and wn is the nth possible outcome.  P(wn) is the probability that the nth possible outcome will actually occur. In our discussion it should be assumed that the log function is to the base 2 and thus information is given it bits. 

The term on the left side of the equal sign might be paraphrased as ‘the information (I) received on learning that the outcome wn has occurred’. The right hand side of the expression might be paraphrased as ‘the negative log function of the probability previously assigned to the possibility that the outcome wn would occur’.
So this definition says that the information received when event occurs equals the negative log of the probability that had been previously assigned to the event happening. Information may be thought of as the amount of surprise experienced when the actual outcome is learned. 

Thus technically information is a measure of probability. If we assigned a low probability to an outcome we receive a lot of information if it does occur. If we expected it to be sunny today but it rained we received a lot of information; our plans may have to be fully revised. On the other hand if we expected rain and it did rain then we did not receive much information and not much needs to be updated. 

It is perhaps somewhat paradoxical that although information has come to be considered perhaps the most fundamental concept in science it is not simple. It requires the assignment of a probability to an outcome and in addition it requires that this probability be compared to the actual outcome, requiring a rather complicated mechanism for any physical instantiation. Thus information transfer is itself a complex phenomenon.

We might also use the above equation as a definition of probability: probability is a numerical assignment of the degree of plausibility for a given outcome. 

Bayesian interpretation of information and knowledge

In Bayesian terminology probabilities represent states of knowledge thus making a connection with information’s colloquial meaning.
Perhaps surprisingly although the term knowledge is used extensively within the Bayesian scientific literature there does not seem to be an accepted definition. In fact Jaynes uses the term to define probability itself:
In our terminology, a probability is something that we assign, in order to represent a state of knowledge.
However nowhere in his writing or in other Bayesian literature have I been able to find an in-depth description of what is meant by ‘knowledge’. Unfortunately the primary technical definition of knowledge seems to still be the one offered by Plato over two thousand years ago and still embraced by many philosophers today that knowledge is ‘justified true belief’.

The first problem with this definition is that it just refocuses our attempts at clarity onto deciphering what is meant by ‘justified true belief’. This seems to offer only a regress to other vague terms. A perhaps more serious problem is that this definition has come to be understood as referring to  human knowledge and justified true human beliefs. It does not refer to knowledge found anywhere else in nature.

Jaynes himself seems to have accepted the philosophers’ definition.
it is...the job of probability theory to describe human inferences at the level of epistemology.
I suggest that this confusion over the nature and scope of ‘knowledge’ within Bayesian thought has led to numerous difficulties in its proper application to fields such as biology where the existence of non-human knowledge is evident.

In our brief review of the proper context for knowledge we have encountered a number of related concepts including: models, information, updating models with information and probability. We can now combine these to gain an understanding of the process by which knowledge may evolve.

Returning to the definition of information as a measure of probability we should consider that the probabilities assigned to the mutually exclusive and exhaustive set of all the possible outcomes of an event must sum to 1. One and only one outcome of the model must occur. We may consider this set as a list of hypothesis; each assigned a probability that the associated outcome will occur.  This kind of complete set of hypothesis forms a model of the event. To find the correct hypothesis in the set we must gather enough information to label one true and the rest false.

A set of probabilities which sums to 1 is called a probability distribution and has many interesting mathematical properties. Perhaps foremost amongst them is entropy. Entropy is the sum of the information contained in the set of hypotheses, the information of each hypothesis weighted by its probability:



Where E is entropy, H is our model and hn are the n hypotheses making up the model. This expression for entropy may be paraphrased as: the expected surprise that a model of the outcomes will experience when the actual outcome becomes known.

Surprise, and thus increased entropy, occurs when the model lacks predictive accuracy. The entropy of every probability distribution has a value between zero and infinity. It equals zero when the probability distribution is a certainty; one hypothesis has a probability of 1 and the rest of 0. The uniform distribution which has n members all having probability 1/n has the highest entropy of any distribution with n members. Its entropy approaches infinity as n approaches infinity.

Entropy measures what a model does not know or its uncertainty. In the case of thermodynamics entropy is the amount of uncertainty in the exact microstate of the system when we have some partial information such as temperature:

The amount of additional information that would allow us to pinpoint the actual microstate is given by the entropy of the distribution.
Definition of knowledge
As entropy measures a lack of knowledge or ignorance it is a kind of inverse of knowledge and we might expect a technical definition of knowledge could be formed in terms of entropy. A first step forward is to recognize that knowledge, like entropy, is a property of a model; it is a measure of a model’s predictive accuracy. Drawing on the relationship between information and probability we noted earlier and noting that entropy is a form of information I propose the technical definition for the knowledge of a model K(H) as:



For example the model describing a coin flip is the two member uniform probability distribution {.5, .5}. It has entropy =1 bit. There are 21 = 2 distinct possible outcomes about which the model is ignorant: [Heads] and [Tails]. The model’s knowledge is 2-1  = .5 which is the probability that an arbitrary choice will produce the correct prediction.
Knowledge and ignorance
In general entropy is a measure of ignorance and ignorance is described by the uniform distribution; when nothing is known all possibilities are equally likely. The entropy of the uniform distribution which has n possibilities is log(n). Using our definition its knowledge is 2-log(n) or more simply 1/n. But this is the probability that any of the possibilities within the space of our ignorance is the correct one. Our definition of knowledge is the probability of randomly guessing the correct possibility within the boundaries of ignorance.

The amazing implication is that knowledge amounts to a random guess within the sphere of ignorance. The only way the guess may become more likely is if the space of ignorance is reduced.

As an example let consider a model in the form of a distribution which has 16 possibilities. To begin with we have no information which would make one possibility more likely than the others so we assign the uniform distribution where each probability is 1/16. This distribution has 4 bits of entropy. Let’s say we get some evidence concerning the model and when this is applied via Bayesian updating some possibilities become more likely than others and the entropy of the new distribution is reduced to 3 bits.

The new state of knowledge is 2-3 = 1/8. But this is the same knowledge as contained in a uniform distribution with only 8 possibilities; it is the same probability as a random guess amongst 8 possibilities. The change in certainty of the model due to the evidence is equivalent to reducing the scope of our ignorance from 16 possibilities to 8.

The amazing implication is that knowledge may be considered a random guess within a scope of ignorance. The only way for the guess to become more likely, for knowledge to increase, is for the scope of ignorance to be reduced.
 
While this definition might seem mathematically cumbersome it has some attractive properties:
  1. Knowledge is a positive number between 0 and 1.
  2. Knowledge increases in value as entropy or ignorance decreases and vice versa.
  3. Knowledge approaches 1 when one of the model’s hypothesis approaches certainty.
  4. The knowledge of the uniform distribution is especially simple; it is 1/n, the same as the probability for any particular outcome. Thus a fair six sided dice has a distribution with knowledge of 1/6 which agrees with our common sense perception of knowledge as a measure of how close we are to certainty.    
  5. The knowledge of the uniform distribution approaches zero as n approaches infinity. Again in agreement with common sense; all else being equal we know the least when there are a great many possibilities and we have no information that would allow us to prefer one over any of the others.
Our definition of knowledge in terms of a probability agrees with the usual Bayesian definition of probability as a state of knowledge with one important difference; knowledge is a property of any model in nature and such models are not necessarily closely related to humans.

With this definition in hand we might next ask ‘how is a model’s knowledge increased?’ Fortunately mathematicians have shown that knowledge increase must follow a unique algorithm: the Bayesian update. On the reception of new information (I) by the model (H) the probability of each component hypothesis (hn) making up the model must be updated according to:



Where X is the information we had prior to receiving our new information I.

This theorem demonstrates that the model composed of the updated probabilities will have the greatest accuracy possible given the data. Models which are updated according to Bayes’ theorem on the reception of new information will tend to have the greatest knowledge or predictive accuracy. There are however some important caveats to this that are explored in Appendix 1 using as an example the results of medical tests. 

We have seen that the mathematical concepts of information, probability, entropy, knowledge, Bayesian update, and models are intimately related. They are in fact but different properties of a mathematical entity called inference. Inference is the mathematical process for basing conclusions on data and for reaching the best conclusions possible in the face of incomplete information.

A clearer view of the intimate relation amongst these concepts might begin with probability.
  1. Cox showed that any consistent process of assigning real numbers to degrees of plausibility would lead to the sum and product rules of probability theory; these rules may be taken as the axioms of probability theory.
  2. The Bayesian update is a mere rearrangement of the terms of the product rule.
  3. The Bayesian update connects new information with updated probabilities which form a probability distribution over an exhaustive and mutually exclusive set of hypothesis (H); in our terms this is a model.
  4.  Entropy and knowledge are inverse functions which are properties of models.
Thus we see that these concepts are inseparable; they are defined in terms of one another and any one of them implies the others. The integrated entity which they form is an inferential agent. When we encounter any one of these concepts we should expect to encounter them all operating together as an inferential agent.
 
This claim may appear reckless. In some views probability or information are primitive concepts which are found throughout nature and do not necessarily entail the complications of inference. However we might consider that probability and information are defined in terms of one another. Probability is the assignment of a degree of plausibility of an outcome. No such assignment can be made without considering the set of alternative outcomes, in other words without considering a probability distribution over all possible outcomes. This is a model and on the reception of new information the correct probabilities entailed by the model are given by the Bayesian update. Thus my claim that probability (and the other related concepts) has no meaning other than within the context of inference.

Rather than narrowing the context for probability, this view actually is an expansion on the usual Bayesian view of probability. Bayesians have stressed that probability is related to a state of human knowledge or inference. In the view presented here the scope of probability is expanded beyond humans to the larger arena of inferential agents in general.


[1] The basis for the principle of incomplete information may reside in the nature of quantum information which is the basic form of all information exchange. The quantum information necessary to fully describe an entity can be divided between Holevo information which may be communicated and the information of quantum discord which may not (91). The quantity of Holevo, or classical information, is usually minute compared to quantum discord.

No comments:

Post a Comment