An introduction to joint, marginal, and conditional probabilities

horse This article is a companion to my piece on Bayesian statistics. In that article I tried very hard to avoid using any specialist statistical terminology. This article attempts to explain the meanings of some of the most important terms terms I would have used, had I not been trying to avoid jargon. Specifically, the purpose of this article is to explain what is meant by joint probability, marginal probability, and conditional probability. In particular, I want to explain what insight we can obtain by calculating and comparing these figures. As ever, I'll give an example with specific figures first, and work towards a more general treatment.

The example

Rather than using horse racing as an example, as I did in the Baysian statistics article, here I'll be using a (completely fictitious) study of the relationship between criminal conviction and bodyweight in a prison population.

It is hypothesized (let us say) that different body compositions incline people to different types of crime. 144 men in a prison, in age range 40-45, were divided into groups according to the crime of which they had been convicted. In this sample, only three crimes were represented -- burglary, tax evasion, and blackmail (yes, I know that's unlikely, but I didn't want to present too much raw data). Each prisoner was weighed, and classified as "underweight", "normal weight", and "overweight", according to body-mass index. The results -- the number of prisoners in each group -- are shown in the table below.

Note:
Do not cite this study! I made up all the figures. I have no idea whether they are representative of anything in real life -- almost certainly they aren't.

                Underweight  Normal weight  Overweight  Total

     Burglary      19          16               5         40 

  Tax evasion       4          16              31         51 

    Blackmail       6          17              30         53 

        Total      29          49              66        144

Just tabulating the data this way suggests a few things.

Burglary seems to be the crime of choice for underweight people.

Overweight people, on the other hand, seem to be less inclined to burglary -- perhaps it's harder to squeeze through upper-storey windows.

There seems to be a high proportion of overweight people in the sample as a whole -- maybe it's all that prison food?

On the face of it, normal-weight people seem to be equally represented in all three crimes.

The first thing to do is to turn these numbers in probabilities. Since there are 144 people in the sample, this just amounts to dividing the individual values by 144. By definition, the probability of observation X is just the number of X observations, divided by the total number of observations. Here are the same results as above, presented as probabilities.

                Underweight  Normal weight  Overweight  Total

     Burglary       0.13          0.11         0.03     0.27

 Tax evasion        0.03          0.11         0.22     0.36

    Blackmail       0.04          0.12         0.21     0.37 

        Total       0.20          0.34         0.46     1.00

This doesn't provide any additional information, but it removes the sample size from the numbers. This makes it easier to compare this experiment with the data from related ones, should we need to.

Joint probabilities

The values in the main body of the table -- other than the "total" row and column -- are known as joint probabilities. They are the probabilities of two simultaneous events or states: being a burglar and being overweight, for example. The probability of being a burglar and being underweight is 0.13 -- that's another way of saying that 13% of the sample were both burglars and underweight, or that 13% were in the "burglar and underweight" category.

Joint probability can be written in several different ways:

p (burglar ∩ underweight)
p (burglar x underweight)
p (burglar, underweight)

I will use the 'comma' notation in this article. The joint probabilities should sum to 1.0 -- there are only nine classes in the study, and everybody in the study has to be in exactly one of them. (Note: the values in my table don't sum to exactly 1.0, because I've rounded the joint probabilities to two decimal places).

In case it isn't obvious, it doesn't matter whether we write p (blackmail, overweight) or p (overweight, blackmail) -- they refer to the same quantity.

Marginal probabilities

The values in the total row and total column are known as the marginal probabilities, presumably because they are in the margins of the table. If we were looking at single variables (only weight, or only crime) the marginal probability would just be called the probability.

So, for example, p (normal weight) = 0.34, because 48 / 143 of the people in the sample were of normal weight.

Note that, in general, we can't multiply the marginal probabilities to get a joint probability. For example,

p (burglar) x p (underweight) = 0.27 * 0.2 = 0.05
p (burglar, underweight) = 0.13

There's a considerable difference between these two figures. In high school we got used to combining probabilities of events by multiplying them; but that only works if the events are independent.

In this case, we clearly expect bodyweight to have an effect on crime, or we wouldn't be studying it. So it's reasonable to assume, until there is evidence to the contrary, that the probabilities of being underweight and being a burglar are not independent, and we can't simply multiply them. Of course, we don't need to calculate the joint probability from any other measure -- the joint probabilities are the data we're starting with.

Note:
In some circumstances, a marginal probability might be referred to as a prior probability. The difference is one of context, rather than meaning. The term 'prior probability' is typically used when we know the marginal probabilities and want to use them in estimation.

Conditional probabilities

Conditional probability expresses the probability of one outcome, given that some other outcome has already occurred, or some other observation has already been made.

For example, if we already know than a person in the sample is overweight, the probality of that person's being a burglar may well be different from the overall probability of a person being a burglar.

The conditional probability is obtained by dividing the joint probability by the appropriate marginal probability. We write this as

p (A|B) = p (A, B) / p (B)

p (A|B) is usually read "probability of A given B". So for example:

p (burglar|overweight) = p (burglar, overweight) / p (overweight)
= 0.03 / 0.46 = 0.07

This formulation really says nothing more than that 7% of overweight people (in the prison sample) are burglars.

It is absolutely crucial to understand that, in general,

p (A|B) ≠ p (B|A).

The fact that 7% of overweight people (in the sample) are burglars does not mean that 7% of burglars are overweight. In fact, about 13% of burglars (in the sample) are overweight. Failing to appreciate the distinction between these two measures is a major source of error, not to mention a number of miscarriages of justice. For that reason the error is often referred to as the prosecutor's fallacy. I describe this problem in much more detail in my article on Bayesian statistics.

Now consider the conditional probabilities for that section of the sample of normal body weight. Compare these figures to the marginal probabilities of the various crimes:


p (burglary | normal weight)    = 0.11/0.34  = 0.33     p (burglary) = 0.27

p (tax evasion | normal weight) = 0.11/0.34  = 0.33     p (tax evasion) = 0.36

p (blackmail | normal weight)   = 0.12/0.34  = 0.35    p (blackmail) = 0.37

We can see that p (X | normal weight) is approximately equal to p (X), where X is the specific criminal conviction. The finding that a person (in this sample) is of normal weight does not allow us to conclude much about the crime that person committed: normal weight and crime are, to a first approximation independent. We would expect the product of the marginal probabilities to be the same as the joint probabilities in such a case and, roughly, they are.

They aren't exactly equal, though. This is what we expect in a sample -- even if two variables are strictly independent in a large population, we will usually find some slight correspondence in a small sample like this.

Now let's make the same comparison for memebers of the 'underweight' group.


p (burglary | underweight)    = 0.13/0.20  = 0.65     p (burglary) = 0.27

p (tax evasion | underweight) = 0.03/0.20  = 0.15     p (tax evasion) = 0.36

p (blackmail | underweight)   = 0.04/0.20  = 0.20     p (blackmail) = 0.37

In this case, p (X | underweight) is very different from p (X). Knowing that somebody (in this sample) is underweight does provide some predictive value, concerning the crime of which that person was convicted.

Summary

This article has explained the meanings of joint, marginal and conditional probability, and how these figures give some insight into how categories of observation are related.