December 30, 2011 | John Rusk | 2 Comments Bayes’ Theorem is an important tool when analysing probabilities. It helps us to avoid cognitive traps and make better decisions. However, it is usually presented as something difficult, or even controversial. The typical article on Bayes’ Theorem stresses how difficult it is, and then goes on to bemoan the fact that its not more widely used! In contrast, I argue that Bayes can be explained concisely, memorably, and intuitively. Read on, and judge for yourself whether I’ve succeeded… Example Consider this example adapted from Wikipedia: An entomologist finds a beetle with spots on its back. She thinks it might be a rare subspecies, because 98% of the rare subspecies have such spots, but only 5% of the common variety have them. The subspecies makes up only 0.1% of the population. Given the fact that the beetle in her hand is spotty, what is the probability that it is also rare? The entomologist wants to know P(rare|spotty), which reads from left to right as “probability of being rare, given that it’s spotty”. This is the time to use Bayes’ Theorem. She wants P(rare|spotty) but only has P(spotty|rare). The latter is 98%, but she suspects that the former may be much lower. She’s right, the spotted beetle in her hand is probably not rare… and Bayes’ can tell us why. Memorable I want to explain Bayes’ Theorem in a way that is easy to remember. To begin, let’s start from scratch and think about the probability that one randomly-selected beetle is both rare and spotty. There are two mathematically-correct ways to express that probability. One is to write: P(rare and spotty) = P(spotty|rare) P(rare) = “(the probability of it being spotty, given that it is rare) times (the probability of it being rare in the first place) The other is: P(rare and spotty) = P(rare|spotty) P(spotty) = “(the probability of it being rare, given that it is spotty) times (the probability of it being spotty in the first place)” We’ve just written down two different ways to compute the same thing. So they must be equal to each other. I.e.: P(spotty|rare) P(rare) = P(rare|spotty) P(spotty) If we re-express this in a general-purpose notation, we get P(A|B) P(B) = P(B|A) P(A) That is the formula you need to remember. To recall its left-hand side, remember this intuitive fact: the probability of two events both happening is the (conditional) probability of one happening given that the other already has happened, times the (un-conditional) probability of the other. The right-hand side is just the same thing round the other way. That’s all you need to remember. Start with the above formula, then use basic algebra to re-arrange it into the form you want. For instance, if you want to compute P(A|B), just divide both sides by P(B). (You’ll end up with the usual form of the Bayes’ equation, as seen in textbooks.) Intuitive Where Bayes gets interesting is when its used to assess the probability of some hypothesis, given the available evidence. Dividing both sides of our formula by P(B), and renaming A as the “hypothesis” and B as the “evidence”, we get this: P(hypothesis|evidence) = P(evidence|hypothesis) P(hypothesis) ————————— P(evidence) Let’s consider a simple example involving a blond-haired young boy called Johnny, and an apple that’ is missing from the neighbour’s tree. The hypothesis in question is “Little Johnny stole the apple” and the evidence is “blond hair left snagged in apple tree”. The angry neighbour wants to know the probability that Johnny did indeed steal the apple, given the evidence of the hair – i.e. P(Johnny-stole-apple|blond-hair-left-at-scene). Our formula becomes P(Johnny-stole-apple|blond-hair-left-at-scene) = P(blond-hair-left-at-scene|Johnny-stole-apple) P(Johnny-stole-apple) ———————————————————– P(blond-hair-left-at-scene) When used is this way, the formula includes three things: the left-hand-side, P(Johnny-stole-apple|blond-hair-left-at-scene), represents our degree of belief that Johnny stole the apple, after taking into account the evidence of the blond hair. This is called the posterior probability. the term on the far right, P(Johnny-stole-apple), represents our degree of belief that Johnny stole the apple, before taking into account the evidence of the blond hair. This is called the prior probability. The rest (i.e. middle) of the equation represents how the evidence of the blond hair influences the prior belief, by making it stronger or weaker. Let’s consider each of the RHS terms, and see how they are relevant to the outcome. Firstly, consider P(blond-hair-left-at-scene|Johnny-stole-apple). For instance, if the neighbour knew that Johnny suffered from excessive hair loss, then this probability would go up. I.e. if he was particularly likely to lose hair, then the presence of hair would be stronger evidence of his guilt (as opposed to someone else’s guilt). Conversely, if Johnny always wore hats, then the presence of a hair would be weaker evidence of his guilt. Secondly, let’s consider P(blond-hair-left-at-scene). This is the overall probability of a blond hair left at the scene, regardless of who committed the crime. If this number goes up, then the hair is considered weaker evidence of guilt. For instance, if the crime occurred in a Swedish town where most boys have blond hair, then a blond hair would be relatively weak evidence against Johnny. It could have come from almost anyone. But if the crime occurred in China, the probability of a blond hair being left at the scene would be much lower, so the fact that one was found is stronger evidence against little blond Johnny. Finally, let’s consider P(Johnny-stole-apple). This is the neighbour’s prior belief that Johnny stole the apple – i.e. his belief before considering this particular evidence. If the neighbour knows that Johnny is a caring, honest, Boy Scout with an allergy to apples, then his prior belief in Johnny’s guilt will be relatively low. To reverse that prior belief, and produce a high posterior probability of guilt, the evidence of the hair would have to be particularly compelling. (E.g. Johnny is the only blond-haired boy in a Chinese village, and he suffers from persistent hair loss). If the neighbour doesn’t know Johnny at all, then he has no option but to assign prior probability based on the population in general – e.g. if there are 10 kids in the neighbourhood, the prior probability of Johnny being the thief could be assumed to be 1/10th. So we can see that all terms of the right hand side contribute to the final result in ways that are intuitively sensible. Bayes’ is not so hard. It actually makes intuitive sense. What’s so difficult then? In practice, there can be some obstacles to applying Bayes’ successfully. Obstacles such as: Remembering that P(hypothesis) is an input to computing P(hypothesis|evidence). If you use the formula, you won’t forget. But when we rely on our intuitions instead of the formula, we tend to neglect P(hypothesis). This mistake is known as base rate neglect. For instance, in the spotted beetle example, P(rare) is the base rate – and it’s only only 0.1%. If you neglect the base rate, you’ll intuitively over estimate the chances of a spotted beetle being rare. Being willing to use P(hypothesis) as an input. Having to rely on P(hypothesis) may seem inconvenient because its often hard to obtain an objectively-correct value for P(hypothesis). So, out of necessity, we may end up using subjective guesses for P(hypothesis). This subjectivity is the cause of much of the controversy that dogged Bayes’ Theorem in the 20th century. The solution is that, for many kinds of problems, “probability” actually means “measure of belief” – rather than some number which is computed from repeatable trials. [Subjective evaluation of P(hypothesis) is also the cause of much of the power of Bayes, since the formula doesn’t mind if P(hypothesis) is subjective. Therefore, it’s a great way to take an initial viewpoint, which may be subjective, and adjust it as we receive new objective evidence.] Computational difficulties when generalizing beyond simple yes/no hypotheses and one piece of evidence. For instance, you may want to consider several events (blond hair left at scene and passer-by reports glimpsing possible thief in pink tracksuit), or you may have random variables instead of boolean events. These issues are entirely solvable, but beyond the scope of this blog post. Bayes’ Theorem gives us a powerful tool for dealing with uncertainty. I hope this page will help you (and me) recall the basics 😉 Finally, for an enjoyable read on the history and uses of Bayes’ Theorem, check out this new book from Sharon Bertsch McGrayne. Postscript: feedback about this post included the following comment, which is worth quoting here in full: “fundamentally the idea actually is very simple and the interpretation and application where all the difficulty lies.” That seems like point worth noting: that the basics of Bayes are very simple, but successfully applying it to real world problems is much more difficult. I hope that this blog post has helped to convey the simplicity of the underlying idea. I wrote this post while reading the above book, since I couldn’t enjoy the book without understanding what Bayes’ Theorem really is :-) As for the difficult problems of devising real-world models and hypotheses, I’ll have to plead (relative) ignorance I’m afraid – since I’m a software designer not a statistician.
Great explanation, wonderfully clear, although you should have completed the calculations in your examples.
Good point. The calculations for the spotted beetle example can be found by following the Wikipedia hyperlink (above). For Johnny (maybe) stealing the apple, I have not set up any numbers to plug into the formula. The point of that example was really about an intuitive understanding of each part of the formula, rather than putting exact numbers on, say, the probability that Johnny would leave a hair snagged in the tree.