Beta Distribution and Coin Tossing: An Intuitive Explanation

Usman Haider
6 min readApr 15, 2023

--

Daniel Kahneman, in his popular book “Thinking, Fast and Slow,” explores the idea that humans are not naturally adept at statistics and probability. He concludes that our brains are wired to rely on intuitive, automatic thinking processes, rather than the more deliberate, effortful processes required for statistical reasoning.

Statistics and probability concepts can be difficult to remember or understand, especially if they are explained using technical and mathematical language.

So in an effort to make statistics more engaging, I’d like to share an approachable explanation of a concept that many students and professionals find intimidating: the Beta distribution. Widely used in the Bayesian framework of statistics, understanding the Beta distribution is a key component in unlocking the power of this approach to data analysis. In this post, I’ll provide an intuitive explanation of the Beta distribution, with the hope that the underlying mathematical formulas will become more natural as a result.

Background

To help illustrate the concept, let’s consider the example of tossing a coin. If we already know the probability of the coin landing heads or tails in a single toss (let’s call this probability “p”), then we can calculate the probability of a particular sequence of heads and tails over a series of N tosses.

where p = Probability of Success (landing heads) in a single toss so in case of an unbiased coin, p = 0.5

For example, given an unbiased coin and after 10 tosses, probability of landing total 3 heads and 7 tails (in any order) is exactly (given by binomial distribution)

Here X = Chance of getting 3 heads in 10 tosses

And probability of getting 5 heads in 10 trials given p = 0.5 should be

Here X = Chance of getting 5 heads in 10 tosses
Image from https://homepage.divms.uiowa.edu/~mbognar/applets/bin.html

You can play-around with this calculator here.

Above examples are very straightforward to calculate chance of getting x number of heads given we know what value of p is beforehand.

Building intuition around Beta Distribution

Now, let’s imagine that we are given a coin from a manufacturing company, but we don’t know whether it is fair or biased one. In other words, we don’t know the true value of p (the probability of getting heads) for this particular coin. Our task is to estimate the value of p using the data we collect from tossing the coin, which will help us determine whether the coin is fair or if it has a bias towards heads or tails.

The easiest way is to start tossing this coin for 100 or 1000 times and generate a sequence containing Hs and Ts. And then start counting # of heads after N tosses and our best estimate of p (after experiment) will be simply:

p = # heads / # Trials

So if we gotten 20 heads in 50 tosses our best estimate of p will be

p = 20/50 = 0.40

Assume N = total trials and z = Total Successes (heads)

But the same outcome or sequence of (N, z) = (50,20) is also possible if p = 0.2 or p = 0.8 OR infinite values that p can take from interval (0,1). Although the probability of getting 20 heads out of 50 tosses will be zero if p can take extreme values of either 0 or 1 in which case only TTTTT,,,,,T or HHHH,,,,H sequences are certain

P(X=20) for different values of p

Above table, gives us probability of desired outcome of observing 20 heads in 50 tosses if coin takes different values of p that we are trying to estimate. I have only taken few values of p but p can take any value from [0,1] e-g; 0.23, 0.49, 0.91 etc. so possibilities are unlimited from continuous interval of [0,1]

We have used following formula to calculate probability of getting 20 heads against various values of p

This formula is derived using rule of independence in the probability. For example for a given p, we have this sequence x = {HHHHHTTTTTTT} then P(X=x / p) = p^(# heads) * (1-p)^(# tails)

where p can take values in interval (0,1)

Above table can be visualised in the form of graph below.
p = possible values of p in (0,1)
P(x=K/p) = For particular p, what is P of getting k heads or successes. K is
usually known parameter
P(X=20) is maximised when p = 0.4

It’s interesting that from the interval of [0,1], the P(X=20) aka probability of getting 20 heads out of 50 tosses is maximum when p = 0.4 and this is similar to what we calculated in above formula by dividing total # of heads with total trials N to estimate the p for a coin whose p was unknown.

Up until now, we’ve covered the fundamental concept of the Beta distribution, which is to model the probability (p) of an event in situations where the true value of p is not known.

Since p can take continuous values in the above graph so area under the curve in above graph will be definitely greater than 1. In order to limit it to 1 and convert to probability distribution we have to divide the above formula by normalising factor:

Normalising Factor for Beta Distribution for p in range(0,1)

Think of this normalising factor in following analogy. You have website traffic coming from 3 countries X, Y and Z. In order to calculate their respective share you will have to divide by their sum. That’s essentially we are doing with normalising factor (Summing all the probability for possible values of p) and bounding the sum of probability across different values of p to 1.

Beta Distribution with parameters alpha and Beta

In above distribution:

Alpha = # of heads (Success)
Beta = # of tails (Failures)
x = possible values of probabilty (p) across [0,1]

In essence, the Beta distribution provides us with a probability distribution of probability itself, making it a useful tool for modeling uncertainty in statistics. This concept is at the heart of the Beta distribution, which is widely used for estimating unknown probabilities and referred to as the Maximum Likelihood Function.

Now in the example of coin manufactured by company, we can have threshold of possible values of p instead of giving one fixed estimate of p. The maximum likelihood is around p = 0.4 but values of 0.3 and 0.5 are also possible but likelihood is low. The more data we collect, the more narrower resulting beta distribution will be towards true p

Concluding Remarks

All systems that follow Bernoulli (success/failures) can be modelled as Beta Distribution when p is unknown. For example, the conversion rate of a website can easily be modelled as Beta and get what are the probable range of conversion rate in interval (0,1); instead of using fixed number that comes as a result of simple division arithmetic.

Last but not the least, here is mathematical formula of Beta distribution from Wikipedia. Now you have no reason to be afraid of complex mathematical formulas :D

Image from Wikipedia

If you have liked this article, then follow me or share with anyone who wants to learn about it. In case of any question(s), do post in the comments

And don’t hesitate to add me on LinkedIn :)

--

--

Usman Haider
Usman Haider

Written by Usman Haider

On a mission to teach statistics in an easy way

Responses (1)