The birthday paradox

Posted by Diego Assencio on 2013.12.16 under Mathematics (Statistics and probability)

If $n$ people are randomly selected from a very large population, what is the probability that at least two of them will have the same birthday?

This problem is a very interesting one and is commonly known as the birthday paradox (the reason for the "paradox" will become clear as we solve the problem). The solution is relatively simple if we assume that the probability $P$ that a randomly selected person is born on a given day $D$ of the year is the same for every day of the year: $P(D) = 1/365$ (in other words, if we assume a uniform distribution; we will ignore years with $366$ days).

More generally, one could ask: if $n$ elements are randomly selected from a set $S$ with $N$ elements (all elements of $S$ are distinct), with the act of selecting each element being independent from all other element selections, what is the probability that the at least one element is selected two or more times?

In the original question, a birthday is an element of the set $Y = \{1,2,\ldots,365\}$ which represents all days of the year ($Y$ contains $365$ elements). Selecting a person at random is equivalent to randomly selecting an element from $Y$. In what follows, the event of selecting a given element more than once will be called a collision.

Let's solve the problem in its general formulation (not necessarily thinking of birthdays). We first randomly select an element $E_1$ from a set which contains $N$ elements. The total number of possible $E_1$ values is $N$. Then we randomly select the second element $E_2$. The probability that this second element is distinct from the first one is: $$ P^{(2)} := P(E_2 \notin \{E_1\}) = \displaystyle\frac{N-1}{N} $$ since there are $N-1$ elements which are distinct from $E_1$ (above I wrote $E_2 \notin \{E_1\}$ instead of $E_2 \neq E_1$ for reasons which will become clear below). Then we randomly select the third element $E_3$. The probability that this element is distinct from the first two selected elements is: $$ P^{(3)} := P(E_3 \notin \{E_1,E_2\}) = \displaystyle\frac{N-2}{N} $$ since there are $N-2$ elements distinct from both $E_1$ and $E_2$ if $E_1 \neq E_2$. We keep on going this way until we select the $n$-th element. Assuming the $(n-1)$ already selected elements are all distinct from each other, the probability that the $n$-th element is distinct from the $(n-1)$ previously selected elements is: $$ P^{(n)} := P(E_n \notin \{E_1,E_2,\ldots,E_{n-1}\}) = \displaystyle\frac{N-(n-1)}{N} $$ since after the first $(n-1)$ elements are selected, there are still $N - (n-1)$ elements left which were not yet selected (assuming, as mentioned above, that the $(n-1)$ elements already selected are all distinct from each other).

Since the act of randomly selecting an element is independent from all other random element selections, the probability $P^*(n)$ of selecting $n$ distinct elements is then the multiplication of the probabilities $P^{(i)}$ computed above ($i = 2,\ldots,n$). To emphasize, $P^{(i)}$ is the probability of selecting an $i$-th element which is distinct from the $(i-1)$ previously selected elements assuming they are all distinct from each other: $$ P^*(n) = P^{(2)} \times \ldots \times P^{(n)} $$ The probability $P(n)$ of at least two among the $n$ selected elements being equal is then: $$ \begin{eqnarray} P(n) &=& 1 - P^*(n) \nonumber\\[5pt] &=& 1 - P^{(2)} \times \ldots \times P^{(n)} \nonumber\\[5pt] &=& 1 - \displaystyle\left(\frac{N-1}{N}\right)\left(\frac{N-2}{N}\right) \ldots \left(\frac{N-(n-1)}{N}\right) \nonumber\\[5pt] &=& 1 - \prod_{i=1}^{n-1}\left(\frac{N-i}{N}\right) \nonumber\\[5pt] &=& 1 - \frac{\prod_{i=0}^{n-1}(N-i)}{N^n} \nonumber\\[5pt] &=& 1 - \frac{\prod_{i=0}^{N-1}(N-i)}{N^n\prod_{i=n}^{N-1}(N-i)} \label{post_c59e4c7dd11d1d61bf902136ee9cafcb_eq_deriv_pn} \end{eqnarray} $$ And therefore, noticing that the top and bottom products on the last line of equation \eqref{post_c59e4c7dd11d1d61bf902136ee9cafcb_eq_deriv_pn} are the definitions of $N!$ and $(N-n)!$ respectively, we get: $$ \boxed{ \displaystyle P(n) = 1 - \prod_{i=1}^{n-1}\left(\frac{N-i}{N}\right) = 1 - \frac{N!}{N^n(N - n)!} } \label{post_c59e4c7dd11d1d61bf902136ee9cafcb_eq_pn} $$ The result on the fourth line of equation \eqref{post_c59e4c7dd11d1d61bf902136ee9cafcb_eq_deriv_pn} has been added to the final result as it makes the computation of $P(n)$ very easy to do on a computer. To clarify, computing $P(n)$ directly using the version involving factorials will not be a trivial task since $N!$ can be an immense number (it will definitely be for $N = 365$) which most calculators/programs will not be able to deal with due to the use of finite precision arithmetic. Roughly speaking, computers and calculators usually store numbers with a fixed number of precision digits and are therefore uncapable of handling very large numbers. I will show later how one can use the product version of $P(n)$ (the middle one on equation \eqref{post_c59e4c7dd11d1d61bf902136ee9cafcb_eq_pn}) to easily compute $P(n)$ with octave (see the "bonus" section below).

Consider again the original problem. How many people do we need to select to have the probability of two or more of them having the same birthday be larger than or equal to $50\%$? Perhaps $365/2 \approx 182$? Or $150$? The answer is, surprisingly, $23$: $$ P(23) = 1 - \frac{365!}{(365)^{23}(365 - 23)!} \approx 0.5073 $$ In other words, with only $n = 23$ randomly selected people we can already expect two or more of them to have the same birthday with more than $50\%$ probability! If the "paradox" on "birthday paradox" was not yet clear to you, it probably is now. Figure 1 shows the probability $P(n)$ of a collision (two or more people having the same birthday) for different values of $n$ and $N = 365$.

Fig. 1: Birthday paradox: collision probability $P(n)$ versus number $n$ of elements randomly selected from of a set $Y = \{1,2,\ldots,365\}$ representing the days of the year. $Q(n) = 1 - e^{-n(n-1)/(2N)}$ is a lower bound estimate for $P(n)$.

Also interestingly, we can expect a collision with $99\%$ probability if we select as few as $n = 57$ people: $$ P(57) \approx 0.9901 $$

A lower bound estimate for $P(n)$

There is a simple lower bound estimate for the probability of a collision $P(n)$ when $n$ elements are randomly selected from a set of set $N$. From equation \eqref{post_c59e4c7dd11d1d61bf902136ee9cafcb_eq_pn} we get: $$ \displaystyle P(n) = 1 - \prod_{i=1}^{n-1}\left(\frac{N-i}{N}\right) = 1 - \prod_{i=1}^{n-1}\left(1 - \frac{i}{N}\right) \geq 1 - \prod_{i=1}^{n-1}e^{-i/N} $$ where above we used the fact that $1 - x \leq e^{-x}$ for all $x \geq 0$ (indeed, for $x = 0$ both sides are equal to $1$; the derivative of $(1-x)$ is $-1$ while the derivative of $e^{-x}$ is $-e^{-x} \geq -1$ for all $x \geq 0$, so $(1-x)$ decreases faster than $e^{-x}$ for $x \geq 0$). Then: $$ \displaystyle P(n) \geq 1 - e^{-\frac{1}{N}\sum_{i=1}^{n-1} i} $$ But since: $$ \sum_{i=1}^{n-1} i = \displaystyle\frac{n(n-1)}{2} $$ we finally obtain: $$ \boxed{ P(n) \geq Q(n) := 1 - e^{-n(n-1)/(2N)} } \label{post_c59e4c7dd11d1d61bf902136ee9cafcb_eq_approx_p} $$ You can compare how close $Q(n)$ is to $P(n)$ on figure 1. Equation \eqref{post_c59e4c7dd11d1d61bf902136ee9cafcb_eq_approx_p} is very interesting since it makes it easy to compute a value of $n$ for which we will have a collision with probability larger than or equal to $p$: $$ \begin{eqnarray} p = 1 - e^{-n(n-1)/(2N)} &\Longrightarrow& \displaystyle\frac{-n(n-1)}{2N} = \log(1 - p) \nonumber\\[5pt] &\Longrightarrow& \displaystyle\frac{n(n-1)}{2N} = \log\left(\displaystyle\frac{1}{1-p}\right) \nonumber\\[5pt] &\Longrightarrow& n^2 \geq 2N\log\left(\displaystyle\frac{1}{1-p}\right) \end{eqnarray} $$ Therefore: $$ \boxed{ n \geq \sqrt{2N\log\left(\displaystyle\frac{1}{1-p}\right)} } \label{post_c59e4c7dd11d1d61bf902136ee9cafcb_eq_n} $$ will yield a collision with probability larger than or equal to $p$. For $N = 365$ and $p = 0.5$, we get $n \approx 22.49$, so indeed for $n = 23$ we will have collisions with at least $50\%$ probability.

The birthday paradox gets even stranger if we take $N = 10^6$ (a million) and $p = 0.5$: for those parameters, $n \approx 1177.4$, so we will have a collision with more than $50\%$ probability with as few as $1180$ elements! In general, to produce a collision with $50\%$ probability for a given $N$, we can select a number of elements given by: $$ \boxed{ n \geq 1.2\sqrt{N} } \label{post_c59e4c7dd11d1d61bf902136ee9cafcb_eq_n_approx} $$ The equation above is obtained by setting $p = 0.5$ on equation \eqref{post_c59e4c7dd11d1d61bf902136ee9cafcb_eq_n} and rounding up the constant factor. For $N = 365$, equation \eqref{post_c59e4c7dd11d1d61bf902136ee9cafcb_eq_n_approx} yields $n \geq 22.92$ (no surprises here).

Equation \eqref{post_c59e4c7dd11d1d61bf902136ee9cafcb_eq_n_approx} closes our discussion of the problem. To have a collision with more than $50\%$ probability when randomly selecting elements from a set of size $N$, we must merely select around $1.2\sqrt{N}$ elements. The paradox originates from the fact that for large values of $N$, $\sqrt{N}$ is significantly smaller than $N$ itself.

Bonus: using octave to compute $P(n)$

On Ubuntu/Debian, you can install octave by opening a terminal and running the following command:

sudo apt-get install octave

Now create a file called birthday.m with the following contents (or download it directly):

# n: number of randomly selected elements
# N: number of elements on the set from which elements are selected
function p = birthday(n,N)

	# we need to choose at least two elements
	if n < 2
		p = 0;
		pc = 1.0;
		for i = 1:(n-1)
			pc = pc * (N-i)/N;
		p = 1 - pc;


Now start octave:


and compute $P(n)$ as in the example below (here I am assuming $n = 23$ and $N = 365$):

octave:1> birthday(23,365)
ans =  0.50730


No comments posted yet.