How hard is it to get a git hash collision?


Posted by Diego Assencio on 2013.12.17 under Programming (Git)

Every time a commit is added to a git repository, a hash string which identifies this commit is generated. This hash is computed with the SHA-1 algorithm and is 160 bits (20 bytes) long. Expressed in hexadecimal notation, such hashes are 40 digit strings.

Since each commit must have a unique hash, one question that naturally arises is: how likely is it that the hash generated for a future commit will coincide with the hash of some past commit?

This question is closely related to the birthday paradox. In a previous post, I have discussed the birthday paradox in detail and proved that when we randomly select $n$ elements from a set containing $N$ elements (a set contains only distinct elements), the probability of obtaining a given element more than once will be greater than $50\%$ if $n \geq 1.2\sqrt{N}$.

Since each hash value produced by SHA-1 is 160 bits long, the total number of possible generated hashes is: $$ N_{\textrm{SHA-1}} = 2^{160} \approx 1.45 \times 10^{48} $$ To obtain the same SHA-1 hash value for different inputs (in technical words, to obtain a hash collision) with probability larger than $50\%$, we must generate approximately: $$ n = 1.2\sqrt{2^{160}} = 1.2\times 2^{80} = 1.45 \times 10^{24} $$ hashes.

Assume that a given project contains $D$ active developers who generate $C_{\textrm{day}}$ commits per day. The number of commits $C_{\textrm{year}}$ generated per year by such a group is (years with $366$ days are ignored here; the impact of this assumption on the end result is negligible for our purposes): $$ C_{\textrm{year}} = 365 D C_{\textrm{day}} $$ At such a rate, the number of years $Y$ necessary to produce a hash collision with $50\%$ probability would be: $$ Y = \displaystyle\frac{N_{\textrm{SHA-1}}}{C_{\textrm{year}}} \Longrightarrow \boxed{ \displaystyle Y = \frac{4.0 \times 10^{21}}{D C_{\textrm{day}}} } \label{post_eacd6eedf46c9dd596a5f12221ad15b8_eq_y} $$

Assume the world has $7$ billion ($7 \times 10^9$) people and every person is a developer (in other words, $D = 7 \times 10^9$). If every person on Earth adds a new commit every second ($C_{\textrm{day}} = 86400$ since a day has $86400$ seconds and our developers are assumed to never sleep!), then equation \eqref{post_eacd6eedf46c9dd596a5f12221ad15b8_eq_y} yields: $$ \displaystyle Y = \frac{4.0 \times 10^{21}}{(7 \times 10^9)(86400)} = 6.6 \times 10^6 \textrm{years} $$ At that rate, mankind would need nothing less than $6.6$ million years to produce a number of commits large enough to create a hash collision with $50\%$ probability!

In a more realistic scenario, a large project might have $ 1000$ active developers who commit $10$ commits per day. In these circumstances, it would take approximately $4 \times 10^{17}$ years for a collision to happen with $50\%$ probability. For comparison, the age of the universe is estimated to be $13.8 \times 10^9$ years.

To summarize, the probability of producing a hash collision on a git repository is so small that it will very likely not happen during your lifetime. In any case, if you are paranoid about this, take a look at this page.

Comments

Andrei on May 02, 2016:
And if developers will be 1 or 10 million who will be making the same amount of commits per day. how much is enough to git hash
Diego Assencio on May 02, 2016:
@Andrei: I'm not sure if I understand your question correctly, but for any number of developers $D$ and if $C_{\textrm{day}}$ is the number of commits per day per developer, a hash collision will take about $Y$ years to happen, where:

$
\displaystyle Y = \frac{4.0 \times 10^{21}}{D C_{\textrm{day}}}.
$
Chucky on Mar 23, 2018:
@Diego: Nice post.

However, I am pretty sure that the _tone_ of the post (esp. the conclusion) induces the reader to think that 6.6 x 10^6 years are required for the first collision to actually happen and that none will occur before that time.

Probabilities (theoretical matter) are related with statistics (practical matter) only through an analysis of "big-enough samples".

As far as mathematics are concerned, a non null probability means that an event may happen, no matter how close it is to 0. The first collision may actually happen tomorrow !

The conclusion to the post should have been something more like (I am no writer though):
"Starting with the very first git commit, the probability of a collision is non-null (= may happen), while it rises over time, as more and more commits are created all around the world.

The chances are pretty slim this actually happen, though, given how far in time we are from a 50% prob. status.

Leave a reply

NOTE: A name and a comment (max. 1024 characters) must be provided; all other fields are optional. Equations will be processed if surrounded with dollar signs (as in LaTeX). You can post up to 5 comments per day.