How hard is it to get a git hash collision?

Posted by Diego Assencio on 2013.12.17 under Programming (Git)

Every time a commit is added to a git repository, a hash string which identifies this commit is generated. This hash is computed with the SHA-1 algorithm and is 160 bits (20 bytes) long. Expressed in hexadecimal notation, such hashes are 40 digit strings.

Since each commit must have a unique hash, one question that naturally arises is: how likely is it that the hash generated for a future commit will coincide with the hash of some past commit?

This question is closely related to the birthday paradox. In a previous post, I have discussed the birthday paradox in detail and proved that when we randomly select $n$ elements from a set containing $N$ elements (a set contains only distinct elements), the probability of obtaining a given element more than once will be greater than $50\%$ if $n \geq 1.2\sqrt{N}$.

Since each hash value produced by SHA-1 is 160 bits long, the total number of possible generated hashes is: $$ N_{\textrm{SHA-1}} = 2^{160} \approx 1.45 \times 10^{48} $$ To obtain the same SHA-1 hash value for different inputs (in technical words, to obtain a hash collision) with probability larger than $50\%$, we must generate approximately: $$ n = 1.2\sqrt{2^{160}} = 1.2\times 2^{80} = 1.45 \times 10^{24} $$ hashes.

Assume that a given project contains $D$ active developers who generate $C_{\textrm{day}}$ commits per day. The number of commits $C_{\textrm{year}}$ generated per year by such a group is (years with $366$ days are ignored here; the impact of this assumption on the end result is negligible for our purposes): $$ C_{\textrm{year}} = 365 D C_{\textrm{day}} $$ At such a rate, the number of years $Y$ necessary to produce a hash collision with $50\%$ probability would be: $$ Y = \displaystyle\frac{N_{\textrm{SHA-1}}}{C_{\textrm{year}}} \Longrightarrow \boxed{ \displaystyle Y = \frac{4.0 \times 10^{21}}{D C_{\textrm{day}}} } \label{post_eacd6eedf46c9dd596a5f12221ad15b8_eq_y} $$

Assume the world has $7$ billion ($7 \times 10^9$) people and every person is a developer (in other words, $D = 7 \times 10^9$). If every person on Earth adds a new commit every second ($C_{\textrm{day}} = 86400$ since a day has $86400$ seconds and our developers are assumed to never sleep!), then equation \eqref{post_eacd6eedf46c9dd596a5f12221ad15b8_eq_y} yields: $$ \displaystyle Y = \frac{4.0 \times 10^{21}}{(7 \times 10^9)(86400)} = 6.6 \times 10^6 \textrm{years} $$ At that rate, mankind would need nothing less than $6.6$ million years to produce a number of commits large enough to create a hash collision with $50\%$ probability!

In a more realistic scenario, a large project might have $ 1000$ active developers who commit $10$ commits per day. In these circumstances, it would take approximately $4 \times 10^{17}$ years for a collision to happen with $50\%$ probability. For comparison, the age of the universe is estimated to be $13.8 \times 10^9$ years.

To summarize, the probability of producing a hash collision on a git repository is so small that it will very likely not happen during your lifetime. In any case, if you are paranoid about this, take a look at this page.


Andrei on May 02, 2016:
And if developers will be 1 or 10 million who will be making the same amount of commits per day. how much is enough to git hash
Diego Assencio on May 02, 2016:
@Andrei: I'm not sure if I understand your question correctly, but for any number of developers $D$ and if $C_{\textrm{day}}$ is the number of commits per day per developer, a hash collision will take about $Y$ years to happen, where:

\displaystyle Y = \frac{4.0 \times 10^{21}}{D C_{\textrm{day}}}.
Chucky on Mar 23, 2018:
@Diego: Nice post.

However, I am pretty sure that the _tone_ of the post (esp. the conclusion) induces the reader to think that 6.6 x 10^6 years are required for the first collision to actually happen and that none will occur before that time.

Probabilities (theoretical matter) are related with statistics (practical matter) only through an analysis of "big-enough samples".

As far as mathematics are concerned, a non null probability means that an event may happen, no matter how close it is to 0. The first collision may actually happen tomorrow !

The conclusion to the post should have been something more like (I am no writer though):
"Starting with the very first git commit, the probability of a collision is non-null (= may happen), while it rises over time, as more and more commits are created all around the world.

The chances are pretty slim this actually happen, though, given how far in time we are from a 50% prob. status.
Thomas W. on Aug 16, 2018:
In 2017 SHA1 was broken after 6600 CPU years. Nice...
Mattias De Wael on Mar 17, 2019:
(thank you for this nice article) Am I correct that you only consider commits in your calculation? What about all files in revision and all 'old' files? They are also hashed alongside the commits and tree nodes? How big would the effect be on your calculations?
Diego Assencio on Mar 17, 2019:
@Mattias: Yes, I only considered commits. Git does indeed hash more than "simply commits", as is nicely described here:

If you are considering the possibility of a hash collision while taking into account all objects for which hashes are computed, the problem becomes a bit more complex, but the end result will still be the same: the probability of a collision is small enough to make it unworthy of your concern.

A valid concern though is whether someone can maliciously craft a commit to generate a collision. The answer is getting closer to "yes" as time goes by since attacks on SHA-1 are becoming dangerously strong:

Notice that even though attacks may exist and become only better over time, a collision is still extremely unlikely to happen in normal circumstances, i.e., through developers simply doing their work.

I hope that helps! :-)
Jaxcie on Apr 29, 2021:
I think its nice that you allow the developers to take a day of rest every Februrary 29th