How hard is it to get a git hash collision?


Posted by Diego Assencio on 2013.12.17 under Programming (Git)

Every time a commit is added to a git repository, a hash string which identifies this commit is generated. This hash is computed with the SHA-1 algorithm and is 160 bits (20 bytes) long. Expressed in hexadecimal notation, such hashes are 40 digit strings.

Since each commit must have a unique hash, one question that naturally arises is: how likely is it that the hash generated for a future commit will coincide with the hash of some past commit?

This question is closely related to the birthday paradox. In a previous post, I have discussed the birthday paradox in detail and proved that when we randomly select $n$ elements from a set containing $N$ elements (a set contains only distinct elements), the probability of obtaining a given element more than once will be greater than $50\%$ if $n \geq 1.2\sqrt{N}$.

Since each hash value produced by SHA-1 is 160 bits long, the total number of possible generated hashes is: $$ N_{\textrm{SHA-1}} = 2^{160} \approx 1.45 \times 10^{48} $$ To obtain the same SHA-1 hash value for different inputs (in technical words, to obtain a hash collision) with probability larger than $50\%$, we must generate approximately: $$ n = 1.2\sqrt{2^{160}} = 1.2\times 2^{80} = 1.45 \times 10^{24} $$ hashes.

Assume that a given project contains $D$ active developers who generate $C_{\textrm{day}}$ commits per day. The number of commits $C_{\textrm{year}}$ generated per year by such a group is (years with $366$ days are ignored here; the impact of this assumption on the end result is negligible for our purposes): $$ C_{\textrm{year}} = 365 D C_{\textrm{day}} $$ At such a rate, the number of years $Y$ necessary to produce a hash collision with $50\%$ probability would be: $$ Y = \displaystyle\frac{N_{\textrm{SHA-1}}}{C_{\textrm{year}}} \Longrightarrow \boxed{ \displaystyle Y = \frac{4.0 \times 10^{21}}{D C_{\textrm{day}}} } \label{post_eacd6eedf46c9dd596a5f12221ad15b8_eq_y} $$

Assume the world has $7$ billion ($7 \times 10^9$) people and every person is a developer (in other words, $D = 7 \times 10^9$). If every person on Earth adds a new commit every second ($C_{\textrm{day}} = 86400$ since a day has $86400$ seconds and our developers are assumed to never sleep!), then equation \eqref{post_eacd6eedf46c9dd596a5f12221ad15b8_eq_y} yields: $$ \displaystyle Y = \frac{4.0 \times 10^{21}}{(7 \times 10^9)(86400)} = 6.6 \times 10^6 \textrm{years} $$ At that rate, mankind would need nothing less than $6.6$ million years to produce a number of commits large enough to create a hash collision with $50\%$ probability!

In a more realistic scenario, a large project might have $ 1000$ active developers who commit $10$ commits per day. In these circumstances, it would take approximately $4 \times 10^{17}$ years for a collision to happen with $50\%$ probability. For comparison, the age of the universe is estimated to be $13.8 \times 10^9$ years.

To summarize, the probability of producing a hash collision on a git repository is so small that it will very likely not happen during your lifetime. In any case, if you are paranoid about this, take a look at this page.

Comments

Andrei on May 02, 2016:
And if developers will be 1 or 10 million who will be making the same amount of commits per day. how much is enough to git hash
Diego Assencio on May 02, 2016:
@Andrei: I'm not sure if I understand your question correctly, but for any number of developers $D$ and if $C_{\textrm{day}}$ is the number of commits per day per developer, a hash collision will take about $Y$ years to happen, where:

$
\displaystyle Y = \frac{4.0 \times 10^{21}}{D C_{\textrm{day}}}.
$

Leave a reply

NOTE: A name and a comment (max. 1024 characters) must be provided; all other fields are optional. Equations will be processed if surrounded with dollar signs (as in LaTeX). You can post up to 5 comments per day.