All of lore.kernel.org
 help / color / mirror / Atom feed
* What about SHA-1 collisions?
@ 2012-11-06 20:26 Josef Wolf
  2012-11-06 21:41 ` John McKown
  0 siblings, 1 reply; 5+ messages in thread
From: Josef Wolf @ 2012-11-06 20:26 UTC (permalink / raw)
  To: git

Hello,

we all know, the probability for SHA-1 collisions is very, very low, almost
non-existant. But we also know that they are not impossible.

Just for curiosity: what would happen if such a collision would occur within
one repository?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: What about SHA-1 collisions?
  2012-11-06 20:26 What about SHA-1 collisions? Josef Wolf
@ 2012-11-06 21:41 ` John McKown
  2012-11-06 22:09   ` Josef Wolf
  0 siblings, 1 reply; 5+ messages in thread
From: John McKown @ 2012-11-06 21:41 UTC (permalink / raw)
  To: git

Josef Wolf <jw <at> raven.inka.de> writes:

> 
> Hello,
> 
> we all know, the probability for SHA-1 collisions is very, very low, almost
> non-existant. But we also know that they are not impossible.
> 
> Just for curiosity: what would happen if such a collision would occur within
> one repository?
> 

In a sense, this cannot happen. Suppose you have a new working directory.
You do a "git init" to initialize it for use by git. You then copy in a
bunch of data from elsewhere. By chance, files "a" and "b" have different
content, but the same sha1 (they collide). The "git add ." command is
basically a short cut for doing something like:
for i in *;do git add $i;done
That is, it seems to add each file, one at a time in some order. Suppose
it creates the sha1 for "a" first. It then creates the appropriate
"stuff" for file "a" in the .git subdirectory, based on the sha1 value. Now,
it gets around to processing "b". It gets the sha1 value of b and finds
that it already has an entry for that value. At that point, the "git add" thinks
"Oh, I've already processed this file. No need to do anything!" So the contents
of file "b" are not saved anywhere in git and, bottom line, that version of "b" 
will not be in the git repository. Ever. Because "a" already has that SHA1 "tied 
up" and it is (theoretically) never released.

I think of the SHA1 value being a unique key into a "write once" database. Once 
you've added some content (a file) into the database, then the SHA1 value of
that content (file) is unmodifiable. Attempts to write another record into the 
database is rejected (in a read DB, you'd get some sort of DUPLICATE KEY 
response). Git considers the "duplicate key" to be just fine because it ASSUMES 
that the SHA1 is unique to the first file (content) which generates it.

Hope I made sense.

John

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: What about SHA-1 collisions?
  2012-11-06 21:41 ` John McKown
@ 2012-11-06 22:09   ` Josef Wolf
  2012-11-07 15:42     ` Shawn Pearce
  0 siblings, 1 reply; 5+ messages in thread
From: Josef Wolf @ 2012-11-06 22:09 UTC (permalink / raw)
  To: git

On Tue, Nov 06, 2012 at 09:41:29PM +0000, John McKown wrote:
> Josef Wolf <jw <at> raven.inka.de> writes:
> > Just for curiosity: what would happen if such a collision would occur within
> > one repository?

> In a sense, this cannot happen.

In the scenario you described, contents of this version of file "b" are lost
and replaced by the contents of file "a". So file "b" is broken.

What happens when files "a" and "b" are added into different repositories?
File "a" is added to repos "A", and file "b" is added to repos "B". Now it
depends from which repository you fetch the collided blob first. If you fetch
it from "A", file "b" will be broken. If you fetch first from "B", your "a"
will be broken.

It becomes even more interesting, if some commit or tree object would have
the same SHA1 as some other object. I guess, in such a case the repository
would be completely hosed?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: What about SHA-1 collisions?
  2012-11-06 22:09   ` Josef Wolf
@ 2012-11-07 15:42     ` Shawn Pearce
  2012-11-07 23:44       ` Andrew Ardill
  0 siblings, 1 reply; 5+ messages in thread
From: Shawn Pearce @ 2012-11-07 15:42 UTC (permalink / raw)
  To: Josef Wolf, git

On Tue, Nov 6, 2012 at 2:09 PM, Josef Wolf <jw@raven.inka.de> wrote:
>
> On Tue, Nov 06, 2012 at 09:41:29PM +0000, John McKown wrote:
> > Josef Wolf <jw <at> raven.inka.de> writes:
> > > Just for curiosity: what would happen if such a collision would occur within
> > > one repository?
>
> > In a sense, this cannot happen.
>
> In the scenario you described, contents of this version of file "b" are lost
> and replaced by the contents of file "a". So file "b" is broken.
>
> What happens when files "a" and "b" are added into different repositories?
> File "a" is added to repos "A", and file "b" is added to repos "B". Now it
> depends from which repository you fetch the collided blob first. If you fetch
> it from "A", file "b" will be broken. If you fetch first from "B", your "a"
> will be broken.
>
> It becomes even more interesting, if some commit or tree object would have
> the same SHA1 as some other object. I guess, in such a case the repository
> would be completely hosed?

When exchanging objects over the network, Git compares byte-for-byte
any object that one side sent that the other side already has, and
complains loudly when there is a collision detected. This only works
if the sender includes the "wrong" content for the named object. Git
also does assume the SHA-1 is unique and that it is not always
necessary to transmit the object. In these cases you would not be able
to detect the collision, because there isn't one. Your repository
would simply be using the wrong content for a file. Presumably one
would notice your build doesn't work anymore and investigate why.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: What about SHA-1 collisions?
  2012-11-07 15:42     ` Shawn Pearce
@ 2012-11-07 23:44       ` Andrew Ardill
  0 siblings, 0 replies; 5+ messages in thread
From: Andrew Ardill @ 2012-11-07 23:44 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Josef Wolf, git

 A recent article [1] did an analysis on the number of items needed
from a given range to have a 50% chance of a collision. The famous
birthday paradox scenario was used, where you only need 23 people
before the chance of two of them having the same birthday is over 50%.
In this scenario there are ~366 options available to be picked from,
and 23 is significantly small in comparison.

The mathematics behind these statistics was extended to account for
any sized range (call it N) and it turns out that the number of items
(k) that can be picked before you have exceeded a given percentage
chance (T) of _not_ having a collision is roughly

k ~= sqrt(-2N.ln(T))

As pedrocr pointed out on Hacker News [2]

"Applying the formula for 160bit SHA-1 you need 1.7e23 objects to get
a 1% chance of collision. The current Linus kernel repository has 2.7
million objects. So to get a collision you'd need a repository that's
6e16 times larger. That should be plenty.

For some wacky perspective that's 10 million kernel sized
contributions for every man woman and child on earth together in a
single repository. It would seem git will reach plenty of other
bottlenecks before SHA-1 becomes a problem..."

An interesting analysis, even given that the OP presumes a collision
in their question.

Regards,

Andrew Ardill

[1] http://www.solipsys.co.uk/new/TheBirthdayParadox.html
[2] http://news.ycombinator.com/item?id=4753198

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-11-07 23:45 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-11-06 20:26 What about SHA-1 collisions? Josef Wolf
2012-11-06 21:41 ` John McKown
2012-11-06 22:09   ` Josef Wolf
2012-11-07 15:42     ` Shawn Pearce
2012-11-07 23:44       ` Andrew Ardill

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.