All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] Implement fast hash-collision detection
       [not found] <1322546563.1719.22.camel@yos>
@ 2011-11-29  9:07 ` Jeff King
  2011-11-29 10:24   ` Ævar Arnfjörð Bjarmason
                     ` (3 more replies)
  2011-11-29 17:08 ` Shawn Pearce
  1 sibling, 4 replies; 20+ messages in thread
From: Jeff King @ 2011-11-29  9:07 UTC (permalink / raw)
  To: Bill Zaumen; +Cc: git, gitster, pclouds, spearce, torvalds

[Your original message was almost certainly bounced by git@vger because
 it surpassed the 100K limit; I'll try to quote liberally for those who
 missed the original. You may want to split your content and repost.]

On Mon, Nov 28, 2011 at 10:02:43PM -0800, Bill Zaumen wrote:

> Maintains a database of CRCs of Git objects to allow SHA-1 hash
> collisions to be detected with high probability (1 - 1/2^32) and with
> little computational overhead.  The CRCs cover the content of Git
> objects, but not the header.  For loose objects, these are stored in
> subdirectories of GIT_DIRECTORY/objects/crcs, with each subdirectory's
> name consisting of the first two hexadecimal digits of the
> corresponding object's SHA-1 hash.  For each pack file, FILE.pack, the
> CRCs are stored in a FILE.mds, in the same order as the SHA-1 hashes
> that appear in FILE.idx.  Checks for hash collisions are made whenever
> a new loose object is created.

I'm confused. Is this about avoiding accidental collisions, or about
avoiding malicious collision attacks?

Let's assume the former for a minute. The usual way of considering
collision likelihood is not by probabilities, but by the number of items
that must be selected to achieve a >50% probability that there is a
collision. Which is the square root of the number of total items, or
half the bit-space.

So we expect to see a collision in 160-bit SHA-1 after writing about
2^80 objects. The linux-2.6 repository has about 2.2 million objects
after 6.5 years of development. If development continues at this pace,
we would expect a collision in a mere 10^18 years.

Assuming your 32-bit CRC is statistically independent of the SHA-1
value, it adds 32 bits to the hash, or 16 bits to the number of objects
we expect to generate (i.e., 2^96). That bumps us up to 10^23 years.

However, it may be of interest that the Sun is expected to burn out in a
mere 10^10 years[1].

So I'm not sure there is really any point in adding a few bits to the
hash length to detect an accidental collision. It's already
fantastically unlikely. Adding another probability on top does make it
less likely, but in the same ballpark of fantastic.

You can argue about whether linux-2.6 is a representative sample, or
whether the pace of object creation might increase. But the simple
answer is that we're many orders of magnitude away from having to care.

However, in your patch you write:

> +Security-Issue Details
> +----------------------
> +
> +Without hash-collision detection, Git has a higher risk of data
> +corruption due to the obvious hash-collision vulnerability, so the
> +issue is really whether a usable vulnerability exists. Recent research
> +has shown that SHA-1 collisions can be found in 2^63 operations or
> +less.  While one result claimed 2^53 operations, the paper claiming
> +that value was withdrawn from publication due to an error in the
> +estimate. Another result claimed a complexity of between 2^51 and 2^57
> +operations, and still another claimed a complexity of 2^57.5 SHA-1
> +computations. A summary is available at
> +<http://hackipedia.org/Checksums/SHA/html/SHA-1.htm#SHA-1>. Given the
> +number of recent attacks, possibly by governments or large-scale
> +criminal enterprises
> +(<http://www.csmonitor.com/World/terrorism-security/2011/0906/Iranian-government-may-be-behind-hack-of-Dutch-security-firm>,
> +<http://en.wikipedia.org/wiki/Operation_Aurora>,
> +<http://en.wikipedia.org/wiki/Botnet#Historical_list_of_botnets>),
> +which include botnets with an estimated 30 million computers, there is
> +reason for some concern: while generating a SHA-1 collision for
> +purposes of damaging a Git repository is extremely expensive
> +computationally, it is possibly within reach of very well funded
> +organizations. 2^32 operations, even if the operations are as
> +expensive as computing a SHA-1 hash of a modest source-code file, can
> +be performed in a reasonably short period of time on the type of
> +hardware widely used in desktop or laptop computers at present. With
> +sufficient parallelism, 30 million personal computers sufficient for
> +playing the latest video games could perform 2^56 operations in a
> +reasonable time.

...which makes me think that you do care about malicious collisions. All
of what you wrote above seems fairly accurate. Let's leave aside that
those numbers are for a collision attack as opposed to a pre-image
attack (collision attacks are hard to execute, as they require you to
generate two "matched" objects, one good and one evil, and then convince
somebody to first accept your good object, only to later replace it with
the evil one).

I have two concerns with this as a security mechanism:

First, let us assume that the implementation details of your scheme
work, and that git users will effectively be checking not just a SHA-1,
but now a SHA-1 plus your additional digest.

In that case, why use crc32 as the digest? It seems like a terrible
choice. Assuming it's cryptographically secure, then it's adding a mere
16 bits to the numbers you mentioned above. IOW, it's at best a band-aid
that pushes attacks off for a few years. But more importantly, it's
_not_ secure, and can be trivially forged.

But that's a relatively simple problem. crc32 could be replaced in your
scheme with any of the SHA-2 family, or the upcoming SHA-3, or whatever.

That brings me to my second concern: how does this alternative message
digest have any authority?

For example, your patch teaches the git protocol a new extension to pass
these digests along with the object sha1s. But how do we know the server
isn't feeding us a bad digest along with the bad object?

The "usual" security model discussed in git is that of verifying a
signed tag. Linus signs a tag and pushes it to a server. I fetch the
tag, and can verify the signature on the tag. I want to know that I have
the exact same objects that Linus had. But I can't assume the server is
trustworthy; it may have fed me a bad object with a collided sha1.

But what's in the signed part of the tag object? Only the sha1 of the
commit the tag points to, but not the new digest. So an attacker can
replace the commit with one that collides, and it can in turn point to
arbitrary trees and blobs.

You can fix this by including an extra header in the signed part of the
tag that says "also, the digest of the commit I point to is X". Then you
know you have the same commit that Linus had. But the commit points to a
tree by its sha1. So you have to add a similar header in the commit
object that says "also, the digest of the tree I point to is X". And
ditto for all of the parent pointers, if you want to care about signing
history. And then you have the same problem in the tree: each sub-tree
and blob is referenced by its sha1.

In other words, authority flows from the gpg-signed tag portion, and
links in the chain to each object are made by referencing sha1s. Every
time such a link is made, you need to also include the digest of the
object, or you are giving the attacker a place to insert a collided
object.

For tag and commit objects, this actually isn't that hard; they have a
relatively small number of links, and they have room for arbitrary
headers. I.e., add a "tree-md-sha256" header that gives the expected
sha-256 of the tree object referenced by sha-1 in the "tree" header.
Older versions of git will skip over this header (though obviously they
won't be smart enough to do the more-secure verification).

However, it's harder for trees. Each entry needs to have the new digest
added, but there simply isn't room in the format. So we would have to
change the tree format, breaking interoperability with older versions of
git. And all of your tree sha1's would change as soon as you wrote them
with the new format. That's only slightly better than just swapping
sha1 out for a better algorithm.

One trick you could do is to include the digest of each blob in the
commit object itself. This really bloats the size of commit objects,
though. You can store a digest of their digests instead (which your
patch actually calculates, but AFAICT does not actually insert into the
commit object). That is small and relatively cheap (it turns commit from
an O(1) operation into an O(number of files) operation, but the per-file
constant is pretty inexpensive). But it comes at the expense of not being
able to tell which of the constituent blobs was actually attacked.

So I think all of that would work for verification starting at a signed
tag that points to a commit or a blob. For a tag pointing straight to a
tree, it could include the same "digest of digests" that the commit
would.

But it can never handle direct sha1 references outside of git. For
example, a bug tracker or CI system that references a commit in git will
do so by sha1. But that's an insecure link in the chain; you really need
it to refer to both the sha1 _and_ the digest, and then you can verify
that the object you have under that sha1 matches the digest. So you'd
have to teach a whole generation of tools that they can't trust git sha1
ids, and that you need an "extended" id that includes the digest.

At that point, I really wonder if a flag day to switch to a new
repository format is all that bad. You'd have to either rewrite existing
history (which means re-signing tags if you care about them!), or just
decide to leave the new and old histories disconnected.  Whereas a
scheme based around an added-on digest could keep the old history
connected. But at the same time, the old history will have zero value
from a security perspective. If sha1 is broken, then those old
signatures are worthless, anyway. And just switching to a new algorithm
means the implementation remains very simple.

-Peff

[1] Fun fact of the day: if linux development continues at the same
    rate until the Sun burns out, there is a 1/2^60 chance of a
    collision!

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-11-29  9:07 ` [PATCH] Implement fast hash-collision detection Jeff King
@ 2011-11-29 10:24   ` Ævar Arnfjörð Bjarmason
  2011-11-29 10:29     ` Jeff King
  2011-11-29 13:17   ` Nguyen Thai Ngoc Duy
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 20+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2011-11-29 10:24 UTC (permalink / raw)
  To: Jeff King; +Cc: Bill Zaumen, git, gitster, pclouds, spearce, torvalds

On Tue, Nov 29, 2011 at 10:07, Jeff King <peff@peff.net> wrote:

> However, it may be of interest that the Sun is expected to burn out in a
> mere 10^10 years[1].

Off topic, but it's a a lot sooner than that. The total age of the sun
is is around 10^10 (10 billion) years, but we're already ~4.6 billion
years along that line.

However the Sun is currently in a stage of gradual heating until it
turns into a red giant in ~5 billion years. In around 500 million
years the earth will be uninhabitable as we know it, and in around 1
billion years the surface will be hot enough to have boiled all the
oceans. In other words the earth in a billion years will probably look
similar to how Venus looks now.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-11-29 10:24   ` Ævar Arnfjörð Bjarmason
@ 2011-11-29 10:29     ` Jeff King
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff King @ 2011-11-29 10:29 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git

On Tue, Nov 29, 2011 at 11:24:18AM +0100, Ævar Arnfjörð Bjarmason wrote:

> On Tue, Nov 29, 2011 at 10:07, Jeff King <peff@peff.net> wrote:
> 
> > However, it may be of interest that the Sun is expected to burn out in a
> > mere 10^10 years[1].
> 
> Off topic, but it's a a lot sooner than that. The total age of the sun
> is is around 10^10 (10 billion) years, but we're already ~4.6 billion
> years along that line.

Yeah, I checked Wikipedia, but rounded up for simplicity. I did use 5
billion for my fun fact at the end of the email, which is close to
accurate.

> However the Sun is currently in a stage of gradual heating until it
> turns into a red giant in ~5 billion years. In around 500 million
> years the earth will be uninhabitable as we know it, and in around 1
> billion years the surface will be hot enough to have boiled all the
> oceans. In other words the earth in a billion years will probably look
> similar to how Venus looks now.

Good point. If we want an accidental collision in the next 500 million
years, we'd better step up the pace of development!

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-11-29  9:07 ` [PATCH] Implement fast hash-collision detection Jeff King
  2011-11-29 10:24   ` Ævar Arnfjörð Bjarmason
@ 2011-11-29 13:17   ` Nguyen Thai Ngoc Duy
  2011-11-29 15:23     ` Shawn Pearce
  2011-11-29 14:04   ` Nguyen Thai Ngoc Duy
  2011-11-29 21:56   ` Bill Zaumen
  3 siblings, 1 reply; 20+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-11-29 13:17 UTC (permalink / raw)
  To: Jeff King; +Cc: Bill Zaumen, git, gitster, spearce, torvalds

On Tue, Nov 29, 2011 at 4:07 PM, Jeff King <peff@peff.net> wrote:
> However, it's harder for trees. Each entry needs to have the new digest
> added, but there simply isn't room in the format. So we would have to
> change the tree format, breaking interoperability with older versions of
> git. And all of your tree sha1's would change as soon as you wrote them
> with the new format. That's only slightly better than just swapping
> sha1 out for a better algorithm.

I think we could hide the new digest after NUL at the end of path
name. C Git at least does not seem to care whatever after NUL.
-- 
Duy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-11-29  9:07 ` [PATCH] Implement fast hash-collision detection Jeff King
  2011-11-29 10:24   ` Ævar Arnfjörð Bjarmason
  2011-11-29 13:17   ` Nguyen Thai Ngoc Duy
@ 2011-11-29 14:04   ` Nguyen Thai Ngoc Duy
  2011-11-29 20:59     ` Jeff King
  2011-11-29 21:56   ` Bill Zaumen
  3 siblings, 1 reply; 20+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-11-29 14:04 UTC (permalink / raw)
  To: Jeff King; +Cc: Bill Zaumen, git, gitster, spearce, torvalds

On Tue, Nov 29, 2011 at 4:07 PM, Jeff King <peff@peff.net> wrote:
> That brings me to my second concern: how does this alternative message
> digest have any authority?
>
> For example, your patch teaches the git protocol a new extension to pass
> these digests along with the object sha1s. But how do we know the server
> isn't feeding us a bad digest along with the bad object?
>
> The "usual" security model discussed in git is that of verifying a
> signed tag. Linus signs a tag and pushes it to a server. I fetch the
> tag, and can verify the signature on the tag. I want to know that I have
> the exact same objects that Linus had. But I can't assume the server is
> trustworthy; it may have fed me a bad object with a collided sha1.
>
> But what's in the signed part of the tag object? Only the sha1 of the
> commit the tag points to, but not the new digest. So an attacker can
> replace the commit with one that collides, and it can in turn point to
> arbitrary trees and blobs.
>
> You can fix this by including an extra header in the signed part of the
> tag that says "also, the digest of the commit I point to is X". Then you
> know you have the same commit that Linus had. But the commit points to a
> tree by its sha1. So you have to add a similar header in the commit
> object that says "also, the digest of the tree I point to is X". And
> ditto for all of the parent pointers, if you want to care about signing
> history. And then you have the same problem in the tree: each sub-tree
> and blob is referenced by its sha1.
>
> ..

(Security ignorant speaking)

Can we just hash all objects in a pack from bottom up, (replacing
sha-1 in trees/commits/tags with the new digest in memory before
hashing), then attach the new top digest to tag's content? The sender
is required by the receiver to send new digests for all objects in the
pack together with the pack. The receiver can then go through the same
process to produce the top digest and match it with one saved in tag.

I do agree that we should use stronger digests, not weaker, preferably
a combination of them.
-- 
Duy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-11-29 13:17   ` Nguyen Thai Ngoc Duy
@ 2011-11-29 15:23     ` Shawn Pearce
  0 siblings, 0 replies; 20+ messages in thread
From: Shawn Pearce @ 2011-11-29 15:23 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Jeff King, Bill Zaumen, git, gitster, torvalds

On Tue, Nov 29, 2011 at 05:17, Nguyen Thai Ngoc Duy <pclouds@gmail.com> wrote:
> On Tue, Nov 29, 2011 at 4:07 PM, Jeff King <peff@peff.net> wrote:
>> However, it's harder for trees. Each entry needs to have the new digest
>> added, but there simply isn't room in the format. So we would have to
>> change the tree format, breaking interoperability with older versions of
>> git. And all of your tree sha1's would change as soon as you wrote them
>> with the new format. That's only slightly better than just swapping
>> sha1 out for a better algorithm.
>
> I think we could hide the new digest after NUL at the end of path
> name. C Git at least does not seem to care whatever after NUL.

No, you can't. The next byte after the NUL at the end of a path name
is the SHA-1 of that entry. After those 20 bytes of SHA-1 data is
consumed, the "mode SP name\0" of the next entry is parsed.

There is not room in the tree format for any additional data. Period.
At best you can modify the mode value to be a new octal code that is
not recognized by current Git implementations. But that doesn't get
you a whole lot of data, and its pretty risky change because its
rather incompatible.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
       [not found] <1322546563.1719.22.camel@yos>
  2011-11-29  9:07 ` [PATCH] Implement fast hash-collision detection Jeff King
@ 2011-11-29 17:08 ` Shawn Pearce
  2011-11-29 22:05   ` Jeff King
  2011-11-30  4:01   ` Bill Zaumen
  1 sibling, 2 replies; 20+ messages in thread
From: Shawn Pearce @ 2011-11-29 17:08 UTC (permalink / raw)
  To: Bill Zaumen; +Cc: git, gitster, pclouds, peff, torvalds

On Mon, Nov 28, 2011 at 22:02, Bill Zaumen <bill.zaumen+git@gmail.com> wrote:
> Several years ago (in 2006) there was a discussion about the whether
> SHA-1 is adequate given the very small but non-zero probability of a
> hash collision, particularly given the possibility of a malicious
> attempt to generate a collision. At roughly the same time, Git was
> modified to support "thin packs" during data transfers. These allow
> one to send deltas based on objects that are not in the pack file
> being transferred. As a result a previously undetected hash collision
> could result in a corrupted repository when (for example) the same
> delta is applied to different objects that have the same SHA-1 hash.

I don't think you understand how these thin packs are processed.

If the pack contains <100 objects, it is inflated to loose objects. If
the receiving side (so client in fetch, server in push) already has an
object by that SHA-1, the new object is discarded. If the pack
contains >=100 objects, and the receiving side already has the object,
it is compared byte-for-byte to ensure the incoming copy exactly
matches the already existing copy.

Either way the first object to arrive always wins.


The recipient has to trust that the remote side is providing it
something reasonable. If the recipient has *ZERO* trust in the sender,
then s/he should read the content of all newly arrived objects before
compiling or executing them. This is one reason why Git does not run
hooks that are transported as part of the repository. If the recipient
thinks reading the content is too onerous or impossible, then they
have to make a trust assertion on the remote side.

This trust assertion should be derived from the community, and not so
much around the machine claiming the content is what it says it is. We
have yet to disprove the halting problem, so we have yet to construct
a machine that can verify those Linux kernel sources you downloaded
don't contain a local root exploit (for example). Instead we have to
trust the community of developers and users who work on and run that
code to have confidence that the code works as expected, etc. We base
our trust off reputable people making statements like "Linus kernel at
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git is
pretty good", and "kernel.org is where Linus pushes directly to, so
its reasonable to trust the kernel.org server".

The recipient should have some understanding of the remote server's
security policies, or pay attention to notices posted by others who
are also fetching and reviewing content from the same repository. At
some level, the community using a repository from a given site should
be policing that site and establishing trust that the host is not
providing garbage content. After an incident, it is possible to pick
up again by rebuilding the environment from an already known
repository that people trust. In the case of the recent kernel.org
environment rebuild, that is exactly what they did, the community
picked up again from Linus' personal repository.

DNS could be abused to send you the wrong IP for a site, but most
people don't use random DNS servers, they have some level of trust in
their DNS provider. DNSSEC is helping to improve the security of the
name->IP translation process, and using protocols like HTTPS with SSL
certificate verification can help to reduce the chances that a forged
DNS entry sends you to an evil source, rather than the community
trusted one. (Although SSL certificates seem to be getting forged left
and right these days, so again you can't really rely on strong
cryptography to magically solve security problems when the attacker
holds the private key you have decided to trust with no further
verification.)


But trust aside, consider an object C is sent as a delta to the remote
side. The delta base B is not included in the pack, and is referenced
by SHA-1. When the remote side processed delta C, it looks up a copy
of base B from its own repository. We assume this content of B is
correct, due to the "first to arrive wins" rule, and the community
review/trust/notification process.

The inflated length of B is checked against a size that is stored in
the front of the delta instruction stream that describes C. These
lengths must match exactly, if they do not match then the delta
application aborts, the pack is rejected, and any temporary data is
removed from the filesystem. As Peff pointed out elsewhere in this
thread, the odds of a SHA-1 collision in a project are low, on the
order of 1/(2^80). Although there are some attacks on message digest
functions like MD-5 or SHA-1 that might be possible to generate a
duplicate in 2^57 time, any that I have read require producing a
different length content than the original you are trying to replace.
Assuming the copy of B on the remote system actually inflates and
computes to the correct SHA-1 B, it probably does not also have the
correct length if it is an object with correct hash but wrong content.
So delta application should still be checking for collision with a
1/(2^80) probability.

Assuming the remote's copy of B passed the size check, the delta is
applied on top, and the SHA-1 of the result buffer is computed. The
attacker must craft the delta such that the SHA-1 of C is the result,
otherwise connectivity checks will fail.

Assuming the attacker successfully stores a C' that has the right
SHA-1, but wrong content... the community around that repository will
eventually notice this and message that the source site cannot be
trusted. I refer you back to the statement above about trusting the
site you pull from, or trusting the users that you authorize to push
into a repository.

But thin pack aside, this problem exists in any form of a packed
object. An attacker can try to send object P' (as a non-delta) in
place of P. SHA1(P') = SHA1(P), but the content differs. This is far
easier to construct than the thin pack delta case you think is a
problem, and is the most likely approach for an attacker to take. I
refer you back to the statement above about trusting the site you pull
from, or trusting the users that you authorize to push into a
repository, or reading every object you receive.


Even if you magically fix the hash function somehow to decrease the
odds of collision (e.g. by switching to a member of the SHA-2 or SHA-3
family), there is no way to prevent a bug or root exploit from
entering the project except by never adding new code, or by carefully
reviewing everything that is submitted, and building up a basis of
trust around that review method. It is far more likely for an attacker
to try and submit a plain text patch to the Linux kernel mailing list
that reads completely correct, hashes to the correct SHA-1s when
applied in Git, etc... but just "happens" to contain an off by one
pointer bug in some weird case that allows the attacker to overflow a
critical memory buffer and later inject some code that can later be
used to create an exploit. If they are ever "caught" they may just
claim "I AM A MORON I AM SORRY I MISSED THAT BUFFER CHECK AND SO DID
YOU DURING CODE REVIEW SO ITS NOT ALL MY FAULT LEAVE ME ALONE" and get
away with it.

Trust. Review. Verify.

I don't know about you, but I don't just pull random code from
arbitrary sites on the Internet. Nor do I compile or execute that code
on my workstation. I do trust some individuals based on their
reputation on the Internet, or my past experiences working with their
code. And I also trust some hosting environments like kernel.org, or
GitHub, or code.google.com, to provide reasonably secure hosting, and
to aggressively react to any event that might make it harder for me to
trust the content they supply. And I also read a lot of code that I
pull.

It really isn't the problem you try to claim it is.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-11-29 14:04   ` Nguyen Thai Ngoc Duy
@ 2011-11-29 20:59     ` Jeff King
  2011-11-30 13:35       ` Nguyen Thai Ngoc Duy
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2011-11-29 20:59 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Bill Zaumen, git, gitster, spearce, torvalds

On Tue, Nov 29, 2011 at 09:04:06PM +0700, Nguyen Thai Ngoc Duy wrote:

> > You can fix this by including an extra header in the signed part of the
> > tag that says "also, the digest of the commit I point to is X". Then you
> > know you have the same commit that Linus had. But the commit points to a
> > tree by its sha1. So you have to add a similar header in the commit
> > object that says "also, the digest of the tree I point to is X". And
> > ditto for all of the parent pointers, if you want to care about signing
> > history. And then you have the same problem in the tree: each sub-tree
> > and blob is referenced by its sha1.
> >
> Can we just hash all objects in a pack from bottom up, (replacing
> sha-1 in trees/commits/tags with the new digest in memory before
> hashing), then attach the new top digest to tag's content? The sender
> is required by the receiver to send new digests for all objects in the
> pack together with the pack. The receiver can then go through the same
> process to produce the top digest and match it with one saved in tag.

I think that is conflating two different layers of git. The security for
tags happens at the conceptual object db layer: you sign a tag, and that
points to a commit, which points to a tree, and so on. The authenticity
comes from the tag signature, but the integrity of each link in the
chain is verifiable because of the has property. The pack layer, on the
other hand, is just an implementation detail about how those conceptual
objects are stored. More than just your tag will be in a pack, and the
contents of your tag may be spread across several packs (or even loose
objects).

So I don't think it's right to talk about packs at all in the signature
model.

If you wanted to say "make a digest of all of the sub-objects pointed to
by the tag", then yes, that does work (security-wise). But it's
expensive to calculate. Instead, you want to use a "digest of digests"
as much as possible. Which is what git already does, of course; you hash
the tree object, which contains hashes of the blob sha1s. Git's
conceptual model is fine. The only problem is that sha1 is potentially
going to lose its security properties, weakening the links in the chain.
So as much as possible, we want to insert additional links at the exact
same places, but using a stronger algorithm.

Does that make sense?

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-11-29  9:07 ` [PATCH] Implement fast hash-collision detection Jeff King
                     ` (2 preceding siblings ...)
  2011-11-29 14:04   ` Nguyen Thai Ngoc Duy
@ 2011-11-29 21:56   ` Bill Zaumen
  2011-11-30  6:25     ` Jeff King
  3 siblings, 1 reply; 20+ messages in thread
From: Bill Zaumen @ 2011-11-29 21:56 UTC (permalink / raw)
  To: Jeff King; +Cc: git, gitster, pclouds, spearce, torvalds

Thanks for mentioning the 100K limit, which I didn't know about.  Will
have to try to see how to split it into two patches.

The intent is to increase the cost of a malicious attack, which requires
generating two different files with the same SHA-1 value, detect such an
attack early, and to slow such an attack down - because of Git's rule
that the first object with a SHA-1 value is the one the repository has,
if it takes longer to generate a collision than the time it takes to get
the original object into all repositories (which is done manually by
multiple individuals), the forged file will never appear in any
"official" repository.

The additional CRC (easily changed to whatever message digest one might
prefer) makes a malicious attack far more difficult: the modified file
has to have both the same SHA-1 hash (including the Git header) and 
the same CRC (not including the Git header).  An efficient algorithm to
do both simultaneously does not yet exist.  So, if we could generate a
SHA-1 collision in one second, it would presumably take billions of
seconds (many decades of continuous computation) to generate a SHA-1
hash with the same CRC, and well before a year has elapsed, the original
object should have been in all the repositories, preventing a forged
object from being inserted. Of course, eventually you might need a
real message digest.

The weakness of a CRC as an integrity check is not an issue since it is
never used alone: it's use is more analogous to the few extra bits added
to a data stream when error-detecting codes are used.  I used a CRC in
the initial implementation rather than a message digest because it is
faster, and because the initial goal was to get things to work
correctly.  In any case, the patch does not eliminate any code in which
Git already does a byte-by-byte comparison.  In cases where Git
currently assumes that two objects are the same because the SHA-1 hashes
are the same, the patch compares CRCs as an additional test.

Regarding your [Jeff's] second concern, "how does this alternative
digest have any authority?" there are two things to keep in mind. First,
it is a supplement to the existing digest.  Second, any value of the CRC
that is stored permanently (baring bugs, in my implementation, of
course) is computed locally - when a loose object is created or when a
pack file's index is created.  At no point is a CRC that was obtained
from another repository trusted. While the patch modifies Git so that it
can send CRCs when using the git protocol, these CRCs are never stored,
but are instead used only for cross checks.  If one side or the other
"lies", you get an error.  

To give a concrete example, during a fetch, the git protocol currently
sends "have" messages that contain the SHA-1 hashes of commits.  The
extension allows two CRCs to be sent along with each hash.  If these do
not match the local values (tested only if the local values exist),
something is wrong and you get an error report that the server sends to
the client, but the server never uses these CRCs for any other purpose
and the server never sends its CRCs to the client because of
backwards-compatibility issues. For objects that are transferred, you
end up with a pack file, with index-pack called to build the index (and
with the patch, the corresponding MDS file), but index-pack already does
a byte-by-byte comparison to detect collisions - the comparison is much
faster than the SHA-1 computation index-pack has to do anyway.

Where this helps is when one is using multiple repositories. If you
fetch a commit from repository B, which we'll assume has a forged blob
(different content, but the original SHA-1 hash), and then run fetch
using repository A, which has has the same commit with the original
blob, the forged blob will not be transferred from Server A and the
client will not be notified that there is an inconsistency - the
protocol is "smart" enough to know that the client already has the
commit and assumes there is nothing to do regarding it.

BTW, regarding your [Jeff's] discussion about putting an additional
header in commit messages - I tried that.  The existing versions of
Git didn't like it: barring a bug in my test code, it seems that Git
expects headers in commit messages to be in a particular order and
treats deviations from that to be an error.  I even tried appending
blank lines at the end of a commit, with spaces and tabs encoding an
additional CRC, and that didn't work either - at least it never got
through all the test programs, failing in places like the tests
involving notes. In any case, you'd have to phase in such a change
gradually, first putting in the code to read the new header if it is
there, and subsequently (after ample time so that everyone is running
a sufficiently new version) enabling the code to create the new
header.

Also, regarding "At that point, I really wonder if a flag day to switch
to a new repository format is all that bad," if that turns out to be
the decision, I'd recommend doing it sooner rather than later. The
reason is cost, which grows with the number of git users and the
number and size of Git repositories.

Bill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-11-29 17:08 ` Shawn Pearce
@ 2011-11-29 22:05   ` Jeff King
  2011-11-30  4:01   ` Bill Zaumen
  1 sibling, 0 replies; 20+ messages in thread
From: Jeff King @ 2011-11-29 22:05 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: Bill Zaumen, git, gitster, pclouds, torvalds

On Tue, Nov 29, 2011 at 09:08:27AM -0800, Shawn O. Pearce wrote:

> As Peff pointed out elsewhere in this thread, the odds of a SHA-1
> collision in a project are low, on the order of 1/(2^80).

Minor nit: it's actually way less than that. You have to do on the order
of 2^80 operations to get a 50% chance of a collision. But that's not
the probability for a collision given a particular number of
operations[1].

The probability for a SHA-1 collision on 10 million hashes (where
linux-2.6 will be in a decade or two) is about 1/(2^115).

That doesn't change the validity of any of your points, of course. 1 in
2^80 and 1 in 2^115 are both in the range of "impossibly small enough
not to care about".

To continue our astronomy analogies, NASA estimates[2] the impact
probability of most tracked asteroids in the 10^6 range (around 2^20).
So getting a collision in linux-2.6 in the next decade has roughly the
same odds as the Earth being hit by 5 or 6 large asteroids.

-Peff

[1] http://en.wikipedia.org/wiki/Birthday_problem#Cast_as_a_collision_problem

[2] http://neo.jpl.nasa.gov/risk/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-11-29 17:08 ` Shawn Pearce
  2011-11-29 22:05   ` Jeff King
@ 2011-11-30  4:01   ` Bill Zaumen
  1 sibling, 0 replies; 20+ messages in thread
From: Bill Zaumen @ 2011-11-30  4:01 UTC (permalink / raw)
  To: Shawn Pearce; +Cc: git, gitster, pclouds, peff, torvalds

Note: for some reason my email is not showing up on the mailing list.
I'm trying a different email address - previously my 'From' field
contained a subaddress "+git" but gmail won't put that in the 'Sender'
field, so possibly the email is being filtered for that reason.

On Tue, 2011-11-29 at 09:08 -0800, Shawn Pearce wrote:

> I don't think you understand how these thin packs are processed.

I think the confusion was due to me being a bit too terse.  The
documentation clearly states that thin packs allow deltas to be
sent when the delta is based on an object that the server and client
both have in common, given the commits each already has.  If there is
one server and one client, there isn't an issue.  The case I meant is
the one in which a user does a fetch from one server, gets a forged
blob, and then fetches from another server with the original blob, and
with additional commits along the same branch. If a server bases the
delta off of the original blob, and the client applies the delta to the
forged blob, the client will most likely end up with a blob with a
different SHA-1 hash than the one expected.  Since an object in a tree
is then missing (no object with the expected SHA-1 hash), the repository
is corrupted.

The "first to arrive wins" policy isn't sufficient in one specific case:
multiple remote repositories where new commits are added asynchronously,
with the repositories out of sync possibly for days at a time (e.g.,
over a 3-day weekend).  In this case, the first to arrive at one
repository may not be the first to arrive at another, so what happens at
a particular client in the presence of hash collisions is dependent on
the sequence of remotes from which updates were fetched.  The risk
occurs in the window where the repositories are out of sync.

Regarding the kernel.org problem that you used as a separate example,
while it was fortunately possible to rebuild things (and git provided
significant advantages), earlier detection of the problem might have
reduced the time for which kernel.org was down.  Early detection of
errors in general is a good practice if it can be done at a reasonable
cost.

> Trust. Review. Verify.

While good advice in principle, you should keep in mind that there are
a lot of people out there working at various companies who are not as
capable as you are.  Some of them are overworked and make mistakes
because they've been working 16 hour days for weeks trying to meet a
deadline. Given that, extra checks to catch problems early
are probably a good idea if they don't impact performance significantly.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-11-29 21:56   ` Bill Zaumen
@ 2011-11-30  6:25     ` Jeff King
  2011-12-01  0:41       ` Bill Zaumen
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2011-11-30  6:25 UTC (permalink / raw)
  To: Bill Zaumen; +Cc: git, gitster, pclouds, spearce, torvalds

On Tue, Nov 29, 2011 at 01:56:28PM -0800, Bill Zaumen wrote:

> The additional CRC (easily changed to whatever message digest one might
> prefer) makes a malicious attack far more difficult: the modified file
> has to have both the same SHA-1 hash (including the Git header) and 
> the same CRC (not including the Git header).

Only if the attack actually involves creating a collision on both. But I
think the important attacks bypass your CRC anyway. Consider this attack
scenario:

  1. Linus signs a tag (or a commit) and pushes it to kernel.org.

  2. kernel.org gets hacked, and the attacker replaces an object with
     an evil colliding version[1].

  3. I clone from kernel.org, and run "git tag --verify". Git says it's
     OK, because the signature checks out, but I have a bogus object.

How does your CRC help? If I understand your scheme correctly,
kernel.org will have told me the CRC of all of the objects during the
clone. But that isn't part of what Linus signed, so the attacker in step
2 could just as easily have overwritten kernel.org's crc file, and the
signature will remain valid.

[1] This is an over-simplification, of course. Because the only even
    remotely feasible attacks on sha1 are birthday attacks, not pre-image
    attacks, there is a step 0 in which the attacker generates a
    colliding pair, convinces Linus to commit it, and then waits.

    Which is probably really hard, but for the purposes of this
    discussion, we assume the attacker is capable of inserting a
    colliding object maliciously into a repo you will fetch from.
    Otherwise, the integrity of sha1 isn't an issue at all.

> An efficient algorithm to do both simultaneously does not yet exist.
> So, if we could generate a SHA-1 collision in one second, it would
> presumably take billions of seconds (many decades of continuous
> computation) to generate a SHA-1 hash with the same CRC, and well
> before a year has elapsed, the original object should have been in all
> the repositories, preventing a forged object from being inserted. Of
> course, eventually you might need a real message digest.

This is wrong, for two reasons.

  1. The method for generating an object that collides in both sha-1 and
     CRC is not necessarily to generate a colliding sha-1 and then do a
     pre-image attack on the CRC. It is to do a birthday attack on the
     sha-1 and the CRC together. Which halves the bit-strength of the
     CRC to 16 bits (just as we can generally find collisions in 160-bit
     sha1s in 2^80). 16 bits isn't a lot to add when you are trying to
     fix a broken cryptosystem (it's not broken yet, obviously, but when
     it does get broken, will it be because computing reaches the 2^57
     or so that sha1 is broken at, or will it be because a new weakness
     is found that drops sha1's bit-strength to something much lower?).

     This assumes that you can combine the two in a birthday attack.
     Certainly this analysis works against brute-force 2^80 sha1
     collision attacks. But I haven't actually read the details of the
     sha1 attacks, so maybe some of the tweaking they do to get those
     results makes it harder. On the other hand, attacking CRC is far
     from hard, so I certainly wouldn't stake money that sha1 reseachers
     couldn't tweak their attacks in a way that also allows finding CRC
     collisions. You say that an algorithm to do both simultaneously
     does not yet exist. But is that because it's hard, or simply
     because nobody has bothered trying?

     Anyway, all of that is just reiterating that CRC should not be used
     as a security function. It can easily be replaced in your scheme by
     sha-256, which does have the desired properties.

  2. Your attack seems to be "find the sha-1 collision, publish one of
     your colliding objects (i.e., the innocent-looking half), then try
     to break the CRC". And then you claim that by the time you find the
     CRC, everybody will already have the object.

     But wouldn't a smarter attack be to first find the collision, including
     the CRC, and only _then_ start the attack? Then nobody will have
     the object.

     Moreover, it's not true that after a year everyone will have the
     object. People still run "git clone" against kernel.org. Those
     repos do not have the object.

> The weakness of a CRC as an integrity check is not an issue since it
> is never used alone: it's use is more analogous to the few extra bits
> added to a data stream when error-detecting codes are used.  I used a
> CRC in the initial implementation rather than a message digest because
> it is faster, and because the initial goal was to get things to work
> correctly.  In any case, the patch does not eliminate any code in
> which Git already does a byte-by-byte comparison.  In cases where Git
> currently assumes that two objects are the same because the SHA-1
> hashes are the same, the patch compares CRCs as an additional test.

Right. I don't claim that your scheme makes git any weaker. I just claim
that it fails to solve the problems people are actually concerned about,
and it adds a lot of complexity while doing so.

> Regarding your [Jeff's] second concern, "how does this alternative
> digest have any authority?" there are two things to keep in mind.
> First, it is a supplement to the existing digest.

Right, but we are assuming that sha1 is broken. That's the whole
security problem. So the existing digest is not worth much.

> Second, any value of the CRC that is stored permanently (baring bugs,
> in my implementation, of course) is computed locally - when a loose
> object is created or when a pack file's index is created.  At no point
> is a CRC that was obtained from another repository trusted. While the
> patch modifies Git so that it can send CRCs when using the git
> protocol, these CRCs are never stored, but are instead used only for
> cross checks.  If one side or the other "lies", you get an error.

But if I don't already have the object, then I have nothing to compare
against. So when I get it from kernel.org, I have to simply accept that
the object I'm getting is good, and write it into my object db.

> BTW, regarding your [Jeff's] discussion about putting an additional
> header in commit messages - I tried that.  The existing versions of
> Git didn't like it: barring a bug in my test code, it seems that Git
> expects headers in commit messages to be in a particular order and
> treats deviations from that to be an error.

Yes, the header has to go at the end of the existing headers. But I
don't see any reason that would be a problem for the scheme I described.

> I even tried appending blank lines at the end of a commit, with spaces
> and tabs encoding an additional CRC, and that didn't work either - at
> least it never got through all the test programs, failing in places
> like the tests involving notes.

Yes, git will helpfully trim whitespace in commit messages. With the
current code, you can hide arbitrary bytes in a commit message after a
NUL, but don't do that. It's not guaranteed to stay that way, and the
appropriate place to add new information is in a header.

> In any case, you'd have to phase in such a change gradually, first
> putting in the code to read the new header if it is there, and
> subsequently (after ample time so that everyone is running a
> sufficiently new version) enabling the code to create the new header.

Current git should ignore headers that it doesn't understand. I haven't
tested this, but Junio recently has been experimenting with
gpg-signature lines in commits, and I'm pretty sure he checked that
older gits properly ignore them.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-11-29 20:59     ` Jeff King
@ 2011-11-30 13:35       ` Nguyen Thai Ngoc Duy
  2011-11-30 18:05         ` Junio C Hamano
  2011-11-30 19:00         ` Bill Zaumen
  0 siblings, 2 replies; 20+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-11-30 13:35 UTC (permalink / raw)
  To: Jeff King; +Cc: Bill Zaumen, git, gitster, spearce, torvalds

On Wed, Nov 30, 2011 at 3:59 AM, Jeff King <peff@peff.net> wrote:
> If you wanted to say "make a digest of all of the sub-objects pointed to
> by the tag", then yes, that does work (security-wise). But it's
> expensive to calculate. Instead, you want to use a "digest of digests"
> as much as possible. Which is what git already does, of course; you hash
> the tree object, which contains hashes of the blob sha1s. Git's
> conceptual model is fine. The only problem is that sha1 is potentially
> going to lose its security properties, weakening the links in the chain.
> So as much as possible, we want to insert additional links at the exact
> same places, but using a stronger algorithm.

What I'm thinking is whether it's possible to decouple two sha-1 roles
in git, as object identifier and digest, separately. Each sha-1
identifies an object and an extra set of digests on the "same" object.
Object database is extended to store all these new digests and mapping
between sha-1 and them. When we need to verify an object, given an
sha-1, we rehash that object and check the result digest with the ones
linked to the sha-1.

These new digests would be "digest of digests" just like how we use
sha-1. In fact this new digest set could be just sha-1. We just don't
hash trees/commits/tags as-is when computing these new digests. When a
tree object is hashed, for example, a new tree object with new digests
will be created for hashing (we keep sha-1 <-> new digest mapping on
disk). Think of duplicating an entire DAG with new digests as links
instead of sha-1, then we keep digests and drop the DAG.

The day sha-1 is broken, a project can generate new digests from its
old good repo and enforce developers to use new digests for
verification instead of sha-1. sha-1 is still used by git as
identifier after that day.

Computing these digests is expensive, but for local, day-to-day use,
we only need sha-1 as identifier (correct me if I'm wrong here), so we
can delay compute/store these new digests until we absolutely need
them (e.g. push/fetch). There is also an interesting case, assume
these digests are strong enough, we could replace sha-1 as identifier
in git with a cheaper hash. A new hash must fit in 160-bit space that
sha-1 takes and should have good distribution, of course. Projects
with a lot of data may like it that way.
-- 
Duy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-11-30 13:35       ` Nguyen Thai Ngoc Duy
@ 2011-11-30 18:05         ` Junio C Hamano
  2011-12-01  4:43           ` Nguyen Thai Ngoc Duy
  2011-11-30 19:00         ` Bill Zaumen
  1 sibling, 1 reply; 20+ messages in thread
From: Junio C Hamano @ 2011-11-30 18:05 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy
  Cc: Jeff King, Bill Zaumen, git, gitster, spearce, torvalds

Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:

> What I'm thinking is whether it's possible to decouple two sha-1 roles
> in git, as object identifier and digest, separately.

Why it would be a good thing? If you have a collided identifier, somebody
has to choose which blob a particular tree wants to have at the path, and
because the tree would not record anything but the identifier, you cannot.

> ...
> The day sha-1 is broken, a project can generate new digests from its
> old good repo and enforce developers to use new digests for
> verification instead of sha-1. sha-1 is still used by git as
> identifier after that day.

And an old blob that is identified with a SHA-1 now has a new blob that
has different contents but happens to have the same SHA-1. How does Git
decide which blob to use when a particular object is named by the SHA-1?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-11-30 13:35       ` Nguyen Thai Ngoc Duy
  2011-11-30 18:05         ` Junio C Hamano
@ 2011-11-30 19:00         ` Bill Zaumen
  1 sibling, 0 replies; 20+ messages in thread
From: Bill Zaumen @ 2011-11-30 19:00 UTC (permalink / raw)
  To: Nguyen Thai Ngoc Duy; +Cc: Jeff King, git, gitster, spearce, torvalds

[Will send a reply to Jeff's comment from last night with some 
clarifications and explanations later].

> What I'm thinking is whether it's possible to decouple two sha-1 roles
> in git, as object identifier and digest, separately. Each sha-1
> identifies an object and an extra set of digests on the "same" object.
> Object database is extended to store all these new digests and mapping
> between sha-1 and them. When we need to verify an object, given an
> sha-1, we rehash that object and check the result digest with the ones
> linked to the sha-1.

The patch I created (at least, a reasonable chunk of the code) kind of
does that:  it is very easy to change the CRC to whatever message digest
one wants.  I used a CRC primarily because I had the impression that
people were very concerned about speed, but it is easy to change that to
the message digest of your choice.  In any case, it might be a good
starting point if you want to try something in a different direction.

Basically, when you create a loose object, in addition to getting a
SHA-1 ID, you get a message digest that gets stored as well (in a
separate file). When you index a pack file, you get an IDX file
containing the SHA-1 ID  plus a corresponding MDS file containing the
message digest. Index-pack calculates the SHA-1 value from the object
stored in the pack file, and the (additional) message digest is computed
at the same time using the same data.  Commands like verify-pack check
both the IDX file and the MDS file for consistency with the matching
pack file.  The new message digest (the CRC in the patch) is used only
in cases where a repository is being altered (e.g., a loose object or
pack file is being created or a fetch, push, or pull operation) or some
explicit verification operation is running (e.g., git verify-pack).

Adding an additional header to the commit message is a good idea (I had
actually tried that, but something went wrong, although one of you
suggested what the problem might have been --- I can try again if there
is some interest in pursuing that).

It might be worth pointing out that you can use the SHA-1 hash of the
contents of objects (e.g., without the Git object header) as an
additional digest:  I tried a test using two 128-byte files with the
same MD5 hash, differing past the 20th byte, and deleted the first
four bytes of each.  With those bytes deleted, the hash collision
went away. I doubt if there is a known efficient algorithm that can
generate a hash collision for two files and for two other files that
differ from the first set by deleting N bytes from both.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-11-30  6:25     ` Jeff King
@ 2011-12-01  0:41       ` Bill Zaumen
  2011-12-01  5:26         ` Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Bill Zaumen @ 2011-12-01  0:41 UTC (permalink / raw)
  To: Jeff King; +Cc: git, gitster, pclouds, spearce, torvalds

On Wed, 2011-11-30 at 01:25 -0500, Jeff King wrote:
> On Tue, Nov 29, 2011 at 01:56:28PM -0800, Bill Zaumen wrote:
> But I
> think the important attacks bypass your CRC anyway. Consider this attack
> scenario:
> 
>   1. Linus signs a tag (or a commit) and pushes it to kernel.org.
> 
>   2. kernel.org gets hacked, and the attacker replaces an object with
>      an evil colliding version[1].
> 
>   3. I clone from kernel.org, and run "git tag --verify". Git says it's
>      OK, because the signature checks out, but I have a bogus object.
> 
> How does your CRC help? If I understand your scheme correctly,
> kernel.org will have told me the CRC of all of the objects during the
> clone. But that isn't part of what Linus signed, so the attacker in step
> 2 could just as easily have overwritten kernel.org's crc file, and the
> signature will remain valid.

First, there is a misconception - the server will not tell you the CRC.
The CRC will be computed locally by the client instead.  

Aside from that, suppose the attacker does what you suggests (providing
a valid CRC so that git commands like verify-pack don't have an error
to detect).  You can't tell that something is wrong, but Linus can - 
the next time he does a fetch.  If he fetches, the server sends
some SHA-1 hashes and the client responds with 'have' or 'want' in a
reply.  If the client wants it, the client doesn't have a CRC, if the
client sends 'have', the CRCs are available so those get sent.  The
server then checks, notices a mismatch (probably in the CRC of the CRCs
of all the blobs in the commit's tree), and generates an error.

It's also possible to write some additional commands to (for example)
fetch the SHA-1 hashes and CRCs from all remote repositories you use
and compare these to make sure they are all consistent, something that
can be run ocassionally.


> > An efficient algorithm to do both simultaneously does not yet exist.
> > So, if we could generate a SHA-1 collision in one second, it would
> > presumably take billions of seconds (many decades of continuous
> > computation) to generate a SHA-1 hash with the same CRC, and well
> > before a year has elapsed, the original object should have been in all
> > the repositories, preventing a forged object from being inserted. Of
> > course, eventually you might need a real message digest.
> 
> This is wrong, for two reasons.
> 
>   1. The method for generating an object that collides in both sha-1 and
>      CRC is not necessarily to generate a colliding sha-1 and then do a
>      pre-image attack on the CRC. It is to do a birthday attack on the
>      sha-1 and the CRC together. Which halves the bit-strength of the
>      CRC to 16 bits (just as we can generally find collisions in 160-bit
>      sha1s in 2^80). 

That result for birthday attack assumes the goal is to find two files
that will have the same SHA-1 value (or SHA-1 + CRC).  The case I was
talking about (apologies if I did not state this clearly) is that you
have an existing object and an existing CRC and you want to generate
a second object with the same SHA-1 and same CRC as the first.  A
birthday attack doesn't work so well in that case - the number of tries
is much higher than half the number of bits in the digest.

http://en.wikipedia.org/wiki/Birthday_attack#Digital_signature_susceptibility
has a discussion regarding digital signatures.  The trick is for a
person to create a large number of variations of a "fair" and "unfair"
contract, and use a birthday attack to find a pair that have the same
hash.  The variations are typically inconsequential changes (extra
spacing, commas, etc.)  In the case I was discussing, a developer
creates some code,  commits it and pushes it to a shared repository -
the developer is not given code by the attacker.  The attacker can,
however, see the code by fetching it.  An attack then consists of
generating a collision, change the object in the attacker's local
repository, and then push the original developer's commit (with the
modified object) to another shared repository before someone else
puts the correct objects into that repository.  A birthday attack
does not work in this case.

There one issue that this suggests however - it is not clear if the
2^57 number given for the best SHA-1 attacks were attempts to generate
a new file with the same SHA-1 hash as an existing file or a pair of
files that have the same SHA-1 hash.  If it is the latter, then an
attack is significantly harder than I assumed as a worst case, but
still possibly much, much better than brute force.

http://en.wikipedia.org/wiki/Birthday_problem has an analysis of the
birthday problem (the basis of a birthday attack) and clearly notes that
this is different than the "same birthday as you" variation - you don't
do nearly so well in that case.

> Anyway, all of that is just reiterating that CRC should not be used
>      as a security function. It can easily be replaced in your scheme by
>      sha-256, which does have the desired properties.

Oh, I'm perfectly happy with sha-256 (and indicated that in the
documentation that is in the patch) - there's a tradeoff between error
detection and speed, and I simply guessed that the community was more
concerned with speed.

>   2. Your attack seems to be "find the sha-1 collision, publish one of
>      your colliding objects (i.e., the innocent-looking half), then try
>      to break the CRC". And then you claim that by the time you find the
>      CRC, everybody will already have the object.
> 
>      But wouldn't a smarter attack be to first find the collision, including
>      the CRC, and only _then_ start the attack? Then nobody will have
>      the object.

My assumption was that legitimate developer wrote some code, put it in
one of the remote repositories, and than an attacker downloaded that
code and tried to get a modified version into a different remote
repository.

> 
>      Moreover, it's not true that after a year everyone will have the
>      object. People still run "git clone" against kernel.org. Those
>      repos do not have the object.

This isn't the problem - a clone is going to be an exact copy of an
existing repository.  I was referring to the case where a commit made it
into one remote repository but not another - to get into the second one,
someone else would have had to review the code, creating a window where
an attacker with write access to the second repository could put the
modified object there.

> Right, but we are assuming that sha1 is broken. That's the whole
> security problem. So the existing digest is not worth much.

The assumption was more that "J Random Hacker's" code would be
trusted far less than code submitted by Linus, so if "J Random Hacker"
can generate create a replacement file with the same SHA-1 hash as
one in one of Linus' commits, others will initially assume that Linus
wrote that code. But, Git uses a "first in wins" rule so the bad guy
has to generate the replacement file and get it inserted before Linus'
commit reaches all the shared repositories for the project.  A SHA-1
collision is harmless if computed too late to get in.

> > Second, any value of the CRC that is stored permanently (baring bugs,
> > in my implementation, of course) is computed locally 

> But if I don't already have the object, then I have nothing to compare
> against. So when I get it from kernel.org, I have to simply accept that
> the object I'm getting is good, and write it into my object db.

The value is computed locally but can be compared remotely.

You (specifically) won't find anything while talking to kernel.org in
this example, but when you try to fetch from a different repository,
instead of just noting that you have the object, you'll get a
notification that there was a collision, so you'll find the problem
earlier than you would otherwise.  Also, anyone who had the right object
from kernel.org (the object before the repository was 'hacked' would get
a warning when trying to fetch from kernel.org after the change.

> Yes, the header has to go at the end of the existing headers. But I
> don't see any reason that would be a problem for the scheme I described.

Thanks for mentioning it - I tried putting the header in multiple
locations and probably missed the one spot where it would work.

> Current git should ignore headers that it doesn't understand. I haven't
> tested this, but Junio recently has been experimenting with
> gpg-signature lines in commits, and I'm pretty sure he checked that
> older gits properly ignore them.

Possibly there was another bug in the test I tried as well - something
was preventing the directory cleanup in one of the 'notes' test.

Anyway, thanks for the comments.

Bill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-11-30 18:05         ` Junio C Hamano
@ 2011-12-01  4:43           ` Nguyen Thai Ngoc Duy
  0 siblings, 0 replies; 20+ messages in thread
From: Nguyen Thai Ngoc Duy @ 2011-12-01  4:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Bill Zaumen, git, spearce, torvalds

On Thu, Dec 1, 2011 at 1:05 AM, Junio C Hamano <gitster@pobox.com> wrote:
> Nguyen Thai Ngoc Duy <pclouds@gmail.com> writes:
>
>> What I'm thinking is whether it's possible to decouple two sha-1 roles
>> in git, as object identifier and digest, separately.
>
> Why it would be a good thing? If you have a collided identifier, somebody
> has to choose which blob a particular tree wants to have at the path, and
> because the tree would not record anything but the identifier, you cannot.

Accidental collision likelihood is small enough we don't have to care about.

>> ...
>> The day sha-1 is broken, a project can generate new digests from its
>> old good repo and enforce developers to use new digests for
>> verification instead of sha-1. sha-1 is still used by git as
>> identifier after that day.
>
> And an old blob that is identified with a SHA-1 now has a new blob that
> has different contents but happens to have the same SHA-1. How does Git
> decide which blob to use when a particular object is named by the SHA-1?

Again, I assume the likelihood that a content happens to have the same
sha-1 with another one is too low to care about. If they are, it's
must be an attack. We do not allow malicious objects to enter in the
first place using other digests. Once objects are in, they are safe to
use.
-- 
Duy

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-12-01  0:41       ` Bill Zaumen
@ 2011-12-01  5:26         ` Jeff King
  2011-12-02  2:59           ` Bill Zaumen
  0 siblings, 1 reply; 20+ messages in thread
From: Jeff King @ 2011-12-01  5:26 UTC (permalink / raw)
  To: Bill Zaumen; +Cc: git, gitster, pclouds, spearce, torvalds

On Wed, Nov 30, 2011 at 04:41:15PM -0800, Bill Zaumen wrote:

> Aside from that, suppose the attacker does what you suggests (providing
> a valid CRC so that git commands like verify-pack don't have an error
> to detect).  You can't tell that something is wrong, but Linus can - 
> the next time he does a fetch.  If he fetches, the server sends
> some SHA-1 hashes and the client responds with 'have' or 'want' in a
> reply.  If the client wants it, the client doesn't have a CRC, if the
> client sends 'have', the CRCs are available so those get sent.  The
> server then checks, notices a mismatch (probably in the CRC of the CRCs
> of all the blobs in the commit's tree), and generates an error.

OK. I thought your claim was that this would help detect
collision-related forgeries in the chain of hashes while verifying a
signed tag. But your claim is only that your scheme gives people who
already have the objects an additional chance to notice discrepencies in
what the server is serving and what they have. And then they can notify
other people of the problem in an out-of-band way (i.e., telling
kernel.org admins that the system is hacked, or telling users not to
fetch from there).

Cryptographically speaking, I think that claim is sound, and you can
certainly construct attack scenarios where this detection would help.
However, quantifying the effectiveness is difficult.  What is the
likelihood of a malicious collided-object replacement being detected
without your scheme? What is it with it?

There are many questions that factor into guessing the latter.

How often does Linus fetch from his own kernel.org repository? He would
usually push, I would think.  Even if he does fetch, he wouldn't be
getting the old objects that he already has. I guess this is the reason
for your digest-of-digests for each commit object sent? But what about
objects that are no longer in current commits, but are in older ones?
E.g., v1.0 of foo.c has sha1 X, and an attacker finds a collision and
replaces it. Meanwhile, the tip of development now has replaced foo.c
with X'. When Linus talks to the server, git will never care about X (it
is neither being transmitted, nor is part of a commit that is being
transmitted).  But people may still be cloning and using the v1.0 tag,
assuming it is valid.

What about the server being more clever about hiding the replacement
object? E.g., instead of just breaking into kernel.org and inserting a
replacement object, the attacker runs a malicious git-daemon that
returns the bogus object to cloners, but the real object to fetchers.
Thus there is nothing for Linus to detect, but new cloners get the
malicious object. Or you could give the bogus object to people who are
getting the object for the first time (since they presumably have no
crc for that object), but otherwise use the real object.

You could get around this deception by pretending to be a "victim";
i.e., cloning fresh and seeing what the server gives you, comparing it
to your known-good repository. Which is similar to what you suggest
here:

> It's also possible to write some additional commands to (for example)
> fetch the SHA-1 hashes and CRCs from all remote repositories you use
> and compare these to make sure they are all consistent, something that
> can be run ocassionally.

But we can already do that. Assume you have an existing repo "foo". To
verify the copy at git://example.com/foo.git, do a fresh clone to
"bar", and then compare the objects in "foo" to "bar", either byte-wise
or by digest.

> That result for birthday attack assumes the goal is to find two files
> that will have the same SHA-1 value (or SHA-1 + CRC).  The case I was
> talking about (apologies if I did not state this clearly) is that you
> have an existing object and an existing CRC and you want to generate
> a second object with the same SHA-1 and same CRC as the first.  A
> birthday attack doesn't work so well in that case - the number of tries
> is much higher than half the number of bits in the digest.

Yes. This is called a "pre-image" attack (to be more specific, it is a
"second pre-image attack", since you have the original object). And
AFAIK, it's not feasible against SHA-1, nor will it be in the near
future (even MD5, which is considered totally broken, does not have
feasible pre-image attacks).

> http://en.wikipedia.org/wiki/Birthday_attack#Digital_signature_susceptibility
> has a discussion regarding digital signatures.  The trick is for a
> person to create a large number of variations of a "fair" and "unfair"
> contract, and use a birthday attack to find a pair that have the same
> hash.  The variations are typically inconsequential changes (extra
> spacing, commas, etc.)

Right. This is the classic birthday attack. I don't keep up with the
state of the art in collision attacks these days, but I think in
practice they are usually executed by adding arbitrary binary garbage
into a "hidden" spot in a file. Which makes it much harder to execute
against text files like source code (as opposed to, say, a binary
firmware blob). IIRC, the practical MD5 attacks involved postscript
documents (with non-printing garbage sections that printers would
ignore).

> In the case I was discussing, a developer creates some code,  commits
> it and pushes it to a shared repository - the developer is not given
> code by the attacker.  The attacker can, however, see the code by
> fetching it.  An attack then consists of generating a collision,
> change the object in the attacker's local repository, and then push
> the original developer's commit (with the modified object) to another
> shared repository before someone else puts the correct objects into
> that repository.  A birthday attack does not work in this case.

Yeah, that is simply not feasible, and is not likely to become so
anytime soon.

> There one issue that this suggests however - it is not clear if the
> 2^57 number given for the best SHA-1 attacks were attempts to generate
> a new file with the same SHA-1 hash as an existing file or a pair of
> files that have the same SHA-1 hash.  If it is the latter, then an
> attack is significantly harder than I assumed as a worst case, but
> still possibly much, much better than brute force.

The 2^57 number is for a collision of two new objects, not for a
pre-image attack. AFAIK, there are currently no pre-image attacks on
SHA-1 at all.

I don't think there's a need to response individually to the points in
the rest of your email; they're all based on the assumption that the
attacker is replacing a known-good written-by-Linus commit with one
found in a pre-image attack.

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-12-01  5:26         ` Jeff King
@ 2011-12-02  2:59           ` Bill Zaumen
  2011-12-02 17:00             ` Jeff King
  0 siblings, 1 reply; 20+ messages in thread
From: Bill Zaumen @ 2011-12-02  2:59 UTC (permalink / raw)
  To: Jeff King; +Cc: git, gitster, pclouds, spearce, torvalds

On Thu, 2011-12-01 at 00:26 -0500, Jeff King wrote:

> Cryptographically speaking, I think that claim is sound, and you can
> certainly construct attack scenarios where this detection would help.
> However, quantifying the effectiveness is difficult.  What is the
> likelihood of a malicious collided-object replacement being detected
> without your scheme? What is it with it?
> 
> There are many questions that factor into guessing the latter.
> 
> How often does Linus fetch from his own kernel.org repository? He would
> usually push, I would think.  Even if he does fetch, he wouldn't be
> getting the old objects that he already has. I guess this is the reason
> for your digest-of-digests for each commit object sent? 

Yes, - the digest of digests is used to check things that would not be
sent.

>
> But what about
> objects that are no longer in current commits, but are in older ones?

That's a good question, of course. After Linus pushes a commit, he'll
notify others, and if some fetch before the repository is hacked,
they will detect an error on a subsequent fetch.  For fetches, the
server reports a series of commits, and the client responds with 'have'
for those it has, with the CRCs added so the server can check for a
mismatch.  I made minimal changes to fetch-pack.c and upload-pack.c:
just adding the CRC fields to the messages already sent.  The server
asks about more commits than it actually transfers, but all the ones
it asks about are tested.  One could send additional 'have' replies
if necessary (for ones the server didn't mention) but I didn't do
that, partly for simplicity but also because I was looking at the
fetch-pack.c and update-pack.c code for the first time. If desired,
such changes can be added.

I also do some similar checking when a commit is pushed - the server
tells the client the last commit it has and the client will send the
CRCs in its reply to allow the server to cross check those.  I didn't
mention that before because only the latest is really checked. Again,
I just changed a message format (backwards compatible, of course),
but additional checking could be added if desired.

You could also add options to check tips of branches and all commits
that have tags (e.g., a v1.0 tag)  All of that simply requires more
work on commands such as fetch-pack, upload-pack, send-pack and
receive-pack.

> What about the server being more clever about hiding the replacement
> object? E.g., instead of just breaking into kernel.org and inserting a
> replacement object, the attacker runs a malicious git-daemon that
> returns the bogus object to cloners, but the real object to fetchers.

That's really a server-security issue, not a git one.  Perhaps
repositories should be configured so that all the executables are on
read-only partitions.  It's an important question in general of
course, but it is probably useful to distinguish attacks that put
bad data on a server from ones that install new software.

> 
> > It's also possible to write some additional commands to (for example)
> > fetch the SHA-1 hashes and CRCs from all remote repositories you use
> > and compare these to make sure they are all consistent, something that
> > can be run ocassionally.
> 
> But we can already do that. Assume you have an existing repo "foo". To
> verify the copy at git://example.com/foo.git, do a fresh clone to
> "bar", and then compare the objects in "foo" to "bar", either byte-wise
> or by digest.

Of course, but that is an expensive operation - in the case of Git
transferring some 50 MBytes of data per repository.  A command to
fetch the SHA-1 ID and a CRC or message digest for each object would
not only run faster, but should put a much lower load on the server.

Getting back to the birthday attack question (this is an area where
your comments were very useful for me), there's a case I didn't
consider.

Suppose two developers bounce code back and forth by email and one of
them commits it, but the other developer is a bad guy.  The bad guy
would then have had an opportunity to use a birthday attack by sending
back subtly modified code (e.g., changes to how comments are formated,
additional blank lines, etc.)  He can even put a humorous comment at
the end of the file such as "the first 200 Chinese characters I learned"
and then include the Chinese characters (I've tested this with gcc -
the Chinese characters, represented in Unicode, print in an editor and
are ignored by the compiler.)  Unicode is a lot closer to binary data
so you have a lot of bits you can alter in a small amount of space,
with each character requiring multiple bytes to represent it. The
comment will look silly but innocuous.  I think Linus Torvalds once
suggested being suspicious of anything that looked like "line noise"
in a patch.  Non-western unicode characters can serve the same function
but look legitimate, at least to people who don't know the language
and when coupled with some "social engineering" to set expectations.

As an example of how this attack might work, without breaking into a
system, assume two programmers collaborating on a project both have
write-access to the same shared repository.

1. The project is using  Java, with a rule that all classes and methods
that are protected or public be documented (so javadoc can create API
documentation).

2. Programmer A emails some Java source code to programmer B with a
request to edit the comments to improve them or fix any obvious
mistakes.

3. Programmer B fixes the comments, but also creates a modified file
with the same SHA-1 hash as the correct file in order to add some bugs
or security flaws. 

4. Programmer B creates a branch from an earlier version, adds some
tests, puts the contents of the modified file into the directory tree
under an obscure name, adds it and does a commit.  B then pushes it,
creating a new remote branch.

5. Programmer B then tells Programmer A that he'll have the modified
file back quickly, but could he please fetch his new remote branch
and run a test, and answer some questions about what happens as B needs
that information to finish his review of the documentation.

6. Programmer A tells B the results, so B knows that A has fetched the
remote branch, an B then sends A the corrected file (not the modified
one). Programmer A reviews the file, notes that everything seems OK,
specifically that only comments were changed, and runs commit -a
followed by a push.  Because Git tries to be smart, it will (I think)
notice from the SHA-1 hashes and from the remote branches that the
server already has the object so there is no need to send it.

8. Programmer B fetches the changes, deletes his temporary branch, both
locally and on the shared repository.  He tells A that the temporary
branch is deleted so that B should run "git branch update --prune ..."

So what happens?  Hopefully someone finds the problem, either through a
source-code review or some QA testing, but regardless Programmer A may
get the blame as the evidence of any tampering has pretty much been
erased.  In the worst case, a release with a security hole goes out.

Why would Programmer B do that?  Maybe he's leaving the company because
he's hard to work with and is blaming Programmer A for the problem, and
wants to "get back" at Programmer A by harming his career. But in any
case he didn't have to break into the repository to get the effect he
wanted. At least is is extremely hard to do in terms of computational
resources.

Bill

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] Implement fast hash-collision detection
  2011-12-02  2:59           ` Bill Zaumen
@ 2011-12-02 17:00             ` Jeff King
  0 siblings, 0 replies; 20+ messages in thread
From: Jeff King @ 2011-12-02 17:00 UTC (permalink / raw)
  To: Bill Zaumen; +Cc: git, gitster, pclouds, spearce, torvalds

On Thu, Dec 01, 2011 at 06:59:04PM -0800, Bill Zaumen wrote:

> > What about the server being more clever about hiding the replacement
> > object? E.g., instead of just breaking into kernel.org and inserting a
> > replacement object, the attacker runs a malicious git-daemon that
> > returns the bogus object to cloners, but the real object to fetchers.
> 
> That's really a server-security issue, not a git one.  Perhaps
> repositories should be configured so that all the executables are on
> read-only partitions.  It's an important question in general of
> course, but it is probably useful to distinguish attacks that put
> bad data on a server from ones that install new software.

I don't agree here. You have to assume that the attacker will ignore
attacks you have blocked, but continue with ones you haven't (just to
counter your example, why not replace the running git-daemon
in-memory?).

You can target the narrow window of attacks that compromise the on-disk
repository without being able to execute arbitrary code. But I don't see
a point. After the kernel.org hack, yes, people are interested in
hardening kernel.org. But they are much more interested in cryptographic
sources of authority that let us not have to trust kernel.org at all.
Having some weird half-way trust just complicates things.

> > But we can already do that. Assume you have an existing repo "foo". To
> > verify the copy at git://example.com/foo.git, do a fresh clone to
> > "bar", and then compare the objects in "foo" to "bar", either byte-wise
> > or by digest.
> 
> Of course, but that is an expensive operation - in the case of Git
> transferring some 50 MBytes of data per repository.  A command to
> fetch the SHA-1 ID and a CRC or message digest for each object would
> not only run faster, but should put a much lower load on the server.

Yes, it is more expensive. But again, my threat model is that the server
is not trusted to serve data accurately or consistently. So you can't
come to the server and say "Hey, I'm doing a security verification. Can
you send me the CRCs?" You _have_ to present yourself as one of the
victims to be infected by the bad object, or a smart attacker will send
you the unmodified data.

> Getting back to the birthday attack question (this is an area where
> your comments were very useful for me), there's a case I didn't
> consider.
> [elaborate birthday attack scenario]

>From my quick reading of your scenario, yes, that is a possible attack.
To me, though, it just highlights the need for either a non-colliding
algorithm, or for better trust verification about the authors of objects
(i.e., cryptographically strong trust).

-Peff

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2011-12-02 17:00 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1322546563.1719.22.camel@yos>
2011-11-29  9:07 ` [PATCH] Implement fast hash-collision detection Jeff King
2011-11-29 10:24   ` Ævar Arnfjörð Bjarmason
2011-11-29 10:29     ` Jeff King
2011-11-29 13:17   ` Nguyen Thai Ngoc Duy
2011-11-29 15:23     ` Shawn Pearce
2011-11-29 14:04   ` Nguyen Thai Ngoc Duy
2011-11-29 20:59     ` Jeff King
2011-11-30 13:35       ` Nguyen Thai Ngoc Duy
2011-11-30 18:05         ` Junio C Hamano
2011-12-01  4:43           ` Nguyen Thai Ngoc Duy
2011-11-30 19:00         ` Bill Zaumen
2011-11-29 21:56   ` Bill Zaumen
2011-11-30  6:25     ` Jeff King
2011-12-01  0:41       ` Bill Zaumen
2011-12-01  5:26         ` Jeff King
2011-12-02  2:59           ` Bill Zaumen
2011-12-02 17:00             ` Jeff King
2011-11-29 17:08 ` Shawn Pearce
2011-11-29 22:05   ` Jeff King
2011-11-30  4:01   ` Bill Zaumen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.