Git Mailing List Archive on lore.kernel.org
 help / color / Atom feed
* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16 20:52                               ` Junio C Hamano
@ 2017-06-16 21:12                                 ` Junio C Hamano
  2017-06-16 21:24                                   ` Jonathan Nieder
  0 siblings, 1 reply; 23+ messages in thread
From: Junio C Hamano @ 2017-06-16 21:12 UTC (permalink / raw)
  To: Adam Langley
  Cc: Johannes Schindelin, Ævar Arnfjörð Bjarmason,
	brian m. carlson, Jeff King, Mike Hommey, Brandon Williams,
	Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller,
	Jonathan Tan

Junio C Hamano <gitster@pobox.com> writes:

> Adam Langley <agl@google.com> writes:
>
>> However, as I'm not a git developer, I've no opinion on whether the
>> cost of carrying implementations of these functions is worth the speed
>> vs using SHA-256, which can be assumed to be supported everywhere
>> already.
>
> Thanks.
>
> My impression from this thread is that even though fast may be
> better than slow, ubiquity trumps it for our use case, as long as
> the thing is not absurdly and unusably slow, of course.  Which makes
> me lean towards something older/more established like SHA-256, and
> it would be a very nice bonus if it gets hardware acceleration more
> widely than others ;-)

Ah, I recall one thing that was mentioned but not discussed much in
the thread: possible use of tree-hashing to exploit multiple cores
hashing a large-ish payload.  As long as it is OK to pick a sound
tree hash coding on top of any (secure) underlying hash function,
I do not think the use of tree-hashing should not affect which exact
underlying hash function is to be used, and I also am not convinced
if we really want tree hashing (some codepaths that deal with a large
payload wants to stream the data in single pass from head to tail)
in the context of Git, but I am not a crypto person, so ...



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16 20:42                             ` Jeff King
@ 2017-06-19  9:26                               ` Johannes Schindelin
  0 siblings, 0 replies; 23+ messages in thread
From: Johannes Schindelin @ 2017-06-19  9:26 UTC (permalink / raw)
  To: Jeff King
  Cc: Ævar Arnfjörð Bjarmason, brian m. carlson,
	Adam Langley, Mike Hommey, Brandon Williams, Linus Torvalds,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Junio Hamano

Hi Peff,

On Fri, 16 Jun 2017, Jeff King wrote:

> On Fri, Jun 16, 2017 at 03:24:19PM +0200, Johannes Schindelin wrote:
> 
> > I have no doubt that Visual Studio Team Services, GitHub and Atlassian
> > will eventually end up with FPGAs for hash computation. So that's
> > that.
> 
> I actually doubt this from the GitHub side. Hash performance is not even
> on our radar as a bottleneck. In most cases the problem is touching
> uncompressed data _at all_, not computing the hash over it (so things
> like reusing on-disk deltas are really important).

Thanks for pointing that out! As a mainly client-side person, I rarely get
insights into the server side...

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16 21:24                                   ` Jonathan Nieder
@ 2017-06-16 21:39                                     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 23+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-06-16 21:39 UTC (permalink / raw)
  To: Jonathan Nieder
  Cc: Junio C Hamano, Adam Langley, Johannes Schindelin,
	brian m. carlson, Jeff King, Mike Hommey, Brandon Williams,
	Linus Torvalds, Git Mailing List, Stefan Beller, Jonathan Tan


On Fri, Jun 16 2017, Jonathan Nieder jotted:
> Part of the reason I suggested previously that it would be helpful to
> try to benchmark Git with various hash functions (which didn't go over
> well, for some reason) is that it makes these comparisons more
> concrete.  Without measuring, it is hard to get a sense of the
> distribution of input sizes and how much practical effect the
> differences we are talking about have.

It would be great to have such benchmarks (I probably missed the "didn't
go over well" part), but FWIW you can get pretty close to this right now
in git by running various t/perf benchmarks with
BLKSHA1/OPENSSL/SHA1DC.

Between the three of those (particularly SHA1DC being slower than
OpenSSL) you get a similar performance difference as some SHA-1
v.s. SHA-256 benchmarks I've seen, so to the extent that we have
existing performance tests it's revealing to see what's slower & faster.

It makes a particularly big difference for e.g. p3400-rebase.sh.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16 21:12                                 ` Junio C Hamano
@ 2017-06-16 21:24                                   ` Jonathan Nieder
  2017-06-16 21:39                                     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 23+ messages in thread
From: Jonathan Nieder @ 2017-06-16 21:24 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Adam Langley, Johannes Schindelin,
	Ævar Arnfjörð Bjarmason, brian m. carlson,
	Jeff King, Mike Hommey, Brandon Williams, Linus Torvalds,
	Git Mailing List, Stefan Beller, Jonathan Tan

Junio C Hamano wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>> Adam Langley <agl@google.com> writes:

>>> However, as I'm not a git developer, I've no opinion on whether the
>>> cost of carrying implementations of these functions is worth the speed
>>> vs using SHA-256, which can be assumed to be supported everywhere
>>> already.
>>
>> Thanks.
>>
>> My impression from this thread is that even though fast may be
>> better than slow, ubiquity trumps it for our use case, as long as
>> the thing is not absurdly and unusably slow, of course.  Which makes
>> me lean towards something older/more established like SHA-256, and
>> it would be a very nice bonus if it gets hardware acceleration more
>> widely than others ;-)
>
> Ah, I recall one thing that was mentioned but not discussed much in
> the thread: possible use of tree-hashing to exploit multiple cores
> hashing a large-ish payload.  As long as it is OK to pick a sound
> tree hash coding on top of any (secure) underlying hash function,
> I do not think the use of tree-hashing should not affect which exact
> underlying hash function is to be used, and I also am not convinced
> if we really want tree hashing (some codepaths that deal with a large
> payload wants to stream the data in single pass from head to tail)
> in the context of Git, but I am not a crypto person, so ...

Tree hashing also affects single-core performance because of the
availability of SIMD instructions.

That is how software implementations of e.g. blake2bp-256 and
SHA-256x16[1] are able to have competitive performance with (slightly
better performance than, at least in some cases) hardware
implementations of SHA-256.

It is also satisfying that we have options like these that are faster
than SHA-1.

All that said, SHA-256 seems like a fine choice, despite its worse
performance.  The wide availability of reasonable-quality
implementations (e.g. in Java you can use
'MessageDigest.getInstance("SHA-256")') makes it a very tempting one.

Part of the reason I suggested previously that it would be helpful to
try to benchmark Git with various hash functions (which didn't go over
well, for some reason) is that it makes these comparisons more
concrete.  Without measuring, it is hard to get a sense of the
distribution of input sizes and how much practical effect the
differences we are talking about have.

Thanks,
Jonathan

[1] https://eprint.iacr.org/2012/476.pdf

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16 17:38                             ` Adam Langley
@ 2017-06-16 20:52                               ` Junio C Hamano
  2017-06-16 21:12                                 ` Junio C Hamano
  0 siblings, 1 reply; 23+ messages in thread
From: Junio C Hamano @ 2017-06-16 20:52 UTC (permalink / raw)
  To: Adam Langley
  Cc: Johannes Schindelin, Ævar Arnfjörð Bjarmason,
	brian m. carlson, Jeff King, Mike Hommey, Brandon Williams,
	Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller,
	Jonathan Tan

Adam Langley <agl@google.com> writes:

> However, as I'm not a git developer, I've no opinion on whether the
> cost of carrying implementations of these functions is worth the speed
> vs using SHA-256, which can be assumed to be supported everywhere
> already.

Thanks.

My impression from this thread is that even though fast may be
better than slow, ubiquity trumps it for our use case, as long as
the thing is not absurdly and unusably slow, of course.  Which makes
me lean towards something older/more established like SHA-256, and
it would be a very nice bonus if it gets hardware acceleration more
widely than others ;-)


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16 13:24                           ` Johannes Schindelin
  2017-06-16 17:38                             ` Adam Langley
@ 2017-06-16 20:42                             ` Jeff King
  2017-06-19  9:26                               ` Johannes Schindelin
  1 sibling, 1 reply; 23+ messages in thread
From: Jeff King @ 2017-06-16 20:42 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, brian m. carlson,
	Adam Langley, Mike Hommey, Brandon Williams, Linus Torvalds,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Junio Hamano

On Fri, Jun 16, 2017 at 03:24:19PM +0200, Johannes Schindelin wrote:

> I have no doubt that Visual Studio Team Services, GitHub and Atlassian
> will eventually end up with FPGAs for hash computation. So that's that.

I actually doubt this from the GitHub side. Hash performance is not even
on our radar as a bottleneck. In most cases the problem is touching
uncompressed data _at all_, not computing the hash over it (so things
like reusing on-disk deltas are really important).

-Peff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16 13:24                           ` Johannes Schindelin
@ 2017-06-16 17:38                             ` Adam Langley
  2017-06-16 20:52                               ` Junio C Hamano
  2017-06-16 20:42                             ` Jeff King
  1 sibling, 1 reply; 23+ messages in thread
From: Adam Langley @ 2017-06-16 17:38 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, brian m. carlson,
	Jeff King, Mike Hommey, Brandon Williams, Linus Torvalds,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Junio Hamano

On Fri, Jun 16, 2017 at 6:24 AM, Johannes Schindelin
<Johannes.Schindelin@gmx.de> wrote:
>
> And while I am really thankful that Adam chimed in, I think he would agree
> that BLAKE2 is a purposefully weakened version of BLAKE, for the benefit
> of speed

That is correct.

Although worth keeping in mind that the analysis results from the
SHA-3 process informed this rebalancing. Indeed, NIST proposed[1] to
do the same with Keccak before stamping it as SHA-3 (although
ultimately did not in the context of public feeling in late 2013). The
Keccak team have essentially done the same with K12. Thus there is
evidence of a fairly widespread belief that the SHA-3 parameters were
excessively cautious.

[1] https://docs.google.com/file/d/0BzRYQSHuuMYOQXdHWkRiZXlURVE/edit, slide 48

> (with the caveat that one of my experts disagrees that BLAKE2b
> would be faster than hardware-accelerated SHA-256).

The numbers given above for SHA-256 on Ryzen and Cortex-A72 must be
with hardware acceleration and I thank Brian Carlson for digging them
up as I hadn't seen them before.

I suggested above that BLAKE2bp (note the p at the end) might be
faster than hardware SHA-256 and that appears to be plausible based on
benchmarks[2] of that function. (With the caveat those numbers are for
Haswell and Skylake and so cannot be directly compared with Ryzen.)

K12 reports similar speeds on Skylake[3] and thus is also plausibly
faster than hardware SHA-256.

[2] https://github.com/sneves/blake2-avx2
[3] http://keccak.noekeon.org/KangarooTwelve.pdf

However, as I'm not a git developer, I've no opinion on whether the
cost of carrying implementations of these functions is worth the speed
vs using SHA-256, which can be assumed to be supported everywhere
already.


Cheers

AGL

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16  6:25                         ` Ævar Arnfjörð Bjarmason
@ 2017-06-16 13:24                           ` Johannes Schindelin
  2017-06-16 17:38                             ` Adam Langley
  2017-06-16 20:42                             ` Jeff King
  0 siblings, 2 replies; 23+ messages in thread
From: Johannes Schindelin @ 2017-06-16 13:24 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: brian m. carlson, Adam Langley, Jeff King, Mike Hommey,
	Brandon Williams, Linus Torvalds, Jonathan Nieder,
	Git Mailing List, Stefan Beller, Jonathan Tan, Junio Hamano


[-- Attachment #1: Type: text/plain, Size: 7027 bytes --]

Hi,

On Fri, 16 Jun 2017, Ævar Arnfjörð Bjarmason wrote:

> On Fri, Jun 16 2017, brian m. carlson jotted:
> 
> > On Fri, Jun 16, 2017 at 01:36:13AM +0200, Ævar Arnfjörð Bjarmason wrote:
> >
> >> So I don't follow the argument that we shouldn't weigh future HW
> >> acceleration highly just because you can't easily buy a laptop today
> >> with these features.
> >>
> >> Aside from that I think you've got this backwards, it's AMD that's
> >> adding SHA acceleration to their high-end Ryzen chips[1] but Intel is
> >> starting at the lower end this year with Goldmont which'll be in
> >> lower-end consumer devices[2]. If you read the github issue I linked
> >> to upthread[3] you can see that the cryptopp devs already tested
> >> their SHA accelerated code on a consumer Celeron[4] recently.
> >>
> >> I don't think Intel has announced the SHA extensions for future Xeon
> >> releases, but it seems given that they're going to have it there as
> >> well. Have there every been x86 extensions that aren't eventually
> >> portable across the entire line, or that they've ended up removing
> >> from x86 once introduced?
> >>
> >> In any case, I think by the time we're ready to follow-up the current
> >> hash refactoring efforts with actually changing the hash
> >> implementation many of us are likely to have laptops with these
> >> extensions, making this easy to test.
> >
> > I think you underestimate the life of hardware and software.  I have
> > servers running KVM development instances that have been running since
> > at least 2012.  Those machines are not scheduled for replacement
> > anytime soon.
> >
> > Whatever we deploy within the next year is going to run on existing
> > hardware for probably a decade, whether we want it to or not.  Most of
> > those machines don't have acceleration.
> 
> To clarify, I'm not dismissing the need to consider existing hardware
> without these acceleration functions or future processors without them.
> I don't think that makes any sense, we need to keep those in mind.
> 
> I was replying to a bit in your comment where you (it seems to me) were
> making the claim that we shouldn't consider the HW acceleration of
> certain hash functions either.

Yes, I also had the impression that it stressed the status quo quite a bit
too much.

We know for a fact that SHA-256 acceleration is coming to consumer CPUs.
We know of no plans for any of the other mentioned hash functions to
hardware-accelerate them in consumer CPUs.

And remember: for those who are affected most (humongous monorepos, source
code hosters), upgrading hardware is less of an issue than having a secure
hash function for the rest of us.

And while I am really thankful that Adam chimed in, I think he would agree
that BLAKE2 is a purposefully weakened version of BLAKE, for the benefit
of speed (with the caveat that one of my experts disagrees that BLAKE2b
would be faster than hardware-accelerated SHA-256). And while BLAKE has
seen roughly equivalent cryptanalysis as Keccak (which became SHA-3),
BLAKE2 has not.

That makes me *very* uneasy about choosing BLAKE2.

> > Furthermore, you need a reasonably modern crypto library to get hardware
> > acceleration.  OpenSSL has only recently gained support for it.  RHEL 7
> > does not currently support it, and probably never will.  That OS is
> > going to be around for the next 6 years.
> >
> > If we're optimizing for performance, I don't want to optimize for the
> > latest, greatest machines.  Those machines are going to outperform
> > everything else either way.  I'd rather optimize for something which
> > performs well on the whole everywhere.  There are a lot of developers
> > who have older machines, for cost reasons or otherwise.
> 
> We have real data showing that the intersection between people who care
> about the hash slowing down and those who can't afford the latest
> hardware is pretty much nil.
> 
> I.e. in 2.13.0 SHA-1 got slower, and pretty much nobody noticed or cared
> except Johannes Schindelin, myself & Christian Couder. This is because
> in practice hashing only becomes a bottleneck on huge monorepos that
> need to e.g. re-hash the contents of a huge index.

Indeed. I am still concerned about that. As you mention, though, it really
only affects users of ginormous monorepos, and of course source code
hosters.

The jury's still out on how much it impacts my colleagues, by the way.

I have no doubt that Visual Studio Team Services, GitHub and Atlassian
will eventually end up with FPGAs for hash computation. So that's that.

Side note: BLAKE is actually *not* friendly to hardware acceleration, I
have been told by one cryptography expert. In contrast, the Keccak team
claims SHA3-256 to be the easiest to hardware-accelerate, making it "a
green cryptographic primitive":
http://keccak.noekeon.org/is_sha3_slow.html

> > Here are some stats (cycles/byte for long messages):
> >
> >                    SHA-256    BLAKE2b
> > Ryzen                 1.89       3.06
> > Knight's Landing     19.00       5.65
> > Cortex-A72            1.99       5.48
> > Cortex-A57           11.81       5.47
> > Cortex-A7            28.19      15.16
> >
> > In other words, BLAKE2b performs well uniformly across a wide variety of
> > architectures even without acceleration.  I'd rather tell people that
> > upgrading to a new hash algorithm is a performance win either way, not
> > just if they have the latest hardware.
> 
> Yup, all of those need to be considered, although given my comment above
> about big repos a 40% improvement on Ryzen (a processor likely to be
> used for big repos) stands out, where are those numbers from, and is
> that with or without HW accel for SHA-256 on Ryzen?

When it comes to BLAKE2, I would actually strongly suggest to consider the
amount of attempts to break it. Or rather, how much less attention it got
than, say, SHA-256.

In any case, I have been encouraged to stress the importance of
"crypto-agility", i.e. the ability to switch to another algorithm when the
current one gets broken "enough".

And I am delighted that that is exactly the direction we are going. In
other words, even if I still think (backed up by the experts on whose
knowledge I lean heavily to form my opinions) that SHA-256 would be the
best choice for now, it should be relatively easy to offer BLAKE2b support
for (and by [*1*]) those who want it.

Ciao,
Dscho

Footnote *1*: I say that the support for BLAKE2b should come from those
parties who desire it also because it is not as ubiquituous as SHA-256.
Hence, it would add the burden of having a performant and reasonably
bug-free implementation in Git's source tree. IIUC OpenSSL added BLAKE2b
support only in OpenSSL 1.1.0, the 1.0.2 line (which is still in use in
many places, e.g. Git for Windows' SDK) does not, meaning: Git's
implementation would be the one *everybody* relies on, with *no*
fall-back.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-16  0:17                       ` brian m. carlson
@ 2017-06-16  6:25                         ` Ævar Arnfjörð Bjarmason
  2017-06-16 13:24                           ` Johannes Schindelin
  0 siblings, 1 reply; 23+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-06-16  6:25 UTC (permalink / raw)
  To: brian m. carlson
  Cc: Adam Langley, Johannes Schindelin, Jeff King, Mike Hommey,
	Brandon Williams, Linus Torvalds, Jonathan Nieder,
	Git Mailing List, Stefan Beller, Jonathan Tan, Junio Hamano


On Fri, Jun 16 2017, brian m. carlson jotted:

> On Fri, Jun 16, 2017 at 01:36:13AM +0200, Ævar Arnfjörð Bjarmason wrote:
>> On Fri, Jun 16, 2017 at 12:41 AM, brian m. carlson
>> <sandals@crustytoothpaste.net> wrote:
>> > SHA-256 acceleration exists for some existing Intel platforms already.
>> > However, they're not practically present on anything but servers at the
>> > moment, and so I don't think the acceleration of SHA-256 is a
>> > something we should consider.
>>
>> Whatever next-gen hash Git ends up with is going to be in use for
>> decades, so what hardware acceleration exists in consumer products
>> right now is practically irrelevant, but what acceleration is likely
>> to exist for the lifetime of the hash existing *is* relevant.
>
> The life of MD5 was about 23 years (introduction to first document
> collision).  SHA-1 had about 22.  Decades, yes, but just barely.  SHA-2
> was introduced in 2001, and by the same estimate, we're a little over
> halfway through its life.

I'm talking about the lifetime of SHA-1 or $newhash's use in Git. As our
continued use of SHA-1 demonstrates the window of practical hash
function use extends well beyond the window from introduction to
published breakage.

It's also telling that SHA-1, which any cryptographer would have waived
you off from since around 2011, is just getting widely deployed HW
acceleration now in 2017. The practical use of hash functions far
exceeds their recommended use in new projects.

>> So I don't follow the argument that we shouldn't weigh future HW
>> acceleration highly just because you can't easily buy a laptop today
>> with these features.
>>
>> Aside from that I think you've got this backwards, it's AMD that's
>> adding SHA acceleration to their high-end Ryzen chips[1] but Intel is
>> starting at the lower end this year with Goldmont which'll be in
>> lower-end consumer devices[2]. If you read the github issue I linked
>> to upthread[3] you can see that the cryptopp devs already tested their
>> SHA accelerated code on a consumer Celeron[4] recently.
>>
>> I don't think Intel has announced the SHA extensions for future Xeon
>> releases, but it seems given that they're going to have it there as
>> well. Have there every been x86 extensions that aren't eventually
>> portable across the entire line, or that they've ended up removing
>> from x86 once introduced?
>>
>> In any case, I think by the time we're ready to follow-up the current
>> hash refactoring efforts with actually changing the hash
>> implementation many of us are likely to have laptops with these
>> extensions, making this easy to test.
>
> I think you underestimate the life of hardware and software.  I have
> servers running KVM development instances that have been running since
> at least 2012.  Those machines are not scheduled for replacement anytime
> soon.
>
> Whatever we deploy within the next year is going to run on existing
> hardware for probably a decade, whether we want it to or not.  Most of
> those machines don't have acceleration.

To clarify, I'm not dismissing the need to consider existing hardware
without these acceleration functions or future processors without
them. I don't think that makes any sense, we need to keep those in mind.

I was replying to a bit in your comment where you (it seems to me) were
making the claim that we shouldn't consider the HW acceleration of
certain hash functions either.

Clearly both need to be considered.

> Furthermore, you need a reasonably modern crypto library to get hardware
> acceleration.  OpenSSL has only recently gained support for it.  RHEL 7
> does not currently support it, and probably never will.  That OS is
> going to be around for the next 6 years.
>
> If we're optimizing for performance, I don't want to optimize for the
> latest, greatest machines.  Those machines are going to outperform
> everything else either way.  I'd rather optimize for something which
> performs well on the whole everywhere.  There are a lot of developers
> who have older machines, for cost reasons or otherwise.

We have real data showing that the intersection between people who care
about the hash slowing down and those who can't afford the latest
hardware is pretty much nil.

I.e. in 2.13.0 SHA-1 got slower, and pretty much nobody noticed or cared
except Johannes Schindelin, myself & Christian Couder. This is because
in practice hashing only becomes a bottleneck on huge monorepos that
need to e.g. re-hash the contents of a huge index.

> Here are some stats (cycles/byte for long messages):
>
>                    SHA-256    BLAKE2b
> Ryzen                 1.89       3.06
> Knight's Landing     19.00       5.65
> Cortex-A72            1.99       5.48
> Cortex-A57           11.81       5.47
> Cortex-A7            28.19      15.16
>
> In other words, BLAKE2b performs well uniformly across a wide variety of
> architectures even without acceleration.  I'd rather tell people that
> upgrading to a new hash algorithm is a performance win either way, not
> just if they have the latest hardware.

Yup, all of those need to be considered, although given my comment above
about big repos a 40% improvement on Ryzen (a processor likely to be
used for big repos) stands out, where are those numbers from, and is
that with or without HW accel for SHA-256 on Ryzen?

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 21:10             ` Mike Hommey
@ 2017-06-16  4:30               ` Jeff King
  0 siblings, 0 replies; 23+ messages in thread
From: Jeff King @ 2017-06-16  4:30 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Johannes Schindelin, Brandon Williams, brian m. carlson,
	Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller,
	jonathantanmy, Junio Hamano

On Fri, Jun 16, 2017 at 06:10:22AM +0900, Mike Hommey wrote:

> > > What do the experts think or SHA512/256, which completely removes the
> > > concerns over length extension attack? (which I'd argue is better than
> > > sweeping them under the carpet)
> > 
> > I don't think it's sweeping them under the carpet. Git does not use the
> > hash as a MAC, so length extension attacks aren't a thing (and even if
> > we later wanted to use the same algorithm as a MAC, the HMAC
> > construction is a well-studied technique for dealing with it).
> 
> AIUI, length extension does make brute force collision attacks (which,
> really Shattered was) cheaper by allowing one to create the collision
> with a small message and extend it later.
> 
> This might not be a credible thread against git, but if we go by that
> standard, post-shattered Sha-1 is still fine for git. As a matter of
> fact, MD5 would also be fine: there is still, to this day, no preimage
> attack against them.

I think collision attacks are of interest to Git. But I would think
2^128 would be enough (TBH, 2^80 probably would have been enough for
SHA-1; it was the weaknesses that brought that down by a factor of a
million that made it a problem).

-Peff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 23:36                     ` Ævar Arnfjörð Bjarmason
@ 2017-06-16  0:17                       ` brian m. carlson
  2017-06-16  6:25                         ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 23+ messages in thread
From: brian m. carlson @ 2017-06-16  0:17 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Adam Langley, Johannes Schindelin, Jeff King, Mike Hommey,
	Brandon Williams, Linus Torvalds, Jonathan Nieder,
	Git Mailing List, Stefan Beller, Jonathan Tan, Junio Hamano


[-- Attachment #1: Type: text/plain, Size: 3810 bytes --]

On Fri, Jun 16, 2017 at 01:36:13AM +0200, Ævar Arnfjörð Bjarmason wrote:
> On Fri, Jun 16, 2017 at 12:41 AM, brian m. carlson
> <sandals@crustytoothpaste.net> wrote:
> > SHA-256 acceleration exists for some existing Intel platforms already.
> > However, they're not practically present on anything but servers at the
> > moment, and so I don't think the acceleration of SHA-256 is a
> > something we should consider.
> 
> Whatever next-gen hash Git ends up with is going to be in use for
> decades, so what hardware acceleration exists in consumer products
> right now is practically irrelevant, but what acceleration is likely
> to exist for the lifetime of the hash existing *is* relevant.

The life of MD5 was about 23 years (introduction to first document
collision).  SHA-1 had about 22.  Decades, yes, but just barely.  SHA-2
was introduced in 2001, and by the same estimate, we're a little over
halfway through its life.

> So I don't follow the argument that we shouldn't weigh future HW
> acceleration highly just because you can't easily buy a laptop today
> with these features.
> 
> Aside from that I think you've got this backwards, it's AMD that's
> adding SHA acceleration to their high-end Ryzen chips[1] but Intel is
> starting at the lower end this year with Goldmont which'll be in
> lower-end consumer devices[2]. If you read the github issue I linked
> to upthread[3] you can see that the cryptopp devs already tested their
> SHA accelerated code on a consumer Celeron[4] recently.
> 
> I don't think Intel has announced the SHA extensions for future Xeon
> releases, but it seems given that they're going to have it there as
> well. Have there every been x86 extensions that aren't eventually
> portable across the entire line, or that they've ended up removing
> from x86 once introduced?
> 
> In any case, I think by the time we're ready to follow-up the current
> hash refactoring efforts with actually changing the hash
> implementation many of us are likely to have laptops with these
> extensions, making this easy to test.

I think you underestimate the life of hardware and software.  I have
servers running KVM development instances that have been running since
at least 2012.  Those machines are not scheduled for replacement anytime
soon.

Whatever we deploy within the next year is going to run on existing
hardware for probably a decade, whether we want it to or not.  Most of
those machines don't have acceleration.

Furthermore, you need a reasonably modern crypto library to get hardware
acceleration.  OpenSSL has only recently gained support for it.  RHEL 7
does not currently support it, and probably never will.  That OS is
going to be around for the next 6 years.

If we're optimizing for performance, I don't want to optimize for the
latest, greatest machines.  Those machines are going to outperform
everything else either way.  I'd rather optimize for something which
performs well on the whole everywhere.  There are a lot of developers
who have older machines, for cost reasons or otherwise.

Here are some stats (cycles/byte for long messages):

                   SHA-256    BLAKE2b
Ryzen                 1.89       3.06
Knight's Landing     19.00       5.65
Cortex-A72            1.99       5.48
Cortex-A57           11.81       5.47
Cortex-A7            28.19      15.16

In other words, BLAKE2b performs well uniformly across a wide variety of
architectures even without acceleration.  I'd rather tell people that
upgrading to a new hash algorithm is a performance win either way, not
just if they have the latest hardware.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 22:41                   ` brian m. carlson
@ 2017-06-15 23:36                     ` Ævar Arnfjörð Bjarmason
  2017-06-16  0:17                       ` brian m. carlson
  0 siblings, 1 reply; 23+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-06-15 23:36 UTC (permalink / raw)
  To: brian m. carlson, Adam Langley, Johannes Schindelin,
	Ævar Arnfjörð Bjarmason, Jeff King, Mike Hommey,
	Brandon Williams, Linus Torvalds, Jonathan Nieder,
	Git Mailing List, Stefan Beller, Jonathan Tan, Junio Hamano

On Fri, Jun 16, 2017 at 12:41 AM, brian m. carlson
<sandals@crustytoothpaste.net> wrote:
> On Thu, Jun 15, 2017 at 02:59:57PM -0700, Adam Langley wrote:
>> (I was asked to comment a few points in public by Jonathan.)
>>
>> I think this group can safely assume that SHA-256, SHA-512, BLAKE2,
>> K12, etc are all secure to the extent that I don't believe that making
>> comparisons between them on that axis is meaningful. Thus I think the
>> question is primarily concerned with performance and implementation
>> availability.
>>
>> I think any of the above would be reasonable choices. I don't believe
>> that length-extension is a concern here.
>>
>> SHA-512/256 will be faster than SHA-256 on 64-bit systems in software.
>> The graph at https://blake2.net/ suggests a 50% speedup on Skylake. On
>> my Ivy Bridge system, it's about 20%.
>>
>> (SHA-512/256 does not enjoy the same availability in common libraries however.)
>>
>> Both Intel and ARM have SHA-256 instructions defined. I've not seen
>> good benchmarks of them yet, but they will make SHA-256 faster than
>> SHA-512 when available. However, it's very possible that something
>> like BLAKE2bp will still be faster. Of course, BLAKE2bp does not enjoy
>> the ubiquity of SHA-256, but nor do you have to wait years for the CPU
>> population to advance for high performance.
>
> SHA-256 acceleration exists for some existing Intel platforms already.
> However, they're not practically present on anything but servers at the
> moment, and so I don't think the acceleration of SHA-256 is a
> something we should consider.

Whatever next-gen hash Git ends up with is going to be in use for
decades, so what hardware acceleration exists in consumer products
right now is practically irrelevant, but what acceleration is likely
to exist for the lifetime of the hash existing *is* relevant.

So I don't follow the argument that we shouldn't weigh future HW
acceleration highly just because you can't easily buy a laptop today
with these features.

Aside from that I think you've got this backwards, it's AMD that's
adding SHA acceleration to their high-end Ryzen chips[1] but Intel is
starting at the lower end this year with Goldmont which'll be in
lower-end consumer devices[2]. If you read the github issue I linked
to upthread[3] you can see that the cryptopp devs already tested their
SHA accelerated code on a consumer Celeron[4] recently.

I don't think Intel has announced the SHA extensions for future Xeon
releases, but it seems given that they're going to have it there as
well. Have there every been x86 extensions that aren't eventually
portable across the entire line, or that they've ended up removing
from x86 once introduced?

In any case, I think by the time we're ready to follow-up the current
hash refactoring efforts with actually changing the hash
implementation many of us are likely to have laptops with these
extensions, making this easy to test.

1. https://en.wikipedia.org/wiki/Intel_SHA_extensions
2. https://en.wikipedia.org/wiki/Goldmont
3. https://github.com/weidai11/cryptopp/issues/139#issuecomment-264283385
4. https://ark.intel.com/products/95594/Intel-Celeron-Processor-J3455-2M-Cache-up-to-2_3-GHz

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 21:59                 ` Adam Langley
@ 2017-06-15 22:41                   ` brian m. carlson
  2017-06-15 23:36                     ` Ævar Arnfjörð Bjarmason
  0 siblings, 1 reply; 23+ messages in thread
From: brian m. carlson @ 2017-06-15 22:41 UTC (permalink / raw)
  To: Adam Langley
  Cc: Johannes Schindelin, Ævar Arnfjörð Bjarmason,
	Jeff King, Mike Hommey, Brandon Williams, Linus Torvalds,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Junio Hamano


[-- Attachment #1: Type: text/plain, Size: 2755 bytes --]

On Thu, Jun 15, 2017 at 02:59:57PM -0700, Adam Langley wrote:
> (I was asked to comment a few points in public by Jonathan.)
> 
> I think this group can safely assume that SHA-256, SHA-512, BLAKE2,
> K12, etc are all secure to the extent that I don't believe that making
> comparisons between them on that axis is meaningful. Thus I think the
> question is primarily concerned with performance and implementation
> availability.
> 
> I think any of the above would be reasonable choices. I don't believe
> that length-extension is a concern here.
> 
> SHA-512/256 will be faster than SHA-256 on 64-bit systems in software.
> The graph at https://blake2.net/ suggests a 50% speedup on Skylake. On
> my Ivy Bridge system, it's about 20%.
> 
> (SHA-512/256 does not enjoy the same availability in common libraries however.)
> 
> Both Intel and ARM have SHA-256 instructions defined. I've not seen
> good benchmarks of them yet, but they will make SHA-256 faster than
> SHA-512 when available. However, it's very possible that something
> like BLAKE2bp will still be faster. Of course, BLAKE2bp does not enjoy
> the ubiquity of SHA-256, but nor do you have to wait years for the CPU
> population to advance for high performance.

SHA-256 acceleration exists for some existing Intel platforms already.
However, they're not practically present on anything but servers at the
moment, and so I don't think the acceleration of SHA-256 is a
something we should consider.

The SUPERCOP benchmarks tell me that generally, on 64-bit systems where
acceleration is not available, SHA-256 is the slowest, followed by
SHA3-256.  BLAKE2b is the fastest.

If our goal is performance, then I would argue BLAKE2b-256 is the best
choice.  It is secure and extremely fast.  It does have the benefit that
we get to tell people that by moving away from SHA-1, they will get a
performance boost, pretty much no matter what the system.

BLAKE2bp may be faster, but it introduces additional implementation
complexity.  I'm not sure crypto libraries will implement it, but then
again, OpenSSL only implements BLAKE2b-512 at the moment.  I don't care
much either way, but we should add good tests to exercise the
implementation thoroughly.  We're generally going to need to ship our
own implementation anyway.

I've argued that SHA3-256 probably has the longest life and good
unaccelerated performance, and for that reason, I've preferred it.  But
if AGL says that they're all secure (and I generally think he knows
what he's talking about), we could consider performance more.
-- 
brian m. carlson / brian with sandals: Houston, Texas, US
https://www.crustytoothpaste.net/~bmc | My opinion only
OpenPGP: https://keybase.io/bk2204

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 868 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 19:34               ` Johannes Schindelin
@ 2017-06-15 21:59                 ` Adam Langley
  2017-06-15 22:41                   ` brian m. carlson
  0 siblings, 1 reply; 23+ messages in thread
From: Adam Langley @ 2017-06-15 21:59 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Ævar Arnfjörð Bjarmason, Jeff King, Mike Hommey,
	Brandon Williams, brian m. carlson, Linus Torvalds,
	Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan,
	Junio Hamano

(I was asked to comment a few points in public by Jonathan.)

I think this group can safely assume that SHA-256, SHA-512, BLAKE2,
K12, etc are all secure to the extent that I don't believe that making
comparisons between them on that axis is meaningful. Thus I think the
question is primarily concerned with performance and implementation
availability.

I think any of the above would be reasonable choices. I don't believe
that length-extension is a concern here.

SHA-512/256 will be faster than SHA-256 on 64-bit systems in software.
The graph at https://blake2.net/ suggests a 50% speedup on Skylake. On
my Ivy Bridge system, it's about 20%.

(SHA-512/256 does not enjoy the same availability in common libraries however.)

Both Intel and ARM have SHA-256 instructions defined. I've not seen
good benchmarks of them yet, but they will make SHA-256 faster than
SHA-512 when available. However, it's very possible that something
like BLAKE2bp will still be faster. Of course, BLAKE2bp does not enjoy
the ubiquity of SHA-256, but nor do you have to wait years for the CPU
population to advance for high performance.

So, overall, none of these choices should obviously be excluded. The
considerations at this point are not cryptographic and the tradeoff
between implementation ease and performance is one that the git
community would have to make.


Cheers

AGL

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 13:01           ` Jeff King
  2017-06-15 16:30             ` Ævar Arnfjörð Bjarmason
@ 2017-06-15 21:10             ` Mike Hommey
  2017-06-16  4:30               ` Jeff King
  1 sibling, 1 reply; 23+ messages in thread
From: Mike Hommey @ 2017-06-15 21:10 UTC (permalink / raw)
  To: Jeff King
  Cc: Johannes Schindelin, Brandon Williams, brian m. carlson,
	Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller,
	jonathantanmy, Junio Hamano

On Thu, Jun 15, 2017 at 09:01:45AM -0400, Jeff King wrote:
> On Thu, Jun 15, 2017 at 08:05:18PM +0900, Mike Hommey wrote:
> 
> > On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
> > > Footnote *1*: SHA-256, as all hash functions whose output is essentially
> > > the entire internal state, are susceptible to a so-called "length
> > > extension attack", where the hash of a secret+message can be used to
> > > generate the hash of secret+message+piggyback without knowing the secret.
> > > This is not the case for Git: only visible data are hashed. The type of
> > > attacks Git has to worry about is very different from the length extension
> > > attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
> > > say, a collision attack.
> > 
> > What do the experts think or SHA512/256, which completely removes the
> > concerns over length extension attack? (which I'd argue is better than
> > sweeping them under the carpet)
> 
> I don't think it's sweeping them under the carpet. Git does not use the
> hash as a MAC, so length extension attacks aren't a thing (and even if
> we later wanted to use the same algorithm as a MAC, the HMAC
> construction is a well-studied technique for dealing with it).

AIUI, length extension does make brute force collision attacks (which,
really Shattered was) cheaper by allowing one to create the collision
with a small message and extend it later.

This might not be a credible thread against git, but if we go by that
standard, post-shattered Sha-1 is still fine for git. As a matter of
fact, MD5 would also be fine: there is still, to this day, no preimage
attack against them.

Mike

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 16:30             ` Ævar Arnfjörð Bjarmason
@ 2017-06-15 19:34               ` Johannes Schindelin
  2017-06-15 21:59                 ` Adam Langley
  0 siblings, 1 reply; 23+ messages in thread
From: Johannes Schindelin @ 2017-06-15 19:34 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Jeff King, Mike Hommey, Brandon Williams, brian m. carlson,
	Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller,
	jonathantanmy, Junio Hamano


[-- Attachment #1: Type: text/plain, Size: 4489 bytes --]

Hi,

On Thu, 15 Jun 2017, Ævar Arnfjörð Bjarmason wrote:

> On Thu, Jun 15 2017, Jeff King jotted:
> 
> > On Thu, Jun 15, 2017 at 08:05:18PM +0900, Mike Hommey wrote:
> >
> >> On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
> >>
> >> > Footnote *1*: SHA-256, as all hash functions whose output is
> >> > essentially the entire internal state, are susceptible to a
> >> > so-called "length extension attack", where the hash of a
> >> > secret+message can be used to generate the hash of
> >> > secret+message+piggyback without knowing the secret.  This is not
> >> > the case for Git: only visible data are hashed. The type of attacks
> >> > Git has to worry about is very different from the length extension
> >> > attacks, and it is highly unlikely that that weakness of SHA-256
> >> > leads to, say, a collision attack.
> >>
> >> What do the experts think or SHA512/256, which completely removes the
> >> concerns over length extension attack? (which I'd argue is better than
> >> sweeping them under the carpet)
> >
> > I don't think it's sweeping them under the carpet. Git does not use the
> > hash as a MAC, so length extension attacks aren't a thing (and even if
> > we later wanted to use the same algorithm as a MAC, the HMAC
> > construction is a well-studied technique for dealing with it).

I really tried to drive that point home, as it had been made very clear to
me that the length extension attack is something that Git need not concern
itself.

The length extension attack *only* comes into play when there are secrets
that are hashed. In that case, one would not want others to be able to
produce a valid hash *without* knowing the secrets. And SHA-256 allows to
"reconstruct" the internal state (which is the hash value) in order to
continue at any point, i.e. if the hash for secret+message is known, it is
easy to calculate the hash for secret+message+addition, without knowing
the secret at all.

That is exactly *not* the case with Git. In Git, what we want to hash is
known in its entirety. If the hash value were not identical to the
internal state, it would be easy enough to reconstruct, because *there are
no secrets*.

So please understand that even the direction that the length extension
attack takes is completely different than the direction any attack would
have to take that weakens SHA-256 for Git's purposes. As far as Git's
usage is concerned, SHA-256 has no known weaknesses.

It is *really, really, really* important to understand this before going
on to suggest another hash function such as SHA-512/256 (i.e. SHA-512
truncated to 256 bits), based only on that perceived weakness of SHA-256.

> > That said, SHA-512 is typically a little faster than SHA-256 on 64-bit
> > platforms. I don't know if that will change with the advent of
> > hardware instructions oriented towards SHA-256.
> 
> Quoting my own
> CACBZZX7JRA2niwt9wsGAxnzS+gWS8hTUgzWm8NaY1gs87o8xVQ@mail.gmail.com sent
> ~2 weeks ago to the list:
> 
>     On Fri, Jun 2, 2017 at 7:54 PM, Jonathan Nieder <jrnieder@gmail.com>
>     wrote:
>     [...]
>     > 4. When choosing a hash function, people may argue about performance.
>     >    It would be useful for run some benchmarks for git (running
>     >    the test suite, t/perf tests, etc) using a variety of hash
>     >    functions as input to such a discussion.
> 
>     To the extent that such benchmarks matter, it seems prudent to heavily
>     weigh them in favor of whatever seems to be likely to be the more
>     common hash function going forward, since those are likely to get
>     faster through future hardware acceleration.
> 
>     E.g. Intel announced Goldmont last year which according to one SHA-1
>     implementation improved from 9.5 cycles per byte to 2.7 cpb[1]. They
>     only have acceleration for SHA-1 and SHA-256[2]
> 
>     1. https://github.com/weidai11/cryptopp/issues/139#issuecomment-264283385
> 
>     2. https://en.wikipedia.org/wiki/Goldmont
> 
> Maybe someone else knows of better numbers / benchmarks, but such a
> reduction in CBP likely makes it faster than SHA-512.

Very, very likely faster than SHA-512.

I'd like to stress explicitly that the Intel SHA extensions do *not* cover
SHA-512:

	https://en.wikipedia.org/wiki/Intel_SHA_extensions

In other words, once those extensions become commonplace, SHA-256 will be
faster than SHA-512, hands down.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 17:36         ` Brandon Williams
@ 2017-06-15 19:20           ` Junio C Hamano
  0 siblings, 0 replies; 23+ messages in thread
From: Junio C Hamano @ 2017-06-15 19:20 UTC (permalink / raw)
  To: Brandon Williams
  Cc: Johannes Schindelin, brian m. carlson, Linus Torvalds,
	Jonathan Nieder, Git Mailing List, Stefan Beller, jonathantanmy,
	Jeff King

Brandon Williams <bmwill@google.com> writes:

>> It would make a whole of a lot of sense to make that knob not Boolean,
>> but to specify which hash function is in use.
>
> 100% agree on this point.  I believe the current plan is to have the
> hashing function used for a repository be a repository format extension
> which would be a value (most likely a string like 'sha1', 'sha256',
> 'black2', etc) stored in a repository's .git/config.  This way, upon
> startup git will die or ignore a repository which uses a hashing
> function which it does not recognize or does not compiled to handle.
>
> I hope (and expect) that the end produce of this transition is a nice,
> clean hashing API and interface with sufficient abstractions such that
> if I wanted to switch to a different hashing function I would just need
> to implement the interface with the new hashing function and ensure that
> 'verify_repository_format' allows the new function.

Yup.  I thought that part has already been agreed upon, but it is a
good thing that somebody is writing it down (perhaps "again", if not
"for the first time").

Thanks.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 10:30       ` Which hash function to use, was " Johannes Schindelin
  2017-06-15 11:05         ` Mike Hommey
  2017-06-15 17:36         ` Brandon Williams
@ 2017-06-15 19:13         ` Jonathan Nieder
  2 siblings, 0 replies; 23+ messages in thread
From: Jonathan Nieder @ 2017-06-15 19:13 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Brandon Williams, brian m. carlson, Linus Torvalds,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	Junio Hamano

Hi Dscho,

Johannes Schindelin wrote:

> From what I read, pretty much everybody who participated in the discussion
> was aware that the essential question is: performance vs security.

I don't completely agree with this framing.  The essential question is:
how to get the right security properties without abysmal performance.

> It turns out that we can have essentially both.
>
> SHA-256 is most likely the best-studied hash function we currently know
[... etc ...]

Thanks for a thoughtful restart to the discussion.  This is much more
concrete than your previous objections about process, and that is very
helpful.

In the interest of transparency: here are my current questions for
cryptographers to whom I have forwarded this thread.  Several of these
questions involve predictions or opinions, so in my ideal world we'd
want multiple, well reasoned answers to them.  Please feel free to
forward them to appropriate people or add more.

 1. Now it sounds like SHA-512/256 is the safest choice (see also Mike
    Hommey's response to Dscho's message).  Please poke holes in my
    understanding.

 2. Would you be willing to weigh in publicly on the mailing list? I
    think that would be the most straightforward way to move this
    forward (and it would give you a chance to ask relevant questions,
    etc).  Feel free to contact me privately if you have any questions
    about how this particular mailing list works.

 3. On the speed side, Dscho states "SHA-256 will be faster than BLAKE
    (and even than BLAKE2) once the Intel and AMD CPUs with hardware
    support for SHA-256 become common."  Do you agree?

 4. On the security side, Dscho states "to compete in the SHA-3
    contest, BLAKE added complexity so that it would be roughly on par
    with its competitors.  To allow for faster execution in software,
    this complexity was *removed* from BLAKE to create BLAKE2, making
    it weaker than SHA-256."  Putting aside the historical questions,
    do you agree with this "weaker than" claim?

 5. On the security side, Dscho states, "The type of attacks Git has to
    worry about is very different from the length extension attacks,
    and it is highly unlikely that that weakness of SHA-256 leads to,
    say, a collision attack", and Jeff King states, "Git does not use
    the hash as a MAC, so length extension attacks aren't a thing (and
    even if we later wanted to use the same algorithm as a MAC, the
    HMAC construction is a well-studied technique for dealing with
    it)."  Is this correct in spirit?  Is SHA-256 equally strong to
    SHA-512/256 for Git's purposes, or are the increased bits of
    internal state (or other differences) relevant?  How would you
    compare the two functions' properties?

 6. On the speed side, Jeff King states "That said, SHA-512 is
    typically a little faster than SHA-256 on 64-bit platforms. I
    don't know if that will change with the advent of hardware
    instructions oriented towards SHA-256."  Thoughts?

 7. If the answer to (2) is "no", do I have permission to quote or
    paraphrase your replies that were given here?

Thanks, sincerely,
Jonathan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 10:30       ` Which hash function to use, was " Johannes Schindelin
  2017-06-15 11:05         ` Mike Hommey
@ 2017-06-15 17:36         ` Brandon Williams
  2017-06-15 19:20           ` Junio C Hamano
  2017-06-15 19:13         ` Jonathan Nieder
  2 siblings, 1 reply; 23+ messages in thread
From: Brandon Williams @ 2017-06-15 17:36 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: brian m. carlson, Linus Torvalds, Jonathan Nieder,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	Junio Hamano

On 06/15, Johannes Schindelin wrote:
> Hi,
> 
> I thought it better to revive this old thread rather than start a new
> thread, so as to automatically reach everybody who chimed in originally.
> 
> On Mon, 6 Mar 2017, Brandon Williams wrote:
> 
> > On 03/06, brian m. carlson wrote:
> >
> > > On Sat, Mar 04, 2017 at 06:35:38PM -0800, Linus Torvalds wrote:
> > >
> > > > Btw, I do think the particular choice of hash should still be on the
> > > > table. sha-256 may be the obvious first choice, but there are
> > > > definitely a few reasons to consider alternatives, especially if
> > > > it's a complete switch-over like this.
> > > > 
> > > > One is large-file behavior - a parallel (or tree) mode could improve
> > > > on that noticeably. BLAKE2 does have special support for that, for
> > > > example. And SHA-256 does have known attacks compared to SHA-3-256
> > > > or BLAKE2 - whether that is due to age or due to more effort, I
> > > > can't really judge. But if we're switching away from SHA1 due to
> > > > known attacks, it does feel like we should be careful.
> > > 
> > > I agree with Linus on this.  SHA-256 is the slowest option, and it's
> > > the one with the most advanced cryptanalysis.  SHA-3-256 is faster on
> > > 64-bit machines (which, as we've seen on the list, is the overwhelming
> > > majority of machines using Git), and even BLAKE2b-256 is stronger.
> > > 
> > > Doing this all over again in another couple years should also be a
> > > non-goal.
> > 
> > I agree that when we decide to move to a new algorithm that we should
> > select one which we plan on using for as long as possible (much longer
> > than a couple years).  While writing the document we simply used
> > "sha256" because it was more tangible and easier to reference.
> 
> The SHA-1 transition *requires* a knob telling Git that the current
> repository uses a hash function different from SHA-1.
> 
> It would make *a whole of a lot of sense* to make that knob *not* Boolean,
> but to specify *which* hash function is in use.

100% agree on this point.  I believe the current plan is to have the
hashing function used for a repository be a repository format extension
which would be a value (most likely a string like 'sha1', 'sha256',
'black2', etc) stored in a repository's .git/config.  This way, upon
startup git will die or ignore a repository which uses a hashing
function which it does not recognize or does not compiled to handle.

I hope (and expect) that the end produce of this transition is a nice,
clean hashing API and interface with sufficient abstractions such that
if I wanted to switch to a different hashing function I would just need
to implement the interface with the new hashing function and ensure that
'verify_repository_format' allows the new function.

> 
> That way, it will be easier to switch another time when it becomes
> necessary.
> 
> And it will also make it easier for interested parties to use a different
> hash function in their infrastructure if they want.
> 
> And it lifts part of that burden that we have to consider *very carefully*
> which function to pick. We still should be more careful than in 2005, when
> Git was born, and when, incidentally, when the first attacks on SHA-1
> became known, of course. We were just lucky for almost 12 years.
> 
> Now, with Dunning-Kruger in mind, I feel that my degree in mathematics
> equips me with *just enough* competence to know just how little *even I*
> know about cryptography.
> 
> The smart thing to do, hence, was to get involved in this discussion and
> act as Lt Tawney Madison between us Git developers and experts in
> cryptography.
> 
> It just so happens that I work at a company with access to excellent
> cryptographers, and as we own the largest Git repository on the planet, we
> have a vested interest in ensuring Git's continued success.
> 
> After a couple of conversations with a couple of experts who I cannot
> thank enough for their time and patience, let alone their knowledge about
> this matter, it would appear that we may not have had a complete enough
> picture yet to even start to make the decision on the hash function to
> use.
> 

-- 
Brandon Williams

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 13:01           ` Jeff King
@ 2017-06-15 16:30             ` Ævar Arnfjörð Bjarmason
  2017-06-15 19:34               ` Johannes Schindelin
  2017-06-15 21:10             ` Mike Hommey
  1 sibling, 1 reply; 23+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-06-15 16:30 UTC (permalink / raw)
  To: Jeff King
  Cc: Mike Hommey, Johannes Schindelin, Brandon Williams,
	brian m. carlson, Linus Torvalds, Jonathan Nieder,
	Git Mailing List, Stefan Beller, jonathantanmy, Junio Hamano


On Thu, Jun 15 2017, Jeff King jotted:

> On Thu, Jun 15, 2017 at 08:05:18PM +0900, Mike Hommey wrote:
>
>> On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
>> > Footnote *1*: SHA-256, as all hash functions whose output is essentially
>> > the entire internal state, are susceptible to a so-called "length
>> > extension attack", where the hash of a secret+message can be used to
>> > generate the hash of secret+message+piggyback without knowing the secret.
>> > This is not the case for Git: only visible data are hashed. The type of
>> > attacks Git has to worry about is very different from the length extension
>> > attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
>> > say, a collision attack.
>>
>> What do the experts think or SHA512/256, which completely removes the
>> concerns over length extension attack? (which I'd argue is better than
>> sweeping them under the carpet)
>
> I don't think it's sweeping them under the carpet. Git does not use the
> hash as a MAC, so length extension attacks aren't a thing (and even if
> we later wanted to use the same algorithm as a MAC, the HMAC
> construction is a well-studied technique for dealing with it).
>
> That said, SHA-512 is typically a little faster than SHA-256 on 64-bit
> platforms. I don't know if that will change with the advent of hardware
> instructions oriented towards SHA-256.

Quoting my own
CACBZZX7JRA2niwt9wsGAxnzS+gWS8hTUgzWm8NaY1gs87o8xVQ@mail.gmail.com sent
~2 weeks ago to the list:

    On Fri, Jun 2, 2017 at 7:54 PM, Jonathan Nieder <jrnieder@gmail.com> wrote:
    [...]
    > 4. When choosing a hash function, people may argue about performance.
    >    It would be useful for run some benchmarks for git (running
    >    the test suite, t/perf tests, etc) using a variety of hash
    >    functions as input to such a discussion.

    To the extent that such benchmarks matter, it seems prudent to heavily
    weigh them in favor of whatever seems to be likely to be the more
    common hash function going forward, since those are likely to get
    faster through future hardware acceleration.

    E.g. Intel announced Goldmont last year which according to one SHA-1
    implementation improved from 9.5 cycles per byte to 2.7 cpb[1]. They
    only have acceleration for SHA-1 and SHA-256[2]

    1. https://github.com/weidai11/cryptopp/issues/139#issuecomment-264283385

    2. https://en.wikipedia.org/wiki/Goldmont

Maybe someone else knows of better numbers / benchmarks, but such a
reduction in CBP likely makes it faster than SHA-512.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 11:05         ` Mike Hommey
@ 2017-06-15 13:01           ` Jeff King
  2017-06-15 16:30             ` Ævar Arnfjörð Bjarmason
  2017-06-15 21:10             ` Mike Hommey
  0 siblings, 2 replies; 23+ messages in thread
From: Jeff King @ 2017-06-15 13:01 UTC (permalink / raw)
  To: Mike Hommey
  Cc: Johannes Schindelin, Brandon Williams, brian m. carlson,
	Linus Torvalds, Jonathan Nieder, Git Mailing List, Stefan Beller,
	jonathantanmy, Junio Hamano

On Thu, Jun 15, 2017 at 08:05:18PM +0900, Mike Hommey wrote:

> On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
> > Footnote *1*: SHA-256, as all hash functions whose output is essentially
> > the entire internal state, are susceptible to a so-called "length
> > extension attack", where the hash of a secret+message can be used to
> > generate the hash of secret+message+piggyback without knowing the secret.
> > This is not the case for Git: only visible data are hashed. The type of
> > attacks Git has to worry about is very different from the length extension
> > attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
> > say, a collision attack.
> 
> What do the experts think or SHA512/256, which completely removes the
> concerns over length extension attack? (which I'd argue is better than
> sweeping them under the carpet)

I don't think it's sweeping them under the carpet. Git does not use the
hash as a MAC, so length extension attacks aren't a thing (and even if
we later wanted to use the same algorithm as a MAC, the HMAC
construction is a well-studied technique for dealing with it).

That said, SHA-512 is typically a little faster than SHA-256 on 64-bit
platforms. I don't know if that will change with the advent of hardware
instructions oriented towards SHA-256.

-Peff

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-06-15 10:30       ` Which hash function to use, was " Johannes Schindelin
@ 2017-06-15 11:05         ` Mike Hommey
  2017-06-15 13:01           ` Jeff King
  2017-06-15 17:36         ` Brandon Williams
  2017-06-15 19:13         ` Jonathan Nieder
  2 siblings, 1 reply; 23+ messages in thread
From: Mike Hommey @ 2017-06-15 11:05 UTC (permalink / raw)
  To: Johannes Schindelin
  Cc: Brandon Williams, brian m. carlson, Linus Torvalds,
	Jonathan Nieder, Git Mailing List, Stefan Beller, jonathantanmy,
	Jeff King, Junio Hamano

On Thu, Jun 15, 2017 at 12:30:46PM +0200, Johannes Schindelin wrote:
> Footnote *1*: SHA-256, as all hash functions whose output is essentially
> the entire internal state, are susceptible to a so-called "length
> extension attack", where the hash of a secret+message can be used to
> generate the hash of secret+message+piggyback without knowing the secret.
> This is not the case for Git: only visible data are hashed. The type of
> attacks Git has to worry about is very different from the length extension
> attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
> say, a collision attack.

What do the experts think or SHA512/256, which completely removes the
concerns over length extension attack? (which I'd argue is better than
sweeping them under the carpet)

Mike

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Which hash function to use, was Re: RFC: Another proposed hash function transition plan
  2017-03-06 18:24     ` Brandon Williams
@ 2017-06-15 10:30       ` Johannes Schindelin
  2017-06-15 11:05         ` Mike Hommey
                           ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Johannes Schindelin @ 2017-06-15 10:30 UTC (permalink / raw)
  To: Brandon Williams
  Cc: brian m. carlson, Linus Torvalds, Jonathan Nieder,
	Git Mailing List, Stefan Beller, jonathantanmy, Jeff King,
	Junio Hamano

Hi,

I thought it better to revive this old thread rather than start a new
thread, so as to automatically reach everybody who chimed in originally.

On Mon, 6 Mar 2017, Brandon Williams wrote:

> On 03/06, brian m. carlson wrote:
>
> > On Sat, Mar 04, 2017 at 06:35:38PM -0800, Linus Torvalds wrote:
> >
> > > Btw, I do think the particular choice of hash should still be on the
> > > table. sha-256 may be the obvious first choice, but there are
> > > definitely a few reasons to consider alternatives, especially if
> > > it's a complete switch-over like this.
> > > 
> > > One is large-file behavior - a parallel (or tree) mode could improve
> > > on that noticeably. BLAKE2 does have special support for that, for
> > > example. And SHA-256 does have known attacks compared to SHA-3-256
> > > or BLAKE2 - whether that is due to age or due to more effort, I
> > > can't really judge. But if we're switching away from SHA1 due to
> > > known attacks, it does feel like we should be careful.
> > 
> > I agree with Linus on this.  SHA-256 is the slowest option, and it's
> > the one with the most advanced cryptanalysis.  SHA-3-256 is faster on
> > 64-bit machines (which, as we've seen on the list, is the overwhelming
> > majority of machines using Git), and even BLAKE2b-256 is stronger.
> > 
> > Doing this all over again in another couple years should also be a
> > non-goal.
> 
> I agree that when we decide to move to a new algorithm that we should
> select one which we plan on using for as long as possible (much longer
> than a couple years).  While writing the document we simply used
> "sha256" because it was more tangible and easier to reference.

The SHA-1 transition *requires* a knob telling Git that the current
repository uses a hash function different from SHA-1.

It would make *a whole of a lot of sense* to make that knob *not* Boolean,
but to specify *which* hash function is in use.

That way, it will be easier to switch another time when it becomes
necessary.

And it will also make it easier for interested parties to use a different
hash function in their infrastructure if they want.

And it lifts part of that burden that we have to consider *very carefully*
which function to pick. We still should be more careful than in 2005, when
Git was born, and when, incidentally, when the first attacks on SHA-1
became known, of course. We were just lucky for almost 12 years.

Now, with Dunning-Kruger in mind, I feel that my degree in mathematics
equips me with *just enough* competence to know just how little *even I*
know about cryptography.

The smart thing to do, hence, was to get involved in this discussion and
act as Lt Tawney Madison between us Git developers and experts in
cryptography.

It just so happens that I work at a company with access to excellent
cryptographers, and as we own the largest Git repository on the planet, we
have a vested interest in ensuring Git's continued success.

After a couple of conversations with a couple of experts who I cannot
thank enough for their time and patience, let alone their knowledge about
this matter, it would appear that we may not have had a complete enough
picture yet to even start to make the decision on the hash function to
use.

From what I read, pretty much everybody who participated in the discussion
was aware that the essential question is: performance vs security.

It turns out that we can have essentially both.

SHA-256 is most likely the best-studied hash function we currently know
about (*maybe* SHA3-256 has been studied slightly more, but only
slightly). All the experts in the field banged on it with multiple sticks
and other weapons. And so far, they only found one weakness that does not
even apply to Git's usage [*1*]. For cryptography experts, this is the
ultimate measure of security: if something has been attacked that
intensely, by that many experts, for that long, with that little effect,
it is the best we got at the time.

And since SHA-256 has become the standard, and more importantly: since
SHA-256 was explicitly designed to allow for relatively inexpensive
hardware acceleration, this is what we will soon have: hardware support in
the form of, say, special CPU instructions. (That is what I meant by: we
can have performance *and* security.)

This is a rather important point to stress, by the way: BLAKE's design is
apparently *not* friendly to CPU instruction implementations. Meaning that
SHA-256 will be faster than BLAKE (and even than BLAKE2) once the Intel
and AMD CPUs with hardware support for SHA-256 become common.

I also heard something really worrisome about BLAKE2 that makes me want to
stay away from it (in addition to the difficulty it poses for hardware
acceleration): to compete in the SHA-3 contest, BLAKE added complexity so
that it would be roughly on par with its competitors. To allow for faster
execution in software, this complexity was *removed* from BLAKE to create
BLAKE2, making it weaker than SHA-256.

Another important point to consider is that SHA-256 implementations are
everywhere. Think e.g. how difficult we would make it on, say, JGit or
go-git if we chose a less common hash function.

As to KangarooTwelve: it has seen substantially less cryptanalysis than
SHA-256 and SHA3-256. That does not necessarily mean that it is weaker,
but it means that we simply cannot know whether it is as strong. On that
basis alone, I would already reject it, and then there are far fewer
implementations, too.

When it comes to choosing SHA-256 vs SHA3-256, I would like to point out
that hardware acceleration is a lot farther in the future than SHA-256
support. And according to the experts I asked, they are roughly equally
secure as far as Git's usage is concerned, even if the SHA-3 contest
provided SHA3-256 with even fiercer cryptanalysis than SHA-256.

In short: my takeaway from the conversations with cryptography experts was
that SHA-256 would be the best choice for now, and that we should make
sure that the next switch is not as painful as this one (read: we should
not repeat the mistake of hard-coding the new hash function into Git as
much as we hard-coded SHA-1 into it).

Ciao,
Johannes

Footnote *1*: SHA-256, as all hash functions whose output is essentially
the entire internal state, are susceptible to a so-called "length
extension attack", where the hash of a secret+message can be used to
generate the hash of secret+message+piggyback without knowing the secret.
This is not the case for Git: only visible data are hashed. The type of
attacks Git has to worry about is very different from the length extension
attacks, and it is highly unlikely that that weakness of SHA-256 leads to,
say, a collision attack.

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, back to index

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-04  1:12 RFC: Another proposed hash function transition plan Jonathan Nieder
2017-03-05  2:35 ` Linus Torvalds
2017-03-06  0:26   ` brian m. carlson
2017-03-06 18:24     ` Brandon Williams
2017-06-15 10:30       ` Which hash function to use, was " Johannes Schindelin
2017-06-15 11:05         ` Mike Hommey
2017-06-15 13:01           ` Jeff King
2017-06-15 16:30             ` Ævar Arnfjörð Bjarmason
2017-06-15 19:34               ` Johannes Schindelin
2017-06-15 21:59                 ` Adam Langley
2017-06-15 22:41                   ` brian m. carlson
2017-06-15 23:36                     ` Ævar Arnfjörð Bjarmason
2017-06-16  0:17                       ` brian m. carlson
2017-06-16  6:25                         ` Ævar Arnfjörð Bjarmason
2017-06-16 13:24                           ` Johannes Schindelin
2017-06-16 17:38                             ` Adam Langley
2017-06-16 20:52                               ` Junio C Hamano
2017-06-16 21:12                                 ` Junio C Hamano
2017-06-16 21:24                                   ` Jonathan Nieder
2017-06-16 21:39                                     ` Ævar Arnfjörð Bjarmason
2017-06-16 20:42                             ` Jeff King
2017-06-19  9:26                               ` Johannes Schindelin
2017-06-15 21:10             ` Mike Hommey
2017-06-16  4:30               ` Jeff King
2017-06-15 17:36         ` Brandon Williams
2017-06-15 19:20           ` Junio C Hamano
2017-06-15 19:13         ` Jonathan Nieder

Git Mailing List Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/git/0 git/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 git git/ https://lore.kernel.org/git \
		git@vger.kernel.org
	public-inbox-index git

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.git


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git