* RFC: Another proposed hash function transition plan @ 2017-03-04 1:12 Jonathan Nieder 2017-03-05 2:35 ` Linus Torvalds 0 siblings, 1 reply; 49+ messages in thread From: Jonathan Nieder @ 2017-03-04 1:12 UTC (permalink / raw) To: git; +Cc: sbeller, bmwill, jonathantanmy, peff, Linus Torvalds Hi, This past week we came up with this idea for what a transition to a new hash function for Git would look like. I'd be interested in your thoughts (especially if you can make them as comments on the document, which makes it easier to address them and update the document). This document is still in flux but I thought it best to send it out early to start getting feedback. We tried to incorporate some thoughts from the thread http://public-inbox.org/git/20170223164306.spg2avxzukkggrpb@kitenet.net but it is a little long so it is easy to imagine we've missed some things already discussed there. You can use the doc URL https://goo.gl/gh2Mzc to view the latest version and comment. Thoughts welcome, as always. Git hash function transition ============================ Status: Draft Last Updated: 2017-03-03 Objective --------- Migrate Git from SHA-1 to a stronger hash function. Background ---------- The Git version control system can be thought of as a content addressable filesystem. It uses the SHA-1 hash function to name content. For example, files, trees, commits are referred to by hash values unlike in other traditional version control systems where files or versions are referred to via sequential numbers. The use of a hash function to address its content delivers a few advantages: * Integrity checking is easy. Bit flips, for example, are easily detected, as the hash of corrupted content does not match its name. Lookup of objects is fast. Using a cryptographically secure hash function brings additional advantages: * Object names can be signed and third parties can trust the hash to address the signed object and all objects it references. * Communication using Git protocol and out of band communication methods have a short reliable string that can be used to reliably address stored content. Over time some flaws in SHA-1 have been discovered by security researchers. https://shattered.io demonstrated a practical SHA-1 hash collision. As a result, SHA-1 cannot be considered cryptographically secure any more. This impacts the communication of hash values because we cannot trust that a given hash value represents the known good version of content that the speaker intended. SHA-1 still possesses the other properties such as fast object lookup and safe error checking, but other hash functions are equally suitable that are believed to be cryptographically secure. Goals ----- 1. The transition to SHA256 can be done one local repository at a time. a. Requiring no action by any other party. b. A SHA256 repository can communicate with SHA-1 Git servers and clients (push/fetch). c. Users can use SHA-1 and SHA256 identifiers for objects interchangeably. d. New signed objects make use of a stronger hash function than SHA-1 for their security guarantees. 2. Allow a complete transition away from SHA-1. a. Local metadata for SHA-1 compatibility can be dropped in a repository if compatibility with SHA-1 is no longer needed. 3. Maintainability throughout the process. a. The object format is kept simple and consistent. b. Creation of a generalized repository conversion tool. Non-Goals --------- 1. Add SHA256 support to Git protocol. This is valuable and the logical next step but it is out of scope for this initial design. 2. Transparently improving the security of existing SHA-1 signed objects. 3. Intermixing objects using multiple hash functions in a single repository. 4. Taking the opportunity to fix other bugs in git's formats and protocols. 5. Shallow clones and fetches into a SHA256 repository. (This will change when we add SHA256 support to Git protocol.) 6. Skip fetching some submodules of a project into a SHA256 repository. (This also depends on SHA256 support in Git protocol.) Overview -------- We introduce a new repository format extension `sha256`. Repositories with this extension enabled use SHA256 instead of SHA-1 to name their objects. This affects both object names and object content --- both the names of objects and all references to other objects within an object are switched to the new hash function. sha256 repositories cannot be read by older versions of Git. Alongside the packfile, a sha256 stores a bidirectional mapping between sha256 and sha1 object names. The mapping is generated locally and can be verified using "git fsck". Object lookups use this mapping to allow naming objects using either their sha1 and sha256 names interchangeably. "git cat-file" and "git hash-object" gain options to display a sha256 object in its sha1 form and write a sha256 object given its sha1 form. This requires all objects referenced by that object to be present in the object database so that they can be named using the appropriate name (using the bidirectional hash mapping). Fetches from a SHA-1 based server convert the fetched objects into sha256 form and record the mapping in the bidirectional mapping table (see below for details). Pushes to a SHA-1 based server convert the objects being pushed into sha1 form so the server does not have to be aware of the hash function the client is using. Detailed Design --------------- Object names ~~~~~~~~~~~~ Objects can be named by their 40 hexadecimal digit sha1-name or 64 hexadecimal digit sha256-name, plus names derived from those (see gitrevisions(7)). The sha1-name of an object is the SHA-1 of the concatenation of its type, length, a nul byte, and the object's sha1-content. This is the traditional <sha1> used in Git to name objects. The sha256-name of an object is the SHA-256 of the concatenation of its type, length, a nul byte, and the object's sha256-content. Object format ~~~~~~~~~~~~~ Objects are stored using a compressed representation of their sha256-content. The sha256-content of an object is the same as its sha1-content, except that: * objects referenced by the object are named using their sha256-names instead of sha1-names * signed tags, commits, and merges of signed tags get some additional fields (see below) The format allows round-trip conversion between sha256-content and sha1-content. Loose objects use zlib compression and packed objects use the packed format described in Documentation/technical/pack-format.txt, just like today. Translation table ~~~~~~~~~~~~~~~~~ A fast bidirectional mapping between sha1-names and sha256-names of all local objects in the repository is kept on disk. The exact format of that mapping is to be determined. All operations that make new objects (e.g., "git commit") add the new objects to the translation table. Reading an object's sha1-content ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The sha1-content of an object can be read by converting all sha256-names its sha256-content references to sha1-names using the translation table. There is an additional minor transformation needed for signed tags, commits, and merges (see below). Fetch ~~~~~ Fetching from a SHA-1 based server requires translating between SHA-1 and SHA-256 based representations on the fly. SHA-1s named in the ref advertisement can be translated to SHA-256 and looked up as local objects using the translation table. Negotiation proceeds as today. Any "have"s or "want"s generated locally are converted to SHA-1 before being sent to the server, and SHA-1s mentioned by the server are converted to SHA-256 when looking them up locally. After negotiation, the server sends a packfile containing the requested objects. We convert the packfile to SHA-256 format using the following steps: 1. index-pack: inflate each object in the packfile and compute its SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against objects the client has locally. These objects can be looked up using the translation table and their sha1-content read as described above to resolve the deltas. 2. topological sort: starting at the "want"s from the negotiation phase, walk through objects in the pack and emit a list of them in topologically sorted order. (This list only contains objects reachable from the "wants". If the pack from the server contained additional extraneous objects, then they will be discarded.) 3. convert to sha256: open a new (sha256) packfile. Read the topologically sorted list just generated in reverse order. For each object, inflate its sha1-content, convert to sha256-content, and write it to the sha256 pack. Write an idx file for this pack and include the new sha1<->sha256 mapping entry in the translation table. 4. clean up: remove the SHA-1 based pack file, index, and topologically sorted list obtained from the server and steps 1 and 2. Step 3 requires every object referenced by the new object to be in the translation table. This is why the topological sort step is necessary. As an optimization, step 1 can write a file describing what objects each object it has inflated from the packfile references. This makes the topological sort in step 2 possible without inflating the objects in the packfile for a second time. The objects need to be inflated again in step 3, for a total of two inflations. Push ~~~~ Push is simpler than fetch because the objects referenced by the pushed objects are already in the translation table. The sha1-content of each object being pushed can be read as described in the "Reading an object's sha1-content" section to generate the pack written by git send-pack. Signed Objects ~~~~~~~~~~~~~~ Commits ^^^^^^^ Commits currently have the following sequence of header lines: "tree" SP object-name ("parent" SP object-name)* "author" SP ident "committer" SP ident ("mergetag" SP object-content)? ("gpgsig" SP pgp-signature)? We introduce new header lines "hash" and "nohash" that come after the "gpgsig" field. No "hash" lines may appear unless the "gpgsig" field is present. Hash lines have the form "hash" SP hash-function SP field SP alternate-object-name Nohash lines have the form "nohash" SP hash-function There are only two recognized values of hash-function: "sha1" and "sha256". "git fsck" will tolerate values of hash-function it does not recognize, as long as they do not come before either of those two. All "nohash" lines come before all "hash" lines. Any "hash sha1" lines must come before all "hash sha256" lines, and likewise for nohash. The Git project determines any future supported hash-functions that can come after those two and their order. There can be at most one "nohash <hash-function>" for one hash function, indicating that this hash function should not be used when checking the commit's signature. There is one "hash <hash-function>" line for each tree or parent field in the commit object header. The hash lines record object names for those trees and parents using the indicated hash function, to be used when checking the commit's signature. TODO: simplify signature rules, handle the mergetag field better. sha256-content of signed commits ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The sha256-content of a commit with a "gpgsig" header can include no hash and nohash lines, a "nohash sha256" line and "hash sha1", or just a "hash sha1" line. Examples: 1. tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4 parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980 parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315 author A U Thor <author@example.com> 1465982009 +0000 committer C O Mitter <committer@example.com> 1465982009 +0000 gpgsig ... 2. tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4 parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980 parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315 author A U Thor <author@example.com> 1465982009 +0000 committer C O Mitter <committer@example.com> 1465982009 +0000 gpgsig ... nohash sha256 hash sha1 tree c7b1cff039a93f3600a1d18b82d26688668c7dea hash sha1 parent c33429be94b5f2d3ee9b0adad223f877f174b05d hash sha1 parent 04b871796dc0420f8e7561a895b52484b701d51a 3. tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4 parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980 parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315 author A U Thor <author@example.com> 1465982009 +0000 committer C O Mitter <committer@example.com> 1465982009 +0000 gpgsig ... hash sha1 tree c7b1cff039a93f3600a1d18b82d26688668c7dea hash sha1 parent c33429be94b5f2d3ee9b0adad223f877f174b05d hash sha1 parent 04b871796dc0420f8e7561a895b52484b701d51a sha1-content of signed commits ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The sha1-content of a commit with a "gpgsig" header can contain a "nohash sha1" and "hash sha256" line, no hash or nohash lines, or just a "hash sha256" line. Examples: 1. tree c7b1cff039a93f3600a1d18b82d26688668c7dea parent c33429be94b5f2d3ee9b0adad223f877f174b05d parent 04b871796dc0420f8e7561a895b52484b701d51a author A U Thor <author@example.com> 1465982009 +0000 committer C O Mitter <committer@example.com> 1465982009 +0000 gpgsig ... nohash sha1 hash sha256 tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4 hash sha256 parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980 hash sha256 parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315 2. tree c7b1cff039a93f3600a1d18b82d26688668c7dea parent c33429be94b5f2d3ee9b0adad223f877f174b05d parent 04b871796dc0420f8e7561a895b52484b701d51a author A U Thor <author@example.com> 1465982009 +0000 committer C O Mitter <committer@example.com> 1465982009 +0000 gpgsig ... 3. tree c7b1cff039a93f3600a1d18b82d26688668c7dea parent c33429be94b5f2d3ee9b0adad223f877f174b05d parent 04b871796dc0420f8e7561a895b52484b701d51a author A U Thor <author@example.com> 1465982009 +0000 committer C O Mitter <committer@example.com> 1465982009 +0000 gpgsig ... hash sha256 tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4 hash sha256 parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980 hash sha256 parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315 Converting signed commits ^^^^^^^^^^^^^^^^^^^^^^^^^ To convert the sha1-content of a signed commit to its sha256-content: 1. Change "tree" and "parent" lines to use the sha256-names of referenced objects, as with unsigned commits. 2. If there is a "mergetag" field, convert it from sha1-content to sha256-content, as with unsigned commits with a mergetag (see the "Mergetag" section below). 3. Unless there is a "nohash sha1" line, add a full set of "hash sha1 <field> <sha1>" lines indicating the sha1-names of the tree and parents. 4. Remove any "hash sha256 <field> <sha256>" lines. If no such lines were present, add a "nohash sha256" line. Converting the sha256-content of a signed commit to sha1-content uses the same process with sha1 and sha256 switched. Verifying signed commit signatures ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ If the commit has a "hash sha1" line (or is sha1-content without a "nohash sha1" line): check that the signature matches the sha1-content with gpgsig field stripped out. Otherwise: check that the signature matches the sha1-content with gpgsig, nohash, tree, and parents fields stripped out. With the examples above, the signed payloads are 1. author A U Thor <author@example.com> 1465982009 +0000 committer C O Mitter <committer@example.com> 1465982009 +0000 hash sha256 tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4 hash sha256 parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980 hash sha256 parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315 2. tree c7b1cff039a93f3600a1d18b82d26688668c7dea parent c33429be94b5f2d3ee9b0adad223f877f174b05d parent 04b871796dc0420f8e7561a895b52484b701d51a author A U Thor <author@example.com> 1465982009 +0000 committer C O Mitter <committer@example.com> 1465982009 +0000 3. tree c7b1cff039a93f3600a1d18b82d26688668c7dea parent c33429be94b5f2d3ee9b0adad223f877f174b05d parent 04b871796dc0420f8e7561a895b52484b701d51a author A U Thor <author@example.com> 1465982009 +0000 committer C O Mitter <committer@example.com> 1465982009 +0000 hash sha1 hash sha256 tree 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4 hash sha256 parent e094bc809626f0a401a40d75c56df478e546902ff812772c4594265203b23980 hash sha256 parent 1059dab4748aa33b86dad5ca97357bd322abaa558921255623fbddd066bb3315 Current versions of "git verify-commit" can verify examples (2) and (3) (but not (1)). Tags ~~~~ Tags currently have the following sequence of header lines: "object" SP object-name "type" SP type "tag" SP identifier "tagger" SP ident A tag's signature, if it exists, is in the message body. We introduce new header lines "nohash" and "hash" that come after the "tagger" field. No "nohash" or "hash" lines may appear unless the message body contains a PGP signature. As with commits, "nohash" lines have the form "nohash <hash-function>", indicating that this hash function should not be used when checking the tag's signature. "hash" lines have the form "hash" SP hash-function SP alternate-object-name This records the pointed-to object name using the indicated hash function, to be used when checking the tag's signature. As with commits, "sha1" and "sha256" are the only permitted values of hash-function and can only appear in that order for a field when they appear. There can be at most one "nohash" line, and it comes before any "hash" lines. There can be only one "hash" line for a given hash function. sha256-content of signed tags ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The sha256-content of a signed tag can include no "hash" or "nohash" lines, a "nohash sha256" and "hash sha1 <sha1>" line, or just a "hash sha1 <sha1>" line. Examples: 1. object 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4 type tree tag v1.0 tagger C O Mitter <committer@example.com> 1465981006 +0000 Tag Demo v1.0 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQEcBAABAgAGBQJXYRhOAAoJEGEJLoW3InGJklkIAIcnhL7RwEb/+QeX9enkXhxn ... 2. object 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4 type tree tag v1.0 tagger C O Mitter <committer@example.com> 1465981006 +0000 nohash sha256 hash sha1 c7b1cff039a93f3600a1d18b82d26688668c7dea Tag Demo v1.0 -----BEGIN PGP SIGNATURE----- ... 3. object 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4 type tree tag v1.0 tagger C O Mitter <committer@example.com> 1465981006 +0000 hash sha1 c7b1cff039a93f3600a1d18b82d26688668c7dea Tag Demo v1.0 ... sha1-content of signed tags ^^^^^^^^^^^^^^^^^^^^^^^^^^^ The sha1-content of a signed tag can include a "nohash sha1" and "hash sha256" line, no "nohash" or "hash" lines, or just a "hash sha256 <sha256>" line. Examples: 1. object c7b1cff039a93f3600a1d18b82d26688668c7dea ... tagger C O Mitter <committer@example.com> 1465981006 +0000 nohash sha1 hash sha256 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4 Tag Demo v1.0 -----BEGIN PGP SIGNATURE----- ... 2. object c7b1cff039a93f3600a1d18b82d26688668c7dea ... tagger C O Mitter <committer@example.com> 1465981006 +0000 Tag Demo v1.0 -----BEGIN PGP SIGNATURE----- ... 3. object c7b1cff039a93f3600a1d18b82d26688668c7dea ... tagger C O Mitter <committer@example.com> 1465981006 +0000 hash sha256 98ea6e4f216f2fb4b69fff9b3a44842c38686ca685f3f55dc48c5d3fb1107be4 Tag Demo v1.0 -----BEGIN PGP SIGNATURE----- ... Signed tags can be converted between sha1-content and sha256-content using the same process as signed commits. Verifying signed tags ^^^^^^^^^^^^^^^^^^^^^ As with commits, if the tag has a "hash sha1" (or is sha1-content without a "nohash sha1" line): check that the signature matches the sha1-content with PGP signature stripped out. Otherwise: check that the signature matches the sha1-content with nohash and object fields and PGP signature stripped out. Mergetag signatures ~~~~~~~~~~~~~~~~~~~ The mergetag field in the sha1-content of a commit contains the sha1-content of a tag that was merged by that commit. The mergetag field in the sha256-content of the same commit contains the sha256-content of the same tag. Submodules ~~~~~~~~~~ To convert recorded submodule pointers, you need to have the converted submodule repository in place. The bidirectional mapping of the submodule can be used to look up the new hash. Caveats ------- Shallow clone and submodules ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Because this requires all referenced objects to be available in the locally generated translation table, this design does not support shallow clone or unfetched submodules. Protocol improvements might allow lifting this restriction. Alternatives considered ----------------------- Upgrading everyone working on a particular project on a flag day ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Projects like the Linux kernel are large and complex enough that flipping the switch for all projects based on the repository at once is infeasible. Not only would all developers and server operators supporting developers have to switch on the same flag day, but supporting tooling (continuous integration, code review, bug trackers, etc) would have to be adapted as well. This also makes it difficult to get early feedback from some project participants testing before it is time for mass adoption. Using hash functions in parallel ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (e.g. https://public-inbox.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ ) Objects newly created would be addressed by the new hash, but inside such an object (e.g. commit) it is still possible to address objects using the old hash function. * You cannot trust its history (needed for bisectability) in the future without further work * Maintenance burden as the number of supported hash functions grows (they will never go away, so they accumulate). In this proposal, by comparison, converted objects lose all references to SHA-1 except where needed to verify signatures. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC: Another proposed hash function transition plan 2017-03-04 1:12 RFC: Another proposed hash function transition plan Jonathan Nieder @ 2017-03-05 2:35 ` Linus Torvalds 2017-03-07 0:17 ` RFC v3: " Jonathan Nieder 0 siblings, 1 reply; 49+ messages in thread From: Linus Torvalds @ 2017-03-05 2:35 UTC (permalink / raw) To: Jonathan Nieder Cc: Git Mailing List, Stefan Beller, bmwill, jonathantanmy, Jeff King On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote: > > This document is still in flux but I thought it best to send it out > early to start getting feedback. This actually looks very reasonable if you can implement it cleanly enough. In many ways the "convert entirely to a new 256-bit hash" is the cleanest model, and interoperability was at least my personal concern. Maybe your model solves it (devil in the details), in which case I really like it. I do think that if you end up essentially converting the objects without really having any true backwards compatibility at the object layer (just the translation code), you should seriously look at doing some other changes at the same time. Like not using zlib compression, it really is very slow. Btw, I do think the particular choice of hash should still be on the table. sha-256 may be the obvious first choice, but there are definitely a few reasons to consider alternatives, especially if it's a complete switch-over like this. One is large-file behavior - a parallel (or tree) mode could improve on that noticeably. BLAKE2 does have special support for that, for example. And SHA-256 does have known attacks compared to SHA-3-256 or BLAKE2 - whether that is due to age or due to more effort, I can't really judge. But if we're switching away from SHA1 due to known attacks, it does feel like we should be careful. Linus ^ permalink raw reply [flat|nested] 49+ messages in thread
* RFC v3: Another proposed hash function transition plan 2017-03-05 2:35 ` Linus Torvalds @ 2017-03-07 0:17 ` Jonathan Nieder 2017-03-09 19:14 ` Shawn Pearce 2017-09-06 6:28 ` Junio C Hamano 0 siblings, 2 replies; 49+ messages in thread From: Jonathan Nieder @ 2017-03-07 0:17 UTC (permalink / raw) To: Linus Torvalds Cc: Git Mailing List, Stefan Beller, bmwill, jonathantanmy, Jeff King, David Lang, brian m. carlson Linus Torvalds wrote: > On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote: >> This document is still in flux but I thought it best to send it out >> early to start getting feedback. > > This actually looks very reasonable if you can implement it cleanly > enough. Thanks for the kind words on what had quite a few flaws still. Here's a new draft. I think the next version will be a patch against Documentation/technical/. As before, comments welcome, both here and inline at https://goo.gl/gh2Mzc Changes since v2: Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2] Make sha3-based signatures a separate field, avoiding the need for "hash" and "nohash" fields (thanks to peff[3]). Add a sorting phase to fetch (thanks to Junio for noticing the need for this). Omit blobs from the topological sort during fetch (thanks to peff). Discuss alternates, git notes, and git servers in the caveats section (thanks to Junio Hamano, brian m. carlson[4], and Shawn Pearce). Clarify language throughout (thanks to various commenters, especially Junio). Sincerely, Jonathan Git hash function transition ============================ Status: Draft Last Updated: 2017-03-06 Objective --------- Migrate Git from SHA-1 to a stronger hash function. Background ---------- At its core, the Git version control system is a content addressable filesystem. It uses the SHA-1 hash function to name content. For example, files, directories, and revisions are referred to by hash values unlike in other traditional version control systems where files or versions are referred to via sequential numbers. The use of a hash function to address its content delivers a few advantages: * Integrity checking is easy. Bit flips, for example, are easily detected, as the hash of corrupted content does not match its name. * Lookup of objects is fast. Using a cryptographically secure hash function brings additional advantages: * Object names can be signed and third parties can trust the hash to address the signed object and all objects it references. * Communication using Git protocol and out of band communication methods have a short reliable string that can be used to reliably address stored content. Over time some flaws in SHA-1 have been discovered by security researchers. https://shattered.io demonstrated a practical SHA-1 hash collision. As a result, SHA-1 cannot be considered cryptographically secure any more. This impacts the communication of hash values because we cannot trust that a given hash value represents the known good version of content that the speaker intended. SHA-1 still possesses the other properties such as fast object lookup and safe error checking, but other hash functions are equally suitable that are believed to be cryptographically secure. Goals ----- 1. The transition to SHA3-256 can be done one local repository at a time. a. Requiring no action by any other party. b. A SHA3-256 repository can communicate with SHA-1 Git servers (push/fetch). c. Users can use SHA-1 and SHA3-256 identifiers for objects interchangeably. d. New signed objects make use of a stronger hash function than SHA-1 for their security guarantees. 2. Allow a complete transition away from SHA-1. a. Local metadata for SHA-1 compatibility can be removed from a repository if compatibility with SHA-1 is no longer needed. 3. Maintainability throughout the process. a. The object format is kept simple and consistent. b. Creation of a generalized repository conversion tool. Non-Goals --------- 1. Add SHA3-256 support to Git protocol. This is valuable and the logical next step but it is out of scope for this initial design. 2. Transparently improving the security of existing SHA-1 signed objects. 3. Intermixing objects using multiple hash functions in a single repository. 4. Taking the opportunity to fix other bugs in git's formats and protocols. 5. Shallow clones and fetches into a SHA3-256 repository. (This will change when we add SHA3-256 support to Git protocol.) 6. Skip fetching some submodules of a project into a SHA3-256 repository. (This also depends on SHA3-256 support in Git protocol.) Overview -------- We introduce a new repository format extension `sha3`. Repositories with this extension enabled use SHA3-256 instead of SHA-1 to name their objects. This affects both object names and object content --- both the names of objects and all references to other objects within an object are switched to the new hash function. sha3 repositories cannot be read by older versions of Git. Alongside the packfile, a sha3 repository stores a bidirectional mapping between sha3 and sha1 object names. The mapping is generated locally and can be verified using "git fsck". Object lookups use this mapping to allow naming objects using either their sha1 and sha3 names interchangeably. "git cat-file" and "git hash-object" gain options to display an object in its sha1 form and write an object given its sha1 form. This requires all objects referenced by that object to be present in the object database so that they can be named using the appropriate name (using the bidirectional hash mapping). Fetches from a SHA-1 based server convert the fetched objects into sha3 form and record the mapping in the bidirectional mapping table (see below for details). Pushes to a SHA-1 based server convert the objects being pushed into sha1 form so the server does not have to be aware of the hash function the client is using. Detailed Design --------------- Object names ~~~~~~~~~~~~ Objects can be named by their 40 hexadecimal digit sha1-name or 64 hexadecimal digit sha3-name, plus names derived from those (see gitrevisions(7)). The sha1-name of an object is the SHA-1 of the concatenation of its type, length, a nul byte, and the object's sha1-content. This is the traditional <sha1> used in Git to name objects. The sha3-name of an object is the SHA3-256 of the concatenation of its type, length, a nul byte, and the object's sha3-content. Object format ~~~~~~~~~~~~~ The content as a byte sequence of a tag, commit, or tree object named by sha1 and sha3 differ because an object named by sha3-name refers to other objects by their sha3-names and an object named by sha1-name refers to other objects by their sha1-names. The sha3-content of an object is the same as its sha1-content, except that objects referenced by the object are named using their sha3-names instead of sha1-names. Because a blob object does not refer to any other object, its sha1-content and sha3-content are the same. The format allows round-trip conversion between sha3-content and sha1-content. Object storage ~~~~~~~~~~~~~~ Loose objects use zlib compression and packed objects use the packed format described in Documentation/technical/pack-format.txt, just like today. The content that is compressed and stored uses sha3-content instead of sha1-content. Translation table ~~~~~~~~~~~~~~~~~ A fast bidirectional mapping between sha1-names and sha3-names of all local objects in the repository is kept on disk. The exact format of that mapping is to be determined. All operations that make new objects (e.g., "git commit") add the new objects to the translation table. (This work could have been deferred to push time, but that would significantly complicate and slow down pushes. Calculating the sha1-name at object creation time at the same time it is being streamed to disk and having its sha3-name calculated should be an acceptable cost.) Reading an object's sha1-content ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The sha1-content of an object can be read by converting all sha3-names its sha3-content references to sha1-names using the translation table. Fetch ~~~~~ Fetching from a SHA-1 based server requires translating between SHA-1 and SHA3-256 based representations on the fly. SHA-1s named in the ref advertisement that are present on the client can be translated to SHA3-256 and looked up as local objects using the translation table. Negotiation proceeds as today. Any "have"s generated locally are converted to SHA-1 before being sent to the server, and SHA-1s mentioned by the server are converted to SHA3-256 when looking them up locally. After negotiation, the server sends a packfile containing the requested objects. We convert the packfile to SHA3-256 format using the following steps: 1. index-pack: inflate each object in the packfile and compute its SHA-1. Objects can contain deltas in OBJ_REF_DELTA format against objects the client has locally. These objects can be looked up using the translation table and their sha1-content read as described above to resolve the deltas. 2. topological sort: starting at the "want"s from the negotiation phase, walk through objects in the pack and emit a list of them, excluding blobs, in reverse topologically sorted order, with each object coming later in the list than all objects it references. (This list only contains objects reachable from the "wants". If the pack from the server contained additional extraneous objects, then they will be discarded.) 3. convert to sha3: open a new (sha3) packfile. Read the topologically sorted list just generated. For each object, inflate its sha1-content, convert to sha3-content, and write it to the sha3 pack. Include the new sha1<->sha3 mapping entry in the translation table. 4. sort: reorder entries in the new pack to match the order of objects in the pack the server generated and include blobs. Write a sha3 idx file. 5. clean up: remove the SHA-1 based pack file, index, and topologically sorted list obtained from the server and steps 1 and 2. Step 3 requires every object referenced by the new object to be in the translation table. This is why the topological sort step is necessary. As an optimization, step 1 could write a file describing what non-blob objects each object it has inflated from the packfile references. This makes the topological sort in step 2 possible without inflating the objects in the packfile for a second time. The objects need to be inflated again in step 3, for a total of two inflations. Step 4 is probably necessary for good read-time performance. "git pack-objects" on the server optimizes the pack file for good data locality (see Documentation/technical/pack-heuristics.txt). Details of this process are likely to change. It will take some experimenting to get this to perform well. Push ~~~~ Push is simpler than fetch because the objects referenced by the pushed objects are already in the translation table. The sha1-content of each object being pushed can be read as described in the "Reading an object's sha1-content" section to generate the pack written by git send-pack. Signed Commits ~~~~~~~~~~~~~~ We add a new field "gpgsig-sha3" to the commit object format to allow signing commits without relying on SHA-1. It is similar to the existing "gpgsig" field. Its signed payload is the sha3-content of the commit object with any "gpgsig" and "gpgsig-sha3" fields removed. This means commits can be signed 1. using SHA-1 only, as in existing signed commit objects 2. using both SHA-1 and SHA3-256, by using both gpgsig-sha3 and gpgsig fields. 3. using only SHA3-256, by only using the gpgsig-sha3 field. Old versions of "git verify-commit" can verify the gpgsig signature in cases (1) and (2) without modifications and view case (3) as an ordinary unsigned commit. Signed Tags ~~~~~~~~~~~ We add a new field "gpgsig-sha3" to the tag object format to allow signing tags without relying on SHA-1. Its signed payload is the sha3-content of the tag with its gpgsig-sha3 field and "-----BEGIN PGP SIGNATURE-----" delimited in-body signature removed. This means tags can be signed 1. using SHA-1 only, as in existing signed tag objects 2. using both SHA-1 and SHA3-256, by using gpgsig-sha3 and an in-body signature. 3. using only SHA3-256, by only using the gpgsig-sha3 field. Mergetag embedding ~~~~~~~~~~~~~~~~~~ The mergetag field in the sha1-content of a commit contains the sha1-content of a tag that was merged by that commit. The mergetag field in the sha3-content of the same commit contains the sha3-content of the same tag. Submodules ~~~~~~~~~~ To convert recorded submodule pointers, you need to have the converted submodule repository in place. The translation table of the submodule can be used to look up the new hash. Caveats ------- Invalid objects ~~~~~~~~~~~~~~~ The conversion from sha1-content to sha3-content retains any brokenness in the original object (e.g., tree entry modes encoded with leading 0, tree objects whose paths are not sorted correctly, and commit objects without an author or committer). This is a deliberate feature of the design to allow the conversion to round-trip. More profoundly broken objects (e.g., a commit with a truncated "tree" header line) cannot be converted but were not usable by current Git anyway. Shallow clone and submodules ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Because it requires all referenced objects to be available in the locally generated translation table, this design does not support shallow clone or unfetched submodules. Protocol improvements might allow lifting this restriction. Alternates ~~~~~~~~~~ For the same reason, a sha3 repository cannot borrow objects from a sha1 repository using objects/info/alternates or $GIT_ALTERNATE_OBJECT_REPOSITORIES. git notes ~~~~~~~~~ The "git notes" tool annotates objects using their sha1-name as key. This design does not describe a way to migrate notes trees to use sha3-names. That migration is expected to happen separately (for example using a file at the root of the notes tree to describe which hash it uses). Server-side cost ~~~~~~~~~~~~~~~~ Until Git protocol gains SHA3-256 support, using sha3 based storage on public-facing Git servers is strongly discouraged. Once Git protocol gains SHA3-256 support, sha3 based servers are likely not to support sha1 compatibility, to avoid what may be a very expensive hash reencode during clone and to encourage peers to modernize. The design described here allows fetches by SHA-1 clients of a personal SHA256 repository because it's not much more difficult than allowing pushes from that repository. This support needs to be guarded by a configuration option --- servers like git.kernel.org that serve a large number of clients would not be expected to bear that cost. Meaning of signatures ~~~~~~~~~~~~~~~~~~~~~ The signed payload for signed commits and tags does not explicitly name the hash used to identify objects. If some day Git adopts a new hash function with the same length as the current SHA-1 (40 hexadecimal digit) or SHA2-256 (64 hexadecimal digit) objects then the intent behind the PGP signed payload in an object signature is unclear: object e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 type commit tag v2.12.0 tagger Junio C Hamano <gitster@pobox.com> 1487962205 -0800 Git 2.12 Does this mean Git v2.12.0 is the commit with sha1-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7 or the commit with new-40-digit-hash-name e7e07d5a4fcc2a203d9873968ad3e6bd4d7419d7? Fortunately SHA3-256 and SHA-1 have different lengths. If Git starts using another hash with the same length to name objects, then it will need to change the format of signed payloads using that hash to address this issue. Alternatives considered ----------------------- Upgrading everyone working on a particular project on a flag day ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Projects like the Linux kernel are large and complex enough that flipping the switch for all projects based on the repository at once is infeasible. Not only would all developers and server operators supporting developers have to switch on the same flag day, but supporting tooling (continuous integration, code review, bug trackers, etc) would have to be adapted as well. This also makes it difficult to get early feedback from some project participants testing before it is time for mass adoption. Using hash functions in parallel ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ (e.g. https://public-inbox.org/git/22708.8913.864049.452252@chiark.greenend.org.uk/ ) Objects newly created would be addressed by the new hash, but inside such an object (e.g. commit) it is still possible to address objects using the old hash function. * You cannot trust its history (needed for bisectability) in the future without further work * Maintenance burden as the number of supported hash functions grows (they will never go away, so they accumulate). In this proposal, by comparison, converted objects lose all references to SHA-1. Signed objects with multiple hashes ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Instead of introducing the gpgsig-sha3 field in commit and tag objects for sha3-content based signatures, an earlier version of this design added "hash sha3 <sha3-name>" fields to strengthen the existing sha1-content based signatures. In other words, a single signature was used to attest to the object content using both hash functions. This had some advantages: * Using one signature instead of two speeds up the signing process. * Having one signed payload with both hashes allows the signer to attest to the sha1-name and sha3-name referring to the same object. * All users consume the same signature. Broken signatures are likely to be detected quickly using current versions of git. However, it also came with disadvantages: * Verifying a signed object requires access to the sha1-names of all objects it references, even after the transition is complete and translation table is no longer needed for anything else. To support this, the design added fields such as "hash sha1 tree <sha1-name>" and "hash sha1 parent <sha1-name>" to the sha3-content of a signed commit, complicating the conversion process. * Allowing signed objects without a sha1 (for after the transition is complete) complicated the design further, requiring a "nohash sha1" field to suppress including "hash sha1" fields in the sha3-content and signed payload. Document History ---------------- 2017-03-03 bmwill@google.com, jonathantanmy@google.com, jrnieder@gmail.com, sbeller@google.com Initial version sent to http://public-inbox.org/git/20170304011251.GA26789@aiede.mtv.corp.google.com 2017-03-03 jrnieder@gmail.com Incorporated suggestions from jonathantanmy and sbeller: * describe purpose of signed objects with each hash type * redefine signed object verification using object content under the first hash function 2017-03-06 jrnieder@gmail.com * Use SHA3-256 instead of SHA2 (thanks, Linus and brian m. carlson).[1][2] * Make sha3-based signatures a separate field, avoiding the need for "hash" and "nohash" fields (thanks to peff[3]). * Add a sorting phase to fetch (thanks to Junio for noticing the need for this). * Omit blobs from the topological sort during fetch (thanks to peff). * Discuss alternates, git notes, and git servers in the caveats section (thanks to Junio Hamano, brian m. carlson[4], and Shawn Pearce). * Clarify language throughout (thanks to various commenters, especially Junio). [1] http://public-inbox.org/git/CA+55aFzJtejiCjV0e43+9oR3QuJK2PiFiLQemytoLpyJWe6P9w@mail.gmail.com/ [2] http://public-inbox.org/git/CA+55aFz+gkAsDZ24zmePQuEs1XPS9BP_s8O7Q4wQ7LV7X5-oDA@mail.gmail.com/ [3] http://public-inbox.org/git/20170306084353.nrns455dvkdsfgo5@sigill.intra.peff.net/ [4] http://public-inbox.org/git/20170304224936.rqqtkdvfjgyezsht@genre.crustytoothpaste.net ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-03-07 0:17 ` RFC v3: " Jonathan Nieder @ 2017-03-09 19:14 ` Shawn Pearce 2017-03-09 20:24 ` Jonathan Nieder 2017-09-06 6:28 ` Junio C Hamano 1 sibling, 1 reply; 49+ messages in thread From: Shawn Pearce @ 2017-03-09 19:14 UTC (permalink / raw) To: Jonathan Nieder Cc: Linus Torvalds, Git Mailing List, Stefan Beller, bmwill, Jonathan Tan, Jeff King, David Lang, brian m. carlson On Mon, Mar 6, 2017 at 4:17 PM, Jonathan Nieder <jrnieder@gmail.com> wrote: > Linus Torvalds wrote: >> On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote: > >>> This document is still in flux but I thought it best to send it out >>> early to start getting feedback. >> >> This actually looks very reasonable if you can implement it cleanly >> enough. > > Thanks for the kind words on what had quite a few flaws still. Here's > a new draft. I think the next version will be a patch against > Documentation/technical/. FWIW, I like this approach. > Alongside the packfile, a sha3 repository stores a bidirectional > mapping between sha3 and sha1 object names. The mapping is generated > locally and can be verified using "git fsck". Object lookups use this > mapping to allow naming objects using either their sha1 and sha3 names > interchangeably. I saw some discussion about using LevelDB for this mapping table. I think any existing database may be overkill. For packs, you may be able to simplify by having only one file (pack-*.msha1) that maps SHA-1 to pack offset; idx v2. The CRC32 table in v2 is unnecessary, but you need the 64 bit offset support. SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to read the SHA-3. SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to translate offset to SHA-1. For loose objects, the loose object directories should have only O(4000) entries before auto gc is strongly encouraging packing/pruning. With 256 shards, each given directory has O(16) loose objects in it. When writing a SHA-3 loose object, Git could also append a line "$sha3 $sha1\n" to objects/${first_byte}/sha1, which GC/prune rewrites to remove entries. With O(16) objects in a directory, these files should only have O(16) entries in them. SHA-3 to SHA-1: open objects/${sha3_first_byte}/sha1 and scan until a match is found. SHA-1 to SHA-3: brute force read 256 files. Callers performing this mapping may load all 256 files into a table in memory. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-03-09 19:14 ` Shawn Pearce @ 2017-03-09 20:24 ` Jonathan Nieder 2017-03-10 19:38 ` Jeff King 0 siblings, 1 reply; 49+ messages in thread From: Jonathan Nieder @ 2017-03-09 20:24 UTC (permalink / raw) To: Shawn Pearce Cc: Linus Torvalds, Git Mailing List, Stefan Beller, bmwill, Jonathan Tan, Jeff King, David Lang, brian m. carlson Hi, Shawn Pearce wrote: > On Mon, Mar 6, 2017 at 4:17 PM, Jonathan Nieder <jrnieder@gmail.com> wrote: >> Alongside the packfile, a sha3 repository stores a bidirectional >> mapping between sha3 and sha1 object names. The mapping is generated >> locally and can be verified using "git fsck". Object lookups use this >> mapping to allow naming objects using either their sha1 and sha3 names >> interchangeably. > > I saw some discussion about using LevelDB for this mapping table. I > think any existing database may be overkill. > > For packs, you may be able to simplify by having only one file > (pack-*.msha1) that maps SHA-1 to pack offset; idx v2. The CRC32 table > in v2 is unnecessary, but you need the 64 bit offset support. > > SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to > read the SHA-3. > SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to > translate offset to SHA-1. Thanks for this suggestion. I was initially vaguely nervous about lookup times in an idx-style file, but as you say, object reads from a packfile already have to deal with this kind of lookup and work fine. > For loose objects, the loose object directories should have only > O(4000) entries before auto gc is strongly encouraging > packing/pruning. With 256 shards, each given directory has O(16) loose > objects in it. When writing a SHA-3 loose object, Git could also > append a line "$sha3 $sha1\n" to objects/${first_byte}/sha1, which > GC/prune rewrites to remove entries. With O(16) objects in a > directory, these files should only have O(16) entries in them. Insertion time is what worries me. When writing a small number of objects using a command like "git commit", I don't want to have to regenerate an entire idx file. I don't want to move the pain to O(loose objects) work at read time, either --- some people disable auto gc, and others have a large number of loose objects due to gc ejecting unreachable objects. But some kind of simplification along these lines should be possible. I'll experiment. Jonathan ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-03-09 20:24 ` Jonathan Nieder @ 2017-03-10 19:38 ` Jeff King 2017-03-10 19:55 ` Jonathan Nieder 0 siblings, 1 reply; 49+ messages in thread From: Jeff King @ 2017-03-10 19:38 UTC (permalink / raw) To: Jonathan Nieder Cc: Shawn Pearce, Linus Torvalds, Git Mailing List, Stefan Beller, bmwill, Jonathan Tan, David Lang, brian m. carlson On Thu, Mar 09, 2017 at 12:24:08PM -0800, Jonathan Nieder wrote: > > SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to > > read the SHA-3. > > SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to > > translate offset to SHA-1. > > Thanks for this suggestion. I was initially vaguely nervous about > lookup times in an idx-style file, but as you say, object reads from a > packfile already have to deal with this kind of lookup and work fine. Not exactly. The "reverse .idx" step has to build the reverse mapping on the fly, and it's non-trivial. For instance, try: sha1=$(git rev-parse HEAD) time echo $sha1 | git cat-file --batch-check='%(objectsize)' time echo $sha1 | git cat-file --batch-check='%(objectsize:disk)' on a large repo (where HEAD is in a big pack). The on-disk size is conceptually simpler, as we only need to look at the offset of the object versus the offset of the object after it. But in practice it takes much longer, because it has to build the revindex on the fly (I get 7ms versus 179ms on linux.git). The effort is linear in the number of objects (we create the revindex with a radix sort). The reachability bitmaps suffer from this, too, as they need the revindex to know which object is at which bit position. At GitHub we added an extension to the .bitmap files that stores this "bit cache". Here are timings before and after on linux.git: $ time git rev-list --use-bitmap-index --count master 659371 real 0m0.182s user 0m0.136s sys 0m0.044s $ time git.gh rev-list --use-bitmap-index --count master 659371 real 0m0.016s user 0m0.008s sys 0m0.004s It's not a full revindex, but it's enough for bitmap use. You can also use it to generate the revindex slightly more quickly, because you can skip the sorting step (you just insert the entries in the correct order by walking the bit cache and dereferencing the offsets from the .idx portion). So it's still linear, but with a smaller constant factor. I think for the purposes here, though, we don't actually care about the offsets. For the cost of one uint32_t per object, you can keep a list mapping positions in the sha1 index into the sha3 index. So then you do the log-n binary search to find the sha1, a constant-time lookup in the mapping array, and that gives you the position in the sha3 index, from which you can then access the sha3 (or the actual pack offset, for that matter). So I think it's solvable, but I suspect we would want an extension to the .idx format to store the mapping array, in order to keep it log-n. -Peff ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-03-10 19:38 ` Jeff King @ 2017-03-10 19:55 ` Jonathan Nieder 0 siblings, 0 replies; 49+ messages in thread From: Jonathan Nieder @ 2017-03-10 19:55 UTC (permalink / raw) To: Jeff King Cc: Shawn Pearce, Linus Torvalds, Git Mailing List, Stefan Beller, bmwill, Jonathan Tan, David Lang, brian m. carlson Jeff King wrote: > On Thu, Mar 09, 2017 at 12:24:08PM -0800, Jonathan Nieder wrote: >>> SHA-1 to SHA-3: lookup SHA-1 in .msha1, reverse .idx, find offset to >>> read the SHA-3. >>> SHA-3 to SHA-1: lookup SHA-3 in .idx, and reverse the .msha1 file to >>> translate offset to SHA-1. >> >> Thanks for this suggestion. I was initially vaguely nervous about >> lookup times in an idx-style file, but as you say, object reads from a >> packfile already have to deal with this kind of lookup and work fine. > > Not exactly. The "reverse .idx" step has to build the reverse mapping on > the fly, and it's non-trivial. Sure. To be clear, I was handwaving over that since adding an on-disk reverse .idx is a relatively small change. [...] > So I think it's solvable, but I suspect we would want an extension to > the .idx format to store the mapping array, in order to keep it log-n. i.e., this. The loose object side is the more worrying bit, since we currently don't have any practical bound on the number of loose objects. One way to deal with that is to disallow loose objects completely. Use packfiles for new objects, batching the objects produced by a single process into a single packfile. Teach "git gc --auto" a behavior similar to Martin Fick's "git exproll" to combine packfiles between full gcs to maintain reasonable performance. For unreachable objects, instead of using loose objects, use "unreachable garbage" packs explicitly labeled as such, with similar semantics to what JGit's DfsRepository backend uses (described in the discussion at https://git.eclipse.org/r/89455). That's a direction that I want in the long term anyway. I was hoping not to couple such changes with the hash transition but it might be one of the simpler ways to go. Jonathan ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-03-07 0:17 ` RFC v3: " Jonathan Nieder 2017-03-09 19:14 ` Shawn Pearce @ 2017-09-06 6:28 ` Junio C Hamano 2017-09-08 2:40 ` Junio C Hamano 1 sibling, 1 reply; 49+ messages in thread From: Junio C Hamano @ 2017-09-06 6:28 UTC (permalink / raw) To: Jonathan Nieder Cc: Linus Torvalds, Git Mailing List, Stefan Beller, bmwill, jonathantanmy, Jeff King, David Lang, brian m. carlson Jonathan Nieder <jrnieder@gmail.com> writes: > Linus Torvalds wrote: >> On Fri, Mar 3, 2017 at 5:12 PM, Jonathan Nieder <jrnieder@gmail.com> wrote: > >>> This document is still in flux but I thought it best to send it out >>> early to start getting feedback. >> >> This actually looks very reasonable if you can implement it cleanly >> enough. > > Thanks for the kind words on what had quite a few flaws still. Here's > a new draft. I think the next version will be a patch against > Documentation/technical/. Can we reboot the discussion and advance this to v4 state? > As before, comments welcome, both here and inline at > > https://goo.gl/gh2Mzc I think what you have over there looks pretty-much ready as the final outline. One thing I still do not know how I feel about after re-reading the thread, and I didn't find the above doc, is Linus's suggestion to use the objects themselves as NewHash-to-SHA-1 mapper [*1*]. It does not help the reverse mapping that is needed while pushing things out (the SHA-1 receiver tells us what they have in terms of SHA-1 names; we need to figure out where we stop sending based on that). While it does help maintaining itself (while constructing SHA3-content, we'd be required to find out its SHA1 name but the SHA3 objects that we refer to all know their SHA-1 names), if it is not useful otherwise, then that does not count as a plus. Also having to bake corresponding SHA-1 name in the object would mean mistakes can easily propagate and cannot be corrected without rewriting the history, which would be a huge downside. So perhaps we are better off without it, I guess. [Reference] *1* <CA+55aFxj7Vtwac64RfAz_u=U4tob4Xg+2pDBDFNpJdmgaTCmxA@mail.gmail.com> ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-06 6:28 ` Junio C Hamano @ 2017-09-08 2:40 ` Junio C Hamano 2017-09-08 3:34 ` Jeff King 2017-09-11 18:59 ` Brandon Williams 0 siblings, 2 replies; 49+ messages in thread From: Junio C Hamano @ 2017-09-08 2:40 UTC (permalink / raw) To: Jonathan Nieder Cc: Linus Torvalds, Git Mailing List, Stefan Beller, bmwill, jonathantanmy, Jeff King, David Lang, brian m. carlson Junio C Hamano <gitster@pobox.com> writes: > One thing I still do not know how I feel about after re-reading the > thread, and I didn't find the above doc, is Linus's suggestion to > use the objects themselves as NewHash-to-SHA-1 mapper [*1*]. > ... > [Reference] > > *1* <CA+55aFxj7Vtwac64RfAz_u=U4tob4Xg+2pDBDFNpJdmgaTCmxA@mail.gmail.com> I think this falls into the same category as the often-talked-about addition of the "generation number" field. It is very tempting to add these "mechanically derivable but expensive to compute" pieces of information to the sha3-content while converting from sha1-content and creating anew. Because the "sha1-name" or the "generation number" can mechanically be computed, as long as everybody agrees to _always_ place them in the sha3-content, the same sha1-content will be converted into exactly the same sha3-content without ambiguity, and converting them back to sha1-content while pushing to an older repository will correctly produce the original sha1-content, as it would just be the matter of simply stripping these extra pieces of information. The reason why I still feel a bit uneasy about adding these things (aside from the fact that sha1-name thing will be a baggage we would need to carry forever even after we completely wean ourselves off of the old hash) is because I am not sure what we should do when we encounter sha3-content in the wild that has these things _wrong_. An object that exists today in the SHA-1 world is fetched into the new repository and converted to SHA-3 contents, and Linus's extra "original SHA-1 name" field is added to the object's header while recording the SHA-3 content. But for whatever reason, the original SHA-1 name is recorded incorrectly in the resulting SHA-3 object. The same thing could happen if we decide to bake "generation number" in the SHA-3 commit objects. One possible definition would be that a root commit will have gen #0; a commit with 1 or more parents will get max(parents' gen numbers) + 1 as its gen number. But somebody may botch the counting and records sum(parents' gen numbers) as its gen number. In these cases, not just the SHA3-content but also the resulting SHA-3 object name would be different from the name of the object that would have recorded the same contents correctly. So converting back to SHA-1 world from these botched SHA-3 contents may produce the original contents, but we may end up with multiple "plausibly looking" set of SHA-3 objects that (clain to) correspond to a single SHA-1 object, only one of which is a valid one. Our "git fsck" already treats certain brokenness (like a tree whose entry has mode that is 0-padded to the left) as broken but still tolerate them. I am not sure if it is sufficient to diagnose and declare broken and invalid when we see sha3-content that records these "mechanically derivable but expensive to compute" pieces of information incorrectly. I am leaning towards saying "yes, catching in fsck is enough" and suggesting to add generation number to sha3-content of the commit objects, and to add even the "original sha1 name" thing if we find good use of it. But I cannot shake this nagging feeling off that I am missing some huge problems that adding these fields and opening ourselves to more classes of broken objects. Thoughts? ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-08 2:40 ` Junio C Hamano @ 2017-09-08 3:34 ` Jeff King 2017-09-11 18:59 ` Brandon Williams 1 sibling, 0 replies; 49+ messages in thread From: Jeff King @ 2017-09-08 3:34 UTC (permalink / raw) To: Junio C Hamano Cc: Jonathan Nieder, Linus Torvalds, Git Mailing List, Stefan Beller, bmwill, jonathantanmy, David Lang, brian m. carlson On Fri, Sep 08, 2017 at 11:40:21AM +0900, Junio C Hamano wrote: > Our "git fsck" already treats certain brokenness (like a tree whose > entry has mode that is 0-padded to the left) as broken but still > tolerate them. I am not sure if it is sufficient to diagnose and > declare broken and invalid when we see sha3-content that records > these "mechanically derivable but expensive to compute" pieces of > information incorrectly. > > I am leaning towards saying "yes, catching in fsck is enough" and > suggesting to add generation number to sha3-content of the commit > objects, and to add even the "original sha1 name" thing if we find > good use of it. But I cannot shake this nagging feeling off that I > am missing some huge problems that adding these fields and opening > ourselves to more classes of broken objects. I share your nagging feeling. I have two thoughts on the "fsck can catch it" line of reasoning. 1. It's harder to fsck generation numbers than other syntactic elements of an object, because it inherently depends on the links. So I can't fsck a commit object in isolation. I have to open its parents and check _their_ generation numbers. In some sense that isn't a big deal. A real fsck wants to know that we _have_ the parents in the first place. But traditionally we've separated "is this syntactically valid" from "do we have full connectivity". And features like shallow clones rely on us fudging the latter but not the former. A shallow history could never properly fsck the generation numbers. A multiple-hash field doesn't have this problem. It's purely a function of the bytes in the object. 2. I wouldn't classify the current fsck checks as a wild success in containing breakages. If a buggy implementation produces invalid objects, the same buggy implementation generally lets people (and their colleagues) unwittingly build on top of those objects. It's only later (sometimes much later) that they interact with a non-buggy implementation whose fsck complains. And what happens then? If they're lucky, the invalid objects haven't spread far, and the worst thing is that they have to learn to use filter-branch (which itself is punishment enough). But sometimes a significant bit of history has been built on top, and it's awkward or impossible to rewrite it. That puts the burden on whoever is running the non-buggy implementation that wants to reject the objects. Do they accept these broken objects? If so, what do they do to mitigate the wrong answers that Git will return? I'm much more in favor of keeping that data outside the object-hash computation, and caching the pre-computed results as necessary. Those cache can disagree with the objects, of course, but the cost to dropping and re-building them is much lower than a history rewrite. I'm speaking primarily to the generation-number thing, where I really don't think there's any benefit to embedding it in the object beyond the obvious "well, it has to go _somewhere_, and this saves us implementing a local cache layer". I haven't thought hard enough on the multiple-hash thing to know if there's some other benefit to having it inside the objects. -Peff ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-08 2:40 ` Junio C Hamano 2017-09-08 3:34 ` Jeff King @ 2017-09-11 18:59 ` Brandon Williams 2017-09-13 12:05 ` Johannes Schindelin 1 sibling, 1 reply; 49+ messages in thread From: Brandon Williams @ 2017-09-11 18:59 UTC (permalink / raw) To: Junio C Hamano Cc: Jonathan Nieder, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson On 09/08, Junio C Hamano wrote: > Junio C Hamano <gitster@pobox.com> writes: > > > One thing I still do not know how I feel about after re-reading the > > thread, and I didn't find the above doc, is Linus's suggestion to > > use the objects themselves as NewHash-to-SHA-1 mapper [*1*]. > > ... > > [Reference] > > > > *1* <CA+55aFxj7Vtwac64RfAz_u=U4tob4Xg+2pDBDFNpJdmgaTCmxA@mail.gmail.com> > > I think this falls into the same category as the often-talked-about > addition of the "generation number" field. It is very tempting to > add these "mechanically derivable but expensive to compute" pieces > of information to the sha3-content while converting from > sha1-content and creating anew. We didn't discuss that in the doc since this particular transition plan we made uses an external NewHash-to-SHA1 map instead of an internal one because we believe that at some point we would be able to drop compatibility with SHA1. Now I suspect that wont happen for a long time but I think it would be preferable over carrying the SHA1 luggage indefinitely. At some point, then, we would be able to stop hashing objects twice (once with SHA1 and once with NewHash) instead of always requiring that we hash them with each hash function which was used historically. > > Because the "sha1-name" or the "generation number" can mechanically > be computed, as long as everybody agrees to _always_ place them in > the sha3-content, the same sha1-content will be converted into > exactly the same sha3-content without ambiguity, and converting them > back to sha1-content while pushing to an older repository will > correctly produce the original sha1-content, as it would just be the > matter of simply stripping these extra pieces of information. > > The reason why I still feel a bit uneasy about adding these things > (aside from the fact that sha1-name thing will be a baggage we would > need to carry forever even after we completely wean ourselves off of > the old hash) is because I am not sure what we should do when we > encounter sha3-content in the wild that has these things _wrong_. > An object that exists today in the SHA-1 world is fetched into the > new repository and converted to SHA-3 contents, and Linus's extra > "original SHA-1 name" field is added to the object's header while > recording the SHA-3 content. But for whatever reason, the original > SHA-1 name is recorded incorrectly in the resulting SHA-3 object. This wasn't one of the issues that I thought of but it just makes the argument against adding sha1's to the sha3 content stronger. > > The same thing could happen if we decide to bake "generation number" > in the SHA-3 commit objects. One possible definition would be that > a root commit will have gen #0; a commit with 1 or more parents will > get max(parents' gen numbers) + 1 as its gen number. But somebody > may botch the counting and records sum(parents' gen numbers) as its > gen number. > > In these cases, not just the SHA3-content but also the resulting > SHA-3 object name would be different from the name of the object > that would have recorded the same contents correctly. So converting > back to SHA-1 world from these botched SHA-3 contents may produce > the original contents, but we may end up with multiple "plausibly > looking" set of SHA-3 objects that (clain to) correspond to a single > SHA-1 object, only one of which is a valid one. > > Our "git fsck" already treats certain brokenness (like a tree whose > entry has mode that is 0-padded to the left) as broken but still > tolerate them. I am not sure if it is sufficient to diagnose and > declare broken and invalid when we see sha3-content that records > these "mechanically derivable but expensive to compute" pieces of > information incorrectly. > > I am leaning towards saying "yes, catching in fsck is enough" and > suggesting to add generation number to sha3-content of the commit > objects, and to add even the "original sha1 name" thing if we find > good use of it. But I cannot shake this nagging feeling off that I > am missing some huge problems that adding these fields and opening > ourselves to more classes of broken objects. > > Thoughts? > > -- Brandon Williams ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-11 18:59 ` Brandon Williams @ 2017-09-13 12:05 ` Johannes Schindelin 2017-09-13 13:43 ` demerphq 2017-09-13 16:30 ` Jonathan Nieder 0 siblings, 2 replies; 49+ messages in thread From: Johannes Schindelin @ 2017-09-13 12:05 UTC (permalink / raw) To: Brandon Williams Cc: Junio C Hamano, Jonathan Nieder, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson Hi Brandon, On Mon, 11 Sep 2017, Brandon Williams wrote: > On 09/08, Junio C Hamano wrote: > > Junio C Hamano <gitster@pobox.com> writes: > > > > > One thing I still do not know how I feel about after re-reading the > > > thread, and I didn't find the above doc, is Linus's suggestion to > > > use the objects themselves as NewHash-to-SHA-1 mapper [*1*]. > > > ... > > > [Reference] > > > > > > *1* <CA+55aFxj7Vtwac64RfAz_u=U4tob4Xg+2pDBDFNpJdmgaTCmxA@mail.gmail.com> > > > > I think this falls into the same category as the often-talked-about > > addition of the "generation number" field. It is very tempting to add > > these "mechanically derivable but expensive to compute" pieces of > > information to the sha3-content while converting from sha1-content and > > creating anew. > > We didn't discuss that in the doc since this particular transition plan > we made uses an external NewHash-to-SHA1 map instead of an internal one > because we believe that at some point we would be able to drop > compatibility with SHA1. Is there even a question about that? I mean, why would *any* project that switches entirely to SHA-256 want to carry the SHA-1 baggage around? So even if the code to generate a bidirectional old <-> new hash mapping might be with us forever, it *definitely* should be optional ("optional" at least as in "config setting"), allowing developers who only work with new-hash repositories to save the time and electrons. > Now I suspect that wont happen for a long time but I think it would be > preferable over carrying the SHA1 luggage indefinitely. It should be possible to push back the SHA-1 ginny into a small gin bottle inside Git's source code, so to say, i.e. encapsulate it to the point where it is a compile-time option, in addition to a runtime option. Of course, that's only unless the SHA-1 calculation is made mandatory as suggested above. I really shudder at the idea of requiring SHA-1 to be required forever. We ignored advice in 2005 against making ourselves too dependent on SHA-1, and I would hope that we would learn from this. > At some point, then, we would be able to stop hashing objects twice > (once with SHA1 and once with NewHash) instead of always requiring that > we hash them with each hash function which was used historically. Yes, please. > > Because the "sha1-name" or the "generation number" can mechanically > > be computed, ... as long as a shallow clone you do not have, of course... > > as long as everybody agrees to _always_ place them in the > > sha3-content, the same sha1-content will be converted into exactly the > > same sha3-content without ambiguity, and converting them back to > > sha1-content while pushing to an older repository will correctly > > produce the original sha1-content, as it would just be the matter of > > simply stripping these extra pieces of information. ... or Git would simply handle the absence of the generation number header gracefully, so that sha1-content == sha3-content... > > The same thing could happen if we decide to bake "generation number" > > in the SHA-3 commit objects. One possible definition would be that a > > root commit will have gen #0; a commit with 1 or more parents will get > > max(parents' gen numbers) + 1 as its gen number. But somebody may > > botch the counting and records sum(parents' gen numbers) as its gen > > number. > > > > In these cases, not just the SHA3-content but also the resulting SHA-3 > > object name would be different from the name of the object that would > > have recorded the same contents correctly. So converting back to > > SHA-1 world from these botched SHA-3 contents may produce the original > > contents, but we may end up with multiple "plausibly looking" set of > > SHA-3 objects that (clain to) correspond to a single SHA-1 object, > > only one of which is a valid one. > > > > Our "git fsck" already treats certain brokenness (like a tree whose > > entry has mode that is 0-padded to the left) as broken but still > > tolerate them. I am not sure if it is sufficient to diagnose and > > declare broken and invalid when we see sha3-content that records > > these "mechanically derivable but expensive to compute" pieces of > > information incorrectly. > > > > I am leaning towards saying "yes, catching in fsck is enough" and > > suggesting to add generation number to sha3-content of the commit > > objects, and to add even the "original sha1 name" thing if we find > > good use of it. But I cannot shake this nagging feeling off that I > > am missing some huge problems that adding these fields and opening > > ourselves to more classes of broken objects. > > > > Thoughts? Seeing as current Git versions would always ignore the generation number (and therefore work perfectly even with erroneous baked-in generation numbers), and seeing as it would be easy to add a config option to force Git to ignore the embedded generation numbers, I would consider `fsck` catching those problems the best idea. It seems that every major Git hoster already has some sort of fsck on the fly for newly-pushed objects, so that would be another "line of defense". Taking a step back, though, it may be a good idea to leave the generation number business for later, as much fun as it is to get side tracked and focus on relatively trivial stuff instead of the far more difficult and complex task to get the transition plan to a new hash ironed out. For example, I am still in favor of SHA-256 over SHA3-256, after learning some background details from in-house cryptographers: it provides essentially the same level of security, according to my sources, while hardware support seems to be coming to SHA-256 a lot sooner than to SHA3-256. Which hash algorithm to choose is a tough question to answer, and discussing generation numbers will sadly not help us answer it any quicker. Ciao, Dscho ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-13 12:05 ` Johannes Schindelin @ 2017-09-13 13:43 ` demerphq 2017-09-13 22:51 ` Jonathan Nieder 2017-09-13 23:30 ` Linus Torvalds 2017-09-13 16:30 ` Jonathan Nieder 1 sibling, 2 replies; 49+ messages in thread From: demerphq @ 2017-09-13 13:43 UTC (permalink / raw) To: Johannes Schindelin Cc: Brandon Williams, Junio C Hamano, Jonathan Nieder, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson On 13 September 2017 at 14:05, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > For example, I am still in favor of SHA-256 over SHA3-256, after learning > some background details from in-house cryptographers: it provides > essentially the same level of security, according to my sources, while > hardware support seems to be coming to SHA-256 a lot sooner than to > SHA3-256. FWIW, and I know it is not worth much, as far as I can tell there is at least some security/math basis to prefer SHA3-256 to SHA-256. The SHA1 and SHA-256 hash functions, (iirc along with their older cousins MD5 and MD2) all have a common design feature where they mix a relatively large block size into a much smaller state *each block*. So for instance SHA-256 mixes a 512 bit block into a 256 bit state with a 2:1 "leverage" between the block being read and the state. In SHA1 this was worse, mixing a 512 bit block into a 160 bit state, closer to 3:1 leverage. SHA3 however uses a completely different design where it mixes a 1088 bit block into a 1600 bit state, for a leverage of 2:3, and the excess is *preserved between each block*. Assuming everything else is equal between SHA-256 and SHA3 this difference alone would seem to justify choosing SHA3 over SHA-256. We know that there MUST be collisions when compressing a 512 bit block into a 256 bit space, however one cannot say the same about mixing 1088 bits into a 1600 bit state. The excess state which is not directly modified by the input block makes a big difference when reading the next block. Of course in both cases we end up compressing the entire source document down to the same number of bits, however SHA3 does that *once*, in finalization only, whereas SHA-256 does it *every* block read. So it seems to me that the opportunity for collisions is *much* higher in SHA-256 than it is in SHA3-256. (Even if they should be vanishingly rare regardless.) For this reason if I had a vote I would definitely vote SHA3-256, or even for SHA3-512. The latter has an impressive 1:2 leverage between block and state, and much better theoretical security levels. cheers, Yves Note: I am not a cryptographer, although I am probably pretty well informed as far hobby-hash-function-enthusiasts go. -- perl -Mre=debug -e "/just|another|perl|hacker/" ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-13 13:43 ` demerphq @ 2017-09-13 22:51 ` Jonathan Nieder 2017-09-14 18:26 ` Johannes Schindelin 2017-09-13 23:30 ` Linus Torvalds 1 sibling, 1 reply; 49+ messages in thread From: Jonathan Nieder @ 2017-09-13 22:51 UTC (permalink / raw) To: demerphq Cc: Johannes Schindelin, Brandon Williams, Junio C Hamano, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson Hi, Yves wrote: > On 13 September 2017 at 14:05, Johannes Schindelin >> For example, I am still in favor of SHA-256 over SHA3-256, after learning >> some background details from in-house cryptographers: it provides >> essentially the same level of security, according to my sources, while >> hardware support seems to be coming to SHA-256 a lot sooner than to >> SHA3-256. > > FWIW, and I know it is not worth much, as far as I can tell there is > at least some security/math basis to prefer SHA3-256 to SHA-256. Thanks for spelling this out. From my (very cursory) understanding of the math, what you are saying makes sense. I think there were some hints of this topic on-list before, but not made so explicit before. Here's my summary of the discussion of other aspects of the choice of hash functions so far: My understanding from asking cryptographers matches what Dscho said. One of the lessons of the history of hash functions is that some kinds of attempts to improve the security margin of a hash function do not help as much as expected once a function is broken. In practice, what we are looking for is - is the algorithm broken, or likely to be broken soon - do the algorithm's guarantees match the application - is the algorithm fast enough - are high quality implementations widely available On that first question, every well informed person I have asked has assured me that SHA-256, SHA-512, SHA-512/256, SHA-256x16, SHA3-256, K12, BLAKE2bp-256, etc are equally likely to be broken in the next 10 years. The main difference for the longevity question is that some of those algorithms have had more scrutiny than others, but all have had significant scrutiny. See [1] and the surrounding thread for more discussion on that. On the second question, SHA-256 is vulnerable to length extension attacks, which means it would not be usable as a MAC directly (instead of using the HMAC construction). Fortunately Git doesn't use its hash function that way. On the third question, SHA-256 is one of the slower ones, even with hardware accelaration, but it should be fast enough. On the fourth question, SHA-256 shines. See [2]. That is where I had thought the conversation ended up. For what it's worth, I'm pretty happy both with the level of scrutiny we've given to this question and SHA-256 as an answer. Luckily even if at the last minute we learn something that changes the choice of hash function, that would not significantly affect the transition plan, so we have a chance to learn more. See also [3]. Thanks, Jonathan [1] https://public-inbox.org/git/CAL9PXLzhPyE+geUdcLmd=pidT5P8eFEBbSgX_dS88knz2q_LSw@mail.gmail.com/#t [2] https://public-inbox.org/git/xmqq37azy7ru.fsf@gitster.mtv.corp.google.com/ [3] https://www.imperialviolet.org/2017/05/31/skipsha3.html, https://news.ycombinator.com/item?id=14453622 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-13 22:51 ` Jonathan Nieder @ 2017-09-14 18:26 ` Johannes Schindelin 2017-09-14 18:40 ` Jonathan Nieder 0 siblings, 1 reply; 49+ messages in thread From: Johannes Schindelin @ 2017-09-14 18:26 UTC (permalink / raw) To: Jonathan Nieder Cc: demerphq, Brandon Williams, Junio C Hamano, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson Hi Jonathan, On Wed, 13 Sep 2017, Jonathan Nieder wrote: > [3] https://www.imperialviolet.org/2017/05/31/skipsha3.html, I had read this short after it was published, and had missed the updates. One link in particular caught my eye: https://eprint.iacr.org/2012/476 Essentially, the authors demonstrate that using SIMD technology can speed up computation by factor 2 for longer messages (2kB being considered "long" already). It is a little bit unclear to me from a cursory look whether their fast algorithm computes SHA-256, or something similar. As the author of that paper is also known to have contributed to OpenSSL, I had a quick look and it would appear that a comment in crypto/sha/asm/sha256-mb-x86_64.pl speaking about "lanes" suggests that OpenSSL uses the ideas from the paper, even if b783858654 (x86_64 assembly pack: add multi-block AES-NI, SHA1 and SHA256., 2013-10-03) does not talk about the paper specifically. The numbers shown in https://github.com/openssl/openssl/blob/master/crypto/sha/asm/keccak1600-x86_64.pl#L28 and in https://github.com/openssl/openssl/blob/master/crypto/sha/asm/sha256-mb-x86_64.pl#L17 are sufficiently satisfying. Ciao, Dscho ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-14 18:26 ` Johannes Schindelin @ 2017-09-14 18:40 ` Jonathan Nieder 2017-09-14 22:09 ` Johannes Schindelin 0 siblings, 1 reply; 49+ messages in thread From: Jonathan Nieder @ 2017-09-14 18:40 UTC (permalink / raw) To: Johannes Schindelin Cc: demerphq, Brandon Williams, Junio C Hamano, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson Hi, Johannes Schindelin wrote: > On Wed, 13 Sep 2017, Jonathan Nieder wrote: >> [3] https://www.imperialviolet.org/2017/05/31/skipsha3.html, > > I had read this short after it was published, and had missed the updates. > One link in particular caught my eye: > > https://eprint.iacr.org/2012/476 > > Essentially, the authors demonstrate that using SIMD technology can speed > up computation by factor 2 for longer messages (2kB being considered > "long" already). It is a little bit unclear to me from a cursory look > whether their fast algorithm computes SHA-256, or something similar. The latter: that paper is about a variant on SHA-256 called SHA-256x4 (or SHA-256x16 to take advantage of newer instructions). It's a different hash function. This is what I was alluding to at [1]. > As the author of that paper is also known to have contributed to OpenSSL, > I had a quick look and it would appear that a comment in > crypto/sha/asm/sha256-mb-x86_64.pl speaking about "lanes" suggests that > OpenSSL uses the ideas from the paper, even if b783858654 (x86_64 assembly > pack: add multi-block AES-NI, SHA1 and SHA256., 2013-10-03) does not talk > about the paper specifically. > > The numbers shown in > https://github.com/openssl/openssl/blob/master/crypto/sha/asm/keccak1600-x86_64.pl#L28 > and in > https://github.com/openssl/openssl/blob/master/crypto/sha/asm/sha256-mb-x86_64.pl#L17 > > are sufficiently satisfying. This one is about actual SHA-256, but computing the hash of multiple streams in a single funtion call. The paper to read is [2]. We could probably take advantage of it for e.g. bulk-checkin and index-pack. Most other code paths that compute hashes wouldn't be able to benefit from it. Thanks, Jonathan [1] https://public-inbox.org/git/20170616212414.GC133952@aiede.mtv.corp.google.com/ [2] https://eprint.iacr.org/2012/371 ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-14 18:40 ` Jonathan Nieder @ 2017-09-14 22:09 ` Johannes Schindelin 0 siblings, 0 replies; 49+ messages in thread From: Johannes Schindelin @ 2017-09-14 22:09 UTC (permalink / raw) To: Jonathan Nieder Cc: demerphq, Brandon Williams, Junio C Hamano, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson Hi Jonathan, On Thu, 14 Sep 2017, Jonathan Nieder wrote: > Johannes Schindelin wrote: > > On Wed, 13 Sep 2017, Jonathan Nieder wrote: > > >> [3] https://www.imperialviolet.org/2017/05/31/skipsha3.html, > > > > I had read this short after it was published, and had missed the updates. > > One link in particular caught my eye: > > > > https://eprint.iacr.org/2012/476 > > > > Essentially, the authors demonstrate that using SIMD technology can speed > > up computation by factor 2 for longer messages (2kB being considered > > "long" already). It is a little bit unclear to me from a cursory look > > whether their fast algorithm computes SHA-256, or something similar. > > The latter: that paper is about a variant on SHA-256 called SHA-256x4 > (or SHA-256x16 to take advantage of newer instructions). It's a > different hash function. This is what I was alluding to at [1]. Thanks for the explanation! > > As the author of that paper is also known to have contributed to OpenSSL, > > I had a quick look and it would appear that a comment in > > crypto/sha/asm/sha256-mb-x86_64.pl speaking about "lanes" suggests that > > OpenSSL uses the ideas from the paper, even if b783858654 (x86_64 assembly > > pack: add multi-block AES-NI, SHA1 and SHA256., 2013-10-03) does not talk > > about the paper specifically. > > > > The numbers shown in > > https://github.com/openssl/openssl/blob/master/crypto/sha/asm/keccak1600-x86_64.pl#L28 > > and in > > https://github.com/openssl/openssl/blob/master/crypto/sha/asm/sha256-mb-x86_64.pl#L17 > > > > are sufficiently satisfying. > > This one is about actual SHA-256, but computing the hash of multiple > streams in a single funtion call. The paper to read is [2]. We could > probably take advantage of it for e.g. bulk-checkin and index-pack. > Most other code paths that compute hashes wouldn't be able to benefit > from it. Again, thanks for the explanation. Ciao, Dscho > [1] https://public-inbox.org/git/20170616212414.GC133952@aiede.mtv.corp.google.com/ > [2] https://eprint.iacr.org/2012/371 > ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-13 13:43 ` demerphq 2017-09-13 22:51 ` Jonathan Nieder @ 2017-09-13 23:30 ` Linus Torvalds 2017-09-14 18:45 ` Johannes Schindelin 1 sibling, 1 reply; 49+ messages in thread From: Linus Torvalds @ 2017-09-13 23:30 UTC (permalink / raw) To: demerphq Cc: Johannes Schindelin, Brandon Williams, Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson On Wed, Sep 13, 2017 at 6:43 AM, demerphq <demerphq@gmail.com> wrote: > > SHA3 however uses a completely different design where it mixes a 1088 > bit block into a 1600 bit state, for a leverage of 2:3, and the excess > is *preserved between each block*. Yes. And considering that the SHA1 attack was actually predicated on the fact that each block was independent (no extra state between), I do think SHA3 is a better model. So I'd rather see SHA3-256 than SHA256. Linus ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-13 23:30 ` Linus Torvalds @ 2017-09-14 18:45 ` Johannes Schindelin 2017-09-18 12:17 ` Gilles Van Assche 2017-09-26 17:05 ` Jason Cooper 0 siblings, 2 replies; 49+ messages in thread From: Johannes Schindelin @ 2017-09-14 18:45 UTC (permalink / raw) To: Linus Torvalds Cc: demerphq, Brandon Williams, Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson Hi Linus, On Wed, 13 Sep 2017, Linus Torvalds wrote: > On Wed, Sep 13, 2017 at 6:43 AM, demerphq <demerphq@gmail.com> wrote: > > > > SHA3 however uses a completely different design where it mixes a 1088 > > bit block into a 1600 bit state, for a leverage of 2:3, and the excess > > is *preserved between each block*. > > Yes. And considering that the SHA1 attack was actually predicated on > the fact that each block was independent (no extra state between), I > do think SHA3 is a better model. > > So I'd rather see SHA3-256 than SHA256. SHA-256 got much more cryptanalysis than SHA3-256, and apart from the length-extension problem that does not affect Git's usage, there are no known weaknesses so far. It would seem that the experts I talked to were much more concerned about that amount of attention than the particulars of the algorithm. My impression was that the new features of SHA3 were less studied than the well-known features of SHA2, and that the new-ness of SHA3 is not necessarily a good thing. You will have to deal with the fact that I trust the crypto experts' opinion on this a lot more than your opinion. Sure, you learned from the fact that you had been warned about SHA-1 already seeing theoretical attacks in 2005 and still choosing to hard-wire it into Git. And yet, you are still no more of a cryptography expert than I am. Ciao, Dscho ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-14 18:45 ` Johannes Schindelin @ 2017-09-18 12:17 ` Gilles Van Assche 2017-09-18 22:16 ` Johannes Schindelin 2017-09-18 22:25 ` Jonathan Nieder 2017-09-26 17:05 ` Jason Cooper 1 sibling, 2 replies; 49+ messages in thread From: Gilles Van Assche @ 2017-09-18 12:17 UTC (permalink / raw) To: Johannes Schindelin Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson, Keccak Team Hi Johannes, > SHA-256 got much more cryptanalysis than SHA3-256 […]. I do not think this is true. Keccak/SHA-3 actually got (and is still getting) a lot of cryptanalysis, with papers published at renowned crypto conferences [1]. Keccak/SHA-3 is recognized to have a significant safety margin. E.g., one can cut the number of rounds in half (as in Keyak or KangarooTwelve) and still get a very strong function. I don't think we could say the same for SHA-256 or SHA-512… Kind regards, Gilles, for the Keccak team [1] https://keccak.team/third_party.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-18 12:17 ` Gilles Van Assche @ 2017-09-18 22:16 ` Johannes Schindelin 2017-09-19 16:45 ` Gilles Van Assche 2017-09-18 22:25 ` Jonathan Nieder 1 sibling, 1 reply; 49+ messages in thread From: Johannes Schindelin @ 2017-09-18 22:16 UTC (permalink / raw) To: Gilles Van Assche Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson, Keccak Team [-- Attachment #1: Type: text/plain, Size: 1646 bytes --] Hi Gilles, On Mon, 18 Sep 2017, Gilles Van Assche wrote: > > SHA-256 got much more cryptanalysis than SHA3-256 […]. > > I do not think this is true. Please read what I said again: SHA-256 got much more cryptanalysis than SHA3-256. I never said that SHA3-256 got little cryptanalysis. Personally, I think that SHA3-256 got a ton more cryptanalysis than SHA-1, and that SHA-256 *still* got more cryptanalysis. But my opinion does not count, really. However, the two experts I pestered with questions over questions left me with that strong impression, and their opinion does count. > Keccak/SHA-3 actually got (and is still getting) a lot of cryptanalysis, > with papers published at renowned crypto conferences [1]. > > Keccak/SHA-3 is recognized to have a significant safety margin. E.g., > one can cut the number of rounds in half (as in Keyak or KangarooTwelve) > and still get a very strong function. I don't think we could say the > same for SHA-256 or SHA-512… Again, I do not want to criticize SHA3/Keccak. Personally, I have a lot of respect for Keccak. I also have a lot of respect for everybody who scrutinized the SHA2 family of algorithms. I also respect the fact that there are more implementations of SHA-256, and thanks to everybody seeming to demand SHA-256 checksums instead of SHA-1 or MD5 for downloads, bugs in those implementations are probably discovered relatively quickly, and I also cannot ignore the prospect of hardware support for SHA-256. In any case, having SHA3 as a fallback in case SHA-256 gets broken seems like a very good safety net to me. Ciao, Johannes ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-18 22:16 ` Johannes Schindelin @ 2017-09-19 16:45 ` Gilles Van Assche 2017-09-29 13:17 ` Johannes Schindelin 0 siblings, 1 reply; 49+ messages in thread From: Gilles Van Assche @ 2017-09-19 16:45 UTC (permalink / raw) To: Johannes Schindelin Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson, Keccak Team Hi Johannes, Thanks for your feedback. On 19/09/17 00:16, Johannes Schindelin wrote: >>> SHA-256 got much more cryptanalysis than SHA3-256 […]. >> >> I do not think this is true. > > Please read what I said again: SHA-256 got much more cryptanalysis > than SHA3-256. Indeed. What I meant is that SHA3-256 got at least as much cryptanalysis as SHA-256. :-) > I never said that SHA3-256 got little cryptanalysis. Personally, I > think that SHA3-256 got a ton more cryptanalysis than SHA-1, and that > SHA-256 *still* got more cryptanalysis. But my opinion does not count, > really. However, the two experts I pestered with questions over > questions left me with that strong impression, and their opinion does > count. OK, I respect your opinion and that of your two experts. Yet, the "much more" part of your statement, in particular, is something that may require a bit more explanations. Kind regards, Gilles ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-19 16:45 ` Gilles Van Assche @ 2017-09-29 13:17 ` Johannes Schindelin 2017-09-29 14:54 ` Joan Daemen 0 siblings, 1 reply; 49+ messages in thread From: Johannes Schindelin @ 2017-09-29 13:17 UTC (permalink / raw) To: Gilles Van Assche Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson, Keccak Team [-- Attachment #1: Type: text/plain, Size: 2234 bytes --] Hi Gilles, On Tue, 19 Sep 2017, Gilles Van Assche wrote: > On 19/09/17 00:16, Johannes Schindelin wrote: > >>> SHA-256 got much more cryptanalysis than SHA3-256 […]. > >> > >> I do not think this is true. > > > > Please read what I said again: SHA-256 got much more cryptanalysis > > than SHA3-256. > > Indeed. What I meant is that SHA3-256 got at least as much cryptanalysis > as SHA-256. :-) Oh? I got the opposite impression... I got the impression that *everybody* in the field banged on all the SHA-2 candidates because everybody was worried that SHA-1 would be utterly broken soon, and I got the impression that after this SHA-2 competition, people were less worried? Besides, I would expect that the difference in age (at *least* 7 years by my humble arithmetic skills) to make a difference... > > I never said that SHA3-256 got little cryptanalysis. Personally, I > > think that SHA3-256 got a ton more cryptanalysis than SHA-1, and that > > SHA-256 *still* got more cryptanalysis. But my opinion does not count, > > really. However, the two experts I pestered with questions over > > questions left me with that strong impression, and their opinion does > > count. > > OK, I respect your opinion and that of your two experts. Yet, the "much > more" part of your statement, in particular, is something that may > require a bit more explanations. I would also like to point out the ubiquitousness of SHA-256. I have been asked to provide SHA-256 checksums for the downloads of Git for Windows, but not SHA3-256... And this is a practically-relevant thing: the more users of an algorithm there are, the more high-quality implementations you can choose from. And this becomes relevant, say, when you have to switch implementations due to license changes (*cough, cough looking in OpenSSL's direction*). Or when you have to support the biggest Git repository on this planet and have to eek out 5-10% more performance using the latest hardware. All of a sudden, your consideration cannot only be "security of the algorithm" any longer. Having said that, I am *really* happy to have SHA3-256 as a valid fallback option in case SHA-256 should be broken. Ciao, Johannes ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-29 13:17 ` Johannes Schindelin @ 2017-09-29 14:54 ` Joan Daemen 2017-09-29 22:33 ` Johannes Schindelin 0 siblings, 1 reply; 49+ messages in thread From: Joan Daemen @ 2017-09-29 14:54 UTC (permalink / raw) To: Johannes Schindelin, Gilles Van Assche Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson, Keccak Team Dear Johannes, if ever there was a SHA-2 competition, it must have been held inside NSA:-) But maybe you are confusing with the SHA-3 competition. In any case, when considering SHA-2 vs SHA-3 for usage in git, you may have a look at arguments we give in the following blogpost: https://keccak.team/2017/open_source_crypto.html Kind regards, Joan Daemen On 29/09/17 15:17, Johannes Schindelin wrote: > Hi Gilles, > > On Tue, 19 Sep 2017, Gilles Van Assche wrote: > >> On 19/09/17 00:16, Johannes Schindelin wrote: >>>>> SHA-256 got much more cryptanalysis than SHA3-256 […]. >>>> I do not think this is true. >>> Please read what I said again: SHA-256 got much more cryptanalysis >>> than SHA3-256. >> Indeed. What I meant is that SHA3-256 got at least as much cryptanalysis >> as SHA-256. :-) > Oh? I got the opposite impression... I got the impression that *everybody* > in the field banged on all the SHA-2 candidates because everybody was > worried that SHA-1 would be utterly broken soon, and I got the impression > that after this SHA-2 competition, people were less worried? > > Besides, I would expect that the difference in age (at *least* 7 years by > my humble arithmetic skills) to make a difference... > >>> I never said that SHA3-256 got little cryptanalysis. Personally, I >>> think that SHA3-256 got a ton more cryptanalysis than SHA-1, and that >>> SHA-256 *still* got more cryptanalysis. But my opinion does not count, >>> really. However, the two experts I pestered with questions over >>> questions left me with that strong impression, and their opinion does >>> count. >> OK, I respect your opinion and that of your two experts. Yet, the "much >> more" part of your statement, in particular, is something that may >> require a bit more explanations. > I would also like to point out the ubiquitousness of SHA-256. I have been > asked to provide SHA-256 checksums for the downloads of Git for Windows, > but not SHA3-256... > > And this is a practically-relevant thing: the more users of an algorithm > there are, the more high-quality implementations you can choose from. And > this becomes relevant, say, when you have to switch implementations due to > license changes (*cough, cough looking in OpenSSL's direction*). Or when > you have to support the biggest Git repository on this planet and have to > eek out 5-10% more performance using the latest hardware. All of a sudden, > your consideration cannot only be "security of the algorithm" any longer. > > Having said that, I am *really* happy to have SHA3-256 as a valid fallback > option in case SHA-256 should be broken. > > Ciao, > Johannes ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-29 14:54 ` Joan Daemen @ 2017-09-29 22:33 ` Johannes Schindelin 2017-09-30 22:02 ` Joan Daemen 0 siblings, 1 reply; 49+ messages in thread From: Johannes Schindelin @ 2017-09-29 22:33 UTC (permalink / raw) To: Joan Daemen Cc: Gilles Van Assche, Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson, Keccak Team Hi Joan, On Fri, 29 Sep 2017, Joan Daemen wrote: > if ever there was a SHA-2 competition, it must have been held inside NSA:-) Oops. My bad, I indeed got confused about that, as you suggest below (I actually thought of the AES competition, but that was obviously not about SHA-2). Sorry. > But maybe you are confusing with the SHA-3 competition. In any case, > when considering SHA-2 vs SHA-3 for usage in git, you may have a look at > arguments we give in the following blogpost: > > https://keccak.team/2017/open_source_crypto.html Thanks for the pointer! Small nit: the post uses "its" in place of "it's", twice. It does have a good point, of course: the scientific exchange (which you call "open-source" in spirit) makes tons of sense. As far as Git is concerned, we not only care about the source code of the hash algorithm we use, we need to care even more about what you call "executable": ready-to-use, high quality, well-tested implementations. We carry source code for SHA-1 as part of Git's source code, which was hand-tuned to be as fast as Linus could get it, which was tricky given that the tuning should be general enough to apply to all common intel CPUs. This hand-crafted code was blown out of the water by OpenSSL's SHA-1 in our tests here at Microsoft, thanks to the fact that OpenSSL does vectorized SHA-1 computation now. To me, this illustrates why it is not good enough to have only a reference implementation available at our finger tips. Of course, above-mentioned OpenSSL supports SHA-256 and SHA3-256, too, and at least recent versions vectorize those, too. Also, ARM processors have become a lot more popular, so we'll want to have high-quality implementations of the hash algorithm also for those processors. Likewise, in contrast to 2005, nowadays implementations of Git in languages as obscure as Javascript are not only theoretical but do exist in practice (https://github.com/creationix/js-git). I had a *very* quick look for libraries providing crypto in Javascript and immediately found the Standford Javascript Crypto library (https://github.com/bitwiseshiftleft/sjcl/) which seems to offer SHA-256 but not SHA3-256 computation. Back to Intel processors: I read some vague hints about extensions accelerating SHA-256 computation on future Intel processors, but not SHA3-256. It would make sense, of course, that more crypto libraries and more hardware support would be available for SHA-256 than for SHA3-256 given the time since publication: 16 vs 5 years (I am playing it loose here, taking just the year into account, not the exact date, so please treat that merely as a ballpark figure). So from a practical point of view, I wonder what your take is on, say, hardware support for SHA3-256. Do you think this will become a focus soon? Also, what is your take on the question whether SHA-256 is good enough? SHA-1 was broken theoretically already 10 years after it was published (which unfortunately did not prevent us from baking it into Git), after all, while SHA-256 is 16 years old and the only known weakness does not apply to Git's usage? Also, while I have the attention of somebody who knows a heck more about cryptography than Git's top 10 committers combined: how soon do you expect practical SHA-1 attacks that are much worse than what we already have seen? I am concerned that if we do not move fast enough to a new hash algorithm, and somebody finds a way in the meantime to craft arbitrary messages given a prefix and an SHA-1, then we have a huge problem on our hands. Ciao, Johannes ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-29 22:33 ` Johannes Schindelin @ 2017-09-30 22:02 ` Joan Daemen 2017-10-02 14:26 ` Johannes Schindelin 0 siblings, 1 reply; 49+ messages in thread From: Joan Daemen @ 2017-09-30 22:02 UTC (permalink / raw) To: Johannes Schindelin Cc: Gilles Van Assche, Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson, Keccak Team Dear Johannes, thanks for your response and taking the effort to express your concerns. Please see below for some feedback. On 30/09/17 00:33, Johannes Schindelin wrote: > Hi Joan, > > On Fri, 29 Sep 2017, Joan Daemen wrote: > >> if ever there was a SHA-2 competition, it must have been held inside >> NSA:-) > Oops. My bad, I indeed got confused about that, as you suggest below (I > actually thought of the AES competition, but that was obviously not > about > SHA-2). Sorry. > >> But maybe you are confusing with the SHA-3 competition. In any case, >> when considering SHA-2 vs SHA-3 for usage in git, you may have a look >> at >> arguments we give in the following blogpost: >> >> https://keccak.team/2017/open_source_crypto.html > Thanks for the pointer! > > Small nit: the post uses "its" in place of "it's", twice. Thanks, we'll correct that. > It does have a good point, of course: the scientific exchange (which > you > call "open-source" in spirit) makes tons of sense. > > As far as Git is concerned, we not only care about the source code of > the > hash algorithm we use, we need to care even more about what you call > "executable": ready-to-use, high quality, well-tested implementations. > > We carry source code for SHA-1 as part of Git's source code, which was > hand-tuned to be as fast as Linus could get it, which was tricky given > that the tuning should be general enough to apply to all common intel > CPUs. > > This hand-crafted code was blown out of the water by OpenSSL's SHA-1 in > our tests here at Microsoft, thanks to the fact that OpenSSL does > vectorized SHA-1 computation now. > > To me, this illustrates why it is not good enough to have only a > reference > implementation available at our finger tips. Of course, above-mentioned > OpenSSL supports SHA-256 and SHA3-256, too, and at least recent > versions > vectorize those, too. There is a lot of high-quality optimized code for all SHA-3 functions and many CPUs in the Keccak code package https://github.com/gvanas/KeccakCodePackage but also OpenSSL contains some good SHA-3 code and then there are all those related to Ethereum. By the way, you speak about SHA3-256, but the right choice would be to use SHAKE128. Well, what is exactly the right choice depends on what you want. If you want to have a function in the SHA3 standard (FIPS 202), it is SHAKE128. You can boost performance on high-end CPUs by adopting Parallelhash from NIST SP 800-185, still a NIST standard. You can multiply that performance again by a factor of 2 by adopting KangarooTwelve. This is our (Keccak team) proposal for a parallelizable Keccak-based hash function that has a safety margin comparable to that of the SHA-2 functions. See https://keccak.team/kangarootwelve.html May I also suggest you read https://keccak.team/2017/is_sha3_slow.html > Also, ARM processors have become a lot more popular, so we'll want to > have > high-quality implementations of the hash algorithm also for those > processors. > > Likewise, in contrast to 2005, nowadays implementations of Git in > languages as obscure as Javascript are not only theoretical but do > exist > in practice (https://github.com/creationix/js-git). I had a *very* > quick > look for libraries providing crypto in Javascript and immediately found > the Standford Javascript Crypto library > (https://github.com/bitwiseshiftleft/sjcl/) which seems to offer > SHA-256 > but not SHA3-256 computation. > > Back to Intel processors: I read some vague hints about extensions > accelerating SHA-256 computation on future Intel processors, but not > SHA3-256. > > It would make sense, of course, that more crypto libraries and more > hardware support would be available for SHA-256 than for SHA3-256 given > the time since publication: 16 vs 5 years (I am playing it loose here, > taking just the year into account, not the exact date, so please treat > that merely as a ballpark figure). > > So from a practical point of view, I wonder what your take is on, say, > hardware support for SHA3-256. Do you think this will become a focus > soon? I think this is a chicken-and-egg problem. In any case, hardware support for one SHA3-256 will also work for the other SHA3 and SHAKE functions as they all use the same underlying primitive: the Keccak-f permutation. This is not the case for SHA2 because SHA224 and SHA256 use a different compression function than SHA384, SHA512, SHA512/224 and SHA512/256. > Also, what is your take on the question whether SHA-256 is good enough? > SHA-1 was broken theoretically already 10 years after it was published > (which unfortunately did not prevent us from baking it into Git), after > all, while SHA-256 is 16 years old and the only known weakness does not > apply to Git's usage? SHA-256 is more conservative than SHA-1 and I don't expect it to be broken in the coming decades (unless NSA inserted a backdoor but I don't think that is likely). But looking at the existing cryptanalysis, I think it is even less likely that I SHAKE128, ParallelHash or KangarooTwelve will be broken anytime. > Also, while I have the attention of somebody who knows a heck more > about > cryptography than Git's top 10 committers combined: how soon do you > expect > practical SHA-1 attacks that are much worse than what we already have > seen? I am concerned that if we do not move fast enough to a new hash > algorithm, and somebody finds a way in the meantime to craft arbitrary > messages given a prefix and an SHA-1, then we have a huge problem on > our hands. This is hard to say. To be honest, when witnessing the first MD5 collisions I did not expect them to lead to some real world attacks and just a few years later we saw real-world forged certificates based on MD5 collisions. And SHA-1 has a lot in common with MD5... But let me end with a philosophical note. Independent of all the arguments for and against, I think this is ultimately about doing the right thing. The choice is here between SHA1/SHA2 on the one hand and SHA3/Keccak on the other. The former standards are imposed on us by NSA and the latter are the best that came out of an open competition involving all experts in the field worldwide. What would be closest to the philosophy of Git (and by extension Linux or open-source in general)? Kind regards, Joan On 30/09/17 00:33, Johannes Schindelin wrote: > Hi Joan, > > On Fri, 29 Sep 2017, Joan Daemen wrote: > >> if ever there was a SHA-2 competition, it must have been held inside >> NSA:-) > Oops. My bad, I indeed got confused about that, as you suggest below (I > actually thought of the AES competition, but that was obviously not > about > SHA-2). Sorry. > >> But maybe you are confusing with the SHA-3 competition. In any case, >> when considering SHA-2 vs SHA-3 for usage in git, you may have a look >> at >> arguments we give in the following blogpost: >> >> https://keccak.team/2017/open_source_crypto.html > Thanks for the pointer! > > Small nit: the post uses "its" in place of "it's", twice. > > It does have a good point, of course: the scientific exchange (which > you > call "open-source" in spirit) makes tons of sense. > > As far as Git is concerned, we not only care about the source code of > the > hash algorithm we use, we need to care even more about what you call > "executable": ready-to-use, high quality, well-tested implementations. > > We carry source code for SHA-1 as part of Git's source code, which was > hand-tuned to be as fast as Linus could get it, which was tricky given > that the tuning should be general enough to apply to all common intel > CPUs. > > This hand-crafted code was blown out of the water by OpenSSL's SHA-1 in > our tests here at Microsoft, thanks to the fact that OpenSSL does > vectorized SHA-1 computation now. > > To me, this illustrates why it is not good enough to have only a > reference > implementation available at our finger tips. Of course, above-mentioned > OpenSSL supports SHA-256 and SHA3-256, too, and at least recent > versions > vectorize those, too. > > Also, ARM processors have become a lot more popular, so we'll want to > have > high-quality implementations of the hash algorithm also for those > processors. > > Likewise, in contrast to 2005, nowadays implementations of Git in > languages as obscure as Javascript are not only theoretical but do > exist > in practice (https://github.com/creationix/js-git). I had a *very* > quick > look for libraries providing crypto in Javascript and immediately found > the Standford Javascript Crypto library > (https://github.com/bitwiseshiftleft/sjcl/) which seems to offer > SHA-256 > but not SHA3-256 computation. > > Back to Intel processors: I read some vague hints about extensions > accelerating SHA-256 computation on future Intel processors, but not > SHA3-256. > > It would make sense, of course, that more crypto libraries and more > hardware support would be available for SHA-256 than for SHA3-256 given > the time since publication: 16 vs 5 years (I am playing it loose here, > taking just the year into account, not the exact date, so please treat > that merely as a ballpark figure). > > So from a practical point of view, I wonder what your take is on, say, > hardware support for SHA3-256. Do you think this will become a focus > soon? > > Also, what is your take on the question whether SHA-256 is good enough? > SHA-1 was broken theoretically already 10 years after it was published > (which unfortunately did not prevent us from baking it into Git), after > all, while SHA-256 is 16 years old and the only known weakness does not > apply to Git's usage? > > Also, while I have the attention of somebody who knows a heck more > about > cryptography than Git's top 10 committers combined: how soon do you > expect > practical SHA-1 attacks that are much worse than what we already have > seen? I am concerned that if we do not move fast enough to a new hash > algorithm, and somebody finds a way in the meantime to craft arbitrary > messages given a prefix and an SHA-1, then we have a huge problem on > our hands. > > Ciao, > Johannes Begin forwarded message: From: Gilles Van Assche <gilles.van.assche@noekeon.org> Subject: Re: RFC v3: Another proposed hash function transition plan Date: 30 Sep 2017 22:20:42 CEST To: Joan Daemen <joan@cs.ru.nl>, keccak@noekeon.org Dag Joan, About the implementations, there are many high-quality implementations of Keccak besides the KCP that you could also mention. E.g., those in OpenSSL are very good. And there are all those related to Ethereum. I tend to agree with Guido regarding SHA-1, even if you are right, there is no need to reduce/excuse too much the impact of collisions, there could be unexpected use cases. And it's not clean. (And don't underestimate the probability to be quoted on this.) Finally, just to say that I like your last paragraph. Kind regards, Gilles Joan Daemen <joan@cs.ru.nl> wrote: what about replying with something like this (please have a critical look). I sent this from my Radboud account as I have problems with my Thunderbird settings. When trying to send a mail, it sometimes works and sometimes it says “An error occurred while sending mail: Outgoing server (SMTP) error. The server responded: 4.7.1 <joans-mbp.home>: Helo command rejected: Host not found." Dear Johannes, thanks for your response and taking the effort to express your concerns. Please see below for some feedback. On 30/09/17 00:33, Johannes Schindelin wrote: Hi Joan, On Fri, 29 Sep 2017, Joan Daemen wrote: if ever there was a SHA-2 competition, it must have been held inside NSA:-) Oops. My bad, I indeed got confused about that, as you suggest below (I actually thought of the AES competition, but that was obviously not about SHA-2). Sorry. But maybe you are confusing with the SHA-3 competition. In any case, when considering SHA-2 vs SHA-3 for usage in git, you may have a look at arguments we give in the following blogpost: https://keccak.team/2017/open_source_crypto.html Thanks for the pointer! Small nit: the post uses "its" in place of "it's", twice. Thanks, we'll correct that. It does have a good point, of course: the scientific exchange (which you call "open-source" in spirit) makes tons of sense. As far as Git is concerned, we not only care about the source code of the hash algorithm we use, we need to care even more about what you call "executable": ready-to-use, high quality, well-tested implementations. We carry source code for SHA-1 as part of Git's source code, which was hand-tuned to be as fast as Linus could get it, which was tricky given that the tuning should be general enough to apply to all common intel CPUs. This hand-crafted code was blown out of the water by OpenSSL's SHA-1 in our tests here at Microsoft, thanks to the fact that OpenSSL does vectorized SHA-1 computation now. To me, this illustrates why it is not good enough to have only a reference implementation available at our finger tips. Of course, above-mentioned OpenSSL supports SHA-256 and SHA3-256, too, and at least recent versions vectorize those, too. There is a lot of high-quality optimized code for all SHA-3 functions and many CPUs in the Keccak code package https://github.com/gvanas/KeccakCodePackage By the way, you speak about SHA3-256, but the right choice would be to use SHAKE128. Well, what is exactly the right choice depends on what you want. If you want to have a function in the SHA3 standard (FIPS 202), it is SHAKE128. You can boost performance on high-end CPUs by adopting Parallelhash from NIST SP 800-185, still a NIST standard. You can multiply that performance again by a factor of 2 by adopting KangarooTwelve. This is our (Keccak team) proposal for a parallelizable Keccak-based hash function that has a safety margin comparable to that of the SHA-2 functions. See https://keccak.team/kangarootwelve.html May I also suggest you to read https://keccak.team/2017/is_sha3_slow.html Also, ARM processors have become a lot more popular, so we'll want to have high-quality implementations of the hash algorithm also for those processors. Likewise, in contrast to 2005, nowadays implementations of Git in languages as obscure as Javascript are not only theoretical but do exist in practice (https://github.com/creationix/js-git). I had a *very* quick look for libraries providing crypto in Javascript and immediately found the Standford Javascript Crypto library (https://github.com/bitwiseshiftleft/sjcl/) which seems to offer SHA-256 but not SHA3-256 computation. Back to Intel processors: I read some vague hints about extensions accelerating SHA-256 computation on future Intel processors, but not SHA3-256. It would make sense, of course, that more crypto libraries and more hardware support would be available for SHA-256 than for SHA3-256 given the time since publication: 16 vs 5 years (I am playing it loose here, taking just the year into account, not the exact date, so please treat that merely as a ballpark figure). So from a practical point of view, I wonder what your take is on, say, hardware support for SHA3-256. Do you think this will become a focus soon? I think this is a chicken-and-egg problem. In any case, hardware support for one SHA3-256 will also work for the other SHA3 and SHAKE functions as they all use the same underlying primitive: the Keccak-f permutation. This is not the case for SHA2 because SHA224 and SHA256 use a different compression function than SHA384, SHA512, SHA512/224 and SHA512/256. Also, what is your take on the question whether SHA-256 is good enough? SHA-1 was broken theoretically already 10 years after it was published (which unfortunately did not prevent us from baking it into Git), after all, while SHA-256 is 16 years old and the only known weakness does not apply to Git's usage? I think even the weakness of SHA-1 will be hard to exploit to do something bad in Git. SHA-256 is more conservative than SHA-1 and I don't expect it to be broken (unless NSA inserted a backdoor but I don't think that is likely). But I also don't expect SHAKE128, ParallelHash or KangarooTwelve to be broken, looking at the existing cryptanalysis. Also, while I have the attention of somebody who knows a heck more about cryptography than Git's top 10 committers combined: how soon do you expect practical SHA-1 attacks that are much worse than what we already have seen? I am concerned that if we do not move fast enough to a new hash algorithm, and somebody finds a way in the meantime to craft arbitrary messages given a prefix and an SHA-1, then we have a huge problem on our hands. As said, I don't expect practical SHA-1 attacks soon. But let me end with a philosophical note. Independent of all the arguments for and against, I think this is about doing the right thing. The choice is here between SHA1/SHA2 on the one hand and SHA3/Keccak on the other. The former standards are imposed on us by NSA and the latter are the best that came out of an open competition involving all experts worldwide. What would be closest to the philosophy of Git (and by extension Linux or open-source in general)? Kind regards, Joan ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-30 22:02 ` Joan Daemen @ 2017-10-02 14:26 ` Johannes Schindelin 0 siblings, 0 replies; 49+ messages in thread From: Johannes Schindelin @ 2017-10-02 14:26 UTC (permalink / raw) To: Joan Daemen Cc: Gilles Van Assche, Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson, Keccak Team Hi Joan, On Sun, 1 Oct 2017, Joan Daemen wrote: > On 30/09/17 00:33, Johannes Schindelin wrote: > > > As far as Git is concerned, we not only care about the source code of > > the hash algorithm we use, we need to care even more about what you > > call "executable": ready-to-use, high quality, well-tested > > implementations. > > > > We carry source code for SHA-1 as part of Git's source code, which was > > hand-tuned to be as fast as Linus could get it, which was tricky given > > that the tuning should be general enough to apply to all common intel > > CPUs. > > > > This hand-crafted code was blown out of the water by OpenSSL's SHA-1 > > in our tests here at Microsoft, thanks to the fact that OpenSSL does > > vectorized SHA-1 computation now. > > > > To me, this illustrates why it is not good enough to have only a > > reference implementation available at our finger tips. Of course, > > above-mentioned OpenSSL supports SHA-256 and SHA3-256, too, and at > > least recent versions vectorize those, too. > > There is a lot of high-quality optimized code for all SHA-3 functions > and many CPUs in the Keccak code package > https://github.com/gvanas/KeccakCodePackage but also OpenSSL contains > some good SHA-3 code and then there are all those related to Ethereum. > > By the way, you speak about SHA3-256, but the right choice would be to > use SHAKE128. Well, what is exactly the right choice depends on what you > want. If you want to have a function in the SHA3 standard (FIPS 202), it > is SHAKE128. You can boost performance on high-end CPUs by adopting > Parallelhash from NIST SP 800-185, still a NIST standard. You can > multiply that performance again by a factor of 2 by adopting > KangarooTwelve. This is our (Keccak team) proposal for a parallelizable > Keccak-based hash function that has a safety margin comparable to that > of the SHA-2 functions. See https://keccak.team/kangarootwelve.html May > I also suggest you read https://keccak.team/2017/is_sha3_slow.html Thanks. I have to admit that all those names that do not start with SHA and do not end in 256 make me a bit dizzy. > > Back to Intel processors: I read some vague hints about extensions > > accelerating SHA-256 computation on future Intel processors, but not > > SHA3-256. > > > > It would make sense, of course, that more crypto libraries and more > > hardware support would be available for SHA-256 than for SHA3-256 > > given the time since publication: 16 vs 5 years (I am playing it loose > > here, taking just the year into account, not the exact date, so please > > treat that merely as a ballpark figure). > > > > So from a practical point of view, I wonder what your take is on, say, > > hardware support for SHA3-256. Do you think this will become a focus > > soon? > > I think this is a chicken-and-egg problem. In any case, hardware support > for one SHA3-256 will also work for the other SHA3 and SHAKE functions > as they all use the same underlying primitive: the Keccak-f permutation. > This is not the case for SHA2 because SHA224 and SHA256 use a different > compression function than SHA384, SHA512, SHA512/224 and SHA512/256. Okay. So given that Git does not exactly have a big sway on hardware vendors, we would have to hope that some other chicken lays that egg. > > Also, what is your take on the question whether SHA-256 is good > > enough? SHA-1 was broken theoretically already 10 years after it was > > published (which unfortunately did not prevent us from baking it into > > Git), after all, while SHA-256 is 16 years old and the only known > > weakness does not apply to Git's usage? > > SHA-256 is more conservative than SHA-1 and I don't expect it to be > broken in the coming decades (unless NSA inserted a backdoor but I don't > think that is likely). But looking at the existing cryptanalysis, I > think it is even less likely that I SHAKE128, ParallelHash or > KangarooTwelve will be broken anytime. That's reassuring! ;-) > > Also, while I have the attention of somebody who knows a heck more > > about cryptography than Git's top 10 committers combined: how soon do > > you expect practical SHA-1 attacks that are much worse than what we > > already have seen? I am concerned that if we do not move fast enough > > to a new hash algorithm, and somebody finds a way in the meantime to > > craft arbitrary messages given a prefix and an SHA-1, then we have a > > huge problem on our hands. > > This is hard to say. To be honest, when witnessing the first MD5 > collisions I did not expect them to lead to some real world attacks and > just a few years later we saw real-world forged certificates based on > MD5 collisions. And SHA-1 has a lot in common with MD5... Oh, okay. I did not realize that MD5 and SHA-1 are so similar in design, thank you for educating me! > But let me end with a philosophical note. Independent of all the > arguments for and against, I think this is ultimately about doing the > right thing. The choice is here between SHA1/SHA2 on the one hand and > SHA3/Keccak on the other. The former standards are imposed on us by NSA > and the latter are the best that came out of an open competition > involving all experts in the field worldwide. What would be closest to > the philosophy of Git (and by extension Linux or open-source in > general)? Heh. Do you realize that you are talking to a Microsoftie, i.e. one of the "evil company"? ;-) So philosophically, I am much more pragmatic. Or maybe I am not, after all, I joined a company at a time when it is arguably going through one of the most dramatic cultural changes any company has seen lately (a year ago, we became #1 contributor on GitHub according to Business Insider, and as far as I can tell, we're not willing to pass that belt to anyone else). But when it comes to the philosophy of Git, I fear I have to disappoint you: Git's fundamental concepts were not developed in an open process. Git even so much as rejected professional advice *not* to bake SHA-1 into everything. Of course, we are undoing this damage right now, and your input helps greatly, I would think. While I feel reassured by your response that SHA-256 would be "good enough" and would have some real-life benefits of announced hardware support, I would now also feel comfortable if my preference was overruled in the end, in favor of a hash from the Keccak family. I would understand, for example, if the parallel option turned out to be enticing enough for other core Git contributors to aim for, say, K12). Again, thank you very much for chiming in, Johannes ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-18 12:17 ` Gilles Van Assche 2017-09-18 22:16 ` Johannes Schindelin @ 2017-09-18 22:25 ` Jonathan Nieder 1 sibling, 0 replies; 49+ messages in thread From: Jonathan Nieder @ 2017-09-18 22:25 UTC (permalink / raw) To: Gilles Van Assche Cc: Johannes Schindelin, Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson, Keccak Team Hi, Gilles Van Assche wrote: > Hi Johannes, >> SHA-256 got much more cryptanalysis than SHA3-256 […]. > > I do not think this is true. Keccak/SHA-3 actually got (and is still > getting) a lot of cryptanalysis, with papers published at renowned > crypto conferences [1]. > > Keccak/SHA-3 is recognized to have a significant safety margin. E.g., > one can cut the number of rounds in half (as in Keyak or KangarooTwelve) > and still get a very strong function. I don't think we could say the > same for SHA-256 or SHA-512… I just wanted to thank you for paying attention to this conversation and weighing in. Most of the regulars in the git project are not crypto experts. This kind of extra information (and e.g. [2]) is very useful to us. Thanks, Jonathan > Kind regards, > Gilles, for the Keccak team > > [1] https://keccak.team/third_party.html [2] https://public-inbox.org/git/91a34c5b-7844-3db2-cf29-411df5bcf886@noekeon.org/ ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-14 18:45 ` Johannes Schindelin 2017-09-18 12:17 ` Gilles Van Assche @ 2017-09-26 17:05 ` Jason Cooper 2017-09-26 22:11 ` Johannes Schindelin 1 sibling, 1 reply; 49+ messages in thread From: Jason Cooper @ 2017-09-26 17:05 UTC (permalink / raw) To: Johannes Schindelin Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson Hi all, Sorry for late commentary... On Thu, Sep 14, 2017 at 08:45:35PM +0200, Johannes Schindelin wrote: > On Wed, 13 Sep 2017, Linus Torvalds wrote: > > On Wed, Sep 13, 2017 at 6:43 AM, demerphq <demerphq@gmail.com> wrote: > > > SHA3 however uses a completely different design where it mixes a 1088 > > > bit block into a 1600 bit state, for a leverage of 2:3, and the excess > > > is *preserved between each block*. > > > > Yes. And considering that the SHA1 attack was actually predicated on > > the fact that each block was independent (no extra state between), I > > do think SHA3 is a better model. > > > > So I'd rather see SHA3-256 than SHA256. Well, for what it's worth, we need to be aware that SHA3 is *different*. In crypto, "different" = "bugs haven't been found yet". :-P And SHA2 is *known*. So we have a pretty good handle on how it'll weaken over time. > SHA-256 got much more cryptanalysis than SHA3-256, and apart from the > length-extension problem that does not affect Git's usage, there are no > known weaknesses so far. While I think that statement is true on it's face (particularly when including post-competition analysis), I don't think it's sufficient justification to chose one over the other. > It would seem that the experts I talked to were much more concerned about > that amount of attention than the particulars of the algorithm. My > impression was that the new features of SHA3 were less studied than the > well-known features of SHA2, and that the new-ness of SHA3 is not > necessarily a good thing. The only thing I really object to here is the abstract "experts". We're talking about cryptography and integrity here. It's no longer sufficient to cite anonymous experts. Either they can put their thoughts, opinions and analysis on record here, or it shouldn't be considered. Sorry. Other than their anonymity, though, I do agree with your experts assessments. However, whether we chose SHA2 or SHA3 doesn't matter. Moving away from SHA1 does. Once the object_id code is in place to facilitate that transition, the problem is solved from git's perspective. If SHA3 is chosen as the successor, it's going to get a *lot* more adoption, and thus, a lot more analysis. If cracks start to show, the hard work of making git flexible is already done. We can migrate to SHA4/5/whatever in an orderly fashion with far less effort than the transition away from SHA1. For my use cases, as a user of git, I have a plan to maintain provable integrity of existing objects stored in git under sha1 while migrating away from sha1. The same plan works for migrating away from SHA2 or SHA3 when the time comes. thx, Jason. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-26 17:05 ` Jason Cooper @ 2017-09-26 22:11 ` Johannes Schindelin 2017-09-26 23:51 ` Jonathan Nieder 2017-10-02 14:00 ` Jason Cooper 0 siblings, 2 replies; 49+ messages in thread From: Johannes Schindelin @ 2017-09-26 22:11 UTC (permalink / raw) To: Jason Cooper Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson Hi Jason, On Tue, 26 Sep 2017, Jason Cooper wrote: > On Thu, Sep 14, 2017 at 08:45:35PM +0200, Johannes Schindelin wrote: > > On Wed, 13 Sep 2017, Linus Torvalds wrote: > > > On Wed, Sep 13, 2017 at 6:43 AM, demerphq <demerphq@gmail.com> wrote: > > > > SHA3 however uses a completely different design where it mixes a 1088 > > > > bit block into a 1600 bit state, for a leverage of 2:3, and the excess > > > > is *preserved between each block*. > > > > > > Yes. And considering that the SHA1 attack was actually predicated on > > > the fact that each block was independent (no extra state between), I > > > do think SHA3 is a better model. > > > > > > So I'd rather see SHA3-256 than SHA256. > > Well, for what it's worth, we need to be aware that SHA3 is *different*. > In crypto, "different" = "bugs haven't been found yet". :-P > > And SHA2 is *known*. So we have a pretty good handle on how it'll > weaken over time. Here, you seem to agree with me. > > SHA-256 got much more cryptanalysis than SHA3-256, and apart from the > > length-extension problem that does not affect Git's usage, there are no > > known weaknesses so far. > > While I think that statement is true on it's face (particularly when > including post-competition analysis), I don't think it's sufficient > justification to chose one over the other. And here you don't. I find that very confusing. > > It would seem that the experts I talked to were much more concerned about > > that amount of attention than the particulars of the algorithm. My > > impression was that the new features of SHA3 were less studied than the > > well-known features of SHA2, and that the new-ness of SHA3 is not > > necessarily a good thing. > > The only thing I really object to here is the abstract "experts". We're > talking about cryptography and integrity here. It's no longer > sufficient to cite anonymous experts. Either they can put their > thoughts, opinions and analysis on record here, or it shouldn't be > considered. Sorry. Sorry, you are asking cryptography experts to spend their time on the Git mailing list. I tried to get them to speak out on the Git mailing list. They respectfully declined. I can't fault them, they have real jobs to do, and none of their managers would be happy for them to educate the Git mailing list on matters of cryptography, not after what happened in 2005. > Other than their anonymity, though, I do agree with your experts > assessments. I know what our in-house cryptography experts have to prove to start working at Microsoft. Forgive me, but you are not a known entity to me. > However, whether we chose SHA2 or SHA3 doesn't matter. To you, it does not matter. To me, it matters. To the several thousand developers working on Windows, probably the largest Git repository in active use, it matters. It matters because the speed difference that has little impact on you has a lot more impact on us. > Moving away from SHA1 does. Once the object_id code is in place to > facilitate that transition, the problem is solved from git's > perspective. Uh oh. You forgot the mapping. And the protocol. And pretty much everything except the oid. > If SHA3 is chosen as the successor, it's going to get a *lot* more > adoption, and thus, a lot more analysis. If cracks start to show, the > hard work of making git flexible is already done. We can migrate to > SHA4/5/whatever in an orderly fashion with far less effort than the > transition away from SHA1. Sure. And if XYZ789 is chosen, it's going to get a *lot* more adoption, too. We think. Let's be realistic. Git is pretty important to us, but it is not important enough to sway, say, Intel into announcing hardware support for SHA3. And if you try to force through *any* hash function only so that it gets more adoption and hence more support, in the short run you will make life harder for developers on more obscure platforms, who may not easily get high-quality, high-speed implementations of anything but the very mainstream (which is, let's face it, MD5, SHA-1 and SHA-256). I know I would have cursed you for such a decision back when I had to work on AIX and IRIX. > For my use cases, as a user of git, I have a plan to maintain provable > integrity of existing objects stored in git under sha1 while migrating > away from sha1. The same plan works for migrating away from SHA2 or > SHA3 when the time comes. Please do not make the mistake of taking your use case to be a template for everybody's use case. Migrating a large team away from any hash function to another one *will* be painful, and costly. Migrating will be very costly for hosting companies like GitHub, Microsoft and BitBucket, too. Ciao, Johannes ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-26 22:11 ` Johannes Schindelin @ 2017-09-26 23:51 ` Jonathan Nieder 2017-10-02 14:54 ` Jason Cooper 2017-10-02 14:00 ` Jason Cooper 1 sibling, 1 reply; 49+ messages in thread From: Jonathan Nieder @ 2017-09-26 23:51 UTC (permalink / raw) To: Johannes Schindelin Cc: Jason Cooper, Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson Hi, Johannes Schindelin wrote: > Sorry, you are asking cryptography experts to spend their time on the Git > mailing list. I tried to get them to speak out on the Git mailing list. > They respectfully declined. > > I can't fault them, they have real jobs to do, and none of their managers > would be happy for them to educate the Git mailing list on matters of > cryptography, not after what happened in 2005. Fortunately we have had a few public comments from crypto specialists: https://public-inbox.org/git/91a34c5b-7844-3db2-cf29-411df5bcf886@noekeon.org/ https://public-inbox.org/git/CAL9PXLzhPyE+geUdcLmd=pidT5P8eFEBbSgX_dS88knz2q_LSw@mail.gmail.com/ https://public-inbox.org/git/CAL9PXLxMHG1nP5_GQaK_WSJTNKs=_qbaL6V5v2GzVG=9VU2+gA@mail.gmail.com/ https://public-inbox.org/git/59BFB95D.1030903@st.com/ https://public-inbox.org/git/59C149A3.6080506@st.com/ [...] > Let's be realistic. Git is pretty important to us, but it is not important > enough to sway, say, Intel into announcing hardware support for SHA3. Yes, I agree with this. (Adoption by Git could lead to adoption by some other projects, leading to more work on high quality software implementations in projects like OpenSSL, but I am not convinced that that would be a good thing for the world anyway. There are downsides to a proliferation of too many crypto primitives. This is the basic argument described in more detail at [1].) [...] > On Tue, 26 Sep 2017, Jason Cooper wrote: >> For my use cases, as a user of git, I have a plan to maintain provable >> integrity of existing objects stored in git under sha1 while migrating >> away from sha1. The same plan works for migrating away from SHA2 or >> SHA3 when the time comes. > > Please do not make the mistake of taking your use case to be a template > for everybody's use case. That said, I'm curious at what plan you are alluding to. Is it something that could benefit others on the list? Thanks, Jonathan [1] https://www.imperialviolet.org/2017/05/31/skipsha3.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-26 23:51 ` Jonathan Nieder @ 2017-10-02 14:54 ` Jason Cooper 2017-10-02 16:50 ` Brandon Williams 0 siblings, 1 reply; 49+ messages in thread From: Jason Cooper @ 2017-10-02 14:54 UTC (permalink / raw) To: Jonathan Nieder Cc: Johannes Schindelin, Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson Hi Jonathan, On Tue, Sep 26, 2017 at 04:51:58PM -0700, Jonathan Nieder wrote: > Johannes Schindelin wrote: > > On Tue, 26 Sep 2017, Jason Cooper wrote: > >> For my use cases, as a user of git, I have a plan to maintain provable > >> integrity of existing objects stored in git under sha1 while migrating > >> away from sha1. The same plan works for migrating away from SHA2 or > >> SHA3 when the time comes. > > > > Please do not make the mistake of taking your use case to be a template > > for everybody's use case. > > That said, I'm curious at what plan you are alluding to. Is it > something that could benefit others on the list? Well, it's just a plan at this point. As there's a lot of other work to do in the mean-time, and there's no possibility of transitioning until the dust has settled on NEWHASH. :-) Given an existing repository that needs to migrate from SHA1 to NEWHASH, and maintain backwards compatibility with clients that haven't migrated yet, how do we a) perform that migration, b) allow non-updated clients to use the data prior to the switch, and c) maintain provable integrity of the old objects as well as the new. The primary method is counter-hashing, which re-uses the blobs, and creates parallel, deterministic tree, commit, and tag objects using NEWHASH for everything up to flag day. post-flag-day only uses NEWHASH. A PGP "transition" key is used to counter-sign the NEWHASH version of the old signed tags. The transition key is not required to be different than the existing maintainers key. A critical feature is the ability of entities other than the maintainer to migrate to NEWHASH. For example, let's say that git has fully implemented and tested NEWHASH. linux.git intends to migrate, but it's going to take several months (get all the developers herded up). In the interim, a security company, relying on Linux for it's products can counter-hash Linus' repo, and continue to do so every time he updates his tree. This shrinks the attack window for an entity (with an undisclosed break of SHA1) down to a few minutes to an hour. Otherwise, a check of the counter hashes in the future would reveal the substitution. The deterministic feature is critical here because there is valuable integrity and trust built by counter-hashing quickly after publication. So once Linux migrates to NEWHASH, the hashes calculated by the security company should be identical. IOW, use the timestamps that are in the SHA1 commit objects for the NEWHASH objects. Which should be obvious, but it's worth explicitly mentioning that determinism provides great value. We're in the process of writing this up formally, which will provide a lot more detail and rationale that this quick stream of thought. :-) I'm sure a lot of this has already been discussed on the list. If so, I apologize for being repetitive. Unfortunately, I'm not able to keep up with the MLs like I used to. thx, Jason. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-10-02 14:54 ` Jason Cooper @ 2017-10-02 16:50 ` Brandon Williams 0 siblings, 0 replies; 49+ messages in thread From: Brandon Williams @ 2017-10-02 16:50 UTC (permalink / raw) To: Jason Cooper Cc: Jonathan Nieder, Johannes Schindelin, Linus Torvalds, demerphq, Junio C Hamano, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson On 10/02, Jason Cooper wrote: > Hi Jonathan, > > On Tue, Sep 26, 2017 at 04:51:58PM -0700, Jonathan Nieder wrote: > > Johannes Schindelin wrote: > > > On Tue, 26 Sep 2017, Jason Cooper wrote: > > >> For my use cases, as a user of git, I have a plan to maintain provable > > >> integrity of existing objects stored in git under sha1 while migrating > > >> away from sha1. The same plan works for migrating away from SHA2 or > > >> SHA3 when the time comes. > > > > > > Please do not make the mistake of taking your use case to be a template > > > for everybody's use case. > > > > That said, I'm curious at what plan you are alluding to. Is it > > something that could benefit others on the list? > > Well, it's just a plan at this point. As there's a lot of other work to > do in the mean-time, and there's no possibility of transitioning until > the dust has settled on NEWHASH. :-) > > Given an existing repository that needs to migrate from SHA1 to NEWHASH, > and maintain backwards compatibility with clients that haven't migrated > yet, how do we > > a) perform that migration, > b) allow non-updated clients to use the data prior to the switch, and > c) maintain provable integrity of the old objects as well as the new. > > The primary method is counter-hashing, which re-uses the blobs, and > creates parallel, deterministic tree, commit, and tag objects using > NEWHASH for everything up to flag day. post-flag-day only uses NEWHASH. > A PGP "transition" key is used to counter-sign the NEWHASH version of > the old signed tags. The transition key is not required to be different > than the existing maintainers key. > > A critical feature is the ability of entities other than the maintainer > to migrate to NEWHASH. For example, let's say that git has fully > implemented and tested NEWHASH. linux.git intends to migrate, but it's > going to take several months (get all the developers herded up). > > In the interim, a security company, relying on Linux for it's products > can counter-hash Linus' repo, and continue to do so every time he > updates his tree. This shrinks the attack window for an entity (with an > undisclosed break of SHA1) down to a few minutes to an hour. Otherwise, > a check of the counter hashes in the future would reveal the > substitution. > > The deterministic feature is critical here because there is valuable > integrity and trust built by counter-hashing quickly after publication. > So once Linux migrates to NEWHASH, the hashes calculated by the security > company should be identical. IOW, use the timestamps that are in the > SHA1 commit objects for the NEWHASH objects. Which should be obvious, > but it's worth explicitly mentioning that determinism provides great > value. > > We're in the process of writing this up formally, which will provide a > lot more detail and rationale that this quick stream of thought. :-) > > I'm sure a lot of this has already been discussed on the list. If so, I > apologize for being repetitive. Unfortunately, I'm not able to keep up > with the MLs like I used to. > > thx, > > Jason. Given the interests that you've expressed here I'd recommend taking a look at https://public-inbox.org/git/20170928044320.GA84719@aiede.mtv.corp.google.com/ which is the current version of the transition plan that the community has settled on (https://public-inbox.org/git/xmqqlgkyxgvq.fsf@gitster.mtv.corp.google.com/ shows that it should be merged to 'next' soon). Once neat aspect of this transition plan is that it doesn't require a flag day but rather anyone can migrate to the new hash function and still interact with repositories (via the wire) which are still running SHA1. -- Brandon Williams ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-26 22:11 ` Johannes Schindelin 2017-09-26 23:51 ` Jonathan Nieder @ 2017-10-02 14:00 ` Jason Cooper 2017-10-02 17:18 ` Linus Torvalds 1 sibling, 1 reply; 49+ messages in thread From: Jason Cooper @ 2017-10-02 14:00 UTC (permalink / raw) To: Johannes Schindelin Cc: Linus Torvalds, demerphq, Brandon Williams, Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson Hi Johannes, Thanks for the response. Sorry for the delay. Had a large deadline for $dayjob. On Wed, Sep 27, 2017 at 12:11:14AM +0200, Johannes Schindelin wrote: > On Tue, 26 Sep 2017, Jason Cooper wrote: > > On Thu, Sep 14, 2017 at 08:45:35PM +0200, Johannes Schindelin wrote: > > > On Wed, 13 Sep 2017, Linus Torvalds wrote: > > > > On Wed, Sep 13, 2017 at 6:43 AM, demerphq <demerphq@gmail.com> wrote: > > > > > SHA3 however uses a completely different design where it mixes a 1088 > > > > > bit block into a 1600 bit state, for a leverage of 2:3, and the excess > > > > > is *preserved between each block*. > > > > > > > > Yes. And considering that the SHA1 attack was actually predicated on > > > > the fact that each block was independent (no extra state between), I > > > > do think SHA3 is a better model. > > > > > > > > So I'd rather see SHA3-256 than SHA256. > > > > Well, for what it's worth, we need to be aware that SHA3 is *different*. > > In crypto, "different" = "bugs haven't been found yet". :-P > > > > And SHA2 is *known*. So we have a pretty good handle on how it'll > > weaken over time. > > Here, you seem to agree with me. Yep. > > > SHA-256 got much more cryptanalysis than SHA3-256, and apart from the > > > length-extension problem that does not affect Git's usage, there are no > > > known weaknesses so far. > > > > While I think that statement is true on it's face (particularly when > > including post-competition analysis), I don't think it's sufficient > > justification to chose one over the other. > > And here you don't. > > I find that very confusing. What I'm saying is that there is more to selecting a hash function for git than just the cryptographic assessment. In fact I would argue that the primary cryptographic concern for git is "What is the likelihood that we'll wake up one day to full collisions with no warning?" To that, I'd argue that SHA-256's time in the field and SHA3-256's competition give them both passing marks in that regard. fwiw, I'd also put Blake and Skein in there as well. The chance that any of those will suffer sudden, catastrophic failure is minimal. IOW, we'll have warnings, and time to migrate to the next function. None of us can predict the future, but having a significant amount of vetting reduces the chances of catastrophic failure. > > > It would seem that the experts I talked to were much more concerned about > > > that amount of attention than the particulars of the algorithm. My > > > impression was that the new features of SHA3 were less studied than the > > > well-known features of SHA2, and that the new-ness of SHA3 is not > > > necessarily a good thing. > > > > The only thing I really object to here is the abstract "experts". We're > > talking about cryptography and integrity here. It's no longer > > sufficient to cite anonymous experts. Either they can put their > > thoughts, opinions and analysis on record here, or it shouldn't be > > considered. Sorry. > > Sorry, you are asking cryptography experts to spend their time on the Git > mailing list. I tried to get them to speak out on the Git mailing list. > They respectfully declined. Ok, fair enough. Just please understand that it's difficult to place much weight on statements that we can't discuss with the person who made them. > > However, whether we chose SHA2 or SHA3 doesn't matter. > > To you, it does not matter. Well, I'd say it does not matter for *most* users. > To me, it matters. To the several thousand developers working on Windows, > probably the largest Git repository in active use, it matters. It matters > because the speed difference that has little impact on you has a lot more > impact on us. Ahhh, so if I understand you correctly, you'd prefer SHA-256 over SHA3-256 because it's more performant for your usecase? Well, that's a completely different animal that cryptographic suitability. Have you been able to crunch numbers yet? Will you be able to share some empirical data? I'd love to see some comparisons between SHA1, SHA-256, SHA512-256, and SHA3-256 for different git operations under your work load. > > If SHA3 is chosen as the successor, it's going to get a *lot* more > > adoption, and thus, a lot more analysis. If cracks start to show, the > > hard work of making git flexible is already done. We can migrate to > > SHA4/5/whatever in an orderly fashion with far less effort than the > > transition away from SHA1. > > Sure. And if XYZ789 is chosen, it's going to get a *lot* more adoption, > too. > > We think. > > Let's be realistic. Git is pretty important to us, but it is not important > enough to sway, say, Intel into announcing hardware support for SHA3. > And if you try to force through *any* hash function only so that it gets > more adoption and hence more support, That's quite a jump from what I was saying. I would never advise using code in a production setting just to increase adoption. What I /was/ saying: Let's say you don't get what you want, and SHA3-256 is chosen. It's not the end of the world from a cryptographic PoV. The hard work of making the git (and libgit2) codebases hash-flexible is already done. So, if you're correct, and SHA3 was too immature, the increased visibility will help us discover that more quickly. And, the code will already be in a position to conduct an orderly migration. Will it still be costly? Yes. But I would argue that it's naive to think that we will be using git/sha3-256 or git/sha-256 10 to 15 years from now. It might be git, it might not. But there *will* be another migration of existing data (code, history, etc) from one object storage model to another. It might be git/SHA4-512, or hg/sha4-384. So, we aren't trying to find the perfect hash function so that we naively think we'll never have to change again. Rather, we're choosing the next hash function so that we can hold off another migration for as long as possible. After all, SHA4-512 doesn't exist yet. ;-) > in the short run you will make life > harder for developers on more obscure platforms, who may not easily get > high-quality, high-speed implementations of anything but the very > mainstream (which is, let's face it, MD5, SHA-1 and SHA-256). I know I > would have cursed you for such a decision back when I had to work on AIX > and IRIX. I think you're assuming that all developers on obscure platforms have a similar git usecase to your current one. I've not heard of that being the case. > > For my use cases, as a user of git, I have a plan to maintain provable > > integrity of existing objects stored in git under sha1 while migrating > > away from sha1. The same plan works for migrating away from SHA2 or > > SHA3 when the time comes. > > Please do not make the mistake of taking your use case to be a template > for everybody's use case. I wasn't. But I will argue that my usecase is valid. Just as yours is. > Migrating a large team away from any hash function to another one *will* > be painful, and costly. Assuming that it will never happen again would make that doubly costly. > Migrating will be very costly for hosting companies like GitHub, Microsoft > and BitBucket, too. <with_my_business_hat_on> GitHub and BitBucket have git as the core of their business model. If they aren't keeping an eye on the future path of git and maintaining migration plans, shame on them. </with_my_business_hat_on> Thanks, Jason. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-10-02 14:00 ` Jason Cooper @ 2017-10-02 17:18 ` Linus Torvalds 2017-10-02 19:37 ` Jeff King 0 siblings, 1 reply; 49+ messages in thread From: Linus Torvalds @ 2017-10-02 17:18 UTC (permalink / raw) To: Jason Cooper Cc: Johannes Schindelin, demerphq, Brandon Williams, Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan, Jeff King, David Lang, brian m. carlson On Mon, Oct 2, 2017 at 7:00 AM, Jason Cooper <jason@lakedaemon.net> wrote: > > Ahhh, so if I understand you correctly, you'd prefer SHA-256 over > SHA3-256 because it's more performant for your usecase? Well, that's a > completely different animal that cryptographic suitability. In almost all loads I've seen, zlib inflate() cost is a bigger deal than the crypto load. The crypto people talk about cycles per byte, but the deflate code is what usually takes the page faults and cache misses etc, and has bad branch prediction. That ends up easily being tens or thousands of cycles, even for small data. But it does obviously depend on exactly what you do. The Windows people saw SHA1 as costly mainly due to the index file (which is just a "fancy crc", and not even cryptographically important, and where the cache misses actually happen when doing crypto, not decompressing the data). And fsck and big initial checkins can have a very different profile than most "regular use" profiles. Again, there the crypto happens first, and takes the cache misses. And the crypto is almost certainly _much_ cheaper than just the act of loading the index file contents in the first place. It may show up on profiles fairly clearly, but that's mostly because crypto is *intensive*, not because crypto takes up most of the cycles. End result: honestly, the real cost on almost any load is not crypto or necessarily even (de)compression, even if those are the things that show up. It's the cache misses and the "get data into user space" (whether using "read()" or page faulting). Worrying about cycles per byte of compression speed is almost certainly missing the real issue. The people who benchmark cryptography tend to intentionally avoid the actual real work, because they just want to know the crypto costs. So when you see numbers like "9 cycles per byte" vs "12 cycles per byte" and think that it's a big deal - 30% performance difference! - it's almost certainly complete garbage. It may be 30%, but it is likely 30% out of 10% total, meaning that it's almost in the noise for any but some very special case. Linus ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-10-02 17:18 ` Linus Torvalds @ 2017-10-02 19:37 ` Jeff King 0 siblings, 0 replies; 49+ messages in thread From: Jeff King @ 2017-10-02 19:37 UTC (permalink / raw) To: Linus Torvalds Cc: Jason Cooper, Johannes Schindelin, demerphq, Brandon Williams, Junio C Hamano, Jonathan Nieder, Git Mailing List, Stefan Beller, Jonathan Tan, David Lang, brian m. carlson On Mon, Oct 02, 2017 at 10:18:02AM -0700, Linus Torvalds wrote: > On Mon, Oct 2, 2017 at 7:00 AM, Jason Cooper <jason@lakedaemon.net> wrote: > > > > Ahhh, so if I understand you correctly, you'd prefer SHA-256 over > > SHA3-256 because it's more performant for your usecase? Well, that's a > > completely different animal that cryptographic suitability. > > In almost all loads I've seen, zlib inflate() cost is a bigger deal > than the crypto load. The crypto people talk about cycles per byte, > but the deflate code is what usually takes the page faults and cache > misses etc, and has bad branch prediction. That ends up easily being > tens or thousands of cycles, even for small data. If anyone is interested in the user-visible effects of slower crypto, I think, there are some numbers in 8325e43b82 (Makefile: add DC_SHA1 knob, 2017-03-16). I don't know how SHA-256 compares to sha1dc exactly, but certainly the latter is a lot slower than normal sha1. The only real-world case I found with a noticeable slowdown was index-pack. Which in the worst case is roughly the same operation as "git fsck" (inflate and compute the sha1 on every byte), but people tend to actually do it a lot more often. And it really _is_ slower for real-world operations; the CPU for computing the sha1 of an incoming clone of linux.git jumped from ~3 minutes to ~6 minutes. But I don't think we've seen a lot of complaints, probably because that time is lumped in with "time to transfer a gigabyte of data", so unless you're on a slow machine on fast connection, you don't even really notice. For day-to-day operations in a repository, I never came up with a good example where the speed difference mattered. I think Dscho's giant-index example is an outlier and the right answer there is not "pick a fast crypto algorithm" but "stop using a slow crypto algorithm as a checksum" (and also, stop routinely reading and writing 400MB for day-to-day operations). -Peff ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-13 12:05 ` Johannes Schindelin 2017-09-13 13:43 ` demerphq @ 2017-09-13 16:30 ` Jonathan Nieder 2017-09-13 21:52 ` Junio C Hamano 2017-09-14 12:39 ` Johannes Schindelin 1 sibling, 2 replies; 49+ messages in thread From: Jonathan Nieder @ 2017-09-13 16:30 UTC (permalink / raw) To: Johannes Schindelin Cc: Brandon Williams, Junio C Hamano, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson Hi Dscho, Johannes Schindelin wrote: > So even if the code to generate a bidirectional old <-> new hash mapping > might be with us forever, it *definitely* should be optional ("optional" > at least as in "config setting"), allowing developers who only work with > new-hash repositories to save the time and electrons. Agreed. This is a good reason not to store the sha1 inside the sha256-encoded objects. I think that is exactly what Brandon was saying in response to Junio --- did you read it differently? [...] > ... or Git would simply handle the absence of the generation number header > gracefully, so that sha1-content == sha3-content... Part of the sha1-content is references to other objects using their sha1-name, so it is not possible to have sha1-content == sha3-content. That said, I am also leaning against including generation numbers as part of this design. There is an argument for including generation numbers. It is much simpler to have generation numbers in *all* commit objects than only in some, since it means the slop-based heuristics for faking generation numbers using commit timestamp can be completely avoided for a repository using such a format. Including generation numbers in all commit objects is a painless thing to do during a format change, since it can happen without harming round-tripping. Treating generation numbers as derived data (as in Jeff King's preferred design, if I have understood his replies correctly) would also be possible but it does not interact well with shallow clone or narrow clone. All that said, for simplicity I still lean against including generation numbers as part of a hash function transition. Nothing stops us from having another format change later. This is a particularly hard decision because I don't have a strong preference. That leads me to err on the side of simplicity. I will make sure to discuss this issue in my patch to Documentation/technical/, so we don't have to repeat the same conversations again and again. [...] > Taking a step back, though, it may be a good idea to leave the generation > number business for later, as much fun as it is to get side tracked and > focus on relatively trivial stuff instead of the far more difficult and > complex task to get the transition plan to a new hash ironed out. > > For example, I am still in favor of SHA-256 over SHA3-256, after learning > some background details from in-house cryptographers: it provides > essentially the same level of security, according to my sources, while > hardware support seems to be coming to SHA-256 a lot sooner than to > SHA3-256. > > Which hash algorithm to choose is a tough question to answer, and > discussing generation numbers will sadly not help us answer it any quicker. This is unrelated to Brandon's message, except for his use of SHA3 as a placeholder for "the next hash function". My assumption based on previous conversations (and other external conversations like [1]) is that we are going to use SHA2-256 and have a pretty strong consensus for that. Don't worry! As a side note, I am probably misreading, but I found this set of paragraphs a bit condescending. It sounds to me like you are saying "You are making the wrong choice of hash function and everything else you are describing is irrelevant when compared to that monumental mistake. Please stop working on things I don't consider important". With that reading it is quite demotivating to read. An alternative reading is that you are saying that the transition plan described in this thread is not ironed out. Can you spell that out more? What particular aspect of the transition plan (which is of course orthogonal to the choice of hash function) are you discontent with? Thanks and hope that helps, Jonathan [1] https://www.imperialviolet.org/2017/05/31/skipsha3.html ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-13 16:30 ` Jonathan Nieder @ 2017-09-13 21:52 ` Junio C Hamano 2017-09-13 22:07 ` Stefan Beller 2017-09-13 22:15 ` Junio C Hamano 2017-09-14 12:39 ` Johannes Schindelin 1 sibling, 2 replies; 49+ messages in thread From: Junio C Hamano @ 2017-09-13 21:52 UTC (permalink / raw) To: Jonathan Nieder Cc: Johannes Schindelin, Brandon Williams, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson Jonathan Nieder <jrnieder@gmail.com> writes: > Treating generation numbers as derived data (as in Jeff King's > preferred design, if I have understood his replies correctly) would > also be possible but it does not interact well with shallow clone or > narrow clone. Just like we have skewed committer timestamps, there is no reason to believe that generation numbers embedded in objects are trustable, and there is no way for narrow clients to even verify their correctness. So I agree with Peff that having generation numbers in object is pointless; I agree any other derivables like corresponding sha-1 name is also pointless to have. This is a tangent, but it may be fine for a shallow clone to treat the cut-off points in the history as if they are root commits and compute generation numbers locally, just like everybody else does. As generation numbers won't have to be global (because we will not be embedding them in objects), nobody gets hurt if they do not match across repositories---just like often-mentioned rename detection cache, it can be kept as a mere local performance aid and does not have to participate in the object model. > All that said, for simplicity I still lean against including > generation numbers as part of a hash function transition. Good. > This is unrelated to Brandon's message, except for his use of SHA3 as > a placeholder for "the next hash function". > > My assumption based on previous conversations (and other external > conversations like [1]) is that we are going to use SHA2-256 and have > a pretty strong consensus for that. Don't worry! Hmph, I actually re-read the thread recently, and my impression was that we didn't quite have a consensus but were leaning towards SHA3-256. I do not personally have a strong preference myself and I would say that anything will do as long as it is with good longevity and availability. SHA2 family would be a fine choice due to its age on both counts, being scrutinized longer and having a chance to be implemented in many places, even though its age itself may have to be subtracted from the longevity factor. Thanks. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-13 21:52 ` Junio C Hamano @ 2017-09-13 22:07 ` Stefan Beller 2017-09-13 22:18 ` Jonathan Nieder 2017-09-13 22:15 ` Junio C Hamano 1 sibling, 1 reply; 49+ messages in thread From: Stefan Beller @ 2017-09-13 22:07 UTC (permalink / raw) To: Junio C Hamano Cc: Jonathan Nieder, Johannes Schindelin, Brandon Williams, Linus Torvalds, Git Mailing List, Jonathan Tan, Jeff King, David Lang, brian m. carlson On Wed, Sep 13, 2017 at 2:52 PM, Junio C Hamano <gitster@pobox.com> wrote: > Jonathan Nieder <jrnieder@gmail.com> writes: > >> Treating generation numbers as derived data (as in Jeff King's >> preferred design, if I have understood his replies correctly) would >> also be possible but it does not interact well with shallow clone or >> narrow clone. > > Just like we have skewed committer timestamps, there is no reason to > believe that generation numbers embedded in objects are trustable, > and there is no way for narrow clients to even verify their correctness. > > So I agree with Peff that having generation numbers in object is > pointless; I agree any other derivables like corresponding sha-1 > name is also pointless to have. > > This is a tangent, but it may be fine for a shallow clone to treat > the cut-off points in the history as if they are root commits and > compute generation numbers locally, just like everybody else does. > As generation numbers won't have to be global (because we will not > be embedding them in objects), nobody gets hurt if they do not match > across repositories---just like often-mentioned rename detection > cache, it can be kept as a mere local performance aid and does not > have to participate in the object model. Locally it helps for some operations such as correct walks. For the network case however, it doesn't really help either. If we had global generation numbers, one could imagine that they are used in the pack negotiation (server advertises the maximum generation number or even gen number per branch; client could binary search in there for the fork point) I wonder if locally generated generation numbers (for the shallow case) could be used somehow to still improve network operations. >> My assumption based on previous conversations (and other external >> conversations like [1]) is that we are going to use SHA2-256 and have >> a pretty strong consensus for that. Don't worry! > > Hmph, I actually re-read the thread recently, and my impression was > that we didn't quite have a consensus but were leaning towards > SHA3-256. > > I do not personally have a strong preference myself and I would say > that anything will do as long as it is with good longevity and > availability. SHA2 family would be a fine choice due to its age on > both counts, being scrutinized longer and having a chance to be > implemented in many places, even though its age itself may have to > be subtracted from the longevity factor. If we'd get the transition somewhat right, the next transition will be easier than the current transition, such that I am not that concerned about longevity. I am rather concerned about the complexity that is added to the code base (whilst accumulating technical debt instead of clearer abstraction layers) ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-13 22:07 ` Stefan Beller @ 2017-09-13 22:18 ` Jonathan Nieder 2017-09-14 2:13 ` Junio C Hamano 0 siblings, 1 reply; 49+ messages in thread From: Jonathan Nieder @ 2017-09-13 22:18 UTC (permalink / raw) To: Stefan Beller Cc: Junio C Hamano, Johannes Schindelin, Brandon Williams, Linus Torvalds, Git Mailing List, Jonathan Tan, Jeff King, David Lang, brian m. carlson Hi, Stefan Beller wrote: > On Wed, Sep 13, 2017 at 2:52 PM, Junio C Hamano <gitster@pobox.com> wrote: >> This is a tangent, but it may be fine for a shallow clone to treat >> the cut-off points in the history as if they are root commits and >> compute generation numbers locally, just like everybody else does. [...] > Locally it helps for some operations such as correct walks. > For the network case however, it doesn't really help either. > > If we had global generation numbers, one could imagine that they > are used in the pack negotiation (server advertises the maximum > generation number or even gen number per branch; client > could binary search in there for the fork point) > > I wonder if locally generated generation numbers (for the shallow > case) could be used somehow to still improve network operations. I have a different concern about locally generated generation numbers in a shallow clone. My concern is that it is slow to recompute them when deepening the shallow clone. However: 1. That only affects performance and for some use cases could be mitigated e.g. by introducing some laziness, and, more convincingly, 2. With a small protocol change, the server could communicate the generation numbers for commit objects at the edge of a shallow clone, avoiding this trouble. So I am not too concerned. More generally, unless there is a very very compelling reason to, I don't want to couple other changes into the hash function transition. If they're worthwhile enough to do, they're worthwhile enough to do whether we're transitioning to a new hash function or not: I have not heard a convincing example yet of a "while at it" that is worth the complexity of such coupling. (That said, if two format changes are worth doing and happen to be implemented at the same time, then we can save users the trouble of experiencing two format change transitions. That is a kind of coupling from the end user's point of view. But from the perspective of someone writing the code, there is no need to count on that, and it is not likely to happen anyway.) > If we'd get the transition somewhat right, the next transition will > be easier than the current transition, such that I am not that concerned > about longevity. I am rather concerned about the complexity that is added > to the code base (whilst accumulating technical debt instead of clearer > abstraction layers) During the transition, users have to suffer reencoding overhead, so it is not good for such transitions to need to happen very often. If the new hash function breaks early, then we have to cope with it and as you say, having the framework in place means we'd be ready for that. But I still don't want the chosen hash function to break early. In other words, a long lifetime for the hash absolutely is a design goal. Coping well with an unexpectedly short lifetime for the hash is also a design goal. If the hash function lasts 10 years then I am happy. Thanks, Jonathan ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-13 22:18 ` Jonathan Nieder @ 2017-09-14 2:13 ` Junio C Hamano 2017-09-14 15:23 ` Johannes Schindelin 0 siblings, 1 reply; 49+ messages in thread From: Junio C Hamano @ 2017-09-14 2:13 UTC (permalink / raw) To: Jonathan Nieder Cc: Stefan Beller, Johannes Schindelin, Brandon Williams, Linus Torvalds, Git Mailing List, Jonathan Tan, Jeff King, David Lang, brian m. carlson Jonathan Nieder <jrnieder@gmail.com> writes: > In other words, a long lifetime for the hash absolutely is a design > goal. Coping well with an unexpectedly short lifetime for the hash is > also a design goal. > > If the hash function lasts 10 years then I am happy. Absolutely. When two functions have similar expected remaining life and are equally widely supported, then faster is better than slower. Otherwise our primary goal when picking the function from candidates should be to optimize for its remaining life and wider availability. Thanks. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-14 2:13 ` Junio C Hamano @ 2017-09-14 15:23 ` Johannes Schindelin 2017-09-14 15:45 ` demerphq 0 siblings, 1 reply; 49+ messages in thread From: Johannes Schindelin @ 2017-09-14 15:23 UTC (permalink / raw) To: Junio C Hamano Cc: Jonathan Nieder, Stefan Beller, Brandon Williams, Linus Torvalds, Git Mailing List, Jonathan Tan, Jeff King, David Lang, brian m. carlson Hi Junio, On Thu, 14 Sep 2017, Junio C Hamano wrote: > Jonathan Nieder <jrnieder@gmail.com> writes: > > > In other words, a long lifetime for the hash absolutely is a design > > goal. Coping well with an unexpectedly short lifetime for the hash is > > also a design goal. > > > > If the hash function lasts 10 years then I am happy. > > Absolutely. When two functions have similar expected remaining life > and are equally widely supported, then faster is better than slower. > Otherwise our primary goal when picking the function from candidates > should be to optimize for its remaining life and wider availability. SHA-256 has been hammered on a lot more than SHA3-256. That would be a strong point in favor of SHA2. Ciao, Dscho ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-14 15:23 ` Johannes Schindelin @ 2017-09-14 15:45 ` demerphq 2017-09-14 22:06 ` Johannes Schindelin 0 siblings, 1 reply; 49+ messages in thread From: demerphq @ 2017-09-14 15:45 UTC (permalink / raw) To: Johannes Schindelin Cc: Junio C Hamano, Jonathan Nieder, Stefan Beller, Brandon Williams, Linus Torvalds, Git Mailing List, Jonathan Tan, Jeff King, David Lang, brian m. carlson On 14 September 2017 at 17:23, Johannes Schindelin <Johannes.Schindelin@gmx.de> wrote: > Hi Junio, > > On Thu, 14 Sep 2017, Junio C Hamano wrote: > >> Jonathan Nieder <jrnieder@gmail.com> writes: >> >> > In other words, a long lifetime for the hash absolutely is a design >> > goal. Coping well with an unexpectedly short lifetime for the hash is >> > also a design goal. >> > >> > If the hash function lasts 10 years then I am happy. >> >> Absolutely. When two functions have similar expected remaining life >> and are equally widely supported, then faster is better than slower. >> Otherwise our primary goal when picking the function from candidates >> should be to optimize for its remaining life and wider availability. > > SHA-256 has been hammered on a lot more than SHA3-256. Last year that was even more true of SHA1 than it is true of SHA-256 today. Anyway, Yves -- perl -Mre=debug -e "/just|another|perl|hacker/" ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-14 15:45 ` demerphq @ 2017-09-14 22:06 ` Johannes Schindelin 0 siblings, 0 replies; 49+ messages in thread From: Johannes Schindelin @ 2017-09-14 22:06 UTC (permalink / raw) To: demerphq Cc: Junio C Hamano, Jonathan Nieder, Stefan Beller, Brandon Williams, Linus Torvalds, Git Mailing List, Jonathan Tan, Jeff King, David Lang, brian m. carlson Hi, On Thu, 14 Sep 2017, demerphq wrote: > On 14 September 2017 at 17:23, Johannes Schindelin > <Johannes.Schindelin@gmx.de> wrote: > > > > SHA-256 has been hammered on a lot more than SHA3-256. > > Last year that was even more true of SHA1 than it is true of SHA-256 > today. I hope you are not deliberately trying to annoy me. I say that because you seemed to be interested enough in cryptography to know that the known attacks on SHA-256 *today* are unlikely to extend to Git's use case, whereas the known attacks on SHA-1 *in 2005* were already raising doubts. So while SHA-1 has been hammered on for longer than SHA-256, the latter came out a lot less scathed than the former. Besides, you are totally missing the point here that the choice is *not* between SHA-1 and SHA-256, but between SHA-256 and SHA3-256. After all, we would not consider any hash algorithm with known problems (as far as Git's usage is concerned). The amount of scrutiny with which the algorithm was investigated would only be a deciding factor among the remaining choices, yes? In any case, don't trust me on cryptography (just like I do not trust you on that matter). Trust the cryptographers. I contacted some of my colleagues who are responsible for crypto, and the two who seem to disagree on pretty much everything agreed on this one thing: that SHA-256 would be a good choice for Git (and one of them suggested that it would be much better than SHA3-256, because SHA-256 saw more cryptanalysis). Ciao, Johannes ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-13 21:52 ` Junio C Hamano 2017-09-13 22:07 ` Stefan Beller @ 2017-09-13 22:15 ` Junio C Hamano 2017-09-13 22:27 ` Jonathan Nieder 1 sibling, 1 reply; 49+ messages in thread From: Junio C Hamano @ 2017-09-13 22:15 UTC (permalink / raw) To: Jonathan Nieder Cc: Johannes Schindelin, Brandon Williams, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson Junio C Hamano <gitster@pobox.com> writes: > Jonathan Nieder <jrnieder@gmail.com> writes: > >> Treating generation numbers as derived data (as in Jeff King's >> preferred design, if I have understood his replies correctly) would >> also be possible but it does not interact well with shallow clone or >> narrow clone. > > Just like we have skewed committer timestamps, there is no reason to > believe that generation numbers embedded in objects are trustable, > and there is no way for narrow clients to even verify their correctness. > > So I agree with Peff that having generation numbers in object is > pointless; I agree any other derivables like corresponding sha-1 > name is also pointless to have. > > This is a tangent, but it may be fine for a shallow clone to treat > the cut-off points in the history as if they are root commits and > compute generation numbers locally, just like everybody else does. > As generation numbers won't have to be global (because we will not > be embedding them in objects), nobody gets hurt if they do not match > across repositories---just like often-mentioned rename detection > cache, it can be kept as a mere local performance aid and does not > have to participate in the object model. > >> All that said, for simplicity I still lean against including >> generation numbers as part of a hash function transition. > > Good. In the proposed transition plan, the treatment of various signatures (deliberately) makes the conversion not quite roundtrip. When existing SHA-1 history in individual clones are converted to NewHash, we obviously cannot re-sign the corresponding NewHash contents with the same PGP key, so these converted objects will carry only signature on SHA-1 contents. They can still be validated when they are exported back to SHA-1 world via the fetch/push protocol, and can be validated locally by converting them back to SHA-1 contents and then passing the result to gpgv. The plan also states, if I remember what I read correctly, that newly created and signed objects (this includes signed commits and signed tags; mergetags merely carry over what the tag object that was merged was signed with, so we do not have to worry about them unless the resulting commit that has mergetag is signed itself, but that is already covered by how we handle signed commits) would be signed both for NewHash contents and its corresponding SHA-1 contents (after internally convering it to SHA-1 contents). That would allow us to strip the signature over NewHash contents and derive the SHA-1 contents to be shown to the outside world while migration is going on and I'd imagine it would be a good practice; it would allow us to sign something that allows everybody to verify, when some participants of the project are not yet NewHash capable. But the signing over SHA-1 contents has to stop at some point, when everybody's Git becomes completely unaware of SHA-1. We may want to have a guideline in the transition plan to (1) encourage signing for both for quite some time, and (2) the criteria for us to decide when to stop. Thanks. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-13 22:15 ` Junio C Hamano @ 2017-09-13 22:27 ` Jonathan Nieder 2017-09-14 2:10 ` Junio C Hamano 0 siblings, 1 reply; 49+ messages in thread From: Jonathan Nieder @ 2017-09-13 22:27 UTC (permalink / raw) To: Junio C Hamano Cc: Johannes Schindelin, Brandon Williams, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson Junio C Hamano wrote: > In the proposed transition plan, the treatment of various signatures > (deliberately) makes the conversion not quite roundtrip. That's not precisely true. Details below. > When existing SHA-1 history in individual clones are converted to > NewHash, we obviously cannot re-sign the corresponding NewHash > contents with the same PGP key, so these converted objects will > carry only signature on SHA-1 contents. They can still be validated > when they are exported back to SHA-1 world via the fetch/push > protocol, and can be validated locally by converting them back to > SHA-1 contents and then passing the result to gpgv. Correct. > The plan also states, if I remember what I read correctly, that > newly created and signed objects (this includes signed commits and > signed tags; mergetags merely carry over what the tag object that > was merged was signed with, so we do not have to worry about them > unless the resulting commit that has mergetag is signed itself, but > that is already covered by how we handle signed commits) would be > signed both for NewHash contents and its corresponding SHA-1 > contents (after internally convering it to SHA-1 contents). Also correct. > would allow us to strip the signature over NewHash contents and > derive the SHA-1 contents to be shown to the outside world while > migration is going on and I'd imagine it would be a good practice; > it would allow us to sign something that allows everybody to verify, > when some participants of the project are not yet NewHash capable. The NewHash-based signature is included in the SHA-1 content as well, for the sake of round-tripping. It is not stripped out. > But the signing over SHA-1 contents has to stop at some point, when > everybody's Git becomes completely unaware of SHA-1. We may want to > have a guideline in the transition plan to (1) encourage signing for > both for quite some time, and (2) the criteria for us to decide when > to stop. Yes, spelling out a rough schedule is a good idea. I'll add that. A version of Git that is aware of NewHash should be able to verify NewHash signatures even for users that are using SHA-1 locally for the sake of faster fetches and pushes to SHA-1 based peers. In addition to a new enough Git, this requires the translation table to translate to NewHash to be present. So the criterion (2) is largely based on how up-to-date the Git used by users wanting to verify signatures is and whether they are willing to tolerate the performance implications of having a translation table. My hope is that when communicating with peers using the same hash function, the translation table will not add too much performance overhead. Thank you, Jonathan ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-13 22:27 ` Jonathan Nieder @ 2017-09-14 2:10 ` Junio C Hamano 0 siblings, 0 replies; 49+ messages in thread From: Junio C Hamano @ 2017-09-14 2:10 UTC (permalink / raw) To: Jonathan Nieder Cc: Johannes Schindelin, Brandon Williams, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson Jonathan Nieder <jrnieder@gmail.com> writes: > The NewHash-based signature is included in the SHA-1 content as well, > for the sake of round-tripping. It is not stripped out. Ah, OK, that allays my worries. We rely on the fact that unknown object headers from the future are ignored. We use something other than "gpgsig" header (say, "gpgsigN") to store NewHash based signature on a commit object created in the NewHash world, so that SHA-1 clients will ignore it but still include in the signature computation---is that the idea? Existing versions of Git that live in the SHA-1 world may still need to learn to ignore/drop "gpgsigN" while amending a commit that originally was created in the NewHash world. Or to force upgrade we may freeze the SHA-1 only versions of Git and stop updating them altogether. I dunno. We also need to use something other than "mergetag" when carrying over the contents of a tag being merged in the NewHash world, but I'd imagine that you've thought about this already. Thanks. ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-13 16:30 ` Jonathan Nieder 2017-09-13 21:52 ` Junio C Hamano @ 2017-09-14 12:39 ` Johannes Schindelin 2017-09-14 16:36 ` Brandon Williams 2017-09-14 18:49 ` Jonathan Nieder 1 sibling, 2 replies; 49+ messages in thread From: Johannes Schindelin @ 2017-09-14 12:39 UTC (permalink / raw) To: Jonathan Nieder Cc: Brandon Williams, Junio C Hamano, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson Hi Jonathan, On Wed, 13 Sep 2017, Jonathan Nieder wrote: > As a side note, I am probably misreading, but I found this set of > paragraphs a bit condescending. It sounds to me like you are saying > "You are making the wrong choice of hash function and everything else > you are describing is irrelevant when compared to that monumental > mistake. Please stop working on things I don't consider important". > With that reading it is quite demotivating to read. I am sorry you read it that way. I did not feel condescending when I wrote that mail, I felt annoyed by the side track, and anxious. In my mind, the transition is too important for side tracking, and I worry that we are not fast enough (imagine what would happen if a better attack was discovered that is not as easily detected as the one we know about?). > An alternative reading is that you are saying that the transition plan > described in this thread is not ironed out. Can you spell that out > more? What particular aspect of the transition plan (which is of > course orthogonal to the choice of hash function) are you discontent > with? My impression from reading Junio's mail was that he does not consider the transition plan ironed out yet, and that he wants to spend time on discussing generation numbers right now. I was in particularly frightened by the suggestion to "reboot" [*1*]. Hopefully I misunderstand and he meant "finishing touches" instead. As to *my* opinion: after reading https://goo.gl/gh2Mzc (is it really correct that its last update has been on March 6th?), my only concern is really that it still talks about SHA3-256 when I think that the performance benefits of SHA-256 (think: "Git at scale", and also hardware support) really make the latter a better choice. In order to be "ironed out", I think we need to talk about the implementation detail "Translation table". This is important. It needs to be *fast*. Speaking of *fast*, I could imagine that it would make sense to store the SHA-1 objects on disk, still, instead of converting them on the fly. I am not sure whether this is something we need to define in the document, though, as it may very well be premature optimization; Maybe mention that we could do this if necessary? Apart from that, I would *love* to see this document as The Official Plan that I can Show To The Manager so that I can ask to Allocate Time. Ciao, Dscho Footnote *1*: https://public-inbox.org/git/xmqqa828733s.fsf@gitster.mtv.corp.google.com/ ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-14 12:39 ` Johannes Schindelin @ 2017-09-14 16:36 ` Brandon Williams 2017-09-14 18:49 ` Jonathan Nieder 1 sibling, 0 replies; 49+ messages in thread From: Brandon Williams @ 2017-09-14 16:36 UTC (permalink / raw) To: Johannes Schindelin Cc: Jonathan Nieder, Junio C Hamano, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson On 09/14, Johannes Schindelin wrote: > Hi Jonathan, > > On Wed, 13 Sep 2017, Jonathan Nieder wrote: > > > As a side note, I am probably misreading, but I found this set of > > paragraphs a bit condescending. It sounds to me like you are saying > > "You are making the wrong choice of hash function and everything else > > you are describing is irrelevant when compared to that monumental > > mistake. Please stop working on things I don't consider important". > > With that reading it is quite demotivating to read. > > I am sorry you read it that way. I did not feel condescending when I wrote > that mail, I felt annoyed by the side track, and anxious. In my mind, the > transition is too important for side tracking, and I worry that we are not > fast enough (imagine what would happen if a better attack was discovered > that is not as easily detected as the one we know about?). > > > An alternative reading is that you are saying that the transition plan > > described in this thread is not ironed out. Can you spell that out > > more? What particular aspect of the transition plan (which is of > > course orthogonal to the choice of hash function) are you discontent > > with? > > My impression from reading Junio's mail was that he does not consider the > transition plan ironed out yet, and that he wants to spend time on > discussing generation numbers right now. > > I was in particularly frightened by the suggestion to "reboot" [*1*]. > Hopefully I misunderstand and he meant "finishing touches" instead. > > As to *my* opinion: after reading https://goo.gl/gh2Mzc (is it really > correct that its last update has been on March 6th?), my only concern is > really that it still talks about SHA3-256 when I think that the > performance benefits of SHA-256 (think: "Git at scale", and also hardware > support) really make the latter a better choice. > > In order to be "ironed out", I think we need to talk about the > implementation detail "Translation table". This is important. It needs to > be *fast*. Agreed, when that document was written it was hand waved as an implementation detail but once we should probably stare ironing out those details soon so that we have a concrete plan in place. > > Speaking of *fast*, I could imagine that it would make sense to store the > SHA-1 objects on disk, still, instead of converting them on the fly. I am > not sure whether this is something we need to define in the document, > though, as it may very well be premature optimization; Maybe mention that > we could do this if necessary? > > Apart from that, I would *love* to see this document as The Official Plan > that I can Show To The Manager so that I can ask to Allocate Time. Speaking of having a concrete plan, we discussed in office the other day about finally converting the doc into a Documentation patch. That was always are intention but after writing up the doc we got busy working on other projects. Getting it in as a patch (with a more concrete road map) is probably the next step we'd need to take. I do want to echo what jonathan has said in other parts of this thread, that the transition plan itself doesn't depend on which hash function we end up going with in the end. I fully expect that for the transition plan to succeed that we'll have infrastructure for dropping in different hash functions so that we can do some sort of benchmarking before selecting one to use. This would also give us the ability to more easily transition to another hash function when the time comes. -- Brandon Williams ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-14 12:39 ` Johannes Schindelin 2017-09-14 16:36 ` Brandon Williams @ 2017-09-14 18:49 ` Jonathan Nieder 2017-09-15 20:42 ` Philip Oakley 1 sibling, 1 reply; 49+ messages in thread From: Jonathan Nieder @ 2017-09-14 18:49 UTC (permalink / raw) To: Johannes Schindelin Cc: Brandon Williams, Junio C Hamano, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson Johannes Schindelin wrote: > On Wed, 13 Sep 2017, Jonathan Nieder wrote: >> As a side note, I am probably misreading, but I found this set of >> paragraphs a bit condescending. It sounds to me like you are saying >> "You are making the wrong choice of hash function and everything else >> you are describing is irrelevant when compared to that monumental >> mistake. Please stop working on things I don't consider important". >> With that reading it is quite demotivating to read. > > I am sorry you read it that way. I did not feel condescending when I wrote > that mail, I felt annoyed by the side track, and anxious. In my mind, the > transition is too important for side tracking, and I worry that we are not > fast enough (imagine what would happen if a better attack was discovered > that is not as easily detected as the one we know about?). Thanks for clarifying. That makes sense. [...] > As to *my* opinion: after reading https://goo.gl/gh2Mzc (is it really > correct that its last update has been on March 6th?), my only concern is > really that it still talks about SHA3-256 when I think that the > performance benefits of SHA-256 (think: "Git at scale", and also hardware > support) really make the latter a better choice. > > In order to be "ironed out", I think we need to talk about the > implementation detail "Translation table". This is important. It needs to > be *fast*. > > Speaking of *fast*, I could imagine that it would make sense to store the > SHA-1 objects on disk, still, instead of converting them on the fly. I am > not sure whether this is something we need to define in the document, > though, as it may very well be premature optimization; Maybe mention that > we could do this if necessary? > > Apart from that, I would *love* to see this document as The Official Plan > that I can Show To The Manager so that I can ask to Allocate Time. Sounds promising! Thanks much for this feedback. This is very helpful for knowing what v4 of the doc needs. The discussion of the translation table in [1] didn't make it to the doc. You're right that it needs to. Caching SHA-1 objects (and the pros and cons involved) makes sense to mention in an "ideas for future work" section. An implementation plan with well-defined pieces for people to take on and estimates of how much work each involves may be useful for Showing To The Manager. So I'll include a sketch of that for reviewers to poke holes in, too. Another thing the doc doesn't currently describe is how Git protocol would work. That's worth sketching in a "future work" section as well. Sorry it has been taking so long to get this out. I think we should have something ready to send on Monday. Thanks, Jonathan [1] https://public-inbox.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/ ^ permalink raw reply [flat|nested] 49+ messages in thread
* Re: RFC v3: Another proposed hash function transition plan 2017-09-14 18:49 ` Jonathan Nieder @ 2017-09-15 20:42 ` Philip Oakley 0 siblings, 0 replies; 49+ messages in thread From: Philip Oakley @ 2017-09-15 20:42 UTC (permalink / raw) To: Jonathan Nieder, Johannes Schindelin Cc: Brandon Williams, Junio C Hamano, Linus Torvalds, Git Mailing List, Stefan Beller, jonathantanmy, Jeff King, David Lang, brian m. carlson Hi Jonathan, "Jonathan Nieder" <jrnieder@gmail.com> wrote; > Johannes Schindelin wrote: >> On Wed, 13 Sep 2017, Jonathan Nieder wrote: > >>> As a side note, I am probably misreading, but I found this set of >>> paragraphs a bit condescending. It sounds to me like you are saying >>> "You are making the wrong choice of hash function and everything else >>> you are describing is irrelevant when compared to that monumental >>> mistake. Please stop working on things I don't consider important". >>> With that reading it is quite demotivating to read. >> >> I am sorry you read it that way. I did not feel condescending when I >> wrote >> that mail, I felt annoyed by the side track, and anxious. In my mind, the >> transition is too important for side tracking, and I worry that we are >> not >> fast enough (imagine what would happen if a better attack was discovered >> that is not as easily detected as the one we know about?). > > Thanks for clarifying. That makes sense. > > [...] >> As to *my* opinion: after reading https://goo.gl/gh2Mzc (is it really >> correct that its last update has been on March 6th?), my only concern is >> really that it still talks about SHA3-256 when I think that the >> performance benefits of SHA-256 (think: "Git at scale", and also hardware >> support) really make the latter a better choice. >> >> In order to be "ironed out", I think we need to talk about the >> implementation detail "Translation table". This is important. It needs to >> be *fast*. >> >> Speaking of *fast*, I could imagine that it would make sense to store the >> SHA-1 objects on disk, still, instead of converting them on the fly. I am >> not sure whether this is something we need to define in the document, >> though, as it may very well be premature optimization; Maybe mention that >> we could do this if necessary? >> >> Apart from that, I would *love* to see this document as The Official Plan >> that I can Show To The Manager so that I can ask to Allocate Time. > > Sounds promising! > > Thanks much for this feedback. This is very helpful for knowing what > v4 of the doc needs. > > The discussion of the translation table in [1] didn't make it to the > doc. You're right that it needs to. > > Caching SHA-1 objects (and the pros and cons involved) makes sense to > mention in an "ideas for future work" section. > > An implementation plan with well-defined pieces for people to take on > and estimates of how much work each involves may be useful for Showing > To The Manager. So I'll include a sketch of that for reviewers to > poke holes in, too. > > Another thing the doc doesn't currently describe is how Git protocol > would work. That's worth sketching in a "future work" section as > well. > > Sorry it has been taking so long to get this out. I think we should > have something ready to send on Monday. I had a look at the current doc https://goo.gl/gh2Mzc and thought that the selection of the "NewHash" should be separated out into a section of it's own as a 'separation of concerns', so that the general transition plan only refers to the "NewHash", so as not to accidentally pre-judge that selection. I did look up the arguments regarding sha2 (sha256) versus sha3-256 and found these two Q&A items https://security.stackexchange.com/questions/152360/should-we-be-using-sha3-2017 https://security.stackexchange.com/questions/86283/how-does-sha3-keccak-shake-compare-to-sha2-should-i-use-non-shake-parameter with an onward link to this: https://www.imperialviolet.org/2012/10/21/nist.html "NIST may not have you in mind (21 Oct 2012)" "A couple of weeks back, NIST announced that Keccak would be SHA-3. Keccak has somewhat disappointing software performance but is a gift to hardware implementations." which does appear to cover some of the concerns that dscho had noted, and speed does appear to be a core Git selling point. It would be worth at least covering these trade offs in the "select a NewHash" section of the document, as at the end of the day it will be a political judgement about what the future might hold regarding the contenders. What may also be worth noting is the fall back plan should the chosen NewHash be the first to fail, perhaps spectacularly, as having a ready plan could support the choice at risk. > > Thanks, > Jonathan > > [1] > https://public-inbox.org/git/CAJo=hJtoX9=AyLHHpUJS7fueV9ciZ_MNpnEPHUz8Whui6g9F0A@mail.gmail.com/ -- Philip ^ permalink raw reply [flat|nested] 49+ messages in thread
end of thread, back to index Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-03-04 1:12 RFC: Another proposed hash function transition plan Jonathan Nieder 2017-03-05 2:35 ` Linus Torvalds 2017-03-07 0:17 ` RFC v3: " Jonathan Nieder 2017-03-09 19:14 ` Shawn Pearce 2017-03-09 20:24 ` Jonathan Nieder 2017-03-10 19:38 ` Jeff King 2017-03-10 19:55 ` Jonathan Nieder 2017-09-06 6:28 ` Junio C Hamano 2017-09-08 2:40 ` Junio C Hamano 2017-09-08 3:34 ` Jeff King 2017-09-11 18:59 ` Brandon Williams 2017-09-13 12:05 ` Johannes Schindelin 2017-09-13 13:43 ` demerphq 2017-09-13 22:51 ` Jonathan Nieder 2017-09-14 18:26 ` Johannes Schindelin 2017-09-14 18:40 ` Jonathan Nieder 2017-09-14 22:09 ` Johannes Schindelin 2017-09-13 23:30 ` Linus Torvalds 2017-09-14 18:45 ` Johannes Schindelin 2017-09-18 12:17 ` Gilles Van Assche 2017-09-18 22:16 ` Johannes Schindelin 2017-09-19 16:45 ` Gilles Van Assche 2017-09-29 13:17 ` Johannes Schindelin 2017-09-29 14:54 ` Joan Daemen 2017-09-29 22:33 ` Johannes Schindelin 2017-09-30 22:02 ` Joan Daemen 2017-10-02 14:26 ` Johannes Schindelin 2017-09-18 22:25 ` Jonathan Nieder 2017-09-26 17:05 ` Jason Cooper 2017-09-26 22:11 ` Johannes Schindelin 2017-09-26 23:51 ` Jonathan Nieder 2017-10-02 14:54 ` Jason Cooper 2017-10-02 16:50 ` Brandon Williams 2017-10-02 14:00 ` Jason Cooper 2017-10-02 17:18 ` Linus Torvalds 2017-10-02 19:37 ` Jeff King 2017-09-13 16:30 ` Jonathan Nieder 2017-09-13 21:52 ` Junio C Hamano 2017-09-13 22:07 ` Stefan Beller 2017-09-13 22:18 ` Jonathan Nieder 2017-09-14 2:13 ` Junio C Hamano 2017-09-14 15:23 ` Johannes Schindelin 2017-09-14 15:45 ` demerphq 2017-09-14 22:06 ` Johannes Schindelin 2017-09-13 22:15 ` Junio C Hamano 2017-09-13 22:27 ` Jonathan Nieder 2017-09-14 2:10 ` Junio C Hamano 2017-09-14 12:39 ` Johannes Schindelin 2017-09-14 16:36 ` Brandon Williams 2017-09-14 18:49 ` Jonathan Nieder 2017-09-15 20:42 ` Philip Oakley
Git Mailing List Archive on lore.kernel.org Archives are clonable: git clone --mirror https://lore.kernel.org/git/0 git/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 git git/ https://lore.kernel.org/git \ git@vger.kernel.org public-inbox-index git Example config snippet for mirrors Newsgroup available over NNTP: nntp://nntp.lore.kernel.org/org.kernel.vger.git AGPL code for this site: git clone https://public-inbox.org/public-inbox.git