Re: Is the sha256 object format experimental or not?

From: dwh@linuxprogrammer.org
To: Junio C Hamano <gitster@pobox.com>
Cc: "brian m. carlson" <sandals@crustytoothpaste.net>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	git@vger.kernel.org
Subject: Re: Is the sha256 object format experimental or not?
Date: Thu, 13 May 2021 16:26:14 -0700	[thread overview]
Message-ID: <20210513232614.GF11882@localhost> (raw)
In-Reply-To: <xmqqo8de9wis.fsf@gitster.g>

On 14.05.2021 06:03, Junio C Hamano wrote:
>dwh@linuxprogrammer.org writes:
>
>> I think Git should externalize the calculation of object digests just
>> like it externalizes the calcualtion of object digital signatures.
>
>The hashing algorithms used to generate object names has
>requirements fundamentally different from that of digital
>signatures.  I strongly suspect that that fact would change the
>equation when you rethink what you said above.

I agree with you. Object names are exactly that: names. Names for
resources/data must be persistent, as well as global in scope and
uniqueness, and autonomously assigned. What this means is that once an
object has a name, that name shall never change as long as the object
remains unchanged. The names must be unique in the scope of all objects
(e.g. all copies of a repo) and generated without coordination.

Calculating object names using a digest algorithm meets all of these
requirements. Choosing a strong digest algorithm creates a strong
cryptographic binding between the name and the object contents. Using
self-describing digests allows for a repo to switch digest algorithms at
arbitrary points in the history.

I think that objects named with SHA1 digests should remain named with
the SHA1 digest. I do *not* advocate going back and rewriting history
to change all of the object names to a digest with a different
algorithm. Git is a provenance log and history matters. I recommend
preserving all existing names, even if they were created with known-weak
digest algorithms, and making the change to a new algorithm at a
specific point in time (e.g. at a tag). Using self-describing digest
encoding and externalizing digest calculation future-proofs
repositories and allows for preservation of history while allowing
algorithm agility.

To illustrate my point, I envision that a repos could have a history
like this:

object 2923f6fa36614586ea09b4424b438915cc1b9b67 (naked SHA1)
  |
<many objects named with SHA1>
  |
object 5f167fb6b3e96273b564fff0b041fb94fee4d3de (naked SHA1)
  |
<modify Git to ext. digest calculation and self-desc encoding>
  |
object 98c2e1c0965e60b0f137577ac5dd0a5c96ce224d (naked SHA1)
  |
<many objects named with SHA1>
  |
<a project decides to switch to SHA2-256, maybe marked in a tag>
  |
object IAOdLVxteOxQwKa-xn8yCBUkuPkjAqcuQ2V7fKAlao8o (self-desc.SHA2-256)
  |
<many objects named with self-describing SHA2-256 digests>
  |
<a project decices to switch to SHA3-256, maybe marked in a tag>
  |
object EK832G0PFhBFf-Dfgr205UKpUMqmVXJX9ltLwQo4Awct (self-desc.SHA3-256)
  |
<many objects named with self-descring SHA3-256 digests>
  .
  .
  .

Neither decision to switch to SHA2-256 nor to SHA3-256 would require any
code changes. If we continue down the current SHA-256 road, we will have
to repeat that multi-year effort in the future to switch to SHA3 or
something else. Most importantly, the choice of digest algorithm would
be left up to the maintainers of a given repo and not limited to the
algorithms we have hard coded into Git.

Brian's work on the SHA-256 switch is valuable. We can leverage a lot of
it to switch to externalized digest calculation and self-describing
digests and never have to worry about doing that again.

Cheers!
Dave