All of lore.kernel.org
 help / color / mirror / Atom feed
From: "brian m. carlson" <sandals@crustytoothpaste.net>
To: "Ævar Arnfjörð Bjarmason" <avarab@gmail.com>
Cc: Derrick Stolee <stolee@gmail.com>,
	git@vger.kernel.org, Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH 03/15] cache: add an algo member to struct object_id
Date: Thu, 15 Apr 2021 23:51:20 +0000	[thread overview]
Message-ID: <YHjReOSYlTF8XZL9@camp.crustytoothpaste.net> (raw)
In-Reply-To: <87pmyw0waj.fsf@evledraar.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 9484 bytes --]

On 2021-04-15 at 08:47:00, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Apr 14 2021, brian m. carlson wrote:
> > I will do some performance numbers on these patches, but it will likely
> > be the weekend before I can get to it.  I think this will add 4 bytes on
> > most platforms, since int is typically 32 bits, and the alignment
> > requirement would be for the most strictly aligned member, which is the
> > int, so a 4-byte alignment.  I don't think the alignment requirements
> > are especially onerous here.
> 
> I think if you're doing such a perf test one where we have SHA-1 mode
> with SHA-1 length OID v.s. SHA-256 (the current behavior) would be
> interesting as well.
> 
> It seems like good SHA1s to test that are 5a0cc8aca79 and
> 13eeedb5d17. Running:
> 
>     GIT_PERF_REPEAT_COUNT=10 \
>     GIT_SKIP_TESTS="p0001.[3-9] p1450.2" \
>     GIT_TEST_OPTS= GIT_PERF_MAKE_OPTS='-j8 CFLAGS=-O3' \
>     ./run 5a0cc8aca79 13eeedb5d17 -- p0001-rev-list.sh p1450-fsck.sh
> 
> (I added a fsck --connectivity-only test)
> 
> Gives us:
> 
>     Test                               5a0cc8aca79         13eeedb5d17
>     ------------------------------------------------------------------------------
>     0001.1: rev-list --all             2.46(2.22+0.14)     2.48(2.25+0.14) +0.8%
>     0001.2: rev-list --all --objects   10.79(10.22+0.14)   10.92(10.24+0.20) +1.2%
>     1450.1: fsck --connectivity-only   16.61(15.42+0.34)   16.94(15.60+0.32) +2.0%
> 
> So at least on my box none of those are outside of the confidence
> intervals. This was against my copy of git.git. Perhaps it matters more
> under memory pressure.

I do plan to take a deeper look at this this weekend and do some
performance numbers, and I think these are great examples to use.  I
just have a limited amount of time most weeknights because, among other
things, I am taking French a couple nights a week.

I talked with Stolee today about this approach and the desire for
performance, so I think we're on the same page about trying to make this
as fast as possible.  I plan to try a couple alternative solutions which
may work as well (or at least, I will make notes this time about why
they didn't work) and be less impactful.

> >> I assume that we already checked what happened when GIT_MAX_RAWSZ
> >> increased, but that seemed worth the cost so we could have SHA-256
> >> at all. I find the justification for this interoperability mode to
> >> be less significant, and potentially adding too much of a tax onto
> >> both SHA-1 repos that will never upgrade, and SHA-256 repos that
> >> upgrade all at once (or start as SHA-256).
> >
> > The entire goal of the interoperability is to let people seamlessly and
> > transparently move from SHA-1 to SHA-256.  Currently, the only way
> > people can move a SHA-1 repository to a SHA-256 repository is with
> > fast-import and fast-export, which loses all digital signatures and tags
> > to blobs.  This also requires a flag day.
> >
> > SHA-1 can now be attacked for USD 45,000.  That means it is within the
> > budget of a dedicated professional and virtually all medium or large
> > corporations, including even most municipal governments, to create a
> > SHA-1 collision.
> 
> Is that for vanilla SHA-1, or SHA-1DC?

Well, for SHA-1 in general.  SHA-1DC doesn't do anything except detect
collisions.  People can still create collisions, but we detect them.

> > Unfortunately, the way we deal with this is to die, so
> > as soon as this happens, the repository fails closed.  While an attacker
> > cannot make use of the collisions to spread malicious objects, because
> > of the way Git works, they can effectively DoS a repository, which is in
> > itself a security issue.  Fixing this requires major surgery.
> 
> Can you elaborate on this? I believe that we die on any known collision
> via the SHA1-DC code, and even if we didn't have that we'd detect the
> collision (see [1] for the code) and die while the object is in the
> temporary quarantine.
> 
> I believe such a request is cheaper to serve than one that doesn't
> upload colliding objects, we die earlier (less CPU etc.), and don't add
> objects to the store.
> 
> So what's the DoS vector?

That assumes that the server is using SHA-1DC and that the object can't
be uploaded any way except a push where its hash is checked.  Those are
true for many, but not all, hosting providers.  For example, anyone
using Git in a FIPS-validated environment can only use FIPS-validated
crypto, and I'm not aware of any SHA-1DC implementations that are.
Also, some implementations like Dulwich don't use SHA-1DC.

Once someone can get that object to a standard Git which does use
SHA-1DC, then the repository is pretty much hosed.  I probably can make
this better by just dropping the non SHA-1DC mode here and in libgit2,
to at least disincentivize poor choices in the most common
implementations.

> > We need the interoperability code to let people transition their
> > repositories away from SHA-1, even if it has some performance impact,
> > because without that most SHA-1 repositories will never transition.
> > That's what's outlined in the transition plan, and why that approach was
> > proposed, even though it would be nicer to avoid having to implement it
> > at all.
> 
> There's no question that we need working interop.
> 
> The question at least in my mind is why that interop is happening by
> annotating every object held in memory with whether they're SHA-1 or
> SHA-256, as opposed to having some translation layer earlier in the
> chain.

This is a good question.  Let me provide an example.

When we speak the remote protocol with a remote system, we'll parse out
several object ID of the appropriate algorithm.  We will then pass those
around to various places in our transport code.  It makes it a lot
easier if we can just run every object ID through an inline mapping
function which immediately does nothing if the object ID is of the
appropriate type rather than adding additional code to keep a check of
the current algorithm that's being used in the transport code.

It also makes it a lot easier when we let people store data in SHA-256
and then print things in SHA-1 for compatibility if we can add an
oid_to_hex_output and just map every object automatically on output,
regardless of where it came from, without needing to keep track of what
algorithm it's in.  For example, think about situations where the user
may have passed in a SHA-1 object ID and we reuse that value instead of
reading a SHA-256 object from the store.

So it's not completely impossible to avoid a hash algorithm member, but
it does significantly complicate our code not to do it.

> Not all our file or in-memory structures are are like that, e.g. the
> commit graph has a header saying "this is a bunch of SHA-1/256", and the
> objects that follow are padded to that actual hash size, not the max
> size we know about.
> 
> My understanding of the transition plan was that we'd e.g. have a
> SHA-1<->SHA-256 mapping of objects, which we'd say use to push/pull.

Correct.  That code exists and mostly works.  There are still a lot of
failing tests, but I have a pack index v3 that stores that data and a
loose object store.

> But that by the time I ran say a "git commit" none of that machinery
> needed to care that I was interoping with a SHA-1 repo on the other end.
> It would just happily have all SHA-256 objects, create new ones, and
> only by the time I needed to push them would something kick in to
> re-hash them.
> 
> I *think* the anwer is just "it's easier on the C-level" and "the
> wastage doesn't seem to matter much", which is fair enough.

I think that's accurate.

> *Goes and digs in the ML archive*:
> 
>     https://lore.kernel.org/git/1399147942-165308-1-git-send-email-sandals@crustytoothpaste.net/#t
>     https://lore.kernel.org/git/55016A3A.6010100@alum.mit.edu/
> 
> To answer (some) of that myself:
> 
> Digging up some of the initial discussion that seems to be the case, at
> that point there was a suggestion of:
> 
>     struct object_id {
>         unsigned char hash_type;
>         union {
>             unsigned char sha1[GIT_SHA1_RAWSZ];
>             unsigned char sha256[GIT_SHA256_RAWSZ];
>         } hash;
>     };
> 
> To which you replied:
> 
>     What I think might be more beneficial is to make the hash function
>     specific to a particular repository, and therefore you could maintain
>     something like this:
> 
>       struct object_id {
>           unsigned char hash[GIT_MAX_RAWSZ];
>       };
> 
> It wouldn't matter for the memory use to make it a union, but my reading
> of the above is that the reason for the current object_id not-a-union
> structure might not be valid now that there's a "hash_type" member, no?

Probably at that time we didn't have the formal transition plan and
didn't anticipate interoperability as a concern.  I do think that using
a single hash member instead of a union makes things easier because we
generally don't want to look up two different members in cases like
printing a hex OID.  We ultimately just want to print the right number
of bytes from that data, and the union just makes things trickier with
the compiler when we do that.
-- 
brian m. carlson (he/him or they/them)
Houston, Texas, US

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 263 bytes --]

  reply	other threads:[~2021-04-15 23:51 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-10 15:21 [PATCH 00/15] SHA-256 / SHA-1 interop, part 1 brian m. carlson
2021-04-10 15:21 ` [PATCH 01/15] sha1-file: allow hashing objects literally with any algorithm brian m. carlson
2021-04-15  8:55   ` Denton Liu
2021-04-15 23:03     ` brian m. carlson
2021-04-16 15:04   ` Ævar Arnfjörð Bjarmason
2021-04-16 18:55     ` Junio C Hamano
2021-04-10 15:21 ` [PATCH 02/15] builtin/hash-object: allow literally hashing with a given algorithm brian m. carlson
2021-04-11  8:52   ` Ævar Arnfjörð Bjarmason
2021-04-11 21:07     ` brian m. carlson
2021-04-16 15:21   ` Ævar Arnfjörð Bjarmason
2021-04-16 17:27   ` Ævar Arnfjörð Bjarmason
2021-04-10 15:21 ` [PATCH 03/15] cache: add an algo member to struct object_id brian m. carlson
2021-04-11 11:55   ` Ævar Arnfjörð Bjarmason
2021-04-11 21:37     ` brian m. carlson
2021-04-13 12:12   ` Derrick Stolee
2021-04-14  1:08     ` brian m. carlson
2021-04-15  8:47       ` Ævar Arnfjörð Bjarmason
2021-04-15 23:51         ` brian m. carlson [this message]
2021-04-10 15:21 ` [PATCH 04/15] Always use oidread to read into " brian m. carlson
2021-04-10 15:21 ` [PATCH 05/15] hash: add a function to finalize object IDs brian m. carlson
2021-04-10 15:21 ` [PATCH 06/15] Use the final_oid_fn to finalize hashing of " brian m. carlson
2021-04-10 15:21 ` [PATCH 07/15] builtin/pack-redundant: avoid casting buffers to struct object_id brian m. carlson
2021-04-10 15:21 ` [PATCH 08/15] cache: compare the entire buffer for " brian m. carlson
2021-04-11  8:17   ` Chris Torek
2021-04-11 11:36   ` Ævar Arnfjörð Bjarmason
2021-04-11 21:05     ` brian m. carlson
2021-04-10 15:21 ` [PATCH 09/15] hash: set and copy algo field in " brian m. carlson
2021-04-11 11:57   ` Ævar Arnfjörð Bjarmason
2021-04-11 21:48     ` brian m. carlson
2021-04-11 22:12       ` Ævar Arnfjörð Bjarmason
2021-04-11 23:52         ` brian m. carlson
2021-04-12 11:02           ` [PATCH 0/2] C99: harder dependency on variadic macros Ævar Arnfjörð Bjarmason
2021-04-12 11:02             ` [PATCH 1/2] git-compat-util.h: clarify comment on GCC-specific code Ævar Arnfjörð Bjarmason
2021-04-13  7:57               ` Jeff King
2021-04-13 21:07                 ` Junio C Hamano
2021-04-14  5:21                   ` Jeff King
2021-04-14  6:12                     ` Ævar Arnfjörð Bjarmason
2021-04-14  7:31                       ` Jeff King
2021-05-21  2:06               ` Jonathan Nieder
2021-04-12 11:02             ` [PATCH 2/2] C99 support: remove non-HAVE_VARIADIC_MACROS code Ævar Arnfjörð Bjarmason
2021-04-12 17:58               ` Junio C Hamano
2021-04-13  8:00                 ` Jeff King
2021-05-21  2:50               ` Jonathan Nieder
2021-04-12 12:14             ` [PATCH 0/2] C99: harder dependency on variadic macros Bagas Sanjaya
2021-04-12 12:41               ` Ævar Arnfjörð Bjarmason
2021-04-12 22:57                 ` brian m. carlson
2021-04-12 23:19                   ` Junio C Hamano
2022-01-28 11:11             ` [PATCH v2 0/2] C99: remove hardcoded-out !HAVE_VARIADIC_MACROS code Ævar Arnfjörð Bjarmason
2022-01-28 11:11               ` [PATCH v2 1/2] git-compat-util.h: clarify GCC v.s. C99-specific in comment Ævar Arnfjörð Bjarmason
2022-01-28 11:11               ` [PATCH v2 2/2] C99: remove hardcoded-out !HAVE_VARIADIC_MACROS code Ævar Arnfjörð Bjarmason
2022-01-28 22:40                 ` Junio C Hamano
2022-02-19 10:41               ` [PATCH v3 0/3] C99: remove dead " Ævar Arnfjörð Bjarmason
2022-02-19 10:41                 ` [PATCH v3 1/3] git-compat-util.h: clarify GCC v.s. C99-specific in comment Ævar Arnfjörð Bjarmason
2022-02-19 10:41                 ` [PATCH v3 2/3] C99: remove hardcoded-out !HAVE_VARIADIC_MACROS code Ævar Arnfjörð Bjarmason
2022-02-19 10:41                 ` [PATCH v3 3/3] trace.h: remove never-used TRACE_CONTEXT Ævar Arnfjörð Bjarmason
2022-02-20 12:02                   ` Junio C Hamano
2022-02-20 12:38                     ` Ævar Arnfjörð Bjarmason
2022-02-20 20:12                       ` Junio C Hamano
2022-02-21 16:05                 ` [PATCH v4 0/2] C99: remove dead !HAVE_VARIADIC_MACROS code Ævar Arnfjörð Bjarmason
2022-02-21 16:05                   ` [PATCH v4 1/2] git-compat-util.h: clarify GCC v.s. C99-specific in comment Ævar Arnfjörð Bjarmason
2022-02-21 16:05                   ` [PATCH v4 2/2] C99: remove hardcoded-out !HAVE_VARIADIC_MACROS code Ævar Arnfjörð Bjarmason
2021-04-12 10:53         ` [PATCH 09/15] hash: set and copy algo field in struct object_id Junio C Hamano
2021-04-12 11:13           ` Ævar Arnfjörð Bjarmason
2021-04-10 15:21 ` [PATCH 10/15] hash: provide per-algorithm null OIDs brian m. carlson
2021-04-11 14:03   ` Junio C Hamano
2021-04-11 21:51     ` brian m. carlson
2021-04-10 15:21 ` [PATCH 11/15] builtin/show-index: set the algorithm for object IDs brian m. carlson
2021-04-10 15:21 ` [PATCH 12/15] commit-graph: don't store file hashes as struct object_id brian m. carlson
2021-04-10 15:21 ` [PATCH 13/15] builtin/pack-objects: avoid using struct object_id for pack hash brian m. carlson
2021-04-10 15:21 ` [PATCH 14/15] hex: default to the_hash_algo on zero algorithm value brian m. carlson
2021-04-10 15:21 ` [PATCH 15/15] hex: print objects using the hash algorithm member brian m. carlson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YHjReOSYlTF8XZL9@camp.crustytoothpaste.net \
    --to=sandals@crustytoothpaste.net \
    --cc=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.