All of lore.kernel.org
 help / color / mirror / Atom feed
* parse_object does check_sha1_signature but not parse_object_buffer?
@ 2016-02-02  1:57 Mike Hommey
  2016-02-02  2:06 ` Mike Hommey
  2016-02-02  3:10 ` Junio C Hamano
  0 siblings, 2 replies; 7+ messages in thread
From: Mike Hommey @ 2016-02-02  1:57 UTC (permalink / raw)
  To: git

Hi,

You might or might not be aware of this thread:
https://groups.google.com/forum/#!topic/binary-transparency/f-BI4o8HZW0

Anyways, this got me to take a look around, and I noticed that
parse_object does SHA-1 validation through check_sha1_signature. What
surprised me is that parse_object_buffer doesn't. So we end up with
inconsistent behavior across commands:

$ git init
$ echo a > a ; echo b > b
$ git add a b
$ git cat-file blob 78981922613b2afb6025042ff6bd878ac1994e85
a
$ cp -f .git/objects/61/780798228d17af2d34fce4cfbdf35556832472 .git/objects/78/981922613b2afb6025042ff6bd878ac1994e85
$ git cat-file blob 78981922613b2afb6025042ff6bd878ac1994e85
b
$ git show 78981922613b2afb6025042ff6bd878ac1994e85
error: sha1 mismatch 78981922613b2afb6025042ff6bd878ac1994e85
fatal: bad object 78981922613b2afb6025042ff6bd878ac1994e85

Shouldn't parse_object_buffer also do check_sha1_signature?

Mike

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: parse_object does check_sha1_signature but not parse_object_buffer?
  2016-02-02  1:57 parse_object does check_sha1_signature but not parse_object_buffer? Mike Hommey
@ 2016-02-02  2:06 ` Mike Hommey
  2016-02-02  3:10 ` Junio C Hamano
  1 sibling, 0 replies; 7+ messages in thread
From: Mike Hommey @ 2016-02-02  2:06 UTC (permalink / raw)
  To: git

On Tue, Feb 02, 2016 at 10:57:01AM +0900, Mike Hommey wrote:
> Hi,
> 
> You might or might not be aware of this thread:
> https://groups.google.com/forum/#!topic/binary-transparency/f-BI4o8HZW0
> 
> Anyways, this got me to take a look around, and I noticed that
> parse_object does SHA-1 validation through check_sha1_signature. What
> surprised me is that parse_object_buffer doesn't. So we end up with
> inconsistent behavior across commands:
> 
> $ git init
> $ echo a > a ; echo b > b
> $ git add a b
> $ git cat-file blob 78981922613b2afb6025042ff6bd878ac1994e85
> a
> $ cp -f .git/objects/61/780798228d17af2d34fce4cfbdf35556832472 .git/objects/78/981922613b2afb6025042ff6bd878ac1994e85
> $ git cat-file blob 78981922613b2afb6025042ff6bd878ac1994e85
> b
> $ git show 78981922613b2afb6025042ff6bd878ac1994e85
> error: sha1 mismatch 78981922613b2afb6025042ff6bd878ac1994e85
> fatal: bad object 78981922613b2afb6025042ff6bd878ac1994e85
> 
> Shouldn't parse_object_buffer also do check_sha1_signature?

Well, except cat-file doesn't use parse_object_buffer either...

Mike

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: parse_object does check_sha1_signature but not parse_object_buffer?
  2016-02-02  1:57 parse_object does check_sha1_signature but not parse_object_buffer? Mike Hommey
  2016-02-02  2:06 ` Mike Hommey
@ 2016-02-02  3:10 ` Junio C Hamano
  2016-02-02  4:36   ` Mike Hommey
  1 sibling, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2016-02-02  3:10 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git

Mike Hommey <mh@glandium.org> writes:

> Shouldn't parse_object_buffer also do check_sha1_signature?

In general, it shouldn't; its callers are supposed to do it as
additional check when/if needed.  Callers like the one in fsck.c
does not want to die after seeing one bad one.  We want to report
and keep checking other things.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: parse_object does check_sha1_signature but not parse_object_buffer?
  2016-02-02  3:10 ` Junio C Hamano
@ 2016-02-02  4:36   ` Mike Hommey
  2016-02-02 19:25     ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Mike Hommey @ 2016-02-02  4:36 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Mon, Feb 01, 2016 at 07:10:04PM -0800, Junio C Hamano wrote:
> Mike Hommey <mh@glandium.org> writes:
> 
> > Shouldn't parse_object_buffer also do check_sha1_signature?
> 
> In general, it shouldn't; its callers are supposed to do it as
> additional check when/if needed.  Callers like the one in fsck.c
> does not want to die after seeing one bad one.  We want to report
> and keep checking other things.

Shouldn't some things like, at least, `git checkout`, still check
the sha1s, though?

Mike

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: parse_object does check_sha1_signature but not parse_object_buffer?
  2016-02-02  4:36   ` Mike Hommey
@ 2016-02-02 19:25     ` Junio C Hamano
  2016-02-04  7:19       ` Jeff King
  0 siblings, 1 reply; 7+ messages in thread
From: Junio C Hamano @ 2016-02-02 19:25 UTC (permalink / raw)
  To: Mike Hommey; +Cc: git

Mike Hommey <mh@glandium.org> writes:

> On Mon, Feb 01, 2016 at 07:10:04PM -0800, Junio C Hamano wrote:
>> Mike Hommey <mh@glandium.org> writes:
>> 
>> > Shouldn't parse_object_buffer also do check_sha1_signature?
>> 
>> In general, it shouldn't; its callers are supposed to do it as
>> additional check when/if needed.  Callers like the one in fsck.c
>> does not want to die after seeing one bad one.  We want to report
>> and keep checking other things.
>
> Shouldn't some things like, at least, `git checkout`, still check
> the sha1s, though?

That is a different issue--my answer was about the quoted question
regarding parse_object_buffer().  Its callers are supposed to do
additional check when/if needed, and there may be codepaths that
currently use parse_object_buffer() that may want to do their own
check, or call a different function that does the check for them.

Having said that, I do not necessarily think "git checkout" should
revalidate the object name.  The repository that you use for your
daily work would have the same error/corruption rate as your working
tree files, and I do not think you would constantly "validate" what
is in your working tree by comparing their contents with what you
think ought to be there.

If you are working on extremely poor quality disks and SSDs, it
might make sense to constantly revalidating the object data to catch
corruption early, as that is what we can do (as opposed to the
working tree files, corruption to which you probably do not have
anything to catch bitflipping on).

Unless you are placing your working tree on a filesystem with
checksums, but your object data would also be protected against
corruption in that case, so an extra revalidation at "git checkout"
time would not buy you much if anything at all.

http://article.gmane.org/gmane.comp.version-control.git/283380 (not
necessarily the entire thread, but that exact article) is a
reasonable summary that illustrates the way how we view the object
integrity.

    So "index-pack" is the enforcement point, and the rest of the
    git commands generally assume that we can trust what is on disk
    (as it is has either been generated by us, or checked by
    index-pack).  The rest of the commands do not spend time
    checking that the on-disk contents are sane (though you can run
    git-fsck if you want to do that).

If anything, we may want to reduce the number of codepaths that
calls check_sha1_signature() on data that we know we have read from
our own repository.  Even though I am not opposed to an idea to have
a "paranoid" mode that revalidates the object name every time (and
if "git checkout" does not currently, allow it to revalidate when we
are operating under the "paranoid" mode), I do not think it should
be on by default.

In fact, I have this suspicion that the original justification to
have the call to check_sha1_signature() in parse_object() might have
been invalidated by the restructuring of the code over the past 10
years.  e9eefa67 ([PATCH] Add function to parse an object of
unspecified type (take 2), 2005-04-28) says

    It also checks the hash of the file, due to its heritage as part
    of fsck-cache.

I.e. we do not need this call here, as long as we make sure that
fsck codepath does not depend on the fact that parse_object() calls
check_sha1_signature() to validate the consistency between the data
and the object name.

In fact, we do this, which is quite suboptimal:

        static int fsck_sha1(const unsigned char *sha1)
        {
                struct object *obj = parse_object(sha1);
                if (!obj) {
                        errors_found |= ERROR_OBJECT;
                        return error("%s: object corrupt or missing",
                                     sha1_to_hex(sha1));
                }
                obj->flags |= HAS_OBJ;
                return fsck_obj(obj);
        }

This function is called for each loose object file we find in
fsck_object_dir(), and there are a few problems:

 * The function parse_object() called from here would issue an error
   message and returns NULL; then you get another "corrupt or
   missing" error message, because this code cannot tell from the
   NULL which is the case.

 * The intent of the callchain to fsck_sha1() is to iterate over
   loose object files xx/x{38} and validate what is contained in
   them, but that behaviour is not guaranteed because it calls
   parse_object(), which may get the object data from a packfile
   if the loose object is also in the packfile.

This function should instead take "path" (the only caller of this
function fsck_loose() has it), read the data in the file, hash the
data to validate that it matches "sha1" and then create the object
out of that data it read by calling parse_object_buffer().

I didn't check other callers of parse_object(), but I doubt that
they need or want a check_sha1_signature() call in the function.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: parse_object does check_sha1_signature but not parse_object_buffer?
  2016-02-02 19:25     ` Junio C Hamano
@ 2016-02-04  7:19       ` Jeff King
  2016-02-04 17:40         ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Jeff King @ 2016-02-04  7:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Mike Hommey, git

On Tue, Feb 02, 2016 at 11:25:44AM -0800, Junio C Hamano wrote:

> Having said that, I do not necessarily think "git checkout" should
> revalidate the object name.  The repository that you use for your
> daily work would have the same error/corruption rate as your working
> tree files, and I do not think you would constantly "validate" what
> is in your working tree by comparing their contents with what you
> think ought to be there.

No, but there is a question of how much it costs to do so.

In the past, I've spent a lot of time trying to speed up object access,
but it was usually about replacing `parse_object` with
`lookup_object` or similar to avoid loading from disk in the first
place, as most of the time goes to zlib. If we are accessing the object
already (and obviously we have to for something like checkout), I'm not
sure what the marginal cost is of computing the sha1 on the data as it
passes through.

It might be significant, but I don't have numbers.

If it's not, then I think it's a nice feature that we would notice
problems earlier rather than later.

> If you are working on extremely poor quality disks and SSDs, it
> might make sense to constantly revalidating the object data to catch
> corruption early, as that is what we can do (as opposed to the
> working tree files, corruption to which you probably do not have
> anything to catch bitflipping on).

Hopefully if your working tree files bit-flip, then you notice via `git
diff`. I guess you wouldn't for stat-clean ones. But if a bit flips in
the forest, and it's not stat-dirty enough for anyone to hear it, does
it make a sound?

> http://article.gmane.org/gmane.comp.version-control.git/283380 (not
> necessarily the entire thread, but that exact article) is a
> reasonable summary that illustrates the way how we view the object
> integrity.
> 
>     So "index-pack" is the enforcement point, and the rest of the
>     git commands generally assume that we can trust what is on disk
>     (as it is has either been generated by us, or checked by
>     index-pack).  The rest of the commands do not spend time
>     checking that the on-disk contents are sane (though you can run
>     git-fsck if you want to do that).

I think that's me your quoting. I was specifically talking about
malicious tampering there. Which isn't to say I disagree with the
world-view you're proposing, but I think for random errors it's a little
more complicated. We certainly should be checking incoming data (and we
do).

For local repository operations, most of them are about reading data.
And there I generally favor performance over extra validation, with the
caveat that we should always be aware of the tradeoff. An extra
comparison to make sure we are not going out-of-bounds on a pack .idx
pointer is cheap. Loading a blob just to make sure its sha1 is valid
before we mention it in `diff --raw` output is stupid. Checking the sha1
on objects we are otherwise accessing is somewhere in between. :)

For local write operations, like repacking, we should err on the careful
side. And I think we do a good job of balancing performance and
validation there (e.g., we reuse deltas without reconstructing the
object, but _with_ a crc check on the delta data itself).

> In fact, we do this, which is quite suboptimal:
> 
>         static int fsck_sha1(const unsigned char *sha1)
>         {
>                 struct object *obj = parse_object(sha1);
>                 if (!obj) {
>                         errors_found |= ERROR_OBJECT;
>                         return error("%s: object corrupt or missing",
>                                      sha1_to_hex(sha1));
>                 }
>                 obj->flags |= HAS_OBJ;
>                 return fsck_obj(obj);
>         }
> 
> This function is called for each loose object file we find in
> fsck_object_dir(), and there are a few problems:
> 
>  * The function parse_object() called from here would issue an error
>    message and returns NULL; then you get another "corrupt or
>    missing" error message, because this code cannot tell from the
>    NULL which is the case.
> 
>  * The intent of the callchain to fsck_sha1() is to iterate over
>    loose object files xx/x{38} and validate what is contained in
>    them, but that behaviour is not guaranteed because it calls
>    parse_object(), which may get the object data from a packfile
>    if the loose object is also in the packfile.
> 
> This function should instead take "path" (the only caller of this
> function fsck_loose() has it), read the data in the file, hash the
> data to validate that it matches "sha1" and then create the object
> out of that data it read by calling parse_object_buffer().

Yeah, I agree. I think as it is written, we also end up loading the
loose objects twice (once in parse_object, and then later again in
fsck_object to do the real work). Your solution would fix that, too.

It looks like we _don't_ load loose commits twice, though. We never turn
off save_commit_buffer, so we happily cache the buffer for every single
commit, even though we won't need it again after fsck_object returns.

I guess nobody has really noticed either issue, because repositories
large enough for it to make a difference will usually be packed.

-Peff

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: parse_object does check_sha1_signature but not parse_object_buffer?
  2016-02-04  7:19       ` Jeff King
@ 2016-02-04 17:40         ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2016-02-04 17:40 UTC (permalink / raw)
  To: Jeff King; +Cc: Mike Hommey, git

Jeff King <peff@peff.net> writes:

> For local repository operations, most of them are about reading data.
> And there I generally favor performance over extra validation, with the
> caveat that we should always be aware of the tradeoff. An extra
> comparison to make sure we are not going out-of-bounds on a pack .idx
> pointer is cheap. Loading a blob just to make sure its sha1 is valid
> before we mention it in `diff --raw` output is stupid. Checking the sha1
> on objects we are otherwise accessing is somewhere in between. :)
>
> For local write operations, like repacking, we should err on the careful
> side. And I think we do a good job of balancing performance and
> validation there (e.g., we reuse deltas without reconstructing the
> object, but _with_ a crc check on the delta data itself).

OK, that is a reasonable set of rules.  We can say "checkout" is
writing out the contents to the filesystem, and make the "somewhere
in between" be closer to the "error on the careful side".

>> In fact, we do this, which is quite suboptimal:
>> 
>>         static int fsck_sha1(const unsigned char *sha1)
>>         {
>>                 struct object *obj = parse_object(sha1);
>>                 if (!obj) {
>>                         errors_found |= ERROR_OBJECT;
>>                         return error("%s: object corrupt or missing",
>>                                      sha1_to_hex(sha1));
>>                 }
>>                 obj->flags |= HAS_OBJ;
>>                 return fsck_obj(obj);
>>         }
>> 
>> This function is called for each loose object file we find in
>> fsck_object_dir(), and there are a few problems:
>> 
>>  * The function parse_object() called from here would issue an error
>>    message and returns NULL; then you get another "corrupt or
>>    missing" error message, because this code cannot tell from the
>>    NULL which is the case.
>> 
>>  * The intent of the callchain to fsck_sha1() is to iterate over
>>    loose object files xx/x{38} and validate what is contained in
>>    them, but that behaviour is not guaranteed because it calls
>>    parse_object(), which may get the object data from a packfile
>>    if the loose object is also in the packfile.
>> 
>> This function should instead take "path" (the only caller of this
>> function fsck_loose() has it), read the data in the file, hash the
>> data to validate that it matches "sha1" and then create the object
>> out of that data it read by calling parse_object_buffer().
>
> Yeah, I agree. I think as it is written, we also end up loading the
> loose objects twice (once in parse_object, and then later again in
> fsck_object to do the real work). Your solution would fix that, too.
> ...
> I guess nobody has really noticed either issue, because repositories
> large enough for it to make a difference will usually be packed.

I'll throw it in the leftover-bits list, then ;-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-02-04 17:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-02-02  1:57 parse_object does check_sha1_signature but not parse_object_buffer? Mike Hommey
2016-02-02  2:06 ` Mike Hommey
2016-02-02  3:10 ` Junio C Hamano
2016-02-02  4:36   ` Mike Hommey
2016-02-02 19:25     ` Junio C Hamano
2016-02-04  7:19       ` Jeff King
2016-02-04 17:40         ` Junio C Hamano

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.