git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: 程洋 <chengyang@xiaomi.com>
Cc: "Derrick Stolee" <derrickstolee@github.com>,
	"git@vger.kernel.org" <git@vger.kernel.org>,
	何浩 <hehao@xiaomi.com>, "Xin7 Ma 马鑫" <maxin7@xiaomi.com>,
	石奉兵 <shifengbing@xiaomi.com>, 凡军辉 <fanjunhui@xiaomi.com>,
	王汉基 <wanghanji@xiaomi.com>
Subject: [PATCH 1/3] parse_object(): allow skipping hash check
Date: Tue, 6 Sep 2022 19:01:34 -0400	[thread overview]
Message-ID: <YxfRTubqh7aFvNJs@coredump.intra.peff.net> (raw)
In-Reply-To: <YxfQi4qg8uJHs7Gp@coredump.intra.peff.net>

The parse_object() function checks the object hash of any object it
parses. This is a nice feature, as it means we may catch bit corruption
during normal use, rather than waiting for specific fsck operations.

But it also can be slow. It's particularly noticeable for blobs, where
except for the hash check, we could return without loading the object
contents at all. Now one may wonder what is the point of calling
parse_object() on a blob in the first place then, but usually it's not
intentional: we were fed an oid from somewhere, don't know the type, and
want an object struct. For commits and trees, the parsing is usually
helpful; we're about to look at the contents anyway. But this is less
true for blobs, where we may be collecting them as part of a
reachability traversal, etc, and don't actually care what's in them. And
blobs, of course, tend to be larger.

We don't want to just throw out the hash-checks for blobs, though. We do
depend on them in some circumstances (e.g., rev-list --verify-objects
uses parse_object() to check them). It's only the callers that know
how they're going to use the result. And so we can help them by
providing a special flag to skip the hash check.

We could just apply this to blobs, as they're going to be the main
source of performance improvement. But if a caller doesn't care about
checking the hash, we might as well skip it for other object types, too.
Even though we can't avoid reading the object contents, we can still
skip the actual hash computation.

If this seems like it is making Git a little bit less safe against
corruption, it may be. But it's part of a series of tradeoffs we're
already making. For instance, "rev-list --objects" does not open the
contents of blobs it prints. And when a commit graph is present, we skip
opening most commits entirely. The important thing will be to use this
flag in cases where it's safe to skip the check. For instance, when
serving a pack for a fetch, we know the client will fully index the
objects and do a connectivity check itself. There's little to be gained
from the server side re-hashing a blob itself. And indeed, most of the
time we don't! The revision machinery won't open up a blob reached by
traversal, but only one requested directly with a "want" line. So
applied properly, this new feature shouldn't make anything less safe in
practice.

Signed-off-by: Jeff King <peff@peff.net>
---
I'm sorry, I know the argument here is really hand-wavy. But I really
think this isn't making anything much less safe.

I was actually tempted to rip out the blob hash-check entirely by
default!  Anybody who really cares about checking the bits can do so
with read_object_file(). That's what fsck does, and we could pretty
easily convert "rev-list --verify-objects" to do so, too. So this is the
less extreme version of the patch. ;)

 object.c | 15 ++++++++++++---
 object.h |  6 ++++++
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/object.c b/object.c
index 588b8156f1..8f6de09078 100644
--- a/object.c
+++ b/object.c
@@ -263,8 +263,11 @@ struct object *parse_object_or_die(const struct object_id *oid,
 	die(_("unable to parse object: %s"), name ? name : oid_to_hex(oid));
 }
 
-struct object *parse_object(struct repository *r, const struct object_id *oid)
+struct object *parse_object_with_flags(struct repository *r,
+				       const struct object_id *oid,
+				       enum parse_object_flags flags)
 {
+	int skip_hash = !!(flags & PARSE_OBJECT_SKIP_HASH_CHECK);
 	unsigned long size;
 	enum object_type type;
 	int eaten;
@@ -279,7 +282,7 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 	if ((obj && obj->type == OBJ_BLOB && repo_has_object_file(r, oid)) ||
 	    (!obj && repo_has_object_file(r, oid) &&
 	     oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
-		if (stream_object_signature(r, repl) < 0) {
+		if (!skip_hash && stream_object_signature(r, repl) < 0) {
 			error(_("hash mismatch %s"), oid_to_hex(oid));
 			return NULL;
 		}
@@ -289,7 +292,8 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 
 	buffer = repo_read_object_file(r, oid, &type, &size);
 	if (buffer) {
-		if (check_object_signature(r, repl, buffer, size, type) < 0) {
+		if (!skip_hash &&
+		    check_object_signature(r, repl, buffer, size, type) < 0) {
 			free(buffer);
 			error(_("hash mismatch %s"), oid_to_hex(repl));
 			return NULL;
@@ -304,6 +308,11 @@ struct object *parse_object(struct repository *r, const struct object_id *oid)
 	return NULL;
 }
 
+struct object *parse_object(struct repository *r, const struct object_id *oid)
+{
+	return parse_object_with_flags(r, oid, 0);
+}
+
 struct object_list *object_list_insert(struct object *item,
 				       struct object_list **list_p)
 {
diff --git a/object.h b/object.h
index 9caef89f1f..31ebe11458 100644
--- a/object.h
+++ b/object.h
@@ -128,7 +128,13 @@ void *object_as_type(struct object *obj, enum object_type type, int quiet);
  *
  * Returns NULL if the object is missing or corrupt.
  */
+enum parse_object_flags {
+	PARSE_OBJECT_SKIP_HASH_CHECK = 1 << 0,
+};
 struct object *parse_object(struct repository *r, const struct object_id *oid);
+struct object *parse_object_with_flags(struct repository *r,
+				       const struct object_id *oid,
+				       enum parse_object_flags flags);
 
 /*
  * Like parse_object, but will die() instead of returning NULL. If the
-- 
2.37.3.1134.gfd534b3986


  reply	other threads:[~2022-09-06 23:01 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-11  8:09 Partial-clone cause big performance impact on server 程洋
2022-08-11 17:22 ` Jonathan Tan
2022-08-13  7:55   ` 回复: [External Mail]Re: " 程洋
2022-08-13 11:41     ` 程洋
2022-08-15  5:16     ` ZheNing Hu
2022-08-15 13:15       ` 程洋
2022-08-12 12:21 ` Derrick Stolee
2022-08-14  6:48 ` Jeff King
2022-08-15 13:18   ` Derrick Stolee
2022-08-15 14:50     ` [External Mail]Re: " 程洋
2022-08-17 10:22     ` 程洋
2022-08-17 13:41       ` Derrick Stolee
2022-08-18  5:49         ` Jeff King
2022-09-01  6:53   ` 程洋
2022-09-01 16:19     ` Jeff King
2022-09-05 11:17       ` 程洋
2022-09-06 18:38         ` Jeff King
2022-09-06 22:58           ` [PATCH 0/3] speeding up on-demand fetch for blobs in partial clone Jeff King
2022-09-06 23:01             ` Jeff King [this message]
2022-09-07 14:15               ` [PATCH 1/3] parse_object(): allow skipping hash check Derrick Stolee
2022-09-07 20:44                 ` Jeff King
2022-09-06 23:05             ` [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects Jeff King
2022-09-07 14:36               ` Derrick Stolee
2022-09-07 14:45                 ` Derrick Stolee
2022-09-07 20:50                   ` Jeff King
2022-09-07 19:26               ` Junio C Hamano
2022-09-07 20:36                 ` Jeff King
2022-09-07 20:48                   ` [BUG] t1800: Fails for error text comparison rsbecker
2022-09-07 21:55                     ` Junio C Hamano
2022-09-07 22:23                       ` rsbecker
2022-09-07 21:02                   ` [PATCH 2/3] upload-pack: skip parse-object re-hashing of "want" objects Jeff King
2022-09-07 22:07                     ` Junio C Hamano
2022-09-08  5:04                       ` Jeff King
2022-09-08 16:41                         ` Junio C Hamano
2022-09-06 23:06             ` [PATCH 3/3] parse_object(): check commit-graph when skip_hash set Jeff King
2022-09-07 14:46               ` Derrick Stolee
2022-09-07 19:31               ` Junio C Hamano
2022-09-08 10:39                 ` [External Mail]Re: " 程洋
2022-09-08 18:42                   ` Jeff King
2022-09-07 14:48             ` [PATCH 0/3] speeding up on-demand fetch for blobs in partial clone Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YxfRTubqh7aFvNJs@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=chengyang@xiaomi.com \
    --cc=derrickstolee@github.com \
    --cc=fanjunhui@xiaomi.com \
    --cc=git@vger.kernel.org \
    --cc=hehao@xiaomi.com \
    --cc=maxin7@xiaomi.com \
    --cc=shifengbing@xiaomi.com \
    --cc=wanghanji@xiaomi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).