git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] (re-)optimizing fsck --connectivity-only
@ 2017-01-26  4:10 Jeff King
  2017-01-26  4:11 ` [PATCH 1/2] fsck: move typename() printing to its own function Jeff King
  2017-01-26  4:12 ` [PATCH 2/2] fsck: lazily load types under --connectivity-only Jeff King
  0 siblings, 2 replies; 6+ messages in thread
From: Jeff King @ 2017-01-26  4:10 UTC (permalink / raw)
  To: git

I've been playing around with the newly-fixed `fsck --connectivity-only`.
While it's much faster than it was, I was sad to find a case that is
still much slower than "git rev-list --objects --all".

This goes on top of my origin/jk/fsck-connectivity-check-fix, and gives
a noticeable speedup. IMHO it's worth considering as part of the same
topic (which is currently in 'next').

  [1/2]: fsck: move typename() printing to its own function
  [2/2]: fsck: lazily load types under --connectivity-only

 builtin/fsck.c | 87 ++++++++++++++++++----------------------------------------
 fsck.c         |  4 +++
 2 files changed, 31 insertions(+), 60 deletions(-)

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] fsck: move typename() printing to its own function
  2017-01-26  4:10 [PATCH 0/2] (re-)optimizing fsck --connectivity-only Jeff King
@ 2017-01-26  4:11 ` Jeff King
  2017-01-26  4:12 ` [PATCH 2/2] fsck: lazily load types under --connectivity-only Jeff King
  1 sibling, 0 replies; 6+ messages in thread
From: Jeff King @ 2017-01-26  4:11 UTC (permalink / raw)
  To: git

When an object has a problem, we mention its type. But we do
so by feeding the result of typename() directly to
fprintf(). This is potentially dangerous because typename()
can return NULL for some type values (like OBJ_NONE).

It's doubtful that this can be triggered in practice with
the current code, so this is probably not fixing a bug. But
it future-proofs us against modifications that make things
like OBJ_NONE more likely (and gives future patches a
central point to handle them).

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fsck.c | 29 ++++++++++++++++++++---------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 57f529b41..3d5ced2d3 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -56,6 +56,17 @@ static const char *describe_object(struct object *obj)
 	return buf.buf;
 }
 
+static const char *printable_type(struct object *obj)
+{
+	const char *ret;
+
+	ret = typename(obj->type);
+	if (!ret)
+		ret = "unknown";
+
+	return ret;
+}
+
 static int fsck_config(const char *var, const char *value, void *cb)
 {
 	if (strcmp(var, "fsck.skiplist") == 0) {
@@ -83,7 +94,7 @@ static void objreport(struct object *obj, const char *msg_type,
 			const char *err)
 {
 	fprintf(stderr, "%s in %s %s: %s\n",
-		msg_type, typename(obj->type), describe_object(obj), err);
+		msg_type, printable_type(obj), describe_object(obj), err);
 }
 
 static int objerror(struct object *obj, const char *err)
@@ -114,7 +125,7 @@ static int mark_object(struct object *obj, int type, void *data, struct fsck_opt
 	if (!obj) {
 		/* ... these references to parent->fld are safe here */
 		printf("broken link from %7s %s\n",
-			   typename(parent->type), describe_object(parent));
+			   printable_type(parent), describe_object(parent));
 		printf("broken link from %7s %s\n",
 			   (type == OBJ_ANY ? "unknown" : typename(type)), "unknown");
 		errors_found |= ERROR_REACHABLE;
@@ -131,9 +142,9 @@ static int mark_object(struct object *obj, int type, void *data, struct fsck_opt
 	if (!(obj->flags & HAS_OBJ)) {
 		if (parent && !has_object_file(&obj->oid)) {
 			printf("broken link from %7s %s\n",
-				 typename(parent->type), describe_object(parent));
+				 printable_type(parent), describe_object(parent));
 			printf("              to %7s %s\n",
-				 typename(obj->type), describe_object(obj));
+				 printable_type(obj), describe_object(obj));
 			errors_found |= ERROR_REACHABLE;
 		}
 		return 1;
@@ -205,7 +216,7 @@ static void check_reachable_object(struct object *obj)
 	if (!(obj->flags & HAS_OBJ)) {
 		if (has_sha1_pack(obj->oid.hash))
 			return; /* it is in pack - forget about it */
-		printf("missing %s %s\n", typename(obj->type),
+		printf("missing %s %s\n", printable_type(obj),
 			describe_object(obj));
 		errors_found |= ERROR_REACHABLE;
 		return;
@@ -231,7 +242,7 @@ static void check_unreachable_object(struct object *obj)
 	 * since this is something that is prunable.
 	 */
 	if (show_unreachable) {
-		printf("unreachable %s %s\n", typename(obj->type),
+		printf("unreachable %s %s\n", printable_type(obj),
 			describe_object(obj));
 		return;
 	}
@@ -250,7 +261,7 @@ static void check_unreachable_object(struct object *obj)
 	 */
 	if (!obj->used) {
 		if (show_dangling)
-			printf("dangling %s %s\n", typename(obj->type),
+			printf("dangling %s %s\n", printable_type(obj),
 			       describe_object(obj));
 		if (write_lost_and_found) {
 			char *filename = git_pathdup("lost-found/%s/%s",
@@ -324,7 +335,7 @@ static int fsck_obj(struct object *obj)
 
 	if (verbose)
 		fprintf(stderr, "Checking %s %s\n",
-			typename(obj->type), describe_object(obj));
+			printable_type(obj), describe_object(obj));
 
 	if (fsck_walk(obj, NULL, &fsck_obj_options))
 		objerror(obj, "broken links");
@@ -350,7 +361,7 @@ static int fsck_obj(struct object *obj)
 		struct tag *tag = (struct tag *) obj;
 
 		if (show_tags && tag->tagged) {
-			printf("tagged %s %s", typename(tag->tagged->type),
+			printf("tagged %s %s", printable_type(tag->tagged),
 				describe_object(tag->tagged));
 			printf(" (%s) in %s\n", tag->tag,
 				describe_object(&tag->object));
-- 
2.11.0.840.gd37c5973a


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] fsck: lazily load types under --connectivity-only
  2017-01-26  4:10 [PATCH 0/2] (re-)optimizing fsck --connectivity-only Jeff King
  2017-01-26  4:11 ` [PATCH 1/2] fsck: move typename() printing to its own function Jeff King
@ 2017-01-26  4:12 ` Jeff King
  2017-01-26 12:07   ` Johannes Schindelin
  2017-01-26 18:51   ` Junio C Hamano
  1 sibling, 2 replies; 6+ messages in thread
From: Jeff King @ 2017-01-26  4:12 UTC (permalink / raw)
  To: git

The recent fixes to "fsck --connectivity-only" load all of
the objects with their correct types. This keeps the
connectivity-only code path close to the regular one, but it
also introduces some unnecessary inefficiency. While getting
the type of an object is cheap compared to actually opening
and parsing the object (as the non-connectivity-only case
would do), it's still not free.

For reachable non-blob objects, we end up having to parse
them later anyway (to see what they point to), making our
type lookup here redundant.

For unreachable objects, we might never hit them at all in
the reachability traversal, making the lookup completely
wasted. And in some cases, we might have quite a few
unreachable objects (e.g., when alternates are used for
shared object storage between repositories, it's normal for
there to be objects reachable from other repositories but
not the one running fsck).

The comment in mark_object_for_connectivity() claims two
benefits to getting the type up front:

  1. We need to know the types during fsck_walk(). (And not
     explicitly mentioned, but we also need them when
     printing the types of broken or dangling commits).

     We can address this by lazy-loading the types as
     necessary. Most objects never need this lazy-load at
     all, because they fall into one of these categories:

       a. Reachable from our tips, and are coerced into the
	  correct type as we traverse (e.g., a parent link
	  will call lookup_commit(), which converts OBJ_NONE
	  to OBJ_COMMIT).

       b. Unreachable, but not at the tip of a chunk of
          unreachable history. We only mention the tips as
	  "dangling", so an unreachable commit which links
	  to hundreds of other objects needs only report the
	  type of the tip commit.

  2. It serves as a cross-check that the coercion in (1a) is
     correct (i.e., we'll complain about a parent link that
     points to a blob). But we get most of this for free
     already, because right after coercing, we'll parse any
     non-blob objects. So we'd notice then if we expected a
     commit and got a blob.

     The one exception is when we expect a blob, in which
     case we never actually read the object contents.

     So this is a slight weakening, but given that the whole
     point of --connectivity-only is to sacrifice some data
     integrity checks for speed, this seems like an
     acceptable tradeoff.

Here are before and after timings for an extreme case with
~5M reachable objects and another ~12M unreachable (it's the
torvalds/linux repository on GitHub, connected to shared
storage for all of the other kernel forks):

  [before]
  $ time git fsck --no-dangling --connectivity-only
  real	3m4.323s
  user	1m25.121s
  sys	1m38.710s

  [after]
  $ time git fsck --no-dangling --connectivity-only
  real	0m51.497s
  user	0m49.575s
  sys	0m1.776s

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/fsck.c | 58 +++++++---------------------------------------------------
 fsck.c         |  4 ++++
 2 files changed, 11 insertions(+), 51 deletions(-)

diff --git a/builtin/fsck.c b/builtin/fsck.c
index 3d5ced2d3..140357b6f 100644
--- a/builtin/fsck.c
+++ b/builtin/fsck.c
@@ -60,6 +60,12 @@ static const char *printable_type(struct object *obj)
 {
 	const char *ret;
 
+	if (obj->type == OBJ_NONE) {
+		enum object_type type = sha1_object_info(obj->oid.hash, NULL);
+		if (type > 0)
+			object_as_type(obj, type, 0);
+	}
+
 	ret = typename(obj->type);
 	if (!ret)
 		ret = "unknown";
@@ -595,57 +601,7 @@ static int fsck_cache_tree(struct cache_tree *it)
 
 static void mark_object_for_connectivity(const unsigned char *sha1)
 {
-	struct object *obj = lookup_object(sha1);
-
-	/*
-	 * Setting the object type here isn't strictly necessary for a
-	 * connectivity check. In most cases, our walk will expect a certain
-	 * type (e.g., a tree referencing a blob) and will use lookup_blob() to
-	 * assign the type. But doing it here has two advantages:
-	 *
-	 *   1. When the fsck_walk code looks at objects that _don't_ come from
-	 *      links (e.g., the tip of a ref), it may complain about the
-	 *      "unknown object type".
-	 *
-	 *   2. This serves as a nice cross-check that the graph links are
-	 *      sane. So --connectivity-only does not check that the bits of
-	 *      blobs are not corrupted, but it _does_ check that 100644 tree
-	 *      entries point to blobs, and so forth.
-	 *
-	 * Unfortunately we can't just use parse_object() here, because the
-	 * whole point of --connectivity-only is to avoid reading the object
-	 * data more than necessary.
-	 */
-	if (!obj || obj->type == OBJ_NONE) {
-		enum object_type type = sha1_object_info(sha1, NULL);
-		switch (type) {
-		case OBJ_BAD:
-			error("%s: unable to read object type",
-			      sha1_to_hex(sha1));
-			break;
-		case OBJ_COMMIT:
-			obj = (struct object *)lookup_commit(sha1);
-			break;
-		case OBJ_TREE:
-			obj = (struct object *)lookup_tree(sha1);
-			break;
-		case OBJ_BLOB:
-			obj = (struct object *)lookup_blob(sha1);
-			break;
-		case OBJ_TAG:
-			obj = (struct object *)lookup_tag(sha1);
-			break;
-		default:
-			error("%s: unknown object type %d",
-			      sha1_to_hex(sha1), type);
-		}
-
-		if (!obj || obj->type == OBJ_NONE) {
-			errors_found |= ERROR_OBJECT;
-			return;
-		}
-	}
-
+	struct object *obj = lookup_unknown_object(sha1);
 	obj->flags |= HAS_OBJ;
 }
 
diff --git a/fsck.c b/fsck.c
index 4a3069e20..939792752 100644
--- a/fsck.c
+++ b/fsck.c
@@ -458,6 +458,10 @@ int fsck_walk(struct object *obj, void *data, struct fsck_options *options)
 {
 	if (!obj)
 		return -1;
+
+	if (obj->type == OBJ_NONE)
+		parse_object(obj->oid.hash);
+
 	switch (obj->type) {
 	case OBJ_BLOB:
 		return 0;
-- 
2.11.0.840.gd37c5973a

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] fsck: lazily load types under --connectivity-only
  2017-01-26  4:12 ` [PATCH 2/2] fsck: lazily load types under --connectivity-only Jeff King
@ 2017-01-26 12:07   ` Johannes Schindelin
  2017-01-26 18:51   ` Junio C Hamano
  1 sibling, 0 replies; 6+ messages in thread
From: Johannes Schindelin @ 2017-01-26 12:07 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Hi Peff,

On Wed, 25 Jan 2017, Jeff King wrote:

>  builtin/fsck.c | 58 +++++++---------------------------------------------------
>  fsck.c         |  4 ++++
>  2 files changed, 11 insertions(+), 51 deletions(-)

Patch looks good to my eyes.

Ciao,
Johannes

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] fsck: lazily load types under --connectivity-only
  2017-01-26  4:12 ` [PATCH 2/2] fsck: lazily load types under --connectivity-only Jeff King
  2017-01-26 12:07   ` Johannes Schindelin
@ 2017-01-26 18:51   ` Junio C Hamano
  2017-01-26 19:12     ` Jeff King
  1 sibling, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2017-01-26 18:51 UTC (permalink / raw)
  To: Jeff King; +Cc: git

Jeff King <peff@peff.net> writes:

> The recent fixes to "fsck --connectivity-only" load all of
> the objects with their correct types. This keeps the
> connectivity-only code path close to the regular one, but it
> also introduces some unnecessary inefficiency. While getting
> the type of an object is cheap compared to actually opening
> and parsing the object (as the non-connectivity-only case
> would do), it's still not free.
>
> For reachable non-blob objects, we end up having to parse
> them later anyway (to see what they point to), making our
> type lookup here redundant.
>
> For unreachable objects, we might never hit them at all in
> the reachability traversal, making the lookup completely
> wasted. And in some cases, we might have quite a few
> unreachable objects (e.g., when alternates are used for
> shared object storage between repositories, it's normal for
> there to be objects reachable from other repositories but
> not the one running fsck).
>
> The comment in mark_object_for_connectivity() claims two
> benefits to getting the type up front:
>
>   1. We need to know the types during fsck_walk(). (And not
>      explicitly mentioned, but we also need them when
>      printing the types of broken or dangling commits).
>
>      We can address this by lazy-loading the types as
>      necessary. Most objects never need this lazy-load at
>      all, because they fall into one of these categories:
>
>        a. Reachable from our tips, and are coerced into the
> 	  correct type as we traverse (e.g., a parent link
> 	  will call lookup_commit(), which converts OBJ_NONE
> 	  to OBJ_COMMIT).
>
>        b. Unreachable, but not at the tip of a chunk of
>           unreachable history. We only mention the tips as
> 	  "dangling", so an unreachable commit which links
> 	  to hundreds of other objects needs only report the
> 	  type of the tip commit.
>
>   2. It serves as a cross-check that the coercion in (1a) is
>      correct (i.e., we'll complain about a parent link that
>      points to a blob). But we get most of this for free
>      already, because right after coercing, we'll parse any
>      non-blob objects. So we'd notice then if we expected a
>      commit and got a blob.
>
>      The one exception is when we expect a blob, in which
>      case we never actually read the object contents.
>
>      So this is a slight weakening, but given that the whole
>      point of --connectivity-only is to sacrifice some data
>      integrity checks for speed, this seems like an
>      acceptable tradeoff.

The only weakening is that a non-blob (or a corrupt blob) object
that sits where we expect a blob (because the containing tree or the
tag says so) would not be diagnosed as an error, then?  I think that
is in line with the spirit of --connectivity-only and is a good
trade-off.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] fsck: lazily load types under --connectivity-only
  2017-01-26 18:51   ` Junio C Hamano
@ 2017-01-26 19:12     ` Jeff King
  0 siblings, 0 replies; 6+ messages in thread
From: Jeff King @ 2017-01-26 19:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

On Thu, Jan 26, 2017 at 10:51:00AM -0800, Junio C Hamano wrote:

> >   2. It serves as a cross-check that the coercion in (1a) is
> >      correct (i.e., we'll complain about a parent link that
> >      points to a blob). But we get most of this for free
> >      already, because right after coercing, we'll parse any
> >      non-blob objects. So we'd notice then if we expected a
> >      commit and got a blob.
> >
> >      The one exception is when we expect a blob, in which
> >      case we never actually read the object contents.
> >
> >      So this is a slight weakening, but given that the whole
> >      point of --connectivity-only is to sacrifice some data
> >      integrity checks for speed, this seems like an
> >      acceptable tradeoff.
> 
> The only weakening is that a non-blob (or a corrupt blob) object
> that sits where we expect a blob (because the containing tree or the
> tag says so) would not be diagnosed as an error, then?  I think that
> is in line with the spirit of --connectivity-only and is a good
> trade-off.

Correct. The corrupt-blob case we always knew was a tradeoff (that's the
whole point of --connectivity-only). We could add back in the "we expect
a blob, is it really one?" at the moment we traverse to it, but IMHO
it's not interesting enough to even be worth the sha1_object_info()
lookup time.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-01-26 19:13 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-01-26  4:10 [PATCH 0/2] (re-)optimizing fsck --connectivity-only Jeff King
2017-01-26  4:11 ` [PATCH 1/2] fsck: move typename() printing to its own function Jeff King
2017-01-26  4:12 ` [PATCH 2/2] fsck: lazily load types under --connectivity-only Jeff King
2017-01-26 12:07   ` Johannes Schindelin
2017-01-26 18:51   ` Junio C Hamano
2017-01-26 19:12     ` Jeff King

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).