git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [CORRECTED PATCH] git-fetch-pack: avoid unnecessary zero packing
@ 2005-10-18 17:52 Linus Torvalds
  2005-10-18 18:45 ` Junio C Hamano
  2005-10-18 18:49 ` Junio C Hamano
  0 siblings, 2 replies; 7+ messages in thread
From: Linus Torvalds @ 2005-10-18 17:52 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List


If everything is up-to-date locally, we don't need to even ask for a
pack-file from the remote, or try to unpack it.

This is especially important for tags - since the pack-file common commit
logic is based purely on the commit history, it will never be able to find
a common tag, and will thus always end up re-fetching them.

Especially notably, if the tag points to a non-commit (eg a tagged tree),
the pack-file would be unnecessarily big, just because it cannot any most
recent common point between commits for pruning.

Short-circuiting the case where we already have that reference means that
we avoid a lot of these in the common case.

NOTE! This only matches remote ref names against the same local name,
which works well for tags, but is not as generic as it could be. If we
ever need to, we could match against _any_ local ref (if we have it, we
have it), but this "match against same name" is simpler and more
efficient, and covers the common case.

Renaming of refs is common for branch heads, but since those are always
commits, the pack-file generation can optimize that case.

In some cases we might still end up fetching pack-files unnecessarily, but
this at least avoids the re-fetching of tags over and over if you use a
regular

	git fetch --tags ...

which was the main reason behind the change.

Signed-off-by: Linus Torvalds <torvalds@osdl.org>
---

Ok, this should be the trivially fixed (famous last words) patch, which 
should correct the case where we don't have a local name for the remote 
ref at all.

If you already applied the previous patch, you just need to fix the

	if (read_ref(...) < 0)
		continue
	if (memcmp(..)) {

to be one case (a failing read_ref should do the exact same thing as a 
failed memcmp):

	if (read_ref(...) < 0 ||
	    memcmp(...) {

and everything should be ok.

This has gotten _some_ testing, but obviously not enough ;)

		Linus

diff --git a/fetch-pack.c b/fetch-pack.c
index 953c0cf..4597369 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -12,6 +12,7 @@ static const char *exec = "git-upload-pa
 static int find_common(int fd[2], unsigned char *result_sha1,
 		       struct ref *refs)
 {
+	int fetching;
 	static char line[1000];
 	int count = 0, flushes = 0, retval;
 	FILE *revs;
@@ -20,16 +21,19 @@ static int find_common(int fd[2], unsign
 	if (!revs)
 		die("unable to run 'git-rev-list'");
 
-	while (refs) {
+	fetching = 0;
+	for ( ; refs ; refs = refs->next) {
 		unsigned char *remote = refs->old_sha1;
-		if (verbose)
-			fprintf(stderr,
-				"want %s (%s)\n", sha1_to_hex(remote),
-				refs->name);
+		unsigned char *local = refs->new_sha1;
+
+		if (!memcmp(remote, local, 20))
+			continue;
 		packet_write(fd[1], "want %s\n", sha1_to_hex(remote));
-		refs = refs->next;
+		fetching++;
 	}
 	packet_flush(fd[1]);
+	if (!fetching)
+		return 1;
 	flushes = 1;
 	retval = -1;
 	while (fgets(line, sizeof(line), revs) != NULL) {
@@ -74,6 +78,35 @@ static int find_common(int fd[2], unsign
 	return retval;
 }
 
+static int everything_local(struct ref *refs)
+{
+	int retval;
+
+	for (retval = 1; refs ; refs = refs->next) {
+		const unsigned char *remote = refs->old_sha1;
+		unsigned char local[20];
+
+		if (read_ref(git_path("%s", refs->name), local) < 0 ||
+		    memcmp(remote, local, 20)) {
+			retval = 0;
+			if (!verbose)
+				continue;
+			fprintf(stderr,
+				"want %s (%s)\n", sha1_to_hex(remote),
+				refs->name);
+			continue;
+		}
+
+		memcpy(refs->new_sha1, local, 20);
+		if (!verbose)
+			continue;
+		fprintf(stderr,
+			"already have %s (%s)\n", sha1_to_hex(remote),
+			refs->name);
+	}
+	return retval;
+}
+
 static int fetch_pack(int fd[2], int nr_match, char **match)
 {
 	struct ref *ref;
@@ -86,6 +119,10 @@ static int fetch_pack(int fd[2], int nr_
 		packet_flush(fd[1]);
 		die("no matching remote head");
 	}
+	if (everything_local(ref)) {
+		packet_flush(fd[1]);
+		goto all_done;
+	}
 	if (find_common(fd, sha1, ref) < 0)
 		fprintf(stderr, "warning: no common commits\n");
 	pid = fork();
@@ -109,6 +146,7 @@ static int fetch_pack(int fd[2], int nr_
 		int code = WEXITSTATUS(status);
 		if (code)
 			die("git-unpack-objects died with error code %d", code);
+all_done:
 		while (ref) {
 			printf("%s %s\n",
 			       sha1_to_hex(ref->old_sha1), ref->name);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [CORRECTED PATCH] git-fetch-pack: avoid unnecessary zero packing
  2005-10-18 17:52 [CORRECTED PATCH] git-fetch-pack: avoid unnecessary zero packing Linus Torvalds
@ 2005-10-18 18:45 ` Junio C Hamano
  2005-10-18 18:49 ` Junio C Hamano
  1 sibling, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2005-10-18 18:45 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds <torvalds@osdl.org> writes:

> If everything is up-to-date locally, we don't need to even ask for a
> pack-file from the remote, or try to unpack it.
>
> This is especially important for tags - since the pack-file common commit
> logic is based purely on the commit history, it will never be able to find
> a common tag, and will thus always end up re-fetching them.
>
> Especially notably, if the tag points to a non-commit (eg a tagged tree),
> the pack-file would be unnecessarily big, just because it cannot any most
> recent common point between commits for pruning.
>
> Short-circuiting the case where we already have that reference means that
> we avoid a lot of these in the common case.
>
> NOTE! This only matches remote ref names against the same local name,
> which works well for tags, but is not as generic as it could be. If we
> ever need to, we could match against _any_ local ref (if we have it, we
> have it), but this "match against same name" is simpler and more
> efficient, and covers the common case.
>
> Renaming of refs is common for branch heads, but since those are always
> commits, the pack-file generation can optimize that case.
>
> In some cases we might still end up fetching pack-files unnecessarily, but
> this at least avoids the re-fetching of tags over and over if you use a
> regular
>
> 	git fetch --tags ...
>
> which was the main reason behind the change.
>
> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
> ---
>
> Ok, this should be the trivially fixed (famous last words) patch, which 
> should correct the case where we don't have a local name for the remote 
> ref at all.
>
> If you already applied the previous patch, you just need to fix the
>
> 	if (read_ref(...) < 0)
> 		continue
> 	if (memcmp(..)) {
>
> to be one case (a failing read_ref should do the exact same thing as a 
> failed memcmp):
>
> 	if (read_ref(...) < 0 ||
> 	    memcmp(...) {
>
> and everything should be ok.
>
> This has gotten _some_ testing, but obviously not enough ;)
>
> 		Linus
>
> diff --git a/fetch-pack.c b/fetch-pack.c
> index 953c0cf..4597369 100644
> --- a/fetch-pack.c
> +++ b/fetch-pack.c
> @@ -12,6 +12,7 @@ static const char *exec = "git-upload-pa
>  static int find_common(int fd[2], unsigned char *result_sha1,
>  		       struct ref *refs)
>  {
> +	int fetching;
>  	static char line[1000];
>  	int count = 0, flushes = 0, retval;
>  	FILE *revs;
> @@ -20,16 +21,19 @@ static int find_common(int fd[2], unsign
>  	if (!revs)
>  		die("unable to run 'git-rev-list'");
>  
> -	while (refs) {
> +	fetching = 0;
> +	for ( ; refs ; refs = refs->next) {
>  		unsigned char *remote = refs->old_sha1;
> -		if (verbose)
> -			fprintf(stderr,
> -				"want %s (%s)\n", sha1_to_hex(remote),
> -				refs->name);
> +		unsigned char *local = refs->new_sha1;
> +
> +		if (!memcmp(remote, local, 20))
> +			continue;
>  		packet_write(fd[1], "want %s\n", sha1_to_hex(remote));
> -		refs = refs->next;
> +		fetching++;
>  	}
>  	packet_flush(fd[1]);
> +	if (!fetching)
> +		return 1;
>  	flushes = 1;
>  	retval = -1;
>  	while (fgets(line, sizeof(line), revs) != NULL) {
> @@ -74,6 +78,35 @@ static int find_common(int fd[2], unsign
>  	return retval;
>  }
>  
> +static int everything_local(struct ref *refs)
> +{
> +	int retval;
> +
> +	for (retval = 1; refs ; refs = refs->next) {
> +		const unsigned char *remote = refs->old_sha1;
> +		unsigned char local[20];
> +
> +		if (read_ref(git_path("%s", refs->name), local) < 0 ||
> +		    memcmp(remote, local, 20)) {
> +			retval = 0;
> +			if (!verbose)
> +				continue;
> +			fprintf(stderr,
> +				"want %s (%s)\n", sha1_to_hex(remote),
> +				refs->name);
> +			continue;
> +		}
> +
> +		memcpy(refs->new_sha1, local, 20);
> +		if (!verbose)
> +			continue;
> +		fprintf(stderr,
> +			"already have %s (%s)\n", sha1_to_hex(remote),
> +			refs->name);
> +	}
> +	return retval;
> +}
> +
>  static int fetch_pack(int fd[2], int nr_match, char **match)
>  {
>  	struct ref *ref;
> @@ -86,6 +119,10 @@ static int fetch_pack(int fd[2], int nr_
>  		packet_flush(fd[1]);
>  		die("no matching remote head");
>  	}
> +	if (everything_local(ref)) {
> +		packet_flush(fd[1]);
> +		goto all_done;
> +	}
>  	if (find_common(fd, sha1, ref) < 0)
>  		fprintf(stderr, "warning: no common commits\n");
>  	pid = fork();
> @@ -109,6 +146,7 @@ static int fetch_pack(int fd[2], int nr_
>  		int code = WEXITSTATUS(status);
>  		if (code)
>  			die("git-unpack-objects died with error code %d", code);
> +all_done:
>  		while (ref) {
>  			printf("%s %s\n",
>  			       sha1_to_hex(ref->old_sha1), ref->name);

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [CORRECTED PATCH] git-fetch-pack: avoid unnecessary zero packing
  2005-10-18 17:52 [CORRECTED PATCH] git-fetch-pack: avoid unnecessary zero packing Linus Torvalds
  2005-10-18 18:45 ` Junio C Hamano
@ 2005-10-18 18:49 ` Junio C Hamano
  2005-10-18 19:19   ` Junio C Hamano
  2005-10-18 20:38   ` Linus Torvalds
  1 sibling, 2 replies; 7+ messages in thread
From: Junio C Hamano @ 2005-10-18 18:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

Linus Torvalds <torvalds@osdl.org> writes:

> Ok, this should be the trivially fixed (famous last words) patch, which 
> should correct the case where we don't have a local name for the remote 
> ref at all.
>
> If you already applied the previous patch, you just need to fix the

No, I haven't (not my git day today).

It strikes me that we could walk from our refs, depth reasonably
limited to say 20 or so commit chain and/or last 5 days of
commit time, to see if any of the remotes are reachable from our
refs and omit issuing "want" quite cheaply.  Do you think that
would be a worthy change to make things more efficient?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [CORRECTED PATCH] git-fetch-pack: avoid unnecessary zero packing
  2005-10-18 18:49 ` Junio C Hamano
@ 2005-10-18 19:19   ` Junio C Hamano
  2005-10-18 20:38   ` Linus Torvalds
  1 sibling, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2005-10-18 19:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git

Junio C Hamano <junkio@cox.net> writes:

> No, I haven't (not my git day today).
>
> It strikes me that we could walk from our refs, depth reasonably
> limited to say 20 or so commit chain and/or last 5 days of
> commit time, to see if any of the remotes are reachable from our
> refs and omit issuing "want" quite cheaply.  Do you think that
> would be a worthy change to make things more efficient?

Something like this on top of your second patch?

 ------------
Subject: do not ask for objects known to be complete.

On top of optimization by Linus not to ask refs that already
match, we can shallowly walk our refs and not issue "want" for
things that we know are reachable from them.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

diff --git a/fetch-pack.c b/fetch-pack.c
index 4597369..212e00f 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -1,6 +1,9 @@
 #include "cache.h"
 #include "refs.h"
 #include "pkt-line.h"
+#include "commit.h"
+#include "tag.h"
+#include <time.h>
 #include <sys/wait.h>
 
 static int quiet;
@@ -78,16 +81,58 @@ static int find_common(int fd[2], unsign
 	return retval;
 }
 
+#define COMPLETE	(1U << 0)
+
+/*
+ * 5 days - this should be configurable.
+ */
+#define RECENT		(5 * 24 * 60 * 60)
+
+static struct commit_list *complete = NULL;
+
+static int mark_complete(const char *path, const unsigned char *sha1)
+{
+	struct object *o = parse_object(sha1);
+
+	while (o && o->type == tag_type) {
+		o->flags |= COMPLETE;
+		o = parse_object(((struct tag *)o)->tagged->sha1);
+	}
+	if (o->type == commit_type) {
+		struct commit *commit = (struct commit *)o;
+		commit->object.flags |= COMPLETE;
+		insert_by_date(commit, &complete);
+	}
+	return 0;
+}
+
+static void mark_recent_complete_commits(unsigned long cutoff_date)
+{
+	while (complete && cutoff_date <= complete->item->date) {
+		if (verbose)
+			fprintf(stderr, "Marking %s as complete\n",
+				sha1_to_hex(complete->item->object.sha1));
+		pop_most_recent_commit(&complete, COMPLETE);
+	}
+}
+
 static int everything_local(struct ref *refs)
 {
 	int retval;
+	time_t now;
+
+	time(&now);
+
+	for_each_ref(mark_complete);
+	mark_recent_complete_commits((unsigned long) now - RECENT);
 
 	for (retval = 1; refs ; refs = refs->next) {
 		const unsigned char *remote = refs->old_sha1;
 		unsigned char local[20];
+		struct object *o;
 
-		if (read_ref(git_path("%s", refs->name), local) < 0 ||
-		    memcmp(remote, local, 20)) {
+		o = parse_object(remote);
+		if (!o || !(o->flags & COMPLETE)) {
 			retval = 0;
 			if (!verbose)
 				continue;

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [CORRECTED PATCH] git-fetch-pack: avoid unnecessary zero packing
  2005-10-18 18:49 ` Junio C Hamano
  2005-10-18 19:19   ` Junio C Hamano
@ 2005-10-18 20:38   ` Linus Torvalds
  2005-10-18 20:42     ` Linus Torvalds
  1 sibling, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2005-10-18 20:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List



On Tue, 18 Oct 2005, Junio C Hamano wrote:
> 
> It strikes me that we could walk from our refs, depth reasonably
> limited to say 20 or so commit chain and/or last 5 days of
> commit time, to see if any of the remotes are reachable from our
> refs and omit issuing "want" quite cheaply.  Do you think that
> would be a worthy change to make things more efficient?

Probably doesn't make a huge difference, but it might be worth trying.

There's a cheap test you can do _before_ you even start walking: check if 
you have the object that is pointed to by the remote ref at all. If you 
don't have it, then you know it can't be reachable from any of the local 
refs. And if you do have it, the likelihood that it _is_ reachable is 
likely pretty high. 

(I didn't do that for the current fetch-pack optimization, since just 
doing the read_ref() is likely faster than even bothering with the object 
lookup. But if you start traversing commit lists, it suddenly becomes 
more worthwhile).

You'd need to look up the object anyway in order to figure out that it's a 
commit (or points to a commit).

So it might be wasting a bit of time, but the good news is that it wastes 
time on the _client_ side, where we've got plenty.

I'll see if I can come up with a good patch.

		Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [CORRECTED PATCH] git-fetch-pack: avoid unnecessary zero packing
  2005-10-18 20:38   ` Linus Torvalds
@ 2005-10-18 20:42     ` Linus Torvalds
  2005-10-19  0:27       ` Junio C Hamano
  0 siblings, 1 reply; 7+ messages in thread
From: Linus Torvalds @ 2005-10-18 20:42 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List



On Tue, 18 Oct 2005, Linus Torvalds wrote:
> 
> I'll see if I can come up with a good patch.

I see you already did. Looks fine. I'd suggest limiting the commits by 
number in mark_recent_commit_complete(), because

 (a) somebody might have their clock set wrong and you don't want to walk 
     a huge tree just because of something like that.
 (b) you might just have imported a huge history (badly) from somewhere 
     else
 (c) a _lot_ can happen in five days with automated things.

but yes, the approach looks very sane otherwise.

		Linus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [CORRECTED PATCH] git-fetch-pack: avoid unnecessary zero packing
  2005-10-18 20:42     ` Linus Torvalds
@ 2005-10-19  0:27       ` Junio C Hamano
  0 siblings, 0 replies; 7+ messages in thread
From: Junio C Hamano @ 2005-10-19  0:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Linus Torvalds <torvalds@osdl.org> writes:

> I see you already did. Looks fine. I'd suggest limiting the commits by 
> number in mark_recent_commit_complete(), because
>
>  (a) somebody might have their clock set wrong and you don't want to walk 
>      a huge tree just because of something like that.
>  (b) you might just have imported a huge history (badly) from somewhere 
>      else
>  (c) a _lot_ can happen in five days with automated things.
>
> but yes, the approach looks very sane otherwise.

When you have several dozen commits on top of a head you fetched
from the remote last time you polled them, and the remote has
not updated that head since then, it may be worthwhile to have
the client dig deeper to avoid asking the server.

What I am thinking is:

    - For objects our refs directly refer to, mark them COMPLETE
      as the patch I sent out.

    - See if we have any objects the remote refs refer to
      already; find the timestamp of the latest one if we have
      commits among them, and use its time as the cutoff time.

      It is likely that we have synched with them after that
      timestamp (either upload or download).  walk the commits
      from our ref, and mark *everything* that are newer than
      that timestamp.  This can turn out to be a huge walking
      but that happens on the client side.

This way I can get rid of the arbitrary 5-day window, and I do
not have to invent another arbitrary number to limit the commits
we walk.

In other words, let's put the burden on the client if its effort
possibly can help the server.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2005-10-19  0:28 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-10-18 17:52 [CORRECTED PATCH] git-fetch-pack: avoid unnecessary zero packing Linus Torvalds
2005-10-18 18:45 ` Junio C Hamano
2005-10-18 18:49 ` Junio C Hamano
2005-10-18 19:19   ` Junio C Hamano
2005-10-18 20:38   ` Linus Torvalds
2005-10-18 20:42     ` Linus Torvalds
2005-10-19  0:27       ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).