All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/25] prune-safety
@ 2014-10-15 22:32 Jeff King
  2014-10-15 22:33 ` [PATCH v2 01/25] foreach_alt_odb: propagate return value from callback Jeff King
                   ` (25 more replies)
  0 siblings, 26 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:32 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

Here's a re-roll of the patch series I posted earlier to make "git
prune" keep more contiguous chunks of the object graph. The cleanups to
t5304 were spun off into their own series, and are dropped here.
However, the other patches seem to have multiplied in number (I must
have fed them after midnight).

Here are the changes since the first round (thanks everybody for your
comments):

  - fix bogus return values from freshen_file, foreach_alt_odb, and
    for_each_packed_object

  - make for_each_object_in_pack static

  - clarify commit message for "keep objects reachable from recent
    objects" patch (this was the one that confused Junio, and I
    elaborated based on our discussion)

  - clarify the definition of "loose object dirs" in the comment above
    for_each_loose_file_in_object_dir

  - in for_each_loose_file, traverse hashed loose object directories in
    numeric order, and pass the number to the subdir callback (this is
    used by prune-packed for its progress updates); as a side effect,
    this fixes the bugs Michael noticed with the subdir callback.

  - prune-packed now reuses the for_each_loose_file interface

  - use revs->ignore_missing_links so we don't barf on already-missing
    unreferenced objects

  - convert reachable.c to use traverse_commit_list instead of its own
    custom walk; this gives support for ignore_missing_links above, and
    saves us a fair bit of code.

  - while in the area, I noticed that reachable.c's reflog handling is
    the same as rev-list's --reflog option; it now builds on what's in
    revision.c.

That takes us up to patch 17. While working in reachable.c, I noticed an
oddity: we consider objects in the index to be reachable during prune
(which is good), but we do not when dropping them during a repack that
uses --unpack-unreachable=<expiration>. The remaining patches fix that,
which needed a fair bit of preparatory cleanup.

I'm really beginning to question whether the "just drop objects that are
about to be pruned" optimization done in 7e52f56 (gc: do not explode
objects which will be immediately pruned, 2012-04-07). It really
complicates things as pack-objects and prune need to have the exact same
rules (and implementing it naively, by having pack-objects run the same
code as prune, is not desirable because pack-objects has _already_ done
a complete expensive traversal to generate the packing list). And I fear
it will get even worse if we implement some of the race-condition fixes
that Michael suggested earlier.

On the other hand, the loosening behavior without 7e52f56 has some
severe pathological cases. A repository which has had a chunk of history
deleted can easily increase in size several orders of magnitude due to
loosening (since we lose the benefit of all deltas in the loosened
objects).

Finally, there are a few things that were discussed that I didn't
address/fix. I don't think any of them is a critical blocker, but I
did want to summarize the state:

  - when refreshing, we may update a pack's mtime multiple times. It
    probably wouldn't be too hard to cache this and only update once per
    program run, but I also don't think it's that big a deal in
    practice.

  - We will munge mtimes of objects found in alternates. If we don't
    have write access to the alternate, we'll create a local duplicate
    of the object. This is the safer thing, but I'm not sure if there
    are cases where we might try to write out a large number of objects
    which exist in an alternate (OTOH, we will eventually drop them at
    the next repack).

  - I didn't implement the "sort by inode" trick that fsck does when
    traversing the loose objects. It wouldn't be too hard, but I'm not
    convinced it's actually important.

  - I didn't convert fsck to the for_each_loose_file interface (mostly
    because I didn't do the inode-sorting trick, and while I don't think
    it matters, I didn't go to the work to show that it _doesn't_).

Here are the patches:

  [01/25]: foreach_alt_odb: propagate return value from callback
  [02/25]: isxdigit: cast input to unsigned char
  [03/25]: object_array: factor out slopbuf-freeing logic
  [04/25]: object_array: add a "clear" function
  [05/25]: clean up name allocation in prepare_revision_walk
  [06/25]: reachable: use traverse_commit_list instead of custom walk
  [07/25]: reachable: reuse revision.c "add all reflogs" code
  [08/25]: prune: factor out loose-object directory traversal
  [09/25]: reachable: mark index blobs as SEEN
  [10/25]: prune-packed: use for_each_loose_file_in_objdir
  [11/25]: count-objects: do not use xsize_t when counting object size
  [12/25]: count-objects: use for_each_loose_file_in_objdir
  [13/25]: sha1_file: add for_each iterators for loose and packed objects
  [14/25]: prune: keep objects reachable from recent objects
  [15/25]: pack-objects: refactor unpack-unreachable expiration check
  [16/25]: pack-objects: match prune logic for discarding objects
  [17/25]: write_sha1_file: freshen existing objects
  [18/25]: make add_object_array_with_context interface more sane
  [19/25]: traverse_commit_list: support pending blobs/trees with paths
  [20/25]: rev-list: document --reflog option
  [21/25]: rev-list: add --index-objects option
  [22/25]: reachable: use revision machinery's --index-objects code
  [23/25]: pack-objects: use argv_array
  [24/25]: repack: pack objects mentioned by the index
  [25/25]: pack-objects: double-check options before discarding objects

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v2 01/25] foreach_alt_odb: propagate return value from callback
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
@ 2014-10-15 22:33 ` Jeff King
  2014-10-15 22:34 ` [PATCH v2 02/25] isxdigit: cast input to unsigned char Jeff King
                   ` (24 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:33 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

We check the return value of the callback and stop iterating
if it is non-zero. However, we do not make the non-zero
return value available to the caller, so they have no way of
knowing whether the operation succeeded or not (technically
they can keep their own error flag in the callback data, but
that is unlike our other for_each functions).

Signed-off-by: Jeff King <peff@peff.net>
---
 cache.h     |  2 +-
 sha1_file.c | 12 ++++++++----
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/cache.h b/cache.h
index 5b86065..13fadb6 100644
--- a/cache.h
+++ b/cache.h
@@ -1125,7 +1125,7 @@ extern void prepare_alt_odb(void);
 extern void read_info_alternates(const char * relative_base, int depth);
 extern void add_to_alternates_file(const char *reference);
 typedef int alt_odb_fn(struct alternate_object_database *, void *);
-extern void foreach_alt_odb(alt_odb_fn, void*);
+extern int foreach_alt_odb(alt_odb_fn, void*);
 
 struct pack_window {
 	struct pack_window *next;
diff --git a/sha1_file.c b/sha1_file.c
index 83f77f0..fa881bf 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -413,14 +413,18 @@ void add_to_alternates_file(const char *reference)
 		link_alt_odb_entries(alt, strlen(alt), '\n', NULL, 0);
 }
 
-void foreach_alt_odb(alt_odb_fn fn, void *cb)
+int foreach_alt_odb(alt_odb_fn fn, void *cb)
 {
 	struct alternate_object_database *ent;
+	int r = 0;
 
 	prepare_alt_odb();
-	for (ent = alt_odb_list; ent; ent = ent->next)
-		if (fn(ent, cb))
-			return;
+	for (ent = alt_odb_list; ent; ent = ent->next) {
+		r = fn(ent, cb);
+		if (r)
+			break;
+	}
+	return r;
 }
 
 void prepare_alt_odb(void)
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 02/25] isxdigit: cast input to unsigned char
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
  2014-10-15 22:33 ` [PATCH v2 01/25] foreach_alt_odb: propagate return value from callback Jeff King
@ 2014-10-15 22:34 ` Jeff King
  2014-10-16 17:16   ` Junio C Hamano
  2014-10-15 22:34 ` [PATCH v2 03/25] object_array: factor out slopbuf-freeing logic Jeff King
                   ` (23 subsequent siblings)
  25 siblings, 1 reply; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:34 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

Otherwise, callers must do so or risk triggering warnings
-Wchar-subscript (and rightfully so; a signed char might
cause us to use a bogus negative index into the
hexval_table).

While we are dropping the now-unnecessary casts from the
caller in urlmatch.c, we can get rid of similar casts in
actually parsing the hex by using the hexval() helper, which
implicitly casts to unsigned (but note that we cannot
implement isxdigit in terms of hexval(), as it also casts
its return value to unsigned).

Signed-off-by: Jeff King <peff@peff.net>
---
The patch that added more calls to isxdigit later in the series actually
got reworked. So this is purely a cleanup and can be dropped if need be,
though I still think it is an improvement.

 git-compat-util.h | 2 +-
 urlmatch.c        | 8 ++++----
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/git-compat-util.h b/git-compat-util.h
index fb41118..44890d5 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -677,7 +677,7 @@ extern const unsigned char sane_ctype[256];
 #define iscntrl(x) (sane_istest(x,GIT_CNTRL))
 #define ispunct(x) sane_istest(x, GIT_PUNCT | GIT_REGEX_SPECIAL | \
 		GIT_GLOB_SPECIAL | GIT_PATHSPEC_MAGIC)
-#define isxdigit(x) (hexval_table[x] != -1)
+#define isxdigit(x) (hexval_table[(unsigned char)(x)] != -1)
 #define tolower(x) sane_case((unsigned char)(x), 0x20)
 #define toupper(x) sane_case((unsigned char)(x), 0)
 #define is_pathspec_magic(x) sane_istest(x,GIT_PATHSPEC_MAGIC)
diff --git a/urlmatch.c b/urlmatch.c
index 3d4c54b..618d216 100644
--- a/urlmatch.c
+++ b/urlmatch.c
@@ -43,11 +43,11 @@ static int append_normalized_escapes(struct strbuf *buf,
 		from_len--;
 		if (ch == '%') {
 			if (from_len < 2 ||
-			    !isxdigit((unsigned char)from[0]) ||
-			    !isxdigit((unsigned char)from[1]))
+			    !isxdigit(from[0]) ||
+			    !isxdigit(from[1]))
 				return 0;
-			ch = hexval_table[(unsigned char)*from++] << 4;
-			ch |= hexval_table[(unsigned char)*from++];
+			ch = hexval(*from++) << 4;
+			ch |= hexval(*from++);
 			from_len -= 2;
 			was_esc = 1;
 		}
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 03/25] object_array: factor out slopbuf-freeing logic
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
  2014-10-15 22:33 ` [PATCH v2 01/25] foreach_alt_odb: propagate return value from callback Jeff King
  2014-10-15 22:34 ` [PATCH v2 02/25] isxdigit: cast input to unsigned char Jeff King
@ 2014-10-15 22:34 ` Jeff King
  2014-10-16 17:39   ` Junio C Hamano
  2014-10-15 22:34 ` [PATCH v2 04/25] object_array: add a "clear" function Jeff King
                   ` (22 subsequent siblings)
  25 siblings, 1 reply; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:34 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

This is not a lot of code, but it's a logical construct that
should not need to be repeated (and we are about to add a
third repetition).

Signed-off-by: Jeff King <peff@peff.net>
---
 object.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/object.c b/object.c
index ca9d790..60f4864 100644
--- a/object.c
+++ b/object.c
@@ -355,6 +355,16 @@ void add_object_array_with_context(struct object *obj, const char *name, struct
 		add_object_array_with_mode_context(obj, name, array, S_IFINVALID, context);
 }
 
+/*
+ * Free all memory associated with an entry; the result is
+ * in an unspecified state and should not be examined.
+ */
+static void object_array_release_entry(struct object_array_entry *ent)
+{
+	if (ent->name != object_array_slopbuf)
+		free(ent->name);
+}
+
 void object_array_filter(struct object_array *array,
 			 object_array_each_func_t want, void *cb_data)
 {
@@ -367,8 +377,7 @@ void object_array_filter(struct object_array *array,
 				objects[dst] = objects[src];
 			dst++;
 		} else {
-			if (objects[src].name != object_array_slopbuf)
-				free(objects[src].name);
+			object_array_release_entry(&objects[src]);
 		}
 	}
 	array->nr = dst;
@@ -400,8 +409,7 @@ void object_array_remove_duplicates(struct object_array *array)
 				objects[array->nr] = objects[src];
 			array->nr++;
 		} else {
-			if (objects[src].name != object_array_slopbuf)
-				free(objects[src].name);
+			object_array_release_entry(&objects[src]);
 		}
 	}
 }
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 04/25] object_array: add a "clear" function
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (2 preceding siblings ...)
  2014-10-15 22:34 ` [PATCH v2 03/25] object_array: factor out slopbuf-freeing logic Jeff King
@ 2014-10-15 22:34 ` Jeff King
  2014-10-15 22:35 ` [PATCH v2 05/25] clean up name allocation in prepare_revision_walk Jeff King
                   ` (21 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:34 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

There's currently no easy way to free the memory associated
with an object_array (and in most cases, we simply leak the
memory in a rev_info's pending array). Let's provide a
helper to make this easier to handle.

We can make use of it in list-objects.c, which does the same
thing by hand (but fails to free the "name" field of each
entry, potentially leaking memory).

Signed-off-by: Jeff King <peff@peff.net>
---
 list-objects.c |  7 +------
 object.c       | 10 ++++++++++
 object.h       |  6 ++++++
 3 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/list-objects.c b/list-objects.c
index 3595ee7..fad6808 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -228,11 +228,6 @@ void traverse_commit_list(struct rev_info *revs,
 		die("unknown pending object %s (%s)",
 		    sha1_to_hex(obj->sha1), name);
 	}
-	if (revs->pending.nr) {
-		free(revs->pending.objects);
-		revs->pending.nr = 0;
-		revs->pending.alloc = 0;
-		revs->pending.objects = NULL;
-	}
+	object_array_clear(&revs->pending);
 	strbuf_release(&base);
 }
diff --git a/object.c b/object.c
index 60f4864..6aeb1bb 100644
--- a/object.c
+++ b/object.c
@@ -383,6 +383,16 @@ void object_array_filter(struct object_array *array,
 	array->nr = dst;
 }
 
+void object_array_clear(struct object_array *array)
+{
+	int i;
+	for (i = 0; i < array->nr; i++)
+		object_array_release_entry(&array->objects[i]);
+	free(array->objects);
+	array->objects = NULL;
+	array->nr = array->alloc = 0;
+}
+
 /*
  * Return true iff array already contains an entry with name.
  */
diff --git a/object.h b/object.h
index e028ced..2a755a2 100644
--- a/object.h
+++ b/object.h
@@ -133,6 +133,12 @@ void object_array_filter(struct object_array *array,
  */
 void object_array_remove_duplicates(struct object_array *array);
 
+/*
+ * Remove any objects from the array, freeing all used memory; afterwards
+ * the array is ready to store more objects with add_object_array().
+ */
+void object_array_clear(struct object_array *array);
+
 void clear_object_flags(unsigned flags);
 
 #endif /* OBJECT_H */
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 05/25] clean up name allocation in prepare_revision_walk
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (3 preceding siblings ...)
  2014-10-15 22:34 ` [PATCH v2 04/25] object_array: add a "clear" function Jeff King
@ 2014-10-15 22:35 ` Jeff King
  2014-10-15 22:37 ` [PATCH v2 06/25] reachable: use traverse_commit_list instead of custom walk Jeff King
                   ` (20 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:35 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

When we enter prepare_revision_walk, we have zero or more
entries in our "pending" array. We disconnect that array
from the rev_info, and then process each entry:

  1. If the entry is a commit and the --source option is in
     effect, we keep a pointer to the object name.

  2. Otherwise, we re-add the item to the pending list with
     a blank name.

We then throw away the old array by freeing the array
itself, but do not touch the "name" field of each entry. For
any items of type (2), we leak the memory associated with
the name. This commit fixes that by calling object_array_clear,
which handles the cleanup for us.

That breaks (1), though, because it depends on the memory
pointed to by the name to last forever. We can solve that by
making a copy of the name. This is slightly less efficient,
but it shouldn't matter in practice, as we do it only for
the tip commits of the traversal.

Signed-off-by: Jeff King <peff@peff.net>
---
 revision.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/revision.c b/revision.c
index e498b7c..01cc276 100644
--- a/revision.c
+++ b/revision.c
@@ -300,7 +300,7 @@ static struct commit *handle_commit(struct rev_info *revs,
 			revs->limited = 1;
 		}
 		if (revs->show_source && !commit->util)
-			commit->util = (void *) name;
+			commit->util = xstrdup(name);
 		return commit;
 	}
 
@@ -2656,15 +2656,16 @@ void reset_revision_walk(void)
 
 int prepare_revision_walk(struct rev_info *revs)
 {
-	int nr = revs->pending.nr;
-	struct object_array_entry *e, *list;
+	int i;
+	struct object_array old_pending;
 	struct commit_list **next = &revs->commits;
 
-	e = list = revs->pending.objects;
+	memcpy(&old_pending, &revs->pending, sizeof(old_pending));
 	revs->pending.nr = 0;
 	revs->pending.alloc = 0;
 	revs->pending.objects = NULL;
-	while (--nr >= 0) {
+	for (i = 0; i < old_pending.nr; i++) {
+		struct object_array_entry *e = old_pending.objects + i;
 		struct commit *commit = handle_commit(revs, e->item, e->name);
 		if (commit) {
 			if (!(commit->object.flags & SEEN)) {
@@ -2672,10 +2673,9 @@ int prepare_revision_walk(struct rev_info *revs)
 				next = commit_list_append(commit, next);
 			}
 		}
-		e++;
 	}
 	if (!revs->leak_pending)
-		free(list);
+		object_array_clear(&old_pending);
 
 	/* Signal whether we need per-parent treesame decoration */
 	if (revs->simplify_merges ||
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 06/25] reachable: use traverse_commit_list instead of custom walk
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (4 preceding siblings ...)
  2014-10-15 22:35 ` [PATCH v2 05/25] clean up name allocation in prepare_revision_walk Jeff King
@ 2014-10-15 22:37 ` Jeff King
  2014-10-16 17:53   ` Junio C Hamano
  2014-10-15 22:38 ` [PATCH v2 07/25] reachable: reuse revision.c "add all reflogs" code Jeff King
                   ` (19 subsequent siblings)
  25 siblings, 1 reply; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:37 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

To find the set of reachable objects, we add a bunch of
possible sources to our rev_info, call prepare_revision_walk,
and then launch into a custom walker that handles each
object top. This is a subset of what traverse_commit_list
does, so we can just reuse that code (it can also handle
more complex cases like UNINTERESTING commits and pathspecs,
but we don't use those features).

Signed-off-by: Jeff King <peff@peff.net>
---
I was concerned this would be slower because traverse_commit_list is
more featureful. To my surprise, it was consistently about 3-4% faster!
The major difference is that traverse_commit_list will hit all of the
commits first, and then the trees. For reachability that doesn't matter
either way, but I suspect the new way has slightly better cache
locality, leading to the minor speedup.

 reachable.c | 130 ++++++++----------------------------------------------------
 1 file changed, 17 insertions(+), 113 deletions(-)

diff --git a/reachable.c b/reachable.c
index 6f6835b..02bf6c2 100644
--- a/reachable.c
+++ b/reachable.c
@@ -8,6 +8,7 @@
 #include "reachable.h"
 #include "cache-tree.h"
 #include "progress.h"
+#include "list-objects.h"
 
 struct connectivity_progress {
 	struct progress *progress;
@@ -21,118 +22,6 @@ static void update_progress(struct connectivity_progress *cp)
 		display_progress(cp->progress, cp->count);
 }
 
-static void process_blob(struct blob *blob,
-			 struct object_array *p,
-			 struct name_path *path,
-			 const char *name,
-			 struct connectivity_progress *cp)
-{
-	struct object *obj = &blob->object;
-
-	if (!blob)
-		die("bad blob object");
-	if (obj->flags & SEEN)
-		return;
-	obj->flags |= SEEN;
-	update_progress(cp);
-	/* Nothing to do, really .. The blob lookup was the important part */
-}
-
-static void process_gitlink(const unsigned char *sha1,
-			    struct object_array *p,
-			    struct name_path *path,
-			    const char *name)
-{
-	/* I don't think we want to recurse into this, really. */
-}
-
-static void process_tree(struct tree *tree,
-			 struct object_array *p,
-			 struct name_path *path,
-			 const char *name,
-			 struct connectivity_progress *cp)
-{
-	struct object *obj = &tree->object;
-	struct tree_desc desc;
-	struct name_entry entry;
-	struct name_path me;
-
-	if (!tree)
-		die("bad tree object");
-	if (obj->flags & SEEN)
-		return;
-	obj->flags |= SEEN;
-	update_progress(cp);
-	if (parse_tree(tree) < 0)
-		die("bad tree object %s", sha1_to_hex(obj->sha1));
-	add_object(obj, p, path, name);
-	me.up = path;
-	me.elem = name;
-	me.elem_len = strlen(name);
-
-	init_tree_desc(&desc, tree->buffer, tree->size);
-
-	while (tree_entry(&desc, &entry)) {
-		if (S_ISDIR(entry.mode))
-			process_tree(lookup_tree(entry.sha1), p, &me, entry.path, cp);
-		else if (S_ISGITLINK(entry.mode))
-			process_gitlink(entry.sha1, p, &me, entry.path);
-		else
-			process_blob(lookup_blob(entry.sha1), p, &me, entry.path, cp);
-	}
-	free_tree_buffer(tree);
-}
-
-static void process_tag(struct tag *tag, struct object_array *p,
-			const char *name, struct connectivity_progress *cp)
-{
-	struct object *obj = &tag->object;
-
-	if (obj->flags & SEEN)
-		return;
-	obj->flags |= SEEN;
-	update_progress(cp);
-
-	if (parse_tag(tag) < 0)
-		die("bad tag object %s", sha1_to_hex(obj->sha1));
-	if (tag->tagged)
-		add_object(tag->tagged, p, NULL, name);
-}
-
-static void walk_commit_list(struct rev_info *revs,
-			     struct connectivity_progress *cp)
-{
-	int i;
-	struct commit *commit;
-	struct object_array objects = OBJECT_ARRAY_INIT;
-
-	/* Walk all commits, process their trees */
-	while ((commit = get_revision(revs)) != NULL) {
-		process_tree(commit->tree, &objects, NULL, "", cp);
-		update_progress(cp);
-	}
-
-	/* Then walk all the pending objects, recursively processing them too */
-	for (i = 0; i < revs->pending.nr; i++) {
-		struct object_array_entry *pending = revs->pending.objects + i;
-		struct object *obj = pending->item;
-		const char *name = pending->name;
-		if (obj->type == OBJ_TAG) {
-			process_tag((struct tag *) obj, &objects, name, cp);
-			continue;
-		}
-		if (obj->type == OBJ_TREE) {
-			process_tree((struct tree *)obj, &objects, NULL, name, cp);
-			continue;
-		}
-		if (obj->type == OBJ_BLOB) {
-			process_blob((struct blob *)obj, &objects, NULL, name, cp);
-			continue;
-		}
-		die("unknown pending object %s (%s)", sha1_to_hex(obj->sha1), name);
-	}
-}
-
 static int add_one_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
 		const char *email, unsigned long timestamp, int tz,
 		const char *message, void *cb_data)
@@ -210,6 +99,21 @@ static void add_cache_refs(struct rev_info *revs)
 		add_cache_tree(active_cache_tree, revs);
 }
 
+/*
+ * The traversal will have already marked us as SEEN, so we
+ * only need to handle any progress reporting here.
+ */
+static void mark_object(struct object *obj, const struct name_path *path,
+			const char *name, void *data)
+{
+	update_progress(data);
+}
+
+static void mark_commit(struct commit *c, void *data)
+{
+	mark_object(&c->object, NULL, NULL, data);
+}
+
 void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 			    struct progress *progress)
 {
@@ -245,6 +149,6 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 	 */
 	if (prepare_revision_walk(revs))
 		die("revision walk setup failed");
-	walk_commit_list(revs, &cp);
+	traverse_commit_list(revs, mark_commit, mark_object, &cp);
 	display_progress(cp.progress, cp.count);
 }
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 07/25] reachable: reuse revision.c "add all reflogs" code
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (5 preceding siblings ...)
  2014-10-15 22:37 ` [PATCH v2 06/25] reachable: use traverse_commit_list instead of custom walk Jeff King
@ 2014-10-15 22:38 ` Jeff King
  2014-10-15 22:38 ` [PATCH v2 08/25] prune: factor out loose-object directory traversal Jeff King
                   ` (18 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:38 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

We want to add all reflog entries as tips for finding
reachable objects. The revision machinery can already do
this (to support "rev-list --reflog"); we can reuse that
code.

Signed-off-by: Jeff King <peff@peff.net>
---
This one is not strictly necessary, but it seems like a nice cleanup.
Note that the big difference in the revision.c code is that it will
print a warning for broken reflogs, but I think that's fine (and perhaps
even desirable) here.

 reachable.c | 24 +-----------------------
 revision.c  |  4 ++--
 revision.h  |  1 +
 3 files changed, 4 insertions(+), 25 deletions(-)

diff --git a/reachable.c b/reachable.c
index 02bf6c2..4e68cfa 100644
--- a/reachable.c
+++ b/reachable.c
@@ -22,22 +22,6 @@ static void update_progress(struct connectivity_progress *cp)
 		display_progress(cp->progress, cp->count);
 }
 
-static int add_one_reflog_ent(unsigned char *osha1, unsigned char *nsha1,
-		const char *email, unsigned long timestamp, int tz,
-		const char *message, void *cb_data)
-{
-	struct object *object;
-	struct rev_info *revs = (struct rev_info *)cb_data;
-
-	object = parse_object(osha1);
-	if (object)
-		add_pending_object(revs, object, "");
-	object = parse_object(nsha1);
-	if (object)
-		add_pending_object(revs, object, "");
-	return 0;
-}
-
 static int add_one_ref(const char *path, const unsigned char *sha1, int flag, void *cb_data)
 {
 	struct object *object = parse_object_or_die(sha1, path);
@@ -48,12 +32,6 @@ static int add_one_ref(const char *path, const unsigned char *sha1, int flag, vo
 	return 0;
 }
 
-static int add_one_reflog(const char *path, const unsigned char *sha1, int flag, void *cb_data)
-{
-	for_each_reflog_ent(path, add_one_reflog_ent, cb_data);
-	return 0;
-}
-
 static void add_one_tree(const unsigned char *sha1, struct rev_info *revs)
 {
 	struct tree *tree = lookup_tree(sha1);
@@ -138,7 +116,7 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 
 	/* Add all reflog info */
 	if (mark_reflog)
-		for_each_reflog(add_one_reflog, revs);
+		add_reflogs_to_pending(revs, 0);
 
 	cp.progress = progress;
 	cp.count = 0;
diff --git a/revision.c b/revision.c
index 01cc276..b8e02e2 100644
--- a/revision.c
+++ b/revision.c
@@ -1275,7 +1275,7 @@ static int handle_one_reflog(const char *path, const unsigned char *sha1, int fl
 	return 0;
 }
 
-static void handle_reflog(struct rev_info *revs, unsigned flags)
+void add_reflogs_to_pending(struct rev_info *revs, unsigned flags)
 {
 	struct all_refs_cb cb;
 	cb.all_revs = revs;
@@ -2061,7 +2061,7 @@ static int handle_revision_pseudo_opt(const char *submodule,
 		for_each_glob_ref_in(handle_one_ref, arg + 10, "refs/remotes/", &cb);
 		clear_ref_exclusion(&revs->ref_excludes);
 	} else if (!strcmp(arg, "--reflog")) {
-		handle_reflog(revs, *flags);
+		add_reflogs_to_pending(revs, *flags);
 	} else if (!strcmp(arg, "--not")) {
 		*flags ^= UNINTERESTING | BOTTOM;
 	} else if (!strcmp(arg, "--no-walk")) {
diff --git a/revision.h b/revision.h
index a620530..e644044 100644
--- a/revision.h
+++ b/revision.h
@@ -276,6 +276,7 @@ extern void add_pending_sha1(struct rev_info *revs,
 			     unsigned int flags);
 
 extern void add_head_to_pending(struct rev_info *);
+extern void add_reflogs_to_pending(struct rev_info *, unsigned int flags);
 
 enum commit_action {
 	commit_ignore,
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 08/25] prune: factor out loose-object directory traversal
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (6 preceding siblings ...)
  2014-10-15 22:38 ` [PATCH v2 07/25] reachable: reuse revision.c "add all reflogs" code Jeff King
@ 2014-10-15 22:38 ` Jeff King
  2014-10-15 22:40 ` [PATCH v2 09/25] reachable: mark index blobs as SEEN Jeff King
                   ` (17 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:38 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

Prune has to walk $GIT_DIR/objects/?? in order to find the
set of loose objects to prune. Other parts of the code
(e.g., count-objects) want to do the same. Let's factor it
out into a reusable for_each-style function.

Note that this is not quite a straight code movement. The
original code had strange behavior when it found a file of
the form "[0-9a-f]{2}/.{38}" that did _not_ contain all hex
digits. It executed a "break" from the loop, meaning that we
stopped pruning in that directory (but still pruned other
directories!). This was probably a bug; we do not want to
process the file as an object, but we should keep going
otherwise (and that is how the new code handles it).

We are also a little more careful with loose object
directories which fail to open. The original code silently
ignored any failures, but the new code will complain about
any problems besides ENOENT.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/prune.c | 87 +++++++++++++++++----------------------------------------
 cache.h         | 33 ++++++++++++++++++++++
 sha1_file.c     | 84 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 143 insertions(+), 61 deletions(-)

diff --git a/builtin/prune.c b/builtin/prune.c
index 144a3bd..763f53e 100644
--- a/builtin/prune.c
+++ b/builtin/prune.c
@@ -31,11 +31,23 @@ static int prune_tmp_file(const char *fullpath)
 	return 0;
 }
 
-static int prune_object(const char *fullpath, const unsigned char *sha1)
+static int prune_object(const unsigned char *sha1, const char *fullpath,
+			void *data)
 {
 	struct stat st;
-	if (lstat(fullpath, &st))
-		return error("Could not stat '%s'", fullpath);
+
+	/*
+	 * Do we know about this object?
+	 * It must have been reachable
+	 */
+	if (lookup_object(sha1))
+		return 0;
+
+	if (lstat(fullpath, &st)) {
+		/* report errors, but do not stop pruning */
+		error("Could not stat '%s'", fullpath);
+		return 0;
+	}
 	if (st.st_mtime > expire)
 		return 0;
 	if (show_only || verbose) {
@@ -48,68 +60,20 @@ static int prune_object(const char *fullpath, const unsigned char *sha1)
 	return 0;
 }
 
-static int prune_dir(int i, struct strbuf *path)
+static int prune_cruft(const char *basename, const char *path, void *data)
 {
-	size_t baselen = path->len;
-	DIR *dir = opendir(path->buf);
-	struct dirent *de;
-
-	if (!dir)
-		return 0;
-
-	while ((de = readdir(dir)) != NULL) {
-		char name[100];
-		unsigned char sha1[20];
-
-		if (is_dot_or_dotdot(de->d_name))
-			continue;
-		if (strlen(de->d_name) == 38) {
-			sprintf(name, "%02x", i);
-			memcpy(name+2, de->d_name, 39);
-			if (get_sha1_hex(name, sha1) < 0)
-				break;
-
-			/*
-			 * Do we know about this object?
-			 * It must have been reachable
-			 */
-			if (lookup_object(sha1))
-				continue;
-
-			strbuf_addf(path, "/%s", de->d_name);
-			prune_object(path->buf, sha1);
-			strbuf_setlen(path, baselen);
-			continue;
-		}
-		if (starts_with(de->d_name, "tmp_obj_")) {
-			strbuf_addf(path, "/%s", de->d_name);
-			prune_tmp_file(path->buf);
-			strbuf_setlen(path, baselen);
-			continue;
-		}
-		fprintf(stderr, "bad sha1 file: %s/%s\n", path->buf, de->d_name);
-	}
-	closedir(dir);
-	if (!show_only)
-		rmdir(path->buf);
+	if (starts_with(basename, "tmp_obj_"))
+		prune_tmp_file(path);
+	else
+		fprintf(stderr, "bad sha1 file: %s\n", path);
 	return 0;
 }
 
-static void prune_object_dir(const char *path)
+static int prune_subdir(int nr, const char *path, void *data)
 {
-	struct strbuf buf = STRBUF_INIT;
-	size_t baselen;
-	int i;
-
-	strbuf_addstr(&buf, path);
-	strbuf_addch(&buf, '/');
-	baselen = buf.len;
-
-	for (i = 0; i < 256; i++) {
-		strbuf_addf(&buf, "%02x", i);
-		prune_dir(i, &buf);
-		strbuf_setlen(&buf, baselen);
-	}
+	if (!show_only)
+		rmdir(path);
+	return 0;
 }
 
 /*
@@ -173,7 +137,8 @@ int cmd_prune(int argc, const char **argv, const char *prefix)
 
 	mark_reachable_objects(&revs, 1, progress);
 	stop_progress(&progress);
-	prune_object_dir(get_object_directory());
+	for_each_loose_file_in_objdir(get_object_directory(), prune_object,
+				      prune_cruft, prune_subdir, NULL);
 
 	prune_packed_objects(show_only ? PRUNE_PACKED_DRY_RUN : 0);
 	remove_temporary_files(get_object_directory());
diff --git a/cache.h b/cache.h
index 13fadb6..8ffefaa 100644
--- a/cache.h
+++ b/cache.h
@@ -1221,6 +1221,39 @@ extern unsigned long unpack_object_header_buffer(const unsigned char *buf, unsig
 extern unsigned long get_size_from_delta(struct packed_git *, struct pack_window **, off_t);
 extern int unpack_object_header(struct packed_git *, struct pack_window **, off_t *, unsigned long *);
 
+/*
+ * Iterate over the files in the loose-object parts of the object
+ * directory "path", triggering the following callbacks:
+ *
+ *  - loose_object is called for each loose object we find.
+ *
+ *  - loose_cruft is called for any files that do not appear to be
+ *    loose objects. Note that we only look in the loose object
+ *    directories "objects/[0-9a-f]{2}/", so we will not report
+ *    "objects/foobar" as cruft.
+ *
+ *  - loose_subdir is called for each top-level hashed subdirectory
+ *    of the object directory (e.g., "$OBJDIR/f0"). It is called
+ *    after the objects in the directory are processed.
+ *
+ * Any callback that is NULL will be ignored. Callbacks returning non-zero
+ * will end the iteration.
+ */
+typedef int each_loose_object_fn(const unsigned char *sha1,
+				 const char *path,
+				 void *data);
+typedef int each_loose_cruft_fn(const char *basename,
+				const char *path,
+				void *data);
+typedef int each_loose_subdir_fn(int nr,
+				 const char *path,
+				 void *data);
+int for_each_loose_file_in_objdir(const char *path,
+				  each_loose_object_fn obj_cb,
+				  each_loose_cruft_fn cruft_cb,
+				  each_loose_subdir_fn subdir_cb,
+				  void *data);
+
 struct object_info {
 	/* Request */
 	enum object_type *typep;
diff --git a/sha1_file.c b/sha1_file.c
index fa881bf..a20240b 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3265,3 +3265,87 @@ void assert_sha1_type(const unsigned char *sha1, enum object_type expect)
 		die("%s is not a valid '%s' object", sha1_to_hex(sha1),
 		    typename(expect));
 }
+
+static int for_each_file_in_obj_subdir(int subdir_nr,
+				       struct strbuf *path,
+				       each_loose_object_fn obj_cb,
+				       each_loose_cruft_fn cruft_cb,
+				       each_loose_subdir_fn subdir_cb,
+				       void *data)
+{
+	size_t baselen = path->len;
+	DIR *dir = opendir(path->buf);
+	struct dirent *de;
+	int r = 0;
+
+	if (!dir) {
+		if (errno == ENOENT)
+			return 0;
+		return error("unable to open %s: %s", path->buf, strerror(errno));
+	}
+
+	while ((de = readdir(dir))) {
+		if (is_dot_or_dotdot(de->d_name))
+			continue;
+
+		strbuf_setlen(path, baselen);
+		strbuf_addf(path, "/%s", de->d_name);
+
+		if (strlen(de->d_name) == 38)  {
+			char hex[41];
+			unsigned char sha1[20];
+
+			snprintf(hex, sizeof(hex), "%02x%s",
+				 subdir_nr, de->d_name);
+			if (!get_sha1_hex(hex, sha1)) {
+				if (obj_cb) {
+					r = obj_cb(sha1, path->buf, data);
+					if (r)
+						break;
+				}
+				continue;
+			}
+		}
+
+		if (cruft_cb) {
+			r = cruft_cb(de->d_name, path->buf, data);
+			if (r)
+				break;
+		}
+	}
+	strbuf_setlen(path, baselen);
+
+	if (!r && subdir_cb)
+		r = subdir_cb(subdir_nr, path->buf, data);
+
+	closedir(dir);
+	return r;
+}
+
+int for_each_loose_file_in_objdir(const char *path,
+			    each_loose_object_fn obj_cb,
+			    each_loose_cruft_fn cruft_cb,
+			    each_loose_subdir_fn subdir_cb,
+			    void *data)
+{
+	struct strbuf buf = STRBUF_INIT;
+	size_t baselen;
+	int r = 0;
+	int i;
+
+	strbuf_addstr(&buf, path);
+	strbuf_addch(&buf, '/');
+	baselen = buf.len;
+
+	for (i = 0; i < 256; i++) {
+		strbuf_addf(&buf, "%02x", i);
+		r = for_each_file_in_obj_subdir(i, &buf, obj_cb, cruft_cb,
+						subdir_cb, data);
+		strbuf_setlen(&buf, baselen);
+		if (r)
+			break;
+	}
+
+	strbuf_release(&buf);
+	return r;
+}
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 09/25] reachable: mark index blobs as SEEN
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (7 preceding siblings ...)
  2014-10-15 22:38 ` [PATCH v2 08/25] prune: factor out loose-object directory traversal Jeff King
@ 2014-10-15 22:40 ` Jeff King
  2014-10-15 22:40 ` [PATCH v2 10/25] prune-packed: use for_each_loose_file_in_objdir Jeff King
                   ` (16 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:40 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

When we mark all reachable objects for pruning, that
includes blobs mentioned by the index. However, we do not
mark these with the SEEN flag, as we do for objects that we
find by traversing (we also do not add them to the pending
list, but that is because there is nothing further to
traverse with them).

This doesn't cause any problems with prune, because it
checks only that the object exists in the global object
hash, and not its flags. However, let's mark these objects
to be consistent and avoid any later surprises.

Signed-off-by: Jeff King <peff@peff.net>
---
This code actually gets dropped later on in the series (when we teach
the revision machinery to look at index objects), so this could
technically be omitted. But it also keeps the minor behavior change
here by itself, where it is explained, instead of as a side effect of
the movement.

 reachable.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/reachable.c b/reachable.c
index 4e68cfa..d03f829 100644
--- a/reachable.c
+++ b/reachable.c
@@ -55,6 +55,8 @@ static void add_cache_refs(struct rev_info *revs)
 
 	read_cache();
 	for (i = 0; i < active_nr; i++) {
+		struct blob *blob;
+
 		/*
 		 * The index can contain blobs and GITLINKs, GITLINKs are hashes
 		 * that don't actually point to objects in the repository, it's
@@ -65,7 +67,10 @@ static void add_cache_refs(struct rev_info *revs)
 		if (S_ISGITLINK(active_cache[i]->ce_mode))
 			continue;
 
-		lookup_blob(active_cache[i]->sha1);
+		blob = lookup_blob(active_cache[i]->sha1);
+		if (blob)
+			blob->object.flags |= SEEN;
+
 		/*
 		 * We could add the blobs to the pending list, but quite
 		 * frankly, we don't care. Once we've looked them up, and
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 10/25] prune-packed: use for_each_loose_file_in_objdir
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (8 preceding siblings ...)
  2014-10-15 22:40 ` [PATCH v2 09/25] reachable: mark index blobs as SEEN Jeff King
@ 2014-10-15 22:40 ` Jeff King
  2014-10-15 22:40 ` [PATCH v2 11/25] count-objects: do not use xsize_t when counting object size Jeff King
                   ` (15 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:40 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

This saves us from manually traversing the directory
structure ourselves.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/prune-packed.c | 69 +++++++++++++++++---------------------------------
 1 file changed, 23 insertions(+), 46 deletions(-)

diff --git a/builtin/prune-packed.c b/builtin/prune-packed.c
index d430731..f24a2c2 100644
--- a/builtin/prune-packed.c
+++ b/builtin/prune-packed.c
@@ -10,65 +10,42 @@ static const char * const prune_packed_usage[] = {
 
 static struct progress *progress;
 
-static void prune_dir(int i, DIR *dir, struct strbuf *pathname, int opts)
+static int prune_subdir(int nr, const char *path, void *data)
 {
-	struct dirent *de;
-	char hex[40];
-	int top_len = pathname->len;
+	int *opts = data;
+	display_progress(progress, nr + 1);
+	if (!(*opts & PRUNE_PACKED_DRY_RUN))
+		rmdir(path);
+	return 0;
+}
+
+static int prune_object(const unsigned char *sha1, const char *path,
+			 void *data)
+{
+	int *opts = data;
 
-	sprintf(hex, "%02x", i);
-	while ((de = readdir(dir)) != NULL) {
-		unsigned char sha1[20];
-		if (strlen(de->d_name) != 38)
-			continue;
-		memcpy(hex + 2, de->d_name, 38);
-		if (get_sha1_hex(hex, sha1))
-			continue;
-		if (!has_sha1_pack(sha1))
-			continue;
+	if (!has_sha1_pack(sha1))
+		return 0;
 
-		strbuf_add(pathname, de->d_name, 38);
-		if (opts & PRUNE_PACKED_DRY_RUN)
-			printf("rm -f %s\n", pathname->buf);
-		else
-			unlink_or_warn(pathname->buf);
-		display_progress(progress, i + 1);
-		strbuf_setlen(pathname, top_len);
-	}
+	if (*opts & PRUNE_PACKED_DRY_RUN)
+		printf("rm -f %s\n", path);
+	else
+		unlink_or_warn(path);
+	return 0;
 }
 
 void prune_packed_objects(int opts)
 {
-	int i;
-	const char *dir = get_object_directory();
-	struct strbuf pathname = STRBUF_INIT;
-	int top_len;
-
-	strbuf_addstr(&pathname, dir);
 	if (opts & PRUNE_PACKED_VERBOSE)
 		progress = start_progress_delay(_("Removing duplicate objects"),
 			256, 95, 2);
 
-	if (pathname.len && pathname.buf[pathname.len - 1] != '/')
-		strbuf_addch(&pathname, '/');
-
-	top_len = pathname.len;
-	for (i = 0; i < 256; i++) {
-		DIR *d;
+	for_each_loose_file_in_objdir(get_object_directory(),
+				      prune_object, NULL, prune_subdir, &opts);
 
-		display_progress(progress, i + 1);
-		strbuf_setlen(&pathname, top_len);
-		strbuf_addf(&pathname, "%02x/", i);
-		d = opendir(pathname.buf);
-		if (!d)
-			continue;
-		prune_dir(i, d, &pathname, opts);
-		closedir(d);
-		strbuf_setlen(&pathname, top_len + 2);
-		rmdir(pathname.buf);
-	}
+	/* Ensure we show 100% before finishing progress */
+	display_progress(progress, 256);
 	stop_progress(&progress);
-	strbuf_release(&pathname);
 }
 
 int cmd_prune_packed(int argc, const char **argv, const char *prefix)
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 11/25] count-objects: do not use xsize_t when counting object size
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (9 preceding siblings ...)
  2014-10-15 22:40 ` [PATCH v2 10/25] prune-packed: use for_each_loose_file_in_objdir Jeff King
@ 2014-10-15 22:40 ` Jeff King
  2014-10-15 22:41 ` [PATCH v2 12/25] count-objects: use for_each_loose_file_in_objdir Jeff King
                   ` (14 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:40 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

The point of xsize_t is to safely cast an off_t into a size_t
(because we are about to mmap). But in count-objects, we are
summing the sizes in an off_t. Using xsize_t means that
count-objects could fail on a 32-bit system with a 4G
object (not likely, as other parts of git would fail, but
we should at least be correct here).

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/count-objects.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/builtin/count-objects.c b/builtin/count-objects.c
index a7f70cb..316a805 100644
--- a/builtin/count-objects.c
+++ b/builtin/count-objects.c
@@ -53,7 +53,7 @@ static void count_objects(DIR *d, char *path, int len, int verbose,
 			if (lstat(path, &st) || !S_ISREG(st.st_mode))
 				bad = 1;
 			else
-				(*loose_size) += xsize_t(on_disk_bytes(st));
+				(*loose_size) += on_disk_bytes(st);
 		}
 		if (bad) {
 			if (verbose) {
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 12/25] count-objects: use for_each_loose_file_in_objdir
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (10 preceding siblings ...)
  2014-10-15 22:40 ` [PATCH v2 11/25] count-objects: do not use xsize_t when counting object size Jeff King
@ 2014-10-15 22:41 ` Jeff King
  2014-10-15 22:41 ` [PATCH v2 13/25] sha1_file: add for_each iterators for loose and packed objects Jeff King
                   ` (13 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:41 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

This drops our line count considerably, and should make
things more readable by keeping the counting logic separate
from the traversal.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/count-objects.c | 101 ++++++++++++++----------------------------------
 1 file changed, 30 insertions(+), 71 deletions(-)

diff --git a/builtin/count-objects.c b/builtin/count-objects.c
index 316a805..e47ef0b 100644
--- a/builtin/count-objects.c
+++ b/builtin/count-objects.c
@@ -11,6 +11,9 @@
 
 static unsigned long garbage;
 static off_t size_garbage;
+static int verbose;
+static unsigned long loose, packed, packed_loose;
+static off_t loose_size;
 
 static void real_report_garbage(const char *desc, const char *path)
 {
@@ -21,61 +24,31 @@ static void real_report_garbage(const char *desc, const char *path)
 	garbage++;
 }
 
-static void count_objects(DIR *d, char *path, int len, int verbose,
-			  unsigned long *loose,
-			  off_t *loose_size,
-			  unsigned long *packed_loose)
+static void loose_garbage(const char *path)
 {
-	struct dirent *ent;
-	while ((ent = readdir(d)) != NULL) {
-		char hex[41];
-		unsigned char sha1[20];
-		const char *cp;
-		int bad = 0;
+	if (verbose)
+		report_garbage("garbage found", path);
+}
 
-		if (is_dot_or_dotdot(ent->d_name))
-			continue;
-		for (cp = ent->d_name; *cp; cp++) {
-			int ch = *cp;
-			if (('0' <= ch && ch <= '9') ||
-			    ('a' <= ch && ch <= 'f'))
-				continue;
-			bad = 1;
-			break;
-		}
-		if (cp - ent->d_name != 38)
-			bad = 1;
-		else {
-			struct stat st;
-			memcpy(path + len + 3, ent->d_name, 38);
-			path[len + 2] = '/';
-			path[len + 41] = 0;
-			if (lstat(path, &st) || !S_ISREG(st.st_mode))
-				bad = 1;
-			else
-				(*loose_size) += on_disk_bytes(st);
-		}
-		if (bad) {
-			if (verbose) {
-				struct strbuf sb = STRBUF_INIT;
-				strbuf_addf(&sb, "%.*s/%s",
-					    len + 2, path, ent->d_name);
-				report_garbage("garbage found", sb.buf);
-				strbuf_release(&sb);
-			}
-			continue;
-		}
-		(*loose)++;
-		if (!verbose)
-			continue;
-		memcpy(hex, path+len, 2);
-		memcpy(hex+2, ent->d_name, 38);
-		hex[40] = 0;
-		if (get_sha1_hex(hex, sha1))
-			die("internal error");
-		if (has_sha1_pack(sha1))
-			(*packed_loose)++;
+static int count_loose(const unsigned char *sha1, const char *path, void *data)
+{
+	struct stat st;
+
+	if (lstat(path, &st) || !S_ISREG(st.st_mode))
+		loose_garbage(path);
+	else {
+		loose_size += on_disk_bytes(st);
+		loose++;
+		if (verbose && has_sha1_pack(sha1))
+			packed_loose++;
 	}
+	return 0;
+}
+
+static int count_cruft(const char *basename, const char *path, void *data)
+{
+	loose_garbage(path);
+	return 0;
 }
 
 static char const * const count_objects_usage[] = {
@@ -85,12 +58,7 @@ static char const * const count_objects_usage[] = {
 
 int cmd_count_objects(int argc, const char **argv, const char *prefix)
 {
-	int i, verbose = 0, human_readable = 0;
-	const char *objdir = get_object_directory();
-	int len = strlen(objdir);
-	char *path = xmalloc(len + 50);
-	unsigned long loose = 0, packed = 0, packed_loose = 0;
-	off_t loose_size = 0;
+	int human_readable = 0;
 	struct option opts[] = {
 		OPT__VERBOSE(&verbose, N_("be verbose")),
 		OPT_BOOL('H', "human-readable", &human_readable,
@@ -104,19 +72,10 @@ int cmd_count_objects(int argc, const char **argv, const char *prefix)
 		usage_with_options(count_objects_usage, opts);
 	if (verbose)
 		report_garbage = real_report_garbage;
-	memcpy(path, objdir, len);
-	if (len && objdir[len-1] != '/')
-		path[len++] = '/';
-	for (i = 0; i < 256; i++) {
-		DIR *d;
-		sprintf(path + len, "%02x", i);
-		d = opendir(path);
-		if (!d)
-			continue;
-		count_objects(d, path, len, verbose,
-			      &loose, &loose_size, &packed_loose);
-		closedir(d);
-	}
+
+	for_each_loose_file_in_objdir(get_object_directory(),
+				      count_loose, count_cruft, NULL, NULL);
+
 	if (verbose) {
 		struct packed_git *p;
 		unsigned long num_pack = 0;
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 13/25] sha1_file: add for_each iterators for loose and packed objects
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (11 preceding siblings ...)
  2014-10-15 22:41 ` [PATCH v2 12/25] count-objects: use for_each_loose_file_in_objdir Jeff King
@ 2014-10-15 22:41 ` Jeff King
  2014-10-15 22:41 ` [PATCH v2 14/25] prune: keep objects reachable from recent objects Jeff King
                   ` (12 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:41 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

We typically iterate over the reachable objects in a
repository by starting at the tips and walking the graph.
There's no easy way to iterate over all of the objects,
including unreachable ones. Let's provide a way of doing so.

Signed-off-by: Jeff King <peff@peff.net>
---
 cache.h     | 11 +++++++++++
 sha1_file.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+)

diff --git a/cache.h b/cache.h
index 8ffefaa..d24dd32 100644
--- a/cache.h
+++ b/cache.h
@@ -1254,6 +1254,17 @@ int for_each_loose_file_in_objdir(const char *path,
 				  each_loose_subdir_fn subdir_cb,
 				  void *data);
 
+/*
+ * Iterate over loose and packed objects in both the local
+ * repository and any alternates repositories.
+ */
+typedef int each_packed_object_fn(const unsigned char *sha1,
+				  struct packed_git *pack,
+				  uint32_t pos,
+				  void *data);
+extern int for_each_loose_object(each_loose_object_fn, void *);
+extern int for_each_packed_object(each_packed_object_fn, void *);
+
 struct object_info {
 	/* Request */
 	enum object_type *typep;
diff --git a/sha1_file.c b/sha1_file.c
index a20240b..eb410a2 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -3349,3 +3349,65 @@ int for_each_loose_file_in_objdir(const char *path,
 	strbuf_release(&buf);
 	return r;
 }
+
+struct loose_alt_odb_data {
+	each_loose_object_fn *cb;
+	void *data;
+};
+
+static int loose_from_alt_odb(struct alternate_object_database *alt,
+			      void *vdata)
+{
+	struct loose_alt_odb_data *data = vdata;
+	return for_each_loose_file_in_objdir(alt->base,
+					     data->cb, NULL, NULL,
+					     data->data);
+}
+
+int for_each_loose_object(each_loose_object_fn cb, void *data)
+{
+	struct loose_alt_odb_data alt;
+	int r;
+
+	r = for_each_loose_file_in_objdir(get_object_directory(),
+					  cb, NULL, NULL, data);
+	if (r)
+		return r;
+
+	alt.cb = cb;
+	alt.data = data;
+	return foreach_alt_odb(loose_from_alt_odb, &alt);
+}
+
+static int for_each_object_in_pack(struct packed_git *p, each_packed_object_fn cb, void *data)
+{
+	uint32_t i;
+	int r = 0;
+
+	for (i = 0; i < p->num_objects; i++) {
+		const unsigned char *sha1 = nth_packed_object_sha1(p, i);
+
+		if (!sha1)
+			return error("unable to get sha1 of object %u in %s",
+				     i, p->pack_name);
+
+		r = cb(sha1, p, i, data);
+		if (r)
+			break;
+	}
+	return r;
+}
+
+int for_each_packed_object(each_packed_object_fn cb, void *data)
+{
+	struct packed_git *p;
+	int r = 0;
+
+	prepare_packed_git();
+	for (p = packed_git; p; p = p->next) {
+		r = for_each_object_in_pack(p, cb, data);
+		if (r)
+			break;
+	}
+	return r;
+}
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 14/25] prune: keep objects reachable from recent objects
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (12 preceding siblings ...)
  2014-10-15 22:41 ` [PATCH v2 13/25] sha1_file: add for_each iterators for loose and packed objects Jeff King
@ 2014-10-15 22:41 ` Jeff King
  2014-10-15 22:41 ` [PATCH v2 15/25] pack-objects: refactor unpack-unreachable expiration check Jeff King
                   ` (11 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:41 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

Our current strategy with prune is that an object falls into
one of three categories:

  1. Reachable (from ref tips, reflogs, index, etc).

  2. Not reachable, but recent (based on the --expire time).

  3. Not reachable and not recent.

We keep objects from (1) and (2), but prune objects in (3).
The point of (2) is that these objects may be part of an
in-progress operation that has not yet updated any refs.

However, it is not always the case that objects for an
in-progress operation will have a recent mtime. For example,
the object database may have an old copy of a blob (from an
abandoned operation, a branch that was deleted, etc). If we
create a new tree that points to it, a simultaneous prune
will leave our tree, but delete the blob. Referencing that
tree with a commit will then work (we check that the tree is
in the object database, but not that all of its referred
objects are), as will mentioning the commit in a ref. But
the resulting repo is corrupt; we are missing the blob
reachable from a ref.

One way to solve this is to be more thorough when
referencing a sha1: make sure that not only do we have that
sha1, but that we have objects it refers to, and so forth
recursively. The problem is that this is very expensive.
Creating a parent link would require traversing the entire
object graph!

Instead, this patch pushes the extra work onto prune, which
runs less frequently (and has to look at the whole object
graph anyway). It creates a new category of objects: objects
which are not recent, but which are reachable from a recent
object. We do not prune these objects, just like the
reachable and recent ones.

This lets us avoid the recursive check above, because if we
have an object, even if it is unreachable, we should have
its referent. We can make a simple inductive argument that
with this patch, this property holds (that there are no
objects with missing referents in the repository):

  0. When we have no objects, we have nothing to refer or be
     referred to, so the property holds.

  1. If we add objects to the repository, their direct
     referents must generally exist (e.g., if you create a
     tree, the blobs it references must exist; if you create
     a commit to point at the tree, the tree must exist).
     This is already the case before this patch. And it is
     not 100% foolproof (you can make bogus objects using
     `git hash-object`, for example), but it should be the
     case for normal usage.

     Therefore for any sequence of object additions, the
     property will continue to hold.

  2. If we remove objects from the repository, then we will
     not remove a child object (like a blob) if an object
     that refers to it is being kept. That is the part
     implemented by this patch.

     Note, however, that our reachability check and the
     actual pruning are not atomic. So it _is_ still
     possible to violate the property (e.g., an object
     becomes referenced just as we are deleting it). This
     patch is shooting for eliminating problems where the
     mtimes of dependent objects differ by hours or days,
     and one is dropped without the other. It does nothing
     to help with short races.

Naively, the simplest way to implement this would be to add
all recent objects as tips to the reachability traversal.
However, this does not perform well. In a recently-packed
repository, all reachable objects will also be recent, and
therefore we have to look at each object twice. This patch
instead performs the reachability traversal, then follows up
with a second traversal for recent objects, skipping any
that have already been marked.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/prune.c            |   2 +-
 builtin/reflog.c           |   2 +-
 reachable.c                | 112 +++++++++++++++++++++++++++++++++++++++++++++
 reachable.h                |   3 +-
 t/t6501-freshen-objects.sh |  88 +++++++++++++++++++++++++++++++++++
 5 files changed, 204 insertions(+), 3 deletions(-)
 create mode 100755 t/t6501-freshen-objects.sh

diff --git a/builtin/prune.c b/builtin/prune.c
index 763f53e..04d3b12 100644
--- a/builtin/prune.c
+++ b/builtin/prune.c
@@ -135,7 +135,7 @@ int cmd_prune(int argc, const char **argv, const char *prefix)
 	if (show_progress)
 		progress = start_progress_delay(_("Checking connectivity"), 0, 0, 2);
 
-	mark_reachable_objects(&revs, 1, progress);
+	mark_reachable_objects(&revs, 1, expire, progress);
 	stop_progress(&progress);
 	for_each_loose_file_in_objdir(get_object_directory(), prune_object,
 				      prune_cruft, prune_subdir, NULL);
diff --git a/builtin/reflog.c b/builtin/reflog.c
index b6388f7..2d85d26 100644
--- a/builtin/reflog.c
+++ b/builtin/reflog.c
@@ -649,7 +649,7 @@ static int cmd_reflog_expire(int argc, const char **argv, const char *prefix)
 		init_revisions(&cb.revs, prefix);
 		if (cb.verbose)
 			printf("Marking reachable objects...");
-		mark_reachable_objects(&cb.revs, 0, NULL);
+		mark_reachable_objects(&cb.revs, 0, 0, NULL);
 		if (cb.verbose)
 			putchar('\n');
 	}
diff --git a/reachable.c b/reachable.c
index d03f829..55589a0 100644
--- a/reachable.c
+++ b/reachable.c
@@ -97,7 +97,109 @@ static void mark_commit(struct commit *c, void *data)
 	mark_object(&c->object, NULL, NULL, data);
 }
 
+struct recent_data {
+	struct rev_info *revs;
+	unsigned long timestamp;
+};
+
+static void add_recent_object(const unsigned char *sha1,
+			      unsigned long mtime,
+			      struct recent_data *data)
+{
+	struct object *obj;
+	enum object_type type;
+
+	if (mtime <= data->timestamp)
+		return;
+
+	/*
+	 * We do not want to call parse_object here, because
+	 * inflating blobs and trees could be very expensive.
+	 * However, we do need to know the correct type for
+	 * later processing, and the revision machinery expects
+	 * commits and tags to have been parsed.
+	 */
+	type = sha1_object_info(sha1, NULL);
+	if (type < 0)
+		die("unable to get object info for %s", sha1_to_hex(sha1));
+
+	switch (type) {
+	case OBJ_TAG:
+	case OBJ_COMMIT:
+		obj = parse_object_or_die(sha1, NULL);
+		break;
+	case OBJ_TREE:
+		obj = (struct object *)lookup_tree(sha1);
+		break;
+	case OBJ_BLOB:
+		obj = (struct object *)lookup_blob(sha1);
+		break;
+	default:
+		die("unknown object type for %s: %s",
+		    sha1_to_hex(sha1), typename(type));
+	}
+
+	if (!obj)
+		die("unable to lookup %s", sha1_to_hex(sha1));
+
+	add_pending_object(data->revs, obj, "");
+}
+
+static int add_recent_loose(const unsigned char *sha1,
+			    const char *path, void *data)
+{
+	struct stat st;
+	struct object *obj = lookup_object(sha1);
+
+	if (obj && obj->flags & SEEN)
+		return 0;
+
+	if (stat(path, &st) < 0) {
+		/*
+		 * It's OK if an object went away during our iteration; this
+		 * could be due to a simultaneous repack. But anything else
+		 * we should abort, since we might then fail to mark objects
+		 * which should not be pruned.
+		 */
+		if (errno == ENOENT)
+			return 0;
+		return error("unable to stat %s: %s",
+			     sha1_to_hex(sha1), strerror(errno));
+	}
+
+	add_recent_object(sha1, st.st_mtime, data);
+	return 0;
+}
+
+static int add_recent_packed(const unsigned char *sha1,
+			     struct packed_git *p, uint32_t pos,
+			     void *data)
+{
+	struct object *obj = lookup_object(sha1);
+
+	if (obj && obj->flags & SEEN)
+		return 0;
+	add_recent_object(sha1, p->mtime, data);
+	return 0;
+}
+
+static int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
+						  unsigned long timestamp)
+{
+	struct recent_data data;
+	int r;
+
+	data.revs = revs;
+	data.timestamp = timestamp;
+
+	r = for_each_loose_object(add_recent_loose, &data);
+	if (r)
+		return r;
+	return for_each_packed_object(add_recent_packed, &data);
+}
+
 void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
+			    unsigned long mark_recent,
 			    struct progress *progress)
 {
 	struct connectivity_progress cp;
@@ -133,5 +235,15 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 	if (prepare_revision_walk(revs))
 		die("revision walk setup failed");
 	traverse_commit_list(revs, mark_commit, mark_object, &cp);
+
+	if (mark_recent) {
+		revs->ignore_missing_links = 1;
+		if (add_unseen_recent_objects_to_traversal(revs, mark_recent))
+			die("unable to mark recent objects");
+		if (prepare_revision_walk(revs))
+			die("revision walk setup failed");
+		traverse_commit_list(revs, mark_commit, mark_object, &cp);
+	}
+
 	display_progress(cp.progress, cp.count);
 }
diff --git a/reachable.h b/reachable.h
index 5d082ad..141fe30 100644
--- a/reachable.h
+++ b/reachable.h
@@ -2,6 +2,7 @@
 #define REACHEABLE_H
 
 struct progress;
-extern void mark_reachable_objects(struct rev_info *revs, int mark_reflog, struct progress *);
+extern void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
+				   unsigned long mark_recent, struct progress *);
 
 #endif
diff --git a/t/t6501-freshen-objects.sh b/t/t6501-freshen-objects.sh
new file mode 100755
index 0000000..de941c2
--- /dev/null
+++ b/t/t6501-freshen-objects.sh
@@ -0,0 +1,88 @@
+#!/bin/sh
+#
+# This test covers the handling of objects which might have old
+# mtimes in the filesystem (because they were used previously)
+# and are just now becoming referenced again.
+#
+# We're going to do two things that are a little bit "fake" to
+# help make our simulation easier:
+#
+#   1. We'll turn off reflogs. You can still run into
+#      problems with reflogs on, but your objects
+#      don't get pruned until both the reflog expiration
+#      has passed on their references, _and_ they are out
+#      of prune's expiration period. Dropping reflogs
+#      means we only have to deal with one variable in our tests,
+#      but the results generalize.
+#
+#   2. We'll use a temporary index file to create our
+#      works-in-progress. Most workflows would mention
+#      referenced objects in the index, which prune takes
+#      into account. However, many operations don't. For
+#      example, a partial commit with "git commit foo"
+#      will use a temporary index. Or they may not need
+#      an index at all (e.g., creating a new commit
+#      to refer to an existing tree).
+
+test_description='check pruning of dependent objects'
+. ./test-lib.sh
+
+# We care about reachability, so we do not want to use
+# the normal test_commit, which creates extra tags.
+add () {
+	echo "$1" >"$1" &&
+	git add "$1"
+}
+commit () {
+	test_tick &&
+	add "$1" &&
+	git commit -m "$1"
+}
+
+test_expect_success 'disable reflogs' '
+	git config core.logallrefupdates false &&
+	rm -rf .git/logs
+'
+
+test_expect_success 'setup basic history' '
+	commit base
+'
+
+test_expect_success 'create and abandon some objects' '
+	git checkout -b experiment &&
+	commit abandon &&
+	git checkout master &&
+	git branch -D experiment
+'
+
+test_expect_success 'simulate time passing' '
+	find .git/objects -type f |
+	xargs test-chmtime -v -86400
+'
+
+test_expect_success 'start writing new commit with old blob' '
+	tree=$(
+		GIT_INDEX_FILE=index.tmp &&
+		export GIT_INDEX_FILE &&
+		git read-tree HEAD &&
+		add unrelated &&
+		add abandon &&
+		git write-tree
+	)
+'
+
+test_expect_success 'simultaneous gc' '
+	git gc --prune=12.hours.ago
+'
+
+test_expect_success 'finish writing out commit' '
+	commit=$(echo foo | git commit-tree -p HEAD $tree) &&
+	git update-ref HEAD $commit
+'
+
+# "abandon" blob should have been rescued by reference from new tree
+test_expect_success 'repository passes fsck' '
+	git fsck
+'
+
+test_done
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 15/25] pack-objects: refactor unpack-unreachable expiration check
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (13 preceding siblings ...)
  2014-10-15 22:41 ` [PATCH v2 14/25] prune: keep objects reachable from recent objects Jeff King
@ 2014-10-15 22:41 ` Jeff King
  2014-10-15 22:42 ` [PATCH v2 16/25] pack-objects: match prune logic for discarding objects Jeff King
                   ` (10 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:41 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

When we are loosening unreachable packed objects, we do not
bother to process objects that would simply be pruned
immediately anyway. The "would be pruned" check is a simple
comparison, but is about to get more complicated. Let's pull
it out into a separate function.

Note that this is slightly less efficient than the original,
which avoided even opening old packs, since no object in
them could pass the current check, which cares only about
the pack mtime.  But the new rules will depend on the exact
object, so we need to perform the check even for old packs.

Note also that we fix a minor buglet when the pack mtime is
exactly the same as the expiration time. The prune code
considers that worth pruning, whereas our check here
considered it worth keeping. This wasn't a big deal. Besides
being unlikely to happen, the result was simply that the
object was loosened and then pruned, missing the
optimization. Still, we can easily fix it while we are here.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/pack-objects.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index d391934..2fe2ab0 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2407,6 +2407,16 @@ static int has_sha1_pack_kept_or_nonlocal(const unsigned char *sha1)
 	return 0;
 }
 
+static int loosened_object_can_be_discarded(const unsigned char *sha1,
+					    unsigned long mtime)
+{
+	if (!unpack_unreachable_expiration)
+		return 0;
+	if (mtime > unpack_unreachable_expiration)
+		return 0;
+	return 1;
+}
+
 static void loosen_unused_packed_objects(struct rev_info *revs)
 {
 	struct packed_git *p;
@@ -2417,17 +2427,14 @@ static void loosen_unused_packed_objects(struct rev_info *revs)
 		if (!p->pack_local || p->pack_keep)
 			continue;
 
-		if (unpack_unreachable_expiration &&
-		    p->mtime < unpack_unreachable_expiration)
-			continue;
-
 		if (open_pack_index(p))
 			die("cannot open pack index");
 
 		for (i = 0; i < p->num_objects; i++) {
 			sha1 = nth_packed_object_sha1(p, i);
 			if (!packlist_find(&to_pack, sha1, NULL) &&
-				!has_sha1_pack_kept_or_nonlocal(sha1))
+			    !has_sha1_pack_kept_or_nonlocal(sha1) &&
+			    !loosened_object_can_be_discarded(sha1, p->mtime))
 				if (force_object_loose(sha1, p->mtime))
 					die("unable to force loose object");
 		}
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 16/25] pack-objects: match prune logic for discarding objects
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (14 preceding siblings ...)
  2014-10-15 22:41 ` [PATCH v2 15/25] pack-objects: refactor unpack-unreachable expiration check Jeff King
@ 2014-10-15 22:42 ` Jeff King
  2014-10-15 22:42 ` [PATCH v2 17/25] write_sha1_file: freshen existing objects Jeff King
                   ` (9 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:42 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

A recent commit taught git-prune to keep non-recent objects
that are reachable from recent ones. However, pack-objects,
when loosening unreachable objects, tries to optimize out
the write in the case that the object will be immediately
pruned. It now gets this wrong, since its rule does not
reflect the new prune code (and this can be seen by running
t6501 with a strategically placed repack).

Let's teach pack-objects similar logic.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/pack-objects.c     | 39 +++++++++++++++++++
 reachable.c                |  4 +-
 reachable.h                |  2 +
 t/t6501-freshen-objects.sh | 93 +++++++++++++++++++++++++++-------------------
 4 files changed, 98 insertions(+), 40 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 2fe2ab0..4df9499 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -20,6 +20,8 @@
 #include "streaming.h"
 #include "thread-utils.h"
 #include "pack-bitmap.h"
+#include "reachable.h"
+#include "sha1-array.h"
 
 static const char *pack_usage[] = {
 	N_("git pack-objects --stdout [options...] [< ref-list | < object-list]"),
@@ -2407,6 +2409,15 @@ static int has_sha1_pack_kept_or_nonlocal(const unsigned char *sha1)
 	return 0;
 }
 
+/*
+ * Store a list of sha1s that are should not be discarded
+ * because they are either written too recently, or are
+ * reachable from another object that was.
+ *
+ * This is filled by get_object_list.
+ */
+static struct sha1_array recent_objects;
+
 static int loosened_object_can_be_discarded(const unsigned char *sha1,
 					    unsigned long mtime)
 {
@@ -2414,6 +2425,8 @@ static int loosened_object_can_be_discarded(const unsigned char *sha1,
 		return 0;
 	if (mtime > unpack_unreachable_expiration)
 		return 0;
+	if (sha1_array_lookup(&recent_objects, sha1) >= 0)
+		return 0;
 	return 1;
 }
 
@@ -2470,6 +2483,19 @@ static int get_object_list_from_bitmap(struct rev_info *revs)
 	return 0;
 }
 
+static void record_recent_object(struct object *obj,
+				 const struct name_path *path,
+				 const char *last,
+				 void *data)
+{
+	sha1_array_append(&recent_objects, obj->sha1);
+}
+
+static void record_recent_commit(struct commit *commit, void *data)
+{
+	sha1_array_append(&recent_objects, commit->object.sha1);
+}
+
 static void get_object_list(int ac, const char **av)
 {
 	struct rev_info revs;
@@ -2517,10 +2543,23 @@ static void get_object_list(int ac, const char **av)
 	mark_edges_uninteresting(&revs, show_edge);
 	traverse_commit_list(&revs, show_commit, show_object, NULL);
 
+	if (unpack_unreachable_expiration) {
+		revs.ignore_missing_links = 1;
+		if (add_unseen_recent_objects_to_traversal(&revs,
+				unpack_unreachable_expiration))
+			die("unable to add recent objects");
+		if (prepare_revision_walk(&revs))
+			die("revision walk setup failed");
+		traverse_commit_list(&revs, record_recent_commit,
+				     record_recent_object, NULL);
+	}
+
 	if (keep_unreachable)
 		add_objects_in_unpacked_packs(&revs);
 	if (unpack_unreachable)
 		loosen_unused_packed_objects(&revs);
+
+	sha1_array_clear(&recent_objects);
 }
 
 static int option_parse_index_version(const struct option *opt,
diff --git a/reachable.c b/reachable.c
index 55589a0..0176a88 100644
--- a/reachable.c
+++ b/reachable.c
@@ -183,8 +183,8 @@ static int add_recent_packed(const unsigned char *sha1,
 	return 0;
 }
 
-static int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
-						  unsigned long timestamp)
+int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
+					   unsigned long timestamp)
 {
 	struct recent_data data;
 	int r;
diff --git a/reachable.h b/reachable.h
index 141fe30..d23efc3 100644
--- a/reachable.h
+++ b/reachable.h
@@ -2,6 +2,8 @@
 #define REACHEABLE_H
 
 struct progress;
+extern int add_unseen_recent_objects_to_traversal(struct rev_info *revs,
+						  unsigned long timestamp);
 extern void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 				   unsigned long mark_recent, struct progress *);
 
diff --git a/t/t6501-freshen-objects.sh b/t/t6501-freshen-objects.sh
index de941c2..e25c47d 100755
--- a/t/t6501-freshen-objects.sh
+++ b/t/t6501-freshen-objects.sh
@@ -39,50 +39,67 @@ commit () {
 	git commit -m "$1"
 }
 
-test_expect_success 'disable reflogs' '
-	git config core.logallrefupdates false &&
-	rm -rf .git/logs
-'
+maybe_repack () {
+	if test -n "$repack"; then
+		git repack -ad
+	fi
+}
+
+for repack in '' true; do
+	title=${repack:+repack}
+	title=${title:-loose}
+
+	test_expect_success "make repo completely empty ($title)" '
+		rm -rf .git &&
+		git init
+	'
+
+	test_expect_success "disable reflogs ($title)" '
+		git config core.logallrefupdates false &&
+		rm -rf .git/logs
+	'
 
-test_expect_success 'setup basic history' '
-	commit base
-'
+	test_expect_success "setup basic history ($title)" '
+		commit base
+	'
 
-test_expect_success 'create and abandon some objects' '
-	git checkout -b experiment &&
-	commit abandon &&
-	git checkout master &&
-	git branch -D experiment
-'
+	test_expect_success "create and abandon some objects ($title)" '
+		git checkout -b experiment &&
+		commit abandon &&
+		maybe_repack &&
+		git checkout master &&
+		git branch -D experiment
+	'
 
-test_expect_success 'simulate time passing' '
-	find .git/objects -type f |
-	xargs test-chmtime -v -86400
-'
+	test_expect_success "simulate time passing ($title)" '
+		find .git/objects -type f |
+		xargs test-chmtime -v -86400
+	'
 
-test_expect_success 'start writing new commit with old blob' '
-	tree=$(
-		GIT_INDEX_FILE=index.tmp &&
-		export GIT_INDEX_FILE &&
-		git read-tree HEAD &&
-		add unrelated &&
-		add abandon &&
-		git write-tree
-	)
-'
+	test_expect_success "start writing new commit with old blob ($title)" '
+		tree=$(
+			GIT_INDEX_FILE=index.tmp &&
+			export GIT_INDEX_FILE &&
+			git read-tree HEAD &&
+			add unrelated &&
+			add abandon &&
+			git write-tree
+		)
+	'
 
-test_expect_success 'simultaneous gc' '
-	git gc --prune=12.hours.ago
-'
+	test_expect_success "simultaneous gc ($title)" '
+		git gc --prune=12.hours.ago
+	'
 
-test_expect_success 'finish writing out commit' '
-	commit=$(echo foo | git commit-tree -p HEAD $tree) &&
-	git update-ref HEAD $commit
-'
+	test_expect_success "finish writing out commit ($title)" '
+		commit=$(echo foo | git commit-tree -p HEAD $tree) &&
+		git update-ref HEAD $commit
+	'
 
-# "abandon" blob should have been rescued by reference from new tree
-test_expect_success 'repository passes fsck' '
-	git fsck
-'
+	# "abandon" blob should have been rescued by reference from new tree
+	test_expect_success "repository passes fsck ($title)" '
+		git fsck
+	'
+done
 
 test_done
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 17/25] write_sha1_file: freshen existing objects
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (15 preceding siblings ...)
  2014-10-15 22:42 ` [PATCH v2 16/25] pack-objects: match prune logic for discarding objects Jeff King
@ 2014-10-15 22:42 ` Jeff King
  2014-10-15 22:42 ` [PATCH v2 18/25] make add_object_array_with_context interface more sane Jeff King
                   ` (8 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:42 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

When we try to write a loose object file, we first check
whether that object already exists. If so, we skip the
write as an optimization. However, this can interfere with
prune's strategy of using mtimes to mark files in progress.

For example, if a branch contains a particular tree object
and is deleted, that tree object may become unreachable, and
have an old mtime. If a new operation then tries to write
the same tree, this ends up as a noop; we notice we
already have the object and do nothing. A prune running
simultaneously with this operation will see the object as
old, and may delete it.

We can solve this by "freshening" objects that we avoid
writing by updating their mtime. The algorithm for doing so
is essentially the same as that of has_sha1_file. Therefore
we provide a new (static) interface "check_and_freshen",
which finds and optionally freshens the object. It's trivial
to implement freshening and simple checking by tweaking a
single parameter.

Signed-off-by: Jeff King <peff@peff.net>
---
 sha1_file.c                | 51 +++++++++++++++++++++++++++++++++++++++-------
 t/t6501-freshen-objects.sh | 27 ++++++++++++++++++++++++
 2 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index eb410a2..d7f1838 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -443,27 +443,53 @@ void prepare_alt_odb(void)
 	read_info_alternates(get_object_directory(), 0);
 }
 
-static int has_loose_object_local(const unsigned char *sha1)
+static int freshen_file(const char *fn)
 {
-	return !access(sha1_file_name(sha1), F_OK);
+	struct utimbuf t;
+	t.actime = t.modtime = time(NULL);
+	return !utime(fn, &t);
 }
 
-int has_loose_object_nonlocal(const unsigned char *sha1)
+static int check_and_freshen_file(const char *fn, int freshen)
+{
+	if (access(fn, F_OK))
+		return 0;
+	if (freshen && freshen_file(fn))
+		return 0;
+	return 1;
+}
+
+static int check_and_freshen_local(const unsigned char *sha1, int freshen)
+{
+	return check_and_freshen_file(sha1_file_name(sha1), freshen);
+}
+
+static int check_and_freshen_nonlocal(const unsigned char *sha1, int freshen)
 {
 	struct alternate_object_database *alt;
 	prepare_alt_odb();
 	for (alt = alt_odb_list; alt; alt = alt->next) {
 		fill_sha1_path(alt->name, sha1);
-		if (!access(alt->base, F_OK))
+		if (check_and_freshen_file(alt->base, freshen))
 			return 1;
 	}
 	return 0;
 }
 
+static int check_and_freshen(const unsigned char *sha1, int freshen)
+{
+	return check_and_freshen_local(sha1, freshen) ||
+	       check_and_freshen_nonlocal(sha1, freshen);
+}
+
+int has_loose_object_nonlocal(const unsigned char *sha1)
+{
+	return check_and_freshen_nonlocal(sha1, 0);
+}
+
 static int has_loose_object(const unsigned char *sha1)
 {
-	return has_loose_object_local(sha1) ||
-	       has_loose_object_nonlocal(sha1);
+	return check_and_freshen(sha1, 0);
 }
 
 static unsigned int pack_used_ctr;
@@ -2966,6 +2992,17 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 	return move_temp_to_file(tmp_file, filename);
 }
 
+static int freshen_loose_object(const unsigned char *sha1)
+{
+	return check_and_freshen(sha1, 1);
+}
+
+static int freshen_packed_object(const unsigned char *sha1)
+{
+	struct pack_entry e;
+	return find_pack_entry(sha1, &e) && freshen_file(e.p->pack_name);
+}
+
 int write_sha1_file(const void *buf, unsigned long len, const char *type, unsigned char *returnsha1)
 {
 	unsigned char sha1[20];
@@ -2978,7 +3015,7 @@ int write_sha1_file(const void *buf, unsigned long len, const char *type, unsign
 	write_sha1_file_prepare(buf, len, type, sha1, hdr, &hdrlen);
 	if (returnsha1)
 		hashcpy(returnsha1, sha1);
-	if (has_sha1_file(sha1))
+	if (freshen_loose_object(sha1) || freshen_packed_object(sha1))
 		return 0;
 	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0);
 }
diff --git a/t/t6501-freshen-objects.sh b/t/t6501-freshen-objects.sh
index e25c47d..157f3f9 100755
--- a/t/t6501-freshen-objects.sh
+++ b/t/t6501-freshen-objects.sh
@@ -100,6 +100,33 @@ for repack in '' true; do
 	test_expect_success "repository passes fsck ($title)" '
 		git fsck
 	'
+
+	test_expect_success "abandon objects again ($title)" '
+		git reset --hard HEAD^ &&
+		find .git/objects -type f |
+		xargs test-chmtime -v -86400
+	'
+
+	test_expect_success "start writing new commit with same tree ($title)" '
+		tree=$(
+			GIT_INDEX_FILE=index.tmp &&
+			export GIT_INDEX_FILE &&
+			git read-tree HEAD &&
+			add abandon &&
+			add unrelated &&
+			git write-tree
+		)
+	'
+
+	test_expect_success "simultaneous gc ($title)" '
+		git gc --prune=12.hours.ago
+	'
+
+	# tree should have been refreshed by write-tree
+	test_expect_success "finish writing out commit ($title)" '
+		commit=$(echo foo | git commit-tree -p HEAD $tree) &&
+		git update-ref HEAD $commit
+	'
 done
 
 test_done
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 18/25] make add_object_array_with_context interface more sane
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (16 preceding siblings ...)
  2014-10-15 22:42 ` [PATCH v2 17/25] write_sha1_file: freshen existing objects Jeff King
@ 2014-10-15 22:42 ` Jeff King
  2014-10-15 22:43 ` [PATCH v2 19/25] traverse_commit_list: support pending blobs/trees with paths Jeff King
                   ` (7 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:42 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

When you resolve a sha1, you can optionally keep any context
found during the resolution, including the path and mode of
a tree entry (e.g., when looking up "HEAD:subdir/file.c").

The add_object_array_with_context function lets you then
attach that context to an entry in a list. Unfortunately,
the interface for doing so is horrible. The object_context
structure is large and most object_array users do not use
it. Therefore we keep a pointer to the structure to avoid
burdening other users too much. But that means when we do
use it that we must allocate the struct ourselves. And the
struct contains a fixed PATH_MAX-sized buffer, which makes
this wholly unsuitable for any large arrays.

We can observe that there is only a single user of the
"with_context" variant: builtin/grep.c. And in that use
case, the only element we care about is the path. We can
therefore store only the path as a pointer (the context's
mode field was redundant with the object_array_entry itself,
and nobody actually cared about the surrounding tree). This
still requires a strdup of the pathname, but at least we are
only consuming the minimum amount of memory for each string.

We can also handle the copying ourselves in
add_object_array_*, and free it as appropriate in
object_array_release_entry.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/grep.c |  8 ++++----
 object.c       | 23 +++++++++--------------
 object.h       |  4 ++--
 3 files changed, 15 insertions(+), 20 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index c86a142..4063882 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -456,10 +456,10 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 }
 
 static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
-		       struct object *obj, const char *name, struct object_context *oc)
+		       struct object *obj, const char *name, const char *path)
 {
 	if (obj->type == OBJ_BLOB)
-		return grep_sha1(opt, obj->sha1, name, 0, oc ? oc->path : NULL);
+		return grep_sha1(opt, obj->sha1, name, 0, path);
 	if (obj->type == OBJ_COMMIT || obj->type == OBJ_TREE) {
 		struct tree_desc tree;
 		void *data;
@@ -501,7 +501,7 @@ static int grep_objects(struct grep_opt *opt, const struct pathspec *pathspec,
 	for (i = 0; i < nr; i++) {
 		struct object *real_obj;
 		real_obj = deref_tag(list->objects[i].item, NULL, 0);
-		if (grep_object(opt, pathspec, real_obj, list->objects[i].name, list->objects[i].context)) {
+		if (grep_object(opt, pathspec, real_obj, list->objects[i].name, list->objects[i].path)) {
 			hit = 1;
 			if (opt->status_only)
 				break;
@@ -821,7 +821,7 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 			struct object *object = parse_object_or_die(sha1, arg);
 			if (!seen_dashdash)
 				verify_non_filename(prefix, arg);
-			add_object_array_with_context(object, arg, &list, xmemdupz(&oc, sizeof(struct object_context)));
+			add_object_array_with_path(object, arg, &list, oc.mode, oc.path);
 			continue;
 		}
 		if (!strcmp(arg, "--")) {
diff --git a/object.c b/object.c
index 6aeb1bb..df86bdd 100644
--- a/object.c
+++ b/object.c
@@ -307,10 +307,9 @@ int object_list_contains(struct object_list *list, struct object *obj)
  */
 static char object_array_slopbuf[1];
 
-static void add_object_array_with_mode_context(struct object *obj, const char *name,
-					       struct object_array *array,
-					       unsigned mode,
-					       struct object_context *context)
+void add_object_array_with_path(struct object *obj, const char *name,
+				struct object_array *array,
+				unsigned mode, const char *path)
 {
 	unsigned nr = array->nr;
 	unsigned alloc = array->alloc;
@@ -333,7 +332,10 @@ static void add_object_array_with_mode_context(struct object *obj, const char *n
 	else
 		entry->name = xstrdup(name);
 	entry->mode = mode;
-	entry->context = context;
+	if (path)
+		entry->path = xstrdup(path);
+	else
+		entry->path = NULL;
 	array->nr = ++nr;
 }
 
@@ -344,15 +346,7 @@ void add_object_array(struct object *obj, const char *name, struct object_array
 
 void add_object_array_with_mode(struct object *obj, const char *name, struct object_array *array, unsigned mode)
 {
-	add_object_array_with_mode_context(obj, name, array, mode, NULL);
-}
-
-void add_object_array_with_context(struct object *obj, const char *name, struct object_array *array, struct object_context *context)
-{
-	if (context)
-		add_object_array_with_mode_context(obj, name, array, context->mode, context);
-	else
-		add_object_array_with_mode_context(obj, name, array, S_IFINVALID, context);
+	add_object_array_with_path(obj, name, array, mode, NULL);
 }
 
 /*
@@ -363,6 +357,7 @@ static void object_array_release_entry(struct object_array_entry *ent)
 {
 	if (ent->name != object_array_slopbuf)
 		free(ent->name);
+	free(ent->path);
 }
 
 void object_array_filter(struct object_array *array,
diff --git a/object.h b/object.h
index 2a755a2..e5178a5 100644
--- a/object.h
+++ b/object.h
@@ -18,8 +18,8 @@ struct object_array {
 		 * empty string.
 		 */
 		char *name;
+		char *path;
 		unsigned mode;
-		struct object_context *context;
 	} *objects;
 };
 
@@ -115,7 +115,7 @@ int object_list_contains(struct object_list *list, struct object *obj);
 /* Object array handling .. */
 void add_object_array(struct object *obj, const char *name, struct object_array *array);
 void add_object_array_with_mode(struct object *obj, const char *name, struct object_array *array, unsigned mode);
-void add_object_array_with_context(struct object *obj, const char *name, struct object_array *array, struct object_context *context);
+void add_object_array_with_path(struct object *obj, const char *name, struct object_array *array, unsigned mode, const char *path);
 
 typedef int (*object_array_each_func_t)(struct object_array_entry *, void *);
 
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 19/25] traverse_commit_list: support pending blobs/trees with paths
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (17 preceding siblings ...)
  2014-10-15 22:42 ` [PATCH v2 18/25] make add_object_array_with_context interface more sane Jeff King
@ 2014-10-15 22:43 ` Jeff King
  2014-10-15 22:43 ` [PATCH v2 20/25] rev-list: document --reflog option Jeff King
                   ` (6 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:43 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

When we call traverse_commit_list, we may have trees and
blobs in the pending array. As we process these, we pass the
"name" field from the pending entry as the path of the
object within the tree (which then becomes the root path if
we recurse into subtrees).

When we set up the traversal in prepare_revision_walk,
though, the "name" field of any pending trees and blobs is
likely to be the ref at which we found the object. We would
not want to make this part of the path (e.g., doing so would
make "git rev-list --objects v2.6.11-tree" in linux.git show
paths like "v2.6.11-tree/Makefile", which is nonsensical).
Therefore prepare_revision_walk sets the name field of each
pending tree and blobs to the empty string.

However, this leaves no room for a caller who does know the
correct path of a pending object to propagate that
information to the revision walker. We can fix this by
making two related changes:

  1. Use the "path" field as the path instead of the "name"
     field in traverse_commit_list. If the path is not set,
     default to "" (which is what we always ended up with in
     the current code, because of prepare_revision_walk).

  2. In prepare_revision_walk, make a complete copy of the
     entry. This makes the path field available to the
     walker (if there is one), solving our problem.
     Leaving the name field intact is now OK, as we do not
     use it as a path due to point (1) above (and we can use
     it to make more meaningful error messages if we want).
     We also make the original "mode" field available to the
     walker, though it does not actually use it.

Note that we still re-add the pending objects and free the
old ones (so we may strdup the path and name only to free
the old ones). This could be made more efficient by simply
copying the object_array entries that we are keeping.
However, that would require more restructuring of the code,
and is not done here.

Signed-off-by: Jeff King <peff@peff.net>
---
 list-objects.c |  7 +++++--
 revision.c     | 30 +++++++++++++++++++++++-------
 2 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/list-objects.c b/list-objects.c
index fad6808..2910bec 100644
--- a/list-objects.c
+++ b/list-objects.c
@@ -208,6 +208,7 @@ void traverse_commit_list(struct rev_info *revs,
 		struct object_array_entry *pending = revs->pending.objects + i;
 		struct object *obj = pending->item;
 		const char *name = pending->name;
+		const char *path = pending->path;
 		if (obj->flags & (UNINTERESTING | SEEN))
 			continue;
 		if (obj->type == OBJ_TAG) {
@@ -215,14 +216,16 @@ void traverse_commit_list(struct rev_info *revs,
 			show_object(obj, NULL, name, data);
 			continue;
 		}
+		if (!path)
+			path = "";
 		if (obj->type == OBJ_TREE) {
 			process_tree(revs, (struct tree *)obj, show_object,
-				     NULL, &base, name, data);
+				     NULL, &base, path, data);
 			continue;
 		}
 		if (obj->type == OBJ_BLOB) {
 			process_blob(revs, (struct blob *)obj, show_object,
-				     NULL, name, data);
+				     NULL, path, data);
 			continue;
 		}
 		die("unknown pending object %s (%s)",
diff --git a/revision.c b/revision.c
index b8e02e2..9a0f99a 100644
--- a/revision.c
+++ b/revision.c
@@ -198,9 +198,10 @@ void mark_parents_uninteresting(struct commit *commit)
 	}
 }
 
-static void add_pending_object_with_mode(struct rev_info *revs,
+static void add_pending_object_with_path(struct rev_info *revs,
 					 struct object *obj,
-					 const char *name, unsigned mode)
+					 const char *name, unsigned mode,
+					 const char *path)
 {
 	if (!obj)
 		return;
@@ -220,7 +221,20 @@ static void add_pending_object_with_mode(struct rev_info *revs,
 		if (st)
 			return;
 	}
-	add_object_array_with_mode(obj, name, &revs->pending, mode);
+	add_object_array_with_path(obj, name, &revs->pending, mode, path);
+}
+
+static void add_pending_object_with_mode(struct rev_info *revs,
+					 struct object *obj,
+					 const char *name, unsigned mode)
+{
+	add_pending_object_with_path(revs, obj, name, mode, NULL);
+}
+
+static void add_pending_object_from_entry(struct rev_info *revs,
+					  struct object_array_entry *e)
+{
+	add_pending_object_with_path(revs, e->item, e->name, e->mode, e->path);
 }
 
 void add_pending_object(struct rev_info *revs,
@@ -265,8 +279,10 @@ void add_pending_sha1(struct rev_info *revs, const char *name,
 }
 
 static struct commit *handle_commit(struct rev_info *revs,
-				    struct object *object, const char *name)
+				    struct object_array_entry *entry)
 {
+	struct object *object = entry->item;
+	const char *name = entry->name;
 	unsigned long flags = object->flags;
 
 	/*
@@ -316,7 +332,7 @@ static struct commit *handle_commit(struct rev_info *revs,
 			mark_tree_contents_uninteresting(tree);
 			return NULL;
 		}
-		add_pending_object(revs, object, "");
+		add_pending_object_from_entry(revs, entry);
 		return NULL;
 	}
 
@@ -328,7 +344,7 @@ static struct commit *handle_commit(struct rev_info *revs,
 			return NULL;
 		if (flags & UNINTERESTING)
 			return NULL;
-		add_pending_object(revs, object, "");
+		add_pending_object_from_entry(revs, entry);
 		return NULL;
 	}
 	die("%s is unknown object", name);
@@ -2666,7 +2682,7 @@ int prepare_revision_walk(struct rev_info *revs)
 	revs->pending.objects = NULL;
 	for (i = 0; i < old_pending.nr; i++) {
 		struct object_array_entry *e = old_pending.objects + i;
-		struct commit *commit = handle_commit(revs, e->item, e->name);
+		struct commit *commit = handle_commit(revs, e);
 		if (commit) {
 			if (!(commit->object.flags & SEEN)) {
 				commit->object.flags |= SEEN;
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 20/25] rev-list: document --reflog option
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (18 preceding siblings ...)
  2014-10-15 22:43 ` [PATCH v2 19/25] traverse_commit_list: support pending blobs/trees with paths Jeff King
@ 2014-10-15 22:43 ` Jeff King
  2014-10-15 22:44 ` [PATCH v2 21/25] rev-list: add --index-objects option Jeff King
                   ` (5 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:43 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

This is mostly used internally, but it does not hurt to
explain it.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/rev-list-options.txt | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 5d311b8..4cf94c6 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -168,6 +168,10 @@ respectively, and they must begin with `refs/` when applied to `--glob`
 or `--all`. If a trailing '/{asterisk}' is intended, it must be given
 explicitly.
 
+--reflog::
+	Pretend as if all objects mentioned by reflogs are listed on the
+	command line as `<commit>`.
+
 --ignore-missing::
 	Upon seeing an invalid object name in the input, pretend as if
 	the bad input was not given.
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 21/25] rev-list: add --index-objects option
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (19 preceding siblings ...)
  2014-10-15 22:43 ` [PATCH v2 20/25] rev-list: document --reflog option Jeff King
@ 2014-10-15 22:44 ` Jeff King
  2014-10-16 18:41   ` Junio C Hamano
  2014-10-15 22:44 ` [PATCH v2 22/25] reachable: use revision machinery's --index-objects code Jeff King
                   ` (4 subsequent siblings)
  25 siblings, 1 reply; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:44 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

There is currently no easy way to ask the revision traversal
machinery to include objects reachable from the index (e.g.,
blobs and trees that have not yet been committed). This
patch adds an option to do so.

Signed-off-by: Jeff King <peff@peff.net>
---
I was tempted to call this just "--index", because I could not think of
what else "--index" would mean in the context of rev-list. But I also
worried about weird interactions with other commands that take revision
arguments. And since this is mostly for internal use anyway, I figured
the more verbose name is not too bad. I could be convinced otherwise,
though.

 Documentation/rev-list-options.txt |  7 ++++++
 revision.c                         | 51 ++++++++++++++++++++++++++++++++++++++
 revision.h                         |  1 +
 t/t6000-rev-list-misc.sh           | 23 +++++++++++++++++
 4 files changed, 82 insertions(+)

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 4cf94c6..03ab343 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -172,6 +172,13 @@ explicitly.
 	Pretend as if all objects mentioned by reflogs are listed on the
 	command line as `<commit>`.
 
+--index-objects::
+	Pretend as if all objects used by the index (any blobs, and any
+	trees which are mentioned by the index's cache-tree extension)
+	ad listed on the command line. Note that you probably want to
+	use `--objects`, too, as there are by definition no commits in
+	the index.
+
 --ignore-missing::
 	Upon seeing an invalid object name in the input, pretend as if
 	the bad input was not given.
diff --git a/revision.c b/revision.c
index 9a0f99a..caabaf1 100644
--- a/revision.c
+++ b/revision.c
@@ -17,6 +17,7 @@
 #include "mailmap.h"
 #include "commit-slab.h"
 #include "dir.h"
+#include "cache-tree.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -1299,6 +1300,53 @@ void add_reflogs_to_pending(struct rev_info *revs, unsigned flags)
 	for_each_reflog(handle_one_reflog, &cb);
 }
 
+static void add_cache_tree(struct cache_tree *it, struct rev_info *revs,
+			   struct strbuf *path)
+{
+	size_t baselen = path->len;
+	int i;
+
+	if (it->entry_count >= 0) {
+		struct tree *tree = lookup_tree(it->sha1);
+		add_pending_object_with_path(revs, &tree->object, "",
+					     040000, path->buf);
+	}
+
+	for (i = 0; i < it->subtree_nr; i++) {
+		struct cache_tree_sub *sub = it->down[i];
+		strbuf_addf(path, "%s%s", baselen ? "/" : "", sub->name);
+		add_cache_tree(sub->cache_tree, revs, path);
+		strbuf_setlen(path, baselen);
+	}
+
+}
+
+void add_index_objects_to_pending(struct rev_info *revs, unsigned flags)
+{
+	int i;
+
+	read_cache();
+	for (i = 0; i < active_nr; i++) {
+		struct cache_entry *ce = active_cache[i];
+		struct blob *blob;
+
+		if (S_ISGITLINK(ce->ce_mode))
+			continue;
+
+		blob = lookup_blob(ce->sha1);
+		if (!blob)
+			die("unable to add index blob to traversal");
+		add_pending_object_with_path(revs, &blob->object, "",
+					     ce->ce_mode, ce->name);
+	}
+
+	if (active_cache_tree) {
+		struct strbuf path = STRBUF_INIT;
+		add_cache_tree(active_cache_tree, revs, &path);
+		strbuf_release(&path);
+	}
+}
+
 static int add_parents_only(struct rev_info *revs, const char *arg_, int flags)
 {
 	unsigned char sha1[20];
@@ -1649,6 +1697,7 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 	    !strcmp(arg, "--reflog") || !strcmp(arg, "--not") ||
 	    !strcmp(arg, "--no-walk") || !strcmp(arg, "--do-walk") ||
 	    !strcmp(arg, "--bisect") || starts_with(arg, "--glob=") ||
+	    !strcmp(arg, "--index-objects") ||
 	    starts_with(arg, "--exclude=") ||
 	    starts_with(arg, "--branches=") || starts_with(arg, "--tags=") ||
 	    starts_with(arg, "--remotes=") || starts_with(arg, "--no-walk="))
@@ -2078,6 +2127,8 @@ static int handle_revision_pseudo_opt(const char *submodule,
 		clear_ref_exclusion(&revs->ref_excludes);
 	} else if (!strcmp(arg, "--reflog")) {
 		add_reflogs_to_pending(revs, *flags);
+	} else if (!strcmp(arg, "--index-objects")) {
+		add_index_objects_to_pending(revs, *flags);
 	} else if (!strcmp(arg, "--not")) {
 		*flags ^= UNINTERESTING | BOTTOM;
 	} else if (!strcmp(arg, "--no-walk")) {
diff --git a/revision.h b/revision.h
index e644044..e6dcd5d 100644
--- a/revision.h
+++ b/revision.h
@@ -277,6 +277,7 @@ extern void add_pending_sha1(struct rev_info *revs,
 
 extern void add_head_to_pending(struct rev_info *);
 extern void add_reflogs_to_pending(struct rev_info *, unsigned int flags);
+extern void add_index_objects_to_pending(struct rev_info *, unsigned int flags);
 
 enum commit_action {
 	commit_ignore,
diff --git a/t/t6000-rev-list-misc.sh b/t/t6000-rev-list-misc.sh
index 3794e4c..c2fb09c 100755
--- a/t/t6000-rev-list-misc.sh
+++ b/t/t6000-rev-list-misc.sh
@@ -73,4 +73,27 @@ test_expect_success 'symleft flag bit is propagated down from tag' '
 	test_cmp expect actual
 '
 
+test_expect_success 'rev-list can show index objects' '
+	# Of the blobs and trees in the index, note:
+	#
+	#   - we do not show two/three, because it is the
+	#     same blob as "one", and we show objects only once
+	#
+	#   - we do show the tree "two", because it has a valid cache tree
+	#     from the last commit
+	#
+	#   - we do not show the root tree; since we updated the index, it
+	#     does not have a valid cache tree
+	#
+	cat >expect <<-\EOF
+	8e4020bb5a8d8c873b25de15933e75cc0fc275df one
+	d9d3a7417b9605cfd88ee6306b28dadc29e6ab08 only-in-index
+	9200b628cf9dc883a85a7abc8d6e6730baee589c two
+	EOF
+	echo only-in-index >only-in-index &&
+	git add only-in-index &&
+	git rev-list --objects --index-objects >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 22/25] reachable: use revision machinery's --index-objects code
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (20 preceding siblings ...)
  2014-10-15 22:44 ` [PATCH v2 21/25] rev-list: add --index-objects option Jeff King
@ 2014-10-15 22:44 ` Jeff King
  2014-10-15 22:45 ` [PATCH v2 23/25] pack-objects: use argv_array Jeff King
                   ` (3 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:44 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

This does the same thing as our custom code, so let's not
repeat ourselves.

Signed-off-by: Jeff King <peff@peff.net>
---
 reachable.c | 52 +---------------------------------------------------
 1 file changed, 1 insertion(+), 51 deletions(-)

diff --git a/reachable.c b/reachable.c
index 0176a88..a647267 100644
--- a/reachable.c
+++ b/reachable.c
@@ -32,56 +32,6 @@ static int add_one_ref(const char *path, const unsigned char *sha1, int flag, vo
 	return 0;
 }
 
-static void add_one_tree(const unsigned char *sha1, struct rev_info *revs)
-{
-	struct tree *tree = lookup_tree(sha1);
-	if (tree)
-		add_pending_object(revs, &tree->object, "");
-}
-
-static void add_cache_tree(struct cache_tree *it, struct rev_info *revs)
-{
-	int i;
-
-	if (it->entry_count >= 0)
-		add_one_tree(it->sha1, revs);
-	for (i = 0; i < it->subtree_nr; i++)
-		add_cache_tree(it->down[i]->cache_tree, revs);
-}
-
-static void add_cache_refs(struct rev_info *revs)
-{
-	int i;
-
-	read_cache();
-	for (i = 0; i < active_nr; i++) {
-		struct blob *blob;
-
-		/*
-		 * The index can contain blobs and GITLINKs, GITLINKs are hashes
-		 * that don't actually point to objects in the repository, it's
-		 * almost guaranteed that they are NOT blobs, so we don't call
-		 * lookup_blob() on them, to avoid populating the hash table
-		 * with invalid information
-		 */
-		if (S_ISGITLINK(active_cache[i]->ce_mode))
-			continue;
-
-		blob = lookup_blob(active_cache[i]->sha1);
-		if (blob)
-			blob->object.flags |= SEEN;
-
-		/*
-		 * We could add the blobs to the pending list, but quite
-		 * frankly, we don't care. Once we've looked them up, and
-		 * added them as objects, we've really done everything
-		 * there is to do for a blob
-		 */
-	}
-	if (active_cache_tree)
-		add_cache_tree(active_cache_tree, revs);
-}
-
 /*
  * The traversal will have already marked us as SEEN, so we
  * only need to handle any progress reporting here.
@@ -213,7 +163,7 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 	revs->tree_objects = 1;
 
 	/* Add all refs from the index file */
-	add_cache_refs(revs);
+	add_index_objects_to_pending(revs, 0);
 
 	/* Add all external refs */
 	for_each_ref(add_one_ref, revs);
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 23/25] pack-objects: use argv_array
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (21 preceding siblings ...)
  2014-10-15 22:44 ` [PATCH v2 22/25] reachable: use revision machinery's --index-objects code Jeff King
@ 2014-10-15 22:45 ` Jeff King
  2014-10-15 22:46 ` [PATCH v2 24/25] repack: pack objects mentioned by the index Jeff King
                   ` (2 subsequent siblings)
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:45 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

This saves us from having to bump the rp_av count when we
add new traversal options.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/pack-objects.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 4df9499..b26276b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -22,6 +22,7 @@
 #include "pack-bitmap.h"
 #include "reachable.h"
 #include "sha1-array.h"
+#include "argv-array.h"
 
 static const char *pack_usage[] = {
 	N_("git pack-objects --stdout [options...] [< ref-list | < object-list]"),
@@ -2614,8 +2615,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	int use_internal_rev_list = 0;
 	int thin = 0;
 	int all_progress_implied = 0;
-	const char *rp_av[6];
-	int rp_ac = 0;
+	struct argv_array rp = ARGV_ARRAY_INIT;
 	int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
 	struct option pack_objects_options[] = {
 		OPT_SET_INT('q', "quiet", &progress,
@@ -2705,24 +2705,24 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (pack_to_stdout != !base_name || argc)
 		usage_with_options(pack_usage, pack_objects_options);
 
-	rp_av[rp_ac++] = "pack-objects";
+	argv_array_push(&rp, "pack-objects");
 	if (thin) {
 		use_internal_rev_list = 1;
-		rp_av[rp_ac++] = "--objects-edge";
+		argv_array_push(&rp, "--objects-edge");
 	} else
-		rp_av[rp_ac++] = "--objects";
+		argv_array_push(&rp, "--objects");
 
 	if (rev_list_all) {
 		use_internal_rev_list = 1;
-		rp_av[rp_ac++] = "--all";
+		argv_array_push(&rp, "--all");
 	}
 	if (rev_list_reflog) {
 		use_internal_rev_list = 1;
-		rp_av[rp_ac++] = "--reflog";
+		argv_array_push(&rp, "--reflog");
 	}
 	if (rev_list_unpacked) {
 		use_internal_rev_list = 1;
-		rp_av[rp_ac++] = "--unpacked";
+		argv_array_push(&rp, "--unpacked");
 	}
 
 	if (!reuse_object)
@@ -2766,8 +2766,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!use_internal_rev_list)
 		read_object_list_from_stdin();
 	else {
-		rp_av[rp_ac] = NULL;
-		get_object_list(rp_ac, rp_av);
+		get_object_list(rp.argc, rp.argv);
+		argv_array_clear(&rp);
 	}
 	cleanup_preferred_base();
 	if (include_tag && nr_result)
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 24/25] repack: pack objects mentioned by the index
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (22 preceding siblings ...)
  2014-10-15 22:45 ` [PATCH v2 23/25] pack-objects: use argv_array Jeff King
@ 2014-10-15 22:46 ` Jeff King
  2014-10-15 22:48 ` [PATCH v2 25/25] pack-objects: double-check options before discarding objects Jeff King
  2014-10-16 21:07 ` [PATCH v2 0/25] prune-safety Junio C Hamano
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:46 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

When we pack all objects, we use only the objects reachable
from references and reflogs. This misses any objects which
are reachable from the index, but not yet referenced.

By itself this isn't a big deal; the objects can remain
loose until they are actually used in a commit. However, it
does create a problem when we drop packed but unreachable
objects. We try to optimize out the writing of objects that
we will immediately prune, which means we must follow the
same rules as prune in determining what is reachable. And
prune uses the index for this purpose.

This is rather uncommon in practice, as objects in the index
would not usually have been packed in the first place. But
it could happen in a sequence like:

  1. You make a commit on a branch that references blob X.

  2. You repack, moving X into the pack.

  3. You delete the branch (and its reflog), so that X is
     unreferenced.

  4. You "git add" blob X so that it is now referenced only
     by the index.

  5. You repack again with git-gc. The pack-objects we
     invoke will see that X is neither referenced nor
     recent and not bother loosening it.

Signed-off-by: Jeff King <peff@peff.net>
---
In addition to fixing the safety, this is actually _packing_ those index
objects. I don't see anything wrong with that, but if would prefer not
to, it should be possible to do a separate traversal over the index
objects (and mark them as "to keep", but not "to pack").

 builtin/pack-objects.c               |  8 ++++++++
 builtin/repack.c                     |  1 +
 t/t7701-repack-unpack-unreachable.sh | 13 +++++++++++++
 3 files changed, 22 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b26276b..3dc9108 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2617,6 +2617,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	int all_progress_implied = 0;
 	struct argv_array rp = ARGV_ARRAY_INIT;
 	int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
+	int rev_list_index = 0;
 	struct option pack_objects_options[] = {
 		OPT_SET_INT('q', "quiet", &progress,
 			    N_("do not show progress meter"), 0),
@@ -2663,6 +2664,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		{ OPTION_SET_INT, 0, "reflog", &rev_list_reflog, NULL,
 		  N_("include objects referred by reflog entries"),
 		  PARSE_OPT_NOARG | PARSE_OPT_NONEG, NULL, 1 },
+		{ OPTION_SET_INT, 0, "index-objects", &rev_list_index, NULL,
+		  N_("include objects referred to by the index"),
+		  PARSE_OPT_NOARG | PARSE_OPT_NONEG, NULL, 1 },
 		OPT_BOOL(0, "stdout", &pack_to_stdout,
 			 N_("output pack to stdout")),
 		OPT_BOOL(0, "include-tag", &include_tag,
@@ -2720,6 +2724,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		use_internal_rev_list = 1;
 		argv_array_push(&rp, "--reflog");
 	}
+	if (rev_list_index) {
+		use_internal_rev_list = 1;
+		argv_array_push(&rp, "--index-objects");
+	}
 	if (rev_list_unpacked) {
 		use_internal_rev_list = 1;
 		argv_array_push(&rp, "--unpacked");
diff --git a/builtin/repack.c b/builtin/repack.c
index 2aae05d..80b6021 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -209,6 +209,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	argv_array_push(&cmd_args, "--non-empty");
 	argv_array_push(&cmd_args, "--all");
 	argv_array_push(&cmd_args, "--reflog");
+	argv_array_push(&cmd_args, "--index-objects");
 	if (window)
 		argv_array_pushf(&cmd_args, "--window=%s", window);
 	if (window_memory)
diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
index b8d4cde..aad8a9c 100755
--- a/t/t7701-repack-unpack-unreachable.sh
+++ b/t/t7701-repack-unpack-unreachable.sh
@@ -109,4 +109,17 @@ test_expect_success 'do not bother loosening old objects' '
 	test_must_fail git cat-file -p $obj2
 '
 
+test_expect_success 'keep packed objects found only in index' '
+	echo my-unique-content >file &&
+	git add file &&
+	git commit -m "make it reachable" &&
+	git gc &&
+	git reset HEAD^ &&
+	git reflog expire --expire=now --all &&
+	git add file &&
+	test-chmtime =-86400 .git/objects/pack/* &&
+	git gc --prune=1.hour.ago &&
+	git cat-file blob :file
+'
+
 test_done
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v2 25/25] pack-objects: double-check options before discarding objects
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (23 preceding siblings ...)
  2014-10-15 22:46 ` [PATCH v2 24/25] repack: pack objects mentioned by the index Jeff King
@ 2014-10-15 22:48 ` Jeff King
  2014-10-16 21:07 ` [PATCH v2 0/25] prune-safety Junio C Hamano
  25 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-15 22:48 UTC (permalink / raw)
  To: git; +Cc: Michael Haggerty, Junio C Hamano

When we are given an expiration time like
--unpack-unreachable=2.weeks.ago, we avoid writing out old,
unreachable loose objects entirely, under the assumption
that running "prune" would simply delete them immediately
anyway. However, this is only valid if we computed the same
set of reachable objects as prune would.

In practice, this is the case, because only git-repack uses
the --unpack-unreachable option with an expiration, and it
always feeds as many objects into the pack as possible. But
we can double-check at runtime just to be sure.

Signed-off-by: Jeff King <peff@peff.net>
---
No test here, because the potential breakage cannot be seen by running
"git repack" (because it gives sane options), nor by just running "git
pack-objects" (you can convince it not explode a loose object, but then
you would have to reimplement the part of "git repack" where it deletes
all the packs except the one we just created). So really this would only
be protecting in practice against somebody who tried to reimplement
git-repack themselves (and I do not know of anybody who has done that).

 builtin/pack-objects.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 3dc9108..72589ed 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2757,6 +2757,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	if (keep_unreachable && unpack_unreachable)
 		die("--keep-unreachable and --unpack-unreachable are incompatible.");
+	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
+		unpack_unreachable_expiration = 0;
 
 	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
 		use_bitmap_index = 0;
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 02/25] isxdigit: cast input to unsigned char
  2014-10-15 22:34 ` [PATCH v2 02/25] isxdigit: cast input to unsigned char Jeff King
@ 2014-10-16 17:16   ` Junio C Hamano
  0 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2014-10-16 17:16 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Michael Haggerty

Jeff King <peff@peff.net> writes:

> Otherwise, callers must do so or risk triggering warnings
> -Wchar-subscript (and rightfully so; a signed char might
> cause us to use a bogus negative index into the
> hexval_table).
>
> While we are dropping the now-unnecessary casts from the
> caller in urlmatch.c, we can get rid of similar casts in
> actually parsing the hex by using the hexval() helper, which
> implicitly casts to unsigned (but note that we cannot
> implement isxdigit in terms of hexval(), as it also casts
> its return value to unsigned).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> The patch that added more calls to isxdigit later in the series actually
> got reworked. So this is purely a cleanup and can be dropped if need be,
> though I still think it is an improvement.

Yes, thanks.

>
>  git-compat-util.h | 2 +-
>  urlmatch.c        | 8 ++++----
>  2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/git-compat-util.h b/git-compat-util.h
> index fb41118..44890d5 100644
> --- a/git-compat-util.h
> +++ b/git-compat-util.h
> @@ -677,7 +677,7 @@ extern const unsigned char sane_ctype[256];
>  #define iscntrl(x) (sane_istest(x,GIT_CNTRL))
>  #define ispunct(x) sane_istest(x, GIT_PUNCT | GIT_REGEX_SPECIAL | \
>  		GIT_GLOB_SPECIAL | GIT_PATHSPEC_MAGIC)
> -#define isxdigit(x) (hexval_table[x] != -1)
> +#define isxdigit(x) (hexval_table[(unsigned char)(x)] != -1)
>  #define tolower(x) sane_case((unsigned char)(x), 0x20)
>  #define toupper(x) sane_case((unsigned char)(x), 0)
>  #define is_pathspec_magic(x) sane_istest(x,GIT_PATHSPEC_MAGIC)
> diff --git a/urlmatch.c b/urlmatch.c
> index 3d4c54b..618d216 100644
> --- a/urlmatch.c
> +++ b/urlmatch.c
> @@ -43,11 +43,11 @@ static int append_normalized_escapes(struct strbuf *buf,
>  		from_len--;
>  		if (ch == '%') {
>  			if (from_len < 2 ||
> -			    !isxdigit((unsigned char)from[0]) ||
> -			    !isxdigit((unsigned char)from[1]))
> +			    !isxdigit(from[0]) ||
> +			    !isxdigit(from[1]))
>  				return 0;
> -			ch = hexval_table[(unsigned char)*from++] << 4;
> -			ch |= hexval_table[(unsigned char)*from++];
> +			ch = hexval(*from++) << 4;
> +			ch |= hexval(*from++);
>  			from_len -= 2;
>  			was_esc = 1;
>  		}

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 03/25] object_array: factor out slopbuf-freeing logic
  2014-10-15 22:34 ` [PATCH v2 03/25] object_array: factor out slopbuf-freeing logic Jeff King
@ 2014-10-16 17:39   ` Junio C Hamano
  2014-10-17  0:33     ` git-bundle rev handling and de-duping Jeff King
  0 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2014-10-16 17:39 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Michael Haggerty

Jeff King <peff@peff.net> writes:

> This is not a lot of code, but it's a logical construct that
> should not need to be repeated (and we are about to add a
> third repetition).

Good, but I have two and a half tangential comments about the
context that appears in this patch ;-)

>  void object_array_filter(struct object_array *array,
>  			 object_array_each_func_t want, void *cb_data)
>  {
> @@ -367,8 +377,7 @@ void object_array_filter(struct object_array *array,
>  				objects[dst] = objects[src];
>  			dst++;
>  		} else {
> -			if (objects[src].name != object_array_slopbuf)
> -				free(objects[src].name);
> +			object_array_release_entry(&objects[src]);
>  		}
>  	}
>  	array->nr = dst;
> @@ -400,8 +409,7 @@ void object_array_remove_duplicates(struct object_array *array)
>  				objects[array->nr] = objects[src];
>  			array->nr++;
>  		} else {
> -			if (objects[src].name != object_array_slopbuf)
> -				free(objects[src].name);
> +			object_array_release_entry(&objects[src]);
>  		}
>  	}
>  }

 1.  These two functions both remove elements from a given array
     in-place, the former being in a more generic form that takes a
     caller-specified criterion while the latter uses a hardcoded
     condition to decide what to filter.  aeb4a51e (object_array:
     add function object_array_filter(), 2013-05-25) and later
     1506510c (object_array_remove_duplicates(): rewrite to reduce
     copying, 2013-05-25) should have refactored the latter further
     to implement it in terms of the former, perhaps?

 1.5 I would have expected a function to "remove duplicates from an
     array" to remove duplicates from the array by comparing the objects
     contained in the array, not entries that may (or may not) point at
     different objects but happens to share the same name; I think this
     function is misnamed.

 2.  We use object_array_remove_duplicates() to de-dup "git bundle
     create x master master", which came from b2a6d1c6 (bundle:
     allow the same ref to be given more than once, 2009-01-17),
     which is still the sole caller of the function, and I think
     this is bogus.  Comparing .name would not de-dup "git bundle
     create x master refs/heads/master".

I think the right way to fix these two and a half problems is to do
the following:

 - object_array_remove_duplicates() (and contains_name() helper it
   uses) should be removed from object.c;

 - create_bundle() in bundle.c should implement a helper that is
   similar to contains_name() but knows about ref dwimming and use
   it to call object_array_filter() to replace its call to
   object_array_remove_duplicates().

I am not doing this myself, and I do not expect either you or
Michael to do so, either.  I am just writing this down to point out
a low hanging fruit to aspiring new contributors (hint, hint).

Thanks.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 06/25] reachable: use traverse_commit_list instead of custom walk
  2014-10-15 22:37 ` [PATCH v2 06/25] reachable: use traverse_commit_list instead of custom walk Jeff King
@ 2014-10-16 17:53   ` Junio C Hamano
  0 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2014-10-16 17:53 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Michael Haggerty

Jeff King <peff@peff.net> writes:

> To find the set of reachable objects, we add a bunch of
> possible sources to our rev_info, call prepare_revision_walk,
> and then launch into a custom walker that handles each
> object top. This is a subset of what traverse_commit_list
> does, so we can just reuse that code (it can also handle
> more complex cases like UNINTERESTING commits and pathspecs,
> but we don't use those features).
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> I was concerned this would be slower because traverse_commit_list is
> more featureful. To my surprise, it was consistently about 3-4% faster!
> The major difference is that traverse_commit_list will hit all of the
> commits first, and then the trees. For reachability that doesn't matter
> either way, but I suspect the new way has slightly better cache
> locality, leading to the minor speedup.

I am not very surprised, as "custom walk" hasn't changed much ever
since it was done in ba84a797 (builtin "git prune", 2006-07-06),
while the generic traversal code has been worked heavily while it
was still in builtin-rev-list.c and then later moved to
list-objects.c.

>  reachable.c | 130 ++++++++----------------------------------------------------
>  1 file changed, 17 insertions(+), 113 deletions(-)

;-) ;-) ;-) ;-) ;-)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 21/25] rev-list: add --index-objects option
  2014-10-15 22:44 ` [PATCH v2 21/25] rev-list: add --index-objects option Jeff King
@ 2014-10-16 18:41   ` Junio C Hamano
  2014-10-17  0:12     ` Jeff King
  0 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2014-10-16 18:41 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Michael Haggerty

Jeff King <peff@peff.net> writes:

> There is currently no easy way to ask the revision traversal
> machinery to include objects reachable from the index (e.g.,
> blobs and trees that have not yet been committed). This
> patch adds an option to do so.
>
> Signed-off-by: Jeff King <peff@peff.net>
> ---
> I was tempted to call this just "--index", because I could not think of
> what else "--index" would mean in the context of rev-list. But I also
> worried about weird interactions with other commands that take revision
> arguments. And since this is mostly for internal use anyway, I figured
> the more verbose name is not too bad. I could be convinced otherwise,
> though.

I agree that "--index" is a bad name as it usually is used in a
particular context: the command can work on various combination of
working tree and the index, and I am asking it to work on both
(e.g. "apply --index" as opposed to "apply --cached").

> diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
> index 4cf94c6..03ab343 100644
> --- a/Documentation/rev-list-options.txt
> +++ b/Documentation/rev-list-options.txt
> @@ -172,6 +172,13 @@ explicitly.
>  	Pretend as if all objects mentioned by reflogs are listed on the
>  	command line as `<commit>`.
>  
> +--index-objects::

This risks "index" getting misunderstood as a verb, e.g. "please
enumerate the objects and assign labels to later refer to them",
doesn't it?

"--indexed-objects" (short for "--show-objects-in-the-index") or
something?

> +	Pretend as if all objects used by the index (any blobs, and any
> +	trees which are mentioned by the index's cache-tree extension)
> +	ad listed on the command line. Note that you probably want to

s/ad/are/, probably?

> +	use `--objects`, too, as there are by definition no commits in
> +	the index.

For gitlinks/submodules, the index records names of the commit
objects, they are not listed, and that is the right behaviour, but
this description invites some confusion.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/25] prune-safety
  2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
                   ` (24 preceding siblings ...)
  2014-10-15 22:48 ` [PATCH v2 25/25] pack-objects: double-check options before discarding objects Jeff King
@ 2014-10-16 21:07 ` Junio C Hamano
  2014-10-16 21:10   ` Junio C Hamano
  2014-10-16 21:21   ` Jeff King
  25 siblings, 2 replies; 48+ messages in thread
From: Junio C Hamano @ 2014-10-16 21:07 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Michael Haggerty

Somewhere in this series the object enumeration seems to get
broken.  "git clone --no-local" of git.git repository died
with

    Cloning into 'victim-7'...
    remote: Counting objects: 173727, done.
    remote: Compressing objects: 100% (43697/43697), done.
    remote: Total 173727 (delta 130908), reused 170009 (delta 128151)
    Receiving objects: 100% (173727/173727), 33.45 MiB | 13.69 MiB/s,
    done.
    Resolving deltas: 100% (130908/130908), done.
    fatal: did not receive expected object a74380c3117994efba501df1707418eb6feb9284
    fatal: index-pack failed

where a74380c... is such an object.

If you have a clone of https://github.com/git/git.git

$ git rev-list --parents --objects --all | grep  a74380c3117994

would be an easy way to reproduce.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/25] prune-safety
  2014-10-16 21:07 ` [PATCH v2 0/25] prune-safety Junio C Hamano
@ 2014-10-16 21:10   ` Junio C Hamano
  2014-10-16 21:21   ` Jeff King
  1 sibling, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2014-10-16 21:10 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Michael Haggerty

Junio C Hamano <gitster@pobox.com> writes:

> Somewhere in this series the object enumeration seems to get
> broken.  "git clone --no-local" of git.git repository died
> with
>
>     Cloning into 'victim-7'...
>     remote: Counting objects: 173727, done.
>     remote: Compressing objects: 100% (43697/43697), done.
>     remote: Total 173727 (delta 130908), reused 170009 (delta 128151)
>     Receiving objects: 100% (173727/173727), 33.45 MiB | 13.69 MiB/s,
>     done.
>     Resolving deltas: 100% (130908/130908), done.
>     fatal: did not receive expected object a74380c3117994efba501df1707418eb6feb9284
>     fatal: index-pack failed
>
> where a74380c... is such an object.

Ehh, "such" is a nonsense.  It is a blob that directly is pointed by
a tag that is in refs/tags/*.

> If you have a clone of https://github.com/git/git.git
>
> $ git rev-list --parents --objects --all | grep  a74380c3117994
>
> would be an easy way to reproduce.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/25] prune-safety
  2014-10-16 21:07 ` [PATCH v2 0/25] prune-safety Junio C Hamano
  2014-10-16 21:10   ` Junio C Hamano
@ 2014-10-16 21:21   ` Jeff King
  2014-10-16 21:39     ` Jeff King
  1 sibling, 1 reply; 48+ messages in thread
From: Jeff King @ 2014-10-16 21:21 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Michael Haggerty

On Thu, Oct 16, 2014 at 02:07:47PM -0700, Junio C Hamano wrote:

> Somewhere in this series the object enumeration seems to get
> broken.  "git clone --no-local" of git.git repository died
> with
> 
>     Cloning into 'victim-7'...
>     remote: Counting objects: 173727, done.
>     remote: Compressing objects: 100% (43697/43697), done.
>     remote: Total 173727 (delta 130908), reused 170009 (delta 128151)
>     Receiving objects: 100% (173727/173727), 33.45 MiB | 13.69 MiB/s,
>     done.
>     Resolving deltas: 100% (130908/130908), done.
>     fatal: did not receive expected object a74380c3117994efba501df1707418eb6feb9284
>     fatal: index-pack failed
> 
> where a74380c... is such an object.
> 
> If you have a clone of https://github.com/git/git.git

Hrm. I cannot reproduce the clone failure...

> $ git rev-list --parents --objects --all | grep  a74380c3117994
> 
> would be an easy way to reproduce.

But this I can. Digging into it...

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/25] prune-safety
  2014-10-16 21:21   ` Jeff King
@ 2014-10-16 21:39     ` Jeff King
  2014-10-16 22:18       ` Junio C Hamano
                         ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Jeff King @ 2014-10-16 21:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Michael Haggerty

On Thu, Oct 16, 2014 at 05:21:12PM -0400, Jeff King wrote:

> Hrm. I cannot reproduce the clone failure...

Oh, because I have bitmaps turned on, and we do not run the list-objects
code at all in that case.

The bug unsurprisingly bisects to "traverse_commit_list: support
pending blobs/trees with paths". The problem is that I didn't notice
that handle_commit actually rewrites the "object" pointer when
unwrapping the tags, and that commit reuses the original pointer from
the entry. So we end up putting two copies of the tag object into the
pending list, rather than the tag and the blob.

The fix is:

diff --git a/revision.c b/revision.c
index 9a0f99a..8030fc8 100644
--- a/revision.c
+++ b/revision.c
@@ -231,12 +231,6 @@ static void add_pending_object_with_mode(struct rev_info *revs,
 	add_pending_object_with_path(revs, obj, name, mode, NULL);
 }
 
-static void add_pending_object_from_entry(struct rev_info *revs,
-					  struct object_array_entry *e)
-{
-	add_pending_object_with_path(revs, e->item, e->name, e->mode, e->path);
-}
-
 void add_pending_object(struct rev_info *revs,
 			struct object *obj, const char *name)
 {
@@ -283,6 +277,8 @@ static struct commit *handle_commit(struct rev_info *revs,
 {
 	struct object *object = entry->item;
 	const char *name = entry->name;
+	const char *path = entry->path;
+	unsigned int mode = entry->mode;
 	unsigned long flags = object->flags;
 
 	/*
@@ -301,6 +297,14 @@ static struct commit *handle_commit(struct rev_info *revs,
 			die("bad object %s", sha1_to_hex(tag->tagged->sha1));
 		}
 		object->flags |= flags;
+		/*
+		 * We'll handle the tagged object by looping or dropping
+		 * through to the non-tag handlers below. Do not
+		 * propagate data from the tag's pending entry.
+		 */
+		name = NULL;
+		path = NULL;
+		mode = 0;
 	}
 
 	/*
@@ -332,7 +336,7 @@ static struct commit *handle_commit(struct rev_info *revs,
 			mark_tree_contents_uninteresting(tree);
 			return NULL;
 		}
-		add_pending_object_from_entry(revs, entry);
+		add_pending_object_with_path(revs, object, name, mode, path);
 		return NULL;
 	}
 
@@ -344,7 +348,7 @@ static struct commit *handle_commit(struct rev_info *revs,
 			return NULL;
 		if (flags & UNINTERESTING)
 			return NULL;
-		add_pending_object_from_entry(revs, entry);
+		add_pending_object_with_path(revs, object, name, mode, path);
 		return NULL;
 	}
 	die("%s is unknown object", name);

We should probably add a test for cloning a tag of an otherwise
unreferenced object, too.

-Peff

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/25] prune-safety
  2014-10-16 21:39     ` Jeff King
@ 2014-10-16 22:18       ` Junio C Hamano
  2014-10-17  0:03         ` Jeff King
       [not found]       ` <CAPc5daX0AFv9jDrFyd_OnupW5AfZW9Je_rgzaViX_xxs3SG5zg@mail.gmail.com>
  2014-10-18 12:31       ` Jeff King
  2 siblings, 1 reply; 48+ messages in thread
From: Junio C Hamano @ 2014-10-16 22:18 UTC (permalink / raw)
  To: Jeff King; +Cc: git, Michael Haggerty

Jeff King <peff@peff.net> writes:

> On Thu, Oct 16, 2014 at 05:21:12PM -0400, Jeff King wrote:
>
>> Hrm. I cannot reproduce the clone failure...
>
> Oh, because I have bitmaps turned on, and we do not run the list-objects
> code at all in that case.

;-)

> We should probably add a test for cloning a tag of an otherwise
> unreferenced object, too.

Yeah, that too ;-)

Thanks for a quick diag.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/25] prune-safety
  2014-10-16 22:18       ` Junio C Hamano
@ 2014-10-17  0:03         ` Jeff King
  0 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-17  0:03 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Michael Haggerty

On Thu, Oct 16, 2014 at 03:18:34PM -0700, Junio C Hamano wrote:

> > We should probably add a test for cloning a tag of an otherwise
> > unreferenced object, too.
> 
> Yeah, that too ;-)

Here's that test (after the scissors below). It can be applied totally
separately, though I stuck it in as patch "18.5/25" in the existing
series to confirm that my original series does cause the test to fail,
and that it passes with the patch I posted earlier.

Do you want to just squash the fix I posted earlier (into patch 19, the
"traverse_commit_list: ..." one)? I'm happy to repost the revised patch,
but my impression of your workflow is that squashing is just as easy
than replacing a patch (i.e., you're running "rebase -i" either way).

Or I'm happy to re-post the whole series, but it's rather long. :)

> Thanks for a quick diag.

I'm impressed that you found the bug so quickly. :) My biggest fear with
the whole series is that it's disrupting and refactoring some
fundamental parts of the code that might cause regressions. I put a lot
of my faith in the test suite, but that did not work out here (I do
still feel pretty good about the series overall, though I think I'd cook
it longer than usual given the complexity and the areas it touches).

-- >8 --
Subject: t5516: test pushing a tag of an otherwise unreferenced blob

It's not unreasonable to have a tag that points to a blob
that is not part of the normal history. We do this in
git.git to distribute gpg keys. However, we never explicitly
checked in our test suite that this actually works (i.e.,
that pack-objects actually sends the blob because of the tag
mentioning it).

It does in fact work fine, but a recent patch under
discussion broke this, and the test suite didn't notice.
Let's make the test suite more complete.

Signed-off-by: Jeff King <peff@peff.net>
---
The "should have" below is belt-and-suspenders. The test actually fails
with my series without the cat-file check, but I'm concerned a bug
that affects pack-objects could also affect the connectivity check on
the receiving side.

 t/t5516-fetch-push.sh | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/t/t5516-fetch-push.sh b/t/t5516-fetch-push.sh
index 67e0ab3..7c8a769 100755
--- a/t/t5516-fetch-push.sh
+++ b/t/t5516-fetch-push.sh
@@ -1277,4 +1277,17 @@ EOF
 	git push --no-thin --receive-pack="$rcvpck" no-thin/.git refs/heads/master:refs/heads/foo
 '
 
+test_expect_success 'pushing a tag pushes the tagged object' '
+	rm -rf dst.git &&
+	blob=$(echo unreferenced | git hash-object -w --stdin) &&
+	git tag -m foo tag-of-blob $blob &&
+	git init --bare dst.git &&
+	git push dst.git tag-of-blob &&
+	# the receiving index-pack should have noticed
+	# any problems, but we double check
+	echo unreferenced >expect &&
+	git --git-dir=dst.git cat-file blob tag-of-blob >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 21/25] rev-list: add --index-objects option
  2014-10-16 18:41   ` Junio C Hamano
@ 2014-10-17  0:12     ` Jeff King
  2014-10-17  0:43       ` Jeff King
  0 siblings, 1 reply; 48+ messages in thread
From: Jeff King @ 2014-10-17  0:12 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Michael Haggerty

On Thu, Oct 16, 2014 at 11:41:35AM -0700, Junio C Hamano wrote:

> I agree that "--index" is a bad name as it usually is used in a
> particular context: the command can work on various combination of
> working tree and the index, and I am asking it to work on both
> (e.g. "apply --index" as opposed to "apply --cached").

Thanks. I wasn't sure if I was just being paranoid or not, but now there
are at least two of us.

> > +--index-objects::
> 
> This risks "index" getting misunderstood as a verb, e.g. "please
> enumerate the objects and assign labels to later refer to them",
> doesn't it?
> 
> "--indexed-objects" (short for "--show-objects-in-the-index") or
> something?

That sounds reasonable. We could technically do `--indexed` as that is
different from `--index`, but maybe they are still confusingly close.

> > +	Pretend as if all objects used by the index (any blobs, and any
> > +	trees which are mentioned by the index's cache-tree extension)
> > +	ad listed on the command line. Note that you probably want to
> 
> s/ad/are/, probably?

Yeah, sorry, vi cruft I think (at least I didn't accidentally insert
"C-a C-k" ;) ).

> > +	use `--objects`, too, as there are by definition no commits in
> > +	the index.
> 
> For gitlinks/submodules, the index records names of the commit
> objects, they are not listed, and that is the right behaviour, but
> this description invites some confusion.

Good point. How about this:

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 03ab343..3301fde 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -172,12 +172,10 @@ explicitly.
 	Pretend as if all objects mentioned by reflogs are listed on the
 	command line as `<commit>`.
 
---index-objects::
-	Pretend as if all objects used by the index (any blobs, and any
-	trees which are mentioned by the index's cache-tree extension)
-	ad listed on the command line. Note that you probably want to
-	use `--objects`, too, as there are by definition no commits in
-	the index.
+--indexed-objects::
+	Pretend as if all trees and blobs used by the index are listed
+	on the command line.  Note that you probably want to use
+	`--objects`, too.
 
 --ignore-missing::
 	Upon seeing an invalid object name in the input, pretend as if

I was tempted to not document this at all (nor to add documentation for
--reflog), as I think these are really only going to be used internally.
But it's nice to have documentation even for this internal stuff (if
anything, we should probably be making sure they are just limited to
rev-list plumbing, and not included in the log manpage).

-Peff

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* git-bundle rev handling and de-duping
  2014-10-16 17:39   ` Junio C Hamano
@ 2014-10-17  0:33     ` Jeff King
  2014-10-17 21:03       ` Philip Oakley
  0 siblings, 1 reply; 48+ messages in thread
From: Jeff King @ 2014-10-17  0:33 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Michael Haggerty

[subject tweaked as we have veered quite far off the original, and
 this might get more attention from interested people]

On Thu, Oct 16, 2014 at 10:39:54AM -0700, Junio C Hamano wrote:

>  2.  We use object_array_remove_duplicates() to de-dup "git bundle
>      create x master master", which came from b2a6d1c6 (bundle:
>      allow the same ref to be given more than once, 2009-01-17),
>      which is still the sole caller of the function, and I think
>      this is bogus.  Comparing .name would not de-dup "git bundle
>      create x master refs/heads/master".

Hmm. We also doesn't respect the UNINTERESTING flag, so it will dedup
"foo" and "^foo" into a single entry. I _think_ this doesn't have any
bad side effects, because they are equivalent (i.e., the flag is carried
on the object, not the pending entry). I would expect this:

  git bundle create ... foo ^foo

to barf (and it does) because the bundle is empty. But with this:

  git bundle create ... another-ref foo ^foo

I would expect the resulting bundle to contain the objects from
another-ref, but still mention "foo" as an update (we didn't need to
send any objects, since presumably the other side already had them). It
doesn't, but that is not because of the dedup. It is because we generate
the list of refs to send based on the pending list, and we skip all
UNINTERESTING objects (we must, because otherwise "^foo" would make an
entry). I.e., it is the same "the flag is on the object, not the pending
entry" that saved us above now causing its own bug. Obviously the
example above is sort of silly, but if you have:

  # two branches point at the same object
  git branch foo master

  # the other side already has master. Let's send them foo.
  # this will fail because the bundle is empty. That's mildly
  # annoying because we really want to tell them "hey, update
  # your foo branch". But at least we get an error.
  git bundle create tmp.bundle foo ^master

  # now the same thing, but we're sending them some other objects.
  # This one succeeds, but silently omits foo from the bundle!
  git bundle create tmp.bundle foo another-ref ^master

I have a feeling that the list needs to be generated from revs.cmdline,
not revs.pending, and the de-duplication needs to happen there (with the
ref resolution that you mention).

I also have the feeling that fast-export had to deal with this exact
same issue. It looks like we use revs.cmdline there. I seem to recall
there was some ongoing work in that area, but I stopped paying close
attention due to some personality conflicts, and I don't know if all of
the issues were ever resolved.

> I think the right way to fix these two and a half problems is to do
> the following:
> 
>  - object_array_remove_duplicates() (and contains_name() helper it
>    uses) should be removed from object.c;
> 
>  - create_bundle() in bundle.c should implement a helper that is
>    similar to contains_name() but knows about ref dwimming and use
>    it to call object_array_filter() to replace its call to
>    object_array_remove_duplicates().

Agreed. The loop in create_bundle right after the de-dup does the
dwimming. Probably it would be simple to just skip duplicates there
using a hashmap or sorted list.

> I am not doing this myself, and I do not expect either you or
> Michael to do so, either.  I am just writing this down to point out
> a low hanging fruit to aspiring new contributors (hint, hint).

I am also not planning on working on it soon, but now we have hopefully
fed plenty of possibilities to anybody who wants to. :)

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 21/25] rev-list: add --index-objects option
  2014-10-17  0:12     ` Jeff King
@ 2014-10-17  0:43       ` Jeff King
  2014-10-17  0:44         ` [PATCH v3 22/26] rev-list: add --indexed-objects option Jeff King
                           ` (4 more replies)
  0 siblings, 5 replies; 48+ messages in thread
From: Jeff King @ 2014-10-17  0:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Michael Haggerty

On Thu, Oct 16, 2014 at 08:12:31PM -0400, Jeff King wrote:

> > "--indexed-objects" (short for "--show-objects-in-the-index") or
> > something?
> 
> That sounds reasonable. We could technically do `--indexed` as that is
> different from `--index`, but maybe they are still confusingly close.

Here's a re-roll of the final 5 patches of the series with the updated
name (`--indexed-objects`). The name change cascades from patch 22 to
patches 23 and 25 (and I renamed the matching pack-objects option as
well). 24 and 26 are unchanged, but I figured it is easier on you to
replace the whole block of patches at once.

  [22/26]: rev-list: add --indexed-objects option
  [23/26]: reachable: use revision machinery's --indexed-objects code
  [24/26]: pack-objects: use argv_array
  [25/26]: repack: pack objects mentioned by the index
  [26/26]: pack-objects: double-check options before discarding objects

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [PATCH v3 22/26] rev-list: add --indexed-objects option
  2014-10-17  0:43       ` Jeff King
@ 2014-10-17  0:44         ` Jeff King
  2014-10-17  0:44         ` [PATCH v3 23/26] reachable: use revision machinery's --indexed-objects code Jeff King
                           ` (3 subsequent siblings)
  4 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-17  0:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Michael Haggerty

There is currently no easy way to ask the revision traversal
machinery to include objects reachable from the index (e.g.,
blobs and trees that have not yet been committed). This
patch adds an option to do so.

Signed-off-by: Jeff King <peff@peff.net>
---
 Documentation/rev-list-options.txt |  5 ++++
 revision.c                         | 51 ++++++++++++++++++++++++++++++++++++++
 revision.h                         |  1 +
 t/t6000-rev-list-misc.sh           | 23 +++++++++++++++++
 4 files changed, 80 insertions(+)

diff --git a/Documentation/rev-list-options.txt b/Documentation/rev-list-options.txt
index 4cf94c6..3301fde 100644
--- a/Documentation/rev-list-options.txt
+++ b/Documentation/rev-list-options.txt
@@ -172,6 +172,11 @@ explicitly.
 	Pretend as if all objects mentioned by reflogs are listed on the
 	command line as `<commit>`.
 
+--indexed-objects::
+	Pretend as if all trees and blobs used by the index are listed
+	on the command line.  Note that you probably want to use
+	`--objects`, too.
+
 --ignore-missing::
 	Upon seeing an invalid object name in the input, pretend as if
 	the bad input was not given.
diff --git a/revision.c b/revision.c
index 8030fc8..a6c5dc3 100644
--- a/revision.c
+++ b/revision.c
@@ -17,6 +17,7 @@
 #include "mailmap.h"
 #include "commit-slab.h"
 #include "dir.h"
+#include "cache-tree.h"
 
 volatile show_early_output_fn_t show_early_output;
 
@@ -1303,6 +1304,53 @@ void add_reflogs_to_pending(struct rev_info *revs, unsigned flags)
 	for_each_reflog(handle_one_reflog, &cb);
 }
 
+static void add_cache_tree(struct cache_tree *it, struct rev_info *revs,
+			   struct strbuf *path)
+{
+	size_t baselen = path->len;
+	int i;
+
+	if (it->entry_count >= 0) {
+		struct tree *tree = lookup_tree(it->sha1);
+		add_pending_object_with_path(revs, &tree->object, "",
+					     040000, path->buf);
+	}
+
+	for (i = 0; i < it->subtree_nr; i++) {
+		struct cache_tree_sub *sub = it->down[i];
+		strbuf_addf(path, "%s%s", baselen ? "/" : "", sub->name);
+		add_cache_tree(sub->cache_tree, revs, path);
+		strbuf_setlen(path, baselen);
+	}
+
+}
+
+void add_index_objects_to_pending(struct rev_info *revs, unsigned flags)
+{
+	int i;
+
+	read_cache();
+	for (i = 0; i < active_nr; i++) {
+		struct cache_entry *ce = active_cache[i];
+		struct blob *blob;
+
+		if (S_ISGITLINK(ce->ce_mode))
+			continue;
+
+		blob = lookup_blob(ce->sha1);
+		if (!blob)
+			die("unable to add index blob to traversal");
+		add_pending_object_with_path(revs, &blob->object, "",
+					     ce->ce_mode, ce->name);
+	}
+
+	if (active_cache_tree) {
+		struct strbuf path = STRBUF_INIT;
+		add_cache_tree(active_cache_tree, revs, &path);
+		strbuf_release(&path);
+	}
+}
+
 static int add_parents_only(struct rev_info *revs, const char *arg_, int flags)
 {
 	unsigned char sha1[20];
@@ -1653,6 +1701,7 @@ static int handle_revision_opt(struct rev_info *revs, int argc, const char **arg
 	    !strcmp(arg, "--reflog") || !strcmp(arg, "--not") ||
 	    !strcmp(arg, "--no-walk") || !strcmp(arg, "--do-walk") ||
 	    !strcmp(arg, "--bisect") || starts_with(arg, "--glob=") ||
+	    !strcmp(arg, "--indexed-objects") ||
 	    starts_with(arg, "--exclude=") ||
 	    starts_with(arg, "--branches=") || starts_with(arg, "--tags=") ||
 	    starts_with(arg, "--remotes=") || starts_with(arg, "--no-walk="))
@@ -2082,6 +2131,8 @@ static int handle_revision_pseudo_opt(const char *submodule,
 		clear_ref_exclusion(&revs->ref_excludes);
 	} else if (!strcmp(arg, "--reflog")) {
 		add_reflogs_to_pending(revs, *flags);
+	} else if (!strcmp(arg, "--indexed-objects")) {
+		add_index_objects_to_pending(revs, *flags);
 	} else if (!strcmp(arg, "--not")) {
 		*flags ^= UNINTERESTING | BOTTOM;
 	} else if (!strcmp(arg, "--no-walk")) {
diff --git a/revision.h b/revision.h
index e644044..e6dcd5d 100644
--- a/revision.h
+++ b/revision.h
@@ -277,6 +277,7 @@ extern void add_pending_sha1(struct rev_info *revs,
 
 extern void add_head_to_pending(struct rev_info *);
 extern void add_reflogs_to_pending(struct rev_info *, unsigned int flags);
+extern void add_index_objects_to_pending(struct rev_info *, unsigned int flags);
 
 enum commit_action {
 	commit_ignore,
diff --git a/t/t6000-rev-list-misc.sh b/t/t6000-rev-list-misc.sh
index 3794e4c..2602086 100755
--- a/t/t6000-rev-list-misc.sh
+++ b/t/t6000-rev-list-misc.sh
@@ -73,4 +73,27 @@ test_expect_success 'symleft flag bit is propagated down from tag' '
 	test_cmp expect actual
 '
 
+test_expect_success 'rev-list can show index objects' '
+	# Of the blobs and trees in the index, note:
+	#
+	#   - we do not show two/three, because it is the
+	#     same blob as "one", and we show objects only once
+	#
+	#   - we do show the tree "two", because it has a valid cache tree
+	#     from the last commit
+	#
+	#   - we do not show the root tree; since we updated the index, it
+	#     does not have a valid cache tree
+	#
+	cat >expect <<-\EOF
+	8e4020bb5a8d8c873b25de15933e75cc0fc275df one
+	d9d3a7417b9605cfd88ee6306b28dadc29e6ab08 only-in-index
+	9200b628cf9dc883a85a7abc8d6e6730baee589c two
+	EOF
+	echo only-in-index >only-in-index &&
+	git add only-in-index &&
+	git rev-list --objects --indexed-objects >actual &&
+	test_cmp expect actual
+'
+
 test_done
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 23/26] reachable: use revision machinery's --indexed-objects code
  2014-10-17  0:43       ` Jeff King
  2014-10-17  0:44         ` [PATCH v3 22/26] rev-list: add --indexed-objects option Jeff King
@ 2014-10-17  0:44         ` Jeff King
  2014-10-17  0:44         ` [PATCH v3 24/26] pack-objects: use argv_array Jeff King
                           ` (2 subsequent siblings)
  4 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-17  0:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Michael Haggerty

This does the same thing as our custom code, so let's not
repeat ourselves.

Signed-off-by: Jeff King <peff@peff.net>
---
 reachable.c | 52 +---------------------------------------------------
 1 file changed, 1 insertion(+), 51 deletions(-)

diff --git a/reachable.c b/reachable.c
index 0176a88..a647267 100644
--- a/reachable.c
+++ b/reachable.c
@@ -32,56 +32,6 @@ static int add_one_ref(const char *path, const unsigned char *sha1, int flag, vo
 	return 0;
 }
 
-static void add_one_tree(const unsigned char *sha1, struct rev_info *revs)
-{
-	struct tree *tree = lookup_tree(sha1);
-	if (tree)
-		add_pending_object(revs, &tree->object, "");
-}
-
-static void add_cache_tree(struct cache_tree *it, struct rev_info *revs)
-{
-	int i;
-
-	if (it->entry_count >= 0)
-		add_one_tree(it->sha1, revs);
-	for (i = 0; i < it->subtree_nr; i++)
-		add_cache_tree(it->down[i]->cache_tree, revs);
-}
-
-static void add_cache_refs(struct rev_info *revs)
-{
-	int i;
-
-	read_cache();
-	for (i = 0; i < active_nr; i++) {
-		struct blob *blob;
-
-		/*
-		 * The index can contain blobs and GITLINKs, GITLINKs are hashes
-		 * that don't actually point to objects in the repository, it's
-		 * almost guaranteed that they are NOT blobs, so we don't call
-		 * lookup_blob() on them, to avoid populating the hash table
-		 * with invalid information
-		 */
-		if (S_ISGITLINK(active_cache[i]->ce_mode))
-			continue;
-
-		blob = lookup_blob(active_cache[i]->sha1);
-		if (blob)
-			blob->object.flags |= SEEN;
-
-		/*
-		 * We could add the blobs to the pending list, but quite
-		 * frankly, we don't care. Once we've looked them up, and
-		 * added them as objects, we've really done everything
-		 * there is to do for a blob
-		 */
-	}
-	if (active_cache_tree)
-		add_cache_tree(active_cache_tree, revs);
-}
-
 /*
  * The traversal will have already marked us as SEEN, so we
  * only need to handle any progress reporting here.
@@ -213,7 +163,7 @@ void mark_reachable_objects(struct rev_info *revs, int mark_reflog,
 	revs->tree_objects = 1;
 
 	/* Add all refs from the index file */
-	add_cache_refs(revs);
+	add_index_objects_to_pending(revs, 0);
 
 	/* Add all external refs */
 	for_each_ref(add_one_ref, revs);
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 24/26] pack-objects: use argv_array
  2014-10-17  0:43       ` Jeff King
  2014-10-17  0:44         ` [PATCH v3 22/26] rev-list: add --indexed-objects option Jeff King
  2014-10-17  0:44         ` [PATCH v3 23/26] reachable: use revision machinery's --indexed-objects code Jeff King
@ 2014-10-17  0:44         ` Jeff King
  2014-10-17  0:44         ` [PATCH v3 25/26] repack: pack objects mentioned by the index Jeff King
  2014-10-17  0:44         ` [PATCH v3 26/26] pack-objects: double-check options before discarding objects Jeff King
  4 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-17  0:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Michael Haggerty

This saves us from having to bump the rp_av count when we
add new traversal options.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/pack-objects.c | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 4df9499..b26276b 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -22,6 +22,7 @@
 #include "pack-bitmap.h"
 #include "reachable.h"
 #include "sha1-array.h"
+#include "argv-array.h"
 
 static const char *pack_usage[] = {
 	N_("git pack-objects --stdout [options...] [< ref-list | < object-list]"),
@@ -2614,8 +2615,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	int use_internal_rev_list = 0;
 	int thin = 0;
 	int all_progress_implied = 0;
-	const char *rp_av[6];
-	int rp_ac = 0;
+	struct argv_array rp = ARGV_ARRAY_INIT;
 	int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
 	struct option pack_objects_options[] = {
 		OPT_SET_INT('q', "quiet", &progress,
@@ -2705,24 +2705,24 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (pack_to_stdout != !base_name || argc)
 		usage_with_options(pack_usage, pack_objects_options);
 
-	rp_av[rp_ac++] = "pack-objects";
+	argv_array_push(&rp, "pack-objects");
 	if (thin) {
 		use_internal_rev_list = 1;
-		rp_av[rp_ac++] = "--objects-edge";
+		argv_array_push(&rp, "--objects-edge");
 	} else
-		rp_av[rp_ac++] = "--objects";
+		argv_array_push(&rp, "--objects");
 
 	if (rev_list_all) {
 		use_internal_rev_list = 1;
-		rp_av[rp_ac++] = "--all";
+		argv_array_push(&rp, "--all");
 	}
 	if (rev_list_reflog) {
 		use_internal_rev_list = 1;
-		rp_av[rp_ac++] = "--reflog";
+		argv_array_push(&rp, "--reflog");
 	}
 	if (rev_list_unpacked) {
 		use_internal_rev_list = 1;
-		rp_av[rp_ac++] = "--unpacked";
+		argv_array_push(&rp, "--unpacked");
 	}
 
 	if (!reuse_object)
@@ -2766,8 +2766,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	if (!use_internal_rev_list)
 		read_object_list_from_stdin();
 	else {
-		rp_av[rp_ac] = NULL;
-		get_object_list(rp_ac, rp_av);
+		get_object_list(rp.argc, rp.argv);
+		argv_array_clear(&rp);
 	}
 	cleanup_preferred_base();
 	if (include_tag && nr_result)
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 25/26] repack: pack objects mentioned by the index
  2014-10-17  0:43       ` Jeff King
                           ` (2 preceding siblings ...)
  2014-10-17  0:44         ` [PATCH v3 24/26] pack-objects: use argv_array Jeff King
@ 2014-10-17  0:44         ` Jeff King
  2014-10-17  0:44         ` [PATCH v3 26/26] pack-objects: double-check options before discarding objects Jeff King
  4 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-17  0:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Michael Haggerty

When we pack all objects, we use only the objects reachable
from references and reflogs. This misses any objects which
are reachable from the index, but not yet referenced.

By itself this isn't a big deal; the objects can remain
loose until they are actually used in a commit. However, it
does create a problem when we drop packed but unreachable
objects. We try to optimize out the writing of objects that
we will immediately prune, which means we must follow the
same rules as prune in determining what is reachable. And
prune uses the index for this purpose.

This is rather uncommon in practice, as objects in the index
would not usually have been packed in the first place. But
it could happen in a sequence like:

  1. You make a commit on a branch that references blob X.

  2. You repack, moving X into the pack.

  3. You delete the branch (and its reflog), so that X is
     unreferenced.

  4. You "git add" blob X so that it is now referenced only
     by the index.

  5. You repack again with git-gc. The pack-objects we
     invoke will see that X is neither referenced nor
     recent and not bother loosening it.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/pack-objects.c               |  8 ++++++++
 builtin/repack.c                     |  1 +
 t/t7701-repack-unpack-unreachable.sh | 13 +++++++++++++
 3 files changed, 22 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b26276b..0cf95c9 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2617,6 +2617,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	int all_progress_implied = 0;
 	struct argv_array rp = ARGV_ARRAY_INIT;
 	int rev_list_unpacked = 0, rev_list_all = 0, rev_list_reflog = 0;
+	int rev_list_index = 0;
 	struct option pack_objects_options[] = {
 		OPT_SET_INT('q', "quiet", &progress,
 			    N_("do not show progress meter"), 0),
@@ -2663,6 +2664,9 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		{ OPTION_SET_INT, 0, "reflog", &rev_list_reflog, NULL,
 		  N_("include objects referred by reflog entries"),
 		  PARSE_OPT_NOARG | PARSE_OPT_NONEG, NULL, 1 },
+		{ OPTION_SET_INT, 0, "indexed-objects", &rev_list_index, NULL,
+		  N_("include objects referred to by the index"),
+		  PARSE_OPT_NOARG | PARSE_OPT_NONEG, NULL, 1 },
 		OPT_BOOL(0, "stdout", &pack_to_stdout,
 			 N_("output pack to stdout")),
 		OPT_BOOL(0, "include-tag", &include_tag,
@@ -2720,6 +2724,10 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		use_internal_rev_list = 1;
 		argv_array_push(&rp, "--reflog");
 	}
+	if (rev_list_index) {
+		use_internal_rev_list = 1;
+		argv_array_push(&rp, "--indexed-objects");
+	}
 	if (rev_list_unpacked) {
 		use_internal_rev_list = 1;
 		argv_array_push(&rp, "--unpacked");
diff --git a/builtin/repack.c b/builtin/repack.c
index 2aae05d..2845620 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -209,6 +209,7 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 	argv_array_push(&cmd_args, "--non-empty");
 	argv_array_push(&cmd_args, "--all");
 	argv_array_push(&cmd_args, "--reflog");
+	argv_array_push(&cmd_args, "--indexed-objects");
 	if (window)
 		argv_array_pushf(&cmd_args, "--window=%s", window);
 	if (window_memory)
diff --git a/t/t7701-repack-unpack-unreachable.sh b/t/t7701-repack-unpack-unreachable.sh
index b8d4cde..aad8a9c 100755
--- a/t/t7701-repack-unpack-unreachable.sh
+++ b/t/t7701-repack-unpack-unreachable.sh
@@ -109,4 +109,17 @@ test_expect_success 'do not bother loosening old objects' '
 	test_must_fail git cat-file -p $obj2
 '
 
+test_expect_success 'keep packed objects found only in index' '
+	echo my-unique-content >file &&
+	git add file &&
+	git commit -m "make it reachable" &&
+	git gc &&
+	git reset HEAD^ &&
+	git reflog expire --expire=now --all &&
+	git add file &&
+	test-chmtime =-86400 .git/objects/pack/* &&
+	git gc --prune=1.hour.ago &&
+	git cat-file blob :file
+'
+
 test_done
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [PATCH v3 26/26] pack-objects: double-check options before discarding objects
  2014-10-17  0:43       ` Jeff King
                           ` (3 preceding siblings ...)
  2014-10-17  0:44         ` [PATCH v3 25/26] repack: pack objects mentioned by the index Jeff King
@ 2014-10-17  0:44         ` Jeff King
  4 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-17  0:44 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Michael Haggerty

When we are given an expiration time like
--unpack-unreachable=2.weeks.ago, we avoid writing out old,
unreachable loose objects entirely, under the assumption
that running "prune" would simply delete them immediately
anyway. However, this is only valid if we computed the same
set of reachable objects as prune would.

In practice, this is the case, because only git-repack uses
the --unpack-unreachable option with an expiration, and it
always feeds as many objects into the pack as possible. But
we can double-check at runtime just to be sure.

Signed-off-by: Jeff King <peff@peff.net>
---
 builtin/pack-objects.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index 0cf95c9..64123d4 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -2757,6 +2757,8 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 
 	if (keep_unreachable && unpack_unreachable)
 		die("--keep-unreachable and --unpack-unreachable are incompatible.");
+	if (!rev_list_all || !rev_list_reflog || !rev_list_index)
+		unpack_unreachable_expiration = 0;
 
 	if (!use_internal_rev_list || !pack_to_stdout || is_repository_shallow())
 		use_bitmap_index = 0;
-- 
2.1.2.596.g7379948

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/25] prune-safety
       [not found]       ` <CAPc5daX0AFv9jDrFyd_OnupW5AfZW9Je_rgzaViX_xxs3SG5zg@mail.gmail.com>
@ 2014-10-17  4:49         ` Jeff King
  0 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-17  4:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Michael Haggerty, Git Mailing List

On Thu, Oct 16, 2014 at 09:13:39PM -0700, Junio C Hamano wrote:

> Just being curious, but would the same bug, if allowed to be triggered
> cutting repacking of your repository, have corrupted the resulting bitmap?

I didn't test, but yes, almost certainly. The bug was in list-objects.c,
which is used by pack-objects to generate the list of objects to pack,
as well as to build the bitmaps. So not only would it have corrupted the
bitmaps, a `git repack -ad`[1] would have dropped the object completely,
corrupting the repository!

-Peff

[1] git-gc uses `repack -Ad`, of course. So assuming you had packed more
    recently than 2 weeks ago, it would have just been ejected to a
    loose object. Small comfort. :)

-Peff

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: git-bundle rev handling and de-duping
  2014-10-17  0:33     ` git-bundle rev handling and de-duping Jeff King
@ 2014-10-17 21:03       ` Philip Oakley
  2014-10-17 22:41         ` Junio C Hamano
  0 siblings, 1 reply; 48+ messages in thread
From: Philip Oakley @ 2014-10-17 21:03 UTC (permalink / raw)
  To: Jeff King, Junio C Hamano; +Cc: Git List, Michael Haggerty

From: "Jeff King" <peff@peff.net>
> [subject tweaked as we have veered quite far off the original, and
> this might get more attention from interested people]
>
> On Thu, Oct 16, 2014 at 10:39:54AM -0700, Junio C Hamano wrote:
>
>>  2.  We use object_array_remove_duplicates() to de-dup "git bundle
>>      create x master master", which came from b2a6d1c6 (bundle:
>>      allow the same ref to be given more than once, 2009-01-17),
>>      which is still the sole caller of the function, and I think
>>      this is bogus.  Comparing .name would not de-dup "git bundle
>>      create x master refs/heads/master".
>
> Hmm. We also doesn't respect the UNINTERESTING flag, so it will dedup
> "foo" and "^foo" into a single entry. I _think_ this doesn't have any
> bad side effects, because they are equivalent (i.e., the flag is
> carried
> on the object, not the pending entry). I would expect this:
>
>  git bundle create ... foo ^foo
>
> to barf (and it does) because the bundle is empty. But with this:
>
>  git bundle create ... another-ref foo ^foo
>
> I would expect the resulting bundle to contain the objects from
> another-ref, but still mention "foo" as an update (we didn't need to
> send any objects, since presumably the other side already had them).
> It
> doesn't, but that is not because of the dedup. It is because we
> generate
> the list of refs to send based on the pending list, and we skip all
> UNINTERESTING objects (we must, because otherwise "^foo" would make an
> entry). I.e., it is the same "the flag is on the object, not the
> pending
> entry" that saved us above now causing its own bug. Obviously the
> example above is sort of silly, but if you have:
>
>  # two branches point at the same object
>  git branch foo master
>
>  # the other side already has master. Let's send them foo.
>  # this will fail because the bundle is empty. That's mildly
>  # annoying because we really want to tell them "hey, update
>  # your foo branch". But at least we get an error.
>  git bundle create tmp.bundle foo ^master

Isn't this kindof what happens as an issue when we want the right HEAD 
to be included explicitly in a bundle. Though 
ttp://thread.gmane.org/gmane.comp.version-control.git/234053 suggests 
its more complicated than that.

>
>  # now the same thing, but we're sending them some other objects.
>  # This one succeeds, but silently omits foo from the bundle!
>  git bundle create tmp.bundle foo another-ref ^master
>
> I have a feeling that the list needs to be generated from
> revs.cmdline,
> not revs.pending, and the de-duplication needs to happen there (with
> the
> ref resolution that you mention).
>
> I also have the feeling that fast-export had to deal with this exact
> same issue. It looks like we use revs.cmdline there. I seem to recall
> there was some ongoing work in that area, but I stopped paying close
> attention due to some personality conflicts, and I don't know if all
> of
> the issues were ever resolved.
>
>> I think the right way to fix these two and a half problems is to do
>> the following:
>>
>>  - object_array_remove_duplicates() (and contains_name() helper it
>>    uses) should be removed from object.c;
>>
>>  - create_bundle() in bundle.c should implement a helper that is
>>    similar to contains_name() but knows about ref dwimming and use
>>    it to call object_array_filter() to replace its call to
>>    object_array_remove_duplicates().
>
> Agreed. The loop in create_bundle right after the de-dup does the
> dwimming. Probably it would be simple to just skip duplicates there
> using a hashmap or sorted list.
>
>> I am not doing this myself, and I do not expect either you or
>> Michael to do so, either.  I am just writing this down to point out
>> a low hanging fruit to aspiring new contributors (hint, hint).
>
> I am also not planning on working on it soon, but now we have
> hopefully
> fed plenty of possibilities to anybody who wants to. :)
>
--
Philip 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: git-bundle rev handling and de-duping
  2014-10-17 21:03       ` Philip Oakley
@ 2014-10-17 22:41         ` Junio C Hamano
  0 siblings, 0 replies; 48+ messages in thread
From: Junio C Hamano @ 2014-10-17 22:41 UTC (permalink / raw)
  To: Philip Oakley; +Cc: Jeff King, Git List, Michael Haggerty

"Philip Oakley" <philipoakley@iee.org> writes:

>>  # two branches point at the same object
>>  git branch foo master
>>
>>  # the other side already has master. Let's send them foo.
>>  # this will fail because the bundle is empty. That's mildly
>>  # annoying because we really want to tell them "hey, update
>>  # your foo branch". But at least we get an error.
>>  git bundle create tmp.bundle foo ^master
>
> Isn't this kindof what happens as an issue when we want the right HEAD
> to be included explicitly in a bundle. Though


What we are discussing here is "we tell from the command line where
the histories end, but do we correctly record all these end points
as fetchable refs in the resulting bundle?"

It does not have anything to do with "bundle that does not record
its HEAD cannot be cloned", which happens when you do not mention
HEAD when creating the bundle in the first place, which is a totally
different thing.

> http://thread.gmane.org/gmane.comp.version-control.git/234053 suggests
> its more complicated than that.

The main topic of discussion does not have much to what bundle
records and what a reader of a bundle guesses.  It is about what
goes on the wire and mention of bundle was just a tangent brought up
by those who do not know what was being discussed, I think.

I think the right fix to the "git bundle" issue is to make it easier
on the "git bundle create" side to have the resulting bundle record
its HEAD, even when the user did not mention HEAD on the command
line.  For example, when there is only one end point, e.g. "git
bundle create x next", record refs/heads/next _and_ HEAD pointing at
the same commit, because there is no other seneible choice.  

"git bundle create y master next" may record master, next and HEAD
while HEAD is likely pointing at the same commit as master (because
'master' is special).  Or we could give a warning and even go
interactive to ask which ref to record as HEAD.

But the above three paragraphs are tangent so I'd stop here.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [PATCH v2 0/25] prune-safety
  2014-10-16 21:39     ` Jeff King
  2014-10-16 22:18       ` Junio C Hamano
       [not found]       ` <CAPc5daX0AFv9jDrFyd_OnupW5AfZW9Je_rgzaViX_xxs3SG5zg@mail.gmail.com>
@ 2014-10-18 12:31       ` Jeff King
  2 siblings, 0 replies; 48+ messages in thread
From: Jeff King @ 2014-10-18 12:31 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, Michael Haggerty

On Thu, Oct 16, 2014 at 05:39:18PM -0400, Jeff King wrote:

> @@ -301,6 +297,14 @@ static struct commit *handle_commit(struct rev_info *revs,
>  			die("bad object %s", sha1_to_hex(tag->tagged->sha1));
>  		}
>  		object->flags |= flags;
> +		/*
> +		 * We'll handle the tagged object by looping or dropping
> +		 * through to the non-tag handlers below. Do not
> +		 * propagate data from the tag's pending entry.
> +		 */
> +		name = NULL;
> +		path = NULL;
> +		mode = 0;

Hmm. On second thought (and after seeing a warning from Coverity), this
should be:

diff --git a/revision.c b/revision.c
index 8030fc8..ebe3e93 100644
--- a/revision.c
+++ b/revision.c
@@ -302,7 +302,7 @@ static struct commit *handle_commit(struct rev_info *revs,
 		 * through to the non-tag handlers below. Do not
 		 * propagate data from the tag's pending entry.
 		 */
-		name = NULL;
+		name = "";
 		path = NULL;
 		mode = 0;
 	}

The rest of the function assumes that name is not NULL (which I'm not
sure is entirely safe, as add_pending_object can take a NULL; presumably
every "add" uses the empty string instead of NULL. But either way,
setting it to NULL here is definite wrong).

The "path" field is explicitly OK to be NULL.

-Peff

^ permalink raw reply related	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2014-10-18 12:31 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-10-15 22:32 [PATCH v2 0/25] prune-safety Jeff King
2014-10-15 22:33 ` [PATCH v2 01/25] foreach_alt_odb: propagate return value from callback Jeff King
2014-10-15 22:34 ` [PATCH v2 02/25] isxdigit: cast input to unsigned char Jeff King
2014-10-16 17:16   ` Junio C Hamano
2014-10-15 22:34 ` [PATCH v2 03/25] object_array: factor out slopbuf-freeing logic Jeff King
2014-10-16 17:39   ` Junio C Hamano
2014-10-17  0:33     ` git-bundle rev handling and de-duping Jeff King
2014-10-17 21:03       ` Philip Oakley
2014-10-17 22:41         ` Junio C Hamano
2014-10-15 22:34 ` [PATCH v2 04/25] object_array: add a "clear" function Jeff King
2014-10-15 22:35 ` [PATCH v2 05/25] clean up name allocation in prepare_revision_walk Jeff King
2014-10-15 22:37 ` [PATCH v2 06/25] reachable: use traverse_commit_list instead of custom walk Jeff King
2014-10-16 17:53   ` Junio C Hamano
2014-10-15 22:38 ` [PATCH v2 07/25] reachable: reuse revision.c "add all reflogs" code Jeff King
2014-10-15 22:38 ` [PATCH v2 08/25] prune: factor out loose-object directory traversal Jeff King
2014-10-15 22:40 ` [PATCH v2 09/25] reachable: mark index blobs as SEEN Jeff King
2014-10-15 22:40 ` [PATCH v2 10/25] prune-packed: use for_each_loose_file_in_objdir Jeff King
2014-10-15 22:40 ` [PATCH v2 11/25] count-objects: do not use xsize_t when counting object size Jeff King
2014-10-15 22:41 ` [PATCH v2 12/25] count-objects: use for_each_loose_file_in_objdir Jeff King
2014-10-15 22:41 ` [PATCH v2 13/25] sha1_file: add for_each iterators for loose and packed objects Jeff King
2014-10-15 22:41 ` [PATCH v2 14/25] prune: keep objects reachable from recent objects Jeff King
2014-10-15 22:41 ` [PATCH v2 15/25] pack-objects: refactor unpack-unreachable expiration check Jeff King
2014-10-15 22:42 ` [PATCH v2 16/25] pack-objects: match prune logic for discarding objects Jeff King
2014-10-15 22:42 ` [PATCH v2 17/25] write_sha1_file: freshen existing objects Jeff King
2014-10-15 22:42 ` [PATCH v2 18/25] make add_object_array_with_context interface more sane Jeff King
2014-10-15 22:43 ` [PATCH v2 19/25] traverse_commit_list: support pending blobs/trees with paths Jeff King
2014-10-15 22:43 ` [PATCH v2 20/25] rev-list: document --reflog option Jeff King
2014-10-15 22:44 ` [PATCH v2 21/25] rev-list: add --index-objects option Jeff King
2014-10-16 18:41   ` Junio C Hamano
2014-10-17  0:12     ` Jeff King
2014-10-17  0:43       ` Jeff King
2014-10-17  0:44         ` [PATCH v3 22/26] rev-list: add --indexed-objects option Jeff King
2014-10-17  0:44         ` [PATCH v3 23/26] reachable: use revision machinery's --indexed-objects code Jeff King
2014-10-17  0:44         ` [PATCH v3 24/26] pack-objects: use argv_array Jeff King
2014-10-17  0:44         ` [PATCH v3 25/26] repack: pack objects mentioned by the index Jeff King
2014-10-17  0:44         ` [PATCH v3 26/26] pack-objects: double-check options before discarding objects Jeff King
2014-10-15 22:44 ` [PATCH v2 22/25] reachable: use revision machinery's --index-objects code Jeff King
2014-10-15 22:45 ` [PATCH v2 23/25] pack-objects: use argv_array Jeff King
2014-10-15 22:46 ` [PATCH v2 24/25] repack: pack objects mentioned by the index Jeff King
2014-10-15 22:48 ` [PATCH v2 25/25] pack-objects: double-check options before discarding objects Jeff King
2014-10-16 21:07 ` [PATCH v2 0/25] prune-safety Junio C Hamano
2014-10-16 21:10   ` Junio C Hamano
2014-10-16 21:21   ` Jeff King
2014-10-16 21:39     ` Jeff King
2014-10-16 22:18       ` Junio C Hamano
2014-10-17  0:03         ` Jeff King
     [not found]       ` <CAPc5daX0AFv9jDrFyd_OnupW5AfZW9Je_rgzaViX_xxs3SG5zg@mail.gmail.com>
2014-10-17  4:49         ` Jeff King
2014-10-18 12:31       ` Jeff King

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.