git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Question: .idx without .pack causes performance issues?
@ 2015-07-21 18:41 Doug Kelly
  2015-07-21 18:57 ` Junio C Hamano
  0 siblings, 1 reply; 34+ messages in thread
From: Doug Kelly @ 2015-07-21 18:41 UTC (permalink / raw)
  To: Git List

Hi all,

I just wanted to relay an issue we've seen before at my day job (and
it just recently cropped up again).  When moving users from Git for
Windows 1.8.3 to 1.9.5, we found a few users started having operations
take an excruciatingly long amount of time.  At some point, we traced
the issue to a number of .pack files had been deleted (possibly
garbage collected?) -- but their associated .idx files were still
present.  Upon removing the "orphaned" idx files, we found performance
returned to normal.  Otherwise, git fsck reported no issues with the
repositories.

Other users have noted that using git gc would sometimes correct the
issue for them, but not always.

Anyway, has anyone else experienced this performance degradation? I
have some feeling that it's an issue that may be exclusive to Windows
(or at least, only slow enough to matter on Windows), but I have no
proof, and I've never heard of an issue like this outside work. (One
idea that came to mind was even the .idx files were locked, and thus
not deleted.)  Something tells me deleting the orphaned .idx files
isn't the "nicest" solution, either.

Thanks!

--Doug

P.S. In addition to running the Git for Windows/msysgit builds, we
have a handful of users running Git Extensions as well, and also have
been seeing an increase in use of Visual Studio 2013 -- which of
course has libgit2 integrated. So, I think the chance that any one of
these might be using the repo or holding files open is very high.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Question: .idx without .pack causes performance issues?
  2015-07-21 18:41 Question: .idx without .pack causes performance issues? Doug Kelly
@ 2015-07-21 18:57 ` Junio C Hamano
  2015-07-21 19:15   ` Junio C Hamano
  2015-07-21 19:49   ` Doug Kelly
  0 siblings, 2 replies; 34+ messages in thread
From: Junio C Hamano @ 2015-07-21 18:57 UTC (permalink / raw)
  To: Doug Kelly; +Cc: Git List

Doug Kelly <dougk.ff7@gmail.com> writes:

> I just wanted to relay an issue we've seen before at my day job (and
> it just recently cropped up again).  When moving users from Git for
> Windows 1.8.3 to 1.9.5, we found a few users started having operations
> take an excruciatingly long amount of time.  At some point, we traced
> the issue to a number of .pack files had been deleted (possibly
> garbage collected?) -- but their associated .idx files were still
> present.  Upon removing the "orphaned" idx files, we found performance
> returned to normal.  Otherwise, git fsck reported no issues with the
> repositories.
>
> Other users have noted that using git gc would sometimes correct the
> issue for them, but not always.
>
> Anyway, has anyone else experienced this performance degradation?

I wouldn't be surprised if such a configuration to have leftover
".idx" files that lack ".pack" affected performance, but I think you
really have to work on getting into such a situation (unless your
operating system is very cooperative and tries hard to corrupt your
repository, that is ;-), so I wouldn't be surprised if you were the
first one to report this.

We open the ".idx" file and try to keep as many of them in-core,
without opening corresponding ".pack" until the data is needed. 

When we need an object, we learn from an ".idx" file that a
particular pack ought to have a copy of it, and then attempt to open
the corresponding ".pack" file.  If this fails, we do protect
ourselves from strange repositories with only ".idx" files by not
using that ".idx" and try to see if the sought-after object exists
elsewhere (and if there isn't we say "no such object", which is also
a correct thing to do).

I however do not think that we mark the in-core structure that
corresponds to an open ".idx" file in any way when such a failure
happens.  If we really cared enough, we could do so, saying "we know
there is .idx file, but do not bother looking at it again, as we
know the corresponding .pack is missing", and that would speed things
up a bit, essentially bringing us back to a sane situation without
any ".idx" without corresponding ".pack".

I do not think it is worth the effort, though.  It would be more
fruitful to find out how you end up with ".idx exists but not
corresponding .pack" and if that is some systemic failure, see if
there is a way to prevent that from happening in the first place.

Also, I think it may not be a bad idea to teach "gc" to remove stale
".idx" files that do not have corresponding ".pack" as garbage.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Question: .idx without .pack causes performance issues?
  2015-07-21 18:57 ` Junio C Hamano
@ 2015-07-21 19:15   ` Junio C Hamano
  2015-07-21 20:48     ` Junio C Hamano
       [not found]     ` <CABYiQpn7r2Vcf=S5RaWHBN85eBYGPV_e02+BY=4L98qfUzDT1Q@mail.gmail.com>
  2015-07-21 19:49   ` Doug Kelly
  1 sibling, 2 replies; 34+ messages in thread
From: Junio C Hamano @ 2015-07-21 19:15 UTC (permalink / raw)
  To: Doug Kelly; +Cc: Git List

Junio C Hamano <gitster@pobox.com> writes:

> I however do not think that we mark the in-core structure that
> corresponds to an open ".idx" file in any way when such a failure
> happens.  If we really cared enough, we could do so, saying "we know
> there is .idx file, but do not bother looking at it again, as we
> know the corresponding .pack is missing", and that would speed things
> up a bit, essentially bringing us back to a sane situation without
> any ".idx" without corresponding ".pack".
>
> I do not think it is worth the effort, though.  It would be more
> fruitful to find out how you end up with ".idx exists but not
> corresponding .pack" and if that is some systemic failure, see if
> there is a way to prevent that from happening in the first place.

While I still think that it is more important to prevent such a
situation from occurring in the first place, ignoring .idx that lack
corresponding .pack should be fairly simple, perhaps like this.

Note that if we wanted to do this for real, I think such an ".idx"
file should also be added to the "garbage" list in the loop in which
the second hunk of the following patch appears.

 sha1_file.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/sha1_file.c b/sha1_file.c
index 1cee438..b69298e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1240,6 +1240,19 @@ static void report_pack_garbage(struct string_list *list)
 	report_helper(list, seen_bits, first, list->nr);
 }
 
+static int packfile_exists(const char *base, size_t base_len)
+{
+	struct strbuf path = STRBUF_INIT;
+	int status;
+
+	strbuf_add(&path, base, base_len);
+	strbuf_addstr(&path, ".pack");
+	status = file_exists(path.buf);
+
+	strbuf_release(&path);
+	return status;
+}
+
 static void prepare_packed_git_one(char *objdir, int local)
 {
 	struct strbuf path = STRBUF_INIT;
@@ -1281,6 +1294,7 @@ static void prepare_packed_git_one(char *objdir, int local)
 					break;
 			}
 			if (p == NULL &&
+			    packfile_exists(path.buf, base_len) &&
 			    /*
 			     * See if it really is a valid .idx file with
 			     * corresponding .pack file that we can map.

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: Question: .idx without .pack causes performance issues?
  2015-07-21 18:57 ` Junio C Hamano
  2015-07-21 19:15   ` Junio C Hamano
@ 2015-07-21 19:49   ` Doug Kelly
  1 sibling, 0 replies; 34+ messages in thread
From: Doug Kelly @ 2015-07-21 19:49 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git List

On Tue, Jul 21, 2015 at 1:57 PM, Junio C Hamano <gitster@pobox.com> wrote:
> I wouldn't be surprised if such a configuration to have leftover
> ".idx" files that lack ".pack" affected performance, but I think you
> really have to work on getting into such a situation (unless your
> operating system is very cooperative and tries hard to corrupt your
> repository, that is ;-), so I wouldn't be surprised if you were the
> first one to report this.

I'm inclined to believe Windows isn't helping this situation: seems
like something it might do, especially because of how it behaves if
one process has a file open. Since I haven't caught a case where these
files show up, maybe adding some tweaks to look for it occurring (such
as on our Jenkins workers, if it's happening there now) would give us
a better indication of the "why" question.  It could even be that it
has occurred long ago, and the performance issue is just now observed:
our environment has run 1.7.4, 1.8.3, and now 1.9.5 -- so even an
unknown bug in a previous version could impact us now.

>
> We open the ".idx" file and try to keep as many of them in-core,
> without opening corresponding ".pack" until the data is needed.
>
> When we need an object, we learn from an ".idx" file that a
> particular pack ought to have a copy of it, and then attempt to open
> the corresponding ".pack" file.  If this fails, we do protect
> ourselves from strange repositories with only ".idx" files by not
> using that ".idx" and try to see if the sought-after object exists
> elsewhere (and if there isn't we say "no such object", which is also
> a correct thing to do).
>
> I however do not think that we mark the in-core structure that
> corresponds to an open ".idx" file in any way when such a failure
> happens.  If we really cared enough, we could do so, saying "we know
> there is .idx file, but do not bother looking at it again, as we
> know the corresponding .pack is missing", and that would speed things
> up a bit, essentially bringing us back to a sane situation without
> any ".idx" without corresponding ".pack".

I think this is where the performance hit occurs on Windows: file stat
operations in general are pretty slow, and I know msysgit did some
things to emulate as much of the POSIX API as possible -- which isn't
always easy on Windows.  But, some of the developers that know
compat/win32/ better would know more (I recall the dirent stuff being
pretty complicated, but open/fopen seem rather straightforward).  And
yes -- retrying the operation each time and failing only compounds the
issue.

>
> I do not think it is worth the effort, though.  It would be more
> fruitful to find out how you end up with ".idx exists but not
> corresponding .pack" and if that is some systemic failure, see if
> there is a way to prevent that from happening in the first place.

Agreed.  It feels like a workaround for a case where you're already in
a bad state...

>
> Also, I think it may not be a bad idea to teach "gc" to remove stale
> ".idx" files that do not have corresponding ".pack" as garbage.

I agree.  This seems like a more correct solution -- if gc understands
to clean up these bad .idx files, it would then be a non-issue when
searching the packs.  The solution you posted to check if an
associated packfile exists -- while perhaps not belonging there --
could still be useful to delete orphanend .idx files.

I think you're correct, though -- if you did propose the solution to
sha1_file.c, it would be necessary to prevent scanning that .idx
again, or else any potential gains would be lost continually
stat()'ing the file.  Now, msysgit does have core.fscache to try
caching the stat()/lstat() results to lessen the impact, but this
isn't on by default, I believe.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Question: .idx without .pack causes performance issues?
  2015-07-21 19:15   ` Junio C Hamano
@ 2015-07-21 20:48     ` Junio C Hamano
  2015-07-21 21:37       ` Doug Kelly
       [not found]     ` <CABYiQpn7r2Vcf=S5RaWHBN85eBYGPV_e02+BY=4L98qfUzDT1Q@mail.gmail.com>
  1 sibling, 1 reply; 34+ messages in thread
From: Junio C Hamano @ 2015-07-21 20:48 UTC (permalink / raw)
  To: Doug Kelly; +Cc: Git List

Junio C Hamano <gitster@pobox.com> writes:

> While I still think that it is more important to prevent such a
> situation from occurring in the first place, ignoring .idx that lack
> corresponding .pack should be fairly simple, perhaps like this.
> ...

Sorry for the noise, but this patch is worthless.  We already have
an equivalent test in add_packed_git() that is called from this same
place.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Question: .idx without .pack causes performance issues?
  2015-07-21 20:48     ` Junio C Hamano
@ 2015-07-21 21:37       ` Doug Kelly
  2015-08-03 22:17         ` Doug Kelly
  0 siblings, 1 reply; 34+ messages in thread
From: Doug Kelly @ 2015-07-21 21:37 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git List

On Tue, Jul 21, 2015 at 3:48 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
>> While I still think that it is more important to prevent such a
>> situation from occurring in the first place, ignoring .idx that lack
>> corresponding .pack should be fairly simple, perhaps like this.
>> ...
>
> Sorry for the noise, but this patch is worthless.  We already have
> an equivalent test in add_packed_git() that is called from this same
> place.

And a few extra updates from me: we found that this appears to occur
even after update to 1.9.5, and setting core.fscache on 2.4.6 has no
appreciable impact on the time it takes to run "git fetch", either.
Our thought was antivirus (or something else?) might have the file
open when git attempts to unlink the .idx, but perhaps it's something
else, too?  In one case, we had ~560 orphaned .idx files, but 150
seems sufficient to slow a fetch operation for a few minutes until it
actually begins transferring objects.

The "git gc" approach to cleaning up the mess is certainly looking
more and more attractive... :)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Question: .idx without .pack causes performance issues?
  2015-07-21 21:37       ` Doug Kelly
@ 2015-08-03 22:17         ` Doug Kelly
  2015-08-04  1:27           ` Junio C Hamano
  0 siblings, 1 reply; 34+ messages in thread
From: Doug Kelly @ 2015-08-03 22:17 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git List

On Tue, Jul 21, 2015 at 4:37 PM, Doug Kelly <dougk.ff7@gmail.com> wrote:
> On Tue, Jul 21, 2015 at 3:48 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> Junio C Hamano <gitster@pobox.com> writes:
>>
>>> While I still think that it is more important to prevent such a
>>> situation from occurring in the first place, ignoring .idx that lack
>>> corresponding .pack should be fairly simple, perhaps like this.
>>> ...
>>
>> Sorry for the noise, but this patch is worthless.  We already have
>> an equivalent test in add_packed_git() that is called from this same
>> place.
>
> And a few extra updates from me: we found that this appears to occur
> even after update to 1.9.5, and setting core.fscache on 2.4.6 has no
> appreciable impact on the time it takes to run "git fetch", either.
> Our thought was antivirus (or something else?) might have the file
> open when git attempts to unlink the .idx, but perhaps it's something
> else, too?  In one case, we had ~560 orphaned .idx files, but 150
> seems sufficient to slow a fetch operation for a few minutes until it
> actually begins transferring objects.
>
> The "git gc" approach to cleaning up the mess is certainly looking
> more and more attractive... :)

Here's a change to prune.c that at least addresses the issue by removing
.idx files without an associated pack, but it's by no means pretty.  If anyone
has any feedback before I turn this into a formal patch, it's more than welcome!

diff --git a/builtin/prune.c b/builtin/prune.c
index 10b03d3..8a60282 100644
--- a/builtin/prune.c
+++ b/builtin/prune.c
@@ -1,6 +1,7 @@
 #include "cache.h"
 #include "commit.h"
 #include "diff.h"
+#include "dir.h"
 #include "revision.h"
 #include "builtin.h"
 #include "reachable.h"
@@ -85,15 +86,31 @@ static void remove_temporary_files(const char *path)
 {
        DIR *dir;
        struct dirent *de;
+       struct strbuf idx, pack;

        dir = opendir(path);
        if (!dir) {
                fprintf(stderr, "Unable to open directory %s\n", path);
                return;
        }
-       while ((de = readdir(dir)) != NULL)
+       while ((de = readdir(dir)) != NULL) {
                if (starts_with(de->d_name, "tmp_"))
                        prune_tmp_file(mkpath("%s/%s", path, de->d_name));
+               if (ends_with(de->d_name, ".idx")) {
+                       strbuf_init(&idx, 0);
+                       strbuf_init(&pack, 0);
+                       strbuf_addstr(&idx, de->d_name);
+                       strbuf_addbuf(&pack, &idx);
+                       if (strbuf_strip_suffix(&pack, ".idx")) {
+                               strbuf_addstr(&pack, ".pack");
+                               if (!file_exists(mkpath("%s/%s", path,
pack.buf)))
+                                       prune_tmp_file(mkpath("%s/%s",
path, idx.buf));
+                       }
+                       strbuf_release(&idx);
+                       strbuf_release(&pack);
+               }
+
+       }
        closedir(dir);
 }

--

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: Question: .idx without .pack causes performance issues?
  2015-08-03 22:17         ` Doug Kelly
@ 2015-08-04  1:27           ` Junio C Hamano
  2015-08-07 21:36             ` Doug Kelly
  0 siblings, 1 reply; 34+ messages in thread
From: Junio C Hamano @ 2015-08-04  1:27 UTC (permalink / raw)
  To: Doug Kelly; +Cc: Git List

Doug Kelly <dougk.ff7@gmail.com> writes:

> Here's a change to prune.c that at least addresses the issue by removing
> .idx files without an associated pack, but it's by no means pretty.  If anyone
> has any feedback before I turn this into a formal patch, it's more than welcome!

I'd hesitate to see removal of a file (for that matter, a creation
too) inside a "while (de = readdir)" loop.  As the original function
is about temporary files, and the new thing is not about temporary
files at all, I'd further prefer that we do not do it in the same
loop.

I am wondering if we can add a new mode to report_pack_garbage() in
sha1_file.c to allow it to remove stale and lone ".idx".  Most of
the time we are accessing packs read-only, and I do not want the
function to unconditionally remove lone ".idx", but perhaps we
can teach "prune" to set a custom report_garbage() routine and
react to a call to its custom report_garbage()?

Perhaps that custom report_garbage() can make a list of ".idx"
files, iterate over it to pick the lone one without ".pack" and
remove them.  Or the custom report_garbage() can make a list of lone
".idx" files, if you tweak the interface to report_garbage() to
contain th seen_bits value, avoiding the need to check the existence
of ".pack" for the second time.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Question: .idx without .pack causes performance issues?
  2015-08-04  1:27           ` Junio C Hamano
@ 2015-08-07 21:36             ` Doug Kelly
  2015-08-07 22:27               ` Junio C Hamano
  0 siblings, 1 reply; 34+ messages in thread
From: Doug Kelly @ 2015-08-07 21:36 UTC (permalink / raw)
  Cc: Git List

On Mon, Aug 3, 2015 at 8:27 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Doug Kelly <dougk.ff7@gmail.com> writes:
>
>> Here's a change to prune.c that at least addresses the issue by removing
>> .idx files without an associated pack, but it's by no means pretty.  If anyone
>> has any feedback before I turn this into a formal patch, it's more than welcome!
>
> I'd hesitate to see removal of a file (for that matter, a creation
> too) inside a "while (de = readdir)" loop.  As the original function
> is about temporary files, and the new thing is not about temporary
> files at all, I'd further prefer that we do not do it in the same
> loop.
>
> I am wondering if we can add a new mode to report_pack_garbage() in
> sha1_file.c to allow it to remove stale and lone ".idx".  Most of
> the time we are accessing packs read-only, and I do not want the
> function to unconditionally remove lone ".idx", but perhaps we
> can teach "prune" to set a custom report_garbage() routine and
> react to a call to its custom report_garbage()?
>
> Perhaps that custom report_garbage() can make a list of ".idx"
> files, iterate over it to pick the lone one without ".pack" and
> remove them.  Or the custom report_garbage() can make a list of lone
> ".idx" files, if you tweak the interface to report_garbage() to
> contain th seen_bits value, avoiding the need to check the existence
> of ".pack" for the second time.

Yeah, I didn't think this was the cleanest solution, and I wasn't even
thinking about removing while inside the readdir loop, but I can see
how that might be a very bad idea.  In any case, thanks for the
suggestions... I'll be completely blunt in saying I'm far less than
well-versed in the Git internals.  Looking at the implementation of
report_pack_garbage(), it does look like seen_bits already has this
logic, and indeed, git count-objects -v reports the files as garbage.

So, I think you're right: prune would need to set report_garbage
appropriately, then call count-objects to clean that up.  If we wanted
it to *only* care for lone idx files, we would have to string match on
the message (seems fragile), but perhaps a more observant approach
would be to add a custom flag to prune to clean *all* garbage in the
repository, as passed to report_garbage?  Probably wouldn't want to be
enabled by default, but only on invocation or with careful
consideration and setting an appropriate config flag.

Thoughts?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: Question: .idx without .pack causes performance issues?
  2015-08-07 21:36             ` Doug Kelly
@ 2015-08-07 22:27               ` Junio C Hamano
  2015-08-13 18:02                 ` [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory Doug Kelly
  0 siblings, 1 reply; 34+ messages in thread
From: Junio C Hamano @ 2015-08-07 22:27 UTC (permalink / raw)
  To: Doug Kelly

Doug Kelly <dougk.ff7@gmail.com> writes:

> So, I think you're right: prune would need to set report_garbage
> appropriately, then call count-objects to clean that up.  If we wanted
> it to *only* care for lone idx files, we would have to string match on
> the message (seems fragile), but perhaps a more observant approach
> would be to add a custom flag to prune to clean *all* garbage in the
> repository, as passed to report_garbage?  Probably wouldn't want to be
> enabled by default, but only on invocation or with careful
> consideration and setting an appropriate config flag.

I was thinking along this line.

Then you would set "report_garbage" to your own function, call
prepare_packed_git(), and in your report-garbagte function, collect
paths with seen_bits set exactly to PACKDIR_FILE_IDX.  By the time
prepare_packed_git() returns, you would have a list of paths only
with .idx but without .pack, which you can prune.

We can later start pruning other garbage, but one step at a time.

-- >8 --
Subject: prepare_packed_git(): refactor garbage reporting in pack directory

The hook to report "garbage" files in $GIT_OBJECT_DIRECTORY/pack/
could be generic but is too specific to count-object's needs.

Move the part to produce human-readable messages to count-objects,
and refine the interface to callback with the "bits" with values
defined in the cache.h header file, so that other callers (e.g.
prune) can later use the same mechanism to enumerate different
kinds of garbage files and do something intelligent about them,
other than reporting in textual messages.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/count-objects.c | 26 ++++++++++++++++++++++++--
 cache.h                 |  7 +++++--
 path.c                  |  2 +-
 sha1_file.c             | 23 ++++++-----------------
 4 files changed, 36 insertions(+), 22 deletions(-)

diff --git a/builtin/count-objects.c b/builtin/count-objects.c
index ad0c799..4c3198e 100644
--- a/builtin/count-objects.c
+++ b/builtin/count-objects.c
@@ -15,9 +15,31 @@ static int verbose;
 static unsigned long loose, packed, packed_loose;
 static off_t loose_size;
 
-static void real_report_garbage(const char *desc, const char *path)
+const char *bits_to_msg(unsigned seen_bits)
+{
+	switch (seen_bits) {
+	case 0:
+		return "no corresponding .idx or .pack";
+	case PACKDIR_FILE_GARBAGE:
+		return "garbage found";
+	case PACKDIR_FILE_PACK:
+		return "no corresponding .idx";
+	case PACKDIR_FILE_IDX:
+		return "no corresponding .pack";
+	case PACKDIR_FILE_PACK|PACKDIR_FILE_IDX:
+	default:
+		return NULL;
+	}
+}
+
+static void real_report_garbage(unsigned seen_bits, const char *path)
 {
 	struct stat st;
+	const char *desc = bits_to_msg(seen_bits);
+
+	if (!desc)
+		return;
+
 	if (!stat(path, &st))
 		size_garbage += st.st_size;
 	warning("%s: %s", desc, path);
@@ -27,7 +49,7 @@ static void real_report_garbage(const char *desc, const char *path)
 static void loose_garbage(const char *path)
 {
 	if (verbose)
-		report_garbage("garbage found", path);
+		report_garbage(PACKDIR_FILE_GARBAGE, path);
 }
 
 static int count_loose(const unsigned char *sha1, const char *path, void *data)
diff --git a/cache.h b/cache.h
index 6bb7119..2d4dedc 100644
--- a/cache.h
+++ b/cache.h
@@ -1212,8 +1212,11 @@ struct pack_entry {
 
 extern struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path);
 
-/* A hook for count-objects to report invalid files in pack directory */
-extern void (*report_garbage)(const char *desc, const char *path);
+/* A hook to report invalid files in pack directory */
+#define PACKDIR_FILE_PACK 1
+#define PACKDIR_FILE_IDX 2
+#define PACKDIR_FILE_GARBAGE 4
+extern void (*report_garbage)(unsigned seen_bits, const char *path);
 
 extern void prepare_packed_git(void);
 extern void reprepare_packed_git(void);
diff --git a/path.c b/path.c
index 10f4cbf..75ec236 100644
--- a/path.c
+++ b/path.c
@@ -143,7 +143,7 @@ void report_linked_checkout_garbage(void)
 		strbuf_setlen(&sb, len);
 		strbuf_addstr(&sb, path);
 		if (file_exists(sb.buf))
-			report_garbage("unused in linked checkout", sb.buf);
+			report_garbage(PACKDIR_FILE_GARBAGE, sb.buf);
 	}
 	strbuf_release(&sb);
 }
diff --git a/sha1_file.c b/sha1_file.c
index 1cee438..0c0b652 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1183,27 +1183,16 @@ void install_packed_git(struct packed_git *pack)
 	packed_git = pack;
 }
 
-void (*report_garbage)(const char *desc, const char *path);
+void (*report_garbage)(unsigned seen_bits, const char *path);
 
 static void report_helper(const struct string_list *list,
 			  int seen_bits, int first, int last)
 {
-	const char *msg;
-	switch (seen_bits) {
-	case 0:
-		msg = "no corresponding .idx or .pack";
-		break;
-	case 1:
-		msg = "no corresponding .idx";
-		break;
-	case 2:
-		msg = "no corresponding .pack";
-		break;
-	default:
+	if (seen_bits == (PACKDIR_FILE_PACK|PACKDIR_FILE_IDX))
 		return;
-	}
+
 	for (; first < last; first++)
-		report_garbage(msg, list->items[first].string);
+		report_garbage(seen_bits, list->items[first].string);
 }
 
 static void report_pack_garbage(struct string_list *list)
@@ -1226,7 +1215,7 @@ static void report_pack_garbage(struct string_list *list)
 		if (baselen == -1) {
 			const char *dot = strrchr(path, '.');
 			if (!dot) {
-				report_garbage("garbage found", path);
+				report_garbage(PACKDIR_FILE_GARBAGE, path);
 				continue;
 			}
 			baselen = dot - path + 1;
@@ -1298,7 +1287,7 @@ static void prepare_packed_git_one(char *objdir, int local)
 		    ends_with(de->d_name, ".keep"))
 			string_list_append(&garbage, path.buf);
 		else
-			report_garbage("garbage found", path.buf);
+			report_garbage(PACKDIR_FILE_GARBAGE, path.buf);
 	}
 	closedir(dir);
 	report_pack_garbage(&garbage);

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2015-08-07 22:27               ` Junio C Hamano
@ 2015-08-13 18:02                 ` Doug Kelly
  2015-08-13 18:02                   ` [PATCH 2/2] gc: Remove garbage .idx files from pack dir Doug Kelly
  2015-08-13 18:46                   ` [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory Eric Sunshine
  0 siblings, 2 replies; 34+ messages in thread
From: Doug Kelly @ 2015-08-13 18:02 UTC (permalink / raw)
  To: git; +Cc: peff, Junio C Hamano

From: Junio C Hamano <gitster@pobox.com>

The hook to report "garbage" files in $GIT_OBJECT_DIRECTORY/pack/
could be generic but is too specific to count-object's needs.

Move the part to produce human-readable messages to count-objects,
and refine the interface to callback with the "bits" with values
defined in the cache.h header file, so that other callers (e.g.
prune) can later use the same mechanism to enumerate different
kinds of garbage files and do something intelligent about them,
other than reporting in textual messages.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/count-objects.c | 26 ++++++++++++++++++++++++--
 cache.h                 |  7 +++++--
 path.c                  |  2 +-
 sha1_file.c             | 23 ++++++-----------------
 4 files changed, 36 insertions(+), 22 deletions(-)

diff --git a/builtin/count-objects.c b/builtin/count-objects.c
index ad0c799..4c3198e 100644
--- a/builtin/count-objects.c
+++ b/builtin/count-objects.c
@@ -15,9 +15,31 @@ static int verbose;
 static unsigned long loose, packed, packed_loose;
 static off_t loose_size;
 
-static void real_report_garbage(const char *desc, const char *path)
+const char *bits_to_msg(unsigned seen_bits)
+{
+	switch (seen_bits) {
+	case 0:
+		return "no corresponding .idx or .pack";
+	case PACKDIR_FILE_GARBAGE:
+		return "garbage found";
+	case PACKDIR_FILE_PACK:
+		return "no corresponding .idx";
+	case PACKDIR_FILE_IDX:
+		return "no corresponding .pack";
+	case PACKDIR_FILE_PACK|PACKDIR_FILE_IDX:
+	default:
+		return NULL;
+	}
+}
+
+static void real_report_garbage(unsigned seen_bits, const char *path)
 {
 	struct stat st;
+	const char *desc = bits_to_msg(seen_bits);
+
+	if (!desc)
+		return;
+
 	if (!stat(path, &st))
 		size_garbage += st.st_size;
 	warning("%s: %s", desc, path);
@@ -27,7 +49,7 @@ static void real_report_garbage(const char *desc, const char *path)
 static void loose_garbage(const char *path)
 {
 	if (verbose)
-		report_garbage("garbage found", path);
+		report_garbage(PACKDIR_FILE_GARBAGE, path);
 }
 
 static int count_loose(const unsigned char *sha1, const char *path, void *data)
diff --git a/cache.h b/cache.h
index 6bb7119..2d4dedc 100644
--- a/cache.h
+++ b/cache.h
@@ -1212,8 +1212,11 @@ struct pack_entry {
 
 extern struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path);
 
-/* A hook for count-objects to report invalid files in pack directory */
-extern void (*report_garbage)(const char *desc, const char *path);
+/* A hook to report invalid files in pack directory */
+#define PACKDIR_FILE_PACK 1
+#define PACKDIR_FILE_IDX 2
+#define PACKDIR_FILE_GARBAGE 4
+extern void (*report_garbage)(unsigned seen_bits, const char *path);
 
 extern void prepare_packed_git(void);
 extern void reprepare_packed_git(void);
diff --git a/path.c b/path.c
index 10f4cbf..75ec236 100644
--- a/path.c
+++ b/path.c
@@ -143,7 +143,7 @@ void report_linked_checkout_garbage(void)
 		strbuf_setlen(&sb, len);
 		strbuf_addstr(&sb, path);
 		if (file_exists(sb.buf))
-			report_garbage("unused in linked checkout", sb.buf);
+			report_garbage(PACKDIR_FILE_GARBAGE, sb.buf);
 	}
 	strbuf_release(&sb);
 }
diff --git a/sha1_file.c b/sha1_file.c
index 1cee438..0c0b652 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1183,27 +1183,16 @@ void install_packed_git(struct packed_git *pack)
 	packed_git = pack;
 }
 
-void (*report_garbage)(const char *desc, const char *path);
+void (*report_garbage)(unsigned seen_bits, const char *path);
 
 static void report_helper(const struct string_list *list,
 			  int seen_bits, int first, int last)
 {
-	const char *msg;
-	switch (seen_bits) {
-	case 0:
-		msg = "no corresponding .idx or .pack";
-		break;
-	case 1:
-		msg = "no corresponding .idx";
-		break;
-	case 2:
-		msg = "no corresponding .pack";
-		break;
-	default:
+	if (seen_bits == (PACKDIR_FILE_PACK|PACKDIR_FILE_IDX))
 		return;
-	}
+
 	for (; first < last; first++)
-		report_garbage(msg, list->items[first].string);
+		report_garbage(seen_bits, list->items[first].string);
 }
 
 static void report_pack_garbage(struct string_list *list)
@@ -1226,7 +1215,7 @@ static void report_pack_garbage(struct string_list *list)
 		if (baselen == -1) {
 			const char *dot = strrchr(path, '.');
 			if (!dot) {
-				report_garbage("garbage found", path);
+				report_garbage(PACKDIR_FILE_GARBAGE, path);
 				continue;
 			}
 			baselen = dot - path + 1;
@@ -1298,7 +1287,7 @@ static void prepare_packed_git_one(char *objdir, int local)
 		    ends_with(de->d_name, ".keep"))
 			string_list_append(&garbage, path.buf);
 		else
-			report_garbage("garbage found", path.buf);
+			report_garbage(PACKDIR_FILE_GARBAGE, path.buf);
 	}
 	closedir(dir);
 	report_pack_garbage(&garbage);
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/2] gc: Remove garbage .idx files from pack dir
  2015-08-13 18:02                 ` [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory Doug Kelly
@ 2015-08-13 18:02                   ` Doug Kelly
  2015-08-17 16:35                     ` Junio C Hamano
  2015-08-17 20:30                     ` Junio C Hamano
  2015-08-13 18:46                   ` [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory Eric Sunshine
  1 sibling, 2 replies; 34+ messages in thread
From: Doug Kelly @ 2015-08-13 18:02 UTC (permalink / raw)
  To: git; +Cc: peff, Doug Kelly

Add a custom report_garbage handler to collect and remove garbage
.idx files from the pack directory.

Signed-off-by: Doug Kelly <dougk.ff7@gmail.com>
---
 builtin/gc.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/builtin/gc.c b/builtin/gc.c
index bcc75d9..8352616 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -42,8 +42,18 @@ static struct argv_array prune = ARGV_ARRAY_INIT;
 static struct argv_array prune_worktrees = ARGV_ARRAY_INIT;
 static struct argv_array rerere = ARGV_ARRAY_INIT;
 
+static struct string_list pack_garbage = STRING_LIST_INIT_DUP;
+
 static char *pidfile;
 
+static void clean_pack_garbage(void)
+{
+	int i;
+	for (i = 0; i < pack_garbage.nr; i++)
+		unlink_or_warn(pack_garbage.items[i].string);
+	string_list_clear(&pack_garbage, 0);
+}
+
 static void remove_pidfile(void)
 {
 	if (pidfile)
@@ -57,6 +67,12 @@ static void remove_pidfile_on_signal(int signo)
 	raise(signo);
 }
 
+static void report_pack_garbage(unsigned seen_bits, const char *path)
+{
+	if (seen_bits == PACKDIR_FILE_IDX)
+		string_list_append(&pack_garbage, path);
+}
+
 static void git_config_date_string(const char *key, const char **output)
 {
 	if (git_config_get_string_const(key, output))
@@ -372,6 +388,11 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	if (run_command_v_opt(rerere.argv, RUN_GIT_CMD))
 		return error(FAILED_RUN, rerere.argv[0]);
 
+	report_garbage = report_pack_garbage;
+	reprepare_packed_git();
+	if (pack_garbage.nr > 0)
+		clean_pack_garbage();
+
 	if (auto_gc && too_many_loose_objects())
 		warning(_("There are too many unreachable loose objects; "
 			"run 'git prune' to remove them."));
-- 
2.0.5

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2015-08-13 18:02                 ` [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory Doug Kelly
  2015-08-13 18:02                   ` [PATCH 2/2] gc: Remove garbage .idx files from pack dir Doug Kelly
@ 2015-08-13 18:46                   ` Eric Sunshine
  2015-08-17 16:53                     ` Junio C Hamano
  1 sibling, 1 reply; 34+ messages in thread
From: Eric Sunshine @ 2015-08-13 18:46 UTC (permalink / raw)
  To: Doug Kelly; +Cc: Git List, Jeff King, Junio C Hamano

On Thu, Aug 13, 2015 at 2:02 PM, Doug Kelly <dougk.ff7@gmail.com> wrote:
> From: Junio C Hamano <gitster@pobox.com>
>
> The hook to report "garbage" files in $GIT_OBJECT_DIRECTORY/pack/
> could be generic but is too specific to count-object's needs.
>
> Move the part to produce human-readable messages to count-objects,
> and refine the interface to callback with the "bits" with values
> defined in the cache.h header file, so that other callers (e.g.
> prune) can later use the same mechanism to enumerate different
> kinds of garbage files and do something intelligent about them,
> other than reporting in textual messages.
>
> Signed-off-by: Junio C Hamano <gitster@pobox.com>

Since you're forwarding Junio's patch, you'd also want to sign-off
(following his).

> ---
> diff --git a/builtin/count-objects.c b/builtin/count-objects.c
> index ad0c799..4c3198e 100644
> --- a/builtin/count-objects.c
> +++ b/builtin/count-objects.c
> @@ -15,9 +15,31 @@ static int verbose;
>  static unsigned long loose, packed, packed_loose;
>  static off_t loose_size;
>
> -static void real_report_garbage(const char *desc, const char *path)
> +const char *bits_to_msg(unsigned seen_bits)

If you don't expect other callers outside this file, then this should
be declared 'static'. If you do expect future external callers, then
this should be declared in a public header file (but renamed to be
more meaningful).

> +{
> +       switch (seen_bits) {
> +       case 0:
> +               return "no corresponding .idx or .pack";
> +       case PACKDIR_FILE_GARBAGE:
> +               return "garbage found";
> +       case PACKDIR_FILE_PACK:
> +               return "no corresponding .idx";
> +       case PACKDIR_FILE_IDX:
> +               return "no corresponding .pack";
> +       case PACKDIR_FILE_PACK|PACKDIR_FILE_IDX:
> +       default:
> +               return NULL;
> +       }
> +}

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] gc: Remove garbage .idx files from pack dir
  2015-08-13 18:02                   ` [PATCH 2/2] gc: Remove garbage .idx files from pack dir Doug Kelly
@ 2015-08-17 16:35                     ` Junio C Hamano
  2015-08-17 20:30                     ` Junio C Hamano
  1 sibling, 0 replies; 34+ messages in thread
From: Junio C Hamano @ 2015-08-17 16:35 UTC (permalink / raw)
  To: Doug Kelly; +Cc: git, peff

Doug Kelly <dougk.ff7@gmail.com> writes:

> Add a custom report_garbage handler to collect and remove garbage
> .idx files from the pack directory.

You need to explain "why" here.  Why do we want to remove them?  And
the definition of what is "garbage" depends on that exact reason why
we want to remove them.

You and I may remember the discussion we had that started with your
initial problem description right now.  IIRC, it had to do with
something with the performance when there are many lone .idx files
without corresponding .pack file in the repository, or something?

But those who will be reading "git log" output later will not know,
and we would forget, too.

As discussed elsewhere, the removal needs to be protected with grace
period, probably controlled via prune_expire.

Perhaps along this line on top of your patch you can squash but this
is not even compile tested yet, so take it with a grain of salt.

Thanks.

 builtin/gc.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 4a459f3..a652773 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -49,8 +49,21 @@ static char *pidfile;
 static void clean_pack_garbage(void)
 {
 	int i;
-	for (i = 0; i < pack_garbage.nr; i++)
+	unsigned long expire = approxidate(prune_expire);
+
+	for (i = 0; i < pack_garbage.nr; i++) {
+		const char *path = pack_garbage.items[i].string;
+		struct stat st;
+
+		if (lstat(path, &st)) {
+			error("could not stat '%s'", path);
+			continue; /* don't risk */
+		}
+		if (st.st_mtime > expire)
+			continue; /* too young */
+
 		unlink_or_warn(pack_garbage.items[i].string);
+	}
 	string_list_clear(&pack_garbage, 0);
 }
 

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2015-08-13 18:46                   ` [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory Eric Sunshine
@ 2015-08-17 16:53                     ` Junio C Hamano
  2015-10-28 17:48                       ` Junio C Hamano
  0 siblings, 1 reply; 34+ messages in thread
From: Junio C Hamano @ 2015-08-17 16:53 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Doug Kelly, Git List, Jeff King

Eric Sunshine <sunshine@sunshineco.com> writes:

>> -static void real_report_garbage(const char *desc, const char *path)
>> +const char *bits_to_msg(unsigned seen_bits)
>
> If you don't expect other callers outside this file, then this should
> be declared 'static'. If you do expect future external callers, then
> this should be declared in a public header file (but renamed to be
> more meaningful).

I think this can be private to this file.  The sole point of moving
this logic to this file is to make it private, after all ;-)  Thanks
for sharp eyes.

Together with the need for a description on "why", this probably
deserves a test or two, probably at the end of t5304.

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] gc: Remove garbage .idx files from pack dir
  2015-08-13 18:02                   ` [PATCH 2/2] gc: Remove garbage .idx files from pack dir Doug Kelly
  2015-08-17 16:35                     ` Junio C Hamano
@ 2015-08-17 20:30                     ` Junio C Hamano
  1 sibling, 0 replies; 34+ messages in thread
From: Junio C Hamano @ 2015-08-17 20:30 UTC (permalink / raw)
  To: Doug Kelly; +Cc: git, peff

Doug Kelly <dougk.ff7@gmail.com> writes:

> +static struct string_list pack_garbage = STRING_LIST_INIT_DUP;
> +
>  static char *pidfile;
>  
> +static void clean_pack_garbage(void)
> +{
> +	int i;
> +	for (i = 0; i < pack_garbage.nr; i++)
> +		unlink_or_warn(pack_garbage.items[i].string);
> +	string_list_clear(&pack_garbage, 0);
> +}
> +
>  static void remove_pidfile(void)
>  {
>  	if (pidfile)
> @@ -57,6 +67,12 @@ static void remove_pidfile_on_signal(int signo)
>  	raise(signo);
>  }
>  
> +static void report_pack_garbage(unsigned seen_bits, const char *path)

This change makes"pidfile management" and "pack garbage cleaning"
tangled in the result.  By inserting the definition of the variable
pack_garbage and the function clean_pack_garbage() just before this
new function, you can keep everything related to 'pack garbage
cleaning" together.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2015-08-17 16:53                     ` Junio C Hamano
@ 2015-10-28 17:48                       ` Junio C Hamano
  2015-10-28 22:43                         ` Doug Kelly
  0 siblings, 1 reply; 34+ messages in thread
From: Junio C Hamano @ 2015-10-28 17:48 UTC (permalink / raw)
  To: Eric Sunshine; +Cc: Doug Kelly, Git List, Jeff King

Junio C Hamano <gitster@pobox.com> writes:

> Eric Sunshine <sunshine@sunshineco.com> writes:
>
>>> -static void real_report_garbage(const char *desc, const char *path)
>>> +const char *bits_to_msg(unsigned seen_bits)
>>
>> If you don't expect other callers outside this file, then this should
>> be declared 'static'. If you do expect future external callers, then
>> this should be declared in a public header file (but renamed to be
>> more meaningful).
>
> I think this can be private to this file.  The sole point of moving
> this logic to this file is to make it private, after all ;-)  Thanks
> for sharp eyes.
>
> Together with the need for a description on "why", this probably
> deserves a test or two, probably at the end of t5304.
>
> Thanks.

Does somebody want to help tying the final loose ends on this topic?
It has been listed in the [Stalled] section for too long, I _think_
what it attempts to do is a worthy thing, and it is shame to see the
initial implementation and review cycles we have spent so far go to
waste.

If I find nothing else to do before any taker appears, I could
volunteer myself, but thought I should ask first.

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2015-10-28 17:48                       ` Junio C Hamano
@ 2015-10-28 22:43                         ` Doug Kelly
  2015-11-04  3:05                           ` [PATCH 1/3] " Doug Kelly
  2015-11-04  3:12                           ` [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory Doug Kelly
  0 siblings, 2 replies; 34+ messages in thread
From: Doug Kelly @ 2015-10-28 22:43 UTC (permalink / raw)
  To: Git List

On Wed, Oct 28, 2015 at 12:48 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Junio C Hamano <gitster@pobox.com> writes:
>
>> Eric Sunshine <sunshine@sunshineco.com> writes:
>>
>>>> -static void real_report_garbage(const char *desc, const char *path)
>>>> +const char *bits_to_msg(unsigned seen_bits)
>>>
>>> If you don't expect other callers outside this file, then this should
>>> be declared 'static'. If you do expect future external callers, then
>>> this should be declared in a public header file (but renamed to be
>>> more meaningful).
>>
>> I think this can be private to this file.  The sole point of moving
>> this logic to this file is to make it private, after all ;-)  Thanks
>> for sharp eyes.
>>
>> Together with the need for a description on "why", this probably
>> deserves a test or two, probably at the end of t5304.
>>
>> Thanks.
>
> Does somebody want to help tying the final loose ends on this topic?
> It has been listed in the [Stalled] section for too long, I _think_
> what it attempts to do is a worthy thing, and it is shame to see the
> initial implementation and review cycles we have spent so far go to
> waste.
>
> If I find nothing else to do before any taker appears, I could
> volunteer myself, but thought I should ask first.
>
> Thanks.

I agree; I've been wanting to get back to it, but had some
higher-priority things at work for a while, so I've not had time.  I'd
be happy to get back into it, but if you get to it first, believe me,
I'm not going to be offended. :)

I'll see if I can't devote a little extra time to it this upcoming
week, though.  Hopefully it doesn't need too much additional polishing
to be ready.

P.S. Does a Googler want to tell the Inbox team that the inability to
send plain-text email is really annoying? :P

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/3] prepare_packed_git(): refactor garbage reporting in pack directory
  2015-10-28 22:43                         ` Doug Kelly
@ 2015-11-04  3:05                           ` Doug Kelly
  2015-11-04  3:05                             ` [PATCH 2/3] t5304: Add test for cleaning pack garbage Doug Kelly
  2015-11-04  3:05                             ` [PATCH 3/3] gc: Remove garbage .idx files from pack dir Doug Kelly
  2015-11-04  3:12                           ` [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory Doug Kelly
  1 sibling, 2 replies; 34+ messages in thread
From: Doug Kelly @ 2015-11-04  3:05 UTC (permalink / raw)
  To: git; +Cc: Junio C Hamano, Doug Kelly

From: Junio C Hamano <gitster@pobox.com>

The hook to report "garbage" files in $GIT_OBJECT_DIRECTORY/pack/
could be generic but is too specific to count-object's needs.

Move the part to produce human-readable messages to count-objects,
and refine the interface to callback with the "bits" with values
defined in the cache.h header file, so that other callers (e.g.
prune) can later use the same mechanism to enumerate different
kinds of garbage files and do something intelligent about them,
other than reporting in textual messages.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
Signed-off-by: Doug Kelly <dougk.ff7@gmail.com>
---
 builtin/count-objects.c | 26 ++++++++++++++++++++++++--
 cache.h                 |  7 +++++--
 path.c                  |  2 +-
 sha1_file.c             | 23 ++++++-----------------
 4 files changed, 36 insertions(+), 22 deletions(-)

diff --git a/builtin/count-objects.c b/builtin/count-objects.c
index ad0c799..ba92919 100644
--- a/builtin/count-objects.c
+++ b/builtin/count-objects.c
@@ -15,9 +15,31 @@ static int verbose;
 static unsigned long loose, packed, packed_loose;
 static off_t loose_size;
 
-static void real_report_garbage(const char *desc, const char *path)
+static const char *bits_to_msg(unsigned seen_bits)
+{
+	switch (seen_bits) {
+	case 0:
+		return "no corresponding .idx or .pack";
+	case PACKDIR_FILE_GARBAGE:
+		return "garbage found";
+	case PACKDIR_FILE_PACK:
+		return "no corresponding .idx";
+	case PACKDIR_FILE_IDX:
+		return "no corresponding .pack";
+	case PACKDIR_FILE_PACK|PACKDIR_FILE_IDX:
+	default:
+		return NULL;
+	}
+}
+
+static void real_report_garbage(unsigned seen_bits, const char *path)
 {
 	struct stat st;
+	const char *desc = bits_to_msg(seen_bits);
+
+	if (!desc)
+		return;
+
 	if (!stat(path, &st))
 		size_garbage += st.st_size;
 	warning("%s: %s", desc, path);
@@ -27,7 +49,7 @@ static void real_report_garbage(const char *desc, const char *path)
 static void loose_garbage(const char *path)
 {
 	if (verbose)
-		report_garbage("garbage found", path);
+		report_garbage(PACKDIR_FILE_GARBAGE, path);
 }
 
 static int count_loose(const unsigned char *sha1, const char *path, void *data)
diff --git a/cache.h b/cache.h
index 3ba0b8f..736abc0 100644
--- a/cache.h
+++ b/cache.h
@@ -1289,8 +1289,11 @@ struct pack_entry {
 
 extern struct packed_git *parse_pack_index(unsigned char *sha1, const char *idx_path);
 
-/* A hook for count-objects to report invalid files in pack directory */
-extern void (*report_garbage)(const char *desc, const char *path);
+/* A hook to report invalid files in pack directory */
+#define PACKDIR_FILE_PACK 1
+#define PACKDIR_FILE_IDX 2
+#define PACKDIR_FILE_GARBAGE 4
+extern void (*report_garbage)(unsigned seen_bits, const char *path);
 
 extern void prepare_packed_git(void);
 extern void reprepare_packed_git(void);
diff --git a/path.c b/path.c
index c740c4f..f28ace2 100644
--- a/path.c
+++ b/path.c
@@ -363,7 +363,7 @@ void report_linked_checkout_garbage(void)
 		strbuf_setlen(&sb, len);
 		strbuf_addstr(&sb, path);
 		if (file_exists(sb.buf))
-			report_garbage("unused in linked checkout", sb.buf);
+			report_garbage(PACKDIR_FILE_GARBAGE, sb.buf);
 	}
 	strbuf_release(&sb);
 }
diff --git a/sha1_file.c b/sha1_file.c
index c5b31de..3d56746 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1217,27 +1217,16 @@ void install_packed_git(struct packed_git *pack)
 	packed_git = pack;
 }
 
-void (*report_garbage)(const char *desc, const char *path);
+void (*report_garbage)(unsigned seen_bits, const char *path);
 
 static void report_helper(const struct string_list *list,
 			  int seen_bits, int first, int last)
 {
-	const char *msg;
-	switch (seen_bits) {
-	case 0:
-		msg = "no corresponding .idx or .pack";
-		break;
-	case 1:
-		msg = "no corresponding .idx";
-		break;
-	case 2:
-		msg = "no corresponding .pack";
-		break;
-	default:
+	if (seen_bits == (PACKDIR_FILE_PACK|PACKDIR_FILE_IDX))
 		return;
-	}
+
 	for (; first < last; first++)
-		report_garbage(msg, list->items[first].string);
+		report_garbage(seen_bits, list->items[first].string);
 }
 
 static void report_pack_garbage(struct string_list *list)
@@ -1260,7 +1249,7 @@ static void report_pack_garbage(struct string_list *list)
 		if (baselen == -1) {
 			const char *dot = strrchr(path, '.');
 			if (!dot) {
-				report_garbage("garbage found", path);
+				report_garbage(PACKDIR_FILE_GARBAGE, path);
 				continue;
 			}
 			baselen = dot - path + 1;
@@ -1332,7 +1321,7 @@ static void prepare_packed_git_one(char *objdir, int local)
 		    ends_with(de->d_name, ".keep"))
 			string_list_append(&garbage, path.buf);
 		else
-			report_garbage("garbage found", path.buf);
+			report_garbage(PACKDIR_FILE_GARBAGE, path.buf);
 	}
 	closedir(dir);
 	report_pack_garbage(&garbage);
-- 
2.5.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 2/3] t5304: Add test for cleaning pack garbage
  2015-11-04  3:05                           ` [PATCH 1/3] " Doug Kelly
@ 2015-11-04  3:05                             ` Doug Kelly
  2015-11-04  3:05                             ` [PATCH 3/3] gc: Remove garbage .idx files from pack dir Doug Kelly
  1 sibling, 0 replies; 34+ messages in thread
From: Doug Kelly @ 2015-11-04  3:05 UTC (permalink / raw)
  To: git; +Cc: Doug Kelly

Pack garbage, noticeably stale .idx files, can be cleaned up during
a garbage collection.  This tests to ensure such garbage is properly
cleaned up.

Note that the prior test for checking pack garbage with count-objects
left some stale garbage after the test exited.  This has also been
corrected.

Signed-off-by: Doug Kelly <dougk.ff7@gmail.com>
---
 t/t5304-prune.sh | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/t/t5304-prune.sh b/t/t5304-prune.sh
index 023d7c6..0297515 100755
--- a/t/t5304-prune.sh
+++ b/t/t5304-prune.sh
@@ -219,6 +219,7 @@ test_expect_success 'gc: prune old objects after local clone' '
 
 test_expect_success 'garbage report in count-objects -v' '
 	test_when_finished "rm -f .git/objects/pack/fake*" &&
+	test_when_finished "rm -f .git/objects/pack/foo*" &&
 	: >.git/objects/pack/foo &&
 	: >.git/objects/pack/foo.bar &&
 	: >.git/objects/pack/foo.keep &&
@@ -244,6 +245,26 @@ EOF
 	test_cmp expected actual
 '
 
+test_expect_failure 'clean pack garbage with gc' '
+	test_when_finished "rm -f .git/objects/pack/fake*" &&
+	test_when_finished "rm -f .git/objects/pack/foo*" &&
+	: >.git/objects/pack/foo.keep &&
+	: >.git/objects/pack/foo.pack &&
+	: >.git/objects/pack/fake.idx &&
+	: >.git/objects/pack/fake2.keep &&
+	: >.git/objects/pack/fake2.idx &&
+	: >.git/objects/pack/fake3.keep &&
+	git gc &&
+	git count-objects -v 2>stderr &&
+	grep "^warning:" stderr | sort >actual &&
+	cat >expected <<\EOF &&
+warning: no corresponding .idx or .pack: .git/objects/pack/fake3.keep
+warning: no corresponding .idx: .git/objects/pack/foo.keep
+warning: no corresponding .idx: .git/objects/pack/foo.pack
+EOF
+	test_cmp expected actual
+'
+
 test_expect_success 'prune .git/shallow' '
 	SHA1=`echo hi|git commit-tree HEAD^{tree}` &&
 	echo $SHA1 >.git/shallow &&
-- 
2.5.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH 3/3] gc: Remove garbage .idx files from pack dir
  2015-11-04  3:05                           ` [PATCH 1/3] " Doug Kelly
  2015-11-04  3:05                             ` [PATCH 2/3] t5304: Add test for cleaning pack garbage Doug Kelly
@ 2015-11-04  3:05                             ` Doug Kelly
  1 sibling, 0 replies; 34+ messages in thread
From: Doug Kelly @ 2015-11-04  3:05 UTC (permalink / raw)
  To: git; +Cc: Doug Kelly

Add a custom report_garbage handler to collect and remove garbage
.idx files from the pack directory.

Signed-off-by: Doug Kelly <dougk.ff7@gmail.com>
---
 builtin/gc.c     | 20 ++++++++++++++++++++
 t/t5304-prune.sh |  2 +-
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index df3e454..668f975 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -45,6 +45,21 @@ static struct argv_array rerere = ARGV_ARRAY_INIT;
 
 static struct tempfile pidfile;
 static struct lock_file log_lock;
+static struct string_list pack_garbage = STRING_LIST_INIT_DUP;
+
+static void clean_pack_garbage(void)
+{
+	int i;
+	for (i = 0; i < pack_garbage.nr; i++)
+		unlink_or_warn(pack_garbage.items[i].string);
+	string_list_clear(&pack_garbage, 0);
+}
+
+static void report_pack_garbage(unsigned seen_bits, const char *path)
+{
+	if (seen_bits == PACKDIR_FILE_IDX)
+		string_list_append(&pack_garbage, path);
+}
 
 static void git_config_date_string(const char *key, const char **output)
 {
@@ -416,6 +431,11 @@ int cmd_gc(int argc, const char **argv, const char *prefix)
 	if (run_command_v_opt(rerere.argv, RUN_GIT_CMD))
 		return error(FAILED_RUN, rerere.argv[0]);
 
+	report_garbage = report_pack_garbage;
+	reprepare_packed_git();
+	if (pack_garbage.nr > 0)
+		clean_pack_garbage();
+
 	if (auto_gc && too_many_loose_objects())
 		warning(_("There are too many unreachable loose objects; "
 			"run 'git prune' to remove them."));
diff --git a/t/t5304-prune.sh b/t/t5304-prune.sh
index 0297515..def203c 100755
--- a/t/t5304-prune.sh
+++ b/t/t5304-prune.sh
@@ -245,7 +245,7 @@ EOF
 	test_cmp expected actual
 '
 
-test_expect_failure 'clean pack garbage with gc' '
+test_expect_success 'clean pack garbage with gc' '
 	test_when_finished "rm -f .git/objects/pack/fake*" &&
 	test_when_finished "rm -f .git/objects/pack/foo*" &&
 	: >.git/objects/pack/foo.keep &&
-- 
2.5.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2015-10-28 22:43                         ` Doug Kelly
  2015-11-04  3:05                           ` [PATCH 1/3] " Doug Kelly
@ 2015-11-04  3:12                           ` Doug Kelly
  2015-11-04 19:35                             ` Junio C Hamano
  1 sibling, 1 reply; 34+ messages in thread
From: Doug Kelly @ 2015-11-04  3:12 UTC (permalink / raw)
  To: Git List, Junio C Hamano, Eric Sunshine, Jeff King

On Wed, Oct 28, 2015 at 5:43 PM, Doug Kelly <dougk.ff7@gmail.com> wrote:
> On Wed, Oct 28, 2015 at 12:48 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> Junio C Hamano <gitster@pobox.com> writes:
>>
>>> Eric Sunshine <sunshine@sunshineco.com> writes:
>>>
>>>>> -static void real_report_garbage(const char *desc, const char *path)
>>>>> +const char *bits_to_msg(unsigned seen_bits)
>>>>
>>>> If you don't expect other callers outside this file, then this should
>>>> be declared 'static'. If you do expect future external callers, then
>>>> this should be declared in a public header file (but renamed to be
>>>> more meaningful).
>>>
>>> I think this can be private to this file.  The sole point of moving
>>> this logic to this file is to make it private, after all ;-)  Thanks
>>> for sharp eyes.
>>>
>>> Together with the need for a description on "why", this probably
>>> deserves a test or two, probably at the end of t5304.
>>>
>>> Thanks.
>>
>> Does somebody want to help tying the final loose ends on this topic?
>> It has been listed in the [Stalled] section for too long, I _think_
>> what it attempts to do is a worthy thing, and it is shame to see the
>> initial implementation and review cycles we have spent so far go to
>> waste.
>>
>> If I find nothing else to do before any taker appears, I could
>> volunteer myself, but thought I should ask first.
>>
>> Thanks.
>
> I agree; I've been wanting to get back to it, but had some
> higher-priority things at work for a while, so I've not had time.  I'd
> be happy to get back into it, but if you get to it first, believe me,
> I'm not going to be offended. :)
>
> I'll see if I can't devote a little extra time to it this upcoming
> week, though.  Hopefully it doesn't need too much additional polishing
> to be ready.
>
> P.S. Does a Googler want to tell the Inbox team that the inability to
> send plain-text email is really annoying? :P

I think the patches I sent (a bit prematurely) address the remaining
comments... I did find there was a relevant test in t5304 already, so
I added a new test in the same section (and cleaned up some of the
garbage it wasn't removing before).  I'm not sure if it's poor form to
move tests around like this, but I figured it might be best to keep
them logically grouped.

Let me know if there's anything I can do, and once again, sorry for the delay!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2015-11-04  3:12                           ` [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory Doug Kelly
@ 2015-11-04 19:35                             ` Junio C Hamano
  2015-11-04 19:56                               ` Doug Kelly
  2015-11-04 19:56                               ` Jeff King
  0 siblings, 2 replies; 34+ messages in thread
From: Junio C Hamano @ 2015-11-04 19:35 UTC (permalink / raw)
  To: Doug Kelly; +Cc: Git List, Eric Sunshine, Jeff King

Doug Kelly <dougk.ff7@gmail.com> writes:

> I think the patches I sent (a bit prematurely) address the
> remaining comments... I did find there was a relevant test in
> t5304 already, so I added a new test in the same section (and
> cleaned up some of the garbage it wasn't removing before).  I'm
> not sure if it's poor form to move tests around like this, but I
> figured it might be best to keep them logically grouped.

OK, will queue as I didn't spot anything glaringly wrong ;-)

I did wonder if we want to say anything about .bitmap files, though.
If there is one without matching .idx and .pack, shouldn't we report
just like we report .idx without .pack (or vice versa)?

Thanks.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2015-11-04 19:35                             ` Junio C Hamano
@ 2015-11-04 19:56                               ` Doug Kelly
  2015-11-04 20:02                                 ` Jeff King
  2015-11-04 19:56                               ` Jeff King
  1 sibling, 1 reply; 34+ messages in thread
From: Doug Kelly @ 2015-11-04 19:56 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git List, Eric Sunshine, Jeff King

On Wed, Nov 4, 2015 at 1:35 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Doug Kelly <dougk.ff7@gmail.com> writes:
>
>> I think the patches I sent (a bit prematurely) address the
>> remaining comments... I did find there was a relevant test in
>> t5304 already, so I added a new test in the same section (and
>> cleaned up some of the garbage it wasn't removing before).  I'm
>> not sure if it's poor form to move tests around like this, but I
>> figured it might be best to keep them logically grouped.
>
> OK, will queue as I didn't spot anything glaringly wrong ;-)
>
> I did wonder if we want to say anything about .bitmap files, though.
> If there is one without matching .idx and .pack, shouldn't we report
> just like we report .idx without .pack (or vice versa)?
>
> Thanks.

I think you're right -- this would be something worth following up on.
At least, t5304 doesn't cover this case explicitly, but when I tried
adding an empty bitmap with a bogus name, I did see a "no
corresponding .idx or .pack" error, similar to the stale .keep file.

I'd trust your (and Jeff's) knowledge on this far more than my own,
but would it be a bad idea to clean up .keep and .bitmap files if the
.idx/.pack pair are missing?  I think we may have had a discussion
previously on how things along these lines might be racey -- but I
don't know what order the .keep file is created in relation to the
.idx/.pack.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2015-11-04 19:35                             ` Junio C Hamano
  2015-11-04 19:56                               ` Doug Kelly
@ 2015-11-04 19:56                               ` Jeff King
  1 sibling, 0 replies; 34+ messages in thread
From: Jeff King @ 2015-11-04 19:56 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Doug Kelly, Git List, Eric Sunshine

On Wed, Nov 04, 2015 at 11:35:52AM -0800, Junio C Hamano wrote:

> Doug Kelly <dougk.ff7@gmail.com> writes:
> 
> > I think the patches I sent (a bit prematurely) address the
> > remaining comments... I did find there was a relevant test in
> > t5304 already, so I added a new test in the same section (and
> > cleaned up some of the garbage it wasn't removing before).  I'm
> > not sure if it's poor form to move tests around like this, but I
> > figured it might be best to keep them logically grouped.
> 
> OK, will queue as I didn't spot anything glaringly wrong ;-)
> 
> I did wonder if we want to say anything about .bitmap files, though.
> If there is one without matching .idx and .pack, shouldn't we report
> just like we report .idx without .pack (or vice versa)?

Yeah, I think so. The logic should really extend to anything without a
matching .pack. And I think the sane rule is probably:

  If we have pack-$sha.$ext, but not pack-$sha.pack, then:

    1. if $ext is known to us as a cache that can be regenerated from the
       .pack (i.e., .idx, .bitmap), then delete it

    2. if $ext is known to us as precious, do nothing (there is nothing
       in this category right now, though)

    3. if $ext is not known to us, warn but do not delete (in case a
       future version adds something precious)

The conservatism in (3) is the right thing to do, I think, but I doubt
it will ever matter, because we probably cannot ever add non-cache
auxiliary files to the pack. Old versions of git would not delete such
precious files, but nor would they carry them forward during a repack.
So short of a repo-version bump, I think we are effectively limited to
adding only caches which can be re-generated from an original .pack.

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2015-11-04 19:56                               ` Doug Kelly
@ 2015-11-04 20:02                                 ` Jeff King
  2015-11-04 20:08                                   ` Doug Kelly
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2015-11-04 20:02 UTC (permalink / raw)
  To: Doug Kelly; +Cc: Junio C Hamano, Git List, Eric Sunshine

On Wed, Nov 04, 2015 at 01:56:38PM -0600, Doug Kelly wrote:

> > I did wonder if we want to say anything about .bitmap files, though.
> > If there is one without matching .idx and .pack, shouldn't we report
> > just like we report .idx without .pack (or vice versa)?
> 
> I think you're right -- this would be something worth following up on.
> At least, t5304 doesn't cover this case explicitly, but when I tried
> adding an empty bitmap with a bogus name, I did see a "no
> corresponding .idx or .pack" error, similar to the stale .keep file.

Yeah, that should be harmless warning (although note because the bitmap
code only really handles a single bitmap, it can prevent loading of the
"real" bitmap; so you'd want to clean it up, for sure).

> I'd trust your (and Jeff's) knowledge on this far more than my own,
> but would it be a bad idea to clean up .keep and .bitmap files if the
> .idx/.pack pair are missing?  I think we may have had a discussion
> previously on how things along these lines might be racey -- but I
> don't know what order the .keep file is created in relation to the
> .idx/.pack.

Definitely cleaning up the .bitmap is sane and not racy (it's in the
same boat as the .idx, I think).

.keep files are more tricky. I'd have to go over the receive-pack code
to confirm, but I think they _are_ racy. That is, receive-pack will
create them as a lockfile before moving the pack into place. That's OK,
though, if we use mtimes to give ourselves a grace period (I haven't
looked at your series yet).

But moreover, .keep files can be created manually by the user. If the
pack they referenced goes away, they are not really serving any purpose.
But it's possible that the user would want to salvage the content of the
file, or know that it was there.

So I'd argue we should leave them. Or at least leave ones that do not
have the generic "{receive,fetch}-pack $pid on $host comment in them,
which were clearly created as lockfiles.

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2015-11-04 20:02                                 ` Jeff King
@ 2015-11-04 20:08                                   ` Doug Kelly
  2015-11-04 20:15                                     ` Jeff King
  2015-12-30  7:37                                     ` Jeff King
  0 siblings, 2 replies; 34+ messages in thread
From: Doug Kelly @ 2015-11-04 20:08 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Git List, Eric Sunshine

On Wed, Nov 4, 2015 at 2:02 PM, Jeff King <peff@peff.net> wrote:
> On Wed, Nov 04, 2015 at 01:56:38PM -0600, Doug Kelly wrote:
>
>> > I did wonder if we want to say anything about .bitmap files, though.
>> > If there is one without matching .idx and .pack, shouldn't we report
>> > just like we report .idx without .pack (or vice versa)?
>>
>> I think you're right -- this would be something worth following up on.
>> At least, t5304 doesn't cover this case explicitly, but when I tried
>> adding an empty bitmap with a bogus name, I did see a "no
>> corresponding .idx or .pack" error, similar to the stale .keep file.
>
> Yeah, that should be harmless warning (although note because the bitmap
> code only really handles a single bitmap, it can prevent loading of the
> "real" bitmap; so you'd want to clean it up, for sure).
>
>> I'd trust your (and Jeff's) knowledge on this far more than my own,
>> but would it be a bad idea to clean up .keep and .bitmap files if the
>> .idx/.pack pair are missing?  I think we may have had a discussion
>> previously on how things along these lines might be racey -- but I
>> don't know what order the .keep file is created in relation to the
>> .idx/.pack.
>
> Definitely cleaning up the .bitmap is sane and not racy (it's in the
> same boat as the .idx, I think).
>
> .keep files are more tricky. I'd have to go over the receive-pack code
> to confirm, but I think they _are_ racy. That is, receive-pack will
> create them as a lockfile before moving the pack into place. That's OK,
> though, if we use mtimes to give ourselves a grace period (I haven't
> looked at your series yet).
>
> But moreover, .keep files can be created manually by the user. If the
> pack they referenced goes away, they are not really serving any purpose.
> But it's possible that the user would want to salvage the content of the
> file, or know that it was there.
>
> So I'd argue we should leave them. Or at least leave ones that do not
> have the generic "{receive,fetch}-pack $pid on $host comment in them,
> which were clearly created as lockfiles.
>
> -Peff

Currently there's no mtime-guarding logic (I dug up that conversation
earlier, though, but after I'd done the respin on this series)... OK,
in that case, I'll create a separate patch that tests/cleans up
.bitmap, but doesn't touch .keep.  This might be a small series since
I think the logic for finding pack garbage doesn't know anything about
.bitmap per-se, so it's looking like I'll extend that relevant code,
before adding the handling in gc and appropriate tests.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2015-11-04 20:08                                   ` Doug Kelly
@ 2015-11-04 20:15                                     ` Jeff King
  2015-12-30  7:37                                     ` Jeff King
  1 sibling, 0 replies; 34+ messages in thread
From: Jeff King @ 2015-11-04 20:15 UTC (permalink / raw)
  To: Doug Kelly; +Cc: Junio C Hamano, Git List, Eric Sunshine

On Wed, Nov 04, 2015 at 02:08:21PM -0600, Doug Kelly wrote:

> Currently there's no mtime-guarding logic (I dug up that conversation
> earlier, though, but after I'd done the respin on this series)... OK,
> in that case, I'll create a separate patch that tests/cleans up
> .bitmap, but doesn't touch .keep.  This might be a small series since
> I think the logic for finding pack garbage doesn't know anything about
> .bitmap per-se, so it's looking like I'll extend that relevant code,
> before adding the handling in gc and appropriate tests.

I'd hoped you could reuse the list of extensions found in
builtin/repack.c (e.g., see remove_redundant_pack). But I guess that is
not connected with the garbage-reporting code. And anyway, the simple
list probably does not carry sufficient information (it does not know
that ".keep" is potentially more precious than ".idx", for example).

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Fwd: Question: .idx without .pack causes performance issues?
       [not found]     ` <CABYiQpn7r2Vcf=S5RaWHBN85eBYGPV_e02+BY=4L98qfUzDT1Q@mail.gmail.com>
@ 2015-11-11 14:58       ` Thomas Berg
  0 siblings, 0 replies; 34+ messages in thread
From: Thomas Berg @ 2015-11-11 14:58 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Doug Kelly, Git List

Hi all,

(re-sending because my first e-mail was rejected due to html formatting)

While debugging a git fetch performance problem on Windows I came
across this thread. The problem in our case was also caused by
orphaned .idx files.

On Tue, Jul 21, 2015 at 9:15 PM, Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > I however do not think that we mark the in-core structure that
> > corresponds to an open ".idx" file in any way when such a failure
> > happens.  If we really cared enough, we could do so, saying "we know
> > there is .idx file, but do not bother looking at it again, as we
> > know the corresponding .pack is missing", and that would speed things
> > up a bit, essentially bringing us back to a sane situation without
> > any ".idx" without corresponding ".pack".
> >
> > I do not think it is worth the effort, though.  It would be more
> > fruitful to find out how you end up with ".idx exists but not
> > corresponding .pack" and if that is some systemic failure, see if
> > there is a way to prevent that from happening in the first place.
>
> While I still think that it is more important to prevent such a
> situation from occurring in the first place, ignoring .idx that lack
> corresponding .pack should be fairly simple, perhaps like this.

I have observed the following: if garbage collection is triggered
during a git fetch, I always get messages like this:

$ git fetch origin
> Auto packing the repository for optimum performance. You may also
> run "git gc" manually. See "git help gc" for more information.
> Counting objects: 396468, done.
> Delta compression using up to 12 threads.
> Compressing objects: 100% (98683/98683), done.
> Writing objects: 100% (396468/396468), done.
> Total 396468 (delta 289422), reused 395212 (delta 288289)
> Unlink of file '.git/objects/pack/pack-343b6cfdf58171f53c235b900a75d09bd9219e06.pack' failed. Should I try again? (y/n) n
> Unlink of file '.git/objects/pack/pack-343b6cfdf58171f53c235b900a75d09bd9219e06.idx' failed. Should I try again? (y/n) n
> Unlink of file '.git/objects/pack/pack-63a6cb5e2a9f72eea72b02ac74a167e1d71d417f.idx' failed. Should I try again? (y/n) n
> Unlink of file '.git/objects/pack/pack-9b616a2501bb9c13acecf3e981c39868dd2f5ff7.pack' failed. Should I try again? (y/n) n
> Unlink of file '.git/objects/pack/pack-9b616a2501bb9c13acecf3e981c39868dd2f5ff7.idx' failed. Should I try again? (y/n) n
> Checking connectivity: 396468, done.

Windows has the property that if a file is open it can't be deleted.
If so, it could be that git fetch needs to close the files first. I
can't remember observing this problem when running git gc by itself.

In the repos where we have problems I observed both unnecessary .pack
files and .idx files, but way more .idx files. Maybe, over time,
unnecessary pack files have been cleaned up but not .idx files?

If so, this would explain how we get into this situation. I have been
testing this with very old git versions on Windows (1.7.4 and 1.8.4),
sorry if these problems are already fixed in later versions.

- Thomas

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2015-11-04 20:08                                   ` Doug Kelly
  2015-11-04 20:15                                     ` Jeff King
@ 2015-12-30  7:37                                     ` Jeff King
  2016-01-13 17:14                                       ` Doug Kelly
  1 sibling, 1 reply; 34+ messages in thread
From: Jeff King @ 2015-12-30  7:37 UTC (permalink / raw)
  To: Doug Kelly; +Cc: Junio C Hamano, Git List, Eric Sunshine

On Wed, Nov 04, 2015 at 02:08:21PM -0600, Doug Kelly wrote:

> On Wed, Nov 4, 2015 at 2:02 PM, Jeff King <peff@peff.net> wrote:
> > Definitely cleaning up the .bitmap is sane and not racy (it's in the
> > same boat as the .idx, I think).
> >
> > .keep files are more tricky. I'd have to go over the receive-pack code
> > to confirm, but I think they _are_ racy. That is, receive-pack will
> > create them as a lockfile before moving the pack into place. That's OK,
> > though, if we use mtimes to give ourselves a grace period (I haven't
> > looked at your series yet).
> >
> > But moreover, .keep files can be created manually by the user. If the
> > pack they referenced goes away, they are not really serving any purpose.
> > But it's possible that the user would want to salvage the content of the
> > file, or know that it was there.
> >
> > So I'd argue we should leave them. Or at least leave ones that do not
> > have the generic "{receive,fetch}-pack $pid on $host comment in them,
> > which were clearly created as lockfiles.
> 
> Currently there's no mtime-guarding logic (I dug up that conversation
> earlier, though, but after I'd done the respin on this series)... OK,
> in that case, I'll create a separate patch that tests/cleans up
> .bitmap, but doesn't touch .keep.  This might be a small series since
> I think the logic for finding pack garbage doesn't know anything about
> .bitmap per-se, so it's looking like I'll extend that relevant code,
> before adding the handling in gc and appropriate tests.

I happened to be looking over your series again, and I noticed that we
didn't end up with any mtime logic at all in what got merged.

I _think_ that is probably OK, because we always write the pack,
followed by the .idx, followed by the .bitmap (if any). And we don't
drop .keep files (though I think we would perhaps note them as possible
cruft?).

So I don't think there are any races introduced here, but I wonder if we
want to be a bit more conservative. Sorry to bring this up so much after
the fact; I completely forgot about it when reviewing the patches.

These changes are slated for the v2.7 release. Like I said, I don't
think it's buggy, so we don't necessarily need to address it before the
release. We could add an mtime check in the next cycle as a
belt-and-suspenders safety, rather than a fix.

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2015-12-30  7:37                                     ` Jeff King
@ 2016-01-13 17:14                                       ` Doug Kelly
  2016-01-13 20:08                                         ` Junio C Hamano
  0 siblings, 1 reply; 34+ messages in thread
From: Doug Kelly @ 2016-01-13 17:14 UTC (permalink / raw)
  To: Jeff King; +Cc: Junio C Hamano, Git List, Eric Sunshine

On Wed, Dec 30, 2015 at 1:37 AM, Jeff King <peff@peff.net> wrote:
> On Wed, Nov 04, 2015 at 02:08:21PM -0600, Doug Kelly wrote:
>
>> On Wed, Nov 4, 2015 at 2:02 PM, Jeff King <peff@peff.net> wrote:
>> > Definitely cleaning up the .bitmap is sane and not racy (it's in the
>> > same boat as the .idx, I think).
>> >
>> > .keep files are more tricky. I'd have to go over the receive-pack code
>> > to confirm, but I think they _are_ racy. That is, receive-pack will
>> > create them as a lockfile before moving the pack into place. That's OK,
>> > though, if we use mtimes to give ourselves a grace period (I haven't
>> > looked at your series yet).
>> >
>> > But moreover, .keep files can be created manually by the user. If the
>> > pack they referenced goes away, they are not really serving any purpose.
>> > But it's possible that the user would want to salvage the content of the
>> > file, or know that it was there.
>> >
>> > So I'd argue we should leave them. Or at least leave ones that do not
>> > have the generic "{receive,fetch}-pack $pid on $host comment in them,
>> > which were clearly created as lockfiles.
>>
>> Currently there's no mtime-guarding logic (I dug up that conversation
>> earlier, though, but after I'd done the respin on this series)... OK,
>> in that case, I'll create a separate patch that tests/cleans up
>> .bitmap, but doesn't touch .keep.  This might be a small series since
>> I think the logic for finding pack garbage doesn't know anything about
>> .bitmap per-se, so it's looking like I'll extend that relevant code,
>> before adding the handling in gc and appropriate tests.
>
> I happened to be looking over your series again, and I noticed that we
> didn't end up with any mtime logic at all in what got merged.
>
> I _think_ that is probably OK, because we always write the pack,
> followed by the .idx, followed by the .bitmap (if any). And we don't
> drop .keep files (though I think we would perhaps note them as possible
> cruft?).
>
> So I don't think there are any races introduced here, but I wonder if we
> want to be a bit more conservative. Sorry to bring this up so much after
> the fact; I completely forgot about it when reviewing the patches.
>
> These changes are slated for the v2.7 release. Like I said, I don't
> think it's buggy, so we don't necessarily need to address it before the
> release. We could add an mtime check in the next cycle as a
> belt-and-suspenders safety, rather than a fix.
>
> -Peff

Yeah, I know I never got to adding the mtime logic, but for a simple (naive,
hard-coded) case, I did come up with a basic patch today.  I think this could
be extended to a configuration option(?) which would allow a default longer
than 10 seconds (an hour? a day?), then during the regression tests, we
could provide a shorter timeout to ensure the guarding both works and also
not wait forever for tests to complete.  Thoughts?

---
 builtin/gc.c     | 14 ++++++++++++--
 t/t5304-prune.sh |  2 ++
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/builtin/gc.c b/builtin/gc.c
index 79e9886..a4ce616 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -51,8 +51,18 @@ static struct string_list pack_garbage =
STRING_LIST_INIT_DUP;
 static void clean_pack_garbage(void)
 {
  int i;
- for (i = 0; i < pack_garbage.nr; i++)
- unlink_or_warn(pack_garbage.items[i].string);
+ /* Define a cutoff time for "new" garbage to prevent race conditions */
+ time_t cutoff = time(NULL) - 10;
+ for (i = 0; i < pack_garbage.nr; i++) {
+ struct stat s;
+ char *garbage = pack_garbage.items[i].string;
+ if (!stat(garbage, &s)) {
+ if (s.st_mtime < cutoff)
+ unlink_or_warn(garbage);
+ } else
+ fprintf(stderr, _("stat failed on pack garbage: %s"),
+ garbage);
+ }
  string_list_clear(&pack_garbage, 0);
 }

diff --git a/t/t5304-prune.sh b/t/t5304-prune.sh
index cbcc0c0..7b4650f 100755
--- a/t/t5304-prune.sh
+++ b/t/t5304-prune.sh
@@ -272,6 +272,7 @@ test_expect_success 'clean pack garbage with gc' '
  : >.git/objects/pack/fake6.keep &&
  : >.git/objects/pack/fake6.bitmap &&
  : >.git/objects/pack/fake6.idx &&
+ sleep 10 &&
  git gc &&
  git count-objects -v 2>stderr &&
  grep "^warning:" stderr | sort >actual &&
@@ -291,6 +292,7 @@ test_expect_success 'ensure unknown garbage kept with gc' '
  : >.git/objects/pack/foo.keep &&
  : >.git/objects/pack/fake.pack &&
  : >.git/objects/pack/fake2.foo &&
+ sleep 10 &&
  git gc &&
  git count-objects -v 2>stderr &&
  grep "^warning:" stderr | sort >actual &&
-- 
2.6.1

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2016-01-13 17:14                                       ` Doug Kelly
@ 2016-01-13 20:08                                         ` Junio C Hamano
  2016-01-13 20:19                                           ` Doug Kelly
  0 siblings, 1 reply; 34+ messages in thread
From: Junio C Hamano @ 2016-01-13 20:08 UTC (permalink / raw)
  To: Doug Kelly; +Cc: Jeff King, Git List, Eric Sunshine

Doug Kelly <dougk.ff7@gmail.com> writes:

> Yeah, I know I never got to adding the mtime logic, but for a simple (naive,
> hard-coded) case, I did come up with a basic patch today.  I think this could
> be extended to a configuration option(?) which would allow a default longer
> than 10 seconds (an hour? a day?), then during the regression tests, we
> could provide a shorter timeout to ensure the guarding both works and also
> not wait forever for tests to complete.  Thoughts?

Please do not sleep in the tests.  Instead, please try to see if you
can use test-chmtime to set the timestamps of these files to the
necessary ages for the purpose of your tests.

Thanks.

>
> ---
>  builtin/gc.c     | 14 ++++++++++++--
>  t/t5304-prune.sh |  2 ++
>  2 files changed, 14 insertions(+), 2 deletions(-)
>
> diff --git a/builtin/gc.c b/builtin/gc.c
> index 79e9886..a4ce616 100644
> --- a/builtin/gc.c
> +++ b/builtin/gc.c
> @@ -51,8 +51,18 @@ static struct string_list pack_garbage =
> STRING_LIST_INIT_DUP;
>  static void clean_pack_garbage(void)
>  {
>   int i;
> - for (i = 0; i < pack_garbage.nr; i++)
> - unlink_or_warn(pack_garbage.items[i].string);
> + /* Define a cutoff time for "new" garbage to prevent race conditions */
> + time_t cutoff = time(NULL) - 10;
> + for (i = 0; i < pack_garbage.nr; i++) {
> + struct stat s;
> + char *garbage = pack_garbage.items[i].string;
> + if (!stat(garbage, &s)) {
> + if (s.st_mtime < cutoff)
> + unlink_or_warn(garbage);
> + } else
> + fprintf(stderr, _("stat failed on pack garbage: %s"),
> + garbage);
> + }
>   string_list_clear(&pack_garbage, 0);
>  }
>
> diff --git a/t/t5304-prune.sh b/t/t5304-prune.sh
> index cbcc0c0..7b4650f 100755
> --- a/t/t5304-prune.sh
> +++ b/t/t5304-prune.sh
> @@ -272,6 +272,7 @@ test_expect_success 'clean pack garbage with gc' '
>   : >.git/objects/pack/fake6.keep &&
>   : >.git/objects/pack/fake6.bitmap &&
>   : >.git/objects/pack/fake6.idx &&
> + sleep 10 &&
>   git gc &&
>   git count-objects -v 2>stderr &&
>   grep "^warning:" stderr | sort >actual &&
> @@ -291,6 +292,7 @@ test_expect_success 'ensure unknown garbage kept with gc' '
>   : >.git/objects/pack/foo.keep &&
>   : >.git/objects/pack/fake.pack &&
>   : >.git/objects/pack/fake2.foo &&
> + sleep 10 &&
>   git gc &&
>   git count-objects -v 2>stderr &&
>   grep "^warning:" stderr | sort >actual &&

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2016-01-13 20:08                                         ` Junio C Hamano
@ 2016-01-13 20:19                                           ` Doug Kelly
  2016-01-13 20:23                                             ` Jeff King
  0 siblings, 1 reply; 34+ messages in thread
From: Doug Kelly @ 2016-01-13 20:19 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, Git List, Eric Sunshine

On Wed, Jan 13, 2016 at 2:08 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Doug Kelly <dougk.ff7@gmail.com> writes:
>
>> Yeah, I know I never got to adding the mtime logic, but for a simple (naive,
>> hard-coded) case, I did come up with a basic patch today.  I think this could
>> be extended to a configuration option(?) which would allow a default longer
>> than 10 seconds (an hour? a day?), then during the regression tests, we
>> could provide a shorter timeout to ensure the guarding both works and also
>> not wait forever for tests to complete.  Thoughts?
>
> Please do not sleep in the tests.  Instead, please try to see if you
> can use test-chmtime to set the timestamps of these files to the
> necessary ages for the purpose of your tests.
>
> Thanks.

Ah, thanks -- I actually didn't know about that.  I didn't intend this
to be a formal
patch (although I'm sorry the mail client clobbered all the
whitespace), but I will
keep this in mind for when I continue to refine this change.

However, back to the point: should the wait value be hard coded? Configurable
as a new option?  What should our default wait be?

Thanks!

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory
  2016-01-13 20:19                                           ` Doug Kelly
@ 2016-01-13 20:23                                             ` Jeff King
  0 siblings, 0 replies; 34+ messages in thread
From: Jeff King @ 2016-01-13 20:23 UTC (permalink / raw)
  To: Doug Kelly; +Cc: Junio C Hamano, Git List, Eric Sunshine

On Wed, Jan 13, 2016 at 02:19:05PM -0600, Doug Kelly wrote:

> However, back to the point: should the wait value be hard coded? Configurable
> as a new option?  What should our default wait be?

I'd think it would make sense to match the expiration time we feed to
prune. That's overly conservative, but I think that's OK (and the user
can tweak it down).

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2016-01-13 20:23 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-07-21 18:41 Question: .idx without .pack causes performance issues? Doug Kelly
2015-07-21 18:57 ` Junio C Hamano
2015-07-21 19:15   ` Junio C Hamano
2015-07-21 20:48     ` Junio C Hamano
2015-07-21 21:37       ` Doug Kelly
2015-08-03 22:17         ` Doug Kelly
2015-08-04  1:27           ` Junio C Hamano
2015-08-07 21:36             ` Doug Kelly
2015-08-07 22:27               ` Junio C Hamano
2015-08-13 18:02                 ` [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory Doug Kelly
2015-08-13 18:02                   ` [PATCH 2/2] gc: Remove garbage .idx files from pack dir Doug Kelly
2015-08-17 16:35                     ` Junio C Hamano
2015-08-17 20:30                     ` Junio C Hamano
2015-08-13 18:46                   ` [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory Eric Sunshine
2015-08-17 16:53                     ` Junio C Hamano
2015-10-28 17:48                       ` Junio C Hamano
2015-10-28 22:43                         ` Doug Kelly
2015-11-04  3:05                           ` [PATCH 1/3] " Doug Kelly
2015-11-04  3:05                             ` [PATCH 2/3] t5304: Add test for cleaning pack garbage Doug Kelly
2015-11-04  3:05                             ` [PATCH 3/3] gc: Remove garbage .idx files from pack dir Doug Kelly
2015-11-04  3:12                           ` [PATCH 1/2] prepare_packed_git(): refactor garbage reporting in pack directory Doug Kelly
2015-11-04 19:35                             ` Junio C Hamano
2015-11-04 19:56                               ` Doug Kelly
2015-11-04 20:02                                 ` Jeff King
2015-11-04 20:08                                   ` Doug Kelly
2015-11-04 20:15                                     ` Jeff King
2015-12-30  7:37                                     ` Jeff King
2016-01-13 17:14                                       ` Doug Kelly
2016-01-13 20:08                                         ` Junio C Hamano
2016-01-13 20:19                                           ` Doug Kelly
2016-01-13 20:23                                             ` Jeff King
2015-11-04 19:56                               ` Jeff King
     [not found]     ` <CABYiQpn7r2Vcf=S5RaWHBN85eBYGPV_e02+BY=4L98qfUzDT1Q@mail.gmail.com>
2015-11-11 14:58       ` Fwd: Question: .idx without .pack causes performance issues? Thomas Berg
2015-07-21 19:49   ` Doug Kelly

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).