All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] repack: Add option to preserve and prune old pack files
@ 2017-03-10 22:00 James Melvin
  2017-03-10 23:43 ` Junio C Hamano
  2017-03-11 12:38 ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 6+ messages in thread
From: James Melvin @ 2017-03-10 22:00 UTC (permalink / raw)
  To: gitster; +Cc: git, nasserg, mfick, peff, sbeller, James Melvin

The new --preserve-and-prune option renames old pack files
instead of deleting them after repacking and prunes previously
preserved pack files.

This option is designed to prevent stale file handle exceptions
during git operations which can happen on users of NFS repos when
repacking is done on them. The strategy is to preserve old pack files
around until the next repack with the hopes that they will become
unreferenced by then and not cause any exceptions to running processes
when they are finally deleted (pruned).

Signed-off-by: James Melvin <jmelvin@codeaurora.org>
---
 Documentation/git-repack.txt |  4 +++
 builtin/repack.c             | 66 +++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-repack.txt b/Documentation/git-repack.txt
index 26afe6ed5..effd98a43 100644
--- a/Documentation/git-repack.txt
+++ b/Documentation/git-repack.txt
@@ -143,6 +143,10 @@ other objects in that pack they already have locally.
 	being removed. In addition, any unreachable loose objects will
 	be packed (and their loose counterparts removed).
 
+--preserve-and-prune::
+	 Preserve old pack files by renaming them instead of deleting. Prune any
+	 previously preserved pack files before preserving new ones.
+
 Configuration
 -------------
 
diff --git a/builtin/repack.c b/builtin/repack.c
index 677bc7c81..78fad8f1a 100644
--- a/builtin/repack.c
+++ b/builtin/repack.c
@@ -10,6 +10,7 @@
 
 static int delta_base_offset = 1;
 static int pack_kept_objects = -1;
+static int preserve_and_prune;
 static int write_bitmaps;
 static char *packdir, *packtmp;
 
@@ -108,6 +109,60 @@ static void get_non_kept_pack_filenames(struct string_list *fname_list)
 	closedir(dir);
 }
 
+/*
+ * Adds all preserved packs hex strings to the fname list
+ */
+static void get_preserved_pack_filenames(struct string_list *fname_list)
+{
+	DIR *dir;
+	struct dirent *e;
+	char *fname;
+
+	if (!(dir = opendir(packdir)))
+		return;
+
+	while ((e = readdir(dir)) != NULL) {
+		size_t len;
+		if (!strip_suffix(e->d_name, ".old-pack", &len))
+			continue;
+
+		fname = xmemdupz(e->d_name, len);
+		string_list_append_nodup(fname_list, fname);
+	}
+	closedir(dir);
+}
+
+static void remove_preserved_packs(void) {
+	const char *exts[] = {"pack", "idx", "keep", "bitmap"};
+	int i;
+	struct string_list names = STRING_LIST_INIT_DUP;
+	struct string_list_item *item;
+
+	get_preserved_pack_filenames(&names);
+
+	for_each_string_list_item(item, &names) {
+		for (i = 0; i < ARRAY_SIZE(exts); i++) {
+			char *fname;
+			fname = mkpathdup("%s/%s.old-%s",
+					  packdir,
+					  item->string,
+					  exts[i]);
+			if (remove_path(fname))
+				warning(_("failed to remove '%s'"), fname);
+			free(fname);
+		}
+	}
+}
+
+static void preserve_pack(const char *file_path, const char *file_name,  const char *file_ext)
+{
+	char *fname_old;
+
+	fname_old = mkpathdup("%s/%s.old-%s", packdir, file_name, ++file_ext);
+	rename(file_path, fname_old);
+	free(fname_old);
+}
+
 static void remove_redundant_pack(const char *dir_name, const char *base_name)
 {
 	const char *exts[] = {".pack", ".idx", ".keep", ".bitmap"};
@@ -121,7 +176,10 @@ static void remove_redundant_pack(const char *dir_name, const char *base_name)
 	for (i = 0; i < ARRAY_SIZE(exts); i++) {
 		strbuf_setlen(&buf, plen);
 		strbuf_addstr(&buf, exts[i]);
-		unlink(buf.buf);
+		if (preserve_and_prune)
+			preserve_pack(buf.buf, base_name, exts[i]);
+		else
+			unlink(buf.buf);
 	}
 	strbuf_release(&buf);
 }
@@ -194,6 +252,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 				N_("maximum size of each packfile")),
 		OPT_BOOL(0, "pack-kept-objects", &pack_kept_objects,
 				N_("repack objects in packs marked with .keep")),
+		OPT_BOOL(0, "preserve-and-prune", &preserve_and_prune,
+				N_("preserve old pack files by renaming them instead of deleting, prune any "
+						"previously preserved pack files before preserving new ones")),
 		OPT_END()
 	};
 
@@ -404,6 +465,9 @@ int cmd_repack(int argc, const char **argv, const char *prefix)
 
 	/* End of pack replacement. */
 
+	if (preserve_and_prune)
+		remove_preserved_packs();
+
 	if (delete_redundant) {
 		int opts = 0;
 		string_list_sort(&names);
-- 
2.12.0.191.ga15447d9f


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] repack: Add option to preserve and prune old pack files
  2017-03-10 22:00 [PATCH v2] repack: Add option to preserve and prune old pack files James Melvin
@ 2017-03-10 23:43 ` Junio C Hamano
  2017-03-12 12:24   ` Jeff King
  2017-03-11 12:38 ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2017-03-10 23:43 UTC (permalink / raw)
  To: James Melvin; +Cc: git, nasserg, mfick, peff, sbeller

James Melvin <jmelvin@codeaurora.org> writes:

> The new --preserve-and-prune option renames old pack files
> instead of deleting them after repacking and prunes previously
> preserved pack files.
>
> This option is designed to prevent stale file handle exceptions
> during git operations which can happen on users of NFS repos when
> repacking is done on them. The strategy is to preserve old pack files
> around until the next repack with the hopes that they will become
> unreferenced by then and not cause any exceptions to running processes
> when they are finally deleted (pruned).

This certainly is simpler than the previous one, but if you run

	git repack --preserve-and-prune 
	sleep for N minutes
	git repack --preserve-and-prune 

the second "repack" will drop the obsoleted packs that were
preserved by the first one no matter how short the value of N is,
no?

I suspect that users would be more comfortable with something based
on expiration timestamp that gives them a guaranteed grace period,
e.g. "--preserve-expire=8.hours", like how we expire reflog entries
and loose objects.

Perhaps builtin/prune.c can be a source of inspiration?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] repack: Add option to preserve and prune old pack files
  2017-03-10 22:00 [PATCH v2] repack: Add option to preserve and prune old pack files James Melvin
  2017-03-10 23:43 ` Junio C Hamano
@ 2017-03-11 12:38 ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 6+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2017-03-11 12:38 UTC (permalink / raw)
  To: James Melvin
  Cc: Junio C Hamano, Git Mailing List, nasserg, mfick, Jeff King,
	Stefan Beller

On Fri, Mar 10, 2017 at 11:00 PM, James Melvin <jmelvin@codeaurora.org> wrote:
> The new --preserve-and-prune option renames old pack files
> instead of deleting them after repacking and prunes previously
> preserved pack files.

I think some of this rationale...

> This option is designed to prevent stale file handle exceptions
> during git operations which can happen on users of NFS repos when
> repacking is done on them. The strategy is to preserve old pack files
> around until the next repack with the hopes that they will become
> unreferenced by then and not cause any exceptions to running processes
> when they are finally deleted (pruned).

...belongs in the actual docs, i.e. here:

> +--preserve-and-prune::
> +        Preserve old pack files by renaming them instead of deleting. Prune any
> +        previously preserved pack files before preserving new ones.
> +

This is a really obscure option with an obscure use-case, let's
explain that in the docs, and also mention NFS & what problems it
solves there, so someone having the same issue can find the solution
more easily.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] repack: Add option to preserve and prune old pack files
  2017-03-10 23:43 ` Junio C Hamano
@ 2017-03-12 12:24   ` Jeff King
  2017-03-12 18:03     ` Junio C Hamano
  0 siblings, 1 reply; 6+ messages in thread
From: Jeff King @ 2017-03-12 12:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: James Melvin, git, nasserg, mfick, sbeller

On Fri, Mar 10, 2017 at 03:43:43PM -0800, Junio C Hamano wrote:

> James Melvin <jmelvin@codeaurora.org> writes:
> 
> > The new --preserve-and-prune option renames old pack files
> > instead of deleting them after repacking and prunes previously
> > preserved pack files.
> >
> > This option is designed to prevent stale file handle exceptions
> > during git operations which can happen on users of NFS repos when
> > repacking is done on them. The strategy is to preserve old pack files
> > around until the next repack with the hopes that they will become
> > unreferenced by then and not cause any exceptions to running processes
> > when they are finally deleted (pruned).
> 
> This certainly is simpler than the previous one, but if you run
> 
> 	git repack --preserve-and-prune 
> 	sleep for N minutes
> 	git repack --preserve-and-prune 
> 
> the second "repack" will drop the obsoleted packs that were
> preserved by the first one no matter how short the value of N is,
> no?
> 
> I suspect that users would be more comfortable with something based
> on expiration timestamp that gives them a guaranteed grace period,
> e.g. "--preserve-expire=8.hours", like how we expire reflog entries
> and loose objects.
> 
> Perhaps builtin/prune.c can be a source of inspiration?

I have been a bit slow to read this series, but FWIW I had the exact
same thought while reading up to this point. There should be an
expiration, and that expiration time corresponds roughly to "how long do
you expect a long-running git operation to last".

You'd probably want "--preserve-expire" as an option to repack, and then
a "gc.preservePackExpire" option so that "git gc" triggers it reliably.

I can think of one downside of a time-based solution, though: if you run
multiple gc's during the time period, you may end up using a lot of disk
space (one repo's worth per gc). But that's a fundamental tension in the
problem space; the whole point is to waste disk to keep helping old
processes. But you may want a knob that lets you slide between those two
things. For instance, if you kept a sliding window of N sets of
preserved packs, and ejected from one end of the window (regardless of
time), while inserting into the other end. James' existing patch is that
same strategy with a hardcoded window of "1".

The other variable you can manipulate is whether to gc in the first
place. E.g., don't gc if there are N preserved sets (or sets consuming
more than N bytes, or whatever). You could do that check outside of git
entirely (or in an auto-gc hook, if you're using it).

Note that something like a sliding window gets complicated pretty
quickly. I'm not really proposing it as a direction; I'm just trying to
think of ways the time-based system could go wrong. IMHO it would
probably be fine to just ignore the problem and assume that repacking
doesn't happen often enough for it to come up in practice.

-Peff

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] repack: Add option to preserve and prune old pack files
  2017-03-12 12:24   ` Jeff King
@ 2017-03-12 18:03     ` Junio C Hamano
  2017-03-13 16:39       ` Martin Fick
  0 siblings, 1 reply; 6+ messages in thread
From: Junio C Hamano @ 2017-03-12 18:03 UTC (permalink / raw)
  To: Jeff King; +Cc: James Melvin, git, nasserg, mfick, sbeller

Jeff King <peff@peff.net> writes:

> I can think of one downside of a time-based solution, though: if you run
> multiple gc's during the time period, you may end up using a lot of disk
> space (one repo's worth per gc). But that's a fundamental tension in the
> problem space; the whole point is to waste disk to keep helping old
> processes.

Yes.  If you want to help a process that mmap's a packfile and wants
to keep using it for N seconds, no matter how many times somebody
else ran "git repack" while you are doing your work within that
timeframe, you somehow need to make sure the NFS server knows the
file is still in use for that N seconds.

> But you may want a knob that lets you slide between those two
> things. For instance, if you kept a sliding window of N sets of
> preserved packs, and ejected from one end of the window (regardless of
> time), while inserting into the other end. James' existing patch is that
> same strategy with a hardcoded window of "1".

Again, yes.  But then the user does not get any guarantee of how
long-living a process the user can have without getting broken by
the NFS server losing track of a packfile that is still in use.  My
suggestion for the "expiry" based approach is essentially that I do
not see a useful tradeoff afforded by having such a knob.

> The other variable you can manipulate is whether to gc in the first
> place. E.g., don't gc if there are N preserved sets (or sets consuming
> more than N bytes, or whatever). You could do that check outside of git
> entirely (or in an auto-gc hook, if you're using it).

Yes, "don't gc/repack more than once within N seconds" may also be
an alternative and may generally be more useful by covering general
source of wastage coming from doing gc too frequently, not necessarily
limited to preserved pack accumulation.



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2] repack: Add option to preserve and prune old pack files
  2017-03-12 18:03     ` Junio C Hamano
@ 2017-03-13 16:39       ` Martin Fick
  0 siblings, 0 replies; 6+ messages in thread
From: Martin Fick @ 2017-03-13 16:39 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Jeff King, James Melvin, git, nasserg, sbeller

On Sunday, March 12, 2017 11:03:44 AM Junio C Hamano wrote:
> Jeff King <peff@peff.net> writes:
> > I can think of one downside of a time-based solution,
> > though: if you run multiple gc's during the time
> > period, you may end up using a lot of disk space (one
> > repo's worth per gc). But that's a fundamental tension
> > in the problem space; the whole point is to waste disk
> > to keep helping old processes.
> 
> Yes.  If you want to help a process that mmap's a packfile
> and wants to keep using it for N seconds, no matter how
> many times somebody else ran "git repack" while you are
> doing your work within that timeframe, you somehow need
> to make sure the NFS server knows the file is still in
> use for that N seconds.
> 
> > But you may want a knob that lets you slide between
> > those two things. For instance, if you kept a sliding
> > window of N sets of preserved packs, and ejected from
> > one end of the window (regardless of time), while
> > inserting into the other end. James' existing patch is
> > that same strategy with a hardcoded window of "1".
> 
> Again, yes.  But then the user does not get any guarantee
> of how long-living a process the user can have without
> getting broken by the NFS server losing track of a
> packfile that is still in use.  My suggestion for the
> "expiry" based approach is essentially that I do not see
> a useful tradeoff afforded by having such a knob.
> > The other variable you can manipulate is whether to gc
> > in the first place. E.g., don't gc if there are N
> > preserved sets (or sets consuming more than N bytes, or
> > whatever). You could do that check outside of git
> > entirely (or in an auto-gc hook, if you're using it).
> Yes, "don't gc/repack more than once within N seconds" may
> also be an alternative and may generally be more useful
> by covering general source of wastage coming from doing
> gc too frequently, not necessarily limited to preserved
> pack accumulation.

As someone who helps manage a Gerrit server for several 
thousand repos, all on the same NFS disks, a time based 
expiry seems unpractical, and not something that I am very 
interested in having.  I favor the simpler (single for now) 
repacking cycle approach, and it is what we have been using 
for almost 6 months now successfully, without suffering any 
more stale file handle exceptions.

While time is indeed the factor that is going to determine 
whether someone is going to see the stale file handles or 
not, on a server (which is what this feature is aimed at), 
this is secondary in my mind to predictability about space 
utilization.  I have no specific minimum time that I can 
reason about, i.e. I cannot reasonably say "I want all 
operations that last less than 1 hour, 1 min, or 1 second... 
to succeed".  I don't really want ANY failures, and I am 
willing to sacrifice some disk space to prevent as many as 
possible.  So the question to me is "How much disk space am 
I willing to sacrifice?", not "How long do I want operations 
to be able to last?".  The only way that time enters my 
equation is to compare it to how long repacking takes, i.e. 
I want the preserved files cleaned up on the next repack.   
So effectively I am choosing a repacking cycle based 
approach, so that I can reasonably predict the extra disk 
space that I need to reserve for my collection of repos.  
With a single cycle, I am effectively doubling the "static" 
size of repos.  

To achieve this predictability with a time based approach 
requires coordination between the expiry setting and the 
repacking time cycle.  This coordination is extra effort for 
me, with no apparent gain.  It is also an additional risk 
that I don't want to have.  If I decide to bump up how often 
I run repacking, and I forget to reduce the expiry time, my 
disk utilization will grow and potentially cause serious 
issues for all my repositories (since they share the same 
volume).  This problem is even more difficult if I decide to 
use a usage (instead of time) based algorithm to determine 
when I repack.

Admittedly, a repacking cycle based approach happens to be 
very easy and practical when it is a "single" cycle.  If I 
determine eventually empirically that a single cycle is not 
long enough for my server, I don't know what I will do?  
Perhaps I would then want a switch that preserves the repos 
for another cycle?  Maybe it could work the way that log 
rotation works, add a number to the end of each file name for 
each preserved cycle?  This option seems preferable to me 
than a time based approach because it makes it more obvious 
what the impact on disk utilization will be.  However, so 
far in practice, this does not seem necessary.

I don't really see a good use case for a time based expiry 
(other than "this is how it was done for other things in 
git").  Of course, that doesn't mean such a use case doesn't 
exist, but I don't support adding a feature unless I really 
understand why and how someone would want to use it.  I 
think that a time based expiry should only be added if 
someone has a specific use case they expect to achieve with 
it, and they actually plan to use it that way, not just for 
uniformity. 

One might even eventually decide that some of the other 
current use cases for time based expiries should be 
converted to cycle based expiries; I suspect server admins 
will have fewer surprises that way,

-Martin

-- 
The Qualcomm Innovation Center, Inc. is a member of Code 
Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2017-03-13 16:40 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-10 22:00 [PATCH v2] repack: Add option to preserve and prune old pack files James Melvin
2017-03-10 23:43 ` Junio C Hamano
2017-03-12 12:24   ` Jeff King
2017-03-12 18:03     ` Junio C Hamano
2017-03-13 16:39       ` Martin Fick
2017-03-11 12:38 ` Ævar Arnfjörð Bjarmason

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.