All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: peff@peff.net, jrnieder@google.com, stolee@gmail.com,
	Derrick Stolee <dstolee@microsoft.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH 05/15] run-job: implement pack-files job
Date: Fri, 03 Apr 2020 20:48:04 +0000	[thread overview]
Message-ID: <5277398f189dd6e1456b0afdd6d85dbac7d98d07.1585946894.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.597.git.1585946894.gitgitgadget@gmail.com>

From: Derrick Stolee <dstolee@microsoft.com>

The previous change cleaned up loose objects using the
'loose-objects' that can be run safely in the background. Add a
similar job that performs similar cleanups for pack-files.

One issue with running 'git repack' is that it is designed to
repack all pack-files into a single pack-file. While this is the
most space-efficient way to store object data, it is not time or
memory efficient. This becomes extremely important if the repo is
so large that a user struggles to store two copies of the pack on
their disk.

Instead, perform an "incremental" repack by collecting a few small
pack-files into a new pack-file. The multi-pack-index facilitates
this process ever since 'git multi-pack-index expire' was added in
19575c7 (multi-pack-index: implement 'expire' subcommand,
2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
(midx: implement midx_repack(), 2019-06-10).

The 'pack-files' job runs the following steps:

1. 'git multi-pack-index write' creates a multi-pack-index file if
   one did not exist, and otherwise will update the multi-pack-index
   with any new pack-files that appeared since the last write. This
   is particularly relevant with the background fetch job.

   When the multi-pack-index sees two copies of the same object, it
   stores the offset data into the newer pack-file. This means that
   some old pack-files could become "unreferenced" which I will use
   to mean "a pack-file that is in the pack-file list of the
   multi-pack-index but none of the objects in the multi-pack-index
   reference a location inside that pack-file."

2. 'git multi-pack-index expire' deletes any unreferenced pack-files
   and updaes the multi-pack-index to drop those pack-files from the
   list. This is safe to do as concurrent Git processes will see the
   multi-pack-index and not open those packs when looking for object
   contents. (Similar to the 'loose-objects' job, there are some Git
   commands that open pack-files regardless of the multi-pack-index,
   but they are rarely used. Further, a user that self-selects to
   use background operations would likely refrain from using those
   commands.)

3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
   of pack-files that are listed in the multi-pack-index and creates
   a new pack-file containing the objects whose offsets are listed
   by the multi-pack-index to be in those objects. The set of pack-
   files is selected greedily by sorting the pack-files by modified
   time and adding a pack-file to the set if its "expected size" is
   smaller than the batch size until the total expected size of the
   selected pack-files is at least the batch size. The "expected
   size" is calculated by taking the size of the pack-file divided
   by the number of objects in the pack-file and multiplied by the
   number of objects from the multi-pack-index with offset in that
   pack-file. The expected size approximats how much data from that
   pack-file will contribute to the resulting pack-file size. The
   intention is that the resulting pack-file will be close in size
   to the provided batch size.

   The next run of the pack-files job will delete these repacked
   pack-files during the 'expire' step.

   In this version, the batch size is set to "0" which ignores the
   size restrictions when selecting the pack-files. It instead
   selects all pack-files and repacks all packed objects into a
   single pack-file. This will be updated in the next change, but
   it requires doing some calculations that are better isolated to
   a separate change.

Each of the above steps update the multi-pack-index file. After
each step, we verify the new multi-pack-index. If the new
multi-pack-index is corrupt, then delete the multi-pack-index,
rewrite it from scratch, and stop doing the later steps of the
job. This is intended to be an extra-safe check without leaving
a repo with many pack-files without a multi-pack-index.

These steps are based on a similar background maintenance step in
Scalar (and VFS for Git) [1]. This was incredibly effective for
users of the Windows OS repository. After using the same VFS for Git
repository for over a year, some users had _thousands_ of pack-files
that combined to up to 250 GB of data. We noticed a few users were
running into the open file descriptor limits (due in part to a bug
in the multi-pack-index fixed by af96fe3392 (midx: add packs to
packed_git linked list, 2019-04-29).

These pack-files were mostly small since they contained the commits
and trees that were pushed to the origin in a given hour. The GVFS
protocol includes a "prefetch" step that asks for pre-computed pack-
files containing commits and trees by timestamp. These pack-files
were grouped into "daily" pack-files once a day for up to 30 days.
If a user did not request prefetch packs for over 30 days, then they
would get the entire history of commits and trees in a new, large
pack-file. This led to a large number of pack-files that had poor
delta compression.

By running this pack-file maintenance step once per day, these repos
with thousands of packs spanning 200+ GB dropped to dozens of pack-
files spanning 30-50 GB. This was done all without removing objects
from the system and using a constant batch size of two gigabytes.
Once the work was done to reduce the pack-files to small sizes, the
batch size of two gigabytes means that not every run triggers a
repack operation, so the following run will not expire a pack-file.
This has kept these repos in a "clean" state.

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-run-job.txt | 18 ++++++-
 builtin/run-job.c             | 90 ++++++++++++++++++++++++++++++++++-
 midx.c                        |  2 +-
 midx.h                        |  1 +
 t/t7900-run-job.sh            | 39 +++++++++++++++
 5 files changed, 147 insertions(+), 3 deletions(-)

diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
index 43ca1160b5a..108ed25b8bd 100644
--- a/Documentation/git-run-job.txt
+++ b/Documentation/git-run-job.txt
@@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
 SYNOPSIS
 --------
 [verse]
-'git run-job (commit-graph|fetch|loose-objects)'
+'git run-job (commit-graph|fetch|loose-objects|pack-files)'
 
 
 DESCRIPTION
@@ -71,6 +71,22 @@ a batch of loose objects. The batch size is limited to 50 thousand
 objects to prevent the job from taking too long on a repository with
 many loose objects.
 
+'pack-files'::
+
+The `pack-files` job incrementally repacks the object directory using
+the `multi-pack-index` feature. In order to prevent race conditions with
+concurrent Git commands, it follows a two-step process. First, it
+deletes any pack-files included in the `multi-pack-index` where none of
+the objects in the `multi-pack-index` reference those pack-files; this
+only happens if all objects in the pack-file are also stored in a newer
+pack-file. Second, it selects a group of pack-files whose "expected
+size" is below the batch size until the group has total expected size at
+least the batch size; see the `--batch-size` option for the `repack`
+subcommand in linkgit:git-multi-pack-index[1]. The default batch-size is
+zero, which is a special case that attempts to repack all pack-files
+into a single pack-file.
+
+
 GIT
 ---
 Part of the linkgit:git[1] suite
diff --git a/builtin/run-job.c b/builtin/run-job.c
index cecf9058c51..d3543f7ccb9 100644
--- a/builtin/run-job.c
+++ b/builtin/run-job.c
@@ -1,13 +1,14 @@
 #include "builtin.h"
 #include "config.h"
 #include "commit-graph.h"
+#include "midx.h"
 #include "object-store.h"
 #include "parse-options.h"
 #include "repository.h"
 #include "run-command.h"
 
 static char const * const builtin_run_job_usage[] = {
-	N_("git run-job (commit-graph|fetch|loose-objects)"),
+	N_("git run-job (commit-graph|fetch|loose-objects|pack-files)"),
 	NULL
 };
 
@@ -238,6 +239,91 @@ static int run_loose_objects_job(void)
 	return prune_packed() || pack_loose();
 }
 
+static int multi_pack_index_write(void)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "write",
+			 "--no-progress", NULL);
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+static int rewrite_multi_pack_index(void)
+{
+	char *midx_name = get_midx_filename(the_repository->objects->odb->path);
+
+	unlink(midx_name);
+	free(midx_name);
+
+	if (multi_pack_index_write()) {
+		error(_("failed to rewrite multi-pack-index"));
+		return 1;
+	}
+
+	return 0;
+}
+
+static int multi_pack_index_verify(void)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "verify",
+			 "--no-progress", NULL);
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+static int multi_pack_index_expire(void)
+{
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "expire",
+			 "--no-progress", NULL);
+	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+}
+
+static int multi_pack_index_repack(void)
+{
+	int result;
+	struct argv_array cmd = ARGV_ARRAY_INIT;
+	argv_array_pushl(&cmd, "multi-pack-index", "repack",
+			 "--no-progress", "--batch-size=0", NULL);
+	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
+
+	if (result && multi_pack_index_verify()) {
+		warning(_("multi-pack-index verify failed after repack"));
+		result = rewrite_multi_pack_index();
+	}
+
+	return result;
+}
+
+static int run_pack_files_job(void)
+{
+	if (multi_pack_index_write()) {
+		error(_("failed to write multi-pack-index"));
+		return 1;
+	}
+
+	if (multi_pack_index_verify()) {
+		warning(_("multi-pack-index verify failed after initial write"));
+		return rewrite_multi_pack_index();
+	}
+
+	if (multi_pack_index_expire()) {
+		error(_("multi-pack-index expire failed"));
+		return 1;
+	}
+
+	if (multi_pack_index_verify()) {
+		warning(_("multi-pack-index verify failed after expire"));
+		return rewrite_multi_pack_index();
+	}
+
+	if (multi_pack_index_repack()) {
+		error(_("multi-pack-index repack failed"));
+		return 1;
+	}
+
+	return 0;
+}
+
 int cmd_run_job(int argc, const char **argv, const char *prefix)
 {
 	static struct option builtin_run_job_options[] = {
@@ -261,6 +347,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
 			return run_fetch_job();
 		if (!strcmp(argv[0], "loose-objects"))
 			return run_loose_objects_job();
+		if (!strcmp(argv[0], "pack-files"))
+			return run_pack_files_job();
 	}
 
 	usage_with_options(builtin_run_job_usage,
diff --git a/midx.c b/midx.c
index 1527e464a7b..0f0d0a38812 100644
--- a/midx.c
+++ b/midx.c
@@ -36,7 +36,7 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static char *get_midx_filename(const char *object_dir)
+char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
diff --git a/midx.h b/midx.h
index e6fa356b5ca..cf2c09dffc2 100644
--- a/midx.h
+++ b/midx.h
@@ -39,6 +39,7 @@ struct multi_pack_index {
 
 #define MIDX_PROGRESS     (1 << 0)
 
+char *get_midx_filename(const char *object_dir);
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
 int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
 int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
index 41da083257b..416ba04989d 100755
--- a/t/t7900-run-job.sh
+++ b/t/t7900-run-job.sh
@@ -6,6 +6,7 @@ Testing the background jobs, in the foreground
 '
 
 GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_MULTI_PACK_INDEX=0
 
 . ./test-lib.sh
 
@@ -93,4 +94,42 @@ test_expect_success 'loose-objects job' '
 	test_cmp packs-between packs-after
 '
 
+test_expect_success 'pack-files job' '
+	packDir=.git/objects/pack &&
+
+	# Create three disjoint pack-files with size BIG, small, small.
+
+	echo HEAD~2 | git -C client pack-objects --revs $packDir/test-1 &&
+
+	test_tick &&
+	git -C client pack-objects --revs $packDir/test-2 <<-\EOF &&
+	HEAD~1
+	^HEAD~2
+	EOF
+
+	test_tick &&
+	git -C client pack-objects --revs $packDir/test-3 <<-\EOF &&
+	HEAD
+	^HEAD~1
+	EOF
+
+	rm -f client/$packDir/pack-* &&
+	rm -f client/$packDir/loose-* &&
+
+	ls client/$packDir/*.pack >packs-before &&
+	test_line_count = 3 packs-before &&
+
+	# the job repacks the two into a new pack, but does not
+	# delete the old ones.
+	git -C client run-job pack-files &&
+	ls client/$packDir/*.pack >packs-between &&
+	test_line_count = 4 packs-between &&
+
+	# the job deletes the two old packs, and does not write
+	# a new one because only one pack remains.
+	git -C client run-job pack-files &&
+	ls client/.git/objects/pack/*.pack >packs-after &&
+	test_line_count = 1 packs-after
+'
+
 test_done
-- 
gitgitgadget


  parent reply	other threads:[~2020-04-03 20:48 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 01/15] run-job: create barebones builtin Derrick Stolee via GitGitGadget
2020-04-05 15:10   ` Phillip Wood
2020-04-05 19:21     ` Junio C Hamano
2020-04-06 14:42       ` Derrick Stolee
2020-04-07  0:58         ` Danh Doan
2020-04-07 10:54           ` Derrick Stolee
2020-04-07 14:16             ` Danh Doan
2020-04-07 14:30               ` Johannes Schindelin
2020-04-03 20:48 ` [PATCH 02/15] run-job: implement commit-graph job Derrick Stolee via GitGitGadget
2020-05-20 19:08   ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
2020-04-05 15:14   ` Phillip Wood
2020-04-06 12:48     ` Derrick Stolee
2020-04-05 20:28   ` Junio C Hamano
2020-04-06 12:46     ` Derrick Stolee
2020-05-20 19:08   ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 04/15] run-job: implement loose-objects job Derrick Stolee via GitGitGadget
2020-04-05 20:33   ` Junio C Hamano
2020-04-03 20:48 ` Derrick Stolee via GitGitGadget [this message]
2020-05-27 22:17   ` [PATCH 05/15] run-job: implement pack-files job Josh Steadmon
2020-04-03 20:48 ` [PATCH 06/15] run-job: auto-size or use custom pack-files batch Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 07/15] config: add job.pack-files.batchSize option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 08/15] job-runner: create builtin for job loop Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 09/15] job-runner: load repos from config by default Derrick Stolee via GitGitGadget
2020-04-05 15:18   ` Phillip Wood
2020-04-06 12:49     ` Derrick Stolee
2020-04-05 15:41   ` Phillip Wood
2020-04-06 12:57     ` Derrick Stolee
2020-04-03 20:48 ` [PATCH 10/15] job-runner: use config to limit job frequency Derrick Stolee via GitGitGadget
2020-04-05 15:24   ` Phillip Wood
2020-04-03 20:48 ` [PATCH 11/15] job-runner: use config for loop interval Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 12/15] job-runner: add --interval=<span> option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 13/15] job-runner: skip a job if job.<job-name>.enabled is false Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 14/15] job-runner: add --daemonize option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 15/15] runjob: customize the loose-objects batch size Derrick Stolee via GitGitGadget
2020-04-03 21:40 ` [PATCH 00/15] [RFC] Maintenance jobs and job runner Junio C Hamano
2020-04-04  0:16   ` Derrick Stolee
2020-04-07  0:50     ` Danh Doan
2020-04-07 10:59       ` Derrick Stolee
2020-04-07 14:26         ` Danh Doan
2020-04-07 14:43           ` Johannes Schindelin
2020-04-07  1:48     ` brian m. carlson
2020-04-07 20:08       ` Junio C Hamano
2020-04-07 22:23       ` Johannes Schindelin
2020-04-08  0:01         ` brian m. carlson
2020-05-27 22:39           ` Josh Steadmon
2020-05-28  0:47             ` Junio C Hamano
2020-05-27 21:52               ` Johannes Schindelin
2020-05-28 14:48                 ` Junio C Hamano
2020-05-28 14:50                 ` Jonathan Nieder
2020-05-28 14:57                   ` Junio C Hamano
2020-05-28 15:03                     ` Jonathan Nieder
2020-05-28 15:30                       ` Derrick Stolee
2020-05-28  4:39                         ` Johannes Schindelin
2020-05-02  7:56 [PATCH 05/15] run-job: implement pack-files job Son Luong Ngoc
2020-05-04 14:34 ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5277398f189dd6e1456b0afdd6d85dbac7d98d07.1585946894.git.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=jrnieder@google.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.