All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: sandals@crustytoothpaste.net, steadmon@google.com,
	jrnieder@gmail.com, peff@peff.net, congdanhqx@gmail.com,
	phillip.wood123@gmail.com, emilyshaffer@google.com,
	sluongng@gmail.com, jonathantanmy@google.com,
	Derrick Stolee <derrickstolee@github.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH v3 6/8] maintenance: add incremental-repack task
Date: Tue, 25 Aug 2020 18:36:37 +0000	[thread overview]
Message-ID: <0dd26bb584bb7e8b9616eb32f7b1235462df77fa.1598380599.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.696.v3.git.1598380599.gitgitgadget@gmail.com>

From: Derrick Stolee <dstolee@microsoft.com>

The previous change cleaned up loose objects using the
'loose-objects' that can be run safely in the background. Add a
similar job that performs similar cleanups for pack-files.

One issue with running 'git repack' is that it is designed to
repack all pack-files into a single pack-file. While this is the
most space-efficient way to store object data, it is not time or
memory efficient. This becomes extremely important if the repo is
so large that a user struggles to store two copies of the pack on
their disk.

Instead, perform an "incremental" repack by collecting a few small
pack-files into a new pack-file. The multi-pack-index facilitates
this process ever since 'git multi-pack-index expire' was added in
19575c7 (multi-pack-index: implement 'expire' subcommand,
2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1
(midx: implement midx_repack(), 2019-06-10).

The 'incremental-repack' task runs the following steps:

1. 'git multi-pack-index write' creates a multi-pack-index file if
   one did not exist, and otherwise will update the multi-pack-index
   with any new pack-files that appeared since the last write. This
   is particularly relevant with the background fetch job.

   When the multi-pack-index sees two copies of the same object, it
   stores the offset data into the newer pack-file. This means that
   some old pack-files could become "unreferenced" which I will use
   to mean "a pack-file that is in the pack-file list of the
   multi-pack-index but none of the objects in the multi-pack-index
   reference a location inside that pack-file."

2. 'git multi-pack-index expire' deletes any unreferenced pack-files
   and updaes the multi-pack-index to drop those pack-files from the
   list. This is safe to do as concurrent Git processes will see the
   multi-pack-index and not open those packs when looking for object
   contents. (Similar to the 'loose-objects' job, there are some Git
   commands that open pack-files regardless of the multi-pack-index,
   but they are rarely used. Further, a user that self-selects to
   use background operations would likely refrain from using those
   commands.)

3. 'git multi-pack-index repack --bacth-size=<size>' collects a set
   of pack-files that are listed in the multi-pack-index and creates
   a new pack-file containing the objects whose offsets are listed
   by the multi-pack-index to be in those objects. The set of pack-
   files is selected greedily by sorting the pack-files by modified
   time and adding a pack-file to the set if its "expected size" is
   smaller than the batch size until the total expected size of the
   selected pack-files is at least the batch size. The "expected
   size" is calculated by taking the size of the pack-file divided
   by the number of objects in the pack-file and multiplied by the
   number of objects from the multi-pack-index with offset in that
   pack-file. The expected size approximates how much data from that
   pack-file will contribute to the resulting pack-file size. The
   intention is that the resulting pack-file will be close in size
   to the provided batch size.

   The next run of the incremental-repack task will delete these
   repacked pack-files during the 'expire' step.

   In this version, the batch size is set to "0" which ignores the
   size restrictions when selecting the pack-files. It instead
   selects all pack-files and repacks all packed objects into a
   single pack-file. This will be updated in the next change, but
   it requires doing some calculations that are better isolated to
   a separate change.

These steps are based on a similar background maintenance step in
Scalar (and VFS for Git) [1]. This was incredibly effective for
users of the Windows OS repository. After using the same VFS for Git
repository for over a year, some users had _thousands_ of pack-files
that combined to up to 250 GB of data. We noticed a few users were
running into the open file descriptor limits (due in part to a bug
in the multi-pack-index fixed by af96fe3 (midx: add packs to
packed_git linked list, 2019-04-29).

These pack-files were mostly small since they contained the commits
and trees that were pushed to the origin in a given hour. The GVFS
protocol includes a "prefetch" step that asks for pre-computed pack-
files containing commits and trees by timestamp. These pack-files
were grouped into "daily" pack-files once a day for up to 30 days.
If a user did not request prefetch packs for over 30 days, then they
would get the entire history of commits and trees in a new, large
pack-file. This led to a large number of pack-files that had poor
delta compression.

By running this pack-file maintenance step once per day, these repos
with thousands of packs spanning 200+ GB dropped to dozens of pack-
files spanning 30-50 GB. This was done all without removing objects
from the system and using a constant batch size of two gigabytes.
Once the work was done to reduce the pack-files to small sizes, the
batch size of two gigabytes means that not every run triggers a
repack operation, so the following run will not expire a pack-file.
This has kept these repos in a "clean" state.

[1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-maintenance.txt | 15 ++++++
 builtin/gc.c                      | 77 +++++++++++++++++++++++++++++++
 midx.c                            |  2 +-
 midx.h                            |  1 +
 t/t5319-multi-pack-index.sh       |  1 +
 t/t7900-maintenance.sh            | 38 +++++++++++++++
 6 files changed, 133 insertions(+), 1 deletion(-)

diff --git a/Documentation/git-maintenance.txt b/Documentation/git-maintenance.txt
index fc95eb594f..b44efb05a3 100644
--- a/Documentation/git-maintenance.txt
+++ b/Documentation/git-maintenance.txt
@@ -85,6 +85,21 @@ loose-objects::
 	advisable to enable both the `loose-objects` and `gc` tasks at the
 	same time.
 
+incremental-repack::
+	The `incremental-repack` job repacks the object directory
+	using the `multi-pack-index` feature. In order to prevent race
+	conditions with concurrent Git commands, it follows a two-step
+	process. First, it deletes any pack-files included in the
+	`multi-pack-index` where none of the objects in the
+	`multi-pack-index` reference those pack-files; this only happens
+	if all objects in the pack-file are also stored in a newer
+	pack-file. Second, it selects a group of pack-files whose "expected
+	size" is below the batch size until the group has total expected
+	size at least the batch size; see the `--batch-size` option for
+	the `repack` subcommand in linkgit:git-multi-pack-index[1]. The
+	default batch-size is zero, which is a special case that attempts
+	to repack all pack-files into a single pack-file.
+
 OPTIONS
 -------
 --auto::
diff --git a/builtin/gc.c b/builtin/gc.c
index 25245bcc10..fbf84996fa 100644
--- a/builtin/gc.c
+++ b/builtin/gc.c
@@ -30,6 +30,7 @@
 #include "promisor-remote.h"
 #include "refs.h"
 #include "remote.h"
+#include "midx.h"
 
 #define FAILED_RUN "failed to run %s"
 
@@ -1001,6 +1002,77 @@ static int maintenance_task_loose_objects(struct maintenance_run_opts *opts)
 	return prune_packed(opts) || pack_loose(opts);
 }
 
+static int multi_pack_index_write(struct maintenance_run_opts *opts)
+{
+	struct child_process child = CHILD_PROCESS_INIT;
+
+	child.git_cmd = 1;
+	strvec_pushl(&child.args, "multi-pack-index", "write", NULL);
+
+	if (opts->quiet)
+		strvec_push(&child.args, "--no-progress");
+
+	if (run_command(&child))
+		return error(_("failed to write multi-pack-index"));
+
+	return 0;
+}
+
+static int multi_pack_index_expire(struct maintenance_run_opts *opts)
+{
+	struct child_process child = CHILD_PROCESS_INIT;
+
+	child.git_cmd = 1;
+	strvec_pushl(&child.args, "multi-pack-index", "expire", NULL);
+
+	if (opts->quiet)
+		strvec_push(&child.args, "--no-progress");
+
+	close_object_store(the_repository->objects);
+
+	if (run_command(&child))
+		return error(_("'git multi-pack-index expire' failed"));
+
+	return 0;
+}
+
+static int multi_pack_index_repack(struct maintenance_run_opts *opts)
+{
+	struct child_process child = CHILD_PROCESS_INIT;
+
+	child.git_cmd = 1;
+	strvec_pushl(&child.args, "multi-pack-index", "repack", NULL);
+
+	if (opts->quiet)
+		strvec_push(&child.args, "--no-progress");
+
+	strvec_push(&child.args, "--batch-size=0");
+
+	close_object_store(the_repository->objects);
+
+	if (run_command(&child))
+		return error(_("'git multi-pack-index repack' failed"));
+
+	return 0;
+}
+
+static int maintenance_task_incremental_repack(struct maintenance_run_opts *opts)
+{
+	prepare_repo_settings(the_repository);
+	if (!the_repository->settings.core_multi_pack_index) {
+		warning(_("skipping incremental-repack task because core.multiPackIndex is disabled"));
+		return 0;
+	}
+
+	if (multi_pack_index_write(opts))
+		return 1;
+	if (multi_pack_index_expire(opts))
+		return 1;
+	if (multi_pack_index_repack(opts))
+		return 1;
+	return 0;
+}
+
 typedef int maintenance_task_fn(struct maintenance_run_opts *opts);
 
 /*
@@ -1023,6 +1095,7 @@ struct maintenance_task {
 enum maintenance_task_label {
 	TASK_PREFETCH,
 	TASK_LOOSE_OBJECTS,
+	TASK_INCREMENTAL_REPACK,
 	TASK_GC,
 	TASK_COMMIT_GRAPH,
 
@@ -1040,6 +1113,10 @@ static struct maintenance_task tasks[] = {
 		maintenance_task_loose_objects,
 		loose_object_auto_condition,
 	},
+	[TASK_INCREMENTAL_REPACK] = {
+		"incremental-repack",
+		maintenance_task_incremental_repack,
+	},
 	[TASK_GC] = {
 		"gc",
 		maintenance_task_gc,
diff --git a/midx.c b/midx.c
index aa37d5da86..66d7053d83 100644
--- a/midx.c
+++ b/midx.c
@@ -37,7 +37,7 @@
 
 #define PACK_EXPIRED UINT_MAX
 
-static char *get_midx_filename(const char *object_dir)
+char *get_midx_filename(const char *object_dir)
 {
 	return xstrfmt("%s/pack/multi-pack-index", object_dir);
 }
diff --git a/midx.h b/midx.h
index b18cf53bc4..baeecc70c9 100644
--- a/midx.h
+++ b/midx.h
@@ -37,6 +37,7 @@ struct multi_pack_index {
 
 #define MIDX_PROGRESS     (1 << 0)
 
+char *get_midx_filename(const char *object_dir);
 struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local);
 int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id);
 int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result);
diff --git a/t/t5319-multi-pack-index.sh b/t/t5319-multi-pack-index.sh
index ec87f616c6..2f942ee1fa 100755
--- a/t/t5319-multi-pack-index.sh
+++ b/t/t5319-multi-pack-index.sh
@@ -3,6 +3,7 @@
 test_description='multi-pack-indexes'
 . ./test-lib.sh
 
+GIT_TEST_MULTI_PACK_INDEX=0
 objdir=.git/objects
 
 midx_read_expect () {
diff --git a/t/t7900-maintenance.sh b/t/t7900-maintenance.sh
index efda1cf69b..dde28cf837 100755
--- a/t/t7900-maintenance.sh
+++ b/t/t7900-maintenance.sh
@@ -5,6 +5,7 @@ test_description='git maintenance builtin'
 . ./test-lib.sh
 
 GIT_TEST_COMMIT_GRAPH=0
+GIT_TEST_MULTI_PACK_INDEX=0
 
 test_expect_success 'help text' '
 	test_expect_code 129 git maintenance -h 2>err &&
@@ -150,4 +151,41 @@ test_expect_success 'maintenance.loose-objects.auto' '
 	done
 '
 
+test_expect_success 'incremental-repack task' '
+	packDir=.git/objects/pack &&
+	for i in $(test_seq 1 5)
+	do
+		test_commit $i || return 1
+	done &&
+
+	# Create three disjoint pack-files with size BIG, small, small.
+	echo HEAD~2 | git pack-objects --revs $packDir/test-1 &&
+	test_tick &&
+	git pack-objects --revs $packDir/test-2 <<-\EOF &&
+	HEAD~1
+	^HEAD~2
+	EOF
+	test_tick &&
+	git pack-objects --revs $packDir/test-3 <<-\EOF &&
+	HEAD
+	^HEAD~1
+	EOF
+	rm -f $packDir/pack-* &&
+	rm -f $packDir/loose-* &&
+	ls $packDir/*.pack >packs-before &&
+	test_line_count = 3 packs-before &&
+
+	# the job repacks the two into a new pack, but does not
+	# delete the old ones.
+	git maintenance run --task=incremental-repack &&
+	ls $packDir/*.pack >packs-between &&
+	test_line_count = 4 packs-between &&
+
+	# the job deletes the two old packs, and does not write
+	# a new one because only one pack remains.
+	git maintenance run --task=incremental-repack &&
+	ls .git/objects/pack/*.pack >packs-after &&
+	test_line_count = 1 packs-after
+'
+
 test_done
-- 
gitgitgadget


  parent reply	other threads:[~2020-08-25 18:37 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-08-06 16:30 [PATCH 0/9] Maintenance II: prefetch, loose-objects, incremental-repack tasks Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 1/9] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano via GitGitGadget
2020-08-12 23:10   ` Emily Shaffer
2020-08-13  0:03     ` Junio C Hamano
2020-08-13  1:45       ` Jonathan Nieder
2020-08-13  4:37       ` [PATCH v3] " Junio C Hamano
2020-08-14  1:13         ` Derrick Stolee
2020-08-14  1:32           ` Junio C Hamano
2020-08-06 16:30 ` [PATCH 2/9] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-08-12 23:10   ` Emily Shaffer
2020-08-14  1:28     ` Derrick Stolee
2020-08-06 16:30 ` [PATCH 3/9] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-08-12 23:10   ` Emily Shaffer
2020-08-14  1:46     ` Derrick Stolee
2020-08-06 16:30 ` [PATCH 4/9] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 5/9] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 6/9] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 7/9] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-08-06 16:30 ` [PATCH 8/9] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-08-06 17:02   ` Son Luong Ngoc
2020-08-06 18:13     ` Derrick Stolee
2020-08-06 16:30 ` [PATCH 9/9] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-08-18 14:25 ` [PATCH v2 0/9] Maintenance II: prefetch, loose-objects, incremental-repack tasks Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 1/9] fetch: optionally allow disabling FETCH_HEAD update Junio C Hamano via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 2/9] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 3/9] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 4/9] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 5/9] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 6/9] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 7/9] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 8/9] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-08-18 14:25   ` [PATCH v2 9/9] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-08-25 18:36   ` [PATCH v3 0/8] Maintenance II: prefetch, loose-objects, incremental-repack tasks Derrick Stolee via GitGitGadget
2020-08-25 18:36     ` [PATCH v3 1/8] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-09-22 23:05       ` Jonathan Tan
2020-08-25 18:36     ` [PATCH v3 2/8] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-09-22 23:09       ` Jonathan Tan
2020-09-24 13:45         ` Derrick Stolee
2020-08-25 18:36     ` [PATCH v3 3/8] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-09-22 23:15       ` Jonathan Tan
2020-09-24 13:51         ` Derrick Stolee
2020-08-25 18:36     ` [PATCH v3 4/8] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-09-22 23:16       ` Jonathan Tan
2020-09-24 13:53         ` Derrick Stolee
2020-08-25 18:36     ` [PATCH v3 5/8] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-08-25 18:36     ` Derrick Stolee via GitGitGadget [this message]
2020-09-22 23:26       ` [PATCH v3 6/8] maintenance: add incremental-repack task Jonathan Tan
2020-09-24 14:05         ` Derrick Stolee
2020-09-24 22:01           ` Jonathan Tan
2020-08-25 18:36     ` [PATCH v3 7/8] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-08-25 18:36     ` [PATCH v3 8/8] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget
2020-09-22 23:52       ` Jonathan Tan
2020-08-25 20:59     ` [PATCH v3 0/8] Maintenance II: prefetch, loose-objects, incremental-repack tasks Junio C Hamano
2020-08-26 15:15     ` Son Luong Ngoc
2020-08-26 16:21       ` Derrick Stolee
2020-09-25 12:33     ` [PATCH v4 " Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 1/8] maintenance: add prefetch task Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 2/8] maintenance: add loose-objects task Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 3/8] maintenance: create auto condition for loose-objects Derrick Stolee via GitGitGadget
2020-09-25 18:00         ` Junio C Hamano
2020-09-25 18:43           ` Derrick Stolee
2020-09-25 12:33       ` [PATCH v4 4/8] midx: enable core.multiPackIndex by default Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 5/8] midx: use start_delayed_progress() Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 6/8] maintenance: add incremental-repack task Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 7/8] maintenance: auto-size incremental-repack batch Derrick Stolee via GitGitGadget
2020-09-25 12:33       ` [PATCH v4 8/8] maintenance: add incremental-repack auto condition Derrick Stolee via GitGitGadget

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0dd26bb584bb7e8b9616eb32f7b1235462df77fa.1598380599.git.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=congdanhqx@gmail.com \
    --cc=derrickstolee@github.com \
    --cc=dstolee@microsoft.com \
    --cc=emilyshaffer@google.com \
    --cc=git@vger.kernel.org \
    --cc=jonathantanmy@google.com \
    --cc=jrnieder@gmail.com \
    --cc=peff@peff.net \
    --cc=phillip.wood123@gmail.com \
    --cc=sandals@crustytoothpaste.net \
    --cc=sluongng@gmail.com \
    --cc=steadmon@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.