All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com>
To: git@vger.kernel.org
Cc: peff@peff.net, jrnieder@google.com, stolee@gmail.com,
	Derrick Stolee <dstolee@microsoft.com>,
	Derrick Stolee <dstolee@microsoft.com>
Subject: [PATCH 06/15] run-job: auto-size or use custom pack-files batch
Date: Fri, 03 Apr 2020 20:48:05 +0000	[thread overview]
Message-ID: <c39a35a959e485623aacdbbecc1fce7d16a956c9.1585946894.git.gitgitgadget@gmail.com> (raw)
In-Reply-To: <pull.597.git.1585946894.gitgitgadget@gmail.com>

From: Derrick Stolee <dstolee@microsoft.com>

When repacking during the 'pack-files' job, we use the --batch-size
option in 'git multi-pack-index repack'. The initial setting used
--batch-size=0 to repack everything into a single pack-file. This
is not sustaintable for a large repository. The amount of work
required is also likely to use too many system resources for a
background job.

Update the 'git run-job pack-files' command by allowing a direct
--batch-size option that can change the value provided. Update the
default of "0" to a computed size based on the existing pack-files.
While computing that new size, count the existing pack-files and
skip the repack step if there are at most two pack-files.

The dynamic default size is computed with this idea in mind for
a client repository that was cloned from a very large remote: there
is likely one "big" pack-file that was created at clone time. Thus,
do not try repacking it as it is likely packed efficiently by the
server. Instead, try packing the other pack-files into a single
pack-file.

The size is then computed as follows:

	batch size = total size - max pack size

The batch size is then also limited to be at most two gigabytes.
This serves two purposes. First, having a limit prevents doing too
much work when the repository is extremely large. Pack-files
larger than two gigabytes are likely to either contain large blobs
or have been carefully repacked by a previous repack operation.
Second, two gigabytes is the size limit for a signed 32-bit int.
It's a good limit to consider, and to keep it far away from the
unsigned 32-bit int limit. This limit comes to mind because on
Windows an "unsigned long" is 32 bits and OPT_MAGNITUDE() uses
unsigned longs for its parsing logic.

This calculation mimics a similar calculation in Scalar [1],
except for a 3% drop in the calculated batch size due to the
round-off error that can happen with the "expected size"
calculation for a pack-file.

[1] https://github.com/microsoft/scalar/blob/616e9b16dd120b8fdb652d6d5a55618c731a8aea/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs#L141-L143

Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
---
 Documentation/git-run-job.txt | 13 +++--
 builtin/run-job.c             | 90 ++++++++++++++++++++++++++++++++---
 t/t7900-run-job.sh            |  6 ++-
 3 files changed, 96 insertions(+), 13 deletions(-)

diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
index 108ed25b8bd..cdd6417f7c9 100644
--- a/Documentation/git-run-job.txt
+++ b/Documentation/git-run-job.txt
@@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
 SYNOPSIS
 --------
 [verse]
-'git run-job (commit-graph|fetch|loose-objects|pack-files)'
+'git run-job (commit-graph|fetch|loose-objects|pack-files) [<options>]'
 
 
 DESCRIPTION
@@ -82,9 +82,14 @@ only happens if all objects in the pack-file are also stored in a newer
 pack-file. Second, it selects a group of pack-files whose "expected
 size" is below the batch size until the group has total expected size at
 least the batch size; see the `--batch-size` option for the `repack`
-subcommand in linkgit:git-multi-pack-index[1]. The default batch-size is
-zero, which is a special case that attempts to repack all pack-files
-into a single pack-file.
+subcommand in linkgit:git-multi-pack-index[1].
++
+The default batch size is computed to optimize for having a single large
+pack-file and many small pack-files. When there are two or fewer
+pack-files, the job does not attempt to repack. Otherwise, the batch
+size is the sum of all pack-file sizes minus the largest pack-file size.
+The batch size is capped at two gigabytes. This intends to pack all
+small pack-files into a single pack-file.
 
 
 GIT
diff --git a/builtin/run-job.c b/builtin/run-job.c
index d3543f7ccb9..2ccc3bbae2d 100644
--- a/builtin/run-job.c
+++ b/builtin/run-job.c
@@ -3,12 +3,18 @@
 #include "commit-graph.h"
 #include "midx.h"
 #include "object-store.h"
+#include "packfile.h"
 #include "parse-options.h"
 #include "repository.h"
 #include "run-command.h"
 
 static char const * const builtin_run_job_usage[] = {
-	N_("git run-job (commit-graph|fetch|loose-objects|pack-files)"),
+	N_("git run-job (commit-graph|fetch|loose-objects|pack-files) [<options>]"),
+	NULL
+};
+
+static char const * const builtin_run_job_pack_file_usage[] = {
+	N_("git run-job pack-files [--batch-size=<size>]"),
 	NULL
 };
 
@@ -278,15 +284,74 @@ static int multi_pack_index_expire(void)
 	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
 }
 
-static int multi_pack_index_repack(void)
+#define TWO_GIGABYTES (2147483647)
+
+static off_t get_auto_pack_size(int *count)
+{
+	/*
+	 * The "auto" value is special: we optimize for
+	 * one large pack-file (i.e. from a clone) and
+	 * expect the rest to be small and they can be
+	 * repacked quickly. Find the sum of the sizes
+	 * other than the largest pack-file, then use
+	 * that as the batch size.
+	 */
+	off_t total_size = 0;
+	off_t max_size = 0;
+	off_t result_size;
+	struct packed_git *p;
+
+	*count = 0;
+
+	reprepare_packed_git(the_repository);
+	for (p = get_all_packs(the_repository); p; p = p->next) {
+		(*count)++;
+		total_size += p->pack_size;
+
+		if (p->pack_size > max_size)
+			max_size = p->pack_size;
+	}
+
+	result_size = total_size - max_size;
+
+	/* But limit ourselves to a batch size of 2g */
+	if (result_size > TWO_GIGABYTES)
+		result_size = TWO_GIGABYTES;
+
+	return result_size;
+}
+
+#define UNSET_BATCH_SIZE ((unsigned long)-1)
+static int multi_pack_index_repack(unsigned long batch_size)
 {
 	int result;
 	struct argv_array cmd = ARGV_ARRAY_INIT;
+	struct strbuf batch_arg = STRBUF_INIT;
+	int count;
+	off_t default_size = get_auto_pack_size(&count);
+
+	if (count <= 2)
+		return 0;
+
+	strbuf_addstr(&batch_arg, "--batch-size=");
+
+	if (batch_size != UNSET_BATCH_SIZE)
+		strbuf_addf(&batch_arg, "\"%"PRIuMAX"\"", (uintmax_t)batch_size);
+	else
+		strbuf_addf(&batch_arg, "%"PRIuMAX,
+			    (uintmax_t)default_size);
+
 	argv_array_pushl(&cmd, "multi-pack-index", "repack",
-			 "--no-progress", "--batch-size=0", NULL);
+			 "--no-progress", batch_arg.buf, NULL);
 	result = run_command_v_opt(cmd.argv, RUN_GIT_CMD);
 
-	if (result && multi_pack_index_verify()) {
+	strbuf_release(&batch_arg);
+
+	/*
+	 * Verify here to avoid verifying again when there are two
+	 * or fewer pack-files.
+	 */
+	if (!result && multi_pack_index_verify()) {
 		warning(_("multi-pack-index verify failed after repack"));
 		result = rewrite_multi_pack_index();
 	}
@@ -294,8 +359,19 @@ static int multi_pack_index_repack(void)
 	return result;
 }
 
-static int run_pack_files_job(void)
+static int run_pack_files_job(int argc, const char **argv)
 {
+	static unsigned long batch_size = UNSET_BATCH_SIZE;
+	static struct option builtin_run_job_pack_file_options[] = {
+		OPT_MAGNITUDE(0, "batch-size", &batch_size,
+			      N_("specify a batch-size for the incremental repack")),
+		OPT_END(),
+	};
+
+	argc = parse_options(argc, argv, NULL,
+			     builtin_run_job_pack_file_options,
+			     builtin_run_job_pack_file_usage, 0);
+
 	if (multi_pack_index_write()) {
 		error(_("failed to write multi-pack-index"));
 		return 1;
@@ -316,7 +392,7 @@ static int run_pack_files_job(void)
 		return rewrite_multi_pack_index();
 	}
 
-	if (multi_pack_index_repack()) {
+	if (multi_pack_index_repack(batch_size)) {
 		error(_("multi-pack-index repack failed"));
 		return 1;
 	}
@@ -348,7 +424,7 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
 		if (!strcmp(argv[0], "loose-objects"))
 			return run_loose_objects_job();
 		if (!strcmp(argv[0], "pack-files"))
-			return run_pack_files_job();
+			return run_pack_files_job(argc, argv);
 	}
 
 	usage_with_options(builtin_run_job_usage,
diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
index 416ba04989d..2d9f6cdf328 100755
--- a/t/t7900-run-job.sh
+++ b/t/t7900-run-job.sh
@@ -128,8 +128,10 @@ test_expect_success 'pack-files job' '
 	# the job deletes the two old packs, and does not write
 	# a new one because only one pack remains.
 	git -C client run-job pack-files &&
-	ls client/.git/objects/pack/*.pack >packs-after &&
-	test_line_count = 1 packs-after
+	ls client/$packDir/*.pack >packs-after &&
+	test_line_count = 2 packs-after &&
+	cat packs-after | grep "pack/test-1-" &&
+	cat packs-after | grep "pack/pack-"
 '
 
 test_done
-- 
gitgitgadget


  parent reply	other threads:[~2020-04-03 20:48 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 01/15] run-job: create barebones builtin Derrick Stolee via GitGitGadget
2020-04-05 15:10   ` Phillip Wood
2020-04-05 19:21     ` Junio C Hamano
2020-04-06 14:42       ` Derrick Stolee
2020-04-07  0:58         ` Danh Doan
2020-04-07 10:54           ` Derrick Stolee
2020-04-07 14:16             ` Danh Doan
2020-04-07 14:30               ` Johannes Schindelin
2020-04-03 20:48 ` [PATCH 02/15] run-job: implement commit-graph job Derrick Stolee via GitGitGadget
2020-05-20 19:08   ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
2020-04-05 15:14   ` Phillip Wood
2020-04-06 12:48     ` Derrick Stolee
2020-04-05 20:28   ` Junio C Hamano
2020-04-06 12:46     ` Derrick Stolee
2020-05-20 19:08   ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 04/15] run-job: implement loose-objects job Derrick Stolee via GitGitGadget
2020-04-05 20:33   ` Junio C Hamano
2020-04-03 20:48 ` [PATCH 05/15] run-job: implement pack-files job Derrick Stolee via GitGitGadget
2020-05-27 22:17   ` Josh Steadmon
2020-04-03 20:48 ` Derrick Stolee via GitGitGadget [this message]
2020-04-03 20:48 ` [PATCH 07/15] config: add job.pack-files.batchSize option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 08/15] job-runner: create builtin for job loop Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 09/15] job-runner: load repos from config by default Derrick Stolee via GitGitGadget
2020-04-05 15:18   ` Phillip Wood
2020-04-06 12:49     ` Derrick Stolee
2020-04-05 15:41   ` Phillip Wood
2020-04-06 12:57     ` Derrick Stolee
2020-04-03 20:48 ` [PATCH 10/15] job-runner: use config to limit job frequency Derrick Stolee via GitGitGadget
2020-04-05 15:24   ` Phillip Wood
2020-04-03 20:48 ` [PATCH 11/15] job-runner: use config for loop interval Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 12/15] job-runner: add --interval=<span> option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 13/15] job-runner: skip a job if job.<job-name>.enabled is false Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 14/15] job-runner: add --daemonize option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 15/15] runjob: customize the loose-objects batch size Derrick Stolee via GitGitGadget
2020-04-03 21:40 ` [PATCH 00/15] [RFC] Maintenance jobs and job runner Junio C Hamano
2020-04-04  0:16   ` Derrick Stolee
2020-04-07  0:50     ` Danh Doan
2020-04-07 10:59       ` Derrick Stolee
2020-04-07 14:26         ` Danh Doan
2020-04-07 14:43           ` Johannes Schindelin
2020-04-07  1:48     ` brian m. carlson
2020-04-07 20:08       ` Junio C Hamano
2020-04-07 22:23       ` Johannes Schindelin
2020-04-08  0:01         ` brian m. carlson
2020-05-27 22:39           ` Josh Steadmon
2020-05-28  0:47             ` Junio C Hamano
2020-05-27 21:52               ` Johannes Schindelin
2020-05-28 14:48                 ` Junio C Hamano
2020-05-28 14:50                 ` Jonathan Nieder
2020-05-28 14:57                   ` Junio C Hamano
2020-05-28 15:03                     ` Jonathan Nieder
2020-05-28 15:30                       ` Derrick Stolee
2020-05-28  4:39                         ` Johannes Schindelin
2020-04-30 16:48 [PATCH 06/15] run-job: auto-size or use custom pack-files batch Son Luong Ngoc
2020-04-30 20:13 ` Derrick Stolee

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c39a35a959e485623aacdbbecc1fce7d16a956c9.1585946894.git.gitgitgadget@gmail.com \
    --to=gitgitgadget@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=jrnieder@google.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.