* Re: [PATCH 05/15] run-job: implement pack-files job @ 2020-05-02 7:56 Son Luong Ngoc 2020-05-04 14:34 ` Derrick Stolee 0 siblings, 1 reply; 4+ messages in thread From: Son Luong Ngoc @ 2020-05-02 7:56 UTC (permalink / raw) To: gitgitgadget; +Cc: dstolee, git, jrnieder, peff, stolee Hi Derrick, Sorry for another late reply on this RFC. This time is a question on the multi-pack-index repack process. "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > diff --git a/builtin/run-job.c b/builtin/run-job.c > index cecf9058c51..d3543f7ccb9 100644 > --- a/builtin/run-job.c > +++ b/builtin/run-job.c ... > +static int multi_pack_index_repack(void) > +{ > + int result; > + struct argv_array cmd = ARGV_ARRAY_INIT; > + argv_array_pushl(&cmd, "multi-pack-index", "repack", > + "--no-progress", "--batch-size=0", NULL); > + result = run_command_v_opt(cmd.argv, RUN_GIT_CMD); > + > + if (result && multi_pack_index_verify()) { > + warning(_("multi-pack-index verify failed after repack")); > + result = rewrite_multi_pack_index(); > + } Its a bit inconsistent where write() and expire() did not include verify() within them but repack does. What make repack() different? > + > + return result; > +} > + > +static int run_pack_files_job(void) > +{ > + if (multi_pack_index_write()) { > + error(_("failed to write multi-pack-index")); > + return 1; > + } > + > + if (multi_pack_index_verify()) { > + warning(_("multi-pack-index verify failed after initial write")); > + return rewrite_multi_pack_index(); > + } > + > + if (multi_pack_index_expire()) { > + error(_("multi-pack-index expire failed")); > + return 1; > + } Why expire *before* repack and not after? I thought `core.multiPackIndex=true` would prevent old pack files from being used thus expiring immediately after repack is safe? (on that note, are users required to set this config prior to running the job?) If expiring immediately after repack()+verify() is not safe, then should we have a minimum allowed interval set? (although I would preferred to make expire() safe) > + > + if (multi_pack_index_verify()) { > + warning(_("multi-pack-index verify failed after expire")); > + return rewrite_multi_pack_index(); > + } > + > + if (multi_pack_index_repack()) { > + error(_("multi-pack-index repack failed")); > + return 1; > + } Again, I just think the decision to include verify() inside repack() made this part a bit inconsistent. > + > + return 0; > +} > + Cheers, Son Luong ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [PATCH 05/15] run-job: implement pack-files job 2020-05-02 7:56 [PATCH 05/15] run-job: implement pack-files job Son Luong Ngoc @ 2020-05-04 14:34 ` Derrick Stolee 0 siblings, 0 replies; 4+ messages in thread From: Derrick Stolee @ 2020-05-04 14:34 UTC (permalink / raw) To: Son Luong Ngoc, gitgitgadget; +Cc: dstolee, git, jrnieder, peff On 5/2/2020 3:56 AM, Son Luong Ngoc wrote: > Hi Derrick, > > Sorry for another late reply on this RFC. > This time is a question on the multi-pack-index repack process. > > Why expire *before* repack and not after? > > I thought `core.multiPackIndex=true` would prevent old pack files from > being used thus expiring immediately after repack is safe? (on that > note, are users > required to set this config prior to running the job?) > > If expiring immediately after repack()+verify() is not safe, then should > we have a minimum allowed interval set? (although I would preferred to > make expire() safe) Suppose we run "repack" and then "expire". Suppose that there is a concurrent Git process that has a read handle on the multi-pack-index from before the repack. This multi-pack-index has a pack-file that was repacked by the "repack" command, and hence was expired and deleted by the "expire" command (because all of its objects are in the _new_ pack). However, when the Git process with the old multi-pack-index reads an object pointing to that pack-file, it finds that the pack does not exist when it tries to load it. Git is usually pretty resilient to these kinds of things, but it seemed prudent to be extra careful here. By spacing out these jobs across a time where we don't expect concurrent Git processes to hold a read handle on that old multi-pack-index (say, 24 hours) then this method is safe. In fact, Scalar ensures that no concurrent Git process is running at the start of the job [1], which means that no Git processes are _still_ running since the last job (but this does not stop concurrent processes from starting after that point and before the 'expire' command). [1] https://github.com/microsoft/scalar/blob/489764b815dfea32bec5cd4228f2157766012784/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs#L106-L111 >> + >> + if (multi_pack_index_verify()) { >> + warning(_("multi-pack-index verify failed after expire")); >> + return rewrite_multi_pack_index(); >> + } >> + >> + if (multi_pack_index_repack()) { >> + error(_("multi-pack-index repack failed")); >> + return 1; >> + } > > Again, I just think the decision to include verify() inside repack() > made this part a bit inconsistent. You're right. It is a bit inconsistent. That should be fixed in the next version. Thanks, -Stolee ^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 00/15] [RFC] Maintenance jobs and job runner @ 2020-04-03 20:47 Derrick Stolee via GitGitGadget 2020-04-03 20:48 ` [PATCH 05/15] run-job: implement pack-files job Derrick Stolee via GitGitGadget 0 siblings, 1 reply; 4+ messages in thread From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:47 UTC (permalink / raw) To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee Background maintenance. I've blogged about it [1]. I've talked about it [2]. I brought it up at the contributor summit [3]. I think it's important. It's a critical feature in Scalar and VFS for Git. It's the main reason we created the "scalar register" feature for existing repos (so the Scalar background maintenance would apply to a "vanilla" Git repo). [1] https://devblogs.microsoft.com/devops/introducing-scalar/[2] https://stolee.dev/docs/git-merge-2020.pdf[3] https://lore.kernel.org/git/35FDF767-CE0C-4C51-88A8-12965CD2D4FF@jramsay.com.au/ This RFC does the "maintenance" half of "background maintenance". It creates two new builtins to assist users in scheduling their own background maintenance: * git run-job <job-name>: This builtin will run a single instance of a maintenance job. * git job-runner [--repo=<path>]: This builtin will run an infinite loop that executes git run-job as a subcommand. I believe that I could replace most maintenance steps in Scalar with the git run-job command and all that we would lose is some logging. (The only one that would remain is the one that sets recommended config settings, but that could be changed to a one-time operation at service start-up.) I haven't tested this yet, but I plan to before sending a non-RFC option. Of course, just because we think these maintenance operations work for our scenario does not mean that they immediately work as the "canonical" maintenance activities. I attempt to demonstrate the value of these jobs as they are implemented. My perspective is skewed towards very large repositories (2000+ full-time contributors), but even medium-size repositories (100-200 full-time contributors) can benefit from these jobs. I've tried to split the series into logical commits. This includes each specific maintenance job being completely described in its own commit (plus maybe an extra one to allow a custom option). Most commit messages have "RFC QUESTION(S)" for things I was unsure about. I'm currently testing this locally by setting job.repo in my global config to be a few important repos on my Linux VM then running GIT_TRACE2_PERF=<path> git job-runner --daemonize to launch a background process that logs the subcommands to <path>. Here I will call out things that could definitely be improved: 1. The git job-runner process is not tested AT ALL. It's a bit tricky to test it since it shouldn't exit when things work as expected. I expect to create a --no-loop option to stop after one iteration of the job loop. But I would like a bit more feedback on the concept before jumping into those tests. (I do test the git run-job builtin for each job, but not the customization through arguments and config.) Perhaps we could do some funny business about mocking git using --exec-path to check that it is calling git run-job in the correct order (and more importantly, not calling it when certain config settings are present). 2. The --daemonize option at the end is shamelessly stolen from git gc --daemonize and git daemon, but has limited abilities on some platforms (I've tested on Linux and macOS). I have not done my research on how far this gets us to allowing users to launch this at startup or something. 3. As I said, this is the "maintenance" "half" of "background maintenance". The "background" part is harder in my opinion because it involves creating platform-specific ways to consistently launch background processes. For example, Unix systems should have one way to service start X while Windows has another. macOS has launchd to launch processes as users log in, which should be a good way forward. Scalar implements a Windows Service that runs as root but impersonates the latest user to log in, and it implements a macOS "service" that is running only with the current user. I expect to need to create these services myself as a follow-up, but I lack the expertise to do it (currently). If someone else has experience creating these things and wants to take over or advise that half then I would appreciate the help! 4. I noticed late in the RFC process that I'm not clearing my argv_arrays carefully in the job-runner. This will need to be rectified and carefully checked with valgrind before merging this code. While it leaks memory very slowly, it will be important that we don't leak any memory at all since this is a long-lived process. There's also some places where I was careful to not include too much of libgit.a to help keep the memory footprint low. Thanks, -Stolee Derrick Stolee (15): run-job: create barebones builtin run-job: implement commit-graph job run-job: implement fetch job run-job: implement loose-objects job run-job: implement pack-files job run-job: auto-size or use custom pack-files batch config: add job.pack-files.batchSize option job-runner: create builtin for job loop job-runner: load repos from config by default job-runner: use config to limit job frequency job-runner: use config for loop interval job-runner: add --interval=<span> option job-runner: skip a job if job.<job-name>.enabled is false job-runner: add --daemonize option runjob: customize the loose-objects batch size .gitignore | 2 + Documentation/config.txt | 2 + Documentation/config/job.txt | 37 +++ Documentation/git-job-runner.txt | 63 +++++ Documentation/git-run-job.txt | 102 +++++++ Makefile | 2 + builtin.h | 2 + builtin/job-runner.c | 347 +++++++++++++++++++++++ builtin/run-job.c | 458 +++++++++++++++++++++++++++++++ cache.h | 4 +- command-list.txt | 2 + commit-graph.c | 2 +- commit-graph.h | 1 + daemon.h | 7 + git.c | 2 + midx.c | 2 +- midx.h | 1 + t/t7900-run-job.sh | 137 +++++++++ 18 files changed, 1168 insertions(+), 5 deletions(-) create mode 100644 Documentation/config/job.txt create mode 100644 Documentation/git-job-runner.txt create mode 100644 Documentation/git-run-job.txt create mode 100644 builtin/job-runner.c create mode 100644 builtin/run-job.c create mode 100644 daemon.h create mode 100755 t/t7900-run-job.sh base-commit: 9fadedd637b312089337d73c3ed8447e9f0aa775 Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-597%2Fderrickstolee%2Fjobs-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-597/derrickstolee/jobs-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/597 -- gitgitgadget ^ permalink raw reply [flat|nested] 4+ messages in thread
* [PATCH 05/15] run-job: implement pack-files job 2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 ` Derrick Stolee via GitGitGadget 2020-05-27 22:17 ` Josh Steadmon 0 siblings, 1 reply; 4+ messages in thread From: Derrick Stolee via GitGitGadget @ 2020-04-03 20:48 UTC (permalink / raw) To: git; +Cc: peff, jrnieder, stolee, Derrick Stolee, Derrick Stolee From: Derrick Stolee <dstolee@microsoft.com> The previous change cleaned up loose objects using the 'loose-objects' that can be run safely in the background. Add a similar job that performs similar cleanups for pack-files. One issue with running 'git repack' is that it is designed to repack all pack-files into a single pack-file. While this is the most space-efficient way to store object data, it is not time or memory efficient. This becomes extremely important if the repo is so large that a user struggles to store two copies of the pack on their disk. Instead, perform an "incremental" repack by collecting a few small pack-files into a new pack-file. The multi-pack-index facilitates this process ever since 'git multi-pack-index expire' was added in 19575c7 (multi-pack-index: implement 'expire' subcommand, 2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1 (midx: implement midx_repack(), 2019-06-10). The 'pack-files' job runs the following steps: 1. 'git multi-pack-index write' creates a multi-pack-index file if one did not exist, and otherwise will update the multi-pack-index with any new pack-files that appeared since the last write. This is particularly relevant with the background fetch job. When the multi-pack-index sees two copies of the same object, it stores the offset data into the newer pack-file. This means that some old pack-files could become "unreferenced" which I will use to mean "a pack-file that is in the pack-file list of the multi-pack-index but none of the objects in the multi-pack-index reference a location inside that pack-file." 2. 'git multi-pack-index expire' deletes any unreferenced pack-files and updaes the multi-pack-index to drop those pack-files from the list. This is safe to do as concurrent Git processes will see the multi-pack-index and not open those packs when looking for object contents. (Similar to the 'loose-objects' job, there are some Git commands that open pack-files regardless of the multi-pack-index, but they are rarely used. Further, a user that self-selects to use background operations would likely refrain from using those commands.) 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set of pack-files that are listed in the multi-pack-index and creates a new pack-file containing the objects whose offsets are listed by the multi-pack-index to be in those objects. The set of pack- files is selected greedily by sorting the pack-files by modified time and adding a pack-file to the set if its "expected size" is smaller than the batch size until the total expected size of the selected pack-files is at least the batch size. The "expected size" is calculated by taking the size of the pack-file divided by the number of objects in the pack-file and multiplied by the number of objects from the multi-pack-index with offset in that pack-file. The expected size approximats how much data from that pack-file will contribute to the resulting pack-file size. The intention is that the resulting pack-file will be close in size to the provided batch size. The next run of the pack-files job will delete these repacked pack-files during the 'expire' step. In this version, the batch size is set to "0" which ignores the size restrictions when selecting the pack-files. It instead selects all pack-files and repacks all packed objects into a single pack-file. This will be updated in the next change, but it requires doing some calculations that are better isolated to a separate change. Each of the above steps update the multi-pack-index file. After each step, we verify the new multi-pack-index. If the new multi-pack-index is corrupt, then delete the multi-pack-index, rewrite it from scratch, and stop doing the later steps of the job. This is intended to be an extra-safe check without leaving a repo with many pack-files without a multi-pack-index. These steps are based on a similar background maintenance step in Scalar (and VFS for Git) [1]. This was incredibly effective for users of the Windows OS repository. After using the same VFS for Git repository for over a year, some users had _thousands_ of pack-files that combined to up to 250 GB of data. We noticed a few users were running into the open file descriptor limits (due in part to a bug in the multi-pack-index fixed by af96fe3392 (midx: add packs to packed_git linked list, 2019-04-29). These pack-files were mostly small since they contained the commits and trees that were pushed to the origin in a given hour. The GVFS protocol includes a "prefetch" step that asks for pre-computed pack- files containing commits and trees by timestamp. These pack-files were grouped into "daily" pack-files once a day for up to 30 days. If a user did not request prefetch packs for over 30 days, then they would get the entire history of commits and trees in a new, large pack-file. This led to a large number of pack-files that had poor delta compression. By running this pack-file maintenance step once per day, these repos with thousands of packs spanning 200+ GB dropped to dozens of pack- files spanning 30-50 GB. This was done all without removing objects from the system and using a constant batch size of two gigabytes. Once the work was done to reduce the pack-files to small sizes, the batch size of two gigabytes means that not every run triggers a repack operation, so the following run will not expire a pack-file. This has kept these repos in a "clean" state. [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs Signed-off-by: Derrick Stolee <dstolee@microsoft.com> --- Documentation/git-run-job.txt | 18 ++++++- builtin/run-job.c | 90 ++++++++++++++++++++++++++++++++++- midx.c | 2 +- midx.h | 1 + t/t7900-run-job.sh | 39 +++++++++++++++ 5 files changed, 147 insertions(+), 3 deletions(-) diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt index 43ca1160b5a..108ed25b8bd 100644 --- a/Documentation/git-run-job.txt +++ b/Documentation/git-run-job.txt @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation. SYNOPSIS -------- [verse] -'git run-job (commit-graph|fetch|loose-objects)' +'git run-job (commit-graph|fetch|loose-objects|pack-files)' DESCRIPTION @@ -71,6 +71,22 @@ a batch of loose objects. The batch size is limited to 50 thousand objects to prevent the job from taking too long on a repository with many loose objects. +'pack-files':: + +The `pack-files` job incrementally repacks the object directory using +the `multi-pack-index` feature. In order to prevent race conditions with +concurrent Git commands, it follows a two-step process. First, it +deletes any pack-files included in the `multi-pack-index` where none of +the objects in the `multi-pack-index` reference those pack-files; this +only happens if all objects in the pack-file are also stored in a newer +pack-file. Second, it selects a group of pack-files whose "expected +size" is below the batch size until the group has total expected size at +least the batch size; see the `--batch-size` option for the `repack` +subcommand in linkgit:git-multi-pack-index[1]. The default batch-size is +zero, which is a special case that attempts to repack all pack-files +into a single pack-file. + + GIT --- Part of the linkgit:git[1] suite diff --git a/builtin/run-job.c b/builtin/run-job.c index cecf9058c51..d3543f7ccb9 100644 --- a/builtin/run-job.c +++ b/builtin/run-job.c @@ -1,13 +1,14 @@ #include "builtin.h" #include "config.h" #include "commit-graph.h" +#include "midx.h" #include "object-store.h" #include "parse-options.h" #include "repository.h" #include "run-command.h" static char const * const builtin_run_job_usage[] = { - N_("git run-job (commit-graph|fetch|loose-objects)"), + N_("git run-job (commit-graph|fetch|loose-objects|pack-files)"), NULL }; @@ -238,6 +239,91 @@ static int run_loose_objects_job(void) return prune_packed() || pack_loose(); } +static int multi_pack_index_write(void) +{ + struct argv_array cmd = ARGV_ARRAY_INIT; + argv_array_pushl(&cmd, "multi-pack-index", "write", + "--no-progress", NULL); + return run_command_v_opt(cmd.argv, RUN_GIT_CMD); +} + +static int rewrite_multi_pack_index(void) +{ + char *midx_name = get_midx_filename(the_repository->objects->odb->path); + + unlink(midx_name); + free(midx_name); + + if (multi_pack_index_write()) { + error(_("failed to rewrite multi-pack-index")); + return 1; + } + + return 0; +} + +static int multi_pack_index_verify(void) +{ + struct argv_array cmd = ARGV_ARRAY_INIT; + argv_array_pushl(&cmd, "multi-pack-index", "verify", + "--no-progress", NULL); + return run_command_v_opt(cmd.argv, RUN_GIT_CMD); +} + +static int multi_pack_index_expire(void) +{ + struct argv_array cmd = ARGV_ARRAY_INIT; + argv_array_pushl(&cmd, "multi-pack-index", "expire", + "--no-progress", NULL); + return run_command_v_opt(cmd.argv, RUN_GIT_CMD); +} + +static int multi_pack_index_repack(void) +{ + int result; + struct argv_array cmd = ARGV_ARRAY_INIT; + argv_array_pushl(&cmd, "multi-pack-index", "repack", + "--no-progress", "--batch-size=0", NULL); + result = run_command_v_opt(cmd.argv, RUN_GIT_CMD); + + if (result && multi_pack_index_verify()) { + warning(_("multi-pack-index verify failed after repack")); + result = rewrite_multi_pack_index(); + } + + return result; +} + +static int run_pack_files_job(void) +{ + if (multi_pack_index_write()) { + error(_("failed to write multi-pack-index")); + return 1; + } + + if (multi_pack_index_verify()) { + warning(_("multi-pack-index verify failed after initial write")); + return rewrite_multi_pack_index(); + } + + if (multi_pack_index_expire()) { + error(_("multi-pack-index expire failed")); + return 1; + } + + if (multi_pack_index_verify()) { + warning(_("multi-pack-index verify failed after expire")); + return rewrite_multi_pack_index(); + } + + if (multi_pack_index_repack()) { + error(_("multi-pack-index repack failed")); + return 1; + } + + return 0; +} + int cmd_run_job(int argc, const char **argv, const char *prefix) { static struct option builtin_run_job_options[] = { @@ -261,6 +347,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix) return run_fetch_job(); if (!strcmp(argv[0], "loose-objects")) return run_loose_objects_job(); + if (!strcmp(argv[0], "pack-files")) + return run_pack_files_job(); } usage_with_options(builtin_run_job_usage, diff --git a/midx.c b/midx.c index 1527e464a7b..0f0d0a38812 100644 --- a/midx.c +++ b/midx.c @@ -36,7 +36,7 @@ #define PACK_EXPIRED UINT_MAX -static char *get_midx_filename(const char *object_dir) +char *get_midx_filename(const char *object_dir) { return xstrfmt("%s/pack/multi-pack-index", object_dir); } diff --git a/midx.h b/midx.h index e6fa356b5ca..cf2c09dffc2 100644 --- a/midx.h +++ b/midx.h @@ -39,6 +39,7 @@ struct multi_pack_index { #define MIDX_PROGRESS (1 << 0) +char *get_midx_filename(const char *object_dir); struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local); int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id); int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result); diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh index 41da083257b..416ba04989d 100755 --- a/t/t7900-run-job.sh +++ b/t/t7900-run-job.sh @@ -6,6 +6,7 @@ Testing the background jobs, in the foreground ' GIT_TEST_COMMIT_GRAPH=0 +GIT_TEST_MULTI_PACK_INDEX=0 . ./test-lib.sh @@ -93,4 +94,42 @@ test_expect_success 'loose-objects job' ' test_cmp packs-between packs-after ' +test_expect_success 'pack-files job' ' + packDir=.git/objects/pack && + + # Create three disjoint pack-files with size BIG, small, small. + + echo HEAD~2 | git -C client pack-objects --revs $packDir/test-1 && + + test_tick && + git -C client pack-objects --revs $packDir/test-2 <<-\EOF && + HEAD~1 + ^HEAD~2 + EOF + + test_tick && + git -C client pack-objects --revs $packDir/test-3 <<-\EOF && + HEAD + ^HEAD~1 + EOF + + rm -f client/$packDir/pack-* && + rm -f client/$packDir/loose-* && + + ls client/$packDir/*.pack >packs-before && + test_line_count = 3 packs-before && + + # the job repacks the two into a new pack, but does not + # delete the old ones. + git -C client run-job pack-files && + ls client/$packDir/*.pack >packs-between && + test_line_count = 4 packs-between && + + # the job deletes the two old packs, and does not write + # a new one because only one pack remains. + git -C client run-job pack-files && + ls client/.git/objects/pack/*.pack >packs-after && + test_line_count = 1 packs-after +' + test_done -- gitgitgadget ^ permalink raw reply related [flat|nested] 4+ messages in thread
* Re: [PATCH 05/15] run-job: implement pack-files job 2020-04-03 20:48 ` [PATCH 05/15] run-job: implement pack-files job Derrick Stolee via GitGitGadget @ 2020-05-27 22:17 ` Josh Steadmon 0 siblings, 0 replies; 4+ messages in thread From: Josh Steadmon @ 2020-05-27 22:17 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: git, peff, jrnieder, stolee, Derrick Stolee On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote: > From: Derrick Stolee <dstolee@microsoft.com> > > The previous change cleaned up loose objects using the > 'loose-objects' that can be run safely in the background. Add a > similar job that performs similar cleanups for pack-files. > > One issue with running 'git repack' is that it is designed to > repack all pack-files into a single pack-file. While this is the > most space-efficient way to store object data, it is not time or > memory efficient. This becomes extremely important if the repo is > so large that a user struggles to store two copies of the pack on > their disk. > > Instead, perform an "incremental" repack by collecting a few small > pack-files into a new pack-file. The multi-pack-index facilitates > this process ever since 'git multi-pack-index expire' was added in > 19575c7 (multi-pack-index: implement 'expire' subcommand, > 2019-06-10) and 'git multi-pack-index repack' was added in ce1e4a1 > (midx: implement midx_repack(), 2019-06-10). > > The 'pack-files' job runs the following steps: > > 1. 'git multi-pack-index write' creates a multi-pack-index file if > one did not exist, and otherwise will update the multi-pack-index > with any new pack-files that appeared since the last write. This > is particularly relevant with the background fetch job. > > When the multi-pack-index sees two copies of the same object, it > stores the offset data into the newer pack-file. This means that > some old pack-files could become "unreferenced" which I will use > to mean "a pack-file that is in the pack-file list of the > multi-pack-index but none of the objects in the multi-pack-index > reference a location inside that pack-file." > > 2. 'git multi-pack-index expire' deletes any unreferenced pack-files > and updaes the multi-pack-index to drop those pack-files from the Typo: updaes -> updates > list. This is safe to do as concurrent Git processes will see the > multi-pack-index and not open those packs when looking for object > contents. (Similar to the 'loose-objects' job, there are some Git Is it still safe for concurrent processes if the repo did not have a multi-pack-index when the first process started? > commands that open pack-files regardless of the multi-pack-index, > but they are rarely used. Further, a user that self-selects to > use background operations would likely refrain from using those > commands.) > > 3. 'git multi-pack-index repack --bacth-size=<size>' collects a set Typo: bacth-size -> batch-size > of pack-files that are listed in the multi-pack-index and creates > a new pack-file containing the objects whose offsets are listed > by the multi-pack-index to be in those objects. The set of pack- > files is selected greedily by sorting the pack-files by modified > time and adding a pack-file to the set if its "expected size" is > smaller than the batch size until the total expected size of the > selected pack-files is at least the batch size. The "expected > size" is calculated by taking the size of the pack-file divided > by the number of objects in the pack-file and multiplied by the > number of objects from the multi-pack-index with offset in that > pack-file. The expected size approximats how much data from that Typo: approximats -> approximates > pack-file will contribute to the resulting pack-file size. The > intention is that the resulting pack-file will be close in size > to the provided batch size. > > The next run of the pack-files job will delete these repacked > pack-files during the 'expire' step. > > In this version, the batch size is set to "0" which ignores the > size restrictions when selecting the pack-files. It instead > selects all pack-files and repacks all packed objects into a > single pack-file. This will be updated in the next change, but > it requires doing some calculations that are better isolated to > a separate change. > > Each of the above steps update the multi-pack-index file. After > each step, we verify the new multi-pack-index. If the new > multi-pack-index is corrupt, then delete the multi-pack-index, > rewrite it from scratch, and stop doing the later steps of the > job. This is intended to be an extra-safe check without leaving > a repo with many pack-files without a multi-pack-index. > > These steps are based on a similar background maintenance step in > Scalar (and VFS for Git) [1]. This was incredibly effective for > users of the Windows OS repository. After using the same VFS for Git > repository for over a year, some users had _thousands_ of pack-files > that combined to up to 250 GB of data. We noticed a few users were > running into the open file descriptor limits (due in part to a bug > in the multi-pack-index fixed by af96fe3392 (midx: add packs to > packed_git linked list, 2019-04-29). > > These pack-files were mostly small since they contained the commits > and trees that were pushed to the origin in a given hour. The GVFS > protocol includes a "prefetch" step that asks for pre-computed pack- > files containing commits and trees by timestamp. These pack-files > were grouped into "daily" pack-files once a day for up to 30 days. > If a user did not request prefetch packs for over 30 days, then they > would get the entire history of commits and trees in a new, large > pack-file. This led to a large number of pack-files that had poor > delta compression. > > By running this pack-file maintenance step once per day, these repos > with thousands of packs spanning 200+ GB dropped to dozens of pack- > files spanning 30-50 GB. This was done all without removing objects > from the system and using a constant batch size of two gigabytes. > Once the work was done to reduce the pack-files to small sizes, the > batch size of two gigabytes means that not every run triggers a > repack operation, so the following run will not expire a pack-file. > This has kept these repos in a "clean" state. > > [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/PackfileMaintenanceStep.cs > > Signed-off-by: Derrick Stolee <dstolee@microsoft.com> > --- > Documentation/git-run-job.txt | 18 ++++++- > builtin/run-job.c | 90 ++++++++++++++++++++++++++++++++++- > midx.c | 2 +- > midx.h | 1 + > t/t7900-run-job.sh | 39 +++++++++++++++ > 5 files changed, 147 insertions(+), 3 deletions(-) > > diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt > index 43ca1160b5a..108ed25b8bd 100644 > --- a/Documentation/git-run-job.txt > +++ b/Documentation/git-run-job.txt > @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation. > SYNOPSIS > -------- > [verse] > -'git run-job (commit-graph|fetch|loose-objects)' > +'git run-job (commit-graph|fetch|loose-objects|pack-files)' > > > DESCRIPTION > @@ -71,6 +71,22 @@ a batch of loose objects. The batch size is limited to 50 thousand > objects to prevent the job from taking too long on a repository with > many loose objects. > > +'pack-files':: > + > +The `pack-files` job incrementally repacks the object directory using > +the `multi-pack-index` feature. In order to prevent race conditions with > +concurrent Git commands, it follows a two-step process. First, it > +deletes any pack-files included in the `multi-pack-index` where none of > +the objects in the `multi-pack-index` reference those pack-files; this > +only happens if all objects in the pack-file are also stored in a newer > +pack-file. Second, it selects a group of pack-files whose "expected > +size" is below the batch size until the group has total expected size at > +least the batch size; see the `--batch-size` option for the `repack` > +subcommand in linkgit:git-multi-pack-index[1]. The default batch-size is > +zero, which is a special case that attempts to repack all pack-files > +into a single pack-file. > + > + > GIT > --- > Part of the linkgit:git[1] suite > diff --git a/builtin/run-job.c b/builtin/run-job.c > index cecf9058c51..d3543f7ccb9 100644 > --- a/builtin/run-job.c > +++ b/builtin/run-job.c > @@ -1,13 +1,14 @@ > #include "builtin.h" > #include "config.h" > #include "commit-graph.h" > +#include "midx.h" > #include "object-store.h" > #include "parse-options.h" > #include "repository.h" > #include "run-command.h" > > static char const * const builtin_run_job_usage[] = { > - N_("git run-job (commit-graph|fetch|loose-objects)"), > + N_("git run-job (commit-graph|fetch|loose-objects|pack-files)"), > NULL > }; > > @@ -238,6 +239,91 @@ static int run_loose_objects_job(void) > return prune_packed() || pack_loose(); > } > > +static int multi_pack_index_write(void) > +{ > + struct argv_array cmd = ARGV_ARRAY_INIT; > + argv_array_pushl(&cmd, "multi-pack-index", "write", > + "--no-progress", NULL); > + return run_command_v_opt(cmd.argv, RUN_GIT_CMD); > +} > + > +static int rewrite_multi_pack_index(void) > +{ > + char *midx_name = get_midx_filename(the_repository->objects->odb->path); > + > + unlink(midx_name); > + free(midx_name); > + > + if (multi_pack_index_write()) { > + error(_("failed to rewrite multi-pack-index")); > + return 1; > + } > + > + return 0; > +} > + > +static int multi_pack_index_verify(void) > +{ > + struct argv_array cmd = ARGV_ARRAY_INIT; > + argv_array_pushl(&cmd, "multi-pack-index", "verify", > + "--no-progress", NULL); > + return run_command_v_opt(cmd.argv, RUN_GIT_CMD); > +} > + > +static int multi_pack_index_expire(void) > +{ > + struct argv_array cmd = ARGV_ARRAY_INIT; > + argv_array_pushl(&cmd, "multi-pack-index", "expire", > + "--no-progress", NULL); > + return run_command_v_opt(cmd.argv, RUN_GIT_CMD); > +} > + > +static int multi_pack_index_repack(void) > +{ > + int result; > + struct argv_array cmd = ARGV_ARRAY_INIT; > + argv_array_pushl(&cmd, "multi-pack-index", "repack", > + "--no-progress", "--batch-size=0", NULL); > + result = run_command_v_opt(cmd.argv, RUN_GIT_CMD); > + > + if (result && multi_pack_index_verify()) { > + warning(_("multi-pack-index verify failed after repack")); > + result = rewrite_multi_pack_index(); > + } > + > + return result; > +} > + > +static int run_pack_files_job(void) > +{ > + if (multi_pack_index_write()) { > + error(_("failed to write multi-pack-index")); > + return 1; > + } > + > + if (multi_pack_index_verify()) { > + warning(_("multi-pack-index verify failed after initial write")); > + return rewrite_multi_pack_index(); > + } > + > + if (multi_pack_index_expire()) { > + error(_("multi-pack-index expire failed")); > + return 1; > + } > + > + if (multi_pack_index_verify()) { > + warning(_("multi-pack-index verify failed after expire")); > + return rewrite_multi_pack_index(); > + } > + > + if (multi_pack_index_repack()) { > + error(_("multi-pack-index repack failed")); > + return 1; > + } > + > + return 0; > +} > + > int cmd_run_job(int argc, const char **argv, const char *prefix) > { > static struct option builtin_run_job_options[] = { > @@ -261,6 +347,8 @@ int cmd_run_job(int argc, const char **argv, const char *prefix) > return run_fetch_job(); > if (!strcmp(argv[0], "loose-objects")) > return run_loose_objects_job(); > + if (!strcmp(argv[0], "pack-files")) > + return run_pack_files_job(); > } > > usage_with_options(builtin_run_job_usage, > diff --git a/midx.c b/midx.c > index 1527e464a7b..0f0d0a38812 100644 > --- a/midx.c > +++ b/midx.c > @@ -36,7 +36,7 @@ > > #define PACK_EXPIRED UINT_MAX > > -static char *get_midx_filename(const char *object_dir) > +char *get_midx_filename(const char *object_dir) > { > return xstrfmt("%s/pack/multi-pack-index", object_dir); > } > diff --git a/midx.h b/midx.h > index e6fa356b5ca..cf2c09dffc2 100644 > --- a/midx.h > +++ b/midx.h > @@ -39,6 +39,7 @@ struct multi_pack_index { > > #define MIDX_PROGRESS (1 << 0) > > +char *get_midx_filename(const char *object_dir); > struct multi_pack_index *load_multi_pack_index(const char *object_dir, int local); > int prepare_midx_pack(struct repository *r, struct multi_pack_index *m, uint32_t pack_int_id); > int bsearch_midx(const struct object_id *oid, struct multi_pack_index *m, uint32_t *result); > diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh > index 41da083257b..416ba04989d 100755 > --- a/t/t7900-run-job.sh > +++ b/t/t7900-run-job.sh > @@ -6,6 +6,7 @@ Testing the background jobs, in the foreground > ' > > GIT_TEST_COMMIT_GRAPH=0 > +GIT_TEST_MULTI_PACK_INDEX=0 > > . ./test-lib.sh > > @@ -93,4 +94,42 @@ test_expect_success 'loose-objects job' ' > test_cmp packs-between packs-after > ' > > +test_expect_success 'pack-files job' ' > + packDir=.git/objects/pack && > + > + # Create three disjoint pack-files with size BIG, small, small. > + > + echo HEAD~2 | git -C client pack-objects --revs $packDir/test-1 && > + > + test_tick && > + git -C client pack-objects --revs $packDir/test-2 <<-\EOF && > + HEAD~1 > + ^HEAD~2 > + EOF > + > + test_tick && > + git -C client pack-objects --revs $packDir/test-3 <<-\EOF && > + HEAD > + ^HEAD~1 > + EOF > + > + rm -f client/$packDir/pack-* && > + rm -f client/$packDir/loose-* && > + > + ls client/$packDir/*.pack >packs-before && > + test_line_count = 3 packs-before && > + > + # the job repacks the two into a new pack, but does not > + # delete the old ones. > + git -C client run-job pack-files && > + ls client/$packDir/*.pack >packs-between && > + test_line_count = 4 packs-between && > + > + # the job deletes the two old packs, and does not write > + # a new one because only one pack remains. > + git -C client run-job pack-files && > + ls client/.git/objects/pack/*.pack >packs-after && > + test_line_count = 1 packs-after > +' > + > test_done > -- > gitgitgadget > ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2020-05-27 22:17 UTC | newest] Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-05-02 7:56 [PATCH 05/15] run-job: implement pack-files job Son Luong Ngoc 2020-05-04 14:34 ` Derrick Stolee -- strict thread matches above, loose matches on Subject: below -- 2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget 2020-04-03 20:48 ` [PATCH 05/15] run-job: implement pack-files job Derrick Stolee via GitGitGadget 2020-05-27 22:17 ` Josh Steadmon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).