git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Josh Steadmon <steadmon@google.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, peff@peff.net, jrnieder@google.com,
	stolee@gmail.com, Derrick Stolee <dstolee@microsoft.com>
Subject: Re: [PATCH 02/15] run-job: implement commit-graph job
Date: Wed, 20 May 2020 12:08:20 -0700	[thread overview]
Message-ID: <20200520190820.GB163566@google.com> (raw)
In-Reply-To: <cd3facd1e03923204bd2259ff0635a5bbf68d700.1585946894.git.gitgitgadget@gmail.com>

Thanks for this series, and sorry for the delayed (and partial) review.
I hope comments here are still useful. I don't have any comments on the
larger design that other people haven't already made, but I do have a
few small questions and suggestions, inline below.


On 2020.04.03 20:48, Derrick Stolee via GitGitGadget wrote:
> From: Derrick Stolee <dstolee@microsoft.com>
> 
> The first job to implement in the 'git run-job' command is the
> 'commit-graph' job. It is based on the sequence of events in the
> 'commit-graph' job in Scalar [1]. This sequence is as follows:
> 
> 1. git commit-graph write --reachable --split
> 2. git commit-graph verify --shallow
> 3. If the verify succeeds, stop.
> 4. Delete the commit-graph-chain file.
> 5. git commit-graph write --reachable --split
> 
> By writing an incremental commit-graph file using the "--split"
> option we minimize the disruption from this operation. The default
> behavior is to merge layers until the new "top" layer is less than
> half the size of the layer below. This provides quick writes "most"
> of the time, with the longer writes following a power law
> distribution.
> 
> Most importantly, concurrent Git processes only look at the
> commit-graph-chain file for a very short amount of time, so they
> will verly likely not be holding a handle to the file when we try
> to replace it. (This only matters on Windows.)
> 
> If a concurrent process reads the old commit-graph-chain file, but
> our job expires some of the .graph files before they can be read,
> then those processes will see a warning message (but not fail).
> This could be avoided by a future update to use the --expire-time
> argument when writing the commit-graph.
> 
> By using 'git commit-graph verify --shallow' we can ensure that
> the file we just wrote is valid. This is an extra safety precaution
> that is faster than our 'write' subcommand. In the rare situation
> that the newest layer of the commit-graph is corrupt, we can "fix"
> the corruption by deleting the commit-graph-chain file and rewrite
> the full commit-graph as a new one-layer commit graph. This does
> not completely prevent _that_ file from being corrupt, but it does
> recompute the commit-graph by parsing commits from the object
> database. In our use of this step in Scalar and VFS for Git, we
> have only seen this issue arise because our microsoft/git fork
> reverted 43d3561 ("commit-graph write: don't die if the existing
> graph is corrupt" 2019-03-25) for a while to keep commit-graph
> writes very fast. We dropped the revert when updating to v2.23.0.
> The verify still has potential for catching corrupt data across
> the layer boundary: if the new file has commit X with parent Y
> in an old file but the commit ID for Y in the old file had a
> bitswap, then we will notice that in the 'verify' command.
> 
> [1] https://github.com/microsoft/scalar/blob/master/Scalar.Common/Maintenance/CommitGraphStep.cs
> 
> RFC QUESTIONS:
> 
> 1. Are links like [1] helpful?

Yes, I appreciate being able to reference these. However, given that
these will show up in commit logs for ~forever, and ongoing work might
happen on Scalar, perhaps the links should be pinned to a specific
revision?


> 
> 2. Can anyone think of a way to test the rewrite fallback?
>    It requires corrupting the latest file between two steps of
>    this one command, so that is a tricky spot to inject a fault.
> 
> Signed-off-by: Derrick Stolee <dstolee@microsoft.com>
> ---
>  Documentation/git-run-job.txt | 21 ++++++++++--
>  builtin/run-job.c             | 60 ++++++++++++++++++++++++++++++++++-
>  commit-graph.c                |  2 +-
>  commit-graph.h                |  1 +
>  t/t7900-run-job.sh            | 32 +++++++++++++++++++
>  5 files changed, 112 insertions(+), 4 deletions(-)
> 
> diff --git a/Documentation/git-run-job.txt b/Documentation/git-run-job.txt
> index 0627b3ed259..8bf2762d650 100644
> --- a/Documentation/git-run-job.txt
> +++ b/Documentation/git-run-job.txt
> @@ -9,7 +9,7 @@ git-run-job - Run a maintenance job. Intended for background operation.
>  SYNOPSIS
>  --------
>  [verse]
> -'git run-job <job-name>'
> +'git run-job commit-graph'
>  
>  
>  DESCRIPTION
> @@ -28,7 +28,24 @@ BEHAVIOR MAY BE ALTERED IN THE FUTURE.
>  JOBS
>  ----
>  
> -TBD
> +'commit-graph'::
> +
> +The `commit-graph` job updates the `commit-graph` files incrementally,
> +then verifies that the written data is correct. If the new layer has an
> +issue, then the chain file is removed and the `commit-graph` is
> +rewritten from scratch.
> ++
> +The verification only checks the top layer of the `commit-graph` chain.
> +If the incremental write merged the new commits with at least one
> +existing layer, then there is potential for on-disk corruption being
> +carried forward into the new file. This will be noticed and the new
> +commit-graph file will be clean as Git reparses the commit data from
> +the object database.
> ++
> +The incremental write is safe to run alongside concurrent Git processes
> +since it will not expire `.graph` files that were in the previous
> +`commit-graph-chain` file. They will be deleted by a later run based on
> +the expiration delay.
>  
>  
>  GIT
> diff --git a/builtin/run-job.c b/builtin/run-job.c
> index 2c78d053aa4..dd7709952d3 100644
> --- a/builtin/run-job.c
> +++ b/builtin/run-job.c
> @@ -1,12 +1,65 @@
>  #include "builtin.h"
>  #include "config.h"
> +#include "commit-graph.h"
> +#include "object-store.h"
>  #include "parse-options.h"
> +#include "repository.h"
> +#include "run-command.h"
>  
>  static char const * const builtin_run_job_usage[] = {
> -	N_("git run-job"),
> +	N_("git run-job commit-graph"),
>  	NULL
>  };
>  
> +static int run_write_commit_graph(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> +	argv_array_pushl(&cmd, "commit-graph", "write",
> +			 "--split", "--reachable", "--no-progress",
> +			 NULL);
> +
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int run_verify_commit_graph(void)
> +{
> +	struct argv_array cmd = ARGV_ARRAY_INIT;
> +
> +	argv_array_pushl(&cmd, "commit-graph", "verify",
> +			 "--shallow", "--no-progress", NULL);
> +
> +	return run_command_v_opt(cmd.argv, RUN_GIT_CMD);
> +}
> +
> +static int run_commit_graph_job(void)
> +{
> +	char *chain_path;
> +
> +	if (run_write_commit_graph()) {
> +		error(_("failed to write commit-graph"));
> +		return 1;
> +	}
> +
> +	if (!run_verify_commit_graph())
> +		return 0;
> +
> +	warning(_("commit-graph verify caught error, rewriting"));
> +
> +	chain_path = get_chain_filename(the_repository->objects->odb);

Should we avoid new uses of `the_repository` and take a repo pointer
argument here instead?


> +	if (unlink(chain_path)) {
> +		UNLEAK(chain_path);
> +		die(_("failed to remove commit-graph at %s"), chain_path);
> +	}
> +	free(chain_path);
> +
> +	if (!run_write_commit_graph())
> +		return 0;

Is there a reason why we don't verify the second write attempt?


> +
> +	error(_("failed to rewrite commit-graph"));
> +	return 1;
> +}
> +
>  int cmd_run_job(int argc, const char **argv, const char *prefix)
>  {
>  	static struct option builtin_run_job_options[] = {
> @@ -23,6 +76,11 @@ int cmd_run_job(int argc, const char **argv, const char *prefix)
>  			     builtin_run_job_usage,
>  			     PARSE_OPT_KEEP_UNKNOWN);
>  
> +	if (argc > 0) {
> +		if (!strcmp(argv[0], "commit-graph"))
> +			return run_commit_graph_job();
> +	}
> +
>  	usage_with_options(builtin_run_job_usage,
>  			   builtin_run_job_options);
>  }
> diff --git a/commit-graph.c b/commit-graph.c
> index f013a84e294..6867f92d04a 100644
> --- a/commit-graph.c
> +++ b/commit-graph.c
> @@ -56,7 +56,7 @@ static char *get_split_graph_filename(struct object_directory *odb,
>  		       oid_hex);
>  }
>  
> -static char *get_chain_filename(struct object_directory *odb)
> +char *get_chain_filename(struct object_directory *odb)
>  {
>  	return xstrfmt("%s/info/commit-graphs/commit-graph-chain", odb->path);
>  }
> diff --git a/commit-graph.h b/commit-graph.h
> index e87a6f63600..8b7bd5a6cb1 100644
> --- a/commit-graph.h
> +++ b/commit-graph.h
> @@ -13,6 +13,7 @@
>  struct commit;
>  
>  char *get_commit_graph_filename(struct object_directory *odb);
> +char *get_chain_filename(struct object_directory *odb);
>  int open_commit_graph(const char *graph_file, int *fd, struct stat *st);
>  
>  /*
> diff --git a/t/t7900-run-job.sh b/t/t7900-run-job.sh
> index 1eac80b7ed3..18b9bd26b3a 100755
> --- a/t/t7900-run-job.sh
> +++ b/t/t7900-run-job.sh
> @@ -5,6 +5,8 @@ test_description='git run-job
>  Testing the background jobs, in the foreground
>  '
>  
> +GIT_TEST_COMMIT_GRAPH=0
> +
>  . ./test-lib.sh
>  
>  test_expect_success 'help text' '
> @@ -12,4 +14,34 @@ test_expect_success 'help text' '
>  	test_i18ngrep "usage: git run-job" err
>  '
>  
> +test_expect_success 'commit-graph job' '
> +	git init server &&
> +	(
> +		cd server &&
> +		chain=.git/objects/info/commit-graphs/commit-graph-chain &&
> +		git checkout -b master &&
> +		for i in $(test_seq 1 15)
> +		do
> +			test_commit $i || return 1
> +		done &&
> +		git run-job commit-graph 2>../err &&
> +		test_must_be_empty ../err &&
> +		test_line_count = 1 $chain &&
> +		for i in $(test_seq 16 19)
> +		do
> +			test_commit $i || return 1
> +		done &&
> +		git run-job commit-graph 2>../err &&
> +		test_must_be_empty ../err &&
> +		test_line_count = 2 $chain &&
> +		for i in $(test_seq 20 23)
> +		do
> +			test_commit $i || return 1
> +		done &&
> +		git run-job commit-graph 2>../err &&
> +		test_must_be_empty ../err &&
> +		test_line_count = 1 $chain
> +	)
> +'
> +
>  test_done
> -- 
> gitgitgadget
> 

  reply	other threads:[~2020-05-20 19:08 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-04-03 20:47 [PATCH 00/15] [RFC] Maintenance jobs and job runner Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 01/15] run-job: create barebones builtin Derrick Stolee via GitGitGadget
2020-04-05 15:10   ` Phillip Wood
2020-04-05 19:21     ` Junio C Hamano
2020-04-06 14:42       ` Derrick Stolee
2020-04-07  0:58         ` Danh Doan
2020-04-07 10:54           ` Derrick Stolee
2020-04-07 14:16             ` Danh Doan
2020-04-07 14:30               ` Johannes Schindelin
2020-04-03 20:48 ` [PATCH 02/15] run-job: implement commit-graph job Derrick Stolee via GitGitGadget
2020-05-20 19:08   ` Josh Steadmon [this message]
2020-04-03 20:48 ` [PATCH 03/15] run-job: implement fetch job Derrick Stolee via GitGitGadget
2020-04-05 15:14   ` Phillip Wood
2020-04-06 12:48     ` Derrick Stolee
2020-04-05 20:28   ` Junio C Hamano
2020-04-06 12:46     ` Derrick Stolee
2020-05-20 19:08   ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 04/15] run-job: implement loose-objects job Derrick Stolee via GitGitGadget
2020-04-05 20:33   ` Junio C Hamano
2020-04-03 20:48 ` [PATCH 05/15] run-job: implement pack-files job Derrick Stolee via GitGitGadget
2020-05-27 22:17   ` Josh Steadmon
2020-04-03 20:48 ` [PATCH 06/15] run-job: auto-size or use custom pack-files batch Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 07/15] config: add job.pack-files.batchSize option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 08/15] job-runner: create builtin for job loop Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 09/15] job-runner: load repos from config by default Derrick Stolee via GitGitGadget
2020-04-05 15:18   ` Phillip Wood
2020-04-06 12:49     ` Derrick Stolee
2020-04-05 15:41   ` Phillip Wood
2020-04-06 12:57     ` Derrick Stolee
2020-04-03 20:48 ` [PATCH 10/15] job-runner: use config to limit job frequency Derrick Stolee via GitGitGadget
2020-04-05 15:24   ` Phillip Wood
2020-04-03 20:48 ` [PATCH 11/15] job-runner: use config for loop interval Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 12/15] job-runner: add --interval=<span> option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 13/15] job-runner: skip a job if job.<job-name>.enabled is false Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 14/15] job-runner: add --daemonize option Derrick Stolee via GitGitGadget
2020-04-03 20:48 ` [PATCH 15/15] runjob: customize the loose-objects batch size Derrick Stolee via GitGitGadget
2020-04-03 21:40 ` [PATCH 00/15] [RFC] Maintenance jobs and job runner Junio C Hamano
2020-04-04  0:16   ` Derrick Stolee
2020-04-07  0:50     ` Danh Doan
2020-04-07 10:59       ` Derrick Stolee
2020-04-07 14:26         ` Danh Doan
2020-04-07 14:43           ` Johannes Schindelin
2020-04-07  1:48     ` brian m. carlson
2020-04-07 20:08       ` Junio C Hamano
2020-04-07 22:23       ` Johannes Schindelin
2020-04-08  0:01         ` brian m. carlson
2020-05-27 22:39           ` Josh Steadmon
2020-05-28  0:47             ` Junio C Hamano
2020-05-27 21:52               ` Johannes Schindelin
2020-05-28 14:48                 ` Junio C Hamano
2020-05-28 14:50                 ` Jonathan Nieder
2020-05-28 14:57                   ` Junio C Hamano
2020-05-28 15:03                     ` Jonathan Nieder
2020-05-28 15:30                       ` Derrick Stolee
2020-05-28  4:39                         ` Johannes Schindelin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200520190820.GB163566@google.com \
    --to=steadmon@google.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=jrnieder@google.com \
    --cc=peff@peff.net \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).