All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jeff King <peff@peff.net>
To: SURA via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, SURA <sura907@hotmail.com>
Subject: Re: [PATCH] builtin/fetch.c: clean tmp pack after receive signal
Date: Tue, 16 Mar 2021 17:54:32 -0400	[thread overview]
Message-ID: <YFEpGGLBgLSdR40V@coredump.intra.peff.net> (raw)
In-Reply-To: <pull.890.git.1615863217226.gitgitgadget@gmail.com>

On Tue, Mar 16, 2021 at 02:53:36AM +0000, SURA via GitGitGadget wrote:

> In Gitee.com, I often use scripts to start a time-limited

Not related to your patch, but I think this name falls afoul of Git's
trademark policy. See:

  https://git-scm.com/trademark

There's also some discussion in this thread:

  https://lore.kernel.org/git/20170202022655.2jwvudhvo4hmueaw@sigill.intra.peff.net/

> "git fetch". When the execution time of'git fetch' is too
> long, send a signal (such as SIGINT) to the'git fetch' process
> 
> But I found that after the process exits, there are some
> files in the format of 'objects/pack/tmp_pack_XXXXXX' on the disk.
> They are usually very large (some of them exceed 5G), and'git gc'
> has no effect on them.

This isn't quite true. "git gc" will clean up the temporary files, but
only if the mtime is sufficiently old. The purpose here is to give a
grace period to avoid deleting a file that is actively being written to.
However, we use the same grace period that we use for deleting
unreachable objects, which is absurdly long for this purpose: 2 weeks.
Probably something like an hour would be more appropriate (since the
mtime is updated on each write, this would imply a process not making
forward progress).

That said...

> Obviously this is only a temporary solution, I think it should be fixed from git
> 
> I found fetch will start a 'index-pack' sub-process, this sub-process
> create the tmp_pack
>   1. make 'index-pack' unlink tmp_pack when get signal
>   2. make 'fetch' send signal to 'index-pack' when get signal

I do think this is worth doing. One of the reasons we haven't
traditionally cleaned up temporary packfiles is that the failed state is
sometimes useful for analyzing what went wrong, or even recovering
partial data. But that probably should not be the default mode of
operation (i.e., a config option or environment variable should probably
be able to turn it on for debugging).

Looking at the implementation itself...

> diff --git a/builtin/fetch.c b/builtin/fetch.c
> index 0b90de87c7a2..a87efa23ceb5 100644
> --- a/builtin/fetch.c
> +++ b/builtin/fetch.c
> @@ -224,8 +224,18 @@ static void unlock_pack(void)
>  		transport_unlock_pack(gsecondary);
>  }
>  
> +static void send_signo_to_index_pack(int signo)
> +{
> +	if (gtransport && gtransport->index_pack_pid > 0)
> +		kill(gtransport->index_pack_pid, signo);
> +
> +	if (gsecondary && gsecondary->index_pack_pid > 0)
> +		kill(gsecondary->index_pack_pid, signo);
> +}
> +
>  static void unlock_pack_on_signal(int signo)
>  {
> +	send_signo_to_index_pack(signo);
>  	unlock_pack();
>  	sigchain_pop(signo);
>  	raise(signo);

We'd probably want to also trigger this if we just hit die(), I'd think.
We have a system for killing process children when we exit unexpectedly.
I think just setting the "clean_on_exit" flag on the index-pack
child_process struct would turn this into a one-liner.

Likewise, we'd probably want to do this from receive-pack, too (so that
pushes don't accumulate temporary packfiles on the receiving side).

> diff --git a/builtin/index-pack.c b/builtin/index-pack.c
> index bad57488079c..27d1ebba746e 100644
> --- a/builtin/index-pack.c
> +++ b/builtin/index-pack.c
> [...]
> +static void remove_tmp_pack(void)
> +{
> +	if (tmp_pack_name) 
> +		unlink_or_warn(tmp_pack_name);
> +}
> +
> +static void remove_tmp_pack_on_signal(int signo)
> +{
> +	remove_tmp_pack();
> +	sigchain_pop(signo);
> +	raise(signo);
> +}

Likewise, we have a tempfile cleanup system already.

I think this hunk:

> @@ -336,6 +339,7 @@ static const char *open_pack_file(const char *pack_name)
>  			output_fd = odb_mkstemp(&tmp_file,
>  						"pack/tmp_pack_XXXXXX");
>  			pack_name = strbuf_detach(&tmp_file, NULL);
> +			tmp_pack_name = pack_name;

...can just call register_tempfile(). It should also record the result
so that we don't try to unlink() it after we've already moved it away
from its temporary name (though it's fairly unlikely for somebody else
to have used the name in the interim).

I think you'd want to do the same for the tmp_idx_* files, too. Likewise
for ".rev" files we create starting in v2.31.

I think it would also make sense in create_tmp_packfile(), which is used
during repacking (a different problem space, but really the same thing:
if repacking fails for some reason, we probably shouldn't leave a
useless gigantic half-finished packfile on disk).

We should possibly also do so for tmp_obj_* files. Those can be written
for a fetch or push via unpack-objects (as well as normal local
commands). They're not usually as big as a pack, obviously, but I think
the same principle applies.

> [...]

It would be nice to see some tests covering this functionality, too.
Reproducing it with signals is likely to be racy and not worth it. But I
think that right now index-pack reading a bogus pack (say, one that
fails fsck checks) will leave the tmp_pack_* on disk. And it would not
if we cleanup tempfiles (again, this would be on any exit, not just
signal death, but I think that is what we'd want, and also what
register_tempfile() will do).

-Peff

  reply	other threads:[~2021-03-16 21:55 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-03-16  2:53 [PATCH] builtin/fetch.c: clean tmp pack after receive signal SURA via GitGitGadget
2021-03-16 21:54 ` Jeff King [this message]
2021-03-17 18:15   ` Junio C Hamano

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YFEpGGLBgLSdR40V@coredump.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=sura907@hotmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.