git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Michael Haggerty <mhagger@alum.mit.edu>
To: Johannes Schindelin <Johannes.Schindelin@gmx.de>
Cc: "Junio C Hamano" <gitster@pobox.com>,
	"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>,
	"Stefan Beller" <sbeller@google.com>, "Jeff King" <peff@peff.net>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Brandon Williams" <bmwill@google.com>,
	git@vger.kernel.org
Subject: Re: [PATCH 00/20] Read `packed-refs` using mmap()
Date: Fri, 15 Sep 2017 06:21:04 +0200	[thread overview]
Message-ID: <d0fe1daf-4b44-fa4f-dab6-2590e6c09163@alum.mit.edu> (raw)
In-Reply-To: <alpine.DEB.2.21.1.1709142101560.4132@virtualbox>

On 09/14/2017 10:23 PM, Johannes Schindelin wrote:
> On Wed, 13 Sep 2017, Michael Haggerty wrote:
> 
>> * `mmap()` the whole file rather than `read()`ing it.
> 
> On Windows, a memory-mapped file cannot be renamed. As a consequence, the
> following tests fail on `pu`:
> 
> [...]

Thanks for your quick investigation.

> This diff:
> 
> -- snip --
> diff --git a/refs/packed-backend.c b/refs/packed-backend.c
> index f9c5e771bdd..ee8b3603624 100644
> --- a/refs/packed-backend.c
> +++ b/refs/packed-backend.c
> @@ -1306,13 +1308,13 @@ static int packed_transaction_finish(struct
> ref_store *ref_store,
>  	char *packed_refs_path;
>  
>  	packed_refs_path = get_locked_file_path(&refs->lock);
> +	clear_snapshot(refs);
>  	if (rename_tempfile(&refs->tempfile, packed_refs_path)) {
>  		strbuf_addf(err, "error replacing %s: %s",
>  			    refs->path, strerror(errno));
>  		goto cleanup;
>  	}
>  
> -	clear_snapshot(refs);
>  	ret = 0;
>  
>  cleanup:
> -- snap --
> 
> reduces the failed tests to
> 
> t1410-reflog.counts.sh
> t3210-pack-refs.counts.sh
> t3211-peel-ref.counts.sh
> t5505-remote.counts.sh
> t5510-fetch.counts.sh
> t5600-clone-fail-cleanup.counts.sh

That change is a strict improvement on all systems; I'll integrate it
into the series.

> However, much as I tried to wrap my head around it, I could not even start
> to come up with a solution for e.g. the t1410 regression. The failure
> happens in `git branch -d one/two`.
> 
> The culprit: there is yet another unreleased snapshot while we try to
> finish the transaction.
> 
> It is the snapshot of the main_worktree_ref_store as acquired by
> add_head_info() (which is called from get_main_worktree(), which in turn
> was called from get_worktrees(), in turn having been called from
> find_shared_symref() that was called from delete_branches()), and it seems
> to me that there was never any plan to have that released, including its
> snapshot.

Yeah the idea was that the default situation would be that whenever a
`packed-refs` file needs to be read, it would be kept mmapped for the
life of the program (or until the file was detected to have been
changed). This was meant to be a replacement for the explicit
`ref_cache`. So any long-lived git process could be expected to have the
`packed-refs` file (or even multiple `packed-refs` files in the case of
submodules) mmapped.

That's obviously not going to work on Windows.

> [...]
> Do you have any idea how to ensure that all mmap()ed packed refs are
> released before attempting to finish a transaction?

[And from your other email:]
> This is only one example, as I figure that multiple worktrees could cause
> even more ref_stores to have unreleased snapshots of the packed-refs file.
> 
> I imagine that something like close_all_packs() is needed
> ("clear_all_snapshots()"?).

Yes, I wasn't really familiar with `close_all_packs()`, but it sounds
like it solves a similar problem.

Within a single process, we could make this work via an arduous process
of figuring out what functions might want to overwrite the `packed-refs`
file, and making sure that nobody up the call stack is iterating over
references during those function calls. If that condition is met, then
calling `clear_snapshot()` earlier, as you propose above, would decrease
the reference count to zero, causing the old snapshot to be freed, and
allowing the rename to succeed. We could even do

    if (!clear_snapshot(refs))
            BUG("packed-refs snapshot is still in use");

, analogously to `close_all_packs()`, to help find code that violates
the condition.

Similarly, it wouldn't be allowed to invoke a subprocess or hook
function while iterating over references, and one would have to clear
any current snapshots before doing so. It might even make sense to do
that at the same time as `close_all_packs()`, for example in a new
function `close_all_unnecessary_files()`.

But what about unrelated processes? Anybody reading `packed-refs` would
block anybody who wants to modify it, for example to delete a reference
or pack the references. I think this is worse than the packfile
situation. Packfiles all have unique names, so they never need to be
overwritten; at worst they are deleted. And the deletion is never
critical to correctness. If you can't delete a packfile, the only
consequence is that it hangs around until the next GC.

However, the `packed-refs` file always has the same name, and
overwriting it is sometimes critical for correctness. So it sounds to me
like even *readers* mustn't keep the file open or mmapped longer than
necessary!

(This makes me wonder, doesn't `rename_tempfile()` for the `packed-refs`
file *already* fail sometimes on Windows due to a concurrent reader?
Shouldn't it retry if it gets whatever error indicates that the failure
was due to the file being open?)

Anyway, if all of my reasoning is correct, it seems that the inescapable
conclusion is that having the `packed-refs` file open or mmapped is
tantamount to holding a reader lock on the file, because it causes
writers to fail or at least have to retry. Therefore, even if you are
only reading the `packed-refs` file, you should read then close/munmap
it as quickly as possible.

And if *that* is true, then ISTM that this mmap idea is unworkable on
Windows. We would either need to keep the old `ref_cache`-based code
around for use there, or we would need to make a copy the file contents
in memory immediately and use *that* as our cache. The latter would be
pretty easy to implement, actually, because something similar is done in
`sort_snapshot()` to handle the case that the `packed-refs` file on disk
is not correctly ordered by refname. I think that approach would be
competitive with the `ref_cache`-based code, which also reads (and
parses!) the whole file whenever any packed reference needs to be read.
It is possible that memory usage might go up a bit due to the fact that
the SHA-1s would be stored in memory in hex rather than binary form; on
the other hand, this would spare some memory overhead associated with
allocating lots of little objects and the pointers that string the
objects together, so it's probably a wash.

Please let me know if you agree with my reasoning. If so, I'll implement
"always copy `packed-refs` file contents to RAM" for Windows.

A possible future optimization might be that when iterating over a
subset of the references (e.g., `refs/replace/`), only that part of the
file is copied to RAM. But that would be a bigger change, and it's not
obvious whether it might make some read access patterns slower.

Michael

      reply	other threads:[~2017-09-15  4:21 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-09-13 17:15 [PATCH 00/20] Read `packed-refs` using mmap() Michael Haggerty
2017-09-13 17:15 ` [PATCH 01/20] ref_iterator: keep track of whether the iterator output is ordered Michael Haggerty
2017-09-13 17:15 ` [PATCH 02/20] prefix_ref_iterator: break when we leave the prefix Michael Haggerty
2017-09-13 17:15 ` [PATCH 03/20] packed_ref_cache: add a backlink to the associated `packed_ref_store` Michael Haggerty
2017-09-13 17:15 ` [PATCH 04/20] die_unterminated_line(), die_invalid_line(): new functions Michael Haggerty
2017-09-13 17:15 ` [PATCH 05/20] read_packed_refs(): use mmap to read the `packed-refs` file Michael Haggerty
2017-09-13 17:16 ` [PATCH 06/20] read_packed_refs(): only check for a header at the top of the file Michael Haggerty
2017-09-13 17:16 ` [PATCH 07/20] read_packed_refs(): make parsing of the header line more robust Michael Haggerty
2017-09-13 17:16 ` [PATCH 08/20] read_packed_refs(): read references with minimal copying Michael Haggerty
2017-09-13 17:16 ` [PATCH 09/20] packed_ref_cache: remember the file-wide peeling state Michael Haggerty
2017-09-13 17:16 ` [PATCH 10/20] mmapped_ref_iterator: add iterator over a packed-refs file Michael Haggerty
2017-09-13 17:16 ` [PATCH 11/20] mmapped_ref_iterator_advance(): no peeled value for broken refs Michael Haggerty
2017-09-13 17:16 ` [PATCH 12/20] packed_ref_cache: keep the `packed-refs` file open and mmapped Michael Haggerty
2017-09-13 17:16 ` [PATCH 13/20] read_packed_refs(): ensure that references are ordered when read Michael Haggerty
2017-09-13 17:16 ` [PATCH 14/20] packed_ref_iterator_begin(): iterate using `mmapped_ref_iterator` Michael Haggerty
2017-09-13 17:16 ` [PATCH 15/20] packed_read_raw_ref(): read the reference from the mmapped buffer Michael Haggerty
2017-09-13 17:16 ` [PATCH 16/20] ref_store: implement `refs_peel_ref()` generically Michael Haggerty
2017-09-13 17:16 ` [PATCH 17/20] packed_ref_store: get rid of the `ref_cache` entirely Michael Haggerty
2017-09-13 17:16 ` [PATCH 18/20] ref_cache: remove support for storing peeled values Michael Haggerty
2017-09-13 17:16 ` [PATCH 19/20] mmapped_ref_iterator: inline into `packed_ref_iterator` Michael Haggerty
2017-09-13 17:16 ` [PATCH 20/20] packed-backend.c: rename a bunch of things and update comments Michael Haggerty
2017-09-13 23:00   ` Stefan Beller
2017-09-13 21:35 ` [PATCH 00/20] Read `packed-refs` using mmap() Junio C Hamano
2017-09-14 16:12 ` Michael Haggerty
2017-09-14 22:16   ` Johannes Schindelin
2017-09-14 20:23 ` Johannes Schindelin
2017-09-15  4:21   ` Michael Haggerty [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d0fe1daf-4b44-fa4f-dab6-2590e6c09163@alum.mit.edu \
    --to=mhagger@alum.mit.edu \
    --cc=Johannes.Schindelin@gmx.de \
    --cc=avarab@gmail.com \
    --cc=bmwill@google.com \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).