All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] refs: sync loose refs to disk before committing them
@ 2021-11-04 12:38 Patrick Steinhardt
  2021-11-04 13:14 ` Ævar Arnfjörð Bjarmason
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Patrick Steinhardt @ 2021-11-04 12:38 UTC (permalink / raw)
  To: git

[-- Attachment #1: Type: text/plain, Size: 2582 bytes --]

When writing loose refs, we first create a lockfile, write the new ref
into that lockfile, close it and then rename the lockfile into place
such that the actual update is atomic for that single ref. While this
works as intended under normal circumstences, at GitLab we infrequently
encounter corrupt loose refs in repositories after a machine encountered
a hard reset. The corruption is always of the same type: the ref has
been committed into place, but it is completely empty.

The root cause of this is likely that we don't sync contents of the
lockfile to disk before renaming it into place. As a result, it's not
guaranteed that the contents are properly persisted and one may observe
weird in-between states on hard resets. Quoting ext4 documentation [1]:

    Many broken applications don't use fsync() when replacing existing
    files via patterns such as fd =
    open("foo.new")/write(fd,..)/close(fd)/ rename("foo.new", "foo"), or
    worse yet, fd = open("foo", O_TRUNC)/write(fd,..)/close(fd). If
    auto_da_alloc is enabled, ext4 will detect the replace-via-rename
    and replace-via-truncate patterns and force that any delayed
    allocation blocks are allocated such that at the next journal
    commit, in the default data=ordered mode, the data blocks of the new
    file are forced to disk before the rename() operation is committed.
    This provides roughly the same level of guarantees as ext3, and
    avoids the "zero-length" problem that can happen when a system
    crashes before the delayed allocation blocks are forced to disk.

This explicitly points out that one must call fsync(3P) before doing the
rename(3P) call, or otherwise data may not be correctly persisted to
disk.

Fix this by always flushing refs to disk before committing them into
place to avoid this class of corruption.

[1]: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 refs/files-backend.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/refs/files-backend.c b/refs/files-backend.c
index 151b0056fe..06a3f0bdea 100644
--- a/refs/files-backend.c
+++ b/refs/files-backend.c
@@ -1749,6 +1749,7 @@ static int write_ref_to_lockfile(struct ref_lock *lock,
 	fd = get_lock_file_fd(&lock->lk);
 	if (write_in_full(fd, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
 	    write_in_full(fd, &term, 1) < 0 ||
+	    fsync(fd) < 0 ||
 	    close_ref_gently(lock) < 0) {
 		strbuf_addf(err,
 			    "couldn't write '%s'", get_lock_file_path(&lock->lk));
-- 
2.33.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-04 12:38 [PATCH] refs: sync loose refs to disk before committing them Patrick Steinhardt
@ 2021-11-04 13:14 ` Ævar Arnfjörð Bjarmason
  2021-11-04 14:51   ` Patrick Steinhardt
  2021-11-04 21:24   ` Junio C Hamano
  2021-11-05  7:07 ` Jeff King
  2021-11-10 11:40 ` [PATCH v2 0/3] " Patrick Steinhardt
  2 siblings, 2 replies; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-04 13:14 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git, Eric Wong, Neeraj K. Singh


On Thu, Nov 04 2021, Patrick Steinhardt wrote:

> When writing loose refs, we first create a lockfile, write the new ref
> into that lockfile, close it and then rename the lockfile into place
> such that the actual update is atomic for that single ref. While this
> works as intended under normal circumstences, at GitLab we infrequently
> encounter corrupt loose refs in repositories after a machine encountered
> a hard reset. The corruption is always of the same type: the ref has
> been committed into place, but it is completely empty.
>
> The root cause of this is likely that we don't sync contents of the
> lockfile to disk before renaming it into place. As a result, it's not
> guaranteed that the contents are properly persisted and one may observe
> weird in-between states on hard resets. Quoting ext4 documentation [1]:
>
>     Many broken applications don't use fsync() when replacing existing
>     files via patterns such as fd =
>     open("foo.new")/write(fd,..)/close(fd)/ rename("foo.new", "foo"), or
>     worse yet, fd = open("foo", O_TRUNC)/write(fd,..)/close(fd). If
>     auto_da_alloc is enabled, ext4 will detect the replace-via-rename
>     and replace-via-truncate patterns and force that any delayed
>     allocation blocks are allocated such that at the next journal
>     commit, in the default data=ordered mode, the data blocks of the new
>     file are forced to disk before the rename() operation is committed.
>     This provides roughly the same level of guarantees as ext3, and
>     avoids the "zero-length" problem that can happen when a system
>     crashes before the delayed allocation blocks are forced to disk.
>
> This explicitly points out that one must call fsync(3P) before doing the
> rename(3P) call, or otherwise data may not be correctly persisted to
> disk.
>
> Fix this by always flushing refs to disk before committing them into
> place to avoid this class of corruption.
>
> [1]: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  refs/files-backend.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/refs/files-backend.c b/refs/files-backend.c
> index 151b0056fe..06a3f0bdea 100644
> --- a/refs/files-backend.c
> +++ b/refs/files-backend.c
> @@ -1749,6 +1749,7 @@ static int write_ref_to_lockfile(struct ref_lock *lock,
>  	fd = get_lock_file_fd(&lock->lk);
>  	if (write_in_full(fd, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
>  	    write_in_full(fd, &term, 1) < 0 ||
> +	    fsync(fd) < 0 ||
>  	    close_ref_gently(lock) < 0) {
>  		strbuf_addf(err,
>  			    "couldn't write '%s'", get_lock_file_path(&lock->lk));

Yeah, that really does seem like it's the cause of such zeroing out
issues.

This has a semantic conflict with some other changes in flight, see:

    git log -p origin/master..origin/seen -- write-or-die.c

I.e. here you do want to not die, so fsync_or_die() doesn't make sense
per-se, but in those changes that function has grown to mean
fsync_with_configured_strategy_or_die().

Also we need the loop around fsync, see cccdfd22436 (fsync(): be
prepared to see EINTR, 2021-06-04).

I think it would probably be best to create a git_fsync_fd() function
which is non-fatal and has that config/while loop, and have
fsync_or_die() be a "..or die()" wrapper around that, then you could
call that git_fsync_fd() here.

On the change more generally there's some performance numbers quoted at,
so re the recent discussions about fsync() performance I wonder how this
changes things.

I've also noted in those threads recently that our overall use of fsync
is quite, bad, and especially when it comes to assuming that we don't
need to fsync dir entries, which we still don't do here.

The ext4 docs seem to suggest that this will be the right thing to do in
either case, but I wonder if this won't increase the odds of corruption
on some other fs's.

I.e. before we'd write() && rename() without the fsync(), so on systems
that deferred fsync() until some global sync point we might have been
able to rely on those happening atomically (although clearly not on
others, e.g. ext4).

But now we'll fsync() the data explicitly, then do a rename(), but we
don't fsync the dir entry, so per POSIX an external application can't
rely on seeing that rename yet. Will that bite us still, but just in
another way on some other systems?

1. https://stackoverflow.com/questions/7433057/is-rename-without-fsync-safe

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-04 13:14 ` Ævar Arnfjörð Bjarmason
@ 2021-11-04 14:51   ` Patrick Steinhardt
  2021-11-04 21:24   ` Junio C Hamano
  1 sibling, 0 replies; 34+ messages in thread
From: Patrick Steinhardt @ 2021-11-04 14:51 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: git, Eric Wong, Neeraj K. Singh

[-- Attachment #1: Type: text/plain, Size: 5958 bytes --]

On Thu, Nov 04, 2021 at 02:14:54PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> On Thu, Nov 04 2021, Patrick Steinhardt wrote:
> 
> > When writing loose refs, we first create a lockfile, write the new ref
> > into that lockfile, close it and then rename the lockfile into place
> > such that the actual update is atomic for that single ref. While this
> > works as intended under normal circumstences, at GitLab we infrequently
> > encounter corrupt loose refs in repositories after a machine encountered
> > a hard reset. The corruption is always of the same type: the ref has
> > been committed into place, but it is completely empty.
> >
> > The root cause of this is likely that we don't sync contents of the
> > lockfile to disk before renaming it into place. As a result, it's not
> > guaranteed that the contents are properly persisted and one may observe
> > weird in-between states on hard resets. Quoting ext4 documentation [1]:
> >
> >     Many broken applications don't use fsync() when replacing existing
> >     files via patterns such as fd =
> >     open("foo.new")/write(fd,..)/close(fd)/ rename("foo.new", "foo"), or
> >     worse yet, fd = open("foo", O_TRUNC)/write(fd,..)/close(fd). If
> >     auto_da_alloc is enabled, ext4 will detect the replace-via-rename
> >     and replace-via-truncate patterns and force that any delayed
> >     allocation blocks are allocated such that at the next journal
> >     commit, in the default data=ordered mode, the data blocks of the new
> >     file are forced to disk before the rename() operation is committed.
> >     This provides roughly the same level of guarantees as ext3, and
> >     avoids the "zero-length" problem that can happen when a system
> >     crashes before the delayed allocation blocks are forced to disk.
> >
> > This explicitly points out that one must call fsync(3P) before doing the
> > rename(3P) call, or otherwise data may not be correctly persisted to
> > disk.
> >
> > Fix this by always flushing refs to disk before committing them into
> > place to avoid this class of corruption.
> >
> > [1]: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt
> >
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> >  refs/files-backend.c | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/refs/files-backend.c b/refs/files-backend.c
> > index 151b0056fe..06a3f0bdea 100644
> > --- a/refs/files-backend.c
> > +++ b/refs/files-backend.c
> > @@ -1749,6 +1749,7 @@ static int write_ref_to_lockfile(struct ref_lock *lock,
> >  	fd = get_lock_file_fd(&lock->lk);
> >  	if (write_in_full(fd, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
> >  	    write_in_full(fd, &term, 1) < 0 ||
> > +	    fsync(fd) < 0 ||
> >  	    close_ref_gently(lock) < 0) {
> >  		strbuf_addf(err,
> >  			    "couldn't write '%s'", get_lock_file_path(&lock->lk));
> 
> Yeah, that really does seem like it's the cause of such zeroing out
> issues.
> 
> This has a semantic conflict with some other changes in flight, see:
> 
>     git log -p origin/master..origin/seen -- write-or-die.c
> 
> I.e. here you do want to not die, so fsync_or_die() doesn't make sense
> per-se, but in those changes that function has grown to mean
> fsync_with_configured_strategy_or_die().
> 
> Also we need the loop around fsync, see cccdfd22436 (fsync(): be
> prepared to see EINTR, 2021-06-04).
> 
> I think it would probably be best to create a git_fsync_fd() function
> which is non-fatal and has that config/while loop, and have
> fsync_or_die() be a "..or die()" wrapper around that, then you could
> call that git_fsync_fd() here.

Thanks for pointing it out, I'll base v2 on next in that case.

> On the change more generally there's some performance numbers quoted at,
> so re the recent discussions about fsync() performance I wonder how this
> changes things.

Yeah, good question. I punted on doing benchmarks for this given that I
wasn't completely sure whether there's any preexisting ones which would
fit best here.

No matter the results, I'd still take the stance that we should by
default try to do the right thing and try hard to not end up with
corrupt data, and if filesystem docs explicitly say we must fsync(3P)
then that's what we should be doing. That being said, I wouldn't mind
introducing something like `core.fsyncObjectFiles` for refs, too, so
that folks who want an escape hatch have one.

> I've also noted in those threads recently that our overall use of fsync
> is quite, bad, and especially when it comes to assuming that we don't
> need to fsync dir entries, which we still don't do here.

Yeah. I also thought about not putting the fsync(3P) logic into ref
logic, but instead into our lockfiles. In theory, we should always be
doing this before committing lockfiles into place, so it would fit in
there quite naturally.

> The ext4 docs seem to suggest that this will be the right thing to do in
> either case, but I wonder if this won't increase the odds of corruption
> on some other fs's.
> 
> I.e. before we'd write() && rename() without the fsync(), so on systems
> that deferred fsync() until some global sync point we might have been
> able to rely on those happening atomically (although clearly not on
> others, e.g. ext4).
> 
> But now we'll fsync() the data explicitly, then do a rename(), but we
> don't fsync the dir entry, so per POSIX an external application can't
> rely on seeing that rename yet. Will that bite us still, but just in
> another way on some other systems?
>
> 1. https://stackoverflow.com/questions/7433057/is-rename-without-fsync-safe

Good point. I'd happy to extend this patch to also fsync(3P) the dir
entry. But it does sound like even more of a reason to move the logic
into the lockfiles such that we don't have to duplicate it wherever we
really don't want to end up with corrupted data.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-04 13:14 ` Ævar Arnfjörð Bjarmason
  2021-11-04 14:51   ` Patrick Steinhardt
@ 2021-11-04 21:24   ` Junio C Hamano
  2021-11-04 22:36     ` Neeraj Singh
  1 sibling, 1 reply; 34+ messages in thread
From: Junio C Hamano @ 2021-11-04 21:24 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Patrick Steinhardt, git, Eric Wong, Neeraj K. Singh

Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:

> I think it would probably be best to create a git_fsync_fd() function
> which is non-fatal and has that config/while loop, and have
> fsync_or_die() be a "..or die()" wrapper around that, then you could
> call that git_fsync_fd() here.

Adding git_fsync_fd() smells like a poor taste, as git_fsync() takes
an fd already.  How about making the current one into a static helper

	-int git_fsync(int fd, enum fsync_action action)
	+static int git_fsync_once(int fd, enum fsync_action action)
	 ...

and then hide the looping behavior behind git_fsync() proper?

        int git_fsync(int fd, enum fsync_action action)
        {
                while (git_fsync_once(fd, action) < 0)
                        if (errno != EINTR)
                                return -1;
                return 0;
        }

fsync_or_die() can be simplified by getting rid of its loop.

None of that needs to block Patrick's work, I think.  A version that
uses raw fsync() and punts on EINTR can graduate first, which makes
the situation better than the status quo, and all the ongoing work
that deal with fsync can be extended with an eye to make it also
usable to replace the fsync() call Patrick's fix adds.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-04 21:24   ` Junio C Hamano
@ 2021-11-04 22:36     ` Neeraj Singh
  2021-11-05  1:40       ` Junio C Hamano
  2021-11-05  8:35       ` Ævar Arnfjörð Bjarmason
  0 siblings, 2 replies; 34+ messages in thread
From: Neeraj Singh @ 2021-11-04 22:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Ævar Arnfjörð Bjarmason, Patrick Steinhardt, git,
	Eric Wong, Neeraj K. Singh

On Thu, Nov 04, 2021 at 02:24:26PM -0700, Junio C Hamano wrote:
> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
> 
> > I think it would probably be best to create a git_fsync_fd() function
> > which is non-fatal and has that config/while loop, and have
> > fsync_or_die() be a "..or die()" wrapper around that, then you could
> > call that git_fsync_fd() here.
> 
> Adding git_fsync_fd() smells like a poor taste, as git_fsync() takes
> an fd already.  How about making the current one into a static helper
> 
> 	-int git_fsync(int fd, enum fsync_action action)
> 	+static int git_fsync_once(int fd, enum fsync_action action)
> 	 ...
> 
> and then hide the looping behavior behind git_fsync() proper?
> 
>         int git_fsync(int fd, enum fsync_action action)
>         {
>                 while (git_fsync_once(fd, action) < 0)
>                         if (errno != EINTR)
>                                 return -1;
>                 return 0;
>         }
> 
> fsync_or_die() can be simplified by getting rid of its loop.
> 
> None of that needs to block Patrick's work, I think.  A version that
> uses raw fsync() and punts on EINTR can graduate first, which makes
> the situation better than the status quo, and all the ongoing work
> that deal with fsync can be extended with an eye to make it also
> usable to replace the fsync() call Patrick's fix adds.

Is there some reason we shouldn't die if writing the ref fails? We are
already accustomed to dying if fsyncing a packfile or the index fails.

I assume the number of refs updated is not that high on any given git
operation, so it's not worth having a batch mode for this eventually.

Thanks,
Neeraj

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-04 22:36     ` Neeraj Singh
@ 2021-11-05  1:40       ` Junio C Hamano
  2021-11-05  6:36         ` Jeff King
  2021-11-05  8:35       ` Ævar Arnfjörð Bjarmason
  1 sibling, 1 reply; 34+ messages in thread
From: Junio C Hamano @ 2021-11-05  1:40 UTC (permalink / raw)
  To: Neeraj Singh
  Cc: Ævar Arnfjörð Bjarmason, Patrick Steinhardt, git,
	Eric Wong, Neeraj K. Singh

Neeraj Singh <nksingh85@gmail.com> writes:

> Is there some reason we shouldn't die if writing the ref fails? We are
> already accustomed to dying if fsyncing a packfile or the index fails.

If we look at the surrounding code and the callers of the function,
this caller of fsync() is about to signal a failure to its callers
by returning an error().  The callers are prepared to see an error
and cope with it.

Introducing die() to such a library-ish part of the code deep in the
callchain is not exactly a good change, especially when it is
obvious how we can avoid it.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-05  1:40       ` Junio C Hamano
@ 2021-11-05  6:36         ` Jeff King
  0 siblings, 0 replies; 34+ messages in thread
From: Jeff King @ 2021-11-05  6:36 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Neeraj Singh, Ævar Arnfjörð Bjarmason,
	Patrick Steinhardt, git, Eric Wong, Neeraj K. Singh

On Thu, Nov 04, 2021 at 06:40:22PM -0700, Junio C Hamano wrote:

> Neeraj Singh <nksingh85@gmail.com> writes:
> 
> > Is there some reason we shouldn't die if writing the ref fails? We are
> > already accustomed to dying if fsyncing a packfile or the index fails.
> 
> If we look at the surrounding code and the callers of the function,
> this caller of fsync() is about to signal a failure to its callers
> by returning an error().  The callers are prepared to see an error
> and cope with it.
> 
> Introducing die() to such a library-ish part of the code deep in the
> callchain is not exactly a good change, especially when it is
> obvious how we can avoid it.

And here's a good concrete example that relies on it: when receive-pack
is taking in a push, if the ref fails it should report that via the
protocol back to the client, rather than just hanging up the connection.
That leaves the client less confused, and gives it the opportunity to
attempt and report status on other ref updates.

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-04 12:38 [PATCH] refs: sync loose refs to disk before committing them Patrick Steinhardt
  2021-11-04 13:14 ` Ævar Arnfjörð Bjarmason
@ 2021-11-05  7:07 ` Jeff King
  2021-11-05  7:17   ` Jeff King
  2021-11-10 11:40 ` [PATCH v2 0/3] " Patrick Steinhardt
  2 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2021-11-05  7:07 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

On Thu, Nov 04, 2021 at 01:38:22PM +0100, Patrick Steinhardt wrote:

> When writing loose refs, we first create a lockfile, write the new ref
> into that lockfile, close it and then rename the lockfile into place
> such that the actual update is atomic for that single ref. While this
> works as intended under normal circumstences, at GitLab we infrequently
> encounter corrupt loose refs in repositories after a machine encountered
> a hard reset. The corruption is always of the same type: the ref has
> been committed into place, but it is completely empty.

Yes, I've seen this quite a bit, too, and I agree that more fsync()-ing
will probably help.

There are two things that have made me hesitate, though:

  1. This doesn't quite make the write durable. Doing that would also
     require fsyncing the surrounding directory (and so on up the tree).
     This is a much trickier problem, because we often don't even have
     open descriptors for those.

     I think we can ignore that, though. Doing an fsync() here makes
     things strictly better with respect to robustness (especially if
     you care less about "has my ref update been committed to disk" and
     more about "does this end up corrupt after a power failure").

  2. It's not clear what the performance implications will be,
     especially on a busy server doing a lot of ref updates, or on a
     filesystem where fsync() ends up syncing everything, not just the
     one file (my impression is ext3 is such a system, but not ext4).
     Whereas another solution may be journaling data and metadata writes
     in order without worrying about the durability of writing them to
     disk.

     I suspect for small updates (say, a push of one or two refs), this
     will have little impact. We'd generally fsync the incoming packfile
     and its idx anyway, so we're adding may one or two fsyncs on top of
     that. But if you're pushing 100 refs, that will be 100 sequential
     fsyncs, which may add up to quite a bit of latency. It would be
     nice if we could batch these by somehow (e.g., by opening up all of
     the lockfiles, writing and fsyncing them, and then renaming one by
     one).

     So I think we want to at least add a config knob here. That would
     make it easier to experiment, and provides an escape hatch.

> diff --git a/refs/files-backend.c b/refs/files-backend.c
> index 151b0056fe..06a3f0bdea 100644
> --- a/refs/files-backend.c
> +++ b/refs/files-backend.c
> @@ -1749,6 +1749,7 @@ static int write_ref_to_lockfile(struct ref_lock *lock,
>  	fd = get_lock_file_fd(&lock->lk);
>  	if (write_in_full(fd, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
>  	    write_in_full(fd, &term, 1) < 0 ||
> +	    fsync(fd) < 0 ||
>  	    close_ref_gently(lock) < 0) {
>  		strbuf_addf(err,
>  			    "couldn't write '%s'", get_lock_file_path(&lock->lk));

I agree with the sentiment elsewhere that this would probably make sense
for any lockfile. That does probably make batching a bit more difficult,
though it could be layered on top (i.e., we could still fsync in the ref
code, and the lockfile fsync-ing should then become a noop).

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-05  7:07 ` Jeff King
@ 2021-11-05  7:17   ` Jeff King
  2021-11-05  9:12     ` Johannes Schindelin
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2021-11-05  7:17 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: git

On Fri, Nov 05, 2021 at 03:07:18AM -0400, Jeff King wrote:

>   2. It's not clear what the performance implications will be,
>      especially on a busy server doing a lot of ref updates, or on a
>      filesystem where fsync() ends up syncing everything, not just the
>      one file (my impression is ext3 is such a system, but not ext4).
>      Whereas another solution may be journaling data and metadata writes
>      in order without worrying about the durability of writing them to
>      disk.
> 
>      I suspect for small updates (say, a push of one or two refs), this
>      will have little impact. We'd generally fsync the incoming packfile
>      and its idx anyway, so we're adding may one or two fsyncs on top of
>      that. But if you're pushing 100 refs, that will be 100 sequential
>      fsyncs, which may add up to quite a bit of latency. It would be
>      nice if we could batch these by somehow (e.g., by opening up all of
>      the lockfiles, writing and fsyncing them, and then renaming one by
>      one).

So here's a quick experiment that shows a worst case: a small push that
updates a bunch of refs. After building Git with and without your patch,
I set up a small repo like:

  git init
  git commit --allow-empty -m foo
  for i in $(seq 100); do
    git update-ref refs/heads/$i HEAD
  done

To give a clean slate between runs, I stuck this in a script called
"setup":

  #!/bin/sh
  rm -rf dst.git
  git init --bare dst.git
  sync

And then ran:

  $ hyperfine -L v orig,fsync -p ./setup '/tmp/{v}/bin/git push dst.git refs/heads/*'
  Benchmark 1: /tmp/orig/bin/git push dst.git refs/heads/*
    Time (mean ± σ):       9.9 ms ±   0.2 ms    [User: 6.3 ms, System: 4.7 ms]
    Range (min … max):     9.5 ms …  10.5 ms    111 runs
   
  Benchmark 2: /tmp/fsync/bin/git push dst.git refs/heads/*
    Time (mean ± σ):     401.0 ms ±   7.7 ms    [User: 9.4 ms, System: 15.2 ms]
    Range (min … max):   389.4 ms … 412.4 ms    10 runs
   
  Summary
    '/tmp/orig/bin/git push dst.git refs/heads/*' ran
     40.68 ± 1.16 times faster than '/tmp/fsync/bin/git push dst.git refs/heads/*'

So it really does produce a noticeable impact (this is on a system with
a decent SSD and no other disk load, so I'd expect it to be about
average for modern hardware).

Now this test isn't entirely fair. 100 refs is a larger than average
number to be pushing, and the effect is out-sized because there's
virtually no time spent dealing with the objects themselves, nor is
there any network latency. But 400ms feels like a non-trivial amount of
time just in absolute numbers.

The numbers scale pretty linearly, as you'd expect. Pushing 10 refs
takes ~40ms, 100 takes ~400ms, and 1000 takes ~4s. The non-fsyncing
version gets slower, too (there's more work to do), but much more slowly
(6ms, 10ms, and 50ms respectively).

So this will definitely hurt at edge / pathological cases.

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-04 22:36     ` Neeraj Singh
  2021-11-05  1:40       ` Junio C Hamano
@ 2021-11-05  8:35       ` Ævar Arnfjörð Bjarmason
  2021-11-05  9:04         ` Jeff King
  1 sibling, 1 reply; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-05  8:35 UTC (permalink / raw)
  To: Neeraj Singh
  Cc: Junio C Hamano, Patrick Steinhardt, git, Eric Wong, Neeraj K. Singh


On Thu, Nov 04 2021, Neeraj Singh wrote:

> On Thu, Nov 04, 2021 at 02:24:26PM -0700, Junio C Hamano wrote:
>> Ævar Arnfjörð Bjarmason <avarab@gmail.com> writes:
>> 
>> > I think it would probably be best to create a git_fsync_fd() function
>> > which is non-fatal and has that config/while loop, and have
>> > fsync_or_die() be a "..or die()" wrapper around that, then you could
>> > call that git_fsync_fd() here.
>> 
>> Adding git_fsync_fd() smells like a poor taste, as git_fsync() takes
>> an fd already.  How about making the current one into a static helper
>> 
>> 	-int git_fsync(int fd, enum fsync_action action)
>> 	+static int git_fsync_once(int fd, enum fsync_action action)
>> 	 ...
>> 
>> and then hide the looping behavior behind git_fsync() proper?
>> 
>>         int git_fsync(int fd, enum fsync_action action)
>>         {
>>                 while (git_fsync_once(fd, action) < 0)
>>                         if (errno != EINTR)
>>                                 return -1;
>>                 return 0;
>>         }
>> 
>> fsync_or_die() can be simplified by getting rid of its loop.
>> 
>> None of that needs to block Patrick's work, I think.  A version that
>> uses raw fsync() and punts on EINTR can graduate first, which makes
>> the situation better than the status quo, and all the ongoing work
>> that deal with fsync can be extended with an eye to make it also
>> usable to replace the fsync() call Patrick's fix adds.
>
> Is there some reason we shouldn't die if writing the ref fails? We are
> already accustomed to dying if fsyncing a packfile or the index fails.
>
> I assume the number of refs updated is not that high on any given git
> operation, so it's not worth having a batch mode for this eventually.

Others have mostly touched on this, but usually (always?) we're about to
die anyway, but by not calling die() here we'll die with a less crappy
message.

I.e. this bit:

    strbuf_addf(err, [...]

Is effectively a deferred die() in most (all?) cases, and be passed up
to the ref update code. because instead of a nondescript message like:

    fsync: failed somewhere

We want:

    couldn't update ref xyz to OID abc: fsync failed

Or similar. 

Hrm, actually having written the above there's a really good reason not
to die, which is that fsync can return ENOSPC.

So if we're in the middle of a transaction and have created and written
the lockfile we might only notice that the disk has is full when we do
the fsync().

In that case we'll (or should, I didn't check just now) unroll the ref
transaction, and delete the *.lock files we created, which presumably
will succeed in that scenario.

So calling die() at this level is the difference between leaving the
repo in an inconsistent state due to a disk error, and something like
"git fetch --atomic" gracefully failing.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-05  8:35       ` Ævar Arnfjörð Bjarmason
@ 2021-11-05  9:04         ` Jeff King
  0 siblings, 0 replies; 34+ messages in thread
From: Jeff King @ 2021-11-05  9:04 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Neeraj Singh, Junio C Hamano, Patrick Steinhardt, git, Eric Wong,
	Neeraj K. Singh

On Fri, Nov 05, 2021 at 09:35:24AM +0100, Ævar Arnfjörð Bjarmason wrote:

> So if we're in the middle of a transaction and have created and written
> the lockfile we might only notice that the disk has is full when we do
> the fsync().
> 
> In that case we'll (or should, I didn't check just now) unroll the ref
> transaction, and delete the *.lock files we created, which presumably
> will succeed in that scenario.
> 
> So calling die() at this level is the difference between leaving the
> repo in an inconsistent state due to a disk error, and something like
> "git fetch --atomic" gracefully failing.

We should rollback the lockfiles even if we call die(), via the atexit()
handler. Ditto if we receive a fatal signal.

(But I completely agree that if we have the opportunity to just pass the
error up the stack, we should do so).

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-05  7:17   ` Jeff King
@ 2021-11-05  9:12     ` Johannes Schindelin
  2021-11-05  9:22       ` Patrick Steinhardt
  2021-11-05  9:34       ` Jeff King
  0 siblings, 2 replies; 34+ messages in thread
From: Johannes Schindelin @ 2021-11-05  9:12 UTC (permalink / raw)
  To: Jeff King; +Cc: Patrick Steinhardt, git

[-- Attachment #1: Type: text/plain, Size: 3707 bytes --]

Hi Peff & Patrick,

On Fri, 5 Nov 2021, Jeff King wrote:

> On Fri, Nov 05, 2021 at 03:07:18AM -0400, Jeff King wrote:
>
> >   2. It's not clear what the performance implications will be,
> >      especially on a busy server doing a lot of ref updates, or on a
> >      filesystem where fsync() ends up syncing everything, not just the
> >      one file (my impression is ext3 is such a system, but not ext4).
> >      Whereas another solution may be journaling data and metadata writes
> >      in order without worrying about the durability of writing them to
> >      disk.
> >
> >      I suspect for small updates (say, a push of one or two refs), this
> >      will have little impact. We'd generally fsync the incoming packfile
> >      and its idx anyway, so we're adding may one or two fsyncs on top of
> >      that. But if you're pushing 100 refs, that will be 100 sequential
> >      fsyncs, which may add up to quite a bit of latency. It would be
> >      nice if we could batch these by somehow (e.g., by opening up all of
> >      the lockfiles, writing and fsyncing them, and then renaming one by
> >      one).
>
> So here's a quick experiment that shows a worst case: a small push that
> updates a bunch of refs. After building Git with and without your patch,
> I set up a small repo like:
>
>   git init
>   git commit --allow-empty -m foo
>   for i in $(seq 100); do
>     git update-ref refs/heads/$i HEAD
>   done
>
> To give a clean slate between runs, I stuck this in a script called
> "setup":
>
>   #!/bin/sh
>   rm -rf dst.git
>   git init --bare dst.git
>   sync
>
> And then ran:
>
>   $ hyperfine -L v orig,fsync -p ./setup '/tmp/{v}/bin/git push dst.git refs/heads/*'
>   Benchmark 1: /tmp/orig/bin/git push dst.git refs/heads/*
>     Time (mean ± σ):       9.9 ms ±   0.2 ms    [User: 6.3 ms, System: 4.7 ms]
>     Range (min … max):     9.5 ms …  10.5 ms    111 runs
>
>   Benchmark 2: /tmp/fsync/bin/git push dst.git refs/heads/*
>     Time (mean ± σ):     401.0 ms ±   7.7 ms    [User: 9.4 ms, System: 15.2 ms]
>     Range (min … max):   389.4 ms … 412.4 ms    10 runs
>
>   Summary
>     '/tmp/orig/bin/git push dst.git refs/heads/*' ran
>      40.68 ± 1.16 times faster than '/tmp/fsync/bin/git push dst.git refs/heads/*'
>
> So it really does produce a noticeable impact (this is on a system with
> a decent SSD and no other disk load, so I'd expect it to be about
> average for modern hardware).
>
> Now this test isn't entirely fair. 100 refs is a larger than average
> number to be pushing, and the effect is out-sized because there's
> virtually no time spent dealing with the objects themselves, nor is
> there any network latency. But 400ms feels like a non-trivial amount of
> time just in absolute numbers.
>
> The numbers scale pretty linearly, as you'd expect. Pushing 10 refs
> takes ~40ms, 100 takes ~400ms, and 1000 takes ~4s. The non-fsyncing
> version gets slower, too (there's more work to do), but much more slowly
> (6ms, 10ms, and 50ms respectively).
>
> So this will definitely hurt at edge / pathological cases.

Ouch.

I wonder whether this could be handled similarly to the
`core.fsyncObjectFiles=batch` mode that has been proposed in
https://lore.kernel.org/git/pull.1076.v8.git.git.1633366667.gitgitgadget@gmail.com/

Essentially, we would have to find a better layer to do this, where we
can synchronize after a potentially quite large number of ref updates has
happened. That would definitely be a different layer than the file-based
refs backend, of course, and would probably apply in a different way to
other refs backends.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-05  9:12     ` Johannes Schindelin
@ 2021-11-05  9:22       ` Patrick Steinhardt
  2021-11-05  9:34       ` Jeff King
  1 sibling, 0 replies; 34+ messages in thread
From: Patrick Steinhardt @ 2021-11-05  9:22 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Jeff King, git

[-- Attachment #1: Type: text/plain, Size: 5107 bytes --]

On Fri, Nov 05, 2021 at 10:12:25AM +0100, Johannes Schindelin wrote:
> Hi Peff & Patrick,
> 
> On Fri, 5 Nov 2021, Jeff King wrote:
> 
> > On Fri, Nov 05, 2021 at 03:07:18AM -0400, Jeff King wrote:
> >
> > >   2. It's not clear what the performance implications will be,
> > >      especially on a busy server doing a lot of ref updates, or on a
> > >      filesystem where fsync() ends up syncing everything, not just the
> > >      one file (my impression is ext3 is such a system, but not ext4).
> > >      Whereas another solution may be journaling data and metadata writes
> > >      in order without worrying about the durability of writing them to
> > >      disk.
> > >
> > >      I suspect for small updates (say, a push of one or two refs), this
> > >      will have little impact. We'd generally fsync the incoming packfile
> > >      and its idx anyway, so we're adding may one or two fsyncs on top of
> > >      that. But if you're pushing 100 refs, that will be 100 sequential
> > >      fsyncs, which may add up to quite a bit of latency. It would be
> > >      nice if we could batch these by somehow (e.g., by opening up all of
> > >      the lockfiles, writing and fsyncing them, and then renaming one by
> > >      one).
> >
> > So here's a quick experiment that shows a worst case: a small push that
> > updates a bunch of refs. After building Git with and without your patch,
> > I set up a small repo like:
> >
> >   git init
> >   git commit --allow-empty -m foo
> >   for i in $(seq 100); do
> >     git update-ref refs/heads/$i HEAD
> >   done
> >
> > To give a clean slate between runs, I stuck this in a script called
> > "setup":
> >
> >   #!/bin/sh
> >   rm -rf dst.git
> >   git init --bare dst.git
> >   sync
> >
> > And then ran:
> >
> >   $ hyperfine -L v orig,fsync -p ./setup '/tmp/{v}/bin/git push dst.git refs/heads/*'
> >   Benchmark 1: /tmp/orig/bin/git push dst.git refs/heads/*
> >     Time (mean ± σ):       9.9 ms ±   0.2 ms    [User: 6.3 ms, System: 4.7 ms]
> >     Range (min … max):     9.5 ms …  10.5 ms    111 runs
> >
> >   Benchmark 2: /tmp/fsync/bin/git push dst.git refs/heads/*
> >     Time (mean ± σ):     401.0 ms ±   7.7 ms    [User: 9.4 ms, System: 15.2 ms]
> >     Range (min … max):   389.4 ms … 412.4 ms    10 runs
> >
> >   Summary
> >     '/tmp/orig/bin/git push dst.git refs/heads/*' ran
> >      40.68 ± 1.16 times faster than '/tmp/fsync/bin/git push dst.git refs/heads/*'
> >
> > So it really does produce a noticeable impact (this is on a system with
> > a decent SSD and no other disk load, so I'd expect it to be about
> > average for modern hardware).
> >
> > Now this test isn't entirely fair. 100 refs is a larger than average
> > number to be pushing, and the effect is out-sized because there's
> > virtually no time spent dealing with the objects themselves, nor is
> > there any network latency. But 400ms feels like a non-trivial amount of
> > time just in absolute numbers.
> >
> > The numbers scale pretty linearly, as you'd expect. Pushing 10 refs
> > takes ~40ms, 100 takes ~400ms, and 1000 takes ~4s. The non-fsyncing
> > version gets slower, too (there's more work to do), but much more slowly
> > (6ms, 10ms, and 50ms respectively).
> >
> > So this will definitely hurt at edge / pathological cases.
> 
> Ouch.
> 
> I wonder whether this could be handled similarly to the
> `core.fsyncObjectFiles=batch` mode that has been proposed in
> https://lore.kernel.org/git/pull.1076.v8.git.git.1633366667.gitgitgadget@gmail.com/
> 
> Essentially, we would have to find a better layer to do this, where we
> can synchronize after a potentially quite large number of ref updates has
> happened. That would definitely be a different layer than the file-based
> refs backend, of course, and would probably apply in a different way to
> other refs backends.
> 
> Ciao,
> Dscho

Good question. We'd likely have to make this part of the ref_transaction
layer to make this work without having to update all callers. It would
likely work something like the following:

    1. The caller queues up all ref updates.

    2. The caller moves the transaction into prepared state, which
       creates all the lockfiles for us.

    3. After all lockfiles have been created, we must sync them to disk.
       This may either happen when preparing the transaction, or when
       committing it. I think it's better located in prepare though so
       that the likelihood that the transaction succeeds after a
       successful prepare is high.

    4. Now we can rename all files into place.

Naturally, we'd only benefit from this batching in case the caller
actually queues up all ref updates into a single transaction. And
there's going to be at least some which don't by default. The first that
comes to my mind is git-fetch(1), which explicitly requires the user to
use `--atomic` to queue them all up into a single transaction. And I
guess there's going to be others which use one transaction per ref.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-05  9:12     ` Johannes Schindelin
  2021-11-05  9:22       ` Patrick Steinhardt
@ 2021-11-05  9:34       ` Jeff King
  2021-11-09 11:25         ` Patrick Steinhardt
  1 sibling, 1 reply; 34+ messages in thread
From: Jeff King @ 2021-11-05  9:34 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Neeraj Singh, Patrick Steinhardt, git

On Fri, Nov 05, 2021 at 10:12:25AM +0100, Johannes Schindelin wrote:

> > So this will definitely hurt at edge / pathological cases.
> 
> Ouch.
> 
> I wonder whether this could be handled similarly to the
> `core.fsyncObjectFiles=batch` mode that has been proposed in
> https://lore.kernel.org/git/pull.1076.v8.git.git.1633366667.gitgitgadget@gmail.com/

Yeah, that was along the lines I was thinking.

I hadn't really looked at the details of the batch-fsync there. The big
trick seems to be doing a pagecache writeback for each file, and then
stimulating an actual disk write (by fsyncing a tempfile) after the
batch is done.

That would be pretty easy to do for the refs (it's git_fsync() while
closing each file where Patrick is calling fsync(), followed by a single
do_batch_fsync() after everything is closed but before we rename).

> Essentially, we would have to find a better layer to do this, where we
> can synchronize after a potentially quite large number of ref updates has
> happened. That would definitely be a different layer than the file-based
> refs backend, of course, and would probably apply in a different way to
> other refs backends.

We do have the concept of a ref_transaction, so that would be the
natural place for it. Not every caller uses it, though, because it
implies atomicity of the transaction (so some may do a sequence of N
independent transactions, because they don't want failure of one to
impact others). I think that could be changed, if the ref_transaction
learned about non-atomicity, but it may take some surgery.

I expect that reftables would similarly benefit; it is probably much
more efficient to write a table slice with N entries than it is to write
N slices, even before accounting for fsync(). And once doing that, then
the fsync() part becomes trivial.

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-05  9:34       ` Jeff King
@ 2021-11-09 11:25         ` Patrick Steinhardt
  2021-11-10  8:36           ` Jeff King
  0 siblings, 1 reply; 34+ messages in thread
From: Patrick Steinhardt @ 2021-11-09 11:25 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Schindelin, Neeraj Singh, git

[-- Attachment #1: Type: text/plain, Size: 3297 bytes --]

On Fri, Nov 05, 2021 at 05:34:44AM -0400, Jeff King wrote:
> On Fri, Nov 05, 2021 at 10:12:25AM +0100, Johannes Schindelin wrote:
> 
> > > So this will definitely hurt at edge / pathological cases.
> > 
> > Ouch.
> > 
> > I wonder whether this could be handled similarly to the
> > `core.fsyncObjectFiles=batch` mode that has been proposed in
> > https://lore.kernel.org/git/pull.1076.v8.git.git.1633366667.gitgitgadget@gmail.com/
> 
> Yeah, that was along the lines I was thinking.
> 
> I hadn't really looked at the details of the batch-fsync there. The big
> trick seems to be doing a pagecache writeback for each file, and then
> stimulating an actual disk write (by fsyncing a tempfile) after the
> batch is done.
> 
> That would be pretty easy to do for the refs (it's git_fsync() while
> closing each file where Patrick is calling fsync(), followed by a single
> do_batch_fsync() after everything is closed but before we rename).
> 
> > Essentially, we would have to find a better layer to do this, where we
> > can synchronize after a potentially quite large number of ref updates has
> > happened. That would definitely be a different layer than the file-based
> > refs backend, of course, and would probably apply in a different way to
> > other refs backends.
> 
> We do have the concept of a ref_transaction, so that would be the
> natural place for it. Not every caller uses it, though, because it
> implies atomicity of the transaction (so some may do a sequence of N
> independent transactions, because they don't want failure of one to
> impact others). I think that could be changed, if the ref_transaction
> learned about non-atomicity, but it may take some surgery.
> 
> I expect that reftables would similarly benefit; it is probably much
> more efficient to write a table slice with N entries than it is to write
> N slices, even before accounting for fsync(). And once doing that, then
> the fsync() part becomes trivial.
> 
> -Peff

So I've finally found the time to have another look at massaging this
into the ref_transaction mechanism. If we do want to batch the fsync(3P)
calls, then we basically have two different alternatives:

    1. We first lock all loose refs by creating the respective lock
       files and writing the updated ref into that file. We keep the
       file descriptor open such that we can then flush them all in one
       go.

    2. Same as before, we lock all refs and write the updated pointers
       into the lockfiles, but this time we close each lockfile after
       having written to it. Later, we reopen them all to fsync(3P) them
       to disk.

I'm afraid both alternatives aren't any good: the first alternative
risks running out of file descriptors if you queue up lots of refs. And
the second alternative is going to be slow, especially so on Windows if
I'm not mistaken.

So with both not being feasible, we'll likely have to come up with a
more complex scheme if we want to batch-sync files. One idea would be to
fsync(3P) all lockfiles every $n refs, but it adds complexity in a place
where I'd really like to have things as simple as possible. It also
raises the question what $n would have to be.

Does anybody else have better ideas than I do?

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-09 11:25         ` Patrick Steinhardt
@ 2021-11-10  8:36           ` Jeff King
  2021-11-10  9:16             ` Patrick Steinhardt
  0 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2021-11-10  8:36 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Johannes Schindelin, Neeraj Singh, git

On Tue, Nov 09, 2021 at 12:25:46PM +0100, Patrick Steinhardt wrote:

> So I've finally found the time to have another look at massaging this
> into the ref_transaction mechanism. If we do want to batch the fsync(3P)
> calls, then we basically have two different alternatives:
> 
>     1. We first lock all loose refs by creating the respective lock
>        files and writing the updated ref into that file. We keep the
>        file descriptor open such that we can then flush them all in one
>        go.
> 
>     2. Same as before, we lock all refs and write the updated pointers
>        into the lockfiles, but this time we close each lockfile after
>        having written to it. Later, we reopen them all to fsync(3P) them
>        to disk.
> 
> I'm afraid both alternatives aren't any good: the first alternative
> risks running out of file descriptors if you queue up lots of refs. And
> the second alternative is going to be slow, especially so on Windows if
> I'm not mistaken.

I agree the first is a dead end. I had imagined something like the
second, but I agree that we'd have to measure the cost of re-opening.
It's too bad there is not a syscall to sync a particular set of paths
(or even better, a directory tree recursively).

There is another option, though: the batch-fsync code for objects does a
"cheap" fsync while we have the descriptor open, and then later triggers
a to-disk sync on a separate file. My understanding is that this works
because modern filesystems will make sure the data write is in the
journal on the cheap sync, and then the separate-file sync makes sure
the journal goes to disk.

We could do something like that here. In fact, if you don't care about
durability and just filesystem corruption, then you only care about the
first sync (because the bad case is when the rename gets journaled but
the data write doesn't).

In fact, even if you did choose to re-open and fsync each one, that's
still sequential. We'd need some way to tell the kernel to sync them all
at once. The batch-fsync trickery above is one such way (I haven't
tried, but I wonder if making a bunch of fsync calls in parallel would
work similarly).

> So with both not being feasible, we'll likely have to come up with a
> more complex scheme if we want to batch-sync files. One idea would be to
> fsync(3P) all lockfiles every $n refs, but it adds complexity in a place
> where I'd really like to have things as simple as possible. It also
> raises the question what $n would have to be.

I do think syncing every $n would not be too hard to implement. It could
all be hidden behind a state machine API that collects lock files and
flushes when it sees fit. You'd just call a magic "batch_fsync_and_close"
instead of "fsync() and close()", though you would have to remember to
do a final flush call to tell it there are no more coming.

-Peff

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] refs: sync loose refs to disk before committing them
  2021-11-10  8:36           ` Jeff King
@ 2021-11-10  9:16             ` Patrick Steinhardt
  0 siblings, 0 replies; 34+ messages in thread
From: Patrick Steinhardt @ 2021-11-10  9:16 UTC (permalink / raw)
  To: Jeff King; +Cc: Johannes Schindelin, Neeraj Singh, git

[-- Attachment #1: Type: text/plain, Size: 3193 bytes --]

On Wed, Nov 10, 2021 at 03:36:59AM -0500, Jeff King wrote:
> On Tue, Nov 09, 2021 at 12:25:46PM +0100, Patrick Steinhardt wrote:
> 
> > So I've finally found the time to have another look at massaging this
> > into the ref_transaction mechanism. If we do want to batch the fsync(3P)
> > calls, then we basically have two different alternatives:
> > 
> >     1. We first lock all loose refs by creating the respective lock
> >        files and writing the updated ref into that file. We keep the
> >        file descriptor open such that we can then flush them all in one
> >        go.
> > 
> >     2. Same as before, we lock all refs and write the updated pointers
> >        into the lockfiles, but this time we close each lockfile after
> >        having written to it. Later, we reopen them all to fsync(3P) them
> >        to disk.
> > 
> > I'm afraid both alternatives aren't any good: the first alternative
> > risks running out of file descriptors if you queue up lots of refs. And
> > the second alternative is going to be slow, especially so on Windows if
> > I'm not mistaken.
> 
> I agree the first is a dead end. I had imagined something like the
> second, but I agree that we'd have to measure the cost of re-opening.
> It's too bad there is not a syscall to sync a particular set of paths
> (or even better, a directory tree recursively).
> 
> There is another option, though: the batch-fsync code for objects does a
> "cheap" fsync while we have the descriptor open, and then later triggers
> a to-disk sync on a separate file. My understanding is that this works
> because modern filesystems will make sure the data write is in the
> journal on the cheap sync, and then the separate-file sync makes sure
> the journal goes to disk.
> 
> We could do something like that here. In fact, if you don't care about
> durability and just filesystem corruption, then you only care about the
> first sync (because the bad case is when the rename gets journaled but
> the data write doesn't).

Ah, interesting. That does sound like a good way forward to me, thanks
for the pointers!

Patrick

> In fact, even if you did choose to re-open and fsync each one, that's
> still sequential. We'd need some way to tell the kernel to sync them all
> at once. The batch-fsync trickery above is one such way (I haven't
> tried, but I wonder if making a bunch of fsync calls in parallel would
> work similarly).
> 
> > So with both not being feasible, we'll likely have to come up with a
> > more complex scheme if we want to batch-sync files. One idea would be to
> > fsync(3P) all lockfiles every $n refs, but it adds complexity in a place
> > where I'd really like to have things as simple as possible. It also
> > raises the question what $n would have to be.
> 
> I do think syncing every $n would not be too hard to implement. It could
> all be hidden behind a state machine API that collects lock files and
> flushes when it sees fit. You'd just call a magic "batch_fsync_and_close"
> instead of "fsync() and close()", though you would have to remember to
> do a final flush call to tell it there are no more coming.
> 
> -Peff

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 0/3] refs: sync loose refs to disk before committing them
  2021-11-04 12:38 [PATCH] refs: sync loose refs to disk before committing them Patrick Steinhardt
  2021-11-04 13:14 ` Ævar Arnfjörð Bjarmason
  2021-11-05  7:07 ` Jeff King
@ 2021-11-10 11:40 ` Patrick Steinhardt
  2021-11-10 11:40   ` [PATCH v2 1/3] wrapper: handle EINTR in `git_fsync()` Patrick Steinhardt
                     ` (4 more replies)
  2 siblings, 5 replies; 34+ messages in thread
From: Patrick Steinhardt @ 2021-11-10 11:40 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Johannes Schindelin,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Eric Wong, Neeraj K. Singh

[-- Attachment #1: Type: text/plain, Size: 1635 bytes --]

Hi,

[This is a resend, I initially used the wrong mailing list address.]

this is the second version of this patch series to implement syncing of
loose refs to avoid a class of ref corruption.

This series is implements the same batching as Neeraj's series [1]. It
introduces a new config switch "core.fsyncRefFiles" with the same three
toggles "on", "off" and "batch". Given that it reuses funcitonality
provided by the batched object flushing this series is based on "next".

Please note that I didn't yet add any performance numbers or tests.
Performance tests didn't show any conclusive results on my machine given
that I couldn't observe any noticeable impact at all, and I didn't write
tests yet given that I first wanted to get some feedback on this series.
If we agree that this is the correct way to go forward I'll of course
put in some more time to add them.

Patrick

[1]: <pull.1076.v8.git.git.1633366667.gitgitgadget@gmail.com>

Patrick Steinhardt (3):
  wrapper: handle EINTR in `git_fsync()`
  wrapper: provide function to sync directories
  refs: add configuration to enable flushing of refs

 Documentation/config/core.txt |  7 ++++
 bulk-checkin.c                | 13 ++------
 cache.h                       |  7 ++++
 config.c                      | 10 ++++++
 environment.c                 |  1 +
 git-compat-util.h             |  7 ++++
 refs/files-backend.c          | 62 ++++++++++++++++++++++++++++++++++-
 wrapper.c                     | 30 ++++++++++++++++-
 write-or-die.c                |  6 ++--
 9 files changed, 127 insertions(+), 16 deletions(-)

-- 
2.33.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v2 1/3] wrapper: handle EINTR in `git_fsync()`
  2021-11-10 11:40 ` [PATCH v2 0/3] " Patrick Steinhardt
@ 2021-11-10 11:40   ` Patrick Steinhardt
  2021-11-10 14:33     ` Johannes Schindelin
  2021-11-10 14:39     ` Ævar Arnfjörð Bjarmason
  2021-11-10 11:40   ` [PATCH v2 2/3] wrapper: provide function to sync directories Patrick Steinhardt
                     ` (3 subsequent siblings)
  4 siblings, 2 replies; 34+ messages in thread
From: Patrick Steinhardt @ 2021-11-10 11:40 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Johannes Schindelin,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Eric Wong, Neeraj K. Singh

[-- Attachment #1: Type: text/plain, Size: 1651 bytes --]

While we already have a wrapper around `git_fsync()`, this wrapper
doesn't handle looping around `EINTR`. Convert it to do so such that
callers don't have to remember doing this everywhere.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 wrapper.c      | 9 ++++++++-
 write-or-die.c | 6 ++----
 2 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/wrapper.c b/wrapper.c
index ece3d2ca10..e20df4f3a6 100644
--- a/wrapper.c
+++ b/wrapper.c
@@ -546,7 +546,7 @@ int xmkstemp_mode(char *filename_template, int mode)
 	return fd;
 }
 
-int git_fsync(int fd, enum fsync_action action)
+static int git_fsync_once(int fd, enum fsync_action action)
 {
 	switch (action) {
 	case FSYNC_WRITEOUT_ONLY:
@@ -591,7 +591,14 @@ int git_fsync(int fd, enum fsync_action action)
 	default:
 		BUG("unexpected git_fsync(%d) call", action);
 	}
+}
 
+int git_fsync(int fd, enum fsync_action action)
+{
+	while (git_fsync_once(fd, action) < 0)
+		if (errno != EINTR)
+			return -1;
+	return 0;
 }
 
 static int warn_if_unremovable(const char *op, const char *file, int rc)
diff --git a/write-or-die.c b/write-or-die.c
index cc8291d979..e61220aa2d 100644
--- a/write-or-die.c
+++ b/write-or-die.c
@@ -57,10 +57,8 @@ void fprintf_or_die(FILE *f, const char *fmt, ...)
 
 void fsync_or_die(int fd, const char *msg)
 {
-	while (git_fsync(fd, FSYNC_HARDWARE_FLUSH) < 0) {
-		if (errno != EINTR)
-			die_errno("fsync error on '%s'", msg);
-	}
+	if (git_fsync(fd, FSYNC_HARDWARE_FLUSH) < 0)
+		die_errno("fsync error on '%s'", msg);
 }
 
 void write_or_die(int fd, const void *buf, size_t count)
-- 
2.33.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 2/3] wrapper: provide function to sync directories
  2021-11-10 11:40 ` [PATCH v2 0/3] " Patrick Steinhardt
  2021-11-10 11:40   ` [PATCH v2 1/3] wrapper: handle EINTR in `git_fsync()` Patrick Steinhardt
@ 2021-11-10 11:40   ` Patrick Steinhardt
  2021-11-10 14:40     ` Ævar Arnfjörð Bjarmason
  2021-11-10 11:41   ` [PATCH v2 3/3] refs: add configuration to enable flushing of refs Patrick Steinhardt
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 34+ messages in thread
From: Patrick Steinhardt @ 2021-11-10 11:40 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Johannes Schindelin,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Eric Wong, Neeraj K. Singh

[-- Attachment #1: Type: text/plain, Size: 3312 bytes --]

In ec983eb5d2 (core.fsyncobjectfiles: batched disk flushes, 2021-10-04),
we have introduced batched syncing of object files. This mode works by
only requesting a writeback of the page cache backing the file on
written files, followed by a single hardware-flush via a temporary file
created in the directory we want to flush. Given modern journaling file
systems, this pattern is expected to be durable.

While it's possible to reuse the `git_fsync()` helper to synchronize the
page cache only, there is no helper which would allow for doing a
hardware flush of a directory by creating a temporary file. Other
callers which want to follow the same pattern would thus have to repeat
this logic.

Extract a new helper `git_fsync_dir()` from the object files code which
neatly encapsulates this logic such that it can be reused.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 bulk-checkin.c    | 13 +++----------
 git-compat-util.h |  7 +++++++
 wrapper.c         | 21 +++++++++++++++++++++
 3 files changed, 31 insertions(+), 10 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 4deee1af46..e6ebdd1db5 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -98,16 +98,9 @@ static void do_batch_fsync(void)
 	 * hardware.
 	 */
 
-	if (needs_batch_fsync) {
-		struct strbuf temp_path = STRBUF_INIT;
-		struct tempfile *temp;
-
-		strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", get_object_directory());
-		temp = xmks_tempfile(temp_path.buf);
-		fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp));
-		delete_tempfile(&temp);
-		strbuf_release(&temp_path);
-	}
+	if (needs_batch_fsync &&
+	    git_fsync_dir(get_object_directory()) < 0)
+		die_errno("fsyncing object directory");
 
 	if (bulk_fsync_objdir)
 		tmp_objdir_migrate(bulk_fsync_objdir);
diff --git a/git-compat-util.h b/git-compat-util.h
index 97f97178e7..f890bd07fd 100644
--- a/git-compat-util.h
+++ b/git-compat-util.h
@@ -1221,6 +1221,13 @@ enum fsync_action {
 
 int git_fsync(int fd, enum fsync_action action);
 
+/*
+ * Issue a full hardware flush against a temporary file in the given directory
+ * to ensure that all files inside that directory are durable before any renames
+ * occur.
+ */
+int git_fsync_dir(const char *path);
+
 /*
  * Preserves errno, prints a message, but gives no warning for ENOENT.
  * Returns 0 on success, which includes trying to unlink an object that does
diff --git a/wrapper.c b/wrapper.c
index e20df4f3a6..6c6cc8b74f 100644
--- a/wrapper.c
+++ b/wrapper.c
@@ -3,6 +3,7 @@
  */
 #include "cache.h"
 #include "config.h"
+#include "tempfile.h"
 
 static int memory_limit_check(size_t size, int gentle)
 {
@@ -601,6 +602,26 @@ int git_fsync(int fd, enum fsync_action action)
 	return 0;
 }
 
+int git_fsync_dir(const char *path)
+{
+	struct strbuf temp_path = STRBUF_INIT;
+	struct tempfile *temp;
+
+	strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", path);
+
+	temp = mks_tempfile(temp_path.buf);
+	if (!temp)
+		return -1;
+
+	if (git_fsync(get_tempfile_fd(temp), FSYNC_HARDWARE_FLUSH) < 0)
+		return -1;
+
+	delete_tempfile(&temp);
+	strbuf_release(&temp_path);
+
+	return 0;
+}
+
 static int warn_if_unremovable(const char *op, const char *file, int rc)
 {
 	int err;
-- 
2.33.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v2 3/3] refs: add configuration to enable flushing of refs
  2021-11-10 11:40 ` [PATCH v2 0/3] " Patrick Steinhardt
  2021-11-10 11:40   ` [PATCH v2 1/3] wrapper: handle EINTR in `git_fsync()` Patrick Steinhardt
  2021-11-10 11:40   ` [PATCH v2 2/3] wrapper: provide function to sync directories Patrick Steinhardt
@ 2021-11-10 11:41   ` Patrick Steinhardt
  2021-11-10 14:49     ` Ævar Arnfjörð Bjarmason
  2021-11-11  0:18     ` Neeraj Singh
  2021-11-10 14:44   ` [PATCH v2 0/3] refs: sync loose refs to disk before committing them Johannes Schindelin
  2021-11-10 20:45   ` Jeff King
  4 siblings, 2 replies; 34+ messages in thread
From: Patrick Steinhardt @ 2021-11-10 11:41 UTC (permalink / raw)
  To: git
  Cc: Jeff King, Johannes Schindelin,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Eric Wong, Neeraj K. Singh

[-- Attachment #1: Type: text/plain, Size: 9268 bytes --]

When writing loose refs, we first create a lockfile, write the new ref
into that lockfile, close it and then rename the lockfile into place
such that the actual update is atomic for that single ref. While this
works as intended under normal circumstences, at GitLab we infrequently
encounter corrupt loose refs in repositories after a machine encountered
a hard reset. The corruption is always of the same type: the ref has
been committed into place, but it is completely empty.

The root cause of this is likely that we don't sync contents of the
lockfile to disk before renaming it into place. As a result, it's not
guaranteed that the contents are properly persisted and one may observe
weird in-between states on hard resets. Quoting ext4 documentation [1]:

    Many broken applications don't use fsync() when replacing existing
    files via patterns such as fd =
    open("foo.new")/write(fd,..)/close(fd)/ rename("foo.new", "foo"), or
    worse yet, fd = open("foo", O_TRUNC)/write(fd,..)/close(fd). If
    auto_da_alloc is enabled, ext4 will detect the replace-via-rename
    and replace-via-truncate patterns and force that any delayed
    allocation blocks are allocated such that at the next journal
    commit, in the default data=ordered mode, the data blocks of the new
    file are forced to disk before the rename() operation is committed.
    This provides roughly the same level of guarantees as ext3, and
    avoids the "zero-length" problem that can happen when a system
    crashes before the delayed allocation blocks are forced to disk.

This explicitly points out that one must call fsync(3P) before doing the
rename(3P) call, or otherwise data may not be correctly persisted to
disk.

Fix this by introducing a new configuration "core.fsyncRefFiles". This
config matches behaviour of "core.fsyncObjectFiles" in that it provides
three different modes:

    - "off" disables calling fsync on ref files. This is the default
      behaviour previous to this change and remains the default after
      this change.

    - "on" enables calling fsync on ref files, where each reference is
      flushed to disk before it is being committed. This is the safest
      setting, but may incur significant performance overhead.

    - "batch" will flush the page cache of each file as it is written to
      ensure its data is persisted. After all refs have been written,
      the directories which host refs are flushed.

With this change in place and when "core.fsyncRefFiles" is set to either
"on" or "batch", this kind of corruption shouldn't happen anymore.

[1]: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---
 Documentation/config/core.txt |  7 ++++
 cache.h                       |  7 ++++
 config.c                      | 10 ++++++
 environment.c                 |  1 +
 refs/files-backend.c          | 62 ++++++++++++++++++++++++++++++++++-
 5 files changed, 86 insertions(+), 1 deletion(-)

diff --git a/Documentation/config/core.txt b/Documentation/config/core.txt
index 200b4d9f06..e2fd0d8c90 100644
--- a/Documentation/config/core.txt
+++ b/Documentation/config/core.txt
@@ -572,6 +572,13 @@ core.fsyncObjectFiles::
   stored on HFS+ or APFS filesystems and on Windows for repos stored on NTFS or
   ReFS.
 
+core.fsyncRefFiles::
+	A value indicating the level of effort Git will expend in trying to make
+	refs added to the repo durable in the event of an unclean system
+	shutdown. This setting currently only controls loose refs in the object
+	store, so updates to packed refs may not be equally durable. Takes the
+	same parameters as `core.fsyncObjectFiles`.
+
 core.preloadIndex::
 	Enable parallel index preload for operations like 'git diff'
 +
diff --git a/cache.h b/cache.h
index 6d6e6770ec..14c8fab6b4 100644
--- a/cache.h
+++ b/cache.h
@@ -991,7 +991,14 @@ enum fsync_object_files_mode {
     FSYNC_OBJECT_FILES_BATCH
 };
 
+enum fsync_ref_files_mode {
+    FSYNC_REF_FILES_OFF,
+    FSYNC_REF_FILES_ON,
+    FSYNC_REF_FILES_BATCH
+};
+
 extern enum fsync_object_files_mode fsync_object_files;
+extern enum fsync_ref_files_mode fsync_ref_files;
 extern int core_preload_index;
 extern int precomposed_unicode;
 extern int protect_hfs;
diff --git a/config.c b/config.c
index 5eb36ecd77..4cbad5a29d 100644
--- a/config.c
+++ b/config.c
@@ -1500,6 +1500,16 @@ static int git_default_core_config(const char *var, const char *value, void *cb)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.fsyncreffiles")) {
+		if (value && !strcmp(value, "batch"))
+			fsync_ref_files = FSYNC_REF_FILES_BATCH;
+		else if (git_config_bool(var, value))
+			fsync_ref_files = FSYNC_REF_FILES_ON;
+		else
+			fsync_ref_files = FSYNC_REF_FILES_OFF;
+		return 0;
+	}
+
 	if (!strcmp(var, "core.preloadindex")) {
 		core_preload_index = git_config_bool(var, value);
 		return 0;
diff --git a/environment.c b/environment.c
index aeafe80235..1514ac9384 100644
--- a/environment.c
+++ b/environment.c
@@ -43,6 +43,7 @@ const char *git_hooks_path;
 int zlib_compression_level = Z_BEST_SPEED;
 int pack_compression_level = Z_DEFAULT_COMPRESSION;
 enum fsync_object_files_mode fsync_object_files;
+enum fsync_ref_files_mode fsync_ref_files;
 size_t packed_git_window_size = DEFAULT_PACKED_GIT_WINDOW_SIZE;
 size_t packed_git_limit = DEFAULT_PACKED_GIT_LIMIT;
 size_t delta_base_cache_limit = 96 * 1024 * 1024;
diff --git a/refs/files-backend.c b/refs/files-backend.c
index 4b14f30d48..31d7456266 100644
--- a/refs/files-backend.c
+++ b/refs/files-backend.c
@@ -1360,6 +1360,57 @@ static int commit_ref_update(struct files_ref_store *refs,
 			     const struct object_id *oid, const char *logmsg,
 			     struct strbuf *err);
 
+static int sync_loose_ref(int fd)
+{
+	switch (fsync_ref_files) {
+	case FSYNC_REF_FILES_OFF:
+		return 0;
+	case FSYNC_REF_FILES_ON:
+		return git_fsync(fd, FSYNC_HARDWARE_FLUSH);
+	case FSYNC_REF_FILES_BATCH:
+		return git_fsync(fd, FSYNC_WRITEOUT_ONLY);
+	default:
+		BUG("invalid fsync mode %d", fsync_ref_files);
+	}
+}
+
+#define SYNC_LOOSE_REF_GITDIR    (1 << 0)
+#define SYNC_LOOSE_REF_COMMONDIR (1 << 1)
+
+static int sync_loose_refs_flags(const char *refname)
+{
+	switch (ref_type(refname)) {
+	case REF_TYPE_PER_WORKTREE:
+	case REF_TYPE_PSEUDOREF:
+		return SYNC_LOOSE_REF_GITDIR;
+	case REF_TYPE_MAIN_PSEUDOREF:
+	case REF_TYPE_OTHER_PSEUDOREF:
+	case REF_TYPE_NORMAL:
+		return SYNC_LOOSE_REF_COMMONDIR;
+	default:
+		BUG("unknown ref type %d of ref %s",
+		    ref_type(refname), refname);
+	}
+}
+
+static int sync_loose_refs(struct files_ref_store *refs,
+			   int flags,
+			   struct strbuf *err)
+{
+	if (fsync_ref_files != FSYNC_REF_FILES_BATCH)
+		return 0;
+
+	if ((flags & SYNC_LOOSE_REF_GITDIR) &&
+	    git_fsync_dir(refs->base.gitdir) < 0)
+		return -1;
+
+	if ((flags & SYNC_LOOSE_REF_COMMONDIR) &&
+	    git_fsync_dir(refs->gitcommondir) < 0)
+		return -1;
+
+	return 0;
+}
+
 /*
  * Emit a better error message than lockfile.c's
  * unable_to_lock_message() would in case there is a D/F conflict with
@@ -1502,6 +1553,7 @@ static int files_copy_or_rename_ref(struct ref_store *ref_store,
 	oidcpy(&lock->old_oid, &orig_oid);
 
 	if (write_ref_to_lockfile(lock, &orig_oid, &err) ||
+	    sync_loose_refs(refs, sync_loose_refs_flags(newrefname), &err) ||
 	    commit_ref_update(refs, lock, &orig_oid, logmsg, &err)) {
 		error("unable to write current sha1 into %s: %s", newrefname, err.buf);
 		strbuf_release(&err);
@@ -1522,6 +1574,7 @@ static int files_copy_or_rename_ref(struct ref_store *ref_store,
 	flag = log_all_ref_updates;
 	log_all_ref_updates = LOG_REFS_NONE;
 	if (write_ref_to_lockfile(lock, &orig_oid, &err) ||
+	    sync_loose_refs(refs, sync_loose_refs_flags(newrefname), &err) ||
 	    commit_ref_update(refs, lock, &orig_oid, NULL, &err)) {
 		error("unable to write current sha1 into %s: %s", oldrefname, err.buf);
 		strbuf_release(&err);
@@ -1781,6 +1834,7 @@ static int write_ref_to_lockfile(struct ref_lock *lock,
 	fd = get_lock_file_fd(&lock->lk);
 	if (write_in_full(fd, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
 	    write_in_full(fd, &term, 1) < 0 ||
+	    sync_loose_ref(fd) < 0 ||
 	    close_ref_gently(lock) < 0) {
 		strbuf_addf(err,
 			    "couldn't write '%s'", get_lock_file_path(&lock->lk));
@@ -2665,7 +2719,7 @@ static int files_transaction_prepare(struct ref_store *ref_store,
 		files_downcast(ref_store, REF_STORE_WRITE,
 			       "ref_transaction_prepare");
 	size_t i;
-	int ret = 0;
+	int ret = 0, sync_flags = 0;
 	struct string_list affected_refnames = STRING_LIST_INIT_NODUP;
 	char *head_ref = NULL;
 	int head_type;
@@ -2777,8 +2831,14 @@ static int files_transaction_prepare(struct ref_store *ref_store,
 					&update->new_oid, NULL,
 					NULL);
 		}
+
+		sync_flags |= sync_loose_refs_flags(update->refname);
 	}
 
+	ret = sync_loose_refs(refs, sync_flags, err);
+	if (ret)
+		goto cleanup;
+
 	if (packed_transaction) {
 		if (packed_refs_lock(refs->packed_ref_store, 0, err)) {
 			ret = TRANSACTION_GENERIC_ERROR;
-- 
2.33.1


[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 1/3] wrapper: handle EINTR in `git_fsync()`
  2021-11-10 11:40   ` [PATCH v2 1/3] wrapper: handle EINTR in `git_fsync()` Patrick Steinhardt
@ 2021-11-10 14:33     ` Johannes Schindelin
  2021-11-10 14:39     ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 34+ messages in thread
From: Johannes Schindelin @ 2021-11-10 14:33 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Eric Wong, Neeraj K. Singh

Hi Patrick,

just one observation, below.

On Wed, 10 Nov 2021, Patrick Steinhardt wrote:

> diff --git a/wrapper.c b/wrapper.c
> index ece3d2ca10..e20df4f3a6 100644
> --- a/wrapper.c
> +++ b/wrapper.c
> @@ -546,7 +546,7 @@ int xmkstemp_mode(char *filename_template, int mode)
>  	return fd;
>  }
>
> -int git_fsync(int fd, enum fsync_action action)
> +static int git_fsync_once(int fd, enum fsync_action action)
>  {
>  	switch (action) {
>  	case FSYNC_WRITEOUT_ONLY:
> @@ -591,7 +591,14 @@ int git_fsync(int fd, enum fsync_action action)
>  	default:
>  		BUG("unexpected git_fsync(%d) call", action);
>  	}
> +}
>
> +int git_fsync(int fd, enum fsync_action action)
> +{
> +	while (git_fsync_once(fd, action) < 0)
> +		if (errno != EINTR)
> +			return -1;
> +	return 0;
>  }

My immediate reaction was: why not fold `git_fsync_once()` into
`git_fsync()`?

And then I had a look at the function (which is sadly not in the diff
context of this email, one of the occasions when I would prefer a proper
UI for reviewing), and I agree that indenting the code even one level
would make it harder to read.

All this is to say: I agree with the approach you took here.

Ciao,
Dscho

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 1/3] wrapper: handle EINTR in `git_fsync()`
  2021-11-10 11:40   ` [PATCH v2 1/3] wrapper: handle EINTR in `git_fsync()` Patrick Steinhardt
  2021-11-10 14:33     ` Johannes Schindelin
@ 2021-11-10 14:39     ` Ævar Arnfjörð Bjarmason
  1 sibling, 0 replies; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-10 14:39 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Jeff King, Johannes Schindelin, Junio C Hamano, Eric Wong,
	Neeraj K. Singh


On Wed, Nov 10 2021, Patrick Steinhardt wrote:

> [[PGP Signed Part:Undecided]]
> While we already have a wrapper around `git_fsync()`, this wrapper
> doesn't handle looping around `EINTR`. Convert it to do so such that
> callers don't have to remember doing this everywhere.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  wrapper.c      | 9 ++++++++-
>  write-or-die.c | 6 ++----
>  2 files changed, 10 insertions(+), 5 deletions(-)
>
> diff --git a/wrapper.c b/wrapper.c
> index ece3d2ca10..e20df4f3a6 100644
> --- a/wrapper.c
> +++ b/wrapper.c
> @@ -546,7 +546,7 @@ int xmkstemp_mode(char *filename_template, int mode)
>  	return fd;
>  }
>  
> -int git_fsync(int fd, enum fsync_action action)
> +static int git_fsync_once(int fd, enum fsync_action action)
>  {
>  	switch (action) {
>  	case FSYNC_WRITEOUT_ONLY:
> @@ -591,7 +591,14 @@ int git_fsync(int fd, enum fsync_action action)
>  	default:
>  		BUG("unexpected git_fsync(%d) call", action);
>  	}
> +}
>  
> +int git_fsync(int fd, enum fsync_action action)
> +{
> +	while (git_fsync_once(fd, action) < 0)
> +		if (errno != EINTR)
> +			return -1;
> +	return 0;
>  }
>  
>  static int warn_if_unremovable(const char *op, const char *file, int rc)
> diff --git a/write-or-die.c b/write-or-die.c
> index cc8291d979..e61220aa2d 100644
> --- a/write-or-die.c
> +++ b/write-or-die.c
> @@ -57,10 +57,8 @@ void fprintf_or_die(FILE *f, const char *fmt, ...)
>  
>  void fsync_or_die(int fd, const char *msg)
>  {
> -	while (git_fsync(fd, FSYNC_HARDWARE_FLUSH) < 0) {
> -		if (errno != EINTR)
> -			die_errno("fsync error on '%s'", msg);
> -	}
> +	if (git_fsync(fd, FSYNC_HARDWARE_FLUSH) < 0)
> +		die_errno("fsync error on '%s'", msg);

Nit: While at it maybe change it to: "fsync() error syncing file '%s'"

Maybe having a xgit_fsync() convenience wrapper would make subsequent
steps easier?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 2/3] wrapper: provide function to sync directories
  2021-11-10 11:40   ` [PATCH v2 2/3] wrapper: provide function to sync directories Patrick Steinhardt
@ 2021-11-10 14:40     ` Ævar Arnfjörð Bjarmason
  0 siblings, 0 replies; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-10 14:40 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Jeff King, Johannes Schindelin, Junio C Hamano, Eric Wong,
	Neeraj K. Singh


On Wed, Nov 10 2021, Patrick Steinhardt wrote:

> [[PGP Signed Part:Undecided]]
> In ec983eb5d2 (core.fsyncobjectfiles: batched disk flushes, 2021-10-04),
> we have introduced batched syncing of object files. This mode works by
> only requesting a writeback of the page cache backing the file on
> written files, followed by a single hardware-flush via a temporary file
> created in the directory we want to flush. Given modern journaling file
> systems, this pattern is expected to be durable.
>
> While it's possible to reuse the `git_fsync()` helper to synchronize the
> page cache only, there is no helper which would allow for doing a
> hardware flush of a directory by creating a temporary file. Other
> callers which want to follow the same pattern would thus have to repeat
> this logic.
>
> Extract a new helper `git_fsync_dir()` from the object files code which
> neatly encapsulates this logic such that it can be reused.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>  bulk-checkin.c    | 13 +++----------
>  git-compat-util.h |  7 +++++++
>  wrapper.c         | 21 +++++++++++++++++++++
>  3 files changed, 31 insertions(+), 10 deletions(-)
>
> diff --git a/bulk-checkin.c b/bulk-checkin.c
> index 4deee1af46..e6ebdd1db5 100644
> --- a/bulk-checkin.c
> +++ b/bulk-checkin.c
> @@ -98,16 +98,9 @@ static void do_batch_fsync(void)
>  	 * hardware.
>  	 */
>  
> -	if (needs_batch_fsync) {
> -		struct strbuf temp_path = STRBUF_INIT;
> -		struct tempfile *temp;
> -
> -		strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", get_object_directory());
> -		temp = xmks_tempfile(temp_path.buf);
> -		fsync_or_die(get_tempfile_fd(temp), get_tempfile_path(temp));
> -		delete_tempfile(&temp);
> -		strbuf_release(&temp_path);
> -	}
> +	if (needs_batch_fsync &&
> +	    git_fsync_dir(get_object_directory()) < 0)
> +		die_errno("fsyncing object directory");

Nit: Similar to 1/3, but this message is new: We say "fsyncing object
directory", but it would be better to pass in some "verbose" flag to
git_fsync_dir() so we can say e.g.:

    error_errno(_("couldn't create core.fsyncRefFiles=batch tempfile '%s' in '%s'"), ...)
    error_errno(_("couldn't fsync() core.fsyncRefFiles=batch tempfile '%s' in '%s'"), ...)

I.e. being able to say specifically why we failed, permission error or
the tempfile? fsync() didn't work etc?

Looking at the underlying APIs maybe they already have a mode to "die"
or "warn" appropriately? Or...

> +int git_fsync_dir(const char *path)
> +{
> +	struct strbuf temp_path = STRBUF_INIT;
> +	struct tempfile *temp;
> +
> +	strbuf_addf(&temp_path, "%s/bulk_fsync_XXXXXX", path);
> +
> +	temp = mks_tempfile(temp_path.buf);
> +	if (!temp)
> +		return -1;
> +
> +	if (git_fsync(get_tempfile_fd(temp), FSYNC_HARDWARE_FLUSH) < 0)
> +		return -1;

...if they do maybe we should use their non-fatal mode, because
with/without that these "return -1" need to be "goto cleanup" so we can
attempt to clean up after ourselves here.

I think this whole thing would be better if we generalized tmp-objdir.h
a bit, so it could create and manage an arbitrary file in an arbitrary
directory, and that API should really be generalized to a user of
tempfile.c.

I.e. we'd then create this file, sync it optionally, whine if it does't
work, and be guaranteed to try to clean anything that goes wrong up
atexit().

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/3] refs: sync loose refs to disk before committing them
  2021-11-10 11:40 ` [PATCH v2 0/3] " Patrick Steinhardt
                     ` (2 preceding siblings ...)
  2021-11-10 11:41   ` [PATCH v2 3/3] refs: add configuration to enable flushing of refs Patrick Steinhardt
@ 2021-11-10 14:44   ` Johannes Schindelin
  2021-11-10 20:45   ` Jeff King
  4 siblings, 0 replies; 34+ messages in thread
From: Johannes Schindelin @ 2021-11-10 14:44 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Jeff King, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Eric Wong, Neeraj K. Singh

Hi Patrick,

On Wed, 10 Nov 2021, Patrick Steinhardt wrote:

> This series is implements the same batching as Neeraj's series [1]. It
> introduces a new config switch "core.fsyncRefFiles" with the same three
> toggles "on", "off" and "batch". Given that it reuses funcitonality
> provided by the batched object flushing this series is based on "next".

Excellent, I am glad that my idea was helpful.

I read through the patches and found them easy to follow. Even 3/3, which
was a bit complex because of the need to support both transacted ref
updates as well as atomic renames/copies, is clear and well-written.

FWIW you could base your patches directly on `ns/batched-fsync`, I would
think.

Thank you,
Dscho

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 3/3] refs: add configuration to enable flushing of refs
  2021-11-10 11:41   ` [PATCH v2 3/3] refs: add configuration to enable flushing of refs Patrick Steinhardt
@ 2021-11-10 14:49     ` Ævar Arnfjörð Bjarmason
  2021-11-10 19:15       ` Neeraj Singh
  2021-11-11 12:06       ` Patrick Steinhardt
  2021-11-11  0:18     ` Neeraj Singh
  1 sibling, 2 replies; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-10 14:49 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Jeff King, Johannes Schindelin, Junio C Hamano, Eric Wong,
	Neeraj K. Singh, Linus Torvalds


On Wed, Nov 10 2021, Patrick Steinhardt wrote:

> [...]
> Fix this by introducing a new configuration "core.fsyncRefFiles". This
> config matches behaviour of "core.fsyncObjectFiles" in that it provides
> three different modes:
>
>     - "off" disables calling fsync on ref files. This is the default
>       behaviour previous to this change and remains the default after
>       this change.
>
>     - "on" enables calling fsync on ref files, where each reference is
>       flushed to disk before it is being committed. This is the safest
>       setting, but may incur significant performance overhead.
>
>     - "batch" will flush the page cache of each file as it is written to
>       ensure its data is persisted. After all refs have been written,
>       the directories which host refs are flushed.
>
> With this change in place and when "core.fsyncRefFiles" is set to either
> "on" or "batch", this kind of corruption shouldn't happen anymore.
>
> [1]: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt

With the understanding that my grokking of this approach is still
somewhere between "uh, that works?" and "wow, voodoo FS magic!". ....

I haven't looked at these changes in much daiter, or Neeraj's recent
related changes but...

> +core.fsyncRefFiles::
> +	A value indicating the level of effort Git will expend in trying to make
> +	refs added to the repo durable in the event of an unclean system
> +	shutdown. This setting currently only controls loose refs in the object
> +	store, so updates to packed refs may not be equally durable. Takes the
> +	same parameters as `core.fsyncObjectFiles`.
> +

...my understanding of it is basically a way of going back to what Linus
pointed out way back in aafe9fbaf4f (Add config option to enable
'fsync()' of object files, 2008-06-18).

I.e. we've got files x and y. POSIX sayeth we'd need to fsync them all
and the directory entry, but on some FS's it would be sufficient to
fsync() just y if they're created in that order. It'll imply an fsync of
both x and y, is that accurate?

If not you can probably discard the rest, but forging on:

Why do we then need to fsync() a "z" file in get_object_directory()
(i.e. .git/objects) then? That's a new "z", wasn't flushing "y" enough?

Or if you've written .git/objects/x and .git/refs/y I can imagine
wanting to create and sync a .git/z if the FS's semantics are to then
flush all remaining updates from that tree up, but here it's
.git/objects, not .git. That also seems to contract this above:

>       ensure its data is persisted. After all refs have been written,
>       the directories which host refs are flushed.

I.e. that would be .git/refs (let's ignore .git/HEAD and the like for
now), not .git/objects or .git?

And again, forging on but more generally [continued below]...

> +	if (!strcmp(var, "core.fsyncreffiles")) {

UX side: now we've got a core.fsyncRefFiles and
core.fsyncWhateverItWasCalled in Neeraj series. Let's make sure those
work together like say "fsck.*" and "fetch.fsck.*" do, i.e. you'd be
able to configure this once for objects and refs, or in two variables,
one for objects, one for refs...


> +static int sync_loose_ref(int fd)
> +{
> +	switch (fsync_ref_files) {
> +	case FSYNC_REF_FILES_OFF:
> +		return 0;
> +	case FSYNC_REF_FILES_ON:
> +		return git_fsync(fd, FSYNC_HARDWARE_FLUSH);
> +	case FSYNC_REF_FILES_BATCH:
> +		return git_fsync(fd, FSYNC_WRITEOUT_ONLY);
> +	default:
> +		BUG("invalid fsync mode %d", fsync_ref_files);
> +	}
> +}
> +
> +#define SYNC_LOOSE_REF_GITDIR    (1 << 0)
> +#define SYNC_LOOSE_REF_COMMONDIR (1 << 1)

nit: make this an enum and ...

> +static int sync_loose_refs_flags(const char *refname)
> +{
> +	switch (ref_type(refname)) {
> +	case REF_TYPE_PER_WORKTREE:
> +	case REF_TYPE_PSEUDOREF:
> +		return SYNC_LOOSE_REF_GITDIR;
> +	case REF_TYPE_MAIN_PSEUDOREF:
> +	case REF_TYPE_OTHER_PSEUDOREF:
> +	case REF_TYPE_NORMAL:
> +		return SYNC_LOOSE_REF_COMMONDIR;
> +	default:
> +		BUG("unknown ref type %d of ref %s",
> +		    ref_type(refname), refname);

... you won't need this default case...

> [...]
>  /*
>   * Emit a better error message than lockfile.c's
>   * unable_to_lock_message() would in case there is a D/F conflict with
> @@ -1502,6 +1553,7 @@ static int files_copy_or_rename_ref(struct ref_store *ref_store,
>  	oidcpy(&lock->old_oid, &orig_oid);
>  
>  	if (write_ref_to_lockfile(lock, &orig_oid, &err) ||
> +	    sync_loose_refs(refs, sync_loose_refs_flags(newrefname), &err) ||
>  	    commit_ref_update(refs, lock, &orig_oid, logmsg, &err)) {
>  		error("unable to write current sha1 into %s: %s", newrefname, err.buf);
>  		strbuf_release(&err);
> @@ -1522,6 +1574,7 @@ static int files_copy_or_rename_ref(struct ref_store *ref_store,
>  	flag = log_all_ref_updates;
>  	log_all_ref_updates = LOG_REFS_NONE;
>  	if (write_ref_to_lockfile(lock, &orig_oid, &err) ||
> +	    sync_loose_refs(refs, sync_loose_refs_flags(newrefname), &err) ||
>  	    commit_ref_update(refs, lock, &orig_oid, NULL, &err)) {
>  		error("unable to write current sha1 into %s: %s", oldrefname, err.buf);
>  		strbuf_release(&err);
> @@ -1781,6 +1834,7 @@ static int write_ref_to_lockfile(struct ref_lock *lock,
>  	fd = get_lock_file_fd(&lock->lk);
>  	if (write_in_full(fd, oid_to_hex(oid), the_hash_algo->hexsz) < 0 ||
>  	    write_in_full(fd, &term, 1) < 0 ||
> +	    sync_loose_ref(fd) < 0 ||
>  	    close_ref_gently(lock) < 0) {
>  		strbuf_addf(err,
>  			    "couldn't write '%s'", get_lock_file_path(&lock->lk));
> @@ -2665,7 +2719,7 @@ static int files_transaction_prepare(struct ref_store *ref_store,
>  		files_downcast(ref_store, REF_STORE_WRITE,
>  			       "ref_transaction_prepare");
>  	size_t i;
> -	int ret = 0;
> +	int ret = 0, sync_flags = 0;
>  	struct string_list affected_refnames = STRING_LIST_INIT_NODUP;
>  	char *head_ref = NULL;
>  	int head_type;
> @@ -2777,8 +2831,14 @@ static int files_transaction_prepare(struct ref_store *ref_store,
>  					&update->new_oid, NULL,
>  					NULL);
>  		}
> +
> +		sync_flags |= sync_loose_refs_flags(update->refname);
>  	}
>  
> +	ret = sync_loose_refs(refs, sync_flags, err);
> +	if (ret)
> +		goto cleanup;
> +
>  	if (packed_transaction) {
>  		if (packed_refs_lock(refs->packed_ref_store, 0, err)) {
>  			ret = TRANSACTION_GENERIC_ERROR;

...[continued from above]: Again, per my potentially wrong understanding
of syncing a "x" and "y" via an fsync of a subsequent "z" that's
adjacent on the FS to those two.

Isn't this setting us up for a really bad interaction between this
series and Neeraj's work? Well "bad" as in "bad for performance".

I.e. you'll turn on "use the batch thing for objects and refs" and we'll
do two fsyncs, one for the object update, and one for refs. The common
case is that we'll have both in play.

So shouldn't this go to a higher level for both so we only create a "z"
.git/sync-it-now-please.txt or whatever once we do all pending updates
on the .git/ directory?

I can also imagine that we'd want that at an even higher level, e.g. for
"git pull" surely we'd want it not for refs or objects, but in
builtin/pull.c somewhere because we'll be updating the .git/index after
we do both refs and objects, and you'd want to fsync at the very end,
no?

None of the above should mean we can't pursue a more narrow approach for
now. I'm just:

 1) Seeing if I understand what we're trying to do here, maybe not.

 2) Encouraging you two to think about a holistic way to configure some
    logical conclusion to this topic at large, so we won't end up with
    core.fsyncConfigFiles, core.fsyncObjectFiles, core.fsyncIndexFile,
    core.fsyncRefFiles, core.fsyncTheCrapRebaseWritesOutFiles etc. :)

I'll send another more generic follow-up E-Mail for #2.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 3/3] refs: add configuration to enable flushing of refs
  2021-11-10 14:49     ` Ævar Arnfjörð Bjarmason
@ 2021-11-10 19:15       ` Neeraj Singh
  2021-11-10 20:23         ` Ævar Arnfjörð Bjarmason
  2021-11-11 12:06       ` Patrick Steinhardt
  1 sibling, 1 reply; 34+ messages in thread
From: Neeraj Singh @ 2021-11-10 19:15 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Patrick Steinhardt, git, Jeff King, Johannes Schindelin,
	Junio C Hamano, Eric Wong, Neeraj K. Singh, Linus Torvalds

On Wed, Nov 10, 2021 at 03:49:02PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Nov 10 2021, Patrick Steinhardt wrote:
> 
> > +	shutdown. This setting currently only controls loose refs in the object
> > +	store, so updates to packed refs may not be equally durable. Takes the
> > +	same parameters as `core.fsyncObjectFiles`.
> > +
> 
> ...my understanding of it is basically a way of going back to what Linus
> pointed out way back in aafe9fbaf4f (Add config option to enable
> 'fsync()' of object files, 2008-06-18).
> 
> I.e. we've got files x and y. POSIX sayeth we'd need to fsync them all
> and the directory entry, but on some FS's it would be sufficient to
> fsync() just y if they're created in that order. It'll imply an fsync of
> both x and y, is that accurate?
> 
> If not you can probably discard the rest, but forging on:
> 
> Why do we then need to fsync() a "z" file in get_object_directory()
> (i.e. .git/objects) then? That's a new "z", wasn't flushing "y" enough?
> 
> Or if you've written .git/objects/x and .git/refs/y I can imagine
> wanting to create and sync a .git/z if the FS's semantics are to then
> flush all remaining updates from that tree up, but here it's
> .git/objects, not .git. That also seems to contract this above:

We're breaking through the abstraction provided by POSIX in 'batch' mode,
since the POSIX abstraction does not give us any option to be both safe
and reasonably fast on hardware that does something expensive upon fsync().

We need to ensure that 'x' and 'y' are handed off to the storage device via
a non-flushing pagecache cleaning call (e.g. sync_file_ranges or macOS fsync).
Then creating and flushing 'z' ensures that 'x' and 'y' will be visible after
successful completion. On FSes with a single journal that is flushed on fsync,
this should also ensure that any other files that have been cleaned out of the
pagecache and any other metadata updates are also persisted. The fsync provides
a barrier in addition to the durability.


> >       ensure its data is persisted. After all refs have been written,
> >       the directories which host refs are flushed.
> 
> I.e. that would be .git/refs (let's ignore .git/HEAD and the like for
> now), not .git/objects or .git?
> 
> And again, forging on but more generally [continued below]...
> 
> > +	if (!strcmp(var, "core.fsyncreffiles")) {
> 
> UX side: now we've got a core.fsyncRefFiles and
> core.fsyncWhateverItWasCalled in Neeraj series. Let's make sure those
> work together like say "fsck.*" and "fetch.fsck.*" do, i.e. you'd be
> able to configure this once for objects and refs, or in two variables,
> one for objects, one for refs...
> 

I agree with this feedback. We should have a core.fsync flag to control everything
that might be synced in the repo.  However, we'd have to decide what we want
to do with packfiles and the index which are currently always fsynced.

> ...[continued from above]: Again, per my potentially wrong understanding
> of syncing a "x" and "y" via an fsync of a subsequent "z" that's
> adjacent on the FS to those two.

I suspect Patrick is concerned about the case where the worktree is on
a separate filesystem from the main repo, so there might be a motivation
to sync the worktree refs separately. Is that right, Patrick?

> 
> Isn't this setting us up for a really bad interaction between this
> series and Neeraj's work? Well "bad" as in "bad for performance".
> 
> I.e. you'll turn on "use the batch thing for objects and refs" and we'll
> do two fsyncs, one for the object update, and one for refs. The common
> case is that we'll have both in play.
> 
> So shouldn't this go to a higher level for both so we only create a "z"
> .git/sync-it-now-please.txt or whatever once we do all pending updates
> on the .git/ directory?
> 
> I can also imagine that we'd want that at an even higher level, e.g. for
> "git pull" surely we'd want it not for refs or objects, but in
> builtin/pull.c somewhere because we'll be updating the .git/index after
> we do both refs and objects, and you'd want to fsync at the very end,
> no?
> 
> None of the above should mean we can't pursue a more narrow approach for
> now. I'm just:
> 
>  1) Seeing if I understand what we're trying to do here, maybe not.
> 
>  2) Encouraging you two to think about a holistic way to configure some
>     logical conclusion to this topic at large, so we won't end up with
>     core.fsyncConfigFiles, core.fsyncObjectFiles, core.fsyncIndexFile,
>     core.fsyncRefFiles, core.fsyncTheCrapRebaseWritesOutFiles etc. :)
> 
> I'll send another more generic follow-up E-Mail for #2.

In my view there are two separable concerns:

    1) What durability do we want for the user's data when an entire 'git'
       command completes? I.e. if I run 'git add <1000 files>; git commit' and the
       system loses power after both commands return, should I see all of those files
       in the committed state of the repo?

    2) What internal consistency do we want in the git databases (ODB and REFS) if
       we crash midway through a 'git' command? I.e. if 'git add <1000 files>' crashes
       before returning, can there be an invalid object or a torn ref state?

If were only concerned with (1), then doing a single fsync at the end of the high-level
git command would be sufficient. However, (2) requires more fsyncs to provide barriers
between different phases internal to the git commands. For instance, we need a barrier
between creating the ODB state for a commit (blobs, trees, commit obj) and the refs
pointing to it.

I am not concerned with a few additional fsyncs for (2). On recent mainstream filesystems/ssds
each fsync may cost tens of milliseconds, so the difference between 1 to 3 fsyncs would not
be user perceptible. However, going from a couple fsyncs to 1000 fsyncs (one per blob added)
would become apparent to the user.

The more optimal way to handle consistency and durability is Write-ahead-logging with log
replay on crash recovery. That approach can reduce the number of fsyncs in the active workload
to at most two (one to sync the log with a commit record and then one before truncating the
log after updating the main database). At that point, however, I think it would be best to
use an existing database engine with some modifications to create a good data layout for git.

Thanks,
Neeraj

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 3/3] refs: add configuration to enable flushing of refs
  2021-11-10 19:15       ` Neeraj Singh
@ 2021-11-10 20:23         ` Ævar Arnfjörð Bjarmason
  2021-11-11  0:03           ` Neeraj Singh
  2021-11-11 12:14           ` Patrick Steinhardt
  0 siblings, 2 replies; 34+ messages in thread
From: Ævar Arnfjörð Bjarmason @ 2021-11-10 20:23 UTC (permalink / raw)
  To: Neeraj Singh
  Cc: Patrick Steinhardt, git, Jeff King, Johannes Schindelin,
	Junio C Hamano, Eric Wong, Neeraj K. Singh, Linus Torvalds


On Wed, Nov 10 2021, Neeraj Singh wrote:

> On Wed, Nov 10, 2021 at 03:49:02PM +0100, Ævar Arnfjörð Bjarmason wrote:
>> 
>> On Wed, Nov 10 2021, Patrick Steinhardt wrote:
>> 
>> > +	shutdown. This setting currently only controls loose refs in the object
>> > +	store, so updates to packed refs may not be equally durable. Takes the
>> > +	same parameters as `core.fsyncObjectFiles`.
>> > +
>> 
>> ...my understanding of it is basically a way of going back to what Linus
>> pointed out way back in aafe9fbaf4f (Add config option to enable
>> 'fsync()' of object files, 2008-06-18).
>> 
>> I.e. we've got files x and y. POSIX sayeth we'd need to fsync them all
>> and the directory entry, but on some FS's it would be sufficient to
>> fsync() just y if they're created in that order. It'll imply an fsync of
>> both x and y, is that accurate?
>> 
>> If not you can probably discard the rest, but forging on:
>> 
>> Why do we then need to fsync() a "z" file in get_object_directory()
>> (i.e. .git/objects) then? That's a new "z", wasn't flushing "y" enough?
>> 
>> Or if you've written .git/objects/x and .git/refs/y I can imagine
>> wanting to create and sync a .git/z if the FS's semantics are to then
>> flush all remaining updates from that tree up, but here it's
>> .git/objects, not .git. That also seems to contract this above:
>
> We're breaking through the abstraction provided by POSIX in 'batch' mode,
> since the POSIX abstraction does not give us any option to be both safe
> and reasonably fast on hardware that does something expensive upon fsync().
>
> We need to ensure that 'x' and 'y' are handed off to the storage device via
> a non-flushing pagecache cleaning call (e.g. sync_file_ranges or macOS fsync).
> Then creating and flushing 'z' ensures that 'x' and 'y' will be visible after
> successful completion. On FSes with a single journal that is flushed on fsync,
> this should also ensure that any other files that have been cleaned out of the
> pagecache and any other metadata updates are also persisted. The fsync provides
> a barrier in addition to the durability.

Yes. I understand that we're not doing POSIX fsyncing().

I'm asking about something else, i.e. with this not-like-POSIX-sync why
it is that when you have a directory:

    A

and files:

    A/{X,Y}

That you'd write those two, and then proceed to do the "batch flush" by
creating and fsync()-ing a:

    B/Z

As opposed to either of:

    A/Z

Or:

    Z

I.e. why update .git/refs/* and create a flush file in .git/object/* and
not .git/refs/* or .git/*?

Maybe the answer is just that this is WIP code copied from your
.git/objects/ fsync() implementation, or maybe it's more subtle and I'm
missing something.

>> >       ensure its data is persisted. After all refs have been written,
>> >       the directories which host refs are flushed.
>> 
>> I.e. that would be .git/refs (let's ignore .git/HEAD and the like for
>> now), not .git/objects or .git?
>> 
>> And again, forging on but more generally [continued below]...
>> 
>> > +	if (!strcmp(var, "core.fsyncreffiles")) {
>> 
>> UX side: now we've got a core.fsyncRefFiles and
>> core.fsyncWhateverItWasCalled in Neeraj series. Let's make sure those
>> work together like say "fsck.*" and "fetch.fsck.*" do, i.e. you'd be
>> able to configure this once for objects and refs, or in two variables,
>> one for objects, one for refs...
>> 
>
> I agree with this feedback. We should have a core.fsync flag to control everything
> that might be synced in the repo.  However, we'd have to decide what we want
> to do with packfiles and the index which are currently always fsynced.

*nod*. See this if you haven't yet:
https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/T/#u

>> ...[continued from above]: Again, per my potentially wrong understanding
>> of syncing a "x" and "y" via an fsync of a subsequent "z" that's
>> adjacent on the FS to those two.
>
> I suspect Patrick is concerned about the case where the worktree is on
> a separate filesystem from the main repo, so there might be a motivation
> to sync the worktree refs separately. Is that right, Patrick?

But he's syncing .git/objects, not .git/worktrees/<NAME>/refs/, no?

>> 
>> Isn't this setting us up for a really bad interaction between this
>> series and Neeraj's work? Well "bad" as in "bad for performance".
>> 
>> I.e. you'll turn on "use the batch thing for objects and refs" and we'll
>> do two fsyncs, one for the object update, and one for refs. The common
>> case is that we'll have both in play.
>> 
>> So shouldn't this go to a higher level for both so we only create a "z"
>> .git/sync-it-now-please.txt or whatever once we do all pending updates
>> on the .git/ directory?
>> 
>> I can also imagine that we'd want that at an even higher level, e.g. for
>> "git pull" surely we'd want it not for refs or objects, but in
>> builtin/pull.c somewhere because we'll be updating the .git/index after
>> we do both refs and objects, and you'd want to fsync at the very end,
>> no?
>> 
>> None of the above should mean we can't pursue a more narrow approach for
>> now. I'm just:
>> 
>>  1) Seeing if I understand what we're trying to do here, maybe not.
>> 
>>  2) Encouraging you two to think about a holistic way to configure some
>>     logical conclusion to this topic at large, so we won't end up with
>>     core.fsyncConfigFiles, core.fsyncObjectFiles, core.fsyncIndexFile,
>>     core.fsyncRefFiles, core.fsyncTheCrapRebaseWritesOutFiles etc. :)
>> 
>> I'll send another more generic follow-up E-Mail for #2.
>
> In my view there are two separable concerns:
>
>     1) What durability do we want for the user's data when an entire 'git'
>        command completes? I.e. if I run 'git add <1000 files>; git commit' and the
>        system loses power after both commands return, should I see all of those files
>        in the committed state of the repo?
>
>     2) What internal consistency do we want in the git databases (ODB and REFS) if
>        we crash midway through a 'git' command? I.e. if 'git add <1000 files>' crashes
>        before returning, can there be an invalid object or a torn ref state?
>
> If were only concerned with (1), then doing a single fsync at the end of the high-level
> git command would be sufficient. However, (2) requires more fsyncs to provide barriers
> between different phases internal to the git commands. For instance, we need a barrier
> between creating the ODB state for a commit (blobs, trees, commit obj) and the refs
> pointing to it.
>
> I am not concerned with a few additional fsyncs for (2). On recent mainstream filesystems/ssds
> each fsync may cost tens of milliseconds, so the difference between 1 to 3 fsyncs would not
> be user perceptible. However, going from a couple fsyncs to 1000 fsyncs (one per blob added)
> would become apparent to the user.
>
> The more optimal way to handle consistency and durability is Write-ahead-logging with log
> replay on crash recovery. That approach can reduce the number of fsyncs in the active workload
> to at most two (one to sync the log with a commit record and then one before truncating the
> log after updating the main database). At that point, however, I think it would be best to
> use an existing database engine with some modifications to create a good data layout for git.

I think that git should safely store your data by default, we clearly
don't do that well enough in some cases, and should fix it.

But there's also cases where people would much rather have performance,
and I think we should support that. E.g. running git in CI, doing a
one-off import/fetch that you'll invoke "sync(1)" yourself after
etc. This is the direction Eric Wong's patches are going into.

I understand from his patches/comments that you're not correct that just
1-3 fsyncs are OK, i.e. for some use-cases 0 is OK, and any >0 is too
many, since it'll force a flush of the whole disk or something.

Even when git is is operating in a completely safe mode I think we'd
still have license to play it fast and loose with /some/ user data,
because users don't really care about all of their data in the .git/
directory.

I.e. you do care about the *.pack file, but an associated *.bitmap is a
derived file, so we probably don't need to fsync that too. Is it worth
worrying about? Probably not, but that's what I had in mind with the
above-linked proposed config schema.

Similarly for say writing 1k loose objects I think it would be
completely fine to end up with a corrupt object during a "git fetch", as
long as we also guarantee that we can gracefully recover from that
corruption.

I.e. 1) we didn't update the ref(s) involved 2) we know the FS has the
sorts of properties you're aiming for with "batch".

Now, as some patches I've had to object*.c recently show we don't handle
those cases gracefully either. I.e. if we find a short loose object
we'll panic, even if we'd be perfectly capable of getting it from
elsewhere, and nothing otherwise references it.

But if we fixed that and gc/fsck/fetch etc. learned to deal with that
content I don't see why wouldn't e.g. default to not fsyncing loose
objects when invoked from say "fetch" by default, depending on FS/OS
detection, and if we couldn't say it was safe defaulted to some "POSIX"
mode that would be pedantic but slow and safe.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/3] refs: sync loose refs to disk before committing them
  2021-11-10 11:40 ` [PATCH v2 0/3] " Patrick Steinhardt
                     ` (3 preceding siblings ...)
  2021-11-10 14:44   ` [PATCH v2 0/3] refs: sync loose refs to disk before committing them Johannes Schindelin
@ 2021-11-10 20:45   ` Jeff King
  2021-11-11 11:47     ` Patrick Steinhardt
  4 siblings, 1 reply; 34+ messages in thread
From: Jeff King @ 2021-11-10 20:45 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Johannes Schindelin, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Eric Wong, Neeraj K. Singh

On Wed, Nov 10, 2021 at 12:40:50PM +0100, Patrick Steinhardt wrote:

> Please note that I didn't yet add any performance numbers or tests.
> Performance tests didn't show any conclusive results on my machine given
> that I couldn't observe any noticeable impact at all, and I didn't write
> tests yet given that I first wanted to get some feedback on this series.
> If we agree that this is the correct way to go forward I'll of course
> put in some more time to add them.

Here's a mild adjustment of the quick test I showed earlier in [1]. It
uses your version for all cases, but swaps between the three config
options, and it uses --atomic to put everything into a single
transaction:

  $ hyperfine -L v false,true,batch -p ./setup 'git -C dst.git config core.fsyncreffiles {v}; git.compile push --atomic dst.git refs/heads/*'
  Benchmark 1: git -C dst.git config core.fsyncreffiles false; git.compile push --atomic dst.git refs/heads/*
    Time (mean ± σ):      10.5 ms ±   0.2 ms    [User: 7.8 ms, System: 3.9 ms]
    Range (min … max):    10.1 ms …  11.0 ms    108 runs
   
  Benchmark 2: git -C dst.git config core.fsyncreffiles true; git.compile push --atomic dst.git refs/heads/*
    Time (mean ± σ):     402.5 ms ±   6.2 ms    [User: 13.9 ms, System: 10.7 ms]
    Range (min … max):   387.3 ms … 408.9 ms    10 runs
   
  Benchmark 3: git -C dst.git config core.fsyncreffiles batch; git.compile push --atomic dst.git refs/heads/*
    Time (mean ± σ):      21.4 ms ±   0.6 ms    [User: 8.0 ms, System: 5.2 ms]
    Range (min … max):    20.7 ms …  24.9 ms    99 runs
   
  Summary
    'git -C dst.git config core.fsyncreffiles false; git.compile push --atomic dst.git refs/heads/*' ran
      2.05 ± 0.07 times faster than 'git -C dst.git config core.fsyncreffiles batch; git.compile push --atomic dst.git refs/heads/*'
     38.51 ± 1.06 times faster than 'git -C dst.git config core.fsyncreffiles true; git.compile push --atomic dst.git refs/heads/*'

So yes, the "batch" case is much improved. It's still more expensive
than skipping the fsync entirely, but it's within reason. The trick,
though is the --atomic. If I drop that flag, then "batch" takes as long
as "true". And of course that's the default.

I wonder if that means we'd want an extra mode: do the WRITEOUT_ONLY
flush, but _don't_ do a HARDWARE_FLUSH for each transaction. That
doesn't give you durability, but it (in theory) solves the out-of-order
journal problem without paying any performance cost.

I say "in theory" because I'm just assuming that the WRITEOUT thing
works as advertised. I don't have an easy way to test the outcome on a
power failure at the right moment here.

-Peff

[1] https://lore.kernel.org/git/YYTaiIlEKxHRVdCy@coredump.intra.peff.net/

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 3/3] refs: add configuration to enable flushing of refs
  2021-11-10 20:23         ` Ævar Arnfjörð Bjarmason
@ 2021-11-11  0:03           ` Neeraj Singh
  2021-11-11 12:14           ` Patrick Steinhardt
  1 sibling, 0 replies; 34+ messages in thread
From: Neeraj Singh @ 2021-11-11  0:03 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Patrick Steinhardt, git, Jeff King, Johannes Schindelin,
	Junio C Hamano, Eric Wong, Neeraj K. Singh, Linus Torvalds

On Wed, Nov 10, 2021 at 09:23:04PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> 
> Yes. I understand that we're not doing POSIX fsyncing().
> 
> I'm asking about something else, i.e. with this not-like-POSIX-sync why
> it is that when you have a directory:
> 
>     A
> 
> and files:
> 
>     A/{X,Y}
> 
> That you'd write those two, and then proceed to do the "batch flush" by
> creating and fsync()-ing a:
> 
>     B/Z
> 
> As opposed to either of:
> 
>     A/Z
> 
> Or:
> 
>     Z
> 
> I.e. why update .git/refs/* and create a flush file in .git/object/* and
> not .git/refs/* or .git/*?
> 
> Maybe the answer is just that this is WIP code copied from your
> .git/objects/ fsync() implementation, or maybe it's more subtle and I'm
> missing something.

It looks I didn't really answer your actual question before. On the filesystems
which I'm familiar with at a code level (NTFS and ReFS on Windows), fsyncing a file
or dir anywhere on the filesystem should ensure that metadata operations completed
before the fsync starts are durable when the fsync returns. So which specific directory
we put the file in shouldn't matter as long as it's on the same filesystem as the other
files we're interested in.

> 
> *nod*. See this if you haven't yet:
> https://lore.kernel.org/git/211110.86r1bogg27.gmgdl@evledraar.gmail.com/T/#u

I'll respond on that thread with my opinion. Thanks for reviewing this and pushing
for a good holistic approach.

Thanks,
Neeraj

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 3/3] refs: add configuration to enable flushing of refs
  2021-11-10 11:41   ` [PATCH v2 3/3] refs: add configuration to enable flushing of refs Patrick Steinhardt
  2021-11-10 14:49     ` Ævar Arnfjörð Bjarmason
@ 2021-11-11  0:18     ` Neeraj Singh
  1 sibling, 0 replies; 34+ messages in thread
From: Neeraj Singh @ 2021-11-11  0:18 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: git, Jeff King, Johannes Schindelin,
	Ævar Arnfjörð Bjarmason, Junio C Hamano,
	Eric Wong, Neeraj K. Singh

On Wed, Nov 10, 2021 at 12:41:03PM +0100, Patrick Steinhardt wrote:
>  
> +static int sync_loose_ref(int fd)
> +{
> +	switch (fsync_ref_files) {
> +	case FSYNC_REF_FILES_OFF:
> +		return 0;
> +	case FSYNC_REF_FILES_ON:
> +		return git_fsync(fd, FSYNC_HARDWARE_FLUSH);
> +	case FSYNC_REF_FILES_BATCH:
> +		return git_fsync(fd, FSYNC_WRITEOUT_ONLY);
> +	default:
> +		BUG("invalid fsync mode %d", fsync_ref_files);
> +	}
> +}
> +
> +#define SYNC_LOOSE_REF_GITDIR    (1 << 0)
> +#define SYNC_LOOSE_REF_COMMONDIR (1 << 1)
> +
> +static int sync_loose_refs_flags(const char *refname)
> +{
> +	switch (ref_type(refname)) {
> +	case REF_TYPE_PER_WORKTREE:
> +	case REF_TYPE_PSEUDOREF:
> +		return SYNC_LOOSE_REF_GITDIR;
> +	case REF_TYPE_MAIN_PSEUDOREF:
> +	case REF_TYPE_OTHER_PSEUDOREF:
> +	case REF_TYPE_NORMAL:
> +		return SYNC_LOOSE_REF_COMMONDIR;
> +	default:
> +		BUG("unknown ref type %d of ref %s",
> +		    ref_type(refname), refname);
> +	}
> +}
> +
> +static int sync_loose_refs(struct files_ref_store *refs,
> +			   int flags,
> +			   struct strbuf *err)
> +{
> +	if (fsync_ref_files != FSYNC_REF_FILES_BATCH)
> +		return 0;
> +
> +	if ((flags & SYNC_LOOSE_REF_GITDIR) &&
> +	    git_fsync_dir(refs->base.gitdir) < 0)
> +		return -1;
> +
> +	if ((flags & SYNC_LOOSE_REF_COMMONDIR) &&
> +	    git_fsync_dir(refs->gitcommondir) < 0)
> +		return -1;
> +
> +	return 0;
> +}
> +

Now that I understand it better, I agree with Ævar's feedback. I think
only a single sync is needed to cover everything in .git/. 

I'd also suggest renaming 'sync_loose_refs' to something which makes
it clearer that it's the batch-flush step. Maybe s/do_batch_fsync/do_batch_objects_fsync
and s/sync_loose_refs/do_batch_refs_fsync.

Thanks
Neeraj

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 0/3] refs: sync loose refs to disk before committing them
  2021-11-10 20:45   ` Jeff King
@ 2021-11-11 11:47     ` Patrick Steinhardt
  0 siblings, 0 replies; 34+ messages in thread
From: Patrick Steinhardt @ 2021-11-11 11:47 UTC (permalink / raw)
  To: Jeff King
  Cc: git, Johannes Schindelin, Ævar Arnfjörð Bjarmason,
	Junio C Hamano, Eric Wong, Neeraj K. Singh

[-- Attachment #1: Type: text/plain, Size: 3634 bytes --]

On Wed, Nov 10, 2021 at 03:45:11PM -0500, Jeff King wrote:
> On Wed, Nov 10, 2021 at 12:40:50PM +0100, Patrick Steinhardt wrote:
> 
> > Please note that I didn't yet add any performance numbers or tests.
> > Performance tests didn't show any conclusive results on my machine given
> > that I couldn't observe any noticeable impact at all, and I didn't write
> > tests yet given that I first wanted to get some feedback on this series.
> > If we agree that this is the correct way to go forward I'll of course
> > put in some more time to add them.
> 
> Here's a mild adjustment of the quick test I showed earlier in [1]. It
> uses your version for all cases, but swaps between the three config
> options, and it uses --atomic to put everything into a single
> transaction:
> 
>   $ hyperfine -L v false,true,batch -p ./setup 'git -C dst.git config core.fsyncreffiles {v}; git.compile push --atomic dst.git refs/heads/*'
>   Benchmark 1: git -C dst.git config core.fsyncreffiles false; git.compile push --atomic dst.git refs/heads/*
>     Time (mean ± σ):      10.5 ms ±   0.2 ms    [User: 7.8 ms, System: 3.9 ms]
>     Range (min … max):    10.1 ms …  11.0 ms    108 runs
>    
>   Benchmark 2: git -C dst.git config core.fsyncreffiles true; git.compile push --atomic dst.git refs/heads/*
>     Time (mean ± σ):     402.5 ms ±   6.2 ms    [User: 13.9 ms, System: 10.7 ms]
>     Range (min … max):   387.3 ms … 408.9 ms    10 runs
>    
>   Benchmark 3: git -C dst.git config core.fsyncreffiles batch; git.compile push --atomic dst.git refs/heads/*
>     Time (mean ± σ):      21.4 ms ±   0.6 ms    [User: 8.0 ms, System: 5.2 ms]
>     Range (min … max):    20.7 ms …  24.9 ms    99 runs
>    
>   Summary
>     'git -C dst.git config core.fsyncreffiles false; git.compile push --atomic dst.git refs/heads/*' ran
>       2.05 ± 0.07 times faster than 'git -C dst.git config core.fsyncreffiles batch; git.compile push --atomic dst.git refs/heads/*'
>      38.51 ± 1.06 times faster than 'git -C dst.git config core.fsyncreffiles true; git.compile push --atomic dst.git refs/heads/*'
> 
> So yes, the "batch" case is much improved. It's still more expensive
> than skipping the fsync entirely, but it's within reason. The trick,
> though is the --atomic. If I drop that flag, then "batch" takes as long
> as "true". And of course that's the default.

Great, thanks for doing these tests.

> I wonder if that means we'd want an extra mode: do the WRITEOUT_ONLY
> flush, but _don't_ do a HARDWARE_FLUSH for each transaction. That
> doesn't give you durability, but it (in theory) solves the out-of-order
> journal problem without paying any performance cost.
> 
> I say "in theory" because I'm just assuming that the WRITEOUT thing
> works as advertised. I don't have an easy way to test the outcome on a
> power failure at the right moment here.
> 
> -Peff
> 
> [1] https://lore.kernel.org/git/YYTaiIlEKxHRVdCy@coredump.intra.peff.net/

Yeah, it's also one of the things I'm currently wondering about. I just
piggy-backed on the preexisting patch series which claim that it at
least ought to work on macOS and Windows, but left out of the equation
were Linux filesystems. I'm hoping that we'll get an answer to this
question via <211111.86pmr7pk9o.gmgdl@evledraar.gmail.com>, where Ævar
put Linus and Christopher Hellwig on Cc.

In any case, if the outcome of a WRITEOUT_ONLY flush would be that we
may not see all renames, but that every refs' contents would be
well-defined, then this is a tradeoff we'd probably want to make at
GitLab.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 3/3] refs: add configuration to enable flushing of refs
  2021-11-10 14:49     ` Ævar Arnfjörð Bjarmason
  2021-11-10 19:15       ` Neeraj Singh
@ 2021-11-11 12:06       ` Patrick Steinhardt
  1 sibling, 0 replies; 34+ messages in thread
From: Patrick Steinhardt @ 2021-11-11 12:06 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: git, Jeff King, Johannes Schindelin, Junio C Hamano, Eric Wong,
	Neeraj K. Singh, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 7705 bytes --]

On Wed, Nov 10, 2021 at 03:49:02PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Nov 10 2021, Patrick Steinhardt wrote:
> 
> > [...]
> > Fix this by introducing a new configuration "core.fsyncRefFiles". This
> > config matches behaviour of "core.fsyncObjectFiles" in that it provides
> > three different modes:
> >
> >     - "off" disables calling fsync on ref files. This is the default
> >       behaviour previous to this change and remains the default after
> >       this change.
> >
> >     - "on" enables calling fsync on ref files, where each reference is
> >       flushed to disk before it is being committed. This is the safest
> >       setting, but may incur significant performance overhead.
> >
> >     - "batch" will flush the page cache of each file as it is written to
> >       ensure its data is persisted. After all refs have been written,
> >       the directories which host refs are flushed.
> >
> > With this change in place and when "core.fsyncRefFiles" is set to either
> > "on" or "batch", this kind of corruption shouldn't happen anymore.
> >
> > [1]: https://www.kernel.org/doc/Documentation/filesystems/ext4.txt
> 
> With the understanding that my grokking of this approach is still
> somewhere between "uh, that works?" and "wow, voodoo FS magic!". ....
> 
> I haven't looked at these changes in much daiter, or Neeraj's recent
> related changes but...
> 
> > +core.fsyncRefFiles::
> > +	A value indicating the level of effort Git will expend in trying to make
> > +	refs added to the repo durable in the event of an unclean system
> > +	shutdown. This setting currently only controls loose refs in the object
> > +	store, so updates to packed refs may not be equally durable. Takes the
> > +	same parameters as `core.fsyncObjectFiles`.
> > +
> 
> ...my understanding of it is basically a way of going back to what Linus
> pointed out way back in aafe9fbaf4f (Add config option to enable
> 'fsync()' of object files, 2008-06-18).
> 
> I.e. we've got files x and y. POSIX sayeth we'd need to fsync them all
> and the directory entry, but on some FS's it would be sufficient to
> fsync() just y if they're created in that order. It'll imply an fsync of
> both x and y, is that accurate?
> 
> If not you can probably discard the rest, but forging on:
> 
> Why do we then need to fsync() a "z" file in get_object_directory()
> (i.e. .git/objects) then? That's a new "z", wasn't flushing "y" enough?
> 
> Or if you've written .git/objects/x and .git/refs/y I can imagine
> wanting to create and sync a .git/z if the FS's semantics are to then
> flush all remaining updates from that tree up, but here it's
> .git/objects, not .git. That also seems to contract this above:
> 
> >       ensure its data is persisted. After all refs have been written,
> >       the directories which host refs are flushed.
> 
> I.e. that would be .git/refs (let's ignore .git/HEAD and the like for
> now), not .git/objects or .git?

Yeah, .git/refs. Note that we need to flush refs in two locations though
given that there are common refs shared across worktrees, and then there
are worktree-specific refs. These may not even share the filesystem, so
there's not much we can do about this. We could play games with the
fsid, but I'm not sure it's worth it.

> And again, forging on but more generally [continued below]...
> 
> > +	if (!strcmp(var, "core.fsyncreffiles")) {
> 
> UX side: now we've got a core.fsyncRefFiles and
> core.fsyncWhateverItWasCalled in Neeraj series. Let's make sure those
> work together like say "fsck.*" and "fetch.fsck.*" do, i.e. you'd be
> able to configure this once for objects and refs, or in two variables,
> one for objects, one for refs...

Honestly, I'd prefer to just have a global variable with per-subsystem
overrides so that you can say "Please batch-sync all files by default,
but I really don't care for files in the worktree". Kind of goes into
the same direction as your RFC.

[snip]
> > @@ -2665,7 +2719,7 @@ static int files_transaction_prepare(struct ref_store *ref_store,
> >  		files_downcast(ref_store, REF_STORE_WRITE,
> >  			       "ref_transaction_prepare");
> >  	size_t i;
> > -	int ret = 0;
> > +	int ret = 0, sync_flags = 0;
> >  	struct string_list affected_refnames = STRING_LIST_INIT_NODUP;
> >  	char *head_ref = NULL;
> >  	int head_type;
> > @@ -2777,8 +2831,14 @@ static int files_transaction_prepare(struct ref_store *ref_store,
> >  					&update->new_oid, NULL,
> >  					NULL);
> >  		}
> > +
> > +		sync_flags |= sync_loose_refs_flags(update->refname);
> >  	}
> >  
> > +	ret = sync_loose_refs(refs, sync_flags, err);
> > +	if (ret)
> > +		goto cleanup;
> > +
> >  	if (packed_transaction) {
> >  		if (packed_refs_lock(refs->packed_ref_store, 0, err)) {
> >  			ret = TRANSACTION_GENERIC_ERROR;
> 
> ...[continued from above]: Again, per my potentially wrong understanding
> of syncing a "x" and "y" via an fsync of a subsequent "z" that's
> adjacent on the FS to those two.
> 
> Isn't this setting us up for a really bad interaction between this
> series and Neeraj's work? Well "bad" as in "bad for performance".
> 
> I.e. you'll turn on "use the batch thing for objects and refs" and we'll
> do two fsyncs, one for the object update, and one for refs. The common
> case is that we'll have both in play.
> 
> So shouldn't this go to a higher level for both so we only create a "z"
> .git/sync-it-now-please.txt or whatever once we do all pending updates
> on the .git/ directory?

Good question. Taking a different stance would be to say "I don't ever
want to write a ref referring to an object which isn't yet guaranteed to
be persistent". In that case, writing and syncing objects first and then
updating refs would be the correct thing to do and wouldn't need
coordination on a higher level.

> I can also imagine that we'd want that at an even higher level, e.g. for
> "git pull" surely we'd want it not for refs or objects, but in
> builtin/pull.c somewhere because we'll be updating the .git/index after
> we do both refs and objects, and you'd want to fsync at the very end,
> no?
> 
> None of the above should mean we can't pursue a more narrow approach for
> now. I'm just:
> 
>  1) Seeing if I understand what we're trying to do here, maybe not.

Personally, my most important bullet point is that I want to get rid of
the corrupt refs we see in production every now and then. As Peff
pointed out in another email, it _might_ be sufficient to just flush out
the page cache for those without synchronizing any metadata at all. But
none of us seems to be sure that this really works the way we think it
does.

>  2) Encouraging you two to think about a holistic way to configure some
>     logical conclusion to this topic at large, so we won't end up with
>     core.fsyncConfigFiles, core.fsyncObjectFiles, core.fsyncIndexFile,
>     core.fsyncRefFiles, core.fsyncTheCrapRebaseWritesOutFiles etc. :)

Yeah, I agree with you on this point. We definitely don't want to end up
with twenty different knobs to sync objects, refs, packed-refs, configs,
reflogs et al, and we're currently steering into that direction. Having
a central knob core.fsync which directs defaults for all subsystems
would be very much preferable if you ask me, where specific subsystems
can then be changed as the user sees fit.

What I think is important though is that we should have the right
defaults in place. It doesn't help to say "fsync is slow" as an excuse
to not do them at all by default, where the result is that users get a
corrupt repository.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v2 3/3] refs: add configuration to enable flushing of refs
  2021-11-10 20:23         ` Ævar Arnfjörð Bjarmason
  2021-11-11  0:03           ` Neeraj Singh
@ 2021-11-11 12:14           ` Patrick Steinhardt
  1 sibling, 0 replies; 34+ messages in thread
From: Patrick Steinhardt @ 2021-11-11 12:14 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason
  Cc: Neeraj Singh, git, Jeff King, Johannes Schindelin,
	Junio C Hamano, Eric Wong, Neeraj K. Singh, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 5709 bytes --]

On Wed, Nov 10, 2021 at 09:23:04PM +0100, Ævar Arnfjörð Bjarmason wrote:
> 
> On Wed, Nov 10 2021, Neeraj Singh wrote:
> 
> > On Wed, Nov 10, 2021 at 03:49:02PM +0100, Ævar Arnfjörð Bjarmason wrote:
> >> 
> >> On Wed, Nov 10 2021, Patrick Steinhardt wrote:
[snip]
> >> ...[continued from above]: Again, per my potentially wrong understanding
> >> of syncing a "x" and "y" via an fsync of a subsequent "z" that's
> >> adjacent on the FS to those two.
> >
> > I suspect Patrick is concerned about the case where the worktree is on
> > a separate filesystem from the main repo, so there might be a motivation
> > to sync the worktree refs separately. Is that right, Patrick?
> 
> But he's syncing .git/objects, not .git/worktrees/<NAME>/refs/, no?

That'd be a bug then ;) My intent was to sync .git/refs and the
per-worktree refs.

[snip]
> > In my view there are two separable concerns:
> >
> >     1) What durability do we want for the user's data when an entire 'git'
> >        command completes? I.e. if I run 'git add <1000 files>; git commit' and the
> >        system loses power after both commands return, should I see all of those files
> >        in the committed state of the repo?
> >
> >     2) What internal consistency do we want in the git databases (ODB and REFS) if
> >        we crash midway through a 'git' command? I.e. if 'git add <1000 files>' crashes
> >        before returning, can there be an invalid object or a torn ref state?
> >
> > If were only concerned with (1), then doing a single fsync at the end of the high-level
> > git command would be sufficient. However, (2) requires more fsyncs to provide barriers
> > between different phases internal to the git commands. For instance, we need a barrier
> > between creating the ODB state for a commit (blobs, trees, commit obj) and the refs
> > pointing to it.
> >
> > I am not concerned with a few additional fsyncs for (2). On recent mainstream filesystems/ssds
> > each fsync may cost tens of milliseconds, so the difference between 1 to 3 fsyncs would not
> > be user perceptible. However, going from a couple fsyncs to 1000 fsyncs (one per blob added)
> > would become apparent to the user.
> >
> > The more optimal way to handle consistency and durability is Write-ahead-logging with log
> > replay on crash recovery. That approach can reduce the number of fsyncs in the active workload
> > to at most two (one to sync the log with a commit record and then one before truncating the
> > log after updating the main database). At that point, however, I think it would be best to
> > use an existing database engine with some modifications to create a good data layout for git.
> 
> I think that git should safely store your data by default, we clearly
> don't do that well enough in some cases, and should fix it.
> 
> But there's also cases where people would much rather have performance,
> and I think we should support that. E.g. running git in CI, doing a
> one-off import/fetch that you'll invoke "sync(1)" yourself after
> etc. This is the direction Eric Wong's patches are going into.
> 
> I understand from his patches/comments that you're not correct that just
> 1-3 fsyncs are OK, i.e. for some use-cases 0 is OK, and any >0 is too
> many, since it'll force a flush of the whole disk or something.
> 
> Even when git is is operating in a completely safe mode I think we'd
> still have license to play it fast and loose with /some/ user data,
> because users don't really care about all of their data in the .git/
> directory.

I think we need to discern two important cases: the first case is us
losing data, and the second case is us leaving the repository in a
corrupt state. I'm okay-ish with the first case: if your machine crashes
it's not completely unexpected that losing data would be a natural
consequence (well, losing the data that's currently in transit at least).

But we should make sure that we're not leaving the repo in a corrupt
state under any circumstance, and that's exactly what can happen right
now because we don't flush refs to disk before doing the renames.
Assuming we've got semantics correct, we are thus forced to do a page
cache flush to make sure that data is on disk before doing a rename to
not corrupt the repo. But in the case where we're fine with losing some
data, we may skip doing the final fsync to also synchronize the rename
to disk.

Patrick

> I.e. you do care about the *.pack file, but an associated *.bitmap is a
> derived file, so we probably don't need to fsync that too. Is it worth
> worrying about? Probably not, but that's what I had in mind with the
> above-linked proposed config schema.
> 
> Similarly for say writing 1k loose objects I think it would be
> completely fine to end up with a corrupt object during a "git fetch", as
> long as we also guarantee that we can gracefully recover from that
> corruption.
> 
> I.e. 1) we didn't update the ref(s) involved 2) we know the FS has the
> sorts of properties you're aiming for with "batch".
> 
> Now, as some patches I've had to object*.c recently show we don't handle
> those cases gracefully either. I.e. if we find a short loose object
> we'll panic, even if we'd be perfectly capable of getting it from
> elsewhere, and nothing otherwise references it.
> 
> But if we fixed that and gc/fsck/fetch etc. learned to deal with that
> content I don't see why wouldn't e.g. default to not fsyncing loose
> objects when invoked from say "fetch" by default, depending on FS/OS
> detection, and if we couldn't say it was safe defaulted to some "POSIX"
> mode that would be pedantic but slow and safe.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2021-11-11 12:14 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-04 12:38 [PATCH] refs: sync loose refs to disk before committing them Patrick Steinhardt
2021-11-04 13:14 ` Ævar Arnfjörð Bjarmason
2021-11-04 14:51   ` Patrick Steinhardt
2021-11-04 21:24   ` Junio C Hamano
2021-11-04 22:36     ` Neeraj Singh
2021-11-05  1:40       ` Junio C Hamano
2021-11-05  6:36         ` Jeff King
2021-11-05  8:35       ` Ævar Arnfjörð Bjarmason
2021-11-05  9:04         ` Jeff King
2021-11-05  7:07 ` Jeff King
2021-11-05  7:17   ` Jeff King
2021-11-05  9:12     ` Johannes Schindelin
2021-11-05  9:22       ` Patrick Steinhardt
2021-11-05  9:34       ` Jeff King
2021-11-09 11:25         ` Patrick Steinhardt
2021-11-10  8:36           ` Jeff King
2021-11-10  9:16             ` Patrick Steinhardt
2021-11-10 11:40 ` [PATCH v2 0/3] " Patrick Steinhardt
2021-11-10 11:40   ` [PATCH v2 1/3] wrapper: handle EINTR in `git_fsync()` Patrick Steinhardt
2021-11-10 14:33     ` Johannes Schindelin
2021-11-10 14:39     ` Ævar Arnfjörð Bjarmason
2021-11-10 11:40   ` [PATCH v2 2/3] wrapper: provide function to sync directories Patrick Steinhardt
2021-11-10 14:40     ` Ævar Arnfjörð Bjarmason
2021-11-10 11:41   ` [PATCH v2 3/3] refs: add configuration to enable flushing of refs Patrick Steinhardt
2021-11-10 14:49     ` Ævar Arnfjörð Bjarmason
2021-11-10 19:15       ` Neeraj Singh
2021-11-10 20:23         ` Ævar Arnfjörð Bjarmason
2021-11-11  0:03           ` Neeraj Singh
2021-11-11 12:14           ` Patrick Steinhardt
2021-11-11 12:06       ` Patrick Steinhardt
2021-11-11  0:18     ` Neeraj Singh
2021-11-10 14:44   ` [PATCH v2 0/3] refs: sync loose refs to disk before committing them Johannes Schindelin
2021-11-10 20:45   ` Jeff King
2021-11-11 11:47     ` Patrick Steinhardt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.