All of lore.kernel.org
 help / color / mirror / Atom feed
* [GIT PULL] gfs2 fix
@ 2022-04-26 14:54 ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-26 14:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andreas Gruenbacher, cluster-devel, linux-kernel

Hi Linus,

please consider pulling the following gfs2 fix for 5.18-rc5.

Also, note that we're fighting with a rare case of data corruption that
seems to have been introduced by commit 00bfe02f4796 ("gfs2: Fix mmap +
page fault deadlocks for buffered I/O"; merged in v5.16).  The
corruption seems to occur in gfs2_file_buffered_write() when
fault_in_iov_iter_readable() is involved.  We do end up with data that's
written at an offset, as if after a fault-in, the position in the iocb
got out of sync with the position in the iov_iter.  The user buffer the
iov_iter points at isn't page aligned, so the corruption also isn't page
aligned.  The code all seems correct and we couldn't figure out what's
going on so far.

Thanks,
Andreas

The following changes since commit af2d861d4cd2a4da5137f795ee3509e6f944a25b:

  Linux 5.18-rc4 (2022-04-24 14:51:22 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git tags/gfs2-v5.18-rc4-fix

for you to fetch changes up to e57f9af73d6b0ffb5f1aeaf6cec9a751dd8535c9:

  gfs2: Don't re-check for write past EOF unnecessarily (2022-04-26 15:38:00 +0200)

----------------------------------------------------------------
gfs2 fix

- Only re-check for direct I/O writes past the end of the file after
  re-acquiring the inode glock.

----------------------------------------------------------------
Andreas Gruenbacher (1):
      gfs2: Don't re-check for write past EOF unnecessarily

 fs/gfs2/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-26 14:54 ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-26 14:54 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Linus,

please consider pulling the following gfs2 fix for 5.18-rc5.

Also, note that we're fighting with a rare case of data corruption that
seems to have been introduced by commit 00bfe02f4796 ("gfs2: Fix mmap +
page fault deadlocks for buffered I/O"; merged in v5.16).  The
corruption seems to occur in gfs2_file_buffered_write() when
fault_in_iov_iter_readable() is involved.  We do end up with data that's
written at an offset, as if after a fault-in, the position in the iocb
got out of sync with the position in the iov_iter.  The user buffer the
iov_iter points at isn't page aligned, so the corruption also isn't page
aligned.  The code all seems correct and we couldn't figure out what's
going on so far.

Thanks,
Andreas

The following changes since commit af2d861d4cd2a4da5137f795ee3509e6f944a25b:

  Linux 5.18-rc4 (2022-04-24 14:51:22 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git tags/gfs2-v5.18-rc4-fix

for you to fetch changes up to e57f9af73d6b0ffb5f1aeaf6cec9a751dd8535c9:

  gfs2: Don't re-check for write past EOF unnecessarily (2022-04-26 15:38:00 +0200)

----------------------------------------------------------------
gfs2 fix

- Only re-check for direct I/O writes past the end of the file after
  re-acquiring the inode glock.

----------------------------------------------------------------
Andreas Gruenbacher (1):
      gfs2: Don't re-check for write past EOF unnecessarily

 fs/gfs2/file.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-26 14:54 ` [Cluster-devel] " Andreas Gruenbacher
@ 2022-04-26 18:31   ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-26 18:31 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: cluster-devel, Linux Kernel Mailing List

On Tue, Apr 26, 2022 at 7:54 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Also, note that we're fighting with a rare case of data corruption that
> seems to have been introduced by commit 00bfe02f4796 ("gfs2: Fix mmap +
> page fault deadlocks for buffered I/O"; merged in v5.16).  The
> corruption seems to occur in gfs2_file_buffered_write() when
> fault_in_iov_iter_readable() is involved.  We do end up with data that's
> written at an offset, as if after a fault-in, the position in the iocb
> got out of sync with the position in the iov_iter.  The user buffer the
> iov_iter points at isn't page aligned, so the corruption also isn't page
> aligned.  The code all seems correct and we couldn't figure out what's
> going on so far.

So this may be stupid, but looking at gfs2_file_direct_write(), I see
what looks like two different obvious bugs.

This looks entirely wrong:

        if (ret > 0)
                read = ret;

because this code is *repeated*.

I think it should use "read += ret;" so that if we have a few
successful reads, they add up.

And then at the end:

        if (ret < 0)
                return ret;
        return read;

looks wrong too, since maybe the *last* iteration failed, but an
earlier succeeded, so I think it should be

        if (read)
                return read;
        return ret;

But hey, what do I know. I say "looks like obvious bugs", but I don't
really know the code, and I may just be completely full of sh*t.

The reason I think I'm full of sh*t is that you say that the problem
occurs in gfs2_file_buffered_write(), not in that
gfs2_file_direct_write() case.

And the *buffered* case seems to get both of those "obvious bugs" right.

So I must be wrong, but I have to say, that looks odd to me.

Now hopefully somebody who knows the code will explain to me why I'm
wrong, and in the process go "Duh, but.." and see what's up.

                  Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-26 18:31   ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-26 18:31 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Apr 26, 2022 at 7:54 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Also, note that we're fighting with a rare case of data corruption that
> seems to have been introduced by commit 00bfe02f4796 ("gfs2: Fix mmap +
> page fault deadlocks for buffered I/O"; merged in v5.16).  The
> corruption seems to occur in gfs2_file_buffered_write() when
> fault_in_iov_iter_readable() is involved.  We do end up with data that's
> written at an offset, as if after a fault-in, the position in the iocb
> got out of sync with the position in the iov_iter.  The user buffer the
> iov_iter points at isn't page aligned, so the corruption also isn't page
> aligned.  The code all seems correct and we couldn't figure out what's
> going on so far.

So this may be stupid, but looking at gfs2_file_direct_write(), I see
what looks like two different obvious bugs.

This looks entirely wrong:

        if (ret > 0)
                read = ret;

because this code is *repeated*.

I think it should use "read += ret;" so that if we have a few
successful reads, they add up.

And then at the end:

        if (ret < 0)
                return ret;
        return read;

looks wrong too, since maybe the *last* iteration failed, but an
earlier succeeded, so I think it should be

        if (read)
                return read;
        return ret;

But hey, what do I know. I say "looks like obvious bugs", but I don't
really know the code, and I may just be completely full of sh*t.

The reason I think I'm full of sh*t is that you say that the problem
occurs in gfs2_file_buffered_write(), not in that
gfs2_file_direct_write() case.

And the *buffered* case seems to get both of those "obvious bugs" right.

So I must be wrong, but I have to say, that looks odd to me.

Now hopefully somebody who knows the code will explain to me why I'm
wrong, and in the process go "Duh, but.." and see what's up.

                  Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-26 14:54 ` [Cluster-devel] " Andreas Gruenbacher
@ 2022-04-26 19:07   ` pr-tracker-bot
  -1 siblings, 0 replies; 64+ messages in thread
From: pr-tracker-bot @ 2022-04-26 19:07 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, Andreas Gruenbacher, cluster-devel, linux-kernel

The pull request you sent on Tue, 26 Apr 2022 16:54:45 +0200:

> https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git tags/gfs2-v5.18-rc4-fix

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/4fad37d595b9d9a2996467d780cb2e7a1b08b2c0

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-26 19:07   ` pr-tracker-bot
  0 siblings, 0 replies; 64+ messages in thread
From: pr-tracker-bot @ 2022-04-26 19:07 UTC (permalink / raw)
  To: cluster-devel.redhat.com

The pull request you sent on Tue, 26 Apr 2022 16:54:45 +0200:

> https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git tags/gfs2-v5.18-rc4-fix

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/4fad37d595b9d9a2996467d780cb2e7a1b08b2c0

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-26 18:31   ` [Cluster-devel] " Linus Torvalds
@ 2022-04-26 21:27     ` Andreas Gruenbacher
  -1 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-26 21:27 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: cluster-devel, Linux Kernel Mailing List

On Tue, Apr 26, 2022 at 8:31 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 26, 2022 at 7:54 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Also, note that we're fighting with a rare case of data corruption that
> > seems to have been introduced by commit 00bfe02f4796 ("gfs2: Fix mmap +
> > page fault deadlocks for buffered I/O"; merged in v5.16).  The
> > corruption seems to occur in gfs2_file_buffered_write() when
> > fault_in_iov_iter_readable() is involved.  We do end up with data that's
> > written at an offset, as if after a fault-in, the position in the iocb
> > got out of sync with the position in the iov_iter.  The user buffer the
> > iov_iter points at isn't page aligned, so the corruption also isn't page
> > aligned.  The code all seems correct and we couldn't figure out what's
> > going on so far.
>
> So this may be stupid, but looking at gfs2_file_direct_write(), I see
> what looks like two different obvious bugs.
>
> This looks entirely wrong:
>
>         if (ret > 0)
>                 read = ret;
>
> because this code is *repeated*.
>
> I think it should use "read += ret;" so that if we have a few
> successful reads, they add up.

Btrfs has a comment in that place that reads:

    /* No increment (+=) because iomap returns a cumulative value. */

That's so that we can complete the tail of an asynchronous write
asynchronously after a failed page fault and subsequent fault-in.

> And then at the end:
>
>         if (ret < 0)
>                 return ret;
>         return read;
>
> looks wrong too, since maybe the *last* iteration failed, but an
> earlier succeeded, so I think it should be
>
>         if (read)
>                 return read;
>         return ret;
>
> But hey, what do I know. I say "looks like obvious bugs", but I don't
> really know the code, and I may just be completely full of sh*t.

That would be great, but applications don't seem to be able to cope
with short direct writes, so we must turn partial failure into total
failure here. There's at least one xfstest that checks for that as
well.

> The reason I think I'm full of sh*t is that you say that the problem
> occurs in gfs2_file_buffered_write(), not in that
> gfs2_file_direct_write() case.

Right, we're having that issue with buffered writes.

> And the *buffered* case seems to get both of those "obvious bugs" right.
>
> So I must be wrong, but I have to say, that looks odd to me.
>
> Now hopefully somebody who knows the code will explain to me why I'm
> wrong, and in the process go "Duh, but.." and see what's up.

Thanks for pointing out the places that seem odd to you. You'll not be
the only one.

Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-26 21:27     ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-26 21:27 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Apr 26, 2022 at 8:31 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 26, 2022 at 7:54 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Also, note that we're fighting with a rare case of data corruption that
> > seems to have been introduced by commit 00bfe02f4796 ("gfs2: Fix mmap +
> > page fault deadlocks for buffered I/O"; merged in v5.16).  The
> > corruption seems to occur in gfs2_file_buffered_write() when
> > fault_in_iov_iter_readable() is involved.  We do end up with data that's
> > written at an offset, as if after a fault-in, the position in the iocb
> > got out of sync with the position in the iov_iter.  The user buffer the
> > iov_iter points at isn't page aligned, so the corruption also isn't page
> > aligned.  The code all seems correct and we couldn't figure out what's
> > going on so far.
>
> So this may be stupid, but looking at gfs2_file_direct_write(), I see
> what looks like two different obvious bugs.
>
> This looks entirely wrong:
>
>         if (ret > 0)
>                 read = ret;
>
> because this code is *repeated*.
>
> I think it should use "read += ret;" so that if we have a few
> successful reads, they add up.

Btrfs has a comment in that place that reads:

    /* No increment (+=) because iomap returns a cumulative value. */

That's so that we can complete the tail of an asynchronous write
asynchronously after a failed page fault and subsequent fault-in.

> And then at the end:
>
>         if (ret < 0)
>                 return ret;
>         return read;
>
> looks wrong too, since maybe the *last* iteration failed, but an
> earlier succeeded, so I think it should be
>
>         if (read)
>                 return read;
>         return ret;
>
> But hey, what do I know. I say "looks like obvious bugs", but I don't
> really know the code, and I may just be completely full of sh*t.

That would be great, but applications don't seem to be able to cope
with short direct writes, so we must turn partial failure into total
failure here. There's at least one xfstest that checks for that as
well.

> The reason I think I'm full of sh*t is that you say that the problem
> occurs in gfs2_file_buffered_write(), not in that
> gfs2_file_direct_write() case.

Right, we're having that issue with buffered writes.

> And the *buffered* case seems to get both of those "obvious bugs" right.
>
> So I must be wrong, but I have to say, that looks odd to me.
>
> Now hopefully somebody who knows the code will explain to me why I'm
> wrong, and in the process go "Duh, but.." and see what's up.

Thanks for pointing out the places that seem odd to you. You'll not be
the only one.

Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-26 21:27     ` [Cluster-devel] " Andreas Gruenbacher
@ 2022-04-26 23:33       ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-26 23:33 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: cluster-devel, Linux Kernel Mailing List

On Tue, Apr 26, 2022 at 2:28 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Btrfs has a comment in that place that reads:
>
>     /* No increment (+=) because iomap returns a cumulative value. */

What a truly horrid interface. But you are triggering repressed
childhood memories:

> That's so that we can complete the tail of an asynchronous write
> asynchronously after a failed page fault and subsequent fault-in.

yeah, that makes me go "I remember something like that".

> That would be great, but applications don't seem to be able to cope
> with short direct writes, so we must turn partial failure into total
> failure here. There's at least one xfstest that checks for that as
> well.

What a complete crock. You're telling me that you did some writes, but
then you can't tell user space that writes happened because that would
confuse things.

Direct-IO is some truly hot disgusting garbage.

Happily it's only used for things like databases that nobody sane
would care about anyway.

Anyway, none of that makes any sense, since you do this:

                ret = gfs2_file_direct_write(iocb, from, &gh);
                if (ret < 0 || !iov_iter_count(from))
                        goto out_unlock;

                iocb->ki_flags |= IOCB_DSYNC;
                buffered = gfs2_file_buffered_write(iocb, from, &gh);

so you're saying that the direct write will never partially succeed,
but then in gfs2_file_write_iter() it very much looks like it's
falling back to buffered writes for that case.

Hmm. Very odd.

I assume this is all due to that

        /* Silently fall back to buffered I/O when writing beyond EOF */
        if (iocb->ki_pos + iov_iter_count(from) > i_size_read(&ip->i_inode))

thing, so this all makes some perverse kind of sense, but it still
looks like this code just needs some serious serious commentary.

So you *can* have a partial write if you hit the end of the file, and
then you'll continue that partial write with the buffered code.

But an actual _failure_ will not do that, and instead return an error
even if the write was partially done.

But then *some* failures aren't failures at all, and without any
comments do this

        if (ret == -ENOTBLK)
                ret = 0;


And remind me again - this all is meant for applications that
supposedly care about consistency on disk?

And the xfs tests actually have a *test* for that case, to make sure
that nobody can sanely *really* know how much of the write succeeded
if it was a DIO write?

Gotcha.

> > The reason I think I'm full of sh*t is that you say that the problem
> > occurs in gfs2_file_buffered_write(), not in that
> > gfs2_file_direct_write() case.
>
> Right, we're having that issue with buffered writes.

I have to say, compared to all the crazy things I see in the DIO path,
the buffered write path actually looks almost entirely sane.

Of course, gfs2_file_read_iter() counts how many bytes it has read in
a variable called 'written', and gfs2_file_buffered_write() counts the
bytes it has written in a variable called 'read', so "entirely sane"
is all very very relative.

I'm sure there's some good reason (job security?) for all this insanity.

But I now have to go dig my eyes out with a rusty spoon.

But before I do that, I have one more question (I'm going to regret
this, aren't I?):

In gfs2_file_buffered_write(), when it has done that
fault_in_iov_iter_readable(), and it decides that that succeeded, it
does

                        if (gfs2_holder_queued(gh))
                                goto retry_under_glock;
                        if (read && !(iocb->ki_flags & IOCB_DIRECT))
                                goto out_uninit;
                        goto retry;

so if it still has that lock (if I understand correctly), it will always retry.

But if it *lost* the lock, it will retry only if was a direct write,
and return a partial success for a regular write() rather than
continue the write.

Whaa? I'm assuming that this is more of that "direct writes have to
succeed fully or not at all", but according to POSIX *regular* writes
should also succeed fully, unless some error happens.

Losing the lock doesn't sound like an error to me.

And there are a lot of applications that do assume "write to a regular
file didn't complete fully means that the disk is full" etc. Because
that's the traditional meaning.

This doesn't seem to be related to any data corruption issue, but it does smell.

             Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-26 23:33       ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-26 23:33 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Apr 26, 2022 at 2:28 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Btrfs has a comment in that place that reads:
>
>     /* No increment (+=) because iomap returns a cumulative value. */

What a truly horrid interface. But you are triggering repressed
childhood memories:

> That's so that we can complete the tail of an asynchronous write
> asynchronously after a failed page fault and subsequent fault-in.

yeah, that makes me go "I remember something like that".

> That would be great, but applications don't seem to be able to cope
> with short direct writes, so we must turn partial failure into total
> failure here. There's at least one xfstest that checks for that as
> well.

What a complete crock. You're telling me that you did some writes, but
then you can't tell user space that writes happened because that would
confuse things.

Direct-IO is some truly hot disgusting garbage.

Happily it's only used for things like databases that nobody sane
would care about anyway.

Anyway, none of that makes any sense, since you do this:

                ret = gfs2_file_direct_write(iocb, from, &gh);
                if (ret < 0 || !iov_iter_count(from))
                        goto out_unlock;

                iocb->ki_flags |= IOCB_DSYNC;
                buffered = gfs2_file_buffered_write(iocb, from, &gh);

so you're saying that the direct write will never partially succeed,
but then in gfs2_file_write_iter() it very much looks like it's
falling back to buffered writes for that case.

Hmm. Very odd.

I assume this is all due to that

        /* Silently fall back to buffered I/O when writing beyond EOF */
        if (iocb->ki_pos + iov_iter_count(from) > i_size_read(&ip->i_inode))

thing, so this all makes some perverse kind of sense, but it still
looks like this code just needs some serious serious commentary.

So you *can* have a partial write if you hit the end of the file, and
then you'll continue that partial write with the buffered code.

But an actual _failure_ will not do that, and instead return an error
even if the write was partially done.

But then *some* failures aren't failures at all, and without any
comments do this

        if (ret == -ENOTBLK)
                ret = 0;


And remind me again - this all is meant for applications that
supposedly care about consistency on disk?

And the xfs tests actually have a *test* for that case, to make sure
that nobody can sanely *really* know how much of the write succeeded
if it was a DIO write?

Gotcha.

> > The reason I think I'm full of sh*t is that you say that the problem
> > occurs in gfs2_file_buffered_write(), not in that
> > gfs2_file_direct_write() case.
>
> Right, we're having that issue with buffered writes.

I have to say, compared to all the crazy things I see in the DIO path,
the buffered write path actually looks almost entirely sane.

Of course, gfs2_file_read_iter() counts how many bytes it has read in
a variable called 'written', and gfs2_file_buffered_write() counts the
bytes it has written in a variable called 'read', so "entirely sane"
is all very very relative.

I'm sure there's some good reason (job security?) for all this insanity.

But I now have to go dig my eyes out with a rusty spoon.

But before I do that, I have one more question (I'm going to regret
this, aren't I?):

In gfs2_file_buffered_write(), when it has done that
fault_in_iov_iter_readable(), and it decides that that succeeded, it
does

                        if (gfs2_holder_queued(gh))
                                goto retry_under_glock;
                        if (read && !(iocb->ki_flags & IOCB_DIRECT))
                                goto out_uninit;
                        goto retry;

so if it still has that lock (if I understand correctly), it will always retry.

But if it *lost* the lock, it will retry only if was a direct write,
and return a partial success for a regular write() rather than
continue the write.

Whaa? I'm assuming that this is more of that "direct writes have to
succeed fully or not at all", but according to POSIX *regular* writes
should also succeed fully, unless some error happens.

Losing the lock doesn't sound like an error to me.

And there are a lot of applications that do assume "write to a regular
file didn't complete fully means that the disk is full" etc. Because
that's the traditional meaning.

This doesn't seem to be related to any data corruption issue, but it does smell.

             Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-26 23:33       ` [Cluster-devel] " Linus Torvalds
@ 2022-04-27 12:29         ` Andreas Gruenbacher
  -1 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-27 12:29 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: cluster-devel, Linux Kernel Mailing List

On Wed, Apr 27, 2022 at 1:33 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 26, 2022 at 2:28 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Btrfs has a comment in that place that reads:
> >
> >     /* No increment (+=) because iomap returns a cumulative value. */
>
> What a truly horrid interface. But you are triggering repressed
> childhood memories:
>
> > That's so that we can complete the tail of an asynchronous write
> > asynchronously after a failed page fault and subsequent fault-in.
>
> yeah, that makes me go "I remember something like that".
>
> > That would be great, but applications don't seem to be able to cope
> > with short direct writes, so we must turn partial failure into total
> > failure here. There's at least one xfstest that checks for that as
> > well.
>
> What a complete crock. You're telling me that you did some writes, but
> then you can't tell user space that writes happened because that would
> confuse things.
>
> Direct-IO is some truly hot disgusting garbage.
>
> Happily it's only used for things like databases that nobody sane
> would care about anyway.
>
> Anyway, none of that makes any sense, since you do this:
>
>                 ret = gfs2_file_direct_write(iocb, from, &gh);
>                 if (ret < 0 || !iov_iter_count(from))
>                         goto out_unlock;
>
>                 iocb->ki_flags |= IOCB_DSYNC;
>                 buffered = gfs2_file_buffered_write(iocb, from, &gh);
>
> so you're saying that the direct write will never partially succeed,
> but then in gfs2_file_write_iter() it very much looks like it's
> falling back to buffered writes for that case.
>
> Hmm. Very odd.
>
> I assume this is all due to that
>
>         /* Silently fall back to buffered I/O when writing beyond EOF */
>         if (iocb->ki_pos + iov_iter_count(from) > i_size_read(&ip->i_inode))
>
> thing, so this all makes some perverse kind of sense, but it still
> looks like this code just needs some serious serious commentary.

Yes, that, as well as writing into holes.

During direct writes, gfs2 is holding the inode glock in a special
"shared writable" mode which works like the usual "shared readable"
mode as far as metadata goes, but can be held by multiple "shared
writers" at the same time. This allows multiple nodes to write to the
storage device concurrently as long as the file is preallocated (i.e.,
databases). When it comes to allocations, it falls back to "exclusive"
locking and buffered writes.

> So you *can* have a partial write if you hit the end of the file, and
> then you'll continue that partial write with the buffered code.
>
> But an actual _failure_ will not do that, and instead return an error
> even if the write was partially done.
>
> But then *some* failures aren't failures at all, and without any
> comments do this
>
>         if (ret == -ENOTBLK)
>                 ret = 0;
>
>
> And remind me again - this all is meant for applications that
> supposedly care about consistency on disk?
>
> And the xfs tests actually have a *test* for that case, to make sure
> that nobody can sanely *really* know how much of the write succeeded
> if it was a DIO write?
>
> Gotcha.

I agree that it's pretty sad.

> > > The reason I think I'm full of sh*t is that you say that the problem
> > > occurs in gfs2_file_buffered_write(), not in that
> > > gfs2_file_direct_write() case.
> >
> > Right, we're having that issue with buffered writes.
>
> I have to say, compared to all the crazy things I see in the DIO path,
> the buffered write path actually looks almost entirely sane.
>
> Of course, gfs2_file_read_iter() counts how many bytes it has read in
> a variable called 'written', and gfs2_file_buffered_write() counts the
> bytes it has written in a variable called 'read', so "entirely sane"
> is all very very relative.
>
> I'm sure there's some good reason (job security?) for all this insanity.

Point taken; I'll fix this up.

> But I now have to go dig my eyes out with a rusty spoon.
>
> But before I do that, I have one more question (I'm going to regret
> this, aren't I?):
>
> In gfs2_file_buffered_write(), when it has done that
> fault_in_iov_iter_readable(), and it decides that that succeeded, it
> does
>
>                         if (gfs2_holder_queued(gh))
>                                 goto retry_under_glock;
>                         if (read && !(iocb->ki_flags & IOCB_DIRECT))
>                                 goto out_uninit;
>                         goto retry;
>
> so if it still has that lock (if I understand correctly), it will always retry.
>
> But if it *lost* the lock, it will retry only if was a direct write,
> and return a partial success for a regular write() rather than
> continue the write.
>
> Whaa? I'm assuming that this is more of that "direct writes have to
> succeed fully or not at all", but according to POSIX *regular* writes
> should also succeed fully, unless some error happens.
>
> Losing the lock doesn't sound like an error to me.

Regular (buffered) reads and writes are expected to be atomic with
respect to each other. That atomicity comes from holding the lock.
When we lose the lock, we can observe atomicity and return a short
result, ignore atomicity and return the full result, or retry the
entire operation. Which of those three options would you prefer?

For what it's worth, none of this matters as long as there's no lock
contention across the cluster.

> And there are a lot of applications that do assume "write to a regular
> file didn't complete fully means that the disk is full" etc. Because
> that's the traditional meaning.
>
> This doesn't seem to be related to any data corruption issue, but it does smell.
>
>              Linus

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-27 12:29         ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-27 12:29 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Apr 27, 2022 at 1:33 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 26, 2022 at 2:28 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Btrfs has a comment in that place that reads:
> >
> >     /* No increment (+=) because iomap returns a cumulative value. */
>
> What a truly horrid interface. But you are triggering repressed
> childhood memories:
>
> > That's so that we can complete the tail of an asynchronous write
> > asynchronously after a failed page fault and subsequent fault-in.
>
> yeah, that makes me go "I remember something like that".
>
> > That would be great, but applications don't seem to be able to cope
> > with short direct writes, so we must turn partial failure into total
> > failure here. There's at least one xfstest that checks for that as
> > well.
>
> What a complete crock. You're telling me that you did some writes, but
> then you can't tell user space that writes happened because that would
> confuse things.
>
> Direct-IO is some truly hot disgusting garbage.
>
> Happily it's only used for things like databases that nobody sane
> would care about anyway.
>
> Anyway, none of that makes any sense, since you do this:
>
>                 ret = gfs2_file_direct_write(iocb, from, &gh);
>                 if (ret < 0 || !iov_iter_count(from))
>                         goto out_unlock;
>
>                 iocb->ki_flags |= IOCB_DSYNC;
>                 buffered = gfs2_file_buffered_write(iocb, from, &gh);
>
> so you're saying that the direct write will never partially succeed,
> but then in gfs2_file_write_iter() it very much looks like it's
> falling back to buffered writes for that case.
>
> Hmm. Very odd.
>
> I assume this is all due to that
>
>         /* Silently fall back to buffered I/O when writing beyond EOF */
>         if (iocb->ki_pos + iov_iter_count(from) > i_size_read(&ip->i_inode))
>
> thing, so this all makes some perverse kind of sense, but it still
> looks like this code just needs some serious serious commentary.

Yes, that, as well as writing into holes.

During direct writes, gfs2 is holding the inode glock in a special
"shared writable" mode which works like the usual "shared readable"
mode as far as metadata goes, but can be held by multiple "shared
writers" at the same time. This allows multiple nodes to write to the
storage device concurrently as long as the file is preallocated (i.e.,
databases). When it comes to allocations, it falls back to "exclusive"
locking and buffered writes.

> So you *can* have a partial write if you hit the end of the file, and
> then you'll continue that partial write with the buffered code.
>
> But an actual _failure_ will not do that, and instead return an error
> even if the write was partially done.
>
> But then *some* failures aren't failures at all, and without any
> comments do this
>
>         if (ret == -ENOTBLK)
>                 ret = 0;
>
>
> And remind me again - this all is meant for applications that
> supposedly care about consistency on disk?
>
> And the xfs tests actually have a *test* for that case, to make sure
> that nobody can sanely *really* know how much of the write succeeded
> if it was a DIO write?
>
> Gotcha.

I agree that it's pretty sad.

> > > The reason I think I'm full of sh*t is that you say that the problem
> > > occurs in gfs2_file_buffered_write(), not in that
> > > gfs2_file_direct_write() case.
> >
> > Right, we're having that issue with buffered writes.
>
> I have to say, compared to all the crazy things I see in the DIO path,
> the buffered write path actually looks almost entirely sane.
>
> Of course, gfs2_file_read_iter() counts how many bytes it has read in
> a variable called 'written', and gfs2_file_buffered_write() counts the
> bytes it has written in a variable called 'read', so "entirely sane"
> is all very very relative.
>
> I'm sure there's some good reason (job security?) for all this insanity.

Point taken; I'll fix this up.

> But I now have to go dig my eyes out with a rusty spoon.
>
> But before I do that, I have one more question (I'm going to regret
> this, aren't I?):
>
> In gfs2_file_buffered_write(), when it has done that
> fault_in_iov_iter_readable(), and it decides that that succeeded, it
> does
>
>                         if (gfs2_holder_queued(gh))
>                                 goto retry_under_glock;
>                         if (read && !(iocb->ki_flags & IOCB_DIRECT))
>                                 goto out_uninit;
>                         goto retry;
>
> so if it still has that lock (if I understand correctly), it will always retry.
>
> But if it *lost* the lock, it will retry only if was a direct write,
> and return a partial success for a regular write() rather than
> continue the write.
>
> Whaa? I'm assuming that this is more of that "direct writes have to
> succeed fully or not at all", but according to POSIX *regular* writes
> should also succeed fully, unless some error happens.
>
> Losing the lock doesn't sound like an error to me.

Regular (buffered) reads and writes are expected to be atomic with
respect to each other. That atomicity comes from holding the lock.
When we lose the lock, we can observe atomicity and return a short
result, ignore atomicity and return the full result, or retry the
entire operation. Which of those three options would you prefer?

For what it's worth, none of this matters as long as there's no lock
contention across the cluster.

> And there are a lot of applications that do assume "write to a regular
> file didn't complete fully means that the disk is full" etc. Because
> that's the traditional meaning.
>
> This doesn't seem to be related to any data corruption issue, but it does smell.
>
>              Linus

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-27 12:29         ` [Cluster-devel] " Andreas Gruenbacher
@ 2022-04-27 17:13           ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-27 17:13 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: cluster-devel, Linux Kernel Mailing List

On Wed, Apr 27, 2022 at 5:29 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Regular (buffered) reads and writes are expected to be atomic with
> respect to each other.

Linux has actually never honored that completely broken POSIX
requirement, although I think some filesystems (notably XFS) have
tried.

It's a completely broken concept. It's not possible to honor atomicity
with mmap(), and nobody has *ever* cared.

And it causes huge amounts of problems and basically makes any sane
locking entirely impossible.

The fact that you literally broke regular file writes in ways that are
incompatible with (much MUCH more important) POSIX file behavior to
try to get that broken read/write atomicity is only one example among
many for why that alleged rule just has to be ignored.

We do honor the PIPE_BUF atomicity on pipes, which is a completely
different kind of atomicity wrt read/write, and doesn't have the
fundamental issues that arbitrary regular file reads/writes have.

There is absolutely no sane way to do that file atomicity wrt
arbitrary read/write calls (*), and you shouldn't even try.

That rule needs to be forgotten about, and buried 6ft deep.

So please scrub any mention of that idiotic rule from documentation,
and from your brain.

And please don't break "partial write means disk full or IO error" due
to trying to follow this broken rule, which was apparently what you
did.

Because that "regular file read/write is done in full" is a *MUCH*
more important rule, and there is a shitton of applications that most
definitely depend on *that* rule.

Just go to debian code search, and look for

   "if (write("

and you'll get thousands of hits, and on the first page of hits 9 out
of 10 of the hits are literally about that "partial write is an
error", eg code like this:

            if (write(fd,&triple,sizeof(triple)) != sizeof(triple))
                reporterr(1,NULL);

from libreoffice.

                        Linus

(*) Yeah, if you never care about performance(**) of mixed read/write,
and you don't care about mmap, and you have no other locking issues,
it's certainly possible. The old rule came about from original UNIX
literally taking an inode lock around the whole IO access, because
that was simple, and back in the days you'd never have multiple
concurrent readers/writers anyway.

(**) It's also instructive how O_DIRECT literally throws that rule
away, and then some direct-IO people said for years that direct-IO is
superior and used this as one of their arguments. Probably the same
people who thought that "oh, don't report partial success", because we
can't deal with it.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-27 17:13           ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-27 17:13 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Apr 27, 2022 at 5:29 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Regular (buffered) reads and writes are expected to be atomic with
> respect to each other.

Linux has actually never honored that completely broken POSIX
requirement, although I think some filesystems (notably XFS) have
tried.

It's a completely broken concept. It's not possible to honor atomicity
with mmap(), and nobody has *ever* cared.

And it causes huge amounts of problems and basically makes any sane
locking entirely impossible.

The fact that you literally broke regular file writes in ways that are
incompatible with (much MUCH more important) POSIX file behavior to
try to get that broken read/write atomicity is only one example among
many for why that alleged rule just has to be ignored.

We do honor the PIPE_BUF atomicity on pipes, which is a completely
different kind of atomicity wrt read/write, and doesn't have the
fundamental issues that arbitrary regular file reads/writes have.

There is absolutely no sane way to do that file atomicity wrt
arbitrary read/write calls (*), and you shouldn't even try.

That rule needs to be forgotten about, and buried 6ft deep.

So please scrub any mention of that idiotic rule from documentation,
and from your brain.

And please don't break "partial write means disk full or IO error" due
to trying to follow this broken rule, which was apparently what you
did.

Because that "regular file read/write is done in full" is a *MUCH*
more important rule, and there is a shitton of applications that most
definitely depend on *that* rule.

Just go to debian code search, and look for

   "if (write("

and you'll get thousands of hits, and on the first page of hits 9 out
of 10 of the hits are literally about that "partial write is an
error", eg code like this:

            if (write(fd,&triple,sizeof(triple)) != sizeof(triple))
                reporterr(1,NULL);

from libreoffice.

                        Linus

(*) Yeah, if you never care about performance(**) of mixed read/write,
and you don't care about mmap, and you have no other locking issues,
it's certainly possible. The old rule came about from original UNIX
literally taking an inode lock around the whole IO access, because
that was simple, and back in the days you'd never have multiple
concurrent readers/writers anyway.

(**) It's also instructive how O_DIRECT literally throws that rule
away, and then some direct-IO people said for years that direct-IO is
superior and used this as one of their arguments. Probably the same
people who thought that "oh, don't report partial success", because we
can't deal with it.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-27 17:13           ` [Cluster-devel] " Linus Torvalds
@ 2022-04-27 19:41             ` Andreas Gruenbacher
  -1 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-27 19:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: cluster-devel, Linux Kernel Mailing List

On Wed, Apr 27, 2022 at 7:13 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Apr 27, 2022 at 5:29 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Regular (buffered) reads and writes are expected to be atomic with
> > respect to each other.
>
> Linux has actually never honored that completely broken POSIX
> requirement, although I think some filesystems (notably XFS) have
> tried.

Okay, I can happily live with that.

I wonder if this could be documented in the read and write manual
pages. Or would that be asking too much?

> It's a completely broken concept. It's not possible to honor atomicity
> with mmap(), and nobody has *ever* cared.
>
> And it causes huge amounts of problems and basically makes any sane
> locking entirely impossible.
>
> The fact that you literally broke regular file writes in ways that are
> incompatible with (much MUCH more important) POSIX file behavior to
> try to get that broken read/write atomicity is only one example among
> many for why that alleged rule just has to be ignored.
>
> We do honor the PIPE_BUF atomicity on pipes, which is a completely
> different kind of atomicity wrt read/write, and doesn't have the
> fundamental issues that arbitrary regular file reads/writes have.
>
> There is absolutely no sane way to do that file atomicity wrt
> arbitrary read/write calls (*), and you shouldn't even try.
>
> That rule needs to be forgotten about, and buried 6ft deep.
>
> So please scrub any mention of that idiotic rule from documentation,
> and from your brain.
>
> And please don't break "partial write means disk full or IO error" due
> to trying to follow this broken rule, which was apparently what you
> did.
>
> Because that "regular file read/write is done in full" is a *MUCH*
> more important rule, and there is a shitton of applications that most
> definitely depend on *that* rule.
>
> Just go to debian code search, and look for
>
>    "if (write("
>
> and you'll get thousands of hits, and on the first page of hits 9 out
> of 10 of the hits are literally about that "partial write is an
> error", eg code like this:
>
>             if (write(fd,&triple,sizeof(triple)) != sizeof(triple))
>                 reporterr(1,NULL);
>
> from libreoffice.
>
>                         Linus
>
> (*) Yeah, if you never care about performance(**) of mixed read/write,
> and you don't care about mmap, and you have no other locking issues,
> it's certainly possible. The old rule came about from original UNIX
> literally taking an inode lock around the whole IO access, because
> that was simple, and back in the days you'd never have multiple
> concurrent readers/writers anyway.
>
> (**) It's also instructive how O_DIRECT literally throws that rule
> away, and then some direct-IO people said for years that direct-IO is
> superior and used this as one of their arguments. Probably the same
> people who thought that "oh, don't report partial success", because we
> can't deal with it.
>

Thanks a lot,
Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-27 19:41             ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-27 19:41 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Apr 27, 2022 at 7:13 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Apr 27, 2022 at 5:29 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Regular (buffered) reads and writes are expected to be atomic with
> > respect to each other.
>
> Linux has actually never honored that completely broken POSIX
> requirement, although I think some filesystems (notably XFS) have
> tried.

Okay, I can happily live with that.

I wonder if this could be documented in the read and write manual
pages. Or would that be asking too much?

> It's a completely broken concept. It's not possible to honor atomicity
> with mmap(), and nobody has *ever* cared.
>
> And it causes huge amounts of problems and basically makes any sane
> locking entirely impossible.
>
> The fact that you literally broke regular file writes in ways that are
> incompatible with (much MUCH more important) POSIX file behavior to
> try to get that broken read/write atomicity is only one example among
> many for why that alleged rule just has to be ignored.
>
> We do honor the PIPE_BUF atomicity on pipes, which is a completely
> different kind of atomicity wrt read/write, and doesn't have the
> fundamental issues that arbitrary regular file reads/writes have.
>
> There is absolutely no sane way to do that file atomicity wrt
> arbitrary read/write calls (*), and you shouldn't even try.
>
> That rule needs to be forgotten about, and buried 6ft deep.
>
> So please scrub any mention of that idiotic rule from documentation,
> and from your brain.
>
> And please don't break "partial write means disk full or IO error" due
> to trying to follow this broken rule, which was apparently what you
> did.
>
> Because that "regular file read/write is done in full" is a *MUCH*
> more important rule, and there is a shitton of applications that most
> definitely depend on *that* rule.
>
> Just go to debian code search, and look for
>
>    "if (write("
>
> and you'll get thousands of hits, and on the first page of hits 9 out
> of 10 of the hits are literally about that "partial write is an
> error", eg code like this:
>
>             if (write(fd,&triple,sizeof(triple)) != sizeof(triple))
>                 reporterr(1,NULL);
>
> from libreoffice.
>
>                         Linus
>
> (*) Yeah, if you never care about performance(**) of mixed read/write,
> and you don't care about mmap, and you have no other locking issues,
> it's certainly possible. The old rule came about from original UNIX
> literally taking an inode lock around the whole IO access, because
> that was simple, and back in the days you'd never have multiple
> concurrent readers/writers anyway.
>
> (**) It's also instructive how O_DIRECT literally throws that rule
> away, and then some direct-IO people said for years that direct-IO is
> superior and used this as one of their arguments. Probably the same
> people who thought that "oh, don't report partial success", because we
> can't deal with it.
>

Thanks a lot,
Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-27 19:41             ` [Cluster-devel] " Andreas Gruenbacher
@ 2022-04-27 20:25               ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-27 20:25 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: cluster-devel, Linux Kernel Mailing List

On Wed, Apr 27, 2022 at 12:41 PM Andreas Gruenbacher
<agruenba@redhat.com> wrote:
>
> I wonder if this could be documented in the read and write manual
> pages. Or would that be asking too much?

I don't think it would be asking too much, since it's basically just
describing what Linux has always done in all the major filesystems.

Eg look at filemap_read(), which is basically the canonical read
function, and note how it doesn't take a single lock at that level.

We *do* have synchronization at a page level, though, ie we've always
had that page-level "uptodate" bit, of course (ok, so "always" isn't
true - back in the distant past it was the 'struct buffer_head' that
was the synchronization point).

That said, even that is not synchronizing against "new writes", but
only against "new creations" (which may, of course, be writers, but is
equally likely to be just reading the contents from disk).

That said:

 (a) different filesystems can and will do different things.

Not all filesystems use filemap_read() at all, and even the ones that
do often have their own wrappers. Such wrappers *can* do extra
serialization, and have their own rules. But ext4 does not, for
example (see ext4_file_read_iter()).

And as mentioned, I *think* XFS honors that old POSIX rule for
historical reasons.

 (b) we do have *different* locking

for example, we these days do actually serialize properly on the
file->f_pos, which means that a certain *class* of read/write things
are atomic wrt each other, because we actually hold that f_pos lock
over the whole operation and so if you do file reads and writes using
the same file descriptor, they'll be disjoint.

That, btw, hasn't always been true. If you had multiple threads using
the same file pointer, I think we used to get basically random
results. So we have actually strengthened our locking in this area,
and made it much better.

But note how even if you have the same file descriptor open, and then
do pread/pwrite, those can and will happen concurrently.

And mmap accesses and modifications are obviously *always* concurrent,
even if the fault itself - but not the accesses - might end up being
serialized due to some filesystem locking implementation detail.

End result: the exact serialization is complex, depends on the
filesystem, and is just not really something that should be described
or even relied on (eg that f_pos serialization is something we do
properly now, but didn't necessarily do in the past, so ..)

Is it then worth pointing out one odd POSIX rule that basically nobody
but some very low-level filesystem people have ever heard about, and
that no version of Linux has ever conformed to in the main default
filesystems, and that no user has ever cared about?

             Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-27 20:25               ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-27 20:25 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Apr 27, 2022 at 12:41 PM Andreas Gruenbacher
<agruenba@redhat.com> wrote:
>
> I wonder if this could be documented in the read and write manual
> pages. Or would that be asking too much?

I don't think it would be asking too much, since it's basically just
describing what Linux has always done in all the major filesystems.

Eg look at filemap_read(), which is basically the canonical read
function, and note how it doesn't take a single lock at that level.

We *do* have synchronization at a page level, though, ie we've always
had that page-level "uptodate" bit, of course (ok, so "always" isn't
true - back in the distant past it was the 'struct buffer_head' that
was the synchronization point).

That said, even that is not synchronizing against "new writes", but
only against "new creations" (which may, of course, be writers, but is
equally likely to be just reading the contents from disk).

That said:

 (a) different filesystems can and will do different things.

Not all filesystems use filemap_read() at all, and even the ones that
do often have their own wrappers. Such wrappers *can* do extra
serialization, and have their own rules. But ext4 does not, for
example (see ext4_file_read_iter()).

And as mentioned, I *think* XFS honors that old POSIX rule for
historical reasons.

 (b) we do have *different* locking

for example, we these days do actually serialize properly on the
file->f_pos, which means that a certain *class* of read/write things
are atomic wrt each other, because we actually hold that f_pos lock
over the whole operation and so if you do file reads and writes using
the same file descriptor, they'll be disjoint.

That, btw, hasn't always been true. If you had multiple threads using
the same file pointer, I think we used to get basically random
results. So we have actually strengthened our locking in this area,
and made it much better.

But note how even if you have the same file descriptor open, and then
do pread/pwrite, those can and will happen concurrently.

And mmap accesses and modifications are obviously *always* concurrent,
even if the fault itself - but not the accesses - might end up being
serialized due to some filesystem locking implementation detail.

End result: the exact serialization is complex, depends on the
filesystem, and is just not really something that should be described
or even relied on (eg that f_pos serialization is something we do
properly now, but didn't necessarily do in the past, so ..)

Is it then worth pointing out one odd POSIX rule that basically nobody
but some very low-level filesystem people have ever heard about, and
that no version of Linux has ever conformed to in the main default
filesystems, and that no user has ever cared about?

             Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-27 20:25               ` [Cluster-devel] " Linus Torvalds
@ 2022-04-27 21:26                 ` Andreas Gruenbacher
  -1 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-27 21:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: cluster-devel, Linux Kernel Mailing List

On Wed, Apr 27, 2022 at 10:26 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Apr 27, 2022 at 12:41 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > I wonder if this could be documented in the read and write manual
> > pages. Or would that be asking too much?
>
> I don't think it would be asking too much, since it's basically just
> describing what Linux has always done in all the major filesystems.
>
> Eg look at filemap_read(), which is basically the canonical read
> function, and note how it doesn't take a single lock at that level.
>
> We *do* have synchronization at a page level, though, ie we've always
> had that page-level "uptodate" bit, of course (ok, so "always" isn't
> true - back in the distant past it was the 'struct buffer_head' that
> was the synchronization point).
>
> That said, even that is not synchronizing against "new writes", but
> only against "new creations" (which may, of course, be writers, but is
> equally likely to be just reading the contents from disk).
>
> That said:
>
>  (a) different filesystems can and will do different things.
>
> Not all filesystems use filemap_read() at all, and even the ones that
> do often have their own wrappers. Such wrappers *can* do extra
> serialization, and have their own rules. But ext4 does not, for
> example (see ext4_file_read_iter()).
>
> And as mentioned, I *think* XFS honors that old POSIX rule for
> historical reasons.
>
>  (b) we do have *different* locking
>
> for example, we these days do actually serialize properly on the
> file->f_pos, which means that a certain *class* of read/write things
> are atomic wrt each other, because we actually hold that f_pos lock
> over the whole operation and so if you do file reads and writes using
> the same file descriptor, they'll be disjoint.
>
> That, btw, hasn't always been true. If you had multiple threads using
> the same file pointer, I think we used to get basically random
> results. So we have actually strengthened our locking in this area,
> and made it much better.
>
> But note how even if you have the same file descriptor open, and then
> do pread/pwrite, those can and will happen concurrently.
>
> And mmap accesses and modifications are obviously *always* concurrent,
> even if the fault itself - but not the accesses - might end up being
> serialized due to some filesystem locking implementation detail.
>
> End result: the exact serialization is complex, depends on the
> filesystem, and is just not really something that should be described
> or even relied on (eg that f_pos serialization is something we do
> properly now, but didn't necessarily do in the past, so ..)
>
> Is it then worth pointing out one odd POSIX rule that basically nobody
> but some very low-level filesystem people have ever heard about, and
> that no version of Linux has ever conformed to in the main default
> filesystems, and that no user has ever cared about?

Well, POSIX explicitly mentions those atomicity expectations, e.g.,
for read [1]:

  "I/O is intended to be atomic to ordinary files and pipes and FIFOs. Atomic
  means that all the bytes from a single operation that started out together
  end up together, without interleaving from other I/O operations."

Users who hear about it from POSIX are led to assume that this atomic
behavior is "real", and the Linux man pages do nothing to rob them of
that illusion. They do document that the file offset aspect has been
fixed though, which only makes things more confusing.

So from that point of view, I think it would be worthwhile to mention
that most if not all filesystems ignore the "non-interleaving" aspect.

[1] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/read.html
[2] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/write.html
[3] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/V2_chap02.html#tag_15_09_07

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-27 21:26                 ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-27 21:26 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Apr 27, 2022 at 10:26 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Apr 27, 2022 at 12:41 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > I wonder if this could be documented in the read and write manual
> > pages. Or would that be asking too much?
>
> I don't think it would be asking too much, since it's basically just
> describing what Linux has always done in all the major filesystems.
>
> Eg look at filemap_read(), which is basically the canonical read
> function, and note how it doesn't take a single lock at that level.
>
> We *do* have synchronization at a page level, though, ie we've always
> had that page-level "uptodate" bit, of course (ok, so "always" isn't
> true - back in the distant past it was the 'struct buffer_head' that
> was the synchronization point).
>
> That said, even that is not synchronizing against "new writes", but
> only against "new creations" (which may, of course, be writers, but is
> equally likely to be just reading the contents from disk).
>
> That said:
>
>  (a) different filesystems can and will do different things.
>
> Not all filesystems use filemap_read() at all, and even the ones that
> do often have their own wrappers. Such wrappers *can* do extra
> serialization, and have their own rules. But ext4 does not, for
> example (see ext4_file_read_iter()).
>
> And as mentioned, I *think* XFS honors that old POSIX rule for
> historical reasons.
>
>  (b) we do have *different* locking
>
> for example, we these days do actually serialize properly on the
> file->f_pos, which means that a certain *class* of read/write things
> are atomic wrt each other, because we actually hold that f_pos lock
> over the whole operation and so if you do file reads and writes using
> the same file descriptor, they'll be disjoint.
>
> That, btw, hasn't always been true. If you had multiple threads using
> the same file pointer, I think we used to get basically random
> results. So we have actually strengthened our locking in this area,
> and made it much better.
>
> But note how even if you have the same file descriptor open, and then
> do pread/pwrite, those can and will happen concurrently.
>
> And mmap accesses and modifications are obviously *always* concurrent,
> even if the fault itself - but not the accesses - might end up being
> serialized due to some filesystem locking implementation detail.
>
> End result: the exact serialization is complex, depends on the
> filesystem, and is just not really something that should be described
> or even relied on (eg that f_pos serialization is something we do
> properly now, but didn't necessarily do in the past, so ..)
>
> Is it then worth pointing out one odd POSIX rule that basically nobody
> but some very low-level filesystem people have ever heard about, and
> that no version of Linux has ever conformed to in the main default
> filesystems, and that no user has ever cared about?

Well, POSIX explicitly mentions those atomicity expectations, e.g.,
for read [1]:

  "I/O is intended to be atomic to ordinary files and pipes and FIFOs. Atomic
  means that all the bytes from a single operation that started out together
  end up together, without interleaving from other I/O operations."

Users who hear about it from POSIX are led to assume that this atomic
behavior is "real", and the Linux man pages do nothing to rob them of
that illusion. They do document that the file offset aspect has been
fixed though, which only makes things more confusing.

So from that point of view, I think it would be worthwhile to mention
that most if not all filesystems ignore the "non-interleaving" aspect.

[1] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/read.html
[2] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/write.html
[3] https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/V2_chap02.html#tag_15_09_07

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-27 21:26                 ` [Cluster-devel] " Andreas Gruenbacher
@ 2022-04-27 22:20                   ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-27 22:20 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: cluster-devel, Linux Kernel Mailing List

On Wed, Apr 27, 2022 at 2:26 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Well, POSIX explicitly mentions those atomicity expectations, e.g.,
> for read [1]:

Yes. I'm aware. And my point is that we've never done _that_ kind of atomicity.

It's also somewhat ambiguous what it actually means, since what it
then talks about is "all bytes that started out together ends
together" and "interleaving".

That all implies that it's about the *position* of the reads and
writes being atomic, not the *data* of the reads and writes.

That, btw, was something we honored even before we had the locking
around f_pos accesses - a read or write system call would get its own
local *copy* the file position, the read or write would then do that
IO based on that copied position - so that things that "started out
together ends together" - and then after the operation is done it
would *update* the file position atomically.

Note that that is exactly so that data would end up "together". But it
would mean that two concurrent reads using the same file position
might read the *same* area of the file.

Which still honors that "the read is atomic wrt the range", but
obviously the actual values of "f_pos" is basically random after the
read (ie is it the end of the first read, or the end of the second
read?).

The same paragraph also explicitly mentions pipes and FIFOs, despite
an earlier paragraph dismissing them, which is all just a sign of
things being very confused.

Anyway, I'm not objecting very sternously to making it very clear in
some documentation that this "data atomicity" is not what Linux has
ever done. If you do overlapping IO, you get what you deserve.

But I do have objections.

On one hand, it's not all that different from some of the other notes
we have in the man-pages (ie documenting that whole "just under 2GB"
limit on the read size, although that's actually using the wrong
constant: it's not 0x7ffff000 bytes, it's MAX_RW_COUNT, which is
"INT_MAX & PAGE_MASK" and that constant in the man-page is as such
only true on a system with 4kB page sizes)

BUT! I'm 100% convinced that NOBODY HAS EVER given the kind of
atomicity guarantees that you would see from reading that document as
a language-lawyer.

For example, that section "2.9.7 Thread Interactions with Regular File
Operations" says that "fstat()" is atomic wrt "write()", and that you
should see "all or nothing".

I *GUARANTEE* that no operating system ever has done that, and I
further claim that reading it the way you read it is not only against
reality, it's against sanity.

Example: if I do a big write to a file that I just created, do you
really want "fstat()" in another thread or process to not even be able
to see how the file grows as the write happens?

It's not what anybody has *EVER* done, I'm pretty sure.

So I really think

 (a) you are mis-reading the standard by attributing too strong logic
to paperwork that is English prose and not so exact

 (b) documenting Linux as not doing what you are mis-reading it for is
only encouraging others to mis-read it too

The whole "arbitrary writes have to be all-or-nothing wrt all other
system calls" is simply not realistic, and has never been. Not just
not in Linux, but in *ANY* operating system that POSIX was meant to
describe.

And equally importantly: if some crazy person were to actually try to
implement such "true atomicity" things, the end result would be
objectively worse. Because you literally *want* to see a big write()
updating the file length as the write happens.

The fact that the standard then doesn't take those kinds of details
into account is simply because the standard isn't meant to be read as
a language lawyer, but as a "realistically .."

             Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-27 22:20                   ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-27 22:20 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Apr 27, 2022 at 2:26 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Well, POSIX explicitly mentions those atomicity expectations, e.g.,
> for read [1]:

Yes. I'm aware. And my point is that we've never done _that_ kind of atomicity.

It's also somewhat ambiguous what it actually means, since what it
then talks about is "all bytes that started out together ends
together" and "interleaving".

That all implies that it's about the *position* of the reads and
writes being atomic, not the *data* of the reads and writes.

That, btw, was something we honored even before we had the locking
around f_pos accesses - a read or write system call would get its own
local *copy* the file position, the read or write would then do that
IO based on that copied position - so that things that "started out
together ends together" - and then after the operation is done it
would *update* the file position atomically.

Note that that is exactly so that data would end up "together". But it
would mean that two concurrent reads using the same file position
might read the *same* area of the file.

Which still honors that "the read is atomic wrt the range", but
obviously the actual values of "f_pos" is basically random after the
read (ie is it the end of the first read, or the end of the second
read?).

The same paragraph also explicitly mentions pipes and FIFOs, despite
an earlier paragraph dismissing them, which is all just a sign of
things being very confused.

Anyway, I'm not objecting very sternously to making it very clear in
some documentation that this "data atomicity" is not what Linux has
ever done. If you do overlapping IO, you get what you deserve.

But I do have objections.

On one hand, it's not all that different from some of the other notes
we have in the man-pages (ie documenting that whole "just under 2GB"
limit on the read size, although that's actually using the wrong
constant: it's not 0x7ffff000 bytes, it's MAX_RW_COUNT, which is
"INT_MAX & PAGE_MASK" and that constant in the man-page is as such
only true on a system with 4kB page sizes)

BUT! I'm 100% convinced that NOBODY HAS EVER given the kind of
atomicity guarantees that you would see from reading that document as
a language-lawyer.

For example, that section "2.9.7 Thread Interactions with Regular File
Operations" says that "fstat()" is atomic wrt "write()", and that you
should see "all or nothing".

I *GUARANTEE* that no operating system ever has done that, and I
further claim that reading it the way you read it is not only against
reality, it's against sanity.

Example: if I do a big write to a file that I just created, do you
really want "fstat()" in another thread or process to not even be able
to see how the file grows as the write happens?

It's not what anybody has *EVER* done, I'm pretty sure.

So I really think

 (a) you are mis-reading the standard by attributing too strong logic
to paperwork that is English prose and not so exact

 (b) documenting Linux as not doing what you are mis-reading it for is
only encouraging others to mis-read it too

The whole "arbitrary writes have to be all-or-nothing wrt all other
system calls" is simply not realistic, and has never been. Not just
not in Linux, but in *ANY* operating system that POSIX was meant to
describe.

And equally importantly: if some crazy person were to actually try to
implement such "true atomicity" things, the end result would be
objectively worse. Because you literally *want* to see a big write()
updating the file length as the write happens.

The fact that the standard then doesn't take those kinds of details
into account is simply because the standard isn't meant to be read as
a language lawyer, but as a "realistically .."

             Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-27 22:20                   ` [Cluster-devel] " Linus Torvalds
@ 2022-04-28  0:00                     ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-28  0:00 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: cluster-devel, Linux Kernel Mailing List

On Wed, Apr 27, 2022 at 3:20 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I really think
>
>  (a) you are mis-reading the standard by attributing too strong logic
> to paperwork that is English prose and not so exact
>
>  (b) documenting Linux as not doing what you are mis-reading it for is
> only encouraging others to mis-read it too
>
> The whole "arbitrary writes have to be all-or-nothing wrt all other
> system calls" is simply not realistic, and has never been. Not just
> not in Linux, but in *ANY* operating system that POSIX was meant to
> describe.

Side note: a lot of those "atomic" things in that documentation have
come from a history of signal handling atomicity issues, and from all
the issues people had with (a) user-space threading implementations
and (b) emulation layers from non-Unixy environments.

So when they say that things like "rename()" has to be all-or-nothing,
it's to clarify that you can't emulate it as a "link and delete
original" kind of operation (which old UNIX *did* do) and claim to be
POSIX.

Because while the end result of rename() and link()+unlink()might be
similar, people did rely on that whole "use rename as a way to create
an atomic marker in the filesystem" (which is a very traditional UNIX
pattern).

So "rename()" has to be atomic, and the legacy behavior of link+unlink
is not valid in POSIX.

Similarly, you can't implement "pread()" as a "lseek+read+lseek back",
because that doesn't work if somebody else is doing another "pread()"
on the same file descriptor concurrently.

Again, people *did* implement exactly those kinds of implementations
of "pread()", and yes, they were broken for both signals and for
threading.

So there's "atomicity" and then there is "atomicity".

That "all or nothing" can be a very practical thing to describe
*roughly* how it must work on a higher level, or it can be a
theoretical "transactional" thing that works literally like a database
where the operation happens in full and you must not see any
intermediate state.

And no, "write()" and friends have never ever been about some
transactional operation where you can't see how the file grows as it
is being written to. That kind of atomicity has simply never existed,
not even in theory.

So when you see POSIX saying that a "read()" system call is "atomic",
you should *not* see it as a transaction thing, but see it in the
historical context of "people used to do threading libraries in user
space, and since they didn't want a big read() to block all other
threads, they'd split it up into many smaller reads and now another
thread *also* doing 'read()' system calls would see the data it read
being not one contiguous region, but multiple regions where the file
position changed in the middle".

Similarly, a "read()" system call will not be interrupted by a signal
in the middle, where the signal handler would do a "lseek()" or
another "read()", and now the original "read()" data suddenly is
affected.

That's why things like that whole "f_pos is atomic" is a big deal.

Because there literally were threading libraries (and badly emulated
environments) where that *WASN'T* the case, and _that_ is why POSIX
then talks about it.

So think of POSIX not as some hard set of "this is exactly how things
work and we describe every detail".

Instead, treat it a bit like historians treat Herodotus - interpreting
his histories by taking the issues of the time into account.  POSIX is
trying to clarify and document the problems of the time it was
written, and taking other things for granted.

                 Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-28  0:00                     ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-28  0:00 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Apr 27, 2022 at 3:20 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So I really think
>
>  (a) you are mis-reading the standard by attributing too strong logic
> to paperwork that is English prose and not so exact
>
>  (b) documenting Linux as not doing what you are mis-reading it for is
> only encouraging others to mis-read it too
>
> The whole "arbitrary writes have to be all-or-nothing wrt all other
> system calls" is simply not realistic, and has never been. Not just
> not in Linux, but in *ANY* operating system that POSIX was meant to
> describe.

Side note: a lot of those "atomic" things in that documentation have
come from a history of signal handling atomicity issues, and from all
the issues people had with (a) user-space threading implementations
and (b) emulation layers from non-Unixy environments.

So when they say that things like "rename()" has to be all-or-nothing,
it's to clarify that you can't emulate it as a "link and delete
original" kind of operation (which old UNIX *did* do) and claim to be
POSIX.

Because while the end result of rename() and link()+unlink()might be
similar, people did rely on that whole "use rename as a way to create
an atomic marker in the filesystem" (which is a very traditional UNIX
pattern).

So "rename()" has to be atomic, and the legacy behavior of link+unlink
is not valid in POSIX.

Similarly, you can't implement "pread()" as a "lseek+read+lseek back",
because that doesn't work if somebody else is doing another "pread()"
on the same file descriptor concurrently.

Again, people *did* implement exactly those kinds of implementations
of "pread()", and yes, they were broken for both signals and for
threading.

So there's "atomicity" and then there is "atomicity".

That "all or nothing" can be a very practical thing to describe
*roughly* how it must work on a higher level, or it can be a
theoretical "transactional" thing that works literally like a database
where the operation happens in full and you must not see any
intermediate state.

And no, "write()" and friends have never ever been about some
transactional operation where you can't see how the file grows as it
is being written to. That kind of atomicity has simply never existed,
not even in theory.

So when you see POSIX saying that a "read()" system call is "atomic",
you should *not* see it as a transaction thing, but see it in the
historical context of "people used to do threading libraries in user
space, and since they didn't want a big read() to block all other
threads, they'd split it up into many smaller reads and now another
thread *also* doing 'read()' system calls would see the data it read
being not one contiguous region, but multiple regions where the file
position changed in the middle".

Similarly, a "read()" system call will not be interrupted by a signal
in the middle, where the signal handler would do a "lseek()" or
another "read()", and now the original "read()" data suddenly is
affected.

That's why things like that whole "f_pos is atomic" is a big deal.

Because there literally were threading libraries (and badly emulated
environments) where that *WASN'T* the case, and _that_ is why POSIX
then talks about it.

So think of POSIX not as some hard set of "this is exactly how things
work and we describe every detail".

Instead, treat it a bit like historians treat Herodotus - interpreting
his histories by taking the issues of the time into account.  POSIX is
trying to clarify and document the problems of the time it was
written, and taking other things for granted.

                 Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-28  0:00                     ` [Cluster-devel] " Linus Torvalds
@ 2022-04-28 13:26                       ` Andreas Gruenbacher
  -1 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-28 13:26 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: cluster-devel, Linux Kernel Mailing List

On Thu, Apr 28, 2022 at 2:00 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Apr 27, 2022 at 3:20 PM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > So I really think
> >
> >  (a) you are mis-reading the standard by attributing too strong logic
> > to paperwork that is English prose and not so exact
> >
> >  (b) documenting Linux as not doing what you are mis-reading it for is
> > only encouraging others to mis-read it too
> >
> > The whole "arbitrary writes have to be all-or-nothing wrt all other
> > system calls" is simply not realistic, and has never been. Not just
> > not in Linux, but in *ANY* operating system that POSIX was meant to
> > describe.
>
> Side note: a lot of those "atomic" things in that documentation have
> come from a history of signal handling atomicity issues, and from all
> the issues people had with (a) user-space threading implementations
> and (b) emulation layers from non-Unixy environments.
>
> So when they say that things like "rename()" has to be all-or-nothing,
> it's to clarify that you can't emulate it as a "link and delete
> original" kind of operation (which old UNIX *did* do) and claim to be
> POSIX.
>
> Because while the end result of rename() and link()+unlink()might be
> similar, people did rely on that whole "use rename as a way to create
> an atomic marker in the filesystem" (which is a very traditional UNIX
> pattern).
>
> So "rename()" has to be atomic, and the legacy behavior of link+unlink
> is not valid in POSIX.
>
> Similarly, you can't implement "pread()" as a "lseek+read+lseek back",
> because that doesn't work if somebody else is doing another "pread()"
> on the same file descriptor concurrently.
>
> Again, people *did* implement exactly those kinds of implementations
> of "pread()", and yes, they were broken for both signals and for
> threading.
>
> So there's "atomicity" and then there is "atomicity".
>
> That "all or nothing" can be a very practical thing to describe
> *roughly* how it must work on a higher level, or it can be a
> theoretical "transactional" thing that works literally like a database
> where the operation happens in full and you must not see any
> intermediate state.
>
> And no, "write()" and friends have never ever been about some
> transactional operation where you can't see how the file grows as it
> is being written to. That kind of atomicity has simply never existed,
> not even in theory.
>
> So when you see POSIX saying that a "read()" system call is "atomic",
> you should *not* see it as a transaction thing, but see it in the
> historical context of "people used to do threading libraries in user
> space, and since they didn't want a big read() to block all other
> threads, they'd split it up into many smaller reads and now another
> thread *also* doing 'read()' system calls would see the data it read
> being not one contiguous region, but multiple regions where the file
> position changed in the middle".
>
> Similarly, a "read()" system call will not be interrupted by a signal
> in the middle, where the signal handler would do a "lseek()" or
> another "read()", and now the original "read()" data suddenly is
> affected.
>
> That's why things like that whole "f_pos is atomic" is a big deal.
>
> Because there literally were threading libraries (and badly emulated
> environments) where that *WASN'T* the case, and _that_ is why POSIX
> then talks about it.
>
> So think of POSIX not as some hard set of "this is exactly how things
> work and we describe every detail".
>
> Instead, treat it a bit like historians treat Herodotus - interpreting
> his histories by taking the issues of the time into account.  POSIX is
> trying to clarify and document the problems of the time it was
> written, and taking other things for granted.

Okay fine, thanks for elaborating.

Would you mind pulling the following fix to straighten this out?

The data corruption we've been getting unfortunately didn't have to do
with lock contention (we already knew that); it still occurs. I'm
running out of ideas on what to try there.

Thanks a lot,
Andreas

--

The following changes since commit 4fad37d595b9d9a2996467d780cb2e7a1b08b2c0:

  Merge tag 'gfs2-v5.18-rc4-fix' of
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
(2022-04-26 11:17:18 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git
tags/gfs2-v5.18-rc4-fix2

for you to fetch changes up to 296abc0d91d8b65d42224dd33452ace14491ad08:

  gfs2: No short reads or writes upon glock contention (2022-04-28
15:14:48 +0200)

----------------------------------------------------------------
gfs2 fix

- No short reads or writes upon glock contention

----------------------------------------------------------------
Andreas Gruenbacher (1):
      gfs2: No short reads or writes upon glock contention

 fs/gfs2/file.c | 4 ----
 1 file changed, 4 deletions(-)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-28 13:26                       ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-28 13:26 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Apr 28, 2022 at 2:00 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Apr 27, 2022 at 3:20 PM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > So I really think
> >
> >  (a) you are mis-reading the standard by attributing too strong logic
> > to paperwork that is English prose and not so exact
> >
> >  (b) documenting Linux as not doing what you are mis-reading it for is
> > only encouraging others to mis-read it too
> >
> > The whole "arbitrary writes have to be all-or-nothing wrt all other
> > system calls" is simply not realistic, and has never been. Not just
> > not in Linux, but in *ANY* operating system that POSIX was meant to
> > describe.
>
> Side note: a lot of those "atomic" things in that documentation have
> come from a history of signal handling atomicity issues, and from all
> the issues people had with (a) user-space threading implementations
> and (b) emulation layers from non-Unixy environments.
>
> So when they say that things like "rename()" has to be all-or-nothing,
> it's to clarify that you can't emulate it as a "link and delete
> original" kind of operation (which old UNIX *did* do) and claim to be
> POSIX.
>
> Because while the end result of rename() and link()+unlink()might be
> similar, people did rely on that whole "use rename as a way to create
> an atomic marker in the filesystem" (which is a very traditional UNIX
> pattern).
>
> So "rename()" has to be atomic, and the legacy behavior of link+unlink
> is not valid in POSIX.
>
> Similarly, you can't implement "pread()" as a "lseek+read+lseek back",
> because that doesn't work if somebody else is doing another "pread()"
> on the same file descriptor concurrently.
>
> Again, people *did* implement exactly those kinds of implementations
> of "pread()", and yes, they were broken for both signals and for
> threading.
>
> So there's "atomicity" and then there is "atomicity".
>
> That "all or nothing" can be a very practical thing to describe
> *roughly* how it must work on a higher level, or it can be a
> theoretical "transactional" thing that works literally like a database
> where the operation happens in full and you must not see any
> intermediate state.
>
> And no, "write()" and friends have never ever been about some
> transactional operation where you can't see how the file grows as it
> is being written to. That kind of atomicity has simply never existed,
> not even in theory.
>
> So when you see POSIX saying that a "read()" system call is "atomic",
> you should *not* see it as a transaction thing, but see it in the
> historical context of "people used to do threading libraries in user
> space, and since they didn't want a big read() to block all other
> threads, they'd split it up into many smaller reads and now another
> thread *also* doing 'read()' system calls would see the data it read
> being not one contiguous region, but multiple regions where the file
> position changed in the middle".
>
> Similarly, a "read()" system call will not be interrupted by a signal
> in the middle, where the signal handler would do a "lseek()" or
> another "read()", and now the original "read()" data suddenly is
> affected.
>
> That's why things like that whole "f_pos is atomic" is a big deal.
>
> Because there literally were threading libraries (and badly emulated
> environments) where that *WASN'T* the case, and _that_ is why POSIX
> then talks about it.
>
> So think of POSIX not as some hard set of "this is exactly how things
> work and we describe every detail".
>
> Instead, treat it a bit like historians treat Herodotus - interpreting
> his histories by taking the issues of the time into account.  POSIX is
> trying to clarify and document the problems of the time it was
> written, and taking other things for granted.

Okay fine, thanks for elaborating.

Would you mind pulling the following fix to straighten this out?

The data corruption we've been getting unfortunately didn't have to do
with lock contention (we already knew that); it still occurs. I'm
running out of ideas on what to try there.

Thanks a lot,
Andreas

--

The following changes since commit 4fad37d595b9d9a2996467d780cb2e7a1b08b2c0:

  Merge tag 'gfs2-v5.18-rc4-fix' of
git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2
(2022-04-26 11:17:18 -0700)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git
tags/gfs2-v5.18-rc4-fix2

for you to fetch changes up to 296abc0d91d8b65d42224dd33452ace14491ad08:

  gfs2: No short reads or writes upon glock contention (2022-04-28
15:14:48 +0200)

----------------------------------------------------------------
gfs2 fix

- No short reads or writes upon glock contention

----------------------------------------------------------------
Andreas Gruenbacher (1):
      gfs2: No short reads or writes upon glock contention

 fs/gfs2/file.c | 4 ----
 1 file changed, 4 deletions(-)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-28 13:26                       ` [Cluster-devel] " Andreas Gruenbacher
@ 2022-04-28 17:09                         ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-28 17:09 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: cluster-devel, Linux Kernel Mailing List

On Thu, Apr 28, 2022 at 6:27 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> The data corruption we've been getting unfortunately didn't have to do
> with lock contention (we already knew that); it still occurs. I'm
> running out of ideas on what to try there.

Hmm.

I don't see the bug, but I do have a suggestion on something to try.

In particular, you said the problem started with commit 00bfe02f4796
("gfs2: Fix mmap + page fault deadlocks for buffered I/O").

And to me, I see two main things that are going on

 (a) the obvious "calling generic IO functions with pagefault disabled" thing

 (b) the "allow demotion" thing

And I wonder if you could at least pinpoint which of the  cases it is
that triggers it.

So I'd love to see you try three things:

 (1) just remove the "allow demotion" cases.

     This will re-introduce the deadlock the commit is trying to fix,
but that's such a special case that I assume you can run your
test-suite that shows the problem even without that fix in place?

     This would just pinpoint whether it's due to some odd locking issue or not.

Honestly, from how you describe the symptoms, I don't think (1) is the
cause, but I think making sure is good.

It sounds much more likely that it's one of those generic vfs
functions that screws up when a page fault happens and it gets a
partial result instead of handling the fault.

Which gets us to

 (2) remove the pagefault_disable/enable() around just the
generic_file_read_iter() case in gfs2_file_read_iter().

and

 (3) finally, remove the pagefault_disable/enable() around the
iomap_file_buffered_write() case in gfs2_file_buffered_write()

Yeah, yeah, you say it's just the read that fails, but humor me on
(3), just in case it's an earlier write in your test-suite and the
read just then uncovered it.

But I put it as (3) so that you'd do the obvious (2) case first, and
narrow it down (ie if (1) still shows the bug, then do (2), and if
that fixes the bug it will be fairly well pinpointed to
generic_file_read_iter().

Looking around, gfs2 is the only thing that obviously calls
generic_file_read_iter() with pagefoaults disabled, so it does smell
like filemap_read() might have some issue, but the only thing that
does is basically that

                copied = copy_folio_to_iter(folio, offset, bytes, iter);

which should just become copy_page_to_iter_iovec(), which you'd hope
would get things right.

But it would be good to just narrow things down a bit.

I'll look at that copy_page_to_iter_iovec() some more regardless, but
doing that "let's double-check it's not somethign else" would be good.

             Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-28 17:09                         ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-28 17:09 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Apr 28, 2022 at 6:27 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> The data corruption we've been getting unfortunately didn't have to do
> with lock contention (we already knew that); it still occurs. I'm
> running out of ideas on what to try there.

Hmm.

I don't see the bug, but I do have a suggestion on something to try.

In particular, you said the problem started with commit 00bfe02f4796
("gfs2: Fix mmap + page fault deadlocks for buffered I/O").

And to me, I see two main things that are going on

 (a) the obvious "calling generic IO functions with pagefault disabled" thing

 (b) the "allow demotion" thing

And I wonder if you could at least pinpoint which of the  cases it is
that triggers it.

So I'd love to see you try three things:

 (1) just remove the "allow demotion" cases.

     This will re-introduce the deadlock the commit is trying to fix,
but that's such a special case that I assume you can run your
test-suite that shows the problem even without that fix in place?

     This would just pinpoint whether it's due to some odd locking issue or not.

Honestly, from how you describe the symptoms, I don't think (1) is the
cause, but I think making sure is good.

It sounds much more likely that it's one of those generic vfs
functions that screws up when a page fault happens and it gets a
partial result instead of handling the fault.

Which gets us to

 (2) remove the pagefault_disable/enable() around just the
generic_file_read_iter() case in gfs2_file_read_iter().

and

 (3) finally, remove the pagefault_disable/enable() around the
iomap_file_buffered_write() case in gfs2_file_buffered_write()

Yeah, yeah, you say it's just the read that fails, but humor me on
(3), just in case it's an earlier write in your test-suite and the
read just then uncovered it.

But I put it as (3) so that you'd do the obvious (2) case first, and
narrow it down (ie if (1) still shows the bug, then do (2), and if
that fixes the bug it will be fairly well pinpointed to
generic_file_read_iter().

Looking around, gfs2 is the only thing that obviously calls
generic_file_read_iter() with pagefoaults disabled, so it does smell
like filemap_read() might have some issue, but the only thing that
does is basically that

                copied = copy_folio_to_iter(folio, offset, bytes, iter);

which should just become copy_page_to_iter_iovec(), which you'd hope
would get things right.

But it would be good to just narrow things down a bit.

I'll look at that copy_page_to_iter_iovec() some more regardless, but
doing that "let's double-check it's not somethign else" would be good.

             Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-28 17:09                         ` [Cluster-devel] " Linus Torvalds
@ 2022-04-28 17:17                           ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-28 17:17 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: cluster-devel, Linux Kernel Mailing List

On Thu, Apr 28, 2022 at 10:09 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'll look at that copy_page_to_iter_iovec() some more regardless, but
> doing that "let's double-check it's not somethign else" would be good.

Oh, and as I do that, it strikes me: What platform do you see the problem on?

Because the code for that function is very different if HIGHMEM is
enabled, so if you see this on x86-32 but not x86-64, for example,
that is relevant.

I *assume* nobody uses x86-32 any more, but just checking...

               Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-28 17:17                           ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-04-28 17:17 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Apr 28, 2022 at 10:09 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'll look at that copy_page_to_iter_iovec() some more regardless, but
> doing that "let's double-check it's not somethign else" would be good.

Oh, and as I do that, it strikes me: What platform do you see the problem on?

Because the code for that function is very different if HIGHMEM is
enabled, so if you see this on x86-32 but not x86-64, for example,
that is relevant.

I *assume* nobody uses x86-32 any more, but just checking...

               Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-28 17:17                           ` [Cluster-devel] " Linus Torvalds
@ 2022-04-28 17:21                             ` Andreas Gruenbacher
  -1 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-28 17:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: cluster-devel, Linux Kernel Mailing List

On Thu, Apr 28, 2022 at 7:17 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Apr 28, 2022 at 10:09 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > I'll look at that copy_page_to_iter_iovec() some more regardless, but
> > doing that "let's double-check it's not somethign else" would be good.
>
> Oh, and as I do that, it strikes me: What platform do you see the problem on?
>
> Because the code for that function is very different if HIGHMEM is
> enabled, so if you see this on x86-32 but not x86-64, for example,
> that is relevant.
>
> I *assume* nobody uses x86-32 any more, but just checking...

This is on x86_64.

Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-28 17:21                             ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-28 17:21 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Apr 28, 2022 at 7:17 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Apr 28, 2022 at 10:09 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > I'll look at that copy_page_to_iter_iovec() some more regardless, but
> > doing that "let's double-check it's not somethign else" would be good.
>
> Oh, and as I do that, it strikes me: What platform do you see the problem on?
>
> Because the code for that function is very different if HIGHMEM is
> enabled, so if you see this on x86-32 but not x86-64, for example,
> that is relevant.
>
> I *assume* nobody uses x86-32 any more, but just checking...

This is on x86_64.

Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-28 17:09                         ` [Cluster-devel] " Linus Torvalds
@ 2022-04-28 17:38                           ` Andreas Gruenbacher
  -1 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-28 17:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: cluster-devel, Linux Kernel Mailing List

On Thu, Apr 28, 2022 at 7:09 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Apr 28, 2022 at 6:27 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > The data corruption we've been getting unfortunately didn't have to do
> > with lock contention (we already knew that); it still occurs. I'm
> > running out of ideas on what to try there.
>
> Hmm.
>
> I don't see the bug, but I do have a suggestion on something to try.
>
> In particular, you said the problem started with commit 00bfe02f4796
> ("gfs2: Fix mmap + page fault deadlocks for buffered I/O").

Yes, but note that it's gfs2_file_buffered_write() that fails. When
the pagefault_disable/enable() around iomap_file_buffered_write() is
removed, the corruption goes away.

> And to me, I see two main things that are going on
>
>  (a) the obvious "calling generic IO functions with pagefault disabled" thing
>
>  (b) the "allow demotion" thing
>
> And I wonder if you could at least pinpoint which of the  cases it is
> that triggers it.
>
> So I'd love to see you try three things:
>
>  (1) just remove the "allow demotion" cases.
>
>      This will re-introduce the deadlock the commit is trying to fix,
> but that's such a special case that I assume you can run your
> test-suite that shows the problem even without that fix in place?
>
>      This would just pinpoint whether it's due to some odd locking issue or not.
>
> Honestly, from how you describe the symptoms, I don't think (1) is the
> cause, but I think making sure is good.
>
> It sounds much more likely that it's one of those generic vfs
> functions that screws up when a page fault happens and it gets a
> partial result instead of handling the fault.

The test should run just fine without allowing demotion. I'll try (1),
but I don't expect the outcome to change.

> Which gets us to
>
>  (2) remove the pagefault_disable/enable() around just the
> generic_file_read_iter() case in gfs2_file_read_iter().
>
> and
>
>  (3) finally, remove the pagefault_disable/enable() around the
> iomap_file_buffered_write() case in gfs2_file_buffered_write()
>
> Yeah, yeah, you say it's just the read that fails, but humor me on
> (3), just in case it's an earlier write in your test-suite and the
> read just then uncovered it.
>
> But I put it as (3) so that you'd do the obvious (2) case first, and
> narrow it down (ie if (1) still shows the bug, then do (2), and if
> that fixes the bug it will be fairly well pinpointed to
> generic_file_read_iter().

As mentioned above, we already did (3) and it didn't help. I'll do (1)
now, and then (2).

> Looking around, gfs2 is the only thing that obviously calls
> generic_file_read_iter() with pagefaults disabled, so it does smell
> like filemap_read() might have some issue, but the only thing that
> does is basically that
>
>                 copied = copy_folio_to_iter(folio, offset, bytes, iter);
>
> which should just become copy_page_to_iter_iovec(), which you'd hope
> would get things right.
>
> But it would be good to just narrow things down a bit.
>
> I'll look at that copy_page_to_iter_iovec() some more regardless, but
> doing that "let's double-check it's not somethign else" would be good.

We've actually been running most of our experiments on a 5.14-based
kernel with a plethora of backports, so pre-folio. Sorry I forgot to
mention that. I'll reproduce with mainline as well.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-28 17:38                           ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-04-28 17:38 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Apr 28, 2022 at 7:09 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Apr 28, 2022 at 6:27 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > The data corruption we've been getting unfortunately didn't have to do
> > with lock contention (we already knew that); it still occurs. I'm
> > running out of ideas on what to try there.
>
> Hmm.
>
> I don't see the bug, but I do have a suggestion on something to try.
>
> In particular, you said the problem started with commit 00bfe02f4796
> ("gfs2: Fix mmap + page fault deadlocks for buffered I/O").

Yes, but note that it's gfs2_file_buffered_write() that fails. When
the pagefault_disable/enable() around iomap_file_buffered_write() is
removed, the corruption goes away.

> And to me, I see two main things that are going on
>
>  (a) the obvious "calling generic IO functions with pagefault disabled" thing
>
>  (b) the "allow demotion" thing
>
> And I wonder if you could at least pinpoint which of the  cases it is
> that triggers it.
>
> So I'd love to see you try three things:
>
>  (1) just remove the "allow demotion" cases.
>
>      This will re-introduce the deadlock the commit is trying to fix,
> but that's such a special case that I assume you can run your
> test-suite that shows the problem even without that fix in place?
>
>      This would just pinpoint whether it's due to some odd locking issue or not.
>
> Honestly, from how you describe the symptoms, I don't think (1) is the
> cause, but I think making sure is good.
>
> It sounds much more likely that it's one of those generic vfs
> functions that screws up when a page fault happens and it gets a
> partial result instead of handling the fault.

The test should run just fine without allowing demotion. I'll try (1),
but I don't expect the outcome to change.

> Which gets us to
>
>  (2) remove the pagefault_disable/enable() around just the
> generic_file_read_iter() case in gfs2_file_read_iter().
>
> and
>
>  (3) finally, remove the pagefault_disable/enable() around the
> iomap_file_buffered_write() case in gfs2_file_buffered_write()
>
> Yeah, yeah, you say it's just the read that fails, but humor me on
> (3), just in case it's an earlier write in your test-suite and the
> read just then uncovered it.
>
> But I put it as (3) so that you'd do the obvious (2) case first, and
> narrow it down (ie if (1) still shows the bug, then do (2), and if
> that fixes the bug it will be fairly well pinpointed to
> generic_file_read_iter().

As mentioned above, we already did (3) and it didn't help. I'll do (1)
now, and then (2).

> Looking around, gfs2 is the only thing that obviously calls
> generic_file_read_iter() with pagefaults disabled, so it does smell
> like filemap_read() might have some issue, but the only thing that
> does is basically that
>
>                 copied = copy_folio_to_iter(folio, offset, bytes, iter);
>
> which should just become copy_page_to_iter_iovec(), which you'd hope
> would get things right.
>
> But it would be good to just narrow things down a bit.
>
> I'll look at that copy_page_to_iter_iovec() some more regardless, but
> doing that "let's double-check it's not somethign else" would be good.

We've actually been running most of our experiments on a 5.14-based
kernel with a plethora of backports, so pre-folio. Sorry I forgot to
mention that. I'll reproduce with mainline as well.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: Re: [GIT PULL] gfs2 fix
  2022-04-28 13:26                       ` [Cluster-devel] " Andreas Gruenbacher
@ 2022-04-28 18:16                         ` pr-tracker-bot
  -1 siblings, 0 replies; 64+ messages in thread
From: pr-tracker-bot @ 2022-04-28 18:16 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, cluster-devel, Linux Kernel Mailing List

The pull request you sent on Thu, 28 Apr 2022 15:26:58 +0200:

> https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git tags/gfs2-v5.18-rc4-fix2

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/4a2316a1eda4ef3ced18c7f08f7cb3726bcae44b

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-04-28 18:16                         ` pr-tracker-bot
  0 siblings, 0 replies; 64+ messages in thread
From: pr-tracker-bot @ 2022-04-28 18:16 UTC (permalink / raw)
  To: cluster-devel.redhat.com

The pull request you sent on Thu, 28 Apr 2022 15:26:58 +0200:

> https://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git tags/gfs2-v5.18-rc4-fix2

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/4a2316a1eda4ef3ced18c7f08f7cb3726bcae44b

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-04-28 17:38                           ` [Cluster-devel] " Andreas Gruenbacher
@ 2022-05-02 18:31                             ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-05-02 18:31 UTC (permalink / raw)
  To: Andreas Gruenbacher, Christoph Hellwig, Darrick J. Wong, Dave Chinner
  Cc: cluster-devel, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 4579 bytes --]

On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher
<agruenba@redhat.com> wrote:
>
> Yes, but note that it's gfs2_file_buffered_write() that fails. When
> the pagefault_disable/enable() around iomap_file_buffered_write() is
> removed, the corruption goes away.

I looked some more at this on and off, and ended up even more confused.

For some reason, I'd mostly looked at the read case, because I had
mis-read some of your emails and thought it was the buffered reads
that caused problems.

Then I went back more carefully, and realized you had always said
gfs2_file_buffered_write() was where the issues happened, and looked
at that path more, and that confused me even *MORE*.

Because that case has always done the copy from user space with page
faults disabled, because of the traditional deadlock with reading from
user space while holding the page lock on the target page cache page.

So that is not really about the new deadlock with filesystem locks,
that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks
for buffered I/O").

So now that I'm looking at the right function (maybe) I'm going "huh",
because it's none of the complex cases that would seem to fail, it's
literally just the fault_in_iov_iter_readable() that we've always done
in iomap_write_iter() that presumably starts failing.

But *that* old code seems bogus too. It's doing

                if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
                        status = -EFAULT;
                        break;
                }

which on the face of it is sane: it's saying "if we can't fault in any
bytes, then stop trying".

And it's good, and correct, but it does leave one case open.

Because what if the result is "we can fault things in _partially_"?

The code blithely goes on and tries to do the whole 'bytes' range _anyway_.

Now, with a bug-free filesystem, this really shouldn't matter, since
the later copy_page_from_iter_atomic() thing should then DTRT anyway,
but this does mean that one fundamental thing that that commit
00bfe02f4796 changed is that it basically disabled that
fault_in_iov_iter_readable() that *used* to fault in the whole range,
and now potentially only faults in a small area.

That, in turn, means that in practice it *used* to do "write_end()"
with a fully successful range, ie when it did that

                status = a_ops->write_end(file, mapping, pos, bytes, copied,
                                                page, fsdata);

then "bytes" and "copied" were the same.

But now that commit 00bfe02f4796 added the "disable_pagefault()"
around the whole thing, fault_in_iov_iter_readable() will easily fail
half-way instead of bringing the next page in, and then that
->write_begin() to ->write_end() sequence will see the copy in the
middle failing half-way too, and you'll have that write_end()
condition with the write _partially_ succeeding.

Which is the complex case for write_end() that you practically
speaking never saw before (it *could* happen with a race with swap-out
or similar, but it was not really something you could trigger in real
life.

And I suspect this is what bites you with gfs2

To *test* that hypothesis, how about you try this attached patch? The
generic_perform_write() function in mm/filemap.c has the same exact
pattern, but as mentioned, a filesystem really needs to be able to
handle the partial write_end() case, so it's not a *bug* in that code,
but it migth be triggering a bug in gfs2.

And gfs2 only uses the iomap_write_iter() case, I think. So that's the
only case this attached patch changes.

Again - I think the unpatched iomap_write_iter() code is fine, but I
think it may be what then triggers the real bug in gfs2. So this patch
is not wrong per se, but this patch is basically a "hide the problem"
patch, and it would be very interesting to hear if it does indeed fix
your test-case.

Because that would pinpoint exactly what the bug is.

I'm adding Christoph and Darrick as iomap maintainers here to the
participants (and Dave Chinner in case he's also the temporary
maintainer because Darrick is doing reviews) not because they
necessarily care, but just because this test-patch obviously involves
the iomap code.

NOTE! This patch is entirely untested. I also didn't actually yet go
look at what gfs2 does when 'bytes' and 'copied' are different. But
since I finally think I figured out what might be going on, I decided
I'd send this out sooner rather than later.

Because this is the first thing that makes me go "Aaahh.. This might
explain it".

                   Linus

[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 906 bytes --]

 fs/iomap/buffered-io.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 8ce8720093b9..eb1973792c68 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -758,13 +758,20 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		if (bytes > length)
 			bytes = length;
 
+		/* Nobody _should_ call us with an empty iter, but.. */
+		if (WARN_ON_ONCE(!bytes)) {
+			status = 0;
+			break;
+		}
+
 		/*
 		 * Bring in the user page that we'll copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
 		 * same page as we're writing to, without it being marked
 		 * up-to-date.
 		 */
-		if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
+		bytes -= fault_in_iov_iter_readable(i, bytes);
+		if (unlikely(!bytes)) {
 			status = -EFAULT;
 			break;
 		}

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-05-02 18:31                             ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-05-02 18:31 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher
<agruenba@redhat.com> wrote:
>
> Yes, but note that it's gfs2_file_buffered_write() that fails. When
> the pagefault_disable/enable() around iomap_file_buffered_write() is
> removed, the corruption goes away.

I looked some more at this on and off, and ended up even more confused.

For some reason, I'd mostly looked at the read case, because I had
mis-read some of your emails and thought it was the buffered reads
that caused problems.

Then I went back more carefully, and realized you had always said
gfs2_file_buffered_write() was where the issues happened, and looked
at that path more, and that confused me even *MORE*.

Because that case has always done the copy from user space with page
faults disabled, because of the traditional deadlock with reading from
user space while holding the page lock on the target page cache page.

So that is not really about the new deadlock with filesystem locks,
that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks
for buffered I/O").

So now that I'm looking at the right function (maybe) I'm going "huh",
because it's none of the complex cases that would seem to fail, it's
literally just the fault_in_iov_iter_readable() that we've always done
in iomap_write_iter() that presumably starts failing.

But *that* old code seems bogus too. It's doing

                if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
                        status = -EFAULT;
                        break;
                }

which on the face of it is sane: it's saying "if we can't fault in any
bytes, then stop trying".

And it's good, and correct, but it does leave one case open.

Because what if the result is "we can fault things in _partially_"?

The code blithely goes on and tries to do the whole 'bytes' range _anyway_.

Now, with a bug-free filesystem, this really shouldn't matter, since
the later copy_page_from_iter_atomic() thing should then DTRT anyway,
but this does mean that one fundamental thing that that commit
00bfe02f4796 changed is that it basically disabled that
fault_in_iov_iter_readable() that *used* to fault in the whole range,
and now potentially only faults in a small area.

That, in turn, means that in practice it *used* to do "write_end()"
with a fully successful range, ie when it did that

                status = a_ops->write_end(file, mapping, pos, bytes, copied,
                                                page, fsdata);

then "bytes" and "copied" were the same.

But now that commit 00bfe02f4796 added the "disable_pagefault()"
around the whole thing, fault_in_iov_iter_readable() will easily fail
half-way instead of bringing the next page in, and then that
->write_begin() to ->write_end() sequence will see the copy in the
middle failing half-way too, and you'll have that write_end()
condition with the write _partially_ succeeding.

Which is the complex case for write_end() that you practically
speaking never saw before (it *could* happen with a race with swap-out
or similar, but it was not really something you could trigger in real
life.

And I suspect this is what bites you with gfs2

To *test* that hypothesis, how about you try this attached patch? The
generic_perform_write() function in mm/filemap.c has the same exact
pattern, but as mentioned, a filesystem really needs to be able to
handle the partial write_end() case, so it's not a *bug* in that code,
but it migth be triggering a bug in gfs2.

And gfs2 only uses the iomap_write_iter() case, I think. So that's the
only case this attached patch changes.

Again - I think the unpatched iomap_write_iter() code is fine, but I
think it may be what then triggers the real bug in gfs2. So this patch
is not wrong per se, but this patch is basically a "hide the problem"
patch, and it would be very interesting to hear if it does indeed fix
your test-case.

Because that would pinpoint exactly what the bug is.

I'm adding Christoph and Darrick as iomap maintainers here to the
participants (and Dave Chinner in case he's also the temporary
maintainer because Darrick is doing reviews) not because they
necessarily care, but just because this test-patch obviously involves
the iomap code.

NOTE! This patch is entirely untested. I also didn't actually yet go
look at what gfs2 does when 'bytes' and 'copied' are different. But
since I finally think I figured out what might be going on, I decided
I'd send this out sooner rather than later.

Because this is the first thing that makes me go "Aaahh.. This might
explain it".

                   Linus
-------------- next part --------------
A non-text attachment was scrubbed...
Name: patch.diff
Type: text/x-patch
Size: 906 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20220502/e329c843/attachment.bin>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-05-02 18:31                             ` [Cluster-devel] " Linus Torvalds
@ 2022-05-02 18:58                               ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-05-02 18:58 UTC (permalink / raw)
  To: Andreas Gruenbacher, Christoph Hellwig, Darrick J. Wong, Dave Chinner
  Cc: cluster-devel, Linux Kernel Mailing List

On Mon, May 2, 2022 at 11:31 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> NOTE! This patch is entirely untested. I also didn't actually yet go
> look at what gfs2 does when 'bytes' and 'copied' are different.

Oh, it's a lot of generic iomap_write_end() code, so I guess it's just
as well that I brought in the iomap people.

And the iomap code has many different cases. Some of them do

        if (unlikely(copied < len && !folio_test_uptodate(folio)))
                return 0;

to force the whole IO to be re-done (looks sane - that's the "the
underlying folio wasn't uptodate, because we expected the write to
make it so").

And that might not have happened before, but it looks like gfs2 does
actually try to deal with that case.

But since Andreas said originally that the IO wasn't aligned, I don't
think that "not uptodate" case is what is going on, and it's more
about some "partial write in the middle of a buffer succeeded"

And the code also has things like

        if (ret < len)
                iomap_write_failed(iter->inode, pos, len);

which looks very wrong - it's not that the write failed, it's just
incomplete because it was done with page faults disabled. It seems to
try to do some page cache truncation based on the original 'len', but
not taking the successful part into account. Which all sounds
horrifically wrong.

But I don't know the code well enough to really judge. It just makes
me uncomfortable, and I do suspect this code may be quite buggy if the
copy of the full 'len' doesn't succeed.

Again, the patch I sent only _hides_ any issues and makes them
practically impossible to see. It doesn't really _fix_ anything, since
- as mentioned - regardless of fault_in_iov_iter_readable()
succeeding, racing with page-out could then cause the later
copy_page_from_iter_atomic() to have a partial copy anyway.

And hey, maybe there's something entirely different going on, and my
"Heureka! It might be explained by that partial write_end that
generally didn't happen before" is only my shouting at windmills.

             Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-05-02 18:58                               ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-05-02 18:58 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, May 2, 2022 at 11:31 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> NOTE! This patch is entirely untested. I also didn't actually yet go
> look at what gfs2 does when 'bytes' and 'copied' are different.

Oh, it's a lot of generic iomap_write_end() code, so I guess it's just
as well that I brought in the iomap people.

And the iomap code has many different cases. Some of them do

        if (unlikely(copied < len && !folio_test_uptodate(folio)))
                return 0;

to force the whole IO to be re-done (looks sane - that's the "the
underlying folio wasn't uptodate, because we expected the write to
make it so").

And that might not have happened before, but it looks like gfs2 does
actually try to deal with that case.

But since Andreas said originally that the IO wasn't aligned, I don't
think that "not uptodate" case is what is going on, and it's more
about some "partial write in the middle of a buffer succeeded"

And the code also has things like

        if (ret < len)
                iomap_write_failed(iter->inode, pos, len);

which looks very wrong - it's not that the write failed, it's just
incomplete because it was done with page faults disabled. It seems to
try to do some page cache truncation based on the original 'len', but
not taking the successful part into account. Which all sounds
horrifically wrong.

But I don't know the code well enough to really judge. It just makes
me uncomfortable, and I do suspect this code may be quite buggy if the
copy of the full 'len' doesn't succeed.

Again, the patch I sent only _hides_ any issues and makes them
practically impossible to see. It doesn't really _fix_ anything, since
- as mentioned - regardless of fault_in_iov_iter_readable()
succeeding, racing with page-out could then cause the later
copy_page_from_iter_atomic() to have a partial copy anyway.

And hey, maybe there's something entirely different going on, and my
"Heureka! It might be explained by that partial write_end that
generally didn't happen before" is only my shouting at windmills.

             Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-05-02 18:58                               ` [Cluster-devel] " Linus Torvalds
@ 2022-05-02 20:24                                 ` Andreas Gruenbacher
  -1 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-05-02 20:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Hellwig, Darrick J. Wong, Dave Chinner, cluster-devel,
	Linux Kernel Mailing List

On Mon, May 2, 2022 at 8:58 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, May 2, 2022 at 11:31 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > NOTE! This patch is entirely untested. I also didn't actually yet go
> > look at what gfs2 does when 'bytes' and 'copied' are different.
>
> Oh, it's a lot of generic iomap_write_end() code, so I guess it's just
> as well that I brought in the iomap people.
>
> And the iomap code has many different cases. Some of them do
>
>         if (unlikely(copied < len && !folio_test_uptodate(folio)))
>                 return 0;
>
> to force the whole IO to be re-done (looks sane - that's the "the
> underlying folio wasn't uptodate, because we expected the write to
> make it so").
>
> And that might not have happened before, but it looks like gfs2 does
> actually try to deal with that case.
>
> But since Andreas said originally that the IO wasn't aligned, I don't
> think that "not uptodate" case is what is going on, and it's more
> about some "partial write in the middle of a buffer succeeded"
>
> And the code also has things like
>
>         if (ret < len)
>                 iomap_write_failed(iter->inode, pos, len);
>
> which looks very wrong - it's not that the write failed, it's just
> incomplete because it was done with page faults disabled. It seems to
> try to do some page cache truncation based on the original 'len', but
> not taking the successful part into account. Which all sounds
> horrifically wrong.
>
> But I don't know the code well enough to really judge. It just makes
> me uncomfortable, and I do suspect this code may be quite buggy if the
> copy of the full 'len' doesn't succeed.

This has thrown me off in the past as well; it should be changed to
iomap_write_failed(iter->inode, pos + ret, len - ret) for legibility.
However, iomap_write_failed() only truncates past EOF and is preceded
by i_size_write(iter->inode, pos + ret) here, so it's not strictly a
bug.

> Again, the patch I sent only _hides_ any issues and makes them
> practically impossible to see. It doesn't really _fix_ anything, since
> - as mentioned - regardless of fault_in_iov_iter_readable()
> succeeding, racing with page-out could then cause the later
> copy_page_from_iter_atomic() to have a partial copy anyway.

Indeed. Let's see what we'll get with it.

In the meantime, we've reproduced with 5.18-rc4 + commit 296abc0d91d8
("gfs2: No short reads or writes upon glock contention"), and it still
has the data corruption.

> And hey, maybe there's something entirely different going on, and my
> "Heureka! It might be explained by that partial write_end that
> generally didn't happen before" is only my shouting at windmills.
>
>              Linus
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-05-02 20:24                                 ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-05-02 20:24 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, May 2, 2022 at 8:58 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, May 2, 2022 at 11:31 AM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> >
> > NOTE! This patch is entirely untested. I also didn't actually yet go
> > look at what gfs2 does when 'bytes' and 'copied' are different.
>
> Oh, it's a lot of generic iomap_write_end() code, so I guess it's just
> as well that I brought in the iomap people.
>
> And the iomap code has many different cases. Some of them do
>
>         if (unlikely(copied < len && !folio_test_uptodate(folio)))
>                 return 0;
>
> to force the whole IO to be re-done (looks sane - that's the "the
> underlying folio wasn't uptodate, because we expected the write to
> make it so").
>
> And that might not have happened before, but it looks like gfs2 does
> actually try to deal with that case.
>
> But since Andreas said originally that the IO wasn't aligned, I don't
> think that "not uptodate" case is what is going on, and it's more
> about some "partial write in the middle of a buffer succeeded"
>
> And the code also has things like
>
>         if (ret < len)
>                 iomap_write_failed(iter->inode, pos, len);
>
> which looks very wrong - it's not that the write failed, it's just
> incomplete because it was done with page faults disabled. It seems to
> try to do some page cache truncation based on the original 'len', but
> not taking the successful part into account. Which all sounds
> horrifically wrong.
>
> But I don't know the code well enough to really judge. It just makes
> me uncomfortable, and I do suspect this code may be quite buggy if the
> copy of the full 'len' doesn't succeed.

This has thrown me off in the past as well; it should be changed to
iomap_write_failed(iter->inode, pos + ret, len - ret) for legibility.
However, iomap_write_failed() only truncates past EOF and is preceded
by i_size_write(iter->inode, pos + ret) here, so it's not strictly a
bug.

> Again, the patch I sent only _hides_ any issues and makes them
> practically impossible to see. It doesn't really _fix_ anything, since
> - as mentioned - regardless of fault_in_iov_iter_readable()
> succeeding, racing with page-out could then cause the later
> copy_page_from_iter_atomic() to have a partial copy anyway.

Indeed. Let's see what we'll get with it.

In the meantime, we've reproduced with 5.18-rc4 + commit 296abc0d91d8
("gfs2: No short reads or writes upon glock contention"), and it still
has the data corruption.

> And hey, maybe there's something entirely different going on, and my
> "Heureka! It might be explained by that partial write_end that
> generally didn't happen before" is only my shouting at windmills.
>
>              Linus
>


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-05-02 18:31                             ` [Cluster-devel] " Linus Torvalds
@ 2022-05-03  8:56                               ` Andreas Gruenbacher
  -1 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-05-03  8:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Hellwig, Darrick J. Wong, Dave Chinner, cluster-devel,
	Linux Kernel Mailing List

On Mon, May 2, 2022 at 8:32 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Yes, but note that it's gfs2_file_buffered_write() that fails. When
> > the pagefault_disable/enable() around iomap_file_buffered_write() is
> > removed, the corruption goes away.
>
> I looked some more at this on and off, and ended up even more confused.
>
> For some reason, I'd mostly looked at the read case, because I had
> mis-read some of your emails and thought it was the buffered reads
> that caused problems.
>
> Then I went back more carefully, and realized you had always said
> gfs2_file_buffered_write() was where the issues happened, and looked
> at that path more, and that confused me even *MORE*.
>
> Because that case has always done the copy from user space with page
> faults disabled, because of the traditional deadlock with reading from
> user space while holding the page lock on the target page cache page.
>
> So that is not really about the new deadlock with filesystem locks,
> that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks
> for buffered I/O").
>
> So now that I'm looking at the right function (maybe) I'm going "huh",
> because it's none of the complex cases that would seem to fail, it's
> literally just the fault_in_iov_iter_readable() that we've always done
> in iomap_write_iter() that presumably starts failing.
>
> But *that* old code seems bogus too. It's doing
>
>                 if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
>                         status = -EFAULT;
>                         break;
>                 }
>
> which on the face of it is sane: it's saying "if we can't fault in any
> bytes, then stop trying".
>
> And it's good, and correct, but it does leave one case open.
>
> Because what if the result is "we can fault things in _partially_"?
>
> The code blithely goes on and tries to do the whole 'bytes' range _anyway_.
>
> Now, with a bug-free filesystem, this really shouldn't matter, since
> the later copy_page_from_iter_atomic() thing should then DTRT anyway,
> but this does mean that one fundamental thing that that commit
> 00bfe02f4796 changed is that it basically disabled that
> fault_in_iov_iter_readable() that *used* to fault in the whole range,
> and now potentially only faults in a small area.
>
> That, in turn, means that in practice it *used* to do "write_end()"
> with a fully successful range, ie when it did that
>
>                 status = a_ops->write_end(file, mapping, pos, bytes, copied,
>                                                 page, fsdata);
>
> then "bytes" and "copied" were the same.
>
> But now that commit 00bfe02f4796 added the "disable_pagefault()"
> around the whole thing, fault_in_iov_iter_readable() will easily fail
> half-way instead of bringing the next page in, and then that
> ->write_begin() to ->write_end() sequence will see the copy in the
> middle failing half-way too, and you'll have that write_end()
> condition with the write _partially_ succeeding.
>
> Which is the complex case for write_end() that you practically
> speaking never saw before (it *could* happen with a race with swap-out
> or similar, but it was not really something you could trigger in real
> life.
>
> And I suspect this is what bites you with gfs2
>
> To *test* that hypothesis, how about you try this attached patch? The
> generic_perform_write() function in mm/filemap.c has the same exact
> pattern, but as mentioned, a filesystem really needs to be able to
> handle the partial write_end() case, so it's not a *bug* in that code,
> but it migth be triggering a bug in gfs2.
>
> And gfs2 only uses the iomap_write_iter() case, I think. So that's the
> only case this attached patch changes.
>
> Again - I think the unpatched iomap_write_iter() code is fine, but I
> think it may be what then triggers the real bug in gfs2. So this patch
> is not wrong per se, but this patch is basically a "hide the problem"
> patch, and it would be very interesting to hear if it does indeed fix
> your test-case.

We still get data corruption with the patch applied. The
WARN_ON_ONCE(!bytes) doesn't trigger.

As an additional experiment, I've added code to check the iterator
position that iomap_file_buffered_write() returns, and it's all
looking good as well: an iov_iter_advance(orig_from, written) from the
original position always gets us to the same iterator.

This points at gfs2 getting things wrong after a short write, for
example, marking a page / folio uptodate that isn't. But the uptodate
handling happens at the iomap layer, so this doesn't leave me with an
immediate suspect.

We're on filesystems with block size == page size, so none of the
struct iomap_page uptodata handling should be involved, either.

> Because that would pinpoint exactly what the bug is.
>
> I'm adding Christoph and Darrick as iomap maintainers here to the
> participants (and Dave Chinner in case he's also the temporary
> maintainer because Darrick is doing reviews) not because they
> necessarily care, but just because this test-patch obviously involves
> the iomap code.
>
> NOTE! This patch is entirely untested. I also didn't actually yet go
> look at what gfs2 does when 'bytes' and 'copied' are different. But
> since I finally think I figured out what might be going on, I decided
> I'd send this out sooner rather than later.
>
> Because this is the first thing that makes me go "Aaahh.. This might
> explain it".
>
>                    Linus

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-05-03  8:56                               ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-05-03  8:56 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, May 2, 2022 at 8:32 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Yes, but note that it's gfs2_file_buffered_write() that fails. When
> > the pagefault_disable/enable() around iomap_file_buffered_write() is
> > removed, the corruption goes away.
>
> I looked some more at this on and off, and ended up even more confused.
>
> For some reason, I'd mostly looked at the read case, because I had
> mis-read some of your emails and thought it was the buffered reads
> that caused problems.
>
> Then I went back more carefully, and realized you had always said
> gfs2_file_buffered_write() was where the issues happened, and looked
> at that path more, and that confused me even *MORE*.
>
> Because that case has always done the copy from user space with page
> faults disabled, because of the traditional deadlock with reading from
> user space while holding the page lock on the target page cache page.
>
> So that is not really about the new deadlock with filesystem locks,
> that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks
> for buffered I/O").
>
> So now that I'm looking at the right function (maybe) I'm going "huh",
> because it's none of the complex cases that would seem to fail, it's
> literally just the fault_in_iov_iter_readable() that we've always done
> in iomap_write_iter() that presumably starts failing.
>
> But *that* old code seems bogus too. It's doing
>
>                 if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
>                         status = -EFAULT;
>                         break;
>                 }
>
> which on the face of it is sane: it's saying "if we can't fault in any
> bytes, then stop trying".
>
> And it's good, and correct, but it does leave one case open.
>
> Because what if the result is "we can fault things in _partially_"?
>
> The code blithely goes on and tries to do the whole 'bytes' range _anyway_.
>
> Now, with a bug-free filesystem, this really shouldn't matter, since
> the later copy_page_from_iter_atomic() thing should then DTRT anyway,
> but this does mean that one fundamental thing that that commit
> 00bfe02f4796 changed is that it basically disabled that
> fault_in_iov_iter_readable() that *used* to fault in the whole range,
> and now potentially only faults in a small area.
>
> That, in turn, means that in practice it *used* to do "write_end()"
> with a fully successful range, ie when it did that
>
>                 status = a_ops->write_end(file, mapping, pos, bytes, copied,
>                                                 page, fsdata);
>
> then "bytes" and "copied" were the same.
>
> But now that commit 00bfe02f4796 added the "disable_pagefault()"
> around the whole thing, fault_in_iov_iter_readable() will easily fail
> half-way instead of bringing the next page in, and then that
> ->write_begin() to ->write_end() sequence will see the copy in the
> middle failing half-way too, and you'll have that write_end()
> condition with the write _partially_ succeeding.
>
> Which is the complex case for write_end() that you practically
> speaking never saw before (it *could* happen with a race with swap-out
> or similar, but it was not really something you could trigger in real
> life.
>
> And I suspect this is what bites you with gfs2
>
> To *test* that hypothesis, how about you try this attached patch? The
> generic_perform_write() function in mm/filemap.c has the same exact
> pattern, but as mentioned, a filesystem really needs to be able to
> handle the partial write_end() case, so it's not a *bug* in that code,
> but it migth be triggering a bug in gfs2.
>
> And gfs2 only uses the iomap_write_iter() case, I think. So that's the
> only case this attached patch changes.
>
> Again - I think the unpatched iomap_write_iter() code is fine, but I
> think it may be what then triggers the real bug in gfs2. So this patch
> is not wrong per se, but this patch is basically a "hide the problem"
> patch, and it would be very interesting to hear if it does indeed fix
> your test-case.

We still get data corruption with the patch applied. The
WARN_ON_ONCE(!bytes) doesn't trigger.

As an additional experiment, I've added code to check the iterator
position that iomap_file_buffered_write() returns, and it's all
looking good as well: an iov_iter_advance(orig_from, written) from the
original position always gets us to the same iterator.

This points at gfs2 getting things wrong after a short write, for
example, marking a page / folio uptodate that isn't. But the uptodate
handling happens at the iomap layer, so this doesn't leave me with an
immediate suspect.

We're on filesystems with block size == page size, so none of the
struct iomap_page uptodata handling should be involved, either.

> Because that would pinpoint exactly what the bug is.
>
> I'm adding Christoph and Darrick as iomap maintainers here to the
> participants (and Dave Chinner in case he's also the temporary
> maintainer because Darrick is doing reviews) not because they
> necessarily care, but just because this test-patch obviously involves
> the iomap code.
>
> NOTE! This patch is entirely untested. I also didn't actually yet go
> look at what gfs2 does when 'bytes' and 'copied' are different. But
> since I finally think I figured out what might be going on, I decided
> I'd send this out sooner rather than later.
>
> Because this is the first thing that makes me go "Aaahh.. This might
> explain it".
>
>                    Linus

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-05-03  8:56                               ` [Cluster-devel] " Andreas Gruenbacher
@ 2022-05-03 13:30                                 ` Andreas Gruenbacher
  -1 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-05-03 13:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Hellwig, Darrick J. Wong, Dave Chinner, cluster-devel,
	Linux Kernel Mailing List

On Tue, May 3, 2022 at 10:56 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> On Mon, May 2, 2022 at 8:32 PM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > >
> > > Yes, but note that it's gfs2_file_buffered_write() that fails. When
> > > the pagefault_disable/enable() around iomap_file_buffered_write() is
> > > removed, the corruption goes away.
> >
> > I looked some more at this on and off, and ended up even more confused.
> >
> > For some reason, I'd mostly looked at the read case, because I had
> > mis-read some of your emails and thought it was the buffered reads
> > that caused problems.
> >
> > Then I went back more carefully, and realized you had always said
> > gfs2_file_buffered_write() was where the issues happened, and looked
> > at that path more, and that confused me even *MORE*.
> >
> > Because that case has always done the copy from user space with page
> > faults disabled, because of the traditional deadlock with reading from
> > user space while holding the page lock on the target page cache page.
> >
> > So that is not really about the new deadlock with filesystem locks,
> > that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks
> > for buffered I/O").
> >
> > So now that I'm looking at the right function (maybe) I'm going "huh",
> > because it's none of the complex cases that would seem to fail, it's
> > literally just the fault_in_iov_iter_readable() that we've always done
> > in iomap_write_iter() that presumably starts failing.
> >
> > But *that* old code seems bogus too. It's doing
> >
> >                 if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
> >                         status = -EFAULT;
> >                         break;
> >                 }
> >
> > which on the face of it is sane: it's saying "if we can't fault in any
> > bytes, then stop trying".
> >
> > And it's good, and correct, but it does leave one case open.
> >
> > Because what if the result is "we can fault things in _partially_"?
> >
> > The code blithely goes on and tries to do the whole 'bytes' range _anyway_.
> >
> > Now, with a bug-free filesystem, this really shouldn't matter, since
> > the later copy_page_from_iter_atomic() thing should then DTRT anyway,
> > but this does mean that one fundamental thing that that commit
> > 00bfe02f4796 changed is that it basically disabled that
> > fault_in_iov_iter_readable() that *used* to fault in the whole range,
> > and now potentially only faults in a small area.
> >
> > That, in turn, means that in practice it *used* to do "write_end()"
> > with a fully successful range, ie when it did that
> >
> >                 status = a_ops->write_end(file, mapping, pos, bytes, copied,
> >                                                 page, fsdata);
> >
> > then "bytes" and "copied" were the same.
> >
> > But now that commit 00bfe02f4796 added the "disable_pagefault()"
> > around the whole thing, fault_in_iov_iter_readable() will easily fail
> > half-way instead of bringing the next page in, and then that
> > ->write_begin() to ->write_end() sequence will see the copy in the
> > middle failing half-way too, and you'll have that write_end()
> > condition with the write _partially_ succeeding.
> >
> > Which is the complex case for write_end() that you practically
> > speaking never saw before (it *could* happen with a race with swap-out
> > or similar, but it was not really something you could trigger in real
> > life.
> >
> > And I suspect this is what bites you with gfs2
> >
> > To *test* that hypothesis, how about you try this attached patch? The
> > generic_perform_write() function in mm/filemap.c has the same exact
> > pattern, but as mentioned, a filesystem really needs to be able to
> > handle the partial write_end() case, so it's not a *bug* in that code,
> > but it migth be triggering a bug in gfs2.
> >
> > And gfs2 only uses the iomap_write_iter() case, I think. So that's the
> > only case this attached patch changes.
> >
> > Again - I think the unpatched iomap_write_iter() code is fine, but I
> > think it may be what then triggers the real bug in gfs2. So this patch
> > is not wrong per se, but this patch is basically a "hide the problem"
> > patch, and it would be very interesting to hear if it does indeed fix
> > your test-case.
>
> We still get data corruption with the patch applied. The
> WARN_ON_ONCE(!bytes) doesn't trigger.
>
> As an additional experiment, I've added code to check the iterator
> position that iomap_file_buffered_write() returns, and it's all
> looking good as well: an iov_iter_advance(orig_from, written) from the
> original position always gets us to the same iterator.
>
> This points at gfs2 getting things wrong after a short write, for
> example, marking a page / folio uptodate that isn't. But the uptodate
> handling happens at the iomap layer, so this doesn't leave me with an
> immediate suspect.
>
> We're on filesystems with block size == page size, so none of the
> struct iomap_page uptodata handling should be involved, either.

The rounding around the hole punching in gfs2_iomap_end() looks wrong.
I'm trying a fix now.

> > Because that would pinpoint exactly what the bug is.
> >
> > I'm adding Christoph and Darrick as iomap maintainers here to the
> > participants (and Dave Chinner in case he's also the temporary
> > maintainer because Darrick is doing reviews) not because they
> > necessarily care, but just because this test-patch obviously involves
> > the iomap code.
> >
> > NOTE! This patch is entirely untested. I also didn't actually yet go
> > look at what gfs2 does when 'bytes' and 'copied' are different. But
> > since I finally think I figured out what might be going on, I decided
> > I'd send this out sooner rather than later.
> >
> > Because this is the first thing that makes me go "Aaahh.. This might
> > explain it".
> >
> >                    Linus
>
> Thanks,
> Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-05-03 13:30                                 ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-05-03 13:30 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, May 3, 2022 at 10:56 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> On Mon, May 2, 2022 at 8:32 PM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > >
> > > Yes, but note that it's gfs2_file_buffered_write() that fails. When
> > > the pagefault_disable/enable() around iomap_file_buffered_write() is
> > > removed, the corruption goes away.
> >
> > I looked some more at this on and off, and ended up even more confused.
> >
> > For some reason, I'd mostly looked at the read case, because I had
> > mis-read some of your emails and thought it was the buffered reads
> > that caused problems.
> >
> > Then I went back more carefully, and realized you had always said
> > gfs2_file_buffered_write() was where the issues happened, and looked
> > at that path more, and that confused me even *MORE*.
> >
> > Because that case has always done the copy from user space with page
> > faults disabled, because of the traditional deadlock with reading from
> > user space while holding the page lock on the target page cache page.
> >
> > So that is not really about the new deadlock with filesystem locks,
> > that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks
> > for buffered I/O").
> >
> > So now that I'm looking at the right function (maybe) I'm going "huh",
> > because it's none of the complex cases that would seem to fail, it's
> > literally just the fault_in_iov_iter_readable() that we've always done
> > in iomap_write_iter() that presumably starts failing.
> >
> > But *that* old code seems bogus too. It's doing
> >
> >                 if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
> >                         status = -EFAULT;
> >                         break;
> >                 }
> >
> > which on the face of it is sane: it's saying "if we can't fault in any
> > bytes, then stop trying".
> >
> > And it's good, and correct, but it does leave one case open.
> >
> > Because what if the result is "we can fault things in _partially_"?
> >
> > The code blithely goes on and tries to do the whole 'bytes' range _anyway_.
> >
> > Now, with a bug-free filesystem, this really shouldn't matter, since
> > the later copy_page_from_iter_atomic() thing should then DTRT anyway,
> > but this does mean that one fundamental thing that that commit
> > 00bfe02f4796 changed is that it basically disabled that
> > fault_in_iov_iter_readable() that *used* to fault in the whole range,
> > and now potentially only faults in a small area.
> >
> > That, in turn, means that in practice it *used* to do "write_end()"
> > with a fully successful range, ie when it did that
> >
> >                 status = a_ops->write_end(file, mapping, pos, bytes, copied,
> >                                                 page, fsdata);
> >
> > then "bytes" and "copied" were the same.
> >
> > But now that commit 00bfe02f4796 added the "disable_pagefault()"
> > around the whole thing, fault_in_iov_iter_readable() will easily fail
> > half-way instead of bringing the next page in, and then that
> > ->write_begin() to ->write_end() sequence will see the copy in the
> > middle failing half-way too, and you'll have that write_end()
> > condition with the write _partially_ succeeding.
> >
> > Which is the complex case for write_end() that you practically
> > speaking never saw before (it *could* happen with a race with swap-out
> > or similar, but it was not really something you could trigger in real
> > life.
> >
> > And I suspect this is what bites you with gfs2
> >
> > To *test* that hypothesis, how about you try this attached patch? The
> > generic_perform_write() function in mm/filemap.c has the same exact
> > pattern, but as mentioned, a filesystem really needs to be able to
> > handle the partial write_end() case, so it's not a *bug* in that code,
> > but it migth be triggering a bug in gfs2.
> >
> > And gfs2 only uses the iomap_write_iter() case, I think. So that's the
> > only case this attached patch changes.
> >
> > Again - I think the unpatched iomap_write_iter() code is fine, but I
> > think it may be what then triggers the real bug in gfs2. So this patch
> > is not wrong per se, but this patch is basically a "hide the problem"
> > patch, and it would be very interesting to hear if it does indeed fix
> > your test-case.
>
> We still get data corruption with the patch applied. The
> WARN_ON_ONCE(!bytes) doesn't trigger.
>
> As an additional experiment, I've added code to check the iterator
> position that iomap_file_buffered_write() returns, and it's all
> looking good as well: an iov_iter_advance(orig_from, written) from the
> original position always gets us to the same iterator.
>
> This points at gfs2 getting things wrong after a short write, for
> example, marking a page / folio uptodate that isn't. But the uptodate
> handling happens at the iomap layer, so this doesn't leave me with an
> immediate suspect.
>
> We're on filesystems with block size == page size, so none of the
> struct iomap_page uptodata handling should be involved, either.

The rounding around the hole punching in gfs2_iomap_end() looks wrong.
I'm trying a fix now.

> > Because that would pinpoint exactly what the bug is.
> >
> > I'm adding Christoph and Darrick as iomap maintainers here to the
> > participants (and Dave Chinner in case he's also the temporary
> > maintainer because Darrick is doing reviews) not because they
> > necessarily care, but just because this test-patch obviously involves
> > the iomap code.
> >
> > NOTE! This patch is entirely untested. I also didn't actually yet go
> > look at what gfs2 does when 'bytes' and 'copied' are different. But
> > since I finally think I figured out what might be going on, I decided
> > I'd send this out sooner rather than later.
> >
> > Because this is the first thing that makes me go "Aaahh.. This might
> > explain it".
> >
> >                    Linus
>
> Thanks,
> Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-05-03  8:56                               ` [Cluster-devel] " Andreas Gruenbacher
@ 2022-05-03 16:19                                 ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-05-03 16:19 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, Darrick J. Wong, Dave Chinner, cluster-devel,
	Linux Kernel Mailing List

On Tue, May 3, 2022 at 1:56 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> We still get data corruption with the patch applied. The
> WARN_ON_ONCE(!bytes) doesn't trigger.

Oh well. I was so sure that I'd finally found something.. That partial
write case has had bugs before.

> As an additional experiment, I've added code to check the iterator
> position that iomap_file_buffered_write() returns, and it's all
> looking good as well: an iov_iter_advance(orig_from, written) from the
> original position always gets us to the same iterator.

Yeah, I've looked at the iterator parts (and iov_iter_revert() in
particular) multiple times, because that too is an area where we've
had bugs before.

That too may be easy to get wrong, but I couldn't for the life of me
see any issues there.

> This points at gfs2 getting things wrong after a short write, for
> example, marking a page / folio uptodate that isn't. But the uptodate
> handling happens at the iomap layer, so this doesn't leave me with an
> immediate suspect.

Yeah, the uptodate setting looked safe, particularly with that "if we
copied less than we thought we would, and it wasn't uptodate, just
claim we didn't do anything at all".

That said, I now have a *new* suspect: the 'iter->pos' handling in
iomap_write_iter().

In particular, let's look at iomap_file_buffered_write(), which does:

        while ((ret = iomap_iter(&iter, ops)) > 0)
                iter.processed = iomap_write_iter(&iter, i);

and then look at what happens to iter.pos here.

iomap_write_iter() does this:

        loff_t pos = iter->pos;
        ...
                pos += status;

but it never seems to write the updated position back to the iterator.

So what happens next time iomap_write_iter() gets called?

This looks like such a huge bug that I'm probably missing something,
but I wonder if this is normally hidden by the fact that usually
iomap_write_iter() consumes the whole 'iter', so despite the 'while()'
loop, it's actually effectively only called once.

Except if it gets a short write due to an unhandled page fault..

Am I entirely blind, and that 'iter.pos' is updated somewhere and I
just missed it?

Or is this maybe the reason for it all?

                Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-05-03 16:19                                 ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-05-03 16:19 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, May 3, 2022 at 1:56 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> We still get data corruption with the patch applied. The
> WARN_ON_ONCE(!bytes) doesn't trigger.

Oh well. I was so sure that I'd finally found something.. That partial
write case has had bugs before.

> As an additional experiment, I've added code to check the iterator
> position that iomap_file_buffered_write() returns, and it's all
> looking good as well: an iov_iter_advance(orig_from, written) from the
> original position always gets us to the same iterator.

Yeah, I've looked at the iterator parts (and iov_iter_revert() in
particular) multiple times, because that too is an area where we've
had bugs before.

That too may be easy to get wrong, but I couldn't for the life of me
see any issues there.

> This points at gfs2 getting things wrong after a short write, for
> example, marking a page / folio uptodate that isn't. But the uptodate
> handling happens at the iomap layer, so this doesn't leave me with an
> immediate suspect.

Yeah, the uptodate setting looked safe, particularly with that "if we
copied less than we thought we would, and it wasn't uptodate, just
claim we didn't do anything at all".

That said, I now have a *new* suspect: the 'iter->pos' handling in
iomap_write_iter().

In particular, let's look at iomap_file_buffered_write(), which does:

        while ((ret = iomap_iter(&iter, ops)) > 0)
                iter.processed = iomap_write_iter(&iter, i);

and then look at what happens to iter.pos here.

iomap_write_iter() does this:

        loff_t pos = iter->pos;
        ...
                pos += status;

but it never seems to write the updated position back to the iterator.

So what happens next time iomap_write_iter() gets called?

This looks like such a huge bug that I'm probably missing something,
but I wonder if this is normally hidden by the fact that usually
iomap_write_iter() consumes the whole 'iter', so despite the 'while()'
loop, it's actually effectively only called once.

Except if it gets a short write due to an unhandled page fault..

Am I entirely blind, and that 'iter.pos' is updated somewhere and I
just missed it?

Or is this maybe the reason for it all?

                Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-05-03 16:19                                 ` [Cluster-devel] " Linus Torvalds
@ 2022-05-03 16:41                                   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-05-03 16:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Hellwig, Darrick J. Wong, Dave Chinner, cluster-devel,
	Linux Kernel Mailing List

On Tue, May 3, 2022 at 6:19 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, May 3, 2022 at 1:56 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > We still get data corruption with the patch applied. The
> > WARN_ON_ONCE(!bytes) doesn't trigger.
>
> Oh well. I was so sure that I'd finally found something.. That partial
> write case has had bugs before.
>
> > As an additional experiment, I've added code to check the iterator
> > position that iomap_file_buffered_write() returns, and it's all
> > looking good as well: an iov_iter_advance(orig_from, written) from the
> > original position always gets us to the same iterator.
>
> Yeah, I've looked at the iterator parts (and iov_iter_revert() in
> particular) multiple times, because that too is an area where we've
> had bugs before.
>
> That too may be easy to get wrong, but I couldn't for the life of me
> see any issues there.
>
> > This points at gfs2 getting things wrong after a short write, for
> > example, marking a page / folio uptodate that isn't. But the uptodate
> > handling happens at the iomap layer, so this doesn't leave me with an
> > immediate suspect.
>
> Yeah, the uptodate setting looked safe, particularly with that "if we
> copied less than we thought we would, and it wasn't uptodate, just
> claim we didn't do anything at all".
>
> That said, I now have a *new* suspect: the 'iter->pos' handling in
> iomap_write_iter().
>
> In particular, let's look at iomap_file_buffered_write(), which does:
>
>         while ((ret = iomap_iter(&iter, ops)) > 0)
>                 iter.processed = iomap_write_iter(&iter, i);
>
> and then look at what happens to iter.pos here.
>
> iomap_write_iter() does this:
>
>         loff_t pos = iter->pos;
>         ...
>                 pos += status;
>
> but it never seems to write the updated position back to the iterator.
>
> So what happens next time iomap_write_iter() gets called?
>
> This looks like such a huge bug that I'm probably missing something,
> but I wonder if this is normally hidden by the fact that usually
> iomap_write_iter() consumes the whole 'iter', so despite the 'while()'
> loop, it's actually effectively only called once.
>
> Except if it gets a short write due to an unhandled page fault..
>
> Am I entirely blind, and that 'iter.pos' is updated somewhere and I
> just missed it?

That's happening in iomap_file_buffered_write() and iomap_iter():

        while ((ret = iomap_iter(&iter, ops)) > 0)
                iter.processed = iomap_write_iter(&iter, i);

Here, iomap_write_iter() returns how much progress it has made, which
is stored in iter.processed, and iomap_iter() -> iomap_iter_advance()
then updates iter.pos and iter.len based on iter.processed.

Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-05-03 16:41                                   ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-05-03 16:41 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, May 3, 2022 at 6:19 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Tue, May 3, 2022 at 1:56 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > We still get data corruption with the patch applied. The
> > WARN_ON_ONCE(!bytes) doesn't trigger.
>
> Oh well. I was so sure that I'd finally found something.. That partial
> write case has had bugs before.
>
> > As an additional experiment, I've added code to check the iterator
> > position that iomap_file_buffered_write() returns, and it's all
> > looking good as well: an iov_iter_advance(orig_from, written) from the
> > original position always gets us to the same iterator.
>
> Yeah, I've looked at the iterator parts (and iov_iter_revert() in
> particular) multiple times, because that too is an area where we've
> had bugs before.
>
> That too may be easy to get wrong, but I couldn't for the life of me
> see any issues there.
>
> > This points at gfs2 getting things wrong after a short write, for
> > example, marking a page / folio uptodate that isn't. But the uptodate
> > handling happens at the iomap layer, so this doesn't leave me with an
> > immediate suspect.
>
> Yeah, the uptodate setting looked safe, particularly with that "if we
> copied less than we thought we would, and it wasn't uptodate, just
> claim we didn't do anything at all".
>
> That said, I now have a *new* suspect: the 'iter->pos' handling in
> iomap_write_iter().
>
> In particular, let's look at iomap_file_buffered_write(), which does:
>
>         while ((ret = iomap_iter(&iter, ops)) > 0)
>                 iter.processed = iomap_write_iter(&iter, i);
>
> and then look at what happens to iter.pos here.
>
> iomap_write_iter() does this:
>
>         loff_t pos = iter->pos;
>         ...
>                 pos += status;
>
> but it never seems to write the updated position back to the iterator.
>
> So what happens next time iomap_write_iter() gets called?
>
> This looks like such a huge bug that I'm probably missing something,
> but I wonder if this is normally hidden by the fact that usually
> iomap_write_iter() consumes the whole 'iter', so despite the 'while()'
> loop, it's actually effectively only called once.
>
> Except if it gets a short write due to an unhandled page fault..
>
> Am I entirely blind, and that 'iter.pos' is updated somewhere and I
> just missed it?

That's happening in iomap_file_buffered_write() and iomap_iter():

        while ((ret = iomap_iter(&iter, ops)) > 0)
                iter.processed = iomap_write_iter(&iter, i);

Here, iomap_write_iter() returns how much progress it has made, which
is stored in iter.processed, and iomap_iter() -> iomap_iter_advance()
then updates iter.pos and iter.len based on iter.processed.

Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-05-03 16:41                                   ` [Cluster-devel] " Andreas Gruenbacher
@ 2022-05-03 16:50                                     ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-05-03 16:50 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, Darrick J. Wong, Dave Chinner, cluster-devel,
	Linux Kernel Mailing List

On Tue, May 3, 2022 at 9:41 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> That's happening in iomap_file_buffered_write() and iomap_iter():
>
>         while ((ret = iomap_iter(&iter, ops)) > 0)
>                 iter.processed = iomap_write_iter(&iter, i);
>
> Here, iomap_write_iter() returns how much progress it has made, which
> is stored in iter.processed, and iomap_iter() -> iomap_iter_advance()
> then updates iter.pos and iter.len based on iter.processed.

Ahh. I even had that file open in my editor in another window, but missed it.

So much for that theory.

              Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-05-03 16:50                                     ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-05-03 16:50 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, May 3, 2022 at 9:41 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> That's happening in iomap_file_buffered_write() and iomap_iter():
>
>         while ((ret = iomap_iter(&iter, ops)) > 0)
>                 iter.processed = iomap_write_iter(&iter, i);
>
> Here, iomap_write_iter() returns how much progress it has made, which
> is stored in iter.processed, and iomap_iter() -> iomap_iter_advance()
> then updates iter.pos and iter.len based on iter.processed.

Ahh. I even had that file open in my editor in another window, but missed it.

So much for that theory.

              Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-05-03  8:56                               ` [Cluster-devel] " Andreas Gruenbacher
@ 2022-05-03 21:35                                 ` Andreas Gruenbacher
  -1 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-05-03 21:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, Christoph Hellwig, Darrick J. Wong,
	Dave Chinner, cluster-devel, Linux Kernel Mailing List

On Tue, May 3, 2022 at 3:30 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> On Tue, May 3, 2022 at 10:56 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > On Mon, May 2, 2022 at 8:32 PM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > > On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > >
> > > > Yes, but note that it's gfs2_file_buffered_write() that fails. When
> > > > the pagefault_disable/enable() around iomap_file_buffered_write() is
> > > > removed, the corruption goes away.
> > >
> > > I looked some more at this on and off, and ended up even more confused.
> > >
> > > For some reason, I'd mostly looked at the read case, because I had
> > > mis-read some of your emails and thought it was the buffered reads
> > > that caused problems.
> > >
> > > Then I went back more carefully, and realized you had always said
> > > gfs2_file_buffered_write() was where the issues happened, and looked
> > > at that path more, and that confused me even *MORE*.
> > >
> > > Because that case has always done the copy from user space with page
> > > faults disabled, because of the traditional deadlock with reading from
> > > user space while holding the page lock on the target page cache page.
> > >
> > > So that is not really about the new deadlock with filesystem locks,
> > > that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks
> > > for buffered I/O").
> > >
> > > So now that I'm looking at the right function (maybe) I'm going "huh",
> > > because it's none of the complex cases that would seem to fail, it's
> > > literally just the fault_in_iov_iter_readable() that we've always done
> > > in iomap_write_iter() that presumably starts failing.
> > >
> > > But *that* old code seems bogus too. It's doing
> > >
> > >                 if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
> > >                         status = -EFAULT;
> > >                         break;
> > >                 }
> > >
> > > which on the face of it is sane: it's saying "if we can't fault in any
> > > bytes, then stop trying".
> > >
> > > And it's good, and correct, but it does leave one case open.
> > >
> > > Because what if the result is "we can fault things in _partially_"?
> > >
> > > The code blithely goes on and tries to do the whole 'bytes' range _anyway_.
> > >
> > > Now, with a bug-free filesystem, this really shouldn't matter, since
> > > the later copy_page_from_iter_atomic() thing should then DTRT anyway,
> > > but this does mean that one fundamental thing that that commit
> > > 00bfe02f4796 changed is that it basically disabled that
> > > fault_in_iov_iter_readable() that *used* to fault in the whole range,
> > > and now potentially only faults in a small area.
> > >
> > > That, in turn, means that in practice it *used* to do "write_end()"
> > > with a fully successful range, ie when it did that
> > >
> > >                 status = a_ops->write_end(file, mapping, pos, bytes, copied,
> > >                                                 page, fsdata);
> > >
> > > then "bytes" and "copied" were the same.
> > >
> > > But now that commit 00bfe02f4796 added the "disable_pagefault()"
> > > around the whole thing, fault_in_iov_iter_readable() will easily fail
> > > half-way instead of bringing the next page in, and then that
> > > ->write_begin() to ->write_end() sequence will see the copy in the
> > > middle failing half-way too, and you'll have that write_end()
> > > condition with the write _partially_ succeeding.
> > >
> > > Which is the complex case for write_end() that you practically
> > > speaking never saw before (it *could* happen with a race with swap-out
> > > or similar, but it was not really something you could trigger in real
> > > life.
> > >
> > > And I suspect this is what bites you with gfs2
> > >
> > > To *test* that hypothesis, how about you try this attached patch? The
> > > generic_perform_write() function in mm/filemap.c has the same exact
> > > pattern, but as mentioned, a filesystem really needs to be able to
> > > handle the partial write_end() case, so it's not a *bug* in that code,
> > > but it migth be triggering a bug in gfs2.
> > >
> > > And gfs2 only uses the iomap_write_iter() case, I think. So that's the
> > > only case this attached patch changes.
> > >
> > > Again - I think the unpatched iomap_write_iter() code is fine, but I
> > > think it may be what then triggers the real bug in gfs2. So this patch
> > > is not wrong per se, but this patch is basically a "hide the problem"
> > > patch, and it would be very interesting to hear if it does indeed fix
> > > your test-case.
> >
> > We still get data corruption with the patch applied. The
> > WARN_ON_ONCE(!bytes) doesn't trigger.
> >
> > As an additional experiment, I've added code to check the iterator
> > position that iomap_file_buffered_write() returns, and it's all
> > looking good as well: an iov_iter_advance(orig_from, written) from the
> > original position always gets us to the same iterator.
> >
> > This points at gfs2 getting things wrong after a short write, for
> > example, marking a page / folio uptodate that isn't. But the uptodate
> > handling happens at the iomap layer, so this doesn't leave me with an
> > immediate suspect.
> >
> > We're on filesystems with block size == page size, so none of the
> > struct iomap_page uptodata handling should be involved, either.
>
> The rounding around the hole punching in gfs2_iomap_end() looks wrong.
> I'm trying a fix now.

More testing still ongoing, but the following patch seems to fix the
data corruption.  Provided that things go well, I'll send a pull request
tomorrow.

Thanks a lot for the huge amount of help!

Andreas

-

gfs2: Short write fix

When a write cannot be carried out in full, gfs2_iomap_end() releases
blocks that have been allocated for this write but haven't been used.

To compute the end of the allocation, gfs2_iomap_end() incorrectly
rounded the end of the attempted write down to the next block boundary
to arrive at the end of the allocation.  It would have to round up, but
the end of the allocation is also available as iomap->offset +
iomap->length, so just use that instead.

In addition, use round_up() for computing the start of the unused range.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 39080b2d6cf8..6abd044a176d 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1153,14 +1153,12 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 
 	if (length != written && (iomap->flags & IOMAP_F_NEW)) {
 		/* Deallocate blocks that were just allocated. */
-		loff_t blockmask = i_blocksize(inode) - 1;
-		loff_t end = (pos + length) & ~blockmask;
+		loff_t hstart = round_up(pos + written, i_blocksize(inode));
+		loff_t hend = iomap->offset + iomap->length;
 
-		pos = (pos + written + blockmask) & ~blockmask;
-		if (pos < end) {
-			truncate_pagecache_range(inode, pos, end - 1);
-			punch_hole(ip, pos, end - pos);
-		}
+		truncate_pagecache_range(inode, hstart, hend - 1);
+		if (hstart < hend)
+			punch_hole(ip, hstart, hend - hstart);
 	}
 
 	if (unlikely(!written))


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-05-03 21:35                                 ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-05-03 21:35 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, May 3, 2022 at 3:30 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> On Tue, May 3, 2022 at 10:56 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > On Mon, May 2, 2022 at 8:32 PM Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > > On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > >
> > > > Yes, but note that it's gfs2_file_buffered_write() that fails. When
> > > > the pagefault_disable/enable() around iomap_file_buffered_write() is
> > > > removed, the corruption goes away.
> > >
> > > I looked some more at this on and off, and ended up even more confused.
> > >
> > > For some reason, I'd mostly looked at the read case, because I had
> > > mis-read some of your emails and thought it was the buffered reads
> > > that caused problems.
> > >
> > > Then I went back more carefully, and realized you had always said
> > > gfs2_file_buffered_write() was where the issues happened, and looked
> > > at that path more, and that confused me even *MORE*.
> > >
> > > Because that case has always done the copy from user space with page
> > > faults disabled, because of the traditional deadlock with reading from
> > > user space while holding the page lock on the target page cache page.
> > >
> > > So that is not really about the new deadlock with filesystem locks,
> > > that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks
> > > for buffered I/O").
> > >
> > > So now that I'm looking at the right function (maybe) I'm going "huh",
> > > because it's none of the complex cases that would seem to fail, it's
> > > literally just the fault_in_iov_iter_readable() that we've always done
> > > in iomap_write_iter() that presumably starts failing.
> > >
> > > But *that* old code seems bogus too. It's doing
> > >
> > >                 if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) {
> > >                         status = -EFAULT;
> > >                         break;
> > >                 }
> > >
> > > which on the face of it is sane: it's saying "if we can't fault in any
> > > bytes, then stop trying".
> > >
> > > And it's good, and correct, but it does leave one case open.
> > >
> > > Because what if the result is "we can fault things in _partially_"?
> > >
> > > The code blithely goes on and tries to do the whole 'bytes' range _anyway_.
> > >
> > > Now, with a bug-free filesystem, this really shouldn't matter, since
> > > the later copy_page_from_iter_atomic() thing should then DTRT anyway,
> > > but this does mean that one fundamental thing that that commit
> > > 00bfe02f4796 changed is that it basically disabled that
> > > fault_in_iov_iter_readable() that *used* to fault in the whole range,
> > > and now potentially only faults in a small area.
> > >
> > > That, in turn, means that in practice it *used* to do "write_end()"
> > > with a fully successful range, ie when it did that
> > >
> > >                 status = a_ops->write_end(file, mapping, pos, bytes, copied,
> > >                                                 page, fsdata);
> > >
> > > then "bytes" and "copied" were the same.
> > >
> > > But now that commit 00bfe02f4796 added the "disable_pagefault()"
> > > around the whole thing, fault_in_iov_iter_readable() will easily fail
> > > half-way instead of bringing the next page in, and then that
> > > ->write_begin() to ->write_end() sequence will see the copy in the
> > > middle failing half-way too, and you'll have that write_end()
> > > condition with the write _partially_ succeeding.
> > >
> > > Which is the complex case for write_end() that you practically
> > > speaking never saw before (it *could* happen with a race with swap-out
> > > or similar, but it was not really something you could trigger in real
> > > life.
> > >
> > > And I suspect this is what bites you with gfs2
> > >
> > > To *test* that hypothesis, how about you try this attached patch? The
> > > generic_perform_write() function in mm/filemap.c has the same exact
> > > pattern, but as mentioned, a filesystem really needs to be able to
> > > handle the partial write_end() case, so it's not a *bug* in that code,
> > > but it migth be triggering a bug in gfs2.
> > >
> > > And gfs2 only uses the iomap_write_iter() case, I think. So that's the
> > > only case this attached patch changes.
> > >
> > > Again - I think the unpatched iomap_write_iter() code is fine, but I
> > > think it may be what then triggers the real bug in gfs2. So this patch
> > > is not wrong per se, but this patch is basically a "hide the problem"
> > > patch, and it would be very interesting to hear if it does indeed fix
> > > your test-case.
> >
> > We still get data corruption with the patch applied. The
> > WARN_ON_ONCE(!bytes) doesn't trigger.
> >
> > As an additional experiment, I've added code to check the iterator
> > position that iomap_file_buffered_write() returns, and it's all
> > looking good as well: an iov_iter_advance(orig_from, written) from the
> > original position always gets us to the same iterator.
> >
> > This points at gfs2 getting things wrong after a short write, for
> > example, marking a page / folio uptodate that isn't. But the uptodate
> > handling happens at the iomap layer, so this doesn't leave me with an
> > immediate suspect.
> >
> > We're on filesystems with block size == page size, so none of the
> > struct iomap_page uptodata handling should be involved, either.
>
> The rounding around the hole punching in gfs2_iomap_end() looks wrong.
> I'm trying a fix now.

More testing still ongoing, but the following patch seems to fix the
data corruption.  Provided that things go well, I'll send a pull request
tomorrow.

Thanks a lot for the huge amount of help!

Andreas

-

gfs2: Short write fix

When a write cannot be carried out in full, gfs2_iomap_end() releases
blocks that have been allocated for this write but haven't been used.

To compute the end of the allocation, gfs2_iomap_end() incorrectly
rounded the end of the attempted write down to the next block boundary
to arrive at the end of the allocation.  It would have to round up, but
the end of the allocation is also available as iomap->offset +
iomap->length, so just use that instead.

In addition, use round_up() for computing the start of the unused range.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 39080b2d6cf8..6abd044a176d 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1153,14 +1153,12 @@ static int gfs2_iomap_end(struct inode *inode, loff_t pos, loff_t length,
 
 	if (length != written && (iomap->flags & IOMAP_F_NEW)) {
 		/* Deallocate blocks that were just allocated. */
-		loff_t blockmask = i_blocksize(inode) - 1;
-		loff_t end = (pos + length) & ~blockmask;
+		loff_t hstart = round_up(pos + written, i_blocksize(inode));
+		loff_t hend = iomap->offset + iomap->length;
 
-		pos = (pos + written + blockmask) & ~blockmask;
-		if (pos < end) {
-			truncate_pagecache_range(inode, pos, end - 1);
-			punch_hole(ip, pos, end - pos);
-		}
+		truncate_pagecache_range(inode, hstart, hend - 1);
+		if (hstart < hend)
+			punch_hole(ip, hstart, hend - hstart);
 	}
 
 	if (unlikely(!written))

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-05-03 21:35                                 ` [Cluster-devel] " Andreas Gruenbacher
@ 2022-05-03 22:41                                   ` Linus Torvalds
  -1 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-05-03 22:41 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Christoph Hellwig, Darrick J. Wong, Dave Chinner, cluster-devel,
	Linux Kernel Mailing List

On Tue, May 3, 2022 at 2:35 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> More testing still ongoing, but the following patch seems to fix the
> data corruption.

Fingers crossed.

> +               truncate_pagecache_range(inode, hstart, hend - 1);
> +               if (hstart < hend)
> +                       punch_hole(ip, hstart, hend - hstart);

Why doesn't that "hstart < hend" condition cover both the truncate and
the hole punch?

             Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-05-03 22:41                                   ` Linus Torvalds
  0 siblings, 0 replies; 64+ messages in thread
From: Linus Torvalds @ 2022-05-03 22:41 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, May 3, 2022 at 2:35 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> More testing still ongoing, but the following patch seems to fix the
> data corruption.

Fingers crossed.

> +               truncate_pagecache_range(inode, hstart, hend - 1);
> +               if (hstart < hend)
> +                       punch_hole(ip, hstart, hend - hstart);

Why doesn't that "hstart < hend" condition cover both the truncate and
the hole punch?

             Linus


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2022-05-03 22:41                                   ` [Cluster-devel] " Linus Torvalds
@ 2022-05-04 17:52                                     ` Andreas Gruenbacher
  -1 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-05-04 17:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Christoph Hellwig, Darrick J. Wong, Dave Chinner, cluster-devel,
	Linux Kernel Mailing List

On Wed, May 4, 2022 at 12:42 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, May 3, 2022 at 2:35 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > More testing still ongoing, but the following patch seems to fix the
> > data corruption.
>
> Fingers crossed.

It turns out that crossing fingers wasn't enough and we still get
corruption, but less frequently than before. We're going in the right
direction.

My working theory is that this is due to a subtle bug in the hole
punching done by gfs2_iomap_end() to get rid of unused blocks. With
the test case that fails, gfs2_iomap_end() is punching holes way more
often than I was expecting, and way more often than it should.
Remember that the test case is doing 32-MiB writes of a user buffer
that usually isn't entirely in memory. The first
iomap_file_buffered_write() allocates filesystem blocks for the entire
buffer, and when it finds that it could only do a partial write, it
frees a large part of those blocks. It will then call
fault_in_iov_iter_readable() for the next chunk, and the next call to
iomap_file_buffered_write() will then usually be able to write that
chunk entirely.

So it seems that we should always call fault_in_iov_iter_readable()
before calling into iomap_file_buffered_write(). This will probably
hide whatever is going wrong in punch_hole(), but we'll get to that
later ...

(Side note: the chunk size should be aligned to the page cache, not to
the iov_iter as in the current code.)

> > +               truncate_pagecache_range(inode, hstart, hend - 1);
> > +               if (hstart < hend)
> > +                       punch_hole(ip, hstart, hend - hstart);
>
> Why doesn't that "hstart < hend" condition cover both the truncate and
> the hole punch?

That was a leftover from a previous experiment in which I did the
truncate_pagecache_range() on the unaligned boundaries. Which turned
out to be pointless. I'll clean that up.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [Cluster-devel] [GIT PULL] gfs2 fix
@ 2022-05-04 17:52                                     ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2022-05-04 17:52 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, May 4, 2022 at 12:42 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, May 3, 2022 at 2:35 PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > More testing still ongoing, but the following patch seems to fix the
> > data corruption.
>
> Fingers crossed.

It turns out that crossing fingers wasn't enough and we still get
corruption, but less frequently than before. We're going in the right
direction.

My working theory is that this is due to a subtle bug in the hole
punching done by gfs2_iomap_end() to get rid of unused blocks. With
the test case that fails, gfs2_iomap_end() is punching holes way more
often than I was expecting, and way more often than it should.
Remember that the test case is doing 32-MiB writes of a user buffer
that usually isn't entirely in memory. The first
iomap_file_buffered_write() allocates filesystem blocks for the entire
buffer, and when it finds that it could only do a partial write, it
frees a large part of those blocks. It will then call
fault_in_iov_iter_readable() for the next chunk, and the next call to
iomap_file_buffered_write() will then usually be able to write that
chunk entirely.

So it seems that we should always call fault_in_iov_iter_readable()
before calling into iomap_file_buffered_write(). This will probably
hide whatever is going wrong in punch_hole(), but we'll get to that
later ...

(Side note: the chunk size should be aligned to the page cache, not to
the iov_iter as in the current code.)

> > +               truncate_pagecache_range(inode, hstart, hend - 1);
> > +               if (hstart < hend)
> > +                       punch_hole(ip, hstart, hend - hstart);
>
> Why doesn't that "hstart < hend" condition cover both the truncate and
> the hole punch?

That was a leftover from a previous experiment in which I did the
truncate_pagecache_range() on the unaligned boundaries. Which turned
out to be pointless. I'll clean that up.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2024-03-25 12:10 Andreas Gruenbacher
@ 2024-03-25 18:18 ` pr-tracker-bot
  0 siblings, 0 replies; 64+ messages in thread
From: pr-tracker-bot @ 2024-03-25 18:18 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, Andreas Gruenbacher, gfs2, linux-kernel

The pull request you sent on Mon, 25 Mar 2024 13:10:16 +0100:

> git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git tags/gfs2-v6.8-fix

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/928a87efa42302a23bb9554be081a28058495f22

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [GIT PULL] gfs2 fix
@ 2024-03-25 12:10 Andreas Gruenbacher
  2024-03-25 18:18 ` pr-tracker-bot
  0 siblings, 1 reply; 64+ messages in thread
From: Andreas Gruenbacher @ 2024-03-25 12:10 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andreas Gruenbacher, gfs2, linux-kernel

Hello Linus,

please consider pulling the following gfs2 fix.

Thank you very much,
Andreas

The following changes since commit e8f897f4afef0031fe618a8e94127a0934896aba:

  Linux 6.8 (2024-03-10 13:38:09 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git tags/gfs2-v6.8-fix

for you to fetch changes up to c95346ac918c5badf51b9a7ac58a26d3bd5bb224:

  gfs2: Fix invalid metadata access in punch_hole (2024-03-11 17:11:18 +0100)

----------------------------------------------------------------
gfs2 fix

- Fix boundary check in punch_hole

----------------------------------------------------------------
Andrew Price (1):
      gfs2: Fix invalid metadata access in punch_hole

 fs/gfs2/bmap.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2023-06-06 12:55 ` Linus Torvalds
@ 2023-06-06 13:32   ` Andreas Gruenbacher
  0 siblings, 0 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2023-06-06 13:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: cluster-devel, linux-kernel

On Tue, Jun 6, 2023 at 2:55 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Jun 6, 2023 at 5:48 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > - Don't get stuck writing page onto itself under direct I/O.
>
> Btw, is there a test for this DIO case?

The previous test case I wrote for these kinds of page faults is:

  "generic: Test page faults during read and write"
  https://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git/commit/?id=d3cbdabff

I've added a check for this specific case, but this change hasn't been
posted/merged yet:

  "generic/728: Add mmap + DIO write test"
  https://gitlab.com/agruenba/xfstests/-/commit/8c37de03

> We've had the deadlock issue on t page lock (or for inode locks or
> whatever) for normal IO when faulting in the same page that is written
> to, and we have as pattern for solving that and I think there are
> filesystem tests that trigger this.
>
> But the DIO pattern is a bit different, with the whole "invalidate
> page cache: issue, and the fact that you send this patch now (rather
> than years ago) makes me wonder about test coverage for this all?

Yes, this case wasn't covered so far. The other page fault issues are
covered since 2021, and were fixed in gfs2 back then.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2023-06-06 12:48 Andreas Gruenbacher
  2023-06-06 12:55 ` Linus Torvalds
@ 2023-06-06 13:22 ` pr-tracker-bot
  1 sibling, 0 replies; 64+ messages in thread
From: pr-tracker-bot @ 2023-06-06 13:22 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, Andreas Gruenbacher, cluster-devel, linux-kernel

The pull request you sent on Tue,  6 Jun 2023 14:48:00 +0200:

> git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git tags/gfs2-v6.4-rc4-fix

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/0bdd0f0bf17c5aac16f348ee4b1ebf23d1ec1649

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [GIT PULL] gfs2 fix
  2023-06-06 12:48 Andreas Gruenbacher
@ 2023-06-06 12:55 ` Linus Torvalds
  2023-06-06 13:32   ` Andreas Gruenbacher
  2023-06-06 13:22 ` pr-tracker-bot
  1 sibling, 1 reply; 64+ messages in thread
From: Linus Torvalds @ 2023-06-06 12:55 UTC (permalink / raw)
  To: Andreas Gruenbacher; +Cc: cluster-devel, linux-kernel

On Tue, Jun 6, 2023 at 5:48 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> - Don't get stuck writing page onto itself under direct I/O.

Btw, is there a test for this DIO case?

We've had the deadlock issue on t page lock (or for inode locks or
whatever) for normal IO when faulting in the same page that is written
to, and we have as pattern for solving that and I think there are
filesystem tests that trigger this.

But the DIO pattern is a bit different, with the whole "invalidate
page cache: issue, and the fact that you send this patch now (rather
than years ago) makes me wonder about test coverage for this all?

                Linus

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [GIT PULL] gfs2 fix
@ 2023-06-06 12:48 Andreas Gruenbacher
  2023-06-06 12:55 ` Linus Torvalds
  2023-06-06 13:22 ` pr-tracker-bot
  0 siblings, 2 replies; 64+ messages in thread
From: Andreas Gruenbacher @ 2023-06-06 12:48 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Andreas Gruenbacher, cluster-devel, linux-kernel

Hi Linus,

please consider pulling the following fix.

Thanks,
Andreas

The following changes since commit 48b1320a674e1ff5de2fad8606bee38f724594dc:

  Merge tag 'for-6.4-rc4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux (2023-05-30 17:23:50 -0400)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/gfs2/linux-gfs2.git tags/gfs2-v6.4-rc4-fix

for you to fetch changes up to fa58cc888d67e640e354d8b3ceef877ea167b0cf:

  gfs2: Don't get stuck writing page onto itself under direct I/O (2023-06-01 14:55:43 +0200)

----------------------------------------------------------------
gfs2 fix

- Don't get stuck writing page onto itself under direct I/O.

----------------------------------------------------------------
Andreas Gruenbacher (1):
      gfs2: Don't get stuck writing page onto itself under direct I/O

 fs/gfs2/file.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)


^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2024-03-25 18:18 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-26 14:54 [GIT PULL] gfs2 fix Andreas Gruenbacher
2022-04-26 14:54 ` [Cluster-devel] " Andreas Gruenbacher
2022-04-26 18:31 ` Linus Torvalds
2022-04-26 18:31   ` [Cluster-devel] " Linus Torvalds
2022-04-26 21:27   ` Andreas Gruenbacher
2022-04-26 21:27     ` [Cluster-devel] " Andreas Gruenbacher
2022-04-26 23:33     ` Linus Torvalds
2022-04-26 23:33       ` [Cluster-devel] " Linus Torvalds
2022-04-27 12:29       ` Andreas Gruenbacher
2022-04-27 12:29         ` [Cluster-devel] " Andreas Gruenbacher
2022-04-27 17:13         ` Linus Torvalds
2022-04-27 17:13           ` [Cluster-devel] " Linus Torvalds
2022-04-27 19:41           ` Andreas Gruenbacher
2022-04-27 19:41             ` [Cluster-devel] " Andreas Gruenbacher
2022-04-27 20:25             ` Linus Torvalds
2022-04-27 20:25               ` [Cluster-devel] " Linus Torvalds
2022-04-27 21:26               ` Andreas Gruenbacher
2022-04-27 21:26                 ` [Cluster-devel] " Andreas Gruenbacher
2022-04-27 22:20                 ` Linus Torvalds
2022-04-27 22:20                   ` [Cluster-devel] " Linus Torvalds
2022-04-28  0:00                   ` Linus Torvalds
2022-04-28  0:00                     ` [Cluster-devel] " Linus Torvalds
2022-04-28 13:26                     ` Andreas Gruenbacher
2022-04-28 13:26                       ` [Cluster-devel] " Andreas Gruenbacher
2022-04-28 17:09                       ` Linus Torvalds
2022-04-28 17:09                         ` [Cluster-devel] " Linus Torvalds
2022-04-28 17:17                         ` Linus Torvalds
2022-04-28 17:17                           ` [Cluster-devel] " Linus Torvalds
2022-04-28 17:21                           ` Andreas Gruenbacher
2022-04-28 17:21                             ` [Cluster-devel] " Andreas Gruenbacher
2022-04-28 17:38                         ` Andreas Gruenbacher
2022-04-28 17:38                           ` [Cluster-devel] " Andreas Gruenbacher
2022-05-02 18:31                           ` Linus Torvalds
2022-05-02 18:31                             ` [Cluster-devel] " Linus Torvalds
2022-05-02 18:58                             ` Linus Torvalds
2022-05-02 18:58                               ` [Cluster-devel] " Linus Torvalds
2022-05-02 20:24                               ` Andreas Gruenbacher
2022-05-02 20:24                                 ` [Cluster-devel] " Andreas Gruenbacher
2022-05-03  8:56                             ` Andreas Gruenbacher
2022-05-03  8:56                               ` [Cluster-devel] " Andreas Gruenbacher
2022-05-03 13:30                               ` Andreas Gruenbacher
2022-05-03 13:30                                 ` [Cluster-devel] " Andreas Gruenbacher
2022-05-03 16:19                               ` Linus Torvalds
2022-05-03 16:19                                 ` [Cluster-devel] " Linus Torvalds
2022-05-03 16:41                                 ` Andreas Gruenbacher
2022-05-03 16:41                                   ` [Cluster-devel] " Andreas Gruenbacher
2022-05-03 16:50                                   ` Linus Torvalds
2022-05-03 16:50                                     ` [Cluster-devel] " Linus Torvalds
2022-05-03 21:35                               ` Andreas Gruenbacher
2022-05-03 21:35                                 ` [Cluster-devel] " Andreas Gruenbacher
2022-05-03 22:41                                 ` Linus Torvalds
2022-05-03 22:41                                   ` [Cluster-devel] " Linus Torvalds
2022-05-04 17:52                                   ` Andreas Gruenbacher
2022-05-04 17:52                                     ` [Cluster-devel] " Andreas Gruenbacher
2022-04-28 18:16                       ` pr-tracker-bot
2022-04-28 18:16                         ` [Cluster-devel] " pr-tracker-bot
2022-04-26 19:07 ` pr-tracker-bot
2022-04-26 19:07   ` [Cluster-devel] " pr-tracker-bot
2023-06-06 12:48 Andreas Gruenbacher
2023-06-06 12:55 ` Linus Torvalds
2023-06-06 13:32   ` Andreas Gruenbacher
2023-06-06 13:22 ` pr-tracker-bot
2024-03-25 12:10 Andreas Gruenbacher
2024-03-25 18:18 ` pr-tracker-bot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.