linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Regression in xfstests on tmpfs-backed NFS exports
@ 2022-04-06 17:18 Chuck Lever III
  2022-04-07  0:18 ` Hugh Dickins
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Lever III @ 2022-04-06 17:18 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Linux-MM, Linux NFS Mailing List

Good day, Hugh-

I noticed that several fsx-related tests in the xfstests suite are
failing after updating my NFS server to v5.18-rc1. I normally test
against xfs, ext4, btrfs, and tmpfs exports. tmpfs is the only export
that sees these new failures:

generic/075 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/075.out.bad)
    --- tests/generic/075.out	2014-02-13 15:40:45.000000000 -0500
    +++ /home/cel/src/xfstests/results//generic/075.out.bad	2022-04-05 16:39:59.145991520 -0400
    @@ -4,15 +4,5 @@
     -----------------------------------------------
     fsx.0 : -d -N numops -S 0
     -----------------------------------------------
    -
    ------------------------------------------------
    -fsx.1 : -d -N numops -S 0 -x
    ------------------------------------------------
    ...
    (Run 'diff -u /home/cel/src/xfstests/tests/generic/075.out /home/cel/src/xfstests/results//generic/075.out.bad'  to see the entire diff)

generic/091 9s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/091.out.bad)
    --- tests/generic/091.out	2014-02-13 15:40:45.000000000 -0500
    +++ /home/cel/src/xfstests/results//generic/091.out.bad	2022-04-05 16:41:24.329063277 -0400
    @@ -1,7 +1,75 @@
     QA output created by 091
     fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
    -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
    -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
    -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
    -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
    -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W
    ...
    (Run 'diff -u /home/cel/src/xfstests/tests/generic/091.out /home/cel/src/xfstests/results//generic/091.out.bad'  to see the entire diff)

generic/112 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/112.out.bad)
    --- tests/generic/112.out	2014-02-13 15:40:45.000000000 -0500
    +++ /home/cel/src/xfstests/results//generic/112.out.bad	2022-04-05 16:41:38.511075170 -0400
    @@ -4,15 +4,4 @@
     -----------------------------------------------
     fsx.0 : -A -d -N numops -S 0
     -----------------------------------------------
    -
    ------------------------------------------------
    -fsx.1 : -A -d -N numops -S 0 -x
    ------------------------------------------------
    ...
    (Run 'diff -u /home/cel/src/xfstests/tests/generic/112.out /home/cel/src/xfstests/results//generic/112.out.bad'  to see the entire diff)

generic/127 49s ... - output mismatch (see /home/cel/src/xfstests/results//generic/127.out.bad)
    --- tests/generic/127.out	2016-08-28 12:16:20.000000000 -0400
    +++ /home/cel/src/xfstests/results//generic/127.out.bad	2022-04-05 16:42:07.655099652 -0400
    @@ -4,10 +4,198 @@
     === FSX Light Mode, Memory Mapping ===
     All 100000 operations completed A-OK!
     === FSX Standard Mode, No Memory Mapping ===
    -All 100000 operations completed A-OK!
    +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
    +READ BAD DATA: offset = 0x9cb7, size = 0xfae3, fname = /tmp/mnt/manet.ib-2323703/fsx_std_nommap
    +OFFSET	GOOD	BAD	RANGE
    ...
    (Run 'diff -u /home/cel/src/xfstests/tests/generic/127.out /home/cel/src/xfstests/results//generic/127.out.bad'  to see the entire diff)

I bisected the problem to:

  56a8c8eb1eaf ("tmpfs: do not allocate pages on read")

generic/075 fails almost immediately without any NFS-level errors.
Likely this is data corruption rather than an overt I/O error.


--
Chuck Lever





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression in xfstests on tmpfs-backed NFS exports
  2022-04-06 17:18 Regression in xfstests on tmpfs-backed NFS exports Chuck Lever III
@ 2022-04-07  0:18 ` Hugh Dickins
  2022-04-07  4:25   ` Mark Hemment
  2022-04-07 19:24   ` Chuck Lever III
  0 siblings, 2 replies; 11+ messages in thread
From: Hugh Dickins @ 2022-04-07  0:18 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Hugh Dickins, Andrew Morton, Patrice CHOTARD, Mikulas Patocka,
	Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM,
	Linux NFS Mailing List, linux-fsdevel

On Wed, 6 Apr 2022, Chuck Lever III wrote:

> Good day, Hugh-

Huh! If you were really wishing me a good day, would you tell me this ;-?

> 
> I noticed that several fsx-related tests in the xfstests suite are
> failing after updating my NFS server to v5.18-rc1. I normally test
> against xfs, ext4, btrfs, and tmpfs exports. tmpfs is the only export
> that sees these new failures:
> 
> generic/075 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/075.out.bad)
>     --- tests/generic/075.out	2014-02-13 15:40:45.000000000 -0500
>     +++ /home/cel/src/xfstests/results//generic/075.out.bad	2022-04-05 16:39:59.145991520 -0400
>     @@ -4,15 +4,5 @@
>      -----------------------------------------------
>      fsx.0 : -d -N numops -S 0
>      -----------------------------------------------
>     -
>     ------------------------------------------------
>     -fsx.1 : -d -N numops -S 0 -x
>     ------------------------------------------------
>     ...
>     (Run 'diff -u /home/cel/src/xfstests/tests/generic/075.out /home/cel/src/xfstests/results//generic/075.out.bad'  to see the entire diff)
> 
> generic/091 9s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/091.out.bad)
>     --- tests/generic/091.out	2014-02-13 15:40:45.000000000 -0500
>     +++ /home/cel/src/xfstests/results//generic/091.out.bad	2022-04-05 16:41:24.329063277 -0400
>     @@ -1,7 +1,75 @@
>      QA output created by 091
>      fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
>     -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
>     -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
>     -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
>     -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
>     -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W
>     ...
>     (Run 'diff -u /home/cel/src/xfstests/tests/generic/091.out /home/cel/src/xfstests/results//generic/091.out.bad'  to see the entire diff)
> 
> generic/112 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/112.out.bad)
>     --- tests/generic/112.out	2014-02-13 15:40:45.000000000 -0500
>     +++ /home/cel/src/xfstests/results//generic/112.out.bad	2022-04-05 16:41:38.511075170 -0400
>     @@ -4,15 +4,4 @@
>      -----------------------------------------------
>      fsx.0 : -A -d -N numops -S 0
>      -----------------------------------------------
>     -
>     ------------------------------------------------
>     -fsx.1 : -A -d -N numops -S 0 -x
>     ------------------------------------------------
>     ...
>     (Run 'diff -u /home/cel/src/xfstests/tests/generic/112.out /home/cel/src/xfstests/results//generic/112.out.bad'  to see the entire diff)
> 
> generic/127 49s ... - output mismatch (see /home/cel/src/xfstests/results//generic/127.out.bad)
>     --- tests/generic/127.out	2016-08-28 12:16:20.000000000 -0400
>     +++ /home/cel/src/xfstests/results//generic/127.out.bad	2022-04-05 16:42:07.655099652 -0400
>     @@ -4,10 +4,198 @@
>      === FSX Light Mode, Memory Mapping ===
>      All 100000 operations completed A-OK!
>      === FSX Standard Mode, No Memory Mapping ===
>     -All 100000 operations completed A-OK!
>     +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
>     +READ BAD DATA: offset = 0x9cb7, size = 0xfae3, fname = /tmp/mnt/manet.ib-2323703/fsx_std_nommap
>     +OFFSET	GOOD	BAD	RANGE
>     ...
>     (Run 'diff -u /home/cel/src/xfstests/tests/generic/127.out /home/cel/src/xfstests/results//generic/127.out.bad'  to see the entire diff)
> 
> I bisected the problem to:
> 
>   56a8c8eb1eaf ("tmpfs: do not allocate pages on read")
> 
> generic/075 fails almost immediately without any NFS-level errors.
> Likely this is data corruption rather than an overt I/O error.

That's sad.  Thanks for bisecting and reporting.  Sorry for the nuisance.

I suspect this patch is heading for a revert, because I shall not have
time to debug and investigate.  Cc'ing fsdevel and a few people who have
an interest in it, to warn of that likely upcoming revert.

But if it's okay with everyone, please may we leave it in for -rc2?
Given that having it in -rc1 already smoked out another issue (problem
of SetPageUptodate(ZERO_PAGE(0)) without CONFIG_MMU), I think keeping
it in a little longer might smoke out even more.

The xfstests info above doesn't actually tell very much, beyond that
generic/075 generic/091 generic/112 generic/127, each a test with fsx,
all fall at their first hurdle.  If you have time, please rerun and
tar up the results/generic directory (maybe filter just those failing)
and send as attachment.  But don't go to any trouble, it's unlikely
that I shall even untar it - it would be mainly to go on record if
anyone has time to look into it later.  And, frankly, it's unlikely
to tell us anything more enlightening, than that the data seen was
not as expected: which we do already know.

I've had no problem with xfstests generic 075,091,112,127 testing
tmpfs here, not before and not in the month or two I've had that
patch in: so it's something in the way that NFS exercises tmpfs
that reveals it.  If I had time to duplicate your procedure, I'd be
asking for detailed instructions: but no, I won't have a chance.

But I can sit here and try to guess.  I notice fs/nfsd checks
file->f_op->splice_read, and employs fallback if not available:
if you have time, please try rerunning those xfstests on an -rc1
kernel, but with mm/shmem.c's .splice_read line commented out.
My guess is that will then pass the tests, and we shall know more.

What could be going wrong there?  I've thought of two possibilities.
A minor, hopefully easily fixed, issue would be if fs/nfsd has
trouble with seeing the same page twice in a row: since tmpfs is
now using the ZERO_PAGE(0) for all pages of a hole, and I think I
caught sight of code which looks to see if the latest page is the
same as the one before.  It's easy to imagine that might go wrong.

A more difficult issue would be, if fsx is racing writes and reads,
in a way that it can guarantee the correct result, but that correct
result is no longer delivered: because the writes go into freshly
allocated tmpfs cache pages, while reads are still delivering
stale ZERO_PAGEs from the pipe.  I'm hazy on the guarantees there.

But unless someone has time to help out, we're heading for a revert.

Thanks,
Hugh


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression in xfstests on tmpfs-backed NFS exports
  2022-04-07  0:18 ` Hugh Dickins
@ 2022-04-07  4:25   ` Mark Hemment
  2022-04-07 22:04     ` Hugh Dickins
  2022-04-07 19:24   ` Chuck Lever III
  1 sibling, 1 reply; 11+ messages in thread
From: Mark Hemment @ 2022-04-07  4:25 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Chuck Lever III, Andrew Morton, Patrice CHOTARD, Mikulas Patocka,
	Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM,
	Linux NFS Mailing List, linux-fsdevel

On Thu, 7 Apr 2022 at 01:19, Hugh Dickins <hughd@google.com> wrote:
>
> On Wed, 6 Apr 2022, Chuck Lever III wrote:
>
> > Good day, Hugh-
>
> Huh! If you were really wishing me a good day, would you tell me this ;-?
>
> >
> > I noticed that several fsx-related tests in the xfstests suite are
> > failing after updating my NFS server to v5.18-rc1. I normally test
> > against xfs, ext4, btrfs, and tmpfs exports. tmpfs is the only export
> > that sees these new failures:
> >
[...]
> > generic/075 fails almost immediately without any NFS-level errors.
> > Likely this is data corruption rather than an overt I/O error.
>
> That's sad.  Thanks for bisecting and reporting.  Sorry for the nuisance.
>
> I suspect this patch is heading for a revert, because I shall not have
> time to debug and investigate.  Cc'ing fsdevel and a few people who have
> an interest in it, to warn of that likely upcoming revert.
>
> But if it's okay with everyone, please may we leave it in for -rc2?
> Given that having it in -rc1 already smoked out another issue (problem
> of SetPageUptodate(ZERO_PAGE(0)) without CONFIG_MMU), I think keeping
> it in a little longer might smoke out even more.
>
> The xfstests info above doesn't actually tell very much, beyond that
> generic/075 generic/091 generic/112 generic/127, each a test with fsx,
> all fall at their first hurdle.  If you have time, please rerun and
> tar up the results/generic directory (maybe filter just those failing)
> and send as attachment.  But don't go to any trouble, it's unlikely
> that I shall even untar it - it would be mainly to go on record if
> anyone has time to look into it later.  And, frankly, it's unlikely
> to tell us anything more enlightening, than that the data seen was
> not as expected: which we do already know.
>
> I've had no problem with xfstests generic 075,091,112,127 testing
> tmpfs here, not before and not in the month or two I've had that
> patch in: so it's something in the way that NFS exercises tmpfs
> that reveals it.  If I had time to duplicate your procedure, I'd be
> asking for detailed instructions: but no, I won't have a chance.
>
> But I can sit here and try to guess.  I notice fs/nfsd checks
> file->f_op->splice_read, and employs fallback if not available:
> if you have time, please try rerunning those xfstests on an -rc1
> kernel, but with mm/shmem.c's .splice_read line commented out.
> My guess is that will then pass the tests, and we shall know more.
>
> What could be going wrong there?  I've thought of two possibilities.
> A minor, hopefully easily fixed, issue would be if fs/nfsd has
> trouble with seeing the same page twice in a row: since tmpfs is
> now using the ZERO_PAGE(0) for all pages of a hole, and I think I
> caught sight of code which looks to see if the latest page is the
> same as the one before.  It's easy to imagine that might go wrong.

When I worked at Veritas, data corruption over NFS was hit when
sending the same page in succession.  This was triggered via VxFS's
shared page cache, after a file had been dedup'ed.
I can't remember all the details (~15yrs ago), but the core issue was
skb_can_coalesce() returning a false-positive for the 'same page' case
(no check for crossing a page boundary).

> A more difficult issue would be, if fsx is racing writes and reads,
> in a way that it can guarantee the correct result, but that correct
> result is no longer delivered: because the writes go into freshly
> allocated tmpfs cache pages, while reads are still delivering
> stale ZERO_PAGEs from the pipe.  I'm hazy on the guarantees there.
>
> But unless someone has time to help out, we're heading for a revert.
>
> Thanks,
> Hugh

Cheers,
Mark


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression in xfstests on tmpfs-backed NFS exports
  2022-04-07  0:18 ` Hugh Dickins
  2022-04-07  4:25   ` Mark Hemment
@ 2022-04-07 19:24   ` Chuck Lever III
  2022-04-07 22:26     ` Hugh Dickins
  1 sibling, 1 reply; 11+ messages in thread
From: Chuck Lever III @ 2022-04-07 19:24 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Patrice CHOTARD, Mikulas Patocka, Lukas Czerner,
	Christoph Hellwig, Darrick J. Wong, Linux-MM,
	Linux NFS Mailing List, linux-fsdevel



> On Apr 6, 2022, at 8:18 PM, Hugh Dickins <hughd@google.com> wrote:
> 
> On Wed, 6 Apr 2022, Chuck Lever III wrote:
> 
>> Good day, Hugh-
> 
> Huh! If you were really wishing me a good day, would you tell me this ;-?
> 
>> 
>> I noticed that several fsx-related tests in the xfstests suite are
>> failing after updating my NFS server to v5.18-rc1. I normally test
>> against xfs, ext4, btrfs, and tmpfs exports. tmpfs is the only export
>> that sees these new failures:
>> 
>> generic/075 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/075.out.bad)
>>    --- tests/generic/075.out	2014-02-13 15:40:45.000000000 -0500
>>    +++ /home/cel/src/xfstests/results//generic/075.out.bad	2022-04-05 16:39:59.145991520 -0400
>>    @@ -4,15 +4,5 @@
>>     -----------------------------------------------
>>     fsx.0 : -d -N numops -S 0
>>     -----------------------------------------------
>>    -
>>    ------------------------------------------------
>>    -fsx.1 : -d -N numops -S 0 -x
>>    ------------------------------------------------
>>    ...
>>    (Run 'diff -u /home/cel/src/xfstests/tests/generic/075.out /home/cel/src/xfstests/results//generic/075.out.bad'  to see the entire diff)
>> 
>> generic/091 9s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/091.out.bad)
>>    --- tests/generic/091.out	2014-02-13 15:40:45.000000000 -0500
>>    +++ /home/cel/src/xfstests/results//generic/091.out.bad	2022-04-05 16:41:24.329063277 -0400
>>    @@ -1,7 +1,75 @@
>>     QA output created by 091
>>     fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
>>    -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
>>    -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
>>    -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
>>    -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
>>    -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W
>>    ...
>>    (Run 'diff -u /home/cel/src/xfstests/tests/generic/091.out /home/cel/src/xfstests/results//generic/091.out.bad'  to see the entire diff)
>> 
>> generic/112 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/112.out.bad)
>>    --- tests/generic/112.out	2014-02-13 15:40:45.000000000 -0500
>>    +++ /home/cel/src/xfstests/results//generic/112.out.bad	2022-04-05 16:41:38.511075170 -0400
>>    @@ -4,15 +4,4 @@
>>     -----------------------------------------------
>>     fsx.0 : -A -d -N numops -S 0
>>     -----------------------------------------------
>>    -
>>    ------------------------------------------------
>>    -fsx.1 : -A -d -N numops -S 0 -x
>>    ------------------------------------------------
>>    ...
>>    (Run 'diff -u /home/cel/src/xfstests/tests/generic/112.out /home/cel/src/xfstests/results//generic/112.out.bad'  to see the entire diff)
>> 
>> generic/127 49s ... - output mismatch (see /home/cel/src/xfstests/results//generic/127.out.bad)
>>    --- tests/generic/127.out	2016-08-28 12:16:20.000000000 -0400
>>    +++ /home/cel/src/xfstests/results//generic/127.out.bad	2022-04-05 16:42:07.655099652 -0400
>>    @@ -4,10 +4,198 @@
>>     === FSX Light Mode, Memory Mapping ===
>>     All 100000 operations completed A-OK!
>>     === FSX Standard Mode, No Memory Mapping ===
>>    -All 100000 operations completed A-OK!
>>    +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
>>    +READ BAD DATA: offset = 0x9cb7, size = 0xfae3, fname = /tmp/mnt/manet.ib-2323703/fsx_std_nommap
>>    +OFFSET	GOOD	BAD	RANGE
>>    ...
>>    (Run 'diff -u /home/cel/src/xfstests/tests/generic/127.out /home/cel/src/xfstests/results//generic/127.out.bad'  to see the entire diff)
>> 
>> I bisected the problem to:
>> 
>>  56a8c8eb1eaf ("tmpfs: do not allocate pages on read")
>> 
>> generic/075 fails almost immediately without any NFS-level errors.
>> Likely this is data corruption rather than an overt I/O error.
> 
> That's sad.  Thanks for bisecting and reporting.  Sorry for the nuisance.
> 
> I suspect this patch is heading for a revert, because I shall not have
> time to debug and investigate.  Cc'ing fsdevel and a few people who have
> an interest in it, to warn of that likely upcoming revert.
> 
> But if it's okay with everyone, please may we leave it in for -rc2?
> Given that having it in -rc1 already smoked out another issue (problem
> of SetPageUptodate(ZERO_PAGE(0)) without CONFIG_MMU), I think keeping
> it in a little longer might smoke out even more.
> 
> The xfstests info above doesn't actually tell very much, beyond that
> generic/075 generic/091 generic/112 generic/127, each a test with fsx,
> all fall at their first hurdle.  If you have time, please rerun and
> tar up the results/generic directory (maybe filter just those failing)
> and send as attachment.  But don't go to any trouble, it's unlikely
> that I shall even untar it - it would be mainly to go on record if
> anyone has time to look into it later.  And, frankly, it's unlikely
> to tell us anything more enlightening, than that the data seen was
> not as expected: which we do already know.
> 
> I've had no problem with xfstests generic 075,091,112,127 testing
> tmpfs here, not before and not in the month or two I've had that
> patch in: so it's something in the way that NFS exercises tmpfs
> that reveals it.  If I had time to duplicate your procedure, I'd be
> asking for detailed instructions: but no, I won't have a chance.
> 
> But I can sit here and try to guess.  I notice fs/nfsd checks
> file->f_op->splice_read, and employs fallback if not available:
> if you have time, please try rerunning those xfstests on an -rc1
> kernel, but with mm/shmem.c's .splice_read line commented out.
> My guess is that will then pass the tests, and we shall know more.

This seemed like the most probative next step, so I commented
out the .splice_read call-out in mm/shmem.c and ran the tests
again. Yes, that change enables the fsx-related tests to pass
as expected.


> What could be going wrong there?  I've thought of two possibilities.
> A minor, hopefully easily fixed, issue would be if fs/nfsd has
> trouble with seeing the same page twice in a row: since tmpfs is
> now using the ZERO_PAGE(0) for all pages of a hole, and I think I
> caught sight of code which looks to see if the latest page is the
> same as the one before.  It's easy to imagine that might go wrong.

Are you referring to this function in fs/nfsd/vfs.c ?

 847 static int
 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
 849                   struct splice_desc *sd)
 850 {
 851         struct svc_rqst *rqstp = sd->u.data;
 852         struct page **pp = rqstp->rq_next_page;
 853         struct page *page = buf->page;
 854 
 855         if (rqstp->rq_res.page_len == 0) {
 856                 svc_rqst_replace_page(rqstp, page);
 857                 rqstp->rq_res.page_base = buf->offset;
 858         } else if (page != pp[-1]) {
 859                 svc_rqst_replace_page(rqstp, page);
 860         }
 861         rqstp->rq_res.page_len += sd->len;
 862 
 863         return sd->len;
 864 }

rq_next_page should point to the first unused element of
rqstp->rq_pages, so IIUC that check is looking for the
final page that is part of the READ payload.

But that does suggest that if page -> ZERO_PAGE and so does
pp[-1], then svc_rqst_replace_page() would not be invoked.


> A more difficult issue would be, if fsx is racing writes and reads,
> in a way that it can guarantee the correct result, but that correct
> result is no longer delivered: because the writes go into freshly
> allocated tmpfs cache pages, while reads are still delivering
> stale ZERO_PAGEs from the pipe.  I'm hazy on the guarantees there.
> 
> But unless someone has time to help out, we're heading for a revert.
> 
> Thanks,
> Hugh

--
Chuck Lever





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression in xfstests on tmpfs-backed NFS exports
  2022-04-07  4:25   ` Mark Hemment
@ 2022-04-07 22:04     ` Hugh Dickins
  0 siblings, 0 replies; 11+ messages in thread
From: Hugh Dickins @ 2022-04-07 22:04 UTC (permalink / raw)
  To: Mark Hemment
  Cc: Hugh Dickins, Chuck Lever III, Andrew Morton, Patrice CHOTARD,
	Mikulas Patocka, Lukas Czerner, Christoph Hellwig,
	Darrick J. Wong, Linux-MM, Linux NFS Mailing List, linux-fsdevel

On Thu, 7 Apr 2022, Mark Hemment wrote:
> On Thu, 7 Apr 2022 at 01:19, Hugh Dickins <hughd@google.com> wrote:
> >
> > What could be going wrong there?  I've thought of two possibilities.
> > A minor, hopefully easily fixed, issue would be if fs/nfsd has
> > trouble with seeing the same page twice in a row: since tmpfs is
> > now using the ZERO_PAGE(0) for all pages of a hole, and I think I
> > caught sight of code which looks to see if the latest page is the
> > same as the one before.  It's easy to imagine that might go wrong.
> 
> When I worked at Veritas, data corruption over NFS was hit when
> sending the same page in succession.  This was triggered via VxFS's
> shared page cache, after a file had been dedup'ed.
> I can't remember all the details (~15yrs ago), but the core issue was
> skb_can_coalesce() returning a false-positive for the 'same page' case
> (no check for crossing a page boundary).

Very useful input: thank you Mark.

That tells me that, even if we spot a "bug" in fs/nfsd, there could
be various other places which get confused if given the ZERO_PAGE(0)
twice in a row.  I won't be able to find them all, I cannot go on
risking that.

At first I thought of using ZERO_PAGE(0) for even pgoffs, alternating
with a tmpfs-specific zeroed page for odd pgoffs.  But I was forgetting
that copying ZERO_PAGE(0) is itself just a workaround to avoid the 28%
slower iov_iter_zero().

I think I have a reasonable hybrid: will reply to Chuck now with that.

Hugh
I've rewr


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression in xfstests on tmpfs-backed NFS exports
  2022-04-07 19:24   ` Chuck Lever III
@ 2022-04-07 22:26     ` Hugh Dickins
  2022-04-07 23:45       ` Chuck Lever III
  2022-04-08 16:10       ` Chuck Lever III
  0 siblings, 2 replies; 11+ messages in thread
From: Hugh Dickins @ 2022-04-07 22:26 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Hugh Dickins, Andrew Morton, Mark Hemment, Patrice CHOTARD,
	Mikulas Patocka, Lukas Czerner, Christoph Hellwig,
	Darrick J. Wong, Linux-MM, Linux NFS Mailing List, linux-fsdevel

On Thu, 7 Apr 2022, Chuck Lever III wrote:
> > On Apr 6, 2022, at 8:18 PM, Hugh Dickins <hughd@google.com> wrote:
> > 
> > But I can sit here and try to guess.  I notice fs/nfsd checks
> > file->f_op->splice_read, and employs fallback if not available:
> > if you have time, please try rerunning those xfstests on an -rc1
> > kernel, but with mm/shmem.c's .splice_read line commented out.
> > My guess is that will then pass the tests, and we shall know more.
> 
> This seemed like the most probative next step, so I commented
> out the .splice_read call-out in mm/shmem.c and ran the tests
> again. Yes, that change enables the fsx-related tests to pass
> as expected.

Great, thank you for trying that.

> 
> > What could be going wrong there?  I've thought of two possibilities.
> > A minor, hopefully easily fixed, issue would be if fs/nfsd has
> > trouble with seeing the same page twice in a row: since tmpfs is
> > now using the ZERO_PAGE(0) for all pages of a hole, and I think I
> > caught sight of code which looks to see if the latest page is the
> > same as the one before.  It's easy to imagine that might go wrong.
> 
> Are you referring to this function in fs/nfsd/vfs.c ?

I think that was it, didn't pay much attention.

> 
>  847 static int
>  848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
>  849                   struct splice_desc *sd)
>  850 {
>  851         struct svc_rqst *rqstp = sd->u.data;
>  852         struct page **pp = rqstp->rq_next_page;
>  853         struct page *page = buf->page;
>  854 
>  855         if (rqstp->rq_res.page_len == 0) {
>  856                 svc_rqst_replace_page(rqstp, page);
>  857                 rqstp->rq_res.page_base = buf->offset;
>  858         } else if (page != pp[-1]) {
>  859                 svc_rqst_replace_page(rqstp, page);
>  860         }
>  861         rqstp->rq_res.page_len += sd->len;
>  862 
>  863         return sd->len;
>  864 }
> 
> rq_next_page should point to the first unused element of
> rqstp->rq_pages, so IIUC that check is looking for the
> final page that is part of the READ payload.
> 
> But that does suggest that if page -> ZERO_PAGE and so does
> pp[-1], then svc_rqst_replace_page() would not be invoked.

I still haven't studied the logic there: Mark's input made it clear
that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly,
there could be expectations of uniqueness in other places too.

> 
> > A more difficult issue would be, if fsx is racing writes and reads,
> > in a way that it can guarantee the correct result, but that correct
> > result is no longer delivered: because the writes go into freshly
> > allocated tmpfs cache pages, while reads are still delivering
> > stale ZERO_PAGEs from the pipe.  I'm hazy on the guarantees there.
> > 
> > But unless someone has time to help out, we're heading for a revert.

We might be able to avoid that revert, and go the whole way to using
iov_iter_zero() instead.  But the significant slowness of clear_user()
relative to copy to user, on x86 at least, does ask for a hybrid.

Suggested patch below, on top of 5.18-rc1, passes my own testing:
but will it pass yours?  It seems to me safe, and as fast as before,
but we don't know yet if this iov_iter_zero() works right for you.
Chuck, please give it a go and let us know.

(Don't forget to restore mm/shmem.c's .splice_read first!  And if
this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0))
in the same patch, fixing the other regression, without recourse to
#ifdefs or arch mods.)

Thanks!
Hugh

--- 5.18-rc1/mm/shmem.c
+++ linux/mm/shmem.c
@@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		pgoff_t end_index;
 		unsigned long nr, ret;
 		loff_t i_size = i_size_read(inode);
-		bool got_page;
 
 		end_index = i_size >> PAGE_SHIFT;
 		if (index > end_index)
@@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			 */
 			if (!offset)
 				mark_page_accessed(page);
-			got_page = true;
+			/*
+			 * Ok, we have the page, and it's up-to-date, so
+			 * now we can copy it to user space...
+			 */
+			ret = copy_page_to_iter(page, offset, nr, to);
+			put_page(page);
+
+		} else if (iter_is_iovec(to)) {
+			/*
+			 * Copy to user tends to be so well optimized, but
+			 * clear_user() not so much, that it is noticeably
+			 * faster to copy the zero page instead of clearing.
+			 */
+			ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
 		} else {
-			page = ZERO_PAGE(0);
-			got_page = false;
+			/*
+			 * But submitting the same page twice in a row to
+			 * splice() - or others? - can result in confusion:
+			 * so don't attempt that optimization on pipes etc.
+			 */
+			ret = iov_iter_zero(nr, to);
 		}
 
-		/*
-		 * Ok, we have the page, and it's up-to-date, so
-		 * now we can copy it to user space...
-		 */
-		ret = copy_page_to_iter(page, offset, nr, to);
 		retval += ret;
 		offset += ret;
 		index += offset >> PAGE_SHIFT;
 		offset &= ~PAGE_MASK;
 
-		if (got_page)
-			put_page(page);
 		if (!iov_iter_count(to))
 			break;
 		if (ret < nr) {


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression in xfstests on tmpfs-backed NFS exports
  2022-04-07 22:26     ` Hugh Dickins
@ 2022-04-07 23:45       ` Chuck Lever III
  2022-04-08 14:38         ` Mark Hemment
  2022-04-08 16:10       ` Chuck Lever III
  1 sibling, 1 reply; 11+ messages in thread
From: Chuck Lever III @ 2022-04-07 23:45 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mark Hemment, Patrice CHOTARD, Mikulas Patocka,
	Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM,
	Linux NFS Mailing List, linux-fsdevel



> On Apr 7, 2022, at 6:26 PM, Hugh Dickins <hughd@google.com> wrote:
> 
> On Thu, 7 Apr 2022, Chuck Lever III wrote:
>>> On Apr 6, 2022, at 8:18 PM, Hugh Dickins <hughd@google.com> wrote:
>>> 
>>> But I can sit here and try to guess.  I notice fs/nfsd checks
>>> file->f_op->splice_read, and employs fallback if not available:
>>> if you have time, please try rerunning those xfstests on an -rc1
>>> kernel, but with mm/shmem.c's .splice_read line commented out.
>>> My guess is that will then pass the tests, and we shall know more.
>> 
>> This seemed like the most probative next step, so I commented
>> out the .splice_read call-out in mm/shmem.c and ran the tests
>> again. Yes, that change enables the fsx-related tests to pass
>> as expected.
> 
> Great, thank you for trying that.
> 
>> 
>>> What could be going wrong there?  I've thought of two possibilities.
>>> A minor, hopefully easily fixed, issue would be if fs/nfsd has
>>> trouble with seeing the same page twice in a row: since tmpfs is
>>> now using the ZERO_PAGE(0) for all pages of a hole, and I think I
>>> caught sight of code which looks to see if the latest page is the
>>> same as the one before.  It's easy to imagine that might go wrong.
>> 
>> Are you referring to this function in fs/nfsd/vfs.c ?
> 
> I think that was it, didn't pay much attention.

This code seems to have been the issue. I added a little test
to see if @page pointed to ZERO_PAGE(0) and now the tests
pass as expected.


>> 847 static int
>> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
>> 849                   struct splice_desc *sd)
>> 850 {
>> 851         struct svc_rqst *rqstp = sd->u.data;
>> 852         struct page **pp = rqstp->rq_next_page;
>> 853         struct page *page = buf->page;
>> 854 
>> 855         if (rqstp->rq_res.page_len == 0) {
>> 856                 svc_rqst_replace_page(rqstp, page);
>> 857                 rqstp->rq_res.page_base = buf->offset;
>> 858         } else if (page != pp[-1]) {
>> 859                 svc_rqst_replace_page(rqstp, page);
>> 860         }
>> 861         rqstp->rq_res.page_len += sd->len;
>> 862 
>> 863         return sd->len;
>> 864 }
>> 
>> rq_next_page should point to the first unused element of
>> rqstp->rq_pages, so IIUC that check is looking for the
>> final page that is part of the READ payload.
>> 
>> But that does suggest that if page -> ZERO_PAGE and so does
>> pp[-1], then svc_rqst_replace_page() would not be invoked.
> 
> I still haven't studied the logic there: Mark's input made it clear
> that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly,
> there could be expectations of uniqueness in other places too.

I can't really attest to Mark's comment, but...

After studying nfsd_splice_actor() I can't see any reason
except cleverness and technical debt for this particular
check. I have a patch that removes the check and simplifies
this function that I'm testing now -- it seems to be a
reasonable clean-up whether you keep 56a8c8eb1eaf or
choose to revert it.


>>> A more difficult issue would be, if fsx is racing writes and reads,
>>> in a way that it can guarantee the correct result, but that correct
>>> result is no longer delivered: because the writes go into freshly
>>> allocated tmpfs cache pages, while reads are still delivering
>>> stale ZERO_PAGEs from the pipe.  I'm hazy on the guarantees there.
>>> 
>>> But unless someone has time to help out, we're heading for a revert.
> 
> We might be able to avoid that revert, and go the whole way to using
> iov_iter_zero() instead.  But the significant slowness of clear_user()
> relative to copy to user, on x86 at least, does ask for a hybrid.
> 
> Suggested patch below, on top of 5.18-rc1, passes my own testing:
> but will it pass yours?  It seems to me safe, and as fast as before,
> but we don't know yet if this iov_iter_zero() works right for you.
> Chuck, please give it a go and let us know.
> 
> (Don't forget to restore mm/shmem.c's .splice_read first!  And if
> this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0))
> in the same patch, fixing the other regression, without recourse to
> #ifdefs or arch mods.)

Sure, I will try this out first thing tomorrow.

One thing that occurs to me is that for NFS/RDMA, having a
page full of zeroes that is already DMA-mapped would be a
nice optimization on the sender side (on the client for an
NFS WRITE and on the server for an NFS READ). The transport
would have to set up a scatter-gather list containing a
bunch of entries that reference the same page...

</musing>


> Thanks!
> Hugh
> 
> --- 5.18-rc1/mm/shmem.c
> +++ linux/mm/shmem.c
> @@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> 		pgoff_t end_index;
> 		unsigned long nr, ret;
> 		loff_t i_size = i_size_read(inode);
> -		bool got_page;
> 
> 		end_index = i_size >> PAGE_SHIFT;
> 		if (index > end_index)
> @@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> 			 */
> 			if (!offset)
> 				mark_page_accessed(page);
> -			got_page = true;
> +			/*
> +			 * Ok, we have the page, and it's up-to-date, so
> +			 * now we can copy it to user space...
> +			 */
> +			ret = copy_page_to_iter(page, offset, nr, to);
> +			put_page(page);
> +
> +		} else if (iter_is_iovec(to)) {
> +			/*
> +			 * Copy to user tends to be so well optimized, but
> +			 * clear_user() not so much, that it is noticeably
> +			 * faster to copy the zero page instead of clearing.
> +			 */
> +			ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
> 		} else {
> -			page = ZERO_PAGE(0);
> -			got_page = false;
> +			/*
> +			 * But submitting the same page twice in a row to
> +			 * splice() - or others? - can result in confusion:
> +			 * so don't attempt that optimization on pipes etc.
> +			 */
> +			ret = iov_iter_zero(nr, to);
> 		}
> 
> -		/*
> -		 * Ok, we have the page, and it's up-to-date, so
> -		 * now we can copy it to user space...
> -		 */
> -		ret = copy_page_to_iter(page, offset, nr, to);
> 		retval += ret;
> 		offset += ret;
> 		index += offset >> PAGE_SHIFT;
> 		offset &= ~PAGE_MASK;
> 
> -		if (got_page)
> -			put_page(page);
> 		if (!iov_iter_count(to))
> 			break;
> 		if (ret < nr) {

--
Chuck Lever





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression in xfstests on tmpfs-backed NFS exports
  2022-04-07 23:45       ` Chuck Lever III
@ 2022-04-08 14:38         ` Mark Hemment
  0 siblings, 0 replies; 11+ messages in thread
From: Mark Hemment @ 2022-04-08 14:38 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Hugh Dickins, Andrew Morton, Patrice CHOTARD, Mikulas Patocka,
	Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM,
	Linux NFS Mailing List, linux-fsdevel

On Fri, 8 Apr 2022 at 00:45, Chuck Lever III <chuck.lever@oracle.com> wrote:
> > On Apr 7, 2022, at 6:26 PM, Hugh Dickins <hughd@google.com> wrote:
> >
> > On Thu, 7 Apr 2022, Chuck Lever III wrote:
> >>> On Apr 6, 2022, at 8:18 PM, Hugh Dickins <hughd@google.com> wrote:
> >>>
> >>> But I can sit here and try to guess.  I notice fs/nfsd checks
> >>> file->f_op->splice_read, and employs fallback if not available:
> >>> if you have time, please try rerunning those xfstests on an -rc1
> >>> kernel, but with mm/shmem.c's .splice_read line commented out.
> >>> My guess is that will then pass the tests, and we shall know more.
> >>
> >> This seemed like the most probative next step, so I commented
> >> out the .splice_read call-out in mm/shmem.c and ran the tests
> >> again. Yes, that change enables the fsx-related tests to pass
> >> as expected.
> >
> > Great, thank you for trying that.
> >
> >>
> >>> What could be going wrong there?  I've thought of two possibilities.
> >>> A minor, hopefully easily fixed, issue would be if fs/nfsd has
> >>> trouble with seeing the same page twice in a row: since tmpfs is
> >>> now using the ZERO_PAGE(0) for all pages of a hole, and I think I
> >>> caught sight of code which looks to see if the latest page is the
> >>> same as the one before.  It's easy to imagine that might go wrong.
> >>
> >> Are you referring to this function in fs/nfsd/vfs.c ?
> >
> > I think that was it, didn't pay much attention.
>
> This code seems to have been the issue. I added a little test
> to see if @page pointed to ZERO_PAGE(0) and now the tests
> pass as expected.
>
>
> >> 847 static int
> >> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
> >> 849                   struct splice_desc *sd)
> >> 850 {
> >> 851         struct svc_rqst *rqstp = sd->u.data;
> >> 852         struct page **pp = rqstp->rq_next_page;
> >> 853         struct page *page = buf->page;
> >> 854
> >> 855         if (rqstp->rq_res.page_len == 0) {
> >> 856                 svc_rqst_replace_page(rqstp, page);
> >> 857                 rqstp->rq_res.page_base = buf->offset;
> >> 858         } else if (page != pp[-1]) {
> >> 859                 svc_rqst_replace_page(rqstp, page);
> >> 860         }
> >> 861         rqstp->rq_res.page_len += sd->len;
> >> 862
> >> 863         return sd->len;
> >> 864 }
> >>
> >> rq_next_page should point to the first unused element of
> >> rqstp->rq_pages, so IIUC that check is looking for the
> >> final page that is part of the READ payload.
> >>
> >> But that does suggest that if page -> ZERO_PAGE and so does
> >> pp[-1], then svc_rqst_replace_page() would not be invoked.
> >
> > I still haven't studied the logic there: Mark's input made it clear
> > that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly,
> > there could be expectations of uniqueness in other places too.
>
> I can't really attest to Mark's comment, but...
>
> After studying nfsd_splice_actor() I can't see any reason
> except cleverness and technical debt for this particular
> check. I have a patch that removes the check and simplifies
> this function that I'm testing now -- it seems to be a
> reasonable clean-up whether you keep 56a8c8eb1eaf or
> choose to revert it.

Agreed nfsd_splice_actor() is broken for the same-page case.
That function hasn't changed in logic since introduction.  So when
VxFS triggered this issue, back in 2007/2008, it must have had the
same problem with this actor (same with its predecessor; ->sendfile).
I don't remember.  But skb_can_coalesce() sticks in my mind for some
reason.  Would jumbo frames be a good stress for can_coalesce with
same-page?  Or, as Hugh is proposing to avoid sending ZERO_PAGE,
ignore this for now?


> >>> A more difficult issue would be, if fsx is racing writes and reads,
> >>> in a way that it can guarantee the correct result, but that correct
> >>> result is no longer delivered: because the writes go into freshly
> >>> allocated tmpfs cache pages, while reads are still delivering
> >>> stale ZERO_PAGEs from the pipe.  I'm hazy on the guarantees there.
> >>>
> >>> But unless someone has time to help out, we're heading for a revert.
> >
> > We might be able to avoid that revert, and go the whole way to using
> > iov_iter_zero() instead.  But the significant slowness of clear_user()
> > relative to copy to user, on x86 at least, does ask for a hybrid.
> >
> > Suggested patch below, on top of 5.18-rc1, passes my own testing:
> > but will it pass yours?  It seems to me safe, and as fast as before,
> > but we don't know yet if this iov_iter_zero() works right for you.
> > Chuck, please give it a go and let us know.
> >
> > (Don't forget to restore mm/shmem.c's .splice_read first!  And if
> > this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0))
> > in the same patch, fixing the other regression, without recourse to
> > #ifdefs or arch mods.)
>
> Sure, I will try this out first thing tomorrow.
>
> One thing that occurs to me is that for NFS/RDMA, having a
> page full of zeroes that is already DMA-mapped would be a
> nice optimization on the sender side (on the client for an
> NFS WRITE and on the server for an NFS READ). The transport
> would have to set up a scatter-gather list containing a
> bunch of entries that reference the same page...
>
> </musing>
>
>
> > Thanks!
> > Hugh
> >
> > --- 5.18-rc1/mm/shmem.c
> > +++ linux/mm/shmem.c
> > @@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >               pgoff_t end_index;
> >               unsigned long nr, ret;
> >               loff_t i_size = i_size_read(inode);
> > -             bool got_page;
> >
> >               end_index = i_size >> PAGE_SHIFT;
> >               if (index > end_index)
> > @@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >                        */
> >                       if (!offset)
> >                               mark_page_accessed(page);
> > -                     got_page = true;
> > +                     /*
> > +                      * Ok, we have the page, and it's up-to-date, so
> > +                      * now we can copy it to user space...
> > +                      */
> > +                     ret = copy_page_to_iter(page, offset, nr, to);
> > +                     put_page(page);
> > +
> > +             } else if (iter_is_iovec(to)) {
> > +                     /*
> > +                      * Copy to user tends to be so well optimized, but
> > +                      * clear_user() not so much, that it is noticeably
> > +                      * faster to copy the zero page instead of clearing.
> > +                      */
> > +                     ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
> >               } else {
> > -                     page = ZERO_PAGE(0);
> > -                     got_page = false;
> > +                     /*
> > +                      * But submitting the same page twice in a row to
> > +                      * splice() - or others? - can result in confusion:
> > +                      * so don't attempt that optimization on pipes etc.
> > +                      */
> > +                     ret = iov_iter_zero(nr, to);
> >               }
> >
> > -             /*
> > -              * Ok, we have the page, and it's up-to-date, so
> > -              * now we can copy it to user space...
> > -              */
> > -             ret = copy_page_to_iter(page, offset, nr, to);
> >               retval += ret;
> >               offset += ret;
> >               index += offset >> PAGE_SHIFT;
> >               offset &= ~PAGE_MASK;
> >
> > -             if (got_page)
> > -                     put_page(page);
> >               if (!iov_iter_count(to))
> >                       break;
> >               if (ret < nr) {
>
> --
> Chuck Lever

Cheers,
Mark


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression in xfstests on tmpfs-backed NFS exports
  2022-04-07 22:26     ` Hugh Dickins
  2022-04-07 23:45       ` Chuck Lever III
@ 2022-04-08 16:10       ` Chuck Lever III
  2022-04-08 19:09         ` Hugh Dickins
  1 sibling, 1 reply; 11+ messages in thread
From: Chuck Lever III @ 2022-04-08 16:10 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mark Hemment, Patrice CHOTARD, Mikulas Patocka,
	Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM,
	Linux NFS Mailing List, linux-fsdevel


> On Apr 7, 2022, at 6:26 PM, Hugh Dickins <hughd@google.com> wrote:
> 
> On Thu, 7 Apr 2022, Chuck Lever III wrote:
>>> On Apr 6, 2022, at 8:18 PM, Hugh Dickins <hughd@google.com> wrote:
>>> 
>>> But I can sit here and try to guess.  I notice fs/nfsd checks
>>> file->f_op->splice_read, and employs fallback if not available:
>>> if you have time, please try rerunning those xfstests on an -rc1
>>> kernel, but with mm/shmem.c's .splice_read line commented out.
>>> My guess is that will then pass the tests, and we shall know more.
>> 
>> This seemed like the most probative next step, so I commented
>> out the .splice_read call-out in mm/shmem.c and ran the tests
>> again. Yes, that change enables the fsx-related tests to pass
>> as expected.
> 
> Great, thank you for trying that.
> 
>> 
>>> What could be going wrong there?  I've thought of two possibilities.
>>> A minor, hopefully easily fixed, issue would be if fs/nfsd has
>>> trouble with seeing the same page twice in a row: since tmpfs is
>>> now using the ZERO_PAGE(0) for all pages of a hole, and I think I
>>> caught sight of code which looks to see if the latest page is the
>>> same as the one before.  It's easy to imagine that might go wrong.
>> 
>> Are you referring to this function in fs/nfsd/vfs.c ?
> 
> I think that was it, didn't pay much attention.
> 
>> 
>> 847 static int
>> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
>> 849                   struct splice_desc *sd)
>> 850 {
>> 851         struct svc_rqst *rqstp = sd->u.data;
>> 852         struct page **pp = rqstp->rq_next_page;
>> 853         struct page *page = buf->page;
>> 854 
>> 855         if (rqstp->rq_res.page_len == 0) {
>> 856                 svc_rqst_replace_page(rqstp, page);
>> 857                 rqstp->rq_res.page_base = buf->offset;
>> 858         } else if (page != pp[-1]) {
>> 859                 svc_rqst_replace_page(rqstp, page);
>> 860         }
>> 861         rqstp->rq_res.page_len += sd->len;
>> 862 
>> 863         return sd->len;
>> 864 }
>> 
>> rq_next_page should point to the first unused element of
>> rqstp->rq_pages, so IIUC that check is looking for the
>> final page that is part of the READ payload.
>> 
>> But that does suggest that if page -> ZERO_PAGE and so does
>> pp[-1], then svc_rqst_replace_page() would not be invoked.

To put a little more color on this, I think the idea here
is to prevent releasing the same page twice. It might be
possible that NFSD can add the same page to the rq_pages
array more than once, and we don't want to do a double
put_page().

The only time I can think this might happen is if the
READ payload is partially contained in the page that
contains the NFS header. I'm not sure that can ever
happen these days.


> I still haven't studied the logic there: Mark's input made it clear
> that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly,
> there could be expectations of uniqueness in other places too.

So far I haven't seen an issue with skb_can_coalesce().
I will keep an eye out for that.


>>> A more difficult issue would be, if fsx is racing writes and reads,
>>> in a way that it can guarantee the correct result, but that correct
>>> result is no longer delivered: because the writes go into freshly
>>> allocated tmpfs cache pages, while reads are still delivering
>>> stale ZERO_PAGEs from the pipe.  I'm hazy on the guarantees there.
>>> 
>>> But unless someone has time to help out, we're heading for a revert.
> 
> We might be able to avoid that revert, and go the whole way to using
> iov_iter_zero() instead.  But the significant slowness of clear_user()
> relative to copy to user, on x86 at least, does ask for a hybrid.
> 
> Suggested patch below, on top of 5.18-rc1, passes my own testing:
> but will it pass yours?  It seems to me safe, and as fast as before,
> but we don't know yet if this iov_iter_zero() works right for you.
> Chuck, please give it a go and let us know.

Applied to stock v5.18-rc1. The tests pass as expected.


> (Don't forget to restore mm/shmem.c's .splice_read first!  And if
> this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0))
> in the same patch, fixing the other regression, without recourse to
> #ifdefs or arch mods.)
> 
> Thanks!
> Hugh
> 
> --- 5.18-rc1/mm/shmem.c
> +++ linux/mm/shmem.c
> @@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> 		pgoff_t end_index;
> 		unsigned long nr, ret;
> 		loff_t i_size = i_size_read(inode);
> -		bool got_page;
> 
> 		end_index = i_size >> PAGE_SHIFT;
> 		if (index > end_index)
> @@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> 			 */
> 			if (!offset)
> 				mark_page_accessed(page);
> -			got_page = true;
> +			/*
> +			 * Ok, we have the page, and it's up-to-date, so
> +			 * now we can copy it to user space...
> +			 */
> +			ret = copy_page_to_iter(page, offset, nr, to);
> +			put_page(page);
> +
> +		} else if (iter_is_iovec(to)) {
> +			/*
> +			 * Copy to user tends to be so well optimized, but
> +			 * clear_user() not so much, that it is noticeably
> +			 * faster to copy the zero page instead of clearing.
> +			 */
> +			ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
> 		} else {
> -			page = ZERO_PAGE(0);
> -			got_page = false;
> +			/*
> +			 * But submitting the same page twice in a row to
> +			 * splice() - or others? - can result in confusion:
> +			 * so don't attempt that optimization on pipes etc.
> +			 */
> +			ret = iov_iter_zero(nr, to);
> 		}
> 
> -		/*
> -		 * Ok, we have the page, and it's up-to-date, so
> -		 * now we can copy it to user space...
> -		 */
> -		ret = copy_page_to_iter(page, offset, nr, to);
> 		retval += ret;
> 		offset += ret;
> 		index += offset >> PAGE_SHIFT;
> 		offset &= ~PAGE_MASK;
> 
> -		if (got_page)
> -			put_page(page);
> 		if (!iov_iter_count(to))
> 			break;
> 		if (ret < nr) {

--
Chuck Lever





^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression in xfstests on tmpfs-backed NFS exports
  2022-04-08 16:10       ` Chuck Lever III
@ 2022-04-08 19:09         ` Hugh Dickins
  2022-04-08 19:52           ` Chuck Lever III
  0 siblings, 1 reply; 11+ messages in thread
From: Hugh Dickins @ 2022-04-08 19:09 UTC (permalink / raw)
  To: Chuck Lever III
  Cc: Hugh Dickins, Andrew Morton, Mark Hemment, Patrice CHOTARD,
	Mikulas Patocka, Lukas Czerner, Christoph Hellwig,
	Darrick J. Wong, Linux-MM, Linux NFS Mailing List, linux-fsdevel

On Fri, 8 Apr 2022, Chuck Lever III wrote:
> > On Apr 7, 2022, at 6:26 PM, Hugh Dickins <hughd@google.com> wrote:
> > On Thu, 7 Apr 2022, Chuck Lever III wrote:
> >> 
> >> 847 static int
> >> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
> >> 849                   struct splice_desc *sd)
> >> 850 {
> >> 851         struct svc_rqst *rqstp = sd->u.data;
> >> 852         struct page **pp = rqstp->rq_next_page;
> >> 853         struct page *page = buf->page;
> >> 854 
> >> 855         if (rqstp->rq_res.page_len == 0) {
> >> 856                 svc_rqst_replace_page(rqstp, page);
> >> 857                 rqstp->rq_res.page_base = buf->offset;
> >> 858         } else if (page != pp[-1]) {
> >> 859                 svc_rqst_replace_page(rqstp, page);
> >> 860         }
> >> 861         rqstp->rq_res.page_len += sd->len;
> >> 862 
> >> 863         return sd->len;
> >> 864 }
> >> 
> >> rq_next_page should point to the first unused element of
> >> rqstp->rq_pages, so IIUC that check is looking for the
> >> final page that is part of the READ payload.
> >> 
> >> But that does suggest that if page -> ZERO_PAGE and so does
> >> pp[-1], then svc_rqst_replace_page() would not be invoked.
> 
> To put a little more color on this, I think the idea here
> is to prevent releasing the same page twice. It might be
> possible that NFSD can add the same page to the rq_pages
> array more than once, and we don't want to do a double
> put_page().
> 
> The only time I can think this might happen is if the
> READ payload is partially contained in the page that
> contains the NFS header. I'm not sure that can ever
> happen these days.

I'd have thought that if a page were repeated, then its refcount would
have been raised twice, and so require a double put_page().  But it's
no concern of mine.  The only thing I'd say is, if you do find a good
way to robustify that code for duplicates, please don't make it
conditional on ZERO_PAGE - that's just a special case which I
mis-introduced and is now about to go away.

> > 
> > We might be able to avoid that revert, and go the whole way to using
> > iov_iter_zero() instead.  But the significant slowness of clear_user()
> > relative to copy to user, on x86 at least, does ask for a hybrid.
> > 
> > Suggested patch below, on top of 5.18-rc1, passes my own testing:
> > but will it pass yours?  It seems to me safe, and as fast as before,
> > but we don't know yet if this iov_iter_zero() works right for you.
> > Chuck, please give it a go and let us know.
> 
> Applied to stock v5.18-rc1. The tests pass as expected.

Great, thanks a lot, I'll move ahead with sending akpm the patch
with a proper commit message.

Hugh


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Regression in xfstests on tmpfs-backed NFS exports
  2022-04-08 19:09         ` Hugh Dickins
@ 2022-04-08 19:52           ` Chuck Lever III
  0 siblings, 0 replies; 11+ messages in thread
From: Chuck Lever III @ 2022-04-08 19:52 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, Mark Hemment, Patrice CHOTARD, Mikulas Patocka,
	Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM,
	Linux NFS Mailing List, linux-fsdevel



> On Apr 8, 2022, at 3:09 PM, Hugh Dickins <hughd@google.com> wrote:
> 
> On Fri, 8 Apr 2022, Chuck Lever III wrote:
>>> On Apr 7, 2022, at 6:26 PM, Hugh Dickins <hughd@google.com> wrote:
>>> On Thu, 7 Apr 2022, Chuck Lever III wrote:
>>>> 
>>>> 847 static int
>>>> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
>>>> 849                   struct splice_desc *sd)
>>>> 850 {
>>>> 851         struct svc_rqst *rqstp = sd->u.data;
>>>> 852         struct page **pp = rqstp->rq_next_page;
>>>> 853         struct page *page = buf->page;
>>>> 854 
>>>> 855         if (rqstp->rq_res.page_len == 0) {
>>>> 856                 svc_rqst_replace_page(rqstp, page);
>>>> 857                 rqstp->rq_res.page_base = buf->offset;
>>>> 858         } else if (page != pp[-1]) {
>>>> 859                 svc_rqst_replace_page(rqstp, page);
>>>> 860         }
>>>> 861         rqstp->rq_res.page_len += sd->len;
>>>> 862 
>>>> 863         return sd->len;
>>>> 864 }
>>>> 
>>>> rq_next_page should point to the first unused element of
>>>> rqstp->rq_pages, so IIUC that check is looking for the
>>>> final page that is part of the READ payload.
>>>> 
>>>> But that does suggest that if page -> ZERO_PAGE and so does
>>>> pp[-1], then svc_rqst_replace_page() would not be invoked.
>> 
>> To put a little more color on this, I think the idea here
>> is to prevent releasing the same page twice. It might be
>> possible that NFSD can add the same page to the rq_pages
>> array more than once, and we don't want to do a double
>> put_page().
>> 
>> The only time I can think this might happen is if the
>> READ payload is partially contained in the page that
>> contains the NFS header. I'm not sure that can ever
>> happen these days.
> 
> I'd have thought that if a page were repeated, then its refcount would
> have been raised twice, and so require a double put_page().  But it's
> no concern of mine.  The only thing I'd say is, if you do find a good
> way to robustify that code for duplicates, please don't make it
> conditional on ZERO_PAGE - that's just a special case which I
> mis-introduced and is now about to go away.

100% agree on both counts: not sure (yet) why get_page() was not
used here, and a special case for ZERO_PAGE would be a band-aid.

Anyway, I haven't found a case yet where a duplicate struct page
appears in rq_pages with current kernels.


--
Chuck Lever





^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-04-08 19:52 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-06 17:18 Regression in xfstests on tmpfs-backed NFS exports Chuck Lever III
2022-04-07  0:18 ` Hugh Dickins
2022-04-07  4:25   ` Mark Hemment
2022-04-07 22:04     ` Hugh Dickins
2022-04-07 19:24   ` Chuck Lever III
2022-04-07 22:26     ` Hugh Dickins
2022-04-07 23:45       ` Chuck Lever III
2022-04-08 14:38         ` Mark Hemment
2022-04-08 16:10       ` Chuck Lever III
2022-04-08 19:09         ` Hugh Dickins
2022-04-08 19:52           ` Chuck Lever III

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).