* Regression in xfstests on tmpfs-backed NFS exports @ 2022-04-06 17:18 Chuck Lever III 2022-04-07 0:18 ` Hugh Dickins 0 siblings, 1 reply; 11+ messages in thread From: Chuck Lever III @ 2022-04-06 17:18 UTC (permalink / raw) To: Hugh Dickins; +Cc: Linux-MM, Linux NFS Mailing List Good day, Hugh- I noticed that several fsx-related tests in the xfstests suite are failing after updating my NFS server to v5.18-rc1. I normally test against xfs, ext4, btrfs, and tmpfs exports. tmpfs is the only export that sees these new failures: generic/075 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/075.out.bad) --- tests/generic/075.out 2014-02-13 15:40:45.000000000 -0500 +++ /home/cel/src/xfstests/results//generic/075.out.bad 2022-04-05 16:39:59.145991520 -0400 @@ -4,15 +4,5 @@ ----------------------------------------------- fsx.0 : -d -N numops -S 0 ----------------------------------------------- - ------------------------------------------------ -fsx.1 : -d -N numops -S 0 -x ------------------------------------------------ ... (Run 'diff -u /home/cel/src/xfstests/tests/generic/075.out /home/cel/src/xfstests/results//generic/075.out.bad' to see the entire diff) generic/091 9s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/091.out.bad) --- tests/generic/091.out 2014-02-13 15:40:45.000000000 -0500 +++ /home/cel/src/xfstests/results//generic/091.out.bad 2022-04-05 16:41:24.329063277 -0400 @@ -1,7 +1,75 @@ QA output created by 091 fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W ... (Run 'diff -u /home/cel/src/xfstests/tests/generic/091.out /home/cel/src/xfstests/results//generic/091.out.bad' to see the entire diff) generic/112 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/112.out.bad) --- tests/generic/112.out 2014-02-13 15:40:45.000000000 -0500 +++ /home/cel/src/xfstests/results//generic/112.out.bad 2022-04-05 16:41:38.511075170 -0400 @@ -4,15 +4,4 @@ ----------------------------------------------- fsx.0 : -A -d -N numops -S 0 ----------------------------------------------- - ------------------------------------------------ -fsx.1 : -A -d -N numops -S 0 -x ------------------------------------------------ ... (Run 'diff -u /home/cel/src/xfstests/tests/generic/112.out /home/cel/src/xfstests/results//generic/112.out.bad' to see the entire diff) generic/127 49s ... - output mismatch (see /home/cel/src/xfstests/results//generic/127.out.bad) --- tests/generic/127.out 2016-08-28 12:16:20.000000000 -0400 +++ /home/cel/src/xfstests/results//generic/127.out.bad 2022-04-05 16:42:07.655099652 -0400 @@ -4,10 +4,198 @@ === FSX Light Mode, Memory Mapping === All 100000 operations completed A-OK! === FSX Standard Mode, No Memory Mapping === -All 100000 operations completed A-OK! +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap +READ BAD DATA: offset = 0x9cb7, size = 0xfae3, fname = /tmp/mnt/manet.ib-2323703/fsx_std_nommap +OFFSET GOOD BAD RANGE ... (Run 'diff -u /home/cel/src/xfstests/tests/generic/127.out /home/cel/src/xfstests/results//generic/127.out.bad' to see the entire diff) I bisected the problem to: 56a8c8eb1eaf ("tmpfs: do not allocate pages on read") generic/075 fails almost immediately without any NFS-level errors. Likely this is data corruption rather than an overt I/O error. -- Chuck Lever ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in xfstests on tmpfs-backed NFS exports 2022-04-06 17:18 Regression in xfstests on tmpfs-backed NFS exports Chuck Lever III @ 2022-04-07 0:18 ` Hugh Dickins 2022-04-07 4:25 ` Mark Hemment 2022-04-07 19:24 ` Chuck Lever III 0 siblings, 2 replies; 11+ messages in thread From: Hugh Dickins @ 2022-04-07 0:18 UTC (permalink / raw) To: Chuck Lever III Cc: Hugh Dickins, Andrew Morton, Patrice CHOTARD, Mikulas Patocka, Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM, Linux NFS Mailing List, linux-fsdevel On Wed, 6 Apr 2022, Chuck Lever III wrote: > Good day, Hugh- Huh! If you were really wishing me a good day, would you tell me this ;-? > > I noticed that several fsx-related tests in the xfstests suite are > failing after updating my NFS server to v5.18-rc1. I normally test > against xfs, ext4, btrfs, and tmpfs exports. tmpfs is the only export > that sees these new failures: > > generic/075 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/075.out.bad) > --- tests/generic/075.out 2014-02-13 15:40:45.000000000 -0500 > +++ /home/cel/src/xfstests/results//generic/075.out.bad 2022-04-05 16:39:59.145991520 -0400 > @@ -4,15 +4,5 @@ > ----------------------------------------------- > fsx.0 : -d -N numops -S 0 > ----------------------------------------------- > - > ------------------------------------------------ > -fsx.1 : -d -N numops -S 0 -x > ------------------------------------------------ > ... > (Run 'diff -u /home/cel/src/xfstests/tests/generic/075.out /home/cel/src/xfstests/results//generic/075.out.bad' to see the entire diff) > > generic/091 9s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/091.out.bad) > --- tests/generic/091.out 2014-02-13 15:40:45.000000000 -0500 > +++ /home/cel/src/xfstests/results//generic/091.out.bad 2022-04-05 16:41:24.329063277 -0400 > @@ -1,7 +1,75 @@ > QA output created by 091 > fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W > -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W > -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W > -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W > -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W > -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W > ... > (Run 'diff -u /home/cel/src/xfstests/tests/generic/091.out /home/cel/src/xfstests/results//generic/091.out.bad' to see the entire diff) > > generic/112 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/112.out.bad) > --- tests/generic/112.out 2014-02-13 15:40:45.000000000 -0500 > +++ /home/cel/src/xfstests/results//generic/112.out.bad 2022-04-05 16:41:38.511075170 -0400 > @@ -4,15 +4,4 @@ > ----------------------------------------------- > fsx.0 : -A -d -N numops -S 0 > ----------------------------------------------- > - > ------------------------------------------------ > -fsx.1 : -A -d -N numops -S 0 -x > ------------------------------------------------ > ... > (Run 'diff -u /home/cel/src/xfstests/tests/generic/112.out /home/cel/src/xfstests/results//generic/112.out.bad' to see the entire diff) > > generic/127 49s ... - output mismatch (see /home/cel/src/xfstests/results//generic/127.out.bad) > --- tests/generic/127.out 2016-08-28 12:16:20.000000000 -0400 > +++ /home/cel/src/xfstests/results//generic/127.out.bad 2022-04-05 16:42:07.655099652 -0400 > @@ -4,10 +4,198 @@ > === FSX Light Mode, Memory Mapping === > All 100000 operations completed A-OK! > === FSX Standard Mode, No Memory Mapping === > -All 100000 operations completed A-OK! > +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap > +READ BAD DATA: offset = 0x9cb7, size = 0xfae3, fname = /tmp/mnt/manet.ib-2323703/fsx_std_nommap > +OFFSET GOOD BAD RANGE > ... > (Run 'diff -u /home/cel/src/xfstests/tests/generic/127.out /home/cel/src/xfstests/results//generic/127.out.bad' to see the entire diff) > > I bisected the problem to: > > 56a8c8eb1eaf ("tmpfs: do not allocate pages on read") > > generic/075 fails almost immediately without any NFS-level errors. > Likely this is data corruption rather than an overt I/O error. That's sad. Thanks for bisecting and reporting. Sorry for the nuisance. I suspect this patch is heading for a revert, because I shall not have time to debug and investigate. Cc'ing fsdevel and a few people who have an interest in it, to warn of that likely upcoming revert. But if it's okay with everyone, please may we leave it in for -rc2? Given that having it in -rc1 already smoked out another issue (problem of SetPageUptodate(ZERO_PAGE(0)) without CONFIG_MMU), I think keeping it in a little longer might smoke out even more. The xfstests info above doesn't actually tell very much, beyond that generic/075 generic/091 generic/112 generic/127, each a test with fsx, all fall at their first hurdle. If you have time, please rerun and tar up the results/generic directory (maybe filter just those failing) and send as attachment. But don't go to any trouble, it's unlikely that I shall even untar it - it would be mainly to go on record if anyone has time to look into it later. And, frankly, it's unlikely to tell us anything more enlightening, than that the data seen was not as expected: which we do already know. I've had no problem with xfstests generic 075,091,112,127 testing tmpfs here, not before and not in the month or two I've had that patch in: so it's something in the way that NFS exercises tmpfs that reveals it. If I had time to duplicate your procedure, I'd be asking for detailed instructions: but no, I won't have a chance. But I can sit here and try to guess. I notice fs/nfsd checks file->f_op->splice_read, and employs fallback if not available: if you have time, please try rerunning those xfstests on an -rc1 kernel, but with mm/shmem.c's .splice_read line commented out. My guess is that will then pass the tests, and we shall know more. What could be going wrong there? I've thought of two possibilities. A minor, hopefully easily fixed, issue would be if fs/nfsd has trouble with seeing the same page twice in a row: since tmpfs is now using the ZERO_PAGE(0) for all pages of a hole, and I think I caught sight of code which looks to see if the latest page is the same as the one before. It's easy to imagine that might go wrong. A more difficult issue would be, if fsx is racing writes and reads, in a way that it can guarantee the correct result, but that correct result is no longer delivered: because the writes go into freshly allocated tmpfs cache pages, while reads are still delivering stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there. But unless someone has time to help out, we're heading for a revert. Thanks, Hugh ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in xfstests on tmpfs-backed NFS exports 2022-04-07 0:18 ` Hugh Dickins @ 2022-04-07 4:25 ` Mark Hemment 2022-04-07 22:04 ` Hugh Dickins 2022-04-07 19:24 ` Chuck Lever III 1 sibling, 1 reply; 11+ messages in thread From: Mark Hemment @ 2022-04-07 4:25 UTC (permalink / raw) To: Hugh Dickins Cc: Chuck Lever III, Andrew Morton, Patrice CHOTARD, Mikulas Patocka, Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM, Linux NFS Mailing List, linux-fsdevel On Thu, 7 Apr 2022 at 01:19, Hugh Dickins <hughd@google.com> wrote: > > On Wed, 6 Apr 2022, Chuck Lever III wrote: > > > Good day, Hugh- > > Huh! If you were really wishing me a good day, would you tell me this ;-? > > > > > I noticed that several fsx-related tests in the xfstests suite are > > failing after updating my NFS server to v5.18-rc1. I normally test > > against xfs, ext4, btrfs, and tmpfs exports. tmpfs is the only export > > that sees these new failures: > > [...] > > generic/075 fails almost immediately without any NFS-level errors. > > Likely this is data corruption rather than an overt I/O error. > > That's sad. Thanks for bisecting and reporting. Sorry for the nuisance. > > I suspect this patch is heading for a revert, because I shall not have > time to debug and investigate. Cc'ing fsdevel and a few people who have > an interest in it, to warn of that likely upcoming revert. > > But if it's okay with everyone, please may we leave it in for -rc2? > Given that having it in -rc1 already smoked out another issue (problem > of SetPageUptodate(ZERO_PAGE(0)) without CONFIG_MMU), I think keeping > it in a little longer might smoke out even more. > > The xfstests info above doesn't actually tell very much, beyond that > generic/075 generic/091 generic/112 generic/127, each a test with fsx, > all fall at their first hurdle. If you have time, please rerun and > tar up the results/generic directory (maybe filter just those failing) > and send as attachment. But don't go to any trouble, it's unlikely > that I shall even untar it - it would be mainly to go on record if > anyone has time to look into it later. And, frankly, it's unlikely > to tell us anything more enlightening, than that the data seen was > not as expected: which we do already know. > > I've had no problem with xfstests generic 075,091,112,127 testing > tmpfs here, not before and not in the month or two I've had that > patch in: so it's something in the way that NFS exercises tmpfs > that reveals it. If I had time to duplicate your procedure, I'd be > asking for detailed instructions: but no, I won't have a chance. > > But I can sit here and try to guess. I notice fs/nfsd checks > file->f_op->splice_read, and employs fallback if not available: > if you have time, please try rerunning those xfstests on an -rc1 > kernel, but with mm/shmem.c's .splice_read line commented out. > My guess is that will then pass the tests, and we shall know more. > > What could be going wrong there? I've thought of two possibilities. > A minor, hopefully easily fixed, issue would be if fs/nfsd has > trouble with seeing the same page twice in a row: since tmpfs is > now using the ZERO_PAGE(0) for all pages of a hole, and I think I > caught sight of code which looks to see if the latest page is the > same as the one before. It's easy to imagine that might go wrong. When I worked at Veritas, data corruption over NFS was hit when sending the same page in succession. This was triggered via VxFS's shared page cache, after a file had been dedup'ed. I can't remember all the details (~15yrs ago), but the core issue was skb_can_coalesce() returning a false-positive for the 'same page' case (no check for crossing a page boundary). > A more difficult issue would be, if fsx is racing writes and reads, > in a way that it can guarantee the correct result, but that correct > result is no longer delivered: because the writes go into freshly > allocated tmpfs cache pages, while reads are still delivering > stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there. > > But unless someone has time to help out, we're heading for a revert. > > Thanks, > Hugh Cheers, Mark ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in xfstests on tmpfs-backed NFS exports 2022-04-07 4:25 ` Mark Hemment @ 2022-04-07 22:04 ` Hugh Dickins 0 siblings, 0 replies; 11+ messages in thread From: Hugh Dickins @ 2022-04-07 22:04 UTC (permalink / raw) To: Mark Hemment Cc: Hugh Dickins, Chuck Lever III, Andrew Morton, Patrice CHOTARD, Mikulas Patocka, Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM, Linux NFS Mailing List, linux-fsdevel On Thu, 7 Apr 2022, Mark Hemment wrote: > On Thu, 7 Apr 2022 at 01:19, Hugh Dickins <hughd@google.com> wrote: > > > > What could be going wrong there? I've thought of two possibilities. > > A minor, hopefully easily fixed, issue would be if fs/nfsd has > > trouble with seeing the same page twice in a row: since tmpfs is > > now using the ZERO_PAGE(0) for all pages of a hole, and I think I > > caught sight of code which looks to see if the latest page is the > > same as the one before. It's easy to imagine that might go wrong. > > When I worked at Veritas, data corruption over NFS was hit when > sending the same page in succession. This was triggered via VxFS's > shared page cache, after a file had been dedup'ed. > I can't remember all the details (~15yrs ago), but the core issue was > skb_can_coalesce() returning a false-positive for the 'same page' case > (no check for crossing a page boundary). Very useful input: thank you Mark. That tells me that, even if we spot a "bug" in fs/nfsd, there could be various other places which get confused if given the ZERO_PAGE(0) twice in a row. I won't be able to find them all, I cannot go on risking that. At first I thought of using ZERO_PAGE(0) for even pgoffs, alternating with a tmpfs-specific zeroed page for odd pgoffs. But I was forgetting that copying ZERO_PAGE(0) is itself just a workaround to avoid the 28% slower iov_iter_zero(). I think I have a reasonable hybrid: will reply to Chuck now with that. Hugh I've rewr ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in xfstests on tmpfs-backed NFS exports 2022-04-07 0:18 ` Hugh Dickins 2022-04-07 4:25 ` Mark Hemment @ 2022-04-07 19:24 ` Chuck Lever III 2022-04-07 22:26 ` Hugh Dickins 1 sibling, 1 reply; 11+ messages in thread From: Chuck Lever III @ 2022-04-07 19:24 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Patrice CHOTARD, Mikulas Patocka, Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM, Linux NFS Mailing List, linux-fsdevel > On Apr 6, 2022, at 8:18 PM, Hugh Dickins <hughd@google.com> wrote: > > On Wed, 6 Apr 2022, Chuck Lever III wrote: > >> Good day, Hugh- > > Huh! If you were really wishing me a good day, would you tell me this ;-? > >> >> I noticed that several fsx-related tests in the xfstests suite are >> failing after updating my NFS server to v5.18-rc1. I normally test >> against xfs, ext4, btrfs, and tmpfs exports. tmpfs is the only export >> that sees these new failures: >> >> generic/075 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/075.out.bad) >> --- tests/generic/075.out 2014-02-13 15:40:45.000000000 -0500 >> +++ /home/cel/src/xfstests/results//generic/075.out.bad 2022-04-05 16:39:59.145991520 -0400 >> @@ -4,15 +4,5 @@ >> ----------------------------------------------- >> fsx.0 : -d -N numops -S 0 >> ----------------------------------------------- >> - >> ------------------------------------------------ >> -fsx.1 : -d -N numops -S 0 -x >> ------------------------------------------------ >> ... >> (Run 'diff -u /home/cel/src/xfstests/tests/generic/075.out /home/cel/src/xfstests/results//generic/075.out.bad' to see the entire diff) >> >> generic/091 9s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/091.out.bad) >> --- tests/generic/091.out 2014-02-13 15:40:45.000000000 -0500 >> +++ /home/cel/src/xfstests/results//generic/091.out.bad 2022-04-05 16:41:24.329063277 -0400 >> @@ -1,7 +1,75 @@ >> QA output created by 091 >> fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W >> -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W >> -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W >> -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W >> -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W >> -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W >> ... >> (Run 'diff -u /home/cel/src/xfstests/tests/generic/091.out /home/cel/src/xfstests/results//generic/091.out.bad' to see the entire diff) >> >> generic/112 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/112.out.bad) >> --- tests/generic/112.out 2014-02-13 15:40:45.000000000 -0500 >> +++ /home/cel/src/xfstests/results//generic/112.out.bad 2022-04-05 16:41:38.511075170 -0400 >> @@ -4,15 +4,4 @@ >> ----------------------------------------------- >> fsx.0 : -A -d -N numops -S 0 >> ----------------------------------------------- >> - >> ------------------------------------------------ >> -fsx.1 : -A -d -N numops -S 0 -x >> ------------------------------------------------ >> ... >> (Run 'diff -u /home/cel/src/xfstests/tests/generic/112.out /home/cel/src/xfstests/results//generic/112.out.bad' to see the entire diff) >> >> generic/127 49s ... - output mismatch (see /home/cel/src/xfstests/results//generic/127.out.bad) >> --- tests/generic/127.out 2016-08-28 12:16:20.000000000 -0400 >> +++ /home/cel/src/xfstests/results//generic/127.out.bad 2022-04-05 16:42:07.655099652 -0400 >> @@ -4,10 +4,198 @@ >> === FSX Light Mode, Memory Mapping === >> All 100000 operations completed A-OK! >> === FSX Standard Mode, No Memory Mapping === >> -All 100000 operations completed A-OK! >> +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap >> +READ BAD DATA: offset = 0x9cb7, size = 0xfae3, fname = /tmp/mnt/manet.ib-2323703/fsx_std_nommap >> +OFFSET GOOD BAD RANGE >> ... >> (Run 'diff -u /home/cel/src/xfstests/tests/generic/127.out /home/cel/src/xfstests/results//generic/127.out.bad' to see the entire diff) >> >> I bisected the problem to: >> >> 56a8c8eb1eaf ("tmpfs: do not allocate pages on read") >> >> generic/075 fails almost immediately without any NFS-level errors. >> Likely this is data corruption rather than an overt I/O error. > > That's sad. Thanks for bisecting and reporting. Sorry for the nuisance. > > I suspect this patch is heading for a revert, because I shall not have > time to debug and investigate. Cc'ing fsdevel and a few people who have > an interest in it, to warn of that likely upcoming revert. > > But if it's okay with everyone, please may we leave it in for -rc2? > Given that having it in -rc1 already smoked out another issue (problem > of SetPageUptodate(ZERO_PAGE(0)) without CONFIG_MMU), I think keeping > it in a little longer might smoke out even more. > > The xfstests info above doesn't actually tell very much, beyond that > generic/075 generic/091 generic/112 generic/127, each a test with fsx, > all fall at their first hurdle. If you have time, please rerun and > tar up the results/generic directory (maybe filter just those failing) > and send as attachment. But don't go to any trouble, it's unlikely > that I shall even untar it - it would be mainly to go on record if > anyone has time to look into it later. And, frankly, it's unlikely > to tell us anything more enlightening, than that the data seen was > not as expected: which we do already know. > > I've had no problem with xfstests generic 075,091,112,127 testing > tmpfs here, not before and not in the month or two I've had that > patch in: so it's something in the way that NFS exercises tmpfs > that reveals it. If I had time to duplicate your procedure, I'd be > asking for detailed instructions: but no, I won't have a chance. > > But I can sit here and try to guess. I notice fs/nfsd checks > file->f_op->splice_read, and employs fallback if not available: > if you have time, please try rerunning those xfstests on an -rc1 > kernel, but with mm/shmem.c's .splice_read line commented out. > My guess is that will then pass the tests, and we shall know more. This seemed like the most probative next step, so I commented out the .splice_read call-out in mm/shmem.c and ran the tests again. Yes, that change enables the fsx-related tests to pass as expected. > What could be going wrong there? I've thought of two possibilities. > A minor, hopefully easily fixed, issue would be if fs/nfsd has > trouble with seeing the same page twice in a row: since tmpfs is > now using the ZERO_PAGE(0) for all pages of a hole, and I think I > caught sight of code which looks to see if the latest page is the > same as the one before. It's easy to imagine that might go wrong. Are you referring to this function in fs/nfsd/vfs.c ? 847 static int 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf, 849 struct splice_desc *sd) 850 { 851 struct svc_rqst *rqstp = sd->u.data; 852 struct page **pp = rqstp->rq_next_page; 853 struct page *page = buf->page; 854 855 if (rqstp->rq_res.page_len == 0) { 856 svc_rqst_replace_page(rqstp, page); 857 rqstp->rq_res.page_base = buf->offset; 858 } else if (page != pp[-1]) { 859 svc_rqst_replace_page(rqstp, page); 860 } 861 rqstp->rq_res.page_len += sd->len; 862 863 return sd->len; 864 } rq_next_page should point to the first unused element of rqstp->rq_pages, so IIUC that check is looking for the final page that is part of the READ payload. But that does suggest that if page -> ZERO_PAGE and so does pp[-1], then svc_rqst_replace_page() would not be invoked. > A more difficult issue would be, if fsx is racing writes and reads, > in a way that it can guarantee the correct result, but that correct > result is no longer delivered: because the writes go into freshly > allocated tmpfs cache pages, while reads are still delivering > stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there. > > But unless someone has time to help out, we're heading for a revert. > > Thanks, > Hugh -- Chuck Lever ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in xfstests on tmpfs-backed NFS exports 2022-04-07 19:24 ` Chuck Lever III @ 2022-04-07 22:26 ` Hugh Dickins 2022-04-07 23:45 ` Chuck Lever III 2022-04-08 16:10 ` Chuck Lever III 0 siblings, 2 replies; 11+ messages in thread From: Hugh Dickins @ 2022-04-07 22:26 UTC (permalink / raw) To: Chuck Lever III Cc: Hugh Dickins, Andrew Morton, Mark Hemment, Patrice CHOTARD, Mikulas Patocka, Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM, Linux NFS Mailing List, linux-fsdevel On Thu, 7 Apr 2022, Chuck Lever III wrote: > > On Apr 6, 2022, at 8:18 PM, Hugh Dickins <hughd@google.com> wrote: > > > > But I can sit here and try to guess. I notice fs/nfsd checks > > file->f_op->splice_read, and employs fallback if not available: > > if you have time, please try rerunning those xfstests on an -rc1 > > kernel, but with mm/shmem.c's .splice_read line commented out. > > My guess is that will then pass the tests, and we shall know more. > > This seemed like the most probative next step, so I commented > out the .splice_read call-out in mm/shmem.c and ran the tests > again. Yes, that change enables the fsx-related tests to pass > as expected. Great, thank you for trying that. > > > What could be going wrong there? I've thought of two possibilities. > > A minor, hopefully easily fixed, issue would be if fs/nfsd has > > trouble with seeing the same page twice in a row: since tmpfs is > > now using the ZERO_PAGE(0) for all pages of a hole, and I think I > > caught sight of code which looks to see if the latest page is the > > same as the one before. It's easy to imagine that might go wrong. > > Are you referring to this function in fs/nfsd/vfs.c ? I think that was it, didn't pay much attention. > > 847 static int > 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf, > 849 struct splice_desc *sd) > 850 { > 851 struct svc_rqst *rqstp = sd->u.data; > 852 struct page **pp = rqstp->rq_next_page; > 853 struct page *page = buf->page; > 854 > 855 if (rqstp->rq_res.page_len == 0) { > 856 svc_rqst_replace_page(rqstp, page); > 857 rqstp->rq_res.page_base = buf->offset; > 858 } else if (page != pp[-1]) { > 859 svc_rqst_replace_page(rqstp, page); > 860 } > 861 rqstp->rq_res.page_len += sd->len; > 862 > 863 return sd->len; > 864 } > > rq_next_page should point to the first unused element of > rqstp->rq_pages, so IIUC that check is looking for the > final page that is part of the READ payload. > > But that does suggest that if page -> ZERO_PAGE and so does > pp[-1], then svc_rqst_replace_page() would not be invoked. I still haven't studied the logic there: Mark's input made it clear that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly, there could be expectations of uniqueness in other places too. > > > A more difficult issue would be, if fsx is racing writes and reads, > > in a way that it can guarantee the correct result, but that correct > > result is no longer delivered: because the writes go into freshly > > allocated tmpfs cache pages, while reads are still delivering > > stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there. > > > > But unless someone has time to help out, we're heading for a revert. We might be able to avoid that revert, and go the whole way to using iov_iter_zero() instead. But the significant slowness of clear_user() relative to copy to user, on x86 at least, does ask for a hybrid. Suggested patch below, on top of 5.18-rc1, passes my own testing: but will it pass yours? It seems to me safe, and as fast as before, but we don't know yet if this iov_iter_zero() works right for you. Chuck, please give it a go and let us know. (Don't forget to restore mm/shmem.c's .splice_read first! And if this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0)) in the same patch, fixing the other regression, without recourse to #ifdefs or arch mods.) Thanks! Hugh --- 5.18-rc1/mm/shmem.c +++ linux/mm/shmem.c @@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) pgoff_t end_index; unsigned long nr, ret; loff_t i_size = i_size_read(inode); - bool got_page; end_index = i_size >> PAGE_SHIFT; if (index > end_index) @@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) */ if (!offset) mark_page_accessed(page); - got_page = true; + /* + * Ok, we have the page, and it's up-to-date, so + * now we can copy it to user space... + */ + ret = copy_page_to_iter(page, offset, nr, to); + put_page(page); + + } else if (iter_is_iovec(to)) { + /* + * Copy to user tends to be so well optimized, but + * clear_user() not so much, that it is noticeably + * faster to copy the zero page instead of clearing. + */ + ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to); } else { - page = ZERO_PAGE(0); - got_page = false; + /* + * But submitting the same page twice in a row to + * splice() - or others? - can result in confusion: + * so don't attempt that optimization on pipes etc. + */ + ret = iov_iter_zero(nr, to); } - /* - * Ok, we have the page, and it's up-to-date, so - * now we can copy it to user space... - */ - ret = copy_page_to_iter(page, offset, nr, to); retval += ret; offset += ret; index += offset >> PAGE_SHIFT; offset &= ~PAGE_MASK; - if (got_page) - put_page(page); if (!iov_iter_count(to)) break; if (ret < nr) { ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in xfstests on tmpfs-backed NFS exports 2022-04-07 22:26 ` Hugh Dickins @ 2022-04-07 23:45 ` Chuck Lever III 2022-04-08 14:38 ` Mark Hemment 2022-04-08 16:10 ` Chuck Lever III 1 sibling, 1 reply; 11+ messages in thread From: Chuck Lever III @ 2022-04-07 23:45 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Mark Hemment, Patrice CHOTARD, Mikulas Patocka, Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM, Linux NFS Mailing List, linux-fsdevel > On Apr 7, 2022, at 6:26 PM, Hugh Dickins <hughd@google.com> wrote: > > On Thu, 7 Apr 2022, Chuck Lever III wrote: >>> On Apr 6, 2022, at 8:18 PM, Hugh Dickins <hughd@google.com> wrote: >>> >>> But I can sit here and try to guess. I notice fs/nfsd checks >>> file->f_op->splice_read, and employs fallback if not available: >>> if you have time, please try rerunning those xfstests on an -rc1 >>> kernel, but with mm/shmem.c's .splice_read line commented out. >>> My guess is that will then pass the tests, and we shall know more. >> >> This seemed like the most probative next step, so I commented >> out the .splice_read call-out in mm/shmem.c and ran the tests >> again. Yes, that change enables the fsx-related tests to pass >> as expected. > > Great, thank you for trying that. > >> >>> What could be going wrong there? I've thought of two possibilities. >>> A minor, hopefully easily fixed, issue would be if fs/nfsd has >>> trouble with seeing the same page twice in a row: since tmpfs is >>> now using the ZERO_PAGE(0) for all pages of a hole, and I think I >>> caught sight of code which looks to see if the latest page is the >>> same as the one before. It's easy to imagine that might go wrong. >> >> Are you referring to this function in fs/nfsd/vfs.c ? > > I think that was it, didn't pay much attention. This code seems to have been the issue. I added a little test to see if @page pointed to ZERO_PAGE(0) and now the tests pass as expected. >> 847 static int >> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf, >> 849 struct splice_desc *sd) >> 850 { >> 851 struct svc_rqst *rqstp = sd->u.data; >> 852 struct page **pp = rqstp->rq_next_page; >> 853 struct page *page = buf->page; >> 854 >> 855 if (rqstp->rq_res.page_len == 0) { >> 856 svc_rqst_replace_page(rqstp, page); >> 857 rqstp->rq_res.page_base = buf->offset; >> 858 } else if (page != pp[-1]) { >> 859 svc_rqst_replace_page(rqstp, page); >> 860 } >> 861 rqstp->rq_res.page_len += sd->len; >> 862 >> 863 return sd->len; >> 864 } >> >> rq_next_page should point to the first unused element of >> rqstp->rq_pages, so IIUC that check is looking for the >> final page that is part of the READ payload. >> >> But that does suggest that if page -> ZERO_PAGE and so does >> pp[-1], then svc_rqst_replace_page() would not be invoked. > > I still haven't studied the logic there: Mark's input made it clear > that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly, > there could be expectations of uniqueness in other places too. I can't really attest to Mark's comment, but... After studying nfsd_splice_actor() I can't see any reason except cleverness and technical debt for this particular check. I have a patch that removes the check and simplifies this function that I'm testing now -- it seems to be a reasonable clean-up whether you keep 56a8c8eb1eaf or choose to revert it. >>> A more difficult issue would be, if fsx is racing writes and reads, >>> in a way that it can guarantee the correct result, but that correct >>> result is no longer delivered: because the writes go into freshly >>> allocated tmpfs cache pages, while reads are still delivering >>> stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there. >>> >>> But unless someone has time to help out, we're heading for a revert. > > We might be able to avoid that revert, and go the whole way to using > iov_iter_zero() instead. But the significant slowness of clear_user() > relative to copy to user, on x86 at least, does ask for a hybrid. > > Suggested patch below, on top of 5.18-rc1, passes my own testing: > but will it pass yours? It seems to me safe, and as fast as before, > but we don't know yet if this iov_iter_zero() works right for you. > Chuck, please give it a go and let us know. > > (Don't forget to restore mm/shmem.c's .splice_read first! And if > this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0)) > in the same patch, fixing the other regression, without recourse to > #ifdefs or arch mods.) Sure, I will try this out first thing tomorrow. One thing that occurs to me is that for NFS/RDMA, having a page full of zeroes that is already DMA-mapped would be a nice optimization on the sender side (on the client for an NFS WRITE and on the server for an NFS READ). The transport would have to set up a scatter-gather list containing a bunch of entries that reference the same page... </musing> > Thanks! > Hugh > > --- 5.18-rc1/mm/shmem.c > +++ linux/mm/shmem.c > @@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) > pgoff_t end_index; > unsigned long nr, ret; > loff_t i_size = i_size_read(inode); > - bool got_page; > > end_index = i_size >> PAGE_SHIFT; > if (index > end_index) > @@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) > */ > if (!offset) > mark_page_accessed(page); > - got_page = true; > + /* > + * Ok, we have the page, and it's up-to-date, so > + * now we can copy it to user space... > + */ > + ret = copy_page_to_iter(page, offset, nr, to); > + put_page(page); > + > + } else if (iter_is_iovec(to)) { > + /* > + * Copy to user tends to be so well optimized, but > + * clear_user() not so much, that it is noticeably > + * faster to copy the zero page instead of clearing. > + */ > + ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to); > } else { > - page = ZERO_PAGE(0); > - got_page = false; > + /* > + * But submitting the same page twice in a row to > + * splice() - or others? - can result in confusion: > + * so don't attempt that optimization on pipes etc. > + */ > + ret = iov_iter_zero(nr, to); > } > > - /* > - * Ok, we have the page, and it's up-to-date, so > - * now we can copy it to user space... > - */ > - ret = copy_page_to_iter(page, offset, nr, to); > retval += ret; > offset += ret; > index += offset >> PAGE_SHIFT; > offset &= ~PAGE_MASK; > > - if (got_page) > - put_page(page); > if (!iov_iter_count(to)) > break; > if (ret < nr) { -- Chuck Lever ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in xfstests on tmpfs-backed NFS exports 2022-04-07 23:45 ` Chuck Lever III @ 2022-04-08 14:38 ` Mark Hemment 0 siblings, 0 replies; 11+ messages in thread From: Mark Hemment @ 2022-04-08 14:38 UTC (permalink / raw) To: Chuck Lever III Cc: Hugh Dickins, Andrew Morton, Patrice CHOTARD, Mikulas Patocka, Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM, Linux NFS Mailing List, linux-fsdevel On Fri, 8 Apr 2022 at 00:45, Chuck Lever III <chuck.lever@oracle.com> wrote: > > On Apr 7, 2022, at 6:26 PM, Hugh Dickins <hughd@google.com> wrote: > > > > On Thu, 7 Apr 2022, Chuck Lever III wrote: > >>> On Apr 6, 2022, at 8:18 PM, Hugh Dickins <hughd@google.com> wrote: > >>> > >>> But I can sit here and try to guess. I notice fs/nfsd checks > >>> file->f_op->splice_read, and employs fallback if not available: > >>> if you have time, please try rerunning those xfstests on an -rc1 > >>> kernel, but with mm/shmem.c's .splice_read line commented out. > >>> My guess is that will then pass the tests, and we shall know more. > >> > >> This seemed like the most probative next step, so I commented > >> out the .splice_read call-out in mm/shmem.c and ran the tests > >> again. Yes, that change enables the fsx-related tests to pass > >> as expected. > > > > Great, thank you for trying that. > > > >> > >>> What could be going wrong there? I've thought of two possibilities. > >>> A minor, hopefully easily fixed, issue would be if fs/nfsd has > >>> trouble with seeing the same page twice in a row: since tmpfs is > >>> now using the ZERO_PAGE(0) for all pages of a hole, and I think I > >>> caught sight of code which looks to see if the latest page is the > >>> same as the one before. It's easy to imagine that might go wrong. > >> > >> Are you referring to this function in fs/nfsd/vfs.c ? > > > > I think that was it, didn't pay much attention. > > This code seems to have been the issue. I added a little test > to see if @page pointed to ZERO_PAGE(0) and now the tests > pass as expected. > > > >> 847 static int > >> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf, > >> 849 struct splice_desc *sd) > >> 850 { > >> 851 struct svc_rqst *rqstp = sd->u.data; > >> 852 struct page **pp = rqstp->rq_next_page; > >> 853 struct page *page = buf->page; > >> 854 > >> 855 if (rqstp->rq_res.page_len == 0) { > >> 856 svc_rqst_replace_page(rqstp, page); > >> 857 rqstp->rq_res.page_base = buf->offset; > >> 858 } else if (page != pp[-1]) { > >> 859 svc_rqst_replace_page(rqstp, page); > >> 860 } > >> 861 rqstp->rq_res.page_len += sd->len; > >> 862 > >> 863 return sd->len; > >> 864 } > >> > >> rq_next_page should point to the first unused element of > >> rqstp->rq_pages, so IIUC that check is looking for the > >> final page that is part of the READ payload. > >> > >> But that does suggest that if page -> ZERO_PAGE and so does > >> pp[-1], then svc_rqst_replace_page() would not be invoked. > > > > I still haven't studied the logic there: Mark's input made it clear > > that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly, > > there could be expectations of uniqueness in other places too. > > I can't really attest to Mark's comment, but... > > After studying nfsd_splice_actor() I can't see any reason > except cleverness and technical debt for this particular > check. I have a patch that removes the check and simplifies > this function that I'm testing now -- it seems to be a > reasonable clean-up whether you keep 56a8c8eb1eaf or > choose to revert it. Agreed nfsd_splice_actor() is broken for the same-page case. That function hasn't changed in logic since introduction. So when VxFS triggered this issue, back in 2007/2008, it must have had the same problem with this actor (same with its predecessor; ->sendfile). I don't remember. But skb_can_coalesce() sticks in my mind for some reason. Would jumbo frames be a good stress for can_coalesce with same-page? Or, as Hugh is proposing to avoid sending ZERO_PAGE, ignore this for now? > >>> A more difficult issue would be, if fsx is racing writes and reads, > >>> in a way that it can guarantee the correct result, but that correct > >>> result is no longer delivered: because the writes go into freshly > >>> allocated tmpfs cache pages, while reads are still delivering > >>> stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there. > >>> > >>> But unless someone has time to help out, we're heading for a revert. > > > > We might be able to avoid that revert, and go the whole way to using > > iov_iter_zero() instead. But the significant slowness of clear_user() > > relative to copy to user, on x86 at least, does ask for a hybrid. > > > > Suggested patch below, on top of 5.18-rc1, passes my own testing: > > but will it pass yours? It seems to me safe, and as fast as before, > > but we don't know yet if this iov_iter_zero() works right for you. > > Chuck, please give it a go and let us know. > > > > (Don't forget to restore mm/shmem.c's .splice_read first! And if > > this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0)) > > in the same patch, fixing the other regression, without recourse to > > #ifdefs or arch mods.) > > Sure, I will try this out first thing tomorrow. > > One thing that occurs to me is that for NFS/RDMA, having a > page full of zeroes that is already DMA-mapped would be a > nice optimization on the sender side (on the client for an > NFS WRITE and on the server for an NFS READ). The transport > would have to set up a scatter-gather list containing a > bunch of entries that reference the same page... > > </musing> > > > > Thanks! > > Hugh > > > > --- 5.18-rc1/mm/shmem.c > > +++ linux/mm/shmem.c > > @@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) > > pgoff_t end_index; > > unsigned long nr, ret; > > loff_t i_size = i_size_read(inode); > > - bool got_page; > > > > end_index = i_size >> PAGE_SHIFT; > > if (index > end_index) > > @@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) > > */ > > if (!offset) > > mark_page_accessed(page); > > - got_page = true; > > + /* > > + * Ok, we have the page, and it's up-to-date, so > > + * now we can copy it to user space... > > + */ > > + ret = copy_page_to_iter(page, offset, nr, to); > > + put_page(page); > > + > > + } else if (iter_is_iovec(to)) { > > + /* > > + * Copy to user tends to be so well optimized, but > > + * clear_user() not so much, that it is noticeably > > + * faster to copy the zero page instead of clearing. > > + */ > > + ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to); > > } else { > > - page = ZERO_PAGE(0); > > - got_page = false; > > + /* > > + * But submitting the same page twice in a row to > > + * splice() - or others? - can result in confusion: > > + * so don't attempt that optimization on pipes etc. > > + */ > > + ret = iov_iter_zero(nr, to); > > } > > > > - /* > > - * Ok, we have the page, and it's up-to-date, so > > - * now we can copy it to user space... > > - */ > > - ret = copy_page_to_iter(page, offset, nr, to); > > retval += ret; > > offset += ret; > > index += offset >> PAGE_SHIFT; > > offset &= ~PAGE_MASK; > > > > - if (got_page) > > - put_page(page); > > if (!iov_iter_count(to)) > > break; > > if (ret < nr) { > > -- > Chuck Lever Cheers, Mark ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in xfstests on tmpfs-backed NFS exports 2022-04-07 22:26 ` Hugh Dickins 2022-04-07 23:45 ` Chuck Lever III @ 2022-04-08 16:10 ` Chuck Lever III 2022-04-08 19:09 ` Hugh Dickins 1 sibling, 1 reply; 11+ messages in thread From: Chuck Lever III @ 2022-04-08 16:10 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Mark Hemment, Patrice CHOTARD, Mikulas Patocka, Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM, Linux NFS Mailing List, linux-fsdevel > On Apr 7, 2022, at 6:26 PM, Hugh Dickins <hughd@google.com> wrote: > > On Thu, 7 Apr 2022, Chuck Lever III wrote: >>> On Apr 6, 2022, at 8:18 PM, Hugh Dickins <hughd@google.com> wrote: >>> >>> But I can sit here and try to guess. I notice fs/nfsd checks >>> file->f_op->splice_read, and employs fallback if not available: >>> if you have time, please try rerunning those xfstests on an -rc1 >>> kernel, but with mm/shmem.c's .splice_read line commented out. >>> My guess is that will then pass the tests, and we shall know more. >> >> This seemed like the most probative next step, so I commented >> out the .splice_read call-out in mm/shmem.c and ran the tests >> again. Yes, that change enables the fsx-related tests to pass >> as expected. > > Great, thank you for trying that. > >> >>> What could be going wrong there? I've thought of two possibilities. >>> A minor, hopefully easily fixed, issue would be if fs/nfsd has >>> trouble with seeing the same page twice in a row: since tmpfs is >>> now using the ZERO_PAGE(0) for all pages of a hole, and I think I >>> caught sight of code which looks to see if the latest page is the >>> same as the one before. It's easy to imagine that might go wrong. >> >> Are you referring to this function in fs/nfsd/vfs.c ? > > I think that was it, didn't pay much attention. > >> >> 847 static int >> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf, >> 849 struct splice_desc *sd) >> 850 { >> 851 struct svc_rqst *rqstp = sd->u.data; >> 852 struct page **pp = rqstp->rq_next_page; >> 853 struct page *page = buf->page; >> 854 >> 855 if (rqstp->rq_res.page_len == 0) { >> 856 svc_rqst_replace_page(rqstp, page); >> 857 rqstp->rq_res.page_base = buf->offset; >> 858 } else if (page != pp[-1]) { >> 859 svc_rqst_replace_page(rqstp, page); >> 860 } >> 861 rqstp->rq_res.page_len += sd->len; >> 862 >> 863 return sd->len; >> 864 } >> >> rq_next_page should point to the first unused element of >> rqstp->rq_pages, so IIUC that check is looking for the >> final page that is part of the READ payload. >> >> But that does suggest that if page -> ZERO_PAGE and so does >> pp[-1], then svc_rqst_replace_page() would not be invoked. To put a little more color on this, I think the idea here is to prevent releasing the same page twice. It might be possible that NFSD can add the same page to the rq_pages array more than once, and we don't want to do a double put_page(). The only time I can think this might happen is if the READ payload is partially contained in the page that contains the NFS header. I'm not sure that can ever happen these days. > I still haven't studied the logic there: Mark's input made it clear > that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly, > there could be expectations of uniqueness in other places too. So far I haven't seen an issue with skb_can_coalesce(). I will keep an eye out for that. >>> A more difficult issue would be, if fsx is racing writes and reads, >>> in a way that it can guarantee the correct result, but that correct >>> result is no longer delivered: because the writes go into freshly >>> allocated tmpfs cache pages, while reads are still delivering >>> stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there. >>> >>> But unless someone has time to help out, we're heading for a revert. > > We might be able to avoid that revert, and go the whole way to using > iov_iter_zero() instead. But the significant slowness of clear_user() > relative to copy to user, on x86 at least, does ask for a hybrid. > > Suggested patch below, on top of 5.18-rc1, passes my own testing: > but will it pass yours? It seems to me safe, and as fast as before, > but we don't know yet if this iov_iter_zero() works right for you. > Chuck, please give it a go and let us know. Applied to stock v5.18-rc1. The tests pass as expected. > (Don't forget to restore mm/shmem.c's .splice_read first! And if > this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0)) > in the same patch, fixing the other regression, without recourse to > #ifdefs or arch mods.) > > Thanks! > Hugh > > --- 5.18-rc1/mm/shmem.c > +++ linux/mm/shmem.c > @@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) > pgoff_t end_index; > unsigned long nr, ret; > loff_t i_size = i_size_read(inode); > - bool got_page; > > end_index = i_size >> PAGE_SHIFT; > if (index > end_index) > @@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to) > */ > if (!offset) > mark_page_accessed(page); > - got_page = true; > + /* > + * Ok, we have the page, and it's up-to-date, so > + * now we can copy it to user space... > + */ > + ret = copy_page_to_iter(page, offset, nr, to); > + put_page(page); > + > + } else if (iter_is_iovec(to)) { > + /* > + * Copy to user tends to be so well optimized, but > + * clear_user() not so much, that it is noticeably > + * faster to copy the zero page instead of clearing. > + */ > + ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to); > } else { > - page = ZERO_PAGE(0); > - got_page = false; > + /* > + * But submitting the same page twice in a row to > + * splice() - or others? - can result in confusion: > + * so don't attempt that optimization on pipes etc. > + */ > + ret = iov_iter_zero(nr, to); > } > > - /* > - * Ok, we have the page, and it's up-to-date, so > - * now we can copy it to user space... > - */ > - ret = copy_page_to_iter(page, offset, nr, to); > retval += ret; > offset += ret; > index += offset >> PAGE_SHIFT; > offset &= ~PAGE_MASK; > > - if (got_page) > - put_page(page); > if (!iov_iter_count(to)) > break; > if (ret < nr) { -- Chuck Lever ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in xfstests on tmpfs-backed NFS exports 2022-04-08 16:10 ` Chuck Lever III @ 2022-04-08 19:09 ` Hugh Dickins 2022-04-08 19:52 ` Chuck Lever III 0 siblings, 1 reply; 11+ messages in thread From: Hugh Dickins @ 2022-04-08 19:09 UTC (permalink / raw) To: Chuck Lever III Cc: Hugh Dickins, Andrew Morton, Mark Hemment, Patrice CHOTARD, Mikulas Patocka, Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM, Linux NFS Mailing List, linux-fsdevel On Fri, 8 Apr 2022, Chuck Lever III wrote: > > On Apr 7, 2022, at 6:26 PM, Hugh Dickins <hughd@google.com> wrote: > > On Thu, 7 Apr 2022, Chuck Lever III wrote: > >> > >> 847 static int > >> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf, > >> 849 struct splice_desc *sd) > >> 850 { > >> 851 struct svc_rqst *rqstp = sd->u.data; > >> 852 struct page **pp = rqstp->rq_next_page; > >> 853 struct page *page = buf->page; > >> 854 > >> 855 if (rqstp->rq_res.page_len == 0) { > >> 856 svc_rqst_replace_page(rqstp, page); > >> 857 rqstp->rq_res.page_base = buf->offset; > >> 858 } else if (page != pp[-1]) { > >> 859 svc_rqst_replace_page(rqstp, page); > >> 860 } > >> 861 rqstp->rq_res.page_len += sd->len; > >> 862 > >> 863 return sd->len; > >> 864 } > >> > >> rq_next_page should point to the first unused element of > >> rqstp->rq_pages, so IIUC that check is looking for the > >> final page that is part of the READ payload. > >> > >> But that does suggest that if page -> ZERO_PAGE and so does > >> pp[-1], then svc_rqst_replace_page() would not be invoked. > > To put a little more color on this, I think the idea here > is to prevent releasing the same page twice. It might be > possible that NFSD can add the same page to the rq_pages > array more than once, and we don't want to do a double > put_page(). > > The only time I can think this might happen is if the > READ payload is partially contained in the page that > contains the NFS header. I'm not sure that can ever > happen these days. I'd have thought that if a page were repeated, then its refcount would have been raised twice, and so require a double put_page(). But it's no concern of mine. The only thing I'd say is, if you do find a good way to robustify that code for duplicates, please don't make it conditional on ZERO_PAGE - that's just a special case which I mis-introduced and is now about to go away. > > > > We might be able to avoid that revert, and go the whole way to using > > iov_iter_zero() instead. But the significant slowness of clear_user() > > relative to copy to user, on x86 at least, does ask for a hybrid. > > > > Suggested patch below, on top of 5.18-rc1, passes my own testing: > > but will it pass yours? It seems to me safe, and as fast as before, > > but we don't know yet if this iov_iter_zero() works right for you. > > Chuck, please give it a go and let us know. > > Applied to stock v5.18-rc1. The tests pass as expected. Great, thanks a lot, I'll move ahead with sending akpm the patch with a proper commit message. Hugh ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Regression in xfstests on tmpfs-backed NFS exports 2022-04-08 19:09 ` Hugh Dickins @ 2022-04-08 19:52 ` Chuck Lever III 0 siblings, 0 replies; 11+ messages in thread From: Chuck Lever III @ 2022-04-08 19:52 UTC (permalink / raw) To: Hugh Dickins Cc: Andrew Morton, Mark Hemment, Patrice CHOTARD, Mikulas Patocka, Lukas Czerner, Christoph Hellwig, Darrick J. Wong, Linux-MM, Linux NFS Mailing List, linux-fsdevel > On Apr 8, 2022, at 3:09 PM, Hugh Dickins <hughd@google.com> wrote: > > On Fri, 8 Apr 2022, Chuck Lever III wrote: >>> On Apr 7, 2022, at 6:26 PM, Hugh Dickins <hughd@google.com> wrote: >>> On Thu, 7 Apr 2022, Chuck Lever III wrote: >>>> >>>> 847 static int >>>> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf, >>>> 849 struct splice_desc *sd) >>>> 850 { >>>> 851 struct svc_rqst *rqstp = sd->u.data; >>>> 852 struct page **pp = rqstp->rq_next_page; >>>> 853 struct page *page = buf->page; >>>> 854 >>>> 855 if (rqstp->rq_res.page_len == 0) { >>>> 856 svc_rqst_replace_page(rqstp, page); >>>> 857 rqstp->rq_res.page_base = buf->offset; >>>> 858 } else if (page != pp[-1]) { >>>> 859 svc_rqst_replace_page(rqstp, page); >>>> 860 } >>>> 861 rqstp->rq_res.page_len += sd->len; >>>> 862 >>>> 863 return sd->len; >>>> 864 } >>>> >>>> rq_next_page should point to the first unused element of >>>> rqstp->rq_pages, so IIUC that check is looking for the >>>> final page that is part of the READ payload. >>>> >>>> But that does suggest that if page -> ZERO_PAGE and so does >>>> pp[-1], then svc_rqst_replace_page() would not be invoked. >> >> To put a little more color on this, I think the idea here >> is to prevent releasing the same page twice. It might be >> possible that NFSD can add the same page to the rq_pages >> array more than once, and we don't want to do a double >> put_page(). >> >> The only time I can think this might happen is if the >> READ payload is partially contained in the page that >> contains the NFS header. I'm not sure that can ever >> happen these days. > > I'd have thought that if a page were repeated, then its refcount would > have been raised twice, and so require a double put_page(). But it's > no concern of mine. The only thing I'd say is, if you do find a good > way to robustify that code for duplicates, please don't make it > conditional on ZERO_PAGE - that's just a special case which I > mis-introduced and is now about to go away. 100% agree on both counts: not sure (yet) why get_page() was not used here, and a special case for ZERO_PAGE would be a band-aid. Anyway, I haven't found a case yet where a duplicate struct page appears in rq_pages with current kernels. -- Chuck Lever ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-04-08 19:52 UTC | newest] Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2022-04-06 17:18 Regression in xfstests on tmpfs-backed NFS exports Chuck Lever III 2022-04-07 0:18 ` Hugh Dickins 2022-04-07 4:25 ` Mark Hemment 2022-04-07 22:04 ` Hugh Dickins 2022-04-07 19:24 ` Chuck Lever III 2022-04-07 22:26 ` Hugh Dickins 2022-04-07 23:45 ` Chuck Lever III 2022-04-08 14:38 ` Mark Hemment 2022-04-08 16:10 ` Chuck Lever III 2022-04-08 19:09 ` Hugh Dickins 2022-04-08 19:52 ` Chuck Lever III
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).