From: Chuck Lever III <chuck.lever@oracle.com>
To: Hugh Dickins <hughd@google.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Mark Hemment <markhemm@googlemail.com>,
Patrice CHOTARD <patrice.chotard@foss.st.com>,
Mikulas Patocka <mpatocka@redhat.com>,
Lukas Czerner <lczerner@redhat.com>,
Christoph Hellwig <hch@lst.de>,
"Darrick J. Wong" <djwong@kernel.org>,
Linux-MM <linux-mm@kvack.org>,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: Regression in xfstests on tmpfs-backed NFS exports
Date: Thu, 7 Apr 2022 23:45:31 +0000 [thread overview]
Message-ID: <0017C60F-0BD8-4F5A-BD68-189EEDB2195C@oracle.com> (raw)
In-Reply-To: <c5ea49a-1a76-8cf9-5c76-4bb31aa3d458@google.com>
> On Apr 7, 2022, at 6:26 PM, Hugh Dickins <hughd@google.com> wrote:
>
> On Thu, 7 Apr 2022, Chuck Lever III wrote:
>>> On Apr 6, 2022, at 8:18 PM, Hugh Dickins <hughd@google.com> wrote:
>>>
>>> But I can sit here and try to guess. I notice fs/nfsd checks
>>> file->f_op->splice_read, and employs fallback if not available:
>>> if you have time, please try rerunning those xfstests on an -rc1
>>> kernel, but with mm/shmem.c's .splice_read line commented out.
>>> My guess is that will then pass the tests, and we shall know more.
>>
>> This seemed like the most probative next step, so I commented
>> out the .splice_read call-out in mm/shmem.c and ran the tests
>> again. Yes, that change enables the fsx-related tests to pass
>> as expected.
>
> Great, thank you for trying that.
>
>>
>>> What could be going wrong there? I've thought of two possibilities.
>>> A minor, hopefully easily fixed, issue would be if fs/nfsd has
>>> trouble with seeing the same page twice in a row: since tmpfs is
>>> now using the ZERO_PAGE(0) for all pages of a hole, and I think I
>>> caught sight of code which looks to see if the latest page is the
>>> same as the one before. It's easy to imagine that might go wrong.
>>
>> Are you referring to this function in fs/nfsd/vfs.c ?
>
> I think that was it, didn't pay much attention.
This code seems to have been the issue. I added a little test
to see if @page pointed to ZERO_PAGE(0) and now the tests
pass as expected.
>> 847 static int
>> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
>> 849 struct splice_desc *sd)
>> 850 {
>> 851 struct svc_rqst *rqstp = sd->u.data;
>> 852 struct page **pp = rqstp->rq_next_page;
>> 853 struct page *page = buf->page;
>> 854
>> 855 if (rqstp->rq_res.page_len == 0) {
>> 856 svc_rqst_replace_page(rqstp, page);
>> 857 rqstp->rq_res.page_base = buf->offset;
>> 858 } else if (page != pp[-1]) {
>> 859 svc_rqst_replace_page(rqstp, page);
>> 860 }
>> 861 rqstp->rq_res.page_len += sd->len;
>> 862
>> 863 return sd->len;
>> 864 }
>>
>> rq_next_page should point to the first unused element of
>> rqstp->rq_pages, so IIUC that check is looking for the
>> final page that is part of the READ payload.
>>
>> But that does suggest that if page -> ZERO_PAGE and so does
>> pp[-1], then svc_rqst_replace_page() would not be invoked.
>
> I still haven't studied the logic there: Mark's input made it clear
> that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly,
> there could be expectations of uniqueness in other places too.
I can't really attest to Mark's comment, but...
After studying nfsd_splice_actor() I can't see any reason
except cleverness and technical debt for this particular
check. I have a patch that removes the check and simplifies
this function that I'm testing now -- it seems to be a
reasonable clean-up whether you keep 56a8c8eb1eaf or
choose to revert it.
>>> A more difficult issue would be, if fsx is racing writes and reads,
>>> in a way that it can guarantee the correct result, but that correct
>>> result is no longer delivered: because the writes go into freshly
>>> allocated tmpfs cache pages, while reads are still delivering
>>> stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there.
>>>
>>> But unless someone has time to help out, we're heading for a revert.
>
> We might be able to avoid that revert, and go the whole way to using
> iov_iter_zero() instead. But the significant slowness of clear_user()
> relative to copy to user, on x86 at least, does ask for a hybrid.
>
> Suggested patch below, on top of 5.18-rc1, passes my own testing:
> but will it pass yours? It seems to me safe, and as fast as before,
> but we don't know yet if this iov_iter_zero() works right for you.
> Chuck, please give it a go and let us know.
>
> (Don't forget to restore mm/shmem.c's .splice_read first! And if
> this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0))
> in the same patch, fixing the other regression, without recourse to
> #ifdefs or arch mods.)
Sure, I will try this out first thing tomorrow.
One thing that occurs to me is that for NFS/RDMA, having a
page full of zeroes that is already DMA-mapped would be a
nice optimization on the sender side (on the client for an
NFS WRITE and on the server for an NFS READ). The transport
would have to set up a scatter-gather list containing a
bunch of entries that reference the same page...
</musing>
> Thanks!
> Hugh
>
> --- 5.18-rc1/mm/shmem.c
> +++ linux/mm/shmem.c
> @@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> pgoff_t end_index;
> unsigned long nr, ret;
> loff_t i_size = i_size_read(inode);
> - bool got_page;
>
> end_index = i_size >> PAGE_SHIFT;
> if (index > end_index)
> @@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> */
> if (!offset)
> mark_page_accessed(page);
> - got_page = true;
> + /*
> + * Ok, we have the page, and it's up-to-date, so
> + * now we can copy it to user space...
> + */
> + ret = copy_page_to_iter(page, offset, nr, to);
> + put_page(page);
> +
> + } else if (iter_is_iovec(to)) {
> + /*
> + * Copy to user tends to be so well optimized, but
> + * clear_user() not so much, that it is noticeably
> + * faster to copy the zero page instead of clearing.
> + */
> + ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
> } else {
> - page = ZERO_PAGE(0);
> - got_page = false;
> + /*
> + * But submitting the same page twice in a row to
> + * splice() - or others? - can result in confusion:
> + * so don't attempt that optimization on pipes etc.
> + */
> + ret = iov_iter_zero(nr, to);
> }
>
> - /*
> - * Ok, we have the page, and it's up-to-date, so
> - * now we can copy it to user space...
> - */
> - ret = copy_page_to_iter(page, offset, nr, to);
> retval += ret;
> offset += ret;
> index += offset >> PAGE_SHIFT;
> offset &= ~PAGE_MASK;
>
> - if (got_page)
> - put_page(page);
> if (!iov_iter_count(to))
> break;
> if (ret < nr) {
--
Chuck Lever
next prev parent reply other threads:[~2022-04-07 23:46 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-04-06 17:18 Regression in xfstests on tmpfs-backed NFS exports Chuck Lever III
2022-04-07 0:18 ` Hugh Dickins
2022-04-07 4:25 ` Mark Hemment
2022-04-07 22:04 ` Hugh Dickins
2022-04-07 19:24 ` Chuck Lever III
2022-04-07 22:26 ` Hugh Dickins
2022-04-07 23:45 ` Chuck Lever III [this message]
2022-04-08 14:38 ` Mark Hemment
2022-04-08 16:10 ` Chuck Lever III
2022-04-08 19:09 ` Hugh Dickins
2022-04-08 19:52 ` Chuck Lever III
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0017C60F-0BD8-4F5A-BD68-189EEDB2195C@oracle.com \
--to=chuck.lever@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=djwong@kernel.org \
--cc=hch@lst.de \
--cc=hughd@google.com \
--cc=lczerner@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nfs@vger.kernel.org \
--cc=markhemm@googlemail.com \
--cc=mpatocka@redhat.com \
--cc=patrice.chotard@foss.st.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).