Re: Regression in xfstests on tmpfs-backed NFS exports

From: Hugh Dickins <hughd@google.com>
To: Chuck Lever III <chuck.lever@oracle.com>
Cc: Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Mark Hemment <markhemm@googlemail.com>,
	 Patrice CHOTARD <patrice.chotard@foss.st.com>,
	 Mikulas Patocka <mpatocka@redhat.com>,
	Lukas Czerner <lczerner@redhat.com>,
	 Christoph Hellwig <hch@lst.de>,
	"Darrick J. Wong" <djwong@kernel.org>,
	 Linux-MM <linux-mm@kvack.org>,
	 Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	 "linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>
Subject: Re: Regression in xfstests on tmpfs-backed NFS exports
Date: Thu, 7 Apr 2022 15:26:56 -0700 (PDT)	[thread overview]
Message-ID: <c5ea49a-1a76-8cf9-5c76-4bb31aa3d458@google.com> (raw)
In-Reply-To: <2B7AF707-67B1-4ED8-A29F-957C26B7F87A@oracle.com>

On Thu, 7 Apr 2022, Chuck Lever III wrote:
> > On Apr 6, 2022, at 8:18 PM, Hugh Dickins <hughd@google.com> wrote:
> > 
> > But I can sit here and try to guess.  I notice fs/nfsd checks
> > file->f_op->splice_read, and employs fallback if not available:
> > if you have time, please try rerunning those xfstests on an -rc1
> > kernel, but with mm/shmem.c's .splice_read line commented out.
> > My guess is that will then pass the tests, and we shall know more.
> 
> This seemed like the most probative next step, so I commented
> out the .splice_read call-out in mm/shmem.c and ran the tests
> again. Yes, that change enables the fsx-related tests to pass
> as expected.

Great, thank you for trying that.

> 
> > What could be going wrong there?  I've thought of two possibilities.
> > A minor, hopefully easily fixed, issue would be if fs/nfsd has
> > trouble with seeing the same page twice in a row: since tmpfs is
> > now using the ZERO_PAGE(0) for all pages of a hole, and I think I
> > caught sight of code which looks to see if the latest page is the
> > same as the one before.  It's easy to imagine that might go wrong.
> 
> Are you referring to this function in fs/nfsd/vfs.c ?

I think that was it, didn't pay much attention.

> 
>  847 static int
>  848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
>  849                   struct splice_desc *sd)
>  850 {
>  851         struct svc_rqst *rqstp = sd->u.data;
>  852         struct page **pp = rqstp->rq_next_page;
>  853         struct page *page = buf->page;
>  854 
>  855         if (rqstp->rq_res.page_len == 0) {
>  856                 svc_rqst_replace_page(rqstp, page);
>  857                 rqstp->rq_res.page_base = buf->offset;
>  858         } else if (page != pp[-1]) {
>  859                 svc_rqst_replace_page(rqstp, page);
>  860         }
>  861         rqstp->rq_res.page_len += sd->len;
>  862 
>  863         return sd->len;
>  864 }
> 
> rq_next_page should point to the first unused element of
> rqstp->rq_pages, so IIUC that check is looking for the
> final page that is part of the READ payload.
> 
> But that does suggest that if page -> ZERO_PAGE and so does
> pp[-1], then svc_rqst_replace_page() would not be invoked.

I still haven't studied the logic there: Mark's input made it clear
that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly,
there could be expectations of uniqueness in other places too.

> 
> > A more difficult issue would be, if fsx is racing writes and reads,
> > in a way that it can guarantee the correct result, but that correct
> > result is no longer delivered: because the writes go into freshly
> > allocated tmpfs cache pages, while reads are still delivering
> > stale ZERO_PAGEs from the pipe.  I'm hazy on the guarantees there.
> > 
> > But unless someone has time to help out, we're heading for a revert.

We might be able to avoid that revert, and go the whole way to using
iov_iter_zero() instead.  But the significant slowness of clear_user()
relative to copy to user, on x86 at least, does ask for a hybrid.

Suggested patch below, on top of 5.18-rc1, passes my own testing:
but will it pass yours?  It seems to me safe, and as fast as before,
but we don't know yet if this iov_iter_zero() works right for you.
Chuck, please give it a go and let us know.

(Don't forget to restore mm/shmem.c's .splice_read first!  And if
this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0))
in the same patch, fixing the other regression, without recourse to
#ifdefs or arch mods.)

Thanks!
Hugh

--- 5.18-rc1/mm/shmem.c
+++ linux/mm/shmem.c
@@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		pgoff_t end_index;
 		unsigned long nr, ret;
 		loff_t i_size = i_size_read(inode);
-		bool got_page;
 
 		end_index = i_size >> PAGE_SHIFT;
 		if (index > end_index)
@@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			 */
 			if (!offset)
 				mark_page_accessed(page);
-			got_page = true;
+			/*
+			 * Ok, we have the page, and it's up-to-date, so
+			 * now we can copy it to user space...
+			 */
+			ret = copy_page_to_iter(page, offset, nr, to);
+			put_page(page);
+
+		} else if (iter_is_iovec(to)) {
+			/*
+			 * Copy to user tends to be so well optimized, but
+			 * clear_user() not so much, that it is noticeably
+			 * faster to copy the zero page instead of clearing.
+			 */
+			ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
 		} else {
-			page = ZERO_PAGE(0);
-			got_page = false;
+			/*
+			 * But submitting the same page twice in a row to
+			 * splice() - or others? - can result in confusion:
+			 * so don't attempt that optimization on pipes etc.
+			 */
+			ret = iov_iter_zero(nr, to);
 		}
 
-		/*
-		 * Ok, we have the page, and it's up-to-date, so
-		 * now we can copy it to user space...
-		 */
-		ret = copy_page_to_iter(page, offset, nr, to);
 		retval += ret;
 		offset += ret;
 		index += offset >> PAGE_SHIFT;
 		offset &= ~PAGE_MASK;
 
-		if (got_page)
-			put_page(page);
 		if (!iov_iter_count(to))
 			break;
 		if (ret < nr) {