Re: [PATCH 1/1] NFSv4.2: fix LISTXATTR buffer receive size

From: Olga Kornievskaia <olga.kornievskaia@gmail.com>
To: Chuck Lever <chuck.lever@oracle.com>
Cc: Frank van der Linden <fllinden@amazon.com>,
	Trond Myklebust <trond.myklebust@hammerspace.com>,
	Anna Schumaker <anna.schumaker@netapp.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: [PATCH 1/1] NFSv4.2: fix LISTXATTR buffer receive size
Date: Mon, 23 Nov 2020 18:14:14 -0500	[thread overview]
Message-ID: <CAN-5tyFe-FBb_UWUmWokotEzNiYj5zJaWiu1oK+54H-1HQRurw@mail.gmail.com> (raw)
In-Reply-To: <F85397C8-3FFD-4A7F-92E4-DB84D80F6387@oracle.com>

On Mon, Nov 23, 2020 at 1:09 PM Chuck Lever <chuck.lever@oracle.com> wrote:
>
>
>
> > On Nov 23, 2020, at 12:59 PM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> >
> > On Mon, Nov 23, 2020 at 12:37 PM Chuck Lever <chuck.lever@oracle.com> wrote:
> >>
> >>
> >>
> >>> On Nov 23, 2020, at 11:42 AM, Olga Kornievskaia <olga.kornievskaia@gmail.com> wrote:
> >>>
> >>> Hi Frank, Chuck,
> >>>
> >>> I would like your option on how LISTXATTR is supposed to work over
> >>> RDMA. Here's my current understanding of why the listxattr is not
> >>> working over the RDMA.
> >>>
> >>> This happens when the listxattr is called with a very small buffer
> >>> size which RDMA wants to send an inline request. I really dont
> >>> understand why, Chuck, you are not seeing any problems with hardware
> >>> as far as I can tell it would have the same problem because the inline
> >>> threshold size would still make this size inline.
> >>> rcprdma_inline_fixup() is trying to write to pages that don't exist.
> >>>
> >>> When LISTXATTR sets this flag XDRBUF_SPARSE_PAGES there is code that
> >>> will allocate pages in xs_alloc_sparse_pages() but this is ONLY for
> >>> TCP. RDMA doesn't have anything like that.
> >>>
> >>> Question: Should there be code added to RDMA that will do something
> >>> similar when it sees that flag set?
> >>
> >> Isn't the logic in rpcrdma_convert_iovs() allocating those pages?
> >
> > No, rpcrdm_convert_iovs is only called for when you have reply chunks,
> > lists etc but not for the inline messages. What am I missing?
>
> So, then, rpcrdma_marshal_req() is deciding that the LISTXATTRS
> reply is supposed to fit inline. That means rqst->rq_rcv_buf.buflen
> is small.
>
> But if rpcrdma_inline_fixup() is trying to fill pages,
> rqst->rq_rcv_buf.page_len must not be zero? That sounds like the
> LISTXATTRS encoder is not setting up the receive buffer correctly.
>
> The receive buffer's buflen field is supposed to be set to a value
> that is at least as large as page_len, I would think.

Here's what the LISTXATTR code does that I can see:
It allocates pointers to the pages (but no pages). It sets the
page_len to the hdr.replen so yes it's not zero (setting of the value
as far as i know is correct). So for RDMA nothing allocates those
pages because it's an inline request. TCP code will allocate those
pages because the code was added. You keep on saying that you don't
think this is it but I don't know how to prove to you that Kasan's
message of "wild-memory access" means that page wasn't allocated. It's
a bogus address. The page isn't there. Or at least that's how I read
the Kasan's message. I don't know how else to interpret it (but also
know that code never allocates the memory I believe is a strong
argument).

NFS code can't know that request is inline. It does assume something
will allocate that memory but RDMA doesn't allocate memory for the
inline messages.

While I'm not suggesting this is a correct fix but this is a fix that
removes the oops.

diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index 2b2211d1234e..faab6aedeb42 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -1258,6 +1258,15 @@ static ssize_t _nfs42_proc_listxattrs(struct
inode *inode, void *buf,
                __free_page(res.scratch);
                return -ENOMEM;
        }
+       if (buflen < 1024) {
+               int i;
+               for (i = 0; i < np; i++) {
+                       pages[i] = alloc_page(GFP_NOWAIT | __GFP_NOWARN);
+                       if (!pages[i])
+                               return -ENOMEM;
+               }
+       }
+

        arg.xattr_pages = pages;
        arg.count = xdrlen;


Basically since I know that all RDMA less than 1024 (for soft Roce)
will be inline, I need to allocate pages for them. This doesn't
interfere with the TCP mounts as the code checks if pages are
allocated and only allocates them if they are not.

But of course this is not a solution as it's unknown what's the rdma's
inline threshold is at the NFS layer.



> >>> Or, should LISTXATTR be re-written
> >>> to be like READDIR which allocates pages before calling the code.
> >>
> >> AIUI READDIR reads into the directory inode's page cache. I recall
> >> that Frank couldn't do that for LISTXATTR because there's no
> >> similar page cache associated with the xattr listing.
> >>
> >> That said, I would prefer that the *XATTR procedures directly
> >> allocate pages instead of relying on SPARSE_PAGES, which is a hack
> >> IMO. I think it would have to use alloc_page() for that, and then
> >> ensure those pages are released when the call has completed.
> >>
> >> I'm not convinced this is the cause of the problem you're seeing,
> >> though.
> >>
> >> --
> >> Chuck Lever
>
> --
> Chuck Lever
>
>
>