Re: [PATCH 0/3] getacl fixes

From: Chuck Lever <chuck.lever@oracle.com>
To: "J. Bruce Fields" <bfields@redhat.com>
Cc: Trond Myklebust <trond.myklebust@primarydata.com>,
	Anna Schumaker <schumakeranna@gmail.com>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
	Andreas Gruenbacher <agruenba@redhat.com>,
	Dros Adamson <dros@primarydata.com>
Subject: Re: [PATCH 0/3] getacl fixes
Date: Fri, 17 Feb 2017 16:21:36 -0500	[thread overview]
Message-ID: <42C50E3B-225A-4416-9693-388F6390EB42@oracle.com> (raw)
In-Reply-To: <20170217205245.GA18901@parsley.fieldses.org>

> On Feb 17, 2017, at 3:52 PM, J. Bruce Fields <bfields@redhat.com> wrote:
> 
> On Fri, Feb 17, 2017 at 03:36:38PM -0500, Chuck Lever wrote:
>> 
>>> On Feb 17, 2017, at 11:44 AM, J. Bruce Fields <bfields@redhat.com> wrote:
>>> 
>>> From: "J. Bruce Fields" <bfields@redhat.com>
>>> 
>>> The getacl code is allocating enough space to handle the ACL data but
>>> not to handle the bitmask, which can lead to spurious ERANGE errors when
>>> the end of the ACL gets close to a page boundary.
>>> 
>>> Dros addressed this by letting the rpc layer allocate pages as necessary
>>> on demand, as the NFSv3 ACL code does.
>>> 
>>> On its own that didn't do the job either, because we don't handle the
>>> case where xdr_shrink_bufhead needs to move data around in the xdr buf.
>>> And xdr_shrink_bufhead was getting called every time due to an incorrect
>>> estimate in an xdr_inline_pages call.
>>> 
>>> So, I fixed that estimate.  That still leaves the chance of a bug in the
>>> rare case xdr_shrink_bufhead is called.
>>> 
>>> We could fix up the handling of the xdr_shrink_bufhead case, but I don't
>>> see the point of shifting this data around in the first place.  We're
>>> not doing anything like zero-copy here, we're just going to copy the
>>> data out into the buffer we were passed.  The NFSv3 ACL code doesn't
>>> bother with this.
>>> 
>>> It's simpler just to pass down the buffer to the xdr layer and let it
>>> copy the ACL out.
>> 
>> I haven't looked closely at these yet, but I have some general
>> thoughts (worth approximately 2 cents).
>> 
>> NFS/RDMA clients have to pre-allocate and register a receive buffer
>> for requests with large replies. The client's RPC layer can't allocate
>> more memory if the reply overruns the existing buffer.
>> 
>> (Note that the server doesn't have the same problem: the client
>> sends an RPC-over-RDMA message telling the server exactly how large
>> the RPC Call message is, and the server prepares RDMA Read operations
>> to pull it over.)
>> 
>> ACLs are particularly troublesome because there doesn't seem to be
>> a way for a client to ask a server "how big is this ACL?" before it
>> actually asks for the ACL. And at least for NFSACL there does
>> not seem to be a protocol-defined size limit for these objects.
> 
> I think in practice the OS/filesystem limits end up being the limiting
> factor.  V4.0 might be the more annoying case, partly thanks to all
> those string names.

Agree, though there is no sure-fire way for either
side to know what the other peer's limits might be,
unlike, say, SYMLINK.

>> If the server can't fit an ACL into the client-provided reply buffer,
>> that causes a transport level error. The blast radius of this failure
>> includes any RPC that happens to be running on that connection, which
>> will have to be retransmitted.
>> 
>> If the client has sent a COMPOUND with a non-idempotent request in
>> the same COMPOUND with a GETATTR requesting the ACL, there could
>> be a problem if the server can't return the RPC because the client's
>> receive buffer is too small. The solution there is to always send
>> such operations in separate COMPOUNDs.
>> 
>> So I prefer in general that the NFS client (above RPC) provide as
>> large a buffer as practical for NFSACL GETACL and NFSv4 GETATTR
>> requesting an ACL. IIUC that is the direction your patches are
>> going.
> 
> No, the net effect is to make the v4 code like the v3 code and allocate
> pages for the reply only on demand.  (I understand the confusion,
> there's multiple buffers involved here, and my description could
> probably be better.)

There is a similar hack in xprtrdma's marshaling code
that allocates reply pages while constructing the RPC
Call if the upper layer hasn't provided them.

It would be great if instead, retrieving an ACL worked
like other NFS operations.

> Ugh.
> 
> Does the RDMA protocol give us any other mechanism we can use for the
> case of ACL replies?

The current RPC-over-RDMA Version One transport provides
two options:

If the RPC Reply is guaranteed to be smaller than the
inline threshold (the size of what can be sent with RDMA
Send/Receive, which can be as small as 1KB), then RDMA
Send with pre-allocated reusable buffers is used to send
the reply. This works fine as long as the expected
Reply message will be small.

If the RPC Reply could be larger than the inline threshold,
the client has to provide a Reply chunk, which is a region
of client memory that is registered so that the server can
use RDMA Writes to return the reply.

If that region is too small, the server is supposed to
return a transport-specific error instead of the RPC
Reply.

(There is a third mechanism but it is forbidden for
everything but NFS READ/WRITE, and it would have the
same problem because the client has to know the size
of the Reply in advance).

For NFSACL GETACL, for example, the client doesn't know
how large the RPC Reply message might be. So it will always
register a Reply chunk for GETACL requests, and it risks
underestimating the size of this region (though I've never
seen it actually get overrun, since real world ACLs tend to
be small).

> It probably wouldn't be so terrible to preallocate the maximum number of
> pages possible if that's really the only option.  May as well get rid of
> the allocations in xdr_partial_copy_from_skb if we do that, as I don't
> think there are other users?

I can't think of any other use cases which rely on this
mechanism.

That mechanism is unreliable, isn't it? It can fail
because it cannot use GFP_KERNEL allocation while
receiving a reply, or am I misinformed. That makes
either of the current implementation choices less
preferable than having the NFS client always allocate
a large buffer while still in process context, AFAICS.

> --b.
> 
>> 
>> We likely have a similar conundrum with security labels.
>> 
>> 
>>> The result looks a lot simpler and more obviously correct than this code
>>> has been, though I'm not particularly happy with the sequence of patches
>>> that gets us there; it would be better to squash together Dros's and my
>>> patch and then split out the result some more sensible way.
>>> 
>>> Sorry for the delay getting back to this.  Older discussions:
>>> 
>>> 	https://marc.info/?t=138452791200001&r=1&w=2
>>> 	http://marc.info/?t=138506891000003&r=1&w=2
>>> 
>>> J. Bruce Fields (2):
>>> nfsd4: fix getacl head length estimation
>>> nfsd4: simplify getacl decoding
>>> 
>>> Weston Andros Adamson (1):
>>> NFSv4: fix getacl ERANGE for some ACL buffer sizes
>>> 
>>> fs/nfs/nfs4proc.c       | 116 +++++++++++++++++++++++-------------------------
>>> fs/nfs/nfs4xdr.c        |  29 +++---------
>>> include/linux/nfs_xdr.h |   4 +-
>>> 3 files changed, 64 insertions(+), 85 deletions(-)
>>> 
>>> -- 
>>> 2.9.3
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever