From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-qt0-f180.google.com ([209.85.216.180]:34047 "EHLO mail-qt0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751132AbdCNOFq (ORCPT ); Tue, 14 Mar 2017 10:05:46 -0400 Received: by mail-qt0-f180.google.com with SMTP id n21so56257581qta.1 for ; Tue, 14 Mar 2017 07:05:45 -0700 (PDT) Message-ID: <1489500341.2682.3.camel@redhat.com> Subject: Re: nfsd: delegation conflicts between NFSv3 and NFSv4 accessors From: Jeff Layton To: Chuck Lever Cc: "J. Bruce Fields" , Linux NFS Mailing List Date: Tue, 14 Mar 2017 10:05:41 -0400 In-Reply-To: <0FEB53CC-D571-469F-98AA-4D68A545DFAD@oracle.com> References: <1489252126.3367.4.camel@redhat.com> <1489266274.3367.6.camel@redhat.com> <20170313132749.GA11746@parsley.fieldses.org> <41257C5D-BFEF-4538-99A3-BBAA4EE99EE3@oracle.com> <1489422816.2807.1.camel@redhat.com> <0D674F66-1A35-4FA9-8827-111B3E9D969C@oracle.com> <0FEB53CC-D571-469F-98AA-4D68A545DFAD@oracle.com> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Sender: linux-nfs-owner@vger.kernel.org List-ID: On Mon, 2017-03-13 at 14:26 -0400, Chuck Lever wrote: > > On Mar 13, 2017, at 1:12 PM, Chuck Lever wrote: > > > > > > > > On Mar 13, 2017, at 12:33 PM, Jeff Layton wrote: > > > > > > On Mon, 2017-03-13 at 11:30 -0400, Chuck Lever wrote: > > > > Hi Bruce- > > > > > > > > > > > > > On Mar 13, 2017, at 9:27 AM, J. Bruce Fields wrote: > > > > > > > > > > On Sat, Mar 11, 2017 at 04:04:34PM -0500, Jeff Layton wrote: > > > > > > On Sat, 2017-03-11 at 15:46 -0500, Chuck Lever wrote: > > > > > > > > On Mar 11, 2017, at 12:08 PM, Jeff Layton wrote: > > > > > > > > > > > > > > > > On Sat, 2017-03-11 at 11:53 -0500, Chuck Lever wrote: > > > > > > > > > Hi Bruce, Jeff- > > > > > > > > > > > > > > > > > > I've observed some interesting Linux NFS server behavior (v4.1.12). > > > > > > > > > > > > > > > > > > We have a single system that has an NFSv4 mount via the kernel NFS > > > > > > > > > client, and an NFSv3 mount of the same export via a user space NFS > > > > > > > > > client. These two clients are accessing the same set of files. > > > > > > > > > > > > > > > > > > The following pattern is seen on the wire. I've filtered a recent > > > > > > > > > capture on the FH of one of the shared files. > > > > > > > > > > > > > > > > > > ---- cut here ---- > > > > > > > > > > > > > > > > > > 18507 19.483085 10.0.2.11 -> 10.0.1.8 NFS 238 V4 Call ACCESS FH: 0xc930444f, [Check: RD MD XT XE] > > > > > > > > > 18508 19.483827 10.0.1.8 -> 10.0.2.11 NFS 194 V4 Reply (Call In 18507) ACCESS, [Access Denied: XE], [Allowed: RD MD XT] > > > > > > > > > 18510 19.484676 10.0.1.8 -> 10.0.2.11 NFS 434 V4 Reply (Call In 18509) OPEN StateID: 0x6de3 > > > > > > > > > > > > > > > > > > This OPEN reply offers a read delegation to the kernel NFS client. > > > > > > > > > > > > > > > > > > 18511 19.484806 10.0.2.11 -> 10.0.1.8 NFS 230 V4 Call GETATTR FH: 0xc930444f > > > > > > > > > 18512 19.485549 10.0.1.8 -> 10.0.2.11 NFS 274 V4 Reply (Call In 18511) GETATTR > > > > > > > > > 18513 19.485611 10.0.2.11 -> 10.0.1.8 NFS 230 V4 Call GETATTR FH: 0xc930444f > > > > > > > > > 18514 19.486375 10.0.1.8 -> 10.0.2.11 NFS 186 V4 Reply (Call In 18513) GETATTR > > > > > > > > > 18515 19.486464 10.0.2.11 -> 10.0.1.8 NFS 254 V4 Call CLOSE StateID: 0x6de3 > > > > > > > > > 18516 19.487201 10.0.1.8 -> 10.0.2.11 NFS 202 V4 Reply (Call In 18515) CLOSE > > > > > > > > > 18556 19.498617 10.0.2.11 -> 10.0.1.8 NFS 210 V3 READ Call, FH: 0xc930444f Offset: 8192 Len: 8192 > > > > > > > > > > > > > > > > > > This READ call by the user space client does not conflict with the > > > > > > > > > read delegation. > > > > > > > > > > > > > > > > > > 18559 19.499396 10.0.1.8 -> 10.0.2.11 NFS 8390 V3 READ Reply (Call In 18556) Len: 8192 > > > > > > > > > 18726 19.568975 10.0.1.8 -> 10.0.2.11 NFS 310 V3 LOOKUP Reply (Call In 18725), FH: 0xc930444f > > > > > > > > > 18727 19.569170 10.0.2.11 -> 10.0.1.8 NFS 210 V3 READ Call, FH: 0xc930444f Offset: 0 Len: 512 > > > > > > > > > 18728 19.569923 10.0.1.8 -> 10.0.2.11 NFS 710 V3 READ Reply (Call In 18727) Len: 512 > > > > > > > > > 18729 19.570135 10.0.2.11 -> 10.0.1.8 NFS 234 V3 SETATTR Call, FH: 0xc930444f > > > > > > > > > 18730 19.570901 10.0.1.8 -> 10.0.2.11 NFS 214 V3 SETATTR Reply (Call In 18729) Error: NFS3ERR_JUKEBOX > > > > > > > > > > > > > > > > > > The user space client has attempted to extend the file. This does > > > > > > > > > conflict with the read delegation held by the kernel NFS client, > > > > > > > > > so the server returns JUKEBOX, the equivalent of NFS4ERR_DELAY. > > > > > > > > > This causes a negative performance impact on the user space NFS > > > > > > > > > client. > > > > > > > > > > > > > > > > > > 18731 19.575396 10.0.2.11 -> 10.0.1.8 NFS 250 V4 Call DELEGRETURN StateID: 0x6de3 > > > > > > > > > 18732 19.576132 10.0.1.8 -> 10.0.2.11 NFS 186 V4 Reply (Call In 18731) DELEGRETURN > > > > > > > > > > > > > > > > > > No CB_RECALL was done to trigger this DELEGRETURN. Apparently > > > > > > > > > the application that was accessing this file via the kernel OS > > > > > > > > > client decided already that it no longer needed the file before > > > > > > > > > the server could send the CB_RECALL. Sign of perhaps a race > > > > > > > > > between the applications accessing the file via these two > > > > > > > > > mounts. > > > > > > > > > > > > > > > > > > ---- cut here ---- > > > > > > > > > > > > > > > > > > The server is aware of non-NFSv4 accessors of this file in frame > > > > > > > > > 18556. NFSv3 has no OPEN operation, of course, so it's not > > > > > > > > > possible for the server to determine how the NFSv3 client will > > > > > > > > > subsequently access this file. > > > > > > > > > > > > > > > > > > > > > > > > > Right. Why should we assume that the v3 client will do anything other > > > > > > > > than read there? If we recall the delegation just for reads, then we > > > > > > > > potentially negatively affect the performance of the v4 client. > > > > > > > > > > > > > > > > > Seems like at frame 18556, it would be a best practice to recall > > > > > > > > > the delegation to avoid potential future conflicts, such as the > > > > > > > > > SETATTR in frame 18729. > > > > > > > > > > > > > > > > > > Or, perhaps that READ isn't the first NFSv3 access of that file. > > > > > > > > > After all, a LOOKUP would have to be done to retrieve that file's > > > > > > > > > FH. The OPEN in frame 18556 perhaps could have avoided offering > > > > > > > > > the READ delegation, knowing there is a recent non-NFSv4 accessor > > > > > > > > > of that file. > > > > > > > > > > > > > > > > > > Would these be difficult or inappropriate policies to implement? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Reads are not currently considered to be conflicting access vs. a read > > > > > > > > delegation. > > > > > > > > > > > > > > Strictly speaking, a single NFSv3 READ does not violate the guarantee > > > > > > > made by the read delegation. And, strictly speaking, there can be no > > > > > > > OPEN conflict because NFSv3 does not have an OPEN operation. > > > > > > > > > > > > > > The question is whether the server has an adequate mechanism for > > > > > > > delaying NFSv3 accessors when an NFSv4 delegation must be recalled. > > > > > > > > > > > > > > NFS3ERR_JUKEBOX and NFS4ERR_DELAY share the same numeric value, but > > > > > > > imply different semantics. > > > > > > > > > > > > > > RFC1813 says: > > > > > > > > > > > > > > NFS3ERR_JUKEBOX > > > > > > > The server initiated the request, but was not able to > > > > > > > complete it in a timely fashion. The client should wait > > > > > > > and then try the request with a new RPC transaction ID. > > > > > > > For example, this error should be returned from a server > > > > > > > that supports hierarchical storage and receives a request > > > > > > > to process a file that has been migrated. In this case, > > > > > > > the server should start the immigration process and > > > > > > > respond to client with this error. > > > > > > > > > > > > > > Some clients respond to NFS3ERR_JUKEBOX by waiting quite some time > > > > > > > before retrying. > > > > > > > > > > > > > > RFC7530 says: > > > > > > > > > > > > > > 13.1.1.3. NFS4ERR_DELAY (Error Code 10008) > > > > > > > > > > > > > > For any of a number of reasons, the replier could not process this > > > > > > > operation in what was deemed a reasonable time. The client should > > > > > > > wait and then try the request with a new RPC transaction ID. > > > > > > > > > > > > > > The following are two examples of what might lead to this situation: > > > > > > > > > > > > > > o A server that supports hierarchical storage receives a request to > > > > > > > process a file that had been migrated. > > > > > > > > > > > > > > o An operation requires a delegation recall to proceed, and waiting > > > > > > > for this delegation recall makes processing this request in a > > > > > > > timely fashion impossible. > > > > > > > > > > > > > > An NFSv4 client is prepared to retry this error almost immediately > > > > > > > because most of the time it is due to the second bullet. > > > > > > > > > > > > > > I agree that not recalling after an NFSv3 READ is reasonable in some > > > > > > > cases. However, I demonstrated a case where the current policy does > > > > > > > not serve one of these clients well at all. In fact, the NFSv3 > > > > > > > accessor in this case is the performance-sensitive one. > > > > > > > > > > > > > > To put it another way, the NFSv4 protocol does not forbid the > > > > > > > current Linux server policy, but interoperating well with existing > > > > > > > NFSv3 clients suggests it's not an optimal policy choice. > > > > > > > > > > > > > > > > > > > I think that is entirely dependent on the workload. If we proactively > > > > > > recall delegations because we think the v3 client _might_ do some > > > > > > conflicting access, and then it doesn't, then that's also a non-optimal > > > > > > choice. > > > > > > > > > > > > > > > > > > > > > I think that's the correct thing to do. Until we have some > > > > > > > > sort of conflicting behavior I don't see why you'd want to prematurely > > > > > > > > recall the delegation. > > > > > > > > > > > > > > The reason to recall a delegation is to avoid returning > > > > > > > NFS3ERR_JUKEBOX if at all possible, because doing so is a drastic > > > > > > > remedy that results in a performance regression. > > > > > > > > > > > > > > The negative impact of not having a delegation is small. The negative > > > > > > > impact of returning NFS3ERR_JUKEBOX to a SETATTR or WRITE can be as > > > > > > > much as a 5 minute wait. (This is intolerably long for, say, online > > > > > > > transaction processing workloads). > > > > > > > > > > > > > > > > > > > That sounds like a deficient v3 client, IMO. There's nothing in the v3 > > > > > > spec that I know of that advocates a delay that long before > > > > > > reattempting. I'm pretty sure the Linux client treats NFSERR3_JUKEBOX > > > > > > and NFS4ERR_DELAY more or less equivalently. > > > > > > > > > > The v3 client uses a 5 second delay (see NFS_JUKEBOX_RETRY_TIME). > > > > > The v4 client, at least in the case of operations that could break a > > > > > deleg, does exponential backoff starting with a tenth of a second--see > > > > > nfs4_delay. > > > > > > > > > > So Trond's been taking the spec at its word here. > > > > > > > > > > Like Jeff I'm pretty unhappy at the idea of revoking delegations > > > > > preemptively on v3 read and lookup. > > > > > > > > To completely avoid JUKEBOX, you'd have to recall asynchronously. > > > > Even better would be not to offer delegations when it is clear > > > > there is an active NFSv3 accessor. > > > > > > > > Is there a specific use case where holding onto delegations in > > > > this case is measurably valuable? > > > > > > > > As Jeff said above, it is workload dependent, but it seems that > > > > we are choosing arbitrarily which workloads work well and which > > > > will be penalized. > > > > > > > > Clearly, speculating about future access is not allowed when > > > > only NFSv4 is in play. > > > > > > > > > > > > > And a 5 minute wait does sound like a client problem. > > > > > > > > Even a 5 second wait is not good. A simple "touch" that takes > > > > five seconds can generate user complaints. > > > > > > > > I do see the point that a NFSv3 client implementation can be > > > > changed to retry JUKEBOX more aggressively. Not all NFSv3 code > > > > bases are actively maintained, however. > > > > > > > > > > > > > > > The server can detect there are other accessors that do not provide > > > > > > > OPEN/CLOSE semantics. In addition, the server cannot predict when one > > > > > > > of these accessors may use a WRITE or SETATTR. And finally it does > > > > > > > not have a reasonably performant mechanism for delaying those > > > > > > > accessors when a delegation must be recalled. > > > > > > > > > > > > > > > > > > > Interoperability is hard (and sometimes it doesn't work well :). We > > > > > > simply don't have enough info to reliably guess what the v3 client will > > > > > > do in this situation. > > > > > > > > (This is in response to Jeff's comment) > > > > > > > > Interoperability means following the spec, but IMO it also > > > > means respecting longstanding implementation practice when > > > > a specification does not prescribe particular behavior. > > > > > > > > In this case, strictly speaking interoperability is not the > > > > concern. > > > > > > > > -> The spec authors clearly believed this is an area where > > > > implementations are to be given free rein. Otherwise the text > > > > would have provided RFC 2119 directives or other specific > > > > guidelines. There was opportunity to add specifics in RFCs > > > > 3530, 7530, and 5661, but that wasn't done. > > > > > > > > -> The scenario I reported does not involve operational > > > > failure. It eventually succeeds whether the client's retry > > > > is aggressive or lazy. It just works _better_ when there is > > > > no DELAY/JUKEBOX. > > > > > > > > There are a few normative constraints here, and I think we > > > > have a bead on what those are, but IMO the issue is one of > > > > implementation quality (on both ends). > > > > > > > > > > Yes. I'm just not sold that what you're proposing would be any better > > > than what we have for the vast majority of people. It might be, but I > > > don't think that's necessarily the case. > > > > In other words, both of you are comparing my use case with > > a counterfactual. That doesn't seem like a fair fight. > > > > Can you demonstrate a specific use case where not offering > > a delegation during mixed NFSv3 and NFSv4 access is a true > > detriment? (I am open to hearing about it). > > > > What happens when an NFSv3 client sends an NLM LOCK on a > > delegated file? I assume the correct response is for the > > server to return NLM_LCK_BLOCKED, recall the delegation, and > > then call the client back when the delegation has been > > returned. Is that known to work? > > > > > > > > > > That said, I wouldn't have a huge objection to a server side tunable > > > > > > (module parameter?) that says "Recall read delegations on v2/3 READ > > > > > > calls". Make it default to off, and then people in your situation could > > > > > > set it if they thought it a better policy for their workload. > > > > > > > > > > I also wonder if in v3 case we should try a small synchronous wait > > > > > before returning JUKEBOX. Read delegations shouldn't require the client > > > > > to do very much, so it could be they're typically returned in a > > > > > fraction of a second. > > > > > > > > That wait would have to be very short in the NFSv3 / UDP case > > > > to avoid a retransmit timeout. I know, UDP is going away. > > > > > > > > It's hard to say how long to wait. The RTT to the client might > > > > have to be taken into account. In WAN deployments, this could > > > > be as long as 50ms, for instance. > > > > > > > > Although, again, waiting is speculative. A fixed 20ms wait > > > > would be appropriate for most LAN deployments, and that's > > > > where the expectation of consistently fast operation lies. > > > > > > > > > > Not a bad idea. That delay could be tunable as well. > > > > > Since we have a fixed number of threads, I don't think we'd want to keep > > > > > one waiting much longer than that. Also, it'd be nice if we could get > > > > > woken up early when the delegation return comes in before our wait's > > > > > over, but I haven't thought about how to do that. > > > > > > > > > > And I don't know if that actually helps. > > > > > > > > When there is a lot of file sharing between clients, it might > > > > be good to reduce the penalty of delegation recalls. > > > > > > > > > > The best way to do that would probably be to have better heuristics for > > > deciding whether to hand them out in the first place. > > > > I thought that was exactly what I was suggesting. ;-) > > See above ("To completely avoid..."). > > > > > > > We have a little > > > of that now with the bloom filter, but maybe those rules could be more > > > friendly to this use-case? > > > > > > > Clients, after all, cannot know when a recall has completed, > > > > so they have to guess about when to retransmit, and usually > > > > make a conservative estimate. If server behavior can shorten > > > > the delay without introducing race windows, that would be good > > > > added value. > > > > > > > > But I'm not clear why waiting must tie up the nfsd thread (pun > > > > intended). How is a COMMIT or synchronous WRITE handled? Seems > > > > like waiting for a delegation recall to complete is a similar > > > > kind of thing. > > > > > > > > > > It's not required per-se, but there currently isn't a good mechanism to > > > idle RPCs in the server without putting the thread to sleep. It may be > > > possible to do that with the svc_defer stuff, but I'm a little leery of > > > that code. > > > > There are other cases where context switching an nfsd would be > > useful. For example, inserting an opportunity for nfsd_write > > to perform transport reads (after having allocated pages in > > the right file) could provide some benefits by reducing data > > copies and page allocator calls. > > > > I'm agnostic about exactly how this is done. > > Meaning I don't have any particular design preferences. > > I'd like to help with implementation, though, if there is > agreement about what approach is preferred. > > Yeah, it would be wonderful to move knfsd to a more asynchronous model. Being able to free up a thread to do other work when it would otherwise be blocked would likely do wonders for scalability. FWIW, that was a long-term goal that I had in mind when I did the patches to move knfsd to workqueues. Once you have it running in a workqueue you could more easily idle RPCs and re-dispatch them to the workqueue for further processing as needed. You could do it with dedicated threads too of course, it's just not quite as conducive to it. -- Jeff Layton