From mboxrd@z Thu Jan 1 00:00:00 1970 From: Devesh Sharma Subject: RE: [PATCH V1] NFS-RDMA: fix qp pointer validation checks Date: Tue, 15 Apr 2014 18:25:47 +0000 Message-ID: References: <014738b6-698e-4ea1-82f9-287378bfec19@CMEXHTCAS2.ad.emulex.com> <56C87770-7940-4006-948C-FEF3C0EC4ACC@oracle.com> <5710A71F-C4D5-408B-9B41-07F21B5853F0@oracle.com> <6837A427-B677-4CC7-A022-4FB9E52A3FC6@oracle.com> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 8BIT Return-path: In-Reply-To: Content-Language: en-US Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Chuck Lever Cc: Linux NFS Mailing List , "linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org" , Trond Myklebust List-Id: linux-rdma@vger.kernel.org > -----Original Message----- > From: Chuck Lever [mailto:chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org] > Sent: Tuesday, April 15, 2014 6:10 AM > To: Devesh Sharma > Cc: Linux NFS Mailing List; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Trond Myklebust > Subject: Re: [PATCH V1] NFS-RDMA: fix qp pointer validation checks > > > On Apr 14, 2014, at 6:46 PM, Devesh Sharma > wrote: > > > Hi Chuck > > > >> -----Original Message----- > >> From: Chuck Lever [mailto:chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org] > >> Sent: Tuesday, April 15, 2014 2:24 AM > >> To: Devesh Sharma > >> Cc: Linux NFS Mailing List; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Trond > >> Myklebust > >> Subject: Re: [PATCH V1] NFS-RDMA: fix qp pointer validation checks > >> > >> Hi Devesh- > >> > >> > >> On Apr 13, 2014, at 12:01 AM, Chuck Lever > wrote: > >> > >>> > >>> On Apr 11, 2014, at 7:51 PM, Devesh Sharma > >> wrote: > >>> > >>>> Hi Chuck, > >>>> Yes that is the case, Following is the trace I got. > >>>> > >>>> <4>RPC: 355 setting alarm for 60000 ms > >>>> <4>RPC: 355 sync task going to sleep > >>>> <4>RPC: xprt_rdma_connect_worker: reconnect > >>>> <4>RPC: rpcrdma_ep_disconnect: rdma_disconnect -1 > >>>> <4>RPC: rpcrdma_ep_connect: rpcrdma_ep_disconnect status -1 > >>>> <3>ocrdma_mbx_create_qp(0) rq_err > >>>> <3>ocrdma_mbx_create_qp(0) sq_err > >>>> <3>ocrdma_create_qp(0) error=-1 > >>>> <4>RPC: rpcrdma_ep_connect: rdma_create_qp failed -1 > >>>> <4>RPC: 355 __rpc_wake_up_task (now 4296956756) > >>>> <4>RPC: 355 disabling timer > >>>> <4>RPC: 355 removed from queue ffff880454578258 "xprt_pending" > >>>> <4>RPC: __rpc_wake_up_task done > >>>> <4>RPC: xprt_rdma_connect_worker: exit > >>>> <4>RPC: 355 sync task resuming > >>>> <4>RPC: 355 xprt_connect_status: error 1 connecting to server > >> 192.168.1.1 > >>> > >>> xprtrdma's connect worker is returning "1" instead of a negative errno. > >>> That's the bug that triggers this chain of events. > >> > >> rdma_create_qp() has returned -EPERM. There's very little xprtrdma > >> can do if the provider won't even create a QP. That seems like a rare > >> and fatal problem. > >> > >> For the moment, I'm inclined to think that a panic is correct > >> behavior, since there are outstanding registered memory regions that > >> cannot be cleaned up without a QP (see below). > > Well, I think the system should still remain alive. > > Sure, in the long run. I'm not suggesting we leave it this way. Okay, Agreed. > > > This will definatly cause a memory leak. But QP create failure does not > mean system should also crash. > > It's more than leaked memory. A permanent QP creation failure can leave > pages in the page cache registered and pinned, as I understand it. Yes! true. > > > I think for the time being it is worth to put Null pointer checks to prevent > system from crash. > > Common practice in the Linux kernel is to avoid unnecessary NULL checks. > Work-around fixes are typically rejected, and not with a happy face either. > > Once the connection tear-down code is fixed, it should be clear where NULL > checks need to go. Okay. > > >> > >>> RPC tasks waiting for the reconnect are awoken. > >>> xprt_connect_status() doesn't recognize a tk_status of "1", so it > >>> turns it into -EIO, and kills each waiting RPC task. > >> > >>>> <4>RPC: wake_up_next(ffff880454578190 "xprt_sending") > >>>> <4>RPC: 355 call_connect_status (status -5) > >>>> <4>RPC: 355 return 0, status -5 > >>>> <4>RPC: 355 release task > >>>> <4>RPC: wake_up_next(ffff880454578190 "xprt_sending") > >>>> <4>RPC: xprt_rdma_free: called on 0x(null) > >>> > >>> And as part of exiting, the RPC task has to free its buffer. > >>> > >>> Not exactly sure why req->rl_nchunks is not zero for an NFSv4 GETATTR. > >>> This is why rpcrdma_deregister_external() is invoked here. > >>> > >>> Eventually this gets around to attempting to post a LOCAL_INV WR > >>> with > >>> ->qp set to NULL, and the panic below occurs. > >> > >> This is a somewhat different problem. > >> > >> Not only do we need to have a good ->qp here, but it has to be > >> connected and in the ready-to-send state before LOCAL_INV work > >> requests can be posted. > >> > >> The implication of this is that if a server disconnects (server crash > >> or network partition), the client is stuck waiting for the server to > >> come back before it can deregister memory and retire outstanding RPC > requests. > > This is a real problem to solve. In the existing state of xprtrdma > > code. Even a Server reboot will cause Client to crash. > > I don't see how that can happen if the HCA/provider manages to create a > fresh QP successfully and then rdma_connect() succeeds. Okay yes, since QP creation will still succeed. > > A soft timeout or a ^C while the server is rebooting might be a problem. > > >> > >> This is bad for ^C or soft timeouts or umount ... when the server is > >> unavailable. > >> > >> So I feel we need better clean-up when the client cannot reconnect. > > Unreg old frmrs with the help of new QP? Until the new QP is created with > same PD and FRMR is bound to PD and not to QP. > >> Probably deregistering RPC chunk MR's before finally tearing down the > >> old QP is what is necessary. > > > > We need a scheme that handles Memory registrations separately from > connection establishment and do book-keeping of which region is Registered > and which one is not. > > Once the new connection is back. Either start using old mem-regions as it is, > or invalidate old and re-register on the new QP. > > What is the existing scheme xprtrdma is following? Is it the same? > > This is what is going on now. Clearly, when managing its own memory > resources, the client should never depend on the server ever coming back. > > The proposal is to deregister _before_ the old QP is torn down, using > ib_dereg_mr() in the connect worker process. All RPC requests on that > connection should be sleeping waiting for the reconnect to complete. > > If chunks are created and marshaled during xprt_transmit(), the waiting RPC > requests should simply re-register when they are ready to be sent again. > Ok, I will try to change this and test, I may take a week's time to understand and rollout V3. > > I think it is possible to create FRMR on qp->qp_num = x while > > invalidate on qp->qp_num = y until qpx.pd == qpy.pd > > -- > Chuck Lever > chuck[dot]lever[at]oracle[dot]com > > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: linux-nfs-owner@vger.kernel.org Received: from cmexedge2.ext.emulex.com ([138.239.224.100]:3151 "EHLO CMEXEDGE2.ext.emulex.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754687AbaDOSZt convert rfc822-to-8bit (ORCPT ); Tue, 15 Apr 2014 14:25:49 -0400 From: Devesh Sharma To: Chuck Lever CC: Linux NFS Mailing List , "linux-rdma@vger.kernel.org" , Trond Myklebust Subject: RE: [PATCH V1] NFS-RDMA: fix qp pointer validation checks Date: Tue, 15 Apr 2014 18:25:47 +0000 Message-ID: References: <014738b6-698e-4ea1-82f9-287378bfec19@CMEXHTCAS2.ad.emulex.com> <56C87770-7940-4006-948C-FEF3C0EC4ACC@oracle.com> <5710A71F-C4D5-408B-9B41-07F21B5853F0@oracle.com> <6837A427-B677-4CC7-A022-4FB9E52A3FC6@oracle.com> In-Reply-To: Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Sender: linux-nfs-owner@vger.kernel.org List-ID: > -----Original Message----- > From: Chuck Lever [mailto:chuck.lever@oracle.com] > Sent: Tuesday, April 15, 2014 6:10 AM > To: Devesh Sharma > Cc: Linux NFS Mailing List; linux-rdma@vger.kernel.org; Trond Myklebust > Subject: Re: [PATCH V1] NFS-RDMA: fix qp pointer validation checks > > > On Apr 14, 2014, at 6:46 PM, Devesh Sharma > wrote: > > > Hi Chuck > > > >> -----Original Message----- > >> From: Chuck Lever [mailto:chuck.lever@oracle.com] > >> Sent: Tuesday, April 15, 2014 2:24 AM > >> To: Devesh Sharma > >> Cc: Linux NFS Mailing List; linux-rdma@vger.kernel.org; Trond > >> Myklebust > >> Subject: Re: [PATCH V1] NFS-RDMA: fix qp pointer validation checks > >> > >> Hi Devesh- > >> > >> > >> On Apr 13, 2014, at 12:01 AM, Chuck Lever > wrote: > >> > >>> > >>> On Apr 11, 2014, at 7:51 PM, Devesh Sharma > >> wrote: > >>> > >>>> Hi Chuck, > >>>> Yes that is the case, Following is the trace I got. > >>>> > >>>> <4>RPC: 355 setting alarm for 60000 ms > >>>> <4>RPC: 355 sync task going to sleep > >>>> <4>RPC: xprt_rdma_connect_worker: reconnect > >>>> <4>RPC: rpcrdma_ep_disconnect: rdma_disconnect -1 > >>>> <4>RPC: rpcrdma_ep_connect: rpcrdma_ep_disconnect status -1 > >>>> <3>ocrdma_mbx_create_qp(0) rq_err > >>>> <3>ocrdma_mbx_create_qp(0) sq_err > >>>> <3>ocrdma_create_qp(0) error=-1 > >>>> <4>RPC: rpcrdma_ep_connect: rdma_create_qp failed -1 > >>>> <4>RPC: 355 __rpc_wake_up_task (now 4296956756) > >>>> <4>RPC: 355 disabling timer > >>>> <4>RPC: 355 removed from queue ffff880454578258 "xprt_pending" > >>>> <4>RPC: __rpc_wake_up_task done > >>>> <4>RPC: xprt_rdma_connect_worker: exit > >>>> <4>RPC: 355 sync task resuming > >>>> <4>RPC: 355 xprt_connect_status: error 1 connecting to server > >> 192.168.1.1 > >>> > >>> xprtrdma's connect worker is returning "1" instead of a negative errno. > >>> That's the bug that triggers this chain of events. > >> > >> rdma_create_qp() has returned -EPERM. There's very little xprtrdma > >> can do if the provider won't even create a QP. That seems like a rare > >> and fatal problem. > >> > >> For the moment, I'm inclined to think that a panic is correct > >> behavior, since there are outstanding registered memory regions that > >> cannot be cleaned up without a QP (see below). > > Well, I think the system should still remain alive. > > Sure, in the long run. I'm not suggesting we leave it this way. Okay, Agreed. > > > This will definatly cause a memory leak. But QP create failure does not > mean system should also crash. > > It's more than leaked memory. A permanent QP creation failure can leave > pages in the page cache registered and pinned, as I understand it. Yes! true. > > > I think for the time being it is worth to put Null pointer checks to prevent > system from crash. > > Common practice in the Linux kernel is to avoid unnecessary NULL checks. > Work-around fixes are typically rejected, and not with a happy face either. > > Once the connection tear-down code is fixed, it should be clear where NULL > checks need to go. Okay. > > >> > >>> RPC tasks waiting for the reconnect are awoken. > >>> xprt_connect_status() doesn't recognize a tk_status of "1", so it > >>> turns it into -EIO, and kills each waiting RPC task. > >> > >>>> <4>RPC: wake_up_next(ffff880454578190 "xprt_sending") > >>>> <4>RPC: 355 call_connect_status (status -5) > >>>> <4>RPC: 355 return 0, status -5 > >>>> <4>RPC: 355 release task > >>>> <4>RPC: wake_up_next(ffff880454578190 "xprt_sending") > >>>> <4>RPC: xprt_rdma_free: called on 0x(null) > >>> > >>> And as part of exiting, the RPC task has to free its buffer. > >>> > >>> Not exactly sure why req->rl_nchunks is not zero for an NFSv4 GETATTR. > >>> This is why rpcrdma_deregister_external() is invoked here. > >>> > >>> Eventually this gets around to attempting to post a LOCAL_INV WR > >>> with > >>> ->qp set to NULL, and the panic below occurs. > >> > >> This is a somewhat different problem. > >> > >> Not only do we need to have a good ->qp here, but it has to be > >> connected and in the ready-to-send state before LOCAL_INV work > >> requests can be posted. > >> > >> The implication of this is that if a server disconnects (server crash > >> or network partition), the client is stuck waiting for the server to > >> come back before it can deregister memory and retire outstanding RPC > requests. > > This is a real problem to solve. In the existing state of xprtrdma > > code. Even a Server reboot will cause Client to crash. > > I don't see how that can happen if the HCA/provider manages to create a > fresh QP successfully and then rdma_connect() succeeds. Okay yes, since QP creation will still succeed. > > A soft timeout or a ^C while the server is rebooting might be a problem. > > >> > >> This is bad for ^C or soft timeouts or umount ... when the server is > >> unavailable. > >> > >> So I feel we need better clean-up when the client cannot reconnect. > > Unreg old frmrs with the help of new QP? Until the new QP is created with > same PD and FRMR is bound to PD and not to QP. > >> Probably deregistering RPC chunk MR's before finally tearing down the > >> old QP is what is necessary. > > > > We need a scheme that handles Memory registrations separately from > connection establishment and do book-keeping of which region is Registered > and which one is not. > > Once the new connection is back. Either start using old mem-regions as it is, > or invalidate old and re-register on the new QP. > > What is the existing scheme xprtrdma is following? Is it the same? > > This is what is going on now. Clearly, when managing its own memory > resources, the client should never depend on the server ever coming back. > > The proposal is to deregister _before_ the old QP is torn down, using > ib_dereg_mr() in the connect worker process. All RPC requests on that > connection should be sleeping waiting for the reconnect to complete. > > If chunks are created and marshaled during xprt_transmit(), the waiting RPC > requests should simply re-register when they are ready to be sent again. > Ok, I will try to change this and test, I may take a week's time to understand and rollout V3. > > I think it is possible to create FRMR on qp->qp_num = x while > > invalidate on qp->qp_num = y until qpx.pd == qpy.pd > > -- > Chuck Lever > chuck[dot]lever[at]oracle[dot]com > > >