From mboxrd@z Thu Jan  1 00:00:00 1970
From: swise@opengridcomputing.com (Steve Wise)
Date: Thu, 11 Aug 2016 08:58:38 -0500
Subject: nvmf/rdma host crash during heavy load and keep alive recovery
In-Reply-To: <2fb1129c-424d-8b2d-7101-b9471e897dc8@grimberg.me>
References: <018301d1e9e1$da3b2e40$8eb18ac0$@opengridcomputing.com>
 <20160801110658.GF16141@lst.de>
 <008801d1ec00$a0bcfbf0$e236f3d0$@opengridcomputing.com>
 <015801d1ec3d$0ca07ea0$25e17be0$@opengridcomputing.com>
 <010f01d1f31e$50c8cb40$f25a61c0$@opengridcomputing.com>
 <013701d1f320$57b185d0$07149170$@opengridcomputing.com>
 <018401d1f32b$792cfdb0$6b86f910$@opengridcomputing.com>
 <01a301d1f339$55ba8e70$012fab50$@opengridcomputing.com>
 <2fb1129c-424d-8b2d-7101-b9471e897dc8@grimberg.me>
Message-ID: <004701d1f3d8$760660b0$62132210$@opengridcomputing.com>

> >> The nvme_rdma_ctrl queue associated with the request is in RECONNECTING
> state:
> >>
> >>   ctrl = {
> >>     state = NVME_CTRL_RECONNECTING,
> >>     lock = {
> >>
> >> So it should not be posting SQ WRs...
> >
> > kato kicks error recovery, nvme_rdma_error_recovery_work(), which calls
> > nvme_cancel_request() on each request.  nvme_cancel_request() sets
req->errors
> > to NVME_SC_ABORT_REQ.  It then completes the request which ends up at
> > nvme_rdma_complete_rq() which queues it for retry:
> > ...
> >         if (unlikely(rq->errors)) {
> >                 if (nvme_req_needs_retry(rq, rq->errors)) {
> >                         nvme_requeue_req(rq);
> >                         return;
> >                 }
> >
> >                 if (rq->cmd_type == REQ_TYPE_DRV_PRIV)
> >                         error = rq->errors;
> >                 else
> >                         error = nvme_error_status(rq->errors);
> >         }
> > ...
> >
> > The retry will end up at nvme_rdma_queue_rq() which will try and post a send
wr
> > to a freed qp...
> >
> > Should the canceled requests actually OR in bit NVME_SC_DNR?  That is only
> done
> > in nvme_cancel_request() if the blk queue is dying:
> 
> the DNR bit should not be set normally, only when we either don't want
> to requeue or we can't.
> 
> >
> > ...
> >         status = NVME_SC_ABORT_REQ;
> >         if (blk_queue_dying(req->q))
> >                 status |= NVME_SC_DNR;
> > ...
> >
> 
> couple of questions:
> 
> 1. bringing down the interface means generating DEVICE_REMOVAL
> event?
> 

No.  Just ifconfig ethX down; sleep 10; ifconfig ethX up.  This simply causes
the pending work requests to take longer to complete and kicks in the kato
logic.


> 2. keep-alive timeout expires means that nvme_rdma_timeout() invokes
> kicks error_recovery and set:
> rq->errors = NVME_SC_ABORT_REQ | NVME_SC_DNR
> 
> So I'm not at all convinced that the keep-alive is the request that
> being re-issued. Did you verify that?

The request that caused the crash had rq->errors == NVME_SC_ABORT_REQ.  I'm not
sure that is always the case though.  But this is very easy to reproduce, so I
should be able to drill down and add any debug code you think might help.