From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 14 May 2018 09:18:21 -0600 From: Keith Busch To: Ming Lei Cc: Keith Busch , Jens Axboe , Laurence Oberman , Sagi Grimberg , James Smart , linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, Jianchao Wang , Christoph Hellwig Subject: Re: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling Message-ID: <20180514151821.GE7772@localhost.localdomain> References: <20180511122933.27155-1-ming.lei@redhat.com> <20180511205028.GB7772@localhost.localdomain> <20180512002110.GA23631@ming.t460p> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <20180512002110.GA23631@ming.t460p> List-ID: Hi Ming, On Sat, May 12, 2018 at 08:21:22AM +0800, Ming Lei wrote: > > [ 760.679960] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff > > [ 760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1 > > [ 760.727099] pci_raw_set_power_state: 62 callbacks suppressed > > [ 760.727103] nvme 0000:86:00.0: Refused to change power state, currently in D3 > > EH may not cover this kind of failure, so it fails in the 1st try. Indeed, the test is simulating a permanently broken link, so recovery is not expected. A success in this case is just completing driver unbinding. > > [ 760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1 > > [ 760.727485] nvme nvme1: EH 0: after recovery -19 > > [ 760.727488] nvme nvme1: EH: fail controller > > The above issue(hang in nvme_remove()) is still an old issue, which > is because queues are kept as quiesce during remove, so could you > please test the following change? > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 1dec353388be..c78e5a0cde06 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) > */ > if (ctrl->state == NVME_CTRL_DEAD) > nvme_kill_queues(ctrl); > + else { > + if (ctrl->admin_q) > + blk_mq_unquiesce_queue(ctrl->admin_q); > + nvme_start_queues(ctrl); > + } > > down_write(&ctrl->namespaces_rwsem); > list_splice_init(&ctrl->namespaces, &ns_list); The above won't actually do anything here since the broken link puts the controller in the DEAD state, so we've killed the queues which also unquiesces them. > BTW, in my environment, it is hard to trigger this failure, so not see > this issue, but I did verify the nested EH which can recover from error > in reset. It's actually pretty easy to trigger this one. I just modify block/019 to remove the check for a hotplug slot then run it on a block device that's not hot-pluggable. From mboxrd@z Thu Jan 1 00:00:00 1970 From: keith.busch@linux.intel.com (Keith Busch) Date: Mon, 14 May 2018 09:18:21 -0600 Subject: [PATCH V5 0/9] nvme: pci: fix & improve timeout handling In-Reply-To: <20180512002110.GA23631@ming.t460p> References: <20180511122933.27155-1-ming.lei@redhat.com> <20180511205028.GB7772@localhost.localdomain> <20180512002110.GA23631@ming.t460p> Message-ID: <20180514151821.GE7772@localhost.localdomain> Hi Ming, On Sat, May 12, 2018@08:21:22AM +0800, Ming Lei wrote: > > [ 760.679960] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0xffff > > [ 760.701468] nvme nvme1: EH 0: after shutdown, top eh: 1 > > [ 760.727099] pci_raw_set_power_state: 62 callbacks suppressed > > [ 760.727103] nvme 0000:86:00.0: Refused to change power state, currently in D3 > > EH may not cover this kind of failure, so it fails in the 1st try. Indeed, the test is simulating a permanently broken link, so recovery is not expected. A success in this case is just completing driver unbinding. > > [ 760.727483] nvme nvme1: EH 0: state 4, eh_done -19, top eh 1 > > [ 760.727485] nvme nvme1: EH 0: after recovery -19 > > [ 760.727488] nvme nvme1: EH: fail controller > > The above issue(hang in nvme_remove()) is still an old issue, which > is because queues are kept as quiesce during remove, so could you > please test the following change? > > diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c > index 1dec353388be..c78e5a0cde06 100644 > --- a/drivers/nvme/host/core.c > +++ b/drivers/nvme/host/core.c > @@ -3254,6 +3254,11 @@ void nvme_remove_namespaces(struct nvme_ctrl *ctrl) > */ > if (ctrl->state == NVME_CTRL_DEAD) > nvme_kill_queues(ctrl); > + else { > + if (ctrl->admin_q) > + blk_mq_unquiesce_queue(ctrl->admin_q); > + nvme_start_queues(ctrl); > + } > > down_write(&ctrl->namespaces_rwsem); > list_splice_init(&ctrl->namespaces, &ns_list); The above won't actually do anything here since the broken link puts the controller in the DEAD state, so we've killed the queues which also unquiesces them. > BTW, in my environment, it is hard to trigger this failure, so not see > this issue, but I did verify the nested EH which can recover from error > in reset. It's actually pretty easy to trigger this one. I just modify block/019 to remove the check for a hotplug slot then run it on a block device that's not hot-pluggable.