From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 21 May 2018 23:25:43 +0800 From: Ming Lei To: Keith Busch Cc: Jens Axboe , Keith Busch , Laurence Oberman , Sagi Grimberg , James Smart , linux-nvme@lists.infradead.org, linux-block@vger.kernel.org, Johannes Thumshirn , Christoph Hellwig Subject: Re: [PATCH 1/6] nvme: Sync request queues on reset Message-ID: <20180521152536.GB19099@ming.t460p> References: <20180518163823.27820-1-keith.busch@intel.com> <20180518223210.GB18334@ming.t460p> <20180518234408.GA31749@localhost.localdomain> <20180519000141.GB19799@ming.t460p> <20180521140413.GA5528@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <20180521140413.GA5528@localhost.localdomain> List-ID: On Mon, May 21, 2018 at 08:04:13AM -0600, Keith Busch wrote: > On Sat, May 19, 2018 at 08:01:42AM +0800, Ming Lei wrote: > > > You keep saying that, but the controller state is global to the > > > controller. It doesn't matter which namespace request_queue started the > > > reset: every namespaces request queue sees the RESETTING controller state > > > > When timeouts come, the global state of RESETTING may not be updated > > yet, so all the timeouts may not observe the state. > > Even prior to the RESETING state, every single command, no matter > which namespace or request_queue it came on, is reclaimed by the driver. > There *should* be no requests to timeout after nvme_dev_disable is called > because the nvme driver returned control of all requests in the tagset > to blk-mq. The timed-out requests won't be canceled by nvme_dev_disable(). If the timed-out requests is handled as RESET_TIMER, there may be new timeout event triggered again. > > In any case, if blk-mq decides it won't complete those requests, we > can just swap the order in the reset_work: sync first, uncondintionally > disable. Does the following snippet look more okay? > > --- > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 17a0190bd88f..42af077ee07a 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2307,11 +2307,14 @@ static void nvme_reset_work(struct work_struct *work) > goto out; > > /* > - * If we're called to reset a live controller first shut it down before > - * moving on. > + * Ensure there are no timeout work in progress prior to forcefully > + * disabling the queue. There is no harm in disabling the device even > + * when it was already disabled, as this will forcefully reclaim any > + * IOs that are stuck due to blk-mq's timeout handling that prevents > + * timed out requests from completing. > */ > - if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) > - nvme_dev_disable(dev, false); > + nvme_sync_queues(&dev->ctrl); > + nvme_dev_disable(dev, false); That may not work reliably too. For example, request A from NS_0 is timed-out and handled as RESET_TIMER, meantime request B from NS_1 is timed-out and handled as EH_HANDLED. When the above reset work is run for handling timeout of req B, new timeout event on request A may come just between the above nvme_sync_queues() and nvme_dev_disable(), then nvme_dev_disable() can't cover request A, and finally the timed-out event for req A will nvme_dev_disable() when the current reset is just in-progress, then this reset can't move on, and IO hang is caused. Thanks, Ming From mboxrd@z Thu Jan 1 00:00:00 1970 From: ming.lei@redhat.com (Ming Lei) Date: Mon, 21 May 2018 23:25:43 +0800 Subject: [PATCH 1/6] nvme: Sync request queues on reset In-Reply-To: <20180521140413.GA5528@localhost.localdomain> References: <20180518163823.27820-1-keith.busch@intel.com> <20180518223210.GB18334@ming.t460p> <20180518234408.GA31749@localhost.localdomain> <20180519000141.GB19799@ming.t460p> <20180521140413.GA5528@localhost.localdomain> Message-ID: <20180521152536.GB19099@ming.t460p> On Mon, May 21, 2018@08:04:13AM -0600, Keith Busch wrote: > On Sat, May 19, 2018@08:01:42AM +0800, Ming Lei wrote: > > > You keep saying that, but the controller state is global to the > > > controller. It doesn't matter which namespace request_queue started the > > > reset: every namespaces request queue sees the RESETTING controller state > > > > When timeouts come, the global state of RESETTING may not be updated > > yet, so all the timeouts may not observe the state. > > Even prior to the RESETING state, every single command, no matter > which namespace or request_queue it came on, is reclaimed by the driver. > There *should* be no requests to timeout after nvme_dev_disable is called > because the nvme driver returned control of all requests in the tagset > to blk-mq. The timed-out requests won't be canceled by nvme_dev_disable(). If the timed-out requests is handled as RESET_TIMER, there may be new timeout event triggered again. > > In any case, if blk-mq decides it won't complete those requests, we > can just swap the order in the reset_work: sync first, uncondintionally > disable. Does the following snippet look more okay? > > --- > diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c > index 17a0190bd88f..42af077ee07a 100644 > --- a/drivers/nvme/host/pci.c > +++ b/drivers/nvme/host/pci.c > @@ -2307,11 +2307,14 @@ static void nvme_reset_work(struct work_struct *work) > goto out; > > /* > - * If we're called to reset a live controller first shut it down before > - * moving on. > + * Ensure there are no timeout work in progress prior to forcefully > + * disabling the queue. There is no harm in disabling the device even > + * when it was already disabled, as this will forcefully reclaim any > + * IOs that are stuck due to blk-mq's timeout handling that prevents > + * timed out requests from completing. > */ > - if (dev->ctrl.ctrl_config & NVME_CC_ENABLE) > - nvme_dev_disable(dev, false); > + nvme_sync_queues(&dev->ctrl); > + nvme_dev_disable(dev, false); That may not work reliably too. For example, request A from NS_0 is timed-out and handled as RESET_TIMER, meantime request B from NS_1 is timed-out and handled as EH_HANDLED. When the above reset work is run for handling timeout of req B, new timeout event on request A may come just between the above nvme_sync_queues() and nvme_dev_disable(), then nvme_dev_disable() can't cover request A, and finally the timed-out event for req A will nvme_dev_disable() when the current reset is just in-progress, then this reset can't move on, and IO hang is caused. Thanks, Ming