From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Date: Mon, 30 Apr 2018 13:52:17 -0600 From: Keith Busch To: Ming Lei Cc: Ming Lei , Jens Axboe , linux-block , Sagi Grimberg , linux-nvme , Keith Busch , Jianchao Wang , Christoph Hellwig Subject: Re: [PATCH 1/2] nvme: pci: simplify timeout handling Message-ID: <20180430195217.GD5938@localhost.localdomain> References: <20180426123956.26039-1-ming.lei@redhat.com> <20180426123956.26039-2-ming.lei@redhat.com> <20180427175157.GB5073@localhost.localdomain> <20180428035015.GB5657@ming.t460p> <20180428133514.GB5938@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: List-ID: On Sun, Apr 29, 2018 at 05:39:52AM +0800, Ming Lei wrote: > On Sat, Apr 28, 2018 at 9:35 PM, Keith Busch > wrote: > > On Sat, Apr 28, 2018 at 11:50:17AM +0800, Ming Lei wrote: > >> > I understand how the problems are happening a bit better now. It used > >> > to be that blk-mq would lock an expired command one at a time, so when > >> > we had a batch of IO timeouts, the driver was able to complete all of > >> > them inside a single IO timeout handler. > >> > > >> > That's not the case anymore, so the driver is called for every IO > >> > timeout even though if it reaped all the commands at once. > >> > >> Actually there isn't the case before, even for legacy path, one .timeout() > >> handles one request only. > > > > That's not quite what I was talking about. > > > > Before, only the command that was about to be sent to the driver's > > .timeout() was marked completed. The driver could (and did) compete > > other timed out commands in a single .timeout(), and the tag would > > clear, so we could hanlde all timeouts in a single .timeout(). > > > > Now, blk-mq marks all timed out commands as aborted prior to calling > > the driver's .timeout(). If the driver completes any of those commands, > > the tag does not clear, so the driver's .timeout() just gets to be called > > again for commands it already reaped. > > That won't happen because new timeout model will mark aborted on timed-out > request first, then run synchronize_rcu() before making these requests > really expired, and now rcu lock is held in normal completion > handler(blk_mq_complete_request). > > Yes, Bart is working towards that way, but there is still the same race > between timeout handler(nvme_dev_disable()) and reset_work(), and nothing > changes wrt. the timeout model: Yeah, the driver makes sure there are no possible outstanding commands at the end of nvme_dev_disable. This should mean there's no timeout handler running because there's no possible commands for that handler. But that's not really the case anymore, so we had been inadvertently depending on that behavior. > - reset may take a while to complete because of nvme_wait_freeze(), and > timeout can happen during resetting, then reset may hang forever. Even > without nvme_wait_freeze(), it is possible for timeout to happen during > reset work too in theory. > > Actually for non-shutdown, it isn't necessary to freeze queue at all, and it > is enough to just quiesce queues to make hardware happy for recovery, > that has been part of my V2 patchset. When we freeze, we prevent IOs from entering contexts that may not be valid on the other side of the reset. It's not very common for the context count to change, but it can happen. Anyway, will take a look at your series and catch up on the notes from you and Jianchao. From mboxrd@z Thu Jan 1 00:00:00 1970 From: keith.busch@linux.intel.com (Keith Busch) Date: Mon, 30 Apr 2018 13:52:17 -0600 Subject: [PATCH 1/2] nvme: pci: simplify timeout handling In-Reply-To: References: <20180426123956.26039-1-ming.lei@redhat.com> <20180426123956.26039-2-ming.lei@redhat.com> <20180427175157.GB5073@localhost.localdomain> <20180428035015.GB5657@ming.t460p> <20180428133514.GB5938@localhost.localdomain> Message-ID: <20180430195217.GD5938@localhost.localdomain> On Sun, Apr 29, 2018@05:39:52AM +0800, Ming Lei wrote: > On Sat, Apr 28, 2018 at 9:35 PM, Keith Busch > wrote: > > On Sat, Apr 28, 2018@11:50:17AM +0800, Ming Lei wrote: > >> > I understand how the problems are happening a bit better now. It used > >> > to be that blk-mq would lock an expired command one at a time, so when > >> > we had a batch of IO timeouts, the driver was able to complete all of > >> > them inside a single IO timeout handler. > >> > > >> > That's not the case anymore, so the driver is called for every IO > >> > timeout even though if it reaped all the commands at once. > >> > >> Actually there isn't the case before, even for legacy path, one .timeout() > >> handles one request only. > > > > That's not quite what I was talking about. > > > > Before, only the command that was about to be sent to the driver's > > .timeout() was marked completed. The driver could (and did) compete > > other timed out commands in a single .timeout(), and the tag would > > clear, so we could hanlde all timeouts in a single .timeout(). > > > > Now, blk-mq marks all timed out commands as aborted prior to calling > > the driver's .timeout(). If the driver completes any of those commands, > > the tag does not clear, so the driver's .timeout() just gets to be called > > again for commands it already reaped. > > That won't happen because new timeout model will mark aborted on timed-out > request first, then run synchronize_rcu() before making these requests > really expired, and now rcu lock is held in normal completion > handler(blk_mq_complete_request). > > Yes, Bart is working towards that way, but there is still the same race > between timeout handler(nvme_dev_disable()) and reset_work(), and nothing > changes wrt. the timeout model: Yeah, the driver makes sure there are no possible outstanding commands at the end of nvme_dev_disable. This should mean there's no timeout handler running because there's no possible commands for that handler. But that's not really the case anymore, so we had been inadvertently depending on that behavior. > - reset may take a while to complete because of nvme_wait_freeze(), and > timeout can happen during resetting, then reset may hang forever. Even > without nvme_wait_freeze(), it is possible for timeout to happen during > reset work too in theory. > > Actually for non-shutdown, it isn't necessary to freeze queue at all, and it > is enough to just quiesce queues to make hardware happy for recovery, > that has been part of my V2 patchset. When we freeze, we prevent IOs from entering contexts that may not be valid on the other side of the reset. It's not very common for the context count to change, but it can happen. Anyway, will take a look at your series and catch up on the notes from you and Jianchao.