All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] nvme/pci: Poll CQ on timeout
@ 2017-02-24 22:59 Keith Busch
  2017-02-27 14:42 ` Sagi Grimberg
  2017-02-28 14:10 ` Christoph Hellwig
  0 siblings, 2 replies; 12+ messages in thread
From: Keith Busch @ 2017-02-24 22:59 UTC (permalink / raw)


If an IO timeout occurs, it's helpful to know if the controller did not
post a completion or the driver missed an interrupt. While we never expect
the latter, this patch will make it possible to tell the difference so
we don't have to guess.

Signed-off-by: Keith Busch <keith.busch at intel.com>
---
 drivers/nvme/host/pci.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index f4f86d7..3ed1a84 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -745,10 +745,8 @@ static irqreturn_t nvme_irq_check(int irq, void *data)
 	return IRQ_NONE;
 }
 
-static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
+static int __nvme_poll(struct nvme_queue *nvmeq, unsigned int tag)
 {
-	struct nvme_queue *nvmeq = hctx->driver_data;
-
 	if (nvme_cqe_valid(nvmeq, nvmeq->cq_head, nvmeq->cq_phase)) {
 		spin_lock_irq(&nvmeq->q_lock);
 		__nvme_process_cq(nvmeq, &tag);
@@ -761,6 +759,13 @@ static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
 	return 0;
 }
 
+static int nvme_poll(struct blk_mq_hw_ctx *hctx, unsigned int tag)
+{
+	struct nvme_queue *nvmeq = hctx->driver_data;
+
+	return __nvme_poll(nvmeq, tag);
+}
+
 static void nvme_pci_submit_async_event(struct nvme_ctrl *ctrl, int aer_idx)
 {
 	struct nvme_dev *dev = to_nvme_dev(ctrl);
@@ -859,6 +864,16 @@ static enum blk_eh_timer_return nvme_timeout(struct request *req, bool reserved)
 	struct nvme_command cmd;
 
 	/*
+	 * Did we miss an interrupt?
+	 */
+	if (__nvme_poll(nvmeq, req->tag)) {
+		dev_warn(dev->ctrl.device,
+			 "I/O %d QID %d timeout, completion polled\n",
+			 req->tag, nvmeq->qid);
+		return BLK_EH_HANDLED;
+	}
+
+	/*
 	 * Shutdown immediately if controller times out while starting. The
 	 * reset work will see the pci device disabled when it gets the forced
 	 * cancellation error. All outstanding requests are completed on
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH] nvme/pci: Poll CQ on timeout
  2017-02-24 22:59 [PATCH] nvme/pci: Poll CQ on timeout Keith Busch
@ 2017-02-27 14:42 ` Sagi Grimberg
  2017-02-28 14:10 ` Christoph Hellwig
  1 sibling, 0 replies; 12+ messages in thread
From: Sagi Grimberg @ 2017-02-27 14:42 UTC (permalink / raw)


> If an IO timeout occurs, it's helpful to know if the controller did not
> post a completion or the driver missed an interrupt. While we never expect
> the latter, this patch will make it possible to tell the difference so
> we don't have to guess.

Did you really got a missed interrupt?

It seems reasonable, I'm just curious...

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] nvme/pci: Poll CQ on timeout
  2017-02-24 22:59 [PATCH] nvme/pci: Poll CQ on timeout Keith Busch
  2017-02-27 14:42 ` Sagi Grimberg
@ 2017-02-28 14:10 ` Christoph Hellwig
  2017-02-28 16:00   ` Keith Busch
  1 sibling, 1 reply; 12+ messages in thread
From: Christoph Hellwig @ 2017-02-28 14:10 UTC (permalink / raw)


On Fri, Feb 24, 2017@05:59:28PM -0500, Keith Busch wrote:
> If an IO timeout occurs, it's helpful to know if the controller did not
> post a completion or the driver missed an interrupt. While we never expect
> the latter, this patch will make it possible to tell the difference so
> we don't have to guess.

Do you have any good real use case for it?  I mostly don't like it
becuase it ties us to polling for a specific tag, something I'd like
to change in the ->poll API.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] nvme/pci: Poll CQ on timeout
  2017-02-28 14:10 ` Christoph Hellwig
@ 2017-02-28 16:00   ` Keith Busch
  2017-02-28 17:44     ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2017-02-28 16:00 UTC (permalink / raw)


On Tue, Feb 28, 2017@03:10:17PM +0100, Christoph Hellwig wrote:
> On Fri, Feb 24, 2017@05:59:28PM -0500, Keith Busch wrote:
> > If an IO timeout occurs, it's helpful to know if the controller did not
> > post a completion or the driver missed an interrupt. While we never expect
> > the latter, this patch will make it possible to tell the difference so
> > we don't have to guess.
> 
> Do you have any good real use case for it?  I mostly don't like it
> becuase it ties us to polling for a specific tag, something I'd like
> to change in the ->poll API.

I don't expect this to often catch anything, but this is a cheap way of
constraining the problem just by having this in place: the "Timeout"
message definitely means the device did not post an entry on that
command's completion queue, so either the device is broken or we messed
up the queue mapping.

In the event it does trigger (I've seen this only a handlful of times
on new platforms and devices, as well as legacy IRQs), we know a whole
lot more about the problem, compared to just seeing "Timeout".

If you want to remove the tag specific polling, I can look into an
alternative.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] nvme/pci: Poll CQ on timeout
  2017-02-28 16:00   ` Keith Busch
@ 2017-02-28 17:44     ` Jens Axboe
  2017-04-20  8:56       ` Johannes Thumshirn
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2017-02-28 17:44 UTC (permalink / raw)


On 02/28/2017 09:00 AM, Keith Busch wrote:
> On Tue, Feb 28, 2017@03:10:17PM +0100, Christoph Hellwig wrote:
>> On Fri, Feb 24, 2017@05:59:28PM -0500, Keith Busch wrote:
>>> If an IO timeout occurs, it's helpful to know if the controller did not
>>> post a completion or the driver missed an interrupt. While we never expect
>>> the latter, this patch will make it possible to tell the difference so
>>> we don't have to guess.
>>
>> Do you have any good real use case for it?  I mostly don't like it
>> becuase it ties us to polling for a specific tag, something I'd like
>> to change in the ->poll API.
> 
> I don't expect this to often catch anything, but this is a cheap way of
> constraining the problem just by having this in place: the "Timeout"
> message definitely means the device did not post an entry on that
> command's completion queue, so either the device is broken or we messed
> up the queue mapping.
> 
> In the event it does trigger (I've seen this only a handlful of times
> on new platforms and devices, as well as legacy IRQs), we know a whole
> lot more about the problem, compared to just seeing "Timeout".
> 
> If you want to remove the tag specific polling, I can look into an
> alternative.

IMHO that can go at a later time, if we do remove polling for
specific entries. For now it's fine.

And I do think this is a nice addition - it's free, and it provides
us extra info for debugging an issue. That's a big deal, especially
if it's a user report.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] nvme/pci: Poll CQ on timeout
  2017-02-28 17:44     ` Jens Axboe
@ 2017-04-20  8:56       ` Johannes Thumshirn
  2017-04-20 14:20         ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Johannes Thumshirn @ 2017-04-20  8:56 UTC (permalink / raw)


On Tue, Feb 28, 2017@10:44:21AM -0700, Jens Axboe wrote:
> IMHO that can go at a later time, if we do remove polling for
> specific entries. For now it's fine.
> 
> And I do think this is a nice addition - it's free, and it provides
> us extra info for debugging an issue. That's a big deal, especially
> if it's a user report.

JFYI, we did have I/O timeout issues on one of our internal systems.
Keith suggested to give this patch a shot and indeed it cought missed IRQs,
which after being polled at least didn't result in filesystem errors.

So if you want you can have my:
Tested-by: Johannes Thumshirn <jthumshirn at suse.de>
Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

-- 
Johannes Thumshirn                                          Storage
jthumshirn at suse.de                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 N?rnberg
GF: Felix Imend?rffer, Jane Smithard, Graham Norton
HRB 21284 (AG N?rnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] nvme/pci: Poll CQ on timeout
  2017-04-20  8:56       ` Johannes Thumshirn
@ 2017-04-20 14:20         ` Keith Busch
  2017-04-20 15:45           ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2017-04-20 14:20 UTC (permalink / raw)


On Thu, Apr 20, 2017@10:56:09AM +0200, Johannes Thumshirn wrote:
> On Tue, Feb 28, 2017@10:44:21AM -0700, Jens Axboe wrote:
> > IMHO that can go at a later time, if we do remove polling for
> > specific entries. For now it's fine.
> > 
> > And I do think this is a nice addition - it's free, and it provides
> > us extra info for debugging an issue. That's a big deal, especially
> > if it's a user report.
> 
> JFYI, we did have I/O timeout issues on one of our internal systems.
> Keith suggested to give this patch a shot and indeed it cought missed IRQs,
> which after being polled at least didn't result in filesystem errors.
> 
> So if you want you can have my:
> Tested-by: Johannes Thumshirn <jthumshirn at suse.de>
> Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>

Thanks, I forgot about this one, but still think it's a good idea.
Applied it to for-next.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] nvme/pci: Poll CQ on timeout
  2017-04-20 14:20         ` Keith Busch
@ 2017-04-20 15:45           ` Jens Axboe
  2017-04-20 16:17             ` Sagi Grimberg
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2017-04-20 15:45 UTC (permalink / raw)


On 04/20/2017 08:20 AM, Keith Busch wrote:
> On Thu, Apr 20, 2017@10:56:09AM +0200, Johannes Thumshirn wrote:
>> On Tue, Feb 28, 2017@10:44:21AM -0700, Jens Axboe wrote:
>>> IMHO that can go at a later time, if we do remove polling for
>>> specific entries. For now it's fine.
>>>
>>> And I do think this is a nice addition - it's free, and it provides
>>> us extra info for debugging an issue. That's a big deal, especially
>>> if it's a user report.
>>
>> JFYI, we did have I/O timeout issues on one of our internal systems.
>> Keith suggested to give this patch a shot and indeed it cought missed IRQs,
>> which after being polled at least didn't result in filesystem errors.
>>
>> So if you want you can have my:
>> Tested-by: Johannes Thumshirn <jthumshirn at suse.de>
>> Reviewed-by: Johannes Thumshirn <jthumshirn at suse.de>
> 
> Thanks, I forgot about this one, but still think it's a good idea.
> Applied it to for-next.

Agree, it's a nice addition.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] nvme/pci: Poll CQ on timeout
  2017-04-20 15:45           ` Jens Axboe
@ 2017-04-20 16:17             ` Sagi Grimberg
  2017-04-20 16:28               ` Keith Busch
  0 siblings, 1 reply; 12+ messages in thread
From: Sagi Grimberg @ 2017-04-20 16:17 UTC (permalink / raw)



>> Thanks, I forgot about this one, but still think it's a good idea.
>> Applied it to for-next.
>
> Agree, it's a nice addition.

I think its harmless to have. especially when we do log that it
happened.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] nvme/pci: Poll CQ on timeout
  2017-04-20 16:17             ` Sagi Grimberg
@ 2017-04-20 16:28               ` Keith Busch
  2017-04-20 16:31                 ` Jens Axboe
  0 siblings, 1 reply; 12+ messages in thread
From: Keith Busch @ 2017-04-20 16:28 UTC (permalink / raw)


On Thu, Apr 20, 2017@07:17:40PM +0300, Sagi Grimberg wrote:
> 
> > > Thanks, I forgot about this one, but still think it's a good idea.
> > > Applied it to for-next.
> > 
> > Agree, it's a nice addition.
> 
> I think its harmless to have. especially when we do log that it
> happened.

It's actually proved to be quite useful and maybe good for stable. Any
objections to roll this into 4.11 instead 4.12?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] nvme/pci: Poll CQ on timeout
  2017-04-20 16:28               ` Keith Busch
@ 2017-04-20 16:31                 ` Jens Axboe
  2017-04-21  6:39                   ` Christoph Hellwig
  0 siblings, 1 reply; 12+ messages in thread
From: Jens Axboe @ 2017-04-20 16:31 UTC (permalink / raw)


On 04/20/2017 10:28 AM, Keith Busch wrote:
> On Thu, Apr 20, 2017@07:17:40PM +0300, Sagi Grimberg wrote:
>>
>>>> Thanks, I forgot about this one, but still think it's a good idea.
>>>> Applied it to for-next.
>>>
>>> Agree, it's a nice addition.
>>
>> I think its harmless to have. especially when we do log that it
>> happened.
> 
> It's actually proved to be quite useful and maybe good for stable. Any
> objections to roll this into 4.11 instead 4.12?

I'll object to that, we should not queue up anything for 4.11 that
isn't potentially fixing a major issue at this point.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH] nvme/pci: Poll CQ on timeout
  2017-04-20 16:31                 ` Jens Axboe
@ 2017-04-21  6:39                   ` Christoph Hellwig
  0 siblings, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2017-04-21  6:39 UTC (permalink / raw)


On Thu, Apr 20, 2017@10:31:36AM -0600, Jens Axboe wrote:
> On 04/20/2017 10:28 AM, Keith Busch wrote:
> > On Thu, Apr 20, 2017@07:17:40PM +0300, Sagi Grimberg wrote:
> >>
> >>>> Thanks, I forgot about this one, but still think it's a good idea.
> >>>> Applied it to for-next.
> >>>
> >>> Agree, it's a nice addition.
> >>
> >> I think its harmless to have. especially when we do log that it
> >> happened.
> > 
> > It's actually proved to be quite useful and maybe good for stable. Any
> > objections to roll this into 4.11 instead 4.12?
> 
> I'll object to that, we should not queue up anything for 4.11 that
> isn't potentially fixing a major issue at this point.

Agreed.  I'll send it to you for 4.12 with the nvme pull request later
today.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2017-04-21  6:39 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-02-24 22:59 [PATCH] nvme/pci: Poll CQ on timeout Keith Busch
2017-02-27 14:42 ` Sagi Grimberg
2017-02-28 14:10 ` Christoph Hellwig
2017-02-28 16:00   ` Keith Busch
2017-02-28 17:44     ` Jens Axboe
2017-04-20  8:56       ` Johannes Thumshirn
2017-04-20 14:20         ` Keith Busch
2017-04-20 15:45           ` Jens Axboe
2017-04-20 16:17             ` Sagi Grimberg
2017-04-20 16:28               ` Keith Busch
2017-04-20 16:31                 ` Jens Axboe
2017-04-21  6:39                   ` Christoph Hellwig

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.