All of lore.kernel.org
 help / color / mirror / Atom feed
* Oops when completing request on the wrong queue
@ 2016-08-10  4:04 Gabriel Krisman Bertazi
  2016-08-11 17:16 ` Keith Busch
  2016-08-19 13:28   ` Gabriel Krisman Bertazi
  0 siblings, 2 replies; 29+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-08-10  4:04 UTC (permalink / raw)


Hi,

We, IBM, have been experiencing eventual Oops when stressing IO at the
same time we add/remove processors.  The Oops happens in the IRQ path,
when we try to complete a request that was apparently meant for another
queue.

In __nvme_process_cq, the driver will use the cqe.command_id and the
nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
initiated the IO.  Eventually, it happens that the request returned by
that function is not initialized, and we crash inside
__blk_mq_complete_request, as shown below.

[ 2679.050701] Faulting instruction address: 0xc000000000549500
[ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1]
[ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries
[ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c
rpcsec_gss_krb5 auth_rpcgss
nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc
autofs4 btrfs xor
raid6_pq ibmvscsi ibmveth nvme
[ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted 4.4.0-22-generic #40-Ubuntu
[ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti: c000000451bc0000
[ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR: c000000000549680
[ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300   Not tainted  (4.4.0-22-generic)
[ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28022428  XER: 00000000
[ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR: 40000000 SOFTE: 0 
GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300 c0000002799a0a00 
GPR04: 0000000000000000 0000000000000105 0000000000000004 0000000000000001 
GPR08: c0000002799a0a50 0000000000000000 0000000000000000 d0000000046bdc68 
GPR12: c000000000549680 c000000007b1c780 0000000000000008 0000000000000001 
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000010005b50 
GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001 0000000000000002 
GPR24: 00000000000001e3 0000000000000000 0000000000000000 ffffffffffffffff 
GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180 c0000002799a0a00 
[ 2679.050908] NIP [c000000000549500] __blk_mq_complete_request+0x30/0x1b0
[ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
[ 2679.050930] Call Trace:
[ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40] 0xc0000004fff2fd40 (unreliable)
[ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
[ 2679.050961] [c0000004fff2fde0] [d0000000046b613c] nvme_irq+0x3c/0xb0 [nvme]
[ 2679.050972] [c0000004fff2fe10] [c000000000130660] handle_irq_event_percpu+0xa0/0x320
[ 2679.050981] [c0000004fff2fed0] [c000000000130948] handle_irq_event+0x68/0xc0
[ 2679.050989] [c0000004fff2ff00] [c000000000135c2c] handle_fasteoi_irq+0xec/0x2b0
[ 2679.050997] [c0000004fff2ff30] [c00000000012f844] generic_handle_irq+0x54/0x80
[ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0
[ 2679.051020] [c0000004fff2ff90] [c000000000024800] call_do_irq+0x14/0x24
[ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140
[ 2679.051054] [c000000451bc33a0] [c000000000002594]
hardware_interrupt_common+0x114/0x180

I added some debugging which skipped the Oops and made me think that we
are fetching the wrong request because we are actually looking at the
wrong set of tags.

The log below exemplifies what I am saying.  Instead of crashing in the
Oops, I made the execution skip the request completion and simply
consider that odd CQE as handled.  The first line of the log does that.

>  nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q
>  nvme nvme0: I/O 309 QID 63 timeout, aborting
>  nvme nvme0: Abort status: 0x0
>  nvme nvme0: I/O 309 QID 63 timeout, reset controller
>  nvme nvme0: completing aborted command with status: fffffffc
>  nvme 0003:03:00.0: Using 64-bit DMA iommu bypass
>  blk_update_request: I/O error, dev nvme0n1, sector 2105537536

In the first line, we found the request 309 for queue 35, which would
have crashed the execution.  My debug patch just ignores it.

Then, we can see that eventually, the same IO, 309, will timeout in
another QID, which was actually expecting for it.  The Abort request
gets sent and completed, but we never receive the completion of the
aborted request, thus we timeout again and restart the device.

This only happens when we are changing the SMT settings very
frequently, which points back to the way we correlate the hctx->tags to
the nvmeq structure, but I failed to find the exact code spot were
things get messed up.

I've put a some effort on this code lately, but I definitely need some
assistance on this one.  Since, as far as I understand, this issue is
much more likely to reproduce on boxes with a lot of hotpluggable
processors, I'm willing to test any patches you may find suitable in our
systems.

Thanks,

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Oops when completing request on the wrong queue
  2016-08-10  4:04 Oops when completing request on the wrong queue Gabriel Krisman Bertazi
@ 2016-08-11 17:16 ` Keith Busch
  2016-08-11 18:10   ` Gabriel Krisman Bertazi
  2016-08-19 13:28   ` Gabriel Krisman Bertazi
  1 sibling, 1 reply; 29+ messages in thread
From: Keith Busch @ 2016-08-11 17:16 UTC (permalink / raw)


On Wed, Aug 10, 2016@01:04:35AM -0300, Gabriel Krisman Bertazi wrote:
> Hi,
> 
> We, IBM, have been experiencing eventual Oops when stressing IO at the
> same time we add/remove processors.  The Oops happens in the IRQ path,
> when we try to complete a request that was apparently meant for another
> queue.
> 
> In __nvme_process_cq, the driver will use the cqe.command_id and the
> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
> initiated the IO.  Eventually, it happens that the request returned by
> that function is not initialized, and we crash inside
> __blk_mq_complete_request, as shown below.

Could you try the following patch and see if it resolves the issue?

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index d7c33f9..d49ddfb 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -216,7 +216,7 @@ static int nvme_admin_init_hctx(struct blk_mq_hw_ctx *hctx, void *data,
 	return 0;
 }
 
-static void nvme_admin_exit_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
+static void nvme_exit_hctx(struct blk_mq_hw_ctx *hctx, unsigned int hctx_idx)
 {
 	struct nvme_queue *nvmeq = hctx->driver_data;
 
@@ -1133,7 +1133,7 @@ static struct blk_mq_ops nvme_mq_admin_ops = {
 	.complete	= nvme_complete_rq,
 	.map_queue	= blk_mq_map_queue,
 	.init_hctx	= nvme_admin_init_hctx,
-	.exit_hctx      = nvme_admin_exit_hctx,
+	.exit_hctx      = nvme_exit_hctx,
 	.init_request	= nvme_admin_init_request,
 	.timeout	= nvme_timeout,
 };
@@ -1143,6 +1143,7 @@ static struct blk_mq_ops nvme_mq_ops = {
 	.complete	= nvme_complete_rq,
 	.map_queue	= blk_mq_map_queue,
 	.init_hctx	= nvme_init_hctx,
+	.exit_hctx      = nvme_exit_hctx,
 	.init_request	= nvme_init_request,
 	.timeout	= nvme_timeout,
 	.poll		= nvme_poll,
--

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Oops when completing request on the wrong queue
  2016-08-11 17:16 ` Keith Busch
@ 2016-08-11 18:10   ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 29+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-08-11 18:10 UTC (permalink / raw)


Keith Busch <keith.busch at intel.com> writes:

> On Wed, Aug 10, 2016@01:04:35AM -0300, Gabriel Krisman Bertazi wrote:
>> Hi,
>> 
>> We, IBM, have been experiencing eventual Oops when stressing IO at the
>> same time we add/remove processors.  The Oops happens in the IRQ path,
>> when we try to complete a request that was apparently meant for another
>> queue.
>> 
>> In __nvme_process_cq, the driver will use the cqe.command_id and the
>> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
>> initiated the IO.  Eventually, it happens that the request returned by
>> that function is not initialized, and we crash inside
>> __blk_mq_complete_request, as shown below.
>
> Could you try the following patch and see if it resolves the issue?

Hi Keith,

Thanks for your response.  I had tried this exact change already on 4.7
with no effect.  Do you think doing it on 4.8-rc1 will yield better
results?

I also verified that the iod, when in __nvme_process_cq, points to the same
queue that queued the command, as expected, but in nvme_timeout,
according to the log I sent earlier, it is pointing to a different nvmeq
(different nvmeq->qid). This is very strange to me.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-10  4:04 Oops when completing request on the wrong queue Gabriel Krisman Bertazi
@ 2016-08-19 13:28   ` Gabriel Krisman Bertazi
  2016-08-19 13:28   ` Gabriel Krisman Bertazi
  1 sibling, 0 replies; 29+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-08-19 13:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, linux-nvme, Jens Axboe, Brian King,
	linux-block, linux-scsi

Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:

> We, IBM, have been experiencing eventual Oops when stressing IO at the
> same time we add/remove processors.  The Oops happens in the IRQ path,
> when we try to complete a request that was apparently meant for another
> queue.
>
> In __nvme_process_cq, the driver will use the cqe.command_id and the
> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
> initiated the IO.  Eventually, it happens that the request returned by
> that function is not initialized, and we crash inside
> __blk_mq_complete_request, as shown below.
>
> [ 2679.050701] Faulting instruction address: 0xc000000000549500
> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries
> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c
> rpcsec_gss_krb5 auth_rpcgss
> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc
> autofs4 btrfs xor
> raid6_pq ibmvscsi ibmveth nvme
> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted 4.4.0-22-generic #40-Ubuntu
> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti: c000000451bc0000
> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR: c000000000549680
> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300   Not tainted  (4.4.0-22-generic)
> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28022428  XER: 00000000
> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR: 40000000 SOFTE: 0 
> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300 c0000002799a0a00 
> GPR04: 0000000000000000 0000000000000105 0000000000000004 0000000000000001 
> GPR08: c0000002799a0a50 0000000000000000 0000000000000000 d0000000046bdc68 
> GPR12: c000000000549680 c000000007b1c780 0000000000000008 0000000000000001 
> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000010005b50 
> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001 0000000000000002 
> GPR24: 00000000000001e3 0000000000000000 0000000000000000 ffffffffffffffff 
> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180 c0000002799a0a00 
> [ 2679.050908] NIP [c000000000549500] __blk_mq_complete_request+0x30/0x1b0
> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
> [ 2679.050930] Call Trace:
> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40] 0xc0000004fff2fd40 (unreliable)
> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c] nvme_irq+0x3c/0xb0 [nvme]
> [ 2679.050972] [c0000004fff2fe10] [c000000000130660] handle_irq_event_percpu+0xa0/0x320
> [ 2679.050981] [c0000004fff2fed0] [c000000000130948] handle_irq_event+0x68/0xc0
> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c] handle_fasteoi_irq+0xec/0x2b0
> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844] generic_handle_irq+0x54/0x80
> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0
> [ 2679.051020] [c0000004fff2ff90] [c000000000024800] call_do_irq+0x14/0x24
> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140
> [ 2679.051054] [c000000451bc33a0] [c000000000002594]
> hardware_interrupt_common+0x114/0x180
>
> I added some debugging which skipped the Oops and made me think that we
> are fetching the wrong request because we are actually looking at the
> wrong set of tags.
>
> The log below exemplifies what I am saying.  Instead of crashing in the
> Oops, I made the execution skip the request completion and simply
> consider that odd CQE as handled.  The first line of the log does that.
>
>>  nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q
>>  nvme nvme0: I/O 309 QID 63 timeout, aborting
>>  nvme nvme0: Abort status: 0x0
>>  nvme nvme0: I/O 309 QID 63 timeout, reset controller
>>  nvme nvme0: completing aborted command with status: fffffffc
>>  nvme 0003:03:00.0: Using 64-bit DMA iommu bypass
>>  blk_update_request: I/O error, dev nvme0n1, sector 2105537536
>
> In the first line, we found the request 309 for queue 35, which would
> have crashed the execution.  My debug patch just ignores it.
>
> Then, we can see that eventually, the same IO, 309, will timeout in
> another QID, which was actually expecting for it.  The Abort request
> gets sent and completed, but we never receive the completion of the
> aborted request, thus we timeout again and restart the device.
>
> This only happens when we are changing the SMT settings very
> frequently, which points back to the way we correlate the hctx->tags to
> the nvmeq structure, but I failed to find the exact code spot were
> things get messed up.
>
> I've put a some effort on this code lately, but I definitely need some
> assistance on this one.  Since, as far as I understand, this issue is
> much more likely to reproduce on boxes with a lot of hotpluggable
> processors, I'm willing to test any patches you may find suitable in our
> systems.
>

Hi,

[Adding linux-block, linux-scsi and Jens in CC for blk-mq input]

Made some progress on tracking down the issue.  It's looking more of a
blk-mq bug than nvme to me now.  Let me know if I made any wrong
assumption below:

I started to wonder whether blk-mq is requesting the IO in a wrong queue
or if the request is being moved to another queue during hotplug, as it
would happen in blk_mq_hctx_cpu_offline.

I verified that when the fail occurs, the request never was never
touched in blk_mq_hctx_cpu_offline.  Also, and more interesting, I added
the following BUG_ON to nvme_queue_req:

    BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req);

>From my experiments, I could never hit this during normal operation
(without SMT). But, as soon as I start triggering hotplug changes, I hit
it pretty fast.  By changing to WARN_ON, I see that the Oops will happen
almost immediately after this request is submitted, once __process_cq
tries to complete it.

Also, when adding the patch below, we send a few -EIOs back to upper
layers, but it prevents the crash.

I also think i saw a similar hang when running on a virtio_scsi disk.  I
couldn't reproduce it a second time, so we still lack confirmation, but
if confirmed, it indeed points away from nvme and back to the block
layer.

Any suggestions to further track this down?


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b927ecf..10f2257 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -599,6 +599,11 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
        if (ret)
                return ret;

+       if(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req) {
+         ret = BLK_MQ_RQ_QUEUE_ERROR;
+         goto out;
+       }
+
        ret = nvme_setup_cmd(ns, req, &cmnd);
        if (ret)
                goto out;

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Oops when completing request on the wrong queue
@ 2016-08-19 13:28   ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 29+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-08-19 13:28 UTC (permalink / raw)


Gabriel Krisman Bertazi <krisman at linux.vnet.ibm.com> writes:

> We, IBM, have been experiencing eventual Oops when stressing IO at the
> same time we add/remove processors.  The Oops happens in the IRQ path,
> when we try to complete a request that was apparently meant for another
> queue.
>
> In __nvme_process_cq, the driver will use the cqe.command_id and the
> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
> initiated the IO.  Eventually, it happens that the request returned by
> that function is not initialized, and we crash inside
> __blk_mq_complete_request, as shown below.
>
> [ 2679.050701] Faulting instruction address: 0xc000000000549500
> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries
> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c
> rpcsec_gss_krb5 auth_rpcgss
> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc
> autofs4 btrfs xor
> raid6_pq ibmvscsi ibmveth nvme
> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted 4.4.0-22-generic #40-Ubuntu
> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti: c000000451bc0000
> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR: c000000000549680
> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300   Not tainted  (4.4.0-22-generic)
> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28022428  XER: 00000000
> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR: 40000000 SOFTE: 0 
> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300 c0000002799a0a00 
> GPR04: 0000000000000000 0000000000000105 0000000000000004 0000000000000001 
> GPR08: c0000002799a0a50 0000000000000000 0000000000000000 d0000000046bdc68 
> GPR12: c000000000549680 c000000007b1c780 0000000000000008 0000000000000001 
> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000010005b50 
> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001 0000000000000002 
> GPR24: 00000000000001e3 0000000000000000 0000000000000000 ffffffffffffffff 
> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180 c0000002799a0a00 
> [ 2679.050908] NIP [c000000000549500] __blk_mq_complete_request+0x30/0x1b0
> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
> [ 2679.050930] Call Trace:
> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40] 0xc0000004fff2fd40 (unreliable)
> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c] nvme_irq+0x3c/0xb0 [nvme]
> [ 2679.050972] [c0000004fff2fe10] [c000000000130660] handle_irq_event_percpu+0xa0/0x320
> [ 2679.050981] [c0000004fff2fed0] [c000000000130948] handle_irq_event+0x68/0xc0
> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c] handle_fasteoi_irq+0xec/0x2b0
> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844] generic_handle_irq+0x54/0x80
> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0
> [ 2679.051020] [c0000004fff2ff90] [c000000000024800] call_do_irq+0x14/0x24
> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140
> [ 2679.051054] [c000000451bc33a0] [c000000000002594]
> hardware_interrupt_common+0x114/0x180
>
> I added some debugging which skipped the Oops and made me think that we
> are fetching the wrong request because we are actually looking at the
> wrong set of tags.
>
> The log below exemplifies what I am saying.  Instead of crashing in the
> Oops, I made the execution skip the request completion and simply
> consider that odd CQE as handled.  The first line of the log does that.
>
>>  nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q
>>  nvme nvme0: I/O 309 QID 63 timeout, aborting
>>  nvme nvme0: Abort status: 0x0
>>  nvme nvme0: I/O 309 QID 63 timeout, reset controller
>>  nvme nvme0: completing aborted command with status: fffffffc
>>  nvme 0003:03:00.0: Using 64-bit DMA iommu bypass
>>  blk_update_request: I/O error, dev nvme0n1, sector 2105537536
>
> In the first line, we found the request 309 for queue 35, which would
> have crashed the execution.  My debug patch just ignores it.
>
> Then, we can see that eventually, the same IO, 309, will timeout in
> another QID, which was actually expecting for it.  The Abort request
> gets sent and completed, but we never receive the completion of the
> aborted request, thus we timeout again and restart the device.
>
> This only happens when we are changing the SMT settings very
> frequently, which points back to the way we correlate the hctx->tags to
> the nvmeq structure, but I failed to find the exact code spot were
> things get messed up.
>
> I've put a some effort on this code lately, but I definitely need some
> assistance on this one.  Since, as far as I understand, this issue is
> much more likely to reproduce on boxes with a lot of hotpluggable
> processors, I'm willing to test any patches you may find suitable in our
> systems.
>

Hi,

[Adding linux-block, linux-scsi and Jens in CC for blk-mq input]

Made some progress on tracking down the issue.  It's looking more of a
blk-mq bug than nvme to me now.  Let me know if I made any wrong
assumption below:

I started to wonder whether blk-mq is requesting the IO in a wrong queue
or if the request is being moved to another queue during hotplug, as it
would happen in blk_mq_hctx_cpu_offline.

I verified that when the fail occurs, the request never was never
touched in blk_mq_hctx_cpu_offline.  Also, and more interesting, I added
the following BUG_ON to nvme_queue_req:

    BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req);

>From my experiments, I could never hit this during normal operation
(without SMT). But, as soon as I start triggering hotplug changes, I hit
it pretty fast.  By changing to WARN_ON, I see that the Oops will happen
almost immediately after this request is submitted, once __process_cq
tries to complete it.

Also, when adding the patch below, we send a few -EIOs back to upper
layers, but it prevents the crash.

I also think i saw a similar hang when running on a virtio_scsi disk.  I
couldn't reproduce it a second time, so we still lack confirmation, but
if confirmed, it indeed points away from nvme and back to the block
layer.

Any suggestions to further track this down?


diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b927ecf..10f2257 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -599,6 +599,11 @@ static int nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
        if (ret)
                return ret;

+       if(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req) {
+         ret = BLK_MQ_RQ_QUEUE_ERROR;
+         goto out;
+       }
+
        ret = nvme_setup_cmd(ns, req, &cmnd);
        if (ret)
                goto out;

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-19 13:28   ` Gabriel Krisman Bertazi
@ 2016-08-19 14:13     ` Jens Axboe
  -1 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2016-08-19 14:13 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, Keith Busch
  Cc: Christoph Hellwig, linux-nvme, Brian King, linux-block, linux-scsi

On 08/19/2016 07:28 AM, Gabriel Krisman Bertazi wrote:
> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>
>> We, IBM, have been experiencing eventual Oops when stressing IO at the
>> same time we add/remove processors.  The Oops happens in the IRQ path,
>> when we try to complete a request that was apparently meant for another
>> queue.
>>
>> In __nvme_process_cq, the driver will use the cqe.command_id and the
>> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
>> initiated the IO.  Eventually, it happens that the request returned by
>> that function is not initialized, and we crash inside
>> __blk_mq_complete_request, as shown below.
>>
>> [ 2679.050701] Faulting instruction address: 0xc000000000549500
>> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1]
>> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries
>> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c
>> rpcsec_gss_krb5 auth_rpcgss
>> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc
>> autofs4 btrfs xor
>> raid6_pq ibmvscsi ibmveth nvme
>> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted 4.4.0-22-generic #40-Ubuntu
>> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti: c000000451bc0000
>> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR: c000000000549680
>> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300   Not tainted  (4.4.0-22-generic)
>> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28022428  XER: 00000000
>> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR: 40000000 SOFTE: 0
>> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300 c0000002799a0a00
>> GPR04: 0000000000000000 0000000000000105 0000000000000004 0000000000000001
>> GPR08: c0000002799a0a50 0000000000000000 0000000000000000 d0000000046bdc68
>> GPR12: c000000000549680 c000000007b1c780 0000000000000008 0000000000000001
>> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000010005b50
>> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001 0000000000000002
>> GPR24: 00000000000001e3 0000000000000000 0000000000000000 ffffffffffffffff
>> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180 c0000002799a0a00
>> [ 2679.050908] NIP [c000000000549500] __blk_mq_complete_request+0x30/0x1b0
>> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
>> [ 2679.050930] Call Trace:
>> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40] 0xc0000004fff2fd40 (unreliable)
>> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
>> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c] nvme_irq+0x3c/0xb0 [nvme]
>> [ 2679.050972] [c0000004fff2fe10] [c000000000130660] handle_irq_event_percpu+0xa0/0x320
>> [ 2679.050981] [c0000004fff2fed0] [c000000000130948] handle_irq_event+0x68/0xc0
>> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c] handle_fasteoi_irq+0xec/0x2b0
>> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844] generic_handle_irq+0x54/0x80
>> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0
>> [ 2679.051020] [c0000004fff2ff90] [c000000000024800] call_do_irq+0x14/0x24
>> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140
>> [ 2679.051054] [c000000451bc33a0] [c000000000002594]
>> hardware_interrupt_common+0x114/0x180
>>
>> I added some debugging which skipped the Oops and made me think that we
>> are fetching the wrong request because we are actually looking at the
>> wrong set of tags.
>>
>> The log below exemplifies what I am saying.  Instead of crashing in the
>> Oops, I made the execution skip the request completion and simply
>> consider that odd CQE as handled.  The first line of the log does that.
>>
>>>  nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q
>>>  nvme nvme0: I/O 309 QID 63 timeout, aborting
>>>  nvme nvme0: Abort status: 0x0
>>>  nvme nvme0: I/O 309 QID 63 timeout, reset controller
>>>  nvme nvme0: completing aborted command with status: fffffffc
>>>  nvme 0003:03:00.0: Using 64-bit DMA iommu bypass
>>>  blk_update_request: I/O error, dev nvme0n1, sector 2105537536
>>
>> In the first line, we found the request 309 for queue 35, which would
>> have crashed the execution.  My debug patch just ignores it.
>>
>> Then, we can see that eventually, the same IO, 309, will timeout in
>> another QID, which was actually expecting for it.  The Abort request
>> gets sent and completed, but we never receive the completion of the
>> aborted request, thus we timeout again and restart the device.
>>
>> This only happens when we are changing the SMT settings very
>> frequently, which points back to the way we correlate the hctx->tags to
>> the nvmeq structure, but I failed to find the exact code spot were
>> things get messed up.
>>
>> I've put a some effort on this code lately, but I definitely need some
>> assistance on this one.  Since, as far as I understand, this issue is
>> much more likely to reproduce on boxes with a lot of hotpluggable
>> processors, I'm willing to test any patches you may find suitable in our
>> systems.
>>
>
> Hi,
>
> [Adding linux-block, linux-scsi and Jens in CC for blk-mq input]
>
> Made some progress on tracking down the issue.  It's looking more of a
> blk-mq bug than nvme to me now.  Let me know if I made any wrong
> assumption below:
>
> I started to wonder whether blk-mq is requesting the IO in a wrong queue
> or if the request is being moved to another queue during hotplug, as it
> would happen in blk_mq_hctx_cpu_offline.
>
> I verified that when the fail occurs, the request never was never
> touched in blk_mq_hctx_cpu_offline.  Also, and more interesting, I added
> the following BUG_ON to nvme_queue_req:
>
>     BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req);
>
> From my experiments, I could never hit this during normal operation
> (without SMT). But, as soon as I start triggering hotplug changes, I hit
> it pretty fast.  By changing to WARN_ON, I see that the Oops will happen
> almost immediately after this request is submitted, once __process_cq
> tries to complete it.
>
> Also, when adding the patch below, we send a few -EIOs back to upper
> layers, but it prevents the crash.
>
> I also think i saw a similar hang when running on a virtio_scsi disk.  I
> couldn't reproduce it a second time, so we still lack confirmation, but
> if confirmed, it indeed points away from nvme and back to the block
> layer.
>
> Any suggestions to further track this down?

Some good detective work so far! I agree, this looks like a blk-mq core
bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when
req->tag != nvmeq->tags? I don't immediately see how this could happen,
the freezing should protect us from this, unless it's broken somehow.
I'll take a look at this.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Oops when completing request on the wrong queue
@ 2016-08-19 14:13     ` Jens Axboe
  0 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2016-08-19 14:13 UTC (permalink / raw)


On 08/19/2016 07:28 AM, Gabriel Krisman Bertazi wrote:
> Gabriel Krisman Bertazi <krisman at linux.vnet.ibm.com> writes:
>
>> We, IBM, have been experiencing eventual Oops when stressing IO at the
>> same time we add/remove processors.  The Oops happens in the IRQ path,
>> when we try to complete a request that was apparently meant for another
>> queue.
>>
>> In __nvme_process_cq, the driver will use the cqe.command_id and the
>> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
>> initiated the IO.  Eventually, it happens that the request returned by
>> that function is not initialized, and we crash inside
>> __blk_mq_complete_request, as shown below.
>>
>> [ 2679.050701] Faulting instruction address: 0xc000000000549500
>> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1]
>> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries
>> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c
>> rpcsec_gss_krb5 auth_rpcgss
>> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc
>> autofs4 btrfs xor
>> raid6_pq ibmvscsi ibmveth nvme
>> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted 4.4.0-22-generic #40-Ubuntu
>> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti: c000000451bc0000
>> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR: c000000000549680
>> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300   Not tainted  (4.4.0-22-generic)
>> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 28022428  XER: 00000000
>> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR: 40000000 SOFTE: 0
>> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300 c0000002799a0a00
>> GPR04: 0000000000000000 0000000000000105 0000000000000004 0000000000000001
>> GPR08: c0000002799a0a50 0000000000000000 0000000000000000 d0000000046bdc68
>> GPR12: c000000000549680 c000000007b1c780 0000000000000008 0000000000000001
>> GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000010005b50
>> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001 0000000000000002
>> GPR24: 00000000000001e3 0000000000000000 0000000000000000 ffffffffffffffff
>> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180 c0000002799a0a00
>> [ 2679.050908] NIP [c000000000549500] __blk_mq_complete_request+0x30/0x1b0
>> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
>> [ 2679.050930] Call Trace:
>> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40] 0xc0000004fff2fd40 (unreliable)
>> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
>> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c] nvme_irq+0x3c/0xb0 [nvme]
>> [ 2679.050972] [c0000004fff2fe10] [c000000000130660] handle_irq_event_percpu+0xa0/0x320
>> [ 2679.050981] [c0000004fff2fed0] [c000000000130948] handle_irq_event+0x68/0xc0
>> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c] handle_fasteoi_irq+0xec/0x2b0
>> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844] generic_handle_irq+0x54/0x80
>> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0
>> [ 2679.051020] [c0000004fff2ff90] [c000000000024800] call_do_irq+0x14/0x24
>> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140
>> [ 2679.051054] [c000000451bc33a0] [c000000000002594]
>> hardware_interrupt_common+0x114/0x180
>>
>> I added some debugging which skipped the Oops and made me think that we
>> are fetching the wrong request because we are actually looking at the
>> wrong set of tags.
>>
>> The log below exemplifies what I am saying.  Instead of crashing in the
>> Oops, I made the execution skip the request completion and simply
>> consider that odd CQE as handled.  The first line of the log does that.
>>
>>>  nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q
>>>  nvme nvme0: I/O 309 QID 63 timeout, aborting
>>>  nvme nvme0: Abort status: 0x0
>>>  nvme nvme0: I/O 309 QID 63 timeout, reset controller
>>>  nvme nvme0: completing aborted command with status: fffffffc
>>>  nvme 0003:03:00.0: Using 64-bit DMA iommu bypass
>>>  blk_update_request: I/O error, dev nvme0n1, sector 2105537536
>>
>> In the first line, we found the request 309 for queue 35, which would
>> have crashed the execution.  My debug patch just ignores it.
>>
>> Then, we can see that eventually, the same IO, 309, will timeout in
>> another QID, which was actually expecting for it.  The Abort request
>> gets sent and completed, but we never receive the completion of the
>> aborted request, thus we timeout again and restart the device.
>>
>> This only happens when we are changing the SMT settings very
>> frequently, which points back to the way we correlate the hctx->tags to
>> the nvmeq structure, but I failed to find the exact code spot were
>> things get messed up.
>>
>> I've put a some effort on this code lately, but I definitely need some
>> assistance on this one.  Since, as far as I understand, this issue is
>> much more likely to reproduce on boxes with a lot of hotpluggable
>> processors, I'm willing to test any patches you may find suitable in our
>> systems.
>>
>
> Hi,
>
> [Adding linux-block, linux-scsi and Jens in CC for blk-mq input]
>
> Made some progress on tracking down the issue.  It's looking more of a
> blk-mq bug than nvme to me now.  Let me know if I made any wrong
> assumption below:
>
> I started to wonder whether blk-mq is requesting the IO in a wrong queue
> or if the request is being moved to another queue during hotplug, as it
> would happen in blk_mq_hctx_cpu_offline.
>
> I verified that when the fail occurs, the request never was never
> touched in blk_mq_hctx_cpu_offline.  Also, and more interesting, I added
> the following BUG_ON to nvme_queue_req:
>
>     BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req);
>
> From my experiments, I could never hit this during normal operation
> (without SMT). But, as soon as I start triggering hotplug changes, I hit
> it pretty fast.  By changing to WARN_ON, I see that the Oops will happen
> almost immediately after this request is submitted, once __process_cq
> tries to complete it.
>
> Also, when adding the patch below, we send a few -EIOs back to upper
> layers, but it prevents the crash.
>
> I also think i saw a similar hang when running on a virtio_scsi disk.  I
> couldn't reproduce it a second time, so we still lack confirmation, but
> if confirmed, it indeed points away from nvme and back to the block
> layer.
>
> Any suggestions to further track this down?

Some good detective work so far! I agree, this looks like a blk-mq core
bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when
req->tag != nvmeq->tags? I don't immediately see how this could happen,
the freezing should protect us from this, unless it's broken somehow.
I'll take a look at this.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-19 14:13     ` Jens Axboe
@ 2016-08-19 15:51       ` Jens Axboe
  -1 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2016-08-19 15:51 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi, Keith Busch
  Cc: Christoph Hellwig, linux-nvme, Brian King, linux-block, linux-scsi

On 08/19/2016 08:13 AM, Jens Axboe wrote:
> On 08/19/2016 07:28 AM, Gabriel Krisman Bertazi wrote:
>> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>>
>>> We, IBM, have been experiencing eventual Oops when stressing IO at the
>>> same time we add/remove processors.  The Oops happens in the IRQ path,
>>> when we try to complete a request that was apparently meant for another
>>> queue.
>>>
>>> In __nvme_process_cq, the driver will use the cqe.command_id and the
>>> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
>>> initiated the IO.  Eventually, it happens that the request returned by
>>> that function is not initialized, and we crash inside
>>> __blk_mq_complete_request, as shown below.
>>>
>>> [ 2679.050701] Faulting instruction address: 0xc000000000549500
>>> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1]
>>> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries
>>> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c
>>> rpcsec_gss_krb5 auth_rpcgss
>>> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc
>>> autofs4 btrfs xor
>>> raid6_pq ibmvscsi ibmveth nvme
>>> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted
>>> 4.4.0-22-generic #40-Ubuntu
>>> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti:
>>> c000000451bc0000
>>> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR:
>>> c000000000549680
>>> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300   Not tainted
>>> (4.4.0-22-generic)
>>> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE>  CR:
>>> 28022428  XER: 00000000
>>> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR:
>>> 40000000 SOFTE: 0
>>> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300
>>> c0000002799a0a00
>>> GPR04: 0000000000000000 0000000000000105 0000000000000004
>>> 0000000000000001
>>> GPR08: c0000002799a0a50 0000000000000000 0000000000000000
>>> d0000000046bdc68
>>> GPR12: c000000000549680 c000000007b1c780 0000000000000008
>>> 0000000000000001
>>> GPR16: 0000000000000000 0000000000000000 0000000000000000
>>> 0000000010005b50
>>> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001
>>> 0000000000000002
>>> GPR24: 00000000000001e3 0000000000000000 0000000000000000
>>> ffffffffffffffff
>>> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180
>>> c0000002799a0a00
>>> [ 2679.050908] NIP [c000000000549500]
>>> __blk_mq_complete_request+0x30/0x1b0
>>> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
>>> [ 2679.050930] Call Trace:
>>> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40]
>>> 0xc0000004fff2fd40 (unreliable)
>>> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84]
>>> __nvme_process_cq+0xf4/0x2c0 [nvme]
>>> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c]
>>> nvme_irq+0x3c/0xb0 [nvme]
>>> [ 2679.050972] [c0000004fff2fe10] [c000000000130660]
>>> handle_irq_event_percpu+0xa0/0x320
>>> [ 2679.050981] [c0000004fff2fed0] [c000000000130948]
>>> handle_irq_event+0x68/0xc0
>>> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c]
>>> handle_fasteoi_irq+0xec/0x2b0
>>> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844]
>>> generic_handle_irq+0x54/0x80
>>> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0
>>> [ 2679.051020] [c0000004fff2ff90] [c000000000024800]
>>> call_do_irq+0x14/0x24
>>> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140
>>> [ 2679.051054] [c000000451bc33a0] [c000000000002594]
>>> hardware_interrupt_common+0x114/0x180
>>>
>>> I added some debugging which skipped the Oops and made me think that we
>>> are fetching the wrong request because we are actually looking at the
>>> wrong set of tags.
>>>
>>> The log below exemplifies what I am saying.  Instead of crashing in the
>>> Oops, I made the execution skip the request completion and simply
>>> consider that odd CQE as handled.  The first line of the log does that.
>>>
>>>>  nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q
>>>>  nvme nvme0: I/O 309 QID 63 timeout, aborting
>>>>  nvme nvme0: Abort status: 0x0
>>>>  nvme nvme0: I/O 309 QID 63 timeout, reset controller
>>>>  nvme nvme0: completing aborted command with status: fffffffc
>>>>  nvme 0003:03:00.0: Using 64-bit DMA iommu bypass
>>>>  blk_update_request: I/O error, dev nvme0n1, sector 2105537536
>>>
>>> In the first line, we found the request 309 for queue 35, which would
>>> have crashed the execution.  My debug patch just ignores it.
>>>
>>> Then, we can see that eventually, the same IO, 309, will timeout in
>>> another QID, which was actually expecting for it.  The Abort request
>>> gets sent and completed, but we never receive the completion of the
>>> aborted request, thus we timeout again and restart the device.
>>>
>>> This only happens when we are changing the SMT settings very
>>> frequently, which points back to the way we correlate the hctx->tags to
>>> the nvmeq structure, but I failed to find the exact code spot were
>>> things get messed up.
>>>
>>> I've put a some effort on this code lately, but I definitely need some
>>> assistance on this one.  Since, as far as I understand, this issue is
>>> much more likely to reproduce on boxes with a lot of hotpluggable
>>> processors, I'm willing to test any patches you may find suitable in our
>>> systems.
>>>
>>
>> Hi,
>>
>> [Adding linux-block, linux-scsi and Jens in CC for blk-mq input]
>>
>> Made some progress on tracking down the issue.  It's looking more of a
>> blk-mq bug than nvme to me now.  Let me know if I made any wrong
>> assumption below:
>>
>> I started to wonder whether blk-mq is requesting the IO in a wrong queue
>> or if the request is being moved to another queue during hotplug, as it
>> would happen in blk_mq_hctx_cpu_offline.
>>
>> I verified that when the fail occurs, the request never was never
>> touched in blk_mq_hctx_cpu_offline.  Also, and more interesting, I added
>> the following BUG_ON to nvme_queue_req:
>>
>>     BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req);
>>
>> From my experiments, I could never hit this during normal operation
>> (without SMT). But, as soon as I start triggering hotplug changes, I hit
>> it pretty fast.  By changing to WARN_ON, I see that the Oops will happen
>> almost immediately after this request is submitted, once __process_cq
>> tries to complete it.
>>
>> Also, when adding the patch below, we send a few -EIOs back to upper
>> layers, but it prevents the crash.
>>
>> I also think i saw a similar hang when running on a virtio_scsi disk.  I
>> couldn't reproduce it a second time, so we still lack confirmation, but
>> if confirmed, it indeed points away from nvme and back to the block
>> layer.
>>
>> Any suggestions to further track this down?
>
> Some good detective work so far! I agree, this looks like a blk-mq core
> bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when
> req->tag != nvmeq->tags? I don't immediately see how this could happen,
> the freezing should protect us from this, unless it's broken somehow.
> I'll take a look at this.

Can you share what you ran to online/offline CPUs? I can't reproduce
this here.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Oops when completing request on the wrong queue
@ 2016-08-19 15:51       ` Jens Axboe
  0 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2016-08-19 15:51 UTC (permalink / raw)


On 08/19/2016 08:13 AM, Jens Axboe wrote:
> On 08/19/2016 07:28 AM, Gabriel Krisman Bertazi wrote:
>> Gabriel Krisman Bertazi <krisman at linux.vnet.ibm.com> writes:
>>
>>> We, IBM, have been experiencing eventual Oops when stressing IO at the
>>> same time we add/remove processors.  The Oops happens in the IRQ path,
>>> when we try to complete a request that was apparently meant for another
>>> queue.
>>>
>>> In __nvme_process_cq, the driver will use the cqe.command_id and the
>>> nvmeq->tags to find out, via blk_mq_tag_to_rq, the request that
>>> initiated the IO.  Eventually, it happens that the request returned by
>>> that function is not initialized, and we crash inside
>>> __blk_mq_complete_request, as shown below.
>>>
>>> [ 2679.050701] Faulting instruction address: 0xc000000000549500
>>> [ 2679.050711] Oops: Kernel access of bad area, sig: 11 [#1]
>>> [ 2679.050716] SMP NR_CPUS=2048 NUMA pSeries
>>> [ 2679.050727] Modules linked in: minix nls_iso8859_1 xfs libcrc32c
>>> rpcsec_gss_krb5 auth_rpcgss
>>> nfsv4 nfs lockd grace fscache binfmt_misc pseries_rng rtc_generic sunrpc
>>> autofs4 btrfs xor
>>> raid6_pq ibmvscsi ibmveth nvme
>>> [ 2679.050771] CPU: 21 PID: 45045 Comm: ppc64_cpu Not tainted
>>> 4.4.0-22-generic #40-Ubuntu
>>> [ 2679.050780] task: c000000497b07e60 ti: c0000004fff2c000 task.ti:
>>> c000000451bc0000
>>> [ 2679.050786] NIP: c000000000549500 LR: d0000000046b5e84 CTR:
>>> c000000000549680
>>> [ 2679.050803] REGS: c0000004fff2fa80 TRAP: 0300   Not tainted
>>> (4.4.0-22-generic)
>>> [ 2679.050807] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE>  CR:
>>> 28022428  XER: 00000000
>>> [ 2679.050824] CFAR: c000000000008468 DAR: 00000000000000b0 DSISR:
>>> 40000000 SOFTE: 0
>>> GPR00: d0000000046b5e84 c0000004fff2fd00 c0000000015b4300
>>> c0000002799a0a00
>>> GPR04: 0000000000000000 0000000000000105 0000000000000004
>>> 0000000000000001
>>> GPR08: c0000002799a0a50 0000000000000000 0000000000000000
>>> d0000000046bdc68
>>> GPR12: c000000000549680 c000000007b1c780 0000000000000008
>>> 0000000000000001
>>> GPR16: 0000000000000000 0000000000000000 0000000000000000
>>> 0000000010005b50
>>> GPR20: 0000000010005dc8 c0000000015eaa60 0000000000000001
>>> 0000000000000002
>>> GPR24: 00000000000001e3 0000000000000000 0000000000000000
>>> ffffffffffffffff
>>> GPR28: 0000000000000000 0000000000000000 c0000004f8cd5180
>>> c0000002799a0a00
>>> [ 2679.050908] NIP [c000000000549500]
>>> __blk_mq_complete_request+0x30/0x1b0
>>> [ 2679.050924] LR [d0000000046b5e84] __nvme_process_cq+0xf4/0x2c0 [nvme]
>>> [ 2679.050930] Call Trace:
>>> [ 2679.050941] [c0000004fff2fd00] [c0000004fff2fd40]
>>> 0xc0000004fff2fd40 (unreliable)
>>> [ 2679.050952] [c0000004fff2fd40] [d0000000046b5e84]
>>> __nvme_process_cq+0xf4/0x2c0 [nvme]
>>> [ 2679.050961] [c0000004fff2fde0] [d0000000046b613c]
>>> nvme_irq+0x3c/0xb0 [nvme]
>>> [ 2679.050972] [c0000004fff2fe10] [c000000000130660]
>>> handle_irq_event_percpu+0xa0/0x320
>>> [ 2679.050981] [c0000004fff2fed0] [c000000000130948]
>>> handle_irq_event+0x68/0xc0
>>> [ 2679.050989] [c0000004fff2ff00] [c000000000135c2c]
>>> handle_fasteoi_irq+0xec/0x2b0
>>> [ 2679.050997] [c0000004fff2ff30] [c00000000012f844]
>>> generic_handle_irq+0x54/0x80
>>> [ 2679.051007] [c0000004fff2ff60] [c000000000011320] __do_irq+0x80/0x1d0
>>> [ 2679.051020] [c0000004fff2ff90] [c000000000024800]
>>> call_do_irq+0x14/0x24
>>> [ 2679.051037] [c000000451bc3350] [c000000000011508] do_IRQ+0x98/0x140
>>> [ 2679.051054] [c000000451bc33a0] [c000000000002594]
>>> hardware_interrupt_common+0x114/0x180
>>>
>>> I added some debugging which skipped the Oops and made me think that we
>>> are fetching the wrong request because we are actually looking at the
>>> wrong set of tags.
>>>
>>> The log below exemplifies what I am saying.  Instead of crashing in the
>>> Oops, I made the execution skip the request completion and simply
>>> consider that odd CQE as handled.  The first line of the log does that.
>>>
>>>>  nvme nvme0: Skip completion of I/O 309 on queue 35 due to missing q
>>>>  nvme nvme0: I/O 309 QID 63 timeout, aborting
>>>>  nvme nvme0: Abort status: 0x0
>>>>  nvme nvme0: I/O 309 QID 63 timeout, reset controller
>>>>  nvme nvme0: completing aborted command with status: fffffffc
>>>>  nvme 0003:03:00.0: Using 64-bit DMA iommu bypass
>>>>  blk_update_request: I/O error, dev nvme0n1, sector 2105537536
>>>
>>> In the first line, we found the request 309 for queue 35, which would
>>> have crashed the execution.  My debug patch just ignores it.
>>>
>>> Then, we can see that eventually, the same IO, 309, will timeout in
>>> another QID, which was actually expecting for it.  The Abort request
>>> gets sent and completed, but we never receive the completion of the
>>> aborted request, thus we timeout again and restart the device.
>>>
>>> This only happens when we are changing the SMT settings very
>>> frequently, which points back to the way we correlate the hctx->tags to
>>> the nvmeq structure, but I failed to find the exact code spot were
>>> things get messed up.
>>>
>>> I've put a some effort on this code lately, but I definitely need some
>>> assistance on this one.  Since, as far as I understand, this issue is
>>> much more likely to reproduce on boxes with a lot of hotpluggable
>>> processors, I'm willing to test any patches you may find suitable in our
>>> systems.
>>>
>>
>> Hi,
>>
>> [Adding linux-block, linux-scsi and Jens in CC for blk-mq input]
>>
>> Made some progress on tracking down the issue.  It's looking more of a
>> blk-mq bug than nvme to me now.  Let me know if I made any wrong
>> assumption below:
>>
>> I started to wonder whether blk-mq is requesting the IO in a wrong queue
>> or if the request is being moved to another queue during hotplug, as it
>> would happen in blk_mq_hctx_cpu_offline.
>>
>> I verified that when the fail occurs, the request never was never
>> touched in blk_mq_hctx_cpu_offline.  Also, and more interesting, I added
>> the following BUG_ON to nvme_queue_req:
>>
>>     BUG_ON(blk_mq_tag_to_rq(*nvmeq->tags, req->tag) != req);
>>
>> From my experiments, I could never hit this during normal operation
>> (without SMT). But, as soon as I start triggering hotplug changes, I hit
>> it pretty fast.  By changing to WARN_ON, I see that the Oops will happen
>> almost immediately after this request is submitted, once __process_cq
>> tries to complete it.
>>
>> Also, when adding the patch below, we send a few -EIOs back to upper
>> layers, but it prevents the crash.
>>
>> I also think i saw a similar hang when running on a virtio_scsi disk.  I
>> couldn't reproduce it a second time, so we still lack confirmation, but
>> if confirmed, it indeed points away from nvme and back to the block
>> layer.
>>
>> Any suggestions to further track this down?
>
> Some good detective work so far! I agree, this looks like a blk-mq core
> bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when
> req->tag != nvmeq->tags? I don't immediately see how this could happen,
> the freezing should protect us from this, unless it's broken somehow.
> I'll take a look at this.

Can you share what you ran to online/offline CPUs? I can't reproduce
this here.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-19 15:51       ` Jens Axboe
@ 2016-08-19 16:38         ` Gabriel Krisman Bertazi
  -1 siblings, 0 replies; 29+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-08-19 16:38 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

Jens Axboe <axboe@kernel.dk> writes:

>> Some good detective work so far! I agree, this looks like a blk-mq core
>> bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when
>> req->tag != nvmeq->tags? I don't immediately see how this could happen,
>> the freezing should protect us from this, unless it's broken somehow.
>> I'll take a look at this.

Nothing much on the BUG() stack trace, at least to me.  We are coming
from an async execution of __blk_mq_run_hw_queue:

This is from 4.7. I'll give another run on 4.8-rc1.

[ 1953.910860] kernel BUG at ../drivers/nvme/host/pci.c:602!
cpu 0x48: Vector: 700 (Program Check) at [c000007870d3b7a0]
    pc: d000000035aa2e88: nvme_queue_rq+0x1a8/0x7a0 [nvme]
    lr: d000000035aa2e78: nvme_queue_rq+0x198/0x7a0 [nvme]
    sp: c000007870d3ba20
   msr: 900000010282b033
  current = 0xc000007870d9e400
  paca    = 0xc000000002bd8800   softe: 0        irq_happened: 0x01
    pid   = 9205, comm = kworker/72:1H
kernel BUG at ../drivers/nvme/host/pci.c:602!
Linux version 4.7.0.mainline+ (root@iod76) (gcc version 5.3.1 20160413 (Ubuntu/IBM 5.3.1-14ubuntu2.
1) ) #150 SMP Wed Aug 17 14:53:47 CDT 2016
enter ? for help
[c000007870d3bb10] c0000000005697b4 __blk_mq_run_hw_queue+0x384/0x640
[c000007870d3bc50] c0000000000e8cf0 process_one_work+0x1e0/0x590
[c000007870d3bce0] c0000000000e9148 worker_thread+0xa8/0x660
[c000007870d3bd80] c0000000000f2090 kthread+0x110/0x130
[c000007870d3be30] c0000000000095f0 ret_from_kernel_thread+0x5c/0x6c
48:mon>


Another interesting data point is that we always reach the WARN_ON in
__blk_mq_run_hw_queue immediately before submitting the bad request.

Maybe we are touching some percpu variable from the wrong cpu?

See the WARN_ON trace below.


[  369.953825] ------------[ cut here ]------------
[  369.954011] WARNING: CPU: 142 PID: 9475 at ../block/blk-mq.c:840 __blk_mq_run_hw_queue+0x338/0x6
40
[  369.954139] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache binfmt_
misc xfs ofpart cmdlinepart powernv_flash mtd ipmi_powernv ipmi_msghandler at24 opal_prd ibmpowernv
 nvmem_core powernv_rng input_leds joydev uio_pdrv_genirq mac_hid uio ib_iser rdma_cm sunrpc iw_cm
ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 ra
id456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0
 multipath linear ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
 drm hid_generic uas usb_storage usbhid hid nvme ahci libahci nvme_core drbg ansi_cprng
[  369.955011] CPU: 142 PID: 9475 Comm: kworker/142:1H Not tainted 4.7.0.mainline+ #150
[  369.955085] Workqueue: kblockd blk_mq_run_work_fn
[  369.955153] task: c0000078cbb89a00 ti: c0000078cb024000 task.ti: c0000078cb024000
[  369.955253] NIP: c000000000569768 LR: c000000000569668 CTR: 0000000000000000
[  369.955314] REGS: c0000078cb027890 TRAP: 0700   Not tainted  (4.7.0.mainline+)
[  369.955372] MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 28002224  XER:
 20000000
[  369.955579] CFAR: c000000000569a18 SOFTE: 1
GPR00: c0000000005695b0 c0000078cb027b10 c0000000013d0b00 0000000000000008
GPR04: 0000000000000008 0000000000000040 0000000000000000 0000000000000001
GPR08: c0000078c345d800 0000000000000001 0000000001700000 9000000102803003
GPR12: 0000000000002200 c000000002bffe00 c0000078bb96b000 0000000000000000
GPR16: 0000000000000008 c0000078bb930000 c000007954aafd00 c0000078bb961800
GPR20: c000007954aafd08 c000000000da3728 c0000078bea80000 c0000078cb027b88
GPR24: 0000000000000000 c0000078bea80000 c0000078bb961808 fffffffffffffffb
GPR28: 0000000000000000 c0000078cb027b98 0000000000000000 c0000078bd299700
[  369.956334] NIP [c000000000569768] __blk_mq_run_hw_queue+0x338/0x640
[  369.956383] LR [c000000000569668] __blk_mq_run_hw_queue+0x238/0x640
[  369.956433] Call Trace:
[  369.956454] [c0000078cb027b10] [c0000000005695b0] __blk_mq_run_hw_queue+0x180/0x640 (unreliable)
[  369.956535] [c0000078cb027c50] [c0000000000e8cf0] process_one_work+0x1e0/0x590
[  369.956605] [c0000078cb027ce0] [c0000000000e9148] worker_thread+0xa8/0x660
[  369.956664] [c0000078cb027d80] [c0000000000f2090] kthread+0x110/0x130
[  369.956723] [c0000078cb027e30] [c0000000000095f0] ret_from_kernel_thread+0x5c/0x6c
[  369.956791] Instruction dump:
[  369.956821] 815f0108 e91302d8 81280000 7f895040 409d01e8 e9280060 794a1f24 7d29502a
[  369.956920] 7d2afa78 312affff 7d295110 7d2907b4 <0b090000> e9210088 e95f0058 38810060
[  369.957020] ---[ end trace 1398dd60e3057065 ]---
[  369.957094] ------------[ cut here ]------------


> Can you share what you ran to online/offline CPUs? I can't reproduce
> this here.

I was using the ppc64_cpu tool, which shouldn't do nothing more than
write to sysfs.  but I just reproduced it with the script below.

Note that this is ppc64le.  I don't have a x86 in hand to attempt to
reproduce right now, but I'll look for one and see how it goes.


#!/bin/bash

MAXCPUs=159
STATE=1

while true; do
    for i in $(seq 1 ${MAXCPUS}); do
        if (($i%8)); then
            echo $STATE > /sys/devices/system/cpu/cpu$i/online
        fi
    done
    if [[ $STATE -eq 1 ]]; then
        STATE=0
    else
        STATE=1
    fi
done

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Oops when completing request on the wrong queue
@ 2016-08-19 16:38         ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 29+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-08-19 16:38 UTC (permalink / raw)


Jens Axboe <axboe at kernel.dk> writes:

>> Some good detective work so far! I agree, this looks like a blk-mq core
>> bug. Do you have a trace of a BUG() triggering in nvme_queue_rq(), when
>> req->tag != nvmeq->tags? I don't immediately see how this could happen,
>> the freezing should protect us from this, unless it's broken somehow.
>> I'll take a look at this.

Nothing much on the BUG() stack trace, at least to me.  We are coming
from an async execution of __blk_mq_run_hw_queue:

This is from 4.7. I'll give another run on 4.8-rc1.

[ 1953.910860] kernel BUG at ../drivers/nvme/host/pci.c:602!
cpu 0x48: Vector: 700 (Program Check) at [c000007870d3b7a0]
    pc: d000000035aa2e88: nvme_queue_rq+0x1a8/0x7a0 [nvme]
    lr: d000000035aa2e78: nvme_queue_rq+0x198/0x7a0 [nvme]
    sp: c000007870d3ba20
   msr: 900000010282b033
  current = 0xc000007870d9e400
  paca    = 0xc000000002bd8800   softe: 0        irq_happened: 0x01
    pid   = 9205, comm = kworker/72:1H
kernel BUG at ../drivers/nvme/host/pci.c:602!
Linux version 4.7.0.mainline+ (root at iod76) (gcc version 5.3.1 20160413 (Ubuntu/IBM 5.3.1-14ubuntu2.
1) ) #150 SMP Wed Aug 17 14:53:47 CDT 2016
enter ? for help
[c000007870d3bb10] c0000000005697b4 __blk_mq_run_hw_queue+0x384/0x640
[c000007870d3bc50] c0000000000e8cf0 process_one_work+0x1e0/0x590
[c000007870d3bce0] c0000000000e9148 worker_thread+0xa8/0x660
[c000007870d3bd80] c0000000000f2090 kthread+0x110/0x130
[c000007870d3be30] c0000000000095f0 ret_from_kernel_thread+0x5c/0x6c
48:mon>


Another interesting data point is that we always reach the WARN_ON in
__blk_mq_run_hw_queue immediately before submitting the bad request.

Maybe we are touching some percpu variable from the wrong cpu?

See the WARN_ON trace below.


[  369.953825] ------------[ cut here ]------------
[  369.954011] WARNING: CPU: 142 PID: 9475 at ../block/blk-mq.c:840 __blk_mq_run_hw_queue+0x338/0x6
40
[  369.954139] Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache binfmt_
misc xfs ofpart cmdlinepart powernv_flash mtd ipmi_powernv ipmi_msghandler at24 opal_prd ibmpowernv
 nvmem_core powernv_rng input_leds joydev uio_pdrv_genirq mac_hid uio ib_iser rdma_cm sunrpc iw_cm
ib_cm ib_core configfs iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 ra
id456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0
 multipath linear ast i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
 drm hid_generic uas usb_storage usbhid hid nvme ahci libahci nvme_core drbg ansi_cprng
[  369.955011] CPU: 142 PID: 9475 Comm: kworker/142:1H Not tainted 4.7.0.mainline+ #150
[  369.955085] Workqueue: kblockd blk_mq_run_work_fn
[  369.955153] task: c0000078cbb89a00 ti: c0000078cb024000 task.ti: c0000078cb024000
[  369.955253] NIP: c000000000569768 LR: c000000000569668 CTR: 0000000000000000
[  369.955314] REGS: c0000078cb027890 TRAP: 0700   Not tainted  (4.7.0.mainline+)
[  369.955372] MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 28002224  XER:
 20000000
[  369.955579] CFAR: c000000000569a18 SOFTE: 1
GPR00: c0000000005695b0 c0000078cb027b10 c0000000013d0b00 0000000000000008
GPR04: 0000000000000008 0000000000000040 0000000000000000 0000000000000001
GPR08: c0000078c345d800 0000000000000001 0000000001700000 9000000102803003
GPR12: 0000000000002200 c000000002bffe00 c0000078bb96b000 0000000000000000
GPR16: 0000000000000008 c0000078bb930000 c000007954aafd00 c0000078bb961800
GPR20: c000007954aafd08 c000000000da3728 c0000078bea80000 c0000078cb027b88
GPR24: 0000000000000000 c0000078bea80000 c0000078bb961808 fffffffffffffffb
GPR28: 0000000000000000 c0000078cb027b98 0000000000000000 c0000078bd299700
[  369.956334] NIP [c000000000569768] __blk_mq_run_hw_queue+0x338/0x640
[  369.956383] LR [c000000000569668] __blk_mq_run_hw_queue+0x238/0x640
[  369.956433] Call Trace:
[  369.956454] [c0000078cb027b10] [c0000000005695b0] __blk_mq_run_hw_queue+0x180/0x640 (unreliable)
[  369.956535] [c0000078cb027c50] [c0000000000e8cf0] process_one_work+0x1e0/0x590
[  369.956605] [c0000078cb027ce0] [c0000000000e9148] worker_thread+0xa8/0x660
[  369.956664] [c0000078cb027d80] [c0000000000f2090] kthread+0x110/0x130
[  369.956723] [c0000078cb027e30] [c0000000000095f0] ret_from_kernel_thread+0x5c/0x6c
[  369.956791] Instruction dump:
[  369.956821] 815f0108 e91302d8 81280000 7f895040 409d01e8 e9280060 794a1f24 7d29502a
[  369.956920] 7d2afa78 312affff 7d295110 7d2907b4 <0b090000> e9210088 e95f0058 38810060
[  369.957020] ---[ end trace 1398dd60e3057065 ]---
[  369.957094] ------------[ cut here ]------------


> Can you share what you ran to online/offline CPUs? I can't reproduce
> this here.

I was using the ppc64_cpu tool, which shouldn't do nothing more than
write to sysfs.  but I just reproduced it with the script below.

Note that this is ppc64le.  I don't have a x86 in hand to attempt to
reproduce right now, but I'll look for one and see how it goes.


#!/bin/bash

MAXCPUs=159
STATE=1

while true; do
    for i in $(seq 1 ${MAXCPUS}); do
        if (($i%8)); then
            echo $STATE > /sys/devices/system/cpu/cpu$i/online
        fi
    done
    if [[ $STATE -eq 1 ]]; then
        STATE=0
    else
        STATE=1
    fi
done

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-19 16:38         ` Gabriel Krisman Bertazi
@ 2016-08-23 20:54           ` Gabriel Krisman Bertazi
  -1 siblings, 0 replies; 29+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-08-23 20:54 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:

>> Can you share what you ran to online/offline CPUs? I can't reproduce
>> this here.
>
> I was using the ppc64_cpu tool, which shouldn't do nothing more than
> write to sysfs.  but I just reproduced it with the script below.
>
> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
> reproduce right now, but I'll look for one and see how it goes.

Hi,

Any luck on reproducing it?  We were initially reproducing with a
proprietary stress test, but I gave a try to a generated fio jobfile
associated with the SMT script I shared earlier and I could reproduce
the crash consistently in less than 10 minutes of execution.  this was
still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.

The job file I used, as well as the smt.sh script, in case you want to
give it a try:

jobfile: http://krisman.be/k/nvmejob.fio
smt.sh:  http://krisman.be/k/smt.sh

Still, the trigger seems to be consistently a heavy load of IO
associated with CPU addition/removal.

Let me share my progress from the last couple days in the hope that it
rings a bell for you.

Firstly, I verified that when we hit the BUG_ON in nvme_queue_rq, the
request_queue's freeze_depth is 0, which points away from a fault in the
freeze/unfreeze mechanism.  If a request was escaping and going through
the block layer during a freeze, we'd see freeze_depth >= 1.  Before
that, I had also tried to keep the q_usage_counter in atomic mode, in
case of a bug in the percpu refcount.  No luck, the BUG_ON was still
hit.

Also, I don't see anything special about the request that reaches the
BUG_ON.  It's a REQ_TYPE_FS request and, at least in the last time I
reproduced, it was a READ that came from the stress test task through
submit_bio.  So nothing remarkable about it too, as far as I can see.

I'm still thinking about a case in which the mapping get's screwed up,
where a ctx would appear into two hctxs bitmaps after a remap, or if the
ctx got remaped to another hctx.  I'm still learning my way through the
cpumap code, so I'm not sure it's a real possibility, but I'm not
convinced it isn't.  Some preliminary tests don't suggest it's the case
at play, but I wanna spend a little more time on this theory (maybe
for my lack of better ideas :)

On a side note, probably unrelated to this crash, it also got me
thinking about the current usefulness of blk_mq_hctx_notify.  Since CPU
is dead, no more requests would be coming through its ctx.  I think we
could force a queue run in blk_mq_queue_reinit_notify, before remapping,
which would cause the hctx to fetch the remaining requests from that
dead ctx (since it's not unmapped yet).  This way, we could maintain a
single hotplug notification hook and simplify the hotplug path.  I
haven't written code for it yet, but I'll see if I can come up with
something and send to the list.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Oops when completing request on the wrong queue
@ 2016-08-23 20:54           ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 29+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-08-23 20:54 UTC (permalink / raw)


Gabriel Krisman Bertazi <krisman at linux.vnet.ibm.com> writes:

>> Can you share what you ran to online/offline CPUs? I can't reproduce
>> this here.
>
> I was using the ppc64_cpu tool, which shouldn't do nothing more than
> write to sysfs.  but I just reproduced it with the script below.
>
> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
> reproduce right now, but I'll look for one and see how it goes.

Hi,

Any luck on reproducing it?  We were initially reproducing with a
proprietary stress test, but I gave a try to a generated fio jobfile
associated with the SMT script I shared earlier and I could reproduce
the crash consistently in less than 10 minutes of execution.  this was
still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.

The job file I used, as well as the smt.sh script, in case you want to
give it a try:

jobfile: http://krisman.be/k/nvmejob.fio
smt.sh:  http://krisman.be/k/smt.sh

Still, the trigger seems to be consistently a heavy load of IO
associated with CPU addition/removal.

Let me share my progress from the last couple days in the hope that it
rings a bell for you.

Firstly, I verified that when we hit the BUG_ON in nvme_queue_rq, the
request_queue's freeze_depth is 0, which points away from a fault in the
freeze/unfreeze mechanism.  If a request was escaping and going through
the block layer during a freeze, we'd see freeze_depth >= 1.  Before
that, I had also tried to keep the q_usage_counter in atomic mode, in
case of a bug in the percpu refcount.  No luck, the BUG_ON was still
hit.

Also, I don't see anything special about the request that reaches the
BUG_ON.  It's a REQ_TYPE_FS request and, at least in the last time I
reproduced, it was a READ that came from the stress test task through
submit_bio.  So nothing remarkable about it too, as far as I can see.

I'm still thinking about a case in which the mapping get's screwed up,
where a ctx would appear into two hctxs bitmaps after a remap, or if the
ctx got remaped to another hctx.  I'm still learning my way through the
cpumap code, so I'm not sure it's a real possibility, but I'm not
convinced it isn't.  Some preliminary tests don't suggest it's the case
at play, but I wanna spend a little more time on this theory (maybe
for my lack of better ideas :)

On a side note, probably unrelated to this crash, it also got me
thinking about the current usefulness of blk_mq_hctx_notify.  Since CPU
is dead, no more requests would be coming through its ctx.  I think we
could force a queue run in blk_mq_queue_reinit_notify, before remapping,
which would cause the hctx to fetch the remaining requests from that
dead ctx (since it's not unmapped yet).  This way, we could maintain a
single hotplug notification hook and simplify the hotplug path.  I
haven't written code for it yet, but I'll see if I can come up with
something and send to the list.

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-23 20:54           ` Gabriel Krisman Bertazi
@ 2016-08-23 21:11             ` Jens Axboe
  -1 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2016-08-23 21:11 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:
> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>
>>> Can you share what you ran to online/offline CPUs? I can't reproduce
>>> this here.
>>
>> I was using the ppc64_cpu tool, which shouldn't do nothing more than
>> write to sysfs.  but I just reproduced it with the script below.
>>
>> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
>> reproduce right now, but I'll look for one and see how it goes.
>
> Hi,
>
> Any luck on reproducing it?  We were initially reproducing with a
> proprietary stress test, but I gave a try to a generated fio jobfile
> associated with the SMT script I shared earlier and I could reproduce
> the crash consistently in less than 10 minutes of execution.  this was
> still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.

Nope, I have not been able to reproduce it. How long does the CPU
offline/online actions take on ppc64? It's pretty slow on x86, which may
hide the issue. I took out the various printk's associated with bringing
a CPU off/online, as well as IRQ breaking parts, but didn't help in
reproducing it.

> The job file I used, as well as the smt.sh script, in case you want to
> give it a try:
>
> jobfile: http://krisman.be/k/nvmejob.fio
> smt.sh:  http://krisman.be/k/smt.sh
>
> Still, the trigger seems to be consistently a heavy load of IO
> associated with CPU addition/removal.

My workload looks similar to yours, in that it's high depth and with a
lot of jobs to keep most CPUs loaded. My bash script is different than
yours, I'll try that and see if it helps here.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Oops when completing request on the wrong queue
@ 2016-08-23 21:11             ` Jens Axboe
  0 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2016-08-23 21:11 UTC (permalink / raw)


On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:
> Gabriel Krisman Bertazi <krisman at linux.vnet.ibm.com> writes:
>
>>> Can you share what you ran to online/offline CPUs? I can't reproduce
>>> this here.
>>
>> I was using the ppc64_cpu tool, which shouldn't do nothing more than
>> write to sysfs.  but I just reproduced it with the script below.
>>
>> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
>> reproduce right now, but I'll look for one and see how it goes.
>
> Hi,
>
> Any luck on reproducing it?  We were initially reproducing with a
> proprietary stress test, but I gave a try to a generated fio jobfile
> associated with the SMT script I shared earlier and I could reproduce
> the crash consistently in less than 10 minutes of execution.  this was
> still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.

Nope, I have not been able to reproduce it. How long does the CPU
offline/online actions take on ppc64? It's pretty slow on x86, which may
hide the issue. I took out the various printk's associated with bringing
a CPU off/online, as well as IRQ breaking parts, but didn't help in
reproducing it.

> The job file I used, as well as the smt.sh script, in case you want to
> give it a try:
>
> jobfile: http://krisman.be/k/nvmejob.fio
> smt.sh:  http://krisman.be/k/smt.sh
>
> Still, the trigger seems to be consistently a heavy load of IO
> associated with CPU addition/removal.

My workload looks similar to yours, in that it's high depth and with a
lot of jobs to keep most CPUs loaded. My bash script is different than
yours, I'll try that and see if it helps here.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-23 21:11             ` Jens Axboe
@ 2016-08-23 21:14               ` Jens Axboe
  -1 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2016-08-23 21:14 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

On 08/23/2016 03:11 PM, Jens Axboe wrote:
> On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:
>> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>>
>>>> Can you share what you ran to online/offline CPUs? I can't reproduce
>>>> this here.
>>>
>>> I was using the ppc64_cpu tool, which shouldn't do nothing more than
>>> write to sysfs.  but I just reproduced it with the script below.
>>>
>>> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
>>> reproduce right now, but I'll look for one and see how it goes.
>>
>> Hi,
>>
>> Any luck on reproducing it?  We were initially reproducing with a
>> proprietary stress test, but I gave a try to a generated fio jobfile
>> associated with the SMT script I shared earlier and I could reproduce
>> the crash consistently in less than 10 minutes of execution.  this was
>> still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.
>
> Nope, I have not been able to reproduce it. How long does the CPU
> offline/online actions take on ppc64? It's pretty slow on x86, which may
> hide the issue. I took out the various printk's associated with bringing
> a CPU off/online, as well as IRQ breaking parts, but didn't help in
> reproducing it.
>
>> The job file I used, as well as the smt.sh script, in case you want to
>> give it a try:
>>
>> jobfile: http://krisman.be/k/nvmejob.fio
>> smt.sh:  http://krisman.be/k/smt.sh
>>
>> Still, the trigger seems to be consistently a heavy load of IO
>> associated with CPU addition/removal.
>
> My workload looks similar to yours, in that it's high depth and with a
> lot of jobs to keep most CPUs loaded. My bash script is different than
> yours, I'll try that and see if it helps here.

Actually, I take that back. You're not using O_DIRECT, hence all your
jobs are running at QD=1, not the 256 specified. That looks odd, but
I'll try, maybe it'll hit something different.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Oops when completing request on the wrong queue
@ 2016-08-23 21:14               ` Jens Axboe
  0 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2016-08-23 21:14 UTC (permalink / raw)


On 08/23/2016 03:11 PM, Jens Axboe wrote:
> On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:
>> Gabriel Krisman Bertazi <krisman at linux.vnet.ibm.com> writes:
>>
>>>> Can you share what you ran to online/offline CPUs? I can't reproduce
>>>> this here.
>>>
>>> I was using the ppc64_cpu tool, which shouldn't do nothing more than
>>> write to sysfs.  but I just reproduced it with the script below.
>>>
>>> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
>>> reproduce right now, but I'll look for one and see how it goes.
>>
>> Hi,
>>
>> Any luck on reproducing it?  We were initially reproducing with a
>> proprietary stress test, but I gave a try to a generated fio jobfile
>> associated with the SMT script I shared earlier and I could reproduce
>> the crash consistently in less than 10 minutes of execution.  this was
>> still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.
>
> Nope, I have not been able to reproduce it. How long does the CPU
> offline/online actions take on ppc64? It's pretty slow on x86, which may
> hide the issue. I took out the various printk's associated with bringing
> a CPU off/online, as well as IRQ breaking parts, but didn't help in
> reproducing it.
>
>> The job file I used, as well as the smt.sh script, in case you want to
>> give it a try:
>>
>> jobfile: http://krisman.be/k/nvmejob.fio
>> smt.sh:  http://krisman.be/k/smt.sh
>>
>> Still, the trigger seems to be consistently a heavy load of IO
>> associated with CPU addition/removal.
>
> My workload looks similar to yours, in that it's high depth and with a
> lot of jobs to keep most CPUs loaded. My bash script is different than
> yours, I'll try that and see if it helps here.

Actually, I take that back. You're not using O_DIRECT, hence all your
jobs are running at QD=1, not the 256 specified. That looks odd, but
I'll try, maybe it'll hit something different.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-23 21:14               ` Jens Axboe
@ 2016-08-23 22:49                 ` Keith Busch
  -1 siblings, 0 replies; 29+ messages in thread
From: Keith Busch @ 2016-08-23 22:49 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Gabriel Krisman Bertazi, Christoph Hellwig, linux-nvme,
	Brian King, linux-block, linux-scsi

On Tue, Aug 23, 2016 at 03:14:23PM -0600, Jens Axboe wrote:
> On 08/23/2016 03:11 PM, Jens Axboe wrote:
> >My workload looks similar to yours, in that it's high depth and with a
> >lot of jobs to keep most CPUs loaded. My bash script is different than
> >yours, I'll try that and see if it helps here.
> 
> Actually, I take that back. You're not using O_DIRECT, hence all your
> jobs are running at QD=1, not the 256 specified. That looks odd, but
> I'll try, maybe it'll hit something different.

I haven't recreated this either, but I think I can logically see why
this failure is happening.

I sent an nvme driver patch earlier on this thread to exit the hardware
context, which I thought would do the trick if the hctx's tags were
being moved. That turns out to be wrong for a couple reasons.

First, we can't release the nvmeq->tags when a hctx exits because
that nvmeq may be used by other namespaces that need to point to
the device's tag set.

The other reason is that blk-mq doesn't exit or init hardware contexts
when remapping for a CPU event, leaving the nvme driver unaware a hardware
context points to a different tag set.

So I think I see why this test would fail; don't know about a fix yet.
Maybe the nvme driver needs some indirection instead of pointing
directly to the tagset after init_hctx.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Oops when completing request on the wrong queue
@ 2016-08-23 22:49                 ` Keith Busch
  0 siblings, 0 replies; 29+ messages in thread
From: Keith Busch @ 2016-08-23 22:49 UTC (permalink / raw)


On Tue, Aug 23, 2016@03:14:23PM -0600, Jens Axboe wrote:
> On 08/23/2016 03:11 PM, Jens Axboe wrote:
> >My workload looks similar to yours, in that it's high depth and with a
> >lot of jobs to keep most CPUs loaded. My bash script is different than
> >yours, I'll try that and see if it helps here.
> 
> Actually, I take that back. You're not using O_DIRECT, hence all your
> jobs are running at QD=1, not the 256 specified. That looks odd, but
> I'll try, maybe it'll hit something different.

I haven't recreated this either, but I think I can logically see why
this failure is happening.

I sent an nvme driver patch earlier on this thread to exit the hardware
context, which I thought would do the trick if the hctx's tags were
being moved. That turns out to be wrong for a couple reasons.

First, we can't release the nvmeq->tags when a hctx exits because
that nvmeq may be used by other namespaces that need to point to
the device's tag set.

The other reason is that blk-mq doesn't exit or init hardware contexts
when remapping for a CPU event, leaving the nvme driver unaware a hardware
context points to a different tag set.

So I think I see why this test would fail; don't know about a fix yet.
Maybe the nvme driver needs some indirection instead of pointing
directly to the tagset after init_hctx.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-23 21:14               ` Jens Axboe
@ 2016-08-24 18:34                 ` Jens Axboe
  -1 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2016-08-24 18:34 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

On 08/23/2016 03:14 PM, Jens Axboe wrote:
> On 08/23/2016 03:11 PM, Jens Axboe wrote:
>> On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:
>>> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>>>
>>>>> Can you share what you ran to online/offline CPUs? I can't reproduce
>>>>> this here.
>>>>
>>>> I was using the ppc64_cpu tool, which shouldn't do nothing more than
>>>> write to sysfs.  but I just reproduced it with the script below.
>>>>
>>>> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
>>>> reproduce right now, but I'll look for one and see how it goes.
>>>
>>> Hi,
>>>
>>> Any luck on reproducing it?  We were initially reproducing with a
>>> proprietary stress test, but I gave a try to a generated fio jobfile
>>> associated with the SMT script I shared earlier and I could reproduce
>>> the crash consistently in less than 10 minutes of execution.  this was
>>> still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.
>>
>> Nope, I have not been able to reproduce it. How long does the CPU
>> offline/online actions take on ppc64? It's pretty slow on x86, which may
>> hide the issue. I took out the various printk's associated with bringing
>> a CPU off/online, as well as IRQ breaking parts, but didn't help in
>> reproducing it.
>>
>>> The job file I used, as well as the smt.sh script, in case you want to
>>> give it a try:
>>>
>>> jobfile: http://krisman.be/k/nvmejob.fio
>>> smt.sh:  http://krisman.be/k/smt.sh
>>>
>>> Still, the trigger seems to be consistently a heavy load of IO
>>> associated with CPU addition/removal.
>>
>> My workload looks similar to yours, in that it's high depth and with a
>> lot of jobs to keep most CPUs loaded. My bash script is different than
>> yours, I'll try that and see if it helps here.
>
> Actually, I take that back. You're not using O_DIRECT, hence all your
> jobs are running at QD=1, not the 256 specified. That looks odd, but
> I'll try, maybe it'll hit something different.

Can you try this patch? It's not perfect, but I'll be interested if it
makes a difference for you.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 758a9b5..41def54 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -810,11 +810,11 @@ static void __blk_mq_run_hw_queue(struct 
blk_mq_hw_ctx *hctx)
  	struct list_head *dptr;
  	int queued;

-	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
-
  	if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
  		return;

+	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
+
  	hctx->run++;

  	/*
@@ -1075,15 +1075,11 @@ static void __blk_mq_insert_request(struct 
blk_mq_hw_ctx *hctx,
  }

  void blk_mq_insert_request(struct request *rq, bool at_head, bool 
run_queue,
-		bool async)
+			   bool async)
  {
+	struct blk_mq_ctx *ctx = rq->mq_ctx;
  	struct request_queue *q = rq->q;
  	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx;
-
-	current_ctx = blk_mq_get_ctx(q);
-	if (!cpu_online(ctx->cpu))
-		rq->mq_ctx = ctx = current_ctx;

  	hctx = q->mq_ops->map_queue(q, ctx->cpu);

@@ -1093,8 +1089,6 @@ void blk_mq_insert_request(struct request *rq, 
bool at_head, bool run_queue,

  	if (run_queue)
  		blk_mq_run_hw_queue(hctx, async);
-
-	blk_mq_put_ctx(current_ctx);
  }

  static void blk_mq_insert_requests(struct request_queue *q,
@@ -1105,14 +1099,9 @@ static void blk_mq_insert_requests(struct 
request_queue *q,

  {
  	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *current_ctx;

  	trace_block_unplug(q, depth, !from_schedule);

-	current_ctx = blk_mq_get_ctx(q);
-
-	if (!cpu_online(ctx->cpu))
-		ctx = current_ctx;
  	hctx = q->mq_ops->map_queue(q, ctx->cpu);

  	/*
@@ -1125,14 +1114,12 @@ static void blk_mq_insert_requests(struct 
request_queue *q,

  		rq = list_first_entry(list, struct request, queuelist);
  		list_del_init(&rq->queuelist);
-		rq->mq_ctx = ctx;
  		__blk_mq_insert_req_list(hctx, ctx, rq, false);
  	}
  	blk_mq_hctx_mark_pending(hctx, ctx);
  	spin_unlock(&ctx->lock);

  	blk_mq_run_hw_queue(hctx, from_schedule);
-	blk_mq_put_ctx(current_ctx);
  }

  static int plug_ctx_cmp(void *priv, struct list_head *a, struct 
list_head *b)
@@ -1692,6 +1679,11 @@ static int blk_mq_hctx_cpu_offline(struct 
blk_mq_hw_ctx *hctx, int cpu)
  	while (!list_empty(&tmp)) {
  		struct request *rq;

+		/*
+		 * FIXME: we can't just move the req here. We'd have to
+		 * pull off the bio chain and add it to a new request
+		 * on the target hw queue
+		 */
  		rq = list_first_entry(&tmp, struct request, queuelist);
  		rq->mq_ctx = ctx;
  		list_move_tail(&rq->queuelist, &ctx->rq_list);

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Oops when completing request on the wrong queue
@ 2016-08-24 18:34                 ` Jens Axboe
  0 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2016-08-24 18:34 UTC (permalink / raw)


On 08/23/2016 03:14 PM, Jens Axboe wrote:
> On 08/23/2016 03:11 PM, Jens Axboe wrote:
>> On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:
>>> Gabriel Krisman Bertazi <krisman at linux.vnet.ibm.com> writes:
>>>
>>>>> Can you share what you ran to online/offline CPUs? I can't reproduce
>>>>> this here.
>>>>
>>>> I was using the ppc64_cpu tool, which shouldn't do nothing more than
>>>> write to sysfs.  but I just reproduced it with the script below.
>>>>
>>>> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
>>>> reproduce right now, but I'll look for one and see how it goes.
>>>
>>> Hi,
>>>
>>> Any luck on reproducing it?  We were initially reproducing with a
>>> proprietary stress test, but I gave a try to a generated fio jobfile
>>> associated with the SMT script I shared earlier and I could reproduce
>>> the crash consistently in less than 10 minutes of execution.  this was
>>> still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.
>>
>> Nope, I have not been able to reproduce it. How long does the CPU
>> offline/online actions take on ppc64? It's pretty slow on x86, which may
>> hide the issue. I took out the various printk's associated with bringing
>> a CPU off/online, as well as IRQ breaking parts, but didn't help in
>> reproducing it.
>>
>>> The job file I used, as well as the smt.sh script, in case you want to
>>> give it a try:
>>>
>>> jobfile: http://krisman.be/k/nvmejob.fio
>>> smt.sh:  http://krisman.be/k/smt.sh
>>>
>>> Still, the trigger seems to be consistently a heavy load of IO
>>> associated with CPU addition/removal.
>>
>> My workload looks similar to yours, in that it's high depth and with a
>> lot of jobs to keep most CPUs loaded. My bash script is different than
>> yours, I'll try that and see if it helps here.
>
> Actually, I take that back. You're not using O_DIRECT, hence all your
> jobs are running at QD=1, not the 256 specified. That looks odd, but
> I'll try, maybe it'll hit something different.

Can you try this patch? It's not perfect, but I'll be interested if it
makes a difference for you.

diff --git a/block/blk-mq.c b/block/blk-mq.c
index 758a9b5..41def54 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -810,11 +810,11 @@ static void __blk_mq_run_hw_queue(struct 
blk_mq_hw_ctx *hctx)
  	struct list_head *dptr;
  	int queued;

-	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
-
  	if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
  		return;

+	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
+
  	hctx->run++;

  	/*
@@ -1075,15 +1075,11 @@ static void __blk_mq_insert_request(struct 
blk_mq_hw_ctx *hctx,
  }

  void blk_mq_insert_request(struct request *rq, bool at_head, bool 
run_queue,
-		bool async)
+			   bool async)
  {
+	struct blk_mq_ctx *ctx = rq->mq_ctx;
  	struct request_queue *q = rq->q;
  	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx;
-
-	current_ctx = blk_mq_get_ctx(q);
-	if (!cpu_online(ctx->cpu))
-		rq->mq_ctx = ctx = current_ctx;

  	hctx = q->mq_ops->map_queue(q, ctx->cpu);

@@ -1093,8 +1089,6 @@ void blk_mq_insert_request(struct request *rq, 
bool at_head, bool run_queue,

  	if (run_queue)
  		blk_mq_run_hw_queue(hctx, async);
-
-	blk_mq_put_ctx(current_ctx);
  }

  static void blk_mq_insert_requests(struct request_queue *q,
@@ -1105,14 +1099,9 @@ static void blk_mq_insert_requests(struct 
request_queue *q,

  {
  	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *current_ctx;

  	trace_block_unplug(q, depth, !from_schedule);

-	current_ctx = blk_mq_get_ctx(q);
-
-	if (!cpu_online(ctx->cpu))
-		ctx = current_ctx;
  	hctx = q->mq_ops->map_queue(q, ctx->cpu);

  	/*
@@ -1125,14 +1114,12 @@ static void blk_mq_insert_requests(struct 
request_queue *q,

  		rq = list_first_entry(list, struct request, queuelist);
  		list_del_init(&rq->queuelist);
-		rq->mq_ctx = ctx;
  		__blk_mq_insert_req_list(hctx, ctx, rq, false);
  	}
  	blk_mq_hctx_mark_pending(hctx, ctx);
  	spin_unlock(&ctx->lock);

  	blk_mq_run_hw_queue(hctx, from_schedule);
-	blk_mq_put_ctx(current_ctx);
  }

  static int plug_ctx_cmp(void *priv, struct list_head *a, struct 
list_head *b)
@@ -1692,6 +1679,11 @@ static int blk_mq_hctx_cpu_offline(struct 
blk_mq_hw_ctx *hctx, int cpu)
  	while (!list_empty(&tmp)) {
  		struct request *rq;

+		/*
+		 * FIXME: we can't just move the req here. We'd have to
+		 * pull off the bio chain and add it to a new request
+		 * on the target hw queue
+		 */
  		rq = list_first_entry(&tmp, struct request, queuelist);
  		rq->mq_ctx = ctx;
  		list_move_tail(&rq->queuelist, &ctx->rq_list);

-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-24 18:34                 ` Jens Axboe
@ 2016-08-24 20:36                   ` Jens Axboe
  -1 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2016-08-24 20:36 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

On 08/24/2016 12:34 PM, Jens Axboe wrote:
> On 08/23/2016 03:14 PM, Jens Axboe wrote:
>> On 08/23/2016 03:11 PM, Jens Axboe wrote:
>>> On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:
>>>> Gabriel Krisman Bertazi <krisman@linux.vnet.ibm.com> writes:
>>>>
>>>>>> Can you share what you ran to online/offline CPUs? I can't reproduce
>>>>>> this here.
>>>>>
>>>>> I was using the ppc64_cpu tool, which shouldn't do nothing more than
>>>>> write to sysfs.  but I just reproduced it with the script below.
>>>>>
>>>>> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
>>>>> reproduce right now, but I'll look for one and see how it goes.
>>>>
>>>> Hi,
>>>>
>>>> Any luck on reproducing it?  We were initially reproducing with a
>>>> proprietary stress test, but I gave a try to a generated fio jobfile
>>>> associated with the SMT script I shared earlier and I could reproduce
>>>> the crash consistently in less than 10 minutes of execution.  this was
>>>> still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.
>>>
>>> Nope, I have not been able to reproduce it. How long does the CPU
>>> offline/online actions take on ppc64? It's pretty slow on x86, which may
>>> hide the issue. I took out the various printk's associated with bringing
>>> a CPU off/online, as well as IRQ breaking parts, but didn't help in
>>> reproducing it.
>>>
>>>> The job file I used, as well as the smt.sh script, in case you want to
>>>> give it a try:
>>>>
>>>> jobfile: http://krisman.be/k/nvmejob.fio
>>>> smt.sh:  http://krisman.be/k/smt.sh
>>>>
>>>> Still, the trigger seems to be consistently a heavy load of IO
>>>> associated with CPU addition/removal.
>>>
>>> My workload looks similar to yours, in that it's high depth and with a
>>> lot of jobs to keep most CPUs loaded. My bash script is different than
>>> yours, I'll try that and see if it helps here.
>>
>> Actually, I take that back. You're not using O_DIRECT, hence all your
>> jobs are running at QD=1, not the 256 specified. That looks odd, but
>> I'll try, maybe it'll hit something different.
>
> Can you try this patch? It's not perfect, but I'll be interested if it
> makes a difference for you.

This one should handle the WARN_ON() for running the hw queue on the
wrong CPU as well.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 758a9b5..b21a9b9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -810,11 +810,12 @@ static void __blk_mq_run_hw_queue(struct 
blk_mq_hw_ctx *hctx)
  	struct list_head *dptr;
  	int queued;

-	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
-
  	if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
  		return;

+	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
+		cpu_online(hctx->next_cpu));
+
  	hctx->run++;

  	/*
@@ -1075,15 +1082,11 @@ static void __blk_mq_insert_request(struct 
blk_mq_hw_ctx *hctx,
  }

  void blk_mq_insert_request(struct request *rq, bool at_head, bool 
run_queue,
-		bool async)
+			   bool async)
  {
+	struct blk_mq_ctx *ctx = rq->mq_ctx;
  	struct request_queue *q = rq->q;
  	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx;
-
-	current_ctx = blk_mq_get_ctx(q);
-	if (!cpu_online(ctx->cpu))
-		rq->mq_ctx = ctx = current_ctx;

  	hctx = q->mq_ops->map_queue(q, ctx->cpu);

@@ -1093,8 +1096,6 @@ void blk_mq_insert_request(struct request *rq, 
bool at_head, bool run_queue,

  	if (run_queue)
  		blk_mq_run_hw_queue(hctx, async);
-
-	blk_mq_put_ctx(current_ctx);
  }

  static void blk_mq_insert_requests(struct request_queue *q,
@@ -1105,14 +1106,9 @@ static void blk_mq_insert_requests(struct 
request_queue *q,

  {
  	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *current_ctx;

  	trace_block_unplug(q, depth, !from_schedule);

-	current_ctx = blk_mq_get_ctx(q);
-
-	if (!cpu_online(ctx->cpu))
-		ctx = current_ctx;
  	hctx = q->mq_ops->map_queue(q, ctx->cpu);

  	/*
@@ -1125,14 +1121,12 @@ static void blk_mq_insert_requests(struct 
request_queue *q,

  		rq = list_first_entry(list, struct request, queuelist);
  		list_del_init(&rq->queuelist);
-		rq->mq_ctx = ctx;
  		__blk_mq_insert_req_list(hctx, ctx, rq, false);
  	}
  	blk_mq_hctx_mark_pending(hctx, ctx);
  	spin_unlock(&ctx->lock);

  	blk_mq_run_hw_queue(hctx, from_schedule);
-	blk_mq_put_ctx(current_ctx);
  }

  static int plug_ctx_cmp(void *priv, struct list_head *a, struct 
list_head *b)
@@ -1692,6 +1686,11 @@ static int blk_mq_hctx_cpu_offline(struct 
blk_mq_hw_ctx *hctx, int cpu)
  	while (!list_empty(&tmp)) {
  		struct request *rq;

+		/*
+		 * FIXME: we can't just move the req here. We'd have to
+		 * pull off the bio chain and add it to a new request
+		 * on the target hw queue
+		 */
  		rq = list_first_entry(&tmp, struct request, queuelist);
  		rq->mq_ctx = ctx;
  		list_move_tail(&rq->queuelist, &ctx->rq_list);


-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Oops when completing request on the wrong queue
@ 2016-08-24 20:36                   ` Jens Axboe
  0 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2016-08-24 20:36 UTC (permalink / raw)


On 08/24/2016 12:34 PM, Jens Axboe wrote:
> On 08/23/2016 03:14 PM, Jens Axboe wrote:
>> On 08/23/2016 03:11 PM, Jens Axboe wrote:
>>> On 08/23/2016 02:54 PM, Gabriel Krisman Bertazi wrote:
>>>> Gabriel Krisman Bertazi <krisman at linux.vnet.ibm.com> writes:
>>>>
>>>>>> Can you share what you ran to online/offline CPUs? I can't reproduce
>>>>>> this here.
>>>>>
>>>>> I was using the ppc64_cpu tool, which shouldn't do nothing more than
>>>>> write to sysfs.  but I just reproduced it with the script below.
>>>>>
>>>>> Note that this is ppc64le.  I don't have a x86 in hand to attempt to
>>>>> reproduce right now, but I'll look for one and see how it goes.
>>>>
>>>> Hi,
>>>>
>>>> Any luck on reproducing it?  We were initially reproducing with a
>>>> proprietary stress test, but I gave a try to a generated fio jobfile
>>>> associated with the SMT script I shared earlier and I could reproduce
>>>> the crash consistently in less than 10 minutes of execution.  this was
>>>> still ppc64le, though.  I couldn't get my hands on nvme on x86 yet.
>>>
>>> Nope, I have not been able to reproduce it. How long does the CPU
>>> offline/online actions take on ppc64? It's pretty slow on x86, which may
>>> hide the issue. I took out the various printk's associated with bringing
>>> a CPU off/online, as well as IRQ breaking parts, but didn't help in
>>> reproducing it.
>>>
>>>> The job file I used, as well as the smt.sh script, in case you want to
>>>> give it a try:
>>>>
>>>> jobfile: http://krisman.be/k/nvmejob.fio
>>>> smt.sh:  http://krisman.be/k/smt.sh
>>>>
>>>> Still, the trigger seems to be consistently a heavy load of IO
>>>> associated with CPU addition/removal.
>>>
>>> My workload looks similar to yours, in that it's high depth and with a
>>> lot of jobs to keep most CPUs loaded. My bash script is different than
>>> yours, I'll try that and see if it helps here.
>>
>> Actually, I take that back. You're not using O_DIRECT, hence all your
>> jobs are running at QD=1, not the 256 specified. That looks odd, but
>> I'll try, maybe it'll hit something different.
>
> Can you try this patch? It's not perfect, but I'll be interested if it
> makes a difference for you.

This one should handle the WARN_ON() for running the hw queue on the
wrong CPU as well.


diff --git a/block/blk-mq.c b/block/blk-mq.c
index 758a9b5..b21a9b9 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -810,11 +810,12 @@ static void __blk_mq_run_hw_queue(struct 
blk_mq_hw_ctx *hctx)
  	struct list_head *dptr;
  	int queued;

-	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask));
-
  	if (unlikely(test_bit(BLK_MQ_S_STOPPED, &hctx->state)))
  		return;

+	WARN_ON(!cpumask_test_cpu(raw_smp_processor_id(), hctx->cpumask) &&
+		cpu_online(hctx->next_cpu));
+
  	hctx->run++;

  	/*
@@ -1075,15 +1082,11 @@ static void __blk_mq_insert_request(struct 
blk_mq_hw_ctx *hctx,
  }

  void blk_mq_insert_request(struct request *rq, bool at_head, bool 
run_queue,
-		bool async)
+			   bool async)
  {
+	struct blk_mq_ctx *ctx = rq->mq_ctx;
  	struct request_queue *q = rq->q;
  	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *ctx = rq->mq_ctx, *current_ctx;
-
-	current_ctx = blk_mq_get_ctx(q);
-	if (!cpu_online(ctx->cpu))
-		rq->mq_ctx = ctx = current_ctx;

  	hctx = q->mq_ops->map_queue(q, ctx->cpu);

@@ -1093,8 +1096,6 @@ void blk_mq_insert_request(struct request *rq, 
bool at_head, bool run_queue,

  	if (run_queue)
  		blk_mq_run_hw_queue(hctx, async);
-
-	blk_mq_put_ctx(current_ctx);
  }

  static void blk_mq_insert_requests(struct request_queue *q,
@@ -1105,14 +1106,9 @@ static void blk_mq_insert_requests(struct 
request_queue *q,

  {
  	struct blk_mq_hw_ctx *hctx;
-	struct blk_mq_ctx *current_ctx;

  	trace_block_unplug(q, depth, !from_schedule);

-	current_ctx = blk_mq_get_ctx(q);
-
-	if (!cpu_online(ctx->cpu))
-		ctx = current_ctx;
  	hctx = q->mq_ops->map_queue(q, ctx->cpu);

  	/*
@@ -1125,14 +1121,12 @@ static void blk_mq_insert_requests(struct 
request_queue *q,

  		rq = list_first_entry(list, struct request, queuelist);
  		list_del_init(&rq->queuelist);
-		rq->mq_ctx = ctx;
  		__blk_mq_insert_req_list(hctx, ctx, rq, false);
  	}
  	blk_mq_hctx_mark_pending(hctx, ctx);
  	spin_unlock(&ctx->lock);

  	blk_mq_run_hw_queue(hctx, from_schedule);
-	blk_mq_put_ctx(current_ctx);
  }

  static int plug_ctx_cmp(void *priv, struct list_head *a, struct 
list_head *b)
@@ -1692,6 +1686,11 @@ static int blk_mq_hctx_cpu_offline(struct 
blk_mq_hw_ctx *hctx, int cpu)
  	while (!list_empty(&tmp)) {
  		struct request *rq;

+		/*
+		 * FIXME: we can't just move the req here. We'd have to
+		 * pull off the bio chain and add it to a new request
+		 * on the target hw queue
+		 */
  		rq = list_first_entry(&tmp, struct request, queuelist);
  		rq->mq_ctx = ctx;
  		list_move_tail(&rq->queuelist, &ctx->rq_list);


-- 
Jens Axboe

^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-24 20:36                   ` Jens Axboe
@ 2016-08-29 18:06                     ` Gabriel Krisman Bertazi
  -1 siblings, 0 replies; 29+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-08-29 18:06 UTC (permalink / raw)
  To: Jens Axboe
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

Jens Axboe <axboe@kernel.dk> writes:
>> Can you try this patch? It's not perfect, but I'll be interested if it
>> makes a difference for you.
>

Hi Jens,

Sorry for the delay.  I just got back to this and have been running your
patch on top of 4.8 without a crash for over 1 hour.  I wanna give it
more time to make sure it's running properly, though.

Let me get back to you after a few more rounds of test.

> This one should handle the WARN_ON() for running the hw queue on the
> wrong CPU as well.

On the workaround you added to prevent WARN_ON, we surely need to
prevent blk_mq_hctx_next_cpu from scheduling dead cpus in the first
place, right..  How do you feel about the following RFC?  I know it's
not a complete fix, but it feels like a good improvement to me.

http://www.spinics.net/lists/linux-scsi/msg98608.html

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Oops when completing request on the wrong queue
@ 2016-08-29 18:06                     ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 29+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-08-29 18:06 UTC (permalink / raw)


Jens Axboe <axboe at kernel.dk> writes:
>> Can you try this patch? It's not perfect, but I'll be interested if it
>> makes a difference for you.
>

Hi Jens,

Sorry for the delay.  I just got back to this and have been running your
patch on top of 4.8 without a crash for over 1 hour.  I wanna give it
more time to make sure it's running properly, though.

Let me get back to you after a few more rounds of test.

> This one should handle the WARN_ON() for running the hw queue on the
> wrong CPU as well.

On the workaround you added to prevent WARN_ON, we surely need to
prevent blk_mq_hctx_next_cpu from scheduling dead cpus in the first
place, right..  How do you feel about the following RFC?  I know it's
not a complete fix, but it feels like a good improvement to me.

http://www.spinics.net/lists/linux-scsi/msg98608.html

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-29 18:06                     ` Gabriel Krisman Bertazi
@ 2016-08-29 18:40                       ` Jens Axboe
  -1 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2016-08-29 18:40 UTC (permalink / raw)
  To: Gabriel Krisman Bertazi
  Cc: Keith Busch, Christoph Hellwig, linux-nvme, Brian King,
	linux-block, linux-scsi

On 08/29/2016 12:06 PM, Gabriel Krisman Bertazi wrote:
> Jens Axboe <axboe@kernel.dk> writes:
>>> Can you try this patch? It's not perfect, but I'll be interested if it
>>> makes a difference for you.
>>
>
> Hi Jens,
>
> Sorry for the delay.  I just got back to this and have been running your
> patch on top of 4.8 without a crash for over 1 hour.  I wanna give it
> more time to make sure it's running properly, though.
>
> Let me get back to you after a few more rounds of test.

Thanks, sounds good. The patches have landed in mainline too.

>> This one should handle the WARN_ON() for running the hw queue on the
>> wrong CPU as well.
>
> On the workaround you added to prevent WARN_ON, we surely need to
> prevent blk_mq_hctx_next_cpu from scheduling dead cpus in the first
> place, right..  How do you feel about the following RFC?  I know it's
> not a complete fix, but it feels like a good improvement to me.
>
> http://www.spinics.net/lists/linux-scsi/msg98608.html

But we can't completely prevent it, and I don't think we have to. I just
don't want to trigger a warning for something that's a valid condition.
I want the warning to trigger if this happens without the CPU going
offline, since then it's indicative of a real bug in the mapping. Your
patch isn't going to prevent it either - it'll shrink the window, at the
expense of making blk_mq_hctx_next_cpu() more expensive. So I don't
think it's worthwhile.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Oops when completing request on the wrong queue
@ 2016-08-29 18:40                       ` Jens Axboe
  0 siblings, 0 replies; 29+ messages in thread
From: Jens Axboe @ 2016-08-29 18:40 UTC (permalink / raw)


On 08/29/2016 12:06 PM, Gabriel Krisman Bertazi wrote:
> Jens Axboe <axboe at kernel.dk> writes:
>>> Can you try this patch? It's not perfect, but I'll be interested if it
>>> makes a difference for you.
>>
>
> Hi Jens,
>
> Sorry for the delay.  I just got back to this and have been running your
> patch on top of 4.8 without a crash for over 1 hour.  I wanna give it
> more time to make sure it's running properly, though.
>
> Let me get back to you after a few more rounds of test.

Thanks, sounds good. The patches have landed in mainline too.

>> This one should handle the WARN_ON() for running the hw queue on the
>> wrong CPU as well.
>
> On the workaround you added to prevent WARN_ON, we surely need to
> prevent blk_mq_hctx_next_cpu from scheduling dead cpus in the first
> place, right..  How do you feel about the following RFC?  I know it's
> not a complete fix, but it feels like a good improvement to me.
>
> http://www.spinics.net/lists/linux-scsi/msg98608.html

But we can't completely prevent it, and I don't think we have to. I just
don't want to trigger a warning for something that's a valid condition.
I want the warning to trigger if this happens without the CPU going
offline, since then it's indicative of a real bug in the mapping. Your
patch isn't going to prevent it either - it'll shrink the window, at the
expense of making blk_mq_hctx_next_cpu() more expensive. So I don't
think it's worthwhile.


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: Oops when completing request on the wrong queue
  2016-08-29 18:40                       ` Jens Axboe
@ 2016-09-05 12:02                         ` Gabriel Krisman Bertazi
  -1 siblings, 0 replies; 29+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-09-05 12:02 UTC (permalink / raw)
  To: Jens Axboe
  Cc: linux-block, linux-scsi, linux-nvme, Keith Busch, Brian King,
	Christoph Hellwig

Jens Axboe <axboe@kernel.dk> writes:

> On 08/29/2016 12:06 PM, Gabriel Krisman Bertazi wrote:
>> Jens Axboe <axboe@kernel.dk> writes:
>>>> Can you try this patch? It's not perfect, but I'll be interested if it
>>>> makes a difference for you.
>>>
>>
>> Hi Jens,
>>
>> Sorry for the delay.  I just got back to this and have been running your
>> patch on top of 4.8 without a crash for over 1 hour.  I wanna give it
>> more time to make sure it's running properly, though.
>>
>> Let me get back to you after a few more rounds of test.
>
> Thanks, sounds good. The patches have landed in mainline too.

Hi Jens,

Our test teams ran stress tests on several machines over the last week
on a test kernel with your patches applied, and were no longer able to
reproduce the issue.

Thanks a lot for helping out on this one.

>>> This one should handle the WARN_ON() for running the hw queue on the
>>> wrong CPU as well.
>>
>> On the workaround you added to prevent WARN_ON, we surely need to
>> prevent blk_mq_hctx_next_cpu from scheduling dead cpus in the first
>> place, right..  How do you feel about the following RFC?  I know it's
>> not a complete fix, but it feels like a good improvement to me.
>>
>> http://www.spinics.net/lists/linux-scsi/msg98608.html
>
> But we can't completely prevent it, and I don't think we have to. I just
> don't want to trigger a warning for something that's a valid condition.
> I want the warning to trigger if this happens without the CPU going
> offline, since then it's indicative of a real bug in the mapping. Your
> patch isn't going to prevent it either - it'll shrink the window, at the
> expense of making blk_mq_hctx_next_cpu() more expensive. So I don't
> think it's worthwhile.

Right, I got your point.  Your patch definitely prevents the WARN_ON
from occurring on CPU hotplug events too.  So thanks a lot for help on
that too :)

-- 
Gabriel Krisman Bertazi


_______________________________________________
Linux-nvme mailing list
Linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Oops when completing request on the wrong queue
@ 2016-09-05 12:02                         ` Gabriel Krisman Bertazi
  0 siblings, 0 replies; 29+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-09-05 12:02 UTC (permalink / raw)


Jens Axboe <axboe at kernel.dk> writes:

> On 08/29/2016 12:06 PM, Gabriel Krisman Bertazi wrote:
>> Jens Axboe <axboe at kernel.dk> writes:
>>>> Can you try this patch? It's not perfect, but I'll be interested if it
>>>> makes a difference for you.
>>>
>>
>> Hi Jens,
>>
>> Sorry for the delay.  I just got back to this and have been running your
>> patch on top of 4.8 without a crash for over 1 hour.  I wanna give it
>> more time to make sure it's running properly, though.
>>
>> Let me get back to you after a few more rounds of test.
>
> Thanks, sounds good. The patches have landed in mainline too.

Hi Jens,

Our test teams ran stress tests on several machines over the last week
on a test kernel with your patches applied, and were no longer able to
reproduce the issue.

Thanks a lot for helping out on this one.

>>> This one should handle the WARN_ON() for running the hw queue on the
>>> wrong CPU as well.
>>
>> On the workaround you added to prevent WARN_ON, we surely need to
>> prevent blk_mq_hctx_next_cpu from scheduling dead cpus in the first
>> place, right..  How do you feel about the following RFC?  I know it's
>> not a complete fix, but it feels like a good improvement to me.
>>
>> http://www.spinics.net/lists/linux-scsi/msg98608.html
>
> But we can't completely prevent it, and I don't think we have to. I just
> don't want to trigger a warning for something that's a valid condition.
> I want the warning to trigger if this happens without the CPU going
> offline, since then it's indicative of a real bug in the mapping. Your
> patch isn't going to prevent it either - it'll shrink the window, at the
> expense of making blk_mq_hctx_next_cpu() more expensive. So I don't
> think it's worthwhile.

Right, I got your point.  Your patch definitely prevents the WARN_ON
from occurring on CPU hotplug events too.  So thanks a lot for help on
that too :)

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2016-09-05 12:02 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-10  4:04 Oops when completing request on the wrong queue Gabriel Krisman Bertazi
2016-08-11 17:16 ` Keith Busch
2016-08-11 18:10   ` Gabriel Krisman Bertazi
2016-08-19 13:28 ` Gabriel Krisman Bertazi
2016-08-19 13:28   ` Gabriel Krisman Bertazi
2016-08-19 14:13   ` Jens Axboe
2016-08-19 14:13     ` Jens Axboe
2016-08-19 15:51     ` Jens Axboe
2016-08-19 15:51       ` Jens Axboe
2016-08-19 16:38       ` Gabriel Krisman Bertazi
2016-08-19 16:38         ` Gabriel Krisman Bertazi
2016-08-23 20:54         ` Gabriel Krisman Bertazi
2016-08-23 20:54           ` Gabriel Krisman Bertazi
2016-08-23 21:11           ` Jens Axboe
2016-08-23 21:11             ` Jens Axboe
2016-08-23 21:14             ` Jens Axboe
2016-08-23 21:14               ` Jens Axboe
2016-08-23 22:49               ` Keith Busch
2016-08-23 22:49                 ` Keith Busch
2016-08-24 18:34               ` Jens Axboe
2016-08-24 18:34                 ` Jens Axboe
2016-08-24 20:36                 ` Jens Axboe
2016-08-24 20:36                   ` Jens Axboe
2016-08-29 18:06                   ` Gabriel Krisman Bertazi
2016-08-29 18:06                     ` Gabriel Krisman Bertazi
2016-08-29 18:40                     ` Jens Axboe
2016-08-29 18:40                       ` Jens Axboe
2016-09-05 12:02                       ` Gabriel Krisman Bertazi
2016-09-05 12:02                         ` Gabriel Krisman Bertazi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.