All of lore.kernel.org
 help / color / mirror / Atom feed
* crash when connecting to targets using nr_io_queues < num cpus
@ 2016-08-31 20:12 Steve Wise
  2016-09-01  9:32 ` Sagi Grimberg
  0 siblings, 1 reply; 22+ messages in thread
From: Steve Wise @ 2016-08-31 20:12 UTC (permalink / raw)


Hey all,

I'm testing smaller ioq sets with nvmf/rdma, and I see some issue.  If I connect
with 2, 4, 6, 8, 10, 16, or 32  for nr_io_queues, everything is happy.  It
seems, though, if I connect with a value of 12, or 28, or some other non power
of two, I get intermittent crashes in __blk_mq_get_reserved_tag() at line 337
when setting up a controller's IO queues.   I'm not sure exactly if this is
always non power of two, or something else, but it seems to never crash with
power of two values (could be a coincidence I guess).

Here:

crash> gdb list *blk_mq_get_tag+0x29
0xffffffff8133b239 is in blk_mq_get_tag (block/blk-mq-tag.c:337).
332
333     static unsigned int __blk_mq_get_reserved_tag(struct blk_mq_alloc_data
*data)
334     {
335             int tag, zero = 0;
336
337             if (unlikely(!data->hctx->tags->nr_reserved_tags)) { 
338                     WARN_ON_ONCE(1);
339                     return BLK_MQ_TAG_FAIL;
340             }
341

This is with linux-4.8-rc3.  Are there restrictions on the number of queues that
can be setup other than <= nr_cpus?

>From my initial debug, it is passed an hctx with a NULL tag pointer.  So
data->hctx->tags is NULL causing this crash:

[  125.225879] nvme nvme1: creating 26 I/O queues.
[  125.346655] BUG: unable to handle kernel NULL pointer dereference at
0000000000000004
[  125.355543] IP: [<ffffffff8133b239>] blk_mq_get_tag+0x29/0xc0
[  125.362332] PGD ff81e9067 PUD 1004ecc067 PMD 0
[  125.367955] Oops: 0000 [#1] SMP
[  125.372078] Modules linked in: nvme_rdma nvme_fabrics brd iw_cxgb4 cxgb4
ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM
iptable_mangle iptable_filter ip_tables bridge 8021q mrp garp stp llc cachefiles
fscache rdma_ucm rdma_cm iw_cm ib_ipoib ib_cm ib_uverbs ib_umad ocrdma be2net
iw_nes libcrc32c iw_cxgb3 cxgb3 mdio ib_qib rdmavt mlx5_ib mlx5_core mlx4_ib
mlx4_en mlx4_core ib_mthca ib_core binfmt_misc dm_mirror dm_region_hash dm_log
vhost_net macvtap macvlan vhost tun kvm irqbypass uinput iTCO_wdt
iTCO_vendor_support mxm_wmi pcspkr dm_mod i2c_i801 i2c_smbus sg lpc_ich mfd_core
mei_me mei nvme nvme_core igb dca ptp pps_core ipmi_si ipmi_msghandler wmi
ext4(E) mbcache(E) jbd2(E) sd_mod(E) ahci(E) libahci(E) libata(E) mgag200(E)
ttm(E) drm_kms_helper(E) drm(E) fb_sys_fops(E) sysimgblt(E) sysfillrect(E)
syscopyarea(E) i2c_algo_bit(E) i2c_core(E) [last unloaded: cxgb4]
[  125.475243] CPU: 0 PID: 11439 Comm: nvme Tainted: G            E
4.8.0-rc3-nvmf+block+reboot #26
[  125.485382] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.2a 07/09/2015
[  125.493530] task: ffff881034994140 task.stack: ffff8810319bc000
[  125.500667] RIP: 0010:[<ffffffff8133b239>]  [<ffffffff8133b239>]
blk_mq_get_tag+0x29/0xc0
[  125.510108] RSP: 0018:ffff8810319bfa58  EFLAGS: 00010202
[  125.516650] RAX: ffff880fe09c1800 RBX: ffff8810319bfae8 RCX: 0000000000000000
[  125.525038] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8810319bfae8
[  125.533423] RBP: ffff8810319bfa78 R08: 0000000000000000 R09: 0000000000000000
[  125.541814] R10: ffff88103e807200 R11: 0000000000000001 R12: 0000000000000001
[  125.550185] R13: 0000000000000000 R14: ffff880fe09c1800 R15: 0000000000000000
[  125.558548] FS:  00007fc764c0a700(0000) GS:ffff88103ee00000(0000)
knlGS:0000000000000000
[  125.567880] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  125.574873] CR2: 0000000000000004 CR3: 000000102869f000 CR4: 00000000000406f0
[  125.583264] Stack:
[  125.586547]  dead000000000200 0000000081332b5d 0000000000000001
ffff8810319bfae8
[  125.595292]  ffff8810319bfad8 ffffffff81336142 ffff881004ea35d0
ffff8810319bfc08
[  125.604062]  ffff8810319bfb48 ffffffff81332ccc 0000000000000000
ffff881004da91c0
[  125.612826] Call Trace:
[  125.616575]  [<ffffffff81336142>] __blk_mq_alloc_request+0x32/0x260
[  125.624142]  [<ffffffff81332ccc>] ? blk_execute_rq+0x8c/0x110
[  125.631187]  [<ffffffff81336d95>] blk_mq_alloc_request_hctx+0xb5/0x110
[  125.639012]  [<ffffffffa00affd7>] nvme_alloc_request+0x37/0x90 [nvme_core]
[  125.647170]  [<ffffffffa00b057c>] __nvme_submit_sync_cmd+0x3c/0xe0
[nvme_core]
[  125.655685]  [<ffffffffa065bdc4>] nvmf_connect_io_queue+0x114/0x160
[nvme_fabrics]
[  125.664551]  [<ffffffffa06388b7>] nvme_rdma_create_io_queues+0x1b7/0x210
[nvme_rdma]
[  125.673565]  [<ffffffffa0639643>] ?
nvme_rdma_configure_admin_queue+0x1e3/0x280 [nvme_rdma]
[  125.683198]  [<ffffffffa0639a83>] nvme_rdma_create_ctrl+0x3a3/0x4c0
[nvme_rdma]
[  125.691793]  [<ffffffff81205fcd>] ? kmem_cache_alloc_trace+0x14d/0x1a0
[  125.699582]  [<ffffffffa065bf92>] nvmf_create_ctrl+0x182/0x210 [nvme_fabrics]
[  125.707986]  [<ffffffffa065c0cc>] nvmf_dev_write+0xac/0x108 [nvme_fabrics]
[  125.716131]  [<ffffffff8122d144>] __vfs_write+0x34/0x120
[  125.722697]  [<ffffffff81003725>] ?
trace_event_raw_event_sys_enter+0xb5/0x130
[  125.731153]  [<ffffffff8122d2f1>] vfs_write+0xc1/0x130
[  125.737541]  [<ffffffff81249793>] ? __fdget+0x13/0x20
[  125.743813]  [<ffffffff8122d466>] SyS_write+0x56/0xc0
[  125.750070]  [<ffffffff81003e7d>] do_syscall_64+0x7d/0x230
[  125.756755]  [<ffffffff8106f057>] ? do_page_fault+0x37/0x90
[  125.763527]  [<ffffffff816e17e1>] entry_SYSCALL64_slow_path+0x25/0x25
[  125.771154] Code: 00 00 55 48 89 e5 53 48 83 ec 18 66 66 66 66 90 f6 47 08 02
48 89 fb 74 34 c7 45 ec 00 00 00 00 48 8b 47 18 4c 8b 80 90 01 00 00 <41> 8b 70
04 85 f6 74 5b 48 8d 4d ec 49 8d 70 38 31 d2 e8 80 fd
[  125.793923] RIP  [<ffffffff8133b239>] blk_mq_get_tag+0x29/0xc0
[  125.800957]  RSP <ffff8810319bfa58>
[  125.805583] CR2: 0000000000000004

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-08-31 20:12 crash when connecting to targets using nr_io_queues < num cpus Steve Wise
@ 2016-09-01  9:32 ` Sagi Grimberg
  2016-09-01 14:10   ` Steve Wise
  0 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2016-09-01  9:32 UTC (permalink / raw)



> Hey all,
>
> I'm testing smaller ioq sets with nvmf/rdma, and I see some issue.  If I connect
> with 2, 4, 6, 8, 10, 16, or 32  for nr_io_queues, everything is happy.  It
> seems, though, if I connect with a value of 12, or 28, or some other non power
> of two, I get intermittent crashes in __blk_mq_get_reserved_tag() at line 337
> when setting up a controller's IO queues.   I'm not sure exactly if this is
> always non power of two, or something else, but it seems to never crash with
> power of two values (could be a coincidence I guess).

I think Ming sent a patch for this some time ago... Not sure what
happened with it though...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-01  9:32 ` Sagi Grimberg
@ 2016-09-01 14:10   ` Steve Wise
  2016-09-01 19:01     ` Steve Wise
  0 siblings, 1 reply; 22+ messages in thread
From: Steve Wise @ 2016-09-01 14:10 UTC (permalink / raw)


> 
> > Hey all,
> >
> > I'm testing smaller ioq sets with nvmf/rdma, and I see some issue.  If I
connect
> > with 2, 4, 6, 8, 10, 16, or 32  for nr_io_queues, everything is happy.  It
> > seems, though, if I connect with a value of 12, or 28, or some other non
power
> > of two, I get intermittent crashes in __blk_mq_get_reserved_tag() at line
337
> > when setting up a controller's IO queues.   I'm not sure exactly if this is
> > always non power of two, or something else, but it seems to never crash with
> > power of two values (could be a coincidence I guess).
> 
> I think Ming sent a patch for this some time ago... Not sure what
> happened with it though...

This?

http://lists.infradead.org/pipermail/linux-nvme/2016-June/004884.html

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-01 14:10   ` Steve Wise
@ 2016-09-01 19:01     ` Steve Wise
  2016-09-04  8:46       ` Sagi Grimberg
  0 siblings, 1 reply; 22+ messages in thread
From: Steve Wise @ 2016-09-01 19:01 UTC (permalink / raw)


> > > Hey all,
> > >
> > > I'm testing smaller ioq sets with nvmf/rdma, and I see some issue.  If I
> connect
> > > with 2, 4, 6, 8, 10, 16, or 32  for nr_io_queues, everything is happy.  It
> > > seems, though, if I connect with a value of 12, or 28, or some other non
> power
> > > of two, I get intermittent crashes in __blk_mq_get_reserved_tag() at line
> 337
> > > when setting up a controller's IO queues.   I'm not sure exactly if this
is
> > > always non power of two, or something else, but it seems to never crash
with
> > > power of two values (could be a coincidence I guess).
> >
> > I think Ming sent a patch for this some time ago... Not sure what
> > happened with it though...
> 
> This?
> 
> http://lists.infradead.org/pipermail/linux-nvme/2016-June/004884.html

This is indeed the same problem.  I don't have the knoggin to propose a fix.
Sagi/Christoph, do you have any ideas on this?  I'm willing to take and idea
forward and test it out of you all have any clever ideas.  We should at least
prevent setting nr_io_queues to a value that will crash immediately when nvmf is
used...

Steve.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-01 19:01     ` Steve Wise
@ 2016-09-04  8:46       ` Sagi Grimberg
  2016-09-13 14:21         ` Steve Wise
  0 siblings, 1 reply; 22+ messages in thread
From: Sagi Grimberg @ 2016-09-04  8:46 UTC (permalink / raw)



>>> I think Ming sent a patch for this some time ago... Not sure what
>>> happened with it though...
>>
>> This?
>>
>> http://lists.infradead.org/pipermail/linux-nvme/2016-June/004884.html
>
> This is indeed the same problem.  I don't have the knoggin to propose a fix.
> Sagi/Christoph, do you have any ideas on this?  I'm willing to take and idea
> forward and test it out of you all have any clever ideas.  We should at least
> prevent setting nr_io_queues to a value that will crash immediately when nvmf is
> used...

I think that Ming and Keith had a few suggestions.

++ Keith.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-04  8:46       ` Sagi Grimberg
@ 2016-09-13 14:21         ` Steve Wise
  2016-09-13 17:14           ` Ming Lin
  2016-09-13 17:52           ` Keith Busch
  0 siblings, 2 replies; 22+ messages in thread
From: Steve Wise @ 2016-09-13 14:21 UTC (permalink / raw)


> 
> >>> I think Ming sent a patch for this some time ago... Not sure what
> >>> happened with it though...
> >>
> >> This?
> >>
> >> http://lists.infradead.org/pipermail/linux-nvme/2016-June/004884.html
> >
> > This is indeed the same problem.  I don't have the knoggin to propose a fix.
> > Sagi/Christoph, do you have any ideas on this?  I'm willing to take and idea
> > forward and test it out of you all have any clever ideas.  We should at
least
> > prevent setting nr_io_queues to a value that will crash immediately when
nvmf is
> > used...
> 
> I think that Ming and Keith had a few suggestions.
> 
> ++ Keith.
>

Ming has been silent. :(  Keith, any thoughts on this?  

Thanks,

Steve.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-13 14:21         ` Steve Wise
@ 2016-09-13 17:14           ` Ming Lin
  2016-09-13 17:52           ` Keith Busch
  1 sibling, 0 replies; 22+ messages in thread
From: Ming Lin @ 2016-09-13 17:14 UTC (permalink / raw)


On Tue, Sep 13, 2016@7:21 AM, Steve Wise <swise@opengridcomputing.com> wrote:
>>
>> >>> I think Ming sent a patch for this some time ago... Not sure what
>> >>> happened with it though...
>> >>
>> >> This?
>> >>
>> >> http://lists.infradead.org/pipermail/linux-nvme/2016-June/004884.html
>> >
>> > This is indeed the same problem.  I don't have the knoggin to propose a fix.
>> > Sagi/Christoph, do you have any ideas on this?  I'm willing to take and idea
>> > forward and test it out of you all have any clever ideas.  We should at
> least
>> > prevent setting nr_io_queues to a value that will crash immediately when
> nvmf is
>> > used...
>>
>> I think that Ming and Keith had a few suggestions.
>>
>> ++ Keith.
>>
>
> Ming has been silent. :(  Keith, any thoughts on this?

Sorry, has been busy with other internal projects :(

>
> Thanks,
>
> Steve.
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-13 14:21         ` Steve Wise
  2016-09-13 17:14           ` Ming Lin
@ 2016-09-13 17:52           ` Keith Busch
  2016-09-13 19:43             ` Steve Wise
  2016-09-16 14:10             ` Steve Wise
  1 sibling, 2 replies; 22+ messages in thread
From: Keith Busch @ 2016-09-13 17:52 UTC (permalink / raw)


On Tue, Sep 13, 2016@09:21:36AM -0500, Steve Wise wrote:
> > 
> > >>> I think Ming sent a patch for this some time ago... Not sure what
> > >>> happened with it though...
> > >>
> > >> This?
> > >>
> > >> http://lists.infradead.org/pipermail/linux-nvme/2016-June/004884.html
> > >
> > > This is indeed the same problem.  I don't have the knoggin to propose a fix.
> > > Sagi/Christoph, do you have any ideas on this?  I'm willing to take and idea
> > > forward and test it out of you all have any clever ideas.  We should at
> least
> > > prevent setting nr_io_queues to a value that will crash immediately when
> nvmf is
> > > used...
> > 
> > I think that Ming and Keith had a few suggestions.
> > 
> > ++ Keith.
> >
> 
> Ming has been silent. :(  Keith, any thoughts on this?  

Sorry, I've also been side tracked. :(

Offline, I've been reviewing and testing new mappings from Christoph
and Thomas that should get all the queues assigned. It wasn't developed
specically for this issue, but it should fix this anyway. I think the
new mapping is really close to being ready for public consideration,
but I'll wait for Christoph on that.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-13 17:52           ` Keith Busch
@ 2016-09-13 19:43             ` Steve Wise
  2016-09-16 14:10             ` Steve Wise
  1 sibling, 0 replies; 22+ messages in thread
From: Steve Wise @ 2016-09-13 19:43 UTC (permalink / raw)


> > > >>> I think Ming sent a patch for this some time ago... Not sure what
> > > >>> happened with it though...
> > > >>
> > > >> This?
> > > >>
> > > >> http://lists.infradead.org/pipermail/linux-nvme/2016-June/004884.html
> > > >
> > > > This is indeed the same problem.  I don't have the knoggin to propose a
fix.
> > > > Sagi/Christoph, do you have any ideas on this?  I'm willing to take and
idea
> > > > forward and test it out of you all have any clever ideas.  We should at
> > least
> > > > prevent setting nr_io_queues to a value that will crash immediately when
> > nvmf is
> > > > used...
> > >
> > > I think that Ming and Keith had a few suggestions.
> > >
> > > ++ Keith.
> > >
> >
> > Ming has been silent. :(  Keith, any thoughts on this?
> 
> Sorry, I've also been side tracked. :(
> 
> Offline, I've been reviewing and testing new mappings from Christoph
> and Thomas that should get all the queues assigned. It wasn't developed
> specically for this issue, but it should fix this anyway. I think the
> new mapping is really close to being ready for public consideration,
> but I'll wait for Christoph on that.

Hey Keith,

Sounds good.  I can test any proposed patches...

Thanks,

Steve.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-13 17:52           ` Keith Busch
  2016-09-13 19:43             ` Steve Wise
@ 2016-09-16 14:10             ` Steve Wise
  2016-09-16 14:26               ` 'Christoph Hellwig'
  1 sibling, 1 reply; 22+ messages in thread
From: Steve Wise @ 2016-09-16 14:10 UTC (permalink / raw)


> > > >>
> > > >> http://lists.infradead.org/pipermail/linux-nvme/2016-June/004884.html
> > > >
> > > > This is indeed the same problem.  I don't have the knoggin to propose a
fix.
> > > > Sagi/Christoph, do you have any ideas on this?  I'm willing to take and
idea
> > > > forward and test it out of you all have any clever ideas.  We should at
> > least
> > > > prevent setting nr_io_queues to a value that will crash immediately when
> > nvmf is
> > > > used...
> > >
> > > I think that Ming and Keith had a few suggestions.
> > >
> > > ++ Keith.
> > >
> >
> > Ming has been silent. :(  Keith, any thoughts on this?
> 
> Sorry, I've also been side tracked. :(
> 
> Offline, I've been reviewing and testing new mappings from Christoph
> and Thomas that should get all the queues assigned. It wasn't developed
> specically for this issue, but it should fix this anyway. I think the
> new mapping is really close to being ready for public consideration,
> but I'll wait for Christoph on that.
> 

Hey Christoph,

Is this the series?

https://lwn.net/Articles/700625/

If not, is there something I can try out?

Steve.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-16 14:10             ` Steve Wise
@ 2016-09-16 14:26               ` 'Christoph Hellwig'
  2016-09-22 21:02                 ` 'Christoph Hellwig'
  0 siblings, 1 reply; 22+ messages in thread
From: 'Christoph Hellwig' @ 2016-09-16 14:26 UTC (permalink / raw)


On Fri, Sep 16, 2016@09:10:48AM -0500, Steve Wise wrote:
> Is this the series?
> 
> https://lwn.net/Articles/700625/

I don't see how that would change any kinds of mapping for fabrics.

> If not, is there something I can try out?

Let me dig through the thread and see what I can do.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-16 14:26               ` 'Christoph Hellwig'
@ 2016-09-22 21:02                 ` 'Christoph Hellwig'
  2016-09-22 21:38                   ` Steve Wise
  0 siblings, 1 reply; 22+ messages in thread
From: 'Christoph Hellwig' @ 2016-09-22 21:02 UTC (permalink / raw)


Steve,

can you test if the patch below properly fails the connect and avoids
the crash?

We could potentially also do something better than just returning the
error in that case.  From a quick look at the code even just ignoring
a EXDEV return from nvmf_connect_io_queue might do the right thing,
so feel free to try that if you have some spare cycles.

---
>From d76be818600d92341125b7c78dcab780a9833427 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <hch@lst.de>
Date: Thu, 22 Sep 2016 13:56:54 -0700
Subject: blk-mq: skip unmapped queues in blk_mq_alloc_request_hctx

This provides the caller a feedback that a given hctx is not mapped and thus
no command can be sent on it.

Signed-off-by: Christoph Hellwig <hch at lst.de>
---
 block/blk-mq.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/block/blk-mq.c b/block/blk-mq.c
index e9b8007..7b430ab 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -266,17 +266,29 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q, int rw,
 	if (ret)
 		return ERR_PTR(ret);
 
+	/*
+	 * Check if the hardware context is actually mapped to anything.
+	 * If not tell the caller that it should skip this queue.
+	 */
 	hctx = q->queue_hw_ctx[hctx_idx];
+	if (!blk_mq_hw_queue_mapped(hctx)) {
+		ret = -EXDEV;
+		goto out_queue_exit;
+	}
 	ctx = __blk_mq_get_ctx(q, cpumask_first(hctx->cpumask));
 
 	blk_mq_set_alloc_data(&alloc_data, q, flags, ctx, hctx);
 	rq = __blk_mq_alloc_request(&alloc_data, rw, 0);
 	if (!rq) {
-		blk_queue_exit(q);
-		return ERR_PTR(-EWOULDBLOCK);
+		ret = -EWOULDBLOCK;
+		goto out_queue_exit;
 	}
 
 	return rq;
+
+out_queue_exit:
+	blk_queue_exit(q);
+	return ERR_PTR(ret);
 }
 EXPORT_SYMBOL_GPL(blk_mq_alloc_request_hctx);
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-22 21:02                 ` 'Christoph Hellwig'
@ 2016-09-22 21:38                   ` Steve Wise
  2016-09-22 21:48                     ` 'Christoph Hellwig'
  0 siblings, 1 reply; 22+ messages in thread
From: Steve Wise @ 2016-09-22 21:38 UTC (permalink / raw)


> Steve,
> 
> can you test if the patch below properly fails the connect and avoids
> the crash?
>

Is this the expected error?

[root at stevo1 ~]# nvme connect --nr-io-queues=26 --transport=rdma --trsvcid=4420
--traddr=10.0.1.14 --nqn=test-ram0
Failed to write to /dev/nvme-fabrics: Invalid cross-device link
[root at stevo1 ~]#

 Steve.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-22 21:38                   ` Steve Wise
@ 2016-09-22 21:48                     ` 'Christoph Hellwig'
  2016-09-22 22:03                       ` Steve Wise
       [not found]                       ` <024201d2151d$28013b90$7803b2b0$@opengridcomputing.com>
  0 siblings, 2 replies; 22+ messages in thread
From: 'Christoph Hellwig' @ 2016-09-22 21:48 UTC (permalink / raw)


On Thu, Sep 22, 2016@04:38:48PM -0500, Steve Wise wrote:
> > Steve,
> > 
> > can you test if the patch below properly fails the connect and avoids
> > the crash?
> >
> 
> Is this the expected error?

Yes.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-22 21:48                     ` 'Christoph Hellwig'
@ 2016-09-22 22:03                       ` Steve Wise
       [not found]                       ` <024201d2151d$28013b90$7803b2b0$@opengridcomputing.com>
  1 sibling, 0 replies; 22+ messages in thread
From: Steve Wise @ 2016-09-22 22:03 UTC (permalink / raw)


> 
> On Thu, Sep 22, 2016@04:38:48PM -0500, Steve Wise wrote:
> > > Steve,
> > >
> > > can you test if the patch below properly fails the connect and avoids
> > > the crash?
> > >
> >
> > Is this the expected error?
> 
> Yes.
> 

Ok then.  Tested-by: Steve Wise <swise at opengridcomputing.com>

I haven't tried ignoring this error when connecting yet...

Stevo

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
       [not found]                       ` <024201d2151d$28013b90$7803b2b0$@opengridcomputing.com>
@ 2016-09-23  0:01                         ` Steve Wise
  2016-09-23  3:31                           ` 'Christoph Hellwig'
  0 siblings, 1 reply; 22+ messages in thread
From: Steve Wise @ 2016-09-23  0:01 UTC (permalink / raw)


> > On Thu, Sep 22, 2016@04:38:48PM -0500, Steve Wise wrote:
> > > > Steve,
> > > >
> > > > can you test if the patch below properly fails the connect and
> avoids
> > > > the crash?
> > > >
> > >
> > > Is this the expected error?
> >
> > Yes.
> >
> 
> Ok then.  Tested-by: Steve Wise <swise at opengridcomputing.com>
> 
> I haven't tried ignoring this error when connecting yet...
> 
> Stevo

This patch seems to work:

@@ -639,6 +639,8 @@ static int nvme_rdma_connect_io_queues(struct
nvme_rdma_ctrl *ctrl)

        for (i = 1; i < ctrl->queue_count; i++) {
                ret = nvmf_connect_io_queue(&ctrl->ctrl, i);
+               if (ret == -EXDEV)
+                       ret = 0;
                if (ret)
                        break;
        }

The fabrics module displays these errors.  But the 28 rdma connections still
get setup.  I'm not sure this is what we want, but it does avoid failing the
connect altogether...


[ 9438.483765] nvme nvme1: creating 28 I/O queues.
[ 9438.619877] nvme nvme1: Connect command failed, error wo/DNR bit: -16402
[ 9438.632542] nvme nvme1: Connect command failed, error wo/DNR bit: -16402
[ 9438.644857] nvme nvme1: Connect command failed, error wo/DNR bit: -16402
[ 9438.662090] nvme nvme1: Connect command failed, error wo/DNR bit: -16402
[ 9438.667138] nvme nvme1: Connect command failed, error wo/DNR bit: -16402
[ 9438.671875] nvme nvme1: Connect command failed, error wo/DNR bit: -16402
[ 9438.681345] nvme nvme1: Connect command failed, error wo/DNR bit: -16402
[ 9438.690364] nvme nvme1: Connect command failed, error wo/DNR bit: -16402
[ 9438.697611] nvme nvme1: Connect command failed, error wo/DNR bit: -16402
[ 9438.712055] nvme nvme1: Connect command failed, error wo/DNR bit: -16402
[ 9438.719229] nvme nvme1: Connect command failed, error wo/DNR bit: -16402
[ 9438.726399] nvme nvme1: Connect command failed, error wo/DNR bit: -16402
[ 9438.726406] nvme nvme1: new ctrl: NQN "test-ram0", addr 10.0.1.14:4420

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-23  0:01                         ` Steve Wise
@ 2016-09-23  3:31                           ` 'Christoph Hellwig'
  2016-09-23 13:58                             ` Steve Wise
  2016-09-23 16:21                             ` Jens Axboe
  0 siblings, 2 replies; 22+ messages in thread
From: 'Christoph Hellwig' @ 2016-09-23  3:31 UTC (permalink / raw)


On Thu, Sep 22, 2016@07:01:05PM -0500, Steve Wise wrote:
> The fabrics module displays these errors.  But the 28 rdma connections still
> get setup.  I'm not sure this is what we want, but it does avoid failing the
> connect altogether...

Bo, it's not really what we want.  I think we simply need to move
forward with the queue state machine, and use that to check if we have
a blk-mq queue allocated for the queue.  If not we can simply skip it
later on.

So I'd say I'll send the crash fix in blk-mq to Jens ASAP, and you'll
look into the queue state machine for 4.9?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-23  3:31                           ` 'Christoph Hellwig'
@ 2016-09-23 13:58                             ` Steve Wise
  2016-09-23 16:21                             ` Jens Axboe
  1 sibling, 0 replies; 22+ messages in thread
From: Steve Wise @ 2016-09-23 13:58 UTC (permalink / raw)



> 
> On Thu, Sep 22, 2016@07:01:05PM -0500, Steve Wise wrote:
> > The fabrics module displays these errors.  But the 28 rdma connections still
> > get setup.  I'm not sure this is what we want, but it does avoid failing the
> > connect altogether...
> 
> Bo, it's not really what we want.  I think we simply need to move
> forward with the queue state machine, and use that to check if we have
> a blk-mq queue allocated for the queue.  If not we can simply skip it
> later on.
> 
> So I'd say I'll send the crash fix in blk-mq to Jens ASAP, and you'll
> look into the queue state machine for 4.9?

Sure.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-23  3:31                           ` 'Christoph Hellwig'
  2016-09-23 13:58                             ` Steve Wise
@ 2016-09-23 16:21                             ` Jens Axboe
  2016-09-23 16:23                               ` 'Christoph Hellwig'
  1 sibling, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2016-09-23 16:21 UTC (permalink / raw)


On 09/22/2016 09:31 PM, 'Christoph Hellwig' wrote:
> On Thu, Sep 22, 2016@07:01:05PM -0500, Steve Wise wrote:
>> The fabrics module displays these errors.  But the 28 rdma connections still
>> get setup.  I'm not sure this is what we want, but it does avoid failing the
>> connect altogether...
>
> Bo, it's not really what we want.  I think we simply need to move
> forward with the queue state machine, and use that to check if we have
> a blk-mq queue allocated for the queue.  If not we can simply skip it
> later on.
>
> So I'd say I'll send the crash fix in blk-mq to Jens ASAP, and you'll
> look into the queue state machine for 4.9?

I'm going to flush the remaining patches for 4.8 later today. Did you
send it out yet? If so, I haven't seen it. I'll be traveling the next 2
weeks. With internet, so I'll get some work done, but it'll be a bit
more intermittent.

--
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-23 16:21                             ` Jens Axboe
@ 2016-09-23 16:23                               ` 'Christoph Hellwig'
  2016-09-23 16:24                                 ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: 'Christoph Hellwig' @ 2016-09-23 16:23 UTC (permalink / raw)


On Fri, Sep 23, 2016@10:21:50AM -0600, Jens Axboe wrote:
> I'm going to flush the remaining patches for 4.8 later today. Did you
> send it out yet? If so, I haven't seen it. I'll be traveling the next 2
> weeks. With internet, so I'll get some work done, but it'll be a bit
> more intermittent.

It's about two patches earlier in this thread, but I can just resend it.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-23 16:23                               ` 'Christoph Hellwig'
@ 2016-09-23 16:24                                 ` Jens Axboe
  2016-09-23 16:26                                   ` Jens Axboe
  0 siblings, 1 reply; 22+ messages in thread
From: Jens Axboe @ 2016-09-23 16:24 UTC (permalink / raw)


On 09/23/2016 10:23 AM, 'Christoph Hellwig' wrote:
> On Fri, Sep 23, 2016@10:21:50AM -0600, Jens Axboe wrote:
>> I'm going to flush the remaining patches for 4.8 later today. Did you
>> send it out yet? If so, I haven't seen it. I'll be traveling the next 2
>> weeks. With internet, so I'll get some work done, but it'll be a bit
>> more intermittent.
>
> It's about two patches earlier in this thread, but I can just resend it.

OK, I was just looking for an email after that reply, thinking you were
sending a new one.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

* crash when connecting to targets using nr_io_queues < num cpus
  2016-09-23 16:24                                 ` Jens Axboe
@ 2016-09-23 16:26                                   ` Jens Axboe
  0 siblings, 0 replies; 22+ messages in thread
From: Jens Axboe @ 2016-09-23 16:26 UTC (permalink / raw)


On 09/23/2016 10:24 AM, Jens Axboe wrote:
> On 09/23/2016 10:23 AM, 'Christoph Hellwig' wrote:
>> On Fri, Sep 23, 2016@10:21:50AM -0600, Jens Axboe wrote:
>>> I'm going to flush the remaining patches for 4.8 later today. Did you
>>> send it out yet? If so, I haven't seen it. I'll be traveling the next 2
>>> weeks. With internet, so I'll get some work done, but it'll be a bit
>>> more intermittent.
>>
>> It's about two patches earlier in this thread, but I can just resend it.
>
> OK, I was just looking for an email after that reply, thinking you were
> sending a new one.

http://git.kernel.dk/cgit/linux-block/commit/?h=for-linus&id=c8712c6a674e3382fe4d26d108251ccfa55d08e0


-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2016-09-23 16:26 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-31 20:12 crash when connecting to targets using nr_io_queues < num cpus Steve Wise
2016-09-01  9:32 ` Sagi Grimberg
2016-09-01 14:10   ` Steve Wise
2016-09-01 19:01     ` Steve Wise
2016-09-04  8:46       ` Sagi Grimberg
2016-09-13 14:21         ` Steve Wise
2016-09-13 17:14           ` Ming Lin
2016-09-13 17:52           ` Keith Busch
2016-09-13 19:43             ` Steve Wise
2016-09-16 14:10             ` Steve Wise
2016-09-16 14:26               ` 'Christoph Hellwig'
2016-09-22 21:02                 ` 'Christoph Hellwig'
2016-09-22 21:38                   ` Steve Wise
2016-09-22 21:48                     ` 'Christoph Hellwig'
2016-09-22 22:03                       ` Steve Wise
     [not found]                       ` <024201d2151d$28013b90$7803b2b0$@opengridcomputing.com>
2016-09-23  0:01                         ` Steve Wise
2016-09-23  3:31                           ` 'Christoph Hellwig'
2016-09-23 13:58                             ` Steve Wise
2016-09-23 16:21                             ` Jens Axboe
2016-09-23 16:23                               ` 'Christoph Hellwig'
2016-09-23 16:24                                 ` Jens Axboe
2016-09-23 16:26                                   ` Jens Axboe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.