All of lore.kernel.org
 help / color / mirror / Atom feed
* nvmf/rdma host crash during heavy load and keep alive recovery
@ 2016-07-29 21:40 Steve Wise
  2016-08-01 11:06 ` Christoph Hellwig
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-07-29 21:40 UTC (permalink / raw)


Running many fio jobs on 10 NVMF/RDMA ram disks, and bringing down and back up
the interfaces in a loop uncovers this crash.  I'm not sure if this has been
reported/fixed?  I'm using the for-linus branch of linux-block + sagi's 5
patches on the host.  

What this test tickles is keep-alive recovery in the presence of heavy
raw/direct IO.  Before the crash there are logs of these logged, which is
probably expected:

[  295.497642] blk_update_request: I/O error, dev nvme6n1, sector 21004
[  295.497643] nvme nvme6: nvme_rdma_post_send failed with error code -22
[  295.497644] blk_update_request: I/O error, dev nvme6n1, sector 10852
[  295.497646] nvme nvme6: nvme_rdma_post_send failed with error code -22
[  295.497647] blk_update_request: I/O error, dev nvme6n1, sector 32004
[  295.497653] nvme nvme6: nvme_rdma_post_send failed with error code -22
[  295.497655] nvme nvme6: nvme_rdma_post_send failed with error code -22
[  295.497658] nvme nvme6: nvme_rdma_post_send failed with error code -22
[  295.497660] nvme nvme6: nvme_rdma_post_send failed with error code -22

and these right before the crash:

[  295.591290] nvme nvme6: nvme_rdma_post_send failed with error code -22
[  295.591291] nvme nvme6: Queueing INV WR for rkey 0x2a26eee failed (-22)
[  295.591316] nvme nvme6: nvme_rdma_post_send failed with error code -22
[  295.591317] nvme nvme6: Queueing INV WR for rkey 0x2a26eef failed (-22)
[  295.591342] nvme nvme6: nvme_rdma_post_send failed with error code -22


The crash is a GPF because the qp has been freed, and due to debug kernel hacks,
the qp struct memory is 0x6b6b6b6b...

Any ideas?


Here's the log:

[  295.642191] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC
[  295.642228] Modules linked in: nvme_rdma nvme_fabrics brd iw_cxgb4
ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT nf_reject_ipv4 cxgb4 xt_CHECKSUM
iptable_mangle iptable_filter ip_tables bridge 8021q mrp garp stp llc cachefiles
fscache rdma_ucm rdma_cm iw_cm ib_ipoib ib_cm ib_uverbs ib_umad ocrdma be2net
iw_nes libcrc32c iw_cxgb3 cxgb3 mdio ib_qib rdmavt mlx5_ib mlx5_core mlx4_ib
mlx4_en mlx4_core ib_mthca ib_core binfmt_misc dm_mirror dm_region_hash dm_log
vhost_net macvtap macvlan vhost tun kvm irqbypass uinput iTCO_wdt
iTCO_vendor_support mxm_wmi pcspkr dm_mod i2c_i801 sg lpc_ich mfd_core mei_me
mei nvme nvme_core igb dca ptp pps_core ipmi_si ipmi_msghandler wmi
[  295.642236]  ext4(E) mbcache(E) jbd2(E) sd_mod(E) ahci(E) libahci(E)
libata(E) mgag200(E) ttm(E) drm_kms_helper(E) drm(E) fb_sys_fops(E) sysimgblt(E)
sysfillrect(E) syscopyarea(E) i2c_algo_bit(E) i2c_core(E) [last unloaded: cxgb4]
[  295.642239] CPU: 8 PID: 18390 Comm: blkid Tainted: G            E
4.7.0-block-for-linus-debug+ #11
[  295.642240] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.2a 07/09/2015
[  295.642242] task: ffff880fb3418040 ti: ffff880fc06cc000 task.ti:
ffff880fc06cc000
[  295.642248] RIP: 0010:[<ffffffffa07fc083>]  [<ffffffffa07fc083>]
nvme_rdma_post_send+0x83/0xd0 [nvme_rdma]
[  295.642249] RSP: 0018:ffff880fc06cf7c8  EFLAGS: 00010286
[  295.642250] RAX: 6b6b6b6b6b6b6b6b RBX: ffff880fc37da698 RCX: 0000000000000001
[  295.642251] RDX: ffff880fc06cf7f0 RSI: ffff880fbf715d00 RDI: ffff880fc0afdfa8
[  295.642252] RBP: ffff880fc06cf808 R08: ffff880fbf715d00 R09: 0000000000000000
[  295.642253] R10: 0000000fb9f9b000 R11: ffff880fbf715d58 R12: ffff880fc37da698
[  295.642254] R13: ffff880fbf715cb0 R14: ffff880fee6522a8 R15: ffff880fbf6d1138
[  295.642255] FS:  00007f9abfe3e740(0000) GS:ffff881037000000(0000)
knlGS:0000000000000000
[  295.642256] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  295.642257] CR2: 0000003f691422e9 CR3: 0000000fcaab1000 CR4: 00000000000406e0
[  295.642258] Stack:
[  295.642260]  0000000000000000 ffff880fbf715cb8 ffff880fbf715cd0
0000000200000001
[  295.642262]  ffff880f00000000 ffff880fbf715cb0 ffff880fbf715b40
ffff880fc37da698
[  295.642264]  ffff880fc06cf858 ffffffffa07fdc92 ffff880fbf715b40
ffff880fc0eac748
[  295.642264] Call Trace:
[  295.642268]  [<ffffffffa07fdc92>] nvme_rdma_queue_rq+0x172/0x280 [nvme_rdma]
[  295.642273]  [<ffffffff81339e97>] blk_mq_make_request+0x2d7/0x560
[  295.642277]  [<ffffffff8119a5c5>] ? mempool_alloc_slab+0x15/0x20
[  295.642280]  [<ffffffff8132c6bc>] generic_make_request+0xfc/0x1d0
[  295.642282]  [<ffffffff8132c7f0>] submit_bio+0x60/0x130
[  295.642286]  [<ffffffff81264541>] submit_bh_wbc+0x161/0x1c0
[  295.642288]  [<ffffffff812645b3>] submit_bh+0x13/0x20
[  295.642291]  [<ffffffff81264eae>] block_read_full_page+0x20e/0x3e0
[  295.642293]  [<ffffffff81268c83>] ? __blkdev_get+0x193/0x420
[  295.642295]  [<ffffffff81266f60>] ? I_BDEV+0x20/0x20
[  295.642297]  [<ffffffff81199228>] ? pagecache_get_page+0x38/0x240
[  295.642300]  [<ffffffff811deae6>] ? page_add_file_rmap+0x56/0x160
[  295.642304]  [<ffffffff81111b3e>] ? is_module_text_address+0xe/0x20
[  295.642307]  [<ffffffff81267908>] blkdev_readpage+0x18/0x20
[  295.642309]  [<ffffffff81199b9a>] do_generic_file_read+0x20a/0x710
[  295.642314]  [<ffffffff8115566b>] ? rb_reserve_next_event+0xdb/0x230
[  295.642317]  [<ffffffff81040fbf>] ? save_stack_trace+0x2f/0x50
[  295.642319]  [<ffffffff81154baa>] ? rb_commit+0x10a/0x1a0
[  295.642321]  [<ffffffff8119a15c>] generic_file_read_iter+0xbc/0x110
[  295.642323]  [<ffffffff81154c64>] ? ring_buffer_unlock_commit+0x24/0xb0
[  295.642326]  [<ffffffff81267277>] blkdev_read_iter+0x37/0x40
[  295.642329]  [<ffffffff8122b37c>] __vfs_read+0xfc/0x120
[  295.642331]  [<ffffffff8122b44e>] vfs_read+0xae/0xf0
[  295.642335]  [<ffffffff812492a3>] ? __fdget+0x13/0x20
[  295.642337]  [<ffffffff8122bf56>] SyS_read+0x56/0xc0
[  295.642341]  [<ffffffff81003f5d>] do_syscall_64+0x7d/0x230
[  295.642345]  [<ffffffff8106f397>] ? do_page_fault+0x37/0x90
[  295.642350]  [<ffffffff816dbda1>] entry_SYSCALL64_slow_path+0x25/0x25
[  295.642367] Code: 47 08 83 c0 01 a8 1f 88 47 08 74 5b 45 84 c9 75 56 4d 85 c0
74 5a 48 8d 45 c0 49 89 00 48 8b 7b 30 48 8d 55 e8 4c 89 c6 48 8b 07 <ff> 90 c8
01 00 00 85 c0 41 89 c4 74 23 48 8b 43 18 44 89 e1 48
[  295.642371] RIP  [<ffffffffa07fc083>] nvme_rdma_post_send+0x83/0xd0
[nvme_rdma]
[  295.642371]  RSP <ffff880fc06cf7c8>
crash> mod -s nvme_rdma
     MODULE       NAME                      SIZE  OBJECT FILE
ffffffffa0801240  nvme_rdma                28672
/lib/modules/4.7.0-block-for-linus-debug+/kernel/drivers/nvme/host/nvme-rdma.ko
crash> gdb list *nvme_rdma_post_send+0x83
0xffffffffa07fc083 is in nvme_rdma_post_send (include/rdma/ib_verbs.h:2619).
2614     */
2615    static inline int ib_post_send(struct ib_qp *qp,
2616                                   struct ib_send_wr *send_wr,
2617                                   struct ib_send_wr **bad_send_wr)
2618    {
2619            return qp->device->post_send(qp, send_wr, bad_send_wr);
2620    }
2621
2622    /**
2623     * ib_post_recv - Posts a list of work requests to the receive queue of
crash>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-07-29 21:40 nvmf/rdma host crash during heavy load and keep alive recovery Steve Wise
@ 2016-08-01 11:06 ` Christoph Hellwig
  2016-08-01 14:26   ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: Christoph Hellwig @ 2016-08-01 11:06 UTC (permalink / raw)


On Fri, Jul 29, 2016@04:40:40PM -0500, Steve Wise wrote:
> Running many fio jobs on 10 NVMF/RDMA ram disks, and bringing down and back up
> the interfaces in a loop uncovers this crash.  I'm not sure if this has been
> reported/fixed?  I'm using the for-linus branch of linux-block + sagi's 5
> patches on the host.  
> 
> What this test tickles is keep-alive recovery in the presence of heavy
> raw/direct IO.  Before the crash there are logs of these logged, which is
> probably expected:

With what fixes does this happen?  This looks pretty similar to an
issue you reported before.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-08-01 11:06 ` Christoph Hellwig
@ 2016-08-01 14:26   ` Steve Wise
  2016-08-01 21:38     ` Steve Wise
       [not found]     ` <015801d1ec3d$0ca07ea0$25e17be0$@opengridcomputing.com>
  0 siblings, 2 replies; 53+ messages in thread
From: Steve Wise @ 2016-08-01 14:26 UTC (permalink / raw)


> On Fri, Jul 29, 2016@04:40:40PM -0500, Steve Wise wrote:
> > Running many fio jobs on 10 NVMF/RDMA ram disks, and bringing down and back
> up
> > the interfaces in a loop uncovers this crash.  I'm not sure if this has been
> > reported/fixed?  I'm using the for-linus branch of linux-block + sagi's 5
> > patches on the host.
> >
> > What this test tickles is keep-alive recovery in the presence of heavy
> > raw/direct IO.  Before the crash there are logs of these logged, which is
> > probably expected:
> 
> With what fixes does this happen?  This looks pretty similar to an
> issue you reported before.

As I said, I'm using the for-linus branch of the linux-block repo
(git://git.kernel.dk/linux-block) + sagi's 5 recent patches.   So I should be
using the latest and greatest, I think.  This problem was originally seen on
nvmf-all.3 as well.  Perhaps I have reported this previously.  But now I'm
trying to fix it :)

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-08-01 14:26   ` Steve Wise
@ 2016-08-01 21:38     ` Steve Wise
       [not found]     ` <015801d1ec3d$0ca07ea0$25e17be0$@opengridcomputing.com>
  1 sibling, 0 replies; 53+ messages in thread
From: Steve Wise @ 2016-08-01 21:38 UTC (permalink / raw)


> > On Fri, Jul 29, 2016@04:40:40PM -0500, Steve Wise wrote:
> > > Running many fio jobs on 10 NVMF/RDMA ram disks, and bringing down and
> back
> > up
> > > the interfaces in a loop uncovers this crash.  I'm not sure if this has
been
> > > reported/fixed?  I'm using the for-linus branch of linux-block + sagi's 5
> > > patches on the host.
> > >
> > > What this test tickles is keep-alive recovery in the presence of heavy
> > > raw/direct IO.  Before the crash there are logs of these logged, which is
> > > probably expected:
> >
> > With what fixes does this happen?  This looks pretty similar to an
> > issue you reported before.
> 
> As I said, I'm using the for-linus branch of the linux-block repo
> (git://git.kernel.dk/linux-block) + sagi's 5 recent patches.   So I should be
> using the latest and greatest, I think.  This problem was originally seen on
> nvmf-all.3 as well.  Perhaps I have reported this previously.  But now I'm
> trying to fix it :)
> 

I do have two different problem reports internally at Chelsio that both show the
same signature.  I found the other one :)  For the 2nd problem report, there was
no ifup/down to induce keep alive recovery.  It just loads up 10 ram disks on a
64 core host/target pair in a similar manner, and after a while lots of
nvme_rdma_post_send() errors are logged (probably due to a connection death) and
then the crash.   I'm still gathering info on that one, but it appears the qp
again was freed somehow and then attempts to post to it cause the crash...

Steve

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
       [not found]     ` <015801d1ec3d$0ca07ea0$25e17be0$@opengridcomputing.com>
@ 2016-08-10 15:46       ` Steve Wise
       [not found]       ` <010f01d1f31e$50c8cb40$f25a61c0$@opengridcomputing.com>
  1 sibling, 0 replies; 53+ messages in thread
From: Steve Wise @ 2016-08-10 15:46 UTC (permalink / raw)


Hey guys, I've rebased the nvmf-4.8-rc branch on top of 4.8-rc2 so I have the
latest/gratest, and continued debugging this crash.  I see:

0) 10 ram disks attached via nvmf/iw_cxgb4, and fio started on all 10 disks.
This node has 8 cores, so that is 80 connections.
1) the cxgb4 interface brought down a few seconds later
2) kato fires on all connections
3) the interface is brought back up 8 seconds after #1
4) 10 seconds after #2 all the qps are destroyed
5) reconnects start happening
6) a blk request is executed and the nvme_rdma_request struct still has a
pointer to one of the qps destroyed in 3 and whamo...

I'm digging into the request cancel logic.  Any ideas/help is greatly
appreciated...

Thanks,

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
       [not found]       ` <010f01d1f31e$50c8cb40$f25a61c0$@opengridcomputing.com>
@ 2016-08-10 16:00         ` Steve Wise
       [not found]         ` <013701d1f320$57b185d0$07149170$@opengridcomputing.com>
  1 sibling, 0 replies; 53+ messages in thread
From: Steve Wise @ 2016-08-10 16:00 UTC (permalink / raw)


> Hey guys, I've rebased the nvmf-4.8-rc branch on top of 4.8-rc2 so I have
> the latest/gratest, and continued debugging this crash.  I see:
> 
> 0) 10 ram disks attached via nvmf/iw_cxgb4, and fio started on all 10
> disks.  This node has 8 cores, so that is 80 connections.
> 1) the cxgb4 interface brought down a few seconds later
> 2) kato fires on all connections
> 3) the interface is brought back up 8 seconds after #1
> 4) 10 seconds after #2 all the qps are destroyed
> 5) reconnects start happening
> 6) a blk request is executed and the nvme_rdma_request struct still has a
> pointer to one of the qps destroyed in 3 and whamo...
> 
> I'm digging into the request cancel logic.  Any ideas/help is greatly
> appreciated...
> 
> Thanks,
> 
> Steve.

Here is the stack that crashed processing a blk request:

crash> bt
PID: 402    TASK: ffff880397968040  CPU: 0   COMMAND: "kworker/0:1H"
 #0 [ffff8803970f7800] machine_kexec at ffffffff8105fc40
 #1 [ffff8803970f7870] __crash_kexec at ffffffff81116908
 #2 [ffff8803970f7940] crash_kexec at ffffffff811169dd
 #3 [ffff8803970f7970] oops_end at ffffffff81032be6
 #4 [ffff8803970f79a0] die at ffffffff810330db
 #5 [ffff8803970f79d0] do_general_protection at ffffffff81030144
 #6 [ffff8803970f7a00] general_protection at ffffffff816e4ca8
    [exception RIP: nvme_rdma_post_send+131]
    RIP: ffffffffa0414083  RSP: ffff8803970f7ab8  RFLAGS: 00010246
    RAX: 6b6b6b6b6b6b6b6b  RBX: ffff8802dd923598  RCX: 0000000000000002
    RDX: ffff8803970f7ae0  RSI: ffff8803970f7ab8  RDI: ffff8802dd9fc518
    RBP: ffff8803970f7af8   R8: ffff8803970f7ab8   R9: 0000000000000000
    R10: 0000000000000000  R11: ffff8802dde6ef58  R12: ffff8802dd923598
    R13: ffff8802dde6eeb0  R14: ffff880399f4c548  R15: ffff8802dde59db8
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #7 [ffff8803970f7b00] nvme_rdma_queue_rq at ffffffffa0415c72 [nvme_rdma]
 #8 [ffff8803970f7b50] __blk_mq_run_hw_queue at ffffffff81338324
 #9 [ffff8803970f7ca0] blk_mq_run_work_fn at ffffffff81338552
#10 [ffff8803970f7cb0] process_one_work at ffffffff810a1593
#11 [ffff8803970f7d90] worker_thread at ffffffff810a222d
#12 [ffff8803970f7ec0] kthread at ffffffff810a6d6c
#13 [ffff8803970f7f50] ret_from_fork at ffffffff816e2cbf

Here is the nvme_rdma_request:

crash> nvme_rdma_request ffff8802dde6eeb0
struct nvme_rdma_request {
  mr = 0xffff8802dde5c008,
  sqe = {
    cqe = {
      done = 0xffffffffa0414320 <nvme_rdma_send_done>
    },
    data = 0xffff8802dde59db8,
    dma = 12312747448
  },
  sge = {{
      addr = 12312747448,
      length = 64,
      lkey = 0
    }, {
      addr = 12138727424,
      length = 2048,
      lkey = 0
    }},
  num_sge = 2,
  nents = 1,
  inline_data = true,
  need_inval = false,
  reg_wr = {
    wr = {
      next = 0x0,
      {
        wr_id = 0,
        wr_cqe = 0x0
      },
      sg_list = 0x0,
      num_sge = 0,
      opcode = IB_WR_RDMA_WRITE,
      send_flags = 0,
      ex = {
        imm_data = 0,
        invalidate_rkey = 0
      }
    },
    mr = 0x0,
    key = 0,
    access = 0
  },
  reg_cqe = {
    done = 0x0
  },
  queue = 0xffff8802dd923598,
  sg_table = {
    sgl = 0xffff8802dde6ef58,
    nents = 1,
    orig_nents = 1
  },
  first_sgl = 0xffff8802dde6ef58
}

And here is the nvme_rdma_queue:

crash> nvme_rdma_queue 0xffff8802dd923598
struct nvme_rdma_queue {
  rsp_ring = 0xffff8802dd968008,
  sig_count = 200 '\310',
  queue_size = 128,
  cmnd_capsule_len = 4160,
  ctrl = 0xffff8802dbd5d3d8,
  device = 0xffff880384ceb5e8,
  ib_cq = 0xffff8802dd9d2e68,
  qp = 0xffff8802dd9fc518,
  flags = 0,
  cm_id = 0xffff8802dd9f8008,
  cm_error = 0,
  cm_done = {
    done = 0,
    wait = {
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      },
      task_list = {
        next = 0xffff8802dd9235f8,
        prev = 0xffff8802dd9235f8
      }
    }
  }
}

And see here the ib_qp has been freed:

crash> gdb x/8g 0xffff8802dd9fc518
0xffff8802dd9fc518:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff8802dd9fc528:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff8802dd9fc538:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff8802dd9fc548:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
       [not found]         ` <013701d1f320$57b185d0$07149170$@opengridcomputing.com>
@ 2016-08-10 17:20           ` Steve Wise
  2016-08-10 18:59             ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-08-10 17:20 UTC (permalink / raw)


> Here is the stack that crashed processing a blk request:
> 
> crash> bt
> PID: 402    TASK: ffff880397968040  CPU: 0   COMMAND: "kworker/0:1H"
>  #0 [ffff8803970f7800] machine_kexec at ffffffff8105fc40
>  #1 [ffff8803970f7870] __crash_kexec at ffffffff81116908
>  #2 [ffff8803970f7940] crash_kexec at ffffffff811169dd
>  #3 [ffff8803970f7970] oops_end at ffffffff81032be6
>  #4 [ffff8803970f79a0] die at ffffffff810330db
>  #5 [ffff8803970f79d0] do_general_protection at ffffffff81030144
>  #6 [ffff8803970f7a00] general_protection at ffffffff816e4ca8
>     [exception RIP: nvme_rdma_post_send+131]
>     RIP: ffffffffa0414083  RSP: ffff8803970f7ab8  RFLAGS: 00010246
>     RAX: 6b6b6b6b6b6b6b6b  RBX: ffff8802dd923598  RCX: 0000000000000002
>     RDX: ffff8803970f7ae0  RSI: ffff8803970f7ab8  RDI: ffff8802dd9fc518
>     RBP: ffff8803970f7af8   R8: ffff8803970f7ab8   R9: 0000000000000000
>     R10: 0000000000000000  R11: ffff8802dde6ef58  R12: ffff8802dd923598
>     R13: ffff8802dde6eeb0  R14: ffff880399f4c548  R15: ffff8802dde59db8
>     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
>  #7 [ffff8803970f7b00] nvme_rdma_queue_rq at ffffffffa0415c72 [nvme_rdma]
>  #8 [ffff8803970f7b50] __blk_mq_run_hw_queue at ffffffff81338324
>  #9 [ffff8803970f7ca0] blk_mq_run_work_fn at ffffffff81338552
> #10 [ffff8803970f7cb0] process_one_work at ffffffff810a1593
> #11 [ffff8803970f7d90] worker_thread at ffffffff810a222d
> #12 [ffff8803970f7ec0] kthread at ffffffff810a6d6c
> #13 [ffff8803970f7f50] ret_from_fork at ffffffff816e2cbf
> 
> Here is the nvme_rdma_request:
> 
> crash> nvme_rdma_request ffff8802dde6eeb0
> struct nvme_rdma_request {
>   mr = 0xffff8802dde5c008,
>   sqe = {
>     cqe = {
>       done = 0xffffffffa0414320 <nvme_rdma_send_done>
>     },
>     data = 0xffff8802dde59db8,
>     dma = 12312747448
>   },
>   sge = {{
>       addr = 12312747448,
>       length = 64,
>       lkey = 0
>     }, {
>       addr = 12138727424,
>       length = 2048,
>       lkey = 0
>     }},
>   num_sge = 2,
>   nents = 1,
>   inline_data = true,
>   need_inval = false,
>   reg_wr = {
>     wr = {
>       next = 0x0,
>       {
>         wr_id = 0,
>         wr_cqe = 0x0
>       },
>       sg_list = 0x0,
>       num_sge = 0,
>       opcode = IB_WR_RDMA_WRITE,
>       send_flags = 0,
>       ex = {
>         imm_data = 0,
>         invalidate_rkey = 0
>       }
>     },
>     mr = 0x0,
>     key = 0,
>     access = 0
>   },
>   reg_cqe = {
>     done = 0x0
>   },
>   queue = 0xffff8802dd923598,
>   sg_table = {
>     sgl = 0xffff8802dde6ef58,
>     nents = 1,
>     orig_nents = 1
>   },
>   first_sgl = 0xffff8802dde6ef58
> }
> 
> And here is the nvme_rdma_queue:
> 
> crash> nvme_rdma_queue 0xffff8802dd923598
> struct nvme_rdma_queue {
>   rsp_ring = 0xffff8802dd968008,
>   sig_count = 200 '\310',
>   queue_size = 128,
>   cmnd_capsule_len = 4160,
>   ctrl = 0xffff8802dbd5d3d8,
>   device = 0xffff880384ceb5e8,
>   ib_cq = 0xffff8802dd9d2e68,
>   qp = 0xffff8802dd9fc518,
>   flags = 0,
>   cm_id = 0xffff8802dd9f8008,
>   cm_error = 0,
>   cm_done = {
>     done = 0,
>     wait = {
>       lock = {
>         {
>           rlock = {
>             raw_lock = {
>               val = {
>                 counter = 0
>               }
>             }
>           }
>         }
>       },
>       task_list = {
>         next = 0xffff8802dd9235f8,
>         prev = 0xffff8802dd9235f8
>       }
>     }
>   }
> }
> 
> And see here the ib_qp has been freed:
> 
> crash> gdb x/8g 0xffff8802dd9fc518
> 0xffff8802dd9fc518:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
> 0xffff8802dd9fc528:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
> 0xffff8802dd9fc538:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
> 0xffff8802dd9fc548:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b

The nvme_rdma_ctrl queue associated with the request is in RECONNECTING state:

  ctrl = {
    state = NVME_CTRL_RECONNECTING,
    lock = {

So it should not be posting SQ WRs...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-08-10 17:20           ` Steve Wise
@ 2016-08-10 18:59             ` Steve Wise
  2016-08-11  6:27               ` Sagi Grimberg
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-08-10 18:59 UTC (permalink / raw)



> The nvme_rdma_ctrl queue associated with the request is in RECONNECTING state:
> 
>   ctrl = {
>     state = NVME_CTRL_RECONNECTING,
>     lock = {
> 
> So it should not be posting SQ WRs...

kato kicks error recovery, nvme_rdma_error_recovery_work(), which calls
nvme_cancel_request() on each request.  nvme_cancel_request() sets req->errors
to NVME_SC_ABORT_REQ.  It then completes the request which ends up at
nvme_rdma_complete_rq() which queues it for retry:
...
        if (unlikely(rq->errors)) {
                if (nvme_req_needs_retry(rq, rq->errors)) {
                        nvme_requeue_req(rq);
                        return;
                }

                if (rq->cmd_type == REQ_TYPE_DRV_PRIV)
                        error = rq->errors;
                else
                        error = nvme_error_status(rq->errors);
        }
...

The retry will end up at nvme_rdma_queue_rq() which will try and post a send wr
to a freed qp...

Should the canceled requests actually OR in bit NVME_SC_DNR?  That is only done
in nvme_cancel_request() if the blk queue is dying:

...
        status = NVME_SC_ABORT_REQ;
        if (blk_queue_dying(req->q))
                status |= NVME_SC_DNR;
...


Sagi, please put on your KATO hat and help! :)

Thanks,

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-08-10 18:59             ` Steve Wise
@ 2016-08-11  6:27               ` Sagi Grimberg
  2016-08-11 13:58                 ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: Sagi Grimberg @ 2016-08-11  6:27 UTC (permalink / raw)




On 10/08/16 21:59, Steve Wise wrote:
>
>> The nvme_rdma_ctrl queue associated with the request is in RECONNECTING state:
>>
>>   ctrl = {
>>     state = NVME_CTRL_RECONNECTING,
>>     lock = {
>>
>> So it should not be posting SQ WRs...
>
> kato kicks error recovery, nvme_rdma_error_recovery_work(), which calls
> nvme_cancel_request() on each request.  nvme_cancel_request() sets req->errors
> to NVME_SC_ABORT_REQ.  It then completes the request which ends up at
> nvme_rdma_complete_rq() which queues it for retry:
> ...
>         if (unlikely(rq->errors)) {
>                 if (nvme_req_needs_retry(rq, rq->errors)) {
>                         nvme_requeue_req(rq);
>                         return;
>                 }
>
>                 if (rq->cmd_type == REQ_TYPE_DRV_PRIV)
>                         error = rq->errors;
>                 else
>                         error = nvme_error_status(rq->errors);
>         }
> ...
>
> The retry will end up at nvme_rdma_queue_rq() which will try and post a send wr
> to a freed qp...
>
> Should the canceled requests actually OR in bit NVME_SC_DNR?  That is only done
> in nvme_cancel_request() if the blk queue is dying:

the DNR bit should not be set normally, only when we either don't want
to requeue or we can't.

>
> ...
>         status = NVME_SC_ABORT_REQ;
>         if (blk_queue_dying(req->q))
>                 status |= NVME_SC_DNR;
> ...
>

couple of questions:

1. bringing down the interface means generating DEVICE_REMOVAL
event?

2. keep-alive timeout expires means that nvme_rdma_timeout() invokes
kicks error_recovery and set:
rq->errors = NVME_SC_ABORT_REQ | NVME_SC_DNR

So I'm not at all convinced that the keep-alive is the request that
being re-issued. Did you verify that?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-08-11  6:27               ` Sagi Grimberg
@ 2016-08-11 13:58                 ` Steve Wise
  2016-08-11 14:19                   ` Steve Wise
  2016-08-11 14:40                   ` Steve Wise
  0 siblings, 2 replies; 53+ messages in thread
From: Steve Wise @ 2016-08-11 13:58 UTC (permalink / raw)


> >> The nvme_rdma_ctrl queue associated with the request is in RECONNECTING
> state:
> >>
> >>   ctrl = {
> >>     state = NVME_CTRL_RECONNECTING,
> >>     lock = {
> >>
> >> So it should not be posting SQ WRs...
> >
> > kato kicks error recovery, nvme_rdma_error_recovery_work(), which calls
> > nvme_cancel_request() on each request.  nvme_cancel_request() sets
req->errors
> > to NVME_SC_ABORT_REQ.  It then completes the request which ends up at
> > nvme_rdma_complete_rq() which queues it for retry:
> > ...
> >         if (unlikely(rq->errors)) {
> >                 if (nvme_req_needs_retry(rq, rq->errors)) {
> >                         nvme_requeue_req(rq);
> >                         return;
> >                 }
> >
> >                 if (rq->cmd_type == REQ_TYPE_DRV_PRIV)
> >                         error = rq->errors;
> >                 else
> >                         error = nvme_error_status(rq->errors);
> >         }
> > ...
> >
> > The retry will end up at nvme_rdma_queue_rq() which will try and post a send
wr
> > to a freed qp...
> >
> > Should the canceled requests actually OR in bit NVME_SC_DNR?  That is only
> done
> > in nvme_cancel_request() if the blk queue is dying:
> 
> the DNR bit should not be set normally, only when we either don't want
> to requeue or we can't.
> 
> >
> > ...
> >         status = NVME_SC_ABORT_REQ;
> >         if (blk_queue_dying(req->q))
> >                 status |= NVME_SC_DNR;
> > ...
> >
> 
> couple of questions:
> 
> 1. bringing down the interface means generating DEVICE_REMOVAL
> event?
> 

No.  Just ifconfig ethX down; sleep 10; ifconfig ethX up.  This simply causes
the pending work requests to take longer to complete and kicks in the kato
logic.


> 2. keep-alive timeout expires means that nvme_rdma_timeout() invokes
> kicks error_recovery and set:
> rq->errors = NVME_SC_ABORT_REQ | NVME_SC_DNR
> 
> So I'm not at all convinced that the keep-alive is the request that
> being re-issued. Did you verify that?

The request that caused the crash had rq->errors == NVME_SC_ABORT_REQ.  I'm not
sure that is always the case though.  But this is very easy to reproduce, so I
should be able to drill down and add any debug code you think might help.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-08-11 13:58                 ` Steve Wise
@ 2016-08-11 14:19                   ` Steve Wise
  2016-08-11 14:40                   ` Steve Wise
  1 sibling, 0 replies; 53+ messages in thread
From: Steve Wise @ 2016-08-11 14:19 UTC (permalink / raw)


> >
> > the DNR bit should not be set normally, only when we either don't want
> > to requeue or we can't.
> >

So when a request is requeued, when is it restarted?  It is getting restarted on
a controller that is in recovery mode and hasn't setup the new qp.  So the
nvme_rdma_queue struct associated with the request is pointing to a freed
ib_pq...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-08-11 13:58                 ` Steve Wise
  2016-08-11 14:19                   ` Steve Wise
@ 2016-08-11 14:40                   ` Steve Wise
  2016-08-11 15:53                     ` Steve Wise
       [not found]                     ` <00fe01d1f3e8$8992b330$9cb81990$@opengridcomputing.com>
  1 sibling, 2 replies; 53+ messages in thread
From: Steve Wise @ 2016-08-11 14:40 UTC (permalink / raw)


> > couple of questions:
> >
> > 1. bringing down the interface means generating DEVICE_REMOVAL
> > event?
> >
> 
> No.  Just ifconfig ethX down; sleep 10; ifconfig ethX up.  This simply causes
> the pending work requests to take longer to complete and kicks in the kato
> logic.
> 
> 
> > 2. keep-alive timeout expires means that nvme_rdma_timeout() invokes
> > kicks error_recovery and set:
> > rq->errors = NVME_SC_ABORT_REQ | NVME_SC_DNR
> >
> > So I'm not at all convinced that the keep-alive is the request that
> > being re-issued. Did you verify that?
> 
> The request that caused the crash had rq->errors == NVME_SC_ABORT_REQ.  I'm
> not
> sure that is always the case though.  But this is very easy to reproduce, so I
> should be able to drill down and add any debug code you think might help.
> 

Here is the nvme_rdma_request and request that caused the crash this time:

crash> nvme_rdma_request ffff8801c7e3a5b0
struct nvme_rdma_request {
  mr = 0xffff8801c7e1d298,
  sqe = {
    cqe = {
      done = 0xffffffffa0853320 <nvme_rdma_send_done>
    },
    data = 0xffff8801c7e1b2c8,
    dma = 7648424648
  },
  sge = {{
      addr = 7648424648,
      length = 64,
      lkey = 0
    }, {
      addr = 7814166528,
      length = 2048,
      lkey = 0
    }},
  num_sge = 2,
  nents = 1,
  inline_data = true,
  need_inval = false,
  reg_wr = {
    wr = {
      next = 0x0,
      {
        wr_id = 0,
        wr_cqe = 0x0
      },
      sg_list = 0x0,
      num_sge = 0,
      opcode = IB_WR_RDMA_WRITE,
      send_flags = 0,
      ex = {
        imm_data = 0,
        invalidate_rkey = 0
      }
    },
    mr = 0x0,
    key = 0,
    access = 0
  },
  reg_cqe = {
    done = 0x0
  },
  queue = 0xffff8801d6484038,
  sg_table = {
    sgl = 0xffff8801c7e3a658,
    nents = 1,
    orig_nents = 1
  },
  first_sgl = 0xffff8801c7e3a658
}
crash> nvme_rdma_queue 0xffff8801d6484038
struct nvme_rdma_queue {
  rsp_ring = 0xffff8801ce699158,
  sig_count = 147 '\223',
  queue_size = 128,
  cmnd_capsule_len = 4160,
  ctrl = 0xffff8801d6308958,
  device = 0xffff8803836f44b8,
  ib_cq = 0xffff8801d65366b8,
  qp = 0xffff8801d6524518,
  flags = 0,
  cm_id = 0xffff8801d6525508,
  cm_error = 0,
  cm_done = {
    done = 0,
    wait = {
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      },
      task_list = {
        next = 0xffff8801d6484098,
        prev = 0xffff8801d6484098
      }
    }
  }
}
crash> gdb p ((struct request *)0xffff8801c7e3a5b0 - 1)
$1 = (struct request *) 0xffff8801c7e3a440
crash> request 0xffff8801c7e3a440
struct request {
  queuelist = {
    next = 0xffff8801c7e3a440,
    prev = 0xffff8801c7e3a440
  },
  {
    csd = {
      llist = {
        next = 0xffff8801c7e3b690
      },
      func = 0xffffffff81336c00 <__blk_mq_complete_request_remote>,
      info = 0xffff8801c7e3a440,
      flags = 0
    },
    fifo_time = 18446612139962775184
  },
  q = 0xffff8801d636dbb8,
  mq_ctx = 0xffffe8ffffc04100,
  cpu = -1,
  cmd_type = 1,
  cmd_flags = 2305843009784119304,
  atomic_flags = 2,
  __data_len = 2048,
  __sector = 13992,
  bio = 0xffff8801c34565c0,
  biotail = 0xffff8801c34565c0,
  {
    hash = {
      next = 0x0,
      pprev = 0x0
    },
    ipi_list = {
      next = 0x0,
      prev = 0x0
    }
  },
  {
    rb_node = {
      __rb_parent_color = 18446612139962770632,
      rb_right = 0x0,
      rb_left = 0x0
    },
    completion_data = 0xffff8801c7e3a4c8
  },
  {
    elv = {
      icq = 0x0,
      priv = {0x0, 0x0}
    },
    flush = {
      seq = 0,
      list = {
        next = 0x0,
        prev = 0x0
      },
      saved_end_io = 0x0
    }
  },
  rq_disk = 0xffff8801d63092a8,
  part = 0xffff8801d63092f0,
  start_time = 4361672618,
  rl = 0x0,
  start_time_ns = 67003651063676,
  io_start_time_ns = 0,
  nr_phys_segments = 1,
  nr_integrity_segments = 0,
  ioprio = 0,
  special = 0x0,
  tag = 121,
  errors = 7,
  __cmd = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
  cmd = 0xffff8801c7e3a548 "",
  cmd_len = 0,
  extra_len = 0,
  sense_len = 0,
  resid_len = 2048,
  sense = 0x0,
  deadline = 4361722943,
  timeout_list = {
    next = 0xffff8801c7e3a580,
    prev = 0xffff8801c7e3a580
  },
  timeout = 30000,
  retries = 0,
  end_io = 0x0,
  end_io_data = 0x0,
  next_rq = 0x0
}

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-08-11 14:40                   ` Steve Wise
@ 2016-08-11 15:53                     ` Steve Wise
       [not found]                     ` <00fe01d1f3e8$8992b330$9cb81990$@opengridcomputing.com>
  1 sibling, 0 replies; 53+ messages in thread
From: Steve Wise @ 2016-08-11 15:53 UTC (permalink / raw)


> > The request that caused the crash had rq->errors == NVME_SC_ABORT_REQ.  I'm
> > not
> > sure that is always the case though.  But this is very easy to reproduce, so
I
> > should be able to drill down and add any debug code you think might help.
> >
> 
> Here is the nvme_rdma_request and request that caused the crash this time:
> 
> crash> nvme_rdma_request ffff8801c7e3a5b0
> struct nvme_rdma_request {
>   mr = 0xffff8801c7e1d298,
>   sqe = {
>     cqe = {
>       done = 0xffffffffa0853320 <nvme_rdma_send_done>
>     },
>     data = 0xffff8801c7e1b2c8,
>     dma = 7648424648
>   },
>   sge = {{
>       addr = 7648424648,
>       length = 64,
>       lkey = 0
>     }, {
>       addr = 7814166528,
>       length = 2048,
>       lkey = 0
>     }},
>   num_sge = 2,
>   nents = 1,
>   inline_data = true,
>   need_inval = false,
>   reg_wr = {
>     wr = {
>       next = 0x0,
>       {
>         wr_id = 0,
>         wr_cqe = 0x0
>       },
>       sg_list = 0x0,
>       num_sge = 0,
>       opcode = IB_WR_RDMA_WRITE,
>       send_flags = 0,
>       ex = {
>         imm_data = 0,
>         invalidate_rkey = 0
>       }
>     },
>     mr = 0x0,
>     key = 0,
>     access = 0
>   },
>   reg_cqe = {
>     done = 0x0
>   },
>   queue = 0xffff8801d6484038,
>   sg_table = {
>     sgl = 0xffff8801c7e3a658,
>     nents = 1,
>     orig_nents = 1
>   },
>   first_sgl = 0xffff8801c7e3a658
> }
> crash> nvme_rdma_queue 0xffff8801d6484038
> struct nvme_rdma_queue {
>   rsp_ring = 0xffff8801ce699158,
>   sig_count = 147 '\223',
>   queue_size = 128,
>   cmnd_capsule_len = 4160,
>   ctrl = 0xffff8801d6308958,
>   device = 0xffff8803836f44b8,
>   ib_cq = 0xffff8801d65366b8,
>   qp = 0xffff8801d6524518,
>   flags = 0,
>   cm_id = 0xffff8801d6525508,
>   cm_error = 0,
>   cm_done = {
>     done = 0,
>     wait = {
>       lock = {
>         {
>           rlock = {
>             raw_lock = {
>               val = {
>                 counter = 0
>               }
>             }
>           }
>         }
>       },
>       task_list = {
>         next = 0xffff8801d6484098,
>         prev = 0xffff8801d6484098
>       }
>     }
>   }
> }
> crash> gdb p ((struct request *)0xffff8801c7e3a5b0 - 1)
> $1 = (struct request *) 0xffff8801c7e3a440
> crash> request 0xffff8801c7e3a440
> struct request {
>   queuelist = {
>     next = 0xffff8801c7e3a440,
>     prev = 0xffff8801c7e3a440
>   },
>   {
>     csd = {
>       llist = {
>         next = 0xffff8801c7e3b690
>       },
>       func = 0xffffffff81336c00 <__blk_mq_complete_request_remote>,
>       info = 0xffff8801c7e3a440,
>       flags = 0
>     },
>     fifo_time = 18446612139962775184
>   },
>   q = 0xffff8801d636dbb8,
>   mq_ctx = 0xffffe8ffffc04100,
>   cpu = -1,
>   cmd_type = 1,
>   cmd_flags = 2305843009784119304,
>   atomic_flags = 2,
>   __data_len = 2048,
>   __sector = 13992,
>   bio = 0xffff8801c34565c0,
>   biotail = 0xffff8801c34565c0,
>   {
>     hash = {
>       next = 0x0,
>       pprev = 0x0
>     },
>     ipi_list = {
>       next = 0x0,
>       prev = 0x0
>     }
>   },
>   {
>     rb_node = {
>       __rb_parent_color = 18446612139962770632,
>       rb_right = 0x0,
>       rb_left = 0x0
>     },
>     completion_data = 0xffff8801c7e3a4c8
>   },
>   {
>     elv = {
>       icq = 0x0,
>       priv = {0x0, 0x0}
>     },
>     flush = {
>       seq = 0,
>       list = {
>         next = 0x0,
>         prev = 0x0
>       },
>       saved_end_io = 0x0
>     }
>   },
>   rq_disk = 0xffff8801d63092a8,
>   part = 0xffff8801d63092f0,
>   start_time = 4361672618,
>   rl = 0x0,
>   start_time_ns = 67003651063676,
>   io_start_time_ns = 0,
>   nr_phys_segments = 1,
>   nr_integrity_segments = 0,
>   ioprio = 0,
>   special = 0x0,
>   tag = 121,
>   errors = 7,
>   __cmd = "\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000",
>   cmd = 0xffff8801c7e3a548 "",
>   cmd_len = 0,
>   extra_len = 0,
>   sense_len = 0,
>   resid_len = 2048,
>   sense = 0x0,
>   deadline = 4361722943,
>   timeout_list = {
>     next = 0xffff8801c7e3a580,
>     prev = 0xffff8801c7e3a580
>   },
>   timeout = 30000,
>   retries = 0,
>   end_io = 0x0,
>   end_io_data = 0x0,
>   next_rq = 0x0
> }
> 


Ah, I see the nvme_rdma worker thread running nvme_rdma_reconnect_ctrl_work() on
the same nvme_rdma_queue that is handling the request and crashing:

crash> bt 371
PID: 371    TASK: ffff8803975a4300  CPU: 5   COMMAND: "kworker/5:2"
    [exception RIP: set_track+16]
    RIP: ffffffff81202070  RSP: ffff880397f2ba18  RFLAGS: 00000086
    RAX: 0000000000000001  RBX: ffff88039f407a00  RCX: ffffffffa0853234
    RDX: 0000000000000001  RSI: ffff8801d663e008  RDI: ffff88039f407a00
    RBP: ffff880397f2ba48   R8: ffff8801d663e158   R9: 000000000000005a
    R10: 00000000000000cc  R11: 0000000000000000  R12: ffff8801d663e008
    R13: ffffea0007598f80  R14: 0000000000000001  R15: ffff8801d663e008
    CS: 0010  SS: 0018
 #0 [ffff880397f2ba50] free_debug_processing at ffffffff81204820
 #1 [ffff880397f2bad0] __slab_free at ffffffff81204bfb
 #2 [ffff880397f2bb90] kfree at ffffffff81204dcd
 #3 [ffff880397f2bc00] nvme_rdma_free_qe at ffffffffa0853234 [nvme_rdma]
 #4 [ffff880397f2bc20] nvme_rdma_destroy_queue_ib at ffffffffa0853dbf
[nvme_rdma]
 #5 [ffff880397f2bc60] nvme_rdma_stop_and_free_queue at ffffffffa085402d
[nvme_rdma]
 #6 [ffff880397f2bc80] nvme_rdma_reconnect_ctrl_work at ffffffffa0854957
[nvme_rdma]
 #7 [ffff880397f2bcb0] process_one_work at ffffffff810a1593
 #8 [ffff880397f2bd90] worker_thread at ffffffff810a222d
 #9 [ffff880397f2bec0] kthread at ffffffff810a6d6c
#10 [ffff880397f2bf50] ret_from_fork at ffffffff816e2cbf

So  why is this request being processed during a reconnect?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
       [not found]                     ` <00fe01d1f3e8$8992b330$9cb81990$@opengridcomputing.com>
@ 2016-08-15 14:39                       ` Steve Wise
  2016-08-16  9:26                         ` Sagi Grimberg
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-08-15 14:39 UTC (permalink / raw)


 
> Ah, I see the nvme_rdma worker thread running
> nvme_rdma_reconnect_ctrl_work() on the same nvme_rdma_queue that is
> handling the request and crashing:
> 
> crash> bt 371
> PID: 371    TASK: ffff8803975a4300  CPU: 5   COMMAND: "kworker/5:2"
>     [exception RIP: set_track+16]
>     RIP: ffffffff81202070  RSP: ffff880397f2ba18  RFLAGS: 00000086
>     RAX: 0000000000000001  RBX: ffff88039f407a00  RCX: ffffffffa0853234
>     RDX: 0000000000000001  RSI: ffff8801d663e008  RDI: ffff88039f407a00
>     RBP: ffff880397f2ba48   R8: ffff8801d663e158   R9: 000000000000005a
>     R10: 00000000000000cc  R11: 0000000000000000  R12: ffff8801d663e008
>     R13: ffffea0007598f80  R14: 0000000000000001  R15: ffff8801d663e008
>     CS: 0010  SS: 0018
>  #0 [ffff880397f2ba50] free_debug_processing at ffffffff81204820
>  #1 [ffff880397f2bad0] __slab_free at ffffffff81204bfb
>  #2 [ffff880397f2bb90] kfree at ffffffff81204dcd
>  #3 [ffff880397f2bc00] nvme_rdma_free_qe at ffffffffa0853234 [nvme_rdma]
>  #4 [ffff880397f2bc20] nvme_rdma_destroy_queue_ib at ffffffffa0853dbf
> [nvme_rdma]
>  #5 [ffff880397f2bc60] nvme_rdma_stop_and_free_queue at ffffffffa085402d
> [nvme_rdma]
>  #6 [ffff880397f2bc80] nvme_rdma_reconnect_ctrl_work at ffffffffa0854957
> [nvme_rdma]
>  #7 [ffff880397f2bcb0] process_one_work at ffffffff810a1593
>  #8 [ffff880397f2bd90] worker_thread at ffffffff810a222d
>  #9 [ffff880397f2bec0] kthread at ffffffff810a6d6c
> #10 [ffff880397f2bf50] ret_from_fork at ffffffff816e2cbf
> 
> So  why is this request being processed during a reconnect?

Hey Sagi, 

Do you have any ideas on this crash? I could really use some help.  Is it
possible that recovery/reconnect/restart of a different controller is somehow
restarting the requests for a controller still in recovery?  Here is one issue
perhaps:  nvme_rdma_reconnect_ctrl_work() calls blk_mq_start_stopped_hw_queues()
before calling nvme_rdma_init_io_queues().  Is that a problem?  I tried moving
blk_mq_start_stopped_hw_queues() to after the io queues are setup, but this
causes a stall in nvme_rdma_reconnect_ctrl_work().  I think the blk queues need
to be started to get the admin queue connected.   Thoughts?

Thanks,

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-08-15 14:39                       ` Steve Wise
@ 2016-08-16  9:26                         ` Sagi Grimberg
  2016-08-16 21:17                           ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: Sagi Grimberg @ 2016-08-16  9:26 UTC (permalink / raw)




On 15/08/16 17:39, Steve Wise wrote:
>
>> Ah, I see the nvme_rdma worker thread running
>> nvme_rdma_reconnect_ctrl_work() on the same nvme_rdma_queue that is
>> handling the request and crashing:
>>
>> crash> bt 371
>> PID: 371    TASK: ffff8803975a4300  CPU: 5   COMMAND: "kworker/5:2"
>>     [exception RIP: set_track+16]
>>     RIP: ffffffff81202070  RSP: ffff880397f2ba18  RFLAGS: 00000086
>>     RAX: 0000000000000001  RBX: ffff88039f407a00  RCX: ffffffffa0853234
>>     RDX: 0000000000000001  RSI: ffff8801d663e008  RDI: ffff88039f407a00
>>     RBP: ffff880397f2ba48   R8: ffff8801d663e158   R9: 000000000000005a
>>     R10: 00000000000000cc  R11: 0000000000000000  R12: ffff8801d663e008
>>     R13: ffffea0007598f80  R14: 0000000000000001  R15: ffff8801d663e008
>>     CS: 0010  SS: 0018
>>  #0 [ffff880397f2ba50] free_debug_processing at ffffffff81204820
>>  #1 [ffff880397f2bad0] __slab_free at ffffffff81204bfb
>>  #2 [ffff880397f2bb90] kfree at ffffffff81204dcd
>>  #3 [ffff880397f2bc00] nvme_rdma_free_qe at ffffffffa0853234 [nvme_rdma]
>>  #4 [ffff880397f2bc20] nvme_rdma_destroy_queue_ib at ffffffffa0853dbf
>> [nvme_rdma]
>>  #5 [ffff880397f2bc60] nvme_rdma_stop_and_free_queue at ffffffffa085402d
>> [nvme_rdma]
>>  #6 [ffff880397f2bc80] nvme_rdma_reconnect_ctrl_work at ffffffffa0854957
>> [nvme_rdma]
>>  #7 [ffff880397f2bcb0] process_one_work at ffffffff810a1593
>>  #8 [ffff880397f2bd90] worker_thread at ffffffff810a222d
>>  #9 [ffff880397f2bec0] kthread at ffffffff810a6d6c
>> #10 [ffff880397f2bf50] ret_from_fork at ffffffff816e2cbf
>>
>> So  why is this request being processed during a reconnect?
>
> Hey Sagi,
>
> Do you have any ideas on this crash? I could really use some help.

Not yet :(

> Is it
> possible that recovery/reconnect/restart of a different controller is somehow
> restarting the requests for a controller still in recovery?

I don't think this is the case.
Can you try and find out if the request is from the admin tagset or from
the io tagset?

We rely on the fact that no I/O will be issued after we call
nvme_stop_queues(). can you trace that we indeed call nvme_stop_queues
when we start error recovery and do nvme_start_queues only when we
successfully reconnect and not anywhere in between?

If that is the case, I think we need to have a closer look at
nvme_stop_queues...

> Here is one issue
> perhaps:  nvme_rdma_reconnect_ctrl_work() calls blk_mq_start_stopped_hw_queues()
> before calling nvme_rdma_init_io_queues().  Is that a problem?

Its for the admin queue, without that you won't be able to issue the
admin connect.

> I tried moving
> blk_mq_start_stopped_hw_queues() to after the io queues are setup, but this
> causes a stall in nvme_rdma_reconnect_ctrl_work().

Make sense...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-08-16  9:26                         ` Sagi Grimberg
@ 2016-08-16 21:17                           ` Steve Wise
  2016-08-17 18:57                             ` Sagi Grimberg
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-08-16 21:17 UTC (permalink / raw)


> > Hey Sagi,
> >
> > Do you have any ideas on this crash? I could really use some help.
> 
> Not yet :(
> 
> > Is it
> > possible that recovery/reconnect/restart of a different controller is
somehow
> > restarting the requests for a controller still in recovery?
> 
> I don't think this is the case.
> Can you try and find out if the request is from the admin tagset or from
> the io tagset?

I wasn't sure how to tell easily which tagset the request was in, so instead I
changed nvme_rdma_admin_mq_ops.queue_rq to its own distinct function,
nvme_rdma_queue_admin_rq() which was a clone of nvme_rdma_queue_rq().  Thus when
it crashed, I could tell by the stack trace.  Anyway, the stack trace indicates
it was from the io tagset.

> 
> We rely on the fact that no I/O will be issued after we call
> nvme_stop_queues(). can you trace that we indeed call nvme_stop_queues
> when we start error recovery and do nvme_start_queues only when we
> successfully reconnect and not anywhere in between?
> 

I see from debugging that the nvme_ns->queue address of the queue being used
that causes the crash was stopped via nvme_stop_queues().  It was never started
via nvme_start_queues().

> If that is the case, I think we need to have a closer look at
> nvme_stop_queues...
> 

request_queue->queue_flags does have QUEUE_FLAG_STOPPED set:

#define QUEUE_FLAG_STOPPED      2       /* queue is stopped */

crash> request_queue.queue_flags -x 0xffff880397a13d28
  queue_flags = 0x1f07a04
crash> request_queue.mq_ops 0xffff880397a13d28
  mq_ops = 0xffffffffa084b140 <nvme_rdma_mq_ops>

So it appears the queue is stopped, yet a request is being processed for that
queue.  Perhaps there is a race where QUEUE_FLAG_STOPPED is set after a request
is scheduled?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-08-16 21:17                           ` Steve Wise
@ 2016-08-17 18:57                             ` Sagi Grimberg
  2016-08-17 19:07                               ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: Sagi Grimberg @ 2016-08-17 18:57 UTC (permalink / raw)



>> If that is the case, I think we need to have a closer look at
>> nvme_stop_queues...
>>
>
> request_queue->queue_flags does have QUEUE_FLAG_STOPPED set:
>
> #define QUEUE_FLAG_STOPPED      2       /* queue is stopped */
>
> crash> request_queue.queue_flags -x 0xffff880397a13d28
>   queue_flags = 0x1f07a04
> crash> request_queue.mq_ops 0xffff880397a13d28
>   mq_ops = 0xffffffffa084b140 <nvme_rdma_mq_ops>
>
> So it appears the queue is stopped, yet a request is being processed for that
> queue.  Perhaps there is a race where QUEUE_FLAG_STOPPED is set after a request
> is scheduled?

Umm. When the keep-alive timeout triggers we stop the queues. only 10
seconds (or reconnect_delay) later we free the queues and reestablish
them, so I find it hard to believe that a request was queued, and spent
so long in queue_rq until we freed the queue-pair.

 From you description of the sequence it seems that after 10 seconds we
attempt a reconnect and during that time an IO request crashes the
party.

I assume this means you ran traffic during the sequence yes?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-08-17 18:57                             ` Sagi Grimberg
@ 2016-08-17 19:07                               ` Steve Wise
  2016-09-01 19:14                                 ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-08-17 19:07 UTC (permalink / raw)


> >> If that is the case, I think we need to have a closer look at
> >> nvme_stop_queues...
> >>
> >
> > request_queue->queue_flags does have QUEUE_FLAG_STOPPED set:
> >
> > #define QUEUE_FLAG_STOPPED      2       /* queue is stopped */
> >
> > crash> request_queue.queue_flags -x 0xffff880397a13d28
> >   queue_flags = 0x1f07a04
> > crash> request_queue.mq_ops 0xffff880397a13d28
> >   mq_ops = 0xffffffffa084b140 <nvme_rdma_mq_ops>
> >
> > So it appears the queue is stopped, yet a request is being processed for
> that
> > queue.  Perhaps there is a race where QUEUE_FLAG_STOPPED is set after a
> request
> > is scheduled?
> 
> Umm. When the keep-alive timeout triggers we stop the queues. only 10
> seconds (or reconnect_delay) later we free the queues and reestablish
> them, so I find it hard to believe that a request was queued, and spent
> so long in queue_rq until we freed the queue-pair.

I agree.

> 
>  From you description of the sequence it seems that after 10 seconds we
> attempt a reconnect and during that time an IO request crashes the
> party.
>

Yes. 
 
> I assume this means you ran traffic during the sequence yes?

Mega fio test streaming to all 10 devices.  I start the following script,
and then bring the link down a few seconds later, which triggers the kato,
then 10 seconds later reconnecting starts and whamo...


for i in $(seq 1 20) ; do

         fio --ioengine=libaio --rw=randwrite --name=randwrite --size=200m
--direct=1 \
        --invalidate=1 --fsync_on_close=1 --group_reporting --exitall
--runtime=20 \
        --time_based --filename=/dev/nvme0n1 --filename=/dev/nvme1n1 \
        --filename=/dev/nvme2n1 --filename=/dev/nvme3n1
--filename=/dev/nvme4n1 \
        --filename=/dev/nvme5n1 --filename=/dev/nvme6n1
--filename=/dev/nvme7n1 \
        --filename=/dev/nvme8n1 --filename=/dev/nvme9n1 --iodepth=4
--numjobs=32 \
        --bs=2K |grep -i "aggrb\|iops"
        sleep 3
        echo "### Iteration $i Done ###"
done

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-08-17 19:07                               ` Steve Wise
@ 2016-09-01 19:14                                 ` Steve Wise
  2016-09-04  9:17                                   ` Sagi Grimberg
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-09-01 19:14 UTC (permalink / raw)


> > >
> > > So it appears the queue is stopped, yet a request is being processed for
> > that
> > > queue.  Perhaps there is a race where QUEUE_FLAG_STOPPED is set after a
> > request
> > > is scheduled?
> >
> > Umm. When the keep-alive timeout triggers we stop the queues. only 10
> > seconds (or reconnect_delay) later we free the queues and reestablish
> > them, so I find it hard to believe that a request was queued, and spent
> > so long in queue_rq until we freed the queue-pair.
> 
> I agree.
> 
> >
> >  From you description of the sequence it seems that after 10 seconds we
> > attempt a reconnect and during that time an IO request crashes the
> > party.
> >
> 
> Yes.
> 
> > I assume this means you ran traffic during the sequence yes?
> 
> Mega fio test streaming to all 10 devices.  I start the following script,
> and then bring the link down a few seconds later, which triggers the kato,
> then 10 seconds later reconnecting starts and whamo...
> 
> 
> for i in $(seq 1 20) ; do
> 
>          fio --ioengine=libaio --rw=randwrite --name=randwrite --size=200m
> --direct=1 \
>         --invalidate=1 --fsync_on_close=1 --group_reporting --exitall
> --runtime=20 \
>         --time_based --filename=/dev/nvme0n1 --filename=/dev/nvme1n1 \
>         --filename=/dev/nvme2n1 --filename=/dev/nvme3n1
> --filename=/dev/nvme4n1 \
>         --filename=/dev/nvme5n1 --filename=/dev/nvme6n1
> --filename=/dev/nvme7n1 \
>         --filename=/dev/nvme8n1 --filename=/dev/nvme9n1 --iodepth=4
> --numjobs=32 \
>         --bs=2K |grep -i "aggrb\|iops"
>         sleep 3
>         echo "### Iteration $i Done ###"
> done
> 

Ok, back to this issue. :)

The same crash happens with mlx4_ib, so this isn't related to cxgb4.  To sum up:

With pending NVME IO on the nvme-rdma host, and in the presence of kato
recovery/reconnect due to the target going away, some NVME requests get
restarted that are referencing nvmf controllers that have freed queues.  I see
this also with my recent v4 series that corrects the recovery problems with
nvme-rdma when the target is down, but without pending IO.  

So the crash in this email is yet another issue that we see when the nvme host
has lots of pending IO requests during kato recovery/reconnect...

My findings to date:  the IO is not an admin queue IO.  It is not the kato
messages.  The io queue has been stopped, yet the request is attempted and
causes the crash.

Any help is appreciated...

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-01 19:14                                 ` Steve Wise
@ 2016-09-04  9:17                                   ` Sagi Grimberg
  2016-09-07 21:08                                     ` Steve Wise
  2016-09-07 21:33                                     ` Steve Wise
  0 siblings, 2 replies; 53+ messages in thread
From: Sagi Grimberg @ 2016-09-04  9:17 UTC (permalink / raw)


Hey Steve,

> Ok, back to this issue. :)
>
> The same crash happens with mlx4_ib, so this isn't related to cxgb4.  To sum up:
>
> With pending NVME IO on the nvme-rdma host, and in the presence of kato
> recovery/reconnect due to the target going away, some NVME requests get
> restarted that are referencing nvmf controllers that have freed queues.  I see
> this also with my recent v4 series that corrects the recovery problems with
> nvme-rdma when the target is down, but without pending IO.
>
> So the crash in this email is yet another issue that we see when the nvme host
> has lots of pending IO requests during kato recovery/reconnect...
>
> My findings to date:  the IO is not an admin queue IO.  It is not the kato
> messages.  The io queue has been stopped, yet the request is attempted and
> causes the crash.
>
> Any help is appreciated...

So in the current state, my impression is that we are seeing a request
queued when we shouldn't (or at least assume we won't).

Given that you run heavy load to reproduce this, I can only suspect that
this is a race condition.

Does this happen if you change the reconnect delay to be something
different than 10 seconds? (say 30?)

Can you also give patch [1] a try? It's not a solution, but I want
to see if it hides the problem...

Now, given that you already verified that the queues are stopped with
BLK_MQ_S_STOPPED, I'm looking at blk-mq now.

I see that blk_mq_run_hw_queue() and __blk_mq_run_hw_queue() indeed take
BLK_MQ_S_STOPPED into account. Theoretically  if we free the queue
pairs after we passed these checks while the rq_list is being processed
then we can end-up with this condition, but given that it takes
essentially forever (10 seconds) I tend to doubt this is the case.

HCH, Jens, Keith, any useful pointers for us?

To summarize we see a stray request being queued long after we set
BLK_MQ_S_STOPPED (and by long I mean 10 seconds).



[1]:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index d2f891efb27b..38ea5dab4524 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -701,20 +701,13 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
work_struct *work)
         bool changed;
         int ret;

-       if (ctrl->queue_count > 1) {
-               nvme_rdma_free_io_queues(ctrl);
-
-               ret = blk_mq_reinit_tagset(&ctrl->tag_set);
-               if (ret)
-                       goto requeue;
-       }
-
-       nvme_rdma_stop_and_free_queue(&ctrl->queues[0]);

         ret = blk_mq_reinit_tagset(&ctrl->admin_tag_set);
         if (ret)
                 goto requeue;

+       nvme_rdma_stop_and_free_queue(&ctrl->queues[0]);
+
         ret = nvme_rdma_init_queue(ctrl, 0, NVMF_AQ_DEPTH);
         if (ret)
                 goto requeue;
@@ -732,6 +725,12 @@ static void nvme_rdma_reconnect_ctrl_work(struct 
work_struct *work)
         nvme_start_keep_alive(&ctrl->ctrl);

         if (ctrl->queue_count > 1) {
+               ret = blk_mq_reinit_tagset(&ctrl->tag_set);
+               if (ret)
+                       goto stop_admin_q;
+
+               nvme_rdma_free_io_queues(ctrl);
+
                 ret = nvme_rdma_init_io_queues(ctrl);
                 if (ret)
                         goto stop_admin_q;
--

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-04  9:17                                   ` Sagi Grimberg
@ 2016-09-07 21:08                                     ` Steve Wise
  2016-09-08  7:45                                       ` Sagi Grimberg
       [not found]                                       ` <7f09e373-6316-26a3-ae81-dab1205d88ab@grimbe rg.me>
  2016-09-07 21:33                                     ` Steve Wise
  1 sibling, 2 replies; 53+ messages in thread
From: Steve Wise @ 2016-09-07 21:08 UTC (permalink / raw)


> Hey Steve,
> 
> > Ok, back to this issue. :)
> >
> > The same crash happens with mlx4_ib, so this isn't related to cxgb4.  To sum
up:
> >
> > With pending NVME IO on the nvme-rdma host, and in the presence of kato
> > recovery/reconnect due to the target going away, some NVME requests get
> > restarted that are referencing nvmf controllers that have freed queues.  I
see
> > this also with my recent v4 series that corrects the recovery problems with
> > nvme-rdma when the target is down, but without pending IO.
> >
> > So the crash in this email is yet another issue that we see when the nvme
host
> > has lots of pending IO requests during kato recovery/reconnect...
> >
> > My findings to date:  the IO is not an admin queue IO.  It is not the kato
> > messages.  The io queue has been stopped, yet the request is attempted and
> > causes the crash.
> >
> > Any help is appreciated...
> 
> So in the current state, my impression is that we are seeing a request
> queued when we shouldn't (or at least assume we won't).
> 
> Given that you run heavy load to reproduce this, I can only suspect that
> this is a race condition.
> 
> Does this happen if you change the reconnect delay to be something
> different than 10 seconds? (say 30?)
>

Yes.  But I noticed something when performing this experiment that is an
important point, I think:  if I just bring the network interface down and leave
it down, we don't crash.  During this state, I see the host continually
reconnecting after the reconnect delay time, timing out trying to reconnect, and
retrying after another reconnect_delay period.  I see this for all 10 targets of
course.  The crash only happens when I bring the interface back up, and the
targets begin to reconnect.   So the process of successfully reconnecting the
RDMA QPs, and restarting the nvme queues is somehow triggering running an nvme
request too soon (or perhaps on the wrong queue).   
 
> Can you also give patch [1] a try? It's not a solution, but I want
> to see if it hides the problem...
> 

hmm.  I ran the experiment once with [1] and it didn't crash.  I ran it a 2nd
time and hit a new crash.  Maybe a problem with [1]?

[  379.864950] BUG: unable to handle kernel NULL pointer dereference at
0000000000000024
[  379.874330] IP: [<ffffffffa0307ad1>] ib_destroy_qp+0x21/0x1a0 [ib_core]
[  379.882489] PGD 1002571067 PUD fa121f067 PMD 0
[  379.888561] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[  379.894526] Modules linked in: nvme_rdma nvme_fabrics brd iw_cxgb4 cxgb4
ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM
iptable_mangle iptable_filter ip_tables bridge 8021q mrp garp stp llc cachefiles
fscache rdma_ucm rdma_cm iw_cm ib_ipoib ib_cm ib_uverbs ib_umad ocrdma be2net
iw_nes libcrc32c iw_cxgb3 cxgb3 mdio ib_qib rdmavt mlx5_ib mlx5_core mlx4_ib
mlx4_en mlx4_core ib_mthca ib_core binfmt_misc dm_mirror dm_region_hash dm_log
vhost_net macvtap macvlan vhost tun kvm irqbypass uinput iTCO_wdt
iTCO_vendor_support mxm_wmi pcspkr dm_mod i2c_i801 i2c_smbus sg lpc_ich mfd_core
mei_me mei nvme nvme_core igb dca ptp pps_core ipmi_si ipmi_msghandler wmi
ext4(E) mbcache(E) jbd2(E) sd_mod(E) ahci(E) libahci(E) libata(E) mgag200(E)
ttm(E) drm_kms_helper(E) drm(E) fb_sys_fops(E) sysimgblt(E) sysfillrect(E)
syscopyarea(E) i2c_algo_bit(E) i2c_core(E) [last unloaded: cxgb4]
[  380.005238] CPU: 30 PID: 10981 Comm: kworker/30:2 Tainted: G            E
4.8.0-rc4-block-for-linus-dbg+ #32
[  380.017596] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.2a 07/09/2015
[  380.027002] Workqueue: nvme_rdma_wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
[  380.036664] task: ffff880fd014af40 task.stack: ffff881013b38000
[  380.045082] RIP: 0010:[<ffffffffa0307ad1>]  [<ffffffffa0307ad1>]
ib_destroy_qp+0x21/0x1a0 [ib_core]
[  380.056562] RSP: 0018:ffff881013b3bb88  EFLAGS: 00010286
[  380.064291] RAX: ffff880fd014af40 RBX: ffff880fba6d3a78 RCX: 0000000000000000
[  380.073733] RDX: ffff881037593d20 RSI: ffff88103758dfc8 RDI: 0000000000000000
[  380.083030] RBP: ffff881013b3bbe8 R08: 0000000000000000 R09: 0000000000000000
[  380.092230] R10: ffffffffa0030060 R11: 0000000000000000 R12: ffff880fba6d3c98
[  380.101442] R13: 00000000ffffff98 R14: ffff88101ce78008 R15: ffff88100507c968
[  380.110541] FS:  0000000000000000(0000) GS:ffff881037580000(0000)
knlGS:0000000000000000
[  380.120507] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  380.128048] CR2: 0000000000000024 CR3: 0000000fa12f6000 CR4: 00000000000406e0
[  380.136990] Stack:
[  380.140714]  ffff881013b3bb98 0000000000000246 000003e800000286
ffffffffa0733fe0
[  380.149939]  ffff881013b3bbd8 ffffffffa0274588 ffff881013b3bbe8
ffff880fba6d3a78
[  380.159113]  ffff880fba6d3c98 00000000ffffff98 ffff88101ce78008
ffff88100507c968
[  380.168205] Call Trace:
[  380.172280]  [<ffffffffa0733fe0>] ? cma_work_handler+0xa0/0xa0 [rdma_cm]
[  380.180546]  [<ffffffffa0731611>] rdma_destroy_qp+0x31/0x50 [rdma_cm]
[  380.188483]  [<ffffffffa0661e92>] nvme_rdma_destroy_queue_ib+0x52/0xb0
[nvme_rdma]
[  380.197566]  [<ffffffffa0662678>] nvme_rdma_init_queue+0x128/0x180
[nvme_rdma]
[  380.206264]  [<ffffffffa0662a09>] nvme_rdma_reconnect_ctrl_work+0x79/0x220
[nvme_rdma]
[  380.215606]  [<ffffffff810a15e3>] process_one_work+0x183/0x4d0
[  380.222820]  [<ffffffff816de9f0>] ? __schedule+0x1f0/0x5b0
[  380.229680]  [<ffffffff816deeb0>] ? schedule+0x40/0xb0
[  380.236147]  [<ffffffff810a227d>] worker_thread+0x16d/0x530
[  380.243051]  [<ffffffff810b43d5>] ?
trace_event_raw_event_sched_switch+0xe5/0x130
[  380.251821]  [<ffffffff816de9f0>] ? __schedule+0x1f0/0x5b0
[  380.258600]  [<ffffffff810cba86>] ? __wake_up_common+0x56/0x90
[  380.265676]  [<ffffffff810a2110>] ? maybe_create_worker+0x120/0x120
[  380.273153]  [<ffffffff816deeb0>] ? schedule+0x40/0xb0
[  380.279504]  [<ffffffff810a2110>] ? maybe_create_worker+0x120/0x120
[  380.286993]  [<ffffffff810a6dbc>] kthread+0xcc/0xf0
[  380.293078]  [<ffffffff810b177e>] ? schedule_tail+0x1e/0xc0
[  380.299857]  [<ffffffff816e2b7f>] ret_from_fork+0x1f/0x40

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-04  9:17                                   ` Sagi Grimberg
  2016-09-07 21:08                                     ` Steve Wise
@ 2016-09-07 21:33                                     ` Steve Wise
  2016-09-08  8:22                                       ` Sagi Grimberg
       [not found]                                       ` <9fd1f090-3b86-b496-d8c0-225ac0815fbe@grimbe rg.me>
  1 sibling, 2 replies; 53+ messages in thread
From: Steve Wise @ 2016-09-07 21:33 UTC (permalink / raw)


> Now, given that you already verified that the queues are stopped with
> BLK_MQ_S_STOPPED, I'm looking at blk-mq now.
> 
> I see that blk_mq_run_hw_queue() and __blk_mq_run_hw_queue() indeed take
> BLK_MQ_S_STOPPED into account. Theoretically  if we free the queue
> pairs after we passed these checks while the rq_list is being processed
> then we can end-up with this condition, but given that it takes
> essentially forever (10 seconds) I tend to doubt this is the case.
> 
> HCH, Jens, Keith, any useful pointers for us?
> 
> To summarize we see a stray request being queued long after we set
> BLK_MQ_S_STOPPED (and by long I mean 10 seconds).

Does nvme-rdma need to call blk_mq_queue_reinit() after it reinits the tag set
for that queue as part of reconnecting?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-07 21:08                                     ` Steve Wise
@ 2016-09-08  7:45                                       ` Sagi Grimberg
  2016-09-08 20:47                                         ` Steve Wise
  2016-09-08 21:00                                         ` Steve Wise
       [not found]                                       ` <7f09e373-6316-26a3-ae81-dab1205d88ab@grimbe rg.me>
  1 sibling, 2 replies; 53+ messages in thread
From: Sagi Grimberg @ 2016-09-08  7:45 UTC (permalink / raw)



>> Does this happen if you change the reconnect delay to be something
>> different than 10 seconds? (say 30?)
>>
>
> Yes.  But I noticed something when performing this experiment that is an
> important point, I think:  if I just bring the network interface down and leave
> it down, we don't crash.  During this state, I see the host continually
> reconnecting after the reconnect delay time, timing out trying to reconnect, and
> retrying after another reconnect_delay period.  I see this for all 10 targets of
> course.  The crash only happens when I bring the interface back up, and the
> targets begin to reconnect.   So the process of successfully reconnecting the
> RDMA QPs, and restarting the nvme queues is somehow triggering running an nvme
> request too soon (or perhaps on the wrong queue).

Interesting. Given this is easy to reproduce, can you record the:
(request_tag, *queue, *qp) for each request submitted?

I'd like to see that the *queue stays the same for each tag
but the *qp indeed changes.

>> Can you also give patch [1] a try? It's not a solution, but I want
>> to see if it hides the problem...
>>
>
> hmm.  I ran the experiment once with [1] and it didn't crash.  I ran it a 2nd
> time and hit a new crash.  Maybe a problem with [1]?

Strange, I don't see how we can visit rdma_destroy_qp twice given
that we have NVME_RDMA_IB_QUEUE_ALLOCATED bit protecting it.

Not sure if it fixes anything, but we probably need it regardless, can
you give another go with this on top:
--
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 43602cebf097..89023326f397 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -542,11 +542,12 @@ static int nvme_rdma_create_queue_ib(struct 
nvme_rdma_queue *queue,
                 goto out_destroy_qp;
         }
         set_bit(NVME_RDMA_IB_QUEUE_ALLOCATED, &queue->flags);
+       clear_bit(NVME_RDMA_Q_DELETING, &queue->flags);

         return 0;

  out_destroy_qp:
-       ib_destroy_qp(queue->qp);
+       rdma_destroy_qp(queue->qp);
  out_destroy_ib_cq:
         ib_free_cq(queue->ib_cq);
  out:
@@ -591,15 +592,16 @@ static int nvme_rdma_init_queue(struct 
nvme_rdma_ctrl *ctrl,
         if (ret) {
                 dev_info(ctrl->ctrl.device,
                         "rdma_resolve_addr wait failed (%d).\n", ret);
-               goto out_destroy_cm_id;
+               goto out_destroy_queue_id;
         }

         set_bit(NVME_RDMA_Q_CONNECTED, &queue->flags);

         return 0;

-out_destroy_cm_id:
+out_destroy_queue_ib:
         nvme_rdma_destroy_queue_ib(queue);
+out_destroy_cm_id:
         rdma_destroy_id(queue->cm_id);
         return ret;
  }
--

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-07 21:33                                     ` Steve Wise
@ 2016-09-08  8:22                                       ` Sagi Grimberg
  2016-09-08 17:19                                         ` Steve Wise
       [not found]                                       ` <9fd1f090-3b86-b496-d8c0-225ac0815fbe@grimbe rg.me>
  1 sibling, 1 reply; 53+ messages in thread
From: Sagi Grimberg @ 2016-09-08  8:22 UTC (permalink / raw)



>> Now, given that you already verified that the queues are stopped with
>> BLK_MQ_S_STOPPED, I'm looking at blk-mq now.
>>
>> I see that blk_mq_run_hw_queue() and __blk_mq_run_hw_queue() indeed take
>> BLK_MQ_S_STOPPED into account. Theoretically  if we free the queue
>> pairs after we passed these checks while the rq_list is being processed
>> then we can end-up with this condition, but given that it takes
>> essentially forever (10 seconds) I tend to doubt this is the case.
>>
>> HCH, Jens, Keith, any useful pointers for us?
>>
>> To summarize we see a stray request being queued long after we set
>> BLK_MQ_S_STOPPED (and by long I mean 10 seconds).
>
> Does nvme-rdma need to call blk_mq_queue_reinit() after it reinits the tag set
> for that queue as part of reconnecting?

I don't see how that'd help...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-08  8:22                                       ` Sagi Grimberg
@ 2016-09-08 17:19                                         ` Steve Wise
  2016-09-09 15:57                                           ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-09-08 17:19 UTC (permalink / raw)


> >> Now, given that you already verified that the queues are stopped with
> >> BLK_MQ_S_STOPPED, I'm looking at blk-mq now.
> >>
> >> I see that blk_mq_run_hw_queue() and __blk_mq_run_hw_queue() indeed take
> >> BLK_MQ_S_STOPPED into account. Theoretically  if we free the queue
> >> pairs after we passed these checks while the rq_list is being processed
> >> then we can end-up with this condition, but given that it takes
> >> essentially forever (10 seconds) I tend to doubt this is the case.
> >>
> >> HCH, Jens, Keith, any useful pointers for us?
> >>
> >> To summarize we see a stray request being queued long after we set
> >> BLK_MQ_S_STOPPED (and by long I mean 10 seconds).
> >
> > Does nvme-rdma need to call blk_mq_queue_reinit() after it reinits the tag
set
> > for that queue as part of reconnecting?
> 
> I don't see how that'd help...
> 

I can't explain this, but the nvme_rdma_queue.flags field has a bit set that
shouldn't be set:

crash> nvme_rdma_queue.flags -x ffff880e52b8e7e8
  flags = 0x14

Bit 2 is set, NVME_RDMA_Q_DELETING, but bit 4 is also set and should never be...

enum nvme_rdma_queue_flags {
        NVME_RDMA_Q_CONNECTED = (1 << 0),
        NVME_RDMA_IB_QUEUE_ALLOCATED = (1 << 1),
        NVME_RDMA_Q_DELETING = (1 << 2),
};

The rest of the structure looks fine.  I've also seen crash dumps where bit 3 is
set which is also not used.  

/me confused...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
       [not found]                                           ` <01bc01d209f5$1b7d7510$52785f30$@opengridcomputing.com>
@ 2016-09-08 19:15                                             ` Steve Wise
  0 siblings, 0 replies; 53+ messages in thread
From: Steve Wise @ 2016-09-08 19:15 UTC (permalink / raw)


 
> > >> Now, given that you already verified that the queues are stopped with
> > >> BLK_MQ_S_STOPPED, I'm looking at blk-mq now.
> > >>
> > >> I see that blk_mq_run_hw_queue() and __blk_mq_run_hw_queue() indeed
> take
> > >> BLK_MQ_S_STOPPED into account. Theoretically  if we free the queue
> > >> pairs after we passed these checks while the rq_list is being
> processed
> > >> then we can end-up with this condition, but given that it takes
> > >> essentially forever (10 seconds) I tend to doubt this is the case.
> > >>
> > >> HCH, Jens, Keith, any useful pointers for us?
> > >>
> > >> To summarize we see a stray request being queued long after we set
> > >> BLK_MQ_S_STOPPED (and by long I mean 10 seconds).
> > >
> > > Does nvme-rdma need to call blk_mq_queue_reinit() after it reinits the
> tag set
> > > for that queue as part of reconnecting?
> >
> > I don't see how that'd help...
> >
> 
> I can't explain this, but the nvme_rdma_queue.flags field has a bit set
> that shouldn't be set:
> 
> crash> nvme_rdma_queue.flags -x ffff880e52b8e7e8
>   flags = 0x14
> 
> Bit 2 is set, NVME_RDMA_Q_DELETING, but bit 4 is also set and should never
> be...
> 
> enum nvme_rdma_queue_flags {
>         NVME_RDMA_Q_CONNECTED = (1 << 0),
>         NVME_RDMA_IB_QUEUE_ALLOCATED = (1 << 1),
>         NVME_RDMA_Q_DELETING = (1 << 2),
> };
> 
> The rest of the structure looks fine.  I've also seen crash dumps where
> bit 3 is set which is also not used.
> 
> /me confused...
> 

While working this with debug code to verify that we never create a qp, cq, or
cm_id where one already exists for an nvme_rdma_queue, I discovered a bug where
the Q_DELETING flag is never cleared, and thus a reconnect can leak qps and
cm_ids.  The fix, I think, is this:

@@ -563,6 +572,7 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl *ctrl,
        int ret;

        queue = &ctrl->queues[idx];
+       queue->flags = 0;
        queue->ctrl = ctrl;
        init_completion(&queue->cm_done);

I think maybe the clearing of the Q_DELETING flag was lost when we moved to
using the ib_client for device removal.   I'll polish this up and submit a
patch. It should go with the next 4.8-rc push I think.

This doesn't resolve the original failure I'm chasing in this thread though :(

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
       [not found]                                           ` <01f201d20a05$6abde5f0$4039b1d0$@opengridcomputing.com>
@ 2016-09-08 19:26                                             ` Steve Wise
       [not found]                                             ` <01f401d20a06$d4cc8360$7e658a20$@opengridcomputing.com>
  1 sibling, 0 replies; 53+ messages in thread
From: Steve Wise @ 2016-09-08 19:26 UTC (permalink / raw)


> While working this with debug code to verify that we never create a qp,
> cq, or cm_id where one already exists for an nvme_rdma_queue, I discovered
> a bug where the Q_DELETING flag is never cleared, and thus a reconnect can
> leak qps and cm_ids.  The fix, I think, is this:
> 
> @@ -563,6 +572,7 @@ static int nvme_rdma_init_queue(struct nvme_rdma_ctrl
> *ctrl,
>         int ret;
> 
>         queue = &ctrl->queues[idx];
> +       queue->flags = 0;
>         queue->ctrl = ctrl;
>         init_completion(&queue->cm_done);
> 
> I think maybe the clearing of the Q_DELETING flag was lost when we moved
> to using the ib_client for device removal.   I'll polish this up and
> submit a patch. It should go with the next 4.8-rc push I think.

Actually, I think the Q_DELETING flag is no longer needed.  Sagi, can you have
look at NVME_RDMA_Q_DELETING in the latest code?  I think the ib_client patch
made the original Q_DELETING patch obsolete.  And the original Q_DELETING patch
probably needed the above chunk for correctness...

Let me know if you want me to submit something for this issue.  We could fix the
original patches as they are still only in your nvmf-4.8-rc repo...

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
       [not found]                                             ` <01f401d20a06$d4cc8360$7e658a20$@opengridcomputing.com>
@ 2016-09-08 20:44                                               ` Steve Wise
  0 siblings, 0 replies; 53+ messages in thread
From: Steve Wise @ 2016-09-08 20:44 UTC (permalink / raw)


> > While working this with debug code to verify that we never create a qp,
> > cq, or cm_id where one already exists for an nvme_rdma_queue, I
> discovered
> > a bug where the Q_DELETING flag is never cleared, and thus a reconnect
> can
> > leak qps and cm_ids.  The fix, I think, is this:
> >
> > @@ -563,6 +572,7 @@ static int nvme_rdma_init_queue(struct
> nvme_rdma_ctrl
> > *ctrl,
> >         int ret;
> >
> >         queue = &ctrl->queues[idx];
> > +       queue->flags = 0;
> >         queue->ctrl = ctrl;
> >         init_completion(&queue->cm_done);
> >
> > I think maybe the clearing of the Q_DELETING flag was lost when we moved
> > to using the ib_client for device removal.   I'll polish this up and
> > submit a patch. It should go with the next 4.8-rc push I think.
> 
> Actually, I think the Q_DELETING flag is no longer needed.  Sagi, can you
> have look at NVME_RDMA_Q_DELETING in the latest code?  I think the
> ib_client patch made the original Q_DELETING patch obsolete.  And the
> original Q_DELETING patch probably needed the above chunk for
> correctness...
> 
> Let me know if you want me to submit something for this issue.  We could
> fix the original patches as they are still only in your nvmf-4.8-rc
> repo...

I see your debug patch v2 you sent me in an earlier email has a clear_bit() to
address the DELETING issue.  I haven't tried that patch yet. :)  Its next on my
list...

steve

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-08  7:45                                       ` Sagi Grimberg
@ 2016-09-08 20:47                                         ` Steve Wise
  2016-09-08 21:00                                         ` Steve Wise
  1 sibling, 0 replies; 53+ messages in thread
From: Steve Wise @ 2016-09-08 20:47 UTC (permalink / raw)


> >> Does this happen if you change the reconnect delay to be something
> >> different than 10 seconds? (say 30?)
> >>
> >
> > Yes.  But I noticed something when performing this experiment that is an
> > important point, I think:  if I just bring the network interface down and
leave
> > it down, we don't crash.  During this state, I see the host continually
> > reconnecting after the reconnect delay time, timing out trying to reconnect,
and
> > retrying after another reconnect_delay period.  I see this for all 10
targets of
> > course.  The crash only happens when I bring the interface back up, and the
> > targets begin to reconnect.   So the process of successfully reconnecting
the
> > RDMA QPs, and restarting the nvme queues is somehow triggering running an
> nvme
> > request too soon (or perhaps on the wrong queue).
> 
> Interesting. Given this is easy to reproduce, can you record the:
> (request_tag, *queue, *qp) for each request submitted?
> 
> I'd like to see that the *queue stays the same for each tag
> but the *qp indeed changes.
> 

I tried this, and didn't hit the BUG_ON(), yet still hit the crash.  I believe
this verifies that  *queue never changed...

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index c075ea5..a77729e 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -76,6 +76,7 @@ struct nvme_rdma_request {
        struct ib_reg_wr        reg_wr;
        struct ib_cqe           reg_cqe;
        struct nvme_rdma_queue  *queue;
+       struct nvme_rdma_queue  *save_queue;
        struct sg_table         sg_table;
        struct scatterlist      first_sgl[];
 };
@@ -354,6 +355,8 @@ static int __nvme_rdma_init_request(struct nvme_rdma_ctrl
*ctrl,
        }

        req->queue = queue;
+       if (!req->save_queue)
+               req->save_queue = queue;

        return 0;

@@ -1434,6 +1436,9 @@ static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,

        WARN_ON_ONCE(rq->tag < 0);

+       BUG_ON(queue != req->queue);
+       BUG_ON(queue != req->save_queue);
+
        dev = queue->device->dev;
        ib_dma_sync_single_for_cpu(dev, sqe->dma,
                        sizeof(struct nvme_command), DMA_TO_DEVICE);

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-08  7:45                                       ` Sagi Grimberg
  2016-09-08 20:47                                         ` Steve Wise
@ 2016-09-08 21:00                                         ` Steve Wise
  1 sibling, 0 replies; 53+ messages in thread
From: Steve Wise @ 2016-09-08 21:00 UTC (permalink / raw)


> 
> >> Can you also give patch [1] a try? It's not a solution, but I want
> >> to see if it hides the problem...
> >>
> >
> > hmm.  I ran the experiment once with [1] and it didn't crash.  I ran it a
2nd
> > time and hit a new crash.  Maybe a problem with [1]?
> 
> Strange, I don't see how we can visit rdma_destroy_qp twice given
> that we have NVME_RDMA_IB_QUEUE_ALLOCATED bit protecting it.
> 
> Not sure if it fixes anything, but we probably need it regardless, can
> you give another go with this on top:

Still hit it with this on top (had to tweak the patch a little).

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
       [not found]                                           ` <021201d20a14$0f203b80$2d60b280$@opengridcomputing.com>
@ 2016-09-08 21:21                                             ` Steve Wise
  0 siblings, 0 replies; 53+ messages in thread
From: Steve Wise @ 2016-09-08 21:21 UTC (permalink / raw)


> >
> > >> Can you also give patch [1] a try? It's not a solution, but I want
> > >> to see if it hides the problem...
> > >>
> > >
> > > hmm.  I ran the experiment once with [1] and it didn't crash.  I ran
> it a 2nd
> > > time and hit a new crash.  Maybe a problem with [1]?
> >
> > Strange, I don't see how we can visit rdma_destroy_qp twice given
> > that we have NVME_RDMA_IB_QUEUE_ALLOCATED bit protecting it.
> >
> > Not sure if it fixes anything, but we probably need it regardless, can
> > you give another go with this on top:
> 
> Still hit it with this on top (had to tweak the patch a little).
> 
> Steve.

So with this patch, the crash is a little different.  One thread is in the usual
place crashed in nvme_rdma_post_send() called by nvme_rdma_queue_rq() because
the qp and cm_id in the nvme_rdma_queue have been freed.   Actually there are a
handful of CPUs processing different requests in the same type stack trace.  But
perhaps that is expected given the work load and number of controllers (10) and
queues (32 per controller)...

I also see another worker thread here:

PID: 3769   TASK: ffff880e18972f40  CPU: 3   COMMAND: "kworker/3:3"
 #0 [ffff880e2f7938d0] __schedule at ffffffff816dfa17
 #1 [ffff880e2f793930] schedule at ffffffff816dff00
 #2 [ffff880e2f793980] schedule_timeout at ffffffff816e2b1b
 #3 [ffff880e2f793a60] wait_for_completion_timeout at ffffffff816e0f03
 #4 [ffff880e2f793ad0] destroy_cq at ffffffffa061d8f3 [iw_cxgb4]
 #5 [ffff880e2f793b60] c4iw_destroy_cq at ffffffffa061dad5 [iw_cxgb4]
 #6 [ffff880e2f793bf0] ib_free_cq at ffffffffa0360e5a [ib_core]
 #7 [ffff880e2f793c20] nvme_rdma_destroy_queue_ib at ffffffffa0644e9b
[nvme_rdma]
 #8 [ffff880e2f793c60] nvme_rdma_stop_and_free_queue at ffffffffa0645083
[nvme_rdma]
 #9 [ffff880e2f793c80] nvme_rdma_reconnect_ctrl_work at ffffffffa0645a9f
[nvme_rdma]
#10 [ffff880e2f793cb0] process_one_work at ffffffff810a1613
#11 [ffff880e2f793d90] worker_thread at ffffffff810a22ad
#12 [ffff880e2f793ec0] kthread at ffffffff810a6dec
#13 [ffff880e2f793f50] ret_from_fork at ffffffff816e3bbf

I'm trying to identify if this reconnect is for the same controller that crashed
processing a request.  It probably is, but I need to search the stack frame to
try and find the controller pointer...

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
       [not found]                                           ` <021401d20a16$ed60d470$c8227d50$@opengridcomputing.com>
       [not found]                                             ` <021501d20a19$327ba5b0$9772f110$@opengrid computing.com>
@ 2016-09-08 21:37                                             ` Steve Wise
  2016-09-09 15:50                                               ` Steve Wise
  1 sibling, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-09-08 21:37 UTC (permalink / raw)


> > > Not sure if it fixes anything, but we probably need it regardless, can
> > > you give another go with this on top:
> >
> > Still hit it with this on top (had to tweak the patch a little).
> >
> > Steve.
> 
> So with this patch, the crash is a little different.  One thread is in the
> usual place crashed in nvme_rdma_post_send() called by
> nvme_rdma_queue_rq() because the qp and cm_id in the nvme_rdma_queue have
> been freed.   Actually there are a handful of CPUs processing different
> requests in the same type stack trace.  But perhaps that is expected given
> the work load and number of controllers (10) and queues (32 per
> controller)...
> 
> I also see another worker thread here:
> 
> PID: 3769   TASK: ffff880e18972f40  CPU: 3   COMMAND: "kworker/3:3"
>  #0 [ffff880e2f7938d0] __schedule at ffffffff816dfa17
>  #1 [ffff880e2f793930] schedule at ffffffff816dff00
>  #2 [ffff880e2f793980] schedule_timeout at ffffffff816e2b1b
>  #3 [ffff880e2f793a60] wait_for_completion_timeout at ffffffff816e0f03
>  #4 [ffff880e2f793ad0] destroy_cq at ffffffffa061d8f3 [iw_cxgb4]
>  #5 [ffff880e2f793b60] c4iw_destroy_cq at ffffffffa061dad5 [iw_cxgb4]
>  #6 [ffff880e2f793bf0] ib_free_cq at ffffffffa0360e5a [ib_core]
>  #7 [ffff880e2f793c20] nvme_rdma_destroy_queue_ib at ffffffffa0644e9b
> [nvme_rdma]
>  #8 [ffff880e2f793c60] nvme_rdma_stop_and_free_queue at ffffffffa0645083
> [nvme_rdma]
>  #9 [ffff880e2f793c80] nvme_rdma_reconnect_ctrl_work at ffffffffa0645a9f
> [nvme_rdma]
> #10 [ffff880e2f793cb0] process_one_work at ffffffff810a1613
> #11 [ffff880e2f793d90] worker_thread at ffffffff810a22ad
> #12 [ffff880e2f793ec0] kthread at ffffffff810a6dec
> #13 [ffff880e2f793f50] ret_from_fork at ffffffff816e3bbf
> 
> I'm trying to identify if this reconnect is for the same controller that
> crashed processing a request.  It probably is, but I need to search the
> stack frame to try and find the controller pointer...
> 
> Steve.

Both the thread that crashed and the thread doing reconnect are operating on
ctrl "nvme2"...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-08 21:37                                             ` Steve Wise
@ 2016-09-09 15:50                                               ` Steve Wise
  2016-09-12 20:10                                                 ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-09-09 15:50 UTC (permalink / raw)


> > So with this patch, the crash is a little different.  One thread is in the
> > usual place crashed in nvme_rdma_post_send() called by
> > nvme_rdma_queue_rq() because the qp and cm_id in the nvme_rdma_queue
> have
> > been freed.   Actually there are a handful of CPUs processing different
> > requests in the same type stack trace.  But perhaps that is expected given
> > the work load and number of controllers (10) and queues (32 per
> > controller)...
> >
> > I also see another worker thread here:
> >
> > PID: 3769   TASK: ffff880e18972f40  CPU: 3   COMMAND: "kworker/3:3"
> >  #0 [ffff880e2f7938d0] __schedule at ffffffff816dfa17
> >  #1 [ffff880e2f793930] schedule at ffffffff816dff00
> >  #2 [ffff880e2f793980] schedule_timeout at ffffffff816e2b1b
> >  #3 [ffff880e2f793a60] wait_for_completion_timeout at ffffffff816e0f03
> >  #4 [ffff880e2f793ad0] destroy_cq at ffffffffa061d8f3 [iw_cxgb4]
> >  #5 [ffff880e2f793b60] c4iw_destroy_cq at ffffffffa061dad5 [iw_cxgb4]
> >  #6 [ffff880e2f793bf0] ib_free_cq at ffffffffa0360e5a [ib_core]
> >  #7 [ffff880e2f793c20] nvme_rdma_destroy_queue_ib at ffffffffa0644e9b
> > [nvme_rdma]
> >  #8 [ffff880e2f793c60] nvme_rdma_stop_and_free_queue at ffffffffa0645083
> > [nvme_rdma]
> >  #9 [ffff880e2f793c80] nvme_rdma_reconnect_ctrl_work at ffffffffa0645a9f
> > [nvme_rdma]
> > #10 [ffff880e2f793cb0] process_one_work at ffffffff810a1613
> > #11 [ffff880e2f793d90] worker_thread at ffffffff810a22ad
> > #12 [ffff880e2f793ec0] kthread at ffffffff810a6dec
> > #13 [ffff880e2f793f50] ret_from_fork at ffffffff816e3bbf
> >
> > I'm trying to identify if this reconnect is for the same controller that
> > crashed processing a request.  It probably is, but I need to search the
> > stack frame to try and find the controller pointer...
> >
> > Steve.
> 
> Both the thread that crashed and the thread doing reconnect are operating on
> ctrl "nvme2"...

I'm reanalyzing the crash dump for this particular crash, and I've found the
blk_mq_hw_ctx struct that has ->driver_data == to the nvme_rdma_queue that
caused the crash.  hctx->state, though, is 2, which is the BLK_MQ_S_TAG_ACTIVE
bit.  IE the BLK_MQ_S_STOPPED bit is _not_ set!

Attached are the blk_mq_hw_ctx, nvme_rdma_queue, and nvme_rdma_ctrl structs, as
well as the nvme_rdma_requeust, request and request_queue structs if you want to
have a look...

Steve.


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: crash_analysis.txt
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20160909/6979423a/attachment-0001.txt>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-08 17:19                                         ` Steve Wise
@ 2016-09-09 15:57                                           ` Steve Wise
  0 siblings, 0 replies; 53+ messages in thread
From: Steve Wise @ 2016-09-09 15:57 UTC (permalink / raw)


> >
> 
> I can't explain this, but the nvme_rdma_queue.flags field has a bit set that
> shouldn't be set:
> 
> crash> nvme_rdma_queue.flags -x ffff880e52b8e7e8
>   flags = 0x14
> 
> Bit 2 is set, NVME_RDMA_Q_DELETING, but bit 4 is also set and should never
be...
> 
> enum nvme_rdma_queue_flags {
>         NVME_RDMA_Q_CONNECTED = (1 << 0),
>         NVME_RDMA_IB_QUEUE_ALLOCATED = (1 << 1),
>         NVME_RDMA_Q_DELETING = (1 << 2),
> };
> 
> The rest of the structure looks fine.  I've also seen crash dumps where bit 3
is
> set which is also not used.
> 
> /me confused...

I'm dumb:  1<<0 is 1, so CONNECTED is bit 1, QUEUE_ALLOCATED is bit 2, and
Q_DELETING is bit 4!  Bits 0 and 3 are not used.  So 0x14 is bits 4 and 2:
DELETING and QUEUE_ALLOCATED.  The queue_flags enum should not be using the
(1<<X) initialization.  Rather, they should be 0, 1, 2, etc...
 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-09 15:50                                               ` Steve Wise
@ 2016-09-12 20:10                                                 ` Steve Wise
       [not found]                                                   ` <da2e918b-0f18-e032-272d-368c6ec49c62@gri mberg.me>
                                                                     ` (2 more replies)
  0 siblings, 3 replies; 53+ messages in thread
From: Steve Wise @ 2016-09-12 20:10 UTC (permalink / raw)


> I'm reanalyzing the crash dump for this particular crash, and I've found the
> blk_mq_hw_ctx struct that has ->driver_data == to the nvme_rdma_queue that
> caused the crash.  hctx->state, though, is 2, which is the BLK_MQ_S_TAG_ACTIVE
> bit.  IE the BLK_MQ_S_STOPPED bit is _not_ set!
> 
> Attached are the blk_mq_hw_ctx, nvme_rdma_queue, and nvme_rdma_ctrl structs,
> as
> well as the nvme_rdma_requeust, request and request_queue structs if you want
to
> have a look...
> 
> Steve.
> 

Hey Sagi,

I've added this debug logic to try and assert that an nvme_rdma_queue is
associated with one and only one blk_mq_hw_ctx:

---
diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index 64434a1..ab7edf7 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -100,6 +100,7 @@ struct nvme_rdma_queue {
        struct rdma_cm_id       *cm_id;
        int                     cm_error;
        struct completion       cm_done;
+       struct blk_mq_hw_ctx *hctx;
 };

 struct nvme_rdma_ctrl {
@@ -384,6 +385,7 @@ static int nvme_rdma_init_hctx(struct blk_mq_hw_ctx *hctx,
void *data,
        BUG_ON(hctx_idx >= ctrl->queue_count);

        hctx->driver_data = queue;
+       queue->hctx = hctx;
        return 0;
 }

@@ -396,6 +398,7 @@ static int nvme_rdma_init_admin_hctx(struct blk_mq_hw_ctx
*hctx, void *data,
        BUG_ON(hctx_idx != 0);

        hctx->driver_data = queue;
+       queue->hctx = hctx;
        return 0;
 }

@@ -622,6 +625,7 @@ static void nvme_rdma_stop_and_free_queue(struct
nvme_rdma_queue *queue)
 {
        if (test_and_set_bit(NVME_RDMA_Q_DELETING, &queue->flags))
                return;
+       BUG_ON(!test_bit(BLK_MQ_S_STOPPED, &queue->hctx->state));
        nvme_rdma_stop_queue(queue);
        nvme_rdma_free_queue(queue);
 }
@@ -1408,6 +1412,8 @@ static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,

        WARN_ON_ONCE(rq->tag < 0);

+       BUG_ON(hctx != queue->hctx);
+       BUG_ON(test_bit(BLK_MQ_S_STOPPED, &hctx->state));
        dev = queue->device->dev;
        ib_dma_sync_single_for_cpu(dev, sqe->dma,
                        sizeof(struct nvme_command), DMA_TO_DEVICE);
---

When I reran the test forcing reconnects, I hit the BUG_ON(hctx != queue->hctx)
in nvme_rdma_queue_rq() when doing the first reconnect (not when initially
connecting the targets).   Here is the back trace.  Is my debug logic flawed?
Or does this mean something is screwed up once we start reconnecting.

crash> bt
PID: 1819   TASK: ffff88101d0217c0  CPU: 0   COMMAND: "kworker/0:2"
 #0 [ffff8810090d34b0] machine_kexec at ffffffff8105fbd0
 #1 [ffff8810090d3520] __crash_kexec at ffffffff81116998
 #2 [ffff8810090d35f0] crash_kexec at ffffffff81116a6d
 #3 [ffff8810090d3620] oops_end at ffffffff81032bd6
 #4 [ffff8810090d3650] die at ffffffff810330cb
 #5 [ffff8810090d3680] do_trap at ffffffff8102fff1
 #6 [ffff8810090d36e0] do_error_trap at ffffffff8103032d
 #7 [ffff8810090d37a0] do_invalid_op at ffffffff81030480
 #8 [ffff8810090d37b0] invalid_op at ffffffff816e47be
    [exception RIP: nvme_rdma_queue_rq+621]
    RIP: ffffffffa065ce3d  RSP: ffff8810090d3868  RFLAGS: 00010206
    RAX: 0000000000000000  RBX: ffff880e33640000  RCX: dead000000000200
    RDX: ffff8810090d3928  RSI: ffff8810090d38f8  RDI: ffff880e315cb528
    RBP: ffff8810090d38a8   R8: ffff880e33640000   R9: 0000000000000000
    R10: 0000000000000674  R11: ffff8810090d3a18  R12: ffff880e36ab91d0
    R13: ffff880e33640170  R14: ffff880e315cb528  R15: ffff880e36bc1138
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff8810090d38b0] __blk_mq_run_hw_queue at ffffffff81338b1b
#10 [ffff8810090d3a00] blk_mq_run_hw_queue at ffffffff81338ffe
#11 [ffff8810090d3a20] blk_mq_insert_request at ffffffff8133a130
#12 [ffff8810090d3a90] blk_execute_rq_nowait at ffffffff813342dd
#13 [ffff8810090d3ad0] blk_execute_rq at ffffffff8133442e
#14 [ffff8810090d3b80] __nvme_submit_sync_cmd at ffffffffa02715d5 [nvme_core]
#15 [ffff8810090d3bd0] nvmf_connect_io_queue at ffffffffa064d134 [nvme_fabrics]
#16 [ffff8810090d3c80] nvme_rdma_reconnect_ctrl_work at ffffffffa065cafb
[nvme_rdma]
#17 [ffff8810090d3cb0] process_one_work at ffffffff810a1613
#18 [ffff8810090d3d90] worker_thread at ffffffff810a22ad
#19 [ffff8810090d3ec0] kthread at ffffffff810a6dec
#20 [ffff8810090d3f50] ret_from_fork at ffffffff816e3bbf
crash> gdb list *nvme_rdma_queue_rq+621
0xffffffffa065ce3d is in nvme_rdma_queue_rq (drivers/nvme/host/rdma.c:1415).
1410            unsigned int map_len;
1411            int ret;
1412
1413            WARN_ON_ONCE(rq->tag < 0);
1414
1415            BUG_ON(hctx != queue->hctx);
1416            BUG_ON(test_bit(BLK_MQ_S_STOPPED, &hctx->state));
1417            dev = queue->device->dev;
1418            ib_dma_sync_single_for_cpu(dev, sqe->dma,
1419                            sizeof(struct nvme_command), DMA_TO_DEVICE);

^ permalink raw reply related	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-12 20:10                                                 ` Steve Wise
       [not found]                                                   ` <da2e918b-0f18-e032-272d-368c6ec49c62@gri mberg.me>
@ 2016-09-15  9:53                                                   ` Sagi Grimberg
  2016-09-15 14:44                                                     ` Steve Wise
  2016-09-15 14:00                                                   ` Gabriel Krisman Bertazi
  2 siblings, 1 reply; 53+ messages in thread
From: Sagi Grimberg @ 2016-09-15  9:53 UTC (permalink / raw)



> @@ -1408,6 +1412,8 @@ static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
>
>         WARN_ON_ONCE(rq->tag < 0);
>
> +       BUG_ON(hctx != queue->hctx);
> +       BUG_ON(test_bit(BLK_MQ_S_STOPPED, &hctx->state));
>         dev = queue->device->dev;
>         ib_dma_sync_single_for_cpu(dev, sqe->dma,
>                         sizeof(struct nvme_command), DMA_TO_DEVICE);
> ---
>
> When I reran the test forcing reconnects, I hit the BUG_ON(hctx != queue->hctx)
> in nvme_rdma_queue_rq() when doing the first reconnect (not when initially
> connecting the targets).   Here is the back trace.  Is my debug logic flawed?
> Or does this mean something is screwed up once we start reconnecting.

This is weird indeed.

The fact that you trigger this means that you successfully reconnect
correct?

If queue is corrupted it would explain the bogus post on a freed or
non-existing qp...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-12 20:10                                                 ` Steve Wise
       [not found]                                                   ` <da2e918b-0f18-e032-272d-368c6ec49c62@gri mberg.me>
  2016-09-15  9:53                                                   ` Sagi Grimberg
@ 2016-09-15 14:00                                                   ` Gabriel Krisman Bertazi
  2016-09-15 14:31                                                     ` Steve Wise
  2 siblings, 1 reply; 53+ messages in thread
From: Gabriel Krisman Bertazi @ 2016-09-15 14:00 UTC (permalink / raw)


"Steve Wise" <swise at opengridcomputing.com> writes:


> @@ -622,6 +625,7 @@ static void nvme_rdma_stop_and_free_queue(struct
> nvme_rdma_queue *queue)
>  {
>         if (test_and_set_bit(NVME_RDMA_Q_DELETING, &queue->flags))
>                 return;
> +       BUG_ON(!test_bit(BLK_MQ_S_STOPPED, &queue->hctx->state));
>         nvme_rdma_stop_queue(queue);
>         nvme_rdma_free_queue(queue);
>  }
> @@ -1408,6 +1412,8 @@ static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx *hctx,
>
>         WARN_ON_ONCE(rq->tag < 0);
>
> +       BUG_ON(hctx != queue->hctx);
> +       BUG_ON(test_bit(BLK_MQ_S_STOPPED, &hctx->state));
>         dev = queue->device->dev;
>         ib_dma_sync_single_for_cpu(dev, sqe->dma,
>                         sizeof(struct nvme_command), DMA_TO_DEVICE);
>

This reminds me of the discussion I had with Jens a few weeks ago here:

http://lists.infradead.org/pipermail/linux-nvme/2016-August/005916.html

The BUG_ON I hit is similar to yours, but for nvme over PCI.  I think
the update queues code will reach a similar path of remapping, but I
didnt go out and check yet.

Can you check you are running with the patch he mentioned at:

http://lists.infradead.org/pipermail/linux-nvme/2016-August/005962.html

Thanks,

-- 
Gabriel Krisman Bertazi

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-15 14:00                                                   ` Gabriel Krisman Bertazi
@ 2016-09-15 14:31                                                     ` Steve Wise
  0 siblings, 0 replies; 53+ messages in thread
From: Steve Wise @ 2016-09-15 14:31 UTC (permalink / raw)


> 
> This reminds me of the discussion I had with Jens a few weeks ago here:
> 
> http://lists.infradead.org/pipermail/linux-nvme/2016-August/005916.html
> 
> The BUG_ON I hit is similar to yours, but for nvme over PCI.  I think
> the update queues code will reach a similar path of remapping, but I
> didnt go out and check yet.
> 
> Can you check you are running with the patch he mentioned at:
> 
> http://lists.infradead.org/pipermail/linux-nvme/2016-August/005962.html

I don't have this patch, and blk_mq_hctx_cpu_offline() is very different from
the code in that patch.

I'm using 4.8-rc5 plus these patches from nvmf-4.8-rc and linux-block.  This
matches Jens' for-linus tag at this point in time.  It will all hit -rc6.

cbee748 nvme-rdma: add back dependency on CONFIG_BLOCK
b8ce92e nvme-rdma: fix null pointer dereference on req->mr
c51b3c7 nvme-rdma: use ib_client API to detect device removal
2b24ee4 nvme-rdma: add DELETING queue flag
8c7d713 nvme-rdma: destroy nvme queue rdma resources on connect failure
4604d4e nvme_rdma: keep a ref on the ctrl during delete/flush
372a82d iw_cxgb4: block module unload until all ep resources are released
7053902 iw_cxgb4: call dev_put() on l2t allocation failure
82469c5 nvme: Don't suspend admin queue that wasn't created
c693593 Linux 4.8-rc5

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-15  9:53                                                   ` Sagi Grimberg
@ 2016-09-15 14:44                                                     ` Steve Wise
  2016-09-15 15:10                                                       ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-09-15 14:44 UTC (permalink / raw)


> > @@ -1408,6 +1412,8 @@ static int nvme_rdma_queue_rq(struct blk_mq_hw_ctx
> *hctx,
> >
> >         WARN_ON_ONCE(rq->tag < 0);
> >
> > +       BUG_ON(hctx != queue->hctx);
> > +       BUG_ON(test_bit(BLK_MQ_S_STOPPED, &hctx->state));
> >         dev = queue->device->dev;
> >         ib_dma_sync_single_for_cpu(dev, sqe->dma,
> >                         sizeof(struct nvme_command), DMA_TO_DEVICE);
> > ---
> >
> > When I reran the test forcing reconnects, I hit the BUG_ON(hctx !=
queue->hctx)
> > in nvme_rdma_queue_rq() when doing the first reconnect (not when initially
> > connecting the targets).   Here is the back trace.  Is my debug logic
flawed?
> > Or does this mean something is screwed up once we start reconnecting.
> 
> This is weird indeed.
> 
> The fact that you trigger this means that you successfully reconnect
> correct?
>

The state of the controller is NVME_CTRL_RECONNECTING.  In fact, this BUG_ON()
happened on the reconnect worker thread.   Ah, this is probably the connect
command on the admin queue maybe?


PID: 1819   TASK: ffff88101d0217c0  CPU: 0   COMMAND: "kworker/0:2"
 #0 [ffff8810090d34b0] machine_kexec at ffffffff8105fbd0
 #1 [ffff8810090d3520] __crash_kexec at ffffffff81116998
 #2 [ffff8810090d35f0] crash_kexec at ffffffff81116a6d
 #3 [ffff8810090d3620] oops_end at ffffffff81032bd6
 #4 [ffff8810090d3650] die at ffffffff810330cb
 #5 [ffff8810090d3680] do_trap at ffffffff8102fff1
 #6 [ffff8810090d36e0] do_error_trap at ffffffff8103032d
 #7 [ffff8810090d37a0] do_invalid_op at ffffffff81030480
 #8 [ffff8810090d37b0] invalid_op at ffffffff816e47be
    [exception RIP: nvme_rdma_queue_rq+621]
    RIP: ffffffffa065ce3d  RSP: ffff8810090d3868  RFLAGS: 00010206
    RAX: 0000000000000000  RBX: ffff880e33640000  RCX: dead000000000200
    RDX: ffff8810090d3928  RSI: ffff8810090d38f8  RDI: ffff880e315cb528
    RBP: ffff8810090d38a8   R8: ffff880e33640000   R9: 0000000000000000
    R10: 0000000000000674  R11: ffff8810090d3a18  R12: ffff880e36ab91d0
    R13: ffff880e33640170  R14: ffff880e315cb528  R15: ffff880e36bc1138
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #9 [ffff8810090d38b0] __blk_mq_run_hw_queue at ffffffff81338b1b
#10 [ffff8810090d3a00] blk_mq_run_hw_queue at ffffffff81338ffe
#11 [ffff8810090d3a20] blk_mq_insert_request at ffffffff8133a130
#12 [ffff8810090d3a90] blk_execute_rq_nowait at ffffffff813342dd
#13 [ffff8810090d3ad0] blk_execute_rq at ffffffff8133442e
#14 [ffff8810090d3b80] __nvme_submit_sync_cmd at ffffffffa02715d5 [nvme_core]
#15 [ffff8810090d3bd0] nvmf_connect_io_queue at ffffffffa064d134 [nvme_fabrics]
#16 [ffff8810090d3c80] nvme_rdma_reconnect_ctrl_work at ffffffffa065cafb
[nvme_rdma]
#17 [ffff8810090d3cb0] process_one_work at ffffffff810a1613
#18 [ffff8810090d3d90] worker_thread at ffffffff810a22ad
#19 [ffff8810090d3ec0] kthread at ffffffff810a6dec
#20 [ffff8810090d3f50] ret_from_fork at ffffffff816e3bbf

 

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-15 14:44                                                     ` Steve Wise
@ 2016-09-15 15:10                                                       ` Steve Wise
  2016-09-15 15:53                                                         ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-09-15 15:10 UTC (permalink / raw)


> The state of the controller is NVME_CTRL_RECONNECTING.  In fact, this BUG_ON()
> happened on the reconnect worker thread.   Ah, this is probably the connect
> command on the admin queue maybe?
> 
> 
> 

The queue being used at the crash is nvme_rdma_ctrl->queues[1].  IE not the
admin queue.  The reconnect work thread is connecting the io queues here:

crash> gdb list *nvme_rdma_reconnect_ctrl_work+0x14b
0xffffffffa065cafb is in nvme_rdma_reconnect_ctrl_work
(drivers/nvme/host/rdma.c:647).
642     {
643             int i, ret = 0;
644
645             for (i = 1; i < ctrl->queue_count; i++) {
646                     ret = nvmf_connect_io_queue(&ctrl->ctrl, i);
647                     if (ret)
648                             break;
649             }
650
651             return ret;


nvmf_connect_io_queue() is here:

crash> gdb list *nvmf_connect_io_queue+0x114
0xffffffffa064d134 is in nvmf_connect_io_queue
(drivers/nvme/host/fabrics.c:454).
449             strncpy(data->hostnqn, ctrl->opts->host->nqn, NVMF_NQN_SIZE);
450
451             ret = __nvme_submit_sync_cmd(ctrl->connect_q, &cmd, &cqe,
452                             data, sizeof(*data), 0, qid, 1,
453                             BLK_MQ_REQ_RESERVED | BLK_MQ_REQ_NOWAIT);
454             if (ret) {
455                     nvmf_log_connect_error(ctrl, ret,
le32_to_cpu(cqe.result),
456                                            &cmd, data);
457             }
458             kfree(data);


The hctx passed into nvme_rdma_queue_rq() is in state BLK_MQ_S_TAG_ACTIVE.  And
hctx->driver_data is the nvme_rdma_queue to be used.  That nvme_rdma_queue has a
different hctx pointer (from my debug code) and that's why we hit the BUG_ON().
Anyway, nvme_rdma_queue->hctx->state is BLK_MQ_S_STOPPED.  So this is more
evidence that somehow an hctx is using an nvme_rdma_queue that wasn't originally
assigned to that hctx...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-15 15:10                                                       ` Steve Wise
@ 2016-09-15 15:53                                                         ` Steve Wise
  2016-09-15 16:45                                                           ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-09-15 15:53 UTC (permalink / raw)


> 
> The hctx passed into nvme_rdma_queue_rq() is in state BLK_MQ_S_TAG_ACTIVE.
> And
> hctx->driver_data is the nvme_rdma_queue to be used.  That nvme_rdma_queue
> has a
> different hctx pointer (from my debug code) and that's why we hit the
BUG_ON().
> Anyway, nvme_rdma_queue->hctx->state is BLK_MQ_S_STOPPED.  So this is more
> evidence that somehow an hctx is using an nvme_rdma_queue that wasn't
> originally
> assigned to that hctx...
> 

I added this to my debug patch.  I will warn on if an hctx is initialized twice
or if it a 2nd hctx is bound to an nvme_rdma_queue that is already bound to
another hctx:

@@ -383,7 +384,12 @@ static int nvme_rdma_init_hctx(struct blk_mq_hw_ctx *hctx,
void *data,

        BUG_ON(hctx_idx >= ctrl->queue_count);

+       dev_warn(ctrl->ctrl.device, "%s hctx_idx %u hctx %p hctx->driver_data %p
queue %p queue->hctx %p\n",
+               __func__, hctx_idx, hctx, hctx->driver_data, queue,
queue->hctx);
+       WARN_ON_ONCE(hctx->driver_data);
        hctx->driver_data = queue;
+       WARN_ON_ONCE(queue->hctx);
+       queue->hctx = hctx;
        return 0;
 }

And I see that 2 sets of blk_mq_hw_ctx structs get assigned to the same 32
queues.  Here is the output for 1 target connect with 32 cores.  So is it
expected that the 32 nvme_rdma IO queues get assigned to 2 sets of hw_ctx
structs?  The 2nd set is getting initialized as part of namespace scanning...

[  652.782267] nvme nvme1: creating 32 I/O queues.
[  653.373979] nvme nvme1: nvme_rdma_init_hctx hctx_idx 0 hctx ffff880e75dd0ff8
hctx->driver_data           (null) queue ffff880fce71e860 queue->hctx
(null)
[  653.394416] nvme nvme1: nvme_rdma_init_hctx hctx_idx 1 hctx ffff880e75dd6a48
hctx->driver_data           (null) queue ffff880fce71e8d8 queue->hctx
(null)
[  653.414415] nvme nvme1: nvme_rdma_init_hctx hctx_idx 2 hctx ffff880e75dd1548
hctx->driver_data           (null) queue ffff880fce71e950 queue->hctx
(null)
[  653.434027] nvme nvme1: nvme_rdma_init_hctx hctx_idx 3 hctx ffff880e75dd64f8
hctx->driver_data           (null) queue ffff880fce71e9c8 queue->hctx
(null)
[  653.453489] nvme nvme1: nvme_rdma_init_hctx hctx_idx 4 hctx ffff880e75dd5fa8
hctx->driver_data           (null) queue ffff880fce71ea40 queue->hctx
(null)
[  653.472782] nvme nvme1: nvme_rdma_init_hctx hctx_idx 5 hctx ffff880e75dd1a98
hctx->driver_data           (null) queue ffff880fce71eab8 queue->hctx
(null)
[  653.491934] nvme nvme1: nvme_rdma_init_hctx hctx_idx 6 hctx ffff880e75dd1fe8
hctx->driver_data           (null) queue ffff880fce71eb30 queue->hctx
(null)
[  653.510957] nvme nvme1: nvme_rdma_init_hctx hctx_idx 7 hctx ffff880e75dd5a58
hctx->driver_data           (null) queue ffff880fce71eba8 queue->hctx
(null)
[  653.530070] nvme nvme1: nvme_rdma_init_hctx hctx_idx 8 hctx ffff880e75dd5508
hctx->driver_data           (null) queue ffff880fce71ec20 queue->hctx
(null)
[  653.551448] nvme nvme1: nvme_rdma_init_hctx hctx_idx 9 hctx ffff880e75dd2538
hctx->driver_data           (null) queue ffff880fce71ec98 queue->hctx
(null)
[  653.572524] nvme nvme1: nvme_rdma_init_hctx hctx_idx 10 hctx ffff880e75dd4fb8
hctx->driver_data           (null) queue ffff880fce71ed10 queue->hctx
(null)
[  653.593434] nvme nvme1: nvme_rdma_init_hctx hctx_idx 11 hctx ffff880e75dd4a68
hctx->driver_data           (null) queue ffff880fce71ed88 queue->hctx
(null)
[  653.614228] nvme nvme1: nvme_rdma_init_hctx hctx_idx 12 hctx ffff880e743e0008
hctx->driver_data           (null) queue ffff880fce71ee00 queue->hctx
(null)
[  653.634809] nvme nvme1: nvme_rdma_init_hctx hctx_idx 13 hctx ffff880e743e7a38
hctx->driver_data           (null) queue ffff880fce71ee78 queue->hctx
(null)
[  653.655223] nvme nvme1: nvme_rdma_init_hctx hctx_idx 14 hctx ffff880e743e74e8
hctx->driver_data           (null) queue ffff880fce71eef0 queue->hctx
(null)
[  653.675474] nvme nvme1: nvme_rdma_init_hctx hctx_idx 15 hctx ffff880e743e0558
hctx->driver_data           (null) queue ffff880fce71ef68 queue->hctx
(null)
[  653.695558] nvme nvme1: nvme_rdma_init_hctx hctx_idx 16 hctx ffff880e743e0aa8
hctx->driver_data           (null) queue ffff880fce71efe0 queue->hctx
(null)
[  653.715577] nvme nvme1: nvme_rdma_init_hctx hctx_idx 17 hctx ffff880e743e6f98
hctx->driver_data           (null) queue ffff880fce71f058 queue->hctx
(null)
[  653.735383] nvme nvme1: nvme_rdma_init_hctx hctx_idx 18 hctx ffff880e743e6a48
hctx->driver_data           (null) queue ffff880fce71f0d0 queue->hctx
(null)
[  653.755053] nvme nvme1: nvme_rdma_init_hctx hctx_idx 19 hctx ffff880e743e0ff8
hctx->driver_data           (null) queue ffff880fce71f148 queue->hctx
(null)
[  653.774627] nvme nvme1: nvme_rdma_init_hctx hctx_idx 20 hctx ffff880e743e1548
hctx->driver_data           (null) queue ffff880fce71f1c0 queue->hctx
(null)
[  653.794081] nvme nvme1: nvme_rdma_init_hctx hctx_idx 21 hctx ffff880e743e64f8
hctx->driver_data           (null) queue ffff880fce71f238 queue->hctx
(null)
[  653.813415] nvme nvme1: nvme_rdma_init_hctx hctx_idx 22 hctx ffff880e743e5fa8
hctx->driver_data           (null) queue ffff880fce71f2b0 queue->hctx
(null)
[  653.832665] nvme nvme1: nvme_rdma_init_hctx hctx_idx 23 hctx ffff880e743e1a98
hctx->driver_data           (null) queue ffff880fce71f328 queue->hctx
(null)
[  653.851791] nvme nvme1: nvme_rdma_init_hctx hctx_idx 24 hctx ffff880e743e1fe8
hctx->driver_data           (null) queue ffff880fce71f3a0 queue->hctx
(null)
[  653.870780] nvme nvme1: nvme_rdma_init_hctx hctx_idx 25 hctx ffff880e743e5a58
hctx->driver_data           (null) queue ffff880fce71f418 queue->hctx
(null)
[  653.889632] nvme nvme1: nvme_rdma_init_hctx hctx_idx 26 hctx ffff880e743e5508
hctx->driver_data           (null) queue ffff880fce71f490 queue->hctx
(null)
[  653.908318] nvme nvme1: nvme_rdma_init_hctx hctx_idx 27 hctx ffff880e743e2538
hctx->driver_data           (null) queue ffff880fce71f508 queue->hctx
(null)
[  653.927036] nvme nvme1: nvme_rdma_init_hctx hctx_idx 28 hctx ffff880e743e2a88
hctx->driver_data           (null) queue ffff880fce71f580 queue->hctx
(null)
[  653.945760] nvme nvme1: nvme_rdma_init_hctx hctx_idx 29 hctx ffff880e743e4fb8
hctx->driver_data           (null) queue ffff880fce71f5f8 queue->hctx
(null)
[  653.964475] nvme nvme1: nvme_rdma_init_hctx hctx_idx 30 hctx ffff880e743e4a68
hctx->driver_data           (null) queue ffff880fce71f670 queue->hctx
(null)
[  653.983174] nvme nvme1: nvme_rdma_init_hctx hctx_idx 31 hctx ffff880e743e2fd8
hctx->driver_data           (null) queue ffff880fce71f6e8 queue->hctx
(null)
[  654.192720] nvme nvme1: new ctrl: NQN "test-ram0", addr 10.0.1.14:4420
[  654.205669] nvme nvme1: nvme_rdma_init_hctx hctx_idx 0 hctx ffff880e75dd2a88
hctx->driver_data           (null) queue ffff880fce71e860 queue->hctx
ffff880e75dd0ff8
[  654.228208] ------------[ cut here ]------------
[  654.236514] WARNING: CPU: 31 PID: 279 at drivers/nvme/host/rdma.c:391
nvme_rdma_init_hctx+0xb7/0xe0 [nvme_rdma]
[  654.250060] Modules linked in: nvme_rdma nvme_fabrics brd iw_cxgb4 cxgb4
ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM
iptable_mangle iptable_filter ip_tables bridge 8021q mrp garp stp llc cachefiles
fscache rdma_ucm rdma_cm iw_cm ib_ipoib ib_cm ib_uverbs ib_umad ocrdma be2net
iw_nes libcrc32c iw_cxgb3 cxgb3 mdio ib_qib rdmavt mlx5_ib mlx5_core mlx4_ib
mlx4_en mlx4_core ib_mthca ib_core binfmt_misc dm_mirror dm_region_hash dm_log
vhost_net macvtap macvlan vhost tun kvm irqbypass uinput iTCO_wdt
iTCO_vendor_support mxm_wmi pcspkr dm_mod i2c_i801 i2c_smbus sg lpc_ich mfd_core
mei_me mei nvme nvme_core igb dca ptp pps_core ipmi_si ipmi_msghandler wmi
ext4(E) mbcache(E) jbd2(E) sd_mod(E) ahci(E) libahci(E) libata(E) mgag200(E)
ttm(E) drm_kms_helper(E) drm(E) fb_sys_fops(E) sysimgblt(E) sysfillrect(E)
syscopyarea(E) i2c_algo_bit(E) i2c_core(E) [last unloaded: cxgb4]
[  654.368142] CPU: 31 PID: 279 Comm: kworker/31:1 Tainted: G            E
4.8.0-rc5-nvmf+block2-dbg+ #34
[  654.379719] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.2a 07/09/2015
[  654.388779] Workqueue: events nvme_scan_work [nvme_core]
[  654.396097]  0000000000000000 ffff881020dbb8b8 ffffffff8135b8b7
0000000000000117
[  654.405476]  0000000000000000 0000000000000000 0000000000000000
ffff881020dbb908
[  654.414792]  ffffffff81086eed ffff881020dbb8e8 00000187d3748534
ffffffffa066d200
[  654.424081] Call Trace:
[  654.428333]  [<ffffffff8135b8b7>] dump_stack+0x67/0x90
[  654.435266]  [<ffffffff81086eed>] __warn+0xfd/0x120
[  654.441894]  [<ffffffff81086f2d>] warn_slowpath_null+0x1d/0x20
[  654.449447]  [<ffffffffa0669407>] nvme_rdma_init_hctx+0xb7/0xe0 [nvme_rdma]
[  654.458118]  [<ffffffff8133a52e>] blk_mq_init_hctx+0x21e/0x2e0
[  654.465625]  [<ffffffff8133a6ea>] blk_mq_realloc_hw_ctxs+0xfa/0x240
[  654.473542]  [<ffffffff8133b342>] blk_mq_init_allocated_queue+0x92/0x410
[  654.481885]  [<ffffffff8132a969>] ? blk_alloc_queue_node+0x259/0x2c0
[  654.489833]  [<ffffffff8135ce84>] ? ida_pre_get+0xb4/0xe0
[  654.496798]  [<ffffffff8133b6ff>] blk_mq_init_queue+0x3f/0x70
[  654.504097]  [<ffffffffa0272998>] nvme_alloc_ns+0x88/0x240 [nvme_core]
[  654.512155]  [<ffffffffa02728bc>] ? nvme_find_get_ns+0x5c/0xb0 [nvme_core]
[  654.520533]  [<ffffffffa0273059>] nvme_validate_ns+0x79/0x90 [nvme_core]
[  654.528718]  [<ffffffffa0273166>] nvme_scan_ns_list+0xf6/0x1f0 [nvme_core]
[  654.537059]  [<ffffffffa027338b>] nvme_scan_work+0x12b/0x140 [nvme_core]
[  654.545199]  [<ffffffff810a1613>] process_one_work+0x183/0x4d0
[  654.552448]  [<ffffffff816dfa40>] ? __schedule+0x1f0/0x5b0
[  654.559314]  [<ffffffff816dff00>] ? schedule+0x40/0xb0
[  654.565810]  [<ffffffff810a22ad>] worker_thread+0x16d/0x530
[  654.572709]  [<ffffffff810a2140>] ? maybe_create_worker+0x120/0x120
[  654.580279]  [<ffffffff816dfa40>] ? __schedule+0x1f0/0x5b0
[  654.587061]  [<ffffffff810cbab6>] ? __wake_up_common+0x56/0x90
[  654.594146]  [<ffffffff810a2140>] ? maybe_create_worker+0x120/0x120
[  654.601644]  [<ffffffff816dff00>] ? schedule+0x40/0xb0
[  654.607998]  [<ffffffff810a2140>] ? maybe_create_worker+0x120/0x120
[  654.615459]  [<ffffffff810a6dec>] kthread+0xcc/0xf0
[  654.621505]  [<ffffffff810b17ae>] ? schedule_tail+0x1e/0xc0
[  654.628218]  [<ffffffff816e3bbf>] ret_from_fork+0x1f/0x40
[  654.634738]  [<ffffffff810a6d20>] ? kthread_freezable_should_stop+0x70/0x70
[  654.642845] ---[ end trace 83cb452b9aa631ae ]---
[  654.648809] nvme nvme1: nvme_rdma_init_hctx hctx_idx 1 hctx ffff880e75dd2fd8
hctx->driver_data           (null) queue ffff880fce71e8d8 queue->hctx
ffff880e75dd6a48
[  654.665803] nvme nvme1: nvme_rdma_init_hctx hctx_idx 2 hctx ffff880e75dd4518
hctx->driver_data           (null) queue ffff880fce71e950 queue->hctx
ffff880e75dd1548
[  654.682808] nvme nvme1: nvme_rdma_init_hctx hctx_idx 3 hctx ffff880e75dd3528
hctx->driver_data           (null) queue ffff880fce71e9c8 queue->hctx
ffff880e75dd64f8
[  654.699807] nvme nvme1: nvme_rdma_init_hctx hctx_idx 4 hctx ffff880e75dd3a78
hctx->driver_data           (null) queue ffff880fce71ea40 queue->hctx
ffff880e75dd5fa8
[  654.716822] nvme nvme1: nvme_rdma_init_hctx hctx_idx 5 hctx ffff880ff084ea48
hctx->driver_data           (null) queue ffff880fce71eab8 queue->hctx
ffff880e75dd1a98
[  654.733809] nvme nvme1: nvme_rdma_init_hctx hctx_idx 6 hctx ffff880e71c78008
hctx->driver_data           (null) queue ffff880fce71eb30 queue->hctx
ffff880e75dd1fe8
[  654.750808] nvme nvme1: nvme_rdma_init_hctx hctx_idx 7 hctx ffff880e71c7fa38
hctx->driver_data           (null) queue ffff880fce71eba8 queue->hctx
ffff880e75dd5a58
[  654.767860] nvme nvme1: nvme_rdma_init_hctx hctx_idx 8 hctx ffff880e71c7f4e8
hctx->driver_data           (null) queue ffff880fce71ec20 queue->hctx
ffff880e75dd5508
[  654.785002] nvme nvme1: nvme_rdma_init_hctx hctx_idx 9 hctx ffff880e71c78558
hctx->driver_data           (null) queue ffff880fce71ec98 queue->hctx
ffff880e75dd2538
[  654.802337] nvme nvme1: nvme_rdma_init_hctx hctx_idx 10 hctx ffff880e71c78aa8
hctx->driver_data           (null) queue ffff880fce71ed10 queue->hctx
ffff880e75dd4fb8
[  654.819704] nvme nvme1: nvme_rdma_init_hctx hctx_idx 11 hctx ffff880e71c7ef98
hctx->driver_data           (null) queue ffff880fce71ed88 queue->hctx
ffff880e75dd4a68
[  654.837312] nvme nvme1: nvme_rdma_init_hctx hctx_idx 12 hctx ffff880e71c7ea48
hctx->driver_data           (null) queue ffff880fce71ee00 queue->hctx
ffff880e743e0008
[  654.854892] nvme nvme1: nvme_rdma_init_hctx hctx_idx 13 hctx ffff880e71c78ff8
hctx->driver_data           (null) queue ffff880fce71ee78 queue->hctx
ffff880e743e7a38
[  654.872556] nvme nvme1: nvme_rdma_init_hctx hctx_idx 14 hctx ffff880e71c79548
hctx->driver_data           (null) queue ffff880fce71eef0 queue->hctx
ffff880e743e74e8
[  654.890332] nvme nvme1: nvme_rdma_init_hctx hctx_idx 15 hctx ffff880e71c7e4f8
hctx->driver_data           (null) queue ffff880fce71ef68 queue->hctx
ffff880e743e0558
[  654.908328] nvme nvme1: nvme_rdma_init_hctx hctx_idx 16 hctx ffff880e71c7dfa8
hctx->driver_data           (null) queue ffff880fce71efe0 queue->hctx
ffff880e743e0aa8
[  654.926268] nvme nvme1: nvme_rdma_init_hctx hctx_idx 17 hctx ffff880e71c79a98
hctx->driver_data           (null) queue ffff880fce71f058 queue->hctx
ffff880e743e6f98
[  654.944313] nvme nvme1: nvme_rdma_init_hctx hctx_idx 18 hctx ffff880e71c7da58
hctx->driver_data           (null) queue ffff880fce71f0d0 queue->hctx
ffff880e743e6a48
[  654.962455] nvme nvme1: nvme_rdma_init_hctx hctx_idx 19 hctx ffff880e71c79fe8
hctx->driver_data           (null) queue ffff880fce71f148 queue->hctx
ffff880e743e0ff8
[  654.980701] nvme nvme1: nvme_rdma_init_hctx hctx_idx 20 hctx ffff880e71c7a538
hctx->driver_data           (null) queue ffff880fce71f1c0 queue->hctx
ffff880e743e1548
[  654.999107] nvme nvme1: nvme_rdma_init_hctx hctx_idx 21 hctx ffff880e71c7d508
hctx->driver_data           (null) queue ffff880fce71f238 queue->hctx
ffff880e743e64f8
[  655.017552] nvme nvme1: nvme_rdma_init_hctx hctx_idx 22 hctx ffff880e71c7aa88
hctx->driver_data           (null) queue ffff880fce71f2b0 queue->hctx
ffff880e743e5fa8
[  655.036101] nvme nvme1: nvme_rdma_init_hctx hctx_idx 23 hctx ffff880e71c7afd8
hctx->driver_data           (null) queue ffff880fce71f328 queue->hctx
ffff880e743e1a98
[  655.054759] nvme nvme1: nvme_rdma_init_hctx hctx_idx 24 hctx ffff880e71c7cfb8
hctx->driver_data           (null) queue ffff880fce71f3a0 queue->hctx
ffff880e743e1fe8
[  655.073480] nvme nvme1: nvme_rdma_init_hctx hctx_idx 25 hctx ffff880e71c7ca68
hctx->driver_data           (null) queue ffff880fce71f418 queue->hctx
ffff880e743e5a58
[  655.092201] nvme nvme1: nvme_rdma_init_hctx hctx_idx 26 hctx ffff880e71c7b528
hctx->driver_data           (null) queue ffff880fce71f490 queue->hctx
ffff880e743e5508
[  655.110921] nvme nvme1: nvme_rdma_init_hctx hctx_idx 27 hctx ffff880e71c7ba78
hctx->driver_data           (null) queue ffff880fce71f508 queue->hctx
ffff880e743e2538
[  655.129659] nvme nvme1: nvme_rdma_init_hctx hctx_idx 28 hctx ffff880e71c7c518
hctx->driver_data           (null) queue ffff880fce71f580 queue->hctx
ffff880e743e2a88
[  655.148357] nvme nvme1: nvme_rdma_init_hctx hctx_idx 29 hctx ffff880e71c7bfc8
hctx->driver_data           (null) queue ffff880fce71f5f8 queue->hctx
ffff880e743e4fb8
[  655.167104] nvme nvme1: nvme_rdma_init_hctx hctx_idx 30 hctx ffff880e71cc8008
hctx->driver_data           (null) queue ffff880fce71f670 queue->hctx
ffff880e743e4a68
[  655.185814] nvme nvme1: nvme_rdma_init_hctx hctx_idx 31 hctx ffff880e71cc8558
hctx->driver_data           (null) queue ffff880fce71f6e8 queue->hctx
ffff880e743e2fd8

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-15 15:53                                                         ` Steve Wise
@ 2016-09-15 16:45                                                           ` Steve Wise
  2016-09-15 20:58                                                             ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-09-15 16:45 UTC (permalink / raw)


> And I see that 2 sets of blk_mq_hw_ctx structs get assigned to the same 32
> queues.  Here is the output for 1 target connect with 32 cores.  So is it
> expected that the 32 nvme_rdma IO queues get assigned to 2 sets of hw_ctx
> structs?  The 2nd set is getting initialized as part of namespace scanning...


So here is the stack for the first time the nvme_rdma_queue structs are bound to
an hctx:

[ 2006.826941]  [<ffffffffa066c452>] nvme_rdma_init_hctx+0x102/0x110 [nvme_rdma]
[ 2006.835409]  [<ffffffff8133a52e>] blk_mq_init_hctx+0x21e/0x2e0
[ 2006.842530]  [<ffffffff8133a6ea>] blk_mq_realloc_hw_ctxs+0xfa/0x240
[ 2006.850097]  [<ffffffff8133b342>] blk_mq_init_allocated_queue+0x92/0x410
[ 2006.858107]  [<ffffffff8132a969>] ? blk_alloc_queue_node+0x259/0x2c0
[ 2006.865765]  [<ffffffff8133b6ff>] blk_mq_init_queue+0x3f/0x70
[ 2006.872829]  [<ffffffffa066d9f9>] nvme_rdma_create_io_queues+0x189/0x210
[nvme_rdma]
[ 2006.881917]  [<ffffffffa066e813>] ?
nvme_rdma_configure_admin_queue+0x1e3/0x290 [nvme_rdma]
[ 2006.891611]  [<ffffffffa066ec65>] nvme_rdma_create_ctrl+0x3a5/0x4c0
[nvme_rdma]
[ 2006.900260]  [<ffffffffa0654d33>] ? nvmf_create_ctrl+0x33/0x210
[nvme_fabrics]
[ 2006.908799]  [<ffffffffa0654e82>] nvmf_create_ctrl+0x182/0x210 [nvme_fabrics]
[ 2006.917228]  [<ffffffffa0654fbc>] nvmf_dev_write+0xac/0x110 [nvme_fabrics]

And here is the 2nd time the same nvme_rdma_queue is bound to a different hctx::

[ 2007.263068]  [<ffffffffa066c40c>] nvme_rdma_init_hctx+0xbc/0x110 [nvme_rdma]
[ 2007.271656]  [<ffffffff8133a52e>] blk_mq_init_hctx+0x21e/0x2e0
[ 2007.279027]  [<ffffffff8133a6ea>] blk_mq_realloc_hw_ctxs+0xfa/0x240
[ 2007.286829]  [<ffffffff8133b342>] blk_mq_init_allocated_queue+0x92/0x410
[ 2007.295066]  [<ffffffff8132a969>] ? blk_alloc_queue_node+0x259/0x2c0
[ 2007.302962]  [<ffffffff8135ce84>] ? ida_pre_get+0xb4/0xe0
[ 2007.309894]  [<ffffffff8133b6ff>] blk_mq_init_queue+0x3f/0x70
[ 2007.317164]  [<ffffffffa0272998>] nvme_alloc_ns+0x88/0x240 [nvme_core]
[ 2007.325218]  [<ffffffffa02728bc>] ? nvme_find_get_ns+0x5c/0xb0 [nvme_core]
[ 2007.333612]  [<ffffffffa0273059>] nvme_validate_ns+0x79/0x90 [nvme_core]
[ 2007.341825]  [<ffffffffa0273166>] nvme_scan_ns_list+0xf6/0x1f0 [nvme_core]
[ 2007.350214]  [<ffffffffa027338b>] nvme_scan_work+0x12b/0x140 [nvme_core]
[ 2007.358427]  [<ffffffff810a1613>] process_one_work+0x183/0x4d0

Is this expected?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-15 16:45                                                           ` Steve Wise
@ 2016-09-15 20:58                                                             ` Steve Wise
  2016-09-16 11:04                                                               ` 'Christoph Hellwig'
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-09-15 20:58 UTC (permalink / raw)


> > And I see that 2 sets of blk_mq_hw_ctx structs get assigned to the same 32
> > queues.  Here is the output for 1 target connect with 32 cores.  So is it
> > expected that the 32 nvme_rdma IO queues get assigned to 2 sets of hw_ctx
> > structs?  The 2nd set is getting initialized as part of namespace
scanning...
> 
> 
> So here is the stack for the first time the nvme_rdma_queue structs are bound
to
> an hctx:
> 
> [ 2006.826941]  [<ffffffffa066c452>] nvme_rdma_init_hctx+0x102/0x110
> [nvme_rdma]
> [ 2006.835409]  [<ffffffff8133a52e>] blk_mq_init_hctx+0x21e/0x2e0
> [ 2006.842530]  [<ffffffff8133a6ea>] blk_mq_realloc_hw_ctxs+0xfa/0x240
> [ 2006.850097]  [<ffffffff8133b342>] blk_mq_init_allocated_queue+0x92/0x410
> [ 2006.858107]  [<ffffffff8132a969>] ? blk_alloc_queue_node+0x259/0x2c0
> [ 2006.865765]  [<ffffffff8133b6ff>] blk_mq_init_queue+0x3f/0x70
> [ 2006.872829]  [<ffffffffa066d9f9>] nvme_rdma_create_io_queues+0x189/0x210
> [nvme_rdma]
> [ 2006.881917]  [<ffffffffa066e813>] ?
> nvme_rdma_configure_admin_queue+0x1e3/0x290 [nvme_rdma]
> [ 2006.891611]  [<ffffffffa066ec65>] nvme_rdma_create_ctrl+0x3a5/0x4c0
> [nvme_rdma]
> [ 2006.900260]  [<ffffffffa0654d33>] ? nvmf_create_ctrl+0x33/0x210
> [nvme_fabrics]
> [ 2006.908799]  [<ffffffffa0654e82>] nvmf_create_ctrl+0x182/0x210
[nvme_fabrics]
> [ 2006.917228]  [<ffffffffa0654fbc>] nvmf_dev_write+0xac/0x110 [nvme_fabrics]
> 

The above stack is creating hctx queues for the nvme_rdma_ctrl->ctrl.connect_q
request queue.

> And here is the 2nd time the same nvme_rdma_queue is bound to a different
hctx::
> 
> [ 2007.263068]  [<ffffffffa066c40c>] nvme_rdma_init_hctx+0xbc/0x110
> [nvme_rdma]
> [ 2007.271656]  [<ffffffff8133a52e>] blk_mq_init_hctx+0x21e/0x2e0
> [ 2007.279027]  [<ffffffff8133a6ea>] blk_mq_realloc_hw_ctxs+0xfa/0x240
> [ 2007.286829]  [<ffffffff8133b342>] blk_mq_init_allocated_queue+0x92/0x410
> [ 2007.295066]  [<ffffffff8132a969>] ? blk_alloc_queue_node+0x259/0x2c0
> [ 2007.302962]  [<ffffffff8135ce84>] ? ida_pre_get+0xb4/0xe0
> [ 2007.309894]  [<ffffffff8133b6ff>] blk_mq_init_queue+0x3f/0x70
> [ 2007.317164]  [<ffffffffa0272998>] nvme_alloc_ns+0x88/0x240 [nvme_core]
> [ 2007.325218]  [<ffffffffa02728bc>] ? nvme_find_get_ns+0x5c/0xb0 [nvme_core]
> [ 2007.333612]  [<ffffffffa0273059>] nvme_validate_ns+0x79/0x90 [nvme_core]
> [ 2007.341825]  [<ffffffffa0273166>] nvme_scan_ns_list+0xf6/0x1f0 [nvme_core]
> [ 2007.350214]  [<ffffffffa027338b>] nvme_scan_work+0x12b/0x140 [nvme_core]
> [ 2007.358427]  [<ffffffff810a1613>] process_one_work+0x183/0x4d0
>

This stack is creating hctx queues for the namespace created for this target
device.

Sagi,

Should nvme_rdma_error_recovery_work() be stopping the hctx queues for
ctrl->ctrl.connect_q too?

Something like:

@@ -781,6 +790,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct
*work)
        if (ctrl->queue_count > 1)
                nvme_stop_queues(&ctrl->ctrl);
        blk_mq_stop_hw_queues(ctrl->ctrl.admin_q);
+       blk_mq_stop_hw_queues(ctrl->ctrl.connect_q);

        /* We must take care of fastfail/requeue all our inflight requests */
        if (ctrl->queue_count > 1)

And then restart these after the nvme_rdma_queue rdma resources are reallocated?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-15 20:58                                                             ` Steve Wise
@ 2016-09-16 11:04                                                               ` 'Christoph Hellwig'
  2016-09-18 17:02                                                                 ` Sagi Grimberg
  0 siblings, 1 reply; 53+ messages in thread
From: 'Christoph Hellwig' @ 2016-09-16 11:04 UTC (permalink / raw)


On Thu, Sep 15, 2016@03:58:25PM -0500, Steve Wise wrote:
> This stack is creating hctx queues for the namespace created for this target
> device.
> 
> Sagi,
> 
> Should nvme_rdma_error_recovery_work() be stopping the hctx queues for
> ctrl->ctrl.connect_q too?

Oh.  Actually we'll probably need to take care of the connect_q just
about anywhere we do anything to the other queues..

Let me look into the details a bit more.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-16 11:04                                                               ` 'Christoph Hellwig'
@ 2016-09-18 17:02                                                                 ` Sagi Grimberg
  2016-09-19 15:38                                                                   ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: Sagi Grimberg @ 2016-09-18 17:02 UTC (permalink / raw)



>> This stack is creating hctx queues for the namespace created for this target
>> device.
>>
>> Sagi,
>>
>> Should nvme_rdma_error_recovery_work() be stopping the hctx queues for
>> ctrl->ctrl.connect_q too?
>
> Oh.  Actually we'll probably need to take care of the connect_q just
> about anywhere we do anything to the other queues..

Why should we?

We control the IOs on the connect_q (we only submit connect to it) and
we only submit to it if our queue is established.

I still don't see how this explains why Steves is seeing bogus
queue/hctx mappings...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-18 17:02                                                                 ` Sagi Grimberg
@ 2016-09-19 15:38                                                                   ` Steve Wise
  2016-09-21 21:20                                                                     ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-09-19 15:38 UTC (permalink / raw)


> >> This stack is creating hctx queues for the namespace created for this
target
> >> device.
> >>
> >> Sagi,
> >>
> >> Should nvme_rdma_error_recovery_work() be stopping the hctx queues for
> >> ctrl->ctrl.connect_q too?
> >
> > Oh.  Actually we'll probably need to take care of the connect_q just
> > about anywhere we do anything to the other queues..
> 
> Why should we?
> 
> We control the IOs on the connect_q (we only submit connect to it) and
> we only submit to it if our queue is established.
> 
> I still don't see how this explains why Steves is seeing bogus
> queue/hctx mappings...

I don't think I'm seeing bogus mappings necessarily.  I think my debug code
uncovered (to me at least) that connect_q hctx's use the same nvme_rdma_queues
as the ioq hctxs.   And I thought that was not a valid configuration, but
apparently its normal.  So I still don't know how/why a pending request gets run
on an nvme_rdma_queue that has blown away its rdma qp and cm_id.  It _could_ be
due to queue/hctx bogus mappings, but I haven't proven it.  I'm not sure how to
prove it (or how to further debug this issue)...

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-19 15:38                                                                   ` Steve Wise
@ 2016-09-21 21:20                                                                     ` Steve Wise
  2016-09-23 23:57                                                                       ` Sagi Grimberg
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-09-21 21:20 UTC (permalink / raw)


> > >
> > > Oh.  Actually we'll probably need to take care of the connect_q just
> > > about anywhere we do anything to the other queues..
> >
> > Why should we?
> >
> > We control the IOs on the connect_q (we only submit connect to it) and
> > we only submit to it if our queue is established.
> >
> > I still don't see how this explains why Steves is seeing bogus
> > queue/hctx mappings...
> 
> I don't think I'm seeing bogus mappings necessarily.  I think my debug
code
> uncovered (to me at least) that connect_q hctx's use the same
> nvme_rdma_queues
> as the ioq hctxs.   And I thought that was not a valid configuration, but
> apparently its normal.  So I still don't know how/why a pending request
gets
> run
> on an nvme_rdma_queue that has blown away its rdma qp and cm_id.  It
> _could_ be
> due to queue/hctx bogus mappings, but I haven't proven it.  I'm not sure
> how to
> prove it (or how to further debug this issue)...

I added debug code to save off the 2 blk_mq_hw_ctx pointers that get
associated with each nvme_rdma_queue.  This allows me to assert that the
hctx passed into nvme_rdma_queue_rq() is not bogus.  And indeed the hctx
passed in during the crash is the correct hctx.  So we know there isn't a
problem with a bogus hctx being used.

The hctx.state has BLK_MQ_S_TAG_ACTIVE set and _not_ BLK_MQ_S_STOPPED.  The
ns->queue->queue_flags has QUEUE_FLAG_STOPPED bit set.  So the blk_mq queue
is active and the nvme queue is STOPPED.  I don't know how it gets in this
state...

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-21 21:20                                                                     ` Steve Wise
@ 2016-09-23 23:57                                                                       ` Sagi Grimberg
  2016-09-26 15:12                                                                         ` 'Christoph Hellwig'
  2016-09-27 14:07                                                                         ` Steve Wise
  0 siblings, 2 replies; 53+ messages in thread
From: Sagi Grimberg @ 2016-09-23 23:57 UTC (permalink / raw)



> The hctx.state has BLK_MQ_S_TAG_ACTIVE set and _not_ BLK_MQ_S_STOPPED.  The
> ns->queue->queue_flags has QUEUE_FLAG_STOPPED bit set.  So the blk_mq queue
> is active and the nvme queue is STOPPED.  I don't know how it gets in this
> state...

Christoph,

I'm still trying to understand how it is possible to
get to a point where the request queue is stopped while
the hardware context is not...

The code in rdma.c seems to do the right thing, but somehow
a stray request sneaks in to our submission path when its not
expected to.

Steve, is the request a normal read/write? or is it something
else triggered from the reconnect flow?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-23 23:57                                                                       ` Sagi Grimberg
@ 2016-09-26 15:12                                                                         ` 'Christoph Hellwig'
  2016-09-26 22:29                                                                           ` 'Christoph Hellwig'
  2016-09-27 14:07                                                                         ` Steve Wise
  1 sibling, 1 reply; 53+ messages in thread
From: 'Christoph Hellwig' @ 2016-09-26 15:12 UTC (permalink / raw)


On Fri, Sep 23, 2016@04:57:21PM -0700, Sagi Grimberg wrote:
> I'm still trying to understand how it is possible to
> get to a point where the request queue is stopped while
> the hardware context is not...

Not sure either, I need to deep dive into the code.  But just a week
or two ago Bart posted some fixes in this area for SRP, so that's where
I will start my deep dive.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-26 15:12                                                                         ` 'Christoph Hellwig'
@ 2016-09-26 22:29                                                                           ` 'Christoph Hellwig'
  2016-09-27 15:11                                                                             ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: 'Christoph Hellwig' @ 2016-09-26 22:29 UTC (permalink / raw)


Steve,

can you retest with the "Introduce blk_quiesce_queue() and blk_resume_queue()"
series from Bart applied?

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-23 23:57                                                                       ` Sagi Grimberg
  2016-09-26 15:12                                                                         ` 'Christoph Hellwig'
@ 2016-09-27 14:07                                                                         ` Steve Wise
  1 sibling, 0 replies; 53+ messages in thread
From: Steve Wise @ 2016-09-27 14:07 UTC (permalink / raw)


> Christoph,
> 
> I'm still trying to understand how it is possible to
> get to a point where the request queue is stopped while
> the hardware context is not...
> 
> The code in rdma.c seems to do the right thing, but somehow
> a stray request sneaks in to our submission path when its not
> expected to.
> 
> Steve, is the request a normal read/write? or is it something
> else triggered from the reconnect flow?

It is a normal IO request I think.  length 64. 1 sge.   Sometimes I see a REG_MR
also filled out in the nvme_rdma_request->reg_wr struct.

I'm going to try Bart's series now to see if it fixes this issue...

Steve.

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-26 22:29                                                                           ` 'Christoph Hellwig'
@ 2016-09-27 15:11                                                                             ` Steve Wise
  2016-09-27 15:31                                                                               ` Steve Wise
  0 siblings, 1 reply; 53+ messages in thread
From: Steve Wise @ 2016-09-27 15:11 UTC (permalink / raw)


> 
> Steve,
> 
> can you retest with the "Introduce blk_quiesce_queue() and blk_resume_queue()"
> series from Bart applied?
> 

Hey Christoph,

To apply Bart's series, I needed to use Jens' for-4.9/block branch.  But I also
wanted the latest nvme fixes in linux-4.8-rc8, so I rebased Jens' branch onto
rc8, then applied Bart's series (which needed a small tweak to patch 2). On top
of this I have some debug patches that will BUG_ON() if it detects freed RDMA
objects (requires mem debug on so freed memory has the 0x6b6b... stamp).   This
code base can be perused at:

https://github.com/larrystevenwise/nvme-fabrics/commits/block-for-4.9

I then tried to reproduce, and still hit a crash.  I'm debugging now.  

[  414.649731] nvme_rdma: nvme_rdma_queue_rq BAZINGA! hctx ffff880fb8fc1548 ns
ffff88101f049798 queue ffff880fbc1443a8 hctx1 ffff880fba47ca68 hctx2
ffff880fb8fc1548 rq ffff880fbb1736c0 req ffff880fbb173830
[  414.649830] nvme_rdma: nvme_rdma_queue_rq BAZINGA! hctx ffff880fbc1164f8 ns
ffff88101f049798 queue ffff880fbc144328 hctx1 ffff880fba47afd8 hctx2
ffff880fbc1164f8 rq ffff880fbb031240 req ffff880fbb0313b0
[  414.649869] ------------[ cut here ]------------
[  414.649870] kernel BUG at drivers/nvme/host/rdma.c:1434!
[  414.649872] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
[  414.649907] Modules linked in: nvme_rdma nvme_fabrics brd iw_cxgb4 cxgb4
ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4
nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM
iptable_mangle iptable_filter ip_tables bridge 8021q mrp garp stp llc cachefiles
fscache rdma_ucm rdma_cm iw_cm ib_ipoib ib_cm ib_uverbs ib_umad ocrdma be2net
iw_nes libcrc32c iw_cxgb3 cxgb3 mdio ib_qib rdmavt mlx5_ib mlx5_core mlx4_ib
mlx4_en mlx4_core ib_mthca ib_core binfmt_misc dm_mirror dm_region_hash dm_log
vhost_net macvtap macvlan vhost tun kvm irqbypass uinput iTCO_wdt
iTCO_vendor_support mxm_wmi pcspkr ipmi_si ipmi_msghandler dm_mod i2c_i801
i2c_smbus sg lpc_ich mfd_core mei_me mei nvme nvme_core igb dca ptp
[  414.649916]  pps_core wmi ext4(E) mbcache(E) jbd2(E) sd_mod(E) ahci(E)
libahci(E) libata(E) mgag200(E) ttm(E) drm_kms_helper(E) drm(E) fb_sys_fops(E)
sysimgblt(E) sysfillrect(E) syscopyarea(E) i2c_algo_bit(E) i2c_core(E) [last
unloaded: cxgb4]
[  414.649918] CPU: 0 PID: 14173 Comm: fio Tainted: G        W   E
4.8.0-rc8-stevo-for-harsha-dbg+ #48
[  414.649919] Hardware name: Supermicro X9DR3-F/X9DR3-F, BIOS 3.2a 07/09/2015
[  414.649920] task: ffff880fa4410040 task.stack: ffff880fa44fc000
[  414.649926] RIP: 0010:[<ffffffffa065eef0>]  [<ffffffffa065eef0>]
nvme_rdma_queue_rq+0x2d0/0x360 [nvme_rdma]
[  414.649927] RSP: 0018:ffff880fa44ff9a8  EFLAGS: 00010296
[  414.649928] RAX: 00000000000000be RBX: ffff880fbc144328 RCX: 0000000000000000
[  414.649929] RDX: 0000000000000001 RSI: 0000000000000286 RDI: 0000000000000286
[  414.649930] RBP: ffff880fa44ffa08 R08: 000000000005aeb2 R09: ffffffff820360de
[  414.649931] R10: 00000000000000e0 R11: 000000000000000f R12: ffff880fbb031240
[  414.649932] R13: ffff880fbc1164f8 R14: ffff880fbb0313b0 R15: ffff88101f049798
[  414.649933] FS:  00007ffb2afc5720(0000) GS:ffff881036e00000(0000)
knlGS:0000000000000000
[  414.649934] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  414.649935] CR2: 00007ffb0942e194 CR3: 0000000fa3636000 CR4: 00000000000406f0
[  414.649936] Stack:
[  414.649939]  ffff880fbc1164f8 ffff880fbb031240 ffff880fbb0313b0
ffff88103681f3f8
[  414.649941]  ffff880fa44ffa08 ffff880fbb7dcc88 ffff880fbb031240
ffff880fbb031240
[  414.649942]  ffff880fa44ffa78 0000000000000000 ffff880fb8f073e0
0000000000000001
[  414.649943] Call Trace:
[  414.649948]  [<ffffffff813394d1>] __blk_mq_run_hw_queue+0x251/0x400
[  414.649952]  [<ffffffff8115aa58>] ? trace_buffer_unlock_commit_regs+0x68/0x80
[  414.649956]  [<ffffffff8116c796>] ? trace_event_buffer_commit+0x146/0x1d0
[  414.649959]  [<ffffffff81330676>] ?
trace_event_raw_event_block_rq+0x116/0x180
[  414.649961]  [<ffffffff81339742>] blk_mq_run_hw_queue+0x72/0x80
[  414.649962]  [<ffffffff81339cd2>] blk_mq_insert_requests+0x112/0x1d0
[  414.649965]  [<ffffffff81197991>] ? __generic_file_write_iter+0xd1/0x1f0
[  414.649969]  [<ffffffff81204cdc>] ? alloc_debug_processing+0x6c/0x1a0
[  414.649971]  [<ffffffff81339f16>] blk_mq_flush_plug_list+0x186/0x1a0
[  414.649974]  [<ffffffff8127c326>] ? aio_run_iocb+0x126/0x2d0
[  414.649976]  [<ffffffff8132f2c2>] blk_flush_plug_list+0x132/0x2e0
[  414.649979]  [<ffffffff812086e5>] ? kmem_cache_alloc+0x165/0x1e0
[  414.649981]  [<ffffffff8127c731>] ? io_submit_one+0x261/0x530
[  414.649983]  [<ffffffff8116c796>] ? trace_event_buffer_commit+0x146/0x1d0
[  414.649984]  [<ffffffff8132f674>] blk_finish_plug+0x34/0x50
[  414.649986]  [<ffffffff8127cb32>] do_io_submit+0x132/0x210
[  414.649988]  [<ffffffff8127cc20>] SyS_io_submit+0x10/0x20
[  414.649992]  [<ffffffff81003e7d>] do_syscall_64+0x7d/0x230
[  414.649997]  [<ffffffff8106f097>] ? do_page_fault+0x37/0x90
[  414.650001]  [<ffffffff816e3f61>] entry_SYSCALL64_slow_path+0x25/0x25
[  414.650020] Code: 49 89 d8 4c 89 64 24 08 48 8b 43 78 4c 89 f9 4c 89 ea 48 c7
c6 50 11 66 a0 48 c7 c7 38 14 66 a0 48 89 04 24 31 c0 e8 d2 68 b3 e0 <0f> 0b eb
fe 80 3d 39 33 00 00 01 0f 84 6d fd ff ff be 8f 05 00
[  414.650023] RIP  [<ffffffffa065eef0>] nvme_rdma_queue_rq+0x2d0/0x360
[nvme_rdma]
[  414.650023]  RSP <ffff880fa44ff9a8>
crash> gdb list *nvme_rdma_queue_rq+0x2d0
0xffffffffa065eef0 is in nvme_rdma_queue_rq (drivers/nvme/host/rdma.c:1434).
1429            if ((uintptr_t)queue->qp->device ==
(uintptr_t)(0x6b6b6b6b6b6b6b6bUL) ||
1430                (uintptr_t)queue->qp->qp_num ==
(uintptr_t)(0x6b6b6b6b6b6b6b6bUL) ||
1431                (uintptr_t)queue->ib_cq->comp_handler ==
(uintptr_t)(0x6b6b6b6b6b6b6b6bUL) ||
1432                (uintptr_t)queue->ib_cq->device ==
(uintptr_t)(0x6b6b6b6b6b6b6b6bUL)) {
1433                    pr_err("%s BAZINGA! hctx %p ns %p queue %p hctx1 %p
hctx2 %p rq %p req %p\n", __func__, hctx, ns, queue, queue->hctx1, queue->hctx2,
rq, req);
1434                    BUG_ON(1);
1435            }
1436            dev = queue->device->dev;
1437            ib_dma_sync_single_for_cpu(dev, sqe->dma,
1438                            sizeof(struct nvme_command), DMA_TO_DEVICE);
crash> nvme_rdma_request ffff880fbb0313b0
struct nvme_rdma_request {
  mr = 0xffff880fbb7d9ad8,
  sqe = {
    cqe = {
      done = 0xffffffffa065d670 <nvme_rdma_send_done>
    },
    data = 0xffff880fbb7dcc88,
    dma = 67570093192
  },
  sge = {{
      addr = 67570093192,
      length = 64,
      lkey = 0
    }, {
      addr = 67902048256,
      length = 2048,
      lkey = 0
    }},
  num_sge = 2,
  nents = 1,
  inline_data = true,
  reg_wr = {
    wr = {
      next = 0xffff880fb65bf518,
      {
        wr_id = 18446612199876269112,
        wr_cqe = 0xffff880fbb031438
      },
      sg_list = 0x0,
      num_sge = 0,
      opcode = IB_WR_REG_MR,
      send_flags = 0,
      ex = {
        imm_data = 0,
        invalidate_rkey = 0
      }
    },
    mr = 0xffff880fbb7d9ad8,
    key = 40554307,
    access = 7
  },
  reg_cqe = {
    done = 0xffffffffa065d6b0 <nvme_rdma_memreg_done>
  },
  queue = 0xffff880fbc144328,
  sg_table = {
    sgl = 0xffff880fbb031458,
    nents = 1,
    orig_nents = 1
  },
  first_sgl = 0xffff880fbb031458
}
crash> nvme_rdma_queue 0xffff880fbc144328
struct nvme_rdma_queue {
  rsp_ring = 0xffff880fbe81c548,
  sig_count = 60 '<',
  queue_size = 128,
  cmnd_capsule_len = 4160,
  ctrl = 0xffff880fc98eca88,
  device = 0xffff880ff0b304b8,
  ib_cq = 0xffff880fbcac3508,
  qp = 0xffff880fbc1874e8,
  flags = 16,
  cm_id = 0xffff880fbc181548,
  cm_error = 0,
  cm_done = {
    done = 0,
    wait = {
      lock = {
        {
          rlock = {
            raw_lock = {
              val = {
                counter = 0
              }
            }
          }
        }
      },
      task_list = {
        next = 0xffff880fbc144388,
        prev = 0xffff880fbc144388
      }
    }
  },
  hctx1 = 0xffff880fba47afd8,
  hctx2 = 0xffff880fbc1164f8
}
crash> gdb x/1g 0xffff880fbcac3508
0xffff880fbcac3508:     0x6b6b6b6b6b6b6b6b
crash>

^ permalink raw reply	[flat|nested] 53+ messages in thread

* nvmf/rdma host crash during heavy load and keep alive recovery
  2016-09-27 15:11                                                                             ` Steve Wise
@ 2016-09-27 15:31                                                                               ` Steve Wise
  0 siblings, 0 replies; 53+ messages in thread
From: Steve Wise @ 2016-09-27 15:31 UTC (permalink / raw)


> Hey Christoph,
> 
> To apply Bart's series, I needed to use Jens' for-4.9/block branch.  But I
also
> wanted the latest nvme fixes in linux-4.8-rc8, so I rebased Jens' branch onto
> rc8, then applied Bart's series (which needed a small tweak to patch 2). On
top
> of this I have some debug patches that will BUG_ON() if it detects freed RDMA
> objects (requires mem debug on so freed memory has the 0x6b6b... stamp).
This
> code base can be perused at:
> 
> https://github.com/larrystevenwise/nvme-fabrics/commits/block-for-4.9
> 
> I then tried to reproduce, and still hit a crash.  I'm debugging now.
> 

blk_mq_hw_ctx.state is: 2
nvme_ns.queue.queue_flags is: 0x1f07a00

So the hw_ctx is BLK_MQ_S_TAG_ACTIVE.  And the nvme_ns.queue request queue
doesn't have QUEUE_FLAG_STOPPED set. nvme_rdma_ctrl.ctrl state is RECONNECTING.

^ permalink raw reply	[flat|nested] 53+ messages in thread

end of thread, other threads:[~2016-09-27 15:31 UTC | newest]

Thread overview: 53+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-07-29 21:40 nvmf/rdma host crash during heavy load and keep alive recovery Steve Wise
2016-08-01 11:06 ` Christoph Hellwig
2016-08-01 14:26   ` Steve Wise
2016-08-01 21:38     ` Steve Wise
     [not found]     ` <015801d1ec3d$0ca07ea0$25e17be0$@opengridcomputing.com>
2016-08-10 15:46       ` Steve Wise
     [not found]       ` <010f01d1f31e$50c8cb40$f25a61c0$@opengridcomputing.com>
2016-08-10 16:00         ` Steve Wise
     [not found]         ` <013701d1f320$57b185d0$07149170$@opengridcomputing.com>
2016-08-10 17:20           ` Steve Wise
2016-08-10 18:59             ` Steve Wise
2016-08-11  6:27               ` Sagi Grimberg
2016-08-11 13:58                 ` Steve Wise
2016-08-11 14:19                   ` Steve Wise
2016-08-11 14:40                   ` Steve Wise
2016-08-11 15:53                     ` Steve Wise
     [not found]                     ` <00fe01d1f3e8$8992b330$9cb81990$@opengridcomputing.com>
2016-08-15 14:39                       ` Steve Wise
2016-08-16  9:26                         ` Sagi Grimberg
2016-08-16 21:17                           ` Steve Wise
2016-08-17 18:57                             ` Sagi Grimberg
2016-08-17 19:07                               ` Steve Wise
2016-09-01 19:14                                 ` Steve Wise
2016-09-04  9:17                                   ` Sagi Grimberg
2016-09-07 21:08                                     ` Steve Wise
2016-09-08  7:45                                       ` Sagi Grimberg
2016-09-08 20:47                                         ` Steve Wise
2016-09-08 21:00                                         ` Steve Wise
     [not found]                                       ` <7f09e373-6316-26a3-ae81-dab1205d88ab@grimbe rg.me>
     [not found]                                         ` <021201d20a14$0 f203b80$2d60b280$@opengridcomputing.com>
     [not found]                                           ` <021201d20a14$0f203b80$2d60b280$@opengridcomputing.com>
2016-09-08 21:21                                             ` Steve Wise
     [not found]                                           ` <021401d20a16$ed60d470$c8227d50$@opengridcomputing.com>
     [not found]                                             ` <021501d20a19$327ba5b0$9772f110$@opengrid computing.com>
2016-09-08 21:37                                             ` Steve Wise
2016-09-09 15:50                                               ` Steve Wise
2016-09-12 20:10                                                 ` Steve Wise
     [not found]                                                   ` <da2e918b-0f18-e032-272d-368c6ec49c62@gri mberg.me>
2016-09-15  9:53                                                   ` Sagi Grimberg
2016-09-15 14:44                                                     ` Steve Wise
2016-09-15 15:10                                                       ` Steve Wise
2016-09-15 15:53                                                         ` Steve Wise
2016-09-15 16:45                                                           ` Steve Wise
2016-09-15 20:58                                                             ` Steve Wise
2016-09-16 11:04                                                               ` 'Christoph Hellwig'
2016-09-18 17:02                                                                 ` Sagi Grimberg
2016-09-19 15:38                                                                   ` Steve Wise
2016-09-21 21:20                                                                     ` Steve Wise
2016-09-23 23:57                                                                       ` Sagi Grimberg
2016-09-26 15:12                                                                         ` 'Christoph Hellwig'
2016-09-26 22:29                                                                           ` 'Christoph Hellwig'
2016-09-27 15:11                                                                             ` Steve Wise
2016-09-27 15:31                                                                               ` Steve Wise
2016-09-27 14:07                                                                         ` Steve Wise
2016-09-15 14:00                                                   ` Gabriel Krisman Bertazi
2016-09-15 14:31                                                     ` Steve Wise
2016-09-07 21:33                                     ` Steve Wise
2016-09-08  8:22                                       ` Sagi Grimberg
2016-09-08 17:19                                         ` Steve Wise
2016-09-09 15:57                                           ` Steve Wise
     [not found]                                       ` <9fd1f090-3b86-b496-d8c0-225ac0815fbe@grimbe rg.me>
     [not found]                                         ` <01bc01d209f5$1 b7d7510$52785f30$@opengridcomputing.com>
     [not found]                                           ` <01bc01d209f5$1b7d7510$52785f30$@opengridcomputing.com>
2016-09-08 19:15                                             ` Steve Wise
     [not found]                                           ` <01f201d20a05$6abde5f0$4039b1d0$@opengridcomputing.com>
2016-09-08 19:26                                             ` Steve Wise
     [not found]                                             ` <01f401d20a06$d4cc8360$7e658a20$@opengridcomputing.com>
2016-09-08 20:44                                               ` Steve Wise

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.