All of lore.kernel.org
 help / color / mirror / Atom feed
* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
       [not found] <1912441239.17517737.1522396297270.JavaMail.zimbra@redhat.com>
@ 2018-03-30  9:32   ` Yi Zhang
  0 siblings, 0 replies; 36+ messages in thread
From: Yi Zhang @ 2018-03-30  9:32 UTC (permalink / raw)
  To: linux-nvme, linux-block; +Cc: Ming Lei

Hello
I got this kernel BUG on 4.16.0-rc7, here is the reproducer and log, let me know if you need more info, thanks.

Reproducer:
1. setup target
#nvmetcli restore /etc/rdma.json
2. connect target on host
#nvme connect-all -t rdma -a $IP -s 4420during my NVMeoF RDMA testing
3. do fio background on host
#fio -filename=/dev/nvme0n1 -iodepth=1 -thread -rw=randwrite -ioengine=psync -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 -bs_unaligned -runtime=180 -size=-group_reporting -name=mytest -numjobs=60 &
4. offline cpu on host
#echo 0 > /sys/devices/system/cpu/cpu1/online
#echo 0 > /sys/devices/system/cpu/cpu2/online
#echo 0 > /sys/devices/system/cpu/cpu3/online
5. clear target
#nvmetcli clear
6. restore target
#nvmetcli restore /etc/rdma.json
7. check console log on host


[  167.054583] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
[  167.108410] nvme nvme0: creating 40 I/O queues.
[  167.421694] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[  256.496376] smpboot: CPU 1 is now offline
[  256.525102] IRQ 37: no longer affine to CPU2
[  256.529872] IRQ 54: no longer affine to CPU2
[  256.534637] IRQ 70: no longer affine to CPU2
[  256.539405] IRQ 98: no longer affine to CPU2
[  256.544175] IRQ 140: no longer affine to CPU2
[  256.549036] IRQ 141: no longer affine to CPU2
[  256.553905] IRQ 166: no longer affine to CPU2
[  256.561042] smpboot: CPU 2 is now offline
[  256.796920] smpboot: CPU 3 is now offline
[  258.649993] print_req_error: operation not supported error, dev nvme0n1, sector 60151856
[  258.650031] print_req_error: operation not supported error, dev nvme0n1, sector 512220944
[  258.650040] print_req_error: operation not supported error, dev nvme0n1, sector 221050984
[  258.650047] print_req_error: operation not supported error, dev nvme0n1, sector 160854616
[  258.650058] print_req_error: operation not supported error, dev nvme0n1, sector 471080288
[  258.650083] print_req_error: operation not supported error, dev nvme0n1, sector 242366208
[  258.650093] print_req_error: operation not supported error, dev nvme0n1, sector 363042304
[  258.650100] print_req_error: operation not supported error, dev nvme0n1, sector 55054168
[  258.650106] print_req_error: operation not supported error, dev nvme0n1, sector 261203184
[  258.650110] print_req_error: operation not supported error, dev nvme0n1, sector 318931552
[  259.401504] nvme nvme0: Reconnecting in 10 seconds...
[  259.401508] Buffer I/O error on dev nvme0n1, logical block 218, lost async page write
[  259.415933] Buffer I/O error on dev nvme0n1, logical block 219, lost async page write
[  259.424709] Buffer I/O error on dev nvme0n1, logical block 267, lost async page write
[  259.433479] Buffer I/O error on dev nvme0n1, logical block 268, lost async page write
[  259.442248] Buffer I/O error on dev nvme0n1, logical block 269, lost async page write
[  259.451017] Buffer I/O error on dev nvme0n1, logical block 270, lost async page write
[  259.459784] Buffer I/O error on dev nvme0n1, logical block 271, lost async page write
[  259.468550] Buffer I/O error on dev nvme0n1, logical block 272, lost async page write
[  259.477319] Buffer I/O error on dev nvme0n1, logical block 273, lost async page write
[  259.486095] Buffer I/O error on dev nvme0n1, logical block 341, lost async page write
[  264.003845] nvme nvme0: Identify namespace failed
[  264.009222] print_req_error: 391720 callbacks suppressed
[  264.009223] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.021610] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.028048] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.034486] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.040922] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.047359] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.053794] Dev nvme0n1: unable to read RDB block 0
[  264.059261] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.065699] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.072134]  nvme0n1: unable to read partition table
[  264.082672] print_req_error: I/O error, dev nvme0n1, sector 524287872
[  264.090339] print_req_error: I/O error, dev nvme0n1, sector 524287872
[  269.481193] nvme nvme0: creating 37 I/O queues.
[  269.787024] BUG: unable to handle kernel paging request at 0000473023d3b6c8
[  269.795246] IP: blk_mq_get_request+0x23e/0x390
[  269.800599] PGD 0 P4D 0 
[  269.803810] Oops: 0002 [#1] SMP PTI
[  269.808089] Modules linked in: nvme_rdma nvme_fabrics nvme_core sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge 8021q garp mrp stp llc ib_isert iscsir
[  269.890870]  syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core ahci libahci tg3 libata crc32c_intel i2c_core devlink dm_mirror dm_region_hash dm_log dm_mod
[  269.908864] CPU: 36 PID: 680 Comm: kworker/u369:8 Not tainted 4.16.0-rc7 #3
[  269.917207] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016
[  269.926155] Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
[  269.934239] RIP: 0010:blk_mq_get_request+0x23e/0x390
[  269.940392] RSP: 0018:ffffb237087cbca8 EFLAGS: 00010246
[  269.946841] RAX: 0000473023d3b680 RBX: ffff8b06546e0000 RCX: 000000000000001f
[  269.955443] RDX: 0000000000000000 RSI: ffffffdbc0ce8100 RDI: ffff8b0653431000
[  269.964053] RBP: ffffb237087cbce8 R08: ffffffffffffffff R09: 0000000000000002
[  269.972674] R10: ffff8af67eaa7160 R11: ffffd62c40186c00 R12: 0000000000000023
[  269.981285] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  269.989891] FS:  0000000000000000(0000) GS:ffff8af67ea80000(0000) knlGS:0000000000000000
[  269.999577] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  270.006654] CR2: 0000473023d3b6c8 CR3: 00000015ed40a001 CR4: 00000000001606e0
[  270.015300] Call Trace:
[  270.018716]  blk_mq_alloc_request_hctx+0xf2/0x140
[  270.024668]  nvme_alloc_request+0x36/0x60 [nvme_core]
[  270.031016]  __nvme_submit_sync_cmd+0x2b/0xd0 [nvme_core]
[  270.037762]  nvmf_connect_io_queue+0x10e/0x170 [nvme_fabrics]
[  270.044898]  nvme_rdma_start_queue+0x21/0x80 [nvme_rdma]
[  270.051566]  nvme_rdma_configure_io_queues+0x196/0x280 [nvme_rdma]
[  270.059199]  nvme_rdma_reconnect_ctrl_work+0x39/0xd0 [nvme_rdma]
[  270.066637]  process_one_work+0x158/0x360
[  270.071846]  worker_thread+0x47/0x3e0
[  270.076672]  kthread+0xf8/0x130
[  270.080918]  ? max_active_store+0x80/0x80
[  270.086142]  ? kthread_bind+0x10/0x10
[  270.090987]  ret_from_fork+0x35/0x40
[  270.095739] Code: 89 83 40 01 00 00 45 84 e4 48 c7 83 48 01 00 00 00 00 00 00 ba 01 00 00 00 48 8b 45 10 74 0c 31 d2 41 f7 c4 00 08 06 00 0f 95 c2 <48> 83 44 d0 48 01 41 81 e4 00 00 06  
[  270.118418] RIP: blk_mq_get_request+0x23e/0x390 RSP: ffffb237087cbca8
[  270.126422] CR2: 0000473023d3b6c8
[  270.130994] ---[ end trace 222e693b7ee07afa ]---
[  270.141098] Kernel panic - not syncing: Fatal exception
[  270.147812] Kernel Offset: 0x22800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  270.164696] ---[ end Kernel panic - not syncing: Fatal exception
[  270.172257] WARNING: CPU: 36 PID: 680 at kernel/sched/core.c:1189 set_task_cpu+0x18c/0x1a0
[  270.182333] Modules linked in: nvme_rdma nvme_fabrics nvme_core sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge 8021q garp mrp stp llc ib_isert iscsir
[  270.268075]  syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core ahci libahci tg3 libata crc32c_intel i2c_core devlink dm_mirror dm_region_hash dm_log dm_mod
[  270.286750] CPU: 36 PID: 680 Comm: kworker/u369:8 Tainted: G      D          4.16.0-rc7 #3
[  270.296862] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016
[  270.306088] Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
[  270.314436] RIP: 0010:set_task_cpu+0x18c/0x1a0
[  270.320253] RSP: 0018:ffff8af67ea83ce0 EFLAGS: 00010046
[  270.326938] RAX: 0000000000000200 RBX: ffff8af65d9445c0 RCX: 0000005555555501
[  270.335764] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8af65d9445c0
[  270.344591] RBP: 0000000000022380 R08: 0000000000000000 R09: 0000000000000010
[  270.353409] R10: 000000005abdf5ea R11: 0000000016684c67 R12: 0000000000000000
[  270.362223] R13: 0000000000000000 R14: 0000000000000046 R15: 0000000000000000
[  270.371030] FS:  0000000000000000(0000) GS:ffff8af67ea80000(0000) knlGS:0000000000000000
[  270.380913] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  270.388166] CR2: 0000473023d3b6c8 CR3: 00000015ed40a001 CR4: 00000000001606e0
[  270.396985] Call Trace:
[  270.400557]  <IRQ>
[  270.403621]  try_to_wake_up+0x167/0x460
[  270.408730]  ? enqueue_task_fair+0x67/0xa00
[  270.414224]  __wake_up_common+0x8f/0x160
[  270.419417]  ep_poll_callback+0xc4/0x2f0
[  270.424609]  __wake_up_common+0x8f/0x160
[  270.429796]  __wake_up_common_lock+0x7a/0xc0
[  270.435368]  irq_work_run_list+0x4c/0x70
[  270.440547]  ? tick_sched_do_timer+0x60/0x60
[  270.446115]  update_process_times+0x3b/0x50
[  270.451579]  tick_sched_handle+0x26/0x60
[  270.456752]  tick_sched_timer+0x34/0x70
[  270.461826]  __hrtimer_run_queues+0xfb/0x270
[  270.467388]  hrtimer_interrupt+0x122/0x270
[  270.472756]  smp_apic_timer_interrupt+0x62/0x130
[  270.478712]  apic_timer_interrupt+0xf/0x20
[  270.484066]  </IRQ>
[  270.487167] RIP: 0010:panic+0x206/0x25c
[  270.492195] RSP: 0018:ffffb237087cba60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
[  270.501406] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000006
[  270.510136] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff8af67ea968b0
[  270.518863] RBP: ffffb237087cbad0 R08: 0000000000000000 R09: 0000000000000886
[  270.527578] R10: 00000000000003ff R11: 0000000000aaaaaa R12: ffffffffa4654b1a
[  270.536278] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[  270.544970]  oops_end+0xb0/0xc0
[  270.549179]  no_context+0x1b3/0x430
[  270.553753]  ? account_entity_dequeue+0xa3/0xd0
[  270.559473]  __do_page_fault+0x97/0x4c0
[  270.564396]  do_page_fault+0x32/0x140
[  270.569103]  page_fault+0x25/0x50
[  270.573398] RIP: 0010:blk_mq_get_request+0x23e/0x390
[  270.579516] RSP: 0018:ffffb237087cbca8 EFLAGS: 00010246
[  270.585906] RAX: 0000473023d3b680 RBX: ffff8b06546e0000 RCX: 000000000000001f
[  270.594422] RDX: 0000000000000000 RSI: ffffffdbc0ce8100 RDI: ffff8b0653431000
[  270.602929] RBP: ffffb237087cbce8 R08: ffffffffffffffff R09: 0000000000000002
[  270.611432] R10: ffff8af67eaa7160 R11: ffffd62c40186c00 R12: 0000000000000023
[  270.619927] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  270.628409]  ? blk_mq_get_request+0x212/0x390
[  270.633795]  blk_mq_alloc_request_hctx+0xf2/0x140
[  270.639565]  nvme_alloc_request+0x36/0x60 [nvme_core]
[  270.645721]  __nvme_submit_sync_cmd+0x2b/0xd0 [nvme_core]
[  270.652269]  nvmf_connect_io_queue+0x10e/0x170 [nvme_fabrics]
[  270.659209]  nvme_rdma_start_queue+0x21/0x80 [nvme_rdma]
[  270.665668]  nvme_rdma_configure_io_queues+0x196/0x280 [nvme_rdma]
[  270.673087]  nvme_rdma_reconnect_ctrl_work+0x39/0xd0 [nvme_rdma]
[  270.680314]  process_one_work+0x158/0x360
[  270.685302]  worker_thread+0x47/0x3e0
[  270.689897]  kthread+0xf8/0x130
[  270.693906]  ? max_active_store+0x80/0x80
[  270.698880]  ? kthread_bind+0x10/0x10
[  270.703473]  ret_from_fork+0x35/0x40
[  270.707967] Code: 8b 9c 08 00 00 04 e9 28 ff ff ff 0f 0b 66 90 e9 bf fe ff ff f7 83 88 00 00 00 fd ff ff ff 0f 84 c9 fe ff ff 0f 0b e9 c2 fe ff ff <0f> 0b e9 d1 fe ff ff 0f 1f 00 66 2e  
[  270.730149] ---[ end trace 222e693b7ee07afb ]---

Best Regards,
  Yi Zhang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-03-30  9:32   ` Yi Zhang
  0 siblings, 0 replies; 36+ messages in thread
From: Yi Zhang @ 2018-03-30  9:32 UTC (permalink / raw)


Hello
I got this kernel BUG on 4.16.0-rc7, here is the reproducer and log, let me know if you need more info, thanks.

Reproducer:
1. setup target
#nvmetcli restore /etc/rdma.json
2. connect target on host
#nvme connect-all -t rdma -a $IP -s 4420during my NVMeoF RDMA testing
3. do fio background on host
#fio -filename=/dev/nvme0n1 -iodepth=1 -thread -rw=randwrite -ioengine=psync -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 -bs_unaligned -runtime=180 -size=-group_reporting -name=mytest -numjobs=60 &
4. offline cpu on host
#echo 0 > /sys/devices/system/cpu/cpu1/online
#echo 0 > /sys/devices/system/cpu/cpu2/online
#echo 0 > /sys/devices/system/cpu/cpu3/online
5. clear target
#nvmetcli clear
6. restore target
#nvmetcli restore /etc/rdma.json
7. check console log on host


[  167.054583] nvme nvme0: new ctrl: NQN "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
[  167.108410] nvme nvme0: creating 40 I/O queues.
[  167.421694] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[  256.496376] smpboot: CPU 1 is now offline
[  256.525102] IRQ 37: no longer affine to CPU2
[  256.529872] IRQ 54: no longer affine to CPU2
[  256.534637] IRQ 70: no longer affine to CPU2
[  256.539405] IRQ 98: no longer affine to CPU2
[  256.544175] IRQ 140: no longer affine to CPU2
[  256.549036] IRQ 141: no longer affine to CPU2
[  256.553905] IRQ 166: no longer affine to CPU2
[  256.561042] smpboot: CPU 2 is now offline
[  256.796920] smpboot: CPU 3 is now offline
[  258.649993] print_req_error: operation not supported error, dev nvme0n1, sector 60151856
[  258.650031] print_req_error: operation not supported error, dev nvme0n1, sector 512220944
[  258.650040] print_req_error: operation not supported error, dev nvme0n1, sector 221050984
[  258.650047] print_req_error: operation not supported error, dev nvme0n1, sector 160854616
[  258.650058] print_req_error: operation not supported error, dev nvme0n1, sector 471080288
[  258.650083] print_req_error: operation not supported error, dev nvme0n1, sector 242366208
[  258.650093] print_req_error: operation not supported error, dev nvme0n1, sector 363042304
[  258.650100] print_req_error: operation not supported error, dev nvme0n1, sector 55054168
[  258.650106] print_req_error: operation not supported error, dev nvme0n1, sector 261203184
[  258.650110] print_req_error: operation not supported error, dev nvme0n1, sector 318931552
[  259.401504] nvme nvme0: Reconnecting in 10 seconds...
[  259.401508] Buffer I/O error on dev nvme0n1, logical block 218, lost async page write
[  259.415933] Buffer I/O error on dev nvme0n1, logical block 219, lost async page write
[  259.424709] Buffer I/O error on dev nvme0n1, logical block 267, lost async page write
[  259.433479] Buffer I/O error on dev nvme0n1, logical block 268, lost async page write
[  259.442248] Buffer I/O error on dev nvme0n1, logical block 269, lost async page write
[  259.451017] Buffer I/O error on dev nvme0n1, logical block 270, lost async page write
[  259.459784] Buffer I/O error on dev nvme0n1, logical block 271, lost async page write
[  259.468550] Buffer I/O error on dev nvme0n1, logical block 272, lost async page write
[  259.477319] Buffer I/O error on dev nvme0n1, logical block 273, lost async page write
[  259.486095] Buffer I/O error on dev nvme0n1, logical block 341, lost async page write
[  264.003845] nvme nvme0: Identify namespace failed
[  264.009222] print_req_error: 391720 callbacks suppressed
[  264.009223] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.021610] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.028048] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.034486] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.040922] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.047359] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.053794] Dev nvme0n1: unable to read RDB block 0
[  264.059261] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.065699] print_req_error: I/O error, dev nvme0n1, sector 0
[  264.072134]  nvme0n1: unable to read partition table
[  264.082672] print_req_error: I/O error, dev nvme0n1, sector 524287872
[  264.090339] print_req_error: I/O error, dev nvme0n1, sector 524287872
[  269.481193] nvme nvme0: creating 37 I/O queues.
[  269.787024] BUG: unable to handle kernel paging request at 0000473023d3b6c8
[  269.795246] IP: blk_mq_get_request+0x23e/0x390
[  269.800599] PGD 0 P4D 0 
[  269.803810] Oops: 0002 [#1] SMP PTI
[  269.808089] Modules linked in: nvme_rdma nvme_fabrics nvme_core sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge 8021q garp mrp stp llc ib_isert iscsir
[  269.890870]  syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core ahci libahci tg3 libata crc32c_intel i2c_core devlink dm_mirror dm_region_hash dm_log dm_mod
[  269.908864] CPU: 36 PID: 680 Comm: kworker/u369:8 Not tainted 4.16.0-rc7 #3
[  269.917207] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016
[  269.926155] Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
[  269.934239] RIP: 0010:blk_mq_get_request+0x23e/0x390
[  269.940392] RSP: 0018:ffffb237087cbca8 EFLAGS: 00010246
[  269.946841] RAX: 0000473023d3b680 RBX: ffff8b06546e0000 RCX: 000000000000001f
[  269.955443] RDX: 0000000000000000 RSI: ffffffdbc0ce8100 RDI: ffff8b0653431000
[  269.964053] RBP: ffffb237087cbce8 R08: ffffffffffffffff R09: 0000000000000002
[  269.972674] R10: ffff8af67eaa7160 R11: ffffd62c40186c00 R12: 0000000000000023
[  269.981285] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  269.989891] FS:  0000000000000000(0000) GS:ffff8af67ea80000(0000) knlGS:0000000000000000
[  269.999577] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  270.006654] CR2: 0000473023d3b6c8 CR3: 00000015ed40a001 CR4: 00000000001606e0
[  270.015300] Call Trace:
[  270.018716]  blk_mq_alloc_request_hctx+0xf2/0x140
[  270.024668]  nvme_alloc_request+0x36/0x60 [nvme_core]
[  270.031016]  __nvme_submit_sync_cmd+0x2b/0xd0 [nvme_core]
[  270.037762]  nvmf_connect_io_queue+0x10e/0x170 [nvme_fabrics]
[  270.044898]  nvme_rdma_start_queue+0x21/0x80 [nvme_rdma]
[  270.051566]  nvme_rdma_configure_io_queues+0x196/0x280 [nvme_rdma]
[  270.059199]  nvme_rdma_reconnect_ctrl_work+0x39/0xd0 [nvme_rdma]
[  270.066637]  process_one_work+0x158/0x360
[  270.071846]  worker_thread+0x47/0x3e0
[  270.076672]  kthread+0xf8/0x130
[  270.080918]  ? max_active_store+0x80/0x80
[  270.086142]  ? kthread_bind+0x10/0x10
[  270.090987]  ret_from_fork+0x35/0x40
[  270.095739] Code: 89 83 40 01 00 00 45 84 e4 48 c7 83 48 01 00 00 00 00 00 00 ba 01 00 00 00 48 8b 45 10 74 0c 31 d2 41 f7 c4 00 08 06 00 0f 95 c2 <48> 83 44 d0 48 01 41 81 e4 00 00 06  
[  270.118418] RIP: blk_mq_get_request+0x23e/0x390 RSP: ffffb237087cbca8
[  270.126422] CR2: 0000473023d3b6c8
[  270.130994] ---[ end trace 222e693b7ee07afa ]---
[  270.141098] Kernel panic - not syncing: Fatal exception
[  270.147812] Kernel Offset: 0x22800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[  270.164696] ---[ end Kernel panic - not syncing: Fatal exception
[  270.172257] WARNING: CPU: 36 PID: 680 at kernel/sched/core.c:1189 set_task_cpu+0x18c/0x1a0
[  270.182333] Modules linked in: nvme_rdma nvme_fabrics nvme_core sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge 8021q garp mrp stp llc ib_isert iscsir
[  270.268075]  syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm mlx4_core ahci libahci tg3 libata crc32c_intel i2c_core devlink dm_mirror dm_region_hash dm_log dm_mod
[  270.286750] CPU: 36 PID: 680 Comm: kworker/u369:8 Tainted: G      D          4.16.0-rc7 #3
[  270.296862] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.6.2 01/08/2016
[  270.306088] Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
[  270.314436] RIP: 0010:set_task_cpu+0x18c/0x1a0
[  270.320253] RSP: 0018:ffff8af67ea83ce0 EFLAGS: 00010046
[  270.326938] RAX: 0000000000000200 RBX: ffff8af65d9445c0 RCX: 0000005555555501
[  270.335764] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff8af65d9445c0
[  270.344591] RBP: 0000000000022380 R08: 0000000000000000 R09: 0000000000000010
[  270.353409] R10: 000000005abdf5ea R11: 0000000016684c67 R12: 0000000000000000
[  270.362223] R13: 0000000000000000 R14: 0000000000000046 R15: 0000000000000000
[  270.371030] FS:  0000000000000000(0000) GS:ffff8af67ea80000(0000) knlGS:0000000000000000
[  270.380913] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  270.388166] CR2: 0000473023d3b6c8 CR3: 00000015ed40a001 CR4: 00000000001606e0
[  270.396985] Call Trace:
[  270.400557]  <IRQ>
[  270.403621]  try_to_wake_up+0x167/0x460
[  270.408730]  ? enqueue_task_fair+0x67/0xa00
[  270.414224]  __wake_up_common+0x8f/0x160
[  270.419417]  ep_poll_callback+0xc4/0x2f0
[  270.424609]  __wake_up_common+0x8f/0x160
[  270.429796]  __wake_up_common_lock+0x7a/0xc0
[  270.435368]  irq_work_run_list+0x4c/0x70
[  270.440547]  ? tick_sched_do_timer+0x60/0x60
[  270.446115]  update_process_times+0x3b/0x50
[  270.451579]  tick_sched_handle+0x26/0x60
[  270.456752]  tick_sched_timer+0x34/0x70
[  270.461826]  __hrtimer_run_queues+0xfb/0x270
[  270.467388]  hrtimer_interrupt+0x122/0x270
[  270.472756]  smp_apic_timer_interrupt+0x62/0x130
[  270.478712]  apic_timer_interrupt+0xf/0x20
[  270.484066]  </IRQ>
[  270.487167] RIP: 0010:panic+0x206/0x25c
[  270.492195] RSP: 0018:ffffb237087cba60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
[  270.501406] RAX: 0000000000000034 RBX: 0000000000000000 RCX: 0000000000000006
[  270.510136] RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff8af67ea968b0
[  270.518863] RBP: ffffb237087cbad0 R08: 0000000000000000 R09: 0000000000000886
[  270.527578] R10: 00000000000003ff R11: 0000000000aaaaaa R12: ffffffffa4654b1a
[  270.536278] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[  270.544970]  oops_end+0xb0/0xc0
[  270.549179]  no_context+0x1b3/0x430
[  270.553753]  ? account_entity_dequeue+0xa3/0xd0
[  270.559473]  __do_page_fault+0x97/0x4c0
[  270.564396]  do_page_fault+0x32/0x140
[  270.569103]  page_fault+0x25/0x50
[  270.573398] RIP: 0010:blk_mq_get_request+0x23e/0x390
[  270.579516] RSP: 0018:ffffb237087cbca8 EFLAGS: 00010246
[  270.585906] RAX: 0000473023d3b680 RBX: ffff8b06546e0000 RCX: 000000000000001f
[  270.594422] RDX: 0000000000000000 RSI: ffffffdbc0ce8100 RDI: ffff8b0653431000
[  270.602929] RBP: ffffb237087cbce8 R08: ffffffffffffffff R09: 0000000000000002
[  270.611432] R10: ffff8af67eaa7160 R11: ffffd62c40186c00 R12: 0000000000000023
[  270.619927] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  270.628409]  ? blk_mq_get_request+0x212/0x390
[  270.633795]  blk_mq_alloc_request_hctx+0xf2/0x140
[  270.639565]  nvme_alloc_request+0x36/0x60 [nvme_core]
[  270.645721]  __nvme_submit_sync_cmd+0x2b/0xd0 [nvme_core]
[  270.652269]  nvmf_connect_io_queue+0x10e/0x170 [nvme_fabrics]
[  270.659209]  nvme_rdma_start_queue+0x21/0x80 [nvme_rdma]
[  270.665668]  nvme_rdma_configure_io_queues+0x196/0x280 [nvme_rdma]
[  270.673087]  nvme_rdma_reconnect_ctrl_work+0x39/0xd0 [nvme_rdma]
[  270.680314]  process_one_work+0x158/0x360
[  270.685302]  worker_thread+0x47/0x3e0
[  270.689897]  kthread+0xf8/0x130
[  270.693906]  ? max_active_store+0x80/0x80
[  270.698880]  ? kthread_bind+0x10/0x10
[  270.703473]  ret_from_fork+0x35/0x40
[  270.707967] Code: 8b 9c 08 00 00 04 e9 28 ff ff ff 0f 0b 66 90 e9 bf fe ff ff f7 83 88 00 00 00 fd ff ff ff 0f 84 c9 fe ff ff 0f 0b e9 c2 fe ff ff <0f> 0b e9 d1 fe ff ff 0f 1f 00 66 2e  
[  270.730149] ---[ end trace 222e693b7ee07afb ]---

Best Regards,
  Yi Zhang

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-03-30  9:32   ` Yi Zhang
@ 2018-04-04 13:22     ` Sagi Grimberg
  -1 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-04 13:22 UTC (permalink / raw)
  To: Yi Zhang, linux-nvme, linux-block; +Cc: Ming Lei



On 03/30/2018 12:32 PM, Yi Zhang wrote:
> Hello
> I got this kernel BUG on 4.16.0-rc7, here is the reproducer and log, let me know if you need more info, thanks.
> 
> Reproducer:
> 1. setup target
> #nvmetcli restore /etc/rdma.json
> 2. connect target on host
> #nvme connect-all -t rdma -a $IP -s 4420during my NVMeoF RDMA testing
> 3. do fio background on host
> #fio -filename=/dev/nvme0n1 -iodepth=1 -thread -rw=randwrite -ioengine=psync -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 -bs_unaligned -runtime=180 -size=-group_reporting -name=mytest -numjobs=60 &
> 4. offline cpu on host
> #echo 0 > /sys/devices/system/cpu/cpu1/online
> #echo 0 > /sys/devices/system/cpu/cpu2/online
> #echo 0 > /sys/devices/system/cpu/cpu3/online
> 5. clear target
> #nvmetcli clear
> 6. restore target
> #nvmetcli restore /etc/rdma.json
> 7. check console log on host

Hi Yi,

Does this happen with this applied?
--
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..b89da55e8aaa 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -35,6 +35,8 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
         const struct cpumask *mask;
         unsigned int queue, cpu;

+       goto fallback;
+
         for (queue = 0; queue < set->nr_hw_queues; queue++) {
                 mask = ib_get_vector_affinity(dev, first_vec + queue);
                 if (!mask)
--

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-04 13:22     ` Sagi Grimberg
  0 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-04 13:22 UTC (permalink / raw)




On 03/30/2018 12:32 PM, Yi Zhang wrote:
> Hello
> I got this kernel BUG on 4.16.0-rc7, here is the reproducer and log, let me know if you need more info, thanks.
> 
> Reproducer:
> 1. setup target
> #nvmetcli restore /etc/rdma.json
> 2. connect target on host
> #nvme connect-all -t rdma -a $IP -s 4420during my NVMeoF RDMA testing
> 3. do fio background on host
> #fio -filename=/dev/nvme0n1 -iodepth=1 -thread -rw=randwrite -ioengine=psync -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 -bs_unaligned -runtime=180 -size=-group_reporting -name=mytest -numjobs=60 &
> 4. offline cpu on host
> #echo 0 > /sys/devices/system/cpu/cpu1/online
> #echo 0 > /sys/devices/system/cpu/cpu2/online
> #echo 0 > /sys/devices/system/cpu/cpu3/online
> 5. clear target
> #nvmetcli clear
> 6. restore target
> #nvmetcli restore /etc/rdma.json
> 7. check console log on host

Hi Yi,

Does this happen with this applied?
--
diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
index 996167f1de18..b89da55e8aaa 100644
--- a/block/blk-mq-rdma.c
+++ b/block/blk-mq-rdma.c
@@ -35,6 +35,8 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
         const struct cpumask *mask;
         unsigned int queue, cpu;

+       goto fallback;
+
         for (queue = 0; queue < set->nr_hw_queues; queue++) {
                 mask = ib_get_vector_affinity(dev, first_vec + queue);
                 if (!mask)
--

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-04 13:22     ` Sagi Grimberg
@ 2018-04-05 16:35       ` Yi Zhang
  -1 siblings, 0 replies; 36+ messages in thread
From: Yi Zhang @ 2018-04-05 16:35 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-nvme, linux-block, Ming Lei



On 04/04/2018 09:22 PM, Sagi Grimberg wrote:
>
>
> On 03/30/2018 12:32 PM, Yi Zhang wrote:
>> Hello
>> I got this kernel BUG on 4.16.0-rc7, here is the reproducer and log, 
>> let me know if you need more info, thanks.
>>
>> Reproducer:
>> 1. setup target
>> #nvmetcli restore /etc/rdma.json
>> 2. connect target on host
>> #nvme connect-all -t rdma -a $IP -s 4420during my NVMeoF RDMA testing
>> 3. do fio background on host
>> #fio -filename=/dev/nvme0n1 -iodepth=1 -thread -rw=randwrite 
>> -ioengine=psync 
>> -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 
>> -bs_unaligned -runtime=180 -size=-group_reporting -name=mytest 
>> -numjobs=60 &
>> 4. offline cpu on host
>> #echo 0 > /sys/devices/system/cpu/cpu1/online
>> #echo 0 > /sys/devices/system/cpu/cpu2/online
>> #echo 0 > /sys/devices/system/cpu/cpu3/online
>> 5. clear target
>> #nvmetcli clear
>> 6. restore target
>> #nvmetcli restore /etc/rdma.json
>> 7. check console log on host
>
> Hi Yi,
>
> Does this happen with this applied?
> -- 
> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> index 996167f1de18..b89da55e8aaa 100644
> --- a/block/blk-mq-rdma.c
> +++ b/block/blk-mq-rdma.c
> @@ -35,6 +35,8 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
>         const struct cpumask *mask;
>         unsigned int queue, cpu;
>
> +       goto fallback;
> +
>         for (queue = 0; queue < set->nr_hw_queues; queue++) {
>                 mask = ib_get_vector_affinity(dev, first_vec + queue);
>                 if (!mask)
> -- 
>

Hi Sagi

Still can reproduce this issue with the change:

[  133.469908] nvme nvme0: new ctrl: NQN 
"nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
[  133.554025] nvme nvme0: creating 40 I/O queues.
[  133.947648] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[  138.740870] smpboot: CPU 1 is now offline
[  138.778382] IRQ 37: no longer affine to CPU2
[  138.783153] IRQ 54: no longer affine to CPU2
[  138.787919] IRQ 70: no longer affine to CPU2
[  138.792687] IRQ 98: no longer affine to CPU2
[  138.797458] IRQ 140: no longer affine to CPU2
[  138.802319] IRQ 141: no longer affine to CPU2
[  138.807189] IRQ 166: no longer affine to CPU2
[  138.813622] smpboot: CPU 2 is now offline
[  139.043610] smpboot: CPU 3 is now offline
[  141.587283] print_req_error: operation not supported error, dev 
nvme0n1, sector 494622136
[  141.587303] print_req_error: operation not supported error, dev 
nvme0n1, sector 219643648
[  141.587304] print_req_error: operation not supported error, dev 
nvme0n1, sector 279256456
[  141.587306] print_req_error: operation not supported error, dev 
nvme0n1, sector 1208024
[  141.587322] print_req_error: operation not supported error, dev 
nvme0n1, sector 100575248
[  141.587335] print_req_error: operation not supported error, dev 
nvme0n1, sector 111717456
[  141.587346] print_req_error: operation not supported error, dev 
nvme0n1, sector 171939296
[  141.587348] print_req_error: operation not supported error, dev 
nvme0n1, sector 476420528
[  141.587353] print_req_error: operation not supported error, dev 
nvme0n1, sector 371566696
[  141.587356] print_req_error: operation not supported error, dev 
nvme0n1, sector 161758408
[  141.587463] Buffer I/O error on dev nvme0n1, logical block 54193430, 
lost async page write
[  141.587472] Buffer I/O error on dev nvme0n1, logical block 54193431, 
lost async page write
[  141.587478] Buffer I/O error on dev nvme0n1, logical block 54193432, 
lost async page write
[  141.587483] Buffer I/O error on dev nvme0n1, logical block 54193433, 
lost async page write
[  141.587532] Buffer I/O error on dev nvme0n1, logical block 54193476, 
lost async page write
[  141.587534] Buffer I/O error on dev nvme0n1, logical block 54193477, 
lost async page write
[  141.587536] Buffer I/O error on dev nvme0n1, logical block 54193478, 
lost async page write
[  141.587538] Buffer I/O error on dev nvme0n1, logical block 54193479, 
lost async page write
[  141.587540] Buffer I/O error on dev nvme0n1, logical block 54193480, 
lost async page write
[  141.587542] Buffer I/O error on dev nvme0n1, logical block 54193481, 
lost async page write
[  142.573522] nvme nvme0: Reconnecting in 10 seconds...
[  146.587532] buffer_io_error: 3743628 callbacks suppressed
[  146.587534] Buffer I/O error on dev nvme0n1, logical block 64832757, 
lost async page write
[  146.602837] Buffer I/O error on dev nvme0n1, logical block 64832758, 
lost async page write
[  146.612091] Buffer I/O error on dev nvme0n1, logical block 64832759, 
lost async page write
[  146.621346] Buffer I/O error on dev nvme0n1, logical block 64832760, 
lost async page write
[  146.630615] print_req_error: 556822 callbacks suppressed
[  146.630616] print_req_error: I/O error, dev nvme0n1, sector 518662176
[  146.643776] Buffer I/O error on dev nvme0n1, logical block 64832772, 
lost async page write
[  146.653030] Buffer I/O error on dev nvme0n1, logical block 64832773, 
lost async page write
[  146.662282] Buffer I/O error on dev nvme0n1, logical block 64832774, 
lost async page write
[  146.671542] print_req_error: I/O error, dev nvme0n1, sector 518662568
[  146.678754] Buffer I/O error on dev nvme0n1, logical block 64832821, 
lost async page write
[  146.688003] Buffer I/O error on dev nvme0n1, logical block 64832822, 
lost async page write
[  146.697784] print_req_error: I/O error, dev nvme0n1, sector 518662928
[  146.705450] Buffer I/O error on dev nvme0n1, logical block 64832866, 
lost async page write
[  146.715176] print_req_error: I/O error, dev nvme0n1, sector 518665376
[  146.722920] print_req_error: I/O error, dev nvme0n1, sector 518666136
[  146.730602] print_req_error: I/O error, dev nvme0n1, sector 518666920
[  146.738275] print_req_error: I/O error, dev nvme0n1, sector 518667880
[  146.745944] print_req_error: I/O error, dev nvme0n1, sector 518668096
[  146.753605] print_req_error: I/O error, dev nvme0n1, sector 518668960
[  146.761249] print_req_error: I/O error, dev nvme0n1, sector 518669616
[  149.010303] nvme nvme0: Identify namespace failed
[  149.016171] Dev nvme0n1: unable to read RDB block 0
[  149.022017]  nvme0n1: unable to read partition table
[  149.032192] nvme nvme0: Identify namespace failed
[  149.037857] Dev nvme0n1: unable to read RDB block 0
[  149.043695]  nvme0n1: unable to read partition table
[  153.081673] nvme nvme0: creating 37 I/O queues.
[  153.384977] BUG: unable to handle kernel paging request at 
00003a9ed053bd48
[  153.393197] IP: blk_mq_get_request+0x23e/0x390
[  153.398585] PGD 0 P4D 0
[  153.401841] Oops: 0002 [#1] SMP PTI
[  153.406168] Modules linked in: nvme_rdma nvme_fabrics nvme_core 
nvmet_rdma nvmet sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tabt
[  153.489688]  drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops ttm drm mlx4_core ahci libahci crc32c_intel libata tg3 
i2c_core dd
[  153.509370] CPU: 32 PID: 689 Comm: kworker/u369:6 Not tainted 
4.16.0-rc7.sagi+ #4
[  153.518417] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 
1.6.2 01/08/2016
[  153.527486] Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
[  153.535695] RIP: 0010:blk_mq_get_request+0x23e/0x390
[  153.541973] RSP: 0018:ffffb8cc0853fca8 EFLAGS: 00010246
[  153.548530] RAX: 00003a9ed053bd00 RBX: ffff9e2cbbf30000 RCX: 
000000000000001f
[  153.557230] RDX: 0000000000000000 RSI: ffffffe19b5ba5d2 RDI: 
ffff9e2c90219000
[  153.565923] RBP: ffffb8cc0853fce8 R08: ffffffffffffffff R09: 
0000000000000002
[  153.574628] R10: ffff9e1cbea27160 R11: fffff20780005c00 R12: 
0000000000000023
[  153.583340] R13: 0000000000000000 R14: 0000000000000000 R15: 
0000000000000000
[  153.592062] FS:  0000000000000000(0000) GS:ffff9e1cbea00000(0000) 
knlGS:0000000000000000
[  153.601846] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  153.609013] CR2: 00003a9ed053bd48 CR3: 00000014b560a003 CR4: 
00000000001606e0
[  153.617732] Call Trace:
[  153.621221]  blk_mq_alloc_request_hctx+0xf2/0x140
[  153.627244]  nvme_alloc_request+0x36/0x60 [nvme_core]
[  153.633647]  __nvme_submit_sync_cmd+0x2b/0xd0 [nvme_core]
[  153.640429]  nvmf_connect_io_queue+0x10e/0x170 [nvme_fabrics]
[  153.647613]  nvme_rdma_start_queue+0x21/0x80 [nvme_rdma]
[  153.654300]  nvme_rdma_configure_io_queues+0x196/0x280 [nvme_rdma]
[  153.661947]  nvme_rdma_reconnect_ctrl_work+0x39/0xd0 [nvme_rdma]
[  153.669394]  process_one_work+0x158/0x360
[  153.674618]  worker_thread+0x47/0x3e0
[  153.679458]  kthread+0xf8/0x130
[  153.683717]  ? max_active_store+0x80/0x80
[  153.688952]  ? kthread_bind+0x10/0x10
[  153.693809]  ret_from_fork+0x35/0x40
[  153.698569] Code: 89 83 40 01 00 00 45 84 e4 48 c7 83 48 01 00 00 00 
00 00 00 ba 01 00 00 00 48 8b 45 10 74 0c 31 d2 41 f7 c4 00 08 06 00 0
[  153.721261] RIP: blk_mq_get_request+0x23e/0x390 RSP: ffffb8cc0853fca8
[  153.729264] CR2: 00003a9ed053bd48
[  153.733833] ---[ end trace f77c1388aba74f1c ]---

> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-05 16:35       ` Yi Zhang
  0 siblings, 0 replies; 36+ messages in thread
From: Yi Zhang @ 2018-04-05 16:35 UTC (permalink / raw)




On 04/04/2018 09:22 PM, Sagi Grimberg wrote:
>
>
> On 03/30/2018 12:32 PM, Yi Zhang wrote:
>> Hello
>> I got this kernel BUG on 4.16.0-rc7, here is the reproducer and log, 
>> let me know if you need more info, thanks.
>>
>> Reproducer:
>> 1. setup target
>> #nvmetcli restore /etc/rdma.json
>> 2. connect target on host
>> #nvme connect-all -t rdma -a $IP -s 4420during my NVMeoF RDMA testing
>> 3. do fio background on host
>> #fio -filename=/dev/nvme0n1 -iodepth=1 -thread -rw=randwrite 
>> -ioengine=psync 
>> -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 
>> -bs_unaligned -runtime=180 -size=-group_reporting -name=mytest 
>> -numjobs=60 &
>> 4. offline cpu on host
>> #echo 0 > /sys/devices/system/cpu/cpu1/online
>> #echo 0 > /sys/devices/system/cpu/cpu2/online
>> #echo 0 > /sys/devices/system/cpu/cpu3/online
>> 5. clear target
>> #nvmetcli clear
>> 6. restore target
>> #nvmetcli restore /etc/rdma.json
>> 7. check console log on host
>
> Hi Yi,
>
> Does this happen with this applied?
> -- 
> diff --git a/block/blk-mq-rdma.c b/block/blk-mq-rdma.c
> index 996167f1de18..b89da55e8aaa 100644
> --- a/block/blk-mq-rdma.c
> +++ b/block/blk-mq-rdma.c
> @@ -35,6 +35,8 @@ int blk_mq_rdma_map_queues(struct blk_mq_tag_set *set,
> ??????? const struct cpumask *mask;
> ??????? unsigned int queue, cpu;
>
> +?????? goto fallback;
> +
> ??????? for (queue = 0; queue < set->nr_hw_queues; queue++) {
> ??????????????? mask = ib_get_vector_affinity(dev, first_vec + queue);
> ??????????????? if (!mask)
> -- 
>

Hi Sagi

Still can reproduce this issue with the change:

[? 133.469908] nvme nvme0: new ctrl: NQN 
"nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
[? 133.554025] nvme nvme0: creating 40 I/O queues.
[? 133.947648] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[? 138.740870] smpboot: CPU 1 is now offline
[? 138.778382] IRQ 37: no longer affine to CPU2
[? 138.783153] IRQ 54: no longer affine to CPU2
[? 138.787919] IRQ 70: no longer affine to CPU2
[? 138.792687] IRQ 98: no longer affine to CPU2
[? 138.797458] IRQ 140: no longer affine to CPU2
[? 138.802319] IRQ 141: no longer affine to CPU2
[? 138.807189] IRQ 166: no longer affine to CPU2
[? 138.813622] smpboot: CPU 2 is now offline
[? 139.043610] smpboot: CPU 3 is now offline
[? 141.587283] print_req_error: operation not supported error, dev 
nvme0n1, sector 494622136
[? 141.587303] print_req_error: operation not supported error, dev 
nvme0n1, sector 219643648
[? 141.587304] print_req_error: operation not supported error, dev 
nvme0n1, sector 279256456
[? 141.587306] print_req_error: operation not supported error, dev 
nvme0n1, sector 1208024
[? 141.587322] print_req_error: operation not supported error, dev 
nvme0n1, sector 100575248
[? 141.587335] print_req_error: operation not supported error, dev 
nvme0n1, sector 111717456
[? 141.587346] print_req_error: operation not supported error, dev 
nvme0n1, sector 171939296
[? 141.587348] print_req_error: operation not supported error, dev 
nvme0n1, sector 476420528
[? 141.587353] print_req_error: operation not supported error, dev 
nvme0n1, sector 371566696
[? 141.587356] print_req_error: operation not supported error, dev 
nvme0n1, sector 161758408
[? 141.587463] Buffer I/O error on dev nvme0n1, logical block 54193430, 
lost async page write
[? 141.587472] Buffer I/O error on dev nvme0n1, logical block 54193431, 
lost async page write
[? 141.587478] Buffer I/O error on dev nvme0n1, logical block 54193432, 
lost async page write
[? 141.587483] Buffer I/O error on dev nvme0n1, logical block 54193433, 
lost async page write
[? 141.587532] Buffer I/O error on dev nvme0n1, logical block 54193476, 
lost async page write
[? 141.587534] Buffer I/O error on dev nvme0n1, logical block 54193477, 
lost async page write
[? 141.587536] Buffer I/O error on dev nvme0n1, logical block 54193478, 
lost async page write
[? 141.587538] Buffer I/O error on dev nvme0n1, logical block 54193479, 
lost async page write
[? 141.587540] Buffer I/O error on dev nvme0n1, logical block 54193480, 
lost async page write
[? 141.587542] Buffer I/O error on dev nvme0n1, logical block 54193481, 
lost async page write
[? 142.573522] nvme nvme0: Reconnecting in 10 seconds...
[? 146.587532] buffer_io_error: 3743628 callbacks suppressed
[? 146.587534] Buffer I/O error on dev nvme0n1, logical block 64832757, 
lost async page write
[? 146.602837] Buffer I/O error on dev nvme0n1, logical block 64832758, 
lost async page write
[? 146.612091] Buffer I/O error on dev nvme0n1, logical block 64832759, 
lost async page write
[? 146.621346] Buffer I/O error on dev nvme0n1, logical block 64832760, 
lost async page write
[? 146.630615] print_req_error: 556822 callbacks suppressed
[? 146.630616] print_req_error: I/O error, dev nvme0n1, sector 518662176
[? 146.643776] Buffer I/O error on dev nvme0n1, logical block 64832772, 
lost async page write
[? 146.653030] Buffer I/O error on dev nvme0n1, logical block 64832773, 
lost async page write
[? 146.662282] Buffer I/O error on dev nvme0n1, logical block 64832774, 
lost async page write
[? 146.671542] print_req_error: I/O error, dev nvme0n1, sector 518662568
[? 146.678754] Buffer I/O error on dev nvme0n1, logical block 64832821, 
lost async page write
[? 146.688003] Buffer I/O error on dev nvme0n1, logical block 64832822, 
lost async page write
[? 146.697784] print_req_error: I/O error, dev nvme0n1, sector 518662928
[? 146.705450] Buffer I/O error on dev nvme0n1, logical block 64832866, 
lost async page write
[? 146.715176] print_req_error: I/O error, dev nvme0n1, sector 518665376
[? 146.722920] print_req_error: I/O error, dev nvme0n1, sector 518666136
[? 146.730602] print_req_error: I/O error, dev nvme0n1, sector 518666920
[? 146.738275] print_req_error: I/O error, dev nvme0n1, sector 518667880
[? 146.745944] print_req_error: I/O error, dev nvme0n1, sector 518668096
[? 146.753605] print_req_error: I/O error, dev nvme0n1, sector 518668960
[? 146.761249] print_req_error: I/O error, dev nvme0n1, sector 518669616
[? 149.010303] nvme nvme0: Identify namespace failed
[? 149.016171] Dev nvme0n1: unable to read RDB block 0
[? 149.022017]? nvme0n1: unable to read partition table
[? 149.032192] nvme nvme0: Identify namespace failed
[? 149.037857] Dev nvme0n1: unable to read RDB block 0
[? 149.043695]? nvme0n1: unable to read partition table
[? 153.081673] nvme nvme0: creating 37 I/O queues.
[? 153.384977] BUG: unable to handle kernel paging request at 
00003a9ed053bd48
[? 153.393197] IP: blk_mq_get_request+0x23e/0x390
[? 153.398585] PGD 0 P4D 0
[? 153.401841] Oops: 0002 [#1] SMP PTI
[? 153.406168] Modules linked in: nvme_rdma nvme_fabrics nvme_core 
nvmet_rdma nvmet sch_mqprio ebtable_filter ebtables ip6table_filter ip6_tabt
[? 153.489688]? drm_kms_helper syscopyarea sysfillrect sysimgblt 
fb_sys_fops ttm drm mlx4_core ahci libahci crc32c_intel libata tg3 
i2c_core dd
[? 153.509370] CPU: 32 PID: 689 Comm: kworker/u369:6 Not tainted 
4.16.0-rc7.sagi+ #4
[? 153.518417] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 
1.6.2 01/08/2016
[? 153.527486] Workqueue: nvme-wq nvme_rdma_reconnect_ctrl_work [nvme_rdma]
[? 153.535695] RIP: 0010:blk_mq_get_request+0x23e/0x390
[? 153.541973] RSP: 0018:ffffb8cc0853fca8 EFLAGS: 00010246
[? 153.548530] RAX: 00003a9ed053bd00 RBX: ffff9e2cbbf30000 RCX: 
000000000000001f
[? 153.557230] RDX: 0000000000000000 RSI: ffffffe19b5ba5d2 RDI: 
ffff9e2c90219000
[? 153.565923] RBP: ffffb8cc0853fce8 R08: ffffffffffffffff R09: 
0000000000000002
[? 153.574628] R10: ffff9e1cbea27160 R11: fffff20780005c00 R12: 
0000000000000023
[? 153.583340] R13: 0000000000000000 R14: 0000000000000000 R15: 
0000000000000000
[? 153.592062] FS:? 0000000000000000(0000) GS:ffff9e1cbea00000(0000) 
knlGS:0000000000000000
[? 153.601846] CS:? 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[? 153.609013] CR2: 00003a9ed053bd48 CR3: 00000014b560a003 CR4: 
00000000001606e0
[? 153.617732] Call Trace:
[? 153.621221]? blk_mq_alloc_request_hctx+0xf2/0x140
[? 153.627244]? nvme_alloc_request+0x36/0x60 [nvme_core]
[? 153.633647]? __nvme_submit_sync_cmd+0x2b/0xd0 [nvme_core]
[? 153.640429]? nvmf_connect_io_queue+0x10e/0x170 [nvme_fabrics]
[? 153.647613]? nvme_rdma_start_queue+0x21/0x80 [nvme_rdma]
[? 153.654300]? nvme_rdma_configure_io_queues+0x196/0x280 [nvme_rdma]
[? 153.661947]? nvme_rdma_reconnect_ctrl_work+0x39/0xd0 [nvme_rdma]
[? 153.669394]? process_one_work+0x158/0x360
[? 153.674618]? worker_thread+0x47/0x3e0
[? 153.679458]? kthread+0xf8/0x130
[? 153.683717]? ? max_active_store+0x80/0x80
[? 153.688952]? ? kthread_bind+0x10/0x10
[? 153.693809]? ret_from_fork+0x35/0x40
[? 153.698569] Code: 89 83 40 01 00 00 45 84 e4 48 c7 83 48 01 00 00 00 
00 00 00 ba 01 00 00 00 48 8b 45 10 74 0c 31 d2 41 f7 c4 00 08 06 00 0
[? 153.721261] RIP: blk_mq_get_request+0x23e/0x390 RSP: ffffb8cc0853fca8
[? 153.729264] CR2: 00003a9ed053bd48
[? 153.733833] ---[ end trace f77c1388aba74f1c ]---

> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-05 16:35       ` Yi Zhang
@ 2018-04-08 10:36         ` Sagi Grimberg
  -1 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-08 10:36 UTC (permalink / raw)
  To: Yi Zhang; +Cc: linux-nvme, linux-block, Ming Lei


> Hi Sagi
> 
> Still can reproduce this issue with the change:

Thanks for validating Yi,

Would it be possible to test the following:
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
                 return ERR_PTR(-EXDEV);
         }
         cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+       if (cpu >= nr_cpu_ids) {
+               pr_warn("no online cpu for hctx %d\n", hctx_idx);
+               cpu = cpumask_first(alloc_data.hctx->cpumask);
+       }
         alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

         rq = blk_mq_get_request(q, NULL, op, &alloc_data);
--
...


> [  153.384977] BUG: unable to handle kernel paging request at 
> 00003a9ed053bd48
> [  153.393197] IP: blk_mq_get_request+0x23e/0x390

Also would it be possible to provide gdb output of:

l *(blk_mq_get_request+0x23e)

Thanks,

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-08 10:36         ` Sagi Grimberg
  0 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-08 10:36 UTC (permalink / raw)



> Hi Sagi
> 
> Still can reproduce this issue with the change:

Thanks for validating Yi,

Would it be possible to test the following:
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
                 return ERR_PTR(-EXDEV);
         }
         cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+       if (cpu >= nr_cpu_ids) {
+               pr_warn("no online cpu for hctx %d\n", hctx_idx);
+               cpu = cpumask_first(alloc_data.hctx->cpumask);
+       }
         alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

         rq = blk_mq_get_request(q, NULL, op, &alloc_data);
--
...


> [? 153.384977] BUG: unable to handle kernel paging request at 
> 00003a9ed053bd48
> [? 153.393197] IP: blk_mq_get_request+0x23e/0x390

Also would it be possible to provide gdb output of:

l *(blk_mq_get_request+0x23e)

Thanks,

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-08 10:36         ` Sagi Grimberg
@ 2018-04-08 10:44           ` Ming Lei
  -1 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2018-04-08 10:44 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Yi Zhang, linux-nvme, linux-block

On Sun, Apr 08, 2018 at 01:36:27PM +0300, Sagi Grimberg wrote:
> 
> > Hi Sagi
> > 
> > Still can reproduce this issue with the change:
> 
> Thanks for validating Yi,
> 
> Would it be possible to test the following:
> --
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 75336848f7a7..81ced3096433 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
> request_queue *q,
>                 return ERR_PTR(-EXDEV);
>         }
>         cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> +       if (cpu >= nr_cpu_ids) {
> +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
> +       }
>         alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> 
>         rq = blk_mq_get_request(q, NULL, op, &alloc_data);
> --
> ...
> 
> 
> > [� 153.384977] BUG: unable to handle kernel paging request at
> > 00003a9ed053bd48
> > [� 153.393197] IP: blk_mq_get_request+0x23e/0x390
> 
> Also would it be possible to provide gdb output of:
> 
> l *(blk_mq_get_request+0x23e)

nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
request from one specific hw queue, but there may not be all online CPUs
mapped to this hw queue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-08 10:44           ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2018-04-08 10:44 UTC (permalink / raw)


On Sun, Apr 08, 2018@01:36:27PM +0300, Sagi Grimberg wrote:
> 
> > Hi Sagi
> > 
> > Still can reproduce this issue with the change:
> 
> Thanks for validating Yi,
> 
> Would it be possible to test the following:
> --
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 75336848f7a7..81ced3096433 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
> request_queue *q,
>                 return ERR_PTR(-EXDEV);
>         }
>         cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> +       if (cpu >= nr_cpu_ids) {
> +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
> +       }
>         alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> 
>         rq = blk_mq_get_request(q, NULL, op, &alloc_data);
> --
> ...
> 
> 
> > [? 153.384977] BUG: unable to handle kernel paging request at
> > 00003a9ed053bd48
> > [? 153.393197] IP: blk_mq_get_request+0x23e/0x390
> 
> Also would it be possible to provide gdb output of:
> 
> l *(blk_mq_get_request+0x23e)

nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
request from one specific hw queue, but there may not be all online CPUs
mapped to this hw queue.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-08 10:44           ` Ming Lei
@ 2018-04-08 10:48             ` Ming Lei
  -1 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2018-04-08 10:48 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Yi Zhang, linux-nvme, linux-block

On Sun, Apr 08, 2018 at 06:44:33PM +0800, Ming Lei wrote:
> On Sun, Apr 08, 2018 at 01:36:27PM +0300, Sagi Grimberg wrote:
> > 
> > > Hi Sagi
> > > 
> > > Still can reproduce this issue with the change:
> > 
> > Thanks for validating Yi,
> > 
> > Would it be possible to test the following:
> > --
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 75336848f7a7..81ced3096433 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
> > request_queue *q,
> >                 return ERR_PTR(-EXDEV);
> >         }
> >         cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> > +       if (cpu >= nr_cpu_ids) {
> > +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
> > +               cpu = cpumask_first(alloc_data.hctx->cpumask);
> > +       }
> >         alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> > 
> >         rq = blk_mq_get_request(q, NULL, op, &alloc_data);
> > --
> > ...
> > 
> > 
> > > [� 153.384977] BUG: unable to handle kernel paging request at
> > > 00003a9ed053bd48
> > > [� 153.393197] IP: blk_mq_get_request+0x23e/0x390
> > 
> > Also would it be possible to provide gdb output of:
> > 
> > l *(blk_mq_get_request+0x23e)
> 
> nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
> request from one specific hw queue, but there may not be all online CPUs
> mapped to this hw queue.

And the following patchset may fail this kind of allocation and avoid
the kernel oops.

	https://marc.info/?l=linux-block&m=152318091025252&w=2

Thanks,
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-08 10:48             ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2018-04-08 10:48 UTC (permalink / raw)


On Sun, Apr 08, 2018@06:44:33PM +0800, Ming Lei wrote:
> On Sun, Apr 08, 2018@01:36:27PM +0300, Sagi Grimberg wrote:
> > 
> > > Hi Sagi
> > > 
> > > Still can reproduce this issue with the change:
> > 
> > Thanks for validating Yi,
> > 
> > Would it be possible to test the following:
> > --
> > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > index 75336848f7a7..81ced3096433 100644
> > --- a/block/blk-mq.c
> > +++ b/block/blk-mq.c
> > @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
> > request_queue *q,
> >                 return ERR_PTR(-EXDEV);
> >         }
> >         cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> > +       if (cpu >= nr_cpu_ids) {
> > +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
> > +               cpu = cpumask_first(alloc_data.hctx->cpumask);
> > +       }
> >         alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> > 
> >         rq = blk_mq_get_request(q, NULL, op, &alloc_data);
> > --
> > ...
> > 
> > 
> > > [? 153.384977] BUG: unable to handle kernel paging request at
> > > 00003a9ed053bd48
> > > [? 153.393197] IP: blk_mq_get_request+0x23e/0x390
> > 
> > Also would it be possible to provide gdb output of:
> > 
> > l *(blk_mq_get_request+0x23e)
> 
> nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
> request from one specific hw queue, but there may not be all online CPUs
> mapped to this hw queue.

And the following patchset may fail this kind of allocation and avoid
the kernel oops.

	https://marc.info/?l=linux-block&m=152318091025252&w=2

Thanks,
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-08 10:48             ` Ming Lei
@ 2018-04-08 10:58               ` Sagi Grimberg
  -1 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-08 10:58 UTC (permalink / raw)
  To: Ming Lei; +Cc: Yi Zhang, linux-nvme, linux-block


>>>> Hi Sagi
>>>>
>>>> Still can reproduce this issue with the change:
>>>
>>> Thanks for validating Yi,
>>>
>>> Would it be possible to test the following:
>>> --
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 75336848f7a7..81ced3096433 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
>>> request_queue *q,
>>>                  return ERR_PTR(-EXDEV);
>>>          }
>>>          cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>>> +       if (cpu >= nr_cpu_ids) {
>>> +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
>>> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
>>> +       }
>>>          alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>>>
>>>          rq = blk_mq_get_request(q, NULL, op, &alloc_data);
>>> --
>>> ...
>>>
>>>
>>>> [  153.384977] BUG: unable to handle kernel paging request at
>>>> 00003a9ed053bd48
>>>> [  153.393197] IP: blk_mq_get_request+0x23e/0x390
>>>
>>> Also would it be possible to provide gdb output of:
>>>
>>> l *(blk_mq_get_request+0x23e)
>>
>> nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
>> request from one specific hw queue, but there may not be all online CPUs
>> mapped to this hw queue.

Yes, this is what I suspect..

> And the following patchset may fail this kind of allocation and avoid
> the kernel oops.
> 
> 	https://marc.info/?l=linux-block&m=152318091025252&w=2

Thanks Ming,

But I don't want to fail the allocation, nvmf_connect_io_queue simply
needs a tag to issue the connect request, I much rather to take this
tag from an online cpu than failing it... We use this because we reserve
a tag per-queue for this, but in this case, I'd rather block until the
inflight tag complete than failing the connect.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-08 10:58               ` Sagi Grimberg
  0 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-08 10:58 UTC (permalink / raw)



>>>> Hi Sagi
>>>>
>>>> Still can reproduce this issue with the change:
>>>
>>> Thanks for validating Yi,
>>>
>>> Would it be possible to test the following:
>>> --
>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>> index 75336848f7a7..81ced3096433 100644
>>> --- a/block/blk-mq.c
>>> +++ b/block/blk-mq.c
>>> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
>>> request_queue *q,
>>>                  return ERR_PTR(-EXDEV);
>>>          }
>>>          cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>>> +       if (cpu >= nr_cpu_ids) {
>>> +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
>>> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
>>> +       }
>>>          alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>>>
>>>          rq = blk_mq_get_request(q, NULL, op, &alloc_data);
>>> --
>>> ...
>>>
>>>
>>>> [? 153.384977] BUG: unable to handle kernel paging request at
>>>> 00003a9ed053bd48
>>>> [? 153.393197] IP: blk_mq_get_request+0x23e/0x390
>>>
>>> Also would it be possible to provide gdb output of:
>>>
>>> l *(blk_mq_get_request+0x23e)
>>
>> nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
>> request from one specific hw queue, but there may not be all online CPUs
>> mapped to this hw queue.

Yes, this is what I suspect..

> And the following patchset may fail this kind of allocation and avoid
> the kernel oops.
> 
> 	https://marc.info/?l=linux-block&m=152318091025252&w=2

Thanks Ming,

But I don't want to fail the allocation, nvmf_connect_io_queue simply
needs a tag to issue the connect request, I much rather to take this
tag from an online cpu than failing it... We use this because we reserve
a tag per-queue for this, but in this case, I'd rather block until the
inflight tag complete than failing the connect.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-08 10:58               ` Sagi Grimberg
@ 2018-04-08 11:04                 ` Ming Lei
  -1 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2018-04-08 11:04 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Yi Zhang, linux-nvme, linux-block

On Sun, Apr 08, 2018 at 01:58:49PM +0300, Sagi Grimberg wrote:
> 
> > > > > Hi Sagi
> > > > > 
> > > > > Still can reproduce this issue with the change:
> > > > 
> > > > Thanks for validating Yi,
> > > > 
> > > > Would it be possible to test the following:
> > > > --
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > index 75336848f7a7..81ced3096433 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
> > > > request_queue *q,
> > > >                  return ERR_PTR(-EXDEV);
> > > >          }
> > > >          cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> > > > +       if (cpu >= nr_cpu_ids) {
> > > > +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
> > > > +               cpu = cpumask_first(alloc_data.hctx->cpumask);
> > > > +       }
> > > >          alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> > > > 
> > > >          rq = blk_mq_get_request(q, NULL, op, &alloc_data);
> > > > --
> > > > ...
> > > > 
> > > > 
> > > > > [� 153.384977] BUG: unable to handle kernel paging request at
> > > > > 00003a9ed053bd48
> > > > > [� 153.393197] IP: blk_mq_get_request+0x23e/0x390
> > > > 
> > > > Also would it be possible to provide gdb output of:
> > > > 
> > > > l *(blk_mq_get_request+0x23e)
> > > 
> > > nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
> > > request from one specific hw queue, but there may not be all online CPUs
> > > mapped to this hw queue.
> 
> Yes, this is what I suspect..
> 
> > And the following patchset may fail this kind of allocation and avoid
> > the kernel oops.
> > 
> > 	https://marc.info/?l=linux-block&m=152318091025252&w=2
> 
> Thanks Ming,
> 
> But I don't want to fail the allocation, nvmf_connect_io_queue simply
> needs a tag to issue the connect request, I much rather to take this
> tag from an online cpu than failing it... We use this because we reserve

The failure is only triggered when there isn't any online CPU mapped to
this hctx, so do you want to wait for CPUs for this hctx becoming online?

Or I may understand you wrong, :-)

> a tag per-queue for this, but in this case, I'd rather block until the
> inflight tag complete than failing the connect.

No, there can't be any inflight request for this hctx.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-08 11:04                 ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2018-04-08 11:04 UTC (permalink / raw)


On Sun, Apr 08, 2018@01:58:49PM +0300, Sagi Grimberg wrote:
> 
> > > > > Hi Sagi
> > > > > 
> > > > > Still can reproduce this issue with the change:
> > > > 
> > > > Thanks for validating Yi,
> > > > 
> > > > Would it be possible to test the following:
> > > > --
> > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > index 75336848f7a7..81ced3096433 100644
> > > > --- a/block/blk-mq.c
> > > > +++ b/block/blk-mq.c
> > > > @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
> > > > request_queue *q,
> > > >                  return ERR_PTR(-EXDEV);
> > > >          }
> > > >          cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> > > > +       if (cpu >= nr_cpu_ids) {
> > > > +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
> > > > +               cpu = cpumask_first(alloc_data.hctx->cpumask);
> > > > +       }
> > > >          alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> > > > 
> > > >          rq = blk_mq_get_request(q, NULL, op, &alloc_data);
> > > > --
> > > > ...
> > > > 
> > > > 
> > > > > [? 153.384977] BUG: unable to handle kernel paging request at
> > > > > 00003a9ed053bd48
> > > > > [? 153.393197] IP: blk_mq_get_request+0x23e/0x390
> > > > 
> > > > Also would it be possible to provide gdb output of:
> > > > 
> > > > l *(blk_mq_get_request+0x23e)
> > > 
> > > nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
> > > request from one specific hw queue, but there may not be all online CPUs
> > > mapped to this hw queue.
> 
> Yes, this is what I suspect..
> 
> > And the following patchset may fail this kind of allocation and avoid
> > the kernel oops.
> > 
> > 	https://marc.info/?l=linux-block&m=152318091025252&w=2
> 
> Thanks Ming,
> 
> But I don't want to fail the allocation, nvmf_connect_io_queue simply
> needs a tag to issue the connect request, I much rather to take this
> tag from an online cpu than failing it... We use this because we reserve

The failure is only triggered when there isn't any online CPU mapped to
this hctx, so do you want to wait for CPUs for this hctx becoming online?

Or I may understand you wrong, :-)

> a tag per-queue for this, but in this case, I'd rather block until the
> inflight tag complete than failing the connect.

No, there can't be any inflight request for this hctx.


Thanks,
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-08 11:04                 ` Ming Lei
@ 2018-04-08 11:53                   ` Sagi Grimberg
  -1 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-08 11:53 UTC (permalink / raw)
  To: Ming Lei; +Cc: Yi Zhang, linux-nvme, linux-block


>>>>>> Hi Sagi
>>>>>>
>>>>>> Still can reproduce this issue with the change:
>>>>>
>>>>> Thanks for validating Yi,
>>>>>
>>>>> Would it be possible to test the following:
>>>>> --
>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>> index 75336848f7a7..81ced3096433 100644
>>>>> --- a/block/blk-mq.c
>>>>> +++ b/block/blk-mq.c
>>>>> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
>>>>> request_queue *q,
>>>>>                   return ERR_PTR(-EXDEV);
>>>>>           }
>>>>>           cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>>>>> +       if (cpu >= nr_cpu_ids) {
>>>>> +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
>>>>> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
>>>>> +       }
>>>>>           alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>>>>>
>>>>>           rq = blk_mq_get_request(q, NULL, op, &alloc_data);
>>>>> --
>>>>> ...
>>>>>
>>>>>
>>>>>> [  153.384977] BUG: unable to handle kernel paging request at
>>>>>> 00003a9ed053bd48
>>>>>> [  153.393197] IP: blk_mq_get_request+0x23e/0x390
>>>>>
>>>>> Also would it be possible to provide gdb output of:
>>>>>
>>>>> l *(blk_mq_get_request+0x23e)
>>>>
>>>> nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
>>>> request from one specific hw queue, but there may not be all online CPUs
>>>> mapped to this hw queue.
>>
>> Yes, this is what I suspect..
>>
>>> And the following patchset may fail this kind of allocation and avoid
>>> the kernel oops.
>>>
>>> 	https://marc.info/?l=linux-block&m=152318091025252&w=2
>>
>> Thanks Ming,
>>
>> But I don't want to fail the allocation, nvmf_connect_io_queue simply
>> needs a tag to issue the connect request, I much rather to take this
>> tag from an online cpu than failing it... We use this because we reserve
> 
> The failure is only triggered when there isn't any online CPU mapped to
> this hctx, so do you want to wait for CPUs for this hctx becoming online?

I was thinking of allocating a tag from that hctx even if it had no
online cpu, the execution is done on an online cpu (hence the call
to blk_mq_alloc_request_hctx).

> Or I may understand you wrong, :-)

In the report we connected 40 hctxs (which was exactly the number of
online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
I'm not sure why some hctxs are left without any online cpus.

This seems to be related to the queue mapping.

Lets say I have 4-cpu system and my device always allocates
num_online_cpus() hctxs.

at first I get:
cpu0 -> hctx0
cpu1 -> hctx1
cpu2 -> hctx2
cpu3 -> hctx3

When cpu1 goes offline I think the new mapping will be:
cpu0 -> hctx0
cpu1 -> hctx0 (from cpu_to_queue_index) // offline
cpu2 -> hctx2
cpu3 -> hctx0 (from cpu_to_queue_index)

This means that now hctx1 is unmapped. I guess we can fix nvmf code
to not connect it. But we end up with less queues than cpus without
any good reason.

I would have optimally want a different mapping that will use all
the queues:
cpu0 -> hctx0
cpu2 -> hctx1
cpu3 -> hctx2
* cpu1 -> hctx1 (doesn't matter, offline)

Something looks broken...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-08 11:53                   ` Sagi Grimberg
  0 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-08 11:53 UTC (permalink / raw)



>>>>>> Hi Sagi
>>>>>>
>>>>>> Still can reproduce this issue with the change:
>>>>>
>>>>> Thanks for validating Yi,
>>>>>
>>>>> Would it be possible to test the following:
>>>>> --
>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>> index 75336848f7a7..81ced3096433 100644
>>>>> --- a/block/blk-mq.c
>>>>> +++ b/block/blk-mq.c
>>>>> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
>>>>> request_queue *q,
>>>>>                   return ERR_PTR(-EXDEV);
>>>>>           }
>>>>>           cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>>>>> +       if (cpu >= nr_cpu_ids) {
>>>>> +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
>>>>> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
>>>>> +       }
>>>>>           alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>>>>>
>>>>>           rq = blk_mq_get_request(q, NULL, op, &alloc_data);
>>>>> --
>>>>> ...
>>>>>
>>>>>
>>>>>> [? 153.384977] BUG: unable to handle kernel paging request at
>>>>>> 00003a9ed053bd48
>>>>>> [? 153.393197] IP: blk_mq_get_request+0x23e/0x390
>>>>>
>>>>> Also would it be possible to provide gdb output of:
>>>>>
>>>>> l *(blk_mq_get_request+0x23e)
>>>>
>>>> nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
>>>> request from one specific hw queue, but there may not be all online CPUs
>>>> mapped to this hw queue.
>>
>> Yes, this is what I suspect..
>>
>>> And the following patchset may fail this kind of allocation and avoid
>>> the kernel oops.
>>>
>>> 	https://marc.info/?l=linux-block&m=152318091025252&w=2
>>
>> Thanks Ming,
>>
>> But I don't want to fail the allocation, nvmf_connect_io_queue simply
>> needs a tag to issue the connect request, I much rather to take this
>> tag from an online cpu than failing it... We use this because we reserve
> 
> The failure is only triggered when there isn't any online CPU mapped to
> this hctx, so do you want to wait for CPUs for this hctx becoming online?

I was thinking of allocating a tag from that hctx even if it had no
online cpu, the execution is done on an online cpu (hence the call
to blk_mq_alloc_request_hctx).

> Or I may understand you wrong, :-)

In the report we connected 40 hctxs (which was exactly the number of
online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
I'm not sure why some hctxs are left without any online cpus.

This seems to be related to the queue mapping.

Lets say I have 4-cpu system and my device always allocates
num_online_cpus() hctxs.

at first I get:
cpu0 -> hctx0
cpu1 -> hctx1
cpu2 -> hctx2
cpu3 -> hctx3

When cpu1 goes offline I think the new mapping will be:
cpu0 -> hctx0
cpu1 -> hctx0 (from cpu_to_queue_index) // offline
cpu2 -> hctx2
cpu3 -> hctx0 (from cpu_to_queue_index)

This means that now hctx1 is unmapped. I guess we can fix nvmf code
to not connect it. But we end up with less queues than cpus without
any good reason.

I would have optimally want a different mapping that will use all
the queues:
cpu0 -> hctx0
cpu2 -> hctx1
cpu3 -> hctx2
* cpu1 -> hctx1 (doesn't matter, offline)

Something looks broken...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-08 11:53                   ` Sagi Grimberg
@ 2018-04-08 12:57                     ` Ming Lei
  -1 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2018-04-08 12:57 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: Yi Zhang, linux-nvme, linux-block

On Sun, Apr 08, 2018 at 02:53:03PM +0300, Sagi Grimberg wrote:
> 
> > > > > > > Hi Sagi
> > > > > > > 
> > > > > > > Still can reproduce this issue with the change:
> > > > > > 
> > > > > > Thanks for validating Yi,
> > > > > > 
> > > > > > Would it be possible to test the following:
> > > > > > --
> > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > > index 75336848f7a7..81ced3096433 100644
> > > > > > --- a/block/blk-mq.c
> > > > > > +++ b/block/blk-mq.c
> > > > > > @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
> > > > > > request_queue *q,
> > > > > >                   return ERR_PTR(-EXDEV);
> > > > > >           }
> > > > > >           cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> > > > > > +       if (cpu >= nr_cpu_ids) {
> > > > > > +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
> > > > > > +               cpu = cpumask_first(alloc_data.hctx->cpumask);
> > > > > > +       }
> > > > > >           alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> > > > > > 
> > > > > >           rq = blk_mq_get_request(q, NULL, op, &alloc_data);
> > > > > > --
> > > > > > ...
> > > > > > 
> > > > > > 
> > > > > > > [� 153.384977] BUG: unable to handle kernel paging request at
> > > > > > > 00003a9ed053bd48
> > > > > > > [� 153.393197] IP: blk_mq_get_request+0x23e/0x390
> > > > > > 
> > > > > > Also would it be possible to provide gdb output of:
> > > > > > 
> > > > > > l *(blk_mq_get_request+0x23e)
> > > > > 
> > > > > nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
> > > > > request from one specific hw queue, but there may not be all online CPUs
> > > > > mapped to this hw queue.
> > > 
> > > Yes, this is what I suspect..
> > > 
> > > > And the following patchset may fail this kind of allocation and avoid
> > > > the kernel oops.
> > > > 
> > > > 	https://marc.info/?l=linux-block&m=152318091025252&w=2
> > > 
> > > Thanks Ming,
> > > 
> > > But I don't want to fail the allocation, nvmf_connect_io_queue simply
> > > needs a tag to issue the connect request, I much rather to take this
> > > tag from an online cpu than failing it... We use this because we reserve
> > 
> > The failure is only triggered when there isn't any online CPU mapped to
> > this hctx, so do you want to wait for CPUs for this hctx becoming online?
> 
> I was thinking of allocating a tag from that hctx even if it had no
> online cpu, the execution is done on an online cpu (hence the call
> to blk_mq_alloc_request_hctx).

That can be done, but not following the current blk-mq's rule, because
blk-mq requires to dispatch the request on CPUs mapping to this hctx.

Could you explain a bit why you want to do in this way?

> 
> > Or I may understand you wrong, :-)
> 
> In the report we connected 40 hctxs (which was exactly the number of
> online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
> I'm not sure why some hctxs are left without any online cpus.

That is possible after the following two commits:

4b855ad37194 ("blk-mq: Create hctx for each present CPU)
20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU)

And this can be triggered even without putting down any CPUs.

The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't
remap queue any more when CPU topo is changed, so the static & fixed mapping
has to be setup from the beginning.

Then if there are less enough online CPUs compared with number of hw queues,
some of hctxes can be mapped with all offline CPUs. For example, if one device
has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most
2 hw queues are assigned to online CPUs, and the other two are all with offline
CPUs.

> 
> This seems to be related to the queue mapping.

Yes.

> 
> Lets say I have 4-cpu system and my device always allocates
> num_online_cpus() hctxs.
> 
> at first I get:
> cpu0 -> hctx0
> cpu1 -> hctx1
> cpu2 -> hctx2
> cpu3 -> hctx3
> 
> When cpu1 goes offline I think the new mapping will be:
> cpu0 -> hctx0
> cpu1 -> hctx0 (from cpu_to_queue_index) // offline
> cpu2 -> hctx2
> cpu3 -> hctx0 (from cpu_to_queue_index)
> 
> This means that now hctx1 is unmapped. I guess we can fix nvmf code
> to not connect it. But we end up with less queues than cpus without
> any good reason.
> 
> I would have optimally want a different mapping that will use all
> the queues:
> cpu0 -> hctx0
> cpu2 -> hctx1
> cpu3 -> hctx2
> * cpu1 -> hctx1 (doesn't matter, offline)
> 
> Something looks broken...

No, it isn't broken.

Storage is client/server model, the hw queue should be only active if
there is request coming from client(CPU), and the hw queue becomes
inactive if no online CPU is mapped to it.

That is why the normal rule is that request allocation needs CPU context
info, then the hctx is obtained via the queue mapping.

Thanks
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-08 12:57                     ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2018-04-08 12:57 UTC (permalink / raw)


On Sun, Apr 08, 2018@02:53:03PM +0300, Sagi Grimberg wrote:
> 
> > > > > > > Hi Sagi
> > > > > > > 
> > > > > > > Still can reproduce this issue with the change:
> > > > > > 
> > > > > > Thanks for validating Yi,
> > > > > > 
> > > > > > Would it be possible to test the following:
> > > > > > --
> > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > > index 75336848f7a7..81ced3096433 100644
> > > > > > --- a/block/blk-mq.c
> > > > > > +++ b/block/blk-mq.c
> > > > > > @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
> > > > > > request_queue *q,
> > > > > >                   return ERR_PTR(-EXDEV);
> > > > > >           }
> > > > > >           cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> > > > > > +       if (cpu >= nr_cpu_ids) {
> > > > > > +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
> > > > > > +               cpu = cpumask_first(alloc_data.hctx->cpumask);
> > > > > > +       }
> > > > > >           alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> > > > > > 
> > > > > >           rq = blk_mq_get_request(q, NULL, op, &alloc_data);
> > > > > > --
> > > > > > ...
> > > > > > 
> > > > > > 
> > > > > > > [? 153.384977] BUG: unable to handle kernel paging request at
> > > > > > > 00003a9ed053bd48
> > > > > > > [? 153.393197] IP: blk_mq_get_request+0x23e/0x390
> > > > > > 
> > > > > > Also would it be possible to provide gdb output of:
> > > > > > 
> > > > > > l *(blk_mq_get_request+0x23e)
> > > > > 
> > > > > nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
> > > > > request from one specific hw queue, but there may not be all online CPUs
> > > > > mapped to this hw queue.
> > > 
> > > Yes, this is what I suspect..
> > > 
> > > > And the following patchset may fail this kind of allocation and avoid
> > > > the kernel oops.
> > > > 
> > > > 	https://marc.info/?l=linux-block&m=152318091025252&w=2
> > > 
> > > Thanks Ming,
> > > 
> > > But I don't want to fail the allocation, nvmf_connect_io_queue simply
> > > needs a tag to issue the connect request, I much rather to take this
> > > tag from an online cpu than failing it... We use this because we reserve
> > 
> > The failure is only triggered when there isn't any online CPU mapped to
> > this hctx, so do you want to wait for CPUs for this hctx becoming online?
> 
> I was thinking of allocating a tag from that hctx even if it had no
> online cpu, the execution is done on an online cpu (hence the call
> to blk_mq_alloc_request_hctx).

That can be done, but not following the current blk-mq's rule, because
blk-mq requires to dispatch the request on CPUs mapping to this hctx.

Could you explain a bit why you want to do in this way?

> 
> > Or I may understand you wrong, :-)
> 
> In the report we connected 40 hctxs (which was exactly the number of
> online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
> I'm not sure why some hctxs are left without any online cpus.

That is possible after the following two commits:

4b855ad37194 ("blk-mq: Create hctx for each present CPU)
20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU)

And this can be triggered even without putting down any CPUs.

The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't
remap queue any more when CPU topo is changed, so the static & fixed mapping
has to be setup from the beginning.

Then if there are less enough online CPUs compared with number of hw queues,
some of hctxes can be mapped with all offline CPUs. For example, if one device
has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most
2 hw queues are assigned to online CPUs, and the other two are all with offline
CPUs.

> 
> This seems to be related to the queue mapping.

Yes.

> 
> Lets say I have 4-cpu system and my device always allocates
> num_online_cpus() hctxs.
> 
> at first I get:
> cpu0 -> hctx0
> cpu1 -> hctx1
> cpu2 -> hctx2
> cpu3 -> hctx3
> 
> When cpu1 goes offline I think the new mapping will be:
> cpu0 -> hctx0
> cpu1 -> hctx0 (from cpu_to_queue_index) // offline
> cpu2 -> hctx2
> cpu3 -> hctx0 (from cpu_to_queue_index)
> 
> This means that now hctx1 is unmapped. I guess we can fix nvmf code
> to not connect it. But we end up with less queues than cpus without
> any good reason.
> 
> I would have optimally want a different mapping that will use all
> the queues:
> cpu0 -> hctx0
> cpu2 -> hctx1
> cpu3 -> hctx2
> * cpu1 -> hctx1 (doesn't matter, offline)
> 
> Something looks broken...

No, it isn't broken.

Storage is client/server model, the hw queue should be only active if
there is request coming from client(CPU), and the hw queue becomes
inactive if no online CPU is mapped to it.

That is why the normal rule is that request allocation needs CPU context
info, then the hctx is obtained via the queue mapping.

Thanks
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-08 12:57                     ` Ming Lei
@ 2018-04-08 13:35                       ` Sagi Grimberg
  -1 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-08 13:35 UTC (permalink / raw)
  To: Ming Lei; +Cc: Yi Zhang, linux-nvme, linux-block



On 04/08/2018 03:57 PM, Ming Lei wrote:
> On Sun, Apr 08, 2018 at 02:53:03PM +0300, Sagi Grimberg wrote:
>>
>>>>>>>> Hi Sagi
>>>>>>>>
>>>>>>>> Still can reproduce this issue with the change:
>>>>>>>
>>>>>>> Thanks for validating Yi,
>>>>>>>
>>>>>>> Would it be possible to test the following:
>>>>>>> --
>>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>>>> index 75336848f7a7..81ced3096433 100644
>>>>>>> --- a/block/blk-mq.c
>>>>>>> +++ b/block/blk-mq.c
>>>>>>> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
>>>>>>> request_queue *q,
>>>>>>>                    return ERR_PTR(-EXDEV);
>>>>>>>            }
>>>>>>>            cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>>>>>>> +       if (cpu >= nr_cpu_ids) {
>>>>>>> +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
>>>>>>> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
>>>>>>> +       }
>>>>>>>            alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>>>>>>>
>>>>>>>            rq = blk_mq_get_request(q, NULL, op, &alloc_data);
>>>>>>> --
>>>>>>> ...
>>>>>>>
>>>>>>>
>>>>>>>> [  153.384977] BUG: unable to handle kernel paging request at
>>>>>>>> 00003a9ed053bd48
>>>>>>>> [  153.393197] IP: blk_mq_get_request+0x23e/0x390
>>>>>>>
>>>>>>> Also would it be possible to provide gdb output of:
>>>>>>>
>>>>>>> l *(blk_mq_get_request+0x23e)
>>>>>>
>>>>>> nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
>>>>>> request from one specific hw queue, but there may not be all online CPUs
>>>>>> mapped to this hw queue.
>>>>
>>>> Yes, this is what I suspect..
>>>>
>>>>> And the following patchset may fail this kind of allocation and avoid
>>>>> the kernel oops.
>>>>>
>>>>> 	https://marc.info/?l=linux-block&m=152318091025252&w=2
>>>>
>>>> Thanks Ming,
>>>>
>>>> But I don't want to fail the allocation, nvmf_connect_io_queue simply
>>>> needs a tag to issue the connect request, I much rather to take this
>>>> tag from an online cpu than failing it... We use this because we reserve
>>>
>>> The failure is only triggered when there isn't any online CPU mapped to
>>> this hctx, so do you want to wait for CPUs for this hctx becoming online?
>>
>> I was thinking of allocating a tag from that hctx even if it had no
>> online cpu, the execution is done on an online cpu (hence the call
>> to blk_mq_alloc_request_hctx).
> 
> That can be done, but not following the current blk-mq's rule, because
> blk-mq requires to dispatch the request on CPUs mapping to this hctx.
> 
> Could you explain a bit why you want to do in this way?

My device exposes nr_hw_queues which is not higher than num_online_cpus
so I want to connect all hctxs with hope that they will be used.

I agree we don't want to connect hctx which doesn't have an online
cpu, that's redundant, but this is not the case here.

>>> Or I may understand you wrong, :-)
>>
>> In the report we connected 40 hctxs (which was exactly the number of
>> online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
>> I'm not sure why some hctxs are left without any online cpus.
> 
> That is possible after the following two commits:
> 
> 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
> 20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU)
> 
> And this can be triggered even without putting down any CPUs.
> 
> The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't
> remap queue any more when CPU topo is changed, so the static & fixed mapping
> has to be setup from the beginning.
> 
> Then if there are less enough online CPUs compared with number of hw queues,
> some of hctxes can be mapped with all offline CPUs. For example, if one device
> has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most
> 2 hw queues are assigned to online CPUs, and the other two are all with offline
> CPUs.

That is fine, but the problem that I gave in the example below which has 
nr_hw_queues == num_online_cpus but because of the mapping, we still
have unmapped hctxs.

>> Lets say I have 4-cpu system and my device always allocates
>> num_online_cpus() hctxs.
>>
>> at first I get:
>> cpu0 -> hctx0
>> cpu1 -> hctx1
>> cpu2 -> hctx2
>> cpu3 -> hctx3
>>
>> When cpu1 goes offline I think the new mapping will be:
>> cpu0 -> hctx0
>> cpu1 -> hctx0 (from cpu_to_queue_index) // offline
>> cpu2 -> hctx2
>> cpu3 -> hctx0 (from cpu_to_queue_index)
>>
>> This means that now hctx1 is unmapped. I guess we can fix nvmf code
>> to not connect it. But we end up with less queues than cpus without
>> any good reason.
>>
>> I would have optimally want a different mapping that will use all
>> the queues:
>> cpu0 -> hctx0
>> cpu2 -> hctx1
>> cpu3 -> hctx2
>> * cpu1 -> hctx1 (doesn't matter, offline)
>>
>> Something looks broken...
> 
> No, it isn't broken.

maybe broken is the wrong phrase, but its suboptimal...

> Storage is client/server model, the hw queue should be only active if
> there is request coming from client(CPU),

Correct.

> and the hw queue becomes inactive if no online CPU is mapped to it.

But when we reset the controller, we call blk_mq_update_nr_hw_queues()
with the current number of nr_hw_queues which never exceeds
num_online_cpus. This in turn, remaps the mq_map which results
in unmapped queues because of the mapping function, not because we
have more hctx than online cpus...

An easy fix, is to allocate num_present_cpus queues, and only connect
the oneline ones, but as you said, we have unused resources this way.

We also have an issue with blk_mq_rdma_map_queues with the only
device that supports it because it doesn't use managed affinity (code
was reverted) and can have irq affinity redirected in case of cpu
offlining...

The goal here I think, should be to allocate just enough queues (not
more than the number online cpus) and spread it 1x1 with online cpus,
and also make sure to allocate completion vectors that align to online
cpus. I just need to figure out how to do that...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-08 13:35                       ` Sagi Grimberg
  0 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-08 13:35 UTC (permalink / raw)




On 04/08/2018 03:57 PM, Ming Lei wrote:
> On Sun, Apr 08, 2018@02:53:03PM +0300, Sagi Grimberg wrote:
>>
>>>>>>>> Hi Sagi
>>>>>>>>
>>>>>>>> Still can reproduce this issue with the change:
>>>>>>>
>>>>>>> Thanks for validating Yi,
>>>>>>>
>>>>>>> Would it be possible to test the following:
>>>>>>> --
>>>>>>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>>>>>>> index 75336848f7a7..81ced3096433 100644
>>>>>>> --- a/block/blk-mq.c
>>>>>>> +++ b/block/blk-mq.c
>>>>>>> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
>>>>>>> request_queue *q,
>>>>>>>                    return ERR_PTR(-EXDEV);
>>>>>>>            }
>>>>>>>            cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>>>>>>> +       if (cpu >= nr_cpu_ids) {
>>>>>>> +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
>>>>>>> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
>>>>>>> +       }
>>>>>>>            alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>>>>>>>
>>>>>>>            rq = blk_mq_get_request(q, NULL, op, &alloc_data);
>>>>>>> --
>>>>>>> ...
>>>>>>>
>>>>>>>
>>>>>>>> [? 153.384977] BUG: unable to handle kernel paging request at
>>>>>>>> 00003a9ed053bd48
>>>>>>>> [? 153.393197] IP: blk_mq_get_request+0x23e/0x390
>>>>>>>
>>>>>>> Also would it be possible to provide gdb output of:
>>>>>>>
>>>>>>> l *(blk_mq_get_request+0x23e)
>>>>>>
>>>>>> nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
>>>>>> request from one specific hw queue, but there may not be all online CPUs
>>>>>> mapped to this hw queue.
>>>>
>>>> Yes, this is what I suspect..
>>>>
>>>>> And the following patchset may fail this kind of allocation and avoid
>>>>> the kernel oops.
>>>>>
>>>>> 	https://marc.info/?l=linux-block&m=152318091025252&w=2
>>>>
>>>> Thanks Ming,
>>>>
>>>> But I don't want to fail the allocation, nvmf_connect_io_queue simply
>>>> needs a tag to issue the connect request, I much rather to take this
>>>> tag from an online cpu than failing it... We use this because we reserve
>>>
>>> The failure is only triggered when there isn't any online CPU mapped to
>>> this hctx, so do you want to wait for CPUs for this hctx becoming online?
>>
>> I was thinking of allocating a tag from that hctx even if it had no
>> online cpu, the execution is done on an online cpu (hence the call
>> to blk_mq_alloc_request_hctx).
> 
> That can be done, but not following the current blk-mq's rule, because
> blk-mq requires to dispatch the request on CPUs mapping to this hctx.
> 
> Could you explain a bit why you want to do in this way?

My device exposes nr_hw_queues which is not higher than num_online_cpus
so I want to connect all hctxs with hope that they will be used.

I agree we don't want to connect hctx which doesn't have an online
cpu, that's redundant, but this is not the case here.

>>> Or I may understand you wrong, :-)
>>
>> In the report we connected 40 hctxs (which was exactly the number of
>> online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
>> I'm not sure why some hctxs are left without any online cpus.
> 
> That is possible after the following two commits:
> 
> 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
> 20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU)
> 
> And this can be triggered even without putting down any CPUs.
> 
> The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't
> remap queue any more when CPU topo is changed, so the static & fixed mapping
> has to be setup from the beginning.
> 
> Then if there are less enough online CPUs compared with number of hw queues,
> some of hctxes can be mapped with all offline CPUs. For example, if one device
> has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most
> 2 hw queues are assigned to online CPUs, and the other two are all with offline
> CPUs.

That is fine, but the problem that I gave in the example below which has 
nr_hw_queues == num_online_cpus but because of the mapping, we still
have unmapped hctxs.

>> Lets say I have 4-cpu system and my device always allocates
>> num_online_cpus() hctxs.
>>
>> at first I get:
>> cpu0 -> hctx0
>> cpu1 -> hctx1
>> cpu2 -> hctx2
>> cpu3 -> hctx3
>>
>> When cpu1 goes offline I think the new mapping will be:
>> cpu0 -> hctx0
>> cpu1 -> hctx0 (from cpu_to_queue_index) // offline
>> cpu2 -> hctx2
>> cpu3 -> hctx0 (from cpu_to_queue_index)
>>
>> This means that now hctx1 is unmapped. I guess we can fix nvmf code
>> to not connect it. But we end up with less queues than cpus without
>> any good reason.
>>
>> I would have optimally want a different mapping that will use all
>> the queues:
>> cpu0 -> hctx0
>> cpu2 -> hctx1
>> cpu3 -> hctx2
>> * cpu1 -> hctx1 (doesn't matter, offline)
>>
>> Something looks broken...
> 
> No, it isn't broken.

maybe broken is the wrong phrase, but its suboptimal...

> Storage is client/server model, the hw queue should be only active if
> there is request coming from client(CPU),

Correct.

> and the hw queue becomes inactive if no online CPU is mapped to it.

But when we reset the controller, we call blk_mq_update_nr_hw_queues()
with the current number of nr_hw_queues which never exceeds
num_online_cpus. This in turn, remaps the mq_map which results
in unmapped queues because of the mapping function, not because we
have more hctx than online cpus...

An easy fix, is to allocate num_present_cpus queues, and only connect
the oneline ones, but as you said, we have unused resources this way.

We also have an issue with blk_mq_rdma_map_queues with the only
device that supports it because it doesn't use managed affinity (code
was reverted) and can have irq affinity redirected in case of cpu
offlining...

The goal here I think, should be to allocate just enough queues (not
more than the number online cpus) and spread it 1x1 with online cpus,
and also make sure to allocate completion vectors that align to online
cpus. I just need to figure out how to do that...

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-08 13:35                       ` Sagi Grimberg
@ 2018-04-09  2:47                         ` Ming Lei
  -1 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2018-04-09  2:47 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-block, Yi Zhang, linux-nvme

On Sun, Apr 08, 2018 at 04:35:59PM +0300, Sagi Grimberg wrote:
> 
> 
> On 04/08/2018 03:57 PM, Ming Lei wrote:
> > On Sun, Apr 08, 2018 at 02:53:03PM +0300, Sagi Grimberg wrote:
> > > 
> > > > > > > > > Hi Sagi
> > > > > > > > > 
> > > > > > > > > Still can reproduce this issue with the change:
> > > > > > > > 
> > > > > > > > Thanks for validating Yi,
> > > > > > > > 
> > > > > > > > Would it be possible to test the following:
> > > > > > > > --
> > > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > > > > index 75336848f7a7..81ced3096433 100644
> > > > > > > > --- a/block/blk-mq.c
> > > > > > > > +++ b/block/blk-mq.c
> > > > > > > > @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
> > > > > > > > request_queue *q,
> > > > > > > >                    return ERR_PTR(-EXDEV);
> > > > > > > >            }
> > > > > > > >            cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> > > > > > > > +       if (cpu >= nr_cpu_ids) {
> > > > > > > > +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
> > > > > > > > +               cpu = cpumask_first(alloc_data.hctx->cpumask);
> > > > > > > > +       }
> > > > > > > >            alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> > > > > > > > 
> > > > > > > >            rq = blk_mq_get_request(q, NULL, op, &alloc_data);
> > > > > > > > --
> > > > > > > > ...
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > [� 153.384977] BUG: unable to handle kernel paging request at
> > > > > > > > > 00003a9ed053bd48
> > > > > > > > > [� 153.393197] IP: blk_mq_get_request+0x23e/0x390
> > > > > > > > 
> > > > > > > > Also would it be possible to provide gdb output of:
> > > > > > > > 
> > > > > > > > l *(blk_mq_get_request+0x23e)
> > > > > > > 
> > > > > > > nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
> > > > > > > request from one specific hw queue, but there may not be all online CPUs
> > > > > > > mapped to this hw queue.
> > > > > 
> > > > > Yes, this is what I suspect..
> > > > > 
> > > > > > And the following patchset may fail this kind of allocation and avoid
> > > > > > the kernel oops.
> > > > > > 
> > > > > > 	https://marc.info/?l=linux-block&m=152318091025252&w=2
> > > > > 
> > > > > Thanks Ming,
> > > > > 
> > > > > But I don't want to fail the allocation, nvmf_connect_io_queue simply
> > > > > needs a tag to issue the connect request, I much rather to take this
> > > > > tag from an online cpu than failing it... We use this because we reserve
> > > > 
> > > > The failure is only triggered when there isn't any online CPU mapped to
> > > > this hctx, so do you want to wait for CPUs for this hctx becoming online?
> > > 
> > > I was thinking of allocating a tag from that hctx even if it had no
> > > online cpu, the execution is done on an online cpu (hence the call
> > > to blk_mq_alloc_request_hctx).
> > 
> > That can be done, but not following the current blk-mq's rule, because
> > blk-mq requires to dispatch the request on CPUs mapping to this hctx.
> > 
> > Could you explain a bit why you want to do in this way?
> 
> My device exposes nr_hw_queues which is not higher than num_online_cpus
> so I want to connect all hctxs with hope that they will be used.

The issue is that CPU online & offline can happen any time, and after
blk-mq removes CPU hotplug handler, there is no way to remap queue
when CPU topo is changed.

For example:

1) after nr_hw_queues is set as num_online_cpus() and hw queues
are initialized, then some of CPUs become offline, and the issue
reported by Zhang Yi is triggered, but in this case, we should fail
the allocation since 1:1 mapping doesn't need to use this inactive
hw queue.

2) when nr_hw_queues is set as num_online_cpus(), there may be
much less online CPUs, so the hw queue number can be initialized as
much smaller, then performance is degraded much even if some CPUs
become online later.

So the current policy is to map all possible CPUs for handing CPU
hotplug, and if you want to get 1:1 mapping between hw queue and
online CPU, the nr_hw_queues can be set as num_possible_cpus.

Please see commit 16ccfff28976130 (nvme: pci: pass max vectors as
num_possible_cpus() to pci_alloc_irq_vectors).

It will waste some memory resource just like percpu variable, but it
simplifies the queue mapping logic a lot, and can support both hard
and soft CPU online/offline without CPU hotplug handler, which may
cause very complicated queue dependency issue.

> 
> I agree we don't want to connect hctx which doesn't have an online
> cpu, that's redundant, but this is not the case here.

OK, I will explain below, and it can be fixed by the following patch too:

https://marc.info/?l=linux-block&m=152318093725257&w=2

> 
> > > > Or I may understand you wrong, :-)
> > > 
> > > In the report we connected 40 hctxs (which was exactly the number of
> > > online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
> > > I'm not sure why some hctxs are left without any online cpus.
> > 
> > That is possible after the following two commits:
> > 
> > 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
> > 20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU)
> > 
> > And this can be triggered even without putting down any CPUs.
> > 
> > The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't
> > remap queue any more when CPU topo is changed, so the static & fixed mapping
> > has to be setup from the beginning.
> > 
> > Then if there are less enough online CPUs compared with number of hw queues,
> > some of hctxes can be mapped with all offline CPUs. For example, if one device
> > has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most
> > 2 hw queues are assigned to online CPUs, and the other two are all with offline
> > CPUs.
> 
> That is fine, but the problem that I gave in the example below which has
> nr_hw_queues == num_online_cpus but because of the mapping, we still
> have unmapped hctxs.

For FC's case, there may be some hctxs not 'mapped', which is caused by
blk_mq_map_queues(), but that should one bug.

So the patch(blk-mq: don't keep offline CPUs mapped to hctx 0)[1] is
fixing the issue:
	
[1]	https://marc.info/?l=linux-block&m=152318093725257&w=2

Once this patch is in, any hctx should be mapped by at least one CPU.

Then later, the patch(blk-mq: reimplement blk_mq_hw_queue_mapped)[2]
extends the mapping concept, maybe it should have been renamed as
blk_mq_hw_queue_active(), will do it in V2.

[2] https://marc.info/?l=linux-block&m=152318099625268&w=2

> 
> > > Lets say I have 4-cpu system and my device always allocates
> > > num_online_cpus() hctxs.
> > > 
> > > at first I get:
> > > cpu0 -> hctx0
> > > cpu1 -> hctx1
> > > cpu2 -> hctx2
> > > cpu3 -> hctx3
> > > 
> > > When cpu1 goes offline I think the new mapping will be:
> > > cpu0 -> hctx0
> > > cpu1 -> hctx0 (from cpu_to_queue_index) // offline
> > > cpu2 -> hctx2
> > > cpu3 -> hctx0 (from cpu_to_queue_index)
> > > 
> > > This means that now hctx1 is unmapped. I guess we can fix nvmf code
> > > to not connect it. But we end up with less queues than cpus without
> > > any good reason.
> > > 
> > > I would have optimally want a different mapping that will use all
> > > the queues:
> > > cpu0 -> hctx0
> > > cpu2 -> hctx1
> > > cpu3 -> hctx2
> > > * cpu1 -> hctx1 (doesn't matter, offline)
> > > 
> > > Something looks broken...
> > 
> > No, it isn't broken.
> 
> maybe broken is the wrong phrase, but its suboptimal...
> 
> > Storage is client/server model, the hw queue should be only active if
> > there is request coming from client(CPU),
> 
> Correct.
> 
> > and the hw queue becomes inactive if no online CPU is mapped to it.
> 
> But when we reset the controller, we call blk_mq_update_nr_hw_queues()
> with the current number of nr_hw_queues which never exceeds
> num_online_cpus. This in turn, remaps the mq_map which results
> in unmapped queues because of the mapping function, not because we
> have more hctx than online cpus...

As I mentioned, num_online_cpus() isn't one stable variable, and it
can change any time.

After patch(blk-mq: don't keep offline CPUs mapped to hctx 0) is in,
there won't be unmapped queue any more.

> 
> An easy fix, is to allocate num_present_cpus queues, and only connect
> the oneline ones, but as you said, we have unused resources this way.

Yeah, it should be num_possible_cpus queues because physical CPU hotplug
is needed to be supported for KVM or S390, or even some X86_64 system.

> 
> We also have an issue with blk_mq_rdma_map_queues with the only
> device that supports it because it doesn't use managed affinity (code
> was reverted) and can have irq affinity redirected in case of cpu
> offlining...

That can be one corner case, looks I have to re-consider the patch
(blk-mq: remove code for dealing with remapping queue), which may cause
regression for this RDMA case, but I guess CPU hotplug may break this
case easily.

[3] https://marc.info/?l=linux-block&m=152318100625284&w=2

Also this case will make blk-mq's queue mapping much complicated,
could you provide one link about the reason for reverting managed affinity
on this device?

Recently we fix quite a few issues on managed affinity, maybe the
original issue for RDMA affinity has been addressed already.

> 
> The goal here I think, should be to allocate just enough queues (not
> more than the number online cpus) and spread it 1x1 with online cpus,
> and also make sure to allocate completion vectors that align to online
> cpus. I just need to figure out how to do that...

I think we have to support CPU hotplug, so your goal may be hard to
reach if you don't want to waste memory resource.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-09  2:47                         ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2018-04-09  2:47 UTC (permalink / raw)


On Sun, Apr 08, 2018@04:35:59PM +0300, Sagi Grimberg wrote:
> 
> 
> On 04/08/2018 03:57 PM, Ming Lei wrote:
> > On Sun, Apr 08, 2018@02:53:03PM +0300, Sagi Grimberg wrote:
> > > 
> > > > > > > > > Hi Sagi
> > > > > > > > > 
> > > > > > > > > Still can reproduce this issue with the change:
> > > > > > > > 
> > > > > > > > Thanks for validating Yi,
> > > > > > > > 
> > > > > > > > Would it be possible to test the following:
> > > > > > > > --
> > > > > > > > diff --git a/block/blk-mq.c b/block/blk-mq.c
> > > > > > > > index 75336848f7a7..81ced3096433 100644
> > > > > > > > --- a/block/blk-mq.c
> > > > > > > > +++ b/block/blk-mq.c
> > > > > > > > @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
> > > > > > > > request_queue *q,
> > > > > > > >                    return ERR_PTR(-EXDEV);
> > > > > > > >            }
> > > > > > > >            cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> > > > > > > > +       if (cpu >= nr_cpu_ids) {
> > > > > > > > +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
> > > > > > > > +               cpu = cpumask_first(alloc_data.hctx->cpumask);
> > > > > > > > +       }
> > > > > > > >            alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> > > > > > > > 
> > > > > > > >            rq = blk_mq_get_request(q, NULL, op, &alloc_data);
> > > > > > > > --
> > > > > > > > ...
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > [? 153.384977] BUG: unable to handle kernel paging request at
> > > > > > > > > 00003a9ed053bd48
> > > > > > > > > [? 153.393197] IP: blk_mq_get_request+0x23e/0x390
> > > > > > > > 
> > > > > > > > Also would it be possible to provide gdb output of:
> > > > > > > > 
> > > > > > > > l *(blk_mq_get_request+0x23e)
> > > > > > > 
> > > > > > > nvmf_connect_io_queue() is used in this way by asking blk-mq to allocate
> > > > > > > request from one specific hw queue, but there may not be all online CPUs
> > > > > > > mapped to this hw queue.
> > > > > 
> > > > > Yes, this is what I suspect..
> > > > > 
> > > > > > And the following patchset may fail this kind of allocation and avoid
> > > > > > the kernel oops.
> > > > > > 
> > > > > > 	https://marc.info/?l=linux-block&m=152318091025252&w=2
> > > > > 
> > > > > Thanks Ming,
> > > > > 
> > > > > But I don't want to fail the allocation, nvmf_connect_io_queue simply
> > > > > needs a tag to issue the connect request, I much rather to take this
> > > > > tag from an online cpu than failing it... We use this because we reserve
> > > > 
> > > > The failure is only triggered when there isn't any online CPU mapped to
> > > > this hctx, so do you want to wait for CPUs for this hctx becoming online?
> > > 
> > > I was thinking of allocating a tag from that hctx even if it had no
> > > online cpu, the execution is done on an online cpu (hence the call
> > > to blk_mq_alloc_request_hctx).
> > 
> > That can be done, but not following the current blk-mq's rule, because
> > blk-mq requires to dispatch the request on CPUs mapping to this hctx.
> > 
> > Could you explain a bit why you want to do in this way?
> 
> My device exposes nr_hw_queues which is not higher than num_online_cpus
> so I want to connect all hctxs with hope that they will be used.

The issue is that CPU online & offline can happen any time, and after
blk-mq removes CPU hotplug handler, there is no way to remap queue
when CPU topo is changed.

For example:

1) after nr_hw_queues is set as num_online_cpus() and hw queues
are initialized, then some of CPUs become offline, and the issue
reported by Zhang Yi is triggered, but in this case, we should fail
the allocation since 1:1 mapping doesn't need to use this inactive
hw queue.

2) when nr_hw_queues is set as num_online_cpus(), there may be
much less online CPUs, so the hw queue number can be initialized as
much smaller, then performance is degraded much even if some CPUs
become online later.

So the current policy is to map all possible CPUs for handing CPU
hotplug, and if you want to get 1:1 mapping between hw queue and
online CPU, the nr_hw_queues can be set as num_possible_cpus.

Please see commit 16ccfff28976130 (nvme: pci: pass max vectors as
num_possible_cpus() to pci_alloc_irq_vectors).

It will waste some memory resource just like percpu variable, but it
simplifies the queue mapping logic a lot, and can support both hard
and soft CPU online/offline without CPU hotplug handler, which may
cause very complicated queue dependency issue.

> 
> I agree we don't want to connect hctx which doesn't have an online
> cpu, that's redundant, but this is not the case here.

OK, I will explain below, and it can be fixed by the following patch too:

https://marc.info/?l=linux-block&m=152318093725257&w=2

> 
> > > > Or I may understand you wrong, :-)
> > > 
> > > In the report we connected 40 hctxs (which was exactly the number of
> > > online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
> > > I'm not sure why some hctxs are left without any online cpus.
> > 
> > That is possible after the following two commits:
> > 
> > 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
> > 20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU)
> > 
> > And this can be triggered even without putting down any CPUs.
> > 
> > The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't
> > remap queue any more when CPU topo is changed, so the static & fixed mapping
> > has to be setup from the beginning.
> > 
> > Then if there are less enough online CPUs compared with number of hw queues,
> > some of hctxes can be mapped with all offline CPUs. For example, if one device
> > has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most
> > 2 hw queues are assigned to online CPUs, and the other two are all with offline
> > CPUs.
> 
> That is fine, but the problem that I gave in the example below which has
> nr_hw_queues == num_online_cpus but because of the mapping, we still
> have unmapped hctxs.

For FC's case, there may be some hctxs not 'mapped', which is caused by
blk_mq_map_queues(), but that should one bug.

So the patch(blk-mq: don't keep offline CPUs mapped to hctx 0)[1] is
fixing the issue:
	
[1]	https://marc.info/?l=linux-block&m=152318093725257&w=2

Once this patch is in, any hctx should be mapped by at least one CPU.

Then later, the patch(blk-mq: reimplement blk_mq_hw_queue_mapped)[2]
extends the mapping concept, maybe it should have been renamed as
blk_mq_hw_queue_active(), will do it in V2.

[2] https://marc.info/?l=linux-block&m=152318099625268&w=2

> 
> > > Lets say I have 4-cpu system and my device always allocates
> > > num_online_cpus() hctxs.
> > > 
> > > at first I get:
> > > cpu0 -> hctx0
> > > cpu1 -> hctx1
> > > cpu2 -> hctx2
> > > cpu3 -> hctx3
> > > 
> > > When cpu1 goes offline I think the new mapping will be:
> > > cpu0 -> hctx0
> > > cpu1 -> hctx0 (from cpu_to_queue_index) // offline
> > > cpu2 -> hctx2
> > > cpu3 -> hctx0 (from cpu_to_queue_index)
> > > 
> > > This means that now hctx1 is unmapped. I guess we can fix nvmf code
> > > to not connect it. But we end up with less queues than cpus without
> > > any good reason.
> > > 
> > > I would have optimally want a different mapping that will use all
> > > the queues:
> > > cpu0 -> hctx0
> > > cpu2 -> hctx1
> > > cpu3 -> hctx2
> > > * cpu1 -> hctx1 (doesn't matter, offline)
> > > 
> > > Something looks broken...
> > 
> > No, it isn't broken.
> 
> maybe broken is the wrong phrase, but its suboptimal...
> 
> > Storage is client/server model, the hw queue should be only active if
> > there is request coming from client(CPU),
> 
> Correct.
> 
> > and the hw queue becomes inactive if no online CPU is mapped to it.
> 
> But when we reset the controller, we call blk_mq_update_nr_hw_queues()
> with the current number of nr_hw_queues which never exceeds
> num_online_cpus. This in turn, remaps the mq_map which results
> in unmapped queues because of the mapping function, not because we
> have more hctx than online cpus...

As I mentioned, num_online_cpus() isn't one stable variable, and it
can change any time.

After patch(blk-mq: don't keep offline CPUs mapped to hctx 0) is in,
there won't be unmapped queue any more.

> 
> An easy fix, is to allocate num_present_cpus queues, and only connect
> the oneline ones, but as you said, we have unused resources this way.

Yeah, it should be num_possible_cpus queues because physical CPU hotplug
is needed to be supported for KVM or S390, or even some X86_64 system.

> 
> We also have an issue with blk_mq_rdma_map_queues with the only
> device that supports it because it doesn't use managed affinity (code
> was reverted) and can have irq affinity redirected in case of cpu
> offlining...

That can be one corner case, looks I have to re-consider the patch
(blk-mq: remove code for dealing with remapping queue), which may cause
regression for this RDMA case, but I guess CPU hotplug may break this
case easily.

[3] https://marc.info/?l=linux-block&m=152318100625284&w=2

Also this case will make blk-mq's queue mapping much complicated,
could you provide one link about the reason for reverting managed affinity
on this device?

Recently we fix quite a few issues on managed affinity, maybe the
original issue for RDMA affinity has been addressed already.

> 
> The goal here I think, should be to allocate just enough queues (not
> more than the number online cpus) and spread it 1x1 with online cpus,
> and also make sure to allocate completion vectors that align to online
> cpus. I just need to figure out how to do that...

I think we have to support CPU hotplug, so your goal may be hard to
reach if you don't want to waste memory resource.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-09  2:47                         ` Ming Lei
@ 2018-04-09  8:31                           ` Sagi Grimberg
  -1 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-09  8:31 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Yi Zhang, linux-nvme


>> My device exposes nr_hw_queues which is not higher than num_online_cpus
>> so I want to connect all hctxs with hope that they will be used.
> 
> The issue is that CPU online & offline can happen any time, and after
> blk-mq removes CPU hotplug handler, there is no way to remap queue
> when CPU topo is changed.
> 
> For example:
> 
> 1) after nr_hw_queues is set as num_online_cpus() and hw queues
> are initialized, then some of CPUs become offline, and the issue
> reported by Zhang Yi is triggered, but in this case, we should fail
> the allocation since 1:1 mapping doesn't need to use this inactive
> hw queue.

Normal cpu offlining is fine, as the hctxs are already connected. When
we reset the controller and re-establish the queues, the issue triggers
because we call blk_mq_alloc_request_hctx.

The question is, for this particular issue, given that the request
execution is guaranteed to run from an online cpu, will the below work?
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
                 return ERR_PTR(-EXDEV);
         }
         cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+       if (cpu >= nr_cpu_ids) {
+               pr_warn("no online cpu for hctx %d\n", hctx_idx);
+               cpu = cpumask_first(alloc_data.hctx->cpumask);
+       }
         alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

         rq = blk_mq_get_request(q, NULL, op, &alloc_data);
--

> 2) when nr_hw_queues is set as num_online_cpus(), there may be
> much less online CPUs, so the hw queue number can be initialized as
> much smaller, then performance is degraded much even if some CPUs
> become online later.

That is correct, when the controller will be reset though, more queues
will be added to the system. I agree it would be good if we can change
stuff dynamically.

> So the current policy is to map all possible CPUs for handing CPU
> hotplug, and if you want to get 1:1 mapping between hw queue and
> online CPU, the nr_hw_queues can be set as num_possible_cpus.

Having nr_hw_queues == num_possible_cpus cannot work as it requires
establishing an RDMA queue-pair with a set of HW resources both on
the host side _and_ on the controller side, which are idle.

> Please see commit 16ccfff28976130 (nvme: pci: pass max vectors as
> num_possible_cpus() to pci_alloc_irq_vectors).

Yes, I am aware of this patch, however I not sure it'll be a good idea
for nvmf as it takes resources from both the host and the target for
for cpus that may never come online...

> It will waste some memory resource just like percpu variable, but it
> simplifies the queue mapping logic a lot, and can support both hard
> and soft CPU online/offline without CPU hotplug handler, which may
> cause very complicated queue dependency issue.

Yes, but these some memory resources are becoming an issue when it
takes HW (RDMA) resources on the local device and on the target device.

>> I agree we don't want to connect hctx which doesn't have an online
>> cpu, that's redundant, but this is not the case here.
> 
> OK, I will explain below, and it can be fixed by the following patch too:
> 
> https://marc.info/?l=linux-block&m=152318093725257&w=2
> 

I agree this patch is good!

>>>>> Or I may understand you wrong, :-)
>>>>
>>>> In the report we connected 40 hctxs (which was exactly the number of
>>>> online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
>>>> I'm not sure why some hctxs are left without any online cpus.
>>>
>>> That is possible after the following two commits:
>>>
>>> 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
>>> 20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU)
>>>
>>> And this can be triggered even without putting down any CPUs.
>>>
>>> The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't
>>> remap queue any more when CPU topo is changed, so the static & fixed mapping
>>> has to be setup from the beginning.
>>>
>>> Then if there are less enough online CPUs compared with number of hw queues,
>>> some of hctxes can be mapped with all offline CPUs. For example, if one device
>>> has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most
>>> 2 hw queues are assigned to online CPUs, and the other two are all with offline
>>> CPUs.
>>
>> That is fine, but the problem that I gave in the example below which has
>> nr_hw_queues == num_online_cpus but because of the mapping, we still
>> have unmapped hctxs.
> 
> For FC's case, there may be some hctxs not 'mapped', which is caused by
> blk_mq_map_queues(), but that should one bug.
> 
> So the patch(blk-mq: don't keep offline CPUs mapped to hctx 0)[1] is
> fixing the issue:
> 	
> [1]	https://marc.info/?l=linux-block&m=152318093725257&w=2
> 
> Once this patch is in, any hctx should be mapped by at least one CPU.

I think this will solve the problem Yi is stepping on.

> Then later, the patch(blk-mq: reimplement blk_mq_hw_queue_mapped)[2]
> extends the mapping concept, maybe it should have been renamed as
> blk_mq_hw_queue_active(), will do it in V2.
> 
> [2] https://marc.info/?l=linux-block&m=152318099625268&w=2

This is also a good patch.

...

>> But when we reset the controller, we call blk_mq_update_nr_hw_queues()
>> with the current number of nr_hw_queues which never exceeds
>> num_online_cpus. This in turn, remaps the mq_map which results
>> in unmapped queues because of the mapping function, not because we
>> have more hctx than online cpus...
> 
> As I mentioned, num_online_cpus() isn't one stable variable, and it
> can change any time.

Correct, but I'm afraid num_possible_cpus might not work either.

> After patch(blk-mq: don't keep offline CPUs mapped to hctx 0) is in,
> there won't be unmapped queue any more.

Yes.

>> An easy fix, is to allocate num_present_cpus queues, and only connect
>> the oneline ones, but as you said, we have unused resources this way.
> 
> Yeah, it should be num_possible_cpus queues because physical CPU hotplug
> is needed to be supported for KVM or S390, or even some X86_64 system.

num_present_cpus is a waste of resources (as I said, both on the host
and on the target), but num_possible_cpus is even worse as this is
all cpus that _can_ be populated.

>> We also have an issue with blk_mq_rdma_map_queues with the only
>> device that supports it because it doesn't use managed affinity (code
>> was reverted) and can have irq affinity redirected in case of cpu
>> offlining...
> 
> That can be one corner case, looks I have to re-consider the patch
> (blk-mq: remove code for dealing with remapping queue), which may cause
> regression for this RDMA case, but I guess CPU hotplug may break this
> case easily.
> 
> [3] https://marc.info/?l=linux-block&m=152318100625284&w=2
> 
> Also this case will make blk-mq's queue mapping much complicated,
> could you provide one link about the reason for reverting managed affinity
> on this device?

The problem was that users reported a regression because now
/proc/irp/$IRQ/smp_affinity is immutable. Looks like netdev users do
this on a regular basis (and also rely on irq_banacer at times) while
nvme users (and other HBAs) do not care about it.

Thread starts here:
https://www.spinics.net/lists/netdev/msg464301.html

> Recently we fix quite a few issues on managed affinity, maybe the
> original issue for RDMA affinity has been addressed already.

That is not specific to RDMA affinity, its because RDMA devices are
also network devices and people want to apply their irq affinity
scripts on it like their used to with other devices.

>> The goal here I think, should be to allocate just enough queues (not
>> more than the number online cpus) and spread it 1x1 with online cpus,
>> and also make sure to allocate completion vectors that align to online
>> cpus. I just need to figure out how to do that...
> 
> I think we have to support CPU hotplug, so your goal may be hard to
> reach if you don't want to waste memory resource.

Well, not so much if I make blk_mq_rdma_map_queues do the right thing?

As I said, for the first go, I'd like to fix the mapping for the simple
case where we map the queues with some cpus offlined. Having queues
being added dynamically is a different story and I agree would require
more work.

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-09  8:31                           ` Sagi Grimberg
  0 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-09  8:31 UTC (permalink / raw)



>> My device exposes nr_hw_queues which is not higher than num_online_cpus
>> so I want to connect all hctxs with hope that they will be used.
> 
> The issue is that CPU online & offline can happen any time, and after
> blk-mq removes CPU hotplug handler, there is no way to remap queue
> when CPU topo is changed.
> 
> For example:
> 
> 1) after nr_hw_queues is set as num_online_cpus() and hw queues
> are initialized, then some of CPUs become offline, and the issue
> reported by Zhang Yi is triggered, but in this case, we should fail
> the allocation since 1:1 mapping doesn't need to use this inactive
> hw queue.

Normal cpu offlining is fine, as the hctxs are already connected. When
we reset the controller and re-establish the queues, the issue triggers
because we call blk_mq_alloc_request_hctx.

The question is, for this particular issue, given that the request
execution is guaranteed to run from an online cpu, will the below work?
--
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
                 return ERR_PTR(-EXDEV);
         }
         cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+       if (cpu >= nr_cpu_ids) {
+               pr_warn("no online cpu for hctx %d\n", hctx_idx);
+               cpu = cpumask_first(alloc_data.hctx->cpumask);
+       }
         alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

         rq = blk_mq_get_request(q, NULL, op, &alloc_data);
--

> 2) when nr_hw_queues is set as num_online_cpus(), there may be
> much less online CPUs, so the hw queue number can be initialized as
> much smaller, then performance is degraded much even if some CPUs
> become online later.

That is correct, when the controller will be reset though, more queues
will be added to the system. I agree it would be good if we can change
stuff dynamically.

> So the current policy is to map all possible CPUs for handing CPU
> hotplug, and if you want to get 1:1 mapping between hw queue and
> online CPU, the nr_hw_queues can be set as num_possible_cpus.

Having nr_hw_queues == num_possible_cpus cannot work as it requires
establishing an RDMA queue-pair with a set of HW resources both on
the host side _and_ on the controller side, which are idle.

> Please see commit 16ccfff28976130 (nvme: pci: pass max vectors as
> num_possible_cpus() to pci_alloc_irq_vectors).

Yes, I am aware of this patch, however I not sure it'll be a good idea
for nvmf as it takes resources from both the host and the target for
for cpus that may never come online...

> It will waste some memory resource just like percpu variable, but it
> simplifies the queue mapping logic a lot, and can support both hard
> and soft CPU online/offline without CPU hotplug handler, which may
> cause very complicated queue dependency issue.

Yes, but these some memory resources are becoming an issue when it
takes HW (RDMA) resources on the local device and on the target device.

>> I agree we don't want to connect hctx which doesn't have an online
>> cpu, that's redundant, but this is not the case here.
> 
> OK, I will explain below, and it can be fixed by the following patch too:
> 
> https://marc.info/?l=linux-block&m=152318093725257&w=2
> 

I agree this patch is good!

>>>>> Or I may understand you wrong, :-)
>>>>
>>>> In the report we connected 40 hctxs (which was exactly the number of
>>>> online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
>>>> I'm not sure why some hctxs are left without any online cpus.
>>>
>>> That is possible after the following two commits:
>>>
>>> 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
>>> 20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU)
>>>
>>> And this can be triggered even without putting down any CPUs.
>>>
>>> The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't
>>> remap queue any more when CPU topo is changed, so the static & fixed mapping
>>> has to be setup from the beginning.
>>>
>>> Then if there are less enough online CPUs compared with number of hw queues,
>>> some of hctxes can be mapped with all offline CPUs. For example, if one device
>>> has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most
>>> 2 hw queues are assigned to online CPUs, and the other two are all with offline
>>> CPUs.
>>
>> That is fine, but the problem that I gave in the example below which has
>> nr_hw_queues == num_online_cpus but because of the mapping, we still
>> have unmapped hctxs.
> 
> For FC's case, there may be some hctxs not 'mapped', which is caused by
> blk_mq_map_queues(), but that should one bug.
> 
> So the patch(blk-mq: don't keep offline CPUs mapped to hctx 0)[1] is
> fixing the issue:
> 	
> [1]	https://marc.info/?l=linux-block&m=152318093725257&w=2
> 
> Once this patch is in, any hctx should be mapped by at least one CPU.

I think this will solve the problem Yi is stepping on.

> Then later, the patch(blk-mq: reimplement blk_mq_hw_queue_mapped)[2]
> extends the mapping concept, maybe it should have been renamed as
> blk_mq_hw_queue_active(), will do it in V2.
> 
> [2] https://marc.info/?l=linux-block&m=152318099625268&w=2

This is also a good patch.

...

>> But when we reset the controller, we call blk_mq_update_nr_hw_queues()
>> with the current number of nr_hw_queues which never exceeds
>> num_online_cpus. This in turn, remaps the mq_map which results
>> in unmapped queues because of the mapping function, not because we
>> have more hctx than online cpus...
> 
> As I mentioned, num_online_cpus() isn't one stable variable, and it
> can change any time.

Correct, but I'm afraid num_possible_cpus might not work either.

> After patch(blk-mq: don't keep offline CPUs mapped to hctx 0) is in,
> there won't be unmapped queue any more.

Yes.

>> An easy fix, is to allocate num_present_cpus queues, and only connect
>> the oneline ones, but as you said, we have unused resources this way.
> 
> Yeah, it should be num_possible_cpus queues because physical CPU hotplug
> is needed to be supported for KVM or S390, or even some X86_64 system.

num_present_cpus is a waste of resources (as I said, both on the host
and on the target), but num_possible_cpus is even worse as this is
all cpus that _can_ be populated.

>> We also have an issue with blk_mq_rdma_map_queues with the only
>> device that supports it because it doesn't use managed affinity (code
>> was reverted) and can have irq affinity redirected in case of cpu
>> offlining...
> 
> That can be one corner case, looks I have to re-consider the patch
> (blk-mq: remove code for dealing with remapping queue), which may cause
> regression for this RDMA case, but I guess CPU hotplug may break this
> case easily.
> 
> [3] https://marc.info/?l=linux-block&m=152318100625284&w=2
> 
> Also this case will make blk-mq's queue mapping much complicated,
> could you provide one link about the reason for reverting managed affinity
> on this device?

The problem was that users reported a regression because now
/proc/irp/$IRQ/smp_affinity is immutable. Looks like netdev users do
this on a regular basis (and also rely on irq_banacer at times) while
nvme users (and other HBAs) do not care about it.

Thread starts here:
https://www.spinics.net/lists/netdev/msg464301.html

> Recently we fix quite a few issues on managed affinity, maybe the
> original issue for RDMA affinity has been addressed already.

That is not specific to RDMA affinity, its because RDMA devices are
also network devices and people want to apply their irq affinity
scripts on it like their used to with other devices.

>> The goal here I think, should be to allocate just enough queues (not
>> more than the number online cpus) and spread it 1x1 with online cpus,
>> and also make sure to allocate completion vectors that align to online
>> cpus. I just need to figure out how to do that...
> 
> I think we have to support CPU hotplug, so your goal may be hard to
> reach if you don't want to waste memory resource.

Well, not so much if I make blk_mq_rdma_map_queues do the right thing?

As I said, for the first go, I'd like to fix the mapping for the simple
case where we map the queues with some cpus offlined. Having queues
being added dynamically is a different story and I agree would require
more work.

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-09  8:31                           ` Sagi Grimberg
@ 2018-04-09  8:54                             ` Yi Zhang
  -1 siblings, 0 replies; 36+ messages in thread
From: Yi Zhang @ 2018-04-09  8:54 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei; +Cc: linux-block, linux-nvme



On 04/09/2018 04:31 PM, Sagi Grimberg wrote:
>
>>> My device exposes nr_hw_queues which is not higher than num_online_cpus
>>> so I want to connect all hctxs with hope that they will be used.
>>
>> The issue is that CPU online & offline can happen any time, and after
>> blk-mq removes CPU hotplug handler, there is no way to remap queue
>> when CPU topo is changed.
>>
>> For example:
>>
>> 1) after nr_hw_queues is set as num_online_cpus() and hw queues
>> are initialized, then some of CPUs become offline, and the issue
>> reported by Zhang Yi is triggered, but in this case, we should fail
>> the allocation since 1:1 mapping doesn't need to use this inactive
>> hw queue.
>
> Normal cpu offlining is fine, as the hctxs are already connected. When
> we reset the controller and re-establish the queues, the issue triggers
> because we call blk_mq_alloc_request_hctx.
>
> The question is, for this particular issue, given that the request
> execution is guaranteed to run from an online cpu, will the below work?
Hi Sagi
Sorry for the late response, bellow patch works, here is the full log:

[  117.370832] nvme nvme0: new ctrl: NQN 
"nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
[  117.427385] nvme nvme0: creating 40 I/O queues.
[  117.736806] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[  122.531891] smpboot: CPU 1 is now offline
[  122.573007] IRQ 37: no longer affine to CPU2
[  122.577775] IRQ 54: no longer affine to CPU2
[  122.582532] IRQ 70: no longer affine to CPU2
[  122.587300] IRQ 98: no longer affine to CPU2
[  122.592069] IRQ 140: no longer affine to CPU2
[  122.596930] IRQ 141: no longer affine to CPU2
[  122.603166] smpboot: CPU 2 is now offline
[  122.840577] smpboot: CPU 3 is now offline
[  125.204901] print_req_error: operation not supported error, dev 
nvme0n1, sector 143212504
[  125.204907] print_req_error: operation not supported error, dev 
nvme0n1, sector 481004984
[  125.204922] print_req_error: operation not supported error, dev 
nvme0n1, sector 436594584
[  125.204924] print_req_error: operation not supported error, dev 
nvme0n1, sector 461363784
[  125.204945] print_req_error: operation not supported error, dev 
nvme0n1, sector 308124792
[  125.204957] print_req_error: operation not supported error, dev 
nvme0n1, sector 513395784
[  125.204959] print_req_error: operation not supported error, dev 
nvme0n1, sector 432260176
[  125.204961] print_req_error: operation not supported error, dev 
nvme0n1, sector 251704096
[  125.204963] print_req_error: operation not supported error, dev 
nvme0n1, sector 234819336
[  125.204966] print_req_error: operation not supported error, dev 
nvme0n1, sector 181874128
[  125.938858] nvme nvme0: Reconnecting in 10 seconds...
[  125.938862] Buffer I/O error on dev nvme0n1, logical block 367355, 
lost async page write
[  125.942587] Buffer I/O error on dev nvme0n1, logical block 586, lost 
async page write
[  125.942589] Buffer I/O error on dev nvme0n1, logical block 375453, 
lost async page write
[  125.942591] Buffer I/O error on dev nvme0n1, logical block 587, lost 
async page write
[  125.942592] Buffer I/O error on dev nvme0n1, logical block 588, lost 
async page write
[  125.942593] Buffer I/O error on dev nvme0n1, logical block 375454, 
lost async page write
[  125.942594] Buffer I/O error on dev nvme0n1, logical block 589, lost 
async page write
[  125.942595] Buffer I/O error on dev nvme0n1, logical block 590, lost 
async page write
[  125.942596] Buffer I/O error on dev nvme0n1, logical block 591, lost 
async page write
[  125.942597] Buffer I/O error on dev nvme0n1, logical block 592, lost 
async page write
[  130.205584] print_req_error: 537000 callbacks suppressed
[  130.205586] print_req_error: I/O error, dev nvme0n1, sector 471135288
[  130.218763] print_req_error: I/O error, dev nvme0n1, sector 471137240
[  130.225985] print_req_error: I/O error, dev nvme0n1, sector 471138328
[  130.233206] print_req_error: I/O error, dev nvme0n1, sector 471140096
[  130.240433] print_req_error: I/O error, dev nvme0n1, sector 471140184
[  130.247659] print_req_error: I/O error, dev nvme0n1, sector 471140960
[  130.254874] print_req_error: I/O error, dev nvme0n1, sector 471141864
[  130.262095] print_req_error: I/O error, dev nvme0n1, sector 471143296
[  130.269317] print_req_error: I/O error, dev nvme0n1, sector 471143776
[  130.276537] print_req_error: I/O error, dev nvme0n1, sector 471144224
[  132.954315] nvme nvme0: Identify namespace failed
[  132.959698] buffer_io_error: 3801549 callbacks suppressed
[  132.959699] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[  132.974669] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[  132.983078] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[  132.991476] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[  132.999859] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[  133.008217] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[  133.016575] Dev nvme0n1: unable to read RDB block 0
[  133.022423] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[  133.030800] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[  133.039151]  nvme0n1: unable to read partition table
[  133.050221] Buffer I/O error on dev nvme0n1, logical block 65535984, 
async page read
[  133.060154] Buffer I/O error on dev nvme0n1, logical block 65535984, 
async page read
[  136.334516] nvme nvme0: creating 37 I/O queues.
[  136.636012] no online cpu for hctx 1
[  136.640448] no online cpu for hctx 2
[  136.644832] no online cpu for hctx 3
[  136.650432] nvme nvme0: Successfully reconnected (1 attempts)
[  184.894584] x86: Booting SMP configuration:
[  184.899694] smpboot: Booting Node 1 Processor 1 APIC 0x20
[  184.913923] smpboot: Booting Node 0 Processor 2 APIC 0x2
[  184.929556] smpboot: Booting Node 1 Processor 3 APIC 0x22

And here is the debug ouput:
[root@rdma-virt-01 linux (test)]$ gdb vmlinux
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/test/linux/vmlinux...done.
(gdb) l *(blk_mq_get_request+0x23e)
0xffffffff8136bf9e is in blk_mq_get_request (block/blk-mq.c:327).
322        rq->rl = NULL;
323        set_start_time_ns(rq);
324        rq->io_start_time_ns = 0;
325    #endif
326
327        data->ctx->rq_dispatched[op_is_sync(op)]++;
328        return rq;
329    }
330
331    static struct request *blk_mq_get_request(struct request_queue *q,



> -- 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 75336848f7a7..81ced3096433 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct 
> request_queue *q,
>                 return ERR_PTR(-EXDEV);
>         }
>         cpu = cpumask_first_and(alloc_data.hctx->cpumask, 
> cpu_online_mask);
> +       if (cpu >= nr_cpu_ids) {
> +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
> +       }
>         alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>
>         rq = blk_mq_get_request(q, NULL, op, &alloc_data);
> -- 
>
>> 2) when nr_hw_queues is set as num_online_cpus(), there may be
>> much less online CPUs, so the hw queue number can be initialized as
>> much smaller, then performance is degraded much even if some CPUs
>> become online later.
>
> That is correct, when the controller will be reset though, more queues
> will be added to the system. I agree it would be good if we can change
> stuff dynamically.
>
>> So the current policy is to map all possible CPUs for handing CPU
>> hotplug, and if you want to get 1:1 mapping between hw queue and
>> online CPU, the nr_hw_queues can be set as num_possible_cpus.
>
> Having nr_hw_queues == num_possible_cpus cannot work as it requires
> establishing an RDMA queue-pair with a set of HW resources both on
> the host side _and_ on the controller side, which are idle.
>
>> Please see commit 16ccfff28976130 (nvme: pci: pass max vectors as
>> num_possible_cpus() to pci_alloc_irq_vectors).
>
> Yes, I am aware of this patch, however I not sure it'll be a good idea
> for nvmf as it takes resources from both the host and the target for
> for cpus that may never come online...
>
>> It will waste some memory resource just like percpu variable, but it
>> simplifies the queue mapping logic a lot, and can support both hard
>> and soft CPU online/offline without CPU hotplug handler, which may
>> cause very complicated queue dependency issue.
>
> Yes, but these some memory resources are becoming an issue when it
> takes HW (RDMA) resources on the local device and on the target device.
>
>>> I agree we don't want to connect hctx which doesn't have an online
>>> cpu, that's redundant, but this is not the case here.
>>
>> OK, I will explain below, and it can be fixed by the following patch 
>> too:
>>
>> https://marc.info/?l=linux-block&m=152318093725257&w=2
>>
>
> I agree this patch is good!
>
>>>>>> Or I may understand you wrong, :-)
>>>>>
>>>>> In the report we connected 40 hctxs (which was exactly the number of
>>>>> online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
>>>>> I'm not sure why some hctxs are left without any online cpus.
>>>>
>>>> That is possible after the following two commits:
>>>>
>>>> 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
>>>> 20e4d8139319 (blk-mq: simplify queue mapping & schedule with each 
>>>> possisble CPU)
>>>>
>>>> And this can be triggered even without putting down any CPUs.
>>>>
>>>> The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we 
>>>> can't
>>>> remap queue any more when CPU topo is changed, so the static & 
>>>> fixed mapping
>>>> has to be setup from the beginning.
>>>>
>>>> Then if there are less enough online CPUs compared with number of 
>>>> hw queues,
>>>> some of hctxes can be mapped with all offline CPUs. For example, if 
>>>> one device
>>>> has 4 hw queues, but there are only 2 online CPUs and 6 offline 
>>>> CPUs, at most
>>>> 2 hw queues are assigned to online CPUs, and the other two are all 
>>>> with offline
>>>> CPUs.
>>>
>>> That is fine, but the problem that I gave in the example below which 
>>> has
>>> nr_hw_queues == num_online_cpus but because of the mapping, we still
>>> have unmapped hctxs.
>>
>> For FC's case, there may be some hctxs not 'mapped', which is caused by
>> blk_mq_map_queues(), but that should one bug.
>>
>> So the patch(blk-mq: don't keep offline CPUs mapped to hctx 0)[1] is
>> fixing the issue:
>>
>> [1] https://marc.info/?l=linux-block&m=152318093725257&w=2
>>
>> Once this patch is in, any hctx should be mapped by at least one CPU.
>
> I think this will solve the problem Yi is stepping on.
>
>> Then later, the patch(blk-mq: reimplement blk_mq_hw_queue_mapped)[2]
>> extends the mapping concept, maybe it should have been renamed as
>> blk_mq_hw_queue_active(), will do it in V2.
>>
>> [2] https://marc.info/?l=linux-block&m=152318099625268&w=2
>
> This is also a good patch.
>
> ...
>
>>> But when we reset the controller, we call blk_mq_update_nr_hw_queues()
>>> with the current number of nr_hw_queues which never exceeds
>>> num_online_cpus. This in turn, remaps the mq_map which results
>>> in unmapped queues because of the mapping function, not because we
>>> have more hctx than online cpus...
>>
>> As I mentioned, num_online_cpus() isn't one stable variable, and it
>> can change any time.
>
> Correct, but I'm afraid num_possible_cpus might not work either.
>
>> After patch(blk-mq: don't keep offline CPUs mapped to hctx 0) is in,
>> there won't be unmapped queue any more.
>
> Yes.
>
>>> An easy fix, is to allocate num_present_cpus queues, and only connect
>>> the oneline ones, but as you said, we have unused resources this way.
>>
>> Yeah, it should be num_possible_cpus queues because physical CPU hotplug
>> is needed to be supported for KVM or S390, or even some X86_64 system.
>
> num_present_cpus is a waste of resources (as I said, both on the host
> and on the target), but num_possible_cpus is even worse as this is
> all cpus that _can_ be populated.
>
>>> We also have an issue with blk_mq_rdma_map_queues with the only
>>> device that supports it because it doesn't use managed affinity (code
>>> was reverted) and can have irq affinity redirected in case of cpu
>>> offlining...
>>
>> That can be one corner case, looks I have to re-consider the patch
>> (blk-mq: remove code for dealing with remapping queue), which may cause
>> regression for this RDMA case, but I guess CPU hotplug may break this
>> case easily.
>>
>> [3] https://marc.info/?l=linux-block&m=152318100625284&w=2
>>
>> Also this case will make blk-mq's queue mapping much complicated,
>> could you provide one link about the reason for reverting managed 
>> affinity
>> on this device?
>
> The problem was that users reported a regression because now
> /proc/irp/$IRQ/smp_affinity is immutable. Looks like netdev users do
> this on a regular basis (and also rely on irq_banacer at times) while
> nvme users (and other HBAs) do not care about it.
>
> Thread starts here:
> https://www.spinics.net/lists/netdev/msg464301.html
>
>> Recently we fix quite a few issues on managed affinity, maybe the
>> original issue for RDMA affinity has been addressed already.
>
> That is not specific to RDMA affinity, its because RDMA devices are
> also network devices and people want to apply their irq affinity
> scripts on it like their used to with other devices.
>
>>> The goal here I think, should be to allocate just enough queues (not
>>> more than the number online cpus) and spread it 1x1 with online cpus,
>>> and also make sure to allocate completion vectors that align to online
>>> cpus. I just need to figure out how to do that...
>>
>> I think we have to support CPU hotplug, so your goal may be hard to
>> reach if you don't want to waste memory resource.
>
> Well, not so much if I make blk_mq_rdma_map_queues do the right thing?
>
> As I said, for the first go, I'd like to fix the mapping for the simple
> case where we map the queues with some cpus offlined. Having queues
> being added dynamically is a different story and I agree would require
> more work.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-09  8:54                             ` Yi Zhang
  0 siblings, 0 replies; 36+ messages in thread
From: Yi Zhang @ 2018-04-09  8:54 UTC (permalink / raw)




On 04/09/2018 04:31 PM, Sagi Grimberg wrote:
>
>>> My device exposes nr_hw_queues which is not higher than num_online_cpus
>>> so I want to connect all hctxs with hope that they will be used.
>>
>> The issue is that CPU online & offline can happen any time, and after
>> blk-mq removes CPU hotplug handler, there is no way to remap queue
>> when CPU topo is changed.
>>
>> For example:
>>
>> 1) after nr_hw_queues is set as num_online_cpus() and hw queues
>> are initialized, then some of CPUs become offline, and the issue
>> reported by Zhang Yi is triggered, but in this case, we should fail
>> the allocation since 1:1 mapping doesn't need to use this inactive
>> hw queue.
>
> Normal cpu offlining is fine, as the hctxs are already connected. When
> we reset the controller and re-establish the queues, the issue triggers
> because we call blk_mq_alloc_request_hctx.
>
> The question is, for this particular issue, given that the request
> execution is guaranteed to run from an online cpu, will the below work?
Hi Sagi
Sorry for the late response, bellow patch works, here is the full log:

[? 117.370832] nvme nvme0: new ctrl: NQN 
"nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
[? 117.427385] nvme nvme0: creating 40 I/O queues.
[? 117.736806] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
[? 122.531891] smpboot: CPU 1 is now offline
[? 122.573007] IRQ 37: no longer affine to CPU2
[? 122.577775] IRQ 54: no longer affine to CPU2
[? 122.582532] IRQ 70: no longer affine to CPU2
[? 122.587300] IRQ 98: no longer affine to CPU2
[? 122.592069] IRQ 140: no longer affine to CPU2
[? 122.596930] IRQ 141: no longer affine to CPU2
[? 122.603166] smpboot: CPU 2 is now offline
[? 122.840577] smpboot: CPU 3 is now offline
[? 125.204901] print_req_error: operation not supported error, dev 
nvme0n1, sector 143212504
[? 125.204907] print_req_error: operation not supported error, dev 
nvme0n1, sector 481004984
[? 125.204922] print_req_error: operation not supported error, dev 
nvme0n1, sector 436594584
[? 125.204924] print_req_error: operation not supported error, dev 
nvme0n1, sector 461363784
[? 125.204945] print_req_error: operation not supported error, dev 
nvme0n1, sector 308124792
[? 125.204957] print_req_error: operation not supported error, dev 
nvme0n1, sector 513395784
[? 125.204959] print_req_error: operation not supported error, dev 
nvme0n1, sector 432260176
[? 125.204961] print_req_error: operation not supported error, dev 
nvme0n1, sector 251704096
[? 125.204963] print_req_error: operation not supported error, dev 
nvme0n1, sector 234819336
[? 125.204966] print_req_error: operation not supported error, dev 
nvme0n1, sector 181874128
[? 125.938858] nvme nvme0: Reconnecting in 10 seconds...
[? 125.938862] Buffer I/O error on dev nvme0n1, logical block 367355, 
lost async page write
[? 125.942587] Buffer I/O error on dev nvme0n1, logical block 586, lost 
async page write
[? 125.942589] Buffer I/O error on dev nvme0n1, logical block 375453, 
lost async page write
[? 125.942591] Buffer I/O error on dev nvme0n1, logical block 587, lost 
async page write
[? 125.942592] Buffer I/O error on dev nvme0n1, logical block 588, lost 
async page write
[? 125.942593] Buffer I/O error on dev nvme0n1, logical block 375454, 
lost async page write
[? 125.942594] Buffer I/O error on dev nvme0n1, logical block 589, lost 
async page write
[? 125.942595] Buffer I/O error on dev nvme0n1, logical block 590, lost 
async page write
[? 125.942596] Buffer I/O error on dev nvme0n1, logical block 591, lost 
async page write
[? 125.942597] Buffer I/O error on dev nvme0n1, logical block 592, lost 
async page write
[? 130.205584] print_req_error: 537000 callbacks suppressed
[? 130.205586] print_req_error: I/O error, dev nvme0n1, sector 471135288
[? 130.218763] print_req_error: I/O error, dev nvme0n1, sector 471137240
[? 130.225985] print_req_error: I/O error, dev nvme0n1, sector 471138328
[? 130.233206] print_req_error: I/O error, dev nvme0n1, sector 471140096
[? 130.240433] print_req_error: I/O error, dev nvme0n1, sector 471140184
[? 130.247659] print_req_error: I/O error, dev nvme0n1, sector 471140960
[? 130.254874] print_req_error: I/O error, dev nvme0n1, sector 471141864
[? 130.262095] print_req_error: I/O error, dev nvme0n1, sector 471143296
[? 130.269317] print_req_error: I/O error, dev nvme0n1, sector 471143776
[? 130.276537] print_req_error: I/O error, dev nvme0n1, sector 471144224
[? 132.954315] nvme nvme0: Identify namespace failed
[? 132.959698] buffer_io_error: 3801549 callbacks suppressed
[? 132.959699] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[? 132.974669] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[? 132.983078] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[? 132.991476] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[? 132.999859] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[? 133.008217] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[? 133.016575] Dev nvme0n1: unable to read RDB block 0
[? 133.022423] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[? 133.030800] Buffer I/O error on dev nvme0n1, logical block 0, async 
page read
[? 133.039151]? nvme0n1: unable to read partition table
[? 133.050221] Buffer I/O error on dev nvme0n1, logical block 65535984, 
async page read
[? 133.060154] Buffer I/O error on dev nvme0n1, logical block 65535984, 
async page read
[? 136.334516] nvme nvme0: creating 37 I/O queues.
[? 136.636012] no online cpu for hctx 1
[? 136.640448] no online cpu for hctx 2
[? 136.644832] no online cpu for hctx 3
[? 136.650432] nvme nvme0: Successfully reconnected (1 attempts)
[? 184.894584] x86: Booting SMP configuration:
[? 184.899694] smpboot: Booting Node 1 Processor 1 APIC 0x20
[? 184.913923] smpboot: Booting Node 0 Processor 2 APIC 0x2
[? 184.929556] smpboot: Booting Node 1 Processor 3 APIC 0x22

And here is the debug ouput:
[root at rdma-virt-01 linux (test)]$ gdb vmlinux
GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
<http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.? Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /home/test/linux/vmlinux...done.
(gdb) l *(blk_mq_get_request+0x23e)
0xffffffff8136bf9e is in blk_mq_get_request (block/blk-mq.c:327).
322??? ??? rq->rl = NULL;
323??? ??? set_start_time_ns(rq);
324??? ??? rq->io_start_time_ns = 0;
325??? #endif
326
327??? ??? data->ctx->rq_dispatched[op_is_sync(op)]++;
328??? ??? return rq;
329??? }
330
331??? static struct request *blk_mq_get_request(struct request_queue *q,



> -- 
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 75336848f7a7..81ced3096433 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct 
> request_queue *q,
> ??????????????? return ERR_PTR(-EXDEV);
> ??????? }
> ??????? cpu = cpumask_first_and(alloc_data.hctx->cpumask, 
> cpu_online_mask);
> +?????? if (cpu >= nr_cpu_ids) {
> +?????????????? pr_warn("no online cpu for hctx %d\n", hctx_idx);
> +?????????????? cpu = cpumask_first(alloc_data.hctx->cpumask);
> +?????? }
> ??????? alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>
> ??????? rq = blk_mq_get_request(q, NULL, op, &alloc_data);
> -- 
>
>> 2) when nr_hw_queues is set as num_online_cpus(), there may be
>> much less online CPUs, so the hw queue number can be initialized as
>> much smaller, then performance is degraded much even if some CPUs
>> become online later.
>
> That is correct, when the controller will be reset though, more queues
> will be added to the system. I agree it would be good if we can change
> stuff dynamically.
>
>> So the current policy is to map all possible CPUs for handing CPU
>> hotplug, and if you want to get 1:1 mapping between hw queue and
>> online CPU, the nr_hw_queues can be set as num_possible_cpus.
>
> Having nr_hw_queues == num_possible_cpus cannot work as it requires
> establishing an RDMA queue-pair with a set of HW resources both on
> the host side _and_ on the controller side, which are idle.
>
>> Please see commit 16ccfff28976130 (nvme: pci: pass max vectors as
>> num_possible_cpus() to pci_alloc_irq_vectors).
>
> Yes, I am aware of this patch, however I not sure it'll be a good idea
> for nvmf as it takes resources from both the host and the target for
> for cpus that may never come online...
>
>> It will waste some memory resource just like percpu variable, but it
>> simplifies the queue mapping logic a lot, and can support both hard
>> and soft CPU online/offline without CPU hotplug handler, which may
>> cause very complicated queue dependency issue.
>
> Yes, but these some memory resources are becoming an issue when it
> takes HW (RDMA) resources on the local device and on the target device.
>
>>> I agree we don't want to connect hctx which doesn't have an online
>>> cpu, that's redundant, but this is not the case here.
>>
>> OK, I will explain below, and it can be fixed by the following patch 
>> too:
>>
>> https://marc.info/?l=linux-block&m=152318093725257&w=2
>>
>
> I agree this patch is good!
>
>>>>>> Or I may understand you wrong, :-)
>>>>>
>>>>> In the report we connected 40 hctxs (which was exactly the number of
>>>>> online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
>>>>> I'm not sure why some hctxs are left without any online cpus.
>>>>
>>>> That is possible after the following two commits:
>>>>
>>>> 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
>>>> 20e4d8139319 (blk-mq: simplify queue mapping & schedule with each 
>>>> possisble CPU)
>>>>
>>>> And this can be triggered even without putting down any CPUs.
>>>>
>>>> The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we 
>>>> can't
>>>> remap queue any more when CPU topo is changed, so the static & 
>>>> fixed mapping
>>>> has to be setup from the beginning.
>>>>
>>>> Then if there are less enough online CPUs compared with number of 
>>>> hw queues,
>>>> some of hctxes can be mapped with all offline CPUs. For example, if 
>>>> one device
>>>> has 4 hw queues, but there are only 2 online CPUs and 6 offline 
>>>> CPUs, at most
>>>> 2 hw queues are assigned to online CPUs, and the other two are all 
>>>> with offline
>>>> CPUs.
>>>
>>> That is fine, but the problem that I gave in the example below which 
>>> has
>>> nr_hw_queues == num_online_cpus but because of the mapping, we still
>>> have unmapped hctxs.
>>
>> For FC's case, there may be some hctxs not 'mapped', which is caused by
>> blk_mq_map_queues(), but that should one bug.
>>
>> So the patch(blk-mq: don't keep offline CPUs mapped to hctx 0)[1] is
>> fixing the issue:
>>
>> [1] https://marc.info/?l=linux-block&m=152318093725257&w=2
>>
>> Once this patch is in, any hctx should be mapped by at least one CPU.
>
> I think this will solve the problem Yi is stepping on.
>
>> Then later, the patch(blk-mq: reimplement blk_mq_hw_queue_mapped)[2]
>> extends the mapping concept, maybe it should have been renamed as
>> blk_mq_hw_queue_active(), will do it in V2.
>>
>> [2] https://marc.info/?l=linux-block&m=152318099625268&w=2
>
> This is also a good patch.
>
> ...
>
>>> But when we reset the controller, we call blk_mq_update_nr_hw_queues()
>>> with the current number of nr_hw_queues which never exceeds
>>> num_online_cpus. This in turn, remaps the mq_map which results
>>> in unmapped queues because of the mapping function, not because we
>>> have more hctx than online cpus...
>>
>> As I mentioned, num_online_cpus() isn't one stable variable, and it
>> can change any time.
>
> Correct, but I'm afraid num_possible_cpus might not work either.
>
>> After patch(blk-mq: don't keep offline CPUs mapped to hctx 0) is in,
>> there won't be unmapped queue any more.
>
> Yes.
>
>>> An easy fix, is to allocate num_present_cpus queues, and only connect
>>> the oneline ones, but as you said, we have unused resources this way.
>>
>> Yeah, it should be num_possible_cpus queues because physical CPU hotplug
>> is needed to be supported for KVM or S390, or even some X86_64 system.
>
> num_present_cpus is a waste of resources (as I said, both on the host
> and on the target), but num_possible_cpus is even worse as this is
> all cpus that _can_ be populated.
>
>>> We also have an issue with blk_mq_rdma_map_queues with the only
>>> device that supports it because it doesn't use managed affinity (code
>>> was reverted) and can have irq affinity redirected in case of cpu
>>> offlining...
>>
>> That can be one corner case, looks I have to re-consider the patch
>> (blk-mq: remove code for dealing with remapping queue), which may cause
>> regression for this RDMA case, but I guess CPU hotplug may break this
>> case easily.
>>
>> [3] https://marc.info/?l=linux-block&m=152318100625284&w=2
>>
>> Also this case will make blk-mq's queue mapping much complicated,
>> could you provide one link about the reason for reverting managed 
>> affinity
>> on this device?
>
> The problem was that users reported a regression because now
> /proc/irp/$IRQ/smp_affinity is immutable. Looks like netdev users do
> this on a regular basis (and also rely on irq_banacer at times) while
> nvme users (and other HBAs) do not care about it.
>
> Thread starts here:
> https://www.spinics.net/lists/netdev/msg464301.html
>
>> Recently we fix quite a few issues on managed affinity, maybe the
>> original issue for RDMA affinity has been addressed already.
>
> That is not specific to RDMA affinity, its because RDMA devices are
> also network devices and people want to apply their irq affinity
> scripts on it like their used to with other devices.
>
>>> The goal here I think, should be to allocate just enough queues (not
>>> more than the number online cpus) and spread it 1x1 with online cpus,
>>> and also make sure to allocate completion vectors that align to online
>>> cpus. I just need to figure out how to do that...
>>
>> I think we have to support CPU hotplug, so your goal may be hard to
>> reach if you don't want to waste memory resource.
>
> Well, not so much if I make blk_mq_rdma_map_queues do the right thing?
>
> As I said, for the first go, I'd like to fix the mapping for the simple
> case where we map the queues with some cpus offlined. Having queues
> being added dynamically is a different story and I agree would require
> more work.
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-09  8:54                             ` Yi Zhang
@ 2018-04-09  9:05                               ` Yi Zhang
  -1 siblings, 0 replies; 36+ messages in thread
From: Yi Zhang @ 2018-04-09  9:05 UTC (permalink / raw)
  To: Sagi Grimberg, Ming Lei; +Cc: linux-block, linux-nvme



On 04/09/2018 04:54 PM, Yi Zhang wrote:
>
>
> On 04/09/2018 04:31 PM, Sagi Grimberg wrote:
>>
>>>> My device exposes nr_hw_queues which is not higher than 
>>>> num_online_cpus
>>>> so I want to connect all hctxs with hope that they will be used.
>>>
>>> The issue is that CPU online & offline can happen any time, and after
>>> blk-mq removes CPU hotplug handler, there is no way to remap queue
>>> when CPU topo is changed.
>>>
>>> For example:
>>>
>>> 1) after nr_hw_queues is set as num_online_cpus() and hw queues
>>> are initialized, then some of CPUs become offline, and the issue
>>> reported by Zhang Yi is triggered, but in this case, we should fail
>>> the allocation since 1:1 mapping doesn't need to use this inactive
>>> hw queue.
>>
>> Normal cpu offlining is fine, as the hctxs are already connected. When
>> we reset the controller and re-establish the queues, the issue triggers
>> because we call blk_mq_alloc_request_hctx.
>>
>> The question is, for this particular issue, given that the request
>> execution is guaranteed to run from an online cpu, will the below work?
> Hi Sagi
> Sorry for the late response, bellow patch works, here is the full log:

And this issue cannot be reproduced on 4.15.
So I did bisect testing today, found it was introduced from bellow commit:
bf9ae8c blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init()

>
> [  117.370832] nvme nvme0: new ctrl: NQN 
> "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
> [  117.427385] nvme nvme0: creating 40 I/O queues.
> [  117.736806] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
> [  122.531891] smpboot: CPU 1 is now offline
> [  122.573007] IRQ 37: no longer affine to CPU2
> [  122.577775] IRQ 54: no longer affine to CPU2
> [  122.582532] IRQ 70: no longer affine to CPU2
> [  122.587300] IRQ 98: no longer affine to CPU2
> [  122.592069] IRQ 140: no longer affine to CPU2
> [  122.596930] IRQ 141: no longer affine to CPU2
> [  122.603166] smpboot: CPU 2 is now offline
> [  122.840577] smpboot: CPU 3 is now offline
> [  125.204901] print_req_error: operation not supported error, dev 
> nvme0n1, sector 143212504
> [  125.204907] print_req_error: operation not supported error, dev 
> nvme0n1, sector 481004984
> [  125.204922] print_req_error: operation not supported error, dev 
> nvme0n1, sector 436594584
> [  125.204924] print_req_error: operation not supported error, dev 
> nvme0n1, sector 461363784
> [  125.204945] print_req_error: operation not supported error, dev 
> nvme0n1, sector 308124792
> [  125.204957] print_req_error: operation not supported error, dev 
> nvme0n1, sector 513395784
> [  125.204959] print_req_error: operation not supported error, dev 
> nvme0n1, sector 432260176
> [  125.204961] print_req_error: operation not supported error, dev 
> nvme0n1, sector 251704096
> [  125.204963] print_req_error: operation not supported error, dev 
> nvme0n1, sector 234819336
> [  125.204966] print_req_error: operation not supported error, dev 
> nvme0n1, sector 181874128
> [  125.938858] nvme nvme0: Reconnecting in 10 seconds...
> [  125.938862] Buffer I/O error on dev nvme0n1, logical block 367355, 
> lost async page write
> [  125.942587] Buffer I/O error on dev nvme0n1, logical block 586, 
> lost async page write
> [  125.942589] Buffer I/O error on dev nvme0n1, logical block 375453, 
> lost async page write
> [  125.942591] Buffer I/O error on dev nvme0n1, logical block 587, 
> lost async page write
> [  125.942592] Buffer I/O error on dev nvme0n1, logical block 588, 
> lost async page write
> [  125.942593] Buffer I/O error on dev nvme0n1, logical block 375454, 
> lost async page write
> [  125.942594] Buffer I/O error on dev nvme0n1, logical block 589, 
> lost async page write
> [  125.942595] Buffer I/O error on dev nvme0n1, logical block 590, 
> lost async page write
> [  125.942596] Buffer I/O error on dev nvme0n1, logical block 591, 
> lost async page write
> [  125.942597] Buffer I/O error on dev nvme0n1, logical block 592, 
> lost async page write
> [  130.205584] print_req_error: 537000 callbacks suppressed
> [  130.205586] print_req_error: I/O error, dev nvme0n1, sector 471135288
> [  130.218763] print_req_error: I/O error, dev nvme0n1, sector 471137240
> [  130.225985] print_req_error: I/O error, dev nvme0n1, sector 471138328
> [  130.233206] print_req_error: I/O error, dev nvme0n1, sector 471140096
> [  130.240433] print_req_error: I/O error, dev nvme0n1, sector 471140184
> [  130.247659] print_req_error: I/O error, dev nvme0n1, sector 471140960
> [  130.254874] print_req_error: I/O error, dev nvme0n1, sector 471141864
> [  130.262095] print_req_error: I/O error, dev nvme0n1, sector 471143296
> [  130.269317] print_req_error: I/O error, dev nvme0n1, sector 471143776
> [  130.276537] print_req_error: I/O error, dev nvme0n1, sector 471144224
> [  132.954315] nvme nvme0: Identify namespace failed
> [  132.959698] buffer_io_error: 3801549 callbacks suppressed
> [  132.959699] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  132.974669] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  132.983078] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  132.991476] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  132.999859] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  133.008217] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  133.016575] Dev nvme0n1: unable to read RDB block 0
> [  133.022423] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  133.030800] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [  133.039151]  nvme0n1: unable to read partition table
> [  133.050221] Buffer I/O error on dev nvme0n1, logical block 
> 65535984, async page read
> [  133.060154] Buffer I/O error on dev nvme0n1, logical block 
> 65535984, async page read
> [  136.334516] nvme nvme0: creating 37 I/O queues.
> [  136.636012] no online cpu for hctx 1
> [  136.640448] no online cpu for hctx 2
> [  136.644832] no online cpu for hctx 3
> [  136.650432] nvme nvme0: Successfully reconnected (1 attempts)
> [  184.894584] x86: Booting SMP configuration:
> [  184.899694] smpboot: Booting Node 1 Processor 1 APIC 0x20
> [  184.913923] smpboot: Booting Node 0 Processor 2 APIC 0x2
> [  184.929556] smpboot: Booting Node 1 Processor 3 APIC 0x22
>
> And here is the debug ouput:
> [root@rdma-virt-01 linux (test)]$ gdb vmlinux
> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
> Copyright (C) 2013 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show 
> copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /home/test/linux/vmlinux...done.
> (gdb) l *(blk_mq_get_request+0x23e)
> 0xffffffff8136bf9e is in blk_mq_get_request (block/blk-mq.c:327).
> 322        rq->rl = NULL;
> 323        set_start_time_ns(rq);
> 324        rq->io_start_time_ns = 0;
> 325    #endif
> 326
> 327        data->ctx->rq_dispatched[op_is_sync(op)]++;
> 328        return rq;
> 329    }
> 330
> 331    static struct request *blk_mq_get_request(struct request_queue *q,
>
>
>
>> -- 
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 75336848f7a7..81ced3096433 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct 
>> request_queue *q,
>>                 return ERR_PTR(-EXDEV);
>>         }
>>         cpu = cpumask_first_and(alloc_data.hctx->cpumask, 
>> cpu_online_mask);
>> +       if (cpu >= nr_cpu_ids) {
>> +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
>> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
>> +       }
>>         alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>>
>>         rq = blk_mq_get_request(q, NULL, op, &alloc_data);
>> -- 
>>
>>> 2) when nr_hw_queues is set as num_online_cpus(), there may be
>>> much less online CPUs, so the hw queue number can be initialized as
>>> much smaller, then performance is degraded much even if some CPUs
>>> become online later.
>>
>> That is correct, when the controller will be reset though, more queues
>> will be added to the system. I agree it would be good if we can change
>> stuff dynamically.
>>
>>> So the current policy is to map all possible CPUs for handing CPU
>>> hotplug, and if you want to get 1:1 mapping between hw queue and
>>> online CPU, the nr_hw_queues can be set as num_possible_cpus.
>>
>> Having nr_hw_queues == num_possible_cpus cannot work as it requires
>> establishing an RDMA queue-pair with a set of HW resources both on
>> the host side _and_ on the controller side, which are idle.
>>
>>> Please see commit 16ccfff28976130 (nvme: pci: pass max vectors as
>>> num_possible_cpus() to pci_alloc_irq_vectors).
>>
>> Yes, I am aware of this patch, however I not sure it'll be a good idea
>> for nvmf as it takes resources from both the host and the target for
>> for cpus that may never come online...
>>
>>> It will waste some memory resource just like percpu variable, but it
>>> simplifies the queue mapping logic a lot, and can support both hard
>>> and soft CPU online/offline without CPU hotplug handler, which may
>>> cause very complicated queue dependency issue.
>>
>> Yes, but these some memory resources are becoming an issue when it
>> takes HW (RDMA) resources on the local device and on the target device.
>>
>>>> I agree we don't want to connect hctx which doesn't have an online
>>>> cpu, that's redundant, but this is not the case here.
>>>
>>> OK, I will explain below, and it can be fixed by the following patch 
>>> too:
>>>
>>> https://marc.info/?l=linux-block&m=152318093725257&w=2
>>>
>>
>> I agree this patch is good!
>>
>>>>>>> Or I may understand you wrong, :-)
>>>>>>
>>>>>> In the report we connected 40 hctxs (which was exactly the number of
>>>>>> online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
>>>>>> I'm not sure why some hctxs are left without any online cpus.
>>>>>
>>>>> That is possible after the following two commits:
>>>>>
>>>>> 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
>>>>> 20e4d8139319 (blk-mq: simplify queue mapping & schedule with each 
>>>>> possisble CPU)
>>>>>
>>>>> And this can be triggered even without putting down any CPUs.
>>>>>
>>>>> The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we 
>>>>> can't
>>>>> remap queue any more when CPU topo is changed, so the static & 
>>>>> fixed mapping
>>>>> has to be setup from the beginning.
>>>>>
>>>>> Then if there are less enough online CPUs compared with number of 
>>>>> hw queues,
>>>>> some of hctxes can be mapped with all offline CPUs. For example, 
>>>>> if one device
>>>>> has 4 hw queues, but there are only 2 online CPUs and 6 offline 
>>>>> CPUs, at most
>>>>> 2 hw queues are assigned to online CPUs, and the other two are all 
>>>>> with offline
>>>>> CPUs.
>>>>
>>>> That is fine, but the problem that I gave in the example below 
>>>> which has
>>>> nr_hw_queues == num_online_cpus but because of the mapping, we still
>>>> have unmapped hctxs.
>>>
>>> For FC's case, there may be some hctxs not 'mapped', which is caused by
>>> blk_mq_map_queues(), but that should one bug.
>>>
>>> So the patch(blk-mq: don't keep offline CPUs mapped to hctx 0)[1] is
>>> fixing the issue:
>>>
>>> [1] https://marc.info/?l=linux-block&m=152318093725257&w=2
>>>
>>> Once this patch is in, any hctx should be mapped by at least one CPU.
>>
>> I think this will solve the problem Yi is stepping on.
>>
>>> Then later, the patch(blk-mq: reimplement blk_mq_hw_queue_mapped)[2]
>>> extends the mapping concept, maybe it should have been renamed as
>>> blk_mq_hw_queue_active(), will do it in V2.
>>>
>>> [2] https://marc.info/?l=linux-block&m=152318099625268&w=2
>>
>> This is also a good patch.
>>
>> ...
>>
>>>> But when we reset the controller, we call blk_mq_update_nr_hw_queues()
>>>> with the current number of nr_hw_queues which never exceeds
>>>> num_online_cpus. This in turn, remaps the mq_map which results
>>>> in unmapped queues because of the mapping function, not because we
>>>> have more hctx than online cpus...
>>>
>>> As I mentioned, num_online_cpus() isn't one stable variable, and it
>>> can change any time.
>>
>> Correct, but I'm afraid num_possible_cpus might not work either.
>>
>>> After patch(blk-mq: don't keep offline CPUs mapped to hctx 0) is in,
>>> there won't be unmapped queue any more.
>>
>> Yes.
>>
>>>> An easy fix, is to allocate num_present_cpus queues, and only connect
>>>> the oneline ones, but as you said, we have unused resources this way.
>>>
>>> Yeah, it should be num_possible_cpus queues because physical CPU 
>>> hotplug
>>> is needed to be supported for KVM or S390, or even some X86_64 system.
>>
>> num_present_cpus is a waste of resources (as I said, both on the host
>> and on the target), but num_possible_cpus is even worse as this is
>> all cpus that _can_ be populated.
>>
>>>> We also have an issue with blk_mq_rdma_map_queues with the only
>>>> device that supports it because it doesn't use managed affinity (code
>>>> was reverted) and can have irq affinity redirected in case of cpu
>>>> offlining...
>>>
>>> That can be one corner case, looks I have to re-consider the patch
>>> (blk-mq: remove code for dealing with remapping queue), which may cause
>>> regression for this RDMA case, but I guess CPU hotplug may break this
>>> case easily.
>>>
>>> [3] https://marc.info/?l=linux-block&m=152318100625284&w=2
>>>
>>> Also this case will make blk-mq's queue mapping much complicated,
>>> could you provide one link about the reason for reverting managed 
>>> affinity
>>> on this device?
>>
>> The problem was that users reported a regression because now
>> /proc/irp/$IRQ/smp_affinity is immutable. Looks like netdev users do
>> this on a regular basis (and also rely on irq_banacer at times) while
>> nvme users (and other HBAs) do not care about it.
>>
>> Thread starts here:
>> https://www.spinics.net/lists/netdev/msg464301.html
>>
>>> Recently we fix quite a few issues on managed affinity, maybe the
>>> original issue for RDMA affinity has been addressed already.
>>
>> That is not specific to RDMA affinity, its because RDMA devices are
>> also network devices and people want to apply their irq affinity
>> scripts on it like their used to with other devices.
>>
>>>> The goal here I think, should be to allocate just enough queues (not
>>>> more than the number online cpus) and spread it 1x1 with online cpus,
>>>> and also make sure to allocate completion vectors that align to online
>>>> cpus. I just need to figure out how to do that...
>>>
>>> I think we have to support CPU hotplug, so your goal may be hard to
>>> reach if you don't want to waste memory resource.
>>
>> Well, not so much if I make blk_mq_rdma_map_queues do the right thing?
>>
>> As I said, for the first go, I'd like to fix the mapping for the simple
>> case where we map the queues with some cpus offlined. Having queues
>> being added dynamically is a different story and I agree would require
>> more work.
>>
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-09  9:05                               ` Yi Zhang
  0 siblings, 0 replies; 36+ messages in thread
From: Yi Zhang @ 2018-04-09  9:05 UTC (permalink / raw)




On 04/09/2018 04:54 PM, Yi Zhang wrote:
>
>
> On 04/09/2018 04:31 PM, Sagi Grimberg wrote:
>>
>>>> My device exposes nr_hw_queues which is not higher than 
>>>> num_online_cpus
>>>> so I want to connect all hctxs with hope that they will be used.
>>>
>>> The issue is that CPU online & offline can happen any time, and after
>>> blk-mq removes CPU hotplug handler, there is no way to remap queue
>>> when CPU topo is changed.
>>>
>>> For example:
>>>
>>> 1) after nr_hw_queues is set as num_online_cpus() and hw queues
>>> are initialized, then some of CPUs become offline, and the issue
>>> reported by Zhang Yi is triggered, but in this case, we should fail
>>> the allocation since 1:1 mapping doesn't need to use this inactive
>>> hw queue.
>>
>> Normal cpu offlining is fine, as the hctxs are already connected. When
>> we reset the controller and re-establish the queues, the issue triggers
>> because we call blk_mq_alloc_request_hctx.
>>
>> The question is, for this particular issue, given that the request
>> execution is guaranteed to run from an online cpu, will the below work?
> Hi Sagi
> Sorry for the late response, bellow patch works, here is the full log:

And this issue cannot be reproduced on 4.15.
So I did bisect testing today, found it was introduced from bellow commit:
bf9ae8c blk-mq: fix bad clear of RQF_MQ_INFLIGHT in blk_mq_ct_ctx_init()

>
> [? 117.370832] nvme nvme0: new ctrl: NQN 
> "nqn.2014-08.org.nvmexpress.discovery", addr 172.31.0.90:4420
> [? 117.427385] nvme nvme0: creating 40 I/O queues.
> [? 117.736806] nvme nvme0: new ctrl: NQN "testnqn", addr 172.31.0.90:4420
> [? 122.531891] smpboot: CPU 1 is now offline
> [? 122.573007] IRQ 37: no longer affine to CPU2
> [? 122.577775] IRQ 54: no longer affine to CPU2
> [? 122.582532] IRQ 70: no longer affine to CPU2
> [? 122.587300] IRQ 98: no longer affine to CPU2
> [? 122.592069] IRQ 140: no longer affine to CPU2
> [? 122.596930] IRQ 141: no longer affine to CPU2
> [? 122.603166] smpboot: CPU 2 is now offline
> [? 122.840577] smpboot: CPU 3 is now offline
> [? 125.204901] print_req_error: operation not supported error, dev 
> nvme0n1, sector 143212504
> [? 125.204907] print_req_error: operation not supported error, dev 
> nvme0n1, sector 481004984
> [? 125.204922] print_req_error: operation not supported error, dev 
> nvme0n1, sector 436594584
> [? 125.204924] print_req_error: operation not supported error, dev 
> nvme0n1, sector 461363784
> [? 125.204945] print_req_error: operation not supported error, dev 
> nvme0n1, sector 308124792
> [? 125.204957] print_req_error: operation not supported error, dev 
> nvme0n1, sector 513395784
> [? 125.204959] print_req_error: operation not supported error, dev 
> nvme0n1, sector 432260176
> [? 125.204961] print_req_error: operation not supported error, dev 
> nvme0n1, sector 251704096
> [? 125.204963] print_req_error: operation not supported error, dev 
> nvme0n1, sector 234819336
> [? 125.204966] print_req_error: operation not supported error, dev 
> nvme0n1, sector 181874128
> [? 125.938858] nvme nvme0: Reconnecting in 10 seconds...
> [? 125.938862] Buffer I/O error on dev nvme0n1, logical block 367355, 
> lost async page write
> [? 125.942587] Buffer I/O error on dev nvme0n1, logical block 586, 
> lost async page write
> [? 125.942589] Buffer I/O error on dev nvme0n1, logical block 375453, 
> lost async page write
> [? 125.942591] Buffer I/O error on dev nvme0n1, logical block 587, 
> lost async page write
> [? 125.942592] Buffer I/O error on dev nvme0n1, logical block 588, 
> lost async page write
> [? 125.942593] Buffer I/O error on dev nvme0n1, logical block 375454, 
> lost async page write
> [? 125.942594] Buffer I/O error on dev nvme0n1, logical block 589, 
> lost async page write
> [? 125.942595] Buffer I/O error on dev nvme0n1, logical block 590, 
> lost async page write
> [? 125.942596] Buffer I/O error on dev nvme0n1, logical block 591, 
> lost async page write
> [? 125.942597] Buffer I/O error on dev nvme0n1, logical block 592, 
> lost async page write
> [? 130.205584] print_req_error: 537000 callbacks suppressed
> [? 130.205586] print_req_error: I/O error, dev nvme0n1, sector 471135288
> [? 130.218763] print_req_error: I/O error, dev nvme0n1, sector 471137240
> [? 130.225985] print_req_error: I/O error, dev nvme0n1, sector 471138328
> [? 130.233206] print_req_error: I/O error, dev nvme0n1, sector 471140096
> [? 130.240433] print_req_error: I/O error, dev nvme0n1, sector 471140184
> [? 130.247659] print_req_error: I/O error, dev nvme0n1, sector 471140960
> [? 130.254874] print_req_error: I/O error, dev nvme0n1, sector 471141864
> [? 130.262095] print_req_error: I/O error, dev nvme0n1, sector 471143296
> [? 130.269317] print_req_error: I/O error, dev nvme0n1, sector 471143776
> [? 130.276537] print_req_error: I/O error, dev nvme0n1, sector 471144224
> [? 132.954315] nvme nvme0: Identify namespace failed
> [? 132.959698] buffer_io_error: 3801549 callbacks suppressed
> [? 132.959699] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [? 132.974669] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [? 132.983078] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [? 132.991476] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [? 132.999859] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [? 133.008217] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [? 133.016575] Dev nvme0n1: unable to read RDB block 0
> [? 133.022423] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [? 133.030800] Buffer I/O error on dev nvme0n1, logical block 0, async 
> page read
> [? 133.039151]? nvme0n1: unable to read partition table
> [? 133.050221] Buffer I/O error on dev nvme0n1, logical block 
> 65535984, async page read
> [? 133.060154] Buffer I/O error on dev nvme0n1, logical block 
> 65535984, async page read
> [? 136.334516] nvme nvme0: creating 37 I/O queues.
> [? 136.636012] no online cpu for hctx 1
> [? 136.640448] no online cpu for hctx 2
> [? 136.644832] no online cpu for hctx 3
> [? 136.650432] nvme nvme0: Successfully reconnected (1 attempts)
> [? 184.894584] x86: Booting SMP configuration:
> [? 184.899694] smpboot: Booting Node 1 Processor 1 APIC 0x20
> [? 184.913923] smpboot: Booting Node 0 Processor 2 APIC 0x2
> [? 184.929556] smpboot: Booting Node 1 Processor 3 APIC 0x22
>
> And here is the debug ouput:
> [root at rdma-virt-01 linux (test)]$ gdb vmlinux
> GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-110.el7
> Copyright (C) 2013 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later 
> <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.? Type "show 
> copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-redhat-linux-gnu".
> For bug reporting instructions, please see:
> <http://www.gnu.org/software/gdb/bugs/>...
> Reading symbols from /home/test/linux/vmlinux...done.
> (gdb) l *(blk_mq_get_request+0x23e)
> 0xffffffff8136bf9e is in blk_mq_get_request (block/blk-mq.c:327).
> 322??? ??? rq->rl = NULL;
> 323??? ??? set_start_time_ns(rq);
> 324??? ??? rq->io_start_time_ns = 0;
> 325??? #endif
> 326
> 327??? ??? data->ctx->rq_dispatched[op_is_sync(op)]++;
> 328??? ??? return rq;
> 329??? }
> 330
> 331??? static struct request *blk_mq_get_request(struct request_queue *q,
>
>
>
>> -- 
>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 75336848f7a7..81ced3096433 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct 
>> request_queue *q,
>> ??????????????? return ERR_PTR(-EXDEV);
>> ??????? }
>> ??????? cpu = cpumask_first_and(alloc_data.hctx->cpumask, 
>> cpu_online_mask);
>> +?????? if (cpu >= nr_cpu_ids) {
>> +?????????????? pr_warn("no online cpu for hctx %d\n", hctx_idx);
>> +?????????????? cpu = cpumask_first(alloc_data.hctx->cpumask);
>> +?????? }
>> ??????? alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
>>
>> ??????? rq = blk_mq_get_request(q, NULL, op, &alloc_data);
>> -- 
>>
>>> 2) when nr_hw_queues is set as num_online_cpus(), there may be
>>> much less online CPUs, so the hw queue number can be initialized as
>>> much smaller, then performance is degraded much even if some CPUs
>>> become online later.
>>
>> That is correct, when the controller will be reset though, more queues
>> will be added to the system. I agree it would be good if we can change
>> stuff dynamically.
>>
>>> So the current policy is to map all possible CPUs for handing CPU
>>> hotplug, and if you want to get 1:1 mapping between hw queue and
>>> online CPU, the nr_hw_queues can be set as num_possible_cpus.
>>
>> Having nr_hw_queues == num_possible_cpus cannot work as it requires
>> establishing an RDMA queue-pair with a set of HW resources both on
>> the host side _and_ on the controller side, which are idle.
>>
>>> Please see commit 16ccfff28976130 (nvme: pci: pass max vectors as
>>> num_possible_cpus() to pci_alloc_irq_vectors).
>>
>> Yes, I am aware of this patch, however I not sure it'll be a good idea
>> for nvmf as it takes resources from both the host and the target for
>> for cpus that may never come online...
>>
>>> It will waste some memory resource just like percpu variable, but it
>>> simplifies the queue mapping logic a lot, and can support both hard
>>> and soft CPU online/offline without CPU hotplug handler, which may
>>> cause very complicated queue dependency issue.
>>
>> Yes, but these some memory resources are becoming an issue when it
>> takes HW (RDMA) resources on the local device and on the target device.
>>
>>>> I agree we don't want to connect hctx which doesn't have an online
>>>> cpu, that's redundant, but this is not the case here.
>>>
>>> OK, I will explain below, and it can be fixed by the following patch 
>>> too:
>>>
>>> https://marc.info/?l=linux-block&m=152318093725257&w=2
>>>
>>
>> I agree this patch is good!
>>
>>>>>>> Or I may understand you wrong, :-)
>>>>>>
>>>>>> In the report we connected 40 hctxs (which was exactly the number of
>>>>>> online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
>>>>>> I'm not sure why some hctxs are left without any online cpus.
>>>>>
>>>>> That is possible after the following two commits:
>>>>>
>>>>> 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
>>>>> 20e4d8139319 (blk-mq: simplify queue mapping & schedule with each 
>>>>> possisble CPU)
>>>>>
>>>>> And this can be triggered even without putting down any CPUs.
>>>>>
>>>>> The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we 
>>>>> can't
>>>>> remap queue any more when CPU topo is changed, so the static & 
>>>>> fixed mapping
>>>>> has to be setup from the beginning.
>>>>>
>>>>> Then if there are less enough online CPUs compared with number of 
>>>>> hw queues,
>>>>> some of hctxes can be mapped with all offline CPUs. For example, 
>>>>> if one device
>>>>> has 4 hw queues, but there are only 2 online CPUs and 6 offline 
>>>>> CPUs, at most
>>>>> 2 hw queues are assigned to online CPUs, and the other two are all 
>>>>> with offline
>>>>> CPUs.
>>>>
>>>> That is fine, but the problem that I gave in the example below 
>>>> which has
>>>> nr_hw_queues == num_online_cpus but because of the mapping, we still
>>>> have unmapped hctxs.
>>>
>>> For FC's case, there may be some hctxs not 'mapped', which is caused by
>>> blk_mq_map_queues(), but that should one bug.
>>>
>>> So the patch(blk-mq: don't keep offline CPUs mapped to hctx 0)[1] is
>>> fixing the issue:
>>>
>>> [1] https://marc.info/?l=linux-block&m=152318093725257&w=2
>>>
>>> Once this patch is in, any hctx should be mapped by at least one CPU.
>>
>> I think this will solve the problem Yi is stepping on.
>>
>>> Then later, the patch(blk-mq: reimplement blk_mq_hw_queue_mapped)[2]
>>> extends the mapping concept, maybe it should have been renamed as
>>> blk_mq_hw_queue_active(), will do it in V2.
>>>
>>> [2] https://marc.info/?l=linux-block&m=152318099625268&w=2
>>
>> This is also a good patch.
>>
>> ...
>>
>>>> But when we reset the controller, we call blk_mq_update_nr_hw_queues()
>>>> with the current number of nr_hw_queues which never exceeds
>>>> num_online_cpus. This in turn, remaps the mq_map which results
>>>> in unmapped queues because of the mapping function, not because we
>>>> have more hctx than online cpus...
>>>
>>> As I mentioned, num_online_cpus() isn't one stable variable, and it
>>> can change any time.
>>
>> Correct, but I'm afraid num_possible_cpus might not work either.
>>
>>> After patch(blk-mq: don't keep offline CPUs mapped to hctx 0) is in,
>>> there won't be unmapped queue any more.
>>
>> Yes.
>>
>>>> An easy fix, is to allocate num_present_cpus queues, and only connect
>>>> the oneline ones, but as you said, we have unused resources this way.
>>>
>>> Yeah, it should be num_possible_cpus queues because physical CPU 
>>> hotplug
>>> is needed to be supported for KVM or S390, or even some X86_64 system.
>>
>> num_present_cpus is a waste of resources (as I said, both on the host
>> and on the target), but num_possible_cpus is even worse as this is
>> all cpus that _can_ be populated.
>>
>>>> We also have an issue with blk_mq_rdma_map_queues with the only
>>>> device that supports it because it doesn't use managed affinity (code
>>>> was reverted) and can have irq affinity redirected in case of cpu
>>>> offlining...
>>>
>>> That can be one corner case, looks I have to re-consider the patch
>>> (blk-mq: remove code for dealing with remapping queue), which may cause
>>> regression for this RDMA case, but I guess CPU hotplug may break this
>>> case easily.
>>>
>>> [3] https://marc.info/?l=linux-block&m=152318100625284&w=2
>>>
>>> Also this case will make blk-mq's queue mapping much complicated,
>>> could you provide one link about the reason for reverting managed 
>>> affinity
>>> on this device?
>>
>> The problem was that users reported a regression because now
>> /proc/irp/$IRQ/smp_affinity is immutable. Looks like netdev users do
>> this on a regular basis (and also rely on irq_banacer at times) while
>> nvme users (and other HBAs) do not care about it.
>>
>> Thread starts here:
>> https://www.spinics.net/lists/netdev/msg464301.html
>>
>>> Recently we fix quite a few issues on managed affinity, maybe the
>>> original issue for RDMA affinity has been addressed already.
>>
>> That is not specific to RDMA affinity, its because RDMA devices are
>> also network devices and people want to apply their irq affinity
>> scripts on it like their used to with other devices.
>>
>>>> The goal here I think, should be to allocate just enough queues (not
>>>> more than the number online cpus) and spread it 1x1 with online cpus,
>>>> and also make sure to allocate completion vectors that align to online
>>>> cpus. I just need to figure out how to do that...
>>>
>>> I think we have to support CPU hotplug, so your goal may be hard to
>>> reach if you don't want to waste memory resource.
>>
>> Well, not so much if I make blk_mq_rdma_map_queues do the right thing?
>>
>> As I said, for the first go, I'd like to fix the mapping for the simple
>> case where we map the queues with some cpus offlined. Having queues
>> being added dynamically is a different story and I agree would require
>> more work.
>>
>> _______________________________________________
>> Linux-nvme mailing list
>> Linux-nvme at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-nvme
>
>
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-09  8:54                             ` Yi Zhang
@ 2018-04-09  9:13                               ` Sagi Grimberg
  -1 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-09  9:13 UTC (permalink / raw)
  To: Yi Zhang, Ming Lei; +Cc: linux-block, linux-nvme


> Hi Sagi
> Sorry for the late response, bellow patch works, here is the full log:

Thanks for testing!

Now that we isolated the issue, the question is if this fix is correct
given that we are guaranteed that the connect context will run on an
online cpu?

another reference to the patch (we can make the pr_warn a pr_debug):
-- 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
                 return ERR_PTR(-EXDEV);
         }
         cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+       if (cpu >= nr_cpu_ids) {
+               pr_warn("no online cpu for hctx %d\n", hctx_idx);
+               cpu = cpumask_first(alloc_data.hctx->cpumask);
+       }
         alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

         rq = blk_mq_get_request(q, NULL, op, &alloc_data);
--

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-09  9:13                               ` Sagi Grimberg
  0 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-09  9:13 UTC (permalink / raw)



> Hi Sagi
> Sorry for the late response, bellow patch works, here is the full log:

Thanks for testing!

Now that we isolated the issue, the question is if this fix is correct
given that we are guaranteed that the connect context will run on an
online cpu?

another reference to the patch (we can make the pr_warn a pr_debug):
-- 
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 75336848f7a7..81ced3096433 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct 
request_queue *q,
                 return ERR_PTR(-EXDEV);
         }
         cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
+       if (cpu >= nr_cpu_ids) {
+               pr_warn("no online cpu for hctx %d\n", hctx_idx);
+               cpu = cpumask_first(alloc_data.hctx->cpumask);
+       }
         alloc_data.ctx = __blk_mq_get_ctx(q, cpu);

         rq = blk_mq_get_request(q, NULL, op, &alloc_data);
--

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-09  8:31                           ` Sagi Grimberg
@ 2018-04-09 12:15                             ` Ming Lei
  -1 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2018-04-09 12:15 UTC (permalink / raw)
  To: Sagi Grimberg; +Cc: linux-block, Yi Zhang, linux-nvme

On Mon, Apr 09, 2018 at 11:31:37AM +0300, Sagi Grimberg wrote:
> 
> > > My device exposes nr_hw_queues which is not higher than num_online_cpus
> > > so I want to connect all hctxs with hope that they will be used.
> > 
> > The issue is that CPU online & offline can happen any time, and after
> > blk-mq removes CPU hotplug handler, there is no way to remap queue
> > when CPU topo is changed.
> > 
> > For example:
> > 
> > 1) after nr_hw_queues is set as num_online_cpus() and hw queues
> > are initialized, then some of CPUs become offline, and the issue
> > reported by Zhang Yi is triggered, but in this case, we should fail
> > the allocation since 1:1 mapping doesn't need to use this inactive
> > hw queue.
> 
> Normal cpu offlining is fine, as the hctxs are already connected. When
> we reset the controller and re-establish the queues, the issue triggers
> because we call blk_mq_alloc_request_hctx.

That is right, blk_mq_alloc_request_hctx() is one insane interface.

Also could you share a bit why the request has to be allocated in
this way?

I may have to read the NVMe connect protocol and related code for
understanding this mechanism.

> 
> The question is, for this particular issue, given that the request
> execution is guaranteed to run from an online cpu, will the below work?
> --
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 75336848f7a7..81ced3096433 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
> request_queue *q,
>                 return ERR_PTR(-EXDEV);
>         }
>         cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> +       if (cpu >= nr_cpu_ids) {
> +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
> +       }

We may do this way for the special case, but it is ugly, IMO.

>         alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> 
>         rq = blk_mq_get_request(q, NULL, op, &alloc_data);
> --
> 
> > 2) when nr_hw_queues is set as num_online_cpus(), there may be
> > much less online CPUs, so the hw queue number can be initialized as
> > much smaller, then performance is degraded much even if some CPUs
> > become online later.
> 
> That is correct, when the controller will be reset though, more queues
> will be added to the system. I agree it would be good if we can change
> stuff dynamically.
> 
> > So the current policy is to map all possible CPUs for handing CPU
> > hotplug, and if you want to get 1:1 mapping between hw queue and
> > online CPU, the nr_hw_queues can be set as num_possible_cpus.
> 
> Having nr_hw_queues == num_possible_cpus cannot work as it requires
> establishing an RDMA queue-pair with a set of HW resources both on
> the host side _and_ on the controller side, which are idle.

OK, can I understand it just because there isn't so many hw resources?

> 
> > Please see commit 16ccfff28976130 (nvme: pci: pass max vectors as
> > num_possible_cpus() to pci_alloc_irq_vectors).
> 
> Yes, I am aware of this patch, however I not sure it'll be a good idea
> for nvmf as it takes resources from both the host and the target for
> for cpus that may never come online...
> 
> > It will waste some memory resource just like percpu variable, but it
> > simplifies the queue mapping logic a lot, and can support both hard
> > and soft CPU online/offline without CPU hotplug handler, which may
> > cause very complicated queue dependency issue.
> 
> Yes, but these some memory resources are becoming an issue when it
> takes HW (RDMA) resources on the local device and on the target device.

Maybe both host & target resource can be allocated until there is any
CPU coming for this hctx in host side. But CPU hotplug handler has to
be re-introduced, maybe callback of .hctx_activate or .hctx_deactivate
can be added for allocating/releasing these resources in CPU hotplug
path. Since queue mapping won't be changed, and queue freezing may be
avoided, it should be fine.

> 
> > > I agree we don't want to connect hctx which doesn't have an online
> > > cpu, that's redundant, but this is not the case here.
> > 
> > OK, I will explain below, and it can be fixed by the following patch too:
> > 
> > https://marc.info/?l=linux-block&m=152318093725257&w=2
> > 
> 
> I agree this patch is good!
> 
> > > > > > Or I may understand you wrong, :-)
> > > > > 
> > > > > In the report we connected 40 hctxs (which was exactly the number of
> > > > > online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
> > > > > I'm not sure why some hctxs are left without any online cpus.
> > > > 
> > > > That is possible after the following two commits:
> > > > 
> > > > 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
> > > > 20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU)
> > > > 
> > > > And this can be triggered even without putting down any CPUs.
> > > > 
> > > > The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't
> > > > remap queue any more when CPU topo is changed, so the static & fixed mapping
> > > > has to be setup from the beginning.
> > > > 
> > > > Then if there are less enough online CPUs compared with number of hw queues,
> > > > some of hctxes can be mapped with all offline CPUs. For example, if one device
> > > > has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most
> > > > 2 hw queues are assigned to online CPUs, and the other two are all with offline
> > > > CPUs.
> > > 
> > > That is fine, but the problem that I gave in the example below which has
> > > nr_hw_queues == num_online_cpus but because of the mapping, we still
> > > have unmapped hctxs.
> > 
> > For FC's case, there may be some hctxs not 'mapped', which is caused by
> > blk_mq_map_queues(), but that should one bug.
> > 
> > So the patch(blk-mq: don't keep offline CPUs mapped to hctx 0)[1] is
> > fixing the issue:
> > 	
> > [1]	https://marc.info/?l=linux-block&m=152318093725257&w=2
> > 
> > Once this patch is in, any hctx should be mapped by at least one CPU.
> 
> I think this will solve the problem Yi is stepping on.
> 
> > Then later, the patch(blk-mq: reimplement blk_mq_hw_queue_mapped)[2]
> > extends the mapping concept, maybe it should have been renamed as
> > blk_mq_hw_queue_active(), will do it in V2.
> > 
> > [2] https://marc.info/?l=linux-block&m=152318099625268&w=2
> 
> This is also a good patch.
> 
> ...
> 
> > > But when we reset the controller, we call blk_mq_update_nr_hw_queues()
> > > with the current number of nr_hw_queues which never exceeds
> > > num_online_cpus. This in turn, remaps the mq_map which results
> > > in unmapped queues because of the mapping function, not because we
> > > have more hctx than online cpus...
> > 
> > As I mentioned, num_online_cpus() isn't one stable variable, and it
> > can change any time.
> 
> Correct, but I'm afraid num_possible_cpus might not work either

Why?

> 
> > After patch(blk-mq: don't keep offline CPUs mapped to hctx 0) is in,
> > there won't be unmapped queue any more.
> 
> Yes.
> 
> > > An easy fix, is to allocate num_present_cpus queues, and only connect
> > > the oneline ones, but as you said, we have unused resources this way.
> > 
> > Yeah, it should be num_possible_cpus queues because physical CPU hotplug
> > is needed to be supported for KVM or S390, or even some X86_64 system.
> 
> num_present_cpus is a waste of resources (as I said, both on the host
> and on the target), but num_possible_cpus is even worse as this is
> all cpus that _can_ be populated.

Yes, that can be one direction for improving queue mapping.

> 
> > > We also have an issue with blk_mq_rdma_map_queues with the only
> > > device that supports it because it doesn't use managed affinity (code
> > > was reverted) and can have irq affinity redirected in case of cpu
> > > offlining...
> > 
> > That can be one corner case, looks I have to re-consider the patch
> > (blk-mq: remove code for dealing with remapping queue), which may cause
> > regression for this RDMA case, but I guess CPU hotplug may break this
> > case easily.
> > 
> > [3] https://marc.info/?l=linux-block&m=152318100625284&w=2
> > 
> > Also this case will make blk-mq's queue mapping much complicated,
> > could you provide one link about the reason for reverting managed affinity
> > on this device?
> 
> The problem was that users reported a regression because now
> /proc/irp/$IRQ/smp_affinity is immutable. Looks like netdev users do
> this on a regular basis (and also rely on irq_banacer at times) while
> nvme users (and other HBAs) do not care about it.
> 
> Thread starts here:
> https://www.spinics.net/lists/netdev/msg464301.html
> 
> > Recently we fix quite a few issues on managed affinity, maybe the
> > original issue for RDMA affinity has been addressed already.
> 
> That is not specific to RDMA affinity, its because RDMA devices are
> also network devices and people want to apply their irq affinity
> scripts on it like their used to with other devices.

OK, got it, then seems RDMA can't use managed IRQ affinity any more,
and it has to be treated a bit special now.

> 
> > > The goal here I think, should be to allocate just enough queues (not
> > > more than the number online cpus) and spread it 1x1 with online cpus,
> > > and also make sure to allocate completion vectors that align to online
> > > cpus. I just need to figure out how to do that...
> > 
> > I think we have to support CPU hotplug, so your goal may be hard to
> > reach if you don't want to waste memory resource.
> 
> Well, not so much if I make blk_mq_rdma_map_queues do the right thing?
> 
> As I said, for the first go, I'd like to fix the mapping for the simple
> case where we map the queues with some cpus offlined. Having queues
> being added dynamically is a different story and I agree would require
> more work.

It may not be a simple case, since Zhang Yi is running CPU hotplug stress
test with NVMe disconnection & connection meantime.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-09 12:15                             ` Ming Lei
  0 siblings, 0 replies; 36+ messages in thread
From: Ming Lei @ 2018-04-09 12:15 UTC (permalink / raw)


On Mon, Apr 09, 2018@11:31:37AM +0300, Sagi Grimberg wrote:
> 
> > > My device exposes nr_hw_queues which is not higher than num_online_cpus
> > > so I want to connect all hctxs with hope that they will be used.
> > 
> > The issue is that CPU online & offline can happen any time, and after
> > blk-mq removes CPU hotplug handler, there is no way to remap queue
> > when CPU topo is changed.
> > 
> > For example:
> > 
> > 1) after nr_hw_queues is set as num_online_cpus() and hw queues
> > are initialized, then some of CPUs become offline, and the issue
> > reported by Zhang Yi is triggered, but in this case, we should fail
> > the allocation since 1:1 mapping doesn't need to use this inactive
> > hw queue.
> 
> Normal cpu offlining is fine, as the hctxs are already connected. When
> we reset the controller and re-establish the queues, the issue triggers
> because we call blk_mq_alloc_request_hctx.

That is right, blk_mq_alloc_request_hctx() is one insane interface.

Also could you share a bit why the request has to be allocated in
this way?

I may have to read the NVMe connect protocol and related code for
understanding this mechanism.

> 
> The question is, for this particular issue, given that the request
> execution is guaranteed to run from an online cpu, will the below work?
> --
> diff --git a/block/blk-mq.c b/block/blk-mq.c
> index 75336848f7a7..81ced3096433 100644
> --- a/block/blk-mq.c
> +++ b/block/blk-mq.c
> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
> request_queue *q,
>                 return ERR_PTR(-EXDEV);
>         }
>         cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
> +       if (cpu >= nr_cpu_ids) {
> +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
> +       }

We may do this way for the special case, but it is ugly, IMO.

>         alloc_data.ctx = __blk_mq_get_ctx(q, cpu);
> 
>         rq = blk_mq_get_request(q, NULL, op, &alloc_data);
> --
> 
> > 2) when nr_hw_queues is set as num_online_cpus(), there may be
> > much less online CPUs, so the hw queue number can be initialized as
> > much smaller, then performance is degraded much even if some CPUs
> > become online later.
> 
> That is correct, when the controller will be reset though, more queues
> will be added to the system. I agree it would be good if we can change
> stuff dynamically.
> 
> > So the current policy is to map all possible CPUs for handing CPU
> > hotplug, and if you want to get 1:1 mapping between hw queue and
> > online CPU, the nr_hw_queues can be set as num_possible_cpus.
> 
> Having nr_hw_queues == num_possible_cpus cannot work as it requires
> establishing an RDMA queue-pair with a set of HW resources both on
> the host side _and_ on the controller side, which are idle.

OK, can I understand it just because there isn't so many hw resources?

> 
> > Please see commit 16ccfff28976130 (nvme: pci: pass max vectors as
> > num_possible_cpus() to pci_alloc_irq_vectors).
> 
> Yes, I am aware of this patch, however I not sure it'll be a good idea
> for nvmf as it takes resources from both the host and the target for
> for cpus that may never come online...
> 
> > It will waste some memory resource just like percpu variable, but it
> > simplifies the queue mapping logic a lot, and can support both hard
> > and soft CPU online/offline without CPU hotplug handler, which may
> > cause very complicated queue dependency issue.
> 
> Yes, but these some memory resources are becoming an issue when it
> takes HW (RDMA) resources on the local device and on the target device.

Maybe both host & target resource can be allocated until there is any
CPU coming for this hctx in host side. But CPU hotplug handler has to
be re-introduced, maybe callback of .hctx_activate or .hctx_deactivate
can be added for allocating/releasing these resources in CPU hotplug
path. Since queue mapping won't be changed, and queue freezing may be
avoided, it should be fine.

> 
> > > I agree we don't want to connect hctx which doesn't have an online
> > > cpu, that's redundant, but this is not the case here.
> > 
> > OK, I will explain below, and it can be fixed by the following patch too:
> > 
> > https://marc.info/?l=linux-block&m=152318093725257&w=2
> > 
> 
> I agree this patch is good!
> 
> > > > > > Or I may understand you wrong, :-)
> > > > > 
> > > > > In the report we connected 40 hctxs (which was exactly the number of
> > > > > online cpus), after Yi removed 3 cpus, we tried to connect 37 hctxs.
> > > > > I'm not sure why some hctxs are left without any online cpus.
> > > > 
> > > > That is possible after the following two commits:
> > > > 
> > > > 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
> > > > 20e4d8139319 (blk-mq: simplify queue mapping & schedule with each possisble CPU)
> > > > 
> > > > And this can be triggered even without putting down any CPUs.
> > > > 
> > > > The blk-mq CPU hotplug handler is removed in 4b855ad37194, and we can't
> > > > remap queue any more when CPU topo is changed, so the static & fixed mapping
> > > > has to be setup from the beginning.
> > > > 
> > > > Then if there are less enough online CPUs compared with number of hw queues,
> > > > some of hctxes can be mapped with all offline CPUs. For example, if one device
> > > > has 4 hw queues, but there are only 2 online CPUs and 6 offline CPUs, at most
> > > > 2 hw queues are assigned to online CPUs, and the other two are all with offline
> > > > CPUs.
> > > 
> > > That is fine, but the problem that I gave in the example below which has
> > > nr_hw_queues == num_online_cpus but because of the mapping, we still
> > > have unmapped hctxs.
> > 
> > For FC's case, there may be some hctxs not 'mapped', which is caused by
> > blk_mq_map_queues(), but that should one bug.
> > 
> > So the patch(blk-mq: don't keep offline CPUs mapped to hctx 0)[1] is
> > fixing the issue:
> > 	
> > [1]	https://marc.info/?l=linux-block&m=152318093725257&w=2
> > 
> > Once this patch is in, any hctx should be mapped by at least one CPU.
> 
> I think this will solve the problem Yi is stepping on.
> 
> > Then later, the patch(blk-mq: reimplement blk_mq_hw_queue_mapped)[2]
> > extends the mapping concept, maybe it should have been renamed as
> > blk_mq_hw_queue_active(), will do it in V2.
> > 
> > [2] https://marc.info/?l=linux-block&m=152318099625268&w=2
> 
> This is also a good patch.
> 
> ...
> 
> > > But when we reset the controller, we call blk_mq_update_nr_hw_queues()
> > > with the current number of nr_hw_queues which never exceeds
> > > num_online_cpus. This in turn, remaps the mq_map which results
> > > in unmapped queues because of the mapping function, not because we
> > > have more hctx than online cpus...
> > 
> > As I mentioned, num_online_cpus() isn't one stable variable, and it
> > can change any time.
> 
> Correct, but I'm afraid num_possible_cpus might not work either

Why?

> 
> > After patch(blk-mq: don't keep offline CPUs mapped to hctx 0) is in,
> > there won't be unmapped queue any more.
> 
> Yes.
> 
> > > An easy fix, is to allocate num_present_cpus queues, and only connect
> > > the oneline ones, but as you said, we have unused resources this way.
> > 
> > Yeah, it should be num_possible_cpus queues because physical CPU hotplug
> > is needed to be supported for KVM or S390, or even some X86_64 system.
> 
> num_present_cpus is a waste of resources (as I said, both on the host
> and on the target), but num_possible_cpus is even worse as this is
> all cpus that _can_ be populated.

Yes, that can be one direction for improving queue mapping.

> 
> > > We also have an issue with blk_mq_rdma_map_queues with the only
> > > device that supports it because it doesn't use managed affinity (code
> > > was reverted) and can have irq affinity redirected in case of cpu
> > > offlining...
> > 
> > That can be one corner case, looks I have to re-consider the patch
> > (blk-mq: remove code for dealing with remapping queue), which may cause
> > regression for this RDMA case, but I guess CPU hotplug may break this
> > case easily.
> > 
> > [3] https://marc.info/?l=linux-block&m=152318100625284&w=2
> > 
> > Also this case will make blk-mq's queue mapping much complicated,
> > could you provide one link about the reason for reverting managed affinity
> > on this device?
> 
> The problem was that users reported a regression because now
> /proc/irp/$IRQ/smp_affinity is immutable. Looks like netdev users do
> this on a regular basis (and also rely on irq_banacer at times) while
> nvme users (and other HBAs) do not care about it.
> 
> Thread starts here:
> https://www.spinics.net/lists/netdev/msg464301.html
> 
> > Recently we fix quite a few issues on managed affinity, maybe the
> > original issue for RDMA affinity has been addressed already.
> 
> That is not specific to RDMA affinity, its because RDMA devices are
> also network devices and people want to apply their irq affinity
> scripts on it like their used to with other devices.

OK, got it, then seems RDMA can't use managed IRQ affinity any more,
and it has to be treated a bit special now.

> 
> > > The goal here I think, should be to allocate just enough queues (not
> > > more than the number online cpus) and spread it 1x1 with online cpus,
> > > and also make sure to allocate completion vectors that align to online
> > > cpus. I just need to figure out how to do that...
> > 
> > I think we have to support CPU hotplug, so your goal may be hard to
> > reach if you don't want to waste memory resource.
> 
> Well, not so much if I make blk_mq_rdma_map_queues do the right thing?
> 
> As I said, for the first go, I'd like to fix the mapping for the simple
> case where we map the queues with some cpus offlined. Having queues
> being added dynamically is a different story and I agree would require
> more work.

It may not be a simple case, since Zhang Yi is running CPU hotplug stress
test with NVMe disconnection & connection meantime.

Thanks,
Ming

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
  2018-04-09 12:15                             ` Ming Lei
@ 2018-04-11 13:24                               ` Sagi Grimberg
  -1 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-11 13:24 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-block, Yi Zhang, linux-nvme, Christoph Hellwig


>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 75336848f7a7..81ced3096433 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
>> request_queue *q,
>>                  return ERR_PTR(-EXDEV);
>>          }
>>          cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>> +       if (cpu >= nr_cpu_ids) {
>> +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
>> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
>> +       }
> 
> We may do this way for the special case, but it is ugly, IMO.

Christoph?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7
@ 2018-04-11 13:24                               ` Sagi Grimberg
  0 siblings, 0 replies; 36+ messages in thread
From: Sagi Grimberg @ 2018-04-11 13:24 UTC (permalink / raw)



>> diff --git a/block/blk-mq.c b/block/blk-mq.c
>> index 75336848f7a7..81ced3096433 100644
>> --- a/block/blk-mq.c
>> +++ b/block/blk-mq.c
>> @@ -444,6 +444,10 @@ struct request *blk_mq_alloc_request_hctx(struct
>> request_queue *q,
>>                  return ERR_PTR(-EXDEV);
>>          }
>>          cpu = cpumask_first_and(alloc_data.hctx->cpumask, cpu_online_mask);
>> +       if (cpu >= nr_cpu_ids) {
>> +               pr_warn("no online cpu for hctx %d\n", hctx_idx);
>> +               cpu = cpumask_first(alloc_data.hctx->cpumask);
>> +       }
> 
> We may do this way for the special case, but it is ugly, IMO.

Christoph?

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2018-04-11 13:24 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1912441239.17517737.1522396297270.JavaMail.zimbra@redhat.com>
2018-03-30  9:32 ` BUG at IP: blk_mq_get_request+0x23e/0x390 on 4.16.0-rc7 Yi Zhang
2018-03-30  9:32   ` Yi Zhang
2018-04-04 13:22   ` Sagi Grimberg
2018-04-04 13:22     ` Sagi Grimberg
2018-04-05 16:35     ` Yi Zhang
2018-04-05 16:35       ` Yi Zhang
2018-04-08 10:36       ` Sagi Grimberg
2018-04-08 10:36         ` Sagi Grimberg
2018-04-08 10:44         ` Ming Lei
2018-04-08 10:44           ` Ming Lei
2018-04-08 10:48           ` Ming Lei
2018-04-08 10:48             ` Ming Lei
2018-04-08 10:58             ` Sagi Grimberg
2018-04-08 10:58               ` Sagi Grimberg
2018-04-08 11:04               ` Ming Lei
2018-04-08 11:04                 ` Ming Lei
2018-04-08 11:53                 ` Sagi Grimberg
2018-04-08 11:53                   ` Sagi Grimberg
2018-04-08 12:57                   ` Ming Lei
2018-04-08 12:57                     ` Ming Lei
2018-04-08 13:35                     ` Sagi Grimberg
2018-04-08 13:35                       ` Sagi Grimberg
2018-04-09  2:47                       ` Ming Lei
2018-04-09  2:47                         ` Ming Lei
2018-04-09  8:31                         ` Sagi Grimberg
2018-04-09  8:31                           ` Sagi Grimberg
2018-04-09  8:54                           ` Yi Zhang
2018-04-09  8:54                             ` Yi Zhang
2018-04-09  9:05                             ` Yi Zhang
2018-04-09  9:05                               ` Yi Zhang
2018-04-09  9:13                             ` Sagi Grimberg
2018-04-09  9:13                               ` Sagi Grimberg
2018-04-09 12:15                           ` Ming Lei
2018-04-09 12:15                             ` Ming Lei
2018-04-11 13:24                             ` Sagi Grimberg
2018-04-11 13:24                               ` Sagi Grimberg

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.