linux-nvme.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* nvme_poll() CQ processing
@ 2020-02-26 19:20 Bijan Mottahedeh
  2020-02-26 20:23 ` Keith Busch
  0 siblings, 1 reply; 3+ messages in thread
From: Bijan Mottahedeh @ 2020-02-26 19:20 UTC (permalink / raw)
  To: linux-nvme

Running an io_uring polled fio test crashes the system (fio test, vm 
config, and stack trace included below).

I think the problem may be in the nvme driver's handling of multiple 
pollers sharing the same CQ, due to the fact that nvme_poll() drops 
cq_poll_lock before completing the CQEs found with nvme_process_cq():

nvme_poll()
{
     ...
     spin_lock(&nvmeq->cq_poll_lock);
     found = nvme_process_cq(nvmeq, &start, &end, -1);
     spin_unlock(&nvmeq->cq_poll_lock);

     nvme_complete_cqes(nvmeq, start, end);
     ...
}

Furthermore, nvme_process_cq() rings the CQ doorbell after collecting 
the CQEs but before processing them:

static inline int nvme_process_cq(struct nvme_queue *nvmeq, u16 *start, 
u16 *end, unsigned int tag)
{
     ...
     while (nvme_cqe_pending(nvmeq)) {
         ...
         nvme_update_cq_head(nvmeq);
     }
     ...
         nvme_ring_cq_doorbell(nvmeq);
     return found;
}

Each poller effectively tells the controller that the CQ is empty when 
it rings the CQ doorbell. This is ok if there is only one poller but 
with many of them, I think enough tags can be freed and reissued that CQ 
could be overrun.

In one specific example:

- Poller 1 find a CQ full of entries in nvme_process_cq()
- Poller 1 processes CQEs, and more pollers find CQE ranges to process
   Pollers 2-4 start processing additional non-overlapping CQE ranges
- Poller 5 finds a CQE range that is overlapping with Poller 1

CQ size 1024

Poller          1   2    3    4    5
CQ start index  10  9    214  401  708
CQ end index    9   214  401  708  77
CQ start phase  1   0    0    0    0
CQ end phase    0   0    0    0    1

Poller 1 finds the CQ phase has flipped when processing CQE 821 and  
indeed the phase has flipped because of poller 5.  If I interpret this 
data correctly, it suggests that Pollers 1 and 5 overlap.

After that I start seeing errors.

A simpler theoretical example with two threads:

Thread 1 submits enough I/O to fill the CQ
Thread 1 then processes two CQEs, two block layer tags become available.
Thread 1 is preempted by thread 2.
Thread 2 submits two I/Os.
Thread 2 processes the two CQEs which it owns.
Thread 2 submits two more I/Os.
Those CQEs overwrite the next two CQEs that will be processed by thread 1.

Two of thread 1's IOs will not receive a completion.  Two of thread 2's 
IOs will receive two completions.

Just as a workaround, I held cq_poll_lock while completing the CQEs and 
see no errors.

Thanks.

--bijan

---------------------------------------------------------------------------------------------------

fio nvme.fio --readonly --ioengine=io_uring --iodepth 1024 --fixedbufs 
--hipri --numjobs=16

[global]
filename=/dev/nvme0n1
rw=randread
bs=256k
direct=1
time_based=1
randrepeat=1
gtod_reduce=1
[fiotest]

# lscpu

CPU(s):              16
On-line CPU(s) list: 0-15
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           16
NUMA node(s):        1

# cat /sys/kernel/debug/block/nvme0n1/hctx*/type

default
poll
poll
poll
poll
poll
default
default
default
default
default
default
poll
poll
poll

[  103.926786] ------------[ cut here ]------------
[  103.928373] DMA-API: nvme 0000:00:04.0: device driver tries to free DMA memory it has not allocated [device address=0x6b6b6b6b6b6b6b6b] [size=1802201963 bytes]
[  103.933647] WARNING: CPU: 15 PID: 13310 at kernel/dma/debug.c:1014 check_unmap+0x145/0x870
[  103.935617] Modules linked in: xfs dm_mod sd_mod sr_mod cdrom crc32c_intel nvme ata_piix nvme_core t10_pi e1000 libata virtio_pci 9pnet_virtio virtio_ring virtio 9p 9pnet
[  103.939255] CPU: 15 PID: 13310 Comm: fio Not tainted 5.6.0-rc3-next-20200225-bij #695
[  103.942658] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[  103.945177] RIP: 0010:check_unmap+0x145/0x870
[  103.946074] Code: 00 48 8b 1f e9 59 06 00 00 48 c7 c3 eb a2 14 82 48 89 de 4d 89 e8 4c 89 e1 48 89 da 48 c7 c7 58 9c 12 82 31 c0 e8 ab 29 f7 ff <0f> 0b 8b 15 13 2a 5e 01 85 d2 0f 85 08 07 00 00 8b 05 31 a0 40 01
[  103.950243] RSP: 0018:ffffc90000e87a98 EFLAGS: 00010282
[  103.951316] RAX: 0000000000000000 RBX: ffff8882300002a0 RCX: 0000000000000000
[  103.953592] RDX: ffff888236bfab80 RSI: ffff888236bea408 RDI: ffff888236bea408
[  103.956417] RBP: ffffc90000e87b00 R08: ffff888203c25580 R09: 00000000fffffff0
[  103.958941] R10: 0000000000000001 R11: 0000000000000000 R12: 6b6b6b6b6b6b6b6b
[  103.960565] R13: 000000006b6b6b6b R14: 0000000000000002 R15: 0000000000000001
[  103.962047] FS:  00007ff16d6a9740(0000) GS:ffff888236a00000(0000) knlGS:0000000000000000
[  103.964397] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  103.965616] CR2: 00007f697e3840d8 CR3: 00000002305cc002 CR4: 0000000000160ee0
[  103.967203] Call Trace:
[  103.967633]  ? do_raw_spin_unlock+0x83/0x90
[  103.968536]  debug_dma_unmap_sg+0xff/0x140
[  103.969592]  ? kvm_sched_clock_read+0xd/0x20
[  103.970416]  ? __lock_acquire.isra.28+0x526/0x6c0
[  103.971383]  ? kvm_sched_clock_read+0xd/0x20
[  103.973386]  ? __lock_acquire.isra.28+0x526/0x6c0
[  103.975465]  nvme_unmap_data+0x109/0x250 [nvme]
[  103.976734]  nvme_pci_complete_rq+0xda/0xf0 [nvme]
[  103.978688]  blk_mq_complete_request+0x47/0xf0
[  103.981003]  nvme_poll+0x256/0x2e0 [nvme]
[  103.982567]  ? kvm_sched_clock_read+0xd/0x20
[  103.983941]  blk_poll+0x24f/0x320
[  103.985445]  ? io_iopoll_check+0x38/0xc0
[  103.986355]  io_iopoll_getevents+0x11c/0x2e0
[  103.987373]  ? do_raw_spin_unlock+0x83/0x90
[  103.988268]  ? __mutex_unlock_slowpath+0x14e/0x250
[  103.989244]  io_iopoll_check+0x83/0xc0
[  103.990126]  __x64_sys_io_uring_enter+0x24e/0x400
[  103.991102]  ? __x64_sys_io_uring_enter+0x83/0x400
[  103.992192]  do_syscall_64+0x63/0x1a0
[  103.993298]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  103.994685] RIP: 0033:0x7ff16c5f52cd
[  103.995845] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b 6b 2c 00 f7 d8 64 89 01 48
[  104.003817] RSP: 002b:00007ffcee452278 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
[  104.007100] RAX: ffffffffffffffda RBX: 00007ff14b339680 RCX: 00007ff16c5f52cd
[  104.010153] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000003
[  104.013045] RBP: 00007ff14b339680 R08: 0000000000000000 R09: 0000000000000000
[  104.015937] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000001182710
[  104.018951] R13: 00000000000004f3 R14: 0000000000000001 R15: 00000000011e50a8
[  104.022258] ---[ end trace e69e8de69a776ca6 ]---
[  104.026236] nvme 0000:00:04.0: dma_pool_free prp list page, 00000000bc9f82c1 (bad vaddr)/0x0000000093f49000
[  104.029873] ------------[ cut here ]------------
[  104.031695] refcount_t: underflow; use-after-free.
[  104.033659] WARNING: CPU: 15 PID: 13310 at lib/refcount.c:28 refcount_warn_saturate+0xa1/0xf0
[  104.036864] Modules linked in: xfs dm_mod sd_mod sr_mod cdrom crc32c_intel nvme ata_piix nvme_core t10_pi e1000 libata virtio_pci 9pnet_virtio virtio_ring virtio 9p 9pnet
[  104.042524] CPU: 15 PID: 13310 Comm: fio Tainted: G        W         5.6.0-rc3-next-20200225-bij #695
[  104.045934] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[  104.050630] RIP: 0010:refcount_warn_saturate+0xa1/0xf0
[  104.052602] Code: ef 27 01 01 e8 50 4e c4 ff 0f 0b c3 80 3d e8 ef 27 01 00 75 59 48 c7 c7 88 2a 17 82 31 c0 c6 05 d6 ef 27 01 01 e8 2f 4e c4 ff <0f> 0b c3 80 3d c6 ef 27 01 00 75 38 48 c7 c7 b0 2a 17 82 31 c0 c6
[  104.059078] RSP: 0018:ffffc90000e87cb0 EFLAGS: 00010282
[  104.061022] RAX: 0000000000000000 RBX: ffff8882264a0f00 RCX: 0000000000000000
[  104.063112] RDX: ffff888236bfab80 RSI: ffff888236bea408 RDI: ffff888236bea408
[  104.066146] RBP: ffff888229adee78 R08: ffff888203c25580 R09: 00000000fffffff0
[  104.068666] R10: 0000000000000001 R11: 0000000000000000 R12: ffffe8ffffa00600
[  104.071430] R13: ffff8882264a0f00 R14: ffff888229adee78 R15: ffff8882277897f0
[  104.073970] FS:  00007ff16d6a9740(0000) GS:ffff888236a00000(0000) knlGS:0000000000000000
[  104.076743] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  104.078629] CR2: 00007f697e3840d8 CR3: 00000002305cc002 CR4: 0000000000160ee0
[  104.080861] Call Trace:
[  104.082232]  blk_mq_complete_request+0x47/0xf0
[  104.084001]  nvme_poll+0x256/0x2e0 [nvme]
[  104.085466]  ? kvm_sched_clock_read+0xd/0x20
[  104.086913]  blk_poll+0x24f/0x320
[  104.087970]  ? io_iopoll_check+0x38/0xc0
[  104.089375]  io_iopoll_getevents+0x11c/0x2e0
[  104.090782]  ? do_raw_spin_unlock+0x83/0x90
[  104.092125]  ? __mutex_unlock_slowpath+0x14e/0x250
[  104.093672]  io_iopoll_check+0x83/0xc0
[  104.094987]  __x64_sys_io_uring_enter+0x24e/0x400
[  104.096356]  ? __x64_sys_io_uring_enter+0x83/0x400
[  104.097731]  do_syscall_64+0x63/0x1a0
[  104.099346]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  104.101036] RIP: 0033:0x7ff16c5f52cd
[  104.102064] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b 6b 2c 00 f7 d8 64 89 01 48
[  104.108623] RSP: 002b:00007ffcee452278 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
[  104.111214] RAX: ffffffffffffffda RBX: 00007ff14b339680 RCX: 00007ff16c5f52cd
[  104.113584] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000003
[  104.115863] RBP: 00007ff14b339680 R08: 0000000000000000 R09: 0000000000000000
[  104.118256] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000001182710
[  104.120504] R13: 00000000000004f3 R14: 0000000000000001 R15: 00000000011e50a8
[  104.122804] ---[ end trace e69e8de69a776ca7 ]---
[  104.126569] nvme 0000:00:04.0: dma_pool_free prp list page, 00000000f09b403e (bad vaddr)/0x0000000090a84000
[  104.133828] BUG: kernel NULL pointer dereference, address: 0000000000000010
[  104.136329] #PF: supervisor read access in kernel mode
[  104.137930] #PF: error_code(0x0000) - not-present page
[  104.139560] PGD 800000022c4c2067 P4D 800000022c4c2067 PUD 22c4c3067 PMD 0
[  104.141782] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[  104.143196] CPU: 15 PID: 13310 Comm: fio Tainted: G        W         5.6.0-rc3-next-20200225-bij #695
[  104.145776] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
[  104.148755] RIP: 0010:debug_dma_unmap_sg+0x8f/0x140
[  104.150568] Code: 31 ff c7 44 24 0c 00 00 00 00 31 c0 b9 10 00 00 00 48 89 df f3 48 ab 48 8b 04 24 48 ba 00 00 00 00 00 16 00 00 48 89 44 24 20 <49> 8b 44 24 10 48 89 44 24 28 41 8b 44 24 18 c7 44 24 38 01 00 00
[  104.157171] RSP: 0018:ffffc90000e87af0 EFLAGS: 00010246
[  104.159158] RAX: ffff8882315fa0b0 RBX: ffffc90000e87b00 RCX: 0000000000000000
[  104.160922] RDX: 0000160000000000 RSI: 00000000cb5926a1 RDI: ffffc90000e87b80
[  104.163226] RBP: ffffc90000e87c20 R08: ffff888203c25580 R09: 00000000fffffff0
[  104.165948] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[  104.168605] R13: 0000000000000002 R14: 0000000000000002 R15: 0000000000000001
[  104.170870] FS:  00007ff16d6a9740(0000) GS:ffff888236a00000(0000) knlGS:0000000000000000
[  104.173846] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  104.176481] CR2: 0000000000000010 CR3: 00000002305cc003 CR4: 0000000000160ee0
[  104.177997] Call Trace:
[  104.178647]  ? kvm_sched_clock_read+0xd/0x20
[  104.179440]  ? __lock_acquire.isra.28+0x526/0x6c0
[  104.181143]  ? kvm_sched_clock_read+0xd/0x20
[  104.182543]  ? __lock_acquire.isra.28+0x526/0x6c0
[  104.184162]  nvme_unmap_data+0x109/0x250 [nvme]
[  104.185779]  nvme_pci_complete_rq+0xda/0xf0 [nvme]
[  104.187420]  blk_mq_complete_request+0x47/0xf0
[  104.189194]  nvme_poll+0x256/0x2e0 [nvme]
[  104.190479]  ? kvm_sched_clock_read+0xd/0x20
[  104.191631]  blk_poll+0x24f/0x320
[  104.192472]  ? io_iopoll_check+0x38/0xc0
[  104.193614]  io_iopoll_getevents+0x11c/0x2e0
[  104.194712]  ? do_raw_spin_unlock+0x83/0x90
[  104.195793]  ? __mutex_unlock_slowpath+0x14e/0x250
[  104.196882]  io_iopoll_check+0x83/0xc0
[  104.197728]  __x64_sys_io_uring_enter+0x24e/0x400
[  104.198886]  ? __x64_sys_io_uring_enter+0x83/0x400
[  104.200391]  do_syscall_64+0x63/0x1a0
[  104.201065]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  104.202340] RIP: 0033:0x7ff16c5f52cd
[  104.203050] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 8b 6b 2c 00 f7 d8 64 89 01 48
[  104.207871] RSP: 002b:00007ffcee452278 EFLAGS: 00000246 ORIG_RAX: 00000000000001aa
[  104.210498] RAX: ffffffffffffffda RBX: 00007ff14b339680 RCX: 00007ff16c5f52cd
[  104.212609] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000003
[  104.214606] RBP: 00007ff14b339680 R08: 0000000000000000 R09: 0000000000000000
[  104.217113] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000001182710
[  104.219215] R13: 00000000000004f3 R14: 0000000000000001 R15: 00000000011e50a8
[  104.221121] Modules linked in: xfs dm_mod sd_mod sr_mod cdrom crc32c_intel nvme ata_piix nvme_core t10_pi e1000 libata virtio_pci 9pnet_virtio virtio_ring virtio 9p 9pnet
[  104.224353] CR2: 0000000000000010
[  104.225289] ---[ end trace e69e8de69a776ca8 ]---
[  104.226923] RIP: 0010:debug_dma_unmap_sg+0x8f/0x140
[  104.228365] Code: 31 ff c7 44 24 0c 00 00 00 00 31 c0 b9 10 00 00 00 48 89 df f3 48 ab 48 8b 04 24 48 ba 00 00 00 00 00 16 00 00 48 89 44 24 20 <49> 8b 44 24 10 48 89 44 24 28 41 8b 44 24 18 c7 44 24 38 01 00 00
[  104.233092] RSP: 0018:ffffc90000e87af0 EFLAGS: 00010246
[  104.234267] RAX: ffff8882315fa0b0 RBX: ffffc90000e87b00 RCX: 0000000000000000
[  104.235910] RDX: 0000160000000000 RSI: 00000000cb5926a1 RDI: ffffc90000e87b80
[  104.237249] RBP: ffffc90000e87c20 R08: ffff888203c25580 R09: 00000000fffffff0
[  104.238881] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[  104.240319] R13: 0000000000000002 R14: 0000000000000002 R15: 0000000000000001
[  104.241706] FS:  00007ff16d6a9740(0000) GS:ffff888236a00000(0000) knlGS:0000000000000000
[  104.244033] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  104.245188] CR2: 0000000000000010 CR3: 00000002305cc003 CR4: 0000000000160ee0
[  104.247659] Kernel panic - not syncing: Fatal exception
[  104.250410] Kernel Offset: disabled
[  104.251030] ---[ end Kernel panic - not syncing: Fatal exception ]---


_______________________________________________
linux-nvme mailing list
linux-nvme@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-nvme

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2020-02-26 21:46 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-26 19:20 nvme_poll() CQ processing Bijan Mottahedeh
2020-02-26 20:23 ` Keith Busch
2020-02-26 21:45   ` Bijan Mottahedeh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).