[PATCH] nvme: add cond_resched() to nvme_complete_batch()

* [PATCH] nvme: add cond_resched() to nvme_complete_batch()
@ 2023-05-02 12:54 jiweisun126
  2023-05-03 16:07 ` Christoph Hellwig
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: jiweisun126 @ 2023-05-02 12:54 UTC (permalink / raw)
  To: linux-nvme, linux-kernel
  Cc: kbusch, axboe, hch, sagi, ahuang12, sunjw10, jiweisun126

From: Jiwei Sun <sunjw10@lenovo.com>

A soft lockup issue will be triggered when do fio test on a 448-core
server, such as the following warning:

[  519.743511] watchdog: BUG: soft lockup - CPU#305 stuck for 144s! [fio:11226]
[  519.744834] CPU: 305 PID: 11226 Comm: fio Kdump: loaded Tainted: G             L     6.3.0 #48
[  519.744850] Hardware name: Lenovo ThinkSystem SR645 V3 MB,Genoa,DDR5,Oahu,1U/SB27B31174, BIOS KAE111F-2.10 04/17/2023
[  519.744867] RIP: 0010:__do_softirq+0x93/0x2d3
[  519.744901] Code: c3 c0 5a ff c7 44 24 08 0a 00 00 00 48 c7 c7 45 6b c5 a0 e8 1f ac fe ff 65 66 c7 05 dd 76 a8 5f 00 00 fb 49 c7 c2 c0 60 20 a1 <eb> 70 48 98 48 c7 c7 f8 fc c5 a0 4d 8d 3c c2 4c 89 fd 48 81 ed c0
[  519.744916] RSP: 0018:ff6e423a8b584fa0 EFLAGS: 00000202
[  519.744944] RAX: 0000000000000131 RBX: ff3143f856cd3f80 RCX: 0000000000000000
[  519.744959] RDX: 000000562c5e4652 RSI: ffffffffa0c56b45 RDI: ffffffffa0c35ab3
[  519.744974] RBP: 0000000000000000 R08: 0000000000000b01 R09: 0000000000031600
[  519.744988] R10: ffffffffa12060c0 R11: 00000054ea7837c0 R12: 0000000000000080
[  519.745003] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[  519.745017] FS:  00007fa41bb9e000(0000) GS:ff3143f7af840000(0000) knlGS:0000000000000000
[  519.745032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  519.745046] CR2: 00007fa3fc05b000 CR3: 00000008d6f56004 CR4: 0000000000771ee0
[  519.745060] PKRU: 55555554
[  519.745075] Call Trace:
[  519.745095]  <IRQ>
[  519.745160]  ? ftrace_regs_caller_end+0x61/0x61
[  519.745185]  irq_exit_rcu+0xcd/0x100
[  519.745217]  sysvec_apic_timer_interrupt+0xa2/0xd0
[  519.745251]  </IRQ>
[  519.745265]  <TASK>
[  519.745292]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  519.745318] RIP: 0010:_raw_spin_unlock_irqrestore+0x1d/0x40
[  519.745339] Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa e8 77 9b 0b 20 c6 07 00 0f 1f 00 f7 c6 00 02 00 00 74 01 fb bf 01 00 00 00 <e8> be 65 58 ff 65 8b 05 9f 81 a8 5f 85 c0 74 05 c3 cc cc cc cc 0f
[  519.745354] RSP: 0018:ff6e423b244279f0 EFLAGS: 00000206
[  519.745381] RAX: 000000098b346200 RBX: ff3143f0f0fccc00 RCX: 0000000000000286
[  519.745396] RDX: 0000000000000700 RSI: 0000000000000286 RDI: 0000000000000001
[  519.745410] RBP: ff3143f835e5c600 R08: 000000058bb02000 R09: 0000000000000000
[  519.745424] R10: 00000107d08ab6ea R11: 0000000000000800 R12: 0000000000000820
[  519.745438] R13: ff3143f835e5c610 R14: ff6e423b24427a50 R15: ff3143f90b346200
[  519.745576]  dma_pool_alloc+0x184/0x200
[  519.745647]  nvme_prep_rq.part.58+0x416/0x840 [nvme]
[  519.745760]  nvme_queue_rq+0x7b/0x210 [nvme]
[  519.745821]  __blk_mq_try_issue_directly+0x153/0x1c0
[  519.745906]  blk_mq_try_issue_directly+0x15/0x50
[  519.745935]  blk_mq_submit_bio+0x4c4/0x510
[  519.746047]  submit_bio_noacct_nocheck+0x331/0x370
[  519.746135]  blkdev_direct_IO.part.22+0x138/0x5b0
[  519.746165]  ? __fsnotify_parent+0x119/0x350
[  519.746274]  blkdev_read_iter+0xe3/0x170
[  519.746325]  aio_read+0xf6/0x1b0
[  519.746423]  ? 0xffffffffc066309b
[  519.746567]  ? io_submit_one+0x18c/0xd60
[  519.746587]  ? aio_read+0x5/0x1b0
[  519.746608]  io_submit_one+0x18c/0xd60
[  519.746672]  ? __x64_sys_io_submit+0x4/0x180
[  519.746794]  ? __x64_sys_io_submit+0x93/0x180
[  519.746815]  __x64_sys_io_submit+0x93/0x180
[  519.746838]  ? __pfx___x64_sys_io_submit+0x10/0x10
[  519.746923]  do_syscall_64+0x3b/0x90
[  519.746958]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[  519.746983] RIP: 0033:0x7fa41b83ee5d
[  519.747009] Code: ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 93 af 1b 00 f7 d8 64 89 01 48
[  519.747024] RSP: 002b:00007fff698e8cc8 EFLAGS: 00000246 ORIG_RAX: 00000000000000d1
[  519.747053] RAX: ffffffffffffffda RBX: 00007fa41bb9df70 RCX: 00007fa41b83ee5d
[  519.747067] RDX: 000055a53098ba68 RSI: 0000000000000001 RDI: 00007fa3fc05b000
[  519.747082] RBP: 00007fa3fc05b000 R08: 0000000000000000 R09: 0000000000000048
[  519.747095] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001
[  519.747110] R13: 0000000000000000 R14: 000055a53098ba68 R15: 000055a530989440
[  519.747245]  </TASK>

and we have collected the corresponding ftrace log,
  515.948558 |   305)               |              nvme_irq [nvme]() {
+--1503 lines: 515.948560 |   305)               |                blk_mq_complete_request_remote() {---
  515.948995 |   305)               |                nvme_pci_complete_batch [nvme]() {
  515.948996 |   305)               |                  nvme_unmap_data [nvme]() {
+--  3 lines: 515.948996 |   305)               |                    dma_unmap_sg_attrs() {---
  515.948997 |   305)               |                    dma_pool_free() {
  515.948997 |   305)               |                      _raw_spin_lock_irqsave() {
  515.948997 |   305)   0.190 us    |                        preempt_count_add();
  515.948999 |   305) * 14893.70 us |                        native_queued_spin_lock_slowpath();
  515.963893 |   305) * 14895.59 us |                      }
  515.963929 |   305)               |                      _raw_spin_unlock_irqrestore() {
  515.963930 |   305)   0.160 us    |                        preempt_count_sub();
  515.963930 |   305)   1.011 us    |                      }
  515.963931 |   305) * 14933.71 us |                    }
+--  9 lines: 515.963931 |   305)               |                    mempool_free() {---
  515.963933 |   305) * 14937.45 us |                  }
  515.963933 |   305)   0.160 us    |                  nvme_complete_batch_req [nvme_core]();
  515.963934 |   305)               |                  nvme_unmap_data [nvme]() {
+--  3 lines: 515.963934 |   305)               |                    dma_unmap_sg_attrs() {---
  515.963935 |   305)               |                    dma_pool_free() {
  515.963935 |   305)               |                      _raw_spin_lock_irqsave() {
  515.963936 |   305)   0.170 us    |                        preempt_count_add();
  515.963936 |   305) * 19865.82 us |                        native_queued_spin_lock_slowpath();
  515.983803 |   305) * 19867.21 us |                      }
  515.983833 |   305)               |                      _raw_spin_unlock_irqrestore() {
  515.983833 |   305)   0.200 us    |                        preempt_count_sub();
  515.983834 |   305)   1.312 us    |                      }
  515.983834 |   305) * 19898.71 us |                    }
+--- 11 lines: 515.983834 |   305)               |                    mempool_free() {---
  515.983837 |   305) * 19902.73 us |                  }
  515.983837 |   305)   0.170 us    |                  nvme_complete_batch_req [nvme_core]();
+--40055 lines: 515.983837 |   305)               |                  nvme_unmap_data [nvme]() {---
  520.816867 |   305) $ 4867871 us  |                }
  520.816867 |   305) $ 4868309 us  |              }
  520.816868 |   305) $ 4868310 us  |            }

According to the above two logs, we can know the nvme_irq() cost too much
time, in the above case, about 4.8 second. And we can also know that the
main bottlenecks is in the competition for the spin lock pool->lock.

In order to avoid the lockup log, we can enable the nvme irq threading
by adding nvme.use_threaded_interrupts=1 in cmdline and with preempt=full,
but for the kernel with preempt=off, the nvme.use_threaded_interrupts=1
will fail. So, we can add cond_resched() in the loops of the
nvme_complete_batch().

Reviewed-by: Adrian Huang <ahuang12@lenovo.com>
Signed-off-by: Jiwei Sun <sunjw10@lenovo.com>
---
 drivers/nvme/host/nvme.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index bf46f122e9e1..e247cb664e9e 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -725,6 +725,7 @@ static __always_inline void nvme_complete_batch(struct io_comp_batch *iob,
 	rq_list_for_each(&iob->req_list, req) {
 		fn(req);
 		nvme_complete_batch_req(req);
+		cond_resched();
 	}
 	blk_mq_end_request_batch(iob);
 }
-- 
2.27.0



^ permalink raw reply related	[flat|nested] 10+ messages in thread