linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Random high CPU utilization in blk-mq with the none scheduler
@ 2021-12-10  3:30 Dexuan Cui
  2021-12-11  1:29 ` Dexuan Cui
  0 siblings, 1 reply; 17+ messages in thread
From: Dexuan Cui @ 2021-12-10  3:30 UTC (permalink / raw)
  To: Jens Axboe, ming.lei, Christoph Hellwig, linux-block
  Cc: Long Li, Michael Kelley (LINUX), linux-kernel

Hi all,
I found a random high CPU utilization issue with some database benchmark
program running on a 192-CPU virtual machine (VM). Originally the issue
was found with RHEL 8.4 and Ubuntu 20.04, and further tests show that the
issue also reproduces with the latest upstream stable kernel v5.15.7, but
*not* with v5.16-rc1. It looks like someone resolved the issue in v5.16-rc1
recently?

Details:
The database benchmark test starts 1000+ processes, but it is not
CPU-intensive. The processes access 24 iSCSI-backed disks (multipath is
used), and here no local disk is used). Normally the overall CPU
utilization of 192 CPUs should be 2~5%, including %user (less than 1%),
%sys (2~5%), and %si (less than 1%) . The iSCSI TCP traffic is not heavy and
the packets-per-second (PPS) rate is not high. By default the "none" I/O
scheduler is used in the test.

The issue is: every 1~2 minutes, there is a >90% CPU utilization spike
that lasts 5~20 seconds, and the system's performance drops
significantly. When the CPU utilization spike is happening, *each* of
the 192 CPUs' %sys goes up to 60+%, and the %si goes up to 30%!
Sometimes if I'm not lucky, the >90% CPU utilization happens all the time.

This is not a NIC driver issue becaue the NIC interrupt rate is low due
to NAPI, and the PPS rate is also low (only about 250K/s) and the network
traffic is also low (typically only 2~5 Gbps. The limit for the VM is
30 Gbps, and the NIC driver itself is able to reach a throughput of
20+Gbps easily, if the database benchmark test is not running).

The NIC driver uses 31 queues by default. When the CPU utilization spike
is happening, if I configure the NIC to use only 1 queue by
"ethtool -L eth0 combined 1", the CPU utilization goes down to normal
immediately, and the network throughput and PPS rate go up; if I use 2
queues, the CPU utilization is also not normal (and it consistently
stays at this high CPU utilization -- see [1]), but is not so high as
the 31-queue case. In the case of [1], the iSCSI TCP packets are
processed on CPU 7 and CPU 30 ("cat /proc/irq/57/smp_affinity_list"
returns "7", and "cat /proc/irq/58/smp_affinity_list" returns "30") on
which the NIC's MSI-X interrupts (IRQ 57 and 58) are bound to. If I
check the hottest functions on CPU 7 and CPU 30, e.g.
by "perf top -C 7 --call-graph dwarf" (see [2]), it looks like the code path
blk_done_softirq() -> ... -> blk_mq_sched_restart() -> ...
-> kblockd_mod_delayed_work_on() -> ... -> try_to_grab_pending() is
even hotter than the NIC TX/RX handler and the TCP/IP stack! Since the
other 190 CPUs also have high %sys and %si (see [3]), I check
"perf top -C 0 --call-graph dwarf" and find that the same
blk_done_softirq() -> ... -> try_to_grab_pending() code path is very hot.

My feeling is that it looks like the iSCSI hardware queues are experiencing
the transition "not full -> full -> not full -> full -> ... " very frequently
and the blk-mq softirq handler is called very frequently, trying to submit
new I/O requests? try_to_grab_pending() acquires and releases the pool->lock
spinlock, and I guess there may be a lock contention issue??  If I change
the NIC's second IRQ's affinity from CPU 30 to to CPU 7 as well (i.e. both the
NIC queues are handled by the same CPU 7) by 
" echo 7 > /proc/irq/58/smp_affinity_list", the high CPU utilization issue goes
away immediately (i.e. the %sys and %si numbers on each CPU rather than
CPU7 go down immediately to normal, i.e. %sys = 1.9%, %si = 0.6%);
if I change the second IRQ's affinity back to CPU 30, the high CPU utilization
happens again immediately. Any insights into this significant difference when
blk_done_softirq() runs on 1 CPU vs. on 2 CPUs? Here when both the 2 NIC
IRQs happen on the same CPU, the network throughput and PPS rate are
actually better.

When the high CPU utilization issue is happening, if I change the I/O
scheduler from "noop" to "mq-deadline", basicaly the issue will also go
away immediately (%sys = 2.7%, %si = 1.4%. This is slightly higher than the
case where "noop" is used and both the 2 NIC IRQs happen on the same CPU).

It's unclear why the issue doesn't reproduce with v5.16-rc1. The kernel
difference from v5.15.7 (the bad one) is huge (1+ million lines of changes)
and a weird file system hang issue (if I use a kernel git commit in between, I/O
requests on the iSCSI disks can not complete at all -- see [5]) prevents me
from doing git-bisect. There are about 100 patches related to blk-mq, and
at first glance I have not found anything that can explain why the issue
does not reproduce with v5.16-rc1.

Thanks,
-- Dexuan


[1] The snippet of the "top" command ( v5.15.7 stable kernel; the NIC uses 2 queues)

top - 01:49:14 up 31 min,  2 users,  load average: 1052.88, 999.40, 706.60
Tasks: 2862 total, 207 running, 2654 sleeping,   1 stopped,   0 zombie
%Cpu(s):  0.1 us, 35.2 sy,  0.0 ni,  1.1 id, 49.6 wa,  0.0 hi, 14.0 si,  0.0 st
MiB Mem : 2064232.+total, 2053872.+free,   9031.1 used,   1328.6 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used. 2049269.+avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    195 root      20   0       0      0      0 R  92.7   0.0  13:37.47 ksoftirqd/30
     57 root      20   0       0      0      0 R  91.4   0.0  14:22.18 ksoftirqd/7
  11649 root      20   0   18624   3544   1536 R  70.3   0.0   9:02.73 GenerateIO.Linu
  11654 root      20   0   18624   3544   1536 R  69.8   0.0   9:05.26 GenerateIO.Linu
  11650 root      20   0   18624   3544   1536 R  69.0   0.0   9:00.95 GenerateIO.Linu
  11655 root      20   0   18624   3544   1536 R  68.9   0.0   8:39.59 GenerateIO.Linu
  11652 root      20   0   18624   3544   1536 R  68.8   0.0   8:37.07 GenerateIO.Linu
  11656 root      20   0   18624   3544   1536 R  68.5   0.0   8:40.68 GenerateIO.Linu
  11651 root      20   0   18624   3544   1536 R  68.2   0.0   8:38.72 GenerateIO.Linu
  11653 root      20   0   18624   3544   1536 S  67.5   0.0   8:39.97 GenerateIO.Linu
    869 root      20   0       0      0      0 S  10.2   0.0   1:32.12 ksoftirqd/142
    580 root      20   0       0      0      0 R  10.1   0.0   1:28.62 ksoftirqd/94
    629 root      20   0       0      0      0 S  10.1   0.0   1:28.80 ksoftirqd/102
    436 root      20   0       0      0      0 S  10.0   0.0   1:27.72 ksoftirqd/70
    556 root      20   0       0      0      0 S  10.0   0.0   1:29.44 ksoftirqd/90
    316 root      20   0       0      0      0 S   9.9   0.0   1:29.45 ksoftirqd/50
    592 root      20   0       0      0      0 S   9.9   0.0   1:32.97 ksoftirqd/96
    990 root      20   0       0      0      0 R   9.8   0.0   1:29.08 ksoftirqd/162
...

[2] "perf top -C 7 --call-graph dwarf -d 30"  (only the first lines of the output are shown here)

Samples: 1M of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 246356121644 lost: 0/47 drop: 0/0
  Children      Self  Shared Object            Symbol
-   98.01%     0.03%  [kernel]                 [k] __softirqentry_text_start
   - 12.10% __softirqentry_text_start
      - 7.10% blk_done_softirq
         - 7.22% blk_complete_reqs
            - 6.52% dm_softirq_done
               - 6.55% dm_done
                  - 6.52% blk_mq_end_request
                     - 6.53% blk_mq_free_request
                        - 6.49% __blk_mq_free_request
                           - 6.48% blk_mq_sched_restart
                              - 6.46% blk_mq_run_hw_queue
                                 - 6.46% __blk_mq_delay_run_hw_queue
                                    - 7.20% kblockd_mod_delayed_work_on
                                       - 51.73% mod_delayed_work_on
                                            try_to_grab_pending
      - 4.89% net_rx_action
         - 4.90% __napi_poll
            - 3.28% mlx5e_napi_poll
               - 3.03% mlx5e_poll_rx_cq
                  - 3.85% mlx5e_handle_rx_cqe
                     - 2.65% napi_gro_receive
                        - 2.54% dev_gro_receive
                           - 1.78% napi_gro_complete.constprop.0.isra.0
                              - 1.80% netif_receive_skb_list_internal
                                 - 1.82% __netif_receive_skb_list_core
                                    - 1.76% ip_list_rcv
                                       - 1.77% ip_sublist_rcv
                                          - 1.67% ip_sublist_rcv_finish
                                             - 1.70% ip_local_deliver
                                                - 1.61% ip_local_deliver_finish
                                                   - 1.71% ip_protocol_deliver_rcu
                                                      - 1.81% tcp_v4_rcv
                                                         - 1.63% tcp_v4_do_rcv
                                                            - 1.73% tcp_rcv_established
                                                               - 0.98% tcp_data_ready
                                                                  - 0.97% iscsi_sw_tcp_data_ready
                                                                     - 0.99% tcp_read_sock
                                                                        - 2.00% iscsi_sw_tcp_recv
                                                                           - 2.01% iscsi_tcp_recv_skb
                                                                             0.99% memcpy_erms
                     - 1.34% mlx5e_skb_from_cqe_linear
               - 0.86% mlx5e_poll_tx_cq
            - 1.93% napi_gro_flush
               - 1.95% napi_gro_complete.constprop.0.isra.0
                  - 1.95% netif_receive_skb_list_internal
                     - 1.98% __netif_receive_skb_list_core
                        - 1.92% ip_list_rcv
                           - 1.93% ip_sublist_rcv
                              - 1.84% ip_sublist_rcv_finish
                                 - 1.87% ip_local_deliver
                                    - 1.78% ip_local_deliver_finish
                                       - 1.89% ip_protocol_deliver_rcu
                                          - 1.97% tcp_v4_rcv
                                             - 1.82% tcp_v4_do_rcv
                                                - 1.93% tcp_rcv_established
                                                   - 1.10% tcp_data_ready
                                                      - 1.10% iscsi_sw_tcp_data_ready
                                                         - 1.12% tcp_read_sock
                                                            - 2.59% iscsi_sw_tcp_recv
                                                               - 2.64% iscsi_tcp_recv_skb
                                                                    0.78% iscsi_tcp_segment_done
                                                                 1.38% memcpy_erms
+   92.61%     0.01%  [kernel]                 [k] smpboot_thread_fn
+   92.58%     0.01%  [kernel]                 [k] run_ksoftirqd
+   56.55%     0.00%  [kernel]                 [k] blk_done_softirq
+   56.53%     0.02%  [kernel]                 [k] blk_complete_reqs
+   52.56%     0.00%  [kernel]                 [k] __blk_mq_delay_run_hw_queue
+   52.55%     0.01%  [kernel]                 [k] blk_mq_run_hw_queue
+   52.54%     0.00%  [kernel]                 [k] kblockd_mod_delayed_work_on
+   52.54%     0.87%  [kernel]                 [k] mod_delayed_work_on
+   52.10%     0.00%  [kernel]                 [k] dm_softirq_done
-   52.10%     0.01%  [kernel]                 [k] dm_done
   - 6.54% dm_done
      - 6.52% blk_mq_end_request
         - 6.53% blk_mq_free_request
            - 6.49% __blk_mq_free_request
               - 6.48% blk_mq_sched_restart
                  - 6.46% blk_mq_run_hw_queue
                     - 6.46% __blk_mq_delay_run_hw_queue
                        - 7.20% kblockd_mod_delayed_work_on
                           - 51.73% mod_delayed_work_on
                                try_to_grab_pending
+   51.97%     0.06%  [kernel]                 [k] blk_mq_free_request
+   51.88%     0.00%  [kernel]                 [k] blk_mq_end_request
+   51.88%     0.04%  [kernel]                 [k] __blk_mq_free_request
+   51.77%     0.02%  [kernel]                 [k] blk_mq_sched_restart
-   51.67%    51.64%  [kernel]                 [k] try_to_grab_pending
   - 6.34% ret_from_fork
      - kthread
         - 6.06% smpboot_thread_fn
            - 6.05% run_ksoftirqd
               - 48.66% __softirqentry_text_start
                    blk_done_softirq
                    blk_complete_reqs
                    dm_softirq_done
                    dm_done
                    blk_mq_end_request
                    blk_mq_free_request
                    __blk_mq_free_request
                    blk_mq_sched_restart
                    blk_mq_run_hw_queue
                    __blk_mq_delay_run_hw_queue
                    kblockd_mod_delayed_work_on
         - 1.53% worker_thread
            - 1.45% asm_sysvec_hyperv_stimer0
                 sysvec_hyperv_stimer0
                 irq_exit_rcu
                 __softirqentry_text_start
                 blk_done_softirq
                 blk_complete_reqs
                 dm_softirq_done
                 dm_done
                 blk_mq_end_request
                 blk_mq_free_request
                 __blk_mq_free_request
                 blk_mq_sched_restart
                 blk_mq_run_hw_queue
                 __blk_mq_delay_run_hw_queue
                 kblockd_mod_delayed_work_on
                 mod_delayed_work_on
                 try_to_grab_pending
+   40.07%     0.02%  [kernel]                 [k] net_rx_action
+   40.04%     0.01%  [kernel]                 [k] __napi_poll
+   33.42%     0.06%  [kernel]                 [k] netif_receive_skb_list_internal
+   33.23%     0.05%  [kernel]                 [k] __netif_receive_skb_list_core
+   32.96%     0.04%  [kernel]                 [k] ip_list_rcv
+   32.87%     0.05%  [kernel]                 [k] ip_sublist_rcv
+   31.34%     0.05%  [kernel]                 [k] ip_sublist_rcv_finish
+   31.27%     0.06%  [kernel]                 [k] ip_local_deliver
+   30.69%     0.02%  [kernel]                 [k] ip_local_deliver_finish
+   30.66%     0.04%  [kernel]                 [k] ip_protocol_deliver_rcu
+   30.57%     0.26%  [kernel]                 [k] tcp_v4_rcv
+   30.20%     0.03%  [kernel]                 [k] napi_gro_complete.constprop.0.isra.0
+   29.79%     0.03%  [kernel]                 [k] tcp_v4_do_rcv
+   29.68%     0.15%  [kernel]                 [k] tcp_rcv_established
+   25.13%     0.04%  [kernel]                 [k] tcp_data_ready
+   25.05%     0.06%  [kernel]                 [k] iscsi_sw_tcp_data_ready
+   24.94%     0.05%  [kernel]                 [k] tcp_read_sock
+   22.24%     0.02%  [kernel]                 [k] mlx5e_napi_poll
+   21.04%     0.05%  [kernel]                 [k] iscsi_sw_tcp_recv
+   20.26%     0.30%  [kernel]                 [k] mlx5e_poll_rx_cq
+   19.93%     0.18%  [kernel]                 [k] mlx5e_handle_rx_cqe
+   17.35%     0.76%  [kernel]                 [k] iscsi_tcp_recv_skb
+   17.31%     0.07%  [kernel]                 [k] napi_gro_receive
+   16.60%     0.03%  [kernel]                 [k] napi_gro_flush
+   14.88%     0.32%  [kernel]                 [k] dev_gro_receive
+   12.23%     0.00%  [kernel]                 [k] ret_from_fork
+   12.23%     0.00%  [kernel]                 [k] kthread
+   10.17%     0.04%  [kernel]                 [k] iscsi_complete_pdu
+   10.06%     0.23%  [kernel]                 [k] __iscsi_complete_pdu
+    9.33%     0.03%  [kernel]                 [k] iscsi_complete_task
+    8.95%     0.52%  [kernel]                 [k] __iscsi_put_task
+    8.41%     0.03%  [kernel]                 [k] blk_mq_complete_reque


[3]  The first lines of the "top" output are shown here:

top - 02:09:58 up 52 min,  2 users,  load average: 1050.01, 1038.20, 951.19
Tasks: 2862 total,  38 running, 2823 sleeping,   1 stopped,   0 zombie
%Cpu0  :  1.1 us, 37.0 sy,  0.0 ni,  1.3 id, 48.6 wa,  0.0 hi, 12.0 si,  0.0 st
%Cpu1  :  1.6 us, 26.1 sy,  0.0 ni,  7.9 id, 59.7 wa,  0.0 hi,  4.7 si,  0.0 st
%Cpu2  :  3.7 us, 34.4 sy,  0.0 ni,  1.3 id, 51.0 wa,  0.0 hi,  9.6 si,  0.0 st
%Cpu3  :  1.1 us, 24.6 sy,  0.0 ni,  8.0 id, 61.6 wa,  0.0 hi,  4.7 si,  0.0 st
%Cpu4  :  0.8 us, 36.0 sy,  0.0 ni,  0.4 id, 53.7 wa,  0.0 hi,  9.1 si,  0.0 st
%Cpu5  :  0.1 us, 25.9 sy,  0.0 ni,  5.6 id, 63.8 wa,  0.0 hi,  4.6 si,  0.0 st
%Cpu6  :  0.7 us, 25.1 sy,  0.0 ni,  2.8 id, 67.3 wa,  0.0 hi,  4.0 si,  0.0 st
%Cpu7  :  0.0 us,  2.1 sy,  0.0 ni,  0.1 id,  0.0 wa,  0.0 hi, 97.8 si,  0.0 st
%Cpu8  :  0.3 us, 38.2 sy,  0.0 ni,  0.1 id, 52.0 wa,  0.0 hi,  9.3 si,  0.0 st
%Cpu9  :  1.0 us, 28.3 sy,  0.0 ni,  1.4 id, 65.1 wa,  0.0 hi,  4.2 si,  0.0 st
%Cpu10 :  0.9 us, 38.2 sy,  0.0 ni,  0.5 id, 49.7 wa,  0.0 hi, 10.6 si,  0.0 st
%Cpu11 :  0.0 us, 27.0 sy,  0.0 ni,  2.3 id, 66.0 wa,  0.0 hi,  4.7 si,  0.0 st
%Cpu12 :  0.1 us, 37.8 sy,  0.0 ni,  1.2 id, 51.5 wa,  0.0 hi,  9.4 si,  0.0 st
%Cpu13 :  0.4 us, 27.3 sy,  0.0 ni,  2.9 id, 64.4 wa,  0.0 hi,  5.0 si,  0.0 st
%Cpu14 :  0.3 us, 38.2 sy,  0.0 ni,  0.6 id, 52.3 wa,  0.0 hi,  8.6 si,  0.0 st
%Cpu15 :  0.0 us, 26.4 sy,  0.0 ni,  2.9 id, 66.4 wa,  0.0 hi,  4.3 si,  0.0 st
%Cpu16 :  0.0 us, 36.3 sy,  0.0 ni,  0.5 id, 54.6 wa,  0.0 hi,  8.6 si,  0.0 st
%Cpu17 :  0.1 us, 26.8 sy,  0.0 ni,  5.0 id, 64.1 wa,  0.0 hi,  4.0 si,  0.0 st
%Cpu18 :  0.4 us, 38.2 sy,  0.0 ni,  0.8 id, 52.6 wa,  0.0 hi,  8.0 si,  0.0 st
%Cpu19 :  0.3 us, 24.5 sy,  0.0 ni,  6.0 id, 63.8 wa,  0.0 hi,  5.4 si,  0.0 st
%Cpu20 :  0.9 us, 37.0 sy,  0.0 ni,  0.6 id, 53.1 wa,  0.0 hi,  8.4 si,  0.0 st


[4] "perf top -C 0 --call-graph dwarf -d 30"  (only the first lines of the output are shown here)

Samples: 63K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 15949750000 lost: 77/77 drop: 0/0
  Children      Self  Shared Object                Symbol
-   85.84%     0.01%  [kernel]                     [k] __blk_mq_delay_run_hw_queue
   - 85.83% __blk_mq_delay_run_hw_queue
      - 69.75% __blk_mq_run_hw_queue
         - blk_mq_sched_dispatch_requests
            - 68.62% __blk_mq_sched_dispatch_requests
               - 42.70% blk_mq_dispatch_rq_list
                  - 26.92% __blk_mq_delay_run_hw_queue
                     - kblockd_mod_delayed_work_on
                        - 26.91% mod_delayed_work_on
                           - try_to_grab_pending
                              - 4.03% asm_sysvec_call_function_single
                                   sysvec_call_function_single
                                   irq_exit_rcu
                                 - __softirqentry_text_start
                                    - blk_done_softirq
                                      blk_complete_reqs
                                      dm_softirq_done
                                    - dm_done
                                       - 3.81% blk_mq_end_request
                                          - blk_mq_free_request
                                             - __blk_mq_free_request
                                               blk_mq_sched_restart
                                             - blk_mq_run_hw_queue
                                                - 3.81% __blk_mq_delay_run_hw_queue
                                                     kblockd_mod_delayed_work_on
                                                   - mod_delayed_work_on
                                                        try_to_grab_pending
                  - 15.03% blk_mq_run_hw_queue
                       __blk_mq_delay_run_hw_queue
                       kblockd_mod_delayed_work_on
                     - mod_delayed_work_on
                        - 14.82% try_to_grab_pending
                           - 2.41% asm_sysvec_call_function_single
                                sysvec_call_function_single
                                irq_exit_rcu
                              - __softirqentry_text_start
                                 - blk_done_softirq
                                 - blk_complete_reqs
                                    - dm_softirq_done
                                      dm_done
                                      blk_mq_end_request
                                    - blk_mq_free_request
                                       - 2.30% __blk_mq_free_request
                                          - blk_mq_sched_restart
                                            blk_mq_run_hw_queue
                                            __blk_mq_delay_run_hw_queue
                                            kblockd_mod_delayed_work_on
                                          - mod_delayed_work_on
                                               try_to_grab_pending
                  - 0.72% dm_mq_queue_rq
                       0.60% map_request
               - 25.85% blk_mq_do_dispatch_ctx
                  - 25.36% blk_mq_dispatch_rq_list
                     - 13.48% blk_mq_run_hw_queue
                          __blk_mq_delay_run_hw_queue
                        - kblockd_mod_delayed_work_on
                           - 13.48% mod_delayed_work_on
                              - 13.27% try_to_grab_pending
                                 - 2.21% asm_sysvec_call_function_single
                                      sysvec_call_function_single
                                      irq_exit_rcu
                                    - __softirqentry_text_start
                                       - blk_done_softirq
                                       - blk_complete_reqs
                                          - dm_softirq_done
                                             - 2.09% dm_done
                                                - 2.09% blk_mq_end_request
                                                   - 2.09% blk_mq_free_request
                                                      - __blk_mq_free_request
                                                        blk_mq_sched_restart
                                                        blk_mq_run_hw_queue
                                                        __blk_mq_delay_run_hw_queue
                                                        kblockd_mod_delayed_work_on
                                                      - mod_delayed_work_on
                                                           try_to_grab_pending
                     - 11.07% __blk_mq_delay_run_hw_queue
                          kblockd_mod_delayed_work_on
                        - mod_delayed_work_on
                           - 11.02% try_to_grab_pending
                              - 1.89% asm_sysvec_call_function_single
                                   sysvec_call_function_single
                                   irq_exit_rcu
                                 - __softirqentry_text_start
                                    - blk_done_softirq
                                      blk_complete_reqs
                                      dm_softirq_done
                                    - dm_done
                                       - 1.81% blk_mq_end_request
                                          - blk_mq_free_request
                                             - 1.81% __blk_mq_free_request
                                                  blk_mq_sched_restart
                                                  blk_mq_run_hw_queue
                                                  __blk_mq_delay_run_hw_queue
                                                  kblockd_mod_delayed_work_on
                                                - mod_delayed_work_on
                                                     try_to_grab_pending
                     - 0.77% dm_mq_queue_rq
                          0.59% map_request
            - 1.12% blk_mq_run_hw_queue
                 __blk_mq_delay_run_hw_queue
                 kblockd_mod_delayed_work_on
               - mod_delayed_work_on
                    1.10% try_to_grab_pending
      - 16.09% kblockd_mod_delayed_work_on
         - mod_delayed_work_on
              try_to_grab_pending
+   85.80%     0.01%  [kernel]                     [k] blk_mq_run_hw_queue
+   84.18%     0.00%  [kernel]                     [k] kblockd_mod_delayed_work_on
+   84.17%     0.95%  [kernel]                     [k] mod_delayed_work_on
-   83.36%    82.52%  [kernel]                     [k] try_to_grab_pending
   - 54.58% pread64 (inlined
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
      - __x64_sys_pread64
         - 54.57% vfs_read
              new_sync_read
              xfs_file_read_iter
              xfs_file_dio_read
            - iomap_dio_rw
               - 54.55% __iomap_dio_rw
                    blk_finish_plug
                    blk_flush_plug_list
                    blk_mq_flush_plug_list
                  - blk_mq_sched_insert_requests
                     - 53.64% blk_mq_run_hw_queue
                          __blk_mq_delay_run_hw_queue
                          __blk_mq_run_hw_queue
                        - blk_mq_sched_dispatch_requests
                           - 52.67% __blk_mq_sched_dispatch_requests
                              - 32.38% blk_mq_dispatch_rq_list
                                 - 19.96% __blk_mq_delay_run_hw_queue
                                    - kblockd_mod_delayed_work_on
                                       - 19.96% mod_delayed_work_on
                                          - try_to_grab_pending
                                             - 2.49% asm_sysvec_call_function_single
                                                  sysvec_call_function_single
                                                  irq_exit_rcu
                                                  __softirqentry_text_start
                                                  blk_done_softirq
                                                  blk_complete_reqs
                                                  dm_softirq_done
                                                  dm_done
                                                  blk_mq_end_request
                                                  blk_mq_free_request
                                                  __blk_mq_free_request
                                                  blk_mq_sched_restart
                                                  blk_mq_run_hw_queue
                                                  __blk_mq_delay_run_hw_queue
                                                  kblockd_mod_delayed_work_on
                                                  mod_delayed_work_on
                                                  try_to_grab_pending
                                 - 12.37% blk_mq_run_hw_queue
                                      __blk_mq_delay_run_hw_queue
                                      kblockd_mod_delayed_work_on
                                    - mod_delayed_work_on
                                       - 12.36% try_to_grab_pending
                                          - asm_sysvec_call_function_single
                                            sysvec_call_function_single
                                            irq_exit_rcu
                                            __softirqentry_text_start
                                            blk_done_softirq
                                            blk_complete_reqs
                                            dm_softirq_done
                                            dm_done
                                            blk_mq_end_request
                                            blk_mq_free_request
                                            __blk_mq_free_request
                                            blk_mq_sched_restart
                                            blk_mq_run_hw_queue
                                            __blk_mq_delay_run_hw_queue
                                            kblockd_mod_delayed_work_on
                                            mod_delayed_work_on
                                            try_to_grab_pending
                              - 20.29% blk_mq_do_dispatch_ctx
                                 - 20.02% blk_mq_dispatch_rq_list
                                    - 10.92% blk_mq_run_hw_queue
                                         __blk_mq_delay_run_hw_queue
                                       - kblockd_mod_delayed_work_on
                                          - 10.91% mod_delayed_work_on
                                             - 10.90% try_to_grab_pending
                                                - asm_sysvec_call_function_single
                                                  sysvec_call_function_single
                                                  irq_exit_rcu
                                                  __softirqentry_text_start
                                                  blk_done_softirq
                                                  blk_complete_reqs
                                                  dm_softirq_done
                                                  dm_done
                                                  blk_mq_end_request
                                                  blk_mq_free_request
                                                  __blk_mq_free_request
                                                  blk_mq_sched_restart
                                                  blk_mq_run_hw_queue
                                                  __blk_mq_delay_run_hw_queue
                                                  kblockd_mod_delayed_work_on
                                                  mod_delayed_work_on
                                                  try_to_grab_pending
                                    - 9.03% __blk_mq_delay_run_hw_queue
                                         kblockd_mod_delayed_work_on
                                         mod_delayed_work_on
                                       - try_to_grab_pending
                                          - asm_sysvec_call_function_single
                                            sysvec_call_function_single
                                            irq_exit_rcu
                                            __softirqentry_text_start
                                            blk_done_softirq
                                            blk_complete_reqs
                                            dm_softirq_done
                                            dm_done
                                            blk_mq_end_request
                                            blk_mq_free_request
                                            __blk_mq_free_request
                                            blk_mq_sched_restart
                                            blk_mq_run_hw_queue
                                            __blk_mq_delay_run_hw_queue
                                            kblockd_mod_delayed_work_on
                                            mod_delayed_work_on
                                            try_to_grab_pending
                           - 0.97% blk_mq_run_hw_queue
                                __blk_mq_delay_run_hw_queue
                                kblockd_mod_delayed_work_on
                                mod_delayed_work_on
                                try_to_grab_pending
                     - 0.91% blk_mq_try_issue_list_directly
                          blk_mq_request_bypass_insert
                          blk_mq_run_hw_queue
                          __blk_mq_delay_run_hw_queue
                          __blk_mq_run_hw_queue
                        - blk_mq_sched_dispatch_requests
                           - 0.91% __blk_mq_sched_dispatch_requests
                              - 0.83% blk_mq_dispatch_rq_list
                                 - 0.59% __blk_mq_delay_run_hw_queue
                                      kblockd_mod_delayed_work_on
                                      mod_delayed_work_on
                                      try_to_grab_pending
   - 11.05% syscall
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
      - __x64_sys_io_submit
         - 5.77% blk_finish_plug
              blk_flush_plug_list
              blk_mq_flush_plug_list
              blk_mq_sched_insert_requests
              blk_mq_run_hw_queue
              __blk_mq_delay_run_hw_queue
              __blk_mq_run_hw_queue
            - blk_mq_sched_dispatch_requests
               - 5.73% __blk_mq_sched_dispatch_requests
                  - 3.73% blk_mq_dispatch_rq_list
                     - 2.79% __blk_mq_delay_run_hw_queue
                          kblockd_mod_delayed_work_on
                          mod_delayed_work_on
                        - try_to_grab_pending
                           - 0.55% asm_sysvec_call_function_single
                                sysvec_call_function_single
                                irq_exit_rcu
                                __softirqentry_text_start
                                blk_done_softirq
                                blk_complete_reqs
                                dm_softirq_done
                                dm_done
                                blk_mq_end_request
                                blk_mq_free_request
                                __blk_mq_free_request
                                blk_mq_sched_restart
                                blk_mq_run_hw_queue
                               __blk_mq_delay_run_hw_queue
                                kblockd_mod_delayed_work_on
                                mod_delayed_work_on
                                try_to_grab_pending
                     - 0.93% blk_mq_run_hw_queue
                          __blk_mq_delay_run_hw_queue
                          kblockd_mod_delayed_work_on
                          mod_delayed_work_on
                          try_to_grab_pending
                  - 2.00% blk_mq_do_dispatch_ctx
                     - 1.99% blk_mq_dispatch_rq_list
                        - 1.14% blk_mq_run_hw_queue
                             __blk_mq_delay_run_hw_queue
                             kblockd_mod_delayed_work_on
                             mod_delayed_work_on
                             try_to_grab_pending
                        - 0.85% __blk_mq_delay_run_hw_queue
                             kblockd_mod_delayed_work_on
                             mod_delayed_work_on
                             try_to_grab_pending
         - 5.28% io_submit_one
              __io_submit_one.constprop.0
              aio_write
              xfs_file_write_iter
            - xfs_file_dio_write_aligned
               - 5.26% iomap_dio_rw
                  - __iomap_dio_rw
                     - 5.25% blk_finish_plug
                          blk_flush_plug_list
                          blk_mq_flush_plug_list
                        - blk_mq_sched_insert_requests
                           - 5.07% blk_mq_run_hw_queue
                                __blk_mq_delay_run_hw_queue
                                __blk_mq_run_hw_queue
                              - blk_mq_sched_dispatch_requests
                                 - 5.01% __blk_mq_sched_dispatch_requests
                                    - 3.40% blk_mq_dispatch_rq_list
                                       - 2.63% __blk_mq_delay_run_hw_queue
                                            kblockd_mod_delayed_work_on
                                            mod_delayed_work_on
                                          - try_to_grab_pending
                                             - asm_sysvec_call_function_single
                                               sysvec_call_function_single
                                               irq_exit_rcu
                                               __softirqentry_text_start
                                               blk_done_softirq
                                               blk_complete_reqs
                                               dm_softirq_done
                                               dm_done
                                               blk_mq_end_request
                                               blk_mq_free_request
                                               __blk_mq_free_request
                                               blk_mq_sched_restart
                                               blk_mq_run_hw_queue
                                               __blk_mq_delay_run_hw_queue
                                               kblockd_mod_delayed_work_on
                                               mod_delayed_work_on
                                               try_to_grab_pending
                                       - 0.76% blk_mq_run_hw_queue
                                            __blk_mq_delay_run_hw_queue
                                            kblockd_mod_delayed_work_on
                                            mod_delayed_work_on
                                            try_to_grab_pending
                                    - 1.62% blk_mq_do_dispatch_ctx
                                       - 1.57% blk_mq_dispatch_rq_list
                                          - 0.79% blk_mq_run_hw_queue
                                               __blk_mq_delay_run_hw_queue
                                               kblockd_mod_delayed_work_on
                                             - mod_delayed_work_on
                                                  0.79% try_to_grab_pending
                                          - 0.77% __blk_mq_delay_run_hw_queue
                                               kblockd_mod_delayed_work_on
                                               mod_delayed_work_on
                                               try_to_grab_pending

[5] the "dmesg" snippet for a kernel between v5.15 and v5.16-rc1 ( iSCSI I/O requests can
not complete due to some bug. Is this a known issue?):

[   45.991286] sd 21:0:0:1: [sdbp] Optimal transfer size 1048576 bytes
[   45.991347] sd 25:0:0:2: [sdbr] Write Protect is off
[   45.991349] sd 25:0:0:2: [sdbr] Mode Sense: 85 00 10 08
[   45.991541] sd 23:0:0:1: [sdbq] Write cache: disabled, read cache: enabled, supports DPO and FUA
[   45.991590] sd 22:0:0:2: Attached scsi generic sg71 type 0
[   45.991746] sd 25:0:0:2: [sdbr] Write cache: disabled, read cache: enabled, supports DPO and FUA
[   45.991877] sd 22:0:0:2: [sdbt] 268435456 512-byte logical blocks: (137 GB/128 GiB)
[   45.992077] sd 23:0:0:1: [sdbq] Optimal transfer size 1048576 bytes
[   45.992093] sd 22:0:0:2: [sdbt] Write Protect is off
[   45.992094] sd 22:0:0:2: [sdbt] Mode Sense: 85 00 10 08
[   45.992165] sd 25:0:0:2: [sdbr] Optimal transfer size 1048576 bytes
[   45.992394] sd 24:0:0:1: [sdbs] 10737418240 512-byte logical blocks: (5.50 TB/5.00 TiB)
[   45.992426] sd 21:0:0:2: Attached scsi generic sg72 type 0
[   45.992533] sd 22:0:0:2: [sdbt] Write cache: disabled, read cache: enabled, supports DPO and FUA
[   45.992691] sd 24:0:0:1: [sdbs] Write Protect is off
[   45.992693] sd 21:0:0:2: [sdbu] 268435456 512-byte logical blocks: (137 GB/128 GiB)
[   45.992694] sd 24:0:0:1: [sdbs] Mode Sense: 85 00 10 08
[   45.992995] sd 22:0:0:2: [sdbt] Optimal transfer size 1048576 bytes
[   45.993000] sd 21:0:0:2: [sdbu] Write Protect is off
[   45.993002] sd 21:0:0:2: [sdbu] Mode Sense: 85 00 10 08
[   45.993367] sd 24:0:0:1: [sdbs] Write cache: disabled, read cache: enabled, supports DPO and FUA
[   45.993480] sd 23:0:0:2: Attached scsi generic sg73 type 0
[   45.993603] sd 21:0:0:2: [sdbu] Write cache: disabled, read cache: enabled, supports DPO and FUA
[   45.993993] sd 24:0:0:1: [sdbs] Optimal transfer size 1048576 bytes
[   45.994283] sd 21:0:0:2: [sdbu] Optimal transfer size 1048576 bytes
[   45.994359] sd 23:0:0:2: Power-on or device reset occurred
[   45.995506] sd 24:0:0:2: Attached scsi generic sg74 type 0
[   45.995754] sd 24:0:0:2: Power-on or device reset occurred
[   45.996180] sd 23:0:0:2: [sdbv] 268435456 512-byte logical blocks: (137 GB/128 GiB)
[   45.996399] sd 23:0:0:2: [sdbv] Write Protect is off
[   45.996402] sd 23:0:0:2: [sdbv] Mode Sense: 85 00 10 08
[   45.996811] sd 23:0:0:2: [sdbv] Write cache: disabled, read cache: enabled, supports DPO and FUA
[   45.997236] sd 23:0:0:2: [sdbv] Optimal transfer size 1048576 bytes
[   45.998626] sd 24:0:0:2: [sdbw] 268435456 512-byte logical blocks: (137 GB/128 GiB)
[   45.998948] sd 24:0:0:2: [sdbw] Write Protect is off
[   45.998949] sd 24:0:0:2: [sdbw] Mode Sense: 85 00 10 08
[   45.999562] sd 24:0:0:2: [sdbw] Write cache: disabled, read cache: enabled, supports DPO and FUA
[   46.000155] sd 24:0:0:2: [sdbw] Optimal transfer size 1048576 bytes
[   46.115711] sd 25:0:0:2: [sdbr] Attached SCSI disk
[   46.116112] sd 21:0:0:0: [sdbj] Attached SCSI disk
[   46.116162] sd 22:0:0:0: [sdbk] Attached SCSI disk
[   46.119829] sd 23:0:0:1: [sdbq] Attached SCSI disk
[   46.120044] sd 21:0:0:2: [sdbu] Attached SCSI disk
[   46.120146] sd 21:0:0:1: [sdbp] Attached SCSI disk
[   46.120234] sd 22:0:0:1: [sdbo] Attached SCSI disk
[   46.127808] sd 23:0:0:2: [sdbv] Attached SCSI disk
[   46.127810] sd 25:0:0:1: [sdbn] Attached SCSI disk
[   46.127941] sd 23:0:0:0: [sdbl] Attached SCSI disk
[   46.128571] sd 24:0:0:1: [sdbs] Attached SCSI disk
[   46.128584] sd 24:0:0:2: [sdbw] Attached SCSI disk
[   46.128703] sd 24:0:0:0: [sdbm] Attached SCSI disk
[   46.131549] sd 25:0:0:0: [sdbi] Attached SCSI disk
[   46.132250] sd 22:0:0:2: [sdbt] Attached SCSI disk
[   51.584107] device-mapper: multipath queue-length: version 0.2.0 loaded
[   54.932029] loop6: detected capacity change from 0 to 8
[   74.779368] hv_balloon: Max. dynamic memory size: 2097152 MB
[  128.762895] sysrq: Show Blocked State
[  128.827260] task:systemd-udevd   state:D stack:    0 pid: 4125 ppid:  2402 flags:0x00000000
[  128.827265] Call Trace:
[  128.827269]  __schedule+0xce6/0x13d0
[  128.827281]  ? blk_finish_plug+0x26/0x40
[  128.827288]  ? read_pages+0x1db/0x280
[  128.827295]  schedule+0x4e/0xb0
[  128.827298]  io_schedule+0x3f/0x70
[  128.827301]  wait_on_page_bit_common+0x16e/0x3b0
[  128.827308]  ? filemap_invalidate_unlock_two+0x40/0x40
[  128.827311]  put_and_wait_on_page_locked+0x4f/0x60
[  128.827313]  filemap_get_pages+0x643/0x660
[  128.827317]  filemap_read+0xbb/0x3e0
[  128.827319]  ? refill_obj_stock+0xc8/0x130
[  128.827324]  generic_file_read_iter+0xe5/0x150
[  128.827327]  blkdev_read_iter+0x38/0x70
[  128.827332]  new_sync_read+0x113/0x1a0
[  128.827341]  vfs_read+0xfe/0x1a0
[  128.827345]  ksys_read+0x67/0xe0
[  128.827347]  __x64_sys_read+0x1a/0x20
[  128.827350]  do_syscall_64+0x5c/0xc0
[  128.827356]  ? irqentry_exit_to_user_mode+0x9/0x20
[  128.827359]  ? irqentry_exit+0x19/0x30
[  128.827361]  ? exc_page_fault+0x89/0x160
[  128.827362]  ? asm_exc_page_fault+0x8/0x30
[  128.827367]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  128.827369] RIP: 0033:0x7fee5a814142
[  128.827373] RSP: 002b:00007ffee6f253d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  128.827375] RAX: ffffffffffffffda RBX: 00005644e54956a8 RCX: 00007fee5a814142
[  128.827376] RDX: 0000000000000400 RSI: 00005644e54956b8 RDI: 0000000000000006
[  128.827378] RBP: 00005644e52edec0 R08: 00005644e5495690 R09: 00007fee5a8ef200
[  128.827379] R10: 00005644e520c010 R11: 0000000000000246 R12: 0000000000000000
[  128.827380] R13: 0000000000000400 R14: 00005644e5495690 R15: 00005644e52edf10
[  128.828282] task:systemd-udevd   state:D stack:    0 pid: 5166 ppid:  2402 flags:0x00000000
[  128.828284] Call Trace:
[  128.828286]  __schedule+0xce6/0x13d0
[  128.828289]  ? blk_finish_plug+0x26/0x40
[  128.828291]  ? read_pages+0x1db/0x280
[  128.828293]  schedule+0x4e/0xb0
[  128.828296]  io_schedule+0x3f/0x70
[  128.828298]  wait_on_page_bit_common+0x16e/0x3b0
[  128.828301]  ? filemap_invalidate_unlock_two+0x40/0x40
[  128.828304]  put_and_wait_on_page_locked+0x4f/0x60
[  128.828307]  filemap_get_pages+0x643/0x660
[  128.828310]  ? __cond_resched+0x19/0x30
[  128.828313]  filemap_read+0xbb/0x3e0
[  128.828316]  ? sed_ioctl+0x7d1/0x8a0
[  128.828320]  ? cp_new_stat+0x152/0x180
[  128.828324]  generic_file_read_iter+0xe5/0x150
[  128.828326]  ? event_triggers_call+0x5d/0xe0
[  128.828332]  blkdev_read_iter+0x38/0x70
[  128.828334]  new_sync_read+0x113/0x1a0
[  128.828337]  vfs_read+0xfe/0x1a0
[  128.828340]  ksys_read+0x67/0xe0
[  128.828343]  __x64_sys_read+0x1a/0x20
[  128.828345]  do_syscall_64+0x5c/0xc0
[  128.828348]  ? irqentry_exit+0x19/0x30
[  128.828350]  ? exc_page_fault+0x89/0x160
[  128.828351]  ? asm_exc_page_fault+0x8/0x30
[  128.828353]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  128.828355] RIP: 0033:0x7fee5a814142
[  128.828357] RSP: 002b:00007ffee6f25318 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  128.828358] RAX: ffffffffffffffda RBX: 00005644e529ff68 RCX: 00007fee5a814142
[  128.828360] RDX: 0000000000000040 RSI: 00005644e529ff78 RDI: 0000000000000006
[  128.828360] RBP: 00005644e52dd960 R08: 00005644e529ff50 R09: 00005644e5459b90
[  128.828361] R10: 00005644e520c010 R11: 0000000000000246 R12: 0000001fffff0000
[  128.828362] R13: 0000000000000040 R14: 00005644e529ff50 R15: 00005644e52dd9b0
[  128.828430] task:systemd-udevd   state:D stack:    0 pid: 5178 ppid:  2402 flags:0x00000000
[  128.828433] Call Trace:
[  128.828434]  __schedule+0xce6/0x13d0
[  128.828437]  ? blk_finish_plug+0x26/0x40
[  128.828439]  ? read_pages+0x1db/0x280
[  128.828441]  schedule+0x4e/0xb0
[  128.828444]  io_schedule+0x3f/0x70
[  128.828446]  wait_on_page_bit_common+0x16e/0x3b0
[  128.828449]  ? filemap_invalidate_unlock_two+0x40/0x40
[  128.828452]  put_and_wait_on_page_locked+0x4f/0x60
[  128.828454]  filemap_get_pages+0x643/0x660
[  128.828458]  filemap_read+0xbb/0x3e0
[  128.828460]  ? page_counter_try_charge+0x34/0xb0
[  128.828466]  ? page_counter_cancel+0x2c/0x70
[  128.828468]  ? page_counter_uncharge+0x22/0x40
[  128.828470]  generic_file_read_iter+0xe5/0x150
[  128.828473]  blkdev_read_iter+0x38/0x70
[  128.828475]  new_sync_read+0x113/0x1a0
[  128.828478]  vfs_read+0xfe/0x1a0
[  128.828480]  ksys_read+0x67/0xe0
[  128.828483]  __x64_sys_read+0x1a/0x20
[  128.828485]  do_syscall_64+0x5c/0xc0
[  128.828488]  ? __audit_syscall_exit+0x233/0x290
[  128.828495]  ? exit_to_user_mode_prepare+0x3d/0x1c0
[  128.828503]  ? syscall_exit_to_user_mode+0x27/0x50
[  128.828505]  ? do_syscall_64+0x69/0xc0
[  128.828508]  ? do_syscall_64+0x69/0xc0
[  128.828510]  ? do_syscall_64+0x69/0xc0
[  128.828512]  ? syscall_exit_to_user_mode+0x27/0x50
[  128.828514]  ? do_syscall_64+0x69/0xc0
[  128.828516]  ? asm_exc_page_fault+0x8/0x30
[  128.828519]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  128.828520] RIP: 0033:0x7fee5a814142
[  128.828522] RSP: 002b:00007ffee6f25318 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
[  128.828523] RAX: ffffffffffffffda RBX: 00005644e5215ba8 RCX: 00007fee5a814142
[  128.828524] RDX: 0000000000000040 RSI: 00005644e5215bb8 RDI: 0000000000000006
[  128.828525] RBP: 00005644e52ab260 R08: 00005644e5215b90 R09: 00005644e52b81e0
[  128.828526] R10: 00005644e520c010 R11: 0000000000000246 R12: 000004ffffff0000
[  128.828527] R13: 0000000000000040 R14: 00005644e5215b90 R15: 00005644e52ab2b0


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-10  3:30 Random high CPU utilization in blk-mq with the none scheduler Dexuan Cui
@ 2021-12-11  1:29 ` Dexuan Cui
  2021-12-11  2:04   ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Dexuan Cui @ 2021-12-11  1:29 UTC (permalink / raw)
  To: Jens Axboe, 'ming.lei@redhat.com',
	'Christoph Hellwig',
	'linux-block@vger.kernel.org'
  Cc: Long Li, Michael Kelley (LINUX), 'linux-kernel@vger.kernel.org'

> From: Dexuan Cui
> Sent: Thursday, December 9, 2021 7:30 PM
> 
> Hi all,
> I found a random high CPU utilization issue with some database benchmark
> program running on a 192-CPU virtual machine (VM). Originally the issue
> was found with RHEL 8.4 and Ubuntu 20.04, and further tests show that the
> issue also reproduces with the latest upstream stable kernel v5.15.7, but
> *not* with v5.16-rc1. It looks like someone resolved the issue in v5.16-rc1
> recently?

I did git-bisect on the linux-block tree's for-5.16/block branch and this patch
resolves the random high CPU utilization issue (I'm not sure how):
	dc5fc361d891 ("block: attempt direct issue of plug list")
	https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-5.16/block&id=dc5fc361d891e089dfd9c0a975dc78041036b906

Do you think if it's easy to backport it to earlier versions like 5.10?
It looks like there are a lot of prerequisite patches.

Thanks,
Dexuan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-11  1:29 ` Dexuan Cui
@ 2021-12-11  2:04   ` Jens Axboe
  2021-12-11  3:10     ` Dexuan Cui
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2021-12-11  2:04 UTC (permalink / raw)
  To: Dexuan Cui, 'ming.lei@redhat.com',
	'Christoph Hellwig',
	'linux-block@vger.kernel.org'
  Cc: Long Li, Michael Kelley (LINUX), 'linux-kernel@vger.kernel.org'

On 12/10/21 6:29 PM, Dexuan Cui wrote:
>> From: Dexuan Cui
>> Sent: Thursday, December 9, 2021 7:30 PM
>>
>> Hi all,
>> I found a random high CPU utilization issue with some database benchmark
>> program running on a 192-CPU virtual machine (VM). Originally the issue
>> was found with RHEL 8.4 and Ubuntu 20.04, and further tests show that the
>> issue also reproduces with the latest upstream stable kernel v5.15.7, but
>> *not* with v5.16-rc1. It looks like someone resolved the issue in v5.16-rc1
>> recently?
> 
> I did git-bisect on the linux-block tree's for-5.16/block branch and this patch
> resolves the random high CPU utilization issue (I'm not sure how):
> 	dc5fc361d891 ("block: attempt direct issue of plug list")
> 	https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git/commit/?h=for-5.16/block&id=dc5fc361d891e089dfd9c0a975dc78041036b906
> 
> Do you think if it's easy to backport it to earlier versions like 5.10?
> It looks like there are a lot of prerequisite patches.

It's more likely the real fix is avoiding the repeated plug list scan,
which I guess makes sense. That is this commit:

commit d38a9c04c0d5637a828269dccb9703d42d40d42b
Author: Jens Axboe <axboe@kernel.dk>
Date:   Thu Oct 14 07:24:07 2021 -0600

    block: only check previous entry for plug merge attempt

If that's the case, try 5.15.x again and do:

echo 2 > /sys/block/<dev>/queue/nomerges

for each drive you are using in the IO test, and see if that gets
rid of the excess CPU usage.

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-11  2:04   ` Jens Axboe
@ 2021-12-11  3:10     ` Dexuan Cui
  2021-12-11  3:15       ` Jens Axboe
  2021-12-13  3:23       ` Ming Lei
  0 siblings, 2 replies; 17+ messages in thread
From: Dexuan Cui @ 2021-12-11  3:10 UTC (permalink / raw)
  To: Jens Axboe, 'ming.lei@redhat.com',
	'Christoph Hellwig',
	'linux-block@vger.kernel.org'
  Cc: Long Li, Michael Kelley (LINUX), 'linux-kernel@vger.kernel.org'

> From: Jens Axboe <axboe@kernel.dk>
> Sent: Friday, December 10, 2021 6:05 PM
> ...
> It's more likely the real fix is avoiding the repeated plug list scan,
> which I guess makes sense. That is this commit:
> 
> commit d38a9c04c0d5637a828269dccb9703d42d40d42b
> Author: Jens Axboe <axboe@kernel.dk>
> Date:   Thu Oct 14 07:24:07 2021 -0600
> 
>     block: only check previous entry for plug merge attempt
> 
> If that's the case, try 5.15.x again and do:
> 
> echo 2 > /sys/block/<dev>/queue/nomerges
> 
> for each drive you are using in the IO test, and see if that gets
> rid of the excess CPU usage.
> 
> --
> Jens Axboe

Thanks for the reply! Unluckily this does not work.

I tried the below command:

for i in `ls /sys/block/*/queue/nomerges`; do echo 2 > $i; done

and verified that the "nomerges" are changed to "2", but the
excess CPU usage can still reproduce easily.

(BTW, I got my 5.15.7 kernel from here:
https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.15.7/)

root@localhost:~# uname -r
5.15.7-051507-generic
root@localhost:~# grep  . /sys/block/*/queue/nomerges
/sys/block/dm-0/queue/nomerges:2
/sys/block/dm-1/queue/nomerges:2
/sys/block/dm-2/queue/nomerges:2
/sys/block/dm-3/queue/nomerges:2
/sys/block/loop0/queue/nomerges:2
/sys/block/loop1/queue/nomerges:2
/sys/block/loop2/queue/nomerges:2
/sys/block/loop3/queue/nomerges:2
/sys/block/loop4/queue/nomerges:2
/sys/block/loop5/queue/nomerges:2
/sys/block/loop6/queue/nomerges:2
/sys/block/loop7/queue/nomerges:2
/sys/block/sda/queue/nomerges:2
/sys/block/sdaa/queue/nomerges:2
/sys/block/sdab/queue/nomerges:2
/sys/block/sdac/queue/nomerges:2
/sys/block/sdad/queue/nomerges:2
/sys/block/sdae/queue/nomerges:2
/sys/block/sdaf/queue/nomerges:2
/sys/block/sdag/queue/nomerges:2
/sys/block/sdah/queue/nomerges:2
/sys/block/sdai/queue/nomerges:2
/sys/block/sdaj/queue/nomerges:2
/sys/block/sdak/queue/nomerges:2
/sys/block/sdal/queue/nomerges:2
/sys/block/sdam/queue/nomerges:2
/sys/block/sdan/queue/nomerges:2
/sys/block/sdao/queue/nomerges:2
/sys/block/sdap/queue/nomerges:2
/sys/block/sdaq/queue/nomerges:2
/sys/block/sdar/queue/nomerges:2
/sys/block/sdas/queue/nomerges:2
/sys/block/sdat/queue/nomerges:2
/sys/block/sdau/queue/nomerges:2
/sys/block/sdav/queue/nomerges:2
/sys/block/sdaw/queue/nomerges:2
/sys/block/sdax/queue/nomerges:2
/sys/block/sday/queue/nomerges:2
/sys/block/sdaz/queue/nomerges:2
/sys/block/sdb/queue/nomerges:2
/sys/block/sdba/queue/nomerges:2
/sys/block/sdbb/queue/nomerges:2
/sys/block/sdbc/queue/nomerges:2
/sys/block/sdbd/queue/nomerges:2
/sys/block/sdbe/queue/nomerges:2
/sys/block/sdbf/queue/nomerges:2
/sys/block/sdbg/queue/nomerges:2
/sys/block/sdbh/queue/nomerges:2
/sys/block/sdbi/queue/nomerges:2
/sys/block/sdbj/queue/nomerges:2
/sys/block/sdbk/queue/nomerges:2
/sys/block/sdbl/queue/nomerges:2
/sys/block/sdbm/queue/nomerges:2
/sys/block/sdbn/queue/nomerges:2
/sys/block/sdbo/queue/nomerges:2
/sys/block/sdbp/queue/nomerges:2
/sys/block/sdbq/queue/nomerges:2
/sys/block/sdbr/queue/nomerges:2
/sys/block/sdbs/queue/nomerges:2
/sys/block/sdbt/queue/nomerges:2
/sys/block/sdbu/queue/nomerges:2
/sys/block/sdbv/queue/nomerges:2
/sys/block/sdbw/queue/nomerges:2
/sys/block/sdc/queue/nomerges:2
/sys/block/sdd/queue/nomerges:2
/sys/block/sde/queue/nomerges:2
/sys/block/sdf/queue/nomerges:2
/sys/block/sdg/queue/nomerges:2
/sys/block/sdh/queue/nomerges:2
/sys/block/sdi/queue/nomerges:2
/sys/block/sdj/queue/nomerges:2
/sys/block/sdk/queue/nomerges:2
/sys/block/sdl/queue/nomerges:2
/sys/block/sdm/queue/nomerges:2
/sys/block/sdn/queue/nomerges:2
/sys/block/sdo/queue/nomerges:2
/sys/block/sdp/queue/nomerges:2
/sys/block/sdq/queue/nomerges:2
/sys/block/sdr/queue/nomerges:2
/sys/block/sds/queue/nomerges:2
/sys/block/sdt/queue/nomerges:2
/sys/block/sdu/queue/nomerges:2
/sys/block/sdv/queue/nomerges:2
/sys/block/sdw/queue/nomerges:2
/sys/block/sdx/queue/nomerges:2
/sys/block/sdy/queue/nomerges:2
/sys/block/sdz/queue/nomerges:2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-11  3:10     ` Dexuan Cui
@ 2021-12-11  3:15       ` Jens Axboe
  2021-12-11  3:44         ` Dexuan Cui
  2021-12-13  3:23       ` Ming Lei
  1 sibling, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2021-12-11  3:15 UTC (permalink / raw)
  To: Dexuan Cui, 'ming.lei@redhat.com',
	'Christoph Hellwig',
	'linux-block@vger.kernel.org'
  Cc: Long Li, Michael Kelley (LINUX), 'linux-kernel@vger.kernel.org'

On 12/10/21 8:10 PM, Dexuan Cui wrote:
>> From: Jens Axboe <axboe@kernel.dk>
>> Sent: Friday, December 10, 2021 6:05 PM
>> ...
>> It's more likely the real fix is avoiding the repeated plug list scan,
>> which I guess makes sense. That is this commit:
>>
>> commit d38a9c04c0d5637a828269dccb9703d42d40d42b
>> Author: Jens Axboe <axboe@kernel.dk>
>> Date:   Thu Oct 14 07:24:07 2021 -0600
>>
>>     block: only check previous entry for plug merge attempt
>>
>> If that's the case, try 5.15.x again and do:
>>
>> echo 2 > /sys/block/<dev>/queue/nomerges
>>
>> for each drive you are using in the IO test, and see if that gets
>> rid of the excess CPU usage.
>>
>> --
>> Jens Axboe
> 
> Thanks for the reply! Unluckily this does not work.
> 
> I tried the below command:
> 
> for i in `ls /sys/block/*/queue/nomerges`; do echo 2 > $i; done
> 
> and verified that the "nomerges" are changed to "2", but the
> excess CPU usage can still reproduce easily.

Just out of curiosity, can you do:

# perf record -a -g -- sleep 3

when you see the excessive CPU usage, then attach the output of

# perf report -g

to a reply?

How confident are you in your bisect result?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-11  3:15       ` Jens Axboe
@ 2021-12-11  3:44         ` Dexuan Cui
  2021-12-11  7:09           ` Dexuan Cui
  0 siblings, 1 reply; 17+ messages in thread
From: Dexuan Cui @ 2021-12-11  3:44 UTC (permalink / raw)
  To: Jens Axboe, 'ming.lei@redhat.com',
	'Christoph Hellwig',
	'linux-block@vger.kernel.org'
  Cc: Long Li, Michael Kelley (LINUX), 'linux-kernel@vger.kernel.org'

> From: Jens Axboe <axboe@kernel.dk>
> 
> Just out of curiosity, can you do:
> 
> # perf record -a -g -- sleep 3
> 
> when you see the excessive CPU usage, then attach the output of
> 
> # perf report -g
> 
> to a reply?

I ran the commands against the 5.15.7 kernel and generated a
771-MB file. After I ran 'bzip2 -9', the file size is still 22 MB, which
I guess is too big to be shared via the list. I'll try to put it somewhere
and send you a link.
 
> How confident are you in your bisect result?
> 
> --
> Jens Axboe

I'm pretty confident:
1) I can't reproduce the issue with v5.16-rc4 even if I run the test 
10 times. Typically the issue repros every time.

2) If I revert the commit

dc5fc361d891 ("block: attempt direct issue of plug list")

and the related patches

ff1552232b36 ("blk-mq: don't issue request directly in case that current is to be blocked")
b22809092c70 ("block: replace always false argument with 'false'")

from v5.16-rc4, I'm able to repro the issue immediately.

Thanks,
Dexuan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-11  3:44         ` Dexuan Cui
@ 2021-12-11  7:09           ` Dexuan Cui
  2021-12-11 14:21             ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Dexuan Cui @ 2021-12-11  7:09 UTC (permalink / raw)
  To: Jens Axboe, 'ming.lei@redhat.com',
	'Christoph Hellwig',
	'linux-block@vger.kernel.org'
  Cc: Long Li, Michael Kelley (LINUX), 'linux-kernel@vger.kernel.org'

[-- Attachment #1: Type: text/plain, Size: 998 bytes --]

> From: Dexuan Cui
> Sent: Friday, December 10, 2021 7:45 PM
> 
> > From: Jens Axboe <axboe@kernel.dk>
> >
> > Just out of curiosity, can you do:
> >
> > # perf record -a -g -- sleep 3
> >
> > when you see the excessive CPU usage, then attach the output of
> >
> > # perf report -g
> >
> > to a reply?

I realized you only asked for the output of "pref report -g", which
is much smaller. Please see the attachment for it. 
try_to_grab_pending() is the hottest function, e.g. see line 2479.

It's generated using v5.16-rc4 with the 3 commits reverted:
b22809092c70 ("block: replace always false argument with 'false'")
ff1552232b36 ("blk-mq: don't issue request directly in case that current is to be blocked")
dc5fc361d891 ("block: attempt direct issue of plug list")

> > How confident are you in your bisect result?
> >
> > Jens Axboe

I did some tests again and I'm pretty confident that the commit
b22809092c70 resolves the excessive CPU usage.

Thanks,
Dexuan

[-- Attachment #2: perf.data.output.log.bz2 --]
[-- Type: application/octet-stream, Size: 129840 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-11  7:09           ` Dexuan Cui
@ 2021-12-11 14:21             ` Jens Axboe
  2021-12-11 18:54               ` Dexuan Cui
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2021-12-11 14:21 UTC (permalink / raw)
  To: Dexuan Cui, 'ming.lei@redhat.com',
	'Christoph Hellwig',
	'linux-block@vger.kernel.org'
  Cc: Long Li, Michael Kelley (LINUX), 'linux-kernel@vger.kernel.org'

On 12/11/21 12:09 AM, Dexuan Cui wrote:
>> From: Dexuan Cui
>> Sent: Friday, December 10, 2021 7:45 PM
>>
>>> From: Jens Axboe <axboe@kernel.dk>
>>>
>>> Just out of curiosity, can you do:
>>>
>>> # perf record -a -g -- sleep 3
>>>
>>> when you see the excessive CPU usage, then attach the output of
>>>
>>> # perf report -g
>>>
>>> to a reply?
> 
> I realized you only asked for the output of "pref report -g", which
> is much smaller. Please see the attachment for it. 
> try_to_grab_pending() is the hottest function, e.g. see line 2479.

Sorry, can you do:

# perf report -g --no-children

instead?

-- 
Jens Axboe


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-11 14:21             ` Jens Axboe
@ 2021-12-11 18:54               ` Dexuan Cui
  2021-12-13 18:43                 ` Jens Axboe
  0 siblings, 1 reply; 17+ messages in thread
From: Dexuan Cui @ 2021-12-11 18:54 UTC (permalink / raw)
  To: Jens Axboe, 'ming.lei@redhat.com',
	'Christoph Hellwig',
	'linux-block@vger.kernel.org'
  Cc: Long Li, Michael Kelley (LINUX), 'linux-kernel@vger.kernel.org'

[-- Attachment #1: Type: text/plain, Size: 176 bytes --]

> From: Jens Axboe <axboe@kernel.dk>
> Sent: Saturday, December 11, 2021 6:21 AM
> 
> Sorry, can you do:
> 
> # perf report -g --no-children
> 
> instead?

Attached.

[-- Attachment #2: perf-report-g-no-children.log.bz2 --]
[-- Type: application/octet-stream, Size: 42444 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-11  3:10     ` Dexuan Cui
  2021-12-11  3:15       ` Jens Axboe
@ 2021-12-13  3:23       ` Ming Lei
  2021-12-13  4:20         ` Dexuan Cui
  1 sibling, 1 reply; 17+ messages in thread
From: Ming Lei @ 2021-12-13  3:23 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: Jens Axboe, 'Christoph Hellwig',
	'linux-block@vger.kernel.org',
	Long Li, Michael Kelley (LINUX),
	'linux-kernel@vger.kernel.org'

Hi Dexuan,

On Sat, Dec 11, 2021 at 03:10:43AM +0000, Dexuan Cui wrote:
> > From: Jens Axboe <axboe@kernel.dk>
> > Sent: Friday, December 10, 2021 6:05 PM
> > ...
> > It's more likely the real fix is avoiding the repeated plug list scan,
> > which I guess makes sense. That is this commit:
> > 
> > commit d38a9c04c0d5637a828269dccb9703d42d40d42b
> > Author: Jens Axboe <axboe@kernel.dk>
> > Date:   Thu Oct 14 07:24:07 2021 -0600
> > 
> >     block: only check previous entry for plug merge attempt
> > 
> > If that's the case, try 5.15.x again and do:
> > 
> > echo 2 > /sys/block/<dev>/queue/nomerges
> > 
> > for each drive you are using in the IO test, and see if that gets
> > rid of the excess CPU usage.
> > 
> > --
> > Jens Axboe
> 
> Thanks for the reply! Unluckily this does not work.
> 
> I tried the below command:
> 
> for i in `ls /sys/block/*/queue/nomerges`; do echo 2 > $i; done
> 
> and verified that the "nomerges" are changed to "2", but the
> excess CPU usage can still reproduce easily.

Can you provide the following blk-mq debugfs log?

(cd /sys/kernel/debug/block/dm-N && find . -type f -exec grep -aH . {} \;)

(cd /sys/kernel/debug/block/sdN && find . -type f -exec grep -aH . {} \;)

And it is enough to just collect log from one dm-mpath & one underlying iscsi disk,
so we can understand basic blk-mq setting, such as nr_hw_queues, queue depths, ...



Thanks,
Ming


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-13  3:23       ` Ming Lei
@ 2021-12-13  4:20         ` Dexuan Cui
  2021-12-13  7:38           ` Ming Lei
  0 siblings, 1 reply; 17+ messages in thread
From: Dexuan Cui @ 2021-12-13  4:20 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, 'Christoph Hellwig',
	'linux-block@vger.kernel.org',
	Long Li, Michael Kelley (LINUX),
	'linux-kernel@vger.kernel.org'

[-- Attachment #1: Type: text/plain, Size: 933 bytes --]

> From: Ming Lei <ming.lei@redhat.com>
>  ...
> Can you provide the following blk-mq debugfs log?
> 
> (cd /sys/kernel/debug/block/dm-N && find . -type f -exec grep -aH . {} \;)
> 
> (cd /sys/kernel/debug/block/sdN && find . -type f -exec grep -aH . {} \;)
> 
> And it is enough to just collect log from one dm-mpath & one underlying iscsi
> disk,
> so we can understand basic blk-mq setting, such as nr_hw_queues, queue
> depths, ...
> 
> Thanks,
> Ming

Attached. I collected the logs for all the dm-* and sd* devices against
v5.16-rc4 with the 3 commits reverted:
b22809092c70 ("block: replace always false argument with 'false'")
ff1552232b36 ("blk-mq: don't issue request directly in case that current is to be blocked")
dc5fc361d891 ("block: attempt direct issue of plug list")

v5.16-rc4 does not reproduce the issue, so I'm pretty sure b22809092c70 is the
patch that resolves the excessive CPU usage.

[-- Attachment #2: dm-sd.log.zip --]
[-- Type: application/x-zip-compressed, Size: 608996 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-13  4:20         ` Dexuan Cui
@ 2021-12-13  7:38           ` Ming Lei
  2021-12-14  0:31             ` Dexuan Cui
  0 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2021-12-13  7:38 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: Jens Axboe, 'Christoph Hellwig',
	'linux-block@vger.kernel.org',
	Long Li, Michael Kelley (LINUX),
	'linux-kernel@vger.kernel.org'

On Mon, Dec 13, 2021 at 04:20:49AM +0000, Dexuan Cui wrote:
> > From: Ming Lei <ming.lei@redhat.com>
> >  ...
> > Can you provide the following blk-mq debugfs log?
> > 
> > (cd /sys/kernel/debug/block/dm-N && find . -type f -exec grep -aH . {} \;)
> > 
> > (cd /sys/kernel/debug/block/sdN && find . -type f -exec grep -aH . {} \;)
> > 
> > And it is enough to just collect log from one dm-mpath & one underlying iscsi
> > disk,
> > so we can understand basic blk-mq setting, such as nr_hw_queues, queue
> > depths, ...
> > 
> > Thanks,
> > Ming
> 
> Attached. I collected the logs for all the dm-* and sd* devices against
> v5.16-rc4 with the 3 commits reverted:
> b22809092c70 ("block: replace always false argument with 'false'")
> ff1552232b36 ("blk-mq: don't issue request directly in case that current is to be blocked")
> dc5fc361d891 ("block: attempt direct issue of plug list")
> 
> v5.16-rc4 does not reproduce the issue, so I'm pretty sure b22809092c70 is the
> patch that resolves the excessive CPU usage.

From the log:

1) dm-mpath:
- queue depth: 2048
- busy: 848, and 62 of them are in sw queue, so run queue is often
  caused
- nr_hw_queues: 1
- dm-2 is in use, and dm-1/dm-3 is idle
- dm-2's dispatch busy is 8, that should be the reason why excessive CPU
usage is observed when flushing plug list without commit dc5fc361d891 in
which hctx->dispatch_busy is just bypassed

2) iscsi
- dispatch_busy is 0
- nr_hw_queues: 1
- queue depth: 113
- busy=~33, active_queues is 3, so each LUN/iscsi host is saturated
- 23 active LUNs, 23 * 33 = 759 in-flight commands

The high CPU utilization may be caused by:

1) big queue depth of dm mpath, the situation may be improved much if it
is reduced to 1024 or 800. The max allowed inflight commands from iscsi
hosts can be figured out, if dm's queue depth is much more than this number,
the extra commands need to dispatch, and run queue can be scheduled
immediately, so high CPU utilization is caused.

2) single hw queue, so contention should be big, which should be avoided
in big machine, nvme-tcp might be better than iscsi here

3) iscsi io latency is a bit big

Even CPU utilization is reduced by commit dc5fc361d891, io performance
can't be good too with v5.16-rc, I guess.


Thanks,
Ming


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-11 18:54               ` Dexuan Cui
@ 2021-12-13 18:43                 ` Jens Axboe
  2021-12-14  0:43                   ` Dexuan Cui
  0 siblings, 1 reply; 17+ messages in thread
From: Jens Axboe @ 2021-12-13 18:43 UTC (permalink / raw)
  To: Dexuan Cui, 'ming.lei@redhat.com',
	'Christoph Hellwig',
	'linux-block@vger.kernel.org'
  Cc: Long Li, Michael Kelley (LINUX), 'linux-kernel@vger.kernel.org'

On 12/11/21 11:54 AM, Dexuan Cui wrote:
>> From: Jens Axboe <axboe@kernel.dk>
>> Sent: Saturday, December 11, 2021 6:21 AM
>>
>> Sorry, can you do:
>>
>> # perf report -g --no-children
>>
>> instead?
> 
> Attached.

I wonder if this will help, SCSI artifact making us hit the
mod delayed work path all the time. Might be a race in there,
but should be fine for testing.


diff --git a/block/blk-core.c b/block/blk-core.c
index 4d8f5fe91588..cb2f4d604bad 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1625,7 +1625,9 @@ EXPORT_SYMBOL(kblockd_schedule_work);
 int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork,
 				unsigned long delay)
 {
-	return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
+	if (!work_pending(&dwork->work))
+		return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
+	return true;
 }
 EXPORT_SYMBOL(kblockd_mod_delayed_work_on);
 

-- 
Jens Axboe


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* RE: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-13  7:38           ` Ming Lei
@ 2021-12-14  0:31             ` Dexuan Cui
  2021-12-14  0:53               ` Ming Lei
  0 siblings, 1 reply; 17+ messages in thread
From: Dexuan Cui @ 2021-12-14  0:31 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, 'Christoph Hellwig',
	'linux-block@vger.kernel.org',
	Long Li, Michael Kelley (LINUX),
	'linux-kernel@vger.kernel.org'

> From: Ming Lei <ming.lei@redhat.com>
> Sent: Sunday, December 12, 2021 11:38 PM

Ming, thanks so much for the detailed analysis!

> From the log:
> 
> 1) dm-mpath:
> - queue depth: 2048
> - busy: 848, and 62 of them are in sw queue, so run queue is often
>   caused
> - nr_hw_queues: 1
> - dm-2 is in use, and dm-1/dm-3 is idle
> - dm-2's dispatch busy is 8, that should be the reason why excessive CPU
> usage is observed when flushing plug list without commit dc5fc361d891 in
> which hctx->dispatch_busy is just bypassed
> 
> 2) iscsi
> - dispatch_busy is 0
> - nr_hw_queues: 1
> - queue depth: 113
> - busy=~33, active_queues is 3, so each LUN/iscsi host is saturated
> - 23 active LUNs, 23 * 33 = 759 in-flight commands
> 
> The high CPU utilization may be caused by:
> 
> 1) big queue depth of dm mpath, the situation may be improved much if it
> is reduced to 1024 or 800. The max allowed inflight commands from iscsi
> hosts can be figured out, if dm's queue depth is much more than this number,
> the extra commands need to dispatch, and run queue can be scheduled
> immediately, so high CPU utilization is caused.

I think you're correct:
with dm_mod.dm_mq_queue_depth=256, the max CPU utilization is 8%.
with dm_mod.dm_mq_queue_depth=400, the max CPU utilization is 12%. 
with dm_mod.dm_mq_queue_depth=800, the max CPU utilization is 88%.

The performance with queue_depth=800 is poor.
The performance with queue_depth=400 is good.
The performance with queue_depth=256 is also good, and there is only a 
small drop comared with the 400 case.

> 2) single hw queue, so contention should be big, which should be avoided
> in big machine, nvme-tcp might be better than iscsi here
> 
> 3) iscsi io latency is a bit big
> 
> Even CPU utilization is reduced by commit dc5fc361d891, io performance
> can't be good too with v5.16-rc, I guess.
> 
> Thanks,
> Ming

Actually the I/O performance of v5.16-rc4 (commit dc5fc361d891 is included)
is good -- it's about the same as the case where v5.16-rc4 + reverting
dc5fc361d891 + dm_mod.dm_mq_queue_depth=400 (or 256).

Thanks,
Dexuan


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-13 18:43                 ` Jens Axboe
@ 2021-12-14  0:43                   ` Dexuan Cui
  0 siblings, 0 replies; 17+ messages in thread
From: Dexuan Cui @ 2021-12-14  0:43 UTC (permalink / raw)
  To: Jens Axboe, 'ming.lei@redhat.com',
	'Christoph Hellwig',
	'linux-block@vger.kernel.org'
  Cc: Long Li, Michael Kelley (LINUX), 'linux-kernel@vger.kernel.org'

From: Jens Axboe <axboe@kernel.dk>
> Sent: Monday, December 13, 2021 10:43 AM
> 
> I wonder if this will help, SCSI artifact making us hit the
> mod delayed work path all the time. 

Yes, I think this is what's happening here.

> Might be a race in there, but should be fine for testing.
> 
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 4d8f5fe91588..cb2f4d604bad 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1625,7 +1625,9 @@ EXPORT_SYMBOL(kblockd_schedule_work);
>  int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork,
>  				unsigned long delay)
>  {
> -	return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay);
> +	if (!work_pending(&dwork->work))
> +		return mod_delayed_work_on(cpu, kblockd_workqueue, dwork,
> delay);
> +	return true;
>  }
>  EXPORT_SYMBOL(kblockd_mod_delayed_work_on);
> --
> Jens Axboe

Thanks very much! This path works!

In the test, "v5.16-rc4 + reverting dc5fc361d891 
("block: attempt direct issue of plug list") + this patch" shows a good I/O
performance that's about the same as v5.16-rc4.

I suppose "v5.16-rc4 + this patch" would show a slightly even better
performance.

It would be great if you can commit this small patch so that it can be
backported to old stable kernels.

Thanks,
Dexuan

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-14  0:31             ` Dexuan Cui
@ 2021-12-14  0:53               ` Ming Lei
  2021-12-14  3:09                 ` Dexuan Cui
  0 siblings, 1 reply; 17+ messages in thread
From: Ming Lei @ 2021-12-14  0:53 UTC (permalink / raw)
  To: Dexuan Cui
  Cc: Jens Axboe, 'Christoph Hellwig',
	'linux-block@vger.kernel.org',
	Long Li, Michael Kelley (LINUX),
	'linux-kernel@vger.kernel.org',
	Mike Snitzer, dm-devel

On Tue, Dec 14, 2021 at 12:31:23AM +0000, Dexuan Cui wrote:
> > From: Ming Lei <ming.lei@redhat.com>
> > Sent: Sunday, December 12, 2021 11:38 PM
> 
> Ming, thanks so much for the detailed analysis!
> 
> > From the log:
> > 
> > 1) dm-mpath:
> > - queue depth: 2048
> > - busy: 848, and 62 of them are in sw queue, so run queue is often
> >   caused
> > - nr_hw_queues: 1
> > - dm-2 is in use, and dm-1/dm-3 is idle
> > - dm-2's dispatch busy is 8, that should be the reason why excessive CPU
> > usage is observed when flushing plug list without commit dc5fc361d891 in
> > which hctx->dispatch_busy is just bypassed
> > 
> > 2) iscsi
> > - dispatch_busy is 0
> > - nr_hw_queues: 1
> > - queue depth: 113
> > - busy=~33, active_queues is 3, so each LUN/iscsi host is saturated
> > - 23 active LUNs, 23 * 33 = 759 in-flight commands
> > 
> > The high CPU utilization may be caused by:
> > 
> > 1) big queue depth of dm mpath, the situation may be improved much if it
> > is reduced to 1024 or 800. The max allowed inflight commands from iscsi
> > hosts can be figured out, if dm's queue depth is much more than this number,
> > the extra commands need to dispatch, and run queue can be scheduled
> > immediately, so high CPU utilization is caused.
> 
> I think you're correct:
> with dm_mod.dm_mq_queue_depth=256, the max CPU utilization is 8%.
> with dm_mod.dm_mq_queue_depth=400, the max CPU utilization is 12%. 
> with dm_mod.dm_mq_queue_depth=800, the max CPU utilization is 88%.
> 
> The performance with queue_depth=800 is poor.
> The performance with queue_depth=400 is good.
> The performance with queue_depth=256 is also good, and there is only a 
> small drop comared with the 400 case.

That should be the reason why the issue isn't triggered in case of real
io scheduler.

So far blk-mq doesn't provide way to adjust tags queue depth
dynamically.

But not understand reason of default dm_mq_queue_depth(2048), in this
situation, each LUN can just queue 113/3 requests at most, and 3 LUNs
are attached to single iscsi host.

Mike, can you share why the default dm_mq_queue_depth is so big? And
seems it doesn't consider the underlying queue's queue depth. What is
the biggest dm rq queue depth? which need to saturate all underlying paths?

> 
> > 2) single hw queue, so contention should be big, which should be avoided
> > in big machine, nvme-tcp might be better than iscsi here
> > 
> > 3) iscsi io latency is a bit big
> > 
> > Even CPU utilization is reduced by commit dc5fc361d891, io performance
> > can't be good too with v5.16-rc, I guess.
> > 
> > Thanks,
> > Ming
> 
> Actually the I/O performance of v5.16-rc4 (commit dc5fc361d891 is included)
> is good -- it's about the same as the case where v5.16-rc4 + reverting
> dc5fc361d891 + dm_mod.dm_mq_queue_depth=400 (or 256).

The single hw queue may be the root cause of your issue, and there
is only single run_work, which can be touched by all CPUs(~200) almost, so cache
ping-pong could be very serious. 

Jens patch may improve it more or less, please test it.

Thanks,
Ming


^ permalink raw reply	[flat|nested] 17+ messages in thread

* RE: Random high CPU utilization in blk-mq with the none scheduler
  2021-12-14  0:53               ` Ming Lei
@ 2021-12-14  3:09                 ` Dexuan Cui
  0 siblings, 0 replies; 17+ messages in thread
From: Dexuan Cui @ 2021-12-14  3:09 UTC (permalink / raw)
  To: Ming Lei
  Cc: Jens Axboe, 'Christoph Hellwig',
	'linux-block@vger.kernel.org',
	Long Li, Michael Kelley (LINUX),
	'linux-kernel@vger.kernel.org',
	Mike Snitzer, dm-devel

> From: Ming Lei <ming.lei@redhat.com>
> Sent: Monday, December 13, 2021 4:53 PM
>  ...
> The single hw queue may be the root cause of your issue, and there
> is only single run_work, which can be touched by all CPUs(~200) almost, so
> cache
> ping-pong could be very serious.
> 
> Jens patch may improve it more or less, please test it.
> 
> Thanks,
> Ming

I replied to Jens's email with the test result before your email. :-)

Jen's patch does help in the absence of dc5fc361d891 
("block: attempt direct issue of plug list").

BTW, the throughput in the case of dm_mod.dm_mq_queue_depth=256
is slightly better than the case of dm_mod.dm_mq_queue_depth=2048 with
Jen's patch or dc5fc361d891.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2021-12-14  3:09 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-10  3:30 Random high CPU utilization in blk-mq with the none scheduler Dexuan Cui
2021-12-11  1:29 ` Dexuan Cui
2021-12-11  2:04   ` Jens Axboe
2021-12-11  3:10     ` Dexuan Cui
2021-12-11  3:15       ` Jens Axboe
2021-12-11  3:44         ` Dexuan Cui
2021-12-11  7:09           ` Dexuan Cui
2021-12-11 14:21             ` Jens Axboe
2021-12-11 18:54               ` Dexuan Cui
2021-12-13 18:43                 ` Jens Axboe
2021-12-14  0:43                   ` Dexuan Cui
2021-12-13  3:23       ` Ming Lei
2021-12-13  4:20         ` Dexuan Cui
2021-12-13  7:38           ` Ming Lei
2021-12-14  0:31             ` Dexuan Cui
2021-12-14  0:53               ` Ming Lei
2021-12-14  3:09                 ` Dexuan Cui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).