Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2

* Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
@ 2020-07-25 14:24 Hans van Kranenburg
  2020-07-25 15:37 ` Holger Hoffstätte
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Hans van Kranenburg @ 2020-07-25 14:24 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4302 bytes --]

Hi,

I have a filesystem here that I'm filling up with data from elsewhere.
Most of it is done by rsync, and part by send/receive. So, receiving
data over the network, and then writing the files to disk. There can be
a dozen of these processes running in parallel.

Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with
varying X) often is using nearly 100% cpu, while enormously slowing down
disk writes. This shows as disk IO wait for the rsync and btrfs receive
processes.

The underlying storage (iSCSI connected over 10Gb/s network) can easily
eat a few hundred MiB/s. When looking at actual disk business on the
storage device, percentages <5% utilization are reported for the actual
disks.

It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here.

I just did a 'perf record -g -a sleep 60' while disk writes were down to
under 1MiB (!) per second and then 'perf report'. Attached some 'screen
shot' of it. Also attached an example of what nmon shows to give an idea
about the situation.

If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately
see network and disk write speed ramping up, easily over 200 MiB/s,
until it soon plummets again.

I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which
is booted now).

So, what I'm looking for is:
* Does anyone else see this happen when doing a lot of concurrent writes?
* What does this flush thing do?
* Why is it using 100% cpu all the time?
* How can I debug this more?
* Ultimately, of course... how can we improve this?

I can recompile the kernel image to e.g. put more trace points in, in
different places.

I just have no idea where to start.

Thanks,
Hans

P.S. /proc/<pid>/stack for the kworker/u16:X+flush-btrfs-2 mostly shows
nothing at all, and sometimes when it does show some output, it mostly
looks like this:

----
[<0>] rq_qos_wait+0xfa/0x170
[<0>] wbt_wait+0x98/0xe0
[<0>] __rq_qos_throttle+0x23/0x30
[<0>] blk_mq_make_request+0x12a/0x5d0
[<0>] generic_make_request+0xcf/0x310
[<0>] submit_bio+0x42/0x1c0
[<0>] btrfs_map_bio+0x1c0/0x380 [btrfs]
[<0>] btrfs_submit_bio_hook+0x8c/0x180 [btrfs]
[<0>] submit_one_bio+0x31/0x50 [btrfs]
[<0>] submit_extent_page+0x102/0x210 [btrfs]
[<0>] __extent_writepage_io+0x1cf/0x380 [btrfs]
[<0>] __extent_writepage+0x101/0x300 [btrfs]
[<0>] extent_write_cache_pages+0x2bb/0x440 [btrfs]
[<0>] extent_writepages+0x44/0x90 [btrfs]
[<0>] do_writepages+0x41/0xd0
[<0>] __writeback_single_inode+0x3d/0x340
[<0>] writeback_sb_inodes+0x1e5/0x480
[<0>] __writeback_inodes_wb+0x5d/0xb0
[<0>] wb_writeback+0x25f/0x2f0
[<0>] wb_workfn+0x2fe/0x3f0
[<0>] process_one_work+0x1ad/0x370
[<0>] worker_thread+0x30/0x390
[<0>] kthread+0x112/0x130
[<0>] ret_from_fork+0x1f/0x40

----
[<0>] rq_qos_wait+0xfa/0x170
[<0>] wbt_wait+0x98/0xe0
[<0>] __rq_qos_throttle+0x23/0x30
[<0>] blk_mq_make_request+0x12a/0x5d0
[<0>] generic_make_request+0xcf/0x310
[<0>] submit_bio+0x42/0x1c0
[<0>] btrfs_map_bio+0x1c0/0x380 [btrfs]
[<0>] btrfs_submit_bio_hook+0x8c/0x180 [btrfs]
[<0>] submit_one_bio+0x31/0x50 [btrfs]
[<0>] extent_writepages+0x5d/0x90 [btrfs]
[<0>] do_writepages+0x41/0xd0
[<0>] __writeback_single_inode+0x3d/0x340
[<0>] writeback_sb_inodes+0x1e5/0x480
[<0>] __writeback_inodes_wb+0x5d/0xb0
[<0>] wb_writeback+0x25f/0x2f0
[<0>] wb_workfn+0x2fe/0x3f0
[<0>] process_one_work+0x1ad/0x370
[<0>] worker_thread+0x30/0x390
[<0>] kthread+0x112/0x130
[<0>] ret_from_fork+0x1f/0x40

----
[<0>] rq_qos_wait+0xfa/0x170
[<0>] wbt_wait+0x98/0xe0
[<0>] __rq_qos_throttle+0x23/0x30
[<0>] blk_mq_make_request+0x12a/0x5d0
[<0>] generic_make_request+0xcf/0x310
[<0>] submit_bio+0x42/0x1c0
[<0>] btrfs_map_bio+0x1c0/0x380 [btrfs]
[<0>] btrfs_submit_bio_hook+0x8c/0x180 [btrfs]
[<0>] submit_one_bio+0x31/0x50 [btrfs]
[<0>] submit_extent_page+0x102/0x210 [btrfs]
[<0>] __extent_writepage_io+0x1cf/0x380 [btrfs]
[<0>] __extent_writepage+0x101/0x300 [btrfs]
[<0>] extent_write_cache_pages+0x2bb/0x440 [btrfs]
[<0>] extent_writepages+0x44/0x90 [btrfs]
[<0>] do_writepages+0x41/0xd0
[<0>] __writeback_single_inode+0x3d/0x340
[<0>] writeback_sb_inodes+0x1e5/0x480
[<0>] __writeback_inodes_wb+0x5d/0xb0
[<0>] wb_writeback+0x25f/0x2f0
[<0>] wb_workfn+0x2fe/0x3f0
[<0>] process_one_work+0x1ad/0x370
[<0>] worker_thread+0x30/0x390
[<0>] kthread+0x112/0x130
[<0>] ret_from_fork+0x1f/0x40

[-- Attachment #2: 2020-07-25-perf-report.txt --]
[-- Type: text/plain, Size: 6268 bytes --]

Samples: 1M of event 'cpu-clock:pppH', Event count (approx.): 311991500000
  Children      Self  Command          Shared Object         Symbol
+   80.53%     0.00%  swapper          [kernel.kallsyms]     [k] secondary_startup_64
+   80.53%     0.00%  swapper          [kernel.kallsyms]     [k] cpu_startup_entry
+   80.53%     0.00%  swapper          [kernel.kallsyms]     [k] do_idle
+   80.52%     0.00%  swapper          [kernel.kallsyms]     [k] default_idle
+   80.52%    80.51%  swapper          [kernel.kallsyms]     [k] native_safe_halt
+   63.42%     0.00%  swapper          [kernel.kallsyms]     [k] start_secondary
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] ret_from_fork
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] kthread
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] worker_thread
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] process_one_work
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] wb_workfn
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] wb_writeback
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] __writeback_inodes_wb
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] writeback_sb_inodes
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] __writeback_single_inode
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] do_writepages
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] extent_writepages
+   19.08%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] extent_write_cache_pages
+   19.07%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] __extent_writepage
+   19.04%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] writepage_delalloc
+   19.03%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] btrfs_run_delalloc_range
+   19.03%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] cow_file_range
+   19.01%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] btrfs_reserve_extent
+   19.01%     4.26%  kworker/u16:11+  [kernel.kallsyms]     [k] find_free_extent
+   17.11%     0.00%  swapper          [kernel.kallsyms]     [k] start_kernel
+    7.18%     7.18%  kworker/u16:11+  [kernel.kallsyms]     [k] _raw_spin_lock
+    4.45%     4.45%  kworker/u16:11+  [kernel.kallsyms]     [k] btrfs_get_block_group
+    1.49%     1.34%  kworker/u16:11+  [kernel.kallsyms]     [k] down_read
+    0.56%     0.56%  kworker/u16:11+  [kernel.kallsyms]     [k] btrfs_put_block_group
     0.50%     0.50%  kworker/u16:11+  [kernel.kallsyms]     [k] __raw_callee_save___pv_queued_spin_unlock
     0.44%     0.44%  kworker/u16:11+  [kernel.kallsyms]     [k] up_read
     0.28%     0.13%  kworker/u16:11+  [kernel.kallsyms]     [k] _cond_resched
     0.15%     0.15%  kworker/u16:11+  [kernel.kallsyms]     [k] rcu_all_qs
     0.05%     0.00%  nmon             [kernel.kallsyms]     [k] entry_SYSCALL_64_after_hwframe
     0.05%     0.00%  nmon             [kernel.kallsyms]     [k] do_syscall_64
     0.04%     0.00%  perf_5.7         [kernel.kallsyms]     [k] entry_SYSCALL_64_after_hwframe
     0.04%     0.00%  perf_5.7         [kernel.kallsyms]     [k] do_syscall_64
     0.04%     0.00%  btrfs-transacti  [kernel.kallsyms]     [k] ret_from_fork
     0.04%     0.00%  btrfs-transacti  [kernel.kallsyms]     [k] kthread
     0.04%     0.00%  btrfs-transacti  [kernel.kallsyms]     [k] transaction_kthread
     0.04%     0.00%  btrfs-transacti  [kernel.kallsyms]     [k] btrfs_commit_transaction
     0.04%     0.00%  perf_5.7         libpthread-2.28.so    [.] __libc_write
     0.04%     0.00%  perf_5.7         [kernel.kallsyms]     [k] ksys_write
     0.04%     0.00%  perf_5.7         [kernel.kallsyms]     [k] vfs_write
     0.04%     0.00%  perf_5.7         [kernel.kallsyms]     [k] new_sync_write
     0.04%     0.00%  perf_5.7         [kernel.kallsyms]     [k] btrfs_file_write_iter
     0.04%     0.00%  perf_5.7         [kernel.kallsyms]     [k] btrfs_buffered_write.isra.28
     0.03%     0.00%  kswapd0          [kernel.kallsyms]     [k] ret_from_fork
     0.03%     0.00%  kswapd0          [kernel.kallsyms]     [k] kthread
     0.03%     0.00%  kswapd0          [kernel.kallsyms]     [k] kswapd
     0.03%     0.00%  kswapd0          [kernel.kallsyms]     [k] balance_pgdat
     0.03%     0.00%  kswapd0          [kernel.kallsyms]     [k] shrink_node
     0.03%     0.00%  btrfs-transacti  [kernel.kallsyms]     [k] btrfs_run_delayed_refs
     0.03%     0.00%  btrfs-transacti  [kernel.kallsyms]     [k] __btrfs_run_delayed_refs
     0.03%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] __extent_writepage_io
     0.03%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] submit_one_bio
     0.03%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] btrfs_submit_bio_hook
     0.03%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] btrfs_map_bio
     0.02%     0.00%  kswapd0          [kernel.kallsyms]     [k] shrink_slab
     0.02%     0.00%  kswapd0          [kernel.kallsyms]     [k] do_shrink_slab
     0.02%     0.00%  kworker/u16:4-b  [kernel.kallsyms]     [k] ret_from_fork
     0.02%     0.00%  kworker/u16:4-b  [kernel.kallsyms]     [k] kthread
     0.02%     0.00%  kworker/u16:4-b  [kernel.kallsyms]     [k] worker_thread
     0.02%     0.00%  kworker/u16:4-b  [kernel.kallsyms]     [k] process_one_work
     0.02%     0.00%  kswapd0          [kernel.kallsyms]     [k] super_cache_scan
     0.02%     0.00%  kworker/u16:4-b  [kernel.kallsyms]     [k] btrfs_work_helper
     0.02%     0.00%  kworker/u16:3-b  [kernel.kallsyms]     [k] ret_from_fork
     0.02%     0.00%  kworker/u16:3-b  [kernel.kallsyms]     [k] kthread
     0.02%     0.00%  kworker/u16:3-b  [kernel.kallsyms]     [k] worker_thread
     0.02%     0.00%  kworker/u16:3-b  [kernel.kallsyms]     [k] process_one_work
     0.02%     0.00%  perf_5.7         [unknown]             [k] 0xffffffff9d9bc42e
     0.02%     0.00%  kworker/u16:3-b  [kernel.kallsyms]     [k] btrfs_work_helper
     0.02%     0.00%  btrfs-transacti  [kernel.kallsyms]     [k] alloc_reserved_file_extent
     0.02%     0.00%  nmon             libc-2.28.so          [.] open64
     0.02%     0.00%  nmon             [kernel.kallsyms]     [k] do_sys_open

[-- Attachment #3: 2020-07-25-nmon.txt --]
[-- Type: text/plain, Size: 5851 bytes --]

---- >8 ---- CPU Utilisation ---- >8 ----

---------------------------+-------------------------------------------------+
CPU User%  Sys% Wait% Steal|0          |25         |50          |75       100|
  1   5.2   3.1  90.2   1.6|UUsWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW >
  2   0.0   0.0  99.5   0.5|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>
  3   0.5   0.0  99.5   0.0|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>
  4   0.0   0.0 100.0   0.0|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>
  5   0.0 100.0   0.0   0.0|sssssssssssssssssssssssssssssssssssssssssssssssss>
  6   0.0   0.0 100.0   0.0|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>
  7   1.0   0.5  98.0   0.5|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW >
  8   1.0   0.0  98.5   0.5|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>
---------------------------+-------------------------------------------------+
Avg   0.9  13.0  85.7   0.3|ssssssWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>|
---------------------------+-------------------------------------------------+

---- >8 ---- Memory and Swap ---- >8 ----

 PageSize:4KB   RAM-Memory  Swap-Space       High-Memory     Low-Memory
 Total (MB)        16015.3         0.0       - not in use   - not in use
 Free  (MB)         1471.1         0.0
 Free Percent          9.2%        0.0%
 Linux Kernel Internal Memory (MB)
                        Cached=   12582.6     Active=    3756.1
 Buffers=       0.2 Swapcached=       0.0  Inactive =    8994.9
 Dirty  =    2387.0 Writeback =       0.5  Mapped   =      27.3
 Slab   =    1653.0 Commit_AS =     321.2 PageTables=       4.4

---- >8 ---- Kernel and Load Average ---- >8 ----

Global-CPU-Stats---->     7.5% user      Load Average  CPU use since boottime
 /proc/stat line 1        0.0% user_nice  1 mins 14.02 Uptime Days Hours Mins
100 ticks per second    101.1% system     5 mins 14.05 Uptime   20    15   19
100%=1 CPUcoreThread      0.0% idle      15 mins 13.33 Idle     15   335 17633
       3   RunQueue     677.8% iowait                  Uptime has overflowed
      13   Blocked        0.0% irq
    2338.1 Context        2.0% softirq                 8 CPU core threads
           Switch         2.5% steal
       0.0 Forks          0.0% guest                   Boot time 1593903119
    3174.7 Interrupts     0.0% guest_nice              12:51 AM 05-Jul-2020

---- >8 ---- Network I/O ---- >8 ----

I/F Name Recv=KB/s Trans=KB/s packin packout insize outsize Peak->Recv Trans
      lo      0.0       0.0       0.0    0.0     0.0    0.0        0.0      0.0
    eth0  12250.1      88.6     968.6  999.0 12950.7   90.8   388740.8   2413.5

---- >8 ---- Disk I/O -- /proc/diskstats -- mostly in KB/s ---- >8 ----

DiskName Busy  Read WriteKB|0          |25         |50          |75       100|
xvda       0%    0.0    0.0|                                                 >
xvdb       0%   31.9    0.0|R                                                >
xvdc      23%    0.0 1171.3|WWWWWWWWWWWW                                     >
xvdd       0%    0.0    0.0|                                                 >
xvde      39%    0.0  970.1|WWWWWWWWWWWWWWWWWWWW                             >
Totals Read-MB/s=0.0      Writes-MB/s=2.1      Transfers/sec=21.9

---- >8 ---- Top Processes Procs=260-mode=3-1=Base 3=Perf 4=Size 5=I/O[RootOnly] u=Args ---- >8 ----

PID        %CPU      Size       Res      Res       Res       Res      Shared   Faults   Faults Command
              Used        KB       Set      Text      Data       Lib        KB      Min      Maj
     7909     99.6         0         0         0         0         0         0        0        0 kworker/u16:2+flush-btrfs-2
    22780      1.5     19780     10548       400      4588         0      5768        0        0 ssh
    21840      1.0     19772     10608       400      4580         0      5836        0        0 ssh
    22638      1.0     19776     10616       400      4584         0      5844        0        0 ssh
    23115      1.0     19772     10592       400      4580         0      5824        0        0 ssh
    23460      1.0     19816     10644       400      4624         0      5828        0        0 ssh
    23571      1.0     19780     10564       400      4588         0      5784        0        0 ssh
    23759      1.0     19768     10472       400      4576         0      5708        0        0 ssh
    23770      1.0     19740     10588       400      4548         0      5848        0        0 ssh
    30708      1.0     19996     10752       400      4804         0      5756        0        0 ssh
    30710      1.0      8544      1124       396       392         0       960        0        0 btrfs
      993      0.5      8672      3160       104      4652         0       440        0        0 nmon
    10290      0.5     19844     10688       400      4652         0      5848        0        0 ssh
    18186      0.5         0         0         0         0         0         0        0        0 kworker/u16:6-btrfs-endio-write
    18616      0.5      8544      1204       396       392         0      1036        0        0 btrfs
    18659      0.5      5420       788        56       440         0       724        0        0 pv
    18660      0.5      8544      1196       396       392         0      1032        0        0 btrfs
    21842      0.5      7896      1940       328      2300         0      1344        0        0 rsync
    22807      0.5      7896      1984       328      2300         0      1384        0        0 rsync
    23461      0.5      7896      1928       328      2300         0      1332        0        0 rsync
    23467      0.5     19724     10580       400      4532         0      5856        0        0 ssh
    23468      0.5      8476      3084       328      2880         0      1272        0        0 rsync

^ permalink raw reply	[flat|nested] 17+ messages in thread