All of lore.kernel.org
 help / color / mirror / Atom feed
* Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
@ 2020-07-25 14:24 Hans van Kranenburg
  2020-07-25 15:37 ` Holger Hoffstätte
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Hans van Kranenburg @ 2020-07-25 14:24 UTC (permalink / raw)
  To: linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 4302 bytes --]

Hi,

I have a filesystem here that I'm filling up with data from elsewhere.
Most of it is done by rsync, and part by send/receive. So, receiving
data over the network, and then writing the files to disk. There can be
a dozen of these processes running in parallel.

Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with
varying X) often is using nearly 100% cpu, while enormously slowing down
disk writes. This shows as disk IO wait for the rsync and btrfs receive
processes.

The underlying storage (iSCSI connected over 10Gb/s network) can easily
eat a few hundred MiB/s. When looking at actual disk business on the
storage device, percentages <5% utilization are reported for the actual
disks.

It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here.

I just did a 'perf record -g -a sleep 60' while disk writes were down to
under 1MiB (!) per second and then 'perf report'. Attached some 'screen
shot' of it. Also attached an example of what nmon shows to give an idea
about the situation.

If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately
see network and disk write speed ramping up, easily over 200 MiB/s,
until it soon plummets again.

I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which
is booted now).

So, what I'm looking for is:
* Does anyone else see this happen when doing a lot of concurrent writes?
* What does this flush thing do?
* Why is it using 100% cpu all the time?
* How can I debug this more?
* Ultimately, of course... how can we improve this?

I can recompile the kernel image to e.g. put more trace points in, in
different places.

I just have no idea where to start.

Thanks,
Hans

P.S. /proc/<pid>/stack for the kworker/u16:X+flush-btrfs-2 mostly shows
nothing at all, and sometimes when it does show some output, it mostly
looks like this:

----
[<0>] rq_qos_wait+0xfa/0x170
[<0>] wbt_wait+0x98/0xe0
[<0>] __rq_qos_throttle+0x23/0x30
[<0>] blk_mq_make_request+0x12a/0x5d0
[<0>] generic_make_request+0xcf/0x310
[<0>] submit_bio+0x42/0x1c0
[<0>] btrfs_map_bio+0x1c0/0x380 [btrfs]
[<0>] btrfs_submit_bio_hook+0x8c/0x180 [btrfs]
[<0>] submit_one_bio+0x31/0x50 [btrfs]
[<0>] submit_extent_page+0x102/0x210 [btrfs]
[<0>] __extent_writepage_io+0x1cf/0x380 [btrfs]
[<0>] __extent_writepage+0x101/0x300 [btrfs]
[<0>] extent_write_cache_pages+0x2bb/0x440 [btrfs]
[<0>] extent_writepages+0x44/0x90 [btrfs]
[<0>] do_writepages+0x41/0xd0
[<0>] __writeback_single_inode+0x3d/0x340
[<0>] writeback_sb_inodes+0x1e5/0x480
[<0>] __writeback_inodes_wb+0x5d/0xb0
[<0>] wb_writeback+0x25f/0x2f0
[<0>] wb_workfn+0x2fe/0x3f0
[<0>] process_one_work+0x1ad/0x370
[<0>] worker_thread+0x30/0x390
[<0>] kthread+0x112/0x130
[<0>] ret_from_fork+0x1f/0x40

----
[<0>] rq_qos_wait+0xfa/0x170
[<0>] wbt_wait+0x98/0xe0
[<0>] __rq_qos_throttle+0x23/0x30
[<0>] blk_mq_make_request+0x12a/0x5d0
[<0>] generic_make_request+0xcf/0x310
[<0>] submit_bio+0x42/0x1c0
[<0>] btrfs_map_bio+0x1c0/0x380 [btrfs]
[<0>] btrfs_submit_bio_hook+0x8c/0x180 [btrfs]
[<0>] submit_one_bio+0x31/0x50 [btrfs]
[<0>] extent_writepages+0x5d/0x90 [btrfs]
[<0>] do_writepages+0x41/0xd0
[<0>] __writeback_single_inode+0x3d/0x340
[<0>] writeback_sb_inodes+0x1e5/0x480
[<0>] __writeback_inodes_wb+0x5d/0xb0
[<0>] wb_writeback+0x25f/0x2f0
[<0>] wb_workfn+0x2fe/0x3f0
[<0>] process_one_work+0x1ad/0x370
[<0>] worker_thread+0x30/0x390
[<0>] kthread+0x112/0x130
[<0>] ret_from_fork+0x1f/0x40

----
[<0>] rq_qos_wait+0xfa/0x170
[<0>] wbt_wait+0x98/0xe0
[<0>] __rq_qos_throttle+0x23/0x30
[<0>] blk_mq_make_request+0x12a/0x5d0
[<0>] generic_make_request+0xcf/0x310
[<0>] submit_bio+0x42/0x1c0
[<0>] btrfs_map_bio+0x1c0/0x380 [btrfs]
[<0>] btrfs_submit_bio_hook+0x8c/0x180 [btrfs]
[<0>] submit_one_bio+0x31/0x50 [btrfs]
[<0>] submit_extent_page+0x102/0x210 [btrfs]
[<0>] __extent_writepage_io+0x1cf/0x380 [btrfs]
[<0>] __extent_writepage+0x101/0x300 [btrfs]
[<0>] extent_write_cache_pages+0x2bb/0x440 [btrfs]
[<0>] extent_writepages+0x44/0x90 [btrfs]
[<0>] do_writepages+0x41/0xd0
[<0>] __writeback_single_inode+0x3d/0x340
[<0>] writeback_sb_inodes+0x1e5/0x480
[<0>] __writeback_inodes_wb+0x5d/0xb0
[<0>] wb_writeback+0x25f/0x2f0
[<0>] wb_workfn+0x2fe/0x3f0
[<0>] process_one_work+0x1ad/0x370
[<0>] worker_thread+0x30/0x390
[<0>] kthread+0x112/0x130
[<0>] ret_from_fork+0x1f/0x40

[-- Attachment #2: 2020-07-25-perf-report.txt --]
[-- Type: text/plain, Size: 6268 bytes --]

Samples: 1M of event 'cpu-clock:pppH', Event count (approx.): 311991500000
  Children      Self  Command          Shared Object         Symbol
+   80.53%     0.00%  swapper          [kernel.kallsyms]     [k] secondary_startup_64
+   80.53%     0.00%  swapper          [kernel.kallsyms]     [k] cpu_startup_entry
+   80.53%     0.00%  swapper          [kernel.kallsyms]     [k] do_idle
+   80.52%     0.00%  swapper          [kernel.kallsyms]     [k] default_idle
+   80.52%    80.51%  swapper          [kernel.kallsyms]     [k] native_safe_halt
+   63.42%     0.00%  swapper          [kernel.kallsyms]     [k] start_secondary
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] ret_from_fork
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] kthread
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] worker_thread
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] process_one_work
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] wb_workfn
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] wb_writeback
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] __writeback_inodes_wb
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] writeback_sb_inodes
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] __writeback_single_inode
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] do_writepages
+   19.10%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] extent_writepages
+   19.08%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] extent_write_cache_pages
+   19.07%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] __extent_writepage
+   19.04%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] writepage_delalloc
+   19.03%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] btrfs_run_delalloc_range
+   19.03%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] cow_file_range
+   19.01%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] btrfs_reserve_extent
+   19.01%     4.26%  kworker/u16:11+  [kernel.kallsyms]     [k] find_free_extent
+   17.11%     0.00%  swapper          [kernel.kallsyms]     [k] start_kernel
+    7.18%     7.18%  kworker/u16:11+  [kernel.kallsyms]     [k] _raw_spin_lock
+    4.45%     4.45%  kworker/u16:11+  [kernel.kallsyms]     [k] btrfs_get_block_group
+    1.49%     1.34%  kworker/u16:11+  [kernel.kallsyms]     [k] down_read
+    0.56%     0.56%  kworker/u16:11+  [kernel.kallsyms]     [k] btrfs_put_block_group
     0.50%     0.50%  kworker/u16:11+  [kernel.kallsyms]     [k] __raw_callee_save___pv_queued_spin_unlock
     0.44%     0.44%  kworker/u16:11+  [kernel.kallsyms]     [k] up_read
     0.28%     0.13%  kworker/u16:11+  [kernel.kallsyms]     [k] _cond_resched
     0.15%     0.15%  kworker/u16:11+  [kernel.kallsyms]     [k] rcu_all_qs
     0.05%     0.00%  nmon             [kernel.kallsyms]     [k] entry_SYSCALL_64_after_hwframe
     0.05%     0.00%  nmon             [kernel.kallsyms]     [k] do_syscall_64
     0.04%     0.00%  perf_5.7         [kernel.kallsyms]     [k] entry_SYSCALL_64_after_hwframe
     0.04%     0.00%  perf_5.7         [kernel.kallsyms]     [k] do_syscall_64
     0.04%     0.00%  btrfs-transacti  [kernel.kallsyms]     [k] ret_from_fork
     0.04%     0.00%  btrfs-transacti  [kernel.kallsyms]     [k] kthread
     0.04%     0.00%  btrfs-transacti  [kernel.kallsyms]     [k] transaction_kthread
     0.04%     0.00%  btrfs-transacti  [kernel.kallsyms]     [k] btrfs_commit_transaction
     0.04%     0.00%  perf_5.7         libpthread-2.28.so    [.] __libc_write
     0.04%     0.00%  perf_5.7         [kernel.kallsyms]     [k] ksys_write
     0.04%     0.00%  perf_5.7         [kernel.kallsyms]     [k] vfs_write
     0.04%     0.00%  perf_5.7         [kernel.kallsyms]     [k] new_sync_write
     0.04%     0.00%  perf_5.7         [kernel.kallsyms]     [k] btrfs_file_write_iter
     0.04%     0.00%  perf_5.7         [kernel.kallsyms]     [k] btrfs_buffered_write.isra.28
     0.03%     0.00%  kswapd0          [kernel.kallsyms]     [k] ret_from_fork
     0.03%     0.00%  kswapd0          [kernel.kallsyms]     [k] kthread
     0.03%     0.00%  kswapd0          [kernel.kallsyms]     [k] kswapd
     0.03%     0.00%  kswapd0          [kernel.kallsyms]     [k] balance_pgdat
     0.03%     0.00%  kswapd0          [kernel.kallsyms]     [k] shrink_node
     0.03%     0.00%  btrfs-transacti  [kernel.kallsyms]     [k] btrfs_run_delayed_refs
     0.03%     0.00%  btrfs-transacti  [kernel.kallsyms]     [k] __btrfs_run_delayed_refs
     0.03%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] __extent_writepage_io
     0.03%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] submit_one_bio
     0.03%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] btrfs_submit_bio_hook
     0.03%     0.00%  kworker/u16:11+  [kernel.kallsyms]     [k] btrfs_map_bio
     0.02%     0.00%  kswapd0          [kernel.kallsyms]     [k] shrink_slab
     0.02%     0.00%  kswapd0          [kernel.kallsyms]     [k] do_shrink_slab
     0.02%     0.00%  kworker/u16:4-b  [kernel.kallsyms]     [k] ret_from_fork
     0.02%     0.00%  kworker/u16:4-b  [kernel.kallsyms]     [k] kthread
     0.02%     0.00%  kworker/u16:4-b  [kernel.kallsyms]     [k] worker_thread
     0.02%     0.00%  kworker/u16:4-b  [kernel.kallsyms]     [k] process_one_work
     0.02%     0.00%  kswapd0          [kernel.kallsyms]     [k] super_cache_scan
     0.02%     0.00%  kworker/u16:4-b  [kernel.kallsyms]     [k] btrfs_work_helper
     0.02%     0.00%  kworker/u16:3-b  [kernel.kallsyms]     [k] ret_from_fork
     0.02%     0.00%  kworker/u16:3-b  [kernel.kallsyms]     [k] kthread
     0.02%     0.00%  kworker/u16:3-b  [kernel.kallsyms]     [k] worker_thread
     0.02%     0.00%  kworker/u16:3-b  [kernel.kallsyms]     [k] process_one_work
     0.02%     0.00%  perf_5.7         [unknown]             [k] 0xffffffff9d9bc42e
     0.02%     0.00%  kworker/u16:3-b  [kernel.kallsyms]     [k] btrfs_work_helper
     0.02%     0.00%  btrfs-transacti  [kernel.kallsyms]     [k] alloc_reserved_file_extent
     0.02%     0.00%  nmon             libc-2.28.so          [.] open64
     0.02%     0.00%  nmon             [kernel.kallsyms]     [k] do_sys_open

[-- Attachment #3: 2020-07-25-nmon.txt --]
[-- Type: text/plain, Size: 5851 bytes --]

---- >8 ---- CPU Utilisation ---- >8 ----

---------------------------+-------------------------------------------------+
CPU User%  Sys% Wait% Steal|0          |25         |50          |75       100|
  1   5.2   3.1  90.2   1.6|UUsWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW >
  2   0.0   0.0  99.5   0.5|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>
  3   0.5   0.0  99.5   0.0|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>
  4   0.0   0.0 100.0   0.0|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>
  5   0.0 100.0   0.0   0.0|sssssssssssssssssssssssssssssssssssssssssssssssss>
  6   0.0   0.0 100.0   0.0|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>
  7   1.0   0.5  98.0   0.5|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW >
  8   1.0   0.0  98.5   0.5|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>
---------------------------+-------------------------------------------------+
Avg   0.9  13.0  85.7   0.3|ssssssWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>|
---------------------------+-------------------------------------------------+

---- >8 ---- Memory and Swap ---- >8 ----

 PageSize:4KB   RAM-Memory  Swap-Space       High-Memory     Low-Memory
 Total (MB)        16015.3         0.0       - not in use   - not in use
 Free  (MB)         1471.1         0.0
 Free Percent          9.2%        0.0%
 Linux Kernel Internal Memory (MB)
                        Cached=   12582.6     Active=    3756.1
 Buffers=       0.2 Swapcached=       0.0  Inactive =    8994.9
 Dirty  =    2387.0 Writeback =       0.5  Mapped   =      27.3
 Slab   =    1653.0 Commit_AS =     321.2 PageTables=       4.4

---- >8 ---- Kernel and Load Average ---- >8 ----

Global-CPU-Stats---->     7.5% user      Load Average  CPU use since boottime
 /proc/stat line 1        0.0% user_nice  1 mins 14.02 Uptime Days Hours Mins
100 ticks per second    101.1% system     5 mins 14.05 Uptime   20    15   19
100%=1 CPUcoreThread      0.0% idle      15 mins 13.33 Idle     15   335 17633
       3   RunQueue     677.8% iowait                  Uptime has overflowed
      13   Blocked        0.0% irq
    2338.1 Context        2.0% softirq                 8 CPU core threads
           Switch         2.5% steal
       0.0 Forks          0.0% guest                   Boot time 1593903119
    3174.7 Interrupts     0.0% guest_nice              12:51 AM 05-Jul-2020

---- >8 ---- Network I/O ---- >8 ----

I/F Name Recv=KB/s Trans=KB/s packin packout insize outsize Peak->Recv Trans
      lo      0.0       0.0       0.0    0.0     0.0    0.0        0.0      0.0
    eth0  12250.1      88.6     968.6  999.0 12950.7   90.8   388740.8   2413.5

---- >8 ---- Disk I/O -- /proc/diskstats -- mostly in KB/s ---- >8 ----

DiskName Busy  Read WriteKB|0          |25         |50          |75       100|
xvda       0%    0.0    0.0|                                                 >
xvdb       0%   31.9    0.0|R                                                >
xvdc      23%    0.0 1171.3|WWWWWWWWWWWW                                     >
xvdd       0%    0.0    0.0|                                                 >
xvde      39%    0.0  970.1|WWWWWWWWWWWWWWWWWWWW                             >
Totals Read-MB/s=0.0      Writes-MB/s=2.1      Transfers/sec=21.9

---- >8 ---- Top Processes Procs=260-mode=3-1=Base 3=Perf 4=Size 5=I/O[RootOnly] u=Args ---- >8 ----

PID        %CPU      Size       Res      Res       Res       Res      Shared   Faults   Faults Command
              Used        KB       Set      Text      Data       Lib        KB      Min      Maj
     7909     99.6         0         0         0         0         0         0        0        0 kworker/u16:2+flush-btrfs-2
    22780      1.5     19780     10548       400      4588         0      5768        0        0 ssh
    21840      1.0     19772     10608       400      4580         0      5836        0        0 ssh
    22638      1.0     19776     10616       400      4584         0      5844        0        0 ssh
    23115      1.0     19772     10592       400      4580         0      5824        0        0 ssh
    23460      1.0     19816     10644       400      4624         0      5828        0        0 ssh
    23571      1.0     19780     10564       400      4588         0      5784        0        0 ssh
    23759      1.0     19768     10472       400      4576         0      5708        0        0 ssh
    23770      1.0     19740     10588       400      4548         0      5848        0        0 ssh
    30708      1.0     19996     10752       400      4804         0      5756        0        0 ssh
    30710      1.0      8544      1124       396       392         0       960        0        0 btrfs
      993      0.5      8672      3160       104      4652         0       440        0        0 nmon
    10290      0.5     19844     10688       400      4652         0      5848        0        0 ssh
    18186      0.5         0         0         0         0         0         0        0        0 kworker/u16:6-btrfs-endio-write
    18616      0.5      8544      1204       396       392         0      1036        0        0 btrfs
    18659      0.5      5420       788        56       440         0       724        0        0 pv
    18660      0.5      8544      1196       396       392         0      1032        0        0 btrfs
    21842      0.5      7896      1940       328      2300         0      1344        0        0 rsync
    22807      0.5      7896      1984       328      2300         0      1384        0        0 rsync
    23461      0.5      7896      1928       328      2300         0      1332        0        0 rsync
    23467      0.5     19724     10580       400      4532         0      5856        0        0 ssh
    23468      0.5      8476      3084       328      2880         0      1272        0        0 rsync

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-25 14:24 Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 Hans van Kranenburg
@ 2020-07-25 15:37 ` Holger Hoffstätte
  2020-07-25 16:43   ` Hans van Kranenburg
  2020-07-25 21:27 ` Hans van Kranenburg
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Holger Hoffstätte @ 2020-07-25 15:37 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs

On 2020-07-25 16:24, Hans van Kranenburg wrote:
> Hi,
> 
> I have a filesystem here that I'm filling up with data from elsewhere.
> Most of it is done by rsync, and part by send/receive. So, receiving
> data over the network, and then writing the files to disk. There can be
> a dozen of these processes running in parallel.
> 
> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with
> varying X) often is using nearly 100% cpu, while enormously slowing down
> disk writes. This shows as disk IO wait for the rsync and btrfs receive
> processes.

<snip>

I cannot speak to anything btrfs-specific (other than the usual write
storms), however..

> [<0>] rq_qos_wait+0xfa/0x170
> [<0>] wbt_wait+0x98/0xe0
> [<0>] __rq_qos_throttle+0x23/0x30

..this means that you have CONFIG_BLK_WBT{_MQ} enabled and are using
an IO scheduler that observes writeback throttling. AFAIK all MQ-capable
schedulers (also SQ ones in 4.19 IIRC!) do so except for BFQ, which has
its own mechanism to regulate fairness vs. latency and explicitly turns
WBT off.

WBT aka 'writeback throttling' throttles background writes acording to
latency/throughput of the underlying block device in favor of readers.
It is meant to protect interactive/low-latency/desktop apps from heavy
bursts of background writeback activity. I tested early versions and
provided feedback to Jens Axboe; it really is helpful when it works,
but obviously cannot cater to every situation. There have been reports
that it is unhelpful for write-only/heavy workloads and may lead to
queueing pileup.

You can tune the expected latency of device writes via:
/sys/block/sda/queue/wbt_lat_usec.

You might also check whether your vm.dirty_{background}_bytes and
vm.dirty_expire_centisecs are too high; distro defaults almost always
are. This leads to more evenly spaced out write traffic.

Without knowing more it's difficult to say exactly what is going on,
but if your underlying storage has latency spikes it might be that
you are very likely looking at queueing pileup caused by multiple WBTs
choking each other. Having other unrelated queueing & throttling
mechanisms (in your case the network) in the mix is unlikely to help.
I'm not going to comment on iSCSI in general.. :^)

OTOH I also have 10G networking here and no such problems, even when
pushing large amounts of data over NFS at ~750 MB/s - and I have WBT
enabled everywhere.

So maybe start small and either ramp up the wbt latency sysctl or
decrease dirty_background bytes to start flushing sooner, depending
on how it's set. As last resort you can rebuild your kernels with
CONFIG_BLK_WBT/CONFIG_BLK_WBT_MQ disabled.

Hope this helps!

-h

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-25 15:37 ` Holger Hoffstätte
@ 2020-07-25 16:43   ` Hans van Kranenburg
  2020-07-25 19:44     ` Holger Hoffstätte
  0 siblings, 1 reply; 17+ messages in thread
From: Hans van Kranenburg @ 2020-07-25 16:43 UTC (permalink / raw)
  To: Holger Hoffstätte, linux-btrfs

On 7/25/20 5:37 PM, Holger Hoffstätte wrote:
> On 2020-07-25 16:24, Hans van Kranenburg wrote:
>> Hi,
>>
>> I have a filesystem here that I'm filling up with data from elsewhere.
>> Most of it is done by rsync, and part by send/receive. So, receiving
>> data over the network, and then writing the files to disk. There can be
>> a dozen of these processes running in parallel.
>>
>> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with
>> varying X) often is using nearly 100% cpu, while enormously slowing down
>> disk writes. This shows as disk IO wait for the rsync and btrfs receive
>> processes.
> 
> <snip>
> 
> I cannot speak to anything btrfs-specific (other than the usual write
> storms), however..
> 
>> [<0>] rq_qos_wait+0xfa/0x170
>> [<0>] wbt_wait+0x98/0xe0
>> [<0>] __rq_qos_throttle+0x23/0x30

I need to cat /proc/<pid>/stack a huge number of times in a loop to once
in a while get this sort of output shown.

> ..this means that you have CONFIG_BLK_WBT{_MQ} enabled and are using
> an IO scheduler that observes writeback throttling. AFAIK all MQ-capable
> schedulers (also SQ ones in 4.19 IIRC!) do so except for BFQ, which has
> its own mechanism to regulate fairness vs. latency and explicitly turns
> WBT off.
> 
> WBT aka 'writeback throttling' throttles background writes acording to
> latency/throughput of the underlying block device in favor of readers.
> It is meant to protect interactive/low-latency/desktop apps from heavy
> bursts of background writeback activity. I tested early versions and
> provided feedback to Jens Axboe; it really is helpful when it works,
> but obviously cannot cater to every situation. There have been reports
> that it is unhelpful for write-only/heavy workloads and may lead to
> queueing pileup.
> 
> You can tune the expected latency of device writes via:
> /sys/block/sda/queue/wbt_lat_usec.

Yes, I have been playing around with it earlier, without any effect on
the symptoms.

I just did this again, echo 0 > all of the involved block devices. When
looking at the events/wbt trace point, I see that wbt activity stops at
that moment.

No difference in symptoms.

> You might also check whether your vm.dirty_{background}_bytes and
> vm.dirty_expire_centisecs are too high; distro defaults almost always
> are. This leads to more evenly spaced out write traffic.

Dirty buffers were ~ 2G in size. I can modify the numbers to make it
bigger or smaller. There's absolutely no change in behavior of the system.

> Without knowing more it's difficult to say exactly what is going on,
> but if your underlying storage has latency spikes

It doesn't. It's idle, waiting to finally get some data sent to it.

> it might be that
> you are very likely looking at queueing pileup caused by multiple WBTs
> choking each other. Having other unrelated queueing & throttling
> mechanisms (in your case the network) in the mix is unlikely to help.
> I'm not going to comment on iSCSI in general.. :^)
> 
> OTOH I also have 10G networking here and no such problems, even when
> pushing large amounts of data over NFS at ~750 MB/s - and I have WBT
> enabled everywhere.
> 
> So maybe start small and either ramp up the wbt latency sysctl or
> decrease dirty_background bytes to start flushing sooner, depending
> on how it's set. As last resort you can rebuild your kernels with
> CONFIG_BLK_WBT/CONFIG_BLK_WBT_MQ disabled.

All processing speed is inversely proportional to the cpu usage of this
kworker/u16:X+flush-btrfs-2 thread. If it reaches >95% kernel cpu usage,
everything slows down. The network is idle, the disks are idle. Incoming
rsync speed drops, the speed in which btrfs receive is reading input
drops, etc. As soon as kworker/u16:X+flush-btrfs-2 busy cpu usage gets
below ~ 95% again, throughput goes up.

I do not see how writeback problems would result in having a
kworker/u16:X+flush-btrfs-2 do 100% cpu. I think it's the other way
round, and that's why I want to know what this thread is actually busy
with doing instead of shoveling the data towards the disks.

As far as I can see, all the iowait reported by rsync and btrfs receive
is because they want to give their writes to the btrfs code in the
kernel, but this kworker/u16:X+flush-btrfs-2 is in their way, so they
are blocked. Even before anything gets queued anywhere. Or doesn't that
make sense?

So, the problem is located *before* all the things you mention above
even come into play yet.

Hans

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-25 16:43   ` Hans van Kranenburg
@ 2020-07-25 19:44     ` Holger Hoffstätte
  2020-07-25 21:03       ` Hans van Kranenburg
  0 siblings, 1 reply; 17+ messages in thread
From: Holger Hoffstätte @ 2020-07-25 19:44 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs

On 2020-07-25 18:43, Hans van Kranenburg wrote:
> On 7/25/20 5:37 PM, Holger Hoffstätte wrote:
>>> [<0>] rq_qos_wait+0xfa/0x170
>>> [<0>] wbt_wait+0x98/0xe0
>>> [<0>] __rq_qos_throttle+0x23/0x30
> 
> I need to cat /proc/<pid>/stack a huge number of times in a loop to once
> in a while get this sort of output shown.

Ahh! So that means you are just getting lucky and catch the occasional
throttle in action. Ok.

>> You can tune the expected latency of device writes via:
>> /sys/block/sda/queue/wbt_lat_usec.
> 
> Yes, I have been playing around with it earlier, without any effect on
> the symptoms.
> 
> I just did this again, echo 0 > all of the involved block devices. When
> looking at the events/wbt trace point, I see that wbt activity stops at
> that moment.
> 
> No difference in symptoms.

Ok. Worth a shot..but in retrospect not really, since I just realized
that I had your entire scenario backwards. Nevermind then.

> Dirty buffers were ~ 2G in size. I can modify the numbers to make it
> bigger or smaller. There's absolutely no change in behavior of the system.

Ok. I also have 2GB max-dirty and 1GB writeback-dirty, which is
plenty for a 10G pipe.

> It doesn't. It's idle, waiting to finally get some data sent to it.

Yup, looks like it.

> All processing speed is inversely proportional to the cpu usage of this
> kworker/u16:X+flush-btrfs-2 thread. If it reaches >95% kernel cpu usage,
> everything slows down. The network is idle, the disks are idle. Incoming
> rsync speed drops, the speed in which btrfs receive is reading input
> drops, etc. As soon as kworker/u16:X+flush-btrfs-2 busy cpu usage gets
> below ~ 95% again, throughput goes up.

Only thing I can think of is that your rsync runs create insane amounts
of small COW extents that need to be ordered/merged. Multiply by number of
processes and you're probably hitting some super contended part. Since
the kworker isn't stuck forever but apparently makes progress it's not
dead, just slow/overloaded.

A few years ago I started using rsync exclusively with --whole-file since
it's not just much faster but also creates significantly less fragmentation
at the expense of some disk space, but whatever...in my case snapshot rotation
takes care of that. Maybe it's an option for you.

So maybe things to try:

- run rsync with --whole-file
- run fewer rsyncs in parallel (might not be necessary with --whole-file!)

If that doesn't help someone else needs to chime in..

-h

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-25 19:44     ` Holger Hoffstätte
@ 2020-07-25 21:03       ` Hans van Kranenburg
  2020-07-26  1:00         ` Chris Murphy
  0 siblings, 1 reply; 17+ messages in thread
From: Hans van Kranenburg @ 2020-07-25 21:03 UTC (permalink / raw)
  To: Holger Hoffstätte, linux-btrfs

On 7/25/20 9:44 PM, Holger Hoffstätte wrote:
> On 2020-07-25 18:43, Hans van Kranenburg wrote:
>> On 7/25/20 5:37 PM, Holger Hoffstätte wrote:
>>>> [<0>] rq_qos_wait+0xfa/0x170
>>>> [<0>] wbt_wait+0x98/0xe0
>>>> [<0>] __rq_qos_throttle+0x23/0x30
>>
>> I need to cat /proc/<pid>/stack a huge number of times in a loop to once
>> in a while get this sort of output shown.
> 
> Ahh! So that means you are just getting lucky and catch the occasional
> throttle in action. Ok.
> 
>>> You can tune the expected latency of device writes via:
>>> /sys/block/sda/queue/wbt_lat_usec.
>>
>> Yes, I have been playing around with it earlier, without any effect on
>> the symptoms.
>>
>> I just did this again, echo 0 > all of the involved block devices. When
>> looking at the events/wbt trace point, I see that wbt activity stops at
>> that moment.
>>
>> No difference in symptoms.
> 
> Ok. Worth a shot..but in retrospect not really, since I just realized
> that I had your entire scenario backwards. Nevermind then.

np

>> Dirty buffers were ~ 2G in size. I can modify the numbers to make it
>> bigger or smaller. There's absolutely no change in behavior of the system.
> 
> Ok. I also have 2GB max-dirty and 1GB writeback-dirty, which is
> plenty for a 10G pipe.
> 
>> It doesn't. It's idle, waiting to finally get some data sent to it.
> 
> Yup, looks like it.
> 
>> All processing speed is inversely proportional to the cpu usage of this
>> kworker/u16:X+flush-btrfs-2 thread. If it reaches >95% kernel cpu usage,
>> everything slows down. The network is idle, the disks are idle. Incoming
>> rsync speed drops, the speed in which btrfs receive is reading input
>> drops, etc. As soon as kworker/u16:X+flush-btrfs-2 busy cpu usage gets
>> below ~ 95% again, throughput goes up.
> 
> Only thing I can think of is that your rsync runs create insane amounts
> of small COW extents that need to be ordered/merged. Multiply by number of
> processes and you're probably hitting some super contended part.

Yes, that's what we're looking at and I see that my other email sent to
the list some hours ago didn't show up yet. Maybe I should resend with a
link to the attachment instead of carrying an attachment, or I'll wait
until tomorrow.

Hm,

Jul 25 21:06:44 syrinx postfix/smtp[32364]: 155C4609B67CF:
to=<linux-btrfs@vger.kernel.org>,
relay=vger.kernel.org[2620:137:e000::1:18]:25, delay=2.2, delays=0
.15/0.02/1.2/0.87, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as
7B1D9C08C5C0)
Jul 25 21:06:44 syrinx postfix/qmgr[951]: 155C4609B67CF: removed

O_o, so it was accepted, but I don't see it.

> Since
> the kworker isn't stuck forever but apparently makes progress it's not
> dead, just slow/overloaded.
> 
> A few years ago I started using rsync exclusively with --whole-file since
> it's not just much faster but also creates significantly less fragmentation
> at the expense of some disk space, but whatever...in my case snapshot rotation
> takes care of that. Maybe it's an option for you.
> 
> So maybe things to try:
> 
> - run rsync with --whole-file
> - run fewer rsyncs in parallel (might not be necessary with --whole-file!)
> 
> If that doesn't help someone else needs to chime in..

That someone else could be me if I only had the work time available to
do it, instead of using hobby time which since a few months is rather
used to optimize the vegetable garden and cutting down trees for fire
wood together with the neighbours.

And yes, all rsync is already running with --whole-file since 2014 to
reduce the total amount of reflink garbage in this now ~100T pile of
stuff with >200k subvols.

Doing fewer things in parallel does help, indeed. If I abort everything
and just run 1 of the 20 at a time, it does something. But, I want to
get this over with before the year 2050 happens.

K

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-25 14:24 Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 Hans van Kranenburg
  2020-07-25 15:37 ` Holger Hoffstätte
@ 2020-07-25 21:27 ` Hans van Kranenburg
  2020-07-26  8:10   ` A L
  2020-07-26  0:50 ` Chris Murphy
  2020-07-27 11:09 ` Qu Wenruo
  3 siblings, 1 reply; 17+ messages in thread
From: Hans van Kranenburg @ 2020-07-25 21:27 UTC (permalink / raw)
  To: linux-btrfs

Ok,

On 7/25/20 4:24 PM, Hans van Kranenburg wrote:
> Hi,
> 
> I have a filesystem here that I'm filling up with data from elsewhere.
> Most of it is done by rsync, and part by send/receive. So, receiving
> data over the network, and then writing the files to disk. There can be
> a dozen of these processes running in parallel.
> 
> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with
> varying X) often is using nearly 100% cpu, while enormously slowing down
> disk writes. This shows as disk IO wait for the rsync and btrfs receive
> processes.

Both rsync and btrfs receive are handling incoming streams of data which
contain huge amounts of files of all kind of sizes, but also a lot of
small files. (E.g. an imap server receive job from 1G split send dump
files is processing now, with maildir storage with every email in a
separate file, and there's also customers storing tiny XML fragments
from XML webservice calls in files to process later).

> The underlying storage (iSCSI connected over 10Gb/s network) can easily
> eat a few hundred MiB/s. When looking at actual disk business on the
> storage device, percentages <5% utilization are reported for the actual
> disks.
> 
> It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here.
> 
> I just did a 'perf record -g -a sleep 60' while disk writes were down to
> under 1MiB (!) per second and then 'perf report'. Attached some 'screen
> shot' of it. Also attached an example of what nmon shows to give an idea
> about the situation.
> 
> If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately
> see network and disk write speed ramping up, easily over 200 MiB/s,
> until it soon plummets again.
> 
> I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which
> is booted now).
> 
> So, what I'm looking for is:
> * Does anyone else see this happen when doing a lot of concurrent writes?
> * What does this flush thing do?

I have been playing around with the available trace point stuff in
/sys/kernel/debug/tracing/events/btrfs a bit more.

After rsync or btrfs receive has some new data available for the fs to
write to disk... What needs to happen before it can actually go into the
write cache to end up on disk?

Well, we need to find a place where to put it.

What's kworker/u16:X+flush-btrfs-2 doing?

It's doing find_free_extent.

> * Why is it using 100% cpu all the time?

Because it spends all its time doing find_free_extent. And almost all of
it is happening for EXTENT_TREE.

    len=16384 empty_size=0 flags=4(METADATA)

I could have known :O And again it's the extent allocator making my life
miserable.

> * How can I debug this more?

I will read more relevant code and see if I can put some more custom
trace points in. I will also have a closer look at the empty_size thing,
again...

> * Ultimately, of course... how can we improve this?

Well, be able to have an idea of 'pressure' on it. If we're slowing down
everyone a lot, switch to some more yolo allocation which just appends
all new writes and flush them to disk, instead of spending 100% kernel
cpu to find some tiny old 16k free metadata space hole to put a single
write in.

So, yes, I could hack around a bit in the code again and flip around
some mount option which does the 'ssd_spread' behavior for metadata
only. I did that on another filesystem a while ago, which helped a lot
for throughput, but it also allocated a ridiculous amount of new
metadata block groups.

I'd like to redesign all of this, but -EPOSSIBLE because -EWORKTIME

Now I have a btrfs receive running for >1.5T of customer garbage...
Receive cannot stop and resume. So am I going to abort and redo it while
it's at 75.2GiB 5:57:46 [1.44MiB/s]? Or just sit it out?

Stay tuned for the next episode of Knorrie's btrfs adventures! :)

K


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-25 14:24 Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 Hans van Kranenburg
  2020-07-25 15:37 ` Holger Hoffstätte
  2020-07-25 21:27 ` Hans van Kranenburg
@ 2020-07-26  0:50 ` Chris Murphy
  2020-07-27 11:09 ` Qu Wenruo
  3 siblings, 0 replies; 17+ messages in thread
From: Chris Murphy @ 2020-07-26  0:50 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: Btrfs BTRFS

On Sat, Jul 25, 2020 at 8:25 AM Hans van Kranenburg <hans@knorrie.org> wrote:
>
> Hi,
>
> I have a filesystem here that I'm filling up with data from elsewhere.
> Most of it is done by rsync, and part by send/receive. So, receiving
> data over the network, and then writing the files to disk. There can be
> a dozen of these processes running in parallel.

Each receive goes into its own subvolume. What about the rsyncs? How
many concurrent and how many per subvolume?



> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with
> varying X) often is using nearly 100% cpu, while enormously slowing down
> disk writes. This shows as disk IO wait for the rsync and btrfs receive
> processes.
>
> The underlying storage (iSCSI connected over 10Gb/s network) can easily
> eat a few hundred MiB/s. When looking at actual disk business on the
> storage device, percentages <5% utilization are reported for the actual
> disks.
>
> It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here.


>
> I just did a 'perf record -g -a sleep 60' while disk writes were down to
> under 1MiB (!) per second and then 'perf report'. Attached some 'screen
> shot' of it. Also attached an example of what nmon shows to give an idea
> about the situation.

Why is swapper involved in this?

+   80.53%     0.00%  swapper          [kernel.kallsyms]     [k]
secondary_startup_64
+   80.53%     0.00%  swapper          [kernel.kallsyms]     [k]
cpu_startup_entry
+   80.53%     0.00%  swapper          [kernel.kallsyms]     [k] do_idle
+   80.52%     0.00%  swapper          [kernel.kallsyms]     [k] default_idle
+   80.52%    80.51%  swapper          [kernel.kallsyms]     [k]
native_safe_halt
+   63.42%     0.00%  swapper          [kernel.kallsyms]     [k] start_secondary

What's PSI look like while this problem is happening?

# grep . /proc/pressure/*
# cat /proc/meminfo

>
> So, what I'm looking for is:
> * Does anyone else see this happen when doing a lot of concurrent writes?
> * What does this flush thing do?
> * Why is it using 100% cpu all the time?
> * How can I debug this more?
> * Ultimately, of course... how can we improve this?
>
> I can recompile the kernel image to e.g. put more trace points in, in
> different places.

What about using bcc-tools to check for high latencies? I'd maybe
start with fileslower, syncsnoop, btrfsslower, biolatency. This is a
bit of a firehose at first, but not near as much as perf. Strategy
wise you could start either at the top with fileslower or at the
bottom with biolatency, fileslower may not tell you anything you don't
already know which is that everything is f'n slow and waiting on
something. So OK is there something hammering on sync()? Try that
next. And so on. There's a bunch of these tools:

https://github.com/iovisor/bcc/blob/master/images/bcc_tracing_tools_2019.png


--
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-25 21:03       ` Hans van Kranenburg
@ 2020-07-26  1:00         ` Chris Murphy
  0 siblings, 0 replies; 17+ messages in thread
From: Chris Murphy @ 2020-07-26  1:00 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: Holger Hoffstätte, Btrfs BTRFS

On Sat, Jul 25, 2020 at 3:03 PM Hans van Kranenburg <hans@knorrie.org> wrote:
>
> And yes, all rsync is already running with --whole-file since 2014 to
> reduce the total amount of reflink garbage in this now ~100T pile of
> stuff with >200k subvols.

OK. So I was thinking maybe lock contention on subvols. But that many
subvolumes I'm willing to bet every task has its own.

And you're already ahead of where I'm at anyway. A search for free
extent sounds like free space tree contention. Does this suck any less
if there's no space cache?


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-25 21:27 ` Hans van Kranenburg
@ 2020-07-26  8:10   ` A L
  0 siblings, 0 replies; 17+ messages in thread
From: A L @ 2020-07-26  8:10 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs


> Now I have a btrfs receive running for >1.5T of customer garbage...
> Receive cannot stop and resume. So am I going to abort and redo it while
> it's at 75.2GiB 5:57:46 [1.44MiB/s]? Or just sit it out?
>
> Stay tuned for the next episode of Knorrie's btrfs adventures! :)
Always interesting stories :)

Anyways, can you throttle your send|receive by putting mbuffer in 
between? I have myself had quite good success with mbuffer in reducing 
overall I/O, not to mention the ability to limit overall speed.


A

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-25 14:24 Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 Hans van Kranenburg
                   ` (2 preceding siblings ...)
  2020-07-26  0:50 ` Chris Murphy
@ 2020-07-27 11:09 ` Qu Wenruo
  2020-07-27 17:17   ` Hans van Kranenburg
  3 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2020-07-27 11:09 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs



On 2020/7/25 下午10:24, Hans van Kranenburg wrote:
> Hi,
>
> I have a filesystem here that I'm filling up with data from elsewhere.
> Most of it is done by rsync, and part by send/receive. So, receiving
> data over the network, and then writing the files to disk. There can be
> a dozen of these processes running in parallel.
>
> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with
> varying X) often is using nearly 100% cpu, while enormously slowing down
> disk writes. This shows as disk IO wait for the rsync and btrfs receive
> processes.

The name is in fact pretty strange.
It doesn't follow the btrfs workqueue names.

There are two type of kernel threads used by btrfs:
- kthread
  This includes:
  * btrfs-uuid
  * btrfs-cleaner
  * btrfs-transaction
  * btrfs-devrepl
  * btrfs-ino-cache-*llu (deprecated)
  * btrfs-balance

- btrfs_workqueue (a wrapper around kernel workqueue)
  This mostly includes the following one, some of them may have another
workqueue with
  "-high" suffix:
  * btrfs-worker
  * btrfs-delalloc
  * btrfs-flush_delalloc
  * btrfs-cache
  * btrfs-fixup
  * btrfs-endio
  * btrfs-endio-meta
  * btrfs-endio-meta-write
  * btrfs-endio-raid56
  * btrfs-rmw
  * btrfs-endio-write
  * btrfs-freespace-write
  * btrfs-delayed-meta
  * btrfs-readahead
  * btrfs-qgroup-rescan
  * btrfs-scrub
  * btrfs-scrubwrc
  * btrfs-scrubparity

  (No wonder Linus is not happy with so many work queues)

As you can see, there is no one named like "flush-btrfs".

Thus I guess it's from other part of the stack.

Also, the calltrace also shows that, that kernel thread is only doing
page writeback, which calls back to the page write hooks of btrfs.

So I guess it may not be btrfs, but something else trying to do all the
writeback.

But still, the CPU usage is still a problem, it shouldn't cost so much
CPU time just writing back pages from btrfs.


>
> The underlying storage (iSCSI connected over 10Gb/s network) can easily
> eat a few hundred MiB/s. When looking at actual disk business on the
> storage device, percentages <5% utilization are reported for the actual
> disks.
>
> It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here.
>
> I just did a 'perf record -g -a sleep 60' while disk writes were down to
> under 1MiB (!) per second and then 'perf report'. Attached some 'screen
> shot' of it. Also attached an example of what nmon shows to give an idea
> about the situation.
>
> If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately
> see network and disk write speed ramping up, easily over 200 MiB/s,
> until it soon plummets again.
>
> I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which
> is booted now).
>
> So, what I'm looking for is:
> * Does anyone else see this happen when doing a lot of concurrent writes?
> * What does this flush thing do?
> * Why is it using 100% cpu all the time?
> * How can I debug this more?

bcc based runtime probes I guess?

Since it's almost a dead CPU burner loop, regular sleep based lockup
detector won't help much.

You can try trace events first to see which trace event get executed the
most frequently, then try to add probe points to pin down the real cause.

But personally speaking, it's better to shrink the workload, to find a
minimal workload to reproduce the 100% CPU burn, so that you need less
probes/time to pindown the problem.

> * Ultimately, of course... how can we improve this?
>
> I can recompile the kernel image to e.g. put more trace points in, in
> different places.>
> I just have no idea where to start.
>
> Thanks,
> Hans
>
> P.S. /proc/<pid>/stack for the kworker/u16:X+flush-btrfs-2 mostly shows
> nothing at all, and sometimes when it does show some output, it mostly
> looks like this:
>
> ----
> [<0>] rq_qos_wait+0xfa/0x170
> [<0>] wbt_wait+0x98/0xe0
> [<0>] __rq_qos_throttle+0x23/0x30
> [<0>] blk_mq_make_request+0x12a/0x5d0
> [<0>] generic_make_request+0xcf/0x310
> [<0>] submit_bio+0x42/0x1c0
> [<0>] btrfs_map_bio+0x1c0/0x380 [btrfs]
> [<0>] btrfs_submit_bio_hook+0x8c/0x180 [btrfs]
> [<0>] submit_one_bio+0x31/0x50 [btrfs]
> [<0>] submit_extent_page+0x102/0x210 [btrfs]
> [<0>] __extent_writepage_io+0x1cf/0x380 [btrfs]
> [<0>] __extent_writepage+0x101/0x300 [btrfs]
> [<0>] extent_write_cache_pages+0x2bb/0x440 [btrfs]
> [<0>] extent_writepages+0x44/0x90 [btrfs]
> [<0>] do_writepages+0x41/0xd0

Yeah, it's just trying to write dirty pages for btrfs.

I don't really believe it's btrfs causing the 100% CPU burn.

Maybe you could just try run btrfs on bare storage, without iSCSI, just
to verify it's btrfs to blame?

Thanks,
Qu

> [<0>] __writeback_single_inode+0x3d/0x340
> [<0>] writeback_sb_inodes+0x1e5/0x480
> [<0>] __writeback_inodes_wb+0x5d/0xb0
> [<0>] wb_writeback+0x25f/0x2f0
> [<0>] wb_workfn+0x2fe/0x3f0
> [<0>] process_one_work+0x1ad/0x370
> [<0>] worker_thread+0x30/0x390
> [<0>] kthread+0x112/0x130
> [<0>] ret_from_fork+0x1f/0x40
>
> ----
> [<0>] rq_qos_wait+0xfa/0x170
> [<0>] wbt_wait+0x98/0xe0
> [<0>] __rq_qos_throttle+0x23/0x30
> [<0>] blk_mq_make_request+0x12a/0x5d0
> [<0>] generic_make_request+0xcf/0x310
> [<0>] submit_bio+0x42/0x1c0
> [<0>] btrfs_map_bio+0x1c0/0x380 [btrfs]
> [<0>] btrfs_submit_bio_hook+0x8c/0x180 [btrfs]
> [<0>] submit_one_bio+0x31/0x50 [btrfs]
> [<0>] extent_writepages+0x5d/0x90 [btrfs]
> [<0>] do_writepages+0x41/0xd0
> [<0>] __writeback_single_inode+0x3d/0x340
> [<0>] writeback_sb_inodes+0x1e5/0x480
> [<0>] __writeback_inodes_wb+0x5d/0xb0
> [<0>] wb_writeback+0x25f/0x2f0
> [<0>] wb_workfn+0x2fe/0x3f0
> [<0>] process_one_work+0x1ad/0x370
> [<0>] worker_thread+0x30/0x390
> [<0>] kthread+0x112/0x130
> [<0>] ret_from_fork+0x1f/0x40
>
> ----
> [<0>] rq_qos_wait+0xfa/0x170
> [<0>] wbt_wait+0x98/0xe0
> [<0>] __rq_qos_throttle+0x23/0x30
> [<0>] blk_mq_make_request+0x12a/0x5d0
> [<0>] generic_make_request+0xcf/0x310
> [<0>] submit_bio+0x42/0x1c0
> [<0>] btrfs_map_bio+0x1c0/0x380 [btrfs]
> [<0>] btrfs_submit_bio_hook+0x8c/0x180 [btrfs]
> [<0>] submit_one_bio+0x31/0x50 [btrfs]
> [<0>] submit_extent_page+0x102/0x210 [btrfs]
> [<0>] __extent_writepage_io+0x1cf/0x380 [btrfs]
> [<0>] __extent_writepage+0x101/0x300 [btrfs]
> [<0>] extent_write_cache_pages+0x2bb/0x440 [btrfs]
> [<0>] extent_writepages+0x44/0x90 [btrfs]
> [<0>] do_writepages+0x41/0xd0
> [<0>] __writeback_single_inode+0x3d/0x340
> [<0>] writeback_sb_inodes+0x1e5/0x480
> [<0>] __writeback_inodes_wb+0x5d/0xb0
> [<0>] wb_writeback+0x25f/0x2f0
> [<0>] wb_workfn+0x2fe/0x3f0
> [<0>] process_one_work+0x1ad/0x370
> [<0>] worker_thread+0x30/0x390
> [<0>] kthread+0x112/0x130
> [<0>] ret_from_fork+0x1f/0x40
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-27 11:09 ` Qu Wenruo
@ 2020-07-27 17:17   ` Hans van Kranenburg
  2020-07-27 19:23     ` Chris Murphy
                       ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Hans van Kranenburg @ 2020-07-27 17:17 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

Hi!

On 7/27/20 1:09 PM, Qu Wenruo wrote:
> 
> 
> On 2020/7/25 下午10:24, Hans van Kranenburg wrote:
>> Hi,
>>
>> I have a filesystem here that I'm filling up with data from elsewhere.
>> Most of it is done by rsync, and part by send/receive. So, receiving
>> data over the network, and then writing the files to disk. There can be
>> a dozen of these processes running in parallel.
>>
>> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with
>> varying X) often is using nearly 100% cpu, while enormously slowing down
>> disk writes. This shows as disk IO wait for the rsync and btrfs receive
>> processes.
> 
> The name is in fact pretty strange.
> It doesn't follow the btrfs workqueue names.
> 
> There are two type of kernel threads used by btrfs:
>   [...]
> 
> As you can see, there is no one named like "flush-btrfs".

Well, I do have a kernel thread named kworker/u16:7-flush-btrfs-2 here,
currently. ;]

> Thus I guess it's from other part of the stack.
> 
> Also, the calltrace also shows that, that kernel thread is only doing
> page writeback, which calls back to the page write hooks of btrfs.
> 
> So I guess it may not be btrfs, but something else trying to do all the
> writeback.

Yes, so, from the writeback path it re-enters btrfs code, right?

> But still, the CPU usage is still a problem, it shouldn't cost so much
> CPU time just writing back pages from btrfs.

It's find_free_extent which is the culprit.

>> [...]
>>
>> It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here.
>>
>> I just did a 'perf record -g -a sleep 60' while disk writes were down to
>> under 1MiB (!) per second and then 'perf report'. Attached some 'screen
>> shot' of it. Also attached an example of what nmon shows to give an idea
>> about the situation.
>>
>> If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately
>> see network and disk write speed ramping up, easily over 200 MiB/s,
>> until it soon plummets again.
>>
>> I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which
>> is booted now).
>>
>> So, what I'm looking for is:
>> * Does anyone else see this happen when doing a lot of concurrent writes?
>> * What does this flush thing do?
>> * Why is it using 100% cpu all the time?
>> * How can I debug this more?
> 
> bcc based runtime probes I guess?
> 
> Since it's almost a dead CPU burner loop, regular sleep based lockup
> detector won't help much.

Here's a flame graph of 180 seconds, taken from the kernel thread pid:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-27-perf-kworker-flush-btrfs.svg

> You can try trace events first to see which trace event get executed the
> most frequently, then try to add probe points to pin down the real cause.

From the default collection, I already got the following, a few days
ago, by enabling find_free_extent and btrfs_cow_block:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_extent.txt

From the timestamps you can see how long this takes. It's not much that
gets done per second.

The spin lock part must be spin_lock(&space_info->lock) because that's
the only one in find_free_extent.

So, what I think is that, like I mentioned on saturday already,
find_free_extent is probably doing things in a really inefficient way,
scanning around for a small single free space gap all the time in a
really expensive way, and doing that over again for each tiny metadata
block or file that needs to be placed somewhere (also notice the
empty_size=0), instead of just throwing all of it on disk after each
other, when it's otherwise slowing down everyone.

And then we have the spin lock part and btrfs_get_block_group, which are
also a significant part of the time spent. All together is the
continuous 100% cpu.

What I *can* try is switch to the ssd_spread option, to force it to do
much more yolo allocation, but I'm afraid this will result in a sudden
blast of metadata block groups getting allocated. Or, maybe try it for a
minute or so and compare the trace pipe output...

> But personally speaking, it's better to shrink the workload, to find a
> minimal workload to reproduce the 100% CPU burn, so that you need less
> probes/time to pindown the problem.

I think I can reproduce it with a single btrfs receive operation, as
long as it has a large amount of small files in it.

Currently I'm almost only copying data onto the filesystem, only data
removals are some targeted dedupe script for known big files that are
identical, but which could not easily be handled by rsync.

So, when find_free_extent wants to write extent tree blocks, it might be
playing outside in the woods all the time, searching around for little
holes that were caused by previous cow operations.

Large files are not a problem, like postgresql database dumps. Copying
them over just runs >200-300M/s without problem, where the read speed of
the old storage I'm moving things off is the bottleneck, actually.

>> * Ultimately, of course... how can we improve this?
>>
>> [...]

I will see if I can experiment a bit with putting more trace points in.

I remember reading around in the extent allocator code, until I got
dizzy from it. Deja vu...

https://www.spinics.net/lists/linux-btrfs/msg70624.html

Thanks
Hans

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-27 17:17   ` Hans van Kranenburg
@ 2020-07-27 19:23     ` Chris Murphy
  2020-07-27 23:16     ` Chris Murphy
  2020-07-28  0:51     ` Qu Wenruo
  2 siblings, 0 replies; 17+ messages in thread
From: Chris Murphy @ 2020-07-27 19:23 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: Qu Wenruo, Btrfs BTRFS

On Mon, Jul 27, 2020 at 11:17 AM Hans van Kranenburg <hans@knorrie.org> wrote:
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_extent.txt
>
> From the timestamps you can see how long this takes. It's not much that
> gets done per second.
>
> The spin lock part must be spin_lock(&space_info->lock) because that's
> the only one in find_free_extent.
>
> So, what I think is that, like I mentioned on saturday already,
> find_free_extent is probably doing things in a really inefficient way,
> scanning around for a small single free space gap all the time in a
> really expensive way, and doing that over again for each tiny metadata
> block or file that needs to be placed somewhere (also notice the
> empty_size=0), instead of just throwing all of it on disk after each
> other, when it's otherwise slowing down everyone.

What are the latencies for the iscsi device while this is happening?
Does your trace have the ability to distinguish between inefficiency
in find_free_extent versus waiting on IO latency?

However, this hypothesis could be tested by completely balancing all
metadata block groups, then trying to reproduce this workload. If it's
really this expensive to do metadata insertions, eliminating the need
for insertions should show up as a remarkable performance improvement
that degrades as metadata block groups become fragmented again.


> What I *can* try is switch to the ssd_spread option, to force it to do
> much more yolo allocation, but I'm afraid this will result in a sudden
> blast of metadata block groups getting allocated. Or, maybe try it for a
> minute or so and compare the trace pipe output...

OK or this.


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-27 17:17   ` Hans van Kranenburg
  2020-07-27 19:23     ` Chris Murphy
@ 2020-07-27 23:16     ` Chris Murphy
  2020-07-28  0:51     ` Qu Wenruo
  2 siblings, 0 replies; 17+ messages in thread
From: Chris Murphy @ 2020-07-27 23:16 UTC (permalink / raw)
  To: Hans van Kranenburg; +Cc: Qu Wenruo, Btrfs BTRFS

On Mon, Jul 27, 2020 at 11:17 AM Hans van Kranenburg <hans@knorrie.org> wrote:
>
> Hi!
>
> On 7/27/20 1:09 PM, Qu Wenruo wrote:
>
> > As you can see, there is no one named like "flush-btrfs".
>
> Well, I do have a kernel thread named kworker/u16:7-flush-btrfs-2 here,
> currently. ;]

On a send/receive doing well over 200M/s I see brief moments of

      0      0      0 D  10.5   0.0   0:06.15 kworker/u16:15+flush-btrfs-4

But it seems to coincide with 30s intervals, consistent with default commit=30.

Are you using flushoncommit? Or a commit time other than 30? Maybe see
if the problem gets better if this is 10s? I wonder if it just has to
long of a commit window for this particular workload?


-- 
Chris Murphy

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-27 17:17   ` Hans van Kranenburg
  2020-07-27 19:23     ` Chris Murphy
  2020-07-27 23:16     ` Chris Murphy
@ 2020-07-28  0:51     ` Qu Wenruo
  2020-07-28  1:52       ` Qu Wenruo
  2 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2020-07-28  0:51 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs



On 2020/7/28 上午1:17, Hans van Kranenburg wrote:
> Hi!
>
> On 7/27/20 1:09 PM, Qu Wenruo wrote:
>>
>>
>> On 2020/7/25 下午10:24, Hans van Kranenburg wrote:
>>> Hi,
>>>
>>> I have a filesystem here that I'm filling up with data from elsewhere.
>>> Most of it is done by rsync, and part by send/receive. So, receiving
>>> data over the network, and then writing the files to disk. There can be
>>> a dozen of these processes running in parallel.
>>>
>>> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with
>>> varying X) often is using nearly 100% cpu, while enormously slowing down
>>> disk writes. This shows as disk IO wait for the rsync and btrfs receive
>>> processes.
>>
>> The name is in fact pretty strange.
>> It doesn't follow the btrfs workqueue names.
>>
>> There are two type of kernel threads used by btrfs:
>>   [...]
>>
>> As you can see, there is no one named like "flush-btrfs".
>
> Well, I do have a kernel thread named kworker/u16:7-flush-btrfs-2 here,
> currently. ;]
>
>> Thus I guess it's from other part of the stack.
>>
>> Also, the calltrace also shows that, that kernel thread is only doing
>> page writeback, which calls back to the page write hooks of btrfs.
>>
>> So I guess it may not be btrfs, but something else trying to do all the
>> writeback.
>
> Yes, so, from the writeback path it re-enters btrfs code, right?

Yep. Btrfs is still responsible for the problem.

>
>> But still, the CPU usage is still a problem, it shouldn't cost so much
>> CPU time just writing back pages from btrfs.
>
> It's find_free_extent which is the culprit.
>
>>> [...]
>>>
>>> It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here.
>>>
>>> I just did a 'perf record -g -a sleep 60' while disk writes were down to
>>> under 1MiB (!) per second and then 'perf report'. Attached some 'screen
>>> shot' of it. Also attached an example of what nmon shows to give an idea
>>> about the situation.
>>>
>>> If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately
>>> see network and disk write speed ramping up, easily over 200 MiB/s,
>>> until it soon plummets again.
>>>
>>> I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which
>>> is booted now).
>>>
>>> So, what I'm looking for is:
>>> * Does anyone else see this happen when doing a lot of concurrent writes?
>>> * What does this flush thing do?
>>> * Why is it using 100% cpu all the time?
>>> * How can I debug this more?
>>
>> bcc based runtime probes I guess?
>>
>> Since it's almost a dead CPU burner loop, regular sleep based lockup
>> detector won't help much.
>
> Here's a flame graph of 180 seconds, taken from the kernel thread pid:
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-27-perf-kworker-flush-btrfs.svg

That's really awesome!

>
>> You can try trace events first to see which trace event get executed the
>> most frequently, then try to add probe points to pin down the real cause.
>
> From the default collection, I already got the following, a few days
> ago, by enabling find_free_extent and btrfs_cow_block:
>
> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_extent.txt
>
> From the timestamps you can see how long this takes. It's not much that
> gets done per second.
>
> The spin lock part must be spin_lock(&space_info->lock) because that's
> the only one in find_free_extent.
>
> So, what I think is that, like I mentioned on saturday already,
> find_free_extent is probably doing things in a really inefficient way,
> scanning around for a small single free space gap all the time in a
> really expensive way, and doing that over again for each tiny metadata
> block or file that needs to be placed somewhere (also notice the
> empty_size=0), instead of just throwing all of it on disk after each
> other, when it's otherwise slowing down everyone.

I think you're mostly right, find_free_extent() is not that efficient.

But that only happens when there are a lot of parallel requests for it.

One example is btrfs_get_block_group().
That function is only calling one function, refcount_inc(), which
normally shouldn't be that slow, unless there are tons of other callers
here.

>
> And then we have the spin lock part and btrfs_get_block_group, which are
> also a significant part of the time spent. All together is the
> continuous 100% cpu.

So is the spinlock part. That spinlock itself should be pretty fast
(that's why we use spinlock), but when it's not, it's mostly caused by
high concurrency.

>
> What I *can* try is switch to the ssd_spread option, to force it to do
> much more yolo allocation, but I'm afraid this will result in a sudden
> blast of metadata block groups getting allocated. Or, maybe try it for a
> minute or so and compare the trace pipe output...

Another thing can't be explained is, that slow down you're experiencing
is from data write back.
It should never conflicts with metadata space reservation, as data and
metadata are separated by space_info first, thus they shouldn't affect
each other.
Either I'm missing something, or there are other problems.

>
>> But personally speaking, it's better to shrink the workload, to find a
>> minimal workload to reproduce the 100% CPU burn, so that you need less
>> probes/time to pindown the problem.
>
> I think I can reproduce it with a single btrfs receive operation, as
> long as it has a large amount of small files in it.

IMHO, if the culprit is really find_free_extent() itself, then single
receive won't reach the concurrency we need to expose the slowdown.

Since receive is just doing serial writes, that shouldn't cause enough
race for find_free_extents().

But that's worthy trying to see if it proofs my assumption of concurrency.

Thanks,
Qu

>
> Currently I'm almost only copying data onto the filesystem, only data
> removals are some targeted dedupe script for known big files that are
> identical, but which could not easily be handled by rsync.
>
> So, when find_free_extent wants to write extent tree blocks, it might be
> playing outside in the woods all the time, searching around for little
> holes that were caused by previous cow operations.
>
> Large files are not a problem, like postgresql database dumps. Copying
> them over just runs >200-300M/s without problem, where the read speed of
> the old storage I'm moving things off is the bottleneck, actually.
>
>>> * Ultimately, of course... how can we improve this?
>>>
>>> [...]
>
> I will see if I can experiment a bit with putting more trace points in.
>
> I remember reading around in the extent allocator code, until I got
> dizzy from it. Deja vu...
>
> https://www.spinics.net/lists/linux-btrfs/msg70624.html
>
> Thanks
> Hans
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-28  0:51     ` Qu Wenruo
@ 2020-07-28  1:52       ` Qu Wenruo
  2020-07-28 14:52         ` Hans van Kranenburg
  0 siblings, 1 reply; 17+ messages in thread
From: Qu Wenruo @ 2020-07-28  1:52 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs



On 2020/7/28 上午8:51, Qu Wenruo wrote:
>
>
> On 2020/7/28 上午1:17, Hans van Kranenburg wrote:
>> Hi!
>>
>> On 7/27/20 1:09 PM, Qu Wenruo wrote:
>>>
>>>
>>> On 2020/7/25 下午10:24, Hans van Kranenburg wrote:
>>>> Hi,
>>>>
>>>> I have a filesystem here that I'm filling up with data from elsewhere.
>>>> Most of it is done by rsync, and part by send/receive. So, receiving
>>>> data over the network, and then writing the files to disk. There can be
>>>> a dozen of these processes running in parallel.
>>>>
>>>> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with
>>>> varying X) often is using nearly 100% cpu, while enormously slowing down
>>>> disk writes. This shows as disk IO wait for the rsync and btrfs receive
>>>> processes.
>>>
>>> The name is in fact pretty strange.
>>> It doesn't follow the btrfs workqueue names.
>>>
>>> There are two type of kernel threads used by btrfs:
>>>   [...]
>>>
>>> As you can see, there is no one named like "flush-btrfs".
>>
>> Well, I do have a kernel thread named kworker/u16:7-flush-btrfs-2 here,
>> currently. ;]
>>
>>> Thus I guess it's from other part of the stack.
>>>
>>> Also, the calltrace also shows that, that kernel thread is only doing
>>> page writeback, which calls back to the page write hooks of btrfs.
>>>
>>> So I guess it may not be btrfs, but something else trying to do all the
>>> writeback.
>>
>> Yes, so, from the writeback path it re-enters btrfs code, right?
>
> Yep. Btrfs is still responsible for the problem.
>
>>
>>> But still, the CPU usage is still a problem, it shouldn't cost so much
>>> CPU time just writing back pages from btrfs.
>>
>> It's find_free_extent which is the culprit.
>>
>>>> [...]
>>>>
>>>> It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here.
>>>>
>>>> I just did a 'perf record -g -a sleep 60' while disk writes were down to
>>>> under 1MiB (!) per second and then 'perf report'. Attached some 'screen
>>>> shot' of it. Also attached an example of what nmon shows to give an idea
>>>> about the situation.
>>>>
>>>> If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately
>>>> see network and disk write speed ramping up, easily over 200 MiB/s,
>>>> until it soon plummets again.
>>>>
>>>> I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which
>>>> is booted now).
>>>>
>>>> So, what I'm looking for is:
>>>> * Does anyone else see this happen when doing a lot of concurrent writes?
>>>> * What does this flush thing do?
>>>> * Why is it using 100% cpu all the time?
>>>> * How can I debug this more?
>>>
>>> bcc based runtime probes I guess?
>>>
>>> Since it's almost a dead CPU burner loop, regular sleep based lockup
>>> detector won't help much.
>>
>> Here's a flame graph of 180 seconds, taken from the kernel thread pid:
>>
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-27-perf-kworker-flush-btrfs.svg
>
> That's really awesome!
>
>>
>>> You can try trace events first to see which trace event get executed the
>>> most frequently, then try to add probe points to pin down the real cause.
>>
>> From the default collection, I already got the following, a few days
>> ago, by enabling find_free_extent and btrfs_cow_block:
>>
>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_extent.txt

This output is in fact pretty confusing, and maybe give you a false view
on the callers of find_free_extent().

It always shows "EXTENT_TREE" as the owner, but that's due to a bad
decision on the trace event.

I have submitted a patch addressing it, and added you to the CC.

Would you mind to re-capture the events with that patch?
So that we could have a clearer idea on which tree is having the most
amount of concurrency?

Thanks,
Qu

>>
>> From the timestamps you can see how long this takes. It's not much that
>> gets done per second.
>>
>> The spin lock part must be spin_lock(&space_info->lock) because that's
>> the only one in find_free_extent.
>>
>> So, what I think is that, like I mentioned on saturday already,
>> find_free_extent is probably doing things in a really inefficient way,
>> scanning around for a small single free space gap all the time in a
>> really expensive way, and doing that over again for each tiny metadata
>> block or file that needs to be placed somewhere (also notice the
>> empty_size=0), instead of just throwing all of it on disk after each
>> other, when it's otherwise slowing down everyone.
>
> I think you're mostly right, find_free_extent() is not that efficient.
>
> But that only happens when there are a lot of parallel requests for it.
>
> One example is btrfs_get_block_group().
> That function is only calling one function, refcount_inc(), which
> normally shouldn't be that slow, unless there are tons of other callers
> here.
>
>>
>> And then we have the spin lock part and btrfs_get_block_group, which are
>> also a significant part of the time spent. All together is the
>> continuous 100% cpu.
>
> So is the spinlock part. That spinlock itself should be pretty fast
> (that's why we use spinlock), but when it's not, it's mostly caused by
> high concurrency.
>
>>
>> What I *can* try is switch to the ssd_spread option, to force it to do
>> much more yolo allocation, but I'm afraid this will result in a sudden
>> blast of metadata block groups getting allocated. Or, maybe try it for a
>> minute or so and compare the trace pipe output...
>
> Another thing can't be explained is, that slow down you're experiencing
> is from data write back.
> It should never conflicts with metadata space reservation, as data and
> metadata are separated by space_info first, thus they shouldn't affect
> each other.
> Either I'm missing something, or there are other problems.
>
>>
>>> But personally speaking, it's better to shrink the workload, to find a
>>> minimal workload to reproduce the 100% CPU burn, so that you need less
>>> probes/time to pindown the problem.
>>
>> I think I can reproduce it with a single btrfs receive operation, as
>> long as it has a large amount of small files in it.
>
> IMHO, if the culprit is really find_free_extent() itself, then single
> receive won't reach the concurrency we need to expose the slowdown.
>
> Since receive is just doing serial writes, that shouldn't cause enough
> race for find_free_extents().
>
> But that's worthy trying to see if it proofs my assumption of concurrency.
>
> Thanks,
> Qu
>
>>
>> Currently I'm almost only copying data onto the filesystem, only data
>> removals are some targeted dedupe script for known big files that are
>> identical, but which could not easily be handled by rsync.
>>
>> So, when find_free_extent wants to write extent tree blocks, it might be
>> playing outside in the woods all the time, searching around for little
>> holes that were caused by previous cow operations.
>>
>> Large files are not a problem, like postgresql database dumps. Copying
>> them over just runs >200-300M/s without problem, where the read speed of
>> the old storage I'm moving things off is the bottleneck, actually.
>>
>>>> * Ultimately, of course... how can we improve this?
>>>>
>>>> [...]
>>
>> I will see if I can experiment a bit with putting more trace points in.
>>
>> I remember reading around in the extent allocator code, until I got
>> dizzy from it. Deja vu...
>>
>> https://www.spinics.net/lists/linux-btrfs/msg70624.html
>>
>> Thanks
>> Hans
>>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-28  1:52       ` Qu Wenruo
@ 2020-07-28 14:52         ` Hans van Kranenburg
  2020-07-29  0:15           ` Qu Wenruo
  0 siblings, 1 reply; 17+ messages in thread
From: Hans van Kranenburg @ 2020-07-28 14:52 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs

On 7/28/20 3:52 AM, Qu Wenruo wrote:
> 
> 
> On 2020/7/28 上午8:51, Qu Wenruo wrote:
>>
>>
>> On 2020/7/28 上午1:17, Hans van Kranenburg wrote:
>>> Hi!
>>>
>>> [...]
>>>>
>>>> Since it's almost a dead CPU burner loop, regular sleep based lockup
>>>> detector won't help much.
>>>
>>> Here's a flame graph of 180 seconds, taken from the kernel thread pid:
>>>
>>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-27-perf-kworker-flush-btrfs.svg
>>
>> That's really awesome!
>>
>>>
>>>> You can try trace events first to see which trace event get executed the
>>>> most frequently, then try to add probe points to pin down the real cause.
>>>
>>> From the default collection, I already got the following, a few days
>>> ago, by enabling find_free_extent and btrfs_cow_block:
>>>
>>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_extent.txt
> 
> This output is in fact pretty confusing, and maybe give you a false view
> on the callers of find_free_extent().
> 
> It always shows "EXTENT_TREE" as the owner, but that's due to a bad
> decision on the trace event.
> 
> I have submitted a patch addressing it, and added you to the CC.

Oh right, thanks, that actually makes a lot of sense, lol.

I was misled because at first sight I was thinking, "yeah, obviously,
where else than in the extent tree are you going to do administration
about allocated blocks.", and didn't realize yet it did make no sense.

> Would you mind to re-capture the events with that patch?
> So that we could have a clearer idea on which tree is having the most
> amount of concurrency?

Yes. I will do that and try to reproduce the symptoms with as few
actions in parallel as possible.

I've been away most of the day today, I will see how far I get later,
otherwise continue tomorrow.

What you *can* see in the current output already, however, is that the
kworker/u16:3-13887 thing is doing all DATA work, while many different
processes (rsync, btrfs receive) all do the find_free_extent work for
METADATA. That's already an interesting difference.

So, one direction to look into is who is all trying to grab that spin
lock, since if it's per 'space' (which sounds logical, since the workers
will never clash because a whole block group belongs to only 1 'space'),
then I don't see why kworker/u16:18-11336 would spend 1/3 of it's time
in a busy locking situation waiting, while it's the only process working
on METADATA.

But, I'll gather some more logs and pictures.

Do you know (some RTFM pointer?) about a way to debug who's locking on
the same thing? I didn't research that yet.

Hans

> [...]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2
  2020-07-28 14:52         ` Hans van Kranenburg
@ 2020-07-29  0:15           ` Qu Wenruo
  0 siblings, 0 replies; 17+ messages in thread
From: Qu Wenruo @ 2020-07-29  0:15 UTC (permalink / raw)
  To: Hans van Kranenburg, linux-btrfs


[-- Attachment #1.1: Type: text/plain, Size: 3758 bytes --]



On 2020/7/28 下午10:52, Hans van Kranenburg wrote:
> On 7/28/20 3:52 AM, Qu Wenruo wrote:
>>
>>
>> On 2020/7/28 上午8:51, Qu Wenruo wrote:
>>>
>>>
>>> On 2020/7/28 上午1:17, Hans van Kranenburg wrote:
>>>> Hi!
>>>>
>>>> [...]
>>>>>
>>>>> Since it's almost a dead CPU burner loop, regular sleep based lockup
>>>>> detector won't help much.
>>>>
>>>> Here's a flame graph of 180 seconds, taken from the kernel thread pid:
>>>>
>>>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-27-perf-kworker-flush-btrfs.svg
>>>
>>> That's really awesome!
>>>
>>>>
>>>>> You can try trace events first to see which trace event get executed the
>>>>> most frequently, then try to add probe points to pin down the real cause.
>>>>
>>>> From the default collection, I already got the following, a few days
>>>> ago, by enabling find_free_extent and btrfs_cow_block:
>>>>
>>>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_extent.txt
>>
>> This output is in fact pretty confusing, and maybe give you a false view
>> on the callers of find_free_extent().
>>
>> It always shows "EXTENT_TREE" as the owner, but that's due to a bad
>> decision on the trace event.
>>
>> I have submitted a patch addressing it, and added you to the CC.
> 
> Oh right, thanks, that actually makes a lot of sense, lol.
> 
> I was misled because at first sight I was thinking, "yeah, obviously,
> where else than in the extent tree are you going to do administration
> about allocated blocks.", and didn't realize yet it did make no sense.
> 
>> Would you mind to re-capture the events with that patch?
>> So that we could have a clearer idea on which tree is having the most
>> amount of concurrency?
> 
> Yes. I will do that and try to reproduce the symptoms with as few
> actions in parallel as possible.
> 
> I've been away most of the day today, I will see how far I get later,
> otherwise continue tomorrow.
> 
> What you *can* see in the current output already, however, is that the
> kworker/u16:3-13887 thing is doing all DATA work, while many different
> processes (rsync, btrfs receive) all do the find_free_extent work for
> METADATA. That's already an interesting difference.

That's common, don't forget that btrfs is doing COW to modify its metadata.
After a data write reaches disk, btrfs will then modify metadata to
reflect that write, which will cause a metadata reserve.

The good news is, btrfs is not always doing metadata COW, if the needed
tree blocks has already been CoWed and not yet written to disk.

So in theory, the data reserve should be more frequently than metadata.

> 
> So, one direction to look into is who is all trying to grab that spin
> lock, since if it's per 'space' (which sounds logical, since the workers
> will never clash because a whole block group belongs to only 1 'space'),
> then I don't see why kworker/u16:18-11336 would spend 1/3 of it's time
> in a busy locking situation waiting, while it's the only process working
> on METADATA.
> 
> But, I'll gather some more logs and pictures.
> 
> Do you know (some RTFM pointer?) about a way to debug who's locking on
> the same thing? I didn't research that yet.

I had a bcc based script to do that analyze, but that's mostly for tree
blocks.
https://github.com/adam900710/btrfs-profiler/blob/master/tree_lock_wait.py

Feel free to use that as a base to develop your own tool.


BTW, this data allocation race reminds me that, Josef and Nikolay
discussed this possible problem before, but not yet have a solid
solution to it.

So I guess what we can do now is just to collect enough data to proof
it's already a big problem.

Thanks,
Qu
> 
> Hans
> 
>> [...]


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-07-29  0:15 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-25 14:24 Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 Hans van Kranenburg
2020-07-25 15:37 ` Holger Hoffstätte
2020-07-25 16:43   ` Hans van Kranenburg
2020-07-25 19:44     ` Holger Hoffstätte
2020-07-25 21:03       ` Hans van Kranenburg
2020-07-26  1:00         ` Chris Murphy
2020-07-25 21:27 ` Hans van Kranenburg
2020-07-26  8:10   ` A L
2020-07-26  0:50 ` Chris Murphy
2020-07-27 11:09 ` Qu Wenruo
2020-07-27 17:17   ` Hans van Kranenburg
2020-07-27 19:23     ` Chris Murphy
2020-07-27 23:16     ` Chris Murphy
2020-07-28  0:51     ` Qu Wenruo
2020-07-28  1:52       ` Qu Wenruo
2020-07-28 14:52         ` Hans van Kranenburg
2020-07-29  0:15           ` Qu Wenruo

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.