* Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 @ 2020-07-25 14:24 Hans van Kranenburg 2020-07-25 15:37 ` Holger Hoffstätte ` (3 more replies) 0 siblings, 4 replies; 17+ messages in thread From: Hans van Kranenburg @ 2020-07-25 14:24 UTC (permalink / raw) To: linux-btrfs [-- Attachment #1: Type: text/plain, Size: 4302 bytes --] Hi, I have a filesystem here that I'm filling up with data from elsewhere. Most of it is done by rsync, and part by send/receive. So, receiving data over the network, and then writing the files to disk. There can be a dozen of these processes running in parallel. Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with varying X) often is using nearly 100% cpu, while enormously slowing down disk writes. This shows as disk IO wait for the rsync and btrfs receive processes. The underlying storage (iSCSI connected over 10Gb/s network) can easily eat a few hundred MiB/s. When looking at actual disk business on the storage device, percentages <5% utilization are reported for the actual disks. It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here. I just did a 'perf record -g -a sleep 60' while disk writes were down to under 1MiB (!) per second and then 'perf report'. Attached some 'screen shot' of it. Also attached an example of what nmon shows to give an idea about the situation. If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately see network and disk write speed ramping up, easily over 200 MiB/s, until it soon plummets again. I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which is booted now). So, what I'm looking for is: * Does anyone else see this happen when doing a lot of concurrent writes? * What does this flush thing do? * Why is it using 100% cpu all the time? * How can I debug this more? * Ultimately, of course... how can we improve this? I can recompile the kernel image to e.g. put more trace points in, in different places. I just have no idea where to start. Thanks, Hans P.S. /proc/<pid>/stack for the kworker/u16:X+flush-btrfs-2 mostly shows nothing at all, and sometimes when it does show some output, it mostly looks like this: ---- [<0>] rq_qos_wait+0xfa/0x170 [<0>] wbt_wait+0x98/0xe0 [<0>] __rq_qos_throttle+0x23/0x30 [<0>] blk_mq_make_request+0x12a/0x5d0 [<0>] generic_make_request+0xcf/0x310 [<0>] submit_bio+0x42/0x1c0 [<0>] btrfs_map_bio+0x1c0/0x380 [btrfs] [<0>] btrfs_submit_bio_hook+0x8c/0x180 [btrfs] [<0>] submit_one_bio+0x31/0x50 [btrfs] [<0>] submit_extent_page+0x102/0x210 [btrfs] [<0>] __extent_writepage_io+0x1cf/0x380 [btrfs] [<0>] __extent_writepage+0x101/0x300 [btrfs] [<0>] extent_write_cache_pages+0x2bb/0x440 [btrfs] [<0>] extent_writepages+0x44/0x90 [btrfs] [<0>] do_writepages+0x41/0xd0 [<0>] __writeback_single_inode+0x3d/0x340 [<0>] writeback_sb_inodes+0x1e5/0x480 [<0>] __writeback_inodes_wb+0x5d/0xb0 [<0>] wb_writeback+0x25f/0x2f0 [<0>] wb_workfn+0x2fe/0x3f0 [<0>] process_one_work+0x1ad/0x370 [<0>] worker_thread+0x30/0x390 [<0>] kthread+0x112/0x130 [<0>] ret_from_fork+0x1f/0x40 ---- [<0>] rq_qos_wait+0xfa/0x170 [<0>] wbt_wait+0x98/0xe0 [<0>] __rq_qos_throttle+0x23/0x30 [<0>] blk_mq_make_request+0x12a/0x5d0 [<0>] generic_make_request+0xcf/0x310 [<0>] submit_bio+0x42/0x1c0 [<0>] btrfs_map_bio+0x1c0/0x380 [btrfs] [<0>] btrfs_submit_bio_hook+0x8c/0x180 [btrfs] [<0>] submit_one_bio+0x31/0x50 [btrfs] [<0>] extent_writepages+0x5d/0x90 [btrfs] [<0>] do_writepages+0x41/0xd0 [<0>] __writeback_single_inode+0x3d/0x340 [<0>] writeback_sb_inodes+0x1e5/0x480 [<0>] __writeback_inodes_wb+0x5d/0xb0 [<0>] wb_writeback+0x25f/0x2f0 [<0>] wb_workfn+0x2fe/0x3f0 [<0>] process_one_work+0x1ad/0x370 [<0>] worker_thread+0x30/0x390 [<0>] kthread+0x112/0x130 [<0>] ret_from_fork+0x1f/0x40 ---- [<0>] rq_qos_wait+0xfa/0x170 [<0>] wbt_wait+0x98/0xe0 [<0>] __rq_qos_throttle+0x23/0x30 [<0>] blk_mq_make_request+0x12a/0x5d0 [<0>] generic_make_request+0xcf/0x310 [<0>] submit_bio+0x42/0x1c0 [<0>] btrfs_map_bio+0x1c0/0x380 [btrfs] [<0>] btrfs_submit_bio_hook+0x8c/0x180 [btrfs] [<0>] submit_one_bio+0x31/0x50 [btrfs] [<0>] submit_extent_page+0x102/0x210 [btrfs] [<0>] __extent_writepage_io+0x1cf/0x380 [btrfs] [<0>] __extent_writepage+0x101/0x300 [btrfs] [<0>] extent_write_cache_pages+0x2bb/0x440 [btrfs] [<0>] extent_writepages+0x44/0x90 [btrfs] [<0>] do_writepages+0x41/0xd0 [<0>] __writeback_single_inode+0x3d/0x340 [<0>] writeback_sb_inodes+0x1e5/0x480 [<0>] __writeback_inodes_wb+0x5d/0xb0 [<0>] wb_writeback+0x25f/0x2f0 [<0>] wb_workfn+0x2fe/0x3f0 [<0>] process_one_work+0x1ad/0x370 [<0>] worker_thread+0x30/0x390 [<0>] kthread+0x112/0x130 [<0>] ret_from_fork+0x1f/0x40 [-- Attachment #2: 2020-07-25-perf-report.txt --] [-- Type: text/plain, Size: 6268 bytes --] Samples: 1M of event 'cpu-clock:pppH', Event count (approx.): 311991500000 Children Self Command Shared Object Symbol + 80.53% 0.00% swapper [kernel.kallsyms] [k] secondary_startup_64 + 80.53% 0.00% swapper [kernel.kallsyms] [k] cpu_startup_entry + 80.53% 0.00% swapper [kernel.kallsyms] [k] do_idle + 80.52% 0.00% swapper [kernel.kallsyms] [k] default_idle + 80.52% 80.51% swapper [kernel.kallsyms] [k] native_safe_halt + 63.42% 0.00% swapper [kernel.kallsyms] [k] start_secondary + 19.10% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] ret_from_fork + 19.10% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] kthread + 19.10% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] worker_thread + 19.10% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] process_one_work + 19.10% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] wb_workfn + 19.10% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] wb_writeback + 19.10% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] __writeback_inodes_wb + 19.10% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] writeback_sb_inodes + 19.10% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] __writeback_single_inode + 19.10% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] do_writepages + 19.10% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] extent_writepages + 19.08% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] extent_write_cache_pages + 19.07% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] __extent_writepage + 19.04% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] writepage_delalloc + 19.03% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] btrfs_run_delalloc_range + 19.03% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] cow_file_range + 19.01% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] btrfs_reserve_extent + 19.01% 4.26% kworker/u16:11+ [kernel.kallsyms] [k] find_free_extent + 17.11% 0.00% swapper [kernel.kallsyms] [k] start_kernel + 7.18% 7.18% kworker/u16:11+ [kernel.kallsyms] [k] _raw_spin_lock + 4.45% 4.45% kworker/u16:11+ [kernel.kallsyms] [k] btrfs_get_block_group + 1.49% 1.34% kworker/u16:11+ [kernel.kallsyms] [k] down_read + 0.56% 0.56% kworker/u16:11+ [kernel.kallsyms] [k] btrfs_put_block_group 0.50% 0.50% kworker/u16:11+ [kernel.kallsyms] [k] __raw_callee_save___pv_queued_spin_unlock 0.44% 0.44% kworker/u16:11+ [kernel.kallsyms] [k] up_read 0.28% 0.13% kworker/u16:11+ [kernel.kallsyms] [k] _cond_resched 0.15% 0.15% kworker/u16:11+ [kernel.kallsyms] [k] rcu_all_qs 0.05% 0.00% nmon [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe 0.05% 0.00% nmon [kernel.kallsyms] [k] do_syscall_64 0.04% 0.00% perf_5.7 [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe 0.04% 0.00% perf_5.7 [kernel.kallsyms] [k] do_syscall_64 0.04% 0.00% btrfs-transacti [kernel.kallsyms] [k] ret_from_fork 0.04% 0.00% btrfs-transacti [kernel.kallsyms] [k] kthread 0.04% 0.00% btrfs-transacti [kernel.kallsyms] [k] transaction_kthread 0.04% 0.00% btrfs-transacti [kernel.kallsyms] [k] btrfs_commit_transaction 0.04% 0.00% perf_5.7 libpthread-2.28.so [.] __libc_write 0.04% 0.00% perf_5.7 [kernel.kallsyms] [k] ksys_write 0.04% 0.00% perf_5.7 [kernel.kallsyms] [k] vfs_write 0.04% 0.00% perf_5.7 [kernel.kallsyms] [k] new_sync_write 0.04% 0.00% perf_5.7 [kernel.kallsyms] [k] btrfs_file_write_iter 0.04% 0.00% perf_5.7 [kernel.kallsyms] [k] btrfs_buffered_write.isra.28 0.03% 0.00% kswapd0 [kernel.kallsyms] [k] ret_from_fork 0.03% 0.00% kswapd0 [kernel.kallsyms] [k] kthread 0.03% 0.00% kswapd0 [kernel.kallsyms] [k] kswapd 0.03% 0.00% kswapd0 [kernel.kallsyms] [k] balance_pgdat 0.03% 0.00% kswapd0 [kernel.kallsyms] [k] shrink_node 0.03% 0.00% btrfs-transacti [kernel.kallsyms] [k] btrfs_run_delayed_refs 0.03% 0.00% btrfs-transacti [kernel.kallsyms] [k] __btrfs_run_delayed_refs 0.03% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] __extent_writepage_io 0.03% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] submit_one_bio 0.03% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] btrfs_submit_bio_hook 0.03% 0.00% kworker/u16:11+ [kernel.kallsyms] [k] btrfs_map_bio 0.02% 0.00% kswapd0 [kernel.kallsyms] [k] shrink_slab 0.02% 0.00% kswapd0 [kernel.kallsyms] [k] do_shrink_slab 0.02% 0.00% kworker/u16:4-b [kernel.kallsyms] [k] ret_from_fork 0.02% 0.00% kworker/u16:4-b [kernel.kallsyms] [k] kthread 0.02% 0.00% kworker/u16:4-b [kernel.kallsyms] [k] worker_thread 0.02% 0.00% kworker/u16:4-b [kernel.kallsyms] [k] process_one_work 0.02% 0.00% kswapd0 [kernel.kallsyms] [k] super_cache_scan 0.02% 0.00% kworker/u16:4-b [kernel.kallsyms] [k] btrfs_work_helper 0.02% 0.00% kworker/u16:3-b [kernel.kallsyms] [k] ret_from_fork 0.02% 0.00% kworker/u16:3-b [kernel.kallsyms] [k] kthread 0.02% 0.00% kworker/u16:3-b [kernel.kallsyms] [k] worker_thread 0.02% 0.00% kworker/u16:3-b [kernel.kallsyms] [k] process_one_work 0.02% 0.00% perf_5.7 [unknown] [k] 0xffffffff9d9bc42e 0.02% 0.00% kworker/u16:3-b [kernel.kallsyms] [k] btrfs_work_helper 0.02% 0.00% btrfs-transacti [kernel.kallsyms] [k] alloc_reserved_file_extent 0.02% 0.00% nmon libc-2.28.so [.] open64 0.02% 0.00% nmon [kernel.kallsyms] [k] do_sys_open [-- Attachment #3: 2020-07-25-nmon.txt --] [-- Type: text/plain, Size: 5851 bytes --] ---- >8 ---- CPU Utilisation ---- >8 ---- ---------------------------+-------------------------------------------------+ CPU User% Sys% Wait% Steal|0 |25 |50 |75 100| 1 5.2 3.1 90.2 1.6|UUsWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW > 2 0.0 0.0 99.5 0.5|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW> 3 0.5 0.0 99.5 0.0|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW> 4 0.0 0.0 100.0 0.0|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW> 5 0.0 100.0 0.0 0.0|sssssssssssssssssssssssssssssssssssssssssssssssss> 6 0.0 0.0 100.0 0.0|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW> 7 1.0 0.5 98.0 0.5|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW > 8 1.0 0.0 98.5 0.5|WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW> ---------------------------+-------------------------------------------------+ Avg 0.9 13.0 85.7 0.3|ssssssWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW>| ---------------------------+-------------------------------------------------+ ---- >8 ---- Memory and Swap ---- >8 ---- PageSize:4KB RAM-Memory Swap-Space High-Memory Low-Memory Total (MB) 16015.3 0.0 - not in use - not in use Free (MB) 1471.1 0.0 Free Percent 9.2% 0.0% Linux Kernel Internal Memory (MB) Cached= 12582.6 Active= 3756.1 Buffers= 0.2 Swapcached= 0.0 Inactive = 8994.9 Dirty = 2387.0 Writeback = 0.5 Mapped = 27.3 Slab = 1653.0 Commit_AS = 321.2 PageTables= 4.4 ---- >8 ---- Kernel and Load Average ---- >8 ---- Global-CPU-Stats----> 7.5% user Load Average CPU use since boottime /proc/stat line 1 0.0% user_nice 1 mins 14.02 Uptime Days Hours Mins 100 ticks per second 101.1% system 5 mins 14.05 Uptime 20 15 19 100%=1 CPUcoreThread 0.0% idle 15 mins 13.33 Idle 15 335 17633 3 RunQueue 677.8% iowait Uptime has overflowed 13 Blocked 0.0% irq 2338.1 Context 2.0% softirq 8 CPU core threads Switch 2.5% steal 0.0 Forks 0.0% guest Boot time 1593903119 3174.7 Interrupts 0.0% guest_nice 12:51 AM 05-Jul-2020 ---- >8 ---- Network I/O ---- >8 ---- I/F Name Recv=KB/s Trans=KB/s packin packout insize outsize Peak->Recv Trans lo 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 eth0 12250.1 88.6 968.6 999.0 12950.7 90.8 388740.8 2413.5 ---- >8 ---- Disk I/O -- /proc/diskstats -- mostly in KB/s ---- >8 ---- DiskName Busy Read WriteKB|0 |25 |50 |75 100| xvda 0% 0.0 0.0| > xvdb 0% 31.9 0.0|R > xvdc 23% 0.0 1171.3|WWWWWWWWWWWW > xvdd 0% 0.0 0.0| > xvde 39% 0.0 970.1|WWWWWWWWWWWWWWWWWWWW > Totals Read-MB/s=0.0 Writes-MB/s=2.1 Transfers/sec=21.9 ---- >8 ---- Top Processes Procs=260-mode=3-1=Base 3=Perf 4=Size 5=I/O[RootOnly] u=Args ---- >8 ---- PID %CPU Size Res Res Res Res Shared Faults Faults Command Used KB Set Text Data Lib KB Min Maj 7909 99.6 0 0 0 0 0 0 0 0 kworker/u16:2+flush-btrfs-2 22780 1.5 19780 10548 400 4588 0 5768 0 0 ssh 21840 1.0 19772 10608 400 4580 0 5836 0 0 ssh 22638 1.0 19776 10616 400 4584 0 5844 0 0 ssh 23115 1.0 19772 10592 400 4580 0 5824 0 0 ssh 23460 1.0 19816 10644 400 4624 0 5828 0 0 ssh 23571 1.0 19780 10564 400 4588 0 5784 0 0 ssh 23759 1.0 19768 10472 400 4576 0 5708 0 0 ssh 23770 1.0 19740 10588 400 4548 0 5848 0 0 ssh 30708 1.0 19996 10752 400 4804 0 5756 0 0 ssh 30710 1.0 8544 1124 396 392 0 960 0 0 btrfs 993 0.5 8672 3160 104 4652 0 440 0 0 nmon 10290 0.5 19844 10688 400 4652 0 5848 0 0 ssh 18186 0.5 0 0 0 0 0 0 0 0 kworker/u16:6-btrfs-endio-write 18616 0.5 8544 1204 396 392 0 1036 0 0 btrfs 18659 0.5 5420 788 56 440 0 724 0 0 pv 18660 0.5 8544 1196 396 392 0 1032 0 0 btrfs 21842 0.5 7896 1940 328 2300 0 1344 0 0 rsync 22807 0.5 7896 1984 328 2300 0 1384 0 0 rsync 23461 0.5 7896 1928 328 2300 0 1332 0 0 rsync 23467 0.5 19724 10580 400 4532 0 5856 0 0 ssh 23468 0.5 8476 3084 328 2880 0 1272 0 0 rsync ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-25 14:24 Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 Hans van Kranenburg @ 2020-07-25 15:37 ` Holger Hoffstätte 2020-07-25 16:43 ` Hans van Kranenburg 2020-07-25 21:27 ` Hans van Kranenburg ` (2 subsequent siblings) 3 siblings, 1 reply; 17+ messages in thread From: Holger Hoffstätte @ 2020-07-25 15:37 UTC (permalink / raw) To: Hans van Kranenburg, linux-btrfs On 2020-07-25 16:24, Hans van Kranenburg wrote: > Hi, > > I have a filesystem here that I'm filling up with data from elsewhere. > Most of it is done by rsync, and part by send/receive. So, receiving > data over the network, and then writing the files to disk. There can be > a dozen of these processes running in parallel. > > Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with > varying X) often is using nearly 100% cpu, while enormously slowing down > disk writes. This shows as disk IO wait for the rsync and btrfs receive > processes. <snip> I cannot speak to anything btrfs-specific (other than the usual write storms), however.. > [<0>] rq_qos_wait+0xfa/0x170 > [<0>] wbt_wait+0x98/0xe0 > [<0>] __rq_qos_throttle+0x23/0x30 ..this means that you have CONFIG_BLK_WBT{_MQ} enabled and are using an IO scheduler that observes writeback throttling. AFAIK all MQ-capable schedulers (also SQ ones in 4.19 IIRC!) do so except for BFQ, which has its own mechanism to regulate fairness vs. latency and explicitly turns WBT off. WBT aka 'writeback throttling' throttles background writes acording to latency/throughput of the underlying block device in favor of readers. It is meant to protect interactive/low-latency/desktop apps from heavy bursts of background writeback activity. I tested early versions and provided feedback to Jens Axboe; it really is helpful when it works, but obviously cannot cater to every situation. There have been reports that it is unhelpful for write-only/heavy workloads and may lead to queueing pileup. You can tune the expected latency of device writes via: /sys/block/sda/queue/wbt_lat_usec. You might also check whether your vm.dirty_{background}_bytes and vm.dirty_expire_centisecs are too high; distro defaults almost always are. This leads to more evenly spaced out write traffic. Without knowing more it's difficult to say exactly what is going on, but if your underlying storage has latency spikes it might be that you are very likely looking at queueing pileup caused by multiple WBTs choking each other. Having other unrelated queueing & throttling mechanisms (in your case the network) in the mix is unlikely to help. I'm not going to comment on iSCSI in general.. :^) OTOH I also have 10G networking here and no such problems, even when pushing large amounts of data over NFS at ~750 MB/s - and I have WBT enabled everywhere. So maybe start small and either ramp up the wbt latency sysctl or decrease dirty_background bytes to start flushing sooner, depending on how it's set. As last resort you can rebuild your kernels with CONFIG_BLK_WBT/CONFIG_BLK_WBT_MQ disabled. Hope this helps! -h ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-25 15:37 ` Holger Hoffstätte @ 2020-07-25 16:43 ` Hans van Kranenburg 2020-07-25 19:44 ` Holger Hoffstätte 0 siblings, 1 reply; 17+ messages in thread From: Hans van Kranenburg @ 2020-07-25 16:43 UTC (permalink / raw) To: Holger Hoffstätte, linux-btrfs On 7/25/20 5:37 PM, Holger Hoffstätte wrote: > On 2020-07-25 16:24, Hans van Kranenburg wrote: >> Hi, >> >> I have a filesystem here that I'm filling up with data from elsewhere. >> Most of it is done by rsync, and part by send/receive. So, receiving >> data over the network, and then writing the files to disk. There can be >> a dozen of these processes running in parallel. >> >> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with >> varying X) often is using nearly 100% cpu, while enormously slowing down >> disk writes. This shows as disk IO wait for the rsync and btrfs receive >> processes. > > <snip> > > I cannot speak to anything btrfs-specific (other than the usual write > storms), however.. > >> [<0>] rq_qos_wait+0xfa/0x170 >> [<0>] wbt_wait+0x98/0xe0 >> [<0>] __rq_qos_throttle+0x23/0x30 I need to cat /proc/<pid>/stack a huge number of times in a loop to once in a while get this sort of output shown. > ..this means that you have CONFIG_BLK_WBT{_MQ} enabled and are using > an IO scheduler that observes writeback throttling. AFAIK all MQ-capable > schedulers (also SQ ones in 4.19 IIRC!) do so except for BFQ, which has > its own mechanism to regulate fairness vs. latency and explicitly turns > WBT off. > > WBT aka 'writeback throttling' throttles background writes acording to > latency/throughput of the underlying block device in favor of readers. > It is meant to protect interactive/low-latency/desktop apps from heavy > bursts of background writeback activity. I tested early versions and > provided feedback to Jens Axboe; it really is helpful when it works, > but obviously cannot cater to every situation. There have been reports > that it is unhelpful for write-only/heavy workloads and may lead to > queueing pileup. > > You can tune the expected latency of device writes via: > /sys/block/sda/queue/wbt_lat_usec. Yes, I have been playing around with it earlier, without any effect on the symptoms. I just did this again, echo 0 > all of the involved block devices. When looking at the events/wbt trace point, I see that wbt activity stops at that moment. No difference in symptoms. > You might also check whether your vm.dirty_{background}_bytes and > vm.dirty_expire_centisecs are too high; distro defaults almost always > are. This leads to more evenly spaced out write traffic. Dirty buffers were ~ 2G in size. I can modify the numbers to make it bigger or smaller. There's absolutely no change in behavior of the system. > Without knowing more it's difficult to say exactly what is going on, > but if your underlying storage has latency spikes It doesn't. It's idle, waiting to finally get some data sent to it. > it might be that > you are very likely looking at queueing pileup caused by multiple WBTs > choking each other. Having other unrelated queueing & throttling > mechanisms (in your case the network) in the mix is unlikely to help. > I'm not going to comment on iSCSI in general.. :^) > > OTOH I also have 10G networking here and no such problems, even when > pushing large amounts of data over NFS at ~750 MB/s - and I have WBT > enabled everywhere. > > So maybe start small and either ramp up the wbt latency sysctl or > decrease dirty_background bytes to start flushing sooner, depending > on how it's set. As last resort you can rebuild your kernels with > CONFIG_BLK_WBT/CONFIG_BLK_WBT_MQ disabled. All processing speed is inversely proportional to the cpu usage of this kworker/u16:X+flush-btrfs-2 thread. If it reaches >95% kernel cpu usage, everything slows down. The network is idle, the disks are idle. Incoming rsync speed drops, the speed in which btrfs receive is reading input drops, etc. As soon as kworker/u16:X+flush-btrfs-2 busy cpu usage gets below ~ 95% again, throughput goes up. I do not see how writeback problems would result in having a kworker/u16:X+flush-btrfs-2 do 100% cpu. I think it's the other way round, and that's why I want to know what this thread is actually busy with doing instead of shoveling the data towards the disks. As far as I can see, all the iowait reported by rsync and btrfs receive is because they want to give their writes to the btrfs code in the kernel, but this kworker/u16:X+flush-btrfs-2 is in their way, so they are blocked. Even before anything gets queued anywhere. Or doesn't that make sense? So, the problem is located *before* all the things you mention above even come into play yet. Hans ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-25 16:43 ` Hans van Kranenburg @ 2020-07-25 19:44 ` Holger Hoffstätte 2020-07-25 21:03 ` Hans van Kranenburg 0 siblings, 1 reply; 17+ messages in thread From: Holger Hoffstätte @ 2020-07-25 19:44 UTC (permalink / raw) To: Hans van Kranenburg, linux-btrfs On 2020-07-25 18:43, Hans van Kranenburg wrote: > On 7/25/20 5:37 PM, Holger Hoffstätte wrote: >>> [<0>] rq_qos_wait+0xfa/0x170 >>> [<0>] wbt_wait+0x98/0xe0 >>> [<0>] __rq_qos_throttle+0x23/0x30 > > I need to cat /proc/<pid>/stack a huge number of times in a loop to once > in a while get this sort of output shown. Ahh! So that means you are just getting lucky and catch the occasional throttle in action. Ok. >> You can tune the expected latency of device writes via: >> /sys/block/sda/queue/wbt_lat_usec. > > Yes, I have been playing around with it earlier, without any effect on > the symptoms. > > I just did this again, echo 0 > all of the involved block devices. When > looking at the events/wbt trace point, I see that wbt activity stops at > that moment. > > No difference in symptoms. Ok. Worth a shot..but in retrospect not really, since I just realized that I had your entire scenario backwards. Nevermind then. > Dirty buffers were ~ 2G in size. I can modify the numbers to make it > bigger or smaller. There's absolutely no change in behavior of the system. Ok. I also have 2GB max-dirty and 1GB writeback-dirty, which is plenty for a 10G pipe. > It doesn't. It's idle, waiting to finally get some data sent to it. Yup, looks like it. > All processing speed is inversely proportional to the cpu usage of this > kworker/u16:X+flush-btrfs-2 thread. If it reaches >95% kernel cpu usage, > everything slows down. The network is idle, the disks are idle. Incoming > rsync speed drops, the speed in which btrfs receive is reading input > drops, etc. As soon as kworker/u16:X+flush-btrfs-2 busy cpu usage gets > below ~ 95% again, throughput goes up. Only thing I can think of is that your rsync runs create insane amounts of small COW extents that need to be ordered/merged. Multiply by number of processes and you're probably hitting some super contended part. Since the kworker isn't stuck forever but apparently makes progress it's not dead, just slow/overloaded. A few years ago I started using rsync exclusively with --whole-file since it's not just much faster but also creates significantly less fragmentation at the expense of some disk space, but whatever...in my case snapshot rotation takes care of that. Maybe it's an option for you. So maybe things to try: - run rsync with --whole-file - run fewer rsyncs in parallel (might not be necessary with --whole-file!) If that doesn't help someone else needs to chime in.. -h ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-25 19:44 ` Holger Hoffstätte @ 2020-07-25 21:03 ` Hans van Kranenburg 2020-07-26 1:00 ` Chris Murphy 0 siblings, 1 reply; 17+ messages in thread From: Hans van Kranenburg @ 2020-07-25 21:03 UTC (permalink / raw) To: Holger Hoffstätte, linux-btrfs On 7/25/20 9:44 PM, Holger Hoffstätte wrote: > On 2020-07-25 18:43, Hans van Kranenburg wrote: >> On 7/25/20 5:37 PM, Holger Hoffstätte wrote: >>>> [<0>] rq_qos_wait+0xfa/0x170 >>>> [<0>] wbt_wait+0x98/0xe0 >>>> [<0>] __rq_qos_throttle+0x23/0x30 >> >> I need to cat /proc/<pid>/stack a huge number of times in a loop to once >> in a while get this sort of output shown. > > Ahh! So that means you are just getting lucky and catch the occasional > throttle in action. Ok. > >>> You can tune the expected latency of device writes via: >>> /sys/block/sda/queue/wbt_lat_usec. >> >> Yes, I have been playing around with it earlier, without any effect on >> the symptoms. >> >> I just did this again, echo 0 > all of the involved block devices. When >> looking at the events/wbt trace point, I see that wbt activity stops at >> that moment. >> >> No difference in symptoms. > > Ok. Worth a shot..but in retrospect not really, since I just realized > that I had your entire scenario backwards. Nevermind then. np >> Dirty buffers were ~ 2G in size. I can modify the numbers to make it >> bigger or smaller. There's absolutely no change in behavior of the system. > > Ok. I also have 2GB max-dirty and 1GB writeback-dirty, which is > plenty for a 10G pipe. > >> It doesn't. It's idle, waiting to finally get some data sent to it. > > Yup, looks like it. > >> All processing speed is inversely proportional to the cpu usage of this >> kworker/u16:X+flush-btrfs-2 thread. If it reaches >95% kernel cpu usage, >> everything slows down. The network is idle, the disks are idle. Incoming >> rsync speed drops, the speed in which btrfs receive is reading input >> drops, etc. As soon as kworker/u16:X+flush-btrfs-2 busy cpu usage gets >> below ~ 95% again, throughput goes up. > > Only thing I can think of is that your rsync runs create insane amounts > of small COW extents that need to be ordered/merged. Multiply by number of > processes and you're probably hitting some super contended part. Yes, that's what we're looking at and I see that my other email sent to the list some hours ago didn't show up yet. Maybe I should resend with a link to the attachment instead of carrying an attachment, or I'll wait until tomorrow. Hm, Jul 25 21:06:44 syrinx postfix/smtp[32364]: 155C4609B67CF: to=<linux-btrfs@vger.kernel.org>, relay=vger.kernel.org[2620:137:e000::1:18]:25, delay=2.2, delays=0 .15/0.02/1.2/0.87, dsn=2.0.0, status=sent (250 2.0.0 Ok: queued as 7B1D9C08C5C0) Jul 25 21:06:44 syrinx postfix/qmgr[951]: 155C4609B67CF: removed O_o, so it was accepted, but I don't see it. > Since > the kworker isn't stuck forever but apparently makes progress it's not > dead, just slow/overloaded. > > A few years ago I started using rsync exclusively with --whole-file since > it's not just much faster but also creates significantly less fragmentation > at the expense of some disk space, but whatever...in my case snapshot rotation > takes care of that. Maybe it's an option for you. > > So maybe things to try: > > - run rsync with --whole-file > - run fewer rsyncs in parallel (might not be necessary with --whole-file!) > > If that doesn't help someone else needs to chime in.. That someone else could be me if I only had the work time available to do it, instead of using hobby time which since a few months is rather used to optimize the vegetable garden and cutting down trees for fire wood together with the neighbours. And yes, all rsync is already running with --whole-file since 2014 to reduce the total amount of reflink garbage in this now ~100T pile of stuff with >200k subvols. Doing fewer things in parallel does help, indeed. If I abort everything and just run 1 of the 20 at a time, it does something. But, I want to get this over with before the year 2050 happens. K ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-25 21:03 ` Hans van Kranenburg @ 2020-07-26 1:00 ` Chris Murphy 0 siblings, 0 replies; 17+ messages in thread From: Chris Murphy @ 2020-07-26 1:00 UTC (permalink / raw) To: Hans van Kranenburg; +Cc: Holger Hoffstätte, Btrfs BTRFS On Sat, Jul 25, 2020 at 3:03 PM Hans van Kranenburg <hans@knorrie.org> wrote: > > And yes, all rsync is already running with --whole-file since 2014 to > reduce the total amount of reflink garbage in this now ~100T pile of > stuff with >200k subvols. OK. So I was thinking maybe lock contention on subvols. But that many subvolumes I'm willing to bet every task has its own. And you're already ahead of where I'm at anyway. A search for free extent sounds like free space tree contention. Does this suck any less if there's no space cache? -- Chris Murphy ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-25 14:24 Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 Hans van Kranenburg 2020-07-25 15:37 ` Holger Hoffstätte @ 2020-07-25 21:27 ` Hans van Kranenburg 2020-07-26 8:10 ` A L 2020-07-26 0:50 ` Chris Murphy 2020-07-27 11:09 ` Qu Wenruo 3 siblings, 1 reply; 17+ messages in thread From: Hans van Kranenburg @ 2020-07-25 21:27 UTC (permalink / raw) To: linux-btrfs Ok, On 7/25/20 4:24 PM, Hans van Kranenburg wrote: > Hi, > > I have a filesystem here that I'm filling up with data from elsewhere. > Most of it is done by rsync, and part by send/receive. So, receiving > data over the network, and then writing the files to disk. There can be > a dozen of these processes running in parallel. > > Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with > varying X) often is using nearly 100% cpu, while enormously slowing down > disk writes. This shows as disk IO wait for the rsync and btrfs receive > processes. Both rsync and btrfs receive are handling incoming streams of data which contain huge amounts of files of all kind of sizes, but also a lot of small files. (E.g. an imap server receive job from 1G split send dump files is processing now, with maildir storage with every email in a separate file, and there's also customers storing tiny XML fragments from XML webservice calls in files to process later). > The underlying storage (iSCSI connected over 10Gb/s network) can easily > eat a few hundred MiB/s. When looking at actual disk business on the > storage device, percentages <5% utilization are reported for the actual > disks. > > It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here. > > I just did a 'perf record -g -a sleep 60' while disk writes were down to > under 1MiB (!) per second and then 'perf report'. Attached some 'screen > shot' of it. Also attached an example of what nmon shows to give an idea > about the situation. > > If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately > see network and disk write speed ramping up, easily over 200 MiB/s, > until it soon plummets again. > > I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which > is booted now). > > So, what I'm looking for is: > * Does anyone else see this happen when doing a lot of concurrent writes? > * What does this flush thing do? I have been playing around with the available trace point stuff in /sys/kernel/debug/tracing/events/btrfs a bit more. After rsync or btrfs receive has some new data available for the fs to write to disk... What needs to happen before it can actually go into the write cache to end up on disk? Well, we need to find a place where to put it. What's kworker/u16:X+flush-btrfs-2 doing? It's doing find_free_extent. > * Why is it using 100% cpu all the time? Because it spends all its time doing find_free_extent. And almost all of it is happening for EXTENT_TREE. len=16384 empty_size=0 flags=4(METADATA) I could have known :O And again it's the extent allocator making my life miserable. > * How can I debug this more? I will read more relevant code and see if I can put some more custom trace points in. I will also have a closer look at the empty_size thing, again... > * Ultimately, of course... how can we improve this? Well, be able to have an idea of 'pressure' on it. If we're slowing down everyone a lot, switch to some more yolo allocation which just appends all new writes and flush them to disk, instead of spending 100% kernel cpu to find some tiny old 16k free metadata space hole to put a single write in. So, yes, I could hack around a bit in the code again and flip around some mount option which does the 'ssd_spread' behavior for metadata only. I did that on another filesystem a while ago, which helped a lot for throughput, but it also allocated a ridiculous amount of new metadata block groups. I'd like to redesign all of this, but -EPOSSIBLE because -EWORKTIME Now I have a btrfs receive running for >1.5T of customer garbage... Receive cannot stop and resume. So am I going to abort and redo it while it's at 75.2GiB 5:57:46 [1.44MiB/s]? Or just sit it out? Stay tuned for the next episode of Knorrie's btrfs adventures! :) K ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-25 21:27 ` Hans van Kranenburg @ 2020-07-26 8:10 ` A L 0 siblings, 0 replies; 17+ messages in thread From: A L @ 2020-07-26 8:10 UTC (permalink / raw) To: Hans van Kranenburg, linux-btrfs > Now I have a btrfs receive running for >1.5T of customer garbage... > Receive cannot stop and resume. So am I going to abort and redo it while > it's at 75.2GiB 5:57:46 [1.44MiB/s]? Or just sit it out? > > Stay tuned for the next episode of Knorrie's btrfs adventures! :) Always interesting stories :) Anyways, can you throttle your send|receive by putting mbuffer in between? I have myself had quite good success with mbuffer in reducing overall I/O, not to mention the ability to limit overall speed. A ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-25 14:24 Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 Hans van Kranenburg 2020-07-25 15:37 ` Holger Hoffstätte 2020-07-25 21:27 ` Hans van Kranenburg @ 2020-07-26 0:50 ` Chris Murphy 2020-07-27 11:09 ` Qu Wenruo 3 siblings, 0 replies; 17+ messages in thread From: Chris Murphy @ 2020-07-26 0:50 UTC (permalink / raw) To: Hans van Kranenburg; +Cc: Btrfs BTRFS On Sat, Jul 25, 2020 at 8:25 AM Hans van Kranenburg <hans@knorrie.org> wrote: > > Hi, > > I have a filesystem here that I'm filling up with data from elsewhere. > Most of it is done by rsync, and part by send/receive. So, receiving > data over the network, and then writing the files to disk. There can be > a dozen of these processes running in parallel. Each receive goes into its own subvolume. What about the rsyncs? How many concurrent and how many per subvolume? > Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with > varying X) often is using nearly 100% cpu, while enormously slowing down > disk writes. This shows as disk IO wait for the rsync and btrfs receive > processes. > > The underlying storage (iSCSI connected over 10Gb/s network) can easily > eat a few hundred MiB/s. When looking at actual disk business on the > storage device, percentages <5% utilization are reported for the actual > disks. > > It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here. > > I just did a 'perf record -g -a sleep 60' while disk writes were down to > under 1MiB (!) per second and then 'perf report'. Attached some 'screen > shot' of it. Also attached an example of what nmon shows to give an idea > about the situation. Why is swapper involved in this? + 80.53% 0.00% swapper [kernel.kallsyms] [k] secondary_startup_64 + 80.53% 0.00% swapper [kernel.kallsyms] [k] cpu_startup_entry + 80.53% 0.00% swapper [kernel.kallsyms] [k] do_idle + 80.52% 0.00% swapper [kernel.kallsyms] [k] default_idle + 80.52% 80.51% swapper [kernel.kallsyms] [k] native_safe_halt + 63.42% 0.00% swapper [kernel.kallsyms] [k] start_secondary What's PSI look like while this problem is happening? # grep . /proc/pressure/* # cat /proc/meminfo > > So, what I'm looking for is: > * Does anyone else see this happen when doing a lot of concurrent writes? > * What does this flush thing do? > * Why is it using 100% cpu all the time? > * How can I debug this more? > * Ultimately, of course... how can we improve this? > > I can recompile the kernel image to e.g. put more trace points in, in > different places. What about using bcc-tools to check for high latencies? I'd maybe start with fileslower, syncsnoop, btrfsslower, biolatency. This is a bit of a firehose at first, but not near as much as perf. Strategy wise you could start either at the top with fileslower or at the bottom with biolatency, fileslower may not tell you anything you don't already know which is that everything is f'n slow and waiting on something. So OK is there something hammering on sync()? Try that next. And so on. There's a bunch of these tools: https://github.com/iovisor/bcc/blob/master/images/bcc_tracing_tools_2019.png -- Chris Murphy ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-25 14:24 Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 Hans van Kranenburg ` (2 preceding siblings ...) 2020-07-26 0:50 ` Chris Murphy @ 2020-07-27 11:09 ` Qu Wenruo 2020-07-27 17:17 ` Hans van Kranenburg 3 siblings, 1 reply; 17+ messages in thread From: Qu Wenruo @ 2020-07-27 11:09 UTC (permalink / raw) To: Hans van Kranenburg, linux-btrfs On 2020/7/25 下午10:24, Hans van Kranenburg wrote: > Hi, > > I have a filesystem here that I'm filling up with data from elsewhere. > Most of it is done by rsync, and part by send/receive. So, receiving > data over the network, and then writing the files to disk. There can be > a dozen of these processes running in parallel. > > Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with > varying X) often is using nearly 100% cpu, while enormously slowing down > disk writes. This shows as disk IO wait for the rsync and btrfs receive > processes. The name is in fact pretty strange. It doesn't follow the btrfs workqueue names. There are two type of kernel threads used by btrfs: - kthread This includes: * btrfs-uuid * btrfs-cleaner * btrfs-transaction * btrfs-devrepl * btrfs-ino-cache-*llu (deprecated) * btrfs-balance - btrfs_workqueue (a wrapper around kernel workqueue) This mostly includes the following one, some of them may have another workqueue with "-high" suffix: * btrfs-worker * btrfs-delalloc * btrfs-flush_delalloc * btrfs-cache * btrfs-fixup * btrfs-endio * btrfs-endio-meta * btrfs-endio-meta-write * btrfs-endio-raid56 * btrfs-rmw * btrfs-endio-write * btrfs-freespace-write * btrfs-delayed-meta * btrfs-readahead * btrfs-qgroup-rescan * btrfs-scrub * btrfs-scrubwrc * btrfs-scrubparity (No wonder Linus is not happy with so many work queues) As you can see, there is no one named like "flush-btrfs". Thus I guess it's from other part of the stack. Also, the calltrace also shows that, that kernel thread is only doing page writeback, which calls back to the page write hooks of btrfs. So I guess it may not be btrfs, but something else trying to do all the writeback. But still, the CPU usage is still a problem, it shouldn't cost so much CPU time just writing back pages from btrfs. > > The underlying storage (iSCSI connected over 10Gb/s network) can easily > eat a few hundred MiB/s. When looking at actual disk business on the > storage device, percentages <5% utilization are reported for the actual > disks. > > It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here. > > I just did a 'perf record -g -a sleep 60' while disk writes were down to > under 1MiB (!) per second and then 'perf report'. Attached some 'screen > shot' of it. Also attached an example of what nmon shows to give an idea > about the situation. > > If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately > see network and disk write speed ramping up, easily over 200 MiB/s, > until it soon plummets again. > > I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which > is booted now). > > So, what I'm looking for is: > * Does anyone else see this happen when doing a lot of concurrent writes? > * What does this flush thing do? > * Why is it using 100% cpu all the time? > * How can I debug this more? bcc based runtime probes I guess? Since it's almost a dead CPU burner loop, regular sleep based lockup detector won't help much. You can try trace events first to see which trace event get executed the most frequently, then try to add probe points to pin down the real cause. But personally speaking, it's better to shrink the workload, to find a minimal workload to reproduce the 100% CPU burn, so that you need less probes/time to pindown the problem. > * Ultimately, of course... how can we improve this? > > I can recompile the kernel image to e.g. put more trace points in, in > different places.> > I just have no idea where to start. > > Thanks, > Hans > > P.S. /proc/<pid>/stack for the kworker/u16:X+flush-btrfs-2 mostly shows > nothing at all, and sometimes when it does show some output, it mostly > looks like this: > > ---- > [<0>] rq_qos_wait+0xfa/0x170 > [<0>] wbt_wait+0x98/0xe0 > [<0>] __rq_qos_throttle+0x23/0x30 > [<0>] blk_mq_make_request+0x12a/0x5d0 > [<0>] generic_make_request+0xcf/0x310 > [<0>] submit_bio+0x42/0x1c0 > [<0>] btrfs_map_bio+0x1c0/0x380 [btrfs] > [<0>] btrfs_submit_bio_hook+0x8c/0x180 [btrfs] > [<0>] submit_one_bio+0x31/0x50 [btrfs] > [<0>] submit_extent_page+0x102/0x210 [btrfs] > [<0>] __extent_writepage_io+0x1cf/0x380 [btrfs] > [<0>] __extent_writepage+0x101/0x300 [btrfs] > [<0>] extent_write_cache_pages+0x2bb/0x440 [btrfs] > [<0>] extent_writepages+0x44/0x90 [btrfs] > [<0>] do_writepages+0x41/0xd0 Yeah, it's just trying to write dirty pages for btrfs. I don't really believe it's btrfs causing the 100% CPU burn. Maybe you could just try run btrfs on bare storage, without iSCSI, just to verify it's btrfs to blame? Thanks, Qu > [<0>] __writeback_single_inode+0x3d/0x340 > [<0>] writeback_sb_inodes+0x1e5/0x480 > [<0>] __writeback_inodes_wb+0x5d/0xb0 > [<0>] wb_writeback+0x25f/0x2f0 > [<0>] wb_workfn+0x2fe/0x3f0 > [<0>] process_one_work+0x1ad/0x370 > [<0>] worker_thread+0x30/0x390 > [<0>] kthread+0x112/0x130 > [<0>] ret_from_fork+0x1f/0x40 > > ---- > [<0>] rq_qos_wait+0xfa/0x170 > [<0>] wbt_wait+0x98/0xe0 > [<0>] __rq_qos_throttle+0x23/0x30 > [<0>] blk_mq_make_request+0x12a/0x5d0 > [<0>] generic_make_request+0xcf/0x310 > [<0>] submit_bio+0x42/0x1c0 > [<0>] btrfs_map_bio+0x1c0/0x380 [btrfs] > [<0>] btrfs_submit_bio_hook+0x8c/0x180 [btrfs] > [<0>] submit_one_bio+0x31/0x50 [btrfs] > [<0>] extent_writepages+0x5d/0x90 [btrfs] > [<0>] do_writepages+0x41/0xd0 > [<0>] __writeback_single_inode+0x3d/0x340 > [<0>] writeback_sb_inodes+0x1e5/0x480 > [<0>] __writeback_inodes_wb+0x5d/0xb0 > [<0>] wb_writeback+0x25f/0x2f0 > [<0>] wb_workfn+0x2fe/0x3f0 > [<0>] process_one_work+0x1ad/0x370 > [<0>] worker_thread+0x30/0x390 > [<0>] kthread+0x112/0x130 > [<0>] ret_from_fork+0x1f/0x40 > > ---- > [<0>] rq_qos_wait+0xfa/0x170 > [<0>] wbt_wait+0x98/0xe0 > [<0>] __rq_qos_throttle+0x23/0x30 > [<0>] blk_mq_make_request+0x12a/0x5d0 > [<0>] generic_make_request+0xcf/0x310 > [<0>] submit_bio+0x42/0x1c0 > [<0>] btrfs_map_bio+0x1c0/0x380 [btrfs] > [<0>] btrfs_submit_bio_hook+0x8c/0x180 [btrfs] > [<0>] submit_one_bio+0x31/0x50 [btrfs] > [<0>] submit_extent_page+0x102/0x210 [btrfs] > [<0>] __extent_writepage_io+0x1cf/0x380 [btrfs] > [<0>] __extent_writepage+0x101/0x300 [btrfs] > [<0>] extent_write_cache_pages+0x2bb/0x440 [btrfs] > [<0>] extent_writepages+0x44/0x90 [btrfs] > [<0>] do_writepages+0x41/0xd0 > [<0>] __writeback_single_inode+0x3d/0x340 > [<0>] writeback_sb_inodes+0x1e5/0x480 > [<0>] __writeback_inodes_wb+0x5d/0xb0 > [<0>] wb_writeback+0x25f/0x2f0 > [<0>] wb_workfn+0x2fe/0x3f0 > [<0>] process_one_work+0x1ad/0x370 > [<0>] worker_thread+0x30/0x390 > [<0>] kthread+0x112/0x130 > [<0>] ret_from_fork+0x1f/0x40 > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-27 11:09 ` Qu Wenruo @ 2020-07-27 17:17 ` Hans van Kranenburg 2020-07-27 19:23 ` Chris Murphy ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: Hans van Kranenburg @ 2020-07-27 17:17 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs Hi! On 7/27/20 1:09 PM, Qu Wenruo wrote: > > > On 2020/7/25 下午10:24, Hans van Kranenburg wrote: >> Hi, >> >> I have a filesystem here that I'm filling up with data from elsewhere. >> Most of it is done by rsync, and part by send/receive. So, receiving >> data over the network, and then writing the files to disk. There can be >> a dozen of these processes running in parallel. >> >> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with >> varying X) often is using nearly 100% cpu, while enormously slowing down >> disk writes. This shows as disk IO wait for the rsync and btrfs receive >> processes. > > The name is in fact pretty strange. > It doesn't follow the btrfs workqueue names. > > There are two type of kernel threads used by btrfs: > [...] > > As you can see, there is no one named like "flush-btrfs". Well, I do have a kernel thread named kworker/u16:7-flush-btrfs-2 here, currently. ;] > Thus I guess it's from other part of the stack. > > Also, the calltrace also shows that, that kernel thread is only doing > page writeback, which calls back to the page write hooks of btrfs. > > So I guess it may not be btrfs, but something else trying to do all the > writeback. Yes, so, from the writeback path it re-enters btrfs code, right? > But still, the CPU usage is still a problem, it shouldn't cost so much > CPU time just writing back pages from btrfs. It's find_free_extent which is the culprit. >> [...] >> >> It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here. >> >> I just did a 'perf record -g -a sleep 60' while disk writes were down to >> under 1MiB (!) per second and then 'perf report'. Attached some 'screen >> shot' of it. Also attached an example of what nmon shows to give an idea >> about the situation. >> >> If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately >> see network and disk write speed ramping up, easily over 200 MiB/s, >> until it soon plummets again. >> >> I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which >> is booted now). >> >> So, what I'm looking for is: >> * Does anyone else see this happen when doing a lot of concurrent writes? >> * What does this flush thing do? >> * Why is it using 100% cpu all the time? >> * How can I debug this more? > > bcc based runtime probes I guess? > > Since it's almost a dead CPU burner loop, regular sleep based lockup > detector won't help much. Here's a flame graph of 180 seconds, taken from the kernel thread pid: https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-27-perf-kworker-flush-btrfs.svg > You can try trace events first to see which trace event get executed the > most frequently, then try to add probe points to pin down the real cause. From the default collection, I already got the following, a few days ago, by enabling find_free_extent and btrfs_cow_block: https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_extent.txt From the timestamps you can see how long this takes. It's not much that gets done per second. The spin lock part must be spin_lock(&space_info->lock) because that's the only one in find_free_extent. So, what I think is that, like I mentioned on saturday already, find_free_extent is probably doing things in a really inefficient way, scanning around for a small single free space gap all the time in a really expensive way, and doing that over again for each tiny metadata block or file that needs to be placed somewhere (also notice the empty_size=0), instead of just throwing all of it on disk after each other, when it's otherwise slowing down everyone. And then we have the spin lock part and btrfs_get_block_group, which are also a significant part of the time spent. All together is the continuous 100% cpu. What I *can* try is switch to the ssd_spread option, to force it to do much more yolo allocation, but I'm afraid this will result in a sudden blast of metadata block groups getting allocated. Or, maybe try it for a minute or so and compare the trace pipe output... > But personally speaking, it's better to shrink the workload, to find a > minimal workload to reproduce the 100% CPU burn, so that you need less > probes/time to pindown the problem. I think I can reproduce it with a single btrfs receive operation, as long as it has a large amount of small files in it. Currently I'm almost only copying data onto the filesystem, only data removals are some targeted dedupe script for known big files that are identical, but which could not easily be handled by rsync. So, when find_free_extent wants to write extent tree blocks, it might be playing outside in the woods all the time, searching around for little holes that were caused by previous cow operations. Large files are not a problem, like postgresql database dumps. Copying them over just runs >200-300M/s without problem, where the read speed of the old storage I'm moving things off is the bottleneck, actually. >> * Ultimately, of course... how can we improve this? >> >> [...] I will see if I can experiment a bit with putting more trace points in. I remember reading around in the extent allocator code, until I got dizzy from it. Deja vu... https://www.spinics.net/lists/linux-btrfs/msg70624.html Thanks Hans ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-27 17:17 ` Hans van Kranenburg @ 2020-07-27 19:23 ` Chris Murphy 2020-07-27 23:16 ` Chris Murphy 2020-07-28 0:51 ` Qu Wenruo 2 siblings, 0 replies; 17+ messages in thread From: Chris Murphy @ 2020-07-27 19:23 UTC (permalink / raw) To: Hans van Kranenburg; +Cc: Qu Wenruo, Btrfs BTRFS On Mon, Jul 27, 2020 at 11:17 AM Hans van Kranenburg <hans@knorrie.org> wrote: > > https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_extent.txt > > From the timestamps you can see how long this takes. It's not much that > gets done per second. > > The spin lock part must be spin_lock(&space_info->lock) because that's > the only one in find_free_extent. > > So, what I think is that, like I mentioned on saturday already, > find_free_extent is probably doing things in a really inefficient way, > scanning around for a small single free space gap all the time in a > really expensive way, and doing that over again for each tiny metadata > block or file that needs to be placed somewhere (also notice the > empty_size=0), instead of just throwing all of it on disk after each > other, when it's otherwise slowing down everyone. What are the latencies for the iscsi device while this is happening? Does your trace have the ability to distinguish between inefficiency in find_free_extent versus waiting on IO latency? However, this hypothesis could be tested by completely balancing all metadata block groups, then trying to reproduce this workload. If it's really this expensive to do metadata insertions, eliminating the need for insertions should show up as a remarkable performance improvement that degrades as metadata block groups become fragmented again. > What I *can* try is switch to the ssd_spread option, to force it to do > much more yolo allocation, but I'm afraid this will result in a sudden > blast of metadata block groups getting allocated. Or, maybe try it for a > minute or so and compare the trace pipe output... OK or this. -- Chris Murphy ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-27 17:17 ` Hans van Kranenburg 2020-07-27 19:23 ` Chris Murphy @ 2020-07-27 23:16 ` Chris Murphy 2020-07-28 0:51 ` Qu Wenruo 2 siblings, 0 replies; 17+ messages in thread From: Chris Murphy @ 2020-07-27 23:16 UTC (permalink / raw) To: Hans van Kranenburg; +Cc: Qu Wenruo, Btrfs BTRFS On Mon, Jul 27, 2020 at 11:17 AM Hans van Kranenburg <hans@knorrie.org> wrote: > > Hi! > > On 7/27/20 1:09 PM, Qu Wenruo wrote: > > > As you can see, there is no one named like "flush-btrfs". > > Well, I do have a kernel thread named kworker/u16:7-flush-btrfs-2 here, > currently. ;] On a send/receive doing well over 200M/s I see brief moments of 0 0 0 D 10.5 0.0 0:06.15 kworker/u16:15+flush-btrfs-4 But it seems to coincide with 30s intervals, consistent with default commit=30. Are you using flushoncommit? Or a commit time other than 30? Maybe see if the problem gets better if this is 10s? I wonder if it just has to long of a commit window for this particular workload? -- Chris Murphy ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-27 17:17 ` Hans van Kranenburg 2020-07-27 19:23 ` Chris Murphy 2020-07-27 23:16 ` Chris Murphy @ 2020-07-28 0:51 ` Qu Wenruo 2020-07-28 1:52 ` Qu Wenruo 2 siblings, 1 reply; 17+ messages in thread From: Qu Wenruo @ 2020-07-28 0:51 UTC (permalink / raw) To: Hans van Kranenburg, linux-btrfs On 2020/7/28 上午1:17, Hans van Kranenburg wrote: > Hi! > > On 7/27/20 1:09 PM, Qu Wenruo wrote: >> >> >> On 2020/7/25 下午10:24, Hans van Kranenburg wrote: >>> Hi, >>> >>> I have a filesystem here that I'm filling up with data from elsewhere. >>> Most of it is done by rsync, and part by send/receive. So, receiving >>> data over the network, and then writing the files to disk. There can be >>> a dozen of these processes running in parallel. >>> >>> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with >>> varying X) often is using nearly 100% cpu, while enormously slowing down >>> disk writes. This shows as disk IO wait for the rsync and btrfs receive >>> processes. >> >> The name is in fact pretty strange. >> It doesn't follow the btrfs workqueue names. >> >> There are two type of kernel threads used by btrfs: >> [...] >> >> As you can see, there is no one named like "flush-btrfs". > > Well, I do have a kernel thread named kworker/u16:7-flush-btrfs-2 here, > currently. ;] > >> Thus I guess it's from other part of the stack. >> >> Also, the calltrace also shows that, that kernel thread is only doing >> page writeback, which calls back to the page write hooks of btrfs. >> >> So I guess it may not be btrfs, but something else trying to do all the >> writeback. > > Yes, so, from the writeback path it re-enters btrfs code, right? Yep. Btrfs is still responsible for the problem. > >> But still, the CPU usage is still a problem, it shouldn't cost so much >> CPU time just writing back pages from btrfs. > > It's find_free_extent which is the culprit. > >>> [...] >>> >>> It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here. >>> >>> I just did a 'perf record -g -a sleep 60' while disk writes were down to >>> under 1MiB (!) per second and then 'perf report'. Attached some 'screen >>> shot' of it. Also attached an example of what nmon shows to give an idea >>> about the situation. >>> >>> If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately >>> see network and disk write speed ramping up, easily over 200 MiB/s, >>> until it soon plummets again. >>> >>> I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which >>> is booted now). >>> >>> So, what I'm looking for is: >>> * Does anyone else see this happen when doing a lot of concurrent writes? >>> * What does this flush thing do? >>> * Why is it using 100% cpu all the time? >>> * How can I debug this more? >> >> bcc based runtime probes I guess? >> >> Since it's almost a dead CPU burner loop, regular sleep based lockup >> detector won't help much. > > Here's a flame graph of 180 seconds, taken from the kernel thread pid: > > https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-27-perf-kworker-flush-btrfs.svg That's really awesome! > >> You can try trace events first to see which trace event get executed the >> most frequently, then try to add probe points to pin down the real cause. > > From the default collection, I already got the following, a few days > ago, by enabling find_free_extent and btrfs_cow_block: > > https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_extent.txt > > From the timestamps you can see how long this takes. It's not much that > gets done per second. > > The spin lock part must be spin_lock(&space_info->lock) because that's > the only one in find_free_extent. > > So, what I think is that, like I mentioned on saturday already, > find_free_extent is probably doing things in a really inefficient way, > scanning around for a small single free space gap all the time in a > really expensive way, and doing that over again for each tiny metadata > block or file that needs to be placed somewhere (also notice the > empty_size=0), instead of just throwing all of it on disk after each > other, when it's otherwise slowing down everyone. I think you're mostly right, find_free_extent() is not that efficient. But that only happens when there are a lot of parallel requests for it. One example is btrfs_get_block_group(). That function is only calling one function, refcount_inc(), which normally shouldn't be that slow, unless there are tons of other callers here. > > And then we have the spin lock part and btrfs_get_block_group, which are > also a significant part of the time spent. All together is the > continuous 100% cpu. So is the spinlock part. That spinlock itself should be pretty fast (that's why we use spinlock), but when it's not, it's mostly caused by high concurrency. > > What I *can* try is switch to the ssd_spread option, to force it to do > much more yolo allocation, but I'm afraid this will result in a sudden > blast of metadata block groups getting allocated. Or, maybe try it for a > minute or so and compare the trace pipe output... Another thing can't be explained is, that slow down you're experiencing is from data write back. It should never conflicts with metadata space reservation, as data and metadata are separated by space_info first, thus they shouldn't affect each other. Either I'm missing something, or there are other problems. > >> But personally speaking, it's better to shrink the workload, to find a >> minimal workload to reproduce the 100% CPU burn, so that you need less >> probes/time to pindown the problem. > > I think I can reproduce it with a single btrfs receive operation, as > long as it has a large amount of small files in it. IMHO, if the culprit is really find_free_extent() itself, then single receive won't reach the concurrency we need to expose the slowdown. Since receive is just doing serial writes, that shouldn't cause enough race for find_free_extents(). But that's worthy trying to see if it proofs my assumption of concurrency. Thanks, Qu > > Currently I'm almost only copying data onto the filesystem, only data > removals are some targeted dedupe script for known big files that are > identical, but which could not easily be handled by rsync. > > So, when find_free_extent wants to write extent tree blocks, it might be > playing outside in the woods all the time, searching around for little > holes that were caused by previous cow operations. > > Large files are not a problem, like postgresql database dumps. Copying > them over just runs >200-300M/s without problem, where the read speed of > the old storage I'm moving things off is the bottleneck, actually. > >>> * Ultimately, of course... how can we improve this? >>> >>> [...] > > I will see if I can experiment a bit with putting more trace points in. > > I remember reading around in the extent allocator code, until I got > dizzy from it. Deja vu... > > https://www.spinics.net/lists/linux-btrfs/msg70624.html > > Thanks > Hans > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-28 0:51 ` Qu Wenruo @ 2020-07-28 1:52 ` Qu Wenruo 2020-07-28 14:52 ` Hans van Kranenburg 0 siblings, 1 reply; 17+ messages in thread From: Qu Wenruo @ 2020-07-28 1:52 UTC (permalink / raw) To: Hans van Kranenburg, linux-btrfs On 2020/7/28 上午8:51, Qu Wenruo wrote: > > > On 2020/7/28 上午1:17, Hans van Kranenburg wrote: >> Hi! >> >> On 7/27/20 1:09 PM, Qu Wenruo wrote: >>> >>> >>> On 2020/7/25 下午10:24, Hans van Kranenburg wrote: >>>> Hi, >>>> >>>> I have a filesystem here that I'm filling up with data from elsewhere. >>>> Most of it is done by rsync, and part by send/receive. So, receiving >>>> data over the network, and then writing the files to disk. There can be >>>> a dozen of these processes running in parallel. >>>> >>>> Now, when doing so, the kworker/u16:X+flush-btrfs-2 process (with >>>> varying X) often is using nearly 100% cpu, while enormously slowing down >>>> disk writes. This shows as disk IO wait for the rsync and btrfs receive >>>> processes. >>> >>> The name is in fact pretty strange. >>> It doesn't follow the btrfs workqueue names. >>> >>> There are two type of kernel threads used by btrfs: >>> [...] >>> >>> As you can see, there is no one named like "flush-btrfs". >> >> Well, I do have a kernel thread named kworker/u16:7-flush-btrfs-2 here, >> currently. ;] >> >>> Thus I guess it's from other part of the stack. >>> >>> Also, the calltrace also shows that, that kernel thread is only doing >>> page writeback, which calls back to the page write hooks of btrfs. >>> >>> So I guess it may not be btrfs, but something else trying to do all the >>> writeback. >> >> Yes, so, from the writeback path it re-enters btrfs code, right? > > Yep. Btrfs is still responsible for the problem. > >> >>> But still, the CPU usage is still a problem, it shouldn't cost so much >>> CPU time just writing back pages from btrfs. >> >> It's find_free_extent which is the culprit. >> >>>> [...] >>>> >>>> It's clearly kworker/u16:X+flush-btrfs-2 which is the bottleneck here. >>>> >>>> I just did a 'perf record -g -a sleep 60' while disk writes were down to >>>> under 1MiB (!) per second and then 'perf report'. Attached some 'screen >>>> shot' of it. Also attached an example of what nmon shows to give an idea >>>> about the situation. >>>> >>>> If the kworker/u16:X+flush-btrfs-2 cpu usage decreases, I immediately >>>> see network and disk write speed ramping up, easily over 200 MiB/s, >>>> until it soon plummets again. >>>> >>>> I see the same behavior with a recent 4.19 kernel and with 5.7.6 (which >>>> is booted now). >>>> >>>> So, what I'm looking for is: >>>> * Does anyone else see this happen when doing a lot of concurrent writes? >>>> * What does this flush thing do? >>>> * Why is it using 100% cpu all the time? >>>> * How can I debug this more? >>> >>> bcc based runtime probes I guess? >>> >>> Since it's almost a dead CPU burner loop, regular sleep based lockup >>> detector won't help much. >> >> Here's a flame graph of 180 seconds, taken from the kernel thread pid: >> >> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-27-perf-kworker-flush-btrfs.svg > > That's really awesome! > >> >>> You can try trace events first to see which trace event get executed the >>> most frequently, then try to add probe points to pin down the real cause. >> >> From the default collection, I already got the following, a few days >> ago, by enabling find_free_extent and btrfs_cow_block: >> >> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_extent.txt This output is in fact pretty confusing, and maybe give you a false view on the callers of find_free_extent(). It always shows "EXTENT_TREE" as the owner, but that's due to a bad decision on the trace event. I have submitted a patch addressing it, and added you to the CC. Would you mind to re-capture the events with that patch? So that we could have a clearer idea on which tree is having the most amount of concurrency? Thanks, Qu >> >> From the timestamps you can see how long this takes. It's not much that >> gets done per second. >> >> The spin lock part must be spin_lock(&space_info->lock) because that's >> the only one in find_free_extent. >> >> So, what I think is that, like I mentioned on saturday already, >> find_free_extent is probably doing things in a really inefficient way, >> scanning around for a small single free space gap all the time in a >> really expensive way, and doing that over again for each tiny metadata >> block or file that needs to be placed somewhere (also notice the >> empty_size=0), instead of just throwing all of it on disk after each >> other, when it's otherwise slowing down everyone. > > I think you're mostly right, find_free_extent() is not that efficient. > > But that only happens when there are a lot of parallel requests for it. > > One example is btrfs_get_block_group(). > That function is only calling one function, refcount_inc(), which > normally shouldn't be that slow, unless there are tons of other callers > here. > >> >> And then we have the spin lock part and btrfs_get_block_group, which are >> also a significant part of the time spent. All together is the >> continuous 100% cpu. > > So is the spinlock part. That spinlock itself should be pretty fast > (that's why we use spinlock), but when it's not, it's mostly caused by > high concurrency. > >> >> What I *can* try is switch to the ssd_spread option, to force it to do >> much more yolo allocation, but I'm afraid this will result in a sudden >> blast of metadata block groups getting allocated. Or, maybe try it for a >> minute or so and compare the trace pipe output... > > Another thing can't be explained is, that slow down you're experiencing > is from data write back. > It should never conflicts with metadata space reservation, as data and > metadata are separated by space_info first, thus they shouldn't affect > each other. > Either I'm missing something, or there are other problems. > >> >>> But personally speaking, it's better to shrink the workload, to find a >>> minimal workload to reproduce the 100% CPU burn, so that you need less >>> probes/time to pindown the problem. >> >> I think I can reproduce it with a single btrfs receive operation, as >> long as it has a large amount of small files in it. > > IMHO, if the culprit is really find_free_extent() itself, then single > receive won't reach the concurrency we need to expose the slowdown. > > Since receive is just doing serial writes, that shouldn't cause enough > race for find_free_extents(). > > But that's worthy trying to see if it proofs my assumption of concurrency. > > Thanks, > Qu > >> >> Currently I'm almost only copying data onto the filesystem, only data >> removals are some targeted dedupe script for known big files that are >> identical, but which could not easily be handled by rsync. >> >> So, when find_free_extent wants to write extent tree blocks, it might be >> playing outside in the woods all the time, searching around for little >> holes that were caused by previous cow operations. >> >> Large files are not a problem, like postgresql database dumps. Copying >> them over just runs >200-300M/s without problem, where the read speed of >> the old storage I'm moving things off is the bottleneck, actually. >> >>>> * Ultimately, of course... how can we improve this? >>>> >>>> [...] >> >> I will see if I can experiment a bit with putting more trace points in. >> >> I remember reading around in the extent allocator code, until I got >> dizzy from it. Deja vu... >> >> https://www.spinics.net/lists/linux-btrfs/msg70624.html >> >> Thanks >> Hans >> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-28 1:52 ` Qu Wenruo @ 2020-07-28 14:52 ` Hans van Kranenburg 2020-07-29 0:15 ` Qu Wenruo 0 siblings, 1 reply; 17+ messages in thread From: Hans van Kranenburg @ 2020-07-28 14:52 UTC (permalink / raw) To: Qu Wenruo, linux-btrfs On 7/28/20 3:52 AM, Qu Wenruo wrote: > > > On 2020/7/28 上午8:51, Qu Wenruo wrote: >> >> >> On 2020/7/28 上午1:17, Hans van Kranenburg wrote: >>> Hi! >>> >>> [...] >>>> >>>> Since it's almost a dead CPU burner loop, regular sleep based lockup >>>> detector won't help much. >>> >>> Here's a flame graph of 180 seconds, taken from the kernel thread pid: >>> >>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-27-perf-kworker-flush-btrfs.svg >> >> That's really awesome! >> >>> >>>> You can try trace events first to see which trace event get executed the >>>> most frequently, then try to add probe points to pin down the real cause. >>> >>> From the default collection, I already got the following, a few days >>> ago, by enabling find_free_extent and btrfs_cow_block: >>> >>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_extent.txt > > This output is in fact pretty confusing, and maybe give you a false view > on the callers of find_free_extent(). > > It always shows "EXTENT_TREE" as the owner, but that's due to a bad > decision on the trace event. > > I have submitted a patch addressing it, and added you to the CC. Oh right, thanks, that actually makes a lot of sense, lol. I was misled because at first sight I was thinking, "yeah, obviously, where else than in the extent tree are you going to do administration about allocated blocks.", and didn't realize yet it did make no sense. > Would you mind to re-capture the events with that patch? > So that we could have a clearer idea on which tree is having the most > amount of concurrency? Yes. I will do that and try to reproduce the symptoms with as few actions in parallel as possible. I've been away most of the day today, I will see how far I get later, otherwise continue tomorrow. What you *can* see in the current output already, however, is that the kworker/u16:3-13887 thing is doing all DATA work, while many different processes (rsync, btrfs receive) all do the find_free_extent work for METADATA. That's already an interesting difference. So, one direction to look into is who is all trying to grab that spin lock, since if it's per 'space' (which sounds logical, since the workers will never clash because a whole block group belongs to only 1 'space'), then I don't see why kworker/u16:18-11336 would spend 1/3 of it's time in a busy locking situation waiting, while it's the only process working on METADATA. But, I'll gather some more logs and pictures. Do you know (some RTFM pointer?) about a way to debug who's locking on the same thing? I didn't research that yet. Hans > [...] ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 2020-07-28 14:52 ` Hans van Kranenburg @ 2020-07-29 0:15 ` Qu Wenruo 0 siblings, 0 replies; 17+ messages in thread From: Qu Wenruo @ 2020-07-29 0:15 UTC (permalink / raw) To: Hans van Kranenburg, linux-btrfs [-- Attachment #1.1: Type: text/plain, Size: 3758 bytes --] On 2020/7/28 下午10:52, Hans van Kranenburg wrote: > On 7/28/20 3:52 AM, Qu Wenruo wrote: >> >> >> On 2020/7/28 上午8:51, Qu Wenruo wrote: >>> >>> >>> On 2020/7/28 上午1:17, Hans van Kranenburg wrote: >>>> Hi! >>>> >>>> [...] >>>>> >>>>> Since it's almost a dead CPU burner loop, regular sleep based lockup >>>>> detector won't help much. >>>> >>>> Here's a flame graph of 180 seconds, taken from the kernel thread pid: >>>> >>>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-27-perf-kworker-flush-btrfs.svg >>> >>> That's really awesome! >>> >>>> >>>>> You can try trace events first to see which trace event get executed the >>>>> most frequently, then try to add probe points to pin down the real cause. >>>> >>>> From the default collection, I already got the following, a few days >>>> ago, by enabling find_free_extent and btrfs_cow_block: >>>> >>>> https://syrinx.knorrie.org/~knorrie/btrfs/keep/2020-07-25-find_free_extent.txt >> >> This output is in fact pretty confusing, and maybe give you a false view >> on the callers of find_free_extent(). >> >> It always shows "EXTENT_TREE" as the owner, but that's due to a bad >> decision on the trace event. >> >> I have submitted a patch addressing it, and added you to the CC. > > Oh right, thanks, that actually makes a lot of sense, lol. > > I was misled because at first sight I was thinking, "yeah, obviously, > where else than in the extent tree are you going to do administration > about allocated blocks.", and didn't realize yet it did make no sense. > >> Would you mind to re-capture the events with that patch? >> So that we could have a clearer idea on which tree is having the most >> amount of concurrency? > > Yes. I will do that and try to reproduce the symptoms with as few > actions in parallel as possible. > > I've been away most of the day today, I will see how far I get later, > otherwise continue tomorrow. > > What you *can* see in the current output already, however, is that the > kworker/u16:3-13887 thing is doing all DATA work, while many different > processes (rsync, btrfs receive) all do the find_free_extent work for > METADATA. That's already an interesting difference. That's common, don't forget that btrfs is doing COW to modify its metadata. After a data write reaches disk, btrfs will then modify metadata to reflect that write, which will cause a metadata reserve. The good news is, btrfs is not always doing metadata COW, if the needed tree blocks has already been CoWed and not yet written to disk. So in theory, the data reserve should be more frequently than metadata. > > So, one direction to look into is who is all trying to grab that spin > lock, since if it's per 'space' (which sounds logical, since the workers > will never clash because a whole block group belongs to only 1 'space'), > then I don't see why kworker/u16:18-11336 would spend 1/3 of it's time > in a busy locking situation waiting, while it's the only process working > on METADATA. > > But, I'll gather some more logs and pictures. > > Do you know (some RTFM pointer?) about a way to debug who's locking on > the same thing? I didn't research that yet. I had a bcc based script to do that analyze, but that's mostly for tree blocks. https://github.com/adam900710/btrfs-profiler/blob/master/tree_lock_wait.py Feel free to use that as a base to develop your own tool. BTW, this data allocation race reminds me that, Josef and Nikolay discussed this possible problem before, but not yet have a solid solution to it. So I guess what we can do now is just to collect enough data to proof it's already a big problem. Thanks, Qu > > Hans > >> [...] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2020-07-29 0:15 UTC | newest] Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2020-07-25 14:24 Debugging abysmal write performance with 100% cpu kworker/u16:X+flush-btrfs-2 Hans van Kranenburg 2020-07-25 15:37 ` Holger Hoffstätte 2020-07-25 16:43 ` Hans van Kranenburg 2020-07-25 19:44 ` Holger Hoffstätte 2020-07-25 21:03 ` Hans van Kranenburg 2020-07-26 1:00 ` Chris Murphy 2020-07-25 21:27 ` Hans van Kranenburg 2020-07-26 8:10 ` A L 2020-07-26 0:50 ` Chris Murphy 2020-07-27 11:09 ` Qu Wenruo 2020-07-27 17:17 ` Hans van Kranenburg 2020-07-27 19:23 ` Chris Murphy 2020-07-27 23:16 ` Chris Murphy 2020-07-28 0:51 ` Qu Wenruo 2020-07-28 1:52 ` Qu Wenruo 2020-07-28 14:52 ` Hans van Kranenburg 2020-07-29 0:15 ` Qu Wenruo
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.