slow performance due to frequent memalloc_retry_wait in btrfs_alloc_page_array

From: Julian Taylor <julian.taylor@1und1.de>
To: <linux-btrfs@vger.kernel.org>
Subject: slow performance due to frequent memalloc_retry_wait in btrfs_alloc_page_array
Date: Tue, 12 Mar 2024 14:35:46 +0100	[thread overview]
Message-ID: <8966c095-cbe7-4d22-9784-a647d1bf27c3@1und1.de> (raw)

Hello,

After upgrading a machine using btrfs to a 6.1 kernel from 5.10 we are 
experiencing very low read performance on some (compressed) files when 
most of the nodes memory is in use by applications and the filesystem 
cache. Reading some files does not exceed 5MiB/second while the 
underlying disks can sustain ~800MiB/s. The load on the machine while 
reading the files slowly is basically zero

The filesystem is mounted with

  btrfs (rw,relatime,compress=zstd:3,space_cache=v2,subvolid=5,subvol=/)

The filesystem contains several snapshot volumes.

Checking with blktrace we noticed a lot of queue unplug events which 
when traced showed that the cause is most likely io_schedule_timeout 
being called extremely frequent from btrfs_alloc_page_array which since 
5.19 (91d6ac1d62c3dc0f102986318f4027ccfa22c638) uses bulk page 
allocations with a memalloc_retry_wait on failure:

$ perf record -e block:block_unplug -g

$ perf script

         ffffffffa3bbff86 blk_mq_flush_plug_list.part.0+0x246 
([kernel.kallsyms])
         ffffffffa3bbff86 blk_mq_flush_plug_list.part.0+0x246 
([kernel.kallsyms])
         ffffffffa3bb1205 __blk_flush_plug+0xf5 ([kernel.kallsyms])
         ffffffffa4213f15 io_schedule_timeout+0x45 ([kernel.kallsyms])
         ffffffffc0c74d42 btrfs_alloc_page_array+0x42 ([kernel.kallsyms])
         ffffffffc0ca8c2e btrfs_submit_compressed_read+0x16e 
([kernel.kallsyms])
         ffffffffc0c724f8 submit_one_bio+0x48 ([kernel.kallsyms])
         ffffffffc0c75295 btrfs_do_readpage+0x415 ([kernel.kallsyms])
         ffffffffc0c766d1 extent_readahead+0x2e1 ([kernel.kallsyms])
         ffffffffa3904bf2 read_pages+0x82 ([kernel.kallsyms])

When bottlenecked in this code the allocations of less than 10 pages  
only receives a single page per loop so it runs into the 
io_schedule_timeout every time.

Tracing the arguments while reading on slow performance shows:

# bpftrace -e "kfunc:btrfs_alloc_page_array {@pages = 
lhist(args->nr_pages, 0, 20, 1)} kretfunc:__alloc_pages_bulk {@allocret 
= lhist(retval, 0, 20, 1)}"
Attaching 2 probes...

@allocret:
[1, 2)               298 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[2, 3)               295 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[3, 4)               295 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[4, 5)               300 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@pages:
[4, 5)               295 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

Further checking why the bulk page allocations only return a single page 
we noticed this is only happening when all memory of the node is tied up 
even if still reclaimable.

It can be reliably reproduced on the machine when filling the page cache 
with data from the disk (just via cat * >/dev/null) until we are have 
following memory situation on the node with two sockets:

$numactl --hardware

available: 2 nodes (0-1)

node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 
42 44 46 48 50 52 54 56 58 60 62
node 0 size: 192048 MB
node 0 free: 170340 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 
43 45 47 49 51 53 55 57 59 61 63
node 1 size: 193524 MB
node 1 free: 224 MB        <<< nothing free due to cache

$ top

MiB Mem : 385573.2 total, 170093.0 free,  19379.1 used, 201077.9 buff/cache
MiB Swap:   3812.0 total,   3812.0 free,      0.0 used. 366194.1 avail Mem

When now reading a file with a process bound to a cpu on node 1 (taskset 
-c cat $file) we see the high io_schedule_timeout rate and very low read 
performance.

This is seen with linux 6.1.76 (debian 12 stable) and linux 6.7.9 
(debian unstable).

It appears the bulk page allocations used by btrfs_alloc_page_array will 
have a high failure rate when the per cpu page lists are empty and they 
do not appear to attempt to reclaim memory from the page cache but 
instead return a single page via the normal page allocations. But this 
combined with memalloc_retry_wait called on each iteration causes very 
slow performance.

Increasing sysctl vm.percpu_pagelist_high_fraction did not yield any 
improvement for the situation, the only workaround seems to be to free 
the page cache on the nodes before reading the data.

Assuming the bulk page allocations functions are intended to not reclaim 
memory when the per core lists are empty probably the way 
btrfs_alloc_page_array handles failure of bulk allocation should be revised.

Cheers,

Julian Taylor