All of lore.kernel.org
 help / color / mirror / Atom feed
* slow performance due to frequent memalloc_retry_wait in btrfs_alloc_page_array
@ 2024-03-12 13:35 Julian Taylor
  2024-03-13  6:26 ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Julian Taylor @ 2024-03-12 13:35 UTC (permalink / raw)
  To: linux-btrfs

Hello,

After upgrading a machine using btrfs to a 6.1 kernel from 5.10 we are 
experiencing very low read performance on some (compressed) files when 
most of the nodes memory is in use by applications and the filesystem 
cache. Reading some files does not exceed 5MiB/second while the 
underlying disks can sustain ~800MiB/s. The load on the machine while 
reading the files slowly is basically zero

The filesystem is mounted with

  btrfs (rw,relatime,compress=zstd:3,space_cache=v2,subvolid=5,subvol=/)

The filesystem contains several snapshot volumes.

Checking with blktrace we noticed a lot of queue unplug events which 
when traced showed that the cause is most likely io_schedule_timeout 
being called extremely frequent from btrfs_alloc_page_array which since 
5.19 (91d6ac1d62c3dc0f102986318f4027ccfa22c638) uses bulk page 
allocations with a memalloc_retry_wait on failure:

$ perf record -e block:block_unplug -g

$ perf script

         ffffffffa3bbff86 blk_mq_flush_plug_list.part.0+0x246 
([kernel.kallsyms])
         ffffffffa3bbff86 blk_mq_flush_plug_list.part.0+0x246 
([kernel.kallsyms])
         ffffffffa3bb1205 __blk_flush_plug+0xf5 ([kernel.kallsyms])
         ffffffffa4213f15 io_schedule_timeout+0x45 ([kernel.kallsyms])
         ffffffffc0c74d42 btrfs_alloc_page_array+0x42 ([kernel.kallsyms])
         ffffffffc0ca8c2e btrfs_submit_compressed_read+0x16e 
([kernel.kallsyms])
         ffffffffc0c724f8 submit_one_bio+0x48 ([kernel.kallsyms])
         ffffffffc0c75295 btrfs_do_readpage+0x415 ([kernel.kallsyms])
         ffffffffc0c766d1 extent_readahead+0x2e1 ([kernel.kallsyms])
         ffffffffa3904bf2 read_pages+0x82 ([kernel.kallsyms])

When bottlenecked in this code the allocations of less than 10 pages  
only receives a single page per loop so it runs into the 
io_schedule_timeout every time.

Tracing the arguments while reading on slow performance shows:

# bpftrace -e "kfunc:btrfs_alloc_page_array {@pages = 
lhist(args->nr_pages, 0, 20, 1)} kretfunc:__alloc_pages_bulk {@allocret 
= lhist(retval, 0, 20, 1)}"
Attaching 2 probes...


@allocret:
[1, 2)               298 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[2, 3)               295 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[3, 4)               295 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[4, 5)               300 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|

@pages:
[4, 5)               295 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|


Further checking why the bulk page allocations only return a single page 
we noticed this is only happening when all memory of the node is tied up 
even if still reclaimable.

It can be reliably reproduced on the machine when filling the page cache 
with data from the disk (just via cat * >/dev/null) until we are have 
following memory situation on the node with two sockets:

$numactl --hardware

available: 2 nodes (0-1)

node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 
42 44 46 48 50 52 54 56 58 60 62
node 0 size: 192048 MB
node 0 free: 170340 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 
43 45 47 49 51 53 55 57 59 61 63
node 1 size: 193524 MB
node 1 free: 224 MB        <<< nothing free due to cache

$ top

MiB Mem : 385573.2 total, 170093.0 free,  19379.1 used, 201077.9 buff/cache
MiB Swap:   3812.0 total,   3812.0 free,      0.0 used. 366194.1 avail Mem


When now reading a file with a process bound to a cpu on node 1 (taskset 
-c cat $file) we see the high io_schedule_timeout rate and very low read 
performance.

This is seen with linux 6.1.76 (debian 12 stable) and linux 6.7.9 
(debian unstable).


It appears the bulk page allocations used by btrfs_alloc_page_array will 
have a high failure rate when the per cpu page lists are empty and they 
do not appear to attempt to reclaim memory from the page cache but 
instead return a single page via the normal page allocations. But this 
combined with memalloc_retry_wait called on each iteration causes very 
slow performance.

Increasing sysctl vm.percpu_pagelist_high_fraction did not yield any 
improvement for the situation, the only workaround seems to be to free 
the page cache on the nodes before reading the data.

Assuming the bulk page allocations functions are intended to not reclaim 
memory when the per core lists are empty probably the way 
btrfs_alloc_page_array handles failure of bulk allocation should be revised.


Cheers,

Julian Taylor


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: slow performance due to frequent memalloc_retry_wait in btrfs_alloc_page_array
  2024-03-12 13:35 slow performance due to frequent memalloc_retry_wait in btrfs_alloc_page_array Julian Taylor
@ 2024-03-13  6:26 ` Qu Wenruo
  2024-03-13  9:36   ` Julian Taylor
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2024-03-13  6:26 UTC (permalink / raw)
  To: Julian Taylor, linux-btrfs



在 2024/3/13 00:05, Julian Taylor 写道:
> Hello,
>
> After upgrading a machine using btrfs to a 6.1 kernel from 5.10 we are
> experiencing very low read performance on some (compressed) files when
> most of the nodes memory is in use by applications and the filesystem
> cache. Reading some files does not exceed 5MiB/second while the
> underlying disks can sustain ~800MiB/s. The load on the machine while
> reading the files slowly is basically zero
>
> The filesystem is mounted with
>
>   btrfs (rw,relatime,compress=zstd:3,space_cache=v2,subvolid=5,subvol=/)
>
> The filesystem contains several snapshot volumes.
>
> Checking with blktrace we noticed a lot of queue unplug events which
> when traced showed that the cause is most likely io_schedule_timeout
> being called extremely frequent from btrfs_alloc_page_array which since
> 5.19 (91d6ac1d62c3dc0f102986318f4027ccfa22c638) uses bulk page
> allocations with a memalloc_retry_wait on failure:
>
> $ perf record -e block:block_unplug -g
>
> $ perf script
>
>          ffffffffa3bbff86 blk_mq_flush_plug_list.part.0+0x246
> ([kernel.kallsyms])
>          ffffffffa3bbff86 blk_mq_flush_plug_list.part.0+0x246
> ([kernel.kallsyms])
>          ffffffffa3bb1205 __blk_flush_plug+0xf5 ([kernel.kallsyms])
>          ffffffffa4213f15 io_schedule_timeout+0x45 ([kernel.kallsyms])
>          ffffffffc0c74d42 btrfs_alloc_page_array+0x42 ([kernel.kallsyms])

Btrfs needs to allocate all the pages for the compressed extents, which
can be very large (as large as 128K, even if the read may only be 4K).

Furthermore, since your system have very high memory pressure, it also
means the page cache doesn't have much chance to cache the decompressed
contents.

Thus I'm afraid for your high memory pressure cases, it is not really
not a good use case with compression.
(Both compressed read and write would need extra pages other than the
inode page cache).

And considering your storage is very fast (800+MiB/s), there is really
little benefit for compression (other than saving disk usages).

>          ffffffffc0ca8c2e btrfs_submit_compressed_read+0x16e
> ([kernel.kallsyms])
>          ffffffffc0c724f8 submit_one_bio+0x48 ([kernel.kallsyms])
>          ffffffffc0c75295 btrfs_do_readpage+0x415 ([kernel.kallsyms])
>          ffffffffc0c766d1 extent_readahead+0x2e1 ([kernel.kallsyms])
>          ffffffffa3904bf2 read_pages+0x82 ([kernel.kallsyms])
>
> When bottlenecked in this code the allocations of less than 10 pages
> only receives a single page per loop so it runs into the
> io_schedule_timeout every time.
>
> Tracing the arguments while reading on slow performance shows:
>
> # bpftrace -e "kfunc:btrfs_alloc_page_array {@pages =
> lhist(args->nr_pages, 0, 20, 1)} kretfunc:__alloc_pages_bulk {@allocret
> = lhist(retval, 0, 20, 1)}"
> Attaching 2 probes...
>
>
> @allocret:
> [1, 2)               298
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [2, 3)               295
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [3, 4)               295
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
> [4, 5)               300
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
> @pages:
> [4, 5)               295
> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>
>
> Further checking why the bulk page allocations only return a single page
> we noticed this is only happening when all memory of the node is tied up
> even if still reclaimable.
>
> It can be reliably reproduced on the machine when filling the page cache
> with data from the disk (just via cat * >/dev/null) until we are have
> following memory situation on the node with two sockets:
>
> $numactl --hardware
>
> available: 2 nodes (0-1)
>
> node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
> 42 44 46 48 50 52 54 56 58 60 62
> node 0 size: 192048 MB
> node 0 free: 170340 MB
> node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
> 43 45 47 49 51 53 55 57 59 61 63
> node 1 size: 193524 MB
> node 1 free: 224 MB        <<< nothing free due to cache

This is interesting, such unbalanced free memory is indeed going to
cause problems.

>
> $ top
>
> MiB Mem : 385573.2 total, 170093.0 free,  19379.1 used, 201077.9 buff/cache
> MiB Swap:   3812.0 total,   3812.0 free,      0.0 used. 366194.1 avail Mem
>
>
> When now reading a file with a process bound to a cpu on node 1 (taskset
> -c cat $file) we see the high io_schedule_timeout rate and very low read
> performance.
>
> This is seen with linux 6.1.76 (debian 12 stable) and linux 6.7.9
> (debian unstable).
>
>
> It appears the bulk page allocations used by btrfs_alloc_page_array will
> have a high failure rate when the per cpu page lists are empty and they
> do not appear to attempt to reclaim memory from the page cache but
> instead return a single page via the normal page allocations. But this
> combined with memalloc_retry_wait called on each iteration causes very
> slow performance.

Not an expert on NUMA, but I guess there should be some way to balance
the free memory between different numa nodes?

Can it be done automatically/periodically as a workaround?

>
> Increasing sysctl vm.percpu_pagelist_high_fraction did not yield any
> improvement for the situation, the only workaround seems to be to free
> the page cache on the nodes before reading the data.
>
> Assuming the bulk page allocations functions are intended to not reclaim
> memory when the per core lists are empty probably the way
> btrfs_alloc_page_array handles failure of bulk allocation should be
> revised.

Any suggestion for improvement?

In our usage, we can not afford to reclaim page cache, as that may
trigger page writeback, meanwhile we may also in the page writeback path
and can lead to deadlock.

On the other hand, if we allocate pages for compressed read/write from
other NUMA nodes, wouldn't that cause different performance problems?
E.g. we still need to do compression using the page from the remote numa
nodes, wouldn't that also greatly reduce the compression speed?

Thanks,
Qu
>
>
> Cheers,
>
> Julian Taylor
>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: slow performance due to frequent memalloc_retry_wait in btrfs_alloc_page_array
  2024-03-13  6:26 ` Qu Wenruo
@ 2024-03-13  9:36   ` Julian Taylor
  2024-03-22  7:37     ` Qu Wenruo
  0 siblings, 1 reply; 5+ messages in thread
From: Julian Taylor @ 2024-03-13  9:36 UTC (permalink / raw)
  To: linux-btrfs


On 13.03.24 07:26, Qu Wenruo wrote:
>
>
> 在 2024/3/13 00:05, Julian Taylor 写道:
>> Hello,
>>
>> After upgrading a machine using btrfs to a 6.1 kernel from 5.10 we are
>> experiencing very low read performance on some (compressed) files when
>> most of the nodes memory is in use by applications and the filesystem
>> cache. Reading some files does not exceed 5MiB/second while the
>> underlying disks can sustain ~800MiB/s. The load on the machine while
>> reading the files slowly is basically zero
>>
>> The filesystem is mounted with
>>
>>   btrfs (rw,relatime,compress=zstd:3,space_cache=v2,subvolid=5,subvol=/)
>>
>> The filesystem contains several snapshot volumes.
>>
>> Checking with blktrace we noticed a lot of queue unplug events which
>> when traced showed that the cause is most likely io_schedule_timeout
>> being called extremely frequent from btrfs_alloc_page_array which since
>> 5.19 (91d6ac1d62c3dc0f102986318f4027ccfa22c638) uses bulk page
>> allocations with a memalloc_retry_wait on failure:
>>
>> $ perf record -e block:block_unplug -g
>>
>> $ perf script
>>
>>          ffffffffa3bbff86 blk_mq_flush_plug_list.part.0+0x246
>> ([kernel.kallsyms])
>>          ffffffffa3bbff86 blk_mq_flush_plug_list.part.0+0x246
>> ([kernel.kallsyms])
>>          ffffffffa3bb1205 __blk_flush_plug+0xf5 ([kernel.kallsyms])
>>          ffffffffa4213f15 io_schedule_timeout+0x45 ([kernel.kallsyms])
>>          ffffffffc0c74d42 btrfs_alloc_page_array+0x42 
>> ([kernel.kallsyms])
>
> Btrfs needs to allocate all the pages for the compressed extents, which
> can be very large (as large as 128K, even if the read may only be 4K).
>
> Furthermore, since your system have very high memory pressure, it also
> means the page cache doesn't have much chance to cache the decompressed
> contents.
>
> Thus I'm afraid for your high memory pressure cases, it is not really
> not a good use case with compression.
> (Both compressed read and write would need extra pages other than the
> inode page cache).
>
> And considering your storage is very fast (800+MiB/s), there is really
> little benefit for compression (other than saving disk usages).

The machine does not have high memory pressure it has 380Gi of memory 
and the applications on it only use a small fraction of it, it is just a 
machine handling backups most of the time.

The memory is all just used by the page cache and is reclaimable. The 
bulk page allocation functions just do not do that without falling back 
to single page allocations.


>
>>
>>
>> Further checking why the bulk page allocations only return a single page
>> we noticed this is only happening when all memory of the node is tied up
>> even if still reclaimable.
>>
>> It can be reliably reproduced on the machine when filling the page cache
>> with data from the disk (just via cat * >/dev/null) until we are have
>> following memory situation on the node with two sockets:
>>
>> $numactl --hardware
>>
>> available: 2 nodes (0-1)
>>
>> node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
>> 42 44 46 48 50 52 54 56 58 60 62
>> node 0 size: 192048 MB
>> node 0 free: 170340 MB
>> node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
>> 43 45 47 49 51 53 55 57 59 61 63
>> node 1 size: 193524 MB
>> node 1 free: 224 MB        <<< nothing free due to cache
>
> This is interesting, such unbalanced free memory is indeed going to
> cause problems.
>
>>
>> $ top
>>
>> MiB Mem : 385573.2 total, 170093.0 free,  19379.1 used, 201077.9 
>> buff/cache
>> MiB Swap:   3812.0 total,   3812.0 free,      0.0 used. 366194.1 
>> avail Mem
>>
>>
>> When now reading a file with a process bound to a cpu on node 1 (taskset
>> -c cat $file) we see the high io_schedule_timeout rate and very low read
>> performance.
>>
>> This is seen with linux 6.1.76 (debian 12 stable) and linux 6.7.9
>> (debian unstable).
>>
>>
>> It appears the bulk page allocations used by btrfs_alloc_page_array will
>> have a high failure rate when the per cpu page lists are empty and they
>> do not appear to attempt to reclaim memory from the page cache but
>> instead return a single page via the normal page allocations. But this
>> combined with memalloc_retry_wait called on each iteration causes very
>> slow performance.
>
> Not an expert on NUMA, but I guess there should be some way to balance
> the free memory between different numa nodes?
>
> Can it be done automatically/periodically as a workaround?

Dropping data from the page cache is the workaround we are using, via 
fadvice(DONTNEED) on the data.

Balancing the memory between numa nodes will not help. At some point 
both nodes memory is in the caches and the same situation will occur on 
both nodes.

I have verified this loading the caches on both nodes:

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 
42 44 46 48 50 52 54 56 58 60 62
node 0 size: 192048 MB
node 0 free: 2316 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 
43 45 47 49 51 53 55 57 59 61 63
node 1 size: 193524 MB
node 1 free: 327 MB

and now loading files with processes bound to either node is affected by 
this. ]


>
>>
>> Increasing sysctl vm.percpu_pagelist_high_fraction did not yield any
>> improvement for the situation, the only workaround seems to be to free
>> the page cache on the nodes before reading the data.
>>
>> Assuming the bulk page allocations functions are intended to not reclaim
>> memory when the per core lists are empty probably the way
>> btrfs_alloc_page_array handles failure of bulk allocation should be
>> revised.
>
> Any suggestion for improvement?
>
> In our usage, we can not afford to reclaim page cache, as that may
> trigger page writeback, meanwhile we may also in the page writeback path
> and can lead to deadlock.
>
> On the other hand, if we allocate pages for compressed read/write from
> other NUMA nodes, wouldn't that cause different performance problems?
> E.g. we still need to do compression using the page from the remote numa
> nodes, wouldn't that also greatly reduce the compression speed?

The problem we see is not the page allocation itself but the looping on 
memalloc_retry_wait when the bulk allocation falls back to single page 
allocations due to empty per cpu page lists.

My naive suggestion would be to revert the bulk allocation 
(91d6ac1d62c3dc0f102986318f4027ccfa22c638) and do single page 
allocations again. As far as I can tell the bulk allocation was done for 
performance reasons not to avoid deadlocks due to writeback.

If the performance gain by the bulk allocation is very significant maybe 
the looping on memalloc_retry_wait can be done in some better way but I 
am unfamiliar with the details here on why the single page allocation 
did not need to do a retry-wait loop and the bulk page allocation does.


Cheers,

Julian Taylor


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: slow performance due to frequent memalloc_retry_wait in btrfs_alloc_page_array
  2024-03-13  9:36   ` Julian Taylor
@ 2024-03-22  7:37     ` Qu Wenruo
  2024-03-25 11:33       ` Julian Taylor
  0 siblings, 1 reply; 5+ messages in thread
From: Qu Wenruo @ 2024-03-22  7:37 UTC (permalink / raw)
  To: Julian Taylor, linux-btrfs

[-- Attachment #1: Type: text/plain, Size: 7950 bytes --]



在 2024/3/13 20:06, Julian Taylor 写道:
> 
> On 13.03.24 07:26, Qu Wenruo wrote:
>>
>>
>> 在 2024/3/13 00:05, Julian Taylor 写道:
>>> Hello,
>>>
>>> After upgrading a machine using btrfs to a 6.1 kernel from 5.10 we are
>>> experiencing very low read performance on some (compressed) files when
>>> most of the nodes memory is in use by applications and the filesystem
>>> cache. Reading some files does not exceed 5MiB/second while the
>>> underlying disks can sustain ~800MiB/s. The load on the machine while
>>> reading the files slowly is basically zero
>>>
>>> The filesystem is mounted with
>>>
>>>   btrfs (rw,relatime,compress=zstd:3,space_cache=v2,subvolid=5,subvol=/)
>>>
>>> The filesystem contains several snapshot volumes.
>>>
>>> Checking with blktrace we noticed a lot of queue unplug events which
>>> when traced showed that the cause is most likely io_schedule_timeout
>>> being called extremely frequent from btrfs_alloc_page_array which since
>>> 5.19 (91d6ac1d62c3dc0f102986318f4027ccfa22c638) uses bulk page
>>> allocations with a memalloc_retry_wait on failure:
>>>
>>> $ perf record -e block:block_unplug -g
>>>
>>> $ perf script
>>>
>>>          ffffffffa3bbff86 blk_mq_flush_plug_list.part.0+0x246
>>> ([kernel.kallsyms])
>>>          ffffffffa3bbff86 blk_mq_flush_plug_list.part.0+0x246
>>> ([kernel.kallsyms])
>>>          ffffffffa3bb1205 __blk_flush_plug+0xf5 ([kernel.kallsyms])
>>>          ffffffffa4213f15 io_schedule_timeout+0x45 ([kernel.kallsyms])
>>>          ffffffffc0c74d42 btrfs_alloc_page_array+0x42 
>>> ([kernel.kallsyms])
>>
>> Btrfs needs to allocate all the pages for the compressed extents, which
>> can be very large (as large as 128K, even if the read may only be 4K).
>>
>> Furthermore, since your system have very high memory pressure, it also
>> means the page cache doesn't have much chance to cache the decompressed
>> contents.
>>
>> Thus I'm afraid for your high memory pressure cases, it is not really
>> not a good use case with compression.
>> (Both compressed read and write would need extra pages other than the
>> inode page cache).
>>
>> And considering your storage is very fast (800+MiB/s), there is really
>> little benefit for compression (other than saving disk usages).
> 
> The machine does not have high memory pressure it has 380Gi of memory 
> and the applications on it only use a small fraction of it, it is just a 
> machine handling backups most of the time.
> 
> The memory is all just used by the page cache and is reclaimable. The 
> bulk page allocation functions just do not do that without falling back 
> to single page allocations.
> 
> 
>>
>>>
>>>
>>> Further checking why the bulk page allocations only return a single page
>>> we noticed this is only happening when all memory of the node is tied up
>>> even if still reclaimable.
>>>
>>> It can be reliably reproduced on the machine when filling the page cache
>>> with data from the disk (just via cat * >/dev/null) until we are have
>>> following memory situation on the node with two sockets:
>>>
>>> $numactl --hardware
>>>
>>> available: 2 nodes (0-1)
>>>
>>> node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
>>> 42 44 46 48 50 52 54 56 58 60 62
>>> node 0 size: 192048 MB
>>> node 0 free: 170340 MB
>>> node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
>>> 43 45 47 49 51 53 55 57 59 61 63
>>> node 1 size: 193524 MB
>>> node 1 free: 224 MB        <<< nothing free due to cache
>>
>> This is interesting, such unbalanced free memory is indeed going to
>> cause problems.
>>
>>>
>>> $ top
>>>
>>> MiB Mem : 385573.2 total, 170093.0 free,  19379.1 used, 201077.9 
>>> buff/cache
>>> MiB Swap:   3812.0 total,   3812.0 free,      0.0 used. 366194.1 
>>> avail Mem
>>>
>>>
>>> When now reading a file with a process bound to a cpu on node 1 (taskset
>>> -c cat $file) we see the high io_schedule_timeout rate and very low read
>>> performance.
>>>
>>> This is seen with linux 6.1.76 (debian 12 stable) and linux 6.7.9
>>> (debian unstable).
>>>
>>>
>>> It appears the bulk page allocations used by btrfs_alloc_page_array will
>>> have a high failure rate when the per cpu page lists are empty and they
>>> do not appear to attempt to reclaim memory from the page cache but
>>> instead return a single page via the normal page allocations. But this
>>> combined with memalloc_retry_wait called on each iteration causes very
>>> slow performance.
>>
>> Not an expert on NUMA, but I guess there should be some way to balance
>> the free memory between different numa nodes?
>>
>> Can it be done automatically/periodically as a workaround?
> 
> Dropping data from the page cache is the workaround we are using, via 
> fadvice(DONTNEED) on the data.
> 
> Balancing the memory between numa nodes will not help. At some point 
> both nodes memory is in the caches and the same situation will occur on 
> both nodes.
> 
> I have verified this loading the caches on both nodes:
> 
> # numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 
> 42 44 46 48 50 52 54 56 58 60 62
> node 0 size: 192048 MB
> node 0 free: 2316 MB
> node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 
> 43 45 47 49 51 53 55 57 59 61 63
> node 1 size: 193524 MB
> node 1 free: 327 MB
> 
> and now loading files with processes bound to either node is affected by 
> this. ]
> 
> 
>>
>>>
>>> Increasing sysctl vm.percpu_pagelist_high_fraction did not yield any
>>> improvement for the situation, the only workaround seems to be to free
>>> the page cache on the nodes before reading the data.
>>>
>>> Assuming the bulk page allocations functions are intended to not reclaim
>>> memory when the per core lists are empty probably the way
>>> btrfs_alloc_page_array handles failure of bulk allocation should be
>>> revised.
>>
>> Any suggestion for improvement?
>>
>> In our usage, we can not afford to reclaim page cache, as that may
>> trigger page writeback, meanwhile we may also in the page writeback path
>> and can lead to deadlock.
>>
>> On the other hand, if we allocate pages for compressed read/write from
>> other NUMA nodes, wouldn't that cause different performance problems?
>> E.g. we still need to do compression using the page from the remote numa
>> nodes, wouldn't that also greatly reduce the compression speed?
> 
> The problem we see is not the page allocation itself but the looping on 
> memalloc_retry_wait when the bulk allocation falls back to single page 
> allocations due to empty per cpu page lists.
> 
> My naive suggestion would be to revert the bulk allocation 
> (91d6ac1d62c3dc0f102986318f4027ccfa22c638) and do single page 
> allocations again. As far as I can tell the bulk allocation was done for 
> performance reasons not to avoid deadlocks due to writeback.
> 
> If the performance gain by the bulk allocation is very significant maybe 
> the looping on memalloc_retry_wait can be done in some better way but I 
> am unfamiliar with the details here on why the single page allocation 
> did not need to do a retry-wait loop and the bulk page allocation does.

I believe you're right.

The common scheme for bulk allocation should be try the bulk/optimized 
version, if failed fallback to the single allocation one.

In fact, that's exactly what's I'm trying to do for larger folio support 
for btrfs metadata.
In that case, we try larger folio first, then fallback to regular 
btrfs_alloc_page_array().

So mind to test the attached patch to see if it solves the problem for you?
The patch would exactly do what I said above, try bulk allocation first, 
then go single page allocation for the remaining ones, since this 
version no longer do any way, the behavior should be more or less the 
same, meanwhile still keep the bulk attempt to benefit from it.

Thanks,
Qu

> 
> 
> Cheers,
> 
> Julian Taylor
> 
> 

[-- Attachment #2: 0001-btrfs-fallback-to-single-page-allocation-to-avoid-bu.patch --]
[-- Type: text/x-patch, Size: 2562 bytes --]

From d0bdbdcd91faa8eebc17ff7aa4938d6d1bef9cbb Mon Sep 17 00:00:00 2001
Message-ID: <d0bdbdcd91faa8eebc17ff7aa4938d6d1bef9cbb.1711092994.git.wqu@suse.com>
From: Qu Wenruo <wqu@suse.com>
Date: Fri, 22 Mar 2024 17:56:42 +1030
Subject: [PATCH] btrfs: fallback to single page allocation to avoid bulk
 allocation latency

[BUG]
There is a recent report that compression is taking a lot of time
waiting for memory allocation.

[CAUSE]
For btrfs_alloc_page_array() we always go alloc_pages_bulk_array(), and
even if the bulk allocation failed we still retry but with extra
memalloc_retry_wait().

If the bulk alloc only returned one page a time, we would spend a lot of
time on the retry wait.

[FIX]
Instead of always trying the same bulk allocation, fallback to single
page allocation if the initial bulk allocation attempt doesn't fill the
whole request.

Reported-by: Julian Taylor <julian.taylor@1und1.de>
Link: https://lore.kernel.org/all/8966c095-cbe7-4d22-9784-a647d1bf27c3@1und1.de/
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
 fs/btrfs/extent_io.c | 35 ++++++++++++-----------------------
 1 file changed, 12 insertions(+), 23 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 7441245b1ceb..d49e7f0384ed 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -681,33 +681,22 @@ static void end_bbio_data_read(struct btrfs_bio *bbio)
 int btrfs_alloc_page_array(unsigned int nr_pages, struct page **page_array,
 			   gfp_t extra_gfp)
 {
+	const gfp_t gfp = GFP_NOFS | extra_gfp;
 	unsigned int allocated;
 
-	for (allocated = 0; allocated < nr_pages;) {
-		unsigned int last = allocated;
-
-		allocated = alloc_pages_bulk_array(GFP_NOFS | extra_gfp,
-						   nr_pages, page_array);
-
-		if (allocated == nr_pages)
-			return 0;
-
-		/*
-		 * During this iteration, no page could be allocated, even
-		 * though alloc_pages_bulk_array() falls back to alloc_page()
-		 * if  it could not bulk-allocate. So we must be out of memory.
-		 */
-		if (allocated == last) {
-			for (int i = 0; i < allocated; i++) {
-				__free_page(page_array[i]);
-				page_array[i] = NULL;
-			}
-			return -ENOMEM;
-		}
-
-		memalloc_retry_wait(GFP_NOFS);
+	allocated = alloc_pages_bulk_array(GFP_NOFS | gfp, nr_pages, page_array);
+	for (; allocated < nr_pages; allocated++) {
+		page_array[allocated] = alloc_page(gfp);
+		if (unlikely(!page_array[allocated]))
+			goto enomem;
 	}
 	return 0;
+enomem:
+	for (int i = 0; i < allocated; i++) {
+		__free_page(page_array[i]);
+		page_array[i] = NULL;
+	}
+	return -ENOMEM;
 }
 
 /*
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: slow performance due to frequent memalloc_retry_wait in btrfs_alloc_page_array
  2024-03-22  7:37     ` Qu Wenruo
@ 2024-03-25 11:33       ` Julian Taylor
  0 siblings, 0 replies; 5+ messages in thread
From: Julian Taylor @ 2024-03-25 11:33 UTC (permalink / raw)
  To: Qu Wenruo, linux-btrfs


On 22.03.24 08:37, Qu Wenruo wrote:
>
>
> 在 2024/3/13 20:06, Julian Taylor 写道:
>>
>> On 13.03.24 07:26, Qu Wenruo wrote:
>>>
>>>
>>> 在 2024/3/13 00:05, Julian Taylor 写道:
>>>> Hello,
>>>>
>>>> After upgrading a machine using btrfs to a 6.1 kernel from 5.10 we are
>>>> experiencing very low read performance on some (compressed) files when
>>>> most of the nodes memory is in use by applications and the filesystem
>>>> cache. Reading some files does not exceed 5MiB/second while the
>>>> underlying disks can sustain ~800MiB/s. The load on the machine while
>>>> reading the files slowly is basically zero
>>>>
>>>> The filesystem is mounted with
>>>>
>>>>   btrfs 
>>>> (rw,relatime,compress=zstd:3,space_cache=v2,subvolid=5,subvol=/)
>>>>
>>>> The filesystem contains several snapshot volumes.
>>>>
>>>> Checking with blktrace we noticed a lot of queue unplug events which
>>>> when traced showed that the cause is most likely io_schedule_timeout
>>>> being called extremely frequent from btrfs_alloc_page_array which 
>>>> since
>>>> 5.19 (91d6ac1d62c3dc0f102986318f4027ccfa22c638) uses bulk page
>>>> allocations with a memalloc_retry_wait on failure:
>>>>
>>>> $ perf record -e block:block_unplug -g
>>>>
>>>> $ perf script
>>>>
>>>>          ffffffffa3bbff86 blk_mq_flush_plug_list.part.0+0x246
>>>> ([kernel.kallsyms])
>>>>          ffffffffa3bbff86 blk_mq_flush_plug_list.part.0+0x246
>>>> ([kernel.kallsyms])
>>>>          ffffffffa3bb1205 __blk_flush_plug+0xf5 ([kernel.kallsyms])
>>>>          ffffffffa4213f15 io_schedule_timeout+0x45 ([kernel.kallsyms])
>>>>          ffffffffc0c74d42 btrfs_alloc_page_array+0x42 
>>>> ([kernel.kallsyms])
>>>
>>> Btrfs needs to allocate all the pages for the compressed extents, which
>>> can be very large (as large as 128K, even if the read may only be 4K).
>>>
>>> Furthermore, since your system have very high memory pressure, it also
>>> means the page cache doesn't have much chance to cache the decompressed
>>> contents.
>>>
>>> Thus I'm afraid for your high memory pressure cases, it is not really
>>> not a good use case with compression.
>>> (Both compressed read and write would need extra pages other than the
>>> inode page cache).
>>>
>>> And considering your storage is very fast (800+MiB/s), there is really
>>> little benefit for compression (other than saving disk usages).
>>
>> The machine does not have high memory pressure it has 380Gi of memory 
>> and the applications on it only use a small fraction of it, it is 
>> just a machine handling backups most of the time.
>>
>> The memory is all just used by the page cache and is reclaimable. The 
>> bulk page allocation functions just do not do that without falling 
>> back to single page allocations.
>>
>>
>>>
>>>>
>>>>
>>>> Further checking why the bulk page allocations only return a single 
>>>> page
>>>> we noticed this is only happening when all memory of the node is 
>>>> tied up
>>>> even if still reclaimable.
>>>>
>>>> It can be reliably reproduced on the machine when filling the page 
>>>> cache
>>>> with data from the disk (just via cat * >/dev/null) until we are have
>>>> following memory situation on the node with two sockets:
>>>>
>>>> $numactl --hardware
>>>>
>>>> available: 2 nodes (0-1)
>>>>
>>>> node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
>>>> 42 44 46 48 50 52 54 56 58 60 62
>>>> node 0 size: 192048 MB
>>>> node 0 free: 170340 MB
>>>> node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41
>>>> 43 45 47 49 51 53 55 57 59 61 63
>>>> node 1 size: 193524 MB
>>>> node 1 free: 224 MB        <<< nothing free due to cache
>>>
>>> This is interesting, such unbalanced free memory is indeed going to
>>> cause problems.
>>>
>>>>
>>>> $ top
>>>>
>>>> MiB Mem : 385573.2 total, 170093.0 free,  19379.1 used, 201077.9 
>>>> buff/cache
>>>> MiB Swap:   3812.0 total,   3812.0 free,      0.0 used. 366194.1 
>>>> avail Mem
>>>>
>>>>
>>>> When now reading a file with a process bound to a cpu on node 1 
>>>> (taskset
>>>> -c cat $file) we see the high io_schedule_timeout rate and very low 
>>>> read
>>>> performance.
>>>>
>>>> This is seen with linux 6.1.76 (debian 12 stable) and linux 6.7.9
>>>> (debian unstable).
>>>>
>>>>
>>>> It appears the bulk page allocations used by btrfs_alloc_page_array 
>>>> will
>>>> have a high failure rate when the per cpu page lists are empty and 
>>>> they
>>>> do not appear to attempt to reclaim memory from the page cache but
>>>> instead return a single page via the normal page allocations. But this
>>>> combined with memalloc_retry_wait called on each iteration causes very
>>>> slow performance.
>>>
>>> Not an expert on NUMA, but I guess there should be some way to balance
>>> the free memory between different numa nodes?
>>>
>>> Can it be done automatically/periodically as a workaround?
>>
>> Dropping data from the page cache is the workaround we are using, via 
>> fadvice(DONTNEED) on the data.
>>
>> Balancing the memory between numa nodes will not help. At some point 
>> both nodes memory is in the caches and the same situation will occur 
>> on both nodes.
>>
>> I have verified this loading the caches on both nodes:
>>
>> # numactl --hardware
>> available: 2 nodes (0-1)
>> node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 
>> 40 42 44 46 48 50 52 54 56 58 60 62
>> node 0 size: 192048 MB
>> node 0 free: 2316 MB
>> node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 
>> 41 43 45 47 49 51 53 55 57 59 61 63
>> node 1 size: 193524 MB
>> node 1 free: 327 MB
>>
>> and now loading files with processes bound to either node is affected 
>> by this. ]
>>
>>
>>>
>>>>
>>>> Increasing sysctl vm.percpu_pagelist_high_fraction did not yield any
>>>> improvement for the situation, the only workaround seems to be to free
>>>> the page cache on the nodes before reading the data.
>>>>
>>>> Assuming the bulk page allocations functions are intended to not 
>>>> reclaim
>>>> memory when the per core lists are empty probably the way
>>>> btrfs_alloc_page_array handles failure of bulk allocation should be
>>>> revised.
>>>
>>> Any suggestion for improvement?
>>>
>>> In our usage, we can not afford to reclaim page cache, as that may
>>> trigger page writeback, meanwhile we may also in the page writeback 
>>> path
>>> and can lead to deadlock.
>>>
>>> On the other hand, if we allocate pages for compressed read/write from
>>> other NUMA nodes, wouldn't that cause different performance problems?
>>> E.g. we still need to do compression using the page from the remote 
>>> numa
>>> nodes, wouldn't that also greatly reduce the compression speed?
>>
>> The problem we see is not the page allocation itself but the looping 
>> on memalloc_retry_wait when the bulk allocation falls back to single 
>> page allocations due to empty per cpu page lists.
>>
>> My naive suggestion would be to revert the bulk allocation 
>> (91d6ac1d62c3dc0f102986318f4027ccfa22c638) and do single page 
>> allocations again. As far as I can tell the bulk allocation was done 
>> for performance reasons not to avoid deadlocks due to writeback.
>>
>> If the performance gain by the bulk allocation is very significant 
>> maybe the looping on memalloc_retry_wait can be done in some better 
>> way but I am unfamiliar with the details here on why the single page 
>> allocation did not need to do a retry-wait loop and the bulk page 
>> allocation does.
>
> I believe you're right.
>
> The common scheme for bulk allocation should be try the bulk/optimized 
> version, if failed fallback to the single allocation one.
>
> In fact, that's exactly what's I'm trying to do for larger folio 
> support for btrfs metadata.
> In that case, we try larger folio first, then fallback to regular 
> btrfs_alloc_page_array().
>
> So mind to test the attached patch to see if it solves the problem for 
> you?
> The patch would exactly do what I said above, try bulk allocation 
> first, then go single page allocation for the remaining ones, since 
> this version no longer do any way, the behavior should be more or less 
> the same, meanwhile still keep the bulk attempt to benefit from it.


I have applied the patch to the running 6.1 kernel (just needed 
extra_gfp removed) and the problem is not reproducible anymore.

# ensure all memory is used by page cache by reading arbitrary data.

find . -size +100M | taskset -c 1 xargs cat > /dev/null

6.1 unpatched:

# reading compressed file that triggers btrfs_alloc_page_array

  python3 /tmp/drop-caches.py $f; taskset -c 1 cat $f | pv > /dev/null
   2  399MiB 0:02:03 [3.23MiB/s]


6.1 patched:

python3 /tmp/drop-caches.py $f; taskset -c 1 cat $f | pv > /dev/null
  399MiB 0:00:00 [ 710MiB/s] [   <=>


To verify the bulk allocation still returns one only page in the patched 
kernel I ran this trace during reading:

# bpftrace -e "kretfunc:__alloc_pages_bulk {if (args->nr_pages != 
retval) {@allocret = lhist(retval, 0, 20, 1);}}"
Attaching 1 probe...
^C

@allocret:
[1, 2)              9516 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|


So the patch does solve the problem for us.



+    const gfp_t gfp = GFP_NOFS | extra_gfp;
+    allocated = alloc_pages_bulk_array(GFP_NOFS | gfp, nr_pages, 
page_array);

The GFP_NOFS | is on alloc_pages_bulk_array redundant.


Thanks,

Julian


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-03-25 11:35 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-12 13:35 slow performance due to frequent memalloc_retry_wait in btrfs_alloc_page_array Julian Taylor
2024-03-13  6:26 ` Qu Wenruo
2024-03-13  9:36   ` Julian Taylor
2024-03-22  7:37     ` Qu Wenruo
2024-03-25 11:33       ` Julian Taylor

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.