All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
@ 2022-09-14  2:33 Hongchen Zhang
  2022-09-14 22:51 ` Andrew Morton
  0 siblings, 1 reply; 17+ messages in thread
From: Hongchen Zhang @ 2022-09-14  2:33 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Hongchen Zhang

when a process falls into page fault and there is not enough free
memory,it will do direct reclaim. At the same time,it is holding
mmap_lock.So in case of multi-thread,it should exit from page fault
ASAP.
When reclaim memory,we do scan adjust between anon and file lru which
may cost too much time and trigger hung task for other thread.So for a
process which is not kswapd,it should just do a little scan adjust.

Signed-off-by: Hongchen Zhang <zhanghongchen@loongson.cn>
---
 mm/vmscan.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b2b1431352dc..07fb58db812b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3042,11 +3042,17 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
 		nr[lru] = targets[lru] * (100 - percentage) / 100;
 		nr[lru] -= min(nr[lru], nr_scanned);
 
+		if (!current_is_kswapd())
+			nr[lru] = min(nr[lru], nr_to_reclaim);
+
 		lru += LRU_ACTIVE;
 		nr_scanned = targets[lru] - nr[lru];
 		nr[lru] = targets[lru] * (100 - percentage) / 100;
 		nr[lru] -= min(nr[lru], nr_scanned);
 
+		if (!current_is_kswapd())
+			nr[lru] = min(nr[lru], nr_to_reclaim);
+
 		scan_adjusted = true;
 	}
 	blk_finish_plug(&plug);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-14  2:33 [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd Hongchen Zhang
@ 2022-09-14 22:51 ` Andrew Morton
  2022-09-14 23:02   ` Matthew Wilcox
  2022-09-15  1:19   ` Hongchen Zhang
  0 siblings, 2 replies; 17+ messages in thread
From: Andrew Morton @ 2022-09-14 22:51 UTC (permalink / raw)
  To: Hongchen Zhang; +Cc: linux-mm, linux-kernel

On Wed, 14 Sep 2022 10:33:18 +0800 Hongchen Zhang <zhanghongchen@loongson.cn> wrote:

> when a process falls into page fault and there is not enough free
> memory,it will do direct reclaim. At the same time,it is holding
> mmap_lock.So in case of multi-thread,it should exit from page fault
> ASAP.
> When reclaim memory,we do scan adjust between anon and file lru which
> may cost too much time and trigger hung task for other thread.So for a
> process which is not kswapd,it should just do a little scan adjust.

Well, that's a pretty nasty bug.  Before diving into a possible fix,
can you please tell us more about how this happens?  What sort of
machine, what sort of workload.  Can you suggest why others are not
experiencing this?

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -3042,11 +3042,17 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>  		nr[lru] = targets[lru] * (100 - percentage) / 100;
>  		nr[lru] -= min(nr[lru], nr_scanned);
>  
> +		if (!current_is_kswapd())
> +			nr[lru] = min(nr[lru], nr_to_reclaim);
> +
>  		lru += LRU_ACTIVE;
>  		nr_scanned = targets[lru] - nr[lru];
>  		nr[lru] = targets[lru] * (100 - percentage) / 100;
>  		nr[lru] -= min(nr[lru], nr_scanned);
>  
> +		if (!current_is_kswapd())
> +			nr[lru] = min(nr[lru], nr_to_reclaim);
> +
>  		scan_adjusted = true;
>  	}
>  	blk_finish_plug(&plug);

It would be better if these additions had code comments explaining why
they're there.  But let's more fully understand the problem before
altering your patch.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-14 22:51 ` Andrew Morton
@ 2022-09-14 23:02   ` Matthew Wilcox
  2022-09-15  1:19   ` Hongchen Zhang
  1 sibling, 0 replies; 17+ messages in thread
From: Matthew Wilcox @ 2022-09-14 23:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hongchen Zhang, linux-mm, linux-kernel

On Wed, Sep 14, 2022 at 03:51:42PM -0700, Andrew Morton wrote:
> On Wed, 14 Sep 2022 10:33:18 +0800 Hongchen Zhang <zhanghongchen@loongson.cn> wrote:
> 
> > when a process falls into page fault and there is not enough free
> > memory,it will do direct reclaim. At the same time,it is holding
> > mmap_lock.So in case of multi-thread,it should exit from page fault
> > ASAP.
> > When reclaim memory,we do scan adjust between anon and file lru which
> > may cost too much time and trigger hung task for other thread.So for a
> > process which is not kswapd,it should just do a little scan adjust.
> 
> Well, that's a pretty nasty bug.  Before diving into a possible fix,
> can you please tell us more about how this happens?  What sort of
> machine, what sort of workload.  Can you suggest why others are not
> experiencing this?

One thing I'd like to know is whether the page fault is for an anonymous or
file-backed page.  We already drop the mmap_lock for doing file I/O
(or we should ...) and maybe we also need to drop the mmap_lock for
doing direct reclaim?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-14 22:51 ` Andrew Morton
  2022-09-14 23:02   ` Matthew Wilcox
@ 2022-09-15  1:19   ` Hongchen Zhang
  2022-09-15  7:28     ` Matthew Wilcox
  1 sibling, 1 reply; 17+ messages in thread
From: Hongchen Zhang @ 2022-09-15  1:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, linux-kernel, Matthew Wilcox

Hi Andrew,

On 2022/9/15 am 6:51, Andrew Morton wrote:
> On Wed, 14 Sep 2022 10:33:18 +0800 Hongchen Zhang <zhanghongchen@loongson.cn> wrote:
> 
>> when a process falls into page fault and there is not enough free
>> memory,it will do direct reclaim. At the same time,it is holding
>> mmap_lock.So in case of multi-thread,it should exit from page fault
>> ASAP.
>> When reclaim memory,we do scan adjust between anon and file lru which
>> may cost too much time and trigger hung task for other thread.So for a
>> process which is not kswapd,it should just do a little scan adjust.
> 
> Well, that's a pretty nasty bug.  Before diving into a possible fix,
> can you please tell us more about how this happens?  What sort of
> machine, what sort of workload.  Can you suggest why others are not
> experiencing this?
>We got a hung task panic originally by doing ltpstress on a Loongarch 
3A5000+71000 machine.Then, we found the same problem on a X86 machine as 
following:
[ 3748.453561] INFO: task float_bessel:77920 blocked for more than 120 
seconds.
[ 3748.460839]       Not tainted 5.15.0-46-generic #49-Ubuntu
[ 3748.466490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 3748.474618] task:float_bessel    state:D stack:    0 pid:77920 ppid: 
77327 flags:0x00004002
[ 3748.483358] Call Trace:
[ 3748.485964]  <TASK>
[ 3748.488150]  __schedule+0x23d/0x590
[ 3748.491804]  schedule+0x4e/0xc0
[ 3748.495038]  rwsem_down_read_slowpath+0x336/0x390
[ 3748.499886]  ? copy_user_enhanced_fast_string+0xe/0x40
[ 3748.505181]  down_read+0x43/0xa0
[ 3748.508518]  do_user_addr_fault+0x41c/0x670
[ 3748.512799]  exc_page_fault+0x77/0x170
[ 3748.516673]  asm_exc_page_fault+0x26/0x30
[ 3748.520824] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x40
[ 3748.526764] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 
0f 01 ca c3 cc cc cc cc 0f 1f 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 
d1 <f3> a4 31 c0 0f 01 ca c3 cc cc cc cc 66 08
[ 3748.546120] RSP: 0018:ffffaa9248fffb90 EFLAGS: 00050206
[ 3748.551495] RAX: 00007f99faa1a010 RBX: ffffaa9248fffd88 RCX: 
0000000000000010
[ 3748.558828] RDX: 0000000000001000 RSI: ffff9db397ab8ff0 RDI: 
00007f99faa1a000
[ 3748.566160] RBP: ffffaa9248fffbf0 R08: ffffcc2fc2965d80 R09: 
0000000000000014
[ 3748.573492] R10: 0000000000000000 R11: 0000000000000014 R12: 
0000000000001000
[ 3748.580858] R13: 0000000000001000 R14: 0000000000000000 R15: 
ffffaa9248fffd98
[ 3748.588196]  ? copy_page_to_iter+0x10e/0x400
[ 3748.592614]  filemap_read+0x174/0x3e0
[ 3748.596354]  ? ima_file_check+0x6a/0xa0
[ 3748.600301]  generic_file_read_iter+0xe5/0x150
[ 3748.604884]  ext4_file_read_iter+0x5b/0x190
[ 3748.609164]  ? aa_file_perm+0x102/0x250
[ 3748.613125]  new_sync_read+0x10d/0x1a0
[ 3748.617009]  vfs_read+0x103/0x1a0
[ 3748.620423]  ksys_read+0x67/0xf0
[ 3748.623743]  __x64_sys_read+0x19/0x20
[ 3748.627511]  do_syscall_64+0x59/0xc0
[ 3748.631203]  ? syscall_exit_to_user_mode+0x27/0x50
[ 3748.636144]  ? do_syscall_64+0x69/0xc0
[ 3748.639992]  ? exit_to_user_mode_prepare+0x96/0xb0
[ 3748.644931]  ? irqentry_exit_to_user_mode+0x9/0x20
[ 3748.649872]  ? irqentry_exit+0x1d/0x30
[ 3748.653737]  ? exc_page_fault+0x89/0x170
[ 3748.657795]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ 3748.663030] RIP: 0033:0x7f9a852989cc
[ 3748.666713] RSP: 002b:00007f9a8497dc90 EFLAGS: 00000246 ORIG_RAX: 
0000000000000000
[ 3748.674487] RAX: ffffffffffffffda RBX: 00007f9a8497f5c0 RCX: 
00007f9a852989cc
[ 3748.681817] RDX: 0000000000027100 RSI: 00007f99faa18010 RDI: 
0000000000000061
[ 3748.689150] RBP: 00007f9a8497dd60 R08: 0000000000000000 R09: 
00007f99faa18010
[ 3748.696493] R10: 0000000000000000 R11: 0000000000000246 R12: 
00007f99faa18010
[ 3748.703841] R13: 00005605e11c406f R14: 0000000000000001 R15: 
0000000000027100
[ 3748.711199]  </TASK>
...
...
[ 3750.943278] Kernel panic - not syncing: hung_task: blocked tasks
[ 3750.949399] CPU: 1 PID: 39 Comm: khungtaskd Not tainted 
5.15.0-46-generic #49-Ubuntu
[ 3750.957305] Hardware name: LENOVO 90DWCTO1WW/30D9, BIOS M05KT70A 
03/09/2017
[ 3750.964410] Call Trace:
[ 3750.966897]  <TASK>
[ 3750.969031]  show_stack+0x52/0x5c
[ 3750.972409]  dump_stack_lvl+0x4a/0x63
[ 3750.976129]  dump_stack+0x10/0x16
[ 3750.979491]  panic+0x149/0x321
[ 3750.982612]  check_hung_uninterruptible_tasks.cold+0x34/0x48
[ 3750.988383]  watchdog+0xad/0xb0
[ 3750.991562]  ? check_hung_uninterruptible_tasks+0x300/0x300
[ 3750.997285]  kthread+0x127/0x150
[ 3751.000587]  ? set_kthread_struct+0x50/0x50
[ 3751.004878]  ret_from_fork+0x1f/0x30
[ 3751.008527]  </TASK>
[ 3751.010794] Kernel Offset: 0x34600000 from 0xffffffff81000000 
(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 3751.034481] ---[ end Kernel panic - not syncing: hung_task: blocked 
tasks ]---
The difference with normal ltpstress test is we use a very large swap 
partition,so the swap pressure is bigger than normal,and this problem 
becomes more likely to occur.

>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -3042,11 +3042,17 @@ static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
>>   		nr[lru] = targets[lru] * (100 - percentage) / 100;
>>   		nr[lru] -= min(nr[lru], nr_scanned);
>>   
>> +		if (!current_is_kswapd())
>> +			nr[lru] = min(nr[lru], nr_to_reclaim);
>> +
>>   		lru += LRU_ACTIVE;
>>   		nr_scanned = targets[lru] - nr[lru];
>>   		nr[lru] = targets[lru] * (100 - percentage) / 100;
>>   		nr[lru] -= min(nr[lru], nr_scanned);
>>   
>> +		if (!current_is_kswapd())
>> +			nr[lru] = min(nr[lru], nr_to_reclaim);
>> +
>>   		scan_adjusted = true;
>>   	}
>>   	blk_finish_plug(&plug);
> 
> It would be better if these additions had code comments explaining why
> they're there.  But let's more fully understand the problem before
> altering your patch.
Thanks,
Hongchen Zhang


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-15  1:19   ` Hongchen Zhang
@ 2022-09-15  7:28     ` Matthew Wilcox
  2022-09-15  8:02       ` Hongchen Zhang
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2022-09-15  7:28 UTC (permalink / raw)
  To: Hongchen Zhang; +Cc: Andrew Morton, linux-mm, linux-kernel

On Thu, Sep 15, 2022 at 09:19:48AM +0800, Hongchen Zhang wrote:
> [ 3748.453561] INFO: task float_bessel:77920 blocked for more than 120
> seconds.
> [ 3748.460839]       Not tainted 5.15.0-46-generic #49-Ubuntu
> [ 3748.466490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> this message.
> [ 3748.474618] task:float_bessel    state:D stack:    0 pid:77920 ppid:
> 77327 flags:0x00004002
> [ 3748.483358] Call Trace:
> [ 3748.485964]  <TASK>
> [ 3748.488150]  __schedule+0x23d/0x590
> [ 3748.491804]  schedule+0x4e/0xc0
> [ 3748.495038]  rwsem_down_read_slowpath+0x336/0x390
> [ 3748.499886]  ? copy_user_enhanced_fast_string+0xe/0x40
> [ 3748.505181]  down_read+0x43/0xa0
> [ 3748.508518]  do_user_addr_fault+0x41c/0x670
> [ 3748.512799]  exc_page_fault+0x77/0x170
> [ 3748.516673]  asm_exc_page_fault+0x26/0x30
> [ 3748.520824] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x40
> [ 3748.526764] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f
> 01 ca c3 cc cc cc cc 0f 1f 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3>
> a4 31 c0 0f 01 ca c3 cc cc cc cc 66 08
> [ 3748.546120] RSP: 0018:ffffaa9248fffb90 EFLAGS: 00050206
> [ 3748.551495] RAX: 00007f99faa1a010 RBX: ffffaa9248fffd88 RCX:
> 0000000000000010
> [ 3748.558828] RDX: 0000000000001000 RSI: ffff9db397ab8ff0 RDI:
> 00007f99faa1a000
> [ 3748.566160] RBP: ffffaa9248fffbf0 R08: ffffcc2fc2965d80 R09:
> 0000000000000014
> [ 3748.573492] R10: 0000000000000000 R11: 0000000000000014 R12:
> 0000000000001000
> [ 3748.580858] R13: 0000000000001000 R14: 0000000000000000 R15:
> ffffaa9248fffd98
> [ 3748.588196]  ? copy_page_to_iter+0x10e/0x400
> [ 3748.592614]  filemap_read+0x174/0x3e0

Interesting; it wasn't the process itself which triggered the page
fault; the process called read() and the kernel took the page fault to
satisfy the read() call.

> [ 3748.596354]  ? ima_file_check+0x6a/0xa0
> [ 3748.600301]  generic_file_read_iter+0xe5/0x150
> [ 3748.604884]  ext4_file_read_iter+0x5b/0x190
> [ 3748.609164]  ? aa_file_perm+0x102/0x250
> [ 3748.613125]  new_sync_read+0x10d/0x1a0
> [ 3748.617009]  vfs_read+0x103/0x1a0
> [ 3748.620423]  ksys_read+0x67/0xf0
> [ 3748.623743]  __x64_sys_read+0x19/0x20
> [ 3748.627511]  do_syscall_64+0x59/0xc0
> [ 3748.631203]  ? syscall_exit_to_user_mode+0x27/0x50
> [ 3748.636144]  ? do_syscall_64+0x69/0xc0
> [ 3748.639992]  ? exit_to_user_mode_prepare+0x96/0xb0
> [ 3748.644931]  ? irqentry_exit_to_user_mode+0x9/0x20
> [ 3748.649872]  ? irqentry_exit+0x1d/0x30
> [ 3748.653737]  ? exc_page_fault+0x89/0x170
> [ 3748.657795]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
> [ 3748.663030] RIP: 0033:0x7f9a852989cc
> [ 3748.666713] RSP: 002b:00007f9a8497dc90 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000000
> [ 3748.674487] RAX: ffffffffffffffda RBX: 00007f9a8497f5c0 RCX:
> 00007f9a852989cc
> [ 3748.681817] RDX: 0000000000027100 RSI: 00007f99faa18010 RDI:
> 0000000000000061
> [ 3748.689150] RBP: 00007f9a8497dd60 R08: 0000000000000000 R09:
> 00007f99faa18010
> [ 3748.696493] R10: 0000000000000000 R11: 0000000000000246 R12:
> 00007f99faa18010
> [ 3748.703841] R13: 00005605e11c406f R14: 0000000000000001 R15:
> 0000000000027100

ORIG_RAX is 0, which matches sys_read.
RDI is file descriptor 0x61
RSI is plausibly a userspace pointer, 0x7f99faa18010
RDX is the length, 0x27100 or 160kB.

That all seems reasonable.

What I really want to know is who is _holding_ the lock.  We stash
a pointer to the task_struct in 'owner', so we could clearly find this
out in the 'blocked for too long' report, and print their stack trace.

You must have done something like this already in order to deduce that
it was the direct reclaim path that was the problem?


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-15  7:28     ` Matthew Wilcox
@ 2022-09-15  8:02       ` Hongchen Zhang
  2022-09-15  9:00         ` Matthew Wilcox
  2022-09-15 13:38         ` Feng Tang
  0 siblings, 2 replies; 17+ messages in thread
From: Hongchen Zhang @ 2022-09-15  8:02 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm, linux-kernel

Hi Matthew,
On 2022/9/15 pm 3:28, Matthew Wilcox wrote:
> On Thu, Sep 15, 2022 at 09:19:48AM +0800, Hongchen Zhang wrote:
>> [ 3748.453561] INFO: task float_bessel:77920 blocked for more than 120
>> seconds.
>> [ 3748.460839]       Not tainted 5.15.0-46-generic #49-Ubuntu
>> [ 3748.466490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>> this message.
>> [ 3748.474618] task:float_bessel    state:D stack:    0 pid:77920 ppid:
>> 77327 flags:0x00004002
>> [ 3748.483358] Call Trace:
>> [ 3748.485964]  <TASK>
>> [ 3748.488150]  __schedule+0x23d/0x590
>> [ 3748.491804]  schedule+0x4e/0xc0
>> [ 3748.495038]  rwsem_down_read_slowpath+0x336/0x390
>> [ 3748.499886]  ? copy_user_enhanced_fast_string+0xe/0x40
>> [ 3748.505181]  down_read+0x43/0xa0
>> [ 3748.508518]  do_user_addr_fault+0x41c/0x670
>> [ 3748.512799]  exc_page_fault+0x77/0x170
>> [ 3748.516673]  asm_exc_page_fault+0x26/0x30
>> [ 3748.520824] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x40
>> [ 3748.526764] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f
>> 01 ca c3 cc cc cc cc 0f 1f 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3>
>> a4 31 c0 0f 01 ca c3 cc cc cc cc 66 08
>> [ 3748.546120] RSP: 0018:ffffaa9248fffb90 EFLAGS: 00050206
>> [ 3748.551495] RAX: 00007f99faa1a010 RBX: ffffaa9248fffd88 RCX:
>> 0000000000000010
>> [ 3748.558828] RDX: 0000000000001000 RSI: ffff9db397ab8ff0 RDI:
>> 00007f99faa1a000
>> [ 3748.566160] RBP: ffffaa9248fffbf0 R08: ffffcc2fc2965d80 R09:
>> 0000000000000014
>> [ 3748.573492] R10: 0000000000000000 R11: 0000000000000014 R12:
>> 0000000000001000
>> [ 3748.580858] R13: 0000000000001000 R14: 0000000000000000 R15:
>> ffffaa9248fffd98
>> [ 3748.588196]  ? copy_page_to_iter+0x10e/0x400
>> [ 3748.592614]  filemap_read+0x174/0x3e0
> 
> Interesting; it wasn't the process itself which triggered the page
> fault; the process called read() and the kernel took the page fault to
> satisfy the read() call.
> 
>> [ 3748.596354]  ? ima_file_check+0x6a/0xa0
>> [ 3748.600301]  generic_file_read_iter+0xe5/0x150
>> [ 3748.604884]  ext4_file_read_iter+0x5b/0x190
>> [ 3748.609164]  ? aa_file_perm+0x102/0x250
>> [ 3748.613125]  new_sync_read+0x10d/0x1a0
>> [ 3748.617009]  vfs_read+0x103/0x1a0
>> [ 3748.620423]  ksys_read+0x67/0xf0
>> [ 3748.623743]  __x64_sys_read+0x19/0x20
>> [ 3748.627511]  do_syscall_64+0x59/0xc0
>> [ 3748.631203]  ? syscall_exit_to_user_mode+0x27/0x50
>> [ 3748.636144]  ? do_syscall_64+0x69/0xc0
>> [ 3748.639992]  ? exit_to_user_mode_prepare+0x96/0xb0
>> [ 3748.644931]  ? irqentry_exit_to_user_mode+0x9/0x20
>> [ 3748.649872]  ? irqentry_exit+0x1d/0x30
>> [ 3748.653737]  ? exc_page_fault+0x89/0x170
>> [ 3748.657795]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
>> [ 3748.663030] RIP: 0033:0x7f9a852989cc
>> [ 3748.666713] RSP: 002b:00007f9a8497dc90 EFLAGS: 00000246 ORIG_RAX:
>> 0000000000000000
>> [ 3748.674487] RAX: ffffffffffffffda RBX: 00007f9a8497f5c0 RCX:
>> 00007f9a852989cc
>> [ 3748.681817] RDX: 0000000000027100 RSI: 00007f99faa18010 RDI:
>> 0000000000000061
>> [ 3748.689150] RBP: 00007f9a8497dd60 R08: 0000000000000000 R09:
>> 00007f99faa18010
>> [ 3748.696493] R10: 0000000000000000 R11: 0000000000000246 R12:
>> 00007f99faa18010
>> [ 3748.703841] R13: 00005605e11c406f R14: 0000000000000001 R15:
>> 0000000000027100
> 
> ORIG_RAX is 0, which matches sys_read.
> RDI is file descriptor 0x61
> RSI is plausibly a userspace pointer, 0x7f99faa18010
> RDX is the length, 0x27100 or 160kB.
> 
> That all seems reasonable.
> 
> What I really want to know is who is _holding_ the lock.  We stash
> a pointer to the task_struct in 'owner', so we could clearly find this
> out in the 'blocked for too long' report, and print their stack trace.
> 
As described in the comment for __rwsem_set_reader_owned,it is hard to 
track read owners.So we could not clearly find out who blocked the 
process,it was caused by multiple tasks.
> You must have done something like this already in order to deduce that
> it was the direct reclaim path that was the problem?
> 
The method we used is to track the direct reclaim using the 
trace_mm_vmscan_direct_reclaim_{begin,end} interface.When the problem 
occurred,we could get a very large "nr_reclaimed" which is not a 
desirable value for process except kswapd.

Thanks
Hongchen Zhang


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-15  8:02       ` Hongchen Zhang
@ 2022-09-15  9:00         ` Matthew Wilcox
  2022-09-15  9:30           ` Hongchen Zhang
  2022-09-16  0:57           ` Hongchen Zhang
  2022-09-15 13:38         ` Feng Tang
  1 sibling, 2 replies; 17+ messages in thread
From: Matthew Wilcox @ 2022-09-15  9:00 UTC (permalink / raw)
  To: Hongchen Zhang; +Cc: Andrew Morton, linux-mm, linux-kernel

On Thu, Sep 15, 2022 at 04:02:41PM +0800, Hongchen Zhang wrote:
> Hi Matthew,
> On 2022/9/15 pm 3:28, Matthew Wilcox wrote:
> > On Thu, Sep 15, 2022 at 09:19:48AM +0800, Hongchen Zhang wrote:
> > > [ 3748.453561] INFO: task float_bessel:77920 blocked for more than 120
> > > seconds.
> > > [ 3748.460839]       Not tainted 5.15.0-46-generic #49-Ubuntu
> > > [ 3748.466490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > > this message.
> > > [ 3748.474618] task:float_bessel    state:D stack:    0 pid:77920 ppid:
> > > 77327 flags:0x00004002
> > > [ 3748.483358] Call Trace:
> > > [ 3748.485964]  <TASK>
> > > [ 3748.488150]  __schedule+0x23d/0x590
> > > [ 3748.491804]  schedule+0x4e/0xc0
> > > [ 3748.495038]  rwsem_down_read_slowpath+0x336/0x390
> > > [ 3748.499886]  ? copy_user_enhanced_fast_string+0xe/0x40
> > > [ 3748.505181]  down_read+0x43/0xa0
> > > [ 3748.508518]  do_user_addr_fault+0x41c/0x670
> > > [ 3748.512799]  exc_page_fault+0x77/0x170
> > > [ 3748.516673]  asm_exc_page_fault+0x26/0x30
> > > [ 3748.520824] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x40
> > > [ 3748.526764] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f
> > > 01 ca c3 cc cc cc cc 0f 1f 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3>
> > > a4 31 c0 0f 01 ca c3 cc cc cc cc 66 08
> > > [ 3748.546120] RSP: 0018:ffffaa9248fffb90 EFLAGS: 00050206
> > > [ 3748.551495] RAX: 00007f99faa1a010 RBX: ffffaa9248fffd88 RCX:
> > > 0000000000000010
> > > [ 3748.558828] RDX: 0000000000001000 RSI: ffff9db397ab8ff0 RDI:
> > > 00007f99faa1a000
> > > [ 3748.566160] RBP: ffffaa9248fffbf0 R08: ffffcc2fc2965d80 R09:
> > > 0000000000000014
> > > [ 3748.573492] R10: 0000000000000000 R11: 0000000000000014 R12:
> > > 0000000000001000
> > > [ 3748.580858] R13: 0000000000001000 R14: 0000000000000000 R15:
> > > ffffaa9248fffd98
> > > [ 3748.588196]  ? copy_page_to_iter+0x10e/0x400
> > > [ 3748.592614]  filemap_read+0x174/0x3e0
> > 
> > Interesting; it wasn't the process itself which triggered the page
> > fault; the process called read() and the kernel took the page fault to
> > satisfy the read() call.
> > 
> > > [ 3748.596354]  ? ima_file_check+0x6a/0xa0
> > > [ 3748.600301]  generic_file_read_iter+0xe5/0x150
> > > [ 3748.604884]  ext4_file_read_iter+0x5b/0x190
> > > [ 3748.609164]  ? aa_file_perm+0x102/0x250
> > > [ 3748.613125]  new_sync_read+0x10d/0x1a0
> > > [ 3748.617009]  vfs_read+0x103/0x1a0
> > > [ 3748.620423]  ksys_read+0x67/0xf0
> > > [ 3748.623743]  __x64_sys_read+0x19/0x20
> > > [ 3748.627511]  do_syscall_64+0x59/0xc0
> > > [ 3748.631203]  ? syscall_exit_to_user_mode+0x27/0x50
> > > [ 3748.636144]  ? do_syscall_64+0x69/0xc0
> > > [ 3748.639992]  ? exit_to_user_mode_prepare+0x96/0xb0
> > > [ 3748.644931]  ? irqentry_exit_to_user_mode+0x9/0x20
> > > [ 3748.649872]  ? irqentry_exit+0x1d/0x30
> > > [ 3748.653737]  ? exc_page_fault+0x89/0x170
> > > [ 3748.657795]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
> > > [ 3748.663030] RIP: 0033:0x7f9a852989cc
> > > [ 3748.666713] RSP: 002b:00007f9a8497dc90 EFLAGS: 00000246 ORIG_RAX:
> > > 0000000000000000
> > > [ 3748.674487] RAX: ffffffffffffffda RBX: 00007f9a8497f5c0 RCX:
> > > 00007f9a852989cc
> > > [ 3748.681817] RDX: 0000000000027100 RSI: 00007f99faa18010 RDI:
> > > 0000000000000061
> > > [ 3748.689150] RBP: 00007f9a8497dd60 R08: 0000000000000000 R09:
> > > 00007f99faa18010
> > > [ 3748.696493] R10: 0000000000000000 R11: 0000000000000246 R12:
> > > 00007f99faa18010
> > > [ 3748.703841] R13: 00005605e11c406f R14: 0000000000000001 R15:
> > > 0000000000027100
> > 
> > ORIG_RAX is 0, which matches sys_read.
> > RDI is file descriptor 0x61
> > RSI is plausibly a userspace pointer, 0x7f99faa18010
> > RDX is the length, 0x27100 or 160kB.
> > 
> > That all seems reasonable.
> > 
> > What I really want to know is who is _holding_ the lock.  We stash
> > a pointer to the task_struct in 'owner', so we could clearly find this
> > out in the 'blocked for too long' report, and print their stack trace.
> > 
> As described in the comment for __rwsem_set_reader_owned,it is hard to track
> read owners.So we could not clearly find out who blocked the process,it was
> caused by multiple tasks.

Readers don't block readers.  You have a reader here, so it's being
blocked by a writer.  And that writer's task_struct is stashed in
rwsem->owner.  It would be nice if we dumped that information
automatically ... but we don't do that today.  Perhaps you could
grab that information from a crash dump if you have one.

> > You must have done something like this already in order to deduce that
> > it was the direct reclaim path that was the problem?
> > 
> The method we used is to track the direct reclaim using the
> trace_mm_vmscan_direct_reclaim_{begin,end} interface.When the problem
> occurred,we could get a very large "nr_reclaimed" which is not a desirable
> value for process except kswapd.

I disagree.  If a process needs to allocate memory then it should be
paying the cost of reclaiming that memory itself.  kswapd is a last
resort to reclaim memory when we have a workload (eg a network router)
that does its memory allocation primarily in interrupt context.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-15  9:00         ` Matthew Wilcox
@ 2022-09-15  9:30           ` Hongchen Zhang
  2022-09-16  0:57           ` Hongchen Zhang
  1 sibling, 0 replies; 17+ messages in thread
From: Hongchen Zhang @ 2022-09-15  9:30 UTC (permalink / raw)
  To: Matthew Wilcox; +Cc: Andrew Morton, linux-mm, linux-kernel

Hi Matthew,

On 2022/9/15 am 5:00, Matthew Wilcox wrote:
> On Thu, Sep 15, 2022 at 04:02:41PM +0800, Hongchen Zhang wrote:
>> Hi Matthew,
>> On 2022/9/15 pm 3:28, Matthew Wilcox wrote:
>>> On Thu, Sep 15, 2022 at 09:19:48AM +0800, Hongchen Zhang wrote:
>>>> [ 3748.453561] INFO: task float_bessel:77920 blocked for more than 120
>>>> seconds.
>>>> [ 3748.460839]       Not tainted 5.15.0-46-generic #49-Ubuntu
>>>> [ 3748.466490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>>>> this message.
>>>> [ 3748.474618] task:float_bessel    state:D stack:    0 pid:77920 ppid:
>>>> 77327 flags:0x00004002
>>>> [ 3748.483358] Call Trace:
>>>> [ 3748.485964]  <TASK>
>>>> [ 3748.488150]  __schedule+0x23d/0x590
>>>> [ 3748.491804]  schedule+0x4e/0xc0
>>>> [ 3748.495038]  rwsem_down_read_slowpath+0x336/0x390
>>>> [ 3748.499886]  ? copy_user_enhanced_fast_string+0xe/0x40
>>>> [ 3748.505181]  down_read+0x43/0xa0
>>>> [ 3748.508518]  do_user_addr_fault+0x41c/0x670
>>>> [ 3748.512799]  exc_page_fault+0x77/0x170
>>>> [ 3748.516673]  asm_exc_page_fault+0x26/0x30
>>>> [ 3748.520824] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x40
>>>> [ 3748.526764] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f
>>>> 01 ca c3 cc cc cc cc 0f 1f 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3>
>>>> a4 31 c0 0f 01 ca c3 cc cc cc cc 66 08
>>>> [ 3748.546120] RSP: 0018:ffffaa9248fffb90 EFLAGS: 00050206
>>>> [ 3748.551495] RAX: 00007f99faa1a010 RBX: ffffaa9248fffd88 RCX:
>>>> 0000000000000010
>>>> [ 3748.558828] RDX: 0000000000001000 RSI: ffff9db397ab8ff0 RDI:
>>>> 00007f99faa1a000
>>>> [ 3748.566160] RBP: ffffaa9248fffbf0 R08: ffffcc2fc2965d80 R09:
>>>> 0000000000000014
>>>> [ 3748.573492] R10: 0000000000000000 R11: 0000000000000014 R12:
>>>> 0000000000001000
>>>> [ 3748.580858] R13: 0000000000001000 R14: 0000000000000000 R15:
>>>> ffffaa9248fffd98
>>>> [ 3748.588196]  ? copy_page_to_iter+0x10e/0x400
>>>> [ 3748.592614]  filemap_read+0x174/0x3e0
>>>
>>> Interesting; it wasn't the process itself which triggered the page
>>> fault; the process called read() and the kernel took the page fault to
>>> satisfy the read() call.
>>>
>>>> [ 3748.596354]  ? ima_file_check+0x6a/0xa0
>>>> [ 3748.600301]  generic_file_read_iter+0xe5/0x150
>>>> [ 3748.604884]  ext4_file_read_iter+0x5b/0x190
>>>> [ 3748.609164]  ? aa_file_perm+0x102/0x250
>>>> [ 3748.613125]  new_sync_read+0x10d/0x1a0
>>>> [ 3748.617009]  vfs_read+0x103/0x1a0
>>>> [ 3748.620423]  ksys_read+0x67/0xf0
>>>> [ 3748.623743]  __x64_sys_read+0x19/0x20
>>>> [ 3748.627511]  do_syscall_64+0x59/0xc0
>>>> [ 3748.631203]  ? syscall_exit_to_user_mode+0x27/0x50
>>>> [ 3748.636144]  ? do_syscall_64+0x69/0xc0
>>>> [ 3748.639992]  ? exit_to_user_mode_prepare+0x96/0xb0
>>>> [ 3748.644931]  ? irqentry_exit_to_user_mode+0x9/0x20
>>>> [ 3748.649872]  ? irqentry_exit+0x1d/0x30
>>>> [ 3748.653737]  ? exc_page_fault+0x89/0x170
>>>> [ 3748.657795]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
>>>> [ 3748.663030] RIP: 0033:0x7f9a852989cc
>>>> [ 3748.666713] RSP: 002b:00007f9a8497dc90 EFLAGS: 00000246 ORIG_RAX:
>>>> 0000000000000000
>>>> [ 3748.674487] RAX: ffffffffffffffda RBX: 00007f9a8497f5c0 RCX:
>>>> 00007f9a852989cc
>>>> [ 3748.681817] RDX: 0000000000027100 RSI: 00007f99faa18010 RDI:
>>>> 0000000000000061
>>>> [ 3748.689150] RBP: 00007f9a8497dd60 R08: 0000000000000000 R09:
>>>> 00007f99faa18010
>>>> [ 3748.696493] R10: 0000000000000000 R11: 0000000000000246 R12:
>>>> 00007f99faa18010
>>>> [ 3748.703841] R13: 00005605e11c406f R14: 0000000000000001 R15:
>>>> 0000000000027100
>>>
>>> ORIG_RAX is 0, which matches sys_read.
>>> RDI is file descriptor 0x61
>>> RSI is plausibly a userspace pointer, 0x7f99faa18010
>>> RDX is the length, 0x27100 or 160kB.
>>>
>>> That all seems reasonable.
>>>
>>> What I really want to know is who is _holding_ the lock.  We stash
>>> a pointer to the task_struct in 'owner', so we could clearly find this
>>> out in the 'blocked for too long' report, and print their stack trace.
>>>
>> As described in the comment for __rwsem_set_reader_owned,it is hard to track
>> read owners.So we could not clearly find out who blocked the process,it was
>> caused by multiple tasks.
> 
> Readers don't block readers.  You have a reader here, so it's being
> blocked by a writer.  And that writer's task_struct is stashed in
> rwsem->owner.  It would be nice if we dumped that information
> automatically ... but we don't do that today.  Perhaps you could
> grab that information from a crash dump if you have one.
> 
>>> You must have done something like this already in order to deduce that
>>> it was the direct reclaim path that was the problem?
>>>
>> The method we used is to track the direct reclaim using the
>> trace_mm_vmscan_direct_reclaim_{begin,end} interface.When the problem
>> occurred,we could get a very large "nr_reclaimed" which is not a desirable
>> value for process except kswapd.
> 
> I disagree.  If a process needs to allocate memory then it should be
> paying the cost of reclaiming that memory itself.  kswapd is a last
> resort to reclaim memory when we have a workload (eg a network router)
> that does its memory allocation primarily in interrupt context.
> 
Yes,the process would reclaim required memory itself.But it should not 
do scan adjust too much,scan adjust is not its primary work.The scan 
adjust should be done by kswapd.

Thanks
Hongchen Zhang


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-15  8:02       ` Hongchen Zhang
  2022-09-15  9:00         ` Matthew Wilcox
@ 2022-09-15 13:38         ` Feng Tang
  1 sibling, 0 replies; 17+ messages in thread
From: Feng Tang @ 2022-09-15 13:38 UTC (permalink / raw)
  To: Hongchen Zhang; +Cc: Matthew Wilcox, Andrew Morton, linux-mm, linux-kernel

Hi Hongchen,

On Thu, Sep 15, 2022 at 04:02:41PM +0800, Hongchen Zhang wrote:
> Hi Matthew,
> On 2022/9/15 pm 3:28, Matthew Wilcox wrote:
> > On Thu, Sep 15, 2022 at 09:19:48AM +0800, Hongchen Zhang wrote:
> >> [ 3748.453561] INFO: task float_bessel:77920 blocked for more than 120
> >> seconds.
> >> [ 3748.460839]       Not tainted 5.15.0-46-generic #49-Ubuntu
> >> [ 3748.466490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> >> this message.
> >> [ 3748.474618] task:float_bessel    state:D stack:    0 pid:77920 ppid:
> >> 77327 flags:0x00004002
> >> [ 3748.483358] Call Trace:
> >> [ 3748.485964]  <TASK>
> >> [ 3748.488150]  __schedule+0x23d/0x590
> >> [ 3748.491804]  schedule+0x4e/0xc0
> >> [ 3748.495038]  rwsem_down_read_slowpath+0x336/0x390
> >> [ 3748.499886]  ? copy_user_enhanced_fast_string+0xe/0x40
> >> [ 3748.505181]  down_read+0x43/0xa0
> >> [ 3748.508518]  do_user_addr_fault+0x41c/0x670
> >> [ 3748.512799]  exc_page_fault+0x77/0x170
> >> [ 3748.516673]  asm_exc_page_fault+0x26/0x30
> >> [ 3748.520824] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x40
> >> [ 3748.526764] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f
> >> 01 ca c3 cc cc cc cc 0f 1f 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3>
> >> a4 31 c0 0f 01 ca c3 cc cc cc cc 66 08
> >> [ 3748.546120] RSP: 0018:ffffaa9248fffb90 EFLAGS: 00050206
> >> [ 3748.551495] RAX: 00007f99faa1a010 RBX: ffffaa9248fffd88 RCX:
> >> 0000000000000010
> >> [ 3748.558828] RDX: 0000000000001000 RSI: ffff9db397ab8ff0 RDI:
> >> 00007f99faa1a000
> >> [ 3748.566160] RBP: ffffaa9248fffbf0 R08: ffffcc2fc2965d80 R09:
> >> 0000000000000014
> >> [ 3748.573492] R10: 0000000000000000 R11: 0000000000000014 R12:
> >> 0000000000001000
> >> [ 3748.580858] R13: 0000000000001000 R14: 0000000000000000 R15:
> >> ffffaa9248fffd98
> >> [ 3748.588196]  ? copy_page_to_iter+0x10e/0x400
> >> [ 3748.592614]  filemap_read+0x174/0x3e0
> > 
> > Interesting; it wasn't the process itself which triggered the page
> > fault; the process called read() and the kernel took the page fault to
> > satisfy the read() call.
> > 
> >> [ 3748.596354]  ? ima_file_check+0x6a/0xa0
> >> [ 3748.600301]  generic_file_read_iter+0xe5/0x150
> >> [ 3748.604884]  ext4_file_read_iter+0x5b/0x190
> >> [ 3748.609164]  ? aa_file_perm+0x102/0x250
> >> [ 3748.613125]  new_sync_read+0x10d/0x1a0
> >> [ 3748.617009]  vfs_read+0x103/0x1a0
> >> [ 3748.620423]  ksys_read+0x67/0xf0
> >> [ 3748.623743]  __x64_sys_read+0x19/0x20
> >> [ 3748.627511]  do_syscall_64+0x59/0xc0
> >> [ 3748.631203]  ? syscall_exit_to_user_mode+0x27/0x50
> >> [ 3748.636144]  ? do_syscall_64+0x69/0xc0
> >> [ 3748.639992]  ? exit_to_user_mode_prepare+0x96/0xb0
> >> [ 3748.644931]  ? irqentry_exit_to_user_mode+0x9/0x20
> >> [ 3748.649872]  ? irqentry_exit+0x1d/0x30
> >> [ 3748.653737]  ? exc_page_fault+0x89/0x170
> >> [ 3748.657795]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
> >> [ 3748.663030] RIP: 0033:0x7f9a852989cc
> >> [ 3748.666713] RSP: 002b:00007f9a8497dc90 EFLAGS: 00000246 ORIG_RAX:
> >> 0000000000000000
> >> [ 3748.674487] RAX: ffffffffffffffda RBX: 00007f9a8497f5c0 RCX:
> >> 00007f9a852989cc
> >> [ 3748.681817] RDX: 0000000000027100 RSI: 00007f99faa18010 RDI:
> >> 0000000000000061
> >> [ 3748.689150] RBP: 00007f9a8497dd60 R08: 0000000000000000 R09:
> >> 00007f99faa18010
> >> [ 3748.696493] R10: 0000000000000000 R11: 0000000000000246 R12:
> >> 00007f99faa18010
> >> [ 3748.703841] R13: 00005605e11c406f R14: 0000000000000001 R15:
> >> 0000000000027100
> > 
> > ORIG_RAX is 0, which matches sys_read.
> > RDI is file descriptor 0x61
> > RSI is plausibly a userspace pointer, 0x7f99faa18010
> > RDX is the length, 0x27100 or 160kB.
> > 
> > That all seems reasonable.
> > 
> > What I really want to know is who is _holding_ the lock.  We stash
> > a pointer to the task_struct in 'owner', so we could clearly find this
> > out in the 'blocked for too long' report, and print their stack trace.
> > 
> As described in the comment for __rwsem_set_reader_owned,it is hard to 
> track read owners.So we could not clearly find out who blocked the 
> process,it was caused by multiple tasks.

In your previous mail, you mentioned panic happened. Maybe you can try
appending "panic_print=0x1" to kernel command line , as it could print
the call stacks of all tasks on panic(the info could be huge).

Initially, we added it also to debug some deadlock (hunging task) bug. 

Thanks,
Feng

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-15  9:00         ` Matthew Wilcox
  2022-09-15  9:30           ` Hongchen Zhang
@ 2022-09-16  0:57           ` Hongchen Zhang
  2022-09-16  8:40             ` Matthew Wilcox
  1 sibling, 1 reply; 17+ messages in thread
From: Hongchen Zhang @ 2022-09-16  0:57 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Matthew Wilcox, linux-mm, linux-kernel

Hi Andrew ,

On 2022/9/15 pm 5:00, Matthew Wilcox wrote:
> On Thu, Sep 15, 2022 at 04:02:41PM +0800, Hongchen Zhang wrote:
>> Hi Matthew,
>> On 2022/9/15 pm 3:28, Matthew Wilcox wrote:
>>> On Thu, Sep 15, 2022 at 09:19:48AM +0800, Hongchen Zhang wrote:
>>>> [ 3748.453561] INFO: task float_bessel:77920 blocked for more than 120
>>>> seconds.
>>>> [ 3748.460839]       Not tainted 5.15.0-46-generic #49-Ubuntu
>>>> [ 3748.466490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>>>> this message.
>>>> [ 3748.474618] task:float_bessel    state:D stack:    0 pid:77920 ppid:
>>>> 77327 flags:0x00004002
>>>> [ 3748.483358] Call Trace:
>>>> [ 3748.485964]  <TASK>
>>>> [ 3748.488150]  __schedule+0x23d/0x590
>>>> [ 3748.491804]  schedule+0x4e/0xc0
>>>> [ 3748.495038]  rwsem_down_read_slowpath+0x336/0x390
>>>> [ 3748.499886]  ? copy_user_enhanced_fast_string+0xe/0x40
>>>> [ 3748.505181]  down_read+0x43/0xa0
>>>> [ 3748.508518]  do_user_addr_fault+0x41c/0x670
>>>> [ 3748.512799]  exc_page_fault+0x77/0x170
>>>> [ 3748.516673]  asm_exc_page_fault+0x26/0x30
>>>> [ 3748.520824] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x40
>>>> [ 3748.526764] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f
>>>> 01 ca c3 cc cc cc cc 0f 1f 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3>
>>>> a4 31 c0 0f 01 ca c3 cc cc cc cc 66 08
>>>> [ 3748.546120] RSP: 0018:ffffaa9248fffb90 EFLAGS: 00050206
>>>> [ 3748.551495] RAX: 00007f99faa1a010 RBX: ffffaa9248fffd88 RCX:
>>>> 0000000000000010
>>>> [ 3748.558828] RDX: 0000000000001000 RSI: ffff9db397ab8ff0 RDI:
>>>> 00007f99faa1a000
>>>> [ 3748.566160] RBP: ffffaa9248fffbf0 R08: ffffcc2fc2965d80 R09:
>>>> 0000000000000014
>>>> [ 3748.573492] R10: 0000000000000000 R11: 0000000000000014 R12:
>>>> 0000000000001000
>>>> [ 3748.580858] R13: 0000000000001000 R14: 0000000000000000 R15:
>>>> ffffaa9248fffd98
>>>> [ 3748.588196]  ? copy_page_to_iter+0x10e/0x400
>>>> [ 3748.592614]  filemap_read+0x174/0x3e0
>>>
>>> Interesting; it wasn't the process itself which triggered the page
>>> fault; the process called read() and the kernel took the page fault to
>>> satisfy the read() call.
>>>
>>>> [ 3748.596354]  ? ima_file_check+0x6a/0xa0
>>>> [ 3748.600301]  generic_file_read_iter+0xe5/0x150
>>>> [ 3748.604884]  ext4_file_read_iter+0x5b/0x190
>>>> [ 3748.609164]  ? aa_file_perm+0x102/0x250
>>>> [ 3748.613125]  new_sync_read+0x10d/0x1a0
>>>> [ 3748.617009]  vfs_read+0x103/0x1a0
>>>> [ 3748.620423]  ksys_read+0x67/0xf0
>>>> [ 3748.623743]  __x64_sys_read+0x19/0x20
>>>> [ 3748.627511]  do_syscall_64+0x59/0xc0
>>>> [ 3748.631203]  ? syscall_exit_to_user_mode+0x27/0x50
>>>> [ 3748.636144]  ? do_syscall_64+0x69/0xc0
>>>> [ 3748.639992]  ? exit_to_user_mode_prepare+0x96/0xb0
>>>> [ 3748.644931]  ? irqentry_exit_to_user_mode+0x9/0x20
>>>> [ 3748.649872]  ? irqentry_exit+0x1d/0x30
>>>> [ 3748.653737]  ? exc_page_fault+0x89/0x170
>>>> [ 3748.657795]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
>>>> [ 3748.663030] RIP: 0033:0x7f9a852989cc
>>>> [ 3748.666713] RSP: 002b:00007f9a8497dc90 EFLAGS: 00000246 ORIG_RAX:
>>>> 0000000000000000
>>>> [ 3748.674487] RAX: ffffffffffffffda RBX: 00007f9a8497f5c0 RCX:
>>>> 00007f9a852989cc
>>>> [ 3748.681817] RDX: 0000000000027100 RSI: 00007f99faa18010 RDI:
>>>> 0000000000000061
>>>> [ 3748.689150] RBP: 00007f9a8497dd60 R08: 0000000000000000 R09:
>>>> 00007f99faa18010
>>>> [ 3748.696493] R10: 0000000000000000 R11: 0000000000000246 R12:
>>>> 00007f99faa18010
>>>> [ 3748.703841] R13: 00005605e11c406f R14: 0000000000000001 R15:
>>>> 0000000000027100
>>>
>>> ORIG_RAX is 0, which matches sys_read.
>>> RDI is file descriptor 0x61
>>> RSI is plausibly a userspace pointer, 0x7f99faa18010
>>> RDX is the length, 0x27100 or 160kB.
>>>
>>> That all seems reasonable.
>>>
>>> What I really want to know is who is _holding_ the lock.  We stash
>>> a pointer to the task_struct in 'owner', so we could clearly find this
>>> out in the 'blocked for too long' report, and print their stack trace.
>>>
>> As described in the comment for __rwsem_set_reader_owned,it is hard to track
>> read owners.So we could not clearly find out who blocked the process,it was
>> caused by multiple tasks.
> 
> Readers don't block readers.  You have a reader here, so it's being
> blocked by a writer.  And that writer's task_struct is stashed in
> rwsem->owner.  It would be nice if we dumped that information
> automatically ... but we don't do that today.  Perhaps you could
> grab that information from a crash dump if you have one.
> 
>>> You must have done something like this already in order to deduce that
>>> it was the direct reclaim path that was the problem?
>>>
>> The method we used is to track the direct reclaim using the
>> trace_mm_vmscan_direct_reclaim_{begin,end} interface.When the problem
>> occurred,we could get a very large "nr_reclaimed" which is not a desirable
>> value for process except kswapd.
> 
> I disagree.  If a process needs to allocate memory then it should be
> paying the cost of reclaiming that memory itself.  kswapd is a last
> resort to reclaim memory when we have a workload (eg a network router)
> that does its memory allocation primarily in interrupt context.
> 
What's your opinion about this scan adjust issue? Is there a better way 
to fix this issue?

Thanks
Hongchen Zhang


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-16  0:57           ` Hongchen Zhang
@ 2022-09-16  8:40             ` Matthew Wilcox
  2022-09-16 10:19               ` Hongchen Zhang
  0 siblings, 1 reply; 17+ messages in thread
From: Matthew Wilcox @ 2022-09-16  8:40 UTC (permalink / raw)
  To: Hongchen Zhang; +Cc: Andrew Morton, linux-mm, linux-kernel

On Fri, Sep 16, 2022 at 08:57:50AM +0800, Hongchen Zhang wrote:
> Hi Andrew ,
> 
> On 2022/9/15 pm 5:00, Matthew Wilcox wrote:
> > On Thu, Sep 15, 2022 at 04:02:41PM +0800, Hongchen Zhang wrote:
> > > Hi Matthew,
> > > On 2022/9/15 pm 3:28, Matthew Wilcox wrote:
> > > > On Thu, Sep 15, 2022 at 09:19:48AM +0800, Hongchen Zhang wrote:
> > > > > [ 3748.453561] INFO: task float_bessel:77920 blocked for more than 120
> > > > > seconds.
> > > > > [ 3748.460839]       Not tainted 5.15.0-46-generic #49-Ubuntu
> > > > > [ 3748.466490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > > > > this message.
> > > > > [ 3748.474618] task:float_bessel    state:D stack:    0 pid:77920 ppid:
> > > > > 77327 flags:0x00004002
> > > > > [ 3748.483358] Call Trace:
> > > > > [ 3748.485964]  <TASK>
> > > > > [ 3748.488150]  __schedule+0x23d/0x590
> > > > > [ 3748.491804]  schedule+0x4e/0xc0
> > > > > [ 3748.495038]  rwsem_down_read_slowpath+0x336/0x390
> > > > > [ 3748.499886]  ? copy_user_enhanced_fast_string+0xe/0x40
> > > > > [ 3748.505181]  down_read+0x43/0xa0
> > > > > [ 3748.508518]  do_user_addr_fault+0x41c/0x670
> > > > > [ 3748.512799]  exc_page_fault+0x77/0x170
> > > > > [ 3748.516673]  asm_exc_page_fault+0x26/0x30
> > > > > [ 3748.520824] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x40
> > > > > [ 3748.526764] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f
> > > > > 01 ca c3 cc cc cc cc 0f 1f 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3>
> > > > > a4 31 c0 0f 01 ca c3 cc cc cc cc 66 08
> > > > > [ 3748.546120] RSP: 0018:ffffaa9248fffb90 EFLAGS: 00050206
> > > > > [ 3748.551495] RAX: 00007f99faa1a010 RBX: ffffaa9248fffd88 RCX:
> > > > > 0000000000000010
> > > > > [ 3748.558828] RDX: 0000000000001000 RSI: ffff9db397ab8ff0 RDI:
> > > > > 00007f99faa1a000
> > > > > [ 3748.566160] RBP: ffffaa9248fffbf0 R08: ffffcc2fc2965d80 R09:
> > > > > 0000000000000014
> > > > > [ 3748.573492] R10: 0000000000000000 R11: 0000000000000014 R12:
> > > > > 0000000000001000
> > > > > [ 3748.580858] R13: 0000000000001000 R14: 0000000000000000 R15:
> > > > > ffffaa9248fffd98
> > > > > [ 3748.588196]  ? copy_page_to_iter+0x10e/0x400
> > > > > [ 3748.592614]  filemap_read+0x174/0x3e0
> > > > 
> > > > Interesting; it wasn't the process itself which triggered the page
> > > > fault; the process called read() and the kernel took the page fault to
> > > > satisfy the read() call.
> > > > 
> > > > > [ 3748.596354]  ? ima_file_check+0x6a/0xa0
> > > > > [ 3748.600301]  generic_file_read_iter+0xe5/0x150
> > > > > [ 3748.604884]  ext4_file_read_iter+0x5b/0x190
> > > > > [ 3748.609164]  ? aa_file_perm+0x102/0x250
> > > > > [ 3748.613125]  new_sync_read+0x10d/0x1a0
> > > > > [ 3748.617009]  vfs_read+0x103/0x1a0
> > > > > [ 3748.620423]  ksys_read+0x67/0xf0
> > > > > [ 3748.623743]  __x64_sys_read+0x19/0x20
> > > > > [ 3748.627511]  do_syscall_64+0x59/0xc0
> > > > > [ 3748.631203]  ? syscall_exit_to_user_mode+0x27/0x50
> > > > > [ 3748.636144]  ? do_syscall_64+0x69/0xc0
> > > > > [ 3748.639992]  ? exit_to_user_mode_prepare+0x96/0xb0
> > > > > [ 3748.644931]  ? irqentry_exit_to_user_mode+0x9/0x20
> > > > > [ 3748.649872]  ? irqentry_exit+0x1d/0x30
> > > > > [ 3748.653737]  ? exc_page_fault+0x89/0x170
> > > > > [ 3748.657795]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
> > > > > [ 3748.663030] RIP: 0033:0x7f9a852989cc
> > > > > [ 3748.666713] RSP: 002b:00007f9a8497dc90 EFLAGS: 00000246 ORIG_RAX:
> > > > > 0000000000000000
> > > > > [ 3748.674487] RAX: ffffffffffffffda RBX: 00007f9a8497f5c0 RCX:
> > > > > 00007f9a852989cc
> > > > > [ 3748.681817] RDX: 0000000000027100 RSI: 00007f99faa18010 RDI:
> > > > > 0000000000000061
> > > > > [ 3748.689150] RBP: 00007f9a8497dd60 R08: 0000000000000000 R09:
> > > > > 00007f99faa18010
> > > > > [ 3748.696493] R10: 0000000000000000 R11: 0000000000000246 R12:
> > > > > 00007f99faa18010
> > > > > [ 3748.703841] R13: 00005605e11c406f R14: 0000000000000001 R15:
> > > > > 0000000000027100
> > > > 
> > > > ORIG_RAX is 0, which matches sys_read.
> > > > RDI is file descriptor 0x61
> > > > RSI is plausibly a userspace pointer, 0x7f99faa18010
> > > > RDX is the length, 0x27100 or 160kB.
> > > > 
> > > > That all seems reasonable.
> > > > 
> > > > What I really want to know is who is _holding_ the lock.  We stash
> > > > a pointer to the task_struct in 'owner', so we could clearly find this
> > > > out in the 'blocked for too long' report, and print their stack trace.
> > > > 
> > > As described in the comment for __rwsem_set_reader_owned,it is hard to track
> > > read owners.So we could not clearly find out who blocked the process,it was
> > > caused by multiple tasks.
> > 
> > Readers don't block readers.  You have a reader here, so it's being
> > blocked by a writer.  And that writer's task_struct is stashed in
> > rwsem->owner.  It would be nice if we dumped that information
> > automatically ... but we don't do that today.  Perhaps you could
> > grab that information from a crash dump if you have one.
> > 
> > > > You must have done something like this already in order to deduce that
> > > > it was the direct reclaim path that was the problem?
> > > > 
> > > The method we used is to track the direct reclaim using the
> > > trace_mm_vmscan_direct_reclaim_{begin,end} interface.When the problem
> > > occurred,we could get a very large "nr_reclaimed" which is not a desirable
> > > value for process except kswapd.
> > 
> > I disagree.  If a process needs to allocate memory then it should be
> > paying the cost of reclaiming that memory itself.  kswapd is a last
> > resort to reclaim memory when we have a workload (eg a network router)
> > that does its memory allocation primarily in interrupt context.
> > 
> What's your opinion about this scan adjust issue? Is there a better way to
> fix this issue?

Yes, but we need you to gather more information about what's causing
the issue before we can suggest what that is.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-16  8:40             ` Matthew Wilcox
@ 2022-09-16 10:19               ` Hongchen Zhang
  2022-09-19 23:32                 ` Yosry Ahmed
  2022-09-20  2:23                 ` Hongchen Zhang
  0 siblings, 2 replies; 17+ messages in thread
From: Hongchen Zhang @ 2022-09-16 10:19 UTC (permalink / raw)
  To: Matthew Wilcox, Andrew Morton; +Cc: linux-mm, linux-kernel

Hi Andrew and Matthew,

On 2022/9/16 pm 4:40, Matthew Wilcox wrote:
> On Fri, Sep 16, 2022 at 08:57:50AM +0800, Hongchen Zhang wrote:
>> Hi Andrew ,
>>
>> On 2022/9/15 pm 5:00, Matthew Wilcox wrote:
>>> On Thu, Sep 15, 2022 at 04:02:41PM +0800, Hongchen Zhang wrote:
>>>> Hi Matthew,
>>>> On 2022/9/15 pm 3:28, Matthew Wilcox wrote:
>>>>> On Thu, Sep 15, 2022 at 09:19:48AM +0800, Hongchen Zhang wrote:
>>>>>> [ 3748.453561] INFO: task float_bessel:77920 blocked for more than 120
>>>>>> seconds.
>>>>>> [ 3748.460839]       Not tainted 5.15.0-46-generic #49-Ubuntu
>>>>>> [ 3748.466490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>>>>>> this message.
>>>>>> [ 3748.474618] task:float_bessel    state:D stack:    0 pid:77920 ppid:
>>>>>> 77327 flags:0x00004002
>>>>>> [ 3748.483358] Call Trace:
>>>>>> [ 3748.485964]  <TASK>
>>>>>> [ 3748.488150]  __schedule+0x23d/0x590
>>>>>> [ 3748.491804]  schedule+0x4e/0xc0
>>>>>> [ 3748.495038]  rwsem_down_read_slowpath+0x336/0x390
>>>>>> [ 3748.499886]  ? copy_user_enhanced_fast_string+0xe/0x40
>>>>>> [ 3748.505181]  down_read+0x43/0xa0
>>>>>> [ 3748.508518]  do_user_addr_fault+0x41c/0x670
>>>>>> [ 3748.512799]  exc_page_fault+0x77/0x170
>>>>>> [ 3748.516673]  asm_exc_page_fault+0x26/0x30
>>>>>> [ 3748.520824] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x40
>>>>>> [ 3748.526764] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f
>>>>>> 01 ca c3 cc cc cc cc 0f 1f 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3>
>>>>>> a4 31 c0 0f 01 ca c3 cc cc cc cc 66 08
>>>>>> [ 3748.546120] RSP: 0018:ffffaa9248fffb90 EFLAGS: 00050206
>>>>>> [ 3748.551495] RAX: 00007f99faa1a010 RBX: ffffaa9248fffd88 RCX:
>>>>>> 0000000000000010
>>>>>> [ 3748.558828] RDX: 0000000000001000 RSI: ffff9db397ab8ff0 RDI:
>>>>>> 00007f99faa1a000
>>>>>> [ 3748.566160] RBP: ffffaa9248fffbf0 R08: ffffcc2fc2965d80 R09:
>>>>>> 0000000000000014
>>>>>> [ 3748.573492] R10: 0000000000000000 R11: 0000000000000014 R12:
>>>>>> 0000000000001000
>>>>>> [ 3748.580858] R13: 0000000000001000 R14: 0000000000000000 R15:
>>>>>> ffffaa9248fffd98
>>>>>> [ 3748.588196]  ? copy_page_to_iter+0x10e/0x400
>>>>>> [ 3748.592614]  filemap_read+0x174/0x3e0
>>>>>
>>>>> Interesting; it wasn't the process itself which triggered the page
>>>>> fault; the process called read() and the kernel took the page fault to
>>>>> satisfy the read() call.
>>>>>
>>>>>> [ 3748.596354]  ? ima_file_check+0x6a/0xa0
>>>>>> [ 3748.600301]  generic_file_read_iter+0xe5/0x150
>>>>>> [ 3748.604884]  ext4_file_read_iter+0x5b/0x190
>>>>>> [ 3748.609164]  ? aa_file_perm+0x102/0x250
>>>>>> [ 3748.613125]  new_sync_read+0x10d/0x1a0
>>>>>> [ 3748.617009]  vfs_read+0x103/0x1a0
>>>>>> [ 3748.620423]  ksys_read+0x67/0xf0
>>>>>> [ 3748.623743]  __x64_sys_read+0x19/0x20
>>>>>> [ 3748.627511]  do_syscall_64+0x59/0xc0
>>>>>> [ 3748.631203]  ? syscall_exit_to_user_mode+0x27/0x50
>>>>>> [ 3748.636144]  ? do_syscall_64+0x69/0xc0
>>>>>> [ 3748.639992]  ? exit_to_user_mode_prepare+0x96/0xb0
>>>>>> [ 3748.644931]  ? irqentry_exit_to_user_mode+0x9/0x20
>>>>>> [ 3748.649872]  ? irqentry_exit+0x1d/0x30
>>>>>> [ 3748.653737]  ? exc_page_fault+0x89/0x170
>>>>>> [ 3748.657795]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
>>>>>> [ 3748.663030] RIP: 0033:0x7f9a852989cc
>>>>>> [ 3748.666713] RSP: 002b:00007f9a8497dc90 EFLAGS: 00000246 ORIG_RAX:
>>>>>> 0000000000000000
>>>>>> [ 3748.674487] RAX: ffffffffffffffda RBX: 00007f9a8497f5c0 RCX:
>>>>>> 00007f9a852989cc
>>>>>> [ 3748.681817] RDX: 0000000000027100 RSI: 00007f99faa18010 RDI:
>>>>>> 0000000000000061
>>>>>> [ 3748.689150] RBP: 00007f9a8497dd60 R08: 0000000000000000 R09:
>>>>>> 00007f99faa18010
>>>>>> [ 3748.696493] R10: 0000000000000000 R11: 0000000000000246 R12:
>>>>>> 00007f99faa18010
>>>>>> [ 3748.703841] R13: 00005605e11c406f R14: 0000000000000001 R15:
>>>>>> 0000000000027100
>>>>>
>>>>> ORIG_RAX is 0, which matches sys_read.
>>>>> RDI is file descriptor 0x61
>>>>> RSI is plausibly a userspace pointer, 0x7f99faa18010
>>>>> RDX is the length, 0x27100 or 160kB.
>>>>>
>>>>> That all seems reasonable.
>>>>>
>>>>> What I really want to know is who is _holding_ the lock.  We stash
>>>>> a pointer to the task_struct in 'owner', so we could clearly find this
>>>>> out in the 'blocked for too long' report, and print their stack trace.
>>>>>
>>>> As described in the comment for __rwsem_set_reader_owned,it is hard to track
>>>> read owners.So we could not clearly find out who blocked the process,it was
>>>> caused by multiple tasks.
>>>
>>> Readers don't block readers.  You have a reader here, so it's being
>>> blocked by a writer.  And that writer's task_struct is stashed in
>>> rwsem->owner.  It would be nice if we dumped that information
>>> automatically ... but we don't do that today.  Perhaps you could
>>> grab that information from a crash dump if you have one.
>>>
>>>>> You must have done something like this already in order to deduce that
>>>>> it was the direct reclaim path that was the problem?
>>>>>
>>>> The method we used is to track the direct reclaim using the
>>>> trace_mm_vmscan_direct_reclaim_{begin,end} interface.When the problem
>>>> occurred,we could get a very large "nr_reclaimed" which is not a desirable
>>>> value for process except kswapd.
>>>
>>> I disagree.  If a process needs to allocate memory then it should be
>>> paying the cost of reclaiming that memory itself.  kswapd is a last
>>> resort to reclaim memory when we have a workload (eg a network router)
>>> that does its memory allocation primarily in interrupt context.
>>>
>> What's your opinion about this scan adjust issue? Is there a better way to
>> fix this issue?
> 
> Yes, but we need you to gather more information about what's causing
> the issue before we can suggest what that is.
> 
I think the following scenery triggers the scan adjust issue:
In function shrink_lruvec, we call get_scan_count and get the following 
values:
targets[LRU_INACTIVE_ANON]=50000
targets[LRU_ACTIVE_ANON]=50000
targets[LRU_INACTIVE_FILE]=128
targets[LRU_ACTIVE_FILE]=129

After the first scan, we get more than nr_to_reclaim pages, but the 
percentage of scanning nr[LRU_INACTIVE_FILE+LRU_ACTIVE_FILE] is 256/257, 

Then when we scan adjust, we must scan(possibly reclaim) 
256*(50000+50000)/257-256=99354 pages, which is too large and would 
waste too many time.
If it is not kswapd, it is unacceptable to reclaim so many pages.
So we should limit the number of pages of scan adjust.

Thanks
Hongchen Zhang


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-16 10:19               ` Hongchen Zhang
@ 2022-09-19 23:32                 ` Yosry Ahmed
  2022-09-20  5:53                   ` Hongchen Zhang
  2022-09-20  2:23                 ` Hongchen Zhang
  1 sibling, 1 reply; 17+ messages in thread
From: Yosry Ahmed @ 2022-09-19 23:32 UTC (permalink / raw)
  To: Hongchen Zhang
  Cc: Matthew Wilcox, Andrew Morton, Linux-MM, Linux Kernel Mailing List

On Fri, Sep 16, 2022 at 3:20 AM Hongchen Zhang
<zhanghongchen@loongson.cn> wrote:
>
> Hi Andrew and Matthew,
>
> On 2022/9/16 pm 4:40, Matthew Wilcox wrote:
> > On Fri, Sep 16, 2022 at 08:57:50AM +0800, Hongchen Zhang wrote:
> >> Hi Andrew ,
> >>
> >> On 2022/9/15 pm 5:00, Matthew Wilcox wrote:
> >>> On Thu, Sep 15, 2022 at 04:02:41PM +0800, Hongchen Zhang wrote:
> >>>> Hi Matthew,
> >>>> On 2022/9/15 pm 3:28, Matthew Wilcox wrote:
> >>>>> On Thu, Sep 15, 2022 at 09:19:48AM +0800, Hongchen Zhang wrote:
> >>>>>> [ 3748.453561] INFO: task float_bessel:77920 blocked for more than 120
> >>>>>> seconds.
> >>>>>> [ 3748.460839]       Not tainted 5.15.0-46-generic #49-Ubuntu
> >>>>>> [ 3748.466490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> >>>>>> this message.
> >>>>>> [ 3748.474618] task:float_bessel    state:D stack:    0 pid:77920 ppid:
> >>>>>> 77327 flags:0x00004002
> >>>>>> [ 3748.483358] Call Trace:
> >>>>>> [ 3748.485964]  <TASK>
> >>>>>> [ 3748.488150]  __schedule+0x23d/0x590
> >>>>>> [ 3748.491804]  schedule+0x4e/0xc0
> >>>>>> [ 3748.495038]  rwsem_down_read_slowpath+0x336/0x390
> >>>>>> [ 3748.499886]  ? copy_user_enhanced_fast_string+0xe/0x40
> >>>>>> [ 3748.505181]  down_read+0x43/0xa0
> >>>>>> [ 3748.508518]  do_user_addr_fault+0x41c/0x670
> >>>>>> [ 3748.512799]  exc_page_fault+0x77/0x170
> >>>>>> [ 3748.516673]  asm_exc_page_fault+0x26/0x30
> >>>>>> [ 3748.520824] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x40
> >>>>>> [ 3748.526764] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f
> >>>>>> 01 ca c3 cc cc cc cc 0f 1f 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3>
> >>>>>> a4 31 c0 0f 01 ca c3 cc cc cc cc 66 08
> >>>>>> [ 3748.546120] RSP: 0018:ffffaa9248fffb90 EFLAGS: 00050206
> >>>>>> [ 3748.551495] RAX: 00007f99faa1a010 RBX: ffffaa9248fffd88 RCX:
> >>>>>> 0000000000000010
> >>>>>> [ 3748.558828] RDX: 0000000000001000 RSI: ffff9db397ab8ff0 RDI:
> >>>>>> 00007f99faa1a000
> >>>>>> [ 3748.566160] RBP: ffffaa9248fffbf0 R08: ffffcc2fc2965d80 R09:
> >>>>>> 0000000000000014
> >>>>>> [ 3748.573492] R10: 0000000000000000 R11: 0000000000000014 R12:
> >>>>>> 0000000000001000
> >>>>>> [ 3748.580858] R13: 0000000000001000 R14: 0000000000000000 R15:
> >>>>>> ffffaa9248fffd98
> >>>>>> [ 3748.588196]  ? copy_page_to_iter+0x10e/0x400
> >>>>>> [ 3748.592614]  filemap_read+0x174/0x3e0
> >>>>>
> >>>>> Interesting; it wasn't the process itself which triggered the page
> >>>>> fault; the process called read() and the kernel took the page fault to
> >>>>> satisfy the read() call.
> >>>>>
> >>>>>> [ 3748.596354]  ? ima_file_check+0x6a/0xa0
> >>>>>> [ 3748.600301]  generic_file_read_iter+0xe5/0x150
> >>>>>> [ 3748.604884]  ext4_file_read_iter+0x5b/0x190
> >>>>>> [ 3748.609164]  ? aa_file_perm+0x102/0x250
> >>>>>> [ 3748.613125]  new_sync_read+0x10d/0x1a0
> >>>>>> [ 3748.617009]  vfs_read+0x103/0x1a0
> >>>>>> [ 3748.620423]  ksys_read+0x67/0xf0
> >>>>>> [ 3748.623743]  __x64_sys_read+0x19/0x20
> >>>>>> [ 3748.627511]  do_syscall_64+0x59/0xc0
> >>>>>> [ 3748.631203]  ? syscall_exit_to_user_mode+0x27/0x50
> >>>>>> [ 3748.636144]  ? do_syscall_64+0x69/0xc0
> >>>>>> [ 3748.639992]  ? exit_to_user_mode_prepare+0x96/0xb0
> >>>>>> [ 3748.644931]  ? irqentry_exit_to_user_mode+0x9/0x20
> >>>>>> [ 3748.649872]  ? irqentry_exit+0x1d/0x30
> >>>>>> [ 3748.653737]  ? exc_page_fault+0x89/0x170
> >>>>>> [ 3748.657795]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
> >>>>>> [ 3748.663030] RIP: 0033:0x7f9a852989cc
> >>>>>> [ 3748.666713] RSP: 002b:00007f9a8497dc90 EFLAGS: 00000246 ORIG_RAX:
> >>>>>> 0000000000000000
> >>>>>> [ 3748.674487] RAX: ffffffffffffffda RBX: 00007f9a8497f5c0 RCX:
> >>>>>> 00007f9a852989cc
> >>>>>> [ 3748.681817] RDX: 0000000000027100 RSI: 00007f99faa18010 RDI:
> >>>>>> 0000000000000061
> >>>>>> [ 3748.689150] RBP: 00007f9a8497dd60 R08: 0000000000000000 R09:
> >>>>>> 00007f99faa18010
> >>>>>> [ 3748.696493] R10: 0000000000000000 R11: 0000000000000246 R12:
> >>>>>> 00007f99faa18010
> >>>>>> [ 3748.703841] R13: 00005605e11c406f R14: 0000000000000001 R15:
> >>>>>> 0000000000027100
> >>>>>
> >>>>> ORIG_RAX is 0, which matches sys_read.
> >>>>> RDI is file descriptor 0x61
> >>>>> RSI is plausibly a userspace pointer, 0x7f99faa18010
> >>>>> RDX is the length, 0x27100 or 160kB.
> >>>>>
> >>>>> That all seems reasonable.
> >>>>>
> >>>>> What I really want to know is who is _holding_ the lock.  We stash
> >>>>> a pointer to the task_struct in 'owner', so we could clearly find this
> >>>>> out in the 'blocked for too long' report, and print their stack trace.
> >>>>>
> >>>> As described in the comment for __rwsem_set_reader_owned,it is hard to track
> >>>> read owners.So we could not clearly find out who blocked the process,it was
> >>>> caused by multiple tasks.
> >>>
> >>> Readers don't block readers.  You have a reader here, so it's being
> >>> blocked by a writer.  And that writer's task_struct is stashed in
> >>> rwsem->owner.  It would be nice if we dumped that information
> >>> automatically ... but we don't do that today.  Perhaps you could
> >>> grab that information from a crash dump if you have one.
> >>>
> >>>>> You must have done something like this already in order to deduce that
> >>>>> it was the direct reclaim path that was the problem?
> >>>>>
> >>>> The method we used is to track the direct reclaim using the
> >>>> trace_mm_vmscan_direct_reclaim_{begin,end} interface.When the problem
> >>>> occurred,we could get a very large "nr_reclaimed" which is not a desirable
> >>>> value for process except kswapd.
> >>>
> >>> I disagree.  If a process needs to allocate memory then it should be
> >>> paying the cost of reclaiming that memory itself.  kswapd is a last
> >>> resort to reclaim memory when we have a workload (eg a network router)
> >>> that does its memory allocation primarily in interrupt context.
> >>>
> >> What's your opinion about this scan adjust issue? Is there a better way to
> >> fix this issue?
> >
> > Yes, but we need you to gather more information about what's causing
> > the issue before we can suggest what that is.
> >
> I think the following scenery triggers the scan adjust issue:
> In function shrink_lruvec, we call get_scan_count and get the following
> values:
> targets[LRU_INACTIVE_ANON]=50000
> targets[LRU_ACTIVE_ANON]=50000
> targets[LRU_INACTIVE_FILE]=128
> targets[LRU_ACTIVE_FILE]=129
>
> After the first scan, we get more than nr_to_reclaim pages, but the
> percentage of scanning nr[LRU_INACTIVE_FILE+LRU_ACTIVE_FILE] is 256/257,
>
> Then when we scan adjust, we must scan(possibly reclaim)
> 256*(50000+50000)/257-256=99354 pages, which is too large and would
> waste too many time.
> If it is not kswapd, it is unacceptable to reclaim so many pages.
> So we should limit the number of pages of scan adjust.

IIUC commit 6eb90d649537 ("mm: vmscan: fix extreme overreclaim and
swap floods") that was recently sent by Johannes [1] addresses a
similar issue (reclaiming way beyond nr_to_reclaim when anon vs file
LRU sizes are very different), but in a slightly different scenario.
IIUC with Johannes's patch, scan adjustment is already limited for
scenarios where scan_adjust (aka proportional_reclaim) is not
initialized to true, which would be all cases except global direct
reclaim on DEF_PRIORITY. Is my understanding here correct?

[1] https://lore.kernel.org/lkml/20220802162811.39216-1-hannes@cmpxchg.org/
>
> Thanks
> Hongchen Zhang
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-16 10:19               ` Hongchen Zhang
  2022-09-19 23:32                 ` Yosry Ahmed
@ 2022-09-20  2:23                 ` Hongchen Zhang
  2022-09-21  9:13                   ` Mel Gorman
  1 sibling, 1 reply; 17+ messages in thread
From: Hongchen Zhang @ 2022-09-20  2:23 UTC (permalink / raw)
  To: Mel Gorman; +Cc: Matthew Wilcox, Andrew Morton, linux-mm, linux-kernel

Hi Mel,

The scan adjust algorithm was originally introduced by you from
commmit e82e0561dae9 ("mm: vmscan: obey proportional scanning 
requirements for kswapd"), any suggestion about this fix patch?
In short, proportional scanning is not friendly to processes other than 
kswapd.

Thanks
Hongchen Zhang

On 2022/9/16 pm 6:19, Hongchen Zhang wrote:
> Hi Andrew and Matthew,
> 
> On 2022/9/16 pm 4:40, Matthew Wilcox wrote:
>> On Fri, Sep 16, 2022 at 08:57:50AM +0800, Hongchen Zhang wrote:
>>> Hi Andrew ,
>>>
>>> On 2022/9/15 pm 5:00, Matthew Wilcox wrote:
>>>> On Thu, Sep 15, 2022 at 04:02:41PM +0800, Hongchen Zhang wrote:
>>>>> Hi Matthew,
>>>>> On 2022/9/15 pm 3:28, Matthew Wilcox wrote:
>>>>>> On Thu, Sep 15, 2022 at 09:19:48AM +0800, Hongchen Zhang wrote:
>>>>>>> [ 3748.453561] INFO: task float_bessel:77920 blocked for more 
>>>>>>> than 120
>>>>>>> seconds.
>>>>>>> [ 3748.460839]       Not tainted 5.15.0-46-generic #49-Ubuntu
>>>>>>> [ 3748.466490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
>>>>>>> disables
>>>>>>> this message.
>>>>>>> [ 3748.474618] task:float_bessel    state:D stack:    0 pid:77920 
>>>>>>> ppid:
>>>>>>> 77327 flags:0x00004002
>>>>>>> [ 3748.483358] Call Trace:
>>>>>>> [ 3748.485964]  <TASK>
>>>>>>> [ 3748.488150]  __schedule+0x23d/0x590
>>>>>>> [ 3748.491804]  schedule+0x4e/0xc0
>>>>>>> [ 3748.495038]  rwsem_down_read_slowpath+0x336/0x390
>>>>>>> [ 3748.499886]  ? copy_user_enhanced_fast_string+0xe/0x40
>>>>>>> [ 3748.505181]  down_read+0x43/0xa0
>>>>>>> [ 3748.508518]  do_user_addr_fault+0x41c/0x670
>>>>>>> [ 3748.512799]  exc_page_fault+0x77/0x170
>>>>>>> [ 3748.516673]  asm_exc_page_fault+0x26/0x30
>>>>>>> [ 3748.520824] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x40
>>>>>>> [ 3748.526764] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 
>>>>>>> 31 c0 0f
>>>>>>> 01 ca c3 cc cc cc cc 0f 1f 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 
>>>>>>> 89 d1 <f3>
>>>>>>> a4 31 c0 0f 01 ca c3 cc cc cc cc 66 08
>>>>>>> [ 3748.546120] RSP: 0018:ffffaa9248fffb90 EFLAGS: 00050206
>>>>>>> [ 3748.551495] RAX: 00007f99faa1a010 RBX: ffffaa9248fffd88 RCX:
>>>>>>> 0000000000000010
>>>>>>> [ 3748.558828] RDX: 0000000000001000 RSI: ffff9db397ab8ff0 RDI:
>>>>>>> 00007f99faa1a000
>>>>>>> [ 3748.566160] RBP: ffffaa9248fffbf0 R08: ffffcc2fc2965d80 R09:
>>>>>>> 0000000000000014
>>>>>>> [ 3748.573492] R10: 0000000000000000 R11: 0000000000000014 R12:
>>>>>>> 0000000000001000
>>>>>>> [ 3748.580858] R13: 0000000000001000 R14: 0000000000000000 R15:
>>>>>>> ffffaa9248fffd98
>>>>>>> [ 3748.588196]  ? copy_page_to_iter+0x10e/0x400
>>>>>>> [ 3748.592614]  filemap_read+0x174/0x3e0
>>>>>>
>>>>>> Interesting; it wasn't the process itself which triggered the page
>>>>>> fault; the process called read() and the kernel took the page 
>>>>>> fault to
>>>>>> satisfy the read() call.
>>>>>>
>>>>>>> [ 3748.596354]  ? ima_file_check+0x6a/0xa0
>>>>>>> [ 3748.600301]  generic_file_read_iter+0xe5/0x150
>>>>>>> [ 3748.604884]  ext4_file_read_iter+0x5b/0x190
>>>>>>> [ 3748.609164]  ? aa_file_perm+0x102/0x250
>>>>>>> [ 3748.613125]  new_sync_read+0x10d/0x1a0
>>>>>>> [ 3748.617009]  vfs_read+0x103/0x1a0
>>>>>>> [ 3748.620423]  ksys_read+0x67/0xf0
>>>>>>> [ 3748.623743]  __x64_sys_read+0x19/0x20
>>>>>>> [ 3748.627511]  do_syscall_64+0x59/0xc0
>>>>>>> [ 3748.631203]  ? syscall_exit_to_user_mode+0x27/0x50
>>>>>>> [ 3748.636144]  ? do_syscall_64+0x69/0xc0
>>>>>>> [ 3748.639992]  ? exit_to_user_mode_prepare+0x96/0xb0
>>>>>>> [ 3748.644931]  ? irqentry_exit_to_user_mode+0x9/0x20
>>>>>>> [ 3748.649872]  ? irqentry_exit+0x1d/0x30
>>>>>>> [ 3748.653737]  ? exc_page_fault+0x89/0x170
>>>>>>> [ 3748.657795]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
>>>>>>> [ 3748.663030] RIP: 0033:0x7f9a852989cc
>>>>>>> [ 3748.666713] RSP: 002b:00007f9a8497dc90 EFLAGS: 00000246 ORIG_RAX:
>>>>>>> 0000000000000000
>>>>>>> [ 3748.674487] RAX: ffffffffffffffda RBX: 00007f9a8497f5c0 RCX:
>>>>>>> 00007f9a852989cc
>>>>>>> [ 3748.681817] RDX: 0000000000027100 RSI: 00007f99faa18010 RDI:
>>>>>>> 0000000000000061
>>>>>>> [ 3748.689150] RBP: 00007f9a8497dd60 R08: 0000000000000000 R09:
>>>>>>> 00007f99faa18010
>>>>>>> [ 3748.696493] R10: 0000000000000000 R11: 0000000000000246 R12:
>>>>>>> 00007f99faa18010
>>>>>>> [ 3748.703841] R13: 00005605e11c406f R14: 0000000000000001 R15:
>>>>>>> 0000000000027100
>>>>>>
>>>>>> ORIG_RAX is 0, which matches sys_read.
>>>>>> RDI is file descriptor 0x61
>>>>>> RSI is plausibly a userspace pointer, 0x7f99faa18010
>>>>>> RDX is the length, 0x27100 or 160kB.
>>>>>>
>>>>>> That all seems reasonable.
>>>>>>
>>>>>> What I really want to know is who is _holding_ the lock.  We stash
>>>>>> a pointer to the task_struct in 'owner', so we could clearly find 
>>>>>> this
>>>>>> out in the 'blocked for too long' report, and print their stack 
>>>>>> trace.
>>>>>>
>>>>> As described in the comment for __rwsem_set_reader_owned,it is hard 
>>>>> to track
>>>>> read owners.So we could not clearly find out who blocked the 
>>>>> process,it was
>>>>> caused by multiple tasks.
>>>>
>>>> Readers don't block readers.  You have a reader here, so it's being
>>>> blocked by a writer.  And that writer's task_struct is stashed in
>>>> rwsem->owner.  It would be nice if we dumped that information
>>>> automatically ... but we don't do that today.  Perhaps you could
>>>> grab that information from a crash dump if you have one.
>>>>
>>>>>> You must have done something like this already in order to deduce 
>>>>>> that
>>>>>> it was the direct reclaim path that was the problem?
>>>>>>
>>>>> The method we used is to track the direct reclaim using the
>>>>> trace_mm_vmscan_direct_reclaim_{begin,end} interface.When the problem
>>>>> occurred,we could get a very large "nr_reclaimed" which is not a 
>>>>> desirable
>>>>> value for process except kswapd.
>>>>
>>>> I disagree.  If a process needs to allocate memory then it should be
>>>> paying the cost of reclaiming that memory itself.  kswapd is a last
>>>> resort to reclaim memory when we have a workload (eg a network router)
>>>> that does its memory allocation primarily in interrupt context.
>>>>
>>> What's your opinion about this scan adjust issue? Is there a better 
>>> way to
>>> fix this issue?
>>
>> Yes, but we need you to gather more information about what's causing
>> the issue before we can suggest what that is.
>>
> I think the following scenery triggers the scan adjust issue:
> In function shrink_lruvec, we call get_scan_count and get the following 
> values:
> targets[LRU_INACTIVE_ANON]=50000
> targets[LRU_ACTIVE_ANON]=50000
> targets[LRU_INACTIVE_FILE]=128
> targets[LRU_ACTIVE_FILE]=129
> 
> After the first scan, we get more than nr_to_reclaim pages, but the 
> percentage of scanning nr[LRU_INACTIVE_FILE+LRU_ACTIVE_FILE] is 256/257,
> Then when we scan adjust, we must scan(possibly reclaim) 
> 256*(50000+50000)/257-256=99354 pages, which is too large and would 
> waste too many time.
> If it is not kswapd, it is unacceptable to reclaim so many pages.
> So we should limit the number of pages of scan adjust.
> 
> Thanks
> Hongchen Zhang


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-19 23:32                 ` Yosry Ahmed
@ 2022-09-20  5:53                   ` Hongchen Zhang
  0 siblings, 0 replies; 17+ messages in thread
From: Hongchen Zhang @ 2022-09-20  5:53 UTC (permalink / raw)
  To: Yosry Ahmed
  Cc: Matthew Wilcox, Andrew Morton, Linux-MM,
	Linux Kernel Mailing List, Feng Tang

Hi Yosry,

Sorry for not replying in time, there was a problem with my email.

On 2022/9/20 am 7:32, Yosry Ahmed wrote:
> On Fri, Sep 16, 2022 at 3:20 AM Hongchen Zhang
> <zhanghongchen@loongson.cn> wrote:
>>
>> Hi Andrew and Matthew,
>>
>> On 2022/9/16 pm 4:40, Matthew Wilcox wrote:
>>> On Fri, Sep 16, 2022 at 08:57:50AM +0800, Hongchen Zhang wrote:
>>>> Hi Andrew ,
>>>>
>>>> On 2022/9/15 pm 5:00, Matthew Wilcox wrote:
>>>>> On Thu, Sep 15, 2022 at 04:02:41PM +0800, Hongchen Zhang wrote:
>>>>>> Hi Matthew,
>>>>>> On 2022/9/15 pm 3:28, Matthew Wilcox wrote:
>>>>>>> On Thu, Sep 15, 2022 at 09:19:48AM +0800, Hongchen Zhang wrote:
>>>>>>>> [ 3748.453561] INFO: task float_bessel:77920 blocked for more than 120
>>>>>>>> seconds.
>>>>>>>> [ 3748.460839]       Not tainted 5.15.0-46-generic #49-Ubuntu
>>>>>>>> [ 3748.466490] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
>>>>>>>> this message.
>>>>>>>> [ 3748.474618] task:float_bessel    state:D stack:    0 pid:77920 ppid:
>>>>>>>> 77327 flags:0x00004002
>>>>>>>> [ 3748.483358] Call Trace:
>>>>>>>> [ 3748.485964]  <TASK>
>>>>>>>> [ 3748.488150]  __schedule+0x23d/0x590
>>>>>>>> [ 3748.491804]  schedule+0x4e/0xc0
>>>>>>>> [ 3748.495038]  rwsem_down_read_slowpath+0x336/0x390
>>>>>>>> [ 3748.499886]  ? copy_user_enhanced_fast_string+0xe/0x40
>>>>>>>> [ 3748.505181]  down_read+0x43/0xa0
>>>>>>>> [ 3748.508518]  do_user_addr_fault+0x41c/0x670
>>>>>>>> [ 3748.512799]  exc_page_fault+0x77/0x170
>>>>>>>> [ 3748.516673]  asm_exc_page_fault+0x26/0x30
>>>>>>>> [ 3748.520824] RIP: 0010:copy_user_enhanced_fast_string+0xe/0x40
>>>>>>>> [ 3748.526764] Code: 89 d1 c1 e9 03 83 e2 07 f3 48 a5 89 d1 f3 a4 31 c0 0f
>>>>>>>> 01 ca c3 cc cc cc cc 0f 1f 00 0f 01 cb 83 fa 40 0f 82 70 ff ff ff 89 d1 <f3>
>>>>>>>> a4 31 c0 0f 01 ca c3 cc cc cc cc 66 08
>>>>>>>> [ 3748.546120] RSP: 0018:ffffaa9248fffb90 EFLAGS: 00050206
>>>>>>>> [ 3748.551495] RAX: 00007f99faa1a010 RBX: ffffaa9248fffd88 RCX:
>>>>>>>> 0000000000000010
>>>>>>>> [ 3748.558828] RDX: 0000000000001000 RSI: ffff9db397ab8ff0 RDI:
>>>>>>>> 00007f99faa1a000
>>>>>>>> [ 3748.566160] RBP: ffffaa9248fffbf0 R08: ffffcc2fc2965d80 R09:
>>>>>>>> 0000000000000014
>>>>>>>> [ 3748.573492] R10: 0000000000000000 R11: 0000000000000014 R12:
>>>>>>>> 0000000000001000
>>>>>>>> [ 3748.580858] R13: 0000000000001000 R14: 0000000000000000 R15:
>>>>>>>> ffffaa9248fffd98
>>>>>>>> [ 3748.588196]  ? copy_page_to_iter+0x10e/0x400
>>>>>>>> [ 3748.592614]  filemap_read+0x174/0x3e0
>>>>>>>
>>>>>>> Interesting; it wasn't the process itself which triggered the page
>>>>>>> fault; the process called read() and the kernel took the page fault to
>>>>>>> satisfy the read() call.
>>>>>>>
>>>>>>>> [ 3748.596354]  ? ima_file_check+0x6a/0xa0
>>>>>>>> [ 3748.600301]  generic_file_read_iter+0xe5/0x150
>>>>>>>> [ 3748.604884]  ext4_file_read_iter+0x5b/0x190
>>>>>>>> [ 3748.609164]  ? aa_file_perm+0x102/0x250
>>>>>>>> [ 3748.613125]  new_sync_read+0x10d/0x1a0
>>>>>>>> [ 3748.617009]  vfs_read+0x103/0x1a0
>>>>>>>> [ 3748.620423]  ksys_read+0x67/0xf0
>>>>>>>> [ 3748.623743]  __x64_sys_read+0x19/0x20
>>>>>>>> [ 3748.627511]  do_syscall_64+0x59/0xc0
>>>>>>>> [ 3748.631203]  ? syscall_exit_to_user_mode+0x27/0x50
>>>>>>>> [ 3748.636144]  ? do_syscall_64+0x69/0xc0
>>>>>>>> [ 3748.639992]  ? exit_to_user_mode_prepare+0x96/0xb0
>>>>>>>> [ 3748.644931]  ? irqentry_exit_to_user_mode+0x9/0x20
>>>>>>>> [ 3748.649872]  ? irqentry_exit+0x1d/0x30
>>>>>>>> [ 3748.653737]  ? exc_page_fault+0x89/0x170
>>>>>>>> [ 3748.657795]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
>>>>>>>> [ 3748.663030] RIP: 0033:0x7f9a852989cc
>>>>>>>> [ 3748.666713] RSP: 002b:00007f9a8497dc90 EFLAGS: 00000246 ORIG_RAX:
>>>>>>>> 0000000000000000
>>>>>>>> [ 3748.674487] RAX: ffffffffffffffda RBX: 00007f9a8497f5c0 RCX:
>>>>>>>> 00007f9a852989cc
>>>>>>>> [ 3748.681817] RDX: 0000000000027100 RSI: 00007f99faa18010 RDI:
>>>>>>>> 0000000000000061
>>>>>>>> [ 3748.689150] RBP: 00007f9a8497dd60 R08: 0000000000000000 R09:
>>>>>>>> 00007f99faa18010
>>>>>>>> [ 3748.696493] R10: 0000000000000000 R11: 0000000000000246 R12:
>>>>>>>> 00007f99faa18010
>>>>>>>> [ 3748.703841] R13: 00005605e11c406f R14: 0000000000000001 R15:
>>>>>>>> 0000000000027100
>>>>>>>
>>>>>>> ORIG_RAX is 0, which matches sys_read.
>>>>>>> RDI is file descriptor 0x61
>>>>>>> RSI is plausibly a userspace pointer, 0x7f99faa18010
>>>>>>> RDX is the length, 0x27100 or 160kB.
>>>>>>>
>>>>>>> That all seems reasonable.
>>>>>>>
>>>>>>> What I really want to know is who is _holding_ the lock.  We stash
>>>>>>> a pointer to the task_struct in 'owner', so we could clearly find this
>>>>>>> out in the 'blocked for too long' report, and print their stack trace.
>>>>>>>
>>>>>> As described in the comment for __rwsem_set_reader_owned,it is hard to track
>>>>>> read owners.So we could not clearly find out who blocked the process,it was
>>>>>> caused by multiple tasks.
>>>>>
>>>>> Readers don't block readers.  You have a reader here, so it's being
>>>>> blocked by a writer.  And that writer's task_struct is stashed in
>>>>> rwsem->owner.  It would be nice if we dumped that information
>>>>> automatically ... but we don't do that today.  Perhaps you could
>>>>> grab that information from a crash dump if you have one.
>>>>>
>>>>>>> You must have done something like this already in order to deduce that
>>>>>>> it was the direct reclaim path that was the problem?
>>>>>>>
>>>>>> The method we used is to track the direct reclaim using the
>>>>>> trace_mm_vmscan_direct_reclaim_{begin,end} interface.When the problem
>>>>>> occurred,we could get a very large "nr_reclaimed" which is not a desirable
>>>>>> value for process except kswapd.
>>>>>
>>>>> I disagree.  If a process needs to allocate memory then it should be
>>>>> paying the cost of reclaiming that memory itself.  kswapd is a last
>>>>> resort to reclaim memory when we have a workload (eg a network router)
>>>>> that does its memory allocation primarily in interrupt context.
>>>>>
>>>> What's your opinion about this scan adjust issue? Is there a better way to
>>>> fix this issue?
>>>
>>> Yes, but we need you to gather more information about what's causing
>>> the issue before we can suggest what that is.
>>>
>> I think the following scenery triggers the scan adjust issue:
>> In function shrink_lruvec, we call get_scan_count and get the following
>> values:
>> targets[LRU_INACTIVE_ANON]=50000
>> targets[LRU_ACTIVE_ANON]=50000
>> targets[LRU_INACTIVE_FILE]=128
>> targets[LRU_ACTIVE_FILE]=129
>>
>> After the first scan, we get more than nr_to_reclaim pages, but the
>> percentage of scanning nr[LRU_INACTIVE_FILE+LRU_ACTIVE_FILE] is 256/257,
>>
>> Then when we scan adjust, we must scan(possibly reclaim)
>> 256*(50000+50000)/257-256=99354 pages, which is too large and would
>> waste too many time.
>> If it is not kswapd, it is unacceptable to reclaim so many pages.
>> So we should limit the number of pages of scan adjust.
> 
> IIUC commit 6eb90d649537 ("mm: vmscan: fix extreme overreclaim and
> swap floods") that was recently sent by Johannes [1] addresses a
> similar issue (reclaiming way beyond nr_to_reclaim when anon vs file
> LRU sizes are very different), but in a slightly different scenario.
> IIUC with Johannes's patch, scan adjustment is already limited for
> scenarios where scan_adjust (aka proportional_reclaim) is not
> initialized to true, which would be all cases except global direct
> reclaim on DEF_PRIORITY. Is my understanding here correct?
> 
Yes, this patch fix the same issue,let's talk this issue there.
> [1] https://lore.kernel.org/lkml/20220802162811.39216-1-hannes@cmpxchg.org/
>>
>> Thanks
>> Hongchen Zhang
>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-20  2:23                 ` Hongchen Zhang
@ 2022-09-21  9:13                   ` Mel Gorman
  2022-09-21 10:14                     ` Hongchen Zhang
  0 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2022-09-21  9:13 UTC (permalink / raw)
  To: Hongchen Zhang
  Cc: Mel Gorman, Matthew Wilcox, Andrew Morton, linux-mm, linux-kernel

On Tue, Sep 20, 2022 at 10:23:05AM +0800, Hongchen Zhang wrote:
> Hi Mel,
> 
> The scan adjust algorithm was originally introduced by you from
> commmit e82e0561dae9 ("mm: vmscan: obey proportional scanning requirements
> for kswapd"), any suggestion about this fix patch?
> In short, proportional scanning is not friendly to processes other than
> kswapd.
> 

I suspect that 6eb90d649537 ("mm: vmscan: fix extreme overreclaim and swap
floods") is a more appropriate fix. While it also has a fairness impact,
it's a more general approach that is likely more robust and while
fairness is important, completely thrashing a full LRU is neither fair
nor expected.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd
  2022-09-21  9:13                   ` Mel Gorman
@ 2022-09-21 10:14                     ` Hongchen Zhang
  0 siblings, 0 replies; 17+ messages in thread
From: Hongchen Zhang @ 2022-09-21 10:14 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Mel Gorman, Matthew Wilcox, Andrew Morton, linux-mm, linux-kernel

Hi Mel,

Thanks for your reply.

On 2022/9/21 pm 5:13, Mel Gorman wrote:
> On Tue, Sep 20, 2022 at 10:23:05AM +0800, Hongchen Zhang wrote:
>> Hi Mel,
>>
>> The scan adjust algorithm was originally introduced by you from
>> commmit e82e0561dae9 ("mm: vmscan: obey proportional scanning requirements
>> for kswapd"), any suggestion about this fix patch?
>> In short, proportional scanning is not friendly to processes other than
>> kswapd.
>>
> 
> I suspect that 6eb90d649537 ("mm: vmscan: fix extreme overreclaim and swap
> floods") is a more appropriate fix. While it also has a fairness impact,
> it's a more general approach that is likely more robust and while
> fairness is important, completely thrashing a full LRU is neither fair
> nor expected.
> 
OK,got it. Let's wait for the 6eb90d649537 ("mm: vmscan: fix extreme 
overreclaim and swap floods") enter the stable repository.

Best Regards
Hongchen Zhang


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-09-21 10:14 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-14  2:33 [PATCH] mm/vmscan: don't scan adjust too much if current is not kswapd Hongchen Zhang
2022-09-14 22:51 ` Andrew Morton
2022-09-14 23:02   ` Matthew Wilcox
2022-09-15  1:19   ` Hongchen Zhang
2022-09-15  7:28     ` Matthew Wilcox
2022-09-15  8:02       ` Hongchen Zhang
2022-09-15  9:00         ` Matthew Wilcox
2022-09-15  9:30           ` Hongchen Zhang
2022-09-16  0:57           ` Hongchen Zhang
2022-09-16  8:40             ` Matthew Wilcox
2022-09-16 10:19               ` Hongchen Zhang
2022-09-19 23:32                 ` Yosry Ahmed
2022-09-20  5:53                   ` Hongchen Zhang
2022-09-20  2:23                 ` Hongchen Zhang
2022-09-21  9:13                   ` Mel Gorman
2022-09-21 10:14                     ` Hongchen Zhang
2022-09-15 13:38         ` Feng Tang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.