linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] mm/swap: fix system stuck due to infinite loop
@ 2021-04-02  7:03 Stillinux
  2021-04-03  0:44 ` Andrew Morton
       [not found] ` <20210406065944.08d8aa76@mail.inbox.lv>
  0 siblings, 2 replies; 6+ messages in thread
From: Stillinux @ 2021-04-02  7:03 UTC (permalink / raw)
  To: akpm; +Cc: linux-mm, linux-kernel, liuzhengyuan, liuyun01

[-- Attachment #1: Type: text/plain, Size: 1804 bytes --]

In the case of high system memory and load pressure, we ran ltp test
and found that the system was stuck, the direct memory reclaim was
all stuck in io_schedule, the waiting request was stuck in the blk_plug
flow of one process, and this process fell into an infinite loop.
not do the action of brushing out the request.

The call flow of this process is swap_cluster_readahead.
Use blk_start/finish_plug for blk_plug operation,
flow swap_cluster_readahead->__read_swap_cache_async->swapcache_prepare.
When swapcache_prepare return -EEXIST, it will fall into an infinite loop,
even if cond_resched is called, but according to the schedule,
sched_submit_work will be based on tsk->state, and will not flash out
the blk_plug request, so will hang io, causing the overall system  hang.

For the first time involving the swap part, there is no good way to fix
the problem from the fundamental problem. In order to solve the
engineering situation, we chose to make swap_cluster_readahead aware of
the memory pressure situation as soon as possible, and do io_schedule to
flush out the blk_plug request, thereby changing the allocation flag in
swap_readpage to GFP_NOIO , No longer do the memory reclaim of flush io.
Although system operating normally, but not the most fundamental way.

Signed-off-by: huangjinhui <huangjinhui@kylinos.cn>
---
 mm/page_io.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_io.c b/mm/page_io.c
index c493ce9ebcf5..87392ffabb12 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -403,7 +403,7 @@ int swap_readpage(struct page *page, bool synchronous)
 	}

 	ret = 0;
-	bio = bio_alloc(GFP_KERNEL, 1);
+	bio = bio_alloc(GFP_NOIO, 1);
 	bio_set_dev(bio, sis->bdev);
 	bio->bi_opf = REQ_OP_READ;
 	bio->bi_iter.bi_sector = swap_page_sector(page);
-- 
2.25.1

[-- Attachment #2: Type: text/html, Size: 2010 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm/swap: fix system stuck due to infinite loop
  2021-04-02  7:03 [RFC PATCH] mm/swap: fix system stuck due to infinite loop Stillinux
@ 2021-04-03  0:44 ` Andrew Morton
  2021-04-04  9:26   ` Stillinux
       [not found] ` <20210406065944.08d8aa76@mail.inbox.lv>
  1 sibling, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2021-04-03  0:44 UTC (permalink / raw)
  To: Stillinux
  Cc: linux-mm, linux-kernel, liuzhengyuan, liuyun01, Johannes Weiner,
	Hugh Dickins

On Fri, 2 Apr 2021 15:03:37 +0800 Stillinux <stillinux@gmail.com> wrote:

> In the case of high system memory and load pressure, we ran ltp test
> and found that the system was stuck, the direct memory reclaim was
> all stuck in io_schedule, the waiting request was stuck in the blk_plug
> flow of one process, and this process fell into an infinite loop.
> not do the action of brushing out the request.
> 
> The call flow of this process is swap_cluster_readahead.
> Use blk_start/finish_plug for blk_plug operation,
> flow swap_cluster_readahead->__read_swap_cache_async->swapcache_prepare.
> When swapcache_prepare return -EEXIST, it will fall into an infinite loop,
> even if cond_resched is called, but according to the schedule,
> sched_submit_work will be based on tsk->state, and will not flash out
> the blk_plug request, so will hang io, causing the overall system  hang.
> 
> For the first time involving the swap part, there is no good way to fix
> the problem from the fundamental problem. In order to solve the
> engineering situation, we chose to make swap_cluster_readahead aware of
> the memory pressure situation as soon as possible, and do io_schedule to
> flush out the blk_plug request, thereby changing the allocation flag in
> swap_readpage to GFP_NOIO , No longer do the memory reclaim of flush io.
> Although system operating normally, but not the most fundamental way.
> 

Thanks.

I'm not understanding why swapcache_prepare() repeatedly returns
-EEXIST in this situation?

And how does the switch to GFP_NOIO fix this?  Simply by avoiding
direct reclaim altogether?

> ---
>  mm/page_io.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_io.c b/mm/page_io.c
> index c493ce9ebcf5..87392ffabb12 100644
> --- a/mm/page_io.c
> +++ b/mm/page_io.c
> @@ -403,7 +403,7 @@ int swap_readpage(struct page *page, bool synchronous)
>  	}
> 
>  	ret = 0;
> -	bio = bio_alloc(GFP_KERNEL, 1);
> +	bio = bio_alloc(GFP_NOIO, 1);
>  	bio_set_dev(bio, sis->bdev);
>  	bio->bi_opf = REQ_OP_READ;
>  	bio->bi_iter.bi_sector = swap_page_sector(page);



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm/swap: fix system stuck due to infinite loop
  2021-04-03  0:44 ` Andrew Morton
@ 2021-04-04  9:26   ` Stillinux
  0 siblings, 0 replies; 6+ messages in thread
From: Stillinux @ 2021-04-04  9:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-mm, linux-kernel, liuzhengyuan, liuyun01, Johannes Weiner,
	Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 11188 bytes --]

> I'm not understanding why swapcache_prepare() repeatedly returns
-EEXIST in this situation?

We found this case which led to system stuck on our several arm64 PC. and
 We using sysrq grabbed a copy of vmcore based on the 5.4.18 kernel
version. Later, we also ran the 5.11.0 version, which can reproduce this
case. According to vmcore analysis, the latest kernel code still preserves
this logic.

Our analysis steps are as follows:
1. We found a large number of D tasks, and the kernel stack is basically
stuck in io_schedule after rq_qos_wait. We analyzed the rq_wb structure,
and we found that inflight in rq_wait[0] is 1, and there is one request
that has not been completed, thus blk-wbt block the io operation of other
tasks.

crash> ps -S
  RU: 5
  IN: 468
  ID: 65
  UN: 117
  ZO: 4

  ==============
  rq_wait = {{
wait = {

inflight = {
  counter = 1
}

2, We tracked down that the request was pluged in the blk_plug of the task
938

crash> ps -A
   PID    PPID  CPU       TASK        ST  %MEM     VSZ    RSS  COMM
>     0      0   1  ffffffa0f5e0b4c0  RU   0.0       0      0  [swapper/1]
>     0      0   2  ffffffa0f5ec0040  RU   0.0       0      0  [swapper/2]
>     0      0   3  ffffffa0f5ec6940  RU   0.0       0      0  [swapper/3]
>   938    929   0  ffffffa0eb00c640  RU   0.0 2698236  15092  qaxsafed
crash> task 938 | grep TASK
PID: 938    TASK: ffffffa0eb00c640  CPU: 0   COMMAND: "qaxsafed"
crash> struct task_struct ffffffa0eb00c640 | grep plug
  plug = 0xffffffa0e918fbe8,
crash> struct blk_plug 0xffffffa0e918fbe8
struct blk_plug {
  mq_list = {
    next = 0xffffffa0e7f81a88,
    prev = 0xffffffa0e7f81a88
  },
  cb_list = {
    next = 0xffffffa0e918fbf8,
    prev = 0xffffffa0e918fbf8
  },
  rq_count = 1,
  multiple_queues = false
}

3, We grabbed the allocation time of the request and found that it was
allocated in 1062 seconds, and then added to the plug of the 938 process,
but combined with the dmesg time information, the last scheduling time of
task 938 has run to 2083 seconds,  plus for the scheduling information of
other processes of cpu0, we confirm that the scheduling is normal, and
there is a problem if the task 938 does not do the brushing out the request
on the plug for so long.

crash> struct request ffffffa0e7f81a40 | grep start_time_ns
  start_time_ns = 1062021311093,
  io_start_time_ns = 0,
crash> log | tail
[ 2183.803249]  do_swap_page+0x1e4/0x968
[ 2183.808028]  __handle_mm_fault+0xc18/0x10a8
[ 2183.813328]  handle_mm_fault+0x144/0x198
[ 2183.818368]  do_page_fault+0x2d0/0x518
[ 2183.823235]  do_translation_fault+0x3c/0x50
[ 2183.828535]  do_mem_abort+0x3c/0x98
[ 2183.833140]  el0_da+0x1c/0x20
[ 2183.837229] SMP: stopping secondary CPUs
[ 2183.842451] Starting crashdump kernel...
[ 2183.876226] Bye!
crash> task 938 | grep "exec_start\|sum_exec_runtime"
    exec_start = 2183587475680,
    sum_exec_runtime = 1120286568937,
    prev_sum_exec_runtime = 1120278576499,
  last_sum_exec_runtime = 0,

4. We analyzed the scheduling because schedule->sched_submit_work will be
judged based on task->state, and the request in the plug will not be
flushed. then we know that the request data on the plug of the runnable
process needs to be flushed out by itself.

5. We combine the call stack of the task 938, the X0 register value saved
by its interrupt context, and the parameter passed in with get_swap_device
as the swp_entry value, which we convert to its entry value.

get_swap_device function and its save context list as follow, and we dis
 get_swap_device function :

struct swap_info_struct *get_swap_device(swp_entry_t entry)
====================================
#27 [ffffffa0e918fa90] el1_irq+0xb4 at ffffffc0100833b4
     PC: ffffffc01022df04  [get_swap_device+36]
     LR: ffffffc01022dfb4  [__swap_duplicate+28]
     SP: ffffffa0e918faa0  PSTATE: 20000005
    X29: ffffffa0e918faa0  X28: ffffffc01151e000  X27: 0000000000000000
    X26: ffffffa0f0d8cfc8  X25: 0000007f9e57a000  X24: 0000000000100cca
    X23: 0000000000000040  X22: 00000000001d1ca6  X21: 0000000000007b40
    X20: 00000000001d1ca6  X19: ffffffff00502800  X18: 0000000000000000
    X17: 0000000000000000  X16: 0000000000000000  X15: 0000000000000000
    X14: 0000000000000000  X13: 0000000000000001  X12: 0000000000000000
    X11: 0000000000000001  X10: 0000000000000bc0   X9: ffffffa0e918fa50
     X8: ffffffa0eb00d260   X7: ffffffa0e9875600   X6: 00000000ffffffff
     X5: 00000000000002af   X4: ffffffa0e918fa80   X3: 0000000000000002
     X2: 0000000000000000   X1: ffffffc0117caf60   X0:
0SWAP_HAS_CACHE0000000001d1ca6<<<<---
    ffffffa0e918fa90: ffffffa0e918faa0 get_swap_device+36
#28 [ffffffa0e918faa0] get_swap_device+0x20 at ffffffc01022df00
    ffffffa0e918faa0: ffffffa0e918fab0 __swap_duplicate+28
#29 [ffffffa0e918fab0] __swap_duplicate+0x18 at ffffffc01022dfb0

============================
crash> dis -rx get_swap_device+36information
0xffffffc01022dee0 <get_swap_device>: cbz x0, 0xffffffc01022df80
<get_swap_device+0xa0>
0xffffffc01022dee4 <get_swap_device+0x4>: adrp x1, 0xffffffc0117ca000
<memblock_reserved_init_regions+0x1500>
0xffffffc01022dee8 <get_swap_device+0x8>: add x1, x1, #0xf60
0xffffffc01022deec <get_swap_device+0xc>: stp x29, x30, [sp,#-16]!
0xffffffc01022def0 <get_swap_device+0x10>: lsr x2, x0, #58
0xffffffc01022def4 <get_swapWe found that this case is probabilistic system
stuck  on our arm64 machine._device+0x14>: mov x29, sp
0xffffffc01022def8 <get_swap_device+0x18>: ldr w3, [x1,#4]
0xffffffc01022defc <get_swap_device+0x1c>: cmp w3, w2
0xffffffc01022df00 <get_swap_device+0x20>: b.ls 0xffffffc01022df40
<get_swap_device+0x60>
0xffffffc01022df04 <get_swap_device+0x24>: dmb ishld



6. Combined with the swap information, we look up the value of swap_map
corresponding to the entry, and we find that it is 0x41<SWAP_HAS_CACHE|
0x1>.

crash> swap
SWAP_INFO_STRUCT    TYPE       SIZE       USED     PCT  PRI  FILENAME
ffffffa0ebae9a00  PARTITION  9693180k   2713508k   27%   -2  /dev/sda6
crash> struct swap_info_struct ffffffa0ebae9a00 | grep swap_map
  swap_map = 0xffffffc012e01000 "?",
  frontswap_map = 0xffffffa0ea580000,
crash> px (0xffffffc012e01000+0x1d1ca6)
$4 = 0xffffffc012fd2ca6
crash> rd -8  0xffffffc012fd2ca6
ffffffc012fd2ca6:  41                                                A

7, We bring 0x41 into the __read_swap_cache_async function, which will
return -EEXIST in swapcache_prepare, if the value in swap_map corresponding
to entry is always 0x41, wapcache_prepare
It will always return -EEXIST, and the task 938 will do infinite loop.

8. The value of swap_map corresponding to the entry is 0x41, which means
that the swap_cache page corresponding to the entry has not been added
successfully. If added successfully, __read_swap_cache_async can get the
page from swap_address_space, without infinite loop.

9. Based on the above information, we guess the swap_cache page
corresponding to the entry of the task 938  has not been added
successfully, and it may also be blocked. Looking at the stacks of other
tasks, it is found that other tasks have triggered the shrink flush io of
direct memory reclaim in
swap_cluster_readahead->read_swap_cache_async->swap_readpage->get_swap_bio,
but it was blocked by blk-wbt because a request was not completed above.



AFAIK, we guess that there is a potential logical infinite loop, that is
the root cause of our system's stuck. The task 938 waits for the page add
to swapcache to join successfully through swap_prepare loop, but the task
of reading data from the swap area to the page in the swapcache is blocked
by io when access blk-wbt or hctx->tags resources, but the task 938 has not
been flushed out because of the request held by its blk-plug, which is the
cause of other task io blocking.

> And how does the switch to GFP_NOIO fix this?  Simply by avoiding
direct reclaim altogether?

In our kernel stack information, our crawling tasks kernel stack may block
the 938 task. By crawling swap_cluster_readahead, we found that most of
them are stuck in
swap_cluster_readahead->read_swap_cache_async->swap_readpage->get_swap_bio.
When get_swap_bio does direct memory reclaim, it enters the D state when
flushing io stuck by task 938 blk_plug request. Therefore, from an
engineering perspective, we just only solve this case urgently, which can
effectively reduce the probability of the stuck situation, can realize the
memory pressure, reduce the readahead reading, and read the formal orig_pte
using read_swap_cache_async for the second time. Remedial opportunities,
but there are still risks. If the entry is not the entry corresponding to
orig_pte, there is no big problem, but if it is the entry corresponding to
orig_pte, returning -ENOMEM may cause serious problems. It is beyond my
ability to solve this potential circular problem fundamentally, so I
request the community to discuss and research to solve the problem
fundamentally.


On Sat, Apr 3, 2021 at 8:44 AM Andrew Morton <akpm@linux-foundation.org>
wrote:

> On Fri, 2 Apr 2021 15:03:37 +0800 Stillinux <stillinux@gmail.com> wrote:
>
> > In the case of high system memory and load pressure, we ran ltp test
> > and found that the system was stuck, the direct memory reclaim was
> > all stuck in io_schedule, the waiting request was stuck in the blk_plug
> > flow of one process, and this process fell into an infinite loop.
> > not do the action of brushing out the request.
> >
> > The call flow of this process is swap_cluster_readahead.
> > Use blk_start/finish_plug for blk_plug operation,
> > flow swap_cluster_readahead->__read_swap_cache_async->swapcache_prepare.
> > When swapcache_prepare return -EEXIST, it will fall into an infinite
> loop,
> > even if cond_resched is called, but according to the schedule,
> > sched_submit_work will be based on tsk->state, and will not flash out
> > the blk_plug request, so will hang io, causing the overall system  hang.
> >
> > For the first time involving the swap part, there is no good way to fix
> > the problem from the fundamental problem. In order to solve the
> > engineering situation, we chose to make swap_cluster_readahead aware of
> > the memory pressure situation as soon as possible, and do io_schedule to
> > flush out the blk_plug request, thereby changing the allocation flag in
> > swap_readpage to GFP_NOIO , No longer do the memory reclaim of flush io.
> > Although system operating normally, but not the most fundamental way.
> >
>
> Thanks.
>
> I'm not understanding why swapcache_prepare() repeatedly returns
> -EEXIST in this situation?
>
> And how does the switch to GFP_NOIO fix this?  Simply by avoiding
> direct reclaim altogether?
>
> > ---
> >  mm/page_io.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/mm/page_io.c b/mm/page_io.c
> > index c493ce9ebcf5..87392ffabb12 100644
> > --- a/mm/page_io.c
> > +++ b/mm/page_io.c
> > @@ -403,7 +403,7 @@ int swap_readpage(struct page *page, bool
> synchronous)
> >       }
> >
> >       ret = 0;
> > -     bio = bio_alloc(GFP_KERNEL, 1);
> > +     bio = bio_alloc(GFP_NOIO, 1);
> >       bio_set_dev(bio, sis->bdev);
> >       bio->bi_opf = REQ_OP_READ;
> >       bio->bi_iter.bi_sector = swap_page_sector(page);
>
>

[-- Attachment #2: Type: text/html, Size: 12866 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/vmscan: add sysctl knobs for protecting the specified
       [not found] ` <20210406065944.08d8aa76@mail.inbox.lv>
@ 2021-04-06  0:15   ` kernel test robot
  2021-04-06  1:16   ` kernel test robot
  2021-04-06 22:49   ` [RFC PATCH] mm/swap: fix system stuck due to infinite loop Stillinux
  2 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2021-04-06  0:15 UTC (permalink / raw)
  To: Alexey Avramov, Stillinux
  Cc: kbuild-all, akpm, linux-mm, linux-kernel, liuzhengyuan, liuyun01

[-- Attachment #1: Type: text/plain, Size: 3048 bytes --]

Hi Alexey,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on linux/master]
[also build test ERROR on linus/master v5.12-rc6 next-20210401]
[cannot apply to hnaz-linux-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Alexey-Avramov/mm-vmscan-add-sysctl-knobs-for-protecting-the-specified/20210406-061034
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 5e46d1b78a03d52306f21f77a4e4a144b6d31486
config: parisc-randconfig-m031-20210405 (attached as .config)
compiler: hppa-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/0day-ci/linux/commit/a5eeb8d197a8e10c333422e9cc0f2c7d976a3426
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Alexey-Avramov/mm-vmscan-add-sysctl-knobs-for-protecting-the-specified/20210406-061034
        git checkout a5eeb8d197a8e10c333422e9cc0f2c7d976a3426
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross ARCH=parisc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All error/warnings (new ones prefixed by >>):

>> mm/vmscan.c:180:5: warning: "CONFIG_CLEAN_LOW_KBYTES" is not defined, evaluates to 0 [-Wundef]
     180 | #if CONFIG_CLEAN_LOW_KBYTES < 0
         |     ^~~~~~~~~~~~~~~~~~~~~~~
>> mm/vmscan.c:184:5: warning: "CONFIG_CLEAN_MIN_KBYTES" is not defined, evaluates to 0 [-Wundef]
     184 | #if CONFIG_CLEAN_MIN_KBYTES < 0
         |     ^~~~~~~~~~~~~~~~~~~~~~~
>> mm/vmscan.c:188:55: error: 'CONFIG_CLEAN_LOW_KBYTES' undeclared here (not in a function)
     188 | unsigned long sysctl_clean_low_kbytes __read_mostly = CONFIG_CLEAN_LOW_KBYTES;
         |                                                       ^~~~~~~~~~~~~~~~~~~~~~~
>> mm/vmscan.c:189:55: error: 'CONFIG_CLEAN_MIN_KBYTES' undeclared here (not in a function)
     189 | unsigned long sysctl_clean_min_kbytes __read_mostly = CONFIG_CLEAN_MIN_KBYTES;
         |                                                       ^~~~~~~~~~~~~~~~~~~~~~~


vim +/CONFIG_CLEAN_LOW_KBYTES +188 mm/vmscan.c

   179	
 > 180	#if CONFIG_CLEAN_LOW_KBYTES < 0
   181	#error "CONFIG_CLEAN_LOW_KBYTES must be >= 0"
   182	#endif
   183	
 > 184	#if CONFIG_CLEAN_MIN_KBYTES < 0
   185	#error "CONFIG_CLEAN_MIN_KBYTES must be >= 0"
   186	#endif
   187	
 > 188	unsigned long sysctl_clean_low_kbytes __read_mostly = CONFIG_CLEAN_LOW_KBYTES;
 > 189	unsigned long sysctl_clean_min_kbytes __read_mostly = CONFIG_CLEAN_MIN_KBYTES;
   190	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26587 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] mm/vmscan: add sysctl knobs for protecting the specified
       [not found] ` <20210406065944.08d8aa76@mail.inbox.lv>
  2021-04-06  0:15   ` [PATCH] mm/vmscan: add sysctl knobs for protecting the specified kernel test robot
@ 2021-04-06  1:16   ` kernel test robot
  2021-04-06 22:49   ` [RFC PATCH] mm/swap: fix system stuck due to infinite loop Stillinux
  2 siblings, 0 replies; 6+ messages in thread
From: kernel test robot @ 2021-04-06  1:16 UTC (permalink / raw)
  To: Alexey Avramov, Stillinux
  Cc: kbuild-all, clang-built-linux, akpm, linux-mm, linux-kernel,
	liuzhengyuan, liuyun01

[-- Attachment #1: Type: text/plain, Size: 16131 bytes --]

Hi Alexey,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linux/master]
[also build test WARNING on linus/master v5.12-rc6 next-20210401]
[cannot apply to hnaz-linux-mm/master]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:    https://github.com/0day-ci/linux/commits/Alexey-Avramov/mm-vmscan-add-sysctl-knobs-for-protecting-the-specified/20210406-061034
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 5e46d1b78a03d52306f21f77a4e4a144b6d31486
config: s390-randconfig-r006-20210405 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project a46f59a747a7273cc439efaf3b4f98d8b63d2f20)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # install s390 cross compiling tool for clang build
        # apt-get install binutils-s390x-linux-gnu
        # https://github.com/0day-ci/linux/commit/a5eeb8d197a8e10c333422e9cc0f2c7d976a3426
        git remote add linux-review https://github.com/0day-ci/linux
        git fetch --no-tags linux-review Alexey-Avramov/mm-vmscan-add-sysctl-knobs-for-protecting-the-specified/20210406-061034
        git checkout a5eeb8d197a8e10c333422e9cc0f2c7d976a3426
        # save the attached .config to linux build tree
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross ARCH=s390 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp@intel.com>

All warnings (new ones prefixed by >>):

   In file included from mm/vmscan.c:20:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:14:
   In file included from include/linux/blk-cgroup.h:23:
   In file included from include/linux/blkdev.h:26:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/s390/include/asm/io.h:80:
   include/asm-generic/io.h:464:31: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __raw_readb(PCI_IOBASE + addr);
                             ~~~~~~~~~~ ^
   include/asm-generic/io.h:477:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le16_to_cpu((__le16 __force)__raw_readw(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/big_endian.h:36:59: note: expanded from macro '__le16_to_cpu'
   #define __le16_to_cpu(x) __swab16((__force __u16)(__le16)(x))
                                                             ^
   include/uapi/linux/swab.h:102:54: note: expanded from macro '__swab16'
   #define __swab16(x) (__u16)__builtin_bswap16((__u16)(x))
                                                        ^
   In file included from mm/vmscan.c:20:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:14:
   In file included from include/linux/blk-cgroup.h:23:
   In file included from include/linux/blkdev.h:26:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/s390/include/asm/io.h:80:
   include/asm-generic/io.h:490:61: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           val = __le32_to_cpu((__le32 __force)__raw_readl(PCI_IOBASE + addr));
                                                           ~~~~~~~~~~ ^
   include/uapi/linux/byteorder/big_endian.h:34:59: note: expanded from macro '__le32_to_cpu'
   #define __le32_to_cpu(x) __swab32((__force __u32)(__le32)(x))
                                                             ^
   include/uapi/linux/swab.h:115:54: note: expanded from macro '__swab32'
   #define __swab32(x) (__u32)__builtin_bswap32((__u32)(x))
                                                        ^
   In file included from mm/vmscan.c:20:
   In file included from include/linux/swap.h:9:
   In file included from include/linux/memcontrol.h:22:
   In file included from include/linux/writeback.h:14:
   In file included from include/linux/blk-cgroup.h:23:
   In file included from include/linux/blkdev.h:26:
   In file included from include/linux/scatterlist.h:9:
   In file included from arch/s390/include/asm/io.h:80:
   include/asm-generic/io.h:501:33: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writeb(value, PCI_IOBASE + addr);
                               ~~~~~~~~~~ ^
   include/asm-generic/io.h:511:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writew((u16 __force)cpu_to_le16(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:521:59: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           __raw_writel((u32 __force)cpu_to_le32(value), PCI_IOBASE + addr);
                                                         ~~~~~~~~~~ ^
   include/asm-generic/io.h:609:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsb(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:617:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsw(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:625:20: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           readsl(PCI_IOBASE + addr, buffer, count);
                  ~~~~~~~~~~ ^
   include/asm-generic/io.h:634:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesb(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:643:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesw(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
   include/asm-generic/io.h:652:21: warning: performing pointer arithmetic on a null pointer has undefined behavior [-Wnull-pointer-arithmetic]
           writesl(PCI_IOBASE + addr, buffer, count);
                   ~~~~~~~~~~ ^
>> mm/vmscan.c:2819:7: warning: variable 'clean' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
                   if (reclaimable_file > dirty)
                       ^~~~~~~~~~~~~~~~~~~~~~~~
   mm/vmscan.c:2822:25: note: uninitialized use occurs here
                   sc->clean_below_low = clean < sysctl_clean_low_kbytes;
                                         ^~~~~
   mm/vmscan.c:2819:3: note: remove the 'if' if its condition is always true
                   if (reclaimable_file > dirty)
                   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   mm/vmscan.c:2812:47: note: initialize the variable 'clean' to silence this warning
                   unsigned long reclaimable_file, dirty, clean;
                                                               ^
                                                                = 0
   13 warnings generated.


vim +2819 mm/vmscan.c

  2706	
  2707	static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
  2708	{
  2709		struct reclaim_state *reclaim_state = current->reclaim_state;
  2710		unsigned long nr_reclaimed, nr_scanned;
  2711		struct lruvec *target_lruvec;
  2712		bool reclaimable = false;
  2713		unsigned long file;
  2714	
  2715		target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
  2716	
  2717	again:
  2718		memset(&sc->nr, 0, sizeof(sc->nr));
  2719	
  2720		nr_reclaimed = sc->nr_reclaimed;
  2721		nr_scanned = sc->nr_scanned;
  2722	
  2723		/*
  2724		 * Determine the scan balance between anon and file LRUs.
  2725		 */
  2726		spin_lock_irq(&target_lruvec->lru_lock);
  2727		sc->anon_cost = target_lruvec->anon_cost;
  2728		sc->file_cost = target_lruvec->file_cost;
  2729		spin_unlock_irq(&target_lruvec->lru_lock);
  2730	
  2731		/*
  2732		 * Target desirable inactive:active list ratios for the anon
  2733		 * and file LRU lists.
  2734		 */
  2735		if (!sc->force_deactivate) {
  2736			unsigned long refaults;
  2737	
  2738			refaults = lruvec_page_state(target_lruvec,
  2739					WORKINGSET_ACTIVATE_ANON);
  2740			if (refaults != target_lruvec->refaults[0] ||
  2741				inactive_is_low(target_lruvec, LRU_INACTIVE_ANON))
  2742				sc->may_deactivate |= DEACTIVATE_ANON;
  2743			else
  2744				sc->may_deactivate &= ~DEACTIVATE_ANON;
  2745	
  2746			/*
  2747			 * When refaults are being observed, it means a new
  2748			 * workingset is being established. Deactivate to get
  2749			 * rid of any stale active pages quickly.
  2750			 */
  2751			refaults = lruvec_page_state(target_lruvec,
  2752					WORKINGSET_ACTIVATE_FILE);
  2753			if (refaults != target_lruvec->refaults[1] ||
  2754			    inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
  2755				sc->may_deactivate |= DEACTIVATE_FILE;
  2756			else
  2757				sc->may_deactivate &= ~DEACTIVATE_FILE;
  2758		} else
  2759			sc->may_deactivate = DEACTIVATE_ANON | DEACTIVATE_FILE;
  2760	
  2761		/*
  2762		 * If we have plenty of inactive file pages that aren't
  2763		 * thrashing, try to reclaim those first before touching
  2764		 * anonymous pages.
  2765		 */
  2766		file = lruvec_page_state(target_lruvec, NR_INACTIVE_FILE);
  2767		if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
  2768			sc->cache_trim_mode = 1;
  2769		else
  2770			sc->cache_trim_mode = 0;
  2771	
  2772		/*
  2773		 * Prevent the reclaimer from falling into the cache trap: as
  2774		 * cache pages start out inactive, every cache fault will tip
  2775		 * the scan balance towards the file LRU.  And as the file LRU
  2776		 * shrinks, so does the window for rotation from references.
  2777		 * This means we have a runaway feedback loop where a tiny
  2778		 * thrashing file LRU becomes infinitely more attractive than
  2779		 * anon pages.  Try to detect this based on file LRU size.
  2780		 */
  2781		if (!cgroup_reclaim(sc)) {
  2782			unsigned long total_high_wmark = 0;
  2783			unsigned long free, anon;
  2784			int z;
  2785	
  2786			free = sum_zone_node_page_state(pgdat->node_id, NR_FREE_PAGES);
  2787			file = node_page_state(pgdat, NR_ACTIVE_FILE) +
  2788				   node_page_state(pgdat, NR_INACTIVE_FILE);
  2789	
  2790			for (z = 0; z < MAX_NR_ZONES; z++) {
  2791				struct zone *zone = &pgdat->node_zones[z];
  2792				if (!managed_zone(zone))
  2793					continue;
  2794	
  2795				total_high_wmark += high_wmark_pages(zone);
  2796			}
  2797	
  2798			/*
  2799			 * Consider anon: if that's low too, this isn't a
  2800			 * runaway file reclaim problem, but rather just
  2801			 * extreme pressure. Reclaim as per usual then.
  2802			 */
  2803			anon = node_page_state(pgdat, NR_INACTIVE_ANON);
  2804	
  2805			sc->file_is_tiny =
  2806				file + free <= total_high_wmark &&
  2807				!(sc->may_deactivate & DEACTIVATE_ANON) &&
  2808				anon >> sc->priority;
  2809		}
  2810	
  2811		if (sysctl_clean_low_kbytes || sysctl_clean_min_kbytes) {
  2812			unsigned long reclaimable_file, dirty, clean;
  2813	
  2814			reclaimable_file =
  2815				node_page_state(pgdat, NR_ACTIVE_FILE) +
  2816				node_page_state(pgdat, NR_INACTIVE_FILE) +
  2817				node_page_state(pgdat, NR_ISOLATED_FILE);
  2818			dirty = node_page_state(pgdat, NR_FILE_DIRTY);
> 2819			if (reclaimable_file > dirty)
  2820				clean = (reclaimable_file - dirty) << (PAGE_SHIFT - 10);
  2821	
  2822			sc->clean_below_low = clean < sysctl_clean_low_kbytes;
  2823			sc->clean_below_min = clean < sysctl_clean_min_kbytes;
  2824		} else {
  2825			sc->clean_below_low = false;
  2826			sc->clean_below_min = false;
  2827		}
  2828	
  2829		shrink_node_memcgs(pgdat, sc);
  2830	
  2831		if (reclaim_state) {
  2832			sc->nr_reclaimed += reclaim_state->reclaimed_slab;
  2833			reclaim_state->reclaimed_slab = 0;
  2834		}
  2835	
  2836		/* Record the subtree's reclaim efficiency */
  2837		vmpressure(sc->gfp_mask, sc->target_mem_cgroup, true,
  2838			   sc->nr_scanned - nr_scanned,
  2839			   sc->nr_reclaimed - nr_reclaimed);
  2840	
  2841		if (sc->nr_reclaimed - nr_reclaimed)
  2842			reclaimable = true;
  2843	
  2844		if (current_is_kswapd()) {
  2845			/*
  2846			 * If reclaim is isolating dirty pages under writeback,
  2847			 * it implies that the long-lived page allocation rate
  2848			 * is exceeding the page laundering rate. Either the
  2849			 * global limits are not being effective at throttling
  2850			 * processes due to the page distribution throughout
  2851			 * zones or there is heavy usage of a slow backing
  2852			 * device. The only option is to throttle from reclaim
  2853			 * context which is not ideal as there is no guarantee
  2854			 * the dirtying process is throttled in the same way
  2855			 * balance_dirty_pages() manages.
  2856			 *
  2857			 * Once a node is flagged PGDAT_WRITEBACK, kswapd will
  2858			 * count the number of pages under pages flagged for
  2859			 * immediate reclaim and stall if any are encountered
  2860			 * in the nr_immediate check below.
  2861			 */
  2862			if (sc->nr.writeback && sc->nr.writeback == sc->nr.taken)
  2863				set_bit(PGDAT_WRITEBACK, &pgdat->flags);
  2864	
  2865			/* Allow kswapd to start writing pages during reclaim.*/
  2866			if (sc->nr.unqueued_dirty == sc->nr.file_taken)
  2867				set_bit(PGDAT_DIRTY, &pgdat->flags);
  2868	
  2869			/*
  2870			 * If kswapd scans pages marked for immediate
  2871			 * reclaim and under writeback (nr_immediate), it
  2872			 * implies that pages are cycling through the LRU
  2873			 * faster than they are written so also forcibly stall.
  2874			 */
  2875			if (sc->nr.immediate)
  2876				congestion_wait(BLK_RW_ASYNC, HZ/10);
  2877		}
  2878	
  2879		/*
  2880		 * Tag a node/memcg as congested if all the dirty pages
  2881		 * scanned were backed by a congested BDI and
  2882		 * wait_iff_congested will stall.
  2883		 *
  2884		 * Legacy memcg will stall in page writeback so avoid forcibly
  2885		 * stalling in wait_iff_congested().
  2886		 */
  2887		if ((current_is_kswapd() ||
  2888		     (cgroup_reclaim(sc) && writeback_throttling_sane(sc))) &&
  2889		    sc->nr.dirty && sc->nr.dirty == sc->nr.congested)
  2890			set_bit(LRUVEC_CONGESTED, &target_lruvec->flags);
  2891	
  2892		/*
  2893		 * Stall direct reclaim for IO completions if underlying BDIs
  2894		 * and node is congested. Allow kswapd to continue until it
  2895		 * starts encountering unqueued dirty pages or cycling through
  2896		 * the LRU too quickly.
  2897		 */
  2898		if (!current_is_kswapd() && current_may_throttle() &&
  2899		    !sc->hibernation_mode &&
  2900		    test_bit(LRUVEC_CONGESTED, &target_lruvec->flags))
  2901			wait_iff_congested(BLK_RW_ASYNC, HZ/10);
  2902	
  2903		if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
  2904					    sc))
  2905			goto again;
  2906	
  2907		/*
  2908		 * Kswapd gives up on balancing particular nodes after too
  2909		 * many failures to reclaim anything from them and goes to
  2910		 * sleep. On reclaim progress, reset the failure counter. A
  2911		 * successful direct reclaim run will revive a dormant kswapd.
  2912		 */
  2913		if (reclaimable)
  2914			pgdat->kswapd_failures = 0;
  2915	}
  2916	

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 11322 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm/swap: fix system stuck due to infinite loop
       [not found] ` <20210406065944.08d8aa76@mail.inbox.lv>
  2021-04-06  0:15   ` [PATCH] mm/vmscan: add sysctl knobs for protecting the specified kernel test robot
  2021-04-06  1:16   ` kernel test robot
@ 2021-04-06 22:49   ` Stillinux
  2 siblings, 0 replies; 6+ messages in thread
From: Stillinux @ 2021-04-06 22:49 UTC (permalink / raw)
  To: Alexey Avramov
  Cc: Andrew Morton, linux-mm, linux-kernel, liuzhengyuan, liuyun01

[-- Attachment #1: Type: text/plain, Size: 14429 bytes --]

  Hi Alexey, Thank you for the patch! looks cool, we will try this patch
for cutdown io operations during high memory pressure test.

 and after check our vmcore, we can see our system io pressure under the
swap_writepage and swap_readpage to under the shrink list operations.

On Tue, Apr 6, 2021 at 5:59 AM Alexey Avramov <hakavlad@inbox.lv> wrote:

> > In the case of high system memory and load pressure, we ran ltp test
> > and found that the system was stuck, the direct memory reclaim was
> > all stuck in io_schedule
>
> > For the first time involving the swap part, there is no good way to fix
> > the problem
>
> The solution is protecting the clean file pages.
>
> Look at this:
>
> > On ChromiumOS, we do not use swap. When memory is low, the only
> > way to free memory is to reclaim pages from the file list. This
> > results in a lot of thrashing under low memory conditions. We see
> > the system become unresponsive for minutes before it eventually OOMs.
> > We also see very slow browser tab switching under low memory. Instead
> > of an unresponsive system, we'd really like the kernel to OOM as soon
> > as it starts to thrash. If it can't keep the working set in memory,
> > then OOM. Losing one of many tabs is a better behaviour for the user
> > than an unresponsive system.
>
> > This patch create a new sysctl, min_filelist_kbytes, which disables
> > reclaim of file-backed pages when when there are less than
> min_filelist_bytes
> > worth of such pages in the cache. This tunable is handy for low memory
> > systems using solid-state storage where interactive response is more
> important
> > than not OOMing.
>
> > With this patch and min_filelist_kbytes set to 50000, I see very little
> block
> > layer activity during low memory. The system stays responsive under low
> > memory and browser tab switching is fast. Eventually, a process a gets
> killed
> > by OOM. Without this patch, the system gets wedged for minutes before it
> > eventually OOMs.
>
> — https://lore.kernel.org/patchwork/patch/222042/
>
> This patch can almost completely eliminate thrashing under memory pressure.
>
> Effects
> - Improving system responsiveness under low-memory conditions;
> - Improving performans in I/O bound tasks under memory pressure;
> - OOM killer comes faster (with hard protection);
> - Fast system reclaiming after OOM.
>
> Read more: https://github.com/hakavlad/le9-patch
>
> The patch:
>
> From 371e3e5290652e97d5279d8cd215cd356c1fb47b Mon Sep 17 00:00:00 2001
> From: Alexey Avramov <hakavlad@inbox.lv>
> Date: Mon, 5 Apr 2021 01:53:26 +0900
> Subject: [PATCH] mm/vmscan: add sysctl knobs for protecting the specified
>  amount of clean file cache
>
> The kernel does not have a mechanism for targeted protection of clean
> file pages (CFP). A certain amount of the CFP is required by the userspace
> for normal operation. First of all, you need a cache of shared libraries
> and executable files. If the volume of the CFP cache falls below a certain
> level, thrashing and even livelock occurs.
>
> Protection of CFP may be used to prevent thrashing and reducing I/O under
> memory pressure. Hard protection of CFP may be used to avoid high latency
> and prevent livelock in near-OOM conditions. The patch provides sysctl
> knobs for protecting the specified amount of clean file cache under memory
> pressure.
>
> The vm.clean_low_kbytes sysctl knob provides *best-effort* protection of
> CFP. The CFP on the current node won't be reclaimed uder memory pressure
> when their volume is below vm.clean_low_kbytes *unless* we threaten to OOM
> or have no swap space or vm.swappiness=0. Setting it to a high value may
> result in a early eviction of anonymous pages into the swap space by
> attempting to hold the protected amount of clean file pages in memory. The
> default value is defined by CONFIG_CLEAN_LOW_KBYTES (suggested 0 in
> Kconfig).
>
> The vm.clean_min_kbytes sysctl knob provides *hard* protection of CFP. The
> CFP on the current node won't be reclaimed under memory pressure when their
> volume is below vm.clean_min_kbytes. Setting it to a high value may result
> in a early out-of-memory condition due to the inability to reclaim the
> protected amount of CFP when other types of pages cannot be reclaimed. The
> default value is defined by CONFIG_CLEAN_MIN_KBYTES (suggested 0 in
> Kconfig).
>
> Reported-by: Artem S. Tashkinov <aros@gmx.com>
> Signed-off-by: Alexey Avramov <hakavlad@inbox.lv>
> ---
>  Documentation/admin-guide/sysctl/vm.rst | 37 +++++++++++++++++++++
>  include/linux/mm.h                      |  3 ++
>  kernel/sysctl.c                         | 14 ++++++++
>  mm/Kconfig                              | 35 +++++++++++++++++++
>  mm/vmscan.c                             | 59
> +++++++++++++++++++++++++++++++++
>  5 files changed, 148 insertions(+)
>
> diff --git a/Documentation/admin-guide/sysctl/vm.rst
> b/Documentation/admin-guide/sysctl/vm.rst
> index f455fa00c..5d5ddfc85 100644
> --- a/Documentation/admin-guide/sysctl/vm.rst
> +++ b/Documentation/admin-guide/sysctl/vm.rst
> @@ -26,6 +26,8 @@ Currently, these files are in /proc/sys/vm:
>
>  - admin_reserve_kbytes
>  - block_dump
> +- clean_low_kbytes
> +- clean_min_kbytes
>  - compact_memory
>  - compaction_proactiveness
>  - compact_unevictable_allowed
> @@ -113,6 +115,41 @@ block_dump enables block I/O debugging when set to a
> nonzero value. More
>  information on block I/O debugging is in
> Documentation/admin-guide/laptops/laptop-mode.rst.
>
>
> +clean_low_kbytes
> +=====================
> +
> +This knob provides *best-effort* protection of clean file pages. The
> clean file
> +pages on the current node won't be reclaimed uder memory pressure when
> their
> +volume is below vm.clean_low_kbytes *unless* we threaten to OOM or have no
> +swap space or vm.swappiness=0.
> +
> +Protection of clean file pages may be used to prevent thrashing and
> +reducing I/O under low-memory conditions.
> +
> +Setting it to a high value may result in a early eviction of anonymous
> pages
> +into the swap space by attempting to hold the protected amount of clean
> file
> +pages in memory.
> +
> +The default value is defined by CONFIG_CLEAN_LOW_KBYTES.
> +
> +
> +clean_min_kbytes
> +=====================
> +
> +This knob provides *hard* protection of clean file pages. The clean file
> pages
> +on the current node won't be reclaimed under memory pressure when their
> volume
> +is below vm.clean_min_kbytes.
> +
> +Hard protection of clean file pages may be used to avoid high latency and
> +prevent livelock in near-OOM conditions.
> +
> +Setting it to a high value may result in a early out-of-memory condition
> due to
> +the inability to reclaim the protected amount of clean file pages when
> other
> +types of pages cannot be reclaimed.
> +
> +The default value is defined by CONFIG_CLEAN_MIN_KBYTES.
> +
> +
>  compact_memory
>  ==============
>
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index db6ae4d3f..7799f1555 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -202,6 +202,9 @@ static inline void __mm_zero_struct_page(struct page
> *page)
>
>  extern int sysctl_max_map_count;
>
> +extern unsigned long sysctl_clean_low_kbytes;
> +extern unsigned long sysctl_clean_min_kbytes;
> +
>  extern unsigned long sysctl_user_reserve_kbytes;
>  extern unsigned long sysctl_admin_reserve_kbytes;
>
> diff --git a/kernel/sysctl.c b/kernel/sysctl.c
> index afad08596..854b311cd 100644
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -3083,6 +3083,20 @@ static struct ctl_table vm_table[] = {
>         },
>  #endif
>         {
> +               .procname       = "clean_low_kbytes",
> +               .data           = &sysctl_clean_low_kbytes,
> +               .maxlen         = sizeof(sysctl_clean_low_kbytes),
> +               .mode           = 0644,
> +               .proc_handler   = proc_doulongvec_minmax,
> +       },
> +       {
> +               .procname       = "clean_min_kbytes",
> +               .data           = &sysctl_clean_min_kbytes,
> +               .maxlen         = sizeof(sysctl_clean_min_kbytes),
> +               .mode           = 0644,
> +               .proc_handler   = proc_doulongvec_minmax,
> +       },
> +       {
>                 .procname       = "user_reserve_kbytes",
>                 .data           = &sysctl_user_reserve_kbytes,
>                 .maxlen         = sizeof(sysctl_user_reserve_kbytes),
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 390165ffb..3915c71e1 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -122,6 +122,41 @@ config SPARSEMEM_VMEMMAP
>           pfn_to_page and page_to_pfn operations.  This is the most
>           efficient option when sufficient kernel resources are available.
>
> +config CLEAN_LOW_KBYTES
> +       int "Default value for vm.clean_low_kbytes"
> +       depends on SYSCTL
> +       default "0"
> +       help
> +         The vm.clean_file_low_kbytes sysctl knob provides *best-effort*
> +         protection of clean file pages. The clean file pages on the
> current
> +         node won't be reclaimed uder memory pressure when their volume is
> +         below vm.clean_low_kbytes *unless* we threaten to OOM or have
> +         no swap space or vm.swappiness=0.
> +
> +         Protection of clean file pages may be used to prevent thrashing
> and
> +         reducing I/O under low-memory conditions.
> +
> +         Setting it to a high value may result in a early eviction of
> anonymous
> +         pages into the swap space by attempting to hold the protected
> amount of
> +         clean file pages in memory.
> +
> +config CLEAN_MIN_KBYTES
> +       int "Default value for vm.clean_min_kbytes"
> +       depends on SYSCTL
> +       default "0"
> +       help
> +         The vm.clean_file_min_kbytes sysctl knob provides *hard*
> protection
> +         of clean file pages. The clean file pages on the current node
> won't be
> +         reclaimed under memory pressure when their volume is below
> +         vm.clean_min_kbytes.
> +
> +         Hard protection of clean file pages may be used to avoid high
> latency and
> +         prevent livelock in near-OOM conditions.
> +
> +         Setting it to a high value may result in a early out-of-memory
> condition
> +         due to the inability to reclaim the protected amount of clean
> file pages
> +         when other types of pages cannot be reclaimed.
> +
>  config HAVE_MEMBLOCK_PHYS_MAP
>         bool
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 7b4e31eac..77e98c43e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -120,6 +120,19 @@ struct scan_control {
>         /* The file pages on the current node are dangerously low */
>         unsigned int file_is_tiny:1;
>
> +       /*
> +        * The clean file pages on the current node won't be reclaimed when
> +        * their volume is below vm.clean_low_kbytes *unless* we threaten
> +        * to OOM or have no swap space or vm.swappiness=0.
> +        */
> +       unsigned int clean_below_low:1;
> +
> +       /*
> +        * The clean file pages on the current node won't be reclaimed when
> +        * their volume is below vm.clean_min_kbytes.
> +        */
> +       unsigned int clean_below_min:1;
> +
>         /* Allocation order */
>         s8 order;
>
> @@ -166,6 +179,17 @@ struct scan_control {
>  #define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0)
>  #endif
>
> +#if CONFIG_CLEAN_LOW_KBYTES < 0
> +#error "CONFIG_CLEAN_LOW_KBYTES must be >= 0"
> +#endif
> +
> +#if CONFIG_CLEAN_MIN_KBYTES < 0
> +#error "CONFIG_CLEAN_MIN_KBYTES must be >= 0"
> +#endif
> +
> +unsigned long sysctl_clean_low_kbytes __read_mostly =
> CONFIG_CLEAN_LOW_KBYTES;
> +unsigned long sysctl_clean_min_kbytes __read_mostly =
> CONFIG_CLEAN_MIN_KBYTES;
> +
>  /*
>   * From 0 .. 200.  Higher means more swappy.
>   */
> @@ -2283,6 +2307,16 @@ static void get_scan_count(struct lruvec *lruvec,
> struct scan_control *sc,
>         }
>
>         /*
> +        * Force-scan anon if clean file pages is under vm.clean_min_kbytes
> +        * or vm.clean_low_kbytes (unless the swappiness setting
> +        * disagrees with swapping).
> +        */
> +       if ((sc->clean_below_low || sc->clean_below_min) && swappiness) {
> +               scan_balance = SCAN_ANON;
> +               goto out;
> +       }
> +
> +       /*
>          * If there is enough inactive page cache, we do not reclaim
>          * anything from the anonymous working right now.
>          */
> @@ -2418,6 +2452,13 @@ static void get_scan_count(struct lruvec *lruvec,
> struct scan_control *sc,
>                         BUG();
>                 }
>
> +               /*
> +                * Don't reclaim clean file pages when their volume is
> below
> +                * vm.clean_min_kbytes.
> +                */
> +               if (file && sc->clean_below_min)
> +                       scan = 0;
> +
>                 nr[lru] = scan;
>         }
>  }
> @@ -2768,6 +2809,24 @@ static void shrink_node(pg_data_t *pgdat, struct
> scan_control *sc)
>                         anon >> sc->priority;
>         }
>
> +       if (sysctl_clean_low_kbytes || sysctl_clean_min_kbytes) {
> +               unsigned long reclaimable_file, dirty, clean;
> +
> +               reclaimable_file =
> +                       node_page_state(pgdat, NR_ACTIVE_FILE) +
> +                       node_page_state(pgdat, NR_INACTIVE_FILE) +
> +                       node_page_state(pgdat, NR_ISOLATED_FILE);
> +               dirty = node_page_state(pgdat, NR_FILE_DIRTY);
> +               if (reclaimable_file > dirty)
> +                       clean = (reclaimable_file - dirty) << (PAGE_SHIFT
> - 10);
> +
> +               sc->clean_below_low = clean < sysctl_clean_low_kbytes;
> +               sc->clean_below_min = clean < sysctl_clean_min_kbytes;
> +       } else {
> +               sc->clean_below_low = false;
> +               sc->clean_below_min = false;
> +       }
> +
>         shrink_node_memcgs(pgdat, sc);
>
>         if (reclaim_state) {
> --
> 2.11.0
>
>

[-- Attachment #2: Type: text/html, Size: 16760 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-04-06 22:49 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-02  7:03 [RFC PATCH] mm/swap: fix system stuck due to infinite loop Stillinux
2021-04-03  0:44 ` Andrew Morton
2021-04-04  9:26   ` Stillinux
     [not found] ` <20210406065944.08d8aa76@mail.inbox.lv>
2021-04-06  0:15   ` [PATCH] mm/vmscan: add sysctl knobs for protecting the specified kernel test robot
2021-04-06  1:16   ` kernel test robot
2021-04-06 22:49   ` [RFC PATCH] mm/swap: fix system stuck due to infinite loop Stillinux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).