[PATCH] mm: fix unevictable page reclaim when calling madvise

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
@ 2019-10-28 15:08 zhong jiang
  2019-10-28 15:27 ` David Hildenbrand
  2019-10-29  8:11 ` Michal Hocko
  0 siblings, 2 replies; 23+ messages in thread
From: zhong jiang @ 2019-10-28 15:08 UTC (permalink / raw)
  To: akpm; +Cc: mhocko, hannes, ktkhai, zhongjiang, linux-mm

Recently, I hit the following issue when running in the upstream.

kernel BUG at mm/vmscan.c:1521!
invalid opcode: 0000 [#1] SMP KASAN PTI
CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74
RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286
RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36
RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc
RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0
R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0
R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000
FS:  00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0
DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
PKRU: 55555554
Call Trace:
 reclaim_pages+0x499/0x800 mm/vmscan.c:2188
 madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
 walk_pmd_range mm/pagewalk.c:53 [inline]
 walk_pud_range mm/pagewalk.c:112 [inline]
 walk_p4d_range mm/pagewalk.c:139 [inline]
 walk_pgd_range mm/pagewalk.c:166 [inline]
 __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
 walk_page_range+0x179/0x310 mm/pagewalk.c:349
 madvise_pageout_page_range mm/madvise.c:506 [inline]
 madvise_pageout+0x1f0/0x330 mm/madvise.c:542
 madvise_vma mm/madvise.c:931 [inline]
 __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
 do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

madvise_pageout access the specified range of the vma and isolate
them, then run shrink_page_list to reclaim the memory. But It also
isolate the unevictable page to reclaim. Hence, we can catch the
cases in shrink_page_list.

We can fix it by preventing unevictable page from isolating.
Another way to fix the issue by removing the condition of
BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it
is better  to use the latter. Because We has taken the unevictable
page and skip it into account in shrink_page_list.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f7d1301..1c6e959 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		unlock_page(page);
 keep:
 		list_add(&page->lru, &ret_pages);
-		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
+		VM_BUG_ON_PAGE(PageLRU(page), page);
 	}
 
 	pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
-- 
1.7.12.4



^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-28 15:08 [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout zhong jiang
@ 2019-10-28 15:27 ` David Hildenbrand
  2019-10-28 15:45   ` zhong jiang
  2019-10-29  8:11 ` Michal Hocko
  1 sibling, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2019-10-28 15:27 UTC (permalink / raw)
  To: zhong jiang, akpm; +Cc: mhocko, hannes, ktkhai, linux-mm

On 28.10.19 16:08, zhong jiang wrote:
> Recently, I hit the following issue when running in the upstream.
> 
> kernel BUG at mm/vmscan.c:1521!
> invalid opcode: 0000 [#1] SMP KASAN PTI
> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74
> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286
> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36
> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc
> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0
> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0
> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000
> FS:  00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0
> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> PKRU: 55555554
> Call Trace:
>   reclaim_pages+0x499/0x800 mm/vmscan.c:2188
>   madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
>   walk_pmd_range mm/pagewalk.c:53 [inline]
>   walk_pud_range mm/pagewalk.c:112 [inline]
>   walk_p4d_range mm/pagewalk.c:139 [inline]
>   walk_pgd_range mm/pagewalk.c:166 [inline]
>   __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
>   walk_page_range+0x179/0x310 mm/pagewalk.c:349
>   madvise_pageout_page_range mm/madvise.c:506 [inline]
>   madvise_pageout+0x1f0/0x330 mm/madvise.c:542
>   madvise_vma mm/madvise.c:931 [inline]
>   __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
>   do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> madvise_pageout access the specified range of the vma and isolate
> them, then run shrink_page_list to reclaim the memory. But It also
> isolate the unevictable page to reclaim. Hence, we can catch the
> cases in shrink_page_list.
> 
> We can fix it by preventing unevictable page from isolating.
> Another way to fix the issue by removing the condition of
> BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it
> is better  to use the latter. Because We has taken the unevictable
> page and skip it into account in shrink_page_list.

I really don't understand the last sentence. Looks like
something got messed up :)


> 
> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
> ---
>   mm/vmscan.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f7d1301..1c6e959 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>   		unlock_page(page);
>   keep:
>   		list_add(&page->lru, &ret_pages);
> -		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
> +		VM_BUG_ON_PAGE(PageLRU(page), page);

So, this comes from

commit b291f000393f5a0b679012b39d79fbc85c018233
Author: Nick Piggin <npiggin@suse.de>
Date:   Sat Oct 18 20:26:44 2008 -0700

    mlock: mlocked pages are unevictable
    
    Make sure that mlocked pages also live on the unevictable LRU, so kswapd
    will not scan them over and over again.


That patch is fairly old. How come we can suddenly trigger this?
Which commit is responsible for that? Was it always broken?

I can see that

commit ad6b67041a45497261617d7a28b15159b202cb5a
Author: Minchan Kim <minchan@kernel.org>
Date:   Wed May 3 14:54:13 2017 -0700

    mm: remove SWAP_MLOCK in ttu

Performed some changes in that area. But also some time ago.

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-28 15:27 ` David Hildenbrand
@ 2019-10-28 15:45   ` zhong jiang
  2019-10-28 16:07     ` David Hildenbrand
  0 siblings, 1 reply; 23+ messages in thread
From: zhong jiang @ 2019-10-28 15:45 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: akpm, mhocko, hannes, ktkhai, linux-mm

On 2019/10/28 23:27, David Hildenbrand wrote:
> On 28.10.19 16:08, zhong jiang wrote:
>> Recently, I hit the following issue when running in the upstream.
>>
>> kernel BUG at mm/vmscan.c:1521!
>> invalid opcode: 0000 [#1] SMP KASAN PTI
>> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
>> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74
>> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286
>> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36
>> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc
>> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0
>> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0
>> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000
>> FS:  00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0
>> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
>> PKRU: 55555554
>> Call Trace:
>>   reclaim_pages+0x499/0x800 mm/vmscan.c:2188
>>   madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
>>   walk_pmd_range mm/pagewalk.c:53 [inline]
>>   walk_pud_range mm/pagewalk.c:112 [inline]
>>   walk_p4d_range mm/pagewalk.c:139 [inline]
>>   walk_pgd_range mm/pagewalk.c:166 [inline]
>>   __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
>>   walk_page_range+0x179/0x310 mm/pagewalk.c:349
>>   madvise_pageout_page_range mm/madvise.c:506 [inline]
>>   madvise_pageout+0x1f0/0x330 mm/madvise.c:542
>>   madvise_vma mm/madvise.c:931 [inline]
>>   __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
>>   do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
>>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> madvise_pageout access the specified range of the vma and isolate
>> them, then run shrink_page_list to reclaim the memory. But It also
>> isolate the unevictable page to reclaim. Hence, we can catch the
>> cases in shrink_page_list.
>>
>> We can fix it by preventing unevictable page from isolating.
>> Another way to fix the issue by removing the condition of
>> BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it
>> is better  to use the latter. Because We has taken the unevictable
>> page and skip it into account in shrink_page_list.
> I really don't understand the last sentence. Looks like
> something got messed up :)
I mean that we will check the page_evictable(page) in shrink_page_list,
if it is unevictable page, we will put the page back to correct lru.

Based on the condition, I make the choice. It seems to more simpler.:-)

Thanks,
zhong jiang
>
>> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
>> ---
>>   mm/vmscan.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index f7d1301..1c6e959 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>>   		unlock_page(page);
>>   keep:
>>   		list_add(&page->lru, &ret_pages);
>> -		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
>> +		VM_BUG_ON_PAGE(PageLRU(page), page);
> So, this comes from
>
> commit b291f000393f5a0b679012b39d79fbc85c018233
> Author: Nick Piggin <npiggin@suse.de>
> Date:   Sat Oct 18 20:26:44 2008 -0700
>
>     mlock: mlocked pages are unevictable
>     
>     Make sure that mlocked pages also live on the unevictable LRU, so kswapd
>     will not scan them over and over again.
>
>
> That patch is fairly old. How come we can suddenly trigger this?
> Which commit is responsible for that? Was it always broken?
>
> I can see that
>
> commit ad6b67041a45497261617d7a28b15159b202cb5a
> Author: Minchan Kim <minchan@kernel.org>
> Date:   Wed May 3 14:54:13 2017 -0700
>
>     mm: remove SWAP_MLOCK in ttu
>
> Performed some changes in that area. But also some time ago.
I think the following patch introduce the issue.

commit 1a4e58cce84ee88129d5d49c064bd2852b481357
Author: Minchan Kim <minchan@kernel.org>
Date:   Wed Sep 25 16:49:15 2019 -0700

    mm: introduce MADV_PAGEOUT

    When a process expects no accesses to a certain memory range for a long

Thanks,
zhong jiang



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-28 15:45   ` zhong jiang
@ 2019-10-28 16:07     ` David Hildenbrand
  2019-10-28 16:15       ` zhong jiang
  2019-10-28 16:15       ` David Hildenbrand
  0 siblings, 2 replies; 23+ messages in thread
From: David Hildenbrand @ 2019-10-28 16:07 UTC (permalink / raw)
  To: zhong jiang; +Cc: akpm, mhocko, hannes, ktkhai, linux-mm, Minchan Kim

On 28.10.19 16:45, zhong jiang wrote:
> On 2019/10/28 23:27, David Hildenbrand wrote:
>> On 28.10.19 16:08, zhong jiang wrote:
>>> Recently, I hit the following issue when running in the upstream.
>>>
>>> kernel BUG at mm/vmscan.c:1521!
>>> invalid opcode: 0000 [#1] SMP KASAN PTI
>>> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
>>> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74
>>> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286
>>> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36
>>> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc
>>> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0
>>> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0
>>> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000
>>> FS:  00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000
>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0
>>> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000
>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
>>> PKRU: 55555554
>>> Call Trace:
>>>    reclaim_pages+0x499/0x800 mm/vmscan.c:2188
>>>    madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
>>>    walk_pmd_range mm/pagewalk.c:53 [inline]
>>>    walk_pud_range mm/pagewalk.c:112 [inline]
>>>    walk_p4d_range mm/pagewalk.c:139 [inline]
>>>    walk_pgd_range mm/pagewalk.c:166 [inline]
>>>    __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
>>>    walk_page_range+0x179/0x310 mm/pagewalk.c:349
>>>    madvise_pageout_page_range mm/madvise.c:506 [inline]
>>>    madvise_pageout+0x1f0/0x330 mm/madvise.c:542
>>>    madvise_vma mm/madvise.c:931 [inline]
>>>    __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
>>>    do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
>>>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>
>>> madvise_pageout access the specified range of the vma and isolate
>>> them, then run shrink_page_list to reclaim the memory. But It also
>>> isolate the unevictable page to reclaim. Hence, we can catch the
>>> cases in shrink_page_list.
>>>
>>> We can fix it by preventing unevictable page from isolating.
>>> Another way to fix the issue by removing the condition of
>>> BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it
>>> is better  to use the latter. Because We has taken the unevictable
>>> page and skip it into account in shrink_page_list.
>> I really don't understand the last sentence. Looks like
>> something got messed up :)
> I mean that we will check the page_evictable(page) in shrink_page_list,
> if it is unevictable page, we will put the page back to correct lru.
> 
> Based on the condition, I make the choice. It seems to more simpler.:-)
> 
> Thanks,
> zhong jiang
>>
>>> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
>>> ---
>>>    mm/vmscan.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index f7d1301..1c6e959 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>>>    		unlock_page(page);
>>>    keep:
>>>    		list_add(&page->lru, &ret_pages);
>>> -		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
>>> +		VM_BUG_ON_PAGE(PageLRU(page), page);
>> So, this comes from
>>
>> commit b291f000393f5a0b679012b39d79fbc85c018233
>> Author: Nick Piggin <npiggin@suse.de>
>> Date:   Sat Oct 18 20:26:44 2008 -0700
>>
>>      mlock: mlocked pages are unevictable
>>      
>>      Make sure that mlocked pages also live on the unevictable LRU, so kswapd
>>      will not scan them over and over again.
>>
>>
>> That patch is fairly old. How come we can suddenly trigger this?
>> Which commit is responsible for that? Was it always broken?
>>
>> I can see that
>>
>> commit ad6b67041a45497261617d7a28b15159b202cb5a
>> Author: Minchan Kim <minchan@kernel.org>
>> Date:   Wed May 3 14:54:13 2017 -0700
>>
>>      mm: remove SWAP_MLOCK in ttu
>>
>> Performed some changes in that area. But also some time ago.
> I think the following patch introduce the issue.
> 
> commit 1a4e58cce84ee88129d5d49c064bd2852b481357
> Author: Minchan Kim <minchan@kernel.org>
> Date:   Wed Sep 25 16:49:15 2019 -0700
> 
>      mm: introduce MADV_PAGEOUT
> 
>      When a process expects no accesses to a certain memory range for a long
> 

CCing Minchan Kim then.

If this is indeed the introducing patch, you probably reference that 
patch in your cover mail somehow. (Fixes: does not apply until upstream)

I am absolutely no expert on vmscan.c, so I'm afraid I can't really 
comment on the details.

-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-28 16:07     ` David Hildenbrand
@ 2019-10-28 16:15       ` zhong jiang
  2019-10-28 16:15       ` David Hildenbrand
  1 sibling, 0 replies; 23+ messages in thread
From: zhong jiang @ 2019-10-28 16:15 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: akpm, mhocko, hannes, ktkhai, linux-mm, Minchan Kim

On 2019/10/29 0:07, David Hildenbrand wrote:
> On 28.10.19 16:45, zhong jiang wrote:
>> On 2019/10/28 23:27, David Hildenbrand wrote:
>>> On 28.10.19 16:08, zhong jiang wrote:
>>>> Recently, I hit the following issue when running in the upstream.
>>>>
>>>> kernel BUG at mm/vmscan.c:1521!
>>>> invalid opcode: 0000 [#1] SMP KASAN PTI
>>>> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>>> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
>>>> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74
>>>> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286
>>>> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36
>>>> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc
>>>> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0
>>>> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0
>>>> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000
>>>> FS:  00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0
>>>> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000
>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
>>>> PKRU: 55555554
>>>> Call Trace:
>>>>    reclaim_pages+0x499/0x800 mm/vmscan.c:2188
>>>>    madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
>>>>    walk_pmd_range mm/pagewalk.c:53 [inline]
>>>>    walk_pud_range mm/pagewalk.c:112 [inline]
>>>>    walk_p4d_range mm/pagewalk.c:139 [inline]
>>>>    walk_pgd_range mm/pagewalk.c:166 [inline]
>>>>    __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
>>>>    walk_page_range+0x179/0x310 mm/pagewalk.c:349
>>>>    madvise_pageout_page_range mm/madvise.c:506 [inline]
>>>>    madvise_pageout+0x1f0/0x330 mm/madvise.c:542
>>>>    madvise_vma mm/madvise.c:931 [inline]
>>>>    __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
>>>>    do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
>>>>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>
>>>> madvise_pageout access the specified range of the vma and isolate
>>>> them, then run shrink_page_list to reclaim the memory. But It also
>>>> isolate the unevictable page to reclaim. Hence, we can catch the
>>>> cases in shrink_page_list.
>>>>
>>>> We can fix it by preventing unevictable page from isolating.
>>>> Another way to fix the issue by removing the condition of
>>>> BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it
>>>> is better  to use the latter. Because We has taken the unevictable
>>>> page and skip it into account in shrink_page_list.
>>> I really don't understand the last sentence. Looks like
>>> something got messed up :)
>> I mean that we will check the page_evictable(page) in shrink_page_list,
>> if it is unevictable page, we will put the page back to correct lru.
>>
>> Based on the condition, I make the choice. It seems to more simpler.:-)
>>
>> Thanks,
>> zhong jiang
>>>
>>>> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
>>>> ---
>>>>    mm/vmscan.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index f7d1301..1c6e959 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>>>>            unlock_page(page);
>>>>    keep:
>>>>            list_add(&page->lru, &ret_pages);
>>>> -        VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
>>>> +        VM_BUG_ON_PAGE(PageLRU(page), page);
>>> So, this comes from
>>>
>>> commit b291f000393f5a0b679012b39d79fbc85c018233
>>> Author: Nick Piggin <npiggin@suse.de>
>>> Date:   Sat Oct 18 20:26:44 2008 -0700
>>>
>>>      mlock: mlocked pages are unevictable
>>>           Make sure that mlocked pages also live on the unevictable LRU, so kswapd
>>>      will not scan them over and over again.
>>>
>>>
>>> That patch is fairly old. How come we can suddenly trigger this?
>>> Which commit is responsible for that? Was it always broken?
>>>
>>> I can see that
>>>
>>> commit ad6b67041a45497261617d7a28b15159b202cb5a
>>> Author: Minchan Kim <minchan@kernel.org>
>>> Date:   Wed May 3 14:54:13 2017 -0700
>>>
>>>      mm: remove SWAP_MLOCK in ttu
>>>
>>> Performed some changes in that area. But also some time ago.
>> I think the following patch introduce the issue.
>>
>> commit 1a4e58cce84ee88129d5d49c064bd2852b481357
>> Author: Minchan Kim <minchan@kernel.org>
>> Date:   Wed Sep 25 16:49:15 2019 -0700
>>
>>      mm: introduce MADV_PAGEOUT
>>
>>      When a process expects no accesses to a certain memory range for a long
>>
>
> CCing Minchan Kim then.
>
> If this is indeed the introducing patch, you probably reference that patch in your cover mail somehow. (Fixes: does not apply until upstream)
>
> I am absolutely no expert on vmscan.c, so I'm afraid I can't really comment on the details.
>
Yep,  but still thanks for your concerns and reply.

Sincerely,
zhong jiang



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-28 16:07     ` David Hildenbrand
  2019-10-28 16:15       ` zhong jiang
@ 2019-10-28 16:15       ` David Hildenbrand
  2019-10-29  2:29         ` zhong jiang
  1 sibling, 1 reply; 23+ messages in thread
From: David Hildenbrand @ 2019-10-28 16:15 UTC (permalink / raw)
  To: zhong jiang; +Cc: akpm, mhocko, hannes, ktkhai, linux-mm, Minchan Kim

On 28.10.19 17:07, David Hildenbrand wrote:
> On 28.10.19 16:45, zhong jiang wrote:
>> On 2019/10/28 23:27, David Hildenbrand wrote:
>>> On 28.10.19 16:08, zhong jiang wrote:
>>>> Recently, I hit the following issue when running in the upstream.
>>>>
>>>> kernel BUG at mm/vmscan.c:1521!
>>>> invalid opcode: 0000 [#1] SMP KASAN PTI
>>>> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>>> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
>>>> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74
>>>> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286
>>>> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36
>>>> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc
>>>> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0
>>>> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0
>>>> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000
>>>> FS:  00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000
>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0
>>>> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000
>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
>>>> PKRU: 55555554
>>>> Call Trace:
>>>>     reclaim_pages+0x499/0x800 mm/vmscan.c:2188
>>>>     madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
>>>>     walk_pmd_range mm/pagewalk.c:53 [inline]
>>>>     walk_pud_range mm/pagewalk.c:112 [inline]
>>>>     walk_p4d_range mm/pagewalk.c:139 [inline]
>>>>     walk_pgd_range mm/pagewalk.c:166 [inline]
>>>>     __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
>>>>     walk_page_range+0x179/0x310 mm/pagewalk.c:349
>>>>     madvise_pageout_page_range mm/madvise.c:506 [inline]
>>>>     madvise_pageout+0x1f0/0x330 mm/madvise.c:542
>>>>     madvise_vma mm/madvise.c:931 [inline]
>>>>     __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
>>>>     do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
>>>>     entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>
>>>> madvise_pageout access the specified range of the vma and isolate
>>>> them, then run shrink_page_list to reclaim the memory. But It also
>>>> isolate the unevictable page to reclaim. Hence, we can catch the
>>>> cases in shrink_page_list.
>>>>
>>>> We can fix it by preventing unevictable page from isolating.
>>>> Another way to fix the issue by removing the condition of
>>>> BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it
>>>> is better  to use the latter. Because We has taken the unevictable
>>>> page and skip it into account in shrink_page_list.
>>> I really don't understand the last sentence. Looks like
>>> something got messed up :)
>> I mean that we will check the page_evictable(page) in shrink_page_list,
>> if it is unevictable page, we will put the page back to correct lru.
>>
>> Based on the condition, I make the choice. It seems to more simpler.:-)
>>
>> Thanks,
>> zhong jiang
>>>
>>>> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
>>>> ---
>>>>     mm/vmscan.c | 2 +-
>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>> index f7d1301..1c6e959 100644
>>>> --- a/mm/vmscan.c
>>>> +++ b/mm/vmscan.c
>>>> @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>>>>     		unlock_page(page);
>>>>     keep:
>>>>     		list_add(&page->lru, &ret_pages);
>>>> -		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
>>>> +		VM_BUG_ON_PAGE(PageLRU(page), page);
>>> So, this comes from
>>>
>>> commit b291f000393f5a0b679012b39d79fbc85c018233
>>> Author: Nick Piggin <npiggin@suse.de>
>>> Date:   Sat Oct 18 20:26:44 2008 -0700
>>>
>>>       mlock: mlocked pages are unevictable
>>>       
>>>       Make sure that mlocked pages also live on the unevictable LRU, so kswapd
>>>       will not scan them over and over again.
>>>
>>>
>>> That patch is fairly old. How come we can suddenly trigger this?
>>> Which commit is responsible for that? Was it always broken?
>>>
>>> I can see that
>>>
>>> commit ad6b67041a45497261617d7a28b15159b202cb5a
>>> Author: Minchan Kim <minchan@kernel.org>
>>> Date:   Wed May 3 14:54:13 2017 -0700
>>>
>>>       mm: remove SWAP_MLOCK in ttu
>>>
>>> Performed some changes in that area. But also some time ago.
>> I think the following patch introduce the issue.
>>
>> commit 1a4e58cce84ee88129d5d49c064bd2852b481357
>> Author: Minchan Kim <minchan@kernel.org>
>> Date:   Wed Sep 25 16:49:15 2019 -0700
>>
>>       mm: introduce MADV_PAGEOUT
>>
>>       When a process expects no accesses to a certain memory range for a long
>>
> 
> CCing Minchan Kim then.
> 
> If this is indeed the introducing patch, you probably reference that
> patch in your cover mail somehow. (Fixes: does not apply until upstream)
> 
> I am absolutely no expert on vmscan.c, so I'm afraid I can't really
> comment on the details.
> 

Oh, and just wondering, is this the same BUG as in

https://lkml.org/lkml/2019/8/2/1506

Where a fix has been proposed? The fix does not seem to be in 
next/master yet.

(I just realized that it is already upstream so "Fixes: 	1a4e58cce84e 
("mm: introduce MADV_PAGEOUT")) applies.



-- 

Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-28 16:15       ` David Hildenbrand
@ 2019-10-29  2:29         ` zhong jiang
  0 siblings, 0 replies; 23+ messages in thread
From: zhong jiang @ 2019-10-29  2:29 UTC (permalink / raw)
  To: David Hildenbrand; +Cc: akpm, mhocko, hannes, ktkhai, linux-mm, Minchan Kim

On 2019/10/29 0:15, David Hildenbrand wrote:
> On 28.10.19 17:07, David Hildenbrand wrote:
>> On 28.10.19 16:45, zhong jiang wrote:
>>> On 2019/10/28 23:27, David Hildenbrand wrote:
>>>> On 28.10.19 16:08, zhong jiang wrote:
>>>>> Recently, I hit the following issue when running in the upstream.
>>>>>
>>>>> kernel BUG at mm/vmscan.c:1521!
>>>>> invalid opcode: 0000 [#1] SMP KASAN PTI
>>>>> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
>>>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>>>>> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
>>>>> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74
>>>>> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286
>>>>> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36
>>>>> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc
>>>>> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0
>>>>> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0
>>>>> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000
>>>>> FS:  00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000
>>>>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>>> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0
>>>>> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000
>>>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
>>>>> PKRU: 55555554
>>>>> Call Trace:
>>>>>     reclaim_pages+0x499/0x800 mm/vmscan.c:2188
>>>>>     madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
>>>>>     walk_pmd_range mm/pagewalk.c:53 [inline]
>>>>>     walk_pud_range mm/pagewalk.c:112 [inline]
>>>>>     walk_p4d_range mm/pagewalk.c:139 [inline]
>>>>>     walk_pgd_range mm/pagewalk.c:166 [inline]
>>>>>     __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
>>>>>     walk_page_range+0x179/0x310 mm/pagewalk.c:349
>>>>>     madvise_pageout_page_range mm/madvise.c:506 [inline]
>>>>>     madvise_pageout+0x1f0/0x330 mm/madvise.c:542
>>>>>     madvise_vma mm/madvise.c:931 [inline]
>>>>>     __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
>>>>>     do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
>>>>>     entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>>
>>>>> madvise_pageout access the specified range of the vma and isolate
>>>>> them, then run shrink_page_list to reclaim the memory. But It also
>>>>> isolate the unevictable page to reclaim. Hence, we can catch the
>>>>> cases in shrink_page_list.
>>>>>
>>>>> We can fix it by preventing unevictable page from isolating.
>>>>> Another way to fix the issue by removing the condition of
>>>>> BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it
>>>>> is better  to use the latter. Because We has taken the unevictable
>>>>> page and skip it into account in shrink_page_list.
>>>> I really don't understand the last sentence. Looks like
>>>> something got messed up :)
>>> I mean that we will check the page_evictable(page) in shrink_page_list,
>>> if it is unevictable page, we will put the page back to correct lru.
>>>
>>> Based on the condition, I make the choice. It seems to more simpler.:-)
>>>
>>> Thanks,
>>> zhong jiang
>>>>
>>>>> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
>>>>> ---
>>>>>     mm/vmscan.c | 2 +-
>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>>>> index f7d1301..1c6e959 100644
>>>>> --- a/mm/vmscan.c
>>>>> +++ b/mm/vmscan.c
>>>>> @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>>>>>             unlock_page(page);
>>>>>     keep:
>>>>>             list_add(&page->lru, &ret_pages);
>>>>> -        VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
>>>>> +        VM_BUG_ON_PAGE(PageLRU(page), page);
>>>> So, this comes from
>>>>
>>>> commit b291f000393f5a0b679012b39d79fbc85c018233
>>>> Author: Nick Piggin <npiggin@suse.de>
>>>> Date:   Sat Oct 18 20:26:44 2008 -0700
>>>>
>>>>       mlock: mlocked pages are unevictable
>>>>             Make sure that mlocked pages also live on the unevictable LRU, so kswapd
>>>>       will not scan them over and over again.
>>>>
>>>>
>>>> That patch is fairly old. How come we can suddenly trigger this?
>>>> Which commit is responsible for that? Was it always broken?
>>>>
>>>> I can see that
>>>>
>>>> commit ad6b67041a45497261617d7a28b15159b202cb5a
>>>> Author: Minchan Kim <minchan@kernel.org>
>>>> Date:   Wed May 3 14:54:13 2017 -0700
>>>>
>>>>       mm: remove SWAP_MLOCK in ttu
>>>>
>>>> Performed some changes in that area. But also some time ago.
>>> I think the following patch introduce the issue.
>>>
>>> commit 1a4e58cce84ee88129d5d49c064bd2852b481357
>>> Author: Minchan Kim <minchan@kernel.org>
>>> Date:   Wed Sep 25 16:49:15 2019 -0700
>>>
>>>       mm: introduce MADV_PAGEOUT
>>>
>>>       When a process expects no accesses to a certain memory range for a long
>>>
>>
>> CCing Minchan Kim then.
>>
>> If this is indeed the introducing patch, you probably reference that
>> patch in your cover mail somehow. (Fixes: does not apply until upstream)
>>
>> I am absolutely no expert on vmscan.c, so I'm afraid I can't really
>> comment on the details.
>>
>
> Oh, and just wondering, is this the same BUG as in
>
> https://lkml.org/lkml/2019/8/2/1506
>
> Where a fix has been proposed? The fix does not seem to be in next/master yet.
>
> (I just realized that it is already upstream so "Fixes:     1a4e58cce84e ("mm: introduce MADV_PAGEOUT")) applies.
>
>
>
I think that is not same issue. thannks



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-28 15:08 [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout zhong jiang
  2019-10-28 15:27 ` David Hildenbrand
@ 2019-10-29  8:11 ` Michal Hocko
  2019-10-29  9:30   ` zhong jiang
  1 sibling, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2019-10-29  8:11 UTC (permalink / raw)
  To: zhong jiang; +Cc: akpm, hannes, ktkhai, linux-mm, Minchan Kim

[Cc Minchan]

On Mon 28-10-19 23:08:37, zhong jiang wrote:
> Recently, I hit the following issue when running in the upstream.
> 
> kernel BUG at mm/vmscan.c:1521!
> invalid opcode: 0000 [#1] SMP KASAN PTI
> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74
> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286
> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36
> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc
> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0
> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0
> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000
> FS:  00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0
> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
> PKRU: 55555554
> Call Trace:
>  reclaim_pages+0x499/0x800 mm/vmscan.c:2188
>  madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
>  walk_pmd_range mm/pagewalk.c:53 [inline]
>  walk_pud_range mm/pagewalk.c:112 [inline]
>  walk_p4d_range mm/pagewalk.c:139 [inline]
>  walk_pgd_range mm/pagewalk.c:166 [inline]
>  __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
>  walk_page_range+0x179/0x310 mm/pagewalk.c:349
>  madvise_pageout_page_range mm/madvise.c:506 [inline]
>  madvise_pageout+0x1f0/0x330 mm/madvise.c:542
>  madvise_vma mm/madvise.c:931 [inline]
>  __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
>  do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> madvise_pageout access the specified range of the vma and isolate
> them, then run shrink_page_list to reclaim the memory. But It also
> isolate the unevictable page to reclaim. Hence, we can catch the
> cases in shrink_page_list.
> 
> We can fix it by preventing unevictable page from isolating.
> Another way to fix the issue by removing the condition of
> BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it
> is better  to use the latter. Because We has taken the unevictable
> page and skip it into account in shrink_page_list.

The justification is indeed not clear. This is essentially the same kind
of bug as a58f2cef26e1 ("mm/vmscan.c: fix trying to reclaim unevictable
LRU page") which has been fixed by checking PageUnevictable before
adding it to the list of pages to reclaim.

Removing a long existing BUG_ON begs for a much better explanation.
shrink_page_list is not a trivial piece of code but I _suspect_ that
removing it should be ok for mapped pages at least (try_to_unmap) but I
am not so sure how unmapped unevictable pages are handled from top of my
head.

Please also ad Fixes: $sha

> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
> ---
>  mm/vmscan.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f7d1301..1c6e959 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>  		unlock_page(page);
>  keep:
>  		list_add(&page->lru, &ret_pages);
> -		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
> +		VM_BUG_ON_PAGE(PageLRU(page), page);
>  	}
>  
>  	pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
> -- 
> 1.7.12.4
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-29  8:11 ` Michal Hocko
@ 2019-10-29  9:30   ` zhong jiang
  2019-10-29  9:40     ` Michal Hocko
  0 siblings, 1 reply; 23+ messages in thread
From: zhong jiang @ 2019-10-29  9:30 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, hannes, ktkhai, linux-mm, Minchan Kim

On 2019/10/29 16:11, Michal Hocko wrote:
> [Cc Minchan]
>
> On Mon 28-10-19 23:08:37, zhong jiang wrote:
>> Recently, I hit the following issue when running in the upstream.
>>
>> kernel BUG at mm/vmscan.c:1521!
>> invalid opcode: 0000 [#1] SMP KASAN PTI
>> CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
>> RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
>> Code: de f5 ff ff e8 ab 79 eb ff 4c 89 f7 e8 43 33 0d 00 e9 cc f5 ff ff e8 99 79 eb ff 48 c7 c6 a0 34 2b a0 4c 89 f7 e8 1a 4d 05 00 <0f> 0b e8 83 79 eb ff 48 89 d8 48 c1 e8 03 42 80 3c 38 00 0f 85 74
>> RSP: 0018:ffff88819a3df5a0 EFLAGS: 00010286
>> RAX: 0000000000040000 RBX: ffffea00061c3980 RCX: ffffffff814fba36
>> RDX: 00000000000056f7 RSI: ffffc9000c02c000 RDI: ffff8881f70268cc
>> RBP: ffff88819a3df898 R08: ffffed103ee05de0 R09: ffffed103ee05de0
>> R10: 0000000000000001 R11: ffffed103ee05ddf R12: ffff88819a3df6f0
>> R13: ffff88819a3df6f0 R14: ffffea00061c3980 R15: dffffc0000000000
>> FS:  00007f21b9d8e700(0000) GS:ffff8881f7000000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000001b2d621000 CR3: 00000001c8c46004 CR4: 00000000007606f0
>> DR0: 0000000020000140 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
>> PKRU: 55555554
>> Call Trace:
>>  reclaim_pages+0x499/0x800 mm/vmscan.c:2188
>>  madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
>>  walk_pmd_range mm/pagewalk.c:53 [inline]
>>  walk_pud_range mm/pagewalk.c:112 [inline]
>>  walk_p4d_range mm/pagewalk.c:139 [inline]
>>  walk_pgd_range mm/pagewalk.c:166 [inline]
>>  __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
>>  walk_page_range+0x179/0x310 mm/pagewalk.c:349
>>  madvise_pageout_page_range mm/madvise.c:506 [inline]
>>  madvise_pageout+0x1f0/0x330 mm/madvise.c:542
>>  madvise_vma mm/madvise.c:931 [inline]
>>  __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
>>  do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
>>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> madvise_pageout access the specified range of the vma and isolate
>> them, then run shrink_page_list to reclaim the memory. But It also
>> isolate the unevictable page to reclaim. Hence, we can catch the
>> cases in shrink_page_list.
>>
>> We can fix it by preventing unevictable page from isolating.
>> Another way to fix the issue by removing the condition of
>> BUG_ON(PageUnevictable(page)) in shrink_page_list. I think it
>> is better  to use the latter. Because We has taken the unevictable
>> page and skip it into account in shrink_page_list.
> The justification is indeed not clear. This is essentially the same kind
> of bug as a58f2cef26e1 ("mm/vmscan.c: fix trying to reclaim unevictable
> LRU page") which has been fixed by checking PageUnevictable before
> adding it to the list of pages to reclaim.
According to the above some bugfix from Minchan and keep the long existing BUG_ON.
Maybe It is better to do that.
> Removing a long existing BUG_ON begs for a much better explanation.
> shrink_page_list is not a trivial piece of code but I _suspect_ that
> removing it should be ok for mapped pages at least (try_to_unmap) but I
> am not so sure how unmapped unevictable pages are handled from top of my
> head.
As to the unmapped unevictable pages.  shrink_page_list has taken that into account.

shinkr_page_list
     page_evictable     --> will filter the unevictable pages to putback its lru.

That should be not a problem after removing the BUG_ON.
> Please also ad Fixes: $sha
Will add in v2.  Thanks
>> Signed-off-by: zhong jiang <zhongjiang@huawei.com>
>> ---
>>  mm/vmscan.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index f7d1301..1c6e959 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1524,7 +1524,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
>>  		unlock_page(page);
>>  keep:
>>  		list_add(&page->lru, &ret_pages);
>> -		VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
>> +		VM_BUG_ON_PAGE(PageLRU(page), page);
>>  	}
>>  
>>  	pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
>> -- 
>> 1.7.12.4
>>




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-29  9:30   ` zhong jiang
@ 2019-10-29  9:40     ` Michal Hocko
  2019-10-29 10:45       ` zhong jiang
  0 siblings, 1 reply; 23+ messages in thread
From: Michal Hocko @ 2019-10-29  9:40 UTC (permalink / raw)
  To: zhong jiang; +Cc: akpm, hannes, ktkhai, linux-mm, Minchan Kim

On Tue 29-10-19 17:30:57, zhong jiang wrote:
> On 2019/10/29 16:11, Michal Hocko wrote:
> > [Cc Minchan]
[...]
> > Removing a long existing BUG_ON begs for a much better explanation.
> > shrink_page_list is not a trivial piece of code but I _suspect_ that
> > removing it should be ok for mapped pages at least (try_to_unmap) but I
> > am not so sure how unmapped unevictable pages are handled from top of my
> > head.
> As to the unmapped unevictable pages.  shrink_page_list has taken that into account.
> 
> shinkr_page_list
>      page_evictable     --> will filter the unevictable pages to putback its lru.

Ohh, it is right there at the top. Missed it. The check has been added
by Nick along with the BUG_ON. So it is sounds more like a "this
shouldn't happen" bugon. I wouldn't mind to remove it with that
justification.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-29  9:40     ` Michal Hocko
@ 2019-10-29 10:45       ` zhong jiang
  2019-10-30 16:52         ` Minchan Kim
  0 siblings, 1 reply; 23+ messages in thread
From: zhong jiang @ 2019-10-29 10:45 UTC (permalink / raw)
  To: Michal Hocko; +Cc: akpm, hannes, ktkhai, linux-mm, Minchan Kim

On 2019/10/29 17:40, Michal Hocko wrote:
> On Tue 29-10-19 17:30:57, zhong jiang wrote:
>> On 2019/10/29 16:11, Michal Hocko wrote:
>>> [Cc Minchan]
> [...]
>>> Removing a long existing BUG_ON begs for a much better explanation.
>>> shrink_page_list is not a trivial piece of code but I _suspect_ that
>>> removing it should be ok for mapped pages at least (try_to_unmap) but I
>>> am not so sure how unmapped unevictable pages are handled from top of my
>>> head.
>> As to the unmapped unevictable pages.  shrink_page_list has taken that into account.
>>
>> shinkr_page_list
>>      page_evictable     --> will filter the unevictable pages to putback its lru.
> Ohh, it is right there at the top. Missed it. The check has been added
> by Nick along with the BUG_ON. So it is sounds more like a "this
> shouldn't happen" bugon. I wouldn't mind to remove it with that
> justification.
As you has said,   Minchan fix the same kind of bug by checking PageUnevictable (I did not notice before)
Wait for Minchan to see whether  he has better reason. thanks,

Sincerely,
zhong jiang



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-29 10:45       ` zhong jiang
@ 2019-10-30 16:52         ` Minchan Kim
  2019-10-30 17:22           ` Johannes Weiner
  2019-10-30 17:45           ` Michal Hocko
  0 siblings, 2 replies; 23+ messages in thread
From: Minchan Kim @ 2019-10-30 16:52 UTC (permalink / raw)
  To: zhong jiang; +Cc: Michal Hocko, akpm, hannes, ktkhai, linux-mm

On Tue, Oct 29, 2019 at 06:45:12PM +0800, zhong jiang wrote:
> On 2019/10/29 17:40, Michal Hocko wrote:
> > On Tue 29-10-19 17:30:57, zhong jiang wrote:
> >> On 2019/10/29 16:11, Michal Hocko wrote:
> >>> [Cc Minchan]
> > [...]
> >>> Removing a long existing BUG_ON begs for a much better explanation.
> >>> shrink_page_list is not a trivial piece of code but I _suspect_ that
> >>> removing it should be ok for mapped pages at least (try_to_unmap) but I
> >>> am not so sure how unmapped unevictable pages are handled from top of my
> >>> head.
> >> As to the unmapped unevictable pages.  shrink_page_list has taken that into account.
> >>
> >> shinkr_page_list
> >>      page_evictable     --> will filter the unevictable pages to putback its lru.
> > Ohh, it is right there at the top. Missed it. The check has been added
> > by Nick along with the BUG_ON. So it is sounds more like a "this
> > shouldn't happen" bugon. I wouldn't mind to remove it with that
> > justification.
> As you has said,   Minchan fix the same kind of bug by checking PageUnevictable (I did not notice before)
> Wait for Minchan to see whether  he has better reason. thanks,

madvise_pageout could work with a shared page and one of the vmas among processes
could do mlock so it could pass Unevictable LRU pages into shrink_page_list.
It's pointless to try reclaim unevictable pages from the beginning so I want to fix
madvise_pageout via introducing only_evictable flag into the API so that
madvise_pageout uses it as "true".

If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list,
I want to see more strong reason why it happens and why caller couldn't
filter them out from the beginning.

diff --git a/mm/gup.c b/mm/gup.c
index 8f236a335ae9..d1ad1c3ec596 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1468,7 +1468,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
 					drain_allow = false;
 				}
 
-				if (!isolate_lru_page(head)) {
+				if (!isolate_lru_page(head, false)) {
 					list_add_tail(&head->lru, &cma_page_list);
 					mod_node_page_state(page_pgdat(head),
 							    NR_ISOLATED_ANON +
diff --git a/mm/internal.h b/mm/internal.h
index 0d5f720c75ab..13319612bef0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -85,7 +85,7 @@ extern unsigned long highest_memmap_pfn;
 /*
  * in mm/vmscan.c:
  */
-extern int isolate_lru_page(struct page *page);
+extern int isolate_lru_page(struct page *page, bool only_evictable);
 extern void putback_lru_page(struct page *page);
 
 /*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 0a1b4b484ac5..095560f7f8ec 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -609,7 +609,7 @@ static int __collapse_huge_page_isolate(struct vm_area_struct *vma,
 		 * Isolate the page to avoid collapsing an hugepage
 		 * currently in use by the VM.
 		 */
-		if (isolate_lru_page(page)) {
+		if (isolate_lru_page(page, false)) {
 			unlock_page(page);
 			result = SCAN_DEL_PAGE_LRU;
 			goto out;
@@ -1642,7 +1642,7 @@ static void collapse_file(struct mm_struct *mm,
 			goto out_unlock;
 		}
 
-		if (isolate_lru_page(page)) {
+		if (isolate_lru_page(page, false)) {
 			result = SCAN_DEL_PAGE_LRU;
 			goto out_unlock;
 		}
diff --git a/mm/madvise.c b/mm/madvise.c
index 2be9f3fdb05e..2639de560a0b 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -363,7 +363,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		ClearPageReferenced(page);
 		test_and_clear_page_young(page);
 		if (pageout) {
-			if (!isolate_lru_page(page))
+			if (!isolate_lru_page(page, true))
 				list_add(&page->lru, &page_list);
 		} else
 			deactivate_page(page);
@@ -441,7 +441,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		ClearPageReferenced(page);
 		test_and_clear_page_young(page);
 		if (pageout) {
-			if (!isolate_lru_page(page))
+			if (!isolate_lru_page(page, true))
 				list_add(&page->lru, &page_list);
 		} else
 			deactivate_page(page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 363106578876..6d913215b074 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -5847,7 +5847,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 		target_type = get_mctgt_type_thp(vma, addr, *pmd, &target);
 		if (target_type == MC_TARGET_PAGE) {
 			page = target.page;
-			if (!isolate_lru_page(page)) {
+			if (!isolate_lru_page(page, false)) {
 				if (!mem_cgroup_move_account(page, true,
 							     mc.from, mc.to)) {
 					mc.precharge -= HPAGE_PMD_NR;
@@ -5895,7 +5895,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
 			 */
 			if (PageTransCompound(page))
 				goto put;
-			if (!device && isolate_lru_page(page))
+			if (!device && isolate_lru_page(page, false))
 				goto put;
 			if (!mem_cgroup_move_account(page, false,
 						mc.from, mc.to)) {
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3151c87dff73..ef37c67a7bab 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -567,7 +567,7 @@ static const char * const action_page_types[] = {
  */
 static int delete_from_lru_cache(struct page *p)
 {
-	if (!isolate_lru_page(p)) {
+	if (!isolate_lru_page(p, false)) {
 		/*
 		 * Clear sensible page flags, so that the buddy system won't
 		 * complain when the page is unpoison-and-freed.
@@ -1782,7 +1782,7 @@ static int __soft_offline_page(struct page *page, int flags)
 	 * handles a large number of cases for us.
 	 */
 	if (PageLRU(page))
-		ret = isolate_lru_page(page);
+		ret = isolate_lru_page(page, false);
 	else
 		ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);
 	/*
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index df570e5c71cc..8ba483d3d8cd 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1314,7 +1314,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		 */
 		if (PageHWPoison(page)) {
 			if (WARN_ON(PageLRU(page)))
-				isolate_lru_page(page);
+				isolate_lru_page(page, false);
 			if (page_mapped(page))
 				try_to_unmap(page, TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS);
 			continue;
@@ -1327,7 +1327,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		 * LRU and non-lru movable pages.
 		 */
 		if (PageLRU(page))
-			ret = isolate_lru_page(page);
+			ret = isolate_lru_page(page, false);
 		else
 			ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);
 		if (!ret) { /* Success */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4ae967bcf954..585e5845f071 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -974,7 +974,7 @@ static int migrate_page_add(struct page *page, struct list_head *pagelist,
 	 * Avoid migrating a page that is shared with others.
 	 */
 	if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(head) == 1) {
-		if (!isolate_lru_page(head)) {
+		if (!isolate_lru_page(head, false)) {
 			list_add_tail(&head->lru, pagelist);
 			mod_node_page_state(page_pgdat(head),
 				NR_ISOLATED_ANON + page_is_file_cache(head),
diff --git a/mm/migrate.c b/mm/migrate.c
index 4fe45d1428c8..710e00317a8f 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1563,7 +1563,7 @@ static int add_page_for_migration(struct mm_struct *mm, unsigned long addr,
 		struct page *head;
 
 		head = compound_head(page);
-		err = isolate_lru_page(head);
+		err = isolate_lru_page(head, false);
 		if (err)
 			goto out_putpage;
 
@@ -1895,7 +1895,7 @@ static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 	if (!migrate_balanced_pgdat(pgdat, compound_nr(page)))
 		return 0;
 
-	if (isolate_lru_page(page))
+	if (isolate_lru_page(page, false))
 		return 0;
 
 	/*
@@ -2450,7 +2450,7 @@ static void migrate_vma_prepare(struct migrate_vma *migrate)
 				allow_drain = false;
 			}
 
-			if (isolate_lru_page(page)) {
+			if (isolate_lru_page(page, false)) {
 				if (remap) {
 					migrate->src[i] &= ~MIGRATE_PFN_MIGRATE;
 					migrate->cpages--;
diff --git a/mm/mlock.c b/mm/mlock.c
index a72c1eeded77..307e340fe2e0 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -70,7 +70,7 @@ void clear_page_mlock(struct page *page)
 	 *
 	 * See __pagevec_lru_add_fn for more explanation.
 	 */
-	if (!isolate_lru_page(page)) {
+	if (!isolate_lru_page(page, false)) {
 		putback_lru_page(page);
 	} else {
 		/*
@@ -97,7 +97,7 @@ void mlock_vma_page(struct page *page)
 		mod_zone_page_state(page_zone(page), NR_MLOCK,
 				    hpage_nr_pages(page));
 		count_vm_event(UNEVICTABLE_PGMLOCKED);
-		if (!isolate_lru_page(page))
+		if (!isolate_lru_page(page, false))
 			putback_lru_page(page);
 	}
 }
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ee4eecc7e1c2..c44fb52c745f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1793,7 +1793,7 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
  * (2) the lru_lock must not be held.
  * (3) interrupts must be enabled.
  */
-int isolate_lru_page(struct page *page)
+int isolate_lru_page(struct page *page, bool only_evictable)
 {
 	int ret = -EBUSY;
 
@@ -1805,6 +1805,8 @@ int isolate_lru_page(struct page *page)
 		struct lruvec *lruvec;
 
 		spin_lock_irq(&pgdat->lru_lock);
+		if (only_evictable && PageUnevictable(page))
+			goto out;
 		lruvec = mem_cgroup_page_lruvec(page, pgdat);
 		if (PageLRU(page)) {
 			int lru = page_lru(page);
@@ -1813,6 +1815,7 @@ int isolate_lru_page(struct page *page)
 			del_page_from_lru_list(page, lruvec, lru);
 			ret = 0;
 		}
+out:
 		spin_unlock_irq(&pgdat->lru_lock);
 	}
 	return ret;


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-30 16:52         ` Minchan Kim
@ 2019-10-30 17:22           ` Johannes Weiner
  2019-10-30 18:39             ` Minchan Kim
  2019-11-01  8:57             ` zhong jiang
  2019-10-30 17:45           ` Michal Hocko
  1 sibling, 2 replies; 23+ messages in thread
From: Johannes Weiner @ 2019-10-30 17:22 UTC (permalink / raw)
  To: Minchan Kim; +Cc: zhong jiang, Michal Hocko, akpm, ktkhai, linux-mm

On Wed, Oct 30, 2019 at 09:52:39AM -0700, Minchan Kim wrote:
> @@ -1468,7 +1468,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
>  					drain_allow = false;
>  				}
>  
> -				if (!isolate_lru_page(head)) {
> +				if (!isolate_lru_page(head, false)) {
>  					list_add_tail(&head->lru, &cma_page_list);
>  					mod_node_page_state(page_pgdat(head),
>  							    NR_ISOLATED_ANON +

It's not clear what that argument means at the callsite, and every
caller needs to pass it to support one niche usecase. Let's not do
that.

I think there are better options. Personally, I think it's a good idea
to keep the sanity check in shrink_page_list() because the mlock LRU
handling is quite tricky. madvise() is really the odd one out here
because it isolates LRU pages through page tables and then sends them
through the regular reclaim path, so IMO it should be the madvise
proper that handles the exceptional situation.

Why not just this? Maybe with a comment that points out that we're
coming from the page tables instead of a specific LRU list, and so
need to filter out the unevictable lru pages from our end.

diff --git a/mm/madvise.c b/mm/madvise.c
index 99dd06fecfa9..63e130800570 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -363,8 +363,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		ClearPageReferenced(page);
 		test_and_clear_page_young(page);
 		if (pageout) {
-			if (!isolate_lru_page(page))
-				list_add(&page->lru, &page_list);
+			if (!isolate_lru_page(page)) {
+				if (PageUnevictable(page))
+					putback_lru_page(page);
+				else
+					list_add(&page->lru, &page_list);
+			}
 		} else
 			deactivate_page(page);
 huge_unlock:
@@ -441,8 +445,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		ClearPageReferenced(page);
 		test_and_clear_page_young(page);
 		if (pageout) {
-			if (!isolate_lru_page(page))
-				list_add(&page->lru, &page_list);
+			if (!isolate_lru_page(page)) {
+				if (PageUnevictable(page))
+					putback_lru_page(page);
+				else
+					list_add(&page->lru, &page_list);
+			}
 		} else
 			deactivate_page(page);
 	}


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-30 16:52         ` Minchan Kim
  2019-10-30 17:22           ` Johannes Weiner
@ 2019-10-30 17:45           ` Michal Hocko
  2019-10-30 18:42             ` Minchan Kim
  2019-10-30 19:33             ` Johannes Weiner
  1 sibling, 2 replies; 23+ messages in thread
From: Michal Hocko @ 2019-10-30 17:45 UTC (permalink / raw)
  To: Minchan Kim; +Cc: zhong jiang, akpm, hannes, ktkhai, linux-mm

On Wed 30-10-19 09:52:39, Minchan Kim wrote:
> On Tue, Oct 29, 2019 at 06:45:12PM +0800, zhong jiang wrote:
> > On 2019/10/29 17:40, Michal Hocko wrote:
> > > On Tue 29-10-19 17:30:57, zhong jiang wrote:
> > >> On 2019/10/29 16:11, Michal Hocko wrote:
> > >>> [Cc Minchan]
> > > [...]
> > >>> Removing a long existing BUG_ON begs for a much better explanation.
> > >>> shrink_page_list is not a trivial piece of code but I _suspect_ that
> > >>> removing it should be ok for mapped pages at least (try_to_unmap) but I
> > >>> am not so sure how unmapped unevictable pages are handled from top of my
> > >>> head.
> > >> As to the unmapped unevictable pages.  shrink_page_list has taken that into account.
> > >>
> > >> shinkr_page_list
> > >>      page_evictable     --> will filter the unevictable pages to putback its lru.
> > > Ohh, it is right there at the top. Missed it. The check has been added
> > > by Nick along with the BUG_ON. So it is sounds more like a "this
> > > shouldn't happen" bugon. I wouldn't mind to remove it with that
> > > justification.
> > As you has said,   Minchan fix the same kind of bug by checking PageUnevictable (I did not notice before)
> > Wait for Minchan to see whether  he has better reason. thanks,
> 
> madvise_pageout could work with a shared page and one of the vmas among processes
> could do mlock so it could pass Unevictable LRU pages into shrink_page_list.
> It's pointless to try reclaim unevictable pages from the beginning so I want to fix
> madvise_pageout via introducing only_evictable flag into the API so that
> madvise_pageout uses it as "true".
> 
> If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list,
> I want to see more strong reason why it happens and why caller couldn't
> filter them out from the beginning.

Why is this preferable over removing the VM_BUG_ON condition? In other
words why should we keep PageUnevictable check there?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-30 17:22           ` Johannes Weiner
@ 2019-10-30 18:39             ` Minchan Kim
  2019-11-01  8:57             ` zhong jiang
  1 sibling, 0 replies; 23+ messages in thread
From: Minchan Kim @ 2019-10-30 18:39 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: zhong jiang, Michal Hocko, akpm, ktkhai, linux-mm

On Wed, Oct 30, 2019 at 01:22:41PM -0400, Johannes Weiner wrote:
> On Wed, Oct 30, 2019 at 09:52:39AM -0700, Minchan Kim wrote:
> > @@ -1468,7 +1468,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
> >  					drain_allow = false;
> >  				}
> >  
> > -				if (!isolate_lru_page(head)) {
> > +				if (!isolate_lru_page(head, false)) {
> >  					list_add_tail(&head->lru, &cma_page_list);
> >  					mod_node_page_state(page_pgdat(head),
> >  							    NR_ISOLATED_ANON +
> 
> It's not clear what that argument means at the callsite, and every
> caller needs to pass it to support one niche usecase. Let's not do
> that.
> 
> I think there are better options. Personally, I think it's a good idea
> to keep the sanity check in shrink_page_list() because the mlock LRU
> handling is quite tricky. madvise() is really the odd one out here
> because it isolates LRU pages through page tables and then sends them
> through the regular reclaim path, so IMO it should be the madvise
> proper that handles the exceptional situation.
> 
> Why not just this? Maybe with a comment that points out that we're
> coming from the page tables instead of a specific LRU list, and so
> need to filter out the unevictable lru pages from our end.

Totally, agree.

> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 99dd06fecfa9..63e130800570 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -363,8 +363,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  		ClearPageReferenced(page);
>  		test_and_clear_page_young(page);
>  		if (pageout) {
> -			if (!isolate_lru_page(page))
> -				list_add(&page->lru, &page_list);
> +			if (!isolate_lru_page(page)) {
> +				if (PageUnevictable(page))
> +					putback_lru_page(page);
> +				else
> +					list_add(&page->lru, &page_list);
> +			}
>  		} else
>  			deactivate_page(page);
>  huge_unlock:
> @@ -441,8 +445,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  		ClearPageReferenced(page);
>  		test_and_clear_page_young(page);
>  		if (pageout) {
> -			if (!isolate_lru_page(page))
> -				list_add(&page->lru, &page_list);
> +			if (!isolate_lru_page(page)) {
> +				if (PageUnevictable(page))
> +					putback_lru_page(page);
> +				else
> +					list_add(&page->lru, &page_list);
> +			}
>  		} else
>  			deactivate_page(page);
>  	}
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-30 17:45           ` Michal Hocko
@ 2019-10-30 18:42             ` Minchan Kim
  2019-10-30 19:33             ` Johannes Weiner
  1 sibling, 0 replies; 23+ messages in thread
From: Minchan Kim @ 2019-10-30 18:42 UTC (permalink / raw)
  To: Michal Hocko; +Cc: zhong jiang, akpm, hannes, ktkhai, linux-mm

On Wed, Oct 30, 2019 at 06:45:33PM +0100, Michal Hocko wrote:
> On Wed 30-10-19 09:52:39, Minchan Kim wrote:
> > On Tue, Oct 29, 2019 at 06:45:12PM +0800, zhong jiang wrote:
> > > On 2019/10/29 17:40, Michal Hocko wrote:
> > > > On Tue 29-10-19 17:30:57, zhong jiang wrote:
> > > >> On 2019/10/29 16:11, Michal Hocko wrote:
> > > >>> [Cc Minchan]
> > > > [...]
> > > >>> Removing a long existing BUG_ON begs for a much better explanation.
> > > >>> shrink_page_list is not a trivial piece of code but I _suspect_ that
> > > >>> removing it should be ok for mapped pages at least (try_to_unmap) but I
> > > >>> am not so sure how unmapped unevictable pages are handled from top of my
> > > >>> head.
> > > >> As to the unmapped unevictable pages.  shrink_page_list has taken that into account.
> > > >>
> > > >> shinkr_page_list
> > > >>      page_evictable     --> will filter the unevictable pages to putback its lru.
> > > > Ohh, it is right there at the top. Missed it. The check has been added
> > > > by Nick along with the BUG_ON. So it is sounds more like a "this
> > > > shouldn't happen" bugon. I wouldn't mind to remove it with that
> > > > justification.
> > > As you has said,   Minchan fix the same kind of bug by checking PageUnevictable (I did not notice before)
> > > Wait for Minchan to see whether  he has better reason. thanks,
> > 
> > madvise_pageout could work with a shared page and one of the vmas among processes
> > could do mlock so it could pass Unevictable LRU pages into shrink_page_list.
> > It's pointless to try reclaim unevictable pages from the beginning so I want to fix
> > madvise_pageout via introducing only_evictable flag into the API so that
> > madvise_pageout uses it as "true".
> > 
> > If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list,
> > I want to see more strong reason why it happens and why caller couldn't
> > filter them out from the beginning.
> 
> Why is this preferable over removing the VM_BUG_ON condition? In other
> words why should we keep PageUnevictable check there?

I don't think it's reasonable to pass unevictalbe LRU pages into shrink_page_list
so wanted to know what race is here what we are missing to remove the BUG_ON
since mlock is heavily complicated.

> 
> -- 
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-30 17:45           ` Michal Hocko
  2019-10-30 18:42             ` Minchan Kim
@ 2019-10-30 19:33             ` Johannes Weiner
  2019-10-31  9:16               ` Michal Hocko
  2019-10-31  9:46               ` zhong jiang
  1 sibling, 2 replies; 23+ messages in thread
From: Johannes Weiner @ 2019-10-30 19:33 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Minchan Kim, zhong jiang, akpm, ktkhai, linux-mm

On Wed, Oct 30, 2019 at 06:45:33PM +0100, Michal Hocko wrote:
> On Wed 30-10-19 09:52:39, Minchan Kim wrote:
> > On Tue, Oct 29, 2019 at 06:45:12PM +0800, zhong jiang wrote:
> > > On 2019/10/29 17:40, Michal Hocko wrote:
> > > > On Tue 29-10-19 17:30:57, zhong jiang wrote:
> > > >> On 2019/10/29 16:11, Michal Hocko wrote:
> > > >>> [Cc Minchan]
> > > > [...]
> > > >>> Removing a long existing BUG_ON begs for a much better explanation.
> > > >>> shrink_page_list is not a trivial piece of code but I _suspect_ that
> > > >>> removing it should be ok for mapped pages at least (try_to_unmap) but I
> > > >>> am not so sure how unmapped unevictable pages are handled from top of my
> > > >>> head.
> > > >> As to the unmapped unevictable pages.  shrink_page_list has taken that into account.
> > > >>
> > > >> shinkr_page_list
> > > >>      page_evictable     --> will filter the unevictable pages to putback its lru.
> > > > Ohh, it is right there at the top. Missed it. The check has been added
> > > > by Nick along with the BUG_ON. So it is sounds more like a "this
> > > > shouldn't happen" bugon. I wouldn't mind to remove it with that
> > > > justification.
> > > As you has said,   Minchan fix the same kind of bug by checking PageUnevictable (I did not notice before)
> > > Wait for Minchan to see whether  he has better reason. thanks,
> > 
> > madvise_pageout could work with a shared page and one of the vmas among processes
> > could do mlock so it could pass Unevictable LRU pages into shrink_page_list.
> > It's pointless to try reclaim unevictable pages from the beginning so I want to fix
> > madvise_pageout via introducing only_evictable flag into the API so that
> > madvise_pageout uses it as "true".
> > 
> > If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list,
> > I want to see more strong reason why it happens and why caller couldn't
> > filter them out from the beginning.
> 
> Why is this preferable over removing the VM_BUG_ON condition? In other
> words why should we keep PageUnevictable check there?

The mlock LRU shuffling is a bit tricky and can race with page reclaim
or others isolating the page from the LRU list. If another isolator
wins, it has to move the page during putback on behalf of mlock.

See the implementation and comments in __pagevec_lru_add_fn().

That's why page reclaim can see !page_evictable(), but it must not see
pages that have the PageUnevictable lru bit already set. Because that
would mean the isolation/putback machinery messed up somewhere and the
page LRU state is corrupt.

As that machinery is non-trivial, it's useful to have that sanity
check in page reclaim.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-30 19:33             ` Johannes Weiner
@ 2019-10-31  9:16               ` Michal Hocko
  2019-10-31 14:48                 ` Minchan Kim
  2019-11-01 12:56                 ` zhong jiang
  2019-10-31  9:46               ` zhong jiang
  1 sibling, 2 replies; 23+ messages in thread
From: Michal Hocko @ 2019-10-31  9:16 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Minchan Kim, zhong jiang, akpm, ktkhai, linux-mm

On Wed 30-10-19 15:33:07, Johannes Weiner wrote:
> On Wed, Oct 30, 2019 at 06:45:33PM +0100, Michal Hocko wrote:
> > On Wed 30-10-19 09:52:39, Minchan Kim wrote:
[...]
> > > madvise_pageout could work with a shared page and one of the vmas among processes
> > > could do mlock so it could pass Unevictable LRU pages into shrink_page_list.
> > > It's pointless to try reclaim unevictable pages from the beginning so I want to fix
> > > madvise_pageout via introducing only_evictable flag into the API so that
> > > madvise_pageout uses it as "true".
> > > 
> > > If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list,
> > > I want to see more strong reason why it happens and why caller couldn't
> > > filter them out from the beginning.
> > 
> > Why is this preferable over removing the VM_BUG_ON condition? In other
> > words why should we keep PageUnevictable check there?
> 
> The mlock LRU shuffling is a bit tricky and can race with page reclaim
> or others isolating the page from the LRU list. If another isolator
> wins, it has to move the page during putback on behalf of mlock.
> 
> See the implementation and comments in __pagevec_lru_add_fn().
> 
> That's why page reclaim can see !page_evictable(), but it must not see
> pages that have the PageUnevictable lru bit already set. Because that
> would mean the isolation/putback machinery messed up somewhere and the
> page LRU state is corrupt.
> 
> As that machinery is non-trivial, it's useful to have that sanity
> check in page reclaim.

Thanks for the clarification! This sounds reasonable (as much as the
mlock juggling does) to me. This is probably worth a comment right above
the bug_on.

I have to confess that I am still not clear on all the details here,
though. E.g. migrate_misplaced_transhuge_page sets the flag without
lru_lock and relies only on page lock IIUC and the bug on is done right
after the lock is released. Maybe I am just confused or maybe the race
window is too small to matter but isn't this race possible at least
theoretically?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-30 19:33             ` Johannes Weiner
  2019-10-31  9:16               ` Michal Hocko
@ 2019-10-31  9:46               ` zhong jiang
  1 sibling, 0 replies; 23+ messages in thread
From: zhong jiang @ 2019-10-31  9:46 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Michal Hocko, Minchan Kim, akpm, ktkhai, linux-mm

On 2019/10/31 3:33, Johannes Weiner wrote:
> On Wed, Oct 30, 2019 at 06:45:33PM +0100, Michal Hocko wrote:
>> On Wed 30-10-19 09:52:39, Minchan Kim wrote:
>>> On Tue, Oct 29, 2019 at 06:45:12PM +0800, zhong jiang wrote:
>>>> On 2019/10/29 17:40, Michal Hocko wrote:
>>>>> On Tue 29-10-19 17:30:57, zhong jiang wrote:
>>>>>> On 2019/10/29 16:11, Michal Hocko wrote:
>>>>>>> [Cc Minchan]
>>>>> [...]
>>>>>>> Removing a long existing BUG_ON begs for a much better explanation.
>>>>>>> shrink_page_list is not a trivial piece of code but I _suspect_ that
>>>>>>> removing it should be ok for mapped pages at least (try_to_unmap) but I
>>>>>>> am not so sure how unmapped unevictable pages are handled from top of my
>>>>>>> head.
>>>>>> As to the unmapped unevictable pages.  shrink_page_list has taken that into account.
>>>>>>
>>>>>> shinkr_page_list
>>>>>>      page_evictable     --> will filter the unevictable pages to putback its lru.
>>>>> Ohh, it is right there at the top. Missed it. The check has been added
>>>>> by Nick along with the BUG_ON. So it is sounds more like a "this
>>>>> shouldn't happen" bugon. I wouldn't mind to remove it with that
>>>>> justification.
>>>> As you has said,   Minchan fix the same kind of bug by checking PageUnevictable (I did not notice before)
>>>> Wait for Minchan to see whether  he has better reason. thanks,
>>> madvise_pageout could work with a shared page and one of the vmas among processes
>>> could do mlock so it could pass Unevictable LRU pages into shrink_page_list.
>>> It's pointless to try reclaim unevictable pages from the beginning so I want to fix
>>> madvise_pageout via introducing only_evictable flag into the API so that
>>> madvise_pageout uses it as "true".
>>>
>>> If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list,
>>> I want to see more strong reason why it happens and why caller couldn't
>>> filter them out from the beginning.
>> Why is this preferable over removing the VM_BUG_ON condition? In other
>> words why should we keep PageUnevictable check there?
> The mlock LRU shuffling is a bit tricky and can race with page reclaim
> or others isolating the page from the LRU list. If another isolator
> wins, it has to move the page during putback on behalf of mlock.
>
> See the implementation and comments in __pagevec_lru_add_fn().
I see that comments in __pagevec_lru_add_fn. I have some confusion.
It will result in evictable page strand in an unevictable lru  without PageMlocked due to disorder
 
If I understand it correctly.    vmscan can see !page_evictable().   It should be PageMLocked is set
in evictable list.  Is there any race window ?

Thanks,
zhong jiang
> That's why page reclaim can see !page_evictable(), but it must not see
> pages that have the PageUnevictable lru bit already set. Because that
> would mean the isolation/putback machinery messed up somewhere and the
> page LRU state is corrupt.
>
> As that machinery is non-trivial, it's useful to have that sanity
> check in page reclaim.
>
> .
>




^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-31  9:16               ` Michal Hocko
@ 2019-10-31 14:48                 ` Minchan Kim
  2019-10-31 17:17                   ` Michal Hocko
  2019-11-01 12:56                 ` zhong jiang
  1 sibling, 1 reply; 23+ messages in thread
From: Minchan Kim @ 2019-10-31 14:48 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, zhong jiang, akpm, ktkhai, linux-mm

On Thu, Oct 31, 2019 at 10:16:01AM +0100, Michal Hocko wrote:
> On Wed 30-10-19 15:33:07, Johannes Weiner wrote:
> > On Wed, Oct 30, 2019 at 06:45:33PM +0100, Michal Hocko wrote:
> > > On Wed 30-10-19 09:52:39, Minchan Kim wrote:
> [...]
> > > > madvise_pageout could work with a shared page and one of the vmas among processes
> > > > could do mlock so it could pass Unevictable LRU pages into shrink_page_list.
> > > > It's pointless to try reclaim unevictable pages from the beginning so I want to fix
> > > > madvise_pageout via introducing only_evictable flag into the API so that
> > > > madvise_pageout uses it as "true".
> > > > 
> > > > If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list,
> > > > I want to see more strong reason why it happens and why caller couldn't
> > > > filter them out from the beginning.
> > > 
> > > Why is this preferable over removing the VM_BUG_ON condition? In other
> > > words why should we keep PageUnevictable check there?
> > 
> > The mlock LRU shuffling is a bit tricky and can race with page reclaim
> > or others isolating the page from the LRU list. If another isolator
> > wins, it has to move the page during putback on behalf of mlock.
> > 
> > See the implementation and comments in __pagevec_lru_add_fn().
> > 
> > That's why page reclaim can see !page_evictable(), but it must not see
> > pages that have the PageUnevictable lru bit already set. Because that
> > would mean the isolation/putback machinery messed up somewhere and the
> > page LRU state is corrupt.
> > 
> > As that machinery is non-trivial, it's useful to have that sanity
> > check in page reclaim.
> 
> Thanks for the clarification! This sounds reasonable (as much as the
> mlock juggling does) to me. This is probably worth a comment right above
> the bug_on.
> 
> I have to confess that I am still not clear on all the details here,
> though. E.g. migrate_misplaced_transhuge_page sets the flag without
> lru_lock and relies only on page lock IIUC and the bug on is done right
> after the lock is released. Maybe I am just confused or maybe the race
> window is too small to matter but isn't this race possible at least
> theoretically?

IIUC, reclaim couldn't see the page from LRU list because it was isolated by
numamigrate_isolate_page.

Thanks.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-31 14:48                 ` Minchan Kim
@ 2019-10-31 17:17                   ` Michal Hocko
  0 siblings, 0 replies; 23+ messages in thread
From: Michal Hocko @ 2019-10-31 17:17 UTC (permalink / raw)
  To: Minchan Kim; +Cc: Johannes Weiner, zhong jiang, akpm, ktkhai, linux-mm

On Thu 31-10-19 07:48:11, Minchan Kim wrote:
> On Thu, Oct 31, 2019 at 10:16:01AM +0100, Michal Hocko wrote:
> > On Wed 30-10-19 15:33:07, Johannes Weiner wrote:
> > > On Wed, Oct 30, 2019 at 06:45:33PM +0100, Michal Hocko wrote:
> > > > On Wed 30-10-19 09:52:39, Minchan Kim wrote:
> > [...]
> > > > > madvise_pageout could work with a shared page and one of the vmas among processes
> > > > > could do mlock so it could pass Unevictable LRU pages into shrink_page_list.
> > > > > It's pointless to try reclaim unevictable pages from the beginning so I want to fix
> > > > > madvise_pageout via introducing only_evictable flag into the API so that
> > > > > madvise_pageout uses it as "true".
> > > > > 
> > > > > If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list,
> > > > > I want to see more strong reason why it happens and why caller couldn't
> > > > > filter them out from the beginning.
> > > > 
> > > > Why is this preferable over removing the VM_BUG_ON condition? In other
> > > > words why should we keep PageUnevictable check there?
> > > 
> > > The mlock LRU shuffling is a bit tricky and can race with page reclaim
> > > or others isolating the page from the LRU list. If another isolator
> > > wins, it has to move the page during putback on behalf of mlock.
> > > 
> > > See the implementation and comments in __pagevec_lru_add_fn().
> > > 
> > > That's why page reclaim can see !page_evictable(), but it must not see
> > > pages that have the PageUnevictable lru bit already set. Because that
> > > would mean the isolation/putback machinery messed up somewhere and the
> > > page LRU state is corrupt.
> > > 
> > > As that machinery is non-trivial, it's useful to have that sanity
> > > check in page reclaim.
> > 
> > Thanks for the clarification! This sounds reasonable (as much as the
> > mlock juggling does) to me. This is probably worth a comment right above
> > the bug_on.
> > 
> > I have to confess that I am still not clear on all the details here,
> > though. E.g. migrate_misplaced_transhuge_page sets the flag without
> > lru_lock and relies only on page lock IIUC and the bug on is done right
> > after the lock is released. Maybe I am just confused or maybe the race
> > window is too small to matter but isn't this race possible at least
> > theoretically?
> 
> IIUC, reclaim couldn't see the page from LRU list because it was isolated by
> numamigrate_isolate_page.

Right you are. I have missed numamigrate_isolate_page. Thanks!

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-30 17:22           ` Johannes Weiner
  2019-10-30 18:39             ` Minchan Kim
@ 2019-11-01  8:57             ` zhong jiang
  1 sibling, 0 replies; 23+ messages in thread
From: zhong jiang @ 2019-11-01  8:57 UTC (permalink / raw)
  To: Johannes Weiner; +Cc: Minchan Kim, Michal Hocko, akpm, ktkhai, linux-mm

On 2019/10/31 1:22, Johannes Weiner wrote:
> On Wed, Oct 30, 2019 at 09:52:39AM -0700, Minchan Kim wrote:
>> @@ -1468,7 +1468,7 @@ static long check_and_migrate_cma_pages(struct task_struct *tsk,
>>  					drain_allow = false;
>>  				}
>>  
>> -				if (!isolate_lru_page(head)) {
>> +				if (!isolate_lru_page(head, false)) {
>>  					list_add_tail(&head->lru, &cma_page_list);
>>  					mod_node_page_state(page_pgdat(head),
>>  							    NR_ISOLATED_ANON +
> It's not clear what that argument means at the callsite, and every
> caller needs to pass it to support one niche usecase. Let's not do
> that.
>
> I think there are better options. Personally, I think it's a good idea
> to keep the sanity check in shrink_page_list() because the mlock LRU
> handling is quite tricky. madvise() is really the odd one out here
> because it isolates LRU pages through page tables and then sends them
> through the regular reclaim path, so IMO it should be the madvise
> proper that handles the exceptional situation.
>
> Why not just this? Maybe with a comment that points out that we're
> coming from the page tables instead of a specific LRU list, and so
> need to filter out the unevictable lru pages from our end.
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 99dd06fecfa9..63e130800570 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -363,8 +363,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  		ClearPageReferenced(page);
>  		test_and_clear_page_young(page);
>  		if (pageout) {
> -			if (!isolate_lru_page(page))
> -				list_add(&page->lru, &page_list);
> +			if (!isolate_lru_page(page)) {
> +				if (PageUnevictable(page))
> +					putback_lru_page(page);
> +				else
> +					list_add(&page->lru, &page_list);
> +			}
>  		} else
>  			deactivate_page(page);
>  huge_unlock:
> @@ -441,8 +445,12 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>  		ClearPageReferenced(page);
>  		test_and_clear_page_young(page);
>  		if (pageout) {
> -			if (!isolate_lru_page(page))
> -				list_add(&page->lru, &page_list);
> +			if (!isolate_lru_page(page)) {
> +				if (PageUnevictable(page))
> +					putback_lru_page(page);
> +				else
> +					list_add(&page->lru, &page_list);
> +			}
>  		} else
>  			deactivate_page(page);
>  	}
>
> .
>
It seems to filter the Unevictable page is the correct fix.  Even  though I am not very clear how
an little race window to result in an PageMlocked in vmscan. 

I will repost the patch in v2 as you has proposed.  Thanks

Sincerely,
zhong jiang



^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout
  2019-10-31  9:16               ` Michal Hocko
  2019-10-31 14:48                 ` Minchan Kim
@ 2019-11-01 12:56                 ` zhong jiang
  1 sibling, 0 replies; 23+ messages in thread
From: zhong jiang @ 2019-11-01 12:56 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Johannes Weiner, Minchan Kim, akpm, ktkhai, linux-mm

On 2019/10/31 17:16, Michal Hocko wrote:
> On Wed 30-10-19 15:33:07, Johannes Weiner wrote:
>> On Wed, Oct 30, 2019 at 06:45:33PM +0100, Michal Hocko wrote:
>>> On Wed 30-10-19 09:52:39, Minchan Kim wrote:
> [...]
>>>> madvise_pageout could work with a shared page and one of the vmas among processes
>>>> could do mlock so it could pass Unevictable LRU pages into shrink_page_list.
>>>> It's pointless to try reclaim unevictable pages from the beginning so I want to fix
>>>> madvise_pageout via introducing only_evictable flag into the API so that
>>>> madvise_pageout uses it as "true".
>>>>
>>>> If we want to remove the PageUnevictable VM_BUG_ON_PAGE in shrink_page_list,
>>>> I want to see more strong reason why it happens and why caller couldn't
>>>> filter them out from the beginning.
>>> Why is this preferable over removing the VM_BUG_ON condition? In other
>>> words why should we keep PageUnevictable check there?
>> The mlock LRU shuffling is a bit tricky and can race with page reclaim
>> or others isolating the page from the LRU list. If another isolator
>> wins, it has to move the page during putback on behalf of mlock.
>>
>> See the implementation and comments in __pagevec_lru_add_fn().
>>
>> That's why page reclaim can see !page_evictable(), but it must not see
>> pages that have the PageUnevictable lru bit already set. Because that
>> would mean the isolation/putback machinery messed up somewhere and the
>> page LRU state is corrupt.
>>
>> As that machinery is non-trivial, it's useful to have that sanity
>> check in page reclaim.
> Thanks for the clarification! This sounds reasonable (as much as the
> mlock juggling does) to me. This is probably worth a comment right above
> the bug_on.
Could  you  write a comment down on VM_BUG_ON() ?  :-)

Thanks,
zhong jiang
> I have to confess that I am still not clear on all the details here,
> though. E.g. migrate_misplaced_transhuge_page sets the flag without
> lru_lock and relies only on page lock IIUC and the bug on is done right
> after the lock is released. Maybe I am just confused or maybe the race
> window is too small to matter but isn't this race possible at least
> theoretically?
>




^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2019-11-01 12:56 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-28 15:08 [PATCH] mm: fix unevictable page reclaim when calling madvise_pageout zhong jiang
2019-10-28 15:27 ` David Hildenbrand
2019-10-28 15:45   ` zhong jiang
2019-10-28 16:07     ` David Hildenbrand
2019-10-28 16:15       ` zhong jiang
2019-10-28 16:15       ` David Hildenbrand
2019-10-29  2:29         ` zhong jiang
2019-10-29  8:11 ` Michal Hocko
2019-10-29  9:30   ` zhong jiang
2019-10-29  9:40     ` Michal Hocko
2019-10-29 10:45       ` zhong jiang
2019-10-30 16:52         ` Minchan Kim
2019-10-30 17:22           ` Johannes Weiner
2019-10-30 18:39             ` Minchan Kim
2019-11-01  8:57             ` zhong jiang
2019-10-30 17:45           ` Michal Hocko
2019-10-30 18:42             ` Minchan Kim
2019-10-30 19:33             ` Johannes Weiner
2019-10-31  9:16               ` Michal Hocko
2019-10-31 14:48                 ` Minchan Kim
2019-10-31 17:17                   ` Michal Hocko
2019-11-01 12:56                 ` zhong jiang
2019-10-31  9:46               ` zhong jiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.