* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy [not found] <20230909034207.5816-1-hdanton@sina.com> @ 2023-09-09 4:43 ` syzbot 0 siblings, 0 replies; 27+ messages in thread From: syzbot @ 2023-09-09 4:43 UTC (permalink / raw) To: hdanton, linux-kernel, syzkaller-bugs Hello, syzbot has tested the proposed patch and the reproducer did not trigger any issue: Reported-and-tested-by: syzbot+b591856e0f0139f83023@syzkaller.appspotmail.com Tested on: commit: 32bf43e4 Merge tag 'thermal-6.6-rc1-3' of git://git.ke.. git tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master console output: https://syzkaller.appspot.com/x/log.txt?x=15137480680000 kernel config: https://syzkaller.appspot.com/x/.config?x=e82a7781f9208c0d dashboard link: https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 patch: https://syzkaller.appspot.com/x/patch.diff?x=12eaf0e8680000 Note: testing is done by a robot and is best-effort only. ^ permalink raw reply [flat|nested] 27+ messages in thread
* [syzbot] [mm?] kernel BUG in vma_replace_policy @ 2023-09-06 1:03 syzbot 2023-09-08 18:04 ` syzbot 2023-09-12 5:30 ` Matthew Wilcox 0 siblings, 2 replies; 27+ messages in thread From: syzbot @ 2023-09-06 1:03 UTC (permalink / raw) To: akpm, linux-kernel, linux-mm, syzkaller-bugs Hello, syzbot found the following issue on: HEAD commit: a47fc304d2b6 Add linux-next specific files for 20230831 git tree: linux-next console+strace: https://syzkaller.appspot.com/x/log.txt?x=16502ddba80000 kernel config: https://syzkaller.appspot.com/x/.config?x=6ecd2a74f20953b9 dashboard link: https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=120e7d70680000 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1523f9c0680000 Downloadable assets: disk image: https://storage.googleapis.com/syzbot-assets/b2e8f4217527/disk-a47fc304.raw.xz vmlinux: https://storage.googleapis.com/syzbot-assets/ed6cdcc09339/vmlinux-a47fc304.xz kernel image: https://storage.googleapis.com/syzbot-assets/bd9b2475bf5a/bzImage-a47fc304.xz IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+b591856e0f0139f83023@syzkaller.appspotmail.com vma ffff888077381a00 start 0000000020c2a000 end 0000000021000000 mm ffff8880258a8980 prot 25 anon_vma 0000000000000000 vm_ops 0000000000000000 pgoff 20c2a file 0000000000000000 private_data 0000000000000000 flags: 0x8100077(read|write|exec|mayread|maywrite|mayexec|account|softdirty) ------------[ cut here ]------------ kernel BUG at include/linux/mm.h:733! invalid opcode: 0000 [#1] PREEMPT SMP KASAN CPU: 1 PID: 5040 Comm: syz-executor418 Not tainted 6.5.0-next-20230831-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 07/26/2023 RIP: 0010:vma_assert_write_locked include/linux/mm.h:733 [inline] RIP: 0010:vma_assert_write_locked include/linux/mm.h:729 [inline] RIP: 0010:vma_replace_policy+0x406/0x4e0 mm/mempolicy.c:783 Code: ff 48 89 ef e8 db 78 ff ff e9 83 fe ff ff e8 d1 7c ad ff 4c 89 e7 e8 a9 86 eb ff 0f 0b e8 c2 7c ad ff 48 89 df e8 fa 83 eb ff <0f> 0b e8 b3 7c ad ff 41 89 ec e9 58 fe ff ff 48 c7 c7 d0 55 ce 8e RSP: 0018:ffffc9000395fc58 EFLAGS: 00010282 RAX: 000000000000011b RBX: ffff888077381a00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff816b9a92 RDI: 0000000000000005 RBP: ffff888014a7e030 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000080000000 R11: 0000000000000001 R12: 0000000000000015 R13: 0000000000000016 R14: 0000000000000001 R15: 0000000021000000 FS: 0000555556684380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd277b0020 CR3: 00000000773e1000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> mbind_range+0x37c/0x530 mm/mempolicy.c:855 do_mbind+0x583/0xa00 mm/mempolicy.c:1345 kernel_mbind+0x1d4/0x1f0 mm/mempolicy.c:1502 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0033:0x7fb959069369 Code: 48 83 c4 28 c3 e8 37 17 00 00 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007ffd276bed98 EFLAGS: 00000246 ORIG_RAX: 00000000000000ed RAX: ffffffffffffffda RBX: 00007ffd276bef78 RCX: 00007fb959069369 RDX: 0000000000000004 RSI: 0000000000c00000 RDI: 0000000020400000 RBP: 00007fb9590dc610 R08: 0000000000000000 R09: 0000000000000003 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000001 R13: 00007ffd276bef68 R14: 0000000000000001 R15: 0000000000000001 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:vma_assert_write_locked include/linux/mm.h:733 [inline] RIP: 0010:vma_assert_write_locked include/linux/mm.h:729 [inline] RIP: 0010:vma_replace_policy+0x406/0x4e0 mm/mempolicy.c:783 Code: ff 48 89 ef e8 db 78 ff ff e9 83 fe ff ff e8 d1 7c ad ff 4c 89 e7 e8 a9 86 eb ff 0f 0b e8 c2 7c ad ff 48 89 df e8 fa 83 eb ff <0f> 0b e8 b3 7c ad ff 41 89 ec e9 58 fe ff ff 48 c7 c7 d0 55 ce 8e RSP: 0018:ffffc9000395fc58 EFLAGS: 00010282 RAX: 000000000000011b RBX: ffff888077381a00 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffffff816b9a92 RDI: 0000000000000005 RBP: ffff888014a7e030 R08: 0000000000000005 R09: 0000000000000000 R10: 0000000080000000 R11: 0000000000000001 R12: 0000000000000015 R13: 0000000000000016 R14: 0000000000000001 R15: 0000000021000000 FS: 0000555556684380(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffd277b0020 CR3: 00000000773e1000 CR4: 00000000003506e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 --- This report is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkaller@googlegroups.com. syzbot will keep track of this issue. See: https://goo.gl/tpsmEJ#status for how to communicate with syzbot. If the bug is already fixed, let syzbot know by replying with: #syz fix: exact-commit-title If you want syzbot to run the reproducer, reply with: #syz test: git://repo/address.git branch-or-commit-hash If you attach or paste a git patch, syzbot will apply it before testing. If you want to overwrite bug's subsystems, reply with: #syz set subsystems: new-subsystem (See the list of subsystem names on the web dashboard) If the bug is a duplicate of another bug, reply with: #syz dup: exact-subject-of-another-report If you want to undo deduplication, reply with: #syz undup ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-06 1:03 syzbot @ 2023-09-08 18:04 ` syzbot 2023-09-12 5:30 ` Matthew Wilcox 1 sibling, 0 replies; 27+ messages in thread From: syzbot @ 2023-09-08 18:04 UTC (permalink / raw) To: 42.hyeyoo, Liam.Howlett, agordeev, akpm, alexghiti, aou, borntraeger, cgroups, christophe.leroy, damon, david, eadavis, frankja, gerald.schaefer, gor, hannes, hca, imbrenda, jeeheng.sia, jglisse, kvm, leyfoon.tan, linmiaohe, linux-fsdevel, linux-kernel, linux-mm, linux-riscv, linux-s390, linuxppc-dev, mason.huo, mhocko, mpe, muchun.song, naoya.horiguchi, npiggin, palmer, paul.walmsley, roman.gushchin, sebastian.reichel, shakeelb, sj, surenb, svens, syzkaller-bugs, willy syzbot has bisected this issue to: commit 49b0638502da097c15d46cd4e871dbaa022caf7c Author: Suren Baghdasaryan <surenb@google.com> Date: Fri Aug 4 15:27:19 2023 +0000 mm: enable page walking API to lock vmas during the walk bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=11fd2348680000 start commit: 7733171926cc Merge tag 'mailbox-v6.6' of git://git.linaro... git tree: upstream final oops: https://syzkaller.appspot.com/x/report.txt?x=13fd2348680000 console output: https://syzkaller.appspot.com/x/log.txt?x=15fd2348680000 kernel config: https://syzkaller.appspot.com/x/.config?x=b273cdfbc13e9a4b dashboard link: https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=15d4ecd0680000 C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1055c284680000 Reported-by: syzbot+b591856e0f0139f83023@syzkaller.appspotmail.com Fixes: 49b0638502da ("mm: enable page walking API to lock vmas during the walk") For information about bisection process see: https://goo.gl/tpsmEJ#bisection ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-06 1:03 syzbot 2023-09-08 18:04 ` syzbot @ 2023-09-12 5:30 ` Matthew Wilcox 2023-09-12 6:09 ` syzbot 2023-09-12 14:55 ` Matthew Wilcox 1 sibling, 2 replies; 27+ messages in thread From: Matthew Wilcox @ 2023-09-12 5:30 UTC (permalink / raw) To: syzbot; +Cc: akpm, linux-kernel, linux-mm, syzkaller-bugs On Tue, Sep 05, 2023 at 06:03:49PM -0700, syzbot wrote: > Hello, > > syzbot found the following issue on: > > HEAD commit: a47fc304d2b6 Add linux-next specific files for 20230831 > git tree: linux-next > console+strace: https://syzkaller.appspot.com/x/log.txt?x=16502ddba80000 > kernel config: https://syzkaller.appspot.com/x/.config?x=6ecd2a74f20953b9 > dashboard link: https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 > compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=120e7d70680000 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1523f9c0680000 > > Downloadable assets: > disk image: https://storage.googleapis.com/syzbot-assets/b2e8f4217527/disk-a47fc304.raw.xz > vmlinux: https://storage.googleapis.com/syzbot-assets/ed6cdcc09339/vmlinux-a47fc304.xz > kernel image: https://storage.googleapis.com/syzbot-assets/bd9b2475bf5a/bzImage-a47fc304.xz > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+b591856e0f0139f83023@syzkaller.appspotmail.com #syz test diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 42b5567e3773..90ad5fe60824 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1342,6 +1342,7 @@ static long do_mbind(unsigned long start, unsigned long len, vma_iter_init(&vmi, mm, start); prev = vma_prev(&vmi); for_each_vma_range(vmi, vma, end) { + vma_start_write(vma); err = mbind_range(&vmi, vma, &prev, start, end, new); if (err) break; ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-12 5:30 ` Matthew Wilcox @ 2023-09-12 6:09 ` syzbot 2023-09-12 14:55 ` Matthew Wilcox 1 sibling, 0 replies; 27+ messages in thread From: syzbot @ 2023-09-12 6:09 UTC (permalink / raw) To: akpm, linux-kernel, linux-mm, syzkaller-bugs, willy Hello, syzbot has tested the proposed patch and the reproducer did not trigger any issue: Reported-and-tested-by: syzbot+b591856e0f0139f83023@syzkaller.appspotmail.com Tested on: commit: 0bb80ecc Linux 6.6-rc1 git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=1574bfc8680000 kernel config: https://syzkaller.appspot.com/x/.config?x=ba194e5cfd385dbf dashboard link: https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 patch: https://syzkaller.appspot.com/x/patch.diff?x=11ad2dfc680000 Note: testing is done by a robot and is best-effort only. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-12 5:30 ` Matthew Wilcox 2023-09-12 6:09 ` syzbot @ 2023-09-12 14:55 ` Matthew Wilcox 2023-09-12 15:03 ` Suren Baghdasaryan 1 sibling, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2023-09-12 14:55 UTC (permalink / raw) To: syzbot; +Cc: akpm, linux-kernel, linux-mm, syzkaller-bugs, Suren Baghdasaryan On Tue, Sep 12, 2023 at 06:30:46AM +0100, Matthew Wilcox wrote: > On Tue, Sep 05, 2023 at 06:03:49PM -0700, syzbot wrote: > > Hello, > > > > syzbot found the following issue on: > > > > HEAD commit: a47fc304d2b6 Add linux-next specific files for 20230831 > > git tree: linux-next > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=16502ddba80000 > > kernel config: https://syzkaller.appspot.com/x/.config?x=6ecd2a74f20953b9 > > dashboard link: https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 > > compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=120e7d70680000 > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1523f9c0680000 > > > > Downloadable assets: > > disk image: https://storage.googleapis.com/syzbot-assets/b2e8f4217527/disk-a47fc304.raw.xz > > vmlinux: https://storage.googleapis.com/syzbot-assets/ed6cdcc09339/vmlinux-a47fc304.xz > > kernel image: https://storage.googleapis.com/syzbot-assets/bd9b2475bf5a/bzImage-a47fc304.xz > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > Reported-by: syzbot+b591856e0f0139f83023@syzkaller.appspotmail.com > > #syz test > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 42b5567e3773..90ad5fe60824 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1342,6 +1342,7 @@ static long do_mbind(unsigned long start, unsigned long len, > vma_iter_init(&vmi, mm, start); > prev = vma_prev(&vmi); > for_each_vma_range(vmi, vma, end) { > + vma_start_write(vma); > err = mbind_range(&vmi, vma, &prev, start, end, new); > if (err) > break; Suren, can you take a look at this? The VMA should be locked by the call to queue_pages_range(), but by the time we get to here, the VMA isn't locked. I don't see anywhere that we cycle the mmap_lock (which would unlock the VMA), but I could have missed something. The two VMA walks should walk over the same set of VMAs. Certainly the VMA being dumped should have been locked by the pagewalk: vma ffff888077381a00 start 0000000020c2a000 end 0000000021000000 mm ffff8880258a8980 prot 25 anon_vma 0000000000000000 vm_ops 0000000000000000 pgoff 20c2a file 0000000000000000 private_data 0000000000000000 flags: 0x8100077(read|write|exec|mayread|maywrite|mayexec|account|softdirty) syscall(__NR_mbind, /*addr=*/0x20400000ul, /*len=*/0xc00000ul, /*mode=*/4ul, /*nodemask=*/0ul, /*maxnode=*/0ul, /*flags=*/3ul); 20400000 + c00000 should overlap 20c2a000-21000000 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-12 14:55 ` Matthew Wilcox @ 2023-09-12 15:03 ` Suren Baghdasaryan 2023-09-12 16:00 ` Suren Baghdasaryan 0 siblings, 1 reply; 27+ messages in thread From: Suren Baghdasaryan @ 2023-09-12 15:03 UTC (permalink / raw) To: Matthew Wilcox; +Cc: syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Tue, Sep 12, 2023 at 7:55 AM Matthew Wilcox <willy@infradead.org> wrote: > > On Tue, Sep 12, 2023 at 06:30:46AM +0100, Matthew Wilcox wrote: > > On Tue, Sep 05, 2023 at 06:03:49PM -0700, syzbot wrote: > > > Hello, > > > > > > syzbot found the following issue on: > > > > > > HEAD commit: a47fc304d2b6 Add linux-next specific files for 20230831 > > > git tree: linux-next > > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=16502ddba80000 > > > kernel config: https://syzkaller.appspot.com/x/.config?x=6ecd2a74f20953b9 > > > dashboard link: https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 > > > compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=120e7d70680000 > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1523f9c0680000 > > > > > > Downloadable assets: > > > disk image: https://storage.googleapis.com/syzbot-assets/b2e8f4217527/disk-a47fc304.raw.xz > > > vmlinux: https://storage.googleapis.com/syzbot-assets/ed6cdcc09339/vmlinux-a47fc304.xz > > > kernel image: https://storage.googleapis.com/syzbot-assets/bd9b2475bf5a/bzImage-a47fc304.xz > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > > Reported-by: syzbot+b591856e0f0139f83023@syzkaller.appspotmail.com > > > > #syz test > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 42b5567e3773..90ad5fe60824 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -1342,6 +1342,7 @@ static long do_mbind(unsigned long start, unsigned long len, > > vma_iter_init(&vmi, mm, start); > > prev = vma_prev(&vmi); > > for_each_vma_range(vmi, vma, end) { > > + vma_start_write(vma); > > err = mbind_range(&vmi, vma, &prev, start, end, new); > > if (err) > > break; > > Suren, can you take a look at this? The VMA should be locked by the > call to queue_pages_range(), but by the time we get to here, the VMA > isn't locked. I don't see anywhere that we cycle the mmap_lock (which > would unlock the VMA), but I could have missed something. The two > VMA walks should walk over the same set of VMAs. Certainly the VMA > being dumped should have been locked by the pagewalk: Sure, I'll look into this today. Somehow this report slipped by me unnoticed. Thanks! > > vma ffff888077381a00 start 0000000020c2a000 end 0000000021000000 mm ffff8880258a8980 > prot 25 anon_vma 0000000000000000 vm_ops 0000000000000000 > pgoff 20c2a file 0000000000000000 private_data 0000000000000000 > flags: 0x8100077(read|write|exec|mayread|maywrite|mayexec|account|softdirty) > > syscall(__NR_mbind, /*addr=*/0x20400000ul, /*len=*/0xc00000ul, /*mode=*/4ul, > /*nodemask=*/0ul, /*maxnode=*/0ul, /*flags=*/3ul); > > 20400000 + c00000 should overlap 20c2a000-21000000 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-12 15:03 ` Suren Baghdasaryan @ 2023-09-12 16:00 ` Suren Baghdasaryan 2023-09-13 16:05 ` Suren Baghdasaryan 0 siblings, 1 reply; 27+ messages in thread From: Suren Baghdasaryan @ 2023-09-12 16:00 UTC (permalink / raw) To: Matthew Wilcox; +Cc: syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Tue, Sep 12, 2023 at 8:03 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Tue, Sep 12, 2023 at 7:55 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Tue, Sep 12, 2023 at 06:30:46AM +0100, Matthew Wilcox wrote: > > > On Tue, Sep 05, 2023 at 06:03:49PM -0700, syzbot wrote: > > > > Hello, > > > > > > > > syzbot found the following issue on: > > > > > > > > HEAD commit: a47fc304d2b6 Add linux-next specific files for 20230831 > > > > git tree: linux-next > > > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=16502ddba80000 > > > > kernel config: https://syzkaller.appspot.com/x/.config?x=6ecd2a74f20953b9 > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 > > > > compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 > > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=120e7d70680000 > > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1523f9c0680000 > > > > > > > > Downloadable assets: > > > > disk image: https://storage.googleapis.com/syzbot-assets/b2e8f4217527/disk-a47fc304.raw.xz > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/ed6cdcc09339/vmlinux-a47fc304.xz > > > > kernel image: https://storage.googleapis.com/syzbot-assets/bd9b2475bf5a/bzImage-a47fc304.xz > > > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > > > Reported-by: syzbot+b591856e0f0139f83023@syzkaller.appspotmail.com > > > > > > #syz test > > > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > > index 42b5567e3773..90ad5fe60824 100644 > > > --- a/mm/mempolicy.c > > > +++ b/mm/mempolicy.c > > > @@ -1342,6 +1342,7 @@ static long do_mbind(unsigned long start, unsigned long len, > > > vma_iter_init(&vmi, mm, start); > > > prev = vma_prev(&vmi); > > > for_each_vma_range(vmi, vma, end) { > > > + vma_start_write(vma); > > > err = mbind_range(&vmi, vma, &prev, start, end, new); > > > if (err) > > > break; > > > > Suren, can you take a look at this? The VMA should be locked by the > > call to queue_pages_range(), but by the time we get to here, the VMA > > isn't locked. I don't see anywhere that we cycle the mmap_lock (which > > would unlock the VMA), but I could have missed something. The two > > VMA walks should walk over the same set of VMAs. Certainly the VMA > > being dumped should have been locked by the pagewalk: Yeah, this looks strange. queue_pages_range() should have locked all the vmas and the tree can't change since we are holding mmap_lock for write. I'll try to reproduce later today and see what's going on. > > Sure, I'll look into this today. Somehow this report slipped by me > unnoticed. Thanks! > > > > > vma ffff888077381a00 start 0000000020c2a000 end 0000000021000000 mm ffff8880258a8980 > > prot 25 anon_vma 0000000000000000 vm_ops 0000000000000000 > > pgoff 20c2a file 0000000000000000 private_data 0000000000000000 > > flags: 0x8100077(read|write|exec|mayread|maywrite|mayexec|account|softdirty) > > > > syscall(__NR_mbind, /*addr=*/0x20400000ul, /*len=*/0xc00000ul, /*mode=*/4ul, > > /*nodemask=*/0ul, /*maxnode=*/0ul, /*flags=*/3ul); > > > > 20400000 + c00000 should overlap 20c2a000-21000000 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-12 16:00 ` Suren Baghdasaryan @ 2023-09-13 16:05 ` Suren Baghdasaryan 2023-09-13 16:46 ` Suren Baghdasaryan 0 siblings, 1 reply; 27+ messages in thread From: Suren Baghdasaryan @ 2023-09-13 16:05 UTC (permalink / raw) To: Matthew Wilcox; +Cc: syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Tue, Sep 12, 2023 at 4:00 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Tue, Sep 12, 2023 at 8:03 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Tue, Sep 12, 2023 at 7:55 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Tue, Sep 12, 2023 at 06:30:46AM +0100, Matthew Wilcox wrote: > > > > On Tue, Sep 05, 2023 at 06:03:49PM -0700, syzbot wrote: > > > > > Hello, > > > > > > > > > > syzbot found the following issue on: > > > > > > > > > > HEAD commit: a47fc304d2b6 Add linux-next specific files for 20230831 > > > > > git tree: linux-next > > > > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=16502ddba80000 > > > > > kernel config: https://syzkaller.appspot.com/x/.config?x=6ecd2a74f20953b9 > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 > > > > > compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 > > > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=120e7d70680000 > > > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1523f9c0680000 > > > > > > > > > > Downloadable assets: > > > > > disk image: https://storage.googleapis.com/syzbot-assets/b2e8f4217527/disk-a47fc304.raw.xz > > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/ed6cdcc09339/vmlinux-a47fc304.xz > > > > > kernel image: https://storage.googleapis.com/syzbot-assets/bd9b2475bf5a/bzImage-a47fc304.xz > > > > > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > > > > Reported-by: syzbot+b591856e0f0139f83023@syzkaller.appspotmail.com > > > > > > > > #syz test > > > > > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > > > index 42b5567e3773..90ad5fe60824 100644 > > > > --- a/mm/mempolicy.c > > > > +++ b/mm/mempolicy.c > > > > @@ -1342,6 +1342,7 @@ static long do_mbind(unsigned long start, unsigned long len, > > > > vma_iter_init(&vmi, mm, start); > > > > prev = vma_prev(&vmi); > > > > for_each_vma_range(vmi, vma, end) { > > > > + vma_start_write(vma); > > > > err = mbind_range(&vmi, vma, &prev, start, end, new); > > > > if (err) > > > > break; > > > > > > Suren, can you take a look at this? The VMA should be locked by the > > > call to queue_pages_range(), but by the time we get to here, the VMA > > > isn't locked. I don't see anywhere that we cycle the mmap_lock (which > > > would unlock the VMA), but I could have missed something. The two > > > VMA walks should walk over the same set of VMAs. Certainly the VMA > > > being dumped should have been locked by the pagewalk: > > Yeah, this looks strange. queue_pages_range() should have locked all > the vmas and the tree can't change since we are holding mmap_lock for > write. I'll try to reproduce later today and see what's going on. So far I was unable to reproduce the issue. I tried with Linus' ToT using the attached config. linux-next ToT does not boot with this config but defconfig boots and fails to reproduce the issue. I'll try to figure out why current linux-next does not like this config. > > > > > Sure, I'll look into this today. Somehow this report slipped by me > > unnoticed. Thanks! > > > > > > > > vma ffff888077381a00 start 0000000020c2a000 end 0000000021000000 mm ffff8880258a8980 > > > prot 25 anon_vma 0000000000000000 vm_ops 0000000000000000 > > > pgoff 20c2a file 0000000000000000 private_data 0000000000000000 > > > flags: 0x8100077(read|write|exec|mayread|maywrite|mayexec|account|softdirty) > > > > > > syscall(__NR_mbind, /*addr=*/0x20400000ul, /*len=*/0xc00000ul, /*mode=*/4ul, > > > /*nodemask=*/0ul, /*maxnode=*/0ul, /*flags=*/3ul); > > > > > > 20400000 + c00000 should overlap 20c2a000-21000000 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-13 16:05 ` Suren Baghdasaryan @ 2023-09-13 16:46 ` Suren Baghdasaryan 2023-09-14 18:20 ` Suren Baghdasaryan 0 siblings, 1 reply; 27+ messages in thread From: Suren Baghdasaryan @ 2023-09-13 16:46 UTC (permalink / raw) To: Matthew Wilcox; +Cc: syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Wed, Sep 13, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Tue, Sep 12, 2023 at 4:00 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Tue, Sep 12, 2023 at 8:03 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Tue, Sep 12, 2023 at 7:55 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Tue, Sep 12, 2023 at 06:30:46AM +0100, Matthew Wilcox wrote: > > > > > On Tue, Sep 05, 2023 at 06:03:49PM -0700, syzbot wrote: > > > > > > Hello, > > > > > > > > > > > > syzbot found the following issue on: > > > > > > > > > > > > HEAD commit: a47fc304d2b6 Add linux-next specific files for 20230831 > > > > > > git tree: linux-next > > > > > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=16502ddba80000 > > > > > > kernel config: https://syzkaller.appspot.com/x/.config?x=6ecd2a74f20953b9 > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 > > > > > > compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 > > > > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=120e7d70680000 > > > > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1523f9c0680000 > > > > > > > > > > > > Downloadable assets: > > > > > > disk image: https://storage.googleapis.com/syzbot-assets/b2e8f4217527/disk-a47fc304.raw.xz > > > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/ed6cdcc09339/vmlinux-a47fc304.xz > > > > > > kernel image: https://storage.googleapis.com/syzbot-assets/bd9b2475bf5a/bzImage-a47fc304.xz > > > > > > > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > > > > > Reported-by: syzbot+b591856e0f0139f83023@syzkaller.appspotmail.com > > > > > > > > > > #syz test > > > > > > > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > > > > index 42b5567e3773..90ad5fe60824 100644 > > > > > --- a/mm/mempolicy.c > > > > > +++ b/mm/mempolicy.c > > > > > @@ -1342,6 +1342,7 @@ static long do_mbind(unsigned long start, unsigned long len, > > > > > vma_iter_init(&vmi, mm, start); > > > > > prev = vma_prev(&vmi); > > > > > for_each_vma_range(vmi, vma, end) { > > > > > + vma_start_write(vma); > > > > > err = mbind_range(&vmi, vma, &prev, start, end, new); > > > > > if (err) > > > > > break; > > > > > > > > Suren, can you take a look at this? The VMA should be locked by the > > > > call to queue_pages_range(), but by the time we get to here, the VMA > > > > isn't locked. I don't see anywhere that we cycle the mmap_lock (which > > > > would unlock the VMA), but I could have missed something. The two > > > > VMA walks should walk over the same set of VMAs. Certainly the VMA > > > > being dumped should have been locked by the pagewalk: > > > > Yeah, this looks strange. queue_pages_range() should have locked all > > the vmas and the tree can't change since we are holding mmap_lock for > > write. I'll try to reproduce later today and see what's going on. > > So far I was unable to reproduce the issue. I tried with Linus' ToT > using the attached config. linux-next ToT does not boot with this > config but defconfig boots and fails to reproduce the issue. I'll try > to figure out why current linux-next does not like this config. Ok, I found a way to reproduce this using the config and kernel baseline reported on 2023/09/06 06:24 at https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023. I suspect mmap_lock is being dropped by a racing thread, similar to this issue we fixed before here: https://lore.kernel.org/all/CAJuCfpH8ucOkCFYrVZafUAppi5+mVhy=uD+BK6-oYX=ysQv5qQ@mail.gmail.com/ Anyway, I'm on it and will report once I figure out the issue. > > > > > > > > > Sure, I'll look into this today. Somehow this report slipped by me > > > unnoticed. Thanks! > > > > > > > > > > > vma ffff888077381a00 start 0000000020c2a000 end 0000000021000000 mm ffff8880258a8980 > > > > prot 25 anon_vma 0000000000000000 vm_ops 0000000000000000 > > > > pgoff 20c2a file 0000000000000000 private_data 0000000000000000 > > > > flags: 0x8100077(read|write|exec|mayread|maywrite|mayexec|account|softdirty) > > > > > > > > syscall(__NR_mbind, /*addr=*/0x20400000ul, /*len=*/0xc00000ul, /*mode=*/4ul, > > > > /*nodemask=*/0ul, /*maxnode=*/0ul, /*flags=*/3ul); > > > > > > > > 20400000 + c00000 should overlap 20c2a000-21000000 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-13 16:46 ` Suren Baghdasaryan @ 2023-09-14 18:20 ` Suren Baghdasaryan 2023-09-14 19:09 ` Matthew Wilcox 0 siblings, 1 reply; 27+ messages in thread From: Suren Baghdasaryan @ 2023-09-14 18:20 UTC (permalink / raw) To: Matthew Wilcox; +Cc: syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Wed, Sep 13, 2023 at 4:46 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Wed, Sep 13, 2023 at 4:05 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Tue, Sep 12, 2023 at 4:00 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Tue, Sep 12, 2023 at 8:03 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > > > On Tue, Sep 12, 2023 at 7:55 AM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Tue, Sep 12, 2023 at 06:30:46AM +0100, Matthew Wilcox wrote: > > > > > > On Tue, Sep 05, 2023 at 06:03:49PM -0700, syzbot wrote: > > > > > > > Hello, > > > > > > > > > > > > > > syzbot found the following issue on: > > > > > > > > > > > > > > HEAD commit: a47fc304d2b6 Add linux-next specific files for 20230831 > > > > > > > git tree: linux-next > > > > > > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=16502ddba80000 > > > > > > > kernel config: https://syzkaller.appspot.com/x/.config?x=6ecd2a74f20953b9 > > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 > > > > > > > compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 > > > > > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=120e7d70680000 > > > > > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1523f9c0680000 > > > > > > > > > > > > > > Downloadable assets: > > > > > > > disk image: https://storage.googleapis.com/syzbot-assets/b2e8f4217527/disk-a47fc304.raw.xz > > > > > > > vmlinux: https://storage.googleapis.com/syzbot-assets/ed6cdcc09339/vmlinux-a47fc304.xz > > > > > > > kernel image: https://storage.googleapis.com/syzbot-assets/bd9b2475bf5a/bzImage-a47fc304.xz > > > > > > > > > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > > > > > > Reported-by: syzbot+b591856e0f0139f83023@syzkaller.appspotmail.com > > > > > > > > > > > > #syz test > > > > > > > > > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > > > > > index 42b5567e3773..90ad5fe60824 100644 > > > > > > --- a/mm/mempolicy.c > > > > > > +++ b/mm/mempolicy.c > > > > > > @@ -1342,6 +1342,7 @@ static long do_mbind(unsigned long start, unsigned long len, > > > > > > vma_iter_init(&vmi, mm, start); > > > > > > prev = vma_prev(&vmi); > > > > > > for_each_vma_range(vmi, vma, end) { > > > > > > + vma_start_write(vma); > > > > > > err = mbind_range(&vmi, vma, &prev, start, end, new); > > > > > > if (err) > > > > > > break; > > > > > > > > > > Suren, can you take a look at this? The VMA should be locked by the > > > > > call to queue_pages_range(), but by the time we get to here, the VMA > > > > > isn't locked. I don't see anywhere that we cycle the mmap_lock (which > > > > > would unlock the VMA), but I could have missed something. The two > > > > > VMA walks should walk over the same set of VMAs. Certainly the VMA > > > > > being dumped should have been locked by the pagewalk: > > > > > > Yeah, this looks strange. queue_pages_range() should have locked all > > > the vmas and the tree can't change since we are holding mmap_lock for > > > write. I'll try to reproduce later today and see what's going on. > > > > So far I was unable to reproduce the issue. I tried with Linus' ToT > > using the attached config. linux-next ToT does not boot with this > > config but defconfig boots and fails to reproduce the issue. I'll try > > to figure out why current linux-next does not like this config. > > Ok, I found a way to reproduce this using the config and kernel > baseline reported on 2023/09/06 06:24 at > https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023. I > suspect mmap_lock is being dropped by a racing thread, similar to this > issue we fixed before here: > https://lore.kernel.org/all/CAJuCfpH8ucOkCFYrVZafUAppi5+mVhy=uD+BK6-oYX=ysQv5qQ@mail.gmail.com/ > Anyway, I'm on it and will report once I figure out the issue. I think I found the problem and the explanation is much simpler. While walking the page range, queue_folios_pte_range() encounters an unmovable page and queue_folios_pte_range() returns 1. That causes a break from the loop inside walk_page_range() and no more VMAs get locked. After that the loop calling mbind_range() walks over all VMAs, even the ones which were skipped by queue_folios_pte_range() and that causes this BUG assertion. Thinking what's the right way to handle this situation (what's the expected behavior here)... I think the safest way would be to modify walk_page_range() and make it continue calling process_vma_walk_lock() for all VMAs in the range even when __walk_page_range() returns a positive err. Any objection or alternative suggestions? > > > > > > > > > > > > > > Sure, I'll look into this today. Somehow this report slipped by me > > > > unnoticed. Thanks! > > > > > > > > > > > > > > vma ffff888077381a00 start 0000000020c2a000 end 0000000021000000 mm ffff8880258a8980 > > > > > prot 25 anon_vma 0000000000000000 vm_ops 0000000000000000 > > > > > pgoff 20c2a file 0000000000000000 private_data 0000000000000000 > > > > > flags: 0x8100077(read|write|exec|mayread|maywrite|mayexec|account|softdirty) > > > > > > > > > > syscall(__NR_mbind, /*addr=*/0x20400000ul, /*len=*/0xc00000ul, /*mode=*/4ul, > > > > > /*nodemask=*/0ul, /*maxnode=*/0ul, /*flags=*/3ul); > > > > > > > > > > 20400000 + c00000 should overlap 20c2a000-21000000 ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-14 18:20 ` Suren Baghdasaryan @ 2023-09-14 19:09 ` Matthew Wilcox 2023-09-14 20:00 ` Suren Baghdasaryan 0 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2023-09-14 19:09 UTC (permalink / raw) To: Suren Baghdasaryan; +Cc: syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Thu, Sep 14, 2023 at 06:20:56PM +0000, Suren Baghdasaryan wrote: > I think I found the problem and the explanation is much simpler. While > walking the page range, queue_folios_pte_range() encounters an > unmovable page and queue_folios_pte_range() returns 1. That causes a > break from the loop inside walk_page_range() and no more VMAs get > locked. After that the loop calling mbind_range() walks over all VMAs, > even the ones which were skipped by queue_folios_pte_range() and that > causes this BUG assertion. > > Thinking what's the right way to handle this situation (what's the > expected behavior here)... > I think the safest way would be to modify walk_page_range() and make > it continue calling process_vma_walk_lock() for all VMAs in the range > even when __walk_page_range() returns a positive err. Any objection or > alternative suggestions? So we only return 1 here if MPOL_MF_MOVE* & MPOL_MF_STRICT were specified. That means we're going to return an error, no matter what, and there's no point in calling mbind_range(). Right? +++ b/mm/mempolicy.c @@ -1334,6 +1334,8 @@ static long do_mbind(unsigned long start, unsigned long len, ret = queue_pages_range(mm, start, end, nmask, flags | MPOL_MF_INVERT, &pagelist, true); + if (ret == 1) + ret = -EIO; if (ret < 0) { err = ret; goto up_out; (I don't really understand this code, so it can't be this simple, can it? Why don't we just return -EIO from queue_folios_pte_range() if this is the right answer?) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-14 19:09 ` Matthew Wilcox @ 2023-09-14 20:00 ` Suren Baghdasaryan 2023-09-14 20:53 ` Suren Baghdasaryan 0 siblings, 1 reply; 27+ messages in thread From: Suren Baghdasaryan @ 2023-09-14 20:00 UTC (permalink / raw) To: Matthew Wilcox; +Cc: syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Thu, Sep 14, 2023 at 7:09 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Sep 14, 2023 at 06:20:56PM +0000, Suren Baghdasaryan wrote: > > I think I found the problem and the explanation is much simpler. While > > walking the page range, queue_folios_pte_range() encounters an > > unmovable page and queue_folios_pte_range() returns 1. That causes a > > break from the loop inside walk_page_range() and no more VMAs get > > locked. After that the loop calling mbind_range() walks over all VMAs, > > even the ones which were skipped by queue_folios_pte_range() and that > > causes this BUG assertion. > > > > Thinking what's the right way to handle this situation (what's the > > expected behavior here)... > > I think the safest way would be to modify walk_page_range() and make > > it continue calling process_vma_walk_lock() for all VMAs in the range > > even when __walk_page_range() returns a positive err. Any objection or > > alternative suggestions? > > So we only return 1 here if MPOL_MF_MOVE* & MPOL_MF_STRICT were > specified. That means we're going to return an error, no matter what, > and there's no point in calling mbind_range(). Right? > > +++ b/mm/mempolicy.c > @@ -1334,6 +1334,8 @@ static long do_mbind(unsigned long start, unsigned long len, > ret = queue_pages_range(mm, start, end, nmask, > flags | MPOL_MF_INVERT, &pagelist, true); > > + if (ret == 1) > + ret = -EIO; > if (ret < 0) { > err = ret; > goto up_out; > > (I don't really understand this code, so it can't be this simple, can > it? Why don't we just return -EIO from queue_folios_pte_range() if > this is the right answer?) Yeah, I'm trying to understand the expected behavior of this function to make sure we are not missing anything. I tried a simple fix that I suggested in my previous email and it works but I want to understand a bit more about this function's logic before posting the fix. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-14 20:00 ` Suren Baghdasaryan @ 2023-09-14 20:53 ` Suren Baghdasaryan 2023-09-14 21:24 ` Matthew Wilcox 0 siblings, 1 reply; 27+ messages in thread From: Suren Baghdasaryan @ 2023-09-14 20:53 UTC (permalink / raw) To: Matthew Wilcox; +Cc: syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Thu, Sep 14, 2023 at 8:00 PM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Sep 14, 2023 at 7:09 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Thu, Sep 14, 2023 at 06:20:56PM +0000, Suren Baghdasaryan wrote: > > > I think I found the problem and the explanation is much simpler. While > > > walking the page range, queue_folios_pte_range() encounters an > > > unmovable page and queue_folios_pte_range() returns 1. That causes a > > > break from the loop inside walk_page_range() and no more VMAs get > > > locked. After that the loop calling mbind_range() walks over all VMAs, > > > even the ones which were skipped by queue_folios_pte_range() and that > > > causes this BUG assertion. > > > > > > Thinking what's the right way to handle this situation (what's the > > > expected behavior here)... > > > I think the safest way would be to modify walk_page_range() and make > > > it continue calling process_vma_walk_lock() for all VMAs in the range > > > even when __walk_page_range() returns a positive err. Any objection or > > > alternative suggestions? > > > > So we only return 1 here if MPOL_MF_MOVE* & MPOL_MF_STRICT were > > specified. That means we're going to return an error, no matter what, > > and there's no point in calling mbind_range(). Right? > > > > +++ b/mm/mempolicy.c > > @@ -1334,6 +1334,8 @@ static long do_mbind(unsigned long start, unsigned long len, > > ret = queue_pages_range(mm, start, end, nmask, > > flags | MPOL_MF_INVERT, &pagelist, true); > > > > + if (ret == 1) > > + ret = -EIO; > > if (ret < 0) { > > err = ret; > > goto up_out; > > > > (I don't really understand this code, so it can't be this simple, can > > it? Why don't we just return -EIO from queue_folios_pte_range() if > > this is the right answer?) > > Yeah, I'm trying to understand the expected behavior of this function > to make sure we are not missing anything. I tried a simple fix that I > suggested in my previous email and it works but I want to understand a > bit more about this function's logic before posting the fix. So, current functionality is that after queue_pages_range() encounters an unmovable page, terminates the loop and returns 1, mbind_range() will still be called for the whole range (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1345), all pages in the pagelist will be migrated (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1355) and only after that the -EIO code will be returned (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1362). So, if we follow Matthew's suggestion we will be altering the current behavior which I assume is not what we want to do. The simple fix I was thinking about that would not alter this behavior is smth like this: diff --git a/mm/pagewalk.c b/mm/pagewalk.c index b7d7e4fcfad7..c37a7e8be4cb 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -493,11 +493,17 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, if (!vma) { /* after the last vma */ walk.vma = NULL; next = end; + if (err) + continue; + if (ops->pte_hole) err = ops->pte_hole(start, next, -1, &walk); } else if (start < vma->vm_start) { /* outside vma */ walk.vma = NULL; next = min(end, vma->vm_start); + if (err) + continue; + if (ops->pte_hole) err = ops->pte_hole(start, next, -1, &walk); } else { /* inside vma */ @@ -505,6 +511,8 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, walk.vma = vma; next = min(end, vma->vm_end); vma = find_vma(mm, vma->vm_end); + if (err) + continue; err = walk_page_test(start, next, &walk); if (err > 0) { @@ -520,8 +528,6 @@ int walk_page_range(struct mm_struct *mm, unsigned long start, break; err = __walk_page_range(start, next, &walk); } - if (err) - break; } while (start = next, start < end); return err; } WDYT? ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-14 20:53 ` Suren Baghdasaryan @ 2023-09-14 21:24 ` Matthew Wilcox 2023-09-14 22:21 ` Suren Baghdasaryan 0 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2023-09-14 21:24 UTC (permalink / raw) To: Suren Baghdasaryan; +Cc: syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Thu, Sep 14, 2023 at 08:53:59PM +0000, Suren Baghdasaryan wrote: > On Thu, Sep 14, 2023 at 8:00 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Thu, Sep 14, 2023 at 7:09 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Thu, Sep 14, 2023 at 06:20:56PM +0000, Suren Baghdasaryan wrote: > > > > I think I found the problem and the explanation is much simpler. While > > > > walking the page range, queue_folios_pte_range() encounters an > > > > unmovable page and queue_folios_pte_range() returns 1. That causes a > > > > break from the loop inside walk_page_range() and no more VMAs get > > > > locked. After that the loop calling mbind_range() walks over all VMAs, > > > > even the ones which were skipped by queue_folios_pte_range() and that > > > > causes this BUG assertion. > > > > > > > > Thinking what's the right way to handle this situation (what's the > > > > expected behavior here)... > > > > I think the safest way would be to modify walk_page_range() and make > > > > it continue calling process_vma_walk_lock() for all VMAs in the range > > > > even when __walk_page_range() returns a positive err. Any objection or > > > > alternative suggestions? > > > > > > So we only return 1 here if MPOL_MF_MOVE* & MPOL_MF_STRICT were > > > specified. That means we're going to return an error, no matter what, > > > and there's no point in calling mbind_range(). Right? > > > > > > +++ b/mm/mempolicy.c > > > @@ -1334,6 +1334,8 @@ static long do_mbind(unsigned long start, unsigned long len, > > > ret = queue_pages_range(mm, start, end, nmask, > > > flags | MPOL_MF_INVERT, &pagelist, true); > > > > > > + if (ret == 1) > > > + ret = -EIO; > > > if (ret < 0) { > > > err = ret; > > > goto up_out; > > > > > > (I don't really understand this code, so it can't be this simple, can > > > it? Why don't we just return -EIO from queue_folios_pte_range() if > > > this is the right answer?) > > > > Yeah, I'm trying to understand the expected behavior of this function > > to make sure we are not missing anything. I tried a simple fix that I > > suggested in my previous email and it works but I want to understand a > > bit more about this function's logic before posting the fix. > > So, current functionality is that after queue_pages_range() encounters > an unmovable page, terminates the loop and returns 1, mbind_range() > will still be called for the whole range > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1345), > all pages in the pagelist will be migrated > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1355) > and only after that the -EIO code will be returned > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1362). > So, if we follow Matthew's suggestion we will be altering the current > behavior which I assume is not what we want to do. Right, I'm intentionally changing the behaviour. My thinking is that mbind(MPOL_MF_MOVE | MPOL_MF_STRICT) is going to fail. Should such a failure actually move the movable pages before reporting that it failed? I don't know. > The simple fix I was thinking about that would not alter this behavior > is smth like this: I don't like it, but can we run it past syzbot to be sure it solves the issue and we're not chasing a ghost here? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-14 21:24 ` Matthew Wilcox @ 2023-09-14 22:21 ` Suren Baghdasaryan 2023-09-15 4:26 ` Hugh Dickins 0 siblings, 1 reply; 27+ messages in thread From: Suren Baghdasaryan @ 2023-09-14 22:21 UTC (permalink / raw) To: Matthew Wilcox; +Cc: syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Thu, Sep 14, 2023 at 9:24 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Sep 14, 2023 at 08:53:59PM +0000, Suren Baghdasaryan wrote: > > On Thu, Sep 14, 2023 at 8:00 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Thu, Sep 14, 2023 at 7:09 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > On Thu, Sep 14, 2023 at 06:20:56PM +0000, Suren Baghdasaryan wrote: > > > > > I think I found the problem and the explanation is much simpler. While > > > > > walking the page range, queue_folios_pte_range() encounters an > > > > > unmovable page and queue_folios_pte_range() returns 1. That causes a > > > > > break from the loop inside walk_page_range() and no more VMAs get > > > > > locked. After that the loop calling mbind_range() walks over all VMAs, > > > > > even the ones which were skipped by queue_folios_pte_range() and that > > > > > causes this BUG assertion. > > > > > > > > > > Thinking what's the right way to handle this situation (what's the > > > > > expected behavior here)... > > > > > I think the safest way would be to modify walk_page_range() and make > > > > > it continue calling process_vma_walk_lock() for all VMAs in the range > > > > > even when __walk_page_range() returns a positive err. Any objection or > > > > > alternative suggestions? > > > > > > > > So we only return 1 here if MPOL_MF_MOVE* & MPOL_MF_STRICT were > > > > specified. That means we're going to return an error, no matter what, > > > > and there's no point in calling mbind_range(). Right? > > > > > > > > +++ b/mm/mempolicy.c > > > > @@ -1334,6 +1334,8 @@ static long do_mbind(unsigned long start, unsigned long len, > > > > ret = queue_pages_range(mm, start, end, nmask, > > > > flags | MPOL_MF_INVERT, &pagelist, true); > > > > > > > > + if (ret == 1) > > > > + ret = -EIO; > > > > if (ret < 0) { > > > > err = ret; > > > > goto up_out; > > > > > > > > (I don't really understand this code, so it can't be this simple, can > > > > it? Why don't we just return -EIO from queue_folios_pte_range() if > > > > this is the right answer?) > > > > > > Yeah, I'm trying to understand the expected behavior of this function > > > to make sure we are not missing anything. I tried a simple fix that I > > > suggested in my previous email and it works but I want to understand a > > > bit more about this function's logic before posting the fix. > > > > So, current functionality is that after queue_pages_range() encounters > > an unmovable page, terminates the loop and returns 1, mbind_range() > > will still be called for the whole range > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1345), > > all pages in the pagelist will be migrated > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1355) > > and only after that the -EIO code will be returned > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1362). > > So, if we follow Matthew's suggestion we will be altering the current > > behavior which I assume is not what we want to do. > > Right, I'm intentionally changing the behaviour. My thinking is > that mbind(MPOL_MF_MOVE | MPOL_MF_STRICT) is going to fail. Should > such a failure actually move the movable pages before reporting that > it failed? I don't know. > > > The simple fix I was thinking about that would not alter this behavior > > is smth like this: > > I don't like it, but can we run it past syzbot to be sure it solves the > issue and we're not chasing a ghost here? Yes, I just finished running the reproducer on both upstream and linux-next builds listed in https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 and the problem does not happen anymore. I'm fine with your suggestion too, just wanted to point out it would introduce change in the behavior. Let me know how you want to proceed. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-14 22:21 ` Suren Baghdasaryan @ 2023-09-15 4:26 ` Hugh Dickins 2023-09-15 16:09 ` Suren Baghdasaryan ` (2 more replies) 0 siblings, 3 replies; 27+ messages in thread From: Hugh Dickins @ 2023-09-15 4:26 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Matthew Wilcox, Yang Shi, Michal Hocko, Vlastimil Babka, syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs [-- Attachment #1: Type: text/plain, Size: 6012 bytes --] On Thu, 14 Sep 2023, Suren Baghdasaryan wrote: > On Thu, Sep 14, 2023 at 9:24 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Thu, Sep 14, 2023 at 08:53:59PM +0000, Suren Baghdasaryan wrote: > > > On Thu, Sep 14, 2023 at 8:00 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > On Thu, Sep 14, 2023 at 7:09 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > On Thu, Sep 14, 2023 at 06:20:56PM +0000, Suren Baghdasaryan wrote: > > > > > > I think I found the problem and the explanation is much simpler. While > > > > > > walking the page range, queue_folios_pte_range() encounters an > > > > > > unmovable page and queue_folios_pte_range() returns 1. That causes a > > > > > > break from the loop inside walk_page_range() and no more VMAs get > > > > > > locked. After that the loop calling mbind_range() walks over all VMAs, > > > > > > even the ones which were skipped by queue_folios_pte_range() and that > > > > > > causes this BUG assertion. > > > > > > > > > > > > Thinking what's the right way to handle this situation (what's the > > > > > > expected behavior here)... > > > > > > I think the safest way would be to modify walk_page_range() and make > > > > > > it continue calling process_vma_walk_lock() for all VMAs in the range > > > > > > even when __walk_page_range() returns a positive err. Any objection or > > > > > > alternative suggestions? > > > > > > > > > > So we only return 1 here if MPOL_MF_MOVE* & MPOL_MF_STRICT were > > > > > specified. That means we're going to return an error, no matter what, > > > > > and there's no point in calling mbind_range(). Right? > > > > > > > > > > +++ b/mm/mempolicy.c > > > > > @@ -1334,6 +1334,8 @@ static long do_mbind(unsigned long start, unsigned long len, > > > > > ret = queue_pages_range(mm, start, end, nmask, > > > > > flags | MPOL_MF_INVERT, &pagelist, true); > > > > > > > > > > + if (ret == 1) > > > > > + ret = -EIO; > > > > > if (ret < 0) { > > > > > err = ret; > > > > > goto up_out; > > > > > > > > > > (I don't really understand this code, so it can't be this simple, can > > > > > it? Why don't we just return -EIO from queue_folios_pte_range() if > > > > > this is the right answer?) > > > > > > > > Yeah, I'm trying to understand the expected behavior of this function > > > > to make sure we are not missing anything. I tried a simple fix that I > > > > suggested in my previous email and it works but I want to understand a > > > > bit more about this function's logic before posting the fix. > > > > > > So, current functionality is that after queue_pages_range() encounters > > > an unmovable page, terminates the loop and returns 1, mbind_range() > > > will still be called for the whole range > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1345), > > > all pages in the pagelist will be migrated > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1355) > > > and only after that the -EIO code will be returned > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1362). > > > So, if we follow Matthew's suggestion we will be altering the current > > > behavior which I assume is not what we want to do. > > > > Right, I'm intentionally changing the behaviour. My thinking is > > that mbind(MPOL_MF_MOVE | MPOL_MF_STRICT) is going to fail. Should > > such a failure actually move the movable pages before reporting that > > it failed? I don't know. > > > > > The simple fix I was thinking about that would not alter this behavior > > > is smth like this: > > > > I don't like it, but can we run it past syzbot to be sure it solves the > > issue and we're not chasing a ghost here? > > Yes, I just finished running the reproducer on both upstream and > linux-next builds listed in > https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 and the > problem does not happen anymore. > I'm fine with your suggestion too, just wanted to point out it would > introduce change in the behavior. Let me know how you want to proceed. Well done, identifying the mysterious cause of this problem: I'm glad to hear that you've now verified that hypothesis. You're right, it would be a regression to follow Matthew's suggestion. Traditionally, modulo bugs and inconsistencies, the queue_pages_range() phase of do_mbind() has done the best it can, gathering all the pages it can that need migration, even if some were missed; and proceeds to do the mbind_range() phase if there was nothing "seriously" wrong (a gap causing -EFAULT). Then at the end, if MPOL_MF_STRICT was set, and not all the pages could be migrated (or MOVE was not specified and not all pages were well placed), it returns -EIO rather than 0 to inform the caller that not all could be done. There have been numerous tweaks, but I think most importantly 5.3's d883544515aa ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") added those "return 1"s which stop the pagewalk early. In my opinion, not an improvement - makes it harder to get mbind() to do the best job it can (or is it justified as what you're asking for if you say STRICT?). But whatever, it would be a further regression for mbind() not to have done the mbind_range(), even though it goes on to return -EIO. I had a bad first reaction to your walk_page_range() patch (was expecting to see vma_start_write()s in mbind_range()), but perhaps your patch is exactly what process_mm_walk_lock() does now demand. [Why is Hugh responding on this? Because I have some long-standing mm/mempolicy.c patches to submit next week, but in reviewing what I could or could not afford to get into at this time, had decided I'd better stay out of queue_pages_range() for now - beyond the trivial preferring an MPOL_MF_WRLOCK flag to your bool lock_vma.] Hugh ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-15 4:26 ` Hugh Dickins @ 2023-09-15 16:09 ` Suren Baghdasaryan 2023-09-15 18:05 ` Suren Baghdasaryan 2023-09-15 18:26 ` Matthew Wilcox 2023-09-16 1:35 ` Yang Shi 2 siblings, 1 reply; 27+ messages in thread From: Suren Baghdasaryan @ 2023-09-15 16:09 UTC (permalink / raw) To: Hugh Dickins Cc: Matthew Wilcox, Yang Shi, Michal Hocko, Vlastimil Babka, syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Thu, Sep 14, 2023 at 9:26 PM Hugh Dickins <hughd@google.com> wrote: > > On Thu, 14 Sep 2023, Suren Baghdasaryan wrote: > > On Thu, Sep 14, 2023 at 9:24 PM Matthew Wilcox <willy@infradead.org> wrote: > > > On Thu, Sep 14, 2023 at 08:53:59PM +0000, Suren Baghdasaryan wrote: > > > > On Thu, Sep 14, 2023 at 8:00 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > On Thu, Sep 14, 2023 at 7:09 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Thu, Sep 14, 2023 at 06:20:56PM +0000, Suren Baghdasaryan wrote: > > > > > > > I think I found the problem and the explanation is much simpler. While > > > > > > > walking the page range, queue_folios_pte_range() encounters an > > > > > > > unmovable page and queue_folios_pte_range() returns 1. That causes a > > > > > > > break from the loop inside walk_page_range() and no more VMAs get > > > > > > > locked. After that the loop calling mbind_range() walks over all VMAs, > > > > > > > even the ones which were skipped by queue_folios_pte_range() and that > > > > > > > causes this BUG assertion. > > > > > > > > > > > > > > Thinking what's the right way to handle this situation (what's the > > > > > > > expected behavior here)... > > > > > > > I think the safest way would be to modify walk_page_range() and make > > > > > > > it continue calling process_vma_walk_lock() for all VMAs in the range > > > > > > > even when __walk_page_range() returns a positive err. Any objection or > > > > > > > alternative suggestions? > > > > > > > > > > > > So we only return 1 here if MPOL_MF_MOVE* & MPOL_MF_STRICT were > > > > > > specified. That means we're going to return an error, no matter what, > > > > > > and there's no point in calling mbind_range(). Right? > > > > > > > > > > > > +++ b/mm/mempolicy.c > > > > > > @@ -1334,6 +1334,8 @@ static long do_mbind(unsigned long start, unsigned long len, > > > > > > ret = queue_pages_range(mm, start, end, nmask, > > > > > > flags | MPOL_MF_INVERT, &pagelist, true); > > > > > > > > > > > > + if (ret == 1) > > > > > > + ret = -EIO; > > > > > > if (ret < 0) { > > > > > > err = ret; > > > > > > goto up_out; > > > > > > > > > > > > (I don't really understand this code, so it can't be this simple, can > > > > > > it? Why don't we just return -EIO from queue_folios_pte_range() if > > > > > > this is the right answer?) > > > > > > > > > > Yeah, I'm trying to understand the expected behavior of this function > > > > > to make sure we are not missing anything. I tried a simple fix that I > > > > > suggested in my previous email and it works but I want to understand a > > > > > bit more about this function's logic before posting the fix. > > > > > > > > So, current functionality is that after queue_pages_range() encounters > > > > an unmovable page, terminates the loop and returns 1, mbind_range() > > > > will still be called for the whole range > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1345), > > > > all pages in the pagelist will be migrated > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1355) > > > > and only after that the -EIO code will be returned > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1362). > > > > So, if we follow Matthew's suggestion we will be altering the current > > > > behavior which I assume is not what we want to do. > > > > > > Right, I'm intentionally changing the behaviour. My thinking is > > > that mbind(MPOL_MF_MOVE | MPOL_MF_STRICT) is going to fail. Should > > > such a failure actually move the movable pages before reporting that > > > it failed? I don't know. > > > > > > > The simple fix I was thinking about that would not alter this behavior > > > > is smth like this: > > > > > > I don't like it, but can we run it past syzbot to be sure it solves the > > > issue and we're not chasing a ghost here? > > > > Yes, I just finished running the reproducer on both upstream and > > linux-next builds listed in > > https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 and the > > problem does not happen anymore. > > I'm fine with your suggestion too, just wanted to point out it would > > introduce change in the behavior. Let me know how you want to proceed. > > Well done, identifying the mysterious cause of this problem: > I'm glad to hear that you've now verified that hypothesis. > > You're right, it would be a regression to follow Matthew's suggestion. > > Traditionally, modulo bugs and inconsistencies, the queue_pages_range() > phase of do_mbind() has done the best it can, gathering all the pages it > can that need migration, even if some were missed; and proceeds to do the > mbind_range() phase if there was nothing "seriously" wrong (a gap causing > -EFAULT). Then at the end, if MPOL_MF_STRICT was set, and not all the > pages could be migrated (or MOVE was not specified and not all pages > were well placed), it returns -EIO rather than 0 to inform the caller > that not all could be done. > > There have been numerous tweaks, but I think most importantly > 5.3's d883544515aa ("mm: mempolicy: make the behavior consistent when > MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") added those "return 1"s > which stop the pagewalk early. In my opinion, not an improvement - makes > it harder to get mbind() to do the best job it can (or is it justified as > what you're asking for if you say STRICT?). > > But whatever, it would be a further regression for mbind() not to have > done the mbind_range(), even though it goes on to return -EIO. > > I had a bad first reaction to your walk_page_range() patch (was expecting > to see vma_start_write()s in mbind_range()), but perhaps your patch is > exactly what process_mm_walk_lock() does now demand. > > [Why is Hugh responding on this? Because I have some long-standing > mm/mempolicy.c patches to submit next week, but in reviewing what I > could or could not afford to get into at this time, had decided I'd > better stay out of queue_pages_range() for now - beyond the trivial > preferring an MPOL_MF_WRLOCK flag to your bool lock_vma.] Thanks for the feedback, Hugh! Yeah, this positive err handling is kinda weird. If this behavior (do as much as possible even if we fail eventually) is specific to mbind() then we could keep walk_page_range() as is and lock the VMAs inside the loop that calls mbind_range() with a condition that ret is positive. That would be the simplest solution IMHO. But if we expect walk_page_range() to always apply requested page_walk_lock policy to all VMAs even if some mm_walk_ops returns a positive error somewhere in the middle of the walk then my fix would work for that. So, to me the important question is how we want walk_page_range() to behave in these conditions. I think we should answer that first and document that. Then the fix will be easy. > > Hugh ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-15 16:09 ` Suren Baghdasaryan @ 2023-09-15 18:05 ` Suren Baghdasaryan 2023-09-16 2:43 ` Hugh Dickins 0 siblings, 1 reply; 27+ messages in thread From: Suren Baghdasaryan @ 2023-09-15 18:05 UTC (permalink / raw) To: Hugh Dickins Cc: Matthew Wilcox, Yang Shi, Michal Hocko, Vlastimil Babka, syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Fri, Sep 15, 2023 at 9:09 AM Suren Baghdasaryan <surenb@google.com> wrote: > > On Thu, Sep 14, 2023 at 9:26 PM Hugh Dickins <hughd@google.com> wrote: > > > > On Thu, 14 Sep 2023, Suren Baghdasaryan wrote: > > > On Thu, Sep 14, 2023 at 9:24 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Thu, Sep 14, 2023 at 08:53:59PM +0000, Suren Baghdasaryan wrote: > > > > > On Thu, Sep 14, 2023 at 8:00 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > On Thu, Sep 14, 2023 at 7:09 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > > > On Thu, Sep 14, 2023 at 06:20:56PM +0000, Suren Baghdasaryan wrote: > > > > > > > > I think I found the problem and the explanation is much simpler. While > > > > > > > > walking the page range, queue_folios_pte_range() encounters an > > > > > > > > unmovable page and queue_folios_pte_range() returns 1. That causes a > > > > > > > > break from the loop inside walk_page_range() and no more VMAs get > > > > > > > > locked. After that the loop calling mbind_range() walks over all VMAs, > > > > > > > > even the ones which were skipped by queue_folios_pte_range() and that > > > > > > > > causes this BUG assertion. > > > > > > > > > > > > > > > > Thinking what's the right way to handle this situation (what's the > > > > > > > > expected behavior here)... > > > > > > > > I think the safest way would be to modify walk_page_range() and make > > > > > > > > it continue calling process_vma_walk_lock() for all VMAs in the range > > > > > > > > even when __walk_page_range() returns a positive err. Any objection or > > > > > > > > alternative suggestions? > > > > > > > > > > > > > > So we only return 1 here if MPOL_MF_MOVE* & MPOL_MF_STRICT were > > > > > > > specified. That means we're going to return an error, no matter what, > > > > > > > and there's no point in calling mbind_range(). Right? > > > > > > > > > > > > > > +++ b/mm/mempolicy.c > > > > > > > @@ -1334,6 +1334,8 @@ static long do_mbind(unsigned long start, unsigned long len, > > > > > > > ret = queue_pages_range(mm, start, end, nmask, > > > > > > > flags | MPOL_MF_INVERT, &pagelist, true); > > > > > > > > > > > > > > + if (ret == 1) > > > > > > > + ret = -EIO; > > > > > > > if (ret < 0) { > > > > > > > err = ret; > > > > > > > goto up_out; > > > > > > > > > > > > > > (I don't really understand this code, so it can't be this simple, can > > > > > > > it? Why don't we just return -EIO from queue_folios_pte_range() if > > > > > > > this is the right answer?) > > > > > > > > > > > > Yeah, I'm trying to understand the expected behavior of this function > > > > > > to make sure we are not missing anything. I tried a simple fix that I > > > > > > suggested in my previous email and it works but I want to understand a > > > > > > bit more about this function's logic before posting the fix. > > > > > > > > > > So, current functionality is that after queue_pages_range() encounters > > > > > an unmovable page, terminates the loop and returns 1, mbind_range() > > > > > will still be called for the whole range > > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1345), > > > > > all pages in the pagelist will be migrated > > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1355) > > > > > and only after that the -EIO code will be returned > > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1362). > > > > > So, if we follow Matthew's suggestion we will be altering the current > > > > > behavior which I assume is not what we want to do. > > > > > > > > Right, I'm intentionally changing the behaviour. My thinking is > > > > that mbind(MPOL_MF_MOVE | MPOL_MF_STRICT) is going to fail. Should > > > > such a failure actually move the movable pages before reporting that > > > > it failed? I don't know. > > > > > > > > > The simple fix I was thinking about that would not alter this behavior > > > > > is smth like this: > > > > > > > > I don't like it, but can we run it past syzbot to be sure it solves the > > > > issue and we're not chasing a ghost here? > > > > > > Yes, I just finished running the reproducer on both upstream and > > > linux-next builds listed in > > > https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 and the > > > problem does not happen anymore. > > > I'm fine with your suggestion too, just wanted to point out it would > > > introduce change in the behavior. Let me know how you want to proceed. > > > > Well done, identifying the mysterious cause of this problem: > > I'm glad to hear that you've now verified that hypothesis. > > > > You're right, it would be a regression to follow Matthew's suggestion. > > > > Traditionally, modulo bugs and inconsistencies, the queue_pages_range() > > phase of do_mbind() has done the best it can, gathering all the pages it > > can that need migration, even if some were missed; and proceeds to do the > > mbind_range() phase if there was nothing "seriously" wrong (a gap causing > > -EFAULT). Then at the end, if MPOL_MF_STRICT was set, and not all the > > pages could be migrated (or MOVE was not specified and not all pages > > were well placed), it returns -EIO rather than 0 to inform the caller > > that not all could be done. > > > > There have been numerous tweaks, but I think most importantly > > 5.3's d883544515aa ("mm: mempolicy: make the behavior consistent when > > MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") added those "return 1"s > > which stop the pagewalk early. In my opinion, not an improvement - makes > > it harder to get mbind() to do the best job it can (or is it justified as > > what you're asking for if you say STRICT?). > > > > But whatever, it would be a further regression for mbind() not to have > > done the mbind_range(), even though it goes on to return -EIO. > > > > I had a bad first reaction to your walk_page_range() patch (was expecting > > to see vma_start_write()s in mbind_range()), but perhaps your patch is > > exactly what process_mm_walk_lock() does now demand. > > > > [Why is Hugh responding on this? Because I have some long-standing > > mm/mempolicy.c patches to submit next week, but in reviewing what I > > could or could not afford to get into at this time, had decided I'd > > better stay out of queue_pages_range() for now - beyond the trivial > > preferring an MPOL_MF_WRLOCK flag to your bool lock_vma.] > > Thanks for the feedback, Hugh! > Yeah, this positive err handling is kinda weird. If this behavior (do > as much as possible even if we fail eventually) is specific to mbind() > then we could keep walk_page_range() as is and lock the VMAs inside > the loop that calls mbind_range() with a condition that ret is > positive. That would be the simplest solution IMHO. But if we expect > walk_page_range() to always apply requested page_walk_lock policy to > all VMAs even if some mm_walk_ops returns a positive error somewhere > in the middle of the walk then my fix would work for that. So, to me > the important question is how we want walk_page_range() to behave in > these conditions. I think we should answer that first and document > that. Then the fix will be easy. I looked at all the cases where we perform page walk while locking VMAs and mbind() seems to be the only one that would require walk_page_range() to lock all VMAs even for a failed walk. So, I suggest this fix instead and I can also document that if walk_page_range() fails it might not apply page_walk_lock policy to the VMAs. diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 42b5567e3773..cbc584e9b6ca 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -1342,6 +1342,9 @@ static long do_mbind(unsigned long start, unsigned long len, vma_iter_init(&vmi, mm, start); prev = vma_prev(&vmi); for_each_vma_range(vmi, vma, end) { + /* If queue_pages_range failed then not all VMAs might be locked */ + if (ret) + vma_start_write(vma); err = mbind_range(&vmi, vma, &prev, start, end, new); if (err) break; If this looks good I'll post the patch. Matthew, Hugh, anyone else? > > > > > > Hugh ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-15 18:05 ` Suren Baghdasaryan @ 2023-09-16 2:43 ` Hugh Dickins 2023-09-18 21:20 ` Suren Baghdasaryan 0 siblings, 1 reply; 27+ messages in thread From: Hugh Dickins @ 2023-09-16 2:43 UTC (permalink / raw) To: Suren Baghdasaryan Cc: Hugh Dickins, Matthew Wilcox, Yang Shi, Michal Hocko, Vlastimil Babka, syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs [-- Attachment #1: Type: text/plain, Size: 2568 bytes --] On Fri, 15 Sep 2023, Suren Baghdasaryan wrote: > On Fri, Sep 15, 2023 at 9:09 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > Thanks for the feedback, Hugh! > > Yeah, this positive err handling is kinda weird. If this behavior (do > > as much as possible even if we fail eventually) is specific to mbind() > > then we could keep walk_page_range() as is and lock the VMAs inside > > the loop that calls mbind_range() with a condition that ret is > > positive. That would be the simplest solution IMHO. But if we expect > > walk_page_range() to always apply requested page_walk_lock policy to > > all VMAs even if some mm_walk_ops returns a positive error somewhere > > in the middle of the walk then my fix would work for that. So, to me > > the important question is how we want walk_page_range() to behave in > > these conditions. I think we should answer that first and document > > that. Then the fix will be easy. > > I looked at all the cases where we perform page walk while locking > VMAs and mbind() seems to be the only one that would require > walk_page_range() to lock all VMAs even for a failed walk. Yes, I can well believe that. > So, I suggest this fix instead and I can also document that if > walk_page_range() fails it might not apply page_walk_lock policy to > the VMAs. > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 42b5567e3773..cbc584e9b6ca 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -1342,6 +1342,9 @@ static long do_mbind(unsigned long start, > unsigned long len, > vma_iter_init(&vmi, mm, start); > prev = vma_prev(&vmi); > for_each_vma_range(vmi, vma, end) { > + /* If queue_pages_range failed then not all VMAs > might be locked */ > + if (ret) > + vma_start_write(vma); > err = mbind_range(&vmi, vma, &prev, start, end, new); > if (err) > break; > > If this looks good I'll post the patch. Matthew, Hugh, anyone else? Yes, I do prefer this, to adding those pos ret mods into the generic pagewalk. The "if (ret)" above being just a minor optimization, that I would probably not have bothered with (does it even save any atomics?) - but I guess it helps as documentation. I think it's quite likely that mbind() will be changed sooner or later not to need this; but it's much the best to fix this vma locking issue urgently as above, without depending on any mbind() behavioral discussions. Thanks, Hugh ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-16 2:43 ` Hugh Dickins @ 2023-09-18 21:20 ` Suren Baghdasaryan 0 siblings, 0 replies; 27+ messages in thread From: Suren Baghdasaryan @ 2023-09-18 21:20 UTC (permalink / raw) To: Hugh Dickins Cc: Matthew Wilcox, Yang Shi, Michal Hocko, Vlastimil Babka, syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Fri, Sep 15, 2023 at 7:44 PM Hugh Dickins <hughd@google.com> wrote: > > On Fri, 15 Sep 2023, Suren Baghdasaryan wrote: > > On Fri, Sep 15, 2023 at 9:09 AM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > > Thanks for the feedback, Hugh! > > > Yeah, this positive err handling is kinda weird. If this behavior (do > > > as much as possible even if we fail eventually) is specific to mbind() > > > then we could keep walk_page_range() as is and lock the VMAs inside > > > the loop that calls mbind_range() with a condition that ret is > > > positive. That would be the simplest solution IMHO. But if we expect > > > walk_page_range() to always apply requested page_walk_lock policy to > > > all VMAs even if some mm_walk_ops returns a positive error somewhere > > > in the middle of the walk then my fix would work for that. So, to me > > > the important question is how we want walk_page_range() to behave in > > > these conditions. I think we should answer that first and document > > > that. Then the fix will be easy. > > > > I looked at all the cases where we perform page walk while locking > > VMAs and mbind() seems to be the only one that would require > > walk_page_range() to lock all VMAs even for a failed walk. > > Yes, I can well believe that. > > > So, I suggest this fix instead and I can also document that if > > walk_page_range() fails it might not apply page_walk_lock policy to > > the VMAs. > > > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > > index 42b5567e3773..cbc584e9b6ca 100644 > > --- a/mm/mempolicy.c > > +++ b/mm/mempolicy.c > > @@ -1342,6 +1342,9 @@ static long do_mbind(unsigned long start, > > unsigned long len, > > vma_iter_init(&vmi, mm, start); > > prev = vma_prev(&vmi); > > for_each_vma_range(vmi, vma, end) { > > + /* If queue_pages_range failed then not all VMAs > > might be locked */ > > + if (ret) > > + vma_start_write(vma); > > err = mbind_range(&vmi, vma, &prev, start, end, new); > > if (err) > > break; > > > > If this looks good I'll post the patch. Matthew, Hugh, anyone else? > > Yes, I do prefer this, to adding those pos ret mods into the generic > pagewalk. The "if (ret)" above being just a minor optimization, that > I would probably not have bothered with (does it even save any atomics?) > - but I guess it helps as documentation. > > I think it's quite likely that mbind() will be changed sooner or later > not to need this; but it's much the best to fix this vma locking issue > urgently as above, without depending on any mbind() behavioral discussions. I posted this patch at https://lore.kernel.org/all/20230918211608.3580629-1-surenb@google.com/ to fix the immediate problem. Thanks! > > Thanks, > Hugh ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-15 4:26 ` Hugh Dickins 2023-09-15 16:09 ` Suren Baghdasaryan @ 2023-09-15 18:26 ` Matthew Wilcox 2023-09-16 2:54 ` Hugh Dickins 2023-09-16 1:35 ` Yang Shi 2 siblings, 1 reply; 27+ messages in thread From: Matthew Wilcox @ 2023-09-15 18:26 UTC (permalink / raw) To: Hugh Dickins Cc: Suren Baghdasaryan, Yang Shi, Michal Hocko, Vlastimil Babka, syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Thu, Sep 14, 2023 at 09:26:15PM -0700, Hugh Dickins wrote: > On Thu, 14 Sep 2023, Suren Baghdasaryan wrote: > > Yes, I just finished running the reproducer on both upstream and > > linux-next builds listed in > > https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 and the > > problem does not happen anymore. > > I'm fine with your suggestion too, just wanted to point out it would > > introduce change in the behavior. Let me know how you want to proceed. > > Well done, identifying the mysterious cause of this problem: > I'm glad to hear that you've now verified that hypothesis. > > You're right, it would be a regression to follow Matthew's suggestion. > > Traditionally, modulo bugs and inconsistencies, the queue_pages_range() > phase of do_mbind() has done the best it can, gathering all the pages it > can that need migration, even if some were missed; and proceeds to do the > mbind_range() phase if there was nothing "seriously" wrong (a gap causing > -EFAULT). Then at the end, if MPOL_MF_STRICT was set, and not all the > pages could be migrated (or MOVE was not specified and not all pages > were well placed), it returns -EIO rather than 0 to inform the caller > that not all could be done. > > There have been numerous tweaks, but I think most importantly > 5.3's d883544515aa ("mm: mempolicy: make the behavior consistent when > MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") added those "return 1"s > which stop the pagewalk early. In my opinion, not an improvement - makes > it harder to get mbind() to do the best job it can (or is it justified as > what you're asking for if you say STRICT?). I suspect you agree that it's inconsistent to stop early. Userspace doesn't know at which point we found an unmovable page, so it can't behave rationally. Perhaps we should remove the 'early stop' and attempt to migrate every page in the range, whether it's before or after the first unmovable page? ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-15 18:26 ` Matthew Wilcox @ 2023-09-16 2:54 ` Hugh Dickins 0 siblings, 0 replies; 27+ messages in thread From: Hugh Dickins @ 2023-09-16 2:54 UTC (permalink / raw) To: Matthew Wilcox Cc: Hugh Dickins, Suren Baghdasaryan, Yang Shi, Michal Hocko, Vlastimil Babka, syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Fri, 15 Sep 2023, Matthew Wilcox wrote: > On Thu, Sep 14, 2023 at 09:26:15PM -0700, Hugh Dickins wrote: > > On Thu, 14 Sep 2023, Suren Baghdasaryan wrote: > > > Yes, I just finished running the reproducer on both upstream and > > > linux-next builds listed in > > > https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 and the > > > problem does not happen anymore. > > > I'm fine with your suggestion too, just wanted to point out it would > > > introduce change in the behavior. Let me know how you want to proceed. > > > > Well done, identifying the mysterious cause of this problem: > > I'm glad to hear that you've now verified that hypothesis. > > > > You're right, it would be a regression to follow Matthew's suggestion. > > > > Traditionally, modulo bugs and inconsistencies, the queue_pages_range() > > phase of do_mbind() has done the best it can, gathering all the pages it > > can that need migration, even if some were missed; and proceeds to do the > > mbind_range() phase if there was nothing "seriously" wrong (a gap causing > > -EFAULT). Then at the end, if MPOL_MF_STRICT was set, and not all the > > pages could be migrated (or MOVE was not specified and not all pages > > were well placed), it returns -EIO rather than 0 to inform the caller > > that not all could be done. > > > > There have been numerous tweaks, but I think most importantly > > 5.3's d883544515aa ("mm: mempolicy: make the behavior consistent when > > MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") added those "return 1"s > > which stop the pagewalk early. In my opinion, not an improvement - makes > > it harder to get mbind() to do the best job it can (or is it justified as > > what you're asking for if you say STRICT?). > > I suspect you agree that it's inconsistent to stop early. Userspace > doesn't know at which point we found an unmovable page, so it can't behave > rationally. Perhaps we should remove the 'early stop' and attempt to > migrate every page in the range, whether it's before or after the first > unmovable page? Yes, that's what I was arguing for, and how it was done in olden days. Though (after Yang Shi's following comments, and looking back at my last attempted patch here) I may disagree with myself about the right behavior in the MPOL_MF_STRICT case. Hugh ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-15 4:26 ` Hugh Dickins 2023-09-15 16:09 ` Suren Baghdasaryan 2023-09-15 18:26 ` Matthew Wilcox @ 2023-09-16 1:35 ` Yang Shi 2023-09-16 3:57 ` Hugh Dickins 2 siblings, 1 reply; 27+ messages in thread From: Yang Shi @ 2023-09-16 1:35 UTC (permalink / raw) To: Hugh Dickins Cc: Suren Baghdasaryan, Matthew Wilcox, Michal Hocko, Vlastimil Babka, syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Thu, Sep 14, 2023 at 9:26 PM Hugh Dickins <hughd@google.com> wrote: > > On Thu, 14 Sep 2023, Suren Baghdasaryan wrote: > > On Thu, Sep 14, 2023 at 9:24 PM Matthew Wilcox <willy@infradead.org> wrote: > > > On Thu, Sep 14, 2023 at 08:53:59PM +0000, Suren Baghdasaryan wrote: > > > > On Thu, Sep 14, 2023 at 8:00 PM Suren Baghdasaryan <surenb@google.com> wrote: > > > > > On Thu, Sep 14, 2023 at 7:09 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > > > > > > > On Thu, Sep 14, 2023 at 06:20:56PM +0000, Suren Baghdasaryan wrote: > > > > > > > I think I found the problem and the explanation is much simpler. While > > > > > > > walking the page range, queue_folios_pte_range() encounters an > > > > > > > unmovable page and queue_folios_pte_range() returns 1. That causes a > > > > > > > break from the loop inside walk_page_range() and no more VMAs get > > > > > > > locked. After that the loop calling mbind_range() walks over all VMAs, > > > > > > > even the ones which were skipped by queue_folios_pte_range() and that > > > > > > > causes this BUG assertion. > > > > > > > > > > > > > > Thinking what's the right way to handle this situation (what's the > > > > > > > expected behavior here)... > > > > > > > I think the safest way would be to modify walk_page_range() and make > > > > > > > it continue calling process_vma_walk_lock() for all VMAs in the range > > > > > > > even when __walk_page_range() returns a positive err. Any objection or > > > > > > > alternative suggestions? > > > > > > > > > > > > So we only return 1 here if MPOL_MF_MOVE* & MPOL_MF_STRICT were > > > > > > specified. That means we're going to return an error, no matter what, > > > > > > and there's no point in calling mbind_range(). Right? > > > > > > > > > > > > +++ b/mm/mempolicy.c > > > > > > @@ -1334,6 +1334,8 @@ static long do_mbind(unsigned long start, unsigned long len, > > > > > > ret = queue_pages_range(mm, start, end, nmask, > > > > > > flags | MPOL_MF_INVERT, &pagelist, true); > > > > > > > > > > > > + if (ret == 1) > > > > > > + ret = -EIO; > > > > > > if (ret < 0) { > > > > > > err = ret; > > > > > > goto up_out; > > > > > > > > > > > > (I don't really understand this code, so it can't be this simple, can > > > > > > it? Why don't we just return -EIO from queue_folios_pte_range() if > > > > > > this is the right answer?) > > > > > > > > > > Yeah, I'm trying to understand the expected behavior of this function > > > > > to make sure we are not missing anything. I tried a simple fix that I > > > > > suggested in my previous email and it works but I want to understand a > > > > > bit more about this function's logic before posting the fix. > > > > > > > > So, current functionality is that after queue_pages_range() encounters > > > > an unmovable page, terminates the loop and returns 1, mbind_range() > > > > will still be called for the whole range > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1345), > > > > all pages in the pagelist will be migrated > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1355) > > > > and only after that the -EIO code will be returned > > > > (https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L1362). > > > > So, if we follow Matthew's suggestion we will be altering the current > > > > behavior which I assume is not what we want to do. > > > > > > Right, I'm intentionally changing the behaviour. My thinking is > > > that mbind(MPOL_MF_MOVE | MPOL_MF_STRICT) is going to fail. Should > > > such a failure actually move the movable pages before reporting that > > > it failed? I don't know. > > > > > > > The simple fix I was thinking about that would not alter this behavior > > > > is smth like this: > > > > > > I don't like it, but can we run it past syzbot to be sure it solves the > > > issue and we're not chasing a ghost here? > > > > Yes, I just finished running the reproducer on both upstream and > > linux-next builds listed in > > https://syzkaller.appspot.com/bug?extid=b591856e0f0139f83023 and the > > problem does not happen anymore. > > I'm fine with your suggestion too, just wanted to point out it would > > introduce change in the behavior. Let me know how you want to proceed. > > Well done, identifying the mysterious cause of this problem: > I'm glad to hear that you've now verified that hypothesis. > > You're right, it would be a regression to follow Matthew's suggestion. > > Traditionally, modulo bugs and inconsistencies, the queue_pages_range() > phase of do_mbind() has done the best it can, gathering all the pages it > can that need migration, even if some were missed; and proceeds to do the > mbind_range() phase if there was nothing "seriously" wrong (a gap causing > -EFAULT). Then at the end, if MPOL_MF_STRICT was set, and not all the > pages could be migrated (or MOVE was not specified and not all pages > were well placed), it returns -EIO rather than 0 to inform the caller > that not all could be done. > > There have been numerous tweaks, but I think most importantly > 5.3's d883544515aa ("mm: mempolicy: make the behavior consistent when > MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") added those "return 1"s > which stop the pagewalk early. In my opinion, not an improvement - makes > it harder to get mbind() to do the best job it can (or is it justified as > what you're asking for if you say STRICT?). Hi Suren and Hugh, Thanks for figuring this out. The mbind behavior is a little bit messy and hard to follow. I tried my best to recall all the changes. IIUC, mbind did break the vma iteration early in the first place, then commit 6f4576e3687b ("mempolicy: apply page table walker on queue_pages_range()") changed the behavior (didn't break vma iteration early for some cases anymore), but it messed up the return value and caused some test cases failure, also violated the manual. The return value issue was fixed by commit a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified"), this commit also restored the oldest behavior (break loop early). But it also breaks the loop early when MPOL_MF_MOVE|MOVEALL is set, kernel should actually continue the loop to try to migrate all existing pages per the manual. It sounds like a regression. I will take a look at it. So the logic should conceptually look like: if (MPOL_MF_MOVE|MOVEALL) continue; if (MPOL_MF_STRICT) break; So it is still possible that some VMAs are not locked if only MPOL_MF_STRICT is set. > > But whatever, it would be a further regression for mbind() not to have > done the mbind_range(), even though it goes on to return -EIO. > > I had a bad first reaction to your walk_page_range() patch (was expecting > to see vma_start_write()s in mbind_range()), but perhaps your patch is > exactly what process_mm_walk_lock() does now demand. > > [Why is Hugh responding on this? Because I have some long-standing > mm/mempolicy.c patches to submit next week, but in reviewing what I > could or could not afford to get into at this time, had decided I'd > better stay out of queue_pages_range() for now - beyond the trivial > preferring an MPOL_MF_WRLOCK flag to your bool lock_vma.] > > Hugh ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-16 1:35 ` Yang Shi @ 2023-09-16 3:57 ` Hugh Dickins 2023-09-18 22:34 ` Yang Shi 0 siblings, 1 reply; 27+ messages in thread From: Hugh Dickins @ 2023-09-16 3:57 UTC (permalink / raw) To: Yang Shi Cc: Hugh Dickins, Suren Baghdasaryan, Matthew Wilcox, Michal Hocko, Vlastimil Babka, syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Fri, 15 Sep 2023, Yang Shi wrote: > > Hi Suren and Hugh, > > Thanks for figuring this out. The mbind behavior is a little bit messy > and hard to follow. I tried my best to recall all the changes. Messy and confusing yes; and for every particular behavior, I suspect that by now there exists some release which has done it that way. > > IIUC, mbind did break the vma iteration early in the first place, then > commit 6f4576e3687b ("mempolicy: apply page table walker on > queue_pages_range()") changed the behavior (didn't break vma iteration > early for some cases anymore), but it messed up the return value and > caused some test cases failure, also violated the manual. The return > value issue was fixed by commit a7f40cfe3b7a ("mm: mempolicy: make > mbind() return -EIO when MPOL_MF_STRICT is specified"), this commit > also restored the oldest behavior (break loop early). But it also > breaks the loop early when MPOL_MF_MOVE|MOVEALL is set, kernel should > actually continue the loop to try to migrate all existing pages per > the manual. Oh, I missed that aspect in my description: yes, I think that's the worst of it: MPOL_MF_STRICT alone could break out early because it had nothing more to learn by going further, but it was simply a mistake for the MOVEs to break out early (and arguable what MOVE|STRICT should do). I thought you and I were going to have a debate about this, but we appear to be in agreement. And I'm not sure whether I agree with myself about whether do_mbind() should apply the mbind_range()s when STRICT queue_pages_range() found an unmovable - there are consistency and regression arguments both ways. (I've been repeatedly puzzled by your comment in queue_folios_pte_range() if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) { /* MPOL_MF_STRICT must be specified if we get here */ if (!vma_migratable(vma)) { Does that commment about MPOL_MF_STRICT actually belong inside the !vma_migratable(vma) block? Sometimes I think so, but sometimes I remember that the interaction of those flags, and the skipping arranged by queue_pages_test_walk(), is subtler than I imagine.) > It sounds like a regression. I will take a look at it. Thanks! Please do, I don't have the time for it. > > So the logic should conceptually look like: > > if (MPOL_MF_MOVE|MOVEALL) > continue; > if (MPOL_MF_STRICT) > break; > > So it is still possible that some VMAs are not locked if only > MPOL_MF_STRICT is set. Conditionally, I'll agree; but it's too easy for me to agree in the course of trying to get an email out, but on later reflection come to disagree. STRICT|MOVE behavior arguable. I think the best I can do is send you (privately) my approx-v5.2 patch for this (which I never got time to put into even a Google-internal kernel, though an earlier version was there). In part because I did more research back then, and its commit message cites several even older commits than you cite above, which might help to shed more light on the history (or might just be wrong). And in part because it may give you some more ideas of what needs doing: notably qp->nr_failed, because "man 2 migrate_pages" says "On success migrate_pages() returns the number of pages that could not be moved", but we seem to have lost sight of that (from which one may conclude that it's not very important, but I did find it useful when testing); but of course the usual doubts about the right way to count a page when compound. I'll check how easily that patch applies to a known base such as v5.2, maybe trim it to fit better, then send it off to you. Hugh ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-16 3:57 ` Hugh Dickins @ 2023-09-18 22:34 ` Yang Shi 2023-09-19 0:34 ` Hugh Dickins 0 siblings, 1 reply; 27+ messages in thread From: Yang Shi @ 2023-09-18 22:34 UTC (permalink / raw) To: Hugh Dickins Cc: Suren Baghdasaryan, Matthew Wilcox, Michal Hocko, Vlastimil Babka, syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs On Fri, Sep 15, 2023 at 8:57 PM Hugh Dickins <hughd@google.com> wrote: > > On Fri, 15 Sep 2023, Yang Shi wrote: > > > > Hi Suren and Hugh, > > > > Thanks for figuring this out. The mbind behavior is a little bit messy > > and hard to follow. I tried my best to recall all the changes. > > Messy and confusing yes; and for every particular behavior, I suspect > that by now there exists some release which has done it that way. > > > > > IIUC, mbind did break the vma iteration early in the first place, then > > commit 6f4576e3687b ("mempolicy: apply page table walker on > > queue_pages_range()") changed the behavior (didn't break vma iteration > > early for some cases anymore), but it messed up the return value and > > caused some test cases failure, also violated the manual. The return > > value issue was fixed by commit a7f40cfe3b7a ("mm: mempolicy: make > > mbind() return -EIO when MPOL_MF_STRICT is specified"), this commit > > also restored the oldest behavior (break loop early). But it also > > breaks the loop early when MPOL_MF_MOVE|MOVEALL is set, kernel should > > actually continue the loop to try to migrate all existing pages per > > the manual. > > Oh, I missed that aspect in my description: yes, I think that's the > worst of it: MPOL_MF_STRICT alone could break out early because it had > nothing more to learn by going further, but it was simply a mistake for > the MOVEs to break out early (and arguable what MOVE|STRICT should do). > > I thought you and I were going to have a debate about this, but we > appear to be in agreement. And I'm not sure whether I agree with > myself about whether do_mbind() should apply the mbind_range()s > when STRICT queue_pages_range() found an unmovable - there are > consistency and regression arguments both ways. They will not be added into the migration list in the first place. Why waste time to try to migrate the unmovable? > > (I've been repeatedly puzzled by your comment in queue_folios_pte_range() > if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) { > /* MPOL_MF_STRICT must be specified if we get here */ > if (!vma_migratable(vma)) { > Does that commment about MPOL_MF_STRICT actually belong inside the > !vma_migratable(vma) block? Sometimes I think so, but sometimes I > remember that the interaction of those flags, and the skipping arranged > by queue_pages_test_walk(), is subtler than I imagine.) It is because the below code snippet from queue_pages_test_walk(): if (!vma_migratable(vma) && !(flags & MPOL_MF_STRICT)) return 1; When queue_pages_test_walk() returns 1, queue_folios_pte_range() will be skipped. So if queue_folios_pte_range() sees unmigratable vma, it means MPOL_MF_STRICT must be set. > > > It sounds like a regression. I will take a look at it. > > Thanks! Please do, I don't have the time for it. > > > > > So the logic should conceptually look like: > > > > if (MPOL_MF_MOVE|MOVEALL) > > continue; > > if (MPOL_MF_STRICT) > > break; > > > > So it is still possible that some VMAs are not locked if only > > MPOL_MF_STRICT is set. > > Conditionally, I'll agree; but it's too easy for me to agree in the > course of trying to get an email out, but on later reflection come > to disagree. STRICT|MOVE behavior arguable. I thought the code should conceptually do: if (MPOL_MF_MOVE|MOVEALL) scan all vmas try to migrate the existing pages return success else if (MPOL_MF_MOVE* | MPOL_MF_STRICT) scan all vmas try to migrate the existing pages return -EIO if unmovable or migration failed else /* MPOL_MF_STRICT alone */ break early if meets unmovable and don't call mbind_range() at all So the vma scan will just be skipped when MPOL_MF_STRICT alone is specified and mbind_range() won't be called in this case. So Suren's fix may not be needed. > > I think the best I can do is send you (privately) my approx-v5.2 patch > for this (which I never got time to put into even a Google-internal > kernel, though an earlier version was there). In part because I did > more research back then, and its commit message cites several even > older commits than you cite above, which might help to shed more light > on the history (or might just be wrong). And in part because it may > give you some more ideas of what needs doing: notably qp->nr_failed, > because "man 2 migrate_pages" says "On success migrate_pages() returns > the number of pages that could not be moved", but we seem to have > lost sight of that (from which one may conclude that it's not very > important, but I did find it useful when testing); but of course > the usual doubts about the right way to count a page when compound. > > I'll check how easily that patch applies to a known base such as > v5.2, maybe trim it to fit better, then send it off to you. I'm thinking about the below fix (build test against the latest mm-unstable only): diff --git a/mm/mempolicy.c b/mm/mempolicy.c index 42b5567e3773..c9b768a042a8 100644 --- a/mm/mempolicy.c +++ b/mm/mempolicy.c @@ -426,6 +426,7 @@ struct queue_pages { unsigned long start; unsigned long end; struct vm_area_struct *first; + bool has_unmovable; }; /* @@ -446,9 +447,8 @@ static inline bool queue_folio_required(struct folio *folio, /* * queue_folios_pmd() has three possible return values: * 0 - folios are placed on the right node or queued successfully, or - * special page is met, i.e. huge zero page. - * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were - * specified. + * special page is met, i.e. zero page, or unmovable page is found + * but continue walking (indicated by queue_pages.has_unmovable). * -EIO - is migration entry or only MPOL_MF_STRICT was specified and an * existing folio was already on a node that does not follow the * policy. @@ -479,7 +479,7 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr, if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) { if (!vma_migratable(walk->vma) || migrate_folio_add(folio, qp->pagelist, flags)) { - ret = 1; + qp->has_unmovable |= 1; goto unlock; } } else @@ -495,9 +495,8 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t *ptl, unsigned long addr, * * queue_folios_pte_range() has three possible return values: * 0 - folios are placed on the right node or queued successfully, or - * special page is met, i.e. zero page. - * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were - * specified. + * special page is met, i.e. zero page, or unmovable page is found + * but continue walking (indicated by queue_pages.has_unmovable). * -EIO - only MPOL_MF_STRICT was specified and an existing folio was already * on a node that does not follow the policy. */ @@ -538,10 +537,13 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, if (!queue_folio_required(folio, qp)) continue; if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) { - /* MPOL_MF_STRICT must be specified if we get here */ + /* + * MPOL_MF_STRICT must be specified if we get here. + * Continue walking vmas due to MPOL_MF_MOVE* flags. + */ if (!vma_migratable(vma)) { - has_unmovable = true; - break; + qp->has_unmovable |= 1; + continue; } /* @@ -550,16 +552,13 @@ static int queue_folios_pte_range(pmd_t *pmd, unsigned long addr, * need migrate other LRU pages. */ if (migrate_folio_add(folio, qp->pagelist, flags)) - has_unmovable = true; + has_unmovable |= 1; } else break; } pte_unmap_unlock(mapped_pte, ptl); cond_resched(); - if (has_unmovable) - return 1; - return addr != end ? -EIO : 0; } @@ -599,7 +598,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask, * Detecting misplaced folio but allow migrating folios which * have been queued. */ - ret = 1; + qp->has_unmovable |= 1; goto unlock; } @@ -620,7 +619,7 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask, * Failed to isolate folio but allow migrating pages * which have been queued. */ - ret = 1; + qp->has_unmovable |= 1; } unlock: spin_unlock(ptl); @@ -756,12 +755,15 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end, .start = start, .end = end, .first = NULL, + .has_unmovable = false, }; const struct mm_walk_ops *ops = lock_vma ? &queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops; err = walk_page_range(mm, start, end, ops, &qp); + if (qp.has_unmovable) + err = 1; if (!qp.first) /* whole range in hole */ err = -EFAULT; @@ -1358,7 +1360,7 @@ static long do_mbind(unsigned long start, unsigned long len, putback_movable_pages(&pagelist); } - if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT))) + if (((ret > 0) || nr_failed) && (flags & MPOL_MF_STRICT)) err = -EIO; } else { up_out: > > Hugh ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [syzbot] [mm?] kernel BUG in vma_replace_policy 2023-09-18 22:34 ` Yang Shi @ 2023-09-19 0:34 ` Hugh Dickins 0 siblings, 0 replies; 27+ messages in thread From: Hugh Dickins @ 2023-09-19 0:34 UTC (permalink / raw) To: Yang Shi Cc: Hugh Dickins, Suren Baghdasaryan, Matthew Wilcox, Michal Hocko, Vlastimil Babka, syzbot, akpm, linux-kernel, linux-mm, syzkaller-bugs [-- Attachment #1: Type: text/plain, Size: 12429 bytes --] On Mon, 18 Sep 2023, Yang Shi wrote: > On Fri, Sep 15, 2023 at 8:57 PM Hugh Dickins <hughd@google.com> wrote: > > On Fri, 15 Sep 2023, Yang Shi wrote: > > > > > > Hi Suren and Hugh, > > > > > > Thanks for figuring this out. The mbind behavior is a little bit messy > > > and hard to follow. I tried my best to recall all the changes. > > > > Messy and confusing yes; and for every particular behavior, I suspect > > that by now there exists some release which has done it that way. > > > > > > > > IIUC, mbind did break the vma iteration early in the first place, then > > > commit 6f4576e3687b ("mempolicy: apply page table walker on > > > queue_pages_range()") changed the behavior (didn't break vma iteration > > > early for some cases anymore), but it messed up the return value and > > > caused some test cases failure, also violated the manual. The return > > > value issue was fixed by commit a7f40cfe3b7a ("mm: mempolicy: make > > > mbind() return -EIO when MPOL_MF_STRICT is specified"), this commit > > > also restored the oldest behavior (break loop early). But it also > > > breaks the loop early when MPOL_MF_MOVE|MOVEALL is set, kernel should > > > actually continue the loop to try to migrate all existing pages per > > > the manual. > > > > Oh, I missed that aspect in my description: yes, I think that's the > > worst of it: MPOL_MF_STRICT alone could break out early because it had > > nothing more to learn by going further, but it was simply a mistake for > > the MOVEs to break out early (and arguable what MOVE|STRICT should do). > > > > I thought you and I were going to have a debate about this, but we > > appear to be in agreement. And I'm not sure whether I agree with > > myself about whether do_mbind() should apply the mbind_range()s > > when STRICT queue_pages_range() found an unmovable - there are > > consistency and regression arguments both ways. > > They will not be added into the migration list in the first place. Why > waste time to try to migrate the unmovable? I don't understand you there. I was not proposing to try to migrate the unmovable. My doubts were really all about how to make sense of mbind() sometimes failing with EFAULT, in which case it has not applied the mbind_range()s, versus sometimes failing with EIO, in which case it may or may not have applied the mbind_range()s. And I've come to the conclusion (partially driven by precedent) that it makes best sense to imagine the collection of folios on pagelist as a part of MOVE's migration stage, and just an implementation detail that it happens to be done before the mbind_range()s. So when there's a MOVE involved, STRICT's EIO says that the mbind_ranges() were applied (but migrations were incomplete); but when no MOVE involved, EIO says that the mbind_range()s were not applied (because it's being STRICT). I don't think there's any disagreement between us on this: it was just hard for me to reach an understanding of behavior which I could defend. > > > > > (I've been repeatedly puzzled by your comment in queue_folios_pte_range() > > if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) { > > /* MPOL_MF_STRICT must be specified if we get here */ > > if (!vma_migratable(vma)) { > > Does that commment about MPOL_MF_STRICT actually belong inside the > > !vma_migratable(vma) block? Sometimes I think so, but sometimes I > > remember that the interaction of those flags, and the skipping arranged > > by queue_pages_test_walk(), is subtler than I imagine.) > > It is because the below code snippet from queue_pages_test_walk(): > > if (!vma_migratable(vma) && > !(flags & MPOL_MF_STRICT)) > return 1; > > When queue_pages_test_walk() returns 1, queue_folios_pte_range() will > be skipped. So if queue_folios_pte_range() sees unmigratable vma, it > means MPOL_MF_STRICT must be set. Thanks, yes, I eventually came to see that, once I got back into the code (I had been right to remember "subtler than I imagine" above). Though I don't think there's any good reason for the queueing code to have to depend on such subtleties. > > > > > > It sounds like a regression. I will take a look at it. At one point I was thinking it a regression in all the MOVE cases; but it's only in the STRICT MOVE case, so maybe not so significant. > > > > Thanks! Please do, I don't have the time for it. I came back in private mail to say that I'd not managed a buildable v5.2 version of my qp->nr_failed patch, so reluctantly put in the time to think through it all again, and do a v6.6-rc1 version to add into my mm/mempolicy series. I have that now, I'll send you the preview privately in a moment; but leave posting it publicly until I've finished the commit messages for all the series. > > > > > > > > So the logic should conceptually look like: > > > > > > if (MPOL_MF_MOVE|MOVEALL) > > > continue; > > > if (MPOL_MF_STRICT) > > > break; > > > > > > So it is still possible that some VMAs are not locked if only > > > MPOL_MF_STRICT is set. > > > > Conditionally, I'll agree; but it's too easy for me to agree in the > > course of trying to get an email out, but on later reflection come > > to disagree. STRICT|MOVE behavior arguable. > > I thought the code should conceptually do: > > if (MPOL_MF_MOVE|MOVEALL) > scan all vmas > try to migrate the existing pages > return success > else if (MPOL_MF_MOVE* | MPOL_MF_STRICT) > scan all vmas > try to migrate the existing pages > return -EIO if unmovable or migration failed > else /* MPOL_MF_STRICT alone */ > break early if meets unmovable and don't call mbind_range() at all else /* none of those flags */ check the ranges in test_walk, EFAULT without mbind_range() if discontig. Yes: to quote my own patch: static bool strictly_unmovable(unsigned long flags) { /* * STRICT without MOVE flags lets do_mbind() fail immediately with -EIO * if any misplaced page is found. */ return (flags & (MPOL_MF_STRICT | MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) == MPOL_MF_STRICT; } > > So the vma scan will just be skipped when MPOL_MF_STRICT alone is > specified and mbind_range() won't be called in this case. So Suren's > fix may not be needed. Yes, Suren's fix can be reverted when your patch or mine goes in; but Suren's is important for fixing the VMA locking issue meanwhile. > > > > > I think the best I can do is send you (privately) my approx-v5.2 patch > > for this (which I never got time to put into even a Google-internal > > kernel, though an earlier version was there). In part because I did > > more research back then, and its commit message cites several even > > older commits than you cite above, which might help to shed more light > > on the history (or might just be wrong). And in part because it may > > give you some more ideas of what needs doing: notably qp->nr_failed, > > because "man 2 migrate_pages" says "On success migrate_pages() returns > > the number of pages that could not be moved", but we seem to have > > lost sight of that (from which one may conclude that it's not very > > important, but I did find it useful when testing); but of course > > the usual doubts about the right way to count a page when compound. > > > > I'll check how easily that patch applies to a known base such as > > v5.2, maybe trim it to fit better, then send it off to you. > > I'm thinking about the below fix (build test against the latest > mm-unstable only): Yes, that looks about right (more "|="ing than necessary, for something that's only going to be set to 1, er, I think would better be "true"). And it's much smaller (rightly so if it's aimed at v6.6) than my patch which is aimed at v6.7: mine doing quite a bit of cleanup, along with the qp->nr_failed instead of your qp->has_unmovable, in order that migrate_pages(2) can return the promised number of pages that could not be moved. Hugh > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c > index 42b5567e3773..c9b768a042a8 100644 > --- a/mm/mempolicy.c > +++ b/mm/mempolicy.c > @@ -426,6 +426,7 @@ struct queue_pages { > unsigned long start; > unsigned long end; > struct vm_area_struct *first; > + bool has_unmovable; > }; > > /* > @@ -446,9 +447,8 @@ static inline bool queue_folio_required(struct folio *folio, > /* > * queue_folios_pmd() has three possible return values: > * 0 - folios are placed on the right node or queued successfully, or > - * special page is met, i.e. huge zero page. > - * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were > - * specified. > + * special page is met, i.e. zero page, or unmovable page is found > + * but continue walking (indicated by queue_pages.has_unmovable). > * -EIO - is migration entry or only MPOL_MF_STRICT was specified and an > * existing folio was already on a node that does not follow the > * policy. > @@ -479,7 +479,7 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t > *ptl, unsigned long addr, > if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) { > if (!vma_migratable(walk->vma) || > migrate_folio_add(folio, qp->pagelist, flags)) { > - ret = 1; > + qp->has_unmovable |= 1; > goto unlock; > } > } else > @@ -495,9 +495,8 @@ static int queue_folios_pmd(pmd_t *pmd, spinlock_t > *ptl, unsigned long addr, > * > * queue_folios_pte_range() has three possible return values: > * 0 - folios are placed on the right node or queued successfully, or > - * special page is met, i.e. zero page. > - * 1 - there is unmovable folio, and MPOL_MF_MOVE* & MPOL_MF_STRICT were > - * specified. > + * special page is met, i.e. zero page, or unmovable page is found > + * but continue walking (indicated by queue_pages.has_unmovable). > * -EIO - only MPOL_MF_STRICT was specified and an existing folio was already > * on a node that does not follow the policy. > */ > @@ -538,10 +537,13 @@ static int queue_folios_pte_range(pmd_t *pmd, > unsigned long addr, > if (!queue_folio_required(folio, qp)) > continue; > if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) { > - /* MPOL_MF_STRICT must be specified if we get here */ > + /* > + * MPOL_MF_STRICT must be specified if we get here. > + * Continue walking vmas due to MPOL_MF_MOVE* flags. > + */ > if (!vma_migratable(vma)) { > - has_unmovable = true; > - break; > + qp->has_unmovable |= 1; > + continue; > } > > /* > @@ -550,16 +552,13 @@ static int queue_folios_pte_range(pmd_t *pmd, > unsigned long addr, > * need migrate other LRU pages. > */ > if (migrate_folio_add(folio, qp->pagelist, flags)) > - has_unmovable = true; > + has_unmovable |= 1; > } else > break; > } > pte_unmap_unlock(mapped_pte, ptl); > cond_resched(); > > - if (has_unmovable) > - return 1; > - > return addr != end ? -EIO : 0; > } > > @@ -599,7 +598,7 @@ static int queue_folios_hugetlb(pte_t *pte, > unsigned long hmask, > * Detecting misplaced folio but allow migrating folios which > * have been queued. > */ > - ret = 1; > + qp->has_unmovable |= 1; > goto unlock; > } > > @@ -620,7 +619,7 @@ static int queue_folios_hugetlb(pte_t *pte, > unsigned long hmask, > * Failed to isolate folio but allow migrating pages > * which have been queued. > */ > - ret = 1; > + qp->has_unmovable |= 1; > } > unlock: > spin_unlock(ptl); > @@ -756,12 +755,15 @@ queue_pages_range(struct mm_struct *mm, unsigned > long start, unsigned long end, > .start = start, > .end = end, > .first = NULL, > + .has_unmovable = false, > }; > const struct mm_walk_ops *ops = lock_vma ? > &queue_pages_lock_vma_walk_ops : &queue_pages_walk_ops; > > err = walk_page_range(mm, start, end, ops, &qp); > > + if (qp.has_unmovable) > + err = 1; > if (!qp.first) > /* whole range in hole */ > err = -EFAULT; > @@ -1358,7 +1360,7 @@ static long do_mbind(unsigned long start, > unsigned long len, > putback_movable_pages(&pagelist); > } > > - if ((ret > 0) || (nr_failed && (flags & MPOL_MF_STRICT))) > + if (((ret > 0) || nr_failed) && (flags & MPOL_MF_STRICT)) > err = -EIO; > } else { > up_out: ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2023-09-19 0:34 UTC | newest] Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20230909034207.5816-1-hdanton@sina.com> 2023-09-09 4:43 ` [syzbot] [mm?] kernel BUG in vma_replace_policy syzbot 2023-09-06 1:03 syzbot 2023-09-08 18:04 ` syzbot 2023-09-12 5:30 ` Matthew Wilcox 2023-09-12 6:09 ` syzbot 2023-09-12 14:55 ` Matthew Wilcox 2023-09-12 15:03 ` Suren Baghdasaryan 2023-09-12 16:00 ` Suren Baghdasaryan 2023-09-13 16:05 ` Suren Baghdasaryan 2023-09-13 16:46 ` Suren Baghdasaryan 2023-09-14 18:20 ` Suren Baghdasaryan 2023-09-14 19:09 ` Matthew Wilcox 2023-09-14 20:00 ` Suren Baghdasaryan 2023-09-14 20:53 ` Suren Baghdasaryan 2023-09-14 21:24 ` Matthew Wilcox 2023-09-14 22:21 ` Suren Baghdasaryan 2023-09-15 4:26 ` Hugh Dickins 2023-09-15 16:09 ` Suren Baghdasaryan 2023-09-15 18:05 ` Suren Baghdasaryan 2023-09-16 2:43 ` Hugh Dickins 2023-09-18 21:20 ` Suren Baghdasaryan 2023-09-15 18:26 ` Matthew Wilcox 2023-09-16 2:54 ` Hugh Dickins 2023-09-16 1:35 ` Yang Shi 2023-09-16 3:57 ` Hugh Dickins 2023-09-18 22:34 ` Yang Shi 2023-09-19 0:34 ` Hugh Dickins
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).