linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [syzbot] general protection fault in PageHeadHuge
@ 2022-09-23 16:05 syzbot
  2022-09-23 21:11 ` Mike Kravetz
  0 siblings, 1 reply; 24+ messages in thread
From: syzbot @ 2022-09-23 16:05 UTC (permalink / raw)
  To: akpm, linux-kernel, linux-mm, llvm, mike.kravetz, nathan,
	ndesaulniers, songmuchun, syzkaller-bugs, trix

Hello,

syzbot found the following issue on:

HEAD commit:    483fed3b5dc8 Add linux-next specific files for 20220921
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=16f0a418880000
kernel config:  https://syzkaller.appspot.com/x/.config?x=849cb9f70f15b1ba
dashboard link: https://syzkaller.appspot.com/bug?extid=152d76c44ba142f8992b
compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12b97b64880000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11fb9040880000

Downloadable assets:
disk image: https://storage.googleapis.com/syzbot-assets/1cb3f4618323/disk-483fed3b.raw.xz
vmlinux: https://storage.googleapis.com/syzbot-assets/cc02cb30b495/vmlinux-483fed3b.xz

IMPORTANT: if you fix the issue, please add the following tag to the commit:
Reported-by: syzbot+152d76c44ba142f8992b@syzkaller.appspotmail.com

general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
CPU: 1 PID: 3617 Comm: syz-executor722 Not tainted 6.0.0-rc6-next-20220921-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/16/2022
RIP: 0010:PagePoisoned include/linux/page-flags.h:304 [inline]
RIP: 0010:PageHead include/linux/page-flags.h:787 [inline]
RIP: 0010:PageHeadHuge+0x1d/0x200 mm/hugetlb.c:1892
Code: ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 54 55 48 89 fd 53 e8 54 c9 b9 ff 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 a2 01 00 00 48 8b 5d 00 48 c7 c7 ff ff ff ff 48
RSP: 0018:ffffc90003e7f5a0 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffffc90003e7f788 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81c2cb2c RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90003e7f798
R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000003f4
FS:  00007f5642262700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000203f4ef0 CR3: 000000007adcc000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 folio_test_hugetlb include/linux/page-flags.h:831 [inline]
 folio_file_page include/linux/pagemap.h:683 [inline]
 shmem_fault+0x27c/0x8a0 mm/shmem.c:2130
 __do_fault+0x107/0x600 mm/memory.c:4191
 do_shared_fault mm/memory.c:4597 [inline]
 do_fault mm/memory.c:4675 [inline]
 handle_pte_fault mm/memory.c:4943 [inline]
 __handle_mm_fault+0x2200/0x3a40 mm/memory.c:5085
 handle_mm_fault+0x1c8/0x780 mm/memory.c:5206
 do_user_addr_fault+0x475/0x1210 arch/x86/mm/fault.c:1428
 handle_page_fault arch/x86/mm/fault.c:1519 [inline]
 exc_page_fault+0x94/0x170 arch/x86/mm/fault.c:1575
 asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
RIP: 0010:__put_user_nocheck_4+0x3/0x11
Code: 00 00 48 39 d9 73 54 0f 01 cb 66 89 01 31 c9 0f 01 ca c3 0f 1f 44 00 00 48 bb fd ef ff ff ff 7f 00 00 48 39 d9 73 34 0f 01 cb <89> 01 31 c9 0f 01 ca c3 66 0f 1f 44 00 00 48 bb f9 ef ff ff ff 7f
RSP: 0018:ffffc90003e7fa00 EFLAGS: 00050293
RAX: 0000000000000000 RBX: ffffc90003e7fdf4 RCX: 00000000203f4ef0
RDX: ffff888020c51d40 RSI: ffffffff8726d52f RDI: 0000000000000005
RBP: ffffc90003e7fdb0 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000000000
R13: 0000000000000002 R14: 00000000203f4ef0 R15: 0000000000000000
 ____sys_recvmsg+0x3ba/0x610 net/socket.c:2714
 ___sys_recvmsg+0xf2/0x180 net/socket.c:2743
 do_recvmmsg+0x25e/0x6e0 net/socket.c:2837
 __sys_recvmmsg net/socket.c:2916 [inline]
 __do_sys_recvmmsg net/socket.c:2939 [inline]
 __se_sys_recvmmsg net/socket.c:2932 [inline]
 __x64_sys_recvmmsg+0x20b/0x260 net/socket.c:2932
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f56422dabb9
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 91 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f5642262208 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
RAX: ffffffffffffffda RBX: 00007f564235c4b8 RCX: 00007f56422dabb9
RDX: 0000000000010106 RSI: 00000000200000c0 RDI: 0000000000000003
RBP: 00007f564235c4b0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000002 R11: 0000000000000246 R12: 00007f564235c4bc
R13: 00007fffbde3618f R14: 00007f5642262300 R15: 0000000000022000
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:PagePoisoned include/linux/page-flags.h:304 [inline]
RIP: 0010:PageHead include/linux/page-flags.h:787 [inline]
RIP: 0010:PageHeadHuge+0x1d/0x200 mm/hugetlb.c:1892
Code: ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 54 55 48 89 fd 53 e8 54 c9 b9 ff 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 a2 01 00 00 48 8b 5d 00 48 c7 c7 ff ff ff ff 48
RSP: 0018:ffffc90003e7f5a0 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffffc90003e7f788 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff81c2cb2c RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90003e7f798
R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000003f4
FS:  00007f5642262700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f56419ff718 CR3: 000000007adcc000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
----------------
Code disassembly (best guess):
   0:	ff 66 66             	jmpq   *0x66(%rsi)
   3:	2e 0f 1f 84 00 00 00 	nopl   %cs:0x0(%rax,%rax,1)
   a:	00 00
   c:	90                   	nop
   d:	41 54                	push   %r12
   f:	55                   	push   %rbp
  10:	48 89 fd             	mov    %rdi,%rbp
  13:	53                   	push   %rbx
  14:	e8 54 c9 b9 ff       	callq  0xffb9c96d
  19:	48 89 ea             	mov    %rbp,%rdx
  1c:	48 b8 00 00 00 00 00 	movabs $0xdffffc0000000000,%rax
  23:	fc ff df
  26:	48 c1 ea 03          	shr    $0x3,%rdx
* 2a:	80 3c 02 00          	cmpb   $0x0,(%rdx,%rax,1) <-- trapping instruction
  2e:	0f 85 a2 01 00 00    	jne    0x1d6
  34:	48 8b 5d 00          	mov    0x0(%rbp),%rbx
  38:	48 c7 c7 ff ff ff ff 	mov    $0xffffffffffffffff,%rdi
  3f:	48                   	rex.W


---
This report is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this issue. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this issue, for details see:
https://goo.gl/tpsmEJ#testing-patches


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-23 16:05 [syzbot] general protection fault in PageHeadHuge syzbot
@ 2022-09-23 21:11 ` Mike Kravetz
  2022-09-23 21:37   ` Yang Shi
                     ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Mike Kravetz @ 2022-09-23 21:11 UTC (permalink / raw)
  To: Matthew Wilcox, syzbot
  Cc: Hugh Dickins, akpm, linux-kernel, linux-mm, llvm, nathan,
	ndesaulniers, songmuchun, syzkaller-bugs, trix

On 09/23/22 09:05, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:    483fed3b5dc8 Add linux-next specific files for 20220921
> git tree:       linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=16f0a418880000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=849cb9f70f15b1ba
> dashboard link: https://syzkaller.appspot.com/bug?extid=152d76c44ba142f8992b
> compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12b97b64880000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11fb9040880000
> 
> Downloadable assets:
> disk image: https://storage.googleapis.com/syzbot-assets/1cb3f4618323/disk-483fed3b.raw.xz
> vmlinux: https://storage.googleapis.com/syzbot-assets/cc02cb30b495/vmlinux-483fed3b.xz
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+152d76c44ba142f8992b@syzkaller.appspotmail.com
> 
> general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
> KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> CPU: 1 PID: 3617 Comm: syz-executor722 Not tainted 6.0.0-rc6-next-20220921-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/16/2022
> RIP: 0010:PagePoisoned include/linux/page-flags.h:304 [inline]
> RIP: 0010:PageHead include/linux/page-flags.h:787 [inline]
> RIP: 0010:PageHeadHuge+0x1d/0x200 mm/hugetlb.c:1892
> Code: ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 54 55 48 89 fd 53 e8 54 c9 b9 ff 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 a2 01 00 00 48 8b 5d 00 48 c7 c7 ff ff ff ff 48
> RSP: 0018:ffffc90003e7f5a0 EFLAGS: 00010246
> RAX: dffffc0000000000 RBX: ffffc90003e7f788 RCX: 0000000000000000
> RDX: 0000000000000000 RSI: ffffffff81c2cb2c RDI: 0000000000000000
> RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90003e7f798
> R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000003f4
> FS:  00007f5642262700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00000000203f4ef0 CR3: 000000007adcc000 CR4: 00000000003506e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  <TASK>
>  folio_test_hugetlb include/linux/page-flags.h:831 [inline]
>  folio_file_page include/linux/pagemap.h:683 [inline]
>  shmem_fault+0x27c/0x8a0 mm/shmem.c:2130
>  __do_fault+0x107/0x600 mm/memory.c:4191
>  do_shared_fault mm/memory.c:4597 [inline]
>  do_fault mm/memory.c:4675 [inline]
>  handle_pte_fault mm/memory.c:4943 [inline]
>  __handle_mm_fault+0x2200/0x3a40 mm/memory.c:5085
>  handle_mm_fault+0x1c8/0x780 mm/memory.c:5206
>  do_user_addr_fault+0x475/0x1210 arch/x86/mm/fault.c:1428
>  handle_page_fault arch/x86/mm/fault.c:1519 [inline]
>  exc_page_fault+0x94/0x170 arch/x86/mm/fault.c:1575
>  asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
> RIP: 0010:__put_user_nocheck_4+0x3/0x11
> Code: 00 00 48 39 d9 73 54 0f 01 cb 66 89 01 31 c9 0f 01 ca c3 0f 1f 44 00 00 48 bb fd ef ff ff ff 7f 00 00 48 39 d9 73 34 0f 01 cb <89> 01 31 c9 0f 01 ca c3 66 0f 1f 44 00 00 48 bb f9 ef ff ff ff 7f
> RSP: 0018:ffffc90003e7fa00 EFLAGS: 00050293
> RAX: 0000000000000000 RBX: ffffc90003e7fdf4 RCX: 00000000203f4ef0
> RDX: ffff888020c51d40 RSI: ffffffff8726d52f RDI: 0000000000000005
> RBP: ffffc90003e7fdb0 R08: 0000000000000005 R09: 0000000000000000
> R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000000000
> R13: 0000000000000002 R14: 00000000203f4ef0 R15: 0000000000000000
>  ____sys_recvmsg+0x3ba/0x610 net/socket.c:2714
>  ___sys_recvmsg+0xf2/0x180 net/socket.c:2743
>  do_recvmmsg+0x25e/0x6e0 net/socket.c:2837
>  __sys_recvmmsg net/socket.c:2916 [inline]
>  __do_sys_recvmmsg net/socket.c:2939 [inline]
>  __se_sys_recvmmsg net/socket.c:2932 [inline]
>  __x64_sys_recvmmsg+0x20b/0x260 net/socket.c:2932
>  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
>  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> RIP: 0033:0x7f56422dabb9
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 91 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007f5642262208 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
> RAX: ffffffffffffffda RBX: 00007f564235c4b8 RCX: 00007f56422dabb9
> RDX: 0000000000010106 RSI: 00000000200000c0 RDI: 0000000000000003
> RBP: 00007f564235c4b0 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000002 R11: 0000000000000246 R12: 00007f564235c4bc
> R13: 00007fffbde3618f R14: 00007f5642262300 R15: 0000000000022000
>  </TASK>
> Modules linked in:
> ---[ end trace 0000000000000000 ]---

While it is true that the addressing exception is happening in the hugetlb
routine PageHeadHuge(), the reason is because it is passed a NULL page
pointer.  This is via the call to folio_file_page() at the end of shmem_fault.

	err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
				  gfp, vma, vmf, &ret);
	if (err)
		return vmf_error(err);

	vmf->page = folio_file_page(folio, vmf->pgoff);
	return ret;

The code assumes that if a non-zero value is returned from shmem_get_folio_gfp,
then folio pointer will be set to a folio.  However, it looks like there are
a few places in shmem_get_folio_gfp where it will return zero and not set
folio.

In this specific case, it is the code block:

        if (vma && userfaultfd_missing(vma)) {
                *fault_type = handle_userfault(vmf, VM_UFFD_MISSING);
                return 0;
        }

I could try to sort this out, but I believe Matthew has the most context as
he has been changing this code recently.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-23 21:11 ` Mike Kravetz
@ 2022-09-23 21:37   ` Yang Shi
  2022-09-24  0:14   ` Hugh Dickins
  2022-09-24 21:56   ` Matthew Wilcox
  2 siblings, 0 replies; 24+ messages in thread
From: Yang Shi @ 2022-09-23 21:37 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Matthew Wilcox, syzbot, Hugh Dickins, akpm, linux-kernel,
	linux-mm, llvm, nathan, ndesaulniers, songmuchun, syzkaller-bugs,
	trix

On Fri, Sep 23, 2022 at 2:11 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 09/23/22 09:05, syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:    483fed3b5dc8 Add linux-next specific files for 20220921
> > git tree:       linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=16f0a418880000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=849cb9f70f15b1ba
> > dashboard link: https://syzkaller.appspot.com/bug?extid=152d76c44ba142f8992b
> > compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12b97b64880000
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11fb9040880000
> >
> > Downloadable assets:
> > disk image: https://storage.googleapis.com/syzbot-assets/1cb3f4618323/disk-483fed3b.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/cc02cb30b495/vmlinux-483fed3b.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+152d76c44ba142f8992b@syzkaller.appspotmail.com
> >
> > general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
> > KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> > CPU: 1 PID: 3617 Comm: syz-executor722 Not tainted 6.0.0-rc6-next-20220921-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/16/2022
> > RIP: 0010:PagePoisoned include/linux/page-flags.h:304 [inline]
> > RIP: 0010:PageHead include/linux/page-flags.h:787 [inline]
> > RIP: 0010:PageHeadHuge+0x1d/0x200 mm/hugetlb.c:1892
> > Code: ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 54 55 48 89 fd 53 e8 54 c9 b9 ff 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 a2 01 00 00 48 8b 5d 00 48 c7 c7 ff ff ff ff 48
> > RSP: 0018:ffffc90003e7f5a0 EFLAGS: 00010246
> > RAX: dffffc0000000000 RBX: ffffc90003e7f788 RCX: 0000000000000000
> > RDX: 0000000000000000 RSI: ffffffff81c2cb2c RDI: 0000000000000000
> > RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90003e7f798
> > R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000003f4
> > FS:  00007f5642262700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00000000203f4ef0 CR3: 000000007adcc000 CR4: 00000000003506e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> >  <TASK>
> >  folio_test_hugetlb include/linux/page-flags.h:831 [inline]
> >  folio_file_page include/linux/pagemap.h:683 [inline]
> >  shmem_fault+0x27c/0x8a0 mm/shmem.c:2130
> >  __do_fault+0x107/0x600 mm/memory.c:4191
> >  do_shared_fault mm/memory.c:4597 [inline]
> >  do_fault mm/memory.c:4675 [inline]
> >  handle_pte_fault mm/memory.c:4943 [inline]
> >  __handle_mm_fault+0x2200/0x3a40 mm/memory.c:5085
> >  handle_mm_fault+0x1c8/0x780 mm/memory.c:5206
> >  do_user_addr_fault+0x475/0x1210 arch/x86/mm/fault.c:1428
> >  handle_page_fault arch/x86/mm/fault.c:1519 [inline]
> >  exc_page_fault+0x94/0x170 arch/x86/mm/fault.c:1575
> >  asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
> > RIP: 0010:__put_user_nocheck_4+0x3/0x11
> > Code: 00 00 48 39 d9 73 54 0f 01 cb 66 89 01 31 c9 0f 01 ca c3 0f 1f 44 00 00 48 bb fd ef ff ff ff 7f 00 00 48 39 d9 73 34 0f 01 cb <89> 01 31 c9 0f 01 ca c3 66 0f 1f 44 00 00 48 bb f9 ef ff ff ff 7f
> > RSP: 0018:ffffc90003e7fa00 EFLAGS: 00050293
> > RAX: 0000000000000000 RBX: ffffc90003e7fdf4 RCX: 00000000203f4ef0
> > RDX: ffff888020c51d40 RSI: ffffffff8726d52f RDI: 0000000000000005
> > RBP: ffffc90003e7fdb0 R08: 0000000000000005 R09: 0000000000000000
> > R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000000000
> > R13: 0000000000000002 R14: 00000000203f4ef0 R15: 0000000000000000
> >  ____sys_recvmsg+0x3ba/0x610 net/socket.c:2714
> >  ___sys_recvmsg+0xf2/0x180 net/socket.c:2743
> >  do_recvmmsg+0x25e/0x6e0 net/socket.c:2837
> >  __sys_recvmmsg net/socket.c:2916 [inline]
> >  __do_sys_recvmmsg net/socket.c:2939 [inline]
> >  __se_sys_recvmmsg net/socket.c:2932 [inline]
> >  __x64_sys_recvmmsg+0x20b/0x260 net/socket.c:2932
> >  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> >  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> >  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > RIP: 0033:0x7f56422dabb9
> > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 91 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > RSP: 002b:00007f5642262208 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
> > RAX: ffffffffffffffda RBX: 00007f564235c4b8 RCX: 00007f56422dabb9
> > RDX: 0000000000010106 RSI: 00000000200000c0 RDI: 0000000000000003
> > RBP: 00007f564235c4b0 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000002 R11: 0000000000000246 R12: 00007f564235c4bc
> > R13: 00007fffbde3618f R14: 00007f5642262300 R15: 0000000000022000
> >  </TASK>
> > Modules linked in:
> > ---[ end trace 0000000000000000 ]---
>
> While it is true that the addressing exception is happening in the hugetlb
> routine PageHeadHuge(), the reason is because it is passed a NULL page
> pointer.  This is via the call to folio_file_page() at the end of shmem_fault.
>
>         err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
>                                   gfp, vma, vmf, &ret);
>         if (err)
>                 return vmf_error(err);
>
>         vmf->page = folio_file_page(folio, vmf->pgoff);
>         return ret;
>
> The code assumes that if a non-zero value is returned from shmem_get_folio_gfp,
> then folio pointer will be set to a folio.  However, it looks like there are
> a few places in shmem_get_folio_gfp where it will return zero and not set
> folio.
>
> In this specific case, it is the code block:
>
>         if (vma && userfaultfd_missing(vma)) {
>                 *fault_type = handle_userfault(vmf, VM_UFFD_MISSING);
>                 return 0;
>         }

This is not the first time that we were tricked by the return value of
shmem_get_folio_gfp (or the old name, shmem_get_page_gfp), sometimes
in the other way, see
https://lore.kernel.org/linux-mm/CAHk-=whEG9pOPmVEYw+_uruxgHZLh6ewc7MmZXGBWjuBOwFB+Q@mail.gmail.com/

>
> I could try to sort this out, but I believe Matthew has the most context as
> he has been changing this code recently.
> --
> Mike Kravetz
>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-23 21:11 ` Mike Kravetz
  2022-09-23 21:37   ` Yang Shi
@ 2022-09-24  0:14   ` Hugh Dickins
  2022-09-24  0:18     ` Nathan Chancellor
                       ` (2 more replies)
  2022-09-24 21:56   ` Matthew Wilcox
  2 siblings, 3 replies; 24+ messages in thread
From: Hugh Dickins @ 2022-09-24  0:14 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Peter Xu, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	Hugh Dickins, akpm, linux-kernel, linux-mm, llvm, nathan,
	ndesaulniers, songmuchun, syzkaller-bugs, trix

On Fri, 23 Sep 2022, Mike Kravetz wrote:
> On 09/23/22 09:05, syzbot wrote:
> > Hello,
> > 
> > syzbot found the following issue on:
> > 
> > HEAD commit:    483fed3b5dc8 Add linux-next specific files for 20220921
> > git tree:       linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=16f0a418880000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=849cb9f70f15b1ba
> > dashboard link: https://syzkaller.appspot.com/bug?extid=152d76c44ba142f8992b
> > compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12b97b64880000
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11fb9040880000
> > 
> > Downloadable assets:
> > disk image: https://storage.googleapis.com/syzbot-assets/1cb3f4618323/disk-483fed3b.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/cc02cb30b495/vmlinux-483fed3b.xz
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+152d76c44ba142f8992b@syzkaller.appspotmail.com
> > 
> > general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
> > KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> > CPU: 1 PID: 3617 Comm: syz-executor722 Not tainted 6.0.0-rc6-next-20220921-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/16/2022
> > RIP: 0010:PagePoisoned include/linux/page-flags.h:304 [inline]
> > RIP: 0010:PageHead include/linux/page-flags.h:787 [inline]
> > RIP: 0010:PageHeadHuge+0x1d/0x200 mm/hugetlb.c:1892
> > Code: ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 54 55 48 89 fd 53 e8 54 c9 b9 ff 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 a2 01 00 00 48 8b 5d 00 48 c7 c7 ff ff ff ff 48
> > RSP: 0018:ffffc90003e7f5a0 EFLAGS: 00010246
> > RAX: dffffc0000000000 RBX: ffffc90003e7f788 RCX: 0000000000000000
> > RDX: 0000000000000000 RSI: ffffffff81c2cb2c RDI: 0000000000000000
> > RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90003e7f798
> > R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000003f4
> > FS:  00007f5642262700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00000000203f4ef0 CR3: 000000007adcc000 CR4: 00000000003506e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> >  <TASK>
> >  folio_test_hugetlb include/linux/page-flags.h:831 [inline]
> >  folio_file_page include/linux/pagemap.h:683 [inline]
> >  shmem_fault+0x27c/0x8a0 mm/shmem.c:2130
> >  __do_fault+0x107/0x600 mm/memory.c:4191
> >  do_shared_fault mm/memory.c:4597 [inline]
> >  do_fault mm/memory.c:4675 [inline]
> >  handle_pte_fault mm/memory.c:4943 [inline]
> >  __handle_mm_fault+0x2200/0x3a40 mm/memory.c:5085
> >  handle_mm_fault+0x1c8/0x780 mm/memory.c:5206
> >  do_user_addr_fault+0x475/0x1210 arch/x86/mm/fault.c:1428
> >  handle_page_fault arch/x86/mm/fault.c:1519 [inline]
> >  exc_page_fault+0x94/0x170 arch/x86/mm/fault.c:1575
> >  asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
> > RIP: 0010:__put_user_nocheck_4+0x3/0x11
> > Code: 00 00 48 39 d9 73 54 0f 01 cb 66 89 01 31 c9 0f 01 ca c3 0f 1f 44 00 00 48 bb fd ef ff ff ff 7f 00 00 48 39 d9 73 34 0f 01 cb <89> 01 31 c9 0f 01 ca c3 66 0f 1f 44 00 00 48 bb f9 ef ff ff ff 7f
> > RSP: 0018:ffffc90003e7fa00 EFLAGS: 00050293
> > RAX: 0000000000000000 RBX: ffffc90003e7fdf4 RCX: 00000000203f4ef0
> > RDX: ffff888020c51d40 RSI: ffffffff8726d52f RDI: 0000000000000005
> > RBP: ffffc90003e7fdb0 R08: 0000000000000005 R09: 0000000000000000
> > R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000000000
> > R13: 0000000000000002 R14: 00000000203f4ef0 R15: 0000000000000000
> >  ____sys_recvmsg+0x3ba/0x610 net/socket.c:2714
> >  ___sys_recvmsg+0xf2/0x180 net/socket.c:2743
> >  do_recvmmsg+0x25e/0x6e0 net/socket.c:2837
> >  __sys_recvmmsg net/socket.c:2916 [inline]
> >  __do_sys_recvmmsg net/socket.c:2939 [inline]
> >  __se_sys_recvmmsg net/socket.c:2932 [inline]
> >  __x64_sys_recvmmsg+0x20b/0x260 net/socket.c:2932
> >  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> >  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> >  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > RIP: 0033:0x7f56422dabb9
> > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 91 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > RSP: 002b:00007f5642262208 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
> > RAX: ffffffffffffffda RBX: 00007f564235c4b8 RCX: 00007f56422dabb9
> > RDX: 0000000000010106 RSI: 00000000200000c0 RDI: 0000000000000003
> > RBP: 00007f564235c4b0 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000002 R11: 0000000000000246 R12: 00007f564235c4bc
> > R13: 00007fffbde3618f R14: 00007f5642262300 R15: 0000000000022000
> >  </TASK>
> > Modules linked in:
> > ---[ end trace 0000000000000000 ]---
> 
> While it is true that the addressing exception is happening in the hugetlb
> routine PageHeadHuge(), the reason is because it is passed a NULL page
> pointer.  This is via the call to folio_file_page() at the end of shmem_fault.
> 
> 	err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
> 				  gfp, vma, vmf, &ret);
> 	if (err)
> 		return vmf_error(err);
> 
> 	vmf->page = folio_file_page(folio, vmf->pgoff);
> 	return ret;
> 
> The code assumes that if a non-zero value is returned from shmem_get_folio_gfp,
> then folio pointer will be set to a folio.  However, it looks like there are
> a few places in shmem_get_folio_gfp where it will return zero and not set
> folio.
> 
> In this specific case, it is the code block:
> 
>         if (vma && userfaultfd_missing(vma)) {
>                 *fault_type = handle_userfault(vmf, VM_UFFD_MISSING);
>                 return 0;
>         }
> 
> I could try to sort this out, but I believe Matthew has the most context as
> he has been changing this code recently.

Thanks Mike.  Yes, it looks like userfaultfd's successful returns from
shmem_get_folio_gfp() have (correctly?) filled in vmf->page directly,
and are not setting *foliop at all.  So this would affect any
userfaultfd usage of shmem, not just an odd syzbot corner case.

I do think shmem_fault() ought at least to initialize its folio pointer
(as shmem_file_read_iter() correctly does); and if it does that, then we
don't need to weed out the userfaultfd cases as such, but just avoid
folio_file_page() when folio is still NULL.

I've neither tried the syzbot test case (but see its .c does involve
userfaultld), nor tried userfaultfd shmem: Peter, could you give this a
try - we'd expect userfaultfd shmem to crash on linux-next without the
patch below.

And it implies that nobody has been trying userfaultfd shmem on recent
linux-next, so whether vmf->page is then set correctly we're not sure.

But me, I cannot even get the latest linux-next to boot: next job.

Hugh

--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2060,7 +2060,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct inode *inode = file_inode(vma->vm_file);
 	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
-	struct folio *folio;
+	struct folio *folio = NULL;
 	int err;
 	vm_fault_t ret = VM_FAULT_LOCKED;
 
@@ -2127,7 +2127,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 				  gfp, vma, vmf, &ret);
 	if (err)
 		return vmf_error(err);
-	vmf->page = folio_file_page(folio, vmf->pgoff);
+	if (folio)
+		vmf->page = folio_file_page(folio, vmf->pgoff);
 	return ret;
 }
 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-24  0:14   ` Hugh Dickins
@ 2022-09-24  0:18     ` Nathan Chancellor
  2022-09-24  7:28       ` Hugh Dickins
  2022-09-24  0:58     ` Peter Xu
  2022-09-25 18:55     ` Andrew Morton
  2 siblings, 1 reply; 24+ messages in thread
From: Nathan Chancellor @ 2022-09-24  0:18 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Mike Kravetz, Peter Xu, Axel Rasmussen, Yang Shi, Matthew Wilcox,
	syzbot, akpm, linux-kernel, linux-mm, llvm, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

On Fri, Sep 23, 2022 at 05:14:38PM -0700, Hugh Dickins wrote:
> On Fri, 23 Sep 2022, Mike Kravetz wrote:
> > On 09/23/22 09:05, syzbot wrote:
> > > Hello,
> > > 
> > > syzbot found the following issue on:
> > > 
> > > HEAD commit:    483fed3b5dc8 Add linux-next specific files for 20220921
> > > git tree:       linux-next
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=16f0a418880000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=849cb9f70f15b1ba
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=152d76c44ba142f8992b
> > > compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12b97b64880000
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11fb9040880000
> > > 
> > > Downloadable assets:
> > > disk image: https://storage.googleapis.com/syzbot-assets/1cb3f4618323/disk-483fed3b.raw.xz
> > > vmlinux: https://storage.googleapis.com/syzbot-assets/cc02cb30b495/vmlinux-483fed3b.xz
> > > 
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+152d76c44ba142f8992b@syzkaller.appspotmail.com
> > > 
> > > general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
> > > KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> > > CPU: 1 PID: 3617 Comm: syz-executor722 Not tainted 6.0.0-rc6-next-20220921-syzkaller #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/16/2022
> > > RIP: 0010:PagePoisoned include/linux/page-flags.h:304 [inline]
> > > RIP: 0010:PageHead include/linux/page-flags.h:787 [inline]
> > > RIP: 0010:PageHeadHuge+0x1d/0x200 mm/hugetlb.c:1892
> > > Code: ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 54 55 48 89 fd 53 e8 54 c9 b9 ff 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 a2 01 00 00 48 8b 5d 00 48 c7 c7 ff ff ff ff 48
> > > RSP: 0018:ffffc90003e7f5a0 EFLAGS: 00010246
> > > RAX: dffffc0000000000 RBX: ffffc90003e7f788 RCX: 0000000000000000
> > > RDX: 0000000000000000 RSI: ffffffff81c2cb2c RDI: 0000000000000000
> > > RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
> > > R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90003e7f798
> > > R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000003f4
> > > FS:  00007f5642262700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00000000203f4ef0 CR3: 000000007adcc000 CR4: 00000000003506e0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > >  <TASK>
> > >  folio_test_hugetlb include/linux/page-flags.h:831 [inline]
> > >  folio_file_page include/linux/pagemap.h:683 [inline]
> > >  shmem_fault+0x27c/0x8a0 mm/shmem.c:2130
> > >  __do_fault+0x107/0x600 mm/memory.c:4191
> > >  do_shared_fault mm/memory.c:4597 [inline]
> > >  do_fault mm/memory.c:4675 [inline]
> > >  handle_pte_fault mm/memory.c:4943 [inline]
> > >  __handle_mm_fault+0x2200/0x3a40 mm/memory.c:5085
> > >  handle_mm_fault+0x1c8/0x780 mm/memory.c:5206
> > >  do_user_addr_fault+0x475/0x1210 arch/x86/mm/fault.c:1428
> > >  handle_page_fault arch/x86/mm/fault.c:1519 [inline]
> > >  exc_page_fault+0x94/0x170 arch/x86/mm/fault.c:1575
> > >  asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
> > > RIP: 0010:__put_user_nocheck_4+0x3/0x11
> > > Code: 00 00 48 39 d9 73 54 0f 01 cb 66 89 01 31 c9 0f 01 ca c3 0f 1f 44 00 00 48 bb fd ef ff ff ff 7f 00 00 48 39 d9 73 34 0f 01 cb <89> 01 31 c9 0f 01 ca c3 66 0f 1f 44 00 00 48 bb f9 ef ff ff ff 7f
> > > RSP: 0018:ffffc90003e7fa00 EFLAGS: 00050293
> > > RAX: 0000000000000000 RBX: ffffc90003e7fdf4 RCX: 00000000203f4ef0
> > > RDX: ffff888020c51d40 RSI: ffffffff8726d52f RDI: 0000000000000005
> > > RBP: ffffc90003e7fdb0 R08: 0000000000000005 R09: 0000000000000000
> > > R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000000000
> > > R13: 0000000000000002 R14: 00000000203f4ef0 R15: 0000000000000000
> > >  ____sys_recvmsg+0x3ba/0x610 net/socket.c:2714
> > >  ___sys_recvmsg+0xf2/0x180 net/socket.c:2743
> > >  do_recvmmsg+0x25e/0x6e0 net/socket.c:2837
> > >  __sys_recvmmsg net/socket.c:2916 [inline]
> > >  __do_sys_recvmmsg net/socket.c:2939 [inline]
> > >  __se_sys_recvmmsg net/socket.c:2932 [inline]
> > >  __x64_sys_recvmmsg+0x20b/0x260 net/socket.c:2932
> > >  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > >  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > >  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > RIP: 0033:0x7f56422dabb9
> > > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 91 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > > RSP: 002b:00007f5642262208 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
> > > RAX: ffffffffffffffda RBX: 00007f564235c4b8 RCX: 00007f56422dabb9
> > > RDX: 0000000000010106 RSI: 00000000200000c0 RDI: 0000000000000003
> > > RBP: 00007f564235c4b0 R08: 0000000000000000 R09: 0000000000000000
> > > R10: 0000000000000002 R11: 0000000000000246 R12: 00007f564235c4bc
> > > R13: 00007fffbde3618f R14: 00007f5642262300 R15: 0000000000022000
> > >  </TASK>
> > > Modules linked in:
> > > ---[ end trace 0000000000000000 ]---
> > 
> > While it is true that the addressing exception is happening in the hugetlb
> > routine PageHeadHuge(), the reason is because it is passed a NULL page
> > pointer.  This is via the call to folio_file_page() at the end of shmem_fault.
> > 
> > 	err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
> > 				  gfp, vma, vmf, &ret);
> > 	if (err)
> > 		return vmf_error(err);
> > 
> > 	vmf->page = folio_file_page(folio, vmf->pgoff);
> > 	return ret;
> > 
> > The code assumes that if a non-zero value is returned from shmem_get_folio_gfp,
> > then folio pointer will be set to a folio.  However, it looks like there are
> > a few places in shmem_get_folio_gfp where it will return zero and not set
> > folio.
> > 
> > In this specific case, it is the code block:
> > 
> >         if (vma && userfaultfd_missing(vma)) {
> >                 *fault_type = handle_userfault(vmf, VM_UFFD_MISSING);
> >                 return 0;
> >         }
> > 
> > I could try to sort this out, but I believe Matthew has the most context as
> > he has been changing this code recently.
> 
> Thanks Mike.  Yes, it looks like userfaultfd's successful returns from
> shmem_get_folio_gfp() have (correctly?) filled in vmf->page directly,
> and are not setting *foliop at all.  So this would affect any
> userfaultfd usage of shmem, not just an odd syzbot corner case.
> 
> I do think shmem_fault() ought at least to initialize its folio pointer
> (as shmem_file_read_iter() correctly does); and if it does that, then we
> don't need to weed out the userfaultfd cases as such, but just avoid
> folio_file_page() when folio is still NULL.
> 
> I've neither tried the syzbot test case (but see its .c does involve
> userfaultld), nor tried userfaultfd shmem: Peter, could you give this a
> try - we'd expect userfaultfd shmem to crash on linux-next without the
> patch below.
> 
> And it implies that nobody has been trying userfaultfd shmem on recent
> linux-next, so whether vmf->page is then set correctly we're not sure.
> 
> But me, I cannot even get the latest linux-next to boot: next job.

Maybe I can save you a bisect:

https://lore.kernel.org/all/Yy4GoMMwiBaq3oJf@dev-arch.thelio-3990X/

> Hugh
> 
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2060,7 +2060,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct inode *inode = file_inode(vma->vm_file);
>  	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
> -	struct folio *folio;
> +	struct folio *folio = NULL;
>  	int err;
>  	vm_fault_t ret = VM_FAULT_LOCKED;
>  
> @@ -2127,7 +2127,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>  				  gfp, vma, vmf, &ret);
>  	if (err)
>  		return vmf_error(err);
> -	vmf->page = folio_file_page(folio, vmf->pgoff);
> +	if (folio)
> +		vmf->page = folio_file_page(folio, vmf->pgoff);
>  	return ret;
>  }
>  


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-24  0:14   ` Hugh Dickins
  2022-09-24  0:18     ` Nathan Chancellor
@ 2022-09-24  0:58     ` Peter Xu
  2022-09-24 15:06       ` Peter Xu
  2022-09-25 18:55     ` Andrew Morton
  2 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-09-24  0:58 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Mike Kravetz, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	akpm, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

Hi, Hugh, Mike,

On Fri, Sep 23, 2022 at 05:14:38PM -0700, Hugh Dickins wrote:
> On Fri, 23 Sep 2022, Mike Kravetz wrote:
> > On 09/23/22 09:05, syzbot wrote:
> > > Hello,
> > > 
> > > syzbot found the following issue on:
> > > 
> > > HEAD commit:    483fed3b5dc8 Add linux-next specific files for 20220921
> > > git tree:       linux-next
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=16f0a418880000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=849cb9f70f15b1ba
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=152d76c44ba142f8992b
> > > compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12b97b64880000
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11fb9040880000
> > > 
> > > Downloadable assets:
> > > disk image: https://storage.googleapis.com/syzbot-assets/1cb3f4618323/disk-483fed3b.raw.xz
> > > vmlinux: https://storage.googleapis.com/syzbot-assets/cc02cb30b495/vmlinux-483fed3b.xz
> > > 
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+152d76c44ba142f8992b@syzkaller.appspotmail.com
> > > 
> > > general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
> > > KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> > > CPU: 1 PID: 3617 Comm: syz-executor722 Not tainted 6.0.0-rc6-next-20220921-syzkaller #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/16/2022
> > > RIP: 0010:PagePoisoned include/linux/page-flags.h:304 [inline]
> > > RIP: 0010:PageHead include/linux/page-flags.h:787 [inline]
> > > RIP: 0010:PageHeadHuge+0x1d/0x200 mm/hugetlb.c:1892
> > > Code: ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 54 55 48 89 fd 53 e8 54 c9 b9 ff 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 a2 01 00 00 48 8b 5d 00 48 c7 c7 ff ff ff ff 48
> > > RSP: 0018:ffffc90003e7f5a0 EFLAGS: 00010246
> > > RAX: dffffc0000000000 RBX: ffffc90003e7f788 RCX: 0000000000000000
> > > RDX: 0000000000000000 RSI: ffffffff81c2cb2c RDI: 0000000000000000
> > > RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
> > > R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90003e7f798
> > > R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000003f4
> > > FS:  00007f5642262700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00000000203f4ef0 CR3: 000000007adcc000 CR4: 00000000003506e0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > >  <TASK>
> > >  folio_test_hugetlb include/linux/page-flags.h:831 [inline]
> > >  folio_file_page include/linux/pagemap.h:683 [inline]
> > >  shmem_fault+0x27c/0x8a0 mm/shmem.c:2130
> > >  __do_fault+0x107/0x600 mm/memory.c:4191
> > >  do_shared_fault mm/memory.c:4597 [inline]
> > >  do_fault mm/memory.c:4675 [inline]
> > >  handle_pte_fault mm/memory.c:4943 [inline]
> > >  __handle_mm_fault+0x2200/0x3a40 mm/memory.c:5085
> > >  handle_mm_fault+0x1c8/0x780 mm/memory.c:5206
> > >  do_user_addr_fault+0x475/0x1210 arch/x86/mm/fault.c:1428
> > >  handle_page_fault arch/x86/mm/fault.c:1519 [inline]
> > >  exc_page_fault+0x94/0x170 arch/x86/mm/fault.c:1575
> > >  asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
> > > RIP: 0010:__put_user_nocheck_4+0x3/0x11
> > > Code: 00 00 48 39 d9 73 54 0f 01 cb 66 89 01 31 c9 0f 01 ca c3 0f 1f 44 00 00 48 bb fd ef ff ff ff 7f 00 00 48 39 d9 73 34 0f 01 cb <89> 01 31 c9 0f 01 ca c3 66 0f 1f 44 00 00 48 bb f9 ef ff ff ff 7f
> > > RSP: 0018:ffffc90003e7fa00 EFLAGS: 00050293
> > > RAX: 0000000000000000 RBX: ffffc90003e7fdf4 RCX: 00000000203f4ef0
> > > RDX: ffff888020c51d40 RSI: ffffffff8726d52f RDI: 0000000000000005
> > > RBP: ffffc90003e7fdb0 R08: 0000000000000005 R09: 0000000000000000
> > > R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000000000
> > > R13: 0000000000000002 R14: 00000000203f4ef0 R15: 0000000000000000
> > >  ____sys_recvmsg+0x3ba/0x610 net/socket.c:2714
> > >  ___sys_recvmsg+0xf2/0x180 net/socket.c:2743
> > >  do_recvmmsg+0x25e/0x6e0 net/socket.c:2837
> > >  __sys_recvmmsg net/socket.c:2916 [inline]
> > >  __do_sys_recvmmsg net/socket.c:2939 [inline]
> > >  __se_sys_recvmmsg net/socket.c:2932 [inline]
> > >  __x64_sys_recvmmsg+0x20b/0x260 net/socket.c:2932
> > >  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > >  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > >  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > RIP: 0033:0x7f56422dabb9
> > > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 91 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > > RSP: 002b:00007f5642262208 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
> > > RAX: ffffffffffffffda RBX: 00007f564235c4b8 RCX: 00007f56422dabb9
> > > RDX: 0000000000010106 RSI: 00000000200000c0 RDI: 0000000000000003
> > > RBP: 00007f564235c4b0 R08: 0000000000000000 R09: 0000000000000000
> > > R10: 0000000000000002 R11: 0000000000000246 R12: 00007f564235c4bc
> > > R13: 00007fffbde3618f R14: 00007f5642262300 R15: 0000000000022000
> > >  </TASK>
> > > Modules linked in:
> > > ---[ end trace 0000000000000000 ]---
> > 
> > While it is true that the addressing exception is happening in the hugetlb
> > routine PageHeadHuge(), the reason is because it is passed a NULL page
> > pointer.  This is via the call to folio_file_page() at the end of shmem_fault.
> > 
> > 	err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
> > 				  gfp, vma, vmf, &ret);
> > 	if (err)
> > 		return vmf_error(err);
> > 
> > 	vmf->page = folio_file_page(folio, vmf->pgoff);
> > 	return ret;
> > 
> > The code assumes that if a non-zero value is returned from shmem_get_folio_gfp,
> > then folio pointer will be set to a folio.  However, it looks like there are
> > a few places in shmem_get_folio_gfp where it will return zero and not set
> > folio.
> > 
> > In this specific case, it is the code block:
> > 
> >         if (vma && userfaultfd_missing(vma)) {
> >                 *fault_type = handle_userfault(vmf, VM_UFFD_MISSING);
> >                 return 0;
> >         }
> > 
> > I could try to sort this out, but I believe Matthew has the most context as
> > he has been changing this code recently.
> 
> Thanks Mike.  Yes, it looks like userfaultfd's successful returns from
> shmem_get_folio_gfp() have (correctly?) filled in vmf->page directly,

AFAICT handle_userfault() doens't set vmf->page; it relies on returning
"*fault_type" to be VM_FAULT_RETRY so the fault will be retried assuming
the pgtable has been installed properly when it returns.

> and are not setting *foliop at all.  So this would affect any
> userfaultfd usage of shmem, not just an odd syzbot corner case.
> 
> I do think shmem_fault() ought at least to initialize its folio pointer
> (as shmem_file_read_iter() correctly does); and if it does that, then we
> don't need to weed out the userfaultfd cases as such, but just avoid
> folio_file_page() when folio is still NULL.
> 
> I've neither tried the syzbot test case (but see its .c does involve
> userfaultld), nor tried userfaultfd shmem: Peter, could you give this a
> try - we'd expect userfaultfd shmem to crash on linux-next without the
> patch below.
> 
> And it implies that nobody has been trying userfaultfd shmem on recent
> linux-next, so whether vmf->page is then set correctly we're not sure.
> 
> But me, I cannot even get the latest linux-next to boot: next job.
> 
> Hugh
> 
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2060,7 +2060,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct inode *inode = file_inode(vma->vm_file);
>  	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
> -	struct folio *folio;
> +	struct folio *folio = NULL;
>  	int err;
>  	vm_fault_t ret = VM_FAULT_LOCKED;
>  
> @@ -2127,7 +2127,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>  				  gfp, vma, vmf, &ret);
>  	if (err)
>  		return vmf_error(err);
> -	vmf->page = folio_file_page(folio, vmf->pgoff);
> +	if (folio)
> +		vmf->page = folio_file_page(folio, vmf->pgoff);
>  	return ret;
>  }

The patch itself looks correct to me.  Thanks.

Reviewed-by: Peter Xu <peterx@redhat.com>

Not sure whether it can be directly squashed into 371133f76a6d.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-24  0:18     ` Nathan Chancellor
@ 2022-09-24  7:28       ` Hugh Dickins
  0 siblings, 0 replies; 24+ messages in thread
From: Hugh Dickins @ 2022-09-24  7:28 UTC (permalink / raw)
  To: Nathan Chancellor
  Cc: Hugh Dickins, Mike Kravetz, Peter Xu, Axel Rasmussen, Yang Shi,
	Matthew Wilcox, syzbot, akpm, linux-kernel, linux-mm, llvm,
	ndesaulniers, songmuchun, syzkaller-bugs, trix

On Fri, 23 Sep 2022, Nathan Chancellor wrote:
> On Fri, Sep 23, 2022 at 05:14:38PM -0700, Hugh Dickins wrote:
> > 
> > But me, I cannot even get the latest linux-next to boot: next job.
> 
> Maybe I can save you a bisect:
> 
> https://lore.kernel.org/all/Yy4GoMMwiBaq3oJf@dev-arch.thelio-3990X/

Many thanks Nathan, yes, you did save me a bisect with that link.
And I managed to get a patch to revert the badness,
I'll reply there on that thread now.

Hugh


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-24  0:58     ` Peter Xu
@ 2022-09-24 15:06       ` Peter Xu
  2022-09-24 19:01         ` Mike Kravetz
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-09-24 15:06 UTC (permalink / raw)
  To: Hugh Dickins, Mike Kravetz
  Cc: Mike Kravetz, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	akpm, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

On Fri, Sep 23, 2022 at 08:58:06PM -0400, Peter Xu wrote:
> Hi, Hugh, Mike,
> 
> On Fri, Sep 23, 2022 at 05:14:38PM -0700, Hugh Dickins wrote:
> > On Fri, 23 Sep 2022, Mike Kravetz wrote:
> > > On 09/23/22 09:05, syzbot wrote:
> > > > Hello,
> > > > 
> > > > syzbot found the following issue on:
> > > > 
> > > > HEAD commit:    483fed3b5dc8 Add linux-next specific files for 20220921
> > > > git tree:       linux-next
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=16f0a418880000
> > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=849cb9f70f15b1ba
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=152d76c44ba142f8992b
> > > > compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> > > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12b97b64880000
> > > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11fb9040880000
> > > > 
> > > > Downloadable assets:
> > > > disk image: https://storage.googleapis.com/syzbot-assets/1cb3f4618323/disk-483fed3b.raw.xz
> > > > vmlinux: https://storage.googleapis.com/syzbot-assets/cc02cb30b495/vmlinux-483fed3b.xz
> > > > 
> > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > Reported-by: syzbot+152d76c44ba142f8992b@syzkaller.appspotmail.com
> > > > 
> > > > general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
> > > > KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> > > > CPU: 1 PID: 3617 Comm: syz-executor722 Not tainted 6.0.0-rc6-next-20220921-syzkaller #0
> > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/16/2022
> > > > RIP: 0010:PagePoisoned include/linux/page-flags.h:304 [inline]
> > > > RIP: 0010:PageHead include/linux/page-flags.h:787 [inline]
> > > > RIP: 0010:PageHeadHuge+0x1d/0x200 mm/hugetlb.c:1892
> > > > Code: ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 54 55 48 89 fd 53 e8 54 c9 b9 ff 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 a2 01 00 00 48 8b 5d 00 48 c7 c7 ff ff ff ff 48
> > > > RSP: 0018:ffffc90003e7f5a0 EFLAGS: 00010246
> > > > RAX: dffffc0000000000 RBX: ffffc90003e7f788 RCX: 0000000000000000
> > > > RDX: 0000000000000000 RSI: ffffffff81c2cb2c RDI: 0000000000000000
> > > > RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
> > > > R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90003e7f798
> > > > R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000003f4
> > > > FS:  00007f5642262700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: 00000000203f4ef0 CR3: 000000007adcc000 CR4: 00000000003506e0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > Call Trace:
> > > >  <TASK>
> > > >  folio_test_hugetlb include/linux/page-flags.h:831 [inline]
> > > >  folio_file_page include/linux/pagemap.h:683 [inline]
> > > >  shmem_fault+0x27c/0x8a0 mm/shmem.c:2130
> > > >  __do_fault+0x107/0x600 mm/memory.c:4191
> > > >  do_shared_fault mm/memory.c:4597 [inline]
> > > >  do_fault mm/memory.c:4675 [inline]
> > > >  handle_pte_fault mm/memory.c:4943 [inline]
> > > >  __handle_mm_fault+0x2200/0x3a40 mm/memory.c:5085
> > > >  handle_mm_fault+0x1c8/0x780 mm/memory.c:5206
> > > >  do_user_addr_fault+0x475/0x1210 arch/x86/mm/fault.c:1428
> > > >  handle_page_fault arch/x86/mm/fault.c:1519 [inline]
> > > >  exc_page_fault+0x94/0x170 arch/x86/mm/fault.c:1575
> > > >  asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
> > > > RIP: 0010:__put_user_nocheck_4+0x3/0x11
> > > > Code: 00 00 48 39 d9 73 54 0f 01 cb 66 89 01 31 c9 0f 01 ca c3 0f 1f 44 00 00 48 bb fd ef ff ff ff 7f 00 00 48 39 d9 73 34 0f 01 cb <89> 01 31 c9 0f 01 ca c3 66 0f 1f 44 00 00 48 bb f9 ef ff ff ff 7f
> > > > RSP: 0018:ffffc90003e7fa00 EFLAGS: 00050293
> > > > RAX: 0000000000000000 RBX: ffffc90003e7fdf4 RCX: 00000000203f4ef0
> > > > RDX: ffff888020c51d40 RSI: ffffffff8726d52f RDI: 0000000000000005
> > > > RBP: ffffc90003e7fdb0 R08: 0000000000000005 R09: 0000000000000000
> > > > R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000000000
> > > > R13: 0000000000000002 R14: 00000000203f4ef0 R15: 0000000000000000
> > > >  ____sys_recvmsg+0x3ba/0x610 net/socket.c:2714
> > > >  ___sys_recvmsg+0xf2/0x180 net/socket.c:2743
> > > >  do_recvmmsg+0x25e/0x6e0 net/socket.c:2837
> > > >  __sys_recvmmsg net/socket.c:2916 [inline]
> > > >  __do_sys_recvmmsg net/socket.c:2939 [inline]
> > > >  __se_sys_recvmmsg net/socket.c:2932 [inline]
> > > >  __x64_sys_recvmmsg+0x20b/0x260 net/socket.c:2932
> > > >  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > > >  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > > >  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > > RIP: 0033:0x7f56422dabb9
> > > > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 91 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > > > RSP: 002b:00007f5642262208 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
> > > > RAX: ffffffffffffffda RBX: 00007f564235c4b8 RCX: 00007f56422dabb9
> > > > RDX: 0000000000010106 RSI: 00000000200000c0 RDI: 0000000000000003
> > > > RBP: 00007f564235c4b0 R08: 0000000000000000 R09: 0000000000000000
> > > > R10: 0000000000000002 R11: 0000000000000246 R12: 00007f564235c4bc
> > > > R13: 00007fffbde3618f R14: 00007f5642262300 R15: 0000000000022000
> > > >  </TASK>
> > > > Modules linked in:
> > > > ---[ end trace 0000000000000000 ]---
> > > 
> > > While it is true that the addressing exception is happening in the hugetlb
> > > routine PageHeadHuge(), the reason is because it is passed a NULL page
> > > pointer.  This is via the call to folio_file_page() at the end of shmem_fault.
> > > 
> > > 	err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
> > > 				  gfp, vma, vmf, &ret);
> > > 	if (err)
> > > 		return vmf_error(err);
> > > 
> > > 	vmf->page = folio_file_page(folio, vmf->pgoff);
> > > 	return ret;
> > > 
> > > The code assumes that if a non-zero value is returned from shmem_get_folio_gfp,
> > > then folio pointer will be set to a folio.  However, it looks like there are
> > > a few places in shmem_get_folio_gfp where it will return zero and not set
> > > folio.
> > > 
> > > In this specific case, it is the code block:
> > > 
> > >         if (vma && userfaultfd_missing(vma)) {
> > >                 *fault_type = handle_userfault(vmf, VM_UFFD_MISSING);
> > >                 return 0;
> > >         }
> > > 
> > > I could try to sort this out, but I believe Matthew has the most context as
> > > he has been changing this code recently.
> > 
> > Thanks Mike.  Yes, it looks like userfaultfd's successful returns from
> > shmem_get_folio_gfp() have (correctly?) filled in vmf->page directly,
> 
> AFAICT handle_userfault() doens't set vmf->page; it relies on returning
> "*fault_type" to be VM_FAULT_RETRY so the fault will be retried assuming
> the pgtable has been installed properly when it returns.
> 
> > and are not setting *foliop at all.  So this would affect any
> > userfaultfd usage of shmem, not just an odd syzbot corner case.
> > 
> > I do think shmem_fault() ought at least to initialize its folio pointer
> > (as shmem_file_read_iter() correctly does); and if it does that, then we
> > don't need to weed out the userfaultfd cases as such, but just avoid
> > folio_file_page() when folio is still NULL.
> > 
> > I've neither tried the syzbot test case (but see its .c does involve
> > userfaultld), nor tried userfaultfd shmem: Peter, could you give this a
> > try - we'd expect userfaultfd shmem to crash on linux-next without the
> > patch below.

Sorry I forgot to reply on this one.

I didn't try linux-next, but I can easily reproduce this with mm-unstable
already, and I verified that Hugh's patch fixes the problem for shmem.

When I was testing I found hugetlb selftest is broken too but with some
other errors:

$ sudo ./userfaultfd hugetlb 100 10  
...
bounces: 6, mode: racing ver read, ERROR: unexpected write fault (errno=0, line=779)

The failing check was making sure all MISSING events are not triggered by
writes, but frankly I don't really know why it's required, and that check
existed since the 1st commit when test was introduced.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c47174fc362a089b1125174258e53ef4a69ce6b8
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/vm/userfaultfd.c?id=c47174fc362a089b1125174258e53ef4a69ce6b8#n291

And obviously some recent hugetlb-related change caused that to happen.

Dropping that check can definitely work, but I'll have a closer look soon
too to make sure I didn't miss something.  Mike, please also let me know if
you are aware of this problem.

> > 
> > And it implies that nobody has been trying userfaultfd shmem on recent
> > linux-next, so whether vmf->page is then set correctly we're not sure.
> > 
> > But me, I cannot even get the latest linux-next to boot: next job.
> > 
> > Hugh
> > 
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2060,7 +2060,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
> >  	struct vm_area_struct *vma = vmf->vma;
> >  	struct inode *inode = file_inode(vma->vm_file);
> >  	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
> > -	struct folio *folio;
> > +	struct folio *folio = NULL;
> >  	int err;
> >  	vm_fault_t ret = VM_FAULT_LOCKED;
> >  
> > @@ -2127,7 +2127,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
> >  				  gfp, vma, vmf, &ret);
> >  	if (err)
> >  		return vmf_error(err);
> > -	vmf->page = folio_file_page(folio, vmf->pgoff);
> > +	if (folio)
> > +		vmf->page = folio_file_page(folio, vmf->pgoff);
> >  	return ret;
> >  }
> 
> The patch itself looks correct to me.  Thanks.
> 
> Reviewed-by: Peter Xu <peterx@redhat.com>
> 
> Not sure whether it can be directly squashed into 371133f76a6d.
> 
> -- 
> Peter Xu

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-24 15:06       ` Peter Xu
@ 2022-09-24 19:01         ` Mike Kravetz
  2022-09-26  0:11           ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Mike Kravetz @ 2022-09-24 19:01 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	akpm, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

On 09/24/22 11:06, Peter Xu wrote:
> 
> Sorry I forgot to reply on this one.
> 
> I didn't try linux-next, but I can easily reproduce this with mm-unstable
> already, and I verified that Hugh's patch fixes the problem for shmem.
> 
> When I was testing I found hugetlb selftest is broken too but with some
> other errors:
> 
> $ sudo ./userfaultfd hugetlb 100 10  
> ...
> bounces: 6, mode: racing ver read, ERROR: unexpected write fault (errno=0, line=779)
> 
> The failing check was making sure all MISSING events are not triggered by
> writes, but frankly I don't really know why it's required, and that check
> existed since the 1st commit when test was introduced.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c47174fc362a089b1125174258e53ef4a69ce6b8
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/vm/userfaultfd.c?id=c47174fc362a089b1125174258e53ef4a69ce6b8#n291
> 
> And obviously some recent hugetlb-related change caused that to happen.
> 
> Dropping that check can definitely work, but I'll have a closer look soon
> too to make sure I didn't miss something.  Mike, please also let me know if
> you are aware of this problem.
> 

Peter, I am not aware of this problem.  I really should make running ALL
hugetlb tests part of my regular routine.

If you do not beat me to it, I will take a look in the next few days.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-23 21:11 ` Mike Kravetz
  2022-09-23 21:37   ` Yang Shi
  2022-09-24  0:14   ` Hugh Dickins
@ 2022-09-24 21:56   ` Matthew Wilcox
  2022-09-27  4:32     ` Hugh Dickins
  2 siblings, 1 reply; 24+ messages in thread
From: Matthew Wilcox @ 2022-09-24 21:56 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: syzbot, Hugh Dickins, akpm, linux-kernel, linux-mm, llvm, nathan,
	ndesaulniers, songmuchun, syzkaller-bugs, trix, vishal.moola

On Fri, Sep 23, 2022 at 02:11:24PM -0700, Mike Kravetz wrote:
> On 09/23/22 09:05, syzbot wrote:
> > Hello,
> > 
> > syzbot found the following issue on:
> > 
> > HEAD commit:    483fed3b5dc8 Add linux-next specific files for 20220921
> > git tree:       linux-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=16f0a418880000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=849cb9f70f15b1ba
> > dashboard link: https://syzkaller.appspot.com/bug?extid=152d76c44ba142f8992b
> > compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12b97b64880000
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11fb9040880000
> > 
> > Downloadable assets:
> > disk image: https://storage.googleapis.com/syzbot-assets/1cb3f4618323/disk-483fed3b.raw.xz
> > vmlinux: https://storage.googleapis.com/syzbot-assets/cc02cb30b495/vmlinux-483fed3b.xz
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+152d76c44ba142f8992b@syzkaller.appspotmail.com
> > 
> > general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
> > KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> > CPU: 1 PID: 3617 Comm: syz-executor722 Not tainted 6.0.0-rc6-next-20220921-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/16/2022
> > RIP: 0010:PagePoisoned include/linux/page-flags.h:304 [inline]
> > RIP: 0010:PageHead include/linux/page-flags.h:787 [inline]
> > RIP: 0010:PageHeadHuge+0x1d/0x200 mm/hugetlb.c:1892
> > Code: ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 54 55 48 89 fd 53 e8 54 c9 b9 ff 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 a2 01 00 00 48 8b 5d 00 48 c7 c7 ff ff ff ff 48
> > RSP: 0018:ffffc90003e7f5a0 EFLAGS: 00010246
> > RAX: dffffc0000000000 RBX: ffffc90003e7f788 RCX: 0000000000000000
> > RDX: 0000000000000000 RSI: ffffffff81c2cb2c RDI: 0000000000000000
> > RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90003e7f798
> > R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000003f4
> > FS:  00007f5642262700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: 00000000203f4ef0 CR3: 000000007adcc000 CR4: 00000000003506e0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> >  <TASK>
> >  folio_test_hugetlb include/linux/page-flags.h:831 [inline]
> >  folio_file_page include/linux/pagemap.h:683 [inline]
> >  shmem_fault+0x27c/0x8a0 mm/shmem.c:2130
> >  __do_fault+0x107/0x600 mm/memory.c:4191
> >  do_shared_fault mm/memory.c:4597 [inline]
> >  do_fault mm/memory.c:4675 [inline]
> >  handle_pte_fault mm/memory.c:4943 [inline]
> >  __handle_mm_fault+0x2200/0x3a40 mm/memory.c:5085
> >  handle_mm_fault+0x1c8/0x780 mm/memory.c:5206
> >  do_user_addr_fault+0x475/0x1210 arch/x86/mm/fault.c:1428
> >  handle_page_fault arch/x86/mm/fault.c:1519 [inline]
> >  exc_page_fault+0x94/0x170 arch/x86/mm/fault.c:1575
> >  asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
> > RIP: 0010:__put_user_nocheck_4+0x3/0x11
> > Code: 00 00 48 39 d9 73 54 0f 01 cb 66 89 01 31 c9 0f 01 ca c3 0f 1f 44 00 00 48 bb fd ef ff ff ff 7f 00 00 48 39 d9 73 34 0f 01 cb <89> 01 31 c9 0f 01 ca c3 66 0f 1f 44 00 00 48 bb f9 ef ff ff ff 7f
> > RSP: 0018:ffffc90003e7fa00 EFLAGS: 00050293
> > RAX: 0000000000000000 RBX: ffffc90003e7fdf4 RCX: 00000000203f4ef0
> > RDX: ffff888020c51d40 RSI: ffffffff8726d52f RDI: 0000000000000005
> > RBP: ffffc90003e7fdb0 R08: 0000000000000005 R09: 0000000000000000
> > R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000000000
> > R13: 0000000000000002 R14: 00000000203f4ef0 R15: 0000000000000000
> >  ____sys_recvmsg+0x3ba/0x610 net/socket.c:2714
> >  ___sys_recvmsg+0xf2/0x180 net/socket.c:2743
> >  do_recvmmsg+0x25e/0x6e0 net/socket.c:2837
> >  __sys_recvmmsg net/socket.c:2916 [inline]
> >  __do_sys_recvmmsg net/socket.c:2939 [inline]
> >  __se_sys_recvmmsg net/socket.c:2932 [inline]
> >  __x64_sys_recvmmsg+0x20b/0x260 net/socket.c:2932
> >  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> >  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> >  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > RIP: 0033:0x7f56422dabb9
> > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 91 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > RSP: 002b:00007f5642262208 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
> > RAX: ffffffffffffffda RBX: 00007f564235c4b8 RCX: 00007f56422dabb9
> > RDX: 0000000000010106 RSI: 00000000200000c0 RDI: 0000000000000003
> > RBP: 00007f564235c4b0 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000002 R11: 0000000000000246 R12: 00007f564235c4bc
> > R13: 00007fffbde3618f R14: 00007f5642262300 R15: 0000000000022000
> >  </TASK>
> > Modules linked in:
> > ---[ end trace 0000000000000000 ]---
> 
> While it is true that the addressing exception is happening in the hugetlb
> routine PageHeadHuge(), the reason is because it is passed a NULL page
> pointer.  This is via the call to folio_file_page() at the end of shmem_fault.
> 
> 	err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
> 				  gfp, vma, vmf, &ret);
> 	if (err)
> 		return vmf_error(err);
> 
> 	vmf->page = folio_file_page(folio, vmf->pgoff);
> 	return ret;
> 
> The code assumes that if a non-zero value is returned from shmem_get_folio_gfp,
> then folio pointer will be set to a folio.  However, it looks like there are
> a few places in shmem_get_folio_gfp where it will return zero and not set
> folio.
> 
> In this specific case, it is the code block:
> 
>         if (vma && userfaultfd_missing(vma)) {
>                 *fault_type = handle_userfault(vmf, VM_UFFD_MISSING);
>                 return 0;
>         }
> 
> I could try to sort this out, but I believe Matthew has the most context as
> he has been changing this code recently.

Vishal sent me a patch a few days ago, but I was on holiday so haven't
seen it until now.

From: "Vishal Moola (Oracle)" <vishal.moola@gmail.com>
To: willy@infradead.org
Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com>
Subject: [PATCH 1/1] Fix an issue in shmem_fault()
Date: Wed, 21 Sep 2022 17:38:56 -0700
Message-Id: <20220922003855.23411-2-vishal.moola@gmail.com>
X-Mailer: git-send-email 2.36.1
In-Reply-To: <20220922003855.23411-1-vishal.moola@gmail.com>
References: <20220922003855.23411-1-vishal.moola@gmail.com>

If shmem_get_folio_gfp returns 0 AND does not assign folio,
folio_file_page() runs into issues. Make sure to assign vmf->page only
if a folio is assigned, otherwise set it to NULL.

Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>
---
 mm/shmem.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index d6921b8e2cb5..986c07362eab 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2060,7 +2060,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 	struct vm_area_struct *vma = vmf->vma;
 	struct inode *inode = file_inode(vma->vm_file);
 	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
-	struct folio *folio;
+	struct folio *folio = NULL;
 	int err;
 	vm_fault_t ret = VM_FAULT_LOCKED;
 
@@ -2127,7 +2127,10 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
 				  gfp, vma, vmf, &ret);
 	if (err)
 		return vmf_error(err);
-	vmf->page = folio_file_page(folio, vmf->pgoff);
+	if (folio)
+		vmf->page = folio_file_page(folio, vmf->pgoff);
+	else
+		vmf->page = NULL;
 	return ret;
 }
 
-- 
2.36.1



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-24  0:14   ` Hugh Dickins
  2022-09-24  0:18     ` Nathan Chancellor
  2022-09-24  0:58     ` Peter Xu
@ 2022-09-25 18:55     ` Andrew Morton
  2 siblings, 0 replies; 24+ messages in thread
From: Andrew Morton @ 2022-09-25 18:55 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Mike Kravetz, Peter Xu, Axel Rasmussen, Yang Shi, Matthew Wilcox,
	syzbot, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

On Fri, 23 Sep 2022 17:14:38 -0700 (PDT) Hugh Dickins <hughd@google.com> wrote:

> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2060,7 +2060,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct inode *inode = file_inode(vma->vm_file);
>  	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
> -	struct folio *folio;
> +	struct folio *folio = NULL;
>  	int err;
>  	vm_fault_t ret = VM_FAULT_LOCKED;
>  
> @@ -2127,7 +2127,8 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>  				  gfp, vma, vmf, &ret);
>  	if (err)
>  		return vmf_error(err);
> -	vmf->page = folio_file_page(folio, vmf->pgoff);
> +	if (folio)
> +		vmf->page = folio_file_page(folio, vmf->pgoff);
>  	return ret;
>  }

I grabbed this as a fix against
shmem-convert-shmem_fault-to-use-shmem_get_folio_gfp.patch.  Hopefully
someone can send along something more formal when ready.



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-24 19:01         ` Mike Kravetz
@ 2022-09-26  0:11           ` Peter Xu
  2022-09-27 16:24             ` Mike Kravetz
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-09-26  0:11 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	akpm, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

On Sat, Sep 24, 2022 at 12:01:16PM -0700, Mike Kravetz wrote:
> On 09/24/22 11:06, Peter Xu wrote:
> > 
> > Sorry I forgot to reply on this one.
> > 
> > I didn't try linux-next, but I can easily reproduce this with mm-unstable
> > already, and I verified that Hugh's patch fixes the problem for shmem.
> > 
> > When I was testing I found hugetlb selftest is broken too but with some
> > other errors:
> > 
> > $ sudo ./userfaultfd hugetlb 100 10  
> > ...
> > bounces: 6, mode: racing ver read, ERROR: unexpected write fault (errno=0, line=779)
> > 
> > The failing check was making sure all MISSING events are not triggered by
> > writes, but frankly I don't really know why it's required, and that check
> > existed since the 1st commit when test was introduced.
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c47174fc362a089b1125174258e53ef4a69ce6b8
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/vm/userfaultfd.c?id=c47174fc362a089b1125174258e53ef4a69ce6b8#n291
> > 
> > And obviously some recent hugetlb-related change caused that to happen.
> > 
> > Dropping that check can definitely work, but I'll have a closer look soon
> > too to make sure I didn't miss something.  Mike, please also let me know if
> > you are aware of this problem.
> > 
> 
> Peter, I am not aware of this problem.  I really should make running ALL
> hugetlb tests part of my regular routine.
> 
> If you do not beat me to it, I will take a look in the next few days.

Just to update - my bisection points to 00cdec99f3eb ("hugetlbfs: revert
use i_mmap_rwsem to address page fault/truncate race", 2022-09-21).

I don't understand how they are related so far, though.  It should be a
timing thing because the failure cannot be reproduced on a VM but only on
the host, and it can also pass sometimes even on the host but rarely.

Logically all the uffd messages in the stress test should be generated by
the locking thread, upon:

		pthread_mutex_lock(area_mutex(area_dst, page_nr));

I thought a common scheme for lock() fast path should already be an
userspace cmpxchg() and that should be a write fault already.

For example, I did some stupid hack on the test and I can trigger the write
check fault with anonymous easily with an explicit cmpxchg on byte offset 128:

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 74babdbc02e5..a7d6938d4553 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -637,6 +637,10 @@ static void *locking_thread(void *arg)
                } else
                        page_nr += 1;
                page_nr %= nr_pages;
+               char *ptr = area_dst + (page_nr * page_size) + 128;
+               char _old = 0, new = 1;
+               (void)__atomic_compare_exchange_n(ptr, &_old, new, false,
+                               __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
                pthread_mutex_lock(area_mutex(area_dst, page_nr));
                count = *area_count(area_dst, page_nr);
                if (count != count_verify[page_nr])

I'll need some more time thinking about it before I send a patch to drop
the write check..

Thanks,

-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-24 21:56   ` Matthew Wilcox
@ 2022-09-27  4:32     ` Hugh Dickins
  0 siblings, 0 replies; 24+ messages in thread
From: Hugh Dickins @ 2022-09-27  4:32 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Mike Kravetz, syzbot, Hugh Dickins, akpm, linux-kernel, linux-mm,
	llvm, nathan, ndesaulniers, songmuchun, syzkaller-bugs, trix,
	vishal.moola, peterx

On Sat, 24 Sep 2022, Matthew Wilcox wrote:
> On Fri, Sep 23, 2022 at 02:11:24PM -0700, Mike Kravetz wrote:
> > On 09/23/22 09:05, syzbot wrote:
> > > Hello,
> > > 
> > > syzbot found the following issue on:
> > > 
> > > HEAD commit:    483fed3b5dc8 Add linux-next specific files for 20220921
> > > git tree:       linux-next
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=16f0a418880000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=849cb9f70f15b1ba
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=152d76c44ba142f8992b
> > > compiler:       gcc (Debian 10.2.1-6) 10.2.1 20210110, GNU ld (GNU Binutils for Debian) 2.35.2
> > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=12b97b64880000
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11fb9040880000
> > > 
> > > Downloadable assets:
> > > disk image: https://storage.googleapis.com/syzbot-assets/1cb3f4618323/disk-483fed3b.raw.xz
> > > vmlinux: https://storage.googleapis.com/syzbot-assets/cc02cb30b495/vmlinux-483fed3b.xz
> > > 
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+152d76c44ba142f8992b@syzkaller.appspotmail.com
> > > 
> > > general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
> > > KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
> > > CPU: 1 PID: 3617 Comm: syz-executor722 Not tainted 6.0.0-rc6-next-20220921-syzkaller #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/16/2022
> > > RIP: 0010:PagePoisoned include/linux/page-flags.h:304 [inline]
> > > RIP: 0010:PageHead include/linux/page-flags.h:787 [inline]
> > > RIP: 0010:PageHeadHuge+0x1d/0x200 mm/hugetlb.c:1892
> > > Code: ff 66 66 2e 0f 1f 84 00 00 00 00 00 90 41 54 55 48 89 fd 53 e8 54 c9 b9 ff 48 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 a2 01 00 00 48 8b 5d 00 48 c7 c7 ff ff ff ff 48
> > > RSP: 0018:ffffc90003e7f5a0 EFLAGS: 00010246
> > > RAX: dffffc0000000000 RBX: ffffc90003e7f788 RCX: 0000000000000000
> > > RDX: 0000000000000000 RSI: ffffffff81c2cb2c RDI: 0000000000000000
> > > RBP: 0000000000000000 R08: 0000000000000005 R09: 0000000000000000
> > > R10: 0000000000000000 R11: 0000000000000001 R12: ffffc90003e7f798
> > > R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000003f4
> > > FS:  00007f5642262700(0000) GS:ffff8880b9b00000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: 00000000203f4ef0 CR3: 000000007adcc000 CR4: 00000000003506e0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > >  <TASK>
> > >  folio_test_hugetlb include/linux/page-flags.h:831 [inline]
> > >  folio_file_page include/linux/pagemap.h:683 [inline]
> > >  shmem_fault+0x27c/0x8a0 mm/shmem.c:2130
> > >  __do_fault+0x107/0x600 mm/memory.c:4191
> > >  do_shared_fault mm/memory.c:4597 [inline]
> > >  do_fault mm/memory.c:4675 [inline]
> > >  handle_pte_fault mm/memory.c:4943 [inline]
> > >  __handle_mm_fault+0x2200/0x3a40 mm/memory.c:5085
> > >  handle_mm_fault+0x1c8/0x780 mm/memory.c:5206
> > >  do_user_addr_fault+0x475/0x1210 arch/x86/mm/fault.c:1428
> > >  handle_page_fault arch/x86/mm/fault.c:1519 [inline]
> > >  exc_page_fault+0x94/0x170 arch/x86/mm/fault.c:1575
> > >  asm_exc_page_fault+0x22/0x30 arch/x86/include/asm/idtentry.h:570
> > > RIP: 0010:__put_user_nocheck_4+0x3/0x11
> > > Code: 00 00 48 39 d9 73 54 0f 01 cb 66 89 01 31 c9 0f 01 ca c3 0f 1f 44 00 00 48 bb fd ef ff ff ff 7f 00 00 48 39 d9 73 34 0f 01 cb <89> 01 31 c9 0f 01 ca c3 66 0f 1f 44 00 00 48 bb f9 ef ff ff ff 7f
> > > RSP: 0018:ffffc90003e7fa00 EFLAGS: 00050293
> > > RAX: 0000000000000000 RBX: ffffc90003e7fdf4 RCX: 00000000203f4ef0
> > > RDX: ffff888020c51d40 RSI: ffffffff8726d52f RDI: 0000000000000005
> > > RBP: ffffc90003e7fdb0 R08: 0000000000000005 R09: 0000000000000000
> > > R10: 0000000000000002 R11: 0000000000000001 R12: 0000000000000000
> > > R13: 0000000000000002 R14: 00000000203f4ef0 R15: 0000000000000000
> > >  ____sys_recvmsg+0x3ba/0x610 net/socket.c:2714
> > >  ___sys_recvmsg+0xf2/0x180 net/socket.c:2743
> > >  do_recvmmsg+0x25e/0x6e0 net/socket.c:2837
> > >  __sys_recvmmsg net/socket.c:2916 [inline]
> > >  __do_sys_recvmmsg net/socket.c:2939 [inline]
> > >  __se_sys_recvmmsg net/socket.c:2932 [inline]
> > >  __x64_sys_recvmmsg+0x20b/0x260 net/socket.c:2932
> > >  do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > >  do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > >  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > RIP: 0033:0x7f56422dabb9
> > > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 91 18 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > > RSP: 002b:00007f5642262208 EFLAGS: 00000246 ORIG_RAX: 000000000000012b
> > > RAX: ffffffffffffffda RBX: 00007f564235c4b8 RCX: 00007f56422dabb9
> > > RDX: 0000000000010106 RSI: 00000000200000c0 RDI: 0000000000000003
> > > RBP: 00007f564235c4b0 R08: 0000000000000000 R09: 0000000000000000
> > > R10: 0000000000000002 R11: 0000000000000246 R12: 00007f564235c4bc
> > > R13: 00007fffbde3618f R14: 00007f5642262300 R15: 0000000000022000
> > >  </TASK>
> > > Modules linked in:
> > > ---[ end trace 0000000000000000 ]---
> > 
> > While it is true that the addressing exception is happening in the hugetlb
> > routine PageHeadHuge(), the reason is because it is passed a NULL page
> > pointer.  This is via the call to folio_file_page() at the end of shmem_fault.
> > 
> > 	err = shmem_get_folio_gfp(inode, vmf->pgoff, &folio, SGP_CACHE,
> > 				  gfp, vma, vmf, &ret);
> > 	if (err)
> > 		return vmf_error(err);
> > 
> > 	vmf->page = folio_file_page(folio, vmf->pgoff);
> > 	return ret;
> > 
> > The code assumes that if a non-zero value is returned from shmem_get_folio_gfp,
> > then folio pointer will be set to a folio.  However, it looks like there are
> > a few places in shmem_get_folio_gfp where it will return zero and not set
> > folio.
> > 
> > In this specific case, it is the code block:
> > 
> >         if (vma && userfaultfd_missing(vma)) {
> >                 *fault_type = handle_userfault(vmf, VM_UFFD_MISSING);
> >                 return 0;
> >         }
> > 
> > I could try to sort this out, but I believe Matthew has the most context as
> > he has been changing this code recently.
> 
> Vishal sent me a patch a few days ago, but I was on holiday so haven't
> seen it until now.
> 
> From: "Vishal Moola (Oracle)" <vishal.moola@gmail.com>
> To: willy@infradead.org
> Cc: "Vishal Moola (Oracle)" <vishal.moola@gmail.com>
> Subject: [PATCH 1/1] Fix an issue in shmem_fault()
> Date: Wed, 21 Sep 2022 17:38:56 -0700
> Message-Id: <20220922003855.23411-2-vishal.moola@gmail.com>
> X-Mailer: git-send-email 2.36.1
> In-Reply-To: <20220922003855.23411-1-vishal.moola@gmail.com>
> References: <20220922003855.23411-1-vishal.moola@gmail.com>
> 
> If shmem_get_folio_gfp returns 0 AND does not assign folio,
> folio_file_page() runs into issues. Make sure to assign vmf->page only
> if a folio is assigned, otherwise set it to NULL.
> 
> Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com>

Acked-by: Hugh Dickins <hughd@google.com>

Peter, thank you for reviewing and advising and testing mine, and
Andrew, thank you for taking it into mm-unstable.  But I think that
since Vishal sent his first, thank you, it is his that should go in.

Me, I tend to avoid an "else", but that's not a very good reason to
prefer mine.  If, in my ignorance, I'd been right about userfaultfd
setting vmf->page, then mine would have been the right patch; but
Peter showed I was wrong on that, so Vishal's seems slightly the better.

Matthew, please send Vishal's with your signoff on to Andrew;
or I can do so (mutatis mutandis) if you prefer.

Thanks,
Hugh

> ---
>  mm/shmem.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/shmem.c b/mm/shmem.c
> index d6921b8e2cb5..986c07362eab 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2060,7 +2060,7 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>  	struct vm_area_struct *vma = vmf->vma;
>  	struct inode *inode = file_inode(vma->vm_file);
>  	gfp_t gfp = mapping_gfp_mask(inode->i_mapping);
> -	struct folio *folio;
> +	struct folio *folio = NULL;
>  	int err;
>  	vm_fault_t ret = VM_FAULT_LOCKED;
>  
> @@ -2127,7 +2127,10 @@ static vm_fault_t shmem_fault(struct vm_fault *vmf)
>  				  gfp, vma, vmf, &ret);
>  	if (err)
>  		return vmf_error(err);
> -	vmf->page = folio_file_page(folio, vmf->pgoff);
> +	if (folio)
> +		vmf->page = folio_file_page(folio, vmf->pgoff);
> +	else
> +		vmf->page = NULL;
>  	return ret;
>  }
>  
> -- 
> 2.36.1


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-26  0:11           ` Peter Xu
@ 2022-09-27 16:24             ` Mike Kravetz
  2022-09-27 16:45               ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Mike Kravetz @ 2022-09-27 16:24 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	akpm, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

On 09/25/22 20:11, Peter Xu wrote:
> On Sat, Sep 24, 2022 at 12:01:16PM -0700, Mike Kravetz wrote:
> > On 09/24/22 11:06, Peter Xu wrote:
> > > 
> > > Sorry I forgot to reply on this one.
> > > 
> > > I didn't try linux-next, but I can easily reproduce this with mm-unstable
> > > already, and I verified that Hugh's patch fixes the problem for shmem.
> > > 
> > > When I was testing I found hugetlb selftest is broken too but with some
> > > other errors:
> > > 
> > > $ sudo ./userfaultfd hugetlb 100 10  
> > > ...
> > > bounces: 6, mode: racing ver read, ERROR: unexpected write fault (errno=0, line=779)
> > > 
> > > The failing check was making sure all MISSING events are not triggered by
> > > writes, but frankly I don't really know why it's required, and that check
> > > existed since the 1st commit when test was introduced.
> > > 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c47174fc362a089b1125174258e53ef4a69ce6b8
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/vm/userfaultfd.c?id=c47174fc362a089b1125174258e53ef4a69ce6b8#n291
> > > 
> > > And obviously some recent hugetlb-related change caused that to happen.
> > > 
> > > Dropping that check can definitely work, but I'll have a closer look soon
> > > too to make sure I didn't miss something.  Mike, please also let me know if
> > > you are aware of this problem.
> > > 
> > 
> > Peter, I am not aware of this problem.  I really should make running ALL
> > hugetlb tests part of my regular routine.
> > 
> > If you do not beat me to it, I will take a look in the next few days.
> 
> Just to update - my bisection points to 00cdec99f3eb ("hugetlbfs: revert
> use i_mmap_rwsem to address page fault/truncate race", 2022-09-21).
> 
> I don't understand how they are related so far, though.  It should be a
> timing thing because the failure cannot be reproduced on a VM but only on
> the host, and it can also pass sometimes even on the host but rarely.

Thanks Peter!

After your analysis, I also started looking at this.
- I did reproduce a few times in a VM
- On BM (a laptop) I could reproduce but it would take several (10's of) runs

> Logically all the uffd messages in the stress test should be generated by
> the locking thread, upon:
> 
> 		pthread_mutex_lock(area_mutex(area_dst, page_nr));

I personally find that test program hard to understand/follow and it takes me 
a day or so to understand what it is doing, then I immediately loose context
when I stop looking at it. :(

So, as you mention below the program is depending on pthread_mutex_lock()
doing a read fault before a write.

> I thought a common scheme for lock() fast path should already be an
> userspace cmpxchg() and that should be a write fault already.
> 
> For example, I did some stupid hack on the test and I can trigger the write
> check fault with anonymous easily with an explicit cmpxchg on byte offset 128:
> 
> diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
> index 74babdbc02e5..a7d6938d4553 100644
> --- a/tools/testing/selftests/vm/userfaultfd.c
> +++ b/tools/testing/selftests/vm/userfaultfd.c
> @@ -637,6 +637,10 @@ static void *locking_thread(void *arg)
>                 } else
>                         page_nr += 1;
>                 page_nr %= nr_pages;
> +               char *ptr = area_dst + (page_nr * page_size) + 128;
> +               char _old = 0, new = 1;
> +               (void)__atomic_compare_exchange_n(ptr, &_old, new, false,
> +                               __ATOMIC_SEQ_CST, __ATOMIC_SEQ_CST);
>                 pthread_mutex_lock(area_mutex(area_dst, page_nr));
>                 count = *area_count(area_dst, page_nr);
>                 if (count != count_verify[page_nr])
> 
> I'll need some more time thinking about it before I send a patch to drop
> the write check..

I did another stupid hack, and duplicated the statement:
	count = *area_count(area_dst, page_nr);
before the,
	pthread_mutex_lock(area_mutex(area_dst, page_nr));

This should guarantee a read fault independent of what pthread_mutex_lock
does.  However, it still results in the occasional "ERROR: unexpected write
fault".  So, something else if happening.  I will continue to experiment
and think about this.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-27 16:24             ` Mike Kravetz
@ 2022-09-27 16:45               ` Peter Xu
  2022-09-29 23:33                 ` Mike Kravetz
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-09-27 16:45 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	akpm, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

On Tue, Sep 27, 2022 at 09:24:37AM -0700, Mike Kravetz wrote:
> This should guarantee a read fault independent of what pthread_mutex_lock
> does.  However, it still results in the occasional "ERROR: unexpected write
> fault".  So, something else if happening.  I will continue to experiment
> and think about this.

Thanks for verifying this, Mike.  I didn't yet reply but I did have some
update on my side too.  I plan to look deeper and wanted to reply until
that, because I do think there's something special on hugetlb and I still
don't know. I just keep getting distracted by other things.. but maybe I
should still share out what I have already.

I think I already know what had guaranteed the read faults - the NPTL
pthread lib implemented mutex in different types, and the 1st instruction
of lock() is to fetch the mutex type (at offset 0x10) then we know how we
should move on:

(gdb) disas pthread_mutex_lock
Dump of assembler code for function ___pthread_mutex_lock:
   0x00007ffff7e3b0d0 <+0>:     endbr64 
   0x00007ffff7e3b0d4 <+4>:     mov    0x10(%rdi),%eax       <---- read 0x10 of &mutex
   0x00007ffff7e3b0d7 <+7>:     mov    %eax,%edx
   0x00007ffff7e3b0d9 <+9>:     and    $0x17f,%edx
(gdb) ptype pthread_mutex_t
type = union {
    struct __pthread_mutex_s __data;
    char __size[40];
    long __align;
}
(gdb) ptype struct __pthread_mutex_s
type = struct __pthread_mutex_s {
    int __lock;
    unsigned int __count;
    int __owner;
    unsigned int __nusers;
    int __kind;                                              <--- 0x10 offset here
    short __spins;
    short __elision;
    __pthread_list_t __list;
}

I looked directly at asm dumped from the binary I tested to make sure it's
accurate.  So it means with current uffd selftest logically there should
never be a write missing fault (I still don't think it's good to have the
write check though.. but that's separate question to ask).

It also means hugetlb does something special here.  It smells really like
for some reason the hugetlb pgtable got evicted after UFFDIO_COPY during
locking_thread running, then any further lock() (e.g. cmpxchg) or modifying
the counter could trigger a write fault.

OTOH this also explained why futex code was never tested on userfaultfd
selftest, simply because futex() will always to be after that "read the
type of mutex" thing.. which is something I want to rework a bit, so as to
have uffd selftest to cover gup as you used to rightfully suggest.  But
that'll be nothing urgent, and be something after we know what's special
with hugetlb new code.

I'll also keep update if I figured something more out of it.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-27 16:45               ` Peter Xu
@ 2022-09-29 23:33                 ` Mike Kravetz
  2022-09-30  1:29                   ` Peter Xu
  2022-09-30 16:05                   ` Peter Xu
  0 siblings, 2 replies; 24+ messages in thread
From: Mike Kravetz @ 2022-09-29 23:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	akpm, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

On 09/27/22 12:45, Peter Xu wrote:
> On Tue, Sep 27, 2022 at 09:24:37AM -0700, Mike Kravetz wrote:
> > This should guarantee a read fault independent of what pthread_mutex_lock
> > does.  However, it still results in the occasional "ERROR: unexpected write
> > fault".  So, something else if happening.  I will continue to experiment
> > and think about this.
> 
> Thanks for verifying this, Mike.  I didn't yet reply but I did have some
> update on my side too.  I plan to look deeper and wanted to reply until
> that, because I do think there's something special on hugetlb and I still
> don't know. I just keep getting distracted by other things.. but maybe I
> should still share out what I have already.
> 
> I think I already know what had guaranteed the read faults - the NPTL
> pthread lib implemented mutex in different types, and the 1st instruction
> of lock() is to fetch the mutex type (at offset 0x10) then we know how we
> should move on:
> 
> (gdb) disas pthread_mutex_lock
> Dump of assembler code for function ___pthread_mutex_lock:
>    0x00007ffff7e3b0d0 <+0>:     endbr64 
>    0x00007ffff7e3b0d4 <+4>:     mov    0x10(%rdi),%eax       <---- read 0x10 of &mutex
>    0x00007ffff7e3b0d7 <+7>:     mov    %eax,%edx
>    0x00007ffff7e3b0d9 <+9>:     and    $0x17f,%edx
> (gdb) ptype pthread_mutex_t
> type = union {
>     struct __pthread_mutex_s __data;
>     char __size[40];
>     long __align;
> }
> (gdb) ptype struct __pthread_mutex_s
> type = struct __pthread_mutex_s {
>     int __lock;
>     unsigned int __count;
>     int __owner;
>     unsigned int __nusers;
>     int __kind;                                              <--- 0x10 offset here
>     short __spins;
>     short __elision;
>     __pthread_list_t __list;
> }
> 
> I looked directly at asm dumped from the binary I tested to make sure it's
> accurate.  So it means with current uffd selftest logically there should
> never be a write missing fault (I still don't think it's good to have the
> write check though.. but that's separate question to ask).
> 
> It also means hugetlb does something special here.  It smells really like
> for some reason the hugetlb pgtable got evicted after UFFDIO_COPY during
> locking_thread running, then any further lock() (e.g. cmpxchg) or modifying
> the counter could trigger a write fault.
> 
> OTOH this also explained why futex code was never tested on userfaultfd
> selftest, simply because futex() will always to be after that "read the
> type of mutex" thing.. which is something I want to rework a bit, so as to
> have uffd selftest to cover gup as you used to rightfully suggest.  But
> that'll be nothing urgent, and be something after we know what's special
> with hugetlb new code.
> 

I was able to do a little more debugging:

As you know the hugetlb calling path to handle_userfault is something
like this,

hugetlb_fault
	mutex_lock(&hugetlb_fault_mutex_table[hash]);
	ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
	if (huge_pte_none_mostly())
		hugetlb_no_page()
			page = find_lock_page(mapping, idx);
			if (!page) {
				if (userfaultfd_missing(vma))
					mutex_unlock(&hugetlb_fault_mutex_table[hash]);
					return handle_userfault()
			}

For anon mappings, find_lock_page() will never find a page, so as long
as huge_pte_none_mostly() is true we will call into handle_userfault().

Since your analysis shows the testcase should never call handle_userfault() for
a write fault, I simply added a 'if (flags & FAULT_FLAG_WRITE) printk' before
the call to handle_userfault().  Sure enough, I saw plenty of printk messages.

Then, before calling handle_userfault() I added code to take the page table
lock and test huge_pte_none_mostly() again.  In every FAULT_FLAG_WRITE case,
this second test of huge_pte_none_mostly() was false.  So, the condition
changed from the check in hugetlb_fault until the check (with page table
lock) in hugetlb_no_page.

IIUC, the only code that should be modifying the pte in this test is
hugetlb_mcopy_atomic_pte().  It also holds the hugetlb_fault_mutex while
updating the pte.

It 'appears' that hugetlb_fault is not seeing the updated pte and I can
only guess that it is due to some caching issues.

After writing the pte in hugetlb_mcopy_atomic_pte() there is this comment.

	/* No need to invalidate - it was non-present before */
	update_mmu_cache(dst_vma, dst_addr, dst_pte);

I suspect that is true.  However, it seems like this test depends on all
CPUs seeing the updated pte immediately?

I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make
any difference.  Suggestions would be appreciated as cache/tlb/???  flushing
issues take me a while to figure out.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-29 23:33                 ` Mike Kravetz
@ 2022-09-30  1:29                   ` Peter Xu
  2022-09-30  1:35                     ` Peter Xu
  2022-09-30 16:05                   ` Peter Xu
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-09-30  1:29 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	akpm, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

Hi, Mike,

On Thu, Sep 29, 2022 at 04:33:53PM -0700, Mike Kravetz wrote:
> I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make
> any difference.  Suggestions would be appreciated as cache/tlb/???  flushing
> issues take me a while to figure out.

It seems the UFFDIO_COPY for hugetlb is the major issue here, in that for
private mappings we don't inject the page cache.

I think it makes sense in that e.g. we don't want to allow a private
mapping to be able to write to the page cache.  But afaict that's not
correct because UFFDIO_COPY resolves exactly page faults in page cache
layer for file backed memories.  So what we should do is inject page cache
but mark the page RO, waiting for a coming CoW if needed.

I'll attach one patch fix that will start to inject the page into page
cache for UFFDIO_COPY+hugetlb even if mapping is private.  Another test
patch is also added because otherwise the private hugetlb selftest won't
work after the fix applied - in the selftest we used to use DONTNEED to
drop the private mapping, but IMHO that's not enough, we need to drop the
page cache too (after the fix).  I've also have the test patch attached.

Feel free to try out with the two patches applied.  It started to work for
me for current issue.

I didn't yet post them out yet because after I applied the two patches I
found other issues - the reserved pages are messed up and leaked.  I'll
keep looking tomorrow on the leak issue, but please also let me know if you
figured anything suspecious as I know you're definitely must more fluent on
the reservation code.

And that's not the only issue I found - shmem can have other issues
regarding private mappings; shmem does it right on the page cache insertion
but not the rest I think..  I'll look into them one by one.  It's quite
interesting to dig multiple things out of the write check symptons..

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-30  1:29                   ` Peter Xu
@ 2022-09-30  1:35                     ` Peter Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2022-09-30  1:35 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	akpm, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

[-- Attachment #1: Type: text/plain, Size: 2036 bytes --]

On Thu, Sep 29, 2022 at 09:29:54PM -0400, Peter Xu wrote:
> Hi, Mike,
> 
> On Thu, Sep 29, 2022 at 04:33:53PM -0700, Mike Kravetz wrote:
> > I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make
> > any difference.  Suggestions would be appreciated as cache/tlb/???  flushing
> > issues take me a while to figure out.
> 
> It seems the UFFDIO_COPY for hugetlb is the major issue here, in that for
> private mappings we don't inject the page cache.
> 
> I think it makes sense in that e.g. we don't want to allow a private
> mapping to be able to write to the page cache.  But afaict that's not
> correct because UFFDIO_COPY resolves exactly page faults in page cache
> layer for file backed memories.  So what we should do is inject page cache
> but mark the page RO, waiting for a coming CoW if needed.
> 
> I'll attach one patch fix that will start to inject the page into page
> cache for UFFDIO_COPY+hugetlb even if mapping is private.  Another test
> patch is also added because otherwise the private hugetlb selftest won't
> work after the fix applied - in the selftest we used to use DONTNEED to
> drop the private mapping, but IMHO that's not enough, we need to drop the
> page cache too (after the fix).  I've also have the test patch attached.
> 
> Feel free to try out with the two patches applied.  It started to work for
> me for current issue.
> 
> I didn't yet post them out yet because after I applied the two patches I
> found other issues - the reserved pages are messed up and leaked.  I'll
> keep looking tomorrow on the leak issue, but please also let me know if you
> figured anything suspecious as I know you're definitely must more fluent on
> the reservation code.
> 
> And that's not the only issue I found - shmem can have other issues
> regarding private mappings; shmem does it right on the page cache insertion
> but not the rest I think..  I'll look into them one by one.  It's quite
> interesting to dig multiple things out of the write check symptons..

The patches..

-- 
Peter Xu

[-- Attachment #2: 0001-mm-hugetlb-Insert-page-cache-for-UFFDIO_COPY-even-if.patch --]
[-- Type: text/plain, Size: 4562 bytes --]

From 1255fabde4f950bef4751ef337791b93f5373a1f Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Wed, 28 Sep 2022 17:44:00 -0400
Subject: [PATCH 1/2] mm/hugetlb: Insert page cache for UFFDIO_COPY even if
 private
Content-type: text/plain

UFFDIO_COPY resolves page fault in page cache layer for file backed
memories on shmem and hugetlbfs.  It also means for each UFFDIO_COPY we
should inject the new page into page cache no matter whether it's private
or shared mappings.

We used to not do that probably because for private mappings we should not
allow the page cache be written for the private mapped process. However it
can be done by removing the write bit (as what this patch does) so that CoW
will trigger properly for the page cache.

Leaving the page cache empty could lead to below sequence:

  (1) map hugetlb privately, register with uffd missing+wp
  (2) read page, trigger MISSING event with READ
  (3) UFFDIO_COPY(wp=1) resolve page fault, keep wr-protected
  (4) write page, trigger MISSING event again (because page cache missing!)
      with WRITE

This behavior existed since the initial commit of hugetlb MISSING mode
support, which is commit 1c9e8def43a3 ("userfaultfd: hugetlbfs: add
UFFDIO_COPY support for shared mappings", 2017-02-22).  In most cases it
should be fine as long as the hugetlb page/pte will be stable (e.g., no
wr-protect, no MADV_DONTNEED, ...).  However for any reason if a further
page fault is triggered, it could cause issue.  Recently due to the newly
introduced uffd-wp on hugetlbfs and also a recent locking rework from Mike,
we can easily fail userfaultfd kselftest with hugetlb private mappings.

One further step is we can do early CoW if the private mapping is
writable, but let's leave that for later.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 28 ++++++++--------------------
 1 file changed, 8 insertions(+), 20 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9679fe519b90..a43fc6852f27 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5933,14 +5933,12 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	int ret = -ENOMEM;
 	struct page *page;
 	int writable;
-	bool page_in_pagecache = false;
 
 	if (is_continue) {
 		ret = -EFAULT;
 		page = find_lock_page(mapping, idx);
 		if (!page)
 			goto out;
-		page_in_pagecache = true;
 	} else if (!*pagep) {
 		/* If a page already exists, then it's UFFDIO_COPY for
 		 * a non-missing case. Return -EEXIST.
@@ -6014,8 +6012,8 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	 */
 	__SetPageUptodate(page);
 
-	/* Add shared, newly allocated pages to the page cache. */
-	if (vm_shared && !is_continue) {
+	/* Add newly allocated pages to the page cache for UFFDIO_COPY. */
+	if (!is_continue) {
 		size = i_size_read(mapping->host) >> huge_page_shift(h);
 		ret = -EFAULT;
 		if (idx >= size)
@@ -6030,7 +6028,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		ret = hugetlb_add_to_page_cache(page, mapping, idx);
 		if (ret)
 			goto out_release_nounlock;
-		page_in_pagecache = true;
 	}
 
 	ptl = huge_pte_lock(h, dst_mm, dst_pte);
@@ -6044,18 +6041,13 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (!huge_pte_none_mostly(huge_ptep_get(dst_pte)))
 		goto out_release_unlock;
 
-	if (page_in_pagecache) {
-		page_dup_file_rmap(page, true);
-	} else {
-		ClearHPageRestoreReserve(page);
-		hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
-	}
+	page_dup_file_rmap(page, true);
 
 	/*
-	 * For either: (1) CONTINUE on a non-shared VMA, or (2) UFFDIO_COPY
-	 * with wp flag set, don't set pte write bit.
+	 * For either: (1) a non-shared VMA, or (2) UFFDIO_COPY with wp
+	 * flag set, don't set pte write bit.
 	 */
-	if (wp_copy || (is_continue && !vm_shared))
+	if (wp_copy || !vm_shared)
 		writable = 0;
 	else
 		writable = dst_vma->vm_flags & VM_WRITE;
@@ -6083,18 +6075,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	spin_unlock(ptl);
 	if (!is_continue)
 		SetHPageMigratable(page);
-	if (vm_shared || is_continue)
-		unlock_page(page);
+	unlock_page(page);
 	ret = 0;
 out:
 	return ret;
 out_release_unlock:
 	spin_unlock(ptl);
-	if (vm_shared || is_continue)
-		unlock_page(page);
+	unlock_page(page);
 out_release_nounlock:
-	if (!page_in_pagecache)
-		restore_reserve_on_error(h, dst_vma, dst_addr, page);
 	put_page(page);
 	goto out;
 }
-- 
2.32.0


[-- Attachment #3: 0002-selftests-vm-Use-memfd-for-hugetlb-tests.patch --]
[-- Type: text/plain, Size: 6123 bytes --]

From 046ba2d1f5a74d86c6546a4f1bc8081f7c0fd3f8 Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Thu, 29 Sep 2022 11:55:26 -0400
Subject: [PATCH 2/2] selftests/vm: Use memfd for hugetlb tests
Content-type: text/plain

We already used memfd for shmem test, move it forward with hugetlb too so
that we don't need user to specify the hugetlb file path explicitly when
running hugetlb shared tests.

More importantly, this patch will start to correctly release hugetlb pages
using fallocate(FALLOC_FL_PUNCH_HOLE) even for private mappings, because
for private mappings MADV_DONTNEED may not be enough to test UFFDIO_COPY
which only zap the pgtables but not evicting the page caches.

Here unfortunately we need to cache the ptr<->offset relationship for
hugetlb for using fallocate() by adding mem_fd_[src|dst]_offset, because
that's the only way to evict the page cache for private mappings.  IOW
MADV_REMOVE doesn't work.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 98 +++++++++++++-----------
 1 file changed, 53 insertions(+), 45 deletions(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 74babdbc02e5..4561e9d443e4 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -93,10 +93,11 @@ static volatile bool test_uffdio_zeropage_eexist = true;
 static bool test_uffdio_wp = true;
 /* Whether to test uffd minor faults */
 static bool test_uffdio_minor = false;
-
 static bool map_shared;
-static int shm_fd;
-static int huge_fd;
+static int mem_fd;
+/* File offset for area_src/area_dst upon mem_fd.  Needed for hole punching */
+static off_t mem_fd_src_offset;
+static off_t mem_fd_dst_offset;
 static unsigned long long *count_verify;
 static int uffd = -1;
 static int uffd_flags, finished, *pipefd;
@@ -247,48 +248,59 @@ static void noop_alias_mapping(__u64 *start, size_t len, unsigned long offset)
 {
 }
 
+static off_t ptr_to_offset(char *ptr)
+{
+	if (ptr == area_src)
+		return mem_fd_src_offset;
+	else
+		return mem_fd_dst_offset;
+}
+
 static void hugetlb_release_pages(char *rel_area)
 {
+	off_t size = nr_pages * page_size, offset = ptr_to_offset(rel_area);
+
 	if (!map_shared) {
 		if (madvise(rel_area, nr_pages * page_size, MADV_DONTNEED))
 			err("madvise(MADV_DONTNEED) failed");
-	} else {
-		if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE))
-			err("madvise(MADV_REMOVE) failed");
-	}
+
+		/* Evict page cache explibitly for private mappings */
+		if (fallocate(mem_fd,
+			      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+			      offset, size))
+			err("fallocate(FALLOC_FL_PUNCH_HOLE) failed");
+       } else {
+               if (madvise(rel_area, nr_pages * page_size, MADV_REMOVE))
+                       err("madvise(MADV_REMOVE) failed");
+       }
 }
 
 static void hugetlb_allocate_area(void **alloc_area, bool is_src)
 {
+	off_t size = nr_pages * page_size;
+	off_t offset = is_src ? 0 : size;
 	void *area_alias = NULL;
 	char **alloc_area_alias;
 
-	if (!map_shared)
-		*alloc_area = mmap(NULL,
-			nr_pages * page_size,
-			PROT_READ | PROT_WRITE,
-			MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB |
-				(is_src ? 0 : MAP_NORESERVE),
-			-1,
-			0);
-	else
-		*alloc_area = mmap(NULL,
-			nr_pages * page_size,
-			PROT_READ | PROT_WRITE,
-			MAP_SHARED |
-				(is_src ? 0 : MAP_NORESERVE),
-			huge_fd,
-			is_src ? 0 : nr_pages * page_size);
+	*alloc_area = mmap(NULL, size, PROT_READ | PROT_WRITE,
+			   (map_shared ? MAP_SHARED : MAP_PRIVATE) |
+			   (is_src ? 0 : MAP_NORESERVE),
+			   mem_fd, offset);
 	if (*alloc_area == MAP_FAILED)
 		err("mmap of hugetlbfs file failed");
 
+	/*
+	 * Only hugetlb needs to cache ptr<->offset because it needs to
+	 * fallocate(FALLOC_FL_PUNCH_HOLE).
+	 */
+	if (is_src)
+		mem_fd_src_offset = offset;
+	else
+		mem_fd_dst_offset = offset;
+
 	if (map_shared) {
-		area_alias = mmap(NULL,
-			nr_pages * page_size,
-			PROT_READ | PROT_WRITE,
-			MAP_SHARED,
-			huge_fd,
-			is_src ? 0 : nr_pages * page_size);
+		area_alias = mmap(NULL, size, PROT_READ | PROT_WRITE,
+				  MAP_SHARED, mem_fd, offset);
 		if (area_alias == MAP_FAILED)
 			err("mmap of hugetlb file alias failed");
 	}
@@ -334,14 +346,14 @@ static void shmem_allocate_area(void **alloc_area, bool is_src)
 	}
 
 	*alloc_area = mmap(p, bytes, PROT_READ | PROT_WRITE, MAP_SHARED,
-			   shm_fd, offset);
+			   mem_fd, offset);
 	if (*alloc_area == MAP_FAILED)
 		err("mmap of memfd failed");
 	if (test_collapse && *alloc_area != p)
 		err("mmap of memfd failed at %p", p);
 
 	area_alias = mmap(p_alias, bytes, PROT_READ | PROT_WRITE, MAP_SHARED,
-			  shm_fd, offset);
+			  mem_fd, offset);
 	if (area_alias == MAP_FAILED)
 		err("mmap of memfd alias failed");
 	if (test_collapse && area_alias != p_alias)
@@ -1629,7 +1641,7 @@ static int userfaultfd_stress(void)
 
 		/* prepare next bounce */
 		swap(area_src, area_dst);
-
+		swap(mem_fd_src_offset, mem_fd_dst_offset);
 		swap(area_src_alias, area_dst_alias);
 
 		uffd_stats_report(uffd_stats, nr_cpus);
@@ -1821,21 +1833,17 @@ int main(int argc, char **argv)
 	}
 	nr_pages = nr_pages_per_cpu * nr_cpus;
 
-	if (test_type == TEST_HUGETLB && map_shared) {
-		if (argc < 5)
-			usage();
-		huge_fd = open(argv[4], O_CREAT | O_RDWR, 0755);
-		if (huge_fd < 0)
-			err("Open of %s failed", argv[4]);
-		if (ftruncate(huge_fd, 0))
-			err("ftruncate %s to size 0 failed", argv[4]);
-	} else if (test_type == TEST_SHMEM) {
-		shm_fd = memfd_create(argv[0], 0);
-		if (shm_fd < 0)
+	if (test_type == TEST_SHMEM || test_type == TEST_HUGETLB) {
+		unsigned int memfd_flags = 0;
+
+		if (test_type == TEST_HUGETLB)
+			memfd_flags = MFD_HUGETLB;
+		mem_fd = memfd_create(argv[0], memfd_flags);
+		if (mem_fd < 0)
 			err("memfd_create");
-		if (ftruncate(shm_fd, nr_pages * page_size * 2))
+		if (ftruncate(mem_fd, nr_pages * page_size * 2))
 			err("ftruncate");
-		if (fallocate(shm_fd,
+		if (fallocate(mem_fd,
 			      FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, 0,
 			      nr_pages * page_size * 2))
 			err("fallocate");
-- 
2.32.0


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-29 23:33                 ` Mike Kravetz
  2022-09-30  1:29                   ` Peter Xu
@ 2022-09-30 16:05                   ` Peter Xu
  2022-09-30 21:38                     ` Mike Kravetz
  1 sibling, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-09-30 16:05 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	akpm, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

On Thu, Sep 29, 2022 at 04:33:53PM -0700, Mike Kravetz wrote:
> I was able to do a little more debugging:
> 
> As you know the hugetlb calling path to handle_userfault is something
> like this,
> 
> hugetlb_fault
> 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
> 	ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
> 	if (huge_pte_none_mostly())
> 		hugetlb_no_page()
> 			page = find_lock_page(mapping, idx);
> 			if (!page) {
> 				if (userfaultfd_missing(vma))
> 					mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> 					return handle_userfault()
> 			}
> 
> For anon mappings, find_lock_page() will never find a page, so as long
> as huge_pte_none_mostly() is true we will call into handle_userfault().
> 
> Since your analysis shows the testcase should never call handle_userfault() for
> a write fault, I simply added a 'if (flags & FAULT_FLAG_WRITE) printk' before
> the call to handle_userfault().  Sure enough, I saw plenty of printk messages.
> 
> Then, before calling handle_userfault() I added code to take the page table
> lock and test huge_pte_none_mostly() again.  In every FAULT_FLAG_WRITE case,
> this second test of huge_pte_none_mostly() was false.  So, the condition
> changed from the check in hugetlb_fault until the check (with page table
> lock) in hugetlb_no_page.
> 
> IIUC, the only code that should be modifying the pte in this test is
> hugetlb_mcopy_atomic_pte().  It also holds the hugetlb_fault_mutex while
> updating the pte.
> 
> It 'appears' that hugetlb_fault is not seeing the updated pte and I can
> only guess that it is due to some caching issues.
> 
> After writing the pte in hugetlb_mcopy_atomic_pte() there is this comment.
> 
> 	/* No need to invalidate - it was non-present before */
> 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
> 
> I suspect that is true.  However, it seems like this test depends on all
> CPUs seeing the updated pte immediately?
> 
> I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make
> any difference.  Suggestions would be appreciated as cache/tlb/???  flushing
> issues take me a while to figure out.

This morning when I went back and rethink the matter, I just found that the
common hugetlb path handles private anonymous mappings with all empty page
cache as you explained above.  In that sense the two patches I posted may
not really make sense even if they can pass the tests.. and maybe that's
also the reason why the reservations got messed up.  This is also something
I found after I read more on the reservation code e.g. no matter private or
shared hugetlb mappings we only reserve that only number of pages when mmap().

Indeed if with that in mind the UFFDIO_COPY should also work because
hugetlb fault handler checks pte first before page cache, so uffd missing
should still work as expected.

It makes sense especially for hugetlb to do that otherwise there can be
plenty of zero huge pages cached in the page cache.  I'm not sure whether
this is the reason hugetlb does it differently (e.g. comparing to shmem?),
it'll be great if I can get a confirmation.  If it's true please ignore the
two patches I posted.

I think what you analyzed is correct in that the pte shouldn't go away
after being armed once.  One more thing I tried (actually yesterday) was
SIGBUS the process when the write missing event was generated, and I can
see the user stack points to the cmpxchg() of the pthread_mutex_lock().  It
means indeed it moved forward and passed the mutex type check, it also
means it should have seen a !none pte already with at least reading
permission, in that sense it matches with "missing TLB" possibility
experiment mentioned above, because for a missing TLB it should keep
stucking at the read not write.  It's still uncertain why the pte can go
away somehow from under us and why it quickly re-appears according to your
experiment.

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-30 16:05                   ` Peter Xu
@ 2022-09-30 21:38                     ` Mike Kravetz
  2022-10-01  2:47                       ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Mike Kravetz @ 2022-09-30 21:38 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	akpm, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

On 09/30/22 12:05, Peter Xu wrote:
> On Thu, Sep 29, 2022 at 04:33:53PM -0700, Mike Kravetz wrote:
> > I was able to do a little more debugging:
> > 
> > As you know the hugetlb calling path to handle_userfault is something
> > like this,
> > 
> > hugetlb_fault
> > 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
> > 	ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
> > 	if (huge_pte_none_mostly())
> > 		hugetlb_no_page()
> > 			page = find_lock_page(mapping, idx);
> > 			if (!page) {
> > 				if (userfaultfd_missing(vma))
> > 					mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > 					return handle_userfault()
> > 			}
> > 
> > For anon mappings, find_lock_page() will never find a page, so as long
> > as huge_pte_none_mostly() is true we will call into handle_userfault().
> > 
> > Since your analysis shows the testcase should never call handle_userfault() for
> > a write fault, I simply added a 'if (flags & FAULT_FLAG_WRITE) printk' before
> > the call to handle_userfault().  Sure enough, I saw plenty of printk messages.
> > 
> > Then, before calling handle_userfault() I added code to take the page table
> > lock and test huge_pte_none_mostly() again.  In every FAULT_FLAG_WRITE case,
> > this second test of huge_pte_none_mostly() was false.  So, the condition
> > changed from the check in hugetlb_fault until the check (with page table
> > lock) in hugetlb_no_page.
> > 
> > IIUC, the only code that should be modifying the pte in this test is
> > hugetlb_mcopy_atomic_pte().  It also holds the hugetlb_fault_mutex while
> > updating the pte.
> > 
> > It 'appears' that hugetlb_fault is not seeing the updated pte and I can
> > only guess that it is due to some caching issues.
> > 
> > After writing the pte in hugetlb_mcopy_atomic_pte() there is this comment.
> > 
> > 	/* No need to invalidate - it was non-present before */
> > 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
> > 
> > I suspect that is true.  However, it seems like this test depends on all
> > CPUs seeing the updated pte immediately?
> > 
> > I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make
> > any difference.  Suggestions would be appreciated as cache/tlb/???  flushing
> > issues take me a while to figure out.
> 
> This morning when I went back and rethink the matter, I just found that the
> common hugetlb path handles private anonymous mappings with all empty page
> cache as you explained above.  In that sense the two patches I posted may
> not really make sense even if they can pass the tests.. and maybe that's
> also the reason why the reservations got messed up.  This is also something
> I found after I read more on the reservation code e.g. no matter private or
> shared hugetlb mappings we only reserve that only number of pages when mmap().
> 
> Indeed if with that in mind the UFFDIO_COPY should also work because
> hugetlb fault handler checks pte first before page cache, so uffd missing
> should still work as expected.
> 
> It makes sense especially for hugetlb to do that otherwise there can be
> plenty of zero huge pages cached in the page cache.  I'm not sure whether
> this is the reason hugetlb does it differently (e.g. comparing to shmem?),
> it'll be great if I can get a confirmation.  If it's true please ignore the
> two patches I posted.
> 
> I think what you analyzed is correct in that the pte shouldn't go away
> after being armed once.  One more thing I tried (actually yesterday) was
> SIGBUS the process when the write missing event was generated, and I can
> see the user stack points to the cmpxchg() of the pthread_mutex_lock().  It
> means indeed it moved forward and passed the mutex type check, it also
> means it should have seen a !none pte already with at least reading
> permission, in that sense it matches with "missing TLB" possibility
> experiment mentioned above, because for a missing TLB it should keep
> stucking at the read not write.  It's still uncertain why the pte can go
> away somehow from under us and why it quickly re-appears according to your
> experiment.
> 

I 'think' it is more of a race with all cpus seeing the pte update.  To be
honest, I can not wrap my head around how that can happen.

I did not really like your idea of adding anon (or private) pages to the
page cache.  As you discovered, there is code like reservations which depend
on current behavior.

It seems to me that for 'missing' hugetlb faults there are two specific cases:
1) Shared or file backed mappings.  In this case, the page cache is the
   'source of truth'.  If there is not a page in the page cache, then we
   hand off to userfault with VM_UFFD_MISSING.
2) anon or private mappings.  In this case, pages are not in the page cache.
   The page table is the 'source of truth'.  Early in hugetlb fault processing
   we check the page table (huge_pte_none_mostly).  However, as my debug code
   has shown, checking the page table again with lock held will reveal that
   the pte has in fact been updated.

My thought was that regular anon pages would have the same issue.  So, I looked
at the calling code there.  In do_anonymous_page() there is this block:

	/* Use the zero-page for reads */
	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
			!mm_forbids_zeropage(vma->vm_mm)) {
		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
						vma->vm_page_prot));
		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
				vmf->address, &vmf->ptl);
		if (!pte_none(*vmf->pte)) {
			update_mmu_tlb(vma, vmf->address, vmf->pte);
			goto unlock;
		}
		ret = check_stable_address_space(vma->vm_mm);
		if (ret)
			goto unlock;
		/* Deliver the page fault to userland, check inside PT lock */
		if (userfaultfd_missing(vma)) {
			pte_unmap_unlock(vmf->pte, vmf->ptl);
			return handle_userfault(vmf, VM_UFFD_MISSING);
		}
		goto setpte;
	}

Notice that here the pte is checked while holding the page table lock while
to make the decision to call handle_userfault().

In my testing, if we check huge_pte_none_mostly() while holding the page table
lock before calling handle_userfault we will not experience the failure.  Can
you see if this also resolves the issue in your environment?  I do not love
this solution as I still can not explain how this code is missing the pte
update.

From f910e7155d6831514165af35e0d75574124a4477 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Fri, 30 Sep 2022 13:45:08 -0700
Subject: [PATCH] hugetlb: check pte with page table lock before handing to
 userfault

In hugetlb_no_page we decide a page is missing if not present in the
page cache.  This is perfectly valid for shared/file mappings where
pages must exist in the page cache.  For anon/private mappings, the page
table must be checked.  This is done early in hugetlb_fault processing
and is the reason we enter hugetlb_no_page.  However, the early check is
made without holding the page table lock.  There could be racing updates
to the pte entry, so check again with page table lock held before
deciding to call handle_userfault.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 21 ++++++++++++++++++++-
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 60e077ce6ca7..4cb44a4629b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5560,10 +5560,29 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		if (idx >= size)
 			goto out;
 		/* Check for page in userfault range */
-		if (userfaultfd_missing(vma))
+		if (userfaultfd_missing(vma)) {
+			/*
+			 * For missing pages, the page cache (checked above) is
+			 * the 'source of truth' for shared mappings.  For anon
+			 * mappings, the source of truth is the page table.  We
+			 * already checked huge_pte_none_mostly() in
+			 * hugetlb_fault.  However, there could be racing
+			 * updates.  Check again while holding page table lock
+			 * before handing off to userfault.
+			 */
+			if (!(vma->vm_flags & VM_MAYSHARE)) {
+				ptl = huge_pte_lock(h, mm, ptep);
+				if (!huge_pte_none_mostly(huge_ptep_get(ptep))) {
+					spin_unlock(ptl);
+					ret = 0;
+					goto out;
+				}
+				spin_unlock(ptl);
+			}
 			return hugetlb_handle_userfault(vma, mapping, idx,
 						       flags, haddr, address,
 						       VM_UFFD_MISSING);
+		}
 
 		page = alloc_huge_page(vma, haddr, 0);
 		if (IS_ERR(page)) {
-- 
2.37.3



^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-09-30 21:38                     ` Mike Kravetz
@ 2022-10-01  2:47                       ` Peter Xu
  2022-10-01  3:01                         ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-10-01  2:47 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	akpm, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

Hi, Mike,

On Fri, Sep 30, 2022 at 02:38:01PM -0700, Mike Kravetz wrote:
> On 09/30/22 12:05, Peter Xu wrote:
> > On Thu, Sep 29, 2022 at 04:33:53PM -0700, Mike Kravetz wrote:
> > > I was able to do a little more debugging:
> > > 
> > > As you know the hugetlb calling path to handle_userfault is something
> > > like this,
> > > 
> > > hugetlb_fault
> > > 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
> > > 	ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
> > > 	if (huge_pte_none_mostly())
> > > 		hugetlb_no_page()
> > > 			page = find_lock_page(mapping, idx);
> > > 			if (!page) {
> > > 				if (userfaultfd_missing(vma))
> > > 					mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > > 					return handle_userfault()
> > > 			}
> > > 
> > > For anon mappings, find_lock_page() will never find a page, so as long
> > > as huge_pte_none_mostly() is true we will call into handle_userfault().
> > > 
> > > Since your analysis shows the testcase should never call handle_userfault() for
> > > a write fault, I simply added a 'if (flags & FAULT_FLAG_WRITE) printk' before
> > > the call to handle_userfault().  Sure enough, I saw plenty of printk messages.
> > > 
> > > Then, before calling handle_userfault() I added code to take the page table
> > > lock and test huge_pte_none_mostly() again.  In every FAULT_FLAG_WRITE case,
> > > this second test of huge_pte_none_mostly() was false.  So, the condition
> > > changed from the check in hugetlb_fault until the check (with page table
> > > lock) in hugetlb_no_page.
> > > 
> > > IIUC, the only code that should be modifying the pte in this test is
> > > hugetlb_mcopy_atomic_pte().  It also holds the hugetlb_fault_mutex while
> > > updating the pte.
> > > 
> > > It 'appears' that hugetlb_fault is not seeing the updated pte and I can
> > > only guess that it is due to some caching issues.
> > > 
> > > After writing the pte in hugetlb_mcopy_atomic_pte() there is this comment.
> > > 
> > > 	/* No need to invalidate - it was non-present before */
> > > 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
> > > 
> > > I suspect that is true.  However, it seems like this test depends on all
> > > CPUs seeing the updated pte immediately?
> > > 
> > > I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make
> > > any difference.  Suggestions would be appreciated as cache/tlb/???  flushing
> > > issues take me a while to figure out.
> > 
> > This morning when I went back and rethink the matter, I just found that the
> > common hugetlb path handles private anonymous mappings with all empty page
> > cache as you explained above.  In that sense the two patches I posted may
> > not really make sense even if they can pass the tests.. and maybe that's
> > also the reason why the reservations got messed up.  This is also something
> > I found after I read more on the reservation code e.g. no matter private or
> > shared hugetlb mappings we only reserve that only number of pages when mmap().
> > 
> > Indeed if with that in mind the UFFDIO_COPY should also work because
> > hugetlb fault handler checks pte first before page cache, so uffd missing
> > should still work as expected.
> > 
> > It makes sense especially for hugetlb to do that otherwise there can be
> > plenty of zero huge pages cached in the page cache.  I'm not sure whether
> > this is the reason hugetlb does it differently (e.g. comparing to shmem?),
> > it'll be great if I can get a confirmation.  If it's true please ignore the
> > two patches I posted.
> > 
> > I think what you analyzed is correct in that the pte shouldn't go away
> > after being armed once.  One more thing I tried (actually yesterday) was
> > SIGBUS the process when the write missing event was generated, and I can
> > see the user stack points to the cmpxchg() of the pthread_mutex_lock().  It
> > means indeed it moved forward and passed the mutex type check, it also
> > means it should have seen a !none pte already with at least reading
> > permission, in that sense it matches with "missing TLB" possibility
> > experiment mentioned above, because for a missing TLB it should keep
> > stucking at the read not write.  It's still uncertain why the pte can go
> > away somehow from under us and why it quickly re-appears according to your
> > experiment.
> > 
> 
> I 'think' it is more of a race with all cpus seeing the pte update.  To be
> honest, I can not wrap my head around how that can happen.
> 
> I did not really like your idea of adding anon (or private) pages to the
> page cache.

I don't like that either, especially when I notice it breaks the
reservations... :)

Note that my previous patch wasn't adding anon or private pages into page
cache.  PageAnon() will be false for those pages being added.  I was
literally doing insertion of page cache for UFFDIO_COPY, rather than
directly making it anonymous.  Actually that's what shmem does.

> As you discovered, there is code like reservations which depend
> on current behavior.
> 
> It seems to me that for 'missing' hugetlb faults there are two specific cases:
> 1) Shared or file backed mappings.  In this case, the page cache is the
>    'source of truth'.  If there is not a page in the page cache, then we
>    hand off to userfault with VM_UFFD_MISSING.
> 2) anon or private mappings.  In this case, pages are not in the page cache.
>    The page table is the 'source of truth'.

Sorry, I can't say I fully agree with it.

IMHO the missing mode is really about page cache.  Let's imagine when we
create private mappings upon a hugetlbfs file that has some pages written.
When page fault triggers, we should never generate a missing if cache hit,
even if the pte is null.  Maybe that's not exactly what you meant, but the
wording is kind of misleading here.

I'd say it's really about hugetlbfs is special in treating private pages.
Note that shmem wasn't treat it like that as I mentioned.  But AFAICT
hugetlbfs is different in a good way because otherwise we could be wasting
quite a few useless zero pages dangling in the page cache, and AFAICT
that's still what we do with shmem - just try to create 20M shmem
anonymouse file, privately map and write to them and see how many mem we'll
consume..  It's not optimal, but still correct IMHO.

> Early in hugetlb fault processing
>    we check the page table (huge_pte_none_mostly).  However, as my debug code
>    has shown, checking the page table again with lock held will reveal that
>    the pte has in fact been updated.

Right, I found it in the hard way. I should have been reading code more
carefully: it's caused by CoW where we'll need a clear+flush to make sure
TLB consistency (in hugetlb_wp):

	if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
		ClearHPageRestoreReserve(new_page);

		/* Break COW or unshare */
		huge_ptep_clear_flush(vma, haddr, ptep);
		mmu_notifier_invalidate_range(mm, range.start, range.end);
		page_remove_rmap(old_page, vma, true);
		hugepage_add_new_anon_rmap(new_page, vma, haddr);
		set_huge_pte_at(mm, haddr, ptep,
				make_huge_pte(vma, new_page, !unshare));
		SetHPageMigratable(new_page);
		/* Make the old page be freed below */
		new_page = old_page;
	}

The early TLB flush was trying to avoid inconsistent TLB entries in
different cores.

So far I still don't know why the hugetlb commit changed the timing, but
this time since I tracked the pgtables I'm more sure of its cause.

> 
> My thought was that regular anon pages would have the same issue.  So, I looked
> at the calling code there.  In do_anonymous_page() there is this block:
> 
> 	/* Use the zero-page for reads */
> 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> 			!mm_forbids_zeropage(vma->vm_mm)) {
> 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
> 						vma->vm_page_prot));
> 		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> 				vmf->address, &vmf->ptl);
> 		if (!pte_none(*vmf->pte)) {
> 			update_mmu_tlb(vma, vmf->address, vmf->pte);
> 			goto unlock;
> 		}
> 		ret = check_stable_address_space(vma->vm_mm);
> 		if (ret)
> 			goto unlock;
> 		/* Deliver the page fault to userland, check inside PT lock */
> 		if (userfaultfd_missing(vma)) {
> 			pte_unmap_unlock(vmf->pte, vmf->ptl);
> 			return handle_userfault(vmf, VM_UFFD_MISSING);
> 		}
> 		goto setpte;
> 	}
> 
> Notice that here the pte is checked while holding the page table lock while
> to make the decision to call handle_userfault().

Right.  That's a great finding, thanks.  It's just that I found it great
only after I found the whole story on the CoW in progress.. I could have
been done better.

> 
> In my testing, if we check huge_pte_none_mostly() while holding the page table
> lock before calling handle_userfault we will not experience the failure.  Can
> you see if this also resolves the issue in your environment?  I do not love
> this solution as I still can not explain how this code is missing the pte
> update.

Yes this patch will fix it which I tested.  But IMHO there're quite a few
wordings that are misleading as I tried to explain above on why page cache
still matters (and matters the most IMHO for MISSING events).

Meanwhile IMHO there's a better way to address this - we're in
hugetlb_no_page() but we're relying on a pte that was read _without_
pgtable lock.  It means it can be either unstable or changed.  We do proper
check for most of the rest code path for hugetlb_no_page() on pte_same()
but we just forgot to do that for userfaultfd.

One example is IMO we shouldn't target this "check pte under locking" for
private mappings only.  If the pte changed for either private/shared
mappings, it's probably already illegal to be inside hugetlb_no_page().
Logically for shared mappings the pte can change in any form too as long as
under pgtable lock.  So the lock should logically apply to shared mappings
too, IMHO.

In summary, I attached another patch that addressed it in the way I
described above (only compile tested; even without writting the commit
message because I need to go very soon).  Let me know what do you think the
best way to approach this.

(During debugging this problem, I also found a few other issues; I'll
not make this email any longer but will verify them next week and follow up
from there)

Thanks,

> 
> From f910e7155d6831514165af35e0d75574124a4477 Mon Sep 17 00:00:00 2001
> From: Mike Kravetz <mike.kravetz@oracle.com>
> Date: Fri, 30 Sep 2022 13:45:08 -0700
> Subject: [PATCH] hugetlb: check pte with page table lock before handing to
>  userfault
> 
> In hugetlb_no_page we decide a page is missing if not present in the
> page cache.  This is perfectly valid for shared/file mappings where
> pages must exist in the page cache.  For anon/private mappings, the page
> table must be checked.  This is done early in hugetlb_fault processing
> and is the reason we enter hugetlb_no_page.  However, the early check is
> made without holding the page table lock.  There could be racing updates
> to the pte entry, so check again with page table lock held before
> deciding to call handle_userfault.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 60e077ce6ca7..4cb44a4629b8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5560,10 +5560,29 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  		if (idx >= size)
>  			goto out;
>  		/* Check for page in userfault range */
> -		if (userfaultfd_missing(vma))
> +		if (userfaultfd_missing(vma)) {
> +			/*
> +			 * For missing pages, the page cache (checked above) is
> +			 * the 'source of truth' for shared mappings.  For anon
> +			 * mappings, the source of truth is the page table.  We
> +			 * already checked huge_pte_none_mostly() in
> +			 * hugetlb_fault.  However, there could be racing
> +			 * updates.  Check again while holding page table lock
> +			 * before handing off to userfault.
> +			 */
> +			if (!(vma->vm_flags & VM_MAYSHARE)) {
> +				ptl = huge_pte_lock(h, mm, ptep);
> +				if (!huge_pte_none_mostly(huge_ptep_get(ptep))) {
> +					spin_unlock(ptl);
> +					ret = 0;
> +					goto out;
> +				}
> +				spin_unlock(ptl);
> +			}
>  			return hugetlb_handle_userfault(vma, mapping, idx,
>  						       flags, haddr, address,
>  						       VM_UFFD_MISSING);
> +		}
>  
>  		page = alloc_huge_page(vma, haddr, 0);
>  		if (IS_ERR(page)) {
> -- 
> 2.37.3
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-10-01  2:47                       ` Peter Xu
@ 2022-10-01  3:01                         ` Peter Xu
  2022-10-03  1:16                           ` Mike Kravetz
  0 siblings, 1 reply; 24+ messages in thread
From: Peter Xu @ 2022-10-01  3:01 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	akpm, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

[-- Attachment #1: Type: text/plain, Size: 13205 bytes --]

On Fri, Sep 30, 2022 at 10:47:45PM -0400, Peter Xu wrote:
> Hi, Mike,
> 
> On Fri, Sep 30, 2022 at 02:38:01PM -0700, Mike Kravetz wrote:
> > On 09/30/22 12:05, Peter Xu wrote:
> > > On Thu, Sep 29, 2022 at 04:33:53PM -0700, Mike Kravetz wrote:
> > > > I was able to do a little more debugging:
> > > > 
> > > > As you know the hugetlb calling path to handle_userfault is something
> > > > like this,
> > > > 
> > > > hugetlb_fault
> > > > 	mutex_lock(&hugetlb_fault_mutex_table[hash]);
> > > > 	ptep = huge_pte_alloc(mm, vma, haddr, huge_page_size(h));
> > > > 	if (huge_pte_none_mostly())
> > > > 		hugetlb_no_page()
> > > > 			page = find_lock_page(mapping, idx);
> > > > 			if (!page) {
> > > > 				if (userfaultfd_missing(vma))
> > > > 					mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> > > > 					return handle_userfault()
> > > > 			}
> > > > 
> > > > For anon mappings, find_lock_page() will never find a page, so as long
> > > > as huge_pte_none_mostly() is true we will call into handle_userfault().
> > > > 
> > > > Since your analysis shows the testcase should never call handle_userfault() for
> > > > a write fault, I simply added a 'if (flags & FAULT_FLAG_WRITE) printk' before
> > > > the call to handle_userfault().  Sure enough, I saw plenty of printk messages.
> > > > 
> > > > Then, before calling handle_userfault() I added code to take the page table
> > > > lock and test huge_pte_none_mostly() again.  In every FAULT_FLAG_WRITE case,
> > > > this second test of huge_pte_none_mostly() was false.  So, the condition
> > > > changed from the check in hugetlb_fault until the check (with page table
> > > > lock) in hugetlb_no_page.
> > > > 
> > > > IIUC, the only code that should be modifying the pte in this test is
> > > > hugetlb_mcopy_atomic_pte().  It also holds the hugetlb_fault_mutex while
> > > > updating the pte.
> > > > 
> > > > It 'appears' that hugetlb_fault is not seeing the updated pte and I can
> > > > only guess that it is due to some caching issues.
> > > > 
> > > > After writing the pte in hugetlb_mcopy_atomic_pte() there is this comment.
> > > > 
> > > > 	/* No need to invalidate - it was non-present before */
> > > > 	update_mmu_cache(dst_vma, dst_addr, dst_pte);
> > > > 
> > > > I suspect that is true.  However, it seems like this test depends on all
> > > > CPUs seeing the updated pte immediately?
> > > > 
> > > > I added some TLB flushing to hugetlb_mcopy_atomic_pte, but it did not make
> > > > any difference.  Suggestions would be appreciated as cache/tlb/???  flushing
> > > > issues take me a while to figure out.
> > > 
> > > This morning when I went back and rethink the matter, I just found that the
> > > common hugetlb path handles private anonymous mappings with all empty page
> > > cache as you explained above.  In that sense the two patches I posted may
> > > not really make sense even if they can pass the tests.. and maybe that's
> > > also the reason why the reservations got messed up.  This is also something
> > > I found after I read more on the reservation code e.g. no matter private or
> > > shared hugetlb mappings we only reserve that only number of pages when mmap().
> > > 
> > > Indeed if with that in mind the UFFDIO_COPY should also work because
> > > hugetlb fault handler checks pte first before page cache, so uffd missing
> > > should still work as expected.
> > > 
> > > It makes sense especially for hugetlb to do that otherwise there can be
> > > plenty of zero huge pages cached in the page cache.  I'm not sure whether
> > > this is the reason hugetlb does it differently (e.g. comparing to shmem?),
> > > it'll be great if I can get a confirmation.  If it's true please ignore the
> > > two patches I posted.
> > > 
> > > I think what you analyzed is correct in that the pte shouldn't go away
> > > after being armed once.  One more thing I tried (actually yesterday) was
> > > SIGBUS the process when the write missing event was generated, and I can
> > > see the user stack points to the cmpxchg() of the pthread_mutex_lock().  It
> > > means indeed it moved forward and passed the mutex type check, it also
> > > means it should have seen a !none pte already with at least reading
> > > permission, in that sense it matches with "missing TLB" possibility
> > > experiment mentioned above, because for a missing TLB it should keep
> > > stucking at the read not write.  It's still uncertain why the pte can go
> > > away somehow from under us and why it quickly re-appears according to your
> > > experiment.
> > > 
> > 
> > I 'think' it is more of a race with all cpus seeing the pte update.  To be
> > honest, I can not wrap my head around how that can happen.
> > 
> > I did not really like your idea of adding anon (or private) pages to the
> > page cache.
> 
> I don't like that either, especially when I notice it breaks the
> reservations... :)
> 
> Note that my previous patch wasn't adding anon or private pages into page
> cache.  PageAnon() will be false for those pages being added.  I was
> literally doing insertion of page cache for UFFDIO_COPY, rather than
> directly making it anonymous.  Actually that's what shmem does.
> 
> > As you discovered, there is code like reservations which depend
> > on current behavior.
> > 
> > It seems to me that for 'missing' hugetlb faults there are two specific cases:
> > 1) Shared or file backed mappings.  In this case, the page cache is the
> >    'source of truth'.  If there is not a page in the page cache, then we
> >    hand off to userfault with VM_UFFD_MISSING.
> > 2) anon or private mappings.  In this case, pages are not in the page cache.
> >    The page table is the 'source of truth'.
> 
> Sorry, I can't say I fully agree with it.
> 
> IMHO the missing mode is really about page cache.  Let's imagine when we
> create private mappings upon a hugetlbfs file that has some pages written.
> When page fault triggers, we should never generate a missing if cache hit,
> even if the pte is null.  Maybe that's not exactly what you meant, but the
> wording is kind of misleading here.
> 
> I'd say it's really about hugetlbfs is special in treating private pages.
> Note that shmem wasn't treat it like that as I mentioned.  But AFAICT
> hugetlbfs is different in a good way because otherwise we could be wasting
> quite a few useless zero pages dangling in the page cache, and AFAICT
> that's still what we do with shmem - just try to create 20M shmem
> anonymouse file, privately map and write to them and see how many mem we'll
> consume..  It's not optimal, but still correct IMHO.
> 
> > Early in hugetlb fault processing
> >    we check the page table (huge_pte_none_mostly).  However, as my debug code
> >    has shown, checking the page table again with lock held will reveal that
> >    the pte has in fact been updated.
> 
> Right, I found it in the hard way. I should have been reading code more
> carefully: it's caused by CoW where we'll need a clear+flush to make sure
> TLB consistency (in hugetlb_wp):
> 
> 	if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
> 		ClearHPageRestoreReserve(new_page);
> 
> 		/* Break COW or unshare */
> 		huge_ptep_clear_flush(vma, haddr, ptep);
> 		mmu_notifier_invalidate_range(mm, range.start, range.end);
> 		page_remove_rmap(old_page, vma, true);
> 		hugepage_add_new_anon_rmap(new_page, vma, haddr);
> 		set_huge_pte_at(mm, haddr, ptep,
> 				make_huge_pte(vma, new_page, !unshare));
> 		SetHPageMigratable(new_page);
> 		/* Make the old page be freed below */
> 		new_page = old_page;
> 	}
> 
> The early TLB flush was trying to avoid inconsistent TLB entries in
> different cores.
> 
> So far I still don't know why the hugetlb commit changed the timing, but
> this time since I tracked the pgtables I'm more sure of its cause.
> 
> > 
> > My thought was that regular anon pages would have the same issue.  So, I looked
> > at the calling code there.  In do_anonymous_page() there is this block:
> > 
> > 	/* Use the zero-page for reads */
> > 	if (!(vmf->flags & FAULT_FLAG_WRITE) &&
> > 			!mm_forbids_zeropage(vma->vm_mm)) {
> > 		entry = pte_mkspecial(pfn_pte(my_zero_pfn(vmf->address),
> > 						vma->vm_page_prot));
> > 		vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd,
> > 				vmf->address, &vmf->ptl);
> > 		if (!pte_none(*vmf->pte)) {
> > 			update_mmu_tlb(vma, vmf->address, vmf->pte);
> > 			goto unlock;
> > 		}
> > 		ret = check_stable_address_space(vma->vm_mm);
> > 		if (ret)
> > 			goto unlock;
> > 		/* Deliver the page fault to userland, check inside PT lock */
> > 		if (userfaultfd_missing(vma)) {
> > 			pte_unmap_unlock(vmf->pte, vmf->ptl);
> > 			return handle_userfault(vmf, VM_UFFD_MISSING);
> > 		}
> > 		goto setpte;
> > 	}
> > 
> > Notice that here the pte is checked while holding the page table lock while
> > to make the decision to call handle_userfault().
> 
> Right.  That's a great finding, thanks.  It's just that I found it great
> only after I found the whole story on the CoW in progress.. I could have
> been done better.
> 
> > 
> > In my testing, if we check huge_pte_none_mostly() while holding the page table
> > lock before calling handle_userfault we will not experience the failure.  Can
> > you see if this also resolves the issue in your environment?  I do not love
> > this solution as I still can not explain how this code is missing the pte
> > update.
> 
> Yes this patch will fix it which I tested.  But IMHO there're quite a few
> wordings that are misleading as I tried to explain above on why page cache
> still matters (and matters the most IMHO for MISSING events).
> 
> Meanwhile IMHO there's a better way to address this - we're in
> hugetlb_no_page() but we're relying on a pte that was read _without_
> pgtable lock.  It means it can be either unstable or changed.  We do proper
> check for most of the rest code path for hugetlb_no_page() on pte_same()
> but we just forgot to do that for userfaultfd.
> 
> One example is IMO we shouldn't target this "check pte under locking" for
> private mappings only.  If the pte changed for either private/shared
> mappings, it's probably already illegal to be inside hugetlb_no_page().
> Logically for shared mappings the pte can change in any form too as long as
> under pgtable lock.  So the lock should logically apply to shared mappings
> too, IMHO.
> 
> In summary, I attached another patch that addressed it in the way I
> described above (only compile tested; even without writting the commit
> message because I need to go very soon).  Let me know what do you think the
> best way to approach this.

Obviously I never remember to attach things when I meant to.. Sorry for the
noise.  Attached this time.

> 
> (During debugging this problem, I also found a few other issues; I'll
> not make this email any longer but will verify them next week and follow up
> from there)
> 
> Thanks,
> 
> > 
> > From f910e7155d6831514165af35e0d75574124a4477 Mon Sep 17 00:00:00 2001
> > From: Mike Kravetz <mike.kravetz@oracle.com>
> > Date: Fri, 30 Sep 2022 13:45:08 -0700
> > Subject: [PATCH] hugetlb: check pte with page table lock before handing to
> >  userfault
> > 
> > In hugetlb_no_page we decide a page is missing if not present in the
> > page cache.  This is perfectly valid for shared/file mappings where
> > pages must exist in the page cache.  For anon/private mappings, the page
> > table must be checked.  This is done early in hugetlb_fault processing
> > and is the reason we enter hugetlb_no_page.  However, the early check is
> > made without holding the page table lock.  There could be racing updates
> > to the pte entry, so check again with page table lock held before
> > deciding to call handle_userfault.
> > 
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > ---
> >  mm/hugetlb.c | 21 ++++++++++++++++++++-
> >  1 file changed, 20 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 60e077ce6ca7..4cb44a4629b8 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5560,10 +5560,29 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >  		if (idx >= size)
> >  			goto out;
> >  		/* Check for page in userfault range */
> > -		if (userfaultfd_missing(vma))
> > +		if (userfaultfd_missing(vma)) {
> > +			/*
> > +			 * For missing pages, the page cache (checked above) is
> > +			 * the 'source of truth' for shared mappings.  For anon
> > +			 * mappings, the source of truth is the page table.  We
> > +			 * already checked huge_pte_none_mostly() in
> > +			 * hugetlb_fault.  However, there could be racing
> > +			 * updates.  Check again while holding page table lock
> > +			 * before handing off to userfault.
> > +			 */
> > +			if (!(vma->vm_flags & VM_MAYSHARE)) {
> > +				ptl = huge_pte_lock(h, mm, ptep);
> > +				if (!huge_pte_none_mostly(huge_ptep_get(ptep))) {
> > +					spin_unlock(ptl);
> > +					ret = 0;
> > +					goto out;
> > +				}
> > +				spin_unlock(ptl);
> > +			}
> >  			return hugetlb_handle_userfault(vma, mapping, idx,
> >  						       flags, haddr, address,
> >  						       VM_UFFD_MISSING);
> > +		}
> >  
> >  		page = alloc_huge_page(vma, haddr, 0);
> >  		if (IS_ERR(page)) {
> > -- 
> > 2.37.3
> > 
> 
> -- 
> Peter Xu

-- 
Peter Xu

[-- Attachment #2: 0001-mm-hugetlb-Fix-race-condition-of-uffd-missing-handli.patch --]
[-- Type: text/plain, Size: 2057 bytes --]

From fe9e50551f3fdb7107315784affca4f9b1c4720f Mon Sep 17 00:00:00 2001
From: Peter Xu <peterx@redhat.com>
Date: Fri, 30 Sep 2022 22:22:44 -0400
Subject: [PATCH] mm/hugetlb: Fix race condition of uffd missing handling
Content-type: text/plain

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 36 +++++++++++++++++++++++++++++++++---
 1 file changed, 33 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dd29cba46e9e..5015d8aa5da4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5557,9 +5557,39 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 	if (!page) {
 		/* Check for page in userfault range */
 		if (userfaultfd_missing(vma)) {
-			ret = hugetlb_handle_userfault(vma, mapping, idx,
-						       flags, haddr, address,
-						       VM_UFFD_MISSING);
+			bool same;
+
+			/*
+			 * Since hugetlb_no_page() was examining pte
+			 * without pgtable lock, we need to re-test under
+			 * lock because the pte may not be stable and could
+			 * have changed from under us.  Try to detect
+			 * either changed or during-changing ptes and bail
+			 * out properly.
+			 *
+			 * One example of changing pte is in-progress CoW
+			 * of private mapping, which will clear+flush pte
+			 * then reinstall the new one.
+			 *
+			 * Note that userfaultfd is actually fine with
+			 * false positives (e.g. caused by pte changed),
+			 * but not wrong logical events (e.g. caused by
+			 * reading a pte during changing).  The latter can
+			 * confuse the userspace, so the strictness is very
+			 * much preferred.  E.g., MISSING event should
+			 * never happen on the page after UFFDIO_COPY has
+			 * correctly installed the page and returned.
+			 */
+			ptl = huge_pte_lock(h, mm, ptep);
+			same = pte_same(huge_ptep_get(ptep), old_pte);
+			spin_unlock(ptl);
+			if (same)
+				ret = hugetlb_handle_userfault(vma, mapping, idx,
+							       flags, haddr, address,
+							       VM_UFFD_MISSING);
+			else
+				/* PTE changed or unstable, retry */
+				ret = 0;
 			goto out;
 		}
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-10-01  3:01                         ` Peter Xu
@ 2022-10-03  1:16                           ` Mike Kravetz
  2022-10-03 15:24                             ` Peter Xu
  0 siblings, 1 reply; 24+ messages in thread
From: Mike Kravetz @ 2022-10-03  1:16 UTC (permalink / raw)
  To: Peter Xu
  Cc: Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	akpm, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

On 09/30/22 23:01, Peter Xu wrote:
> On Fri, Sep 30, 2022 at 10:47:45PM -0400, Peter Xu wrote:
> From fe9e50551f3fdb7107315784affca4f9b1c4720f Mon Sep 17 00:00:00 2001
> From: Peter Xu <peterx@redhat.com>
> Date: Fri, 30 Sep 2022 22:22:44 -0400
> Subject: [PATCH] mm/hugetlb: Fix race condition of uffd missing handling
> Content-type: text/plain
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 36 +++++++++++++++++++++++++++++++++---
>  1 file changed, 33 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index dd29cba46e9e..5015d8aa5da4 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5557,9 +5557,39 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  	if (!page) {
>  		/* Check for page in userfault range */
>  		if (userfaultfd_missing(vma)) {
> -			ret = hugetlb_handle_userfault(vma, mapping, idx,
> -						       flags, haddr, address,
> -						       VM_UFFD_MISSING);
> +			bool same;
> +
> +			/*
> +			 * Since hugetlb_no_page() was examining pte
> +			 * without pgtable lock, we need to re-test under
> +			 * lock because the pte may not be stable and could
> +			 * have changed from under us.  Try to detect
> +			 * either changed or during-changing ptes and bail
> +			 * out properly.
> +			 *
> +			 * One example of changing pte is in-progress CoW
> +			 * of private mapping, which will clear+flush pte
> +			 * then reinstall the new one.
> +			 *
> +			 * Note that userfaultfd is actually fine with
> +			 * false positives (e.g. caused by pte changed),
> +			 * but not wrong logical events (e.g. caused by
> +			 * reading a pte during changing).  The latter can
> +			 * confuse the userspace, so the strictness is very
> +			 * much preferred.  E.g., MISSING event should
> +			 * never happen on the page after UFFDIO_COPY has
> +			 * correctly installed the page and returned.
> +			 */

Thanks Peter!

The wording and pte_same check here is better than what I proposed.  I think
that last paragraph above should go into the commit message as it describes
user visible effects (missing event after UFFDIO_COPY has correctly installed
the page and returned).

This seems to have existed since hugetlb userfault support was added.  It just
became exposed recently due to locking changes going into 6.1.  However, I
think it may have existed in the window after hugetlb userfault support was
added and before current i_mmap_sema locking for pmd sharing was added.  Just
a long way of saying I am not sure cc stable if of much value.

-- 
Mike Kravetz

> +			ptl = huge_pte_lock(h, mm, ptep);
> +			same = pte_same(huge_ptep_get(ptep), old_pte);
> +			spin_unlock(ptl);
> +			if (same)
> +				ret = hugetlb_handle_userfault(vma, mapping, idx,
> +							       flags, haddr, address,
> +							       VM_UFFD_MISSING);
> +			else
> +				/* PTE changed or unstable, retry */
> +				ret = 0;
>  			goto out;
>  		}
>  
> -- 
> 2.37.3
> 


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [syzbot] general protection fault in PageHeadHuge
  2022-10-03  1:16                           ` Mike Kravetz
@ 2022-10-03 15:24                             ` Peter Xu
  0 siblings, 0 replies; 24+ messages in thread
From: Peter Xu @ 2022-10-03 15:24 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hugh Dickins, Axel Rasmussen, Yang Shi, Matthew Wilcox, syzbot,
	akpm, linux-kernel, linux-mm, llvm, nathan, ndesaulniers,
	songmuchun, syzkaller-bugs, trix

On Sun, Oct 02, 2022 at 06:16:53PM -0700, Mike Kravetz wrote:
> On 09/30/22 23:01, Peter Xu wrote:
> > On Fri, Sep 30, 2022 at 10:47:45PM -0400, Peter Xu wrote:
> > From fe9e50551f3fdb7107315784affca4f9b1c4720f Mon Sep 17 00:00:00 2001
> > From: Peter Xu <peterx@redhat.com>
> > Date: Fri, 30 Sep 2022 22:22:44 -0400
> > Subject: [PATCH] mm/hugetlb: Fix race condition of uffd missing handling
> > Content-type: text/plain
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  mm/hugetlb.c | 36 +++++++++++++++++++++++++++++++++---
> >  1 file changed, 33 insertions(+), 3 deletions(-)
> > 
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index dd29cba46e9e..5015d8aa5da4 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -5557,9 +5557,39 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
> >  	if (!page) {
> >  		/* Check for page in userfault range */
> >  		if (userfaultfd_missing(vma)) {
> > -			ret = hugetlb_handle_userfault(vma, mapping, idx,
> > -						       flags, haddr, address,
> > -						       VM_UFFD_MISSING);
> > +			bool same;
> > +
> > +			/*
> > +			 * Since hugetlb_no_page() was examining pte
> > +			 * without pgtable lock, we need to re-test under
> > +			 * lock because the pte may not be stable and could
> > +			 * have changed from under us.  Try to detect
> > +			 * either changed or during-changing ptes and bail
> > +			 * out properly.
> > +			 *
> > +			 * One example of changing pte is in-progress CoW
> > +			 * of private mapping, which will clear+flush pte
> > +			 * then reinstall the new one.
> > +			 *
> > +			 * Note that userfaultfd is actually fine with
> > +			 * false positives (e.g. caused by pte changed),
> > +			 * but not wrong logical events (e.g. caused by
> > +			 * reading a pte during changing).  The latter can
> > +			 * confuse the userspace, so the strictness is very
> > +			 * much preferred.  E.g., MISSING event should
> > +			 * never happen on the page after UFFDIO_COPY has
> > +			 * correctly installed the page and returned.
> > +			 */
> 
> Thanks Peter!
> 
> The wording and pte_same check here is better than what I proposed.  I think
> that last paragraph above should go into the commit message as it describes
> user visible effects (missing event after UFFDIO_COPY has correctly installed
> the page and returned).

Will do.

> 
> This seems to have existed since hugetlb userfault support was added.  It just
> became exposed recently due to locking changes going into 6.1.  However, I
> think it may have existed in the window after hugetlb userfault support was
> added and before current i_mmap_sema locking for pmd sharing was added.

Agreed.

> Just a long way of saying I am not sure cc stable if of much value.

Logically the change is stable material. I had worry that after uffd-wp
intergration with hugetlb it's indeed possible to trigger on the CoWs we're
encountering already, so IMO still something good to have for 5.19.

I just saw that you proposed a similar fix in 4643d67e8cb0b35 on a similar
page migration race three years ago.  I'm not sure whether it also can
happen with uffd missing modes too even before uffd-wp introduced.

I think I'll first post the patch with Fixes attached without having stable
tagged, but let me know your thoughts.  No worry on the backport, I can
take care of doing that and tests.

I also plan to add your co-devel tag if you're fine with it because this
patch is a collaboration effort IMO, but please let me know either here or
directly replying to the patch if it's posted if you think that's inproper
in any form.

Thanks Mike!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2022-10-03 15:24 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-23 16:05 [syzbot] general protection fault in PageHeadHuge syzbot
2022-09-23 21:11 ` Mike Kravetz
2022-09-23 21:37   ` Yang Shi
2022-09-24  0:14   ` Hugh Dickins
2022-09-24  0:18     ` Nathan Chancellor
2022-09-24  7:28       ` Hugh Dickins
2022-09-24  0:58     ` Peter Xu
2022-09-24 15:06       ` Peter Xu
2022-09-24 19:01         ` Mike Kravetz
2022-09-26  0:11           ` Peter Xu
2022-09-27 16:24             ` Mike Kravetz
2022-09-27 16:45               ` Peter Xu
2022-09-29 23:33                 ` Mike Kravetz
2022-09-30  1:29                   ` Peter Xu
2022-09-30  1:35                     ` Peter Xu
2022-09-30 16:05                   ` Peter Xu
2022-09-30 21:38                     ` Mike Kravetz
2022-10-01  2:47                       ` Peter Xu
2022-10-01  3:01                         ` Peter Xu
2022-10-03  1:16                           ` Mike Kravetz
2022-10-03 15:24                             ` Peter Xu
2022-09-25 18:55     ` Andrew Morton
2022-09-24 21:56   ` Matthew Wilcox
2022-09-27  4:32     ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).