* [PATCH v2] secretmem: fix unhandled fault in truncate
@ 2022-07-07 16:56 Mike Rapoport
2022-07-07 17:48 ` Yang Shi
2022-07-07 20:27 ` Axel Rasmussen
0 siblings, 2 replies; 8+ messages in thread
From: Mike Rapoport @ 2022-07-07 16:56 UTC (permalink / raw)
To: Andrew Morton
Cc: Axel Rasmussen, Eric Biggers, Hillf Danton, Matthew Wilcox,
Mike Rapoport, Mike Rapoport, linux-fsdevel, linux-mm,
syzkaller-bugs, syzbot+9bd2b7adbd34b30b87e4
From: Mike Rapoport <rppt@linux.ibm.com>
syzkaller reports the following issue:
BUG: unable to handle page fault for address: ffff888021f7e005
PGD 11401067 P4D 11401067 PUD 11402067 PMD 21f7d063 PTE 800fffffde081060
Oops: 0002 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 3761 Comm: syz-executor281 Not tainted 5.19.0-rc4-syzkaller-00014-g941e3e791269 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64
Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
RSP: 0018:ffffc9000329fa90 EFLAGS: 00010202
RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000ffb
RDX: 0000000000000ffb RSI: 0000000000000000 RDI: ffff888021f7e005
RBP: ffffea000087df80 R08: 0000000000000001 R09: ffff888021f7e005
R10: ffffed10043efdff R11: 0000000000000000 R12: 0000000000000005
R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000ffb
FS: 00007fb29d8b2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff888021f7e005 CR3: 0000000026e7b000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
zero_user_segments include/linux/highmem.h:272 [inline]
folio_zero_range include/linux/highmem.h:428 [inline]
truncate_inode_partial_folio+0x76a/0xdf0 mm/truncate.c:237
truncate_inode_pages_range+0x83b/0x1530 mm/truncate.c:381
truncate_inode_pages mm/truncate.c:452 [inline]
truncate_pagecache+0x63/0x90 mm/truncate.c:753
simple_setattr+0xed/0x110 fs/libfs.c:535
secretmem_setattr+0xae/0xf0 mm/secretmem.c:170
notify_change+0xb8c/0x12b0 fs/attr.c:424
do_truncate+0x13c/0x200 fs/open.c:65
do_sys_ftruncate+0x536/0x730 fs/open.c:193
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x46/0xb0
RIP: 0033:0x7fb29d900899
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fb29d8b2318 EFLAGS: 00000246 ORIG_RAX: 000000000000004d
RAX: ffffffffffffffda RBX: 00007fb29d988408 RCX: 00007fb29d900899
RDX: 00007fb29d900899 RSI: 0000000000000005 RDI: 0000000000000003
RBP: 00007fb29d988400 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb29d98840c
R13: 00007ffca01a23bf R14: 00007fb29d8b2400 R15: 0000000000022000
</TASK>
Modules linked in:
CR2: ffff888021f7e005
---[ end trace 0000000000000000 ]---
Eric Biggers suggested that this happens when
secretmem_setattr()->simple_setattr() races with secretmem_fault() so
that a page that is faulted in by secretmem_fault() (and thus removed
from the direct map) is zeroed by inode truncation right afterwards.
Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
secretmem_fault() prevents the race.
Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
Suggested-by: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
---
v2: use inode_lock_shared() rather than add a new rw_sem to secretmem
Axel, I didn't add your Reviewed-by because v2 is quite different.
mm/secretmem.c | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)
diff --git a/mm/secretmem.c b/mm/secretmem.c
index 206ed6b40c1d..a4fabf705e4f 100644
--- a/mm/secretmem.c
+++ b/mm/secretmem.c
@@ -55,22 +55,28 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
gfp_t gfp = vmf->gfp_mask;
unsigned long addr;
struct page *page;
+ vm_fault_t ret;
int err;
if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
return vmf_error(-EINVAL);
+ inode_lock_shared(inode);
+
retry:
page = find_lock_page(mapping, offset);
if (!page) {
page = alloc_page(gfp | __GFP_ZERO);
- if (!page)
- return VM_FAULT_OOM;
+ if (!page) {
+ ret = VM_FAULT_OOM;
+ goto out;
+ }
err = set_direct_map_invalid_noflush(page);
if (err) {
put_page(page);
- return vmf_error(err);
+ ret = vmf_error(err);
+ goto out;
}
__SetPageUptodate(page);
@@ -86,7 +92,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
if (err == -EEXIST)
goto retry;
- return vmf_error(err);
+ ret = vmf_error(err);
+ goto out;
}
addr = (unsigned long)page_address(page);
@@ -94,7 +101,11 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
}
vmf->page = page;
- return VM_FAULT_LOCKED;
+ ret = VM_FAULT_LOCKED;
+
+out:
+ inode_unlock_shared(inode);
+ return ret;
}
static const struct vm_operations_struct secretmem_vm_ops = {
base-commit: 03c765b0e3b4cb5063276b086c76f7a612856a9a
--
2.34.1
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v2] secretmem: fix unhandled fault in truncate
2022-07-07 16:56 [PATCH v2] secretmem: fix unhandled fault in truncate Mike Rapoport
@ 2022-07-07 17:48 ` Yang Shi
2022-07-07 20:55 ` Darrick J. Wong
2022-07-07 20:27 ` Axel Rasmussen
1 sibling, 1 reply; 8+ messages in thread
From: Yang Shi @ 2022-07-07 17:48 UTC (permalink / raw)
To: Mike Rapoport
Cc: Andrew Morton, Axel Rasmussen, Eric Biggers, Hillf Danton,
Matthew Wilcox, Mike Rapoport, Linux FS-devel Mailing List,
Linux MM, syzkaller-bugs, syzbot+9bd2b7adbd34b30b87e4
On Thu, Jul 7, 2022 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: Mike Rapoport <rppt@linux.ibm.com>
>
> syzkaller reports the following issue:
>
> BUG: unable to handle page fault for address: ffff888021f7e005
> PGD 11401067 P4D 11401067 PUD 11402067 PMD 21f7d063 PTE 800fffffde081060
> Oops: 0002 [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 3761 Comm: syz-executor281 Not tainted 5.19.0-rc4-syzkaller-00014-g941e3e791269 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64
> Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
> RSP: 0018:ffffc9000329fa90 EFLAGS: 00010202
> RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000ffb
> RDX: 0000000000000ffb RSI: 0000000000000000 RDI: ffff888021f7e005
> RBP: ffffea000087df80 R08: 0000000000000001 R09: ffff888021f7e005
> R10: ffffed10043efdff R11: 0000000000000000 R12: 0000000000000005
> R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000ffb
> FS: 00007fb29d8b2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffff888021f7e005 CR3: 0000000026e7b000 CR4: 00000000003506f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> zero_user_segments include/linux/highmem.h:272 [inline]
> folio_zero_range include/linux/highmem.h:428 [inline]
> truncate_inode_partial_folio+0x76a/0xdf0 mm/truncate.c:237
> truncate_inode_pages_range+0x83b/0x1530 mm/truncate.c:381
> truncate_inode_pages mm/truncate.c:452 [inline]
> truncate_pagecache+0x63/0x90 mm/truncate.c:753
> simple_setattr+0xed/0x110 fs/libfs.c:535
> secretmem_setattr+0xae/0xf0 mm/secretmem.c:170
> notify_change+0xb8c/0x12b0 fs/attr.c:424
> do_truncate+0x13c/0x200 fs/open.c:65
> do_sys_ftruncate+0x536/0x730 fs/open.c:193
> do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> entry_SYSCALL_64_after_hwframe+0x46/0xb0
> RIP: 0033:0x7fb29d900899
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007fb29d8b2318 EFLAGS: 00000246 ORIG_RAX: 000000000000004d
> RAX: ffffffffffffffda RBX: 00007fb29d988408 RCX: 00007fb29d900899
> RDX: 00007fb29d900899 RSI: 0000000000000005 RDI: 0000000000000003
> RBP: 00007fb29d988400 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb29d98840c
> R13: 00007ffca01a23bf R14: 00007fb29d8b2400 R15: 0000000000022000
> </TASK>
> Modules linked in:
> CR2: ffff888021f7e005
> ---[ end trace 0000000000000000 ]---
>
> Eric Biggers suggested that this happens when
> secretmem_setattr()->simple_setattr() races with secretmem_fault() so
> that a page that is faulted in by secretmem_fault() (and thus removed
> from the direct map) is zeroed by inode truncation right afterwards.
>
> Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
> secretmem_fault() prevents the race.
Should invalidate_lock be used to serialize between page fault and truncate?
>
> Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
> Suggested-by: Eric Biggers <ebiggers@kernel.org>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>
> v2: use inode_lock_shared() rather than add a new rw_sem to secretmem
>
> Axel, I didn't add your Reviewed-by because v2 is quite different.
>
> mm/secretmem.c | 21 ++++++++++++++++-----
> 1 file changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index 206ed6b40c1d..a4fabf705e4f 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -55,22 +55,28 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> gfp_t gfp = vmf->gfp_mask;
> unsigned long addr;
> struct page *page;
> + vm_fault_t ret;
> int err;
>
> if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> return vmf_error(-EINVAL);
>
> + inode_lock_shared(inode);
> +
> retry:
> page = find_lock_page(mapping, offset);
> if (!page) {
> page = alloc_page(gfp | __GFP_ZERO);
> - if (!page)
> - return VM_FAULT_OOM;
> + if (!page) {
> + ret = VM_FAULT_OOM;
> + goto out;
> + }
>
> err = set_direct_map_invalid_noflush(page);
> if (err) {
> put_page(page);
> - return vmf_error(err);
> + ret = vmf_error(err);
> + goto out;
> }
>
> __SetPageUptodate(page);
> @@ -86,7 +92,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> if (err == -EEXIST)
> goto retry;
>
> - return vmf_error(err);
> + ret = vmf_error(err);
> + goto out;
> }
>
> addr = (unsigned long)page_address(page);
> @@ -94,7 +101,11 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> }
>
> vmf->page = page;
> - return VM_FAULT_LOCKED;
> + ret = VM_FAULT_LOCKED;
> +
> +out:
> + inode_unlock_shared(inode);
> + return ret;
> }
>
> static const struct vm_operations_struct secretmem_vm_ops = {
>
> base-commit: 03c765b0e3b4cb5063276b086c76f7a612856a9a
> --
> 2.34.1
>
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] secretmem: fix unhandled fault in truncate
2022-07-07 16:56 [PATCH v2] secretmem: fix unhandled fault in truncate Mike Rapoport
2022-07-07 17:48 ` Yang Shi
@ 2022-07-07 20:27 ` Axel Rasmussen
1 sibling, 0 replies; 8+ messages in thread
From: Axel Rasmussen @ 2022-07-07 20:27 UTC (permalink / raw)
To: Mike Rapoport
Cc: Andrew Morton, Eric Biggers, Hillf Danton, Matthew Wilcox,
Mike Rapoport, linux-fsdevel, Linux MM, syzkaller-bugs, syzbot
Looks correct to me - at least, I'm confident it will prevent the
race. I'm not really familiar enough with filesystem APIs to know
whether Yang's suggestion is better or not, though.
Reviewed-by: Axel Rasmussen <axelrasmussen@google.com>
On Thu, Jul 7, 2022 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> From: Mike Rapoport <rppt@linux.ibm.com>
>
> syzkaller reports the following issue:
>
> BUG: unable to handle page fault for address: ffff888021f7e005
> PGD 11401067 P4D 11401067 PUD 11402067 PMD 21f7d063 PTE 800fffffde081060
> Oops: 0002 [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 3761 Comm: syz-executor281 Not tainted 5.19.0-rc4-syzkaller-00014-g941e3e791269 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64
> Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
> RSP: 0018:ffffc9000329fa90 EFLAGS: 00010202
> RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000ffb
> RDX: 0000000000000ffb RSI: 0000000000000000 RDI: ffff888021f7e005
> RBP: ffffea000087df80 R08: 0000000000000001 R09: ffff888021f7e005
> R10: ffffed10043efdff R11: 0000000000000000 R12: 0000000000000005
> R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000ffb
> FS: 00007fb29d8b2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffff888021f7e005 CR3: 0000000026e7b000 CR4: 00000000003506f0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> <TASK>
> zero_user_segments include/linux/highmem.h:272 [inline]
> folio_zero_range include/linux/highmem.h:428 [inline]
> truncate_inode_partial_folio+0x76a/0xdf0 mm/truncate.c:237
> truncate_inode_pages_range+0x83b/0x1530 mm/truncate.c:381
> truncate_inode_pages mm/truncate.c:452 [inline]
> truncate_pagecache+0x63/0x90 mm/truncate.c:753
> simple_setattr+0xed/0x110 fs/libfs.c:535
> secretmem_setattr+0xae/0xf0 mm/secretmem.c:170
> notify_change+0xb8c/0x12b0 fs/attr.c:424
> do_truncate+0x13c/0x200 fs/open.c:65
> do_sys_ftruncate+0x536/0x730 fs/open.c:193
> do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> entry_SYSCALL_64_after_hwframe+0x46/0xb0
> RIP: 0033:0x7fb29d900899
> Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007fb29d8b2318 EFLAGS: 00000246 ORIG_RAX: 000000000000004d
> RAX: ffffffffffffffda RBX: 00007fb29d988408 RCX: 00007fb29d900899
> RDX: 00007fb29d900899 RSI: 0000000000000005 RDI: 0000000000000003
> RBP: 00007fb29d988400 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb29d98840c
> R13: 00007ffca01a23bf R14: 00007fb29d8b2400 R15: 0000000000022000
> </TASK>
> Modules linked in:
> CR2: ffff888021f7e005
> ---[ end trace 0000000000000000 ]---
>
> Eric Biggers suggested that this happens when
> secretmem_setattr()->simple_setattr() races with secretmem_fault() so
> that a page that is faulted in by secretmem_fault() (and thus removed
> from the direct map) is zeroed by inode truncation right afterwards.
>
> Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
> secretmem_fault() prevents the race.
>
> Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
> Suggested-by: Eric Biggers <ebiggers@kernel.org>
> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> ---
>
> v2: use inode_lock_shared() rather than add a new rw_sem to secretmem
>
> Axel, I didn't add your Reviewed-by because v2 is quite different.
>
> mm/secretmem.c | 21 ++++++++++++++++-----
> 1 file changed, 16 insertions(+), 5 deletions(-)
>
> diff --git a/mm/secretmem.c b/mm/secretmem.c
> index 206ed6b40c1d..a4fabf705e4f 100644
> --- a/mm/secretmem.c
> +++ b/mm/secretmem.c
> @@ -55,22 +55,28 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> gfp_t gfp = vmf->gfp_mask;
> unsigned long addr;
> struct page *page;
> + vm_fault_t ret;
> int err;
>
> if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> return vmf_error(-EINVAL);
>
> + inode_lock_shared(inode);
> +
> retry:
> page = find_lock_page(mapping, offset);
> if (!page) {
> page = alloc_page(gfp | __GFP_ZERO);
> - if (!page)
> - return VM_FAULT_OOM;
> + if (!page) {
> + ret = VM_FAULT_OOM;
> + goto out;
> + }
>
> err = set_direct_map_invalid_noflush(page);
> if (err) {
> put_page(page);
> - return vmf_error(err);
> + ret = vmf_error(err);
> + goto out;
> }
>
> __SetPageUptodate(page);
> @@ -86,7 +92,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> if (err == -EEXIST)
> goto retry;
>
> - return vmf_error(err);
> + ret = vmf_error(err);
> + goto out;
> }
>
> addr = (unsigned long)page_address(page);
> @@ -94,7 +101,11 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> }
>
> vmf->page = page;
> - return VM_FAULT_LOCKED;
> + ret = VM_FAULT_LOCKED;
> +
> +out:
> + inode_unlock_shared(inode);
> + return ret;
> }
>
> static const struct vm_operations_struct secretmem_vm_ops = {
>
> base-commit: 03c765b0e3b4cb5063276b086c76f7a612856a9a
> --
> 2.34.1
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] secretmem: fix unhandled fault in truncate
2022-07-07 17:48 ` Yang Shi
@ 2022-07-07 20:55 ` Darrick J. Wong
2022-07-07 22:09 ` Yang Shi
0 siblings, 1 reply; 8+ messages in thread
From: Darrick J. Wong @ 2022-07-07 20:55 UTC (permalink / raw)
To: Yang Shi
Cc: Mike Rapoport, Andrew Morton, Axel Rasmussen, Eric Biggers,
Hillf Danton, Matthew Wilcox, Mike Rapoport,
Linux FS-devel Mailing List, Linux MM, syzkaller-bugs,
syzbot+9bd2b7adbd34b30b87e4
On Thu, Jul 07, 2022 at 10:48:00AM -0700, Yang Shi wrote:
> On Thu, Jul 7, 2022 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > From: Mike Rapoport <rppt@linux.ibm.com>
> >
> > syzkaller reports the following issue:
> >
> > BUG: unable to handle page fault for address: ffff888021f7e005
> > PGD 11401067 P4D 11401067 PUD 11402067 PMD 21f7d063 PTE 800fffffde081060
> > Oops: 0002 [#1] PREEMPT SMP KASAN
> > CPU: 0 PID: 3761 Comm: syz-executor281 Not tainted 5.19.0-rc4-syzkaller-00014-g941e3e791269 #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64
> > Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
> > RSP: 0018:ffffc9000329fa90 EFLAGS: 00010202
> > RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000ffb
> > RDX: 0000000000000ffb RSI: 0000000000000000 RDI: ffff888021f7e005
> > RBP: ffffea000087df80 R08: 0000000000000001 R09: ffff888021f7e005
> > R10: ffffed10043efdff R11: 0000000000000000 R12: 0000000000000005
> > R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000ffb
> > FS: 00007fb29d8b2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > CR2: ffff888021f7e005 CR3: 0000000026e7b000 CR4: 00000000003506f0
> > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > Call Trace:
> > <TASK>
> > zero_user_segments include/linux/highmem.h:272 [inline]
> > folio_zero_range include/linux/highmem.h:428 [inline]
> > truncate_inode_partial_folio+0x76a/0xdf0 mm/truncate.c:237
> > truncate_inode_pages_range+0x83b/0x1530 mm/truncate.c:381
> > truncate_inode_pages mm/truncate.c:452 [inline]
> > truncate_pagecache+0x63/0x90 mm/truncate.c:753
> > simple_setattr+0xed/0x110 fs/libfs.c:535
> > secretmem_setattr+0xae/0xf0 mm/secretmem.c:170
> > notify_change+0xb8c/0x12b0 fs/attr.c:424
> > do_truncate+0x13c/0x200 fs/open.c:65
> > do_sys_ftruncate+0x536/0x730 fs/open.c:193
> > do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > RIP: 0033:0x7fb29d900899
> > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > RSP: 002b:00007fb29d8b2318 EFLAGS: 00000246 ORIG_RAX: 000000000000004d
> > RAX: ffffffffffffffda RBX: 00007fb29d988408 RCX: 00007fb29d900899
> > RDX: 00007fb29d900899 RSI: 0000000000000005 RDI: 0000000000000003
> > RBP: 00007fb29d988400 R08: 0000000000000000 R09: 0000000000000000
> > R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb29d98840c
> > R13: 00007ffca01a23bf R14: 00007fb29d8b2400 R15: 0000000000022000
> > </TASK>
> > Modules linked in:
> > CR2: ffff888021f7e005
> > ---[ end trace 0000000000000000 ]---
> >
> > Eric Biggers suggested that this happens when
> > secretmem_setattr()->simple_setattr() races with secretmem_fault() so
> > that a page that is faulted in by secretmem_fault() (and thus removed
> > from the direct map) is zeroed by inode truncation right afterwards.
> >
> > Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
> > secretmem_fault() prevents the race.
>
> Should invalidate_lock be used to serialize between page fault and truncate?
I would have thought so, given Documentation/filesystems/locking.rst:
"->fault() is called when a previously not present pte is about to be
faulted in. The filesystem must find and return the page associated with
the passed in "pgoff" in the vm_fault structure. If it is possible that
the page may be truncated and/or invalidated, then the filesystem must
lock invalidate_lock, then ensure the page is not already truncated
(invalidate_lock will block subsequent truncate), and then return with
VM_FAULT_LOCKED, and the page locked. The VM will unlock the page."
IIRC page faults aren't supposed to take i_rwsem because the fault could
be in response to someone mmaping a file into memory and then write()ing
to the same file using the mmapped region. The write() takes
inode_lock and faults on the buffer, so the fault cannot take inode_lock
again.
That said... I don't think memfd_secret files /can/ be written to? Hard
to say, since I can't find a manpage describing what that syscall does.
--D
>
> >
> > Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
> > Suggested-by: Eric Biggers <ebiggers@kernel.org>
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > ---
> >
> > v2: use inode_lock_shared() rather than add a new rw_sem to secretmem
> >
> > Axel, I didn't add your Reviewed-by because v2 is quite different.
> >
> > mm/secretmem.c | 21 ++++++++++++++++-----
> > 1 file changed, 16 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/secretmem.c b/mm/secretmem.c
> > index 206ed6b40c1d..a4fabf705e4f 100644
> > --- a/mm/secretmem.c
> > +++ b/mm/secretmem.c
> > @@ -55,22 +55,28 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > gfp_t gfp = vmf->gfp_mask;
> > unsigned long addr;
> > struct page *page;
> > + vm_fault_t ret;
> > int err;
> >
> > if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> > return vmf_error(-EINVAL);
> >
> > + inode_lock_shared(inode);
> > +
> > retry:
> > page = find_lock_page(mapping, offset);
> > if (!page) {
> > page = alloc_page(gfp | __GFP_ZERO);
> > - if (!page)
> > - return VM_FAULT_OOM;
> > + if (!page) {
> > + ret = VM_FAULT_OOM;
> > + goto out;
> > + }
> >
> > err = set_direct_map_invalid_noflush(page);
> > if (err) {
> > put_page(page);
> > - return vmf_error(err);
> > + ret = vmf_error(err);
> > + goto out;
> > }
> >
> > __SetPageUptodate(page);
> > @@ -86,7 +92,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > if (err == -EEXIST)
> > goto retry;
> >
> > - return vmf_error(err);
> > + ret = vmf_error(err);
> > + goto out;
> > }
> >
> > addr = (unsigned long)page_address(page);
> > @@ -94,7 +101,11 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > }
> >
> > vmf->page = page;
> > - return VM_FAULT_LOCKED;
> > + ret = VM_FAULT_LOCKED;
> > +
> > +out:
> > + inode_unlock_shared(inode);
> > + return ret;
> > }
> >
> > static const struct vm_operations_struct secretmem_vm_ops = {
> >
> > base-commit: 03c765b0e3b4cb5063276b086c76f7a612856a9a
> > --
> > 2.34.1
> >
> >
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] secretmem: fix unhandled fault in truncate
2022-07-07 20:55 ` Darrick J. Wong
@ 2022-07-07 22:09 ` Yang Shi
2022-07-08 8:28 ` Mike Rapoport
0 siblings, 1 reply; 8+ messages in thread
From: Yang Shi @ 2022-07-07 22:09 UTC (permalink / raw)
To: Darrick J. Wong
Cc: Mike Rapoport, Andrew Morton, Axel Rasmussen, Eric Biggers,
Hillf Danton, Matthew Wilcox, Mike Rapoport,
Linux FS-devel Mailing List, Linux MM, syzkaller-bugs,
syzbot+9bd2b7adbd34b30b87e4
On Thu, Jul 7, 2022 at 1:55 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Thu, Jul 07, 2022 at 10:48:00AM -0700, Yang Shi wrote:
> > On Thu, Jul 7, 2022 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > > From: Mike Rapoport <rppt@linux.ibm.com>
> > >
> > > syzkaller reports the following issue:
> > >
> > > BUG: unable to handle page fault for address: ffff888021f7e005
> > > PGD 11401067 P4D 11401067 PUD 11402067 PMD 21f7d063 PTE 800fffffde081060
> > > Oops: 0002 [#1] PREEMPT SMP KASAN
> > > CPU: 0 PID: 3761 Comm: syz-executor281 Not tainted 5.19.0-rc4-syzkaller-00014-g941e3e791269 #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > > RIP: 0010:memset_erms+0x9/0x10 arch/x86/lib/memset_64.S:64
> > > Code: c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 f3 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 <f3> aa 4c 89 c8 c3 90 49 89 fa 40 0f b6 ce 48 b8 01 01 01 01 01 01
> > > RSP: 0018:ffffc9000329fa90 EFLAGS: 00010202
> > > RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000ffb
> > > RDX: 0000000000000ffb RSI: 0000000000000000 RDI: ffff888021f7e005
> > > RBP: ffffea000087df80 R08: 0000000000000001 R09: ffff888021f7e005
> > > R10: ffffed10043efdff R11: 0000000000000000 R12: 0000000000000005
> > > R13: 0000000000000000 R14: 0000000000001000 R15: 0000000000000ffb
> > > FS: 00007fb29d8b2700(0000) GS:ffff8880b9a00000(0000) knlGS:0000000000000000
> > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: ffff888021f7e005 CR3: 0000000026e7b000 CR4: 00000000003506f0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > > <TASK>
> > > zero_user_segments include/linux/highmem.h:272 [inline]
> > > folio_zero_range include/linux/highmem.h:428 [inline]
> > > truncate_inode_partial_folio+0x76a/0xdf0 mm/truncate.c:237
> > > truncate_inode_pages_range+0x83b/0x1530 mm/truncate.c:381
> > > truncate_inode_pages mm/truncate.c:452 [inline]
> > > truncate_pagecache+0x63/0x90 mm/truncate.c:753
> > > simple_setattr+0xed/0x110 fs/libfs.c:535
> > > secretmem_setattr+0xae/0xf0 mm/secretmem.c:170
> > > notify_change+0xb8c/0x12b0 fs/attr.c:424
> > > do_truncate+0x13c/0x200 fs/open.c:65
> > > do_sys_ftruncate+0x536/0x730 fs/open.c:193
> > > do_syscall_x64 arch/x86/entry/common.c:50 [inline]
> > > do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
> > > entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > > RIP: 0033:0x7fb29d900899
> > > Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 11 15 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
> > > RSP: 002b:00007fb29d8b2318 EFLAGS: 00000246 ORIG_RAX: 000000000000004d
> > > RAX: ffffffffffffffda RBX: 00007fb29d988408 RCX: 00007fb29d900899
> > > RDX: 00007fb29d900899 RSI: 0000000000000005 RDI: 0000000000000003
> > > RBP: 00007fb29d988400 R08: 0000000000000000 R09: 0000000000000000
> > > R10: 0000000000000000 R11: 0000000000000246 R12: 00007fb29d98840c
> > > R13: 00007ffca01a23bf R14: 00007fb29d8b2400 R15: 0000000000022000
> > > </TASK>
> > > Modules linked in:
> > > CR2: ffff888021f7e005
> > > ---[ end trace 0000000000000000 ]---
> > >
> > > Eric Biggers suggested that this happens when
> > > secretmem_setattr()->simple_setattr() races with secretmem_fault() so
> > > that a page that is faulted in by secretmem_fault() (and thus removed
> > > from the direct map) is zeroed by inode truncation right afterwards.
> > >
> > > Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
> > > secretmem_fault() prevents the race.
> >
> > Should invalidate_lock be used to serialize between page fault and truncate?
>
> I would have thought so, given Documentation/filesystems/locking.rst:
>
> "->fault() is called when a previously not present pte is about to be
> faulted in. The filesystem must find and return the page associated with
> the passed in "pgoff" in the vm_fault structure. If it is possible that
> the page may be truncated and/or invalidated, then the filesystem must
> lock invalidate_lock, then ensure the page is not already truncated
> (invalidate_lock will block subsequent truncate), and then return with
> VM_FAULT_LOCKED, and the page locked. The VM will unlock the page."
>
> IIRC page faults aren't supposed to take i_rwsem because the fault could
> be in response to someone mmaping a file into memory and then write()ing
> to the same file using the mmapped region. The write() takes
> inode_lock and faults on the buffer, so the fault cannot take inode_lock
> again.
Do you mean writing from one part of the file to the other part of the
file so the "from" buffer used by copy_from_user() is part of the
mmaped region?
Another possible deadlock issue by using inode_lock in page faults is
mmap_lock is acquired before inode_lock, but write may acquire
inode_lock before mmap_lock, it is a AB-BA lock pattern, but it should
not cause real deadlock since mmap_lock is not exclusive for page
faults. But such pattern should be avoided IMHO.
>
> That said... I don't think memfd_secret files /can/ be written to? Hard
> to say, since I can't find a manpage describing what that syscall does.
>
> --D
>
> >
> > >
> > > Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
> > > Suggested-by: Eric Biggers <ebiggers@kernel.org>
> > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > > ---
> > >
> > > v2: use inode_lock_shared() rather than add a new rw_sem to secretmem
> > >
> > > Axel, I didn't add your Reviewed-by because v2 is quite different.
> > >
> > > mm/secretmem.c | 21 ++++++++++++++++-----
> > > 1 file changed, 16 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/mm/secretmem.c b/mm/secretmem.c
> > > index 206ed6b40c1d..a4fabf705e4f 100644
> > > --- a/mm/secretmem.c
> > > +++ b/mm/secretmem.c
> > > @@ -55,22 +55,28 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > gfp_t gfp = vmf->gfp_mask;
> > > unsigned long addr;
> > > struct page *page;
> > > + vm_fault_t ret;
> > > int err;
> > >
> > > if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> > > return vmf_error(-EINVAL);
> > >
> > > + inode_lock_shared(inode);
> > > +
> > > retry:
> > > page = find_lock_page(mapping, offset);
> > > if (!page) {
> > > page = alloc_page(gfp | __GFP_ZERO);
> > > - if (!page)
> > > - return VM_FAULT_OOM;
> > > + if (!page) {
> > > + ret = VM_FAULT_OOM;
> > > + goto out;
> > > + }
> > >
> > > err = set_direct_map_invalid_noflush(page);
> > > if (err) {
> > > put_page(page);
> > > - return vmf_error(err);
> > > + ret = vmf_error(err);
> > > + goto out;
> > > }
> > >
> > > __SetPageUptodate(page);
> > > @@ -86,7 +92,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > if (err == -EEXIST)
> > > goto retry;
> > >
> > > - return vmf_error(err);
> > > + ret = vmf_error(err);
> > > + goto out;
> > > }
> > >
> > > addr = (unsigned long)page_address(page);
> > > @@ -94,7 +101,11 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > }
> > >
> > > vmf->page = page;
> > > - return VM_FAULT_LOCKED;
> > > + ret = VM_FAULT_LOCKED;
> > > +
> > > +out:
> > > + inode_unlock_shared(inode);
> > > + return ret;
> > > }
> > >
> > > static const struct vm_operations_struct secretmem_vm_ops = {
> > >
> > > base-commit: 03c765b0e3b4cb5063276b086c76f7a612856a9a
> > > --
> > > 2.34.1
> > >
> > >
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] secretmem: fix unhandled fault in truncate
2022-07-07 22:09 ` Yang Shi
@ 2022-07-08 8:28 ` Mike Rapoport
2022-07-12 17:40 ` Yang Shi
0 siblings, 1 reply; 8+ messages in thread
From: Mike Rapoport @ 2022-07-08 8:28 UTC (permalink / raw)
To: Yang Shi
Cc: Darrick J. Wong, Andrew Morton, Axel Rasmussen, Eric Biggers,
Hillf Danton, Matthew Wilcox, Mike Rapoport,
Linux FS-devel Mailing List, Linux MM, syzkaller-bugs,
syzbot+9bd2b7adbd34b30b87e4
On Thu, Jul 07, 2022 at 03:09:32PM -0700, Yang Shi wrote:
> On Thu, Jul 7, 2022 at 1:55 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > On Thu, Jul 07, 2022 at 10:48:00AM -0700, Yang Shi wrote:
> > > On Thu, Jul 7, 2022 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > >
> > > > Eric Biggers suggested that this happens when
> > > > secretmem_setattr()->simple_setattr() races with secretmem_fault() so
> > > > that a page that is faulted in by secretmem_fault() (and thus removed
> > > > from the direct map) is zeroed by inode truncation right afterwards.
> > > >
> > > > Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
> > > > secretmem_fault() prevents the race.
> > >
> > > Should invalidate_lock be used to serialize between page fault and truncate?
> >
> > I would have thought so, given Documentation/filesystems/locking.rst:
> >
> > "->fault() is called when a previously not present pte is about to be
> > faulted in. The filesystem must find and return the page associated with
> > the passed in "pgoff" in the vm_fault structure. If it is possible that
> > the page may be truncated and/or invalidated, then the filesystem must
> > lock invalidate_lock, then ensure the page is not already truncated
> > (invalidate_lock will block subsequent truncate), and then return with
> > VM_FAULT_LOCKED, and the page locked. The VM will unlock the page."
> >
> > IIRC page faults aren't supposed to take i_rwsem because the fault could
> > be in response to someone mmaping a file into memory and then write()ing
> > to the same file using the mmapped region. The write() takes
> > inode_lock and faults on the buffer, so the fault cannot take inode_lock
> > again.
>
> Do you mean writing from one part of the file to the other part of the
> file so the "from" buffer used by copy_from_user() is part of the
> mmaped region?
>
> Another possible deadlock issue by using inode_lock in page faults is
> mmap_lock is acquired before inode_lock, but write may acquire
> inode_lock before mmap_lock, it is a AB-BA lock pattern, but it should
> not cause real deadlock since mmap_lock is not exclusive for page
> faults. But such pattern should be avoided IMHO.
>
> > That said... I don't think memfd_secret files /can/ be written to?
memfd_secret files cannot be written to, they can only be mmap()ed.
Synchronization is only required between
do_truncate()->...->simple_setatt() and secretmem->fault() and I don't see
how that can deadlock.
I'm not an fs expert though, so if you think that invalidate_lock() is
safer, I don't mind s/inode_lock/invalidate_lock/ in the patch.
> > Hard to say, since I can't find a manpage describing what that syscall
> > does.
Right, I don't see it's published :-/
There is a groff version:
https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/memfd_secret.2
> > --D
> >
> > >
> > > >
> > > > Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
> > > > Suggested-by: Eric Biggers <ebiggers@kernel.org>
> > > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > > > ---
> > > >
> > > > v2: use inode_lock_shared() rather than add a new rw_sem to secretmem
> > > >
> > > > Axel, I didn't add your Reviewed-by because v2 is quite different.
> > > >
> > > > mm/secretmem.c | 21 ++++++++++++++++-----
> > > > 1 file changed, 16 insertions(+), 5 deletions(-)
> > > >
> > > > diff --git a/mm/secretmem.c b/mm/secretmem.c
> > > > index 206ed6b40c1d..a4fabf705e4f 100644
> > > > --- a/mm/secretmem.c
> > > > +++ b/mm/secretmem.c
> > > > @@ -55,22 +55,28 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > > gfp_t gfp = vmf->gfp_mask;
> > > > unsigned long addr;
> > > > struct page *page;
> > > > + vm_fault_t ret;
> > > > int err;
> > > >
> > > > if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> > > > return vmf_error(-EINVAL);
> > > >
> > > > + inode_lock_shared(inode);
> > > > +
> > > > retry:
> > > > page = find_lock_page(mapping, offset);
> > > > if (!page) {
> > > > page = alloc_page(gfp | __GFP_ZERO);
> > > > - if (!page)
> > > > - return VM_FAULT_OOM;
> > > > + if (!page) {
> > > > + ret = VM_FAULT_OOM;
> > > > + goto out;
> > > > + }
> > > >
> > > > err = set_direct_map_invalid_noflush(page);
> > > > if (err) {
> > > > put_page(page);
> > > > - return vmf_error(err);
> > > > + ret = vmf_error(err);
> > > > + goto out;
> > > > }
> > > >
> > > > __SetPageUptodate(page);
> > > > @@ -86,7 +92,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > > if (err == -EEXIST)
> > > > goto retry;
> > > >
> > > > - return vmf_error(err);
> > > > + ret = vmf_error(err);
> > > > + goto out;
> > > > }
> > > >
> > > > addr = (unsigned long)page_address(page);
> > > > @@ -94,7 +101,11 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > > }
> > > >
> > > > vmf->page = page;
> > > > - return VM_FAULT_LOCKED;
> > > > + ret = VM_FAULT_LOCKED;
> > > > +
> > > > +out:
> > > > + inode_unlock_shared(inode);
> > > > + return ret;
> > > > }
> > > >
> > > > static const struct vm_operations_struct secretmem_vm_ops = {
> > > >
> > > > base-commit: 03c765b0e3b4cb5063276b086c76f7a612856a9a
> > > > --
> > > > 2.34.1
> > > >
> > > >
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] secretmem: fix unhandled fault in truncate
2022-07-08 8:28 ` Mike Rapoport
@ 2022-07-12 17:40 ` Yang Shi
2022-07-13 13:27 ` Jan Kara
0 siblings, 1 reply; 8+ messages in thread
From: Yang Shi @ 2022-07-12 17:40 UTC (permalink / raw)
To: Mike Rapoport, Jan Kara
Cc: Darrick J. Wong, Andrew Morton, Axel Rasmussen, Eric Biggers,
Hillf Danton, Matthew Wilcox, Mike Rapoport,
Linux FS-devel Mailing List, Linux MM, syzkaller-bugs,
syzbot+9bd2b7adbd34b30b87e4
On Fri, Jul 8, 2022 at 1:29 AM Mike Rapoport <rppt@kernel.org> wrote:
>
> On Thu, Jul 07, 2022 at 03:09:32PM -0700, Yang Shi wrote:
> > On Thu, Jul 7, 2022 at 1:55 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > On Thu, Jul 07, 2022 at 10:48:00AM -0700, Yang Shi wrote:
> > > > On Thu, Jul 7, 2022 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > > >
> > > > > Eric Biggers suggested that this happens when
> > > > > secretmem_setattr()->simple_setattr() races with secretmem_fault() so
> > > > > that a page that is faulted in by secretmem_fault() (and thus removed
> > > > > from the direct map) is zeroed by inode truncation right afterwards.
> > > > >
> > > > > Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
> > > > > secretmem_fault() prevents the race.
> > > >
> > > > Should invalidate_lock be used to serialize between page fault and truncate?
> > >
> > > I would have thought so, given Documentation/filesystems/locking.rst:
> > >
> > > "->fault() is called when a previously not present pte is about to be
> > > faulted in. The filesystem must find and return the page associated with
> > > the passed in "pgoff" in the vm_fault structure. If it is possible that
> > > the page may be truncated and/or invalidated, then the filesystem must
> > > lock invalidate_lock, then ensure the page is not already truncated
> > > (invalidate_lock will block subsequent truncate), and then return with
> > > VM_FAULT_LOCKED, and the page locked. The VM will unlock the page."
> > >
> > > IIRC page faults aren't supposed to take i_rwsem because the fault could
> > > be in response to someone mmaping a file into memory and then write()ing
> > > to the same file using the mmapped region. The write() takes
> > > inode_lock and faults on the buffer, so the fault cannot take inode_lock
> > > again.
> >
> > Do you mean writing from one part of the file to the other part of the
> > file so the "from" buffer used by copy_from_user() is part of the
> > mmaped region?
> >
> > Another possible deadlock issue by using inode_lock in page faults is
> > mmap_lock is acquired before inode_lock, but write may acquire
> > inode_lock before mmap_lock, it is a AB-BA lock pattern, but it should
> > not cause real deadlock since mmap_lock is not exclusive for page
> > faults. But such pattern should be avoided IMHO.
> >
> > > That said... I don't think memfd_secret files /can/ be written to?
>
> memfd_secret files cannot be written to, they can only be mmap()ed.
> Synchronization is only required between
> do_truncate()->...->simple_setatt() and secretmem->fault() and I don't see
> how that can deadlock.
Sure, there is no deadlock.
>
> I'm not an fs expert though, so if you think that invalidate_lock() is
> safer, I don't mind s/inode_lock/invalidate_lock/ in the patch.
IIUC invalidate_lock should be preferred per the filesystem's locking
document. And I found Jan Kara's email of the invalidate_lock
patchset, please refer to
https://lore.kernel.org/linux-mm/20210715133202.5975-1-jack@suse.cz/.
>
> > > Hard to say, since I can't find a manpage describing what that syscall
> > > does.
>
> Right, I don't see it's published :-/
>
> There is a groff version:
> https://git.kernel.org/pub/scm/docs/man-pages/man-pages.git/tree/man2/memfd_secret.2
>
> > > --D
> > >
> > > >
> > > > >
> > > > > Reported-by: syzbot+9bd2b7adbd34b30b87e4@syzkaller.appspotmail.com
> > > > > Suggested-by: Eric Biggers <ebiggers@kernel.org>
> > > > > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > > > > ---
> > > > >
> > > > > v2: use inode_lock_shared() rather than add a new rw_sem to secretmem
> > > > >
> > > > > Axel, I didn't add your Reviewed-by because v2 is quite different.
> > > > >
> > > > > mm/secretmem.c | 21 ++++++++++++++++-----
> > > > > 1 file changed, 16 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/mm/secretmem.c b/mm/secretmem.c
> > > > > index 206ed6b40c1d..a4fabf705e4f 100644
> > > > > --- a/mm/secretmem.c
> > > > > +++ b/mm/secretmem.c
> > > > > @@ -55,22 +55,28 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > > > gfp_t gfp = vmf->gfp_mask;
> > > > > unsigned long addr;
> > > > > struct page *page;
> > > > > + vm_fault_t ret;
> > > > > int err;
> > > > >
> > > > > if (((loff_t)vmf->pgoff << PAGE_SHIFT) >= i_size_read(inode))
> > > > > return vmf_error(-EINVAL);
> > > > >
> > > > > + inode_lock_shared(inode);
> > > > > +
> > > > > retry:
> > > > > page = find_lock_page(mapping, offset);
> > > > > if (!page) {
> > > > > page = alloc_page(gfp | __GFP_ZERO);
> > > > > - if (!page)
> > > > > - return VM_FAULT_OOM;
> > > > > + if (!page) {
> > > > > + ret = VM_FAULT_OOM;
> > > > > + goto out;
> > > > > + }
> > > > >
> > > > > err = set_direct_map_invalid_noflush(page);
> > > > > if (err) {
> > > > > put_page(page);
> > > > > - return vmf_error(err);
> > > > > + ret = vmf_error(err);
> > > > > + goto out;
> > > > > }
> > > > >
> > > > > __SetPageUptodate(page);
> > > > > @@ -86,7 +92,8 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > > > if (err == -EEXIST)
> > > > > goto retry;
> > > > >
> > > > > - return vmf_error(err);
> > > > > + ret = vmf_error(err);
> > > > > + goto out;
> > > > > }
> > > > >
> > > > > addr = (unsigned long)page_address(page);
> > > > > @@ -94,7 +101,11 @@ static vm_fault_t secretmem_fault(struct vm_fault *vmf)
> > > > > }
> > > > >
> > > > > vmf->page = page;
> > > > > - return VM_FAULT_LOCKED;
> > > > > + ret = VM_FAULT_LOCKED;
> > > > > +
> > > > > +out:
> > > > > + inode_unlock_shared(inode);
> > > > > + return ret;
> > > > > }
> > > > >
> > > > > static const struct vm_operations_struct secretmem_vm_ops = {
> > > > >
> > > > > base-commit: 03c765b0e3b4cb5063276b086c76f7a612856a9a
> > > > > --
> > > > > 2.34.1
> > > > >
> > > > >
>
> --
> Sincerely yours,
> Mike.
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] secretmem: fix unhandled fault in truncate
2022-07-12 17:40 ` Yang Shi
@ 2022-07-13 13:27 ` Jan Kara
0 siblings, 0 replies; 8+ messages in thread
From: Jan Kara @ 2022-07-13 13:27 UTC (permalink / raw)
To: Yang Shi
Cc: Mike Rapoport, Jan Kara, Darrick J. Wong, Andrew Morton,
Axel Rasmussen, Eric Biggers, Hillf Danton, Matthew Wilcox,
Mike Rapoport, Linux FS-devel Mailing List, Linux MM,
syzkaller-bugs, syzbot+9bd2b7adbd34b30b87e4
On Tue 12-07-22 10:40:11, Yang Shi wrote:
> On Fri, Jul 8, 2022 at 1:29 AM Mike Rapoport <rppt@kernel.org> wrote:
> >
> > On Thu, Jul 07, 2022 at 03:09:32PM -0700, Yang Shi wrote:
> > > On Thu, Jul 7, 2022 at 1:55 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > >
> > > > On Thu, Jul 07, 2022 at 10:48:00AM -0700, Yang Shi wrote:
> > > > > On Thu, Jul 7, 2022 at 9:57 AM Mike Rapoport <rppt@kernel.org> wrote:
> > > > > >
> > > > > > Eric Biggers suggested that this happens when
> > > > > > secretmem_setattr()->simple_setattr() races with secretmem_fault() so
> > > > > > that a page that is faulted in by secretmem_fault() (and thus removed
> > > > > > from the direct map) is zeroed by inode truncation right afterwards.
> > > > > >
> > > > > > Since do_truncate() takes inode_lock(), adding inode_lock_shared() to
> > > > > > secretmem_fault() prevents the race.
> > > > >
> > > > > Should invalidate_lock be used to serialize between page fault and truncate?
> > > >
> > > > I would have thought so, given Documentation/filesystems/locking.rst:
> > > >
> > > > "->fault() is called when a previously not present pte is about to be
> > > > faulted in. The filesystem must find and return the page associated with
> > > > the passed in "pgoff" in the vm_fault structure. If it is possible that
> > > > the page may be truncated and/or invalidated, then the filesystem must
> > > > lock invalidate_lock, then ensure the page is not already truncated
> > > > (invalidate_lock will block subsequent truncate), and then return with
> > > > VM_FAULT_LOCKED, and the page locked. The VM will unlock the page."
> > > >
> > > > IIRC page faults aren't supposed to take i_rwsem because the fault could
> > > > be in response to someone mmaping a file into memory and then write()ing
> > > > to the same file using the mmapped region. The write() takes
> > > > inode_lock and faults on the buffer, so the fault cannot take inode_lock
> > > > again.
> > >
> > > Do you mean writing from one part of the file to the other part of the
> > > file so the "from" buffer used by copy_from_user() is part of the
> > > mmaped region?
> > >
> > > Another possible deadlock issue by using inode_lock in page faults is
> > > mmap_lock is acquired before inode_lock, but write may acquire
> > > inode_lock before mmap_lock, it is a AB-BA lock pattern, but it should
> > > not cause real deadlock since mmap_lock is not exclusive for page
> > > faults. But such pattern should be avoided IMHO.
> > >
> > > > That said... I don't think memfd_secret files /can/ be written to?
> >
> > memfd_secret files cannot be written to, they can only be mmap()ed.
> > Synchronization is only required between
> > do_truncate()->...->simple_setatt() and secretmem->fault() and I don't see
> > how that can deadlock.
>
> Sure, there is no deadlock.
>
> >
> > I'm not an fs expert though, so if you think that invalidate_lock() is
> > safer, I don't mind s/inode_lock/invalidate_lock/ in the patch.
>
> IIUC invalidate_lock should be preferred per the filesystem's locking
> document. And I found Jan Kara's email of the invalidate_lock
> patchset, please refer to
> https://lore.kernel.org/linux-mm/20210715133202.5975-1-jack@suse.cz/.
Yeah, so using invalidate_lock for such synchronization would be certainly
more standard than using inode_lock. Although I agree that for filesystems
that do not support read(2) and write(2) there does not seem to be an
immediate risk of a deadlock when inode_lock is used inside a page fault.
Honza
--
Jan Kara <jack@suse.com>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2022-07-13 13:27 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-07 16:56 [PATCH v2] secretmem: fix unhandled fault in truncate Mike Rapoport
2022-07-07 17:48 ` Yang Shi
2022-07-07 20:55 ` Darrick J. Wong
2022-07-07 22:09 ` Yang Shi
2022-07-08 8:28 ` Mike Rapoport
2022-07-12 17:40 ` Yang Shi
2022-07-13 13:27 ` Jan Kara
2022-07-07 20:27 ` Axel Rasmussen
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.