linux-nfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* BUG: unable to handle kernel paging request in rb_erase
@ 2020-05-18  6:59 syzbot
  2020-06-02 21:55 ` J. Bruce Fields
       [not found] ` <20200603043435.13820-1-hdanton@sina.com>
  0 siblings, 2 replies; 8+ messages in thread
From: syzbot @ 2020-05-18  6:59 UTC (permalink / raw)
  To: bfields, chuck.lever, linux-kernel, linux-nfs, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    9b1f2cbd Merge tag 'clk-fixes-for-linus' of git://git.kern..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=15dfdeaa100000
kernel config:  https://syzkaller.appspot.com/x/.config?x=c14212794ed9ad24
dashboard link: https://syzkaller.appspot.com/bug?extid=0e37e9d19bded16b8ab9
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+0e37e9d19bded16b8ab9@syzkaller.appspotmail.com

BUG: unable to handle page fault for address: ffff887ffffffff0
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 0 P4D 0 
Oops: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 8682 Comm: syz-executor.3 Not tainted 5.7.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__rb_erase_augmented include/linux/rbtree_augmented.h:201 [inline]
RIP: 0010:rb_erase+0x37/0x18d0 lib/rbtree.c:443
Code: 89 f7 41 56 41 55 49 89 fd 48 83 c7 08 48 89 fa 41 54 48 c1 ea 03 55 53 48 83 ec 18 80 3c 02 00 0f 85 89 10 00 00 49 8d 7d 10 <4d> 8b 75 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80
RSP: 0018:ffffc900178ffb58 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffff8880354d0000 RCX: ffffc9000fb6d000
RDX: 1ffff10ffffffffe RSI: ffff88800011dfe0 RDI: ffff887ffffffff8
RBP: ffff887fffffffb0 R08: ffff888057284280 R09: fffffbfff185d12e
R10: ffffffff8c2e896f R11: fffffbfff185d12d R12: ffff88800011dfe0
R13: ffff887fffffffe8 R14: 000000000001dfe0 R15: ffff88800011dfe0
FS:  00007fa002d21700(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff887ffffffff0 CR3: 00000000a2164000 CR4: 00000000001426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 nfsd_reply_cache_free_locked+0x198/0x380 fs/nfsd/nfscache.c:127
 nfsd_reply_cache_shutdown+0x150/0x350 fs/nfsd/nfscache.c:203
 nfsd_exit_net+0x189/0x4c0 fs/nfsd/nfsctl.c:1504
 ops_exit_list.isra.0+0xa8/0x150 net/core/net_namespace.c:186
 setup_net+0x50c/0x860 net/core/net_namespace.c:364
 copy_net_ns+0x293/0x590 net/core/net_namespace.c:482
 create_new_namespaces+0x3fb/0xb30 kernel/nsproxy.c:108
 unshare_nsproxy_namespaces+0xbd/0x1f0 kernel/nsproxy.c:229
 ksys_unshare+0x43d/0x8e0 kernel/fork.c:2970
 __do_sys_unshare kernel/fork.c:3038 [inline]
 __se_sys_unshare kernel/fork.c:3036 [inline]
 __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3036
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x45ca29
Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fa002d20c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
RAX: ffffffffffffffda RBX: 000000000050a1c0 RCX: 000000000045ca29
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000000
RBP: 000000000078bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000000c4e R14: 00000000004ce9bd R15: 00007fa002d216d4
Modules linked in:
CR2: ffff887ffffffff0
---[ end trace f929dcba0362906a ]---
RIP: 0010:__rb_erase_augmented include/linux/rbtree_augmented.h:201 [inline]
RIP: 0010:rb_erase+0x37/0x18d0 lib/rbtree.c:443
Code: 89 f7 41 56 41 55 49 89 fd 48 83 c7 08 48 89 fa 41 54 48 c1 ea 03 55 53 48 83 ec 18 80 3c 02 00 0f 85 89 10 00 00 49 8d 7d 10 <4d> 8b 75 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80
RSP: 0018:ffffc900178ffb58 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffff8880354d0000 RCX: ffffc9000fb6d000
RDX: 1ffff10ffffffffe RSI: ffff88800011dfe0 RDI: ffff887ffffffff8
RBP: ffff887fffffffb0 R08: ffff888057284280 R09: fffffbfff185d12e
R10: ffffffff8c2e896f R11: fffffbfff185d12d R12: ffff88800011dfe0
R13: ffff887fffffffe8 R14: 000000000001dfe0 R15: ffff88800011dfe0
FS:  00007fa002d21700(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff887ffffffff0 CR3: 00000000a2164000 CR4: 00000000001426e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: unable to handle kernel paging request in rb_erase
  2020-05-18  6:59 BUG: unable to handle kernel paging request in rb_erase syzbot
@ 2020-06-02 21:55 ` J. Bruce Fields
       [not found] ` <20200603043435.13820-1-hdanton@sina.com>
  1 sibling, 0 replies; 8+ messages in thread
From: J. Bruce Fields @ 2020-06-02 21:55 UTC (permalink / raw)
  To: syzbot; +Cc: chuck.lever, linux-kernel, linux-nfs, syzkaller-bugs

As far as I know, this one's still unresolved.  I can't see the bug from
code inspection, and we don't have a reproducer.  If anyone else sees
this or has an idea what might be going wrong, I'd be interested.--b.

On Sun, May 17, 2020 at 11:59:12PM -0700, syzbot wrote:
> Hello,
> 
> syzbot found the following crash on:
> 
> HEAD commit:    9b1f2cbd Merge tag 'clk-fixes-for-linus' of git://git.kern..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=15dfdeaa100000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=c14212794ed9ad24
> dashboard link: https://syzkaller.appspot.com/bug?extid=0e37e9d19bded16b8ab9
> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> 
> Unfortunately, I don't have any reproducer for this crash yet.
> 
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+0e37e9d19bded16b8ab9@syzkaller.appspotmail.com
> 
> BUG: unable to handle page fault for address: ffff887ffffffff0
> #PF: supervisor read access in kernel mode
> #PF: error_code(0x0000) - not-present page
> PGD 0 P4D 0 
> Oops: 0000 [#1] PREEMPT SMP KASAN
> CPU: 1 PID: 8682 Comm: syz-executor.3 Not tainted 5.7.0-rc5-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> RIP: 0010:__rb_erase_augmented include/linux/rbtree_augmented.h:201 [inline]
> RIP: 0010:rb_erase+0x37/0x18d0 lib/rbtree.c:443
> Code: 89 f7 41 56 41 55 49 89 fd 48 83 c7 08 48 89 fa 41 54 48 c1 ea 03 55 53 48 83 ec 18 80 3c 02 00 0f 85 89 10 00 00 49 8d 7d 10 <4d> 8b 75 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80
> RSP: 0018:ffffc900178ffb58 EFLAGS: 00010246
> RAX: dffffc0000000000 RBX: ffff8880354d0000 RCX: ffffc9000fb6d000
> RDX: 1ffff10ffffffffe RSI: ffff88800011dfe0 RDI: ffff887ffffffff8
> RBP: ffff887fffffffb0 R08: ffff888057284280 R09: fffffbfff185d12e
> R10: ffffffff8c2e896f R11: fffffbfff185d12d R12: ffff88800011dfe0
> R13: ffff887fffffffe8 R14: 000000000001dfe0 R15: ffff88800011dfe0
> FS:  00007fa002d21700(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffff887ffffffff0 CR3: 00000000a2164000 CR4: 00000000001426e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
>  nfsd_reply_cache_free_locked+0x198/0x380 fs/nfsd/nfscache.c:127
>  nfsd_reply_cache_shutdown+0x150/0x350 fs/nfsd/nfscache.c:203
>  nfsd_exit_net+0x189/0x4c0 fs/nfsd/nfsctl.c:1504
>  ops_exit_list.isra.0+0xa8/0x150 net/core/net_namespace.c:186
>  setup_net+0x50c/0x860 net/core/net_namespace.c:364
>  copy_net_ns+0x293/0x590 net/core/net_namespace.c:482
>  create_new_namespaces+0x3fb/0xb30 kernel/nsproxy.c:108
>  unshare_nsproxy_namespaces+0xbd/0x1f0 kernel/nsproxy.c:229
>  ksys_unshare+0x43d/0x8e0 kernel/fork.c:2970
>  __do_sys_unshare kernel/fork.c:3038 [inline]
>  __se_sys_unshare kernel/fork.c:3036 [inline]
>  __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3036
>  do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
>  entry_SYSCALL_64_after_hwframe+0x49/0xb3
> RIP: 0033:0x45ca29
> Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:00007fa002d20c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000110
> RAX: ffffffffffffffda RBX: 000000000050a1c0 RCX: 000000000045ca29
> RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000040000000
> RBP: 000000000078bf00 R08: 0000000000000000 R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
> R13: 0000000000000c4e R14: 00000000004ce9bd R15: 00007fa002d216d4
> Modules linked in:
> CR2: ffff887ffffffff0
> ---[ end trace f929dcba0362906a ]---
> RIP: 0010:__rb_erase_augmented include/linux/rbtree_augmented.h:201 [inline]
> RIP: 0010:rb_erase+0x37/0x18d0 lib/rbtree.c:443
> Code: 89 f7 41 56 41 55 49 89 fd 48 83 c7 08 48 89 fa 41 54 48 c1 ea 03 55 53 48 83 ec 18 80 3c 02 00 0f 85 89 10 00 00 49 8d 7d 10 <4d> 8b 75 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80
> RSP: 0018:ffffc900178ffb58 EFLAGS: 00010246
> RAX: dffffc0000000000 RBX: ffff8880354d0000 RCX: ffffc9000fb6d000
> RDX: 1ffff10ffffffffe RSI: ffff88800011dfe0 RDI: ffff887ffffffff8
> RBP: ffff887fffffffb0 R08: ffff888057284280 R09: fffffbfff185d12e
> R10: ffffffff8c2e896f R11: fffffbfff185d12d R12: ffff88800011dfe0
> R13: ffff887fffffffe8 R14: 000000000001dfe0 R15: ffff88800011dfe0
> FS:  00007fa002d21700(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: ffff887ffffffff0 CR3: 00000000a2164000 CR4: 00000000001426e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> 
> 
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
> 
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: unable to handle kernel paging request in rb_erase
       [not found] ` <20200603043435.13820-1-hdanton@sina.com>
@ 2020-06-03 14:43   ` J. Bruce Fields
  2020-06-03 16:48     ` J. Bruce Fields
       [not found]     ` <20200604035359.2516-1-hdanton@sina.com>
  0 siblings, 2 replies; 8+ messages in thread
From: J. Bruce Fields @ 2020-06-03 14:43 UTC (permalink / raw)
  To: Hillf Danton; +Cc: syzbot, chuck.lever, linux-kernel, linux-nfs, syzkaller-bugs

On Wed, Jun 03, 2020 at 12:34:35PM +0800, Hillf Danton wrote:
> 
> On Tue, 2 Jun 2020 17:55:17 -0400 "J. Bruce Fields" wrote:
> > 
> > As far as I know, this one's still unresolved.  I can't see the bug from
> > code inspection, and we don't have a reproducer.  If anyone else sees
> > this or has an idea what might be going wrong, I'd be interested.--b.
> 
> It's a PF reported in the syz-executor.3 context (PID: 8682 on CPU:1),
> meanwhile there's another at 
> 
>  https://lore.kernel.org/lkml/20200603011425.GA13019@fieldses.org/T/#t
>  Reported-by: syzbot+a29df412692980277f9d@syzkaller.appspotmail.com
> 
> in the kworker context, and one of the quick questions is, is it needed
> to serialize the two players, say, using a mutex?

nfsd_reply_cache_shutdown() doesn't take any locks.  All the data
structures it's tearing down are per-network-namespace, and it's assumed
all the users of that structure are gone by the time we get here.

I wonder if that assumption's correct.  Looking at nfsd_exit_net()....

nfsd_reply_cache_shutdown() is one of the first things we do, so I think
we're depending on the assumption that the interfaces in that network
namespace, and anything referencing associated sockets (in particular,
any associated in-progress rpc's), must be gone before our net exit
method is called.

I wonder if that's a good assumption.

--b.

> 
> 
> > On Sun, May 17, 2020 at 11:59:12PM -0700, syzbot wrote:
> > > Hello,
> > > 
> > > syzbot found the following crash on:
> > > 
> > > HEAD commit:    9b1f2cbd Merge tag 'clk-fixes-for-linus' of git://git.kern..
> > > git tree:       upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=15dfdeaa100000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=c14212794ed9ad24
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=0e37e9d19bded16b8ab9
> > > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> > > 
> > > Unfortunately, I don't have any reproducer for this crash yet.
> > > 
> > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > Reported-by: syzbot+0e37e9d19bded16b8ab9@syzkaller.appspotmail.com
> > > 
> > > BUG: unable to handle page fault for address: ffff887ffffffff0
> > > #PF: supervisor read access in kernel mode
> > > #PF: error_code(0x0000) - not-present page
> > > PGD 0 P4D 0 
> > > Oops: 0000 [#1] PREEMPT SMP KASAN
> > > CPU: 1 PID: 8682 Comm: syz-executor.3 Not tainted 5.7.0-rc5-syzkaller #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > > RIP: 0010:__rb_erase_augmented include/linux/rbtree_augmented.h:201 [inline]
> > > RIP: 0010:rb_erase+0x37/0x18d0 lib/rbtree.c:443
> > > Code: 89 f7 41 56 41 55 49 89 fd 48 83 c7 08 48 89 fa 41 54 48 c1 ea 03 55 53 48 83 ec 18 80 3c 02 00 0f 85 89 10 00 00 49 8d 7d 10 <4d> 8b 75 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80
> > > RSP: 0018:ffffc900178ffb58 EFLAGS: 00010246
> > > RAX: dffffc0000000000 RBX: ffff8880354d0000 RCX: ffffc9000fb6d000
> > > RDX: 1ffff10ffffffffe RSI: ffff88800011dfe0 RDI: ffff887ffffffff8
> > > RBP: ffff887fffffffb0 R08: ffff888057284280 R09: fffffbfff185d12e
> > > R10: ffffffff8c2e896f R11: fffffbfff185d12d R12: ffff88800011dfe0
> > > R13: ffff887fffffffe8 R14: 000000000001dfe0 R15: ffff88800011dfe0
> > > FS:  00007fa002d21700(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
> > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > CR2: ffff887ffffffff0 CR3: 00000000a2164000 CR4: 00000000001426e0
> > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > Call Trace:
> > >  nfsd_reply_cache_free_locked+0x198/0x380 fs/nfsd/nfscache.c:127
> > >  nfsd_reply_cache_shutdown+0x150/0x350 fs/nfsd/nfscache.c:203
> > >  nfsd_exit_net+0x189/0x4c0 fs/nfsd/nfsctl.c:1504
> > >  ops_exit_list.isra.0+0xa8/0x150 net/core/net_namespace.c:186
> > >  setup_net+0x50c/0x860 net/core/net_namespace.c:364
> > >  copy_net_ns+0x293/0x590 net/core/net_namespace.c:482
> > >  create_new_namespaces+0x3fb/0xb30 kernel/nsproxy.c:108
> > >  unshare_nsproxy_namespaces+0xbd/0x1f0 kernel/nsproxy.c:229
> > >  ksys_unshare+0x43d/0x8e0 kernel/fork.c:2970
> > >  __do_sys_unshare kernel/fork.c:3038 [inline]
> > >  __se_sys_unshare kernel/fork.c:3036 [inline]
> > >  __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3036
> > >  do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
> > >  entry_SYSCALL_64_after_hwframe+0x49/0xb3

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: unable to handle kernel paging request in rb_erase
  2020-06-03 14:43   ` J. Bruce Fields
@ 2020-06-03 16:48     ` J. Bruce Fields
       [not found]     ` <20200604035359.2516-1-hdanton@sina.com>
  1 sibling, 0 replies; 8+ messages in thread
From: J. Bruce Fields @ 2020-06-03 16:48 UTC (permalink / raw)
  To: Hillf Danton; +Cc: syzbot, chuck.lever, linux-kernel, linux-nfs, syzkaller-bugs

On Wed, Jun 03, 2020 at 10:43:26AM -0400, J. Bruce Fields wrote:
> On Wed, Jun 03, 2020 at 12:34:35PM +0800, Hillf Danton wrote:
> > 
> > On Tue, 2 Jun 2020 17:55:17 -0400 "J. Bruce Fields" wrote:
> > > 
> > > As far as I know, this one's still unresolved.  I can't see the bug from
> > > code inspection, and we don't have a reproducer.  If anyone else sees
> > > this or has an idea what might be going wrong, I'd be interested.--b.
> > 
> > It's a PF reported in the syz-executor.3 context (PID: 8682 on CPU:1),
> > meanwhile there's another at 
> > 
> >  https://lore.kernel.org/lkml/20200603011425.GA13019@fieldses.org/T/#t
> >  Reported-by: syzbot+a29df412692980277f9d@syzkaller.appspotmail.com
> > 
> > in the kworker context, and one of the quick questions is, is it needed
> > to serialize the two players, say, using a mutex?
> 
> nfsd_reply_cache_shutdown() doesn't take any locks.  All the data
> structures it's tearing down are per-network-namespace, and it's assumed
> all the users of that structure are gone by the time we get here.
> 
> I wonder if that assumption's correct.  Looking at nfsd_exit_net()....
> 
> nfsd_reply_cache_shutdown() is one of the first things we do, so I think
> we're depending on the assumption that the interfaces in that network
> namespace, and anything referencing associated sockets (in particular,
> any associated in-progress rpc's), must be gone before our net exit
> method is called.
> 
> I wonder if that's a good assumption.

I think that assumption must be the problem.

That would explain why the crashes are happening in nfsd_exit_net as
opposed to somewhere else, and why we're only seeing them since
3ba75830ce17 "nfsd4: drc containerization".

I wonder what *is* safe to assume when the net exit method is called?

--b.

> 
> --b.
> 
> > 
> > 
> > > On Sun, May 17, 2020 at 11:59:12PM -0700, syzbot wrote:
> > > > Hello,
> > > > 
> > > > syzbot found the following crash on:
> > > > 
> > > > HEAD commit:    9b1f2cbd Merge tag 'clk-fixes-for-linus' of git://git.kern..
> > > > git tree:       upstream
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=15dfdeaa100000
> > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=c14212794ed9ad24
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=0e37e9d19bded16b8ab9
> > > > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> > > > 
> > > > Unfortunately, I don't have any reproducer for this crash yet.
> > > > 
> > > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > > Reported-by: syzbot+0e37e9d19bded16b8ab9@syzkaller.appspotmail.com
> > > > 
> > > > BUG: unable to handle page fault for address: ffff887ffffffff0
> > > > #PF: supervisor read access in kernel mode
> > > > #PF: error_code(0x0000) - not-present page
> > > > PGD 0 P4D 0 
> > > > Oops: 0000 [#1] PREEMPT SMP KASAN
> > > > CPU: 1 PID: 8682 Comm: syz-executor.3 Not tainted 5.7.0-rc5-syzkaller #0
> > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > > > RIP: 0010:__rb_erase_augmented include/linux/rbtree_augmented.h:201 [inline]
> > > > RIP: 0010:rb_erase+0x37/0x18d0 lib/rbtree.c:443
> > > > Code: 89 f7 41 56 41 55 49 89 fd 48 83 c7 08 48 89 fa 41 54 48 c1 ea 03 55 53 48 83 ec 18 80 3c 02 00 0f 85 89 10 00 00 49 8d 7d 10 <4d> 8b 75 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80
> > > > RSP: 0018:ffffc900178ffb58 EFLAGS: 00010246
> > > > RAX: dffffc0000000000 RBX: ffff8880354d0000 RCX: ffffc9000fb6d000
> > > > RDX: 1ffff10ffffffffe RSI: ffff88800011dfe0 RDI: ffff887ffffffff8
> > > > RBP: ffff887fffffffb0 R08: ffff888057284280 R09: fffffbfff185d12e
> > > > R10: ffffffff8c2e896f R11: fffffbfff185d12d R12: ffff88800011dfe0
> > > > R13: ffff887fffffffe8 R14: 000000000001dfe0 R15: ffff88800011dfe0
> > > > FS:  00007fa002d21700(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
> > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > CR2: ffff887ffffffff0 CR3: 00000000a2164000 CR4: 00000000001426e0
> > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > Call Trace:
> > > >  nfsd_reply_cache_free_locked+0x198/0x380 fs/nfsd/nfscache.c:127
> > > >  nfsd_reply_cache_shutdown+0x150/0x350 fs/nfsd/nfscache.c:203
> > > >  nfsd_exit_net+0x189/0x4c0 fs/nfsd/nfsctl.c:1504
> > > >  ops_exit_list.isra.0+0xa8/0x150 net/core/net_namespace.c:186
> > > >  setup_net+0x50c/0x860 net/core/net_namespace.c:364
> > > >  copy_net_ns+0x293/0x590 net/core/net_namespace.c:482
> > > >  create_new_namespaces+0x3fb/0xb30 kernel/nsproxy.c:108
> > > >  unshare_nsproxy_namespaces+0xbd/0x1f0 kernel/nsproxy.c:229
> > > >  ksys_unshare+0x43d/0x8e0 kernel/fork.c:2970
> > > >  __do_sys_unshare kernel/fork.c:3038 [inline]
> > > >  __se_sys_unshare kernel/fork.c:3036 [inline]
> > > >  __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3036
> > > >  do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
> > > >  entry_SYSCALL_64_after_hwframe+0x49/0xb3

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: unable to handle kernel paging request in rb_erase
       [not found]     ` <20200604035359.2516-1-hdanton@sina.com>
@ 2020-06-04 21:58       ` J. Bruce Fields
  2020-06-25 21:02         ` J. Bruce Fields
  0 siblings, 1 reply; 8+ messages in thread
From: J. Bruce Fields @ 2020-06-04 21:58 UTC (permalink / raw)
  To: Hillf Danton; +Cc: syzbot, chuck.lever, linux-kernel, linux-nfs, syzkaller-bugs

On Thu, Jun 04, 2020 at 11:53:59AM +0800, Hillf Danton wrote:
> 
> On Wed, 3 Jun 2020 12:48:49 -0400 J. Bruce Fields wrote:
> > On Wed, Jun 03, 2020 at 10:43:26AM -0400, J. Bruce Fields wrote:
> > > On Wed, Jun 03, 2020 at 12:34:35PM +0800, Hillf Danton wrote:
> > > > 
> > > > On Tue, 2 Jun 2020 17:55:17 -0400 "J. Bruce Fields" wrote:
> > > > > 
> > > > > As far as I know, this one's still unresolved.  I can't see the bug from
> > > > > code inspection, and we don't have a reproducer.  If anyone else sees
> > > > > this or has an idea what might be going wrong, I'd be interested.--b.
> > > > 
> > > > It's a PF reported in the syz-executor.3 context (PID: 8682 on CPU:1),
> > > > meanwhile there's another at 
> > > > 
> > > >  https://lore.kernel.org/lkml/20200603011425.GA13019@fieldses.org/T/#t
> > > >  Reported-by: syzbot+a29df412692980277f9d@syzkaller.appspotmail.com
> > > > 
> > > > in the kworker context, and one of the quick questions is, is it needed
> > > > to serialize the two players, say, using a mutex?
> > > 
> > > nfsd_reply_cache_shutdown() doesn't take any locks.  All the data
> > > structures it's tearing down are per-network-namespace, and it's assumed
> > > all the users of that structure are gone by the time we get here.
> > > 
> > > I wonder if that assumption's correct.  Looking at nfsd_exit_net()....
> 
> IIUC it's correct for the kworker case where the ns in question is on
> the cleanup list, and for the syscall as well because the report is
> triggered in the error path, IOW the new ns is not yet visible to the
> kworker ATM.

Sorry, I'm not familiar with the namespace code and I'm not following
you.

I'm trying to figure out what prevents the network namespace exit method
being called while nfsd is still processing an rpc call from that
network namespace.

--b.

> 
> Then we can not draw a race between the two parties, and the two reports
> are not related... but of independent glitches.
> 
> > > 
> > > nfsd_reply_cache_shutdown() is one of the first things we do, so I think
> > > we're depending on the assumption that the interfaces in that network
> > > namespace, and anything referencing associated sockets (in particular,
> > > any associated in-progress rpc's), must be gone before our net exit
> > > method is called.
> > > 
> > > I wonder if that's a good assumption.
> > 
> > I think that assumption must be the problem.
> > 
> > That would explain why the crashes are happening in nfsd_exit_net as
> > opposed to somewhere else, and why we're only seeing them since
> > 3ba75830ce17 "nfsd4: drc containerization".
> > 
> > I wonder what *is* safe to assume when the net exit method is called?
> > 
> > --b.
> > 
> > > 
> > > --b.
> > > 
> > > > 
> > > > 
> > > > > On Sun, May 17, 2020 at 11:59:12PM -0700, syzbot wrote:
> > > > > > Hello,
> > > > > > 
> > > > > > syzbot found the following crash on:
> > > > > > 
> > > > > > HEAD commit:    9b1f2cbd Merge tag 'clk-fixes-for-linus' of git://git.kern..
> > > > > > git tree:       upstream
> > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=15dfdeaa100000
> > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=c14212794ed9ad24
> > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=0e37e9d19bded16b8ab9
> > > > > > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> > > > > > 
> > > > > > Unfortunately, I don't have any reproducer for this crash yet.
> > > > > > 
> > > > > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > > > > Reported-by: syzbot+0e37e9d19bded16b8ab9@syzkaller.appspotmail.com
> > > > > > 
> > > > > > BUG: unable to handle page fault for address: ffff887ffffffff0
> > > > > > #PF: supervisor read access in kernel mode
> > > > > > #PF: error_code(0x0000) - not-present page
> > > > > > PGD 0 P4D 0 
> > > > > > Oops: 0000 [#1] PREEMPT SMP KASAN
> > > > > > CPU: 1 PID: 8682 Comm: syz-executor.3 Not tainted 5.7.0-rc5-syzkaller #0
> > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > > > > > RIP: 0010:__rb_erase_augmented include/linux/rbtree_augmented.h:201 [inline]
> > > > > > RIP: 0010:rb_erase+0x37/0x18d0 lib/rbtree.c:443
> > > > > > Code: 89 f7 41 56 41 55 49 89 fd 48 83 c7 08 48 89 fa 41 54 48 c1 ea 03 55 53 48 83 ec 18 80 3c 02 00 0f 85 89 10 00 00 49 8d 7d 10 <4d> 8b 75 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80
> > > > > > RSP: 0018:ffffc900178ffb58 EFLAGS: 00010246
> > > > > > RAX: dffffc0000000000 RBX: ffff8880354d0000 RCX: ffffc9000fb6d000
> > > > > > RDX: 1ffff10ffffffffe RSI: ffff88800011dfe0 RDI: ffff887ffffffff8
> > > > > > RBP: ffff887fffffffb0 R08: ffff888057284280 R09: fffffbfff185d12e
> > > > > > R10: ffffffff8c2e896f R11: fffffbfff185d12d R12: ffff88800011dfe0
> > > > > > R13: ffff887fffffffe8 R14: 000000000001dfe0 R15: ffff88800011dfe0
> > > > > > FS:  00007fa002d21700(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
> > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > CR2: ffff887ffffffff0 CR3: 00000000a2164000 CR4: 00000000001426e0
> > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > Call Trace:
> > > > > >  nfsd_reply_cache_free_locked+0x198/0x380 fs/nfsd/nfscache.c:127
> > > > > >  nfsd_reply_cache_shutdown+0x150/0x350 fs/nfsd/nfscache.c:203
> > > > > >  nfsd_exit_net+0x189/0x4c0 fs/nfsd/nfsctl.c:1504
> > > > > >  ops_exit_list.isra.0+0xa8/0x150 net/core/net_namespace.c:186
> > > > > >  setup_net+0x50c/0x860 net/core/net_namespace.c:364
> > > > > >  copy_net_ns+0x293/0x590 net/core/net_namespace.c:482
> > > > > >  create_new_namespaces+0x3fb/0xb30 kernel/nsproxy.c:108
> > > > > >  unshare_nsproxy_namespaces+0xbd/0x1f0 kernel/nsproxy.c:229
> > > > > >  ksys_unshare+0x43d/0x8e0 kernel/fork.c:2970
> > > > > >  __do_sys_unshare kernel/fork.c:3038 [inline]
> > > > > >  __se_sys_unshare kernel/fork.c:3036 [inline]
> > > > > >  __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3036
> > > > > >  do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
> > > > > >  entry_SYSCALL_64_after_hwframe+0x49/0xb3

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: unable to handle kernel paging request in rb_erase
  2020-06-04 21:58       ` J. Bruce Fields
@ 2020-06-25 21:02         ` J. Bruce Fields
  2020-06-26 10:32           ` Dmitry Vyukov
  0 siblings, 1 reply; 8+ messages in thread
From: J. Bruce Fields @ 2020-06-25 21:02 UTC (permalink / raw)
  To: Hillf Danton; +Cc: syzbot, chuck.lever, linux-kernel, linux-nfs, syzkaller-bugs

On Thu, Jun 04, 2020 at 05:58:12PM -0400, J. Bruce Fields wrote:
> On Thu, Jun 04, 2020 at 11:53:59AM +0800, Hillf Danton wrote:
> > 
> > On Wed, 3 Jun 2020 12:48:49 -0400 J. Bruce Fields wrote:
> > > On Wed, Jun 03, 2020 at 10:43:26AM -0400, J. Bruce Fields wrote:
> > > > On Wed, Jun 03, 2020 at 12:34:35PM +0800, Hillf Danton wrote:
> > > > > 
> > > > > On Tue, 2 Jun 2020 17:55:17 -0400 "J. Bruce Fields" wrote:
> > > > > > 
> > > > > > As far as I know, this one's still unresolved.  I can't see the bug from
> > > > > > code inspection, and we don't have a reproducer.  If anyone else sees
> > > > > > this or has an idea what might be going wrong, I'd be interested.--b.
> > > > > 
> > > > > It's a PF reported in the syz-executor.3 context (PID: 8682 on CPU:1),
> > > > > meanwhile there's another at 
> > > > > 
> > > > >  https://lore.kernel.org/lkml/20200603011425.GA13019@fieldses.org/T/#t
> > > > >  Reported-by: syzbot+a29df412692980277f9d@syzkaller.appspotmail.com
> > > > > 
> > > > > in the kworker context, and one of the quick questions is, is it needed
> > > > > to serialize the two players, say, using a mutex?
> > > > 
> > > > nfsd_reply_cache_shutdown() doesn't take any locks.  All the data
> > > > structures it's tearing down are per-network-namespace, and it's assumed
> > > > all the users of that structure are gone by the time we get here.
> > > > 
> > > > I wonder if that assumption's correct.  Looking at nfsd_exit_net()....
> > 
> > IIUC it's correct for the kworker case where the ns in question is on
> > the cleanup list, and for the syscall as well because the report is
> > triggered in the error path, IOW the new ns is not yet visible to the
> > kworker ATM.
> 
> Sorry, I'm not familiar with the namespace code and I'm not following
> you.
> 
> I'm trying to figure out what prevents the network namespace exit method
> being called while nfsd is still processing an rpc call from that
> network namespace.

Looking at this some more:

Each server socket (struct svc_xprt) holds a reference on the struct net
that's not released until svc_xprt_free().

The svc_xprt is itself referenced as long as an rpc for that socket is
being processed, the referenced released in svc_xprt_release().  Which
isn't called until the rpc is processed and the reply sent.

So, assuming nfsd_exit_net() can't be called while we're still holding
references to the struct net, there can't still be any reply cache
processing going on when nfsd_reply_cache_shutdown() is called.

So, still a mystery to me how this is happening.

--b.

> 
> 
> > ---
> > v2: Use linux/highmem.h instead of asm/cacheflush.sh
> > Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> > ---
> > net/sunrpc/svcsock.c | 1 +
> > 1 file changed, 1 insertion(+)
> > 
> > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > index 5c4ec9386f81..c537272f9c7e 100644
> > --- a/net/sunrpc/svcsock.c
> > +++ b/net/sunrpc/svcsock.c
> > @@ -44,6 +44,7 @@
> > #include <net/tcp.h>
> > #include <net/tcp_states.h>
> > #include <linux/uaccess.h>
> > +#include <linux/highmem.h>
> > #include <asm/ioctls.h>
> > 
> > #include <linux/sunrpc/types.h>
> > -- 
> > 2.25.0
> > 
> 
> --
> Chuck Lever
> 
> 

> 
> --b.
> 
> > 
> > Then we can not draw a race between the two parties, and the two reports
> > are not related... but of independent glitches.
> > 
> > > > 
> > > > nfsd_reply_cache_shutdown() is one of the first things we do, so I think
> > > > we're depending on the assumption that the interfaces in that network
> > > > namespace, and anything referencing associated sockets (in particular,
> > > > any associated in-progress rpc's), must be gone before our net exit
> > > > method is called.
> > > > 
> > > > I wonder if that's a good assumption.
> > > 
> > > I think that assumption must be the problem.
> > > 
> > > That would explain why the crashes are happening in nfsd_exit_net as
> > > opposed to somewhere else, and why we're only seeing them since
> > > 3ba75830ce17 "nfsd4: drc containerization".
> > > 
> > > I wonder what *is* safe to assume when the net exit method is called?
> > > 
> > > --b.
> > > 
> > > > 
> > > > --b.
> > > > 
> > > > > 
> > > > > 
> > > > > > On Sun, May 17, 2020 at 11:59:12PM -0700, syzbot wrote:
> > > > > > > Hello,
> > > > > > > 
> > > > > > > syzbot found the following crash on:
> > > > > > > 
> > > > > > > HEAD commit:    9b1f2cbd Merge tag 'clk-fixes-for-linus' of git://git.kern..
> > > > > > > git tree:       upstream
> > > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=15dfdeaa100000
> > > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=c14212794ed9ad24
> > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=0e37e9d19bded16b8ab9
> > > > > > > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> > > > > > > 
> > > > > > > Unfortunately, I don't have any reproducer for this crash yet.
> > > > > > > 
> > > > > > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > > > > > Reported-by: syzbot+0e37e9d19bded16b8ab9@syzkaller.appspotmail.com
> > > > > > > 
> > > > > > > BUG: unable to handle page fault for address: ffff887ffffffff0
> > > > > > > #PF: supervisor read access in kernel mode
> > > > > > > #PF: error_code(0x0000) - not-present page
> > > > > > > PGD 0 P4D 0 
> > > > > > > Oops: 0000 [#1] PREEMPT SMP KASAN
> > > > > > > CPU: 1 PID: 8682 Comm: syz-executor.3 Not tainted 5.7.0-rc5-syzkaller #0
> > > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > > > > > > RIP: 0010:__rb_erase_augmented include/linux/rbtree_augmented.h:201 [inline]
> > > > > > > RIP: 0010:rb_erase+0x37/0x18d0 lib/rbtree.c:443
> > > > > > > Code: 89 f7 41 56 41 55 49 89 fd 48 83 c7 08 48 89 fa 41 54 48 c1 ea 03 55 53 48 83 ec 18 80 3c 02 00 0f 85 89 10 00 00 49 8d 7d 10 <4d> 8b 75 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80
> > > > > > > RSP: 0018:ffffc900178ffb58 EFLAGS: 00010246
> > > > > > > RAX: dffffc0000000000 RBX: ffff8880354d0000 RCX: ffffc9000fb6d000
> > > > > > > RDX: 1ffff10ffffffffe RSI: ffff88800011dfe0 RDI: ffff887ffffffff8
> > > > > > > RBP: ffff887fffffffb0 R08: ffff888057284280 R09: fffffbfff185d12e
> > > > > > > R10: ffffffff8c2e896f R11: fffffbfff185d12d R12: ffff88800011dfe0
> > > > > > > R13: ffff887fffffffe8 R14: 000000000001dfe0 R15: ffff88800011dfe0
> > > > > > > FS:  00007fa002d21700(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
> > > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > CR2: ffff887ffffffff0 CR3: 00000000a2164000 CR4: 00000000001426e0
> > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > > Call Trace:
> > > > > > >  nfsd_reply_cache_free_locked+0x198/0x380 fs/nfsd/nfscache.c:127
> > > > > > >  nfsd_reply_cache_shutdown+0x150/0x350 fs/nfsd/nfscache.c:203
> > > > > > >  nfsd_exit_net+0x189/0x4c0 fs/nfsd/nfsctl.c:1504
> > > > > > >  ops_exit_list.isra.0+0xa8/0x150 net/core/net_namespace.c:186
> > > > > > >  setup_net+0x50c/0x860 net/core/net_namespace.c:364
> > > > > > >  copy_net_ns+0x293/0x590 net/core/net_namespace.c:482
> > > > > > >  create_new_namespaces+0x3fb/0xb30 kernel/nsproxy.c:108
> > > > > > >  unshare_nsproxy_namespaces+0xbd/0x1f0 kernel/nsproxy.c:229
> > > > > > >  ksys_unshare+0x43d/0x8e0 kernel/fork.c:2970
> > > > > > >  __do_sys_unshare kernel/fork.c:3038 [inline]
> > > > > > >  __se_sys_unshare kernel/fork.c:3036 [inline]
> > > > > > >  __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3036
> > > > > > >  do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
> > > > > > >  entry_SYSCALL_64_after_hwframe+0x49/0xb3

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: unable to handle kernel paging request in rb_erase
  2020-06-25 21:02         ` J. Bruce Fields
@ 2020-06-26 10:32           ` Dmitry Vyukov
  2020-06-26 16:47             ` J. Bruce Fields
  0 siblings, 1 reply; 8+ messages in thread
From: Dmitry Vyukov @ 2020-06-26 10:32 UTC (permalink / raw)
  To: J. Bruce Fields
  Cc: Hillf Danton, syzbot, chuck.lever, LKML, linux-nfs, syzkaller-bugs

On Thu, Jun 25, 2020 at 11:02 PM J. Bruce Fields <bfields@fieldses.org> wrote:
> > On Thu, Jun 04, 2020 at 11:53:59AM +0800, Hillf Danton wrote:
> > >
> > > On Wed, 3 Jun 2020 12:48:49 -0400 J. Bruce Fields wrote:
> > > > On Wed, Jun 03, 2020 at 10:43:26AM -0400, J. Bruce Fields wrote:
> > > > > On Wed, Jun 03, 2020 at 12:34:35PM +0800, Hillf Danton wrote:
> > > > > >
> > > > > > On Tue, 2 Jun 2020 17:55:17 -0400 "J. Bruce Fields" wrote:
> > > > > > >
> > > > > > > As far as I know, this one's still unresolved.  I can't see the bug from
> > > > > > > code inspection, and we don't have a reproducer.  If anyone else sees
> > > > > > > this or has an idea what might be going wrong, I'd be interested.--b.
> > > > > >
> > > > > > It's a PF reported in the syz-executor.3 context (PID: 8682 on CPU:1),
> > > > > > meanwhile there's another at
> > > > > >
> > > > > >  https://lore.kernel.org/lkml/20200603011425.GA13019@fieldses.org/T/#t
> > > > > >  Reported-by: syzbot+a29df412692980277f9d@syzkaller.appspotmail.com
> > > > > >
> > > > > > in the kworker context, and one of the quick questions is, is it needed
> > > > > > to serialize the two players, say, using a mutex?
> > > > >
> > > > > nfsd_reply_cache_shutdown() doesn't take any locks.  All the data
> > > > > structures it's tearing down are per-network-namespace, and it's assumed
> > > > > all the users of that structure are gone by the time we get here.
> > > > >
> > > > > I wonder if that assumption's correct.  Looking at nfsd_exit_net()....
> > >
> > > IIUC it's correct for the kworker case where the ns in question is on
> > > the cleanup list, and for the syscall as well because the report is
> > > triggered in the error path, IOW the new ns is not yet visible to the
> > > kworker ATM.
> >
> > Sorry, I'm not familiar with the namespace code and I'm not following
> > you.
> >
> > I'm trying to figure out what prevents the network namespace exit method
> > being called while nfsd is still processing an rpc call from that
> > network namespace.
>
> Looking at this some more:
>
> Each server socket (struct svc_xprt) holds a reference on the struct net
> that's not released until svc_xprt_free().
>
> The svc_xprt is itself referenced as long as an rpc for that socket is
> being processed, the referenced released in svc_xprt_release().  Which
> isn't called until the rpc is processed and the reply sent.
>
> So, assuming nfsd_exit_net() can't be called while we're still holding
> references to the struct net, there can't still be any reply cache
> processing going on when nfsd_reply_cache_shutdown() is called.
>
> So, still a mystery to me how this is happening.

Hi Bruce,

So far this crash happened only once:
https://syzkaller.appspot.com/bug?extid=0e37e9d19bded16b8ab9

For continuous fuzzing on syzbot it usually means either (1) it's a
super narrow race or (2) it's a previous unnoticed memory corruption.

Simpler bugs usually have much higher hit counts:
https://syzkaller.appspot.com/upstream
https://syzkaller.appspot.com/upstream/fixed

If you did a reasonable looking for any obvious bugs in the code that
would lead to such failure, it can make sense to postpone any
additional actions until we have more info.
If no info comes, at some point syzbot will auto-obsolete it, and then
then we can assume it was (2).



> --b.
>
> >
> >
> > > ---
> > > v2: Use linux/highmem.h instead of asm/cacheflush.sh
> > > Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> > > ---
> > > net/sunrpc/svcsock.c | 1 +
> > > 1 file changed, 1 insertion(+)
> > >
> > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > > index 5c4ec9386f81..c537272f9c7e 100644
> > > --- a/net/sunrpc/svcsock.c
> > > +++ b/net/sunrpc/svcsock.c
> > > @@ -44,6 +44,7 @@
> > > #include <net/tcp.h>
> > > #include <net/tcp_states.h>
> > > #include <linux/uaccess.h>
> > > +#include <linux/highmem.h>
> > > #include <asm/ioctls.h>
> > >
> > > #include <linux/sunrpc/types.h>
> > > --
> > > 2.25.0
> > >
> >
> > --
> > Chuck Lever
> >
> >
>
> >
> > --b.
> >
> > >
> > > Then we can not draw a race between the two parties, and the two reports
> > > are not related... but of independent glitches.
> > >
> > > > >
> > > > > nfsd_reply_cache_shutdown() is one of the first things we do, so I think
> > > > > we're depending on the assumption that the interfaces in that network
> > > > > namespace, and anything referencing associated sockets (in particular,
> > > > > any associated in-progress rpc's), must be gone before our net exit
> > > > > method is called.
> > > > >
> > > > > I wonder if that's a good assumption.
> > > >
> > > > I think that assumption must be the problem.
> > > >
> > > > That would explain why the crashes are happening in nfsd_exit_net as
> > > > opposed to somewhere else, and why we're only seeing them since
> > > > 3ba75830ce17 "nfsd4: drc containerization".
> > > >
> > > > I wonder what *is* safe to assume when the net exit method is called?
> > > >
> > > > --b.
> > > >
> > > > >
> > > > > --b.
> > > > >
> > > > > >
> > > > > >
> > > > > > > On Sun, May 17, 2020 at 11:59:12PM -0700, syzbot wrote:
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > syzbot found the following crash on:
> > > > > > > >
> > > > > > > > HEAD commit:    9b1f2cbd Merge tag 'clk-fixes-for-linus' of git://git.kern..
> > > > > > > > git tree:       upstream
> > > > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=15dfdeaa100000
> > > > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=c14212794ed9ad24
> > > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=0e37e9d19bded16b8ab9
> > > > > > > > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> > > > > > > >
> > > > > > > > Unfortunately, I don't have any reproducer for this crash yet.
> > > > > > > >
> > > > > > > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > > > > > > Reported-by: syzbot+0e37e9d19bded16b8ab9@syzkaller.appspotmail.com
> > > > > > > >
> > > > > > > > BUG: unable to handle page fault for address: ffff887ffffffff0
> > > > > > > > #PF: supervisor read access in kernel mode
> > > > > > > > #PF: error_code(0x0000) - not-present page
> > > > > > > > PGD 0 P4D 0
> > > > > > > > Oops: 0000 [#1] PREEMPT SMP KASAN
> > > > > > > > CPU: 1 PID: 8682 Comm: syz-executor.3 Not tainted 5.7.0-rc5-syzkaller #0
> > > > > > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> > > > > > > > RIP: 0010:__rb_erase_augmented include/linux/rbtree_augmented.h:201 [inline]
> > > > > > > > RIP: 0010:rb_erase+0x37/0x18d0 lib/rbtree.c:443
> > > > > > > > Code: 89 f7 41 56 41 55 49 89 fd 48 83 c7 08 48 89 fa 41 54 48 c1 ea 03 55 53 48 83 ec 18 80 3c 02 00 0f 85 89 10 00 00 49 8d 7d 10 <4d> 8b 75 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 80
> > > > > > > > RSP: 0018:ffffc900178ffb58 EFLAGS: 00010246
> > > > > > > > RAX: dffffc0000000000 RBX: ffff8880354d0000 RCX: ffffc9000fb6d000
> > > > > > > > RDX: 1ffff10ffffffffe RSI: ffff88800011dfe0 RDI: ffff887ffffffff8
> > > > > > > > RBP: ffff887fffffffb0 R08: ffff888057284280 R09: fffffbfff185d12e
> > > > > > > > R10: ffffffff8c2e896f R11: fffffbfff185d12d R12: ffff88800011dfe0
> > > > > > > > R13: ffff887fffffffe8 R14: 000000000001dfe0 R15: ffff88800011dfe0
> > > > > > > > FS:  00007fa002d21700(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
> > > > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > > > > > > CR2: ffff887ffffffff0 CR3: 00000000a2164000 CR4: 00000000001426e0
> > > > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > > > > > > Call Trace:
> > > > > > > >  nfsd_reply_cache_free_locked+0x198/0x380 fs/nfsd/nfscache.c:127
> > > > > > > >  nfsd_reply_cache_shutdown+0x150/0x350 fs/nfsd/nfscache.c:203
> > > > > > > >  nfsd_exit_net+0x189/0x4c0 fs/nfsd/nfsctl.c:1504
> > > > > > > >  ops_exit_list.isra.0+0xa8/0x150 net/core/net_namespace.c:186
> > > > > > > >  setup_net+0x50c/0x860 net/core/net_namespace.c:364
> > > > > > > >  copy_net_ns+0x293/0x590 net/core/net_namespace.c:482
> > > > > > > >  create_new_namespaces+0x3fb/0xb30 kernel/nsproxy.c:108
> > > > > > > >  unshare_nsproxy_namespaces+0xbd/0x1f0 kernel/nsproxy.c:229
> > > > > > > >  ksys_unshare+0x43d/0x8e0 kernel/fork.c:2970
> > > > > > > >  __do_sys_unshare kernel/fork.c:3038 [inline]
> > > > > > > >  __se_sys_unshare kernel/fork.c:3036 [inline]
> > > > > > > >  __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3036
> > > > > > > >  do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
> > > > > > > >  entry_SYSCALL_64_after_hwframe+0x49/0xb3
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller-bugs" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller-bugs+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/syzkaller-bugs/20200625210229.GE6605%40fieldses.org.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: BUG: unable to handle kernel paging request in rb_erase
  2020-06-26 10:32           ` Dmitry Vyukov
@ 2020-06-26 16:47             ` J. Bruce Fields
  0 siblings, 0 replies; 8+ messages in thread
From: J. Bruce Fields @ 2020-06-26 16:47 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Hillf Danton, syzbot, chuck.lever, LKML, linux-nfs, syzkaller-bugs

On Fri, Jun 26, 2020 at 12:32:42PM +0200, Dmitry Vyukov wrote:
> So far this crash happened only once:
> https://syzkaller.appspot.com/bug?extid=0e37e9d19bded16b8ab9
> 
> For continuous fuzzing on syzbot it usually means either (1) it's a
> super narrow race or (2) it's a previous unnoticed memory corruption.
> 
> Simpler bugs usually have much higher hit counts:
> https://syzkaller.appspot.com/upstream
> https://syzkaller.appspot.com/upstream/fixed
> 
> If you did a reasonable looking for any obvious bugs in the code that
> would lead to such failure, it can make sense to postpone any
> additional actions until we have more info.
> If no info comes, at some point syzbot will auto-obsolete it, and then
> then we can assume it was (2).

OK, thanks.

It's a big heavily used data structure, if there was random memory
corruption then I guess this wouldn't be a surprising way for it to show
up.

--b.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2020-06-26 16:47 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-05-18  6:59 BUG: unable to handle kernel paging request in rb_erase syzbot
2020-06-02 21:55 ` J. Bruce Fields
     [not found] ` <20200603043435.13820-1-hdanton@sina.com>
2020-06-03 14:43   ` J. Bruce Fields
2020-06-03 16:48     ` J. Bruce Fields
     [not found]     ` <20200604035359.2516-1-hdanton@sina.com>
2020-06-04 21:58       ` J. Bruce Fields
2020-06-25 21:02         ` J. Bruce Fields
2020-06-26 10:32           ` Dmitry Vyukov
2020-06-26 16:47             ` J. Bruce Fields

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).