Re: possible deadlock in start_this

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: possible deadlock in start_this_handle (2)
       [not found] <000000000000563a0205bafb7970@google.com>
@ 2021-02-11 10:49 ` Jan Kara
  2021-02-11 10:55   ` Michal Hocko
                     ` (2 more replies)
  0 siblings, 3 replies; 29+ messages in thread
From: Jan Kara @ 2021-02-11 10:49 UTC (permalink / raw)
  To: syzbot
  Cc: jack, linux-ext4, linux-kernel, syzkaller-bugs, tytso, mhocko, linux-mm

Hello,

added mm guys to CC.

On Wed 10-02-21 05:35:18, syzbot wrote:
> HEAD commit:    1e0d27fc Merge branch 'akpm' (patches from Andrew)
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=15cbce90d00000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=bd1f72220b2e57eb
> dashboard link: https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> userspace arch: i386
> 
> Unfortunately, I don't have any reproducer for this issue yet.
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+bfdded10ab7dcd7507ae@syzkaller.appspotmail.com
> 
> ======================================================
> WARNING: possible circular locking dependency detected
> 5.11.0-rc6-syzkaller #0 Not tainted
> ------------------------------------------------------
> kswapd0/2246 is trying to acquire lock:
> ffff888041a988e0 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> 
> but task is already holding lock:
> ffffffff8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #2 (fs_reclaim){+.+.}-{0:0}:
>        __fs_reclaim_acquire mm/page_alloc.c:4326 [inline]
>        fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
>        might_alloc include/linux/sched/mm.h:193 [inline]
>        slab_pre_alloc_hook mm/slab.h:493 [inline]
>        slab_alloc_node mm/slub.c:2817 [inline]
>        __kmalloc_node+0x5f/0x430 mm/slub.c:4015
>        kmalloc_node include/linux/slab.h:575 [inline]
>        kvmalloc_node+0x61/0xf0 mm/util.c:587
>        kvmalloc include/linux/mm.h:781 [inline]
>        ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
>        ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
>        ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
>        ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
>        ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
>        ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
>        ext4_xattr_user_set+0xbc/0x100 fs/ext4/xattr_user.c:40
>        __vfs_setxattr+0x10e/0x170 fs/xattr.c:177
>        __vfs_setxattr_noperm+0x11a/0x4c0 fs/xattr.c:208
>        __vfs_setxattr_locked+0x1bf/0x250 fs/xattr.c:266
>        vfs_setxattr+0x135/0x320 fs/xattr.c:291
>        setxattr+0x1ff/0x290 fs/xattr.c:553
>        path_setxattr+0x170/0x190 fs/xattr.c:572
>        __do_sys_setxattr fs/xattr.c:587 [inline]
>        __se_sys_setxattr fs/xattr.c:583 [inline]
>        __ia32_sys_setxattr+0xbc/0x150 fs/xattr.c:583
>        do_syscall_32_irqs_on arch/x86/entry/common.c:77 [inline]
>        __do_fast_syscall_32+0x56/0x80 arch/x86/entry/common.c:139
>        do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:164
>        entry_SYSENTER_compat_after_hwframe+0x4d/0x5c

This stacktrace should never happen. ext4_xattr_set() starts a transaction.
That internally goes through start_this_handle() which calls:

	handle->saved_alloc_context = memalloc_nofs_save();

and we restore the allocation context only in stop_this_handle() when
stopping the handle. And with this fs_reclaim_acquire() should remove
__GFP_FS from the mask and not call __fs_reclaim_acquire().

Now I have no idea why something here didn't work out. Given we don't have
a reproducer it will be probably difficult to debug this. I'd note that
about year and half ago similar report happened (got autoclosed) so it may
be something real somewhere but it may also be just some HW glitch or
something like that.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-11 10:49 ` possible deadlock in start_this_handle (2) Jan Kara
@ 2021-02-11 10:55   ` Michal Hocko
  2021-02-11 11:22   ` Dmitry Vyukov
  2021-02-13 14:26   ` Tetsuo Handa
  2 siblings, 0 replies; 29+ messages in thread
From: Michal Hocko @ 2021-02-11 10:55 UTC (permalink / raw)
  To: Jan Kara
  Cc: syzbot, jack, linux-ext4, linux-kernel, syzkaller-bugs, tytso, linux-mm

On Thu 11-02-21 11:49:47, Jan Kara wrote:
> Hello,
> 
> added mm guys to CC.
> 
> On Wed 10-02-21 05:35:18, syzbot wrote:
> > HEAD commit:    1e0d27fc Merge branch 'akpm' (patches from Andrew)
> > git tree:       upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=15cbce90d00000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=bd1f72220b2e57eb
> > dashboard link: https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > userspace arch: i386
> > 
> > Unfortunately, I don't have any reproducer for this issue yet.
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+bfdded10ab7dcd7507ae@syzkaller.appspotmail.com
> > 
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 5.11.0-rc6-syzkaller #0 Not tainted
> > ------------------------------------------------------
> > kswapd0/2246 is trying to acquire lock:
> > ffff888041a988e0 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> > 
> > but task is already holding lock:
> > ffffffff8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> > 
> > which lock already depends on the new lock.
> > 
> > the existing dependency chain (in reverse order) is:
> > 
> > -> #2 (fs_reclaim){+.+.}-{0:0}:
> >        __fs_reclaim_acquire mm/page_alloc.c:4326 [inline]
> >        fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
> >        might_alloc include/linux/sched/mm.h:193 [inline]
> >        slab_pre_alloc_hook mm/slab.h:493 [inline]
> >        slab_alloc_node mm/slub.c:2817 [inline]
> >        __kmalloc_node+0x5f/0x430 mm/slub.c:4015
> >        kmalloc_node include/linux/slab.h:575 [inline]
> >        kvmalloc_node+0x61/0xf0 mm/util.c:587
> >        kvmalloc include/linux/mm.h:781 [inline]
> >        ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
> >        ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
> >        ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
> >        ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
> >        ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
> >        ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
> >        ext4_xattr_user_set+0xbc/0x100 fs/ext4/xattr_user.c:40
> >        __vfs_setxattr+0x10e/0x170 fs/xattr.c:177
> >        __vfs_setxattr_noperm+0x11a/0x4c0 fs/xattr.c:208
> >        __vfs_setxattr_locked+0x1bf/0x250 fs/xattr.c:266
> >        vfs_setxattr+0x135/0x320 fs/xattr.c:291
> >        setxattr+0x1ff/0x290 fs/xattr.c:553
> >        path_setxattr+0x170/0x190 fs/xattr.c:572
> >        __do_sys_setxattr fs/xattr.c:587 [inline]
> >        __se_sys_setxattr fs/xattr.c:583 [inline]
> >        __ia32_sys_setxattr+0xbc/0x150 fs/xattr.c:583
> >        do_syscall_32_irqs_on arch/x86/entry/common.c:77 [inline]
> >        __do_fast_syscall_32+0x56/0x80 arch/x86/entry/common.c:139
> >        do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:164
> >        entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
> 
> This stacktrace should never happen. ext4_xattr_set() starts a transaction.
> That internally goes through start_this_handle() which calls:
> 
> 	handle->saved_alloc_context = memalloc_nofs_save();
> 
> and we restore the allocation context only in stop_this_handle() when
> stopping the handle. And with this fs_reclaim_acquire() should remove
> __GFP_FS from the mask and not call __fs_reclaim_acquire().
> 
> Now I have no idea why something here didn't work out. Given we don't have
> a reproducer it will be probably difficult to debug this. I'd note that
> about year and half ago similar report happened (got autoclosed) so it may
> be something real somewhere but it may also be just some HW glitch or
> something like that.

Is it possible this is just a lockdep false positive? Is it possible
that there is a pre-recorded lock dependency chain that happens outside
of the transaction and that clashes with this one?

I do not remember any recent changes in the way how scope API is handled
except for CMA scope API changes but those should be pretty much
independent.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-11 10:49 ` possible deadlock in start_this_handle (2) Jan Kara
  2021-02-11 10:55   ` Michal Hocko
@ 2021-02-11 11:22   ` Dmitry Vyukov
  2021-02-11 11:28     ` Dmitry Vyukov
  2021-02-11 11:46     ` Jan Kara
  2021-02-13 14:26   ` Tetsuo Handa
  2 siblings, 2 replies; 29+ messages in thread
From: Dmitry Vyukov @ 2021-02-11 11:22 UTC (permalink / raw)
  To: Jan Kara
  Cc: syzbot, Jan Kara, linux-ext4, LKML, syzkaller-bugs,
	Theodore Ts'o, Michal Hocko, Linux-MM

On Thu, Feb 11, 2021 at 11:49 AM Jan Kara <jack@suse.cz> wrote:
>
> Hello,
>
> added mm guys to CC.
>
> On Wed 10-02-21 05:35:18, syzbot wrote:
> > HEAD commit:    1e0d27fc Merge branch 'akpm' (patches from Andrew)
> > git tree:       upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=15cbce90d00000
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=bd1f72220b2e57eb
> > dashboard link: https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > userspace arch: i386
> >
> > Unfortunately, I don't have any reproducer for this issue yet.
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+bfdded10ab7dcd7507ae@syzkaller.appspotmail.com
> >
> > ======================================================
> > WARNING: possible circular locking dependency detected
> > 5.11.0-rc6-syzkaller #0 Not tainted
> > ------------------------------------------------------
> > kswapd0/2246 is trying to acquire lock:
> > ffff888041a988e0 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> >
> > but task is already holding lock:
> > ffffffff8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> >
> > which lock already depends on the new lock.
> >
> > the existing dependency chain (in reverse order) is:
> >
> > -> #2 (fs_reclaim){+.+.}-{0:0}:
> >        __fs_reclaim_acquire mm/page_alloc.c:4326 [inline]
> >        fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
> >        might_alloc include/linux/sched/mm.h:193 [inline]
> >        slab_pre_alloc_hook mm/slab.h:493 [inline]
> >        slab_alloc_node mm/slub.c:2817 [inline]
> >        __kmalloc_node+0x5f/0x430 mm/slub.c:4015
> >        kmalloc_node include/linux/slab.h:575 [inline]
> >        kvmalloc_node+0x61/0xf0 mm/util.c:587
> >        kvmalloc include/linux/mm.h:781 [inline]
> >        ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
> >        ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
> >        ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
> >        ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
> >        ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
> >        ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
> >        ext4_xattr_user_set+0xbc/0x100 fs/ext4/xattr_user.c:40
> >        __vfs_setxattr+0x10e/0x170 fs/xattr.c:177
> >        __vfs_setxattr_noperm+0x11a/0x4c0 fs/xattr.c:208
> >        __vfs_setxattr_locked+0x1bf/0x250 fs/xattr.c:266
> >        vfs_setxattr+0x135/0x320 fs/xattr.c:291
> >        setxattr+0x1ff/0x290 fs/xattr.c:553
> >        path_setxattr+0x170/0x190 fs/xattr.c:572
> >        __do_sys_setxattr fs/xattr.c:587 [inline]
> >        __se_sys_setxattr fs/xattr.c:583 [inline]
> >        __ia32_sys_setxattr+0xbc/0x150 fs/xattr.c:583
> >        do_syscall_32_irqs_on arch/x86/entry/common.c:77 [inline]
> >        __do_fast_syscall_32+0x56/0x80 arch/x86/entry/common.c:139
> >        do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:164
> >        entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
>
> This stacktrace should never happen. ext4_xattr_set() starts a transaction.
> That internally goes through start_this_handle() which calls:
>
>         handle->saved_alloc_context = memalloc_nofs_save();
>
> and we restore the allocation context only in stop_this_handle() when
> stopping the handle. And with this fs_reclaim_acquire() should remove
> __GFP_FS from the mask and not call __fs_reclaim_acquire().
>
> Now I have no idea why something here didn't work out. Given we don't have
> a reproducer it will be probably difficult to debug this. I'd note that
> about year and half ago similar report happened (got autoclosed) so it may
> be something real somewhere but it may also be just some HW glitch or
> something like that.

HW glitch is theoretically possible. But if we are considering such
causes, I would say a kernel memory corruption is way more likely, we
have hundreds of known memory-corruption-capable bugs open. In most
cases they are caught by KASAN before doing silent damage. But KASAN
can miss some cases.

I see at least 4 existing bugs with similar stack:
https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
https://syzkaller.appspot.com/bug?extid=a7ab8df042baaf42ae3c
https://syzkaller.appspot.com/bug?id=c814a55a728493959328551c769ede4c8ff72aab
https://syzkaller.appspot.com/bug?id=426ad9adca053dafcd698f3a48ad5406dccc972b

All in all, I would not assume it's a memory corruption. When we had
bugs that actually caused silent memory corruption, that caused a
spike of random one-time crashes all over the kernel. This does not
look like it.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-11 11:22   ` Dmitry Vyukov
@ 2021-02-11 11:28     ` Dmitry Vyukov
  2021-02-11 12:10       ` Jan Kara
  2021-02-11 11:46     ` Jan Kara
  1 sibling, 1 reply; 29+ messages in thread
From: Dmitry Vyukov @ 2021-02-11 11:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: syzbot, Jan Kara, linux-ext4, LKML, syzkaller-bugs,
	Theodore Ts'o, Michal Hocko, Linux-MM

On Thu, Feb 11, 2021 at 12:22 PM Dmitry Vyukov <dvyukov@google.com> wrote:
>
> On Thu, Feb 11, 2021 at 11:49 AM Jan Kara <jack@suse.cz> wrote:
> >
> > Hello,
> >
> > added mm guys to CC.
> >
> > On Wed 10-02-21 05:35:18, syzbot wrote:
> > > HEAD commit:    1e0d27fc Merge branch 'akpm' (patches from Andrew)
> > > git tree:       upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=15cbce90d00000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=bd1f72220b2e57eb
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > > userspace arch: i386
> > >
> > > Unfortunately, I don't have any reproducer for this issue yet.
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+bfdded10ab7dcd7507ae@syzkaller.appspotmail.com
> > >
> > > ======================================================
> > > WARNING: possible circular locking dependency detected
> > > 5.11.0-rc6-syzkaller #0 Not tainted
> > > ------------------------------------------------------
> > > kswapd0/2246 is trying to acquire lock:
> > > ffff888041a988e0 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> > >
> > > but task is already holding lock:
> > > ffffffff8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> > >
> > > which lock already depends on the new lock.
> > >
> > > the existing dependency chain (in reverse order) is:
> > >
> > > -> #2 (fs_reclaim){+.+.}-{0:0}:
> > >        __fs_reclaim_acquire mm/page_alloc.c:4326 [inline]
> > >        fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
> > >        might_alloc include/linux/sched/mm.h:193 [inline]
> > >        slab_pre_alloc_hook mm/slab.h:493 [inline]
> > >        slab_alloc_node mm/slub.c:2817 [inline]
> > >        __kmalloc_node+0x5f/0x430 mm/slub.c:4015
> > >        kmalloc_node include/linux/slab.h:575 [inline]
> > >        kvmalloc_node+0x61/0xf0 mm/util.c:587
> > >        kvmalloc include/linux/mm.h:781 [inline]
> > >        ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
> > >        ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
> > >        ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
> > >        ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
> > >        ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
> > >        ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
> > >        ext4_xattr_user_set+0xbc/0x100 fs/ext4/xattr_user.c:40
> > >        __vfs_setxattr+0x10e/0x170 fs/xattr.c:177
> > >        __vfs_setxattr_noperm+0x11a/0x4c0 fs/xattr.c:208
> > >        __vfs_setxattr_locked+0x1bf/0x250 fs/xattr.c:266
> > >        vfs_setxattr+0x135/0x320 fs/xattr.c:291
> > >        setxattr+0x1ff/0x290 fs/xattr.c:553
> > >        path_setxattr+0x170/0x190 fs/xattr.c:572
> > >        __do_sys_setxattr fs/xattr.c:587 [inline]
> > >        __se_sys_setxattr fs/xattr.c:583 [inline]
> > >        __ia32_sys_setxattr+0xbc/0x150 fs/xattr.c:583
> > >        do_syscall_32_irqs_on arch/x86/entry/common.c:77 [inline]
> > >        __do_fast_syscall_32+0x56/0x80 arch/x86/entry/common.c:139
> > >        do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:164
> > >        entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
> >
> > This stacktrace should never happen. ext4_xattr_set() starts a transaction.
> > That internally goes through start_this_handle() which calls:
> >
> >         handle->saved_alloc_context = memalloc_nofs_save();
> >
> > and we restore the allocation context only in stop_this_handle() when
> > stopping the handle. And with this fs_reclaim_acquire() should remove
> > __GFP_FS from the mask and not call __fs_reclaim_acquire().
> >
> > Now I have no idea why something here didn't work out. Given we don't have
> > a reproducer it will be probably difficult to debug this. I'd note that
> > about year and half ago similar report happened (got autoclosed) so it may
> > be something real somewhere but it may also be just some HW glitch or
> > something like that.
>
> HW glitch is theoretically possible. But if we are considering such
> causes, I would say a kernel memory corruption is way more likely, we
> have hundreds of known memory-corruption-capable bugs open. In most
> cases they are caught by KASAN before doing silent damage. But KASAN
> can miss some cases.
>
> I see at least 4 existing bugs with similar stack:
> https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> https://syzkaller.appspot.com/bug?extid=a7ab8df042baaf42ae3c
> https://syzkaller.appspot.com/bug?id=c814a55a728493959328551c769ede4c8ff72aab
> https://syzkaller.appspot.com/bug?id=426ad9adca053dafcd698f3a48ad5406dccc972b
>
> All in all, I would not assume it's a memory corruption. When we had
> bugs that actually caused silent memory corruption, that caused a
> spike of random one-time crashes all over the kernel. This does not
> look like it.

I wonder if memalloc_nofs_save (or any other manipulation of
current->flags) could have been invoked from interrupt context? I
think it could cause the failure mode we observe (extremely rare
disappearing flags). It may be useful to add a check for task context
there.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-11 11:28     ` Dmitry Vyukov
@ 2021-02-11 12:10       ` Jan Kara
  2021-02-11 12:34         ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2021-02-11 12:10 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Jan Kara, syzbot, Jan Kara, linux-ext4, LKML, syzkaller-bugs,
	Theodore Ts'o, Michal Hocko, Linux-MM

On Thu 11-02-21 12:28:48, Dmitry Vyukov wrote:
> On Thu, Feb 11, 2021 at 12:22 PM Dmitry Vyukov <dvyukov@google.com> wrote:
> >
> > On Thu, Feb 11, 2021 at 11:49 AM Jan Kara <jack@suse.cz> wrote:
> > >
> > > Hello,
> > >
> > > added mm guys to CC.
> > >
> > > On Wed 10-02-21 05:35:18, syzbot wrote:
> > > > HEAD commit:    1e0d27fc Merge branch 'akpm' (patches from Andrew)
> > > > git tree:       upstream
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=15cbce90d00000
> > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=bd1f72220b2e57eb
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > > > userspace arch: i386
> > > >
> > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > >
> > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > Reported-by: syzbot+bfdded10ab7dcd7507ae@syzkaller.appspotmail.com
> > > >
> > > > ======================================================
> > > > WARNING: possible circular locking dependency detected
> > > > 5.11.0-rc6-syzkaller #0 Not tainted
> > > > ------------------------------------------------------
> > > > kswapd0/2246 is trying to acquire lock:
> > > > ffff888041a988e0 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> > > >
> > > > but task is already holding lock:
> > > > ffffffff8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> > > >
> > > > which lock already depends on the new lock.
> > > >
> > > > the existing dependency chain (in reverse order) is:
> > > >
> > > > -> #2 (fs_reclaim){+.+.}-{0:0}:
> > > >        __fs_reclaim_acquire mm/page_alloc.c:4326 [inline]
> > > >        fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
> > > >        might_alloc include/linux/sched/mm.h:193 [inline]
> > > >        slab_pre_alloc_hook mm/slab.h:493 [inline]
> > > >        slab_alloc_node mm/slub.c:2817 [inline]
> > > >        __kmalloc_node+0x5f/0x430 mm/slub.c:4015
> > > >        kmalloc_node include/linux/slab.h:575 [inline]
> > > >        kvmalloc_node+0x61/0xf0 mm/util.c:587
> > > >        kvmalloc include/linux/mm.h:781 [inline]
> > > >        ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
> > > >        ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
> > > >        ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
> > > >        ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
> > > >        ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
> > > >        ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
> > > >        ext4_xattr_user_set+0xbc/0x100 fs/ext4/xattr_user.c:40
> > > >        __vfs_setxattr+0x10e/0x170 fs/xattr.c:177
> > > >        __vfs_setxattr_noperm+0x11a/0x4c0 fs/xattr.c:208
> > > >        __vfs_setxattr_locked+0x1bf/0x250 fs/xattr.c:266
> > > >        vfs_setxattr+0x135/0x320 fs/xattr.c:291
> > > >        setxattr+0x1ff/0x290 fs/xattr.c:553
> > > >        path_setxattr+0x170/0x190 fs/xattr.c:572
> > > >        __do_sys_setxattr fs/xattr.c:587 [inline]
> > > >        __se_sys_setxattr fs/xattr.c:583 [inline]
> > > >        __ia32_sys_setxattr+0xbc/0x150 fs/xattr.c:583
> > > >        do_syscall_32_irqs_on arch/x86/entry/common.c:77 [inline]
> > > >        __do_fast_syscall_32+0x56/0x80 arch/x86/entry/common.c:139
> > > >        do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:164
> > > >        entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
> > >
> > > This stacktrace should never happen. ext4_xattr_set() starts a transaction.
> > > That internally goes through start_this_handle() which calls:
> > >
> > >         handle->saved_alloc_context = memalloc_nofs_save();
> > >
> > > and we restore the allocation context only in stop_this_handle() when
> > > stopping the handle. And with this fs_reclaim_acquire() should remove
> > > __GFP_FS from the mask and not call __fs_reclaim_acquire().
> > >
> > > Now I have no idea why something here didn't work out. Given we don't have
> > > a reproducer it will be probably difficult to debug this. I'd note that
> > > about year and half ago similar report happened (got autoclosed) so it may
> > > be something real somewhere but it may also be just some HW glitch or
> > > something like that.
> >
> > HW glitch is theoretically possible. But if we are considering such
> > causes, I would say a kernel memory corruption is way more likely, we
> > have hundreds of known memory-corruption-capable bugs open. In most
> > cases they are caught by KASAN before doing silent damage. But KASAN
> > can miss some cases.
> >
> > I see at least 4 existing bugs with similar stack:
> > https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > https://syzkaller.appspot.com/bug?extid=a7ab8df042baaf42ae3c
> > https://syzkaller.appspot.com/bug?id=c814a55a728493959328551c769ede4c8ff72aab
> > https://syzkaller.appspot.com/bug?id=426ad9adca053dafcd698f3a48ad5406dccc972b
> >
> > All in all, I would not assume it's a memory corruption. When we had
> > bugs that actually caused silent memory corruption, that caused a
> > spike of random one-time crashes all over the kernel. This does not
> > look like it.
> 
> I wonder if memalloc_nofs_save (or any other manipulation of
> current->flags) could have been invoked from interrupt context? I
> think it could cause the failure mode we observe (extremely rare
> disappearing flags). It may be useful to add a check for task context
> there.

That's an interesting idea. I'm not sure if anything does manipulate
current->flags from inside an interrupt (definitely memalloc_nofs_save()
doesn't seem to be) but I'd think that in fully preemtible kernel,
scheduler could preempt the task inside memalloc_nofs_save() and the
current->flags manipulation could also clash with a manipulation of these
flags by the scheduler if there's any?

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-11 12:10       ` Jan Kara
@ 2021-02-11 12:34         ` Michal Hocko
  2021-02-11 12:57           ` Matthew Wilcox
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2021-02-11 12:34 UTC (permalink / raw)
  To: Jan Kara
  Cc: Dmitry Vyukov, syzbot, Jan Kara, linux-ext4, LKML,
	syzkaller-bugs, Theodore Ts'o, Linux-MM

On Thu 11-02-21 13:10:20, Jan Kara wrote:
> On Thu 11-02-21 12:28:48, Dmitry Vyukov wrote:
> > On Thu, Feb 11, 2021 at 12:22 PM Dmitry Vyukov <dvyukov@google.com> wrote:
> > >
> > > On Thu, Feb 11, 2021 at 11:49 AM Jan Kara <jack@suse.cz> wrote:
> > > >
> > > > Hello,
> > > >
> > > > added mm guys to CC.
> > > >
> > > > On Wed 10-02-21 05:35:18, syzbot wrote:
> > > > > HEAD commit:    1e0d27fc Merge branch 'akpm' (patches from Andrew)
> > > > > git tree:       upstream
> > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=15cbce90d00000
> > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=bd1f72220b2e57eb
> > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > > > > userspace arch: i386
> > > > >
> > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > >
> > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > Reported-by: syzbot+bfdded10ab7dcd7507ae@syzkaller.appspotmail.com
> > > > >
> > > > > ======================================================
> > > > > WARNING: possible circular locking dependency detected
> > > > > 5.11.0-rc6-syzkaller #0 Not tainted
> > > > > ------------------------------------------------------
> > > > > kswapd0/2246 is trying to acquire lock:
> > > > > ffff888041a988e0 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> > > > >
> > > > > but task is already holding lock:
> > > > > ffffffff8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> > > > >
> > > > > which lock already depends on the new lock.
> > > > >
> > > > > the existing dependency chain (in reverse order) is:
> > > > >
> > > > > -> #2 (fs_reclaim){+.+.}-{0:0}:
> > > > >        __fs_reclaim_acquire mm/page_alloc.c:4326 [inline]
> > > > >        fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
> > > > >        might_alloc include/linux/sched/mm.h:193 [inline]
> > > > >        slab_pre_alloc_hook mm/slab.h:493 [inline]
> > > > >        slab_alloc_node mm/slub.c:2817 [inline]
> > > > >        __kmalloc_node+0x5f/0x430 mm/slub.c:4015
> > > > >        kmalloc_node include/linux/slab.h:575 [inline]
> > > > >        kvmalloc_node+0x61/0xf0 mm/util.c:587
> > > > >        kvmalloc include/linux/mm.h:781 [inline]
> > > > >        ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
> > > > >        ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
> > > > >        ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
> > > > >        ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
> > > > >        ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
> > > > >        ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
> > > > >        ext4_xattr_user_set+0xbc/0x100 fs/ext4/xattr_user.c:40
> > > > >        __vfs_setxattr+0x10e/0x170 fs/xattr.c:177
> > > > >        __vfs_setxattr_noperm+0x11a/0x4c0 fs/xattr.c:208
> > > > >        __vfs_setxattr_locked+0x1bf/0x250 fs/xattr.c:266
> > > > >        vfs_setxattr+0x135/0x320 fs/xattr.c:291
> > > > >        setxattr+0x1ff/0x290 fs/xattr.c:553
> > > > >        path_setxattr+0x170/0x190 fs/xattr.c:572
> > > > >        __do_sys_setxattr fs/xattr.c:587 [inline]
> > > > >        __se_sys_setxattr fs/xattr.c:583 [inline]
> > > > >        __ia32_sys_setxattr+0xbc/0x150 fs/xattr.c:583
> > > > >        do_syscall_32_irqs_on arch/x86/entry/common.c:77 [inline]
> > > > >        __do_fast_syscall_32+0x56/0x80 arch/x86/entry/common.c:139
> > > > >        do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:164
> > > > >        entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
> > > >
> > > > This stacktrace should never happen. ext4_xattr_set() starts a transaction.
> > > > That internally goes through start_this_handle() which calls:
> > > >
> > > >         handle->saved_alloc_context = memalloc_nofs_save();
> > > >
> > > > and we restore the allocation context only in stop_this_handle() when
> > > > stopping the handle. And with this fs_reclaim_acquire() should remove
> > > > __GFP_FS from the mask and not call __fs_reclaim_acquire().
> > > >
> > > > Now I have no idea why something here didn't work out. Given we don't have
> > > > a reproducer it will be probably difficult to debug this. I'd note that
> > > > about year and half ago similar report happened (got autoclosed) so it may
> > > > be something real somewhere but it may also be just some HW glitch or
> > > > something like that.
> > >
> > > HW glitch is theoretically possible. But if we are considering such
> > > causes, I would say a kernel memory corruption is way more likely, we
> > > have hundreds of known memory-corruption-capable bugs open. In most
> > > cases they are caught by KASAN before doing silent damage. But KASAN
> > > can miss some cases.
> > >
> > > I see at least 4 existing bugs with similar stack:
> > > https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > > https://syzkaller.appspot.com/bug?extid=a7ab8df042baaf42ae3c
> > > https://syzkaller.appspot.com/bug?id=c814a55a728493959328551c769ede4c8ff72aab
> > > https://syzkaller.appspot.com/bug?id=426ad9adca053dafcd698f3a48ad5406dccc972b
> > >
> > > All in all, I would not assume it's a memory corruption. When we had
> > > bugs that actually caused silent memory corruption, that caused a
> > > spike of random one-time crashes all over the kernel. This does not
> > > look like it.
> > 
> > I wonder if memalloc_nofs_save (or any other manipulation of
> > current->flags) could have been invoked from interrupt context? I
> > think it could cause the failure mode we observe (extremely rare
> > disappearing flags). It may be useful to add a check for task context
> > there.
> 
> That's an interesting idea. I'm not sure if anything does manipulate
> current->flags from inside an interrupt (definitely memalloc_nofs_save()
> doesn't seem to be) but I'd think that in fully preemtible kernel,
> scheduler could preempt the task inside memalloc_nofs_save() and the
> current->flags manipulation could also clash with a manipulation of these
> flags by the scheduler if there's any?

current->flags should be always manipulated from the user context. But
who knows maybe there is a bug and some interrupt handler is calling it.
This should be easy to catch no?

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-11 12:34         ` Michal Hocko
@ 2021-02-11 12:57           ` Matthew Wilcox
  2021-02-11 13:07             ` Michal Hocko
  2021-02-11 13:18             ` Dmitry Vyukov
  0 siblings, 2 replies; 29+ messages in thread
From: Matthew Wilcox @ 2021-02-11 12:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jan Kara, Dmitry Vyukov, syzbot, Jan Kara, linux-ext4, LKML,
	syzkaller-bugs, Theodore Ts'o, Linux-MM

On Thu, Feb 11, 2021 at 01:34:48PM +0100, Michal Hocko wrote:
> On Thu 11-02-21 13:10:20, Jan Kara wrote:
> > On Thu 11-02-21 12:28:48, Dmitry Vyukov wrote:
> > > On Thu, Feb 11, 2021 at 12:22 PM Dmitry Vyukov <dvyukov@google.com> wrote:
> > > >
> > > > On Thu, Feb 11, 2021 at 11:49 AM Jan Kara <jack@suse.cz> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > added mm guys to CC.
> > > > >
> > > > > On Wed 10-02-21 05:35:18, syzbot wrote:
> > > > > > HEAD commit:    1e0d27fc Merge branch 'akpm' (patches from Andrew)
> > > > > > git tree:       upstream
> > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=15cbce90d00000
> > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=bd1f72220b2e57eb
> > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > > > > > userspace arch: i386
> > > > > >
> > > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > > >
> > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > > Reported-by: syzbot+bfdded10ab7dcd7507ae@syzkaller.appspotmail.com
> > > > > >
> > > > > > ======================================================
> > > > > > WARNING: possible circular locking dependency detected
> > > > > > 5.11.0-rc6-syzkaller #0 Not tainted
> > > > > > ------------------------------------------------------
> > > > > > kswapd0/2246 is trying to acquire lock:
> > > > > > ffff888041a988e0 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> > > > > >
> > > > > > but task is already holding lock:
> > > > > > ffffffff8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> > > > > >
> > > > > > which lock already depends on the new lock.
> > > > > >
> > > > > > the existing dependency chain (in reverse order) is:
> > > > > >
> > > > > > -> #2 (fs_reclaim){+.+.}-{0:0}:
> > > > > >        __fs_reclaim_acquire mm/page_alloc.c:4326 [inline]
> > > > > >        fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
> > > > > >        might_alloc include/linux/sched/mm.h:193 [inline]
> > > > > >        slab_pre_alloc_hook mm/slab.h:493 [inline]
> > > > > >        slab_alloc_node mm/slub.c:2817 [inline]
> > > > > >        __kmalloc_node+0x5f/0x430 mm/slub.c:4015
> > > > > >        kmalloc_node include/linux/slab.h:575 [inline]
> > > > > >        kvmalloc_node+0x61/0xf0 mm/util.c:587
> > > > > >        kvmalloc include/linux/mm.h:781 [inline]
> > > > > >        ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
> > > > > >        ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
> > > > > >        ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
> > > > > >        ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
> > > > > >        ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
> > > > > >        ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
> > > > > >        ext4_xattr_user_set+0xbc/0x100 fs/ext4/xattr_user.c:40
> > > > > >        __vfs_setxattr+0x10e/0x170 fs/xattr.c:177
> > > > > >        __vfs_setxattr_noperm+0x11a/0x4c0 fs/xattr.c:208
> > > > > >        __vfs_setxattr_locked+0x1bf/0x250 fs/xattr.c:266
> > > > > >        vfs_setxattr+0x135/0x320 fs/xattr.c:291
> > > > > >        setxattr+0x1ff/0x290 fs/xattr.c:553
> > > > > >        path_setxattr+0x170/0x190 fs/xattr.c:572
> > > > > >        __do_sys_setxattr fs/xattr.c:587 [inline]
> > > > > >        __se_sys_setxattr fs/xattr.c:583 [inline]
> > > > > >        __ia32_sys_setxattr+0xbc/0x150 fs/xattr.c:583
> > > > > >        do_syscall_32_irqs_on arch/x86/entry/common.c:77 [inline]
> > > > > >        __do_fast_syscall_32+0x56/0x80 arch/x86/entry/common.c:139
> > > > > >        do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:164
> > > > > >        entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
> > > > >
> > > > > This stacktrace should never happen. ext4_xattr_set() starts a transaction.
> > > > > That internally goes through start_this_handle() which calls:
> > > > >
> > > > >         handle->saved_alloc_context = memalloc_nofs_save();
> > > > >
> > > > > and we restore the allocation context only in stop_this_handle() when
> > > > > stopping the handle. And with this fs_reclaim_acquire() should remove
> > > > > __GFP_FS from the mask and not call __fs_reclaim_acquire().
> > > > >
> > > > > Now I have no idea why something here didn't work out. Given we don't have
> > > > > a reproducer it will be probably difficult to debug this. I'd note that
> > > > > about year and half ago similar report happened (got autoclosed) so it may
> > > > > be something real somewhere but it may also be just some HW glitch or
> > > > > something like that.
> > > >
> > > > HW glitch is theoretically possible. But if we are considering such
> > > > causes, I would say a kernel memory corruption is way more likely, we
> > > > have hundreds of known memory-corruption-capable bugs open. In most
> > > > cases they are caught by KASAN before doing silent damage. But KASAN
> > > > can miss some cases.
> > > >
> > > > I see at least 4 existing bugs with similar stack:
> > > > https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > > > https://syzkaller.appspot.com/bug?extid=a7ab8df042baaf42ae3c
> > > > https://syzkaller.appspot.com/bug?id=c814a55a728493959328551c769ede4c8ff72aab
> > > > https://syzkaller.appspot.com/bug?id=426ad9adca053dafcd698f3a48ad5406dccc972b
> > > >
> > > > All in all, I would not assume it's a memory corruption. When we had
> > > > bugs that actually caused silent memory corruption, that caused a
> > > > spike of random one-time crashes all over the kernel. This does not
> > > > look like it.
> > > 
> > > I wonder if memalloc_nofs_save (or any other manipulation of
> > > current->flags) could have been invoked from interrupt context? I
> > > think it could cause the failure mode we observe (extremely rare
> > > disappearing flags). It may be useful to add a check for task context
> > > there.
> > 
> > That's an interesting idea. I'm not sure if anything does manipulate
> > current->flags from inside an interrupt (definitely memalloc_nofs_save()
> > doesn't seem to be) but I'd think that in fully preemtible kernel,
> > scheduler could preempt the task inside memalloc_nofs_save() and the
> > current->flags manipulation could also clash with a manipulation of these
> > flags by the scheduler if there's any?
> 
> current->flags should be always manipulated from the user context. But
> who knows maybe there is a bug and some interrupt handler is calling it.
> This should be easy to catch no?

Why would it matter if it were?  We save the current value of the nofs
flag and then restore it.  That would happen before the end of the
interrupt handler.  So the interrupt isn't going to change the observed
value of the flag by the task which is interrupted.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-11 12:57           ` Matthew Wilcox
@ 2021-02-11 13:07             ` Michal Hocko
  2021-02-11 13:25               ` Matthew Wilcox
  2021-02-11 13:18             ` Dmitry Vyukov
  1 sibling, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2021-02-11 13:07 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Dmitry Vyukov, syzbot, Jan Kara, linux-ext4, LKML,
	syzkaller-bugs, Theodore Ts'o, Linux-MM

On Thu 11-02-21 12:57:17, Matthew Wilcox wrote:
> On Thu, Feb 11, 2021 at 01:34:48PM +0100, Michal Hocko wrote:
> > On Thu 11-02-21 13:10:20, Jan Kara wrote:
> > > On Thu 11-02-21 12:28:48, Dmitry Vyukov wrote:
> > > > On Thu, Feb 11, 2021 at 12:22 PM Dmitry Vyukov <dvyukov@google.com> wrote:
> > > > >
> > > > > On Thu, Feb 11, 2021 at 11:49 AM Jan Kara <jack@suse.cz> wrote:
> > > > > >
> > > > > > Hello,
> > > > > >
> > > > > > added mm guys to CC.
> > > > > >
> > > > > > On Wed 10-02-21 05:35:18, syzbot wrote:
> > > > > > > HEAD commit:    1e0d27fc Merge branch 'akpm' (patches from Andrew)
> > > > > > > git tree:       upstream
> > > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=15cbce90d00000
> > > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=bd1f72220b2e57eb
> > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > > > > > > userspace arch: i386
> > > > > > >
> > > > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > > > >
> > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > > > Reported-by: syzbot+bfdded10ab7dcd7507ae@syzkaller.appspotmail.com
> > > > > > >
> > > > > > > ======================================================
> > > > > > > WARNING: possible circular locking dependency detected
> > > > > > > 5.11.0-rc6-syzkaller #0 Not tainted
> > > > > > > ------------------------------------------------------
> > > > > > > kswapd0/2246 is trying to acquire lock:
> > > > > > > ffff888041a988e0 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> > > > > > >
> > > > > > > but task is already holding lock:
> > > > > > > ffffffff8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> > > > > > >
> > > > > > > which lock already depends on the new lock.
> > > > > > >
> > > > > > > the existing dependency chain (in reverse order) is:
> > > > > > >
> > > > > > > -> #2 (fs_reclaim){+.+.}-{0:0}:
> > > > > > >        __fs_reclaim_acquire mm/page_alloc.c:4326 [inline]
> > > > > > >        fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
> > > > > > >        might_alloc include/linux/sched/mm.h:193 [inline]
> > > > > > >        slab_pre_alloc_hook mm/slab.h:493 [inline]
> > > > > > >        slab_alloc_node mm/slub.c:2817 [inline]
> > > > > > >        __kmalloc_node+0x5f/0x430 mm/slub.c:4015
> > > > > > >        kmalloc_node include/linux/slab.h:575 [inline]
> > > > > > >        kvmalloc_node+0x61/0xf0 mm/util.c:587
> > > > > > >        kvmalloc include/linux/mm.h:781 [inline]
> > > > > > >        ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
> > > > > > >        ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
> > > > > > >        ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
> > > > > > >        ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
> > > > > > >        ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
> > > > > > >        ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
> > > > > > >        ext4_xattr_user_set+0xbc/0x100 fs/ext4/xattr_user.c:40
> > > > > > >        __vfs_setxattr+0x10e/0x170 fs/xattr.c:177
> > > > > > >        __vfs_setxattr_noperm+0x11a/0x4c0 fs/xattr.c:208
> > > > > > >        __vfs_setxattr_locked+0x1bf/0x250 fs/xattr.c:266
> > > > > > >        vfs_setxattr+0x135/0x320 fs/xattr.c:291
> > > > > > >        setxattr+0x1ff/0x290 fs/xattr.c:553
> > > > > > >        path_setxattr+0x170/0x190 fs/xattr.c:572
> > > > > > >        __do_sys_setxattr fs/xattr.c:587 [inline]
> > > > > > >        __se_sys_setxattr fs/xattr.c:583 [inline]
> > > > > > >        __ia32_sys_setxattr+0xbc/0x150 fs/xattr.c:583
> > > > > > >        do_syscall_32_irqs_on arch/x86/entry/common.c:77 [inline]
> > > > > > >        __do_fast_syscall_32+0x56/0x80 arch/x86/entry/common.c:139
> > > > > > >        do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:164
> > > > > > >        entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
> > > > > >
> > > > > > This stacktrace should never happen. ext4_xattr_set() starts a transaction.
> > > > > > That internally goes through start_this_handle() which calls:
> > > > > >
> > > > > >         handle->saved_alloc_context = memalloc_nofs_save();
> > > > > >
> > > > > > and we restore the allocation context only in stop_this_handle() when
> > > > > > stopping the handle. And with this fs_reclaim_acquire() should remove
> > > > > > __GFP_FS from the mask and not call __fs_reclaim_acquire().
> > > > > >
> > > > > > Now I have no idea why something here didn't work out. Given we don't have
> > > > > > a reproducer it will be probably difficult to debug this. I'd note that
> > > > > > about year and half ago similar report happened (got autoclosed) so it may
> > > > > > be something real somewhere but it may also be just some HW glitch or
> > > > > > something like that.
> > > > >
> > > > > HW glitch is theoretically possible. But if we are considering such
> > > > > causes, I would say a kernel memory corruption is way more likely, we
> > > > > have hundreds of known memory-corruption-capable bugs open. In most
> > > > > cases they are caught by KASAN before doing silent damage. But KASAN
> > > > > can miss some cases.
> > > > >
> > > > > I see at least 4 existing bugs with similar stack:
> > > > > https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > > > > https://syzkaller.appspot.com/bug?extid=a7ab8df042baaf42ae3c
> > > > > https://syzkaller.appspot.com/bug?id=c814a55a728493959328551c769ede4c8ff72aab
> > > > > https://syzkaller.appspot.com/bug?id=426ad9adca053dafcd698f3a48ad5406dccc972b
> > > > >
> > > > > All in all, I would not assume it's a memory corruption. When we had
> > > > > bugs that actually caused silent memory corruption, that caused a
> > > > > spike of random one-time crashes all over the kernel. This does not
> > > > > look like it.
> > > > 
> > > > I wonder if memalloc_nofs_save (or any other manipulation of
> > > > current->flags) could have been invoked from interrupt context? I
> > > > think it could cause the failure mode we observe (extremely rare
> > > > disappearing flags). It may be useful to add a check for task context
> > > > there.
> > > 
> > > That's an interesting idea. I'm not sure if anything does manipulate
> > > current->flags from inside an interrupt (definitely memalloc_nofs_save()
> > > doesn't seem to be) but I'd think that in fully preemtible kernel,
> > > scheduler could preempt the task inside memalloc_nofs_save() and the
> > > current->flags manipulation could also clash with a manipulation of these
> > > flags by the scheduler if there's any?
> > 
> > current->flags should be always manipulated from the user context. But
> > who knows maybe there is a bug and some interrupt handler is calling it.
> > This should be easy to catch no?
> 
> Why would it matter if it were?

I was thinking about a clobbered state because updates to ->flags are
not atomic because this shouldn't ever be updated concurrently. So maybe
a racing interrupt could corrupt the flags state?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-11 13:07             ` Michal Hocko
@ 2021-02-11 13:25               ` Matthew Wilcox
  2021-02-11 14:20                 ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2021-02-11 13:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jan Kara, Dmitry Vyukov, syzbot, Jan Kara, linux-ext4, LKML,
	syzkaller-bugs, Theodore Ts'o, Linux-MM

On Thu, Feb 11, 2021 at 02:07:03PM +0100, Michal Hocko wrote:
> On Thu 11-02-21 12:57:17, Matthew Wilcox wrote:
> > > current->flags should be always manipulated from the user context. But
> > > who knows maybe there is a bug and some interrupt handler is calling it.
> > > This should be easy to catch no?
> > 
> > Why would it matter if it were?
> 
> I was thinking about a clobbered state because updates to ->flags are
> not atomic because this shouldn't ever be updated concurrently. So maybe
> a racing interrupt could corrupt the flags state?

I don't think that's possible.  Same-CPU races between interrupt and
process context are simpler because the CPU always observes its own writes
in order and the interrupt handler completes "between" two instructions.

eg a load-store CPU will do:

load 0 from address A
or 8 with result
store 8 to A

Two CPUs can do:

CPU 0			CPU 1
load 0 from A
			load 0 from A
or 8 with 0
			or 4 with 0
store 8 to A
			store 4 to A

and the store of 8 is lost.

process			interrupt
load 0 from A
			load 0 from A
			or 4 with 0
			store 4 to A
or 8 with 0
store 8 to A

so the store of 4 would be lost.

but we expect the interrupt handler to restore it.  so we actually have this:

load 0 from A
			load 0 from A
			or 4 with 0
			store 4 to A
			load 4 from A
			clear 4 from 4
			store 0 to A
or 8 with 0
store 8 to A


If we have a leak where someone forgets to restore the nofs, that might
cause this.  We could check for the allocation mask bits being clear at
syscall exit (scheduling with these flags set is obviously ok).


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-11 13:25               ` Matthew Wilcox
@ 2021-02-11 14:20                 ` Michal Hocko
  2021-02-11 14:26                   ` Matthew Wilcox
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2021-02-11 14:20 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Dmitry Vyukov, syzbot, Jan Kara, linux-ext4, LKML,
	syzkaller-bugs, Theodore Ts'o, Linux-MM

On Thu 11-02-21 13:25:33, Matthew Wilcox wrote:
> On Thu, Feb 11, 2021 at 02:07:03PM +0100, Michal Hocko wrote:
> > On Thu 11-02-21 12:57:17, Matthew Wilcox wrote:
> > > > current->flags should be always manipulated from the user context. But
> > > > who knows maybe there is a bug and some interrupt handler is calling it.
> > > > This should be easy to catch no?
> > > 
> > > Why would it matter if it were?
> > 
> > I was thinking about a clobbered state because updates to ->flags are
> > not atomic because this shouldn't ever be updated concurrently. So maybe
> > a racing interrupt could corrupt the flags state?
> 
> I don't think that's possible.  Same-CPU races between interrupt and
> process context are simpler because the CPU always observes its own writes
> in order and the interrupt handler completes "between" two instructions.

I have to confess I haven't really thought the scenario through. My idea
was to simply add a simple check for an irq context into ->flags setting
routine because this should never be done in the first place. Not only
for scope gfp flags but any other PF_ flags IIRC.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-11 14:20                 ` Michal Hocko
@ 2021-02-11 14:26                   ` Matthew Wilcox
  2021-02-11 16:41                     ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2021-02-11 14:26 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Jan Kara, Dmitry Vyukov, syzbot, Jan Kara, linux-ext4, LKML,
	syzkaller-bugs, Theodore Ts'o, Linux-MM

On Thu, Feb 11, 2021 at 03:20:41PM +0100, Michal Hocko wrote:
> On Thu 11-02-21 13:25:33, Matthew Wilcox wrote:
> > On Thu, Feb 11, 2021 at 02:07:03PM +0100, Michal Hocko wrote:
> > > On Thu 11-02-21 12:57:17, Matthew Wilcox wrote:
> > > > > current->flags should be always manipulated from the user context. But
> > > > > who knows maybe there is a bug and some interrupt handler is calling it.
> > > > > This should be easy to catch no?
> > > > 
> > > > Why would it matter if it were?
> > > 
> > > I was thinking about a clobbered state because updates to ->flags are
> > > not atomic because this shouldn't ever be updated concurrently. So maybe
> > > a racing interrupt could corrupt the flags state?
> > 
> > I don't think that's possible.  Same-CPU races between interrupt and
> > process context are simpler because the CPU always observes its own writes
> > in order and the interrupt handler completes "between" two instructions.
> 
> I have to confess I haven't really thought the scenario through. My idea
> was to simply add a simple check for an irq context into ->flags setting
> routine because this should never be done in the first place. Not only
> for scope gfp flags but any other PF_ flags IIRC.

That's not automatically clear to me.  There are plenty of places
where an interrupt borrows the context of the task that it happens to
have interrupted.  Specifically, interrupts should be using GFP_ATOMIC
anyway, so this doesn't really make a lot of sense, but I don't think
it's necessarily wrong for an interrupt to call a function that says
"Definitely don't make GFP_FS allocations between these two points".


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-11 14:26                   ` Matthew Wilcox
@ 2021-02-11 16:41                     ` Michal Hocko
  2021-02-12 11:18                       ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2021-02-11 16:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, Dmitry Vyukov, syzbot, Jan Kara, linux-ext4, LKML,
	syzkaller-bugs, Theodore Ts'o, Linux-MM

On Thu 11-02-21 14:26:30, Matthew Wilcox wrote:
> On Thu, Feb 11, 2021 at 03:20:41PM +0100, Michal Hocko wrote:
> > On Thu 11-02-21 13:25:33, Matthew Wilcox wrote:
> > > On Thu, Feb 11, 2021 at 02:07:03PM +0100, Michal Hocko wrote:
> > > > On Thu 11-02-21 12:57:17, Matthew Wilcox wrote:
> > > > > > current->flags should be always manipulated from the user context. But
> > > > > > who knows maybe there is a bug and some interrupt handler is calling it.
> > > > > > This should be easy to catch no?
> > > > > 
> > > > > Why would it matter if it were?
> > > > 
> > > > I was thinking about a clobbered state because updates to ->flags are
> > > > not atomic because this shouldn't ever be updated concurrently. So maybe
> > > > a racing interrupt could corrupt the flags state?
> > > 
> > > I don't think that's possible.  Same-CPU races between interrupt and
> > > process context are simpler because the CPU always observes its own writes
> > > in order and the interrupt handler completes "between" two instructions.
> > 
> > I have to confess I haven't really thought the scenario through. My idea
> > was to simply add a simple check for an irq context into ->flags setting
> > routine because this should never be done in the first place. Not only
> > for scope gfp flags but any other PF_ flags IIRC.
> 
> That's not automatically clear to me.  There are plenty of places
> where an interrupt borrows the context of the task that it happens to
> have interrupted.  Specifically, interrupts should be using GFP_ATOMIC
> anyway, so this doesn't really make a lot of sense, but I don't think
> it's necessarily wrong for an interrupt to call a function that says
> "Definitely don't make GFP_FS allocations between these two points".

Not sure I got your point. IRQ context never does reclaim so anything
outside of NOWAIT/ATOMIC is pointless. But you might be refering to a
future code where GFP_FS might have a meaning outside of the reclaim
context?

Anyway if we are to allow modifying PF_ flags from an interrupt contenxt
then I believe we should make that code IRQ aware at least. I do not
feel really comfortable about async modifications when this is stated to
be safe doing in a non atomic way.

But I suspect we have drifted away from the original issue. I thought
that a simple check would help us narrow down this particular case and
somebody messing up from the IRQ context didn't sound like a completely
off.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-11 16:41                     ` Michal Hocko
@ 2021-02-12 11:18                       ` Tetsuo Handa
  2021-02-12 12:22                         ` Matthew Wilcox
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2021-02-12 11:18 UTC (permalink / raw)
  To: Michal Hocko, Matthew Wilcox
  Cc: Jan Kara, Dmitry Vyukov, syzbot, Jan Kara, linux-ext4, LKML,
	syzkaller-bugs, Theodore Ts'o, Linux-MM

On 2021/02/12 1:41, Michal Hocko wrote:
> But I suspect we have drifted away from the original issue. I thought
> that a simple check would help us narrow down this particular case and
> somebody messing up from the IRQ context didn't sound like a completely
> off.
> 

 From my experience at https://lkml.kernel.org/r/201409192053.IHJ35462.JLOMOSOFFVtQFH@I-love.SAKURA.ne.jp ,
I think we can replace direct PF_* manipulation with macros which do not receive "struct task_struct *" argument.
Since TASK_PFA_TEST()/TASK_PFA_SET()/TASK_PFA_CLEAR() are for manipulating PFA_* flags on a remote thread, we can
define similar ones for manipulating PF_* flags on current thread. Then, auditing dangerous users becomes easier.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-12 11:18                       ` Tetsuo Handa
@ 2021-02-12 12:22                         ` Matthew Wilcox
  2021-02-12 12:30                           ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Matthew Wilcox @ 2021-02-12 12:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Michal Hocko, Jan Kara, Dmitry Vyukov, syzbot, Jan Kara,
	linux-ext4, LKML, syzkaller-bugs, Theodore Ts'o, Linux-MM

On Fri, Feb 12, 2021 at 08:18:11PM +0900, Tetsuo Handa wrote:
> On 2021/02/12 1:41, Michal Hocko wrote:
> > But I suspect we have drifted away from the original issue. I thought
> > that a simple check would help us narrow down this particular case and
> > somebody messing up from the IRQ context didn't sound like a completely
> > off.
> > 
> 
>  From my experience at https://lkml.kernel.org/r/201409192053.IHJ35462.JLOMOSOFFVtQFH@I-love.SAKURA.ne.jp ,
> I think we can replace direct PF_* manipulation with macros which do not receive "struct task_struct *" argument.
> Since TASK_PFA_TEST()/TASK_PFA_SET()/TASK_PFA_CLEAR() are for manipulating PFA_* flags on a remote thread, we can
> define similar ones for manipulating PF_* flags on current thread. Then, auditing dangerous users becomes easier.

No, nobody is manipulating another task's GFP flags.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-12 12:22                         ` Matthew Wilcox
@ 2021-02-12 12:30                           ` Michal Hocko
  2021-02-12 12:58                             ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2021-02-12 12:30 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Tetsuo Handa, Jan Kara, Dmitry Vyukov, syzbot, Jan Kara,
	linux-ext4, LKML, syzkaller-bugs, Theodore Ts'o, Linux-MM

On Fri 12-02-21 12:22:07, Matthew Wilcox wrote:
> On Fri, Feb 12, 2021 at 08:18:11PM +0900, Tetsuo Handa wrote:
> > On 2021/02/12 1:41, Michal Hocko wrote:
> > > But I suspect we have drifted away from the original issue. I thought
> > > that a simple check would help us narrow down this particular case and
> > > somebody messing up from the IRQ context didn't sound like a completely
> > > off.
> > > 
> > 
> >  From my experience at https://lkml.kernel.org/r/201409192053.IHJ35462.JLOMOSOFFVtQFH@I-love.SAKURA.ne.jp ,
> > I think we can replace direct PF_* manipulation with macros which do not receive "struct task_struct *" argument.
> > Since TASK_PFA_TEST()/TASK_PFA_SET()/TASK_PFA_CLEAR() are for manipulating PFA_* flags on a remote thread, we can
> > define similar ones for manipulating PF_* flags on current thread. Then, auditing dangerous users becomes easier.
> 
> No, nobody is manipulating another task's GFP flags.

Agreed. And nobody should be manipulating PF flags on remote tasks
either.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-12 12:30                           ` Michal Hocko
@ 2021-02-12 12:58                             ` Tetsuo Handa
  2021-02-12 13:12                               ` Michal Hocko
  2021-02-12 15:43                               ` Michal Hocko
  0 siblings, 2 replies; 29+ messages in thread
From: Tetsuo Handa @ 2021-02-12 12:58 UTC (permalink / raw)
  To: Michal Hocko, Matthew Wilcox
  Cc: Jan Kara, Dmitry Vyukov, syzbot, Jan Kara, linux-ext4, LKML,
	syzkaller-bugs, Theodore Ts'o, Linux-MM

On 2021/02/12 21:30, Michal Hocko wrote:
> On Fri 12-02-21 12:22:07, Matthew Wilcox wrote:
>> On Fri, Feb 12, 2021 at 08:18:11PM +0900, Tetsuo Handa wrote:
>>> On 2021/02/12 1:41, Michal Hocko wrote:
>>>> But I suspect we have drifted away from the original issue. I thought
>>>> that a simple check would help us narrow down this particular case and
>>>> somebody messing up from the IRQ context didn't sound like a completely
>>>> off.
>>>>
>>>
>>>  From my experience at https://lkml.kernel.org/r/201409192053.IHJ35462.JLOMOSOFFVtQFH@I-love.SAKURA.ne.jp ,
>>> I think we can replace direct PF_* manipulation with macros which do not receive "struct task_struct *" argument.
>>> Since TASK_PFA_TEST()/TASK_PFA_SET()/TASK_PFA_CLEAR() are for manipulating PFA_* flags on a remote thread, we can
>>> define similar ones for manipulating PF_* flags on current thread. Then, auditing dangerous users becomes easier.
>>
>> No, nobody is manipulating another task's GFP flags.
> 
> Agreed. And nobody should be manipulating PF flags on remote tasks
> either.
> 

No. You are misunderstanding. The bug report above is an example of manipulating PF flags on remote tasks.
You say "nobody should", but the reality is "there indeed was". There might be unnoticed others. The point of
this proposal is to make it possible to "find such unnoticed users who are manipulating PF flags on remote tasks".


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-12 12:58                             ` Tetsuo Handa
@ 2021-02-12 13:12                               ` Michal Hocko
  2021-02-12 13:34                                 ` Tetsuo Handa
  2021-02-12 15:43                               ` Michal Hocko
  1 sibling, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2021-02-12 13:12 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Matthew Wilcox, Jan Kara, Dmitry Vyukov, syzbot, Jan Kara,
	linux-ext4, LKML, syzkaller-bugs, Theodore Ts'o, Linux-MM

On Fri 12-02-21 21:58:15, Tetsuo Handa wrote:
> On 2021/02/12 21:30, Michal Hocko wrote:
> > On Fri 12-02-21 12:22:07, Matthew Wilcox wrote:
> >> On Fri, Feb 12, 2021 at 08:18:11PM +0900, Tetsuo Handa wrote:
> >>> On 2021/02/12 1:41, Michal Hocko wrote:
> >>>> But I suspect we have drifted away from the original issue. I thought
> >>>> that a simple check would help us narrow down this particular case and
> >>>> somebody messing up from the IRQ context didn't sound like a completely
> >>>> off.
> >>>>
> >>>
> >>>  From my experience at https://lkml.kernel.org/r/201409192053.IHJ35462.JLOMOSOFFVtQFH@I-love.SAKURA.ne.jp ,
> >>> I think we can replace direct PF_* manipulation with macros which do not receive "struct task_struct *" argument.
> >>> Since TASK_PFA_TEST()/TASK_PFA_SET()/TASK_PFA_CLEAR() are for manipulating PFA_* flags on a remote thread, we can
> >>> define similar ones for manipulating PF_* flags on current thread. Then, auditing dangerous users becomes easier.
> >>
> >> No, nobody is manipulating another task's GFP flags.
> > 
> > Agreed. And nobody should be manipulating PF flags on remote tasks
> > either.
> > 
> 
> No. You are misunderstanding. The bug report above is an example of manipulating PF flags on remote tasks.

Could you be more specific? I do not remember there was any theory that
somebody is manipulating flags on a remote task. A very vague theory was
that an interrupt context might be doing that on the _current_ context
but even that is not based on any real evidence. It is a pure
speculation.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-12 13:12                               ` Michal Hocko
@ 2021-02-12 13:34                                 ` Tetsuo Handa
  0 siblings, 0 replies; 29+ messages in thread
From: Tetsuo Handa @ 2021-02-12 13:34 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Matthew Wilcox, Jan Kara, Dmitry Vyukov, syzbot, Jan Kara,
	linux-ext4, LKML, syzkaller-bugs, Theodore Ts'o, Linux-MM

On 2021/02/12 22:12, Michal Hocko wrote:
> On Fri 12-02-21 21:58:15, Tetsuo Handa wrote:
>> On 2021/02/12 21:30, Michal Hocko wrote:
>>> On Fri 12-02-21 12:22:07, Matthew Wilcox wrote:
>>>> On Fri, Feb 12, 2021 at 08:18:11PM +0900, Tetsuo Handa wrote:
>>>>> On 2021/02/12 1:41, Michal Hocko wrote:
>>>>>> But I suspect we have drifted away from the original issue. I thought
>>>>>> that a simple check would help us narrow down this particular case and
>>>>>> somebody messing up from the IRQ context didn't sound like a completely
>>>>>> off.
>>>>>>
>>>>>
>>>>>  From my experience at https://lkml.kernel.org/r/201409192053.IHJ35462.JLOMOSOFFVtQFH@I-love.SAKURA.ne.jp ,
>>>>> I think we can replace direct PF_* manipulation with macros which do not receive "struct task_struct *" argument.
>>>>> Since TASK_PFA_TEST()/TASK_PFA_SET()/TASK_PFA_CLEAR() are for manipulating PFA_* flags on a remote thread, we can
>>>>> define similar ones for manipulating PF_* flags on current thread. Then, auditing dangerous users becomes easier.
>>>>
>>>> No, nobody is manipulating another task's GFP flags.
>>>
>>> Agreed. And nobody should be manipulating PF flags on remote tasks
>>> either.
>>>
>>
>> No. You are misunderstanding. The bug report above is an example of manipulating PF flags on remote tasks.
> 
> Could you be more specific? I do not remember there was any theory that
> somebody is manipulating flags on a remote task. A very vague theory was
> that an interrupt context might be doing that on the _current_ context
> but even that is not based on any real evidence. It is a pure
> speculation.
> 

Please read the link above. The report is an example of manipulating PF flags on a remote task.
You are thinking interrupt context as the only possible culprit, but you should also think
concurrent access as the other possible culprit.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-12 12:58                             ` Tetsuo Handa
  2021-02-12 13:12                               ` Michal Hocko
@ 2021-02-12 15:43                               ` Michal Hocko
  2021-02-13 10:58                                 ` Dmitry Vyukov
  1 sibling, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2021-02-12 15:43 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Matthew Wilcox, Jan Kara, Dmitry Vyukov, syzbot, Jan Kara,
	linux-ext4, LKML, syzkaller-bugs, Theodore Ts'o, Linux-MM

On Fri 12-02-21 21:58:15, Tetsuo Handa wrote:
> On 2021/02/12 21:30, Michal Hocko wrote:
> > On Fri 12-02-21 12:22:07, Matthew Wilcox wrote:
> >> On Fri, Feb 12, 2021 at 08:18:11PM +0900, Tetsuo Handa wrote:
> >>> On 2021/02/12 1:41, Michal Hocko wrote:
> >>>> But I suspect we have drifted away from the original issue. I thought
> >>>> that a simple check would help us narrow down this particular case and
> >>>> somebody messing up from the IRQ context didn't sound like a completely
> >>>> off.
> >>>>
> >>>
> >>>  From my experience at https://lkml.kernel.org/r/201409192053.IHJ35462.JLOMOSOFFVtQFH@I-love.SAKURA.ne.jp ,
> >>> I think we can replace direct PF_* manipulation with macros which do not receive "struct task_struct *" argument.
> >>> Since TASK_PFA_TEST()/TASK_PFA_SET()/TASK_PFA_CLEAR() are for manipulating PFA_* flags on a remote thread, we can
> >>> define similar ones for manipulating PF_* flags on current thread. Then, auditing dangerous users becomes easier.
> >>
> >> No, nobody is manipulating another task's GFP flags.
> > 
> > Agreed. And nobody should be manipulating PF flags on remote tasks
> > either.
> > 
> 
> No. You are misunderstanding. The bug report above is an example of
> manipulating PF flags on remote tasks.

The bug report you are referring to is ancient. And the cpuset code
doesn't touch task->flags for a long time. I haven't checked exactly but
it is years since regular and atomic flags have been separated unless I
misremember.

> You say "nobody should", but the reality is "there indeed was". There
> might be unnoticed others. The point of this proposal is to make it
> possible to "find such unnoticed users who are manipulating PF flags
> on remote tasks".

I am really confused what you are proposing here TBH and referring to an
ancient bug doesn't really help. task->flags are _explicitly_ documented
to be only used for _current_. Is it possible that somebody writes a
buggy code? Sure, should we build a whole infrastructure around that to
catch such a broken code? I am not really sure. One bug 6 years ago
doesn't sound like a good reason for that.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-12 15:43                               ` Michal Hocko
@ 2021-02-13 10:58                                 ` Dmitry Vyukov
  0 siblings, 0 replies; 29+ messages in thread
From: Dmitry Vyukov @ 2021-02-13 10:58 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Tetsuo Handa, Matthew Wilcox, Jan Kara, syzbot, Jan Kara,
	linux-ext4, LKML, syzkaller-bugs, Theodore Ts'o, Linux-MM

On Fri, Feb 12, 2021 at 4:43 PM Michal Hocko <mhocko@suse.com> wrote:
>
> On Fri 12-02-21 21:58:15, Tetsuo Handa wrote:
> > On 2021/02/12 21:30, Michal Hocko wrote:
> > > On Fri 12-02-21 12:22:07, Matthew Wilcox wrote:
> > >> On Fri, Feb 12, 2021 at 08:18:11PM +0900, Tetsuo Handa wrote:
> > >>> On 2021/02/12 1:41, Michal Hocko wrote:
> > >>>> But I suspect we have drifted away from the original issue. I thought
> > >>>> that a simple check would help us narrow down this particular case and
> > >>>> somebody messing up from the IRQ context didn't sound like a completely
> > >>>> off.
> > >>>>
> > >>>
> > >>>  From my experience at https://lkml.kernel.org/r/201409192053.IHJ35462.JLOMOSOFFVtQFH@I-love.SAKURA.ne.jp ,
> > >>> I think we can replace direct PF_* manipulation with macros which do not receive "struct task_struct *" argument.
> > >>> Since TASK_PFA_TEST()/TASK_PFA_SET()/TASK_PFA_CLEAR() are for manipulating PFA_* flags on a remote thread, we can
> > >>> define similar ones for manipulating PF_* flags on current thread. Then, auditing dangerous users becomes easier.
> > >>
> > >> No, nobody is manipulating another task's GFP flags.
> > >
> > > Agreed. And nobody should be manipulating PF flags on remote tasks
> > > either.
> > >
> >
> > No. You are misunderstanding. The bug report above is an example of
> > manipulating PF flags on remote tasks.
>
> The bug report you are referring to is ancient. And the cpuset code
> doesn't touch task->flags for a long time. I haven't checked exactly but
> it is years since regular and atomic flags have been separated unless I
> misremember.
>
> > You say "nobody should", but the reality is "there indeed was". There
> > might be unnoticed others. The point of this proposal is to make it
> > possible to "find such unnoticed users who are manipulating PF flags
> > on remote tasks".
>
> I am really confused what you are proposing here TBH and referring to an
> ancient bug doesn't really help. task->flags are _explicitly_ documented
> to be only used for _current_. Is it possible that somebody writes a
> buggy code? Sure, should we build a whole infrastructure around that to
> catch such a broken code? I am not really sure. One bug 6 years ago
> doesn't sound like a good reason for that.

Another similar one was just reported:

https://syzkaller.appspot.com/bug?extid=1b2c6989ec12e467d65c

WARNING: possible circular locking dependency detected
5.11.0-rc7-syzkaller #0 Not tainted
------------------------------------------------------
kswapd0/2232 is trying to acquire lock:
ffff88801f552650 (sb_internal){.+.+}-{0:0}, at: evict+0x2ed/0x6b0 fs/inode.c:577

but task is already holding lock:
ffffffff8be89240 (fs_reclaim){+.+.}-{0:0}, at:
__fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #3 (fs_reclaim){+.+.}-{0:0}:
       __fs_reclaim_acquire mm/page_alloc.c:4326 [inline]
       fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
       might_alloc include/linux/sched/mm.h:193 [inline]
       slab_pre_alloc_hook mm/slab.h:493 [inline]
       slab_alloc_node mm/slab.c:3221 [inline]
       kmem_cache_alloc_node_trace+0x48/0x520 mm/slab.c:3596
       __do_kmalloc_node mm/slab.c:3618 [inline]
       __kmalloc_node+0x38/0x60 mm/slab.c:3626
       kmalloc_node include/linux/slab.h:575 [inline]
       kvmalloc_node+0x61/0xf0 mm/util.c:587
       kvmalloc include/linux/mm.h:781 [inline]
       ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
       ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
       ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
       ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
       ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
       ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
       __vfs_setxattr+0x10e/0x170 fs/xattr.c:177
       __vfs_setxattr_noperm+0x11a/0x4c0 fs/xattr.c:208
       __vfs_setxattr_locked+0x1bf/0x250 fs/xattr.c:266
       vfs_setxattr+0x135/0x320 fs/xattr.c:291
       setxattr+0x1ff/0x290 fs/xattr.c:553
       path_setxattr+0x170/0x190 fs/xattr.c:572
       __do_sys_setxattr fs/xattr.c:587 [inline]
       __se_sys_setxattr fs/xattr.c:583 [inline]
       __x64_sys_setxattr+0xc0/0x160 fs/xattr.c:583
       do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-11 12:57           ` Matthew Wilcox
  2021-02-11 13:07             ` Michal Hocko
@ 2021-02-11 13:18             ` Dmitry Vyukov
  1 sibling, 0 replies; 29+ messages in thread
From: Dmitry Vyukov @ 2021-02-11 13:18 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Michal Hocko, Jan Kara, syzbot, Jan Kara, linux-ext4, LKML,
	syzkaller-bugs, Theodore Ts'o, Linux-MM

On Thu, Feb 11, 2021 at 1:57 PM Matthew Wilcox <willy@infradead.org> wrote:
> > > > > > Hello,
> > > > > >
> > > > > > added mm guys to CC.
> > > > > >
> > > > > > On Wed 10-02-21 05:35:18, syzbot wrote:
> > > > > > > HEAD commit:    1e0d27fc Merge branch 'akpm' (patches from Andrew)
> > > > > > > git tree:       upstream
> > > > > > > console output: https://syzkaller.appspot.com/x/log.txt?x=15cbce90d00000
> > > > > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=bd1f72220b2e57eb
> > > > > > > dashboard link: https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > > > > > > userspace arch: i386
> > > > > > >
> > > > > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > > > > >
> > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > > > > Reported-by: syzbot+bfdded10ab7dcd7507ae@syzkaller.appspotmail.com
> > > > > > >
> > > > > > > ======================================================
> > > > > > > WARNING: possible circular locking dependency detected
> > > > > > > 5.11.0-rc6-syzkaller #0 Not tainted
> > > > > > > ------------------------------------------------------
> > > > > > > kswapd0/2246 is trying to acquire lock:
> > > > > > > ffff888041a988e0 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> > > > > > >
> > > > > > > but task is already holding lock:
> > > > > > > ffffffff8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> > > > > > >
> > > > > > > which lock already depends on the new lock.
> > > > > > >
> > > > > > > the existing dependency chain (in reverse order) is:
> > > > > > >
> > > > > > > -> #2 (fs_reclaim){+.+.}-{0:0}:
> > > > > > >        __fs_reclaim_acquire mm/page_alloc.c:4326 [inline]
> > > > > > >        fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
> > > > > > >        might_alloc include/linux/sched/mm.h:193 [inline]
> > > > > > >        slab_pre_alloc_hook mm/slab.h:493 [inline]
> > > > > > >        slab_alloc_node mm/slub.c:2817 [inline]
> > > > > > >        __kmalloc_node+0x5f/0x430 mm/slub.c:4015
> > > > > > >        kmalloc_node include/linux/slab.h:575 [inline]
> > > > > > >        kvmalloc_node+0x61/0xf0 mm/util.c:587
> > > > > > >        kvmalloc include/linux/mm.h:781 [inline]
> > > > > > >        ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
> > > > > > >        ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
> > > > > > >        ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
> > > > > > >        ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
> > > > > > >        ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
> > > > > > >        ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
> > > > > > >        ext4_xattr_user_set+0xbc/0x100 fs/ext4/xattr_user.c:40
> > > > > > >        __vfs_setxattr+0x10e/0x170 fs/xattr.c:177
> > > > > > >        __vfs_setxattr_noperm+0x11a/0x4c0 fs/xattr.c:208
> > > > > > >        __vfs_setxattr_locked+0x1bf/0x250 fs/xattr.c:266
> > > > > > >        vfs_setxattr+0x135/0x320 fs/xattr.c:291
> > > > > > >        setxattr+0x1ff/0x290 fs/xattr.c:553
> > > > > > >        path_setxattr+0x170/0x190 fs/xattr.c:572
> > > > > > >        __do_sys_setxattr fs/xattr.c:587 [inline]
> > > > > > >        __se_sys_setxattr fs/xattr.c:583 [inline]
> > > > > > >        __ia32_sys_setxattr+0xbc/0x150 fs/xattr.c:583
> > > > > > >        do_syscall_32_irqs_on arch/x86/entry/common.c:77 [inline]
> > > > > > >        __do_fast_syscall_32+0x56/0x80 arch/x86/entry/common.c:139
> > > > > > >        do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:164
> > > > > > >        entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
> > > > > >
> > > > > > This stacktrace should never happen. ext4_xattr_set() starts a transaction.
> > > > > > That internally goes through start_this_handle() which calls:
> > > > > >
> > > > > >         handle->saved_alloc_context = memalloc_nofs_save();
> > > > > >
> > > > > > and we restore the allocation context only in stop_this_handle() when
> > > > > > stopping the handle. And with this fs_reclaim_acquire() should remove
> > > > > > __GFP_FS from the mask and not call __fs_reclaim_acquire().
> > > > > >
> > > > > > Now I have no idea why something here didn't work out. Given we don't have
> > > > > > a reproducer it will be probably difficult to debug this. I'd note that
> > > > > > about year and half ago similar report happened (got autoclosed) so it may
> > > > > > be something real somewhere but it may also be just some HW glitch or
> > > > > > something like that.
> > > > >
> > > > > HW glitch is theoretically possible. But if we are considering such
> > > > > causes, I would say a kernel memory corruption is way more likely, we
> > > > > have hundreds of known memory-corruption-capable bugs open. In most
> > > > > cases they are caught by KASAN before doing silent damage. But KASAN
> > > > > can miss some cases.
> > > > >
> > > > > I see at least 4 existing bugs with similar stack:
> > > > > https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > > > > https://syzkaller.appspot.com/bug?extid=a7ab8df042baaf42ae3c
> > > > > https://syzkaller.appspot.com/bug?id=c814a55a728493959328551c769ede4c8ff72aab
> > > > > https://syzkaller.appspot.com/bug?id=426ad9adca053dafcd698f3a48ad5406dccc972b
> > > > >
> > > > > All in all, I would not assume it's a memory corruption. When we had
> > > > > bugs that actually caused silent memory corruption, that caused a
> > > > > spike of random one-time crashes all over the kernel. This does not
> > > > > look like it.
> > > >
> > > > I wonder if memalloc_nofs_save (or any other manipulation of
> > > > current->flags) could have been invoked from interrupt context? I
> > > > think it could cause the failure mode we observe (extremely rare
> > > > disappearing flags). It may be useful to add a check for task context
> > > > there.
> > >
> > > That's an interesting idea. I'm not sure if anything does manipulate
> > > current->flags from inside an interrupt (definitely memalloc_nofs_save()
> > > doesn't seem to be) but I'd think that in fully preemtible kernel,
> > > scheduler could preempt the task inside memalloc_nofs_save() and the
> > > current->flags manipulation could also clash with a manipulation of these
> > > flags by the scheduler if there's any?
> >
> > current->flags should be always manipulated from the user context. But
> > who knows maybe there is a bug and some interrupt handler is calling it.
> > This should be easy to catch no?
>
> Why would it matter if it were?  We save the current value of the nofs
> flag and then restore it.  That would happen before the end of the
> interrupt handler.  So the interrupt isn't going to change the observed
> value of the flag by the task which is interrupted.

Good question.
I just think that fixing some of these assumptions as runtime checks
is useful, as it will allow us to reduce infinite space of
possibilities. What is called from what context. Maybe checking that
PF_MEMALLOC_NOFS is indeed set when we enter memalloc_nofs_restore().


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-11 11:22   ` Dmitry Vyukov
  2021-02-11 11:28     ` Dmitry Vyukov
@ 2021-02-11 11:46     ` Jan Kara
  1 sibling, 0 replies; 29+ messages in thread
From: Jan Kara @ 2021-02-11 11:46 UTC (permalink / raw)
  To: Dmitry Vyukov
  Cc: Jan Kara, syzbot, Jan Kara, linux-ext4, LKML, syzkaller-bugs,
	Theodore Ts'o, Michal Hocko, Linux-MM

On Thu 11-02-21 12:22:39, Dmitry Vyukov wrote:
> On Thu, Feb 11, 2021 at 11:49 AM Jan Kara <jack@suse.cz> wrote:
> >
> > Hello,
> >
> > added mm guys to CC.
> >
> > On Wed 10-02-21 05:35:18, syzbot wrote:
> > > HEAD commit:    1e0d27fc Merge branch 'akpm' (patches from Andrew)
> > > git tree:       upstream
> > > console output: https://syzkaller.appspot.com/x/log.txt?x=15cbce90d00000
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=bd1f72220b2e57eb
> > > dashboard link: https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> > > userspace arch: i386
> > >
> > > Unfortunately, I don't have any reproducer for this issue yet.
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > Reported-by: syzbot+bfdded10ab7dcd7507ae@syzkaller.appspotmail.com
> > >
> > > ======================================================
> > > WARNING: possible circular locking dependency detected
> > > 5.11.0-rc6-syzkaller #0 Not tainted
> > > ------------------------------------------------------
> > > kswapd0/2246 is trying to acquire lock:
> > > ffff888041a988e0 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> > >
> > > but task is already holding lock:
> > > ffffffff8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> > >
> > > which lock already depends on the new lock.
> > >
> > > the existing dependency chain (in reverse order) is:
> > >
> > > -> #2 (fs_reclaim){+.+.}-{0:0}:
> > >        __fs_reclaim_acquire mm/page_alloc.c:4326 [inline]
> > >        fs_reclaim_acquire+0x117/0x150 mm/page_alloc.c:4340
> > >        might_alloc include/linux/sched/mm.h:193 [inline]
> > >        slab_pre_alloc_hook mm/slab.h:493 [inline]
> > >        slab_alloc_node mm/slub.c:2817 [inline]
> > >        __kmalloc_node+0x5f/0x430 mm/slub.c:4015
> > >        kmalloc_node include/linux/slab.h:575 [inline]
> > >        kvmalloc_node+0x61/0xf0 mm/util.c:587
> > >        kvmalloc include/linux/mm.h:781 [inline]
> > >        ext4_xattr_inode_cache_find fs/ext4/xattr.c:1465 [inline]
> > >        ext4_xattr_inode_lookup_create fs/ext4/xattr.c:1508 [inline]
> > >        ext4_xattr_set_entry+0x1ce6/0x3780 fs/ext4/xattr.c:1649
> > >        ext4_xattr_ibody_set+0x78/0x2b0 fs/ext4/xattr.c:2224
> > >        ext4_xattr_set_handle+0x8f4/0x13e0 fs/ext4/xattr.c:2380
> > >        ext4_xattr_set+0x13a/0x340 fs/ext4/xattr.c:2493
> > >        ext4_xattr_user_set+0xbc/0x100 fs/ext4/xattr_user.c:40
> > >        __vfs_setxattr+0x10e/0x170 fs/xattr.c:177
> > >        __vfs_setxattr_noperm+0x11a/0x4c0 fs/xattr.c:208
> > >        __vfs_setxattr_locked+0x1bf/0x250 fs/xattr.c:266
> > >        vfs_setxattr+0x135/0x320 fs/xattr.c:291
> > >        setxattr+0x1ff/0x290 fs/xattr.c:553
> > >        path_setxattr+0x170/0x190 fs/xattr.c:572
> > >        __do_sys_setxattr fs/xattr.c:587 [inline]
> > >        __se_sys_setxattr fs/xattr.c:583 [inline]
> > >        __ia32_sys_setxattr+0xbc/0x150 fs/xattr.c:583
> > >        do_syscall_32_irqs_on arch/x86/entry/common.c:77 [inline]
> > >        __do_fast_syscall_32+0x56/0x80 arch/x86/entry/common.c:139
> > >        do_fast_syscall_32+0x2f/0x70 arch/x86/entry/common.c:164
> > >        entry_SYSENTER_compat_after_hwframe+0x4d/0x5c
> >
> > This stacktrace should never happen. ext4_xattr_set() starts a transaction.
> > That internally goes through start_this_handle() which calls:
> >
> >         handle->saved_alloc_context = memalloc_nofs_save();
> >
> > and we restore the allocation context only in stop_this_handle() when
> > stopping the handle. And with this fs_reclaim_acquire() should remove
> > __GFP_FS from the mask and not call __fs_reclaim_acquire().
> >
> > Now I have no idea why something here didn't work out. Given we don't have
> > a reproducer it will be probably difficult to debug this. I'd note that
> > about year and half ago similar report happened (got autoclosed) so it may
> > be something real somewhere but it may also be just some HW glitch or
> > something like that.
> 
> HW glitch is theoretically possible. But if we are considering such
> causes, I would say a kernel memory corruption is way more likely, we
> have hundreds of known memory-corruption-capable bugs open. In most
> cases they are caught by KASAN before doing silent damage. But KASAN
> can miss some cases.
> 
> I see at least 4 existing bugs with similar stack:
> https://syzkaller.appspot.com/bug?extid=bfdded10ab7dcd7507ae
> https://syzkaller.appspot.com/bug?extid=a7ab8df042baaf42ae3c
> https://syzkaller.appspot.com/bug?id=c814a55a728493959328551c769ede4c8ff72aab
> https://syzkaller.appspot.com/bug?id=426ad9adca053dafcd698f3a48ad5406dccc972b

The last one looks different and likely unrelated (I don't see scoping API
to be used anywhere in that subsystem) but the others look indeed valid. So
I agree it seems to be some very hard to hit problem and likely not just a
random corruption.

								Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-11 10:49 ` possible deadlock in start_this_handle (2) Jan Kara
  2021-02-11 10:55   ` Michal Hocko
  2021-02-11 11:22   ` Dmitry Vyukov
@ 2021-02-13 14:26   ` Tetsuo Handa
  2021-02-15 12:45     ` Jan Kara
  2 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2021-02-13 14:26 UTC (permalink / raw)
  To: Jan Kara
  Cc: jack, linux-ext4, linux-kernel, syzkaller-bugs, tytso, mhocko,
	linux-mm, syzbot

On 2021/02/11 19:49, Jan Kara wrote:
> This stacktrace should never happen. ext4_xattr_set() starts a transaction.
> That internally goes through start_this_handle() which calls:
> 
> 	handle->saved_alloc_context = memalloc_nofs_save();
> 
> and we restore the allocation context only in stop_this_handle() when
> stopping the handle. And with this fs_reclaim_acquire() should remove
> __GFP_FS from the mask and not call __fs_reclaim_acquire().

Excuse me, but it seems to me that nothing prevents ext4_xattr_set_handle() from reaching
ext4_xattr_inode_lookup_create() without memalloc_nofs_save() when hitting ext4_get_nojournal() path.
Will you explain when ext4_get_nojournal() path is executed?

ext4_xattr_set() {
  handle = ext4_journal_start(inode, EXT4_HT_XATTR, credits) == __ext4_journal_start() {
      return __ext4_journal_start_sb() {
        journal = EXT4_SB(sb)->s_journal;
        if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
          return ext4_get_nojournal(); // Never calls memalloc_nofs_save() despite returning !IS_ERR() value.
        return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds, GFP_NOFS, type, line); // Calls memalloc_nofs_save() when start_this_handle() returns 0.
      }
    }
  }
  error = ext4_xattr_set_handle(handle, inode, name_index, name, value, value_len, flags); {
    ext4_write_lock_xattr(inode, &no_expand); // Grabs &ei->xattr_sem
    error = ext4_xattr_ibody_set(handle, inode, &i, &is) {
      error = ext4_xattr_set_entry(i, s, handle, inode, false /* is_block */) {
        ret = ext4_xattr_inode_lookup_create(handle, inode, i->value, i->value_len, &new_ea_inode); // Using GFP_KERNEL based on assumption that ext4_journal_start() called memalloc_nofs_save().
      }
    }
  }
}



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-13 14:26   ` Tetsuo Handa
@ 2021-02-15 12:45     ` Jan Kara
  2021-02-15 14:06       ` Tetsuo Handa
  0 siblings, 1 reply; 29+ messages in thread
From: Jan Kara @ 2021-02-15 12:45 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jan Kara, jack, linux-ext4, linux-kernel, syzkaller-bugs, tytso,
	mhocko, linux-mm, syzbot

On Sat 13-02-21 23:26:37, Tetsuo Handa wrote:
> On 2021/02/11 19:49, Jan Kara wrote:
> > This stacktrace should never happen. ext4_xattr_set() starts a transaction.
> > That internally goes through start_this_handle() which calls:
> > 
> > 	handle->saved_alloc_context = memalloc_nofs_save();
> > 
> > and we restore the allocation context only in stop_this_handle() when
> > stopping the handle. And with this fs_reclaim_acquire() should remove
> > __GFP_FS from the mask and not call __fs_reclaim_acquire().
> 
> Excuse me, but it seems to me that nothing prevents
> ext4_xattr_set_handle() from reaching ext4_xattr_inode_lookup_create()
> without memalloc_nofs_save() when hitting ext4_get_nojournal() path.
> Will you explain when ext4_get_nojournal() path is executed?

That's a good question but sadly I don't think that's it.
ext4_get_nojournal() is called when the filesystem is created without a
journal. In that case we also don't acquire jbd2_handle lockdep map. In the
syzbot report we can see:

kswapd0/2246 is trying to acquire lock:
ffff888041a988e0 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444

but task is already holding lock:
ffffffff8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195

So this filesystem has very clearly been created with a journal. Also the
journal lockdep tracking machinery uses:

rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);

so a lockdep key is per-filesystem. Thus it is not possible that lockdep
would combine lock dependencies from two different filesystems.

But I guess we could narrow the search for this problem by adding WARN_ONs
to ext4_xattr_set_handle() and ext4_xattr_inode_lookup_create() like:

WARN_ON(ext4_handle_valid(handle) && !(current->flags & PF_MEMALLOC_NOFS));

It would narrow down a place in which PF_MEMALLOC_NOFS flag isn't set
properly... At least that seems like the most plausible way forward to me.

								Honza

> 
> ext4_xattr_set() {
>   handle = ext4_journal_start(inode, EXT4_HT_XATTR, credits) == __ext4_journal_start() {
>       return __ext4_journal_start_sb() {
>         journal = EXT4_SB(sb)->s_journal;
>         if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
>           return ext4_get_nojournal(); // Never calls memalloc_nofs_save() despite returning !IS_ERR() value.
>         return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds, GFP_NOFS, type, line); // Calls memalloc_nofs_save() when start_this_handle() returns 0.
>       }
>     }
>   }
>   error = ext4_xattr_set_handle(handle, inode, name_index, name, value, value_len, flags); {
>     ext4_write_lock_xattr(inode, &no_expand); // Grabs &ei->xattr_sem
>     error = ext4_xattr_ibody_set(handle, inode, &i, &is) {
>       error = ext4_xattr_set_entry(i, s, handle, inode, false /* is_block */) {
>         ret = ext4_xattr_inode_lookup_create(handle, inode, i->value, i->value_len, &new_ea_inode); // Using GFP_KERNEL based on assumption that ext4_journal_start() called memalloc_nofs_save().
>       }
>     }
>   }
> }
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-15 12:45     ` Jan Kara
@ 2021-02-15 14:06       ` Tetsuo Handa
  2021-02-15 14:29         ` Jan Kara
  0 siblings, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2021-02-15 14:06 UTC (permalink / raw)
  To: Jan Kara
  Cc: jack, linux-ext4, linux-kernel, syzkaller-bugs, tytso, mhocko,
	linux-mm, syzbot

On 2021/02/15 21:45, Jan Kara wrote:
> On Sat 13-02-21 23:26:37, Tetsuo Handa wrote:
>> Excuse me, but it seems to me that nothing prevents
>> ext4_xattr_set_handle() from reaching ext4_xattr_inode_lookup_create()
>> without memalloc_nofs_save() when hitting ext4_get_nojournal() path.
>> Will you explain when ext4_get_nojournal() path is executed?
> 
> That's a good question but sadly I don't think that's it.
> ext4_get_nojournal() is called when the filesystem is created without a
> journal. In that case we also don't acquire jbd2_handle lockdep map. In the
> syzbot report we can see:

Since syzbot can test filesystem images, syzbot might have tested a filesystem
image created both with and without journal within this boot.

> 
> kswapd0/2246 is trying to acquire lock:
> ffff888041a988e0 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> 
> but task is already holding lock:
> ffffffff8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> 
> So this filesystem has very clearly been created with a journal. Also the
> journal lockdep tracking machinery uses:

While locks held by kswapd0/2246 are fs_reclaim, shrinker_rwsem, &type->s_umount_key#38
and jbd2_handle, isn't the dependency lockdep considers problematic is

  Chain exists of:
    jbd2_handle --> &ei->xattr_sem --> fs_reclaim

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(fs_reclaim);
                                 lock(&ei->xattr_sem);
                                 lock(fs_reclaim);
    lock(jbd2_handle);

where CPU0 is kswapd/2246 and CPU1 is the case of ext4_get_nojournal() path?
If someone has taken jbd2_handle and &ei->xattr_sem in this order, isn't this
dependency true?

> 
> rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
> 
> so a lockdep key is per-filesystem. Thus it is not possible that lockdep
> would combine lock dependencies from two different filesystems.
> 
> But I guess we could narrow the search for this problem by adding WARN_ONs
> to ext4_xattr_set_handle() and ext4_xattr_inode_lookup_create() like:
> 
> WARN_ON(ext4_handle_valid(handle) && !(current->flags & PF_MEMALLOC_NOFS));
> 
> It would narrow down a place in which PF_MEMALLOC_NOFS flag isn't set
> properly... At least that seems like the most plausible way forward to me.

You can use CONFIG_DEBUG_AID_FOR_SYZBOT for adding such WARN_ONs on linux-next.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-15 14:06       ` Tetsuo Handa
@ 2021-02-15 14:29         ` Jan Kara
  2021-02-19 10:15           ` Tetsuo Handa
  2021-03-20 10:02           ` Tetsuo Handa
  0 siblings, 2 replies; 29+ messages in thread
From: Jan Kara @ 2021-02-15 14:29 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jan Kara, jack, linux-ext4, linux-kernel, syzkaller-bugs, tytso,
	mhocko, linux-mm, syzbot

On Mon 15-02-21 23:06:15, Tetsuo Handa wrote:
> On 2021/02/15 21:45, Jan Kara wrote:
> > On Sat 13-02-21 23:26:37, Tetsuo Handa wrote:
> >> Excuse me, but it seems to me that nothing prevents
> >> ext4_xattr_set_handle() from reaching ext4_xattr_inode_lookup_create()
> >> without memalloc_nofs_save() when hitting ext4_get_nojournal() path.
> >> Will you explain when ext4_get_nojournal() path is executed?
> > 
> > That's a good question but sadly I don't think that's it.
> > ext4_get_nojournal() is called when the filesystem is created without a
> > journal. In that case we also don't acquire jbd2_handle lockdep map. In the
> > syzbot report we can see:
> 
> Since syzbot can test filesystem images, syzbot might have tested a filesystem
> image created both with and without journal within this boot.

a) I think that syzbot reboots the VM between executing different tests to
get reproducible conditions. But in theory I agree the test may have
contained one image with and one image without a journal.

*but*

b) as I wrote in the email you are replying to, the jbd2_handle key is
private per filesystem. Thus for lockdep to complain about
jbd2_handle->fs_reclaim->jbd2_handle deadlock, those jbd2_handle lockdep
maps must come from the same filesystem.

*and*

c) filesystem without journal doesn't use jbd2_handle lockdep map at all so
for such filesystems lockdep creates no dependency for jbd2_handle map.

								Honza

> 
> > 
> > kswapd0/2246 is trying to acquire lock:
> > ffff888041a988e0 (jbd2_handle){++++}-{0:0}, at: start_this_handle+0xf81/0x1380 fs/jbd2/transaction.c:444
> > 
> > but task is already holding lock:
> > ffffffff8be892c0 (fs_reclaim){+.+.}-{0:0}, at: __fs_reclaim_acquire+0x0/0x30 mm/page_alloc.c:5195
> > 
> > So this filesystem has very clearly been created with a journal. Also the
> > journal lockdep tracking machinery uses:
> 
> While locks held by kswapd0/2246 are fs_reclaim, shrinker_rwsem, &type->s_umount_key#38
> and jbd2_handle, isn't the dependency lockdep considers problematic is
> 
>   Chain exists of:
>     jbd2_handle --> &ei->xattr_sem --> fs_reclaim
> 
>    Possible unsafe locking scenario:
> 
>          CPU0                    CPU1
>          ----                    ----
>     lock(fs_reclaim);
>                                  lock(&ei->xattr_sem);
>                                  lock(fs_reclaim);
>     lock(jbd2_handle);
> 
> where CPU0 is kswapd/2246 and CPU1 is the case of ext4_get_nojournal() path?
> If someone has taken jbd2_handle and &ei->xattr_sem in this order, isn't this
> dependency true?
> 
> > 
> > rwsem_acquire_read(&journal->j_trans_commit_map, 0, 0, _THIS_IP_);
> > 
> > so a lockdep key is per-filesystem. Thus it is not possible that lockdep
> > would combine lock dependencies from two different filesystems.
> > 
> > But I guess we could narrow the search for this problem by adding WARN_ONs
> > to ext4_xattr_set_handle() and ext4_xattr_inode_lookup_create() like:
> > 
> > WARN_ON(ext4_handle_valid(handle) && !(current->flags & PF_MEMALLOC_NOFS));
> > 
> > It would narrow down a place in which PF_MEMALLOC_NOFS flag isn't set
> > properly... At least that seems like the most plausible way forward to me.
> 
> You can use CONFIG_DEBUG_AID_FOR_SYZBOT for adding such WARN_ONs on linux-next.
> 
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-15 14:29         ` Jan Kara
@ 2021-02-19 10:15           ` Tetsuo Handa
  2021-02-19 17:22             ` harshad shirwadkar
  2021-03-20 10:02           ` Tetsuo Handa
  1 sibling, 1 reply; 29+ messages in thread
From: Tetsuo Handa @ 2021-02-19 10:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: jack, linux-ext4, linux-kernel, syzkaller-bugs, tytso, mhocko,
	linux-mm, syzbot

On 2021/02/15 23:29, Jan Kara wrote:
> On Mon 15-02-21 23:06:15, Tetsuo Handa wrote:
>> On 2021/02/15 21:45, Jan Kara wrote:
>>> On Sat 13-02-21 23:26:37, Tetsuo Handa wrote:
>>>> Excuse me, but it seems to me that nothing prevents
>>>> ext4_xattr_set_handle() from reaching ext4_xattr_inode_lookup_create()
>>>> without memalloc_nofs_save() when hitting ext4_get_nojournal() path.
>>>> Will you explain when ext4_get_nojournal() path is executed?
>>>
>>> That's a good question but sadly I don't think that's it.
>>> ext4_get_nojournal() is called when the filesystem is created without a
>>> journal. In that case we also don't acquire jbd2_handle lockdep map. In the
>>> syzbot report we can see:
>>
>> Since syzbot can test filesystem images, syzbot might have tested a filesystem
>> image created both with and without journal within this boot.
> 
> a) I think that syzbot reboots the VM between executing different tests to
> get reproducible conditions. But in theory I agree the test may have
> contained one image with and one image without a journal.

syzkaller reboots the VM upon a crash.
syzkaller can test multiple filesystem images within one boot.

https://storage.googleapis.com/syzkaller/cover/ci-qemu-upstream-386.html (this
statistic snapshot is volatile) reports that ext4_get_nojournal() is partially covered
( https://github.com/google/syzkaller/blob/master/docs/coverage.md ) by syzkaller.

      /* Just increment the non-pointer handle value */
      static handle_t *ext4_get_nojournal(void)
      {
   86         handle_t *handle = current->journal_info;
              unsigned long ref_cnt = (unsigned long)handle;
      
              BUG_ON(ref_cnt >= EXT4_NOJOURNAL_MAX_REF_COUNT);
      
   86         ref_cnt++;
              handle = (handle_t *)ref_cnt;
      
              current->journal_info = handle;
 2006         return handle;
      }
      
      
      /* Decrement the non-pointer handle value */
      static void ext4_put_nojournal(handle_t *handle)
      {
              unsigned long ref_cnt = (unsigned long)handle;
      
   85         BUG_ON(ref_cnt == 0);
      
   85         ref_cnt--;
              handle = (handle_t *)ref_cnt;
      
              current->journal_info = handle;
      }


      handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
                                        int type, int blocks, int rsv_blocks,
                                        int revoke_creds)
      {
              journal_t *journal;
              int err;
      
 2006         trace_ext4_journal_start(sb, blocks, rsv_blocks, revoke_creds,
 2006                                  _RET_IP_);
 2006         err = ext4_journal_check_start(sb);
              if (err < 0)
                      return ERR_PTR(err);
      
 2005         journal = EXT4_SB(sb)->s_journal;
 1969         if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
 2006                 return ext4_get_nojournal();
 1969         return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds,
                                         GFP_NOFS, type, line);
      }

> 
> *but*
> 
> b) as I wrote in the email you are replying to, the jbd2_handle key is
> private per filesystem. Thus for lockdep to complain about
> jbd2_handle->fs_reclaim->jbd2_handle deadlock, those jbd2_handle lockdep
> maps must come from the same filesystem.
> 
> *and*
> 
> c) filesystem without journal doesn't use jbd2_handle lockdep map at all so
> for such filesystems lockdep creates no dependency for jbd2_handle map.
> 

What about "EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)" case?
Does this case happen on filesystem with journal, and can this case be executed
by fuzzing a crafted (a sort of erroneous) filesystem with journal, and are
the jbd2_handle for calling ext4_get_nojournal() case and the jbd2_handle for
calling jbd2__journal_start() case the same?

Also, I worry that jbd2__journal_restart() is problematic, for it calls
stop_this_handle() (which calls memalloc_nofs_restore()) and then calls
start_this_handle() (which fails to call memalloc_nofs_save() if an error
occurs). An error from start_this_handle() from journal restart operation
might need special handling (i.e. either remount-ro or panic) ?



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-19 10:15           ` Tetsuo Handa
@ 2021-02-19 17:22             ` harshad shirwadkar
  0 siblings, 0 replies; 29+ messages in thread
From: harshad shirwadkar @ 2021-02-19 17:22 UTC (permalink / raw)
  To: Tetsuo Handa
  Cc: Jan Kara, jack, Ext4 Developers List, linux-kernel,
	syzkaller-bugs, Theodore Y. Ts'o, mhocko, linux-mm, syzbot

On Fri, Feb 19, 2021 at 2:20 AM Tetsuo Handa
<penguin-kernel@i-love.sakura.ne.jp> wrote:
>
> On 2021/02/15 23:29, Jan Kara wrote:
> > On Mon 15-02-21 23:06:15, Tetsuo Handa wrote:
> >> On 2021/02/15 21:45, Jan Kara wrote:
> >>> On Sat 13-02-21 23:26:37, Tetsuo Handa wrote:
> >>>> Excuse me, but it seems to me that nothing prevents
> >>>> ext4_xattr_set_handle() from reaching ext4_xattr_inode_lookup_create()
> >>>> without memalloc_nofs_save() when hitting ext4_get_nojournal() path.
> >>>> Will you explain when ext4_get_nojournal() path is executed?
> >>>
> >>> That's a good question but sadly I don't think that's it.
> >>> ext4_get_nojournal() is called when the filesystem is created without a
> >>> journal. In that case we also don't acquire jbd2_handle lockdep map. In the
> >>> syzbot report we can see:
> >>
> >> Since syzbot can test filesystem images, syzbot might have tested a filesystem
> >> image created both with and without journal within this boot.
> >
> > a) I think that syzbot reboots the VM between executing different tests to
> > get reproducible conditions. But in theory I agree the test may have
> > contained one image with and one image without a journal.
>
> syzkaller reboots the VM upon a crash.
> syzkaller can test multiple filesystem images within one boot.
>
> https://storage.googleapis.com/syzkaller/cover/ci-qemu-upstream-386.html (this
> statistic snapshot is volatile) reports that ext4_get_nojournal() is partially covered
> ( https://github.com/google/syzkaller/blob/master/docs/coverage.md ) by syzkaller.
>
>       /* Just increment the non-pointer handle value */
>       static handle_t *ext4_get_nojournal(void)
>       {
>    86         handle_t *handle = current->journal_info;
>               unsigned long ref_cnt = (unsigned long)handle;
>
>               BUG_ON(ref_cnt >= EXT4_NOJOURNAL_MAX_REF_COUNT);
>
>    86         ref_cnt++;
>               handle = (handle_t *)ref_cnt;
>
>               current->journal_info = handle;
>  2006         return handle;
>       }
>
>
>       /* Decrement the non-pointer handle value */
>       static void ext4_put_nojournal(handle_t *handle)
>       {
>               unsigned long ref_cnt = (unsigned long)handle;
>
>    85         BUG_ON(ref_cnt == 0);
>
>    85         ref_cnt--;
>               handle = (handle_t *)ref_cnt;
>
>               current->journal_info = handle;
>       }
>
>
>       handle_t *__ext4_journal_start_sb(struct super_block *sb, unsigned int line,
>                                         int type, int blocks, int rsv_blocks,
>                                         int revoke_creds)
>       {
>               journal_t *journal;
>               int err;
>
>  2006         trace_ext4_journal_start(sb, blocks, rsv_blocks, revoke_creds,
>  2006                                  _RET_IP_);
>  2006         err = ext4_journal_check_start(sb);
>               if (err < 0)
>                       return ERR_PTR(err);
>
>  2005         journal = EXT4_SB(sb)->s_journal;
>  1969         if (!journal || (EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY))
>  2006                 return ext4_get_nojournal();
>  1969         return jbd2__journal_start(journal, blocks, rsv_blocks, revoke_creds,
>                                          GFP_NOFS, type, line);
>       }
>
> >
> > *but*
> >
> > b) as I wrote in the email you are replying to, the jbd2_handle key is
> > private per filesystem. Thus for lockdep to complain about
> > jbd2_handle->fs_reclaim->jbd2_handle deadlock, those jbd2_handle lockdep
> > maps must come from the same filesystem.
> >
> > *and*
> >
> > c) filesystem without journal doesn't use jbd2_handle lockdep map at all so
> > for such filesystems lockdep creates no dependency for jbd2_handle map.
> >
>
> What about "EXT4_SB(sb)->s_mount_state & EXT4_FC_REPLAY)" case?
> Does this case happen on filesystem with journal, and can this case be executed
> by fuzzing a crafted (a sort of erroneous) filesystem with journal, and are
> the jbd2_handle for calling ext4_get_nojournal() case and the jbd2_handle for
> calling jbd2__journal_start() case the same?
EXT4_FC_REPLAY is a mount state that is only set during jbd2 journal
recovery. The only way for jbd2 journal recovery to set EXT4_FC_REPLAY
option is if after a journal crash there are special fast_commit
blocks present in the journal. For these fast_commit blocks to be
present in the journal, the Ext4 file system prior to crash should
have had "fast_commit" feature enabled.

If we have a way to look at the Ext4 partition that syzbot used for
reporting this bug, it is very easy to see if this FC_REPLAY will ever
be set or not. Just running "debugfs <image>" and inside debugfs,
running logdump will show us if there are any fast commit blocks
present in the journal.

Having said that, I have following reason to believe that this option
wasn't set during the syzbot failure:

EXT4_FC_REPLAY will only be set during journal recovery and is cleared
immediately after. Which means EXT4_FC_REPLAY will only be set during
mount and as soon as mount returns the option will be cleared. Looking
at the stack trace, it shows no evidence that we are in the journal
recovery phase. It seems like most of the traces are resulting from
system calls made by the user. I checked if we are accidentally
setting this flag even after journal recovery, but that doesn't seem
to be the case. On a successfully mounted file system, we
unconditionally clear this flag.

- Harshad

>
> Also, I worry that jbd2__journal_restart() is problematic, for it calls
> stop_this_handle() (which calls memalloc_nofs_restore()) and then calls
> start_this_handle() (which fails to call memalloc_nofs_save() if an error
> occurs). An error from start_this_handle() from journal restart operation
> might need special handling (i.e. either remount-ro or panic) ?
>


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: possible deadlock in start_this_handle (2)
  2021-02-15 14:29         ` Jan Kara
  2021-02-19 10:15           ` Tetsuo Handa
@ 2021-03-20 10:02           ` Tetsuo Handa
  1 sibling, 0 replies; 29+ messages in thread
From: Tetsuo Handa @ 2021-03-20 10:02 UTC (permalink / raw)
  To: Jan Kara, Peter Zijlstra
  Cc: jack, linux-ext4, linux-kernel, syzkaller-bugs, tytso, mhocko,
	linux-mm, syzbot, harshad shirwadkar

Peter, would you answer to 6 questions listed below?

Below is a reproducer kernel module (prefixed with "my_" for distinction) for
https://syzkaller.appspot.com/bug?id=38c060d5757cbc13fdffd46e80557c645fbe79ba .

---------- test.c ----------
#include <linux/module.h>
#include <linux/jbd2.h>
#include <../fs/ext4/ext4.h>

static struct lockdep_map my__fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("my_fs_reclaim", &my__fs_reclaim_map);

static noinline void my_journal_init_common(journal_t *journal)
{
        static struct lock_class_key my_jbd2_trans_commit_key;

        lockdep_init_map(&journal->j_trans_commit_map, "my_jbd2_handle", &my_jbd2_trans_commit_key, 0);
}

static noinline void my_init_once(void *foo)
{
        struct ext4_inode_info *my_ei = (struct ext4_inode_info *) foo;

        init_rwsem(&my_ei->xattr_sem);
}

static int __init lockdep_test_init(void)
{
        journal_t *journal1 = kzalloc(sizeof(*journal1), GFP_KERNEL | __GFP_NOFAIL);
        struct ext4_inode_info *ei1 = kzalloc(sizeof(*ei1), GFP_KERNEL | __GFP_NOFAIL);
        struct ext4_inode_info *ei2 = kzalloc(sizeof(*ei2), GFP_KERNEL | __GFP_NOFAIL);

        my_journal_init_common(journal1);
        my_init_once(ei1);
        my_init_once(ei2);

        rwsem_acquire_read(&journal1->j_trans_commit_map, 0, 0, _THIS_IP_);
        down_write(&ei1->xattr_sem);
        up_write(&ei1->xattr_sem);
        rwsem_release(&journal1->j_trans_commit_map, _THIS_IP_);

        down_write(&ei2->xattr_sem);
        lock_map_acquire(&my__fs_reclaim_map);
        lock_map_release(&my__fs_reclaim_map);
        up_write(&ei2->xattr_sem);

        lock_map_acquire(&my__fs_reclaim_map);
        rwsem_acquire_read(&journal1->j_trans_commit_map, 0, 0, _THIS_IP_);
        rwsem_release(&journal1->j_trans_commit_map, _THIS_IP_);
        lock_map_release(&my__fs_reclaim_map);

        return -EINVAL;
}

module_init(lockdep_test_init);
MODULE_LICENSE("GPL");
---------- test.c ----------

---------- dmesg ----------
[   32.938906][ T2776] test: loading out-of-tree module taints kernel.
[   32.940200][ T2776]
[   32.940240][ T2776] ======================================================
[   32.940306][ T2776] WARNING: possible circular locking dependency detected
[   32.940373][ T2776] 5.12.0-rc3+ #846 Tainted: G           O
[   32.940434][ T2776] ------------------------------------------------------
[   32.940500][ T2776] insmod/2776 is trying to acquire lock:
[   32.940557][ T2776] ffff9f1d438d98e0 (my_jbd2_handle){.+.+}-{0:0}, at: lockdep_test_init+0x0/0x1000 [test]
[   32.940651][ T2776]
[   32.940651][ T2776] but task is already holding lock:
[   32.940719][ T2776] ffffffffc0631020 (my_fs_reclaim){+.+.}-{0:0}, at: lockdep_test_init+0x0/0x1000 [test]
[   32.940808][ T2776]
[   32.940808][ T2776] which lock already depends on the new lock.
[   32.940808][ T2776]
[   32.940897][ T2776]
[   32.940897][ T2776] the existing dependency chain (in reverse order) is:
[   32.940976][ T2776]
[   32.940976][ T2776] -> #2 (my_fs_reclaim){+.+.}-{0:0}:
[   32.941053][ T2776]        lock_acquire+0xb3/0x380
[   32.941112][ T2776]        lockdep_test_init+0xe2/0x1000 [test]
[   32.941176][ T2776]        do_one_initcall+0x58/0x2c0
[   32.941234][ T2776]        do_init_module+0x5b/0x210
[   32.941291][ T2776]        load_module+0x1884/0x19a0
[   32.941347][ T2776]        __do_sys_finit_module+0xc1/0x120
[   32.941408][ T2776]        __x64_sys_finit_module+0x15/0x20
[   32.941469][ T2776]        do_syscall_64+0x31/0x40
[   32.941527][ T2776]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[   32.941594][ T2776]
[   32.941594][ T2776] -> #1 (&my_ei->xattr_sem){+.+.}-{3:3}:
[   32.941673][ T2776]        lock_acquire+0xb3/0x380
[   32.941728][ T2776]        down_write+0x52/0x620
[   32.941783][ T2776]        lockdep_test_init+0xa2/0x1000 [test]
[   32.941846][ T2776]        do_one_initcall+0x58/0x2c0
[   32.941904][ T2776]        do_init_module+0x5b/0x210
[   32.941960][ T2776]        load_module+0x1884/0x19a0
[   32.942016][ T2776]        __do_sys_finit_module+0xc1/0x120
[   32.942077][ T2776]        __x64_sys_finit_module+0x15/0x20
[   32.942137][ T2776]        do_syscall_64+0x31/0x40
[   32.942193][ T2776]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[   32.942260][ T2776]
[   32.942260][ T2776] -> #0 (my_jbd2_handle){.+.+}-{0:0}:
[   32.942336][ T2776]        check_prevs_add+0x16a/0x1040
[   32.942395][ T2776]        __lock_acquire+0x118b/0x1580
[   32.942453][ T2776]        lock_acquire+0xb3/0x380
[   32.942509][ T2776]        lockdep_test_init+0x140/0x1000 [test]
[   32.942574][ T2776]        do_one_initcall+0x58/0x2c0
[   32.942631][ T2776]        do_init_module+0x5b/0x210
[   32.942687][ T2776]        load_module+0x1884/0x19a0
[   32.942743][ T2776]        __do_sys_finit_module+0xc1/0x120
[   32.942803][ T2776]        __x64_sys_finit_module+0x15/0x20
[   32.942915][ T2776]        do_syscall_64+0x31/0x40
[   32.942973][ T2776]        entry_SYSCALL_64_after_hwframe+0x44/0xae
[   32.943041][ T2776]
[   32.943041][ T2776] other info that might help us debug this:
[   32.943041][ T2776]
[   32.943138][ T2776] Chain exists of:
[   32.943138][ T2776]   my_jbd2_handle --> &my_ei->xattr_sem --> my_fs_reclaim
[   32.943138][ T2776]
[   32.943258][ T2776]  Possible unsafe locking scenario:
[   32.943258][ T2776]
[   32.943333][ T2776]        CPU0                    CPU1
[   32.943392][ T2776]        ----                    ----
[   32.943449][ T2776]   lock(my_fs_reclaim);
[   32.943500][ T2776]                                lock(&my_ei->xattr_sem);
[   32.943572][ T2776]                                lock(my_fs_reclaim);
[   32.943641][ T2776]   lock(my_jbd2_handle);
[   32.943693][ T2776]
[   32.943693][ T2776]  *** DEADLOCK ***
[   32.943693][ T2776]
[   32.943775][ T2776] 1 lock held by insmod/2776:
[   32.943829][ T2776]  #0: ffffffffc0631020 (my_fs_reclaim){+.+.}-{0:0}, at: lockdep_test_init+0x0/0x1000 [test]
[   32.943931][ T2776]
[   32.943931][ T2776] stack backtrace:
[   32.943996][ T2776] CPU: 0 PID: 2776 Comm: insmod Tainted: G           O      5.12.0-rc3+ #846
[   32.944084][ T2776] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 02/27/2020
[   32.944246][ T2776] Call Trace:
[   32.944292][ T2776]  dump_stack+0x76/0x95
[   32.944376][ T2776]  print_circular_bug.isra.49.cold.70+0x13c/0x141
[   32.944446][ T2776]  check_noncircular+0xfe/0x110
[   32.944504][ T2776]  check_prevs_add+0x16a/0x1040
[   32.944561][ T2776]  __lock_acquire+0x118b/0x1580
[   32.944658][ T2776]  ? lock_is_held_type+0x97/0x100
[   32.944717][ T2776]  lock_acquire+0xb3/0x380
[   32.944771][ T2776]  ? 0xffffffffc0273000
[   32.944822][ T2776]  ? lock_release+0xb8/0x270
[   32.944877][ T2776]  lockdep_test_init+0x140/0x1000 [test]
[   32.944939][ T2776]  ? 0xffffffffc0273000
[   32.944989][ T2776]  ? 0xffffffffc0273000
[   32.945039][ T2776]  do_one_initcall+0x58/0x2c0
[   32.945094][ T2776]  ? kmem_cache_alloc_trace+0x745/0x820
[   32.945157][ T2776]  ? do_init_module+0x22/0x210
[   32.945212][ T2776]  do_init_module+0x5b/0x210
[   32.945267][ T2776]  load_module+0x1884/0x19a0
[   32.945324][ T2776]  __do_sys_finit_module+0xc1/0x120
[   32.945382][ T2776]  ? __do_sys_finit_module+0xc1/0x120
[   32.945444][ T2776]  __x64_sys_finit_module+0x15/0x20
[   32.945502][ T2776]  do_syscall_64+0x31/0x40
[   32.945555][ T2776]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   32.945619][ T2776] RIP: 0033:0x7fa7bb256d19
[   32.945672][ T2776] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 e1 2c 00 f7 d8 64 89 01 48
[   32.945842][ T2776] RSP: 002b:00007ffd756468e8 EFLAGS: 00000206 ORIG_RAX: 0000000000000139
[   32.945927][ T2776] RAX: ffffffffffffffda RBX: 0000000000c2f250 RCX: 00007fa7bb256d19
[   32.946007][ T2776] RDX: 0000000000000000 RSI: 000000000041a96e RDI: 0000000000000003
[   32.946088][ T2776] RBP: 000000000041a96e R08: 0000000000000000 R09: 00007ffd75646a88
[   32.946168][ T2776] R10: 0000000000000003 R11: 0000000000000206 R12: 0000000000000000
[   32.946248][ T2776] R13: 0000000000c2f220 R14: 0000000000000000 R15: 0000000000000000
---------- dmesg ----------



Q1:

mm/page_alloc.c defines __fs_reclaim_map as

  static struct lockdep_map __fs_reclaim_map = STATIC_LOCKDEP_MAP_INIT("fs_reclaim", &__fs_reclaim_map);

where STATIC_LOCKDEP_MAP_INIT(_name, _key) is defined by include/linux/lockdep.h as

  #define STATIC_LOCKDEP_MAP_INIT(_name, _key) { .name = (_name), .key = (void *)(_key), }

which results in

  static struct lockdep_map __fs_reclaim_map = { .name = "fs_reclaim", .key = &__fs_reclaim_map };

indicating that there is only one instance of "struct lockdep_map" and one constant value for "struct lockdep_map"->key .
Is my understanding correct?



Q2:

journal_init_common() in fs/jbd2/journal.c defines jbd2_trans_commit_key as

  static struct lock_class_key jbd2_trans_commit_key;

and journal_init_common() is passing this constant value for "struct lockdep_map"->key to
all instances of "struct lockdep_map" via

  lockdep_init_map(&journal->j_trans_commit_map, "jbd2_handle", &jbd2_trans_commit_key, 0) {
    lockdep_init_map_wait(&journal->j_trans_commit_map, "jbd2_handle", &jbd2_trans_commit_key, 0, LD_WAIT_INV) {
      lockdep_init_map_waits(&journal->j_trans_commit_map, "jbd2_handle", &jbd2_trans_commit_key, 0, LD_WAIT_INV, LD_WAIT_INV) {
        lockdep_init_map_type(&journal->j_trans_commit_map, "jbd2_handle", &jbd2_trans_commit_key, 0, 0, 0) {
          journal->j_trans_commit_map.name = "jbd2_handle";
          journal->j_trans_commit_map.key = &jbd2_trans_commit_key;
        }
      }
    }
  }

on each journal, indicating that there can be multiple instances of "struct lockdep_map"
but every instance shares one constant value for that "struct lockdep_map"->key .
Is my understanding correct?



Q3:

init_once() in fs/ext4/super.c initializes ei->xattr_sem on each inode via

  init_rwsem(&ei->xattr_sem);

where init_rwsem() is defined in include/linux/rwsem.h as

  #define init_rwsem(sem)                   \
  do {                                      \
  	static struct lock_class_key __key; \
  					    \
  	__init_rwsem((sem), #sem, &__key);  \
  } while (0)

and __init_rwsem() in kernel/locking/rwsem.c does

  lockdep_init_map_wait(&ei->xattr_sem.dep_map, "&ei->xattr_sem", &__key, 0, LD_WAIT_SLEEP);
    lockdep_init_map_waits(&ei->xattr_sem.dep_map, "&ei->xattr_sem", &__key, 0, LD_WAIT_SLEEP, LD_WAIT_INV) {
      lockdep_init_map_type(&ei->xattr_sem.dep_map, "&ei->xattr_sem", &__key, 0, LD_WAIT_SLEEP, 0) {
        ei->xattr_sem.dep_map.name = "&ei->xattr_sem";
        ei->xattr_sem.dep_map.key = &__key;
      }
    }
  }

on each inode, indicating that there can be multiple instances of "struct lockdep_map"
but every instance shares one constant value for that "struct lockdep_map"->key .
Is my understanding correct?



Q4:

check_prev_add() in kernel/locking/lockdep.c checks possibility of
deadlock from "struct lock_class lock_classes[MAX_LOCKDEP_KEYS]" rather
than individual "struct lockdep_map" instance. Is this correct?



Q5:

Due to not using individual "struct lockdep_map" instance for checking
possibility of deadlock, despite there can be multiple
"journal->j_trans_commit_map" instances and multiple
"ei->xattr_sem" instances, lockdep treats "journal->j_trans_commit_map"
and "ei->xattr_sem.dep_map" and __fs_reclaim_map equally (in other words,
lockdep thinks as if there is only one instance of
"journal->j_trans_commit_map" and "ei->xattr_sem.dep_map" and __fs_reclaim_map).
Is this correct?



Q6:

If 5 questions listed above are "yes", then,

  Chain exists of:
    jbd2_handle --> &ei->xattr_sem --> fs_reclaim
  
   Possible unsafe locking scenario:
  
         CPU0                    CPU1
         ----                    ----
    lock(fs_reclaim);
                                 lock(&ei->xattr_sem);
                                 lock(fs_reclaim);
    lock(jbd2_handle);

is simplified as if

  Chain exists of:
    jbd2_handle --> ei_xattr_sem --> fs_reclaim
  
   Possible unsafe locking scenario:
  
         CPU0                    CPU1
         ----                    ----
    lock(fs_reclaim);
                                 lock(ei_xattr_sem);
                                 lock(fs_reclaim);
    lock(jbd2_handle);

. Is this correct?



If 6 questions listed above are "yes", Jan Kara's response that

  *but*
  
  b) as I wrote in the email you are replying to, the jbd2_handle key is
  private per filesystem. Thus for lockdep to complain about
  jbd2_handle->fs_reclaim->jbd2_handle deadlock, those jbd2_handle lockdep
  maps must come from the same filesystem.
  
  *and*
  
  c) filesystem without journal doesn't use jbd2_handle lockdep map at all so
  for such filesystems lockdep creates no dependency for jbd2_handle map.

will not help avoiding this lockdep report because

  a) I think that syzbot reboots the VM between executing different tests to
  get reproducible conditions. But in theory I agree the test may have
  contained one image with and one image without a journal.

can become true beause syzkaller can test both "filesystem without journal" and
"filesystem with journal" within single boot, and same ei->xattr_sem.dep_map is
used for both "filesystem without journal" and "filesystem with journal"...



^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2021-03-20 10:07 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <000000000000563a0205bafb7970@google.com>
2021-02-11 10:49 ` possible deadlock in start_this_handle (2) Jan Kara
2021-02-11 10:55   ` Michal Hocko
2021-02-11 11:22   ` Dmitry Vyukov
2021-02-11 11:28     ` Dmitry Vyukov
2021-02-11 12:10       ` Jan Kara
2021-02-11 12:34         ` Michal Hocko
2021-02-11 12:57           ` Matthew Wilcox
2021-02-11 13:07             ` Michal Hocko
2021-02-11 13:25               ` Matthew Wilcox
2021-02-11 14:20                 ` Michal Hocko
2021-02-11 14:26                   ` Matthew Wilcox
2021-02-11 16:41                     ` Michal Hocko
2021-02-12 11:18                       ` Tetsuo Handa
2021-02-12 12:22                         ` Matthew Wilcox
2021-02-12 12:30                           ` Michal Hocko
2021-02-12 12:58                             ` Tetsuo Handa
2021-02-12 13:12                               ` Michal Hocko
2021-02-12 13:34                                 ` Tetsuo Handa
2021-02-12 15:43                               ` Michal Hocko
2021-02-13 10:58                                 ` Dmitry Vyukov
2021-02-11 13:18             ` Dmitry Vyukov
2021-02-11 11:46     ` Jan Kara
2021-02-13 14:26   ` Tetsuo Handa
2021-02-15 12:45     ` Jan Kara
2021-02-15 14:06       ` Tetsuo Handa
2021-02-15 14:29         ` Jan Kara
2021-02-19 10:15           ` Tetsuo Handa
2021-02-19 17:22             ` harshad shirwadkar
2021-03-20 10:02           ` Tetsuo Handa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).