All of lore.kernel.org
 help / color / mirror / Atom feed
* lockdep recursive locking warning on for-next
@ 2021-02-18 18:14 Brian Foster
  2021-02-18 18:49 ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2021-02-18 18:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

Hi Darrick,

I'm seeing the warning below via xfs/167 on a test machine. It looks
like it's just complaining about nested freeze protection between the
scan invocation and an underlying transaction allocation for an inode
eofblocks trim. I suppose we could either refactor xfs_trans_alloc() to
drop and reacquire freeze protection around the scan, or alternatively
call __sb_writers_release() and __sb_writers_acquire() around the scan
to retain freeze protection and quiet lockdep. Hm?

BTW, the stack report also had me wondering whether we had or need any
nesting protection in these new scan invocations. For example, if we
have an fs with a bunch of tagged inodes and concurrent allocation
activity, would anything prevent an in-scan transaction allocation from
jumping back into the scan code to complete outstanding work? It looks
like that might not be possible right now because neither scan reserves
blocks, but they do both use transactions and that's quite a subtle
balance..

Brian

[  316.631387] ============================================
[  316.636697] WARNING: possible recursive locking detected
[  316.642010] 5.11.0-rc4 #35 Tainted: G        W I      
[  316.647148] --------------------------------------------
[  316.652462] fsstress/17733 is trying to acquire lock:
[  316.657515] ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_free_eofblocks+0x104/0x1d0 [xfs]
[  316.666405] 
               but task is already holding lock:
[  316.672239] ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_trans_alloc_inode+0x5f/0x160 [xfs]
[  316.681269] 
...
               stack backtrace:
[  316.774735] CPU: 38 PID: 17733 Comm: fsstress Tainted: G        W I       5.11.0-rc4 #35
[  316.782819] Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 1.6.11 11/20/2018
[  316.790386] Call Trace:
[  316.792844]  dump_stack+0x8b/0xb0
[  316.796168]  __lock_acquire.cold+0x159/0x2ab
[  316.800441]  lock_acquire+0x116/0x370
[  316.804106]  ? xfs_free_eofblocks+0x104/0x1d0 [xfs]
[  316.809045]  ? rcu_read_lock_sched_held+0x3f/0x80
[  316.813750]  ? kmem_cache_alloc+0x287/0x2b0
[  316.817937]  xfs_trans_alloc+0x1ad/0x310 [xfs]
[  316.822445]  ? xfs_free_eofblocks+0x104/0x1d0 [xfs]
[  316.827376]  xfs_free_eofblocks+0x104/0x1d0 [xfs]
[  316.832134]  xfs_blockgc_scan_inode+0x24/0x60 [xfs]
[  316.837074]  xfs_inode_walk_ag+0x202/0x4b0 [xfs]
[  316.841754]  ? xfs_inode_free_cowblocks+0xf0/0xf0 [xfs]
[  316.847040]  ? __lock_acquire+0x382/0x1e10
[  316.851142]  ? xfs_inode_free_cowblocks+0xf0/0xf0 [xfs]
[  316.856425]  xfs_inode_walk+0x66/0xc0 [xfs]
[  316.860670]  xfs_trans_alloc+0x160/0x310 [xfs]
[  316.865179]  xfs_trans_alloc_inode+0x5f/0x160 [xfs]
[  316.870119]  xfs_alloc_file_space+0x105/0x300 [xfs]
[  316.875048]  ? down_write_nested+0x30/0x70
[  316.879148]  xfs_file_fallocate+0x270/0x460 [xfs]
[  316.883913]  ? lock_acquire+0x116/0x370
[  316.887752]  ? __x64_sys_fallocate+0x3e/0x70
[  316.892026]  ? selinux_file_permission+0x105/0x140
[  316.896820]  vfs_fallocate+0x14d/0x3d0
[  316.900572]  __x64_sys_fallocate+0x3e/0x70
[  316.904669]  do_syscall_64+0x33/0x40
[  316.908250]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
...


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: lockdep recursive locking warning on for-next
  2021-02-18 18:14 lockdep recursive locking warning on for-next Brian Foster
@ 2021-02-18 18:49 ` Darrick J. Wong
  2021-02-18 19:12   ` Brian Foster
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2021-02-18 18:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 18, 2021 at 01:14:50PM -0500, Brian Foster wrote:
> Hi Darrick,
> 
> I'm seeing the warning below via xfs/167 on a test machine. It looks
> like it's just complaining about nested freeze protection between the
> scan invocation and an underlying transaction allocation for an inode
> eofblocks trim. I suppose we could either refactor xfs_trans_alloc() to
> drop and reacquire freeze protection around the scan, or alternatively
> call __sb_writers_release() and __sb_writers_acquire() around the scan
> to retain freeze protection and quiet lockdep. Hm?

Erk, isn't that a potential log grant livelock too?

Fill up the filesystem with real data and cow blocks until it's full,
then spawn exactly enough file writer threads to eat up all the log
reservation, then each _reserve() fails, so every thread starts a scan
and tries to allocate /another/ transaction ... but there's no space
left in the log, so those scans just block indefinitely.

So... I think the solution here is to go back to a previous version of
what that patchset did, where we'd drop the whole transaction, run the
scan, and jump back to the top of the function to get a fresh
transaction.

> BTW, the stack report also had me wondering whether we had or need any
> nesting protection in these new scan invocations. For example, if we
> have an fs with a bunch of tagged inodes and concurrent allocation
> activity, would anything prevent an in-scan transaction allocation from
> jumping back into the scan code to complete outstanding work? It looks
> like that might not be possible right now because neither scan reserves
> blocks, but they do both use transactions and that's quite a subtle
> balance..

Yes, that's a subtlety that screams for better documentation.

--D

> 
> Brian
> 
> [  316.631387] ============================================
> [  316.636697] WARNING: possible recursive locking detected
> [  316.642010] 5.11.0-rc4 #35 Tainted: G        W I      
> [  316.647148] --------------------------------------------
> [  316.652462] fsstress/17733 is trying to acquire lock:
> [  316.657515] ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_free_eofblocks+0x104/0x1d0 [xfs]
> [  316.666405] 
>                but task is already holding lock:
> [  316.672239] ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_trans_alloc_inode+0x5f/0x160 [xfs]
> [  316.681269] 
> ...
>                stack backtrace:
> [  316.774735] CPU: 38 PID: 17733 Comm: fsstress Tainted: G        W I       5.11.0-rc4 #35
> [  316.782819] Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 1.6.11 11/20/2018
> [  316.790386] Call Trace:
> [  316.792844]  dump_stack+0x8b/0xb0
> [  316.796168]  __lock_acquire.cold+0x159/0x2ab
> [  316.800441]  lock_acquire+0x116/0x370
> [  316.804106]  ? xfs_free_eofblocks+0x104/0x1d0 [xfs]
> [  316.809045]  ? rcu_read_lock_sched_held+0x3f/0x80
> [  316.813750]  ? kmem_cache_alloc+0x287/0x2b0
> [  316.817937]  xfs_trans_alloc+0x1ad/0x310 [xfs]
> [  316.822445]  ? xfs_free_eofblocks+0x104/0x1d0 [xfs]
> [  316.827376]  xfs_free_eofblocks+0x104/0x1d0 [xfs]
> [  316.832134]  xfs_blockgc_scan_inode+0x24/0x60 [xfs]
> [  316.837074]  xfs_inode_walk_ag+0x202/0x4b0 [xfs]
> [  316.841754]  ? xfs_inode_free_cowblocks+0xf0/0xf0 [xfs]
> [  316.847040]  ? __lock_acquire+0x382/0x1e10
> [  316.851142]  ? xfs_inode_free_cowblocks+0xf0/0xf0 [xfs]
> [  316.856425]  xfs_inode_walk+0x66/0xc0 [xfs]
> [  316.860670]  xfs_trans_alloc+0x160/0x310 [xfs]
> [  316.865179]  xfs_trans_alloc_inode+0x5f/0x160 [xfs]
> [  316.870119]  xfs_alloc_file_space+0x105/0x300 [xfs]
> [  316.875048]  ? down_write_nested+0x30/0x70
> [  316.879148]  xfs_file_fallocate+0x270/0x460 [xfs]
> [  316.883913]  ? lock_acquire+0x116/0x370
> [  316.887752]  ? __x64_sys_fallocate+0x3e/0x70
> [  316.892026]  ? selinux_file_permission+0x105/0x140
> [  316.896820]  vfs_fallocate+0x14d/0x3d0
> [  316.900572]  __x64_sys_fallocate+0x3e/0x70
> [  316.904669]  do_syscall_64+0x33/0x40
> [  316.908250]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> ...
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: lockdep recursive locking warning on for-next
  2021-02-18 18:49 ` Darrick J. Wong
@ 2021-02-18 19:12   ` Brian Foster
  2021-02-18 19:31     ` Darrick J. Wong
  0 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2021-02-18 19:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Feb 18, 2021 at 10:49:26AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 18, 2021 at 01:14:50PM -0500, Brian Foster wrote:
> > Hi Darrick,
> > 
> > I'm seeing the warning below via xfs/167 on a test machine. It looks
> > like it's just complaining about nested freeze protection between the
> > scan invocation and an underlying transaction allocation for an inode
> > eofblocks trim. I suppose we could either refactor xfs_trans_alloc() to
> > drop and reacquire freeze protection around the scan, or alternatively
> > call __sb_writers_release() and __sb_writers_acquire() around the scan
> > to retain freeze protection and quiet lockdep. Hm?
> 
> Erk, isn't that a potential log grant livelock too?
> 
> Fill up the filesystem with real data and cow blocks until it's full,
> then spawn exactly enough file writer threads to eat up all the log
> reservation, then each _reserve() fails, so every thread starts a scan
> and tries to allocate /another/ transaction ... but there's no space
> left in the log, so those scans just block indefinitely.
> 
> So... I think the solution here is to go back to a previous version of
> what that patchset did, where we'd drop the whole transaction, run the
> scan, and jump back to the top of the function to get a fresh
> transaction.
> 

But we don't call into the scan while holding log reservation. We hold
the transaction memory and freeze protection. It's probably debatable
whether we'd want to scan with freeze protection held or not, but I
don't see how dropping either of those changes anything wrt to log
reservation..?

> > BTW, the stack report also had me wondering whether we had or need any
> > nesting protection in these new scan invocations. For example, if we
> > have an fs with a bunch of tagged inodes and concurrent allocation
> > activity, would anything prevent an in-scan transaction allocation from
> > jumping back into the scan code to complete outstanding work? It looks
> > like that might not be possible right now because neither scan reserves
> > blocks, but they do both use transactions and that's quite a subtle
> > balance..
> 
> Yes, that's a subtlety that screams for better documentation.
> 

TBH, I'm not sure that's enough. I think we should at least have some
kind of warning, even if only in DEBUG mode, that explicitly calls out
if we've become susceptible to this kind of scan reentry. Otherwise I
suspect that if this problem is ever truly introduced, the person who
first discovers it will probably be user with a blown stack. :( Could we
set a flag on the task or something that warns as such (i.e. "WARNING:
attempted block reservation in block reclaim context") or perhaps just
prevents scan reentry in the first place?

Brian

> --D
> 
> > 
> > Brian
> > 
> > [  316.631387] ============================================
> > [  316.636697] WARNING: possible recursive locking detected
> > [  316.642010] 5.11.0-rc4 #35 Tainted: G        W I      
> > [  316.647148] --------------------------------------------
> > [  316.652462] fsstress/17733 is trying to acquire lock:
> > [  316.657515] ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_free_eofblocks+0x104/0x1d0 [xfs]
> > [  316.666405] 
> >                but task is already holding lock:
> > [  316.672239] ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_trans_alloc_inode+0x5f/0x160 [xfs]
> > [  316.681269] 
> > ...
> >                stack backtrace:
> > [  316.774735] CPU: 38 PID: 17733 Comm: fsstress Tainted: G        W I       5.11.0-rc4 #35
> > [  316.782819] Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 1.6.11 11/20/2018
> > [  316.790386] Call Trace:
> > [  316.792844]  dump_stack+0x8b/0xb0
> > [  316.796168]  __lock_acquire.cold+0x159/0x2ab
> > [  316.800441]  lock_acquire+0x116/0x370
> > [  316.804106]  ? xfs_free_eofblocks+0x104/0x1d0 [xfs]
> > [  316.809045]  ? rcu_read_lock_sched_held+0x3f/0x80
> > [  316.813750]  ? kmem_cache_alloc+0x287/0x2b0
> > [  316.817937]  xfs_trans_alloc+0x1ad/0x310 [xfs]
> > [  316.822445]  ? xfs_free_eofblocks+0x104/0x1d0 [xfs]
> > [  316.827376]  xfs_free_eofblocks+0x104/0x1d0 [xfs]
> > [  316.832134]  xfs_blockgc_scan_inode+0x24/0x60 [xfs]
> > [  316.837074]  xfs_inode_walk_ag+0x202/0x4b0 [xfs]
> > [  316.841754]  ? xfs_inode_free_cowblocks+0xf0/0xf0 [xfs]
> > [  316.847040]  ? __lock_acquire+0x382/0x1e10
> > [  316.851142]  ? xfs_inode_free_cowblocks+0xf0/0xf0 [xfs]
> > [  316.856425]  xfs_inode_walk+0x66/0xc0 [xfs]
> > [  316.860670]  xfs_trans_alloc+0x160/0x310 [xfs]
> > [  316.865179]  xfs_trans_alloc_inode+0x5f/0x160 [xfs]
> > [  316.870119]  xfs_alloc_file_space+0x105/0x300 [xfs]
> > [  316.875048]  ? down_write_nested+0x30/0x70
> > [  316.879148]  xfs_file_fallocate+0x270/0x460 [xfs]
> > [  316.883913]  ? lock_acquire+0x116/0x370
> > [  316.887752]  ? __x64_sys_fallocate+0x3e/0x70
> > [  316.892026]  ? selinux_file_permission+0x105/0x140
> > [  316.896820]  vfs_fallocate+0x14d/0x3d0
> > [  316.900572]  __x64_sys_fallocate+0x3e/0x70
> > [  316.904669]  do_syscall_64+0x33/0x40
> > [  316.908250]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > ...
> > 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: lockdep recursive locking warning on for-next
  2021-02-18 19:12   ` Brian Foster
@ 2021-02-18 19:31     ` Darrick J. Wong
  2021-02-18 20:26       ` Brian Foster
  2021-02-19  4:14       ` Dave Chinner
  0 siblings, 2 replies; 9+ messages in thread
From: Darrick J. Wong @ 2021-02-18 19:31 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Thu, Feb 18, 2021 at 02:12:52PM -0500, Brian Foster wrote:
> On Thu, Feb 18, 2021 at 10:49:26AM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 18, 2021 at 01:14:50PM -0500, Brian Foster wrote:
> > > Hi Darrick,
> > > 
> > > I'm seeing the warning below via xfs/167 on a test machine. It looks
> > > like it's just complaining about nested freeze protection between the
> > > scan invocation and an underlying transaction allocation for an inode
> > > eofblocks trim. I suppose we could either refactor xfs_trans_alloc() to
> > > drop and reacquire freeze protection around the scan, or alternatively
> > > call __sb_writers_release() and __sb_writers_acquire() around the scan
> > > to retain freeze protection and quiet lockdep. Hm?
> > 
> > Erk, isn't that a potential log grant livelock too?
> > 
> > Fill up the filesystem with real data and cow blocks until it's full,
> > then spawn exactly enough file writer threads to eat up all the log
> > reservation, then each _reserve() fails, so every thread starts a scan
> > and tries to allocate /another/ transaction ... but there's no space
> > left in the log, so those scans just block indefinitely.
> > 
> > So... I think the solution here is to go back to a previous version of
> > what that patchset did, where we'd drop the whole transaction, run the
> > scan, and jump back to the top of the function to get a fresh
> > transaction.
> > 
> 
> But we don't call into the scan while holding log reservation. We hold
> the transaction memory and freeze protection. It's probably debatable
> whether we'd want to scan with freeze protection held or not, but I
> don't see how dropping either of those changes anything wrt to log
> reservation..?

Right, sorry about the noise.  We could just trick lockdep with
__sb_writers_release like you said.  Though I am a tad bit concerned
about the rwsem behavior -- what happens if:

T1 calls sb_start_intwrite (which is down_read on sb_writers), gets the
lock, and then hits ENOSPC and goes into our scan loop; meanwhile,

T2 calls sb_wait_write (which is down_write on sb_writers), and is
scheduled off because it was a blocking lock attempt; and then,

T1 finds some eofblocks to delete, and now it wants to sb_start_intwrite
again as part of allocating that second nested transaction.  Does that
actually work, or will T1 stall because we don't allow more readers once
something is waiting in down_write()?

> > > BTW, the stack report also had me wondering whether we had or need any
> > > nesting protection in these new scan invocations. For example, if we
> > > have an fs with a bunch of tagged inodes and concurrent allocation
> > > activity, would anything prevent an in-scan transaction allocation from
> > > jumping back into the scan code to complete outstanding work? It looks
> > > like that might not be possible right now because neither scan reserves
> > > blocks, but they do both use transactions and that's quite a subtle
> > > balance..
> > 
> > Yes, that's a subtlety that screams for better documentation.
> > 
> 
> TBH, I'm not sure that's enough. I think we should at least have some
> kind of warning, even if only in DEBUG mode, that explicitly calls out
> if we've become susceptible to this kind of scan reentry. Otherwise I
> suspect that if this problem is ever truly introduced, the person who
> first discovers it will probably be user with a blown stack. :( Could we
> set a flag on the task or something that warns as such (i.e. "WARNING:
> attempted block reservation in block reclaim context") or perhaps just
> prevents scan reentry in the first place?

What if we implemented a XFS_TRANS_TRYRESERVE flag that would skip the
scanning loops?  Then it would be at least a little more obvious when
xfs_free_eofblocks and xfs_reflink_cancel_cow_range kick on.

OTOH that's problematic because both of those functions have other
callers, and "we're already doing a blockgc scan, don't start another"
is part of the thread context.

--D

> 
> Brian
> 
> > --D
> > 
> > > 
> > > Brian
> > > 
> > > [  316.631387] ============================================
> > > [  316.636697] WARNING: possible recursive locking detected
> > > [  316.642010] 5.11.0-rc4 #35 Tainted: G        W I      
> > > [  316.647148] --------------------------------------------
> > > [  316.652462] fsstress/17733 is trying to acquire lock:
> > > [  316.657515] ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_free_eofblocks+0x104/0x1d0 [xfs]
> > > [  316.666405] 
> > >                but task is already holding lock:
> > > [  316.672239] ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_trans_alloc_inode+0x5f/0x160 [xfs]
> > > [  316.681269] 
> > > ...
> > >                stack backtrace:
> > > [  316.774735] CPU: 38 PID: 17733 Comm: fsstress Tainted: G        W I       5.11.0-rc4 #35
> > > [  316.782819] Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 1.6.11 11/20/2018
> > > [  316.790386] Call Trace:
> > > [  316.792844]  dump_stack+0x8b/0xb0
> > > [  316.796168]  __lock_acquire.cold+0x159/0x2ab
> > > [  316.800441]  lock_acquire+0x116/0x370
> > > [  316.804106]  ? xfs_free_eofblocks+0x104/0x1d0 [xfs]
> > > [  316.809045]  ? rcu_read_lock_sched_held+0x3f/0x80
> > > [  316.813750]  ? kmem_cache_alloc+0x287/0x2b0
> > > [  316.817937]  xfs_trans_alloc+0x1ad/0x310 [xfs]
> > > [  316.822445]  ? xfs_free_eofblocks+0x104/0x1d0 [xfs]
> > > [  316.827376]  xfs_free_eofblocks+0x104/0x1d0 [xfs]
> > > [  316.832134]  xfs_blockgc_scan_inode+0x24/0x60 [xfs]
> > > [  316.837074]  xfs_inode_walk_ag+0x202/0x4b0 [xfs]
> > > [  316.841754]  ? xfs_inode_free_cowblocks+0xf0/0xf0 [xfs]
> > > [  316.847040]  ? __lock_acquire+0x382/0x1e10
> > > [  316.851142]  ? xfs_inode_free_cowblocks+0xf0/0xf0 [xfs]
> > > [  316.856425]  xfs_inode_walk+0x66/0xc0 [xfs]
> > > [  316.860670]  xfs_trans_alloc+0x160/0x310 [xfs]
> > > [  316.865179]  xfs_trans_alloc_inode+0x5f/0x160 [xfs]
> > > [  316.870119]  xfs_alloc_file_space+0x105/0x300 [xfs]
> > > [  316.875048]  ? down_write_nested+0x30/0x70
> > > [  316.879148]  xfs_file_fallocate+0x270/0x460 [xfs]
> > > [  316.883913]  ? lock_acquire+0x116/0x370
> > > [  316.887752]  ? __x64_sys_fallocate+0x3e/0x70
> > > [  316.892026]  ? selinux_file_permission+0x105/0x140
> > > [  316.896820]  vfs_fallocate+0x14d/0x3d0
> > > [  316.900572]  __x64_sys_fallocate+0x3e/0x70
> > > [  316.904669]  do_syscall_64+0x33/0x40
> > > [  316.908250]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > ...
> > > 
> > 
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: lockdep recursive locking warning on for-next
  2021-02-18 19:31     ` Darrick J. Wong
@ 2021-02-18 20:26       ` Brian Foster
  2021-02-19  4:14       ` Dave Chinner
  1 sibling, 0 replies; 9+ messages in thread
From: Brian Foster @ 2021-02-18 20:26 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs

On Thu, Feb 18, 2021 at 11:31:54AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 18, 2021 at 02:12:52PM -0500, Brian Foster wrote:
> > On Thu, Feb 18, 2021 at 10:49:26AM -0800, Darrick J. Wong wrote:
> > > On Thu, Feb 18, 2021 at 01:14:50PM -0500, Brian Foster wrote:
> > > > Hi Darrick,
> > > > 
> > > > I'm seeing the warning below via xfs/167 on a test machine. It looks
> > > > like it's just complaining about nested freeze protection between the
> > > > scan invocation and an underlying transaction allocation for an inode
> > > > eofblocks trim. I suppose we could either refactor xfs_trans_alloc() to
> > > > drop and reacquire freeze protection around the scan, or alternatively
> > > > call __sb_writers_release() and __sb_writers_acquire() around the scan
> > > > to retain freeze protection and quiet lockdep. Hm?
> > > 
> > > Erk, isn't that a potential log grant livelock too?
> > > 
> > > Fill up the filesystem with real data and cow blocks until it's full,
> > > then spawn exactly enough file writer threads to eat up all the log
> > > reservation, then each _reserve() fails, so every thread starts a scan
> > > and tries to allocate /another/ transaction ... but there's no space
> > > left in the log, so those scans just block indefinitely.
> > > 
> > > So... I think the solution here is to go back to a previous version of
> > > what that patchset did, where we'd drop the whole transaction, run the
> > > scan, and jump back to the top of the function to get a fresh
> > > transaction.
> > > 
> > 
> > But we don't call into the scan while holding log reservation. We hold
> > the transaction memory and freeze protection. It's probably debatable
> > whether we'd want to scan with freeze protection held or not, but I
> > don't see how dropping either of those changes anything wrt to log
> > reservation..?
> 
> Right, sorry about the noise.  We could just trick lockdep with
> __sb_writers_release like you said.  Though I am a tad bit concerned
> about the rwsem behavior -- what happens if:
> 
> T1 calls sb_start_intwrite (which is down_read on sb_writers), gets the
> lock, and then hits ENOSPC and goes into our scan loop; meanwhile,
> 
> T2 calls sb_wait_write (which is down_write on sb_writers), and is
> scheduled off because it was a blocking lock attempt; and then,
> 
> T1 finds some eofblocks to delete, and now it wants to sb_start_intwrite
> again as part of allocating that second nested transaction.  Does that
> actually work, or will T1 stall because we don't allow more readers once
> something is waiting in down_write()?
> 

Good question. I'm not sure off the top of my head. In light of that, it
probably makes more sense to just cycle out of freeze protection during
the scan (patch sent and will test overnight).

> > > > BTW, the stack report also had me wondering whether we had or need any
> > > > nesting protection in these new scan invocations. For example, if we
> > > > have an fs with a bunch of tagged inodes and concurrent allocation
> > > > activity, would anything prevent an in-scan transaction allocation from
> > > > jumping back into the scan code to complete outstanding work? It looks
> > > > like that might not be possible right now because neither scan reserves
> > > > blocks, but they do both use transactions and that's quite a subtle
> > > > balance..
> > > 
> > > Yes, that's a subtlety that screams for better documentation.
> > > 
> > 
> > TBH, I'm not sure that's enough. I think we should at least have some
> > kind of warning, even if only in DEBUG mode, that explicitly calls out
> > if we've become susceptible to this kind of scan reentry. Otherwise I
> > suspect that if this problem is ever truly introduced, the person who
> > first discovers it will probably be user with a blown stack. :( Could we
> > set a flag on the task or something that warns as such (i.e. "WARNING:
> > attempted block reservation in block reclaim context") or perhaps just
> > prevents scan reentry in the first place?
> 
> What if we implemented a XFS_TRANS_TRYRESERVE flag that would skip the
> scanning loops?  Then it would be at least a little more obvious when
> xfs_free_eofblocks and xfs_reflink_cancel_cow_range kick on.
> 
> OTOH that's problematic because both of those functions have other
> callers, and "we're already doing a blockgc scan, don't start another"
> is part of the thread context.
> 

Yeah, so we'd probably still need task state or to put something in the
eofb (which might be too ugly to plumb all the way through to the
transaction..?) to provide context information..

Brian

> --D
> 
> > 
> > Brian
> > 
> > > --D
> > > 
> > > > 
> > > > Brian
> > > > 
> > > > [  316.631387] ============================================
> > > > [  316.636697] WARNING: possible recursive locking detected
> > > > [  316.642010] 5.11.0-rc4 #35 Tainted: G        W I      
> > > > [  316.647148] --------------------------------------------
> > > > [  316.652462] fsstress/17733 is trying to acquire lock:
> > > > [  316.657515] ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_free_eofblocks+0x104/0x1d0 [xfs]
> > > > [  316.666405] 
> > > >                but task is already holding lock:
> > > > [  316.672239] ffff8e0fd1d90650 (sb_internal){++++}-{0:0}, at: xfs_trans_alloc_inode+0x5f/0x160 [xfs]
> > > > [  316.681269] 
> > > > ...
> > > >                stack backtrace:
> > > > [  316.774735] CPU: 38 PID: 17733 Comm: fsstress Tainted: G        W I       5.11.0-rc4 #35
> > > > [  316.782819] Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 1.6.11 11/20/2018
> > > > [  316.790386] Call Trace:
> > > > [  316.792844]  dump_stack+0x8b/0xb0
> > > > [  316.796168]  __lock_acquire.cold+0x159/0x2ab
> > > > [  316.800441]  lock_acquire+0x116/0x370
> > > > [  316.804106]  ? xfs_free_eofblocks+0x104/0x1d0 [xfs]
> > > > [  316.809045]  ? rcu_read_lock_sched_held+0x3f/0x80
> > > > [  316.813750]  ? kmem_cache_alloc+0x287/0x2b0
> > > > [  316.817937]  xfs_trans_alloc+0x1ad/0x310 [xfs]
> > > > [  316.822445]  ? xfs_free_eofblocks+0x104/0x1d0 [xfs]
> > > > [  316.827376]  xfs_free_eofblocks+0x104/0x1d0 [xfs]
> > > > [  316.832134]  xfs_blockgc_scan_inode+0x24/0x60 [xfs]
> > > > [  316.837074]  xfs_inode_walk_ag+0x202/0x4b0 [xfs]
> > > > [  316.841754]  ? xfs_inode_free_cowblocks+0xf0/0xf0 [xfs]
> > > > [  316.847040]  ? __lock_acquire+0x382/0x1e10
> > > > [  316.851142]  ? xfs_inode_free_cowblocks+0xf0/0xf0 [xfs]
> > > > [  316.856425]  xfs_inode_walk+0x66/0xc0 [xfs]
> > > > [  316.860670]  xfs_trans_alloc+0x160/0x310 [xfs]
> > > > [  316.865179]  xfs_trans_alloc_inode+0x5f/0x160 [xfs]
> > > > [  316.870119]  xfs_alloc_file_space+0x105/0x300 [xfs]
> > > > [  316.875048]  ? down_write_nested+0x30/0x70
> > > > [  316.879148]  xfs_file_fallocate+0x270/0x460 [xfs]
> > > > [  316.883913]  ? lock_acquire+0x116/0x370
> > > > [  316.887752]  ? __x64_sys_fallocate+0x3e/0x70
> > > > [  316.892026]  ? selinux_file_permission+0x105/0x140
> > > > [  316.896820]  vfs_fallocate+0x14d/0x3d0
> > > > [  316.900572]  __x64_sys_fallocate+0x3e/0x70
> > > > [  316.904669]  do_syscall_64+0x33/0x40
> > > > [  316.908250]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > ...
> > > > 
> > > 
> > 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: lockdep recursive locking warning on for-next
  2021-02-18 19:31     ` Darrick J. Wong
  2021-02-18 20:26       ` Brian Foster
@ 2021-02-19  4:14       ` Dave Chinner
  2021-02-19  4:28         ` Darrick J. Wong
  1 sibling, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2021-02-19  4:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Thu, Feb 18, 2021 at 11:31:54AM -0800, Darrick J. Wong wrote:
> On Thu, Feb 18, 2021 at 02:12:52PM -0500, Brian Foster wrote:
> > On Thu, Feb 18, 2021 at 10:49:26AM -0800, Darrick J. Wong wrote:
> > > On Thu, Feb 18, 2021 at 01:14:50PM -0500, Brian Foster wrote:
> > > > Hi Darrick,
> > > > 
> > > > I'm seeing the warning below via xfs/167 on a test machine. It looks
> > > > like it's just complaining about nested freeze protection between the
> > > > scan invocation and an underlying transaction allocation for an inode
> > > > eofblocks trim. I suppose we could either refactor xfs_trans_alloc() to
> > > > drop and reacquire freeze protection around the scan, or alternatively
> > > > call __sb_writers_release() and __sb_writers_acquire() around the scan
> > > > to retain freeze protection and quiet lockdep. Hm?
> > > 
> > > Erk, isn't that a potential log grant livelock too?
> > > 
> > > Fill up the filesystem with real data and cow blocks until it's full,
> > > then spawn exactly enough file writer threads to eat up all the log
> > > reservation, then each _reserve() fails, so every thread starts a scan
> > > and tries to allocate /another/ transaction ... but there's no space
> > > left in the log, so those scans just block indefinitely.
> > > 
> > > So... I think the solution here is to go back to a previous version of
> > > what that patchset did, where we'd drop the whole transaction, run the
> > > scan, and jump back to the top of the function to get a fresh
> > > transaction.
> > > 
> > 
> > But we don't call into the scan while holding log reservation. We hold
> > the transaction memory and freeze protection. It's probably debatable
> > whether we'd want to scan with freeze protection held or not, but I
> > don't see how dropping either of those changes anything wrt to log
> > reservation..?
> 
> Right, sorry about the noise.  We could just trick lockdep with
> __sb_writers_release like you said.  Though I am a tad bit concerned
> about the rwsem behavior -- what happens if:
> 
> T1 calls sb_start_intwrite (which is down_read on sb_writers), gets the
> lock, and then hits ENOSPC and goes into our scan loop; meanwhile,
> 
> T2 calls sb_wait_write (which is down_write on sb_writers), and is
> scheduled off because it was a blocking lock attempt; and then,
> 
> T1 finds some eofblocks to delete, and now it wants to sb_start_intwrite
> again as part of allocating that second nested transaction.  Does that
> actually work, or will T1 stall because we don't allow more readers once
> something is waiting in down_write()?

The stack trace nesting inside xfs_trans_alloc() looks fundamentally
wrong to me. It screams "warning, dragons be here" to me. We're not
allowed to nest transactions -anywhere- so actually designing a call
path that ends up looking like we are nesting transactions but then
plays whacky games to avoid problems associated with nesting
seems... poorly thought out.

> > > > BTW, the stack report also had me wondering whether we had or need any
> > > > nesting protection in these new scan invocations. For example, if we
> > > > have an fs with a bunch of tagged inodes and concurrent allocation
> > > > activity, would anything prevent an in-scan transaction allocation from
> > > > jumping back into the scan code to complete outstanding work? It looks
> > > > like that might not be possible right now because neither scan reserves
> > > > blocks, but they do both use transactions and that's quite a subtle
> > > > balance..
> > > 
> > > Yes, that's a subtlety that screams for better documentation.
> > > 
> > 
> > TBH, I'm not sure that's enough. I think we should at least have some
> > kind of warning, even if only in DEBUG mode, that explicitly calls out
> > if we've become susceptible to this kind of scan reentry. Otherwise I
> > suspect that if this problem is ever truly introduced, the person who
> > first discovers it will probably be user with a blown stack. :( Could we
> > set a flag on the task or something that warns as such (i.e. "WARNING:
> > attempted block reservation in block reclaim context") or perhaps just
> > prevents scan reentry in the first place?
> 
> What if we implemented a XFS_TRANS_TRYRESERVE flag that would skip the
> scanning loops?  Then it would be at least a little more obvious when
> xfs_free_eofblocks and xfs_reflink_cancel_cow_range kick on.

Isn't detecting transaction reentry exactly what PF_FSTRANS is for?

Or have we dropped that regression fix on the ground *again*?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: lockdep recursive locking warning on for-next
  2021-02-19  4:14       ` Dave Chinner
@ 2021-02-19  4:28         ` Darrick J. Wong
  2021-02-19  5:48           ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Darrick J. Wong @ 2021-02-19  4:28 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, linux-xfs

On Fri, Feb 19, 2021 at 03:14:27PM +1100, Dave Chinner wrote:
> On Thu, Feb 18, 2021 at 11:31:54AM -0800, Darrick J. Wong wrote:
> > On Thu, Feb 18, 2021 at 02:12:52PM -0500, Brian Foster wrote:
> > > On Thu, Feb 18, 2021 at 10:49:26AM -0800, Darrick J. Wong wrote:
> > > > On Thu, Feb 18, 2021 at 01:14:50PM -0500, Brian Foster wrote:
> > > > > Hi Darrick,
> > > > > 
> > > > > I'm seeing the warning below via xfs/167 on a test machine. It looks
> > > > > like it's just complaining about nested freeze protection between the
> > > > > scan invocation and an underlying transaction allocation for an inode
> > > > > eofblocks trim. I suppose we could either refactor xfs_trans_alloc() to
> > > > > drop and reacquire freeze protection around the scan, or alternatively
> > > > > call __sb_writers_release() and __sb_writers_acquire() around the scan
> > > > > to retain freeze protection and quiet lockdep. Hm?
> > > > 
> > > > Erk, isn't that a potential log grant livelock too?
> > > > 
> > > > Fill up the filesystem with real data and cow blocks until it's full,
> > > > then spawn exactly enough file writer threads to eat up all the log
> > > > reservation, then each _reserve() fails, so every thread starts a scan
> > > > and tries to allocate /another/ transaction ... but there's no space
> > > > left in the log, so those scans just block indefinitely.
> > > > 
> > > > So... I think the solution here is to go back to a previous version of
> > > > what that patchset did, where we'd drop the whole transaction, run the
> > > > scan, and jump back to the top of the function to get a fresh
> > > > transaction.
> > > > 
> > > 
> > > But we don't call into the scan while holding log reservation. We hold
> > > the transaction memory and freeze protection. It's probably debatable
> > > whether we'd want to scan with freeze protection held or not, but I
> > > don't see how dropping either of those changes anything wrt to log
> > > reservation..?
> > 
> > Right, sorry about the noise.  We could just trick lockdep with
> > __sb_writers_release like you said.  Though I am a tad bit concerned
> > about the rwsem behavior -- what happens if:
> > 
> > T1 calls sb_start_intwrite (which is down_read on sb_writers), gets the
> > lock, and then hits ENOSPC and goes into our scan loop; meanwhile,
> > 
> > T2 calls sb_wait_write (which is down_write on sb_writers), and is
> > scheduled off because it was a blocking lock attempt; and then,
> > 
> > T1 finds some eofblocks to delete, and now it wants to sb_start_intwrite
> > again as part of allocating that second nested transaction.  Does that
> > actually work, or will T1 stall because we don't allow more readers once
> > something is waiting in down_write()?
> 
> The stack trace nesting inside xfs_trans_alloc() looks fundamentally
> wrong to me. It screams "warning, dragons be here" to me. We're not
> allowed to nest transactions -anywhere- so actually designing a call
> path that ends up looking like we are nesting transactions but then
> plays whacky games to avoid problems associated with nesting
> seems... poorly thought out.

The original version of the code[1] in question cancelled the
transaction before starting the scan, but the reviewers didn't think it
would be a problem to hold on to the transaction or recurse intwrite.

[1] https://lore.kernel.org/linux-xfs/20210124094816.GE670331@infradead.org/

I probably should've just kept it the way it was.  Well, maybe I'll send
in my own fixpatch and see what the rest of you think.

> > > > > BTW, the stack report also had me wondering whether we had or need any
> > > > > nesting protection in these new scan invocations. For example, if we
> > > > > have an fs with a bunch of tagged inodes and concurrent allocation
> > > > > activity, would anything prevent an in-scan transaction allocation from
> > > > > jumping back into the scan code to complete outstanding work? It looks
> > > > > like that might not be possible right now because neither scan reserves
> > > > > blocks, but they do both use transactions and that's quite a subtle
> > > > > balance..
> > > > 
> > > > Yes, that's a subtlety that screams for better documentation.
> > > > 
> > > 
> > > TBH, I'm not sure that's enough. I think we should at least have some
> > > kind of warning, even if only in DEBUG mode, that explicitly calls out
> > > if we've become susceptible to this kind of scan reentry. Otherwise I
> > > suspect that if this problem is ever truly introduced, the person who
> > > first discovers it will probably be user with a blown stack. :( Could we
> > > set a flag on the task or something that warns as such (i.e. "WARNING:
> > > attempted block reservation in block reclaim context") or perhaps just
> > > prevents scan reentry in the first place?
> > 
> > What if we implemented a XFS_TRANS_TRYRESERVE flag that would skip the
> > scanning loops?  Then it would be at least a little more obvious when
> > xfs_free_eofblocks and xfs_reflink_cancel_cow_range kick on.
> 
> Isn't detecting transaction reentry exactly what PF_FSTRANS is for?
> 
> Or have we dropped that regression fix on the ground *again*?

That series isn't making progress.

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: lockdep recursive locking warning on for-next
  2021-02-19  4:28         ` Darrick J. Wong
@ 2021-02-19  5:48           ` Dave Chinner
  2021-02-22 22:53             ` Dave Chinner
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2021-02-19  5:48 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Thu, Feb 18, 2021 at 08:28:33PM -0800, Darrick J. Wong wrote:
> On Fri, Feb 19, 2021 at 03:14:27PM +1100, Dave Chinner wrote:
> > > What if we implemented a XFS_TRANS_TRYRESERVE flag that would skip the
> > > scanning loops?  Then it would be at least a little more obvious when
> > > xfs_free_eofblocks and xfs_reflink_cancel_cow_range kick on.
> > 
> > Isn't detecting transaction reentry exactly what PF_FSTRANS is for?
> > 
> > Or have we dropped that regression fix on the ground *again*?
> 
> That series isn't making progress.

Ok, I just went back and looked at v15 from a month ago and noticed
multiple new bugs that weren't in v14. I'm so done with that slow
motion merry-go-round, so here's patch I wrote from scratch 10
minutes ago....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: use current->journal_info for detecting transaction recursion

From: Dave Chinner <dchinner@redhat.com>

Because the iomap code using PF_MEMALLOC_NOFS to detect transaction
recursion in XFS is just wrong. Remove it from the iomap code and
replace it with XFS specific internal checks using
current->journal_info instead.

Fixes: 9070733b4efa ("xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap/buffered-io.c    |  7 -------
 fs/xfs/libxfs/xfs_btree.c |  4 +++-
 fs/xfs/xfs_aops.c         | 17 +++++++++++++++--
 fs/xfs/xfs_trans.c        | 21 +++++----------------
 fs/xfs/xfs_trans.h        | 30 ++++++++++++++++++++++++++++++
 5 files changed, 53 insertions(+), 26 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 16a1e82e3aeb..fcd4a0d71fc1 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1458,13 +1458,6 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
 			PF_MEMALLOC))
 		goto redirty;
 
-	/*
-	 * Given that we do not allow direct reclaim to call us, we should
-	 * never be called in a recursive filesystem reclaim context.
-	 */
-	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
-		goto redirty;
-
 	/*
 	 * Is this page beyond the end of the file?
 	 *
diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
index b56ff451adce..e6a15b920034 100644
--- a/fs/xfs/libxfs/xfs_btree.c
+++ b/fs/xfs/libxfs/xfs_btree.c
@@ -2805,7 +2805,7 @@ xfs_btree_split_worker(
 	struct xfs_btree_split_args	*args = container_of(work,
 						struct xfs_btree_split_args, work);
 	unsigned long		pflags;
-	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
+	unsigned long		new_pflags = 0;
 
 	/*
 	 * we are in a transaction context here, but may also be doing work
@@ -2817,11 +2817,13 @@ xfs_btree_split_worker(
 		new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
 
 	current_set_flags_nested(&pflags, new_pflags);
+	xfs_trans_set_context(args->cur->bc_tp);
 
 	args->result = __xfs_btree_split(args->cur, args->level, args->ptrp,
 					 args->key, args->curp, args->stat);
 	complete(args->done);
 
+	xfs_trans_clear_context(args->cur->bc_tp);
 	current_restore_flags_nested(&pflags, new_pflags);
 }
 
diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 4304c6416fbb..b4186d666157 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -62,7 +62,7 @@ xfs_setfilesize_trans_alloc(
 	 * We hand off the transaction to the completion thread now, so
 	 * clear the flag here.
 	 */
-	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	xfs_trans_clear_context(tp);
 	return 0;
 }
 
@@ -125,7 +125,7 @@ xfs_setfilesize_ioend(
 	 * thus we need to mark ourselves as being in a transaction manually.
 	 * Similarly for freeze protection.
 	 */
-	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+	xfs_trans_set_context(tp);
 	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
 
 	/* we abort the update if there was an IO error */
@@ -568,6 +568,12 @@ xfs_vm_writepage(
 {
 	struct xfs_writepage_ctx wpc = { };
 
+	if (WARN_ON_ONCE(current->journal_info)) {
+		redirty_page_for_writepage(wbc, page);
+		unlock_page(page);
+		return 0;
+	}
+
 	return iomap_writepage(page, wbc, &wpc.ctx, &xfs_writeback_ops);
 }
 
@@ -578,6 +584,13 @@ xfs_vm_writepages(
 {
 	struct xfs_writepage_ctx wpc = { };
 
+	/*
+	 * Writing back data in a transaction context can result in recursive
+	 * transactions. This is bad, so issue a warning and get out of here.
+	 */
+	if (WARN_ON_ONCE(current->journal_info))
+		return 0;
+
 	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
 	return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
 }
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index 7d05694681e3..28c87eff11c0 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -72,6 +72,7 @@ xfs_trans_free(
 	xfs_extent_busy_clear(tp->t_mountp, &tp->t_busy, false);
 
 	trace_xfs_trans_free(tp, _RET_IP_);
+	xfs_trans_clear_context(tp);
 	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
 		sb_end_intwrite(tp->t_mountp->m_super);
 	xfs_trans_free_dqinfo(tp);
@@ -123,7 +124,8 @@ xfs_trans_dup(
 
 	ntp->t_rtx_res = tp->t_rtx_res - tp->t_rtx_res_used;
 	tp->t_rtx_res = tp->t_rtx_res_used;
-	ntp->t_pflags = tp->t_pflags;
+
+	xfs_trans_switch_context(tp, ntp);
 
 	/* move deferred ops over to the new tp */
 	xfs_defer_move(ntp, tp);
@@ -157,9 +159,6 @@ xfs_trans_reserve(
 	int			error = 0;
 	bool			rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
 
-	/* Mark this thread as being in a transaction */
-	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
-
 	/*
 	 * Attempt to reserve the needed disk blocks by decrementing
 	 * the number needed from the number available.  This will
@@ -167,10 +166,8 @@ xfs_trans_reserve(
 	 */
 	if (blocks > 0) {
 		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
-		if (error != 0) {
-			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
+		if (error != 0)
 			return -ENOSPC;
-		}
 		tp->t_blk_res += blocks;
 	}
 
@@ -244,9 +241,6 @@ xfs_trans_reserve(
 		xfs_mod_fdblocks(mp, (int64_t)blocks, rsvd);
 		tp->t_blk_res = 0;
 	}
-
-	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
-
 	return error;
 }
 
@@ -270,6 +264,7 @@ xfs_trans_alloc(
 	tp = kmem_cache_zalloc(xfs_trans_zone, GFP_KERNEL | __GFP_NOFAIL);
 	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
 		sb_start_intwrite(mp->m_super);
+	xfs_trans_set_context(tp);
 
 	/*
 	 * Zero-reservation ("empty") transactions can't modify anything, so
@@ -892,8 +887,6 @@ __xfs_trans_commit(
 	xfs_trans_apply_dquot_deltas(tp);
 
 	xlog_cil_commit(mp->m_log, tp, &commit_seq, regrant);
-
-	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free(tp);
 
 	/*
@@ -925,7 +918,6 @@ __xfs_trans_commit(
 			xfs_log_ticket_ungrant(mp->m_log, tp->t_ticket);
 		tp->t_ticket = NULL;
 	}
-	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
 	xfs_trans_free_items(tp, !!error);
 	xfs_trans_free(tp);
 
@@ -985,9 +977,6 @@ xfs_trans_cancel(
 		tp->t_ticket = NULL;
 	}
 
-	/* mark this thread as no longer being in a transaction */
-	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
-
 	xfs_trans_free_items(tp, dirty);
 	xfs_trans_free(tp);
 }
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index d223d4f4e429..f7b0fc24027f 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -281,4 +281,34 @@ int xfs_trans_alloc_ichange(struct xfs_inode *ip, struct xfs_dquot *udqp,
 		struct xfs_dquot *gdqp, struct xfs_dquot *pdqp, bool force,
 		struct xfs_trans **tpp);
 
+static inline void
+xfs_trans_set_context(
+	struct xfs_trans	*tp)
+{
+	ASSERT(current->journal_info == NULL);
+	tp->t_pflags = memalloc_nofs_save();
+	current->journal_info = tp;
+}
+
+static inline void
+xfs_trans_clear_context(
+	struct xfs_trans	*tp)
+{
+	if (current->journal_info == tp) {
+		memalloc_nofs_restore(tp->t_pflags);
+		current->journal_info = NULL;
+	}
+}
+
+static inline void
+xfs_trans_switch_context(
+	struct xfs_trans	*old_tp,
+	struct xfs_trans	*new_tp)
+{
+	ASSERT(current->journal_info == old_tp);
+	new_tp->t_pflags = old_tp->t_pflags;
+	old_tp->t_pflags = 0;
+	current->journal_info = new_tp;
+}
+
 #endif	/* __XFS_TRANS_H__ */

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: lockdep recursive locking warning on for-next
  2021-02-19  5:48           ` Dave Chinner
@ 2021-02-22 22:53             ` Dave Chinner
  0 siblings, 0 replies; 9+ messages in thread
From: Dave Chinner @ 2021-02-22 22:53 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Brian Foster, linux-xfs

On Fri, Feb 19, 2021 at 04:48:20PM +1100, Dave Chinner wrote:
> On Thu, Feb 18, 2021 at 08:28:33PM -0800, Darrick J. Wong wrote:
> > On Fri, Feb 19, 2021 at 03:14:27PM +1100, Dave Chinner wrote:
> > > > What if we implemented a XFS_TRANS_TRYRESERVE flag that would skip the
> > > > scanning loops?  Then it would be at least a little more obvious when
> > > > xfs_free_eofblocks and xfs_reflink_cancel_cow_range kick on.
> > > 
> > > Isn't detecting transaction reentry exactly what PF_FSTRANS is for?
> > > 
> > > Or have we dropped that regression fix on the ground *again*?
> > 
> > That series isn't making progress.
> 
> Ok, I just went back and looked at v15 from a month ago and noticed
> multiple new bugs that weren't in v14. I'm so done with that slow
> motion merry-go-round, so here's patch I wrote from scratch 10
> minutes ago....

And, yes, this patch assert fails when the GC code re-enters
xfs_trans_alloc()....

[56063.434910] kernel BUG at fs/xfs/xfs_message.c:110!
[56063.434929] invalid opcode: 0000 [#8] PREEMPT SMP
[56063.434932] CPU: 8 PID: 436700 Comm: fsstress Tainted: G      D           5.11.0-dgc+ #2897
[56063.436666] kernel BUG at fs/xfs/xfs_message.c:110!
[56063.439129] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1 04/01/2014
[56063.439134] RIP: 0010:assfail+0x27/0x2d
[56063.439142] Code: 0b 5d c3 66 66 66 66 90 55 41 89 c8 48 89 d1 48 89 f2 48 89 e5 48 c7 c6 50 47 58 82 e8 79 f9 ff ff 80 3d 49 51 86 02 00 74 02 <0f> 0b 0f 0b 5d 9
[56063.439145] RSP: 0018:ffffc9000a91f968 EFLAGS: 00010202
[56063.439149] RAX: 0000000000000000 RBX: ffff8885cffbe200 RCX: 0000000000000000
[56063.439151] RDX: 00000000ffffffc0 RSI: 0000000000000000 RDI: ffffffff82511561
[56063.439154] RBP: ffffc9000a91f968 R08: 0000000000000000 R09: 000000000000000a
[56064.187006] R10: 000000000000000a R11: f000000000000000 R12: ffff888258a5e000
[56064.188957] R13: ffff888258a5e2fc R14: 0000000000000000 R15: 0000000000000000
[56064.190909] FS:  00007f498012b040(0000) GS:ffff8885fec00000(0000) knlGS:0000000000000000
[56064.193109] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[56064.194701] CR2: 00007f498033b000 CR3: 0000000240fab001 CR4: 0000000000060ee0
[56064.196648] Call Trace:
[56064.197333]  xfs_trans_alloc+0x28c/0x2c0
[56064.198423]  xfs_free_eofblocks+0x12b/0x200
[56064.199603]  xfs_inode_free_eofblocks.constprop.0+0xd7/0x120
[56064.201161]  xfs_blockgc_scan_inode+0x36/0x80
[56064.202358]  xfs_inode_walk_ag+0x1bc/0x450
[56064.203493]  ? xfs_inode_free_cowblocks+0xf0/0xf0
[56064.204795]  xfs_inode_walk+0x69/0xc0
[56064.205836]  ? xfs_inode_free_cowblocks+0xf0/0xf0
[56064.207128]  xfs_blockgc_free_space+0x36/0x90
[56064.208323]  xfs_trans_alloc+0x19b/0x2c0
[56064.209421]  xfs_trans_alloc_inode+0x6f/0x170
[56064.210633]  xfs_alloc_file_space+0x101/0x2c0
[56064.211828]  ? __might_sleep+0x4b/0x80
[56064.212874]  xfs_file_fallocate+0x2e1/0x480
[56064.214030]  ? __do_sys_newfstat+0x55/0x60
[56064.215171]  vfs_fallocate+0x152/0x330
[56064.216220]  __x64_sys_fallocate+0x44/0x70
[56064.217383]  do_syscall_64+0x32/0x50
[56064.218372]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

So this patch does detect this as transaction recursion....

Cheers,

Dave.

> xfs: use current->journal_info for detecting transaction recursion
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Because the iomap code using PF_MEMALLOC_NOFS to detect transaction
> recursion in XFS is just wrong. Remove it from the iomap code and
> replace it with XFS specific internal checks using
> current->journal_info instead.
> 
> Fixes: 9070733b4efa ("xfs: abstract PF_FSTRANS to PF_MEMALLOC_NOFS")
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/iomap/buffered-io.c    |  7 -------
>  fs/xfs/libxfs/xfs_btree.c |  4 +++-
>  fs/xfs/xfs_aops.c         | 17 +++++++++++++++--
>  fs/xfs/xfs_trans.c        | 21 +++++----------------
>  fs/xfs/xfs_trans.h        | 30 ++++++++++++++++++++++++++++++
>  5 files changed, 53 insertions(+), 26 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 16a1e82e3aeb..fcd4a0d71fc1 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1458,13 +1458,6 @@ iomap_do_writepage(struct page *page, struct writeback_control *wbc, void *data)
>  			PF_MEMALLOC))
>  		goto redirty;
>  
> -	/*
> -	 * Given that we do not allow direct reclaim to call us, we should
> -	 * never be called in a recursive filesystem reclaim context.
> -	 */
> -	if (WARN_ON_ONCE(current->flags & PF_MEMALLOC_NOFS))
> -		goto redirty;
> -
>  	/*
>  	 * Is this page beyond the end of the file?
>  	 *
> diff --git a/fs/xfs/libxfs/xfs_btree.c b/fs/xfs/libxfs/xfs_btree.c
> index b56ff451adce..e6a15b920034 100644
> --- a/fs/xfs/libxfs/xfs_btree.c
> +++ b/fs/xfs/libxfs/xfs_btree.c
> @@ -2805,7 +2805,7 @@ xfs_btree_split_worker(
>  	struct xfs_btree_split_args	*args = container_of(work,
>  						struct xfs_btree_split_args, work);
>  	unsigned long		pflags;
> -	unsigned long		new_pflags = PF_MEMALLOC_NOFS;
> +	unsigned long		new_pflags = 0;
>  
>  	/*
>  	 * we are in a transaction context here, but may also be doing work
> @@ -2817,11 +2817,13 @@ xfs_btree_split_worker(
>  		new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
>  
>  	current_set_flags_nested(&pflags, new_pflags);
> +	xfs_trans_set_context(args->cur->bc_tp);
>  
>  	args->result = __xfs_btree_split(args->cur, args->level, args->ptrp,
>  					 args->key, args->curp, args->stat);
>  	complete(args->done);
>  
> +	xfs_trans_clear_context(args->cur->bc_tp);
>  	current_restore_flags_nested(&pflags, new_pflags);
>  }
>  
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 4304c6416fbb..b4186d666157 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -62,7 +62,7 @@ xfs_setfilesize_trans_alloc(
>  	 * We hand off the transaction to the completion thread now, so
>  	 * clear the flag here.
>  	 */
> -	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> +	xfs_trans_clear_context(tp);
>  	return 0;
>  }
>  
> @@ -125,7 +125,7 @@ xfs_setfilesize_ioend(
>  	 * thus we need to mark ourselves as being in a transaction manually.
>  	 * Similarly for freeze protection.
>  	 */
> -	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> +	xfs_trans_set_context(tp);
>  	__sb_writers_acquired(VFS_I(ip)->i_sb, SB_FREEZE_FS);
>  
>  	/* we abort the update if there was an IO error */
> @@ -568,6 +568,12 @@ xfs_vm_writepage(
>  {
>  	struct xfs_writepage_ctx wpc = { };
>  
> +	if (WARN_ON_ONCE(current->journal_info)) {
> +		redirty_page_for_writepage(wbc, page);
> +		unlock_page(page);
> +		return 0;
> +	}
> +
>  	return iomap_writepage(page, wbc, &wpc.ctx, &xfs_writeback_ops);
>  }
>  
> @@ -578,6 +584,13 @@ xfs_vm_writepages(
>  {
>  	struct xfs_writepage_ctx wpc = { };
>  
> +	/*
> +	 * Writing back data in a transaction context can result in recursive
> +	 * transactions. This is bad, so issue a warning and get out of here.
> +	 */
> +	if (WARN_ON_ONCE(current->journal_info))
> +		return 0;
> +
>  	xfs_iflags_clear(XFS_I(mapping->host), XFS_ITRUNCATED);
>  	return iomap_writepages(mapping, wbc, &wpc.ctx, &xfs_writeback_ops);
>  }
> diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
> index 7d05694681e3..28c87eff11c0 100644
> --- a/fs/xfs/xfs_trans.c
> +++ b/fs/xfs/xfs_trans.c
> @@ -72,6 +72,7 @@ xfs_trans_free(
>  	xfs_extent_busy_clear(tp->t_mountp, &tp->t_busy, false);
>  
>  	trace_xfs_trans_free(tp, _RET_IP_);
> +	xfs_trans_clear_context(tp);
>  	if (!(tp->t_flags & XFS_TRANS_NO_WRITECOUNT))
>  		sb_end_intwrite(tp->t_mountp->m_super);
>  	xfs_trans_free_dqinfo(tp);
> @@ -123,7 +124,8 @@ xfs_trans_dup(
>  
>  	ntp->t_rtx_res = tp->t_rtx_res - tp->t_rtx_res_used;
>  	tp->t_rtx_res = tp->t_rtx_res_used;
> -	ntp->t_pflags = tp->t_pflags;
> +
> +	xfs_trans_switch_context(tp, ntp);
>  
>  	/* move deferred ops over to the new tp */
>  	xfs_defer_move(ntp, tp);
> @@ -157,9 +159,6 @@ xfs_trans_reserve(
>  	int			error = 0;
>  	bool			rsvd = (tp->t_flags & XFS_TRANS_RESERVE) != 0;
>  
> -	/* Mark this thread as being in a transaction */
> -	current_set_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> -
>  	/*
>  	 * Attempt to reserve the needed disk blocks by decrementing
>  	 * the number needed from the number available.  This will
> @@ -167,10 +166,8 @@ xfs_trans_reserve(
>  	 */
>  	if (blocks > 0) {
>  		error = xfs_mod_fdblocks(mp, -((int64_t)blocks), rsvd);
> -		if (error != 0) {
> -			current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> +		if (error != 0)
>  			return -ENOSPC;
> -		}
>  		tp->t_blk_res += blocks;
>  	}
>  
> @@ -244,9 +241,6 @@ xfs_trans_reserve(
>  		xfs_mod_fdblocks(mp, (int64_t)blocks, rsvd);
>  		tp->t_blk_res = 0;
>  	}
> -
> -	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> -
>  	return error;
>  }
>  
> @@ -270,6 +264,7 @@ xfs_trans_alloc(
>  	tp = kmem_cache_zalloc(xfs_trans_zone, GFP_KERNEL | __GFP_NOFAIL);
>  	if (!(flags & XFS_TRANS_NO_WRITECOUNT))
>  		sb_start_intwrite(mp->m_super);
> +	xfs_trans_set_context(tp);
>  
>  	/*
>  	 * Zero-reservation ("empty") transactions can't modify anything, so
> @@ -892,8 +887,6 @@ __xfs_trans_commit(
>  	xfs_trans_apply_dquot_deltas(tp);
>  
>  	xlog_cil_commit(mp->m_log, tp, &commit_seq, regrant);
> -
> -	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free(tp);
>  
>  	/*
> @@ -925,7 +918,6 @@ __xfs_trans_commit(
>  			xfs_log_ticket_ungrant(mp->m_log, tp->t_ticket);
>  		tp->t_ticket = NULL;
>  	}
> -	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
>  	xfs_trans_free_items(tp, !!error);
>  	xfs_trans_free(tp);
>  
> @@ -985,9 +977,6 @@ xfs_trans_cancel(
>  		tp->t_ticket = NULL;
>  	}
>  
> -	/* mark this thread as no longer being in a transaction */
> -	current_restore_flags_nested(&tp->t_pflags, PF_MEMALLOC_NOFS);
> -
>  	xfs_trans_free_items(tp, dirty);
>  	xfs_trans_free(tp);
>  }
> diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
> index d223d4f4e429..f7b0fc24027f 100644
> --- a/fs/xfs/xfs_trans.h
> +++ b/fs/xfs/xfs_trans.h
> @@ -281,4 +281,34 @@ int xfs_trans_alloc_ichange(struct xfs_inode *ip, struct xfs_dquot *udqp,
>  		struct xfs_dquot *gdqp, struct xfs_dquot *pdqp, bool force,
>  		struct xfs_trans **tpp);
>  
> +static inline void
> +xfs_trans_set_context(
> +	struct xfs_trans	*tp)
> +{
> +	ASSERT(current->journal_info == NULL);
> +	tp->t_pflags = memalloc_nofs_save();
> +	current->journal_info = tp;
> +}
> +
> +static inline void
> +xfs_trans_clear_context(
> +	struct xfs_trans	*tp)
> +{
> +	if (current->journal_info == tp) {
> +		memalloc_nofs_restore(tp->t_pflags);
> +		current->journal_info = NULL;
> +	}
> +}
> +
> +static inline void
> +xfs_trans_switch_context(
> +	struct xfs_trans	*old_tp,
> +	struct xfs_trans	*new_tp)
> +{
> +	ASSERT(current->journal_info == old_tp);
> +	new_tp->t_pflags = old_tp->t_pflags;
> +	old_tp->t_pflags = 0;
> +	current->journal_info = new_tp;
> +}
> +
>  #endif	/* __XFS_TRANS_H__ */
> 

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-02-22 22:54 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-18 18:14 lockdep recursive locking warning on for-next Brian Foster
2021-02-18 18:49 ` Darrick J. Wong
2021-02-18 19:12   ` Brian Foster
2021-02-18 19:31     ` Darrick J. Wong
2021-02-18 20:26       ` Brian Foster
2021-02-19  4:14       ` Dave Chinner
2021-02-19  4:28         ` Darrick J. Wong
2021-02-19  5:48           ` Dave Chinner
2021-02-22 22:53             ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.