All of lore.kernel.org
 help / color / mirror / Atom feed
* Self-deadlock (?) in xfs_inodegc_worker / xfs_inactive ?
@ 2022-04-04 13:16 Frank Hofmann
  2022-04-04 23:22 ` Dave Chinner
  0 siblings, 1 reply; 7+ messages in thread
From: Frank Hofmann @ 2022-04-04 13:16 UTC (permalink / raw)
  To: linux-xfs, kernel-team

Hi,

we see machines getting stuck with a large number of backed-up
processes that invoke statfs() (monitoring stuff), like:

[Sat Apr  2 09:54:32 2022] INFO: task node_exporter:244222 blocked for
more than 10 seconds.
[Sat Apr  2 09:54:32 2022]       Tainted: G           O
5.15.26-cloudflare-2022.3.4 #1
[Sat Apr  2 09:54:32 2022] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sat Apr  2 09:54:32 2022] task:node_exporter   state:D stack:    0
pid:244222 ppid:     1 flags:0x00004000
[Sat Apr  2 09:54:32 2022] Call Trace:
[Sat Apr  2 09:54:32 2022]  <TASK>
[Sat Apr  2 09:54:32 2022]  __schedule+0x2cd/0x950
[Sat Apr  2 09:54:32 2022]  schedule+0x44/0xa0
[Sat Apr  2 09:54:32 2022]  schedule_timeout+0xfc/0x140
[Sat Apr  2 09:54:32 2022]  ? try_to_wake_up+0x338/0x4e0
[Sat Apr  2 09:54:32 2022]  ? __prepare_to_swait+0x4b/0x70
[Sat Apr  2 09:54:32 2022]  wait_for_completion+0x86/0xe0
[Sat Apr  2 09:54:32 2022]  flush_work+0x5c/0x80
[Sat Apr  2 09:54:32 2022]  ? flush_workqueue_prep_pwqs+0x110/0x110
[Sat Apr  2 09:54:32 2022]  xfs_inodegc_flush.part.0+0x3b/0x90
[Sat Apr  2 09:54:32 2022]  xfs_fs_statfs+0x29/0x1c0
[Sat Apr  2 09:54:32 2022]  statfs_by_dentry+0x4d/0x70
[Sat Apr  2 09:54:32 2022]  user_statfs+0x57/0xc0
[Sat Apr  2 09:54:32 2022]  __do_sys_statfs+0x20/0x50
[Sat Apr  2 09:54:32 2022]  do_syscall_64+0x3b/0x90
[Sat Apr  2 09:54:32 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[Sat Apr  2 09:54:32 2022] RIP: 0033:0x4ac9db
...

A linear-over-time increasing number of 'D' state processes is usually
what alerts us to this.

The oldest thread found waiting appears always to be the inode gc
worker doing deferred inactivation:

[Sat Apr  2 09:55:47 2022] INFO: task kworker/25:0:3880332 blocked for
more than 85 seconds.
[Sat Apr  2 09:55:47 2022]       Tainted: G           O
5.15.26-cloudflare-2022.3.4 #1
[Sat Apr  2 09:55:47 2022] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sat Apr  2 09:55:47 2022] task:kworker/25:0    state:D stack:    0
pid:3880332 ppid:     2 flags:0x00004000
[Sat Apr  2 09:55:47 2022] Workqueue: xfs-inodegc/dm-5 xfs_inodegc_worker
[Sat Apr  2 09:55:47 2022] Call Trace:
[Sat Apr  2 09:55:47 2022]  <TASK>
[Sat Apr  2 09:55:47 2022]  __schedule+0x2cd/0x950
[Sat Apr  2 09:55:47 2022]  schedule+0x44/0xa0
[Sat Apr  2 09:55:47 2022]  schedule_timeout+0xfc/0x140
[Sat Apr  2 09:55:47 2022]  __down+0x7d/0xd0
[Sat Apr  2 09:55:47 2022]  down+0x43/0x60
[Sat Apr  2 09:55:47 2022]  xfs_buf_lock+0x29/0xa0
[Sat Apr  2 09:55:47 2022]  xfs_buf_find+0x2c4/0x590
[Sat Apr  2 09:55:47 2022]  xfs_buf_get_map+0x46/0x390
[Sat Apr  2 09:55:47 2022]  xfs_buf_read_map+0x52/0x270
[Sat Apr  2 09:55:47 2022]  ? xfs_read_agf+0x87/0x110
[Sat Apr  2 09:55:47 2022]  xfs_trans_read_buf_map+0x128/0x2a0
[Sat Apr  2 09:55:47 2022]  ? xfs_read_agf+0x87/0x110
[Sat Apr  2 09:55:47 2022]  xfs_read_agf+0x87/0x110
[Sat Apr  2 09:55:47 2022]  xfs_alloc_read_agf+0x34/0x1a0
[Sat Apr  2 09:55:47 2022]  xfs_alloc_fix_freelist+0x3d7/0x4f0
[Sat Apr  2 09:55:47 2022]  ? xfs_buf_find+0x2aa/0x590
[Sat Apr  2 09:55:47 2022]  ? cpumask_next_and+0x1f/0x20
[Sat Apr  2 09:55:47 2022]  ? update_sd_lb_stats.constprop.0+0xfe/0x7d0
[Sat Apr  2 09:55:47 2022]  ? down_trylock+0x25/0x30
[Sat Apr  2 09:55:47 2022]  xfs_alloc_vextent+0x22b/0x440
[Sat Apr  2 09:55:47 2022]  __xfs_inobt_alloc_block.isra.0+0xc5/0x1a0
[Sat Apr  2 09:55:47 2022]  __xfs_btree_split+0xf2/0x610
[Sat Apr  2 09:55:47 2022]  ? xfs_btree_read_buf_block.constprop.0+0xae/0xd0
[Sat Apr  2 09:55:47 2022]  xfs_btree_split+0x4b/0x100
[Sat Apr  2 09:55:47 2022]  xfs_btree_make_block_unfull+0x193/0x1c0
[Sat Apr  2 09:55:47 2022]  xfs_btree_insrec+0x4a9/0x5a0
[Sat Apr  2 09:55:47 2022]  ? xfs_btree_read_buf_block.constprop.0+0x95/0xd0
[Sat Apr  2 09:55:47 2022]  xfs_btree_insert+0xa8/0x1f0
[Sat Apr  2 09:55:47 2022]  ? xfs_inobt_init_common+0x28/0x90
[Sat Apr  2 09:55:47 2022]  xfs_difree_finobt+0xa4/0x240
[Sat Apr  2 09:55:47 2022]  xfs_difree+0x126/0x1a0
[Sat Apr  2 09:55:47 2022]  xfs_ifree+0xca/0x4a0
[Sat Apr  2 09:55:47 2022]  ? xfs_trans_reserve+0x123/0x160
[Sat Apr  2 09:55:47 2022]  ? xfs_trans_alloc+0xec/0x1f0
[Sat Apr  2 09:55:47 2022]  xfs_inactive_ifree.isra.0+0x9e/0x1a0
[Sat Apr  2 09:55:47 2022]  xfs_inactive+0xf8/0x170
[Sat Apr  2 09:55:47 2022]  xfs_inodegc_worker+0x73/0xf0
[Sat Apr  2 09:55:47 2022]  process_one_work+0x1e6/0x380
[Sat Apr  2 09:55:47 2022]  worker_thread+0x50/0x3a0
[Sat Apr  2 09:55:47 2022]  ? rescuer_thread+0x380/0x380
[Sat Apr  2 09:55:47 2022]  kthread+0x127/0x150
[Sat Apr  2 09:55:47 2022]  ? set_kthread_struct+0x40/0x40
[Sat Apr  2 09:55:47 2022]  ret_from_fork+0x22/0x30
[Sat Apr  2 09:55:47 2022]  </TASK>

Once the system is in this state, I/O to the affected filesystem no
longer happens - all ops to it get stuck. Other than rebooting the
system, we have not found a way to recover from this.

This is a histogram (first column: number of proceses 'D'-ed on that
call trace) of `/proc/<PID>/stack`:

336 flush_work+0x5c/0x80 <= xfs_inodegc_flush.part.0+0x3b/0x90 <=
xfs_fs_statfs+0x29/0x1c0 <= statfs_by_dentry+0x4d/0x70 <=
user_statfs+0x57/0xc0 <= __do_sys_statfs+0x20/0x50 <=
do_syscall_64+0x3b/0x90 <= entry_SYSCALL_64_after_hwframe+0x44/0xae
48 xfs_ilock_attr_map_shared+0x24/0x50 <= xfs_attr_get+0x96/0xc0 <=
xfs_xattr_get+0x75/0xb0 <= __vfs_getxattr+0x53/0x70 <=
get_vfs_caps_from_disk+0x6e/0x1d0 <= audit_copy_inode+0x95/0xd0 <=
path_openat+0x27f/0x1090 <= do_filp_open+0xa9/0x150 <=
do_sys_openat2+0x97/0x160 <= __x64_sys_openat+0x54/0x90 <=
do_syscall_64+0x3b/0x90 <= entry_SYSCALL_64_after_hwframe+0x44/0xae
32 down+0x43/0x60 <= xfs_buf_lock+0x29/0xa0 <=
xfs_buf_find+0x2c4/0x590 <= xfs_buf_get_map+0x46/0x390 <=
xfs_buf_read_map+0x52/0x270 <= xfs_trans_read_buf_map+0x128/0x2a0 <=
xfs_read_agi+0x8e/0x120 <= xfs_iunlink_remove+0x59/0x220 <=
xfs_ifree+0x7d/0x4a0 <= xfs_inactive_ifree.isra.0+0x9e/0x1a0 <=
xfs_inactive+0xf8/0x170 <= xfs_inodegc_worker+0x73/0xf0 <=
process_one_work+0x1e6/0x380 <= worker_thread+0x50/0x3a0 <=
kthread+0x127/0x150 <= ret_from_fork+0x22/0x30
5 down+0x43/0x60 <= xfs_buf_lock+0x29/0xa0 <= xfs_buf_find+0x2c4/0x590
<= xfs_buf_get_map+0x46/0x390 <= xfs_buf_read_map+0x52/0x270 <=
xfs_trans_read_buf_map+0x128/0x2a0 <= xfs_read_agi+0x8e/0x120 <=
xfs_ialloc_read_agi+0x26/0xa0 <= xfs_dialloc+0x171/0x720 <=
xfs_create+0x394/0x610 <= xfs_generic_create+0x11e/0x360 <=
path_openat+0xe1c/0x1090 <= do_filp_open+0xa9/0x150 <=
do_sys_openat2+0x97/0x160 <= __x64_sys_openat+0x54/0x90 <=
do_syscall_64+0x3b/0x90 <= entry_SYSCALL_64_after_hwframe+0x44/0xae
1 down+0x43/0x60 <= xfs_buf_lock+0x29/0xa0 <= xfs_buf_find+0x2c4/0x590
<= xfs_buf_get_map+0x46/0x390 <= xfs_buf_read_map+0x52/0x270 <=
xfs_trans_read_buf_map+0x128/0x2a0 <= xfs_read_agi+0x8e/0x120 <=
xfs_iunlink+0x6b/0x1c0 <= xfs_remove+0x2c5/0x410 <=
xfs_vn_unlink+0x53/0xa0 <= vfs_unlink+0x113/0x280 <=
do_unlinkat+0x1ab/0x2d0 <= __x64_sys_unlink+0x3e/0x60 <=
do_syscall_64+0x3b/0x90 <= entry_SYSCALL_64_after_hwframe+0x44/0xae
1 down+0x43/0x60 <= xfs_buf_lock+0x29/0xa0 <= xfs_buf_find+0x2c4/0x590
<= xfs_buf_get_map+0x46/0x390 <= xfs_buf_read_map+0x52/0x270 <=
xfs_trans_read_buf_map+0x128/0x2a0 <= xfs_read_agf+0x87/0x110 <=
xfs_alloc_read_agf+0x34/0x1a0 <= xfs_alloc_fix_freelist+0x3d7/0x4f0 <=
xfs_free_extent_fix_freelist+0x61/0xa0 <= __xfs_free_extent+0x6b/0x1a0
<= xfs_trans_free_extent+0x3b/0xd0 <=
xfs_extent_free_finish_item+0x24/0x40 <=
xfs_defer_finish_noroll+0x1f1/0x580 <= xfs_defer_finish+0x11/0x70 <=
xfs_itruncate_extents_flags+0xc1/0x220 <=
xfs_free_eofblocks+0xbc/0x120 <= xfs_icwalk_ag+0x3a9/0x640 <=
xfs_blockgc_worker+0x31/0xf0 <= process_one_work+0x1e6/0x380 <=
worker_thread+0x50/0x3a0 <= kthread+0x127/0x150 <=
ret_from_fork+0x22/0x30
1 down+0x43/0x60 <= xfs_buf_lock+0x29/0xa0 <= xfs_buf_find+0x2c4/0x590
<= xfs_buf_get_map+0x46/0x390 <= xfs_buf_read_map+0x52/0x270 <=
xfs_trans_read_buf_map+0x128/0x2a0 <= xfs_read_agf+0x87/0x110 <=
xfs_alloc_read_agf+0x34/0x1a0 <= xfs_alloc_fix_freelist+0x3d7/0x4f0 <=
xfs_alloc_vextent+0x22b/0x440 <=
__xfs_inobt_alloc_block.isra.0+0xc5/0x1a0 <=
__xfs_btree_split+0xf2/0x610 <= xfs_btree_split+0x4b/0x100 <=
xfs_btree_make_block_unfull+0x193/0x1c0 <=
xfs_btree_insrec+0x4a9/0x5a0 <= xfs_btree_insert+0xa8/0x1f0 <=
xfs_difree_finobt+0xa4/0x240 <= xfs_difree+0x126/0x1a0 <=
xfs_ifree+0xca/0x4a0 <= xfs_inactive_ifree.isra.0+0x9e/0x1a0 <=
xfs_inactive+0xf8/0x170 <= xfs_inodegc_worker+0x73/0xf0 <=
process_one_work+0x1e6/0x380 <= worker_thread+0x50/0x3a0 <=
kthread+0x127/0x150 <= ret_from_fork+0x22/0x30
1 down+0x43/0x60 <= xfs_buf_lock+0x29/0xa0 <= xfs_buf_find+0x2c4/0x590
<= xfs_buf_get_map+0x46/0x390 <= xfs_buf_read_map+0x52/0x270 <=
xfs_trans_read_buf_map+0x128/0x2a0 <= xfs_imap_to_bp+0x4e/0x70 <=
xfs_trans_log_inode+0x1d0/0x280 <= xfs_bmap_btalloc+0x75f/0x820 <=
xfs_bmapi_allocate+0xe4/0x310 <=
xfs_bmapi_convert_delalloc+0x273/0x490 <= xfs_map_blocks+0x1b5/0x400
<= iomap_do_writepage+0x11d/0x820 <= write_cache_pages+0x189/0x3e0 <=
iomap_writepages+0x1c/0x40 <= xfs_vm_writepages+0x71/0xa0 <=
do_writepages+0xc3/0x1e0 <= __writeback_single_inode+0x37/0x270 <=
writeback_sb_inodes+0x1ed/0x420 <= __writeback_inodes_wb+0x4c/0xd0 <=
wb_writeback+0x1ba/0x270 <= wb_workfn+0x292/0x4d0 <=
process_one_work+0x1e6/0x380 <= worker_thread+0x50/0x3a0 <=
kthread+0x127/0x150 <= ret_from_fork+0x22/0x30


Is this something that has been seen before ?

The codepath from xfs_inactive() through xfs_btree_split(),
xfs_alloc_fix_freelist() to eventually xfs_buf_lock() doesn't look
very new to me (an old report showing the callpath in a different
context is, for example,
https://www.spinics.net/lists/linux-mm/msg147472.html).

But the introduction of deferred inode inactivation last Summer (in
5.15) - and this going through the xfs_inodegc_worker - appears to
roughly correlate with us starting to see these hangs.

There's no "advance warning" for these (no OOMs preceding, no disk
full) as far as we've found.

How would we go about addressing this hang ?

Thanks in advance,
Frank Hofmann

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Self-deadlock (?) in xfs_inodegc_worker / xfs_inactive ?
  2022-04-04 13:16 Self-deadlock (?) in xfs_inodegc_worker / xfs_inactive ? Frank Hofmann
@ 2022-04-04 23:22 ` Dave Chinner
  2022-04-05 14:52   ` Frank Hofmann
                     ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Dave Chinner @ 2022-04-04 23:22 UTC (permalink / raw)
  To: Frank Hofmann; +Cc: linux-xfs, kernel-team

On Mon, Apr 04, 2022 at 02:16:23PM +0100, Frank Hofmann wrote:
> Hi,
> 
> we see machines getting stuck with a large number of backed-up
> processes that invoke statfs() (monitoring stuff), like:
>
> [Sat Apr  2 09:54:32 2022] INFO: task node_exporter:244222 blocked for
> more than 10 seconds.
> [Sat Apr  2 09:54:32 2022]       Tainted: G           O
> 5.15.26-cloudflare-2022.3.4 #1

Is this a vanilla kernel, or one that has been patched extensively
by cloudfare?

> [Sat Apr  2 09:54:32 2022] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [Sat Apr  2 09:54:32 2022] task:node_exporter   state:D stack:    0
> pid:244222 ppid:     1 flags:0x00004000
> [Sat Apr  2 09:54:32 2022] Call Trace:
> [Sat Apr  2 09:54:32 2022]  <TASK>
> [Sat Apr  2 09:54:32 2022]  __schedule+0x2cd/0x950
> [Sat Apr  2 09:54:32 2022]  schedule+0x44/0xa0
> [Sat Apr  2 09:54:32 2022]  schedule_timeout+0xfc/0x140
> [Sat Apr  2 09:54:32 2022]  ? try_to_wake_up+0x338/0x4e0
> [Sat Apr  2 09:54:32 2022]  ? __prepare_to_swait+0x4b/0x70
> [Sat Apr  2 09:54:32 2022]  wait_for_completion+0x86/0xe0
> [Sat Apr  2 09:54:32 2022]  flush_work+0x5c/0x80
> [Sat Apr  2 09:54:32 2022]  ? flush_workqueue_prep_pwqs+0x110/0x110
> [Sat Apr  2 09:54:32 2022]  xfs_inodegc_flush.part.0+0x3b/0x90
> [Sat Apr  2 09:54:32 2022]  xfs_fs_statfs+0x29/0x1c0
> [Sat Apr  2 09:54:32 2022]  statfs_by_dentry+0x4d/0x70
> [Sat Apr  2 09:54:32 2022]  user_statfs+0x57/0xc0
> [Sat Apr  2 09:54:32 2022]  __do_sys_statfs+0x20/0x50
> [Sat Apr  2 09:54:32 2022]  do_syscall_64+0x3b/0x90
> [Sat Apr  2 09:54:32 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [Sat Apr  2 09:54:32 2022] RIP: 0033:0x4ac9db

Waiting for background inode gc to complete.

> A linear-over-time increasing number of 'D' state processes is usually
> what alerts us to this.
> 
> The oldest thread found waiting appears always to be the inode gc
> worker doing deferred inactivation:

OK.

> This is a histogram (first column: number of proceses 'D'-ed on that
> call trace) of `/proc/<PID>/stack`:

It helps greatly if you reformat the stacks back to a readable stack
(s/=>/\r/g, s/^\n//, s/^ //) so the output is easily readable.

> 1 stuck on AGF, holding AGI, inode and inode buffer locks
> 
> down+0x43/0x60 
> xfs_buf_lock+0x29/0xa0 
> xfs_buf_find+0x2c4/0x590
> xfs_buf_get_map+0x46/0x390 
> xfs_buf_read_map+0x52/0x270 
> xfs_trans_read_buf_map+0x128/0x2a0 
> xfs_read_agf+0x87/0x110 
> xfs_alloc_read_agf+0x34/0x1a0 
> xfs_alloc_fix_freelist+0x3d7/0x4f0 
> xfs_alloc_vextent+0x22b/0x440 
> __xfs_inobt_alloc_block.isra.0+0xc5/0x1a0 
> __xfs_btree_split+0xf2/0x610 
> xfs_btree_split+0x4b/0x100 
> xfs_btree_make_block_unfull+0x193/0x1c0 
> xfs_btree_insrec+0x4a9/0x5a0 
> xfs_btree_insert+0xa8/0x1f0 
> xfs_difree_finobt+0xa4/0x240 
> xfs_difree+0x126/0x1a0 
> xfs_ifree+0xca/0x4a0 
> xfs_inactive_ifree.isra.0+0x9e/0x1a0 
> xfs_inactive+0xf8/0x170 
> xfs_inodegc_worker+0x73/0xf0 
> process_one_work+0x1e6/0x380 
> worker_thread+0x50/0x3a0 
> kthread+0x127/0x150 
> ret_from_fork+0x22/0x30
> 
> 1	stuck on inode buffer, holding inode lock, holding AGF
> 
> down+0x43/0x60 
> xfs_buf_lock+0x29/0xa0 
> xfs_buf_find+0x2c4/0x590
> xfs_buf_get_map+0x46/0x390 
> xfs_buf_read_map+0x52/0x270 
> xfs_trans_read_buf_map+0x128/0x2a0 
> xfs_imap_to_bp+0x4e/0x70 
> xfs_trans_log_inode+0x1d0/0x280 
> xfs_bmap_btalloc+0x75f/0x820 
> xfs_bmapi_allocate+0xe4/0x310 
> xfs_bmapi_convert_delalloc+0x273/0x490 
> xfs_map_blocks+0x1b5/0x400
> iomap_do_writepage+0x11d/0x820 
> write_cache_pages+0x189/0x3e0 
> iomap_writepages+0x1c/0x40 
> xfs_vm_writepages+0x71/0xa0 
> do_writepages+0xc3/0x1e0 
> __writeback_single_inode+0x37/0x270 
> writeback_sb_inodes+0x1ed/0x420 
> __writeback_inodes_wb+0x4c/0xd0 
> wb_writeback+0x1ba/0x270 
> wb_workfn+0x292/0x4d0 
> process_one_work+0x1e6/0x380 
> worker_thread+0x50/0x3a0 
> kthread+0x127/0x150 
> ret_from_fork+0x22/0x30

That's the deadlock right there.

task 1				task 2
lock inode A			lock inode A+1
lock AGF B for allocation	lock inode cluster
				remove inode from unlinked list
				free inode
				mark inode free in finobt
				  insert new rec
				    split tree
				      lock AGF B for allocation
				      <blocks waiting for task 1>
<allocate blocks>

xfs_bmap_finish
  log inode
    lock inode cluster buffer
    <blocks waiting for task 2>

So this has nothing to do with background inode inactivation. It may
have made it easier to hit, but it's definitely not *caused* by
background inodegc as these two operations have always been able to
run concurrently.

The likely cause is going to be the async memory reclaim work from
late June 2020. Commit 298f7bec503f ("xfs: pin inode backing buffer
to the inode log item") added the xfs_imap_to_bp() call to
xfs_trans_log_inode() to pin the inode cluster buffer in memory when
the inode was first dirtied.

Essentially, the problem is that inode unlink list manipulations are
not consistently ordered with inode allocation/freeing, hence not
consistently ordered against AGI and AGF locking. I didn't realise
that there was an AGF component to this problem, otherwise I would
have sent this patch upstream much sooner:

https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-iunlink-item-2&id=17b71a2fba3549ea55e8bef764532fd42be1213a

That commit is dated August 2020 - about the same time that the
async memory reclaim stuff was merged. What this will do is:

task 1				task 2
lock inode A			lock inode A+1
lock AGF B for allocation
				free inode
				mark inode free in finobt
				  insert new rec
				    split tree
				      lock AGF B for allocation
				      <blocks waiting for task 1>
<allocate blocks>

xfs_bmap_finish
  log inode
    lock inode cluster buffer
    attach inode
    unlock inode cluster buffer
xfs_trans_commit
  ...
  unlock AGF B
				      <unblocks, holds AGF B>
				      <allocate blocks>
				    <completes split>
				  <completes insert>
				<completes ifree>
				lock inode cluster buffer
				remove inode from unlinked list
				xfs_trans_commit
				  ...
				  unlock AGF B
				  unlock inode cluster buffer

And so the deadlock should go away.

I've attached the current patch from my local dev tree below. Can
you try it and see if the problem goes away?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com


xfs: reorder iunlink remove operation in xfs_ifree

From: Dave Chinner <dchinner@redhat.com>

The O_TMPFILE creation implementation creates a specific order of
operations for inode allocation/freeing and unlinked list
modification. Currently both are serialised by the AGI, so the order
doesn't strictly matter as long as the are both in the same
transaction.

However, if we want to move the unlinked list insertions largely
out from under the AGI lock, then we have to be concerned about the
order in which we do unlinked list modification operations.
O_TMPFILE creation tells us this order is inode allocation/free,
then unlinked list modification.

Change xfs_ifree() to use this same ordering on unlinked list
removal. THis way we always guarantee that when we enter the
iunlinked list removal code from this path, we have the already
locked and we don't have to worry about lock nesting AGI reads
inside unlink list locks because it's already locked and attached to
the transaction.

We can do this safely as the inode freeing and unlinked list removal
are done in the same transaction and hence are atomic operations
with resepect to log recovery.

Signed-off-by: Dave Chinner <dchinner@redhat.com>

---
 fs/xfs/xfs_inode.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index c66c9585f1ba..f8eaeb83d05e 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -2277,14 +2277,13 @@ xfs_ifree_cluster(
 }
 
 /*
- * This is called to return an inode to the inode free list.
- * The inode should already be truncated to 0 length and have
- * no pages associated with it.  This routine also assumes that
- * the inode is already a part of the transaction.
+ * This is called to return an inode to the inode free list.  The inode should
+ * already be truncated to 0 length and have no pages associated with it.  This
+ * routine also assumes that the inode is already a part of the transaction.
  *
- * The on-disk copy of the inode will have been added to the list
- * of unlinked inodes in the AGI. We need to remove the inode from
- * that list atomically with respect to freeing it here.
+ * The on-disk copy of the inode will have been added to the list of unlinked
+ * inodes in the AGI. We need to remove the inode from that list atomically with
+ * respect to freeing it here.
  */
 int
 xfs_ifree(
@@ -2306,13 +2305,16 @@ xfs_ifree(
 	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
 
 	/*
-	 * Pull the on-disk inode from the AGI unlinked list.
+	 * Free the inode first so that we guarantee that the AGI lock is going
+	 * to be taken before we remove the inode from the unlinked list. This
+	 * makes the AGI lock -> unlinked list modification order the same as
+	 * used in O_TMPFILE creation.
 	 */
-	error = xfs_iunlink_remove(tp, pag, ip);
+	error = xfs_difree(tp, pag, ip->i_ino, &xic);
 	if (error)
-		goto out;
+		return error;
 
-	error = xfs_difree(tp, pag, ip->i_ino, &xic);
+	error = xfs_iunlink_remove(tp, pag, ip);
 	if (error)
 		goto out;
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Self-deadlock (?) in xfs_inodegc_worker / xfs_inactive ?
  2022-04-04 23:22 ` Dave Chinner
@ 2022-04-05 14:52   ` Frank Hofmann
  2022-04-15 12:39   ` Holger Hoffstätte
  2022-04-25 16:17   ` Darrick J. Wong
  2 siblings, 0 replies; 7+ messages in thread
From: Frank Hofmann @ 2022-04-05 14:52 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, kernel-team

On Tue, Apr 5, 2022 at 12:22 AM Dave Chinner <david@fromorbit.com> wrote:
>

Hi Dave,

thank you for the quick reply. See inline/below. Will test the patch
and report back.

> On Mon, Apr 04, 2022 at 02:16:23PM +0100, Frank Hofmann wrote:
> > Hi,
> >
> > we see machines getting stuck with a large number of backed-up
> > processes that invoke statfs() (monitoring stuff), like:
> >
> > [Sat Apr  2 09:54:32 2022] INFO: task node_exporter:244222 blocked for
> > more than 10 seconds.
> > [Sat Apr  2 09:54:32 2022]       Tainted: G           O
> > 5.15.26-cloudflare-2022.3.4 #1
>
> Is this a vanilla kernel, or one that has been patched extensively
> by cloudflare?

Probably best to say "by and large vanilla". We carry a single-digit
number of non-mainlined feature and driver patches that our product
and infrastructure relies on (network code features and drivers). But
we do not usually modify either kernel, mm, fs or arch subtrees at
all. We sometimes backport certain (security) changes before
linux-stable gets them "ordinarily". Our 5.15.26 has no changes in mm,
fs, arch at all, vs. linux-stable.

>
[ ... ]
>
> Waiting for background inode gc to complete.

ack.

>
> > A linear-over-time increasing number of 'D' state processes is usually
> > what alerts us to this.
> >
> > The oldest thread found waiting appears always to be the inode gc
> > worker doing deferred inactivation:
>
> OK.
>
> > This is a histogram (first column: number of proceses 'D'-ed on that
> > call trace) of `/proc/<PID>/stack`:
>
> It helps greatly if you reformat the stacks back to a readable stack
> (s/=>/\r/g, s/^\n//, s/^ //) so the output is easily readable.

Apologies; my awk script to grab these turns them around for the
counting, will change the output when I post such again.

>
> > 1 stuck on AGF, holding AGI, inode and inode buffer locks
> >
> > down+0x43/0x60
> > xfs_buf_lock+0x29/0xa0
> > xfs_buf_find+0x2c4/0x590
[ ... ]
> > xfs_ifree+0xca/0x4a0
> > xfs_inactive_ifree.isra.0+0x9e/0x1a0
> > xfs_inactive+0xf8/0x170
> > xfs_inodegc_worker+0x73/0xf0
> > process_one_work+0x1e6/0x380
> > worker_thread+0x50/0x3a0
> > kthread+0x127/0x150
> > ret_from_fork+0x22/0x30
> >
> > 1     stuck on inode buffer, holding inode lock, holding AGF
> >
> > down+0x43/0x60
> > xfs_buf_lock+0x29/0xa0
[ ... ]
> > xfs_map_blocks+0x1b5/0x400
> > iomap_do_writepage+0x11d/0x820
> > write_cache_pages+0x189/0x3e0
> > iomap_writepages+0x1c/0x40
> > xfs_vm_writepages+0x71/0xa0
> > do_writepages+0xc3/0x1e0
[ ... ]


I've actually dug through our archives, and found one of these that
occurred in 5.10.53, with:

down+0x3b/0x50
xfs_buf_lock+0x2c/0xb0
xfs_buf_find+0x2c4/0x590
xfs_buf_get_map+0x44/0x270
xfs_buf_read_map+0x52/0x270
xfs_trans_read_buf_map+0x126/0x290
xfs_read_agf+0x87/0x110
xfs_alloc_read_agf+0x33/0x180
xfs_alloc_fix_freelist+0x32d/0x450
xfs_alloc_vextent+0x231/0x440
__xfs_inobt_alloc_block.isra.0+0xc1/0x1a0
__xfs_btree_split+0xef/0x660
xfs_btree_split+0x4b/0x100
xfs_btree_make_block_unfull+0x193/0x1c0
xfs_btree_insrec+0x49d/0x590
xfs_btree_insert+0xa8/0x1f0
xfs_difree_finobt+0xae/0x250
xfs_difree+0x121/0x190
xfs_ifree+0x79/0x150
xfs_inactive_ifree+0xa8/0x1c0
xfs_inactive+0x9e/0x140
xfs_fs_destroy_inode+0xa8/0x1a0
destroy_inode+0x3b/0x70
do_unlinkat+0x1db/0x2f0
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9

for the caller of xfs_ifree(), and

down+0x3b/0x50
xfs_buf_lock+0x2c/0xb0
xfs_buf_find+0x2c4/0x590
xfs_buf_get_map+0x44/0x270
xfs_buf_read_map+0x52/0x270
xfs_trans_read_buf_map+0x126/0x290
xfs_imap_to_bp+0x61/0xb0
xfs_trans_log_inode+0x19f/0x250
xfs_bmap_btalloc+0x75c/0x830
xfs_bmapi_allocate+0xe4/0x310
xfs_bmapi_convert_delalloc+0x240/0x470
xfs_map_blocks+0x1b3/0x400
iomap_do_writepage+0x15e/0x870
write_cache_pages+0x186/0x3d0
iomap_writepages+0x1c/0x40
xfs_vm_writepages+0x59/0x80
do_writepages+0x34/0xc0
__writeback_single_inode+0x37/0x290
writeback_sb_inodes+0x1ed/0x430
__writeback_inodes_wb+0x4c/0xd0
wb_writeback+0x1d6/0x290
wb_workfn+0x292/0x4d0
process_one_work+0x1ab/0x340
worker_thread+0x50/0x3a0
kthread+0x11b/0x140
ret_from_fork+0x22/0x30

for the writeback worker.


> That's the deadlock right there.

And yes, you're right the async inode gc doesn't appear to be
necessary; the above looks like the same deadlock you describe -
without the inode gc changes.

>
[ ... ]
> So this has nothing to do with background inode inactivation. It may
> have made it easier to hit, but it's definitely not *caused* by
> background inodegc as these two operations have always been able to
> run concurrently.
>
> The likely cause is going to be the async memory reclaim work from
> late June 2020. Commit 298f7bec503f ("xfs: pin inode backing buffer
> to the inode log item") added the xfs_imap_to_bp() call to
> xfs_trans_log_inode() to pin the inode cluster buffer in memory when
> the inode was first dirtied.

We've been seeing these "for a while" but the rate of occurrence
didn't use to be high (0.01% of systems encountering it once a week).
At the moment, we see it around ten times as frequently as that.
Will work on a reproducer and see how that goes.


>
> Essentially, the problem is that inode unlink list manipulations are
> not consistently ordered with inode allocation/freeing, hence not
> consistently ordered against AGI and AGF locking. I didn't realise
> that there was an AGF component to this problem, otherwise I would
> have sent this patch upstream much sooner:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-iunlink-item-2&id=17b71a2fba3549ea55e8bef764532fd42be1213a
>
> That commit is dated August 2020 - about the same time that the
> async memory reclaim stuff was merged. What this will do is:
[ ... ]
> And so the deadlock should go away.
>
> I've attached the current patch from my local dev tree below. Can
> you try it and see if the problem goes away?
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
[ ... ]

We'll give it a spin for sure !
As said, unless I can write a testcase/reproducer, it'll have to soak
a few weeks to give us the confidence it's "done it".

Thank you for the fast response,
Frank Hofmann

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Self-deadlock (?) in xfs_inodegc_worker / xfs_inactive ?
  2022-04-04 23:22 ` Dave Chinner
  2022-04-05 14:52   ` Frank Hofmann
@ 2022-04-15 12:39   ` Holger Hoffstätte
  2022-04-15 16:29     ` Darrick J. Wong
  2022-04-25 16:17   ` Darrick J. Wong
  2 siblings, 1 reply; 7+ messages in thread
From: Holger Hoffstätte @ 2022-04-15 12:39 UTC (permalink / raw)
  To: Dave Chinner, Frank Hofmann; +Cc: linux-xfs, kernel-team

On 2022-04-05 01:22, Dave Chinner wrote:
> On Mon, Apr 04, 2022 at 02:16:23PM +0100, Frank Hofmann wrote:
>> Hi,
>>
>> we see machines getting stuck with a large number of backed-up
>> processes that invoke statfs() (monitoring stuff), like:
>>
>> [Sat Apr  2 09:54:32 2022] INFO: task node_exporter:244222 blocked for
>> more than 10 seconds.
>> [Sat Apr  2 09:54:32 2022]       Tainted: G           O
>> 5.15.26-cloudflare-2022.3.4 #1
> 
> Is this a vanilla kernel, or one that has been patched extensively
> by cloudfare?
> 
>> [Sat Apr  2 09:54:32 2022] "echo 0 >
>> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>> [Sat Apr  2 09:54:32 2022] task:node_exporter   state:D stack:    0
>> pid:244222 ppid:     1 flags:0x00004000
>> [Sat Apr  2 09:54:32 2022] Call Trace:
>> [Sat Apr  2 09:54:32 2022]  <TASK>
>> [Sat Apr  2 09:54:32 2022]  __schedule+0x2cd/0x950
>> [Sat Apr  2 09:54:32 2022]  schedule+0x44/0xa0
>> [Sat Apr  2 09:54:32 2022]  schedule_timeout+0xfc/0x140
>> [Sat Apr  2 09:54:32 2022]  ? try_to_wake_up+0x338/0x4e0
>> [Sat Apr  2 09:54:32 2022]  ? __prepare_to_swait+0x4b/0x70
>> [Sat Apr  2 09:54:32 2022]  wait_for_completion+0x86/0xe0
>> [Sat Apr  2 09:54:32 2022]  flush_work+0x5c/0x80
>> [Sat Apr  2 09:54:32 2022]  ? flush_workqueue_prep_pwqs+0x110/0x110
>> [Sat Apr  2 09:54:32 2022]  xfs_inodegc_flush.part.0+0x3b/0x90
>> [Sat Apr  2 09:54:32 2022]  xfs_fs_statfs+0x29/0x1c0
>> [Sat Apr  2 09:54:32 2022]  statfs_by_dentry+0x4d/0x70
>> [Sat Apr  2 09:54:32 2022]  user_statfs+0x57/0xc0
>> [Sat Apr  2 09:54:32 2022]  __do_sys_statfs+0x20/0x50
>> [Sat Apr  2 09:54:32 2022]  do_syscall_64+0x3b/0x90
>> [Sat Apr  2 09:54:32 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [Sat Apr  2 09:54:32 2022] RIP: 0033:0x4ac9db
> 
> Waiting for background inode gc to complete.
> 
>> A linear-over-time increasing number of 'D' state processes is usually
>> what alerts us to this.
>>
>> The oldest thread found waiting appears always to be the inode gc
>> worker doing deferred inactivation:
> 
> OK.
> 
>> This is a histogram (first column: number of proceses 'D'-ed on that
>> call trace) of `/proc/<PID>/stack`:
> 
> It helps greatly if you reformat the stacks back to a readable stack
> (s/=>/\r/g, s/^\n//, s/^ //) so the output is easily readable.
> 
>> 1 stuck on AGF, holding AGI, inode and inode buffer locks
>>
>> down+0x43/0x60
>> xfs_buf_lock+0x29/0xa0
>> xfs_buf_find+0x2c4/0x590
>> xfs_buf_get_map+0x46/0x390
>> xfs_buf_read_map+0x52/0x270
>> xfs_trans_read_buf_map+0x128/0x2a0
>> xfs_read_agf+0x87/0x110
>> xfs_alloc_read_agf+0x34/0x1a0
>> xfs_alloc_fix_freelist+0x3d7/0x4f0
>> xfs_alloc_vextent+0x22b/0x440
>> __xfs_inobt_alloc_block.isra.0+0xc5/0x1a0
>> __xfs_btree_split+0xf2/0x610
>> xfs_btree_split+0x4b/0x100
>> xfs_btree_make_block_unfull+0x193/0x1c0
>> xfs_btree_insrec+0x4a9/0x5a0
>> xfs_btree_insert+0xa8/0x1f0
>> xfs_difree_finobt+0xa4/0x240
>> xfs_difree+0x126/0x1a0
>> xfs_ifree+0xca/0x4a0
>> xfs_inactive_ifree.isra.0+0x9e/0x1a0
>> xfs_inactive+0xf8/0x170
>> xfs_inodegc_worker+0x73/0xf0
>> process_one_work+0x1e6/0x380
>> worker_thread+0x50/0x3a0
>> kthread+0x127/0x150
>> ret_from_fork+0x22/0x30
>>
>> 1	stuck on inode buffer, holding inode lock, holding AGF
>>
>> down+0x43/0x60
>> xfs_buf_lock+0x29/0xa0
>> xfs_buf_find+0x2c4/0x590
>> xfs_buf_get_map+0x46/0x390
>> xfs_buf_read_map+0x52/0x270
>> xfs_trans_read_buf_map+0x128/0x2a0
>> xfs_imap_to_bp+0x4e/0x70
>> xfs_trans_log_inode+0x1d0/0x280
>> xfs_bmap_btalloc+0x75f/0x820
>> xfs_bmapi_allocate+0xe4/0x310
>> xfs_bmapi_convert_delalloc+0x273/0x490
>> xfs_map_blocks+0x1b5/0x400
>> iomap_do_writepage+0x11d/0x820
>> write_cache_pages+0x189/0x3e0
>> iomap_writepages+0x1c/0x40
>> xfs_vm_writepages+0x71/0xa0
>> do_writepages+0xc3/0x1e0
>> __writeback_single_inode+0x37/0x270
>> writeback_sb_inodes+0x1ed/0x420
>> __writeback_inodes_wb+0x4c/0xd0
>> wb_writeback+0x1ba/0x270
>> wb_workfn+0x292/0x4d0
>> process_one_work+0x1e6/0x380
>> worker_thread+0x50/0x3a0
>> kthread+0x127/0x150
>> ret_from_fork+0x22/0x30
> 
> That's the deadlock right there.
> 
> task 1				task 2
> lock inode A			lock inode A+1
> lock AGF B for allocation	lock inode cluster
> 				remove inode from unlinked list
> 				free inode
> 				mark inode free in finobt
> 				  insert new rec
> 				    split tree
> 				      lock AGF B for allocation
> 				      <blocks waiting for task 1>
> <allocate blocks>
> 
> xfs_bmap_finish
>    log inode
>      lock inode cluster buffer
>      <blocks waiting for task 2>
> 
> So this has nothing to do with background inode inactivation. It may
> have made it easier to hit, but it's definitely not *caused* by
> background inodegc as these two operations have always been able to
> run concurrently.
> 
> The likely cause is going to be the async memory reclaim work from
> late June 2020. Commit 298f7bec503f ("xfs: pin inode backing buffer
> to the inode log item") added the xfs_imap_to_bp() call to
> xfs_trans_log_inode() to pin the inode cluster buffer in memory when
> the inode was first dirtied.
> 
> Essentially, the problem is that inode unlink list manipulations are
> not consistently ordered with inode allocation/freeing, hence not
> consistently ordered against AGI and AGF locking. I didn't realise
> that there was an AGF component to this problem, otherwise I would
> have sent this patch upstream much sooner:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-iunlink-item-2&id=17b71a2fba3549ea55e8bef764532fd42be1213a
> 
> That commit is dated August 2020 - about the same time that the
> async memory reclaim stuff was merged. What this will do is:
> 
> task 1				task 2
> lock inode A			lock inode A+1
> lock AGF B for allocation
> 				free inode
> 				mark inode free in finobt
> 				  insert new rec
> 				    split tree
> 				      lock AGF B for allocation
> 				      <blocks waiting for task 1>
> <allocate blocks>
> 
> xfs_bmap_finish
>    log inode
>      lock inode cluster buffer
>      attach inode
>      unlock inode cluster buffer
> xfs_trans_commit
>    ...
>    unlock AGF B
> 				      <unblocks, holds AGF B>
> 				      <allocate blocks>
> 				    <completes split>
> 				  <completes insert>
> 				<completes ifree>
> 				lock inode cluster buffer
> 				remove inode from unlinked list
> 				xfs_trans_commit
> 				  ...
> 				  unlock AGF B
> 				  unlock inode cluster buffer
> 
> And so the deadlock should go away.
> 
> I've attached the current patch from my local dev tree below. Can
> you try it and see if the problem goes away?
> 
> Cheers,
> 
> Dave.
> 

Just a quick note on this. While I haven't come across the reported
self-deadlock (yet?), consistent lock ordering is obviously a Good Thing
and I'd like to see this applied, so that it can be merged into 5.15+.
It applies cleanly and has not caused any unexpected new issues during
almost two weeks of continuous use.

thanks,
Holger

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Self-deadlock (?) in xfs_inodegc_worker / xfs_inactive ?
  2022-04-15 12:39   ` Holger Hoffstätte
@ 2022-04-15 16:29     ` Darrick J. Wong
  0 siblings, 0 replies; 7+ messages in thread
From: Darrick J. Wong @ 2022-04-15 16:29 UTC (permalink / raw)
  To: Holger Hoffstätte
  Cc: Dave Chinner, Frank Hofmann, linux-xfs, kernel-team

On Fri, Apr 15, 2022 at 02:39:51PM +0200, Holger Hoffstätte wrote:
> On 2022-04-05 01:22, Dave Chinner wrote:
> > On Mon, Apr 04, 2022 at 02:16:23PM +0100, Frank Hofmann wrote:
> > > Hi,
> > > 
> > > we see machines getting stuck with a large number of backed-up
> > > processes that invoke statfs() (monitoring stuff), like:
> > > 
> > > [Sat Apr  2 09:54:32 2022] INFO: task node_exporter:244222 blocked for
> > > more than 10 seconds.
> > > [Sat Apr  2 09:54:32 2022]       Tainted: G           O
> > > 5.15.26-cloudflare-2022.3.4 #1
> > 
> > Is this a vanilla kernel, or one that has been patched extensively
> > by cloudfare?
> > 
> > > [Sat Apr  2 09:54:32 2022] "echo 0 >
> > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [Sat Apr  2 09:54:32 2022] task:node_exporter   state:D stack:    0
> > > pid:244222 ppid:     1 flags:0x00004000
> > > [Sat Apr  2 09:54:32 2022] Call Trace:
> > > [Sat Apr  2 09:54:32 2022]  <TASK>
> > > [Sat Apr  2 09:54:32 2022]  __schedule+0x2cd/0x950
> > > [Sat Apr  2 09:54:32 2022]  schedule+0x44/0xa0
> > > [Sat Apr  2 09:54:32 2022]  schedule_timeout+0xfc/0x140
> > > [Sat Apr  2 09:54:32 2022]  ? try_to_wake_up+0x338/0x4e0
> > > [Sat Apr  2 09:54:32 2022]  ? __prepare_to_swait+0x4b/0x70
> > > [Sat Apr  2 09:54:32 2022]  wait_for_completion+0x86/0xe0
> > > [Sat Apr  2 09:54:32 2022]  flush_work+0x5c/0x80
> > > [Sat Apr  2 09:54:32 2022]  ? flush_workqueue_prep_pwqs+0x110/0x110
> > > [Sat Apr  2 09:54:32 2022]  xfs_inodegc_flush.part.0+0x3b/0x90
> > > [Sat Apr  2 09:54:32 2022]  xfs_fs_statfs+0x29/0x1c0
> > > [Sat Apr  2 09:54:32 2022]  statfs_by_dentry+0x4d/0x70
> > > [Sat Apr  2 09:54:32 2022]  user_statfs+0x57/0xc0
> > > [Sat Apr  2 09:54:32 2022]  __do_sys_statfs+0x20/0x50
> > > [Sat Apr  2 09:54:32 2022]  do_syscall_64+0x3b/0x90
> > > [Sat Apr  2 09:54:32 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [Sat Apr  2 09:54:32 2022] RIP: 0033:0x4ac9db
> > 
> > Waiting for background inode gc to complete.
> > 
> > > A linear-over-time increasing number of 'D' state processes is usually
> > > what alerts us to this.
> > > 
> > > The oldest thread found waiting appears always to be the inode gc
> > > worker doing deferred inactivation:
> > 
> > OK.
> > 
> > > This is a histogram (first column: number of proceses 'D'-ed on that
> > > call trace) of `/proc/<PID>/stack`:
> > 
> > It helps greatly if you reformat the stacks back to a readable stack
> > (s/=>/\r/g, s/^\n//, s/^ //) so the output is easily readable.
> > 
> > > 1 stuck on AGF, holding AGI, inode and inode buffer locks
> > > 
> > > down+0x43/0x60
> > > xfs_buf_lock+0x29/0xa0
> > > xfs_buf_find+0x2c4/0x590
> > > xfs_buf_get_map+0x46/0x390
> > > xfs_buf_read_map+0x52/0x270
> > > xfs_trans_read_buf_map+0x128/0x2a0
> > > xfs_read_agf+0x87/0x110
> > > xfs_alloc_read_agf+0x34/0x1a0
> > > xfs_alloc_fix_freelist+0x3d7/0x4f0
> > > xfs_alloc_vextent+0x22b/0x440
> > > __xfs_inobt_alloc_block.isra.0+0xc5/0x1a0
> > > __xfs_btree_split+0xf2/0x610
> > > xfs_btree_split+0x4b/0x100
> > > xfs_btree_make_block_unfull+0x193/0x1c0
> > > xfs_btree_insrec+0x4a9/0x5a0
> > > xfs_btree_insert+0xa8/0x1f0
> > > xfs_difree_finobt+0xa4/0x240
> > > xfs_difree+0x126/0x1a0
> > > xfs_ifree+0xca/0x4a0
> > > xfs_inactive_ifree.isra.0+0x9e/0x1a0
> > > xfs_inactive+0xf8/0x170
> > > xfs_inodegc_worker+0x73/0xf0
> > > process_one_work+0x1e6/0x380
> > > worker_thread+0x50/0x3a0
> > > kthread+0x127/0x150
> > > ret_from_fork+0x22/0x30
> > > 
> > > 1	stuck on inode buffer, holding inode lock, holding AGF
> > > 
> > > down+0x43/0x60
> > > xfs_buf_lock+0x29/0xa0
> > > xfs_buf_find+0x2c4/0x590
> > > xfs_buf_get_map+0x46/0x390
> > > xfs_buf_read_map+0x52/0x270
> > > xfs_trans_read_buf_map+0x128/0x2a0
> > > xfs_imap_to_bp+0x4e/0x70
> > > xfs_trans_log_inode+0x1d0/0x280
> > > xfs_bmap_btalloc+0x75f/0x820
> > > xfs_bmapi_allocate+0xe4/0x310
> > > xfs_bmapi_convert_delalloc+0x273/0x490
> > > xfs_map_blocks+0x1b5/0x400
> > > iomap_do_writepage+0x11d/0x820
> > > write_cache_pages+0x189/0x3e0
> > > iomap_writepages+0x1c/0x40
> > > xfs_vm_writepages+0x71/0xa0
> > > do_writepages+0xc3/0x1e0
> > > __writeback_single_inode+0x37/0x270
> > > writeback_sb_inodes+0x1ed/0x420
> > > __writeback_inodes_wb+0x4c/0xd0
> > > wb_writeback+0x1ba/0x270
> > > wb_workfn+0x292/0x4d0
> > > process_one_work+0x1e6/0x380
> > > worker_thread+0x50/0x3a0
> > > kthread+0x127/0x150
> > > ret_from_fork+0x22/0x30
> > 
> > That's the deadlock right there.
> > 
> > task 1				task 2
> > lock inode A			lock inode A+1
> > lock AGF B for allocation	lock inode cluster
> > 				remove inode from unlinked list
> > 				free inode
> > 				mark inode free in finobt
> > 				  insert new rec
> > 				    split tree
> > 				      lock AGF B for allocation
> > 				      <blocks waiting for task 1>
> > <allocate blocks>
> > 
> > xfs_bmap_finish
> >    log inode
> >      lock inode cluster buffer
> >      <blocks waiting for task 2>
> > 
> > So this has nothing to do with background inode inactivation. It may
> > have made it easier to hit, but it's definitely not *caused* by
> > background inodegc as these two operations have always been able to
> > run concurrently.
> > 
> > The likely cause is going to be the async memory reclaim work from
> > late June 2020. Commit 298f7bec503f ("xfs: pin inode backing buffer
> > to the inode log item") added the xfs_imap_to_bp() call to
> > xfs_trans_log_inode() to pin the inode cluster buffer in memory when
> > the inode was first dirtied.
> > 
> > Essentially, the problem is that inode unlink list manipulations are
> > not consistently ordered with inode allocation/freeing, hence not
> > consistently ordered against AGI and AGF locking. I didn't realise
> > that there was an AGF component to this problem, otherwise I would
> > have sent this patch upstream much sooner:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-iunlink-item-2&id=17b71a2fba3549ea55e8bef764532fd42be1213a
> > 
> > That commit is dated August 2020 - about the same time that the
> > async memory reclaim stuff was merged. What this will do is:
> > 
> > task 1				task 2
> > lock inode A			lock inode A+1
> > lock AGF B for allocation
> > 				free inode
> > 				mark inode free in finobt
> > 				  insert new rec
> > 				    split tree
> > 				      lock AGF B for allocation
> > 				      <blocks waiting for task 1>
> > <allocate blocks>
> > 
> > xfs_bmap_finish
> >    log inode
> >      lock inode cluster buffer
> >      attach inode
> >      unlock inode cluster buffer
> > xfs_trans_commit
> >    ...
> >    unlock AGF B
> > 				      <unblocks, holds AGF B>
> > 				      <allocate blocks>
> > 				    <completes split>
> > 				  <completes insert>
> > 				<completes ifree>
> > 				lock inode cluster buffer
> > 				remove inode from unlinked list
> > 				xfs_trans_commit
> > 				  ...
> > 				  unlock AGF B
> > 				  unlock inode cluster buffer
> > 
> > And so the deadlock should go away.
> > 
> > I've attached the current patch from my local dev tree below. Can
> > you try it and see if the problem goes away?
> > 
> > Cheers,
> > 
> > Dave.
> > 
> 
> Just a quick note on this. While I haven't come across the reported
> self-deadlock (yet?), consistent lock ordering is obviously a Good Thing
> and I'd like to see this applied, so that it can be merged into 5.15+.
> It applies cleanly and has not caused any unexpected new issues during
> almost two weeks of continuous use.

I just hit this for the first time ever, so I guess I'll add this patch
and see if anything shakes out.

--D

> 
> thanks,
> Holger

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Self-deadlock (?) in xfs_inodegc_worker / xfs_inactive ?
  2022-04-04 23:22 ` Dave Chinner
  2022-04-05 14:52   ` Frank Hofmann
  2022-04-15 12:39   ` Holger Hoffstätte
@ 2022-04-25 16:17   ` Darrick J. Wong
  2022-04-25 16:19     ` Frank Hofmann
  2 siblings, 1 reply; 7+ messages in thread
From: Darrick J. Wong @ 2022-04-25 16:17 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Frank Hofmann, linux-xfs, kernel-team

On Tue, Apr 05, 2022 at 09:22:04AM +1000, Dave Chinner wrote:
> On Mon, Apr 04, 2022 at 02:16:23PM +0100, Frank Hofmann wrote:
> > Hi,
> > 
> > we see machines getting stuck with a large number of backed-up
> > processes that invoke statfs() (monitoring stuff), like:
> >
> > [Sat Apr  2 09:54:32 2022] INFO: task node_exporter:244222 blocked for
> > more than 10 seconds.
> > [Sat Apr  2 09:54:32 2022]       Tainted: G           O
> > 5.15.26-cloudflare-2022.3.4 #1
> 
> Is this a vanilla kernel, or one that has been patched extensively
> by cloudfare?
> 
> > [Sat Apr  2 09:54:32 2022] "echo 0 >
> > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > [Sat Apr  2 09:54:32 2022] task:node_exporter   state:D stack:    0
> > pid:244222 ppid:     1 flags:0x00004000
> > [Sat Apr  2 09:54:32 2022] Call Trace:
> > [Sat Apr  2 09:54:32 2022]  <TASK>
> > [Sat Apr  2 09:54:32 2022]  __schedule+0x2cd/0x950
> > [Sat Apr  2 09:54:32 2022]  schedule+0x44/0xa0
> > [Sat Apr  2 09:54:32 2022]  schedule_timeout+0xfc/0x140
> > [Sat Apr  2 09:54:32 2022]  ? try_to_wake_up+0x338/0x4e0
> > [Sat Apr  2 09:54:32 2022]  ? __prepare_to_swait+0x4b/0x70
> > [Sat Apr  2 09:54:32 2022]  wait_for_completion+0x86/0xe0
> > [Sat Apr  2 09:54:32 2022]  flush_work+0x5c/0x80
> > [Sat Apr  2 09:54:32 2022]  ? flush_workqueue_prep_pwqs+0x110/0x110
> > [Sat Apr  2 09:54:32 2022]  xfs_inodegc_flush.part.0+0x3b/0x90
> > [Sat Apr  2 09:54:32 2022]  xfs_fs_statfs+0x29/0x1c0
> > [Sat Apr  2 09:54:32 2022]  statfs_by_dentry+0x4d/0x70
> > [Sat Apr  2 09:54:32 2022]  user_statfs+0x57/0xc0
> > [Sat Apr  2 09:54:32 2022]  __do_sys_statfs+0x20/0x50
> > [Sat Apr  2 09:54:32 2022]  do_syscall_64+0x3b/0x90
> > [Sat Apr  2 09:54:32 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [Sat Apr  2 09:54:32 2022] RIP: 0033:0x4ac9db
> 
> Waiting for background inode gc to complete.
> 
> > A linear-over-time increasing number of 'D' state processes is usually
> > what alerts us to this.
> > 
> > The oldest thread found waiting appears always to be the inode gc
> > worker doing deferred inactivation:
> 
> OK.
> 
> > This is a histogram (first column: number of proceses 'D'-ed on that
> > call trace) of `/proc/<PID>/stack`:
> 
> It helps greatly if you reformat the stacks back to a readable stack
> (s/=>/\r/g, s/^\n//, s/^ //) so the output is easily readable.
> 
> > 1 stuck on AGF, holding AGI, inode and inode buffer locks
> > 
> > down+0x43/0x60 
> > xfs_buf_lock+0x29/0xa0 
> > xfs_buf_find+0x2c4/0x590
> > xfs_buf_get_map+0x46/0x390 
> > xfs_buf_read_map+0x52/0x270 
> > xfs_trans_read_buf_map+0x128/0x2a0 
> > xfs_read_agf+0x87/0x110 
> > xfs_alloc_read_agf+0x34/0x1a0 
> > xfs_alloc_fix_freelist+0x3d7/0x4f0 
> > xfs_alloc_vextent+0x22b/0x440 
> > __xfs_inobt_alloc_block.isra.0+0xc5/0x1a0 
> > __xfs_btree_split+0xf2/0x610 
> > xfs_btree_split+0x4b/0x100 
> > xfs_btree_make_block_unfull+0x193/0x1c0 
> > xfs_btree_insrec+0x4a9/0x5a0 
> > xfs_btree_insert+0xa8/0x1f0 
> > xfs_difree_finobt+0xa4/0x240 
> > xfs_difree+0x126/0x1a0 
> > xfs_ifree+0xca/0x4a0 
> > xfs_inactive_ifree.isra.0+0x9e/0x1a0 
> > xfs_inactive+0xf8/0x170 
> > xfs_inodegc_worker+0x73/0xf0 
> > process_one_work+0x1e6/0x380 
> > worker_thread+0x50/0x3a0 
> > kthread+0x127/0x150 
> > ret_from_fork+0x22/0x30
> > 
> > 1	stuck on inode buffer, holding inode lock, holding AGF
> > 
> > down+0x43/0x60 
> > xfs_buf_lock+0x29/0xa0 
> > xfs_buf_find+0x2c4/0x590
> > xfs_buf_get_map+0x46/0x390 
> > xfs_buf_read_map+0x52/0x270 
> > xfs_trans_read_buf_map+0x128/0x2a0 
> > xfs_imap_to_bp+0x4e/0x70 
> > xfs_trans_log_inode+0x1d0/0x280 
> > xfs_bmap_btalloc+0x75f/0x820 
> > xfs_bmapi_allocate+0xe4/0x310 
> > xfs_bmapi_convert_delalloc+0x273/0x490 
> > xfs_map_blocks+0x1b5/0x400
> > iomap_do_writepage+0x11d/0x820 
> > write_cache_pages+0x189/0x3e0 
> > iomap_writepages+0x1c/0x40 
> > xfs_vm_writepages+0x71/0xa0 
> > do_writepages+0xc3/0x1e0 
> > __writeback_single_inode+0x37/0x270 
> > writeback_sb_inodes+0x1ed/0x420 
> > __writeback_inodes_wb+0x4c/0xd0 
> > wb_writeback+0x1ba/0x270 
> > wb_workfn+0x292/0x4d0 
> > process_one_work+0x1e6/0x380 
> > worker_thread+0x50/0x3a0 
> > kthread+0x127/0x150 
> > ret_from_fork+0x22/0x30
> 
> That's the deadlock right there.
> 
> task 1				task 2
> lock inode A			lock inode A+1
> lock AGF B for allocation	lock inode cluster
> 				remove inode from unlinked list
> 				free inode
> 				mark inode free in finobt
> 				  insert new rec
> 				    split tree
> 				      lock AGF B for allocation
> 				      <blocks waiting for task 1>
> <allocate blocks>
> 
> xfs_bmap_finish
>   log inode
>     lock inode cluster buffer
>     <blocks waiting for task 2>
> 
> So this has nothing to do with background inode inactivation. It may
> have made it easier to hit, but it's definitely not *caused* by
> background inodegc as these two operations have always been able to
> run concurrently.
> 
> The likely cause is going to be the async memory reclaim work from
> late June 2020. Commit 298f7bec503f ("xfs: pin inode backing buffer
> to the inode log item") added the xfs_imap_to_bp() call to
> xfs_trans_log_inode() to pin the inode cluster buffer in memory when
> the inode was first dirtied.
> 
> Essentially, the problem is that inode unlink list manipulations are
> not consistently ordered with inode allocation/freeing, hence not
> consistently ordered against AGI and AGF locking. I didn't realise
> that there was an AGF component to this problem, otherwise I would
> have sent this patch upstream much sooner:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-iunlink-item-2&id=17b71a2fba3549ea55e8bef764532fd42be1213a
> 
> That commit is dated August 2020 - about the same time that the
> async memory reclaim stuff was merged. What this will do is:
> 
> task 1				task 2
> lock inode A			lock inode A+1
> lock AGF B for allocation
> 				free inode
> 				mark inode free in finobt
> 				  insert new rec
> 				    split tree
> 				      lock AGF B for allocation
> 				      <blocks waiting for task 1>
> <allocate blocks>
> 
> xfs_bmap_finish
>   log inode
>     lock inode cluster buffer
>     attach inode
>     unlock inode cluster buffer
> xfs_trans_commit
>   ...
>   unlock AGF B
> 				      <unblocks, holds AGF B>
> 				      <allocate blocks>
> 				    <completes split>
> 				  <completes insert>
> 				<completes ifree>
> 				lock inode cluster buffer
> 				remove inode from unlinked list
> 				xfs_trans_commit
> 				  ...
> 				  unlock AGF B
> 				  unlock inode cluster buffer
> 
> And so the deadlock should go away.
> 
> I've attached the current patch from my local dev tree below. Can
> you try it and see if the problem goes away?
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> 
> xfs: reorder iunlink remove operation in xfs_ifree
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> The O_TMPFILE creation implementation creates a specific order of
> operations for inode allocation/freeing and unlinked list
> modification. Currently both are serialised by the AGI, so the order
> doesn't strictly matter as long as the are both in the same
> transaction.
> 
> However, if we want to move the unlinked list insertions largely
> out from under the AGI lock, then we have to be concerned about the
> order in which we do unlinked list modification operations.
> O_TMPFILE creation tells us this order is inode allocation/free,
> then unlinked list modification.
> 
> Change xfs_ifree() to use this same ordering on unlinked list
> removal. THis way we always guarantee that when we enter the
> iunlinked list removal code from this path, we have the already
> locked and we don't have to worry about lock nesting AGI reads
> inside unlink list locks because it's already locked and attached to
> the transaction.
> 
> We can do this safely as the inode freeing and unlinked list removal
> are done in the same transaction and hence are atomic operations
> with resepect to log recovery.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

No weird problems observed due to /this/ patch, so:

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> 
> ---
>  fs/xfs/xfs_inode.c | 24 +++++++++++++-----------
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index c66c9585f1ba..f8eaeb83d05e 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -2277,14 +2277,13 @@ xfs_ifree_cluster(
>  }
>  
>  /*
> - * This is called to return an inode to the inode free list.
> - * The inode should already be truncated to 0 length and have
> - * no pages associated with it.  This routine also assumes that
> - * the inode is already a part of the transaction.
> + * This is called to return an inode to the inode free list.  The inode should
> + * already be truncated to 0 length and have no pages associated with it.  This
> + * routine also assumes that the inode is already a part of the transaction.
>   *
> - * The on-disk copy of the inode will have been added to the list
> - * of unlinked inodes in the AGI. We need to remove the inode from
> - * that list atomically with respect to freeing it here.
> + * The on-disk copy of the inode will have been added to the list of unlinked
> + * inodes in the AGI. We need to remove the inode from that list atomically with
> + * respect to freeing it here.
>   */
>  int
>  xfs_ifree(
> @@ -2306,13 +2305,16 @@ xfs_ifree(
>  	pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
>  
>  	/*
> -	 * Pull the on-disk inode from the AGI unlinked list.
> +	 * Free the inode first so that we guarantee that the AGI lock is going
> +	 * to be taken before we remove the inode from the unlinked list. This
> +	 * makes the AGI lock -> unlinked list modification order the same as
> +	 * used in O_TMPFILE creation.
>  	 */
> -	error = xfs_iunlink_remove(tp, pag, ip);
> +	error = xfs_difree(tp, pag, ip->i_ino, &xic);
>  	if (error)
> -		goto out;
> +		return error;
>  
> -	error = xfs_difree(tp, pag, ip->i_ino, &xic);
> +	error = xfs_iunlink_remove(tp, pag, ip);
>  	if (error)
>  		goto out;
>  

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Self-deadlock (?) in xfs_inodegc_worker / xfs_inactive ?
  2022-04-25 16:17   ` Darrick J. Wong
@ 2022-04-25 16:19     ` Frank Hofmann
  0 siblings, 0 replies; 7+ messages in thread
From: Frank Hofmann @ 2022-04-25 16:19 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, kernel-team

As original poster: Confirmed. We've not seen detrimental effects of it either.

On Mon, Apr 25, 2022 at 5:17 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Apr 05, 2022 at 09:22:04AM +1000, Dave Chinner wrote:
> > On Mon, Apr 04, 2022 at 02:16:23PM +0100, Frank Hofmann wrote:
> > > Hi,
> > >
> > > we see machines getting stuck with a large number of backed-up
> > > processes that invoke statfs() (monitoring stuff), like:
> > >
> > > [Sat Apr  2 09:54:32 2022] INFO: task node_exporter:244222 blocked for
> > > more than 10 seconds.
> > > [Sat Apr  2 09:54:32 2022]       Tainted: G           O
> > > 5.15.26-cloudflare-2022.3.4 #1
> >
> > Is this a vanilla kernel, or one that has been patched extensively
> > by cloudfare?
> >
> > > [Sat Apr  2 09:54:32 2022] "echo 0 >
> > > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > [Sat Apr  2 09:54:32 2022] task:node_exporter   state:D stack:    0
> > > pid:244222 ppid:     1 flags:0x00004000
> > > [Sat Apr  2 09:54:32 2022] Call Trace:
> > > [Sat Apr  2 09:54:32 2022]  <TASK>
> > > [Sat Apr  2 09:54:32 2022]  __schedule+0x2cd/0x950
> > > [Sat Apr  2 09:54:32 2022]  schedule+0x44/0xa0
> > > [Sat Apr  2 09:54:32 2022]  schedule_timeout+0xfc/0x140
> > > [Sat Apr  2 09:54:32 2022]  ? try_to_wake_up+0x338/0x4e0
> > > [Sat Apr  2 09:54:32 2022]  ? __prepare_to_swait+0x4b/0x70
> > > [Sat Apr  2 09:54:32 2022]  wait_for_completion+0x86/0xe0
> > > [Sat Apr  2 09:54:32 2022]  flush_work+0x5c/0x80
> > > [Sat Apr  2 09:54:32 2022]  ? flush_workqueue_prep_pwqs+0x110/0x110
> > > [Sat Apr  2 09:54:32 2022]  xfs_inodegc_flush.part.0+0x3b/0x90
> > > [Sat Apr  2 09:54:32 2022]  xfs_fs_statfs+0x29/0x1c0
> > > [Sat Apr  2 09:54:32 2022]  statfs_by_dentry+0x4d/0x70
> > > [Sat Apr  2 09:54:32 2022]  user_statfs+0x57/0xc0
> > > [Sat Apr  2 09:54:32 2022]  __do_sys_statfs+0x20/0x50
> > > [Sat Apr  2 09:54:32 2022]  do_syscall_64+0x3b/0x90
> > > [Sat Apr  2 09:54:32 2022]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [Sat Apr  2 09:54:32 2022] RIP: 0033:0x4ac9db
> >
> > Waiting for background inode gc to complete.
> >
> > > A linear-over-time increasing number of 'D' state processes is usually
> > > what alerts us to this.
> > >
> > > The oldest thread found waiting appears always to be the inode gc
> > > worker doing deferred inactivation:
> >
> > OK.
> >
> > > This is a histogram (first column: number of proceses 'D'-ed on that
> > > call trace) of `/proc/<PID>/stack`:
> >
> > It helps greatly if you reformat the stacks back to a readable stack
> > (s/=>/\r/g, s/^\n//, s/^ //) so the output is easily readable.
> >
> > > 1 stuck on AGF, holding AGI, inode and inode buffer locks
> > >
> > > down+0x43/0x60
> > > xfs_buf_lock+0x29/0xa0
> > > xfs_buf_find+0x2c4/0x590
> > > xfs_buf_get_map+0x46/0x390
> > > xfs_buf_read_map+0x52/0x270
> > > xfs_trans_read_buf_map+0x128/0x2a0
> > > xfs_read_agf+0x87/0x110
> > > xfs_alloc_read_agf+0x34/0x1a0
> > > xfs_alloc_fix_freelist+0x3d7/0x4f0
> > > xfs_alloc_vextent+0x22b/0x440
> > > __xfs_inobt_alloc_block.isra.0+0xc5/0x1a0
> > > __xfs_btree_split+0xf2/0x610
> > > xfs_btree_split+0x4b/0x100
> > > xfs_btree_make_block_unfull+0x193/0x1c0
> > > xfs_btree_insrec+0x4a9/0x5a0
> > > xfs_btree_insert+0xa8/0x1f0
> > > xfs_difree_finobt+0xa4/0x240
> > > xfs_difree+0x126/0x1a0
> > > xfs_ifree+0xca/0x4a0
> > > xfs_inactive_ifree.isra.0+0x9e/0x1a0
> > > xfs_inactive+0xf8/0x170
> > > xfs_inodegc_worker+0x73/0xf0
> > > process_one_work+0x1e6/0x380
> > > worker_thread+0x50/0x3a0
> > > kthread+0x127/0x150
> > > ret_from_fork+0x22/0x30
> > >
> > > 1   stuck on inode buffer, holding inode lock, holding AGF
> > >
> > > down+0x43/0x60
> > > xfs_buf_lock+0x29/0xa0
> > > xfs_buf_find+0x2c4/0x590
> > > xfs_buf_get_map+0x46/0x390
> > > xfs_buf_read_map+0x52/0x270
> > > xfs_trans_read_buf_map+0x128/0x2a0
> > > xfs_imap_to_bp+0x4e/0x70
> > > xfs_trans_log_inode+0x1d0/0x280
> > > xfs_bmap_btalloc+0x75f/0x820
> > > xfs_bmapi_allocate+0xe4/0x310
> > > xfs_bmapi_convert_delalloc+0x273/0x490
> > > xfs_map_blocks+0x1b5/0x400
> > > iomap_do_writepage+0x11d/0x820
> > > write_cache_pages+0x189/0x3e0
> > > iomap_writepages+0x1c/0x40
> > > xfs_vm_writepages+0x71/0xa0
> > > do_writepages+0xc3/0x1e0
> > > __writeback_single_inode+0x37/0x270
> > > writeback_sb_inodes+0x1ed/0x420
> > > __writeback_inodes_wb+0x4c/0xd0
> > > wb_writeback+0x1ba/0x270
> > > wb_workfn+0x292/0x4d0
> > > process_one_work+0x1e6/0x380
> > > worker_thread+0x50/0x3a0
> > > kthread+0x127/0x150
> > > ret_from_fork+0x22/0x30
> >
> > That's the deadlock right there.
> >
> > task 1                                task 2
> > lock inode A                  lock inode A+1
> > lock AGF B for allocation     lock inode cluster
> >                               remove inode from unlinked list
> >                               free inode
> >                               mark inode free in finobt
> >                                 insert new rec
> >                                   split tree
> >                                     lock AGF B for allocation
> >                                     <blocks waiting for task 1>
> > <allocate blocks>
> >
> > xfs_bmap_finish
> >   log inode
> >     lock inode cluster buffer
> >     <blocks waiting for task 2>
> >
> > So this has nothing to do with background inode inactivation. It may
> > have made it easier to hit, but it's definitely not *caused* by
> > background inodegc as these two operations have always been able to
> > run concurrently.
> >
> > The likely cause is going to be the async memory reclaim work from
> > late June 2020. Commit 298f7bec503f ("xfs: pin inode backing buffer
> > to the inode log item") added the xfs_imap_to_bp() call to
> > xfs_trans_log_inode() to pin the inode cluster buffer in memory when
> > the inode was first dirtied.
> >
> > Essentially, the problem is that inode unlink list manipulations are
> > not consistently ordered with inode allocation/freeing, hence not
> > consistently ordered against AGI and AGF locking. I didn't realise
> > that there was an AGF component to this problem, otherwise I would
> > have sent this patch upstream much sooner:
> >
> > https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-iunlink-item-2&id=17b71a2fba3549ea55e8bef764532fd42be1213a
> >
> > That commit is dated August 2020 - about the same time that the
> > async memory reclaim stuff was merged. What this will do is:
> >
> > task 1                                task 2
> > lock inode A                  lock inode A+1
> > lock AGF B for allocation
> >                               free inode
> >                               mark inode free in finobt
> >                                 insert new rec
> >                                   split tree
> >                                     lock AGF B for allocation
> >                                     <blocks waiting for task 1>
> > <allocate blocks>
> >
> > xfs_bmap_finish
> >   log inode
> >     lock inode cluster buffer
> >     attach inode
> >     unlock inode cluster buffer
> > xfs_trans_commit
> >   ...
> >   unlock AGF B
> >                                     <unblocks, holds AGF B>
> >                                     <allocate blocks>
> >                                   <completes split>
> >                                 <completes insert>
> >                               <completes ifree>
> >                               lock inode cluster buffer
> >                               remove inode from unlinked list
> >                               xfs_trans_commit
> >                                 ...
> >                                 unlock AGF B
> >                                 unlock inode cluster buffer
> >
> > And so the deadlock should go away.
> >
> > I've attached the current patch from my local dev tree below. Can
> > you try it and see if the problem goes away?
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > david@fromorbit.com
> >
> >
> > xfs: reorder iunlink remove operation in xfs_ifree
> >
> > From: Dave Chinner <dchinner@redhat.com>
> >
> > The O_TMPFILE creation implementation creates a specific order of
> > operations for inode allocation/freeing and unlinked list
> > modification. Currently both are serialised by the AGI, so the order
> > doesn't strictly matter as long as the are both in the same
> > transaction.
> >
> > However, if we want to move the unlinked list insertions largely
> > out from under the AGI lock, then we have to be concerned about the
> > order in which we do unlinked list modification operations.
> > O_TMPFILE creation tells us this order is inode allocation/free,
> > then unlinked list modification.
> >
> > Change xfs_ifree() to use this same ordering on unlinked list
> > removal. THis way we always guarantee that when we enter the
> > iunlinked list removal code from this path, we have the already
> > locked and we don't have to worry about lock nesting AGI reads
> > inside unlink list locks because it's already locked and attached to
> > the transaction.
> >
> > We can do this safely as the inode freeing and unlinked list removal
> > are done in the same transaction and hence are atomic operations
> > with resepect to log recovery.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
>
> No weird problems observed due to /this/ patch, so:
>
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>
>
> --D
>
> >
> > ---
> >  fs/xfs/xfs_inode.c | 24 +++++++++++++-----------
> >  1 file changed, 13 insertions(+), 11 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index c66c9585f1ba..f8eaeb83d05e 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -2277,14 +2277,13 @@ xfs_ifree_cluster(
> >  }
> >
> >  /*
> > - * This is called to return an inode to the inode free list.
> > - * The inode should already be truncated to 0 length and have
> > - * no pages associated with it.  This routine also assumes that
> > - * the inode is already a part of the transaction.
> > + * This is called to return an inode to the inode free list.  The inode should
> > + * already be truncated to 0 length and have no pages associated with it.  This
> > + * routine also assumes that the inode is already a part of the transaction.
> >   *
> > - * The on-disk copy of the inode will have been added to the list
> > - * of unlinked inodes in the AGI. We need to remove the inode from
> > - * that list atomically with respect to freeing it here.
> > + * The on-disk copy of the inode will have been added to the list of unlinked
> > + * inodes in the AGI. We need to remove the inode from that list atomically with
> > + * respect to freeing it here.
> >   */
> >  int
> >  xfs_ifree(
> > @@ -2306,13 +2305,16 @@ xfs_ifree(
> >       pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino));
> >
> >       /*
> > -      * Pull the on-disk inode from the AGI unlinked list.
> > +      * Free the inode first so that we guarantee that the AGI lock is going
> > +      * to be taken before we remove the inode from the unlinked list. This
> > +      * makes the AGI lock -> unlinked list modification order the same as
> > +      * used in O_TMPFILE creation.
> >        */
> > -     error = xfs_iunlink_remove(tp, pag, ip);
> > +     error = xfs_difree(tp, pag, ip->i_ino, &xic);
> >       if (error)
> > -             goto out;
> > +             return error;
> >
> > -     error = xfs_difree(tp, pag, ip->i_ino, &xic);
> > +     error = xfs_iunlink_remove(tp, pag, ip);
> >       if (error)
> >               goto out;
> >

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-04-25 16:19 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-04 13:16 Self-deadlock (?) in xfs_inodegc_worker / xfs_inactive ? Frank Hofmann
2022-04-04 23:22 ` Dave Chinner
2022-04-05 14:52   ` Frank Hofmann
2022-04-15 12:39   ` Holger Hoffstätte
2022-04-15 16:29     ` Darrick J. Wong
2022-04-25 16:17   ` Darrick J. Wong
2022-04-25 16:19     ` Frank Hofmann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.