All of lore.kernel.org
 help / color / mirror / Atom feed
* generic/475 deadlock?
@ 2019-03-20  5:04 Darrick J. Wong
  2019-03-20 17:03 ` Brian Foster
  2019-03-20 21:39 ` Dave Chinner
  0 siblings, 2 replies; 18+ messages in thread
From: Darrick J. Wong @ 2019-03-20  5:04 UTC (permalink / raw)
  To: xfs

Hmmm.

Every now and then I see a generic/475 deadlock that generates the
hangcheck warning pasted below.

I /think/ this is ... the ail is processing an inode log item, for which
it locked the cluster buffer and pushed the cil to unpin the buffer.
However, the cil is cleaning up after the shut down and is trying to
simulate an EIO completion, but tries grabs the buffer lock and hence
the cil and ail deadlock.  Maybe the solution is to trylock in the
(freed && remove) case of xfs_buf_item_unpin, since we're tearing the
whole system down anyway?

Anyway, tired, going to bed now, really just backing up my brain to the
internet... :)

--D

INFO: task kworker/1:4:4668 blocked for more than 61 seconds.
      Not tainted 5.1.0-rc1-xfsx #rc1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kworker/1:4     D11632  4668      2 0x80000000
Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs]
Call Trace:
 ? __schedule+0x309/0x9e0
 schedule+0x36/0x90
 schedule_timeout+0x1fd/0x500
 ? __next_timer_interrupt+0xc0/0xc0
 ? __down_timeout+0x9b/0x120
 __down_timeout+0xc3/0x120
 down_timeout+0x43/0x50
 xfs_buf_lock+0x7a/0x280 [xfs]
 xfs_buf_item_unpin+0x300/0x4a0 [xfs]
 xfs_trans_committed_bulk+0x105/0x3e0 [xfs]
 xlog_cil_committed+0x4b/0x410 [xfs]
 ? mark_held_locks+0x47/0x70
 ? xlog_cil_push+0x3f3/0x460 [xfs]
 ? xlog_cil_push+0x3f3/0x460 [xfs]
 ? rcu_read_lock_sched_held+0x74/0x80
 xlog_cil_push+0x400/0x460 [xfs]
 process_one_work+0x252/0x600
 worker_thread+0x1d9/0x390
 ? wq_calc_node_cpumask+0x80/0x80
 kthread+0x11f/0x140
 ? kthread_create_on_node+0x70/0x70
 ret_from_fork+0x24/0x30
INFO: task xfsaild/dm-0:7063 blocked for more than 61 seconds.
      Not tainted 5.1.0-rc1-xfsx #rc1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
xfsaild/dm-0    D13752  7063      2 0x80000000
Call Trace:
 ? __schedule+0x309/0x9e0
 ? wait_for_completion+0x12b/0x1c0
 schedule+0x36/0x90
 schedule_timeout+0x32a/0x500
 ? wait_for_completion+0x12b/0x1c0
 wait_for_completion+0x153/0x1c0
 ? wake_up_q+0x80/0x80
 __flush_work+0x1e1/0x300
 ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
 ? wait_for_completion+0x43/0x1c0
 xlog_cil_force_lsn+0x97/0x270 [xfs]
 ? xfs_iflush+0x38d/0x470 [xfs]
 xfs_log_force+0xff/0x4f0 [xfs]
 ? xfs_dinode_calc_crc+0x3b/0x60 [xfs]
 ? xfs_iflush_int+0x24d/0x570 [xfs]
 xfs_iflush+0x38d/0x470 [xfs]
 xfs_inode_item_push+0xe6/0x200 [xfs]
 xfsaild+0x50d/0x1210 [xfs]
 ? sched_clock+0x5/0x10
 ? sched_clock_cpu+0xe/0xd0
 ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
 kthread+0x11f/0x140
 ? kthread_create_on_node+0x70/0x70
 ret_from_fork+0x24/0x30
INFO: task umount:7089 blocked for more than 61 seconds.
      Not tainted 5.1.0-rc1-xfsx #rc1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
umount          D13312  7089   6414 0x00000000
Call Trace:
 ? __schedule+0x309/0x9e0
 ? wait_for_completion+0x12b/0x1c0
 schedule+0x36/0x90
 schedule_timeout+0x32a/0x500
 ? wait_for_completion+0x12b/0x1c0
 wait_for_completion+0x153/0x1c0
 ? wake_up_q+0x80/0x80
 __flush_work+0x1e1/0x300
 ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
 ? wait_for_completion+0x43/0x1c0
 xlog_cil_force_lsn+0x97/0x270 [xfs]
 ? xfs_fs_sync_fs+0x21/0x50 [xfs]
 xfs_log_force+0xff/0x4f0 [xfs]
 xfs_fs_sync_fs+0x21/0x50 [xfs]
 sync_filesystem+0x6b/0x90
 generic_shutdown_super+0x22/0x100
 kill_block_super+0x21/0x50
 deactivate_locked_super+0x29/0x60
 cleanup_mnt+0x3b/0x70
 task_work_run+0x98/0xc0
 exit_to_usermode_loop+0x83/0x90
 do_syscall_64+0x14b/0x150
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fa9ff80e8c7
Code: Bad RIP value.
RSP: 002b:00007ffd647ea078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00005642818ed970 RCX: 00007fa9ff80e8c7
RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005642818fa100
RBP: 0000000000000000 R08: 00005642818f91e0 R09: 00007fa9ffad8c40
R10: 000000000000000b R11: 0000000000000246 R12: 00005642818fa100
R13: 00007fa9ffd308a4 R14: 00005642818f9980 R15: 0000000000000000

Showing all locks held in the system:
1 lock held by khungtaskd/34:
 #0: 0000000056e8f88f (rcu_read_lock){....}, at: debug_show_all_locks+0x15/0x17f
1 lock held by in:imklog/1113:
 #0: 000000007873f311 (&f->f_pos_lock){+.+.}, at: __fdget_pos+0x4d/0x60
5 locks held by kworker/u13:4/3736:
2 locks held by kworker/1:4/4668:
 #0: 00000000cd34452f ((wq_completion)xfs-cil/dm-0){+.+.}, at: process_one_work+0x1d8/0x600
 #1: 000000003c563604 ((work_completion)(&cil->xc_push_work)){+.+.}, at: process_one_work+0x1d8/0x600
2 locks held by kworker/2:4/5394:
 #0: 0000000072544c54 ((wq_completion)xfs-sync/dm-0){+.+.}, at: process_one_work+0x1d8/0x600
 #1: 00000000c3e919b6 ((work_completion)(&(&log->l_work)->work)){+.+.}, at: process_one_work+0x1d8/0x600
1 lock held by xfsaild/dm-0/7063:
 #0: 0000000049da3bb4 (&xfs_nondir_ilock_class){++++}, at: xfs_ilock_nowait+0x1ee/0x330 [xfs]
1 lock held by umount/7089:
 #0: 0000000031dc2924 (&type->s_umount_key#50){++++}, at: deactivate_super+0x43/0x50

=============================================

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-20  5:04 generic/475 deadlock? Darrick J. Wong
@ 2019-03-20 17:03 ` Brian Foster
  2019-03-20 17:45   ` Darrick J. Wong
  2019-03-20 21:39 ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Brian Foster @ 2019-03-20 17:03 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote:
> Hmmm.
> 
> Every now and then I see a generic/475 deadlock that generates the
> hangcheck warning pasted below.
> 

Haven't seen it, but I'll kick off a few iterations..

> I /think/ this is ... the ail is processing an inode log item, for which
> it locked the cluster buffer and pushed the cil to unpin the buffer.
> However, the cil is cleaning up after the shut down and is trying to
> simulate an EIO completion, but tries grabs the buffer lock and hence
> the cil and ail deadlock.  Maybe the solution is to trylock in the
> (freed && remove) case of xfs_buf_item_unpin, since we're tearing the
> whole system down anyway?
> 

While the xfsaild will lock buffers via ->iop_push() callouts, I think
this is only to queue the buffer to the delwri queue (both buffer and
inode item pushes should cycle the buf lock before returning) and then
to submit the pending delwri queue. I don't think xfsaild should own any
such locks at the time it issues a log force..

> Anyway, tired, going to bed now, really just backing up my brain to the
> internet... :)
> 
> --D
> 
> INFO: task kworker/1:4:4668 blocked for more than 61 seconds.
>       Not tainted 5.1.0-rc1-xfsx #rc1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> kworker/1:4     D11632  4668      2 0x80000000
> Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs]
> Call Trace:
>  ? __schedule+0x309/0x9e0
>  schedule+0x36/0x90
>  schedule_timeout+0x1fd/0x500
>  ? __next_timer_interrupt+0xc0/0xc0
>  ? __down_timeout+0x9b/0x120
>  __down_timeout+0xc3/0x120
>  down_timeout+0x43/0x50
>  xfs_buf_lock+0x7a/0x280 [xfs]
>  xfs_buf_item_unpin+0x300/0x4a0 [xfs]
>  xfs_trans_committed_bulk+0x105/0x3e0 [xfs]
>  xlog_cil_committed+0x4b/0x410 [xfs]
>  ? mark_held_locks+0x47/0x70
>  ? xlog_cil_push+0x3f3/0x460 [xfs]
>  ? xlog_cil_push+0x3f3/0x460 [xfs]
>  ? rcu_read_lock_sched_held+0x74/0x80
>  xlog_cil_push+0x400/0x460 [xfs]
>  process_one_work+0x252/0x600
>  worker_thread+0x1d9/0x390
>  ? wq_calc_node_cpumask+0x80/0x80
>  kthread+0x11f/0x140
>  ? kthread_create_on_node+0x70/0x70
>  ret_from_fork+0x24/0x30

Ok, so there's a CIL push where the log write failed and the buf item
abort is waiting on the buf lock..

> INFO: task xfsaild/dm-0:7063 blocked for more than 61 seconds.
>       Not tainted 5.1.0-rc1-xfsx #rc1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> xfsaild/dm-0    D13752  7063      2 0x80000000
> Call Trace:
>  ? __schedule+0x309/0x9e0
>  ? wait_for_completion+0x12b/0x1c0
>  schedule+0x36/0x90
>  schedule_timeout+0x32a/0x500
>  ? wait_for_completion+0x12b/0x1c0
>  wait_for_completion+0x153/0x1c0
>  ? wake_up_q+0x80/0x80
>  __flush_work+0x1e1/0x300
>  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
>  ? wait_for_completion+0x43/0x1c0
>  xlog_cil_force_lsn+0x97/0x270 [xfs]
>  ? xfs_iflush+0x38d/0x470 [xfs]
>  xfs_log_force+0xff/0x4f0 [xfs]
>  ? xfs_dinode_calc_crc+0x3b/0x60 [xfs]
>  ? xfs_iflush_int+0x24d/0x570 [xfs]
>  xfs_iflush+0x38d/0x470 [xfs]
>  xfs_inode_item_push+0xe6/0x200 [xfs]
>  xfsaild+0x50d/0x1210 [xfs]
>  ? sched_clock+0x5/0x10
>  ? sched_clock_cpu+0xe/0xd0
>  ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
>  kthread+0x11f/0x140
>  ? kthread_create_on_node+0x70/0x70
>  ret_from_fork+0x24/0x30

xfsaild forces the log and the associated CIL push blocks on the one
already in progress..

> INFO: task umount:7089 blocked for more than 61 seconds.
>       Not tainted 5.1.0-rc1-xfsx #rc1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> umount          D13312  7089   6414 0x00000000
> Call Trace:
>  ? __schedule+0x309/0x9e0
>  ? wait_for_completion+0x12b/0x1c0
>  schedule+0x36/0x90
>  schedule_timeout+0x32a/0x500
>  ? wait_for_completion+0x12b/0x1c0
>  wait_for_completion+0x153/0x1c0
>  ? wake_up_q+0x80/0x80
>  __flush_work+0x1e1/0x300
>  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
>  ? wait_for_completion+0x43/0x1c0
>  xlog_cil_force_lsn+0x97/0x270 [xfs]
>  ? xfs_fs_sync_fs+0x21/0x50 [xfs]
>  xfs_log_force+0xff/0x4f0 [xfs]
>  xfs_fs_sync_fs+0x21/0x50 [xfs]
>  sync_filesystem+0x6b/0x90
>  generic_shutdown_super+0x22/0x100
>  kill_block_super+0x21/0x50
>  deactivate_locked_super+0x29/0x60
>  cleanup_mnt+0x3b/0x70
>  task_work_run+0x98/0xc0
>  exit_to_usermode_loop+0x83/0x90
>  do_syscall_64+0x14b/0x150
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe

And unmount is doing a log force as well..

Hmm.. yeah, I think we need to figure out how/why the buffer is locked.
I don't think it should be xfsaild because while the delwri submit locks
the buffers for I/O, that should essentially transfer lock ownership to
the async I/O (i.e., to be unlocked via xfs_buf_ioend() via bio
completion). Is the buf I/O associated completion wq stuck somewhere? Is
the buffer submission path working correctly in the shutdown case..?

Brian

> RIP: 0033:0x7fa9ff80e8c7
> Code: Bad RIP value.
> RSP: 002b:00007ffd647ea078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> RAX: 0000000000000000 RBX: 00005642818ed970 RCX: 00007fa9ff80e8c7
> RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005642818fa100
> RBP: 0000000000000000 R08: 00005642818f91e0 R09: 00007fa9ffad8c40
> R10: 000000000000000b R11: 0000000000000246 R12: 00005642818fa100
> R13: 00007fa9ffd308a4 R14: 00005642818f9980 R15: 0000000000000000
> 
> Showing all locks held in the system:
> 1 lock held by khungtaskd/34:
>  #0: 0000000056e8f88f (rcu_read_lock){....}, at: debug_show_all_locks+0x15/0x17f
> 1 lock held by in:imklog/1113:
>  #0: 000000007873f311 (&f->f_pos_lock){+.+.}, at: __fdget_pos+0x4d/0x60
> 5 locks held by kworker/u13:4/3736:
> 2 locks held by kworker/1:4/4668:
>  #0: 00000000cd34452f ((wq_completion)xfs-cil/dm-0){+.+.}, at: process_one_work+0x1d8/0x600
>  #1: 000000003c563604 ((work_completion)(&cil->xc_push_work)){+.+.}, at: process_one_work+0x1d8/0x600
> 2 locks held by kworker/2:4/5394:
>  #0: 0000000072544c54 ((wq_completion)xfs-sync/dm-0){+.+.}, at: process_one_work+0x1d8/0x600
>  #1: 00000000c3e919b6 ((work_completion)(&(&log->l_work)->work)){+.+.}, at: process_one_work+0x1d8/0x600
> 1 lock held by xfsaild/dm-0/7063:
>  #0: 0000000049da3bb4 (&xfs_nondir_ilock_class){++++}, at: xfs_ilock_nowait+0x1ee/0x330 [xfs]
> 1 lock held by umount/7089:
>  #0: 0000000031dc2924 (&type->s_umount_key#50){++++}, at: deactivate_super+0x43/0x50
> 
> =============================================
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-20 17:03 ` Brian Foster
@ 2019-03-20 17:45   ` Darrick J. Wong
  2019-03-20 17:59     ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2019-03-20 17:45 UTC (permalink / raw)
  To: Brian Foster; +Cc: xfs

On Wed, Mar 20, 2019 at 01:03:05PM -0400, Brian Foster wrote:
> On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote:
> > Hmmm.
> > 
> > Every now and then I see a generic/475 deadlock that generates the
> > hangcheck warning pasted below.
> > 
> 
> Haven't seen it, but I'll kick off a few iterations..
> 
> > I /think/ this is ... the ail is processing an inode log item, for which
> > it locked the cluster buffer and pushed the cil to unpin the buffer.
> > However, the cil is cleaning up after the shut down and is trying to
> > simulate an EIO completion, but tries grabs the buffer lock and hence
> > the cil and ail deadlock.  Maybe the solution is to trylock in the
> > (freed && remove) case of xfs_buf_item_unpin, since we're tearing the
> > whole system down anyway?
> > 
> 
> While the xfsaild will lock buffers via ->iop_push() callouts, I think
> this is only to queue the buffer to the delwri queue (both buffer and
> inode item pushes should cycle the buf lock before returning) and then
> to submit the pending delwri queue. I don't think xfsaild should own any
> such locks at the time it issues a log force..

...but I think it does (see below) because iflush forces the log when
the cluster buffer is pinned, and only after that does it add the buffer
to the delwri queue.

> > Anyway, tired, going to bed now, really just backing up my brain to the
> > internet... :)
> > 
> > --D
> > 
> > INFO: task kworker/1:4:4668 blocked for more than 61 seconds.
> >       Not tainted 5.1.0-rc1-xfsx #rc1
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > kworker/1:4     D11632  4668      2 0x80000000
> > Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs]
> > Call Trace:
> >  ? __schedule+0x309/0x9e0
> >  schedule+0x36/0x90
> >  schedule_timeout+0x1fd/0x500
> >  ? __next_timer_interrupt+0xc0/0xc0
> >  ? __down_timeout+0x9b/0x120
> >  __down_timeout+0xc3/0x120
> >  down_timeout+0x43/0x50
> >  xfs_buf_lock+0x7a/0x280 [xfs]
> >  xfs_buf_item_unpin+0x300/0x4a0 [xfs]
> >  xfs_trans_committed_bulk+0x105/0x3e0 [xfs]
> >  xlog_cil_committed+0x4b/0x410 [xfs]
> >  ? mark_held_locks+0x47/0x70
> >  ? xlog_cil_push+0x3f3/0x460 [xfs]
> >  ? xlog_cil_push+0x3f3/0x460 [xfs]
> >  ? rcu_read_lock_sched_held+0x74/0x80
> >  xlog_cil_push+0x400/0x460 [xfs]
> >  process_one_work+0x252/0x600
> >  worker_thread+0x1d9/0x390
> >  ? wq_calc_node_cpumask+0x80/0x80
> >  kthread+0x11f/0x140
> >  ? kthread_create_on_node+0x70/0x70
> >  ret_from_fork+0x24/0x30
> 
> Ok, so there's a CIL push where the log write failed and the buf item
> abort is waiting on the buf lock..
> 
> > INFO: task xfsaild/dm-0:7063 blocked for more than 61 seconds.
> >       Not tainted 5.1.0-rc1-xfsx #rc1
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > xfsaild/dm-0    D13752  7063      2 0x80000000
> > Call Trace:
> >  ? __schedule+0x309/0x9e0
> >  ? wait_for_completion+0x12b/0x1c0
> >  schedule+0x36/0x90
> >  schedule_timeout+0x32a/0x500
> >  ? wait_for_completion+0x12b/0x1c0
> >  wait_for_completion+0x153/0x1c0
> >  ? wake_up_q+0x80/0x80
> >  __flush_work+0x1e1/0x300
> >  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
> >  ? wait_for_completion+0x43/0x1c0
> >  xlog_cil_force_lsn+0x97/0x270 [xfs]
> >  ? xfs_iflush+0x38d/0x470 [xfs]
> >  xfs_log_force+0xff/0x4f0 [xfs]
> >  ? xfs_dinode_calc_crc+0x3b/0x60 [xfs]
> >  ? xfs_iflush_int+0x24d/0x570 [xfs]
> >  xfs_iflush+0x38d/0x470 [xfs]
> >  xfs_inode_item_push+0xe6/0x200 [xfs]
> >  xfsaild+0x50d/0x1210 [xfs]
> >  ? sched_clock+0x5/0x10
> >  ? sched_clock_cpu+0xe/0xd0
> >  ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
> >  kthread+0x11f/0x140
> >  ? kthread_create_on_node+0x70/0x70
> >  ret_from_fork+0x24/0x30
> 
> xfsaild forces the log and the associated CIL push blocks on the one
> already in progress..
> 
> > INFO: task umount:7089 blocked for more than 61 seconds.
> >       Not tainted 5.1.0-rc1-xfsx #rc1
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > umount          D13312  7089   6414 0x00000000
> > Call Trace:
> >  ? __schedule+0x309/0x9e0
> >  ? wait_for_completion+0x12b/0x1c0
> >  schedule+0x36/0x90
> >  schedule_timeout+0x32a/0x500
> >  ? wait_for_completion+0x12b/0x1c0
> >  wait_for_completion+0x153/0x1c0
> >  ? wake_up_q+0x80/0x80
> >  __flush_work+0x1e1/0x300
> >  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
> >  ? wait_for_completion+0x43/0x1c0
> >  xlog_cil_force_lsn+0x97/0x270 [xfs]
> >  ? xfs_fs_sync_fs+0x21/0x50 [xfs]
> >  xfs_log_force+0xff/0x4f0 [xfs]
> >  xfs_fs_sync_fs+0x21/0x50 [xfs]
> >  sync_filesystem+0x6b/0x90
> >  generic_shutdown_super+0x22/0x100
> >  kill_block_super+0x21/0x50
> >  deactivate_locked_super+0x29/0x60
> >  cleanup_mnt+0x3b/0x70
> >  task_work_run+0x98/0xc0
> >  exit_to_usermode_loop+0x83/0x90
> >  do_syscall_64+0x14b/0x150
> >  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> And unmount is doing a log force as well..
> 
> Hmm.. yeah, I think we need to figure out how/why the buffer is locked.

AFAICT it's xfs_inode_item_push -> xfs_iflush -> xfs_imap_to_bp that
returns the locked cluster buffer (code extract from xfs_iflush):

	error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &bp, XBF_TRYLOCK,
			       0);
	if (error == -EAGAIN) {
		xfs_ifunlock(ip);
		return error;
	}
	if (error)
		goto corrupt_out;

So if we make it to here we've have bp pointing to the locked inode
cluster buffer...

	/*
	 * First flush out the inode that xfs_iflush was called with.
	 */
	error = xfs_iflush_int(ip, bp);
	if (error)
		goto corrupt_out;

...after which we flush the inode to the (still locked) cluster
buffer...

	/*
	 * If the buffer is pinned then push on the log now so we won't
	 * get stuck waiting in the write for too long.
	 */
	if (xfs_buf_ispinned(bp))
		xfs_log_force(mp, 0);

...but if the buffer is also pinned then we kick the log (while holding
the buffer lock) to unpin the buffer.  Since the fs is shutdown, the cil
will just be trying to remove everything, but it needs to lock the
buffer to simulate EIO.

A naive fix to the deadlock is to see if the inode buf ops are attached
and downgrade to a trylock and go ahead even if we can't lock it, but ick.
It resolves the deadlock so the umount can proceed but we also trigger a
lot of unlocked buffer asserts.

I'm not sure what "too long" means in that code chunk.  It was added to
reduce stalls in pdflush, which these days I think translates to trying
to reduce the amount of time reclaim can stall while flushing inodes...?

(and, heh, I think I've ended up back at "move the reclaim iflush to
deferred inactivation"...)

> I don't think it should be xfsaild because while the delwri submit locks
> the buffers for I/O, that should essentially transfer lock ownership to
> the async I/O (i.e., to be unlocked via xfs_buf_ioend() via bio
> completion). Is the buf I/O associated completion wq stuck somewhere?

I didn't see any threads waiting on the buffer io completion in the
sysrq-t output.

> Is the buffer submission path working correctly in the shutdown
> case..?

It looks like it's working ok...

--D

> Brian
> 
> > RIP: 0033:0x7fa9ff80e8c7
> > Code: Bad RIP value.
> > RSP: 002b:00007ffd647ea078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> > RAX: 0000000000000000 RBX: 00005642818ed970 RCX: 00007fa9ff80e8c7
> > RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005642818fa100
> > RBP: 0000000000000000 R08: 00005642818f91e0 R09: 00007fa9ffad8c40
> > R10: 000000000000000b R11: 0000000000000246 R12: 00005642818fa100
> > R13: 00007fa9ffd308a4 R14: 00005642818f9980 R15: 0000000000000000
> > 
> > Showing all locks held in the system:
> > 1 lock held by khungtaskd/34:
> >  #0: 0000000056e8f88f (rcu_read_lock){....}, at: debug_show_all_locks+0x15/0x17f
> > 1 lock held by in:imklog/1113:
> >  #0: 000000007873f311 (&f->f_pos_lock){+.+.}, at: __fdget_pos+0x4d/0x60
> > 5 locks held by kworker/u13:4/3736:
> > 2 locks held by kworker/1:4/4668:
> >  #0: 00000000cd34452f ((wq_completion)xfs-cil/dm-0){+.+.}, at: process_one_work+0x1d8/0x600
> >  #1: 000000003c563604 ((work_completion)(&cil->xc_push_work)){+.+.}, at: process_one_work+0x1d8/0x600
> > 2 locks held by kworker/2:4/5394:
> >  #0: 0000000072544c54 ((wq_completion)xfs-sync/dm-0){+.+.}, at: process_one_work+0x1d8/0x600
> >  #1: 00000000c3e919b6 ((work_completion)(&(&log->l_work)->work)){+.+.}, at: process_one_work+0x1d8/0x600
> > 1 lock held by xfsaild/dm-0/7063:
> >  #0: 0000000049da3bb4 (&xfs_nondir_ilock_class){++++}, at: xfs_ilock_nowait+0x1ee/0x330 [xfs]
> > 1 lock held by umount/7089:
> >  #0: 0000000031dc2924 (&type->s_umount_key#50){++++}, at: deactivate_super+0x43/0x50
> > 
> > =============================================
> > 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-20 17:45   ` Darrick J. Wong
@ 2019-03-20 17:59     ` Brian Foster
  2019-03-20 21:49       ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2019-03-20 17:59 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Wed, Mar 20, 2019 at 10:45:51AM -0700, Darrick J. Wong wrote:
> On Wed, Mar 20, 2019 at 01:03:05PM -0400, Brian Foster wrote:
> > On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote:
> > > Hmmm.
> > > 
> > > Every now and then I see a generic/475 deadlock that generates the
> > > hangcheck warning pasted below.
> > > 
> > 
> > Haven't seen it, but I'll kick off a few iterations..
> > 
> > > I /think/ this is ... the ail is processing an inode log item, for which
> > > it locked the cluster buffer and pushed the cil to unpin the buffer.
> > > However, the cil is cleaning up after the shut down and is trying to
> > > simulate an EIO completion, but tries grabs the buffer lock and hence
> > > the cil and ail deadlock.  Maybe the solution is to trylock in the
> > > (freed && remove) case of xfs_buf_item_unpin, since we're tearing the
> > > whole system down anyway?
> > > 
> > 
> > While the xfsaild will lock buffers via ->iop_push() callouts, I think
> > this is only to queue the buffer to the delwri queue (both buffer and
> > inode item pushes should cycle the buf lock before returning) and then
> > to submit the pending delwri queue. I don't think xfsaild should own any
> > such locks at the time it issues a log force..
> 
> ...but I think it does (see below) because iflush forces the log when
> the cluster buffer is pinned, and only after that does it add the buffer
> to the delwri queue.
> 

Oops, my mistake.. I read through the stack too quickly and misread it
as the log force directly via xfsaild_push() rather than iflush.

> > > Anyway, tired, going to bed now, really just backing up my brain to the
> > > internet... :)
> > > 
> > > --D
> > > 
> > > INFO: task kworker/1:4:4668 blocked for more than 61 seconds.
> > >       Not tainted 5.1.0-rc1-xfsx #rc1
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > kworker/1:4     D11632  4668      2 0x80000000
> > > Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs]
> > > Call Trace:
> > >  ? __schedule+0x309/0x9e0
> > >  schedule+0x36/0x90
> > >  schedule_timeout+0x1fd/0x500
> > >  ? __next_timer_interrupt+0xc0/0xc0
> > >  ? __down_timeout+0x9b/0x120
> > >  __down_timeout+0xc3/0x120
> > >  down_timeout+0x43/0x50
> > >  xfs_buf_lock+0x7a/0x280 [xfs]
> > >  xfs_buf_item_unpin+0x300/0x4a0 [xfs]
> > >  xfs_trans_committed_bulk+0x105/0x3e0 [xfs]
> > >  xlog_cil_committed+0x4b/0x410 [xfs]
> > >  ? mark_held_locks+0x47/0x70
> > >  ? xlog_cil_push+0x3f3/0x460 [xfs]
> > >  ? xlog_cil_push+0x3f3/0x460 [xfs]
> > >  ? rcu_read_lock_sched_held+0x74/0x80
> > >  xlog_cil_push+0x400/0x460 [xfs]
> > >  process_one_work+0x252/0x600
> > >  worker_thread+0x1d9/0x390
> > >  ? wq_calc_node_cpumask+0x80/0x80
> > >  kthread+0x11f/0x140
> > >  ? kthread_create_on_node+0x70/0x70
> > >  ret_from_fork+0x24/0x30
> > 
> > Ok, so there's a CIL push where the log write failed and the buf item
> > abort is waiting on the buf lock..
> > 
> > > INFO: task xfsaild/dm-0:7063 blocked for more than 61 seconds.
> > >       Not tainted 5.1.0-rc1-xfsx #rc1
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > xfsaild/dm-0    D13752  7063      2 0x80000000
> > > Call Trace:
> > >  ? __schedule+0x309/0x9e0
> > >  ? wait_for_completion+0x12b/0x1c0
> > >  schedule+0x36/0x90
> > >  schedule_timeout+0x32a/0x500
> > >  ? wait_for_completion+0x12b/0x1c0
> > >  wait_for_completion+0x153/0x1c0
> > >  ? wake_up_q+0x80/0x80
> > >  __flush_work+0x1e1/0x300
> > >  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
> > >  ? wait_for_completion+0x43/0x1c0
> > >  xlog_cil_force_lsn+0x97/0x270 [xfs]
> > >  ? xfs_iflush+0x38d/0x470 [xfs]
> > >  xfs_log_force+0xff/0x4f0 [xfs]
> > >  ? xfs_dinode_calc_crc+0x3b/0x60 [xfs]
> > >  ? xfs_iflush_int+0x24d/0x570 [xfs]
> > >  xfs_iflush+0x38d/0x470 [xfs]
> > >  xfs_inode_item_push+0xe6/0x200 [xfs]
> > >  xfsaild+0x50d/0x1210 [xfs]
> > >  ? sched_clock+0x5/0x10
> > >  ? sched_clock_cpu+0xe/0xd0
> > >  ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
> > >  kthread+0x11f/0x140
> > >  ? kthread_create_on_node+0x70/0x70
> > >  ret_from_fork+0x24/0x30
> > 
> > xfsaild forces the log and the associated CIL push blocks on the one
> > already in progress..
> > 
> > > INFO: task umount:7089 blocked for more than 61 seconds.
> > >       Not tainted 5.1.0-rc1-xfsx #rc1
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > umount          D13312  7089   6414 0x00000000
> > > Call Trace:
> > >  ? __schedule+0x309/0x9e0
> > >  ? wait_for_completion+0x12b/0x1c0
> > >  schedule+0x36/0x90
> > >  schedule_timeout+0x32a/0x500
> > >  ? wait_for_completion+0x12b/0x1c0
> > >  wait_for_completion+0x153/0x1c0
> > >  ? wake_up_q+0x80/0x80
> > >  __flush_work+0x1e1/0x300
> > >  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
> > >  ? wait_for_completion+0x43/0x1c0
> > >  xlog_cil_force_lsn+0x97/0x270 [xfs]
> > >  ? xfs_fs_sync_fs+0x21/0x50 [xfs]
> > >  xfs_log_force+0xff/0x4f0 [xfs]
> > >  xfs_fs_sync_fs+0x21/0x50 [xfs]
> > >  sync_filesystem+0x6b/0x90
> > >  generic_shutdown_super+0x22/0x100
> > >  kill_block_super+0x21/0x50
> > >  deactivate_locked_super+0x29/0x60
> > >  cleanup_mnt+0x3b/0x70
> > >  task_work_run+0x98/0xc0
> > >  exit_to_usermode_loop+0x83/0x90
> > >  do_syscall_64+0x14b/0x150
> > >  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > 
> > And unmount is doing a log force as well..
> > 
> > Hmm.. yeah, I think we need to figure out how/why the buffer is locked.
> 
> AFAICT it's xfs_inode_item_push -> xfs_iflush -> xfs_imap_to_bp that
> returns the locked cluster buffer (code extract from xfs_iflush):
> 
> 	error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &bp, XBF_TRYLOCK,
> 			       0);
> 	if (error == -EAGAIN) {
> 		xfs_ifunlock(ip);
> 		return error;
> 	}
> 	if (error)
> 		goto corrupt_out;
> 
> So if we make it to here we've have bp pointing to the locked inode
> cluster buffer...
> 
> 	/*
> 	 * First flush out the inode that xfs_iflush was called with.
> 	 */
> 	error = xfs_iflush_int(ip, bp);
> 	if (error)
> 		goto corrupt_out;
> 
> ...after which we flush the inode to the (still locked) cluster
> buffer...
> 
> 	/*
> 	 * If the buffer is pinned then push on the log now so we won't
> 	 * get stuck waiting in the write for too long.
> 	 */
> 	if (xfs_buf_ispinned(bp))
> 		xfs_log_force(mp, 0);
> 
> ...but if the buffer is also pinned then we kick the log (while holding
> the buffer lock) to unpin the buffer.  Since the fs is shutdown, the cil
> will just be trying to remove everything, but it needs to lock the
> buffer to simulate EIO.
> 

Yeah, makes sense.

> A naive fix to the deadlock is to see if the inode buf ops are attached
> and downgrade to a trylock and go ahead even if we can't lock it, but ick.
> It resolves the deadlock so the umount can proceed but we also trigger a
> lot of unlocked buffer asserts.
> 
> I'm not sure what "too long" means in that code chunk.  It was added to
> reduce stalls in pdflush, which these days I think translates to trying
> to reduce the amount of time reclaim can stall while flushing inodes...?
> 

I'm not sure either. I was wondering why we'd need that given that
xfsaild() has its own log force logic in cases where the delwri submit
didn't fully complete (i.e., for pinned buffers). Relying on that seems
more appropriate since we'd force the log per delwri submission for
multiple potential buffers as opposed to per inode cluster buffer.
Perhaps we could just kill this log force?

Brian

> (and, heh, I think I've ended up back at "move the reclaim iflush to
> deferred inactivation"...)
> 
> > I don't think it should be xfsaild because while the delwri submit locks
> > the buffers for I/O, that should essentially transfer lock ownership to
> > the async I/O (i.e., to be unlocked via xfs_buf_ioend() via bio
> > completion). Is the buf I/O associated completion wq stuck somewhere?
> 
> I didn't see any threads waiting on the buffer io completion in the
> sysrq-t output.
> 
> > Is the buffer submission path working correctly in the shutdown
> > case..?
> 
> It looks like it's working ok...
> 
> --D
> 
> > Brian
> > 
> > > RIP: 0033:0x7fa9ff80e8c7
> > > Code: Bad RIP value.
> > > RSP: 002b:00007ffd647ea078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> > > RAX: 0000000000000000 RBX: 00005642818ed970 RCX: 00007fa9ff80e8c7
> > > RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005642818fa100
> > > RBP: 0000000000000000 R08: 00005642818f91e0 R09: 00007fa9ffad8c40
> > > R10: 000000000000000b R11: 0000000000000246 R12: 00005642818fa100
> > > R13: 00007fa9ffd308a4 R14: 00005642818f9980 R15: 0000000000000000
> > > 
> > > Showing all locks held in the system:
> > > 1 lock held by khungtaskd/34:
> > >  #0: 0000000056e8f88f (rcu_read_lock){....}, at: debug_show_all_locks+0x15/0x17f
> > > 1 lock held by in:imklog/1113:
> > >  #0: 000000007873f311 (&f->f_pos_lock){+.+.}, at: __fdget_pos+0x4d/0x60
> > > 5 locks held by kworker/u13:4/3736:
> > > 2 locks held by kworker/1:4/4668:
> > >  #0: 00000000cd34452f ((wq_completion)xfs-cil/dm-0){+.+.}, at: process_one_work+0x1d8/0x600
> > >  #1: 000000003c563604 ((work_completion)(&cil->xc_push_work)){+.+.}, at: process_one_work+0x1d8/0x600
> > > 2 locks held by kworker/2:4/5394:
> > >  #0: 0000000072544c54 ((wq_completion)xfs-sync/dm-0){+.+.}, at: process_one_work+0x1d8/0x600
> > >  #1: 00000000c3e919b6 ((work_completion)(&(&log->l_work)->work)){+.+.}, at: process_one_work+0x1d8/0x600
> > > 1 lock held by xfsaild/dm-0/7063:
> > >  #0: 0000000049da3bb4 (&xfs_nondir_ilock_class){++++}, at: xfs_ilock_nowait+0x1ee/0x330 [xfs]
> > > 1 lock held by umount/7089:
> > >  #0: 0000000031dc2924 (&type->s_umount_key#50){++++}, at: deactivate_super+0x43/0x50
> > > 
> > > =============================================
> > > 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-20  5:04 generic/475 deadlock? Darrick J. Wong
  2019-03-20 17:03 ` Brian Foster
@ 2019-03-20 21:39 ` Dave Chinner
  1 sibling, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2019-03-20 21:39 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: xfs

On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote:
> Hmmm.
> 
> Every now and then I see a generic/475 deadlock that generates the
> hangcheck warning pasted below.
> 
> I /think/ this is ... the ail is processing an inode log item, for which
> it locked the cluster buffer and pushed the cil to unpin the buffer.
> However, the cil is cleaning up after the shut down and is trying to
> simulate an EIO completion, but tries grabs the buffer lock and hence
> the cil and ail deadlock.  Maybe the solution is to trylock in the
> (freed && remove) case of xfs_buf_item_unpin, since we're tearing the
> whole system down anyway?

Oh, that's looks like a bug in xfs_iflush() - we are forcing the log
to unpin a buffer we already own the lock on. It's the same problem
we had in the discard code fixed by commit 8c81dd46ef3c ("Force log
to disk before reading the AGF during a fstrim").

It also means that the log forces in the busy extent code have the
same potential problem, as does xfs_qm_dqflush().

I'll move further down the discussion now....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-20 17:59     ` Brian Foster
@ 2019-03-20 21:49       ` Dave Chinner
  2019-03-21 12:11         ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2019-03-20 21:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, xfs

On Wed, Mar 20, 2019 at 01:59:22PM -0400, Brian Foster wrote:
> On Wed, Mar 20, 2019 at 10:45:51AM -0700, Darrick J. Wong wrote:
> > On Wed, Mar 20, 2019 at 01:03:05PM -0400, Brian Foster wrote:
> > > On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote:
> > > And unmount is doing a log force as well..
> > > 
> > > Hmm.. yeah, I think we need to figure out how/why the buffer is locked.
> > 
> > AFAICT it's xfs_inode_item_push -> xfs_iflush -> xfs_imap_to_bp that
> > returns the locked cluster buffer (code extract from xfs_iflush):
....
> > ...but if the buffer is also pinned then we kick the log (while holding
> > the buffer lock) to unpin the buffer.  Since the fs is shutdown, the cil
> > will just be trying to remove everything, but it needs to lock the
> > buffer to simulate EIO.
> > 
> 
> Yeah, makes sense.

Yup, and this same problem means any buffer we hold locked when we
issue a blocking log force can deadlock in the same way. If it's a
non-blocking log force (i.e. a log flush we don't wait for) then we
can move onwards and eventually we unlock the buffer and the log can
make progress.

> > A naive fix to the deadlock is to see if the inode buf ops are attached
> > and downgrade to a trylock and go ahead even if we can't lock it, but ick.
> > It resolves the deadlock so the umount can proceed but we also trigger a
> > lot of unlocked buffer asserts.

Yuck!

> > I'm not sure what "too long" means in that code chunk.  It was added to
> > reduce stalls in pdflush, which these days I think translates to trying
> > to reduce the amount of time reclaim can stall while flushing inodes...?

"too long" means that if there is no log pressure, it might be the
next periodic log work that forces the log and unpins the buffer.
i.e. it could be several seconds before the buffer gets unpinned.
This is the same reason we have the log force in xfs_buf_lock() - if
we have to wait for the next log force to unlock/unpin a stale
buffer, we can stall for several seconds.

> I'm not sure either. I was wondering why we'd need that given that
> xfsaild() has its own log force logic in cases where the delwri submit
> didn't fully complete (i.e., for pinned buffers). Relying on that seems
> more appropriate since we'd force the log per delwri submission for
> multiple potential buffers as opposed to per inode cluster buffer.
> Perhaps we could just kill this log force?

If the writeback is coming from the AIL, then we could check if the
buffer is pinned and then return EDEADLOCK or something like that to
get the item push return value set to XFS_ITEM_PINNED, in which case
the AIL will then do the log force that will unpin the object. This
will happen after we have added the buffer to the AIL delwri list
and unlocked the buffer, so everything should work just fine in that
case. Similar for dquot flushing from the AIL.

If the writeback is coming from inode reclaim, we've already
guaranteed the inode is unpinned, so we'll never hit this case
through there.

As for the busy extent code, I'll need to look at that much more
closely to determine if there's a way around this....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-20 21:49       ` Dave Chinner
@ 2019-03-21 12:11         ` Brian Foster
  2019-03-21 21:10           ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2019-03-21 12:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, xfs

On Thu, Mar 21, 2019 at 08:49:38AM +1100, Dave Chinner wrote:
> On Wed, Mar 20, 2019 at 01:59:22PM -0400, Brian Foster wrote:
> > On Wed, Mar 20, 2019 at 10:45:51AM -0700, Darrick J. Wong wrote:
> > > On Wed, Mar 20, 2019 at 01:03:05PM -0400, Brian Foster wrote:
> > > > On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote:
> > > > And unmount is doing a log force as well..
> > > > 
> > > > Hmm.. yeah, I think we need to figure out how/why the buffer is locked.
> > > 
> > > AFAICT it's xfs_inode_item_push -> xfs_iflush -> xfs_imap_to_bp that
> > > returns the locked cluster buffer (code extract from xfs_iflush):
> ....
> > > ...but if the buffer is also pinned then we kick the log (while holding
> > > the buffer lock) to unpin the buffer.  Since the fs is shutdown, the cil
> > > will just be trying to remove everything, but it needs to lock the
> > > buffer to simulate EIO.
> > > 
> > 
> > Yeah, makes sense.
> 
> Yup, and this same problem means any buffer we hold locked when we
> issue a blocking log force can deadlock in the same way. If it's a
> non-blocking log force (i.e. a log flush we don't wait for) then we
> can move onwards and eventually we unlock the buffer and the log can
> make progress.
> 
> > > A naive fix to the deadlock is to see if the inode buf ops are attached
> > > and downgrade to a trylock and go ahead even if we can't lock it, but ick.
> > > It resolves the deadlock so the umount can proceed but we also trigger a
> > > lot of unlocked buffer asserts.
> 
> Yuck!
> 
> > > I'm not sure what "too long" means in that code chunk.  It was added to
> > > reduce stalls in pdflush, which these days I think translates to trying
> > > to reduce the amount of time reclaim can stall while flushing inodes...?
> 
> "too long" means that if there is no log pressure, it might be the
> next periodic log work that forces the log and unpins the buffer.
> i.e. it could be several seconds before the buffer gets unpinned.
> This is the same reason we have the log force in xfs_buf_lock() - if
> we have to wait for the next log force to unlock/unpin a stale
> buffer, we can stall for several seconds.
> 

Ok.. so it kind of looks to me that this particular instance of the log
force is a historical artifact of xfsaild. The latter used to use
separate ->trylock() and ->push() log item callbacks where the former
would either lock the object or return LOCKED, PINNED, etc. So a pinned
buffer or inode would return pinned here and xfsaild does the log force
on its behalf once it's through the current cycle.

An unpinned inode with a pinned cluster buffer didn't communicate this
state because the inode ->trylock() callout grabbed the ilock and iflock
and returned success. The iflush doesn't occur until ->push(), which
means if the cluster buffer is pinned and we wanted to force the log
than xfs_iflush() is the last chance to do it.

(For reference, I'm looking at the code as of v2.6.33).

The current xfsaild() log force logic is still tied to PINNED objects,
but it has additional logic to detect skipped buffers at delwri queue
submit time (i.e., due to being pinned) that should cover the pinned
inode cluster buffer case.

> > I'm not sure either. I was wondering why we'd need that given that
> > xfsaild() has its own log force logic in cases where the delwri submit
> > didn't fully complete (i.e., for pinned buffers). Relying on that seems
> > more appropriate since we'd force the log per delwri submission for
> > multiple potential buffers as opposed to per inode cluster buffer.
> > Perhaps we could just kill this log force?
> 
> If the writeback is coming from the AIL, then we could check if the
> buffer is pinned and then return EDEADLOCK or something like that to
> get the item push return value set to XFS_ITEM_PINNED, in which case
> the AIL will then do the log force that will unpin the object. This
> will happen after we have added the buffer to the AIL delwri list
> and unlocked the buffer, so everything should work just fine in that
> case. Similar for dquot flushing from the AIL.
> 

Sounds plausible, though I'd probably just lift the pinned check into
the inode item callout after the xfs_iflush() since it already has a
reference to the buffer.

But given the above, couldn't we just remove the log force from
xfs_iflush() and let the existing xfsaild_push() logic handle it? It
should detect any pinned buffers after the delwri queue submit and do
the log force. It might require a closer look at the existing logic to
ensure the log force is timely enough, but ISTM that the purpose of this
logic is cover this case. AFAICT nothing prevents the buffer from
becoming pinned again once it's added to the queue anyways, even if we
just forced the log. Hm?

Brian

> If the writeback is coming from inode reclaim, we've already
> guaranteed the inode is unpinned, so we'll never hit this case
> through there.
> 
> As for the busy extent code, I'll need to look at that much more
> closely to determine if there's a way around this....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-21 12:11         ` Brian Foster
@ 2019-03-21 21:10           ` Dave Chinner
  2019-03-21 21:53             ` Brian Foster
  2019-03-26 17:13             ` Brian Foster
  0 siblings, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2019-03-21 21:10 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, xfs

On Thu, Mar 21, 2019 at 08:11:52AM -0400, Brian Foster wrote:
> On Thu, Mar 21, 2019 at 08:49:38AM +1100, Dave Chinner wrote:
> > On Wed, Mar 20, 2019 at 01:59:22PM -0400, Brian Foster wrote:
> > > On Wed, Mar 20, 2019 at 10:45:51AM -0700, Darrick J. Wong wrote:
> > > > On Wed, Mar 20, 2019 at 01:03:05PM -0400, Brian Foster wrote:
> > > > > On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote:
> > > > > And unmount is doing a log force as well..
> > > > > 
> > > > > Hmm.. yeah, I think we need to figure out how/why the buffer is locked.
> > > > 
> > > > AFAICT it's xfs_inode_item_push -> xfs_iflush -> xfs_imap_to_bp that
> > > > returns the locked cluster buffer (code extract from xfs_iflush):
> > ....
> > > > ...but if the buffer is also pinned then we kick the log (while holding
> > > > the buffer lock) to unpin the buffer.  Since the fs is shutdown, the cil
> > > > will just be trying to remove everything, but it needs to lock the
> > > > buffer to simulate EIO.
> > > > 
> > > 
> > > Yeah, makes sense.
> > 
> > Yup, and this same problem means any buffer we hold locked when we
> > issue a blocking log force can deadlock in the same way. If it's a
> > non-blocking log force (i.e. a log flush we don't wait for) then we
> > can move onwards and eventually we unlock the buffer and the log can
> > make progress.
> > 
> > > > A naive fix to the deadlock is to see if the inode buf ops are attached
> > > > and downgrade to a trylock and go ahead even if we can't lock it, but ick.
> > > > It resolves the deadlock so the umount can proceed but we also trigger a
> > > > lot of unlocked buffer asserts.
> > 
> > Yuck!
> > 
> > > > I'm not sure what "too long" means in that code chunk.  It was added to
> > > > reduce stalls in pdflush, which these days I think translates to trying
> > > > to reduce the amount of time reclaim can stall while flushing inodes...?
> > 
> > "too long" means that if there is no log pressure, it might be the
> > next periodic log work that forces the log and unpins the buffer.
> > i.e. it could be several seconds before the buffer gets unpinned.
> > This is the same reason we have the log force in xfs_buf_lock() - if
> > we have to wait for the next log force to unlock/unpin a stale
> > buffer, we can stall for several seconds.
> > 
> 
> Ok.. so it kind of looks to me that this particular instance of the log
> force is a historical artifact of xfsaild. The latter used to use
> separate ->trylock() and ->push() log item callbacks where the former
> would either lock the object or return LOCKED, PINNED, etc. So a pinned
> buffer or inode would return pinned here and xfsaild does the log force
> on its behalf once it's through the current cycle.

Sure. that's no different to now. The only thing that is different
is that we combined ->trylock, ->push and ->pushbuf into one
function. The actual semantics and behaviour is unchanged.

i.e. if you look at xfs_iflush(ip, ASYNC) back then, it did:

        /*
         * If the buffer is pinned then push on the log now so we won't
         * get stuck waiting in the write for too long.
         */
        if (XFS_BUF_ISPINNED(bp))
                xfs_log_force(mp, (xfs_lsn_t)0, XFS_LOG_FORCE);

<snip iflush_cluster to pull in all other dirty inodes>

        if (flags & INT_DELWRI) {
                xfs_bdwrite(mp, bp);
        } else if (flags & INT_ASYNC) {
                error = xfs_bawrite(mp, bp);
        } else {
                error = xfs_bwrite(mp, bp);
        }

And then it would write the inode buffer directly with an async write
via xfs_bawrite(). i.e. there was no delwri list or anything like
that - the inode flush controlled the inode buffer writeback and
hence the unpinning of the inode buffer. The only difference now is
that the caller controls how the buffer is written, not
xfs_iflush...

> An unpinned inode with a pinned cluster buffer didn't communicate this
> state because the inode ->trylock() callout grabbed the ilock and iflock
> and returned success. The iflush doesn't occur until ->push(), which
> means if the cluster buffer is pinned and we wanted to force the log
> than xfs_iflush() is the last chance to do it.
> 
> (For reference, I'm looking at the code as of v2.6.33).

Again, that's no different to what we have now - the push is for the
inode, not the backing buffer. If we can't lock the inode because
it's pinned or already flushing, then we return that info to AIL
loop.  The acutal state of the backing buffer is still completely
hidden from the caller and so that's why I left the unpin in that
code.

> The current xfsaild() log force logic is still tied to PINNED objects,
> but it has additional logic to detect skipped buffers at delwri queue
> submit time (i.e., due to being pinned) that should cover the pinned
> inode cluster buffer case.

And it's not skipped buffers, either. Skipped buffers are only for
determining timeouts - it's the ail->ail_log_flush counter that
determines whether a log flush should be done or not, and that's
only incremented on ITEM_PINNED objects. This is why I was
suggesting returning that if the backing buffer is pinned, but I
forgot all about how we already handle this case...

> But given the above, couldn't we just remove the log force from
> xfs_iflush() and let the existing xfsaild_push() logic handle it?

We can, but not for the reasons you are suggesting.

> It
> should detect any pinned buffers after the delwri queue submit and do
> the log force.

And this is what we already do with this code:

	if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
		ailp->ail_log_flush++;

Which falls down to this code when we try to flush a pinned delwri
buffer:

	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
		if (!wait_list) {
			if (xfs_buf_ispinned(bp)) {
				pinned++;
				continue;
			}
....
	return pinned;

Basically, we leave pinned buffers on the delwri list rather than
remove them and issue IO on them, and tell the caller that we left
pinned/locked buffers on the list that still need to be submitted.
The AIL takes this as an indication it needs to flush the log to
unpin the buffers it can't flush, and because they are still on it's
delwri list, it will attempt to submit them again next time around
after the log has been forced.

So given that inode reclaim already waits for inodes to be unpinned
and there is a xfs_buf_unpin_wait() call in
xfs_bwrite()->xfs_buf_submit() path that will issue a log force on
the buffer from the inode reclaim path, we don't actually need the
log force in xfs_iflush() at all. i.e.  the AIL will capture the
buffer state and unpin inode and dquot buffers automatically, and so
we can simply remove the pinned-buffer log forces from the
inode/dquot flush code....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-21 21:10           ` Dave Chinner
@ 2019-03-21 21:53             ` Brian Foster
  2019-03-21 23:50               ` Dave Chinner
  2019-03-26 17:13             ` Brian Foster
  1 sibling, 1 reply; 18+ messages in thread
From: Brian Foster @ 2019-03-21 21:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, xfs

On Fri, Mar 22, 2019 at 08:10:51AM +1100, Dave Chinner wrote:
> On Thu, Mar 21, 2019 at 08:11:52AM -0400, Brian Foster wrote:
> > On Thu, Mar 21, 2019 at 08:49:38AM +1100, Dave Chinner wrote:
> > > On Wed, Mar 20, 2019 at 01:59:22PM -0400, Brian Foster wrote:
> > > > On Wed, Mar 20, 2019 at 10:45:51AM -0700, Darrick J. Wong wrote:
> > > > > On Wed, Mar 20, 2019 at 01:03:05PM -0400, Brian Foster wrote:
> > > > > > On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote:
> > > > > > And unmount is doing a log force as well..
> > > > > > 
> > > > > > Hmm.. yeah, I think we need to figure out how/why the buffer is locked.
> > > > > 
> > > > > AFAICT it's xfs_inode_item_push -> xfs_iflush -> xfs_imap_to_bp that
> > > > > returns the locked cluster buffer (code extract from xfs_iflush):
> > > ....
> > > > > ...but if the buffer is also pinned then we kick the log (while holding
> > > > > the buffer lock) to unpin the buffer.  Since the fs is shutdown, the cil
> > > > > will just be trying to remove everything, but it needs to lock the
> > > > > buffer to simulate EIO.
> > > > > 
> > > > 
> > > > Yeah, makes sense.
> > > 
> > > Yup, and this same problem means any buffer we hold locked when we
> > > issue a blocking log force can deadlock in the same way. If it's a
> > > non-blocking log force (i.e. a log flush we don't wait for) then we
> > > can move onwards and eventually we unlock the buffer and the log can
> > > make progress.
> > > 
> > > > > A naive fix to the deadlock is to see if the inode buf ops are attached
> > > > > and downgrade to a trylock and go ahead even if we can't lock it, but ick.
> > > > > It resolves the deadlock so the umount can proceed but we also trigger a
> > > > > lot of unlocked buffer asserts.
> > > 
> > > Yuck!
> > > 
> > > > > I'm not sure what "too long" means in that code chunk.  It was added to
> > > > > reduce stalls in pdflush, which these days I think translates to trying
> > > > > to reduce the amount of time reclaim can stall while flushing inodes...?
> > > 
> > > "too long" means that if there is no log pressure, it might be the
> > > next periodic log work that forces the log and unpins the buffer.
> > > i.e. it could be several seconds before the buffer gets unpinned.
> > > This is the same reason we have the log force in xfs_buf_lock() - if
> > > we have to wait for the next log force to unlock/unpin a stale
> > > buffer, we can stall for several seconds.
> > > 
> > 
> > Ok.. so it kind of looks to me that this particular instance of the log
> > force is a historical artifact of xfsaild. The latter used to use
> > separate ->trylock() and ->push() log item callbacks where the former
> > would either lock the object or return LOCKED, PINNED, etc. So a pinned
> > buffer or inode would return pinned here and xfsaild does the log force
> > on its behalf once it's through the current cycle.
> 
> Sure. that's no different to now. The only thing that is different
> is that we combined ->trylock, ->push and ->pushbuf into one
> function. The actual semantics and behaviour is unchanged.
> 
> i.e. if you look at xfs_iflush(ip, ASYNC) back then, it did:
> 
>         /*
>          * If the buffer is pinned then push on the log now so we won't
>          * get stuck waiting in the write for too long.
>          */
>         if (XFS_BUF_ISPINNED(bp))
>                 xfs_log_force(mp, (xfs_lsn_t)0, XFS_LOG_FORCE);
> 
> <snip iflush_cluster to pull in all other dirty inodes>
> 
>         if (flags & INT_DELWRI) {
>                 xfs_bdwrite(mp, bp);
>         } else if (flags & INT_ASYNC) {
>                 error = xfs_bawrite(mp, bp);
>         } else {
>                 error = xfs_bwrite(mp, bp);
>         }
> 
> And then it would write the inode buffer directly with an async write
> via xfs_bawrite(). i.e. there was no delwri list or anything like
> that - the inode flush controlled the inode buffer writeback and
> hence the unpinning of the inode buffer. The only difference now is
> that the caller controls how the buffer is written, not
> xfs_iflush...
> 

What I'm saying is that the difference in the interface explains the
historical need for the log force in xfs_iflush(). The difference that
I'm pointing out is that ->push() had no means to return item state. The
sequence for an unpinned inode with a pinned cluster buffer in 2.6.32 is
for ->trylock() to return success after acquiring the ilock and flush
lock, followed by a call to ->push(). We didn't call xfs_iflush() (and
thus get the inode buffer) until the push call:

                switch (lock_result) {
                case XFS_ITEM_SUCCESS:
                        XFS_STATS_INC(xs_push_ail_success);
                        IOP_PUSH(lip);
                        last_pushed_lsn = lsn;
                        break;

... which doesn't even have a return value. Hence there is no way to
return a pinned state for an inode cluster buffer and xfs_iflush() would
have had to do the log force for a pinned cluster buffer itself.

That is no longer the case with the current code. I don't know with
certainty of course, but from that I surmised the xfs_iflush() ->
xfs_log_force() call existed to deal with this historical interface
limitation.

> > An unpinned inode with a pinned cluster buffer didn't communicate this
> > state because the inode ->trylock() callout grabbed the ilock and iflock
> > and returned success. The iflush doesn't occur until ->push(), which
> > means if the cluster buffer is pinned and we wanted to force the log
> > than xfs_iflush() is the last chance to do it.
> > 
> > (For reference, I'm looking at the code as of v2.6.33).
> 
> Again, that's no different to what we have now - the push is for the
> inode, not the backing buffer. If we can't lock the inode because
> it's pinned or already flushing, then we return that info to AIL
> loop.  The acutal state of the backing buffer is still completely
> hidden from the caller and so that's why I left the unpin in that
> code.
> 

Pretty much what I'm saying above.

> > The current xfsaild() log force logic is still tied to PINNED objects,
> > but it has additional logic to detect skipped buffers at delwri queue
> > submit time (i.e., due to being pinned) that should cover the pinned
> > inode cluster buffer case.
> 
> And it's not skipped buffers, either. Skipped buffers are only for
> determining timeouts - it's the ail->ail_log_flush counter that
> determines whether a log flush should be done or not, and that's
> only incremented on ITEM_PINNED objects. This is why I was
> suggesting returning that if the backing buffer is pinned, but I
> forgot all about how we already handle this case...
> 
> > But given the above, couldn't we just remove the log force from
> > xfs_iflush() and let the existing xfsaild_push() logic handle it?
> 
> We can, but not for the reasons you are suggesting.
> 

The reason I suggested is because we currently "detect skipped buffers
at delwri queue submit time (i.e., due to being pinned)," which causes
xfsaild to force the log. I'm not sure how exactly you read that, but it
refers exactly to the code snippet you posted below where we bump the
log flush counter due to skipped buffers on submit (because they were
pinned).

> > It
> > should detect any pinned buffers after the delwri queue submit and do
> > the log force.
> 
> And this is what we already do with this code:
> 

Yes.

> 	if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
> 		ailp->ail_log_flush++;
> 
> Which falls down to this code when we try to flush a pinned delwri
> buffer:
> 
> 	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
> 		if (!wait_list) {
> 			if (xfs_buf_ispinned(bp)) {
> 				pinned++;
> 				continue;
> 			}
> ....
> 	return pinned;
> 

"detect skipped buffers at delwri queue submit time (i.e., due to being
pinned)"

> Basically, we leave pinned buffers on the delwri list rather than
> remove them and issue IO on them, and tell the caller that we left
> pinned/locked buffers on the list that still need to be submitted.
> The AIL takes this as an indication it needs to flush the log to
> unpin the buffers it can't flush, and because they are still on it's
> delwri list, it will attempt to submit them again next time around
> after the log has been forced.
> 

Right, so this is exactly the reason I suggested.

Brian

> So given that inode reclaim already waits for inodes to be unpinned
> and there is a xfs_buf_unpin_wait() call in
> xfs_bwrite()->xfs_buf_submit() path that will issue a log force on
> the buffer from the inode reclaim path, we don't actually need the
> log force in xfs_iflush() at all. i.e.  the AIL will capture the
> buffer state and unpin inode and dquot buffers automatically, and so
> we can simply remove the pinned-buffer log forces from the
> inode/dquot flush code....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-21 21:53             ` Brian Foster
@ 2019-03-21 23:50               ` Dave Chinner
  2019-03-22  0:07                 ` Darrick J. Wong
  2019-03-22 12:01                 ` Brian Foster
  0 siblings, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2019-03-21 23:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, xfs

On Thu, Mar 21, 2019 at 05:53:07PM -0400, Brian Foster wrote:
> On Fri, Mar 22, 2019 at 08:10:51AM +1100, Dave Chinner wrote:
> > On Thu, Mar 21, 2019 at 08:11:52AM -0400, Brian Foster wrote:
> > > On Thu, Mar 21, 2019 at 08:49:38AM +1100, Dave Chinner wrote:
> > > The current xfsaild() log force logic is still tied to PINNED objects,
> > > but it has additional logic to detect skipped buffers at delwri queue
> > > submit time (i.e., due to being pinned) that should cover the pinned
> > > inode cluster buffer case.
> > 
> > And it's not skipped buffers, either. Skipped buffers are only for
> > determining timeouts - it's the ail->ail_log_flush counter that
> > determines whether a log flush should be done or not, and that's
> > only incremented on ITEM_PINNED objects. This is why I was
> > suggesting returning that if the backing buffer is pinned, but I
> > forgot all about how we already handle this case...
> > 
> > > But given the above, couldn't we just remove the log force from
> > > xfs_iflush() and let the existing xfsaild_push() logic handle it?
> > 
> > We can, but not for the reasons you are suggesting.
> > 
> 
> The reason I suggested is because we currently "detect skipped buffers
> at delwri queue submit time (i.e., due to being pinned)," which causes
> xfsaild to force the log.  I'm not sure how exactly you read that, but it
> refers exactly to the code snippet you posted below where we bump the
> log flush counter due to skipped buffers on submit (because they were
> pinned).

In the context of the AIL pushing, "skipped" has specific meaning -
it refers to items that could not be flushed for whatever reason.
And the actions that are taken based on the number of "skipped"
objects is taken after delwri queue submit time.

You used all the terminology that refers to /skipped/ item
handling, not /pinned/ item handling - if you re-read what I wrote
above with this difference in mind, then it will make a lot more
sense to you.

FWIW, this skipped/pinned terminology difference goes right down
into the delwri list handling I quoted - I'll add the next few lines
so it's clear what I'm talking about:


> > 	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
> > 		if (!wait_list) {
> > 			if (xfs_buf_ispinned(bp)) {
> > 				pinned++;
> > 				continue;
> > 			}
			if (!xfs_buf_trylock(bp))
				continue;
		}

This code branch also avoid blocking on locked buffers. In AIL
terms these locked buffers have been "skipped", but the code only
counts and indicates pinned buffers to the caller. And the delwri
submission code in the AIL pushing acts on this specific
distinction - it triggers a log force if pinned buffers were
encountered, not if the flush skipped locked buffers.

> > ....
> > 	return pinned;
> > 
> 
> "detect skipped buffers at delwri queue submit time (i.e., due to being
> pinned)"

If the AIL code was detecting skipped buffers, it would do:

	xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list);
	if (!list_empty(&ailp->ail_buf_list)) {
		/* we skipped buffers */
		.....
	}

But it's not doing this.

i.e. "Pinned" vs "skipped" may be a subtle terminology difference,
but it is an important difference and they are not interchangable...

> > remove them and issue IO on them, and tell the caller that we left
> > pinned/locked buffers on the list that still need to be submitted.
> > The AIL takes this as an indication it needs to flush the log to
> > unpin the buffers it can't flush, and because they are still on it's
> > delwri list, it will attempt to submit them again next time around
> > after the log has been forced.
> > 
> 
> Right, so this is exactly the reason I suggested.

OK, we're on the same page. :)

Do want to submit a patch that fixes the inode and dquot flushing?
I haven't looked any further at the busy extent stuff, that's likely
to be a more difficult problem to solve...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-21 23:50               ` Dave Chinner
@ 2019-03-22  0:07                 ` Darrick J. Wong
  2019-03-22 12:01                 ` Brian Foster
  1 sibling, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2019-03-22  0:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, xfs

On Fri, Mar 22, 2019 at 10:50:45AM +1100, Dave Chinner wrote:
> On Thu, Mar 21, 2019 at 05:53:07PM -0400, Brian Foster wrote:
> > On Fri, Mar 22, 2019 at 08:10:51AM +1100, Dave Chinner wrote:
> > > On Thu, Mar 21, 2019 at 08:11:52AM -0400, Brian Foster wrote:
> > > > On Thu, Mar 21, 2019 at 08:49:38AM +1100, Dave Chinner wrote:
> > > > The current xfsaild() log force logic is still tied to PINNED objects,
> > > > but it has additional logic to detect skipped buffers at delwri queue
> > > > submit time (i.e., due to being pinned) that should cover the pinned
> > > > inode cluster buffer case.
> > > 
> > > And it's not skipped buffers, either. Skipped buffers are only for
> > > determining timeouts - it's the ail->ail_log_flush counter that
> > > determines whether a log flush should be done or not, and that's
> > > only incremented on ITEM_PINNED objects. This is why I was
> > > suggesting returning that if the backing buffer is pinned, but I
> > > forgot all about how we already handle this case...
> > > 
> > > > But given the above, couldn't we just remove the log force from
> > > > xfs_iflush() and let the existing xfsaild_push() logic handle it?
> > > 
> > > We can, but not for the reasons you are suggesting.
> > > 
> > 
> > The reason I suggested is because we currently "detect skipped buffers
> > at delwri queue submit time (i.e., due to being pinned)," which causes
> > xfsaild to force the log.  I'm not sure how exactly you read that, but it
> > refers exactly to the code snippet you posted below where we bump the
> > log flush counter due to skipped buffers on submit (because they were
> > pinned).
> 
> In the context of the AIL pushing, "skipped" has specific meaning -
> it refers to items that could not be flushed for whatever reason.
> And the actions that are taken based on the number of "skipped"
> objects is taken after delwri queue submit time.
> 
> You used all the terminology that refers to /skipped/ item
> handling, not /pinned/ item handling - if you re-read what I wrote
> above with this difference in mind, then it will make a lot more
> sense to you.
> 
> FWIW, this skipped/pinned terminology difference goes right down
> into the delwri list handling I quoted - I'll add the next few lines
> so it's clear what I'm talking about:
> 
> 
> > > 	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
> > > 		if (!wait_list) {
> > > 			if (xfs_buf_ispinned(bp)) {
> > > 				pinned++;
> > > 				continue;
> > > 			}
> 			if (!xfs_buf_trylock(bp))
> 				continue;
> 		}
> 
> This code branch also avoid blocking on locked buffers. In AIL
> terms these locked buffers have been "skipped", but the code only
> counts and indicates pinned buffers to the caller. And the delwri
> submission code in the AIL pushing acts on this specific
> distinction - it triggers a log force if pinned buffers were
> encountered, not if the flush skipped locked buffers.
> 
> > > ....
> > > 	return pinned;
> > > 
> > 
> > "detect skipped buffers at delwri queue submit time (i.e., due to being
> > pinned)"
> 
> If the AIL code was detecting skipped buffers, it would do:
> 
> 	xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list);
> 	if (!list_empty(&ailp->ail_buf_list)) {
> 		/* we skipped buffers */
> 		.....
> 	}
> 
> But it's not doing this.
> 
> i.e. "Pinned" vs "skipped" may be a subtle terminology difference,
> but it is an important difference and they are not interchangable...
> 
> > > remove them and issue IO on them, and tell the caller that we left
> > > pinned/locked buffers on the list that still need to be submitted.
> > > The AIL takes this as an indication it needs to flush the log to
> > > unpin the buffers it can't flush, and because they are still on it's
> > > delwri list, it will attempt to submit them again next time around
> > > after the log has been forced.

<nod> That's kinda what I thought too, having looked through the code,
but TBH I defer to the two of you as far as knowing how things work in
the logging code. :)

> > > 
> > 
> > Right, so this is exactly the reason I suggested.
> 
> OK, we're on the same page. :)
> 
> Do want to submit a patch that fixes the inode and dquot flushing?

I'd review any such patch that shows up. :)

[He says having not written a patch since the yucky one a few days ago
and not likely having time to write one anyway.]

> I haven't looked any further at the busy extent stuff, that's likely
> to be a more difficult problem to solve...

<nod>  Thank you both for taking time to dig through my question!

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-21 23:50               ` Dave Chinner
  2019-03-22  0:07                 ` Darrick J. Wong
@ 2019-03-22 12:01                 ` Brian Foster
  2019-03-24 23:03                   ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Brian Foster @ 2019-03-22 12:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, xfs

On Fri, Mar 22, 2019 at 10:50:45AM +1100, Dave Chinner wrote:
> On Thu, Mar 21, 2019 at 05:53:07PM -0400, Brian Foster wrote:
> > On Fri, Mar 22, 2019 at 08:10:51AM +1100, Dave Chinner wrote:
> > > On Thu, Mar 21, 2019 at 08:11:52AM -0400, Brian Foster wrote:
> > > > On Thu, Mar 21, 2019 at 08:49:38AM +1100, Dave Chinner wrote:
> > > > The current xfsaild() log force logic is still tied to PINNED objects,
> > > > but it has additional logic to detect skipped buffers at delwri queue
> > > > submit time (i.e., due to being pinned) that should cover the pinned
> > > > inode cluster buffer case.
> > > 
> > > And it's not skipped buffers, either. Skipped buffers are only for
> > > determining timeouts - it's the ail->ail_log_flush counter that
> > > determines whether a log flush should be done or not, and that's
> > > only incremented on ITEM_PINNED objects. This is why I was
> > > suggesting returning that if the backing buffer is pinned, but I
> > > forgot all about how we already handle this case...
> > > 
> > > > But given the above, couldn't we just remove the log force from
> > > > xfs_iflush() and let the existing xfsaild_push() logic handle it?
> > > 
> > > We can, but not for the reasons you are suggesting.
> > > 
> > 
> > The reason I suggested is because we currently "detect skipped buffers
> > at delwri queue submit time (i.e., due to being pinned)," which causes
> > xfsaild to force the log.  I'm not sure how exactly you read that, but it
> > refers exactly to the code snippet you posted below where we bump the
> > log flush counter due to skipped buffers on submit (because they were
> > pinned).
> 
> In the context of the AIL pushing, "skipped" has specific meaning -
> it refers to items that could not be flushed for whatever reason.
> And the actions that are taken based on the number of "skipped"
> objects is taken after delwri queue submit time.
> 
> You used all the terminology that refers to /skipped/ item
> handling, not /pinned/ item handling - if you re-read what I wrote
> above with this difference in mind, then it will make a lot more
> sense to you.
> 

Perhaps I could have phrased it better, but I specifically referred to
skipped buffers (not items), in the context of delwri submission and
qualified it with those skipped due to being pinned. ;) Even in my
earlier reply right after Darrick corrected my mischaracterization of
the deadlock, I said the following about the log force in xfs_iflush():

"I was wondering why we'd need that given that xfsaild() has its own log
force logic in cases where the delwri submit didn't fully complete
(i.e., for pinned buffers)."

But no matter, I can see where the disconnect is now at least. I just
think you misread me in this case.

> FWIW, this skipped/pinned terminology difference goes right down
> into the delwri list handling I quoted - I'll add the next few lines
> so it's clear what I'm talking about:
> 
> 
> > > 	list_for_each_entry_safe(bp, n, buffer_list, b_list) {
> > > 		if (!wait_list) {
> > > 			if (xfs_buf_ispinned(bp)) {
> > > 				pinned++;
> > > 				continue;
> > > 			}
> 			if (!xfs_buf_trylock(bp))
> 				continue;
> 		}
> 
> This code branch also avoid blocking on locked buffers. In AIL
> terms these locked buffers have been "skipped", but the code only
> counts and indicates pinned buffers to the caller. And the delwri
> submission code in the AIL pushing acts on this specific
> distinction - it triggers a log force if pinned buffers were
> encountered, not if the flush skipped locked buffers.
> 

Yep..

> > > ....
> > > 	return pinned;
> > > 
> > 
> > "detect skipped buffers at delwri queue submit time (i.e., due to being
> > pinned)"
> 
> If the AIL code was detecting skipped buffers, it would do:
> 
> 	xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list);
> 	if (!list_empty(&ailp->ail_buf_list)) {
> 		/* we skipped buffers */
> 		.....
> 	}
> 
> But it's not doing this.
> 

In general, if you think I'm mischaracterizing code/behavior, it's
helpful to point out how you interpret the specified reasoning so we can
correct mistaken assumptions/understandings, clarify descriptions or use
better terminology or whatever. That helps to get onto the same page
more efficiently, particularly for cases where it so happens we were
talking about the same exact thing in the first place and just happen to
think about it a bit differently. ;)

For example, to just say "We can, but not for the reasons you are
suggesting." doesn't tell me much about how you interpreted my
reasoning. I thus assume you interpreted it exactly as I meant it. That
ambiguity leads to confusion when you then go on and explain the same
behavior I was (trying to) refer to. ;P If you said something like "we
can but not because of <reason>," it would have probably been
immediately apparent to me that the reason I intended to describe is not
exactly how you read it.

> i.e. "Pinned" vs "skipped" may be a subtle terminology difference,
> but it is an important difference and they are not interchangable...
> 
> > > remove them and issue IO on them, and tell the caller that we left
> > > pinned/locked buffers on the list that still need to be submitted.
> > > The AIL takes this as an indication it needs to flush the log to
> > > unpin the buffers it can't flush, and because they are still on it's
> > > delwri list, it will attempt to submit them again next time around
> > > after the log has been forced.
> > > 
> > 
> > Right, so this is exactly the reason I suggested.
> 
> OK, we're on the same page. :)
> 

Indeed.

> Do want to submit a patch that fixes the inode and dquot flushing?
> I haven't looked any further at the busy extent stuff, that's likely
> to be a more difficult problem to solve...
> 

I'll add it to the todo list and make a pass through the log items...

I missed where the busy extent stuff came into play here.. are you
pointing out that log forces from the busy extent code can occur under
buffer locks and thus we're susceptible to similar problems there? If
so, aren't we protected by the transaction in that context?

IOW, wouldn't a log force from busy extent context always occur with
locked buffers joined to a transaction? If so, then doesn't the active
transaction hold a bli reference and prevent such items from being
"freed" in log completion context (i.e. xfs_buf_item_unpin()) if they
happened to be pinned? Perhaps I'm missing something...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-22 12:01                 ` Brian Foster
@ 2019-03-24 23:03                   ` Dave Chinner
  2019-03-25 12:34                     ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2019-03-24 23:03 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, xfs

On Fri, Mar 22, 2019 at 08:01:36AM -0400, Brian Foster wrote:
> On Fri, Mar 22, 2019 at 10:50:45AM +1100, Dave Chinner wrote:
> > > > 	return pinned;
> > > > 
> > > 
> > > "detect skipped buffers at delwri queue submit time (i.e., due to being
> > > pinned)"
> > 
> > If the AIL code was detecting skipped buffers, it would do:
> > 
> > 	xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list);
> > 	if (!list_empty(&ailp->ail_buf_list)) {
> > 		/* we skipped buffers */
> > 		.....
> > 	}
> > 
> > But it's not doing this.
> > 
> 
> In general, if you think I'm mischaracterizing code/behavior, it's
> helpful to point out how you interpret the specified reasoning so we can
> correct mistaken assumptions/understandings, clarify descriptions or use
> better terminology or whatever. That helps to get onto the same page
> more efficiently, particularly for cases where it so happens we were
> talking about the same exact thing in the first place and just happen to
> think about it a bit differently. ;)

Well, i didn't realise you had mis-applied the terminology until you
replied to me saying "we're describing the same thing". It's easy to
say in hindsight "-you- should have just point out the problem", but
that's laying blame for something that nobody is at fault for.

> For example, to just say "We can, but not for the reasons you are
> suggesting." doesn't tell me much about how you interpreted my
> reasoning.

I think you're being unreasonable here, Brian - I explained the
skipped vs pinned difference once I understood there was a
terminology mismatch issue. :/

I don't expect to have to rewrite emails in my own words just so the
other party can determine if I've understood the terms they used
correctly. We have definitions and try to use common, clear
terminology so that code and communication is efficient and we don't
need to endlessly repeat what the other person has said.

Misunderstandings over terminology are relatively rare - as soon as
I understood that you were using terminology with a different
definition to the code you were refering to, I pointed it out and
explained where the misunderstanding was.

I cannot do any more than that, and to expect more from /anyone/ is
entirely unreasonable.

> I missed where the busy extent stuff came into play here.. are you
> pointing out that log forces from the busy extent code can occur under
> buffer locks and thus we're susceptible to similar problems there? If
> so, aren't we protected by the transaction in that context?

I'm not sure - as I said I have not traced/reasoned the logic right
through yet.  All I was doing is pointing out that these are other
places we are doing a log force with buffers locked so might have
the same problem, and....

> IOW, wouldn't a log force from busy extent context always occur with
> locked buffers joined to a transaction? If so, then doesn't the active
> transaction hold a bli reference and prevent such items from being
> "freed" in log completion context (i.e. xfs_buf_item_unpin()) if they
> happened to be pinned? Perhaps I'm missing something...

... now you've looked at it in more detail that I have.

What you've described is why it was considered to be safe, but now
we have things like defered AGFL block freeing that pick up and drop
the AGF lock repeatedly as the single transaction rolls and -
eventually - starts calling xfs_trans_reserve() on transaction roll.
That's something we never used to do.  It may still be safe, but it
is unclear to me how this all interacts in the presence of
filesystem shutdown conditions.....

---

What it comes down to is that the only way I can answer questions
like this is to do the complete analysis myself, which simply does
not scale. Having to jump between multiple separate complex
questions completely derails my own workflow, and it's recently been
resulting in me spending >75% of every day answering complex
questions like this. Duplication of effort is not an efficient use
of our time or expertise.

If I see something and say "these other code paths look similar, we
should check them too", all it means that I noticed a pattern of
usage and they need to be checked for similar issues.  It does not
mean "dave has enough spare time on hand and can drop everything to
analyse the code in question and answer every little detail about
the similar problem path he noticed."

Brian, you're way better than me at analysing these complex
interactions. Be confident in your analysis, just say "no, it's safe
because...." rather than asking a question that requires me to
repeat the analysis you've already done.  If there's something that
you've missed that I spot, I'll point it out, but I really don't
want to have to repeat everything you've already done just to answer
a "did I miss anything" question.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-24 23:03                   ` Dave Chinner
@ 2019-03-25 12:34                     ` Brian Foster
  2019-03-27  1:22                       ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2019-03-25 12:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, xfs

On Mon, Mar 25, 2019 at 10:03:27AM +1100, Dave Chinner wrote:
> On Fri, Mar 22, 2019 at 08:01:36AM -0400, Brian Foster wrote:
> > On Fri, Mar 22, 2019 at 10:50:45AM +1100, Dave Chinner wrote:
> > > > > 	return pinned;
> > > > > 
> > > > 
> > > > "detect skipped buffers at delwri queue submit time (i.e., due to being
> > > > pinned)"
> > > 
> > > If the AIL code was detecting skipped buffers, it would do:
> > > 
> > > 	xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list);
> > > 	if (!list_empty(&ailp->ail_buf_list)) {
> > > 		/* we skipped buffers */
> > > 		.....
> > > 	}
> > > 
> > > But it's not doing this.
> > > 
> > 
> > In general, if you think I'm mischaracterizing code/behavior, it's
> > helpful to point out how you interpret the specified reasoning so we can
> > correct mistaken assumptions/understandings, clarify descriptions or use
> > better terminology or whatever. That helps to get onto the same page
> > more efficiently, particularly for cases where it so happens we were
> > talking about the same exact thing in the first place and just happen to
> > think about it a bit differently. ;)
> 
> Well, i didn't realise you had mis-applied the terminology until you
> replied to me saying "we're describing the same thing". It's easy to
> say in hindsight "-you- should have just point out the problem", but
> that's laying blame for something that nobody is at fault for.
> 
> > For example, to just say "We can, but not for the reasons you are
> > suggesting." doesn't tell me much about how you interpreted my
> > reasoning.
> 
> I think you're being unreasonable here, Brian - I explained the
> skipped vs pinned difference once I understood there was a
> terminology mismatch issue. :/
> 
> I don't expect to have to rewrite emails in my own words just so the
> other party can determine if I've understood the terms they used
> correctly. We have definitions and try to use common, clear
> terminology so that code and communication is efficient and we don't
> need to endlessly repeat what the other person has said.
> 
> Misunderstandings over terminology are relatively rare - as soon as
> I understood that you were using terminology with a different
> definition to the code you were refering to, I pointed it out and
> explained where the misunderstanding was.
> 
> I cannot do any more than that, and to expect more from /anyone/ is
> entirely unreasonable.
> 

Fair enough. I'm just trying to see if we can improve communication
signal a bit here, but I suppose some degree of miscommunication is
inevitable in a highly technical forum. Not worth fussing over too much
I guess..

> > I missed where the busy extent stuff came into play here.. are you
> > pointing out that log forces from the busy extent code can occur under
> > buffer locks and thus we're susceptible to similar problems there? If
> > so, aren't we protected by the transaction in that context?
> 
> I'm not sure - as I said I have not traced/reasoned the logic right
> through yet.  All I was doing is pointing out that these are other
> places we are doing a log force with buffers locked so might have
> the same problem, and....
> 

Ok, that's all I was asking..

> > IOW, wouldn't a log force from busy extent context always occur with
> > locked buffers joined to a transaction? If so, then doesn't the active
> > transaction hold a bli reference and prevent such items from being
> > "freed" in log completion context (i.e. xfs_buf_item_unpin()) if they
> > happened to be pinned? Perhaps I'm missing something...
> 
> ... now you've looked at it in more detail that I have.
> 
> What you've described is why it was considered to be safe, but now
> we have things like defered AGFL block freeing that pick up and drop
> the AGF lock repeatedly as the single transaction rolls and -
> eventually - starts calling xfs_trans_reserve() on transaction roll.
> That's something we never used to do.  It may still be safe, but it
> is unclear to me how this all interacts in the presence of
> filesystem shutdown conditions.....
> 

Some higher level thoughts...

IME, we've had these kind of quirky shutdown issues since I've been
working on XFS. Similar to log recovery, it's a rarely enough hit
scenario that keeping it robust and reliable across new
features/development is a bit of a challenge. LR is certainly more
critical and I think our test coverage in both areas has improved
significantly over the past few years.

The point is just that I don't think it's worth getting too crazy by
trying to audit every possible path to a sync log force or changing how
various bits of core infrastructure work just to accommodate a very rare
shutdown case. If this turns out to be the only place we require an
object lock in the synchronous log force sequence, it might be best to
find a way to remove it (as Darrick posited earlier). If not, it might
also be more useful to figure a way to detect the "sync log force while
holding (certain) locks" pattern dynamically and provide some assertions
against it to improve test coverage going forward. I'm not quite sure
how to do that off the top of my head. I'll have to think about that
some more. I'm sure we could always come up with a test that reproduces
this particular instance of the problem more reliably, but that test
becomes far less useful once we address the particular instance it
reproduces in the kernel.

> ---
> 
> What it comes down to is that the only way I can answer questions
> like this is to do the complete analysis myself, which simply does
> not scale. Having to jump between multiple separate complex
> questions completely derails my own workflow, and it's recently been
> resulting in me spending >75% of every day answering complex
> questions like this. Duplication of effort is not an efficient use
> of our time or expertise.
> 

Agreed.

> If I see something and say "these other code paths look similar, we
> should check them too", all it means that I noticed a pattern of
> usage and they need to be checked for similar issues.  It does not
> mean "dave has enough spare time on hand and can drop everything to
> analyse the code in question and answer every little detail about
> the similar problem path he noticed."
> 
> Brian, you're way better than me at analysing these complex
> interactions. Be confident in your analysis, just say "no, it's safe
> because...." rather than asking a question that requires me to
> repeat the analysis you've already done.  If there's something that
> you've missed that I spot, I'll point it out, but I really don't
> want to have to repeat everything you've already done just to answer
> a "did I miss anything" question.
> 

Note that my question was more around establishing context since I
didn't see an issue/concern pointed out for the busy extent code (even
after reading back through the thread), but point taken. Thanks for the
feedback.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-21 21:10           ` Dave Chinner
  2019-03-21 21:53             ` Brian Foster
@ 2019-03-26 17:13             ` Brian Foster
  2019-03-27  1:05               ` Dave Chinner
  1 sibling, 1 reply; 18+ messages in thread
From: Brian Foster @ 2019-03-26 17:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, xfs

On Fri, Mar 22, 2019 at 08:10:51AM +1100, Dave Chinner wrote:
...
> 
> So given that inode reclaim already waits for inodes to be unpinned
> and there is a xfs_buf_unpin_wait() call in
> xfs_bwrite()->xfs_buf_submit() path that will issue a log force on
> the buffer from the inode reclaim path, we don't actually need the
> log force in xfs_iflush() at all. i.e.  the AIL will capture the
> buffer state and unpin inode and dquot buffers automatically, and so
> we can simply remove the pinned-buffer log forces from the
> inode/dquot flush code....
> 

Hmm, where is the log force you reference above in the inode reclaim (->
xfs_bwrite()) path if the buffer happens to be pinned? I see the
xfs_buf_wait_unpin() in the submit path, but that just waits on the pin
count (which I think is safe). It actually looks to me that reclaim
could be susceptible to the write delay problem you referenced earlier
in this thread without the log force from xfs_iflush()..

It also occurs to me that the xfs_iflush() log force isn't blocking
because it's a a sync force, but rather because there is already a CIL
push in progress. The flush_work() call basically means that a non-sync
force can either queue a workqueue job and return or turn somewhat
synchronous by waiting for whatever push happens to be in progress. That
implies another potential workaround could be to find a way around this
sync behavior (assuming that would be fundamentally more simple than
just avoiding log forces entirely while holding non-transactional buffer
locks).

Of course that assumes there aren't other variants of this. I still need
to poke around more to get a better idea of the scope. One that looks
like a possibility to me right now is the xfs_iflush_cluster() ->
xfs_force_shutdown() -> xfs_log_force() sequence (in the event of a
corrupt in-core inode on the same pinned buffer) down in the same iflush
path. Another could be in xfs_trans_read_buf_map() where if we somehow
grab a dirty+pinned buffer with b_error set, we can issue a shutdown if
the transaction is dirty before we release the failed buffer. That
combination of events should probably never happen (and the vector is
probably closed by just releasing the buffer before the shutdown), but
it kind of illustrates the potential for whack-a-mole with log forces
and shutdown calls scattered around in so many places.. :/

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-26 17:13             ` Brian Foster
@ 2019-03-27  1:05               ` Dave Chinner
  2019-03-27 14:13                 ` Brian Foster
  0 siblings, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2019-03-27  1:05 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, xfs

On Tue, Mar 26, 2019 at 01:13:24PM -0400, Brian Foster wrote:
> On Fri, Mar 22, 2019 at 08:10:51AM +1100, Dave Chinner wrote:
> ...
> > 
> > So given that inode reclaim already waits for inodes to be unpinned
> > and there is a xfs_buf_unpin_wait() call in
> > xfs_bwrite()->xfs_buf_submit() path that will issue a log force on
> > the buffer from the inode reclaim path, we don't actually need the
> > log force in xfs_iflush() at all. i.e.  the AIL will capture the
> > buffer state and unpin inode and dquot buffers automatically, and so
> > we can simply remove the pinned-buffer log forces from the
> > inode/dquot flush code....
> > 
> 
> Hmm, where is the log force you reference above in the inode reclaim (->
> xfs_bwrite()) path if the buffer happens to be pinned? I see the
> xfs_buf_wait_unpin() in the submit path, but that just waits on the pin
> count (which I think is safe). It actually looks to me that reclaim
> could be susceptible to the write delay problem you referenced earlier
> in this thread without the log force from xfs_iflush()..

I was thinking that the unpin wait code issued the log force.
Oh, we only issue a log force in xfs_iunpin_wait(), not in the
xfs_buf_wait_unpin(). Ugh. My mistake.

As for the write delay problem, it's entire possible that this could
occur, but I have no evidence that it is actually a problem for
reclaim - I have only ever found reclaim blocking waiting on IO in
synchronous reclaim, never blocking on buffer unpin waits in async
reclaim. That's not to say it doesn't happen, I've just never seen
it and I've been looking at inode reclaim blocking vectors quite a
lot in the past couple of years....

> It also occurs to me that the xfs_iflush() log force isn't blocking
> because it's a a sync force, but rather because there is already a CIL
> push in progress. The flush_work() call basically means that a non-sync
> force can either queue a workqueue job and return or turn somewhat
> synchronous by waiting for whatever push happens to be in progress.

xfs_log_force() isn't a non-blocking operation - it is supposed to
write and dispatch all the pending changes in memory. If
XFS_LOG_SYNC is not set then it does not wait for IO completion - it
is expected that the caller will then wait on whatever it needs
unpinning, not wait for everything in the log flush to be completed.
e.g. this is why xfs_iunpin_wait() does not use XFS_LOG_SYNC - we
only need to wait for the inode to be unpinned, not everything in
the log.

> That
> implies another potential workaround could be to find a way around this
> sync behavior (assuming that would be fundamentally more simple than
> just avoiding log forces entirely while holding non-transactional buffer
> locks).

Or we can just remove it altogether and put the necessary log forces
in the callers where the buffer is no longer held.

> Of course that assumes there aren't other variants of this. I still need
> to poke around more to get a better idea of the scope. One that looks
> like a possibility to me right now is the xfs_iflush_cluster() ->
> xfs_force_shutdown() -> xfs_log_force() sequence (in the event of a

The xfs_iflush_cluster error handling has always been full of
shutdown-related bugs.

xfs_iflush() gets this buffer-vs-shutdown ordering right, but if
it's ok to just release the buffer with the inode still attached to
it in the xfs_iflush() case, then why isn't it ok to do exactly the
same thing in xfs_iflush_cluster()?

Oh, it's because the xfs_iflush() error handling doesn't need to
remove the inode from the buffer to get a reference to it to call
xfs_iflush_abort() to remove the inode from the AIL.

IOWs, I think xfs_iflush_cluster() should really be doing the same
thing - pull the list of inodes off the buffer, release the buffer,
trigger a shutdown, then call xfs_iflush_abort() on each of those
inodes....

> corrupt in-core inode on the same pinned buffer) down in the same iflush
> path. Another could be in xfs_trans_read_buf_map() where if we somehow
> grab a dirty+pinned buffer with b_error set, we can issue a shutdown if
> the transaction is dirty before we release the failed buffer.

Again, that would be indicative of a bug somewhere else. A buffer
found by a buffer should not be pinned and dirty in memory with a
b_error value on it.  Whatever recorded the error should have
processed and cleared the error, marked the buffer as stale, and/or
shut down the filesystem, not left it it in cache for someone else
to trip over.

e.g. see xfs_buf_iodone_callback_error(). If it's a transient error,
it clears b_error before it returns. If it's a permanent error, it
shuts down the filesystem and stales the buffer and so it will not
ever be seen again on lookup.

i.e. The code in xfs_trans_read_buf_map() is detecting physical read
IO errors bringing new metadata into the cache and reacting
appropriately. Clean transactions can back out gracefully, while
dirty transactions will trigger a shutdown during cancellation
anyway, so the early shutdown is just a way of ensuring we error
out appropriately. Maybe all the callers get the error handling
right, and if that is the case the we can remove the shutdowns from
xfs_trans_read_buf_map() altogether.....

> That
> combination of events should probably never happen (and the vector is
> probably closed by just releasing the buffer before the shutdown), but
> it kind of illustrates the potential for whack-a-mole with log forces
> and shutdown calls scattered around in so many places.. :/

If we end up in this kind of "whack-a-mole" scenario, it indicates
we have other bugs that need fixing. Calling xfs_force_shutdown()
holding inappropriate locks is a bug that should be fixed; changing
how xfs_log_force() works is a much more risky proposition than
fixing the callers that aren't safe...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-25 12:34                     ` Brian Foster
@ 2019-03-27  1:22                       ` Dave Chinner
  0 siblings, 0 replies; 18+ messages in thread
From: Dave Chinner @ 2019-03-27  1:22 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, xfs

On Mon, Mar 25, 2019 at 08:34:59AM -0400, Brian Foster wrote:
> On Mon, Mar 25, 2019 at 10:03:27AM +1100, Dave Chinner wrote:
> > > IOW, wouldn't a log force from busy extent context always occur with
> > > locked buffers joined to a transaction? If so, then doesn't the active
> > > transaction hold a bli reference and prevent such items from being
> > > "freed" in log completion context (i.e. xfs_buf_item_unpin()) if they
> > > happened to be pinned? Perhaps I'm missing something...
> > 
> > ... now you've looked at it in more detail that I have.
> > 
> > What you've described is why it was considered to be safe, but now
> > we have things like defered AGFL block freeing that pick up and drop
> > the AGF lock repeatedly as the single transaction rolls and -
> > eventually - starts calling xfs_trans_reserve() on transaction roll.
> > That's something we never used to do.  It may still be safe, but it
> > is unclear to me how this all interacts in the presence of
> > filesystem shutdown conditions.....
> > 
> 
> Some higher level thoughts...
> 
> IME, we've had these kind of quirky shutdown issues since I've been
> working on XFS. Similar to log recovery, it's a rarely enough hit
> scenario that keeping it robust and reliable across new
> features/development is a bit of a challenge. LR is certainly more
> critical and I think our test coverage in both areas has improved
> significantly over the past few years.

Log recovery and shutdown have _always_ been a source of bugs, and
there are some particular parts of the code that are more
problematic than others. e.g. Inode cluster flushing error/shutdown
handling has been a regular source of bugs over the years.

And, yes, we do hammer on them a lot more than we ever had and so
iwe are slowly peeling the onion and finding the imore subtle bugs
that have been there for many, many years.

> The point is just that I don't think it's worth getting too crazy by
> trying to audit every possible path to a sync log force or changing how
> various bits of core infrastructure work just to accommodate a very rare
> shutdown case.

I think we still have to look at them and understand if the way the
problematic shutdowns are done make any sense anymore. They might
have when the shutdown was added, but lots of the code is very
different now, as is the shutdown handling. Hence we might have
shutdowns in places we don't need them anymore (e.g.
xfs_trans_read_buf_map(), xfs_iflush_cluster(), etc).

> If this turns out to be the only place we require an
> object lock in the synchronous log force sequence, it might be best to
> find a way to remove it (as Darrick posited earlier). If not, it might
> also be more useful to figure a way to detect the "sync log force while
> holding (certain) locks" pattern dynamically and provide some assertions
> against it to improve test coverage going forward.  I'm not quite sure
> how to do that off the top of my head. I'll have to think about that
> some more.

That's what lockdep contexts are supposed to be used for. Perhaps
something similar to the memory reclaim contexts could be done here
- locks taken both above and below the shutdown state trigger
warnings....

> I'm sure we could always come up with a test that reproduces
> this particular instance of the problem more reliably, but that test
> becomes far less useful once we address the particular instance it
> reproduces in the kernel.

*nod*

That's why tests like generic/475 are so good - they keep finding
new bugs in shutdown/recovery...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: generic/475 deadlock?
  2019-03-27  1:05               ` Dave Chinner
@ 2019-03-27 14:13                 ` Brian Foster
  0 siblings, 0 replies; 18+ messages in thread
From: Brian Foster @ 2019-03-27 14:13 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, xfs

On Wed, Mar 27, 2019 at 12:05:52PM +1100, Dave Chinner wrote:
> On Tue, Mar 26, 2019 at 01:13:24PM -0400, Brian Foster wrote:
> > On Fri, Mar 22, 2019 at 08:10:51AM +1100, Dave Chinner wrote:
> > ...
> > > 
> > > So given that inode reclaim already waits for inodes to be unpinned
> > > and there is a xfs_buf_unpin_wait() call in
> > > xfs_bwrite()->xfs_buf_submit() path that will issue a log force on
> > > the buffer from the inode reclaim path, we don't actually need the
> > > log force in xfs_iflush() at all. i.e.  the AIL will capture the
> > > buffer state and unpin inode and dquot buffers automatically, and so
> > > we can simply remove the pinned-buffer log forces from the
> > > inode/dquot flush code....
> > > 
> > 
> > Hmm, where is the log force you reference above in the inode reclaim (->
> > xfs_bwrite()) path if the buffer happens to be pinned? I see the
> > xfs_buf_wait_unpin() in the submit path, but that just waits on the pin
> > count (which I think is safe). It actually looks to me that reclaim
> > could be susceptible to the write delay problem you referenced earlier
> > in this thread without the log force from xfs_iflush()..
> 
> I was thinking that the unpin wait code issued the log force.
> Oh, we only issue a log force in xfs_iunpin_wait(), not in the
> xfs_buf_wait_unpin(). Ugh. My mistake.
> 
> As for the write delay problem, it's entire possible that this could
> occur, but I have no evidence that it is actually a problem for
> reclaim - I have only ever found reclaim blocking waiting on IO in
> synchronous reclaim, never blocking on buffer unpin waits in async
> reclaim. That's not to say it doesn't happen, I've just never seen
> it and I've been looking at inode reclaim blocking vectors quite a
> lot in the past couple of years....
> 

Good info, thanks. So this is something we should at least be cognizant
of, but it sounds like it's not a critical or likely state for reclaim.

> > It also occurs to me that the xfs_iflush() log force isn't blocking
> > because it's a a sync force, but rather because there is already a CIL
> > push in progress. The flush_work() call basically means that a non-sync
> > force can either queue a workqueue job and return or turn somewhat
> > synchronous by waiting for whatever push happens to be in progress.
> 
> xfs_log_force() isn't a non-blocking operation - it is supposed to
> write and dispatch all the pending changes in memory. If
> XFS_LOG_SYNC is not set then it does not wait for IO completion - it
> is expected that the caller will then wait on whatever it needs
> unpinning, not wait for everything in the log flush to be completed.
> e.g. this is why xfs_iunpin_wait() does not use XFS_LOG_SYNC - we
> only need to wait for the inode to be unpinned, not everything in
> the log.
> 

Yes, I don't mean to imply it's nonblocking. I was thinking that the
case where we queue the CIL push could prevent the caller from blocking
on the force completion (even though we still wait on the commit
completion). Looking closer, I see that the CIL push is where we
actually abort the CIL context (xlog_cil_push() calls
xlog_cil_committed(..., XFS_LI_ABORTED)) and that occurs before the wake
of the commit completion, so the behavior would probably be the same
even without calling flush_work().

But the interesting thing to note here is that in the normal case,
xc_commit_wait is awoken after submission (at the tail end of
xlog_cil_push()) right when we assign ->commit_lsn. In the abort case,
xc_commit_wait is technically awoken after callback completion
processing in xlog_cil_committed(). This occurs after the
xfs_trans_committed_bulk() call and thus serializes on unpin processing
(where we need the buffer lock).

It's not clear to me why the wake is here. From the code, I assume it's
just to batch things into a single xc_push_lock critical section. We
need a wake somewhere of course because we haven't assigned a commit_lsn
in the abort case, but I wonder if without this particular ordering an
async log force would now be safe from deadlock. For example, if
xlog_cil_committed() awoke xc_commit_wait earlier, the CIL push could
unwind in parallel to the CIL context abort. It already doesn't wait on
log force completion because it's an async force (no XFS_LOG_SYNC), and
thus xfs_iflush() can unwind and (eventually) unlock the buffer. I.e.,
this would be safe for shutdown cases in the same general manner it's
currently safe for the non-shutdown case. Hmmm....

Darrick, since I haven't been able to reproduce this problem, could you
give the appended (untested) patch a try and let me know whether it
helps at all?

> > That
> > implies another potential workaround could be to find a way around this
> > sync behavior (assuming that would be fundamentally more simple than
> > just avoiding log forces entirely while holding non-transactional buffer
> > locks).
> 
> Or we can just remove it altogether and put the necessary log forces
> in the callers where the buffer is no longer held.
> 

Yep, that's what I meant by just avoiding the situation entirely..

> > Of course that assumes there aren't other variants of this. I still need
> > to poke around more to get a better idea of the scope. One that looks
> > like a possibility to me right now is the xfs_iflush_cluster() ->
> > xfs_force_shutdown() -> xfs_log_force() sequence (in the event of a
> 
> The xfs_iflush_cluster error handling has always been full of
> shutdown-related bugs.
> 

Indeed, I recall dealing with these various problems in the past.

> xfs_iflush() gets this buffer-vs-shutdown ordering right, but if
> it's ok to just release the buffer with the inode still attached to
> it in the xfs_iflush() case, then why isn't it ok to do exactly the
> same thing in xfs_iflush_cluster()?
> 
> Oh, it's because the xfs_iflush() error handling doesn't need to
> remove the inode from the buffer to get a reference to it to call
> xfs_iflush_abort() to remove the inode from the AIL.
> 
> IOWs, I think xfs_iflush_cluster() should really be doing the same
> thing - pull the list of inodes off the buffer, release the buffer,
> trigger a shutdown, then call xfs_iflush_abort() on each of those
> inodes....
> 

I'll have to take a closer look at the broader iflush[_cluster]() error
handling to wrap my head around this.

> > corrupt in-core inode on the same pinned buffer) down in the same iflush
> > path. Another could be in xfs_trans_read_buf_map() where if we somehow
> > grab a dirty+pinned buffer with b_error set, we can issue a shutdown if
> > the transaction is dirty before we release the failed buffer.
> 
> Again, that would be indicative of a bug somewhere else. A buffer
> found by a buffer should not be pinned and dirty in memory with a
> b_error value on it.  Whatever recorded the error should have
> processed and cleared the error, marked the buffer as stale, and/or
> shut down the filesystem, not left it it in cache for someone else
> to trip over.
> 

Completely agreed that this should never happen. I'm fine with
categorizing these kind of scenarios as a separate bug. I'm just
describing the complexity involved with trying to reason about the
scope of the general problem.

> e.g. see xfs_buf_iodone_callback_error(). If it's a transient error,
> it clears b_error before it returns. If it's a permanent error, it
> shuts down the filesystem and stales the buffer and so it will not
> ever be seen again on lookup.
> 
> i.e. The code in xfs_trans_read_buf_map() is detecting physical read
> IO errors bringing new metadata into the cache and reacting
> appropriately. Clean transactions can back out gracefully, while
> dirty transactions will trigger a shutdown during cancellation
> anyway, so the early shutdown is just a way of ensuring we error
> out appropriately. Maybe all the callers get the error handling
> right, and if that is the case the we can remove the shutdowns from
> xfs_trans_read_buf_map() altogether.....
> 
> > That
> > combination of events should probably never happen (and the vector is
> > probably closed by just releasing the buffer before the shutdown), but
> > it kind of illustrates the potential for whack-a-mole with log forces
> > and shutdown calls scattered around in so many places.. :/
> 
> If we end up in this kind of "whack-a-mole" scenario, it indicates
> we have other bugs that need fixing. Calling xfs_force_shutdown()
> holding inappropriate locks is a bug that should be fixed; changing
> how xfs_log_force() works is a much more risky proposition than
> fixing the callers that aren't safe...
> 

I agree in principle on changing xfs_log_force() behavior, but I think
there's a possibility that CIL context abort simply has a bug that
causes this problem for async log forces that happen to race with a
shutdown (re: the analysis above).

That aside, having some kind of guideline on when (not) to call a sync
log force or shutdown is partly what I'm trying to reason about. Your
statement above implies that such a guideline could be to not do so in
general while holding things like buffer locks. That sounds perfectly
reasonable to me and is much more straightforward of a concept to apply
to future code on review rather than trying to work out whether each
sync log force or shutdown site can somehow result in a shutdown
deadlock. The only open question I see is whether this needs to apply to
all log forces or only sync log forces...

Brian

--- 8< ---

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index d3884e08b43c..5e595948bc5a 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -582,6 +582,19 @@ xlog_cil_committed(
 	struct xfs_cil_ctx	*ctx = args;
 	struct xfs_mount	*mp = ctx->cil->xc_log->l_mp;
 
+	/*
+	 * If the I/O failed, we're aborting the commit and already shutdown.
+	 * Wake any commit waiters before aborting the log items so we don't
+	 * block async log pushers on callbacks. Async log pushers explicitly do
+	 * not wait on log force completion because they may be holding locks
+	 * required to unpin items.
+	 */
+	if (abort) {
+		spin_lock(&ctx->cil->xc_push_lock);
+		wake_up_all(&ctx->cil->xc_commit_wait);
+		spin_unlock(&ctx->cil->xc_push_lock);
+	}
+
 	xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain,
 					ctx->start_lsn, abort);
 
@@ -589,15 +602,7 @@ xlog_cil_committed(
 	xfs_extent_busy_clear(mp, &ctx->busy_extents,
 			     (mp->m_flags & XFS_MOUNT_DISCARD) && !abort);
 
-	/*
-	 * If we are aborting the commit, wake up anyone waiting on the
-	 * committing list.  If we don't, then a shutdown we can leave processes
-	 * waiting in xlog_cil_force_lsn() waiting on a sequence commit that
-	 * will never happen because we aborted it.
-	 */
 	spin_lock(&ctx->cil->xc_push_lock);
-	if (abort)
-		wake_up_all(&ctx->cil->xc_commit_wait);
 	list_del(&ctx->committing);
 	spin_unlock(&ctx->cil->xc_push_lock);
 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-03-27 14:14 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-03-20  5:04 generic/475 deadlock? Darrick J. Wong
2019-03-20 17:03 ` Brian Foster
2019-03-20 17:45   ` Darrick J. Wong
2019-03-20 17:59     ` Brian Foster
2019-03-20 21:49       ` Dave Chinner
2019-03-21 12:11         ` Brian Foster
2019-03-21 21:10           ` Dave Chinner
2019-03-21 21:53             ` Brian Foster
2019-03-21 23:50               ` Dave Chinner
2019-03-22  0:07                 ` Darrick J. Wong
2019-03-22 12:01                 ` Brian Foster
2019-03-24 23:03                   ` Dave Chinner
2019-03-25 12:34                     ` Brian Foster
2019-03-27  1:22                       ` Dave Chinner
2019-03-26 17:13             ` Brian Foster
2019-03-27  1:05               ` Dave Chinner
2019-03-27 14:13                 ` Brian Foster
2019-03-20 21:39 ` Dave Chinner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.