* generic/475 deadlock? @ 2019-03-20 5:04 Darrick J. Wong 2019-03-20 17:03 ` Brian Foster 2019-03-20 21:39 ` Dave Chinner 0 siblings, 2 replies; 18+ messages in thread From: Darrick J. Wong @ 2019-03-20 5:04 UTC (permalink / raw) To: xfs Hmmm. Every now and then I see a generic/475 deadlock that generates the hangcheck warning pasted below. I /think/ this is ... the ail is processing an inode log item, for which it locked the cluster buffer and pushed the cil to unpin the buffer. However, the cil is cleaning up after the shut down and is trying to simulate an EIO completion, but tries grabs the buffer lock and hence the cil and ail deadlock. Maybe the solution is to trylock in the (freed && remove) case of xfs_buf_item_unpin, since we're tearing the whole system down anyway? Anyway, tired, going to bed now, really just backing up my brain to the internet... :) --D INFO: task kworker/1:4:4668 blocked for more than 61 seconds. Not tainted 5.1.0-rc1-xfsx #rc1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/1:4 D11632 4668 2 0x80000000 Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs] Call Trace: ? __schedule+0x309/0x9e0 schedule+0x36/0x90 schedule_timeout+0x1fd/0x500 ? __next_timer_interrupt+0xc0/0xc0 ? __down_timeout+0x9b/0x120 __down_timeout+0xc3/0x120 down_timeout+0x43/0x50 xfs_buf_lock+0x7a/0x280 [xfs] xfs_buf_item_unpin+0x300/0x4a0 [xfs] xfs_trans_committed_bulk+0x105/0x3e0 [xfs] xlog_cil_committed+0x4b/0x410 [xfs] ? mark_held_locks+0x47/0x70 ? xlog_cil_push+0x3f3/0x460 [xfs] ? xlog_cil_push+0x3f3/0x460 [xfs] ? rcu_read_lock_sched_held+0x74/0x80 xlog_cil_push+0x400/0x460 [xfs] process_one_work+0x252/0x600 worker_thread+0x1d9/0x390 ? wq_calc_node_cpumask+0x80/0x80 kthread+0x11f/0x140 ? kthread_create_on_node+0x70/0x70 ret_from_fork+0x24/0x30 INFO: task xfsaild/dm-0:7063 blocked for more than 61 seconds. Not tainted 5.1.0-rc1-xfsx #rc1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. xfsaild/dm-0 D13752 7063 2 0x80000000 Call Trace: ? __schedule+0x309/0x9e0 ? wait_for_completion+0x12b/0x1c0 schedule+0x36/0x90 schedule_timeout+0x32a/0x500 ? wait_for_completion+0x12b/0x1c0 wait_for_completion+0x153/0x1c0 ? wake_up_q+0x80/0x80 __flush_work+0x1e1/0x300 ? flush_workqueue_prep_pwqs+0x1b0/0x1b0 ? wait_for_completion+0x43/0x1c0 xlog_cil_force_lsn+0x97/0x270 [xfs] ? xfs_iflush+0x38d/0x470 [xfs] xfs_log_force+0xff/0x4f0 [xfs] ? xfs_dinode_calc_crc+0x3b/0x60 [xfs] ? xfs_iflush_int+0x24d/0x570 [xfs] xfs_iflush+0x38d/0x470 [xfs] xfs_inode_item_push+0xe6/0x200 [xfs] xfsaild+0x50d/0x1210 [xfs] ? sched_clock+0x5/0x10 ? sched_clock_cpu+0xe/0xd0 ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs] kthread+0x11f/0x140 ? kthread_create_on_node+0x70/0x70 ret_from_fork+0x24/0x30 INFO: task umount:7089 blocked for more than 61 seconds. Not tainted 5.1.0-rc1-xfsx #rc1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. umount D13312 7089 6414 0x00000000 Call Trace: ? __schedule+0x309/0x9e0 ? wait_for_completion+0x12b/0x1c0 schedule+0x36/0x90 schedule_timeout+0x32a/0x500 ? wait_for_completion+0x12b/0x1c0 wait_for_completion+0x153/0x1c0 ? wake_up_q+0x80/0x80 __flush_work+0x1e1/0x300 ? flush_workqueue_prep_pwqs+0x1b0/0x1b0 ? wait_for_completion+0x43/0x1c0 xlog_cil_force_lsn+0x97/0x270 [xfs] ? xfs_fs_sync_fs+0x21/0x50 [xfs] xfs_log_force+0xff/0x4f0 [xfs] xfs_fs_sync_fs+0x21/0x50 [xfs] sync_filesystem+0x6b/0x90 generic_shutdown_super+0x22/0x100 kill_block_super+0x21/0x50 deactivate_locked_super+0x29/0x60 cleanup_mnt+0x3b/0x70 task_work_run+0x98/0xc0 exit_to_usermode_loop+0x83/0x90 do_syscall_64+0x14b/0x150 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7fa9ff80e8c7 Code: Bad RIP value. RSP: 002b:00007ffd647ea078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 RAX: 0000000000000000 RBX: 00005642818ed970 RCX: 00007fa9ff80e8c7 RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005642818fa100 RBP: 0000000000000000 R08: 00005642818f91e0 R09: 00007fa9ffad8c40 R10: 000000000000000b R11: 0000000000000246 R12: 00005642818fa100 R13: 00007fa9ffd308a4 R14: 00005642818f9980 R15: 0000000000000000 Showing all locks held in the system: 1 lock held by khungtaskd/34: #0: 0000000056e8f88f (rcu_read_lock){....}, at: debug_show_all_locks+0x15/0x17f 1 lock held by in:imklog/1113: #0: 000000007873f311 (&f->f_pos_lock){+.+.}, at: __fdget_pos+0x4d/0x60 5 locks held by kworker/u13:4/3736: 2 locks held by kworker/1:4/4668: #0: 00000000cd34452f ((wq_completion)xfs-cil/dm-0){+.+.}, at: process_one_work+0x1d8/0x600 #1: 000000003c563604 ((work_completion)(&cil->xc_push_work)){+.+.}, at: process_one_work+0x1d8/0x600 2 locks held by kworker/2:4/5394: #0: 0000000072544c54 ((wq_completion)xfs-sync/dm-0){+.+.}, at: process_one_work+0x1d8/0x600 #1: 00000000c3e919b6 ((work_completion)(&(&log->l_work)->work)){+.+.}, at: process_one_work+0x1d8/0x600 1 lock held by xfsaild/dm-0/7063: #0: 0000000049da3bb4 (&xfs_nondir_ilock_class){++++}, at: xfs_ilock_nowait+0x1ee/0x330 [xfs] 1 lock held by umount/7089: #0: 0000000031dc2924 (&type->s_umount_key#50){++++}, at: deactivate_super+0x43/0x50 ============================================= ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-20 5:04 generic/475 deadlock? Darrick J. Wong @ 2019-03-20 17:03 ` Brian Foster 2019-03-20 17:45 ` Darrick J. Wong 2019-03-20 21:39 ` Dave Chinner 1 sibling, 1 reply; 18+ messages in thread From: Brian Foster @ 2019-03-20 17:03 UTC (permalink / raw) To: Darrick J. Wong; +Cc: xfs On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote: > Hmmm. > > Every now and then I see a generic/475 deadlock that generates the > hangcheck warning pasted below. > Haven't seen it, but I'll kick off a few iterations.. > I /think/ this is ... the ail is processing an inode log item, for which > it locked the cluster buffer and pushed the cil to unpin the buffer. > However, the cil is cleaning up after the shut down and is trying to > simulate an EIO completion, but tries grabs the buffer lock and hence > the cil and ail deadlock. Maybe the solution is to trylock in the > (freed && remove) case of xfs_buf_item_unpin, since we're tearing the > whole system down anyway? > While the xfsaild will lock buffers via ->iop_push() callouts, I think this is only to queue the buffer to the delwri queue (both buffer and inode item pushes should cycle the buf lock before returning) and then to submit the pending delwri queue. I don't think xfsaild should own any such locks at the time it issues a log force.. > Anyway, tired, going to bed now, really just backing up my brain to the > internet... :) > > --D > > INFO: task kworker/1:4:4668 blocked for more than 61 seconds. > Not tainted 5.1.0-rc1-xfsx #rc1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > kworker/1:4 D11632 4668 2 0x80000000 > Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs] > Call Trace: > ? __schedule+0x309/0x9e0 > schedule+0x36/0x90 > schedule_timeout+0x1fd/0x500 > ? __next_timer_interrupt+0xc0/0xc0 > ? __down_timeout+0x9b/0x120 > __down_timeout+0xc3/0x120 > down_timeout+0x43/0x50 > xfs_buf_lock+0x7a/0x280 [xfs] > xfs_buf_item_unpin+0x300/0x4a0 [xfs] > xfs_trans_committed_bulk+0x105/0x3e0 [xfs] > xlog_cil_committed+0x4b/0x410 [xfs] > ? mark_held_locks+0x47/0x70 > ? xlog_cil_push+0x3f3/0x460 [xfs] > ? xlog_cil_push+0x3f3/0x460 [xfs] > ? rcu_read_lock_sched_held+0x74/0x80 > xlog_cil_push+0x400/0x460 [xfs] > process_one_work+0x252/0x600 > worker_thread+0x1d9/0x390 > ? wq_calc_node_cpumask+0x80/0x80 > kthread+0x11f/0x140 > ? kthread_create_on_node+0x70/0x70 > ret_from_fork+0x24/0x30 Ok, so there's a CIL push where the log write failed and the buf item abort is waiting on the buf lock.. > INFO: task xfsaild/dm-0:7063 blocked for more than 61 seconds. > Not tainted 5.1.0-rc1-xfsx #rc1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > xfsaild/dm-0 D13752 7063 2 0x80000000 > Call Trace: > ? __schedule+0x309/0x9e0 > ? wait_for_completion+0x12b/0x1c0 > schedule+0x36/0x90 > schedule_timeout+0x32a/0x500 > ? wait_for_completion+0x12b/0x1c0 > wait_for_completion+0x153/0x1c0 > ? wake_up_q+0x80/0x80 > __flush_work+0x1e1/0x300 > ? flush_workqueue_prep_pwqs+0x1b0/0x1b0 > ? wait_for_completion+0x43/0x1c0 > xlog_cil_force_lsn+0x97/0x270 [xfs] > ? xfs_iflush+0x38d/0x470 [xfs] > xfs_log_force+0xff/0x4f0 [xfs] > ? xfs_dinode_calc_crc+0x3b/0x60 [xfs] > ? xfs_iflush_int+0x24d/0x570 [xfs] > xfs_iflush+0x38d/0x470 [xfs] > xfs_inode_item_push+0xe6/0x200 [xfs] > xfsaild+0x50d/0x1210 [xfs] > ? sched_clock+0x5/0x10 > ? sched_clock_cpu+0xe/0xd0 > ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs] > kthread+0x11f/0x140 > ? kthread_create_on_node+0x70/0x70 > ret_from_fork+0x24/0x30 xfsaild forces the log and the associated CIL push blocks on the one already in progress.. > INFO: task umount:7089 blocked for more than 61 seconds. > Not tainted 5.1.0-rc1-xfsx #rc1 > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > umount D13312 7089 6414 0x00000000 > Call Trace: > ? __schedule+0x309/0x9e0 > ? wait_for_completion+0x12b/0x1c0 > schedule+0x36/0x90 > schedule_timeout+0x32a/0x500 > ? wait_for_completion+0x12b/0x1c0 > wait_for_completion+0x153/0x1c0 > ? wake_up_q+0x80/0x80 > __flush_work+0x1e1/0x300 > ? flush_workqueue_prep_pwqs+0x1b0/0x1b0 > ? wait_for_completion+0x43/0x1c0 > xlog_cil_force_lsn+0x97/0x270 [xfs] > ? xfs_fs_sync_fs+0x21/0x50 [xfs] > xfs_log_force+0xff/0x4f0 [xfs] > xfs_fs_sync_fs+0x21/0x50 [xfs] > sync_filesystem+0x6b/0x90 > generic_shutdown_super+0x22/0x100 > kill_block_super+0x21/0x50 > deactivate_locked_super+0x29/0x60 > cleanup_mnt+0x3b/0x70 > task_work_run+0x98/0xc0 > exit_to_usermode_loop+0x83/0x90 > do_syscall_64+0x14b/0x150 > entry_SYSCALL_64_after_hwframe+0x49/0xbe And unmount is doing a log force as well.. Hmm.. yeah, I think we need to figure out how/why the buffer is locked. I don't think it should be xfsaild because while the delwri submit locks the buffers for I/O, that should essentially transfer lock ownership to the async I/O (i.e., to be unlocked via xfs_buf_ioend() via bio completion). Is the buf I/O associated completion wq stuck somewhere? Is the buffer submission path working correctly in the shutdown case..? Brian > RIP: 0033:0x7fa9ff80e8c7 > Code: Bad RIP value. > RSP: 002b:00007ffd647ea078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 > RAX: 0000000000000000 RBX: 00005642818ed970 RCX: 00007fa9ff80e8c7 > RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005642818fa100 > RBP: 0000000000000000 R08: 00005642818f91e0 R09: 00007fa9ffad8c40 > R10: 000000000000000b R11: 0000000000000246 R12: 00005642818fa100 > R13: 00007fa9ffd308a4 R14: 00005642818f9980 R15: 0000000000000000 > > Showing all locks held in the system: > 1 lock held by khungtaskd/34: > #0: 0000000056e8f88f (rcu_read_lock){....}, at: debug_show_all_locks+0x15/0x17f > 1 lock held by in:imklog/1113: > #0: 000000007873f311 (&f->f_pos_lock){+.+.}, at: __fdget_pos+0x4d/0x60 > 5 locks held by kworker/u13:4/3736: > 2 locks held by kworker/1:4/4668: > #0: 00000000cd34452f ((wq_completion)xfs-cil/dm-0){+.+.}, at: process_one_work+0x1d8/0x600 > #1: 000000003c563604 ((work_completion)(&cil->xc_push_work)){+.+.}, at: process_one_work+0x1d8/0x600 > 2 locks held by kworker/2:4/5394: > #0: 0000000072544c54 ((wq_completion)xfs-sync/dm-0){+.+.}, at: process_one_work+0x1d8/0x600 > #1: 00000000c3e919b6 ((work_completion)(&(&log->l_work)->work)){+.+.}, at: process_one_work+0x1d8/0x600 > 1 lock held by xfsaild/dm-0/7063: > #0: 0000000049da3bb4 (&xfs_nondir_ilock_class){++++}, at: xfs_ilock_nowait+0x1ee/0x330 [xfs] > 1 lock held by umount/7089: > #0: 0000000031dc2924 (&type->s_umount_key#50){++++}, at: deactivate_super+0x43/0x50 > > ============================================= > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-20 17:03 ` Brian Foster @ 2019-03-20 17:45 ` Darrick J. Wong 2019-03-20 17:59 ` Brian Foster 0 siblings, 1 reply; 18+ messages in thread From: Darrick J. Wong @ 2019-03-20 17:45 UTC (permalink / raw) To: Brian Foster; +Cc: xfs On Wed, Mar 20, 2019 at 01:03:05PM -0400, Brian Foster wrote: > On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote: > > Hmmm. > > > > Every now and then I see a generic/475 deadlock that generates the > > hangcheck warning pasted below. > > > > Haven't seen it, but I'll kick off a few iterations.. > > > I /think/ this is ... the ail is processing an inode log item, for which > > it locked the cluster buffer and pushed the cil to unpin the buffer. > > However, the cil is cleaning up after the shut down and is trying to > > simulate an EIO completion, but tries grabs the buffer lock and hence > > the cil and ail deadlock. Maybe the solution is to trylock in the > > (freed && remove) case of xfs_buf_item_unpin, since we're tearing the > > whole system down anyway? > > > > While the xfsaild will lock buffers via ->iop_push() callouts, I think > this is only to queue the buffer to the delwri queue (both buffer and > inode item pushes should cycle the buf lock before returning) and then > to submit the pending delwri queue. I don't think xfsaild should own any > such locks at the time it issues a log force.. ...but I think it does (see below) because iflush forces the log when the cluster buffer is pinned, and only after that does it add the buffer to the delwri queue. > > Anyway, tired, going to bed now, really just backing up my brain to the > > internet... :) > > > > --D > > > > INFO: task kworker/1:4:4668 blocked for more than 61 seconds. > > Not tainted 5.1.0-rc1-xfsx #rc1 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > kworker/1:4 D11632 4668 2 0x80000000 > > Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs] > > Call Trace: > > ? __schedule+0x309/0x9e0 > > schedule+0x36/0x90 > > schedule_timeout+0x1fd/0x500 > > ? __next_timer_interrupt+0xc0/0xc0 > > ? __down_timeout+0x9b/0x120 > > __down_timeout+0xc3/0x120 > > down_timeout+0x43/0x50 > > xfs_buf_lock+0x7a/0x280 [xfs] > > xfs_buf_item_unpin+0x300/0x4a0 [xfs] > > xfs_trans_committed_bulk+0x105/0x3e0 [xfs] > > xlog_cil_committed+0x4b/0x410 [xfs] > > ? mark_held_locks+0x47/0x70 > > ? xlog_cil_push+0x3f3/0x460 [xfs] > > ? xlog_cil_push+0x3f3/0x460 [xfs] > > ? rcu_read_lock_sched_held+0x74/0x80 > > xlog_cil_push+0x400/0x460 [xfs] > > process_one_work+0x252/0x600 > > worker_thread+0x1d9/0x390 > > ? wq_calc_node_cpumask+0x80/0x80 > > kthread+0x11f/0x140 > > ? kthread_create_on_node+0x70/0x70 > > ret_from_fork+0x24/0x30 > > Ok, so there's a CIL push where the log write failed and the buf item > abort is waiting on the buf lock.. > > > INFO: task xfsaild/dm-0:7063 blocked for more than 61 seconds. > > Not tainted 5.1.0-rc1-xfsx #rc1 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > xfsaild/dm-0 D13752 7063 2 0x80000000 > > Call Trace: > > ? __schedule+0x309/0x9e0 > > ? wait_for_completion+0x12b/0x1c0 > > schedule+0x36/0x90 > > schedule_timeout+0x32a/0x500 > > ? wait_for_completion+0x12b/0x1c0 > > wait_for_completion+0x153/0x1c0 > > ? wake_up_q+0x80/0x80 > > __flush_work+0x1e1/0x300 > > ? flush_workqueue_prep_pwqs+0x1b0/0x1b0 > > ? wait_for_completion+0x43/0x1c0 > > xlog_cil_force_lsn+0x97/0x270 [xfs] > > ? xfs_iflush+0x38d/0x470 [xfs] > > xfs_log_force+0xff/0x4f0 [xfs] > > ? xfs_dinode_calc_crc+0x3b/0x60 [xfs] > > ? xfs_iflush_int+0x24d/0x570 [xfs] > > xfs_iflush+0x38d/0x470 [xfs] > > xfs_inode_item_push+0xe6/0x200 [xfs] > > xfsaild+0x50d/0x1210 [xfs] > > ? sched_clock+0x5/0x10 > > ? sched_clock_cpu+0xe/0xd0 > > ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs] > > kthread+0x11f/0x140 > > ? kthread_create_on_node+0x70/0x70 > > ret_from_fork+0x24/0x30 > > xfsaild forces the log and the associated CIL push blocks on the one > already in progress.. > > > INFO: task umount:7089 blocked for more than 61 seconds. > > Not tainted 5.1.0-rc1-xfsx #rc1 > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > umount D13312 7089 6414 0x00000000 > > Call Trace: > > ? __schedule+0x309/0x9e0 > > ? wait_for_completion+0x12b/0x1c0 > > schedule+0x36/0x90 > > schedule_timeout+0x32a/0x500 > > ? wait_for_completion+0x12b/0x1c0 > > wait_for_completion+0x153/0x1c0 > > ? wake_up_q+0x80/0x80 > > __flush_work+0x1e1/0x300 > > ? flush_workqueue_prep_pwqs+0x1b0/0x1b0 > > ? wait_for_completion+0x43/0x1c0 > > xlog_cil_force_lsn+0x97/0x270 [xfs] > > ? xfs_fs_sync_fs+0x21/0x50 [xfs] > > xfs_log_force+0xff/0x4f0 [xfs] > > xfs_fs_sync_fs+0x21/0x50 [xfs] > > sync_filesystem+0x6b/0x90 > > generic_shutdown_super+0x22/0x100 > > kill_block_super+0x21/0x50 > > deactivate_locked_super+0x29/0x60 > > cleanup_mnt+0x3b/0x70 > > task_work_run+0x98/0xc0 > > exit_to_usermode_loop+0x83/0x90 > > do_syscall_64+0x14b/0x150 > > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > And unmount is doing a log force as well.. > > Hmm.. yeah, I think we need to figure out how/why the buffer is locked. AFAICT it's xfs_inode_item_push -> xfs_iflush -> xfs_imap_to_bp that returns the locked cluster buffer (code extract from xfs_iflush): error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &bp, XBF_TRYLOCK, 0); if (error == -EAGAIN) { xfs_ifunlock(ip); return error; } if (error) goto corrupt_out; So if we make it to here we've have bp pointing to the locked inode cluster buffer... /* * First flush out the inode that xfs_iflush was called with. */ error = xfs_iflush_int(ip, bp); if (error) goto corrupt_out; ...after which we flush the inode to the (still locked) cluster buffer... /* * If the buffer is pinned then push on the log now so we won't * get stuck waiting in the write for too long. */ if (xfs_buf_ispinned(bp)) xfs_log_force(mp, 0); ...but if the buffer is also pinned then we kick the log (while holding the buffer lock) to unpin the buffer. Since the fs is shutdown, the cil will just be trying to remove everything, but it needs to lock the buffer to simulate EIO. A naive fix to the deadlock is to see if the inode buf ops are attached and downgrade to a trylock and go ahead even if we can't lock it, but ick. It resolves the deadlock so the umount can proceed but we also trigger a lot of unlocked buffer asserts. I'm not sure what "too long" means in that code chunk. It was added to reduce stalls in pdflush, which these days I think translates to trying to reduce the amount of time reclaim can stall while flushing inodes...? (and, heh, I think I've ended up back at "move the reclaim iflush to deferred inactivation"...) > I don't think it should be xfsaild because while the delwri submit locks > the buffers for I/O, that should essentially transfer lock ownership to > the async I/O (i.e., to be unlocked via xfs_buf_ioend() via bio > completion). Is the buf I/O associated completion wq stuck somewhere? I didn't see any threads waiting on the buffer io completion in the sysrq-t output. > Is the buffer submission path working correctly in the shutdown > case..? It looks like it's working ok... --D > Brian > > > RIP: 0033:0x7fa9ff80e8c7 > > Code: Bad RIP value. > > RSP: 002b:00007ffd647ea078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 > > RAX: 0000000000000000 RBX: 00005642818ed970 RCX: 00007fa9ff80e8c7 > > RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005642818fa100 > > RBP: 0000000000000000 R08: 00005642818f91e0 R09: 00007fa9ffad8c40 > > R10: 000000000000000b R11: 0000000000000246 R12: 00005642818fa100 > > R13: 00007fa9ffd308a4 R14: 00005642818f9980 R15: 0000000000000000 > > > > Showing all locks held in the system: > > 1 lock held by khungtaskd/34: > > #0: 0000000056e8f88f (rcu_read_lock){....}, at: debug_show_all_locks+0x15/0x17f > > 1 lock held by in:imklog/1113: > > #0: 000000007873f311 (&f->f_pos_lock){+.+.}, at: __fdget_pos+0x4d/0x60 > > 5 locks held by kworker/u13:4/3736: > > 2 locks held by kworker/1:4/4668: > > #0: 00000000cd34452f ((wq_completion)xfs-cil/dm-0){+.+.}, at: process_one_work+0x1d8/0x600 > > #1: 000000003c563604 ((work_completion)(&cil->xc_push_work)){+.+.}, at: process_one_work+0x1d8/0x600 > > 2 locks held by kworker/2:4/5394: > > #0: 0000000072544c54 ((wq_completion)xfs-sync/dm-0){+.+.}, at: process_one_work+0x1d8/0x600 > > #1: 00000000c3e919b6 ((work_completion)(&(&log->l_work)->work)){+.+.}, at: process_one_work+0x1d8/0x600 > > 1 lock held by xfsaild/dm-0/7063: > > #0: 0000000049da3bb4 (&xfs_nondir_ilock_class){++++}, at: xfs_ilock_nowait+0x1ee/0x330 [xfs] > > 1 lock held by umount/7089: > > #0: 0000000031dc2924 (&type->s_umount_key#50){++++}, at: deactivate_super+0x43/0x50 > > > > ============================================= > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-20 17:45 ` Darrick J. Wong @ 2019-03-20 17:59 ` Brian Foster 2019-03-20 21:49 ` Dave Chinner 0 siblings, 1 reply; 18+ messages in thread From: Brian Foster @ 2019-03-20 17:59 UTC (permalink / raw) To: Darrick J. Wong; +Cc: xfs On Wed, Mar 20, 2019 at 10:45:51AM -0700, Darrick J. Wong wrote: > On Wed, Mar 20, 2019 at 01:03:05PM -0400, Brian Foster wrote: > > On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote: > > > Hmmm. > > > > > > Every now and then I see a generic/475 deadlock that generates the > > > hangcheck warning pasted below. > > > > > > > Haven't seen it, but I'll kick off a few iterations.. > > > > > I /think/ this is ... the ail is processing an inode log item, for which > > > it locked the cluster buffer and pushed the cil to unpin the buffer. > > > However, the cil is cleaning up after the shut down and is trying to > > > simulate an EIO completion, but tries grabs the buffer lock and hence > > > the cil and ail deadlock. Maybe the solution is to trylock in the > > > (freed && remove) case of xfs_buf_item_unpin, since we're tearing the > > > whole system down anyway? > > > > > > > While the xfsaild will lock buffers via ->iop_push() callouts, I think > > this is only to queue the buffer to the delwri queue (both buffer and > > inode item pushes should cycle the buf lock before returning) and then > > to submit the pending delwri queue. I don't think xfsaild should own any > > such locks at the time it issues a log force.. > > ...but I think it does (see below) because iflush forces the log when > the cluster buffer is pinned, and only after that does it add the buffer > to the delwri queue. > Oops, my mistake.. I read through the stack too quickly and misread it as the log force directly via xfsaild_push() rather than iflush. > > > Anyway, tired, going to bed now, really just backing up my brain to the > > > internet... :) > > > > > > --D > > > > > > INFO: task kworker/1:4:4668 blocked for more than 61 seconds. > > > Not tainted 5.1.0-rc1-xfsx #rc1 > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > kworker/1:4 D11632 4668 2 0x80000000 > > > Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs] > > > Call Trace: > > > ? __schedule+0x309/0x9e0 > > > schedule+0x36/0x90 > > > schedule_timeout+0x1fd/0x500 > > > ? __next_timer_interrupt+0xc0/0xc0 > > > ? __down_timeout+0x9b/0x120 > > > __down_timeout+0xc3/0x120 > > > down_timeout+0x43/0x50 > > > xfs_buf_lock+0x7a/0x280 [xfs] > > > xfs_buf_item_unpin+0x300/0x4a0 [xfs] > > > xfs_trans_committed_bulk+0x105/0x3e0 [xfs] > > > xlog_cil_committed+0x4b/0x410 [xfs] > > > ? mark_held_locks+0x47/0x70 > > > ? xlog_cil_push+0x3f3/0x460 [xfs] > > > ? xlog_cil_push+0x3f3/0x460 [xfs] > > > ? rcu_read_lock_sched_held+0x74/0x80 > > > xlog_cil_push+0x400/0x460 [xfs] > > > process_one_work+0x252/0x600 > > > worker_thread+0x1d9/0x390 > > > ? wq_calc_node_cpumask+0x80/0x80 > > > kthread+0x11f/0x140 > > > ? kthread_create_on_node+0x70/0x70 > > > ret_from_fork+0x24/0x30 > > > > Ok, so there's a CIL push where the log write failed and the buf item > > abort is waiting on the buf lock.. > > > > > INFO: task xfsaild/dm-0:7063 blocked for more than 61 seconds. > > > Not tainted 5.1.0-rc1-xfsx #rc1 > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > xfsaild/dm-0 D13752 7063 2 0x80000000 > > > Call Trace: > > > ? __schedule+0x309/0x9e0 > > > ? wait_for_completion+0x12b/0x1c0 > > > schedule+0x36/0x90 > > > schedule_timeout+0x32a/0x500 > > > ? wait_for_completion+0x12b/0x1c0 > > > wait_for_completion+0x153/0x1c0 > > > ? wake_up_q+0x80/0x80 > > > __flush_work+0x1e1/0x300 > > > ? flush_workqueue_prep_pwqs+0x1b0/0x1b0 > > > ? wait_for_completion+0x43/0x1c0 > > > xlog_cil_force_lsn+0x97/0x270 [xfs] > > > ? xfs_iflush+0x38d/0x470 [xfs] > > > xfs_log_force+0xff/0x4f0 [xfs] > > > ? xfs_dinode_calc_crc+0x3b/0x60 [xfs] > > > ? xfs_iflush_int+0x24d/0x570 [xfs] > > > xfs_iflush+0x38d/0x470 [xfs] > > > xfs_inode_item_push+0xe6/0x200 [xfs] > > > xfsaild+0x50d/0x1210 [xfs] > > > ? sched_clock+0x5/0x10 > > > ? sched_clock_cpu+0xe/0xd0 > > > ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs] > > > kthread+0x11f/0x140 > > > ? kthread_create_on_node+0x70/0x70 > > > ret_from_fork+0x24/0x30 > > > > xfsaild forces the log and the associated CIL push blocks on the one > > already in progress.. > > > > > INFO: task umount:7089 blocked for more than 61 seconds. > > > Not tainted 5.1.0-rc1-xfsx #rc1 > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > > > umount D13312 7089 6414 0x00000000 > > > Call Trace: > > > ? __schedule+0x309/0x9e0 > > > ? wait_for_completion+0x12b/0x1c0 > > > schedule+0x36/0x90 > > > schedule_timeout+0x32a/0x500 > > > ? wait_for_completion+0x12b/0x1c0 > > > wait_for_completion+0x153/0x1c0 > > > ? wake_up_q+0x80/0x80 > > > __flush_work+0x1e1/0x300 > > > ? flush_workqueue_prep_pwqs+0x1b0/0x1b0 > > > ? wait_for_completion+0x43/0x1c0 > > > xlog_cil_force_lsn+0x97/0x270 [xfs] > > > ? xfs_fs_sync_fs+0x21/0x50 [xfs] > > > xfs_log_force+0xff/0x4f0 [xfs] > > > xfs_fs_sync_fs+0x21/0x50 [xfs] > > > sync_filesystem+0x6b/0x90 > > > generic_shutdown_super+0x22/0x100 > > > kill_block_super+0x21/0x50 > > > deactivate_locked_super+0x29/0x60 > > > cleanup_mnt+0x3b/0x70 > > > task_work_run+0x98/0xc0 > > > exit_to_usermode_loop+0x83/0x90 > > > do_syscall_64+0x14b/0x150 > > > entry_SYSCALL_64_after_hwframe+0x49/0xbe > > > > And unmount is doing a log force as well.. > > > > Hmm.. yeah, I think we need to figure out how/why the buffer is locked. > > AFAICT it's xfs_inode_item_push -> xfs_iflush -> xfs_imap_to_bp that > returns the locked cluster buffer (code extract from xfs_iflush): > > error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &bp, XBF_TRYLOCK, > 0); > if (error == -EAGAIN) { > xfs_ifunlock(ip); > return error; > } > if (error) > goto corrupt_out; > > So if we make it to here we've have bp pointing to the locked inode > cluster buffer... > > /* > * First flush out the inode that xfs_iflush was called with. > */ > error = xfs_iflush_int(ip, bp); > if (error) > goto corrupt_out; > > ...after which we flush the inode to the (still locked) cluster > buffer... > > /* > * If the buffer is pinned then push on the log now so we won't > * get stuck waiting in the write for too long. > */ > if (xfs_buf_ispinned(bp)) > xfs_log_force(mp, 0); > > ...but if the buffer is also pinned then we kick the log (while holding > the buffer lock) to unpin the buffer. Since the fs is shutdown, the cil > will just be trying to remove everything, but it needs to lock the > buffer to simulate EIO. > Yeah, makes sense. > A naive fix to the deadlock is to see if the inode buf ops are attached > and downgrade to a trylock and go ahead even if we can't lock it, but ick. > It resolves the deadlock so the umount can proceed but we also trigger a > lot of unlocked buffer asserts. > > I'm not sure what "too long" means in that code chunk. It was added to > reduce stalls in pdflush, which these days I think translates to trying > to reduce the amount of time reclaim can stall while flushing inodes...? > I'm not sure either. I was wondering why we'd need that given that xfsaild() has its own log force logic in cases where the delwri submit didn't fully complete (i.e., for pinned buffers). Relying on that seems more appropriate since we'd force the log per delwri submission for multiple potential buffers as opposed to per inode cluster buffer. Perhaps we could just kill this log force? Brian > (and, heh, I think I've ended up back at "move the reclaim iflush to > deferred inactivation"...) > > > I don't think it should be xfsaild because while the delwri submit locks > > the buffers for I/O, that should essentially transfer lock ownership to > > the async I/O (i.e., to be unlocked via xfs_buf_ioend() via bio > > completion). Is the buf I/O associated completion wq stuck somewhere? > > I didn't see any threads waiting on the buffer io completion in the > sysrq-t output. > > > Is the buffer submission path working correctly in the shutdown > > case..? > > It looks like it's working ok... > > --D > > > Brian > > > > > RIP: 0033:0x7fa9ff80e8c7 > > > Code: Bad RIP value. > > > RSP: 002b:00007ffd647ea078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6 > > > RAX: 0000000000000000 RBX: 00005642818ed970 RCX: 00007fa9ff80e8c7 > > > RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005642818fa100 > > > RBP: 0000000000000000 R08: 00005642818f91e0 R09: 00007fa9ffad8c40 > > > R10: 000000000000000b R11: 0000000000000246 R12: 00005642818fa100 > > > R13: 00007fa9ffd308a4 R14: 00005642818f9980 R15: 0000000000000000 > > > > > > Showing all locks held in the system: > > > 1 lock held by khungtaskd/34: > > > #0: 0000000056e8f88f (rcu_read_lock){....}, at: debug_show_all_locks+0x15/0x17f > > > 1 lock held by in:imklog/1113: > > > #0: 000000007873f311 (&f->f_pos_lock){+.+.}, at: __fdget_pos+0x4d/0x60 > > > 5 locks held by kworker/u13:4/3736: > > > 2 locks held by kworker/1:4/4668: > > > #0: 00000000cd34452f ((wq_completion)xfs-cil/dm-0){+.+.}, at: process_one_work+0x1d8/0x600 > > > #1: 000000003c563604 ((work_completion)(&cil->xc_push_work)){+.+.}, at: process_one_work+0x1d8/0x600 > > > 2 locks held by kworker/2:4/5394: > > > #0: 0000000072544c54 ((wq_completion)xfs-sync/dm-0){+.+.}, at: process_one_work+0x1d8/0x600 > > > #1: 00000000c3e919b6 ((work_completion)(&(&log->l_work)->work)){+.+.}, at: process_one_work+0x1d8/0x600 > > > 1 lock held by xfsaild/dm-0/7063: > > > #0: 0000000049da3bb4 (&xfs_nondir_ilock_class){++++}, at: xfs_ilock_nowait+0x1ee/0x330 [xfs] > > > 1 lock held by umount/7089: > > > #0: 0000000031dc2924 (&type->s_umount_key#50){++++}, at: deactivate_super+0x43/0x50 > > > > > > ============================================= > > > ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-20 17:59 ` Brian Foster @ 2019-03-20 21:49 ` Dave Chinner 2019-03-21 12:11 ` Brian Foster 0 siblings, 1 reply; 18+ messages in thread From: Dave Chinner @ 2019-03-20 21:49 UTC (permalink / raw) To: Brian Foster; +Cc: Darrick J. Wong, xfs On Wed, Mar 20, 2019 at 01:59:22PM -0400, Brian Foster wrote: > On Wed, Mar 20, 2019 at 10:45:51AM -0700, Darrick J. Wong wrote: > > On Wed, Mar 20, 2019 at 01:03:05PM -0400, Brian Foster wrote: > > > On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote: > > > And unmount is doing a log force as well.. > > > > > > Hmm.. yeah, I think we need to figure out how/why the buffer is locked. > > > > AFAICT it's xfs_inode_item_push -> xfs_iflush -> xfs_imap_to_bp that > > returns the locked cluster buffer (code extract from xfs_iflush): .... > > ...but if the buffer is also pinned then we kick the log (while holding > > the buffer lock) to unpin the buffer. Since the fs is shutdown, the cil > > will just be trying to remove everything, but it needs to lock the > > buffer to simulate EIO. > > > > Yeah, makes sense. Yup, and this same problem means any buffer we hold locked when we issue a blocking log force can deadlock in the same way. If it's a non-blocking log force (i.e. a log flush we don't wait for) then we can move onwards and eventually we unlock the buffer and the log can make progress. > > A naive fix to the deadlock is to see if the inode buf ops are attached > > and downgrade to a trylock and go ahead even if we can't lock it, but ick. > > It resolves the deadlock so the umount can proceed but we also trigger a > > lot of unlocked buffer asserts. Yuck! > > I'm not sure what "too long" means in that code chunk. It was added to > > reduce stalls in pdflush, which these days I think translates to trying > > to reduce the amount of time reclaim can stall while flushing inodes...? "too long" means that if there is no log pressure, it might be the next periodic log work that forces the log and unpins the buffer. i.e. it could be several seconds before the buffer gets unpinned. This is the same reason we have the log force in xfs_buf_lock() - if we have to wait for the next log force to unlock/unpin a stale buffer, we can stall for several seconds. > I'm not sure either. I was wondering why we'd need that given that > xfsaild() has its own log force logic in cases where the delwri submit > didn't fully complete (i.e., for pinned buffers). Relying on that seems > more appropriate since we'd force the log per delwri submission for > multiple potential buffers as opposed to per inode cluster buffer. > Perhaps we could just kill this log force? If the writeback is coming from the AIL, then we could check if the buffer is pinned and then return EDEADLOCK or something like that to get the item push return value set to XFS_ITEM_PINNED, in which case the AIL will then do the log force that will unpin the object. This will happen after we have added the buffer to the AIL delwri list and unlocked the buffer, so everything should work just fine in that case. Similar for dquot flushing from the AIL. If the writeback is coming from inode reclaim, we've already guaranteed the inode is unpinned, so we'll never hit this case through there. As for the busy extent code, I'll need to look at that much more closely to determine if there's a way around this.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-20 21:49 ` Dave Chinner @ 2019-03-21 12:11 ` Brian Foster 2019-03-21 21:10 ` Dave Chinner 0 siblings, 1 reply; 18+ messages in thread From: Brian Foster @ 2019-03-21 12:11 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, xfs On Thu, Mar 21, 2019 at 08:49:38AM +1100, Dave Chinner wrote: > On Wed, Mar 20, 2019 at 01:59:22PM -0400, Brian Foster wrote: > > On Wed, Mar 20, 2019 at 10:45:51AM -0700, Darrick J. Wong wrote: > > > On Wed, Mar 20, 2019 at 01:03:05PM -0400, Brian Foster wrote: > > > > On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote: > > > > And unmount is doing a log force as well.. > > > > > > > > Hmm.. yeah, I think we need to figure out how/why the buffer is locked. > > > > > > AFAICT it's xfs_inode_item_push -> xfs_iflush -> xfs_imap_to_bp that > > > returns the locked cluster buffer (code extract from xfs_iflush): > .... > > > ...but if the buffer is also pinned then we kick the log (while holding > > > the buffer lock) to unpin the buffer. Since the fs is shutdown, the cil > > > will just be trying to remove everything, but it needs to lock the > > > buffer to simulate EIO. > > > > > > > Yeah, makes sense. > > Yup, and this same problem means any buffer we hold locked when we > issue a blocking log force can deadlock in the same way. If it's a > non-blocking log force (i.e. a log flush we don't wait for) then we > can move onwards and eventually we unlock the buffer and the log can > make progress. > > > > A naive fix to the deadlock is to see if the inode buf ops are attached > > > and downgrade to a trylock and go ahead even if we can't lock it, but ick. > > > It resolves the deadlock so the umount can proceed but we also trigger a > > > lot of unlocked buffer asserts. > > Yuck! > > > > I'm not sure what "too long" means in that code chunk. It was added to > > > reduce stalls in pdflush, which these days I think translates to trying > > > to reduce the amount of time reclaim can stall while flushing inodes...? > > "too long" means that if there is no log pressure, it might be the > next periodic log work that forces the log and unpins the buffer. > i.e. it could be several seconds before the buffer gets unpinned. > This is the same reason we have the log force in xfs_buf_lock() - if > we have to wait for the next log force to unlock/unpin a stale > buffer, we can stall for several seconds. > Ok.. so it kind of looks to me that this particular instance of the log force is a historical artifact of xfsaild. The latter used to use separate ->trylock() and ->push() log item callbacks where the former would either lock the object or return LOCKED, PINNED, etc. So a pinned buffer or inode would return pinned here and xfsaild does the log force on its behalf once it's through the current cycle. An unpinned inode with a pinned cluster buffer didn't communicate this state because the inode ->trylock() callout grabbed the ilock and iflock and returned success. The iflush doesn't occur until ->push(), which means if the cluster buffer is pinned and we wanted to force the log than xfs_iflush() is the last chance to do it. (For reference, I'm looking at the code as of v2.6.33). The current xfsaild() log force logic is still tied to PINNED objects, but it has additional logic to detect skipped buffers at delwri queue submit time (i.e., due to being pinned) that should cover the pinned inode cluster buffer case. > > I'm not sure either. I was wondering why we'd need that given that > > xfsaild() has its own log force logic in cases where the delwri submit > > didn't fully complete (i.e., for pinned buffers). Relying on that seems > > more appropriate since we'd force the log per delwri submission for > > multiple potential buffers as opposed to per inode cluster buffer. > > Perhaps we could just kill this log force? > > If the writeback is coming from the AIL, then we could check if the > buffer is pinned and then return EDEADLOCK or something like that to > get the item push return value set to XFS_ITEM_PINNED, in which case > the AIL will then do the log force that will unpin the object. This > will happen after we have added the buffer to the AIL delwri list > and unlocked the buffer, so everything should work just fine in that > case. Similar for dquot flushing from the AIL. > Sounds plausible, though I'd probably just lift the pinned check into the inode item callout after the xfs_iflush() since it already has a reference to the buffer. But given the above, couldn't we just remove the log force from xfs_iflush() and let the existing xfsaild_push() logic handle it? It should detect any pinned buffers after the delwri queue submit and do the log force. It might require a closer look at the existing logic to ensure the log force is timely enough, but ISTM that the purpose of this logic is cover this case. AFAICT nothing prevents the buffer from becoming pinned again once it's added to the queue anyways, even if we just forced the log. Hm? Brian > If the writeback is coming from inode reclaim, we've already > guaranteed the inode is unpinned, so we'll never hit this case > through there. > > As for the busy extent code, I'll need to look at that much more > closely to determine if there's a way around this.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-21 12:11 ` Brian Foster @ 2019-03-21 21:10 ` Dave Chinner 2019-03-21 21:53 ` Brian Foster 2019-03-26 17:13 ` Brian Foster 0 siblings, 2 replies; 18+ messages in thread From: Dave Chinner @ 2019-03-21 21:10 UTC (permalink / raw) To: Brian Foster; +Cc: Darrick J. Wong, xfs On Thu, Mar 21, 2019 at 08:11:52AM -0400, Brian Foster wrote: > On Thu, Mar 21, 2019 at 08:49:38AM +1100, Dave Chinner wrote: > > On Wed, Mar 20, 2019 at 01:59:22PM -0400, Brian Foster wrote: > > > On Wed, Mar 20, 2019 at 10:45:51AM -0700, Darrick J. Wong wrote: > > > > On Wed, Mar 20, 2019 at 01:03:05PM -0400, Brian Foster wrote: > > > > > On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote: > > > > > And unmount is doing a log force as well.. > > > > > > > > > > Hmm.. yeah, I think we need to figure out how/why the buffer is locked. > > > > > > > > AFAICT it's xfs_inode_item_push -> xfs_iflush -> xfs_imap_to_bp that > > > > returns the locked cluster buffer (code extract from xfs_iflush): > > .... > > > > ...but if the buffer is also pinned then we kick the log (while holding > > > > the buffer lock) to unpin the buffer. Since the fs is shutdown, the cil > > > > will just be trying to remove everything, but it needs to lock the > > > > buffer to simulate EIO. > > > > > > > > > > Yeah, makes sense. > > > > Yup, and this same problem means any buffer we hold locked when we > > issue a blocking log force can deadlock in the same way. If it's a > > non-blocking log force (i.e. a log flush we don't wait for) then we > > can move onwards and eventually we unlock the buffer and the log can > > make progress. > > > > > > A naive fix to the deadlock is to see if the inode buf ops are attached > > > > and downgrade to a trylock and go ahead even if we can't lock it, but ick. > > > > It resolves the deadlock so the umount can proceed but we also trigger a > > > > lot of unlocked buffer asserts. > > > > Yuck! > > > > > > I'm not sure what "too long" means in that code chunk. It was added to > > > > reduce stalls in pdflush, which these days I think translates to trying > > > > to reduce the amount of time reclaim can stall while flushing inodes...? > > > > "too long" means that if there is no log pressure, it might be the > > next periodic log work that forces the log and unpins the buffer. > > i.e. it could be several seconds before the buffer gets unpinned. > > This is the same reason we have the log force in xfs_buf_lock() - if > > we have to wait for the next log force to unlock/unpin a stale > > buffer, we can stall for several seconds. > > > > Ok.. so it kind of looks to me that this particular instance of the log > force is a historical artifact of xfsaild. The latter used to use > separate ->trylock() and ->push() log item callbacks where the former > would either lock the object or return LOCKED, PINNED, etc. So a pinned > buffer or inode would return pinned here and xfsaild does the log force > on its behalf once it's through the current cycle. Sure. that's no different to now. The only thing that is different is that we combined ->trylock, ->push and ->pushbuf into one function. The actual semantics and behaviour is unchanged. i.e. if you look at xfs_iflush(ip, ASYNC) back then, it did: /* * If the buffer is pinned then push on the log now so we won't * get stuck waiting in the write for too long. */ if (XFS_BUF_ISPINNED(bp)) xfs_log_force(mp, (xfs_lsn_t)0, XFS_LOG_FORCE); <snip iflush_cluster to pull in all other dirty inodes> if (flags & INT_DELWRI) { xfs_bdwrite(mp, bp); } else if (flags & INT_ASYNC) { error = xfs_bawrite(mp, bp); } else { error = xfs_bwrite(mp, bp); } And then it would write the inode buffer directly with an async write via xfs_bawrite(). i.e. there was no delwri list or anything like that - the inode flush controlled the inode buffer writeback and hence the unpinning of the inode buffer. The only difference now is that the caller controls how the buffer is written, not xfs_iflush... > An unpinned inode with a pinned cluster buffer didn't communicate this > state because the inode ->trylock() callout grabbed the ilock and iflock > and returned success. The iflush doesn't occur until ->push(), which > means if the cluster buffer is pinned and we wanted to force the log > than xfs_iflush() is the last chance to do it. > > (For reference, I'm looking at the code as of v2.6.33). Again, that's no different to what we have now - the push is for the inode, not the backing buffer. If we can't lock the inode because it's pinned or already flushing, then we return that info to AIL loop. The acutal state of the backing buffer is still completely hidden from the caller and so that's why I left the unpin in that code. > The current xfsaild() log force logic is still tied to PINNED objects, > but it has additional logic to detect skipped buffers at delwri queue > submit time (i.e., due to being pinned) that should cover the pinned > inode cluster buffer case. And it's not skipped buffers, either. Skipped buffers are only for determining timeouts - it's the ail->ail_log_flush counter that determines whether a log flush should be done or not, and that's only incremented on ITEM_PINNED objects. This is why I was suggesting returning that if the backing buffer is pinned, but I forgot all about how we already handle this case... > But given the above, couldn't we just remove the log force from > xfs_iflush() and let the existing xfsaild_push() logic handle it? We can, but not for the reasons you are suggesting. > It > should detect any pinned buffers after the delwri queue submit and do > the log force. And this is what we already do with this code: if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list)) ailp->ail_log_flush++; Which falls down to this code when we try to flush a pinned delwri buffer: list_for_each_entry_safe(bp, n, buffer_list, b_list) { if (!wait_list) { if (xfs_buf_ispinned(bp)) { pinned++; continue; } .... return pinned; Basically, we leave pinned buffers on the delwri list rather than remove them and issue IO on them, and tell the caller that we left pinned/locked buffers on the list that still need to be submitted. The AIL takes this as an indication it needs to flush the log to unpin the buffers it can't flush, and because they are still on it's delwri list, it will attempt to submit them again next time around after the log has been forced. So given that inode reclaim already waits for inodes to be unpinned and there is a xfs_buf_unpin_wait() call in xfs_bwrite()->xfs_buf_submit() path that will issue a log force on the buffer from the inode reclaim path, we don't actually need the log force in xfs_iflush() at all. i.e. the AIL will capture the buffer state and unpin inode and dquot buffers automatically, and so we can simply remove the pinned-buffer log forces from the inode/dquot flush code.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-21 21:10 ` Dave Chinner @ 2019-03-21 21:53 ` Brian Foster 2019-03-21 23:50 ` Dave Chinner 2019-03-26 17:13 ` Brian Foster 1 sibling, 1 reply; 18+ messages in thread From: Brian Foster @ 2019-03-21 21:53 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, xfs On Fri, Mar 22, 2019 at 08:10:51AM +1100, Dave Chinner wrote: > On Thu, Mar 21, 2019 at 08:11:52AM -0400, Brian Foster wrote: > > On Thu, Mar 21, 2019 at 08:49:38AM +1100, Dave Chinner wrote: > > > On Wed, Mar 20, 2019 at 01:59:22PM -0400, Brian Foster wrote: > > > > On Wed, Mar 20, 2019 at 10:45:51AM -0700, Darrick J. Wong wrote: > > > > > On Wed, Mar 20, 2019 at 01:03:05PM -0400, Brian Foster wrote: > > > > > > On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote: > > > > > > And unmount is doing a log force as well.. > > > > > > > > > > > > Hmm.. yeah, I think we need to figure out how/why the buffer is locked. > > > > > > > > > > AFAICT it's xfs_inode_item_push -> xfs_iflush -> xfs_imap_to_bp that > > > > > returns the locked cluster buffer (code extract from xfs_iflush): > > > .... > > > > > ...but if the buffer is also pinned then we kick the log (while holding > > > > > the buffer lock) to unpin the buffer. Since the fs is shutdown, the cil > > > > > will just be trying to remove everything, but it needs to lock the > > > > > buffer to simulate EIO. > > > > > > > > > > > > > Yeah, makes sense. > > > > > > Yup, and this same problem means any buffer we hold locked when we > > > issue a blocking log force can deadlock in the same way. If it's a > > > non-blocking log force (i.e. a log flush we don't wait for) then we > > > can move onwards and eventually we unlock the buffer and the log can > > > make progress. > > > > > > > > A naive fix to the deadlock is to see if the inode buf ops are attached > > > > > and downgrade to a trylock and go ahead even if we can't lock it, but ick. > > > > > It resolves the deadlock so the umount can proceed but we also trigger a > > > > > lot of unlocked buffer asserts. > > > > > > Yuck! > > > > > > > > I'm not sure what "too long" means in that code chunk. It was added to > > > > > reduce stalls in pdflush, which these days I think translates to trying > > > > > to reduce the amount of time reclaim can stall while flushing inodes...? > > > > > > "too long" means that if there is no log pressure, it might be the > > > next periodic log work that forces the log and unpins the buffer. > > > i.e. it could be several seconds before the buffer gets unpinned. > > > This is the same reason we have the log force in xfs_buf_lock() - if > > > we have to wait for the next log force to unlock/unpin a stale > > > buffer, we can stall for several seconds. > > > > > > > Ok.. so it kind of looks to me that this particular instance of the log > > force is a historical artifact of xfsaild. The latter used to use > > separate ->trylock() and ->push() log item callbacks where the former > > would either lock the object or return LOCKED, PINNED, etc. So a pinned > > buffer or inode would return pinned here and xfsaild does the log force > > on its behalf once it's through the current cycle. > > Sure. that's no different to now. The only thing that is different > is that we combined ->trylock, ->push and ->pushbuf into one > function. The actual semantics and behaviour is unchanged. > > i.e. if you look at xfs_iflush(ip, ASYNC) back then, it did: > > /* > * If the buffer is pinned then push on the log now so we won't > * get stuck waiting in the write for too long. > */ > if (XFS_BUF_ISPINNED(bp)) > xfs_log_force(mp, (xfs_lsn_t)0, XFS_LOG_FORCE); > > <snip iflush_cluster to pull in all other dirty inodes> > > if (flags & INT_DELWRI) { > xfs_bdwrite(mp, bp); > } else if (flags & INT_ASYNC) { > error = xfs_bawrite(mp, bp); > } else { > error = xfs_bwrite(mp, bp); > } > > And then it would write the inode buffer directly with an async write > via xfs_bawrite(). i.e. there was no delwri list or anything like > that - the inode flush controlled the inode buffer writeback and > hence the unpinning of the inode buffer. The only difference now is > that the caller controls how the buffer is written, not > xfs_iflush... > What I'm saying is that the difference in the interface explains the historical need for the log force in xfs_iflush(). The difference that I'm pointing out is that ->push() had no means to return item state. The sequence for an unpinned inode with a pinned cluster buffer in 2.6.32 is for ->trylock() to return success after acquiring the ilock and flush lock, followed by a call to ->push(). We didn't call xfs_iflush() (and thus get the inode buffer) until the push call: switch (lock_result) { case XFS_ITEM_SUCCESS: XFS_STATS_INC(xs_push_ail_success); IOP_PUSH(lip); last_pushed_lsn = lsn; break; ... which doesn't even have a return value. Hence there is no way to return a pinned state for an inode cluster buffer and xfs_iflush() would have had to do the log force for a pinned cluster buffer itself. That is no longer the case with the current code. I don't know with certainty of course, but from that I surmised the xfs_iflush() -> xfs_log_force() call existed to deal with this historical interface limitation. > > An unpinned inode with a pinned cluster buffer didn't communicate this > > state because the inode ->trylock() callout grabbed the ilock and iflock > > and returned success. The iflush doesn't occur until ->push(), which > > means if the cluster buffer is pinned and we wanted to force the log > > than xfs_iflush() is the last chance to do it. > > > > (For reference, I'm looking at the code as of v2.6.33). > > Again, that's no different to what we have now - the push is for the > inode, not the backing buffer. If we can't lock the inode because > it's pinned or already flushing, then we return that info to AIL > loop. The acutal state of the backing buffer is still completely > hidden from the caller and so that's why I left the unpin in that > code. > Pretty much what I'm saying above. > > The current xfsaild() log force logic is still tied to PINNED objects, > > but it has additional logic to detect skipped buffers at delwri queue > > submit time (i.e., due to being pinned) that should cover the pinned > > inode cluster buffer case. > > And it's not skipped buffers, either. Skipped buffers are only for > determining timeouts - it's the ail->ail_log_flush counter that > determines whether a log flush should be done or not, and that's > only incremented on ITEM_PINNED objects. This is why I was > suggesting returning that if the backing buffer is pinned, but I > forgot all about how we already handle this case... > > > But given the above, couldn't we just remove the log force from > > xfs_iflush() and let the existing xfsaild_push() logic handle it? > > We can, but not for the reasons you are suggesting. > The reason I suggested is because we currently "detect skipped buffers at delwri queue submit time (i.e., due to being pinned)," which causes xfsaild to force the log. I'm not sure how exactly you read that, but it refers exactly to the code snippet you posted below where we bump the log flush counter due to skipped buffers on submit (because they were pinned). > > It > > should detect any pinned buffers after the delwri queue submit and do > > the log force. > > And this is what we already do with this code: > Yes. > if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list)) > ailp->ail_log_flush++; > > Which falls down to this code when we try to flush a pinned delwri > buffer: > > list_for_each_entry_safe(bp, n, buffer_list, b_list) { > if (!wait_list) { > if (xfs_buf_ispinned(bp)) { > pinned++; > continue; > } > .... > return pinned; > "detect skipped buffers at delwri queue submit time (i.e., due to being pinned)" > Basically, we leave pinned buffers on the delwri list rather than > remove them and issue IO on them, and tell the caller that we left > pinned/locked buffers on the list that still need to be submitted. > The AIL takes this as an indication it needs to flush the log to > unpin the buffers it can't flush, and because they are still on it's > delwri list, it will attempt to submit them again next time around > after the log has been forced. > Right, so this is exactly the reason I suggested. Brian > So given that inode reclaim already waits for inodes to be unpinned > and there is a xfs_buf_unpin_wait() call in > xfs_bwrite()->xfs_buf_submit() path that will issue a log force on > the buffer from the inode reclaim path, we don't actually need the > log force in xfs_iflush() at all. i.e. the AIL will capture the > buffer state and unpin inode and dquot buffers automatically, and so > we can simply remove the pinned-buffer log forces from the > inode/dquot flush code.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-21 21:53 ` Brian Foster @ 2019-03-21 23:50 ` Dave Chinner 2019-03-22 0:07 ` Darrick J. Wong 2019-03-22 12:01 ` Brian Foster 0 siblings, 2 replies; 18+ messages in thread From: Dave Chinner @ 2019-03-21 23:50 UTC (permalink / raw) To: Brian Foster; +Cc: Darrick J. Wong, xfs On Thu, Mar 21, 2019 at 05:53:07PM -0400, Brian Foster wrote: > On Fri, Mar 22, 2019 at 08:10:51AM +1100, Dave Chinner wrote: > > On Thu, Mar 21, 2019 at 08:11:52AM -0400, Brian Foster wrote: > > > On Thu, Mar 21, 2019 at 08:49:38AM +1100, Dave Chinner wrote: > > > The current xfsaild() log force logic is still tied to PINNED objects, > > > but it has additional logic to detect skipped buffers at delwri queue > > > submit time (i.e., due to being pinned) that should cover the pinned > > > inode cluster buffer case. > > > > And it's not skipped buffers, either. Skipped buffers are only for > > determining timeouts - it's the ail->ail_log_flush counter that > > determines whether a log flush should be done or not, and that's > > only incremented on ITEM_PINNED objects. This is why I was > > suggesting returning that if the backing buffer is pinned, but I > > forgot all about how we already handle this case... > > > > > But given the above, couldn't we just remove the log force from > > > xfs_iflush() and let the existing xfsaild_push() logic handle it? > > > > We can, but not for the reasons you are suggesting. > > > > The reason I suggested is because we currently "detect skipped buffers > at delwri queue submit time (i.e., due to being pinned)," which causes > xfsaild to force the log. I'm not sure how exactly you read that, but it > refers exactly to the code snippet you posted below where we bump the > log flush counter due to skipped buffers on submit (because they were > pinned). In the context of the AIL pushing, "skipped" has specific meaning - it refers to items that could not be flushed for whatever reason. And the actions that are taken based on the number of "skipped" objects is taken after delwri queue submit time. You used all the terminology that refers to /skipped/ item handling, not /pinned/ item handling - if you re-read what I wrote above with this difference in mind, then it will make a lot more sense to you. FWIW, this skipped/pinned terminology difference goes right down into the delwri list handling I quoted - I'll add the next few lines so it's clear what I'm talking about: > > list_for_each_entry_safe(bp, n, buffer_list, b_list) { > > if (!wait_list) { > > if (xfs_buf_ispinned(bp)) { > > pinned++; > > continue; > > } if (!xfs_buf_trylock(bp)) continue; } This code branch also avoid blocking on locked buffers. In AIL terms these locked buffers have been "skipped", but the code only counts and indicates pinned buffers to the caller. And the delwri submission code in the AIL pushing acts on this specific distinction - it triggers a log force if pinned buffers were encountered, not if the flush skipped locked buffers. > > .... > > return pinned; > > > > "detect skipped buffers at delwri queue submit time (i.e., due to being > pinned)" If the AIL code was detecting skipped buffers, it would do: xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list); if (!list_empty(&ailp->ail_buf_list)) { /* we skipped buffers */ ..... } But it's not doing this. i.e. "Pinned" vs "skipped" may be a subtle terminology difference, but it is an important difference and they are not interchangable... > > remove them and issue IO on them, and tell the caller that we left > > pinned/locked buffers on the list that still need to be submitted. > > The AIL takes this as an indication it needs to flush the log to > > unpin the buffers it can't flush, and because they are still on it's > > delwri list, it will attempt to submit them again next time around > > after the log has been forced. > > > > Right, so this is exactly the reason I suggested. OK, we're on the same page. :) Do want to submit a patch that fixes the inode and dquot flushing? I haven't looked any further at the busy extent stuff, that's likely to be a more difficult problem to solve... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-21 23:50 ` Dave Chinner @ 2019-03-22 0:07 ` Darrick J. Wong 2019-03-22 12:01 ` Brian Foster 1 sibling, 0 replies; 18+ messages in thread From: Darrick J. Wong @ 2019-03-22 0:07 UTC (permalink / raw) To: Dave Chinner; +Cc: Brian Foster, xfs On Fri, Mar 22, 2019 at 10:50:45AM +1100, Dave Chinner wrote: > On Thu, Mar 21, 2019 at 05:53:07PM -0400, Brian Foster wrote: > > On Fri, Mar 22, 2019 at 08:10:51AM +1100, Dave Chinner wrote: > > > On Thu, Mar 21, 2019 at 08:11:52AM -0400, Brian Foster wrote: > > > > On Thu, Mar 21, 2019 at 08:49:38AM +1100, Dave Chinner wrote: > > > > The current xfsaild() log force logic is still tied to PINNED objects, > > > > but it has additional logic to detect skipped buffers at delwri queue > > > > submit time (i.e., due to being pinned) that should cover the pinned > > > > inode cluster buffer case. > > > > > > And it's not skipped buffers, either. Skipped buffers are only for > > > determining timeouts - it's the ail->ail_log_flush counter that > > > determines whether a log flush should be done or not, and that's > > > only incremented on ITEM_PINNED objects. This is why I was > > > suggesting returning that if the backing buffer is pinned, but I > > > forgot all about how we already handle this case... > > > > > > > But given the above, couldn't we just remove the log force from > > > > xfs_iflush() and let the existing xfsaild_push() logic handle it? > > > > > > We can, but not for the reasons you are suggesting. > > > > > > > The reason I suggested is because we currently "detect skipped buffers > > at delwri queue submit time (i.e., due to being pinned)," which causes > > xfsaild to force the log. I'm not sure how exactly you read that, but it > > refers exactly to the code snippet you posted below where we bump the > > log flush counter due to skipped buffers on submit (because they were > > pinned). > > In the context of the AIL pushing, "skipped" has specific meaning - > it refers to items that could not be flushed for whatever reason. > And the actions that are taken based on the number of "skipped" > objects is taken after delwri queue submit time. > > You used all the terminology that refers to /skipped/ item > handling, not /pinned/ item handling - if you re-read what I wrote > above with this difference in mind, then it will make a lot more > sense to you. > > FWIW, this skipped/pinned terminology difference goes right down > into the delwri list handling I quoted - I'll add the next few lines > so it's clear what I'm talking about: > > > > > list_for_each_entry_safe(bp, n, buffer_list, b_list) { > > > if (!wait_list) { > > > if (xfs_buf_ispinned(bp)) { > > > pinned++; > > > continue; > > > } > if (!xfs_buf_trylock(bp)) > continue; > } > > This code branch also avoid blocking on locked buffers. In AIL > terms these locked buffers have been "skipped", but the code only > counts and indicates pinned buffers to the caller. And the delwri > submission code in the AIL pushing acts on this specific > distinction - it triggers a log force if pinned buffers were > encountered, not if the flush skipped locked buffers. > > > > .... > > > return pinned; > > > > > > > "detect skipped buffers at delwri queue submit time (i.e., due to being > > pinned)" > > If the AIL code was detecting skipped buffers, it would do: > > xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list); > if (!list_empty(&ailp->ail_buf_list)) { > /* we skipped buffers */ > ..... > } > > But it's not doing this. > > i.e. "Pinned" vs "skipped" may be a subtle terminology difference, > but it is an important difference and they are not interchangable... > > > > remove them and issue IO on them, and tell the caller that we left > > > pinned/locked buffers on the list that still need to be submitted. > > > The AIL takes this as an indication it needs to flush the log to > > > unpin the buffers it can't flush, and because they are still on it's > > > delwri list, it will attempt to submit them again next time around > > > after the log has been forced. <nod> That's kinda what I thought too, having looked through the code, but TBH I defer to the two of you as far as knowing how things work in the logging code. :) > > > > > > > Right, so this is exactly the reason I suggested. > > OK, we're on the same page. :) > > Do want to submit a patch that fixes the inode and dquot flushing? I'd review any such patch that shows up. :) [He says having not written a patch since the yucky one a few days ago and not likely having time to write one anyway.] > I haven't looked any further at the busy extent stuff, that's likely > to be a more difficult problem to solve... <nod> Thank you both for taking time to dig through my question! --D > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-21 23:50 ` Dave Chinner 2019-03-22 0:07 ` Darrick J. Wong @ 2019-03-22 12:01 ` Brian Foster 2019-03-24 23:03 ` Dave Chinner 1 sibling, 1 reply; 18+ messages in thread From: Brian Foster @ 2019-03-22 12:01 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, xfs On Fri, Mar 22, 2019 at 10:50:45AM +1100, Dave Chinner wrote: > On Thu, Mar 21, 2019 at 05:53:07PM -0400, Brian Foster wrote: > > On Fri, Mar 22, 2019 at 08:10:51AM +1100, Dave Chinner wrote: > > > On Thu, Mar 21, 2019 at 08:11:52AM -0400, Brian Foster wrote: > > > > On Thu, Mar 21, 2019 at 08:49:38AM +1100, Dave Chinner wrote: > > > > The current xfsaild() log force logic is still tied to PINNED objects, > > > > but it has additional logic to detect skipped buffers at delwri queue > > > > submit time (i.e., due to being pinned) that should cover the pinned > > > > inode cluster buffer case. > > > > > > And it's not skipped buffers, either. Skipped buffers are only for > > > determining timeouts - it's the ail->ail_log_flush counter that > > > determines whether a log flush should be done or not, and that's > > > only incremented on ITEM_PINNED objects. This is why I was > > > suggesting returning that if the backing buffer is pinned, but I > > > forgot all about how we already handle this case... > > > > > > > But given the above, couldn't we just remove the log force from > > > > xfs_iflush() and let the existing xfsaild_push() logic handle it? > > > > > > We can, but not for the reasons you are suggesting. > > > > > > > The reason I suggested is because we currently "detect skipped buffers > > at delwri queue submit time (i.e., due to being pinned)," which causes > > xfsaild to force the log. I'm not sure how exactly you read that, but it > > refers exactly to the code snippet you posted below where we bump the > > log flush counter due to skipped buffers on submit (because they were > > pinned). > > In the context of the AIL pushing, "skipped" has specific meaning - > it refers to items that could not be flushed for whatever reason. > And the actions that are taken based on the number of "skipped" > objects is taken after delwri queue submit time. > > You used all the terminology that refers to /skipped/ item > handling, not /pinned/ item handling - if you re-read what I wrote > above with this difference in mind, then it will make a lot more > sense to you. > Perhaps I could have phrased it better, but I specifically referred to skipped buffers (not items), in the context of delwri submission and qualified it with those skipped due to being pinned. ;) Even in my earlier reply right after Darrick corrected my mischaracterization of the deadlock, I said the following about the log force in xfs_iflush(): "I was wondering why we'd need that given that xfsaild() has its own log force logic in cases where the delwri submit didn't fully complete (i.e., for pinned buffers)." But no matter, I can see where the disconnect is now at least. I just think you misread me in this case. > FWIW, this skipped/pinned terminology difference goes right down > into the delwri list handling I quoted - I'll add the next few lines > so it's clear what I'm talking about: > > > > > list_for_each_entry_safe(bp, n, buffer_list, b_list) { > > > if (!wait_list) { > > > if (xfs_buf_ispinned(bp)) { > > > pinned++; > > > continue; > > > } > if (!xfs_buf_trylock(bp)) > continue; > } > > This code branch also avoid blocking on locked buffers. In AIL > terms these locked buffers have been "skipped", but the code only > counts and indicates pinned buffers to the caller. And the delwri > submission code in the AIL pushing acts on this specific > distinction - it triggers a log force if pinned buffers were > encountered, not if the flush skipped locked buffers. > Yep.. > > > .... > > > return pinned; > > > > > > > "detect skipped buffers at delwri queue submit time (i.e., due to being > > pinned)" > > If the AIL code was detecting skipped buffers, it would do: > > xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list); > if (!list_empty(&ailp->ail_buf_list)) { > /* we skipped buffers */ > ..... > } > > But it's not doing this. > In general, if you think I'm mischaracterizing code/behavior, it's helpful to point out how you interpret the specified reasoning so we can correct mistaken assumptions/understandings, clarify descriptions or use better terminology or whatever. That helps to get onto the same page more efficiently, particularly for cases where it so happens we were talking about the same exact thing in the first place and just happen to think about it a bit differently. ;) For example, to just say "We can, but not for the reasons you are suggesting." doesn't tell me much about how you interpreted my reasoning. I thus assume you interpreted it exactly as I meant it. That ambiguity leads to confusion when you then go on and explain the same behavior I was (trying to) refer to. ;P If you said something like "we can but not because of <reason>," it would have probably been immediately apparent to me that the reason I intended to describe is not exactly how you read it. > i.e. "Pinned" vs "skipped" may be a subtle terminology difference, > but it is an important difference and they are not interchangable... > > > > remove them and issue IO on them, and tell the caller that we left > > > pinned/locked buffers on the list that still need to be submitted. > > > The AIL takes this as an indication it needs to flush the log to > > > unpin the buffers it can't flush, and because they are still on it's > > > delwri list, it will attempt to submit them again next time around > > > after the log has been forced. > > > > > > > Right, so this is exactly the reason I suggested. > > OK, we're on the same page. :) > Indeed. > Do want to submit a patch that fixes the inode and dquot flushing? > I haven't looked any further at the busy extent stuff, that's likely > to be a more difficult problem to solve... > I'll add it to the todo list and make a pass through the log items... I missed where the busy extent stuff came into play here.. are you pointing out that log forces from the busy extent code can occur under buffer locks and thus we're susceptible to similar problems there? If so, aren't we protected by the transaction in that context? IOW, wouldn't a log force from busy extent context always occur with locked buffers joined to a transaction? If so, then doesn't the active transaction hold a bli reference and prevent such items from being "freed" in log completion context (i.e. xfs_buf_item_unpin()) if they happened to be pinned? Perhaps I'm missing something... Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-22 12:01 ` Brian Foster @ 2019-03-24 23:03 ` Dave Chinner 2019-03-25 12:34 ` Brian Foster 0 siblings, 1 reply; 18+ messages in thread From: Dave Chinner @ 2019-03-24 23:03 UTC (permalink / raw) To: Brian Foster; +Cc: Darrick J. Wong, xfs On Fri, Mar 22, 2019 at 08:01:36AM -0400, Brian Foster wrote: > On Fri, Mar 22, 2019 at 10:50:45AM +1100, Dave Chinner wrote: > > > > return pinned; > > > > > > > > > > "detect skipped buffers at delwri queue submit time (i.e., due to being > > > pinned)" > > > > If the AIL code was detecting skipped buffers, it would do: > > > > xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list); > > if (!list_empty(&ailp->ail_buf_list)) { > > /* we skipped buffers */ > > ..... > > } > > > > But it's not doing this. > > > > In general, if you think I'm mischaracterizing code/behavior, it's > helpful to point out how you interpret the specified reasoning so we can > correct mistaken assumptions/understandings, clarify descriptions or use > better terminology or whatever. That helps to get onto the same page > more efficiently, particularly for cases where it so happens we were > talking about the same exact thing in the first place and just happen to > think about it a bit differently. ;) Well, i didn't realise you had mis-applied the terminology until you replied to me saying "we're describing the same thing". It's easy to say in hindsight "-you- should have just point out the problem", but that's laying blame for something that nobody is at fault for. > For example, to just say "We can, but not for the reasons you are > suggesting." doesn't tell me much about how you interpreted my > reasoning. I think you're being unreasonable here, Brian - I explained the skipped vs pinned difference once I understood there was a terminology mismatch issue. :/ I don't expect to have to rewrite emails in my own words just so the other party can determine if I've understood the terms they used correctly. We have definitions and try to use common, clear terminology so that code and communication is efficient and we don't need to endlessly repeat what the other person has said. Misunderstandings over terminology are relatively rare - as soon as I understood that you were using terminology with a different definition to the code you were refering to, I pointed it out and explained where the misunderstanding was. I cannot do any more than that, and to expect more from /anyone/ is entirely unreasonable. > I missed where the busy extent stuff came into play here.. are you > pointing out that log forces from the busy extent code can occur under > buffer locks and thus we're susceptible to similar problems there? If > so, aren't we protected by the transaction in that context? I'm not sure - as I said I have not traced/reasoned the logic right through yet. All I was doing is pointing out that these are other places we are doing a log force with buffers locked so might have the same problem, and.... > IOW, wouldn't a log force from busy extent context always occur with > locked buffers joined to a transaction? If so, then doesn't the active > transaction hold a bli reference and prevent such items from being > "freed" in log completion context (i.e. xfs_buf_item_unpin()) if they > happened to be pinned? Perhaps I'm missing something... ... now you've looked at it in more detail that I have. What you've described is why it was considered to be safe, but now we have things like defered AGFL block freeing that pick up and drop the AGF lock repeatedly as the single transaction rolls and - eventually - starts calling xfs_trans_reserve() on transaction roll. That's something we never used to do. It may still be safe, but it is unclear to me how this all interacts in the presence of filesystem shutdown conditions..... --- What it comes down to is that the only way I can answer questions like this is to do the complete analysis myself, which simply does not scale. Having to jump between multiple separate complex questions completely derails my own workflow, and it's recently been resulting in me spending >75% of every day answering complex questions like this. Duplication of effort is not an efficient use of our time or expertise. If I see something and say "these other code paths look similar, we should check them too", all it means that I noticed a pattern of usage and they need to be checked for similar issues. It does not mean "dave has enough spare time on hand and can drop everything to analyse the code in question and answer every little detail about the similar problem path he noticed." Brian, you're way better than me at analysing these complex interactions. Be confident in your analysis, just say "no, it's safe because...." rather than asking a question that requires me to repeat the analysis you've already done. If there's something that you've missed that I spot, I'll point it out, but I really don't want to have to repeat everything you've already done just to answer a "did I miss anything" question. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-24 23:03 ` Dave Chinner @ 2019-03-25 12:34 ` Brian Foster 2019-03-27 1:22 ` Dave Chinner 0 siblings, 1 reply; 18+ messages in thread From: Brian Foster @ 2019-03-25 12:34 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, xfs On Mon, Mar 25, 2019 at 10:03:27AM +1100, Dave Chinner wrote: > On Fri, Mar 22, 2019 at 08:01:36AM -0400, Brian Foster wrote: > > On Fri, Mar 22, 2019 at 10:50:45AM +1100, Dave Chinner wrote: > > > > > return pinned; > > > > > > > > > > > > > "detect skipped buffers at delwri queue submit time (i.e., due to being > > > > pinned)" > > > > > > If the AIL code was detecting skipped buffers, it would do: > > > > > > xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list); > > > if (!list_empty(&ailp->ail_buf_list)) { > > > /* we skipped buffers */ > > > ..... > > > } > > > > > > But it's not doing this. > > > > > > > In general, if you think I'm mischaracterizing code/behavior, it's > > helpful to point out how you interpret the specified reasoning so we can > > correct mistaken assumptions/understandings, clarify descriptions or use > > better terminology or whatever. That helps to get onto the same page > > more efficiently, particularly for cases where it so happens we were > > talking about the same exact thing in the first place and just happen to > > think about it a bit differently. ;) > > Well, i didn't realise you had mis-applied the terminology until you > replied to me saying "we're describing the same thing". It's easy to > say in hindsight "-you- should have just point out the problem", but > that's laying blame for something that nobody is at fault for. > > > For example, to just say "We can, but not for the reasons you are > > suggesting." doesn't tell me much about how you interpreted my > > reasoning. > > I think you're being unreasonable here, Brian - I explained the > skipped vs pinned difference once I understood there was a > terminology mismatch issue. :/ > > I don't expect to have to rewrite emails in my own words just so the > other party can determine if I've understood the terms they used > correctly. We have definitions and try to use common, clear > terminology so that code and communication is efficient and we don't > need to endlessly repeat what the other person has said. > > Misunderstandings over terminology are relatively rare - as soon as > I understood that you were using terminology with a different > definition to the code you were refering to, I pointed it out and > explained where the misunderstanding was. > > I cannot do any more than that, and to expect more from /anyone/ is > entirely unreasonable. > Fair enough. I'm just trying to see if we can improve communication signal a bit here, but I suppose some degree of miscommunication is inevitable in a highly technical forum. Not worth fussing over too much I guess.. > > I missed where the busy extent stuff came into play here.. are you > > pointing out that log forces from the busy extent code can occur under > > buffer locks and thus we're susceptible to similar problems there? If > > so, aren't we protected by the transaction in that context? > > I'm not sure - as I said I have not traced/reasoned the logic right > through yet. All I was doing is pointing out that these are other > places we are doing a log force with buffers locked so might have > the same problem, and.... > Ok, that's all I was asking.. > > IOW, wouldn't a log force from busy extent context always occur with > > locked buffers joined to a transaction? If so, then doesn't the active > > transaction hold a bli reference and prevent such items from being > > "freed" in log completion context (i.e. xfs_buf_item_unpin()) if they > > happened to be pinned? Perhaps I'm missing something... > > ... now you've looked at it in more detail that I have. > > What you've described is why it was considered to be safe, but now > we have things like defered AGFL block freeing that pick up and drop > the AGF lock repeatedly as the single transaction rolls and - > eventually - starts calling xfs_trans_reserve() on transaction roll. > That's something we never used to do. It may still be safe, but it > is unclear to me how this all interacts in the presence of > filesystem shutdown conditions..... > Some higher level thoughts... IME, we've had these kind of quirky shutdown issues since I've been working on XFS. Similar to log recovery, it's a rarely enough hit scenario that keeping it robust and reliable across new features/development is a bit of a challenge. LR is certainly more critical and I think our test coverage in both areas has improved significantly over the past few years. The point is just that I don't think it's worth getting too crazy by trying to audit every possible path to a sync log force or changing how various bits of core infrastructure work just to accommodate a very rare shutdown case. If this turns out to be the only place we require an object lock in the synchronous log force sequence, it might be best to find a way to remove it (as Darrick posited earlier). If not, it might also be more useful to figure a way to detect the "sync log force while holding (certain) locks" pattern dynamically and provide some assertions against it to improve test coverage going forward. I'm not quite sure how to do that off the top of my head. I'll have to think about that some more. I'm sure we could always come up with a test that reproduces this particular instance of the problem more reliably, but that test becomes far less useful once we address the particular instance it reproduces in the kernel. > --- > > What it comes down to is that the only way I can answer questions > like this is to do the complete analysis myself, which simply does > not scale. Having to jump between multiple separate complex > questions completely derails my own workflow, and it's recently been > resulting in me spending >75% of every day answering complex > questions like this. Duplication of effort is not an efficient use > of our time or expertise. > Agreed. > If I see something and say "these other code paths look similar, we > should check them too", all it means that I noticed a pattern of > usage and they need to be checked for similar issues. It does not > mean "dave has enough spare time on hand and can drop everything to > analyse the code in question and answer every little detail about > the similar problem path he noticed." > > Brian, you're way better than me at analysing these complex > interactions. Be confident in your analysis, just say "no, it's safe > because...." rather than asking a question that requires me to > repeat the analysis you've already done. If there's something that > you've missed that I spot, I'll point it out, but I really don't > want to have to repeat everything you've already done just to answer > a "did I miss anything" question. > Note that my question was more around establishing context since I didn't see an issue/concern pointed out for the busy extent code (even after reading back through the thread), but point taken. Thanks for the feedback. Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-25 12:34 ` Brian Foster @ 2019-03-27 1:22 ` Dave Chinner 0 siblings, 0 replies; 18+ messages in thread From: Dave Chinner @ 2019-03-27 1:22 UTC (permalink / raw) To: Brian Foster; +Cc: Darrick J. Wong, xfs On Mon, Mar 25, 2019 at 08:34:59AM -0400, Brian Foster wrote: > On Mon, Mar 25, 2019 at 10:03:27AM +1100, Dave Chinner wrote: > > > IOW, wouldn't a log force from busy extent context always occur with > > > locked buffers joined to a transaction? If so, then doesn't the active > > > transaction hold a bli reference and prevent such items from being > > > "freed" in log completion context (i.e. xfs_buf_item_unpin()) if they > > > happened to be pinned? Perhaps I'm missing something... > > > > ... now you've looked at it in more detail that I have. > > > > What you've described is why it was considered to be safe, but now > > we have things like defered AGFL block freeing that pick up and drop > > the AGF lock repeatedly as the single transaction rolls and - > > eventually - starts calling xfs_trans_reserve() on transaction roll. > > That's something we never used to do. It may still be safe, but it > > is unclear to me how this all interacts in the presence of > > filesystem shutdown conditions..... > > > > Some higher level thoughts... > > IME, we've had these kind of quirky shutdown issues since I've been > working on XFS. Similar to log recovery, it's a rarely enough hit > scenario that keeping it robust and reliable across new > features/development is a bit of a challenge. LR is certainly more > critical and I think our test coverage in both areas has improved > significantly over the past few years. Log recovery and shutdown have _always_ been a source of bugs, and there are some particular parts of the code that are more problematic than others. e.g. Inode cluster flushing error/shutdown handling has been a regular source of bugs over the years. And, yes, we do hammer on them a lot more than we ever had and so iwe are slowly peeling the onion and finding the imore subtle bugs that have been there for many, many years. > The point is just that I don't think it's worth getting too crazy by > trying to audit every possible path to a sync log force or changing how > various bits of core infrastructure work just to accommodate a very rare > shutdown case. I think we still have to look at them and understand if the way the problematic shutdowns are done make any sense anymore. They might have when the shutdown was added, but lots of the code is very different now, as is the shutdown handling. Hence we might have shutdowns in places we don't need them anymore (e.g. xfs_trans_read_buf_map(), xfs_iflush_cluster(), etc). > If this turns out to be the only place we require an > object lock in the synchronous log force sequence, it might be best to > find a way to remove it (as Darrick posited earlier). If not, it might > also be more useful to figure a way to detect the "sync log force while > holding (certain) locks" pattern dynamically and provide some assertions > against it to improve test coverage going forward. I'm not quite sure > how to do that off the top of my head. I'll have to think about that > some more. That's what lockdep contexts are supposed to be used for. Perhaps something similar to the memory reclaim contexts could be done here - locks taken both above and below the shutdown state trigger warnings.... > I'm sure we could always come up with a test that reproduces > this particular instance of the problem more reliably, but that test > becomes far less useful once we address the particular instance it > reproduces in the kernel. *nod* That's why tests like generic/475 are so good - they keep finding new bugs in shutdown/recovery... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-21 21:10 ` Dave Chinner 2019-03-21 21:53 ` Brian Foster @ 2019-03-26 17:13 ` Brian Foster 2019-03-27 1:05 ` Dave Chinner 1 sibling, 1 reply; 18+ messages in thread From: Brian Foster @ 2019-03-26 17:13 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, xfs On Fri, Mar 22, 2019 at 08:10:51AM +1100, Dave Chinner wrote: ... > > So given that inode reclaim already waits for inodes to be unpinned > and there is a xfs_buf_unpin_wait() call in > xfs_bwrite()->xfs_buf_submit() path that will issue a log force on > the buffer from the inode reclaim path, we don't actually need the > log force in xfs_iflush() at all. i.e. the AIL will capture the > buffer state and unpin inode and dquot buffers automatically, and so > we can simply remove the pinned-buffer log forces from the > inode/dquot flush code.... > Hmm, where is the log force you reference above in the inode reclaim (-> xfs_bwrite()) path if the buffer happens to be pinned? I see the xfs_buf_wait_unpin() in the submit path, but that just waits on the pin count (which I think is safe). It actually looks to me that reclaim could be susceptible to the write delay problem you referenced earlier in this thread without the log force from xfs_iflush().. It also occurs to me that the xfs_iflush() log force isn't blocking because it's a a sync force, but rather because there is already a CIL push in progress. The flush_work() call basically means that a non-sync force can either queue a workqueue job and return or turn somewhat synchronous by waiting for whatever push happens to be in progress. That implies another potential workaround could be to find a way around this sync behavior (assuming that would be fundamentally more simple than just avoiding log forces entirely while holding non-transactional buffer locks). Of course that assumes there aren't other variants of this. I still need to poke around more to get a better idea of the scope. One that looks like a possibility to me right now is the xfs_iflush_cluster() -> xfs_force_shutdown() -> xfs_log_force() sequence (in the event of a corrupt in-core inode on the same pinned buffer) down in the same iflush path. Another could be in xfs_trans_read_buf_map() where if we somehow grab a dirty+pinned buffer with b_error set, we can issue a shutdown if the transaction is dirty before we release the failed buffer. That combination of events should probably never happen (and the vector is probably closed by just releasing the buffer before the shutdown), but it kind of illustrates the potential for whack-a-mole with log forces and shutdown calls scattered around in so many places.. :/ Brian > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-26 17:13 ` Brian Foster @ 2019-03-27 1:05 ` Dave Chinner 2019-03-27 14:13 ` Brian Foster 0 siblings, 1 reply; 18+ messages in thread From: Dave Chinner @ 2019-03-27 1:05 UTC (permalink / raw) To: Brian Foster; +Cc: Darrick J. Wong, xfs On Tue, Mar 26, 2019 at 01:13:24PM -0400, Brian Foster wrote: > On Fri, Mar 22, 2019 at 08:10:51AM +1100, Dave Chinner wrote: > ... > > > > So given that inode reclaim already waits for inodes to be unpinned > > and there is a xfs_buf_unpin_wait() call in > > xfs_bwrite()->xfs_buf_submit() path that will issue a log force on > > the buffer from the inode reclaim path, we don't actually need the > > log force in xfs_iflush() at all. i.e. the AIL will capture the > > buffer state and unpin inode and dquot buffers automatically, and so > > we can simply remove the pinned-buffer log forces from the > > inode/dquot flush code.... > > > > Hmm, where is the log force you reference above in the inode reclaim (-> > xfs_bwrite()) path if the buffer happens to be pinned? I see the > xfs_buf_wait_unpin() in the submit path, but that just waits on the pin > count (which I think is safe). It actually looks to me that reclaim > could be susceptible to the write delay problem you referenced earlier > in this thread without the log force from xfs_iflush().. I was thinking that the unpin wait code issued the log force. Oh, we only issue a log force in xfs_iunpin_wait(), not in the xfs_buf_wait_unpin(). Ugh. My mistake. As for the write delay problem, it's entire possible that this could occur, but I have no evidence that it is actually a problem for reclaim - I have only ever found reclaim blocking waiting on IO in synchronous reclaim, never blocking on buffer unpin waits in async reclaim. That's not to say it doesn't happen, I've just never seen it and I've been looking at inode reclaim blocking vectors quite a lot in the past couple of years.... > It also occurs to me that the xfs_iflush() log force isn't blocking > because it's a a sync force, but rather because there is already a CIL > push in progress. The flush_work() call basically means that a non-sync > force can either queue a workqueue job and return or turn somewhat > synchronous by waiting for whatever push happens to be in progress. xfs_log_force() isn't a non-blocking operation - it is supposed to write and dispatch all the pending changes in memory. If XFS_LOG_SYNC is not set then it does not wait for IO completion - it is expected that the caller will then wait on whatever it needs unpinning, not wait for everything in the log flush to be completed. e.g. this is why xfs_iunpin_wait() does not use XFS_LOG_SYNC - we only need to wait for the inode to be unpinned, not everything in the log. > That > implies another potential workaround could be to find a way around this > sync behavior (assuming that would be fundamentally more simple than > just avoiding log forces entirely while holding non-transactional buffer > locks). Or we can just remove it altogether and put the necessary log forces in the callers where the buffer is no longer held. > Of course that assumes there aren't other variants of this. I still need > to poke around more to get a better idea of the scope. One that looks > like a possibility to me right now is the xfs_iflush_cluster() -> > xfs_force_shutdown() -> xfs_log_force() sequence (in the event of a The xfs_iflush_cluster error handling has always been full of shutdown-related bugs. xfs_iflush() gets this buffer-vs-shutdown ordering right, but if it's ok to just release the buffer with the inode still attached to it in the xfs_iflush() case, then why isn't it ok to do exactly the same thing in xfs_iflush_cluster()? Oh, it's because the xfs_iflush() error handling doesn't need to remove the inode from the buffer to get a reference to it to call xfs_iflush_abort() to remove the inode from the AIL. IOWs, I think xfs_iflush_cluster() should really be doing the same thing - pull the list of inodes off the buffer, release the buffer, trigger a shutdown, then call xfs_iflush_abort() on each of those inodes.... > corrupt in-core inode on the same pinned buffer) down in the same iflush > path. Another could be in xfs_trans_read_buf_map() where if we somehow > grab a dirty+pinned buffer with b_error set, we can issue a shutdown if > the transaction is dirty before we release the failed buffer. Again, that would be indicative of a bug somewhere else. A buffer found by a buffer should not be pinned and dirty in memory with a b_error value on it. Whatever recorded the error should have processed and cleared the error, marked the buffer as stale, and/or shut down the filesystem, not left it it in cache for someone else to trip over. e.g. see xfs_buf_iodone_callback_error(). If it's a transient error, it clears b_error before it returns. If it's a permanent error, it shuts down the filesystem and stales the buffer and so it will not ever be seen again on lookup. i.e. The code in xfs_trans_read_buf_map() is detecting physical read IO errors bringing new metadata into the cache and reacting appropriately. Clean transactions can back out gracefully, while dirty transactions will trigger a shutdown during cancellation anyway, so the early shutdown is just a way of ensuring we error out appropriately. Maybe all the callers get the error handling right, and if that is the case the we can remove the shutdowns from xfs_trans_read_buf_map() altogether..... > That > combination of events should probably never happen (and the vector is > probably closed by just releasing the buffer before the shutdown), but > it kind of illustrates the potential for whack-a-mole with log forces > and shutdown calls scattered around in so many places.. :/ If we end up in this kind of "whack-a-mole" scenario, it indicates we have other bugs that need fixing. Calling xfs_force_shutdown() holding inappropriate locks is a bug that should be fixed; changing how xfs_log_force() works is a much more risky proposition than fixing the callers that aren't safe... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-27 1:05 ` Dave Chinner @ 2019-03-27 14:13 ` Brian Foster 0 siblings, 0 replies; 18+ messages in thread From: Brian Foster @ 2019-03-27 14:13 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, xfs On Wed, Mar 27, 2019 at 12:05:52PM +1100, Dave Chinner wrote: > On Tue, Mar 26, 2019 at 01:13:24PM -0400, Brian Foster wrote: > > On Fri, Mar 22, 2019 at 08:10:51AM +1100, Dave Chinner wrote: > > ... > > > > > > So given that inode reclaim already waits for inodes to be unpinned > > > and there is a xfs_buf_unpin_wait() call in > > > xfs_bwrite()->xfs_buf_submit() path that will issue a log force on > > > the buffer from the inode reclaim path, we don't actually need the > > > log force in xfs_iflush() at all. i.e. the AIL will capture the > > > buffer state and unpin inode and dquot buffers automatically, and so > > > we can simply remove the pinned-buffer log forces from the > > > inode/dquot flush code.... > > > > > > > Hmm, where is the log force you reference above in the inode reclaim (-> > > xfs_bwrite()) path if the buffer happens to be pinned? I see the > > xfs_buf_wait_unpin() in the submit path, but that just waits on the pin > > count (which I think is safe). It actually looks to me that reclaim > > could be susceptible to the write delay problem you referenced earlier > > in this thread without the log force from xfs_iflush().. > > I was thinking that the unpin wait code issued the log force. > Oh, we only issue a log force in xfs_iunpin_wait(), not in the > xfs_buf_wait_unpin(). Ugh. My mistake. > > As for the write delay problem, it's entire possible that this could > occur, but I have no evidence that it is actually a problem for > reclaim - I have only ever found reclaim blocking waiting on IO in > synchronous reclaim, never blocking on buffer unpin waits in async > reclaim. That's not to say it doesn't happen, I've just never seen > it and I've been looking at inode reclaim blocking vectors quite a > lot in the past couple of years.... > Good info, thanks. So this is something we should at least be cognizant of, but it sounds like it's not a critical or likely state for reclaim. > > It also occurs to me that the xfs_iflush() log force isn't blocking > > because it's a a sync force, but rather because there is already a CIL > > push in progress. The flush_work() call basically means that a non-sync > > force can either queue a workqueue job and return or turn somewhat > > synchronous by waiting for whatever push happens to be in progress. > > xfs_log_force() isn't a non-blocking operation - it is supposed to > write and dispatch all the pending changes in memory. If > XFS_LOG_SYNC is not set then it does not wait for IO completion - it > is expected that the caller will then wait on whatever it needs > unpinning, not wait for everything in the log flush to be completed. > e.g. this is why xfs_iunpin_wait() does not use XFS_LOG_SYNC - we > only need to wait for the inode to be unpinned, not everything in > the log. > Yes, I don't mean to imply it's nonblocking. I was thinking that the case where we queue the CIL push could prevent the caller from blocking on the force completion (even though we still wait on the commit completion). Looking closer, I see that the CIL push is where we actually abort the CIL context (xlog_cil_push() calls xlog_cil_committed(..., XFS_LI_ABORTED)) and that occurs before the wake of the commit completion, so the behavior would probably be the same even without calling flush_work(). But the interesting thing to note here is that in the normal case, xc_commit_wait is awoken after submission (at the tail end of xlog_cil_push()) right when we assign ->commit_lsn. In the abort case, xc_commit_wait is technically awoken after callback completion processing in xlog_cil_committed(). This occurs after the xfs_trans_committed_bulk() call and thus serializes on unpin processing (where we need the buffer lock). It's not clear to me why the wake is here. From the code, I assume it's just to batch things into a single xc_push_lock critical section. We need a wake somewhere of course because we haven't assigned a commit_lsn in the abort case, but I wonder if without this particular ordering an async log force would now be safe from deadlock. For example, if xlog_cil_committed() awoke xc_commit_wait earlier, the CIL push could unwind in parallel to the CIL context abort. It already doesn't wait on log force completion because it's an async force (no XFS_LOG_SYNC), and thus xfs_iflush() can unwind and (eventually) unlock the buffer. I.e., this would be safe for shutdown cases in the same general manner it's currently safe for the non-shutdown case. Hmmm.... Darrick, since I haven't been able to reproduce this problem, could you give the appended (untested) patch a try and let me know whether it helps at all? > > That > > implies another potential workaround could be to find a way around this > > sync behavior (assuming that would be fundamentally more simple than > > just avoiding log forces entirely while holding non-transactional buffer > > locks). > > Or we can just remove it altogether and put the necessary log forces > in the callers where the buffer is no longer held. > Yep, that's what I meant by just avoiding the situation entirely.. > > Of course that assumes there aren't other variants of this. I still need > > to poke around more to get a better idea of the scope. One that looks > > like a possibility to me right now is the xfs_iflush_cluster() -> > > xfs_force_shutdown() -> xfs_log_force() sequence (in the event of a > > The xfs_iflush_cluster error handling has always been full of > shutdown-related bugs. > Indeed, I recall dealing with these various problems in the past. > xfs_iflush() gets this buffer-vs-shutdown ordering right, but if > it's ok to just release the buffer with the inode still attached to > it in the xfs_iflush() case, then why isn't it ok to do exactly the > same thing in xfs_iflush_cluster()? > > Oh, it's because the xfs_iflush() error handling doesn't need to > remove the inode from the buffer to get a reference to it to call > xfs_iflush_abort() to remove the inode from the AIL. > > IOWs, I think xfs_iflush_cluster() should really be doing the same > thing - pull the list of inodes off the buffer, release the buffer, > trigger a shutdown, then call xfs_iflush_abort() on each of those > inodes.... > I'll have to take a closer look at the broader iflush[_cluster]() error handling to wrap my head around this. > > corrupt in-core inode on the same pinned buffer) down in the same iflush > > path. Another could be in xfs_trans_read_buf_map() where if we somehow > > grab a dirty+pinned buffer with b_error set, we can issue a shutdown if > > the transaction is dirty before we release the failed buffer. > > Again, that would be indicative of a bug somewhere else. A buffer > found by a buffer should not be pinned and dirty in memory with a > b_error value on it. Whatever recorded the error should have > processed and cleared the error, marked the buffer as stale, and/or > shut down the filesystem, not left it it in cache for someone else > to trip over. > Completely agreed that this should never happen. I'm fine with categorizing these kind of scenarios as a separate bug. I'm just describing the complexity involved with trying to reason about the scope of the general problem. > e.g. see xfs_buf_iodone_callback_error(). If it's a transient error, > it clears b_error before it returns. If it's a permanent error, it > shuts down the filesystem and stales the buffer and so it will not > ever be seen again on lookup. > > i.e. The code in xfs_trans_read_buf_map() is detecting physical read > IO errors bringing new metadata into the cache and reacting > appropriately. Clean transactions can back out gracefully, while > dirty transactions will trigger a shutdown during cancellation > anyway, so the early shutdown is just a way of ensuring we error > out appropriately. Maybe all the callers get the error handling > right, and if that is the case the we can remove the shutdowns from > xfs_trans_read_buf_map() altogether..... > > > That > > combination of events should probably never happen (and the vector is > > probably closed by just releasing the buffer before the shutdown), but > > it kind of illustrates the potential for whack-a-mole with log forces > > and shutdown calls scattered around in so many places.. :/ > > If we end up in this kind of "whack-a-mole" scenario, it indicates > we have other bugs that need fixing. Calling xfs_force_shutdown() > holding inappropriate locks is a bug that should be fixed; changing > how xfs_log_force() works is a much more risky proposition than > fixing the callers that aren't safe... > I agree in principle on changing xfs_log_force() behavior, but I think there's a possibility that CIL context abort simply has a bug that causes this problem for async log forces that happen to race with a shutdown (re: the analysis above). That aside, having some kind of guideline on when (not) to call a sync log force or shutdown is partly what I'm trying to reason about. Your statement above implies that such a guideline could be to not do so in general while holding things like buffer locks. That sounds perfectly reasonable to me and is much more straightforward of a concept to apply to future code on review rather than trying to work out whether each sync log force or shutdown site can somehow result in a shutdown deadlock. The only open question I see is whether this needs to apply to all log forces or only sync log forces... Brian --- 8< --- diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c index d3884e08b43c..5e595948bc5a 100644 --- a/fs/xfs/xfs_log_cil.c +++ b/fs/xfs/xfs_log_cil.c @@ -582,6 +582,19 @@ xlog_cil_committed( struct xfs_cil_ctx *ctx = args; struct xfs_mount *mp = ctx->cil->xc_log->l_mp; + /* + * If the I/O failed, we're aborting the commit and already shutdown. + * Wake any commit waiters before aborting the log items so we don't + * block async log pushers on callbacks. Async log pushers explicitly do + * not wait on log force completion because they may be holding locks + * required to unpin items. + */ + if (abort) { + spin_lock(&ctx->cil->xc_push_lock); + wake_up_all(&ctx->cil->xc_commit_wait); + spin_unlock(&ctx->cil->xc_push_lock); + } + xfs_trans_committed_bulk(ctx->cil->xc_log->l_ailp, ctx->lv_chain, ctx->start_lsn, abort); @@ -589,15 +602,7 @@ xlog_cil_committed( xfs_extent_busy_clear(mp, &ctx->busy_extents, (mp->m_flags & XFS_MOUNT_DISCARD) && !abort); - /* - * If we are aborting the commit, wake up anyone waiting on the - * committing list. If we don't, then a shutdown we can leave processes - * waiting in xlog_cil_force_lsn() waiting on a sequence commit that - * will never happen because we aborted it. - */ spin_lock(&ctx->cil->xc_push_lock); - if (abort) - wake_up_all(&ctx->cil->xc_commit_wait); list_del(&ctx->committing); spin_unlock(&ctx->cil->xc_push_lock); ^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: generic/475 deadlock? 2019-03-20 5:04 generic/475 deadlock? Darrick J. Wong 2019-03-20 17:03 ` Brian Foster @ 2019-03-20 21:39 ` Dave Chinner 1 sibling, 0 replies; 18+ messages in thread From: Dave Chinner @ 2019-03-20 21:39 UTC (permalink / raw) To: Darrick J. Wong; +Cc: xfs On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote: > Hmmm. > > Every now and then I see a generic/475 deadlock that generates the > hangcheck warning pasted below. > > I /think/ this is ... the ail is processing an inode log item, for which > it locked the cluster buffer and pushed the cil to unpin the buffer. > However, the cil is cleaning up after the shut down and is trying to > simulate an EIO completion, but tries grabs the buffer lock and hence > the cil and ail deadlock. Maybe the solution is to trylock in the > (freed && remove) case of xfs_buf_item_unpin, since we're tearing the > whole system down anyway? Oh, that's looks like a bug in xfs_iflush() - we are forcing the log to unpin a buffer we already own the lock on. It's the same problem we had in the discard code fixed by commit 8c81dd46ef3c ("Force log to disk before reading the AGF during a fstrim"). It also means that the log forces in the busy extent code have the same potential problem, as does xfs_qm_dqflush(). I'll move further down the discussion now.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2019-03-27 14:14 UTC | newest] Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-03-20 5:04 generic/475 deadlock? Darrick J. Wong 2019-03-20 17:03 ` Brian Foster 2019-03-20 17:45 ` Darrick J. Wong 2019-03-20 17:59 ` Brian Foster 2019-03-20 21:49 ` Dave Chinner 2019-03-21 12:11 ` Brian Foster 2019-03-21 21:10 ` Dave Chinner 2019-03-21 21:53 ` Brian Foster 2019-03-21 23:50 ` Dave Chinner 2019-03-22 0:07 ` Darrick J. Wong 2019-03-22 12:01 ` Brian Foster 2019-03-24 23:03 ` Dave Chinner 2019-03-25 12:34 ` Brian Foster 2019-03-27 1:22 ` Dave Chinner 2019-03-26 17:13 ` Brian Foster 2019-03-27 1:05 ` Dave Chinner 2019-03-27 14:13 ` Brian Foster 2019-03-20 21:39 ` Dave Chinner
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.