Re: generic/475 deadlock?

From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: xfs <linux-xfs@vger.kernel.org>
Subject: Re: generic/475 deadlock?
Date: Wed, 20 Mar 2019 13:59:22 -0400	[thread overview]
Message-ID: <20190320175922.GB29010@bfoster> (raw)
In-Reply-To: <20190320174551.GA1186@magnolia>

On Wed, Mar 20, 2019 at 10:45:51AM -0700, Darrick J. Wong wrote:
> On Wed, Mar 20, 2019 at 01:03:05PM -0400, Brian Foster wrote:
> > On Tue, Mar 19, 2019 at 10:04:08PM -0700, Darrick J. Wong wrote:
> > > Hmmm.
> > > 
> > > Every now and then I see a generic/475 deadlock that generates the
> > > hangcheck warning pasted below.
> > > 
> > 
> > Haven't seen it, but I'll kick off a few iterations..
> > 
> > > I /think/ this is ... the ail is processing an inode log item, for which
> > > it locked the cluster buffer and pushed the cil to unpin the buffer.
> > > However, the cil is cleaning up after the shut down and is trying to
> > > simulate an EIO completion, but tries grabs the buffer lock and hence
> > > the cil and ail deadlock.  Maybe the solution is to trylock in the
> > > (freed && remove) case of xfs_buf_item_unpin, since we're tearing the
> > > whole system down anyway?
> > > 
> > 
> > While the xfsaild will lock buffers via ->iop_push() callouts, I think
> > this is only to queue the buffer to the delwri queue (both buffer and
> > inode item pushes should cycle the buf lock before returning) and then
> > to submit the pending delwri queue. I don't think xfsaild should own any
> > such locks at the time it issues a log force..
> 
> ...but I think it does (see below) because iflush forces the log when
> the cluster buffer is pinned, and only after that does it add the buffer
> to the delwri queue.
> 

Oops, my mistake.. I read through the stack too quickly and misread it
as the log force directly via xfsaild_push() rather than iflush.

> > > Anyway, tired, going to bed now, really just backing up my brain to the
> > > internet... :)
> > > 
> > > --D
> > > 
> > > INFO: task kworker/1:4:4668 blocked for more than 61 seconds.
> > >       Not tainted 5.1.0-rc1-xfsx #rc1
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > kworker/1:4     D11632  4668      2 0x80000000
> > > Workqueue: xfs-cil/dm-0 xlog_cil_push_work [xfs]
> > > Call Trace:
> > >  ? __schedule+0x309/0x9e0
> > >  schedule+0x36/0x90
> > >  schedule_timeout+0x1fd/0x500
> > >  ? __next_timer_interrupt+0xc0/0xc0
> > >  ? __down_timeout+0x9b/0x120
> > >  __down_timeout+0xc3/0x120
> > >  down_timeout+0x43/0x50
> > >  xfs_buf_lock+0x7a/0x280 [xfs]
> > >  xfs_buf_item_unpin+0x300/0x4a0 [xfs]
> > >  xfs_trans_committed_bulk+0x105/0x3e0 [xfs]
> > >  xlog_cil_committed+0x4b/0x410 [xfs]
> > >  ? mark_held_locks+0x47/0x70
> > >  ? xlog_cil_push+0x3f3/0x460 [xfs]
> > >  ? xlog_cil_push+0x3f3/0x460 [xfs]
> > >  ? rcu_read_lock_sched_held+0x74/0x80
> > >  xlog_cil_push+0x400/0x460 [xfs]
> > >  process_one_work+0x252/0x600
> > >  worker_thread+0x1d9/0x390
> > >  ? wq_calc_node_cpumask+0x80/0x80
> > >  kthread+0x11f/0x140
> > >  ? kthread_create_on_node+0x70/0x70
> > >  ret_from_fork+0x24/0x30
> > 
> > Ok, so there's a CIL push where the log write failed and the buf item
> > abort is waiting on the buf lock..
> > 
> > > INFO: task xfsaild/dm-0:7063 blocked for more than 61 seconds.
> > >       Not tainted 5.1.0-rc1-xfsx #rc1
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > xfsaild/dm-0    D13752  7063      2 0x80000000
> > > Call Trace:
> > >  ? __schedule+0x309/0x9e0
> > >  ? wait_for_completion+0x12b/0x1c0
> > >  schedule+0x36/0x90
> > >  schedule_timeout+0x32a/0x500
> > >  ? wait_for_completion+0x12b/0x1c0
> > >  wait_for_completion+0x153/0x1c0
> > >  ? wake_up_q+0x80/0x80
> > >  __flush_work+0x1e1/0x300
> > >  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
> > >  ? wait_for_completion+0x43/0x1c0
> > >  xlog_cil_force_lsn+0x97/0x270 [xfs]
> > >  ? xfs_iflush+0x38d/0x470 [xfs]
> > >  xfs_log_force+0xff/0x4f0 [xfs]
> > >  ? xfs_dinode_calc_crc+0x3b/0x60 [xfs]
> > >  ? xfs_iflush_int+0x24d/0x570 [xfs]
> > >  xfs_iflush+0x38d/0x470 [xfs]
> > >  xfs_inode_item_push+0xe6/0x200 [xfs]
> > >  xfsaild+0x50d/0x1210 [xfs]
> > >  ? sched_clock+0x5/0x10
> > >  ? sched_clock_cpu+0xe/0xd0
> > >  ? xfs_trans_ail_cursor_first+0x80/0x80 [xfs]
> > >  kthread+0x11f/0x140
> > >  ? kthread_create_on_node+0x70/0x70
> > >  ret_from_fork+0x24/0x30
> > 
> > xfsaild forces the log and the associated CIL push blocks on the one
> > already in progress..
> > 
> > > INFO: task umount:7089 blocked for more than 61 seconds.
> > >       Not tainted 5.1.0-rc1-xfsx #rc1
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > umount          D13312  7089   6414 0x00000000
> > > Call Trace:
> > >  ? __schedule+0x309/0x9e0
> > >  ? wait_for_completion+0x12b/0x1c0
> > >  schedule+0x36/0x90
> > >  schedule_timeout+0x32a/0x500
> > >  ? wait_for_completion+0x12b/0x1c0
> > >  wait_for_completion+0x153/0x1c0
> > >  ? wake_up_q+0x80/0x80
> > >  __flush_work+0x1e1/0x300
> > >  ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
> > >  ? wait_for_completion+0x43/0x1c0
> > >  xlog_cil_force_lsn+0x97/0x270 [xfs]
> > >  ? xfs_fs_sync_fs+0x21/0x50 [xfs]
> > >  xfs_log_force+0xff/0x4f0 [xfs]
> > >  xfs_fs_sync_fs+0x21/0x50 [xfs]
> > >  sync_filesystem+0x6b/0x90
> > >  generic_shutdown_super+0x22/0x100
> > >  kill_block_super+0x21/0x50
> > >  deactivate_locked_super+0x29/0x60
> > >  cleanup_mnt+0x3b/0x70
> > >  task_work_run+0x98/0xc0
> > >  exit_to_usermode_loop+0x83/0x90
> > >  do_syscall_64+0x14b/0x150
> > >  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> > 
> > And unmount is doing a log force as well..
> > 
> > Hmm.. yeah, I think we need to figure out how/why the buffer is locked.
> 
> AFAICT it's xfs_inode_item_push -> xfs_iflush -> xfs_imap_to_bp that
> returns the locked cluster buffer (code extract from xfs_iflush):
> 
> 	error = xfs_imap_to_bp(mp, NULL, &ip->i_imap, &dip, &bp, XBF_TRYLOCK,
> 			       0);
> 	if (error == -EAGAIN) {
> 		xfs_ifunlock(ip);
> 		return error;
> 	}
> 	if (error)
> 		goto corrupt_out;
> 
> So if we make it to here we've have bp pointing to the locked inode
> cluster buffer...
> 
> 	/*
> 	 * First flush out the inode that xfs_iflush was called with.
> 	 */
> 	error = xfs_iflush_int(ip, bp);
> 	if (error)
> 		goto corrupt_out;
> 
> ...after which we flush the inode to the (still locked) cluster
> buffer...
> 
> 	/*
> 	 * If the buffer is pinned then push on the log now so we won't
> 	 * get stuck waiting in the write for too long.
> 	 */
> 	if (xfs_buf_ispinned(bp))
> 		xfs_log_force(mp, 0);
> 
> ...but if the buffer is also pinned then we kick the log (while holding
> the buffer lock) to unpin the buffer.  Since the fs is shutdown, the cil
> will just be trying to remove everything, but it needs to lock the
> buffer to simulate EIO.
> 

Yeah, makes sense.

> A naive fix to the deadlock is to see if the inode buf ops are attached
> and downgrade to a trylock and go ahead even if we can't lock it, but ick.
> It resolves the deadlock so the umount can proceed but we also trigger a
> lot of unlocked buffer asserts.
> 
> I'm not sure what "too long" means in that code chunk.  It was added to
> reduce stalls in pdflush, which these days I think translates to trying
> to reduce the amount of time reclaim can stall while flushing inodes...?
> 

I'm not sure either. I was wondering why we'd need that given that
xfsaild() has its own log force logic in cases where the delwri submit
didn't fully complete (i.e., for pinned buffers). Relying on that seems
more appropriate since we'd force the log per delwri submission for
multiple potential buffers as opposed to per inode cluster buffer.
Perhaps we could just kill this log force?

Brian

> (and, heh, I think I've ended up back at "move the reclaim iflush to
> deferred inactivation"...)
> 
> > I don't think it should be xfsaild because while the delwri submit locks
> > the buffers for I/O, that should essentially transfer lock ownership to
> > the async I/O (i.e., to be unlocked via xfs_buf_ioend() via bio
> > completion). Is the buf I/O associated completion wq stuck somewhere?
> 
> I didn't see any threads waiting on the buffer io completion in the
> sysrq-t output.
> 
> > Is the buffer submission path working correctly in the shutdown
> > case..?
> 
> It looks like it's working ok...
> 
> --D
> 
> > Brian
> > 
> > > RIP: 0033:0x7fa9ff80e8c7
> > > Code: Bad RIP value.
> > > RSP: 002b:00007ffd647ea078 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
> > > RAX: 0000000000000000 RBX: 00005642818ed970 RCX: 00007fa9ff80e8c7
> > > RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005642818fa100
> > > RBP: 0000000000000000 R08: 00005642818f91e0 R09: 00007fa9ffad8c40
> > > R10: 000000000000000b R11: 0000000000000246 R12: 00005642818fa100
> > > R13: 00007fa9ffd308a4 R14: 00005642818f9980 R15: 0000000000000000
> > > 
> > > Showing all locks held in the system:
> > > 1 lock held by khungtaskd/34:
> > >  #0: 0000000056e8f88f (rcu_read_lock){....}, at: debug_show_all_locks+0x15/0x17f
> > > 1 lock held by in:imklog/1113:
> > >  #0: 000000007873f311 (&f->f_pos_lock){+.+.}, at: __fdget_pos+0x4d/0x60
> > > 5 locks held by kworker/u13:4/3736:
> > > 2 locks held by kworker/1:4/4668:
> > >  #0: 00000000cd34452f ((wq_completion)xfs-cil/dm-0){+.+.}, at: process_one_work+0x1d8/0x600
> > >  #1: 000000003c563604 ((work_completion)(&cil->xc_push_work)){+.+.}, at: process_one_work+0x1d8/0x600
> > > 2 locks held by kworker/2:4/5394:
> > >  #0: 0000000072544c54 ((wq_completion)xfs-sync/dm-0){+.+.}, at: process_one_work+0x1d8/0x600
> > >  #1: 00000000c3e919b6 ((work_completion)(&(&log->l_work)->work)){+.+.}, at: process_one_work+0x1d8/0x600
> > > 1 lock held by xfsaild/dm-0/7063:
> > >  #0: 0000000049da3bb4 (&xfs_nondir_ilock_class){++++}, at: xfs_ilock_nowait+0x1ee/0x330 [xfs]
> > > 1 lock held by umount/7089:
> > >  #0: 0000000031dc2924 (&type->s_umount_key#50){++++}, at: deactivate_super+0x43/0x50
> > > 
> > > =============================================
> > >