linux-xfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
@ 2019-08-21 11:04 Chandan Rajendra
  2019-08-21 16:42 ` Brian Foster
                   ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Chandan Rajendra @ 2019-08-21 11:04 UTC (permalink / raw)
  To: linux-xfs; +Cc: Chandan Rajendra, chandan, darrick.wong, hch, david

The following call trace is seen when executing generic/530 on a ppc64le
machine,

INFO: task mount:7722 blocked for more than 122 seconds.
      Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
mount           D 8448  7722   7490 0x00040008
Call Trace:
[c000000629343210] [0000000000000001] 0x1 (unreliable)
[c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490
[c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50
[c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0
[c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420
[c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0
[c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310
[c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0
[c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0
[c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0
[c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300
[c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0
[c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0
[c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0
[c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0
[c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130
[c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130
[c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0
[c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0
[c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0
[c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0
[c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40
[c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0
[c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160
[c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20
[c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180
[c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30
[c000000629343e20] [c00000000000b864] system_call+0x5c/0x70

i.e. the mount task gets hung indefinitely due to the following sequence
of events,

1. Test creates lots of unlinked temp files and then shutsdown the
   filesystem.
2. During mount, a transaction started in the context of processing
   unlinked inode list causes several iclogs to be filled up. All but
   the last one is submitted for I/O.
3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have
   18532 bytes of free space in the last iclog of the transaction which is
   greater than 2*sizeof(xlog_op_header_t). Hence
   xlog_state_get_iclog_space() does not switch over to using a newer iclog.
4. Meanwhile, the endio code processing iclogs of the transaction do not
   insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS
   hasn't been submitted for I/O yet. Hence a major part of the on-disk
   log cannot be freed yet.
5. A new request for log space (via xfs_log_reserve()) will now wait
   indefinitely for on-disk log space to be freed.

To fix this issue, before waiting for log space to be freed, this commit
now submits xlog->l_iclog for write I/O if iclog->ic_state is
XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes
AIL list to be populated and a later call to xlog_grant_push_ail() will
free up the on-disk log space.

Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
---
 fs/xfs/xfs_log.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 00e9f5c388d3..dc785a6b9f47 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -236,11 +236,32 @@ xlog_grant_head_wait(
 	int			need_bytes) __releases(&head->lock)
 					    __acquires(&head->lock)
 {
+	struct xlog_in_core	*iclog;
+
 	list_add_tail(&tic->t_queue, &head->waiters);
 
 	do {
 		if (XLOG_FORCED_SHUTDOWN(log))
 			goto shutdown;
+
+		if (xfs_ail_min(log->l_ailp) == NULL) {
+			spin_lock(&log->l_icloglock);
+			iclog = log->l_iclog;
+
+			if (iclog->ic_state == XLOG_STATE_ACTIVE
+				&& iclog->ic_offset) {
+				atomic_inc(&iclog->ic_refcnt);
+				xlog_state_want_sync(log, iclog);
+				spin_unlock(&log->l_icloglock);
+				xlog_state_release_iclog(log, iclog);
+
+				spin_lock(&log->l_icloglock);
+				xlog_wait(&iclog->ic_force_wait, &log->l_icloglock);
+			} else {
+				spin_unlock(&log->l_icloglock);
+			}
+		}
+
 		xlog_grant_push_ail(log, need_bytes);
 
 		__set_current_state(TASK_UNINTERRUPTIBLE);
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-21 11:04 [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space Chandan Rajendra
@ 2019-08-21 16:42 ` Brian Foster
  2019-08-22 12:46   ` Chandan Rajendra
  2019-08-21 22:18 ` Dave Chinner
       [not found] ` <20190830164750.GD26520@bfoster>
  2 siblings, 1 reply; 25+ messages in thread
From: Brian Foster @ 2019-08-21 16:42 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: linux-xfs, chandan, darrick.wong, hch, david

On Wed, Aug 21, 2019 at 04:34:48PM +0530, Chandan Rajendra wrote:
> The following call trace is seen when executing generic/530 on a ppc64le
> machine,
> 

Could you provide xfs_info of the scratch device that reproduces this
problem? I'm curious because I don't recall seeing this, but I'm also
not sure I've run those iunlink tests on ppc64...

> INFO: task mount:7722 blocked for more than 122 seconds.
>       Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> mount           D 8448  7722   7490 0x00040008
> Call Trace:
> [c000000629343210] [0000000000000001] 0x1 (unreliable)
> [c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490
> [c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50
> [c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0
> [c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420
> [c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0
> [c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310
> [c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0
> [c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0
> [c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0
> [c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300
> [c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0
> [c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0
> [c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0
> [c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0
> [c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130
> [c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130
> [c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0
> [c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0
> [c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0
> [c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0
> [c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40
> [c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0
> [c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160
> [c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20
> [c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180
> [c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30
> [c000000629343e20] [c00000000000b864] system_call+0x5c/0x70
> 
> i.e. the mount task gets hung indefinitely due to the following sequence
> of events,
> 
> 1. Test creates lots of unlinked temp files and then shutsdown the
>    filesystem.
> 2. During mount, a transaction started in the context of processing
>    unlinked inode list causes several iclogs to be filled up. All but
>    the last one is submitted for I/O.

So the inactivation transaction commits should basically start to
populate the CIL. At some point we cross a threshold and a commit
triggers a background CIL push (see xlog_cil_push_background()). I'm
assuming that occurs somewhere in here because otherwise we wouldn't
have started filling iclogs..

> 3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have
>    18532 bytes of free space in the last iclog of the transaction which is
>    greater than 2*sizeof(xlog_op_header_t). Hence
>    xlog_state_get_iclog_space() does not switch over to using a newer iclog.

Ok, so we've sent a bunch of iclogs to disk and the commit record for
the checkpoint ends up in the current iclog log, which remains active on
return from xlog_commit_record() -> xlog_write() etc.

> 4. Meanwhile, the endio code processing iclogs of the transaction do not
>    insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS
>    hasn't been submitted for I/O yet. Hence a major part of the on-disk
>    log cannot be freed yet.

Ok, so that final (still active) iclog holding the commit record is also
holding the ctx.

> 5. A new request for log space (via xfs_log_reserve()) will now wait
>    indefinitely for on-disk log space to be freed.
> 
> To fix this issue, before waiting for log space to be freed, this commit
> now submits xlog->l_iclog for write I/O if iclog->ic_state is
> XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes
> AIL list to be populated and a later call to xlog_grant_push_ail() will
> free up the on-disk log space.
> 

Interesting, so I think I follow the issue at least and what the code is
trying to do to fix it. I'm not totally clear on what the ideal approach
is as this code is complex and I'd need to think about it some more. My
first thought was perhaps whether we needed a checkpoint size limit or
to change the background push threshold or some such, but the more I
think about that the less I think that fixes anything.

A follow up thought is that it seems somewhat unfortunate that we can
build up and hold so much reservation on essentially a single commit
record and then just let it sit around in-core waiting for somebody else
to come along and ask for reservation for something unrelated. I think a
background log force should eventually come around and push that last
record out, so we're not in that state indefinitely, but IIUC that's a
lot of log consumption essentially unaccounted for until the associated
items land in the AIL.

I'm wondering if we should have some measure (if we don't already) of
how much reservation is tied into a single iclog and use that as an
additional metric to determine when to switch iclogs a bit more
aggressively (as opposed to only based on how much space is physically
left in the iclog buffer). For example, if we end a huge checkpoint with
a commit record in an active iclog that has a cil context attached
associated with a hundreds of MBs (handwaving) of log data, perhaps we
should just switch out of that iclog before the CIL push returns so that
consumption can be immediately reflected by the AIL.

Anyways, perhaps there are other thoughts/ideas on this..

Brian

> Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> ---
>  fs/xfs/xfs_log.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 00e9f5c388d3..dc785a6b9f47 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -236,11 +236,32 @@ xlog_grant_head_wait(
>  	int			need_bytes) __releases(&head->lock)
>  					    __acquires(&head->lock)
>  {
> +	struct xlog_in_core	*iclog;
> +
>  	list_add_tail(&tic->t_queue, &head->waiters);
>  
>  	do {
>  		if (XLOG_FORCED_SHUTDOWN(log))
>  			goto shutdown;
> +
> +		if (xfs_ail_min(log->l_ailp) == NULL) {
> +			spin_lock(&log->l_icloglock);
> +			iclog = log->l_iclog;
> +
> +			if (iclog->ic_state == XLOG_STATE_ACTIVE
> +				&& iclog->ic_offset) {
> +				atomic_inc(&iclog->ic_refcnt);
> +				xlog_state_want_sync(log, iclog);
> +				spin_unlock(&log->l_icloglock);
> +				xlog_state_release_iclog(log, iclog);
> +
> +				spin_lock(&log->l_icloglock);
> +				xlog_wait(&iclog->ic_force_wait, &log->l_icloglock);
> +			} else {
> +				spin_unlock(&log->l_icloglock);
> +			}
> +		}
> +
>  		xlog_grant_push_ail(log, need_bytes);
>  
>  		__set_current_state(TASK_UNINTERRUPTIBLE);
> -- 
> 2.19.1
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-21 11:04 [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space Chandan Rajendra
  2019-08-21 16:42 ` Brian Foster
@ 2019-08-21 22:18 ` Dave Chinner
  2019-08-22 12:40   ` Chandan Rajendra
  2019-08-22 16:34   ` Brian Foster
       [not found] ` <20190830164750.GD26520@bfoster>
  2 siblings, 2 replies; 25+ messages in thread
From: Dave Chinner @ 2019-08-21 22:18 UTC (permalink / raw)
  To: Chandan Rajendra; +Cc: linux-xfs, chandan, darrick.wong, hch

On Wed, Aug 21, 2019 at 04:34:48PM +0530, Chandan Rajendra wrote:
> The following call trace is seen when executing generic/530 on a ppc64le
> machine,
> 
> INFO: task mount:7722 blocked for more than 122 seconds.
>       Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6

can you reproduce this on 5.3-rc5? There were bugs in log recovery
IO in -rc1 that could result in things going wrong...

> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> mount           D 8448  7722   7490 0x00040008
> Call Trace:
> [c000000629343210] [0000000000000001] 0x1 (unreliable)
> [c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490
> [c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50
> [c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0
> [c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420
> [c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0
> [c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310
> [c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0
> [c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0
> [c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0
> [c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300
> [c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0
> [c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0
> [c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0
> [c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0
> [c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130
> [c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130
> [c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0
> [c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0
> [c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0
> [c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0
> [c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40
> [c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0
> [c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160
> [c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20
> [c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180
> [c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30
> [c000000629343e20] [c00000000000b864] system_call+0x5c/0x70
> 
> i.e. the mount task gets hung indefinitely due to the following sequence
> of events,
> 
> 1. Test creates lots of unlinked temp files and then shutsdown the
>    filesystem.
> 2. During mount, a transaction started in the context of processing
>    unlinked inode list causes several iclogs to be filled up. All but
>    the last one is submitted for I/O.
> 3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have
>    18532 bytes of free space in the last iclog of the transaction which is
>    greater than 2*sizeof(xlog_op_header_t). Hence
>    xlog_state_get_iclog_space() does not switch over to using a newer iclog.
> 4. Meanwhile, the endio code processing iclogs of the transaction do not
>    insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS
>    hasn't been submitted for I/O yet. Hence a major part of the on-disk
>    log cannot be freed yet.

So all those items are still pinned in memory.

> 5. A new request for log space (via xfs_log_reserve()) will now wait
>    indefinitely for on-disk log space to be freed.

Because nothing has issued a xfs_log_force() for write the iclog to
disk, unpin the objects that it pins in memory, and allow the tail
to be moved forwards.

The xfsaild normally takes care of thisi - it gets pushed byt the
log reserve when there's not enough space to in the log for the
transaction before transaction reserve goes to sleep in
xlog_grant_head_wait(). The AIL pushing code is then responsible for
making sure log space is eventually freed. It will issue log forces
if it isn't making progress and so this problem shouldn't occur.

So, why has it occurred?

The xfsaild kthread should be running at this point, so if it was
pushed it should be trying to empty the journal to move the tail
forward. Why hasn't it issue a log force?


> To fix this issue, before waiting for log space to be freed, this commit
> now submits xlog->l_iclog for write I/O if iclog->ic_state is
> XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes
> AIL list to be populated and a later call to xlog_grant_push_ail() will
> free up the on-disk log space.

hmmm.

> Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> ---
>  fs/xfs/xfs_log.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 00e9f5c388d3..dc785a6b9f47 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -236,11 +236,32 @@ xlog_grant_head_wait(
>  	int			need_bytes) __releases(&head->lock)
>  					    __acquires(&head->lock)
>  {
> +	struct xlog_in_core	*iclog;
> +
>  	list_add_tail(&tic->t_queue, &head->waiters);
>  
>  	do {
>  		if (XLOG_FORCED_SHUTDOWN(log))
>  			goto shutdown;
> +
> +		if (xfs_ail_min(log->l_ailp) == NULL) {

This is indicative of the situation. If the AIL is empty, and the
log does not have room for an entire transaction reservation, then
we need to be issuing synchronous transactions in recovery until
such time the AIL pushing can actually function correctly to
guarantee forwards progress for async transaction processing.

sync transactions call xfs_log_force(XFS_LOG_SYNC) immediately after
writing the commit record, so this whole problem goes away.

perhaps in __xfs_trans_commit() we need somethign like this:

 	/*
 	 * If the transaction needs to be synchronous, then force the
 	 * log out now and wait for it.
+	 *
+	 * If we are in recovery and the AIL is empty, the log may only
+	 * have enough room for a single transaction and the AIL
+	 * cannot push the tail forwards. Hence while the AIL is
+	 * empty and we are in recovery, do synchronous transactions
+	 * to ensure the commit hits the journal and move into the
+	 * AIL. Once there are items in the AIL, it can move the
+	 * tail of the log forwards itself.
 	 */
-	if (sync) {
+	if (sync || 
+	    ((mp->m_flags & XFS_MOUNT_RECOVERY) && !xfs_ail_min(mp->m_ail)) {
 		error = xfs_log_force_lsn(mp, commit_lsn, XFS_LOG_SYNC, NULL);
 		XFS_STATS_INC(mp, xs_trans_sync);

The other option is that this can be put into the
xfs_trans_reserve() code to set the XFS_TRANS_SYNC flag to trigger
the log force in __xfs_trans_commit().

> +			spin_lock(&log->l_icloglock);
> +			iclog = log->l_iclog;
> +
> +			if (iclog->ic_state == XLOG_STATE_ACTIVE
> +				&& iclog->ic_offset) {
> +				atomic_inc(&iclog->ic_refcnt);
> +				xlog_state_want_sync(log, iclog);
> +				spin_unlock(&log->l_icloglock);
> +				xlog_state_release_iclog(log, iclog);
> +
> +				spin_lock(&log->l_icloglock);
> +				xlog_wait(&iclog->ic_force_wait, &log->l_icloglock);

As I suspected, that's just an open coded, cut down log force. And
it has a bug in it - you can't drop the l_icloglock, then pick it
back up and sleep immediately without redoing all the state checks
again. The log IO may have completed and run the wakeups before you
get the spinlock back, and so this will simply hang forever here.

So, yeah, I think getting log forces into the right places are
the right way to go here...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-21 22:18 ` Dave Chinner
@ 2019-08-22 12:40   ` Chandan Rajendra
  2019-08-22 13:19     ` Chandan Rajendra
  2019-08-22 16:34   ` Brian Foster
  1 sibling, 1 reply; 25+ messages in thread
From: Chandan Rajendra @ 2019-08-22 12:40 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandan Rajendra, linux-xfs, darrick.wong, hch

On Thursday, August 22, 2019 3:48 AM Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 04:34:48PM +0530, Chandan Rajendra wrote:
> > The following call trace is seen when executing generic/530 on a ppc64le
> > machine,
> > 
> > INFO: task mount:7722 blocked for more than 122 seconds.
> >       Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6
> 
> can you reproduce this on 5.3-rc5? There were bugs in log recovery
> IO in -rc1 that could result in things going wrong...

Yes, I was able to recreate this bug on v5.3-rc5.

> 
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > mount           D 8448  7722   7490 0x00040008
> > Call Trace:
> > [c000000629343210] [0000000000000001] 0x1 (unreliable)
> > [c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490
> > [c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50
> > [c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0
> > [c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420
> > [c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0
> > [c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310
> > [c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0
> > [c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0
> > [c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0
> > [c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300
> > [c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0
> > [c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0
> > [c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0
> > [c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0
> > [c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130
> > [c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130
> > [c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0
> > [c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0
> > [c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0
> > [c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0
> > [c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40
> > [c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0
> > [c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160
> > [c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20
> > [c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180
> > [c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30
> > [c000000629343e20] [c00000000000b864] system_call+0x5c/0x70
> > 
> > i.e. the mount task gets hung indefinitely due to the following sequence
> > of events,
> > 
> > 1. Test creates lots of unlinked temp files and then shutsdown the
> >    filesystem.
> > 2. During mount, a transaction started in the context of processing
> >    unlinked inode list causes several iclogs to be filled up. All but
> >    the last one is submitted for I/O.
> > 3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have
> >    18532 bytes of free space in the last iclog of the transaction which is
> >    greater than 2*sizeof(xlog_op_header_t). Hence
> >    xlog_state_get_iclog_space() does not switch over to using a newer iclog.
> > 4. Meanwhile, the endio code processing iclogs of the transaction do not
> >    insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS
> >    hasn't been submitted for I/O yet. Hence a major part of the on-disk
> >    log cannot be freed yet.
> 
> So all those items are still pinned in memory.
> 
> > 5. A new request for log space (via xfs_log_reserve()) will now wait
> >    indefinitely for on-disk log space to be freed.
> 
> Because nothing has issued a xfs_log_force() for write the iclog to
> disk, unpin the objects that it pins in memory, and allow the tail
> to be moved forwards.
> 
> The xfsaild normally takes care of thisi - it gets pushed byt the
> log reserve when there's not enough space to in the log for the
> transaction before transaction reserve goes to sleep in
> xlog_grant_head_wait(). The AIL pushing code is then responsible for
> making sure log space is eventually freed. It will issue log forces
> if it isn't making progress and so this problem shouldn't occur.
> 
> So, why has it occurred?
> 
> The xfsaild kthread should be running at this point, so if it was
> pushed it should be trying to empty the journal to move the tail
> forward. Why hasn't it issue a log force?
> 
> 
> > To fix this issue, before waiting for log space to be freed, this commit
> > now submits xlog->l_iclog for write I/O if iclog->ic_state is
> > XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes
> > AIL list to be populated and a later call to xlog_grant_push_ail() will
> > free up the on-disk log space.
> 
> hmmm.
> 
> > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/xfs_log.c | 21 +++++++++++++++++++++
> >  1 file changed, 21 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 00e9f5c388d3..dc785a6b9f47 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -236,11 +236,32 @@ xlog_grant_head_wait(
> >  	int			need_bytes) __releases(&head->lock)
> >  					    __acquires(&head->lock)
> >  {
> > +	struct xlog_in_core	*iclog;
> > +
> >  	list_add_tail(&tic->t_queue, &head->waiters);
> >  
> >  	do {
> >  		if (XLOG_FORCED_SHUTDOWN(log))
> >  			goto shutdown;
> > +
> > +		if (xfs_ail_min(log->l_ailp) == NULL) {
> 
> This is indicative of the situation. If the AIL is empty, and the
> log does not have room for an entire transaction reservation, then
> we need to be issuing synchronous transactions in recovery until
> such time the AIL pushing can actually function correctly to
> guarantee forwards progress for async transaction processing.

Yes, In the case of this bug, the AIL list was empty.

> 
> sync transactions call xfs_log_force(XFS_LOG_SYNC) immediately after
> writing the commit record, so this whole problem goes away.
> 
> perhaps in __xfs_trans_commit() we need somethign like this:
> 
>  	/*
>  	 * If the transaction needs to be synchronous, then force the
>  	 * log out now and wait for it.
> +	 *
> +	 * If we are in recovery and the AIL is empty, the log may only
> +	 * have enough room for a single transaction and the AIL
> +	 * cannot push the tail forwards. Hence while the AIL is
> +	 * empty and we are in recovery, do synchronous transactions
> +	 * to ensure the commit hits the journal and move into the
> +	 * AIL. Once there are items in the AIL, it can move the
> +	 * tail of the log forwards itself.
>  	 */
> -	if (sync) {
> +	if (sync || 
> +	    ((mp->m_flags & XFS_MOUNT_RECOVERY) && !xfs_ail_min(mp->m_ail)) {
>  		error = xfs_log_force_lsn(mp, commit_lsn, XFS_LOG_SYNC, NULL);
>  		XFS_STATS_INC(mp, xs_trans_sync);
> 
> The other option is that this can be put into the
> xfs_trans_reserve() code to set the XFS_TRANS_SYNC flag to trigger
> the log force in __xfs_trans_commit().

I will implement the fixes, test and post the patch once again. Thanks for the
review comments.

> 
> > +			spin_lock(&log->l_icloglock);
> > +			iclog = log->l_iclog;
> > +
> > +			if (iclog->ic_state == XLOG_STATE_ACTIVE
> > +				&& iclog->ic_offset) {
> > +				atomic_inc(&iclog->ic_refcnt);
> > +				xlog_state_want_sync(log, iclog);
> > +				spin_unlock(&log->l_icloglock);
> > +				xlog_state_release_iclog(log, iclog);
> > +
> > +				spin_lock(&log->l_icloglock);
> > +				xlog_wait(&iclog->ic_force_wait, &log->l_icloglock);
> 
> As I suspected, that's just an open coded, cut down log force. And
> it has a bug in it - you can't drop the l_icloglock, then pick it
> back up and sleep immediately without redoing all the state checks
> again. The log IO may have completed and run the wakeups before you
> get the spinlock back, and so this will simply hang forever here.
> 
> So, yeah, I think getting log forces into the right places are
> the right way to go here...
> 
> Cheers,
> 
> Dave.
> 


-- 
chandan




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-21 16:42 ` Brian Foster
@ 2019-08-22 12:46   ` Chandan Rajendra
  0 siblings, 0 replies; 25+ messages in thread
From: Chandan Rajendra @ 2019-08-22 12:46 UTC (permalink / raw)
  To: Brian Foster; +Cc: Chandan Rajendra, linux-xfs, darrick.wong, hch, david

On Wednesday, August 21, 2019 10:12 PM Brian Foster wrote:
> On Wed, Aug 21, 2019 at 04:34:48PM +0530, Chandan Rajendra wrote:
> > The following call trace is seen when executing generic/530 on a ppc64le
> > machine,
> > 
> 
> Could you provide xfs_info of the scratch device that reproduces this
> problem? I'm curious because I don't recall seeing this, but I'm also
> not sure I've run those iunlink tests on ppc64...

# xfs_info /dev/loop1
meta-data=/dev/loop1             isize=512    agcount=4, agsize=159648 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1
data     =                       bsize=4096   blocks=638592, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=3693, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

> 
> > INFO: task mount:7722 blocked for more than 122 seconds.
> >       Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > mount           D 8448  7722   7490 0x00040008
> > Call Trace:
> > [c000000629343210] [0000000000000001] 0x1 (unreliable)
> > [c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490
> > [c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50
> > [c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0
> > [c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420
> > [c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0
> > [c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310
> > [c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0
> > [c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0
> > [c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0
> > [c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300
> > [c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0
> > [c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0
> > [c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0
> > [c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0
> > [c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130
> > [c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130
> > [c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0
> > [c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0
> > [c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0
> > [c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0
> > [c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40
> > [c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0
> > [c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160
> > [c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20
> > [c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180
> > [c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30
> > [c000000629343e20] [c00000000000b864] system_call+0x5c/0x70
> > 
> > i.e. the mount task gets hung indefinitely due to the following sequence
> > of events,
> > 
> > 1. Test creates lots of unlinked temp files and then shutsdown the
> >    filesystem.
> > 2. During mount, a transaction started in the context of processing
> >    unlinked inode list causes several iclogs to be filled up. All but
> >    the last one is submitted for I/O.
> 
> So the inactivation transaction commits should basically start to
> populate the CIL. At some point we cross a threshold and a commit
> triggers a background CIL push (see xlog_cil_push_background()). I'm
> assuming that occurs somewhere in here because otherwise we wouldn't
> have started filling iclogs..
> 
> > 3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have
> >    18532 bytes of free space in the last iclog of the transaction which is
> >    greater than 2*sizeof(xlog_op_header_t). Hence
> >    xlog_state_get_iclog_space() does not switch over to using a newer iclog.
> 
> Ok, so we've sent a bunch of iclogs to disk and the commit record for
> the checkpoint ends up in the current iclog log, which remains active on
> return from xlog_commit_record() -> xlog_write() etc.
> 
> > 4. Meanwhile, the endio code processing iclogs of the transaction do not
> >    insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS
> >    hasn't been submitted for I/O yet. Hence a major part of the on-disk
> >    log cannot be freed yet.
> 
> Ok, so that final (still active) iclog holding the commit record is also
> holding the ctx.
> 
> > 5. A new request for log space (via xfs_log_reserve()) will now wait
> >    indefinitely for on-disk log space to be freed.
> > 
> > To fix this issue, before waiting for log space to be freed, this commit
> > now submits xlog->l_iclog for write I/O if iclog->ic_state is
> > XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes
> > AIL list to be populated and a later call to xlog_grant_push_ail() will
> > free up the on-disk log space.
> > 
> 
> Interesting, so I think I follow the issue at least and what the code is
> trying to do to fix it. I'm not totally clear on what the ideal approach
> is as this code is complex and I'd need to think about it some more. My
> first thought was perhaps whether we needed a checkpoint size limit or
> to change the background push threshold or some such, but the more I
> think about that the less I think that fixes anything.
> 
> A follow up thought is that it seems somewhat unfortunate that we can
> build up and hold so much reservation on essentially a single commit
> record and then just let it sit around in-core waiting for somebody else
> to come along and ask for reservation for something unrelated. I think a
> background log force should eventually come around and push that last
> record out, so we're not in that state indefinitely, but IIUC that's a
> lot of log consumption essentially unaccounted for until the associated
> items land in the AIL.
> 
> I'm wondering if we should have some measure (if we don't already) of
> how much reservation is tied into a single iclog and use that as an
> additional metric to determine when to switch iclogs a bit more
> aggressively (as opposed to only based on how much space is physically
> left in the iclog buffer). For example, if we end a huge checkpoint with
> a commit record in an active iclog that has a cil context attached
> associated with a hundreds of MBs (handwaving) of log data, perhaps we
> should just switch out of that iclog before the CIL push returns so that
> consumption can be immediately reflected by the AIL.
> 
> Anyways, perhaps there are other thoughts/ideas on this..
> 
> Brian
> 
> > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/xfs_log.c | 21 +++++++++++++++++++++
> >  1 file changed, 21 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 00e9f5c388d3..dc785a6b9f47 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -236,11 +236,32 @@ xlog_grant_head_wait(
> >  	int			need_bytes) __releases(&head->lock)
> >  					    __acquires(&head->lock)
> >  {
> > +	struct xlog_in_core	*iclog;
> > +
> >  	list_add_tail(&tic->t_queue, &head->waiters);
> >  
> >  	do {
> >  		if (XLOG_FORCED_SHUTDOWN(log))
> >  			goto shutdown;
> > +
> > +		if (xfs_ail_min(log->l_ailp) == NULL) {
> > +			spin_lock(&log->l_icloglock);
> > +			iclog = log->l_iclog;
> > +
> > +			if (iclog->ic_state == XLOG_STATE_ACTIVE
> > +				&& iclog->ic_offset) {
> > +				atomic_inc(&iclog->ic_refcnt);
> > +				xlog_state_want_sync(log, iclog);
> > +				spin_unlock(&log->l_icloglock);
> > +				xlog_state_release_iclog(log, iclog);
> > +
> > +				spin_lock(&log->l_icloglock);
> > +				xlog_wait(&iclog->ic_force_wait, &log->l_icloglock);
> > +			} else {
> > +				spin_unlock(&log->l_icloglock);
> > +			}
> > +		}
> > +
> >  		xlog_grant_push_ail(log, need_bytes);
> >  
> >  		__set_current_state(TASK_UNINTERRUPTIBLE);
> 


-- 
chandan




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-22 12:40   ` Chandan Rajendra
@ 2019-08-22 13:19     ` Chandan Rajendra
  0 siblings, 0 replies; 25+ messages in thread
From: Chandan Rajendra @ 2019-08-22 13:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandan Rajendra, linux-xfs, darrick.wong, hch

On Thursday, August 22, 2019 6:10 PM Chandan Rajendra wrote:
> On Thursday, August 22, 2019 3:48 AM Dave Chinner wrote:
> > On Wed, Aug 21, 2019 at 04:34:48PM +0530, Chandan Rajendra wrote:
> > > The following call trace is seen when executing generic/530 on a ppc64le
> > > machine,
> > > 
> > > INFO: task mount:7722 blocked for more than 122 seconds.
> > >       Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6
> > 
> > can you reproduce this on 5.3-rc5? There were bugs in log recovery
> > IO in -rc1 that could result in things going wrong...
> 
> Yes, I was able to recreate this bug on v5.3-rc5.
> 
> > 
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > mount           D 8448  7722   7490 0x00040008
> > > Call Trace:
> > > [c000000629343210] [0000000000000001] 0x1 (unreliable)
> > > [c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490
> > > [c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50
> > > [c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0
> > > [c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420
> > > [c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0
> > > [c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310
> > > [c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0
> > > [c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0
> > > [c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0
> > > [c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300
> > > [c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0
> > > [c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0
> > > [c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0
> > > [c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0
> > > [c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130
> > > [c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130
> > > [c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0
> > > [c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0
> > > [c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0
> > > [c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0
> > > [c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40
> > > [c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0
> > > [c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160
> > > [c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20
> > > [c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180
> > > [c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30
> > > [c000000629343e20] [c00000000000b864] system_call+0x5c/0x70
> > > 
> > > i.e. the mount task gets hung indefinitely due to the following sequence
> > > of events,
> > > 
> > > 1. Test creates lots of unlinked temp files and then shutsdown the
> > >    filesystem.
> > > 2. During mount, a transaction started in the context of processing
> > >    unlinked inode list causes several iclogs to be filled up. All but
> > >    the last one is submitted for I/O.
> > > 3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have
> > >    18532 bytes of free space in the last iclog of the transaction which is
> > >    greater than 2*sizeof(xlog_op_header_t). Hence
> > >    xlog_state_get_iclog_space() does not switch over to using a newer iclog.
> > > 4. Meanwhile, the endio code processing iclogs of the transaction do not
> > >    insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS
> > >    hasn't been submitted for I/O yet. Hence a major part of the on-disk
> > >    log cannot be freed yet.
> > 
> > So all those items are still pinned in memory.
> > 
> > > 5. A new request for log space (via xfs_log_reserve()) will now wait
> > >    indefinitely for on-disk log space to be freed.
> > 
> > Because nothing has issued a xfs_log_force() for write the iclog to
> > disk, unpin the objects that it pins in memory, and allow the tail
> > to be moved forwards.
> > 
> > The xfsaild normally takes care of thisi - it gets pushed byt the
> > log reserve when there's not enough space to in the log for the
> > transaction before transaction reserve goes to sleep in
> > xlog_grant_head_wait(). The AIL pushing code is then responsible for
> > making sure log space is eventually freed. It will issue log forces
> > if it isn't making progress and so this problem shouldn't occur.
> > 
> > So, why has it occurred?
> > 
> > The xfsaild kthread should be running at this point, so if it was
> > pushed it should be trying to empty the journal to move the tail
> > forward. Why hasn't it issue a log force?
> > 
> > 
> > > To fix this issue, before waiting for log space to be freed, this commit
> > > now submits xlog->l_iclog for write I/O if iclog->ic_state is
> > > XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes
> > > AIL list to be populated and a later call to xlog_grant_push_ail() will
> > > free up the on-disk log space.
> > 
> > hmmm.
> > 
> > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > ---
> > >  fs/xfs/xfs_log.c | 21 +++++++++++++++++++++
> > >  1 file changed, 21 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > index 00e9f5c388d3..dc785a6b9f47 100644
> > > --- a/fs/xfs/xfs_log.c
> > > +++ b/fs/xfs/xfs_log.c
> > > @@ -236,11 +236,32 @@ xlog_grant_head_wait(
> > >  	int			need_bytes) __releases(&head->lock)
> > >  					    __acquires(&head->lock)
> > >  {
> > > +	struct xlog_in_core	*iclog;
> > > +
> > >  	list_add_tail(&tic->t_queue, &head->waiters);
> > >  
> > >  	do {
> > >  		if (XLOG_FORCED_SHUTDOWN(log))
> > >  			goto shutdown;
> > > +
> > > +		if (xfs_ail_min(log->l_ailp) == NULL) {
> > 
> > This is indicative of the situation. If the AIL is empty, and the
> > log does not have room for an entire transaction reservation, then
> > we need to be issuing synchronous transactions in recovery until
> > such time the AIL pushing can actually function correctly to
> > guarantee forwards progress for async transaction processing.
> 
> Yes, In the case of this bug, the AIL list was empty.
> 
> > 
> > sync transactions call xfs_log_force(XFS_LOG_SYNC) immediately after
> > writing the commit record, so this whole problem goes away.
> > 
> > perhaps in __xfs_trans_commit() we need somethign like this:
> > 
> >  	/*
> >  	 * If the transaction needs to be synchronous, then force the
> >  	 * log out now and wait for it.
> > +	 *
> > +	 * If we are in recovery and the AIL is empty, the log may only
> > +	 * have enough room for a single transaction and the AIL
> > +	 * cannot push the tail forwards. Hence while the AIL is
> > +	 * empty and we are in recovery, do synchronous transactions
> > +	 * to ensure the commit hits the journal and move into the
> > +	 * AIL. Once there are items in the AIL, it can move the
> > +	 * tail of the log forwards itself.
> >  	 */
> > -	if (sync) {
> > +	if (sync || 
> > +	    ((mp->m_flags & XFS_MOUNT_RECOVERY) && !xfs_ail_min(mp->m_ail)) {
> >  		error = xfs_log_force_lsn(mp, commit_lsn, XFS_LOG_SYNC, NULL);
> >  		XFS_STATS_INC(mp, xs_trans_sync);
> > 
> > The other option is that this can be put into the
> > xfs_trans_reserve() code to set the XFS_TRANS_SYNC flag to trigger
> > the log force in __xfs_trans_commit().

Most likely, you meant (xlog->l_flags & XLOG_RECOVERY_NEEDED) instead of
(mp->m_flags & XFS_MOUNT_RECOVERY) in the above snippet.

-- 
chandan




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-21 22:18 ` Dave Chinner
  2019-08-22 12:40   ` Chandan Rajendra
@ 2019-08-22 16:34   ` Brian Foster
  2019-08-23  0:06     ` Dave Chinner
  1 sibling, 1 reply; 25+ messages in thread
From: Brian Foster @ 2019-08-22 16:34 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandan Rajendra, linux-xfs, chandan, darrick.wong, hch

[-- Attachment #1: Type: text/plain, Size: 10958 bytes --]

On Thu, Aug 22, 2019 at 08:18:34AM +1000, Dave Chinner wrote:
> On Wed, Aug 21, 2019 at 04:34:48PM +0530, Chandan Rajendra wrote:
> > The following call trace is seen when executing generic/530 on a ppc64le
> > machine,
> > 
> > INFO: task mount:7722 blocked for more than 122 seconds.
> >       Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6
> 
> can you reproduce this on 5.3-rc5? There were bugs in log recovery
> IO in -rc1 that could result in things going wrong...
> 
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > mount           D 8448  7722   7490 0x00040008
> > Call Trace:
> > [c000000629343210] [0000000000000001] 0x1 (unreliable)
> > [c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490
> > [c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50
> > [c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0
> > [c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420
> > [c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0
> > [c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310
> > [c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0
> > [c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0
> > [c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0
> > [c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300
> > [c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0
> > [c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0
> > [c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0
> > [c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0
> > [c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130
> > [c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130
> > [c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0
> > [c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0
> > [c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0
> > [c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0
> > [c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40
> > [c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0
> > [c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160
> > [c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20
> > [c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180
> > [c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30
> > [c000000629343e20] [c00000000000b864] system_call+0x5c/0x70
> > 
> > i.e. the mount task gets hung indefinitely due to the following sequence
> > of events,
> > 
> > 1. Test creates lots of unlinked temp files and then shutsdown the
> >    filesystem.
> > 2. During mount, a transaction started in the context of processing
> >    unlinked inode list causes several iclogs to be filled up. All but
> >    the last one is submitted for I/O.
> > 3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have
> >    18532 bytes of free space in the last iclog of the transaction which is
> >    greater than 2*sizeof(xlog_op_header_t). Hence
> >    xlog_state_get_iclog_space() does not switch over to using a newer iclog.
> > 4. Meanwhile, the endio code processing iclogs of the transaction do not
> >    insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS
> >    hasn't been submitted for I/O yet. Hence a major part of the on-disk
> >    log cannot be freed yet.
> 
> So all those items are still pinned in memory.
> 
> > 5. A new request for log space (via xfs_log_reserve()) will now wait
> >    indefinitely for on-disk log space to be freed.
> 
> Because nothing has issued a xfs_log_force() for write the iclog to
> disk, unpin the objects that it pins in memory, and allow the tail
> to be moved forwards.
> 
> The xfsaild normally takes care of thisi - it gets pushed byt the
> log reserve when there's not enough space to in the log for the
> transaction before transaction reserve goes to sleep in
> xlog_grant_head_wait(). The AIL pushing code is then responsible for
> making sure log space is eventually freed. It will issue log forces
> if it isn't making progress and so this problem shouldn't occur.
> 
> So, why has it occurred?
> 
> The xfsaild kthread should be running at this point, so if it was
> pushed it should be trying to empty the journal to move the tail
> forward. Why hasn't it issue a log force?
> 
> 
> > To fix this issue, before waiting for log space to be freed, this commit
> > now submits xlog->l_iclog for write I/O if iclog->ic_state is
> > XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes
> > AIL list to be populated and a later call to xlog_grant_push_ail() will
> > free up the on-disk log space.
> 
> hmmm.
> 
> > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > ---
> >  fs/xfs/xfs_log.c | 21 +++++++++++++++++++++
> >  1 file changed, 21 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 00e9f5c388d3..dc785a6b9f47 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -236,11 +236,32 @@ xlog_grant_head_wait(
> >  	int			need_bytes) __releases(&head->lock)
> >  					    __acquires(&head->lock)
> >  {
> > +	struct xlog_in_core	*iclog;
> > +
> >  	list_add_tail(&tic->t_queue, &head->waiters);
> >  
> >  	do {
> >  		if (XLOG_FORCED_SHUTDOWN(log))
> >  			goto shutdown;
> > +
> > +		if (xfs_ail_min(log->l_ailp) == NULL) {
> 
> This is indicative of the situation. If the AIL is empty, and the
> log does not have room for an entire transaction reservation, then
> we need to be issuing synchronous transactions in recovery until
> such time the AIL pushing can actually function correctly to
> guarantee forwards progress for async transaction processing.
> 

Hmm, I don't think that addresses the fundamental problem. This
phenomenon doesn't require log recovery. The same scenario can present
itself after a clean mount or from an idle fs. I think the scenario that
plays out here, at a high level, is as follows:

- Heavy transaction workload commences. This continuously acquires log
  reservation and transfers it to the CIL as transactions commit.
- The CIL context grows until we cross the background threshold, at
  which point we schedule a background push.
- Background CIL push cycles the current context into the log via the
  iclog buffers. The commit record stays around in-core because the last
  iclog used for the CIL checkpoint isn't full. Hence, none of the
  associated log items make it into the AIL and the background CIL push
  had no effect with respect to freeing log reservation.
- The same transaction workload is still running and filling up the next
  CIL context. If we run out of log reservation before a second
  background CIL push comes along, we're basically stuck waiting on
  somebody to force the log.

IOW, the second background CIL push starts writing to the iclog where
the first left off, so that iclog I/O completion is where we end up
doing AIL insertion from the first (background CIL push) checkpoint.
Once the AIL is populated, I _think_ the workload stabilizes such that
AIL pushing can be used to free up space, etc., but I'm not totally sure
of that yet.

The things that prevent this at normal runtime are timely execution of
background CIL pushes and the background log worker. If for some reason
the background CIL push is not timely enough that we consume all log
reservation before two background CIL pushes occur from the time the
racing workload begins (i.e. starting from an idle system such that the
AIL is empty), then we're stuck waiting on the background log worker to
force the log from the first background CIL push, populate the AIL and
get things moving again.

IOW, the same essential problem is reproducible outside of log recovery
in the form of stalls as opposed to deadlocks via an artificial
background CIL push delay (i.e., think workqueue or xc_cil_ctx lock
starvation) and an elevated xfssyncd_centisecs. We aren't stuck forever
because the background log worker will run eventually, but it could
certainly be a dead stall of minutes before that occurs. I think this
could still be addressed at transaction commit or reservation time, but
I think the logic needs to be more generic and based on log reservation
pressure rather than the context from which this particular test happens
to reproduce.

Chandan,

If this is all on the right track, I'm still curious if/how you're
getting into a situation where all log reservation is held up in the CIL
before a couple background pushes occur. The background push threshold
is still fairly small relative to the size of the log. Could you apply
the attached patch to hack in some tracepoints to dump some CIL and
iclog state and 'trace-cmd record -e printk' a generic/530 run that
reproduces?

Brian

> sync transactions call xfs_log_force(XFS_LOG_SYNC) immediately after
> writing the commit record, so this whole problem goes away.
> 
> perhaps in __xfs_trans_commit() we need somethign like this:
> 
>  	/*
>  	 * If the transaction needs to be synchronous, then force the
>  	 * log out now and wait for it.
> +	 *
> +	 * If we are in recovery and the AIL is empty, the log may only
> +	 * have enough room for a single transaction and the AIL
> +	 * cannot push the tail forwards. Hence while the AIL is
> +	 * empty and we are in recovery, do synchronous transactions
> +	 * to ensure the commit hits the journal and move into the
> +	 * AIL. Once there are items in the AIL, it can move the
> +	 * tail of the log forwards itself.
>  	 */
> -	if (sync) {
> +	if (sync || 
> +	    ((mp->m_flags & XFS_MOUNT_RECOVERY) && !xfs_ail_min(mp->m_ail)) {
>  		error = xfs_log_force_lsn(mp, commit_lsn, XFS_LOG_SYNC, NULL);
>  		XFS_STATS_INC(mp, xs_trans_sync);
> 
> The other option is that this can be put into the
> xfs_trans_reserve() code to set the XFS_TRANS_SYNC flag to trigger
> the log force in __xfs_trans_commit().
> 
> > +			spin_lock(&log->l_icloglock);
> > +			iclog = log->l_iclog;
> > +
> > +			if (iclog->ic_state == XLOG_STATE_ACTIVE
> > +				&& iclog->ic_offset) {
> > +				atomic_inc(&iclog->ic_refcnt);
> > +				xlog_state_want_sync(log, iclog);
> > +				spin_unlock(&log->l_icloglock);
> > +				xlog_state_release_iclog(log, iclog);
> > +
> > +				spin_lock(&log->l_icloglock);
> > +				xlog_wait(&iclog->ic_force_wait, &log->l_icloglock);
> 
> As I suspected, that's just an open coded, cut down log force. And
> it has a bug in it - you can't drop the l_icloglock, then pick it
> back up and sleep immediately without redoing all the state checks
> again. The log IO may have completed and run the wakeups before you
> get the spinlock back, and so this will simply hang forever here.
> 
> So, yeah, I think getting log forces into the right places are
> the right way to go here...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

[-- Attachment #2: cil-tp.diff --]
[-- Type: text/plain, Size: 1611 bytes --]

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 7fc3c1ad36bc..737f66b5f231 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3106,6 +3106,8 @@ xlog_state_release_iclog(
 	ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE ||
 	       iclog->ic_state == XLOG_STATE_WANT_SYNC);
 
+	trace_printk("%d: iclog %p ic_state 0x%x \n", __LINE__, iclog, iclog->ic_state);
+
 	if (iclog->ic_state == XLOG_STATE_WANT_SYNC) {
 		/* update tail before writing to iclog */
 		xfs_lsn_t tail_lsn = xlog_assign_tail_lsn(log->l_mp);
diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index fa5602d0fd7f..7145ba6c9c10 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -803,6 +803,7 @@ xlog_cil_push(
 	lvhdr.lv_iovecp = &lhdr;
 	lvhdr.lv_next = ctx->lv_chain;
 
+	trace_printk("%d: ctx->space_used %d\n", __LINE__, ctx->space_used);
 	error = xlog_write(log, &lvhdr, tic, &ctx->start_lsn, NULL, 0);
 	if (error)
 		goto out_abort_free_ticket;
@@ -845,6 +846,7 @@ xlog_cil_push(
 	commit_lsn = xfs_log_done(log->l_mp, tic, &commit_iclog, false);
 	if (commit_lsn == -1)
 		goto out_abort;
+	trace_printk("%d: commit_iclog %p ic_state 0x%x\n", __LINE__, commit_iclog, commit_iclog->ic_state);
 
 	spin_lock(&commit_iclog->ic_callback_lock);
 	if (commit_iclog->ic_state & XLOG_STATE_IOERROR) {
@@ -919,6 +921,7 @@ xlog_cil_push_background(
 
 	spin_lock(&cil->xc_push_lock);
 	if (cil->xc_push_seq < cil->xc_current_sequence) {
+		trace_printk("%d: xlog_cil_push_background()\n", __LINE__);
 		cil->xc_push_seq = cil->xc_current_sequence;
 		queue_work(log->l_mp->m_cil_workqueue, &cil->xc_push_work);
 	}

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-22 16:34   ` Brian Foster
@ 2019-08-23  0:06     ` Dave Chinner
  2019-08-23 13:08       ` Brian Foster
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2019-08-23  0:06 UTC (permalink / raw)
  To: Brian Foster; +Cc: Chandan Rajendra, linux-xfs, chandan, darrick.wong, hch

On Thu, Aug 22, 2019 at 12:34:46PM -0400, Brian Foster wrote:
> On Thu, Aug 22, 2019 at 08:18:34AM +1000, Dave Chinner wrote:
> > On Wed, Aug 21, 2019 at 04:34:48PM +0530, Chandan Rajendra wrote:
> > > The following call trace is seen when executing generic/530 on a ppc64le
> > > machine,
> > > 
> > > INFO: task mount:7722 blocked for more than 122 seconds.
> > >       Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6
> > 
> > can you reproduce this on 5.3-rc5? There were bugs in log recovery
> > IO in -rc1 that could result in things going wrong...
> > 
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > mount           D 8448  7722   7490 0x00040008
> > > Call Trace:
> > > [c000000629343210] [0000000000000001] 0x1 (unreliable)
> > > [c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490
> > > [c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50
> > > [c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0
> > > [c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420
> > > [c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0
> > > [c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310
> > > [c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0
> > > [c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0
> > > [c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0
> > > [c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300
> > > [c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0
> > > [c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0
> > > [c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0
> > > [c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0
> > > [c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130
> > > [c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130
> > > [c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0
> > > [c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0
> > > [c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0
> > > [c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0
> > > [c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40
> > > [c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0
> > > [c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160
> > > [c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20
> > > [c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180
> > > [c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30
> > > [c000000629343e20] [c00000000000b864] system_call+0x5c/0x70
> > > 
> > > i.e. the mount task gets hung indefinitely due to the following sequence
> > > of events,
> > > 
> > > 1. Test creates lots of unlinked temp files and then shutsdown the
> > >    filesystem.
> > > 2. During mount, a transaction started in the context of processing
> > >    unlinked inode list causes several iclogs to be filled up. All but
> > >    the last one is submitted for I/O.
> > > 3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have
> > >    18532 bytes of free space in the last iclog of the transaction which is
> > >    greater than 2*sizeof(xlog_op_header_t). Hence
> > >    xlog_state_get_iclog_space() does not switch over to using a newer iclog.
> > > 4. Meanwhile, the endio code processing iclogs of the transaction do not
> > >    insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS
> > >    hasn't been submitted for I/O yet. Hence a major part of the on-disk
> > >    log cannot be freed yet.
> > 
> > So all those items are still pinned in memory.
> > 
> > > 5. A new request for log space (via xfs_log_reserve()) will now wait
> > >    indefinitely for on-disk log space to be freed.
> > 
> > Because nothing has issued a xfs_log_force() for write the iclog to
> > disk, unpin the objects that it pins in memory, and allow the tail
> > to be moved forwards.
> > 
> > The xfsaild normally takes care of thisi - it gets pushed byt the
> > log reserve when there's not enough space to in the log for the
> > transaction before transaction reserve goes to sleep in
> > xlog_grant_head_wait(). The AIL pushing code is then responsible for
> > making sure log space is eventually freed. It will issue log forces
> > if it isn't making progress and so this problem shouldn't occur.
> > 
> > So, why has it occurred?
> > 
> > The xfsaild kthread should be running at this point, so if it was
> > pushed it should be trying to empty the journal to move the tail
> > forward. Why hasn't it issue a log force?
> > 
> > 
> > > To fix this issue, before waiting for log space to be freed, this commit
> > > now submits xlog->l_iclog for write I/O if iclog->ic_state is
> > > XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes
> > > AIL list to be populated and a later call to xlog_grant_push_ail() will
> > > free up the on-disk log space.
> > 
> > hmmm.
> > 
> > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > ---
> > >  fs/xfs/xfs_log.c | 21 +++++++++++++++++++++
> > >  1 file changed, 21 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > index 00e9f5c388d3..dc785a6b9f47 100644
> > > --- a/fs/xfs/xfs_log.c
> > > +++ b/fs/xfs/xfs_log.c
> > > @@ -236,11 +236,32 @@ xlog_grant_head_wait(
> > >  	int			need_bytes) __releases(&head->lock)
> > >  					    __acquires(&head->lock)
> > >  {
> > > +	struct xlog_in_core	*iclog;
> > > +
> > >  	list_add_tail(&tic->t_queue, &head->waiters);
> > >  
> > >  	do {
> > >  		if (XLOG_FORCED_SHUTDOWN(log))
> > >  			goto shutdown;
> > > +
> > > +		if (xfs_ail_min(log->l_ailp) == NULL) {
> > 
> > This is indicative of the situation. If the AIL is empty, and the
> > log does not have room for an entire transaction reservation, then
> > we need to be issuing synchronous transactions in recovery until
> > such time the AIL pushing can actually function correctly to
> > guarantee forwards progress for async transaction processing.
> > 
> 
> Hmm, I don't think that addresses the fundamental problem. This
> phenomenon doesn't require log recovery. The same scenario can present
> itself after a clean mount or from an idle fs. I think the scenario that
> plays out here, at a high level, is as follows:
> 

  - mount
  - [log recovery]
  - xfs_log_mount_finish
    - calls xfs_log_work_queue()

> - Heavy transaction workload commences. This continuously acquires log
>   reservation and transfers it to the CIL as transactions commit.
> - The CIL context grows until we cross the background threshold, at
>   which point we schedule a background push.
> - Background CIL push cycles the current context into the log via the
>   iclog buffers. The commit record stays around in-core because the last
>   iclog used for the CIL checkpoint isn't full. Hence, none of the
>   associated log items make it into the AIL and the background CIL push
>   had no effect with respect to freeing log reservation.
> - The same transaction workload is still running and filling up the next
>   CIL context. If we run out of log reservation before a second
>   background CIL push comes along, we're basically stuck waiting on
>   somebody to force the log.

- every 30s xfs_log_worker() runs, sees the log dirty, triggers
  a log force. pending commit is flushed to log, dirty objects get
  moved to AIL, then xfs_log_worker() pushes on the AIL to do
  periodic background metadata writeback.

> The things that prevent this at normal runtime are timely execution of
> background CIL pushes and the background log worker. If for some reason
> the background CIL push is not timely enough that we consume all log
> reservation before two background CIL pushes occur from the time the
> racing workload begins (i.e. starting from an idle system such that the
> AIL is empty), then we're stuck waiting on the background log worker to
> force the log from the first background CIL push, populate the AIL and
> get things moving again.

Right, this does not deadlock - it might pause for a short while
while waiting for the log worker to run and issue a log force. I
have never actually seen it happen in all my years of "mkfs; mount;
fsmark" testing that places a /massive/ metadata modification
workload on a pristine, newly mounted filesystems....

As it is, we've always used the log worker as a watchdog in this
way. The fact is that we have very few situations left in the code
where it needs to act as a watchdog - delayed logging actually
negated the vast majority of problems that required the periodic log
force to get out of trouble because individual transactions no
longer needed to wait on iclog space to make progress...

> IOW, the same essential problem is reproducible outside of log
> recovery in the form of stalls as opposed to deadlocks via an
> artificial background CIL push delay (i.e., think workqueue or
> xc_cil_ctx lock starvation) and an elevated xfssyncd_centisecs.
> We aren't stuck forever because the background log worker will run
> eventually, but it could certainly be a dead stall of minutes
> before that occurs.

I don't think addressing it in xlog_grant_head_wait() fixes the
problem fully, either.  If no other transaction comes in, then the
ones already blocked (because the AIL was not empty when they tried
to reserve space) will end up still blocked because nothing has
kicked the code in the transaction reservation code. So putting the
log force into the grant head wait code is not sufficient by itself.

> I think this
> could still be addressed at transaction commit or reservation time, but
> I think the logic needs to be more generic and based on log reservation
> pressure rather than the context from which this particular test happens
> to reproduce.

Log reservation pressure is what xlog_grant_push_ail() detects and
that pressure is transferred to the AIL pushing code to clean dirty
log items and move the tail of the log forward. It's right there
where Chandan added the log force. :)

IOWs, xlog_grant_push_ail() tells the xfsaild how much log space to
make available. If the log is full, then xlog_grant_push_ail() will
already be telling the AIL to push and will be waking it up.
However, the aild will see the AIL empty and go right back to sleep.
That's likely the runtime problem here - the mechanism that pushes
the log tail forwards is not realising that the log needs pushing
via a log force.

IOWs, I suspect the xfsaild is the right place to take the action,
because AIL pushing is triggered by much more than just log
reservations. It gets kicked by memory reclaim, the log worker, new
transactions, etc and so if a transaction doesn't kick it when it
gest stuck like this, something else will.

> If this is all on the right track, I'm still curious if/how you're
> getting into a situation where all log reservation is held up in the CIL
> before a couple background pushes occur.

I'd guess that it could be reproduced via a single CPU machine and
non-preempt kernel. We've already replayed all the unlink
transactions, so the buffer/inode caches are fully primed. If all
unlink removal transactions the buffer/inode cache, then it won't
block anywhere and will never yield the CPU. Hence the CIL push
kworker thread doesn't get to run before the unlinks run the log out
of space.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-23  0:06     ` Dave Chinner
@ 2019-08-23 13:08       ` Brian Foster
  2019-08-23 13:38         ` Chandan Rajendra
  0 siblings, 1 reply; 25+ messages in thread
From: Brian Foster @ 2019-08-23 13:08 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Chandan Rajendra, linux-xfs, chandan, darrick.wong, hch

On Fri, Aug 23, 2019 at 10:06:36AM +1000, Dave Chinner wrote:
> On Thu, Aug 22, 2019 at 12:34:46PM -0400, Brian Foster wrote:
> > On Thu, Aug 22, 2019 at 08:18:34AM +1000, Dave Chinner wrote:
> > > On Wed, Aug 21, 2019 at 04:34:48PM +0530, Chandan Rajendra wrote:
> > > > The following call trace is seen when executing generic/530 on a ppc64le
> > > > machine,
> > > > 
> > > > INFO: task mount:7722 blocked for more than 122 seconds.
> > > >       Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6
> > > 
> > > can you reproduce this on 5.3-rc5? There were bugs in log recovery
> > > IO in -rc1 that could result in things going wrong...
> > > 
> > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > mount           D 8448  7722   7490 0x00040008
> > > > Call Trace:
> > > > [c000000629343210] [0000000000000001] 0x1 (unreliable)
> > > > [c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490
> > > > [c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50
> > > > [c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0
> > > > [c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420
> > > > [c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0
> > > > [c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310
> > > > [c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0
> > > > [c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0
> > > > [c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0
> > > > [c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300
> > > > [c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0
> > > > [c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0
> > > > [c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0
> > > > [c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0
> > > > [c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130
> > > > [c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130
> > > > [c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0
> > > > [c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0
> > > > [c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0
> > > > [c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0
> > > > [c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40
> > > > [c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0
> > > > [c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160
> > > > [c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20
> > > > [c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180
> > > > [c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30
> > > > [c000000629343e20] [c00000000000b864] system_call+0x5c/0x70
> > > > 
> > > > i.e. the mount task gets hung indefinitely due to the following sequence
> > > > of events,
> > > > 
> > > > 1. Test creates lots of unlinked temp files and then shutsdown the
> > > >    filesystem.
> > > > 2. During mount, a transaction started in the context of processing
> > > >    unlinked inode list causes several iclogs to be filled up. All but
> > > >    the last one is submitted for I/O.
> > > > 3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have
> > > >    18532 bytes of free space in the last iclog of the transaction which is
> > > >    greater than 2*sizeof(xlog_op_header_t). Hence
> > > >    xlog_state_get_iclog_space() does not switch over to using a newer iclog.
> > > > 4. Meanwhile, the endio code processing iclogs of the transaction do not
> > > >    insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS
> > > >    hasn't been submitted for I/O yet. Hence a major part of the on-disk
> > > >    log cannot be freed yet.
> > > 
> > > So all those items are still pinned in memory.
> > > 
> > > > 5. A new request for log space (via xfs_log_reserve()) will now wait
> > > >    indefinitely for on-disk log space to be freed.
> > > 
> > > Because nothing has issued a xfs_log_force() for write the iclog to
> > > disk, unpin the objects that it pins in memory, and allow the tail
> > > to be moved forwards.
> > > 
> > > The xfsaild normally takes care of thisi - it gets pushed byt the
> > > log reserve when there's not enough space to in the log for the
> > > transaction before transaction reserve goes to sleep in
> > > xlog_grant_head_wait(). The AIL pushing code is then responsible for
> > > making sure log space is eventually freed. It will issue log forces
> > > if it isn't making progress and so this problem shouldn't occur.
> > > 
> > > So, why has it occurred?
> > > 
> > > The xfsaild kthread should be running at this point, so if it was
> > > pushed it should be trying to empty the journal to move the tail
> > > forward. Why hasn't it issue a log force?
> > > 
> > > 
> > > > To fix this issue, before waiting for log space to be freed, this commit
> > > > now submits xlog->l_iclog for write I/O if iclog->ic_state is
> > > > XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes
> > > > AIL list to be populated and a later call to xlog_grant_push_ail() will
> > > > free up the on-disk log space.
> > > 
> > > hmmm.
> > > 
> > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > ---
> > > >  fs/xfs/xfs_log.c | 21 +++++++++++++++++++++
> > > >  1 file changed, 21 insertions(+)
> > > > 
> > > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > > index 00e9f5c388d3..dc785a6b9f47 100644
> > > > --- a/fs/xfs/xfs_log.c
> > > > +++ b/fs/xfs/xfs_log.c
> > > > @@ -236,11 +236,32 @@ xlog_grant_head_wait(
> > > >  	int			need_bytes) __releases(&head->lock)
> > > >  					    __acquires(&head->lock)
> > > >  {
> > > > +	struct xlog_in_core	*iclog;
> > > > +
> > > >  	list_add_tail(&tic->t_queue, &head->waiters);
> > > >  
> > > >  	do {
> > > >  		if (XLOG_FORCED_SHUTDOWN(log))
> > > >  			goto shutdown;
> > > > +
> > > > +		if (xfs_ail_min(log->l_ailp) == NULL) {
> > > 
> > > This is indicative of the situation. If the AIL is empty, and the
> > > log does not have room for an entire transaction reservation, then
> > > we need to be issuing synchronous transactions in recovery until
> > > such time the AIL pushing can actually function correctly to
> > > guarantee forwards progress for async transaction processing.
> > > 
> > 
> > Hmm, I don't think that addresses the fundamental problem. This
> > phenomenon doesn't require log recovery. The same scenario can present
> > itself after a clean mount or from an idle fs. I think the scenario that
> > plays out here, at a high level, is as follows:
> > 
> 
>   - mount
>   - [log recovery]
>   - xfs_log_mount_finish
>     - calls xfs_log_work_queue()
> 
> > - Heavy transaction workload commences. This continuously acquires log
> >   reservation and transfers it to the CIL as transactions commit.
> > - The CIL context grows until we cross the background threshold, at
> >   which point we schedule a background push.
> > - Background CIL push cycles the current context into the log via the
> >   iclog buffers. The commit record stays around in-core because the last
> >   iclog used for the CIL checkpoint isn't full. Hence, none of the
> >   associated log items make it into the AIL and the background CIL push
> >   had no effect with respect to freeing log reservation.
> > - The same transaction workload is still running and filling up the next
> >   CIL context. If we run out of log reservation before a second
> >   background CIL push comes along, we're basically stuck waiting on
> >   somebody to force the log.
> 
> - every 30s xfs_log_worker() runs, sees the log dirty, triggers
>   a log force. pending commit is flushed to log, dirty objects get
>   moved to AIL, then xfs_log_worker() pushes on the AIL to do
>   periodic background metadata writeback.
> 
> > The things that prevent this at normal runtime are timely execution of
> > background CIL pushes and the background log worker. If for some reason
> > the background CIL push is not timely enough that we consume all log
> > reservation before two background CIL pushes occur from the time the
> > racing workload begins (i.e. starting from an idle system such that the
> > AIL is empty), then we're stuck waiting on the background log worker to
> > force the log from the first background CIL push, populate the AIL and
> > get things moving again.
> 
> Right, this does not deadlock - it might pause for a short while
> while waiting for the log worker to run and issue a log force. I
> have never actually seen it happen in all my years of "mkfs; mount;
> fsmark" testing that places a /massive/ metadata modification
> workload on a pristine, newly mounted filesystems....
> 

Neither have I. That's why I'm asking for more details on how this
triggers. :) Whatever factors contribute to this particular instance may
or may not apply outside of log recovery context.

> As it is, we've always used the log worker as a watchdog in this
> way. The fact is that we have very few situations left in the code
> where it needs to act as a watchdog - delayed logging actually
> negated the vast majority of problems that required the periodic log
> force to get out of trouble because individual transactions no
> longer needed to wait on iclog space to make progress...
> 

Yes, the description I wrote up already explains that the background log
worker changes the symptom from a deadlock to a stall. The defaults
allow for a stall of up to 30s, but custom configuration allows for a
stall of up to 2 hours.

> > IOW, the same essential problem is reproducible outside of log
> > recovery in the form of stalls as opposed to deadlocks via an
> > artificial background CIL push delay (i.e., think workqueue or
> > xc_cil_ctx lock starvation) and an elevated xfssyncd_centisecs.
> > We aren't stuck forever because the background log worker will run
> > eventually, but it could certainly be a dead stall of minutes
> > before that occurs.
> 
> I don't think addressing it in xlog_grant_head_wait() fixes the
> problem fully, either.  If no other transaction comes in, then the
> ones already blocked (because the AIL was not empty when they tried
> to reserve space) will end up still blocked because nothing has
> kicked the code in the transaction reservation code. So putting the
> log force into the grant head wait code is not sufficient by itself.
> 
> > I think this
> > could still be addressed at transaction commit or reservation time, but
> > I think the logic needs to be more generic and based on log reservation
> > pressure rather than the context from which this particular test happens
> > to reproduce.
> 
> Log reservation pressure is what xlog_grant_push_ail() detects and
> that pressure is transferred to the AIL pushing code to clean dirty
> log items and move the tail of the log forward. It's right there
> where Chandan added the log force. :)
> 
> IOWs, xlog_grant_push_ail() tells the xfsaild how much log space to
> make available. If the log is full, then xlog_grant_push_ail() will
> already be telling the AIL to push and will be waking it up.
> However, the aild will see the AIL empty and go right back to sleep.
> That's likely the runtime problem here - the mechanism that pushes
> the log tail forwards is not realising that the log needs pushing
> via a log force.
> 
> IOWs, I suspect the xfsaild is the right place to take the action,
> because AIL pushing is triggered by much more than just log
> reservations. It gets kicked by memory reclaim, the log worker, new
> transactions, etc and so if a transaction doesn't kick it when it
> gest stuck like this, something else will.
> 

Perhaps, I could see that as an option too. TBH, it's not clear to me
where the best place to fix this is. I want to know more about the
reproducer first. My comment above was just that I'm not convinced the
previously proposed logic to force based on log recovery context is
correct.

> > If this is all on the right track, I'm still curious if/how you're
> > getting into a situation where all log reservation is held up in the CIL
> > before a couple background pushes occur.
> 
> I'd guess that it could be reproduced via a single CPU machine and
> non-preempt kernel. We've already replayed all the unlink
> transactions, so the buffer/inode caches are fully primed. If all
> unlink removal transactions the buffer/inode cache, then it won't
> block anywhere and will never yield the CPU. Hence the CIL push
> kworker thread doesn't get to run before the unlinks run the log out
> of space.
> 

Yeah, I wouldn't expect that for a ppc64 system, but I suppose this
could be a VM or something too. Chandan?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-23 13:08       ` Brian Foster
@ 2019-08-23 13:38         ` Chandan Rajendra
  2019-08-25 15:05           ` Chandan Rajendra
  0 siblings, 1 reply; 25+ messages in thread
From: Chandan Rajendra @ 2019-08-23 13:38 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Friday, August 23, 2019 6:38 PM Brian Foster wrote:
> On Fri, Aug 23, 2019 at 10:06:36AM +1000, Dave Chinner wrote:
> > On Thu, Aug 22, 2019 at 12:34:46PM -0400, Brian Foster wrote:
> > > On Thu, Aug 22, 2019 at 08:18:34AM +1000, Dave Chinner wrote:
> > > > On Wed, Aug 21, 2019 at 04:34:48PM +0530, Chandan Rajendra wrote:
> > > > > The following call trace is seen when executing generic/530 on a ppc64le
> > > > > machine,
> > > > > 
> > > > > INFO: task mount:7722 blocked for more than 122 seconds.
> > > > >       Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6
> > > > 
> > > > can you reproduce this on 5.3-rc5? There were bugs in log recovery
> > > > IO in -rc1 that could result in things going wrong...
> > > > 
> > > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > mount           D 8448  7722   7490 0x00040008
> > > > > Call Trace:
> > > > > [c000000629343210] [0000000000000001] 0x1 (unreliable)
> > > > > [c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490
> > > > > [c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50
> > > > > [c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0
> > > > > [c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420
> > > > > [c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0
> > > > > [c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310
> > > > > [c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0
> > > > > [c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0
> > > > > [c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0
> > > > > [c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300
> > > > > [c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0
> > > > > [c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0
> > > > > [c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0
> > > > > [c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0
> > > > > [c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130
> > > > > [c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130
> > > > > [c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0
> > > > > [c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0
> > > > > [c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0
> > > > > [c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0
> > > > > [c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40
> > > > > [c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0
> > > > > [c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160
> > > > > [c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20
> > > > > [c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180
> > > > > [c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30
> > > > > [c000000629343e20] [c00000000000b864] system_call+0x5c/0x70
> > > > > 
> > > > > i.e. the mount task gets hung indefinitely due to the following sequence
> > > > > of events,
> > > > > 
> > > > > 1. Test creates lots of unlinked temp files and then shutsdown the
> > > > >    filesystem.
> > > > > 2. During mount, a transaction started in the context of processing
> > > > >    unlinked inode list causes several iclogs to be filled up. All but
> > > > >    the last one is submitted for I/O.
> > > > > 3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have
> > > > >    18532 bytes of free space in the last iclog of the transaction which is
> > > > >    greater than 2*sizeof(xlog_op_header_t). Hence
> > > > >    xlog_state_get_iclog_space() does not switch over to using a newer iclog.
> > > > > 4. Meanwhile, the endio code processing iclogs of the transaction do not
> > > > >    insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS
> > > > >    hasn't been submitted for I/O yet. Hence a major part of the on-disk
> > > > >    log cannot be freed yet.
> > > > 
> > > > So all those items are still pinned in memory.
> > > > 
> > > > > 5. A new request for log space (via xfs_log_reserve()) will now wait
> > > > >    indefinitely for on-disk log space to be freed.
> > > > 
> > > > Because nothing has issued a xfs_log_force() for write the iclog to
> > > > disk, unpin the objects that it pins in memory, and allow the tail
> > > > to be moved forwards.
> > > > 
> > > > The xfsaild normally takes care of thisi - it gets pushed byt the
> > > > log reserve when there's not enough space to in the log for the
> > > > transaction before transaction reserve goes to sleep in
> > > > xlog_grant_head_wait(). The AIL pushing code is then responsible for
> > > > making sure log space is eventually freed. It will issue log forces
> > > > if it isn't making progress and so this problem shouldn't occur.
> > > > 
> > > > So, why has it occurred?
> > > > 
> > > > The xfsaild kthread should be running at this point, so if it was
> > > > pushed it should be trying to empty the journal to move the tail
> > > > forward. Why hasn't it issue a log force?
> > > > 
> > > > 
> > > > > To fix this issue, before waiting for log space to be freed, this commit
> > > > > now submits xlog->l_iclog for write I/O if iclog->ic_state is
> > > > > XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes
> > > > > AIL list to be populated and a later call to xlog_grant_push_ail() will
> > > > > free up the on-disk log space.
> > > > 
> > > > hmmm.
> > > > 
> > > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > > ---
> > > > >  fs/xfs/xfs_log.c | 21 +++++++++++++++++++++
> > > > >  1 file changed, 21 insertions(+)
> > > > > 
> > > > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > > > index 00e9f5c388d3..dc785a6b9f47 100644
> > > > > --- a/fs/xfs/xfs_log.c
> > > > > +++ b/fs/xfs/xfs_log.c
> > > > > @@ -236,11 +236,32 @@ xlog_grant_head_wait(
> > > > >  	int			need_bytes) __releases(&head->lock)
> > > > >  					    __acquires(&head->lock)
> > > > >  {
> > > > > +	struct xlog_in_core	*iclog;
> > > > > +
> > > > >  	list_add_tail(&tic->t_queue, &head->waiters);
> > > > >  
> > > > >  	do {
> > > > >  		if (XLOG_FORCED_SHUTDOWN(log))
> > > > >  			goto shutdown;
> > > > > +
> > > > > +		if (xfs_ail_min(log->l_ailp) == NULL) {
> > > > 
> > > > This is indicative of the situation. If the AIL is empty, and the
> > > > log does not have room for an entire transaction reservation, then
> > > > we need to be issuing synchronous transactions in recovery until
> > > > such time the AIL pushing can actually function correctly to
> > > > guarantee forwards progress for async transaction processing.
> > > > 
> > > 
> > > Hmm, I don't think that addresses the fundamental problem. This
> > > phenomenon doesn't require log recovery. The same scenario can present
> > > itself after a clean mount or from an idle fs. I think the scenario that
> > > plays out here, at a high level, is as follows:
> > > 
> > 
> >   - mount
> >   - [log recovery]
> >   - xfs_log_mount_finish
> >     - calls xfs_log_work_queue()
> > 
> > > - Heavy transaction workload commences. This continuously acquires log
> > >   reservation and transfers it to the CIL as transactions commit.
> > > - The CIL context grows until we cross the background threshold, at
> > >   which point we schedule a background push.
> > > - Background CIL push cycles the current context into the log via the
> > >   iclog buffers. The commit record stays around in-core because the last
> > >   iclog used for the CIL checkpoint isn't full. Hence, none of the
> > >   associated log items make it into the AIL and the background CIL push
> > >   had no effect with respect to freeing log reservation.
> > > - The same transaction workload is still running and filling up the next
> > >   CIL context. If we run out of log reservation before a second
> > >   background CIL push comes along, we're basically stuck waiting on
> > >   somebody to force the log.
> > 
> > - every 30s xfs_log_worker() runs, sees the log dirty, triggers
> >   a log force. pending commit is flushed to log, dirty objects get
> >   moved to AIL, then xfs_log_worker() pushes on the AIL to do
> >   periodic background metadata writeback.
> > 
> > > The things that prevent this at normal runtime are timely execution of
> > > background CIL pushes and the background log worker. If for some reason
> > > the background CIL push is not timely enough that we consume all log
> > > reservation before two background CIL pushes occur from the time the
> > > racing workload begins (i.e. starting from an idle system such that the
> > > AIL is empty), then we're stuck waiting on the background log worker to
> > > force the log from the first background CIL push, populate the AIL and
> > > get things moving again.
> > 
> > Right, this does not deadlock - it might pause for a short while
> > while waiting for the log worker to run and issue a log force. I
> > have never actually seen it happen in all my years of "mkfs; mount;
> > fsmark" testing that places a /massive/ metadata modification
> > workload on a pristine, newly mounted filesystems....
> > 
> 
> Neither have I. That's why I'm asking for more details on how this
> triggers. :) Whatever factors contribute to this particular instance may
> or may not apply outside of log recovery context.
> 
> > As it is, we've always used the log worker as a watchdog in this
> > way. The fact is that we have very few situations left in the code
> > where it needs to act as a watchdog - delayed logging actually
> > negated the vast majority of problems that required the periodic log
> > force to get out of trouble because individual transactions no
> > longer needed to wait on iclog space to make progress...
> > 
> 
> Yes, the description I wrote up already explains that the background log
> worker changes the symptom from a deadlock to a stall. The defaults
> allow for a stall of up to 30s, but custom configuration allows for a
> stall of up to 2 hours.
> 
> > > IOW, the same essential problem is reproducible outside of log
> > > recovery in the form of stalls as opposed to deadlocks via an
> > > artificial background CIL push delay (i.e., think workqueue or
> > > xc_cil_ctx lock starvation) and an elevated xfssyncd_centisecs.
> > > We aren't stuck forever because the background log worker will run
> > > eventually, but it could certainly be a dead stall of minutes
> > > before that occurs.
> > 
> > I don't think addressing it in xlog_grant_head_wait() fixes the
> > problem fully, either.  If no other transaction comes in, then the
> > ones already blocked (because the AIL was not empty when they tried
> > to reserve space) will end up still blocked because nothing has
> > kicked the code in the transaction reservation code. So putting the
> > log force into the grant head wait code is not sufficient by itself.
> > 
> > > I think this
> > > could still be addressed at transaction commit or reservation time, but
> > > I think the logic needs to be more generic and based on log reservation
> > > pressure rather than the context from which this particular test happens
> > > to reproduce.
> > 
> > Log reservation pressure is what xlog_grant_push_ail() detects and
> > that pressure is transferred to the AIL pushing code to clean dirty
> > log items and move the tail of the log forward. It's right there
> > where Chandan added the log force. :)
> > 
> > IOWs, xlog_grant_push_ail() tells the xfsaild how much log space to
> > make available. If the log is full, then xlog_grant_push_ail() will
> > already be telling the AIL to push and will be waking it up.
> > However, the aild will see the AIL empty and go right back to sleep.
> > That's likely the runtime problem here - the mechanism that pushes
> > the log tail forwards is not realising that the log needs pushing
> > via a log force.
> > 
> > IOWs, I suspect the xfsaild is the right place to take the action,
> > because AIL pushing is triggered by much more than just log
> > reservations. It gets kicked by memory reclaim, the log worker, new
> > transactions, etc and so if a transaction doesn't kick it when it
> > gest stuck like this, something else will.
> > 
> 
> Perhaps, I could see that as an option too. TBH, it's not clear to me
> where the best place to fix this is. I want to know more about the
> reproducer first. My comment above was just that I'm not convinced the
> previously proposed logic to force based on log recovery context is
> correct.
> 
> > > If this is all on the right track, I'm still curious if/how you're
> > > getting into a situation where all log reservation is held up in the CIL
> > > before a couple background pushes occur.
> > 
> > I'd guess that it could be reproduced via a single CPU machine and
> > non-preempt kernel. We've already replayed all the unlink
> > transactions, so the buffer/inode caches are fully primed. If all
> > unlink removal transactions the buffer/inode cache, then it won't
> > block anywhere and will never yield the CPU. Hence the CIL push
> > kworker thread doesn't get to run before the unlinks run the log out
> > of space.
> > 
> 
> Yeah, I wouldn't expect that for a ppc64 system, but I suppose this
> could be a VM or something too. Chandan?
>

Indeed, This is a KVM guest on a ppc64le machine. The guest has 8 vcpus
allocated to it. Also, Linux on ppc64le is by default configured to be
non-preemptable.

Btw, the issue is still re-creatable after adding,

+               if ((log->l_flags & XLOG_RECOVERY_NEEDED)
+                       && !xfs_ail_min(tp->t_mountp->m_ail))
+                       tp->t_flags |= XFS_TRANS_SYNC;

... to xfs_trans_reserve(). I didn't get a chance to debug it yet. I will try
to find the root cause now.

-- 
chandan




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-23 13:38         ` Chandan Rajendra
@ 2019-08-25 15:05           ` Chandan Rajendra
  2019-08-26  0:32             ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Chandan Rajendra @ 2019-08-25 15:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Friday, August 23, 2019 7:08 PM Chandan Rajendra wrote:
> On Friday, August 23, 2019 6:38 PM Brian Foster wrote:
> > On Fri, Aug 23, 2019 at 10:06:36AM +1000, Dave Chinner wrote:
> > > On Thu, Aug 22, 2019 at 12:34:46PM -0400, Brian Foster wrote:
> > > > On Thu, Aug 22, 2019 at 08:18:34AM +1000, Dave Chinner wrote:
> > > > > On Wed, Aug 21, 2019 at 04:34:48PM +0530, Chandan Rajendra wrote:
> > > > > > The following call trace is seen when executing generic/530 on a ppc64le
> > > > > > machine,
> > > > > > 
> > > > > > INFO: task mount:7722 blocked for more than 122 seconds.
> > > > > >       Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6
> > > > > 
> > > > > can you reproduce this on 5.3-rc5? There were bugs in log recovery
> > > > > IO in -rc1 that could result in things going wrong...
> > > > > 
> > > > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > > mount           D 8448  7722   7490 0x00040008
> > > > > > Call Trace:
> > > > > > [c000000629343210] [0000000000000001] 0x1 (unreliable)
> > > > > > [c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490
> > > > > > [c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50
> > > > > > [c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0
> > > > > > [c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420
> > > > > > [c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0
> > > > > > [c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310
> > > > > > [c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0
> > > > > > [c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0
> > > > > > [c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0
> > > > > > [c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300
> > > > > > [c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0
> > > > > > [c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0
> > > > > > [c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0
> > > > > > [c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0
> > > > > > [c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130
> > > > > > [c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130
> > > > > > [c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0
> > > > > > [c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0
> > > > > > [c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0
> > > > > > [c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0
> > > > > > [c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40
> > > > > > [c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0
> > > > > > [c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160
> > > > > > [c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20
> > > > > > [c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180
> > > > > > [c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30
> > > > > > [c000000629343e20] [c00000000000b864] system_call+0x5c/0x70
> > > > > > 
> > > > > > i.e. the mount task gets hung indefinitely due to the following sequence
> > > > > > of events,
> > > > > > 
> > > > > > 1. Test creates lots of unlinked temp files and then shutsdown the
> > > > > >    filesystem.
> > > > > > 2. During mount, a transaction started in the context of processing
> > > > > >    unlinked inode list causes several iclogs to be filled up. All but
> > > > > >    the last one is submitted for I/O.
> > > > > > 3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have
> > > > > >    18532 bytes of free space in the last iclog of the transaction which is
> > > > > >    greater than 2*sizeof(xlog_op_header_t). Hence
> > > > > >    xlog_state_get_iclog_space() does not switch over to using a newer iclog.
> > > > > > 4. Meanwhile, the endio code processing iclogs of the transaction do not
> > > > > >    insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS
> > > > > >    hasn't been submitted for I/O yet. Hence a major part of the on-disk
> > > > > >    log cannot be freed yet.
> > > > > 
> > > > > So all those items are still pinned in memory.
> > > > > 
> > > > > > 5. A new request for log space (via xfs_log_reserve()) will now wait
> > > > > >    indefinitely for on-disk log space to be freed.
> > > > > 
> > > > > Because nothing has issued a xfs_log_force() for write the iclog to
> > > > > disk, unpin the objects that it pins in memory, and allow the tail
> > > > > to be moved forwards.
> > > > > 
> > > > > The xfsaild normally takes care of thisi - it gets pushed byt the
> > > > > log reserve when there's not enough space to in the log for the
> > > > > transaction before transaction reserve goes to sleep in
> > > > > xlog_grant_head_wait(). The AIL pushing code is then responsible for
> > > > > making sure log space is eventually freed. It will issue log forces
> > > > > if it isn't making progress and so this problem shouldn't occur.
> > > > > 
> > > > > So, why has it occurred?
> > > > > 
> > > > > The xfsaild kthread should be running at this point, so if it was
> > > > > pushed it should be trying to empty the journal to move the tail
> > > > > forward. Why hasn't it issue a log force?
> > > > > 
> > > > > 
> > > > > > To fix this issue, before waiting for log space to be freed, this commit
> > > > > > now submits xlog->l_iclog for write I/O if iclog->ic_state is
> > > > > > XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes
> > > > > > AIL list to be populated and a later call to xlog_grant_push_ail() will
> > > > > > free up the on-disk log space.
> > > > > 
> > > > > hmmm.
> > > > > 
> > > > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > > > ---
> > > > > >  fs/xfs/xfs_log.c | 21 +++++++++++++++++++++
> > > > > >  1 file changed, 21 insertions(+)
> > > > > > 
> > > > > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > > > > index 00e9f5c388d3..dc785a6b9f47 100644
> > > > > > --- a/fs/xfs/xfs_log.c
> > > > > > +++ b/fs/xfs/xfs_log.c
> > > > > > @@ -236,11 +236,32 @@ xlog_grant_head_wait(
> > > > > >  	int			need_bytes) __releases(&head->lock)
> > > > > >  					    __acquires(&head->lock)
> > > > > >  {
> > > > > > +	struct xlog_in_core	*iclog;
> > > > > > +
> > > > > >  	list_add_tail(&tic->t_queue, &head->waiters);
> > > > > >  
> > > > > >  	do {
> > > > > >  		if (XLOG_FORCED_SHUTDOWN(log))
> > > > > >  			goto shutdown;
> > > > > > +
> > > > > > +		if (xfs_ail_min(log->l_ailp) == NULL) {
> > > > > 
> > > > > This is indicative of the situation. If the AIL is empty, and the
> > > > > log does not have room for an entire transaction reservation, then
> > > > > we need to be issuing synchronous transactions in recovery until
> > > > > such time the AIL pushing can actually function correctly to
> > > > > guarantee forwards progress for async transaction processing.
> > > > > 
> > > > 
> > > > Hmm, I don't think that addresses the fundamental problem. This
> > > > phenomenon doesn't require log recovery. The same scenario can present
> > > > itself after a clean mount or from an idle fs. I think the scenario that
> > > > plays out here, at a high level, is as follows:
> > > > 
> > > 
> > >   - mount
> > >   - [log recovery]
> > >   - xfs_log_mount_finish
> > >     - calls xfs_log_work_queue()
> > > 
> > > > - Heavy transaction workload commences. This continuously acquires log
> > > >   reservation and transfers it to the CIL as transactions commit.
> > > > - The CIL context grows until we cross the background threshold, at
> > > >   which point we schedule a background push.
> > > > - Background CIL push cycles the current context into the log via the
> > > >   iclog buffers. The commit record stays around in-core because the last
> > > >   iclog used for the CIL checkpoint isn't full. Hence, none of the
> > > >   associated log items make it into the AIL and the background CIL push
> > > >   had no effect with respect to freeing log reservation.
> > > > - The same transaction workload is still running and filling up the next
> > > >   CIL context. If we run out of log reservation before a second
> > > >   background CIL push comes along, we're basically stuck waiting on
> > > >   somebody to force the log.
> > > 
> > > - every 30s xfs_log_worker() runs, sees the log dirty, triggers
> > >   a log force. pending commit is flushed to log, dirty objects get
> > >   moved to AIL, then xfs_log_worker() pushes on the AIL to do
> > >   periodic background metadata writeback.
> > > 
> > > > The things that prevent this at normal runtime are timely execution of
> > > > background CIL pushes and the background log worker. If for some reason
> > > > the background CIL push is not timely enough that we consume all log
> > > > reservation before two background CIL pushes occur from the time the
> > > > racing workload begins (i.e. starting from an idle system such that the
> > > > AIL is empty), then we're stuck waiting on the background log worker to
> > > > force the log from the first background CIL push, populate the AIL and
> > > > get things moving again.
> > > 
> > > Right, this does not deadlock - it might pause for a short while
> > > while waiting for the log worker to run and issue a log force. I
> > > have never actually seen it happen in all my years of "mkfs; mount;
> > > fsmark" testing that places a /massive/ metadata modification
> > > workload on a pristine, newly mounted filesystems....
> > > 
> > 
> > Neither have I. That's why I'm asking for more details on how this
> > triggers. :) Whatever factors contribute to this particular instance may
> > or may not apply outside of log recovery context.
> > 
> > > As it is, we've always used the log worker as a watchdog in this
> > > way. The fact is that we have very few situations left in the code
> > > where it needs to act as a watchdog - delayed logging actually
> > > negated the vast majority of problems that required the periodic log
> > > force to get out of trouble because individual transactions no
> > > longer needed to wait on iclog space to make progress...
> > > 
> > 
> > Yes, the description I wrote up already explains that the background log
> > worker changes the symptom from a deadlock to a stall. The defaults
> > allow for a stall of up to 30s, but custom configuration allows for a
> > stall of up to 2 hours.
> > 
> > > > IOW, the same essential problem is reproducible outside of log
> > > > recovery in the form of stalls as opposed to deadlocks via an
> > > > artificial background CIL push delay (i.e., think workqueue or
> > > > xc_cil_ctx lock starvation) and an elevated xfssyncd_centisecs.
> > > > We aren't stuck forever because the background log worker will run
> > > > eventually, but it could certainly be a dead stall of minutes
> > > > before that occurs.
> > > 
> > > I don't think addressing it in xlog_grant_head_wait() fixes the
> > > problem fully, either.  If no other transaction comes in, then the
> > > ones already blocked (because the AIL was not empty when they tried
> > > to reserve space) will end up still blocked because nothing has
> > > kicked the code in the transaction reservation code. So putting the
> > > log force into the grant head wait code is not sufficient by itself.
> > > 
> > > > I think this
> > > > could still be addressed at transaction commit or reservation time, but
> > > > I think the logic needs to be more generic and based on log reservation
> > > > pressure rather than the context from which this particular test happens
> > > > to reproduce.
> > > 
> > > Log reservation pressure is what xlog_grant_push_ail() detects and
> > > that pressure is transferred to the AIL pushing code to clean dirty
> > > log items and move the tail of the log forward. It's right there
> > > where Chandan added the log force. :)
> > > 
> > > IOWs, xlog_grant_push_ail() tells the xfsaild how much log space to
> > > make available. If the log is full, then xlog_grant_push_ail() will
> > > already be telling the AIL to push and will be waking it up.
> > > However, the aild will see the AIL empty and go right back to sleep.
> > > That's likely the runtime problem here - the mechanism that pushes
> > > the log tail forwards is not realising that the log needs pushing
> > > via a log force.
> > > 
> > > IOWs, I suspect the xfsaild is the right place to take the action,
> > > because AIL pushing is triggered by much more than just log
> > > reservations. It gets kicked by memory reclaim, the log worker, new
> > > transactions, etc and so if a transaction doesn't kick it when it
> > > gest stuck like this, something else will.
> > > 
> > 
> > Perhaps, I could see that as an option too. TBH, it's not clear to me
> > where the best place to fix this is. I want to know more about the
> > reproducer first. My comment above was just that I'm not convinced the
> > previously proposed logic to force based on log recovery context is
> > correct.
> > 
> > > > If this is all on the right track, I'm still curious if/how you're
> > > > getting into a situation where all log reservation is held up in the CIL
> > > > before a couple background pushes occur.
> > > 
> > > I'd guess that it could be reproduced via a single CPU machine and
> > > non-preempt kernel. We've already replayed all the unlink
> > > transactions, so the buffer/inode caches are fully primed. If all
> > > unlink removal transactions the buffer/inode cache, then it won't
> > > block anywhere and will never yield the CPU. Hence the CIL push
> > > kworker thread doesn't get to run before the unlinks run the log out
> > > of space.
> > > 
> > 
> > Yeah, I wouldn't expect that for a ppc64 system, but I suppose this
> > could be a VM or something too. Chandan?
> >
> 
> Indeed, This is a KVM guest on a ppc64le machine. The guest has 8 vcpus
> allocated to it. Also, Linux on ppc64le is by default configured to be
> non-preemptable.
> 
> Btw, the issue is still re-creatable after adding,
> 
> +               if ((log->l_flags & XLOG_RECOVERY_NEEDED)
> +                       && !xfs_ail_min(tp->t_mountp->m_ail))
> +                       tp->t_flags |= XFS_TRANS_SYNC;
> 
> ... to xfs_trans_reserve(). I didn't get a chance to debug it yet. I will try
> to find the root cause now.
> 

Dave, With the above changes made in xfs_trans_reserve(), mount task is
deadlocking due to the following,
1. With synchronous transactions, __xfs_trans_commit() now causes iclogs to be
flushed to the disk and hence log items to be ultimately moved to AIL.
2. xfsaild task is woken up, which acts in items on AIL.
3. After some time, we stop issuing synchronous transactions because AIL has
   log items in its list and hence !xfs_ail_min(tp->t_mountp->m_ail) evaluates to
   false. In xfsaild_push(), "XFS_LSN_CMP(lip->li_lsn, target) <= 0"
   evaluates to false on the first iteration of the while loop. This means we
   have a log item whose LSN is larger than xfs_ail->ail_target at the
   beginning of the AIL.
4. Now unlinked inode list processing blocks waiting for space reservation on
   the log.

I will debug and get more details as to how AIL ended up in this situation.

-- 
chandan




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-25 15:05           ` Chandan Rajendra
@ 2019-08-26  0:32             ` Dave Chinner
  2019-08-26 12:39               ` Brian Foster
  2019-08-29  5:21               ` Chandan Rajendra
  0 siblings, 2 replies; 25+ messages in thread
From: Dave Chinner @ 2019-08-26  0:32 UTC (permalink / raw)
  To: Chandan Rajendra
  Cc: Brian Foster, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Sun, Aug 25, 2019 at 08:35:17PM +0530, Chandan Rajendra wrote:
> On Friday, August 23, 2019 7:08 PM Chandan Rajendra wrote:
> > On Friday, August 23, 2019 6:38 PM Brian Foster wrote:
> > > On Fri, Aug 23, 2019 at 10:06:36AM +1000, Dave Chinner wrote:
> > > > On Thu, Aug 22, 2019 at 12:34:46PM -0400, Brian Foster wrote:
> > > > > On Thu, Aug 22, 2019 at 08:18:34AM +1000, Dave Chinner wrote:
> > > > > > On Wed, Aug 21, 2019 at 04:34:48PM +0530, Chandan Rajendra wrote:
> > > > > > > The following call trace is seen when executing generic/530 on a ppc64le
> > > > > > > machine,
> > > > > > > 
> > > > > > > INFO: task mount:7722 blocked for more than 122 seconds.
> > > > > > >       Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6
> > > > > > 
> > > > > > can you reproduce this on 5.3-rc5? There were bugs in log recovery
> > > > > > IO in -rc1 that could result in things going wrong...
> > > > > > 
> > > > > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > > > mount           D 8448  7722   7490 0x00040008
> > > > > > > Call Trace:
> > > > > > > [c000000629343210] [0000000000000001] 0x1 (unreliable)
> > > > > > > [c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490
> > > > > > > [c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50
> > > > > > > [c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0
> > > > > > > [c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420
> > > > > > > [c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0
> > > > > > > [c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310
> > > > > > > [c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0
> > > > > > > [c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0
> > > > > > > [c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0
> > > > > > > [c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300
> > > > > > > [c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0
> > > > > > > [c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0
> > > > > > > [c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0
> > > > > > > [c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0
> > > > > > > [c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130
> > > > > > > [c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130
> > > > > > > [c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0
> > > > > > > [c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0
> > > > > > > [c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0
> > > > > > > [c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0
> > > > > > > [c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40
> > > > > > > [c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0
> > > > > > > [c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160
> > > > > > > [c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20
> > > > > > > [c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180
> > > > > > > [c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30
> > > > > > > [c000000629343e20] [c00000000000b864] system_call+0x5c/0x70
> > > > > > > 
> > > > > > > i.e. the mount task gets hung indefinitely due to the following sequence
> > > > > > > of events,
> > > > > > > 
> > > > > > > 1. Test creates lots of unlinked temp files and then shutsdown the
> > > > > > >    filesystem.
> > > > > > > 2. During mount, a transaction started in the context of processing
> > > > > > >    unlinked inode list causes several iclogs to be filled up. All but
> > > > > > >    the last one is submitted for I/O.
> > > > > > > 3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have
> > > > > > >    18532 bytes of free space in the last iclog of the transaction which is
> > > > > > >    greater than 2*sizeof(xlog_op_header_t). Hence
> > > > > > >    xlog_state_get_iclog_space() does not switch over to using a newer iclog.
> > > > > > > 4. Meanwhile, the endio code processing iclogs of the transaction do not
> > > > > > >    insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS
> > > > > > >    hasn't been submitted for I/O yet. Hence a major part of the on-disk
> > > > > > >    log cannot be freed yet.
> > > > > > 
> > > > > > So all those items are still pinned in memory.
> > > > > > 
> > > > > > > 5. A new request for log space (via xfs_log_reserve()) will now wait
> > > > > > >    indefinitely for on-disk log space to be freed.
> > > > > > 
> > > > > > Because nothing has issued a xfs_log_force() for write the iclog to
> > > > > > disk, unpin the objects that it pins in memory, and allow the tail
> > > > > > to be moved forwards.
> > > > > > 
> > > > > > The xfsaild normally takes care of thisi - it gets pushed byt the
> > > > > > log reserve when there's not enough space to in the log for the
> > > > > > transaction before transaction reserve goes to sleep in
> > > > > > xlog_grant_head_wait(). The AIL pushing code is then responsible for
> > > > > > making sure log space is eventually freed. It will issue log forces
> > > > > > if it isn't making progress and so this problem shouldn't occur.
> > > > > > 
> > > > > > So, why has it occurred?
> > > > > > 
> > > > > > The xfsaild kthread should be running at this point, so if it was
> > > > > > pushed it should be trying to empty the journal to move the tail
> > > > > > forward. Why hasn't it issue a log force?
> > > > > > 
> > > > > > 
> > > > > > > To fix this issue, before waiting for log space to be freed, this commit
> > > > > > > now submits xlog->l_iclog for write I/O if iclog->ic_state is
> > > > > > > XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes
> > > > > > > AIL list to be populated and a later call to xlog_grant_push_ail() will
> > > > > > > free up the on-disk log space.
> > > > > > 
> > > > > > hmmm.
> > > > > > 
> > > > > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > > > > ---
> > > > > > >  fs/xfs/xfs_log.c | 21 +++++++++++++++++++++
> > > > > > >  1 file changed, 21 insertions(+)
> > > > > > > 
> > > > > > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > > > > > index 00e9f5c388d3..dc785a6b9f47 100644
> > > > > > > --- a/fs/xfs/xfs_log.c
> > > > > > > +++ b/fs/xfs/xfs_log.c
> > > > > > > @@ -236,11 +236,32 @@ xlog_grant_head_wait(
> > > > > > >  	int			need_bytes) __releases(&head->lock)
> > > > > > >  					    __acquires(&head->lock)
> > > > > > >  {
> > > > > > > +	struct xlog_in_core	*iclog;
> > > > > > > +
> > > > > > >  	list_add_tail(&tic->t_queue, &head->waiters);
> > > > > > >  
> > > > > > >  	do {
> > > > > > >  		if (XLOG_FORCED_SHUTDOWN(log))
> > > > > > >  			goto shutdown;
> > > > > > > +
> > > > > > > +		if (xfs_ail_min(log->l_ailp) == NULL) {
> > > > > > 
> > > > > > This is indicative of the situation. If the AIL is empty, and the
> > > > > > log does not have room for an entire transaction reservation, then
> > > > > > we need to be issuing synchronous transactions in recovery until
> > > > > > such time the AIL pushing can actually function correctly to
> > > > > > guarantee forwards progress for async transaction processing.
> > > > > > 
> > > > > 
> > > > > Hmm, I don't think that addresses the fundamental problem. This
> > > > > phenomenon doesn't require log recovery. The same scenario can present
> > > > > itself after a clean mount or from an idle fs. I think the scenario that
> > > > > plays out here, at a high level, is as follows:
> > > > > 
> > > > 
> > > >   - mount
> > > >   - [log recovery]
> > > >   - xfs_log_mount_finish
> > > >     - calls xfs_log_work_queue()
> > > > 
> > > > > - Heavy transaction workload commences. This continuously acquires log
> > > > >   reservation and transfers it to the CIL as transactions commit.
> > > > > - The CIL context grows until we cross the background threshold, at
> > > > >   which point we schedule a background push.
> > > > > - Background CIL push cycles the current context into the log via the
> > > > >   iclog buffers. The commit record stays around in-core because the last
> > > > >   iclog used for the CIL checkpoint isn't full. Hence, none of the
> > > > >   associated log items make it into the AIL and the background CIL push
> > > > >   had no effect with respect to freeing log reservation.
> > > > > - The same transaction workload is still running and filling up the next
> > > > >   CIL context. If we run out of log reservation before a second
> > > > >   background CIL push comes along, we're basically stuck waiting on
> > > > >   somebody to force the log.
> > > > 
> > > > - every 30s xfs_log_worker() runs, sees the log dirty, triggers
> > > >   a log force. pending commit is flushed to log, dirty objects get
> > > >   moved to AIL, then xfs_log_worker() pushes on the AIL to do
> > > >   periodic background metadata writeback.
> > > > 
> > > > > The things that prevent this at normal runtime are timely execution of
> > > > > background CIL pushes and the background log worker. If for some reason
> > > > > the background CIL push is not timely enough that we consume all log
> > > > > reservation before two background CIL pushes occur from the time the
> > > > > racing workload begins (i.e. starting from an idle system such that the
> > > > > AIL is empty), then we're stuck waiting on the background log worker to
> > > > > force the log from the first background CIL push, populate the AIL and
> > > > > get things moving again.
> > > > 
> > > > Right, this does not deadlock - it might pause for a short while
> > > > while waiting for the log worker to run and issue a log force. I
> > > > have never actually seen it happen in all my years of "mkfs; mount;
> > > > fsmark" testing that places a /massive/ metadata modification
> > > > workload on a pristine, newly mounted filesystems....
> > > > 
> > > 
> > > Neither have I. That's why I'm asking for more details on how this
> > > triggers. :) Whatever factors contribute to this particular instance may
> > > or may not apply outside of log recovery context.
> > > 
> > > > As it is, we've always used the log worker as a watchdog in this
> > > > way. The fact is that we have very few situations left in the code
> > > > where it needs to act as a watchdog - delayed logging actually
> > > > negated the vast majority of problems that required the periodic log
> > > > force to get out of trouble because individual transactions no
> > > > longer needed to wait on iclog space to make progress...
> > > > 
> > > 
> > > Yes, the description I wrote up already explains that the background log
> > > worker changes the symptom from a deadlock to a stall. The defaults
> > > allow for a stall of up to 30s, but custom configuration allows for a
> > > stall of up to 2 hours.
> > > 
> > > > > IOW, the same essential problem is reproducible outside of log
> > > > > recovery in the form of stalls as opposed to deadlocks via an
> > > > > artificial background CIL push delay (i.e., think workqueue or
> > > > > xc_cil_ctx lock starvation) and an elevated xfssyncd_centisecs.
> > > > > We aren't stuck forever because the background log worker will run
> > > > > eventually, but it could certainly be a dead stall of minutes
> > > > > before that occurs.
> > > > 
> > > > I don't think addressing it in xlog_grant_head_wait() fixes the
> > > > problem fully, either.  If no other transaction comes in, then the
> > > > ones already blocked (because the AIL was not empty when they tried
> > > > to reserve space) will end up still blocked because nothing has
> > > > kicked the code in the transaction reservation code. So putting the
> > > > log force into the grant head wait code is not sufficient by itself.
> > > > 
> > > > > I think this
> > > > > could still be addressed at transaction commit or reservation time, but
> > > > > I think the logic needs to be more generic and based on log reservation
> > > > > pressure rather than the context from which this particular test happens
> > > > > to reproduce.
> > > > 
> > > > Log reservation pressure is what xlog_grant_push_ail() detects and
> > > > that pressure is transferred to the AIL pushing code to clean dirty
> > > > log items and move the tail of the log forward. It's right there
> > > > where Chandan added the log force. :)
> > > > 
> > > > IOWs, xlog_grant_push_ail() tells the xfsaild how much log space to
> > > > make available. If the log is full, then xlog_grant_push_ail() will
> > > > already be telling the AIL to push and will be waking it up.
> > > > However, the aild will see the AIL empty and go right back to sleep.
> > > > That's likely the runtime problem here - the mechanism that pushes
> > > > the log tail forwards is not realising that the log needs pushing
> > > > via a log force.
> > > > 
> > > > IOWs, I suspect the xfsaild is the right place to take the action,
> > > > because AIL pushing is triggered by much more than just log
> > > > reservations. It gets kicked by memory reclaim, the log worker, new
> > > > transactions, etc and so if a transaction doesn't kick it when it
> > > > gest stuck like this, something else will.
> > > > 
> > > 
> > > Perhaps, I could see that as an option too. TBH, it's not clear to me
> > > where the best place to fix this is. I want to know more about the
> > > reproducer first. My comment above was just that I'm not convinced the
> > > previously proposed logic to force based on log recovery context is
> > > correct.
> > > 
> > > > > If this is all on the right track, I'm still curious if/how you're
> > > > > getting into a situation where all log reservation is held up in the CIL
> > > > > before a couple background pushes occur.
> > > > 
> > > > I'd guess that it could be reproduced via a single CPU machine and
> > > > non-preempt kernel. We've already replayed all the unlink
> > > > transactions, so the buffer/inode caches are fully primed. If all
> > > > unlink removal transactions the buffer/inode cache, then it won't
> > > > block anywhere and will never yield the CPU. Hence the CIL push
> > > > kworker thread doesn't get to run before the unlinks run the log out
> > > > of space.
> > > > 
> > > 
> > > Yeah, I wouldn't expect that for a ppc64 system, but I suppose this
> > > could be a VM or something too. Chandan?
> > >
> > 
> > Indeed, This is a KVM guest on a ppc64le machine. The guest has 8 vcpus
> > allocated to it. Also, Linux on ppc64le is by default configured to be
> > non-preemptable.
> > 
> > Btw, the issue is still re-creatable after adding,
> > 
> > +               if ((log->l_flags & XLOG_RECOVERY_NEEDED)
> > +                       && !xfs_ail_min(tp->t_mountp->m_ail))
> > +                       tp->t_flags |= XFS_TRANS_SYNC;
> > 
> > ... to xfs_trans_reserve(). I didn't get a chance to debug it yet. I will try
> > to find the root cause now.
> > 
> 
> Dave, With the above changes made in xfs_trans_reserve(), mount task is
> deadlocking due to the following,
> 1. With synchronous transactions, __xfs_trans_commit() now causes iclogs to be
> flushed to the disk and hence log items to be ultimately moved to AIL.
> 2. xfsaild task is woken up, which acts in items on AIL.
> 3. After some time, we stop issuing synchronous transactions because AIL has
>    log items in its list and hence !xfs_ail_min(tp->t_mountp->m_ail) evaluates to
>    false. In xfsaild_push(), "XFS_LSN_CMP(lip->li_lsn, target) <= 0"
>    evaluates to false on the first iteration of the while loop. This means we
>    have a log item whose LSN is larger than xfs_ail->ail_target at the
>    beginning of the AIL.

The push target for xlog_grant_push_ail() is to free 25% of the log
space. So if all the items in the AIL are not within 25% of the tail
end of the log, there's nothing for the AIL to push. This indicates
that there is at least 25% of physical log space free.

I suspect that this means the CIL is overruning it's background push
target by more than expected probably because the log is so small. That leads
to the outstanding CIL pending commit size (the current CIL context
and the previous CIL commit that is held off by the uncommited
iclog) is greater than the AIL push target, and so nothing will free
up more log space and wake up the transaction waiting for grant
space.

e.g. the previous CIL context commit might take 15% of the log
space, and the current CIL has reserved 11% of the log space.
Now new transactions reservations have run out of grant space and we
push on the ail, but it's lowest item is at 26%, and so the AIL push
does nothing and we're stuck because the CIL has pinned 26% of the
log space.

As a test, can you run the test with larger log sizes? I think
the default used was about ~3600 blocks, so it you step that up by
500 blocks at a time we should get an idea of the size of the
overrun by the size of the log where the hang goes away. A
trace of the transaction reservations and AIL pushing would also be
insightful.

FWIW, when I originally sized the CIL space, I did actually try to
take this into account. The CIL space is "log size >> 3", which is
12.5% of the log, or exactly half the space of the
xlog_grant_push_ail() target window. Hence the space pinned by two
CIL pushes (on on it's way to disk and one filling) should be
covered by the 25% threshold in xlog_grant_push_ail()

However, what I didn't take into account is that the one going to
disk could stall indefinitely because we run out of log space before
the current CIL context is pushed and there's nothing to trigger the
CIL to be pushed to unwedge it.

There are a few solutions I can think of to this:

	1 increase the xlog_grant_push_ail() push window to a larger
	  amount of the log for small-to-medium logs. This doesn't
	  work if the overrun is too large, but it should fix the
	  "off-by-one" type of hang.

	2 reduce the CIL background push threshold. I already have
	  patches to do this for large logs (part of the
	  non-blocking shrinker work) but for small logs this is
	  somewhat equivalent to #1. I think #1 is a better option
	  for small logs they are bound on metadata writeback,
	  anyway, so starting writeback earlier isn't a big deal.

	3 have the aild push the CIL if it's woken and there is
	  nothing to push on (i.e. empty or everything > target).
	  This may be too agressive, so it might be a case of "if
	  the push target changed and there's still nothing to do".

	4 have the aild force the log if it's woken and there is
	  nothing to push on

	5 xlog_grant_push_ail() trigger background CIL pushes when
	  the space required is not available at all (i.e.
	  need_bytes > xlog_space_left(log, &log->l_reserve_head.grant)
	  as that is when we are about to sleep). Hence the CIL will
	  always get pushed when we are completely out of log grant
	  space.

If it is simply and overrun accounting issue, then I suspect #1 or
#5 are the most side-effect free solutions to the problem. the
others are likely to have performance degradations in unpredictable
corner cases...

Cheers,

Dave.


-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-26  0:32             ` Dave Chinner
@ 2019-08-26 12:39               ` Brian Foster
  2019-08-27 23:10                 ` Dave Chinner
  2019-08-29  5:21               ` Chandan Rajendra
  1 sibling, 1 reply; 25+ messages in thread
From: Brian Foster @ 2019-08-26 12:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chandan Rajendra, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Mon, Aug 26, 2019 at 10:32:53AM +1000, Dave Chinner wrote:
> On Sun, Aug 25, 2019 at 08:35:17PM +0530, Chandan Rajendra wrote:
> > On Friday, August 23, 2019 7:08 PM Chandan Rajendra wrote:
> > > On Friday, August 23, 2019 6:38 PM Brian Foster wrote:
> > > > On Fri, Aug 23, 2019 at 10:06:36AM +1000, Dave Chinner wrote:
> > > > > On Thu, Aug 22, 2019 at 12:34:46PM -0400, Brian Foster wrote:
> > > > > > On Thu, Aug 22, 2019 at 08:18:34AM +1000, Dave Chinner wrote:
> > > > > > > On Wed, Aug 21, 2019 at 04:34:48PM +0530, Chandan Rajendra wrote:
> > > > > > > > The following call trace is seen when executing generic/530 on a ppc64le
> > > > > > > > machine,
> > > > > > > > 
> > > > > > > > INFO: task mount:7722 blocked for more than 122 seconds.
> > > > > > > >       Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6
> > > > > > > 
> > > > > > > can you reproduce this on 5.3-rc5? There were bugs in log recovery
> > > > > > > IO in -rc1 that could result in things going wrong...
> > > > > > > 
> > > > > > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > > > > mount           D 8448  7722   7490 0x00040008
> > > > > > > > Call Trace:
> > > > > > > > [c000000629343210] [0000000000000001] 0x1 (unreliable)
> > > > > > > > [c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490
> > > > > > > > [c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50
> > > > > > > > [c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0
> > > > > > > > [c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420
> > > > > > > > [c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0
> > > > > > > > [c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310
> > > > > > > > [c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0
> > > > > > > > [c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0
> > > > > > > > [c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0
> > > > > > > > [c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300
> > > > > > > > [c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0
> > > > > > > > [c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0
> > > > > > > > [c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0
> > > > > > > > [c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0
> > > > > > > > [c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130
> > > > > > > > [c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130
> > > > > > > > [c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0
> > > > > > > > [c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0
> > > > > > > > [c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0
> > > > > > > > [c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0
> > > > > > > > [c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40
> > > > > > > > [c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0
> > > > > > > > [c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160
> > > > > > > > [c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20
> > > > > > > > [c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180
> > > > > > > > [c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30
> > > > > > > > [c000000629343e20] [c00000000000b864] system_call+0x5c/0x70
> > > > > > > > 
> > > > > > > > i.e. the mount task gets hung indefinitely due to the following sequence
> > > > > > > > of events,
> > > > > > > > 
> > > > > > > > 1. Test creates lots of unlinked temp files and then shutsdown the
> > > > > > > >    filesystem.
> > > > > > > > 2. During mount, a transaction started in the context of processing
> > > > > > > >    unlinked inode list causes several iclogs to be filled up. All but
> > > > > > > >    the last one is submitted for I/O.
> > > > > > > > 3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have
> > > > > > > >    18532 bytes of free space in the last iclog of the transaction which is
> > > > > > > >    greater than 2*sizeof(xlog_op_header_t). Hence
> > > > > > > >    xlog_state_get_iclog_space() does not switch over to using a newer iclog.
> > > > > > > > 4. Meanwhile, the endio code processing iclogs of the transaction do not
> > > > > > > >    insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS
> > > > > > > >    hasn't been submitted for I/O yet. Hence a major part of the on-disk
> > > > > > > >    log cannot be freed yet.
> > > > > > > 
> > > > > > > So all those items are still pinned in memory.
> > > > > > > 
> > > > > > > > 5. A new request for log space (via xfs_log_reserve()) will now wait
> > > > > > > >    indefinitely for on-disk log space to be freed.
> > > > > > > 
> > > > > > > Because nothing has issued a xfs_log_force() for write the iclog to
> > > > > > > disk, unpin the objects that it pins in memory, and allow the tail
> > > > > > > to be moved forwards.
> > > > > > > 
> > > > > > > The xfsaild normally takes care of thisi - it gets pushed byt the
> > > > > > > log reserve when there's not enough space to in the log for the
> > > > > > > transaction before transaction reserve goes to sleep in
> > > > > > > xlog_grant_head_wait(). The AIL pushing code is then responsible for
> > > > > > > making sure log space is eventually freed. It will issue log forces
> > > > > > > if it isn't making progress and so this problem shouldn't occur.
> > > > > > > 
> > > > > > > So, why has it occurred?
> > > > > > > 
> > > > > > > The xfsaild kthread should be running at this point, so if it was
> > > > > > > pushed it should be trying to empty the journal to move the tail
> > > > > > > forward. Why hasn't it issue a log force?
> > > > > > > 
> > > > > > > 
> > > > > > > > To fix this issue, before waiting for log space to be freed, this commit
> > > > > > > > now submits xlog->l_iclog for write I/O if iclog->ic_state is
> > > > > > > > XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes
> > > > > > > > AIL list to be populated and a later call to xlog_grant_push_ail() will
> > > > > > > > free up the on-disk log space.
> > > > > > > 
> > > > > > > hmmm.
> > > > > > > 
> > > > > > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > > > > > ---
> > > > > > > >  fs/xfs/xfs_log.c | 21 +++++++++++++++++++++
> > > > > > > >  1 file changed, 21 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > > > > > > index 00e9f5c388d3..dc785a6b9f47 100644
> > > > > > > > --- a/fs/xfs/xfs_log.c
> > > > > > > > +++ b/fs/xfs/xfs_log.c
> > > > > > > > @@ -236,11 +236,32 @@ xlog_grant_head_wait(
> > > > > > > >  	int			need_bytes) __releases(&head->lock)
> > > > > > > >  					    __acquires(&head->lock)
> > > > > > > >  {
> > > > > > > > +	struct xlog_in_core	*iclog;
> > > > > > > > +
> > > > > > > >  	list_add_tail(&tic->t_queue, &head->waiters);
> > > > > > > >  
> > > > > > > >  	do {
> > > > > > > >  		if (XLOG_FORCED_SHUTDOWN(log))
> > > > > > > >  			goto shutdown;
> > > > > > > > +
> > > > > > > > +		if (xfs_ail_min(log->l_ailp) == NULL) {
> > > > > > > 
> > > > > > > This is indicative of the situation. If the AIL is empty, and the
> > > > > > > log does not have room for an entire transaction reservation, then
> > > > > > > we need to be issuing synchronous transactions in recovery until
> > > > > > > such time the AIL pushing can actually function correctly to
> > > > > > > guarantee forwards progress for async transaction processing.
> > > > > > > 
> > > > > > 
> > > > > > Hmm, I don't think that addresses the fundamental problem. This
> > > > > > phenomenon doesn't require log recovery. The same scenario can present
> > > > > > itself after a clean mount or from an idle fs. I think the scenario that
> > > > > > plays out here, at a high level, is as follows:
> > > > > > 
> > > > > 
> > > > >   - mount
> > > > >   - [log recovery]
> > > > >   - xfs_log_mount_finish
> > > > >     - calls xfs_log_work_queue()
> > > > > 
> > > > > > - Heavy transaction workload commences. This continuously acquires log
> > > > > >   reservation and transfers it to the CIL as transactions commit.
> > > > > > - The CIL context grows until we cross the background threshold, at
> > > > > >   which point we schedule a background push.
> > > > > > - Background CIL push cycles the current context into the log via the
> > > > > >   iclog buffers. The commit record stays around in-core because the last
> > > > > >   iclog used for the CIL checkpoint isn't full. Hence, none of the
> > > > > >   associated log items make it into the AIL and the background CIL push
> > > > > >   had no effect with respect to freeing log reservation.
> > > > > > - The same transaction workload is still running and filling up the next
> > > > > >   CIL context. If we run out of log reservation before a second
> > > > > >   background CIL push comes along, we're basically stuck waiting on
> > > > > >   somebody to force the log.
> > > > > 
> > > > > - every 30s xfs_log_worker() runs, sees the log dirty, triggers
> > > > >   a log force. pending commit is flushed to log, dirty objects get
> > > > >   moved to AIL, then xfs_log_worker() pushes on the AIL to do
> > > > >   periodic background metadata writeback.
> > > > > 
> > > > > > The things that prevent this at normal runtime are timely execution of
> > > > > > background CIL pushes and the background log worker. If for some reason
> > > > > > the background CIL push is not timely enough that we consume all log
> > > > > > reservation before two background CIL pushes occur from the time the
> > > > > > racing workload begins (i.e. starting from an idle system such that the
> > > > > > AIL is empty), then we're stuck waiting on the background log worker to
> > > > > > force the log from the first background CIL push, populate the AIL and
> > > > > > get things moving again.
> > > > > 
> > > > > Right, this does not deadlock - it might pause for a short while
> > > > > while waiting for the log worker to run and issue a log force. I
> > > > > have never actually seen it happen in all my years of "mkfs; mount;
> > > > > fsmark" testing that places a /massive/ metadata modification
> > > > > workload on a pristine, newly mounted filesystems....
> > > > > 
> > > > 
> > > > Neither have I. That's why I'm asking for more details on how this
> > > > triggers. :) Whatever factors contribute to this particular instance may
> > > > or may not apply outside of log recovery context.
> > > > 
> > > > > As it is, we've always used the log worker as a watchdog in this
> > > > > way. The fact is that we have very few situations left in the code
> > > > > where it needs to act as a watchdog - delayed logging actually
> > > > > negated the vast majority of problems that required the periodic log
> > > > > force to get out of trouble because individual transactions no
> > > > > longer needed to wait on iclog space to make progress...
> > > > > 
> > > > 
> > > > Yes, the description I wrote up already explains that the background log
> > > > worker changes the symptom from a deadlock to a stall. The defaults
> > > > allow for a stall of up to 30s, but custom configuration allows for a
> > > > stall of up to 2 hours.
> > > > 
> > > > > > IOW, the same essential problem is reproducible outside of log
> > > > > > recovery in the form of stalls as opposed to deadlocks via an
> > > > > > artificial background CIL push delay (i.e., think workqueue or
> > > > > > xc_cil_ctx lock starvation) and an elevated xfssyncd_centisecs.
> > > > > > We aren't stuck forever because the background log worker will run
> > > > > > eventually, but it could certainly be a dead stall of minutes
> > > > > > before that occurs.
> > > > > 
> > > > > I don't think addressing it in xlog_grant_head_wait() fixes the
> > > > > problem fully, either.  If no other transaction comes in, then the
> > > > > ones already blocked (because the AIL was not empty when they tried
> > > > > to reserve space) will end up still blocked because nothing has
> > > > > kicked the code in the transaction reservation code. So putting the
> > > > > log force into the grant head wait code is not sufficient by itself.
> > > > > 
> > > > > > I think this
> > > > > > could still be addressed at transaction commit or reservation time, but
> > > > > > I think the logic needs to be more generic and based on log reservation
> > > > > > pressure rather than the context from which this particular test happens
> > > > > > to reproduce.
> > > > > 
> > > > > Log reservation pressure is what xlog_grant_push_ail() detects and
> > > > > that pressure is transferred to the AIL pushing code to clean dirty
> > > > > log items and move the tail of the log forward. It's right there
> > > > > where Chandan added the log force. :)
> > > > > 
> > > > > IOWs, xlog_grant_push_ail() tells the xfsaild how much log space to
> > > > > make available. If the log is full, then xlog_grant_push_ail() will
> > > > > already be telling the AIL to push and will be waking it up.
> > > > > However, the aild will see the AIL empty and go right back to sleep.
> > > > > That's likely the runtime problem here - the mechanism that pushes
> > > > > the log tail forwards is not realising that the log needs pushing
> > > > > via a log force.
> > > > > 
> > > > > IOWs, I suspect the xfsaild is the right place to take the action,
> > > > > because AIL pushing is triggered by much more than just log
> > > > > reservations. It gets kicked by memory reclaim, the log worker, new
> > > > > transactions, etc and so if a transaction doesn't kick it when it
> > > > > gest stuck like this, something else will.
> > > > > 
> > > > 
> > > > Perhaps, I could see that as an option too. TBH, it's not clear to me
> > > > where the best place to fix this is. I want to know more about the
> > > > reproducer first. My comment above was just that I'm not convinced the
> > > > previously proposed logic to force based on log recovery context is
> > > > correct.
> > > > 
> > > > > > If this is all on the right track, I'm still curious if/how you're
> > > > > > getting into a situation where all log reservation is held up in the CIL
> > > > > > before a couple background pushes occur.
> > > > > 
> > > > > I'd guess that it could be reproduced via a single CPU machine and
> > > > > non-preempt kernel. We've already replayed all the unlink
> > > > > transactions, so the buffer/inode caches are fully primed. If all
> > > > > unlink removal transactions the buffer/inode cache, then it won't
> > > > > block anywhere and will never yield the CPU. Hence the CIL push
> > > > > kworker thread doesn't get to run before the unlinks run the log out
> > > > > of space.
> > > > > 
> > > > 
> > > > Yeah, I wouldn't expect that for a ppc64 system, but I suppose this
> > > > could be a VM or something too. Chandan?
> > > >
> > > 
> > > Indeed, This is a KVM guest on a ppc64le machine. The guest has 8 vcpus
> > > allocated to it. Also, Linux on ppc64le is by default configured to be
> > > non-preemptable.
> > > 
> > > Btw, the issue is still re-creatable after adding,
> > > 
> > > +               if ((log->l_flags & XLOG_RECOVERY_NEEDED)
> > > +                       && !xfs_ail_min(tp->t_mountp->m_ail))
> > > +                       tp->t_flags |= XFS_TRANS_SYNC;
> > > 
> > > ... to xfs_trans_reserve(). I didn't get a chance to debug it yet. I will try
> > > to find the root cause now.
> > > 
> > 
> > Dave, With the above changes made in xfs_trans_reserve(), mount task is
> > deadlocking due to the following,
> > 1. With synchronous transactions, __xfs_trans_commit() now causes iclogs to be
> > flushed to the disk and hence log items to be ultimately moved to AIL.
> > 2. xfsaild task is woken up, which acts in items on AIL.
> > 3. After some time, we stop issuing synchronous transactions because AIL has
> >    log items in its list and hence !xfs_ail_min(tp->t_mountp->m_ail) evaluates to
> >    false. In xfsaild_push(), "XFS_LSN_CMP(lip->li_lsn, target) <= 0"
> >    evaluates to false on the first iteration of the while loop. This means we
> >    have a log item whose LSN is larger than xfs_ail->ail_target at the
> >    beginning of the AIL.
> 
> The push target for xlog_grant_push_ail() is to free 25% of the log
> space. So if all the items in the AIL are not within 25% of the tail
> end of the log, there's nothing for the AIL to push. This indicates
> that there is at least 25% of physical log space free.
> 
> I suspect that this means the CIL is overruning it's background push
> target by more than expected probably because the log is so small. That leads
> to the outstanding CIL pending commit size (the current CIL context
> and the previous CIL commit that is held off by the uncommited
> iclog) is greater than the AIL push target, and so nothing will free
> up more log space and wake up the transaction waiting for grant
> space.
> 
> e.g. the previous CIL context commit might take 15% of the log
> space, and the current CIL has reserved 11% of the log space.
> Now new transactions reservations have run out of grant space and we
> push on the ail, but it's lowest item is at 26%, and so the AIL push
> does nothing and we're stuck because the CIL has pinned 26% of the
> log space.
> 

This is pretty much the behavior I described in my previous mail. For
some unknown reason two background CIL pushes do not occur in enough
time before we exhaust log reservation and so all of our log reservation
ends up pinned by an iclog. I'd really like more data to explain what
might be going on in this case to prevent the background CIL push from
executing...

> As a test, can you run the test with larger log sizes? I think
> the default used was about ~3600 blocks, so it you step that up by
> 500 blocks at a time we should get an idea of the size of the
> overrun by the size of the log where the hang goes away. A
> trace of the transaction reservations and AIL pushing would also be
> insightful.
> 
> FWIW, when I originally sized the CIL space, I did actually try to
> take this into account. The CIL space is "log size >> 3", which is
> 12.5% of the log, or exactly half the space of the
> xlog_grant_push_ail() target window. Hence the space pinned by two
> CIL pushes (on on it's way to disk and one filling) should be
> covered by the 25% threshold in xlog_grant_push_ail()
> 
> However, what I didn't take into account is that the one going to
> disk could stall indefinitely because we run out of log space before
> the current CIL context is pushed and there's nothing to trigger the
> CIL to be pushed to unwedge it.
> 
> There are a few solutions I can think of to this:
> 
> 	1 increase the xlog_grant_push_ail() push window to a larger
> 	  amount of the log for small-to-medium logs. This doesn't
> 	  work if the overrun is too large, but it should fix the
> 	  "off-by-one" type of hang.
> 

Rather than special case this to log sizes, I'm wondering if we can
update this function to incorporate log pressure built up in the in-core
log. The current code already stubs the calculated threshold_lsn to the
last commit record that made it to the log, so if reservation is pinned
in-core then just increasing the window may have no effect.

What I'm wondering is that if the log reservation is primarily in-core,
if this function could easily detect that scenario based on the
difference between the reservation grant head and the head/tail of the
physical log. I.e., rename this function to something like
xlog_grant_push() and do just about all of the same calculations, but
rather than skip out if the AIL threshold is too far beyond the on-disk
head, use that to push on the log (not the CIL) before setting the AIL
threshold.

I suppose we'd have to do a sync log push here in order to set a valid
AIL push target, but we're down in the grant wait path anyways so I
don't see the harm in waiting on a log I/O before waiting on a grant
head. It does look like we'd need to fix up the locking a bit though.

Otherwise I think this is quite similar to the aforementioned diff.
Instead of checking for an empty AIL or special casing against log
recovery context or small logs, we directly calculate if some portion of
the log reservation requested by the caller is tied up in the in-core
log and push on that before we push on the AIL to ultimately free up the
space. Thoughts? An untested and very quickly hacked up diff is appended
below just to illustrate the idea. FWIW, this is also effectively
similar to something along the lines of allowing the excess target into
the AIL and letting the AIL use that to initiate a log force...

Brian

--- 8< ---

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 7fc3c1ad36bc..37d115826a87 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -112,6 +112,13 @@ xlog_verify_tail_lsn(
 STATIC int
 xlog_iclogs_empty(
 	struct xlog		*log);
+static int
+__xfs_log_force_lsn(
+	struct xfs_mount	*mp,
+	xfs_lsn_t		lsn,
+	uint			flags,
+	int			*log_flushed,
+	bool			already_slept);
 
 static void
 xlog_grant_sub_space(
@@ -241,11 +248,9 @@ xlog_grant_head_wait(
 	do {
 		if (XLOG_FORCED_SHUTDOWN(log))
 			goto shutdown;
+		spin_unlock(&head->lock);
 		xlog_grant_push_ail(log, need_bytes);
-
 		__set_current_state(TASK_UNINTERRUPTIBLE);
-		spin_unlock(&head->lock);
-
 		XFS_STATS_INC(log->l_mp, xs_sleep_logspace);
 
 		trace_xfs_log_grant_sleep(log, tic);
@@ -1558,8 +1563,15 @@ xlog_grant_push_ail(
 	 * so that it doesn't change between the compare and the set.
 	 */
 	last_sync_lsn = atomic64_read(&log->l_last_sync_lsn);
-	if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0)
-		threshold_lsn = last_sync_lsn;
+	if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0) {
+		int		flushed = 0;
+		xfs_lsn_t	force_lsn;
+
+		force_lsn = xlog_assign_lsn(log->l_prev_cycle,
+					    log->l_prev_block);
+		__xfs_log_force_lsn(log->l_mp, force_lsn, XFS_LOG_SYNC,
+				    &flushed, true);
+	}
 
 	/*
 	 * Get the transaction layer to kick the dirty buffers out to

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-26 12:39               ` Brian Foster
@ 2019-08-27 23:10                 ` Dave Chinner
  2019-08-28 12:06                   ` Brian Foster
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2019-08-27 23:10 UTC (permalink / raw)
  To: Brian Foster
  Cc: Chandan Rajendra, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Mon, Aug 26, 2019 at 08:39:34AM -0400, Brian Foster wrote:
> On Mon, Aug 26, 2019 at 10:32:53AM +1000, Dave Chinner wrote:
> > On Sun, Aug 25, 2019 at 08:35:17PM +0530, Chandan Rajendra wrote:
> > > > Btw, the issue is still re-creatable after adding,
> > > > 
> > > > +               if ((log->l_flags & XLOG_RECOVERY_NEEDED)
> > > > +                       && !xfs_ail_min(tp->t_mountp->m_ail))
> > > > +                       tp->t_flags |= XFS_TRANS_SYNC;
> > > > 
> > > > ... to xfs_trans_reserve(). I didn't get a chance to debug it yet. I will try
> > > > to find the root cause now.
> > > > 
> > > 
> > > Dave, With the above changes made in xfs_trans_reserve(), mount task is
> > > deadlocking due to the following,
> > > 1. With synchronous transactions, __xfs_trans_commit() now causes iclogs to be
> > > flushed to the disk and hence log items to be ultimately moved to AIL.
> > > 2. xfsaild task is woken up, which acts in items on AIL.
> > > 3. After some time, we stop issuing synchronous transactions because AIL has
> > >    log items in its list and hence !xfs_ail_min(tp->t_mountp->m_ail) evaluates to
> > >    false. In xfsaild_push(), "XFS_LSN_CMP(lip->li_lsn, target) <= 0"
> > >    evaluates to false on the first iteration of the while loop. This means we
> > >    have a log item whose LSN is larger than xfs_ail->ail_target at the
> > >    beginning of the AIL.
> > 
> > The push target for xlog_grant_push_ail() is to free 25% of the log
> > space. So if all the items in the AIL are not within 25% of the tail
> > end of the log, there's nothing for the AIL to push. This indicates
> > that there is at least 25% of physical log space free.
> > 
> > I suspect that this means the CIL is overruning it's background push
> > target by more than expected probably because the log is so small. That leads
> > to the outstanding CIL pending commit size (the current CIL context
> > and the previous CIL commit that is held off by the uncommited
> > iclog) is greater than the AIL push target, and so nothing will free
> > up more log space and wake up the transaction waiting for grant
> > space.
> > 
> > e.g. the previous CIL context commit might take 15% of the log
> > space, and the current CIL has reserved 11% of the log space.
> > Now new transactions reservations have run out of grant space and we
> > push on the ail, but it's lowest item is at 26%, and so the AIL push
> > does nothing and we're stuck because the CIL has pinned 26% of the
> > log space.
> > 
> 
> This is pretty much the behavior I described in my previous mail. For
> some unknown reason two background CIL pushes do not occur in enough
> time before we exhaust log reservation and so all of our log reservation
> ends up pinned by an iclog. I'd really like more data to explain what
> might be going on in this case to prevent the background CIL push from
> executing...

Oh, it doesn't take much when the log is small. In this case, it's
3600 blocks in size - ~14MB, so 1% of the log is 140kB and that's
smaller than a single itruncate transaction reservation. The CIL
push threshold is ~1.7MB, and the push threshold is ~3.5MB. IOWs,
we might not push the cil the first time until it reaches 1.8MB
because of the size of the last transaction that pushes it over.
Then it gets pushed, leaving the last iclog waiting and pinning
that part of the log.

AFAICT, it's largely caused by the fact that transaction
reservations require significant single digit percentages of the
total log space. i.e. we're fine at 23%, next reservation requires
3% of the total log space, and then we're in trouble.

> > However, what I didn't take into account is that the one going to
> > disk could stall indefinitely because we run out of log space before
> > the current CIL context is pushed and there's nothing to trigger the
> > CIL to be pushed to unwedge it.
> > 
> > There are a few solutions I can think of to this:
> > 
> > 	1 increase the xlog_grant_push_ail() push window to a larger
> > 	  amount of the log for small-to-medium logs. This doesn't
> > 	  work if the overrun is too large, but it should fix the
> > 	  "off-by-one" type of hang.
> > 
> 
> Rather than special case this to log sizes, I'm wondering if we can
> update this function to incorporate log pressure built up in the in-core
> log. The current code already stubs the calculated threshold_lsn to the
> last commit record that made it to the log, so if reservation is pinned
> in-core then just increasing the window may have no effect.

AFAICT, the problem is actually related to log sizes, so we have to
consider that as an aspect of the problem...

> What I'm wondering is that if the log reservation is primarily in-core,
> if this function could easily detect that scenario based on the
> difference between the reservation grant head and the head/tail of the
> physical log. I.e., rename this function to something like
> xlog_grant_push() and do just about all of the same calculations, but
> rather than skip out if the AIL threshold is too far beyond the on-disk
> head, use that to push on the log (not the CIL) before setting the AIL
> threshold.
> 
> I suppose we'd have to do a sync log push here in order to set a valid
> AIL push target, but we're down in the grant wait path anyways so I
> don't see the harm in waiting on a log I/O before waiting on a grant
> head. It does look like we'd need to fix up the locking a bit though.

The problem with dropping the AIL lock and issuing and waiting for
log IO in the grant head wait loop is that it subverts the FIFO
waiting nature of the log reservations. i.e. first into the wait
code is the first woken. If we put a blocking log force in here,
the transaction reseration that triggers it now has to wait for
that to complete before it goes onto the wait queue, and there might
be a bunch more transactions that get in before it.

That also brings up the problem of thundering herd log forces.
Every concurrent transaction reservation that sees this state will
issue a log force before it goes to sleep waiting for space to be
made available. And every time they are woken and there's no space
available, they'll issue a log force. This will substantially
perturb behaviour when we are running in constant tail-pushing
workloads (i.e. all the time on small logs). That's why I suggested
the XFS_TRANS_SYNC mechanism so it didn't perturb the non-blocking
nature of the AIL push....

> Otherwise I think this is quite similar to the aforementioned diff.
> Instead of checking for an empty AIL or special casing against log
> recovery context or small logs, we directly calculate if some portion of
> the log reservation requested by the caller is tied up in the in-core
> log and push on that before we push on the AIL to ultimately free up the
> space. Thoughts? An untested and very quickly hacked up diff is appended
> below just to illustrate the idea. FWIW, this is also effectively
> similar to something along the lines of allowing the excess target into
> the AIL and letting the AIL use that to initiate a log force...

The difference is moving it to the AIL doesn't perturb the
transaction reservation waiting behaviour at all, and it's largely
indepedent on the log size, transaction reservation sizes, etc.

Logic might be:

	if the push target is approximately where the log grant tail
			pushing threshold should be; and
	if the AIL is empty or all the items in the AIL are > grant
			push threshold; and
	if we've got nothing else to do and will go to sleep

		issue a log force

i.e. I don't think we need to put any new code into the log
reservation side of things - all the triggers we need to handle this
in the AIL are already there...

> @@ -1558,8 +1563,15 @@ xlog_grant_push_ail(
>  	 * so that it doesn't change between the compare and the set.
>  	 */
>  	last_sync_lsn = atomic64_read(&log->l_last_sync_lsn);
> -	if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0)
> -		threshold_lsn = last_sync_lsn;
> +	if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0) {
> +		int		flushed = 0;
> +		xfs_lsn_t	force_lsn;
> +
> +		force_lsn = xlog_assign_lsn(log->l_prev_cycle,
> +					    log->l_prev_block);
> +		__xfs_log_force_lsn(log->l_mp, force_lsn, XFS_LOG_SYNC,
> +				    &flushed, true);

__xfs_log_force_lsn() takes a CIL lsn, not a log LSN, so this
doesn't work the way you thought it does. i.e. __xfs_log_force_lsn()
pushes on the CIL first, then syncs the iclog buffers. Indeed, if
this was to be used, we wouldn't even need XFS_LOG_SYNC, because all
we need to do is start the IO on all the iclog buffers, and the log
forces do that. XFS_LOG_SYNC() only waits for them all to complete,
it doesn't change the way they are written.

Largest problme with this is that the CIL flush is a blocking
operation. Same as if we just replaced this with xfs_log_force(0)
to just push the CIL and the iclogs to disk...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-27 23:10                 ` Dave Chinner
@ 2019-08-28 12:06                   ` Brian Foster
  0 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2019-08-28 12:06 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chandan Rajendra, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Wed, Aug 28, 2019 at 09:10:31AM +1000, Dave Chinner wrote:
> On Mon, Aug 26, 2019 at 08:39:34AM -0400, Brian Foster wrote:
> > On Mon, Aug 26, 2019 at 10:32:53AM +1000, Dave Chinner wrote:
> > > On Sun, Aug 25, 2019 at 08:35:17PM +0530, Chandan Rajendra wrote:
> > > > > Btw, the issue is still re-creatable after adding,
> > > > > 
> > > > > +               if ((log->l_flags & XLOG_RECOVERY_NEEDED)
> > > > > +                       && !xfs_ail_min(tp->t_mountp->m_ail))
> > > > > +                       tp->t_flags |= XFS_TRANS_SYNC;
> > > > > 
> > > > > ... to xfs_trans_reserve(). I didn't get a chance to debug it yet. I will try
> > > > > to find the root cause now.
> > > > > 
> > > > 
> > > > Dave, With the above changes made in xfs_trans_reserve(), mount task is
> > > > deadlocking due to the following,
> > > > 1. With synchronous transactions, __xfs_trans_commit() now causes iclogs to be
> > > > flushed to the disk and hence log items to be ultimately moved to AIL.
> > > > 2. xfsaild task is woken up, which acts in items on AIL.
> > > > 3. After some time, we stop issuing synchronous transactions because AIL has
> > > >    log items in its list and hence !xfs_ail_min(tp->t_mountp->m_ail) evaluates to
> > > >    false. In xfsaild_push(), "XFS_LSN_CMP(lip->li_lsn, target) <= 0"
> > > >    evaluates to false on the first iteration of the while loop. This means we
> > > >    have a log item whose LSN is larger than xfs_ail->ail_target at the
> > > >    beginning of the AIL.
> > > 
> > > The push target for xlog_grant_push_ail() is to free 25% of the log
> > > space. So if all the items in the AIL are not within 25% of the tail
> > > end of the log, there's nothing for the AIL to push. This indicates
> > > that there is at least 25% of physical log space free.
> > > 
> > > I suspect that this means the CIL is overruning it's background push
> > > target by more than expected probably because the log is so small. That leads
> > > to the outstanding CIL pending commit size (the current CIL context
> > > and the previous CIL commit that is held off by the uncommited
> > > iclog) is greater than the AIL push target, and so nothing will free
> > > up more log space and wake up the transaction waiting for grant
> > > space.
> > > 
> > > e.g. the previous CIL context commit might take 15% of the log
> > > space, and the current CIL has reserved 11% of the log space.
> > > Now new transactions reservations have run out of grant space and we
> > > push on the ail, but it's lowest item is at 26%, and so the AIL push
> > > does nothing and we're stuck because the CIL has pinned 26% of the
> > > log space.
> > > 
> > 
> > This is pretty much the behavior I described in my previous mail. For
> > some unknown reason two background CIL pushes do not occur in enough
> > time before we exhaust log reservation and so all of our log reservation
> > ends up pinned by an iclog. I'd really like more data to explain what
> > might be going on in this case to prevent the background CIL push from
> > executing...
> 
> Oh, it doesn't take much when the log is small. In this case, it's
> 3600 blocks in size - ~14MB, so 1% of the log is 140kB and that's
> smaller than a single itruncate transaction reservation. The CIL
> push threshold is ~1.7MB, and the push threshold is ~3.5MB. IOWs,
> we might not push the cil the first time until it reaches 1.8MB
> because of the size of the last transaction that pushes it over.
> Then it gets pushed, leaving the last iclog waiting and pinning
> that part of the log.
> 

Yeah, that's the behavior I observed when testing with a ~14MB log
previously. The first push ends up right around 1.7-1.8MB of CIL, and
hence why I'm wondering where the extra delay comes from to the point of
locking things up.

> AFAICT, it's largely caused by the fact that transaction
> reservations require significant single digit percentages of the
> total log space. i.e. we're fine at 23%, next reservation requires
> 3% of the total log space, and then we're in trouble.
> 
> > > However, what I didn't take into account is that the one going to
> > > disk could stall indefinitely because we run out of log space before
> > > the current CIL context is pushed and there's nothing to trigger the
> > > CIL to be pushed to unwedge it.
> > > 
> > > There are a few solutions I can think of to this:
> > > 
> > > 	1 increase the xlog_grant_push_ail() push window to a larger
> > > 	  amount of the log for small-to-medium logs. This doesn't
> > > 	  work if the overrun is too large, but it should fix the
> > > 	  "off-by-one" type of hang.
> > > 
> > 
> > Rather than special case this to log sizes, I'm wondering if we can
> > update this function to incorporate log pressure built up in the in-core
> > log. The current code already stubs the calculated threshold_lsn to the
> > last commit record that made it to the log, so if reservation is pinned
> > in-core then just increasing the window may have no effect.
> 
> AFAICT, the problem is actually related to log sizes, so we have to
> consider that as an aspect of the problem...
> 

Sure, but we don't have a complete picture yet.

> > What I'm wondering is that if the log reservation is primarily in-core,
> > if this function could easily detect that scenario based on the
> > difference between the reservation grant head and the head/tail of the
> > physical log. I.e., rename this function to something like
> > xlog_grant_push() and do just about all of the same calculations, but
> > rather than skip out if the AIL threshold is too far beyond the on-disk
> > head, use that to push on the log (not the CIL) before setting the AIL
> > threshold.
> > 
> > I suppose we'd have to do a sync log push here in order to set a valid
> > AIL push target, but we're down in the grant wait path anyways so I
> > don't see the harm in waiting on a log I/O before waiting on a grant
> > head. It does look like we'd need to fix up the locking a bit though.
> 
> The problem with dropping the AIL lock and issuing and waiting for
> log IO in the grant head wait loop is that it subverts the FIFO
> waiting nature of the log reservations. i.e. first into the wait
> code is the first woken. If we put a blocking log force in here,
> the transaction reseration that triggers it now has to wait for
> that to complete before it goes onto the wait queue, and there might
> be a bunch more transactions that get in before it.
> 

This was just a quick hack to illustrate what I meant by connecting log
pressure back into the in-core log. The locking and scheduling are
incorrect because it's probably not safe to block on a log force while
the process is sitting on the reservation wait queue (where it is at
risk to be woken up) in the first place. That means a real
implementation would look quite different. It's kind of pointless to
assume existence of various runtime bugs without having a complete
implementation. I'd think preservation of reservation order would be an
obvious requirement for any change to this particular code.

> That also brings up the problem of thundering herd log forces.
> Every concurrent transaction reservation that sees this state will
> issue a log force before it goes to sleep waiting for space to be
> made available. And every time they are woken and there's no space
> available, they'll issue a log force. This will substantially
> perturb behaviour when we are running in constant tail-pushing
> workloads (i.e. all the time on small logs). That's why I suggested
> the XFS_TRANS_SYNC mechanism so it didn't perturb the non-blocking
> nature of the AIL push....
> 

See above. I don't have a strong preference for whether such logic lands
in the AIL or reservation blocking code because as I see it, the
resulting high level behavior needs to be the same either way. I don't
see the placement as a major impediment (so far), but the question to me
was more of what lends itself to a cleaner implementation.

> > Otherwise I think this is quite similar to the aforementioned diff.
> > Instead of checking for an empty AIL or special casing against log
> > recovery context or small logs, we directly calculate if some portion of
> > the log reservation requested by the caller is tied up in the in-core
> > log and push on that before we push on the AIL to ultimately free up the
> > space. Thoughts? An untested and very quickly hacked up diff is appended
> > below just to illustrate the idea. FWIW, this is also effectively
> > similar to something along the lines of allowing the excess target into
> > the AIL and letting the AIL use that to initiate a log force...
> 
> The difference is moving it to the AIL doesn't perturb the
> transaction reservation waiting behaviour at all, and it's largely
> indepedent on the log size, transaction reservation sizes, etc.
> 

If you'd prefer to just not mess with the transaction reservation code
in general due to risk/complexity, that's a different argument, but also
a reasonable enough one to me.

> Logic might be:
> 
> 	if the push target is approximately where the log grant tail
> 			pushing threshold should be; and

What do you mean by "where the log grant tail pushing threshold should
be?"

> 	if the AIL is empty or all the items in the AIL are > grant
> 			push threshold; and
> 	if we've got nothing else to do and will go to sleep
> 
> 		issue a log force
> 
> i.e. I don't think we need to put any new code into the log
> reservation side of things - all the triggers we need to handle this
> in the AIL are already there...
> 

I agree that this can be addressed in the AIL, but I'm not following the
logic above. I was expecting that we'd need to update the AIL push logic
because currently the push target can't be beyond the physical head
(where the reservation head being further out than that is what tracks
in-core reservation pressure). It's not clear to me if you're saying a
similar thing or using existing AIL state to infer the problematic
situation.

> > @@ -1558,8 +1563,15 @@ xlog_grant_push_ail(
> >  	 * so that it doesn't change between the compare and the set.
> >  	 */
> >  	last_sync_lsn = atomic64_read(&log->l_last_sync_lsn);
> > -	if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0)
> > -		threshold_lsn = last_sync_lsn;
> > +	if (XFS_LSN_CMP(threshold_lsn, last_sync_lsn) > 0) {
> > +		int		flushed = 0;
> > +		xfs_lsn_t	force_lsn;
> > +
> > +		force_lsn = xlog_assign_lsn(log->l_prev_cycle,
> > +					    log->l_prev_block);
> > +		__xfs_log_force_lsn(log->l_mp, force_lsn, XFS_LOG_SYNC,
> > +				    &flushed, true);
> 
> __xfs_log_force_lsn() takes a CIL lsn, not a log LSN, so this
> doesn't work the way you thought it does. i.e. __xfs_log_force_lsn()
> pushes on the CIL first, then syncs the iclog buffers. Indeed, if
> this was to be used, we wouldn't even need XFS_LOG_SYNC, because all
> we need to do is start the IO on all the iclog buffers, and the log
> forces do that. XFS_LOG_SYNC() only waits for them all to complete,
> it doesn't change the way they are written.
> 

Well I wasn't sure if using the l_prev_* variants was correct, but not
for the reason you state here. I think you're confusing
__xfs_log_force_lsn() with xfs_log_force_lsn(). The former doesn't touch
the CIL. In fact, the only thing __xfs_log_force_lsn() does with the LSN
is compare it to iclog header LSNs. xfs_log_force_lsn() pushes the CIL,
but it passes the return value of xlog_cil_force_lsn() (which is the
physical lsn of the iclog with the commit record) along to
__xfs_log_force_lsn().

Brian

> Largest problme with this is that the CIL flush is a blocking
> operation. Same as if we just replaced this with xfs_log_force(0)
> to just push the CIL and the iclogs to disk...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-26  0:32             ` Dave Chinner
  2019-08-26 12:39               ` Brian Foster
@ 2019-08-29  5:21               ` Chandan Rajendra
  2019-08-29 23:08                 ` Dave Chinner
  1 sibling, 1 reply; 25+ messages in thread
From: Chandan Rajendra @ 2019-08-29  5:21 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Monday, August 26, 2019 6:02 AM Dave Chinner wrote: 
> On Sun, Aug 25, 2019 at 08:35:17PM +0530, Chandan Rajendra wrote:
> > On Friday, August 23, 2019 7:08 PM Chandan Rajendra wrote:
> > > On Friday, August 23, 2019 6:38 PM Brian Foster wrote:
> > > > On Fri, Aug 23, 2019 at 10:06:36AM +1000, Dave Chinner wrote:
> > > > > On Thu, Aug 22, 2019 at 12:34:46PM -0400, Brian Foster wrote:
> > > > > > On Thu, Aug 22, 2019 at 08:18:34AM +1000, Dave Chinner wrote:
> > > > > > > On Wed, Aug 21, 2019 at 04:34:48PM +0530, Chandan Rajendra wrote:
> > > > > > > > The following call trace is seen when executing generic/530 on a ppc64le
> > > > > > > > machine,
> > > > > > > > 
> > > > > > > > INFO: task mount:7722 blocked for more than 122 seconds.
> > > > > > > >       Not tainted 5.3.0-rc1-next-20190723-00001-g1867922e5cbf-dirty #6
> > > > > > > 
> > > > > > > can you reproduce this on 5.3-rc5? There were bugs in log recovery
> > > > > > > IO in -rc1 that could result in things going wrong...
> > > > > > > 
> > > > > > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> > > > > > > > mount           D 8448  7722   7490 0x00040008
> > > > > > > > Call Trace:
> > > > > > > > [c000000629343210] [0000000000000001] 0x1 (unreliable)
> > > > > > > > [c0000006293433f0] [c000000000021acc] __switch_to+0x2ac/0x490
> > > > > > > > [c000000629343450] [c000000000fbbbf4] __schedule+0x394/0xb50
> > > > > > > > [c000000629343510] [c000000000fbc3f4] schedule+0x44/0xf0
> > > > > > > > [c000000629343540] [c0000000007623b4] xlog_grant_head_wait+0x84/0x420
> > > > > > > > [c0000006293435b0] [c000000000762828] xlog_grant_head_check+0xd8/0x1e0
> > > > > > > > [c000000629343600] [c000000000762f6c] xfs_log_reserve+0x26c/0x310
> > > > > > > > [c000000629343690] [c00000000075defc] xfs_trans_reserve+0x28c/0x3e0
> > > > > > > > [c0000006293436e0] [c0000000007606ac] xfs_trans_alloc+0xfc/0x2f0
> > > > > > > > [c000000629343780] [c000000000749ca8] xfs_inactive_ifree+0x248/0x2a0
> > > > > > > > [c000000629343810] [c000000000749e58] xfs_inactive+0x158/0x300
> > > > > > > > [c000000629343850] [c000000000758554] xfs_fs_destroy_inode+0x104/0x3f0
> > > > > > > > [c000000629343890] [c00000000046850c] destroy_inode+0x6c/0xc0
> > > > > > > > [c0000006293438c0] [c00000000074c748] xfs_irele+0x168/0x1d0
> > > > > > > > [c000000629343900] [c000000000778c78] xlog_recover_process_one_iunlink+0x118/0x1e0
> > > > > > > > [c000000629343960] [c000000000778e10] xlog_recover_process_iunlinks+0xd0/0x130
> > > > > > > > [c0000006293439b0] [c000000000782408] xlog_recover_finish+0x58/0x130
> > > > > > > > [c000000629343a20] [c000000000763818] xfs_log_mount_finish+0xa8/0x1d0
> > > > > > > > [c000000629343a60] [c000000000750908] xfs_mountfs+0x6e8/0x9e0
> > > > > > > > [c000000629343b20] [c00000000075a210] xfs_fs_fill_super+0x5a0/0x7c0
> > > > > > > > [c000000629343bc0] [c00000000043e7fc] mount_bdev+0x25c/0x2a0
> > > > > > > > [c000000629343c60] [c000000000757c48] xfs_fs_mount+0x28/0x40
> > > > > > > > [c000000629343c80] [c0000000004956cc] legacy_get_tree+0x4c/0xb0
> > > > > > > > [c000000629343cb0] [c00000000043d690] vfs_get_tree+0x50/0x160
> > > > > > > > [c000000629343d30] [c0000000004775d4] do_mount+0xa14/0xc20
> > > > > > > > [c000000629343db0] [c000000000477d48] ksys_mount+0xc8/0x180
> > > > > > > > [c000000629343e00] [c000000000477e20] sys_mount+0x20/0x30
> > > > > > > > [c000000629343e20] [c00000000000b864] system_call+0x5c/0x70
> > > > > > > > 
> > > > > > > > i.e. the mount task gets hung indefinitely due to the following sequence
> > > > > > > > of events,
> > > > > > > > 
> > > > > > > > 1. Test creates lots of unlinked temp files and then shutsdown the
> > > > > > > >    filesystem.
> > > > > > > > 2. During mount, a transaction started in the context of processing
> > > > > > > >    unlinked inode list causes several iclogs to be filled up. All but
> > > > > > > >    the last one is submitted for I/O.
> > > > > > > > 3. After writing XLOG_COMMIT_TRANS record into the iclog, we will have
> > > > > > > >    18532 bytes of free space in the last iclog of the transaction which is
> > > > > > > >    greater than 2*sizeof(xlog_op_header_t). Hence
> > > > > > > >    xlog_state_get_iclog_space() does not switch over to using a newer iclog.
> > > > > > > > 4. Meanwhile, the endio code processing iclogs of the transaction do not
> > > > > > > >    insert items into the AIL since the iclog containing XLOG_COMMIT_TRANS
> > > > > > > >    hasn't been submitted for I/O yet. Hence a major part of the on-disk
> > > > > > > >    log cannot be freed yet.
> > > > > > > 
> > > > > > > So all those items are still pinned in memory.
> > > > > > > 
> > > > > > > > 5. A new request for log space (via xfs_log_reserve()) will now wait
> > > > > > > >    indefinitely for on-disk log space to be freed.
> > > > > > > 
> > > > > > > Because nothing has issued a xfs_log_force() for write the iclog to
> > > > > > > disk, unpin the objects that it pins in memory, and allow the tail
> > > > > > > to be moved forwards.
> > > > > > > 
> > > > > > > The xfsaild normally takes care of thisi - it gets pushed byt the
> > > > > > > log reserve when there's not enough space to in the log for the
> > > > > > > transaction before transaction reserve goes to sleep in
> > > > > > > xlog_grant_head_wait(). The AIL pushing code is then responsible for
> > > > > > > making sure log space is eventually freed. It will issue log forces
> > > > > > > if it isn't making progress and so this problem shouldn't occur.
> > > > > > > 
> > > > > > > So, why has it occurred?
> > > > > > > 
> > > > > > > The xfsaild kthread should be running at this point, so if it was
> > > > > > > pushed it should be trying to empty the journal to move the tail
> > > > > > > forward. Why hasn't it issue a log force?
> > > > > > > 
> > > > > > > 
> > > > > > > > To fix this issue, before waiting for log space to be freed, this commit
> > > > > > > > now submits xlog->l_iclog for write I/O if iclog->ic_state is
> > > > > > > > XLOG_STATE_ACTIVE and iclog has metadata written into it. This causes
> > > > > > > > AIL list to be populated and a later call to xlog_grant_push_ail() will
> > > > > > > > free up the on-disk log space.
> > > > > > > 
> > > > > > > hmmm.
> > > > > > > 
> > > > > > > > Signed-off-by: Chandan Rajendra <chandanrlinux@gmail.com>
> > > > > > > > ---
> > > > > > > >  fs/xfs/xfs_log.c | 21 +++++++++++++++++++++
> > > > > > > >  1 file changed, 21 insertions(+)
> > > > > > > > 
> > > > > > > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > > > > > > index 00e9f5c388d3..dc785a6b9f47 100644
> > > > > > > > --- a/fs/xfs/xfs_log.c
> > > > > > > > +++ b/fs/xfs/xfs_log.c
> > > > > > > > @@ -236,11 +236,32 @@ xlog_grant_head_wait(
> > > > > > > >  	int			need_bytes) __releases(&head->lock)
> > > > > > > >  					    __acquires(&head->lock)
> > > > > > > >  {
> > > > > > > > +	struct xlog_in_core	*iclog;
> > > > > > > > +
> > > > > > > >  	list_add_tail(&tic->t_queue, &head->waiters);
> > > > > > > >  
> > > > > > > >  	do {
> > > > > > > >  		if (XLOG_FORCED_SHUTDOWN(log))
> > > > > > > >  			goto shutdown;
> > > > > > > > +
> > > > > > > > +		if (xfs_ail_min(log->l_ailp) == NULL) {
> > > > > > > 
> > > > > > > This is indicative of the situation. If the AIL is empty, and the
> > > > > > > log does not have room for an entire transaction reservation, then
> > > > > > > we need to be issuing synchronous transactions in recovery until
> > > > > > > such time the AIL pushing can actually function correctly to
> > > > > > > guarantee forwards progress for async transaction processing.
> > > > > > > 
> > > > > > 
> > > > > > Hmm, I don't think that addresses the fundamental problem. This
> > > > > > phenomenon doesn't require log recovery. The same scenario can present
> > > > > > itself after a clean mount or from an idle fs. I think the scenario that
> > > > > > plays out here, at a high level, is as follows:
> > > > > > 
> > > > > 
> > > > >   - mount
> > > > >   - [log recovery]
> > > > >   - xfs_log_mount_finish
> > > > >     - calls xfs_log_work_queue()
> > > > > 
> > > > > > - Heavy transaction workload commences. This continuously acquires log
> > > > > >   reservation and transfers it to the CIL as transactions commit.
> > > > > > - The CIL context grows until we cross the background threshold, at
> > > > > >   which point we schedule a background push.
> > > > > > - Background CIL push cycles the current context into the log via the
> > > > > >   iclog buffers. The commit record stays around in-core because the last
> > > > > >   iclog used for the CIL checkpoint isn't full. Hence, none of the
> > > > > >   associated log items make it into the AIL and the background CIL push
> > > > > >   had no effect with respect to freeing log reservation.
> > > > > > - The same transaction workload is still running and filling up the next
> > > > > >   CIL context. If we run out of log reservation before a second
> > > > > >   background CIL push comes along, we're basically stuck waiting on
> > > > > >   somebody to force the log.
> > > > > 
> > > > > - every 30s xfs_log_worker() runs, sees the log dirty, triggers
> > > > >   a log force. pending commit is flushed to log, dirty objects get
> > > > >   moved to AIL, then xfs_log_worker() pushes on the AIL to do
> > > > >   periodic background metadata writeback.
> > > > > 
> > > > > > The things that prevent this at normal runtime are timely execution of
> > > > > > background CIL pushes and the background log worker. If for some reason
> > > > > > the background CIL push is not timely enough that we consume all log
> > > > > > reservation before two background CIL pushes occur from the time the
> > > > > > racing workload begins (i.e. starting from an idle system such that the
> > > > > > AIL is empty), then we're stuck waiting on the background log worker to
> > > > > > force the log from the first background CIL push, populate the AIL and
> > > > > > get things moving again.
> > > > > 
> > > > > Right, this does not deadlock - it might pause for a short while
> > > > > while waiting for the log worker to run and issue a log force. I
> > > > > have never actually seen it happen in all my years of "mkfs; mount;
> > > > > fsmark" testing that places a /massive/ metadata modification
> > > > > workload on a pristine, newly mounted filesystems....
> > > > > 
> > > > 
> > > > Neither have I. That's why I'm asking for more details on how this
> > > > triggers. :) Whatever factors contribute to this particular instance may
> > > > or may not apply outside of log recovery context.
> > > > 
> > > > > As it is, we've always used the log worker as a watchdog in this
> > > > > way. The fact is that we have very few situations left in the code
> > > > > where it needs to act as a watchdog - delayed logging actually
> > > > > negated the vast majority of problems that required the periodic log
> > > > > force to get out of trouble because individual transactions no
> > > > > longer needed to wait on iclog space to make progress...
> > > > > 
> > > > 
> > > > Yes, the description I wrote up already explains that the background log
> > > > worker changes the symptom from a deadlock to a stall. The defaults
> > > > allow for a stall of up to 30s, but custom configuration allows for a
> > > > stall of up to 2 hours.
> > > > 
> > > > > > IOW, the same essential problem is reproducible outside of log
> > > > > > recovery in the form of stalls as opposed to deadlocks via an
> > > > > > artificial background CIL push delay (i.e., think workqueue or
> > > > > > xc_cil_ctx lock starvation) and an elevated xfssyncd_centisecs.
> > > > > > We aren't stuck forever because the background log worker will run
> > > > > > eventually, but it could certainly be a dead stall of minutes
> > > > > > before that occurs.
> > > > > 
> > > > > I don't think addressing it in xlog_grant_head_wait() fixes the
> > > > > problem fully, either.  If no other transaction comes in, then the
> > > > > ones already blocked (because the AIL was not empty when they tried
> > > > > to reserve space) will end up still blocked because nothing has
> > > > > kicked the code in the transaction reservation code. So putting the
> > > > > log force into the grant head wait code is not sufficient by itself.
> > > > > 
> > > > > > I think this
> > > > > > could still be addressed at transaction commit or reservation time, but
> > > > > > I think the logic needs to be more generic and based on log reservation
> > > > > > pressure rather than the context from which this particular test happens
> > > > > > to reproduce.
> > > > > 
> > > > > Log reservation pressure is what xlog_grant_push_ail() detects and
> > > > > that pressure is transferred to the AIL pushing code to clean dirty
> > > > > log items and move the tail of the log forward. It's right there
> > > > > where Chandan added the log force. :)
> > > > > 
> > > > > IOWs, xlog_grant_push_ail() tells the xfsaild how much log space to
> > > > > make available. If the log is full, then xlog_grant_push_ail() will
> > > > > already be telling the AIL to push and will be waking it up.
> > > > > However, the aild will see the AIL empty and go right back to sleep.
> > > > > That's likely the runtime problem here - the mechanism that pushes
> > > > > the log tail forwards is not realising that the log needs pushing
> > > > > via a log force.
> > > > > 
> > > > > IOWs, I suspect the xfsaild is the right place to take the action,
> > > > > because AIL pushing is triggered by much more than just log
> > > > > reservations. It gets kicked by memory reclaim, the log worker, new
> > > > > transactions, etc and so if a transaction doesn't kick it when it
> > > > > gest stuck like this, something else will.
> > > > > 
> > > > 
> > > > Perhaps, I could see that as an option too. TBH, it's not clear to me
> > > > where the best place to fix this is. I want to know more about the
> > > > reproducer first. My comment above was just that I'm not convinced the
> > > > previously proposed logic to force based on log recovery context is
> > > > correct.
> > > > 
> > > > > > If this is all on the right track, I'm still curious if/how you're
> > > > > > getting into a situation where all log reservation is held up in the CIL
> > > > > > before a couple background pushes occur.
> > > > > 
> > > > > I'd guess that it could be reproduced via a single CPU machine and
> > > > > non-preempt kernel. We've already replayed all the unlink
> > > > > transactions, so the buffer/inode caches are fully primed. If all
> > > > > unlink removal transactions the buffer/inode cache, then it won't
> > > > > block anywhere and will never yield the CPU. Hence the CIL push
> > > > > kworker thread doesn't get to run before the unlinks run the log out
> > > > > of space.
> > > > > 
> > > > 
> > > > Yeah, I wouldn't expect that for a ppc64 system, but I suppose this
> > > > could be a VM or something too. Chandan?
> > > >
> > > 
> > > Indeed, This is a KVM guest on a ppc64le machine. The guest has 8 vcpus
> > > allocated to it. Also, Linux on ppc64le is by default configured to be
> > > non-preemptable.
> > > 
> > > Btw, the issue is still re-creatable after adding,
> > > 
> > > +               if ((log->l_flags & XLOG_RECOVERY_NEEDED)
> > > +                       && !xfs_ail_min(tp->t_mountp->m_ail))
> > > +                       tp->t_flags |= XFS_TRANS_SYNC;
> > > 
> > > ... to xfs_trans_reserve(). I didn't get a chance to debug it yet. I will try
> > > to find the root cause now.
> > > 
> > 
> > Dave, With the above changes made in xfs_trans_reserve(), mount task is
> > deadlocking due to the following,
> > 1. With synchronous transactions, __xfs_trans_commit() now causes iclogs to be
> > flushed to the disk and hence log items to be ultimately moved to AIL.
> > 2. xfsaild task is woken up, which acts in items on AIL.
> > 3. After some time, we stop issuing synchronous transactions because AIL has
> >    log items in its list and hence !xfs_ail_min(tp->t_mountp->m_ail) evaluates to
> >    false. In xfsaild_push(), "XFS_LSN_CMP(lip->li_lsn, target) <= 0"
> >    evaluates to false on the first iteration of the while loop. This means we
> >    have a log item whose LSN is larger than xfs_ail->ail_target at the
> >    beginning of the AIL.
> 
> The push target for xlog_grant_push_ail() is to free 25% of the log
> space. So if all the items in the AIL are not within 25% of the tail
> end of the log, there's nothing for the AIL to push. This indicates
> that there is at least 25% of physical log space free.

Sorry for the late response. I was trying to understand the code flow.

Here is a snippet of perf trace explaining what is going on,

	 760881:           mount  8654 [002]   216.813041:                         probe:xlog_grant_push_ail: (c000000000765864) comm="xfsaild/loop1" threshold_cycle_s32=3 threshold_block_s32=3970 need_bytes_s32=389328 last_sync_cycle_u32=2 last_sync_block_u32=19330 free_threshold_s32=5120 free_bytes_s32=383756 free_blocks_s32=749 l_logsize=10485760 reserve_cycle_s32=3 reserve_block_s32=9513204(~18580 blocks) tail_cycle_s32=2 tail_block_s32=19330
	 786576: kworker/4:1H-kb  1825 [004]   217.041079:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501
	 786577: kworker/4:1H-kb  1825 [004]   217.041087:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501
	 793653:   xfsaild/loop1  8661 [004]   265.407708:                probe:xfsaild_push_last_pushed_lsn: (c000000000784644) comm="xfsaild/loop1" cycle_lsn_u32=0 block_lsn_u32=0 target_cycle_lsn_u32=2 target_block_lsn_u32=19330
	 793654:   xfsaild/loop1  8661 [004]   265.407717:              probe:xfsaild_push_min_lsn_less_than: (c0000000007846a0) comm="xfsaild/loop1" less_than_s32=0 cycle_lsn_u32=2 block_lsn_u32=19333 lip_x64=0xc000000303fb4a48

... Explaination follows,

760881:           mount  8654 [002]   216.813041:                         probe:xlog_grant_push_ail: (c000000000765864) comm="xfsaild/loop1" threshold_cycle_s32=3 threshold_block_s32=3970 need_bytes_s32=389328 last_sync_cycle_u32=2 last_sync_block_u32=19330 free_threshold_s32=5120 free_bytes_s32=383756 free_blocks_s32=749 l_logsize=10485760 reserve_cycle_s32=3 reserve_block_s32=9513204(~18580 blocks) tail_cycle_s32=2 tail_block_s32=19330
The above trace data shows that threshold_lsn is set to wrap around the
log. However since the last synced lsn has a much lesser value,
xlog_grant_push_ail() will set threshold_lsn to last_sync_lsn. This ends up
being the value of xfsaild_push()'s target.


786576: kworker/4:1H-kb  1825 [004]   217.041079:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501
An item in AIL was relogged and hence the item was moved to the tail end of
AIL. Since the head of AIL has changed, the tail_lsn's value is updated.

793653:   xfsaild/loop1  8661 [004]   265.407708:                probe:xfsaild_push_last_pushed_lsn: (c000000000784644) comm="xfsaild/loop1" cycle_lsn_u32=0 block_lsn_u32=0 target_cycle_lsn_u32=2 target_block_lsn_u32=19330
Since the target LSN wasn't updated, we still have the old value.

793654:   xfsaild/loop1  8661 [004]   265.407717:              probe:xfsaild_push_min_lsn_less_than: (c0000000007846a0) comm="xfsaild/loop1" less_than_s32=0 cycle_lsn_u32=2 block_lsn_u32=19333 lip_x64=0xc000000303fb4a48
The lip at the head of AIL has an LSN greater than target.

xfs_info from the corresponding disk,
# xfs_info /dev/loop1
meta-data=/dev/loop1             isize=512    agcount=4, agsize=159648 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=638592, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

i.e the log size was 2560 * 4096 = 10485760 bytes.

> 
> I suspect that this means the CIL is overruning it's background push
> target by more than expected probably because the log is so small. That leads
> to the outstanding CIL pending commit size (the current CIL context
> and the previous CIL commit that is held off by the uncommited
> iclog) is greater than the AIL push target, and so nothing will free
> up more log space and wake up the transaction waiting for grant
> space.
> 
> e.g. the previous CIL context commit might take 15% of the log
> space, and the current CIL has reserved 11% of the log space.
> Now new transactions reservations have run out of grant space and we
> push on the ail, but it's lowest item is at 26%, and so the AIL push
> does nothing and we're stuck because the CIL has pinned 26% of the
> log space.
> 
> As a test, can you run the test with larger log sizes? I think
> the default used was about ~3600 blocks, so it you step that up by
> 500 blocks at a time we should get an idea of the size of the
> overrun by the size of the log where the hang goes away. A
> trace of the transaction reservations and AIL pushing would also be
> insightful.

After increasing the log size to 4193 blocks (i.e. 4193 * 4k = 17174528
bytes) and also the patch applied, I don't see the dead lock happening.

Please let me know if the trace provided above isn't sufficient and you need
a traces for the "larger log size" case also.

Meanwhile, I am planning to read more code to map the explaination
provided below.

> 
> FWIW, when I originally sized the CIL space, I did actually try to
> take this into account. The CIL space is "log size >> 3", which is
> 12.5% of the log, or exactly half the space of the
> xlog_grant_push_ail() target window. Hence the space pinned by two
> CIL pushes (on on it's way to disk and one filling) should be
> covered by the 25% threshold in xlog_grant_push_ail()
> 
> However, what I didn't take into account is that the one going to
> disk could stall indefinitely because we run out of log space before
> the current CIL context is pushed and there's nothing to trigger the
> CIL to be pushed to unwedge it.
> 
> There are a few solutions I can think of to this:
> 
> 	1 increase the xlog_grant_push_ail() push window to a larger
> 	  amount of the log for small-to-medium logs. This doesn't
> 	  work if the overrun is too large, but it should fix the
> 	  "off-by-one" type of hang.
> 
> 	2 reduce the CIL background push threshold. I already have
> 	  patches to do this for large logs (part of the
> 	  non-blocking shrinker work) but for small logs this is
> 	  somewhat equivalent to #1. I think #1 is a better option
> 	  for small logs they are bound on metadata writeback,
> 	  anyway, so starting writeback earlier isn't a big deal.
> 
> 	3 have the aild push the CIL if it's woken and there is
> 	  nothing to push on (i.e. empty or everything > target).
> 	  This may be too agressive, so it might be a case of "if
> 	  the push target changed and there's still nothing to do".
> 
> 	4 have the aild force the log if it's woken and there is
> 	  nothing to push on
> 
> 	5 xlog_grant_push_ail() trigger background CIL pushes when
> 	  the space required is not available at all (i.e.
> 	  need_bytes > xlog_space_left(log, &log->l_reserve_head.grant)
> 	  as that is when we are about to sleep). Hence the CIL will
> 	  always get pushed when we are completely out of log grant
> 	  space.
> 
> If it is simply and overrun accounting issue, then I suspect #1 or
> #5 are the most side-effect free solutions to the problem. the
> others are likely to have performance degradations in unpredictable
> corner cases...
> 
> Cheers,
> 
> Dave.
> 
> 
> 


-- 
chandan




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-29  5:21               ` Chandan Rajendra
@ 2019-08-29 23:08                 ` Dave Chinner
  2019-08-30  0:34                   ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2019-08-29 23:08 UTC (permalink / raw)
  To: Chandan Rajendra
  Cc: Brian Foster, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Thu, Aug 29, 2019 at 10:51:59AM +0530, Chandan Rajendra wrote:
> On Monday, August 26, 2019 6:02 AM Dave Chinner wrote: 
> > On Sun, Aug 25, 2019 at 08:35:17PM +0530, Chandan Rajendra wrote:
> > > On Friday, August 23, 2019 7:08 PM Chandan Rajendra wrote:
> > > 
> > > Dave, With the above changes made in xfs_trans_reserve(), mount task is
> > > deadlocking due to the following,
> > > 1. With synchronous transactions, __xfs_trans_commit() now causes iclogs to be
> > > flushed to the disk and hence log items to be ultimately moved to AIL.
> > > 2. xfsaild task is woken up, which acts in items on AIL.
> > > 3. After some time, we stop issuing synchronous transactions because AIL has
> > >    log items in its list and hence !xfs_ail_min(tp->t_mountp->m_ail) evaluates to
> > >    false. In xfsaild_push(), "XFS_LSN_CMP(lip->li_lsn, target) <= 0"
> > >    evaluates to false on the first iteration of the while loop. This means we
> > >    have a log item whose LSN is larger than xfs_ail->ail_target at the
> > >    beginning of the AIL.
> > 
> > The push target for xlog_grant_push_ail() is to free 25% of the log
> > space. So if all the items in the AIL are not within 25% of the tail
> > end of the log, there's nothing for the AIL to push. This indicates
> > that there is at least 25% of physical log space free.
> 
> Sorry for the late response. I was trying to understand the code flow.
> 
> Here is a snippet of perf trace explaining what is going on,
> 
> 	 760881:           mount  8654 [002]   216.813041:                         probe:xlog_grant_push_ail: (c000000000765864) comm="xfsaild/loop1" threshold_cycle_s32=3 threshold_block_s32=3970 need_bytes_s32=389328 last_sync_cycle_u32=2 last_sync_block_u32=19330 free_threshold_s32=5120 free_bytes_s32=383756 free_blocks_s32=749 l_logsize=10485760 reserve_cycle_s32=3 reserve_block_s32=9513204(~18580 blocks) tail_cycle_s32=2 tail_block_s32=19330

So this looks like last_sync_lsn is 2/19330, and the transaction
reservation is ~380kB, or close on 3% of the log. The reserve grant
head is at 3/18580, so we're ~700 * 512 = ~350kB of reservation
remaining. Yup, so we are definitely in the "go to sleep and wait"
situation here.

> 	 786576: kworker/4:1H-kb  1825 [004]   217.041079:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501

200ms later the tail has moved, and last_sync_lsn is now 3/18501.
i.e. the iclog writes have made it to disk, and the items have been
moved into the AIL. I don't know where that came from, but I'm
assuming it's an IO completion based on it being run from a
kworker context that doesn't have an "xfs-" name prefix(*).

As the tail has moved, this should have woken the anything sleeping
on the log tail in xlog_grant_head_wait() via a call to
xfs_log_space_wake(). The first waiter should wake, see that there
still isn't room in the log (only 3 sectors were freed in the log,
we need at least 60). That woken process should then run
xlog_grant_push_ail() again and go back to sleep.

(*) I have a patch that shortens "s/kworker/kw/" so that you can
actually see the name of the kworker in the 16 byte field we have
for the task name. We really should just increase current->comm to
32 bytes.

> 	 786577: kworker/4:1H-kb  1825 [004]   217.041087:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501
> 	 793653:   xfsaild/loop1  8661 [004]   265.407708:                probe:xfsaild_push_last_pushed_lsn: (c000000000784644) comm="xfsaild/loop1" cycle_lsn_u32=0 block_lsn_u32=0 target_cycle_lsn_u32=2 target_block_lsn_u32=19330
> 	 793654:   xfsaild/loop1  8661 [004]   265.407717:              probe:xfsaild_push_min_lsn_less_than: (c0000000007846a0) comm="xfsaild/loop1" less_than_s32=0 cycle_lsn_u32=2 block_lsn_u32=19333 lip_x64=0xc000000303fb4a48

Ans some 40s later the xfsaild is woken by something, sees there's
nothing to do, and goes back to sleep. I don't see the process
sleeping on the grant head being ever being woken and calling
xlog_grant_push_ail(), which would see the new last_sync_lsn and
move the push target....

From this trace, it looks like the problem here is a missing or
incorrectly processed wakeup when the log tail moves.

Unfortunately, you haven't used the built in trace points for
debugging log space hangs so I can't tell anything more than this.
i.e the trace we need contains these build in tracepoints:

# trace-cmd record -e xfs_log\* -e xfs_ail\* sleep 120 &
# <run workload that hangs within 120s>

<wait for trace-cmd to exit>
# trace-cmd report | gzip > trace.txt.gz

as that will record all transaction reservations, grant head
manipulations, changes to the tail lsn, when processes sleep on the
grant head and are worken, AIL insert/move/delete, etc.

This will generate a -lot- of data. I often generate and analyse
traces in the order of tens of GBs of events to track down issues
like this, because the problem is often only seen in a single trace
event in amongst the millions that are recorded....

And if we need more info, then we add the appropriate tracepoints
into xlog_grant_push_ail, xfsaild_push, etc under those tracepoint
namespaces, so next time we have a problem we don't ahve to write
custom tracepoints.....

> i.e the log size was 2560 * 4096 = 10485760 bytes.

The default minimum size.

> > I suspect that this means the CIL is overruning it's background push
> > target by more than expected probably because the log is so small. That leads
> > to the outstanding CIL pending commit size (the current CIL context
> > and the previous CIL commit that is held off by the uncommited
> > iclog) is greater than the AIL push target, and so nothing will free
> > up more log space and wake up the transaction waiting for grant
> > space.
> > 
> > e.g. the previous CIL context commit might take 15% of the log
> > space, and the current CIL has reserved 11% of the log space.
> > Now new transactions reservations have run out of grant space and we
> > push on the ail, but it's lowest item is at 26%, and so the AIL push
> > does nothing and we're stuck because the CIL has pinned 26% of the
> > log space.
> > 
> > As a test, can you run the test with larger log sizes? I think
> > the default used was about ~3600 blocks, so it you step that up by
> > 500 blocks at a time we should get an idea of the size of the
> > overrun by the size of the log where the hang goes away. A
> > trace of the transaction reservations and AIL pushing would also be
> > insightful.
> 
> After increasing the log size to 4193 blocks (i.e. 4193 * 4k = 17174528
> bytes) and also the patch applied, I don't see the dead lock happening.

Likely because now the 380k transaction reservation is only 2% of the
log instead of close to 4% of the log, and so the overrun isn't
large enough to trigger whatever wakeup issue we have....

> Meanwhile, I am planning to read more code to map the explaination
> provided below.

Can you get a complete trace (as per above) of a hang? we're going
to need that trace to validate any analysis you do yourself,
anyway...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-29 23:08                 ` Dave Chinner
@ 2019-08-30  0:34                   ` Dave Chinner
  2019-08-30  2:13                     ` Dave Chinner
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2019-08-30  0:34 UTC (permalink / raw)
  To: Chandan Rajendra
  Cc: Brian Foster, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Fri, Aug 30, 2019 at 09:08:17AM +1000, Dave Chinner wrote:
> On Thu, Aug 29, 2019 at 10:51:59AM +0530, Chandan Rajendra wrote:
> > 	 786576: kworker/4:1H-kb  1825 [004]   217.041079:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501
> 
> 200ms later the tail has moved, and last_sync_lsn is now 3/18501.
> i.e. the iclog writes have made it to disk, and the items have been
> moved into the AIL. I don't know where that came from, but I'm
> assuming it's an IO completion based on it being run from a
> kworker context that doesn't have an "xfs-" name prefix(*).
> 
> As the tail has moved, this should have woken the anything sleeping
> on the log tail in xlog_grant_head_wait() via a call to
> xfs_log_space_wake(). The first waiter should wake, see that there
> still isn't room in the log (only 3 sectors were freed in the log,
> we need at least 60). That woken process should then run
> xlog_grant_push_ail() again and go back to sleep.

Actually, it doesn't get woken because xlog_grant_head_wake() checks
how much space is available before waking waiters, and there clearly
isn't enough here. So that's one likely vector. Can you try this
patch?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: push the AIL in xlog_grant_head_wake

From: Dave Chinner <dchinner@redhat.com>

In the situation where the log is full and the CIL has not recently
flushed, the AIL push threshold is throttled back to the where the
last write of the head of the log was completed. This is stored in
log->l_last_sync_lsn. Hence if the CIL holds > 25% of the log space
pinned by flushes and/or aggregation in progress, we can get the
situation where the head of the log lags a long way behind the
reservation grant head.

When this happens, the AIL push target is trimmed back from where
the reservation grant head wants to push the log tail to, back to
where the head of the log currently is. This means the push target
doesn't reach far enough into the log to actually move the tail
before the transaction reservation goes to sleep.

When the CIL push completes, it moves the log head forward such that
the AIL push target can now be moved, but that has no mechanism for
puhsing the log tail. Further, if the next tail movement of the log
is not large enough wake the waiter (i.e. still not enough space for
it to have a reservation granted), we don't wake anything up, and
hence we do not update the AIL push target to take into account the
head of the log moving and allowing the push target to be moved
forwards.

To avoid this particular condition, if we fail to wake the first
waiter on the grant head because we don't have enough space,
push on the AIL again. This will pick up any movement of the log
head and allow the push target to move forward due to completion of
CIL pushing.

XXX: this does not address the situation where there are no tail
updates after the log head moves forward. There is currently no
infrastructure for a log head update to trigger an AIL tail push if
necessary.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index b159a9e9fef0..941f10ff99d9 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -214,15 +214,20 @@ xlog_grant_head_wake(
 {
 	struct xlog_ticket	*tic;
 	int			need_bytes;
+	bool			woken_task = false;
 
 	list_for_each_entry(tic, &head->waiters, t_queue) {
 		need_bytes = xlog_ticket_reservation(log, head, tic);
-		if (*free_bytes < need_bytes)
+		if (*free_bytes < need_bytes) {
+			if (!woken_task)
+				xlog_grant_push_ail(log, need_bytes);
 			return false;
+		}
 
 		*free_bytes -= need_bytes;
 		trace_xfs_log_grant_wake_up(log, tic);
 		wake_up_process(tic->t_task);
+		woken_task = true;
 	}
 
 	return true;

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-30  0:34                   ` Dave Chinner
@ 2019-08-30  2:13                     ` Dave Chinner
  2019-08-30  6:05                       ` Chandan Rajendra
  2019-08-30 17:24                       ` Brian Foster
  0 siblings, 2 replies; 25+ messages in thread
From: Dave Chinner @ 2019-08-30  2:13 UTC (permalink / raw)
  To: Chandan Rajendra
  Cc: Brian Foster, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Fri, Aug 30, 2019 at 10:34:41AM +1000, Dave Chinner wrote:
> On Fri, Aug 30, 2019 at 09:08:17AM +1000, Dave Chinner wrote:
> > On Thu, Aug 29, 2019 at 10:51:59AM +0530, Chandan Rajendra wrote:
> > > 	 786576: kworker/4:1H-kb  1825 [004]   217.041079:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501
> > 
> > 200ms later the tail has moved, and last_sync_lsn is now 3/18501.
> > i.e. the iclog writes have made it to disk, and the items have been
> > moved into the AIL. I don't know where that came from, but I'm
> > assuming it's an IO completion based on it being run from a
> > kworker context that doesn't have an "xfs-" name prefix(*).
> > 
> > As the tail has moved, this should have woken the anything sleeping
> > on the log tail in xlog_grant_head_wait() via a call to
> > xfs_log_space_wake(). The first waiter should wake, see that there
> > still isn't room in the log (only 3 sectors were freed in the log,
> > we need at least 60). That woken process should then run
> > xlog_grant_push_ail() again and go back to sleep.
> 
> Actually, it doesn't get woken because xlog_grant_head_wake() checks
> how much space is available before waking waiters, and there clearly
> isn't enough here. So that's one likely vector. Can you try this
> patch?

And this one on top to address the situation the previous patch
doesn't....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

xfs: push the grant head when the log head moves forward

From: Dave Chinner <dchinner@redhat.com>

When the log fills up, we can get into the state where the
outstanding items in the CIL being committed and aggregated are
larger than the range that the reservation grant head tail pushing
will attempt to clean. This can result in the tail pushing range
being trimmed back to the the log head (l_last_sync_lsn) and so
may not actually move the push target at all.

When the iclogs associated with the CIL commit finally land, the
log head moves forward, and this removes the restriction on the AIL
push target. However, if we already have transactions sleeping on
the grant head, and there's nothing in the AIL still to flush from
the current push target, then nothing will move the tail of the log
and trigger a log reservation wakeup.

Hence the there is nothing that will trigger xlog_grant_push_ail()
to recalculate the AIL push target and start pushing on the AIL
again to write back the metadata objects that pin the tail of the
log and hence free up space and allow the transaction reservations
to be woken and make progress.

Hence we need to push on the grant head when we move the log head
forward, as this may be the only trigger we have that can move the
AIL push target forwards in this situation.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c | 65 ++++++++++++++++++++++++++++++++++----------------------
 1 file changed, 40 insertions(+), 25 deletions(-)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 941f10ff99d9..ce46bb244442 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -2612,6 +2612,45 @@ xlog_get_lowest_lsn(
 	return lowest_lsn;
 }
 
+
+/*
+ * Completion of a iclog IO does not imply that a transaction has completed, as
+ * transactions can be large enough to span many iclogs. We cannot change the
+ * tail of the log half way through a transaction as this may be the only
+ * transaction in the log and moving the tail to point to the middle of it
+ * will prevent recovery from finding the start of the transaction. Hence we
+ * should only update the last_sync_lsn if this iclog contains transaction
+ * completion callbacks on it.
+ *
+ * We have to do this before we drop the icloglock to ensure we are the only one
+ * that can update it.
+ *
+ * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
+ * the reservation grant head pushing. This is due to the fact that the push
+ * target is bound by the current last_sync_lsn value. Hence if we have a large
+ * amount of log space bound up in this committing transaction then the
+ * last_sync_lsn value may be the limiting factor preventing tail pushing from
+ * freeing space in the log. Hence once we've updated the last_sync_lsn we
+ * should push the AIL to ensure the push target (and hence the grant head) is
+ * no longer bound by the old log head location and can move forwards and make
+ * progress again.
+ */
+static void
+xlog_iclog_iodone(
+	struct xlog		*log,
+	struct xlog_in_core	*iclog)
+{
+	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
+		be64_to_cpu(iclog->ic_header.h_lsn)) <= 0);
+
+	if (list_empty_careful(&iclog->ic_callbacks))
+		return;
+
+	atomic64_set(&log->l_last_sync_lsn, be64_to_cpu(iclog->ic_header.h_lsn));
+	xlog_grant_push_ail(log, 0);
+
+}
+
 STATIC void
 xlog_state_do_callback(
 	struct xlog		*log,
@@ -2710,31 +2749,7 @@ xlog_state_do_callback(
 				}
 
 				iclog->ic_state = XLOG_STATE_CALLBACK;
-
-
-				/*
-				 * Completion of a iclog IO does not imply that
-				 * a transaction has completed, as transactions
-				 * can be large enough to span many iclogs. We
-				 * cannot change the tail of the log half way
-				 * through a transaction as this may be the only
-				 * transaction in the log and moving th etail to
-				 * point to the middle of it will prevent
-				 * recovery from finding the start of the
-				 * transaction. Hence we should only update the
-				 * last_sync_lsn if this iclog contains
-				 * transaction completion callbacks on it.
-				 *
-				 * We have to do this before we drop the
-				 * icloglock to ensure we are the only one that
-				 * can update it.
-				 */
-				ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
-					be64_to_cpu(iclog->ic_header.h_lsn)) <= 0);
-				if (!list_empty_careful(&iclog->ic_callbacks))
-					atomic64_set(&log->l_last_sync_lsn,
-						be64_to_cpu(iclog->ic_header.h_lsn));
-
+				xlog_iclog_iodone(log, iclog);
 			} else
 				ioerrors++;
 

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-30  2:13                     ` Dave Chinner
@ 2019-08-30  6:05                       ` Chandan Rajendra
  2019-08-30 17:24                       ` Brian Foster
  1 sibling, 0 replies; 25+ messages in thread
From: Chandan Rajendra @ 2019-08-30  6:05 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Brian Foster, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Friday, August 30, 2019 7:43 AM Dave Chinner wrote: 
> On Fri, Aug 30, 2019 at 10:34:41AM +1000, Dave Chinner wrote:
> > On Fri, Aug 30, 2019 at 09:08:17AM +1000, Dave Chinner wrote:
> > > On Thu, Aug 29, 2019 at 10:51:59AM +0530, Chandan Rajendra wrote:
> > > > 	 786576: kworker/4:1H-kb  1825 [004]   217.041079:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501
> > > 
> > > 200ms later the tail has moved, and last_sync_lsn is now 3/18501.
> > > i.e. the iclog writes have made it to disk, and the items have been
> > > moved into the AIL. I don't know where that came from, but I'm
> > > assuming it's an IO completion based on it being run from a
> > > kworker context that doesn't have an "xfs-" name prefix(*).
> > > 
> > > As the tail has moved, this should have woken the anything sleeping
> > > on the log tail in xlog_grant_head_wait() via a call to
> > > xfs_log_space_wake(). The first waiter should wake, see that there
> > > still isn't room in the log (only 3 sectors were freed in the log,
> > > we need at least 60). That woken process should then run
> > > xlog_grant_push_ail() again and go back to sleep.
> > 
> > Actually, it doesn't get woken because xlog_grant_head_wake() checks
> > how much space is available before waking waiters, and there clearly
> > isn't enough here. So that's one likely vector. Can you try this
> > patch?
> 
> And this one on top to address the situation the previous patch
> doesn't....
> 

Dave, with the 3 patches added (i.e. synchronous transactions during log
recovery and the two patches posted now), the deadlock is not recreated.

Tested-by: Chandan Rajendra <chandanrlinux@gmail.com>

-- 
chandan




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-30  2:13                     ` Dave Chinner
  2019-08-30  6:05                       ` Chandan Rajendra
@ 2019-08-30 17:24                       ` Brian Foster
  2019-08-30 22:10                         ` Dave Chinner
  1 sibling, 1 reply; 25+ messages in thread
From: Brian Foster @ 2019-08-30 17:24 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chandan Rajendra, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Fri, Aug 30, 2019 at 12:13:29PM +1000, Dave Chinner wrote:
> On Fri, Aug 30, 2019 at 10:34:41AM +1000, Dave Chinner wrote:
> > On Fri, Aug 30, 2019 at 09:08:17AM +1000, Dave Chinner wrote:
> > > On Thu, Aug 29, 2019 at 10:51:59AM +0530, Chandan Rajendra wrote:
> > > > 	 786576: kworker/4:1H-kb  1825 [004]   217.041079:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501
> > > 
> > > 200ms later the tail has moved, and last_sync_lsn is now 3/18501.
> > > i.e. the iclog writes have made it to disk, and the items have been
> > > moved into the AIL. I don't know where that came from, but I'm
> > > assuming it's an IO completion based on it being run from a
> > > kworker context that doesn't have an "xfs-" name prefix(*).
> > > 
> > > As the tail has moved, this should have woken the anything sleeping
> > > on the log tail in xlog_grant_head_wait() via a call to
> > > xfs_log_space_wake(). The first waiter should wake, see that there
> > > still isn't room in the log (only 3 sectors were freed in the log,
> > > we need at least 60). That woken process should then run
> > > xlog_grant_push_ail() again and go back to sleep.
> > 
> > Actually, it doesn't get woken because xlog_grant_head_wake() checks
> > how much space is available before waking waiters, and there clearly
> > isn't enough here. So that's one likely vector. Can you try this
> > patch?
> 
> And this one on top to address the situation the previous patch
> doesn't....
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> xfs: push the grant head when the log head moves forward
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> When the log fills up, we can get into the state where the
> outstanding items in the CIL being committed and aggregated are
> larger than the range that the reservation grant head tail pushing
> will attempt to clean. This can result in the tail pushing range
> being trimmed back to the the log head (l_last_sync_lsn) and so
> may not actually move the push target at all.
> 
> When the iclogs associated with the CIL commit finally land, the
> log head moves forward, and this removes the restriction on the AIL
> push target. However, if we already have transactions sleeping on
> the grant head, and there's nothing in the AIL still to flush from
> the current push target, then nothing will move the tail of the log
> and trigger a log reservation wakeup.
> 
> Hence the there is nothing that will trigger xlog_grant_push_ail()
> to recalculate the AIL push target and start pushing on the AIL
> again to write back the metadata objects that pin the tail of the
> log and hence free up space and allow the transaction reservations
> to be woken and make progress.
> 
> Hence we need to push on the grant head when we move the log head
> forward, as this may be the only trigger we have that can move the
> AIL push target forwards in this situation.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c | 65 ++++++++++++++++++++++++++++++++++----------------------
>  1 file changed, 40 insertions(+), 25 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 941f10ff99d9..ce46bb244442 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -2612,6 +2612,45 @@ xlog_get_lowest_lsn(
>  	return lowest_lsn;
>  }
>  
> +
> +/*
> + * Completion of a iclog IO does not imply that a transaction has completed, as
> + * transactions can be large enough to span many iclogs. We cannot change the
> + * tail of the log half way through a transaction as this may be the only
> + * transaction in the log and moving the tail to point to the middle of it
> + * will prevent recovery from finding the start of the transaction. Hence we
> + * should only update the last_sync_lsn if this iclog contains transaction
> + * completion callbacks on it.
> + *
> + * We have to do this before we drop the icloglock to ensure we are the only one
> + * that can update it.
> + *
> + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
> + * the reservation grant head pushing. This is due to the fact that the push
> + * target is bound by the current last_sync_lsn value. Hence if we have a large
> + * amount of log space bound up in this committing transaction then the
> + * last_sync_lsn value may be the limiting factor preventing tail pushing from
> + * freeing space in the log. Hence once we've updated the last_sync_lsn we
> + * should push the AIL to ensure the push target (and hence the grant head) is
> + * no longer bound by the old log head location and can move forwards and make
> + * progress again.
> + */
> +static void
> +xlog_iclog_iodone(
> +	struct xlog		*log,
> +	struct xlog_in_core	*iclog)
> +{
> +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
> +		be64_to_cpu(iclog->ic_header.h_lsn)) <= 0);
> +
> +	if (list_empty_careful(&iclog->ic_callbacks))
> +		return;
> +
> +	atomic64_set(&log->l_last_sync_lsn, be64_to_cpu(iclog->ic_header.h_lsn));
> +	xlog_grant_push_ail(log, 0);

I think the intent of the patch makes sense but the wakeup is misplaced.
First, this wakeup technically occurs before AIL population because we
haven't invoked the iclog callbacks yet. Second, your first patch
already adds an AIL push in the log I/O completion path via
xfs_trans_ail_update_bulk() -> xfs_log_space_wake(). The problem may
just be that it's only invoked on tail changes and not empty ->
populated state changes. If so, that could be addressed with a check for
mlip == NULL in xfs_trans_ail_update_bulk().

(Actually, the first patch alone seems to be enough to prevent the
problem in my tests. I suppose I could be seeing a variation of the race
from what Chandan sees and a wakeup from the second CIL ctx ticket
release or something is saving me...).

Brian

> +
> +}
> +
>  STATIC void
>  xlog_state_do_callback(
>  	struct xlog		*log,
> @@ -2710,31 +2749,7 @@ xlog_state_do_callback(
>  				}
>  
>  				iclog->ic_state = XLOG_STATE_CALLBACK;
> -
> -
> -				/*
> -				 * Completion of a iclog IO does not imply that
> -				 * a transaction has completed, as transactions
> -				 * can be large enough to span many iclogs. We
> -				 * cannot change the tail of the log half way
> -				 * through a transaction as this may be the only
> -				 * transaction in the log and moving th etail to
> -				 * point to the middle of it will prevent
> -				 * recovery from finding the start of the
> -				 * transaction. Hence we should only update the
> -				 * last_sync_lsn if this iclog contains
> -				 * transaction completion callbacks on it.
> -				 *
> -				 * We have to do this before we drop the
> -				 * icloglock to ensure we are the only one that
> -				 * can update it.
> -				 */
> -				ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
> -					be64_to_cpu(iclog->ic_header.h_lsn)) <= 0);
> -				if (!list_empty_careful(&iclog->ic_callbacks))
> -					atomic64_set(&log->l_last_sync_lsn,
> -						be64_to_cpu(iclog->ic_header.h_lsn));
> -
> +				xlog_iclog_iodone(log, iclog);
>  			} else
>  				ioerrors++;
>  

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-30 17:24                       ` Brian Foster
@ 2019-08-30 22:10                         ` Dave Chinner
  2019-08-30 23:18                           ` Brian Foster
  0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2019-08-30 22:10 UTC (permalink / raw)
  To: Brian Foster
  Cc: Chandan Rajendra, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Fri, Aug 30, 2019 at 01:24:47PM -0400, Brian Foster wrote:
> On Fri, Aug 30, 2019 at 12:13:29PM +1000, Dave Chinner wrote:
> > On Fri, Aug 30, 2019 at 10:34:41AM +1000, Dave Chinner wrote:
> > > On Fri, Aug 30, 2019 at 09:08:17AM +1000, Dave Chinner wrote:
> > > > On Thu, Aug 29, 2019 at 10:51:59AM +0530, Chandan Rajendra wrote:
> > > > > 	 786576: kworker/4:1H-kb  1825 [004]   217.041079:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501
> > > > 
> > > > 200ms later the tail has moved, and last_sync_lsn is now 3/18501.
> > > > i.e. the iclog writes have made it to disk, and the items have been
> > > > moved into the AIL. I don't know where that came from, but I'm
> > > > assuming it's an IO completion based on it being run from a
> > > > kworker context that doesn't have an "xfs-" name prefix(*).
> > > > 
> > > > As the tail has moved, this should have woken the anything sleeping
> > > > on the log tail in xlog_grant_head_wait() via a call to
> > > > xfs_log_space_wake(). The first waiter should wake, see that there
> > > > still isn't room in the log (only 3 sectors were freed in the log,
> > > > we need at least 60). That woken process should then run
> > > > xlog_grant_push_ail() again and go back to sleep.
> > > 
> > > Actually, it doesn't get woken because xlog_grant_head_wake() checks
> > > how much space is available before waking waiters, and there clearly
> > > isn't enough here. So that's one likely vector. Can you try this
> > > patch?
> > 
> > And this one on top to address the situation the previous patch
> > doesn't....
> > 
> > -Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> > xfs: push the grant head when the log head moves forward
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > When the log fills up, we can get into the state where the
> > outstanding items in the CIL being committed and aggregated are
> > larger than the range that the reservation grant head tail pushing
> > will attempt to clean. This can result in the tail pushing range
> > being trimmed back to the the log head (l_last_sync_lsn) and so
> > may not actually move the push target at all.
> > 
> > When the iclogs associated with the CIL commit finally land, the
> > log head moves forward, and this removes the restriction on the AIL
> > push target. However, if we already have transactions sleeping on
> > the grant head, and there's nothing in the AIL still to flush from
> > the current push target, then nothing will move the tail of the log
> > and trigger a log reservation wakeup.
> > 
> > Hence the there is nothing that will trigger xlog_grant_push_ail()
> > to recalculate the AIL push target and start pushing on the AIL
> > again to write back the metadata objects that pin the tail of the
> > log and hence free up space and allow the transaction reservations
> > to be woken and make progress.
> > 
> > Hence we need to push on the grant head when we move the log head
> > forward, as this may be the only trigger we have that can move the
> > AIL push target forwards in this situation.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log.c | 65 ++++++++++++++++++++++++++++++++++----------------------
> >  1 file changed, 40 insertions(+), 25 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index 941f10ff99d9..ce46bb244442 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -2612,6 +2612,45 @@ xlog_get_lowest_lsn(
> >  	return lowest_lsn;
> >  }
> >  
> > +
> > +/*
> > + * Completion of a iclog IO does not imply that a transaction has completed, as
> > + * transactions can be large enough to span many iclogs. We cannot change the
> > + * tail of the log half way through a transaction as this may be the only
> > + * transaction in the log and moving the tail to point to the middle of it
> > + * will prevent recovery from finding the start of the transaction. Hence we
> > + * should only update the last_sync_lsn if this iclog contains transaction
> > + * completion callbacks on it.
> > + *
> > + * We have to do this before we drop the icloglock to ensure we are the only one
> > + * that can update it.
> > + *
> > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
> > + * the reservation grant head pushing. This is due to the fact that the push
> > + * target is bound by the current last_sync_lsn value. Hence if we have a large
> > + * amount of log space bound up in this committing transaction then the
> > + * last_sync_lsn value may be the limiting factor preventing tail pushing from
> > + * freeing space in the log. Hence once we've updated the last_sync_lsn we
> > + * should push the AIL to ensure the push target (and hence the grant head) is
> > + * no longer bound by the old log head location and can move forwards and make
> > + * progress again.
> > + */
> > +static void
> > +xlog_iclog_iodone(
> > +	struct xlog		*log,
> > +	struct xlog_in_core	*iclog)
> > +{
> > +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
> > +		be64_to_cpu(iclog->ic_header.h_lsn)) <= 0);
> > +
> > +	if (list_empty_careful(&iclog->ic_callbacks))
> > +		return;
> > +
> > +	atomic64_set(&log->l_last_sync_lsn, be64_to_cpu(iclog->ic_header.h_lsn));
> > +	xlog_grant_push_ail(log, 0);
> 
> I think the intent of the patch makes sense but the wakeup is misplaced.
> First, this wakeup technically occurs before AIL population because we
> haven't invoked the iclog callbacks yet.

I *think* it's fine - we just need to push the AIL once the l_last_sync_lsn
has been updated. The items going into the AIL from this commit
will be at the head, anyway, so it won't make any difference to
pushing the items at tail of the AIL. What matters is the push
threshold is no longer truncated down by the l_last_sync_lsn....

Really, though, I just did this quick change because I didn't want
to have to do a major rework of this code without first knowing
whether it fixed the problem. This whole iclog completion function
needs cleaning up....

> Second, your first patch
> already adds an AIL push in the log I/O completion path via
> xfs_trans_ail_update_bulk() -> xfs_log_space_wake().

That will only trigger if we move something from the tail of the log
during insertion. There is no guarantee that will happen from this
iclog completion - there may be no relogging at all, so no wakeup
from the AIL insert process....

> The problem may
> just be that it's only invoked on tail changes and not empty ->
> populated state changes. If so, that could be addressed with a check for
> mlip == NULL in xfs_trans_ail_update_bulk().
> 
> (Actually, the first patch alone seems to be enough to prevent the
> problem in my tests. I suppose I could be seeing a variation of the race
> from what Chandan sees and a wakeup from the second CIL ctx ticket
> release or something is saving me...).

Right, the first patch addresses the hang where a tail update occurs
after this iclog hits the disk and inserts everything into the AIL.
This patch addresses the variant (that is much harder to hit) where
no tail updates occur after this iclog hits the disk and completes.

Both are necessary, AFAICT.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-30 22:10                         ` Dave Chinner
@ 2019-08-30 23:18                           ` Brian Foster
  0 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2019-08-30 23:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Chandan Rajendra, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Sat, Aug 31, 2019 at 08:10:46AM +1000, Dave Chinner wrote:
> On Fri, Aug 30, 2019 at 01:24:47PM -0400, Brian Foster wrote:
> > On Fri, Aug 30, 2019 at 12:13:29PM +1000, Dave Chinner wrote:
> > > On Fri, Aug 30, 2019 at 10:34:41AM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 30, 2019 at 09:08:17AM +1000, Dave Chinner wrote:
> > > > > On Thu, Aug 29, 2019 at 10:51:59AM +0530, Chandan Rajendra wrote:
> > > > > > 	 786576: kworker/4:1H-kb  1825 [004]   217.041079:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501
> > > > > 
> > > > > 200ms later the tail has moved, and last_sync_lsn is now 3/18501.
> > > > > i.e. the iclog writes have made it to disk, and the items have been
> > > > > moved into the AIL. I don't know where that came from, but I'm
> > > > > assuming it's an IO completion based on it being run from a
> > > > > kworker context that doesn't have an "xfs-" name prefix(*).
> > > > > 
> > > > > As the tail has moved, this should have woken the anything sleeping
> > > > > on the log tail in xlog_grant_head_wait() via a call to
> > > > > xfs_log_space_wake(). The first waiter should wake, see that there
> > > > > still isn't room in the log (only 3 sectors were freed in the log,
> > > > > we need at least 60). That woken process should then run
> > > > > xlog_grant_push_ail() again and go back to sleep.
> > > > 
> > > > Actually, it doesn't get woken because xlog_grant_head_wake() checks
> > > > how much space is available before waking waiters, and there clearly
> > > > isn't enough here. So that's one likely vector. Can you try this
> > > > patch?
> > > 
> > > And this one on top to address the situation the previous patch
> > > doesn't....
> > > 
> > > -Dave.
> > > -- 
> > > Dave Chinner
> > > david@fromorbit.com
> > > 
> > > xfs: push the grant head when the log head moves forward
> > > 
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > When the log fills up, we can get into the state where the
> > > outstanding items in the CIL being committed and aggregated are
> > > larger than the range that the reservation grant head tail pushing
> > > will attempt to clean. This can result in the tail pushing range
> > > being trimmed back to the the log head (l_last_sync_lsn) and so
> > > may not actually move the push target at all.
> > > 
> > > When the iclogs associated with the CIL commit finally land, the
> > > log head moves forward, and this removes the restriction on the AIL
> > > push target. However, if we already have transactions sleeping on
> > > the grant head, and there's nothing in the AIL still to flush from
> > > the current push target, then nothing will move the tail of the log
> > > and trigger a log reservation wakeup.
> > > 
> > > Hence the there is nothing that will trigger xlog_grant_push_ail()
> > > to recalculate the AIL push target and start pushing on the AIL
> > > again to write back the metadata objects that pin the tail of the
> > > log and hence free up space and allow the transaction reservations
> > > to be woken and make progress.
> > > 
> > > Hence we need to push on the grant head when we move the log head
> > > forward, as this may be the only trigger we have that can move the
> > > AIL push target forwards in this situation.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log.c | 65 ++++++++++++++++++++++++++++++++++----------------------
> > >  1 file changed, 40 insertions(+), 25 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > index 941f10ff99d9..ce46bb244442 100644
> > > --- a/fs/xfs/xfs_log.c
> > > +++ b/fs/xfs/xfs_log.c
> > > @@ -2612,6 +2612,45 @@ xlog_get_lowest_lsn(
> > >  	return lowest_lsn;
> > >  }
> > >  
> > > +
> > > +/*
> > > + * Completion of a iclog IO does not imply that a transaction has completed, as
> > > + * transactions can be large enough to span many iclogs. We cannot change the
> > > + * tail of the log half way through a transaction as this may be the only
> > > + * transaction in the log and moving the tail to point to the middle of it
> > > + * will prevent recovery from finding the start of the transaction. Hence we
> > > + * should only update the last_sync_lsn if this iclog contains transaction
> > > + * completion callbacks on it.
> > > + *
> > > + * We have to do this before we drop the icloglock to ensure we are the only one
> > > + * that can update it.
> > > + *
> > > + * If we are moving the last_sync_lsn forwards, we also need to ensure we kick
> > > + * the reservation grant head pushing. This is due to the fact that the push
> > > + * target is bound by the current last_sync_lsn value. Hence if we have a large
> > > + * amount of log space bound up in this committing transaction then the
> > > + * last_sync_lsn value may be the limiting factor preventing tail pushing from
> > > + * freeing space in the log. Hence once we've updated the last_sync_lsn we
> > > + * should push the AIL to ensure the push target (and hence the grant head) is
> > > + * no longer bound by the old log head location and can move forwards and make
> > > + * progress again.
> > > + */
> > > +static void
> > > +xlog_iclog_iodone(
> > > +	struct xlog		*log,
> > > +	struct xlog_in_core	*iclog)
> > > +{
> > > +	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
> > > +		be64_to_cpu(iclog->ic_header.h_lsn)) <= 0);
> > > +
> > > +	if (list_empty_careful(&iclog->ic_callbacks))
> > > +		return;
> > > +
> > > +	atomic64_set(&log->l_last_sync_lsn, be64_to_cpu(iclog->ic_header.h_lsn));
> > > +	xlog_grant_push_ail(log, 0);
> > 
> > I think the intent of the patch makes sense but the wakeup is misplaced.
> > First, this wakeup technically occurs before AIL population because we
> > haven't invoked the iclog callbacks yet.
> 
> I *think* it's fine - we just need to push the AIL once the l_last_sync_lsn
> has been updated. The items going into the AIL from this commit
> will be at the head, anyway, so it won't make any difference to
> pushing the items at tail of the AIL. What matters is the push
> threshold is no longer truncated down by the l_last_sync_lsn....
> 

Hmm.. but that assumes the AIL is populated, which isn't what I'm
seeing. Sure, xlog_grant_push_ail() truncates the push target based on
->l_last_sync_lsn, but the AIL push itself is a no-op if the AIL is
empty.

> Really, though, I just did this quick change because I didn't want
> to have to do a major rework of this code without first knowing
> whether it fixed the problem. This whole iclog completion function
> needs cleaning up....
> 

Yep...

> > Second, your first patch
> > already adds an AIL push in the log I/O completion path via
> > xfs_trans_ail_update_bulk() -> xfs_log_space_wake().
> 
> That will only trigger if we move something from the tail of the log
> during insertion. There is no guarantee that will happen from this
> iclog completion - there may be no relogging at all, so no wakeup
> from the AIL insert process....
> 

Which is exactly why I noted below that we should probably check for
mlip == NULL there as well. I.e.:

	if (!mlip || mlip_changed) {
		...
		xfs_log_space_wake()
	}

Of course, that's based on my thinking that this was intended to address
the "pushing an empty AIL" case. Even if there is some other problematic
case, this patch is still basically a racy variant of just making that
existing wake in xfs_trans_ail_update_bulk() unconditional. FWIW, this
call also properly communicates the log reservation requirement of the
next transaction in the queue back to the AIL, particularly if the
request happens to be larger than whatever the min threshold heuristic
is in xlog_grant_push_ail().

Brian

> > The problem may
> > just be that it's only invoked on tail changes and not empty ->
> > populated state changes. If so, that could be addressed with a check for
> > mlip == NULL in xfs_trans_ail_update_bulk().
> > 
> > (Actually, the first patch alone seems to be enough to prevent the
> > problem in my tests. I suppose I could be seeing a variation of the race
> > from what Chandan sees and a wakeup from the second CIL ctx ticket
> > release or something is saving me...).
> 
> Right, the first patch addresses the hang where a tail update occurs
> after this iclog hits the disk and inserts everything into the AIL.
> This patch addresses the variant (that is much harder to hit) where
> no tail updates occur after this iclog hits the disk and completes.
> 
> Both are necessary, AFAICT.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
       [not found]   ` <2367290.sgLJaTIShl@localhost.localdomain>
@ 2019-08-31  6:58     ` Chandan Rajendra
  2019-08-31 13:07       ` Brian Foster
  0 siblings, 1 reply; 25+ messages in thread
From: Chandan Rajendra @ 2019-08-31  6:58 UTC (permalink / raw)
  To: Brian Foster; +Cc: Dave Chinner, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Saturday, August 31, 2019 10:29 AM Chandan Rajendra wrote: 
> On Friday, August 30, 2019 10:17 PM Brian Foster wrote: 
> > On Fri, Aug 30, 2019 at 09:08:17AM +1000, Dave Chinner wrote:
> > > On Thu, Aug 29, 2019 at 10:51:59AM +0530, Chandan Rajendra wrote:
> > > > On Monday, August 26, 2019 6:02 AM Dave Chinner wrote: 
> > > > > On Sun, Aug 25, 2019 at 08:35:17PM +0530, Chandan Rajendra wrote:
> > > > > > On Friday, August 23, 2019 7:08 PM Chandan Rajendra wrote:
> > > > > > 
> > > > > > Dave, With the above changes made in xfs_trans_reserve(), mount task is
> > > > > > deadlocking due to the following,
> > > > > > 1. With synchronous transactions, __xfs_trans_commit() now causes iclogs to be
> > > > > > flushed to the disk and hence log items to be ultimately moved to AIL.
> > > > > > 2. xfsaild task is woken up, which acts in items on AIL.
> > > > > > 3. After some time, we stop issuing synchronous transactions because AIL has
> > > > > >    log items in its list and hence !xfs_ail_min(tp->t_mountp->m_ail) evaluates to
> > > > > >    false. In xfsaild_push(), "XFS_LSN_CMP(lip->li_lsn, target) <= 0"
> > > > > >    evaluates to false on the first iteration of the while loop. This means we
> > > > > >    have a log item whose LSN is larger than xfs_ail->ail_target at the
> > > > > >    beginning of the AIL.
> > > > > 
> > > > > The push target for xlog_grant_push_ail() is to free 25% of the log
> > > > > space. So if all the items in the AIL are not within 25% of the tail
> > > > > end of the log, there's nothing for the AIL to push. This indicates
> > > > > that there is at least 25% of physical log space free.
> > > > 
> > > > Sorry for the late response. I was trying to understand the code flow.
> > > > 
> > > > Here is a snippet of perf trace explaining what is going on,
> > > > 
> > > > 	 760881:           mount  8654 [002]   216.813041:                         probe:xlog_grant_push_ail: (c000000000765864) comm="xfsaild/loop1" threshold_cycle_s32=3 threshold_block_s32=3970 need_bytes_s32=389328 last_sync_cycle_u32=2 last_sync_block_u32=19330 free_threshold_s32=5120 free_bytes_s32=383756 free_blocks_s32=749 l_logsize=10485760 reserve_cycle_s32=3 reserve_block_s32=9513204(~18580 blocks) tail_cycle_s32=2 tail_block_s32=19330
> > > 
> > > So this looks like last_sync_lsn is 2/19330, and the transaction
> > > reservation is ~380kB, or close on 3% of the log. The reserve grant
> > > head is at 3/18580, so we're ~700 * 512 = ~350kB of reservation
> > > remaining. Yup, so we are definitely in the "go to sleep and wait"
> > > situation here.
> > > 
> > > > 	 786576: kworker/4:1H-kb  1825 [004]   217.041079:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501
> > > 
> > > 200ms later the tail has moved, and last_sync_lsn is now 3/18501.
> > > i.e. the iclog writes have made it to disk, and the items have been
> > > moved into the AIL. I don't know where that came from, but I'm
> > > assuming it's an IO completion based on it being run from a
> > > kworker context that doesn't have an "xfs-" name prefix(*).
> > > 
> > > As the tail has moved, this should have woken the anything sleeping
> > > on the log tail in xlog_grant_head_wait() via a call to
> > > xfs_log_space_wake(). The first waiter should wake, see that there
> > > still isn't room in the log (only 3 sectors were freed in the log,
> > > we need at least 60). That woken process should then run
> > > xlog_grant_push_ail() again and go back to sleep.
> > > 
> > > (*) I have a patch that shortens "s/kworker/kw/" so that you can
> > > actually see the name of the kworker in the 16 byte field we have
> > > for the task name. We really should just increase current->comm to
> > > 32 bytes.
> > > 
> > > > 	 786577: kworker/4:1H-kb  1825 [004]   217.041087:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501
> > > > 	 793653:   xfsaild/loop1  8661 [004]   265.407708:                probe:xfsaild_push_last_pushed_lsn: (c000000000784644) comm="xfsaild/loop1" cycle_lsn_u32=0 block_lsn_u32=0 target_cycle_lsn_u32=2 target_block_lsn_u32=19330
> > > > 	 793654:   xfsaild/loop1  8661 [004]   265.407717:              probe:xfsaild_push_min_lsn_less_than: (c0000000007846a0) comm="xfsaild/loop1" less_than_s32=0 cycle_lsn_u32=2 block_lsn_u32=19333 lip_x64=0xc000000303fb4a48
> > > 
> > > Ans some 40s later the xfsaild is woken by something, sees there's
> > > nothing to do, and goes back to sleep. I don't see the process
> > > sleeping on the grant head being ever being woken and calling
> > > xlog_grant_push_ail(), which would see the new last_sync_lsn and
> > > move the push target....
> > > 
> > > From this trace, it looks like the problem here is a missing or
> > > incorrectly processed wakeup when the log tail moves.
> > > 
> > > Unfortunately, you haven't used the built in trace points for
> > > debugging log space hangs so I can't tell anything more than this.
> > > i.e the trace we need contains these build in tracepoints:
> > > 
> > > # trace-cmd record -e xfs_log\* -e xfs_ail\* sleep 120 &
> > > # <run workload that hangs within 120s>
> > > 
> > > <wait for trace-cmd to exit>
> > > # trace-cmd report | gzip > trace.txt.gz
> > > 
> > > as that will record all transaction reservations, grant head
> > > manipulations, changes to the tail lsn, when processes sleep on the
> > > grant head and are worken, AIL insert/move/delete, etc.
> > > 
> > > This will generate a -lot- of data. I often generate and analyse
> > > traces in the order of tens of GBs of events to track down issues
> > > like this, because the problem is often only seen in a single trace
> > > event in amongst the millions that are recorded....
> > > 
> > > And if we need more info, then we add the appropriate tracepoints
> > > into xlog_grant_push_ail, xfsaild_push, etc under those tracepoint
> > > namespaces, so next time we have a problem we don't ahve to write
> > > custom tracepoints.....
> > > 
> > > > i.e the log size was 2560 * 4096 = 10485760 bytes.
> > > 
> > > The default minimum size.
> > > 
> > > > > I suspect that this means the CIL is overruning it's background push
> > > > > target by more than expected probably because the log is so small. That leads
> > > > > to the outstanding CIL pending commit size (the current CIL context
> > > > > and the previous CIL commit that is held off by the uncommited
> > > > > iclog) is greater than the AIL push target, and so nothing will free
> > > > > up more log space and wake up the transaction waiting for grant
> > > > > space.
> > > > > 
> > > > > e.g. the previous CIL context commit might take 15% of the log
> > > > > space, and the current CIL has reserved 11% of the log space.
> > > > > Now new transactions reservations have run out of grant space and we
> > > > > push on the ail, but it's lowest item is at 26%, and so the AIL push
> > > > > does nothing and we're stuck because the CIL has pinned 26% of the
> > > > > log space.
> > > > > 
> > > > > As a test, can you run the test with larger log sizes? I think
> > > > > the default used was about ~3600 blocks, so it you step that up by
> > > > > 500 blocks at a time we should get an idea of the size of the
> > > > > overrun by the size of the log where the hang goes away. A
> > > > > trace of the transaction reservations and AIL pushing would also be
> > > > > insightful.
> > > > 
> > > > After increasing the log size to 4193 blocks (i.e. 4193 * 4k = 17174528
> > > > bytes) and also the patch applied, I don't see the dead lock happening.
> > > 
> > > Likely because now the 380k transaction reservation is only 2% of the
> > > log instead of close to 4% of the log, and so the overrun isn't
> > > large enough to trigger whatever wakeup issue we have....
> > > 
> > > > Meanwhile, I am planning to read more code to map the explaination
> > > > provided below.
> > > 
> > > Can you get a complete trace (as per above) of a hang? we're going
> > > to need that trace to validate any analysis you do yourself,
> > > anyway...
> > > 
> > 
> > FWIW, I finally managed to reproduce this (trace dump attached for
> > reference). I ultimately moved to another system, starting using loop
> > devices and bumped up the fstests LOAD_FACTOR. What I see is essentially
> > what looks like the lack of a wake up problem described above. More
> > specifically:
> > 
> > - Workload begins, background CIL push occurs for ~1.25MB context.
> > - While the first checkpoint is being written out, another background
> >   push is requested. By the time we can execute the next CIL push, the
> >   second context accrues to ~8MB in size, basically consuming the rest of
> >   the log.
> > 
> > So that essentially describes the delay. It's just a matter of timing
> > due to the amount of time it takes to complete the first checkpoint and
> > start the second vs. the log reservation consumption workload.
> > 
> > - The next CIL push executes, but all transaction waiters block before
> >   the iclog with the commit record for the first checkpoint hits the
> >   log.
> > 
> > So the AIL doesn't actually remain empty indefinitely here. Shortly
> > after the second CIL push starts doing I/O, the items from the first
> > checkpoint make it into the AIL. The problem is just that the AIL stays
> > empty long enough to absorb all of the AIL pushes that occur due to
> > log reservation pressure before everything stops.
> > 
> > I think the two extra AIL push patches Dave has posted are enough to
> > keep things moving. So far I haven't been able to reproduce with those
> > applied (though I applied a modified variant of the second)...
> > 
> 
> I am able to recreate the bug without the "synchronous transaction" patch. I
> am attaching the trace file with this mail.

I think it essentially boils down to the fact that without "synchronous
transactions" during log recovery, the last iclog containing the commit record
wouldn't be submitted for I/O since in the case of this bug we have more than
"2*sizeof(xlog_op_header_t)" space left in it.

Without the "synchronous transaction" change, the patch which initiates the
AIL push from the iclog endio function would be ineffective.

-- 
chandan




^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space
  2019-08-31  6:58     ` Chandan Rajendra
@ 2019-08-31 13:07       ` Brian Foster
  0 siblings, 0 replies; 25+ messages in thread
From: Brian Foster @ 2019-08-31 13:07 UTC (permalink / raw)
  To: Chandan Rajendra
  Cc: Dave Chinner, Chandan Rajendra, linux-xfs, darrick.wong, hch

On Sat, Aug 31, 2019 at 12:28:47PM +0530, Chandan Rajendra wrote:
> On Saturday, August 31, 2019 10:29 AM Chandan Rajendra wrote: 
> > On Friday, August 30, 2019 10:17 PM Brian Foster wrote: 
> > > On Fri, Aug 30, 2019 at 09:08:17AM +1000, Dave Chinner wrote:
> > > > On Thu, Aug 29, 2019 at 10:51:59AM +0530, Chandan Rajendra wrote:
> > > > > On Monday, August 26, 2019 6:02 AM Dave Chinner wrote: 
> > > > > > On Sun, Aug 25, 2019 at 08:35:17PM +0530, Chandan Rajendra wrote:
> > > > > > > On Friday, August 23, 2019 7:08 PM Chandan Rajendra wrote:
> > > > > > > 
> > > > > > > Dave, With the above changes made in xfs_trans_reserve(), mount task is
> > > > > > > deadlocking due to the following,
> > > > > > > 1. With synchronous transactions, __xfs_trans_commit() now causes iclogs to be
> > > > > > > flushed to the disk and hence log items to be ultimately moved to AIL.
> > > > > > > 2. xfsaild task is woken up, which acts in items on AIL.
> > > > > > > 3. After some time, we stop issuing synchronous transactions because AIL has
> > > > > > >    log items in its list and hence !xfs_ail_min(tp->t_mountp->m_ail) evaluates to
> > > > > > >    false. In xfsaild_push(), "XFS_LSN_CMP(lip->li_lsn, target) <= 0"
> > > > > > >    evaluates to false on the first iteration of the while loop. This means we
> > > > > > >    have a log item whose LSN is larger than xfs_ail->ail_target at the
> > > > > > >    beginning of the AIL.
> > > > > > 
> > > > > > The push target for xlog_grant_push_ail() is to free 25% of the log
> > > > > > space. So if all the items in the AIL are not within 25% of the tail
> > > > > > end of the log, there's nothing for the AIL to push. This indicates
> > > > > > that there is at least 25% of physical log space free.
> > > > > 
> > > > > Sorry for the late response. I was trying to understand the code flow.
> > > > > 
> > > > > Here is a snippet of perf trace explaining what is going on,
> > > > > 
> > > > > 	 760881:           mount  8654 [002]   216.813041:                         probe:xlog_grant_push_ail: (c000000000765864) comm="xfsaild/loop1" threshold_cycle_s32=3 threshold_block_s32=3970 need_bytes_s32=389328 last_sync_cycle_u32=2 last_sync_block_u32=19330 free_threshold_s32=5120 free_bytes_s32=383756 free_blocks_s32=749 l_logsize=10485760 reserve_cycle_s32=3 reserve_block_s32=9513204(~18580 blocks) tail_cycle_s32=2 tail_block_s32=19330
> > > > 
> > > > So this looks like last_sync_lsn is 2/19330, and the transaction
> > > > reservation is ~380kB, or close on 3% of the log. The reserve grant
> > > > head is at 3/18580, so we're ~700 * 512 = ~350kB of reservation
> > > > remaining. Yup, so we are definitely in the "go to sleep and wait"
> > > > situation here.
> > > > 
> > > > > 	 786576: kworker/4:1H-kb  1825 [004]   217.041079:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501
> > > > 
> > > > 200ms later the tail has moved, and last_sync_lsn is now 3/18501.
> > > > i.e. the iclog writes have made it to disk, and the items have been
> > > > moved into the AIL. I don't know where that came from, but I'm
> > > > assuming it's an IO completion based on it being run from a
> > > > kworker context that doesn't have an "xfs-" name prefix(*).
> > > > 
> > > > As the tail has moved, this should have woken the anything sleeping
> > > > on the log tail in xlog_grant_head_wait() via a call to
> > > > xfs_log_space_wake(). The first waiter should wake, see that there
> > > > still isn't room in the log (only 3 sectors were freed in the log,
> > > > we need at least 60). That woken process should then run
> > > > xlog_grant_push_ail() again and go back to sleep.
> > > > 
> > > > (*) I have a patch that shortens "s/kworker/kw/" so that you can
> > > > actually see the name of the kworker in the 16 byte field we have
> > > > for the task name. We really should just increase current->comm to
> > > > 32 bytes.
> > > > 
> > > > > 	 786577: kworker/4:1H-kb  1825 [004]   217.041087:                       xfs:xfs_log_assign_tail_lsn: dev 7:1 new tail lsn 2/19333, old lsn 2/19330, last sync 3/18501
> > > > > 	 793653:   xfsaild/loop1  8661 [004]   265.407708:                probe:xfsaild_push_last_pushed_lsn: (c000000000784644) comm="xfsaild/loop1" cycle_lsn_u32=0 block_lsn_u32=0 target_cycle_lsn_u32=2 target_block_lsn_u32=19330
> > > > > 	 793654:   xfsaild/loop1  8661 [004]   265.407717:              probe:xfsaild_push_min_lsn_less_than: (c0000000007846a0) comm="xfsaild/loop1" less_than_s32=0 cycle_lsn_u32=2 block_lsn_u32=19333 lip_x64=0xc000000303fb4a48
> > > > 
> > > > Ans some 40s later the xfsaild is woken by something, sees there's
> > > > nothing to do, and goes back to sleep. I don't see the process
> > > > sleeping on the grant head being ever being woken and calling
> > > > xlog_grant_push_ail(), which would see the new last_sync_lsn and
> > > > move the push target....
> > > > 
> > > > From this trace, it looks like the problem here is a missing or
> > > > incorrectly processed wakeup when the log tail moves.
> > > > 
> > > > Unfortunately, you haven't used the built in trace points for
> > > > debugging log space hangs so I can't tell anything more than this.
> > > > i.e the trace we need contains these build in tracepoints:
> > > > 
> > > > # trace-cmd record -e xfs_log\* -e xfs_ail\* sleep 120 &
> > > > # <run workload that hangs within 120s>
> > > > 
> > > > <wait for trace-cmd to exit>
> > > > # trace-cmd report | gzip > trace.txt.gz
> > > > 
> > > > as that will record all transaction reservations, grant head
> > > > manipulations, changes to the tail lsn, when processes sleep on the
> > > > grant head and are worken, AIL insert/move/delete, etc.
> > > > 
> > > > This will generate a -lot- of data. I often generate and analyse
> > > > traces in the order of tens of GBs of events to track down issues
> > > > like this, because the problem is often only seen in a single trace
> > > > event in amongst the millions that are recorded....
> > > > 
> > > > And if we need more info, then we add the appropriate tracepoints
> > > > into xlog_grant_push_ail, xfsaild_push, etc under those tracepoint
> > > > namespaces, so next time we have a problem we don't ahve to write
> > > > custom tracepoints.....
> > > > 
> > > > > i.e the log size was 2560 * 4096 = 10485760 bytes.
> > > > 
> > > > The default minimum size.
> > > > 
> > > > > > I suspect that this means the CIL is overruning it's background push
> > > > > > target by more than expected probably because the log is so small. That leads
> > > > > > to the outstanding CIL pending commit size (the current CIL context
> > > > > > and the previous CIL commit that is held off by the uncommited
> > > > > > iclog) is greater than the AIL push target, and so nothing will free
> > > > > > up more log space and wake up the transaction waiting for grant
> > > > > > space.
> > > > > > 
> > > > > > e.g. the previous CIL context commit might take 15% of the log
> > > > > > space, and the current CIL has reserved 11% of the log space.
> > > > > > Now new transactions reservations have run out of grant space and we
> > > > > > push on the ail, but it's lowest item is at 26%, and so the AIL push
> > > > > > does nothing and we're stuck because the CIL has pinned 26% of the
> > > > > > log space.
> > > > > > 
> > > > > > As a test, can you run the test with larger log sizes? I think
> > > > > > the default used was about ~3600 blocks, so it you step that up by
> > > > > > 500 blocks at a time we should get an idea of the size of the
> > > > > > overrun by the size of the log where the hang goes away. A
> > > > > > trace of the transaction reservations and AIL pushing would also be
> > > > > > insightful.
> > > > > 
> > > > > After increasing the log size to 4193 blocks (i.e. 4193 * 4k = 17174528
> > > > > bytes) and also the patch applied, I don't see the dead lock happening.
> > > > 
> > > > Likely because now the 380k transaction reservation is only 2% of the
> > > > log instead of close to 4% of the log, and so the overrun isn't
> > > > large enough to trigger whatever wakeup issue we have....
> > > > 
> > > > > Meanwhile, I am planning to read more code to map the explaination
> > > > > provided below.
> > > > 
> > > > Can you get a complete trace (as per above) of a hang? we're going
> > > > to need that trace to validate any analysis you do yourself,
> > > > anyway...
> > > > 
> > > 
> > > FWIW, I finally managed to reproduce this (trace dump attached for
> > > reference). I ultimately moved to another system, starting using loop
> > > devices and bumped up the fstests LOAD_FACTOR. What I see is essentially
> > > what looks like the lack of a wake up problem described above. More
> > > specifically:
> > > 
> > > - Workload begins, background CIL push occurs for ~1.25MB context.
> > > - While the first checkpoint is being written out, another background
> > >   push is requested. By the time we can execute the next CIL push, the
> > >   second context accrues to ~8MB in size, basically consuming the rest of
> > >   the log.
> > > 
> > > So that essentially describes the delay. It's just a matter of timing
> > > due to the amount of time it takes to complete the first checkpoint and
> > > start the second vs. the log reservation consumption workload.
> > > 
> > > - The next CIL push executes, but all transaction waiters block before
> > >   the iclog with the commit record for the first checkpoint hits the
> > >   log.
> > > 
> > > So the AIL doesn't actually remain empty indefinitely here. Shortly
> > > after the second CIL push starts doing I/O, the items from the first
> > > checkpoint make it into the AIL. The problem is just that the AIL stays
> > > empty long enough to absorb all of the AIL pushes that occur due to
> > > log reservation pressure before everything stops.
> > > 
> > > I think the two extra AIL push patches Dave has posted are enough to
> > > keep things moving. So far I haven't been able to reproduce with those
> > > applied (though I applied a modified variant of the second)...
> > > 
> > 
> > I am able to recreate the bug without the "synchronous transaction" patch. I
> > am attaching the trace file with this mail.
> 

FYI I don't see a trace attached. Note that I attached one to a previous
mail yesterday (~800k) and the mail never made it through. I guess it
was too large.

> I think it essentially boils down to the fact that without "synchronous
> transactions" during log recovery, the last iclog containing the commit record
> wouldn't be submitted for I/O since in the case of this bug we have more than
> "2*sizeof(xlog_op_header_t)" space left in it.
> 

In my reproducer, the last iclog for the first CIL push isn't written
out as part of the push (as you describe), but another background CIL
push is queued before the first checkpoint finishes writing to disk.
This means that commit record iclog is not stuck indefinitely, but
rather all of the log reservation pressure that is translated into AIL
pushing before reservation is fully consumed is essentially thrown away
because it occurs before the AIL is populated. The full trace should
show us whether you're seeing similar behavior.

> Without the "synchronous transaction" change, the patch which initiates the
> AIL push from the iclog endio function would be ineffective.
> 

Hmm. So between your and Dave's comments on that patch, I'm not sure I'm
following the sequence the motivated it. It sounds like it is a similar
problem related to smaller checkpoints, which probably explains why it
has no effect in my tests. I think the reason it's unnecessary for the
generic/530 scenario is that the same followup CIL push that writes the
first checkpoint iclog out to disk ultimately releases the CIL context
ticket (regardless of whether the second checkpoint commit record is
committed) and that wakes up log waiters. After Dave's first patch, the
log waker now also pushes the AIL, which at this point has been
populated by the first checkpoint.

That's just a guess, but even if correct it's still a timing sensitive
pattern so I think it might make sense to have that additional wakeup.

Brian

> -- 
> chandan
> 
> 
> 

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2019-08-31 13:07 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-21 11:04 [RFC] xfs: Flush iclog containing XLOG_COMMIT_TRANS before waiting for log space Chandan Rajendra
2019-08-21 16:42 ` Brian Foster
2019-08-22 12:46   ` Chandan Rajendra
2019-08-21 22:18 ` Dave Chinner
2019-08-22 12:40   ` Chandan Rajendra
2019-08-22 13:19     ` Chandan Rajendra
2019-08-22 16:34   ` Brian Foster
2019-08-23  0:06     ` Dave Chinner
2019-08-23 13:08       ` Brian Foster
2019-08-23 13:38         ` Chandan Rajendra
2019-08-25 15:05           ` Chandan Rajendra
2019-08-26  0:32             ` Dave Chinner
2019-08-26 12:39               ` Brian Foster
2019-08-27 23:10                 ` Dave Chinner
2019-08-28 12:06                   ` Brian Foster
2019-08-29  5:21               ` Chandan Rajendra
2019-08-29 23:08                 ` Dave Chinner
2019-08-30  0:34                   ` Dave Chinner
2019-08-30  2:13                     ` Dave Chinner
2019-08-30  6:05                       ` Chandan Rajendra
2019-08-30 17:24                       ` Brian Foster
2019-08-30 22:10                         ` Dave Chinner
2019-08-30 23:18                           ` Brian Foster
     [not found] ` <20190830164750.GD26520@bfoster>
     [not found]   ` <2367290.sgLJaTIShl@localhost.localdomain>
2019-08-31  6:58     ` Chandan Rajendra
2019-08-31 13:07       ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).