All of lore.kernel.org
 help / color / mirror / Atom feed
* xfs: fix CIL push hang in for-next tree
@ 2021-06-15  6:46 Dave Chinner
  2021-06-15  6:46 ` [PATCH 1/2] xfs: add iclog state trace events Dave Chinner
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Dave Chinner @ 2021-06-15  6:46 UTC (permalink / raw)
  To: linux-xfs

Hi folks,

This is the first fix for the problems Brian has reported from
generic/019. This has fixed the hang, but the other log recovery
problem he reported is still present (seen once with these patches
in place).

I've tested these out to a couple of hundred cycles of
continual looping generic/019 before the systems fall over with a
perag reference count underrun at unmount after a shutdown. I'm
pretty sure the hang is fixed, as it would manifest within 10-20
cycles without this patch.

The first patch is the iclogbuf state tracing I used to capture the
iclogbuf wrapping state. The second patch is the fix.

Cheers,

Dave.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] xfs: add iclog state trace events
  2021-06-15  6:46 xfs: fix CIL push hang in for-next tree Dave Chinner
@ 2021-06-15  6:46 ` Dave Chinner
  2021-06-15 15:26   ` Darrick J. Wong
  2021-06-15 15:37   ` Brian Foster
  2021-06-15  6:46 ` [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL Dave Chinner
  2021-06-15 17:57 ` xfs: fix CIL push hang in for-next tree Darrick J. Wong
  2 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2021-06-15  6:46 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

For the DEBUGS!

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c   | 18 ++++++++++++++++
 fs/xfs/xfs_trace.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index e921b554b683..54fd6a695bb5 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -524,6 +524,7 @@ __xlog_state_release_iclog(
 		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
 		xlog_verify_tail_lsn(log, iclog, tail_lsn);
 		/* cycle incremented when incrementing curr_block */
+		trace_xlog_iclog_syncing(iclog, _RET_IP_);
 		return true;
 	}
 
@@ -543,6 +544,7 @@ xlog_state_release_iclog(
 {
 	lockdep_assert_held(&log->l_icloglock);
 
+	trace_xlog_iclog_release(iclog, _RET_IP_);
 	if (iclog->ic_state == XLOG_STATE_IOERROR)
 		return -EIO;
 
@@ -804,6 +806,7 @@ xlog_wait_on_iclog(
 {
 	struct xlog		*log = iclog->ic_log;
 
+	trace_xlog_iclog_wait_on(iclog, _RET_IP_);
 	if (!XLOG_FORCED_SHUTDOWN(log) &&
 	    iclog->ic_state != XLOG_STATE_ACTIVE &&
 	    iclog->ic_state != XLOG_STATE_DIRTY) {
@@ -1804,6 +1807,7 @@ xlog_write_iclog(
 	unsigned int		count)
 {
 	ASSERT(bno < log->l_logBBsize);
+	trace_xlog_iclog_write(iclog, _RET_IP_);
 
 	/*
 	 * We lock the iclogbufs here so that we can serialise against I/O
@@ -1950,6 +1954,7 @@ xlog_sync(
 	unsigned int		size;
 
 	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
+	trace_xlog_iclog_sync(iclog, _RET_IP_);
 
 	count = xlog_calc_iclog_size(log, iclog, &roundoff);
 
@@ -2488,6 +2493,7 @@ xlog_state_activate_iclog(
 	int			*iclogs_changed)
 {
 	ASSERT(list_empty_careful(&iclog->ic_callbacks));
+	trace_xlog_iclog_activate(iclog, _RET_IP_);
 
 	/*
 	 * If the number of ops in this iclog indicate it just contains the
@@ -2577,6 +2583,8 @@ xlog_state_clean_iclog(
 {
 	int			iclogs_changed = 0;
 
+	trace_xlog_iclog_clean(dirty_iclog, _RET_IP_);
+
 	dirty_iclog->ic_state = XLOG_STATE_DIRTY;
 
 	xlog_state_activate_iclogs(log, &iclogs_changed);
@@ -2636,6 +2644,7 @@ xlog_state_set_callback(
 	struct xlog_in_core	*iclog,
 	xfs_lsn_t		header_lsn)
 {
+	trace_xlog_iclog_callback(iclog, _RET_IP_);
 	iclog->ic_state = XLOG_STATE_CALLBACK;
 
 	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
@@ -2717,6 +2726,7 @@ xlog_state_do_iclog_callbacks(
 		__releases(&log->l_icloglock)
 		__acquires(&log->l_icloglock)
 {
+	trace_xlog_iclog_callbacks_start(iclog, _RET_IP_);
 	spin_unlock(&log->l_icloglock);
 	spin_lock(&iclog->ic_callback_lock);
 	while (!list_empty(&iclog->ic_callbacks)) {
@@ -2736,6 +2746,7 @@ xlog_state_do_iclog_callbacks(
 	 */
 	spin_lock(&log->l_icloglock);
 	spin_unlock(&iclog->ic_callback_lock);
+	trace_xlog_iclog_callbacks_done(iclog, _RET_IP_);
 }
 
 STATIC void
@@ -2827,6 +2838,7 @@ xlog_state_done_syncing(
 
 	spin_lock(&log->l_icloglock);
 	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
+	trace_xlog_iclog_sync_done(iclog, _RET_IP_);
 
 	/*
 	 * If we got an error, either on the first buffer, or in the case of
@@ -2899,6 +2911,8 @@ xlog_state_get_iclog_space(
 	atomic_inc(&iclog->ic_refcnt);	/* prevents sync */
 	log_offset = iclog->ic_offset;
 
+	trace_xlog_iclog_get_space(iclog, _RET_IP_);
+
 	/* On the 1st write to an iclog, figure out lsn.  This works
 	 * if iclogs marked XLOG_STATE_WANT_SYNC always write out what they are
 	 * committing to.  If the offset is set, that's how many blocks
@@ -3056,6 +3070,7 @@ xlog_state_switch_iclogs(
 {
 	ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE);
 	assert_spin_locked(&log->l_icloglock);
+	trace_xlog_iclog_switch(iclog, _RET_IP_);
 
 	if (!eventual_size)
 		eventual_size = iclog->ic_offset;
@@ -3138,6 +3153,8 @@ xfs_log_force(
 	if (iclog->ic_state == XLOG_STATE_IOERROR)
 		goto out_error;
 
+	trace_xlog_iclog_force(iclog, _RET_IP_);
+
 	if (iclog->ic_state == XLOG_STATE_DIRTY ||
 	    (iclog->ic_state == XLOG_STATE_ACTIVE &&
 	     atomic_read(&iclog->ic_refcnt) == 0 && iclog->ic_offset == 0)) {
@@ -3225,6 +3242,7 @@ xlog_force_lsn(
 		goto out_error;
 
 	while (be64_to_cpu(iclog->ic_header.h_lsn) != lsn) {
+		trace_xlog_iclog_force_lsn(iclog, _RET_IP_);
 		iclog = iclog->ic_next;
 		if (iclog == log->l_iclog)
 			goto out_unlock;
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 71dca776c110..79ee973ba1ed 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -24,6 +24,7 @@ struct xlog_ticket;
 struct xlog_recover;
 struct xlog_recover_item;
 struct xlog_rec_header;
+struct xlog_in_core;
 struct xfs_buf_log_format;
 struct xfs_inode_log_format;
 struct xfs_bmbt_irec;
@@ -3927,6 +3928,57 @@ DEFINE_EVENT(xfs_icwalk_class, name,	\
 DEFINE_ICWALK_EVENT(xfs_ioc_free_eofblocks);
 DEFINE_ICWALK_EVENT(xfs_blockgc_free_space);
 
+DECLARE_EVENT_CLASS(xlog_iclog_class,
+	TP_PROTO(struct xlog_in_core *iclog, unsigned long caller_ip),
+	TP_ARGS(iclog, caller_ip),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(uint32_t, state)
+		__field(int32_t, refcount)
+		__field(uint32_t, offset)
+		__field(unsigned long long, lsn)
+		__field(unsigned long, caller_ip)
+	),
+	TP_fast_assign(
+		__entry->dev = iclog->ic_log->l_mp->m_super->s_dev;
+		__entry->state = iclog->ic_state;
+		__entry->refcount = atomic_read(&iclog->ic_refcnt);
+		__entry->offset = iclog->ic_offset;
+		__entry->lsn = be64_to_cpu(iclog->ic_header.h_lsn);
+		__entry->caller_ip = caller_ip;
+	),
+	TP_printk("dev %d:%d state 0x%x refcnt %d offset %u lsn 0x%llx caller %pS",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->state,
+		  __entry->refcount,
+		  __entry->offset,
+		  __entry->lsn,
+		  (char *)__entry->caller_ip)
+
+);
+
+#define DEFINE_ICLOG_EVENT(name)	\
+DEFINE_EVENT(xlog_iclog_class, name,	\
+	TP_PROTO(struct xlog_in_core *iclog, unsigned long caller_ip), \
+	TP_ARGS(iclog, caller_ip))
+
+DEFINE_ICLOG_EVENT(xlog_iclog_activate);
+DEFINE_ICLOG_EVENT(xlog_iclog_clean);
+DEFINE_ICLOG_EVENT(xlog_iclog_callback);
+DEFINE_ICLOG_EVENT(xlog_iclog_callbacks_start);
+DEFINE_ICLOG_EVENT(xlog_iclog_callbacks_done);
+DEFINE_ICLOG_EVENT(xlog_iclog_force);
+DEFINE_ICLOG_EVENT(xlog_iclog_force_lsn);
+DEFINE_ICLOG_EVENT(xlog_iclog_get_space);
+DEFINE_ICLOG_EVENT(xlog_iclog_release);
+DEFINE_ICLOG_EVENT(xlog_iclog_switch);
+DEFINE_ICLOG_EVENT(xlog_iclog_sync);
+DEFINE_ICLOG_EVENT(xlog_iclog_syncing);
+DEFINE_ICLOG_EVENT(xlog_iclog_sync_done);
+DEFINE_ICLOG_EVENT(xlog_iclog_want_sync);
+DEFINE_ICLOG_EVENT(xlog_iclog_wait_on);
+DEFINE_ICLOG_EVENT(xlog_iclog_write);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL
  2021-06-15  6:46 xfs: fix CIL push hang in for-next tree Dave Chinner
  2021-06-15  6:46 ` [PATCH 1/2] xfs: add iclog state trace events Dave Chinner
@ 2021-06-15  6:46 ` Dave Chinner
  2021-06-15 15:38   ` Brian Foster
  2021-06-15 16:19   ` Darrick J. Wong
  2021-06-15 17:57 ` xfs: fix CIL push hang in for-next tree Darrick J. Wong
  2 siblings, 2 replies; 14+ messages in thread
From: Dave Chinner @ 2021-06-15  6:46 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The iclogbuf ring attached to the struct xlog is circular, hence the
first and last iclogs in the ring can only be determined by
comparing them against the log->l_iclog pointer.

In xfs_cil_push_work(), we want to wait on previous iclogs that were
issued so that we can flush them to stable storage with the commit
record write, and it simply waits on the previous iclog in the ring.
This, however, leads to CIL push hangs in generic/019 like so:

task:kworker/u33:0   state:D stack:12680 pid:    7 ppid:     2 flags:0x00004000
Workqueue: xfs-cil/pmem1 xlog_cil_push_work
Call Trace:
 __schedule+0x30b/0x9f0
 schedule+0x68/0xe0
 xlog_wait_on_iclog+0x121/0x190
 ? wake_up_q+0xa0/0xa0
 xlog_cil_push_work+0x994/0xa10
 ? _raw_spin_lock+0x15/0x20
 ? xfs_swap_extents+0x920/0x920
 process_one_work+0x1ab/0x390
 worker_thread+0x56/0x3d0
 ? rescuer_thread+0x3c0/0x3c0
 kthread+0x14d/0x170
 ? __kthread_bind_mask+0x70/0x70
 ret_from_fork+0x1f/0x30

With other threads blocking in either xlog_state_get_iclog_space()
waiting for iclog space or xlog_grant_head_wait() waiting for log
reservation space.

The problem here is that the previous iclog on the ring might
actually be a future iclog. That is, if log->l_iclog points at
commit_iclog, commit_iclog is the first (oldest) iclog in the ring
and there are no previous iclogs pending as they have all completed
their IO and been activated again. IOWs, commit_iclog->ic_prev
points to an iclog that will be written in the future, not one that
has been written in the past.

Hence, in this case, waiting on the ->ic_prev iclog is incorrect
behaviour, and depending on the state of the future iclog, we can
end up with a circular ABA wait cycle and we hang.

Fix this by only waiting on the previous iclog when the commit_iclog
is not the oldest iclog in the ring.

Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO")
Reported-by: Brian Foster <bfoster@redhat.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log_cil.c | 18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
index 705619e9dab4..398f00cf9cbf 100644
--- a/fs/xfs/xfs_log_cil.c
+++ b/fs/xfs/xfs_log_cil.c
@@ -1075,15 +1075,21 @@ xlog_cil_push_work(
 	ticket = ctx->ticket;
 
 	/*
-	 * If the checkpoint spans multiple iclogs, wait for all previous
-	 * iclogs to complete before we submit the commit_iclog. In this case,
-	 * the commit_iclog write needs to issue a pre-flush so that the
-	 * ordering is correctly preserved down to stable storage.
+	 * If the checkpoint spans multiple iclogs, wait for all previous iclogs
+	 * to complete before we submit the commit_iclog. If the commit iclog is
+	 * at the head of the iclog ring, then all other iclogs have completed
+	 * and are waiting on this one and hence we don't need to wait.
+	 *
+	 * Regardless of whether we need to wait or not, the the commit_iclog
+	 * write needs to issue a pre-flush so that the ordering for this
+	 * checkpoint is correctly preserved down to stable storage.
 	 */
 	spin_lock(&log->l_icloglock);
 	if (ctx->start_lsn != commit_lsn) {
-		xlog_wait_on_iclog(commit_iclog->ic_prev);
-		spin_lock(&log->l_icloglock);
+		if (commit_iclog != log->l_iclog) {
+			xlog_wait_on_iclog(commit_iclog->ic_prev);
+			spin_lock(&log->l_icloglock);
+		}
 		commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
 	}
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] xfs: add iclog state trace events
  2021-06-15  6:46 ` [PATCH 1/2] xfs: add iclog state trace events Dave Chinner
@ 2021-06-15 15:26   ` Darrick J. Wong
  2021-06-15 15:37   ` Brian Foster
  1 sibling, 0 replies; 14+ messages in thread
From: Darrick J. Wong @ 2021-06-15 15:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 15, 2021 at 04:46:57PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For the DEBUGS!
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c   | 18 ++++++++++++++++
>  fs/xfs/xfs_trace.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 70 insertions(+)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index e921b554b683..54fd6a695bb5 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -524,6 +524,7 @@ __xlog_state_release_iclog(
>  		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
>  		xlog_verify_tail_lsn(log, iclog, tail_lsn);
>  		/* cycle incremented when incrementing curr_block */
> +		trace_xlog_iclog_syncing(iclog, _RET_IP_);
>  		return true;
>  	}
>  
> @@ -543,6 +544,7 @@ xlog_state_release_iclog(
>  {
>  	lockdep_assert_held(&log->l_icloglock);
>  
> +	trace_xlog_iclog_release(iclog, _RET_IP_);
>  	if (iclog->ic_state == XLOG_STATE_IOERROR)
>  		return -EIO;
>  
> @@ -804,6 +806,7 @@ xlog_wait_on_iclog(
>  {
>  	struct xlog		*log = iclog->ic_log;
>  
> +	trace_xlog_iclog_wait_on(iclog, _RET_IP_);
>  	if (!XLOG_FORCED_SHUTDOWN(log) &&
>  	    iclog->ic_state != XLOG_STATE_ACTIVE &&
>  	    iclog->ic_state != XLOG_STATE_DIRTY) {
> @@ -1804,6 +1807,7 @@ xlog_write_iclog(
>  	unsigned int		count)
>  {
>  	ASSERT(bno < log->l_logBBsize);
> +	trace_xlog_iclog_write(iclog, _RET_IP_);
>  
>  	/*
>  	 * We lock the iclogbufs here so that we can serialise against I/O
> @@ -1950,6 +1954,7 @@ xlog_sync(
>  	unsigned int		size;
>  
>  	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
> +	trace_xlog_iclog_sync(iclog, _RET_IP_);
>  
>  	count = xlog_calc_iclog_size(log, iclog, &roundoff);
>  
> @@ -2488,6 +2493,7 @@ xlog_state_activate_iclog(
>  	int			*iclogs_changed)
>  {
>  	ASSERT(list_empty_careful(&iclog->ic_callbacks));
> +	trace_xlog_iclog_activate(iclog, _RET_IP_);
>  
>  	/*
>  	 * If the number of ops in this iclog indicate it just contains the
> @@ -2577,6 +2583,8 @@ xlog_state_clean_iclog(
>  {
>  	int			iclogs_changed = 0;
>  
> +	trace_xlog_iclog_clean(dirty_iclog, _RET_IP_);
> +
>  	dirty_iclog->ic_state = XLOG_STATE_DIRTY;
>  
>  	xlog_state_activate_iclogs(log, &iclogs_changed);
> @@ -2636,6 +2644,7 @@ xlog_state_set_callback(
>  	struct xlog_in_core	*iclog,
>  	xfs_lsn_t		header_lsn)
>  {
> +	trace_xlog_iclog_callback(iclog, _RET_IP_);
>  	iclog->ic_state = XLOG_STATE_CALLBACK;
>  
>  	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
> @@ -2717,6 +2726,7 @@ xlog_state_do_iclog_callbacks(
>  		__releases(&log->l_icloglock)
>  		__acquires(&log->l_icloglock)
>  {
> +	trace_xlog_iclog_callbacks_start(iclog, _RET_IP_);
>  	spin_unlock(&log->l_icloglock);
>  	spin_lock(&iclog->ic_callback_lock);
>  	while (!list_empty(&iclog->ic_callbacks)) {
> @@ -2736,6 +2746,7 @@ xlog_state_do_iclog_callbacks(
>  	 */
>  	spin_lock(&log->l_icloglock);
>  	spin_unlock(&iclog->ic_callback_lock);
> +	trace_xlog_iclog_callbacks_done(iclog, _RET_IP_);
>  }
>  
>  STATIC void
> @@ -2827,6 +2838,7 @@ xlog_state_done_syncing(
>  
>  	spin_lock(&log->l_icloglock);
>  	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
> +	trace_xlog_iclog_sync_done(iclog, _RET_IP_);
>  
>  	/*
>  	 * If we got an error, either on the first buffer, or in the case of
> @@ -2899,6 +2911,8 @@ xlog_state_get_iclog_space(
>  	atomic_inc(&iclog->ic_refcnt);	/* prevents sync */
>  	log_offset = iclog->ic_offset;
>  
> +	trace_xlog_iclog_get_space(iclog, _RET_IP_);
> +
>  	/* On the 1st write to an iclog, figure out lsn.  This works
>  	 * if iclogs marked XLOG_STATE_WANT_SYNC always write out what they are
>  	 * committing to.  If the offset is set, that's how many blocks
> @@ -3056,6 +3070,7 @@ xlog_state_switch_iclogs(
>  {
>  	ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE);
>  	assert_spin_locked(&log->l_icloglock);
> +	trace_xlog_iclog_switch(iclog, _RET_IP_);
>  
>  	if (!eventual_size)
>  		eventual_size = iclog->ic_offset;
> @@ -3138,6 +3153,8 @@ xfs_log_force(
>  	if (iclog->ic_state == XLOG_STATE_IOERROR)
>  		goto out_error;
>  
> +	trace_xlog_iclog_force(iclog, _RET_IP_);
> +
>  	if (iclog->ic_state == XLOG_STATE_DIRTY ||
>  	    (iclog->ic_state == XLOG_STATE_ACTIVE &&
>  	     atomic_read(&iclog->ic_refcnt) == 0 && iclog->ic_offset == 0)) {
> @@ -3225,6 +3242,7 @@ xlog_force_lsn(
>  		goto out_error;
>  
>  	while (be64_to_cpu(iclog->ic_header.h_lsn) != lsn) {
> +		trace_xlog_iclog_force_lsn(iclog, _RET_IP_);
>  		iclog = iclog->ic_next;
>  		if (iclog == log->l_iclog)
>  			goto out_unlock;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 71dca776c110..79ee973ba1ed 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -24,6 +24,7 @@ struct xlog_ticket;
>  struct xlog_recover;
>  struct xlog_recover_item;
>  struct xlog_rec_header;
> +struct xlog_in_core;
>  struct xfs_buf_log_format;
>  struct xfs_inode_log_format;
>  struct xfs_bmbt_irec;
> @@ -3927,6 +3928,57 @@ DEFINE_EVENT(xfs_icwalk_class, name,	\
>  DEFINE_ICWALK_EVENT(xfs_ioc_free_eofblocks);
>  DEFINE_ICWALK_EVENT(xfs_blockgc_free_space);
>  
> +DECLARE_EVENT_CLASS(xlog_iclog_class,
> +	TP_PROTO(struct xlog_in_core *iclog, unsigned long caller_ip),
> +	TP_ARGS(iclog, caller_ip),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(uint32_t, state)
> +		__field(int32_t, refcount)
> +		__field(uint32_t, offset)
> +		__field(unsigned long long, lsn)
> +		__field(unsigned long, caller_ip)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = iclog->ic_log->l_mp->m_super->s_dev;
> +		__entry->state = iclog->ic_state;
> +		__entry->refcount = atomic_read(&iclog->ic_refcnt);
> +		__entry->offset = iclog->ic_offset;
> +		__entry->lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> +		__entry->caller_ip = caller_ip;
> +	),
> +	TP_printk("dev %d:%d state 0x%x refcnt %d offset %u lsn 0x%llx caller %pS",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->state,

Any chance you want to embed XLOG_STATE_* decoding instead of dumping
the raw value here?

Otherwise, looks fine to me,

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D


> +		  __entry->refcount,
> +		  __entry->offset,
> +		  __entry->lsn,
> +		  (char *)__entry->caller_ip)
> +
> +);
> +
> +#define DEFINE_ICLOG_EVENT(name)	\
> +DEFINE_EVENT(xlog_iclog_class, name,	\
> +	TP_PROTO(struct xlog_in_core *iclog, unsigned long caller_ip), \
> +	TP_ARGS(iclog, caller_ip))
> +
> +DEFINE_ICLOG_EVENT(xlog_iclog_activate);
> +DEFINE_ICLOG_EVENT(xlog_iclog_clean);
> +DEFINE_ICLOG_EVENT(xlog_iclog_callback);
> +DEFINE_ICLOG_EVENT(xlog_iclog_callbacks_start);
> +DEFINE_ICLOG_EVENT(xlog_iclog_callbacks_done);
> +DEFINE_ICLOG_EVENT(xlog_iclog_force);
> +DEFINE_ICLOG_EVENT(xlog_iclog_force_lsn);
> +DEFINE_ICLOG_EVENT(xlog_iclog_get_space);
> +DEFINE_ICLOG_EVENT(xlog_iclog_release);
> +DEFINE_ICLOG_EVENT(xlog_iclog_switch);
> +DEFINE_ICLOG_EVENT(xlog_iclog_sync);
> +DEFINE_ICLOG_EVENT(xlog_iclog_syncing);
> +DEFINE_ICLOG_EVENT(xlog_iclog_sync_done);
> +DEFINE_ICLOG_EVENT(xlog_iclog_want_sync);
> +DEFINE_ICLOG_EVENT(xlog_iclog_wait_on);
> +DEFINE_ICLOG_EVENT(xlog_iclog_write);
> +
>  #endif /* _TRACE_XFS_H */
>  
>  #undef TRACE_INCLUDE_PATH
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] xfs: add iclog state trace events
  2021-06-15  6:46 ` [PATCH 1/2] xfs: add iclog state trace events Dave Chinner
  2021-06-15 15:26   ` Darrick J. Wong
@ 2021-06-15 15:37   ` Brian Foster
  2021-06-15 22:47     ` Dave Chinner
  1 sibling, 1 reply; 14+ messages in thread
From: Brian Foster @ 2021-06-15 15:37 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 15, 2021 at 04:46:57PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> For the DEBUGS!
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c   | 18 ++++++++++++++++
>  fs/xfs/xfs_trace.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 70 insertions(+)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index e921b554b683..54fd6a695bb5 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -524,6 +524,7 @@ __xlog_state_release_iclog(
>  		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
>  		xlog_verify_tail_lsn(log, iclog, tail_lsn);
>  		/* cycle incremented when incrementing curr_block */
> +		trace_xlog_iclog_syncing(iclog, _RET_IP_);
>  		return true;
>  	}
>  
> @@ -543,6 +544,7 @@ xlog_state_release_iclog(
>  {
>  	lockdep_assert_held(&log->l_icloglock);
>  
> +	trace_xlog_iclog_release(iclog, _RET_IP_);
>  	if (iclog->ic_state == XLOG_STATE_IOERROR)
>  		return -EIO;
>  
> @@ -804,6 +806,7 @@ xlog_wait_on_iclog(
>  {
>  	struct xlog		*log = iclog->ic_log;
>  
> +	trace_xlog_iclog_wait_on(iclog, _RET_IP_);

Seems like this might be more informative if we actually wait.

>  	if (!XLOG_FORCED_SHUTDOWN(log) &&
>  	    iclog->ic_state != XLOG_STATE_ACTIVE &&
>  	    iclog->ic_state != XLOG_STATE_DIRTY) {
> @@ -1804,6 +1807,7 @@ xlog_write_iclog(
>  	unsigned int		count)
>  {
>  	ASSERT(bno < log->l_logBBsize);
> +	trace_xlog_iclog_write(iclog, _RET_IP_);
>  
>  	/*
>  	 * We lock the iclogbufs here so that we can serialise against I/O
> @@ -1950,6 +1954,7 @@ xlog_sync(
>  	unsigned int		size;
>  
>  	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
> +	trace_xlog_iclog_sync(iclog, _RET_IP_);
>  
>  	count = xlog_calc_iclog_size(log, iclog, &roundoff);
>  
> @@ -2488,6 +2493,7 @@ xlog_state_activate_iclog(
>  	int			*iclogs_changed)
>  {
>  	ASSERT(list_empty_careful(&iclog->ic_callbacks));
> +	trace_xlog_iclog_activate(iclog, _RET_IP_);
>  

I wonder a bit whether these state change oriented tracepoints could be
served by a single trace_ixlog_iclog_state() or some such, since the
tracepoint already prints the state. That wouldn't show the before ->
state in each tracepoint, but that should reflect in the full log.

>  	/*
>  	 * If the number of ops in this iclog indicate it just contains the
> @@ -2577,6 +2583,8 @@ xlog_state_clean_iclog(
>  {
>  	int			iclogs_changed = 0;
>  
> +	trace_xlog_iclog_clean(dirty_iclog, _RET_IP_);
> +
>  	dirty_iclog->ic_state = XLOG_STATE_DIRTY;
>  
>  	xlog_state_activate_iclogs(log, &iclogs_changed);
> @@ -2636,6 +2644,7 @@ xlog_state_set_callback(
>  	struct xlog_in_core	*iclog,
>  	xfs_lsn_t		header_lsn)
>  {
> +	trace_xlog_iclog_callback(iclog, _RET_IP_);
>  	iclog->ic_state = XLOG_STATE_CALLBACK;
>  
>  	ASSERT(XFS_LSN_CMP(atomic64_read(&log->l_last_sync_lsn),
> @@ -2717,6 +2726,7 @@ xlog_state_do_iclog_callbacks(
>  		__releases(&log->l_icloglock)
>  		__acquires(&log->l_icloglock)
>  {
> +	trace_xlog_iclog_callbacks_start(iclog, _RET_IP_);
>  	spin_unlock(&log->l_icloglock);
>  	spin_lock(&iclog->ic_callback_lock);
>  	while (!list_empty(&iclog->ic_callbacks)) {
> @@ -2736,6 +2746,7 @@ xlog_state_do_iclog_callbacks(
>  	 */
>  	spin_lock(&log->l_icloglock);
>  	spin_unlock(&iclog->ic_callback_lock);
> +	trace_xlog_iclog_callbacks_done(iclog, _RET_IP_);
>  }
>  
>  STATIC void
> @@ -2827,6 +2838,7 @@ xlog_state_done_syncing(
>  
>  	spin_lock(&log->l_icloglock);
>  	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
> +	trace_xlog_iclog_sync_done(iclog, _RET_IP_);
>  
>  	/*
>  	 * If we got an error, either on the first buffer, or in the case of
> @@ -2899,6 +2911,8 @@ xlog_state_get_iclog_space(
>  	atomic_inc(&iclog->ic_refcnt);	/* prevents sync */
>  	log_offset = iclog->ic_offset;
>  
> +	trace_xlog_iclog_get_space(iclog, _RET_IP_);
> +
>  	/* On the 1st write to an iclog, figure out lsn.  This works
>  	 * if iclogs marked XLOG_STATE_WANT_SYNC always write out what they are
>  	 * committing to.  If the offset is set, that's how many blocks
> @@ -3056,6 +3070,7 @@ xlog_state_switch_iclogs(
>  {
>  	ASSERT(iclog->ic_state == XLOG_STATE_ACTIVE);
>  	assert_spin_locked(&log->l_icloglock);
> +	trace_xlog_iclog_switch(iclog, _RET_IP_);
>  
>  	if (!eventual_size)
>  		eventual_size = iclog->ic_offset;
> @@ -3138,6 +3153,8 @@ xfs_log_force(
>  	if (iclog->ic_state == XLOG_STATE_IOERROR)
>  		goto out_error;
>  
> +	trace_xlog_iclog_force(iclog, _RET_IP_);
> +

We have a log force tracepoint a few lines up. Do we need both?

>  	if (iclog->ic_state == XLOG_STATE_DIRTY ||
>  	    (iclog->ic_state == XLOG_STATE_ACTIVE &&
>  	     atomic_read(&iclog->ic_refcnt) == 0 && iclog->ic_offset == 0)) {
> @@ -3225,6 +3242,7 @@ xlog_force_lsn(
>  		goto out_error;
>  
>  	while (be64_to_cpu(iclog->ic_header.h_lsn) != lsn) {
> +		trace_xlog_iclog_force_lsn(iclog, _RET_IP_);
>  		iclog = iclog->ic_next;
>  		if (iclog == log->l_iclog)
>  			goto out_unlock;
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 71dca776c110..79ee973ba1ed 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -24,6 +24,7 @@ struct xlog_ticket;
>  struct xlog_recover;
>  struct xlog_recover_item;
>  struct xlog_rec_header;
> +struct xlog_in_core;
>  struct xfs_buf_log_format;
>  struct xfs_inode_log_format;
>  struct xfs_bmbt_irec;
> @@ -3927,6 +3928,57 @@ DEFINE_EVENT(xfs_icwalk_class, name,	\
>  DEFINE_ICWALK_EVENT(xfs_ioc_free_eofblocks);
>  DEFINE_ICWALK_EVENT(xfs_blockgc_free_space);
>  
> +DECLARE_EVENT_CLASS(xlog_iclog_class,
> +	TP_PROTO(struct xlog_in_core *iclog, unsigned long caller_ip),
> +	TP_ARGS(iclog, caller_ip),
> +	TP_STRUCT__entry(
> +		__field(dev_t, dev)
> +		__field(uint32_t, state)
> +		__field(int32_t, refcount)
> +		__field(uint32_t, offset)
> +		__field(unsigned long long, lsn)
> +		__field(unsigned long, caller_ip)
> +	),
> +	TP_fast_assign(
> +		__entry->dev = iclog->ic_log->l_mp->m_super->s_dev;
> +		__entry->state = iclog->ic_state;
> +		__entry->refcount = atomic_read(&iclog->ic_refcnt);
> +		__entry->offset = iclog->ic_offset;
> +		__entry->lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> +		__entry->caller_ip = caller_ip;
> +	),
> +	TP_printk("dev %d:%d state 0x%x refcnt %d offset %u lsn 0x%llx caller %pS",
> +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> +		  __entry->state,

It might be nice to see a string representation of the state like we do
for other things like log item type, etc.

Brian

> +		  __entry->refcount,
> +		  __entry->offset,
> +		  __entry->lsn,
> +		  (char *)__entry->caller_ip)
> +
> +);
> +
> +#define DEFINE_ICLOG_EVENT(name)	\
> +DEFINE_EVENT(xlog_iclog_class, name,	\
> +	TP_PROTO(struct xlog_in_core *iclog, unsigned long caller_ip), \
> +	TP_ARGS(iclog, caller_ip))
> +
> +DEFINE_ICLOG_EVENT(xlog_iclog_activate);
> +DEFINE_ICLOG_EVENT(xlog_iclog_clean);
> +DEFINE_ICLOG_EVENT(xlog_iclog_callback);
> +DEFINE_ICLOG_EVENT(xlog_iclog_callbacks_start);
> +DEFINE_ICLOG_EVENT(xlog_iclog_callbacks_done);
> +DEFINE_ICLOG_EVENT(xlog_iclog_force);
> +DEFINE_ICLOG_EVENT(xlog_iclog_force_lsn);
> +DEFINE_ICLOG_EVENT(xlog_iclog_get_space);
> +DEFINE_ICLOG_EVENT(xlog_iclog_release);
> +DEFINE_ICLOG_EVENT(xlog_iclog_switch);
> +DEFINE_ICLOG_EVENT(xlog_iclog_sync);
> +DEFINE_ICLOG_EVENT(xlog_iclog_syncing);
> +DEFINE_ICLOG_EVENT(xlog_iclog_sync_done);
> +DEFINE_ICLOG_EVENT(xlog_iclog_want_sync);
> +DEFINE_ICLOG_EVENT(xlog_iclog_wait_on);
> +DEFINE_ICLOG_EVENT(xlog_iclog_write);
> +
>  #endif /* _TRACE_XFS_H */
>  
>  #undef TRACE_INCLUDE_PATH
> -- 
> 2.31.1
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL
  2021-06-15  6:46 ` [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL Dave Chinner
@ 2021-06-15 15:38   ` Brian Foster
  2021-06-15 22:09     ` Dave Chinner
  2021-06-15 16:19   ` Darrick J. Wong
  1 sibling, 1 reply; 14+ messages in thread
From: Brian Foster @ 2021-06-15 15:38 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The iclogbuf ring attached to the struct xlog is circular, hence the
> first and last iclogs in the ring can only be determined by
> comparing them against the log->l_iclog pointer.
> 
> In xfs_cil_push_work(), we want to wait on previous iclogs that were
> issued so that we can flush them to stable storage with the commit
> record write, and it simply waits on the previous iclog in the ring.
> This, however, leads to CIL push hangs in generic/019 like so:
> 
> task:kworker/u33:0   state:D stack:12680 pid:    7 ppid:     2 flags:0x00004000
> Workqueue: xfs-cil/pmem1 xlog_cil_push_work
> Call Trace:
>  __schedule+0x30b/0x9f0
>  schedule+0x68/0xe0
>  xlog_wait_on_iclog+0x121/0x190
>  ? wake_up_q+0xa0/0xa0
>  xlog_cil_push_work+0x994/0xa10
>  ? _raw_spin_lock+0x15/0x20
>  ? xfs_swap_extents+0x920/0x920
>  process_one_work+0x1ab/0x390
>  worker_thread+0x56/0x3d0
>  ? rescuer_thread+0x3c0/0x3c0
>  kthread+0x14d/0x170
>  ? __kthread_bind_mask+0x70/0x70
>  ret_from_fork+0x1f/0x30
> 
> With other threads blocking in either xlog_state_get_iclog_space()
> waiting for iclog space or xlog_grant_head_wait() waiting for log
> reservation space.
> 
> The problem here is that the previous iclog on the ring might
> actually be a future iclog. That is, if log->l_iclog points at
> commit_iclog, commit_iclog is the first (oldest) iclog in the ring
> and there are no previous iclogs pending as they have all completed
> their IO and been activated again. IOWs, commit_iclog->ic_prev
> points to an iclog that will be written in the future, not one that
> has been written in the past.
> 
> Hence, in this case, waiting on the ->ic_prev iclog is incorrect
> behaviour, and depending on the state of the future iclog, we can
> end up with a circular ABA wait cycle and we hang.
> 
> Fix this by only waiting on the previous iclog when the commit_iclog
> is not the oldest iclog in the ring.
> 
> Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO")
> Reported-by: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 705619e9dab4..398f00cf9cbf 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -1075,15 +1075,21 @@ xlog_cil_push_work(
>  	ticket = ctx->ticket;
>  
>  	/*
> -	 * If the checkpoint spans multiple iclogs, wait for all previous
> -	 * iclogs to complete before we submit the commit_iclog. In this case,
> -	 * the commit_iclog write needs to issue a pre-flush so that the
> -	 * ordering is correctly preserved down to stable storage.
> +	 * If the checkpoint spans multiple iclogs, wait for all previous iclogs
> +	 * to complete before we submit the commit_iclog. If the commit iclog is
> +	 * at the head of the iclog ring, then all other iclogs have completed
> +	 * and are waiting on this one and hence we don't need to wait.
> +	 *
> +	 * Regardless of whether we need to wait or not, the the commit_iclog
> +	 * write needs to issue a pre-flush so that the ordering for this
> +	 * checkpoint is correctly preserved down to stable storage.
>  	 */
>  	spin_lock(&log->l_icloglock);
>  	if (ctx->start_lsn != commit_lsn) {
> -		xlog_wait_on_iclog(commit_iclog->ic_prev);
> -		spin_lock(&log->l_icloglock);
> +		if (commit_iclog != log->l_iclog) {
> +			xlog_wait_on_iclog(commit_iclog->ic_prev);
> +			spin_lock(&log->l_icloglock);
> +		}

I'm confused at what this is attempting to accomplish. If we have a
single CIL force and that happens to span several iclogs, isn't the goal
to wait on the previous iclog to ensure the previously written content
of the current checkpoint is on-disk? If so, it looks to me that
commit_iclog will always match log->l_iclog in that scenario because in
the commit record path we don't (potentially) switch it out until toward
the end of this function. Hmm.. did you mean to check ->ic_prev against
->l_iclog perhaps, or am I just missing something else?

Brian

>  		commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
>  	}
>  
> -- 
> 2.31.1
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL
  2021-06-15  6:46 ` [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL Dave Chinner
  2021-06-15 15:38   ` Brian Foster
@ 2021-06-15 16:19   ` Darrick J. Wong
  1 sibling, 0 replies; 14+ messages in thread
From: Darrick J. Wong @ 2021-06-15 16:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The iclogbuf ring attached to the struct xlog is circular, hence the
> first and last iclogs in the ring can only be determined by
> comparing them against the log->l_iclog pointer.
> 
> In xfs_cil_push_work(), we want to wait on previous iclogs that were
> issued so that we can flush them to stable storage with the commit
> record write, and it simply waits on the previous iclog in the ring.
> This, however, leads to CIL push hangs in generic/019 like so:
> 
> task:kworker/u33:0   state:D stack:12680 pid:    7 ppid:     2 flags:0x00004000
> Workqueue: xfs-cil/pmem1 xlog_cil_push_work
> Call Trace:
>  __schedule+0x30b/0x9f0
>  schedule+0x68/0xe0
>  xlog_wait_on_iclog+0x121/0x190
>  ? wake_up_q+0xa0/0xa0
>  xlog_cil_push_work+0x994/0xa10
>  ? _raw_spin_lock+0x15/0x20
>  ? xfs_swap_extents+0x920/0x920
>  process_one_work+0x1ab/0x390
>  worker_thread+0x56/0x3d0
>  ? rescuer_thread+0x3c0/0x3c0
>  kthread+0x14d/0x170
>  ? __kthread_bind_mask+0x70/0x70
>  ret_from_fork+0x1f/0x30
> 
> With other threads blocking in either xlog_state_get_iclog_space()
> waiting for iclog space or xlog_grant_head_wait() waiting for log
> reservation space.
> 
> The problem here is that the previous iclog on the ring might
> actually be a future iclog. That is, if log->l_iclog points at
> commit_iclog, commit_iclog is the first (oldest) iclog in the ring
> and there are no previous iclogs pending as they have all completed
> their IO and been activated again. IOWs, commit_iclog->ic_prev
> points to an iclog that will be written in the future, not one that
> has been written in the past.
> 
> Hence, in this case, waiting on the ->ic_prev iclog is incorrect
> behaviour, and depending on the state of the future iclog, we can
> end up with a circular ABA wait cycle and we hang.
> 
> Fix this by only waiting on the previous iclog when the commit_iclog
> is not the oldest iclog in the ring.
> 
> Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO")
> Reported-by: Brian Foster <bfoster@redhat.com>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log_cil.c | 18 ++++++++++++------
>  1 file changed, 12 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> index 705619e9dab4..398f00cf9cbf 100644
> --- a/fs/xfs/xfs_log_cil.c
> +++ b/fs/xfs/xfs_log_cil.c
> @@ -1075,15 +1075,21 @@ xlog_cil_push_work(
>  	ticket = ctx->ticket;
>  
>  	/*
> -	 * If the checkpoint spans multiple iclogs, wait for all previous
> -	 * iclogs to complete before we submit the commit_iclog. In this case,
> -	 * the commit_iclog write needs to issue a pre-flush so that the
> -	 * ordering is correctly preserved down to stable storage.
> +	 * If the checkpoint spans multiple iclogs, wait for all previous iclogs
> +	 * to complete before we submit the commit_iclog. If the commit iclog is
> +	 * at the head of the iclog ring, then all other iclogs have completed
> +	 * and are waiting on this one and hence we don't need to wait.
> +	 *
> +	 * Regardless of whether we need to wait or not, the the commit_iclog
> +	 * write needs to issue a pre-flush so that the ordering for this
> +	 * checkpoint is correctly preserved down to stable storage.
>  	 */
>  	spin_lock(&log->l_icloglock);
>  	if (ctx->start_lsn != commit_lsn) {
> -		xlog_wait_on_iclog(commit_iclog->ic_prev);
> -		spin_lock(&log->l_icloglock);
> +		if (commit_iclog != log->l_iclog) {
> +			xlog_wait_on_iclog(commit_iclog->ic_prev);
> +			spin_lock(&log->l_icloglock);
> +		}

I'm confused.  How can you tell that we need to wait for
commit_iclog->ic_prev to be written out by comparing commit_iclog to
log->l_iclog?  Can't you determine this by checking ic_prev for DIRTY
state?

--D


>  		commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
>  	}
>  
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: xfs: fix CIL push hang in for-next tree
  2021-06-15  6:46 xfs: fix CIL push hang in for-next tree Dave Chinner
  2021-06-15  6:46 ` [PATCH 1/2] xfs: add iclog state trace events Dave Chinner
  2021-06-15  6:46 ` [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL Dave Chinner
@ 2021-06-15 17:57 ` Darrick J. Wong
  2 siblings, 0 replies; 14+ messages in thread
From: Darrick J. Wong @ 2021-06-15 17:57 UTC (permalink / raw)
  To: Dave Chinner, Allison Henderson; +Cc: linux-xfs

On Tue, Jun 15, 2021 at 04:46:56PM +1000, Dave Chinner wrote:
> Hi folks,
> 
> This is the first fix for the problems Brian has reported from
> generic/019. This has fixed the hang, but the other log recovery
> problem he reported is still present (seen once with these patches
> in place).
> 
> I've tested these out to a couple of hundred cycles of
> continual looping generic/019 before the systems fall over with a
> perag reference count underrun at unmount after a shutdown. I'm
> pretty sure the hang is fixed, as it would manifest within 10-20
> cycles without this patch.
> 
> The first patch is the iclogbuf state tracing I used to capture the
> iclogbuf wrapping state. The second patch is the fix.

I found another bug while testing for-next.  If I run generic/100 more
than about ~30 times with a 1k block size:

FSTYP         -- xfs (debug)
PLATFORM      -- Linux/x86_64 flax-mtr00 5.13.0-rc4-djwx #rc4 SMP
PREEMPT Mon Jun 7 11:17:23 PDT 2021
MKFS_OPTIONS  -- -f -b size=1024, /dev/sdf
MOUNT_OPTIONS -- -o usrquota,grpquota,prjquota, /dev/sdf /opt

I see this in dmesg:

run fstests generic/100 at 2021-06-15 10:41:45
XFS (sda): ctx ticket reservation ran out. Need to up reservation
XFS (sda): ticket reservation summary:
XFS (sda):   unit res    = 47168 bytes
XFS (sda):   current res = -404 bytes
XFS (sda):   original count  = 1
XFS (sda):   remaining count = 1
XFS (sda): xfs_do_force_shutdown(0x2) called from line 2440 of file fs/xfs/xfs_log.c. Return address = xlog_write+0x608/0x640 [xfs]
XFS (sda): Log I/O Error Detected. Shutting down filesystem
XFS (sda): Please unmount the filesystem and rectify the problem(s)
XFS (sda): Unmounting Filesystem

Looking up that line in gdb produces:

0xffffffffa038a0a8 is in xlog_write (fs/xfs/xfs_log.c:2439).
2434            int                     log_offset;
2435
2436            if (ticket->t_curr_res < 0) {
2437                    xfs_alert_tag(log->l_mp, XFS_PTAG_LOGRES,
2438                         "ctx ticket reservation ran out. Need to up reservation");
2439                    xlog_print_tic_res(log->l_mp, ticket);
2440                    xfs_force_shutdown(log->l_mp, SHUTDOWN_LOG_IO_ERROR);
2441            }

I haven't applied these two patches yet, but looking back through
fstests reports I never saw this before the recent for-next push.
I'm uncertain if it's the CIL work or the xattr refactoring that did
this, though AFAICT generic/100 itself does not generate any xattrs and
I don't have any LSMs enabled that would cause them to be created.

--D

> 
> Cheers,
> 
> Dave.
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL
  2021-06-15 15:38   ` Brian Foster
@ 2021-06-15 22:09     ` Dave Chinner
  2021-06-16 14:57       ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2021-06-15 22:09 UTC (permalink / raw)
  To: Brian Foster, Darrick J. Wong; +Cc: linux-xfs

On Tue, Jun 15, 2021 at 11:38:14AM -0400, Brian Foster wrote:
> On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The iclogbuf ring attached to the struct xlog is circular, hence the
> > first and last iclogs in the ring can only be determined by
> > comparing them against the log->l_iclog pointer.
> > 
> > In xfs_cil_push_work(), we want to wait on previous iclogs that were
> > issued so that we can flush them to stable storage with the commit
> > record write, and it simply waits on the previous iclog in the ring.
> > This, however, leads to CIL push hangs in generic/019 like so:
> > 
> > task:kworker/u33:0   state:D stack:12680 pid:    7 ppid:     2 flags:0x00004000
> > Workqueue: xfs-cil/pmem1 xlog_cil_push_work
> > Call Trace:
> >  __schedule+0x30b/0x9f0
> >  schedule+0x68/0xe0
> >  xlog_wait_on_iclog+0x121/0x190
> >  ? wake_up_q+0xa0/0xa0
> >  xlog_cil_push_work+0x994/0xa10
> >  ? _raw_spin_lock+0x15/0x20
> >  ? xfs_swap_extents+0x920/0x920
> >  process_one_work+0x1ab/0x390
> >  worker_thread+0x56/0x3d0
> >  ? rescuer_thread+0x3c0/0x3c0
> >  kthread+0x14d/0x170
> >  ? __kthread_bind_mask+0x70/0x70
> >  ret_from_fork+0x1f/0x30
> > 
> > With other threads blocking in either xlog_state_get_iclog_space()
> > waiting for iclog space or xlog_grant_head_wait() waiting for log
> > reservation space.
> > 
> > The problem here is that the previous iclog on the ring might
> > actually be a future iclog. That is, if log->l_iclog points at
> > commit_iclog, commit_iclog is the first (oldest) iclog in the ring
> > and there are no previous iclogs pending as they have all completed
> > their IO and been activated again. IOWs, commit_iclog->ic_prev
> > points to an iclog that will be written in the future, not one that
> > has been written in the past.
> > 
> > Hence, in this case, waiting on the ->ic_prev iclog is incorrect
> > behaviour, and depending on the state of the future iclog, we can
> > end up with a circular ABA wait cycle and we hang.
> > 
> > Fix this by only waiting on the previous iclog when the commit_iclog
> > is not the oldest iclog in the ring.
> > 
> > Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO")
> > Reported-by: Brian Foster <bfoster@redhat.com>
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log_cil.c | 18 ++++++++++++------
> >  1 file changed, 12 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > index 705619e9dab4..398f00cf9cbf 100644
> > --- a/fs/xfs/xfs_log_cil.c
> > +++ b/fs/xfs/xfs_log_cil.c
> > @@ -1075,15 +1075,21 @@ xlog_cil_push_work(
> >  	ticket = ctx->ticket;
> >  
> >  	/*
> > -	 * If the checkpoint spans multiple iclogs, wait for all previous
> > -	 * iclogs to complete before we submit the commit_iclog. In this case,
> > -	 * the commit_iclog write needs to issue a pre-flush so that the
> > -	 * ordering is correctly preserved down to stable storage.
> > +	 * If the checkpoint spans multiple iclogs, wait for all previous iclogs
> > +	 * to complete before we submit the commit_iclog. If the commit iclog is
> > +	 * at the head of the iclog ring, then all other iclogs have completed
> > +	 * and are waiting on this one and hence we don't need to wait.
> > +	 *
> > +	 * Regardless of whether we need to wait or not, the the commit_iclog
> > +	 * write needs to issue a pre-flush so that the ordering for this
> > +	 * checkpoint is correctly preserved down to stable storage.
> >  	 */
> >  	spin_lock(&log->l_icloglock);
> >  	if (ctx->start_lsn != commit_lsn) {
> > -		xlog_wait_on_iclog(commit_iclog->ic_prev);
> > -		spin_lock(&log->l_icloglock);
> > +		if (commit_iclog != log->l_iclog) {
> > +			xlog_wait_on_iclog(commit_iclog->ic_prev);
> > +			spin_lock(&log->l_icloglock);
> > +		}

Brian says:

> I'm confused at what this is attempting to accomplish. If we have a
> single CIL force and that happens to span several iclogs, isn't the goal
> to wait on the previous iclog to ensure the previously written content
> of the current checkpoint is on-disk?

Yes. I wouldn't be surprised if I broke something, though, because I
was only looking at situation where the log hang was occurring.

I think I should have added a state check on the commit_iclog being
in ACTIVE state, too, because if it is active then there are no
future iclogs in flight and we should definitely wait. But that
doesn't solve the problem, either, because if it is in WANT_SYNC
there are future iclogs in flight (because
xlog_state_get_iclog_space() has switched iclogs) and then we're in
trouble because....


Darrick says:

> I'm confused.  How can you tell that we need to wait for
> commit_iclog->ic_prev to be written out by comparing commit_iclog to
> log->l_iclog?  Can't you determine this by checking ic_prev for DIRTY
> state?

.... the problem is that we can't tell if ->ic_prev is a future or
past iclog based on it's state. If it is in the past the ic_prev
iclog could be in any state from WANT_SYNC to ACTIVE, depending on
where the IO completion is. And if it's in the future, it could be
in any state from ACTIVE to SYNC_DONE. It's the overlap in
past/future states that causes the problem here.

Being in DIRTY state just means that IO completion is running,
callbacks are complete and we're about to run a ic_force_wait queue
wakeup and activate the iclog again. When the hang occurs, the
future iclogs are stuck in the DONE_SYNC state and can't progress
through to the callback state, let alone dirty...

Even an iclog in ACTIVE state doesn't mean it is in the past - it
just means that it's ready to be written to. It could be being
filled at this point in time by another checkpoint via xlog_write(),
and it, at any point in time, can transition to WANT_SYNC, SYNCING,
or SYNC_DONE.

xlog_wait_on_iclog() will ignore ACTIVE/DIRTY iclogs, anyway, but
the problem is that a future iclog can still transition at any time
to WANT_SYNC or any other state indicating IO is in progress. Hence
we just can't tell if it is a past or future iclog and so we can't
blindly wait on it. In the common case, ic_prev is a past iclog and
it won't get stuck on the commit_iclog being ACTIVE/WANT_SYNC at the
head of the log->l_iclog ring in xlog_state_do_callback() and so
will be processed and go back to ACTIVE state. However, in the rare
case it is a future iclog, then it gets stuck waiting for IO
completion of the commit_iclog to reach xlog_state_do_callback()....

----

What I was trying to take into account via the (commit_iclog !=
log->l_iclog) check was a measure of past/future tense. The hang
always occurs where commit_iclog == log->l_iclog and all ic_next
iclogs are future iclogs rather than being past iclogs. As Brian
notes, this probably just broke the single commit case, but because
the failure stress tests load up the log greatly, they don't
exercise that case.

So I went and looked at all the cases where we check
ic_prev->ic_state to make decisions. There's only one place that
does that: log forces. The sync transaction/log force aggregation
code is the only thing that checks the state of the previous iclog,
and it appears to avoid the "future iclog" problem in two ways.
First, it only checks the previous iclog state when the current head
iclog state is ACTIVE (i.e. no future iclogs). Secondly, even if it
ic-prev is a future iclog, it only waits on the ic_write_wait queue,
which means the log force gets woken when IO completion occurs and
the iclog transitions to DONE_SYNC, hence it never waits for
ordering of callbacks (other log force code does that).

IOWs, there's only one place that actually tries to use the previous
iclog state for making decisions, and it does doesn't really handle
the case we need to handle here which is determining if the previous
iclog is a past or future iclog. I does, however, handle the case of
incorrect waiting safely.

That is, what we need to wait for in the CIL push is the completion
of the physical IO to be signalled, not for the callbacks associated
with that iclog to run. IOWs, neat as it may be,
xlog_wait_on_iclog() is not the right function to be using here. We
should be waiting on the ic_write_wait queue to be woken when iclog
IO completion occurs, not the ic_force_wait queue that waits on
ordered completion of entire journal checkpoints. We just want to
ensure that all previous iclogs are physically completed before we
do the pre-flush on the commit record IO, we don't need software
processing of those IOs to be complete.

So, I think the right thing to do here is change this code to wait
on the ic_write_wait queue for completion of the previous iclog and
then we just don't care if commit_iclog->ic_prev is a past or future
iclog....



-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] xfs: add iclog state trace events
  2021-06-15 15:37   ` Brian Foster
@ 2021-06-15 22:47     ` Dave Chinner
  2021-06-16 14:55       ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2021-06-15 22:47 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Jun 15, 2021 at 11:37:49AM -0400, Brian Foster wrote:
> On Tue, Jun 15, 2021 at 04:46:57PM +1000, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > For the DEBUGS!
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/xfs_log.c   | 18 ++++++++++++++++
> >  fs/xfs/xfs_trace.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 70 insertions(+)
> > 
> > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > index e921b554b683..54fd6a695bb5 100644
> > --- a/fs/xfs/xfs_log.c
> > +++ b/fs/xfs/xfs_log.c
> > @@ -524,6 +524,7 @@ __xlog_state_release_iclog(
> >  		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> >  		xlog_verify_tail_lsn(log, iclog, tail_lsn);
> >  		/* cycle incremented when incrementing curr_block */
> > +		trace_xlog_iclog_syncing(iclog, _RET_IP_);
> >  		return true;
> >  	}
> >  
> > @@ -543,6 +544,7 @@ xlog_state_release_iclog(
> >  {
> >  	lockdep_assert_held(&log->l_icloglock);
> >  
> > +	trace_xlog_iclog_release(iclog, _RET_IP_);
> >  	if (iclog->ic_state == XLOG_STATE_IOERROR)
> >  		return -EIO;
> >  
> > @@ -804,6 +806,7 @@ xlog_wait_on_iclog(
> >  {
> >  	struct xlog		*log = iclog->ic_log;
> >  
> > +	trace_xlog_iclog_wait_on(iclog, _RET_IP_);
> 
> Seems like this might be more informative if we actually wait.

This tells us the state at the time the call is made, so we know
from that if we waited or not. Being able to see all calls to this
function was instrumental in understanding the stiuation where we
were waiting and when we weren't waiting. This is the iclog we got
stuck waiting on:

 xlog_iclog_wait_on:   dev 259:0 state 0x3 refcnt 0 offset 32256 lsn 0x2300004be0 caller xlog_cil_push_work

The state is XLOG_STATE_DONE_SYNC, indicating IO had been completed,
and the lsn tells me it is a future iclog we are waiting on, which
tells us that it's waiting for a previous iclog to complete IO
before it can move forward...

> >  	if (!XLOG_FORCED_SHUTDOWN(log) &&
> >  	    iclog->ic_state != XLOG_STATE_ACTIVE &&
> >  	    iclog->ic_state != XLOG_STATE_DIRTY) {
> > @@ -1804,6 +1807,7 @@ xlog_write_iclog(
> >  	unsigned int		count)
> >  {
> >  	ASSERT(bno < log->l_logBBsize);
> > +	trace_xlog_iclog_write(iclog, _RET_IP_);
> >  
> >  	/*
> >  	 * We lock the iclogbufs here so that we can serialise against I/O
> > @@ -1950,6 +1954,7 @@ xlog_sync(
> >  	unsigned int		size;
> >  
> >  	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
> > +	trace_xlog_iclog_sync(iclog, _RET_IP_);
> >  
> >  	count = xlog_calc_iclog_size(log, iclog, &roundoff);
> >  
> > @@ -2488,6 +2493,7 @@ xlog_state_activate_iclog(
> >  	int			*iclogs_changed)
> >  {
> >  	ASSERT(list_empty_careful(&iclog->ic_callbacks));
> > +	trace_xlog_iclog_activate(iclog, _RET_IP_);
> >  
> 
> I wonder a bit whether these state change oriented tracepoints could be
> served by a single trace_ixlog_iclog_state() or some such, since the
> tracepoint already prints the state. That wouldn't show the before ->
> state in each tracepoint, but that should reflect in the full log.

I disagree. We lose a -lot- of context by getting rid of the
per-call site trace names, especially those that get called from
multiple places. Not to mention tracing an ACTIVE, idle iclog can
result in zero state information except for the tracepoint name...
e.g:

384.204393: xlog_iclog_write:     dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004b60 caller xlog_sync
384.204406: xlog_iclog_sync_done: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004b60 caller xlog_ioend_work
384.204407: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 0 lsn 0x0 caller xlog_write_get_more_iclog_space
384.204408: xlog_iclog_switch:    dev 259:0 state 0x0 refcnt 1 offset 0 lsn 0x2300004ba0 caller xlog_state_get_iclog_space
384.204418: xlog_iclog_release:   dev 259:0 state 0x1 refcnt 1 offset 32256 lsn 0x2300004ba0 caller xlog_write_get_more_iclog_space
384.204419: xlog_iclog_syncing:   dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_write_get_more_iclog_space
384.204419: xlog_iclog_sync:      dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_state_release_iclog
384.204422: xlog_iclog_write:     dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_sync
384.204435: xlog_iclog_sync_done: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_ioend_work

See how the xlog_iclog_get_space trace has zero state/lsn? That
indicates that it has actually found an idle, active iclog in the
correct state for it to hand out. We can also see this progression
is through xlog_write() filling mulitple iclogs. However, we can get
different progressions from different contexts such as:

384.193766: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 0 lsn 0x0 caller xlog_write_get_more_iclog_space
384.193778: xlog_iclog_release:   dev 259:0 state 0x0 refcnt 1 offset 27352 lsn 0x23000042b0 caller xlog_write
384.193779: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 27352 lsn 0x23000042b0 caller xlog_write
384.193780: xlog_iclog_wait_on:   dev 259:0 state 0x0 refcnt 0 offset 0 lsn 0x0 caller xlog_cil_push_work
384.193781: xlog_iclog_release:   dev 259:0 state 0x0 refcnt 1 offset 27364 lsn 0x23000042b0 caller xlog_cil_push_work
384.201268: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 27364 lsn 0x23000042b0 caller xlog_write

.... the competion of a checkpoint where xlog_write() releases the
iclog and it's not a full iclog. This is followed immediately by the
commit record write which gets iclog space from a different caller
location, passes it back to xlog_cil_push_work which then releases
it. ANd then we try to wait on the previous iclog which is an idle
ACTIVE buffer, so we skip the wait.  And because this isn't a
"stable CIL push", we don't see a xlog_iclog_switch() trace, and the
so iclog is not flushed to disk when it is release here.

IOWs, if we don't have the individual trace names, we largely lose
all this context and then have to infer it from caller and state
changes that occur. That just makes things unnnecessarily difficult
when looking for problems in hundreds of thousands trace events....

> > @@ -3138,6 +3153,8 @@ xfs_log_force(
> >  	if (iclog->ic_state == XLOG_STATE_IOERROR)
> >  		goto out_error;
> >  
> > +	trace_xlog_iclog_force(iclog, _RET_IP_);
> > +
> 
> We have a log force tracepoint a few lines up. Do we need both?

Yes. The log force tracepoint tells us nothing about the iclog
state we are about to operate on.

> > +	TP_ARGS(iclog, caller_ip),
> > +	TP_STRUCT__entry(
> > +		__field(dev_t, dev)
> > +		__field(uint32_t, state)
> > +		__field(int32_t, refcount)
> > +		__field(uint32_t, offset)
> > +		__field(unsigned long long, lsn)
> > +		__field(unsigned long, caller_ip)
> > +	),
> > +	TP_fast_assign(
> > +		__entry->dev = iclog->ic_log->l_mp->m_super->s_dev;
> > +		__entry->state = iclog->ic_state;
> > +		__entry->refcount = atomic_read(&iclog->ic_refcnt);
> > +		__entry->offset = iclog->ic_offset;
> > +		__entry->lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> > +		__entry->caller_ip = caller_ip;
> > +	),
> > +	TP_printk("dev %d:%d state 0x%x refcnt %d offset %u lsn 0x%llx caller %pS",
> > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > +		  __entry->state,
> 
> It might be nice to see a string representation of the state like we do
> for other things like log item type, etc.

Yup, I thought about that after I sent when looking at other code.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 1/2] xfs: add iclog state trace events
  2021-06-15 22:47     ` Dave Chinner
@ 2021-06-16 14:55       ` Brian Foster
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2021-06-16 14:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Jun 16, 2021 at 08:47:09AM +1000, Dave Chinner wrote:
> On Tue, Jun 15, 2021 at 11:37:49AM -0400, Brian Foster wrote:
> > On Tue, Jun 15, 2021 at 04:46:57PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > For the DEBUGS!
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log.c   | 18 ++++++++++++++++
> > >  fs/xfs/xfs_trace.h | 52 ++++++++++++++++++++++++++++++++++++++++++++++
> > >  2 files changed, 70 insertions(+)
> > > 
> > > diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> > > index e921b554b683..54fd6a695bb5 100644
> > > --- a/fs/xfs/xfs_log.c
> > > +++ b/fs/xfs/xfs_log.c
> > > @@ -524,6 +524,7 @@ __xlog_state_release_iclog(
> > >  		iclog->ic_header.h_tail_lsn = cpu_to_be64(tail_lsn);
> > >  		xlog_verify_tail_lsn(log, iclog, tail_lsn);
> > >  		/* cycle incremented when incrementing curr_block */
> > > +		trace_xlog_iclog_syncing(iclog, _RET_IP_);
> > >  		return true;
> > >  	}
> > >  
> > > @@ -543,6 +544,7 @@ xlog_state_release_iclog(
> > >  {
> > >  	lockdep_assert_held(&log->l_icloglock);
> > >  
> > > +	trace_xlog_iclog_release(iclog, _RET_IP_);
> > >  	if (iclog->ic_state == XLOG_STATE_IOERROR)
> > >  		return -EIO;
> > >  
> > > @@ -804,6 +806,7 @@ xlog_wait_on_iclog(
> > >  {
> > >  	struct xlog		*log = iclog->ic_log;
> > >  
> > > +	trace_xlog_iclog_wait_on(iclog, _RET_IP_);
> > 
> > Seems like this might be more informative if we actually wait.
> 
> This tells us the state at the time the call is made, so we know
> from that if we waited or not. Being able to see all calls to this
> function was instrumental in understanding the stiuation where we
> were waiting and when we weren't waiting. This is the iclog we got
> stuck waiting on:
> 
>  xlog_iclog_wait_on:   dev 259:0 state 0x3 refcnt 0 offset 32256 lsn 0x2300004be0 caller xlog_cil_push_work
> 
> The state is XLOG_STATE_DONE_SYNC, indicating IO had been completed,
> and the lsn tells me it is a future iclog we are waiting on, which
> tells us that it's waiting for a previous iclog to complete IO
> before it can move forward...
> 

Then this tracepoint would have still triggered, wouldn't it? What's the
great significance in seeing a tracepoint for a function call in cases
where that call is otherwise essentially a no-op?

My point is that from a general standpoint (i.e. once this particular
issue is addressed), I find it obscured to see a bunch of blah_wait*
tracepoints in a full trace dump because that explicitly suggests that
something actually waited, when in reality, it's just a byproduct of an
unfortunately named (i.e. based on the function name) function call
based tracepoint.

Thinking more about it, this is the exact type of thing I almost always
end up modifying or replacing with my own trace_printk() when debugging
anything that already requires reproduction to root cause, because
otherwise I have to sift through more context just to filter out
function call noise from the actual meaningful calls.

There's also subtle details like how this doesn't reflect shutdown state
(so technically this doesn't actually provide all the information
required to determine whether a block actually occurred), or that your
_syncing() tracepoint is in a very similar situation, yet in that case
you opted to bind the tracepoint with the relevant logic vs. to the
function call/name. All I'm asking for is to do a similar thing here by
moving this one down a few lines. What's the problem?

> > >  	if (!XLOG_FORCED_SHUTDOWN(log) &&
> > >  	    iclog->ic_state != XLOG_STATE_ACTIVE &&
> > >  	    iclog->ic_state != XLOG_STATE_DIRTY) {
> > > @@ -1804,6 +1807,7 @@ xlog_write_iclog(
> > >  	unsigned int		count)
> > >  {
> > >  	ASSERT(bno < log->l_logBBsize);
> > > +	trace_xlog_iclog_write(iclog, _RET_IP_);
> > >  
> > >  	/*
> > >  	 * We lock the iclogbufs here so that we can serialise against I/O
> > > @@ -1950,6 +1954,7 @@ xlog_sync(
> > >  	unsigned int		size;
> > >  
> > >  	ASSERT(atomic_read(&iclog->ic_refcnt) == 0);
> > > +	trace_xlog_iclog_sync(iclog, _RET_IP_);
> > >  
> > >  	count = xlog_calc_iclog_size(log, iclog, &roundoff);
> > >  
> > > @@ -2488,6 +2493,7 @@ xlog_state_activate_iclog(
> > >  	int			*iclogs_changed)
> > >  {
> > >  	ASSERT(list_empty_careful(&iclog->ic_callbacks));
> > > +	trace_xlog_iclog_activate(iclog, _RET_IP_);
> > >  
> > 
> > I wonder a bit whether these state change oriented tracepoints could be
> > served by a single trace_ixlog_iclog_state() or some such, since the
> > tracepoint already prints the state. That wouldn't show the before ->
> > state in each tracepoint, but that should reflect in the full log.
> 
> I disagree. We lose a -lot- of context by getting rid of the
> per-call site trace names, especially those that get called from
> multiple places. Not to mention tracing an ACTIVE, idle iclog can
> result in zero state information except for the tracepoint name...
> e.g:
> 
> 384.204393: xlog_iclog_write:     dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004b60 caller xlog_sync
> 384.204406: xlog_iclog_sync_done: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004b60 caller xlog_ioend_work
> 384.204407: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 0 lsn 0x0 caller xlog_write_get_more_iclog_space
> 384.204408: xlog_iclog_switch:    dev 259:0 state 0x0 refcnt 1 offset 0 lsn 0x2300004ba0 caller xlog_state_get_iclog_space
> 384.204418: xlog_iclog_release:   dev 259:0 state 0x1 refcnt 1 offset 32256 lsn 0x2300004ba0 caller xlog_write_get_more_iclog_space
> 384.204419: xlog_iclog_syncing:   dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_write_get_more_iclog_space
> 384.204419: xlog_iclog_sync:      dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_state_release_iclog
> 384.204422: xlog_iclog_write:     dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_sync
> 384.204435: xlog_iclog_sync_done: dev 259:0 state 0x2 refcnt 0 offset 32256 lsn 0x2300004ba0 caller xlog_ioend_work
> 
> See how the xlog_iclog_get_space trace has zero state/lsn? That
> indicates that it has actually found an idle, active iclog in the
> correct state for it to hand out. We can also see this progression
> is through xlog_write() filling mulitple iclogs. However, we can get
> different progressions from different contexts such as:
> 
> 384.193766: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 0 lsn 0x0 caller xlog_write_get_more_iclog_space
> 384.193778: xlog_iclog_release:   dev 259:0 state 0x0 refcnt 1 offset 27352 lsn 0x23000042b0 caller xlog_write
> 384.193779: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 27352 lsn 0x23000042b0 caller xlog_write
> 384.193780: xlog_iclog_wait_on:   dev 259:0 state 0x0 refcnt 0 offset 0 lsn 0x0 caller xlog_cil_push_work
> 384.193781: xlog_iclog_release:   dev 259:0 state 0x0 refcnt 1 offset 27364 lsn 0x23000042b0 caller xlog_cil_push_work
> 384.201268: xlog_iclog_get_space: dev 259:0 state 0x0 refcnt 1 offset 27364 lsn 0x23000042b0 caller xlog_write
> 
> .... the competion of a checkpoint where xlog_write() releases the
> iclog and it's not a full iclog. This is followed immediately by the
> commit record write which gets iclog space from a different caller
> location, passes it back to xlog_cil_push_work which then releases
> it. ANd then we try to wait on the previous iclog which is an idle
> ACTIVE buffer, so we skip the wait.  And because this isn't a
> "stable CIL push", we don't see a xlog_iclog_switch() trace, and the
> so iclog is not flushed to disk when it is release here.
> 

Side note: I like how you incorporated the lack of an _iclog_switch()
tracepoint in your analysis. It's almost like we'd be capable of similar
reasoning wrt that _wait_on() tracepoint. ;)

That aside, fair enough on the names. The reason I suggest it is I think
it would be nicer to be able to do something like 'trace-cmd start -e
xfs:xlog_iclog_state*' to see all the state transitions and nothing
else, but that's not a huge deal and secondary to providing useful
information. I do wonder a bit if we could accomplish something similar
with a new state setting inline and a custom, single tracepoint to
include the calling function name, but that's not something worth
digging into now.

> IOWs, if we don't have the individual trace names, we largely lose
> all this context and then have to infer it from caller and state
> changes that occur. That just makes things unnnecessarily difficult
> when looking for problems in hundreds of thousands trace events....
> 
> > > @@ -3138,6 +3153,8 @@ xfs_log_force(
> > >  	if (iclog->ic_state == XLOG_STATE_IOERROR)
> > >  		goto out_error;
> > >  
> > > +	trace_xlog_iclog_force(iclog, _RET_IP_);
> > > +
> > 
> > We have a log force tracepoint a few lines up. Do we need both?
> 
> Yes. The log force tracepoint tells us nothing about the iclog
> state we are about to operate on.
> 

Ok.

Brian

> > > +	TP_ARGS(iclog, caller_ip),
> > > +	TP_STRUCT__entry(
> > > +		__field(dev_t, dev)
> > > +		__field(uint32_t, state)
> > > +		__field(int32_t, refcount)
> > > +		__field(uint32_t, offset)
> > > +		__field(unsigned long long, lsn)
> > > +		__field(unsigned long, caller_ip)
> > > +	),
> > > +	TP_fast_assign(
> > > +		__entry->dev = iclog->ic_log->l_mp->m_super->s_dev;
> > > +		__entry->state = iclog->ic_state;
> > > +		__entry->refcount = atomic_read(&iclog->ic_refcnt);
> > > +		__entry->offset = iclog->ic_offset;
> > > +		__entry->lsn = be64_to_cpu(iclog->ic_header.h_lsn);
> > > +		__entry->caller_ip = caller_ip;
> > > +	),
> > > +	TP_printk("dev %d:%d state 0x%x refcnt %d offset %u lsn 0x%llx caller %pS",
> > > +		  MAJOR(__entry->dev), MINOR(__entry->dev),
> > > +		  __entry->state,
> > 
> > It might be nice to see a string representation of the state like we do
> > for other things like log item type, etc.
> 
> Yup, I thought about that after I sent when looking at other code.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL
  2021-06-15 22:09     ` Dave Chinner
@ 2021-06-16 14:57       ` Brian Foster
  2021-06-16 22:20         ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Brian Foster @ 2021-06-16 14:57 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Wed, Jun 16, 2021 at 08:09:44AM +1000, Dave Chinner wrote:
> On Tue, Jun 15, 2021 at 11:38:14AM -0400, Brian Foster wrote:
> > On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The iclogbuf ring attached to the struct xlog is circular, hence the
> > > first and last iclogs in the ring can only be determined by
> > > comparing them against the log->l_iclog pointer.
> > > 
> > > In xfs_cil_push_work(), we want to wait on previous iclogs that were
> > > issued so that we can flush them to stable storage with the commit
> > > record write, and it simply waits on the previous iclog in the ring.
> > > This, however, leads to CIL push hangs in generic/019 like so:
> > > 
> > > task:kworker/u33:0   state:D stack:12680 pid:    7 ppid:     2 flags:0x00004000
> > > Workqueue: xfs-cil/pmem1 xlog_cil_push_work
> > > Call Trace:
> > >  __schedule+0x30b/0x9f0
> > >  schedule+0x68/0xe0
> > >  xlog_wait_on_iclog+0x121/0x190
> > >  ? wake_up_q+0xa0/0xa0
> > >  xlog_cil_push_work+0x994/0xa10
> > >  ? _raw_spin_lock+0x15/0x20
> > >  ? xfs_swap_extents+0x920/0x920
> > >  process_one_work+0x1ab/0x390
> > >  worker_thread+0x56/0x3d0
> > >  ? rescuer_thread+0x3c0/0x3c0
> > >  kthread+0x14d/0x170
> > >  ? __kthread_bind_mask+0x70/0x70
> > >  ret_from_fork+0x1f/0x30
> > > 
> > > With other threads blocking in either xlog_state_get_iclog_space()
> > > waiting for iclog space or xlog_grant_head_wait() waiting for log
> > > reservation space.
> > > 
> > > The problem here is that the previous iclog on the ring might
> > > actually be a future iclog. That is, if log->l_iclog points at
> > > commit_iclog, commit_iclog is the first (oldest) iclog in the ring
> > > and there are no previous iclogs pending as they have all completed
> > > their IO and been activated again. IOWs, commit_iclog->ic_prev
> > > points to an iclog that will be written in the future, not one that
> > > has been written in the past.
> > > 
> > > Hence, in this case, waiting on the ->ic_prev iclog is incorrect
> > > behaviour, and depending on the state of the future iclog, we can
> > > end up with a circular ABA wait cycle and we hang.
> > > 
> > > Fix this by only waiting on the previous iclog when the commit_iclog
> > > is not the oldest iclog in the ring.
> > > 
> > > Fixes: 5fd9256ce156 ("xfs: separate CIL commit record IO")
> > > Reported-by: Brian Foster <bfoster@redhat.com>
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/xfs_log_cil.c | 18 ++++++++++++------
> > >  1 file changed, 12 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_log_cil.c b/fs/xfs/xfs_log_cil.c
> > > index 705619e9dab4..398f00cf9cbf 100644
> > > --- a/fs/xfs/xfs_log_cil.c
> > > +++ b/fs/xfs/xfs_log_cil.c
> > > @@ -1075,15 +1075,21 @@ xlog_cil_push_work(
> > >  	ticket = ctx->ticket;
> > >  
> > >  	/*
> > > -	 * If the checkpoint spans multiple iclogs, wait for all previous
> > > -	 * iclogs to complete before we submit the commit_iclog. In this case,
> > > -	 * the commit_iclog write needs to issue a pre-flush so that the
> > > -	 * ordering is correctly preserved down to stable storage.
> > > +	 * If the checkpoint spans multiple iclogs, wait for all previous iclogs
> > > +	 * to complete before we submit the commit_iclog. If the commit iclog is
> > > +	 * at the head of the iclog ring, then all other iclogs have completed
> > > +	 * and are waiting on this one and hence we don't need to wait.
> > > +	 *
> > > +	 * Regardless of whether we need to wait or not, the the commit_iclog
> > > +	 * write needs to issue a pre-flush so that the ordering for this
> > > +	 * checkpoint is correctly preserved down to stable storage.
> > >  	 */
> > >  	spin_lock(&log->l_icloglock);
> > >  	if (ctx->start_lsn != commit_lsn) {
> > > -		xlog_wait_on_iclog(commit_iclog->ic_prev);
> > > -		spin_lock(&log->l_icloglock);
> > > +		if (commit_iclog != log->l_iclog) {
> > > +			xlog_wait_on_iclog(commit_iclog->ic_prev);
> > > +			spin_lock(&log->l_icloglock);
> > > +		}
> 
> Brian says:
> 
> > I'm confused at what this is attempting to accomplish. If we have a
> > single CIL force and that happens to span several iclogs, isn't the goal
> > to wait on the previous iclog to ensure the previously written content
> > of the current checkpoint is on-disk?
> 
> Yes. I wouldn't be surprised if I broke something, though, because I
> was only looking at situation where the log hang was occurring.
> 
> I think I should have added a state check on the commit_iclog being
> in ACTIVE state, too, because if it is active then there are no
> future iclogs in flight and we should definitely wait. But that
> doesn't solve the problem, either, because if it is in WANT_SYNC
> there are future iclogs in flight (because
> xlog_state_get_iclog_space() has switched iclogs) and then we're in
> trouble because....
> 
> 
> Darrick says:
> 
> > I'm confused.  How can you tell that we need to wait for
> > commit_iclog->ic_prev to be written out by comparing commit_iclog to
> > log->l_iclog?  Can't you determine this by checking ic_prev for DIRTY
> > state?
> 
> .... the problem is that we can't tell if ->ic_prev is a future or
> past iclog based on it's state. If it is in the past the ic_prev
> iclog could be in any state from WANT_SYNC to ACTIVE, depending on
> where the IO completion is. And if it's in the future, it could be
> in any state from ACTIVE to SYNC_DONE. It's the overlap in
> past/future states that causes the problem here.
> 
> Being in DIRTY state just means that IO completion is running,
> callbacks are complete and we're about to run a ic_force_wait queue
> wakeup and activate the iclog again. When the hang occurs, the
> future iclogs are stuck in the DONE_SYNC state and can't progress
> through to the callback state, let alone dirty...
> 
> Even an iclog in ACTIVE state doesn't mean it is in the past - it
> just means that it's ready to be written to. It could be being
> filled at this point in time by another checkpoint via xlog_write(),
> and it, at any point in time, can transition to WANT_SYNC, SYNCING,
> or SYNC_DONE.
> 
> xlog_wait_on_iclog() will ignore ACTIVE/DIRTY iclogs, anyway, but
> the problem is that a future iclog can still transition at any time
> to WANT_SYNC or any other state indicating IO is in progress. Hence
> we just can't tell if it is a past or future iclog and so we can't
> blindly wait on it. In the common case, ic_prev is a past iclog and
> it won't get stuck on the commit_iclog being ACTIVE/WANT_SYNC at the
> head of the log->l_iclog ring in xlog_state_do_callback() and so
> will be processed and go back to ACTIVE state. However, in the rare
> case it is a future iclog, then it gets stuck waiting for IO
> completion of the commit_iclog to reach xlog_state_do_callback()....
> 
> ----
> 
> What I was trying to take into account via the (commit_iclog !=
> log->l_iclog) check was a measure of past/future tense. The hang
> always occurs where commit_iclog == log->l_iclog and all ic_next
> iclogs are future iclogs rather than being past iclogs. As Brian
> notes, this probably just broke the single commit case, but because
> the failure stress tests load up the log greatly, they don't
> exercise that case.
> 

That's kind of why I was wondering if you meant to check the state of
->ic_prev against ->l_iclog, but thinking about it again I'm not sure
that is valid either because I don't think anything prevents an iclog
from landing at the head before it might have completed pending I/O.
Maybe that doesn't make sense, I'm still wrapping my head around the
whole multi-push CIL model the original patch creates..

> So I went and looked at all the cases where we check
> ic_prev->ic_state to make decisions. There's only one place that
> does that: log forces. The sync transaction/log force aggregation
> code is the only thing that checks the state of the previous iclog,
> and it appears to avoid the "future iclog" problem in two ways.
> First, it only checks the previous iclog state when the current head
> iclog state is ACTIVE (i.e. no future iclogs). Secondly, even if it
> ic-prev is a future iclog, it only waits on the ic_write_wait queue,
> which means the log force gets woken when IO completion occurs and
> the iclog transitions to DONE_SYNC, hence it never waits for
> ordering of callbacks (other log force code does that).
> 
> IOWs, there's only one place that actually tries to use the previous
> iclog state for making decisions, and it does doesn't really handle
> the case we need to handle here which is determining if the previous
> iclog is a past or future iclog. I does, however, handle the case of
> incorrect waiting safely.
> 
> That is, what we need to wait for in the CIL push is the completion
> of the physical IO to be signalled, not for the callbacks associated
> with that iclog to run. IOWs, neat as it may be,
> xlog_wait_on_iclog() is not the right function to be using here. We
> should be waiting on the ic_write_wait queue to be woken when iclog
> IO completion occurs, not the ic_force_wait queue that waits on
> ordered completion of entire journal checkpoints. We just want to
> ensure that all previous iclogs are physically completed before we
> do the pre-flush on the commit record IO, we don't need software
> processing of those IOs to be complete.
> 

Ok, but doesn't that leave a gap in scenarios where iclog completion
occurs out of order? I thought the reason for waiting on the force of
the previous iclog was to encapsulate that the entire content of the
checkpoint has been written to disk, not just the content in the
immediately previous iclog.

If the issue is past/future tense, is there any reason we can't filter
against the record header lsn of the previous iclog? E.g., assuming the
iclog hasn't been cleaned/reactivated (i.e. h_lsn != 0), then only block
on the previous iclog if XFS_LSN_CMP(prev_lsn, commit_lsn) <= 0?

Hmm.. that also has me wondering if it's possible that the previous
iclog might be some other future checkpoint, but then skipping the wait
entirely means the checkpoint might not actually be on-disk [1]. I
wonder if we need to do something like track the last lsn and/or iclog
touched by the current checkpoint and make sure we wait on that,
regardless of where it is in the ring (but as above, I'm still wrapping
my head around all of this).

Brian

[1] Once we work out the functional behavior one way or another, how do
we actually test/confirm that the flush behavior works as expected? I
feel as though the discussions going all the way to the initial review
feedback on this change have provided constant reminders that things
like the I/O flush / ordering dependencies are very easy to get wrong.
If that is ever the case, it's not clear to me we have any way to detect
it. Not that I'd expect to have some explicit 5 second fstest for that
problem, but have we analyzed whether the few decent crash recovery
tests we do have are sensitive enough to fail? I wonder if we could come
up with something clever and more explicit via error injection..

> So, I think the right thing to do here is change this code to wait
> on the ic_write_wait queue for completion of the previous iclog and
> then we just don't care if commit_iclog->ic_prev is a past or future
> iclog....
> 
> 
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL
  2021-06-16 14:57       ` Brian Foster
@ 2021-06-16 22:20         ` Dave Chinner
  2021-06-17 13:15           ` Brian Foster
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2021-06-16 22:20 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, linux-xfs

On Wed, Jun 16, 2021 at 10:57:29AM -0400, Brian Foster wrote:
> On Wed, Jun 16, 2021 at 08:09:44AM +1000, Dave Chinner wrote:
> > On Tue, Jun 15, 2021 at 11:38:14AM -0400, Brian Foster wrote:
> > > On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote:
> > So I went and looked at all the cases where we check
> > ic_prev->ic_state to make decisions. There's only one place that
> > does that: log forces. The sync transaction/log force aggregation
> > code is the only thing that checks the state of the previous iclog,
> > and it appears to avoid the "future iclog" problem in two ways.
> > First, it only checks the previous iclog state when the current head
> > iclog state is ACTIVE (i.e. no future iclogs). Secondly, even if it
> > ic-prev is a future iclog, it only waits on the ic_write_wait queue,
> > which means the log force gets woken when IO completion occurs and
> > the iclog transitions to DONE_SYNC, hence it never waits for
> > ordering of callbacks (other log force code does that).
> > 
> > IOWs, there's only one place that actually tries to use the previous
> > iclog state for making decisions, and it does doesn't really handle
> > the case we need to handle here which is determining if the previous
> > iclog is a past or future iclog. I does, however, handle the case of
> > incorrect waiting safely.
> > 
> > That is, what we need to wait for in the CIL push is the completion
> > of the physical IO to be signalled, not for the callbacks associated
> > with that iclog to run. IOWs, neat as it may be,
> > xlog_wait_on_iclog() is not the right function to be using here. We
> > should be waiting on the ic_write_wait queue to be woken when iclog
> > IO completion occurs, not the ic_force_wait queue that waits on
> > ordered completion of entire journal checkpoints. We just want to
> > ensure that all previous iclogs are physically completed before we
> > do the pre-flush on the commit record IO, we don't need software
> > processing of those IOs to be complete.
> > 
> 
> Ok, but doesn't that leave a gap in scenarios where iclog completion
> occurs out of order? I thought the reason for waiting on the force of
> the previous iclog was to encapsulate that the entire content of the
> checkpoint has been written to disk, not just the content in the
> immediately previous iclog.

Yes. We either have to wait for all the iclogs to be written or
we have to use ic_force_wait for sequencing.

> If the issue is past/future tense, is there any reason we can't filter
> against the record header lsn of the previous iclog? E.g., assuming the
> iclog hasn't been cleaned/reactivated (i.e. h_lsn != 0), then only block
> on the previous iclog if XFS_LSN_CMP(prev_lsn, commit_lsn) <= 0?

That's what the patch I left running overnight does:

        /*
         * If the checkpoint spans multiple iclogs, wait for all previous iclogs
         * to complete before we submit the commit_iclog. We can't use state
         * checks for this - ACTIVE can be either a past completed iclog or a
         * future iclog being filled, while WANT_SYNC through SYNC_DONE can be a
         * past or future iclog awaiting IO or ordered IO completion to be run.
         * In the latter case, if it's a future iclog and we wait on it, the we
         * will hang because it won't get processed through to ic_force_wait
         * wakeup until this commit_iclog is written to disk.  Hence we use the
         * iclog header lsn and compare it to the commit lsn to determine if we
         * need to wait on iclogs or not.
         */
        spin_lock(&log->l_icloglock);
        if (ctx->start_lsn != ctx->commit_lsn) {
                struct xlog_in_core     *iclog;

                for (iclog = commit_iclog->ic_prev;
                     iclog != commit_iclog;
                     iclog = iclog->ic_prev) {
                        xfs_lsn_t       hlsn;

                        /*
                         * If the LSN of the iclog is zero or in the future it
                         * means it has passed through IO completion and
                         * activation and hence all previous iclogs have also
                         * done so. We do not need to wait at all in this case.
                         */
                        hlsn = be64_to_cpu(iclog->ic_header.h_lsn);
                        if (!hlsn || XFS_LSN_CMP(hlsn, ctx->commit_lsn) > 0)
                                break;

                        /*
                         * If the LSN of the iclog is older than the commit lsn,
                         * we have to wait on it. Waiting on this via the
                         * ic_force_wait should also order the completion of all
                         * older iclogs, too, but we leave checking that to the
                         * next loop iteration.
                         */
                        ASSERT(XFS_LSN_CMP(hlsn, ctx->commit_lsn) < 0);
                        xlog_wait_on_iclog(iclog);
                        spin_lock(&log->l_icloglock);
                }

                /*
                 * Regardless of whether we need to wait or not, the the
                 * commit_iclog write needs to issue a pre-flush so that the
                 * ordering for this checkpoint is correctly preserved down to
                 * stable storage.
                 */
                commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
        }


> Hmm.. that also has me wondering if it's possible that the previous
> iclog might be some other future checkpoint, but then skipping the wait
> entirely means the checkpoint might not actually be on-disk [1]. I

If the previous iclog is a future iclog, then all iclogs are future
iclogs and this iclog is the oldest. The iclog ring only ages in one
direction....

> wonder if we need to do something like track the last lsn and/or iclog
> touched by the current checkpoint and make sure we wait on that,
> regardless of where it is in the ring (but as above, I'm still wrapping
> my head around all of this).

We have to guarantee the log is completely stable up to the commit
record, regardless of which checkpoint the iclogs belong to.  If we
don't do that, then log recovery will consider that the checkpoint
in the log is not complete before it finds the commit record. i.e.
there will be a hole in the log where the previous cycle numbers
show through between the end of the checkpoint and the commit
record, and hence the head of the log will be pointed at the last
fully completed checkpoint, not this one.

Hence we have to ensure all prior iclogs have completed to stable
storage before writing the commit record, regardless of whether the
previous iclogs are part of the current checkpoint or not.

> [1] Once we work out the functional behavior one way or another, how do
> we actually test/confirm that the flush behavior works as expected? I
> feel as though the discussions going all the way to the initial review
> feedback on this change have provided constant reminders that things
> like the I/O flush / ordering dependencies are very easy to get wrong.
> If that is ever the case, it's not clear to me we have any way to detect
> it. Not that I'd expect to have some explicit 5 second fstest for that
> problem, but have we analyzed whether the few decent crash recovery
> tests we do have are sensitive enough to fail? I wonder if we could come
> up with something clever and more explicit via error injection..

Given that we've found these problems soon after the code was
integrated and made available for wider testing, I think it shows
that, at the integration level, we have sufficiently sensitive crash
and crash recovery tests to discover issues like this. It's not
surprising that we may not find things like this at the
individual developer level because they only have so much time and
resources at their disposal.

IOWs, I think the process is working exactly the way it should be
working. It is expected that complex, subtle problems like these
will be found (if they exist) once more people are actively using
and testing the code in a variety of different environments. THis is
exactly the function that integration testing is supposed to
provide...

So while it's possible that we could have more reliable test to
exercise this stuff (no real idea how, though), the way we are
developing and merging and finding much harder to trigger problems
once the code is available for wider testing shows that our
engineering process is working as it should be....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL
  2021-06-16 22:20         ` Dave Chinner
@ 2021-06-17 13:15           ` Brian Foster
  0 siblings, 0 replies; 14+ messages in thread
From: Brian Foster @ 2021-06-17 13:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs

On Thu, Jun 17, 2021 at 08:20:54AM +1000, Dave Chinner wrote:
> On Wed, Jun 16, 2021 at 10:57:29AM -0400, Brian Foster wrote:
> > On Wed, Jun 16, 2021 at 08:09:44AM +1000, Dave Chinner wrote:
> > > On Tue, Jun 15, 2021 at 11:38:14AM -0400, Brian Foster wrote:
> > > > On Tue, Jun 15, 2021 at 04:46:58PM +1000, Dave Chinner wrote:
> > > So I went and looked at all the cases where we check
> > > ic_prev->ic_state to make decisions. There's only one place that
> > > does that: log forces. The sync transaction/log force aggregation
> > > code is the only thing that checks the state of the previous iclog,
> > > and it appears to avoid the "future iclog" problem in two ways.
> > > First, it only checks the previous iclog state when the current head
> > > iclog state is ACTIVE (i.e. no future iclogs). Secondly, even if it
> > > ic-prev is a future iclog, it only waits on the ic_write_wait queue,
> > > which means the log force gets woken when IO completion occurs and
> > > the iclog transitions to DONE_SYNC, hence it never waits for
> > > ordering of callbacks (other log force code does that).
> > > 
> > > IOWs, there's only one place that actually tries to use the previous
> > > iclog state for making decisions, and it does doesn't really handle
> > > the case we need to handle here which is determining if the previous
> > > iclog is a past or future iclog. I does, however, handle the case of
> > > incorrect waiting safely.
> > > 
> > > That is, what we need to wait for in the CIL push is the completion
> > > of the physical IO to be signalled, not for the callbacks associated
> > > with that iclog to run. IOWs, neat as it may be,
> > > xlog_wait_on_iclog() is not the right function to be using here. We
> > > should be waiting on the ic_write_wait queue to be woken when iclog
> > > IO completion occurs, not the ic_force_wait queue that waits on
> > > ordered completion of entire journal checkpoints. We just want to
> > > ensure that all previous iclogs are physically completed before we
> > > do the pre-flush on the commit record IO, we don't need software
> > > processing of those IOs to be complete.
> > > 
> > 
> > Ok, but doesn't that leave a gap in scenarios where iclog completion
> > occurs out of order? I thought the reason for waiting on the force of
> > the previous iclog was to encapsulate that the entire content of the
> > checkpoint has been written to disk, not just the content in the
> > immediately previous iclog.
> 
> Yes. We either have to wait for all the iclogs to be written or
> we have to use ic_force_wait for sequencing.
> 

Right, Ok. That wasn't stated clearly in your earlier comment.

> > If the issue is past/future tense, is there any reason we can't filter
> > against the record header lsn of the previous iclog? E.g., assuming the
> > iclog hasn't been cleaned/reactivated (i.e. h_lsn != 0), then only block
> > on the previous iclog if XFS_LSN_CMP(prev_lsn, commit_lsn) <= 0?
> 
> That's what the patch I left running overnight does:
> 

Ok.

>         /*
>          * If the checkpoint spans multiple iclogs, wait for all previous iclogs
>          * to complete before we submit the commit_iclog. We can't use state
>          * checks for this - ACTIVE can be either a past completed iclog or a
>          * future iclog being filled, while WANT_SYNC through SYNC_DONE can be a
>          * past or future iclog awaiting IO or ordered IO completion to be run.
>          * In the latter case, if it's a future iclog and we wait on it, the we
>          * will hang because it won't get processed through to ic_force_wait
>          * wakeup until this commit_iclog is written to disk.  Hence we use the
>          * iclog header lsn and compare it to the commit lsn to determine if we
>          * need to wait on iclogs or not.
>          */
>         spin_lock(&log->l_icloglock);
>         if (ctx->start_lsn != ctx->commit_lsn) {
>                 struct xlog_in_core     *iclog;
> 
>                 for (iclog = commit_iclog->ic_prev;
>                      iclog != commit_iclog;
>                      iclog = iclog->ic_prev) {
>                         xfs_lsn_t       hlsn;
> 
>                         /*
>                          * If the LSN of the iclog is zero or in the future it
>                          * means it has passed through IO completion and
>                          * activation and hence all previous iclogs have also
>                          * done so. We do not need to wait at all in this case.
>                          */
>                         hlsn = be64_to_cpu(iclog->ic_header.h_lsn);
>                         if (!hlsn || XFS_LSN_CMP(hlsn, ctx->commit_lsn) > 0)
>                                 break;
> 
>                         /*
>                          * If the LSN of the iclog is older than the commit lsn,
>                          * we have to wait on it. Waiting on this via the
>                          * ic_force_wait should also order the completion of all
>                          * older iclogs, too, but we leave checking that to the
>                          * next loop iteration.
>                          */

What is the point of the loop? ISTM the functional change only requires
some minimal tweaking from the first patch you posted.

>                         ASSERT(XFS_LSN_CMP(hlsn, ctx->commit_lsn) < 0);
>                         xlog_wait_on_iclog(iclog);
>                         spin_lock(&log->l_icloglock);
>                 }
> 
>                 /*
>                  * Regardless of whether we need to wait or not, the the
>                  * commit_iclog write needs to issue a pre-flush so that the
>                  * ordering for this checkpoint is correctly preserved down to
>                  * stable storage.
>                  */
>                 commit_iclog->ic_flags |= XLOG_ICL_NEED_FLUSH;
>         }
> 
> 
> > Hmm.. that also has me wondering if it's possible that the previous
> > iclog might be some other future checkpoint, but then skipping the wait
> > entirely means the checkpoint might not actually be on-disk [1]. I
> 
> If the previous iclog is a future iclog, then all iclogs are future
> iclogs and this iclog is the oldest. The iclog ring only ages in one
> direction....
> 

Hm, yes that makes sense. I might have been confusing the new push
concurrency with the fact that all log writers still must contend over
the head iclog.

> > wonder if we need to do something like track the last lsn and/or iclog
> > touched by the current checkpoint and make sure we wait on that,
> > regardless of where it is in the ring (but as above, I'm still wrapping
> > my head around all of this).
> 
> We have to guarantee the log is completely stable up to the commit
> record, regardless of which checkpoint the iclogs belong to.  If we
> don't do that, then log recovery will consider that the checkpoint
> in the log is not complete before it finds the commit record. i.e.
> there will be a hole in the log where the previous cycle numbers
> show through between the end of the checkpoint and the commit
> record, and hence the head of the log will be pointed at the last
> fully completed checkpoint, not this one.
> 

Yet another thus far undocumented side effect of concurrent pushes..

Yes, I know that recovery has always stamped over incomplete record
blocks out beyond the prospective head block. I'm referring to the fact
that recovery clearly has some historical warts that conflict with this
change in behavior to the runtime logging code and suggesting that the
upstream merge cycle should not be the avenue to identify and resolve
these problems.

It's rather evident to me at this point that this change should be
deferred, tested much more thoroughly (not because it wasn't tested or
anything of that nature, but because our log/crash recovery testing is
historically lacking and crude) and reposted with these issues clearly
documented and addressed, at the very least in a manner that allows for
proper consideration of potential alternatives, etc., as opposed to
racing with an arbitrary upstream release deadline.

(This is of course out of my hands as it's not my tree, not my call,
etc. Just my opinion. I've thrown your latest series into my testbed
regardless.)

> Hence we have to ensure all prior iclogs have completed to stable
> storage before writing the commit record, regardless of whether the
> previous iclogs are part of the current checkpoint or not.
> 
> > [1] Once we work out the functional behavior one way or another, how do
> > we actually test/confirm that the flush behavior works as expected? I
> > feel as though the discussions going all the way to the initial review
> > feedback on this change have provided constant reminders that things
> > like the I/O flush / ordering dependencies are very easy to get wrong.
> > If that is ever the case, it's not clear to me we have any way to detect
> > it. Not that I'd expect to have some explicit 5 second fstest for that
> > problem, but have we analyzed whether the few decent crash recovery
> > tests we do have are sensitive enough to fail? I wonder if we could come
> > up with something clever and more explicit via error injection..
> 
> Given that we've found these problems soon after the code was
> integrated and made available for wider testing, I think it shows
> that, at the integration level, we have sufficiently sensitive crash
> and crash recovery tests to discover issues like this. It's not
> surprising that we may not find things like this at the
> individual developer level because they only have so much time and
> resources at their disposal.
> 

What do the bugs that were found have to do with test coverage for the
updated flush semantics? I called that out separately precisely because
recent patches demonstrate how the logic is subtle and easy to break,
but I don't see any real coverage for it at all. So it would be nice to
think about how we might improve on that.

> IOWs, I think the process is working exactly the way it should be
> working. It is expected that complex, subtle problems like these
> will be found (if they exist) once more people are actively using
> and testing the code in a variety of different environments. THis is
> exactly the function that integration testing is supposed to
> provide...
> 
> So while it's possible that we could have more reliable test to
> exercise this stuff (no real idea how, though), the way we are
> developing and merging and finding much harder to trigger problems
> once the code is available for wider testing shows that our
> engineering process is working as it should be....
> 

Yeah, that's wishful thinking. I made a point to run some tests I've
recently been hammering on (for that bli patch I still can't seem to
progress) because 1.) I know we have a bunch of logging changes in the
pipeline that I otherwise haven't been able to keep up with and 2.) I'm
about to disappear for a few months on leave at any moment and so wanted
to prioritize some testing to at least help verify we maintain
correctness or otherwise get any issues reported to the list. And TBH if
it weren't for 2., this is probably not something I would have
prioritized over some other longer term development work in the first
place.

I'd like to hope that we would have caught these issues before release
regardless, but let's be realistic.. if developer testing doesn't catch
it, upstream review/merge cycle testing doesn't catch it, then for crash
and recovery issues (no regular users are experimenting with failure and
recovery of their filesystems to see if we've found new ways to lose
their data), the odds go way down from there that somebody is going to
catch it before it's released to the wild.

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2021-06-17 13:15 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-15  6:46 xfs: fix CIL push hang in for-next tree Dave Chinner
2021-06-15  6:46 ` [PATCH 1/2] xfs: add iclog state trace events Dave Chinner
2021-06-15 15:26   ` Darrick J. Wong
2021-06-15 15:37   ` Brian Foster
2021-06-15 22:47     ` Dave Chinner
2021-06-16 14:55       ` Brian Foster
2021-06-15  6:46 ` [PATCH 2/2] xfs: don't wait on future iclogs when pushing the CIL Dave Chinner
2021-06-15 15:38   ` Brian Foster
2021-06-15 22:09     ` Dave Chinner
2021-06-16 14:57       ` Brian Foster
2021-06-16 22:20         ` Dave Chinner
2021-06-17 13:15           ` Brian Foster
2021-06-15 16:19   ` Darrick J. Wong
2021-06-15 17:57 ` xfs: fix CIL push hang in for-next tree Darrick J. Wong

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.